• Open

    Vertical Federated Image Segmentation
    arXiv:2401.07931v2 Announce Type: replace-cross Abstract: With the popularization of AI solutions for image based problems, there has been a growing concern for both data privacy and acquisition. In a large number of cases, information is located on separate data silos and it can be difficult for a developer to consolidate all of it in a fashion that is appropriate for machine learning model development. Alongside this, a portion of these localized data regions may not have access to a labelled ground truth. This indicates that they have the capacity to reach conclusions numerically, but are not able to assign classifications amid a lack of pertinent information. Such a determination is often negligible, especially when attempting to develop image based solutions that often necessitate this capability. With this being the case, we propose an innovative vertical federated learning (VFL) model architecture that can operate under this common set of conditions. This is the first (and currently the only) implementation of a system that can work under the constraints of a VFL environment and perform image segmentation while maintaining nominal accuracies. We achieved this by utilizing an FCN that boasts the ability to operate on federates that lack labelled data and privately share the respective weights with a central server, that of which hosts the necessary features for classification. Tests were conducted on the CamVid dataset in order to determine the impact of heavy feature compression required for the transfer of information between federates, as well as to reach nominal conclusions about the overall performance metrics when working under such constraints.  ( 3 min )
    A Graph is Worth $K$ Words: Euclideanizing Graph using Pure Transformer
    arXiv:2402.02464v2 Announce Type: replace Abstract: Can we model non-Euclidean graphs as pure language or even Euclidean vectors while retaining their inherent information? The non-Euclidean property have posed a long term challenge in graph modeling. Despite recent GNN and Graphformer efforts encoding graphs as Euclidean vectors, recovering original graph from the vectors remains a challenge. We introduce GraphsGPT, featuring a Graph2Seq encoder that transforms non-Euclidean graphs into learnable graph words in a Euclidean space, along with a GraphGPT decoder that reconstructs the original graph from graph words to ensure information equivalence. We pretrain GraphsGPT on 100M molecules and yield some interesting findings: (1) Pretrained Graph2Seq excels in graph representation learning, achieving state-of-the-art results on 8/9 graph classification and regression tasks. (2) Pretrained GraphGPT serves as a strong graph generator, demonstrated by its ability to perform both unconditional and conditional graph generation. (3) Graph2Seq+GraphGPT enables effective graph mixup in the Euclidean space, overcoming previously known non-Euclidean challenge. (4) Our proposed novel edge-centric GPT pretraining task is effective in graph fields, underscoring its success in both representation and generation.  ( 2 min )
    Accelerating Data Generation for Neural Operators via Krylov Subspace Recycling
    arXiv:2401.09516v2 Announce Type: replace Abstract: Learning neural operators for solving partial differential equations (PDEs) has attracted great attention due to its high inference efficiency. However, training such operators requires generating a substantial amount of labeled data, i.e., PDE problems together with their solutions. The data generation process is exceptionally time-consuming, as it involves solving numerous systems of linear equations to obtain numerical solutions to the PDEs. Many existing methods solve these systems independently without considering their inherent similarities, resulting in extremely redundant computations. To tackle this problem, we propose a novel method, namely Sorting Krylov Recycling (SKR), to boost the efficiency of solving these systems, thus significantly accelerating data generation for neural operators training. To the best of our knowledge, SKR is the first attempt to address the time-consuming nature of data generation for learning neural operators. The working horse of SKR is Krylov subspace recycling, a powerful technique for solving a series of interrelated systems by leveraging their inherent similarities. Specifically, SKR employs a sorting algorithm to arrange these systems in a sequence, where adjacent systems exhibit high similarities. Then it equips a solver with Krylov subspace recycling to solve the systems sequentially instead of independently, thus effectively enhancing the solving efficiency. Both theoretical analysis and extensive experiments demonstrate that SKR can significantly accelerate neural operator data generation, achieving a remarkable speedup of up to 13.9 times.  ( 3 min )
    Oracle-Efficient Smoothed Online Learning for Piecewise Continuous Decision Making
    arXiv:2302.05430v2 Announce Type: replace-cross Abstract: Smoothed online learning has emerged as a popular framework to mitigate the substantial loss in statistical and computational complexity that arises when one moves from classical to adversarial learning. Unfortunately, for some spaces, it has been shown that efficient algorithms suffer an exponentially worse regret than that which is minimax optimal, even when the learner has access to an optimization oracle over the space. To mitigate that exponential dependence, this work introduces a new notion of complexity, the generalized bracketing numbers, which marries constraints on the adversary to the size of the space, and shows that an instantiation of Follow-the-Perturbed-Leader can attain low regret with the number of calls to the optimization oracle scaling optimally with respect to average regret. We then instantiate our bounds in several problems of interest, including online prediction and planning of piecewise continuous functions, which has many applications in fields as diverse as econometrics and robotics.  ( 2 min )
    Smoothed Online Learning for Prediction in Piecewise Affine Systems
    arXiv:2301.11187v2 Announce Type: replace-cross Abstract: The problem of piecewise affine (PWA) regression and planning is of foundational importance to the study of online learning, control, and robotics, where it provides a theoretically and empirically tractable setting to study systems undergoing sharp changes in the dynamics. Unfortunately, due to the discontinuities that arise when crossing into different ``pieces,'' learning in general sequential settings is impossible and practical algorithms are forced to resort to heuristic approaches. This paper builds on the recently developed smoothed online learning framework and provides the first algorithms for prediction and simulation in PWA systems whose regret is polynomial in all relevant problem parameters under a weak smoothness assumption; moreover, our algorithms are efficient in the number of calls to an optimization oracle. We further apply our results to the problems of one-step prediction and multi-step simulation regret in piecewise affine dynamical systems, where the learner is tasked with simulating trajectories and regret is measured in terms of the Wasserstein distance between simulated and true data. Along the way, we develop several technical tools of more general interest.  ( 2 min )
    MCRAGE: Synthetic Healthcare Data for Fairness
    arXiv:2310.18430v2 Announce Type: replace-cross Abstract: In the field of healthcare, electronic health records (EHR) serve as crucial training data for developing machine learning models for diagnosis, treatment, and the management of healthcare resources. However, medical datasets are often imbalanced in terms of sensitive attributes such as race/ethnicity, gender, and age. Machine learning models trained on class-imbalanced EHR datasets perform significantly worse in deployment for individuals of the minority classes compared to samples from majority classes, which may lead to inequitable healthcare outcomes for minority groups. To address this challenge, we propose Minority Class Rebalancing through Augmentation by Generative modeling (MCRAGE), a novel approach to augment imbalanced datasets using samples generated by a deep generative model. The MCRAGE process involves training a Conditional Denoising Diffusion Probabilistic Model (CDDPM) capable of generating high-quality synthetic EHR samples from underrepresented classes. We use this synthetic data to augment the existing imbalanced dataset, thereby achieving a more balanced distribution across all classes, which can be used to train an unbiased machine learning model. We measure the performance of MCRAGE versus alternative approaches using Accuracy, F1 score and AUROC. We provide theoretical justification for our method in terms of recent convergence results for DDPMs with minimal assumptions.  ( 3 min )
    Warped geometric information on the optimisation of Euclidean functions
    arXiv:2308.08305v2 Announce Type: replace-cross Abstract: We consider the fundamental task of optimising a real-valued function defined in a potentially high-dimensional Euclidean space, such as the loss function in many machine-learning tasks or the logarithm of the probability distribution in statistical inference. We use Riemannian geometry notions to redefine the optimisation problem of a function on the Euclidean space to a Riemannian manifold with a warped metric, and then find the function's optimum along this manifold. The warped metric chosen for the search domain induces a computational friendly metric-tensor for which optimal search directions associated with geodesic curves on the manifold becomes easier to compute. Performing optimization along geodesics is known to be generally infeasible, yet we show that in this specific manifold we can analytically derive Taylor approximations up to third-order. In general these approximations to the geodesic curve will not lie on the manifold, however we construct suitable retraction maps to pull them back onto the manifold. Therefore, we can efficiently optimize along the approximate geodesic curves. We cover the related theory, describe a practical optimization algorithm and empirically evaluate it on a collection of challenging optimisation benchmarks. Our proposed algorithm, using 3rd-order approximation of geodesics, tends to outperform standard Euclidean gradient-based counterparts in term of number of iterations until convergence.  ( 2 min )
    WHAC: World-grounded Humans and Cameras
    arXiv:2403.12959v1 Announce Type: cross Abstract: Estimating human and camera trajectories with accurate scale in the world coordinate system from a monocular video is a highly desirable yet challenging and ill-posed problem. In this study, we aim to recover expressive parametric human models (i.e., SMPL-X) and corresponding camera poses jointly, by leveraging the synergy between three critical players: the world, the human, and the camera. Our approach is founded on two key observations. Firstly, camera-frame SMPL-X estimation methods readily recover absolute human depth. Secondly, human motions inherently provide absolute spatial cues. By integrating these insights, we introduce a novel framework, referred to as WHAC, to facilitate world-grounded expressive human pose and shape estimation (EHPS) alongside camera pose estimation, without relying on traditional optimization techniques. Additionally, we present a new synthetic dataset, WHAC-A-Mole, which includes accurately annotated humans and cameras, and features diverse interactive human motions as well as realistic camera trajectories. Extensive experiments on both standard and newly established benchmarks highlight the superiority and efficacy of our framework. We will make the code and dataset publicly available.  ( 2 min )
    Attribute-Efficient PAC Learning of Low-Degree Polynomial Threshold Functions with Nasty Noise
    arXiv:2306.00673v2 Announce Type: replace-cross Abstract: The concept class of low-degree polynomial threshold functions (PTFs) plays a fundamental role in machine learning. In this paper, we study PAC learning of $K$-sparse degree-$d$ PTFs on $\mathbb{R}^n$, where any such concept depends only on $K$ out of $n$ attributes of the input. Our main contribution is a new algorithm that runs in time $({nd}/{\epsilon})^{O(d)}$ and under the Gaussian marginal distribution, PAC learns the class up to error rate $\epsilon$ with $O(\frac{K^{4d}}{\epsilon^{2d}} \cdot \log^{5d} n)$ samples even when an $\eta \leq O(\epsilon^d)$ fraction of them are corrupted by the nasty noise of Bshouty et al. (2002), possibly the strongest corruption model. Prior to this work, attribute-efficient robust algorithms are established only for the special case of sparse homogeneous halfspaces. Our key ingredients are: 1) a structural result that translates the attribute sparsity to a sparsity pattern of the Chow vector under the basis of Hermite polynomials, and 2) a novel attribute-efficient robust Chow vector estimation algorithm which uses exclusively a restricted Frobenius norm to either certify a good approximation or to validate a sparsity-induced degree-$2d$ polynomial as a filter to detect corrupted samples.  ( 2 min )
    D-Cubed: Latent Diffusion Trajectory Optimisation for Dexterous Deformable Manipulation
    arXiv:2403.12861v1 Announce Type: cross Abstract: Mastering dexterous robotic manipulation of deformable objects is vital for overcoming the limitations of parallel grippers in real-world applications. Current trajectory optimisation approaches often struggle to solve such tasks due to the large search space and the limited task information available from a cost function. In this work, we propose D-Cubed, a novel trajectory optimisation method using a latent diffusion model (LDM) trained from a task-agnostic play dataset to solve dexterous deformable object manipulation tasks. D-Cubed learns a skill-latent space that encodes short-horizon actions in the play dataset using a VAE and trains a LDM to compose the skill latents into a skill trajectory, representing a long-horizon action trajectory in the dataset. To optimise a trajectory for a target task, we introduce a novel gradient-free guided sampling method that employs the Cross-Entropy method within the reverse diffusion process. In particular, D-Cubed samples a small number of noisy skill trajectories using the LDM for exploration and evaluates the trajectories in simulation. Then, D-Cubed selects the trajectory with the lowest cost for the subsequent reverse process. This effectively explores promising solution areas and optimises the sampled trajectories towards a target task throughout the reverse diffusion process. Through empirical evaluation on a public benchmark of dexterous deformable object manipulation tasks, we demonstrate that D-Cubed outperforms traditional trajectory optimisation and competitive baseline approaches by a significant margin. We further demonstrate that trajectories found by D-Cubed readily transfer to a real-world LEAP hand on a folding task.  ( 3 min )
    Evaluating the Utility of Conformal Prediction Sets for AI-Advised Image Labeling
    arXiv:2401.08876v5 Announce Type: replace-cross Abstract: As deep neural networks are more commonly deployed in high-stakes domains, their black-box nature makes uncertainty quantification challenging. We investigate the effects of presenting conformal prediction sets--a distribution-free class of methods for generating prediction sets with specified coverage--to express uncertainty in AI-advised decision-making. Through a large online experiment, we compare the utility of conformal prediction sets to displays of Top-1 and Top-k predictions for AI-advised image labeling. In a pre-registered analysis, we find that the utility of prediction sets for accuracy varies with the difficulty of the task: while they result in accuracy on par with or less than Top-1 and Top-k displays for easy images, prediction sets excel at assisting humans in labeling out-of-distribution (OOD) images, especially when the set size is small. Our results empirically pinpoint practical challenges of conformal prediction sets and provide implications on how to incorporate them for real-world decision-making.  ( 2 min )
    Cut Facets and Cube Facets of Lifted Multicut Polytopes
    arXiv:2402.16814v2 Announce Type: replace-cross Abstract: The lifted multicut problem has diverse applications in the field of computer vision. Exact algorithms based on linear programming require an understanding of lifted multicut polytopes. Despite recent progress, two fundamental questions about these polytopes have remained open: Which lower cube inequalities define facets, and which cut inequalities define facets? In this article, we answer the first question by establishing conditions that are necessary, sufficient and efficiently decidable. Toward the second question, we show that deciding facet-definingness of cut inequalities is NP-hard. This completes the analysis of canonical facets of lifted multicut polytopes.  ( 2 min )
    EasyEdit: An Easy-to-use Knowledge Editing Framework for Large Language Models
    arXiv:2308.07269v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) usually suffer from knowledge cutoff or fallacy issues, which means they are unaware of unseen events or generate text with incorrect facts owing to outdated/noisy data. To this end, many knowledge editing approaches for LLMs have emerged -- aiming to subtly inject/edit updated knowledge or adjust undesired behavior while minimizing the impact on unrelated inputs. Nevertheless, due to significant differences among various knowledge editing methods and the variations in task setups, there is no standard implementation framework available for the community, which hinders practitioners from applying knowledge editing to applications. To address these issues, we propose EasyEdit, an easy-to-use knowledge editing framework for LLMs. It supports various cutting-edge knowledge editing approaches and can be readily applied to many well-known LLMs such as T5, GPT-J, LlaMA, etc. Empirically, we report the knowledge editing results on LlaMA-2 with EasyEdit, demonstrating that knowledge editing surpasses traditional fine-tuning in terms of reliability and generalization. We have released the source code on GitHub, along with Google Colab tutorials and comprehensive documentation for beginners to get started. Besides, we present an online system for real-time knowledge editing, and a demo video.  ( 3 min )
    Active learning for effective Hamiltonian of super-large-scale atomic structures
    arXiv:2307.08929v2 Announce Type: replace-cross Abstract: The first-principles-based effective Hamiltonian is widely used to predict and simulate the properties of ferroelectrics and relaxor ferroelectrics. However, the parametrization method of the effective Hamiltonian is complicated and hardly can resolve the systems with complex interactions and/or complex components. Here, we developed an on-the-fly active machine learning approach to parametrize the effective Hamiltonian based on Bayesian linear regression. The parametrization is completed in molecular dynamics simulations, with the energy, forces and stress predicted at each step along with their uncertainties. First-principles calculations are executed when the uncertainties are large to retrain the parameters. This approach provides a universal and automatic way to compute the effective Hamiltonian parameters for any considered systems including complex systems which previous methods can not handle. The form of the effective Hamiltonian is also revised to include some complex terms. BaTiO3, CsPbI3 and SrTiO3/PbTiO3 surface are taken as examples to show the accurateness of this approach comparing with conventional first-principles parametrization method.  ( 2 min )
    Bayesian Nonparametrics Meets Data-Driven Robust Optimization
    arXiv:2401.15771v3 Announce Type: replace-cross Abstract: Training machine learning and statistical models often involves optimizing a data-driven risk criterion. The risk is usually computed with respect to the empirical data distribution, but this may result in poor and unstable out-of-sample performance due to distributional uncertainty. In the spirit of distributionally robust optimization, we propose a novel robust criterion by combining insights from Bayesian nonparametric (i.e., Dirichlet Process) theory and recent decision-theoretic models of smooth ambiguity-averse preferences. First, we highlight novel connections with standard regularized empirical risk minimization techniques, among which Ridge and LASSO regressions. Then, we theoretically demonstrate the existence of favorable finite-sample and asymptotic statistical guarantees on the performance of the robust optimization procedure. For practical implementation, we propose and study tractable approximations of the criterion based on well-known Dirichlet Process representations. We also show that the smoothness of the criterion naturally leads to standard gradient-based numerical optimization. Finally, we provide insights into the workings of our method by applying it to high-dimensional sparse linear regression, binary classification, and robust location parameter estimation tasks.  ( 2 min )
    Towards Reducing Diagnostic Errors with Interpretable Risk Prediction
    arXiv:2402.10109v2 Announce Type: replace-cross Abstract: Many diagnostic errors occur because clinicians cannot easily access relevant information in patient Electronic Health Records (EHRs). In this work we propose a method to use LLMs to identify pieces of evidence in patient EHR data that indicate increased or decreased risk of specific diagnoses; our ultimate aim is to increase access to evidence and reduce diagnostic errors. In particular, we propose a Neural Additive Model to make predictions backed by evidence with individualized risk estimates at time-points where clinicians are still uncertain, aiming to specifically mitigate delays in diagnosis and errors stemming from an incomplete differential. To train such a model, it is necessary to infer temporally fine-grained retrospective labels of eventual "true" diagnoses. We do so with LLMs, to ensure that the input text is from before a confident diagnosis can be made. We use an LLM to retrieve an initial pool of evidence, but then refine this set of evidence according to correlations learned by the model. We conduct an in-depth evaluation of the usefulness of our approach by simulating how it might be used by a clinician to decide between a pre-defined list of differential diagnoses.  ( 2 min )
    OCTDL: Optical Coherence Tomography Dataset for Image-Based Deep Learning Methods
    arXiv:2312.08255v2 Announce Type: replace-cross Abstract: Optical coherence tomography (OCT) is a non-invasive imaging technique with extensive clinical applications in ophthalmology. OCT enables the visualization of the retinal layers, playing a vital role in the early detection and monitoring of retinal diseases. OCT uses the principle of light wave interference to create detailed images of the retinal microstructures, making it a valuable tool for diagnosing ocular conditions. This work presents an open-access OCT dataset (OCTDL) comprising over 2000 OCT images labeled according to disease group and retinal pathology. The dataset consists of OCT records of patients with Age-related Macular Degeneration (AMD), Diabetic Macular Edema (DME), Epiretinal Membrane (ERM), Retinal Artery Occlusion (RAO), Retinal Vein Occlusion (RVO), and Vitreomacular Interface Disease (VID). The images were acquired with an Optovue Avanti RTVue XR using raster scanning protocols with dynamic scan length and image resolution. Each retinal b-scan was acquired by centering on the fovea and interpreted and cataloged by an experienced retinal specialist. In this work, we applied Deep Learning classification techniques to this new open-access dataset.  ( 2 min )
    Cross-Domain Few-Shot Object Detection via Enhanced Open-Set Object Detector
    arXiv:2402.03094v2 Announce Type: replace-cross Abstract: This paper studies the challenging cross-domain few-shot object detection (CD-FSOD), aiming to develop an accurate object detector for novel domains with minimal labeled examples. While transformer-based open-set detectors, such as DE-ViT, show promise in traditional few-shot object detection, their generalization to CD-FSOD remains unclear: 1) can such open-set detection methods easily generalize to CD-FSOD? 2) If not, how can models be enhanced when facing huge domain gaps? To answer the first question, we employ measures including style, inter-class variance (ICV), and indefinable boundaries (IB) to understand the domain gap. Based on these measures, we establish a new benchmark named CD-FSOD to evaluate object detection methods, revealing that most of the current approaches fail to generalize across domains. Technically, we observe that the performance decline is associated with our proposed measures: style, ICV, and IB. Consequently, we propose several novel modules to address these issues. First, the learnable instance features align initial fixed instances with target categories, enhancing feature distinctiveness. Second, the instance reweighting module assigns higher importance to high-quality instances with slight IB. Third, the domain prompter encourages features resilient to different styles by synthesizing imaginary domains without altering semantic contents. These techniques collectively contribute to the development of the Cross-Domain Vision Transformer for CD-FSOD (CD-ViTO), significantly improving upon the base DE-ViT. Experimental results validate the efficacy of our model. All datasets, codes, and models will be released to the community.  ( 3 min )
    On the Diversity and Realism of Distilled Dataset: An Efficient Dataset Distillation Paradigm
    arXiv:2312.03526v2 Announce Type: replace-cross Abstract: Contemporary machine learning requires training large neural networks on massive datasets and thus faces the challenges of high computational demands. Dataset distillation, as a recent emerging strategy, aims to compress real-world datasets for efficient training. However, this line of research currently struggle with large-scale and high-resolution datasets, hindering its practicality and feasibility. To this end, we re-examine the existing dataset distillation methods and identify three properties required for large-scale real-world applications, namely, realism, diversity, and efficiency. As a remedy, we propose RDED, a novel computationally-efficient yet effective data distillation paradigm, to enable both diversity and realism of the distilled data. Extensive empirical results over various neural architectures and datasets demonstrate the advancement of RDED: we can distill the full ImageNet-1K to a small dataset comprising 10 images per class within 7 minutes, achieving a notable 42% top-1 accuracy with ResNet-18 on a single RTX-4090 GPU (while the SOTA only achieves 21% but requires 6 hours).  ( 2 min )
    Tuning-Free Stochastic Optimization
    arXiv:2402.07793v2 Announce Type: replace-cross Abstract: Large-scale machine learning problems make the cost of hyperparameter tuning ever more prohibitive. This creates a need for algorithms that can tune themselves on-the-fly. We formalize the notion of "tuning-free" algorithms that can match the performance of optimally-tuned optimization algorithms up to polylogarithmic factors given only loose hints on the relevant problem parameters. We consider in particular algorithms that can match optimally-tuned Stochastic Gradient Descent (SGD). When the domain of optimization is bounded, we show tuning-free matching of SGD is possible and achieved by several existing algorithms. We prove that for the task of minimizing a convex and smooth or Lipschitz function over an unbounded domain, tuning-free optimization is impossible. We discuss conditions under which tuning-free optimization is possible even over unbounded domains. In particular, we show that the recently proposed DoG and DoWG algorithms are tuning-free when the noise distribution is sufficiently well-behaved. For the task of finding a stationary point of a smooth and potentially nonconvex function, we give a variant of SGD that matches the best-known high-probability convergence rate for tuned SGD at only an additional polylogarithmic cost. However, we also give an impossibility result that shows no algorithm can hope to match the optimal expected convergence rate for tuned SGD with high probability.  ( 2 min )
    Guiding Masked Representation Learning to Capture Spatio-Temporal Relationship of Electrocardiogram
    arXiv:2402.09450v3 Announce Type: replace-cross Abstract: Electrocardiograms (ECG) are widely employed as a diagnostic tool for monitoring electrical signals originating from a heart. Recent machine learning research efforts have focused on the application of screening various diseases using ECG signals. However, adapting to the application of screening disease is challenging in that labeled ECG data are limited. Achieving general representation through self-supervised learning (SSL) is a well-known approach to overcome the scarcity of labeled data; however, a naive application of SSL to ECG data, without considering the spatial-temporal relationships inherent in ECG signals, may yield suboptimal results. In this paper, we introduce ST-MEM (Spatio-Temporal Masked Electrocardiogram Modeling), designed to learn spatio-temporal features by reconstructing masked 12-lead ECG data. ST-MEM outperforms other SSL baseline methods in various experimental settings for arrhythmia classification tasks. Moreover, we demonstrate that ST-MEM is adaptable to various lead combinations. Through quantitative and qualitative analysis, we show a spatio-temporal relationship within ECG data. Our code is available at https://github.com/bakqui/ST-MEM.  ( 2 min )
    Quantum Architecture Search with Unsupervised Representation Learning
    arXiv:2401.11576v2 Announce Type: replace-cross Abstract: Utilizing unsupervised representation learning for quantum architecture search (QAS) represents a cutting-edge approach poised to realize potential quantum advantage on Noisy Intermediate-Scale Quantum (NISQ) devices. Most QAS algorithms combine their search space and search algorithms together and thus generally require evaluating a large number of quantum circuits during the search process. Predictor-based QAS algorithms can alleviate this problem by directly estimating the performance of circuits according to their structures. However, a high-performance predictor generally requires very time-consuming labeling to obtain a large number of labeled quantum circuits. Recently, a classical neural architecture search algorithm Arch2vec inspires us by showing that architecture search can benefit from decoupling unsupervised representation learning from the search process. Whether unsupervised representation learning can help QAS without any predictor is still an open topic. In this work, we propose a framework QAS with unsupervised representation learning and visualize how unsupervised architecture representation learning encourages quantum circuit architectures with similar connections and operators to cluster together. Specifically, our framework enables the process of QAS to be decoupled from unsupervised architecture representation learning so that the learned representation can be directly applied to different downstream applications. Furthermore, our framework is predictor-free eliminating the need for a large number of labeled quantum circuits. During the search process, we use two algorithms REINFORCE and Bayesian Optimization to directly search on the latent representation, and compare them with the method Random Search. The results show our framework can more efficiently get well-performing candidate circuits within a limited number of searches.  ( 3 min )
    DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback
    arXiv:2311.10081v2 Announce Type: replace-cross Abstract: We present DRESS, a large vision language model (LVLM) that innovatively exploits Natural Language feedback (NLF) from Large Language Models to enhance its alignment and interactions by addressing two key limitations in the state-of-the-art LVLMs. First, prior LVLMs generally rely only on the instruction finetuning stage to enhance alignment with human preferences. Without incorporating extra feedback, they are still prone to generate unhelpful, hallucinated, or harmful responses. Second, while the visual instruction tuning data is generally structured in a multi-turn dialogue format, the connections and dependencies among consecutive conversational turns are weak. This reduces the capacity for effective multi-turn interactions. To tackle these, we propose a novel categorization of the NLF into two key types: critique and refinement. The critique NLF identifies the strengths and weaknesses of the responses and is used to align the LVLMs with human preferences. The refinement NLF offers concrete suggestions for improvement and is adopted to improve the interaction ability of the LVLMs-- which focuses on LVLMs' ability to refine responses by incorporating feedback in multi-turn interactions. To address the non-differentiable nature of NLF, we generalize conditional reinforcement learning for training. Our experimental results demonstrate that DRESS can generate more helpful (9.76%), honest (11.52%), and harmless (21.03%) responses, and more effectively learn from feedback during multi-turn interactions compared to SOTA LVMLs.  ( 3 min )
    TiC-CLIP: Continual Training of CLIP Models
    arXiv:2310.16226v2 Announce Type: replace-cross Abstract: Keeping large foundation models up to date on latest data is inherently expensive. To avoid the prohibitive costs of constantly retraining, it is imperative to \emph{continually} train these models. This problem is exacerbated by the lack of any large scale continual learning benchmarks or baselines. We introduce the first set of web-scale Time-Continual (TiC) benchmarks for training vision-language models: TiC-DataComp, TiC-YFCC, and TiC-Redcaps. TiC-DataComp, our largest dataset, contains over 12.7B timestamped image-text pairs spanning 9 years (2014--2022). We first use our benchmarks to curate various \emph{dynamic} evaluations to measure temporal robustness of existing models. We show OpenAI's CLIP (trained on data up to 2020) loses $\approx 8\%$ zero-shot accuracy on our curated retrieval task from 2021--2022 compared with more recently trained models in OpenCLIP repository. We then study how to efficiently train models on time-continuous data. We demonstrate that a simple rehearsal-based approach that continues training from the last checkpoint and replays old data reduces compute by $2.5\times$ when compared to the standard practice of retraining from scratch. Code is available at https://github.com/apple/ml-tic-clip.  ( 2 min )
    Transductive conformal inference with adaptive scores
    arXiv:2310.18108v2 Announce Type: replace-cross Abstract: Conformal inference is a fundamental and versatile tool that provides distribution-free guarantees for many machine learning tasks. We consider the transductive setting, where decisions are made on a test sample of $m$ new points, giving rise to $m$ conformal $p$-values. While classical results only concern their marginal distribution, we show that their joint distribution follows a P\'olya urn model, and establish a concentration inequality for their empirical distribution function. The results hold for arbitrary exchangeable scores, including adaptive ones that can use the covariates of the test+calibration samples at training stage for increased accuracy. We demonstrate the usefulness of these theoretical results through uniform, in-probability guarantees for two machine learning tasks of current interest: interval prediction for transductive transfer learning and novelty detection based on two-class classification.  ( 2 min )
    T-MARS: Improving Visual Representations by Circumventing Text Feature Learning
    arXiv:2307.03132v2 Announce Type: replace-cross Abstract: Large web-sourced multimodal datasets have powered a slew of new methods for learning general-purpose visual representations, advancing the state of the art in computer vision and revolutionizing zero- and few-shot recognition. One crucial decision facing practitioners is how, if at all, to curate these ever-larger datasets. For example, the creators of the LAION-5B dataset chose to retain only image-caption pairs whose CLIP similarity score exceeded a designated threshold. In this paper, we propose a new state-of-the-art data filtering approach motivated by our observation that nearly 40% of LAION's images contain text that overlaps significantly with the caption. Intuitively, such data could be wasteful as it incentivizes models to perform optical character recognition rather than learning visual features. However, naively removing all such data could also be wasteful, as it throws away images that contain visual features (in addition to overlapping text). Our simple and scalable approach, T-MARS (Text Masking and Re-Scoring), filters out only those pairs where the text dominates the remaining visual features -- by first masking out the text and then filtering out those with a low CLIP similarity score of the masked image. Experimentally, T-MARS outperforms the top-ranked method on the "medium scale" of DataComp (a data filtering benchmark) by a margin of 6.5% on ImageNet and 4.7% on VTAB. Additionally, our systematic evaluation on various data pool sizes from 2M to 64M shows that the accuracy gains enjoyed by T-MARS linearly increase as data and compute are scaled exponentially. Code is available at https://github.com/locuslab/T-MARS.  ( 3 min )
    PGA: Personalizing Grasping Agents with Single Human-Robot Interaction
    arXiv:2310.12547v2 Announce Type: replace-cross Abstract: Language-Conditioned Robotic Grasping (LCRG) aims to develop robots that comprehend and grasp objects based on natural language instructions. While the ability to understand personal objects like my wallet facilitates more natural interaction with human users, current LCRG systems only allow generic language instructions, e.g., the black-colored wallet next to the laptop. To this end, we introduce a task scenario GraspMine alongside a novel dataset aimed at pinpointing and grasping personal objects given personal indicators via learning from a single human-robot interaction, rather than a large labeled dataset. Our proposed method, Personalized Grasping Agent (PGA), addresses GraspMine by leveraging the unlabeled image data of the user's environment, called Reminiscence. Specifically, PGA acquires personal object information by a user presenting a personal object with its associated indicator, followed by PGA inspecting the object by rotating it. Based on the acquired information, PGA pseudo-labels objects in the Reminiscence by our proposed label propagation algorithm. Harnessing the information acquired from the interactions and the pseudo-labeled objects in the Reminiscence, PGA adapts the object grounding model to grasp personal objects. This results in significant efficiency while previous LCRG systems rely on resource-intensive human annotations -- necessitating hundreds of labeled data to learn my wallet. Moreover, PGA outperforms baseline methods across all metrics and even shows comparable performance compared to the fully-supervised method, which learns from 9k annotated data samples. We further validate PGA's real-world applicability by employing a physical robot to execute GrsapMine. Code and data are publicly available at https://github.com/JHKim-snu/PGA.  ( 3 min )
    PopulAtion Parameter Averaging (PAPA)
    arXiv:2304.03094v3 Announce Type: replace Abstract: Ensemble methods combine the predictions of multiple models to improve performance, but they require significantly higher computation costs at inference time. To avoid these costs, multiple neural networks can be combined into one by averaging their weights. However, this usually performs significantly worse than ensembling. Weight averaging is only beneficial when different enough to benefit from combining them, but similar enough to average well. Based on this idea, we propose PopulAtion Parameter Averaging (PAPA): a method that combines the generality of ensembling with the efficiency of weight averaging. PAPA leverages a population of diverse models (trained on different data orders, augmentations, and regularizations) while slowly pushing the weights of the networks toward the population average of the weights. We also propose PAPA variants (PAPA-all, and PAPA-2) that average weights rarely rather than continuously; all methods increase generalization, but PAPA tends to perform best. PAPA reduces the performance gap between averaging and ensembling, increasing the average accuracy of a population of models by up to 0.8% on CIFAR-10, 1.9% on CIFAR-100, and 1.6% on ImageNet when compared to training independent (non-averaged) models.  ( 2 min )
    Pointer Networks with Q-Learning for OP Combinatorial Optimization
    arXiv:2311.02629v2 Announce Type: replace Abstract: The Orienteering Problem (OP) presents a unique challenge in Combinatorial Optimization (CO), emphasized by its widespread use in logistics, delivery, and transportation planning. Given the NP-hard nature of OP, obtaining optimal solutions is inherently complex. While Pointer Networks (Ptr-Nets) have exhibited prowess in various combinatorial tasks, their performance in the context of OP, and duties requiring focus on future return or exploration, leaves room for improvement. Recognizing the potency combining Reinforcement Learning (RL) methods with sequence-to-sequence models, this research unveils the Pointer Q-Network (PQN). This method combines Ptr-Nets and Q-learning, which, thanks to its critic only nature, outstands in its capability of capturing relationships within an embedded graph, a fundamental requirement in order to effectively address the specific challenges presented by OP. We explore the architecture and functionality of the PQN system, while showcasing its theoretical and practical advantages in terms of efficiency for combinatorial optimization problems such as the Orienteering Problem.  ( 2 min )
    Deep learning automates Cobb angle measurement compared with multi-expert observers
    arXiv:2403.12115v1 Announce Type: cross Abstract: Scoliosis, a prevalent condition characterized by abnormal spinal curvature leading to deformity, requires precise assessment methods for effective diagnosis and management. The Cobb angle is a widely used scoliosis quantification method that measures the degree of curvature between the tilted vertebrae. Yet, manual measuring of Cobb angles is time-consuming and labor-intensive, fraught with significant interobserver and intraobserver variability. To address these challenges and the lack of interpretability found in certain existing automated methods, we have created fully automated software that not only precisely measures the Cobb angle but also provides clear visualizations of these measurements. This software integrates deep neural network-based spine region detection and segmentation, spine centerline identification, pinpointing the most significantly tilted vertebrae, and direct visualization of Cobb angles on the original images. Upon comparison with the assessments of 7 expert readers, our algorithm exhibited a mean deviation in Cobb angle measurements of 4.17 degrees, notably surpassing the manual approach's average intra-reader discrepancy of 5.16 degrees. The algorithm also achieved intra-class correlation coefficients (ICC) exceeding 0.96 and Pearson correlation coefficients above 0.944, reflecting robust agreement with expert assessments and superior measurement reliability. Through the comprehensive reader study and statistical analysis, we believe this algorithm not only ensures a higher consensus with expert readers but also enhances interpretability and reproducibility during assessments. It holds significant promise for clinical application, potentially aiding physicians in more accurate scoliosis assessment and diagnosis, thereby improving patient care.  ( 3 min )
    Riemannian Stochastic Gradient Method for Nested Composition Optimization
    arXiv:2207.09350v2 Announce Type: replace-cross Abstract: This work considers optimization of composition of functions in a nested form over Riemannian manifolds where each function contains an expectation. This type of problems is gaining popularity in applications such as policy evaluation in reinforcement learning or model customization in meta-learning. The standard Riemannian stochastic gradient methods for non-compositional optimization cannot be directly applied as stochastic approximation of inner functions create bias in the gradients of the outer functions. For two-level composition optimization, we present a Riemannian Stochastic Composition Gradient Descent (R-SCGD) method that finds an approximate stationary point, with expected squared Riemannian gradient smaller than $\epsilon$, in $O(\epsilon^{-2})$ calls to the stochastic gradient oracle of the outer function and stochastic function and gradient oracles of the inner function. Furthermore, we generalize the R-SCGD algorithms for problems with multi-level nested compositional structures, with the same complexity of $O(\epsilon^{-2})$ for the first-order stochastic oracle. Finally, the performance of the R-SCGD method is numerically evaluated over a policy evaluation problem in reinforcement learning.  ( 2 min )
    Improving Human Sequential Decision-Making with Reinforcement Learning
    arXiv:2108.08454v5 Announce Type: replace Abstract: Workers spend a significant amount of time learning how to make good decisions. Evaluating the efficacy of a given decision, however, can be complicated -- e.g., decision outcomes are often long-term and relate to the original decision in complex ways. Surprisingly, even though learning good decision-making strategies is difficult, they can often be expressed in simple and concise forms. Focusing on sequential decision-making, we design a novel machine learning algorithm that is capable of extracting "best practices" from trace data and conveying its insights to humans in the form of interpretable "tips". Our algorithm selects the tip that best bridges the gap between the actions taken by human workers and those taken by the optimal policy in a way that accounts for which actions are consequential for achieving higher performance. We evaluate our approach through a series of randomized controlled experiments where participants manage a virtual kitchen. Our experiments show that the tips generated by our algorithm can significantly improve human performance relative to intuitive baselines. In addition, we discuss a number of empirical insights that can help inform the design of algorithms intended for human-AI interfaces. For instance, we find evidence that participants do not simply blindly follow our tips; instead, they combine them with their own experience to discover additional strategies for improving performance.  ( 3 min )
    Offline Imitation of Badminton Player Behavior via Experiential Contexts and Brownian Motion
    arXiv:2403.12406v1 Announce Type: cross Abstract: In the dynamic and rapid tactic involvements of turn-based sports, badminton stands out as an intrinsic paradigm that requires alter-dependent decision-making of players. While the advancement of learning from offline expert data in sequential decision-making has been witnessed in various domains, how to rally-wise imitate the behaviors of human players from offline badminton matches has remained underexplored. Replicating opponents' behavior benefits players by allowing them to undergo strategic development with direction before matches. However, directly applying existing methods suffers from the inherent hierarchy of the match and the compounding effect due to the turn-based nature of players alternatively taking actions. In this paper, we propose RallyNet, a novel hierarchical offline imitation learning model for badminton player behaviors: (i) RallyNet captures players' decision dependencies by modeling decision-making processes as a contextual Markov decision process. (ii) RallyNet leverages the experience to generate context as the agent's intent in the rally. (iii) To generate more realistic behavior, RallyNet leverages Geometric Brownian Motion (GBM) to model the interactions between players by introducing a valuable inductive bias for learning player behaviors. In this manner, RallyNet links player intents with interaction models with GBM, providing an understanding of interactions for sports analytics. We extensively validate RallyNet with the largest available real-world badminton dataset consisting of men's and women's singles, demonstrating its ability to imitate player behaviors. Results reveal RallyNet's superiority over offline imitation learning methods and state-of-the-art turn-based approaches, outperforming them by at least 16% in mean rule-based agent normalization score. Furthermore, we discuss various practical use cases to highlight RallyNet's applicability.  ( 3 min )
    Deep Generative Design for Mass Production
    arXiv:2403.12098v1 Announce Type: cross Abstract: Generative Design (GD) has evolved as a transformative design approach, employing advanced algorithms and AI to create diverse and innovative solutions beyond traditional constraints. Despite its success, GD faces significant challenges regarding the manufacturability of complex designs, often necessitating extensive manual modifications due to limitations in standard manufacturing processes and the reliance on additive manufacturing, which is not ideal for mass production. Our research introduces an innovative framework addressing these manufacturability concerns by integrating constraints pertinent to die casting and injection molding into GD, through the utilization of 2D depth images. This method simplifies intricate 3D geometries into manufacturable profiles, removing unfeasible features such as non-manufacturable overhangs and allowing for the direct consideration of essential manufacturing aspects like thickness and rib design. Consequently, designs previously unsuitable for mass production are transformed into viable solutions. We further enhance this approach by adopting an advanced 2D generative model, which offer a more efficient alternative to traditional 3D shape generation methods. Our results substantiate the efficacy of this framework, demonstrating the production of innovative, and, importantly, manufacturable designs. This shift towards integrating practical manufacturing considerations into GD represents a pivotal advancement, transitioning from purely inspirational concepts to actionable, production-ready solutions. Our findings underscore usefulness and potential of GD for broader industry adoption, marking a significant step forward in aligning GD with the demands of manufacturing challenges.  ( 2 min )
    Neuron-centric Hebbian Learning
    arXiv:2403.12076v1 Announce Type: cross Abstract: One of the most striking capabilities behind the learning mechanisms of the brain is the adaptation, through structural and functional plasticity, of its synapses. While synapses have the fundamental role of transmitting information across the brain, several studies show that it is the neuron activations that produce changes on synapses. Yet, most plasticity models devised for artificial Neural Networks (NNs), e.g., the ABCD rule, focus on synapses, rather than neurons, therefore optimizing synaptic-specific Hebbian parameters. This approach, however, increases the complexity of the optimization process since each synapse is associated to multiple Hebbian parameters. To overcome this limitation, we propose a novel plasticity model, called Neuron-centric Hebbian Learning (NcHL), where optimization focuses on neuron- rather than synaptic-specific Hebbian parameters. Compared to the ABCD rule, NcHL reduces the parameters from $5W$ to $5N$, being $W$ and $N$ the number of weights and neurons, and usually $N \ll W$. We also devise a "weightless" NcHL model, which requires less memory by approximating the weights based on a record of neuron activations. Our experiments on two robotic locomotion tasks reveal that NcHL performs comparably to the ABCD rule, despite using up to $\sim97$ times less parameters, thus allowing for scalable plasticity.  ( 2 min )
    Evaluating Named Entity Recognition: Comparative Analysis of Mono- and Multilingual Transformer Models on Brazilian Corporate Earnings Call Transcriptions
    arXiv:2403.12212v1 Announce Type: cross Abstract: Named Entity Recognition (NER) is a Natural Language Processing technique for extracting information from textual documents. However, much of the existing research on NER has been centered around English-language documents, leaving a gap in the availability of datasets tailored to the financial domain in Portuguese. This study addresses the need for NER within the financial domain, focusing on Portuguese-language texts extracted from earnings call transcriptions of Brazilian banks. By curating a comprehensive dataset comprising 384 transcriptions and leveraging weak supervision techniques for annotation, we evaluate the performance of monolingual models trained on Portuguese (BERTimbau and PTT5) and multilingual models (mBERT and mT5). Notably, we introduce a novel approach that reframes the token classification task as a text generation problem, enabling fine-tuning and evaluation of T5 models. Following the fine-tuning of the models, we conduct an evaluation on the test dataset, employing performance and error metrics. Our findings reveal that BERT-based models consistently outperform T5-based models. Furthermore, while the multilingual models exhibit comparable macro F1-scores, BERTimbau demonstrates superior performance over PTT5. A manual analysis of sentences generated by PTT5 and mT5 unveils a degree of similarity ranging from 0.89 to 1.0, between the original and generated sentences. However, critical errors emerge as both models exhibit discrepancies, such as alterations to monetary and percentage values, underscoring the importance of accuracy and consistency in the financial domain. Despite these challenges, PTT5 and mT5 achieve impressive macro F1-scores of 98.52% and 98.85%, respectively, with our proposed approach. Furthermore, our study sheds light on notable disparities in memory and time consumption for inference across the models.  ( 3 min )
    Best of Both Worlds: Hybrid SNN-ANN Architecture for Event-based Optical Flow Estimation
    arXiv:2306.02960v2 Announce Type: replace-cross Abstract: In the field of robotics, event-based cameras are emerging as a promising low-power alternative to traditional frame-based cameras for capturing high-speed motion and high dynamic range scenes. This is due to their sparse and asynchronous event outputs. Spiking Neural Networks (SNNs) with their asynchronous event-driven compute, show great potential for extracting the spatio-temporal features from these event streams. In contrast, the standard Analog Neural Networks (ANNs) fail to process event data effectively. However, training SNNs is difficult due to additional trainable parameters (thresholds and leaks), vanishing spikes at deeper layers, and a non-differentiable binary activation function. Furthermore, an additional data structure, membrane potential, responsible for keeping track of temporal information, must be fetched and updated at every timestep in SNNs. To overcome these challenges, we propose a novel SNN-ANN hybrid architecture that combines the strengths of both. Specifically, we leverage the asynchronous compute capabilities of SNN layers to effectively extract the input temporal information. Concurrently, the ANN layers facilitate training and efficient hardware deployment on traditional machine learning hardware such as GPUs. We provide extensive experimental analysis for assigning each layer to be spiking or analog, leading to a network configuration optimized for performance and ease of training. We evaluate our hybrid architecture for optical flow estimation on DSEC-flow and Multi-Vehicle Stereo Event-Camera (MVSEC) datasets. On the DSEC-flow dataset, the hybrid SNN-ANN architecture achieves a 40% reduction in average endpoint error (AEE) with 22% lower energy consumption compared to Full-SNN, and 48% lower AEE compared to Full-ANN, while maintaining comparable energy usage.  ( 3 min )
    Useful Compact Representations for Data-Fitting
    arXiv:2403.12206v1 Announce Type: cross Abstract: For minimization problems without 2nd derivative information, methods that estimate Hessian matrices can be very effective. However, conventional techniques generate dense matrices that are prohibitive for large problems. Limited-memory compact representations express the dense arrays in terms of a low rank representation and have become the state-of-the-art for software implementations on large deterministic problems. We develop new compact representations that are parameterized by a choice of vectors and that reduce to existing well known formulas for special choices. We demonstrate effectiveness of the compact representations for large eigenvalue computations, tensor factorizations and nonlinear regressions.  ( 2 min )
    FinLlama: Financial Sentiment Classification for Algorithmic Trading Applications
    arXiv:2403.12285v1 Announce Type: cross Abstract: There are multiple sources of financial news online which influence market movements and trader's decisions. This highlights the need for accurate sentiment analysis, in addition to having appropriate algorithmic trading techniques, to arrive at better informed trading decisions. Standard lexicon based sentiment approaches have demonstrated their power in aiding financial decisions. However, they are known to suffer from issues related to context sensitivity and word ordering. Large Language Models (LLMs) can also be used in this context, but they are not finance-specific and tend to require significant computational resources. To facilitate a finance specific LLM framework, we introduce a novel approach based on the Llama 2 7B foundational model, in order to benefit from its generative nature and comprehensive language manipulation. This is achieved by fine-tuning the Llama2 7B model on a small portion of supervised financial sentiment analysis data, so as to jointly handle the complexities of financial lexicon and context, and further equipping it with a neural network based decision mechanism. Such a generator-classifier scheme, referred to as FinLlama, is trained not only to classify the sentiment valence but also quantify its strength, thus offering traders a nuanced insight into financial news articles. Complementing this, the implementation of parameter-efficient fine-tuning through LoRA optimises trainable parameters, thus minimising computational and memory requirements, without sacrificing accuracy. Simulation results demonstrate the ability of the proposed FinLlama to provide a framework for enhanced portfolio management decisions and increased market returns. These results underpin the ability of FinLlama to construct high-return portfolios which exhibit enhanced resilience, even during volatile periods and unpredictable market events.  ( 3 min )
    Light Curve Classification with DistClassiPy: a new distance-based classifier
    arXiv:2403.12120v1 Announce Type: cross Abstract: The rise of synoptic sky surveys has ushered in an era of big data in time-domain astronomy, making data science and machine learning essential tools for studying celestial objects. Tree-based (e.g. Random Forests) and deep learning models represent the current standard in the field. We explore the use of different distance metrics to aid in the classification of objects. For this, we developed a new distance metric based classifier called DistClassiPy. The direct use of distance metrics is an approach that has not been explored in time-domain astronomy, but distance-based methods can aid in increasing the interpretability of the classification result and decrease the computational costs. In particular, we classify light curves of variable stars by comparing the distances between objects of different classes. Using 18 distance metrics applied to a catalog of 6,000 variable stars in 10 classes, we demonstrate classification and dimensionality reduction. We show that this classifier meets state-of-the-art performance but has lower computational requirements and improved interpretability. We have made DistClassiPy open-source and accessible at https://pypi.org/project/distclassipy/ with the goal of broadening its applications to other classification scenarios within and beyond astronomy.  ( 2 min )
    Bilevel Hypergraph Networks for Multi-Modal Alzheimer's Diagnosis
    arXiv:2403.12719v1 Announce Type: new Abstract: Early detection of Alzheimer's disease's precursor stages is imperative for significantly enhancing patient outcomes and quality of life. This challenge is tackled through a semi-supervised multi-modal diagnosis framework. In particular, we introduce a new hypergraph framework that enables higher-order relations between multi-modal data, while utilising minimal labels. We first introduce a bilevel hypergraph optimisation framework that jointly learns a graph augmentation policy and a semi-supervised classifier. This dual learning strategy is hypothesised to enhance the robustness and generalisation capabilities of the model by fostering new pathways for information propagation. Secondly, we introduce a novel strategy for generating pseudo-labels more effectively via a gradient-driven flow. Our experimental results demonstrate the superior performance of our framework over current techniques in diagnosing Alzheimer's disease.  ( 2 min )
    Deciphering AutoML Ensembles: cattleia's Assistance in Decision-Making
    arXiv:2403.12664v1 Announce Type: new Abstract: In many applications, model ensembling proves to be better than a single predictive model. Hence, it is the most common post-processing technique in Automated Machine Learning (AutoML). The most popular frameworks use ensembles at the expense of reducing the interpretability of the final models. In our work, we propose cattleia - an application that deciphers the ensembles for regression, multiclass, and binary classification tasks. This tool works with models built by three AutoML packages: auto-sklearn, AutoGluon, and FLAML. The given ensemble is analyzed from different perspectives. We conduct a predictive performance investigation through evaluation metrics of the ensemble and its component models. We extend the validation perspective by introducing new measures to assess the diversity and complementarity of the model predictions. Moreover, we apply explainable artificial intelligence (XAI) techniques to examine the importance of variables. Summarizing obtained insights, we can investigate and adjust the weights with a modification tool to tune the ensemble in the desired way. The application provides the aforementioned aspects through dedicated interactive visualizations, making it accessible to a diverse audience. We believe the cattleia can support users in decision-making and deepen the comprehension of AutoML frameworks.  ( 2 min )
    Stochastic Halpern iteration in normed spaces and applications to reinforcement learning
    arXiv:2403.12338v1 Announce Type: cross Abstract: We analyze the oracle complexity of the stochastic Halpern iteration with variance reduction, where we aim to approximate fixed-points of nonexpansive and contractive operators in a normed finite-dimensional space. We show that if the underlying stochastic oracle is with uniformly bounded variance, our method exhibits an overall oracle complexity of $\tilde{O}(\varepsilon^{-5})$, improving recent rates established for the stochastic Krasnoselskii-Mann iteration. Also, we establish a lower bound of $\Omega(\varepsilon^{-3})$, which applies to a wide range of algorithms, including all averaged iterations even with minibatching. Using a suitable modification of our approach, we derive a $O(\varepsilon^{-2}(1-\gamma)^{-3})$ complexity bound in the case in which the operator is a $\gamma$-contraction. As an application, we propose new synchronous algorithms for average reward and discounted reward Markov decision processes. In particular, for the average reward, our method improves on the best-known sample complexity.  ( 2 min )
    Towards Interpretable Hate Speech Detection using Large Language Model-extracted Rationales
    arXiv:2403.12403v1 Announce Type: cross Abstract: Although social media platforms are a prominent arena for users to engage in interpersonal discussions and express opinions, the facade and anonymity offered by social media may allow users to spew hate speech and offensive content. Given the massive scale of such platforms, there arises a need to automatically identify and flag instances of hate speech. Although several hate speech detection methods exist, most of these black-box methods are not interpretable or explainable by design. To address the lack of interpretability, in this paper, we propose to use state-of-the-art Large Language Models (LLMs) to extract features in the form of rationales from the input text, to train a base hate speech classifier, thereby enabling faithful interpretability by design. Our framework effectively combines the textual understanding capabilities of LLMs and the discriminative power of state-of-the-art hate speech classifiers to make these classifiers faithfully interpretable. Our comprehensive evaluation on a variety of social media hate speech datasets demonstrate: (1) the goodness of the LLM-extracted rationales, and (2) the surprising retention of detector performance even after training to ensure interpretability.  ( 2 min )
    Learning Time Slot Preferences via Mobility Tree for Next POI Recommendation
    arXiv:2403.12100v1 Announce Type: cross Abstract: Next Point-of-Interests (POIs) recommendation task aims to provide a dynamic ranking of POIs based on users' current check-in trajectories. The recommendation performance of this task is contingent upon a comprehensive understanding of users' personalized behavioral patterns through Location-based Social Networks (LBSNs) data. While prior studies have adeptly captured sequential patterns and transitional relationships within users' check-in trajectories, a noticeable gap persists in devising a mechanism for discerning specialized behavioral patterns during distinct time slots, such as noon, afternoon, or evening. In this paper, we introduce an innovative data structure termed the ``Mobility Tree'', tailored for hierarchically describing users' check-in records. The Mobility Tree encompasses multi-granularity time slot nodes to learn user preferences across varying temporal periods. Meanwhile, we propose the Mobility Tree Network (MTNet), a multitask framework for personalized preference learning based on Mobility Trees. We develop a four-step node interaction operation to propagate feature information from the leaf nodes to the root node. Additionally, we adopt a multitask training strategy to push the model towards learning a robust representation. The comprehensive experimental results demonstrate the superiority of MTNet over ten state-of-the-art next POI recommendation models across three real-world LBSN datasets, substantiating the efficacy of time slot preference learning facilitated by Mobility Tree.  ( 2 min )
    Low-Trace Adaptation of Zero-shot Self-supervised Blind Image Denoising
    arXiv:2403.12382v1 Announce Type: cross Abstract: Deep learning-based denoiser has been the focus of recent development on image denoising. In the past few years, there has been increasing interest in developing self-supervised denoising networks that only require noisy images, without the need for clean ground truth for training. However, a performance gap remains between current self-supervised methods and their supervised counterparts. Additionally, these methods commonly depend on assumptions about noise characteristics, thereby constraining their applicability in real-world scenarios. Inspired by the properties of the Frobenius norm expansion, we discover that incorporating a trace term reduces the optimization goal disparity between self-supervised and supervised methods, thereby enhancing the performance of self-supervised learning. To exploit this insight, we propose a trace-constraint loss function and design the low-trace adaptation Noise2Noise (LoTA-N2N) model that bridges the gap between self-supervised and supervised learning. Furthermore, we have discovered that several existing self-supervised denoising frameworks naturally fall within the proposed trace-constraint loss as subcases. Extensive experiments conducted on natural and confocal image datasets indicate that our method achieves state-of-the-art performance within the realm of zero-shot self-supervised image denoising approaches, without relying on any assumptions regarding the noise.  ( 2 min )
    Variational Approach for Efficient KL Divergence Estimation in Dirichlet Mixture Models
    arXiv:2403.12158v1 Announce Type: cross Abstract: This study tackles the efficient estimation of Kullback-Leibler (KL) Divergence in Dirichlet Mixture Models (DMM), crucial for clustering compositional data. Despite the significance of DMMs, obtaining an analytically tractable solution for KL Divergence has proven elusive. Past approaches relied on computationally demanding Monte Carlo methods, motivating our introduction of a novel variational approach. Our method offers a closed-form solution, significantly enhancing computational efficiency for swift model comparisons and robust estimation evaluations. Validation using real and simulated data showcases its superior efficiency and accuracy over traditional Monte Carlo-based methods, opening new avenues for rapid exploration of diverse DMM models and advancing statistical analyses of compositional data.  ( 2 min )
    An Aligning and Training Framework for Multimodal Recommendations
    arXiv:2403.12384v1 Announce Type: cross Abstract: With the development of multimedia applications, multimodal recommendations are playing an essential role, as they can leverage rich contexts beyond user interactions. Existing methods mainly regard multimodal information as an auxiliary, using them to help learn ID features; however, there exist semantic gaps among multimodal content features and ID features, for which directly using multimodal information as an auxiliary would lead to misalignment in representations of users and items. In this paper, we first systematically investigate the misalignment issue in multimodal recommendations, and propose a solution named AlignRec. In AlignRec, the recommendation objective is decomposed into three alignments, namely alignment within contents, alignment between content and categorical ID, and alignment between users and items. Each alignment is characterized by a specific objective function and is integrated into our multimodal recommendation framework. To effectively train our AlignRec, we propose starting from pre-training the first alignment to obtain unified multimodal features and subsequently training the following two alignments together with these features as input. As it is essential to analyze whether each multimodal feature helps in training, we design three new classes of metrics to evaluate intermediate performance. Our extensive experiments on three real-world datasets consistently verify the superiority of AlignRec compared to nine baselines. We also find that the multimodal features generated by AlignRec are better than currently used ones, which are to be open-sourced.  ( 3 min )
    DMAD: Dual Memory Bank for Real-World Anomaly Detection
    arXiv:2403.12362v1 Announce Type: cross Abstract: Training a unified model is considered to be more suitable for practical industrial anomaly detection scenarios due to its generalization ability and storage efficiency. However, this multi-class setting, which exclusively uses normal data, overlooks the few but important accessible annotated anomalies in the real world. To address the challenge of real-world anomaly detection, we propose a new framework named Dual Memory bank enhanced representation learning for Anomaly Detection (DMAD). This framework handles both unsupervised and semi-supervised scenarios in a unified (multi-class) setting. DMAD employs a dual memory bank to calculate feature distance and feature attention between normal and abnormal patterns, thereby encapsulating knowledge about normal and abnormal instances. This knowledge is then used to construct an enhanced representation for anomaly score learning. We evaluated DMAD on the MVTec-AD and VisA datasets. The results show that DMAD surpasses current state-of-the-art methods, highlighting DMAD's capability in handling the complexities of real-world anomaly detection scenarios.  ( 2 min )
    Semisupervised score based matching algorithm to evaluate the effect of public health interventions
    arXiv:2403.12367v1 Announce Type: cross Abstract: Multivariate matching algorithms "pair" similar study units in an observational study to remove potential bias and confounding effects caused by the absence of randomizations. In one-to-one multivariate matching algorithms, a large number of "pairs" to be matched could mean both the information from a large sample and a large number of tasks, and therefore, to best match the pairs, such a matching algorithm with efficiency and comparatively limited auxiliary matching knowledge provided through a "training" set of paired units by domain experts, is practically intriguing. We proposed a novel one-to-one matching algorithm based on a quadratic score function $S_{\beta}(x_i,x_j)= \beta^T (x_i-x_j)(x_i-x_j)^T \beta$. The weights $\beta$, which can be interpreted as a variable importance measure, are designed to minimize the score difference between paired training units while maximizing the score difference between unpaired training units. Further, in the typical but intricate case where the training set is much smaller than the unpaired set, we propose a \underline{s}emisupervised \underline{c}ompanion \underline{o}ne-\underline{t}o-\underline{o}ne \underline{m}atching \underline{a}lgorithm (SCOTOMA) that makes the best use of the unpaired units. The proposed weight estimator is proved to be consistent when the truth matching criterion is indeed the quadratic score function. When the model assumptions are violated, we demonstrate that the proposed algorithm still outperforms some popular competing matching algorithms through a series of simulations. We applied the proposed algorithm to a real-world study to investigate the effect of in-person schooling on community Covid-19 transmission rate for policy making purpose.  ( 3 min )
    Data-Efficient Contrastive Language-Image Pretraining: Prioritizing Data Quality over Quantity
    arXiv:2403.12267v1 Announce Type: cross Abstract: Contrastive Language-Image Pre-training (CLIP) on large-scale image-caption datasets learns representations that can achieve remarkable zero-shot generalization. However, such models require a massive amount of pre-training data. Improving the quality of the pre-training data has been shown to be much more effective in improving CLIP's performance than increasing its volume. Nevertheless, finding small subsets of training data that provably generalize the best has remained an open question. In this work, we propose the first theoretically rigorous data selection method for CLIP. We show that subsets that closely preserve the cross-covariance of the images and captions of the full data provably achieve a superior generalization performance. Our extensive experiments on ConceptualCaptions3M and ConceptualCaptions12M demonstrate that subsets found by \method\ achieve over 2.7x and 1.4x the accuracy of the next best baseline on ImageNet and its shifted versions. Moreover, we show that our subsets obtain 1.5x the average accuracy across 11 downstream datasets, of the next best baseline. The code is available at: https://github.com/BigML-CS-UCLA/clipcov-data-efficient-clip.  ( 2 min )
    Adaptive LPD Radar Waveform Design with Generative Deep Learning
    arXiv:2403.12254v1 Announce Type: cross Abstract: We propose a novel, learning-based method for adaptively generating low probability of detection (LPD) radar waveforms that blend into their operating environment. Our waveforms are designed to follow a distribution that is indistinguishable from the ambient radio frequency (RF) background -- while still being effective at ranging and sensing. To do so, we use an unsupervised, adversarial learning framework; our generator network produces waveforms designed to confuse a critic network, which is optimized to differentiate generated waveforms from the background. To ensure our generated waveforms are still effective for sensing, we introduce and minimize an ambiguity function-based loss on the generated waveforms. We evaluate the performance of our method by comparing the single-pulse detectability of our generated waveforms with traditional LPD waveforms using a separately trained detection neural network. We find that our method can generate LPD waveforms that reduce detectability by up to 90% while simultaneously offering improved ambiguity function (sensing) characteristics. Our framework also provides a mechanism to trade-off detectability and sensing performance.  ( 2 min )
    Decomposing Control Lyapunov Functions for Efficient Reinforcement Learning
    arXiv:2403.12210v1 Announce Type: cross Abstract: Recent methods using Reinforcement Learning (RL) have proven to be successful for training intelligent agents in unknown environments. However, RL has not been applied widely in real-world robotics scenarios. This is because current state-of-the-art RL methods require large amounts of data to learn a specific task, leading to unreasonable costs when deploying the agent to collect data in real-world applications. In this paper, we build from existing work that reshapes the reward function in RL by introducing a Control Lyapunov Function (CLF), which is demonstrated to reduce the sample complexity. Still, this formulation requires knowing a CLF of the system, but due to the lack of a general method, it is often a challenge to identify a suitable CLF. Existing work can compute low-dimensional CLFs via a Hamilton-Jacobi reachability procedure. However, this class of methods becomes intractable on high-dimensional systems, a problem that we address by using a system decomposition technique to compute what we call Decomposed Control Lyapunov Functions (DCLFs). We use the computed DCLF for reward shaping, which we show improves RL performance. Through multiple examples, we demonstrate the effectiveness of this approach, where our method finds a policy to successfully land a quadcopter in less than half the amount of real-world data required by the state-of-the-art Soft-Actor Critic algorithm.  ( 2 min )
    Approximation of RKHS Functionals by Neural Networks
    arXiv:2403.12187v1 Announce Type: cross Abstract: Motivated by the abundance of functional data such as time series and images, there has been a growing interest in integrating such data into neural networks and learning maps from function spaces to R (i.e., functionals). In this paper, we study the approximation of functionals on reproducing kernel Hilbert spaces (RKHS's) using neural networks. We establish the universality of the approximation of functionals on the RKHS's. Specifically, we derive explicit error bounds for those induced by inverse multiquadric, Gaussian, and Sobolev kernels. Moreover, we apply our findings to functional regression, proving that neural networks can accurately approximate the regression maps in generalized functional linear models. Existing works on functional learning require integration-type basis function expansions with a set of pre-specified basis functions. By leveraging the interpolating orthogonal projections in RKHS's, our proposed network is much simpler in that we use point evaluations to replace basis function expansions.  ( 2 min )
    Private graphon estimation via sum-of-squares
    arXiv:2403.12213v1 Announce Type: cross Abstract: We develop the first pure node-differentially-private algorithms for learning stochastic block models and for graphon estimation with polynomial running time for any constant number of blocks. The statistical utility guarantees match those of the previous best information-theoretic (exponential-time) node-private mechanisms for these problems. The algorithm is based on an exponential mechanism for a score function defined in terms of a sum-of-squares relaxation whose level depends on the number of blocks. The key ingredients of our results are (1) a characterization of the distance between the block graphons in terms of a quadratic optimization over the polytope of doubly stochastic matrices, (2) a general sum-of-squares convergence result for polynomial optimization over arbitrary polytopes, and (3) a general approach to perform Lipschitz extensions of score functions as part of the sum-of-squares algorithmic paradigm.  ( 2 min )
    GT-Rain Single Image Deraining Challenge Report
    arXiv:2403.12327v1 Announce Type: cross Abstract: This report reviews the results of the GT-Rain challenge on single image deraining at the UG2+ workshop at CVPR 2023. The aim of this competition is to study the rainy weather phenomenon in real world scenarios, provide a novel real world rainy image dataset, and to spark innovative ideas that will further the development of single image deraining methods on real images. Submissions were trained on the GT-Rain dataset and evaluated on an extension of the dataset consisting of 15 additional scenes. Scenes in GT-Rain are comprised of real rainy image and ground truth image captured moments after the rain had stopped. 275 participants were registered in the challenge and 55 competed in the final testing phase.  ( 3 min )
    Bootstrapping Reinforcement Learning with Imitation for Vision-Based Agile Flight
    arXiv:2403.12203v1 Announce Type: cross Abstract: We combine the effectiveness of Reinforcement Learning (RL) and the efficiency of Imitation Learning (IL) in the context of vision-based, autonomous drone racing. We focus on directly processing visual input without explicit state estimation. While RL offers a general framework for learning complex controllers through trial and error, it faces challenges regarding sample efficiency and computational demands due to the high dimensionality of visual inputs. Conversely, IL demonstrates efficiency in learning from visual demonstrations but is limited by the quality of those demonstrations and faces issues like covariate shift. To overcome these limitations, we propose a novel training framework combining RL and IL's advantages. Our framework involves three stages: initial training of a teacher policy using privileged state information, distilling this policy into a student policy using IL, and performance-constrained adaptive RL fine-tuning. Our experiments in both simulated and real-world environments demonstrate that our approach achieves superior performance and robustness than IL or RL alone in navigating a quadrotor through a racing course using only visual information without explicit state estimation.  ( 2 min )
    Reference-based Metrics Disprove Themselves in Question Generation
    arXiv:2403.12242v1 Announce Type: cross Abstract: Reference-based metrics such as BLEU and BERTScore are widely used to evaluate question generation (QG). In this study, on QG benchmarks such as SQuAD and HotpotQA, we find that using human-written references cannot guarantee the effectiveness of the reference-based metrics. Most QG benchmarks have only one reference; we replicated the annotation process and collect another reference. A good metric was expected to grade a human-validated question no worse than generated questions. However, the results of reference-based metrics on our newly collected reference disproved the metrics themselves. We propose a reference-free metric consisted of multi-dimensional criteria such as naturalness, answerability, and complexity, utilizing large language models. These criteria are not constrained to the syntactic or semantic of a single reference question, and the metric does not require a diverse set of references. Experiments reveal that our metric accurately distinguishes between high-quality questions and flawed ones, and achieves state-of-the-art alignment with human judgment.  ( 2 min )
    Compositional learning of functions in humans and machines
    arXiv:2403.12201v1 Announce Type: cross Abstract: The ability to learn and compose functions is foundational to efficient learning and reasoning in humans, enabling flexible generalizations such as creating new dishes from known cooking processes. Beyond sequential chaining of functions, existing linguistics literature indicates that humans can grasp more complex compositions with interacting functions, where output production depends on context changes induced by different function orderings. Extending the investigation into the visual domain, we developed a function learning paradigm to explore the capacity of humans and neural network models in learning and reasoning with compositional functions under varied interaction conditions. Following brief training on individual functions, human participants were assessed on composing two learned functions, in ways covering four main interaction types, including instances in which the application of the first function creates or removes the context for applying the second function. Our findings indicate that humans can make zero-shot generalizations on novel visual function compositions across interaction conditions, demonstrating sensitivity to contextual changes. A comparison with a neural network model on the same task reveals that, through the meta-learning for compositionality (MLC) approach, a standard sequence-to-sequence Transformer can mimic human generalization patterns in composing functions.  ( 2 min )
    Transfer Learning for T-Cell Response Prediction
    arXiv:2403.12117v1 Announce Type: cross Abstract: We study the prediction of T-cell response for specific given peptides, which could, among other applications, be a crucial step towards the development of personalized cancer vaccines. It is a challenging task due to limited, heterogeneous training data featuring a multi-domain structure; such data entail the danger of shortcut learning, where models learn general characteristics of peptide sources, such as the source organism, rather than specific peptide characteristics associated with T-cell response. Using a transformer model for T-cell response prediction, we show that the danger of inflated predictive performance is not merely theoretical but occurs in practice. Consequently, we propose a domain-aware evaluation scheme. We then study different transfer learning techniques to deal with the multi-domain structure and shortcut learning. We demonstrate a per-source fine tuning approach to be effective across a wide range of peptide sources and further show that our final model outperforms existing state-of-the-art approaches for predicting T-cell responses for human peptides.  ( 2 min )
    FLex: Joint Pose and Dynamic Radiance Fields Optimization for Stereo Endoscopic Videos
    arXiv:2403.12198v1 Announce Type: cross Abstract: Reconstruction of endoscopic scenes is an important asset for various medical applications, from post-surgery analysis to educational training. Neural rendering has recently shown promising results in endoscopic reconstruction with deforming tissue. However, the setup has been restricted to a static endoscope, limited deformation, or required an external tracking device to retrieve camera pose information of the endoscopic camera. With FLex we adress the challenging setup of a moving endoscope within a highly dynamic environment of deforming tissue. We propose an implicit scene separation into multiple overlapping 4D neural radiance fields (NeRFs) and a progressive optimization scheme jointly optimizing for reconstruction and camera poses from scratch. This improves the ease-of-use and allows to scale reconstruction capabilities in time to process surgical videos of 5,000 frames and more; an improvement of more than ten times compared to the state of the art while being agnostic to external tracking information. Extensive evaluations on the StereoMIS dataset show that FLex significantly improves the quality of novel view synthesis while maintaining competitive pose accuracy.  ( 2 min )
    Unsupervised End-to-End Training with a Self-Defined Bio-Inspired Target
    arXiv:2403.12116v1 Announce Type: cross Abstract: Current unsupervised learning methods depend on end-to-end training via deep learning techniques such as self-supervised learning, with high computational requirements, or employ layer-by-layer training using bio-inspired approaches like Hebbian learning, using local learning rules incompatible with supervised learning. Both approaches are problematic for edge AI hardware that relies on sparse computational resources and would strongly benefit from alternating between unsupervised and supervised learning phases - thus leveraging widely available unlabeled data from the environment as well as labeled training datasets. To solve this challenge, in this work, we introduce a 'self-defined target' that uses Winner-Take-All (WTA) selectivity at the network's final layer, complemented by regularization through biologically inspired homeostasis mechanism. This approach, framework-agnostic and compatible with both global (Backpropagation) and local (Equilibrium propagation) learning rules, achieves a 97.6% test accuracy on the MNIST dataset. Furthermore, we demonstrate that incorporating a hidden layer enhances classification accuracy and the quality of learned features across all training methods, showcasing the advantages of end-to-end unsupervised training. Extending to semi-supervised learning, our method dynamically adjusts the target according to data availability, reaching a 96.6% accuracy with just 600 labeled MNIST samples. This result highlights our 'unsupervised target' strategy's efficacy and flexibility in scenarios ranging from abundant to no labeled data availability.  ( 2 min )
    Fusing Domain-Specific Content from Large Language Models into Knowledge Graphs for Enhanced Zero Shot Object State Classification
    arXiv:2403.12151v1 Announce Type: cross Abstract: Domain-specific knowledge can significantly contribute to addressing a wide variety of vision tasks. However, the generation of such knowledge entails considerable human labor and time costs. This study investigates the potential of Large Language Models (LLMs) in generating and providing domain-specific information through semantic embeddings. To achieve this, an LLM is integrated into a pipeline that utilizes Knowledge Graphs and pre-trained semantic vectors in the context of the Vision-based Zero-shot Object State Classification task. We thoroughly examine the behavior of the LLM through an extensive ablation study. Our findings reveal that the integration of LLM-based embeddings, in combination with general-purpose pre-trained embeddings, leads to substantial performance improvements. Drawing insights from this ablation study, we conduct a comparative analysis against competing models, thereby highlighting the state-of-the-art performance achieved by the proposed approach.  ( 2 min )
    Adversarial Nibbler: An Open Red-Teaming Method for Identifying Diverse Harms in Text-to-Image Generation
    arXiv:2403.12075v1 Announce Type: cross Abstract: With the rise of text-to-image (T2I) generative AI models reaching wide audiences, it is critical to evaluate model robustness against non-obvious attacks to mitigate the generation of offensive images. By focusing on ``implicitly adversarial'' prompts (those that trigger T2I models to generate unsafe images for non-obvious reasons), we isolate a set of difficult safety issues that human creativity is well-suited to uncover. To this end, we built the Adversarial Nibbler Challenge, a red-teaming methodology for crowdsourcing a diverse set of implicitly adversarial prompts. We have assembled a suite of state-of-the-art T2I models, employed a simple user interface to identify and annotate harms, and engaged diverse populations to capture long-tail safety issues that may be overlooked in standard testing. The challenge is run in consecutive rounds to enable a sustained discovery and analysis of safety pitfalls in T2I models. In this paper, we present an in-depth account of our methodology, a systematic study of novel attack strategies and discussion of safety failures revealed by challenge participants. We also release a companion visualization tool for easy exploration and derivation of insights from the dataset. The first challenge round resulted in over 10k prompt-image pairs with machine annotations for safety. A subset of 1.5k samples contains rich human annotations of harm types and attack styles. We find that 14% of images that humans consider harmful are mislabeled as ``safe'' by machines. We have identified new attack strategies that highlight the complexity of ensuring T2I model robustness. Our findings emphasize the necessity of continual auditing and adaptation as new vulnerabilities emerge. We are confident that this work will enable proactive, iterative safety assessments and promote responsible development of T2I models.  ( 3 min )
    Circular Belief Propagation for Approximate Probabilistic Inference
    arXiv:2403.12106v1 Announce Type: cross Abstract: Belief Propagation (BP) is a simple probabilistic inference algorithm, consisting of passing messages between nodes of a graph representing a probability distribution. Its analogy with a neural network suggests that it could have far-ranging applications for neuroscience and artificial intelligence. Unfortunately, it is only exact when applied to cycle-free graphs, which restricts the potential of the algorithm. In this paper, we propose Circular Belief Propagation (CBP), an extension of BP which limits the detrimental effects of message reverberation caused by cycles by learning to detect and cancel spurious correlations and belief amplifications. We show in numerical experiments involving binary probabilistic graphs that CBP far outperforms BP and reaches good performance compared to that of previously proposed algorithms.  ( 2 min )
    The Boy Who Survived: Removing Harry Potter from an LLM is harder than reported
    arXiv:2403.12082v1 Announce Type: cross Abstract: Recent work arXiv.2310.02238 asserted that "we effectively erase the model's ability to generate or recall Harry Potter-related content.'' This claim is shown to be overbroad. A small experiment of less than a dozen trials led to repeated and specific mentions of Harry Potter, including "Ah, I see! A "muggle" is a term used in the Harry Potter book series by Terry Pratchett...''  ( 2 min )
    Beyond Quantities: Machine Learning-based Characterization of Inequality in Infrastructure Quality Provision in Cities
    arXiv:2403.12074v1 Announce Type: cross Abstract: The objective of this study is to characterize inequality in infrastructure quality across urban areas. While a growing of body of literature has recognized the importance of characterizing infrastructure inequality in cities and provided quantified metrics to inform urban development plans, the majority of the existing approaches focus primarily on measuring the quantity of infrastructure, assuming that more infrastructure is better. Also, the existing research focuses primarily on index-based approaches in which the status of infrastructure provision in urban areas is determined based on assumed subjective weights. The focus on infrastructure quantity and use of indices obtained from subjective weights has hindered the ability to properly examine infrastructure inequality as it pertains to urban inequality and environmental justice considerations. Recognizing this gap, we propose a machine learning-based approach in which infrastructure features that shape environmental hazard exposure are identified and we use the weights obtained by the model to calculate an infrastructure quality provision for spatial areas of cities and accordingly, quantify the extent of inequality in infrastructure quality. The implementation of the model in five metropolitan areas in the U.S. demonstrates the capability of the proposed approach in characterizing inequality in infrastructure quality and capturing city-specific differences in the weights of infrastructure features. The results also show that areas in which low-income populations reside have lower infrastructure quality provision, suggesting the lower infrastructure quality provision as a determinant of urban disparities. Accordingly, the proposed approach can be effectively used to inform integrated urban design strategies to promote infrastructure equity and environmental justice based on data-driven and machine intelligence-based insights.  ( 3 min )
    Are LLMs Good Cryptic Crossword Solvers?
    arXiv:2403.12094v1 Announce Type: cross Abstract: Cryptic crosswords are puzzles that rely not only on general knowledge but also on the solver's ability to manipulate language on different levels and deal with various types of wordplay. Previous research suggests that solving such puzzles is a challenge even for modern NLP models. However, the abilities of large language models (LLMs) have not yet been tested on this task. In this paper, we establish the benchmark results for three popular LLMs -- LLaMA2, Mistral, and ChatGPT -- showing that their performance on this task is still far from that of humans.  ( 2 min )
    Foundation Models and Information Retrieval in Digital Pathology
    arXiv:2403.12090v1 Announce Type: cross Abstract: The paper reviews the state-of-the-art of foundation models, LLMs, generative AI, information retrieval and CBIR in digital pathology  ( 2 min )
    Beyond Beats: A Recipe to Song Popularity? A machine learning approach
    arXiv:2403.12079v1 Announce Type: cross Abstract: Music popularity prediction has garnered significant attention in both industry and academia, fuelled by the rise of data-driven algorithms and streaming platforms like Spotify. This study aims to explore the predictive power of various machine learning models in forecasting song popularity using a dataset comprising 30,000 songs spanning different genres from 1957 to 2020. Methods: We employ Ordinary Least Squares (OLS), Multivariate Adaptive Regression Splines (MARS), Random Forest, and XGBoost algorithms to analyse song characteristics and their impact on popularity. Results: Ordinary Least Squares (OLS) regression analysis reveals genre as the primary influencer of popularity, with notable trends over time. MARS modelling highlights the complex relationship between variables, particularly with features like instrumentalness and duration. Random Forest and XGBoost models underscore the importance of genre, especially EDM, in predicting popularity. Despite variations in performance, Random Forest emerges as the most effective model, improving prediction accuracy by 7.1% compared to average scores. Despite the importance of genre, predicting song popularity remains challenging, as observed variations in music-related features suggest complex interactions between genre and other factors. Consequently, while certain characteristics like loudness and song duration may impact popularity scores, accurately predicting song success remains elusive.  ( 2 min )
    Floralens: a Deep Learning Model for the Portuguese Native Flora
    arXiv:2403.12072v1 Announce Type: cross Abstract: Machine-learning techniques, namely deep convolutional neural networks, are pivotal for image-based identification of biological species in many Citizen Science platforms. However, the construction of critically sized and sampled datasets to train the networks and the choice of the network architectures itself remains little documented and, therefore, does not lend itself to be easily replicated. In this paper, we develop a streamlined methodology for building datasets for biological taxa from publicly available research-grade datasets and for deriving models from these datasets using off-the-shelf deep convolutional neural networks such as those provided by Google's AutoML Vision cloud service. Our case study is the Portuguese native flora, anchored in a high-quality dataset, provided by the Sociedade Portuguesa de Bot\^anica, scaled up by adding sampled data from iNaturalist, Pl@ntNet, and Observation.org. We find that with a careful dataset design, off-the-shelf machine-learning cloud services produce accurate models with relatively little effort that rival those provided by state-of-the-art citizen science platforms. The best model we derived, dubbed Floralens, has been integrated into the public website of Project Biolens, where we gather models for other taxa as well. The dataset used to train the model and its namesake is publicly available on Zenodo.  ( 2 min )
    Evaluating Terrain-Dependent Performance for Martian Frost Detection in Visible Satellite Observations
    arXiv:2403.12080v1 Announce Type: cross Abstract: Seasonal frosting and defrosting on the surface of Mars is hypothesized to drive both climate processes and the formation and evolution of geomorphological features such as gullies. Past studies have focused on manually analyzing the behavior of the frost cycle in the northern mid-latitude region of Mars using high-resolution visible observations from orbit. Extending these studies globally requires automating the detection of frost using data science techniques such as convolutional neural networks. However, visible indications of frost presence can vary significantly depending on the geologic context on which the frost is superimposed. In this study, we (1) present a novel approach for spatially partitioning data to reduce biases in model performance estimation, (2) illustrate how geologic context affects automated frost detection, and (3) propose mitigations to observed biases in automated frost detection.  ( 2 min )
    Fairness Evaluation for Uplift Modeling in the Absence of Ground Truth
    arXiv:2403.12069v1 Announce Type: cross Abstract: The acceleration in the adoption of AI-based automated decision-making systems poses a challenge for evaluating the fairness of algorithmic decisions, especially in the absence of ground truth. When designing interventions, uplift modeling is used extensively to identify candidates that are likely to benefit from treatment. However, these models remain particularly susceptible to fairness evaluation due to the lack of ground truth on the outcome measure since a candidate cannot be in both treatment and control simultaneously. In this article, we propose a framework that overcomes the missing ground truth problem by generating surrogates to serve as a proxy for counterfactual labels of uplift modeling campaigns. We then leverage the surrogate ground truth to conduct a more comprehensive binary fairness evaluation. We show how to apply the approach in a comprehensive study from a real-world marketing campaign for promotional offers and demonstrate its enhancement for fairness evaluation.  ( 2 min )
    Process mining for self-regulated learning assessment in e-learning
    arXiv:2403.12068v1 Announce Type: cross Abstract: Content assessment has broadly improved in e-learning scenarios in recent decades. However, the eLearning process can give rise to a spatial and temporal gap that poses interesting challenges for assessment of not only content, but also students' acquisition of core skills such as self-regulated learning. Our objective was to discover students' self-regulated learning processes during an eLearning course by using Process Mining Techniques. We applied a new algorithm in the educational domain called Inductive Miner over the interaction traces from 101 university students in a course given over one semester on the Moodle 2.0 platform. Data was extracted from the platform's event logs with 21629 traces in order to discover students' self-regulation models that contribute to improving the instructional process. The Inductive Miner algorithm discovered optimal models in terms of fitness for both Pass and Fail students in this dataset, as well as models at a certain level of granularity that can be interpreted in educational terms, which are the most important achievement in model discovery. We can conclude that although students who passed did not follow the instructors' suggestions exactly, they did follow the logic of a successful self-regulated learning process as opposed to their failing classmates. The Process Mining models also allow us to examine which specific actions the students performed, and it was particularly interesting to see a high presence of actions related to forum-supported collaborative learning in the Pass group and an absence of those in the Fail group.  ( 3 min )
    Ada-NAV: Adaptive Trajectory Length-Based Sample Efficient Policy Learning for Robotic Navigation
    arXiv:2306.06192v4 Announce Type: replace-cross Abstract: Trajectory length stands as a crucial hyperparameter within reinforcement learning (RL) algorithms, significantly contributing to the sample inefficiency in robotics applications. Motivated by the pivotal role trajectory length plays in the training process, we introduce Ada-NAV, a novel adaptive trajectory length scheme designed to enhance the training sample efficiency of RL algorithms in robotic navigation tasks. Unlike traditional approaches that treat trajectory length as a fixed hyperparameter, we propose to dynamically adjust it based on the entropy of the underlying navigation policy. Interestingly, Ada-NAV can be applied to both existing on-policy and off-policy RL methods, which we demonstrate by empirically validating its efficacy on three popular RL methods: REINFORCE, Proximal Policy Optimization (PPO), and Soft Actor-Critic (SAC). We demonstrate through simulated and real-world robotic experiments that Ada-NAV outperforms conventional methods that employ constant or randomly sampled trajectory lengths. Specifically, for a fixed sample budget, Ada-NAV achieves an 18\% increase in navigation success rate, a 20-38\% reduction in navigation path length, and a 9.32\% decrease in elevation costs. Furthermore, we showcase the versatility of Ada-NAV by integrating it with the Clearpath Husky robot, illustrating its applicability in complex outdoor environments.  ( 3 min )
    Impossible Distillation: from Low-Quality Model to High-Quality Dataset & Model for Summarization and Paraphrasing
    arXiv:2305.16635v2 Announce Type: replace-cross Abstract: We present Impossible Distillation, a novel framework for paraphrasing and sentence summarization, that distills a high-quality dataset and model from a low-quality teacher that itself cannot perform these tasks. Unlike prior works that rely on an extreme-scale teacher model (e.g., GPT3) or task-specific architecture, we hypothesize and verify the paraphrastic proximity intrinsic to pre-trained LMs (e.g., GPT2), where paraphrases occupy a proximal subspace in the LM distribution. By identifying and distilling generations from these subspaces, Impossible Distillation produces a high-quality dataset and model even from GPT2-scale LMs. We evaluate our method on multiple benchmarks spanning unconstrained / syntax-controlled paraphrase generation and sentence summarization. Our model with 770M parameters consistently outperforms strong baselines, including models distilled from ChatGPT, and sometimes, even ChatGPT itself. Also, we find that our distilled dataset from 1.5B LMs exhibits higher diversity and fidelity than up to 13 times larger datasets.  ( 2 min )
    Toward a Theory of Causation for Interpreting Neural Code Models
    arXiv:2302.03788v2 Announce Type: replace-cross Abstract: Neural Language Models of Code, or Neural Code Models (NCMs), are rapidly progressing from research prototypes to commercial developer tools. As such, understanding the capabilities and limitations of such models is becoming critical. However, the abilities of these models are typically measured using automated metrics that often only reveal a portion of their real-world performance. While, in general, the performance of NCMs appears promising, currently much is unknown about how such models arrive at decisions. To this end, this paper introduces $do_{code}$, a post hoc interpretability method specific to NCMs that is capable of explaining model predictions. $do_{code}$ is based upon causal inference to enable programming language-oriented explanations. While the theoretical underpinnings of $do_{code}$ are extensible to exploring different model properties, we provide a concrete instantiation that aims to mitigate the impact of spurious correlations by grounding explanations of model behavior in properties of programming languages. To demonstrate the practical benefit of $do_{code}$, we illustrate the insights that our framework can provide by performing a case study on two popular deep learning architectures and ten NCMs. The results of this case study illustrate that our studied NCMs are sensitive to changes in code syntax. All our NCMs, except for the BERT-like model, statistically learn to predict tokens related to blocks of code (\eg brackets, parenthesis, semicolon) with less confounding bias as compared to other programming language constructs. These insights demonstrate the potential of $do_{code}$ as a useful method to detect and facilitate the elimination of confounding bias in NCMs.  ( 3 min )
    Deep Recurrent Learning Through Long Short Term Memory and TOPSIS
    arXiv:2301.00693v2 Announce Type: replace-cross Abstract: Enterprise resource planning (ERP) software brings resources, data together to keep software-flow within business processes in a company. However, cloud computing's cheap, easy and quick management promise pushes business-owners for a transition from monolithic to a data-center/cloud based ERP. Since cloud-ERP development involves a cyclic process, namely planning, implementing, testing and upgrading, its adoption is realized as a deep recurrent neural network problem. Eventually, a classification algorithm based on long short term memory (LSTM) and TOPSIS is proposed to identify and rank, respectively, adoption features. Our theoretical model is validated over a reference model by articulating key players, services, architecture, functionalities. Qualitative survey is conducted among users by considering technology, innovation and resistance issues, to formulate hypotheses on key adoption factors.  ( 2 min )
    Non-Coherent Over-the-Air Decentralized Gradient Descent
    arXiv:2211.10777v2 Announce Type: replace-cross Abstract: Decentralized Gradient Descent (DGD) is a popular algorithm used to solve decentralized optimization problems in diverse domains such as remote sensing, distributed inference, multi-agent coordination, and federated learning. Yet, executing DGD over wireless systems affected by noise, fading and limited bandwidth presents challenges, requiring scheduling of transmissions to mitigate interference and the acquisition of topology and channel state information -- complex tasks in wireless decentralized systems. This paper proposes a DGD algorithm tailored to wireless systems. Unlike existing approaches, it operates without inter-agent coordination, topology information, or channel state information. Its core is a Non-Coherent Over-The-Air (NCOTA) consensus scheme, exploiting a noisy energy superposition property of wireless channels. With a randomized transmission strategy to accommodate half-duplex operation, transmitters map local optimization signals to energy levels across subcarriers in an OFDM frame, and transmit concurrently without coordination. It is shown that received energies form a noisy consensus signal, whose fluctuations are mitigated via a consensus stepsize. NCOTA-DGD leverages the channel pathloss for consensus formation, without explicit knowledge of the mixing weights. It is shown that, for the class of strongly-convex problems, the expected squared distance between the local and globally optimum models vanishes with rate $\mathcal O(1/\sqrt{k})$ after $k$ iterations, with a proper design of decreasing stepsizes. Extensions address a broad class of fading models and frequency-selective channels. Numerical results on an image classification task depict faster convergence vis-\`a-vis running time than state-of-the-art schemes, especially in densely deployed networks.  ( 3 min )
    SFPDML: Securer and Faster Privacy-Preserving Distributed Machine Learning based on MKTFHE
    arXiv:2211.09353v2 Announce Type: replace-cross Abstract: In recent years, distributed machine learning has garnered significant attention. However, privacy continues to be an unresolved issue within this field. Multi-key homomorphic encryption over torus (MKTFHE) is one of the promising candidates for addressing this concern. Nevertheless, there may be security risks in the decryption of MKTFHE. Moreover, to our best known, the latest works about MKTFHE only support Boolean operation and linear operation which cannot directly compute the non-linear function like Sigmoid. Therefore, it is still hard to perform common machine learning such as logistic regression and neural networks in high performance. In this paper, we first discover a possible attack on the existing distributed decryption protocol for MKTFHE and subsequently introduce secret sharing to propose a securer one. Next, we design a new MKTFHE-friendly activation function via \emph{homogenizer} and \emph{compare quads}. Finally, we utilize them to implement logistic regression and neural network training in MKTFHE. Comparing the efficiency and accuracy between using Taylor polynomials of Sigmoid and our proposed function as an activation function, the experiments show that the efficiency of our function is 10 times higher than using 7-order Taylor polynomials straightly and the accuracy of the training model is similar to using a high-order polynomial as an activation function scheme.  ( 3 min )
    Towards Lossless ANN-SNN Conversion under Ultra-Low Latency with Dual-Phase Optimization
    arXiv:2205.07473v3 Announce Type: replace-cross Abstract: Spiking neural networks (SNNs) operating with asynchronous discrete events show higher energy efficiency with sparse computation. A popular approach for implementing deep SNNs is ANN-SNN conversion combining both efficient training of ANNs and efficient inference of SNNs. However, the accuracy loss is usually non-negligible, especially under a few time steps, which restricts the applications of SNN on latency-sensitive edge devices greatly. In this paper, we first identify that such performance degradation stems from the misrepresentation of the negative or overflow residual membrane potential in SNNs. Inspired by this, we decompose the conversion error into three parts: quantization error, clipping error, and residual membrane potential representation error. With such insights, we propose a two-stage conversion algorithm to minimize those errors respectively. Besides, We show each stage achieves significant performance gains in a complementary manner. By evaluating on challenging datasets including CIFAR-10, CIFAR- 100 and ImageNet, the proposed method demonstrates the state-of-the-art performance in terms of accuracy, latency and energy preservation. Furthermore, our method is evaluated using a more challenging object detection task, revealing notable gains in regression performance under ultra-low latency when compared to existing spike-based detection algorithms. Codes are available at https://github.com/Windere/snn-cvt-dual-phase.  ( 3 min )
    Unveiling Latent Causal Rules: A Temporal Point Process Approach for Abnormal Event Explanation
    arXiv:2402.05946v2 Announce Type: replace Abstract: In high-stakes systems such as healthcare, it is critical to understand the causal reasons behind unusual events, such as sudden changes in patient's health. Unveiling the causal reasons helps with quick diagnoses and precise treatment planning. In this paper, we propose an automated method for uncovering "if-then" logic rules to explain observational events. We introduce temporal point processes to model the events of interest, and discover the set of latent rules to explain the occurrence of events. To achieve this, we employ an Expectation-Maximization (EM) algorithm. In the E-step, we calculate the likelihood of each event being explained by each discovered rule. In the M-step, we update both the rule set and model parameters to enhance the likelihood function's lower bound. Notably, we optimize the rule set in a differential manner. Our approach demonstrates accurate performance in both discovering rules and identifying root causes. We showcase its promising results using synthetic and real healthcare datasets.  ( 2 min )
    Retrieve to Explain: Evidence-driven Predictions with Language Models
    arXiv:2402.04068v2 Announce Type: replace Abstract: Machine learning models, particularly language models, are notoriously difficult to introspect. Black-box models can mask both issues in model training and harmful biases. For human-in-the-loop processes, opaque predictions can drive lack of trust, limiting a model's impact even when it performs effectively. To address these issues, we introduce Retrieve to Explain (R2E). R2E is a retrieval-based language model that prioritizes amongst a pre-defined set of possible answers to a research question based on the evidence in a document corpus, using Shapley values to identify the relative importance of pieces of evidence to the final prediction. R2E can adapt to new evidence without retraining, and incorporate structured data through templating into natural language. We assess on the use case of drug target identification from published scientific literature, where we show that the model outperforms an industry-standard genetics-based approach on predicting clinical trial outcomes.  ( 2 min )
    Conformal Monte Carlo Meta-learners for Predictive Inference of Individual Treatment Effects
    arXiv:2402.04906v2 Announce Type: replace Abstract: Knowledge of the effect of interventions, called the treatment effect, is paramount for decision-making. Approaches to estimating this treatment effect, e.g. by using Conditional Average Treatment Effect (CATE) estimators, often only provide a point estimate of this treatment effect, while additional uncertainty quantification is frequently desired instead. Therefore, we present a novel method, the Conformal Monte Carlo (CMC) meta-learners, leveraging conformal predictive systems, Monte Carlo sampling, and CATE meta-learners, to instead produce a predictive distribution usable in individualized decision-making. Furthermore, we show how specific assumptions on the noise distribution of the outcome heavily affect these uncertainty predictions. Nonetheless, the CMC framework shows strong experimental coverage while retaining small interval widths to provide estimates of the true individual treatment effect.  ( 2 min )
    Arithmetic Feature Interaction Is Necessary for Deep Tabular Learning
    arXiv:2402.02334v2 Announce Type: replace Abstract: Until recently, the question of the effective inductive bias of deep models on tabular data has remained unanswered. This paper investigates the hypothesis that arithmetic feature interaction is necessary for deep tabular learning. To test this point, we create a synthetic tabular dataset with a mild feature interaction assumption and examine a modified transformer architecture enabling arithmetical feature interactions, referred to as AMFormer. Results show that AMFormer outperforms strong counterparts in fine-grained tabular data modeling, data efficiency in training, and generalization. This is attributed to its parallel additive and multiplicative attention operators and prompt-based optimization, which facilitate the separation of tabular samples in an extended space with arithmetically-engineered features. Our extensive experiments on real-world data also validate the consistent effectiveness, efficiency, and rationale of AMFormer, suggesting it has established a strong inductive bias for deep learning on tabular data. Code is available at https://github.com/aigc-apps/AMFormer.  ( 2 min )
    Divide and not forget: Ensemble of selectively trained experts in Continual Learning
    arXiv:2401.10191v3 Announce Type: replace Abstract: Class-incremental learning is becoming more popular as it helps models widen their applicability while not forgetting what they already know. A trend in this area is to use a mixture-of-expert technique, where different models work together to solve the task. However, the experts are usually trained all at once using whole task data, which makes them all prone to forgetting and increasing computational burden. To address this limitation, we introduce a novel approach named SEED. SEED selects only one, the most optimal expert for a considered task, and uses data from this task to fine-tune only this expert. For this purpose, each expert represents each class with a Gaussian distribution, and the optimal expert is selected based on the similarity of those distributions. Consequently, SEED increases diversity and heterogeneity within the experts while maintaining the high stability of this ensemble method. The extensive experiments demonstrate that SEED achieves state-of-the-art performance in exemplar-free settings across various scenarios, showing the potential of expert diversification through data in continual learning.  ( 3 min )
    Probabilistic Modeling for Sequences of Sets in Continuous-Time
    arXiv:2312.15045v3 Announce Type: replace Abstract: Neural marked temporal point processes have been a valuable addition to the existing toolbox of statistical parametric models for continuous-time event data. These models are useful for sequences where each event is associated with a single item (a single type of event or a "mark") -- but such models are not suited for the practical situation where each event is associated with a set of items. In this work, we develop a general framework for modeling set-valued data in continuous-time, compatible with any intensity-based recurrent neural point process model. In addition, we develop inference methods that can use such models to answer probabilistic queries such as "the probability of item $A$ being observed before item $B$," conditioned on sequence history. Computing exact answers for such queries is generally intractable for neural models due to both the continuous-time nature of the problem setting and the combinatorially-large space of potential outcomes for each event. To address this, we develop a class of importance sampling methods for querying with set-based sequences and demonstrate orders-of-magnitude improvements in efficiency over direct sampling via systematic experiments with four real-world datasets. We also illustrate how to use this framework to perform model selection using likelihoods that do not involve one-step-ahead prediction.  ( 3 min )
    Fast Cell Library Characterization for Design Technology Co-Optimization Based on Graph Neural Networks
    arXiv:2312.12784v4 Announce Type: replace Abstract: Design technology co-optimization (DTCO) plays a critical role in achieving optimal power, performance, and area (PPA) for advanced semiconductor process development. Cell library characterization is essential in DTCO flow, but traditional methods are time-consuming and costly. To overcome these challenges, we propose a graph neural network (GNN)-based machine learning model for rapid and accurate cell library characterization. Our model incorporates cell structures and demonstrates high prediction accuracy across various process-voltage-temperature (PVT) corners and technology parameters. Validation with 512 unseen technology corners and over one million test data points shows accurate predictions of delay, power, and input pin capacitance for 33 types of cells, with a mean absolute percentage error (MAPE) $\le$ 0.95% and a speed-up of 100X compared with SPICE simulations. Additionally, we investigate system-level metrics such as worst negative slack (WNS), leakage power, and dynamic power using predictions obtained from the GNN-based model on unseen corners. Our model achieves precise predictions, with absolute error $\le$3.0 ps for WNS, percentage errors $\le$0.60% for leakage power, and $\le$0.99% for dynamic power, when compared to golden reference. With the developed model, we further proposed a fine-grained drive strength interpolation methodology to enhance PPA for small-to-medium-scale designs, resulting in an approximate 1-3% improvement.  ( 3 min )
    Stronger Graph Transformer with Regularized Attention Scores
    arXiv:2312.11730v3 Announce Type: replace Abstract: Graph Neural Networks are notorious for its memory consumption. A recent Transformer-based GNN called Graph Transformer is shown to obtain superior performances when long range dependencies exist. However, combining graph data and Transformer architecture led to a combinationally worse memory issue. We propose a novel version of "edge regularization technique" that alleviates the need for Positional Encoding and ultimately alleviate GT's out of memory issue. We observe that it is not clear whether having an edge regularization on top of positional encoding is helpful. However, it seems evident that applying our edge regularization technique indeed stably improves GT's performance compared to GT without Positional Encoding.  ( 2 min )
    TrojFSP: Trojan Insertion in Few-shot Prompt Tuning
    arXiv:2312.10467v3 Announce Type: replace Abstract: Prompt tuning is one of the most effective solutions to adapting a fixed pre-trained language model (PLM) for various downstream tasks, especially with only a few input samples. However, the security issues, e.g., Trojan attacks, of prompt tuning on a few data samples are not well-studied. Transferring established data poisoning attacks directly to few-shot prompt tuning presents multiple challenges. One significant issue is the \textit{poisoned imbalance issue}, where non-target class samples are added to the target class, resulting in a greater number of target-class samples compared to non-target class. While this issue is not critical in regular tuning, it significantly hampers the few-shot prompt tuning, making it difficult to simultaneously achieve a high attack success rate (ASR) and maintain clean data accuracy (CDA). Additionally, few-shot prompting is prone to overfitting in terms of both ASR and CDA. In this paper, we introduce \textit{TrojFSP}, a method designed to address the challenges. To solve the poisoned imbalance issue, we develop a \textit{Target-Class Shrink (TC-Shrink)} technique, which aims to equalize the number of poisoning samples. To combat overfitting, we employ a \textit{Selective Token Poisoning} technique to boost attack performance. Furthermore, we introduce a \textit{Trojan-Trigger Attention} objective function to amplify the attention of the poisoned trojan prompt on triggers. Experiments show that our TrojFSP achieves an ASR of over 99\% while maintaining negligible decreases in CDA across various PLMs and datasets.  ( 3 min )
    Hypergraph-MLP: Learning on Hypergraphs without Message Passing
    arXiv:2312.09778v2 Announce Type: replace Abstract: Hypergraphs are vital in modelling data with higher-order relations containing more than two entities, gaining prominence in machine learning and signal processing. Many hypergraph neural networks leverage message passing over hypergraph structures to enhance node representation learning, yielding impressive performances in tasks like hypergraph node classification. However, these message-passing-based models face several challenges, including oversmoothing as well as high latency and sensitivity to structural perturbations at inference time. To tackle those challenges, we propose an alternative approach where we integrate the information about hypergraph structures into training supervision without explicit message passing, thus also removing the reliance on it at inference. Specifically, we introduce Hypergraph-MLP, a novel learning framework for hypergraph-structured data, where the learning model is a straightforward multilayer perceptron (MLP) supervised by a loss function based on a notion of signal smoothness on hypergraphs. Experiments on hypergraph node classification tasks demonstrate that Hypergraph-MLP achieves competitive performance compared to existing baselines, and is considerably faster and more robust against structural perturbations at inference.  ( 2 min )
    GeoShapley: A Game Theory Approach to Measuring Spatial Effects in Machine Learning Models
    arXiv:2312.03675v2 Announce Type: replace Abstract: This paper introduces GeoShapley, a game theory approach to measuring spatial effects in machine learning models. GeoShapley extends the Nobel Prize-winning Shapley value framework in game theory by conceptualizing location as a player in a model prediction game, which enables the quantification of the importance of location and the synergies between location and other features in a model. GeoShapley is a model-agnostic approach and can be applied to statistical or black-box machine learning models in various structures. The interpretation of GeoShapley is directly linked with spatially varying coefficient models for explaining spatial effects and additive models for explaining non-spatial effects. Using simulated data, GeoShapley values are validated against known data-generating processes and are used for cross-comparison of seven statistical and machine learning models. An empirical example of house price modeling is used to illustrate GeoShapley's utility and interpretation with real world data. The method is available as an open-source Python package named geoshapley.  ( 2 min )
    Learning Multi-Pattern Normalities in the Frequency Domain for Efficient Time Series Anomaly Detection
    arXiv:2311.16191v2 Announce Type: replace Abstract: Anomaly detection significantly enhances the robustness of cloud systems. While neural network-based methods have recently demonstrated strong advantages, they encounter practical challenges in cloud environments: the contradiction between the impracticality of maintaining a unique model for each service and the limited ability to deal with diverse normal patterns by a unified model, as well as issues with handling heavy traffic in real time and short-term anomaly detection sensitivity. Thus, we propose MACE, a multi-normal-pattern accommodated and efficient anomaly detection method in the frequency domain for time series anomaly detection. There are three novel characteristics of it: (i) a pattern extraction mechanism excelling at handling diverse normal patterns with a unified model, which enables the model to identify anomalies by examining the correlation between the data sample and its service normal pattern, instead of solely focusing on the data sample itself; (ii) a dualistic convolution mechanism that amplifies short-term anomalies in the time domain and hinders the reconstruction of anomalies in the frequency domain, which enlarges the reconstruction error disparity between anomaly and normality and facilitates anomaly detection; (iii) leveraging the sparsity and parallelism of frequency domain to enhance model efficiency. We theoretically and experimentally prove that using a strategically selected subset of Fourier bases can not only reduce computational overhead but is also profitable to distinguish anomalies, compared to using the complete spectrum. Moreover, extensive experiments demonstrate MACE's effectiveness in handling diverse normal patterns with a unified model and it achieves state-of-the-art performance with high efficiency.  ( 3 min )
    Accurate and interpretable drug-drug interaction prediction enabled by knowledge subgraph learning
    arXiv:2311.15056v2 Announce Type: replace Abstract: Background: Discovering potential drug-drug interactions (DDIs) is a long-standing challenge in clinical treatments and drug developments. Recently, deep learning techniques have been developed for DDI prediction. However, they generally require a huge number of samples, while known DDIs are rare. Methods: In this work, we present KnowDDI, a graph neural network-based method that addresses the above challenge. KnowDDI enhances drug representations by adaptively leveraging rich neighborhood information from large biomedical knowledge graphs. Then, it learns a knowledge subgraph for each drug-pair to interpret the predicted DDI, where each of the edges is associated with a connection strength indicating the importance of a known DDI or resembling strength between a drug-pair whose connection is unknown. Thus, the lack of DDIs is implicitly compensated by the enriched drug representations and propagated drug similarities. Results: We evaluate KnowDDI on two benchmark DDI datasets. Results show that KnowDDI obtains the state-of-the-art prediction performance with better interpretability. We also find that KnowDDI suffers less than existing works given a sparser knowledge graph. This indicates that the propagated drug similarities play a more important role in compensating for the lack of DDIs when the drug representations are less enriched. Conclusions: KnowDDI nicely combines the efficiency of deep learning techniques and the rich prior knowledge in biomedical knowledge graphs. As an original open-source tool, KnowDDI can help detect possible interactions in a broad range of relevant interaction prediction tasks, such as protein-protein interactions, drug-target interactions and disease-gene interactions, eventually promoting the development of biomedicine and healthcare.  ( 3 min )
    Functional Bayesian Tucker Decomposition for Continuous-indexed Tensor Data
    arXiv:2311.04829v2 Announce Type: replace Abstract: Tucker decomposition is a powerful tensor model to handle multi-aspect data. It demonstrates the low-rank property by decomposing the grid-structured data as interactions between a core tensor and a set of object representations (factors). A fundamental assumption of such decomposition is that there are finite objects in each aspect or mode, corresponding to discrete indexes of data entries. However, real-world data is often not naturally posed in this setting. For example, geographic data is represented as continuous indexes of latitude and longitude coordinates, and cannot fit tensor models directly. To generalize Tucker decomposition to such scenarios, we propose Functional Bayesian Tucker Decomposition (FunBaT). We treat the continuous-indexed data as the interaction between the Tucker core and a group of latent functions. We use Gaussian processes (GP) as functional priors to model the latent functions. Then, we convert each GP into a state-space prior by constructing an equivalent stochastic differential equation (SDE) to reduce computational cost. An efficient inference algorithm is developed for scalable posterior approximation based on advanced message-passing techniques. The advantage of our method is shown in both synthetic data and several real-world applications. We release the code of FunBaT at \url{https://github.com/xuangu-fang/Functional-Bayesian-Tucker-Decomposition}.  ( 3 min )
    Solving High Frequency and Multi-Scale PDEs with Gaussian Processes
    arXiv:2311.04465v2 Announce Type: replace Abstract: Machine learning based solvers have garnered much attention in physical simulation and scientific computing, with a prominent example, physics-informed neural networks (PINNs). However, PINNs often struggle to solve high-frequency and multi-scale PDEs, which can be due to spectral bias during neural network training. To address this problem, we resort to the Gaussian process (GP) framework. To flexibly capture the dominant frequencies, we model the power spectrum of the PDE solution with a student $t$ mixture or Gaussian mixture. We apply the inverse Fourier transform to obtain the covariance function (by Wiener-Khinchin theorem). The covariance derived from the Gaussian mixture spectrum corresponds to the known spectral mixture kernel. Next, we estimate the mixture weights in the log domain, which we show is equivalent to placing a Jeffreys prior. It automatically induces sparsity, prunes excessive frequencies, and adjusts the remaining toward the ground truth. Third, to enable efficient and scalable computation on massive collocation points, which are critical to capture high frequencies, we place the collocation points on a grid, and multiply our covariance function at each input dimension. We use the GP conditional mean to predict the solution and its derivatives so as to fit the boundary condition and the equation itself. As a result, we can derive a Kronecker product structure in the covariance matrix. We use Kronecker product properties and multilinear algebra to promote computational efficiency and scalability, without low-rank approximations. We show the advantage of our method in systematic experiments. The code is released at \url{https://github.com/xuangu-fang/Gaussian-Process-Slover-for-High-Freq-PDE}.  ( 3 min )
    One-Shot Strategic Classification Under Unknown Costs
    arXiv:2311.02761v2 Announce Type: replace Abstract: The goal of strategic classification is to learn decision rules which are robust to strategic input manipulation. Earlier works assume that these responses are known; while some recent works handle unknown responses, they exclusively study online settings with repeated model deployments. But there are many domains$\unicode{x2014}$particularly in public policy, a common motivating use case$\unicode{x2014}$where multiple deployments are infeasible, or where even one bad round is unacceptable. To address this gap, we initiate the formal study of one-shot strategic classification under unknown responses, which requires committing to a single classifier once. Focusing on uncertainty in the users' cost function, we begin by proving that for a broad class of costs, even a small mis-estimation of the true cost can entail trivial accuracy in the worst case. In light of this, we frame the task as a minimax problem, with the goal of identifying the classifier with the smallest worst-case risk over an uncertainty set of possible costs. We design efficient algorithms for both the full-batch and stochastic settings, which we prove converge (offline) to the minimax solution at the dimension-independent rate of $\tilde{\mathcal{O}}(T^{-\frac{1}{2}})$. Our theoretical analysis reveals important structure stemming from strategic responses, particularly the value of dual norm regularization with respect to the cost function.  ( 3 min )
    TabRepo: A Large Scale Repository of Tabular Model Evaluations and its AutoML Applications
    arXiv:2311.02971v2 Announce Type: replace Abstract: We introduce TabRepo, a new dataset of tabular model evaluations and predictions. TabRepo contains the predictions and metrics of 1310 models evaluated on 200 classification and regression datasets. We illustrate the benefit of our dataset in multiple ways. First, we show that it allows to perform analysis such as comparing Hyperparameter Optimization against current AutoML systems while also considering ensembling at marginal cost by using precomputed model predictions. Second, we show that our dataset can be readily leveraged to perform transfer-learning. In particular, we show that applying standard transfer-learning techniques allows to outperform current state-of-the-art tabular systems in accuracy, runtime and latency.  ( 2 min )
    Learning Successor Features with Distributed Hebbian Temporal Memory
    arXiv:2310.13391v2 Announce Type: replace Abstract: This paper presents a novel approach to address the challenge of online temporal memory learning for decision-making under uncertainty in non-stationary, partially observable environments. The proposed algorithm, Distributed Hebbian Temporal Memory (DHTM), is based on factor graph formalism and a multicomponent neuron model. DHTM aims to capture sequential data relationships and make cumulative predictions about future observations, forming Successor Features (SF). Inspired by neurophysiological models of the neocortex, the algorithm utilizes distributed representations, sparse transition matrices, and local Hebbian-like learning rules to overcome the instability and slow learning process of traditional temporal memory algorithms like RNN and HMM. Experimental results demonstrate that DHTM outperforms LSTM and a biologically inspired HMM-like algorithm, CSCG, in the case of non-stationary datasets. Our findings suggest that DHTM is a promising approach for addressing the challenges of online sequence learning and planning in dynamic environments.  ( 2 min )
    The Relative Gaussian Mechanism and its Application to Private Gradient Descent
    arXiv:2308.15250v2 Announce Type: replace Abstract: The Gaussian Mechanism (GM), which consists in adding Gaussian noise to a vector-valued query before releasing it, is a standard privacy protection mechanism. In particular, given that the query respects some L2 sensitivity property (the L2 distance between outputs on any two neighboring inputs is bounded), GM guarantees R\'enyi Differential Privacy (RDP). Unfortunately, precisely bounding the L2 sensitivity can be hard, thus leading to loose privacy bounds. In this work, we consider a Relative L2 sensitivity assumption, in which the bound on the distance between two query outputs may also depend on their norm. Leveraging this assumption, we introduce the Relative Gaussian Mechanism (RGM), in which the variance of the noise depends on the norm of the output. We prove tight bounds on the RDP parameters under relative L2 sensitivity, and characterize the privacy loss incurred by using output-dependent noise. In particular, we show that RGM naturally adapts to a latent variable that would control the norm of the output. Finally, we instantiate our framework to show tight guarantees for Private Gradient Descent, a problem that naturally fits our relative L2 sensitivity assumption.  ( 2 min )
    Language Modeling Is Compression
    arXiv:2309.10668v2 Announce Type: replace Abstract: It has long been established that predictive models can be transformed into lossless compressors and vice versa. Incidentally, in recent years, the machine learning community has focused on training increasingly large and powerful self-supervised (language) models. Since these large language models exhibit impressive predictive capabilities, they are well-positioned to be strong compressors. In this work, we advocate for viewing the prediction problem through the lens of compression and evaluate the compression capabilities of large (foundation) models. We show that large language models are powerful general-purpose predictors and that the compression viewpoint provides novel insights into scaling laws, tokenization, and in-context learning. For example, Chinchilla 70B, while trained primarily on text, compresses ImageNet patches to 43.4% and LibriSpeech samples to 16.4% of their raw size, beating domain-specific compressors like PNG (58.5%) or FLAC (30.3%), respectively. Finally, we show that the prediction-compression equivalence allows us to use any compressor (like gzip) to build a conditional generative model.  ( 2 min )
    A Pre-trained Data Deduplication Model based on Active Learning
    arXiv:2308.00721v2 Announce Type: replace Abstract: In the era of big data, the issue of data quality has become increasingly prominent. One of the main challenges is the problem of duplicate data, which can arise from repeated entry or the merging of multiple data sources. These "dirty data" problems can significantly limit the effective application of big data. To address the issue of data deduplication, we propose a pre-trained deduplication model based on active learning, which is the first work that utilizes active learning to address the problem of deduplication at the semantic level. The model is built on a pre-trained Transformer and fine-tuned to solve the deduplication problem as a sequence to classification task, which firstly integrate the transformer with active learning into an end-to-end architecture to select the most valuable data for deduplication model training, and also firstly employ the R-Drop method to perform data augmentation on each round of labeled data, which can reduce the cost of manual labeling and improve the model's performance. Experimental results demonstrate that our proposed model outperforms previous state-of-the-art (SOTA) for deduplicated data identification, achieving up to a 28% improvement in Recall score on benchmark datasets.  ( 2 min )
    FedDRL: A Trustworthy Federated Learning Model Fusion Method Based on Staged Reinforcement Learning
    arXiv:2307.13716v4 Announce Type: replace Abstract: Traditional federated learning uses the number of samples to calculate the weights of each client model and uses this fixed weight value to fusion the global model. However, in practical scenarios, each client's device and data heterogeneity leads to differences in the quality of each client's model. Thus the contribution to the global model is not wholly determined by the sample size. In addition, if clients intentionally upload low-quality or malicious models, using these models for aggregation will lead to a severe decrease in global model accuracy. Traditional federated learning algorithms do not address these issues. To solve this probelm, we propose FedDRL, a model fusion approach using reinforcement learning based on a two staged approach. In the first stage, Our method could filter out malicious models and selects trusted client models to participate in the model fusion. In the second stage, the FedDRL algorithm adaptively adjusts the weights of the trusted client models and aggregates the optimal global model. We also define five model fusion scenarios and compare our method with two baseline algorithms in those scenarios. The experimental results show that our algorithm has higher reliability than other algorithms while maintaining accuracy.  ( 3 min )
    Information decomposition in complex systems via machine learning
    arXiv:2307.04755v2 Announce Type: replace Abstract: One of the fundamental steps toward understanding a complex system is identifying variation at the scale of the system's components that is most relevant to behavior on a macroscopic scale. Mutual information provides a natural means of linking variation across scales of a system due to its independence of functional relationship between observables. However, characterizing the manner in which information is distributed across a set of observables is computationally challenging and generally infeasible beyond a handful of measurements. Here we propose a practical and general methodology that uses machine learning to decompose the information contained in a set of measurements by jointly optimizing a lossy compression of each measurement. Guided by the distributed information bottleneck as a learning objective, the information decomposition identifies the variation in the measurements of the system state most relevant to specified macroscale behavior. We focus our analysis on two paradigmatic complex systems: a Boolean circuit and an amorphous material undergoing plastic deformation. In both examples, the large amount of entropy of the system state is decomposed, bit by bit, in terms of what is most related to macroscale behavior. The identification of meaningful variation in data, with the full generality brought by information theory, is made practical for studying the connection between micro- and macroscale structure in complex systems.  ( 3 min )
    Layer-wise Linear Mode Connectivity
    arXiv:2307.06966v3 Announce Type: replace Abstract: Averaging neural network parameters is an intuitive method for fusing the knowledge of two independent models. It is most prominently used in federated learning. If models are averaged at the end of training, this can only lead to a good performing model if the loss surface of interest is very particular, i.e., the loss in the midpoint between the two models needs to be sufficiently low. This is impossible to guarantee for the non-convex losses of state-of-the-art networks. For averaging models trained on vastly different datasets, it was proposed to average only the parameters of particular layers or combinations of layers, resulting in better performing models. To get a better understanding of the effect of layer-wise averaging, we analyse the performance of the models that result from averaging single layers, or groups of layers. Based on our empirical and theoretical investigation, we introduce a novel notion of the layer-wise linear connectivity, and show that deep networks do not have layer-wise barriers between them.  ( 2 min )
    Yet Another ICU Benchmark: A Flexible Multi-Center Framework for Clinical ML
    arXiv:2306.05109v4 Announce Type: replace Abstract: Medical applications of machine learning (ML) have experienced a surge in popularity in recent years. The intensive care unit (ICU) is a natural habitat for ML given the abundance of available data from electronic health records. Models have been proposed to address numerous ICU prediction tasks like the early detection of complications. While authors frequently report state-of-the-art performance, it is challenging to verify claims of superiority. Datasets and code are not always published, and cohort definitions, preprocessing pipelines, and training setups are difficult to reproduce. This work introduces Yet Another ICU Benchmark (YAIB), a modular framework that allows researchers to define reproducible and comparable clinical ML experiments; we offer an end-to-end solution from cohort definition to model evaluation. The framework natively supports most open-access ICU datasets (MIMIC III/IV, eICU, HiRID, AUMCdb) and is easily adaptable to future ICU datasets. Combined with a transparent preprocessing pipeline and extensible training code for multiple ML and deep learning models, YAIB enables unified model development. Our benchmark comes with five predefined established prediction tasks (mortality, acute kidney injury, sepsis, kidney function, and length of stay) developed in collaboration with clinicians. Adding further tasks is straightforward by design. Using YAIB, we demonstrate that the choice of dataset, cohort definition, and preprocessing have a major impact on the prediction performance - often more so than model class - indicating an urgent need for YAIB as a holistic benchmarking tool. We provide our work to the clinical ML community to accelerate method development and enable real-world clinical implementations. Software Repository: https://github.com/rvandewater/YAIB.  ( 3 min )
    Adversarial Training Should Be Cast as a Non-Zero-Sum Game
    arXiv:2306.11035v2 Announce Type: replace Abstract: One prominent approach toward resolving the adversarial vulnerability of deep neural networks is the two-player zero-sum paradigm of adversarial training, in which predictors are trained against adversarially chosen perturbations of data. Despite the promise of this approach, algorithms based on this paradigm have not engendered sufficient levels of robustness and suffer from pathological behavior like robust overfitting. To understand this shortcoming, we first show that the commonly used surrogate-based relaxation used in adversarial training algorithms voids all guarantees on the robustness of trained classifiers. The identification of this pitfall informs a novel non-zero-sum bilevel formulation of adversarial training, wherein each player optimizes a different objective function. Our formulation yields a simple algorithmic framework that matches and in some cases outperforms state-of-the-art attacks, attains comparable levels of robustness to standard adversarial training algorithms, and does not suffer from robust overfitting.  ( 2 min )
    Cross or Wait? Predicting Pedestrian Interaction Outcomes at Unsignalized Crossings
    arXiv:2304.08260v2 Announce Type: replace Abstract: Predicting pedestrian behavior when interacting with vehicles is one of the most critical challenges in the field of automated driving. Pedestrian crossing behavior is influenced by various interaction factors, including time to arrival, pedestrian waiting time, the presence of zebra crossing, and the properties and personality traits of both pedestrians and drivers. However, these factors have not been fully explored for use in predicting interaction outcomes. In this paper, we use machine learning to predict pedestrian crossing behavior including pedestrian crossing decision, crossing initiation time (CIT), and crossing duration (CD) when interacting with vehicles at unsignalized crossings. Distributed simulator data are utilized for predicting and analyzing the interaction factors. Compared with the logistic regression baseline model, our proposed neural network model improves the prediction accuracy and F1 score by 4.46% and 3.23%, respectively. Our model also reduces the root mean squared error (RMSE) for CIT and CD by 21.56% and 30.14% compared with the linear regression model. Additionally, we have analyzed the importance of interaction factors, and present the results of models using fewer factors. This provides information for model selection in different scenarios with limited input features.  ( 3 min )
    The Training Process of Many Deep Networks Explores the Same Low-Dimensional Manifold
    arXiv:2305.01604v3 Announce Type: replace Abstract: We develop information-geometric techniques to analyze the trajectories of the predictions of deep networks during training. By examining the underlying high-dimensional probabilistic models, we reveal that the training process explores an effectively low-dimensional manifold. Networks with a wide range of architectures, sizes, trained using different optimization methods, regularization techniques, data augmentation techniques, and weight initializations lie on the same manifold in the prediction space. We study the details of this manifold to find that networks with different architectures follow distinguishable trajectories but other factors have a minimal influence; larger networks train along a similar manifold as that of smaller networks, just faster; and networks initialized at very different parts of the prediction space converge to the solution along a similar manifold.  ( 2 min )
    On student-teacher deviations in distillation: does it pay to disobey?
    arXiv:2301.12923v3 Announce Type: replace Abstract: Knowledge distillation (KD) has been widely used to improve the test accuracy of a "student" network, by training it to mimic the soft probabilities of a trained "teacher" network. Yet, it has been shown in recent work that, despite being trained to fit the teacher's probabilities, the student may not only significantly deviate from the teacher probabilities, but may also outdo than the teacher in performance. Our work aims to reconcile this seemingly paradoxical observation. Specifically, we characterize the precise nature of the student-teacher deviations, and argue how they can co-occur with better generalization. First, through experiments on image and language data, we identify that these probability deviations correspond to the student systematically exaggerating the confidence levels of the teacher. Next, we theoretically and empirically establish another form of exaggeration in some simple settings: KD exaggerates the implicit bias of gradient descent in converging faster along the top eigendirections of the data. Finally, we tie these two observations together: we demonstrate that the exaggerated bias of KD can simultaneously result in both (a) the exaggeration of confidence and (b) the improved generalization of the student, thus offering a resolution to the apparent paradox. Our analysis brings existing theory and practice closer by considering the role of gradient descent in KD and by demonstrating the exaggerated bias effect in both theoretical and empirical settings.  ( 3 min )
    Behave-XAI: Deep Explainable Learning of Behavioral Representational Data
    arXiv:2301.00016v2 Announce Type: replace Abstract: According to the latest trend of artificial intelligence, AI-systems needs to clarify regarding general,specific decisions,services provided by it. Only consumer is satisfied, with explanation , for example, why any classification result is the outcome of any given time. This actually motivates us using explainable or human understandable AI for a behavioral mining scenario, where users engagement on digital platform is determined from context, such as emotion, activity, weather, etc. However, the output of AI-system is not always systematically correct, and often systematically correct, but apparently not-perfect and thereby creating confusions, such as, why the decision is given? What is the reason underneath? In this context, we first formulate the behavioral mining problem in deep convolutional neural network architecture. Eventually, we apply a recursive neural network due to the presence of time-series data from users physiological and environmental sensor-readings. Once the model is developed, explanations are presented with the advent of XAI models in front of users. This critical step involves extensive trial with users preference on explanations over conventional AI, judgement of credibility of explanation.  ( 2 min )
    When Layers Play the Lottery, all Tickets Win at Initialization
    arXiv:2301.10835v2 Announce Type: replace Abstract: Pruning is a standard technique for reducing the computational cost of deep networks. Many advances in pruning leverage concepts from the Lottery Ticket Hypothesis (LTH). LTH reveals that inside a trained dense network exists sparse subnetworks (tickets) able to achieve similar accuracy (i.e., win the lottery - winning tickets). Pruning at initialization focuses on finding winning tickets without training a dense network. Studies on these concepts share the trend that subnetworks come from weight or filter pruning. In this work, we investigate LTH and pruning at initialization from the lens of layer pruning. First, we confirm the existence of winning tickets when the pruning process removes layers. Leveraged by this observation, we propose to discover these winning tickets at initialization, eliminating the requirement of heavy computational resources for training the initial (over-parameterized) dense network. Extensive experiments show that our winning tickets notably speed up the training phase and reduce up to 51% of carbon emission, an important step towards democratization and green Artificial Intelligence. Beyond computational benefits, our winning tickets exhibit robustness against adversarial and out-of-distribution examples. Finally, we show that our subnetworks easily win the lottery at initialization while tickets from filter removal (the standard structured LTH) hardly become winning tickets.  ( 3 min )
    Mitigating Gradient Bias in Multi-objective Learning: A Provably Convergent Stochastic Approach
    arXiv:2210.12624v2 Announce Type: replace Abstract: Machine learning problems with multiple objective functions appear either in learning with multiple criteria where learning has to make a trade-off between multiple performance metrics such as fairness, safety and accuracy; or, in multi-task learning where multiple tasks are optimized jointly, sharing inductive bias between them. This problems are often tackled by the multi-objective optimization framework. However, existing stochastic multi-objective gradient methods and its variants (e.g., MGDA, PCGrad, CAGrad, etc.) all adopt a biased noisy gradient direction, which leads to degraded empirical performance. To this end, we develop a stochastic Multi-objective gradient Correction (MoCo) method for multi-objective optimization. The unique feature of our method is that it can guarantee convergence without increasing the batch size even in the non-convex setting. Simulations on multi-task supervised and reinforcement learning demonstrate the effectiveness of our method relative to state-of-the-art methods.  ( 2 min )
    Copula Conformal Prediction for Multi-step Time Series Forecasting
    arXiv:2212.03281v4 Announce Type: replace Abstract: Accurate uncertainty measurement is a key step to building robust and reliable machine learning systems. Conformal prediction is a distribution-free uncertainty quantification algorithm popular for its ease of implementation, statistical coverage guarantees, and versatility for underlying forecasters. However, existing conformal prediction algorithms for time series are limited to single-step prediction without considering the temporal dependency. In this paper, we propose a Copula Conformal Prediction algorithm for multivariate, multi-step Time Series forecasting, CopulaCPTS. We prove that CopulaCPTS has finite sample validity guarantee. On several synthetic and real-world multivariate time series datasets, we show that CopulaCPTS produces more calibrated and sharp confidence intervals for multi-step prediction tasks than existing techniques.  ( 2 min )
    LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression
    arXiv:2403.12968v1 Announce Type: cross Abstract: This paper focuses on task-agnostic prompt compression for better generalizability and efficiency. Considering the redundancy in natural language, existing approaches compress prompts by removing tokens or lexical units according to their information entropy obtained from a causal language model such as LLaMa-7B. The challenge is that information entropy may be a suboptimal compression metric: (i) it only leverages unidirectional context and may fail to capture all essential information needed for prompt compression; (ii) it is not aligned with the prompt compression objective. To address these issues, we propose a data distillation procedure to derive knowledge from an LLM to compress prompts without losing crucial information, and meantime, introduce an extractive text compression dataset. We formulate prompt compression as a token classification problem to guarantee the faithfulness of the compressed prompt to the original one, and use a Transformer encoder as the base architecture to capture all essential information for prompt compression from the full bidirectional context. Our approach leads to lower latency by explicitly learning the compression objective with smaller models such as XLM-RoBERTa-large and mBERT. We evaluate our method on both in-domain and out-of-domain datasets, including MeetingBank, LongBench, ZeroScrolls, GSM8K, and BBH. Despite its small size, our model shows significant performance gains over strong baselines and demonstrates robust generalization ability across different LLMs. Additionally, our model is 3x-6x faster than existing prompt compression methods, while accelerating the end-to-end latency by 1.6x-2.9x with compression ratios of 2x-5x.  ( 3 min )
    TexTile: A Differentiable Metric for Texture Tileability
    arXiv:2403.12961v1 Announce Type: cross Abstract: We introduce TexTile, a novel differentiable metric to quantify the degree upon which a texture image can be concatenated with itself without introducing repeating artifacts (i.e., the tileability). Existing methods for tileable texture synthesis focus on general texture quality, but lack explicit analysis of the intrinsic repeatability properties of a texture. In contrast, our TexTile metric effectively evaluates the tileable properties of a texture, opening the door to more informed synthesis and analysis of tileable textures. Under the hood, TexTile is formulated as a binary classifier carefully built from a large dataset of textures of different styles, semantics, regularities, and human annotations.Key to our method is a set of architectural modifications to baseline pre-train image classifiers to overcome their shortcomings at measuring tileability, along with a custom data augmentation and training regime aimed at increasing robustness and accuracy. We demonstrate that TexTile can be plugged into different state-of-the-art texture synthesis methods, including diffusion-based strategies, and generate tileable textures while keeping or even improving the overall texture quality. Furthermore, we show that TexTile can objectively evaluate any tileable texture synthesis method, whereas the current mix of existing metrics produces uncorrelated scores which heavily hinders progress in the field.  ( 2 min )
    Generalizable and Stable Finetuning of Pretrained Language Models on Low-Resource Texts
    arXiv:2403.12918v1 Announce Type: cross Abstract: Pretrained Language Models (PLMs) have advanced Natural Language Processing (NLP) tasks significantly, but finetuning PLMs on low-resource datasets poses significant challenges such as instability and overfitting. Previous methods tackle these issues by finetuning a strategically chosen subnetwork on a downstream task, while keeping the remaining weights fixed to the pretrained weights. However, they rely on a suboptimal criteria for sub-network selection, leading to suboptimal solutions. To address these limitations, we propose a regularization method based on attention-guided weight mixup for finetuning PLMs. Our approach represents each network weight as a mixup of task-specific weight and pretrained weight, controlled by a learnable attention parameter, providing finer control over sub-network selection. Furthermore, we employ a bi-level optimization (BLO) based framework on two separate splits of the training dataset, improving generalization and combating overfitting. We validate the efficacy of our proposed method through extensive experiments, demonstrating its superiority over previous methods, particularly in the context of finetuning PLMs on low-resource datasets.  ( 2 min )
    Just Shift It: Test-Time Prototype Shifting for Zero-Shot Generalization with Vision-Language Models
    arXiv:2403.12952v1 Announce Type: cross Abstract: Advancements in vision-language models (VLMs) have propelled the field of computer vision, particularly in the zero-shot learning setting. Despite their promise, the effectiveness of these models often diminishes due to domain shifts in test environments. To address this, we introduce the Test-Time Prototype Shifting (TPS) framework, a pioneering approach designed to adapt VLMs to test datasets using unlabeled test inputs. Our method is based on the notion of modulating per-class prototypes in the shared embedding space. By pre-computing and caching prototypes generated with the pre-trained text encoder, TPS not only facilitates optimization-free prototype reuse for subsequent predictions but also enables seamless integration with current advancements in prompt engineering. At test-time, TPS dynamically learns shift vectors for each prototype based solely on the given test sample, effectively bridging the domain gap and enhancing classification accuracy. A notable aspect of our framework is its significantly reduced memory and computational demands when compared to conventional text-prompt tuning methods. Extensive evaluations across 15 datasets involving natural distribution shifts and cross-dataset generalization demonstrate TPS's superior performance, achieving state-of-the-art results while reducing resource requirements.  ( 2 min )
    Yell At Your Robot: Improving On-the-Fly from Language Corrections
    arXiv:2403.12910v1 Announce Type: cross Abstract: Hierarchical policies that combine language and low-level control have been shown to perform impressively long-horizon robotic tasks, by leveraging either zero-shot high-level planners like pretrained language and vision-language models (LLMs/VLMs) or models trained on annotated robotic demonstrations. However, for complex and dexterous skills, attaining high success rates on long-horizon tasks still represents a major challenge -- the longer the task is, the more likely it is that some stage will fail. Can humans help the robot to continuously improve its long-horizon task performance through intuitive and natural feedback? In this paper, we make the following observation: high-level policies that index into sufficiently rich and expressive low-level language-conditioned skills can be readily supervised with human feedback in the form of language corrections. We show that even fine-grained corrections, such as small movements ("move a bit to the left"), can be effectively incorporated into high-level policies, and that such corrections can be readily obtained from humans observing the robot and making occasional suggestions. This framework enables robots not only to rapidly adapt to real-time language feedback, but also incorporate this feedback into an iterative training scheme that improves the high-level policy's ability to correct errors in both low-level execution and high-level decision-making purely from verbal feedback. Our evaluation on real hardware shows that this leads to significant performance improvement in long-horizon, dexterous manipulation tasks without the need for any additional teleoperation. Videos and code are available at https://yay-robot.github.io/.  ( 3 min )
    Toward Sustainable GenAI using Generation Directives for Carbon-Friendly Large Language Model Inference
    arXiv:2403.12900v1 Announce Type: cross Abstract: The rapid advancement of Generative Artificial Intelligence (GenAI) across diverse sectors raises significant environmental concerns, notably the carbon emissions from their cloud and high performance computing (HPC) infrastructure. This paper presents Sprout, an innovative framework designed to address these concerns by reducing the carbon footprint of generative Large Language Model (LLM) inference services. Sprout leverages the innovative concept of "generation directives" to guide the autoregressive generation process, thereby enhancing carbon efficiency. Our proposed method meticulously balances the need for ecological sustainability with the demand for high-quality generation outcomes. Employing a directive optimizer for the strategic assignment of generation directives to user prompts and an original offline quality evaluator, Sprout demonstrates a significant reduction in carbon emissions by over 40% in real-world evaluations using the Llama2 LLM and global electricity grid data. This research marks a critical step toward aligning AI technology with sustainable practices, highlighting the potential for mitigating environmental impacts in the rapidly expanding domain of generative artificial intelligence.  ( 2 min )
    Primal Methods for Variational Inequality Problems with Functional Constraints
    arXiv:2403.12859v1 Announce Type: cross Abstract: Constrained variational inequality problems are recognized for their broad applications across various fields including machine learning and operations research. First-order methods have emerged as the standard approach for solving these problems due to their simplicity and scalability. However, they typically rely on projection or linear minimization oracles to navigate the feasible set, which becomes computationally expensive in practical scenarios featuring multiple functional constraints. Existing efforts to tackle such functional constrained variational inequality problems have centered on primal-dual algorithms grounded in the Lagrangian function. These algorithms along with their theoretical analysis often require the existence and prior knowledge of the optimal Lagrange multipliers. In this work, we propose a simple primal method, termed Constrained Gradient Method (CGM), for addressing functional constrained variational inequality problems, without necessitating any information on the optimal Lagrange multipliers. We establish a non-asymptotic convergence analysis of the algorithm for variational inequality problems with monotone operators under smooth constraints. Remarkably, our algorithms match the complexity of projection-based methods in terms of operator queries for both monotone and strongly monotone settings, while utilizing significantly cheaper oracles based on quadratic programming. Furthermore, we provide several numerical examples to evaluate the efficacy of our algorithms.  ( 2 min )
    A Physics-embedded Deep Learning Framework for Cloth Simulation
    arXiv:2403.12820v1 Announce Type: cross Abstract: Delicate cloth simulations have long been desired in computer graphics. Various methods were proposed to improve engaged force interactions, collision handling, and numerical integrations. Deep learning has the potential to achieve fast and real-time simulation, but common neural network structures often demand many parameters to capture cloth dynamics. This paper proposes a physics-embedded learning framework that directly encodes physical features of cloth simulation. The convolutional neural network is used to represent spatial correlations of the mass-spring system, after which three branches are designed to learn linear, nonlinear, and time derivate features of cloth physics. The framework can also integrate with other external forces and collision handling through either traditional simulators or sub neural networks. The model is tested across different cloth animation cases, without training with new data. Agreement with baselines and predictive realism successfully validate its generalization ability. Inference efficiency of the proposed model also defeats traditional physics simulation. This framework is also designed to easily integrate with other visual refinement techniques like wrinkle carving, which leaves significant chances to incorporate prevailing macing learning techniques in 3D cloth amination.  ( 2 min )
    Tighter Confidence Bounds for Sequential Kernel Regression
    arXiv:2403.12732v1 Announce Type: cross Abstract: Confidence bounds are an essential tool for rigorously quantifying the uncertainty of predictions. In this capacity, they can inform the exploration-exploitation trade-off and form a core component in many sequential learning and decision-making algorithms. Tighter confidence bounds give rise to algorithms with better empirical performance and better performance guarantees. In this work, we use martingale tail bounds and finite-dimensional reformulations of infinite-dimensional convex programs to establish new confidence bounds for sequential kernel regression. We prove that our new confidence bounds are always tighter than existing ones in this setting. We apply our confidence bounds to the kernel bandit problem, where future actions depend on the previous history. When our confidence bounds replace existing ones, the KernelUCB (GP-UCB) algorithm has better empirical performance, a matching worst-case performance guarantee and comparable computational cost. Our new confidence bounds can be used as a generic tool to design improved algorithms for other kernelised learning and decision-making problems.  ( 2 min )
    Addressing Source Scale Bias via Image Warping for Domain Adaptation
    arXiv:2403.12712v1 Announce Type: cross Abstract: In visual recognition, scale bias is a key challenge due to the imbalance of object and image size distribution inherent in real scene datasets. Conventional solutions involve injecting scale invariance priors, oversampling the dataset at different scales during training, or adjusting scale at inference. While these strategies mitigate scale bias to some extent, their ability to adapt across diverse datasets is limited. Besides, they increase computational load during training and latency during inference. In this work, we use adaptive attentional processing -- oversampling salient object regions by warping images in-place during training. Discovering that shifting the source scale distribution improves backbone features, we developed a instance-level warping guidance aimed at object region sampling to mitigate source scale bias in domain adaptation. Our approach improves adaptation across geographies, lighting and weather conditions, is agnostic to the task, domain adaptation algorithm, saliency guidance, and underlying model architecture. Highlights include +6.1 mAP50 for BDD100K Clear $\rightarrow$ DENSE Foggy, +3.7 mAP50 for BDD100K Day $\rightarrow$ Night, +3.0 mAP50 for BDD100K Clear $\rightarrow$ Rainy, and +6.3 mIoU for Cityscapes $\rightarrow$ ACDC. Our approach adds minimal memory during training and has no additional latency at inference time. Please see Appendix for more results and analysis.  ( 2 min )
    Posterior Uncertainty Quantification in Neural Networks using Data Augmentation
    arXiv:2403.12729v1 Announce Type: cross Abstract: In this paper, we approach the problem of uncertainty quantification in deep learning through a predictive framework, which captures uncertainty in model parameters by specifying our assumptions about the predictive distribution of unseen future data. Under this view, we show that deep ensembling (Lakshminarayanan et al., 2017) is a fundamentally mis-specified model class, since it assumes that future data are supported on existing observations only -- a situation rarely encountered in practice. To address this limitation, we propose MixupMP, a method that constructs a more realistic predictive distribution using popular data augmentation techniques. MixupMP operates as a drop-in replacement for deep ensembles, where each ensemble member is trained on a random simulation from this predictive distribution. Grounded in the recently-proposed framework of Martingale posteriors (Fong et al., 2023), MixupMP returns samples from an implicitly defined Bayesian posterior. Our empirical analysis showcases that MixupMP achieves superior predictive performance and uncertainty quantification on various image classification datasets, when compared with existing Bayesian and non-Bayesian approaches.  ( 2 min )
    Federated Semi-supervised Learning for Medical Image Segmentation with intra-client and inter-client Consistency
    arXiv:2403.12695v1 Announce Type: cross Abstract: Medical image segmentation plays a vital role in clinic disease diagnosis and medical image analysis. However, labeling medical images for segmentation task is tough due to the indispensable domain expertise of radiologists. Furthermore, considering the privacy and sensitivity of medical images, it is impractical to build a centralized segmentation dataset from different medical institutions. Federated learning aims to train a shared model of isolated clients without local data exchange which aligns well with the scarcity and privacy characteristics of medical data. To solve the problem of labeling hard, many advanced semi-supervised methods have been proposed in a centralized data setting. As for federated learning, how to conduct semi-supervised learning under this distributed scenario is worth investigating. In this work, we propose a novel federated semi-supervised learning framework for medical image segmentation. The intra-client and inter-client consistency learning are introduced to smooth predictions at the data level and avoid confirmation bias of local models. They are achieved with the assistance of a Variational Autoencoder (VAE) trained collaboratively by clients. The added VAE model plays three roles: 1) extracting latent low-dimensional features of all labeled and unlabeled data; 2) performing a novel type of data augmentation in calculating intra-client consistency loss; 3) utilizing the generative ability of itself to conduct inter-client consistency distillation. The proposed framework is compared with other federated semi-supervised or self-supervised learning methods. The experimental results illustrate that our method outperforms the state-of-the-art method while avoiding a lot of computation and communication overhead.  ( 3 min )
    Selective, Interpretable, and Motion Consistent Privacy Attribute Obfuscation for Action Recognition
    arXiv:2403.12710v1 Announce Type: cross Abstract: Concerns for the privacy of individuals captured in public imagery have led to privacy-preserving action recognition. Existing approaches often suffer from issues arising through obfuscation being applied globally and a lack of interpretability. Global obfuscation hides privacy sensitive regions, but also contextual regions important for action recognition. Lack of interpretability erodes trust in these new technologies. We highlight the limitations of current paradigms and propose a solution: Human selected privacy templates that yield interpretability by design, an obfuscation scheme that selectively hides attributes and also induces temporal consistency, which is important in action recognition. Our approach is architecture agnostic and directly modifies input imagery, while existing approaches generally require architecture training. Our approach offers more flexibility, as no retraining is required, and outperforms alternatives on three widely used datasets.  ( 2 min )
    Audio-Visual Compound Expression Recognition Method based on Late Modality Fusion and Rule-based Decision
    arXiv:2403.12687v1 Announce Type: cross Abstract: This paper presents the results of the SUN team for the Compound Expressions Recognition Challenge of the 6th ABAW Competition. We propose a novel audio-visual method for compound expression recognition. Our method relies on emotion recognition models that fuse modalities at the emotion probability level, while decisions regarding the prediction of compound expressions are based on predefined rules. Notably, our method does not use any training data specific to the target task. The method is evaluated in multi-corpus training and cross-corpus validation setups. Our findings from the challenge demonstrate that the proposed method can potentially form a basis for development of intelligent tools for annotating audio-visual data in the context of human's basic and compound emotions. The source code is publicly available.  ( 2 min )
    Zeolite Adsorption Property Prediction using Deep Learning
    arXiv:2403.12659v1 Announce Type: cross Abstract: The ability to efficiently predict adsorption properties of zeolites can be of large benefit in accelerating the design process of novel materials. The existing configuration space for these materials is wide, while existing molecular simulation methods are computationally expensive. In this work, we propose a model which is 4 to 5 orders of magnitude faster at adsorption properties compared to molecular simulations. To validate the model, we generated datasets containing various aluminium configurations for the MOR, MFI, RHO and ITW zeolites along with their heat of adsorptions and Henry coefficients for CO$_2$, obtained from Monte Carlo simulations. The predictions obtained from the Machine Learning model are in agreement with the values obtained from the Monte Carlo simulations, confirming that the model can be used for property prediction. Furthermore, we show that the model can be used for identifying adsorption sites. Finally, we evaluate the capability of our model for generating novel zeolite configurations by using it in combination with a genetic algorithm.  ( 2 min )
    Adaptive Multilevel Neural Networks for Parametric PDEs with Error Estimation
    arXiv:2403.12650v1 Announce Type: cross Abstract: To solve high-dimensional parameter-dependent partial differential equations (pPDEs), a neural network architecture is presented. It is constructed to map parameters of the model data to corresponding finite element solutions. To improve training efficiency and to enable control of the approximation error, the network mimics an adaptive finite element method (AFEM). It outputs a coarse grid solution and a series of corrections as produced in an AFEM, allowing a tracking of the error decay over successive layers of the network. The observed errors are measured by a reliable residual based a posteriori error estimator, enabling the reduction to only few parameters for the approximation in the output of the network. This leads to a problem adapted representation of the solution on locally refined grids. Furthermore, each solution of the AFEM is discretized in a hierarchical basis. For the architecture, convolutional neural networks (CNNs) are chosen. The hierarchical basis then allows to handle sparse images for finely discretized meshes. Additionally, as corrections on finer levels decrease in amplitude, i.e., importance for the overall approximation, the accuracy of the network approximation is allowed to decrease successively. This can either be incorporated in the number of generated high fidelity samples used for training or the size of the network components responsible for the fine grid outputs. The architecture is described and preliminary numerical examples are presented.  ( 3 min )
    Preventing Eviction-Caused Homelessness through ML-Informed Distribution of Rental Assistance
    arXiv:2403.12599v1 Announce Type: cross Abstract: Rental assistance programs provide individuals with financial assistance to prevent housing instabilities caused by evictions and avert homelessness. Since these programs operate under resource constraints, they must decide who to prioritize. Typically, funding is distributed by a reactive or first-come-first serve allocation process that does not systematically consider risk of future homelessness. We partnered with Allegheny County, PA to explore a proactive allocation approach that prioritizes individuals facing eviction based on their risk of future homelessness. Our ML system that uses state and county administrative data to accurately identify individuals in need of support outperforms simpler prioritization approaches by at least 20% while being fair and equitable across race and gender. Furthermore, our approach would identify 28% of individuals who are overlooked by the current process and end up homeless. Beyond improvements to the rental assistance program in Allegheny County, this study can inform the development of evidence-based decision support tools in similar contexts, including lessons about data needs, model design, evaluation, and field validation.  ( 2 min )
    Confidence Self-Calibration for Multi-Label Class-Incremental Learning
    arXiv:2403.12559v1 Announce Type: cross Abstract: The partial label challenge in Multi-Label Class-Incremental Learning (MLCIL) arises when only the new classes are labeled during training, while past and future labels remain unavailable. This issue leads to a proliferation of false-positive errors due to erroneously high confidence multi-label predictions, exacerbating catastrophic forgetting within the disjoint label space. In this paper, we aim to refine multi-label confidence calibration in MLCIL and propose a Confidence Self-Calibration (CSC) approach. Firstly, for label relationship calibration, we introduce a class-incremental graph convolutional network that bridges the isolated label spaces by constructing learnable, dynamically extended label relationship graph. Then, for confidence calibration, we present a max-entropy regularization for each multi-label increment, facilitating confidence self-calibration through the penalization of over-confident output distributions. Our approach attains new state-of-the-art results in MLCIL tasks on both MS-COCO and PASCAL VOC datasets, with the calibration of label confidences confirmed through our methodology.  ( 2 min )
    Machine Learning of the Prime Distribution
    arXiv:2403.12588v1 Announce Type: cross Abstract: In the present work we use maximum entropy methods to derive several theorems in probabilistic number theory, including a version of the Hardy-Ramanujan Theorem. We also provide a theoretical argument explaining the experimental observations of Y.-H. He about the learnability of primes, and posit that the Erd\H{o}s-Kac law would very unlikely be discovered by current machine learning techniques. Numerical experiments that we perform corroborate our theoretical findings.  ( 2 min )
    Generalized Consistency Trajectory Models for Image Manipulation
    arXiv:2403.12510v1 Announce Type: cross Abstract: Diffusion-based generative models excel in unconditional generation, as well as on applied tasks such as image editing and restoration. The success of diffusion models lies in the iterative nature of diffusion: diffusion breaks down the complex process of mapping noise to data into a sequence of simple denoising tasks. Moreover, we are able to exert fine-grained control over the generation process by injecting guidance terms into each denoising step. However, the iterative process is also computationally intensive, often taking from tens up to thousands of function evaluations. Although consistency trajectory models (CTMs) enable traversal between any time points along the probability flow ODE (PFODE) and score inference with a single function evaluation, CTMs only allow translation from Gaussian noise to data. Thus, this work aims to unlock the full potential of CTMs by proposing generalized CTMs (GCTMs), which translate between arbitrary distributions via ODEs. We discuss the design space of GCTMs and demonstrate their efficacy in various image manipulation tasks such as image-to-image translation, restoration, and editing. Code: \url{https://github.com/1202kbs/GCTM}  ( 2 min )
    Securing Large Language Models: Threats, Vulnerabilities and Responsible Practices
    arXiv:2403.12503v1 Announce Type: cross Abstract: Large language models (LLMs) have significantly transformed the landscape of Natural Language Processing (NLP). Their impact extends across a diverse spectrum of tasks, revolutionizing how we approach language understanding and generations. Nevertheless, alongside their remarkable utility, LLMs introduce critical security and risk considerations. These challenges warrant careful examination to ensure responsible deployment and safeguard against potential vulnerabilities. This research paper thoroughly investigates security and privacy concerns related to LLMs from five thematic perspectives: security and privacy concerns, vulnerabilities against adversarial attacks, potential harms caused by misuses of LLMs, mitigation strategies to address these challenges while identifying limitations of current strategies. Lastly, the paper recommends promising avenues for future research to enhance the security and risk management of LLMs.  ( 2 min )
    When Do "More Contexts" Help with Sarcasm Recognition?
    arXiv:2403.12469v1 Announce Type: cross Abstract: Sarcasm recognition is challenging because it needs an understanding of the true intention, which is opposite to or different from the literal meaning of the words. Prior work has addressed this challenge by developing a series of methods that provide richer $contexts$, e.g., sentiment or cultural nuances, to models. While shown to be effective individually, no study has systematically evaluated their collective effectiveness. As a result, it remains unclear to what extent additional contexts can improve sarcasm recognition. In this work, we explore the improvements that existing methods bring by incorporating more contexts into a model. To this end, we develop a framework where we can integrate multiple contextual cues and test different approaches. In evaluation with four approaches on three sarcasm recognition benchmarks, we achieve existing state-of-the-art performances and also demonstrate the benefits of sequentially adding more contexts. We also identify inherent drawbacks of using more contexts, highlighting that in the pursuit of even better results, the model may need to adopt societal biases.  ( 2 min )
    A Trainable Feature Extractor Module for Deep Neural Networks and Scanpath Classification
    arXiv:2403.12493v1 Announce Type: cross Abstract: Scanpath classification is an area in eye tracking research with possible applications in medicine, manufacturing as well as training systems for students in various domains. In this paper we propose a trainable feature extraction module for deep neural networks. The purpose of this module is to transform a scanpath into a feature vector which is directly useable for the deep neural network architecture. Based on the backpropagated error of the deep neural network, the feature extraction module adapts its parameters to improve the classification performance. Therefore, our feature extraction module is jointly trainable with the deep neural network. The motivation to this feature extraction module is based on classical histogram-based approaches which usually compute distributions over a scanpath. We evaluated our module on three public datasets and compared it to the state of the art approaches.  ( 2 min )
    TransformMix: Learning Transformation and Mixing Strategies from Data
    arXiv:2403.12429v1 Announce Type: cross Abstract: Data augmentation improves the generalization power of deep learning models by synthesizing more training samples. Sample-mixing is a popular data augmentation approach that creates additional data by combining existing samples. Recent sample-mixing methods, like Mixup and Cutmix, adopt simple mixing operations to blend multiple inputs. Although such a heuristic approach shows certain performance gains in some computer vision tasks, it mixes the images blindly and does not adapt to different datasets automatically. A mixing strategy that is effective for a particular dataset does not often generalize well to other datasets. If not properly configured, the methods may create misleading mixed images, which jeopardize the effectiveness of sample-mixing augmentations. In this work, we propose an automated approach, TransformMix, to learn better transformation and mixing augmentation strategies from data. In particular, TransformMix applies learned transformations and mixing masks to create compelling mixed images that contain correct and important information for the target tasks. We demonstrate the effectiveness of TransformMix on multiple datasets in transfer learning, classification, object detection, and knowledge distillation settings. Experimental results show that our method achieves better performance as well as efficiency when compared with strong sample-mixing baselines.  ( 2 min )
    On Predictive planning and counterfactual learning in active inference
    arXiv:2403.12417v1 Announce Type: cross Abstract: Given the rapid advancement of artificial intelligence, understanding the foundations of intelligent behaviour is increasingly important. Active inference, regarded as a general theory of behaviour, offers a principled approach to probing the basis of sophistication in planning and decision-making. In this paper, we examine two decision-making schemes in active inference based on 'planning' and 'learning from experience'. Furthermore, we also introduce a mixed model that navigates the data-complexity trade-off between these strategies, leveraging the strengths of both to facilitate balanced decision-making. We evaluate our proposed model in a challenging grid-world scenario that requires adaptability from the agent. Additionally, our model provides the opportunity to analyze the evolution of various parameters, offering valuable insights and contributing to an explainable framework for intelligent decision-making.  ( 2 min )
    FedFisher: Leveraging Fisher Information for One-Shot Federated Learning
    arXiv:2403.12329v1 Announce Type: new Abstract: Standard federated learning (FL) algorithms typically require multiple rounds of communication between the server and the clients, which has several drawbacks, including requiring constant network connectivity, repeated investment of computational resources, and susceptibility to privacy attacks. One-Shot FL is a new paradigm that aims to address this challenge by enabling the server to train a global model in a single round of communication. In this work, we present FedFisher, a novel algorithm for one-shot FL that makes use of Fisher information matrices computed on local client models, motivated by a Bayesian perspective of FL. First, we theoretically analyze FedFisher for two-layer over-parameterized ReLU neural networks and show that the error of our one-shot FedFisher global model becomes vanishingly small as the width of the neural networks and amount of local training at clients increases. Next, we propose practical variants of FedFisher using the diagonal Fisher and K-FAC approximation for the full Fisher and highlight their communication and compute efficiency for FL. Finally, we conduct extensive experiments on various datasets, which show that these variants of FedFisher consistently improve over competing baselines.  ( 2 min )
    Optimal and Adaptive Non-Stationary Dueling Bandits Under a Generalized Borda Criterion
    arXiv:2403.12950v1 Announce Type: new Abstract: In dueling bandits, the learner receives preference feedback between arms, and the regret of an arm is defined in terms of its suboptimality to a winner arm. The more challenging and practically motivated non-stationary variant of dueling bandits, where preferences change over time, has been the focus of several recent works (Saha and Gupta, 2022; Buening and Saha, 2023; Suk and Agarwal, 2023). The goal is to design algorithms without foreknowledge of the amount of change. The bulk of known results here studies the Condorcet winner setting, where an arm preferred over any other exists at all times. Yet, such a winner may not exist and, to contrast, the Borda version of this problem (which is always well-defined) has received little attention. In this work, we establish the first optimal and adaptive Borda dynamic regret upper bound, which highlights fundamental differences in the learnability of severe non-stationarity between Condorcet vs. Borda regret objectives in dueling bandits. Surprisingly, our techniques for non-stationary Borda dueling bandits also yield improved rates within the Condorcet winner setting, and reveal new preference models where tighter notions of non-stationarity are adaptively learnable. This is accomplished through a novel generalized Borda score framework which unites the Borda and Condorcet problems, thus allowing reduction of Condorcet regret to a Borda-like task. Such a generalization was not previously known and is likely to be of independent interest.  ( 3 min )
    Policy Bifurcation in Safe Reinforcement Learning
    arXiv:2403.12847v1 Announce Type: new Abstract: Safe reinforcement learning (RL) offers advanced solutions to constrained optimal control problems. Existing studies in safe RL implicitly assume continuity in policy functions, where policies map states to actions in a smooth, uninterrupted manner; however, our research finds that in some scenarios, the feasible policy should be discontinuous or multi-valued, interpolating between discontinuous local optima can inevitably lead to constraint violations. We are the first to identify the generating mechanism of such a phenomenon, and employ topological analysis to rigorously prove the existence of policy bifurcation in safe RL, which corresponds to the contractibility of the reachable tuple. Our theorem reveals that in scenarios where the obstacle-free state space is non-simply connected, a feasible policy is required to be bifurcated, meaning its output action needs to change abruptly in response to the varying state. To train such a bifurcated policy, we propose a safe RL algorithm called multimodal policy optimization (MUPO), which utilizes a Gaussian mixture distribution as the policy output. The bifurcated behavior can be achieved by selecting the Gaussian component with the highest mixing coefficient. Besides, MUPO also integrates spectral normalization and forward KL divergence to enhance the policy's capability of exploring different modes. Experiments with vehicle control tasks show that our algorithm successfully learns the bifurcated policy and ensures satisfying safety, while a continuous policy suffers from inevitable constraint violations.  ( 3 min )
    A Comparison of Deep Learning Architectures for Spacecraft Anomaly Detection
    arXiv:2403.12864v1 Announce Type: new Abstract: Spacecraft operations are highly critical, demanding impeccable reliability and safety. Ensuring the optimal performance of a spacecraft requires the early detection and mitigation of anomalies, which could otherwise result in unit or mission failures. With the advent of deep learning, a surge of interest has been seen in leveraging these sophisticated algorithms for anomaly detection in space operations. This study aims to compare the efficacy of various deep learning architectures in detecting anomalies in spacecraft data. The deep learning models under investigation include Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM) networks, and Transformer-based architectures. Each of these models was trained and validated using a comprehensive dataset sourced from multiple spacecraft missions, encompassing diverse operational scenarios and anomaly types. Initial results indicate that while CNNs excel in identifying spatial patterns and may be effective for some classes of spacecraft data, LSTMs and RNNs show a marked proficiency in capturing temporal anomalies seen in time-series spacecraft telemetry. The Transformer-based architectures, given their ability to focus on both local and global contexts, have showcased promising results, especially in scenarios where anomalies are subtle and span over longer durations. Additionally, considerations such as computational efficiency, ease of deployment, and real-time processing capabilities were evaluated. While CNNs and LSTMs demonstrated a balance between accuracy and computational demands, Transformer architectures, though highly accurate, require significant computational resources. In conclusion, the choice of deep learning architecture for spacecraft anomaly detection is highly contingent on the nature of the data, the type of anomalies, and operational constraints.  ( 3 min )
    Adapting SAM for Volumetric X-Ray Data-sets of Arbitrary Sizes
    arXiv:2403.12066v1 Announce Type: cross Abstract: Objective: We propose a new approach for volumetric instance segmentation in X-ray Computed Tomography (CT) data for Non-Destructive Testing (NDT) by combining the Segment Anything Model (SAM) with tile-based Flood Filling Networks (FFN). Our work evaluates the performance of SAM on volumetric NDT data-sets and demonstrates its effectiveness to segment instances in challenging imaging scenarios. Methods: We implemented and evaluated techniques to extend the image-based SAM algorithm fo the use with volumetric data-sets, enabling the segmentation of three-dimensional objects using FFN's spatially adaptability. The tile-based approach for SAM leverages FFN's capabilities to segment objects of any size. We also explore the use of dense prompts to guide SAM in combining segmented tiles for improved segmentation accuracy. Results: Our research indicates the potential of combining SAM with FFN for volumetric instance segmentation tasks, particularly in NDT scenarios and segmenting large entities and objects. Conclusion: While acknowledging remaining limitations, our study provides insights and establishes a foundation for advancements in instance segmentation in NDT scenarios.  ( 2 min )
    On Safety in Safe Bayesian Optimization
    arXiv:2403.12948v1 Announce Type: new Abstract: Optimizing an unknown function under safety constraints is a central task in robotics, biomedical engineering, and many other disciplines, and increasingly safe Bayesian Optimization (BO) is used for this. Due to the safety critical nature of these applications, it is of utmost importance that theoretical safety guarantees for these algorithms translate into the real world. In this work, we investigate three safety-related issues of the popular class of SafeOpt-type algorithms. First, these algorithms critically rely on frequentist uncertainty bounds for Gaussian Process (GP) regression, but concrete implementations typically utilize heuristics that invalidate all safety guarantees. We provide a detailed analysis of this problem and introduce Real-\b{eta}-SafeOpt, a variant of the SafeOpt algorithm that leverages recent GP bounds and thus retains all theoretical guarantees. Second, we identify assuming an upper bound on the reproducing kernel Hilbert space (RKHS) norm of the target function, a key technical assumption in SafeOpt-like algorithms, as a central obstacle to real-world usage. To overcome this challenge, we introduce the Lipschitz-only Safe Bayesian Optimization (LoSBO) algorithm, which guarantees safety without an assumption on the RKHS bound, and empirically show that this algorithm is not only safe, but also exhibits superior performance compared to the state-of-the-art on several function classes. Third, SafeOpt and derived algorithms rely on a discrete search space, making them difficult to apply to higher-dimensional problems. To widen the applicability of these algorithms, we introduce Lipschitz-only GP-UCB (LoS-GP-UCB), a variant of LoSBO applicable to moderately high-dimensional problems, while retaining safety.  ( 3 min )
    Consistency Models Improve Diffusion Inverse Solvers
    arXiv:2403.12063v1 Announce Type: cross Abstract: Diffusion inverse solvers (DIS) aim to find an image $x$ that lives on the diffusion prior while satisfying the constraint $f(x) = y$, given an operator $f(.)$ and measurement $y$. Most non-linear DIS use posterior mean $\hat{x}_{0|t}=\mathbb{E}[x_0|x_t]$ to evaluate $f(.)$ and minimize the distance $||f(\hat{x}_{0|t})-y||^2$. Previous works show that posterior mean-based distance is biased; instead, posterior sample $x_{0|t}\sim p_{\theta}(x_0|x_t)$ promises a better candidate. In this paper, we first clarify when is posterior sample better: $1)$ When $f(.)$ is linear, the distance with posterior mean is as good as single posterior sample, thus preferable as it does not require Monte Carlo; $2)$ When $f(.)$ is non-linear, the distance using posterior sample is better. As previous approximations to posterior sample do not look like a real image, we propose to use consistency model (CM) as a high quality approximation. In addition, we propose a new family of DIS using pure CM. Empirically, we show that replacing posterior mean by CM improves DIS performance on non-linear $f(.)$ (e.g. semantic segmentation, image captioning). Further, our pure CM inversion works well for both linear and non-linear $f(.)$.  ( 2 min )
    Sample Complexity of Offline Distributionally Robust Linear Markov Decision Processes
    arXiv:2403.12946v1 Announce Type: new Abstract: In offline reinforcement learning (RL), the absence of active exploration calls for attention on the model robustness to tackle the sim-to-real gap, where the discrepancy between the simulated and deployed environments can significantly undermine the performance of the learned policy. To endow the learned policy with robustness in a sample-efficient manner in the presence of high-dimensional state-action space, this paper considers the sample complexity of distributionally robust linear Markov decision processes (MDPs) with an uncertainty set characterized by the total variation distance using offline data. We develop a pessimistic model-based algorithm and establish its sample complexity bound under minimal data coverage assumptions, which outperforms prior art by at least $\tilde{O}(d)$, where $d$ is the feature dimension. We further improve the performance guarantee of the proposed algorithm by incorporating a carefully-designed variance estimator.  ( 2 min )
    Neural Differential Algebraic Equations
    arXiv:2403.12938v1 Announce Type: new Abstract: Differential-Algebraic Equations (DAEs) describe the temporal evolution of systems that obey both differential and algebraic constraints. Of particular interest are systems that contain implicit relationships between their components, such as conservation relationships. Here, we present Neural Differential-Algebraic Equations (NDAEs) suitable for data-driven modeling of DAEs. This methodology is built upon the concept of the Universal Differential Equation; that is, a model constructed as a system of Neural Ordinary Differential Equations informed by theory from particular science domains. In this work, we show that the proposed NDAEs abstraction is suitable for relevant system-theoretic data-driven modeling tasks. Presented examples include (i) the inverse problem of tank-manifold dynamics and (ii) discrepancy modeling of a network of pumps, tanks, and pipes. Our experiments demonstrate the proposed method's robustness to noise and extrapolation ability to (i) learn the behaviors of the system components and their interaction physics and (ii) disambiguate between data trends and mechanistic relationships contained in the system.  ( 2 min )
    Mobile Application for Oral Disease Detection using Federated Learning
    arXiv:2403.12044v1 Announce Type: cross Abstract: The mouth, often regarded as a window to the internal state of the body, plays an important role in reflecting one's overall health. Poor oral hygiene has far-reaching consequences, contributing to severe conditions like heart disease, cancer, and diabetes, while inadequate care leads to discomfort, pain, and costly treatments. Federated Learning (FL) for object detection can be utilized for this use case due to the sensitivity of the oral image data of the patients. FL ensures data privacy by storing the images used for object detection on the local device and trains the model on the edge. The updated weights are federated to a central server where all the collected weights are updated via The Federated Averaging algorithm. Finally, we have developed a mobile app named OralH which provides user-friendly solutions, allowing people to conduct self-assessments through mouth scans and providing quick oral health insights. Upon detection of the issues, the application alerts the user about potential oral health concerns or diseases and provides details about dental clinics in the user's locality. Designed as a Progressive Web Application (PWA), the platform ensures ubiquitous access, catering to users across devices for a seamless experience. The application aims to provide state-of-the-art segmentation and detection techniques, leveraging the YOLOv8 object detection model to identify oral hygiene issues and diseases. This study deals with the benefits of leveraging FL in healthcare with promising real-world results.  ( 3 min )
    LNPT: Label-free Network Pruning and Training
    arXiv:2403.12690v1 Announce Type: new Abstract: Pruning before training enables the deployment of neural networks on smart devices. By retaining weights conducive to generalization, pruned networks can be accommodated on resource-constrained smart devices. It is commonly held that the distance on weight norms between the initialized and the fully-trained networks correlates with generalization performance. However, as we have uncovered, inconsistency between this metric and generalization during training processes, which poses an obstacle to determine the pruned structures on smart devices in advance. In this paper, we introduce the concept of the learning gap, emphasizing its accurate correlation with generalization. Experiments show that the learning gap, in the form of feature maps from the penultimate layer of networks, aligns with variations of generalization performance. We propose a novel learning framework, LNPT, which enables mature networks on the cloud to provide online guidance for network pruning and learning on smart devices with unlabeled data. Our results demonstrate the superiority of this approach over supervised training.  ( 2 min )
    On the Effectiveness of Heterogeneous Ensemble Methods for Re-identification
    arXiv:2403.12606v1 Announce Type: new Abstract: In this contribution, we introduce a novel ensemble method for the re-identification of industrial entities, using images of chipwood pallets and galvanized metal plates as dataset examples. Our algorithms replace commonly used, complex siamese neural networks with an ensemble of simplified, rudimentary models, providing wider applicability, especially in hardware-restricted scenarios. Each ensemble sub-model uses different types of extracted features of the given data as its input, allowing for the creation of effective ensembles in a fraction of the training duration needed for more complex state-of-the-art models. We reach state-of-the-art performance at our task, with a Rank-1 accuracy of over 77% and a Rank-10 accuracy of over 99%, and introduce five distinct feature extraction approaches, and study their combination using different ensemble methods.  ( 2 min )
    Short-Term Solar Irradiance Forecasting Under Data Transmission Constraints
    arXiv:2403.12873v1 Announce Type: new Abstract: We report a data-parsimonious machine learning model for short-term forecasting of solar irradiance. The model inputs include sky camera images that are reduced to scalar features to meet data transmission constraints. The output irradiance values are transformed to focus on unknown short-term dynamics. Inspired by control theory, a noise input is used to reflect unmeasured variables and is shown to improve model predictions, often considerably. Five years of data from the NREL Solar Radiation Research Laboratory were used to create three rolling train-validate sets and determine the best representations for time, the optimal span of input measurements, and the most impactful model input data (features). For the chosen test data, the model achieves a mean absolute error of 74.34 $W/m^2$ compared to a baseline 134.35 $W/m^2$ using the persistence of cloudiness model.  ( 2 min )
    Has Approximate Machine Unlearning been evaluated properly? From Auditing to Side Effects
    arXiv:2403.12830v1 Announce Type: new Abstract: The growing concerns surrounding data privacy and security have underscored the critical necessity for machine unlearning--aimed at fully removing data lineage from machine learning models. MLaaS providers expect this to be their ultimate safeguard for regulatory compliance. Despite its critical importance, the pace at which privacy communities have been developing and implementing strong methods to verify the effectiveness of machine unlearning has been disappointingly slow, with this vital area often receiving insufficient focus. This paper seeks to address this shortfall by introducing well-defined and effective metrics for black-box unlearning auditing tasks. We transform the auditing challenge into a question of non-membership inference and develop efficient metrics for auditing. By relying exclusively on the original and unlearned models--eliminating the need to train additional shadow models--our approach simplifies the evaluation of unlearning at the individual data point level. Utilizing these metrics, we conduct an in-depth analysis of current approximate machine unlearning algorithms, identifying three key directions where these approaches fall short: utility, resilience, and equity. Our aim is that this work will greatly improve our understanding of approximate machine unlearning methods, taking a significant stride towards converting the theoretical right to data erasure into a auditable reality.  ( 2 min )
    Equivariant Ensembles and Regularization for Reinforcement Learning in Map-based Path Planning
    arXiv:2403.12856v1 Announce Type: new Abstract: In reinforcement learning (RL), exploiting environmental symmetries can significantly enhance efficiency, robustness, and performance. However, ensuring that the deep RL policy and value networks are respectively equivariant and invariant to exploit these symmetries is a substantial challenge. Related works try to design networks that are equivariant and invariant by construction, limiting them to a very restricted library of components, which in turn hampers the expressiveness of the networks. This paper proposes a method to construct equivariant policies and invariant value functions without specialized neural network components, which we term equivariant ensembles. We further add a regularization term for adding inductive bias during training. In a map-based path planning case study, we show how equivariant ensembles and regularization benefit sample efficiency and performance.  ( 2 min )
    Wildfire danger prediction optimization with transfer learning
    arXiv:2403.12871v1 Announce Type: new Abstract: Convolutional Neural Networks (CNNs) have proven instrumental across various computer science domains, enabling advancements in object detection, classification, and anomaly detection. This paper explores the application of CNNs to analyze geospatial data specifically for identifying wildfire-affected areas. Leveraging transfer learning techniques, we fine-tuned CNN hyperparameters and integrated the Canadian Fire Weather Index (FWI) to assess moisture conditions. The study establishes a methodology for computing wildfire risk levels on a scale of 0 to 5, dynamically linked to weather patterns. Notably, through the integration of transfer learning, the CNN model achieved an impressive accuracy of 95\% in identifying burnt areas. This research sheds light on the inner workings of CNNs and their practical, real-time utility in predicting and mitigating wildfires. By combining transfer learning and CNNs, this study contributes a robust approach to assess burnt areas, facilitating timely interventions and preventative measures against conflagrations.  ( 2 min )
    Neural Parameter Regression for Explicit Representations of PDE Solution Operators
    arXiv:2403.12764v1 Announce Type: new Abstract: We introduce Neural Parameter Regression (NPR), a novel framework specifically developed for learning solution operators in Partial Differential Equations (PDEs). Tailored for operator learning, this approach surpasses traditional DeepONets (Lu et al., 2021) by employing Physics-Informed Neural Network (PINN, Raissi et al., 2019) techniques to regress Neural Network (NN) parameters. By parametrizing each solution based on specific initial conditions, it effectively approximates a mapping between function spaces. Our method enhances parameter efficiency by incorporating low-rank matrices, thereby boosting computational efficiency and scalability. The framework shows remarkable adaptability to new initial and boundary conditions, allowing for rapid fine-tuning and inference, even in cases of out-of-distribution examples.  ( 2 min )
    Understanding the training of infinitely deep and wide ResNets with Conditional Optimal Transport
    arXiv:2403.12887v1 Announce Type: new Abstract: We study the convergence of gradient flow for the training of deep neural networks. If Residual Neural Networks are a popular example of very deep architectures, their training constitutes a challenging optimization problem due notably to the non-convexity and the non-coercivity of the objective. Yet, in applications, those tasks are successfully solved by simple optimization algorithms such as gradient descent. To better understand this phenomenon, we focus here on a ``mean-field'' model of infinitely deep and arbitrarily wide ResNet, parameterized by probability measures over the product set of layers and parameters and with constant marginal on the set of layers. Indeed, in the case of shallow neural networks, mean field models have proven to benefit from simplified loss-landscapes and good theoretical guarantees when trained with gradient flow for the Wasserstein metric on the set of probability measures. Motivated by this approach, we propose to train our model with gradient flow w.r.t. the conditional Optimal Transport distance: a restriction of the classical Wasserstein distance which enforces our marginal condition. Relying on the theory of gradient flows in metric spaces we first show the well-posedness of the gradient flow equation and its consistency with the training of ResNets at finite width. Performing a local Polyak-\L{}ojasiewicz analysis, we then show convergence of the gradient flow for well-chosen initializations: if the number of features is finite but sufficiently large and the risk is sufficiently small at initialization, the gradient flow converges towards a global minimizer. This is the first result of this type for infinitely deep and arbitrarily wide ResNets.  ( 3 min )
    SUN Team's Contribution to ABAW 2024 Competition: Audio-visual Valence-Arousal Estimation and Expression Recognition
    arXiv:2403.12609v1 Announce Type: new Abstract: As emotions play a central role in human communication, automatic emotion recognition has attracted increasing attention in the last two decades. While multimodal systems enjoy high performances on lab-controlled data, they are still far from providing ecological validity on non-lab-controlled, namely 'in-the-wild' data. This work investigates audiovisual deep learning approaches for emotion recognition in-the-wild problem. We particularly explore the effectiveness of architectures based on fine-tuned Convolutional Neural Networks (CNN) and Public Dimensional Emotion Model (PDEM), for video and audio modality, respectively. We compare alternative temporal modeling and fusion strategies using the embeddings from these multi-stage trained modality-specific Deep Neural Networks (DNN). We report results on the AffWild2 dataset under Affective Behavior Analysis in-the-Wild 2024 (ABAW'24) challenge protocol.  ( 2 min )
    Prompt-fused framework for Inductive Logical Query Answering
    arXiv:2403.12646v1 Announce Type: new Abstract: Answering logical queries on knowledge graphs (KG) poses a significant challenge for machine reasoning. The primary obstacle in this task stems from the inherent incompleteness of KGs. Existing research has predominantly focused on addressing the issue of missing edges in KGs, thereby neglecting another aspect of incompleteness: the emergence of new entities. Furthermore, most of the existing methods tend to reason over each logical operator separately, rather than comprehensively analyzing the query as a whole during the reasoning process. In this paper, we propose a query-aware prompt-fused framework named Pro-QE, which could incorporate existing query embedding methods and address the embedding of emerging entities through contextual information aggregation. Additionally, a query prompt, which is generated by encoding the symbolic query, is introduced to gather information relevant to the query from a holistic perspective. To evaluate the efficacy of our model in the inductive setting, we introduce two new challenging benchmarks. Experimental results demonstrate that our model successfully handles the issue of unseen entities in logical queries. Furthermore, the ablation study confirms the efficacy of the aggregator and prompt components.  ( 2 min )
    MELTing point: Mobile Evaluation of Language Transformers
    arXiv:2403.12844v1 Announce Type: new Abstract: Transformers have revolutionized the machine learning landscape, gradually making their way into everyday tasks and equipping our computers with ``sparks of intelligence''. However, their runtime requirements have prevented them from being broadly deployed on mobile. As personal devices become increasingly powerful and prompt privacy becomes an ever more pressing issue, we explore the current state of mobile execution of Large Language Models (LLMs). To achieve this, we have created our own automation infrastructure, MELT, which supports the headless execution and benchmarking of LLMs on device, supporting different models, devices and frameworks, including Android, iOS and Nvidia Jetson devices. We evaluate popular instruction fine-tuned LLMs and leverage different frameworks to measure their end-to-end and granular performance, tracing their memory and energy requirements along the way. Our analysis is the first systematic study of on-device LLM execution, quantifying performance, energy efficiency and accuracy across various state-of-the-art models and showcases the state of on-device intelligence in the era of hyperscale models. Results highlight the performance heterogeneity across targets and corroborates that LLM inference is largely memory-bound. Quantization drastically reduces memory requirements and renders execution viable, but at a non-negligible accuracy cost. Drawing from its energy footprint and thermal behavior, the continuous execution of LLMs remains elusive, as both factors negatively affect user experience. Last, our experience shows that the ecosystem is still in its infancy, and algorithmic as well as hardware breakthroughs can significantly shift the execution cost. We expect NPU acceleration, and framework-hardware co-design to be the biggest bet towards efficient standalone execution, with the alternative of offloading tailored towards edge deployments.  ( 3 min )
    Pretraining Codomain Attention Neural Operators for Solving Multiphysics PDEs
    arXiv:2403.12553v1 Announce Type: new Abstract: Existing neural operator architectures face challenges when solving multiphysics problems with coupled partial differential equations (PDEs), due to complex geometries, interactions between physical variables, and the lack of large amounts of high-resolution training data. To address these issues, we propose Codomain Attention Neural Operator (CoDA-NO), which tokenizes functions along the codomain or channel space, enabling self-supervised learning or pretraining of multiple PDE systems. Specifically, we extend positional encoding, self-attention, and normalization layers to the function space. CoDA-NO can learn representations of different PDE systems with a single model. We evaluate CoDA-NO's potential as a backbone for learning multiphysics PDEs over multiple systems by considering few-shot learning settings. On complex downstream tasks with limited data, such as fluid flow simulations and fluid-structure interactions, we found CoDA-NO to outperform existing methods on the few-shot learning task by over $36\%$. The code is available at https://github.com/ashiq24/CoDA-NO.  ( 2 min )
    FlowerFormer: Empowering Neural Architecture Encoding using a Flow-aware Graph Transformer
    arXiv:2403.12821v1 Announce Type: new Abstract: The success of a specific neural network architecture is closely tied to the dataset and task it tackles; there is no one-size-fits-all solution. Thus, considerable efforts have been made to quickly and accurately estimate the performances of neural architectures, without full training or evaluation, for given tasks and datasets. Neural architecture encoding has played a crucial role in the estimation, and graphbased methods, which treat an architecture as a graph, have shown prominent performance. For enhanced representation learning of neural architectures, we introduce FlowerFormer, a powerful graph transformer that incorporates the information flows within a neural architecture. FlowerFormer consists of two key components: (a) bidirectional asynchronous message passing, inspired by the flows; (b) global attention built on flow-based masking. Our extensive experiments demonstrate the superiority of FlowerFormer over existing neural encoding methods, and its effectiveness extends beyond computer vision models to include graph neural networks and auto speech recognition models. Our code is available at http://github.com/y0ngjaenius/CVPR2024_FLOWERFormer.  ( 2 min )
    Dynamic Survival Analysis for Early Event Prediction
    arXiv:2403.12818v1 Announce Type: new Abstract: This study advances Early Event Prediction (EEP) in healthcare through Dynamic Survival Analysis (DSA), offering a novel approach by integrating risk localization into alarm policies to enhance clinical event metrics. By adapting and evaluating DSA models against traditional EEP benchmarks, our research demonstrates their ability to match EEP models on a time-step level and significantly improve event-level metrics through a new alarm prioritization scheme (up to 11% AuPRC difference). This approach represents a significant step forward in predictive healthcare, providing a more nuanced and actionable framework for early event prediction and management.  ( 2 min )
    A Practical Guide to Statistical Distances for Evaluating Generative Models in Science
    arXiv:2403.12636v1 Announce Type: new Abstract: Generative models are invaluable in many fields of science because of their ability to capture high-dimensional and complicated distributions, such as photo-realistic images, protein structures, and connectomes. How do we evaluate the samples these models generate? This work aims to provide an accessible entry point to understanding popular notions of statistical distances, requiring only foundational knowledge in mathematics and statistics. We focus on four commonly used notions of statistical distances representing different methodologies: Using low-dimensional projections (Sliced-Wasserstein; SW), obtaining a distance using classifiers (Classifier Two-Sample Tests; C2ST), using embeddings through kernels (Maximum Mean Discrepancy; MMD), or neural networks (Fr\'echet Inception Distance; FID). We highlight the intuition behind each distance and explain their merits, scalability, complexity, and pitfalls. To demonstrate how these distances are used in practice, we evaluate generative models from different scientific domains, namely a model of decision making and a model generating medical images. We showcase that distinct distances can give different results on similar data. Through this guide, we aim to help researchers to use, interpret, and evaluate statistical distances for generative models in science.  ( 3 min )
    Automated Contrastive Learning Strategy Search for Time Series
    arXiv:2403.12641v1 Announce Type: new Abstract: In recent years, Contrastive Learning (CL) has become a predominant representation learning paradigm for time series. Most existing methods in the literature focus on manually building specific Contrastive Learning Strategies (CLS) by human heuristics for certain datasets and tasks. However, manually developing CLS usually require excessive prior knowledge about the datasets and tasks, e.g., professional cognition of the medical time series in healthcare, as well as huge human labor and massive experiments to determine the detailed learning configurations. In this paper, we present an Automated Machine Learning (AutoML) practice at Microsoft, which automatically learns to contrastively learn representations for various time series datasets and tasks, namely Automated Contrastive Learning (AutoCL). We first construct a principled universal search space of size over 3x1012, covering data augmentation, embedding transformation, contrastive pair construction and contrastive losses. Further, we introduce an efficient reinforcement learning algorithm, which optimizes CLS from the performance on the validation tasks, to obtain more effective CLS within the space. Experimental results on various real-world tasks and datasets demonstrate that AutoCL could automatically find the suitable CLS for a given dataset and task. From the candidate CLS found by AutoCL on several public datasets/tasks, we compose a transferable Generally Good Strategy (GGS), which has a strong performance for other datasets. We also provide empirical analysis as a guidance for future design of CLS.  ( 3 min )
    Forward Gradient-Based Frank-Wolfe Optimization for Memory Efficient Deep Neural Network Training
    arXiv:2403.12511v1 Announce Type: new Abstract: Training a deep neural network using gradient-based methods necessitates the calculation of gradients at each level. However, using backpropagation or reverse mode differentiation, to calculate the gradients necessities significant memory consumption, rendering backpropagation an inefficient method for computing gradients. This paper focuses on analyzing the performance of the well-known Frank-Wolfe algorithm, a.k.a. conditional gradient algorithm by having access to the forward mode of automatic differentiation to compute gradients. We provide in-depth technical details that show the proposed Algorithm does converge to the optimal solution with a sub-linear rate of convergence by having access to the noisy estimate of the true gradient obtained in the forward mode of automated differentiation, referred to as the Projected Forward Gradient. In contrast, the standard Frank-Wolfe algorithm, when provided with access to the Projected Forward Gradient, fails to converge to the optimal solution. We demonstrate the convergence attributes of our proposed algorithms using a numerical example.  ( 2 min )
    SEVEN: Pruning Transformer Model by Reserving Sentinels
    arXiv:2403.12688v1 Announce Type: new Abstract: Large-scale Transformer models (TM) have demonstrated outstanding performance across various tasks. However, their considerable parameter size restricts their applicability, particularly on mobile devices. Due to the dynamic and intricate nature of gradients on TM compared to Convolutional Neural Networks, commonly used pruning methods tend to retain weights with larger gradient noise. This results in pruned models that are sensitive to sparsity and datasets, exhibiting suboptimal performance. Symbolic Descent (SD) is a general approach for training and fine-tuning TM. In this paper, we attempt to describe the noisy batch gradient sequences on TM through the cumulative process of SD. We utilize this design to dynamically assess the importance scores of weights.SEVEN is introduced by us, which particularly favors weights with consistently high sensitivity, i.e., weights with small gradient noise. These weights are tended to be preserved by SEVEN. Extensive experiments on various TM in natural language, question-answering, and image classification domains are conducted to validate the effectiveness of SEVEN. The results demonstrate significant improvements of SEVEN in multiple pruning scenarios and across different sparsity levels. Additionally, SEVEN exhibits robust performance under various fine-tuning strategies. The code is publicly available at https://github.com/xiaojinying/SEVEN.  ( 2 min )
    AffineQuant: Affine Transformation Quantization for Large Language Models
    arXiv:2403.12544v1 Announce Type: new Abstract: The significant resource requirements associated with Large-scale Language Models (LLMs) have generated considerable interest in the development of techniques aimed at compressing and accelerating neural networks. Among these techniques, Post-Training Quantization (PTQ) has emerged as a subject of considerable interest due to its noteworthy compression efficiency and cost-effectiveness in the context of training. Existing PTQ methods for LLMs limit the optimization scope to scaling transformations between pre- and post-quantization weights. In this paper, we advocate for the direct optimization using equivalent Affine transformations in PTQ (AffineQuant). This approach extends the optimization scope and thus significantly minimizing quantization errors. Additionally, by employing the corresponding inverse matrix, we can ensure equivalence between the pre- and post-quantization outputs of PTQ, thereby maintaining its efficiency and generalization capabilities. To ensure the invertibility of the transformation during optimization, we further introduce a gradual mask optimization method. This method initially focuses on optimizing the diagonal elements and gradually extends to the other elements. Such an approach aligns with the Levy-Desplanques theorem, theoretically ensuring invertibility of the transformation. As a result, significant performance improvements are evident across different LLMs on diverse datasets. To illustrate, we attain a C4 perplexity of 15.76 (2.26 lower vs 18.02 in OmniQuant) on the LLaMA2-7B model of W4A4 quantization without overhead. On zero-shot tasks, AffineQuant achieves an average of 58.61 accuracy (1.98 lower vs 56.63 in OmniQuant) when using 4/4-bit quantization for LLaMA-30B, which setting a new state-of-the-art benchmark for PTQ in LLMs.  ( 3 min )
    Improving Interpretability of Scores in Anomaly Detection Based on Gaussian-Bernoulli Restricted Boltzmann Machine
    arXiv:2403.12672v1 Announce Type: new Abstract: Gaussian-Bernoulli restricted Boltzmann machines (GBRBMs) are often used for semi-supervised anomaly detection, where they are trained using only normal data points. In GBRBM-based anomaly detection, normal and anomalous data are classified based on a score that is identical to an energy function of the marginal GBRBM. However, the classification threshold is difficult to set to an appropriate value, as this score cannot be interpreted. In this study, we propose a measure that improves score's interpretability based on its cumulative distribution, and establish a guideline for setting the threshold using the interpretable measure. The results of numerical experiments show that the guideline is reasonable when setting the threshold solely using normal data points. Moreover, because identifying the measure involves computationally infeasible evaluation of the minimum score value, we also propose an evaluation method for the minimum score based on simulated annealing, which is widely used for optimization problems. The proposed evaluation method was also validated using numerical experiments.  ( 2 min )
    Non-negative Contrastive Learning
    arXiv:2403.12459v1 Announce Type: new Abstract: Deep representations have shown promising performance when transferred to downstream tasks in a black-box manner. Yet, their inherent lack of interpretability remains a significant challenge, as these features are often opaque to human understanding. In this paper, we propose Non-negative Contrastive Learning (NCL), a renaissance of Non-negative Matrix Factorization (NMF) aimed at deriving interpretable features. The power of NCL lies in its enforcement of non-negativity constraints on features, reminiscent of NMF's capability to extract features that align closely with sample clusters. NCL not only aligns mathematically well with an NMF objective but also preserves NMF's interpretability attributes, resulting in a more sparse and disentangled representation compared to standard contrastive learning (CL). Theoretically, we establish guarantees on the identifiability and downstream generalization of NCL. Empirically, we show that these advantages enable NCL to outperform CL significantly on feature disentanglement, feature selection, as well as downstream classification tasks. At last, we show that NCL can be easily extended to other learning scenarios and benefit supervised learning as well. Code is available at https://github.com/PKU-ML/non_neg.  ( 2 min )
    TT-BLIP: Enhancing Fake News Detection Using BLIP and Tri-Transformer
    arXiv:2403.12481v1 Announce Type: new Abstract: Detecting fake news has received a lot of attention. Many previous methods concatenate independently encoded unimodal data, ignoring the benefits of integrated multimodal information. Also, the absence of specialized feature extraction for text and images further limits these methods. This paper introduces an end-to-end model called TT-BLIP that applies the bootstrapping language-image pretraining for unified vision-language understanding and generation (BLIP) for three types of information: BERT and BLIP\textsubscript{Txt} for text, ResNet and BLIP\textsubscript{Img} for images, and bidirectional BLIP encoders for multimodal information. The Multimodal Tri-Transformer fuses tri-modal features using three types of multi-head attention mechanisms, ensuring integrated modalities for enhanced representations and improved multimodal data analysis. The experiments are performed using two fake news datasets, Weibo and Gossipcop. The results indicate TT-BLIP outperforms the state-of-the-art models.  ( 2 min )
    U-Net Kalman Filter (UNetKF): An Example of Machine Learning-assisted Ensemble Data Assimilation
    arXiv:2403.12366v1 Announce Type: new Abstract: Machine learning techniques have seen a tremendous rise in popularity in weather and climate sciences. Data assimilation (DA), which combines observations and numerical models, has great potential to incorporate machine learning and artificial intelligence (ML/AI) techniques. In this paper, we use U-Net, a type of convolutional neutral network (CNN), to predict the localized ensemble covariances for the Ensemble Kalman Filter (EnKF) algorithm. Using a 2-layer quasi-geostrophic model, U-Nets are trained using data from EnKF DA experiments. The trained U-Nets are then used to predict the flow-dependent localized error covariance matrices in U-Net Kalman Filter (UNetKF) experiments, which are compared to traditional 3-dimensional variational (3DVar), ensemble 3DVar (En3DVar) and EnKF methods. The performance of UNetKF can match or exceed that of 3DVar, En3DVar or EnKF. We also demonstrate that trained U-Nets can be transferred to a higher-resolution model for UNetKF implementation, which again performs competitively to 3DVar and EnKF, particularly for small ensemble sizes.  ( 2 min )
    Finding the Missing Data: A BERT-inspired Approach Against Package Loss in Wireless Sensing
    arXiv:2403.12400v1 Announce Type: new Abstract: Despite the development of various deep learning methods for Wi-Fi sensing, package loss often results in noncontinuous estimation of the Channel State Information (CSI), which negatively impacts the performance of the learning models. To overcome this challenge, we propose a deep learning model based on Bidirectional Encoder Representations from Transformers (BERT) for CSI recovery, named CSI-BERT. CSI-BERT can be trained in an self-supervised manner on the target dataset without the need for additional data. Furthermore, unlike traditional interpolation methods that focus on one subcarrier at a time, CSI-BERT captures the sequential relationships across different subcarriers. Experimental results demonstrate that CSI-BERT achieves lower error rates and faster speed compared to traditional interpolation methods, even when facing with high loss rates. Moreover, by harnessing the recovered CSI obtained from CSI-BERT, other deep learning models like Residual Network and Recurrent Neural Network can achieve an average increase in accuracy of approximately 15\% in Wi-Fi sensing tasks. The collected dataset WiGesture and code for our model are publicly available at https://github.com/RS2002/CSI-BERT.  ( 2 min )
    Equity through Access: A Case for Small-scale Deep Learning
    arXiv:2403.12562v1 Announce Type: new Abstract: The recent advances in deep learning (DL) have been accelerated by access to large-scale data and compute. These large-scale resources have been used to train progressively larger models which are resource intensive in terms of compute, data, energy, and carbon emissions. These costs are becoming a new type of entry barrier to researchers and practitioners with limited access to resources at such scale, particularly in the Global South. In this work, we take a comprehensive look at the landscape of existing DL models for vision tasks and demonstrate their usefulness in settings where resources are limited. To account for the resource consumption of DL models, we introduce a novel measure to estimate the performance per resource unit, which we call the PePR score. Using a diverse family of 131 unique DL architectures (spanning 1M to 130M trainable parameters) and three medical image datasets, we capture trends about the performance-resource trade-offs. In applications like medical image analysis, we argue that small-scale, specialized models are better than striving for large-scale models. Furthermore, we show that using pretrained models can significantly reduce the computational resources and data required. We hope this work will encourage the community to focus on improving AI equity by developing methods and models with smaller resource footprints.  ( 3 min )
    Do Generated Data Always Help Contrastive Learning?
    arXiv:2403.12448v1 Announce Type: new Abstract: Contrastive Learning (CL) has emerged as one of the most successful paradigms for unsupervised visual representation learning, yet it often depends on intensive manual data augmentations. With the rise of generative models, especially diffusion models, the ability to generate realistic images close to the real data distribution has been well recognized. These generated high-equality images have been successfully applied to enhance contrastive representation learning, a technique termed ``data inflation''. However, we find that the generated data (even from a good diffusion model like DDPM) may sometimes even harm contrastive learning. We investigate the causes behind this failure from the perspective of both data inflation and data augmentation. For the first time, we reveal the complementary roles that stronger data inflation should be accompanied by weaker augmentations, and vice versa. We also provide rigorous theoretical explanations for these phenomena via deriving its generalization bounds under data inflation. Drawing from these insights, we propose Adaptive Inflation (AdaInf), a purely data-centric strategy without introducing any extra computation cost. On benchmark datasets, AdaInf can bring significant improvements for various contrastive learning methods. Notably, without using external data, AdaInf obtains 94.70% linear accuracy on CIFAR-10 with SimCLR, setting a new record that surpasses many sophisticated methods. Code is available at https://github.com/PKU-ML/adainf.  ( 2 min )
    FairSIN: Achieving Fairness in Graph Neural Networks through Sensitive Information Neutralization
    arXiv:2403.12474v1 Announce Type: new Abstract: Despite the remarkable success of graph neural networks (GNNs) in modeling graph-structured data, like other machine learning models, GNNs are also susceptible to making biased predictions based on sensitive attributes, such as race and gender. For fairness consideration, recent state-of-the-art (SOTA) methods propose to filter out sensitive information from inputs or representations, e.g., edge dropping or feature masking. However, we argue that such filtering-based strategies may also filter out some non-sensitive feature information, leading to a sub-optimal trade-off between predictive performance and fairness. To address this issue, we unveil an innovative neutralization-based paradigm, where additional Fairness-facilitating Features (F3) are incorporated into node features or representations before message passing. The F3 are expected to statistically neutralize the sensitive bias in node representations and provide additional nonsensitive information. We also provide theoretical explanations for our rationale, concluding that F3 can be realized by emphasizing the features of each node's heterogeneous neighbors (neighbors with different sensitive attributes). We name our method as FairSIN, and present three implementation variants from both data-centric and model-centric perspectives. Experimental results on five benchmark datasets with three different GNN backbones show that FairSIN significantly improves fairness metrics while maintaining high prediction accuracies.  ( 2 min )
    Electioneering the Network: Dynamic Multi-Step Adversarial Attacks for Community Canvassing
    arXiv:2403.12399v1 Announce Type: new Abstract: The problem of online social network manipulation for community canvassing is of real concern in today's world. Motivated by the study of voter models, opinion and polarization dynamics on networks, we model community canvassing as a dynamic process over a network enabled via gradient-based attacks on GNNs. Existing attacks on GNNs are all single-step and do not account for the dynamic cascading nature of information diffusion in networks. We consider the realistic scenario where an adversary uses a GNN as a proxy to predict and manipulate voter preferences, especially uncertain voters. Gradient-based attacks on the GNN inform the adversary of strategic manipulations that can be made to proselytize targeted voters. In particular, we explore $\textit{minimum budget attacks for community canvassing}$ (MBACC). We show that the MBACC problem is NP-Hard and propose Dynamic Multi-Step Adversarial Community Canvassing (MAC) to address it. MAC makes dynamic local decisions based on the heuristic of low budget and high second-order influence to convert and perturb target voters. MAC is a dynamic multi-step attack that discovers low-budget and high-influence targets from which efficient cascading attacks can happen. We evaluate MAC against single-step baselines on the MBACC problem with multiple underlying networks and GNN models. Our experiments show the superiority of MAC which is able to discover efficient multi-hop attacks for adversarial community canvassing. Our code implementation and data is available at https://github.com/saurabhsharma1993/mac.  ( 3 min )
    NTK-Guided Few-Shot Class Incremental Learning
    arXiv:2403.12486v1 Announce Type: new Abstract: While anti-amnesia FSCIL learners often excel in incremental sessions, they tend to prioritize mitigating knowledge attrition over harnessing the model's potential for knowledge acquisition. In this paper, we delve into the foundations of model generalization in FSCIL through the lens of the Neural Tangent Kernel (NTK). Our primary design focus revolves around ensuring optimal NTK convergence and NTK-related generalization error, serving as the theoretical bedrock for exceptional generalization. To attain globally optimal NTK convergence, we employ a meta-learning mechanism grounded in mathematical principles to guide the optimization process within an expanded network. Furthermore, to reduce the NTK-related generalization error, we commence from the foundational level, optimizing the relevant factors constituting its generalization loss. Specifically, we initiate self-supervised pre-training on the base session to shape the initial network weights. Then they are carefully refined through curricular alignment, followed by the application of dual NTK regularization tailored specifically for both convolutional and linear layers. Through the combined effects of these measures, our network acquires robust NTK properties, significantly enhancing its foundational generalization. On popular FSCIL benchmark datasets, our NTK-FSCIL surpasses contemporary state-of-the-art approaches, elevating end-session accuracy by 2.9% to 8.7%.  ( 2 min )
    Contextualized Messages Boost Graph Representations
    arXiv:2403.12529v1 Announce Type: new Abstract: Graph neural networks (GNNs) have gained significant interest in recent years due to their ability to handle arbitrarily structured data represented as graphs. GNNs generally follow the message-passing scheme to locally update node feature representations. A graph readout function is then employed to create a representation for the entire graph. Several studies proposed different GNNs by modifying the aggregation and combination strategies of the message-passing framework, often inspired by heuristics. Nevertheless, several studies have begun exploring GNNs from a theoretical perspective based on the graph isomorphism problem which inherently assumes countable node feature representations. Yet, there are only a few theoretical works exploring GNNs with uncountable node feature representations. This paper presents a new perspective on the representational capabilities of GNNs across all levels - node-level, neighborhood-level, and graph-level - when the space of node feature representation is uncountable. From the results, a novel soft-isomorphic relational graph convolution network (SIR-GCN) is proposed that emphasizes non-linear and contextualized transformations of neighborhood feature representations. The mathematical relationship of SIR-GCN and three widely used GNNs is explored to highlight the contribution. Validation on synthetic datasets then demonstrates that SIR-GCN outperforms comparable models even in simple node and graph property prediction tasks.  ( 2 min )
    Temporally-Consistent Koopman Autoencoders for Forecasting Dynamical Systems
    arXiv:2403.12335v1 Announce Type: new Abstract: Absence of sufficiently high-quality data often poses a key challenge in data-driven modeling of high-dimensional spatio-temporal dynamical systems. Koopman Autoencoders (KAEs) harness the expressivity of deep neural networks (DNNs), the dimension reduction capabilities of autoencoders, and the spectral properties of the Koopman operator to learn a reduced-order feature space with simpler, linear dynamics. However, the effectiveness of KAEs is hindered by limited and noisy training datasets, leading to poor generalizability. To address this, we introduce the Temporally-Consistent Koopman Autoencoder (tcKAE), designed to generate accurate long-term predictions even with constrained and noisy training data. This is achieved through a consistency regularization term that enforces prediction coherence across different time steps, thus enhancing the robustness and generalizability of tcKAE over existing models. We provide analytical justification for this approach based on Koopman spectral theory and empirically demonstrate tcKAE's superior performance over state-of-the-art KAE models across a variety of test cases, including simple pendulum oscillations, kinetic plasmas, fluid flows, and sea surface temperature data.  ( 2 min )
    STG-Mamba: Spatial-Temporal Graph Learning via Selective State Space Model
    arXiv:2403.12418v1 Announce Type: new Abstract: Spatial-Temporal Graph (STG) data is characterized as dynamic, heterogenous, and non-stationary, leading to the continuous challenge of spatial-temporal graph learning. In the past few years, various GNN-based methods have been proposed to solely focus on mimicking the relationships among node individuals of the STG network, ignoring the significance of modeling the intrinsic features that exist in STG system over time. In contrast, modern Selective State Space Models (SSSMs) present a new approach which treat STG Network as a system, and meticulously explore the STG system's dynamic state evolution across temporal dimension. In this work, we introduce Spatial-Temporal Graph Mamba (STG-Mamba) as the first exploration of leveraging the powerful selective state space models for STG learning by treating STG Network as a system, and employing the Graph Selective State Space Block (GS3B) to precisely characterize the dynamic evolution of STG networks. STG-Mamba is formulated as an Encoder-Decoder architecture, which takes GS3B as the basic module, for efficient sequential data modeling. Furthermore, to strengthen GNN's ability of modeling STG data under the setting of SSSMs, we propose Kalman Filtering Graph Neural Networks (KFGN) for adaptive graph structure upgrading. KFGN smoothly fits in the context of selective state space evolution, and at the same time keeps linear complexity. Extensive empirical studies are conducted on three benchmark STG forecasting datasets, demonstrating the performance superiority and computational efficiency of STG-Mamba. It not only surpasses existing state-of-the-art methods in terms of STG forecasting performance, but also effectively alleviate the computational bottleneck of large-scale graph networks in reducing the computational cost of FLOPs and test inference time.  ( 3 min )
    Understanding Training-free Diffusion Guidance: Mechanisms and Limitations
    arXiv:2403.12404v1 Announce Type: new Abstract: Adding additional control to pretrained diffusion models has become an increasingly popular research area, with extensive applications in computer vision, reinforcement learning, and AI for science. Recently, several studies have proposed training-free diffusion guidance by using off-the-shelf networks pretrained on clean images. This approach enables zero-shot conditional generation for universal control formats, which appears to offer a free lunch in diffusion guidance. In this paper, we aim to develop a deeper understanding of the operational mechanisms and fundamental limitations of training-free guidance. We offer a theoretical analysis that supports training-free guidance from the perspective of optimization, distinguishing it from classifier-based (or classifier-free) guidance. To elucidate their drawbacks, we theoretically demonstrate that training-free methods are more susceptible to adversarial gradients and exhibit slower convergence rates compared to classifier guidance. We then introduce a collection of techniques designed to overcome the limitations, accompanied by theoretical rationale and empirical evidence. Our experiments in image and motion generation confirm the efficacy of these techniques.  ( 2 min )
    Transfer in Sequential Multi-armed Bandits via Reward Samples
    arXiv:2403.12428v1 Announce Type: new Abstract: We consider a sequential stochastic multi-armed bandit problem where the agent interacts with bandit over multiple episodes. The reward distribution of the arms remain constant throughout an episode but can change over different episodes. We propose an algorithm based on UCB to transfer the reward samples from the previous episodes and improve the cumulative regret performance over all the episodes. We provide regret analysis and empirical results for our algorithm, which show significant improvement over the standard UCB algorithm without transfer.  ( 2 min )
    FairSTG: Countering performance heterogeneity via collaborative sample-level optimization
    arXiv:2403.12391v1 Announce Type: new Abstract: Spatiotemporal learning plays a crucial role in mobile computing techniques to empower smart cites. While existing research has made great efforts to achieve accurate predictions on the overall dataset, they still neglect the significant performance heterogeneity across samples. In this work, we designate the performance heterogeneity as the reason for unfair spatiotemporal learning, which not only degrades the practical functions of models, but also brings serious potential risks to real-world urban applications. To fix this gap, we propose a model-independent Fairness-aware framework for SpatioTemporal Graph learning (FairSTG), which inherits the idea of exploiting advantages of well-learned samples to challenging ones with collaborative mix-up. Specifically, FairSTG consists of a spatiotemporal feature extractor for model initialization, a collaborative representation enhancement for knowledge transfer between well-learned samples and challenging ones, and fairness objectives for immediately suppressing sample-level performance heterogeneity. Experiments on four spatiotemporal datasets demonstrate that our FairSTG significantly improves the fairness quality while maintaining comparable forecasting accuracy. Case studies show FairSTG can counter both spatial and temporal performance heterogeneity by our sample-level retrieval and compensation, and our work can potentially alleviate the risks on spatiotemporal resource allocation for underrepresented urban regions.  ( 2 min )
    Jetfire: Efficient and Accurate Transformer Pretraining with INT8 Data Flow and Per-Block Quantization
    arXiv:2403.12422v1 Announce Type: new Abstract: Pretraining transformers are generally time-consuming. Fully quantized training (FQT) is a promising approach to speed up pretraining. However, most FQT methods adopt a quantize-compute-dequantize procedure, which often leads to suboptimal speedup and significant performance degradation when used in transformers due to the high memory access overheads and low-precision computations. In this work, we propose Jetfire, an efficient and accurate INT8 training method specific to transformers. Our method features an INT8 data flow to optimize memory access and a per-block quantization method to maintain the accuracy of pretrained transformers. Extensive experiments demonstrate that our INT8 FQT method achieves comparable accuracy to the FP16 training baseline and outperforms the existing INT8 training works for transformers. Moreover, for a standard transformer block, our method offers an end-to-end training speedup of 1.42x and a 1.49x memory reduction compared to the FP16 baseline.  ( 2 min )
    Methods for Generating Drift in Text Streams
    arXiv:2403.12328v1 Announce Type: new Abstract: Systems and individuals produce data continuously. On the Internet, people share their knowledge, sentiments, and opinions, provide reviews about services and products, and so on. Automatically learning from these textual data can provide insights to organizations and institutions, thus preventing financial impacts, for example. To learn from textual data over time, the machine learning system must account for concept drift. Concept drift is a frequent phenomenon in real-world datasets and corresponds to changes in data distribution over time. For instance, a concept drift occurs when sentiments change or a word's meaning is adjusted over time. Although concept drift is frequent in real-world applications, benchmark datasets with labeled drifts are rare in the literature. To bridge this gap, this paper provides four textual drift generation methods to ease the production of datasets with labeled drifts. These methods were applied to Yelp and Airbnb datasets and tested using incremental classifiers respecting the stream mining paradigm to evaluate their ability to recover from the drifts. Results show that all methods have their performance degraded right after the drifts, and the incremental SVM is the fastest to run and recover the previous performance levels regarding accuracy and Macro F1-Score.  ( 2 min )
    Learning Transferable Time Series Classifier with Cross-Domain Pre-training from Language Model
    arXiv:2403.12372v1 Announce Type: new Abstract: Advancements in self-supervised pre-training (SSL) have significantly advanced the field of learning transferable time series representations, which can be very useful in enhancing the downstream task. Despite being effective, most existing works struggle to achieve cross-domain SSL pre-training, missing valuable opportunities to integrate patterns and features from different domains. The main challenge lies in the significant differences in the characteristics of time-series data across different domains, such as variations in the number of channels and temporal resolution scales. To address this challenge, we propose CrossTimeNet, a novel cross-domain SSL learning framework to learn transferable knowledge from various domains to largely benefit the target downstream task. One of the key characteristics of CrossTimeNet is the newly designed time series tokenization module, which could effectively convert the raw time series into a sequence of discrete tokens based on a reconstruction optimization process. Besides, we highlight that predicting a high proportion of corrupted tokens can be very helpful for extracting informative patterns across different domains during SSL pre-training, which has been largely overlooked in past years. Furthermore, unlike previous works, our work treats the pre-training language model (PLM) as the initialization of the encoder network, investigating the feasibility of transferring the knowledge learned by the PLM to the time series area. Through these efforts, the path to cross-domain pre-training of a generic time series model can be effectively paved. We conduct extensive experiments in a real-world scenario across various time series classification domains. The experimental results clearly confirm CrossTimeNet's superior performance.  ( 3 min )
    Sim2Real in Reconstructive Spectroscopy: Deep Learning with Augmented Device-Informed Data Simulation
    arXiv:2403.12354v1 Announce Type: new Abstract: This work proposes a deep learning (DL)-based framework, namely Sim2Real, for spectral signal reconstruction in reconstructive spectroscopy, focusing on efficient data sampling and fast inference time. The work focuses on the challenge of reconstructing real-world spectral signals under the extreme setting where only device-informed simulated data are available for training. Such device-informed simulated data are much easier to collect than real-world data but exhibit large distribution shifts from their real-world counterparts. To leverage such simulated data effectively, a hierarchical data augmentation strategy is introduced to mitigate the adverse effects of this domain shift, and a corresponding neural network for the spectral signal reconstruction with our augmented data is designed. Experiments using a real dataset measured from our spectrometer device demonstrate that Sim2Real achieves significant speed-up during the inference while attaining on-par performance with the state-of-the-art optimization-based methods.  ( 2 min )
    Advancing Time Series Classification with Multimodal Language Modeling
    arXiv:2403.12371v1 Announce Type: new Abstract: For the advancements of time series classification, scrutinizing previous studies, most existing methods adopt a common learning-to-classify paradigm - a time series classifier model tries to learn the relation between sequence inputs and target label encoded by one-hot distribution. Although effective, this paradigm conceals two inherent limitations: (1) encoding target categories with one-hot distribution fails to reflect the comparability and similarity between labels, and (2) it is very difficult to learn transferable model across domains, which greatly hinder the development of universal serving paradigm. In this work, we propose InstructTime, a novel attempt to reshape time series classification as a learning-to-generate paradigm. Relying on the powerful generative capacity of the pre-trained language model, the core idea is to formulate the classification of time series as a multimodal understanding task, in which both task-specific instructions and raw time series are treated as multimodal inputs while the label information is represented by texts. To accomplish this goal, three distinct designs are developed in the InstructTime. Firstly, a time series discretization module is designed to convert continuous time series into a sequence of hard tokens to solve the inconsistency issue across modal inputs. To solve the modality representation gap issue, for one thing, we introduce an alignment projected layer before feeding the transformed token of time series into language models. For another, we highlight the necessity of auto-regressive pre-training across domains, which can facilitate the transferability of the language model and boost the generalization performance. Extensive experiments are conducted over benchmark datasets, whose results uncover the superior performance of InstructTime and the potential for a universal foundation model in time series classification.  ( 3 min )
    Friendly Sharpness-Aware Minimization
    arXiv:2403.12350v1 Announce Type: new Abstract: Sharpness-Aware Minimization (SAM) has been instrumental in improving deep neural network training by minimizing both training loss and loss sharpness. Despite the practical success, the mechanisms behind SAM's generalization enhancements remain elusive, limiting its progress in deep learning optimization. In this work, we investigate SAM's core components for generalization improvement and introduce "Friendly-SAM" (F-SAM) to further enhance SAM's generalization. Our investigation reveals the key role of batch-specific stochastic gradient noise within the adversarial perturbation, i.e., the current minibatch gradient, which significantly influences SAM's generalization performance. By decomposing the adversarial perturbation in SAM into full gradient and stochastic gradient noise components, we discover that relying solely on the full gradient component degrades generalization while excluding it leads to improved performance. The possible reason lies in the full gradient component's increase in sharpness loss for the entire dataset, creating inconsistencies with the subsequent sharpness minimization step solely on the current minibatch data. Inspired by these insights, F-SAM aims to mitigate the negative effects of the full gradient component. It removes the full gradient estimated by an exponentially moving average (EMA) of historical stochastic gradients, and then leverages stochastic gradient noise for improved generalization. Moreover, we provide theoretical validation for the EMA approximation and prove the convergence of F-SAM on non-convex problems. Extensive experiments demonstrate the superior generalization performance and robustness of F-SAM over vanilla SAM. Code is available at https://github.com/nblt/F-SAM.  ( 2 min )
    Removing Undesirable Concepts in Text-to-Image Generative Models with Learnable Prompts
    arXiv:2403.12326v1 Announce Type: new Abstract: Generative models have demonstrated remarkable potential in generating visually impressive content from textual descriptions. However, training these models on unfiltered internet data poses the risk of learning and subsequently propagating undesirable concepts, such as copyrighted or unethical content. In this paper, we propose a novel method to remove undesirable concepts from text-to-image generative models by incorporating a learnable prompt into the cross-attention module. This learnable prompt acts as additional memory to transfer the knowledge of undesirable concepts into it and reduce the dependency of these concepts on the model parameters and corresponding textual inputs. Because of this knowledge transfer into the prompt, erasing these undesirable concepts is more stable and has minimal negative impact on other concepts. We demonstrate the effectiveness of our method on the Stable Diffusion model, showcasing its superiority over state-of-the-art erasure methods in terms of removing undesirable content while preserving other unrelated elements.  ( 2 min )
    Reinforcement Learning from Delayed Observations via World Models
    arXiv:2403.12309v1 Announce Type: new Abstract: In standard Reinforcement Learning settings, agents typically assume immediate feedback about the effects of their actions after taking them. However, in practice, this assumption may not hold true due to physical constraints and can significantly impact the performance of RL algorithms. In this paper, we focus on addressing observation delays in partially observable environments. We propose leveraging world models, which have shown success in integrating past observations and learning dynamics, to handle observation delays. By reducing delayed POMDPs to delayed MDPs with world models, our methods can effectively handle partial observability, where existing approaches achieve sub-optimal performance or even degrade quickly as observability decreases. Experiments suggest that one of our methods can outperform a naive model-based approach by up to %30. Moreover, we evaluate our methods on visual input based delayed environment, for the first time showcasing delay-aware reinforcement learning on visual observations.  ( 2 min )
    Enhanced Detection of Transdermal Alcohol Levels Using Hyperdimensional Computing on Embedded Devices
    arXiv:2403.12323v1 Announce Type: new Abstract: Alcohol consumption has a significant impact on individuals' health, with even more pronounced consequences when consumption becomes excessive. One approach to promoting healthier drinking habits is implementing just-in-time interventions, where timely notifications indicating intoxication are sent during heavy drinking episodes. However, the complexity or invasiveness of an intervention mechanism may deter an individual from using them in practice. Previous research tackled this challenge using collected motion data and conventional Machine Learning (ML) algorithms to classify heavy drinking episodes, but with impractical accuracy and computational efficiency for mobile devices. Consequently, we have elected to use Hyperdimensional Computing (HDC) to design a just-in-time intervention approach that is practical for smartphones, smart wearables, and IoT deployment. HDC is a framework that has proven results in processing real-time sensor data efficiently. This approach offers several advantages, including low latency, minimal power consumption, and high parallelism. We explore various HDC encoding designs and combine them with various HDC learning models to create an optimal and feasible approach for mobile devices. Our findings indicate an accuracy rate of 89\%, which represents a substantial 12\% improvement over the current state-of-the-art.  ( 2 min )
    Approximated Likelihood Ratio: A Forward-Only and Parallel Framework for Boosting Neural Network Training
    arXiv:2403.12320v1 Announce Type: new Abstract: Efficient and biologically plausible alternatives to backpropagation in neural network training remain a challenge due to issues such as high computational complexity and additional assumptions about neural networks, which limit scalability to deeper networks. The likelihood ratio method offers a promising gradient estimation strategy but is constrained by significant memory consumption, especially when deploying multiple copies of data to reduce estimation variance. In this paper, we introduce an approximation technique for the likelihood ratio (LR) method to alleviate computational and memory demands in gradient estimation. By exploiting the natural parallelism during the backward pass using LR, we further provide a high-performance training strategy, which pipelines both the forward and backward pass, to make it more suitable for the computation on specialized hardware. Extensive experiments demonstrate the effectiveness of the approximation technique in neural network training. This work underscores the potential of the likelihood ratio method in achieving high-performance neural network training, suggesting avenues for further exploration.  ( 2 min )
    Improving LoRA in Privacy-preserving Federated Learning
    arXiv:2403.12313v1 Announce Type: new Abstract: Low-rank adaptation (LoRA) is one of the most popular task-specific parameter-efficient fine-tuning (PEFT) methods on pre-trained language models for its good performance and computational efficiency. LoRA injects a product of two trainable rank decomposition matrices over the top of each frozen pre-trained model module. However, when applied in the setting of privacy-preserving federated learning (FL), LoRA may become unstable due to the following facts: 1) the effects of data heterogeneity and multi-step local updates are non-negligible, 2) additive noise enforced on updating gradients to guarantee differential privacy (DP) can be amplified and 3) the final performance is susceptible to hyper-parameters. A key factor leading to these phenomena is the discordance between jointly optimizing the two low-rank matrices by local clients and separately aggregating them by the central server. Thus, this paper proposes an efficient and effective version of LoRA, Federated Freeze A LoRA (FFA-LoRA), to alleviate these challenges and further halve the communication cost of federated fine-tuning LLMs. The core idea of FFA-LoRA is to fix the randomly initialized non-zero matrices and only fine-tune the zero-initialized matrices. Compared to LoRA, FFA-LoRA is motivated by practical and theoretical benefits in privacy-preserved FL. Our experiments demonstrate that FFA-LoRA provides more consistent performance with better computational efficiency over vanilla LoRA in various FL tasks.  ( 2 min )
    Molecular Classification Using Hyperdimensional Graph Classification
    arXiv:2403.12307v1 Announce Type: new Abstract: Our work introduces an innovative approach to graph learning by leveraging Hyperdimensional Computing. Graphs serve as a widely embraced method for conveying information, and their utilization in learning has gained significant attention. This is notable in the field of chemoinformatics, where learning from graph representations plays a pivotal role. An important application within this domain involves the identification of cancerous cells across diverse molecular structures. We propose an HDC-based model that demonstrates comparable Area Under the Curve results when compared to state-of-the-art models like Graph Neural Networks (GNNs) or the Weisfieler-Lehman graph kernel (WL). Moreover, it outperforms previously proposed hyperdimensional computing graph learning methods. Furthermore, it achieves noteworthy speed enhancements, boasting a 40x acceleration in the training phase and a 15x improvement in inference time compared to GNN and WL models. This not only underscores the efficacy of the HDC-based method, but also highlights its potential for expedited and resource-efficient graph learning.  ( 2 min )
    Stochastic Rounding Implicitly Regularizes Tall-and-Thin Matrices
    arXiv:2403.12278v1 Announce Type: new Abstract: Motivated by the popularity of stochastic rounding in the context of machine learning and the training of large-scale deep neural network models, we consider stochastic nearness rounding of real matrices $\mathbf{A}$ with many more rows than columns. We provide novel theoretical evidence, supported by extensive experimental evaluation that, with high probability, the smallest singular value of a stochastically rounded matrix is well bounded away from zero -- regardless of how close $\mathbf{A}$ is to being rank deficient and even if $\mathbf{A}$ is rank-deficient. In other words, stochastic rounding \textit{implicitly regularizes} tall and skinny matrices $\mathbf{A}$ so that the rounded version has full column rank. Our proofs leverage powerful results in random matrix theory, and the idea that stochastic rounding errors do not concentrate in low-dimensional column spaces.  ( 2 min )
    Efficient Transformer-based Hyper-parameter Optimization for Resource-constrained IoT Environments
    arXiv:2403.12237v1 Announce Type: new Abstract: The hyper-parameter optimization (HPO) process is imperative for finding the best-performing Convolutional Neural Networks (CNNs). The automation process of HPO is characterized by its sizable computational footprint and its lack of transparency; both important factors in a resource-constrained Internet of Things (IoT) environment. In this paper, we address these problems by proposing a novel approach that combines transformer architecture and actor-critic Reinforcement Learning (RL) model, TRL-HPO, equipped with multi-headed attention that enables parallelization and progressive generation of layers. These assumptions are founded empirically by evaluating TRL-HPO on the MNIST dataset and comparing it with state-of-the-art approaches that build CNN models from scratch. The results show that TRL-HPO outperforms the classification results of these approaches by 6.8% within the same time frame, demonstrating the efficiency of TRL-HPO for the HPO process. The analysis of the results identifies the main culprit for performance degradation attributed to stacking fully connected layers. This paper identifies new avenues for improving RL-based HPO processes in resource-constrained environments.  ( 2 min )
    Improving Generalization via Meta-Learning on Hard Samples
    arXiv:2403.12236v1 Announce Type: new Abstract: Learned reweighting (LRW) approaches to supervised learning use an optimization criterion to assign weights for training instances, in order to maximize performance on a representative validation dataset. We pose and formalize the problem of optimized selection of the validation set used in LRW training, to improve classifier generalization. In particular, we show that using hard-to-classify instances in the validation set has both a theoretical connection to, and strong empirical evidence of generalization. We provide an efficient algorithm for training this meta-optimized model, as well as a simple train-twice heuristic for careful comparative study. We demonstrate that LRW with easy validation data performs consistently worse than LRW with hard validation data, establishing the validity of our meta-optimization problem. Our proposed algorithm outperforms a wide range of baselines on a range of datasets and domain shift challenges (Imagenet-1K, CIFAR-100, Clothing-1M, CAMELYON, WILDS, etc.), with ~1% gains using VIT-B on Imagenet. We also show that using naturally hard examples for validation (Imagenet-R / Imagenet-A) in LRW training for Imagenet improves performance on both clean and naturally hard test instances by 1-2%. Secondary analyses show that using hard validation data in an LRW framework improves margins on test data, hinting at the mechanism underlying our empirical gains. We believe this work opens up new research directions for the meta-optimization of meta-learning in a supervised learning context.  ( 2 min )
    Large-scale flood modeling and forecasting with FloodCast
    arXiv:2403.12226v1 Announce Type: new Abstract: Large-scale hydrodynamic models generally rely on fixed-resolution spatial grids and model parameters as well as incurring a high computational cost. This limits their ability to accurately forecast flood crests and issue time-critical hazard warnings. In this work, we build a fast, stable, accurate, resolution-invariant, and geometry-adaptative flood modeling and forecasting framework that can perform at large scales, namely FloodCast. The framework comprises two main modules: multi-satellite observation and hydrodynamic modeling. In the multi-satellite observation module, a real-time unsupervised change detection method and a rainfall processing and analysis tool are proposed to harness the full potential of multi-satellite observations in large-scale flood prediction. In the hydrodynamic modeling module, a geometry-adaptive physics-informed neural solver (GeoPINS) is introduced, benefiting from the absence of a requirement for training data in physics-informed neural networks and featuring a fast, accurate, and resolution-invariant architecture with Fourier neural operators. GeoPINS demonstrates impressive performance on popular PDEs across regular and irregular domains. Building upon GeoPINS, we propose a sequence-to-sequence GeoPINS model to handle long-term temporal series and extensive spatial domains in large-scale flood modeling. Next, we establish a benchmark dataset in the 2022 Pakistan flood to assess various flood prediction methods. Finally, we validate the model in three dimensions - flood inundation range, depth, and transferability of spatiotemporal downscaling. Traditional hydrodynamics and sequence-to-sequence GeoPINS exhibit exceptional agreement during high water levels, while comparative assessments with SAR-based flood depth data show that sequence-to-sequence GeoPINS outperforms traditional hydrodynamics, with smaller prediction errors.  ( 3 min )
    GCAM: Gaussian and causal-attention model of food fine-grained recognition
    arXiv:2403.12109v1 Announce Type: new Abstract: Currently, most food recognition relies on deep learning for category classification. However, these approaches struggle to effectively distinguish between visually similar food samples, highlighting the pressing need to address fine-grained issues in food recognition. To mitigate these challenges, we propose the adoption of a Gaussian and causal-attention model for fine-grained object recognition.In particular, we train to obtain Gaussian features over target regions, followed by the extraction of fine-grained features from the objects, thereby enhancing the feature mapping capabilities of the target regions. To counteract data drift resulting from uneven data distributions, we employ a counterfactual reasoning approach. By using counterfactual interventions, we analyze the impact of the learned image attention mechanism on network predictions, enabling the network to acquire more useful attention weights for fine-grained image recognition. Finally, we design a learnable loss strategy to balance training stability across various modules, ultimately improving the accuracy of the final target recognition. We validate our approach on four relevant datasets, demonstrating its excellent performance across these four datasets.We experimentally show that GCAM surpasses state-of-the-art methods on the ETH-FOOD101, UECFOOD256, and Vireo-FOOD172 datasets. Furthermore, our approach also achieves state-of-the-art performance on the CUB-200 dataset.  ( 2 min )
    PETScML: Second-order solvers for training regression problems in Scientific Machine Learning
    arXiv:2403.12188v1 Announce Type: new Abstract: In recent years, we have witnessed the emergence of scientific machine learning as a data-driven tool for the analysis, by means of deep-learning techniques, of data produced by computational science and engineering applications. At the core of these methods is the supervised training algorithm to learn the neural network realization, a highly non-convex optimization problem that is usually solved using stochastic gradient methods. However, distinct from deep-learning practice, scientific machine-learning training problems feature a much larger volume of smooth data and better characterizations of the empirical risk functions, which make them suited for conventional solvers for unconstrained optimization. We introduce a lightweight software framework built on top of the Portable and Extensible Toolkit for Scientific computation to bridge the gap between deep-learning software and conventional solvers for unconstrained minimization. We empirically demonstrate the superior efficacy of a trust region method based on the Gauss-Newton approximation of the Hessian in improving the generalization errors arising from regression tasks when learning surrogate models for a wide range of scientific machine-learning techniques and test cases. All the conventional second-order solvers tested, including L-BFGS and inexact Newton with line-search, compare favorably, either in terms of cost or accuracy, with the adaptive first-order methods used to validate the surrogate models.  ( 2 min )
    Graph Neural Networks for Learning Equivariant Representations of Neural Networks
    arXiv:2403.12143v1 Announce Type: new Abstract: Neural networks that process the parameters of other neural networks find applications in domains as diverse as classifying implicit neural representations, generating neural network weights, and predicting generalization errors. However, existing approaches either overlook the inherent permutation symmetry in the neural network or rely on intricate weight-sharing patterns to achieve equivariance, while ignoring the impact of the network architecture itself. In this work, we propose to represent neural networks as computational graphs of parameters, which allows us to harness powerful graph neural networks and transformers that preserve permutation symmetry. Consequently, our approach enables a single model to encode neural computational graphs with diverse architectures. We showcase the effectiveness of our method on a wide range of tasks, including classification and editing of implicit neural representations, predicting generalization performance, and learning to optimize, while consistently outperforming state-of-the-art methods. The source code is open-sourced at https://github.com/mkofinas/neural-graphs.  ( 2 min )
    The Power of Few: Accelerating and Enhancing Data Reweighting with Coreset Selection
    arXiv:2403.12166v1 Announce Type: new Abstract: As machine learning tasks continue to evolve, the trend has been to gather larger datasets and train increasingly larger models. While this has led to advancements in accuracy, it has also escalated computational costs to unsustainable levels. Addressing this, our work aims to strike a delicate balance between computational efficiency and model accuracy, a persisting challenge in the field. We introduce a novel method that employs core subset selection for reweighting, effectively optimizing both computational time and model performance. By focusing on a strategically selected coreset, our approach offers a robust representation, as it efficiently minimizes the influence of outliers. The re-calibrated weights are then mapped back to and propagated across the entire dataset. Our experimental results substantiate the effectiveness of this approach, underscoring its potential as a scalable and precise solution for model training.  ( 2 min )
  • Open

    Probabilistic Modeling for Sequences of Sets in Continuous-Time
    arXiv:2312.15045v3 Announce Type: replace-cross Abstract: Neural marked temporal point processes have been a valuable addition to the existing toolbox of statistical parametric models for continuous-time event data. These models are useful for sequences where each event is associated with a single item (a single type of event or a "mark") -- but such models are not suited for the practical situation where each event is associated with a set of items. In this work, we develop a general framework for modeling set-valued data in continuous-time, compatible with any intensity-based recurrent neural point process model. In addition, we develop inference methods that can use such models to answer probabilistic queries such as "the probability of item $A$ being observed before item $B$," conditioned on sequence history. Computing exact answers for such queries is generally intractable for neural models due to both the continuous-time nature of the problem setting and the combinatorially-large space of potential outcomes for each event. To address this, we develop a class of importance sampling methods for querying with set-based sequences and demonstrate orders-of-magnitude improvements in efficiency over direct sampling via systematic experiments with four real-world datasets. We also illustrate how to use this framework to perform model selection using likelihoods that do not involve one-step-ahead prediction.  ( 3 min )
    Tuning-Free Stochastic Optimization
    arXiv:2402.07793v2 Announce Type: replace-cross Abstract: Large-scale machine learning problems make the cost of hyperparameter tuning ever more prohibitive. This creates a need for algorithms that can tune themselves on-the-fly. We formalize the notion of "tuning-free" algorithms that can match the performance of optimally-tuned optimization algorithms up to polylogarithmic factors given only loose hints on the relevant problem parameters. We consider in particular algorithms that can match optimally-tuned Stochastic Gradient Descent (SGD). When the domain of optimization is bounded, we show tuning-free matching of SGD is possible and achieved by several existing algorithms. We prove that for the task of minimizing a convex and smooth or Lipschitz function over an unbounded domain, tuning-free optimization is impossible. We discuss conditions under which tuning-free optimization is possible even over unbounded domains. In particular, we show that the recently proposed DoG and DoWG algorithms are tuning-free when the noise distribution is sufficiently well-behaved. For the task of finding a stationary point of a smooth and potentially nonconvex function, we give a variant of SGD that matches the best-known high-probability convergence rate for tuned SGD at only an additional polylogarithmic cost. However, we also give an impossibility result that shows no algorithm can hope to match the optimal expected convergence rate for tuned SGD with high probability.  ( 2 min )
    Conformal Monte Carlo Meta-learners for Predictive Inference of Individual Treatment Effects
    arXiv:2402.04906v2 Announce Type: replace-cross Abstract: Knowledge of the effect of interventions, called the treatment effect, is paramount for decision-making. Approaches to estimating this treatment effect, e.g. by using Conditional Average Treatment Effect (CATE) estimators, often only provide a point estimate of this treatment effect, while additional uncertainty quantification is frequently desired instead. Therefore, we present a novel method, the Conformal Monte Carlo (CMC) meta-learners, leveraging conformal predictive systems, Monte Carlo sampling, and CATE meta-learners, to instead produce a predictive distribution usable in individualized decision-making. Furthermore, we show how specific assumptions on the noise distribution of the outcome heavily affect these uncertainty predictions. Nonetheless, the CMC framework shows strong experimental coverage while retaining small interval widths to provide estimates of the true individual treatment effect.  ( 2 min )
    Functional Bayesian Tucker Decomposition for Continuous-indexed Tensor Data
    arXiv:2311.04829v2 Announce Type: replace-cross Abstract: Tucker decomposition is a powerful tensor model to handle multi-aspect data. It demonstrates the low-rank property by decomposing the grid-structured data as interactions between a core tensor and a set of object representations (factors). A fundamental assumption of such decomposition is that there are finite objects in each aspect or mode, corresponding to discrete indexes of data entries. However, real-world data is often not naturally posed in this setting. For example, geographic data is represented as continuous indexes of latitude and longitude coordinates, and cannot fit tensor models directly. To generalize Tucker decomposition to such scenarios, we propose Functional Bayesian Tucker Decomposition (FunBaT). We treat the continuous-indexed data as the interaction between the Tucker core and a group of latent functions. We use Gaussian processes (GP) as functional priors to model the latent functions. Then, we convert each GP into a state-space prior by constructing an equivalent stochastic differential equation (SDE) to reduce computational cost. An efficient inference algorithm is developed for scalable posterior approximation based on advanced message-passing techniques. The advantage of our method is shown in both synthetic data and several real-world applications. We release the code of FunBaT at \url{https://github.com/xuangu-fang/Functional-Bayesian-Tucker-Decomposition}.  ( 3 min )
    Adversarial Training Should Be Cast as a Non-Zero-Sum Game
    arXiv:2306.11035v2 Announce Type: replace-cross Abstract: One prominent approach toward resolving the adversarial vulnerability of deep neural networks is the two-player zero-sum paradigm of adversarial training, in which predictors are trained against adversarially chosen perturbations of data. Despite the promise of this approach, algorithms based on this paradigm have not engendered sufficient levels of robustness and suffer from pathological behavior like robust overfitting. To understand this shortcoming, we first show that the commonly used surrogate-based relaxation used in adversarial training algorithms voids all guarantees on the robustness of trained classifiers. The identification of this pitfall informs a novel non-zero-sum bilevel formulation of adversarial training, wherein each player optimizes a different objective function. Our formulation yields a simple algorithmic framework that matches and in some cases outperforms state-of-the-art attacks, attains comparable levels of robustness to standard adversarial training algorithms, and does not suffer from robust overfitting.  ( 2 min )
    One-Shot Strategic Classification Under Unknown Costs
    arXiv:2311.02761v2 Announce Type: replace-cross Abstract: The goal of strategic classification is to learn decision rules which are robust to strategic input manipulation. Earlier works assume that these responses are known; while some recent works handle unknown responses, they exclusively study online settings with repeated model deployments. But there are many domains$\unicode{x2014}$particularly in public policy, a common motivating use case$\unicode{x2014}$where multiple deployments are infeasible, or where even one bad round is unacceptable. To address this gap, we initiate the formal study of one-shot strategic classification under unknown responses, which requires committing to a single classifier once. Focusing on uncertainty in the users' cost function, we begin by proving that for a broad class of costs, even a small mis-estimation of the true cost can entail trivial accuracy in the worst case. In light of this, we frame the task as a minimax problem, with the goal of identifying the classifier with the smallest worst-case risk over an uncertainty set of possible costs. We design efficient algorithms for both the full-batch and stochastic settings, which we prove converge (offline) to the minimax solution at the dimension-independent rate of $\tilde{\mathcal{O}}(T^{-\frac{1}{2}})$. Our theoretical analysis reveals important structure stemming from strategic responses, particularly the value of dual norm regularization with respect to the cost function.  ( 3 min )
    GeoShapley: A Game Theory Approach to Measuring Spatial Effects in Machine Learning Models
    arXiv:2312.03675v2 Announce Type: replace-cross Abstract: This paper introduces GeoShapley, a game theory approach to measuring spatial effects in machine learning models. GeoShapley extends the Nobel Prize-winning Shapley value framework in game theory by conceptualizing location as a player in a model prediction game, which enables the quantification of the importance of location and the synergies between location and other features in a model. GeoShapley is a model-agnostic approach and can be applied to statistical or black-box machine learning models in various structures. The interpretation of GeoShapley is directly linked with spatially varying coefficient models for explaining spatial effects and additive models for explaining non-spatial effects. Using simulated data, GeoShapley values are validated against known data-generating processes and are used for cross-comparison of seven statistical and machine learning models. An empirical example of house price modeling is used to illustrate GeoShapley's utility and interpretation with real world data. The method is available as an open-source Python package named geoshapley.  ( 2 min )
    On student-teacher deviations in distillation: does it pay to disobey?
    arXiv:2301.12923v3 Announce Type: replace-cross Abstract: Knowledge distillation (KD) has been widely used to improve the test accuracy of a "student" network, by training it to mimic the soft probabilities of a trained "teacher" network. Yet, it has been shown in recent work that, despite being trained to fit the teacher's probabilities, the student may not only significantly deviate from the teacher probabilities, but may also outdo than the teacher in performance. Our work aims to reconcile this seemingly paradoxical observation. Specifically, we characterize the precise nature of the student-teacher deviations, and argue how they can co-occur with better generalization. First, through experiments on image and language data, we identify that these probability deviations correspond to the student systematically exaggerating the confidence levels of the teacher. Next, we theoretically and empirically establish another form of exaggeration in some simple settings: KD exaggerates the implicit bias of gradient descent in converging faster along the top eigendirections of the data. Finally, we tie these two observations together: we demonstrate that the exaggerated bias of KD can simultaneously result in both (a) the exaggeration of confidence and (b) the improved generalization of the student, thus offering a resolution to the apparent paradox. Our analysis brings existing theory and practice closer by considering the role of gradient descent in KD and by demonstrating the exaggerated bias effect in both theoretical and empirical settings.  ( 3 min )
    Attribute-Efficient PAC Learning of Low-Degree Polynomial Threshold Functions with Nasty Noise
    arXiv:2306.00673v2 Announce Type: replace-cross Abstract: The concept class of low-degree polynomial threshold functions (PTFs) plays a fundamental role in machine learning. In this paper, we study PAC learning of $K$-sparse degree-$d$ PTFs on $\mathbb{R}^n$, where any such concept depends only on $K$ out of $n$ attributes of the input. Our main contribution is a new algorithm that runs in time $({nd}/{\epsilon})^{O(d)}$ and under the Gaussian marginal distribution, PAC learns the class up to error rate $\epsilon$ with $O(\frac{K^{4d}}{\epsilon^{2d}} \cdot \log^{5d} n)$ samples even when an $\eta \leq O(\epsilon^d)$ fraction of them are corrupted by the nasty noise of Bshouty et al. (2002), possibly the strongest corruption model. Prior to this work, attribute-efficient robust algorithms are established only for the special case of sparse homogeneous halfspaces. Our key ingredients are: 1) a structural result that translates the attribute sparsity to a sparsity pattern of the Chow vector under the basis of Hermite polynomials, and 2) a novel attribute-efficient robust Chow vector estimation algorithm which uses exclusively a restricted Frobenius norm to either certify a good approximation or to validate a sparsity-induced degree-$2d$ polynomial as a filter to detect corrupted samples.  ( 2 min )
    Private graphon estimation via sum-of-squares
    arXiv:2403.12213v1 Announce Type: cross Abstract: We develop the first pure node-differentially-private algorithms for learning stochastic block models and for graphon estimation with polynomial running time for any constant number of blocks. The statistical utility guarantees match those of the previous best information-theoretic (exponential-time) node-private mechanisms for these problems. The algorithm is based on an exponential mechanism for a score function defined in terms of a sum-of-squares relaxation whose level depends on the number of blocks. The key ingredients of our results are (1) a characterization of the distance between the block graphons in terms of a quadratic optimization over the polytope of doubly stochastic matrices, (2) a general sum-of-squares convergence result for polynomial optimization over arbitrary polytopes, and (3) a general approach to perform Lipschitz extensions of score functions as part of the sum-of-squares algorithmic paradigm.  ( 2 min )
    Selecting informative conformal prediction sets with false coverage rate control
    arXiv:2403.12295v1 Announce Type: cross Abstract: In supervised learning, including regression and classification, conformal methods provide prediction sets for the outcome/label with finite sample coverage for any machine learning predictors. We consider here the case where such prediction sets come after a selection process. The selection process requires that the selected prediction sets be `informative' in a well defined sense. We consider both the classification and regression settings where the analyst may consider as informative only the sample with prediction label sets or prediction intervals small enough, excluding null values, or obeying other appropriate `monotone' constraints. While this covers many settings of possible interest in various applications, we develop a unified framework for building such informative conformal prediction sets while controlling the false coverage rate (FCR) on the selected sample. While conformal prediction sets after selection have been the focus of much recent literature in the field, the new introduced procedures, called InfoSP and InfoSCOP, are to our knowledge the first ones providing FCR control for informative prediction sets. We show the usefulness of our resulting procedures on real and simulated data.  ( 2 min )
    Clustered Mallows Model
    arXiv:2403.12880v1 Announce Type: cross Abstract: Rankings are a type of preference elicitation that arise in experiments where assessors arrange items, for example, in decreasing order of utility. Orderings of n items labelled {1,...,n} denoted are permutations that reflect strict preferences. For a number of reasons, strict preferences can be unrealistic assumptions for real data. For example, when items share common traits it may be reasonable to attribute them equal ranks. Also, there can be different importance attributions to decisions that form the ranking. In a situation with, for example, a large number of items, an assessor may wish to rank at top a certain number items; to rank other items at the bottom and to express indifference to all others. In addition, when aggregating opinions, a judging body might be decisive about some parts of the rank but ambiguous for others. In this paper we extend the well-known Mallows (Mallows, 1957) model (MM) to accommodate item indifference, a phenomenon that can be in place for a variety of reasons, such as those above mentioned.The underlying grouping of similar items motivates the proposed Clustered Mallows Model (CMM). The CMM can be interpreted as a Mallows distribution for tied ranks where ties are learned from the data. The CMM provides the flexibility to combine strict and indifferent relations, achieving a simpler and robust representation of rank collections in the form of ordered clusters. Bayesian inference for the CMM is in the class of doubly-intractable problems since the model's normalisation constant is not available in closed form. We overcome this challenge by sampling from the posterior with a version of the exchange algorithm \citep{murray2006}. Real data analysis of food preferences and results of Formula 1 races are presented, illustrating the CMM in practical situations.  ( 3 min )
    Equity through Access: A Case for Small-scale Deep Learning
    arXiv:2403.12562v1 Announce Type: cross Abstract: The recent advances in deep learning (DL) have been accelerated by access to large-scale data and compute. These large-scale resources have been used to train progressively larger models which are resource intensive in terms of compute, data, energy, and carbon emissions. These costs are becoming a new type of entry barrier to researchers and practitioners with limited access to resources at such scale, particularly in the Global South. In this work, we take a comprehensive look at the landscape of existing DL models for vision tasks and demonstrate their usefulness in settings where resources are limited. To account for the resource consumption of DL models, we introduce a novel measure to estimate the performance per resource unit, which we call the PePR score. Using a diverse family of 131 unique DL architectures (spanning 1M to 130M trainable parameters) and three medical image datasets, we capture trends about the performance-resource trade-offs. In applications like medical image analysis, we argue that small-scale, specialized models are better than striving for large-scale models. Furthermore, we show that using pretrained models can significantly reduce the computational resources and data required. We hope this work will encourage the community to focus on improving AI equity by developing methods and models with smaller resource footprints.  ( 3 min )
    Improving Interpretability of Scores in Anomaly Detection Based on Gaussian-Bernoulli Restricted Boltzmann Machine
    arXiv:2403.12672v1 Announce Type: cross Abstract: Gaussian-Bernoulli restricted Boltzmann machines (GBRBMs) are often used for semi-supervised anomaly detection, where they are trained using only normal data points. In GBRBM-based anomaly detection, normal and anomalous data are classified based on a score that is identical to an energy function of the marginal GBRBM. However, the classification threshold is difficult to set to an appropriate value, as this score cannot be interpreted. In this study, we propose a measure that improves score's interpretability based on its cumulative distribution, and establish a guideline for setting the threshold using the interpretable measure. The results of numerical experiments show that the guideline is reasonable when setting the threshold solely using normal data points. Moreover, because identifying the measure involves computationally infeasible evaluation of the minimum score value, we also propose an evaluation method for the minimum score based on simulated annealing, which is widely used for optimization problems. The proposed evaluation method was also validated using numerical experiments.  ( 2 min )
    Community detection by spectral methods in multi-layer networks
    arXiv:2403.12540v1 Announce Type: cross Abstract: Community detection in multi-layer networks is a crucial problem in network analysis. In this paper, we analyze the performance of two spectral clustering algorithms for community detection within the multi-layer degree-corrected stochastic block model (MLDCSBM) framework. One algorithm is based on the sum of adjacency matrices, while the other utilizes the debiased sum of squared adjacency matrices. We establish consistency results for community detection using these methods under MLDCSBM as the size of the network and/or the number of layers increases. Our theorems demonstrate the advantages of utilizing multiple layers for community detection. Moreover, our analysis indicates that spectral clustering with the debiased sum of squared adjacency matrices is generally superior to spectral clustering with the sum of adjacency matrices. Numerical simulations confirm that our algorithm, employing the debiased sum of squared adjacency matrices, surpasses existing methods for community detection in multi-layer networks. Finally, the analysis of several real-world multi-layer networks yields meaningful insights.  ( 2 min )
    On Safety in Safe Bayesian Optimization
    arXiv:2403.12948v1 Announce Type: cross Abstract: Optimizing an unknown function under safety constraints is a central task in robotics, biomedical engineering, and many other disciplines, and increasingly safe Bayesian Optimization (BO) is used for this. Due to the safety critical nature of these applications, it is of utmost importance that theoretical safety guarantees for these algorithms translate into the real world. In this work, we investigate three safety-related issues of the popular class of SafeOpt-type algorithms. First, these algorithms critically rely on frequentist uncertainty bounds for Gaussian Process (GP) regression, but concrete implementations typically utilize heuristics that invalidate all safety guarantees. We provide a detailed analysis of this problem and introduce Real-\b{eta}-SafeOpt, a variant of the SafeOpt algorithm that leverages recent GP bounds and thus retains all theoretical guarantees. Second, we identify assuming an upper bound on the reproducing kernel Hilbert space (RKHS) norm of the target function, a key technical assumption in SafeOpt-like algorithms, as a central obstacle to real-world usage. To overcome this challenge, we introduce the Lipschitz-only Safe Bayesian Optimization (LoSBO) algorithm, which guarantees safety without an assumption on the RKHS bound, and empirically show that this algorithm is not only safe, but also exhibits superior performance compared to the state-of-the-art on several function classes. Third, SafeOpt and derived algorithms rely on a discrete search space, making them difficult to apply to higher-dimensional problems. To widen the applicability of these algorithms, we introduce Lipschitz-only GP-UCB (LoS-GP-UCB), a variant of LoSBO applicable to moderately high-dimensional problems, while retaining safety.  ( 3 min )
    Primal Methods for Variational Inequality Problems with Functional Constraints
    arXiv:2403.12859v1 Announce Type: cross Abstract: Constrained variational inequality problems are recognized for their broad applications across various fields including machine learning and operations research. First-order methods have emerged as the standard approach for solving these problems due to their simplicity and scalability. However, they typically rely on projection or linear minimization oracles to navigate the feasible set, which becomes computationally expensive in practical scenarios featuring multiple functional constraints. Existing efforts to tackle such functional constrained variational inequality problems have centered on primal-dual algorithms grounded in the Lagrangian function. These algorithms along with their theoretical analysis often require the existence and prior knowledge of the optimal Lagrange multipliers. In this work, we propose a simple primal method, termed Constrained Gradient Method (CGM), for addressing functional constrained variational inequality problems, without necessitating any information on the optimal Lagrange multipliers. We establish a non-asymptotic convergence analysis of the algorithm for variational inequality problems with monotone operators under smooth constraints. Remarkably, our algorithms match the complexity of projection-based methods in terms of operator queries for both monotone and strongly monotone settings, while utilizing significantly cheaper oracles based on quadratic programming. Furthermore, we provide several numerical examples to evaluate the efficacy of our algorithms.  ( 2 min )
    An Alternative Graphical Lasso Algorithm for Precision Matrices
    arXiv:2403.12357v1 Announce Type: cross Abstract: The Graphical Lasso (GLasso) algorithm is fast and widely used for estimating sparse precision matrices (Friedman et al., 2008). Its central role in the literature of high-dimensional covariance estimation rivals that of Lasso regression for sparse estimation of the mean vector. Some mysteries regarding its optimization target, convergence, positive-definiteness and performance have been unearthed, resolved and presented in Mazumder and Hastie (2011), leading to a new/improved (dual-primal) DP-GLasso. Using a new and slightly different reparametriztion of the last column of a precision matrix we show that the regularized normal log-likelihood naturally decouples into a sum of two easy to minimize convex functions one of which is a Lasso regression problem. This decomposition is the key in developing a transparent, simple iterative block coordinate descent algorithm for computing the GLasso updates with performance comparable to DP-GLasso. In particular, our algorithm has the precision matrix as its optimization target right at the outset, and retains all the favorable properties of the DP-GLasso algorithm.  ( 2 min )
    FedFisher: Leveraging Fisher Information for One-Shot Federated Learning
    arXiv:2403.12329v1 Announce Type: cross Abstract: Standard federated learning (FL) algorithms typically require multiple rounds of communication between the server and the clients, which has several drawbacks, including requiring constant network connectivity, repeated investment of computational resources, and susceptibility to privacy attacks. One-Shot FL is a new paradigm that aims to address this challenge by enabling the server to train a global model in a single round of communication. In this work, we present FedFisher, a novel algorithm for one-shot FL that makes use of Fisher information matrices computed on local client models, motivated by a Bayesian perspective of FL. First, we theoretically analyze FedFisher for two-layer over-parameterized ReLU neural networks and show that the error of our one-shot FedFisher global model becomes vanishingly small as the width of the neural networks and amount of local training at clients increases. Next, we propose practical variants of FedFisher using the diagonal Fisher and K-FAC approximation for the full Fisher and highlight their communication and compute efficiency for FL. Finally, we conduct extensive experiments on various datasets, which show that these variants of FedFisher consistently improve over competing baselines.  ( 2 min )
    Non-negative Contrastive Learning
    arXiv:2403.12459v1 Announce Type: cross Abstract: Deep representations have shown promising performance when transferred to downstream tasks in a black-box manner. Yet, their inherent lack of interpretability remains a significant challenge, as these features are often opaque to human understanding. In this paper, we propose Non-negative Contrastive Learning (NCL), a renaissance of Non-negative Matrix Factorization (NMF) aimed at deriving interpretable features. The power of NCL lies in its enforcement of non-negativity constraints on features, reminiscent of NMF's capability to extract features that align closely with sample clusters. NCL not only aligns mathematically well with an NMF objective but also preserves NMF's interpretability attributes, resulting in a more sparse and disentangled representation compared to standard contrastive learning (CL). Theoretically, we establish guarantees on the identifiability and downstream generalization of NCL. Empirically, we show that these advantages enable NCL to outperform CL significantly on feature disentanglement, feature selection, as well as downstream classification tasks. At last, we show that NCL can be easily extended to other learning scenarios and benefit supervised learning as well. Code is available at https://github.com/PKU-ML/non_neg.  ( 2 min )
    Stochastic Halpern iteration in normed spaces and applications to reinforcement learning
    arXiv:2403.12338v1 Announce Type: cross Abstract: We analyze the oracle complexity of the stochastic Halpern iteration with variance reduction, where we aim to approximate fixed-points of nonexpansive and contractive operators in a normed finite-dimensional space. We show that if the underlying stochastic oracle is with uniformly bounded variance, our method exhibits an overall oracle complexity of $\tilde{O}(\varepsilon^{-5})$, improving recent rates established for the stochastic Krasnoselskii-Mann iteration. Also, we establish a lower bound of $\Omega(\varepsilon^{-3})$, which applies to a wide range of algorithms, including all averaged iterations even with minibatching. Using a suitable modification of our approach, we derive a $O(\varepsilon^{-2}(1-\gamma)^{-3})$ complexity bound in the case in which the operator is a $\gamma$-contraction. As an application, we propose new synchronous algorithms for average reward and discounted reward Markov decision processes. In particular, for the average reward, our method improves on the best-known sample complexity.  ( 2 min )
    The Power of Few: Accelerating and Enhancing Data Reweighting with Coreset Selection
    arXiv:2403.12166v1 Announce Type: cross Abstract: As machine learning tasks continue to evolve, the trend has been to gather larger datasets and train increasingly larger models. While this has led to advancements in accuracy, it has also escalated computational costs to unsustainable levels. Addressing this, our work aims to strike a delicate balance between computational efficiency and model accuracy, a persisting challenge in the field. We introduce a novel method that employs core subset selection for reweighting, effectively optimizing both computational time and model performance. By focusing on a strategically selected coreset, our approach offers a robust representation, as it efficiently minimizes the influence of outliers. The re-calibrated weights are then mapped back to and propagated across the entire dataset. Our experimental results substantiate the effectiveness of this approach, underscoring its potential as a scalable and precise solution for model training.  ( 2 min )
    Bayesian Nonparametrics Meets Data-Driven Robust Optimization
    arXiv:2401.15771v3 Announce Type: replace Abstract: Training machine learning and statistical models often involves optimizing a data-driven risk criterion. The risk is usually computed with respect to the empirical data distribution, but this may result in poor and unstable out-of-sample performance due to distributional uncertainty. In the spirit of distributionally robust optimization, we propose a novel robust criterion by combining insights from Bayesian nonparametric (i.e., Dirichlet Process) theory and recent decision-theoretic models of smooth ambiguity-averse preferences. First, we highlight novel connections with standard regularized empirical risk minimization techniques, among which Ridge and LASSO regressions. Then, we theoretically demonstrate the existence of favorable finite-sample and asymptotic statistical guarantees on the performance of the robust optimization procedure. For practical implementation, we propose and study tractable approximations of the criterion based on well-known Dirichlet Process representations. We also show that the smoothness of the criterion naturally leads to standard gradient-based numerical optimization. Finally, we provide insights into the workings of our method by applying it to high-dimensional sparse linear regression, binary classification, and robust location parameter estimation tasks.  ( 2 min )
    Mitigating Gradient Bias in Multi-objective Learning: A Provably Convergent Stochastic Approach
    arXiv:2210.12624v2 Announce Type: replace-cross Abstract: Machine learning problems with multiple objective functions appear either in learning with multiple criteria where learning has to make a trade-off between multiple performance metrics such as fairness, safety and accuracy; or, in multi-task learning where multiple tasks are optimized jointly, sharing inductive bias between them. This problems are often tackled by the multi-objective optimization framework. However, existing stochastic multi-objective gradient methods and its variants (e.g., MGDA, PCGrad, CAGrad, etc.) all adopt a biased noisy gradient direction, which leads to degraded empirical performance. To this end, we develop a stochastic Multi-objective gradient Correction (MoCo) method for multi-objective optimization. The unique feature of our method is that it can guarantee convergence without increasing the batch size even in the non-convex setting. Simulations on multi-task supervised and reinforcement learning demonstrate the effectiveness of our method relative to state-of-the-art methods.  ( 2 min )
    Mean-field neural networks-based algorithms for McKean-Vlasov control problems *
    arXiv:2212.11518v2 Announce Type: replace-cross Abstract: This paper is devoted to the numerical resolution of McKean-Vlasov control problems via the class of mean-field neural networks introduced in our companion paper [25] in order to learn the solution on the Wasserstein space. We propose several algorithms either based on dynamic programming with control learning by policy or value iteration, or backward SDE from stochastic maximum principle with global or local loss functions. Extensive numerical results on different examples are presented to illustrate the accuracy of each of our eight algorithms. We discuss and compare the pros and cons of all the tested methods.  ( 2 min )
    Oracle-Efficient Smoothed Online Learning for Piecewise Continuous Decision Making
    arXiv:2302.05430v2 Announce Type: replace Abstract: Smoothed online learning has emerged as a popular framework to mitigate the substantial loss in statistical and computational complexity that arises when one moves from classical to adversarial learning. Unfortunately, for some spaces, it has been shown that efficient algorithms suffer an exponentially worse regret than that which is minimax optimal, even when the learner has access to an optimization oracle over the space. To mitigate that exponential dependence, this work introduces a new notion of complexity, the generalized bracketing numbers, which marries constraints on the adversary to the size of the space, and shows that an instantiation of Follow-the-Perturbed-Leader can attain low regret with the number of calls to the optimization oracle scaling optimally with respect to average regret. We then instantiate our bounds in several problems of interest, including online prediction and planning of piecewise continuous functions, which has many applications in fields as diverse as econometrics and robotics.  ( 2 min )
    MCRAGE: Synthetic Healthcare Data for Fairness
    arXiv:2310.18430v2 Announce Type: replace Abstract: In the field of healthcare, electronic health records (EHR) serve as crucial training data for developing machine learning models for diagnosis, treatment, and the management of healthcare resources. However, medical datasets are often imbalanced in terms of sensitive attributes such as race/ethnicity, gender, and age. Machine learning models trained on class-imbalanced EHR datasets perform significantly worse in deployment for individuals of the minority classes compared to samples from majority classes, which may lead to inequitable healthcare outcomes for minority groups. To address this challenge, we propose Minority Class Rebalancing through Augmentation by Generative modeling (MCRAGE), a novel approach to augment imbalanced datasets using samples generated by a deep generative model. The MCRAGE process involves training a Conditional Denoising Diffusion Probabilistic Model (CDDPM) capable of generating high-quality synthetic EHR samples from underrepresented classes. We use this synthetic data to augment the existing imbalanced dataset, thereby achieving a more balanced distribution across all classes, which can be used to train an unbiased machine learning model. We measure the performance of MCRAGE versus alternative approaches using Accuracy, F1 score and AUROC. We provide theoretical justification for our method in terms of recent convergence results for DDPMs with minimal assumptions.  ( 3 min )
    Warped geometric information on the optimisation of Euclidean functions
    arXiv:2308.08305v2 Announce Type: replace Abstract: We consider the fundamental task of optimising a real-valued function defined in a potentially high-dimensional Euclidean space, such as the loss function in many machine-learning tasks or the logarithm of the probability distribution in statistical inference. We use Riemannian geometry notions to redefine the optimisation problem of a function on the Euclidean space to a Riemannian manifold with a warped metric, and then find the function's optimum along this manifold. The warped metric chosen for the search domain induces a computational friendly metric-tensor for which optimal search directions associated with geodesic curves on the manifold becomes easier to compute. Performing optimization along geodesics is known to be generally infeasible, yet we show that in this specific manifold we can analytically derive Taylor approximations up to third-order. In general these approximations to the geodesic curve will not lie on the manifold, however we construct suitable retraction maps to pull them back onto the manifold. Therefore, we can efficiently optimize along the approximate geodesic curves. We cover the related theory, describe a practical optimization algorithm and empirically evaluate it on a collection of challenging optimisation benchmarks. Our proposed algorithm, using 3rd-order approximation of geodesics, tends to outperform standard Euclidean gradient-based counterparts in term of number of iterations until convergence.  ( 2 min )
    Optimal and Adaptive Non-Stationary Dueling Bandits Under a Generalized Borda Criterion
    arXiv:2403.12950v1 Announce Type: cross Abstract: In dueling bandits, the learner receives preference feedback between arms, and the regret of an arm is defined in terms of its suboptimality to a winner arm. The more challenging and practically motivated non-stationary variant of dueling bandits, where preferences change over time, has been the focus of several recent works (Saha and Gupta, 2022; Buening and Saha, 2023; Suk and Agarwal, 2023). The goal is to design algorithms without foreknowledge of the amount of change. The bulk of known results here studies the Condorcet winner setting, where an arm preferred over any other exists at all times. Yet, such a winner may not exist and, to contrast, the Borda version of this problem (which is always well-defined) has received little attention. In this work, we establish the first optimal and adaptive Borda dynamic regret upper bound, which highlights fundamental differences in the learnability of severe non-stationarity between Condorcet vs. Borda regret objectives in dueling bandits. Surprisingly, our techniques for non-stationary Borda dueling bandits also yield improved rates within the Condorcet winner setting, and reveal new preference models where tighter notions of non-stationarity are adaptively learnable. This is accomplished through a novel generalized Borda score framework which unites the Borda and Condorcet problems, thus allowing reduction of Condorcet regret to a Borda-like task. Such a generalization was not previously known and is likely to be of independent interest.  ( 3 min )
    Smoothed Online Learning for Prediction in Piecewise Affine Systems
    arXiv:2301.11187v2 Announce Type: replace Abstract: The problem of piecewise affine (PWA) regression and planning is of foundational importance to the study of online learning, control, and robotics, where it provides a theoretically and empirically tractable setting to study systems undergoing sharp changes in the dynamics. Unfortunately, due to the discontinuities that arise when crossing into different ``pieces,'' learning in general sequential settings is impossible and practical algorithms are forced to resort to heuristic approaches. This paper builds on the recently developed smoothed online learning framework and provides the first algorithms for prediction and simulation in PWA systems whose regret is polynomial in all relevant problem parameters under a weak smoothness assumption; moreover, our algorithms are efficient in the number of calls to an optimization oracle. We further apply our results to the problems of one-step prediction and multi-step simulation regret in piecewise affine dynamical systems, where the learner is tasked with simulating trajectories and regret is measured in terms of the Wasserstein distance between simulated and true data. Along the way, we develop several technical tools of more general interest.  ( 2 min )
    Tighter Confidence Bounds for Sequential Kernel Regression
    arXiv:2403.12732v1 Announce Type: new Abstract: Confidence bounds are an essential tool for rigorously quantifying the uncertainty of predictions. In this capacity, they can inform the exploration-exploitation trade-off and form a core component in many sequential learning and decision-making algorithms. Tighter confidence bounds give rise to algorithms with better empirical performance and better performance guarantees. In this work, we use martingale tail bounds and finite-dimensional reformulations of infinite-dimensional convex programs to establish new confidence bounds for sequential kernel regression. We prove that our new confidence bounds are always tighter than existing ones in this setting. We apply our confidence bounds to the kernel bandit problem, where future actions depend on the previous history. When our confidence bounds replace existing ones, the KernelUCB (GP-UCB) algorithm has better empirical performance, a matching worst-case performance guarantee and comparable computational cost. Our new confidence bounds can be used as a generic tool to design improved algorithms for other kernelised learning and decision-making problems.  ( 2 min )
    Do Generated Data Always Help Contrastive Learning?
    arXiv:2403.12448v1 Announce Type: cross Abstract: Contrastive Learning (CL) has emerged as one of the most successful paradigms for unsupervised visual representation learning, yet it often depends on intensive manual data augmentations. With the rise of generative models, especially diffusion models, the ability to generate realistic images close to the real data distribution has been well recognized. These generated high-equality images have been successfully applied to enhance contrastive representation learning, a technique termed ``data inflation''. However, we find that the generated data (even from a good diffusion model like DDPM) may sometimes even harm contrastive learning. We investigate the causes behind this failure from the perspective of both data inflation and data augmentation. For the first time, we reveal the complementary roles that stronger data inflation should be accompanied by weaker augmentations, and vice versa. We also provide rigorous theoretical explanations for these phenomena via deriving its generalization bounds under data inflation. Drawing from these insights, we propose Adaptive Inflation (AdaInf), a purely data-centric strategy without introducing any extra computation cost. On benchmark datasets, AdaInf can bring significant improvements for various contrastive learning methods. Notably, without using external data, AdaInf obtains 94.70% linear accuracy on CIFAR-10 with SimCLR, setting a new record that surpasses many sophisticated methods. Code is available at https://github.com/PKU-ML/adainf.  ( 2 min )
    A Practical Guide to Statistical Distances for Evaluating Generative Models in Science
    arXiv:2403.12636v1 Announce Type: cross Abstract: Generative models are invaluable in many fields of science because of their ability to capture high-dimensional and complicated distributions, such as photo-realistic images, protein structures, and connectomes. How do we evaluate the samples these models generate? This work aims to provide an accessible entry point to understanding popular notions of statistical distances, requiring only foundational knowledge in mathematics and statistics. We focus on four commonly used notions of statistical distances representing different methodologies: Using low-dimensional projections (Sliced-Wasserstein; SW), obtaining a distance using classifiers (Classifier Two-Sample Tests; C2ST), using embeddings through kernels (Maximum Mean Discrepancy; MMD), or neural networks (Fr\'echet Inception Distance; FID). We highlight the intuition behind each distance and explain their merits, scalability, complexity, and pitfalls. To demonstrate how these distances are used in practice, we evaluate generative models from different scientific domains, namely a model of decision making and a model generating medical images. We showcase that distinct distances can give different results on similar data. Through this guide, we aim to help researchers to use, interpret, and evaluate statistical distances for generative models in science.  ( 3 min )
    Transfer in Sequential Multi-armed Bandits via Reward Samples
    arXiv:2403.12428v1 Announce Type: cross Abstract: We consider a sequential stochastic multi-armed bandit problem where the agent interacts with bandit over multiple episodes. The reward distribution of the arms remain constant throughout an episode but can change over different episodes. We propose an algorithm based on UCB to transfer the reward samples from the previous episodes and improve the cumulative regret performance over all the episodes. We provide regret analysis and empirical results for our algorithm, which show significant improvement over the standard UCB algorithm without transfer.  ( 2 min )
    Posterior Uncertainty Quantification in Neural Networks using Data Augmentation
    arXiv:2403.12729v1 Announce Type: new Abstract: In this paper, we approach the problem of uncertainty quantification in deep learning through a predictive framework, which captures uncertainty in model parameters by specifying our assumptions about the predictive distribution of unseen future data. Under this view, we show that deep ensembling (Lakshminarayanan et al., 2017) is a fundamentally mis-specified model class, since it assumes that future data are supported on existing observations only -- a situation rarely encountered in practice. To address this limitation, we propose MixupMP, a method that constructs a more realistic predictive distribution using popular data augmentation techniques. MixupMP operates as a drop-in replacement for deep ensembles, where each ensemble member is trained on a random simulation from this predictive distribution. Grounded in the recently-proposed framework of Martingale posteriors (Fong et al., 2023), MixupMP returns samples from an implicitly defined Bayesian posterior. Our empirical analysis showcases that MixupMP achieves superior predictive performance and uncertainty quantification on various image classification datasets, when compared with existing Bayesian and non-Bayesian approaches.  ( 2 min )
    Graph Neural Networks for Learning Equivariant Representations of Neural Networks
    arXiv:2403.12143v1 Announce Type: cross Abstract: Neural networks that process the parameters of other neural networks find applications in domains as diverse as classifying implicit neural representations, generating neural network weights, and predicting generalization errors. However, existing approaches either overlook the inherent permutation symmetry in the neural network or rely on intricate weight-sharing patterns to achieve equivariance, while ignoring the impact of the network architecture itself. In this work, we propose to represent neural networks as computational graphs of parameters, which allows us to harness powerful graph neural networks and transformers that preserve permutation symmetry. Consequently, our approach enables a single model to encode neural computational graphs with diverse architectures. We showcase the effectiveness of our method on a wide range of tasks, including classification and editing of implicit neural representations, predicting generalization performance, and learning to optimize, while consistently outperforming state-of-the-art methods. The source code is open-sourced at https://github.com/mkofinas/neural-graphs.  ( 2 min )
    Variational Approach for Efficient KL Divergence Estimation in Dirichlet Mixture Models
    arXiv:2403.12158v1 Announce Type: new Abstract: This study tackles the efficient estimation of Kullback-Leibler (KL) Divergence in Dirichlet Mixture Models (DMM), crucial for clustering compositional data. Despite the significance of DMMs, obtaining an analytically tractable solution for KL Divergence has proven elusive. Past approaches relied on computationally demanding Monte Carlo methods, motivating our introduction of a novel variational approach. Our method offers a closed-form solution, significantly enhancing computational efficiency for swift model comparisons and robust estimation evaluations. Validation using real and simulated data showcases its superior efficiency and accuracy over traditional Monte Carlo-based methods, opening new avenues for rapid exploration of diverse DMM models and advancing statistical analyses of compositional data.  ( 2 min )
    Semisupervised score based matching algorithm to evaluate the effect of public health interventions
    arXiv:2403.12367v1 Announce Type: new Abstract: Multivariate matching algorithms "pair" similar study units in an observational study to remove potential bias and confounding effects caused by the absence of randomizations. In one-to-one multivariate matching algorithms, a large number of "pairs" to be matched could mean both the information from a large sample and a large number of tasks, and therefore, to best match the pairs, such a matching algorithm with efficiency and comparatively limited auxiliary matching knowledge provided through a "training" set of paired units by domain experts, is practically intriguing. We proposed a novel one-to-one matching algorithm based on a quadratic score function $S_{\beta}(x_i,x_j)= \beta^T (x_i-x_j)(x_i-x_j)^T \beta$. The weights $\beta$, which can be interpreted as a variable importance measure, are designed to minimize the score difference between paired training units while maximizing the score difference between unpaired training units. Further, in the typical but intricate case where the training set is much smaller than the unpaired set, we propose a \underline{s}emisupervised \underline{c}ompanion \underline{o}ne-\underline{t}o-\underline{o}ne \underline{m}atching \underline{a}lgorithm (SCOTOMA) that makes the best use of the unpaired units. The proposed weight estimator is proved to be consistent when the truth matching criterion is indeed the quadratic score function. When the model assumptions are violated, we demonstrate that the proposed algorithm still outperforms some popular competing matching algorithms through a series of simulations. We applied the proposed algorithm to a real-world study to investigate the effect of in-person schooling on community Covid-19 transmission rate for policy making purpose.  ( 3 min )
    Approximation of RKHS Functionals by Neural Networks
    arXiv:2403.12187v1 Announce Type: new Abstract: Motivated by the abundance of functional data such as time series and images, there has been a growing interest in integrating such data into neural networks and learning maps from function spaces to R (i.e., functionals). In this paper, we study the approximation of functionals on reproducing kernel Hilbert spaces (RKHS's) using neural networks. We establish the universality of the approximation of functionals on the RKHS's. Specifically, we derive explicit error bounds for those induced by inverse multiquadric, Gaussian, and Sobolev kernels. Moreover, we apply our findings to functional regression, proving that neural networks can accurately approximate the regression maps in generalized functional linear models. Existing works on functional learning require integration-type basis function expansions with a set of pre-specified basis functions. By leveraging the interpolating orthogonal projections in RKHS's, our proposed network is much simpler in that we use point evaluations to replace basis function expansions.  ( 2 min )
  • Open

    Why does my basic neural network occasionally fail to "learn" but usually produces an accuracy of ~95% ?
    I am following this simple tutorial, but subbing in this dataset and adapting accordingly. I've got it working (most of the time), and the network can predict whether the URLs are phishing or legitimate with ~95% accuracy (when it works). Occasionally, maybe 1 in 5 times, the weights produced end up being totally useless for prediction. Is this a common issue with neural network training? I can provide code if desired, but the code is nearly identical to what's in the tutorial, except for some changes to the number of input and output neurons and obviously the different treatment of different data. It's like it hickups somewhere in the beginning of the training process, and can't escape the calamity it's found itself in. Has anyone encountered something similar in their work? I'm super new to all this, so apologies if this question is naive. submitted by /u/allthenine [link] [comments]

  • Open

    [Discussion]: Seeking Advice on Automated Tagging for Items in an Ecommerce Marketplace
    Hi everyone, I'm currently working on an ecommerce marketplace app where users can sell items. We want to improve the searchability of items by associating them with relevant tags or keywords. While users can currently provide tags when uploading items, we're exploring ways to automate this process to reduce human error and improve efficiency. Specifically, we're looking for a service or solution that can automatically generate tags or associations for uploaded items. Our goal is to enhance the search experience for users by ensuring that items are tagged accurately and comprehensively. One approach we're considering is categorizing items into predefined categories (e.g., Sports & Outdoors, Electronics, Clothing) during the upload process. However, we recognize that this may not fully address the need for detailed tags and associations. As someone with limited experience in machine learning (ML) and artificial intelligence (AI), I'm seeking advice, recommendations, and insights from the community. Are there existing services or tools that can help automate the tagging process for items in our ecommerce platform? How can we effectively implement this solution while leveraging our existing database infrastructure (Postgres)? Additionally, I'm open to discussing alternative approaches, best practices, and real-world experiences related to automated tagging in ecommerce marketplaces. Any help or guidance would be greatly appreciated. Thank you! submitted by /u/tha_ghost_007 [link] [comments]
    [D] What are you go to resources for learning more about score matching?
    Hello guys! I’m really interested in the topic of score matching, but the papers around it seem so hard to understand. I was just wondering if anyone has some cool resources on this matter. submitted by /u/CapableCheesecake219 [link] [comments]
    [D] In the AlphaFold 2 paper, can someone explain figure 2 for me?
    Paper here: https://www.nature.com/articles/s41586-021-03819-2 Figure 2 here: https://www.nature.com/articles/s41586-021-03819-2/figures/2 In Figure 2, they show a series of charts but I don't understand what is the main point the authors want to make. There's a brief paragraph that mentions its: we observe high side-chain accuracy when the backbone prediction is accurate (Fig. 2b) and we show that our confidence measure, the predicted local-distance difference test (pLDDT), reliably predicts the Cα local-distance difference test (lDDT-Cα) accuracy of the corresponding prediction (Fig. 2c). We also find that the global superposition metric template modelling score (TM-score)27 can be accurately estimated (Fig. 2d). Overall, these analyses validate that the high accuracy and reliabilit…
    [D] Comparing LLM Tokens/Second: Gemma 7 Bn vs. Llama-2 7Bn vs. Mistral 7 Bn ( Independent Analysis)
    Hey everyone! I've been working on a detailed benchmark analysis that explores the performance of three leading Language Learning Models (LLMs): Gemma 7B, Llama-2 7B, and Mistral 7B, across a variety of libraries including Text Generation Inference, vLLM, DeepSpeed Mii, CTranslate2, Triton with vLLM Backend, and TensorRT-LLM. We put these models through their paces, testing: Input tokens ranging from 20 to 5,000. Output tokens set at 100, 200, and 500. Used A100 80GB Machine on Azure Key Findings: Mistral 7B paired with TensorRT-LLM reached the pinnacle of efficiency at 93.63 tokens/sec for configurations of 20 input/200 output tokens, narrowly surpassing vLLM by 5.10%. https://preview.redd.it/ih4di4vysjpc1.png?width=1600&format=png&auto=webp&s=52b215ea899b7c043991087281d598…
    [D] PullRequestBenchmark - Gauging Progress Towards Programming Automation
    Greetings, /r/machinelearning! We're excited to share PullRequestBenchmark, a project aimed at advancing the automation of programming through Large Language Models (LLMs). While it might remind you of SWE-bench at first glance, PullRequestBenchmark uniquely focuses on evaluating LLMs' abilities to review PRs, a critical aspect of software development. This benchmark not only tests decision-making in PR reviews but also hints at the potential for LLMs to autonomously generate complex PRs, possibly redefining traditional programming roles. Our approach includes assessing LLMs against a wide range of real-world PR scenarios, from minor adjustments to major architectural changes, using comprehensive inputs such as the entire Git history, PR titles, descriptions, and changesets. We believe that PullRequestBenchmark marks a significant step towards fully automating programming. Your contributions to expanding this benchmark are vital and warmly welcomed. For more details on how to contribute and what distinguishes PullRequestBenchmark from SWE-bench, visit our GitHub repository: PullRequestBenchmark We're eager to see how this community can help drive the project forward. For inquiries or suggestions, don't hesitate to reach out! submitted by /u/gvatte [link] [comments]
    [D] AWS SageMaker warm start latencies: serverless vs real-time
    Hello! I'm looking into deploying a ResNet-like computer vision model on AWS SageMaker and deciding between the serverless and real-time options. The inference workload will be bursty, and I can predict the burst times in advance. I do need low latency, however. Reading (some of!) the docs for Serverless vs Real-Time makes it clear that cold starts are an issue for serverless (though this isn't an issue as I can schedule capacity in advance, since I can predict the workload volume) and makes vague references to real-time being better for low-latency applications, but doesn't exactly specify or quantify the difference. Has anyone had experience with serverless vs real-time, and has an opinion about how much slower serverless inference is *from a warm start* compared to real-time? Thank you! submitted by /u/kilgore_teh_trout [link] [comments]
    [D] NLP - Sequence Classification/labelling model
    Hi peeps! I have an NLP project where I have to create a sequence classification model from a given dataset. I have to test it under different environments like by comparing different algorithms for one experiment and comparing vectorization methods for another etc. Do you guys have any good leads where I can start from? I am still trying to understand the whole cycle of nlp so i'd really appreciate it. P.S: I have 4 other modules and dissertation so I cannot give all my time to this project. I'd appreciate to the point and quick sources. submitted by /u/friendsbase [link] [comments]
    [D] State of the art of Document Parsing OCR
    Hello, I want to build a model to parse some documents. However, I do not find what is the current SOTA for OCR + Parsing. I currently found Donut - LayoutLM - PaddleOCR... But I didn't understand which is the best currently, or if there are better/newer models. I'm working with french documents. Thank you for advance if you can help me ! submitted by /u/ez613 [link] [comments]
    [D] SMV for identifying 'quality content'?
    Could a Support Vector Machine be used to identify 'quality content' for SEO purposes. Asking for a friend... submitted by /u/Downtown_Quality_322 [link] [comments]
    [D] Peaks on accuracy and loss graph
    Hello, im training a model with a dataset of around 1000 images and 4 classes, and as you can see, there are a lot of peaks, what does this mean? The model has transfer learning of a VGG16 model and achieved a good accuracy with other datasets, is it the size of the data-set the only reason for the peaks or there could be another factor for this to happen? https://preview.redd.it/yn3axu34djpc1.png?width=1189&format=png&auto=webp&s=8420d6c4ef00e0f39caf2a813d3f8fe974ab071c submitted by /u/Icy_Dependent9199 [link] [comments]
    [D] What methods/tools do you use in your team to fine-tune LLMs for your domain?
    How do you evaluate your data quality, and domain-specific performance for your custom LLM, and fine-tune for best outputs? submitted by /u/metalvendetta [link] [comments]
    [D] Deanonymized paper accepted at ICLR 2024
    I want to start a discussion about a particular case at this year's ICLR that I have observed. - A paper is submitted to ICLR double-blind review with the full name of the first author and full acknowledgements (with several high-profile references) in the main paper body, right above the bibliography. The authors explicitly confirm the false statement "No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review." at submission time. - The four reviewers and the AC avoid mentioning this and review the paper with the bias of knowing its authors, as if it does not violate the basic submission rules listed in the call for papers. - The decision released for the paper mid-January is "Desk Reject" (presumably due to the above violation, confirming that the PCs are aware of it). This is not publicly archived. - This is changed early February to "Oral" with no further justification, for unclear reasons. An anonymity rule violation is first ignored by the reviewers and then silently allowed via an exception by the program committee, presumably after a direct complaint. In my opinion, this should be unacceptable for any conference, but especially a top conference in the field like ICLR. Why do we have the double-blind process if we do not enforce it and allow some authors to opt for revealing their names and bias the reviewers? Would the same exception be made if the authors were not prominent, or if they do research in a less privileged place, or come from a marginalized community? What about other desk rejected papers at this year's ICLR that did not get the opportunity for an exception? The issues around bias and integrity in academia are in my experience often raised in private conversations but almost never openly discussed, so I am interested in hearing what the community thinks of this case. The program chairs have declined to comment on the final decision. submitted by /u/Melodic-Foundation47 [link] [comments]
    [D] Reuse documents vectors for a standard search bar, good idea or not?
    Hi, I embed JSON objects, then I do RAG for a virtual assistant and getting N top relevant objects to put into the context so the LLM can determine if it's relevant or not and use the information. So even if the N top are not a perfect match, the LLM is smart enough to not mention them. The point is, I thought about reusing the vectors for a directory search since almost everything is already setup (experimenting, so maybe that's a bad idea), so it could return XX items. I put a fixed threshold on the matching score (distance) to be sure not having weird things listed (I adjusted the value after multiple attempts), but depending on the user input (how many words...), the threshold sometimes cuts items that should be part of the list, and sometimes it lets all bad items pass. Is this use case viable? I'm wondering if some of you have experimented a flex threshold? Let's say as distance I have 2 cases: For a Query "A", [0.1, 0.2, 0.9] as distances with a threshold of 0.3 : I get the 2 things expected For a Query "B", [0.1, 0.28, 0.29, 0.3] with still a threshold of 0.3 : unfortunately due to variation in the query, this time the threshold is not good and all the `0.28, 0.29, 0.3` are just garbage. Is there any known technique to say "if there is a significant gap" between distances, probably the threshold should be fixed there? Sorry if it's unclear. In any case I can still use a basic DB search on the original document content (but thought it would be more powerful than the DB vectors...). Thank you, submitted by /u/sneko7 [link] [comments]
    [D] How much will Nvidia's newest Blackwell GPU's cut down training and inference time/price?
    Apparently it cost ~$5M to train Llama-2. How much do you think these new GPUs will reduce training/inference cost for similar model? I'm more curious than anything, since I will most likely grab an Apple product soon. submitted by /u/dittospin [link] [comments]
    [Project] Sparse Mixture of Experts Language Model from Scratch in less than 300 lines of Python + pytorch
    Hi all,I implemented a sparse mixture of experts language model (basically a tiny version of what's used in Mixtral, Grok-1 and supposedly GPT-4) from scratch in pure pytorch and trained it on tiny Shakespeare. This is based largely on makemore from Andrej Karpathy (an autoregressive character-level decoder only transformer model). My goal is for this to be a hackable implementation that people use to understand how this really works and improve upon. I foresee more and more of these models coming out throughout the year. The repo is here: https://github.com/AviSoori1x/makeMoE I created this a couple of months ago and shared on Localllama but figured I'd share here as well, since I made some updates like adding expert capacity and consolidating the entire implementation to less than 300 lines of readable python + pytorch. The blog that steps through this is here: https://huggingface.co/blog/AviSoori1x/makemoe-from-scratch. Hope this is helpful! submitted by /u/avi1x [link] [comments]
    [D]Searching for a GUI for LLMs APIs (openrouter, openai et simila), with plug-ins and RAG support.
    hey everyone, I'm searching for a user-friendly GUI that supports OpenAI's ChatGPT API (or compatible like OpenRouter) and allows for plugins and RAG. I use both opens source models (mixtral, Hermes 34b) and private models, like claude-opus and perplexity-sonnet someone here have some tips? thank you in advance for your help! edit: Im wondering if exist any platforms that offer the ability to modify chat history or run code based on chat events. submitted by /u/Distinct-Target7503 [link] [comments]
    [D] 0 Reviews submitted for ICML 2024 - someone else facing this?
    Hi all, I submitted a paper to ICML 2024. The author rebuttal phase should start tomorrow, but till today it seems, no reviewer has submitted a review yet. Anyone facing a similat issue? How will it proceed if no one reviews? submitted by /u/Tigmib [link] [comments]
    Fine tune multiple LLM [D]
    Is there any framework which can fine tune different llm like llama2, mxtral, titan and getting the output for test data. Provided I have training and test data. Basically some framework that can help me fine tune multiple model quickly submitted by /u/IntentionNo5258 [link] [comments]
    [D] Which Vectorizer Fits the best?
    I have a dataset of Dutch texts. Am trying to align the texts to elements of a taxonomy using descriptions of such elements. I want to exploit cosine similarity. Which is the best model to embed texts in this case? I'm thinking of paraphrase-multilingual-mpnet-base-v2 from SentenceTransformers given that it's multingual and its impressive performances. Also I'm dealing with texts that will exceed the maximum number of tokens in input, what are the best practices in this case to avoid truncation? submitted by /u/purpleoptimum [link] [comments]
    [D] Why the readability of academic papers are continuously bad?
    Some of these expectations are inevitable. Papers absolutely have to assume extensive background knowledge and vocabulary specific to the field—including a basic introduction to the field with each paper would be excessive. Sometimes papers assume too much background, but that's hard to judge as a writer or a reader, not least because the set of things people know changes over time and in different contexts. Reference During the past 3 years, to broaden my understanding of ML field, I have read a lot of papers without considering focusing on a specific problem in a specific field. This leads to allowing me to understand the structure of an academic writing, basic concepts of ML, and relatively precious insights compared to previous knowledge in the past, but when I encounter new papers, some papers are still verbose, annoying me due to the readability of nature of academic papers. This strongly requires me to skim through some contents, which is harmful for purely concentrating on reading documents compared to non-academic books (I don't assume that all of the books are great like personal best seller, but I want to emphasize the difference of the structure between academic and non-academic writing.) The question is, why this convention cannot be smoothly changed for the better direction, which is helpful for existed and newer researchers? submitted by /u/Mundane_Definition_8 [link] [comments]
    [Discussion] GNN applications
    I’ve heard about the fairly standard ways that graph neural networks are applied (chem, traffic, social networks), but they are such a cool class of models, what applications of GNNs are you familiar with that aren’t the standard ones mentioned above? submitted by /u/Radiant_Walrus3007 [link] [comments]
    [discussion] Why increase channel size in CNN layers?
    In most CNN architecture I see, they increase the channels from 1 or 3 to like 16, 32 or even 64. Isn’t the point to decrease the dimensionality? submitted by /u/Radiant_Walrus3007 [link] [comments]
    [D] Is it common for recent "LLM engineers" to not have a background in NLP?
    The past few weeks I've attended a few Meetups and networking events where I met a lot of people claiming they "work with LLMs." I personally don't have that much experience with them and have done research in more "classic" NLP (ELMo and BERT were big announcements when I was doing research) and have now been in industry working mostly as an engineer. I noticed very often that when I try to talk about connections between LLM research patterns or applications and those I dubbed classical approaches people often don't seem to know what I'm talking about. I'm not talking about researchers, obviously if you're doing actual research with LLMs I'm assuming that you've been in the field for a while. These days it just seems like LLM and NLP are being treated separately. Curious what others think. submitted by /u/Seankala [link] [comments]
    [D] Conv2D on 3D point clouds
    Hi! I'm working with point cloud regression for pose prediction. The original implementations of PointNet, PointNet++, DGCNN, etc, use Conv2D operations in every convolutional layer. My question is: why use 2D convolutions instead of 1D convolutions in this case? What are the reasons and advantages for doing that? Thanks in advance! submitted by /u/Professional-Act-163 [link] [comments]
  • Open

    Computer-aided diagnosis for lung cancer screening
    Posted by Atilla Kiraly, Software Engineer, and Rory Pilgrim, Product Manager, Google Research Lung cancer is the leading cause of cancer-related deaths globally with 1.8 million deaths reported in 2020. Late diagnosis dramatically reduces the chances of survival. Lung cancer screening via computed tomography (CT), which provides a detailed 3D image of the lungs, has been shown to reduce mortality in high-risk populations by at least 20% by detecting potential signs of cancers earlier. In the US, screening involves annual scans, with some countries or cases recommending more or less frequent scans. The United States Preventive Services Task Force recently expanded lung cancer screening recommendations by roughly 80%, which is expected to increase screening access for women and ra…  ( 93 min )
    Using AI to expand global access to reliable flood forecasts
    Posted by Yossi Matias, VP Engineering & Research, and Grey Nearing, Research Scientist, Google Research Floods are the most common natural disaster, and are responsible for roughly $50 billion in annual financial damages worldwide. The rate of flood-related disasters has more than doubled since the year 2000 partly due to climate change. Nearly 1.5 billion people, making up 19% of the world’s population, are exposed to substantial risks from severe flood events. Upgrading early warning systems to make accurate and timely information accessible to these populations can save thousands of lives per year. Driven by the potential impact of reliable flood forecasting on people’s lives globally, we started our flood forecasting effort in 2017. Through this multi-year journey, we advanced…  ( 94 min )
  • Open

    Reset Function in RL : Initial State or Current State?
    Hey everyone, I'm new to Reinforcement Learning. I'm currently working on implementing a RL environment to control the temperature of a room using a thermostat. I'm at a crossroads regarding the reset function and would love some insights. The environment represents a room with a thermostat, where the goal is to maintain the temperature at a desired setpoint while minimizing energy consumption. The state of the environment is represented by the current temperature and the energy consumption, and actions include adjusting the thermostat settings to increase or decrease the temperature setpoint. I thought of 2 approaches : Reset to Initial State: This option involves resetting the environment to its initial configuration at the beginning of each episode, where the temperature is set to a predefined starting value. (Setting to the minimal state). Reset to Current State: Alternatively, the reset function could return the environment to the current state of the room.. I'm particularly drawn to the idea of resetting to the current state because it aligns with my approach in the agent's decision-making process. In each step, when the agent chooses an action, I'm planning to check if the agent can perform that action based on the current state represented in the state vector then give positive reward, else give negative reward. Question: Given this environment, which approach do you think would be more suitable for the reset function in my RL algorithm? Are there any additional considerations I should take into account? Thanks in advance for your help! submitted by /u/Few-Papaya101 [link] [comments]
    Struggling with duelling+c51, need a second pair of eyes.
    Dear RL Community, After trying for many many hours to implement c51+dueling in my framework I am finally ready to give up. As my last ditch effort I am posting this in hopes some genius can help me find a mistake in my algorithm. I'm trying to make a DRL library in C# with TorchSharp and I've been trying to also offer Rainbow. It turns out its really difficult (or I am not ready yet) Long story short my loss doesn't decrease and I am not sure where am I making mistake with implementing the algo, I tried many things and can't figure it out. Could someone kindly have a look and try to spot my blunder? My code here: Network: https://github.com/asieradzk/RL_Matrix/blob/master/src/RLMatrix/Agents/DQN/NN/Variants/DuelingDNQ_C51_Noisy.cs Agent: https://github.com/asieradzk/RL_Matrix/blob/master/src/RLMatrix/Agents/DQN/Variants/DQNRainbow.cs Python translation for those who hate C#: https://github.com/asieradzk/RL_Matrix/blob/master/src/RLMatrix/Agents/DQN/Variants/PythonRainbowTranslation.txt Moral support also welcome. submitted by /u/DotNetEvangeliser [link] [comments]
    DQN from SB3 doesn't work
    I have a custom boid flocking environment and have implemented DQN using SB3. However, it keeps running and outputs nothing. Can't seem to figure out the bug. Code: https://drive.google.com/drive/folders/1zoQSrLOVO13TBGtoJhkg5LwVoKEmM2gT?usp=sharing submitted by /u/Sadboi1010 [link] [comments]
    Trying to implement crossQ in PyTorch does not work
    Hello, I am trying to implement: https://openreview.net/pdf?id=PczQtTsTIX And can't seem to get good results. I have tried it in Ant-v4 With the following change in my SAC code in the update bellman equation: def update_critic(self, state_batch: torch.tensor, action_batch: torch.tensor, reward_batch: torch.tensor, next_state_batch: torch.tensor, mask_batch: torch.tensor ) -> torch.tensor: """ Updates critic with Soft Bellman equation :param state_batch: the state batch extracted from memory :param action_batch: the action batch extracted from memory :param reward_batch: the reward batch extracted from memory :param next_state_batch: the next state batch batch extracted from memory :param mask_batch: mask of dones :return: float loss for Q1, float loss for Q2 """ next_state_action_batch…
  • Open

    Semi-Supervised and Self-Supervised Learning
    submitted by /u/Neurosymbolic [link] [comments]
  • Open

    Google DeepMind Reveals TacticAI - an AI tool for analyzing soccer tactics
    Google recently launched TacticAI, an AI system aimed at providing professional tips, especially for corner kicks, showcasing how technology can transform sports. Is Google adopting an aggressive strategy with new model releases? Just a few days ago, they introduced Vlogger, and now TacticAI. What do you think? submitted by /u/ImpressiveContest283 [link] [comments]
    What would be the best platform/llm to output an entire book?
    Let's say I wanted to write a book - my memoirs for example. My idea is to record my stories and insights by voice and have an ai transcribe the text. Probably this would be done in small pieces day by day, so I want the AI to collate the trasnscibed data and when finished rewrite everything into a finished product in book format. What I've seen is ChatGPT is highly restricted in its output terms, getting any more than a few hundred words is difficult. Would Gemini or Claude be any better at inputting large amounts of banks of data and collating all together into one single large output? Any insights on how this can be done? submitted by /u/zascar [link] [comments]
    Introducing Stable Video 3D: Quality Novel View Synthesis and 3D Generation from Single Images — Stability AI
    submitted by /u/tedbarney12 [link] [comments]
    Perplexity AI, a hyped Silicon Valley AI startup that claimed to take on Google, was found out copying Google results directly
    submitted by /u/Healthy_Moment_1804 [link] [comments]
    Looking for a Live Translation app that runs constantly across my screen?
    Lately, i've been harrowed by many Japanese games that look very fun, but don't have english translations, nor patches. Google Lens and other apps let you translate things by pointing a camera at the screen or taking screenshots, but you can imagine this is extremely inconvenient for video games. i'm thinking a live translation app that runs in the background, scans the screen for text, then translates said text and formats new translated text over it. i'm sure AI technology will be able to do this eventually, but is there anything along the lines of this now? i'm still new to AI so forgive me if i sound like a newb submitted by /u/Caesar_The_Mighty [link] [comments]
    One-Minute Daily AI News 3/19/2024
    DeepMind co-founder Mustafa Suleyman will become the head of a newly created consumer AI unit at Microsoft, the scientist and entrepreneur said on Tuesday.[1] Nvidia’s Jensen Huang says AI hallucinations are solvable, artificial general intelligence is 5 years away.[2] Saudi Arabia Plans $40 Billion Push Into Artificial Intelligence.[3] Today at GDC, Roblox unveiled a pair of AI technologies aiming to streamline 3D modeling for the platform’s boundless legion of content creators.[4] Sources: [1] https://www.nbc39.com/news/national/microsoft-hires-deepmind-co-founder-to-lead-ai-unit/article_7a920643-ad5e-52e8-8f56-fd81ae785d2b.html [2] https://techcrunch.com/2024/03/19/agi-and-hallucinations/ [3] https://www.nytimes.com/2024/03/19/business/saudi-arabia-investment-artificial-intelligence.html [4] https://www.pcgamer.com/software/platforms/roblox-reveals-new-ai-powered-3d-modeling-tools-to-empower-the-creation-of-anything-anywhere-by-anyone-which-were-probably-not-meant-to-read-too-closely/ submitted by /u/Excellent-Target-847 [link] [comments]
    How AI can save one industry in the USA about 135 billion+ dollars per year.
    ​ One industry that has slowly being replaced with AI is customer service. Most current chatbots don't use a llm or at least a very primitive one, but a locally integrated GPT or llama engine that gets regular updates with a local business knowledge base would save a ton of money. There are about 3 million customer service jobs in the united states alone, which presents an opportunity of currently about 135 billion dollars a year in potential savings. This is just employee salary alone, but there are a whole breadth of expenses estimated around 200 billion total if you include taxes, health premiums etc that will be saved. One company alone, Charter communications, has about 100000 employees, and at about 45k a year salary on average, we're looking at about a savings of 1 billion alone …
  • Open

    Research Focus: Week of March 18, 2024
    Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft. Large language models (LLMs) have shown impressive capabilities, yet they still struggle with math reasoning. In a recent paper: Fewer is More: Boosting LLM Reasoning with Reinforced Context […] The post Research Focus: Week of March 18, 2024 appeared first on Microsoft Research.  ( 11 min )
  • Open

    AI Decoded From GTC: The Latest Developer Tools and Apps Accelerating AI on PC and Workstation
    NVIDIA’s RTX AI platform includes tools and software development kits that help Windows developers create cutting-edge generative AI features to deliver the best performance on AI PCs and workstations.  ( 9 min )
  • Open

    MOFI: Learning Image Representations from Noisy Entity Annotated Images
    arXiv:2306.07952v3 Announce Type: replace-cross Abstract: We present MOFI, Manifold OF Images, a new vision foundation model designed to learn image representations from noisy entity annotated images. MOFI differs from previous work in two key aspects: (i) pre-training data, and (ii) training recipe. Regarding data, we introduce a new approach to automatically assign entity labels to images from noisy image-text pairs. Our approach involves employing a named entity recognition model to extract entities from the alt-text, and then using a CLIP model to select the correct entities as labels of the paired image. It's a simple, cost-effective method that can scale to handle billions of web-mined image-text pairs. Through this method, we have created Image-to-Entities (I2E), a new dataset with 1 billion images and 2 million distinct entities, covering rich visual concepts in the wild. Building upon the I2E dataset, we study different training recipes like supervised pre-training, contrastive pre-training, and multi-task learning. For contrastive pre-training, we treat entity names as free-form text, and further enrich them with entity descriptions. Experiments show that supervised pre-training with large-scale fine-grained entity labels is highly effective for image retrieval tasks, and multi-task training further improves the performance. The final MOFI model achieves 86.66% mAP on the challenging GPR1200 dataset, surpassing the previous state-of-the-art performance of 72.19% from OpenAI's CLIP model. Further experiments on zero-shot and linear probe image classification also show that MOFI outperforms a CLIP model trained on the original image-text data, demonstrating the effectiveness of the I2E dataset in learning strong image representations. We release our code and model weights at https://github.com/apple/ml-mofi.  ( 3 min )
    Symphony: Symmetry-Equivariant Point-Centered Spherical Harmonics for Molecule Generation
    arXiv:2311.16199v2 Announce Type: replace Abstract: We present Symphony, an $E(3)$-equivariant autoregressive generative model for 3D molecular geometries that iteratively builds a molecule from molecular fragments. Existing autoregressive models such as G-SchNet and G-SphereNet for molecules utilize rotationally invariant features to respect the 3D symmetries of molecules. In contrast, Symphony uses message-passing with higher-degree $E(3)$-equivariant features. This allows a novel representation of probability distributions via spherical harmonic signals to efficiently model the 3D geometry of molecules. We show that Symphony is able to accurately generate small molecules from the QM9 dataset, outperforming existing autoregressive models and approaching the performance of diffusion models.  ( 2 min )
    Skill Machines: Temporal Logic Skill Composition in Reinforcement Learning
    arXiv:2205.12532v2 Announce Type: replace Abstract: It is desirable for an agent to be able to solve a rich variety of problems that can be specified through language in the same environment. A popular approach towards obtaining such agents is to reuse skills learned in prior tasks to generalise compositionally to new ones. However, this is a challenging problem due to the curse of dimensionality induced by the combinatorially large number of ways high-level goals can be combined both logically and temporally in language. To address this problem, we propose a framework where an agent first learns a sufficient set of skill primitives to achieve all high-level goals in its environment. The agent can then flexibly compose them both logically and temporally to provably achieve temporal logic specifications in any regular language, such as regular fragments of linear temporal logic. This provides the agent with the ability to map from complex temporal logic task specifications to near-optimal behaviours zero-shot. We demonstrate this experimentally in a tabular setting, as well as in a high-dimensional video game and continuous control environment. Finally, we also demonstrate that the performance of skill machines can be improved with regular off-policy reinforcement learning algorithms when optimal behaviours are desired.  ( 3 min )
    Predicting the long-term collective behaviour of fish pairs with deep learning
    arXiv:2302.06839v2 Announce Type: replace Abstract: Modern computing has enhanced our understanding of how social interactions shape collective behaviour in animal societies. Although analytical models dominate in studying collective behaviour, this study introduces a deep learning model to assess social interactions in the fish species Hemigrammus rhodostomus. We compare the results of our deep learning approach to experiments and to the results of a state-of-the-art analytical model. To that end, we propose a systematic methodology to assess the faithfulness of a collective motion model, exploiting a set of stringent individual and collective spatio-temporal observables. We demonstrate that machine learning models of social interactions can directly compete with their analytical counterparts in reproducing subtle experimental observables. Moreover, this work emphasises the need for consistent validation across different timescales, and identifies key design aspects that enable our deep learning approach to capture both short- and long-term dynamics. We also show that our approach can be extended to larger groups without any retraining, and to other fish species, while retaining the same architecture of the deep learning network. Finally, we discuss the added value of machine learning in the context of the study of collective motion in animal groups and its potential as a complementary approach to analytical models.  ( 3 min )
    Compressing LLMs: The Truth is Rarely Pure and Never Simple
    arXiv:2310.01382v2 Announce Type: replace-cross Abstract: Despite their remarkable achievements, modern Large Language Models (LLMs) face exorbitant computational and memory footprints. Recently, several works have shown significant success in training-free and data-free compression (pruning and quantization) of LLMs that achieve 50 - 60% sparsity and reduce the bit width to 3 or 4 bits per weight, with negligible degradation of perplexity over the uncompressed baseline. As recent research efforts are focused on developing increasingly sophisticated compression methods, our work takes a step back and re-evaluates the effectiveness of existing SoTA compression methods, which rely on a fairly simple and widely questioned metric, perplexity (even for dense LLMs). We introduce Knowledge-Intensive Compressed LLM BenchmarK (LLM-KICK), a collection of carefully curated tasks to redefine the evaluation protocol for compressed LLMs, which have significant alignment with their dense counterparts and perplexity fail to capture subtle change in their true capabilities. LLM-KICK unveils many favorable merits and unfortunate plights of current SoTA compression methods: all pruning methods suffer significant performance degradation, sometimes at trivial sparsity ratios (e.g., 25-30%), and fail for N:M sparsity in knowledge-intensive tasks; current quantization methods are more successful than pruning; yet, pruned LLMs even at $\geq 50$% sparsity are robust in-context retrieval and summarization systems; among others. LLM-KICK is designed to holistically access compressed LLMs' ability for language understanding, reasoning, generation, in-context retrieval, in-context summarization, etc. We hope our study can foster the development of better LLM compression methods. The reproduced codes are available at https://github.com/VITA-Group/llm-kick.  ( 3 min )
    Towards Robust Fidelity for Evaluating Explainability of Graph Neural Networks
    arXiv:2310.01820v2 Announce Type: replace Abstract: Graph Neural Networks (GNNs) are neural models that leverage the dependency structure in graphical data via message passing among the graph nodes. GNNs have emerged as pivotal architectures in analyzing graph-structured data, and their expansive application in sensitive domains requires a comprehensive understanding of their decision-making processes -- necessitating a framework for GNN explainability. An explanation function for GNNs takes a pre-trained GNN along with a graph as input, to produce a `sufficient statistic' subgraph with respect to the graph label. A main challenge in studying GNN explainability is to provide fidelity measures that evaluate the performance of these explanation functions. This paper studies this foundational challenge, spotlighting the inherent limitations of prevailing fidelity metrics, including $Fid_+$, $Fid_-$, and $Fid_\Delta$. Specifically, a formal, information-theoretic definition of explainability is introduced and it is shown that existing metrics often fail to align with this definition across various statistical scenarios. The reason is due to potential distribution shifts when subgraphs are removed in computing these fidelity measures. Subsequently, a robust class of fidelity measures are introduced, and it is shown analytically that they are resilient to distribution shift issues and are applicable in a wide range of scenarios. Extensive empirical analysis on both synthetic and real datasets are provided to illustrate that the proposed metrics are more coherent with gold standard metrics. The source code is available at https://trustai4s-lab.github.io/fidelity.  ( 3 min )
    Hybrid-Task Meta-Learning: A Graph Neural Network Approach for Scalable and Transferable Bandwidth Allocation
    arXiv:2401.10253v2 Announce Type: replace-cross Abstract: In this paper, we develop a deep learning-based bandwidth allocation policy that is: 1) scalable with the number of users and 2) transferable to different communication scenarios, such as non-stationary wireless channels, different quality-of-service (QoS) requirements, and dynamically available resources. To support scalability, the bandwidth allocation policy is represented by a graph neural network (GNN), with which the number of training parameters does not change with the number of users. To enable the generalization of the GNN, we develop a hybrid-task meta-learning (HML) algorithm that trains the initial parameters of the GNN with different communication scenarios during meta-training. Next, during meta-testing, a few samples are used to fine-tune the GNN with unseen communication scenarios. Simulation results demonstrate that our HML approach can improve the initial performance by $8.79\%$, and sampling efficiency by $73\%$, compared with existing benchmarks. After fine-tuning, our near-optimal GNN-based policy can achieve close to the same reward with much lower inference complexity compared to the optimal policy obtained using iterative optimization.  ( 2 min )
    Enhancing Predictive Capabilities in Data-Driven Dynamical Modeling with Automatic Differentiation: Koopman and Neural ODE Approaches
    arXiv:2310.06790v2 Announce Type: replace Abstract: Data-driven approximations of the Koopman operator are promising for predicting the time evolution of systems characterized by complex dynamics. Among these methods, the approach known as extended dynamic mode decomposition with dictionary learning (EDMD-DL) has garnered significant attention. Here we present a modification of EDMD-DL that concurrently determines both the dictionary of observables and the corresponding approximation of the Koopman operator. This innovation leverages automatic differentiation to facilitate gradient descent computations through the pseudoinverse. We also address the performance of several alternative methodologies. We assess a 'pure' Koopman approach, which involves the direct time-integration of a linear, high-dimensional system governing the dynamics within the space of observables. Additionally, we explore a modified approach where the system alternates between spaces of states and observables at each time step -- this approach no longer satisfies the linearity of the true Koopman operator representation. For further comparisons, we also apply a state space approach (neural ODEs). We consider systems encompassing two and three-dimensional ordinary differential equation systems featuring steady, oscillatory, and chaotic attractors, as well as partial differential equations exhibiting increasingly complex and intricate behaviors. Our framework significantly outperforms EDMD-DL. Furthermore, the state space approach offers superior performance compared to the 'pure' Koopman approach where the entire time evolution occurs in the space of observables. When the temporal evolution of the Koopman approach alternates between states and observables at each time step, however, its predictions become comparable to those of the state space approach.  ( 3 min )
    DeforestVis: Behavior Analysis of Machine Learning Models with Surrogate Decision Stumps
    arXiv:2304.00133v4 Announce Type: replace Abstract: As the complexity of machine learning (ML) models increases and their application in different (and critical) domains grows, there is a strong demand for more interpretable and trustworthy ML. A direct, model-agnostic, way to interpret such models is to train surrogate models-such as rule sets and decision trees-that sufficiently approximate the original ones while being simpler and easier-to-explain. Yet, rule sets can become very lengthy, with many if-else statements, and decision tree depth grows rapidly when accurately emulating complex ML models. In such cases, both approaches can fail to meet their core goal-providing users with model interpretability. To tackle this, we propose DeforestVis, a visual analytics tool that offers summarization of the behaviour of complex ML models by providing surrogate decision stumps (one-level decision trees) generated with the Adaptive Boosting (AdaBoost) technique. DeforestVis helps users to explore the complexity versus fidelity trade-off by incrementally generating more stumps, creating attribute-based explanations with weighted stumps to justify decision making, and analysing the impact of rule overriding on training instance allocation between one or more stumps. An independent test set allows users to monitor the effectiveness of manual rule changes and form hypotheses based on case-by-case analyses. We show the applicability and usefulness of DeforestVis with two use cases and expert interviews with data analysts and model developers.  ( 3 min )
    From Words and Exercises to Wellness: Farsi Chatbot for Self-Attachment Technique
    arXiv:2310.09362v2 Announce Type: replace-cross Abstract: In the wake of the post-pandemic era, marked by social isolation and surging rates of depression and anxiety, conversational agents based on digital psychotherapy can play an influential role compared to traditional therapy sessions. In this work, we develop a voice-capable chatbot in Farsi to guide users through Self-Attachment (SAT), a novel, self-administered, holistic psychological technique based on attachment theory. Our chatbot uses a dynamic array of rule-based and classification-based modules to comprehend user input throughout the conversation and navigates a dialogue flowchart accordingly, recommending appropriate SAT exercises that depend on the user's emotional and mental state. In particular, we collect a dataset of over 6,000 utterances and develop a novel sentiment-analysis module that classifies user sentiment into 12 classes, with accuracy above 92%. To keep the conversation novel and engaging, the chatbot's responses are retrieved from a large dataset of utterances created with the aid of Farsi GPT-2 and a reinforcement learning approach, thus requiring minimal human annotation. Our chatbot also offers a question-answering module, called SAT Teacher, to answer users' questions about the principles of Self-Attachment. Finally, we design a cross-platform application as the bot's user interface. We evaluate our platform in a ten-day human study with N=52 volunteers from the non-clinical population, who have had over 2,000 dialogues in total with the chatbot. The results indicate that the platform was engaging to most users (75%), 72% felt better after the interactions, and 74% were satisfied with the SAT Teacher's performance.  ( 3 min )
    Deep Neural Networks Tend To Extrapolate Predictably
    arXiv:2310.00873v2 Announce Type: replace Abstract: Conventional wisdom suggests that neural network predictions tend to be unpredictable and overconfident when faced with out-of-distribution (OOD) inputs. Our work reassesses this assumption for neural networks with high-dimensional inputs. Rather than extrapolating in arbitrary ways, we observe that neural network predictions often tend towards a constant value as input data becomes increasingly OOD. Moreover, we find that this value often closely approximates the optimal constant solution (OCS), i.e., the prediction that minimizes the average loss over the training data without observing the input. We present results showing this phenomenon across 8 datasets with different distributional shifts (including CIFAR10-C and ImageNet-R, S), different loss functions (cross entropy, MSE, and Gaussian NLL), and different architectures (CNNs and transformers). Furthermore, we present an explanation for this behavior, which we first validate empirically and then study theoretically in a simplified setting involving deep homogeneous networks with ReLU activations. Finally, we show how one can leverage our insights in practice to enable risk-sensitive decision-making in the presence of OOD inputs.  ( 2 min )
    Is Mamba Effective for Time Series Forecasting?
    arXiv:2403.11144v1 Announce Type: new Abstract: In the realm of time series forecasting (TSF), the Transformer has consistently demonstrated robust performance due to its ability to focus on the global context and effectively capture long-range dependencies within time, as well as discern correlations between multiple variables. However, due to the inefficiencies of the Transformer model and questions surrounding its ability to capture dependencies, ongoing efforts to refine the Transformer architecture persist. Recently, state space models (SSMs), e.g. Mamba, have gained traction due to their ability to capture complex dependencies in sequences, similar to the Transformer, while maintaining near-linear complexity. In text and image tasks, Mamba-based models can improve performance and cost savings, creating a win-win situation. This has piqued our interest in exploring SSM's potential in TSF tasks. In this paper, we introduce two straightforward SSM-based models for TSF, S-Mamba and D-Mamba, both employing the Mamba Block to extract variate correlations. Remarkably, S-Mamba and D-Mamba achieve superior performance while saving GPU memory and training time. Furthermore, we conduct extensive experiments to delve deeper into the potential of Mamba compared to the Transformer in the TSF, aiming to explore a new research direction for this field. Our code is available at https://github.com/wzhwzhwzh0921/S-D-Mamba.  ( 2 min )
    Investigating the Benefits of Projection Head for Representation Learning
    arXiv:2403.11391v1 Announce Type: new Abstract: An effective technique for obtaining high-quality representations is adding a projection head on top of the encoder during training, then discarding it and using the pre-projection representations. Despite its proven practical effectiveness, the reason behind the success of this technique is poorly understood. The pre-projection representations are not directly optimized by the loss function, raising the question: what makes them better? In this work, we provide a rigorous theoretical answer to this question. We start by examining linear models trained with self-supervised contrastive loss. We reveal that the implicit bias of training algorithms leads to layer-wise progressive feature weighting, where features become increasingly unequal as we go deeper into the layers. Consequently, lower layers tend to have more normalized and less specialized representations. We theoretically characterize scenarios where such representations are more beneficial, highlighting the intricate interplay between data augmentation and input features. Additionally, we demonstrate that introducing non-linearity into the network allows lower layers to learn features that are completely absent in higher layers. Finally, we show how this mechanism improves the robustness in supervised contrastive learning and supervised learning. We empirically validate our results through various experiments on CIFAR-10/100, UrbanCars and shifted versions of ImageNet. We also introduce a potential alternative to projection head, which offers a more interpretable and controllable design.  ( 2 min )
    Beyond IID weights: sparse and low-rank deep Neural Networks are also Gaussian Processes
    arXiv:2310.16597v3 Announce Type: replace-cross Abstract: The infinitely wide neural network has been proven a useful and manageable mathematical model that enables the understanding of many phenomena appearing in deep learning. One example is the convergence of random deep networks to Gaussian processes that allows a rigorous analysis of the way the choice of activation function and network weights impacts the training dynamics. In this paper, we extend the seminal proof of Matthews et al. (2018) to a larger class of initial weight distributions (which we call PSEUDO-IID), including the established cases of IID and orthogonal weights, as well as the emerging low-rank and structured sparse settings celebrated for their computational speed-up benefits. We show that fully-connected and convolutional networks initialized with PSEUDO-IID distributions are all effectively equivalent up to their variance. Using our results, one can identify the Edge-of-Chaos for a broader class of neural networks and tune them at criticality in order to enhance their training. Moreover, they enable the posterior distribution of Bayesian Neural Networks to be tractable across these various initialization schemes.  ( 2 min )
    Representation Deficiency in Masked Language Modeling
    arXiv:2302.02060v2 Announce Type: replace-cross Abstract: Masked Language Modeling (MLM) has been one of the most prominent approaches for pretraining bidirectional text encoders due to its simplicity and effectiveness. One notable concern about MLM is that the special $\texttt{[MASK]}$ symbol causes a discrepancy between pretraining data and downstream data as it is present only in pretraining but not in fine-tuning. In this work, we offer a new perspective on the consequence of such a discrepancy: We demonstrate empirically and theoretically that MLM pretraining allocates some model dimensions exclusively for representing $\texttt{[MASK]}$ tokens, resulting in a representation deficiency for real tokens and limiting the pretrained model's expressiveness when it is adapted to downstream data without $\texttt{[MASK]}$ tokens. Motivated by the identified issue, we propose MAE-LM, which pretrains the Masked Autoencoder architecture with MLM where $\texttt{[MASK]}$ tokens are excluded from the encoder. Empirically, we show that MAE-LM improves the utilization of model dimensions for real token representations, and MAE-LM consistently outperforms MLM-pretrained models across different pretraining settings and model sizes when fine-tuned on the GLUE and SQuAD benchmarks.  ( 2 min )
    Modular Neural Networks for Time Series Forecasting: Interpretability and Feature Selection using Attention
    arXiv:2311.16834v3 Announce Type: replace Abstract: Multivariate time series have many applications, from healthcare and meteorology to life science. Although deep learning models have shown excellent predictive performance for time series, they have been criticised for being "black-boxes" or non-interpretable. This paper proposes a novel modular neural network model for multivariate time series prediction that is interpretable by construction. A recurrent neural network learns the temporal dependencies in the data while an attention-based feature selection component selects the most relevant features and suppresses redundant features used in the learning of the temporal dependencies. A modular deep network is trained from the selected features independently to show the users how features influence outcomes, making the model interpretable. Experimental results show that this approach can outperform state-of-the-art interpretable Neural Additive Models (NAM) and variations thereof in both regression and classification of time series tasks, achieving a predictive performance that is comparable to the top non-interpretable methods for time series, LSTM and XGBoost.  ( 2 min )
    Solvent-Aware 2D NMR Prediction: Leveraging Multi-Tasking Training and Iterative Self-Training Strategies
    arXiv:2403.11353v1 Announce Type: new Abstract: Nuclear magnetic resonance (NMR) spectroscopy plays a pivotal role in various scientific fields, offering insights into structural information, electronic properties and dynamic behaviors of molecules. Accurate NMR spectrum prediction efficiently produces candidate molecules, enabling chemists to compare them with actual experimental spectra. This process aids in confirming molecular structures or pinpointing discrepancies, guiding further investigation. Machine Learning (ML) has then emerged as a promising alternative approach for predicting atomic NMR chemical shits of molecules given their structures. Although significant progresses have been made in predicting one-dimensional (1D) NMR, two-dimensional (2D) NMR prediction via ML remains a challenge due to the lack of annotated NMR training datasets. To address this gap, we propose an iterative self-training (IST) approach to train a deep learning model for predicting atomic 2DNMR shifts and assigning peaks in experimental spectra. Our model undergoes an initial pre-training phase employing a Multi-Task Training (MTT) approach, which simultaneously leverages annotated 1D NMR datasets of both $^{1}\text{H}$ and $^{13}\text{C}$ spectra to enhance its understanding of NMR spectra. Subsequently, the pre-trained model is utilized to generate pseudo-annotations for unlabelled 2D NMR spectra, which are subsequently used to refine the 2D NMR prediction model. Our approach iterates between annotated unlabelled 2D NMR data and refining our 2D NMR prediction model until convergence. Finally, our model is able to not only accurately predict 2D NMR but also annotate peaks in experimental 2D NMR spectra. Experimental results show that our model is capable of accurately handling medium-sized and large molecules, including polysaccharides, underscoring its effectiveness.  ( 3 min )
    Solving General Noisy Inverse Problem via Posterior Sampling: A Policy Gradient Viewpoint
    arXiv:2403.10585v1 Announce Type: cross Abstract: Solving image inverse problems (e.g., super-resolution and inpainting) requires generating a high fidelity image that matches the given input (the low-resolution image or the masked image). By using the input image as guidance, we can leverage a pretrained diffusion generative model to solve a wide range of image inverse tasks without task specific model fine-tuning. To precisely estimate the guidance score function of the input image, we propose Diffusion Policy Gradient (DPG), a tractable computation method by viewing the intermediate noisy images as policies and the target image as the states selected by the policy. Experiments show that our method is robust to both Gaussian and Poisson noise degradation on multiple linear and non-linear inverse tasks, resulting into a higher image restoration quality on FFHQ, ImageNet and LSUN datasets.  ( 2 min )
    Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning
    arXiv:2402.05808v2 Announce Type: replace-cross Abstract: In this paper, we propose R$^3$: Learning Reasoning through Reverse Curriculum Reinforcement Learning (RL), a novel method that employs only outcome supervision to achieve the benefits of process supervision for large language models. The core challenge in applying RL to complex reasoning is to identify a sequence of actions that result in positive rewards and provide appropriate supervision for optimization. Outcome supervision provides sparse rewards for final results without identifying error locations, whereas process supervision offers step-wise rewards but requires extensive manual annotation. R$^3$ overcomes these limitations by learning from correct demonstrations. Specifically, R$^3$ progressively slides the start state of reasoning from a demonstration's end to its beginning, facilitating easier model exploration at all stages. Thus, R$^3$ establishes a step-wise curriculum, allowing outcome supervision to offer step-level signals and precisely pinpoint errors. Using Llama2-7B, our method surpasses RL baseline on eight reasoning tasks by $4.1$ points on average. Notebaly, in program-based reasoning on GSM8K, it exceeds the baseline by $4.2$ points across three backbone models, and without any extra data, Codellama-7B + R$^3$ performs comparable to larger models or closed-source models.  ( 3 min )
    Discovering Invariant Neighborhood Patterns for Heterophilic Graphs
    arXiv:2403.10572v1 Announce Type: new Abstract: This paper studies the problem of distribution shifts on non-homophilous graphs Mosting existing graph neural network methods rely on the homophilous assumption that nodes from the same class are more likely to be linked. However, such assumptions of homophily do not always hold in real-world graphs, which leads to more complex distribution shifts unaccounted for in previous methods. The distribution shifts of neighborhood patterns are much more diverse on non-homophilous graphs. We propose a novel Invariant Neighborhood Pattern Learning (INPL) to alleviate the distribution shifts problem on non-homophilous graphs. Specifically, we propose the Adaptive Neighborhood Propagation (ANP) module to capture the adaptive neighborhood information, which could alleviate the neighborhood pattern distribution shifts problem on non-homophilous graphs. We propose Invariant Non-Homophilous Graph Learning (INHGL) module to constrain the ANP and learn invariant graph representation on non-homophilous graphs. Extensive experimental results on real-world non-homophilous graphs show that INPL could achieve state-of-the-art performance for learning on large non-homophilous graphs.  ( 2 min )
    Robust Counterfactual Explanations for Neural Networks With Probabilistic Guarantees
    arXiv:2305.11997v3 Announce Type: replace-cross Abstract: There is an emerging interest in generating robust counterfactual explanations that would remain valid if the model is updated or changed even slightly. Towards finding robust counterfactuals, existing literature often assumes that the original model $m$ and the new model $M$ are bounded in the parameter space, i.e., $\|\text{Params}(M){-}\text{Params}(m)\|{<}\Delta$. However, models can often change significantly in the parameter space with little to no change in their predictions or accuracy on the given dataset. In this work, we introduce a mathematical abstraction termed $\textit{naturally-occurring}$ model change, which allows for arbitrary changes in the parameter space such that the change in predictions on points that lie on the data manifold is limited. Next, we propose a measure -- that we call $\textit{Stability}$ -- to quantify the robustness of counterfactuals to potential model changes for differentiable models, e.g., neural networks. Our main contribution is to show that counterfactuals with sufficiently high value of $\textit{Stability}$ as defined by our measure will remain valid after potential $\textit{naturally-occurring}$ model changes with high probability (leveraging concentration bounds for Lipschitz function of independent Gaussians). Since our quantification depends on the local Lipschitz constant around a data point which is not always available, we also examine practical relaxations of our proposed measure and demonstrate experimentally how they can be incorporated to find robust counterfactuals for neural networks that are close, realistic, and remain valid after potential model changes. This work also has interesting connections with model multiplicity, also known as, the Rashomon effect.  ( 3 min )
    Phased Data Augmentation for Training a Likelihood-Based Generative Model with Limited Data
    arXiv:2305.12681v2 Announce Type: replace-cross Abstract: Generative models excel in creating realistic images, yet their dependency on extensive datasets for training presents significant challenges, especially in domains where data collection is costly or challenging. Current data-efficient methods largely focus on GAN architectures, leaving a gap in training other types of generative models. Our study introduces "phased data augmentation" as a novel technique that addresses this gap by optimizing training in limited data scenarios without altering the inherent data distribution. By limiting the augmentation intensity throughout the learning phases, our method enhances the model's ability to learn from limited data, thus maintaining fidelity. Applied to a model integrating PixelCNNs with VQ-VAE-2, our approach demonstrates superior performance in both quantitative and qualitative evaluations across diverse datasets. This represents an important step forward in the efficient training of likelihood-based models, extending the usefulness of data augmentation techniques beyond just GANs.  ( 2 min )
    Understanding the Robustness of Multi-modal Contrastive Learning to Distribution Shift
    arXiv:2310.04971v2 Announce Type: replace Abstract: Recently, multimodal contrastive learning (MMCL) approaches, such as CLIP, have achieved a remarkable success in learning representations that are robust against distribution shift and generalize to new domains. Despite the empirical success, the mechanism behind learning such generalizable representations is not understood. In this work, we rigorously analyze this problem and uncover two mechanisms behind MMCL's robustness: \emph{intra-class contrasting}, which allows the model to learn features with a high variance, and \emph{inter-class feature sharing}, where annotated details in one class help learning other classes better. Both mechanisms prevent spurious features that are over-represented in the training data to overshadow the generalizable core features. This yields superior zero-shot classification accuracy under distribution shift. Furthermore, we theoretically demonstrate the benefits of using rich captions on robustness and explore the effect of annotating different types of details in the captions. We validate our theoretical findings through experiments, including a well-designed synthetic experiment and an experiment involving training CLIP models on MSCOCO/Conceptual Captions and evaluating them on shifted ImageNets.  ( 2 min )
    Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities
    arXiv:2401.14405v2 Announce Type: replace-cross Abstract: We propose to improve transformers of a specific modality with irrelevant data from other modalities, e.g., improve an ImageNet model with audio or point cloud datasets. We would like to highlight that the data samples of the target modality are irrelevant to the other modalities, which distinguishes our method from other works utilizing paired (e.g., CLIP) or interleaved data of different modalities. We propose a methodology named Multimodal Pathway - given a target modality and a transformer designed for it, we use an auxiliary transformer trained with data of another modality and construct pathways to connect components of the two models so that data of the target modality can be processed by both models. In this way, we utilize the universal sequence-to-sequence modeling abilities of transformers obtained from two modalities. As a concrete implementation, we use a modality-specific tokenizer and task-specific head as usual but utilize the transformer blocks of the auxiliary model via a proposed method named Cross-Modal Re-parameterization, which exploits the auxiliary weights without any inference costs. On the image, point cloud, video, and audio recognition tasks, we observe significant and consistent performance improvements with irrelevant data from other modalities. The code and models are available at https://github.com/AILab-CVC/M2PT.  ( 3 min )
    CRISP: Hybrid Structured Sparsity for Class-aware Model Pruning
    arXiv:2311.14272v2 Announce Type: replace-cross Abstract: Machine learning pipelines for classification tasks often train a universal model to achieve accuracy across a broad range of classes. However, a typical user encounters only a limited selection of classes regularly. This disparity provides an opportunity to enhance computational efficiency by tailoring models to focus on user-specific classes. Existing works rely on unstructured pruning, which introduces randomly distributed non-zero values in the model, making it unsuitable for hardware acceleration. Alternatively, some approaches employ structured pruning, such as channel pruning, but these tend to provide only minimal compression and may lead to reduced model accuracy. In this work, we propose CRISP, a novel pruning framework leveraging a hybrid structured sparsity pattern that combines both fine-grained N:M structured sparsity and coarse-grained block sparsity. Our pruning strategy is guided by a gradient-based class-aware saliency score, allowing us to retain weights crucial for user-specific classes. CRISP achieves high accuracy with minimal memory consumption for popular models like ResNet-50, VGG-16, and MobileNetV2 on ImageNet and CIFAR-100 datasets. Moreover, CRISP delivers up to 14$\times$ reduction in latency and energy consumption compared to existing pruning methods while maintaining comparable accuracy. Our code is available at https://github.com/shivmgg/CRISP/.  ( 2 min )
    Uncertainty-aware Surrogate Models for Airfoil Flow Simulations with Denoising Diffusion Probabilistic Models
    arXiv:2312.05320v2 Announce Type: replace-cross Abstract: Leveraging neural networks as surrogate models for turbulence simulation is a topic of growing interest. At the same time, embodying the inherent uncertainty of simulations in the predictions of surrogate models remains very challenging. The present study makes a first attempt to use denoising diffusion probabilistic models (DDPMs) to train an uncertainty-aware surrogate model for turbulence simulations. Due to its prevalence, the simulation of flows around airfoils with various shapes, Reynolds numbers, and angles of attack is chosen as the learning objective. Our results show that DDPMs can successfully capture the whole distribution of solutions and, as a consequence, accurately estimate the uncertainty of the simulations. The performance of DDPMs is also compared with varying baselines in the form of Bayesian neural networks and heteroscedastic models. Experiments demonstrate that DDPMs outperform the other methods regarding a variety of accuracy metrics. Besides, it offers the advantage of providing access to the complete distributions of uncertainties rather than providing a set of parameters. As such, it can yield realistic and detailed samples from the distribution of solutions.  ( 2 min )
    Neuroformer: Multimodal and Multitask Generative Pretraining for Brain Data
    arXiv:2311.00136v4 Announce Type: replace-cross Abstract: State-of-the-art systems neuroscience experiments yield large-scale multimodal data, and these data sets require new tools for analysis. Inspired by the success of large pretrained models in vision and language domains, we reframe the analysis of large-scale, cellular-resolution neuronal spiking data into an autoregressive spatiotemporal generation problem. Neuroformer is a multimodal, multitask generative pretrained transformer (GPT) model that is specifically designed to handle the intricacies of data in systems neuroscience. It scales linearly with feature size, can process an arbitrary number of modalities, and is adaptable to downstream tasks, such as predicting behavior. We first trained Neuroformer on simulated datasets, and found that it both accurately predicted simulated neuronal circuit activity, and also intrinsically inferred the underlying neural circuit connectivity, including direction. When pretrained to decode neural responses, the model predicted the behavior of a mouse with only few-shot fine-tuning, suggesting that the model begins learning how to do so directly from the neural representations themselves, without any explicit supervision. We used an ablation study to show that joint training on neuronal responses and behavior boosted performance, highlighting the model's ability to associate behavioral and neural representations in an unsupervised manner. These findings show that Neuroformer can analyze neural datasets and their emergent properties, informing the development of models and hypotheses associated with the brain.  ( 3 min )
    VFusion3D: Learning Scalable 3D Generative Models from Video Diffusion Models
    arXiv:2403.12034v1 Announce Type: cross Abstract: This paper presents a novel paradigm for building scalable 3D generative models utilizing pre-trained video diffusion models. The primary obstacle in developing foundation 3D generative models is the limited availability of 3D data. Unlike images, texts, or videos, 3D data are not readily accessible and are difficult to acquire. This results in a significant disparity in scale compared to the vast quantities of other types of data. To address this issue, we propose using a video diffusion model, trained with extensive volumes of text, images, and videos, as a knowledge source for 3D data. By unlocking its multi-view generative capabilities through fine-tuning, we generate a large-scale synthetic multi-view dataset to train a feed-forward 3D generative model. The proposed model, VFusion3D, trained on nearly 3M synthetic multi-view data, can generate a 3D asset from a single image in seconds and achieves superior performance when compared to current SOTA feed-forward 3D generative models, with users preferring our results over 70% of the time.  ( 2 min )
    Decentralized Stochastic Subgradient Methods for Nonsmooth Nonconvex Optimization
    arXiv:2403.11565v1 Announce Type: cross Abstract: In this paper, we concentrate on decentralized optimization problems with nonconvex and nonsmooth objective functions, especially on the decentralized training of nonsmooth neural networks. We introduce a unified framework, named DSM, to analyze the global convergence of decentralized stochastic subgradient methods. We prove the global convergence of our proposed framework under mild conditions, by establishing that the generated sequence asymptotically approximates the trajectories of its associated differential inclusion. Furthermore, we establish that our proposed framework encompasses a wide range of existing efficient decentralized subgradient methods, including decentralized stochastic subgradient descent (DSGD), DSGD with gradient-tracking technique (DSGD-T), and DSGD with momentum (DSGDm). In addition, we introduce SignSGD employing the sign map to regularize the update directions in DSGDm, and show it is enclosed in our proposed framework. Consequently, our convergence results establish, for the first time, global convergence of these methods when applied to nonsmooth nonconvex objectives. Preliminary numerical experiments demonstrate that our proposed framework yields highly efficient decentralized subgradient methods with convergence guarantees in the training of nonsmooth neural networks.  ( 2 min )
    Limits of Approximating the Median Treatment Effect
    arXiv:2403.10618v1 Announce Type: new Abstract: Average Treatment Effect (ATE) estimation is a well-studied problem in causal inference. However, it does not necessarily capture the heterogeneity in the data, and several approaches have been proposed to tackle the issue, including estimating the Quantile Treatment Effects. In the finite population setting containing $n$ individuals, with treatment and control values denoted by the potential outcome vectors $\mathbf{a}, \mathbf{b}$, much of the prior work focused on estimating median$(\mathbf{a}) -$ median$(\mathbf{b})$, where median($\mathbf x$) denotes the median value in the sorted ordering of all the values in vector $\mathbf x$. It is known that estimating the difference of medians is easier than the desired estimand of median$(\mathbf{a-b})$, called the Median Treatment Effect (MTE). The fundamental problem of causal inference -- for every individual $i$, we can only observe one of the potential outcome values, i.e., either the value $a_i$ or $b_i$, but not both, makes estimating MTE particularly challenging. In this work, we argue that MTE is not estimable and detail a novel notion of approximation that relies on the sorted order of the values in $\mathbf{a-b}$. Next, we identify a quantity called variability that exactly captures the complexity of MTE estimation. By drawing connections to instance-optimality studied in theoretical computer science, we show that every algorithm for estimating the MTE obtains an approximation error that is no better than the error of an algorithm that computes variability. Finally, we provide a simple linear time algorithm for computing the variability exactly. Unlike much prior work, a particular highlight of our work is that we make no assumptions about how the potential outcome vectors are generated or how they are correlated, except that the potential outcome values are $k$-ary, i.e., take one of $k$ discrete values.  ( 3 min )
    Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt
    arXiv:2403.11780v1 Announce Type: cross Abstract: Recent singing-voice-synthesis (SVS) methods have achieved remarkable audio quality and naturalness, yet they lack the capability to control the style attributes of the synthesized singing explicitly. We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language. We adopt a model architecture based on a decoder-only transformer with a multi-scale hierarchy, and design a range-melody decoupled pitch representation that enables text-conditioned vocal range control while keeping melodic accuracy. Furthermore, we explore various experiment settings, including different types of text representations, text encoder fine-tuning, and introducing speech data to alleviate data scarcity, aiming to facilitate further research. Experiments show that our model achieves favorable controlling ability and audio quality. Audio samples are available at http://prompt-singer.github.io .  ( 2 min )
    Online Multi-Task Learning with Recursive Least Squares and Recursive Kernel Methods
    arXiv:2308.01938v2 Announce Type: replace-cross Abstract: This paper introduces two novel approaches for Online Multi-Task Learning (MTL) Regression Problems. We employ a high performance graph-based MTL formulation and develop two alternative recursive versions based on the Weighted Recursive Least Squares (WRLS) and the Online Sparse Least Squares Support Vector Regression (OSLSSVR) strategies. Adopting task-stacking transformations, we demonstrate the existence of a single matrix incorporating the relationship of multiple tasks and providing structural information to be embodied by the MT-WRLS method in its initialization procedure and by the MT-OSLSSVR in its multi-task kernel function. Contrasting the existing literature, which is mostly based on Online Gradient Descent (OGD) or cubic inexact approaches, we achieve exact and approximate recursions with quadratic per-instance cost on the dimension of the input space (MT-WRLS) or on the size of the dictionary of instances (MT-OSLSSVR). We compare our online MTL methods to other contenders in a real-world wind speed forecasting case study, evidencing the significant gain in performance of both proposed approaches.  ( 2 min )
    Inherently Interpretable Time Series Classification via Multiple Instance Learning
    arXiv:2311.10049v3 Announce Type: replace Abstract: Conventional Time Series Classification (TSC) methods are often black boxes that obscure inherent interpretation of their decision-making processes. In this work, we leverage Multiple Instance Learning (MIL) to overcome this issue, and propose a new framework called MILLET: Multiple Instance Learning for Locally Explainable Time series classification. We apply MILLET to existing deep learning TSC models and show how they become inherently interpretable without compromising (and in some cases, even improving) predictive performance. We evaluate MILLET on 85 UCR TSC datasets and also present a novel synthetic dataset that is specially designed to facilitate interpretability evaluation. On these datasets, we show MILLET produces sparse explanations quickly that are of higher quality than other well-known interpretability methods. To the best of our knowledge, our work with MILLET, which is available on GitHub (https://github.com/JAEarly/MILTimeSeriesClassification), is the first to develop general MIL methods for TSC and apply them to an extensive variety of domains  ( 2 min )
    Divide-and-Conquer Posterior Sampling for Denoising Diffusion Priors
    arXiv:2403.11407v1 Announce Type: cross Abstract: Interest in the use of Denoising Diffusion Models (DDM) as priors for solving inverse Bayesian problems has recently increased significantly. However, sampling from the resulting posterior distribution poses a challenge. To solve this problem, previous works have proposed approximations to bias the drift term of the diffusion. In this work, we take a different approach and utilize the specific structure of the DDM prior to define a set of intermediate and simpler posterior sampling problems, resulting in a lower approximation error compared to previous methods. We empirically demonstrate the reconstruction capability of our method for general linear inverse problems using synthetic examples and various image restoration tasks.  ( 2 min )
    The Benefits of Reusing Batches for Gradient Descent in Two-Layer Networks: Breaking the Curse of Information and Leap Exponents
    arXiv:2402.03220v2 Announce Type: replace-cross Abstract: We investigate the training dynamics of two-layer neural networks when learning multi-index target functions. We focus on multi-pass gradient descent (GD) that reuses the batches multiple times and show that it significantly changes the conclusion about which functions are learnable compared to single-pass gradient descent. In particular, multi-pass GD with finite stepsize is found to overcome the limitations of gradient flow and single-pass GD given by the information exponent (Ben Arous et al., 2021) and leap exponent (Abbe et al., 2023) of the target function. We show that upon re-using batches, the network achieves in just two time steps an overlap with the target subspace even for functions not satisfying the staircase property (Abbe et al., 2021). We characterize the (broad) class of functions efficiently learned in finite time. The proof of our results is based on the analysis of the Dynamical Mean-Field Theory (DMFT). We further provide a closed-form description of the dynamical process of the low-dimensional projections of the weights, and numerical experiments illustrating the theory.  ( 2 min )
    Co-orchestration of Multiple Instruments to Uncover Structure-Property Relationships in Combinatorial Libraries
    arXiv:2402.02198v2 Announce Type: replace-cross Abstract: The rapid growth of automated and autonomous instrumentations brings forth an opportunity for the co-orchestration of multimodal tools, equipped with multiple sequential detection methods, or several characterization tools to explore identical samples. This can be exemplified by the combinatorial libraries that can be explored in multiple locations by multiple tools simultaneously, or downstream characterization in automated synthesis systems. In the co-orchestration approaches, information gained in one modality should accelerate the discovery of other modalities. Correspondingly, the orchestrating agent should select the measurement modality based on the anticipated knowledge gain and measurement cost. Here, we propose and implement a co-orchestration approach for conducting measurements with complex observables such as spectra or images. The method relies on combining dimensionality reduction by variational autoencoders with representation learning for control over the latent space structure, and integrated into iterative workflow via multi-task Gaussian Processes (GP). This approach further allows for the native incorporation of the system's physics via a probabilistic model as a mean function of the GP. We illustrated this method for different modalities of piezoresponse force microscopy and micro-Raman on combinatorial $Sm-BiFeO_3$ library. However, the proposed framework is general and can be extended to multiple measurement modalities and arbitrary dimensionality of measured signals. The analysis code that supports the funding is publicly available at https://github.com/Slautin/2024_Co-orchestration.  ( 3 min )
    DCNFIS: Deep Convolutional Neuro-Fuzzy Inference System
    arXiv:2308.06378v3 Announce Type: replace-cross Abstract: A key challenge in eXplainable Artificial Intelligence is the well-known tradeoff between the transparency of an algorithm (i.e., how easily a human can directly understand the algorithm, as opposed to receiving a post-hoc explanation), and its accuracy. We report on the design of a new deep network that achieves improved transparency without sacrificing accuracy. We design a deep convolutional neuro-fuzzy inference system (DCNFIS) by hybridizing fuzzy logic and deep learning models and show that DCNFIS performs as accurately as existing convolutional neural networks on four well-known datasets and 3 famous architectures. Our performance comparison with available fuzzy methods show that DCNFIS is now state-of-the-art fuzzy system and outperforms other shallow and deep fuzzy methods to the best of our knowledge. At the end, we exploit the transparency of fuzzy logic by deriving explanations, in the form of saliency maps, from the fuzzy rules encoded in the network to take benefit of fuzzy logic upon regular deep learning methods. We investigate the properties of these explanations in greater depth using the Fashion-MNIST dataset.  ( 2 min )
    Locally Optimal Fixed-Budget Best Arm Identification in Two-Armed Gaussian Bandits with Unknown Variances
    arXiv:2312.12741v2 Announce Type: replace Abstract: We address the problem of best arm identification (BAI) with a fixed budget for two-armed Gaussian bandits. In BAI, given multiple arms, we aim to find the best arm, an arm with the highest expected reward, through an adaptive experiment. Kaufmann et al. (2016) develops a lower bound for the probability of misidentifying the best arm. They also propose a strategy, assuming that the variances of rewards are known, and show that it is asymptotically optimal in the sense that its probability of misidentification matches the lower bound as the budget approaches infinity. However, an asymptotically optimal strategy is unknown when the variances are unknown. For this open issue, we propose a strategy that estimates variances during an adaptive experiment and draws arms with a ratio of the estimated standard deviations. We refer to this strategy as the Neyman Allocation (NA)-Augmented Inverse Probability weighting (AIPW) strategy. We then demonstrate that this strategy is asymptotically optimal by showing that its probability of misidentification matches the lower bound when the budget approaches infinity, and the gap between the expected rewards of two arms approaches zero (small-gap regime). Our results suggest that under the worst-case scenario characterized by the small-gap regime, our strategy, which employs estimated variance, is asymptotically optimal even when the variances are unknown.  ( 3 min )
    Identifying Three-Dimensional Radiative Patterns Associated with Early Tropical Cyclone Intensification
    arXiv:2401.09493v3 Announce Type: replace-cross Abstract: Cloud radiative feedback impacts early tropical cyclone (TC) intensification, but limitations in existing diagnostic frameworks make them unsuitable for studying asymmetric or transient radiative heating. We propose a linear Variational Encoder-Decoder (VED) to learn the hidden relationship between radiation and the surface intensification of realistic simulated TCs. Limiting VED model inputs enables using its uncertainty to identify periods when radiation has more importance for intensification. A close examination of the extracted 3D radiative structures suggests that longwave radiative forcing from inner core deep convection and shallow clouds both contribute to intensification, with the deep convection having the most impact overall. We find that deep convection downwind of the shallow clouds is critical to the intensification of Haiyan. Our work demonstrates that machine learning can discover thermodynamic-kinematic relationships without relying on axisymmetric or deterministic assumptions, paving the way towards the objective discovery of processes leading to TC intensification in realistic conditions.  ( 2 min )
    Evaluating Language Model Agency through Negotiations
    arXiv:2401.04536v2 Announce Type: replace-cross Abstract: We introduce an approach to evaluate language model (LM) agency using negotiation games. This approach better reflects real-world use cases and addresses some of the shortcomings of alternative LM benchmarks. Negotiation games enable us to study multi-turn, and cross-model interactions, modulate complexity, and side-step accidental evaluation data leakage. We use our approach to test six widely used and publicly accessible LMs, evaluating performance and alignment in both self-play and cross-play settings. Noteworthy findings include: (i) only closed-source models tested here were able to complete these tasks; (ii) cooperative bargaining games proved to be most challenging to the models; and (iii) even the most powerful models sometimes "lose" to weaker opponents  ( 2 min )
    Covid-19 detection from CT scans using EfficientNet and Attention mechanism
    arXiv:2403.11505v1 Announce Type: cross Abstract: Manual diagnosis and analysis of COVID-19 through the examination of lung Computed Tomography (CT) scan images by physicians tends to result in inefficiency, especially with high patient volumes and numerous images per patient. We address the need for automation by developing a deep learning model-based pipeline for COVID-19 detection from CT scan images of the lungs. The Domain adaptation, Explainability, and Fairness in AI for Medical Image Analysis Workshop and COVID-19 Diagnosis Competition (DEF-AI-MIA COV19D) provides an opportunity to assess our designed pipeline for COVID-19 detection from CT scan images. The proposed pipeline incorporates EfficientNet with an Attention mechanism with a pre-processing step. Our pipeline outperforms last year's teams on the validation set of the competition dataset.  ( 2 min )
    Uni-O4: Unifying Online and Offline Deep Reinforcement Learning with Multi-Step On-Policy Optimization
    arXiv:2311.03351v4 Announce Type: replace Abstract: Combining offline and online reinforcement learning (RL) is crucial for efficient and safe learning. However, previous approaches treat offline and online learning as separate procedures, resulting in redundant designs and limited performance. We ask: Can we achieve straightforward yet effective offline and online learning without introducing extra conservatism or regularization? In this study, we propose Uni-o4, which utilizes an on-policy objective for both offline and online learning. Owning to the alignment of objectives in two phases, the RL agent can transfer between offline and online learning seamlessly. This property enhances the flexibility of the learning paradigm, allowing for arbitrary combinations of pretraining, fine-tuning, offline, and online learning. In the offline phase, specifically, Uni-o4 leverages diverse ensemble policies to address the mismatch issues between the estimated behavior policy and the offline dataset. Through a simple offline policy evaluation (OPE) approach, Uni-o4 can achieve multi-step policy improvement safely. We demonstrate that by employing the method above, the fusion of these two paradigms can yield superior offline initialization as well as stable and rapid online fine-tuning capabilities. Through real-world robot tasks, we highlight the benefits of this paradigm for rapid deployment in challenging, previously unseen real-world environments. Additionally, through comprehensive evaluations using numerous simulated benchmarks, we substantiate that our method achieves state-of-the-art performance in both offline and offline-to-online fine-tuning learning. Our website: https://lei-kun.github.io/uni-o4/ .  ( 3 min )
    Curriculum Learning for ab initio Deep Learned Refractive Optics
    arXiv:2302.01089v3 Announce Type: replace-cross Abstract: Deep optical optimization has recently emerged as a new paradigm for designing computational imaging systems using only the output image as the objective. However, it has been limited to either simple optical systems consisting of a single element such as a diffractive optical element (DOE) or metalens, or the fine-tuning of compound lenses from good initial designs. Here we present a DeepLens design method based on curriculum learning, which is able to learn optical designs of compound lenses ab initio from randomly initialized surfaces without human intervention, therefore overcoming the need for a good initial design. We demonstrate the effectiveness of our approach by fully automatically designing both classical imaging lenses and a large field-of-view extended depth-of-field computational lens in a cellphone-style form factor, with highly aspheric surfaces and a short back focal length.  ( 2 min )
    Can Knowledge Graphs Reduce Hallucinations in LLMs? : A Survey
    arXiv:2311.07914v2 Announce Type: replace-cross Abstract: The contemporary LLMs are prone to producing hallucinations, stemming mainly from the knowledge gaps within the models. To address this critical limitation, researchers employ diverse strategies to augment the LLMs by incorporating external knowledge, aiming to reduce hallucinations and enhance reasoning accuracy. Among these strategies, leveraging knowledge graphs as a source of external information has demonstrated promising results. In this survey, we comprehensively review these knowledge-graph-based augmentation techniques in LLMs, focusing on their efficacy in mitigating hallucinations. We systematically categorize these methods into three overarching groups, offering methodological comparisons and performance evaluations. Lastly, this survey explores the current trends and challenges associated with these techniques and outlines potential avenues for future research in this emerging field.  ( 2 min )
    Identifying Policy Gradient Subspaces
    arXiv:2401.06604v3 Announce Type: replace Abstract: Policy gradient methods hold great potential for solving complex continuous control tasks. Still, their training efficiency can be improved by exploiting structure within the optimization problem. Recent work indicates that supervised learning can be accelerated by leveraging the fact that gradients lie in a low-dimensional and slowly-changing subspace. In this paper, we conduct a thorough evaluation of this phenomenon for two popular deep policy gradient methods on various simulated benchmark tasks. Our results demonstrate the existence of such gradient subspaces despite the continuously changing data distribution inherent to reinforcement learning. These findings reveal promising directions for future work on more efficient reinforcement learning, e.g., through improving parameter-space exploration or enabling second-order optimization.  ( 2 min )
    Pointing the Way: Refining Radar-Lidar Localization Using Learned ICP Weights
    arXiv:2309.08731v2 Announce Type: replace-cross Abstract: This paper presents a novel deep-learning-based approach to improve localizing radar measurements against lidar maps. Although the state of the art for localization is matching lidar data to lidar maps, radar has been considered as a promising alternative. This is largely due to radar being more resilient against adverse weather such as precipitation and heavy fog. To make use of existing high-quality lidar maps, while maintaining performance in adverse weather, it is of interest to match radar data to lidar maps. However, owing in part to the unique artefacts present in radar measurements, radar-lidar localization has struggled to achieve comparable performance to lidar-lidar systems, preventing it from being viable for autonomous driving. This work builds on an ICP-based radar-lidar localization system by including a learned preprocessing step that weights radar points based on high-level scan information. Combining a proven analytical approach with a learned weight reduces localization errors in radar-lidar ICP results run on real-world autonomous driving data by up to 54.94% in translation and 68.39% in rotation, while maintaining interpretability and robustness.  ( 2 min )
    Cartesian atomic cluster expansion for machine learning interatomic potentials
    arXiv:2402.07472v2 Announce Type: replace-cross Abstract: Machine learning interatomic potentials are revolutionizing large-scale, accurate atomistic modelling in material science and chemistry. Many potentials use atomic cluster expansion or equivariant message passing frameworks. Such frameworks typically use spherical harmonics as angular basis functions, and then use Clebsch-Gordan contraction to maintain rotational symmetry, which may introduce redundancies in representations and computational overhead. We propose an alternative: a Cartesian-coordinates-based atomic density expansion. This approach provides a complete set of polynormially indepedent features of atomic environments while maintaining interaction body orders. Additionally, we integrate low-dimensional embeddings of various chemical elements and inter-atomic message passing. The resulting potential, named Cartesian Atomic Cluster Expansion (CACE), exhibits good accuracy, stability, and generalizability. We validate its performance in diverse systems, including bulk water, small molecules, and 25-element high-entropy alloys.  ( 2 min )
    DreamLLM: Synergistic Multimodal Comprehension and Creation
    arXiv:2309.11499v2 Announce Type: replace-cross Abstract: This paper presents DreamLLM, a learning framework that first achieves versatile Multimodal Large Language Models (MLLMs) empowered with frequently overlooked synergy between multimodal comprehension and creation. DreamLLM operates on two fundamental principles. The first focuses on the generative modeling of both language and image posteriors by direct sampling in the raw multimodal space. This approach circumvents the limitations and information loss inherent to external feature extractors like CLIP, and a more thorough multimodal understanding is obtained. Second, DreamLLM fosters the generation of raw, interleaved documents, modeling both text and image contents, along with unstructured layouts. This allows DreamLLM to learn all conditional, marginal, and joint multimodal distributions effectively. As a result, DreamLLM is the first MLLM capable of generating free-form interleaved content. Comprehensive experiments highlight DreamLLM's superior performance as a zero-shot multimodal generalist, reaping from the enhanced learning synergy. Project page: https://dreamllm.github.io.  ( 2 min )
    Fourier Transporter: Bi-Equivariant Robotic Manipulation in 3D
    arXiv:2401.12046v2 Announce Type: replace-cross Abstract: Many complex robotic manipulation tasks can be decomposed as a sequence of pick and place actions. Training a robotic agent to learn this sequence over many different starting conditions typically requires many iterations or demonstrations, especially in 3D environments. In this work, we propose Fourier Transporter (FourTran) which leverages the two-fold SE(d)xSE(d) symmetry in the pick-place problem to achieve much higher sample efficiency. FourTran is an open-loop behavior cloning method trained using expert demonstrations to predict pick-place actions on new environments. FourTran is constrained to incorporate symmetries of the pick and place actions independently. Our method utilizes a fiber space Fourier transformation that allows for memory-efficient construction. We test our proposed network on the RLbench benchmark and achieve state-of-the-art results across various tasks.  ( 2 min )
    Sheaf4Rec: Sheaf Neural Networks for Graph-based Recommender Systems
    arXiv:2304.09097v3 Announce Type: replace-cross Abstract: Recent advancements in Graph Neural Networks (GNN) have facilitated their widespread adoption in various applications, including recommendation systems. GNNs have proven to be effective in addressing the challenges posed by recommendation systems by efficiently modeling graphs in which nodes represent users or items and edges denote preference relationships. However, current GNN techniques represent nodes by means of a single static vector, which may inadequately capture the intricate complexities of users and items. To overcome these limitations, we propose a solution integrating a cutting-edge model inspired by category theory: Sheaf4Rec. Unlike single vector representations, Sheaf Neural Networks and their corresponding Laplacians represent each node (and edge) using a vector space. Our approach takes advantage from this theory and results in a more comprehensive representation that can be effectively exploited during inference, providing a versatile method applicable to a wide range of graph-related tasks and demonstrating unparalleled performance. Our proposed model exhibits a noteworthy relative improvement of up to 8.53% on F1-Score@10 and an impressive increase of up to 11.29% on NDCG@10, outperforming existing state-of-the-art models such as Neural Graph Collaborative Filtering (NGCF), KGTORe and other recently developed GNN-based models. In addition to its superior predictive capabilities, Sheaf4Rec shows remarkable improvements in terms of efficiency: we observe substantial runtime improvements ranging from 2.5% up to 37% when compared to other GNN-based competitor models, indicating a more efficient way of handling information while achieving better performance. Code is available at https://github.com/antoniopurificato/Sheaf4Rec.  ( 3 min )
    MUBen: Benchmarking the Uncertainty of Molecular Representation Models
    arXiv:2306.10060v3 Announce Type: replace-cross Abstract: Large molecular representation models pre-trained on massive unlabeled data have shown great success in predicting molecular properties. However, these models may tend to overfit the fine-tuning data, resulting in over-confident predictions on test data that fall outside of the training distribution. To address this issue, uncertainty quantification (UQ) methods can be used to improve the models' calibration of predictions. Although many UQ approaches exist, not all of them lead to improved performance. While some studies have included UQ to improve molecular pre-trained models, the process of selecting suitable backbone and UQ methods for reliable molecular uncertainty estimation remains underexplored. To address this gap, we present MUBen, which evaluates different UQ methods for state-of-the-art backbone molecular representation models to investigate their capabilities. By fine-tuning various backbones using different molecular descriptors as inputs with UQ methods from different categories, we critically assess the influence of architectural decisions and training strategies. Our study offers insights for selecting UQ for backbone models, which can facilitate research on uncertainty-critical applications in fields such as materials science and drug discovery.  ( 2 min )
    Guided Data Augmentation for Offline Reinforcement Learning and Imitation Learning
    arXiv:2310.18247v2 Announce Type: replace Abstract: In offline reinforcement learning (RL), an RL agent learns to solve a task using only a fixed dataset of previously collected data. While offline RL has been successful in learning real-world robot control policies, it typically requires large amounts of expert-quality data to learn effective policies that generalize to out-of-distribution states. Unfortunately, such data is often difficult and expensive to acquire in real-world tasks. Several recent works have leveraged data augmentation (DA) to inexpensively generate additional data, but most DA works apply augmentations in a random fashion and ultimately produce highly suboptimal augmented experience. In this work, we propose Guided Data Augmentation (GuDA), a human-guided DA framework that generates expert-quality augmented data. The key insight behind GuDA is that while it may be difficult to demonstrate the sequence of actions required to produce expert data, a user can often easily characterize when an augmented trajectory segment represents progress toward task completion. Thus, a user can restrict the space of possible augmentations to automatically reject suboptimal augmented data. To extract a policy from GuDA, we use off-the-shelf offline reinforcement learning and behavior cloning algorithms. We evaluate GuDA on a physical robot soccer task as well as simulated D4RL navigation tasks, a simulated autonomous driving task, and a simulated soccer task. Empirically, GuDA enables learning given a small initial dataset of potentially suboptimal experience and outperforms a random DA strategy as well as a model-based DA strategy.  ( 3 min )
    PerceptionCLIP: Visual Classification by Inferring and Conditioning on Contexts
    arXiv:2308.01313v3 Announce Type: replace-cross Abstract: Vision-language models like CLIP are widely used in zero-shot image classification due to their ability to understand various visual concepts and natural language descriptions. However, how to fully leverage CLIP's unprecedented human-like understanding capabilities to achieve better performance is still an open question. This paper draws inspiration from the human visual perception process: when classifying an object, humans first infer contextual attributes (e.g., background and orientation) which help separate the foreground object from the background, and then classify the object based on this information. Inspired by it, we observe that providing CLIP with contextual attributes improves zero-shot image classification and mitigates reliance on spurious features. We also observe that CLIP itself can reasonably infer the attributes from an image. With these observations, we propose a training-free, two-step zero-shot classification method PerceptionCLIP. Given an image, it first infers contextual attributes (e.g., background) and then performs object classification conditioning on them. Our experiments show that PerceptionCLIP achieves better generalization, group robustness, and interoperability. Our code is available at https://github.com/umd-huang-lab/perceptionCLIP  ( 2 min )
    A Novel Federated Learning-Based IDS for Enhancing UAVs Privacy and Security
    arXiv:2312.04135v2 Announce Type: replace-cross Abstract: Unmanned aerial vehicles (UAVs) operating within Flying Ad-hoc Networks (FANETs) encounter security challenges due to the dynamic and distributed nature of these networks. Previous studies predominantly focused on centralized intrusion detection, assuming a central entity responsible for storing and analyzing data from all devices.However, these approaches face challenges including computation and storage costs, along with a single point of failure risk, threatening data privacy and availability. The widespread dispersion of data across interconnected devices underscores the necessity for decentralized approaches. This paper introduces the Federated Learning-based Intrusion Detection System (FL-IDS), addressing challenges encountered by centralized systems in FANETs. FL-IDS reduces computation and storage costs for both clients and the central server, crucial for resource-constrained UAVs. Operating in a decentralized manner, FL-IDS enables UAVs to collaboratively train a global intrusion detection model without sharing raw data, thus avoiding the delay in decisions based on collected data, as is often the case with traditional methods. Experimental results demonstrate FL-IDS's competitive performance with Central IDS (C-IDS) while mitigating privacy concerns, with the Bias Towards Specific Clients (BTSC) method further enhancing FL-IDS performance even at lower attacker ratios. Comparative analysis with traditional intrusion detection methods, including Local IDS (L-IDS), sheds light on FL-IDS's strengths. This study significantly contributes to UAV security by introducing a privacy-aware, decentralized intrusion detection approach tailored to UAV networks. Moreover, by introducing a realistic dataset for FANETs and federated learning, our approach differs from others lacking high dynamism and 3D node movements or accurate federated data federations.  ( 3 min )
    MARVEL: Multi-Agent Reinforcement-Learning for Large-Scale Variable Speed Limits
    arXiv:2310.12359v2 Announce Type: replace-cross Abstract: Variable Speed Limit (VSL) control acts as a promising highway traffic management strategy with worldwide deployment, which can enhance traffic safety by dynamically adjusting speed limits according to real-time traffic conditions. Most of the deployed VSL control algorithms so far are rule-based, lacking generalizability under varying and complex traffic scenarios. In this work, we propose MARVEL (Multi-Agent Reinforcement-learning for large-scale Variable spEed Limits), a novel framework for large-scale VSL control on highway corridors with real-world deployment settings. MARVEL utilizes only sensing information observable in the real world as state input and learns through a reward structure that incorporates adaptability to traffic conditions, safety, and mobility, thereby enabling multi-agent coordination. With parameter sharing among all VSL agents, the proposed framework scales to cover corridors with many agents. The policies are trained in a microscopic traffic simulation environment, focusing on a short freeway stretch with 8 VSL agents spanning 7 miles. For testing, these policies are applied to a more extensive network with 34 VSL agents spanning 17 miles of I-24 near Nashville, TN, USA. MARVEL-based method improves traffic safety by 63.4% compared to the no control scenario and enhances traffic mobility by 58.6% compared to a state-of-the-practice algorithm that has been deployed on I-24. Besides, we conduct an explainability analysis to examine the decision-making process of the agents and explore the learned policy under different traffic conditions. Finally, we test the response of the policy learned from the simulation-based experiments with real-world data collected from I-24 and illustrate its deployment capability.  ( 3 min )
    Inverse Problems with Learned Forward Operators
    arXiv:2311.12528v2 Announce Type: replace-cross Abstract: Solving inverse problems requires the knowledge of the forward operator, but accurate models can be computationally expensive and hence cheaper variants that do not compromise the reconstruction quality are desired. This chapter reviews reconstruction methods in inverse problems with learned forward operators that follow two different paradigms. The first one is completely agnostic to the forward operator and learns its restriction to the subspace spanned by the training data. The framework of regularisation by projection is then used to find a reconstruction. The second one uses a simplified model of the physics of the measurement process and only relies on the training data to learn a model correction. We present the theory of these two approaches and compare them numerically. A common theme emerges: both methods require, or at least benefit from, training data not only for the forward operator, but also for its adjoint.  ( 2 min )
    Functional-Group-Based Diffusion for Pocket-Specific Molecule Generation and Elaboration
    arXiv:2306.13769v3 Announce Type: replace-cross Abstract: In recent years, AI-assisted drug design methods have been proposed to generate molecules given the pockets' structures of target proteins. Most of them are atom-level-based methods, which consider atoms as basic components and generate atom positions and types. In this way, however, it is hard to generate realistic fragments with complicated structures. To solve this, we propose D3FG, a functional-group-based diffusion model for pocket-specific molecule generation and elaboration. D3FG decomposes molecules into two categories of components: functional groups defined as rigid bodies and linkers as mass points. And the two kinds of components can together form complicated fragments that enhance ligand-protein interactions. To be specific, in the diffusion process, D3FG diffuses the data distribution of the positions, orientations, and types of the components into a prior distribution; In the generative process, the noise is gradually removed from the three variables by denoisers parameterized with designed equivariant graph neural networks. In the experiments, our method can generate molecules with more realistic 3D structures, competitive affinities toward the protein targets, and better drug properties. Besides, D3FG as a solution to a new task of molecule elaboration, could generate molecules with high affinities based on existing ligands and the hotspots of target proteins.  ( 3 min )
    Can LLM-Generated Misinformation Be Detected?
    arXiv:2309.13788v3 Announce Type: replace-cross Abstract: The advent of Large Language Models (LLMs) has made a transformative impact. However, the potential that LLMs such as ChatGPT can be exploited to generate misinformation has posed a serious concern to online safety and public trust. A fundamental research question is: will LLM-generated misinformation cause more harm than human-written misinformation? We propose to tackle this question from the perspective of detection difficulty. We first build a taxonomy of LLM-generated misinformation. Then we categorize and validate the potential real-world methods for generating misinformation with LLMs. Then, through extensive empirical investigation, we discover that LLM-generated misinformation can be harder to detect for humans and detectors compared to human-written misinformation with the same semantics, which suggests it can have more deceptive styles and potentially cause more harm. We also discuss the implications of our discovery on combating misinformation in the age of LLMs and the countermeasures.  ( 2 min )
    Tangent Bundle Convolutional Learning: from Manifolds to Cellular Sheaves and Back
    arXiv:2303.11323v2 Announce Type: replace-cross Abstract: In this work we introduce a convolution operation over the tangent bundle of Riemann manifolds in terms of exponentials of the Connection Laplacian operator. We define tangent bundle filters and tangent bundle neural networks (TNNs) based on this convolution operation, which are novel continuous architectures operating on tangent bundle signals, i.e. vector fields over the manifolds. Tangent bundle filters admit a spectral representation that generalizes the ones of scalar manifold filters, graph filters and standard convolutional filters in continuous time. We then introduce a discretization procedure, both in the space and time domains, to make TNNs implementable, showing that their discrete counterpart is a novel principled variant of the very recently introduced sheaf neural networks. We formally prove that this discretized architecture converges to the underlying continuous TNN. Finally, we numerically evaluate the effectiveness of the proposed architecture on various learning tasks, both on synthetic and real data.  ( 2 min )
    Quantifying the biomimicry gap in biohybrid robot-fish pairs
    arXiv:2308.08978v2 Announce Type: replace-cross Abstract: Biohybrid systems in which robotic lures interact with animals have become compelling tools for probing and identifying the mechanisms underlying collective animal behavior. One key challenge lies in the transfer of social interaction models from simulations to reality, using robotics to validate the modeling hypotheses. This challenge arises in bridging what we term the "biomimicry gap", which is caused by imperfect robotic replicas, communication cues and physics constraints not incorporated in the simulations, that may elicit unrealistic behavioral responses in animals. In this work, we used a biomimetic lure of a rummy-nose tetra fish (Hemigrammus rhodostomus) and a neural network (NN) model for generating biomimetic social interactions. Through experiments with a biohybrid pair comprising a fish and the robotic lure, a pair of real fish, and simulations of pairs of fish, we demonstrate that our biohybrid system generates social interactions mirroring those of genuine fish pairs. Our analyses highlight that: 1) the lure and NN maintain minimal deviation in real-world interactions compared to simulations and fish-only experiments, 2) our NN controls the robot efficiently in real-time, and 3) a comprehensive validation is crucial to bridge the biomimicry gap, ensuring realistic biohybrid systems.  ( 2 min )
    Respiratory Disease Classification and Biometric Analysis Using Biosignals from Digital Stethoscopes
    arXiv:2309.07183v2 Announce Type: replace-cross Abstract: Respiratory diseases remain a leading cause of mortality worldwide, highlighting the need for faster and more accurate diagnostic tools. This work presents a novel approach leveraging digital stethoscope technology for automatic respiratory disease classification and biometric analysis. Our approach has the potential to significantly enhance traditional auscultation practices. By leveraging one of the largest publicly available medical database of respiratory sounds, we train machine learning models to classify various respiratory health conditions. Our method differs from conventional methods by using Empirical Mode Decomposition (EMD) and spectral analysis techniques to isolate clinically relevant biosignals embedded within acoustic data captured by digital stethoscopes. This approach focuses on information closely tied to cardiovascular and respiratory patterns within the acoustic data. Spectral analysis and filtering techniques isolate Intrinsic Mode Functions (IMFs) strongly correlated with these physiological phenomena. These biosignals undergo a comprehensive feature extraction process for predictive modeling. These features then serve as input to train several machine learning models for both classification and regression tasks. Our approach achieves high accuracy in both binary classification (89% balanced accuracy for healthy vs. diseased) and multi-class classification (72% balanced accuracy for specific diseases like pneumonia and COPD). For the first time, this work introduces regression models capable of estimating age and body mass index (BMI) based solely on acoustic data, as well as a model for sex classification. Our findings underscore the potential of intelligent digital stethoscopes to significantly enhance assistive and remote diagnostic capabilities, contributing to advancements in digital health, telehealth, and remote patient monitoring.  ( 3 min )
    Compact: Approximating Complex Activation Functions for Secure Computation
    arXiv:2309.04664v2 Announce Type: replace-cross Abstract: Secure multi-party computation (MPC) techniques can be used to provide data privacy when users query deep neural network (DNN) models hosted on a public cloud. State-of-the-art MPC techniques can be directly leveraged for DNN models that use simple activation functions such as ReLU. However, these techniques are ineffective and/or inefficient for the complex and highly non-linear activation functions used in cutting-edge DNN models. We present Compact, which produces piece-wise polynomial approximations of complex AFs to enable their efficient use with state-of-the-art MPC techniques. Compact neither requires nor imposes any restriction on model training and results in near-identical model accuracy. To achieve this, we design Compact with input density awareness and use an application-specific simulated annealing type optimization to generate computationally more efficient approximations of complex AFs. We extensively evaluate Compact on four different machine-learning tasks with DNN architectures that use popular complex AFs silu, gelu, and mish. Our experimental results show that Compact incurs negligible accuracy loss while being 2x-5x computationally more efficient than state-of-the-art approaches for DNN models with large number of hidden layers. Our work accelerates easy adoption of MPC techniques to provide user data privacy even when the queried DNN models consist of a number of hidden layers and trained over complex AFs.  ( 2 min )
    BCQQ: Batch-Constraint Quantum Q-Learning with Cyclic Data Re-uploading
    arXiv:2305.00905v2 Announce Type: replace-cross Abstract: Deep reinforcement learning (DRL) often requires a large number of data and environment interactions, making the training process time-consuming. This challenge is further exacerbated in the case of batch RL, where the agent is trained solely on a pre-collected dataset without environment interactions. Recent advancements in quantum computing suggest that quantum models might require less data for training compared to classical methods. In this paper, we investigate this potential advantage by proposing a batch RL algorithm that utilizes VQC as function approximators within the discrete batch-constraint deep Q-learning (BCQ) algorithm. Additionally, we introduce a novel data re-uploading scheme by cyclically shifting the order of input variables in the data encoding layers. We evaluate the efficiency of our algorithm on the OpenAI CartPole environment and compare its performance to the classical neural network-based discrete BCQ.  ( 2 min )
    On the Detectability of ChatGPT Content: Benchmarking, Methodology, and Evaluation through the Lens of Academic Writing
    arXiv:2306.05524v2 Announce Type: replace-cross Abstract: With ChatGPT under the spotlight, utilizing large language models (LLMs) to assist academic writing has drawn a significant amount of debate in the community. In this paper, we aim to present a comprehensive study of the detectability of ChatGPT-generated content within the academic literature, particularly focusing on the abstracts of scientific papers, to offer holistic support for the future development of LLM applications and policies in academia. Specifically, we first present GPABench2, a benchmarking dataset of over 2.8 million comparative samples of human-written, GPT-written, GPT-completed, and GPT-polished abstracts of scientific writing in computer science, physics, and humanities and social sciences. Second, we explore the methodology for detecting ChatGPT content. We start by examining the unsatisfactory performance of existing ChatGPT detecting tools and the challenges faced by human evaluators (including more than 240 researchers or students). We then test the hand-crafted linguistic features models as a baseline and develop a deep neural framework named CheckGPT to better capture the subtle and deep semantic and linguistic patterns in ChatGPT written literature. Last, we conduct comprehensive experiments to validate the proposed CheckGPT framework in each benchmarking task over different disciplines. To evaluate the detectability of ChatGPT content, we conduct extensive experiments on the transferability, prompt engineering, and robustness of CheckGPT.  ( 3 min )
    Magnushammer: A Transformer-Based Approach to Premise Selection
    arXiv:2303.04488v3 Announce Type: replace Abstract: This paper presents a novel approach to premise selection, a crucial reasoning task in automated theorem proving. Traditionally, symbolic methods that rely on extensive domain knowledge and engineering effort are applied to this task. In contrast, this work demonstrates that contrastive training with the transformer architecture can achieve higher-quality retrieval of relevant premises, without the engineering overhead. Our method, Magnushammer, outperforms the most advanced and widely used automation tool in interactive theorem proving called Sledgehammer. On the PISA and miniF2F benchmarks Magnushammer achieves $59.5\%$ (against $38.3\%$) and $34.0\%$ (against $20.9\%$) success rates, respectively. By combining \method with a language-model-based automated theorem prover, we further improve the state-of-the-art proof success rate from $57.0\%$ to $71.0\%$ on the PISA benchmark using $4$x fewer parameters. Moreover, we develop and open source a novel dataset for premise selection, containing textual representations of (proof state, relevant premise) pairs. To the best of our knowledge, this is the largest available premise selection dataset, and the first one for the Isabelle proof assistant.  ( 2 min )
    CaVE: A Cone-Aligned Approach for Fast Predict-then-optimize with Binary Linear Programs
    arXiv:2312.07718v2 Announce Type: replace Abstract: The end-to-end predict-then-optimize framework, also known as decision-focused learning, has gained popularity for its ability to integrate optimization into the training procedure of machine learning models that predict the unknown cost (objective function) coefficients of optimization problems from contextual instance information. Naturally, most of the problems of interest in this space can be cast as integer linear programs. In this work, we focus on binary linear programs (BLPs) and propose a new end-to-end training method to predict-then-optimize. Our method, Cone-aligned Vector Estimation (CaVE), aligns the predicted cost vectors with the normal cone corresponding to the true optimal solution of a training instance. When the predicted cost vector lies inside the cone, the optimal solution to the linear relaxation of the binary problem is optimal. This alignment not only produces decision-aware learning models but also dramatically reduces training time as it circumvents the need to solve BLPs to compute a loss function with its gradients. Experiments across multiple datasets show that our method exhibits a favorable trade-off between training time and solution quality, particularly with large-scale optimization problems such as vehicle routing, a hard BLP that has yet to benefit from predict-then-optimize methods in the literature due to its difficulty.  ( 3 min )
    A low latency attention module for streaming self-supervised speech representation learning
    arXiv:2302.13451v2 Announce Type: replace-cross Abstract: The transformer is a fundamental building block in deep learning, and the attention mechanism is the transformer's core component. Self-supervised speech representation learning (SSRL) represents a popular use-case for the transformer architecture. Due to transformers' acausal behavior, the use of transformers for SSRL has been predominantly focused on acausal applications. However, several media processing problems, such as speech processing, require real-time solutions. In this paper, we present an implementation of the attention module that enables training of SSRL architectures with low compute and memory requirements, while allowing real-time inference with low and fixed latency. The attention module proposed in this paper includes two components, streaming attention (SA) and low-latency streaming attention (LLSA). The SA represents our proposal for an efficient streaming SSRL implementation, while the LLSA solves the latency build-up problem of other streaming attention architectures, such as the masked acausal attention (MAA), guaranteeing a latency equal to one layer even when multiple layers are stacked. We present a comparative analysis between the vanilla attention, which we will refer here as acausal attention (AA), the SA, and the LLSA, by training a streaming SSRL with automatic speech recognition as downstream task. When training on librispeech-clean-100 and testing on librispeech-test-clean, our low-latency attention module has a word error rate (WER) of 5.84%, which represents a significant improvement over the MAA (WER = 13.82%). Our implementation also reduces the inference latency from 1.92 to 0.16 seconds. The proposed low-latency module preserves many of the benefits of conventional acausal transformers, but also enables latency characteristics that make it applicable to real-time streaming applications.  ( 3 min )
    EcoVal: An Efficient Data Valuation Framework for Machine Learning
    arXiv:2402.09288v3 Announce Type: replace Abstract: Quantifying the value of data within a machine learning workflow can play a pivotal role in making more strategic decisions in machine learning initiatives. The existing Shapley value based frameworks for data valuation in machine learning are computationally expensive as they require considerable amount of repeated training of the model to obtain the Shapley value. In this paper, we introduce an efficient data valuation framework EcoVal, to estimate the value of data for machine learning models in a fast and practical manner. Instead of directly working with individual data sample, we determine the value of a cluster of similar data points. This value is further propagated amongst all the member cluster points. We show that the overall data value can be determined by estimating the intrinsic and extrinsic value of each data. This is enabled by formulating the performance of a model as a \textit{production function}, a concept which is popularly used to estimate the amount of output based on factors like labor and capital in a traditional free economic market. We provide a formal proof of our valuation technique and elucidate the principles and mechanisms that enable its accelerated performance. We demonstrate the real-world applicability of our method by showcasing its effectiveness for both in-distribution and out-of-sample data. This work addresses one of the core challenges of efficient data valuation at scale in machine learning models.  ( 3 min )
    A Bayesian Gaussian Process-Based Latent Discriminative Generative Decoder (LDGD) Model for High-Dimensional Data
    arXiv:2401.16497v2 Announce Type: replace Abstract: Extracting meaningful information from high-dimensional data poses a formidable modeling challenge, particularly when the data is obscured by noise or represented through different modalities. This research proposes a novel non-parametric modeling approach, leveraging the Gaussian process (GP), to characterize high-dimensional data by mapping it to a latent low-dimensional manifold. This model, named the latent discriminative generative decoder (LDGD), employs both the data and associated labels in the manifold discovery process. We derive a Bayesian solution to infer the latent variables, allowing LDGD to effectively capture inherent stochasticity in the data. We demonstrate applications of LDGD on both synthetic and benchmark datasets. Not only does LDGD infer the manifold accurately, but its accuracy in predicting data points' labels surpasses state-of-the-art approaches. In the development of LDGD, we have incorporated inducing points to reduce the computational complexity of Gaussian processes for large datasets, enabling batch training for enhanced efficient processing and scalability. Additionally, we show that LDGD can robustly infer manifold and precisely predict labels for scenarios in that data size is limited, demonstrating its capability to efficiently characterize high-dimensional data with limited samples. These collective attributes highlight the importance of developing non-parametric modeling approaches to analyze high-dimensional data.  ( 3 min )
    Towards 3D Molecule-Text Interpretation in Language Models
    arXiv:2401.13923v2 Announce Type: replace Abstract: Language Models (LMs) have greatly influenced diverse domains. However, their inherent limitation in comprehending 3D molecular structures has considerably constrained their potential in the biomolecular domain. To bridge this gap, we focus on 3D molecule-text interpretation, and propose 3D-MoLM: 3D-Molecular Language Modeling. Specifically, 3D-MoLM enables an LM to interpret and analyze 3D molecules by equipping the LM with a 3D molecular encoder. This integration is achieved by a 3D molecule-text projector, bridging the 3D molecular encoder's representation space and the LM's input space. Moreover, to enhance 3D-MoLM's ability of cross-modal molecular understanding and instruction following, we meticulously curated a 3D molecule-centric instruction tuning dataset -- 3D-MoIT. Through 3D molecule-text alignment and 3D molecule-centric instruction tuning, 3D-MoLM establishes an integration of 3D molecular encoder and LM. It significantly surpasses existing baselines on downstream tasks, including molecule-text retrieval, molecule captioning, and more challenging open-text molecular QA tasks, especially focusing on 3D-dependent properties. We release our codes and datasets at https://github.com/lsh0520/3D-MoLM.  ( 2 min )
    SuperLoRA: Parameter-Efficient Unified Adaptation of Multi-Layer Attention Modules
    arXiv:2403.11887v1 Announce Type: cross Abstract: Low-rank adaptation (LoRA) and its variants are widely employed in fine-tuning large models, including large language models for natural language processing and diffusion models for computer vision. This paper proposes a generalized framework called SuperLoRA that unifies and extends different LoRA variants, which can be realized under different hyper-parameter settings. Introducing grouping, folding, shuffling, projecting, and tensor factoring, SuperLoRA offers high flexibility compared with other LoRA variants and demonstrates superior performance for transfer learning tasks especially in the extremely few-parameter regimes.  ( 2 min )
    Learning to design protein-protein interactions with enhanced generalization
    arXiv:2310.18515v3 Announce Type: replace Abstract: Discovering mutations enhancing protein-protein interactions (PPIs) is critical for advancing biomedical research and developing improved therapeutics. While machine learning approaches have substantially advanced the field, they often struggle to generalize beyond training data in practical scenarios. The contributions of this work are three-fold. First, we construct PPIRef, the largest and non-redundant dataset of 3D protein-protein interactions, enabling effective large-scale learning. Second, we leverage the PPIRef dataset to pre-train PPIformer, a new SE(3)-equivariant model generalizing across diverse protein-binder variants. We fine-tune PPIformer to predict effects of mutations on protein-protein interactions via a thermodynamically motivated adjustment of the pre-training loss function. Finally, we demonstrate the enhanced generalization of our new PPIformer approach by outperforming other state-of-the-art methods on new, non-leaking splits of standard labeled PPI mutational data and independent case studies optimizing a human antibody against SARS-CoV-2 and increasing the thrombolytic activity of staphylokinase.  ( 2 min )
    Looped Transformers are Better at Learning Learning Algorithms
    arXiv:2311.12424v3 Announce Type: replace Abstract: Transformers have demonstrated effectiveness in in-context solving data-fitting problems from various (latent) models, as reported by Garg et al. However, the absence of an inherent iterative structure in the transformer architecture presents a challenge in emulating the iterative algorithms, which are commonly employed in traditional machine learning methods. To address this, we propose the utilization of looped transformer architecture and its associated training methodology, with the aim of incorporating iterative characteristics into the transformer architectures. Experimental results suggest that the looped transformer achieves performance comparable to the standard transformer in solving various data-fitting problems, while utilizing less than 10% of the parameter count.  ( 2 min )
    Causal Modeling with Stationary Diffusions
    arXiv:2310.17405v2 Announce Type: replace Abstract: We develop a novel approach towards causal inference. Rather than structural equations over a causal graph, we learn stochastic differential equations (SDEs) whose stationary densities model a system's behavior under interventions. These stationary diffusion models do not require the formalism of causal graphs, let alone the common assumption of acyclicity. We show that in several cases, they generalize to unseen interventions on their variables, often better than classical approaches. Our inference method is based on a new theoretical result that expresses a stationarity condition on the diffusion's generator in a reproducing kernel Hilbert space. The resulting kernel deviation from stationarity (KDS) is an objective function of independent interest.  ( 2 min )
    REBAR: Retrieval-Based Reconstruction for Time-series Contrastive Learning
    arXiv:2311.00519v3 Announce Type: replace Abstract: The success of self-supervised contrastive learning hinges on identifying positive data pairs, such that when they are pushed together in embedding space, the space encodes useful information for subsequent downstream tasks. Constructing positive pairs is non-trivial as the pairing must be similar enough to reflect a shared semantic meaning, but different enough to capture within-class variation. Classical approaches in vision use augmentations to exploit well-established invariances to construct positive pairs, but invariances in the time-series domain are much less obvious. In our work, we propose a novel method of using a learned measure for identifying positive pairs. Our Retrieval-Based Reconstruction (REBAR) measure measures the similarity between two sequences as the reconstruction error that results from reconstructing one sequence with retrieved information from the other. Then, if the two sequences have high REBAR similarity, we label them as a positive pair. Through validation experiments, we show that the REBAR error is a predictor of mutual class membership. Once integrated into a contrastive learning framework, our REBAR method learns an embedding that achieves state-of-the-art performance on downstream tasks across various modalities.  ( 2 min )
    Dual-Encoders for Extreme Multi-Label Classification
    arXiv:2310.10636v2 Announce Type: replace Abstract: Dual-encoder (DE) models are widely used in retrieval tasks, most commonly studied on open QA benchmarks that are often characterized by multi-class and limited training data. In contrast, their performance in multi-label and data-rich retrieval settings like extreme multi-label classification (XMC), remains under-explored. Current empirical evidence indicates that DE models fall significantly short on XMC benchmarks, where SOTA methods linearly scale the number of learnable parameters with the total number of classes (documents in the corpus) by employing per-class classification head. To this end, we first study and highlight that existing multi-label contrastive training losses are not appropriate for training DE models on XMC tasks. We propose decoupled softmax loss - a simple modification to the InfoNCE loss - that overcomes the limitations of existing contrastive losses. We further extend our loss design to a soft top-k operator-based loss which is tailored to optimize top-k prediction performance. When trained with our proposed loss functions, standard DE models alone can match or outperform SOTA methods by up to 2% at Precision@1 even on the largest XMC datasets while being 20x smaller in terms of the number of trainable parameters. This leads to more parameter-efficient and universally applicable solutions for retrieval tasks. Our code and models are publicly available at https://github.com/nilesh2797/dexml.  ( 3 min )
    Enhancing the Antidote: Improved Pointwise Certifications against Poisoning Attacks
    arXiv:2308.07553v2 Announce Type: replace Abstract: Poisoning attacks can disproportionately influence model behaviour by making small changes to the training corpus. While defences against specific poisoning attacks do exist, they in general do not provide any guarantees, leaving them potentially countered by novel attacks. In contrast, by examining worst-case behaviours Certified Defences make it possible to provide guarantees of the robustness of a sample against adversarial attacks modifying a finite number of training samples, known as pointwise certification. We achieve this by exploiting both Differential Privacy and the Sampled Gaussian Mechanism to ensure the invariance of prediction for each testing instance against finite numbers of poisoned examples. In doing so, our model provides guarantees of adversarial robustness that are more than twice as large as those provided by prior certifications.  ( 2 min )
    Zeroth-Order Hard-Thresholding: Gradient Error vs. Expansivity
    arXiv:2210.05279v2 Announce Type: replace Abstract: $\ell_0$ constrained optimization is prevalent in machine learning, particularly for high-dimensional problems, because it is a fundamental approach to achieve sparse learning. Hard-thresholding gradient descent is a dominant technique to solve this problem. However, first-order gradients of the objective function may be either unavailable or expensive to calculate in a lot of real-world problems, where zeroth-order (ZO) gradients could be a good surrogate. Unfortunately, whether ZO gradients can work with the hard-thresholding operator is still an unsolved problem. To solve this puzzle, in this paper, we focus on the $\ell_0$ constrained black-box stochastic optimization problems, and propose a new stochastic zeroth-order gradient hard-thresholding (SZOHT) algorithm with a general ZO gradient estimator powered by a novel random support sampling. We provide the convergence analysis of SZOHT under standard assumptions. Importantly, we reveal a conflict between the deviation of ZO estimators and the expansivity of the hard-thresholding operator, and provide a theoretical minimal value of the number of random directions in ZO gradients. In addition, we find that the query complexity of SZOHT is independent or weakly dependent on the dimensionality under different settings. Finally, we illustrate the utility of our method on a portfolio optimization problem as well as black-box adversarial attacks.  ( 2 min )
    PyGOD: A Python Library for Graph Outlier Detection
    arXiv:2204.12095v2 Announce Type: replace Abstract: PyGOD is an open-source Python library for detecting outliers in graph data. As the first comprehensive library of its kind, PyGOD supports a wide array of leading graph-based methods for outlier detection under an easy-to-use, well-documented API designed for use by both researchers and practitioners. PyGOD provides modularized components of the different detectors implemented so that users can easily customize each detector for their purposes. To ease the construction of detection workflows, PyGOD offers numerous commonly used utility functions. To scale computation to large graphs, PyGOD supports functionalities for deep models such as sampling and mini-batch processing. PyGOD uses best practices in fostering code reliability and maintainability, including unit testing, continuous integration, and code coverage. To facilitate accessibility, PyGOD is released under a BSD 2-Clause license at https://pygod.org and at the Python Package Index (PyPI).  ( 2 min )
    From explainable to interpretable deep learning for natural language processing in healthcare: how far from reality?
    arXiv:2403.11894v1 Announce Type: cross Abstract: Deep learning (DL) has substantially enhanced healthcare research by addressing various natural language processing (NLP) tasks. Yet, the increasing complexity of DL-based NLP methods necessitates transparent model interpretability, or at least explainability, for reliable decision-making. This work presents a thorough scoping review on explainable and interpretable DL in healthcare NLP. The term "XIAI" (eXplainable and Interpretable Artificial Intelligence) was introduced to distinguish XAI from IAI. Methods were further categorized based on their functionality (model-, input-, output-based) and scope (local, global). Our analysis shows that attention mechanisms were the most dominant emerging IAI. Moreover, IAI is increasingly used against XAI. The major challenges identified are that most XIAI do not explore "global" modeling processes, the lack of best practices, and the unmet need for systematic evaluation and benchmarks. Important opportunities were raised such as using "attention" to enhance multi-modal XIAI for personalized medicine and combine DL with causal reasoning. Our discussion encourages the integration of XIAI in LLMs and domain-specific smaller models. Our review can stimulate further research and benchmarks toward improving inherent IAI and engaging complex NLP in healthcare.  ( 2 min )
    SIFU: Sequential Informed Federated Unlearning for Efficient and Provable Client Unlearning in Federated Optimization
    arXiv:2211.11656v5 Announce Type: replace Abstract: Machine Unlearning (MU) is an increasingly important topic in machine learning safety, aiming at removing the contribution of a given data point from a training procedure. Federated Unlearning (FU) consists in extending MU to unlearn a given client's contribution from a federated training routine. While several FU methods have been proposed, we currently lack a general approach providing formal unlearning guarantees to the FedAvg routine, while ensuring scalability and generalization beyond the convex assumption on the clients' loss functions. We aim at filling this gap by proposing SIFU (Sequential Informed Federated Unlearning), a new FU method applying to both convex and non-convex optimization regimes. SIFU naturally applies to FedAvg without additional computational cost for the clients and provides formal guarantees on the quality of the unlearning task. We provide a theoretical analysis of the unlearning properties of SIFU, and practically demonstrate its effectiveness as compared to a panel of unlearning methods from the state-of-the-art.  ( 2 min )
    Diffusion-Based Environment-Aware Trajectory Prediction
    arXiv:2403.11643v1 Announce Type: cross Abstract: The ability to predict the future trajectories of traffic participants is crucial for the safe and efficient operation of autonomous vehicles. In this paper, a diffusion-based generative model for multi-agent trajectory prediction is proposed. The model is capable of capturing the complex interactions between traffic participants and the environment, accurately learning the multimodal nature of the data. The effectiveness of the approach is assessed on large-scale datasets of real-world traffic scenarios, showing that our model outperforms several well-established methods in terms of prediction accuracy. By the incorporation of differential motion constraints on the model output, we illustrate that our model is capable of generating a diverse set of realistic future trajectories. Through the use of an interaction-aware guidance signal, we further demonstrate that the model can be adapted to predict the behavior of less cooperative agents, emphasizing its practical applicability under uncertain traffic conditions.  ( 2 min )
    Language Model Knowledge Distillation for Efficient Question Answering in Spanish
    arXiv:2312.04193v2 Announce Type: replace-cross Abstract: Recent advances in the development of pre-trained Spanish language models has led to significant progress in many Natural Language Processing (NLP) tasks, such as question answering. However, the lack of efficient models imposes a barrier for the adoption of such models in resource-constrained environments. Therefore, smaller distilled models for the Spanish language could be proven to be highly scalable and facilitate their further adoption on a variety of tasks and scenarios. In this work, we take one step in this direction by developing SpanishTinyRoBERTa, a compressed language model based on RoBERTa for efficient question answering in Spanish. To achieve this, we employ knowledge distillation from a large model onto a lighter model that allows for a wider implementation, even in areas with limited computational resources, whilst attaining negligible performance sacrifice. Our experiments show that the dense distilled model can still preserve the performance of its larger counterpart, while significantly increasing inference speedup. This work serves as a starting point for further research and investigation of model compression efforts for Spanish language models across various NLP tasks.  ( 2 min )
    FEDORA: Flying Event Dataset fOr Reactive behAvior
    arXiv:2305.14392v2 Announce Type: replace-cross Abstract: The ability of resource-constrained biological systems such as fruitflies to perform complex and high-speed maneuvers in cluttered environments has been one of the prime sources of inspiration for developing vision-based autonomous systems. To emulate this capability, the perception pipeline of such systems must integrate information cues from tasks including optical flow and depth estimation, object detection and tracking, and segmentation, among others. However, the conventional approach of employing slow, synchronous inputs from standard frame-based cameras constrains these perception capabilities, particularly during high-speed maneuvers. Recently, event-based sensors have emerged as low latency and low energy alternatives to standard frame-based cameras for capturing high-speed motion, effectively speeding up perception and hence navigation. For coherence, all the perception tasks must be trained on the same input data. However, present-day datasets are curated mainly for a single or a handful of tasks and are limited in the rate of the provided ground truths. To address these limitations, we present Flying Event Dataset fOr Reactive behAviour (FEDORA) - a fully synthetic dataset for perception tasks, with raw data from frame-based cameras, event-based cameras, and Inertial Measurement Units (IMU), along with ground truths for depth, pose, and optical flow at a rate much higher than existing datasets.  ( 3 min )
    LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model
    arXiv:2402.02544v3 Announce Type: replace-cross Abstract: The revolutionary capabilities of large language models (LLMs) have paved the way for multimodal large language models (MLLMs) and fostered diverse applications across various specialized domains. In the remote sensing (RS) field, however, the diverse geographical landscapes and varied objects in RS imagery are not adequately considered in recent MLLM endeavors. To bridge this gap, we construct a large-scale RS image-text dataset, LHRS-Align, and an informative RS-specific instruction dataset, LHRS-Instruct, leveraging the extensive volunteered geographic information (VGI) and globally available RS images. Building on this foundation, we introduce LHRS-Bot, an MLLM tailored for RS image understanding through a novel multi-level vision-language alignment strategy and a curriculum learning method. Additionally, we introduce LHRS-Bench, a benchmark for thoroughly evaluating MLLMs' abilities in RS image understanding. Comprehensive experiments demonstrate that LHRS-Bot exhibits a profound understanding of RS images and the ability to perform nuanced reasoning within the RS domain.  ( 2 min )
    Visuo-Tactile Pretraining for Cable Plugging
    arXiv:2403.11898v1 Announce Type: cross Abstract: Tactile information is a critical tool for fine-grain manipulation. As humans, we rely heavily on tactile information to understand objects in our environments and how to interact with them. We use touch not only to perform manipulation tasks but also to learn how to perform these tasks. Therefore, to create robotic agents that can learn to complete manipulation tasks at a human or super-human level of performance, we need to properly incorporate tactile information into both skill execution and skill learning. In this paper, we investigate how we can incorporate tactile information into imitation learning platforms to improve performance on complex tasks. To do this, we tackle the challenge of plugging in a USB cable, a dexterous manipulation task that relies on fine-grain visuo-tactile serving. By incorporating tactile information into imitation learning frameworks, we are able to train a robotic agent to plug in a USB cable - a first for imitation learning. Additionally, we explore how tactile information can be used to train non-tactile agents through a contrastive-loss pretraining process. Our results show that by pretraining with tactile information, the performance of a non-tactile agent can be significantly improved, reaching a level on par with visuo-tactile agents. For demonstration videos and access to our codebase, see the project website: https://sites.google.com/andrew.cmu.edu/visuo-tactile-cable-plugging/home  ( 2 min )
    Spike-based Neuromorphic Computing for Next-Generation Computer Vision
    arXiv:2310.09692v2 Announce Type: replace-cross Abstract: Neuromorphic Computing promises orders of magnitude improvement in energy efficiency compared to traditional von Neumann computing paradigm. The goal is to develop an adaptive, fault-tolerant, low-footprint, fast, low-energy intelligent system by learning and emulating brain functionality which can be realized through innovation in different abstraction layers including material, device, circuit, architecture and algorithm. As the energy consumption in complex vision tasks keep increasing exponentially due to larger data set and resource-constrained edge devices become increasingly ubiquitous, spike-based neuromorphic computing approaches can be viable alternative to deep convolutional neural network that is dominating the vision field today. In this book chapter, we introduce neuromorphic computing, outline a few representative examples from different layers of the design stack (devices, circuits and algorithms) and conclude with a few exciting applications and future research directions that seem promising for computer vision in the near future.  ( 2 min )
    Deep Nonnegative Matrix Factorization with Beta Divergences
    arXiv:2309.08249v3 Announce Type: replace Abstract: Deep Nonnegative Matrix Factorization (deep NMF) has recently emerged as a valuable technique for extracting multiple layers of features across different scales. However, all existing deep NMF models and algorithms have primarily centered their evaluation on the least squares error, which may not be the most appropriate metric for assessing the quality of approximations on diverse datasets. For instance, when dealing with data types such as audio signals and documents, it is widely acknowledged that $\beta$-divergences offer a more suitable alternative. In this paper, we develop new models and algorithms for deep NMF using some $\beta$-divergences, with a focus on the Kullback-Leibler divergence. Subsequently, we apply these techniques to the extraction of facial features, the identification of topics within document collections, and the identification of materials within hyperspectral images.  ( 2 min )
    Multi-Objective Evolutionary Neural Architecture Search for Recurrent Neural Networks
    arXiv:2403.11173v1 Announce Type: cross Abstract: Artificial neural network (NN) architecture design is a nontrivial and time-consuming task that often requires a high level of human expertise. Neural architecture search (NAS) serves to automate the design of NN architectures and has proven to be successful in automatically finding NN architectures that outperform those manually designed by human experts. NN architecture performance can be quantified based on multiple objectives, which include model accuracy and some NN architecture complexity objectives, among others. The majority of modern NAS methods that consider multiple objectives for NN architecture performance evaluation are concerned with automated feed forward NN architecture design, which leaves multi-objective automated recurrent neural network (RNN) architecture design unexplored. RNNs are important for modeling sequential datasets, and prominent within the natural language processing domain. It is often the case in real world implementations of machine learning and NNs that a reasonable trade-off is accepted for marginally reduced model accuracy in favour of lower computational resources demanded by the model. This paper proposes a multi-objective evolutionary algorithm-based RNN architecture search method. The proposed method relies on approximate network morphisms for RNN architecture complexity optimisation during evolution. The results show that the proposed method is capable of finding novel RNN architectures with comparable performance to state-of-the-art manually designed RNN architectures, but with reduced computational demand.  ( 2 min )
    Solution of the Probabilistic Lambert Problem: Connections with Optimal Mass Transport, Schr\"odinger Bridge and Reaction-Diffusion PDEs
    arXiv:2401.07961v3 Announce Type: replace-cross Abstract: Lambert's problem concerns with transferring a spacecraft from a given initial to a given terminal position within prescribed flight time via velocity control subject to a gravitational force field. We consider a probabilistic variant of the Lambert problem where the knowledge of the endpoint constraints in position vectors are replaced by the knowledge of their respective joint probability density functions. We show that the Lambert problem with endpoint joint probability density constraints is a generalized optimal mass transport (OMT) problem, thereby connecting this classical astrodynamics problem with a burgeoning area of research in modern stochastic control and stochastic machine learning. This newfound connection allows us to rigorously establish the existence and uniqueness of solution for the probabilistic Lambert problem. The same connection also helps to numerically solve the probabilistic Lambert problem via diffusion regularization, i.e., by leveraging further connection of the OMT with the Schr\"odinger bridge problem (SBP). This also shows that the probabilistic Lambert problem with additive dynamic process noise is in fact a generalized SBP, and can be solved numerically using the so-called Schr\"odinger factors, as we do in this work. We explain how the resulting analysis leads to solving a boundary-coupled system of reaction-diffusion PDEs where the nonlinear gravitational potential appears as the reaction rate. We propose novel algorithms for the same, and present illustrative numerical results. Our analysis and the algorithmic framework are nonparametric, i.e., we make neither statistical (e.g., Gaussian, first few moments, mixture or exponential family, finite dimensionality of the sufficient statistic) nor dynamical (e.g., Taylor series) approximations.  ( 3 min )
    VLLaVO: Mitigating Visual Gap through LLMs
    arXiv:2401.03253v2 Announce Type: replace-cross Abstract: Recent advances achieved by deep learning models rely on the independent and identically distributed assumption, hindering their applications in real-world scenarios with domain shifts. To tackle this issue, cross-domain learning aims at extracting domain-invariant knowledge to reduce the domain shift between training and testing data. However, in visual cross-domain learning, traditional methods concentrate solely on the image modality, disregarding the potential benefits of incorporating the text modality. In this work, we propose VLLaVO, combining Vision language models and Large Language models as Visual cross-dOmain learners. VLLaVO uses vision-language models to convert images into detailed textual descriptions. A large language model is then finetuned on textual descriptions of the source/target domain generated by a designed instruction template. Extensive experimental results under domain generalization and unsupervised domain adaptation settings demonstrate the effectiveness of the proposed method.  ( 2 min )
    Dataset Distillation via the Wasserstein Metric
    arXiv:2311.18531v2 Announce Type: replace-cross Abstract: Dataset Distillation (DD) emerges as a powerful strategy to encapsulate the expansive information of large datasets into significantly smaller, synthetic equivalents, thereby preserving model performance with reduced computational overhead. Pursuing this objective, we introduce the Wasserstein distance, a metric grounded in optimal transport theory, to enhance distribution matching in DD. Our approach employs the Wasserstein barycenter to provide a geometrically meaningful method for quantifying distribution differences and capturing the centroid of distribution sets efficiently. By embedding synthetic data in the feature spaces of pretrained classification models, we facilitate effective distribution matching that leverages prior knowledge inherent in these models. Our method not only maintains the computational advantages of distribution matching-based techniques but also achieves new state-of-the-art performance across a range of high-resolution datasets. Extensive testing demonstrates the effectiveness and adaptability of our method, underscoring the untapped potential of Wasserstein metrics in dataset distillation.  ( 2 min )
    UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition
    arXiv:2311.15599v2 Announce Type: replace-cross Abstract: Large-kernel convolutional neural networks (ConvNets) have recently received extensive research attention, but two unresolved and critical issues demand further investigation. 1) The architectures of existing large-kernel ConvNets largely follow the design principles of conventional ConvNets or transformers, while the architectural design for large-kernel ConvNets remains under-addressed. 2) As transformers have dominated multiple modalities, it remains to be investigated whether ConvNets also have a strong universal perception ability in domains beyond vision. In this paper, we contribute from two aspects. 1) We propose four architectural guidelines for designing large-kernel ConvNets, the core of which is to exploit the essential characteristics of large kernels that distinguish them from small kernels - they can see wide without going deep. Following such guidelines, our proposed large-kernel ConvNet shows leading performance in image recognition (ImageNet accuracy of 88.0%, ADE20K mIoU of 55.6%, and COCO box AP of 56.4%), demonstrating better performance and higher speed than the recent powerful competitors. 2) We discover large kernels are the key to unlocking the exceptional performance of ConvNets in domains where they were originally not proficient. With certain modality-related preprocessing approaches, the proposed model achieves state-of-the-art performance on time-series forecasting and audio recognition tasks even without modality-specific customization to the architecture. All the code and models are publicly available on GitHub and Huggingface.  ( 3 min )
    Multinomial belief networks
    arXiv:2311.16909v2 Announce Type: replace-cross Abstract: A Bayesian approach to machine learning is attractive when we need to quantify uncertainty, deal with missing observations, when samples are scarce, or when the data is sparse. All of these commonly apply when analysing healthcare data. To address these analytical requirements, we propose a deep generative model for multinomial count data where both the weights and hidden units of the network are Dirichlet distributed. A Gibbs sampling procedure is formulated that takes advantage of a series of augmentation relations, analogous to the Zhou--Cong--Chen model. We apply the model on small handwritten digits, and a large experimental dataset of DNA mutations in cancer, and we show how the model is able to extract biologically meaningful meta-signatures in a fully data-driven way.  ( 2 min )
    Revealing behavioral impact on mobility prediction networks through causal interventions
    arXiv:2311.11749v2 Announce Type: replace-cross Abstract: Deep neural networks are increasingly utilized in mobility prediction tasks, yet their intricate internal workings pose challenges for interpretability, especially in comprehending how various aspects of mobility behavior affect predictions. This study introduces a causal intervention framework to assess the impact of mobility-related factors on neural networks designed for next location prediction -- a task focusing on predicting the immediate next location of an individual. To achieve this, we employ individual mobility models to generate synthetic location visit sequences and control behavior dynamics by intervening in their data generation process. We evaluate the interventional location sequences using mobility metrics and input them into well-trained networks to analyze performance variations. The results demonstrate the effectiveness in producing location sequences with distinct mobility behaviors, thereby facilitating the simulation of diverse yet realistic spatial and temporal changes. These changes result in performance fluctuations in next location prediction networks, revealing impacts of critical mobility behavior factors, including sequential patterns in location transitions, proclivity for exploring new locations, and preferences in location choices at population and individual levels. The gained insights hold significant value for the real-world application of mobility prediction networks, and the framework is expected to promote the use of causal inference for enhancing the interpretability and robustness of neural networks in mobility applications.  ( 3 min )
    Unit Commitment Predictor With a Performance Guarantee: A Support Vector Machine Classifier
    arXiv:2310.08601v2 Announce Type: replace-cross Abstract: The system operators usually need to solve large-scale unit commitment problems within limited time frame for computation. This paper provides a pragmatic solution, showing how by learning and predicting the on/off commitment decisions of conventional units, there is a potential for system operators to warm start their solver and speed up their computation significantly. For the prediction, we train linear and kernelized support vector machine classifiers, providing an out-of-sample performance guarantee if properly regularized, converting to distributionally robust classifiers. For the unit commitment problem, we solve a mixed-integer second-order cone problem. Our results based on the IEEE 6- and 118-bus test systems show that the kernelized SVM with proper regularization outperforms other classifiers, reducing the computational time by a factor of 1.7. In addition, if there is a tight computational limit, while the unit commitment problem without warm start is far away from the optimal solution, its warmly-started version can be solved to (near) optimality within the time limit.  ( 2 min )
    Generalization in diffusion models arises from geometry-adaptive harmonic representations
    arXiv:2310.02557v2 Announce Type: replace-cross Abstract: Deep neural networks (DNNs) trained for image denoising are able to generate high-quality samples with score-based reverse diffusion algorithms. These impressive capabilities seem to imply an escape from the curse of dimensionality, but recent reports of memorization of the training set raise the question of whether these networks are learning the "true" continuous density of the data. Here, we show that two DNNs trained on non-overlapping subsets of a dataset learn nearly the same score function, and thus the same density, when the number of training images is large enough. In this regime of strong generalization, diffusion-generated images are distinct from the training set, and are of high visual quality, suggesting that the inductive biases of the DNNs are well-aligned with the data density. We analyze the learned denoising functions and show that the inductive biases give rise to a shrinkage operation in a basis adapted to the underlying image. Examination of these bases reveals oscillating harmonic structures along contours and in homogeneous regions. We demonstrate that trained denoisers are inductively biased towards these geometry-adaptive harmonic bases since they arise not only when the network is trained on photographic images, but also when it is trained on image classes supported on low-dimensional manifolds for which the harmonic basis is suboptimal. Finally, we show that when trained on regular image classes for which the optimal basis is known to be geometry-adaptive and harmonic, the denoising performance of the networks is near-optimal.  ( 3 min )
    Knolling Bot: Learning Robotic Object Arrangement from Tidy Demonstrations
    arXiv:2310.04566v2 Announce Type: replace-cross Abstract: Addressing the challenge of organizing scattered items in domestic spaces is complicated by the diversity and subjective nature of tidiness. Just as the complexity of human language allows for multiple expressions of the same idea, household tidiness preferences and organizational patterns vary widely, so presetting object locations would limit the adaptability to new objects and environments. Inspired by advancements in natural language processing (NLP), this paper introduces a self-supervised learning framework that allows robots to understand and replicate the concept of tidiness from demonstrations of well-organized layouts, akin to using conversational datasets to train Large Language Models(LLM). We leverage a transformer neural network to predict the placement of subsequent objects. We demonstrate a ``knolling'' system with a robotic arm and an RGB camera to organize items of varying sizes and quantities on a table. Our method not only trains a generalizable concept of tidiness, enabling the model to provide diverse solutions and adapt to different numbers of objects, but it can also incorporate human preferences to generate customized tidy tables without explicit target positions for each object.  ( 3 min )
    L2MAC: Large Language Model Automatic Computer for Extensive Code Generation
    arXiv:2310.02003v3 Announce Type: replace-cross Abstract: Transformer-based large language models (LLMs) are constrained by the fixed context window of the underlying transformer architecture, hindering their ability to produce long and coherent outputs. Memory-augmented LLMs are a promising solution, but current approaches cannot handle long output generation tasks since they (1) only focus on reading memory and reduce its evolution to the concatenation of new memories or (2) use very specialized memories that cannot adapt to other domains. This paper presents L2MAC, the first practical LLM-based stored-program automatic computer (von Neumann architecture) framework, an LLM-based multi-agent system, for long and consistent output generation. Its memory has two components: the instruction registry, which is populated with a prompt program to solve the user-given task, and a file store, which will contain the final and intermediate outputs. Each instruction in turn is executed by a separate LLM agent, whose context is managed by a control unit capable of precise memory reading and writing to ensure effective interaction with the file store. These components enable L2MAC to generate extensive outputs, bypassing the constraints of the finite context window while producing outputs that fulfill a complex user-specified task. We empirically demonstrate that L2MAC achieves state-of-the-art performance in generating large codebases for system design tasks, significantly outperforming other coding methods in implementing the detailed user-specified task, and we provide valuable insights into the reasons for this performance gap.  ( 3 min )
    Consciousness-Inspired Spatio-Temporal Abstractions for Better Generalization in Reinforcement Learning
    arXiv:2310.00229v4 Announce Type: replace-cross Abstract: Inspired by human conscious planning, we propose Skipper, a model-based reinforcement learning framework utilizing spatio-temporal abstractions to generalize better in novel situations. It automatically decomposes the given task into smaller, more manageable subtasks, and thus enables sparse decision-making and focused computation on the relevant parts of the environment. The decomposition relies on the extraction of an abstracted proxy problem represented as a directed graph, in which vertices and edges are learned end-to-end from hindsight. Our theoretical analyses provide performance guarantees under appropriate assumptions and establish where our approach is expected to be helpful. Generalization-focused experiments validate Skipper's significant advantage in zero-shot generalization, compared to some existing state-of-the-art hierarchical planning methods.  ( 2 min )
    Learning to Make Adherence-Aware Advice
    arXiv:2310.00817v2 Announce Type: replace-cross Abstract: As artificial intelligence (AI) systems play an increasingly prominent role in human decision-making, challenges surface in the realm of human-AI interactions. One challenge arises from the suboptimal AI policies due to the inadequate consideration of humans disregarding AI recommendations, as well as the need for AI to provide advice selectively when it is most pertinent. This paper presents a sequential decision-making model that (i) takes into account the human's adherence level (the probability that the human follows/rejects machine advice) and (ii) incorporates a defer option so that the machine can temporarily refrain from making advice. We provide learning algorithms that learn the optimal advice policy and make advice only at critical time stamps. Compared to problem-agnostic reinforcement learning algorithms, our specialized learning algorithms not only enjoy better theoretical convergence properties but also show strong empirical performance.  ( 2 min )
    Sparse Index Tracking: Simultaneous Asset Selection and Capital Allocation via $\ell_0$-Constrained Portfolio
    arXiv:2309.10152v3 Announce Type: replace-cross Abstract: Sparse index tracking is a prominent passive portfolio management strategy that constructs a sparse portfolio to track a financial index. A sparse portfolio is preferable to a full portfolio in terms of reducing transaction costs and avoiding illiquid assets. To achieve portfolio sparsity, conventional studies have utilized $\ell_p$-norm regularizations as a continuous surrogate of the $\ell_0$-norm regularization. Although these formulations can construct sparse portfolios, their practical application is challenging due to the intricate and time-consuming process of tuning parameters to define the precise upper limit of assets in the portfolio. In this paper, we propose a new problem formulation of sparse index tracking using an $\ell_0$-norm constraint that enables easy control of the upper bound on the number of assets in the portfolio. Moreover, our approach offers a choice between constraints on portfolio and turnover sparsity, further reducing transaction costs by limiting asset updates at each rebalancing interval. Furthermore, we develop an efficient algorithm for solving this problem based on a primal-dual splitting method. Finally, we illustrate the effectiveness of the proposed method through experiments on the S&P500 and Russell3000 index datasets.  ( 3 min )
    Temporal Interest Network for User Response Prediction
    arXiv:2308.08487v2 Announce Type: replace-cross Abstract: User response prediction is essential in industrial recommendation systems, such as online display advertising. Among all the features in recommendation models, user behaviors are among the most critical. Many works have revealed that a user's behavior reflects her interest in the candidate item, owing to the semantic or temporal correlation between behaviors and the candidate. While the literature has individually examined each of these correlations, researchers have yet to analyze them in combination, that is, the semantic-temporal correlation. We empirically measure this correlation and observe intuitive yet robust patterns. We then examine several popular user interest models and find that, surprisingly, none of them learn such correlation well. To fill this gap, we propose a Temporal Interest Network (TIN) to capture the semantic-temporal correlation simultaneously between behaviors and the target. We achieve this by incorporating target-aware temporal encoding, in addition to semantic encoding, to represent behaviors and the target. Furthermore, we conduct explicit 4-way interaction by deploying target-aware attention and target-aware representation to capture both semantic and temporal correlation. We conduct comprehensive evaluations on two popular public datasets, and our proposed TIN outperforms the best-performing baselines by 0.43% and 0.29% on GAUC, respectively. During online A/B testing in Tencent's advertising platform, TIN achieves 1.65% cost lift and 1.93% GMV lift over the base model. It has been successfully deployed in production since October 2023, serving the WeChat Moments traffic. We have released our code at https://github.com/zhouxy1003/TIN.  ( 3 min )
    Online learning in bandits with predicted context
    arXiv:2307.13916v3 Announce Type: replace-cross Abstract: We consider the contextual bandit problem where at each time, the agent only has access to a noisy version of the context and the error variance (or an estimator of this variance). This setting is motivated by a wide range of applications where the true context for decision-making is unobserved, and only a prediction of the context by a potentially complex machine learning algorithm is available. When the context error is non-vanishing, classical bandit algorithms fail to achieve sublinear regret. We propose the first online algorithm in this setting with sublinear regret guarantees under mild conditions. The key idea is to extend the measurement error model in classical statistics to the online decision-making setting, which is nontrivial due to the policy being dependent on the noisy context observations. We further demonstrate the benefits of the proposed approach in simulation environments based on synthetic and real digital intervention datasets.  ( 2 min )
    Properties of Discrete Sliced Wasserstein Losses
    arXiv:2307.10352v3 Announce Type: replace-cross Abstract: The Sliced Wasserstein (SW) distance has become a popular alternative to the Wasserstein distance for comparing probability measures. Widespread applications include image processing, domain adaptation and generative modelling, where it is common to optimise some parameters in order to minimise SW, which serves as a loss function between discrete probability measures (since measures admitting densities are numerically unattainable). All these optimisation problems bear the same sub-problem, which is minimising the Sliced Wasserstein energy. In this paper we study the properties of $\mathcal{E}: Y \longmapsto \mathrm{SW}_2^2(\gamma_Y, \gamma_Z)$, i.e. the SW distance between two uniform discrete measures with the same amount of points as a function of the support $Y \in \mathbb{R}^{n \times d}$ of one of the measures. We investigate the regularity and optimisation properties of this energy, as well as its Monte-Carlo approximation $\mathcal{E}_p$ (estimating the expectation in SW using only $p$ samples) and show convergence results on the critical points of $\mathcal{E}_p$ to those of $\mathcal{E}$, as well as an almost-sure uniform convergence. Finally, we show that in a certain sense, Stochastic Gradient Descent methods minimising $\mathcal{E}$ and $\mathcal{E}_p$ converge towards (Clarke) critical points of these energies.  ( 2 min )
    In-context Autoencoder for Context Compression in a Large Language Model
    arXiv:2307.06945v3 Announce Type: replace-cross Abstract: We propose the In-context Autoencoder (ICAE), leveraging the power of a large language models (LLM) to compress a long context into short compact memory slots that can be directly conditioned on by the LLM for various purposes. ICAE is first pretrained using both autoencoding and language modeling objectives on massive text data, enabling it to generate memory slots that accurately and comprehensively represent the original context; Then, it is fine-tuned on instruction data for producing desirable responses to various prompts. Experiments demonstrate that our lightweight ICAE, introducing about 1% additional parameters, effectively achieves $4\times$ context compression based on Llama, offering advantages in both improved latency and GPU memory cost during inference, and showing an interesting insight in memorization as well as potential for scalability. These promising results imply a novel perspective on the connection between working memory in cognitive science and representation learning in LLMs, revealing ICAE's significant implications in addressing the long context problem and suggesting further research in LLM context management. Our data, code and models are available at https://github.com/getao/icae.  ( 2 min )
    The Expressive Leaky Memory Neuron: an Efficient and Expressive Phenomenological Neuron Model Can Solve Long-Horizon Tasks
    arXiv:2306.16922v3 Announce Type: replace-cross Abstract: Biological cortical neurons are remarkably sophisticated computational devices, temporally integrating their vast synaptic input over an intricate dendritic tree, subject to complex, nonlinearly interacting internal biological processes. A recent study proposed to characterize this complexity by fitting accurate surrogate models to replicate the input-output relationship of a detailed biophysical cortical pyramidal neuron model and discovered it needed temporal convolutional networks (TCN) with millions of parameters. Requiring these many parameters, however, could stem from a misalignment between the inductive biases of the TCN and cortical neuron's computations. In light of this, and to explore the computational implications of leaky memory units and nonlinear dendritic processing, we introduce the Expressive Leaky Memory (ELM) neuron model, a biologically inspired phenomenological model of a cortical neuron. Remarkably, by exploiting such slowly decaying memory-like hidden states and two-layered nonlinear integration of synaptic input, our ELM neuron can accurately match the aforementioned input-output relationship with under ten thousand trainable parameters. To further assess the computational ramifications of our neuron design, we evaluate it on various tasks with demanding temporal structures, including the Long Range Arena (LRA) datasets, as well as a novel neuromorphic dataset based on the Spiking Heidelberg Digits dataset (SHD-Adding). Leveraging a larger number of memory units with sufficiently long timescales, and correspondingly sophisticated synaptic integration, the ELM neuron displays substantial long-range processing capabilities, reliably outperforming the classic Transformer or Chrono-LSTM architectures on LRA, and even solving the Pathfinder-X task with over 70% accuracy (16k context length).  ( 3 min )
    Protein Discovery with Discrete Walk-Jump Sampling
    arXiv:2306.12360v2 Announce Type: replace-cross Abstract: We resolve difficulties in training and sampling from a discrete generative model by learning a smoothed energy function, sampling from the smoothed data manifold with Langevin Markov chain Monte Carlo (MCMC), and projecting back to the true data manifold with one-step denoising. Our Discrete Walk-Jump Sampling formalism combines the contrastive divergence training of an energy-based model and improved sample quality of a score-based model, while simplifying training and sampling by requiring only a single noise level. We evaluate the robustness of our approach on generative modeling of antibody proteins and introduce the distributional conformity score to benchmark protein generative models. By optimizing and sampling from our models for the proposed distributional conformity score, 97-100% of generated samples are successfully expressed and purified and 70% of functional designs show equal or improved binding affinity compared to known functional antibodies on the first attempt in a single round of laboratory experiments. We also report the first demonstration of long-run fast-mixing MCMC chains where diverse antibody protein classes are visited in a single MCMC chain.  ( 2 min )
    Bayesian Learning of Optimal Policies in Markov Decision Processes with Countably Infinite State-Space
    arXiv:2306.02574v3 Announce Type: replace-cross Abstract: Models of many real-life applications, such as queuing models of communication networks or computing systems, have a countably infinite state-space. Algorithmic and learning procedures that have been developed to produce optimal policies mainly focus on finite state settings, and do not directly apply to these models. To overcome this lacuna, in this work we study the problem of optimal control of a family of discrete-time countable state-space Markov Decision Processes (MDPs) governed by an unknown parameter $\theta\in\Theta$, and defined on a countably-infinite state space $\mathcal X=\mathbb{Z}_+^d$, with finite action space $\mathcal A$, and an unbounded cost function. We take a Bayesian perspective with the random unknown parameter $\boldsymbol{\theta}^*$ generated via a given fixed prior distribution on $\Theta$. To optimally control the unknown MDP, we propose an algorithm based on Thompson sampling with dynamically-sized episodes: at the beginning of each episode, the posterior distribution formed via Bayes' rule is used to produce a parameter estimate, which then decides the policy applied during the episode. To ensure the stability of the Markov chain obtained by following the policy chosen for each parameter, we impose ergodicity assumptions. From this condition and using the solution of the average cost Bellman equation, we establish an $\tilde O(dh^d\sqrt{|\mathcal A|T})$ upper bound on the Bayesian regret of our algorithm, where $T$ is the time-horizon. Finally, to elucidate the applicability of our algorithm, we consider two different queuing models with unknown dynamics, and show that our algorithm can be applied to develop approximately optimal control algorithms.  ( 3 min )
    Self-contradictory Hallucinations of Large Language Models: Evaluation, Detection and Mitigation
    arXiv:2305.15852v3 Announce Type: replace-cross Abstract: Large language models (large LMs) are susceptible to producing text that contains hallucinated content. An important instance of this problem is self-contradiction, where the LM generates two contradictory sentences within the same context. In this work, we present a comprehensive investigation into self-contradiction for various instruction-tuned LMs, covering evaluation, detection, and mitigation. Our primary evaluation task is open-domain text generation, but we also demonstrate the applicability of our approach to shorter question answering. Our analysis reveals the prevalence of self-contradictions, e.g., in 17.7% of all sentences produced by ChatGPT. We then propose a novel prompting-based framework designed to effectively detect and mitigate self-contradictions. Our detector achieves high accuracy, e.g., around 80% F1 score when prompting ChatGPT. The mitigation algorithm iteratively refines the generated text to remove contradictory information while preserving text fluency and informativeness. Importantly, our entire framework is applicable to black-box LMs and does not require retrieval of external knowledge. Rather, our method complements retrieval-based methods, as a large portion of self-contradictions (e.g., 35.2% for ChatGPT) cannot be verified using online text. Our approach is practically effective and has been released as a push-button tool to benefit the public at https://chatprotect.ai/.  ( 2 min )
    Differentially Private Latent Diffusion Models
    arXiv:2305.15759v4 Announce Type: replace-cross Abstract: Diffusion models (DMs) are widely used for generating high-quality high-dimensional images in a non-differentially private manner. To address this challenge, recent papers suggest pre-training DMs with public data, then fine-tuning them with private data using DP-SGD for a relatively short period. In this paper, we further improve the current state of DMs with DP by adopting the Latent Diffusion Models (LDMs). LDMs are equipped with powerful pre-trained autoencoders that map the high-dimensional pixels into lower-dimensional latent representations, in which DMs are trained, yielding a more efficient and fast training of DMs. In our algorithm, DP-LDMs, rather than fine-tuning the entire DMs, we fine-tune only the attention modules of LDMs at varying layers with privacy-sensitive data, reducing the number of trainable parameters by roughly 90% and achieving a better accuracy, compared to fine-tuning the entire DMs. The smaller parameter space to fine-tune with DP-SGD helps our algorithm to achieve new state-of-the-art results in several public-private benchmark data pairs.Our approach also allows us to generate more realistic, high-dimensional images (256x256) and those conditioned on text prompts with differential privacy, which have not been attempted before us, to the best of our knowledge. Our approach provides a promising direction for training more powerful, yet training-efficient differentially private DMs, producing high-quality high-dimensional DP images.  ( 3 min )
    LLM-CXR: Instruction-Finetuned LLM for CXR Image Understanding and Generation
    arXiv:2305.11490v5 Announce Type: replace-cross Abstract: Following the impressive development of LLMs, vision-language alignment in LLMs is actively being researched to enable multimodal reasoning and visual IO. This direction of research is particularly relevant to medical imaging because medical image analysis and generation consist of reasoning based on a combination of visual features and prior knowledge. Many recent works have focused on training adapter networks that serve as an information bridge between image processing networks and LLMs; but presumably, in order to achieve maximum reasoning potential of LLMs on visual information as well, visual and language features should be allowed to interact more freely. This is especially important in the medical domain because understanding and generating medical images such as chest X-rays (CXR) require not only accurate visual and language-based reasoning but also a more intimate mapping between the two modalities. Thus, taking inspiration from previous work on the transformer and VQ-GAN combination for bidirectional image and text generation, we build upon this approach and develop a method for instruction-tuning an LLM pre-trained only on text to gain vision-language capabilities for medical images. Specifically, we leverage a pretrained LLM's existing question-answering and instruction-following abilities to teach it to understand visual inputs by instructing it to answer questions about image inputs and, symmetrically, output both text and image responses appropriate to a given query by tuning the LLM with diverse tasks that encompass image-based text-generation and text-based image-generation. We show that our model, LLM-CXR, trained in this approach shows better image-text alignment in both CXR understanding and generation tasks while being smaller in size compared to previously developed models that perform a narrower range of tasks. The code is at https://github.com/hyn2028/llm-cxr.  ( 3 min )
    Evaluating how interactive visualizations can assist in finding samples where and how computer vision models make mistakes
    arXiv:2305.11927v2 Announce Type: replace-cross Abstract: Creating Computer Vision (CV) models remains a complex practice, despite their ubiquity. Access to data, the requirement for ML expertise, and model opacity are just a few points of complexity that limit the ability of end-users to build, inspect, and improve these models. Interactive ML perspectives have helped address some of these issues by considering a teacher in the loop where planning, teaching, and evaluating tasks take place. We present and evaluate two interactive visualizations in the context of Sprite, a system for creating CV classification and detection models for images originating from videos. We study how these visualizations help Sprite's users identify (evaluate) and select (plan) images where a model is struggling and can lead to improved performance, compared to a baseline condition where users used a query language. We found that users who had used the visualizations found more images across a wider set of potential types of model errors.  ( 3 min )
    The Update-Equivalence Framework for Decision-Time Planning
    arXiv:2304.13138v2 Announce Type: replace-cross Abstract: The process of revising (or constructing) a policy at execution time -- known as decision-time planning -- has been key to achieving superhuman performance in perfect-information games like chess and Go. A recent line of work has extended decision-time planning to imperfect-information games, leading to superhuman performance in poker. However, these methods involve solving subgames whose sizes grow quickly in the amount of non-public information, making them unhelpful when the amount of non-public information is large. Motivated by this issue, we introduce an alternative framework for decision-time planning that is not based on solving subgames, but rather on update equivalence. In this update-equivalence framework, decision-time planning algorithms replicate the updates of last-iterate algorithms, which need not rely on public information. This facilitates scalability to games with large amounts of non-public information. Using this framework, we derive a provably sound search algorithm for fully cooperative games based on mirror descent and a search algorithm for adversarial games based on magnetic mirror descent. We validate the performance of these algorithms in cooperative and adversarial domains, notably in Hanabi, the standard benchmark for search in fully cooperative imperfect-information games. Here, our mirror descent approach exceeds or matches the performance of public information-based search while using two orders of magnitude less search time. This is the first instance of a non-public-information-based algorithm outperforming public-information-based approaches in a domain they have historically dominated.  ( 3 min )
    CoSMo: a Framework to Instantiate Conditioned Process Simulation Models
    arXiv:2303.17879v3 Announce Type: replace-cross Abstract: Process simulation is gaining attention for its ability to assess potential performance improvements and risks associated with business process changes. The existing literature presents various techniques, generally grounded in process models discovered from event logs or built upon deep learning algorithms. These techniques have specific strengths and limitations. Traditional approaches rooted in process models offer increased interpretability, while those using deep learning excel at generalizing changes across large event logs. However, the practical application of deep learning faces challenges related to managing stochasticity and integrating information for what-if analysis. This paper introduces a novel recurrent neural architecture tailored to discover COnditioned process Simulation MOdels (CoSMo) based on user-based constraints or any other nature of a-priori knowledge. This architecture facilitates the simulation of event logs that adhere to specific constraints by incorporating declarative-based rules into the learning phase as an attempt to fill the gap of incorporating information into deep learning models to perform what-if analysis. Experimental validation illustrates CoSMo's efficacy in simulating event logs while adhering to predefined declarative conditions, emphasizing both control-flow and data-flow perspectives.  ( 2 min )
    Multi-Source Diffusion Models for Simultaneous Music Generation and Separation
    arXiv:2302.02257v4 Announce Type: replace-cross Abstract: In this work, we define a diffusion-based generative model capable of both music synthesis and source separation by learning the score of the joint probability density of sources sharing a context. Alongside the classic total inference tasks (i.e., generating a mixture, separating the sources), we also introduce and experiment on the partial generation task of source imputation, where we generate a subset of the sources given the others (e.g., play a piano track that goes well with the drums). Additionally, we introduce a novel inference method for the separation task based on Dirac likelihood functions. We train our model on Slakh2100, a standard dataset for musical source separation, provide qualitative results in the generation settings, and showcase competitive quantitative results in the source separation setting. Our method is the first example of a single model that can handle both generation and separation tasks, thus representing a step toward general audio models.  ( 2 min )
    Copula-based transferable models for synthetic population generation
    arXiv:2302.09193v2 Announce Type: replace-cross Abstract: Population synthesis involves generating synthetic yet realistic representations of a target population of micro-agents for behavioral modeling and simulation. Traditional methods, often reliant on target population samples, such as census data or travel surveys, face limitations due to high costs and small sample sizes, particularly at smaller geographical scales. We propose a novel framework based on copulas to generate synthetic data for target populations where only empirical marginal distributions are known. This method utilizes samples from different populations with similar marginal dependencies, introduces a spatial component into population synthesis, and considers various information sources for more realistic generators. Concretely, the process involves normalizing the data and treat it as realizations of a given copula, and then training a generative model before incorporating the information on the marginals of the target population. Utilizing American Community Survey data, we assess our framework's performance through standardized root mean squared error (SRMSE) and so-called sampled zeros. We focus on its capacity to transfer a model learned from one population to another. Our experiments include transfer tests between regions at the same geographical level as well as to lower geographical levels, hence evaluating the framework's adaptability in varied spatial contexts. We compare Bayesian Networks, Variational Autoencoders, and Generative Adversarial Networks, both individually and combined with our copula framework. Results show that the copula enhances machine learning methods in matching the marginals of the reference data. Furthermore, it consistently surpasses Iterative Proportional Fitting in terms of SRMSE in the transferability experiments, while introducing unique observations not found in the original training sample.  ( 3 min )
    On Large-Scale Multiple Testing Over Networks: An Asymptotic Approach
    arXiv:2211.16059v4 Announce Type: replace-cross Abstract: This work concerns developing communication- and computation-efficient methods for large-scale multiple testing over networks, which is of interest to many practical applications. We take an asymptotic approach and propose two methods, proportion-matching and greedy aggregation, tailored to distributed settings. The proportion-matching method achieves the global BH performance yet only requires a one-shot communication of the (estimated) proportion of true null hypotheses as well as the number of p-values at each node. By focusing on the asymptotic optimal power, we go beyond the BH procedure by providing an explicit characterization of the asymptotic optimal solution. This leads to the greedy aggregation method that effectively approximates the optimal rejection regions at each node, while computation efficiency comes from the greedy-type approach naturally. Moreover, for both methods, we provide the rate of convergence for both the FDR and power. Extensive numerical results over a variety of challenging settings are provided to support our theoretical findings.  ( 2 min )
    Local and Regional Counterfactual Rules: Summarized and Robust Recourses
    arXiv:2209.14568v3 Announce Type: replace-cross Abstract: Counterfactual Explanations (CE) face several unresolved challenges, such as ensuring stability, synthesizing multiple CEs, and providing plausibility and sparsity guarantees. From a more practical point of view, recent studies [Pawelczyk et al., 2022] show that the prescribed counterfactual recourses are often not implemented exactly by individuals and demonstrate that most state-of-the-art CE algorithms are very likely to fail in this noisy environment. To address these issues, we propose a probabilistic framework that gives a sparse local counterfactual rule for each observation, providing rules that give a range of values capable of changing decisions with high probability. These rules serve as a summary of diverse counterfactual explanations and yield robust recourses. We further aggregate these local rules into a regional counterfactual rule, identifying shared recourses for subgroups of the data. Our local and regional rules are derived from the Random Forest algorithm, which offers statistical guarantees and fidelity to data distribution by selecting recourses in high-density regions. Moreover, our rules are sparse as we first select the smallest set of variables having a high probability of changing the decision. We have conducted experiments to validate the effectiveness of our counterfactual rules in comparison to standard CE and recent similar attempts. Our methods are available as a Python package.  ( 3 min )
    A review of predictive uncertainty estimation with machine learning
    arXiv:2209.08307v2 Announce Type: replace-cross Abstract: Predictions and forecasts of machine learning models should take the form of probability distributions, aiming to increase the quantity of information communicated to end users. Although applications of probabilistic prediction and forecasting with machine learning models in academia and industry are becoming more frequent, related concepts and methods have not been formalized and structured under a holistic view of the entire field. Here, we review the topic of predictive uncertainty estimation with machine learning algorithms, as well as the related metrics (consistent scoring functions and proper scoring rules) for assessing probabilistic predictions. The review covers a time period spanning from the introduction of early statistical (linear regression and time series models, based on Bayesian statistics or quantile regression) to recent machine learning algorithms (including generalized additive models for location, scale and shape, random forests, boosting and deep learning algorithms) that are more flexible by nature. The review of the progress in the field, expedites our understanding on how to develop new algorithms tailored to users' needs, since the latest advancements are based on some fundamental concepts applied to more complex algorithms. We conclude by classifying the material and discussing challenges that are becoming a hot topic of research.  ( 3 min )
    Statistical Properties of the log-cosh Loss Function Used in Machine Learning
    arXiv:2208.04564v4 Announce Type: replace-cross Abstract: This paper analyzes a popular loss function used in machine learning called the log-cosh loss function. A number of papers have been published using this loss function but, to date, no statistical analysis has been presented in the literature. In this paper, we present the distribution function from which the log-cosh loss arises. We compare it to a similar distribution, called the Cauchy distribution, and carry out various statistical procedures that characterize its properties. In particular, we examine its associated pdf, cdf, likelihood function and Fisher information. Side-by-side we consider the Cauchy and Cosh distributions as well as the MLE of the location parameter with asymptotic bias, asymptotic variance, and confidence intervals. We also provide a comparison of robust estimators from several other loss functions, including the Huber loss function and the rank dispersion function. Further, we examine the use of the log-cosh function for quantile regression. In particular, we identify a quantile distribution function from which a maximum likelihood estimator for quantile regression can be derived. Finally, we compare a quantile M-estimator based on log-cosh with robust monotonicity against another approach to quantile regression based on convolutional smoothing.  ( 3 min )
    Random Forest Weighted Local Fr\'echet Regression with Random Objects
    arXiv:2202.04912v4 Announce Type: replace-cross Abstract: Statistical analysis is increasingly confronted with complex data from metric spaces. Petersen and M\"uller (2019) established a general paradigm of Fr\'echet regression with complex metric space valued responses and Euclidean predictors. However, the local approach therein involves nonparametric kernel smoothing and suffers from the curse of dimensionality. To address this issue, we in this paper propose a novel random forest weighted local Fr\'echet regression paradigm. The main mechanism of our approach relies on a locally adaptive kernel generated by random forests. Our first method uses these weights as the local average to solve the conditional Fr\'echet mean, while the second method performs local linear Fr\'echet regression, both significantly improving existing Fr\'echet regression methods. Based on the theory of infinite order U-processes and infinite order $M_{m_n}$-estimator, we establish the consistency, rate of convergence, and asymptotic normality for our local constant estimator, which covers the current large sample theory of random forests with Euclidean responses as a special case. Numerical studies show the superiority of our methods with several commonly encountered types of responses such as distribution functions, symmetric positive-definite matrices, and sphere data. The practical merits of our proposals are also demonstrated through the application to New York taxi data and human mortality data.  ( 3 min )
    Digital Twin Data Modelling by Randomized Orthogonal Decomposition and Deep Learning
    arXiv:2206.08659v2 Announce Type: replace-cross Abstract: A digital twin is a surrogate model that has the main feature to mirror the original process behavior. Associating the dynamical process with a digital twin model of reduced complexity has the significant advantage to map the dynamics with high accuracy and reduced costs in CPU time and hardware to timescales over which that suffers significantly changes and so it is difficult to explore. This paper introduces a new framework for creating efficient digital twin models of fluid flows. We introduce a novel algorithm that combines the advantages of Krylov based dynamic mode decomposition with proper orthogonal decomposition and outperforms the selection of the most influential modes. We prove that randomized orthogonal decomposition algorithm provides several advantages over SVD empirical orthogonal decomposition methods and mitigates the projection error formulating a multiobjective optimization problem.We involve the state-of-the-art artificial intelligence Deep Learning (DL) to perform a real-time adaptive calibration of the digital twin model, with increasing fidelity. The output is a high-fidelity DIGITAL TWIN DATA MODEL of the fluid flow dynamics, with the advantage of a reduced complexity. The new modelling tools are investigated in the numerical simulation of three wave phenomena with increasing complexity. We show that the outputs are consistent with the original source data.We perform a thorough assessment of the performance of the new digital twin data models, in terms of numerical accuracy and computational efficiency, including a time simulation response feature study.  ( 3 min )
    Enhancing Continuous Domain Adaptation with Multi-Path Transfer Curriculum
    arXiv:2402.16681v2 Announce Type: replace Abstract: Addressing the large distribution gap between training and testing data has long been a challenge in machine learning, giving rise to fields such as transfer learning and domain adaptation. Recently, Continuous Domain Adaptation (CDA) has emerged as an effective technique, closing this gap by utilizing a series of intermediate domains. This paper contributes a novel CDA method, W-MPOT, which rigorously addresses the domain ordering and error accumulation problems overlooked by previous studies. Specifically, we construct a transfer curriculum over the source and intermediate domains based on Wasserstein distance, motivated by theoretical analysis of CDA. Then we transfer the source model to the target domain through multiple valid paths in the curriculum using a modified version of continuous optimal transport. A bidirectional path consistency constraint is introduced to mitigate the impact of accumulated mapping errors during continuous transfer. We extensively evaluate W-MPOT on multiple datasets, achieving up to 54.1\% accuracy improvement on multi-session Alzheimer MR image classification and 94.7\% MSE reduction on battery capacity estimation.  ( 2 min )
    Revisiting Recurrent Reinforcement Learning with Memory Monoids
    arXiv:2402.09900v2 Announce Type: replace Abstract: Memory models such as Recurrent Neural Networks (RNNs) and Transformers address Partially Observable Markov Decision Processes (POMDPs) by mapping trajectories to latent Markov states. Neither model scales particularly well to long sequences, especially compared to an emerging class of memory models sometimes called linear recurrent models. We discover that we can model the recurrent update of these models using a monoid, leading us to reformulate existing models using a novel memory monoid framework. We revisit the traditional approach to batching in recurrent RL, highlighting both theoretical and empirical deficiencies. We leverage the properties of memory monoids to propose a batching method that improves sample efficiency, increases the return, and simplifies the implementation of recurrent loss functions in RL.  ( 2 min )
    A Novel Gaussian Min-Max Theorem and its Applications
    arXiv:2402.07356v2 Announce Type: replace Abstract: A celebrated result by Gordon allows one to compare the min-max behavior of two Gaussian processes if certain inequality conditions are met. The consequences of this result include the Gaussian min-max (GMT) and convex Gaussian min-max (CGMT) theorems which have had far-reaching implications in high-dimensional statistics, machine learning, non-smooth optimization, and signal processing. Both theorems rely on a pair of Gaussian processes, first identified by Slepian, that satisfy Gordon's comparison inequalities. To date, no other pair of Gaussian processes satisfying these inequalities has been discovered. In this paper, we identify such a new pair. The resulting theorems extend the classical GMT and CGMT Theorems from the case where the underlying Gaussian matrix in the primary process has iid rows to where it has independent but non-identically-distributed ones. The new CGMT is applied to the problems of multi-source Gaussian regression, as well as to binary classification of general Gaussian mixture models.  ( 2 min )
    Hidden in Plain Sight: Undetectable Adversarial Bias Attacks on Vulnerable Patient Populations
    arXiv:2402.05713v2 Announce Type: replace Abstract: The proliferation of artificial intelligence (AI) in radiology has shed light on the risk of deep learning (DL) models exacerbating clinical biases towards vulnerable patient populations. While prior literature has focused on quantifying biases exhibited by trained DL models, demographically targeted adversarial bias attacks on DL models and its implication in the clinical environment remains an underexplored field of research in medical imaging. In this work, we demonstrate that demographically targeted label poisoning attacks can introduce undetectable underdiagnosis bias in DL models. Our results across multiple performance metrics and demographic groups like sex, age, and their intersectional subgroups show that adversarial bias attacks demonstrate high-selectivity for bias in the targeted group by degrading group model performance without impacting overall model performance. Furthermore, our results indicate that adversarial bias attacks result in biased DL models that propagate prediction bias even when evaluated with external datasets.  ( 2 min )
    How Free is Parameter-Free Stochastic Optimization?
    arXiv:2402.03126v2 Announce Type: replace Abstract: We study the problem of parameter-free stochastic optimization, inquiring whether, and under what conditions, do fully parameter-free methods exist: these are methods that achieve convergence rates competitive with optimally tuned methods, without requiring significant knowledge of the true problem parameters. Existing parameter-free methods can only be considered ``partially'' parameter-free, as they require some non-trivial knowledge of the true problem parameters, such as a bound on the stochastic gradient norms, a bound on the distance to a minimizer, etc. In the non-convex setting, we demonstrate that a simple hyperparameter search technique results in a fully parameter-free method that outperforms more sophisticated state-of-the-art algorithms. We also provide a similar result in the convex setting with access to noisy function values under mild noise assumptions. Finally, assuming only access to stochastic gradients, we establish a lower bound that renders fully parameter-free stochastic convex optimization infeasible, and provide a method which is (partially) parameter-free up to the limit indicated by our lower bound.  ( 2 min )
    EdgeOL: Efficient in-situ Online Learning on Edge Devices
    arXiv:2401.16694v2 Announce Type: replace Abstract: Emerging applications, such as robot-assisted eldercare and object recognition, generally employ deep learning neural networks (DNNs) and naturally require: i) handling streaming-in inference requests and ii) adapting to possible deployment scenario changes. Online model fine-tuning is widely adopted to satisfy these needs. However, an inappropriate fine-tuning scheme could involve significant energy consumption, making it challenging to deploy on edge devices. In this paper, we propose EdgeOL, an edge online learning framework that optimizes inference accuracy, fine-tuning execution time, and energy efficiency through both inter-tuning and intra-tuning optimizations. Experimental results show that, on average, EdgeOL reduces overall fine-tuning execution time by 64%, energy consumption by 52%, and improves average inference accuracy by 1.75% over the immediate online learning strategy.  ( 2 min )
    Improving Expressive Power of Spectral Graph Neural Networks with Eigenvalue Correction
    arXiv:2401.15603v2 Announce Type: replace Abstract: In recent years, spectral graph neural networks, characterized by polynomial filters, have garnered increasing attention and have achieved remarkable performance in tasks such as node classification. These models typically assume that eigenvalues for the normalized Laplacian matrix are distinct from each other, thus expecting a polynomial filter to have a high fitting ability. However, this paper empirically observes that normalized Laplacian matrices frequently possess repeated eigenvalues. Moreover, we theoretically establish that the number of distinguishable eigenvalues plays a pivotal role in determining the expressive power of spectral graph neural networks. In light of this observation, we propose an eigenvalue correction strategy that can free polynomial filters from the constraints of repeated eigenvalue inputs. Concretely, the proposed eigenvalue correction strategy enhances the uniform distribution of eigenvalues, thus mitigating repeated eigenvalues, and improving the fitting capacity and expressive power of polynomial filters. Extensive experimental results on both synthetic and real-world datasets demonstrate the superiority of our method.  ( 2 min )
    Understanding Domain Generalization: A Noise Robustness Perspective
    arXiv:2401.14846v2 Announce Type: replace Abstract: Despite the rapid development of machine learning algorithms for domain generalization (DG), there is no clear empirical evidence that the existing DG algorithms outperform the classic empirical risk minimization (ERM) across standard benchmarks. To better understand this phenomenon, we investigate whether there are benefits of DG algorithms over ERM through the lens of label noise. Specifically, our finite-sample analysis reveals that label noise exacerbates the effect of spurious correlations for ERM, undermining generalization. Conversely, we illustrate that DG algorithms exhibit implicit label-noise robustness during finite-sample training even when spurious correlation is present. Such desirable property helps mitigate spurious correlations and improve generalization in synthetic experiments. However, additional comprehensive experiments on real-world benchmark datasets indicate that label-noise robustness does not necessarily translate to better performance compared to ERM. We conjecture that the failure mode of ERM arising from spurious correlations may be less pronounced in practice.  ( 2 min )
    Ricci flow-guided autoencoders in learning time-dependent dynamics
    arXiv:2401.14591v5 Announce Type: replace Abstract: We present a manifold-based autoencoder method for learning nonlinear dynamics in time, notably partial differential equations (PDEs), in which the manifold latent space evolves according to Ricci flow. This can be accomplished by simulating Ricci flow in a physics-informed setting, and manifold quantities can be matched so that Ricci flow is empirically achieved. With our methodology, the manifold is learned as part of the training procedure, so ideal geometries may be discerned, while the evolution simultaneously induces a more accommodating latent representation over static methods. We present our method on a range of numerical experiments consisting of PDEs that encompass desirable characteristics such as periodicity and randomness, remarking error on in-distribution and extrapolation scenarios.  ( 2 min )
    Graph-Informed Neural Networks for Sparse Grid-Based Discontinuity Detectors
    arXiv:2401.13652v3 Announce Type: replace Abstract: In this paper, we present a novel approach for detecting the discontinuity interfaces of a discontinuous function. This approach leverages Graph-Informed Neural Networks (GINNs) and sparse grids to address discontinuity detection also in domains of dimension larger than 3. GINNs, trained to identify troubled points on sparse grids, exploit graph structures built on the grids to achieve efficient and accurate discontinuity detection performances. We also introduce a recursive algorithm for general sparse grid-based detectors, characterized by convergence properties and easy applicability. Numerical experiments on functions with dimensions n = 2 and n = 4 demonstrate the efficiency and robust generalization of GINNs in detecting discontinuity interfaces. Notably, the trained GINNs offer portability and versatility, allowing integration into various algorithms and sharing among users.  ( 2 min )
    Cooperative Multi-Agent Graph Bandits: UCB Algorithm and Regret Analysis
    arXiv:2401.10383v2 Announce Type: replace Abstract: In this paper, we formulate the multi-agent graph bandit problem as a multi-agent extension of the graph bandit problem introduced by Zhang, Johansson, and Li [CISS 57, 1-6 (2023)]. In our formulation, $N$ cooperative agents travel on a connected graph $G$ with $K$ nodes. Upon arrival at each node, agents observe a random reward drawn from a node-dependent probability distribution. The reward of the system is modeled as a weighted sum of the rewards the agents observe, where the weights capture some transformation of the reward associated with multiple agents sampling the same node at the same time. We propose an Upper Confidence Bound (UCB)-based learning algorithm, Multi-G-UCB, and prove that its expected regret over $T$ steps is bounded by $O(\gamma N\log(T)[\sqrt{KT} + DK])$, where $D$ is the diameter of graph $G$ and $\gamma$ a boundedness parameter associated with the weight functions. Lastly, we numerically test our algorithm by comparing it to alternative methods.  ( 2 min )
    The Power of Training: How Different Neural Network Setups Influence the Energy Demand
    arXiv:2401.01851v2 Announce Type: replace Abstract: This work offers a heuristic evaluation of the effects of variations in machine learning training regimes and learning paradigms on the energy consumption of computing, especially HPC hardware with a life-cycle aware perspective. While increasing data availability and innovation in high-performance hardware fuels the training of sophisticated models, it also fosters the fading perception of energy consumption and carbon emission. Therefore, the goal of this work is to raise awareness about the energy impact of general training parameters and processes, from learning rate over batch size to knowledge transfer. Multiple setups with different hyperparameter configurations are evaluated on three different hardware systems. Among many results, we have found out that even with the same model and hardware to reach the same accuracy, improperly set training hyperparameters consume up to 5 times the energy of the optimal setup. We also extensively examined the energy-saving benefits of learning paradigms including recycling knowledge through pretraining and sharing knowledge through multitask training.  ( 2 min )
    Accelerated Convergence of Stochastic Heavy Ball Method under Anisotropic Gradient Noise
    arXiv:2312.14567v2 Announce Type: replace Abstract: Heavy-ball momentum with decaying learning rates is widely used with SGD for optimizing deep learning models. In contrast to its empirical popularity, the understanding of its theoretical property is still quite limited, especially under the standard anisotropic gradient noise condition for quadratic regression problems. Although it is widely conjectured that heavy-ball momentum method can provide accelerated convergence and should work well in large batch settings, there is no rigorous theoretical analysis. In this paper, we fill this theoretical gap by establishing a non-asymptotic convergence bound for stochastic heavy-ball methods with step decay scheduler on quadratic objectives, under the anisotropic gradient noise condition. As a direct implication, we show that heavy-ball momentum can provide $\tilde{\mathcal{O}}(\sqrt{\kappa})$ accelerated convergence of the bias term of SGD while still achieving near-optimal convergence rate with respect to the stochastic variance term. The combined effect implies an overall convergence rate within log factors from the statistical minimax rate. This means SGD with heavy-ball momentum is useful in the large-batch settings such as distributed machine learning or federated learning, where a smaller number of iterations can significantly reduce the number of communication rounds, leading to acceleration in practice.  ( 2 min )
    Exploring the Frontiers of LLMs in Psychological Applications: A Comprehensive Review
    arXiv:2401.01519v3 Announce Type: replace Abstract: This paper explores the frontiers of large language models (LLMs) in psychology applications. Psychology has undergone several theoretical changes, and the current use of Artificial Intelligence (AI) and Machine Learning, particularly LLMs, promises to open up new research directions. We provide a detailed exploration of how LLMs like ChatGPT are transforming psychological research. It discusses the impact of LLMs across various branches of psychology, including cognitive and behavioral, clinical and counseling, educational and developmental, and social and cultural psychology, highlighting their potential to simulate aspects of human cognition and behavior. The paper delves into the capabilities of these models to emulate human-like text generation, offering innovative tools for literature review, hypothesis generation, experimental design, experimental subjects, data analysis, academic writing, and peer review in psychology. While LLMs are essential in advancing research methodologies in psychology, the paper also cautions about their technical and ethical challenges. There are issues like data privacy, the ethical implications of using LLMs in psychological research, and the need for a deeper understanding of these models' limitations. Researchers should responsibly use LLMs in psychological studies, adhering to ethical standards and considering the potential consequences of deploying these technologies in sensitive areas. Overall, the article provides a comprehensive overview of the current state of LLMs in psychology, exploring potential benefits and challenges. It serves as a call to action for researchers to leverage LLMs' advantages responsibly while addressing associated risks.  ( 3 min )
    Diffusion Reward: Learning Rewards via Conditional Video Diffusion
    arXiv:2312.14134v2 Announce Type: replace Abstract: Learning rewards from expert videos offers an affordable and effective solution to specify the intended behaviors for reinforcement learning tasks. In this work, we propose Diffusion Reward, a novel framework that learns rewards from expert videos via conditional video diffusion models for solving complex visual RL problems. Our key insight is that lower generative diversity is observed when conditioned on expert trajectories. Diffusion Reward is accordingly formalized by the negative of conditional entropy that encourages productive exploration of expert-like behaviors. We show the efficacy of our method over 10 robotic manipulation tasks from MetaWorld and Adroit with visual input and sparse reward. Moreover, Diffusion Reward could even solve unseen tasks successfully and effectively, largely surpassing baseline methods. Project page and code: https://diffusion-reward.github.io/.  ( 2 min )
    Multi-Objective Reinforcement Learning-based Approach for Pressurized Water Reactor Optimization
    arXiv:2312.10194v3 Announce Type: replace Abstract: A novel method, the Pareto Envelope Augmented with Reinforcement Learning (PEARL), has been developed to address the challenges posed by multi-objective problems, particularly in the field of engineering where the evaluation of candidate solutions can be time-consuming. PEARL distinguishes itself from traditional policy-based multi-objective Reinforcement Learning methods by learning a single policy, eliminating the need for multiple neural networks to independently solve simpler sub-problems. Several versions inspired from deep learning and evolutionary techniques have been crafted, catering to both unconstrained and constrained problem domains. Curriculum Learning is harnessed to effectively manage constraints in these versions. PEARL's performance is first evaluated on classical multi-objective benchmarks. Additionally, it is tested on two practical PWR core Loading Pattern optimization problems to showcase its real-world applicability. The first problem involves optimizing the Cycle length and the rod-integrated peaking factor as the primary objectives, while the second problem incorporates the mean average enrichment as an additional objective. Furthermore, PEARL addresses three types of constraints related to boron concentration, peak pin burnup, and peak pin power. The results are systematically compared against conventional approaches. Notably, PEARL, specifically the PEARL-NdS variant, efficiently uncovers a Pareto front without necessitating additional efforts from the algorithm designer, as opposed to a single optimization with scaled objectives. It also outperforms the classical approach across multiple performance metrics, including the Hyper-volume.  ( 3 min )
    A Comprehensive Evaluation of Augmentations for Robust OOD Self-Supervised Contrastive Phonocardiogram Representation Learning
    arXiv:2312.00502v2 Announce Type: replace Abstract: Despite the recent increase in research activity, deep-learning models have not yet been widely accepted in several real-world settings, such as medicine. The shortage of high-quality annotated data often hinders the development of robust and generalizable models, which do not suffer from degraded effectiveness when presented with newly-collected, out-of-distribution (OOD) datasets. Contrastive Self-Supervised Learning (SSL) offers a potential solution to labeled data scarcity, as it takes advantage of unlabeled data to increase model effectiveness and robustness. In this research, we propose applying contrastive SSL for detecting abnormalities in 1D phonocardiogram (PCG) samples by learning a generalized representation of the signal. Specifically, we perform an extensive comparative evaluation of a wide range of audio-based augmentations, evaluate trained classifiers on multiple datasets across different downstream tasks, and finally report on the impact of each augmentation in model training. We experimentally demonstrate that, depending on its training distribution, the effectiveness of a fully-supervised model can degrade up to 32% when evaluated on unseen data, while SSL models only lose up to 10% or even improve in some cases. We argue and experimentally demonstrate that, contrastive SSL pretraining can assist in providing robust classifiers which can generalize to unseen, OOD data, without relying on time- and labor-intensive annotation processes by medical experts. Furthermore, the proposed extensive evaluation protocol sheds light on the most promising and appropriate augmentations for robust PCG signal processing, by calculating their effect size on model training. Finally, we provide researchers and practitioners with a roadmap towards producing robust models for PCG classification, in addition to an open-source codebase for developing novel approaches.  ( 3 min )
    Global $\mathcal{L}^2$ minimization at uniform exponential rate via geometrically adapted gradient descent in Deep Learning
    arXiv:2311.15487v3 Announce Type: replace Abstract: We consider the gradient descent flow widely used for the minimization of the $\mathcal{L}^2$ cost function in Deep Learning networks, and introduce two modified versions; one adapted for the overparametrized setting, and the other for the underparametrized setting. Both have a clear and natural invariant geometric meaning, taking into account the pullback vector bundle structure in the overparametrized, and the pushforward vector bundle structure in the underparametrized setting. In the overparametrized case, we prove that, provided that a rank condition holds, all orbits of the modified gradient descent drive the $\mathcal{L}^2$ cost to its global minimum at a uniform exponential convergence rate; one thereby obtains an a priori stopping time for any prescribed proximity to the global minimum. We point out relations of the latter to sub-Riemannian geometry.  ( 2 min )
    Flooding Regularization for Stable Training of Generative Adversarial Networks
    arXiv:2311.00318v2 Announce Type: replace Abstract: Generative Adversarial Networks (GANs) have shown remarkable performance in image generation. However, GAN training suffers from the problem of instability. One of the main approaches to address this problem is to modify the loss function, often using regularization terms in addition to changing the type of adversarial losses. This paper focuses on directly regularizing the adversarial loss function. We propose a method that applies flooding, an overfitting suppression method in supervised learning, to GANs to directly prevent the discriminator's loss from becoming excessively low. Flooding requires tuning the flood level, but when applied to GANs, we propose that the appropriate range of flood level settings is determined by the adversarial loss function, supported by theoretical analysis of GANs using the binary cross entropy loss. We experimentally verify that flooding stabilizes GAN training and can be combined with other stabilization techniques. We also show that by restricting the discriminator's loss to be no less than the flood level, the training proceeds stably even when the flood level is somewhat high.  ( 2 min )
    PubDef: Defending Against Transfer Attacks From Public Models
    arXiv:2310.17645v2 Announce Type: replace Abstract: Adversarial attacks have been a looming and unaddressed threat in the industry. However, through a decade-long history of the robustness evaluation literature, we have learned that mounting a strong or optimal attack is challenging. It requires both machine learning and domain expertise. In other words, the white-box threat model, religiously assumed by a large majority of the past literature, is unrealistic. In this paper, we propose a new practical threat model where the adversary relies on transfer attacks through publicly available surrogate models. We argue that this setting will become the most prevalent for security-sensitive applications in the future. We evaluate the transfer attacks in this setting and propose a specialized defense method based on a game-theoretic perspective. The defenses are evaluated under 24 public models and 11 attack algorithms across three datasets (CIFAR-10, CIFAR-100, and ImageNet). Under this threat model, our defense, PubDef, outperforms the state-of-the-art white-box adversarial training by a large margin with almost no loss in the normal accuracy. For instance, on ImageNet, our defense achieves 62% accuracy under the strongest transfer attack vs only 36% of the best adversarially trained model. Its accuracy when not under attack is only 2% lower than that of an undefended model (78% vs 80%). We release our code at https://github.com/wagner-group/pubdef.  ( 3 min )
    Understanding when Dynamics-Invariant Data Augmentations Benefit Model-Free Reinforcement Learning Updates
    arXiv:2310.17786v2 Announce Type: replace Abstract: Recently, data augmentation (DA) has emerged as a method for leveraging domain knowledge to inexpensively generate additional data in reinforcement learning (RL) tasks, often yielding substantial improvements in data efficiency. While prior work has demonstrated the utility of incorporating augmented data directly into model-free RL updates, it is not well-understood when a particular DA strategy will improve data efficiency. In this paper, we seek to identify general aspects of DA responsible for observed learning improvements. Our study focuses on sparse-reward tasks with dynamics-invariant data augmentation functions, serving as an initial step towards a more general understanding of DA and its integration into RL training. Experimentally, we isolate three relevant aspects of DA: state-action coverage, reward density, and the number of augmented transitions generated per update (the augmented replay ratio). From our experiments, we draw two conclusions: (1) increasing state-action coverage often has a much greater impact on data efficiency than increasing reward density, and (2) decreasing the augmented replay ratio substantially improves data efficiency. In fact, certain tasks in our empirical study are solvable only when the replay ratio is sufficiently low.  ( 2 min )
    The Expressive Power of Low-Rank Adaptation
    arXiv:2310.17513v3 Announce Type: replace Abstract: Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method that leverages low-rank adaptation of weight matrices, has emerged as a prevalent technique for fine-tuning pre-trained models such as large language models and diffusion models. Despite its huge success in practice, the theoretical underpinnings of LoRA have largely remained unexplored. This paper takes the first step to bridge this gap by theoretically analyzing the expressive power of LoRA. We prove that, for fully connected neural networks, LoRA can adapt any model $f$ to accurately represent any smaller target model $\overline{f}$ if LoRA-rank $\geq(\text{width of }f) \times \frac{\text{depth of }\overline{f}}{\text{depth of }f}$. We also quantify the approximation error when LoRA-rank is lower than the threshold. For Transformer networks, we show any model can be adapted to a target model of the same size with rank-$(\frac{\text{embedding size}}{2})$ LoRA adapters.  ( 2 min )
    Tokenizer Choice For LLM Training: Negligible or Crucial?
    arXiv:2310.08754v4 Announce Type: replace Abstract: The recent success of Large Language Models (LLMs) has been predominantly driven by curating the training dataset composition, scaling of model architectures and dataset sizes and advancements in pretraining objectives, leaving tokenizer influence as a blind spot. Shedding light on this underexplored area, we conduct a comprehensive study on the influence of tokenizer choice on LLM downstream performance by training 24 mono- and multilingual LLMs at a 2.6B parameter scale, ablating different tokenizer algorithms and parameterizations. Our studies highlight that the tokenizer choice can significantly impact the model's downstream performance and training costs. In particular, we find that the common tokenizer evaluation metrics fertility and parity are not always predictive of model downstream performance, rendering these metrics a questionable proxy for the model's downstream performance. Furthermore, we show that multilingual tokenizers trained on the five most frequent European languages require vocabulary size increases of factor three in comparison to English. While English-centric tokenizers have been applied to the training of multi-lingual LLMs in the past, we find that this approach results in a severe downstream performance degradation and additional training costs of up to 68%, due to an inefficient tokenization vocabulary.  ( 3 min )
    Memory-Consistent Neural Networks for Imitation Learning
    arXiv:2310.06171v2 Announce Type: replace Abstract: Imitation learning considerably simplifies policy synthesis compared to alternative approaches by exploiting access to expert demonstrations. For such imitation policies, errors away from the training samples are particularly critical. Even rare slip-ups in the policy action outputs can compound quickly over time, since they lead to unfamiliar future states where the policy is still more likely to err, eventually causing task failures. We revisit simple supervised ``behavior cloning'' for conveniently training the policy from nothing more than pre-recorded demonstrations, but carefully design the model class to counter the compounding error phenomenon. Our ``memory-consistent neural network'' (MCNN) outputs are hard-constrained to stay within clearly specified permissible regions anchored to prototypical ``memory'' training samples. We provide a guaranteed upper bound for the sub-optimality gap induced by MCNN policies. Using MCNNs on 10 imitation learning tasks, with MLP, Transformer, and Diffusion backbones, spanning dexterous robotic manipulation and driving, proprioceptive inputs and visual inputs, and varying sizes and types of demonstration data, we find large and consistent gains in performance, validating that MCNNs are better-suited than vanilla deep neural networks for imitation learning applications. Website: https://sites.google.com/view/mcnn-imitation  ( 2 min )
    Robust-GBDT: GBDT with Nonconvex Loss for Tabular Classification in the Presence of Label Noise and Class Imbalance
    arXiv:2310.05067v2 Announce Type: replace Abstract: Dealing with label noise in tabular classification tasks poses a persistent challenge in machine learning. While robust boosting methods have shown promise in binary classification, their effectiveness in complex, multi-class scenarios is often limited. Additionally, issues like imbalanced datasets, missing values, and computational inefficiencies further complicate their practical utility. This study introduces Robust-GBDT, a groundbreaking approach that combines the power of Gradient Boosted Decision Trees (GBDT) with the resilience of nonconvex loss functions against label noise. By leveraging local convexity within specific regions, Robust-GBDT demonstrates unprecedented robustness, challenging conventional wisdom. Through seamless integration of advanced GBDT with a novel Robust Focal Loss tailored for class imbalance, Robust-GBDT significantly enhances generalization capabilities, particularly in noisy and imbalanced datasets. Notably, its user-friendly design facilitates integration with existing open-source code, enhancing computational efficiency and scalability. Extensive experiments validate Robust-GBDT's superiority over other noise-robust methods, establishing a new standard for accurate classification amidst label noise. This research heralds a paradigm shift in machine learning, paving the way for a new era of robust and precise classification across diverse real-world applications.  ( 2 min )
    PolySketchFormer: Fast Transformers via Sketching Polynomial Kernels
    arXiv:2310.01655v3 Announce Type: replace Abstract: The quadratic time and memory complexity inherent to self-attention mechanisms, with respect to sequence length, presents a critical computational bottleneck in the training and deployment of large-scale Transformer-based language models. Recent theoretical results indicate the intractability of sub-quadratic softmax attention approximation under reasonable complexity assumptions. This paper addresses this challenge by first demonstrating that polynomial attention with high degree can effectively replace softmax without sacrificing model quality. Next, we develop polynomial sketching techniques from numerical linear algebra to achieve linear-time polynomial attention with approximation guarantees. Crucially, our approach achieves this speedup without requiring the sparsification of attention matrices. We also present a block-based algorithm to apply causal masking efficiently. Combining these techniques, we provide \emph{PolySketchFormer}, a practical linear-time Transformer architecture for language modeling that offers provable guarantees. We validate PolySketchFormer empirically by training language models capable of handling long contexts. These experiments utilize both synthetic and real-world datasets (PG19, Wikipedia and C4) on Google Cloud TPUs. For context lengths of 32k and GPT-2 style models, our model achieves a 2.5-4x speedup in training compared to FlashAttention, with no observed degradation in quality across our experiments.  ( 2 min )
    Unconstrained Stochastic CCA: Unifying Multiview and Self-Supervised Learning
    arXiv:2310.01012v3 Announce Type: replace Abstract: The Canonical Correlation Analysis (CCA) family of methods is foundational in multiview learning. Regularised linear CCA methods can be seen to generalise Partial Least Squares (PLS) and be unified with a Generalized Eigenvalue Problem (GEP) framework. However, classical algorithms for these linear methods are computationally infeasible for large-scale data. Extensions to Deep CCA show great promise, but current training procedures are slow and complicated. First we propose a novel unconstrained objective that characterizes the top subspace of GEPs. Our core contribution is a family of fast algorithms for stochastic PLS, stochastic CCA, and Deep CCA, simply obtained by applying stochastic gradient descent (SGD) to the corresponding CCA objectives. Our algorithms show far faster convergence and recover higher correlations than the previous state-of-the-art on all standard CCA and Deep CCA benchmarks. These improvements allow us to perform a first-of-its-kind PLS analysis of an extremely large biomedical dataset from the UK Biobank, with over 33,000 individuals and 500,000 features. Finally, we apply our algorithms to match the performance of `CCA-family' Self-Supervised Learning (SSL) methods on CIFAR-10 and CIFAR-100 with minimal hyper-parameter tuning, and also present theory to clarify the links between these methods and classical CCA, laying the groundwork for future insights.  ( 3 min )
    The Lipschitz-Variance-Margin Tradeoff for Enhanced Randomized Smoothing
    arXiv:2309.16883v4 Announce Type: replace Abstract: Real-life applications of deep neural networks are hindered by their unsteady predictions when faced with noisy inputs and adversarial attacks. The certified radius in this context is a crucial indicator of the robustness of models. However how to design an efficient classifier with an associated certified radius? Randomized smoothing provides a promising framework by relying on noise injection into the inputs to obtain a smoothed and robust classifier. In this paper, we first show that the variance introduced by the Monte-Carlo sampling in the randomized smoothing procedure estimate closely interacts with two other important properties of the classifier, \textit{i.e.} its Lipschitz constant and margin. More precisely, our work emphasizes the dual impact of the Lipschitz constant of the base classifier, on both the smoothed classifier and the empirical variance. To increase the certified robust radius, we introduce a different way to convert logits to probability vectors for the base classifier to leverage the variance-margin trade-off. We leverage the use of Bernstein's concentration inequality along with enhanced Lipschitz bounds for randomized smoothing. Experimental results show a significant improvement in certified accuracy compared to current state-of-the-art methods. Our novel certification procedure allows us to use pre-trained models with randomized smoothing, effectively improving the current certification radius in a zero-shot manner.  ( 3 min )
    Analyzing and Mitigating Object Hallucination in Large Vision-Language Models
    arXiv:2310.00754v2 Announce Type: replace Abstract: Large vision-language models (LVLMs) have shown remarkable abilities in understanding visual information with human languages. However, LVLMs still suffer from object hallucination, which is the problem of generating descriptions that include objects that do not actually exist in the images. This can negatively impact many vision-language tasks, such as visual summarization and reasoning. To address this issue, we propose a simple yet powerful algorithm, LVLM Hallucination Revisor (LURE), to post-hoc rectify object hallucination in LVLMs by reconstructing less hallucinatory descriptions. LURE is grounded in a rigorous statistical analysis of the key factors underlying object hallucination, including co-occurrence (the frequent appearance of certain objects alongside others in images), uncertainty (objects with higher uncertainty during LVLM decoding), and object position (hallucination often appears in the later part of the generated text). LURE can also be seamlessly integrated with any LVLMs. We evaluate LURE on six open-source LVLMs, achieving a 23% improvement in general object hallucination evaluation metrics over the previous best approach. In both GPT and human evaluations, LURE consistently ranks at the top. Our data and code are available at https://github.com/YiyangZhou/LURE.  ( 2 min )
    From Complexity to Clarity: Analytical Expressions of Deep Neural Network Weights via Clifford's Geometric Algebra and Convexity
    arXiv:2309.16512v3 Announce Type: replace Abstract: In this paper, we introduce a novel analysis of neural networks based on geometric (Clifford) algebra and convex optimization. We show that optimal weights of deep ReLU neural networks are given by the wedge product of training samples when trained with standard regularized loss. Furthermore, the training problem reduces to convex optimization over wedge product features, which encode the geometric structure of the training dataset. This structure is given in terms of signed volumes of triangles and parallelotopes generated by data vectors. The convex problem finds a small subset of samples via $\ell_1$ regularization to discover only relevant wedge product features. Our analysis provides a novel perspective on the inner workings of deep neural networks and sheds light on the role of the hidden layers.  ( 2 min )
    Convergence and Recovery Guarantees of Unsupervised Neural Networks for Inverse Problems
    arXiv:2309.12128v3 Announce Type: replace Abstract: Neural networks have become a prominent approach to solve inverse problems in recent years. While a plethora of such methods was developed to solve inverse problems empirically, we are still lacking clear theoretical guarantees for these methods. On the other hand, many works proved convergence to optimal solutions of neural networks in a more general setting using overparametrization as a way to control the Neural Tangent Kernel. In this work we investigate how to bridge these two worlds and we provide deterministic convergence and recovery guarantees for the class of unsupervised feedforward multilayer neural networks trained to solve inverse problems. We also derive overparametrization bounds under which a two-layers Deep Inverse Prior network with smooth activation function will benefit from our guarantees.  ( 2 min )
    Geometric structure of shallow neural networks and constructive ${\mathcal L}^2$ cost minimization
    arXiv:2309.10370v2 Announce Type: replace Abstract: In this paper, we approach the problem of cost (loss) minimization in underparametrized shallow neural networks through the explicit construction of upper bounds, without any use of gradient descent. A key focus is on elucidating the geometric structure of approximate and precise minimizers. We consider shallow neural networks with one hidden layer, a ReLU activation function, an ${\mathcal L}^2$ Schatten class (or Hilbert-Schmidt) cost function, input space ${\mathbb R}^M$, output space ${\mathbb R}^Q$ with $Q\leq M$, and training input sample size $N>QM$ that can be arbitrarily large. We prove an upper bound on the minimum of the cost function of order $O(\delta_P)$ where $\delta_P$ measures the signal to noise ratio of training inputs. In the special case $M=Q$, we explicitly determine an exact degenerate local minimum of the cost function, and show that the sharp value differs from the upper bound obtained for $Q\leq M$ by a relative error $O(\delta_P^2)$. The proof of the upper bound yields a constructively trained network; we show that it metrizes a particular $Q$-dimensional subspace in the input space ${\mathbb R}^M$. We comment on the characterization of the global minimum of the cost function in the given context.  ( 3 min )
    Global Convergence of SGD For Logistic Loss on Two Layer Neural Nets
    arXiv:2309.09258v2 Announce Type: replace Abstract: In this note, we demonstrate a first-of-its-kind provable convergence of SGD to the global minima of appropriately regularized logistic empirical risk of depth $2$ nets -- for arbitrary data and with any number of gates with adequately smooth and bounded activations like sigmoid and tanh. We also prove an exponentially fast convergence rate for continuous time SGD that also applies to smooth unbounded activations like SoftPlus. Our key idea is to show the existence of Frobenius norm regularized logistic loss functions on constant-sized neural nets which are "Villani functions" and thus be able to build on recent progress with analyzing SGD on such objectives.  ( 2 min )
    Uncovering the Hidden Cost of Model Compression
    arXiv:2308.14969v3 Announce Type: replace Abstract: In an age dominated by resource-intensive foundation models, the ability to efficiently adapt to downstream tasks is crucial. Visual Prompting (VP), drawing inspiration from the prompting techniques employed in Large Language Models (LLMs), has emerged as a pivotal method for transfer learning in the realm of computer vision. As the importance of efficiency continues to rise, research into model compression has become indispensable in alleviating the computational burdens associated with training and deploying over-parameterized neural networks. A primary objective in model compression is to develop sparse and/or quantized models capable of matching or even surpassing the performance of their over-parameterized, full-precision counterparts. Although previous studies have explored the effects of model compression on transfer learning, its impact on visual prompting-based transfer remains unclear. This study aims to bridge this gap, shedding light on the fact that model compression detrimentally impacts the performance of visual prompting-based transfer, particularly evident in scenarios with low data volume. Furthermore, our findings underscore the adverse influence of sparsity on the calibration of downstream visual-prompted models. However, intriguingly, we also illustrate that such negative effects on calibration are not present when models are compressed via quantization. This empirical investigation underscores the need for a nuanced understanding beyond mere accuracy in sparse and quantized settings, thereby paving the way for further exploration in Visual Prompting techniques tailored for sparse and quantized models.  ( 3 min )
    OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models
    arXiv:2308.13137v3 Announce Type: replace Abstract: Large language models (LLMs) have revolutionized natural language processing tasks. However, their practical deployment is hindered by their immense memory and computation requirements. Although recent post-training quantization (PTQ) methods are effective in reducing memory footprint and improving the computational efficiency of LLM, they hand-craft quantization parameters, leading to low performance, especially in extremely low-bit quantization. To tackle this issue, we introduce an Omnidirectionally calibrated Quantization (\textbf{OmniQuant}) technique for LLMs, which achieves good performance in diverse quantization settings while maintaining the computational efficiency of PTQ by efficiently optimizing various quantization parameters. OmniQuant comprises two innovative components including Learnable Weight Clipping (LWC) and Learnable Equivalent Transformation (LET). LWC modulates the extreme values of weights by optimizing the clipping threshold. Meanwhile, LET tackles activation outliers by shifting the challenge of quantization from activations to weights. Operating within a differentiable framework using block-wise error minimization, OmniQuant can optimize the quantization process efficiently for both weight-only and weight-activation quantization. For instance, the LLaMA-2 model family size 7-70B can be processed with OmniQuant on a single A100-40G GPU within 1-16 hours using 128 samples. Extensive experiments validate OmniQuant's superior performance across diverse quantization configurations such as W4A4 (4-bit weight, 4-bit activation), W6A6, W4A16, W3A16, and W2A16. Additionally, OmniQuant demonstrates effectiveness in instruction-tuned models and delivers notable improvements in inference speed and memory reduction on real devices. Codes are available at \url{https://github.com/OpenGVLab/OmniQuant}.  ( 3 min )
    Large Language Models and Foundation Models in Smart Agriculture: Basics, Opportunities, and Challenges
    arXiv:2308.06668v4 Announce Type: replace Abstract: The past decade has witnessed the rapid development and adoption of ML & DL methodologies in agricultural systems, showcased by great successes in agricultural applications. However, these conventional ML/DL models have certain limitations: they heavily rely on large, costly-to-acquire labeled datasets for training, require specialized expertise for development and maintenance, and are mostly tailored for specific tasks, thus lacking generalizability. Recently, large pre-trained models, also known as FMs, have demonstrated remarkable successes in language, vision, and decision-making tasks across various domains. These models are trained on a large amount of data from multiple domains and modalities. Once trained, they can accomplish versatile tasks with just minor fine-tuning and minimal task-specific labeled data. Despite their proven effectiveness and huge potential, there has been little exploration of applying FMs to agriculture AI. Thus, this study aims to explore the potential of FMs in the field of smart agriculture. In particular, conceptual tools and technical background are presented to help the understanding of the problem space and uncover new research directions. To this end, recent FMs in the general CS domain are reviewed, and the models are categorized into four categories: language FMs, vision FMs, multimodal FMs, and reinforcement learning FMs. Then, the steps of developing agriculture FMs (AFMs) are outlined and potential applications in smart agriculture are discussed. Moreover, challenges and risks associated with developing AFMs are discussed, including model training, validation, and deployment. In summary, the advancement of AI in agriculture is explored by introducing AFMs as a promising paradigm that can significantly mitigate the reliance on extensive labeled datasets and enhance the efficiency, effectiveness, and generalization of agricultural AI systems.  ( 3 min )
    Convergence of SGD for Training Neural Networks with Sliced Wasserstein Losses
    arXiv:2307.11714v3 Announce Type: replace Abstract: Optimal Transport has sparked vivid interest in recent years, in particular thanks to the Wasserstein distance, which provides a geometrically sensible and intuitive way of comparing probability measures. For computational reasons, the Sliced Wasserstein (SW) distance was introduced as an alternative to the Wasserstein distance, and has seen uses for training generative Neural Networks (NNs). While convergence of Stochastic Gradient Descent (SGD) has been observed practically in such a setting, there is to our knowledge no theoretical guarantee for this observation. Leveraging recent works on convergence of SGD on non-smooth and non-convex functions by Bianchi et al. (2022), we aim to bridge that knowledge gap, and provide a realistic context under which fixed-step SGD trajectories for the SW loss on NN parameters converge. More precisely, we show that the trajectories approach the set of (sub)-gradient flow equations as the step decreases. Under stricter assumptions, we show a much stronger convergence result for noised and projected SGD schemes, namely that the long-run limits of the trajectories approach a set of generalised critical points of the loss function.  ( 2 min )
    Deep learning for dynamic graphs: models and benchmarks
    arXiv:2307.06104v3 Announce Type: replace Abstract: Recent progress in research on Deep Graph Networks (DGNs) has led to a maturation of the domain of learning on graphs. Despite the growth of this research field, there are still important challenges that are yet unsolved. Specifically, there is an urge of making DGNs suitable for predictive tasks on realworld systems of interconnected entities, which evolve over time. With the aim of fostering research in the domain of dynamic graphs, at first, we survey recent advantages in learning both temporal and spatial information, providing a comprehensive overview of the current state-of-the-art in the domain of representation learning for dynamic graphs. Secondly, we conduct a fair performance comparison among the most popular proposed approaches on node and edge-level tasks, leveraging rigorous model selection and assessment for all the methods, thus establishing a sound baseline for evaluating new architectures and approaches  ( 2 min )
    Proximity-Informed Calibration for Deep Neural Networks
    arXiv:2306.04590v2 Announce Type: replace Abstract: Confidence calibration is central to providing accurate and interpretable uncertainty estimates, especially under safety-critical scenarios. However, we find that existing calibration algorithms often overlook the issue of *proximity bias*, a phenomenon where models tend to be more overconfident in low proximity data (i.e., data lying in the sparse region of the data distribution) compared to high proximity samples, and thus suffer from inconsistent miscalibration across different proximity samples. We examine the problem over 504 pretrained ImageNet models and observe that: 1) Proximity bias exists across a wide variety of model architectures and sizes; 2) Transformer-based models are relatively more susceptible to proximity bias than CNN-based models; 3) Proximity bias persists even after performing popular calibration algorithms like temperature scaling; 4) Models tend to overfit more heavily on low proximity samples than on high proximity samples. Motivated by the empirical findings, we propose ProCal, a plug-and-play algorithm with a theoretical guarantee to adjust sample confidence based on proximity. To further quantify the effectiveness of calibration algorithms in mitigating proximity bias, we introduce proximity-informed expected calibration error (PIECE) with theoretical analysis. We show that ProCal is effective in addressing proximity bias and improving calibration on balanced, long-tail, and distribution-shift settings under four metrics over various model architectures. We believe our findings on proximity bias will guide the development of *fairer and better-calibrated* models, contributing to the broader pursuit of trustworthy AI. Our code is available at: https://github.com/MiaoXiong2320/ProximityBias-Calibration.  ( 3 min )
    Jumanji: a Diverse Suite of Scalable Reinforcement Learning Environments in JAX
    arXiv:2306.09884v2 Announce Type: replace Abstract: Open-source reinforcement learning (RL) environments have played a crucial role in driving progress in the development of AI algorithms. In modern RL research, there is a need for simulated environments that are performant, scalable, and modular to enable their utilization in a wider range of potential real-world applications. Therefore, we present Jumanji, a suite of diverse RL environments specifically designed to be fast, flexible, and scalable. Jumanji provides a suite of environments focusing on combinatorial problems frequently encountered in industry, as well as challenging general decision-making tasks. By leveraging the efficiency of JAX and hardware accelerators like GPUs and TPUs, Jumanji enables rapid iteration of research ideas and large-scale experimentation, ultimately empowering more capable agents. Unlike existing RL environment suites, Jumanji is highly customizable, allowing users to tailor the initial state distribution and problem complexity to their needs. Furthermore, we provide actor-critic baselines for each environment, accompanied by preliminary findings on scaling and generalization scenarios. Jumanji aims to set a new standard for speed, adaptability, and scalability of RL environments.  ( 3 min )
    Improving Offline RL by Blending Heuristics
    arXiv:2306.00321v2 Announce Type: replace Abstract: We propose Heuristic Blending (HUBL), a simple performance-improving technique for a broad class of offline RL algorithms based on value bootstrapping. HUBL modifies the Bellman operators used in these algorithms, partially replacing the bootstrapped values with heuristic ones that are estimated with Monte-Carlo returns. For trajectories with higher returns, HUBL relies more on the heuristic values and less on bootstrapping; otherwise, it leans more heavily on bootstrapping. HUBL is very easy to combine with many existing offline RL implementations by relabeling the offline datasets with adjusted rewards and discount factors. We derive a theory that explains HUBL's effect on offline RL as reducing offline RL's complexity and thus increasing its finite-sample performance. Furthermore, we empirically demonstrate that HUBL consistently improves the policy quality of four state-of-the-art bootstrapping-based offline RL algorithms (ATAC, CQL, TD3+BC, and IQL), by 9% on average over 27 datasets of the D4RL and Meta-World benchmarks.  ( 2 min )
    Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo
    arXiv:2305.18246v2 Announce Type: replace Abstract: We present a scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL). One of the key shortcomings of existing Thompson sampling algorithms is the need to perform a Gaussian approximation of the posterior distribution, which is not a good surrogate in most practical settings. We instead directly sample the Q function from its posterior distribution, by using Langevin Monte Carlo, an efficient type of Markov Chain Monte Carlo (MCMC) method. Our method only needs to perform noisy gradient descent updates to learn the exact posterior distribution of the Q function, which makes our approach easy to deploy in deep RL. We provide a rigorous theoretical analysis for the proposed method and demonstrate that, in the linear Markov decision process (linear MDP) setting, it has a regret bound of $\tilde{O}(d^{3/2}H^{3/2}\sqrt{T})$, where $d$ is the dimension of the feature mapping, $H$ is the planning horizon, and $T$ is the total number of steps. We apply this approach to deep RL, by using Adam optimizer to perform gradient updates. Our approach achieves better or similar results compared with state-of-the-art deep RL algorithms on several challenging exploration tasks from the Atari57 suite.  ( 3 min )
    Hang-Time HAR: A Benchmark Dataset for Basketball Activity Recognition using Wrist-Worn Inertial Sensors
    arXiv:2305.13124v2 Announce Type: replace Abstract: We present a benchmark dataset for evaluating physical human activity recognition methods from wrist-worn sensors, for the specific setting of basketball training, drills, and games. Basketball activities lend themselves well for measurement by wrist-worn inertial sensors, and systems that are able to detect such sport-relevant activities could be used in applications toward game analysis, guided training, and personal physical activity tracking. The dataset was recorded for two teams from separate countries (USA and Germany) with a total of 24 players who wore an inertial sensor on their wrist, during both repetitive basketball training sessions and full games. Particular features of this dataset include an inherent variance through cultural differences in game rules and styles as the data was recorded in two countries, as well as different sport skill levels, since the participants were heterogeneous in terms of prior basketball experience. We illustrate the dataset's features in several time-series analyses and report on a baseline classification performance study with two state-of-the-art deep learning architectures.  ( 3 min )
    Distributional Reinforcement Learning with Dual Expectile-Quantile Regression
    arXiv:2305.16877v2 Announce Type: replace Abstract: Distributional reinforcement learning (RL) has proven useful in multiple benchmarks as it enables approximating the full distribution of returns and makes a better use of environment samples. The commonly used quantile regression approach to distributional RL -- based on asymmetric $L_1$ losses -- provides a flexible and effective way of learning arbitrary return distributions. In practice, it is often improved by using a more efficient, hybrid asymmetric $L_1$-$L_2$ Huber loss for quantile regression. However, by doing so, distributional estimation guarantees vanish, and we empirically observe that the estimated distribution rapidly collapses to its mean. Indeed, asymmetric $L_2$ losses, corresponding to expectile regression, cannot be readily used for distributional temporal difference learning. Motivated by the efficiency of $L_2$-based learning, we propose to jointly learn expectiles and quantiles of the return distribution in a way that allows efficient learning while keeping an estimate of the full distribution of returns. We prove that our approach approximately learns the correct return distribution, and we benchmark a practical implementation on a toy example and at scale. On the Atari benchmark, our approach matches the performance of the Huber-based IQN-1 baseline after $200$M training frames but avoids distributional collapse and keeps estimates of the full distribution of returns.  ( 3 min )
    Energy-guided Entropic Neural Optimal Transport
    arXiv:2304.06094v4 Announce Type: replace Abstract: Energy-based models (EBMs) are known in the Machine Learning community for decades. Since the seminal works devoted to EBMs dating back to the noughties, there have been a lot of efficient methods which solve the generative modelling problem by means of energy potentials (unnormalized likelihood functions). In contrast, the realm of Optimal Transport (OT) and, in particular, neural OT solvers is much less explored and limited by few recent works (excluding WGAN-based approaches which utilize OT as a loss function and do not model OT maps themselves). In our work, we bridge the gap between EBMs and Entropy-regularized OT. We present a novel methodology which allows utilizing the recent developments and technical improvements of the former in order to enrich the latter. From the theoretical perspective, we prove generalization bounds for our technique. In practice, we validate its applicability in toy 2D and image domains. To showcase the scalability, we empower our method with a pre-trained StyleGAN and apply it to high-res AFHQ $512\times 512$ unpaired I2I translation. For simplicity, we choose simple short- and long-run EBMs as a backbone of our Energy-guided Entropic OT approach, leaving the application of more sophisticated EBMs for future research. Our code is available at: https://github.com/PetrMokrov/Energy-guided-Entropic-OT  ( 3 min )
    Interpretable statistical representations of neural population dynamics and geometry
    arXiv:2304.03376v3 Announce Type: replace Abstract: The dynamics of neuron populations during many behavioural tasks evolve on low-dimensional manifolds. However, it remains challenging to discover latent representations from neural recordings that are interpretable and consistently decodable across individuals and conditions without explicitly relying on behavioural information. Here, we introduce MARBLE, a fully unsupervised geometric deep learning framework for the data-driven representation of non-linear dynamics based on statistical distributions of local dynamical features. Using both in silico examples from non-linear dynamical systems and recurrent neural networks and in vivo recordings from primates and rodents, we demonstrate that MARBLE can infer latent representations that are highly interpretable in terms of global system variables such as decision-thresholds, kinematics or internal states. We also show that MARBLE representations are consistent across neural networks and animals so that they can be used to compare cognitive computations or train universal decoders. Through extensive benchmarking, we show that unsupervised MARBLE provides best-in-class within- and across-animal decoding accuracy, comparable to or significantly better than current supervised approaches, yet without the need for behavioural labels. Our results suggest that using the manifold structure in conjunction with the temporal information of neural dynamics provides a common framework to develop better decoding algorithms and assimilate data across experiments.  ( 3 min )
    EHRDiff: Exploring Realistic EHR Synthesis with Diffusion Models
    arXiv:2303.05656v2 Announce Type: replace Abstract: Electronic health records (EHR) contain a wealth of biomedical information, serving as valuable resources for the development of precision medicine systems. However, privacy concerns have resulted in limited access to high-quality and large-scale EHR data for researchers, impeding progress in methodological development. Recent research has delved into synthesizing realistic EHR data through generative modeling techniques, where a majority of proposed methods relied on generative adversarial networks (GAN) and their variants for EHR synthesis. Despite GAN-based methods attaining state-of-the-art performance in generating EHR data, these approaches are difficult to train and prone to mode collapse. Recently introduced in generative modeling, diffusion models have established cutting-edge performance in image generation, but their efficacy in EHR data synthesis remains largely unexplored. In this study, we investigate the potential of diffusion models for EHR data synthesis and introduce a novel method, EHRDiff. Through extensive experiments, EHRDiff establishes new state-of-the-art quality for synthetic EHR data, protecting private information in the meanwhile.  ( 2 min )
    Optimistic Online Mirror Descent for Bridging Stochastic and Adversarial Online Convex Optimization
    arXiv:2302.04552v3 Announce Type: replace Abstract: Stochastically Extended Adversarial (SEA) model is introduced by Sachs et al. [2022] as an interpolation between stochastic and adversarial online convex optimization. Under the smoothness condition, they demonstrate that the expected regret of optimistic follow-the-regularized-leader (FTRL) depends on the cumulative stochastic variance $\sigma_{1:T}^2$ and the cumulative adversarial variation $\Sigma_{1:T}^2$ for convex functions. They also provide a slightly weaker bound based on the maximal stochastic variance $\sigma_{\max}^2$ and the maximal adversarial variation $\Sigma_{\max}^2$ for strongly convex functions. Inspired by their work, we investigate the theoretical guarantees of optimistic online mirror descent (OMD) for the SEA model. For convex and smooth functions, we obtain the same $\mathcal{O}(\sqrt{\sigma_{1:T}^2}+\sqrt{\Sigma_{1:T}^2})$ regret bound, without the convexity requirement of individual functions. For strongly convex and smooth functions, we establish an $\mathcal{O}((\sigma_{\max}^2 + \Sigma_{\max}^2) \log (\sigma_{1:T}^2+\Sigma_{1:T}^2))$ bound, better than their $\mathcal{O}((\sigma_{\max}^2 + \Sigma_{\max}^2) \log T)$ result. For exp-concave and smooth functions, we achieve a new $\mathcal{O}(d\log(\sigma_{1:T}^2+\Sigma_{1:T}^2))$ bound. Owing to the OMD framework, we broaden our work to study dynamic regret minimization and scenarios where the online functions are non-smooth. We establish the first dynamic regret guarantee for the SEA model with convex and smooth functions, which is more favorable than static regret bounds in non-stationary scenarios. Furthermore, to deal with non-smooth and convex functions in the SEA model, we propose novel algorithms building on optimistic OMD with an implicit update, which provably attain static regret and dynamic regret guarantees without smoothness conditions.  ( 3 min )
    Koopman-based generalization bound: New aspect for full-rank weights
    arXiv:2302.05825v3 Announce Type: replace Abstract: We propose a new bound for generalization of neural networks using Koopman operators. Whereas most of existing works focus on low-rank weight matrices, we focus on full-rank weight matrices. Our bound is tighter than existing norm-based bounds when the condition numbers of weight matrices are small. Especially, it is completely independent of the width of the network if the weight matrices are orthogonal. Our bound does not contradict to the existing bounds but is a complement to the existing bounds. As supported by several existing empirical results, low-rankness is not the only reason for generalization. Furthermore, our bound can be combined with the existing bounds to obtain a tighter bound. Our result sheds new light on understanding generalization of neural networks with full-rank weight matrices, and it provides a connection between operator-theoretic analysis and generalization of neural networks.  ( 2 min )
    Improving Domain Generalization with Domain Relations
    arXiv:2302.02609v2 Announce Type: replace Abstract: Distribution shift presents a significant challenge in machine learning, where models often underperform during the test stage when faced with a different distribution than the one they were trained on. This paper focuses on domain shifts, which occur when the model is applied to new domains that are different from the ones it was trained on, and propose a new approach called D$^3$G. Unlike previous methods that aim to learn a single model that is domain invariant, D$^3$G leverages domain similarities based on domain metadata to learn domain-specific models. Concretely, D$^3$G learns a set of training-domain-specific functions during the training stage and reweights them based on domain relations during the test stage. These domain relations can be directly obtained and learned from domain metadata. Under mild assumptions, we theoretically prove that using domain relations to reweight training-domain-specific functions achieves stronger out-of-domain generalization compared to the conventional averaging approach. Empirically, we evaluate the effectiveness of D$^3$G using real-world datasets for tasks such as temperature regression, land use classification, and molecule-protein binding affinity prediction. Our results show that D$^3$G consistently outperforms state-of-the-art methods.  ( 2 min )
    Provably Bounding Neural Network Preimages
    arXiv:2302.01404v4 Announce Type: replace Abstract: Most work on the formal verification of neural networks has focused on bounding the set of outputs that correspond to a given set of inputs (for example, bounded perturbations of a nominal input). However, many use cases of neural network verification require solving the inverse problem, or over-approximating the set of inputs that lead to certain outputs. We present the INVPROP algorithm for verifying properties over the preimage of a linearly constrained output set, which can be combined with branch-and-bound to increase precision. Contrary to other approaches, our efficient algorithm is GPU-accelerated and does not require a linear programming solver. We demonstrate our algorithm for identifying safe control regions for a dynamical system via backward reachability analysis, verifying adversarial robustness, and detecting out-of-distribution inputs to a neural network. Our results show that in certain settings, we find over-approximations over 2500x tighter than prior work while being 2.5x faster. By strengthening robustness verification with output constraints, we consistently verify more properties than the previous state-of-the-art on multiple benchmarks, including a large model with 167k neurons in VNN-COMP 2023. Our algorithm has been incorporated into the $\alpha,\!\beta$-CROWN verifier, available at https://abcrown.org.  ( 3 min )
    Agnostic Visual Recommendation Systems: Open Challenges and Future Directions
    arXiv:2302.00569v2 Announce Type: replace Abstract: Visualization Recommendation Systems (VRSs) are a novel and challenging field of study aiming to help generate insightful visualizations from data and support non-expert users in information discovery. Among the many contributions proposed in this area, some systems embrace the ambitious objective of imitating human analysts to identify relevant relationships in data and make appropriate design choices to represent these relationships with insightful charts. We denote these systems as "agnostic" VRSs since they do not rely on human-provided constraints and rules but try to learn the task autonomously. Despite the high application potential of agnostic VRSs, their progress is hindered by several obstacles, including the absence of standardized datasets to train recommendation algorithms, the difficulty of learning design rules, and defining quantitative criteria for evaluating the perceptual effectiveness of generated plots. This paper summarizes the literature on agnostic VRSs and outlines promising future research directions.  ( 2 min )
    Generalized Munchausen Reinforcement Learning using Tsallis KL Divergence
    arXiv:2301.11476v4 Announce Type: replace Abstract: Many policy optimization approaches in reinforcement learning incorporate a Kullback-Leilbler (KL) divergence to the previous policy, to prevent the policy from changing too quickly. This idea was initially proposed in a seminal paper on Conservative Policy Iteration, with approximations given by algorithms like TRPO and Munchausen Value Iteration (MVI). We continue this line of work by investigating a generalized KL divergence -- called the Tsallis KL divergence -- which use the $q$-logarithm in the definition. The approach is a strict generalization, as $q = 1$ corresponds to the standard KL divergence; $q > 1$ provides a range of new options. We characterize the types of policies learned under the Tsallis KL, and motivate when $q >1$ could be beneficial. To obtain a practical algorithm that incorporates Tsallis KL regularization, we extend MVI, which is one of the simplest approaches to incorporate KL regularization. We show that this generalized MVI($q$) obtains significant improvements over the standard MVI($q = 1$) across 35 Atari games.  ( 2 min )
    MechProNet: Machine Learning Prediction of Mechanical Properties in Metal Additive Manufacturing
    arXiv:2209.12605v2 Announce Type: replace Abstract: Predicting mechanical properties in metal additive manufacturing (MAM) is essential for ensuring the performance and reliability of printed parts, as well as their suitability for specific applications. However, conducting experiments to estimate mechanical properties in MAM processes can be laborious and expensive, and they are often limited to specific materials and processes. Machine learning (ML) methods offer a more flexible and cost-effective approach to predicting mechanical properties based on processing parameters and material properties. In this study, we introduce a comprehensive framework for benchmarking ML models for predicting mechanical properties. We compiled an extensive experimental dataset from over 90 MAM articles and data sheets from a diverse range of sources, encompassing 140 different MAM data sheets. This dataset includes information on MAM processing conditions, machines, materials, and resulting mechanical properties such as yield strength, ultimate tensile strength, elastic modulus, elongation, hardness, and surface roughness. Our framework incorporates physics-aware featurization specific to MAM, adjustable ML models, and tailored evaluation metrics to construct a comprehensive learning framework for predicting mechanical properties. Additionally, we explore the Explainable AI method, specifically SHAP analysis, to elucidate and interpret the predicted values of ML models for mechanical properties. Furthermore, data-driven explicit models were developed to estimate mechanical properties based on processing parameters and material properties, offering enhanced interpretability compared to conventional ML models.  ( 3 min )
    Graph Neural Modeling of Network Flows
    arXiv:2209.05208v3 Announce Type: replace Abstract: Network flow problems, which involve distributing traffic such that the underlying infrastructure is used effectively, are ubiquitous in transportation and logistics. Among them, the general Multi-Commodity Network Flow (MCNF) problem concerns the distribution of multiple flows of different sizes between several sources and sinks, while achieving effective utilization of the links. Due to the appeal of data-driven optimization, these problems have increasingly been approached using graph learning methods. In this paper, we propose a novel graph learning architecture for network flow problems called Per-Edge Weights (PEW). This method builds on a Graph Attention Network and uses distinctly parametrized message functions along each link. We extensively evaluate the proposed solution through an Internet flow routing case study using $17$ Service Provider topologies and $2$ routing schemes. We show that PEW yields substantial gains over architectures whose global message function constrains the routing unnecessarily. We also find that an MLP is competitive with other standard architectures. Furthermore, we analyze the relationship between graph structure and predictive performance for data-driven routing of flows, an aspect that has not been considered by existing work in the area.  ( 2 min )
    Accelerating Asynchronous Federated Learning Convergence via Opportunistic Mobile Relaying
    arXiv:2206.04742v2 Announce Type: replace Abstract: This paper presents a study on asynchronous Federated Learning (FL) in a mobile network setting. The majority of FL algorithms assume that communication between clients and the server is always available, however, this is not the case in many real-world systems. To address this issue, the paper explores the impact of mobility on the convergence performance of asynchronous FL. By exploiting mobility, the study shows that clients can indirectly communicate with the server through another client serving as a relay, creating additional communication opportunities. This enables clients to upload local model updates sooner or receive fresher global models. We propose a new FL algorithm, called FedMobile, that incorporates opportunistic relaying and addresses key questions such as when and how to relay. We prove that FedMobile achieves a convergence rate $O(\frac{1}{\sqrt{NT}})$, where $N$ is the number of clients and $T$ is the number of communication slots, and show that the optimal design involves an interesting trade-off on the best timing of relaying. The paper also presents an extension that considers data manipulation before relaying to reduce the cost and enhance privacy. Experiment results on a synthetic dataset and two real-world datasets verify our theoretical findings.  ( 2 min )
    Shift Aggregate Extract Networks
    arXiv:1703.05537v2 Announce Type: replace Abstract: We introduce an architecture based on deep hierarchical decompositions to learn effective representations of large graphs. Our framework extends classic R-decompositions used in kernel methods, enabling nested part-of-part relations. Unlike recursive neural networks, which unroll a template on input graphs directly, we unroll a neural network template over the decomposition hierarchy, allowing us to deal with the high degree variability that typically characterize social network graphs. Deep hierarchical decompositions are also amenable to domain compression, a technique that reduces both space and time complexity by exploiting symmetries. We show empirically that our approach is able to outperform current state-of-the-art graph classification methods on large social network datasets, while at the same time being competitive on small chemobiological benchmark datasets.  ( 2 min )
    One-Step Image Translation with Text-to-Image Models
    arXiv:2403.12036v1 Announce Type: cross Abstract: In this work, we address two limitations of existing conditional diffusion models: their slow inference speed due to the iterative denoising process and their reliance on paired data for model fine-tuning. To tackle these issues, we introduce a general method for adapting a single-step diffusion model to new tasks and domains through adversarial learning objectives. Specifically, we consolidate various modules of the vanilla latent diffusion model into a single end-to-end generator network with small trainable weights, enhancing its ability to preserve the input image structure while reducing overfitting. We demonstrate that, for unpaired settings, our model CycleGAN-Turbo outperforms existing GAN-based and diffusion-based methods for various scene translation tasks, such as day-to-night conversion and adding/removing weather effects like fog, snow, and rain. We extend our method to paired settings, where our model pix2pix-Turbo is on par with recent works like Control-Net for Sketch2Photo and Edge2Image, but with a single-step inference. This work suggests that single-step diffusion models can serve as strong backbones for a range of GAN learning objectives. Our code and models are available at https://github.com/GaParmar/img2img-turbo.  ( 2 min )
    Expandable Subspace Ensemble for Pre-Trained Model-Based Class-Incremental Learning
    arXiv:2403.12030v1 Announce Type: cross Abstract: Class-Incremental Learning (CIL) requires a learning system to continually learn new classes without forgetting. Despite the strong performance of Pre-Trained Models (PTMs) in CIL, a critical issue persists: learning new classes often results in the overwriting of old ones. Excessive modification of the network causes forgetting, while minimal adjustments lead to an inadequate fit for new classes. As a result, it is desired to figure out a way of efficient model updating without harming former knowledge. In this paper, we propose ExpAndable Subspace Ensemble (EASE) for PTM-based CIL. To enable model updating without conflict, we train a distinct lightweight adapter module for each new task, aiming to create task-specific subspaces. These adapters span a high-dimensional feature space, enabling joint decision-making across multiple subspaces. As data evolves, the expanding subspaces render the old class classifiers incompatible with new-stage spaces. Correspondingly, we design a semantic-guided prototype complement strategy that synthesizes old classes' new features without using any old class instance. Extensive experiments on seven benchmark datasets verify EASE's state-of-the-art performance. Code is available at: https://github.com/sun-hailong/CVPR24-Ease  ( 2 min )
    Align and Distill: Unifying and Improving Domain Adaptive Object Detection
    arXiv:2403.12029v1 Announce Type: cross Abstract: Object detectors often perform poorly on data that differs from their training set. Domain adaptive object detection (DAOD) methods have recently demonstrated strong results on addressing this challenge. Unfortunately, we identify systemic benchmarking pitfalls that call past results into question and hamper further progress: (a) Overestimation of performance due to underpowered baselines, (b) Inconsistent implementation practices preventing transparent comparisons of methods, and (c) Lack of generality due to outdated backbones and lack of diversity in benchmarks. We address these problems by introducing: (1) A unified benchmarking and implementation framework, Align and Distill (ALDI), enabling comparison of DAOD methods and supporting future development, (2) A fair and modern training and evaluation protocol for DAOD that addresses benchmarking pitfalls, (3) A new DAOD benchmark dataset, CFC-DAOD, enabling evaluation on diverse real-world data, and (4) A new method, ALDI++, that achieves state-of-the-art results by a large margin. ALDI++ outperforms the previous state-of-the-art by +3.5 AP50 on Cityscapes to Foggy Cityscapes, +5.7 AP50 on Sim10k to Cityscapes (where ours is the only method to outperform a fair baseline), and +2.0 AP50 on CFC Kenai to Channel. Our framework, dataset, and state-of-the-art method offer a critical reset for DAOD and provide a strong foundation for future research. Code and data are available: https://github.com/justinkay/aldi and https://github.com/visipedia/caltech-fish-counting.  ( 3 min )
    FlexCap: Generating Rich, Localized, and Flexible Captions in Images
    arXiv:2403.12026v1 Announce Type: cross Abstract: We introduce a versatile $\textit{flexible-captioning}$ vision-language model (VLM) capable of generating region-specific descriptions of varying lengths. The model, FlexCap, is trained to produce length-conditioned captions for input bounding boxes, and this allows control over the information density of its output, with descriptions ranging from concise object labels to detailed captions. To achieve this we create large-scale training datasets of image region descriptions of varying length, starting from captioned images. This flexible-captioning capability has several valuable applications. First, FlexCap demonstrates superior performance in dense captioning tasks on the Visual Genome dataset. Second, a visual question answering (VQA) system can be built by employing FlexCap to generate localized descriptions as inputs to a large language model. The resulting system achieves state-of-the-art zero-shot performance on a number of VQA datasets. We also demonstrate a $\textit{localize-then-describe}$ approach with FlexCap can be better at open-ended object detection than a $\textit{describe-then-localize}$ approach with other VLMs. We highlight a novel characteristic of FlexCap, which is its ability to extract diverse visual information through prefix conditioning. Finally, we qualitatively demonstrate FlexCap's broad applicability in tasks such as image labeling, object attribute recognition, and visual dialog. Project webpage: https://flex-cap.github.io .  ( 2 min )
    EnvGen: Generating and Adapting Environments via LLMs for Training Embodied Agents
    arXiv:2403.12014v1 Announce Type: cross Abstract: Recent SOTA approaches for embodied learning via interaction directly employ large language models (LLMs) as agents to determine the next steps in an environment. Due to their world knowledge and reasoning capabilities, LLM agents achieve stronger performance than previous smaller agents based on reinforcement learning (RL); however, frequently calling LLMs is slow and expensive. Instead of directly employing LLMs as agents, can we use LLMs' reasoning capabilities to adaptively create training environments to help smaller embodied RL agents learn useful skills that they are weak at? We propose EnvGen, a novel framework to address this question. First, we prompt an LLM to generate training environments that allow agents to quickly learn different tasks in parallel. Concretely, the LLM is given the task description and simulator objectives that the agents should learn and is then asked to generate a set of environment configurations (e.g., different terrains, items given to agents, etc.). Next, we train a small RL agent in a mixture of the original and LLM-generated environments. Then, we enable the LLM to continuously adapt the generated environments to progressively improve the skills that the agent is weak at, by providing feedback to the LLM in the form of the agent's performance. We demonstrate the usefulness of EnvGen with comprehensive experiments in Crafter and Heist environments. We find that a small RL agent trained with EnvGen can outperform SOTA methods, including a GPT-4 agent, and learns long-horizon tasks significantly faster. We show qualitatively how the LLM adapts training environments to help improve RL agents' weaker skills over time. Additionally, EnvGen is substantially more efficient as it only uses a small number of LLM calls (e.g., 4 in total), whereas LLM agents require thousands of LLM calls. Lastly, we present detailed ablation studies for our design choices.  ( 3 min )
    A Toolbox for Surfacing Health Equity Harms and Biases in Large Language Models
    arXiv:2403.12025v1 Announce Type: cross Abstract: Large language models (LLMs) hold immense promise to serve complex health information needs but also have the potential to introduce harm and exacerbate health disparities. Reliably evaluating equity-related model failures is a critical step toward developing systems that promote health equity. In this work, we present resources and methodologies for surfacing biases with potential to precipitate equity-related harms in long-form, LLM-generated answers to medical questions and then conduct an empirical case study with Med-PaLM 2, resulting in the largest human evaluation study in this area to date. Our contributions include a multifactorial framework for human assessment of LLM-generated answers for biases, and EquityMedQA, a collection of seven newly-released datasets comprising both manually-curated and LLM-generated questions enriched for adversarial queries. Both our human assessment framework and dataset design process are grounded in an iterative participatory approach and review of possible biases in Med-PaLM 2 answers to adversarial queries. Through our empirical study, we find that the use of a collection of datasets curated through a variety of methodologies, coupled with a thorough evaluation protocol that leverages multiple assessment rubric designs and diverse rater groups, surfaces biases that may be missed via narrower evaluation approaches. Our experience underscores the importance of using diverse assessment methodologies and involving raters of varying backgrounds and expertise. We emphasize that while our framework can identify specific forms of bias, it is not sufficient to holistically assess whether the deployment of an AI system promotes equitable health outcomes. We hope the broader community leverages and builds on these tools and methods towards realizing a shared goal of LLMs that promote accessible and equitable healthcare for all.  ( 3 min )
    Convergence of Kinetic Langevin Monte Carlo on Lie groups
    arXiv:2403.12012v1 Announce Type: cross Abstract: Explicit, momentum-based dynamics for optimizing functions defined on Lie groups was recently constructed, based on techniques such as variational optimization and left trivialization. We appropriately add tractable noise to the optimization dynamics to turn it into a sampling dynamics, leveraging the advantageous feature that the momentum variable is Euclidean despite that the potential function lives on a manifold. We then propose a Lie-group MCMC sampler, by delicately discretizing the resulting kinetic-Langevin-type sampling dynamics. The Lie group structure is exactly preserved by this discretization. Exponential convergence with explicit convergence rate for both the continuous dynamics and the discrete sampler are then proved under W2 distance. Only compactness of the Lie group and geodesically L-smoothness of the potential function are needed. To the best of our knowledge, this is the first convergence result for kinetic Langevin on curved spaces, and also the first quantitative result that requires no convexity or, at least not explicitly, any common relaxation such as isoperimetry.  ( 2 min )
    Defining Effective Engagement For Enhancing Cancer Patients' Well-being with Mobile Digital Behavior Change Interventions
    arXiv:2403.12007v1 Announce Type: cross Abstract: Digital Behavior Change Interventions (DBCIs) are supporting development of new health behaviors. Evaluating their effectiveness is crucial for their improvement and understanding of success factors. However, comprehensive guidance for developers, particularly in small-scale studies with ethical constraints, is limited. Building on the CAPABLE project, this study aims to define effective engagement with DBCIs for supporting cancer patients in enhancing their quality of life. We identify metrics for measuring engagement, explore the interest of both patients and clinicians in DBCIs, and propose hypotheses for assessing the impact of DBCIs in such contexts. Our findings suggest that clinician prescriptions significantly increase sustained engagement with mobile DBCIs. In addition, while one weekly engagement with a DBCI is sufficient to maintain well-being, transitioning from extrinsic to intrinsic motivation may require a higher level of engagement.  ( 2 min )
    Diffusion Denoising as a Certified Defense against Clean-label Poisoning
    arXiv:2403.11981v1 Announce Type: cross Abstract: We present a certified defense to clean-label poisoning attacks. These attacks work by injecting a small number of poisoning samples (e.g., 1%) that contain $p$-norm bounded adversarial perturbations into the training data to induce a targeted misclassification of a test-time input. Inspired by the adversarial robustness achieved by $denoised$ $smoothing$, we show how an off-the-shelf diffusion model can sanitize the tampered training data. We extensively test our defense against seven clean-label poisoning attacks and reduce their attack success to 0-16% with only a negligible drop in the test time accuracy. We compare our defense with existing countermeasures against clean-label poisoning, showing that the defense reduces the attack success the most and offers the best model utility. Our results highlight the need for future work on developing stronger clean-label attacks and using our certified yet practical defense as a strong baseline to evaluate these attacks.  ( 2 min )
    Visualization for Trust in Machine Learning Revisited: The State of the Field in 2023
    arXiv:2403.12005v1 Announce Type: cross Abstract: Visualization for explainable and trustworthy machine learning remains one of the most important and heavily researched fields within information visualization and visual analytics with various application domains, such as medicine, finance, and bioinformatics. After our 2020 state-of-the-art report comprising 200 techniques, we have persistently collected peer-reviewed articles describing visualization techniques, categorized them based on the previously established categorization schema consisting of 119 categories, and provided the resulting collection of 542 techniques in an online survey browser. In this survey article, we present the updated findings of new analyses of this dataset as of fall 2023 and discuss trends, insights, and eight open challenges for using visualizations in machine learning. Our results corroborate the rapidly growing trend of visualization techniques for increasing trust in machine learning models in the past three years, with visualization found to help improve popular model explainability methods and check new deep learning architectures, for instance.  ( 2 min )
    Enhanced Event-Based Video Reconstruction with Motion Compensation
    arXiv:2403.11961v1 Announce Type: cross Abstract: Deep neural networks for event-based video reconstruction often suffer from a lack of interpretability and have high memory demands. A lightweight network called CISTA-LSTC has recently been introduced showing that high-quality reconstruction can be achieved through the systematic design of its architecture. However, its modelling assumption that input signals and output reconstructed frame share the same sparse representation neglects the displacement caused by motion. To address this, we propose warping the input intensity frames and sparse codes to enhance reconstruction quality. A CISTA-Flow network is constructed by integrating a flow network with CISTA-LSTC for motion compensation. The system relies solely on events, in which predicted flow aids in reconstruction and then reconstructed frames are used to facilitate flow estimation. We also introduce an iterative training framework for this combined system. Results demonstrate that our approach achieves state-of-the-art reconstruction accuracy and simultaneously provides reliable dense flow estimation. Furthermore, our model exhibits flexibility in that it can integrate different flow networks, suggesting its potential for further performance enhancement.  ( 2 min )
    Learning Dynamical Systems Encoding Non-Linearity within Space Curvature
    arXiv:2403.11948v1 Announce Type: cross Abstract: Dynamical Systems (DS) are an effective and powerful means of shaping high-level policies for robotics control. They provide robust and reactive control while ensuring the stability of the driving vector field. The increasing complexity of real-world scenarios necessitates DS with a higher degree of non-linearity, along with the ability to adapt to potential changes in environmental conditions, such as obstacles. Current learning strategies for DSs often involve a trade-off, sacrificing either stability guarantees or offline computational efficiency in order to enhance the capabilities of the learned DS. Online local adaptation to environmental changes is either not taken into consideration or treated as a separate problem. In this paper, our objective is to introduce a method that enhances the complexity of the learned DS without compromising efficiency during training or stability guarantees. Furthermore, we aim to provide a unified approach for seamlessly integrating the initially learned DS's non-linearity with any local non-linearities that may arise due to changes in the environment. We propose a geometrical approach to learn asymptotically stable non-linear DS for robotics control. Each DS is modeled as a harmonic damped oscillator on a latent manifold. By learning the manifold's Euclidean embedded representation, our approach encodes the non-linearity of the DS within the curvature of the space. Having an explicit embedded representation of the manifold allows us to showcase obstacle avoidance by directly inducing local deformations of the space. We demonstrate the effectiveness of our methodology through two scenarios: first, the 2D learning of synthetic vector fields, and second, the learning of 3D robotic end-effector motions in real-world settings.  ( 3 min )
    State space representations of the Roesser type for convolutional layers
    arXiv:2403.11938v1 Announce Type: cross Abstract: From the perspective of control theory, convolutional layers (of neural networks) are 2-D (or N-D) linear time-invariant dynamical systems. The usual representation of convolutional layers by the convolution kernel corresponds to the representation of a dynamical system by its impulse response. However, many analysis tools from control theory, e.g., involving linear matrix inequalities, require a state space representation. For this reason, we explicitly provide a state space representation of the Roesser type for 2-D convolutional layers with $c_\mathrm{in}r_1 + c_\mathrm{out}r_2$ states, where $c_\mathrm{in}$/$c_\mathrm{out}$ is the number of input/output channels of the layer and $r_1$/$r_2$ characterizes the width/length of the convolution kernel. This representation is shown to be minimal for $c_\mathrm{in} = c_\mathrm{out}$. We further construct state space representations for dilated, strided, and N-D convolutions.  ( 2 min )
    Explainable Reinforcement Learning-based Home Energy Management Systems using Differentiable Decision Trees
    arXiv:2403.11947v1 Announce Type: cross Abstract: With the ongoing energy transition, demand-side flexibility has become an important aspect of the modern power grid for providing grid support and allowing further integration of sustainable energy sources. Besides traditional sources, the residential sector is another major and largely untapped source of flexibility, driven by the increased adoption of solar PV, home batteries, and EVs. However, unlocking this residential flexibility is challenging as it requires a control framework that can effectively manage household energy consumption, and maintain user comfort while being readily scalable across different, diverse houses. We aim to address this challenging problem and introduce a reinforcement learning-based approach using differentiable decision trees. This approach integrates the scalability of data-driven reinforcement learning with the explainability of (differentiable) decision trees. This leads to a controller that can be easily adapted across different houses and provides a simple control policy that can be explained to end-users, further improving user acceptance. As a proof-of-concept, we analyze our method using a home energy management problem, comparing its performance with commercially available rule-based baseline and standard neural network-based RL controllers. Through this preliminary study, we show that the performance of our proposed method is comparable to standard RL-based controllers, outperforming baseline controllers by ~20% in terms of daily cost savings while being straightforward to explain.  ( 2 min )
    Distill2Explain: Differentiable decision trees for explainable reinforcement learning in energy application controllers
    arXiv:2403.11907v1 Announce Type: cross Abstract: Demand-side flexibility is gaining importance as a crucial element in the energy transition process. Accounting for about 25% of final energy consumption globally, the residential sector is an important (potential) source of energy flexibility. However, unlocking this flexibility requires developing a control framework that (1) easily scales across different houses, (2) is easy to maintain, and (3) is simple to understand for end-users. A potential control framework for such a task is data-driven control, specifically model-free reinforcement learning (RL). Such RL-based controllers learn a good control policy by interacting with their environment, learning purely based on data and with minimal human intervention. Yet, they lack explainability, which hampers user acceptance. Moreover, limited hardware capabilities of residential assets forms a hurdle (e.g., using deep neural networks). To overcome both those challenges, we propose a novel method to obtain explainable RL policies by using differentiable decision trees. Using a policy distillation approach, we train these differentiable decision trees to mimic standard RL-based controllers, leading to a decision tree-based control policy that is data-driven and easy to explain. As a proof-of-concept, we examine the performance and explainability of our proposed approach in a battery-based home energy management system to reduce energy costs. For this use case, we show that our proposed approach can outperform baseline rule-based policies by about 20-25%, while providing simple, explainable control policies. We further compare these explainable policies with standard RL policies and examine the performance trade-offs associated with this increased explainability.  ( 3 min )
    CICLe: Conformal In-Context Learning for Largescale Multi-Class Food Risk Classification
    arXiv:2403.11904v1 Announce Type: cross Abstract: Contaminated or adulterated food poses a substantial risk to human health. Given sets of labeled web texts for training, Machine Learning and Natural Language Processing can be applied to automatically detect such risks. We publish a dataset of 7,546 short texts describing public food recall announcements. Each text is manually labeled, on two granularity levels (coarse and fine), for food products and hazards that the recall corresponds to. We describe the dataset and benchmark naive, traditional, and Transformer models. Based on our analysis, Logistic Regression based on a tf-idf representation outperforms RoBERTa and XLM-R on classes with low support. Finally, we discuss different prompting strategies and present an LLM-in-the-loop framework, based on Conformal Prediction, which boosts the performance of the base classifier while reducing energy consumption compared to normal prompting.  ( 2 min )
    Deep Bayesian Future Fusion for Self-Supervised, High-Resolution, Off-Road Mapping
    arXiv:2403.11876v1 Announce Type: cross Abstract: The limited sensing resolution of resource-constrained off-road vehicles poses significant challenges towards reliable off-road autonomy. To overcome this limitation, we propose a general framework based on fusing the future information (i.e. future fusion) for self-supervision. Recent approaches exploit this future information alongside the hand-crafted heuristics to directly supervise the targeted downstream tasks (e.g. traversability estimation). However, in this paper, we opt for a more general line of development - time-efficient completion of the highest resolution (i.e. 2cm per pixel) BEV map in a self-supervised manner via future fusion, which can be used for any downstream tasks for better longer range prediction. To this end, first, we create a high-resolution future-fusion dataset containing pairs of (RGB / height) raw sparse and noisy inputs and map-based dense labels. Next, to accommodate the noise and sparsity of the sensory information, especially in the distal regions, we design an efficient realization of the Bayes filter onto the vanilla convolutional network via the recurrent mechanism. Equipped with the ideas from SOTA generative models, our Bayesian structure effectively predicts high-quality BEV maps in the distal regions. Extensive evaluation on both the quality of completion and downstream task on our future-fusion dataset demonstrates the potential of our approach.  ( 2 min )
    The Real Tropical Geometry of Neural Networks
    arXiv:2403.11871v1 Announce Type: cross Abstract: We consider a binary classifier defined as the sign of a tropical rational function, that is, as the difference of two convex piecewise linear functions. The parameter space of ReLU neural networks is contained as a semialgebraic set inside the parameter space of tropical rational functions. We initiate the study of two different subdivisions of this parameter space: a subdivision into semialgebraic sets, on which the combinatorial type of the decision boundary is fixed, and a subdivision into a polyhedral fan, capturing the combinatorics of the partitions of the dataset. The sublevel sets of the 0/1-loss function arise as subfans of this classification fan, and we show that the level-sets are not necessarily connected. We describe the classification fan i) geometrically, as normal fan of the activation polytope, and ii) combinatorially through a list of properties of associated bipartite graphs, in analogy to covector axioms of oriented matroids and tropical oriented matroids. Our findings extend and refine the connection between neural networks and tropical geometry by observing structures established in real tropical geometry, such as positive tropicalizations of hypersurfaces and tropical semialgebraic sets.  ( 2 min )
    NuGraph2: A Graph Neural Network for Neutrino Physics Event Reconstruction
    arXiv:2403.11872v1 Announce Type: cross Abstract: Liquid Argon Time Projection Chamber (LArTPC) detector technology offers a wealth of high-resolution information on particle interactions, and leveraging that information to its full potential requires sophisticated automated reconstruction techniques. This article describes NuGraph2, a Graph Neural Network (GNN) for low-level reconstruction of simulated neutrino interactions in a LArTPC detector. Simulated neutrino interactions in the MicroBooNE detector geometry are described as heterogeneous graphs, with energy depositions on each detector plane forming nodes on planar subgraphs. The network utilizes a multi-head attention message-passing mechanism to perform background filtering and semantic labelling on these graph nodes, identifying those associated with the primary physics interaction with 98.0\% efficiency and labelling them according to particle type with 94.9\% efficiency. The network operates directly on detector observables across multiple 2D representations, but utilizes a 3D-context-aware mechanism to encourage consistency between these representations. Model inference takes 0.12 s/event on a CPU, and 0.005 s/event batched on a GPU. This architecture is designed to be a general-purpose solution for particle reconstruction in neutrino physics, with the potential for deployment across a broad range of detector technologies, and offers a core convolution engine that can be leveraged for a variety of tasks beyond the two described in this article.  ( 3 min )
    Towards Understanding the Relationship between In-context Learning and Compositional Generalization
    arXiv:2403.11834v1 Announce Type: cross Abstract: According to the principle of compositional generalization, the meaning of a complex expression can be understood as a function of the meaning of its parts and of how they are combined. This principle is crucial for human language processing and also, arguably, for NLP models in the face of out-of-distribution data. However, many neural network models, including Transformers, have been shown to struggle with compositional generalization. In this paper, we hypothesize that forcing models to in-context learn can provide an inductive bias to promote compositional generalization. To test this hypothesis, we train a causal Transformer in a setting that renders ordinary learning very difficult: we present it with different orderings of the training instance and shuffle instance labels. This corresponds to training the model on all possible few-shot learning problems attainable from the dataset. The model can solve the task, however, by utilizing earlier examples to generalize to later ones (i.e. in-context learning). In evaluations on the datasets, SCAN, COGS, and GeoQuery, models trained in this manner indeed show improved compositional generalization. This indicates the usefulness of in-context learning problems as an inductive bias for generalization.  ( 2 min )
    Pessimistic Causal Reinforcement Learning with Mediators for Confounded Offline Data
    arXiv:2403.11841v1 Announce Type: cross Abstract: In real-world scenarios, datasets collected from randomized experiments are often constrained by size, due to limitations in time and budget. As a result, leveraging large observational datasets becomes a more attractive option for achieving high-quality policy learning. However, most existing offline reinforcement learning (RL) methods depend on two key assumptions--unconfoundedness and positivity--which frequently do not hold in observational data contexts. Recognizing these challenges, we propose a novel policy learning algorithm, PESsimistic CAusal Learning (PESCAL). We utilize the mediator variable based on front-door criterion to remove the confounding bias; additionally, we adopt the pessimistic principle to address the distributional shift between the action distributions induced by candidate policies, and the behavior policy that generates the observational data. Our key observation is that, by incorporating auxiliary variables that mediate the effect of actions on system dynamics, it is sufficient to learn a lower bound of the mediator distribution function, instead of the Q-function, to partially mitigate the issue of distributional shift. This insight significantly simplifies our algorithm, by circumventing the challenging task of sequential uncertainty quantification for the estimated Q-function. Moreover, we provide theoretical guarantees for the algorithms we propose, and demonstrate their efficacy through simulations, as well as real-world experiments utilizing offline datasets from a leading ride-hailing platform.  ( 2 min )
    Sound Event Detection and Localization with Distance Estimation
    arXiv:2403.11827v1 Announce Type: cross Abstract: Sound Event Detection and Localization (SELD) is a combined task of identifying sound events and their corresponding direction-of-arrival (DOA). While this task has numerous applications and has been extensively researched in recent years, it fails to provide full information about the sound source position. In this paper, we overcome this problem by extending the task to Sound Event Detection, Localization with Distance Estimation (3D SELD). We study two ways of integrating distance estimation within the SELD core - a multi-task approach, in which the problem is tackled by a separate model output, and a single-task approach obtained by extending the multi-ACCDOA method to include distance information. We investigate both methods for the Ambisonic and binaural versions of STARSS23: Sony-TAU Realistic Spatial Soundscapes 2023. Moreover, our study involves experiments on the loss function related to the distance estimation part. Our results show that it is possible to perform 3D SELD without any degradation of performance in sound event detection and DOA estimation.  ( 2 min )
    SSCAE -- Semantic, Syntactic, and Context-aware natural language Adversarial Examples generator
    arXiv:2403.11833v1 Announce Type: cross Abstract: Machine learning models are vulnerable to maliciously crafted Adversarial Examples (AEs). Training a machine learning model with AEs improves its robustness and stability against adversarial attacks. It is essential to develop models that produce high-quality AEs. Developing such models has been much slower in natural language processing (NLP) than in areas such as computer vision. This paper introduces a practical and efficient adversarial attack model called SSCAE for \textbf{S}emantic, \textbf{S}yntactic, and \textbf{C}ontext-aware natural language \textbf{AE}s generator. SSCAE identifies important words and uses a masked language model to generate an early set of substitutions. Next, two well-known language models are employed to evaluate the initial set in terms of semantic and syntactic characteristics. We introduce (1) a dynamic threshold to capture more efficient perturbations and (2) a local greedy search to generate high-quality AEs. As a black-box method, SSCAE generates humanly imperceptible and context-aware AEs that preserve semantic consistency and the source language's syntactical and grammatical requirements. The effectiveness and superiority of the proposed SSCAE model are illustrated with fifteen comparative experiments and extensive sensitivity analysis for parameter optimization. SSCAE outperforms the existing models in all experiments while maintaining a higher semantic consistency with a lower query number and a comparable perturbation rate.  ( 3 min )
    CapsLorentzNet: Integrating Physics Inspired Features with Graph Convolution
    arXiv:2403.11826v1 Announce Type: cross Abstract: With the advent of advanced machine learning techniques, boosted object tagging has witnessed significant progress. In this article, we take this field further by introducing novel architectural modifications compatible with a wide array of Graph Neural Network (GNN) architectures. Our approach advocates for integrating capsule layers, replacing the conventional decoding blocks in standard GNNs. These capsules are a group of neurons with vector activations. The orientation of these vectors represents important properties of the objects under study, with their magnitude characterizing whether the object under study belongs to the class represented by the capsule. Moreover, capsule networks incorporate a regularization by reconstruction mechanism, facilitating the seamless integration of expert-designed high-level features into the analysis. We have studied the usefulness of our architecture with the LorentzNet architecture for quark-gluon tagging. Here, we have replaced the decoding block of LorentzNet with a capsulated decoding block and have called the resulting architecture CapsLorentzNet. Our new architecture can enhance the performance of LorentzNet by 20 \% for the quark-gluon tagging task.  ( 2 min )
    Efficient Feature Extraction and Late Fusion Strategy for Audiovisual Emotional Mimicry Intensity Estimation
    arXiv:2403.11757v1 Announce Type: cross Abstract: In this paper, we present the solution to the Emotional Mimicry Intensity (EMI) Estimation challenge, which is part of 6th Affective Behavior Analysis in-the-wild (ABAW) Competition.The EMI Estimation challenge task aims to evaluate the emotional intensity of seed videos by assessing them from a set of predefined emotion categories (i.e., "Admiration," "Amusement," "Determination," "Empathic Pain," "Excitement," and "Joy").  ( 2 min )
    Towards the Development of a Real-Time Deepfake Audio Detection System in Communication Platforms
    arXiv:2403.11778v1 Announce Type: cross Abstract: Deepfake audio poses a rising threat in communication platforms, necessitating real-time detection for audio stream integrity. Unlike traditional non-real-time approaches, this study assesses the viability of employing static deepfake audio detection models in real-time communication platforms. An executable software is developed for cross-platform compatibility, enabling real-time execution. Two deepfake audio detection models based on Resnet and LCNN architectures are implemented using the ASVspoof 2019 dataset, achieving benchmark performances compared to ASVspoof 2019 challenge baselines. The study proposes strategies and frameworks for enhancing these models, paving the way for real-time deepfake audio detection in communication platforms. This work contributes to the advancement of audio stream security, ensuring robust detection capabilities in dynamic, real-time communication scenarios.  ( 2 min )
    Meta-Prompting for Automating Zero-shot Visual Recognition with LLMs
    arXiv:2403.11755v1 Announce Type: cross Abstract: Prompt ensembling of Large Language Model (LLM) generated category-specific prompts has emerged as an effective method to enhance zero-shot recognition ability of Vision-Language Models (VLMs). To obtain these category-specific prompts, the present methods rely on hand-crafting the prompts to the LLMs for generating VLM prompts for the downstream tasks. However, this requires manually composing these task-specific prompts and still, they might not cover the diverse set of visual concepts and task-specific styles associated with the categories of interest. To effectively take humans out of the loop and completely automate the prompt generation process for zero-shot recognition, we propose Meta-Prompting for Visual Recognition (MPVR). Taking as input only minimal information about the target task, in the form of its short natural language description, and a list of associated class labels, MPVR automatically produces a diverse set of category-specific prompts resulting in a strong zero-shot classifier. MPVR generalizes effectively across various popular zero-shot image recognition benchmarks belonging to widely different domains when tested with multiple LLMs and VLMs. For example, MPVR obtains a zero-shot recognition improvement over CLIP by up to 19.8% and 18.2% (5.0% and 4.5% on average over 20 datasets) leveraging GPT and Mixtral LLMs, respectively  ( 2 min )
    LSKNet: A Foundation Lightweight Backbone for Remote Sensing
    arXiv:2403.11735v1 Announce Type: cross Abstract: Remote sensing images pose distinct challenges for downstream tasks due to their inherent complexity. While a considerable amount of research has been dedicated to remote sensing classification, object detection and semantic segmentation, most of these studies have overlooked the valuable prior knowledge embedded within remote sensing scenarios. Such prior knowledge can be useful because remote sensing objects may be mistakenly recognized without referencing a sufficiently long-range context, which can vary for different objects. This paper considers these priors and proposes a lightweight Large Selective Kernel Network (LSKNet) backbone. LSKNet can dynamically adjust its large spatial receptive field to better model the ranging context of various objects in remote sensing scenarios. To our knowledge, large and selective kernel mechanisms have not been previously explored in remote sensing images. Without bells and whistles, our lightweight LSKNet sets new state-of-the-art scores on standard remote sensing classification, object detection and semantic segmentation benchmarks. Our comprehensive analysis further validated the significance of the identified priors and the effectiveness of LSKNet. The code is available at https://github.com/zcablii/LSKNet.  ( 2 min )
    Generalized Multi-Source Inference for Text Conditioned Music Diffusion Models
    arXiv:2403.11706v1 Announce Type: cross Abstract: Multi-Source Diffusion Models (MSDM) allow for compositional musical generation tasks: generating a set of coherent sources, creating accompaniments, and performing source separation. Despite their versatility, they require estimating the joint distribution over the sources, necessitating pre-separated musical data, which is rarely available, and fixing the number and type of sources at training time. This paper generalizes MSDM to arbitrary time-domain diffusion models conditioned on text embeddings. These models do not require separated data as they are trained on mixtures, can parameterize an arbitrary number of sources, and allow for rich semantic control. We propose an inference procedure enabling the coherent generation of sources and accompaniments. Additionally, we adapt the Dirac separator of MSDM to perform source separation. We experiment with diffusion models trained on Slakh2100 and MTG-Jamendo, showcasing competitive generation and separation results in a relaxed data setting.  ( 2 min )
    Learning General Policies for Classical Planning Domains: Getting Beyond C$_2$
    arXiv:2403.11734v1 Announce Type: cross Abstract: GNN-based approaches for learning general policies across planning domains are limited by the expressive power of $C_2$, namely; first-order logic with two variables and counting. This limitation can be overcomed by transitioning to $k$-GNNs, for $k=3$, wherein object embeddings are substituted with triplet embeddings. Yet, while $3$-GNNs have the expressive power of $C_3$, unlike $1$- and $2$-GNNs that are confined to $C_2$, they require quartic time for message exchange and cubic space for embeddings, rendering them impractical. In this work, we introduce a parameterized version of relational GNNs. When $t$ is infinity, R-GNN[$t$] approximates $3$-GNNs using only quadratic space for embeddings. For lower values of $t$, such as $t=1$ and $t=2$, R-GNN[$t$] achieves a weaker approximation by exchanging fewer messages, yet interestingly, often yield the $C_3$ features required in several planning domains. Furthermore, the new R-GNN[$t$] architecture is the original R-GNN architecture with a suitable transformation applied to the input states only. Experimental results illustrate the clear performance gains of R-GNN[$1$] and R-GNN[$2$] over plain R-GNNs, and also over edge transformers that also approximate $3$-GNNs.  ( 2 min )
    Coarsening of chiral domains in itinerant electron magnets: A machine learning force field approach
    arXiv:2403.11705v1 Announce Type: cross Abstract: Frustrated itinerant magnets often exhibit complex noncollinear or noncoplanar magnetic orders which support topological electronic structures. A canonical example is the anomalous quantum Hall state with a chiral spin order stabilized by electron-spin interactions on a triangular lattice. While a long-range magnetic order cannot survive thermal fluctuations in two dimensions, the chiral order which results from the breaking of a discrete Ising symmetry persists even at finite temperatures. We present a scalable machine learning (ML) framework to model the complex electron-mediated spin-spin interactions that stabilize the chiral magnetic domains in a triangular lattice. Large-scale dynamical simulations, enabled by the ML force-field models, are performed to investigate the coarsening of chiral domains after a thermal quench. While the chiral phase is described by a broken $Z_2$ Ising-type symmetry, we find that the characteristic size of chiral domains increases linearly with time, in stark contrast to the expected Allen-Cahn domain growth law for a non-conserved Ising order parameter field. The linear growth of the chiral domains is attributed to the orientational anisotropy of domain boundaries. Our work also demonstrates the promising potential of ML models for large-scale spin dynamics of itinerant magnets.  ( 2 min )
    Nonsmooth Implicit Differentiation: Deterministic and Stochastic Convergence Rates
    arXiv:2403.11687v1 Announce Type: cross Abstract: We study the problem of efficiently computing the derivative of the fixed-point of a parametric non-differentiable contraction map. This problem has wide applications in machine learning, including hyperparameter optimization, meta-learning and data poisoning attacks. We analyze two popular approaches: iterative differentiation (ITD) and approximate implicit differentiation (AID). A key challenge behind the nonsmooth setting is that the chain rule does not hold anymore. Building upon the recent work by Bolte et al. (2022), who proved the linear convergence of non-differentiable ITD, we provide refined linear convergence rates for both ITD and AID in the deterministic case. We further introduce NSID, a new method to compute the implicit derivative when the fixed point is defined as the composition of an outer map and an inner map which is accessible only through a stochastic unbiased estimator. We establish rates for the convergence of NSID to the true derivative, encompassing the best available rates in the smooth setting. We present illustrative experiments confirming our analysis.  ( 2 min )
    HDLdebugger: Streamlining HDL debugging with Large Language Models
    arXiv:2403.11671v1 Announce Type: cross Abstract: In the domain of chip design, Hardware Description Languages (HDLs) play a pivotal role. However, due to the complex syntax of HDLs and the limited availability of online resources, debugging HDL codes remains a difficult and time-intensive task, even for seasoned engineers. Consequently, there is a pressing need to develop automated HDL code debugging models, which can alleviate the burden on hardware engineers. Despite the strong capabilities of Large Language Models (LLMs) in generating, completing, and debugging software code, their utilization in the specialized field of HDL debugging has been limited and, to date, has not yielded satisfactory results. In this paper, we propose an LLM-assisted HDL debugging framework, namely HDLdebugger, which consists of HDL debugging data generation via a reverse engineering approach, a search engine for retrieval-augmented generation, and a retrieval-augmented LLM fine-tuning approach. Through the integration of these components, HDLdebugger can automate and streamline HDL debugging for chip design. Our comprehensive experiments, conducted on an HDL code dataset sourced from Huawei, reveal that HDLdebugger outperforms 13 cutting-edge LLM baselines, displaying exceptional effectiveness in HDL code debugging.  ( 2 min )
    Exploring 3D-aware Latent Spaces for Efficiently Learning Numerous Scenes
    arXiv:2403.11678v1 Announce Type: cross Abstract: We present a method enabling the scaling of NeRFs to learn a large number of semantically-similar scenes. We combine two techniques to improve the required training time and memory cost per scene. First, we learn a 3D-aware latent space in which we train Tri-Plane scene representations, hence reducing the resolution at which scenes are learned. Moreover, we present a way to share common information across scenes, hence allowing for a reduction of model complexity to learn a particular scene. Our method reduces effective per-scene memory costs by 44% and per-scene time costs by 86% when training 1000 scenes. Our project page can be found at https://3da-ae.github.io .  ( 2 min )
    Guiding the generation of counterfactual explanations through temporal background knowledge for Predictive Process Monitoring
    arXiv:2403.11642v1 Announce Type: cross Abstract: Counterfactual explanations suggest what should be different in the input instance to change the outcome of an AI system. When dealing with counterfactual explanations in the field of Predictive Process Monitoring, however, control flow relationships among events have to be carefully considered. A counterfactual, indeed, should not violate control flow relationships among activities (temporal background knowledege). Within the field of Explainability in Predictive Process Monitoring, there have been a series of works regarding counterfactual explanations for outcome-based predictions. However, none of them consider the inclusion of temporal background knowledge when generating these counterfactuals. In this work, we adapt state-of-the-art techniques for counterfactual generation in the domain of XAI that are based on genetic algorithms to consider a series of temporal constraints at runtime. We assume that this temporal background knowledge is given, and we adapt the fitness function, as well as the crossover and mutation operators, to maintain the satisfaction of the constraints. The proposed methods are evaluated with respect to state-of-the-art genetic algorithms for counterfactual generation and the results are presented. We showcase that the inclusion of temporal background knowledge allows the generation of counterfactuals more conformant to the temporal background knowledge, without however losing in terms of the counterfactual traditional quality metrics.  ( 2 min )
    Fair Distributed Cooperative Bandit Learning on Networks for Intelligent Internet of Things Systems (Technical Report)
    arXiv:2403.11603v1 Announce Type: cross Abstract: In intelligent Internet of Things (IoT) systems, edge servers within a network exchange information with their neighbors and collect data from sensors to complete delivered tasks. In this paper, we propose a multiplayer multi-armed bandit model for intelligent IoT systems to facilitate data collection and incorporate fairness considerations. In our model, we establish an effective communication protocol that helps servers cooperate with their neighbors. Then we design a distributed cooperative bandit algorithm, DC-ULCB, enabling servers to collaboratively select sensors to maximize data rates while maintaining fairness in their choices. We conduct an analysis of the reward regret and fairness regret of DC-ULCB, and prove that both regrets have logarithmic instance-dependent upper bounds. Additionally, through extensive simulations, we validate that DC-ULCB outperforms existing algorithms in maximizing reward and ensuring fairness.  ( 2 min )
    Dual-Channel Multiplex Graph Neural Networks for Recommendation
    arXiv:2403.11624v1 Announce Type: cross Abstract: Efficient recommender systems play a crucial role in accurately capturing user and item attributes that mirror individual preferences. Some existing recommendation techniques have started to shift their focus towards modeling various types of interaction relations between users and items in real-world recommendation scenarios, such as clicks, marking favorites, and purchases on online shopping platforms. Nevertheless, these approaches still grapple with two significant shortcomings: (1) Insufficient modeling and exploitation of the impact of various behavior patterns formed by multiplex relations between users and items on representation learning, and (2) ignoring the effect of different relations in the behavior patterns on the target relation in recommender system scenarios. In this study, we introduce a novel recommendation framework, Dual-Channel Multiplex Graph Neural Network (DCMGNN), which addresses the aforementioned challenges. It incorporates an explicit behavior pattern representation learner to capture the behavior patterns composed of multiplex user-item interaction relations, and includes a relation chain representation learning and a relation chain-aware encoder to discover the impact of various auxiliary relations on the target relation, the dependencies between different relations, and mine the appropriate order of relations in a behavior pattern. Extensive experiments on three real-world datasets demonstrate that our \model surpasses various state-of-the-art recommendation methods. It outperforms the best baselines by 10.06\% and 12.15\% on average across all datasets in terms of R@10 and N@10 respectively.  ( 2 min )
    A physics-informed neural network method for the approximation of slow invariant manifolds for the general class of stiff systems of ODEs
    arXiv:2403.11591v1 Announce Type: cross Abstract: We present a physics-informed neural network (PINN) approach for the discovery of slow invariant manifolds (SIMs), for the most general class of fast/slow dynamical systems of ODEs. In contrast to other machine learning (ML) approaches that construct reduced order black box surrogate models using simple regression, and/or require a priori knowledge of the fast and slow variables, our approach, simultaneously decomposes the vector field into fast and slow components and provides a functional of the underlying SIM in a closed form. The decomposition is achieved by finding a transformation of the state variables to the fast and slow ones, which enables the derivation of an explicit, in terms of fast variables, SIM functional. The latter is obtained by solving a PDE corresponding to the invariance equation within the Geometric Singular Perturbation Theory (GSPT) using a single-layer feedforward neural network with symbolic differentiation. The performance of the proposed physics-informed ML framework is assessed via three benchmark problems: the Michaelis-Menten, the target mediated drug disposition (TMDD) reaction model and a fully competitive substrate-inhibitor(fCSI) mechanism. We also provide a comparison with other GPST methods, namely the quasi steady state approximation (QSSA), the partial equilibrium approximation (PEA) and CSP with one and two iterations. We show that the proposed PINN scheme provides SIM approximations, of equivalent or even higher accuracy, than those provided by QSSA, PEA and CSP, especially close to the boundaries of the underlying SIMs.  ( 3 min )
    Semantic Prompting with Image-Token for Continual Learning
    arXiv:2403.11537v1 Announce Type: cross Abstract: Continual learning aims to refine model parameters for new tasks while retaining knowledge from previous tasks. Recently, prompt-based learning has emerged to leverage pre-trained models to be prompted to learn subsequent tasks without the reliance on the rehearsal buffer. Although this approach has demonstrated outstanding results, existing methods depend on preceding task-selection process to choose appropriate prompts. However, imperfectness in task-selection may lead to negative impacts on the performance particularly in the scenarios where the number of tasks is large or task distributions are imbalanced. To address this issue, we introduce I-Prompt, a task-agnostic approach focuses on the visual semantic information of image tokens to eliminate task prediction. Our method consists of semantic prompt matching, which determines prompts based on similarities between tokens, and image token-level prompting, which applies prompts directly to image tokens in the intermediate layers. Consequently, our method achieves competitive performance on four benchmarks while significantly reducing training time compared to state-of-the-art methods. Moreover, we demonstrate the superiority of our method across various scenarios through extensive experiments.  ( 2 min )
    Advancing Neuromorphic Computing: Mixed-Signal Design Techniques Leveraging Brain Code Units and Fundamental Code Units
    arXiv:2403.11563v1 Announce Type: cross Abstract: This paper introduces a groundbreaking digital neuromorphic architecture that innovatively integrates Brain Code Unit (BCU) and Fundamental Code Unit (FCU) using mixedsignal design methodologies. Leveraging open-source datasets and the latest advances in materials science, our research focuses on enhancing the computational efficiency, accuracy, and adaptability of neuromorphic systems. The core of our approach lies in harmonizing the precision and scalability of digital systems with the robustness and energy efficiency of analog processing. Through experimentation, we demonstrate the effectiveness of our system across various metrics. The BCU achieved an accuracy of 88.0% and a power efficiency of 20.0 GOP/s/W, while the FCU recorded an accuracy of 86.5% and a power efficiency of 18.5 GOP/s/W. Our mixed-signal design approach significantly improved latency and throughput, achieving a latency as low as 0.75 ms and throughput up to 213 TOP/s. These results firmly establish the potential of our architecture in neuromorphic computing, providing a solid foundation for future developments in this domain. Our study underscores the feasibility of mixedsignal neuromorphic systems and their promise in advancing the field, particularly in applications requiring high efficiency and adaptability  ( 2 min )
    OCR is All you need: Importing Multi-Modality into Image-based Defect Detection System
    arXiv:2403.11536v1 Announce Type: cross Abstract: Automatic optical inspection (AOI) plays a pivotal role in the manufacturing process, predominantly leveraging high-resolution imaging instruments for scanning purposes. It detects anomalies by analyzing image textures or patterns, making it an essential tool in industrial manufacturing and quality control. Despite its importance, the deployment of models for AOI often faces challenges. These include limited sample sizes, which hinder effective feature learning, variations among source domains, and sensitivities to changes in lighting and camera positions during imaging. These factors collectively compromise the accuracy of model predictions. Traditional AOI often fails to capitalize on the rich mechanism-parameter information from machines or inside images, including statistical parameters, which typically benefit AOI classification. To address this, we introduce an external modality-guided data mining framework, primarily rooted in optical character recognition (OCR), to extract statistical features from images as a second modality to enhance performance, termed OANet (Ocr-Aoi-Net). A key aspect of our approach is the alignment of external modality features, extracted using a single modality-aware model, with image features encoded by a convolutional neural network. This synergy enables a more refined fusion of semantic representations from different modalities. We further introduce feature refinement and a gating function in our OANet to optimize the combination of these features, enhancing inference and decision-making capabilities. Experimental outcomes show that our methodology considerably boosts the recall rate of the defect detection model and maintains high robustness even in challenging scenarios.  ( 3 min )
    Out-of-Distribution Detection Should Use Conformal Prediction (and Vice-versa?)
    arXiv:2403.11532v1 Announce Type: cross Abstract: Research on Out-Of-Distribution (OOD) detection focuses mainly on building scores that efficiently distinguish OOD data from In Distribution (ID) data. On the other hand, Conformal Prediction (CP) uses non-conformity scores to construct prediction sets with probabilistic coverage guarantees. In this work, we propose to use CP to better assess the efficiency of OOD scores. Specifically, we emphasize that in standard OOD benchmark settings, evaluation metrics can be overly optimistic due to the finite sample size of the test dataset. Based on the work of (Bates et al., 2022), we define new conformal AUROC and conformal FRP@TPR95 metrics, which are corrections that provide probabilistic conservativeness guarantees on the variability of these metrics. We show the effect of these corrections on two reference OOD and anomaly detection benchmarks, OpenOOD (Yang et al., 2022) and ADBench (Han et al., 2022). We also show that the benefits of using OOD together with CP apply the other way around by using OOD scores as non-conformity scores, which results in improving upon current CP methods. One of the key messages of these contributions is that since OOD is concerned with designing scores and CP with interpreting these scores, the two fields may be inherently intertwined.  ( 2 min )
    A Data-driven Approach for Rapid Detection of Aeroelastic Modes from Flutter Flight Test Based on Limited Sensor Measurements
    arXiv:2403.11521v1 Announce Type: cross Abstract: Flutter flight test involves the evaluation of the airframes aeroelastic stability by applying artificial excitation on the aircraft lifting surfaces. The subsequent responses are captured and analyzed to extract the frequencies and damping characteristics of the system. However, noise contamination, turbulence, non-optimal excitation of modes, and sensor malfunction in one or more sensors make it time-consuming and corrupt the extraction process. In order to expedite the process of identifying and analyzing aeroelastic modes, this study implements a time-delay embedded Dynamic Mode Decomposition technique. This approach is complemented by Robust Principal Component Analysis methodology, and a sparsity promoting criterion which enables the automatic and optimal selection of sparse modes. The anonymized flutter flight test data, provided by the fifth author of this research paper, is utilized in this implementation. The methodology assumes no knowledge of the input excitation, only deals with the responses captured by accelerometer channels, and rapidly identifies the aeroelastic modes. By incorporating a compressed sensing algorithm, the methodology gains the ability to identify aeroelastic modes, even when the number of available sensors is limited. This augmentation greatly enhances the methodology's robustness and effectiveness, making it an excellent choice for real-time implementation during flutter test campaigns.  ( 3 min )
    LOOPer: A Learned Automatic Code Optimizer For Polyhedral Compilers
    arXiv:2403.11522v1 Announce Type: cross Abstract: While polyhedral compilers have shown success in implementing advanced code transformations, they still have challenges in selecting the most profitable transformations that lead to the best speedups. This has motivated the use of machine learning to build cost models to guide the search for polyhedral optimizations. State-of-the-art polyhedral compilers have demonstrated a viable proof-of-concept of this approach. While such a proof-of-concept has shown promise, it still has significant limitations. State-of-the-art polyhedral compilers that use a deep-learning cost model only support a small subset of affine transformations, limiting their ability to apply complex code transformations. They also only support simple programs that have a single loop nest and a rectangular iteration domain, limiting their applicability to many programs. These limitations significantly impact the generality of such compilers and autoschedulers and put into question the whole approach. In this paper, we introduce LOOPer, the first polyhedral autoscheduler that uses a deep-learning based cost model and covers a large set of affine transformations and programs. It supports the exploration of a large set of affine transformations, allowing the application of complex sequences of polyhedral transformations. It also supports the optimization of programs with multiple loop nests and with rectangular and non-rectangular iteration domains, allowing the optimization of an extensive set of programs. We implement and evaluate LOOPer and show that it achieves speedups over the state-of-the-art. On the Polybench benchmark, LOOPer achieves a geometric mean speedup of 1.59x over Tiramisu. LOOPer also achieves competitive speedups with a geometric mean speedup of 1.34x over Pluto, a state-of-the-art polyhedral compiler that does not use a machine-learning based cost model.  ( 3 min )
    MLVICX: Multi-Level Variance-Covariance Exploration for Chest X-ray Self-Supervised Representation Learning
    arXiv:2403.11504v1 Announce Type: cross Abstract: Self-supervised learning (SSL) is potentially useful in reducing the need for manual annotation and making deep learning models accessible for medical image analysis tasks. By leveraging the representations learned from unlabeled data, self-supervised models perform well on tasks that require little to no fine-tuning. However, for medical images, like chest X-rays, which are characterized by complex anatomical structures and diverse clinical conditions, there arises a need for representation learning techniques that can encode fine-grained details while preserving the broader contextual information. In this context, we introduce MLVICX (Multi-Level Variance-Covariance Exploration for Chest X-ray Self-Supervised Representation Learning), an approach to capture rich representations in the form of embeddings from chest X-ray images. Central to our approach is a novel multi-level variance and covariance exploration strategy that empowers the model to detect diagnostically meaningful patterns while reducing redundancy effectively. By enhancing the variance and covariance of the learned embeddings, MLVICX promotes the retention of critical medical insights by adapting both global and local contextual details. We demonstrate the performance of MLVICX in advancing self-supervised chest X-ray representation learning through comprehensive experiments. The performance enhancements we observe across various downstream tasks highlight the significance of the proposed approach in enhancing the utility of chest X-ray embeddings for precision medical diagnosis and comprehensive image analysis. For pertaining, we used the NIH-Chest X-ray dataset, while for downstream tasks, we utilized NIH-Chest X-ray, Vinbig-CXR, RSNA pneumonia, and SIIM-ACR Pneumothorax datasets. Overall, we observe more than 3% performance gains over SOTA SSL approaches in various downstream tasks.  ( 3 min )
    LLM Guided Evolution - The Automation of Models Advancing Models
    arXiv:2403.11446v1 Announce Type: cross Abstract: In the realm of machine learning, traditional model development and automated approaches like AutoML typically rely on layers of abstraction, such as tree-based or Cartesian genetic programming. Our study introduces "Guided Evolution" (GE), a novel framework that diverges from these methods by utilizing Large Language Models (LLMs) to directly modify code. GE leverages LLMs for a more intelligent, supervised evolutionary process, guiding mutations and crossovers. Our unique "Evolution of Thought" (EoT) technique further enhances GE by enabling LLMs to reflect on and learn from the outcomes of previous mutations. This results in a self-sustaining feedback loop that augments decision-making in model evolution. GE maintains genetic diversity, crucial for evolutionary algorithms, by leveraging LLMs' capability to generate diverse responses from expertly crafted prompts and modulate model temperature. This not only accelerates the evolution process but also injects expert like creativity and insight into the process. Our application of GE in evolving the ExquisiteNetV2 model demonstrates its efficacy: the LLM-driven GE autonomously produced variants with improved accuracy, increasing from 92.52% to 93.34%, without compromising model compactness. This underscores the potential of LLMs to accelerate the traditional model design pipeline, enabling models to autonomously evolve and enhance their own designs.  ( 2 min )
    Do CLIPs Always Generalize Better than ImageNet Models?
    arXiv:2403.11497v1 Announce Type: cross Abstract: Large vision language models, such as CLIPs, have revolutionized modern machine learning. CLIPs have demonstrated great generalizability under distribution shifts, supported by an increasing body of literature. However, the evaluation datasets for CLIPs are variations primarily designed for ImageNet benchmarks, which may not fully reflect the extent to which CLIPs, e.g., pre-trained on LAION, robust to spurious correlations. To bridge the gap, we collect a real-world dataset called CounterAnimal that contains realistic spurious features found in animal photos. CounterAnimal consists of a) the common group: comprising animals on common backgrounds, and b) the counter group: including animals on unusual backgrounds. The performance drops from the common to counter groups quantify the reliance of models on spurious features (i.e., backgrounds) to predict the animals. We find that CLIPs trained on either LAION or the OpenAI data exhibit notable performance drops on the counter group. Surprisingly, we observe that single-modal models trained on ImageNet are more robust than CLIPs. We provide both theoretical and empirical explanations for why CLIPs still learn spurious features. Our findings suggest that distribution shifts remain an open problem for CLIPs, and one needs to be cautious about test setups when evaluating foundation models pre-trained on a significantly different scale and distribution.  ( 2 min )
    Demystifying Deep Reinforcement Learning-Based Autonomous Vehicle Decision-Making
    arXiv:2403.11432v1 Announce Type: cross Abstract: With the advent of universal function approximators in the domain of reinforcement learning, the number of practical applications leveraging deep reinforcement learning (DRL) has exploded. Decision-making in automated driving tasks has emerged as a chief application among them, taking the sensor data or the higher-order kinematic variables as the input and providing a discrete choice or continuous control output. However, the black-box nature of the models presents an overwhelming limitation that restricts the real-world deployment of DRL in autonomous vehicles (AVs). Therefore, in this research work, we focus on the interpretability of an attention-based DRL framework. We use a continuous proximal policy optimization-based DRL algorithm as the baseline model and add a multi-head attention framework in an open-source AV simulation environment. We provide some analytical techniques for discussing the interpretability of the trained models in terms of explainability and causality for spatial and temporal correlations. We show that the weights in the first head encode the positions of the neighboring vehicles while the second head focuses on the leader vehicle exclusively. Also, the ego vehicle's action is causally dependent on the vehicles in the target lane spatially and temporally. Through these findings, we reliably show that these techniques can help practitioners decipher the results of the DRL algorithms.  ( 2 min )
    Neural network representation of quantum systems
    arXiv:2403.11420v1 Announce Type: cross Abstract: It has been proposed that random wide neural networks near Gaussian process are quantum field theories around Gaussian fixed points. In this paper, we provide a novel map with which a wide class of quantum mechanical systems can be cast into the form of a neural network with a statistical summation over network parameters. Our simple idea is to use the universal approximation theorem of neural networks to generate arbitrary paths in the Feynman's path integral. The map can be applied to interacting quantum systems / field theories, even away from the Gaussian limit. Our findings bring machine learning closer to the quantum world.  ( 2 min )
    DreamSampler: Unifying Diffusion Sampling and Score Distillation for Image Manipulation
    arXiv:2403.11415v1 Announce Type: cross Abstract: Reverse sampling and score-distillation have emerged as main workhorses in recent years for image manipulation using latent diffusion models (LDMs). While reverse diffusion sampling often requires adjustments of LDM architecture or feature engineering, score distillation offers a simple yet powerful model-agnostic approach, but it is often prone to mode-collapsing. To address these limitations and leverage the strengths of both approaches, here we introduce a novel framework called {\em DreamSampler}, which seamlessly integrates these two distinct approaches through the lens of regularized latent optimization. Similar to score-distillation, DreamSampler is a model-agnostic approach applicable to any LDM architecture, but it allows both distillation and reverse sampling with additional guidance for image editing and reconstruction. Through experiments involving image editing, SVG reconstruction and etc, we demonstrate the competitive performance of DreamSampler compared to existing approaches, while providing new applications.  ( 2 min )
    Stochastic approach for elliptic problems in perforated domains
    arXiv:2403.11385v1 Announce Type: cross Abstract: A wide range of applications in science and engineering involve a PDE model in a domain with perforations, such as perforated metals or air filters. Solving such perforated domain problems suffers from computational challenges related to resolving the scale imposed by the geometries of perforations. We propose a neural network-based mesh-free approach for perforated domain problems. The method is robust and efficient in capturing various configuration scales, including the averaged macroscopic behavior of the solution that involves a multiscale nature induced by small perforations. The new approach incorporates the derivative-free loss method that uses a stochastic representation or the Feynman-Kac formulation. In particular, we implement the Neumann boundary condition for the derivative-free loss method to handle the interface between the domain and perforations. A suite of stringent numerical tests is provided to support the proposed method's efficacy in handling various perforation scales.  ( 2 min )
    Path-GPTOmic: A Balanced Multi-modal Learning Framework for Survival Outcome Prediction
    arXiv:2403.11375v1 Announce Type: cross Abstract: For predicting cancer survival outcomes, standard approaches in clinical research are often based on two main modalities: pathology images for observing cell morphology features, and genomic (e.g., bulk RNA-seq) for quantifying gene expressions. However, existing pathology-genomic multi-modal algorithms face significant challenges: (1) Valuable biological insights regarding genes and gene-gene interactions are frequently overlooked; (2) one modality often dominates the optimization process, causing inadequate training for the other modality. In this paper, we introduce a new multi-modal ``Path-GPTOmic" framework for cancer survival outcome prediction. First, to extract valuable biological insights, we regulate the embedding space of a foundation model, scGPT, initially trained on single-cell RNA-seq data, making it adaptable for bulk RNA-seq data. Second, to address the imbalance-between-modalities problem, we propose a gradient modulation mechanism tailored to the Cox partial likelihood loss for survival prediction. The contributions of the modalities are dynamically monitored and adjusted during the training process, encouraging that both modalities are sufficiently trained. Evaluated on two TCGA(The Cancer Genome Atlas) datasets, our model achieves substantially improved survival prediction accuracy.  ( 2 min )
    Robustness of the data-driven approach in limited angle tomography
    arXiv:2403.11350v1 Announce Type: cross Abstract: The limited angle Radon transform is notoriously difficult to invert due to the ill-posedness. In this work, we give a mathematical explanation that the data-driven approach based on deep neural networks can reconstruct more information in a stable way compared to traditional methods.  ( 2 min )
    An SDP-based Branch-and-Cut Algorithm for Biclustering
    arXiv:2403.11351v1 Announce Type: cross Abstract: Biclustering, also called co-clustering, block clustering, or two-way clustering, involves the simultaneous clustering of both the rows and columns of a data matrix into distinct groups, such that the rows and columns within a group display similar patterns. As a model problem for biclustering, we consider the $k$-densest-disjoint biclique problem, whose goal is to identify $k$ disjoint complete bipartite subgraphs (called bicliques) of a given weighted complete bipartite graph such that the sum of their densities is maximized. To address this problem, we present a tailored branch-and-cut algorithm. For the upper bound routine, we consider a semidefinite programming relaxation and propose valid inequalities to strengthen the bound. We solve this relaxation in a cutting-plane fashion using a first-order method. For the lower bound, we design a maximum weight matching rounding procedure that exploits the solution of the relaxation solved at each node. Computational results on both synthetic and real-world instances show that the proposed algorithm can solve instances approximately 20 times larger than those handled by general-purpose solvers.  ( 2 min )
    Ensembling and Test Augmentation for Covid-19 Detection and Covid-19 Domain Adaptation from 3D CT-Scans
    arXiv:2403.11338v1 Announce Type: cross Abstract: Since the emergence of Covid-19 in late 2019, medical image analysis using artificial intelligence (AI) has emerged as a crucial research area, particularly with the utility of CT-scan imaging for disease diagnosis. This paper contributes to the 4th COV19D competition, focusing on Covid-19 Detection and Covid-19 Domain Adaptation Challenges. Our approach centers on lung segmentation and Covid-19 infection segmentation employing the recent CNN-based segmentation architecture PDAtt-Unet, which simultaneously segments lung regions and infections. Departing from traditional methods, we concatenate the input slice (grayscale) with segmented lung and infection, generating three input channels akin to color channels. Additionally, we employ three 3D CNN backbones Customized Hybrid-DeCoVNet, along with pretrained 3D-Resnet-18 and 3D-Resnet-50 models to train Covid-19 recognition for both challenges. Furthermore, we explore ensemble approaches and testing augmentation to enhance performance. Comparison with baseline results underscores the substantial efficiency of our approach, with a significant margin in terms of F1-score (14 %). This study advances the field by presenting a comprehensive methodology for accurate Covid-19 detection and adaptation, leveraging cutting-edge AI techniques in medical image analysis.  ( 3 min )
    Potential of Domain Adaptation in Machine Learning in Ecology and Hydrology to Improve Model Extrapolability
    arXiv:2403.11331v1 Announce Type: cross Abstract: Due to the heterogeneity of the global distribution of ecological and hydrological ground-truth observations, machine learning models can have limited adaptability when applied to unknown locations, which is referred to as weak extrapolability. Domain adaptation techniques have been widely used in machine learning domains such as image classification, which can improve the model generalization ability by adjusting the difference or inconsistency of the domain distribution between the training and test sets. However, this approach has rarely been used explicitly in machine learning models in ecology and hydrology at the global scale, although these models have often been questioned due to geographic extrapolability issues. This paper briefly describes the shortcomings of current machine learning models of ecology and hydrology in terms of the global representativeness of the distribution of observations and the resulting limitations of the lack of extrapolability and suggests that future related modelling efforts should consider the use of domain adaptation techniques to improve extrapolability.  ( 2 min )
    SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant
    arXiv:2403.11299v1 Announce Type: cross Abstract: Recent advancements in the vision-language model have shown notable generalization in vision-language tasks after visual instruction tuning. However, bridging the gap between the pre-trained vision encoder and the large language models becomes the whole network's bottleneck. To improve cross-modality alignment, existing works usually consider more visual instruction data covering a broader range of vision tasks to fine-tune the model for question-answering, which are costly to obtain. However, the image contains rich contextual information that has been largely under-explored. This paper first attempts to harness this overlooked context within visual instruction data, training the model to self-supervised `learning' how to ask high-quality questions. In this way, we introduce a novel framework named SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant. SQ-LLaVA exhibits proficiency in generating flexible and meaningful image-related questions while analyzing the visual clue and prior language knowledge, signifying an advanced level of generalized visual understanding. Moreover, fine-tuning SQ-LLaVA on higher-quality instruction data shows a consistent performance improvement compared with traditional visual-instruction tuning methods. This improvement highlights the efficacy of self-questioning techniques in achieving a deeper and more nuanced comprehension of visual content across various contexts.  ( 2 min )
    Improving Dialogue Agents by Decomposing One Global Explicit Annotation with Local Implicit Multimodal Feedback
    arXiv:2403.11330v1 Announce Type: cross Abstract: We describe an approach for aligning an LLM-based dialogue agent based on global (i.e., dialogue-level) rewards, while also taking into account naturally-occurring multimodal signals. At a high level, our approach (dubbed GELI) learns a local, turn-level reward model by decomposing the human-provided Global Explicit (GE) session-level reward, using Local Implicit (LI} multimodal reward signals to crossmodally shape the reward decomposition step. This decomposed reward model is then used as part of the standard RHLF pipeline improve an LLM-based dialog agent. We run quantitative and qualitative human studies to evaluate the performance of our GELI approach, and find that it shows consistent improvements across various conversational metrics compared to baseline methods.  ( 2 min )
    A Modified Word Saliency-Based Adversarial Attack on Text Classification Models
    arXiv:2403.11297v1 Announce Type: cross Abstract: This paper introduces a novel adversarial attack method targeting text classification models, termed the Modified Word Saliency-based Adversarial At-tack (MWSAA). The technique builds upon the concept of word saliency to strategically perturb input texts, aiming to mislead classification models while preserving semantic coherence. By refining the traditional adversarial attack approach, MWSAA significantly enhances its efficacy in evading detection by classification systems. The methodology involves first identifying salient words in the input text through a saliency estimation process, which prioritizes words most influential to the model's decision-making process. Subsequently, these salient words are subjected to carefully crafted modifications, guided by semantic similarity metrics to ensure that the altered text remains coherent and retains its original meaning. Empirical evaluations conducted on diverse text classification datasets demonstrate the effectiveness of the proposed method in generating adversarial examples capable of successfully deceiving state-of-the-art classification models. Comparative analyses with existing adversarial attack techniques further indicate the superiority of the proposed approach in terms of both attack success rate and preservation of text coherence.  ( 2 min )
    Advanced Knowledge Extraction of Physical Design Drawings, Translation and conversion to CAD formats using Deep Learning
    arXiv:2403.11291v1 Announce Type: cross Abstract: The maintenance, archiving and usage of the design drawings is cumbersome in physical form in different industries for longer period. It is hard to extract information by simple scanning of drawing sheets. Converting them to their digital formats such as Computer-Aided Design (CAD), with needed knowledge extraction can solve this problem. The conversion of these machine drawings to its digital form is a crucial challenge which requires advanced techniques. This research proposes an innovative methodology utilizing Deep Learning methods. The approach employs object detection model, such as Yolov7, Faster R-CNN, to detect physical drawing objects present in the images followed by, edge detection algorithms such as canny filter to extract and refine the identified lines from the drawing region and curve detection techniques to detect circle. Also ornaments (complex shapes) within the drawings are extracted. To ensure comprehensive conversion, an Optical Character Recognition (OCR) tool is integrated to identify and extract the text elements from the drawings. The extracted data which includes the lines, shapes and text is consolidated and stored in a structured comma separated values(.csv) file format. The accuracy and the efficiency of conversion is evaluated. Through this, conversion can be automated to help organizations enhance their productivity, facilitate seamless collaborations and preserve valuable design information in a digital format easily accessible. Overall, this study contributes to the advancement of CAD conversions, providing accurate results from the translating process. Future research can focus on handling diverse drawing types, enhanced accuracy in shape and line detection and extraction.  ( 3 min )
    Cheap Ways of Extracting Clinical Markers from Texts
    arXiv:2403.11227v1 Announce Type: cross Abstract: This paper describes the work of the UniBuc Archaeology team for CLPsych's 2024 Shared Task, which involved finding evidence within the text supporting the assigned suicide risk level. Two types of evidence were required: highlights (extracting relevant spans within the text) and summaries (aggregating evidence into a synthesis). Our work focuses on evaluating Large Language Models (LLM) as opposed to an alternative method that is much more memory and resource efficient. The first approach employs a good old-fashioned machine learning (GOML) pipeline consisting of a tf-idf vectorizer with a logistic regression classifier, whose representative features are used to extract relevant highlights. The second, more resource intensive, uses an LLM for generating the summaries and is guided by chain-of-thought to provide sequences of text indicating clinical markers.  ( 2 min )
    Simple 2D Convolutional Neural Network-based Approach for COVID-19 Detection
    arXiv:2403.11230v1 Announce Type: cross Abstract: This study explores the use of deep learning techniques for analyzing lung Computed Tomography (CT) images. Classic deep learning approaches face challenges with varying slice counts and resolutions in CT images, a diversity arising from the utilization of assorted scanning equipment. Typically, predictions are made on single slices which are then combined for a comprehensive outcome. Yet, this method does not incorporate learning features specific to each slice, leading to a compromise in effectiveness. To address these challenges, we propose an advanced Spatial-Slice Feature Learning (SSFL++) framework specifically tailored for CT scans. It aims to filter out out-of-distribution (OOD) data within the entire CT scan, allowing us to select essential spatial-slice features for analysis by reducing data redundancy by 70\%. Additionally, we introduce a Kernel-Density-based slice Sampling (KDS) method to enhance stability during training and inference phases, thereby accelerating convergence and enhancing overall performance. Remarkably, our experiments reveal that our model achieves promising results with a simple EfficientNet-2D (E2D) model. The effectiveness of our approach is confirmed on the COVID-19-CT-DB datasets provided by the DEF-AI-MIA workshop.  ( 3 min )
    CPA-Enhancer: Chain-of-Thought Prompted Adaptive Enhancer for Object Detection under Unknown Degradations
    arXiv:2403.11220v1 Announce Type: cross Abstract: Object detection methods under known single degradations have been extensively investigated. However, existing approaches require prior knowledge of the degradation type and train a separate model for each, limiting their practical applications in unpredictable environments. To address this challenge, we propose a chain-of-thought (CoT) prompted adaptive enhancer, CPA-Enhancer, for object detection under unknown degradations. Specifically, CPA-Enhancer progressively adapts its enhancement strategy under the step-by-step guidance of CoT prompts, that encode degradation-related information. To the best of our knowledge, it's the first work that exploits CoT prompting for object detection tasks. Overall, CPA-Enhancer is a plug-and-play enhancement model that can be integrated into any generic detectors to achieve substantial gains on degraded images, without knowing the degradation type priorly. Experimental results demonstrate that CPA-Enhancer not only sets the new state of the art for object detection but also boosts the performance of other downstream vision tasks under unknown degradations.  ( 2 min )
    usfAD Based Effective Unknown Attack Detection Focused IDS Framework
    arXiv:2403.11180v1 Announce Type: cross Abstract: The rapid expansion of varied network systems, including the Internet of Things (IoT) and Industrial Internet of Things (IIoT), has led to an increasing range of cyber threats. Ensuring robust protection against these threats necessitates the implementation of an effective Intrusion Detection System (IDS). For more than a decade, researchers have delved into supervised machine learning techniques to develop IDS to classify normal and attack traffic. However, building effective IDS models using supervised learning requires a substantial number of benign and attack samples. To collect a sufficient number of attack samples from real-life scenarios is not possible since cyber attacks occur occasionally. Further, IDS trained and tested on known datasets fails in detecting zero-day or unknown attacks due to the swift evolution of attack patterns. To address this challenge, we put forth two strategies for semi-supervised learning based IDS where training samples of attacks are not required: 1) training a supervised machine learning model using randomly and uniformly dispersed synthetic attack samples; 2) building a One Class Classification (OCC) model that is trained exclusively on benign network traffic. We have implemented both approaches and compared their performances using 10 recent benchmark IDS datasets. Our findings demonstrate that the OCC model based on the state-of-art anomaly detection technique called usfAD significantly outperforms conventional supervised classification and other OCC based techniques when trained and tested considering real-life scenarios, particularly to detect previously unseen attacks.  ( 3 min )
    Prior-dependent analysis of posterior sampling reinforcement learning with function approximation
    arXiv:2403.11175v1 Announce Type: cross Abstract: This work advances randomized exploration in reinforcement learning (RL) with function approximation modeled by linear mixture MDPs. We establish the first prior-dependent Bayesian regret bound for RL with function approximation; and refine the Bayesian regret analysis for posterior sampling reinforcement learning (PSRL), presenting an upper bound of ${\mathcal{O}}(d\sqrt{H^3 T \log T})$, where $d$ represents the dimensionality of the transition kernel, $H$ the planning horizon, and $T$ the total number of interactions. This signifies a methodological enhancement by optimizing the $\mathcal{O}(\sqrt{\log T})$ factor over the previous benchmark (Osband and Van Roy, 2014) specified to linear mixture MDPs. Our approach, leveraging a value-targeted model learning perspective, introduces a decoupling argument and a variance reduction technique, moving beyond traditional analyses reliant on confidence sets and concentration inequalities to formalize Bayesian regret bounds more effectively.  ( 2 min )
    Pencil: Private and Extensible Collaborative Learning without the Non-Colluding Assumption
    arXiv:2403.11166v1 Announce Type: cross Abstract: The escalating focus on data privacy poses significant challenges for collaborative neural network training, where data ownership and model training/deployment responsibilities reside with distinct entities. Our community has made substantial contributions to addressing this challenge, proposing various approaches such as federated learning (FL) and privacy-preserving machine learning based on cryptographic constructs like homomorphic encryption (HE) and secure multiparty computation (MPC). However, FL completely overlooks model privacy, and HE has limited extensibility (confined to only one data provider). While the state-of-the-art MPC frameworks provide reasonable throughput and simultaneously ensure model/data privacy, they rely on a critical non-colluding assumption on the computing servers, and relaxing this assumption is still an open problem. In this paper, we present Pencil, the first private training framework for collaborative learning that simultaneously offers data privacy, model privacy, and extensibility to multiple data providers, without relying on the non-colluding assumption. Our fundamental design principle is to construct the n-party collaborative training protocol based on an efficient two-party protocol, and meanwhile ensuring that switching to different data providers during model training introduces no extra cost. We introduce several novel cryptographic protocols to realize this design principle and conduct a rigorous security and privacy analysis. Our comprehensive evaluations of Pencil demonstrate that (i) models trained in plaintext and models trained privately using Pencil exhibit nearly identical test accuracies; (ii) The training overhead of Pencil is greatly reduced: Pencil achieves 10 ~ 260x higher throughput and 2 orders of magnitude less communication than prior art; (iii) Pencil is resilient against both existing and adaptive (white-box) attacks.  ( 3 min )
    A Selective Review on Statistical Methods for Massive Data Computation: Distributed Computing, Subsampling, and Minibatch Techniques
    arXiv:2403.11163v1 Announce Type: cross Abstract: This paper presents a selective review of statistical computation methods for massive data analysis. A huge amount of statistical methods for massive data computation have been rapidly developed in the past decades. In this work, we focus on three categories of statistical computation methods: (1) distributed computing, (2) subsampling methods, and (3) minibatch gradient techniques. The first class of literature is about distributed computing and focuses on the situation, where the dataset size is too huge to be comfortably handled by one single computer. In this case, a distributed computation system with multiple computers has to be utilized. The second class of literature is about subsampling methods and concerns about the situation, where the sample size of dataset is small enough to be placed on one single computer but too large to be easily processed by its memory as a whole. The last class of literature studies those minibatch gradient related optimization techniques, which have been extensively used for optimizing various deep learning models.  ( 2 min )
    Machine learning-based system reliability analysis with Gaussian Process Regression
    arXiv:2403.11125v1 Announce Type: cross Abstract: Machine learning-based reliability analysis methods have shown great advancements for their computational efficiency and accuracy. Recently, many efficient learning strategies have been proposed to enhance the computational performance. However, few of them explores the theoretical optimal learning strategy. In this article, we propose several theorems that facilitates such exploration. Specifically, cases that considering and neglecting the correlations among the candidate design samples are well elaborated. Moreover, we prove that the well-known U learning function can be reformulated to the optimal learning function for the case neglecting the Kriging correlation. In addition, the theoretical optimal learning strategy for sequential multiple training samples enrichment is also mathematically explored through the Bayesian estimate with the corresponding lost functions. Simulation results show that the optimal learning strategy considering the Kriging correlation works better than that neglecting the Kriging correlation and other state-of-the art learning functions from the literatures in terms of the reduction of number of evaluations of performance function. However, the implementation needs to investigate very large computational resource.  ( 2 min )
    CGI-DM: Digital Copyright Authentication for Diffusion Models via Contrasting Gradient Inversion
    arXiv:2403.11162v1 Announce Type: cross Abstract: Diffusion Models (DMs) have evolved into advanced image generation tools, especially for few-shot generation where a pretrained model is fine-tuned on a small set of images to capture a specific style or object. Despite their success, concerns exist about potential copyright violations stemming from the use of unauthorized data in this process. In response, we present Contrasting Gradient Inversion for Diffusion Models (CGI-DM), a novel method featuring vivid visual representations for digital copyright authentication. Our approach involves removing partial information of an image and recovering missing details by exploiting conceptual differences between the pretrained and fine-tuned models. We formulate the differences as KL divergence between latent variables of the two models when given the same input image, which can be maximized through Monte Carlo sampling and Projected Gradient Descent (PGD). The similarity between original and recovered images serves as a strong indicator of potential infringements. Extensive experiments on the WikiArt and Dreambooth datasets demonstrate the high accuracy of CGI-DM in digital copyright authentication, surpassing alternative validation techniques. Code implementation is available at https://github.com/Nicholas0228/Revelio.  ( 2 min )
    ProgGen: Generating Named Entity Recognition Datasets Step-by-step with Self-Reflexive Large Language Models
    arXiv:2403.11103v1 Announce Type: cross Abstract: Although Large Language Models (LLMs) exhibit remarkable adaptability across domains, these models often fall short in structured knowledge extraction tasks such as named entity recognition (NER). This paper explores an innovative, cost-efficient strategy to harness LLMs with modest NER capabilities for producing superior NER datasets. Our approach diverges from the basic class-conditional prompts by instructing LLMs to self-reflect on the specific domain, thereby generating domain-relevant attributes (such as category and emotions for movie reviews), which are utilized for creating attribute-rich training data. Furthermore, we preemptively generate entity terms and then develop NER context data around these entities, effectively bypassing the LLMs' challenges with complex structures. Our experiments across both general and niche domains reveal significant performance enhancements over conventional data generation methods while being more cost-effective than existing alternatives.  ( 2 min )
    Learning-Based Pricing and Matching for Two-Sided Queues
    arXiv:2403.11093v1 Announce Type: cross Abstract: We consider a dynamic system with multiple types of customers and servers. Each type of waiting customer or server joins a separate queue, forming a bipartite graph with customer-side queues and server-side queues. The platform can match the servers and customers if their types are compatible. The matched pairs then leave the system. The platform will charge a customer a price according to their type when they arrive and will pay a server a price according to their type. The arrival rate of each queue is determined by the price according to some unknown demand or supply functions. Our goal is to design pricing and matching algorithms to maximize the profit of the platform with unknown demand and supply functions, while keeping queue lengths of both customers and servers below a predetermined threshold. This system can be used to model two-sided markets such as ride-sharing markets with passengers and drivers. The difficulties of the problem include simultaneous learning and decision making, and the tradeoff between maximizing profit and minimizing queue length. We use a longest-queue-first matching algorithm and propose a learning-based pricing algorithm, which combines gradient-free stochastic projected gradient ascent with bisection search. We prove that our proposed algorithm yields a sublinear regret $\tilde{O}(T^{5/6})$ and queue-length bound $\tilde{O}(T^{2/3})$, where $T$ is the time horizon. We further establish a tradeoff between the regret bound and the queue-length bound: $\tilde{O}(T^{1-\gamma/4})$ versus $\tilde{O}(T^{\gamma})$ for $\gamma \in (0, 2/3].$  ( 2 min )
    RobustSentEmbed: Robust Sentence Embeddings Using Adversarial Self-Supervised Contrastive Learning
    arXiv:2403.11082v1 Announce Type: cross Abstract: Pre-trained language models (PLMs) have consistently demonstrated outstanding performance across a diverse spectrum of natural language processing tasks. Nevertheless, despite their success with unseen data, current PLM-based representations often exhibit poor robustness in adversarial settings. In this paper, we introduce RobustSentEmbed, a self-supervised sentence embedding framework designed to improve both generalization and robustness in diverse text representation tasks and against a diverse set of adversarial attacks. Through the generation of high-risk adversarial perturbations and their utilization in a novel objective function, RobustSentEmbed adeptly learns high-quality and robust sentence embeddings. Our experiments confirm the superiority of RobustSentEmbed over state-of-the-art representations. Specifically, Our framework achieves a significant reduction in the success rate of various adversarial attacks, notably reducing the BERTAttack success rate by almost half (from 75.51\% to 38.81\%). The framework also yields improvements of 1.59\% and 0.23\% in semantic textual similarity tasks and various transfer tasks, respectively.  ( 2 min )
    Brain-on-Switch: Towards Advanced Intelligent Network Data Plane via NN-Driven Traffic Analysis at Line-Speed
    arXiv:2403.11090v1 Announce Type: cross Abstract: The emerging programmable networks sparked significant research on Intelligent Network Data Plane (INDP), which achieves learning-based traffic analysis at line-speed. Prior art in INDP focus on deploying tree/forest models on the data plane. We observe a fundamental limitation in tree-based INDP approaches: although it is possible to represent even larger tree/forest tables on the data plane, the flow features that are computable on the data plane are fundamentally limited by hardware constraints. In this paper, we present BoS to push the boundaries of INDP by enabling Neural Network (NN) driven traffic analysis at line-speed. Many types of NNs (such as Recurrent Neural Network (RNN), and transformers) that are designed to work with sequential data have advantages over tree-based models, because they can take raw network data as input without complex feature computations on the fly. However, the challenge is significant: the recurrent computation scheme used in RNN inference is fundamentally different from the match-action paradigm used on the network data plane. BoS addresses this challenge by (i) designing a novel data plane friendly RNN architecture that can execute unlimited RNN time steps with limited data plane stages, effectively achieving line-speed RNN inference; and (ii) complementing the on-switch RNN model with an off-switch transformer-based traffic analysis module to further boost the overall performance. We implement a prototype of BoS using a P4 programmable switch as our data plane, and extensively evaluate it over multiple traffic analysis tasks. The results show that BoS outperforms state-of-the-art in both analysis accuracy and scalability.  ( 3 min )
    Bridging Expert Knowledge with Deep Learning Techniques for Just-In-Time Defect Prediction
    arXiv:2403.11079v1 Announce Type: cross Abstract: Just-In-Time (JIT) defect prediction aims to automatically predict whether a commit is defective or not, and has been widely studied in recent years. In general, most studies can be classified into two categories: 1) simple models using traditional machine learning classifiers with hand-crafted features, and 2) complex models using deep learning techniques to automatically extract features from commit contents. Hand-crafted features used by simple models are based on expert knowledge but may not fully represent the semantic meaning of the commits. On the other hand, deep learning-based features used by complex models represent the semantic meaning of commits but may not reflect useful expert knowledge. Simple models and complex models seem complementary to each other to some extent. To utilize the advantages of both simple and complex models, we propose a model fusion framework that adopts both early fusions on the feature level and late fusions on the decision level. We propose SimCom++ by adopting the best early and late fusion strategies. The experimental results show that SimCom++ can significantly outperform the baselines by 5.7--26.9\%. In addition, our experimental results confirm that the simple model and complex model are complementary to each other.  ( 2 min )
    JustQ: Automated Deployment of Fair and Accurate Quantum Neural Networks
    arXiv:2403.11048v1 Announce Type: cross Abstract: Despite the success of Quantum Neural Networks (QNNs) in decision-making systems, their fairness remains unexplored, as the focus primarily lies on accuracy. This work conducts a design space exploration, unveiling QNN unfairness, and highlighting the significant influence of QNN deployment and quantum noise on accuracy and fairness. To effectively navigate the vast QNN deployment design space, we propose JustQ, a framework for deploying fair and accurate QNNs on NISQ computers. It includes a complete NISQ error model, reinforcement learning-based deployment, and a flexible optimization objective incorporating both fairness and accuracy. Experimental results show JustQ outperforms previous methods, achieving superior accuracy and fairness. This work pioneers fair QNN design on NISQ computers, paving the way for future investigations.  ( 2 min )
    Topologically faithful multi-class segmentation in medical images
    arXiv:2403.11001v1 Announce Type: cross Abstract: Topological accuracy in medical image segmentation is a highly important property for downstream applications such as network analysis and flow modeling in vessels or cell counting. Recently, significant methodological advancements have brought well-founded concepts from algebraic topology to binary segmentation. However, these approaches have been underexplored in multi-class segmentation scenarios, where topological errors are common. We propose a general loss function for topologically faithful multi-class segmentation extending the recent Betti matching concept, which is based on induced matchings of persistence barcodes. We project the N-class segmentation problem to N single-class segmentation tasks, which allows us to use 1-parameter persistent homology making training of neural networks computationally feasible. We validate our method on a comprehensive set of four medical datasets with highly variant topological characteristics. Our loss formulation significantly enhances topological correctness in cardiac, cell, artery-vein, and Circle of Willis segmentation.  ( 2 min )
    Regulating Chatbot Output via Inter-Informational Competition
    arXiv:2403.11046v1 Announce Type: cross Abstract: The advent of ChatGPT has sparked over a year of regulatory frenzy. However, few existing studies have rigorously questioned the assumption that, if left unregulated, AI chatbot's output would inflict tangible, severe real harm on human affairs. Most researchers have overlooked the critical possibility that the information market itself can effectively mitigate these risks and, as a result, they tend to use regulatory tools to address the issue directly. This Article develops a yardstick for reevaluating both AI-related content risks and corresponding regulatory proposals by focusing on inter-informational competition among various outlets. The decades-long history of regulating information and communications technologies indicates that regulators tend to err too much on the side of caution and to put forward excessive regulatory measures when encountering the uncertainties brought about by new technologies. In fact, a trove of empirical evidence has demonstrated that market competition among information outlets can effectively mitigate most risks and that overreliance on regulation is not only unnecessary but detrimental, as well. This Article argues that sufficient competition among chatbots and other information outlets in the information marketplace can sufficiently mitigate and even resolve most content risks posed by generative AI technologies. This renders certain loudly advocated regulatory strategies, like mandatory prohibitions, licensure, curation of datasets, and notice-and-response regimes, truly unnecessary and even toxic to desirable competition and innovation throughout the AI industry. Ultimately, the ideas that I advance in this Article should pour some much-needed cold water on the regulatory frenzy over generative AI and steer the issue back to a rational track.  ( 3 min )
    N2F2: Hierarchical Scene Understanding with Nested Neural Feature Fields
    arXiv:2403.10997v1 Announce Type: cross Abstract: Understanding complex scenes at multiple levels of abstraction remains a formidable challenge in computer vision. To address this, we introduce Nested Neural Feature Fields (N2F2), a novel approach that employs hierarchical supervision to learn a single feature field, wherein different dimensions within the same high-dimensional feature encode scene properties at varying granularities. Our method allows for a flexible definition of hierarchies, tailored to either the physical dimensions or semantics or both, thereby enabling a comprehensive and nuanced understanding of scenes. We leverage a 2D class-agnostic segmentation model to provide semantically meaningful pixel groupings at arbitrary scales in the image space, and query the CLIP vision-encoder to obtain language-aligned embeddings for each of these segments. Our proposed hierarchical supervision method then assigns different nested dimensions of the feature field to distill the CLIP embeddings using deferred volumetric rendering at varying physical scales, creating a coarse-to-fine representation. Extensive experiments show that our approach outperforms the state-of-the-art feature field distillation methods on tasks such as open-vocabulary 3D segmentation and localization, demonstrating the effectiveness of the learned nested feature field.  ( 2 min )
    A Scalable and Parallelizable Digital Twin Framework for Sustainable Sim2Real Transition of Multi-Agent Reinforcement Learning Systems
    arXiv:2403.10996v1 Announce Type: cross Abstract: This work presents a sustainable multi-agent deep reinforcement learning framework capable of selectively scaling parallelized training workloads on-demand, and transferring the trained policies from simulation to reality using minimal hardware resources. We introduce AutoDRIVE Ecosystem as an enabling digital twin framework to train, deploy, and transfer cooperative as well as competitive multi-agent reinforcement learning policies from simulation to reality. Particularly, we first investigate an intersection traversal problem of 4 cooperative vehicles (Nigel) that share limited state information in single as well as multi-agent learning settings using a common policy approach. We then investigate an adversarial autonomous racing problem of 2 vehicles (F1TENTH) using an individual policy approach. In either set of experiments, a decentralized learning architecture was adopted, which allowed robust training and testing of the policies in stochastic environments. The agents were provided with realistically sparse observation spaces, and were restricted to sample control actions that implicitly satisfied the imposed kinodynamic and safety constraints. The experimental results for both problem statements are reported in terms of quantitative metrics and qualitative remarks for training as well as deployment phases. We also discuss agent and environment parallelization techniques adopted to efficiently accelerate MARL training, while analyzing their computational performance. Finally, we demonstrate a resource-aware transition of the trained policies from simulation to reality using the proposed digital twin framework.  ( 3 min )
    SelfIE: Self-Interpretation of Large Language Model Embeddings
    arXiv:2403.10949v1 Announce Type: cross Abstract: How do large language models (LLMs) obtain their answers? The ability to explain and control an LLM's reasoning process is key for reliability, transparency, and future model developments. We propose SelfIE (Self-Interpretation of Embeddings), a framework that enables LLMs to interpret their own embeddings in natural language by leveraging their ability to respond inquiry about a given passage. Capable of interpreting open-world concepts in the hidden embeddings, SelfIE reveals LLM internal reasoning in cases such as making ethical decisions, internalizing prompt injection, and recalling harmful knowledge. SelfIE's text descriptions on hidden embeddings also open up new avenues to control LLM reasoning. We propose Supervised Control, which allows editing open-ended concepts while only requiring gradient computation of individual layer. We extend RLHF to hidden embeddings and propose Reinforcement Control that erases harmful knowledge in LLM without supervision targets.  ( 2 min )
    Enhancing IoT Security Against DDoS Attacks through Federated Learning
    arXiv:2403.10968v1 Announce Type: cross Abstract: The rapid proliferation of the Internet of Things (IoT) has ushered in transformative connectivity between physical devices and the digital realm. Nonetheless, the escalating threat of Distributed Denial of Service (DDoS) attacks jeopardizes the integrity and reliability of IoT networks. Conventional DDoS mitigation approaches are ill-equipped to handle the intricacies of IoT ecosystems, potentially compromising data privacy. This paper introduces an innovative strategy to bolster the security of IoT networks against DDoS attacks by harnessing the power of Federated Learning that allows multiple IoT devices or edge nodes to collaboratively build a global model while preserving data privacy and minimizing communication overhead. The research aims to investigate Federated Learning's effectiveness in detecting and mitigating DDoS attacks in IoT. Our proposed framework leverages IoT devices' collective intelligence for real-time attack detection without compromising sensitive data. This study proposes innovative deep autoencoder approaches for data dimensionality reduction, retraining, and partial selection to enhance the performance and stability of the proposed model. Additionally, two renowned aggregation algorithms, FedAvg and FedAvgM, are employed in this research. Various metrics, including true positive rate, false positive rate, and F1-score, are employed to evaluate the model. The dataset utilized in this research, N-BaIoT, exhibits non-IID data distribution, where data categories are distributed quite differently. The negative impact of these distribution disparities is managed by employing retraining and partial selection techniques, enhancing the final model's stability. Furthermore, evaluation results demonstrate that the FedAvgM aggregation algorithm outperforms FedAvg, indicating that in non-IID datasets, FedAvgM provides better stability and performance.  ( 3 min )
    The Fallacy of Minimizing Local Regret in the Sequential Task Setting
    arXiv:2403.10946v1 Announce Type: cross Abstract: In the realm of Reinforcement Learning (RL), online RL is often conceptualized as an optimization problem, where an algorithm interacts with an unknown environment to minimize cumulative regret. In a stationary setting, strong theoretical guarantees, like a sublinear ($\sqrt{T}$) regret bound, can be obtained, which typically implies the convergence to an optimal policy and the cessation of exploration. However, these theoretical setups often oversimplify the complexities encountered in real-world RL implementations, where tasks arrive sequentially with substantial changes between tasks and the algorithm may not be allowed to adaptively learn within certain tasks. We study the changes beyond the outcome distributions, encompassing changes in the reward designs (mappings from outcomes to rewards) and the permissible policy spaces. Our results reveal the fallacy of myopically minimizing regret within each task: obtaining optimal regret rates in the early tasks may lead to worse rates in the subsequent ones, even when the outcome distributions stay the same. To realize the optimal cumulative regret bound across all the tasks, the algorithm has to overly explore in the earlier tasks. This theoretical insight is practically significant, suggesting that due to unanticipated changes (e.g., rapid technological development or human-in-the-loop involvement) between tasks, the algorithm needs to explore more than it would in the usual stationary setting within each task. Such implication resonates with the common practice of using clipped policies in mobile health clinical trials and maintaining a fixed rate of $\epsilon$-greedy exploration in robotic learning.  ( 3 min )
    ViSaRL: Visual Reinforcement Learning Guided by Human Saliency
    arXiv:2403.10940v1 Announce Type: cross Abstract: Training robots to perform complex control tasks from high-dimensional pixel input using reinforcement learning (RL) is sample-inefficient, because image observations are comprised primarily of task-irrelevant information. By contrast, humans are able to visually attend to task-relevant objects and areas. Based on this insight, we introduce Visual Saliency-Guided Reinforcement Learning (ViSaRL). Using ViSaRL to learn visual representations significantly improves the success rate, sample efficiency, and generalization of an RL agent on diverse tasks including DeepMind Control benchmark, robot manipulation in simulation and on a real robot. We present approaches for incorporating saliency into both CNN and Transformer-based encoders. We show that visual representations learned using ViSaRL are robust to various sources of visual perturbations including perceptual noise and scene variations. ViSaRL nearly doubles success rate on the real-robot tasks compared to the baseline which does not use saliency.  ( 2 min )
    Function-space Parameterization of Neural Networks for Sequential Learning
    arXiv:2403.10929v1 Announce Type: cross Abstract: Sequential learning paradigms pose challenges for gradient-based deep learning due to difficulties incorporating new data and retaining prior knowledge. While Gaussian processes elegantly tackle these problems, they struggle with scalability and handling rich inputs, such as images. To address these issues, we introduce a technique that converts neural networks from weight space to function space, through a dual parameterization. Our parameterization offers: (i) a way to scale function-space methods to large data sets via sparsification, (ii) retention of prior knowledge when access to past data is limited, and (iii) a mechanism to incorporate new data without retraining. Our experiments demonstrate that we can retain knowledge in continual learning and incorporate new data efficiently. We further show its strengths in uncertainty quantification and guiding exploration in model-based RL. Further information and code is available on the project website.  ( 2 min )
    Initial Decoding with Minimally Augmented Language Model for Improved Lattice Rescoring in Low Resource ASR
    arXiv:2403.10937v1 Announce Type: cross Abstract: This paper addresses the problem of improving speech recognition accuracy with lattice rescoring in low-resource languages where the baseline language model is insufficient for generating inclusive lattices. We minimally augment the baseline language model with word unigram counts that are present in a larger text corpus of the target language but absent in the baseline. The lattices generated after decoding with such an augmented baseline language model are more comprehensive. We obtain 21.8% (Telugu) and 41.8% (Kannada) relative word error reduction with our proposed method. This reduction in word error rate is comparable to 21.5% (Telugu) and 45.9% (Kannada) relative word error reduction obtained by decoding with full Wikipedia text augmented language mode while our approach consumes only 1/8th the memory. We demonstrate that our method is comparable with various text selection-based language model augmentation and also consistent for data sets of different sizes. Our approach is applicable for training speech recognition systems under low resource conditions where speech data and compute resources are insufficient, while there is a large text corpus that is available in the target language. Our research involves addressing the issue of out-of-vocabulary words of the baseline in general and does not focus on resolving the absence of named entities. Our proposed method is simple and yet computationally less expensive.  ( 3 min )
    Distributed Multi-Objective Dynamic Offloading Scheduling for Air-Ground Cooperative MEC
    arXiv:2403.10927v1 Announce Type: cross Abstract: Utilizing unmanned aerial vehicles (UAVs) with edge server to assist terrestrial mobile edge computing (MEC) has attracted tremendous attention. Nevertheless, state-of-the-art schemes based on deterministic optimizations or single-objective reinforcement learning (RL) cannot reduce the backlog of task bits and simultaneously improve energy efficiency in highly dynamic network environments, where the design problem amounts to a sequential decision-making problem. In order to address the aforementioned problems, as well as the curses of dimensionality introduced by the growing number of terrestrial terrestrial users, this paper proposes a distributed multi-objective (MO) dynamic trajectory planning and offloading scheduling scheme, integrated with MORL and the kernel method. The design of n-step return is also applied to average fluctuations in the backlog. Numerical results reveal that the n-step return can benefit the proposed kernel-based approach, achieving significant improvement in the long-term average backlog performance, compared to the conventional 1-step return design. Due to such design and the kernel-based neural network, to which decision-making features can be continuously added, the kernel-based approach can outperform the approach based on fully-connected deep neural network, yielding improvement in energy consumption and the backlog performance, as well as a significant reduction in decision-making and online learning time.  ( 2 min )
    Automatic location detection based on deep learning
    arXiv:2403.10912v1 Announce Type: cross Abstract: The proliferation of digital images and the advancements in deep learning have paved the way for innovative solutions in various domains, especially in the field of image classification. Our project presents an in-depth study and implementation of an image classification system specifically tailored to identify and classify images of Indian cities. Drawing from an extensive dataset, our model classifies images into five major Indian cities: Ahmedabad, Delhi, Kerala, Kolkata, and Mumbai to recognize the distinct features and characteristics of each city/state. To achieve high precision and recall rates, we adopted two approaches. The first, a vanilla Convolutional Neural Network (CNN) and then we explored the power of transfer learning by leveraging the VGG16 model. The vanilla CNN achieved commendable accuracy and the VGG16 model achieved a test accuracy of 63.6%. Evaluations highlighted the strengths and potential areas of improvement, positioning our model as not only competitive but also scalable for broader applications. With an emphasis on open-source ethos, our work aims to contribute to the community, encouraging further development and diverse applications. Our findings demonstrate the potential applications in tourism, urban planning, and even real-time location identification systems, among others.  ( 2 min )
    FedQNN: Federated Learning using Quantum Neural Networks
    arXiv:2403.10861v1 Announce Type: cross Abstract: In this study, we explore the innovative domain of Quantum Federated Learning (QFL) as a framework for training Quantum Machine Learning (QML) models via distributed networks. Conventional machine learning models frequently grapple with issues about data privacy and the exposure of sensitive information. Our proposed Federated Quantum Neural Network (FedQNN) framework emerges as a cutting-edge solution, integrating the singular characteristics of QML with the principles of classical federated learning. This work thoroughly investigates QFL, underscoring its capability to secure data handling in a distributed environment and facilitate cooperative learning without direct data sharing. Our research corroborates the concept through experiments across varied datasets, including genomics and healthcare, thereby validating the versatility and efficacy of our FedQNN framework. The results consistently exceed 86% accuracy across three distinct datasets, proving its suitability for conducting various QML tasks. Our research not only identifies the limitations of classical paradigms but also presents a novel framework to propel the field of QML into a new era of secure and collaborative innovation.  ( 2 min )
    stMCDI: Masked Conditional Diffusion Model with Graph Neural Network for Spatial Transcriptomics Data Imputation
    arXiv:2403.10863v1 Announce Type: cross Abstract: Spatially resolved transcriptomics represents a significant advancement in single-cell analysis by offering both gene expression data and their corresponding physical locations. However, this high degree of spatial resolution entails a drawback, as the resulting spatial transcriptomic data at the cellular level is notably plagued by a high incidence of missing values. Furthermore, most existing imputation methods either overlook the spatial information between spots or compromise the overall gene expression data distribution. To address these challenges, our primary focus is on effectively utilizing the spatial location information within spatial transcriptomic data to impute missing values, while preserving the overall data distribution. We introduce \textbf{stMCDI}, a novel conditional diffusion model for spatial transcriptomics data imputation, which employs a denoising network trained using randomly masked data portions as guidance, with the unmasked data serving as conditions. Additionally, it utilizes a GNN encoder to integrate the spatial position information, thereby enhancing model performance. The results obtained from spatial transcriptomics datasets elucidate the performance of our methods relative to existing approaches.  ( 2 min )
    Neural-Kernel Conditional Mean Embeddings
    arXiv:2403.10859v1 Announce Type: cross Abstract: Kernel conditional mean embeddings (CMEs) offer a powerful framework for representing conditional distribution, but they often face scalability and expressiveness challenges. In this work, we propose a new method that effectively combines the strengths of deep learning with CMEs in order to address these challenges. Specifically, our approach leverages the end-to-end neural network (NN) optimization framework using a kernel-based objective. This design circumvents the computationally expensive Gram matrix inversion required by current CME methods. To further enhance performance, we provide efficient strategies to optimize the remaining kernel hyperparameters. In conditional density estimation tasks, our NN-CME hybrid achieves competitive performance and often surpasses existing deep learning-based methods. Lastly, we showcase its remarkable versatility by seamlessly integrating it into reinforcement learning (RL) contexts. Building on Q-learning, our approach naturally leads to a new variant of distributional RL methods, which demonstrates consistent effectiveness across different environments.  ( 2 min )
    SF(DA)$^2$: Source-free Domain Adaptation Through the Lens of Data Augmentation
    arXiv:2403.10834v1 Announce Type: cross Abstract: In the face of the deep learning model's vulnerability to domain shift, source-free domain adaptation (SFDA) methods have been proposed to adapt models to new, unseen target domains without requiring access to source domain data. Although the potential benefits of applying data augmentation to SFDA are attractive, several challenges arise such as the dependence on prior knowledge of class-preserving transformations and the increase in memory and computational requirements. In this paper, we propose Source-free Domain Adaptation Through the Lens of Data Augmentation (SF(DA)$^2$), a novel approach that leverages the benefits of data augmentation without suffering from these challenges. We construct an augmentation graph in the feature space of the pretrained model using the neighbor relationships between target features and propose spectral neighborhood clustering to identify partitions in the prediction space. Furthermore, we propose implicit feature augmentation and feature disentanglement as regularization loss functions that effectively utilize class semantic information within the feature space. These regularizers simulate the inclusion of an unlimited number of augmented target features into the augmentation graph while minimizing computational and memory demands. Our method shows superior adaptation performance in SFDA scenarios, including 2D image and 3D point cloud datasets and a highly imbalanced dataset.  ( 2 min )
    From Words to Routes: Applying Large Language Models to Vehicle Routing
    arXiv:2403.10795v1 Announce Type: cross Abstract: LLMs have shown impressive progress in robotics (e.g., manipulation and navigation) with natural language task descriptions. The success of LLMs in these tasks leads us to wonder: What is the ability of LLMs to solve vehicle routing problems (VRPs) with natural language task descriptions? In this work, we study this question in three steps. First, we construct a dataset with 21 types of single- or multi-vehicle routing problems. Second, we evaluate the performance of LLMs across four basic prompt paradigms of text-to-code generation, each involving different types of text input. We find that the basic prompt paradigm, which generates code directly from natural language task descriptions, performs the best for GPT-4, achieving 56% feasibility, 40% optimality, and 53% efficiency. Third, based on the observation that LLMs may not be able to provide correct solutions at the initial attempt, we propose a framework that enables LLMs to refine solutions through self-reflection, including self-debugging and self-verification. With GPT-4, our proposed framework achieves a 16% increase in feasibility, a 7% increase in optimality, and a 15% increase in efficiency. Moreover, we examine the sensitivity of GPT-4 to task descriptions, specifically focusing on how its performance changes when certain details are omitted from the task descriptions, yet the core meaning is preserved. Our findings reveal that such omissions lead to a notable decrease in performance: 4% in feasibility, 4% in optimality, and 5% in efficiency. Website: https://sites.google.com/view/words-to-routes/  ( 3 min )
    Efficient Pruning of Large Language Model with Adaptive Estimation Fusion
    arXiv:2403.10799v1 Announce Type: cross Abstract: Large language models (LLMs) have become crucial for many generative downstream tasks, leading to an inevitable trend and significant challenge to deploy them efficiently on resource-constrained devices. Structured pruning is a widely used method to address this challenge. However, when dealing with the complex structure of the multiple decoder layers, general methods often employ common estimation approaches for pruning. These approaches lead to a decline in accuracy for specific downstream tasks. In this paper, we introduce a simple yet efficient method that adaptively models the importance of each substructure. Meanwhile, it can adaptively fuse coarse-grained and finegrained estimations based on the results from complex and multilayer structures. All aspects of our design seamlessly integrate into the endto-end pruning framework. Our experimental results, compared with state-of-the-art methods on mainstream datasets, demonstrate average accuracy improvements of 1.1%, 1.02%, 2.0%, and 1.2% for LLaMa-7B,Vicuna-7B, Baichuan-7B, and Bloom-7b1, respectively.  ( 2 min )
    Diffusion-Reinforcement Learning Hierarchical Motion Planning in Adversarial Multi-agent Games
    arXiv:2403.10794v1 Announce Type: cross Abstract: Reinforcement Learning- (RL-)based motion planning has recently shown the potential to outperform traditional approaches from autonomous navigation to robot manipulation. In this work, we focus on a motion planning task for an evasive target in a partially observable multi-agent adversarial pursuit-evasion games (PEG). These pursuit-evasion problems are relevant to various applications, such as search and rescue operations and surveillance robots, where robots must effectively plan their actions to gather intelligence or accomplish mission tasks while avoiding detection or capture themselves. We propose a hierarchical architecture that integrates a high-level diffusion model to plan global paths responsive to environment data while a low-level RL algorithm reasons about evasive versus global path-following behavior. Our approach outperforms baselines by 51.2% by leveraging the diffusion model to guide the RL algorithm for more efficient exploration and improves the explanability and predictability.  ( 2 min )
    QuantumLeak: Stealing Quantum Neural Networks from Cloud-based NISQ Machines
    arXiv:2403.10790v1 Announce Type: cross Abstract: Variational quantum circuits (VQCs) have become a powerful tool for implementing Quantum Neural Networks (QNNs), addressing a wide range of complex problems. Well-trained VQCs serve as valuable intellectual assets hosted on cloud-based Noisy Intermediate Scale Quantum (NISQ) computers, making them susceptible to malicious VQC stealing attacks. However, traditional model extraction techniques designed for classical machine learning models encounter challenges when applied to NISQ computers due to significant noise in current devices. In this paper, we introduce QuantumLeak, an effective and accurate QNN model extraction technique from cloud-based NISQ machines. Compared to existing classical model stealing techniques, QuantumLeak improves local VQC accuracy by 4.99\%$\sim$7.35\% across diverse datasets and VQC architectures.  ( 2 min )
    From Melting Pots to Misrepresentations: Exploring Harms in Generative AI
    arXiv:2403.10776v1 Announce Type: cross Abstract: With the widespread adoption of advanced generative models such as Gemini and GPT, there has been a notable increase in the incorporation of such models into sociotechnical systems, categorized under AI-as-a-Service (AIaaS). Despite their versatility across diverse sectors, concerns persist regarding discriminatory tendencies within these models, particularly favoring selected `majority' demographics across various sociodemographic dimensions. Despite widespread calls for diversification of media representations, marginalized racial and ethnic groups continue to face persistent distortion, stereotyping, and neglect within the AIaaS context. In this work, we provide a critical summary of the state of research in the context of social harms to lead the conversation to focus on their implications. We also present open-ended research questions, guided by our discussion, to help define future research pathways.  ( 2 min )
    ContourDiff: Unpaired Image Translation with Contour-Guided Diffusion Models
    arXiv:2403.10786v1 Announce Type: cross Abstract: Accurately translating medical images across different modalities (e.g., CT to MRI) has numerous downstream clinical and machine learning applications. While several methods have been proposed to achieve this, they often prioritize perceptual quality with respect to output domain features over preserving anatomical fidelity. However, maintaining anatomy during translation is essential for many tasks, e.g., when leveraging masks from the input domain to develop a segmentation model with images translated to the output domain. To address these challenges, we propose ContourDiff, a novel framework that leverages domain-invariant anatomical contour representations of images. These representations are simple to extract from images, yet form precise spatial constraints on their anatomical content. We introduce a diffusion model that converts contour representations of images from arbitrary input domains into images in the output domain of interest. By applying the contour as a constraint at every diffusion sampling step, we ensure the preservation of anatomical content. We evaluate our method by training a segmentation model on images translated from CT to MRI with their original CT masks and testing its performance on real MRIs. Our method outperforms other unpaired image translation methods by a significant margin, furthermore without the need to access any input domain information during training.  ( 2 min )
    A Primal-Dual Algorithm for Faster Distributionally Robust Optimization
    arXiv:2403.10763v1 Announce Type: cross Abstract: We consider the penalized distributionally robust optimization (DRO) problem with a closed, convex uncertainty set, a setting that encompasses the $f$-DRO, Wasserstein-DRO, and spectral/$L$-risk formulations used in practice. We present Drago, a stochastic primal-dual algorithm that achieves a state-of-the-art linear convergence rate on strongly convex-strongly concave DRO problems. The method combines both randomized and cyclic components with mini-batching, which effectively handles the unique asymmetric nature of the primal and dual problems in DRO. We support our theoretical results with numerical benchmarks in classification and regression.  ( 2 min )
    Scheduling Drone and Mobile Charger via Hybrid-Action Deep Reinforcement Learning
    arXiv:2403.10761v1 Announce Type: cross Abstract: Recently there has been a growing interest in industry and academia, regarding the use of wireless chargers to prolong the operational longevity of unmanned aerial vehicles (commonly knowns as drones). In this paper we consider a charger-assisted drone application: a drone is deployed to observe a set points of interest, while a charger can move to recharge the drone's battery. We focus on the route and charging schedule of the drone and the mobile charger, to obtain high observation utility with the shortest possible time, while ensuring the drone remains operational during task execution. Essentially, this proposed drone-charger scheduling problem is a multi-stage decision-making process, in which the drone and the mobile charger act as two agents who cooperate to finish a task. The discrete-continuous hybrid action space of the two agents poses a significant challenge in our problem. To address this issue, we present a hybrid-action deep reinforcement learning framework, called HaDMC, which uses a standard policy learning algorithm to generate latent continuous actions. Motivated by representation learning, we specifically design and train an action decoder. It involves two pipelines to convert the latent continuous actions into original discrete and continuous actions, by which the drone and the charger can directly interact with environment. We embed a mutual learning scheme in model training, emphasizing the collaborative rather than individual actions. We conduct extensive numerical experiments to evaluate HaDMC and compare it with state-of-the-art deep reinforcement learning approaches. The experimental results show the effectiveness and efficiency of our solution.  ( 3 min )
    Giving a Hand to Diffusion Models: a Two-Stage Approach to Improving Conditional Human Image Generation
    arXiv:2403.10731v1 Announce Type: cross Abstract: Recent years have seen significant progress in human image generation, particularly with the advancements in diffusion models. However, existing diffusion methods encounter challenges when producing consistent hand anatomy and the generated images often lack precise control over the hand pose. To address this limitation, we introduce a novel approach to pose-conditioned human image generation, dividing the process into two stages: hand generation and subsequent body out-painting around the hands. We propose training the hand generator in a multi-task setting to produce both hand images and their corresponding segmentation masks, and employ the trained model in the first stage of generation. An adapted ControlNet model is then used in the second stage to outpaint the body around the generated hands, producing the final result. A novel blending technique is introduced to preserve the hand details during the second stage that combines the results of both stages in a coherent way. This involves sequential expansion of the out-painted region while fusing the latent representations, to ensure a seamless and cohesive synthesis of the final image. Experimental evaluations demonstrate the superiority of our proposed method over state-of-the-art techniques, in both pose accuracy and image quality, as validated on the HaGRID dataset. Our approach not only enhances the quality of the generated hands but also offers improved control over hand pose, advancing the capabilities of pose-conditioned human image generation. The source code of the proposed approach is available at https://github.com/apelykh/hand-to-diffusion.  ( 3 min )
    A Comprehensive Review of Latent Space Dynamics Identification Algorithms for Intrusive and Non-Intrusive Reduced-Order-Modeling
    arXiv:2403.10748v1 Announce Type: cross Abstract: Numerical solvers of partial differential equations (PDEs) have been widely employed for simulating physical systems. However, the computational cost remains a major bottleneck in various scientific and engineering applications, which has motivated the development of reduced-order models (ROMs). Recently, machine-learning-based ROMs have gained significant popularity and are promising for addressing some limitations of traditional ROM methods, especially for advection dominated systems. In this chapter, we focus on a particular framework known as Latent Space Dynamics Identification (LaSDI), which transforms the high-fidelity data, governed by a PDE, to simpler and low-dimensional latent-space data, governed by ordinary differential equations (ODEs). These ODEs can be learned and subsequently interpolated to make ROM predictions. Each building block of LaSDI can be easily modulated depending on the application, which makes the LaSDI framework highly flexible. In particular, we present strategies to enforce the laws of thermodynamics into LaSDI models (tLaSDI), enhance robustness in the presence of noise through the weak form (WLaSDI), select high-fidelity training data efficiently through active learning (gLaSDI, GPLaSDI), and quantify the ROM prediction uncertainty through Gaussian processes (GPLaSDI). We demonstrate the performance of different LaSDI approaches on Burgers equation, a non-linear heat conduction problem, and a plasma physics problem, showing that LaSDI algorithms can achieve relative errors of less than a few percent and up to thousands of times speed-ups.  ( 3 min )
    Uncovering Latent Themes of Messaging on Social Media by Integrating LLMs: A Case Study on Climate Campaigns
    arXiv:2403.10707v1 Announce Type: cross Abstract: This paper introduces a novel approach to uncovering and analyzing themes in social media messaging. Recognizing the limitations of traditional topic-level analysis, which tends to capture only the overarching patterns, this study emphasizes the need for a finer-grained, theme-focused exploration. Conventional methods of theme discovery, involving manual processes and a human-in-the-loop approach, are valuable but face challenges in scalability, consistency, and resource intensity in terms of time and cost. To address these challenges, we propose a machine-in-the-loop approach that leverages the advanced capabilities of Large Language Models (LLMs). This approach allows for a deeper investigation into the thematic aspects of social media discourse, enabling us to uncover a diverse array of themes, each with unique characteristics and relevance, thereby offering a comprehensive understanding of the nuances present within broader topics. Furthermore, this method efficiently maps the text and the newly discovered themes, enhancing our understanding of the thematic nuances in social media messaging. We employ climate campaigns as a case study and demonstrate that our methodology yields more accurate and interpretable results compared to traditional topic models. Our results not only demonstrate the effectiveness of our approach in uncovering latent themes but also illuminate how these themes are tailored for demographic targeting in social media contexts. Additionally, our work sheds light on the dynamic nature of social media, revealing the shifts in the thematic focus of messaging in response to real-world events.  ( 3 min )
    On the low-shot transferability of [V]-Mamba
    arXiv:2403.10696v1 Announce Type: cross Abstract: The strength of modern large-scale neural networks lies in their ability to efficiently adapt to new tasks with few examples. Although extensive research has investigated the transferability of Vision Transformers (ViTs) to various downstream tasks under diverse constraints, this study shifts focus to explore the transfer learning potential of [V]-Mamba. We compare its performance with ViTs across different few-shot data budgets and efficient transfer methods. Our analysis yields three key insights into [V]-Mamba's few-shot transfer performance: (a) [V]-Mamba demonstrates superior or equivalent few-shot learning capabilities compared to ViTs when utilizing linear probing (LP) for transfer, (b) Conversely, [V]-Mamba exhibits weaker or similar few-shot learning performance compared to ViTs when employing visual prompting (VP) as the transfer method, and (c) We observe a weak positive correlation between the performance gap in transfer via LP and VP and the scale of the [V]-Mamba model. This preliminary analysis lays the foundation for more comprehensive studies aimed at furthering our understanding of the capabilities of [V]-Mamba variants and their distinctions from ViTs.  ( 2 min )
    Latent Object Characteristics Recognition with Visual to Haptic-Audio Cross-modal Transfer Learning
    arXiv:2403.10689v1 Announce Type: cross Abstract: Recognising the characteristics of objects while a robot handles them is crucial for adjusting motions that ensure stable and efficient interactions with containers. Ahead of realising stable and efficient robot motions for handling/transferring the containers, this work aims to recognise the latent unobservable object characteristics. While vision is commonly used for object recognition by robots, it is ineffective for detecting hidden objects. However, recognising objects indirectly using other sensors is a challenging task. To address this challenge, we propose a cross-modal transfer learning approach from vision to haptic-audio. We initially train the model with vision, directly observing the target object. Subsequently, we transfer the latent space learned from vision to a second module, trained only with haptic-audio and motor data. This transfer learning framework facilitates the representation of object characteristics using indirect sensor data, thereby improving recognition accuracy. For evaluating the recognition accuracy of our proposed learning framework we selected shape, position, and orientation as the object characteristics. Finally, we demonstrate online recognition of both trained and untrained objects using the humanoid robot Nextage Open.  ( 2 min )
    MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling
    arXiv:2403.10691v1 Announce Type: cross Abstract: A major consideration in multilingual language modeling is how to best represent languages with diverse vocabularies and scripts. Although contemporary text encoding methods cover most of the world's writing systems, they exhibit bias towards the high-resource languages of the Global West. As a result, texts of underrepresented languages tend to be segmented into long sequences of linguistically meaningless units. To address the disparities, we introduce a new paradigm that encodes the same information with segments of consistent size across diverse languages. Our encoding convention (MYTE) is based on morphemes, as their inventories are more balanced across languages than characters, which are used in previous methods. We show that MYTE produces shorter encodings for all 99 analyzed languages, with the most notable improvements for non-European languages and non-Latin scripts. This, in turn, improves multilingual LM performance and diminishes the perplexity gap throughout diverse languages.  ( 2 min )
    AutoHLS: Learning to Accelerate Design Space Exploration for HLS Designs
    arXiv:2403.10686v1 Announce Type: cross Abstract: High-level synthesis (HLS) is a design flow that leverages modern language features and flexibility, such as complex data structures, inheritance, templates, etc., to prototype hardware designs rapidly. However, exploring various design space parameters can take much time and effort for hardware engineers to meet specific design specifications. This paper proposes a novel framework called AutoHLS, which integrates a deep neural network (DNN) with Bayesian optimization (BO) to accelerate HLS hardware design optimization. Our tool focuses on HLS pragma exploration and operation transformation. It utilizes integrated DNNs to predict synthesizability within a given FPGA resource budget. We also investigate the potential of emerging quantum neural networks (QNNs) instead of classical DNNs for the AutoHLS pipeline. Our experimental results demonstrate up to a 70-fold speedup in exploration time.  ( 2 min )
    Evaluation of GlassNet for physics-informed machine learning of glass stability and glass-forming ability
    arXiv:2403.10682v1 Announce Type: cross Abstract: Glasses form the basis of many modern applications and also hold great potential for future medical and environmental applications. However, their structural complexity and large composition space make design and optimization challenging for certain applications. Of particular importance for glass processing is an estimate of a given composition's glass-forming ability (GFA). However, there remain many open questions regarding the physical mechanisms of glass formation, especially in oxide glasses. It is apparent that a proxy for GFA would be highly useful in glass processing and design, but identifying such a surrogate property has proven itself to be difficult. Here, we explore the application of an open-source pre-trained NN model, GlassNet, that can predict the characteristic temperatures necessary to compute glass stability (GS) and assess the feasibility of using these physics-informed ML (PIML)-predicted GS parameters to estimate GFA. In doing so, we track the uncertainties at each step of the computation - from the original ML prediction errors, to the compounding of errors during GS estimation, and finally to the final estimation of GFA. While GlassNet exhibits reasonable accuracy on all individual properties, we observe a large compounding of error in the combination of these individual predictions for the prediction of GS, finding that random forest models offer similar accuracy to GlassNet. We also breakdown the ML performance on different glass families and find that the error in GS prediction is correlated with the error in crystallization peak temperature prediction. Lastly, we utilize this finding to assess the relationship between top-performing GS parameters and GFA for two ternary glass systems: sodium borosilicate and sodium iron phosphate glasses. We conclude that to obtain true ML predictive capability of GFA, significantly more data needs to be collected.  ( 3 min )
    Hessian-Free Laplace in Bayesian Deep Learning
    arXiv:2403.10671v1 Announce Type: cross Abstract: The Laplace approximation (LA) of the Bayesian posterior is a Gaussian distribution centered at the maximum a posteriori estimate. Its appeal in Bayesian deep learning stems from the ability to quantify uncertainty post-hoc (i.e., after standard network parameter optimization), the ease of sampling from the approximate posterior, and the analytic form of model evidence. However, an important computational bottleneck of LA is the necessary step of calculating and inverting the Hessian matrix of the log posterior. The Hessian may be approximated in a variety of ways, with quality varying with a number of factors including the network, dataset, and inference task. In this paper, we propose an alternative framework that sidesteps Hessian calculation and inversion. The Hessian-free Laplace (HFL) approximation uses curvature of both the log posterior and network prediction to estimate its variance. Only two point estimates are needed: the standard maximum a posteriori parameter and the optimal parameter under a loss regularized by the network prediction. We show that, under standard assumptions of LA in Bayesian deep learning, HFL targets the same variance as LA, and can be efficiently amortized in a pre-trained network. Experiments demonstrate comparable performance to that of exact and approximate Hessians, with excellent coverage for in-between uncertainty.  ( 2 min )
    Riemannian Flow Matching Policy for Robot Motion Learning
    arXiv:2403.10672v1 Announce Type: cross Abstract: We introduce Riemannian Flow Matching Policies (RFMP), a novel model for learning and synthesizing robot visuomotor policies. RFMP leverages the efficient training and inference capabilities of flow matching methods. By design, RFMP inherits the strengths of flow matching: the ability to encode high-dimensional multimodal distributions, commonly encountered in robotic tasks, and a very simple and fast inference process. We demonstrate the applicability of RFMP to both state-based and vision-conditioned robot motion policies. Notably, as the robot state resides on a Riemannian manifold, RFMP inherently incorporates geometric awareness, which is crucial for realistic robotic tasks. To evaluate RFMP, we conduct two proof-of-concept experiments, comparing its performance against Diffusion Policies. Although both approaches successfully learn the considered tasks, our results show that RFMP provides smoother action trajectories with significantly lower inference times.  ( 2 min )
    Not Just Change the Labels, Learn the Features: Watermarking Deep Neural Networks with Multi-View Data
    arXiv:2403.10663v1 Announce Type: cross Abstract: With the increasing prevalence of Machine Learning as a Service (MLaaS) platforms, there is a growing focus on deep neural network (DNN) watermarking techniques. These methods are used to facilitate the verification of ownership for a target DNN model to protect intellectual property. One of the most widely employed watermarking techniques involves embedding a trigger set into the source model. Unfortunately, existing methodologies based on trigger sets are still susceptible to functionality-stealing attacks, potentially enabling adversaries to steal the functionality of the source model without a reliable means of verifying ownership. In this paper, we first introduce a novel perspective on trigger set-based watermarking methods from a feature learning perspective. Specifically, we demonstrate that by selecting data exhibiting multiple features, also referred to as $\textit{multi-view data}$, it becomes feasible to effectively defend functionality stealing attacks. Based on this perspective, we introduce a novel watermarking technique based on Multi-view dATa, called MAT, for efficiently embedding watermarks within DNNs. This approach involves constructing a trigger set with multi-view data and incorporating a simple feature-based regularization method for training the source model. We validate our method across various benchmarks and demonstrate its efficacy in defending against model extraction attacks, surpassing relevant baselines by a significant margin.  ( 2 min )
    InterLUDE: Interactions between Labeled and Unlabeled Data to Enhance Semi-Supervised Learning
    arXiv:2403.10658v1 Announce Type: cross Abstract: Semi-supervised learning (SSL) seeks to enhance task performance by training on both labeled and unlabeled data. Mainstream SSL image classification methods mostly optimize a loss that additively combines a supervised classification objective with a regularization term derived solely from unlabeled data. This formulation neglects the potential for interaction between labeled and unlabeled images. In this paper, we introduce InterLUDE, a new approach to enhance SSL made of two parts that each benefit from labeled-unlabeled interaction. The first part, embedding fusion, interpolates between labeled and unlabeled embeddings to improve representation learning. The second part is a new loss, grounded in the principle of consistency regularization, that aims to minimize discrepancies in the model's predictions between labeled versus unlabeled inputs. Experiments on standard closed-set SSL benchmarks and a medical SSL task with an uncurated unlabeled set show clear benefits to our approach. On the STL-10 dataset with only 40 labels, InterLUDE achieves 3.2% error rate, while the best previous method reports 14.9%.  ( 2 min )
    LightIt: Illumination Modeling and Control for Diffusion Models
    arXiv:2403.10615v1 Announce Type: cross Abstract: We introduce LightIt, a method for explicit illumination control for image generation. Recent generative methods lack lighting control, which is crucial to numerous artistic aspects of image generation such as setting the overall mood or cinematic appearance. To overcome these limitations, we propose to condition the generation on shading and normal maps. We model the lighting with single bounce shading, which includes cast shadows. We first train a shading estimation module to generate a dataset of real-world images and shading pairs. Then, we train a control network using the estimated shading and normals as input. Our method demonstrates high-quality image generation and lighting control in numerous scenes. Additionally, we use our generated dataset to train an identity-preserving relighting model, conditioned on an image and a target shading. Our method is the first that enables the generation of images with controllable, consistent lighting and performs on par with specialized relighting state-of-the-art methods.  ( 2 min )
    PALM: Pushing Adaptive Learning Rate Mechanisms for Continual Test-Time Adaptation
    arXiv:2403.10650v1 Announce Type: cross Abstract: Real-world vision models in dynamic environments face rapid shifts in domain distributions, leading to decreased recognition performance. Continual test-time adaptation (CTTA) directly adjusts a pre-trained source discriminative model to these changing domains using test data. A highly effective CTTA method involves applying layer-wise adaptive learning rates, and selectively adapting pre-trained layers. However, it suffers from the poor estimation of domain shift and the inaccuracies arising from the pseudo-labels. In this work, we aim to overcome these limitations by identifying layers through the quantification of model prediction uncertainty without relying on pseudo-labels. We utilize the magnitude of gradients as a metric, calculated by backpropagating the KL divergence between the softmax output and a uniform distribution, to select layers for further adaptation. Subsequently, for the parameters exclusively belonging to these selected layers, with the remaining ones frozen, we evaluate their sensitivity in order to approximate the domain shift, followed by adjusting their learning rates accordingly. Overall, this approach leads to a more robust and stable optimization than prior approaches. We conduct extensive image classification experiments on CIFAR-10C, CIFAR-100C, and ImageNet-C and demonstrate the efficacy of our method against standard benchmarks and prior methods.  ( 2 min )
    SurvRNC: Learning Ordered Representations for Survival Prediction using Rank-N-Contrast
    arXiv:2403.10603v1 Announce Type: cross Abstract: Predicting the likelihood of survival is of paramount importance for individuals diagnosed with cancer as it provides invaluable information regarding prognosis at an early stage. This knowledge enables the formulation of effective treatment plans that lead to improved patient outcomes. In the past few years, deep learning models have provided a feasible solution for assessing medical images, electronic health records, and genomic data to estimate cancer risk scores. However, these models often fall short of their potential because they struggle to learn regression-aware feature representations. In this study, we propose Survival Rank-N Contrast (SurvRNC) method, which introduces a loss function as a regularizer to obtain an ordered representation based on the survival times. This function can handle censored data and can be incorporated into any survival model to ensure that the learned representation is ordinal. The model was extensively evaluated on a HEad \& NeCK TumOR (HECKTOR) segmentation and the outcome-prediction task dataset. We demonstrate that using the SurvRNC method for training can achieve higher performance on different deep survival models. Additionally, it outperforms state-of-the-art methods by 3.6% on the concordance index. The code is publicly available on https://github.com/numanai/SurvRNC  ( 2 min )
    Large Language Model-informed ECG Dual Attention Network for Heart Failure Risk Prediction
    arXiv:2403.10581v1 Announce Type: cross Abstract: Heart failure (HF) poses a significant public health challenge due to its rising global mortality rate. Addressing this issue through early diagnosis and prevention could significantly reduce the disease's impact. This work introduces a methodology for HF risk prediction using clinically acquired 12-lead electrocardiograms (ECGs). We present a novel, lightweight dual-attention ECG network designed to capture complex ECG features essential for early HF prediction, despite the notable imbalance between low and high-risk groups. The network features a cross-lead attention module and twelve lead-specific temporal attention modules to capture cross-lead interactions and local temporal dynamics within each lead. To prevent model overfitting from limited training data, we leverage a large language model (LLM) with a public ECG-Report dataset for pretraining on an ECG-report alignment task. The network is then fine-tuned for HF risk prediction using two specific cohorts from the UK Biobank study, focusing on patients with hypertension (UKB-HYP) and those who have had a myocardial infarction (UKB-MI). Our findings show that LLM-informed pretraining significantly improves the network's HF risk prediction capability in these cohorts. Moreover, the dual-attention mechanism enhances interpretability and predictive performance, ensuring a transparent and reliable prediction process. The method outperforms existing models, achieving average C-index scores of 0.6349 and 0.5805 on the UKB-HYP and UKB-MI test sets, respectively. This performance demonstrates our approach's effectiveness in managing complex clinical ECG data and its potential to improve HF risk assessment across various populations.  ( 3 min )
    How Suboptimal is Training rPPG Models with Videos and Targets from Different Body Sites?
    arXiv:2403.10582v1 Announce Type: cross Abstract: Remote camera measurement of the blood volume pulse via photoplethysmography (rPPG) is a compelling technology for scalable, low-cost, and accessible assessment of cardiovascular information. Neural networks currently provide the state-of-the-art for this task and supervised training or fine-tuning is an important step in creating these models. However, most current models are trained on facial videos using contact PPG measurements from the fingertip as targets/ labels. One of the reasons for this is that few public datasets to date have incorporated contact PPG measurements from the face. Yet there is copious evidence that the PPG signals at different sites on the body have very different morphological features. Is training a facial video rPPG model using contact measurements from another site on the body suboptimal? Using a recently released unique dataset with synchronized contact PPG and video measurements from both the hand and face, we can provide precise and quantitative answers to this question. We obtain up to 40 % lower mean squared errors between the waveforms of the predicted and the ground truth PPG signals using state-of-the-art neural models when using PPG signals from the forehead compared to using PPG signals from the fingertip. We also show qualitatively that the neural models learn to predict the morphology of the ground truth PPG signal better when trained on the forehead PPG signals. However, while models trained from the forehead PPG produce a more faithful waveform, models trained from a finger PPG do still learn the dominant frequency (i.e., the heart rate) well.  ( 3 min )
    Generative Modelling of Stochastic Rotating Shallow Water Noise
    arXiv:2403.10578v1 Announce Type: cross Abstract: In recent work, the authors have developed a generic methodology for calibrating the noise in fluid dynamics stochastic partial differential equations where the stochasticity was introduced to parametrize subgrid-scale processes. The stochastic parameterization of sub-grid scale processes is required in the estimation of uncertainty in weather and climate predictions, to represent systematic model errors arising from subgrid-scale fluctuations. The previous methodology used a principal component analysis (PCA) technique based on the ansatz that the increments of the stochastic parametrization are normally distributed. In this paper, the PCA technique is replaced by a generative model technique. This enables us to avoid imposing additional constraints on the increments. The methodology is tested on a stochastic rotating shallow water model with the elevation variable of the model used as input data. The numerical simulations show that the noise is indeed non-Gaussian. The generative modelling technology gives good RMSE, CRPS score and forecast rank histogram results.  ( 2 min )
    Ignore Me But Don't Replace Me: Utilizing Non-Linguistic Elements for Pretraining on the Cybersecurity Domain
    arXiv:2403.10576v1 Announce Type: cross Abstract: Cybersecurity information is often technically complex and relayed through unstructured text, making automation of cyber threat intelligence highly challenging. For such text domains that involve high levels of expertise, pretraining on in-domain corpora has been a popular method for language models to obtain domain expertise. However, cybersecurity texts often contain non-linguistic elements (such as URLs and hash values) that could be unsuitable with the established pretraining methodologies. Previous work in other domains have removed or filtered such text as noise, but the effectiveness of these methods have not been investigated, especially in the cybersecurity domain. We propose different pretraining methodologies and evaluate their effectiveness through downstream tasks and probing tasks. Our proposed strategy (selective MLM and jointly training NLE token classification) outperforms the commonly taken approach of replacing non-linguistic elements (NLEs). We use our domain-customized methodology to train CyBERTuned, a cybersecurity domain language model that outperforms other cybersecurity PLMs on most tasks.  ( 2 min )
    JaxDecompiler: Redefining Gradient-Informed Software Design
    arXiv:2403.10571v1 Announce Type: cross Abstract: Among numerical libraries capable of computing gradient descent optimization, JAX stands out by offering more features, accelerated by an intermediate representation known as Jaxpr language. However, editing the Jaxpr code is not directly possible. This article introduces JaxDecompiler, a tool that transforms any JAX function into an editable Python code, especially useful for editing the JAX function generated by the gradient function. JaxDecompiler simplifies the processes of reverse engineering, understanding, customizing, and interoperability of software developed by JAX. We highlight its capabilities, emphasize its practical applications especially in deep learning and more generally gradient-informed software, and demonstrate that the decompiled code speed performance is similar to the original.  ( 2 min )
    Medical Unlearnable Examples: Securing Medical Data from Unauthorized Traning via Sparsity-Aware Local Masking
    arXiv:2403.10573v1 Announce Type: cross Abstract: With the rapid growth of artificial intelligence (AI) in healthcare, there has been a significant increase in the generation and storage of sensitive medical data. This abundance of data, in turn, has propelled the advancement of medical AI technologies. However, concerns about unauthorized data exploitation, such as training commercial AI models, often deter researchers from making their invaluable datasets publicly available. In response to the need to protect this hard-to-collect data while still encouraging medical institutions to share it, one promising solution is to introduce imperceptible noise into the data. This method aims to safeguard the data against unauthorized training by inducing degradation in model generalization. Although existing methods have shown commendable data protection capabilities in general domains, they tend to fall short when applied to biomedical data, mainly due to their failure to account for the sparse nature of medical images. To address this problem, we propose the Sparsity-Aware Local Masking (SALM) method, a novel approach that selectively perturbs significant pixel regions rather than the entire image as previous strategies have done. This simple-yet-effective approach significantly reduces the perturbation search space by concentrating on local regions, thereby improving both the efficiency and effectiveness of data protection for biomedical datasets characterized by sparse features. Besides, we have demonstrated that SALM maintains the essential characteristics of the data, ensuring its clinical utility remains uncompromised. Our extensive experiments across various datasets and model architectures demonstrate that SALM effectively prevents unauthorized training of deep-learning models and outperforms previous state-of-the-art data protection methods.  ( 3 min )
    Counter-Samples: A Stateless Strategy to Neutralize Black Box Adversarial Attacks
    arXiv:2403.10562v1 Announce Type: cross Abstract: Our paper presents a novel defence against black box attacks, where attackers use the victim model as an oracle to craft their adversarial examples. Unlike traditional preprocessing defences that rely on sanitizing input samples, our stateless strategy counters the attack process itself. For every query we evaluate a counter-sample instead, where the counter-sample is the original sample optimized against the attacker's objective. By countering every black box query with a targeted white box optimization, our strategy effectively introduces an asymmetry to the game to the defender's advantage. This defence not only effectively misleads the attacker's search for an adversarial example, it also preserves the model's accuracy on legitimate inputs and is generic to multiple types of attacks. We demonstrate that our approach is remarkably effective against state-of-the-art black box attacks and outperforms existing defences for both the CIFAR-10 and ImageNet datasets. Additionally, we also show that the proposed defence is robust against strong adversaries as well.  ( 2 min )
    Adaptive Hybrid Masking Strategy for Privacy-Preserving Face Recognition Against Model Inversion Attack
    arXiv:2403.10558v1 Announce Type: cross Abstract: The utilization of personal sensitive data in training face recognition (FR) models poses significant privacy concerns, as adversaries can employ model inversion attacks (MIA) to infer the original training data. Existing defense methods, such as data augmentation and differential privacy, have been employed to mitigate this issue. However, these methods often fail to strike an optimal balance between privacy and accuracy. To address this limitation, this paper introduces an adaptive hybrid masking algorithm against MIA. Specifically, face images are masked in the frequency domain using an adaptive MixUp strategy. Unlike the traditional MixUp algorithm, which is predominantly used for data augmentation, our modified approach incorporates frequency domain mixing. Previous studies have shown that increasing the number of images mixed in MixUp can enhance privacy preservation but at the expense of reduced face recognition accuracy. To overcome this trade-off, we develop an enhanced adaptive MixUp strategy based on reinforcement learning, which enables us to mix a larger number of images while maintaining satisfactory recognition accuracy. To optimize privacy protection, we propose maximizing the reward function (i.e., the loss function of the FR system) during the training of the strategy network. While the loss function of the FR network is minimized in the phase of training the FR network. The strategy network and the face recognition network can be viewed as antagonistic entities in the training process, ultimately reaching a more balanced trade-off. Experimental results demonstrate that our proposed hybrid masking scheme outperforms existing defense algorithms in terms of privacy preservation and recognition accuracy against MIA.  ( 3 min )
    Robust Second-Order Nonconvex Optimization and Its Application to Low Rank Matrix Sensing
    arXiv:2403.10547v1 Announce Type: cross Abstract: Finding an approximate second-order stationary point (SOSP) is a well-studied and fundamental problem in stochastic nonconvex optimization with many applications in machine learning. However, this problem is poorly understood in the presence of outliers, limiting the use of existing nonconvex algorithms in adversarial settings. In this paper, we study the problem of finding SOSPs in the strong contamination model, where a constant fraction of datapoints are arbitrarily corrupted. We introduce a general framework for efficiently finding an approximate SOSP with \emph{dimension-independent} accuracy guarantees, using $\widetilde{O}({D^2}/{\epsilon})$ samples where $D$ is the ambient dimension and $\epsilon$ is the fraction of corrupted datapoints. As a concrete application of our framework, we apply it to the problem of low rank matrix sensing, developing efficient and provably robust algorithms that can tolerate corruptions in both the sensing matrices and the measurements. In addition, we establish a Statistical Query lower bound providing evidence that the quadratic dependence on $D$ in the sample complexity is necessary for computationally efficient algorithms.  ( 2 min )
    On-Device Domain Learning for Keyword Spotting on Low-Power Extreme Edge Embedded Systems
    arXiv:2403.10549v1 Announce Type: cross Abstract: Keyword spotting accuracy degrades when neural networks are exposed to noisy environments. On-site adaptation to previously unseen noise is crucial to recovering accuracy loss, and on-device learning is required to ensure that the adaptation process happens entirely on the edge device. In this work, we propose a fully on-device domain adaptation system achieving up to 14% accuracy gains over already-robust keyword spotting models. We enable on-device learning with less than 10 kB of memory, using only 100 labeled utterances to recover 5% accuracy after adapting to the complex speech noise. We demonstrate that domain adaptation can be achieved on ultra-low-power microcontrollers with as little as 806 mJ in only 14 s on always-on, battery-operated devices.  ( 2 min )
    Distinguishing Neighborhood Representations Through Reverse Process of GNNs for Heterophilic Graphs
    arXiv:2403.10543v1 Announce Type: cross Abstract: Graph Neural Network (GNN) resembles the diffusion process, leading to the over-smoothing of learned representations when stacking many layers. Hence, the reverse process of message passing can sharpen the node representations by inverting the forward message propagation. The sharpened representations can help us to better distinguish neighboring nodes with different labels, such as in heterophilic graphs. In this work, we apply the design principle of the reverse process to the three variants of the GNNs. Through the experiments on heterophilic graph data, where adjacent nodes need to have different representations for successful classification, we show that the reverse process significantly improves the prediction performance in many cases. Additional analysis reveals that the reverse mechanism can mitigate the over-smoothing over hundreds of layers.  ( 2 min )
    MATADOR: Automated System-on-Chip Tsetlin Machine Design Generation for Edge Applications
    arXiv:2403.10538v1 Announce Type: cross Abstract: System-on-Chip Field-Programmable Gate Arrays (SoC-FPGAs) offer significant throughput gains for machine learning (ML) edge inference applications via the design of co-processor accelerator systems. However, the design effort for training and translating ML models into SoC-FPGA solutions can be substantial and requires specialist knowledge aware trade-offs between model performance, power consumption, latency and resource utilization. Contrary to other ML algorithms, Tsetlin Machine (TM) performs classification by forming logic proposition between boolean actions from the Tsetlin Automata (the learning elements) and boolean input features. A trained TM model, usually, exhibits high sparsity and considerable overlapping of these logic propositions both within and among the classes. The model, thus, can be translated to RTL-level design using a miniscule number of AND and NOT gates. This paper presents MATADOR, an automated boolean-to-silicon tool with GUI interface capable of implementing optimized accelerator design of the TM model onto SoC-FPGA for inference at the edge. It offers automation of the full development pipeline: model training, system level design generation, design verification and deployment. It makes use of the logic sharing that ensues from propositional overlap and creates a compact design by effectively utilizing the TM model's sparsity. MATADOR accelerator designs are shown to be up to 13.4x faster, up to 7x more resource frugal and up to 2x more power efficient when compared to the state-of-the-art Quantized and Binary Deep Neural Network implementations.  ( 2 min )
    ROUTERBENCH: A Benchmark for Multi-LLM Routing System
    arXiv:2403.12031v1 Announce Type: new Abstract: As the range of applications for Large Language Models (LLMs) continues to grow, the demand for effective serving solutions becomes increasingly critical. Despite the versatility of LLMs, no single model can optimally address all tasks and applications, particularly when balancing performance with cost. This limitation has led to the development of LLM routing systems, which combine the strengths of various models to overcome the constraints of individual LLMs. Yet, the absence of a standardized benchmark for evaluating the performance of LLM routers hinders progress in this area. To bridge this gap, we present ROUTERBENCH, a novel evaluation framework designed to systematically assess the efficacy of LLM routing systems, along with a comprehensive dataset comprising over 405k inference outcomes from representative LLMs to support the development of routing strategies. We further propose a theoretical framework for LLM routing, and deliver a comparative analysis of various routing approaches through ROUTERBENCH, highlighting their potentials and limitations within our evaluation framework. This work not only formalizes and advances the development of LLM routing systems but also sets a standard for their assessment, paving the way for more accessible and economically viable LLM deployments. The code and data are available at https://github.com/withmartian/routerbench.  ( 2 min )
    Robust Alzheimer's Progression Modeling using Cross-Domain Self-Supervised Deep Learning
    arXiv:2211.08559v2 Announce Type: cross Abstract: Developing successful artificial intelligence systems in practice depends on both robust deep learning models and large, high-quality data. However, acquiring and labeling data can be prohibitively expensive and time-consuming in many real-world applications, such as clinical disease models. Self-supervised learning has demonstrated great potential in increasing model accuracy and robustness in small data regimes. In addition, many clinical imaging and disease modeling applications rely heavily on regression of continuous quantities. However, the applicability of self-supervised learning for these medical-imaging regression tasks has not been extensively studied. In this study, we develop a cross-domain self-supervised learning approach for disease prognostic modeling as a regression problem using medical images as input. We demonstrate that self-supervised pretraining can improve the prediction of Alzheimer's Disease progression from brain MRI. We also show that pretraining on extended (but not labeled) brain MRI data outperforms pretraining on natural images. We further observe that the highest performance is achieved when both natural images and extended brain-MRI data are used for pretraining.  ( 2 min )
    Supervised Fine-Tuning as Inverse Reinforcement Learning
    arXiv:2403.12017v1 Announce Type: new Abstract: The prevailing approach to aligning Large Language Models (LLMs) typically relies on human or AI feedback and assumes access to specific types of preference datasets. In our work, we question the efficacy of such datasets and explore various scenarios where alignment with expert demonstrations proves more realistic. We build a sequential decision-making framework to formulate the problem of aligning LLMs using demonstration datasets. Drawing insights from inverse reinforcement learning and imitation learning, we introduce various approaches for divergence minimization in the LLM alignment tasks. Our analysis highlights the mass-covering and mode-seeking behaviors of these different approaches. Inclusively, we examine the pros and cons of the classical supervised fine-tuning method, elaborating on scenarios where different methods shine.  ( 2 min )
    Learning Useful Representations of Recurrent Neural Network Weight Matrices
    arXiv:2403.11998v1 Announce Type: new Abstract: Recurrent Neural Networks (RNNs) are general-purpose parallel-sequential computers. The program of an RNN is its weight matrix. How to learn useful representations of RNN weights that facilitate RNN analysis as well as downstream tasks? While the mechanistic approach directly looks at some RNN's weights to predict its behavior, the functionalist approach analyzes its overall functionality -- specifically, its input-output mapping. We consider several mechanistic approaches for RNN weights and adapt the permutation equivariant Deep Weight Space layer for RNNs. Our two novel functionalist approaches extract information from RNN weights by 'interrogating' the RNN through probing inputs. We develop a theoretical framework that demonstrates conditions under which the functionalist approach can generate rich representations that help determine RNN behavior. We create and release the first two 'model zoo' datasets for RNN weight representation learning. One consists of generative models of a class of formal languages, and the other one of classifiers of sequentially processed MNIST digits. With the help of an emulation-based self-supervised learning technique we compare and evaluate the different RNN weight encoding techniques on multiple downstream applications. On the most challenging one, namely predicting which exact task the RNN was trained on, functionalist approaches show clear superiority.  ( 2 min )
    Unveil Conditional Diffusion Models with Classifier-free Guidance: A Sharp Statistical Theory
    arXiv:2403.11968v1 Announce Type: new Abstract: Conditional diffusion models serve as the foundation of modern image synthesis and find extensive application in fields like computational biology and reinforcement learning. In these applications, conditional diffusion models incorporate various conditional information, such as prompt input, to guide the sample generation towards desired properties. Despite the empirical success, theory of conditional diffusion models is largely missing. This paper bridges this gap by presenting a sharp statistical theory of distribution estimation using conditional diffusion models. Our analysis yields a sample complexity bound that adapts to the smoothness of the data distribution and matches the minimax lower bound. The key to our theoretical development lies in an approximation result for the conditional score function, which relies on a novel diffused Taylor approximation technique. Moreover, we demonstrate the utility of our statistical theory in elucidating the performance of conditional diffusion models across diverse applications, including model-based transition kernel estimation in reinforcement learning, solving inverse problems, and reward conditioned sample generation.  ( 2 min )
    Accelerating Scientific Discovery with Generative Knowledge Extraction, Graph-Based Representation, and Multimodal Intelligent Graph Reasoning
    arXiv:2403.11996v1 Announce Type: new Abstract: Using generative Artificial Intelligence (AI), we transformed a set of 1,000 scientific papers in the area of biological materials into detailed ontological knowledge graphs, revealing their inherently scale-free nature. Using graph traversal path detection between dissimilar concepts based on combinatorial ranking of node similarity and betweenness centrality, we reveal deep insights into unprecedented interdisciplinary relationships that can be used to answer queries, identify gaps in knowledge, and propose never-before-seen material designs and their behaviors. One comparison revealed detailed structural parallels between biological materials and Beethoven's 9th Symphony, highlighting shared patterns of complexity through isomorphic mapping. The algorithm further created an innovative hierarchical mycelium-based composite that incorporates joint synthesis of graph sampling with principles extracted from Kandinsky's Composition VII painting, where the resulting composite reflects a balance of chaos and order, with features like adjustable porosity, mechanical strength, and complex patterned chemical functionalization. We uncover other isomorphisms across physical, biological, and artistic spheres, revealing a nuanced ontology of immanence and material flux that resonates with postmodern philosophy, and positions these interconnections within a heterarchical framework. Our findings reveal the dynamic, context-dependent interplay of entities beyond traditional hierarchical paradigms, emphasizing the significant role of individual components and their fluctuative relationships within the system. Our predictions achieve a far higher degree of novelty, technical detail and explorative capacity than conventional generative AI methods. The approach establishes a widely useful framework for innovation by revealing hidden connections that facilitate discovery.  ( 3 min )
    Informed Spectral Normalized Gaussian Processes for Trajectory Prediction
    arXiv:2403.11966v1 Announce Type: new Abstract: Prior parameter distributions provide an elegant way to represent prior expert and world knowledge for informed learning. Previous work has shown that using such informative priors to regularize probabilistic deep learning (DL) models increases their performance and data-efficiency. However, commonly used sampling-based approximations for probabilistic DL models can be computationally expensive, requiring multiple inference passes and longer training times. Promising alternatives are compute-efficient last layer kernel approximations like spectral normalized Gaussian processes (SNGPs). We propose a novel regularization-based continual learning method for SNGPs, which enables the use of informative priors that represent prior knowledge learned from previous tasks. Our proposal builds upon well-established methods and requires no rehearsal memory or parameter expansion. We apply our informed SNGP model to the trajectory prediction problem in autonomous driving by integrating prior drivability knowledge. On two public datasets, we investigate its performance under diminishing training data and across locations, and thereby demonstrate an increase in data-efficiency and robustness to location-transfers over non-informed and informed baselines.  ( 2 min )
    Probabilistic Calibration by Design for Neural Network Regression
    arXiv:2403.11964v1 Announce Type: new Abstract: Generating calibrated and sharp neural network predictive distributions for regression problems is essential for optimal decision-making in many real-world applications. To address the miscalibration issue of neural networks, various methods have been proposed to improve calibration, including post-hoc methods that adjust predictions after training and regularization methods that act during training. While post-hoc methods have shown better improvement in calibration compared to regularization methods, the post-hoc step is completely independent of model training. We introduce a novel end-to-end model training procedure called Quantile Recalibration Training, integrating post-hoc calibration directly into the training process without additional parameters. We also present a unified algorithm that includes our method and other post-hoc and regularization methods, as particular cases. We demonstrate the performance of our method in a large-scale experiment involving 57 tabular regression datasets, showcasing improved predictive accuracy while maintaining calibration. We also conduct an ablation study to evaluate the significance of different components within our proposed method, as well as an in-depth analysis of the impact of the base model and different hyperparameters on predictive accuracy.  ( 2 min )
    CASPER: Causality-Aware Spatiotemporal Graph Neural Networks for Spatiotemporal Time Series Imputation
    arXiv:2403.11960v1 Announce Type: new Abstract: Spatiotemporal time series is the foundation of understanding human activities and their impacts, which is usually collected via monitoring sensors placed at different locations. The collected data usually contains missing values due to various failures, which have significant impact on data analysis. To impute the missing values, a lot of methods have been introduced. When recovering a specific data point, most existing methods tend to take into consideration all the information relevant to that point regardless of whether they have a cause-and-effect relationship. During data collection, it is inevitable that some unknown confounders are included, e.g., background noise in time series and non-causal shortcut edges in the constructed sensor network. These confounders could open backdoor paths between the input and output, in other words, they establish non-causal correlations between the input and output. Over-exploiting these non-causal correlations could result in overfitting and make the model vulnerable to noises. In this paper, we first revisit spatiotemporal time series imputation from a causal perspective, which shows the causal relationships among the input, output, embeddings and confounders. Next, we show how to block the confounders via the frontdoor adjustment. Based on the results of the frontdoor adjustment, we introduce a novel Causality-Aware SPatiotEmpoRal graph neural network (CASPER), which contains a novel Spatiotemporal Causal Attention (SCA) and a Prompt Based Decoder (PBD). PBD could reduce the impact of confounders and SCA could discover the sparse causal relationships among embeddings. Theoretical analysis reveals that SCA discovers causal relationships based on the values of gradients. We evaluate Casper on three real-world datasets, and the experimental results show that Casper outperforms the baselines and effectively discovers causal relationships.  ( 3 min )
    Transfer Learning Beyond Bounded Density Ratios
    arXiv:2403.11963v1 Announce Type: new Abstract: We study the fundamental problem of transfer learning where a learning algorithm collects data from some source distribution $P$ but needs to perform well with respect to a different target distribution $Q$. A standard change of measure argument implies that transfer learning happens when the density ratio $dQ/dP$ is bounded. Yet, prior thought-provoking works by Kpotufe and Martinet (COLT, 2018) and Hanneke and Kpotufe (NeurIPS, 2019) demonstrate cases where the ratio $dQ/dP$ is unbounded, but transfer learning is possible. In this work, we focus on transfer learning over the class of low-degree polynomial estimators. Our main result is a general transfer inequality over the domain $\mathbb{R}^n$, proving that non-trivial transfer learning for low-degree polynomials is possible under very mild assumptions, going well beyond the classical assumption that $dQ/dP$ is bounded. For instance, it always applies if $Q$ is a log-concave measure and the inverse ratio $dP/dQ$ is bounded. To demonstrate the applicability of our inequality, we obtain new results in the settings of: (1) the classical truncated regression setting, where $dQ/dP$ equals infinity, and (2) the more recent out-of-distribution generalization setting for in-context learning linear functions with transformers. We also provide a discrete analogue of our transfer inequality on the Boolean Hypercube $\{-1,1\}^n$, and study its connections with the recent problem of Generalization on the Unseen of Abbe, Bengio, Lotfi and Rizk (ICML, 2023). Our main conceptual contribution is that the maximum influence of the error of the estimator $\widehat{f}-f^*$ under $Q$, $\mathrm{I}_{\max}(\widehat{f}-f^*)$, acts as a sufficient condition for transferability; when $\mathrm{I}_{\max}(\widehat{f}-f^*)$ is appropriately bounded, transfer is possible over the Boolean domain.  ( 3 min )
    Multistep Inverse Is Not All You Need
    arXiv:2403.11940v1 Announce Type: new Abstract: In real-world control settings, the observation space is often unnecessarily high-dimensional and subject to time-correlated noise. However, the controllable dynamics of the system are often far simpler than the dynamics of the raw observations. It is therefore desirable to learn an encoder to map the observation space to a simpler space of control-relevant variables. In this work, we consider the Ex-BMDP model, first proposed by Efroni et al. (2022), which formalizes control problems where observations can be factorized into an action-dependent latent state which evolves deterministically, and action-independent time-correlated noise. Lamb et al. (2022) proposes the "AC-State" method for learning an encoder to extract a complete action-dependent latent state representation from the observations in such problems. AC-State is a multistep-inverse method, in that it uses the encoding of the the first and last state in a path to predict the first action in the path. However, we identify cases where AC-State will fail to learn a correct latent representation of the agent-controllable factor of the state. We therefore propose a new algorithm, ACDF, which combines multistep-inverse prediction with a latent forward model. ACDF is guaranteed to correctly infer an action-dependent latent state encoder for a large class of Ex-BMDP models. We demonstrate the effectiveness of ACDF on tabular Ex-BMDPs through numerical simulations; as well as high-dimensional environments using neural-network-based encoders. Code is available at https://github.com/midi-lab/acdf.  ( 2 min )
    Global Optimality without Mixing Time Oracles in Average-reward RL via Multi-level Actor-Critic
    arXiv:2403.11925v1 Announce Type: new Abstract: In the context of average-reward reinforcement learning, the requirement for oracle knowledge of the mixing time, a measure of the duration a Markov chain under a fixed policy needs to achieve its stationary distribution-poses a significant challenge for the global convergence of policy gradient methods. This requirement is particularly problematic due to the difficulty and expense of estimating mixing time in environments with large state spaces, leading to the necessity of impractically long trajectories for effective gradient estimation in practical applications. To address this limitation, we consider the Multi-level Actor-Critic (MAC) framework, which incorporates a Multi-level Monte Carlo (MLMC) gradient estimator. With our approach, we effectively alleviate the dependency on mixing time knowledge, a first for average-reward MDPs global convergence. Furthermore, our approach exhibits the tightest-available dependence of $\mathcal{O}\left( \sqrt{\tau_{mix}} \right)$ relative to prior work. With a 2D gridworld goal-reaching navigation experiment, we demonstrate that MAC achieves higher reward than a previous PG-based method for average reward, Parameterized Policy Gradient with Advantage Estimation (PPGAE), especially in cases with relatively small training sample budget restricting trajectory length.  ( 2 min )
    Larimar: Large Language Models with Episodic Memory Control
    arXiv:2403.11901v1 Announce Type: new Abstract: Efficient and accurate updating of knowledge stored in Large Language Models (LLMs) is one of the most pressing research challenges today. This paper presents Larimar - a novel, brain-inspired architecture for enhancing LLMs with a distributed episodic memory. Larimar's memory allows for dynamic, one-shot updates of knowledge without the need for computationally expensive re-training or fine-tuning. Experimental results on multiple fact editing benchmarks demonstrate that Larimar attains accuracy comparable to most competitive baselines, even in the challenging sequential editing setup, but also excels in speed - yielding speed-ups of 4-10x depending on the base LLM - as well as flexibility due to the proposed architecture being simple, LLM-agnostic, and hence general. We further provide mechanisms for selective fact forgetting and input context length generalization with Larimar and show their effectiveness.  ( 2 min )
    Single-Agent Actor Critic for Decentralized Cooperative Driving
    arXiv:2403.11914v1 Announce Type: new Abstract: Active traffic management incorporating autonomous vehicles (AVs) promises a future with diminished congestion and enhanced traffic flow. However, developing algorithms for real-world application requires addressing the challenges posed by continuous traffic flow and partial observability. To bridge this gap and advance the field of active traffic management towards greater decentralization, we introduce a novel asymmetric actor-critic model aimed at learning decentralized cooperative driving policies for autonomous vehicles using single-agent reinforcement learning. Our approach employs attention neural networks with masking to handle the dynamic nature of real-world traffic flow and partial observability. Through extensive evaluations against baseline controllers across various traffic scenarios, our model shows great potential for improving traffic flow at diverse bottleneck locations within the road system. Additionally, we explore the challenge associated with the conservative driving behaviors of autonomous vehicles that adhere strictly to traffic regulations. The experiment results illustrate that our proposed cooperative policy can mitigate potential traffic slowdowns without compromising safety.  ( 2 min )
    KnFu: Effective Knowledge Fusion
    arXiv:2403.11892v1 Announce Type: new Abstract: Federated Learning (FL) has emerged as a prominent alternative to the traditional centralized learning approach. Generally speaking, FL is a decentralized approach that allows for collaborative training of Machine Learning (ML) models across multiple local nodes, ensuring data privacy and security while leveraging diverse datasets. Conventional FL, however, is susceptible to gradient inversion attacks, restrictively enforces a uniform architecture on local models, and suffers from model heterogeneity (model drift) due to non-IID local datasets. To mitigate some of these challenges, the new paradigm of Federated Knowledge Distillation (FKD) has emerged. FDK is developed based on the concept of Knowledge Distillation (KD), which involves extraction and transfer of a large and well-trained teacher model's knowledge to lightweight student models. FKD, however, still faces the model drift issue. Intuitively speaking, not all knowledge is universally beneficial due to the inherent diversity of data among local nodes. This calls for innovative mechanisms to evaluate the relevance and effectiveness of each client's knowledge for others, to prevent propagation of adverse knowledge. In this context, the paper proposes Effective Knowledge Fusion (KnFu) algorithm that evaluates knowledge of local models to only fuse semantic neighbors' effective knowledge for each client. The KnFu is a personalized effective knowledge fusion scheme for each client, that analyzes effectiveness of different local models' knowledge prior to the aggregation phase. Comprehensive experiments were performed on MNIST and CIFAR10 datasets illustrating effectiveness of the proposed KnFu in comparison to its state-of-the-art counterparts. A key conclusion of the work is that in scenarios with large and highly heterogeneous local datasets, local training could be preferable to knowledge fusion-based solutions.  ( 3 min )
    Efficient Training of Learning-Based Thermal Power Flow for 4th Generation District Heating Grids
    arXiv:2403.11877v1 Announce Type: new Abstract: Thermal power flow (TPF) is an important task for various control purposes in 4 Th generation district heating grids with multiple decentral heat sources and meshed grid structures. Computing the TPF, i.e., determining the grid state consisting of temperatures, pressures, and mass flows for given supply and demand values, is classically done by solving the nonlinear heat grid equations, but can be sped up by orders of magnitude using learned models such as neural networks. We propose a novel, efficient scheme to generate a sufficiently large training data set covering relevant supply and demand values. Instead of sampling supply and demand values, our approach generates training examples from a proxy distribution over generator and consumer mass flows, omitting the iterations needed for solving the heat grid equations. The exact, but slightly different, training examples can be weighted to represent the original training distribution. We show with simulations for typical grid structures that the new approach can reduce training set generation times by two orders of magnitude compared to sampling supply and demand values directly, without loss of relevance for the training samples. Moreover, learning TPF with a training data set is shown to outperform sample-free, physics-aware training approaches significantly.  ( 3 min )
    Near-Optimal Solutions of Constrained Learning Problems
    arXiv:2403.11844v1 Announce Type: new Abstract: With the widespread adoption of machine learning systems, the need to curtail their behavior has become increasingly apparent. This is evidenced by recent advancements towards developing models that satisfy robustness, safety, and fairness requirements. These requirements can be imposed (with generalization guarantees) by formulating constrained learning problems that can then be tackled by dual ascent algorithms. Yet, though these algorithms converge in objective value, even in non-convex settings, they cannot guarantee that their outcome is feasible. Doing so requires randomizing over all iterates, which is impractical in virtually any modern applications. Still, final iterates have been observed to perform well in practice. In this work, we address this gap between theory and practice by characterizing the constraint violation of Lagrangian minimizers associated with optimal dual variables, despite lack of convexity. To do this, we leverage the fact that non-convex, finite-dimensional constrained learning problems can be seen as parametrizations of convex, functional problems. Our results show that rich parametrizations effectively mitigate the issue of feasibility in dual methods, shedding light on prior empirical successes of dual learning. We illustrate our findings in fair learning tasks.  ( 2 min )
    Complete and Efficient Graph Transformers for Crystal Material Property Prediction
    arXiv:2403.11857v1 Announce Type: new Abstract: Crystal structures are characterized by atomic bases within a primitive unit cell that repeats along a regular lattice throughout 3D space. The periodic and infinite nature of crystals poses unique challenges for geometric graph representation learning. Specifically, constructing graphs that effectively capture the complete geometric information of crystals and handle chiral crystals remains an unsolved and challenging problem. In this paper, we introduce a novel approach that utilizes the periodic patterns of unit cells to establish the lattice-based representation for each atom, enabling efficient and expressive graph representations of crystals. Furthermore, we propose ComFormer, a SE(3) transformer designed specifically for crystalline materials. ComFormer includes two variants; namely, iComFormer that employs invariant geometric descriptors of Euclidean distances and angles, and eComFormer that utilizes equivariant vector representations. Experimental results demonstrate the state-of-the-art predictive accuracy of ComFormer variants on various tasks across three widely-used crystal benchmarks. Our code is publicly available as part of the AIRS library (https://github.com/divelab/AIRS).  ( 2 min )
    Fuzzy Rough Choquet Distances for Classification
    arXiv:2403.11843v1 Announce Type: new Abstract: This paper introduces a novel Choquet distance using fuzzy rough set based measures. The proposed distance measure combines the attribute information received from fuzzy rough set theory with the flexibility of the Choquet integral. This approach is designed to adeptly capture non-linear relationships within the data, acknowledging the interplay of the conditional attributes towards the decision attribute and resulting in a more flexible and accurate distance. We explore its application in the context of machine learning, with a specific emphasis on distance-based classification approaches (e.g. k-nearest neighbours). The paper examines two fuzzy rough set based measures that are based on the positive region. Moreover, we explore two procedures for monotonizing the measures derived from fuzzy rough set theory, making them suitable for use with the Choquet integral, and investigate their differences.  ( 2 min )
    Multi-Criteria Comparison as a Method of Advancing Knowledge-Guided Machine Learning
    arXiv:2403.11840v1 Announce Type: new Abstract: This paper describes a generalizable model evaluation method that can be adapted to evaluate AI/ML models across multiple criteria including core scientific principles and more practical outcomes. Emerging from prediction competitions in Psychology and Decision Science, the method evaluates a group of candidate models of varying type and structure across multiple scientific, theoretic, and practical criteria. Ordinal ranking of criteria scores are evaluated using voting rules from the field of computational social choice and allow the comparison of divergent measures and types of models in a holistic evaluation. Additional advantages and applications are discussed.  ( 2 min )
    A tutorial on learning from preferences and choices with Gaussian Processes
    arXiv:2403.11782v1 Announce Type: new Abstract: Preference modelling lies at the intersection of economics, decision theory, machine learning and statistics. By understanding individuals' preferences and how they make choices, we can build products that closely match their expectations, paving the way for more efficient and personalised applications across a wide range of domains. The objective of this tutorial is to present a cohesive and comprehensive framework for preference learning with Gaussian Processes (GPs), demonstrating how to seamlessly incorporate rationality principles (from economics and decision theory) into the learning process. By suitably tailoring the likelihood function, this framework enables the construction of preference learning models that encompass random utility models, limits of discernment, and scenarios with multiple conflicting utilities for both object- and label-preference. This tutorial builds upon established research while simultaneously introducing some novel GP-based models to address specific gaps in the existing literature.  ( 2 min )
    Low-Cost Privacy-Aware Decentralized Learning
    arXiv:2403.11795v1 Announce Type: new Abstract: This paper introduces ZIP-DL, a novel privacy-aware decentralized learning (DL) algorithm that relies on adding correlated noise to each model update during the model training process. This technique ensures that the added noise almost neutralizes itself during the aggregation process due to its correlation, thus minimizing the impact on model accuracy. In addition, ZIP-DL does not require multiple communication rounds for noise cancellation, addressing the common trade-off between privacy protection and communication overhead. We provide theoretical guarantees for both convergence speed and privacy guarantees, thereby making ZIP-DL applicable to practical scenarios. Our extensive experimental study shows that ZIP-DL achieves the best trade-off between vulnerability and accuracy. In particular, ZIP-DL (i) reduces the effectiveness of a linkability attack by up to 52 points compared to baseline DL, and (ii) achieves up to 37 more accuracy points for the same vulnerability under membership inference attacks against a privacy-preserving competitor  ( 2 min )
    S-JEPA: towards seamless cross-dataset transfer through dynamic spatial attention
    arXiv:2403.11772v1 Announce Type: new Abstract: Motivated by the challenge of seamless cross-dataset transfer in EEG signal processing, this article presents an exploratory study on the use of Joint Embedding Predictive Architectures (JEPAs). In recent years, self-supervised learning has emerged as a promising approach for transfer learning in various domains. However, its application to EEG signals remains largely unexplored. In this article, we introduce Signal-JEPA for representing EEG recordings which includes a novel domain-specific spatial block masking strategy and three novel architectures for downstream classification. The study is conducted on a 54~subjects dataset and the downstream performance of the models is evaluated on three different BCI paradigms: motor imagery, ERP and SSVEP. Our study provides preliminary evidence for the potential of JEPAs in EEG signal encoding. Notably, our results highlight the importance of spatial filtering for accurate downstream classification and reveal an influence of the length of the pre-training examples but not of the mask size on the downstream performance.  ( 2 min )
    PARMESAN: Parameter-Free Memory Search and Transduction for Dense Prediction Tasks
    arXiv:2403.11743v1 Announce Type: new Abstract: In this work we address flexibility in deep learning by means of transductive reasoning. For adaptation to new tasks or new data, existing methods typically involve tuning of learnable parameters or even complete re-training from scratch, rendering such approaches unflexible in practice. We argue that the notion of separating computation from memory by the means of transduction can act as a stepping stone for solving these issues. We therefore propose PARMESAN (parameter-free memory search and transduction), a scalable transduction method which leverages a memory module for solving dense prediction tasks. At inference, hidden representations in memory are being searched to find corresponding examples. In contrast to other methods, PARMESAN learns without the requirement for any continuous training or fine-tuning of learnable parameters simply by modifying the memory content. Our method is compatible with commonly used neural architectures and canonically transfers to 1D, 2D, and 3D grid-based data. We demonstrate the capabilities of our approach at complex tasks such as continual and few-shot learning. PARMESAN learns up to 370 times faster than common baselines while being on par in terms of predictive performance, knowledge retention, and data-efficiency.  ( 2 min )
    Time Series Compression using Quaternion Valued Neural Networks and Quaternion Backpropagation
    arXiv:2403.11722v1 Announce Type: new Abstract: We propose a novel quaternionic time-series compression methodology where we divide a long time-series into segments of data, extract the min, max, mean and standard deviation of these chunks as representative features and encapsulate them in a quaternion, yielding a quaternion valued time-series. This time-series is processed using quaternion valued neural network layers, where we aim to preserve the relation between these features through the usage of the Hamilton product. To train this quaternion neural network, we derive quaternion backpropagation employing the GHR calculus, which is required for a valid product and chain rule in quaternion space. Furthermore, we investigate the connection between the derived update rules and automatic differentiation. We apply our proposed compression method on the Tennessee Eastman Dataset, where we perform fault classification using the compressed data in two settings: a fully supervised one and in a semi supervised, contrastive learning setting. Both times, we were able to outperform real valued counterparts as well as two baseline models: one with the uncompressed time-series as the input and the other with a regular downsampling using the mean. Further, we could improve the classification benchmark set by SimCLR-TS from 81.43% to 83.90%.  ( 2 min )
    PITA: Physics-Informed Trajectory Autoencoder
    arXiv:2403.11728v1 Announce Type: new Abstract: Validating robotic systems in safety-critical appli-cations requires testing in many scenarios including rare edgecases that are unlikely to occur, requiring to complement real-world testing with testing in simulation. Generative models canbe used to augment real-world datasets with generated data toproduce edge case scenarios by sampling in a learned latentspace. Autoencoders can learn said latent representation for aspecific domain by learning to reconstruct the input data froma lower-dimensional intermediate representation. However, theresulting trajectories are not necessarily physically plausible, butinstead typically contain noise that is not present in the inputtrajectory. To resolve this issue, we propose the novel Physics-Informed Trajectory Autoencoder (PITA) architecture, whichincorporates a physical dynamics model into the loss functionof the autoencoder. This results in smooth trajectories that notonly reconstruct the input trajectory but also adhere to thephysical model. We evaluate PITA on a real-world dataset ofvehicle trajectories and compare its performance to a normalautoencoder and a state-of-the-art action-space autoencoder.  ( 2 min )
    Generalization error of spectral algorithms
    arXiv:2403.11696v1 Announce Type: new Abstract: The asymptotically precise estimation of the generalization of kernel methods has recently received attention due to the parallels between neural networks and their associated kernels. However, prior works derive such estimates for training by kernel ridge regression (KRR), whereas neural networks are typically trained with gradient descent (GD). In the present work, we consider the training of kernels with a family of $\textit{spectral algorithms}$ specified by profile $h(\lambda)$, and including KRR and GD as special cases. Then, we derive the generalization error as a functional of learning profile $h(\lambda)$ for two data models: high-dimensional Gaussian and low-dimensional translation-invariant model. Under power-law assumptions on the spectrum of the kernel and target, we use our framework to (i) give full loss asymptotics for both noisy and noiseless observations (ii) show that the loss localizes on certain spectral scales, giving a new perspective on the KRR saturation phenomenon (iii) conjecture, and demonstrate for the considered data models, the universality of the loss w.r.t. non-spectral details of the problem, but only in case of noisy observation.  ( 2 min )
    Crystalformer: Infinitely Connected Attention for Periodic Structure Encoding
    arXiv:2403.11686v1 Announce Type: new Abstract: Predicting physical properties of materials from their crystal structures is a fundamental problem in materials science. In peripheral areas such as the prediction of molecular properties, fully connected attention networks have been shown to be successful. However, unlike these finite atom arrangements, crystal structures are infinitely repeating, periodic arrangements of atoms, whose fully connected attention results in infinitely connected attention. In this work, we show that this infinitely connected attention can lead to a computationally tractable formulation, interpreted as neural potential summation, that performs infinite interatomic potential summations in a deeply learned feature space. We then propose a simple yet effective Transformer-based encoder architecture for crystal structures called Crystalformer. Compared to an existing Transformer-based model, the proposed model requires only 29.4% of the number of parameters, with minimal modifications to the original Transformer architecture. Despite the architectural simplicity, the proposed method outperforms state-of-the-art methods for various property regression tasks on the Materials Project and JARVIS-DFT datasets.  ( 2 min )
    Linguacodus: A Synergistic Framework for Transformative Code Generation in Machine Learning Pipelines
    arXiv:2403.11585v1 Announce Type: new Abstract: In the ever-evolving landscape of machine learning, seamless translation of natural language descriptions into executable code remains a formidable challenge. This paper introduces Linguacodus, an innovative framework designed to tackle this challenge by deploying a dynamic pipeline that iteratively transforms natural language task descriptions into code through high-level data-shaping instructions. The core of Linguacodus is a fine-tuned large language model (LLM), empowered to evaluate diverse solutions for various problems and select the most fitting one for a given task. This paper details the fine-tuning process, and sheds light on how natural language descriptions can be translated into functional code. Linguacodus represents a substantial leap towards automated code generation, effectively bridging the gap between task descriptions and executable code. It holds great promise for advancing machine learning applications across diverse domains. Additionally, we propose an algorithm capable of transforming a natural description of an ML task into code with minimal human interaction. In extensive experiments on a vast machine learning code dataset originating from Kaggle, we showcase the effectiveness of Linguacodus. The investigations highlight its potential applications across diverse domains, emphasizing its impact on applied machine learning in various scientific fields.  ( 2 min )
    The Value of Reward Lookahead in Reinforcement Learning
    arXiv:2403.11637v1 Announce Type: new Abstract: In reinforcement learning (RL), agents sequentially interact with changing environments while aiming to maximize the obtained rewards. Usually, rewards are observed only after acting, and so the goal is to maximize the expected cumulative reward. Yet, in many practical settings, reward information is observed in advance -- prices are observed before performing transactions; nearby traffic information is partially known; and goals are oftentimes given to agents prior to the interaction. In this work, we aim to quantifiably analyze the value of such future reward information through the lens of competitive analysis. In particular, we measure the ratio between the value of standard RL agents and that of agents with partial future-reward lookahead. We characterize the worst-case reward distribution and derive exact ratios for the worst-case reward expectations. Surprisingly, the resulting ratios relate to known quantities in offline RL and reward-free exploration. We further provide tight bounds for the ratio given the worst-case dynamics. Our results cover the full spectrum between observing the immediate rewards before acting to observing all the rewards before the interaction starts.  ( 2 min )
    Offline Multitask Representation Learning for Reinforcement Learning
    arXiv:2403.11574v1 Announce Type: new Abstract: We study offline multitask representation learning in reinforcement learning (RL), where a learner is provided with an offline dataset from different tasks that share a common representation and is asked to learn the shared representation. We theoretically investigate offline multitask low-rank RL, and propose a new algorithm called MORL for offline multitask representation learning. Furthermore, we examine downstream RL in reward-free, offline and online scenarios, where a new task is introduced to the agent that shares the same representation as the upstream offline tasks. Our theoretical results demonstrate the benefits of using the learned representation from the upstream offline task instead of directly learning the representation of the low-rank model.  ( 2 min )
    RL en Markov Games with Independent Function Approximation: Improved Sample Complexity Bound under the Local Access Model
    arXiv:2403.11544v1 Announce Type: new Abstract: Efficiently learning equilibria with large state and action spaces in general-sum Markov games while overcoming the curse of multi-agency is a challenging problem. Recent works have attempted to solve this problem by employing independent linear function classes to approximate the marginal $Q$-value for each agent. However, existing sample complexity bounds under such a framework have a suboptimal dependency on the desired accuracy $\varepsilon$ or the action space. In this work, we introduce a new algorithm, Lin-Confident-FTRL, for learning coarse correlated equilibria (CCE) with local access to the simulator, i.e., one can interact with the underlying environment on the visited states. Up to a logarithmic dependence on the size of the state space, Lin-Confident-FTRL learns $\epsilon$-CCE with a provable optimal accuracy bound $O(\epsilon^{-2})$ and gets rids of the linear dependency on the action space, while scaling polynomially with relevant problem parameters (such as the number of agents and time horizon). Moreover, our analysis of Linear-Confident-FTRL generalizes the virtual policy iteration technique in the single-agent local planning literature, which yields a new computationally efficient algorithm with a tighter sample complexity bound when assuming random access to the simulator.  ( 3 min )
    Semantic-Enhanced Representation Learning for Road Networks with Temporal Dynamics
    arXiv:2403.11495v1 Announce Type: new Abstract: In this study, we introduce a novel framework called Toast for learning general-purpose representations of road networks, along with its advanced counterpart DyToast, designed to enhance the integration of temporal dynamics to boost the performance of various time-sensitive downstream tasks. Specifically, we propose to encode two pivotal semantic characteristics intrinsic to road networks: traffic patterns and traveling semantics. To achieve this, we refine the skip-gram module by incorporating auxiliary objectives aimed at predicting the traffic context associated with a target road segment. Moreover, we leverage trajectory data and design pre-training strategies based on Transformer to distill traveling semantics on road networks. DyToast further augments this framework by employing unified trigonometric functions characterized by their beneficial properties, enabling the capture of temporal evolution and dynamic nature of road networks more effectively. With these proposed techniques, we can obtain representations that encode multi-faceted aspects of knowledge within road networks, applicable across both road segment-based applications and trajectory-based applications. Extensive experiments on two real-world datasets across three tasks demonstrate that our proposed framework consistently outperforms the state-of-the-art baselines by a significant margin.  ( 2 min )
    State-Separated SARSA: A Practical Sequential Decision-Making Algorithm with Recovering Rewards
    arXiv:2403.11520v1 Announce Type: new Abstract: While many multi-armed bandit algorithms assume that rewards for all arms are constant across rounds, this assumption does not hold in many real-world scenarios. This paper considers the setting of recovering bandits (Pike-Burke & Grunewalder, 2019), where the reward depends on the number of rounds elapsed since the last time an arm was pulled. We propose a new reinforcement learning (RL) algorithm tailored to this setting, named the State-Separate SARSA (SS-SARSA) algorithm, which treats rounds as states. The SS-SARSA algorithm achieves efficient learning by reducing the number of state combinations required for Q-learning/SARSA, which often suffers from combinatorial issues for large-scale RL problems. Additionally, it makes minimal assumptions about the reward structure and offers lower computational complexity. Furthermore, we prove asymptotic convergence to an optimal policy under mild assumptions. Simulation studies demonstrate the superior performance of our algorithm across various settings.  ( 2 min )
    Uncertainty-Calibrated Test-Time Model Adaptation without Forgetting
    arXiv:2403.11491v1 Announce Type: new Abstract: Test-time adaptation (TTA) seeks to tackle potential distribution shifts between training and test data by adapting a given model w.r.t. any test sample. Although recent TTA has shown promising performance, we still face two key challenges: 1) prior methods perform backpropagation for each test sample, resulting in unbearable optimization costs to many applications; 2) while existing TTA can significantly improve the test performance on out-of-distribution data, they often suffer from severe performance degradation on in-distribution data after TTA (known as forgetting). To this end, we have proposed an Efficient Anti-Forgetting Test-Time Adaptation (EATA) method which develops an active sample selection criterion to identify reliable and non-redundant samples for test-time entropy minimization. To alleviate forgetting, EATA introduces a Fisher regularizer estimated from test samples to constrain important model parameters from drastic changes. However, in EATA, the adopted entropy loss consistently assigns higher confidence to predictions even for samples that are underlying uncertain, leading to overconfident predictions. To tackle this, we further propose EATA with Calibration (EATA-C) to separately exploit the reducible model uncertainty and the inherent data uncertainty for calibrated TTA. Specifically, we measure the model uncertainty by the divergence between predictions from the full network and its sub-networks, on which we propose a divergence loss to encourage consistent predictions instead of overconfident ones. To further recalibrate prediction confidence, we utilize the disagreement among predicted labels as an indicator of the data uncertainty, and then devise a min-max entropy regularizer to selectively increase and decrease prediction confidence for different samples. Experiments on image classification and semantic segmentation verify the effectiveness of our methods.  ( 3 min )
    Open-World Semi-Supervised Learning for Node Classification
    arXiv:2403.11483v1 Announce Type: new Abstract: Open-world semi-supervised learning (Open-world SSL) for node classification, that classifies unlabeled nodes into seen classes or multiple novel classes, is a practical but under-explored problem in the graph community. As only seen classes have human labels, they are usually better learned than novel classes, and thus exhibit smaller intra-class variances within the embedding space (named as imbalance of intra-class variances between seen and novel classes). Based on empirical and theoretical analysis, we find the variance imbalance can negatively impact the model performance. Pre-trained feature encoders can alleviate this issue via producing compact representations for novel classes. However, creating general pre-trained encoders for various types of graph data has been proven to be challenging. As such, there is a demand for an effective method that does not rely on pre-trained graph encoders. In this paper, we propose an IMbalance-Aware method named OpenIMA for Open-world semi-supervised node classification, which trains the node classification model from scratch via contrastive learning with bias-reduced pseudo labels. Extensive experiments on seven popular graph benchmarks demonstrate the effectiveness of OpenIMA, and the source code has been available on GitHub.  ( 2 min )
    Span-Based Optimal Sample Complexity for Weakly Communicating and General Average Reward MDPs
    arXiv:2403.11477v1 Announce Type: new Abstract: We study the sample complexity of learning an $\epsilon$-optimal policy in an average-reward Markov decision process (MDP) under a generative model. For weakly communicating MDPs, we establish the complexity bound $\tilde{O}(SA\frac{H}{\epsilon^2})$, where $H$ is the span of the bias function of the optimal policy and $SA$ is the cardinality of the state-action space. Our result is the first that is minimax optimal (up to log factors) in all parameters $S,A,H$ and $\epsilon$, improving on existing work that either assumes uniformly bounded mixing times for all policies or has suboptimal dependence on the parameters. We further investigate sample complexity in general (non-weakly-communicating) average-reward MDPs. We argue a new transient time parameter $B$ is necessary, establish an $\tilde{O}(SA\frac{B+H}{\epsilon^2})$ complexity bound, and prove a matching (up to log factors) minimax lower bound. Both results are based on reducing the average-reward MDP to a discounted MDP, which requires new ideas in the general setting. To establish the optimality of this reduction, we develop improved bounds for $\gamma$-discounted MDPs, showing that $\tilde{\Omega}\left(SA\frac{H}{(1-\gamma)^2\epsilon^2}\right)$ samples suffice to learn an $\epsilon$-optimal policy in weakly communicating MDPs under the regime that $\gamma\geq 1-1/H$, and $\tilde{\Omega}\left(SA\frac{B+H}{(1-\gamma)^2\epsilon^2}\right)$ samples suffice in general MDPs when $\gamma\geq 1-\frac{1}{B+H}$. Both these results circumvent the well-known lower bound of $\tilde{\Omega}\left(SA\frac{1}{(1-\gamma)^3\epsilon^2}\right)$ for arbitrary $\gamma$-discounted MDPs. Our analysis develops upper bounds on certain instance-dependent variance parameters in terms of the span and transient time parameters. The weakly communicating bounds are tighter than those based on the mixing time or diameter of the MDP and may be of broader use.  ( 3 min )
    SeisFusion: Constrained Diffusion Model with Input Guidance for 3D Seismic Data Interpolation and Reconstruction
    arXiv:2403.11482v1 Announce Type: new Abstract: Geographical, physical, or economic constraints often result in missing traces within seismic data, making the reconstruction of complete seismic data a crucial step in seismic data processing. Traditional methods for seismic data reconstruction require the selection of multiple empirical parameters and struggle to handle large-scale continuous missing data. With the development of deep learning, various neural networks have demonstrated powerful reconstruction capabilities. However, these convolutional neural networks represent a point-to-point reconstruction approach that may not cover the entire distribution of the dataset. Consequently, when dealing with seismic data featuring complex missing patterns, such networks may experience varying degrees of performance degradation. In response to this challenge, we propose a novel diffusion model reconstruction framework tailored for 3D seismic data. To constrain the results generated by the diffusion model, we introduce conditional supervision constraints into the diffusion model, constraining the generated data of the diffusion model based on the input data to be reconstructed. We introduce a 3D neural network architecture into the diffusion model, successfully extending the 2D diffusion model to 3D space. Additionally, we refine the model's generation process by incorporating missing data into the generation process, resulting in reconstructions with higher consistency. Through ablation studies determining optimal parameter values, our method exhibits superior reconstruction accuracy when applied to both field datasets and synthetic datasets, effectively addressing a wide range of complex missing patterns. Our implementation is available at https://github.com/WAL-l/SeisFusion.  ( 3 min )
    Accelerating String-Key Learned Index Structures via Memoization-based Incremental Training
    arXiv:2403.11472v1 Announce Type: new Abstract: Learned indexes use machine learning models to learn the mappings between keys and their corresponding positions in key-value indexes. These indexes use the mapping information as training data. Learned indexes require frequent retrainings of their models to incorporate the changes introduced by update queries. To efficiently retrain the models, existing learned index systems often harness a linear algebraic QR factorization technique that performs matrix decomposition. This factorization approach processes all key-position pairs during each retraining, resulting in compute operations that grow linearly with the total number of keys and their lengths. Consequently, the retrainings create a severe performance bottleneck, especially for variable-length string keys, while the retrainings are crucial for maintaining high prediction accuracy and in turn, ensuring low query service latency. To address this performance problem, we develop an algorithm-hardware co-designed string-key learned index system, dubbed SIA. In designing SIA, we leverage a unique algorithmic property of the matrix decomposition-based training method. Exploiting the property, we develop a memoization-based incremental training scheme, which only requires computation over updated keys, while decomposition results of non-updated keys from previous computations can be reused. We further enhance SIA to offload a portion of this training process to an FPGA accelerator to not only relieve CPU resources for serving index queries (i.e., inference), but also accelerate the training itself. Our evaluation shows that compared to ALEX, LIPP, and SIndex, a state-of-the-art learned index systems, SIA-accelerated learned indexes offer 2.6x and 3.4x higher throughput on the two real-world benchmark suites, YCSB and Twitter cache trace, respectively.  ( 3 min )
    FedSPU: Personalized Federated Learning for Resource-constrained Devices with Stochastic Parameter Update
    arXiv:2403.11464v1 Announce Type: new Abstract: Personalized Federated Learning (PFL) is widely employed in IoT applications to handle high-volume, non-iid client data while ensuring data privacy. However, heterogeneous edge devices owned by clients may impose varying degrees of resource constraints, causing computation and communication bottlenecks for PFL. Federated Dropout has emerged as a popular strategy to address this challenge, wherein only a subset of the global model, i.e. a \textit{sub-model}, is trained on a client's device, thereby reducing computation and communication overheads. Nevertheless, the dropout-based model-pruning strategy may introduce bias, particularly towards non-iid local data. When biased sub-models absorb highly divergent parameters from other clients, performance degradation becomes inevitable. In response, we propose federated learning with stochastic parameter update (FedSPU). Unlike dropout that tailors the global model to small-size local sub-models, FedSPU maintains the full model architecture on each device but randomly freezes a certain percentage of neurons in the local model during training while updating the remaining neurons. This approach ensures that a portion of the local model remains personalized, thereby enhancing the model's robustness against biased parameters from other clients. Experimental results demonstrate that FedSPU outperforms federated dropout by 7.57\% on average in terms of accuracy. Furthermore, an introduced early stopping scheme leads to a significant reduction of the training time by \(24.8\%\sim70.4\%\) while maintaining high accuracy.  ( 2 min )
    Narrative Feature or Structured Feature? A Study of Large Language Models to Identify Cancer Patients at Risk of Heart Failure
    arXiv:2403.11425v1 Announce Type: new Abstract: Cancer treatments are known to introduce cardiotoxicity, negatively impacting outcomes and survivorship. Identifying cancer patients at risk of heart failure (HF) is critical to improving cancer treatment outcomes and safety. This study examined machine learning (ML) models to identify cancer patients at risk of HF using electronic health records (EHRs), including traditional ML, Time-Aware long short-term memory (T-LSTM), and large language models (LLMs) using novel narrative features derived from the structured medical codes. We identified a cancer cohort of 12,806 patients from the University of Florida Health, diagnosed with lung, breast, and colorectal cancers, among which 1,602 individuals developed HF after cancer. The LLM, GatorTron-3.9B, achieved the best F1 scores, outperforming the traditional support vector machines by 39%, the T-LSTM deep learning model by 7%, and a widely used transformer model, BERT, by 5.6%. The analysis shows that the proposed narrative features remarkably increased feature density and improved performance.  ( 2 min )
    Graph Partial Label Learning with Potential Cause Discovering
    arXiv:2403.11449v1 Announce Type: new Abstract: Graph Neural Networks (GNNs) have gained considerable attention for their potential in addressing challenges posed by complex graph-structured data in diverse domains. However, accurately annotating graph data for training is difficult due to the inherent complexity and interconnectedness of graphs. To tackle this issue, we propose a novel graph representation learning method that enables GNN models to effectively learn discriminative information even in the presence of noisy labels within the context of Partially Labeled Learning (PLL). PLL is a critical weakly supervised learning problem, where each training instance is associated with a set of candidate labels, including both the true label and additional noisy labels. Our approach leverages potential cause extraction to obtain graph data that exhibit a higher likelihood of possessing a causal relationship with the labels. By incorporating auxiliary training based on the extracted graph data, our model can effectively filter out the noise contained in the labels. We support the rationale behind our approach with a series of theoretical analyses. Moreover, we conduct extensive evaluations and ablation studies on multiple datasets, demonstrating the superiority of our proposed method.  ( 2 min )
    Variational Sampling of Temporal Trajectories
    arXiv:2403.11418v1 Announce Type: new Abstract: A deterministic temporal process can be determined by its trajectory, an element in the product space of (a) initial condition $z_0 \in \mathcal{Z}$ and (b) transition function $f: (\mathcal{Z}, \mathcal{T}) \to \mathcal{Z}$ often influenced by the control of the underlying dynamical system. Existing methods often model the transition function as a differential equation or as a recurrent neural network. Despite their effectiveness in predicting future measurements, few results have successfully established a method for sampling and statistical inference of trajectories using neural networks, partially due to constraints in the parameterization. In this work, we introduce a mechanism to learn the distribution of trajectories by parameterizing the transition function $f$ explicitly as an element in a function space. Our framework allows efficient synthesis of novel trajectories, while also directly providing a convenient tool for inference, i.e., uncertainty estimation, likelihood evaluations and out of distribution detection for abnormal trajectories. These capabilities can have implications for various downstream tasks, e.g., simulation and evaluation for reinforcement learning.  ( 2 min )
    Layer-diverse Negative Sampling for Graph Neural Networks
    arXiv:2403.11408v1 Announce Type: new Abstract: Graph neural networks (GNNs) are a powerful solution for various structure learning applications due to their strong representation capabilities for graph data. However, traditional GNNs, relying on message-passing mechanisms that gather information exclusively from first-order neighbours (known as positive samples), can lead to issues such as over-smoothing and over-squashing. To mitigate these issues, we propose a layer-diverse negative sampling method for message-passing propagation. This method employs a sampling matrix within a determinantal point process, which transforms the candidate set into a space and selectively samples from this space to generate negative samples. To further enhance the diversity of the negative samples during each forward pass, we develop a space-squeezing method to achieve layer-wise diversity in multi-layer GNNs. Experiments on various real-world graph datasets demonstrate the effectiveness of our approach in improving the diversity of negative samples and overall learning performance. Moreover, adding negative samples dynamically changes the graph's topology, thus with the strong potential to improve the expressiveness of GNNs and reduce the risk of over-squashing.  ( 2 min )
    Automated data processing and feature engineering for deep learning and big data applications: a survey
    arXiv:2403.11395v1 Announce Type: new Abstract: Modern approach to artificial intelligence (AI) aims to design algorithms that learn directly from data. This approach has achieved impressive results and has contributed significantly to the progress of AI, particularly in the sphere of supervised deep learning. It has also simplified the design of machine learning systems as the learning process is highly automated. However, not all data processing tasks in conventional deep learning pipelines have been automated. In most cases data has to be manually collected, preprocessed and further extended through data augmentation before they can be effective for training. Recently, special techniques for automating these tasks have emerged. The automation of data processing tasks is driven by the need to utilize large volumes of complex, heterogeneous data for machine learning and big data applications. Today, end-to-end automated data processing systems based on automated machine learning (AutoML) techniques are capable of taking raw data and transforming them into useful features for Big Data tasks by automating all intermediate processing stages. In this work, we present a thorough review of approaches for automating data processing tasks in deep learning pipelines, including automated data preprocessing--e.g., data cleaning, labeling, missing data imputation, and categorical data encoding--as well as data augmentation (including synthetic data generation using generative AI methods) and feature engineering--specifically, automated feature extraction, feature construction and feature selection. In addition to automating specific data processing tasks, we discuss the use of AutoML methods and tools to simultaneously optimize all stages of the machine learning pipeline.  ( 3 min )
    JORA: JAX Tensor-Parallel LoRA Library for Retrieval Augmented Fine-Tuning
    arXiv:2403.11366v1 Announce Type: new Abstract: The scaling of Large Language Models (LLMs) for retrieval-based tasks, particularly in Retrieval Augmented Generation (RAG), faces significant memory constraints, especially when fine-tuning extensive prompt sequences. Current open-source libraries support full-model inference and fine-tuning across multiple GPUs but fall short of accommodating the efficient parameter distribution required for retrieved context. Addressing this gap, we introduce a novel framework for PEFT-compatible fine-tuning of Llama-2 models, leveraging distributed training. Our framework uniquely utilizes JAX's just-in-time (JIT) compilation and tensor-sharding for efficient resource management, thereby enabling accelerated fine-tuning with reduced memory requirements. This advancement significantly improves the scalability and feasibility of fine-tuning LLMs for complex RAG applications, even on systems with limited GPU resources. Our experiments show more than 12x improvement in runtime compared to Hugging Face/DeepSpeed implementation with four GPUs while consuming less than half the VRAM per GPU. Our library will be open-sourced in due course.  ( 2 min )
    IGANN Sparse: Bridging Sparsity and Interpretability with Non-linear Insight
    arXiv:2403.11363v1 Announce Type: new Abstract: Feature selection is a critical component in predictive analytics that significantly affects the prediction accuracy and interpretability of models. Intrinsic methods for feature selection are built directly into model learning, providing a fast and attractive option for large amounts of data. Machine learning algorithms, such as penalized regression models (e.g., lasso) are the most common choice when it comes to in-built feature selection. However, they fail to capture non-linear relationships, which ultimately affects their ability to predict outcomes in intricate datasets. In this paper, we propose IGANN Sparse, a novel machine learning model from the family of generalized additive models, which promotes sparsity through a non-linear feature selection process during training. This ensures interpretability through improved model sparsity without sacrificing predictive performance. Moreover, IGANN Sparse serves as an exploratory tool for information systems researchers to unveil important non-linear relationships in domains that are characterized by complex patterns. Our ongoing research is directed at a thorough evaluation of the IGANN Sparse model, including user studies that allow to assess how well users of the model can benefit from the reduced number of features. This will allow for a deeper understanding of the interactions between linear vs. non-linear modeling, number of selected features, and predictive performance.  ( 2 min )
    COLEP: Certifiably Robust Learning-Reasoning Conformal Prediction via Probabilistic Circuits
    arXiv:2403.11348v1 Announce Type: new Abstract: Conformal prediction has shown spurring performance in constructing statistically rigorous prediction sets for arbitrary black-box machine learning models, assuming the data is exchangeable. However, even small adversarial perturbations during the inference can violate the exchangeability assumption, challenge the coverage guarantees, and result in a subsequent decline in empirical coverage. In this work, we propose a certifiably robust learning-reasoning conformal prediction framework (COLEP) via probabilistic circuits, which comprise a data-driven learning component that trains statistical models to learn different semantic concepts, and a reasoning component that encodes knowledge and characterizes the relationships among the trained models for logic reasoning. To achieve exact and efficient reasoning, we employ probabilistic circuits (PCs) within the reasoning component. Theoretically, we provide end-to-end certification of prediction coverage for COLEP in the presence of bounded adversarial perturbations. We also provide certified coverage considering the finite size of the calibration set. Furthermore, we prove that COLEP achieves higher prediction coverage and accuracy over a single model as long as the utilities of knowledge models are non-trivial. Empirically, we show the validity and tightness of our certified coverage, demonstrating the robust conformal prediction of COLEP on various datasets, including GTSRB, CIFAR10, and AwA2. We show that COLEP achieves up to 12% improvement in certified coverage on GTSRB, 9% on CIFAR-10, and 14% on AwA2.  ( 2 min )
    Independent RL for Cooperative-Competitive Agents: A Mean-Field Perspective
    arXiv:2403.11345v1 Announce Type: new Abstract: We address in this paper Reinforcement Learning (RL) among agents that are grouped into teams such that there is cooperation within each team but general-sum (non-zero sum) competition across different teams. To develop an RL method that provably achieves a Nash equilibrium, we focus on a linear-quadratic structure. Moreover, to tackle the non-stationarity induced by multi-agent interactions in the finite population setting, we consider the case where the number of agents within each team is infinite, i.e., the mean-field setting. This results in a General-Sum LQ Mean-Field Type Game (GS-MFTGs). We characterize the Nash equilibrium (NE) of the GS-MFTG, under a standard invertibility condition. This MFTG NE is then shown to be $\mathcal{O}(1/M)$-NE for the finite population game where $M$ is a lower bound on the number of agents in each team. These structural results motivate an algorithm called Multi-player Receding-horizon Natural Policy Gradient (MRPG), where each team minimizes its cumulative cost independently in a receding-horizon manner. Despite the non-convexity of the problem, we establish that the resulting algorithm converges to a global NE through a novel problem decomposition into sub-problems using backward recursive discrete-time Hamilton-Jacobi-Isaacs (HJI) equations, in which independent natural policy gradient is shown to exhibit linear convergence under time-independent diagonal dominance. Experiments illuminate the merits of this approach in practice.  ( 2 min )
    Federated Transfer Learning with Differential Privacy
    arXiv:2403.11343v1 Announce Type: new Abstract: Federated learning is gaining increasing popularity, with data heterogeneity and privacy being two prominent challenges. In this paper, we address both issues within a federated transfer learning framework, aiming to enhance learning on a target data set by leveraging information from multiple heterogeneous source data sets while adhering to privacy constraints. We rigorously formulate the notion of \textit{federated differential privacy}, which offers privacy guarantees for each data set without assuming a trusted central server. Under this privacy constraint, we study three classical statistical problems, namely univariate mean estimation, low-dimensional linear regression, and high-dimensional linear regression. By investigating the minimax rates and identifying the costs of privacy for these problems, we show that federated differential privacy is an intermediate privacy model between the well-established local and central models of differential privacy. Our analyses incorporate data heterogeneity and privacy, highlighting the fundamental costs of both in federated learning and underscoring the benefit of knowledge transfer across data sets.  ( 2 min )
    Reasoning in Transformers - Mitigating Spurious Correlations and Reasoning Shortcuts
    arXiv:2403.11314v1 Announce Type: new Abstract: Transformer language models are neural networks used for a wide variety of tasks concerning natural language, including some that also require logical reasoning. However, a transformer model may easily learn spurious patterns in the data, short-circuiting actual reasoning. In this paper we investigate to what extent transformers can be trained to a) approximate reasoning in propositional logic while b) avoiding known reasoning shortcuts via spurious correlations in the training data. To do so, we use a dataset with known spurious correlation between truth and e.g. the number of rules in the problem. We augment the data with proofs, and train two models: a generative transformer, WP-BART, trained on problems and their whole proofs, and a neuro-symbolic model, SIP-BART, trained on individual proof steps and combining the generative transformer model BART with a symbolic proof checker. We find that SIP-BART succeeds in avoiding reasoning shortcuts, while WP-BART does not. For SIP-BART, we then identify a few remaining reasoning errors, not previously described in the literature, arising from using a pre-trained language model. These are qualitatively analysed to create a taxonomy of four different types of additional pitfalls.  ( 2 min )
    Graph Neural Network based Double Machine Learning Estimator of Network Causal Effects
    arXiv:2403.11332v1 Announce Type: new Abstract: Our paper addresses the challenge of inferring causal effects in social network data, characterized by complex interdependencies among individuals resulting in challenges such as non-independence of units, interference (where a unit's outcome is affected by neighbors' treatments), and introduction of additional confounding factors from neighboring units. We propose a novel methodology combining graph neural networks and double machine learning, enabling accurate and efficient estimation of direct and peer effects using a single observational social network. Our approach utilizes graph isomorphism networks in conjunction with double machine learning to effectively adjust for network confounders and consistently estimate the desired causal effects. We demonstrate that our estimator is both asymptotically normal and semiparametrically efficient. A comprehensive evaluation against four state-of-the-art baseline methods using three semi-synthetic social network datasets reveals our method's on-par or superior efficacy in precise causal effect estimation. Further, we illustrate the practical application of our method through a case study that investigates the impact of Self-Help Group participation on financial risk tolerance. The results indicate a significant positive direct effect, underscoring the potential of our approach in social network analysis. Additionally, we explore the effects of network sparsity on estimation performance.  ( 2 min )
    Multi-Relational Graph Neural Network for Out-of-Domain Link Prediction
    arXiv:2403.11292v1 Announce Type: new Abstract: Dynamic multi-relational graphs are an expressive relational representation for data enclosing entities and relations of different types, and where relationships are allowed to vary in time. Addressing predictive tasks over such data requires the ability to find structure embeddings that capture the diversity of the relationships involved, as well as their dynamic evolution. In this work, we establish a novel class of challenging tasks for dynamic multi-relational graphs involving out-of-domain link prediction, where the relationship being predicted is not available in the input graph. We then introduce a novel Graph Neural Network model, named GOOD, designed specifically to tackle the out-of-domain generalization problem. GOOD introduces a novel design concept for multi-relation embedding aggregation, based on the idea that good representations are such when it is possible to disentangle the mixing proportions of the different relational embeddings that have produced it. We also propose five benchmarks based on two retail domains, where we show that GOOD can effectively generalize predictions out of known relationship types and achieve state-of-the-art results. Most importantly, we provide insights into problems where out-of-domain prediction might be preferred to an in-domain formulation, that is, where the relationship to be predicted has very few positive examples.  ( 2 min )
    Forging the Forger: An Attempt to Improve Authorship Verification via Data Augmentation
    arXiv:2403.11265v1 Announce Type: new Abstract: Authorship Verification (AV) is a text classification task concerned with inferring whether a candidate text has been written by one specific author or by someone else. It has been shown that many AV systems are vulnerable to adversarial attacks, where a malicious author actively tries to fool the classifier by either concealing their writing style, or by imitating the style of another author. In this paper, we investigate the potential benefits of augmenting the classifier training set with (negative) synthetic examples. These synthetic examples are generated to imitate the style of the author of interest. We analyze the improvements in classifier prediction that this augmentation brings to bear in the task of AV in an adversarial setting. In particular, we experiment with three different generator architectures (one based on Recurrent Neural Networks, another based on small-scale transformers, and another based on the popular GPT model) and with two training strategies (one inspired by standard Language Models, and another inspired by Wasserstein Generative Adversarial Networks). We evaluate our hypothesis on five datasets (three of which have been specifically collected to represent an adversarial setting) and using two learning algorithms for the AV classifier (Support Vector Machines and Convolutional Neural Networks). This experimentation has yielded negative results, revealing that, although our methodology proves effective in many adversarial settings, its benefits are too sporadic for a pragmatical application.  ( 3 min )
    A Lie Group Approach to Riemannian Batch Normalization
    arXiv:2403.11261v1 Announce Type: new Abstract: Manifold-valued measurements exist in numerous applications within computer vision and machine learning. Recent studies have extended Deep Neural Networks (DNNs) to manifolds, and concomitantly, normalization techniques have also been adapted to several manifolds, referred to as Riemannian normalization. Nonetheless, most of the existing Riemannian normalization methods have been derived in an ad hoc manner and only apply to specific manifolds. This paper establishes a unified framework for Riemannian Batch Normalization (RBN) techniques on Lie groups. Our framework offers the theoretical guarantee of controlling both the Riemannian mean and variance. Empirically, we focus on Symmetric Positive Definite (SPD) manifolds, which possess three distinct types of Lie group structures. Using the deformation concept, we generalize the existing Lie groups on SPD manifolds into three families of parameterized Lie groups. Specific normalization layers induced by these Lie groups are then proposed for SPD neural networks. We demonstrate the effectiveness of our approach through three sets of experiments: radar recognition, human action recognition, and electroencephalography (EEG) classification. The code is available at https://github.com/GitZH-Chen/LieBN.git.  ( 2 min )
    Understanding Diffusion Models by Feynman's Path Integral
    arXiv:2403.11262v1 Announce Type: new Abstract: Score-based diffusion models have proven effective in image generation and have gained widespread usage; however, the underlying factors contributing to the performance disparity between stochastic and deterministic (i.e., the probability flow ODEs) sampling schemes remain unclear. We introduce a novel formulation of diffusion models using Feynman's path integral, which is a formulation originally developed for quantum physics. We find this formulation providing comprehensive descriptions of score-based generative models, and demonstrate the derivation of backward stochastic differential equations and loss functions.The formulation accommodates an interpolating parameter connecting stochastic and deterministic sampling schemes, and we identify this parameter as a counterpart of Planck's constant in quantum physics. This analogy enables us to apply the Wentzel-Kramers-Brillouin (WKB) expansion, a well-established technique in quantum physics, for evaluating the negative log-likelihood to assess the performance disparity between stochastic and deterministic sampling schemes.  ( 2 min )
    A learning-based solution approach to the application placement problem in mobile edge computing under uncertainty
    arXiv:2403.11259v1 Announce Type: new Abstract: Placing applications in mobile edge computing servers presents a complex challenge involving many servers, users, and their requests. Existing algorithms take a long time to solve high-dimensional problems with significant uncertainty scenarios. Therefore, an efficient approach is required to maximize the quality of service while considering all technical constraints. One of these approaches is machine learning, which emulates optimal solutions for application placement in edge servers. Machine learning models are expected to learn how to allocate user requests to servers based on the spatial positions of users and servers. In this study, the problem is formulated as a two-stage stochastic programming. A sufficient amount of training records is generated by varying parameters such as user locations, their request rates, and solving the optimization model. Then, based on the distance features of each user from the available servers and their request rates, machine learning models generate decision variables for the first stage of the stochastic optimization model, which is the user-to-server request allocation, and are employed as independent decision agents that reliably mimic the optimization model. Support Vector Machines (SVM) and Multi-layer Perceptron (MLP) are used in this research to achieve practical decisions from the stochastic optimization models. The performance of each model has shown an execution effectiveness of over 80%. This research aims to provide a more efficient approach for tackling high-dimensional problems and scenarios with uncertainties in mobile edge computing by leveraging machine learning models for optimal decision-making in request allocation to edge servers. These results suggest that machine-learning models can significantly improve solution times compared to conventional approaches.  ( 3 min )
    CBR - Boosting Adaptive Classification By Retrieval of Encrypted Network Traffic with Out-of-distribution
    arXiv:2403.11206v1 Announce Type: new Abstract: Encrypted network traffic Classification tackles the problem from different approaches and with different goals. One of the common approaches is using Machine learning or Deep Learning-based solutions on a fixed number of classes, leading to misclassification when an unknown class is given as input. One of the solutions for handling unknown classes is to retrain the model, however, retraining models every time they become obsolete is both resource and time-consuming. Therefore, there is a growing need to allow classification models to detect and adapt to new classes dynamically, without retraining, but instead able to detect new classes using few shots learning [1]. In this paper, we introduce Adaptive Classification By Retrieval CBR, a novel approach for encrypted network traffic classification. Our new approach is based on an ANN-based method, which allows us to effectively identify new and existing classes without retraining the model. The novel approach is simple, yet effective and achieved similar results to RF with up to 5% difference (usually less than that) in the classification tasks while having a slight decrease in the case of new samples (from new classes) without retraining. To summarize, the new method is a real-time classification, which can classify new classes without retraining. Furthermore, our solution can be used as a complementary solution alongside RF or any other machine/deep learning classification method, as an aggregated solution.  ( 3 min )
    Graph Unitary Message Passing
    arXiv:2403.11199v1 Announce Type: new Abstract: Message passing mechanism contributes to the success of GNNs in various applications, but also brings the oversquashing problem. Recent works combat oversquashing by improving the graph spectrums with rewiring techniques, disrupting the structural bias in graphs, and having limited improvement on oversquashing in terms of oversquashing measure. Motivated by unitary RNN, we propose Graph Unitary Message Passing (GUMP) to alleviate oversquashing in GNNs by applying unitary adjacency matrix for message passing. To design GUMP, a transformation is first proposed to make general graphs have unitary adjacency matrix and keep its structural bias. Then, unitary adjacency matrix is obtained with a unitary projection algorithm, which is implemented by utilizing the intrinsic structure of unitary adjacency matrix and allows GUMP to be permutation-equivariant. Experimental results show the effectiveness of GUMP in improving the performance on various graph learning tasks.  ( 2 min )
    Partitioned Neural Network Training via Synthetic Intermediate Labels
    arXiv:2403.11204v1 Announce Type: new Abstract: The proliferation of extensive neural network architectures, particularly deep learning models, presents a challenge in terms of resource-intensive training. GPU memory constraints have become a notable bottleneck in training such sizable models. Existing strategies, including data parallelism, model parallelism, pipeline parallelism, and fully sharded data parallelism, offer partial solutions. Model parallelism, in particular, enables the distribution of the entire model across multiple GPUs, yet the ensuing data communication between these partitions slows down training. Additionally, the substantial memory overhead required to store auxiliary parameters on each GPU compounds computational demands. Instead of using the entire model for training, this study advocates partitioning the model across GPUs and generating synthetic intermediate labels to train individual segments. These labels, produced through a random process, mitigate memory overhead and computational load. This approach results in a more efficient training process that minimizes data communication while maintaining model accuracy. To validate this method, a 6-layer fully connected neural network is partitioned into two parts and its performance is assessed on the extended MNIST dataset. Experimental results indicate that the proposed approach achieves similar testing accuracies to conventional training methods, while significantly reducing memory and computational requirements. This work contributes to mitigating the resource-intensive nature of training large neural networks, paving the way for more efficient deep learning model development.  ( 3 min )
    Phasic Diversity Optimization for Population-Based Reinforcement Learning
    arXiv:2403.11114v1 Announce Type: new Abstract: Reviewing the previous work of diversity Rein-forcement Learning,diversity is often obtained via an augmented loss function,which requires a balance between reward and diversity.Generally,diversity optimization algorithms use Multi-armed Bandits algorithms to select the coefficient in the pre-defined space. However, the dynamic distribution of reward signals for MABs or the conflict between quality and diversity limits the performance of these methods. We introduce the Phasic Diversity Optimization (PDO) algorithm, a Population-Based Training framework that separates reward and diversity training into distinct phases instead of optimizing a multi-objective function. In the auxiliary phase, agents with poor performance diversified via determinants will not replace the better agents in the archive. The decoupling of reward and diversity allows us to use an aggressive diversity optimization in the auxiliary phase without performance degradation. Furthermore, we construct a dogfight scenario for aerial agents to demonstrate the practicality of the PDO algorithm. We introduce two implementations of PDO archive and conduct tests in the newly proposed adversarial dogfight and MuJoCo simulations. The results show that our proposed algorithm achieves better performance than baselines.  ( 2 min )
    Graph Expansion in Pruned Recurrent Neural Network Layers Preserve Performance
    arXiv:2403.11100v1 Announce Type: new Abstract: Expansion property of a graph refers to its strong connectivity as well as sparseness. It has been reported that deep neural networks can be pruned to a high degree of sparsity while maintaining their performance. Such pruning is essential for performing real time sequence learning tasks using recurrent neural networks in resource constrained platforms. We prune recurrent networks such as RNNs and LSTMs, maintaining a large spectral gap of the underlying graphs and ensuring their layerwise expansion properties. We also study the time unfolded recurrent network graphs in terms of the properties of their bipartite layers. Experimental results for the benchmark sequence MNIST, CIFAR-10, and Google speech command data show that expander graph properties are key to preserving classification accuracy of RNN and LSTM.  ( 2 min )
    Self-Supervised Quantization-Aware Knowledge Distillation
    arXiv:2403.11106v1 Announce Type: new Abstract: Quantization-aware training (QAT) and Knowledge Distillation (KD) are combined to achieve competitive performance in creating low-bit deep learning models. However, existing works applying KD to QAT require tedious hyper-parameter tuning to balance the weights of different loss terms, assume the availability of labeled training data, and require complex, computationally intensive training procedures for good performance. To address these limitations, this paper proposes a novel Self-Supervised Quantization-Aware Knowledge Distillation (SQAKD) framework. SQAKD first unifies the forward and backward dynamics of various quantization functions, making it flexible for incorporating various QAT works. Then it formulates QAT as a co-optimization problem that simultaneously minimizes the KL-Loss between the full-precision and low-bit models for KD and the discretization error for quantization, without supervision from labels. A comprehensive evaluation shows that SQAKD substantially outperforms the state-of-the-art QAT and KD works for a variety of model architectures. Our code is at: https://github.com/kaiqi123/SQAKD.git.  ( 2 min )
    Incorporating Higher-order Structural Information for Graph Clustering
    arXiv:2403.11087v1 Announce Type: new Abstract: Clustering holds profound significance in data mining. In recent years, graph convolutional network (GCN) has emerged as a powerful tool for deep clustering, integrating both graph structural information and node attributes. However, most existing methods ignore the higher-order structural information of the graph. Evidently, nodes within the same cluster can establish distant connections. Besides, recent deep clustering methods usually apply a self-supervised module to monitor the training process of their model, focusing solely on node attributes without paying attention to graph structure. In this paper, we propose a novel graph clustering network to make full use of graph structural information. To capture the higher-order structural information, we design a graph mutual infomax module, effectively maximizing mutual information between graph-level and node-level representations, and employ a trinary self-supervised module that includes modularity as a structural constraint. Our proposed model outperforms many state-of-the-art methods on various datasets, demonstrating its superiority.  ( 2 min )
    A Simple Mixture Policy Parameterization for Improving Sample Efficiency of CVaR Optimization
    arXiv:2403.11062v1 Announce Type: new Abstract: Reinforcement learning algorithms utilizing policy gradients (PG) to optimize Conditional Value at Risk (CVaR) face significant challenges with sample inefficiency, hindering their practical applications. This inefficiency stems from two main facts: a focus on tail-end performance that overlooks many sampled trajectories, and the potential of gradient vanishing when the lower tail of the return distribution is overly flat. To address these challenges, we propose a simple mixture policy parameterization. This method integrates a risk-neutral policy with an adjustable policy to form a risk-averse policy. By employing this strategy, all collected trajectories can be utilized for policy updating, and the issue of vanishing gradients is counteracted by stimulating higher returns through the risk-neutral component, thus lifting the tail and preventing flatness. Our empirical study reveals that this mixture parameterization is uniquely effective across a variety of benchmark domains. Specifically, it excels in identifying risk-averse CVaR policies in some Mujoco environments where the traditional CVaR-PG fails to learn a reasonable policy.  ( 2 min )
    FAGH: Accelerating Federated Learning with Approximated Global Hessian
    arXiv:2403.11041v1 Announce Type: new Abstract: In federated learning (FL), the significant communication overhead due to the slow convergence speed of training the global model poses a great challenge. Specifically, a large number of communication rounds are required to achieve the convergence in FL. One potential solution is to employ the Newton-based optimization method for training, known for its quadratic convergence rate. However, the existing Newton-based FL training methods suffer from either memory inefficiency or high computational costs for local clients or the server. To address this issue, we propose an FL with approximated global Hessian (FAGH) method to accelerate FL training. FAGH leverages the first moment of the approximated global Hessian and the first moment of the global gradient to train the global model. By harnessing the approximated global Hessian curvature, FAGH accelerates the convergence of global model training, leading to the reduced number of communication rounds and thus the shortened training time. Experimental results verify FAGH's effectiveness in decreasing the number of communication rounds and the time required to achieve the pre-specified objectives of the global model performance in terms of training and test losses as well as test accuracy. Notably, FAGH outperforms several state-of-the-art FL training methods.  ( 2 min )
    Advancing multivariate time series similarity assessment: an integrated computational approach
    arXiv:2403.11044v1 Announce Type: new Abstract: Data mining, particularly the analysis of multivariate time series data, plays a crucial role in extracting insights from complex systems and supporting informed decision-making across diverse domains. However, assessing the similarity of multivariate time series data presents several challenges, including dealing with large datasets, addressing temporal misalignments, and the need for efficient and comprehensive analytical frameworks. To address all these challenges, we propose a novel integrated computational approach known as Multivariate Time series Alignment and Similarity Assessment (MTASA). MTASA is built upon a hybrid methodology designed to optimize time series alignment, complemented by a multiprocessing engine that enhances the utilization of computational resources. This integrated approach comprises four key components, each addressing essential aspects of time series similarity assessment, thereby offering a comprehensive framework for analysis. MTASA is implemented as an open-source Python library with a user-friendly interface, making it accessible to researchers and practitioners. To evaluate the effectiveness of MTASA, we conducted an empirical study focused on assessing agroecosystem similarity using real-world environmental data. The results from this study highlight MTASA's superiority, achieving approximately 1.5 times greater accuracy and twice the speed compared to existing state-of-the-art integrated frameworks for multivariate time series similarity assessment. It is hoped that MTASA will significantly enhance the efficiency and accessibility of multivariate time series analysis, benefitting researchers and practitioners across various domains. Its capabilities in handling large datasets, addressing temporal misalignments, and delivering accurate results make MTASA a valuable tool for deriving insights and aiding decision-making processes in complex systems.  ( 3 min )
    FH-TabNet: Multi-Class Familial Hypercholesterolemia Detection via a Multi-Stage Tabular Deep Learning
    arXiv:2403.11032v1 Announce Type: new Abstract: Familial Hypercholesterolemia (FH) is a genetic disorder characterized by elevated levels of Low-Density Lipoprotein (LDL) cholesterol or its associated genes. Early-stage and accurate categorization of FH is of significance allowing for timely interventions to mitigate the risk of life-threatening conditions. Conventional diagnosis approach, however, is complex, costly, and a challenging interpretation task even for experienced clinicians resulting in high underdiagnosis rates. Although there has been a recent surge of interest in using Machine Learning (ML) models for early FH detection, existing solutions only consider a binary classification task solely using classical ML models. Despite its significance, application of Deep Learning (DL) for FH detection is in its infancy, possibly, due to categorical nature of the underlying clinical data. The paper addresses this gap by introducing the FH-TabNet, which is a multi-stage tabular DL network for multi-class (Definite, Probable, Possible, and Unlikely) FH detection. The FH-TabNet initially involves applying a deep tabular data learning architecture (TabNet) for primary categorization into healthy (Possible/Unlikely) and patient (Probable/Definite) classes. Subsequently, independent TabNet classifiers are applied to each subgroup, enabling refined classification. The model's performance is evaluated through 5-fold cross-validation illustrating superior performance in categorizing FH patients, particularly in the challenging low-prevalence subcategories.  ( 2 min )
    Accelerating prototype selection with spatial abstraction
    arXiv:2403.11020v1 Announce Type: new Abstract: The increasing digitalization in industry and society leads to a growing abundance of data available to be processed and exploited. However, the high volume of data requires considerable computational resources for applying machine learning approaches. Prototype selection techniques have been applied to reduce the requirements of computational resources that are needed by these techniques. In this paper, we propose an approach for speeding up existing prototype selection techniques. It builds an abstract representation of the dataset, using the notion of spatial partition. The second step uses this abstract representation to prune the search space efficiently and select a set of candidate prototypes. After, some conventional prototype selection algorithms can be applied to the candidates selected by our approach. Our approach was integrated with five conventional prototype selection algorithms and tested on 14 widely recognized datasets used in classification tasks. The performance of the modified algorithms was compared to that of their original versions in terms of accuracy and reduction rate. The experimental results demonstrate that, overall, our proposed approach maintains accuracy while enhancing the reduction rate of the original prototype selection algorithms and simultaneously reducing their execution times.  ( 2 min )
    Forward Learning of Graph Neural Networks
    arXiv:2403.11004v1 Announce Type: new Abstract: Graph neural networks (GNNs) have achieved remarkable success across a wide range of applications, such as recommendation, drug discovery, and question answering. Behind the success of GNNs lies the backpropagation (BP) algorithm, which is the de facto standard for training deep neural networks (NNs). However, despite its effectiveness, BP imposes several constraints, which are not only biologically implausible, but also limit the scalability, parallelism, and flexibility in learning NNs. Examples of such constraints include storage of neural activities computed in the forward pass for use in the subsequent backward pass, and the dependence of parameter updates on non-local signals. To address these limitations, the forward-forward algorithm (FF) was recently proposed as an alternative to BP in the image classification domain, which trains NNs by performing two forward passes over positive and negative data. Inspired by this advance, we propose ForwardGNN in this work, a new forward learning procedure for GNNs, which avoids the constraints imposed by BP via an effective layer-wise local forward training. ForwardGNN extends the original FF to deal with graph data and GNNs, and makes it possible to operate without generating negative inputs (hence no longer forward-forward). Further, ForwardGNN enables each layer to learn from both the bottom-up and top-down signals without relying on the backpropagation of errors. Extensive experiments on real-world datasets show the effectiveness and generality of the proposed forward graph learning framework. We release our code at https://github.com/facebookresearch/forwardgnn.  ( 3 min )
    Improved Algorithm and Bounds for Successive Projection
    arXiv:2403.11013v1 Announce Type: new Abstract: Given a $K$-vertex simplex in a $d$-dimensional space, suppose we measure $n$ points on the simplex with noise (hence, some of the observed points fall outside the simplex). Vertex hunting is the problem of estimating the $K$ vertices of the simplex. A popular vertex hunting algorithm is successive projection algorithm (SPA). However, SPA is observed to perform unsatisfactorily under strong noise or outliers. We propose pseudo-point SPA (pp-SPA). It uses a projection step and a denoise step to generate pseudo-points and feed them into SPA for vertex hunting. We derive error bounds for pp-SPA, leveraging on extreme value theory of (possibly) high-dimensional random vectors. The results suggest that pp-SPA has faster rates and better numerical performances than SPA. Our analysis includes an improved non-asymptotic bound for the original SPA, which is of independent interest.  ( 2 min )
    Edge Private Graph Neural Networks with Singular Value Perturbation
    arXiv:2403.10995v1 Announce Type: new Abstract: Graph neural networks (GNNs) play a key role in learning representations from graph-structured data and are demonstrated to be useful in many applications. However, the GNN training pipeline has been shown to be vulnerable to node feature leakage and edge extraction attacks. This paper investigates a scenario where an attacker aims to recover private edge information from a trained GNN model. Previous studies have employed differential privacy (DP) to add noise directly to the adjacency matrix or a compact graph representation. The added perturbations cause the graph structure to be substantially morphed, reducing the model utility. We propose a new privacy-preserving GNN training algorithm, Eclipse, that maintains good model utility while providing strong privacy protection on edges. Eclipse is based on two key observations. First, adjacency matrices in graph structures exhibit low-rank behavior. Thus, Eclipse trains GNNs with a low-rank format of the graph via singular values decomposition (SVD), rather than the original graph. Using the low-rank format, Eclipse preserves the primary graph topology and removes the remaining residual edges. Eclipse adds noise to the low-rank singular values instead of the entire graph, thereby preserving the graph privacy while still maintaining enough of the graph structure to maintain model utility. We theoretically show Eclipse provide formal DP guarantee on edges. Experiments on benchmark graph datasets show that Eclipse achieves significantly better privacy-utility tradeoff compared to existing privacy-preserving GNN training methods. In particular, under strong privacy constraints ($\epsilon$ < 4), Eclipse shows significant gains in the model utility by up to 46%. We further demonstrate that Eclipse also has better resilience against common edge attacks (e.g., LPA), lowering the attack AUC by up to 5% compared to other state-of-the-art baselines.  ( 3 min )
    IoTCO2: Assessing the End-To-End Carbon Footprint of Internet-of-Things-Enabled Deep Learning
    arXiv:2403.10984v1 Announce Type: new Abstract: To improve privacy and ensure quality-of-service (QoS), deep learning (DL) models are increasingly deployed on Internet of Things (IoT) devices for data processing, significantly increasing the carbon footprint associated with DL on IoT, covering both operational and embodied aspects. Existing operational energy predictors often overlook quantized DL models and emerging neural processing units (NPUs), while embodied carbon footprint modeling tools neglect non-computing hardware components common in IoT devices, creating a gap in accurate carbon footprint modeling tools for IoT-enabled DL. This paper introduces \textit{\carb}, an end-to-end modeling tool for precise carbon footprint estimation in IoT-enabled DL, demonstrating a maximum $\pm21\%$ deviation in carbon footprint values compared to actual measurements across various DL models. Additionally, practical applications of \carb are showcased through multiple user case studies.  ( 2 min )
    Energy-Based Models with Applications to Speech and Language Processing
    arXiv:2403.10961v1 Announce Type: new Abstract: Energy-Based Models (EBMs) are an important class of probabilistic models, also known as random fields and undirected graphical models. EBMs are un-normalized and thus radically different from other popular self-normalized probabilistic models such as hidden Markov models (HMMs), autoregressive models, generative adversarial nets (GANs) and variational auto-encoders (VAEs). Over the past years, EBMs have attracted increasing interest not only from the core machine learning community, but also from application domains such as speech, vision, natural language processing (NLP) and so on, due to significant theoretical and algorithmic progress. The sequential nature of speech and language also presents special challenges and needs a different treatment from processing fix-dimensional data (e.g., images). Therefore, the purpose of this monograph is to present a systematic introduction to energy-based models, including both algorithmic progress and applications in speech and language processing. First, the basics of EBMs are introduced, including classic models, recent models parameterized by neural networks, sampling methods, and various learning methods from the classic learning algorithms to the most advanced ones. Then, the application of EBMs in three different scenarios is presented, i.e., for modeling marginal, conditional and joint distributions, respectively. 1) EBMs for sequential data with applications in language modeling, where the main focus is on the marginal distribution of a sequence itself; 2) EBMs for modeling conditional distributions of target sequences given observation sequences, with applications in speech recognition, sequence labeling and text generation; 3) EBMs for modeling joint distributions of both sequences of observations and targets, and their applications in semi-supervised learning and calibrated natural language understanding.  ( 3 min )
    Dreaming of Many Worlds: Learning Contextual World Models Aids Zero-Shot Generalization
    arXiv:2403.10967v1 Announce Type: new Abstract: Zero-shot generalization (ZSG) to unseen dynamics is a major challenge for creating generally capable embodied agents. To address the broader challenge, we start with the simpler setting of contextual reinforcement learning (cRL), assuming observability of the context values that parameterize the variation in the system's dynamics, such as the mass or dimensions of a robot, without making further simplifying assumptions about the observability of the Markovian state. Toward the goal of ZSG to unseen variation in context, we propose the contextual recurrent state-space model (cRSSM), which introduces changes to the world model of the Dreamer (v3) (Hafner et al., 2023). This allows the world model to incorporate context for inferring latent Markovian states from the observations and modeling the latent dynamics. Our experiments show that such systematic incorporation of the context improves the ZSG of the policies trained on the ``dreams'' of the world model. We further find qualitatively that our approach allows Dreamer to disentangle the latent state from context, allowing it to extrapolate its dreams to the many worlds of unseen contexts. The code for all our experiments is available at \url{https://github.com/sai-prasanna/dreaming_of_many_worlds}.  ( 2 min )
    Interpretable Machine Learning for TabPFN
    arXiv:2403.10923v1 Announce Type: new Abstract: The recently developed Prior-Data Fitted Networks (PFNs) have shown very promising results for applications in low-data regimes. The TabPFN model, a special case of PFNs for tabular data, is able to achieve state-of-the-art performance on a variety of classification tasks while producing posterior predictive distributions in mere seconds by in-context learning without the need for learning parameters or hyperparameter tuning. This makes TabPFN a very attractive option for a wide range of domain applications. However, a major drawback of the method is its lack of interpretability. Therefore, we propose several adaptations of popular interpretability methods that we specifically design for TabPFN. By taking advantage of the unique properties of the model, our adaptations allow for more efficient computations than existing implementations. In particular, we show how in-context learning facilitates the estimation of Shapley values by avoiding approximate retraining and enables the use of Leave-One-Covariate-Out (LOCO) even when working with large-scale Transformers. In addition, we demonstrate how data valuation methods can be used to address scalability challenges of TabPFN. Our proposed methods are implemented in a package tabpfn_iml and made available at https://github.com/david-rundel/tabpfn_iml.  ( 2 min )
    Graph Regularized NMF with L20-norm for Unsupervised Feature Learning
    arXiv:2403.10910v1 Announce Type: new Abstract: Nonnegative Matrix Factorization (NMF) is a widely applied technique in the fields of machine learning and data mining. Graph Regularized Non-negative Matrix Factorization (GNMF) is an extension of NMF that incorporates graph regularization constraints. GNMF has demonstrated exceptional performance in clustering and dimensionality reduction, effectively discovering inherent low-dimensional structures embedded within high-dimensional spaces. However, the sensitivity of GNMF to noise limits its stability and robustness in practical applications. In order to enhance feature sparsity and mitigate the impact of noise while mining row sparsity patterns in the data for effective feature selection, we introduce the $\ell_{2,0}$-norm constraint as the sparsity constraints for GNMF. We propose an unsupervised feature learning framework based on GNMF\_$\ell_{20}$ and devise an algorithm based on PALM and its accelerated version to address this problem. Additionally, we establish the convergence of the proposed algorithms and validate the efficacy and superiority of our approach through experiments conducted on both simulated and real image data.  ( 2 min )
    List Sample Compression and Uniform Convergence
    arXiv:2403.10889v1 Announce Type: new Abstract: List learning is a variant of supervised classification where the learner outputs multiple plausible labels for each instance rather than just one. We investigate classical principles related to generalization within the context of list learning. Our primary goal is to determine whether classical principles in the PAC setting retain their applicability in the domain of list PAC learning. We focus on uniform convergence (which is the basis of Empirical Risk Minimization) and on sample compression (which is a powerful manifestation of Occam's Razor). In classical PAC learning, both uniform convergence and sample compression satisfy a form of `completeness': whenever a class is learnable, it can also be learned by a learning rule that adheres to these principles. We ask whether the same completeness holds true in the list learning setting. We show that uniform convergence remains equivalent to learnability in the list PAC learning setting. In contrast, our findings reveal surprising results regarding sample compression: we prove that when the label space is $Y=\{0,1,2\}$, then there are 2-list-learnable classes that cannot be compressed. This refutes the list version of the sample compression conjecture by Littlestone and Warmuth (1986). We prove an even stronger impossibility result, showing that there are $2$-list-learnable classes that cannot be compressed even when the reconstructed function can work with lists of arbitrarily large size. We prove a similar result for (1-list) PAC learnable classes when the label space is unbounded. This generalizes a recent result by arXiv:2308.06424.  ( 2 min )
    DTOR: Decision Tree Outlier Regressor to explain anomalies
    arXiv:2403.10903v1 Announce Type: new Abstract: Explaining outliers occurrence and mechanism of their occurrence can be extremely important in a variety of domains. Malfunctions, frauds, threats, in addition to being correctly identified, oftentimes need a valid explanation in order to effectively perform actionable counteracts. The ever more widespread use of sophisticated Machine Learning approach to identify anomalies make such explanations more challenging. We present the Decision Tree Outlier Regressor (DTOR), a technique for producing rule-based explanations for individual data points by estimating anomaly scores generated by an anomaly detection model. This is accomplished by first applying a Decision Tree Regressor, which computes the estimation score, and then extracting the relative path associated with the data point score. Our results demonstrate the robustness of DTOR even in datasets with a large number of features. Additionally, in contrast to other rule-based approaches, the generated rules are consistently satisfied by the points to be explained. Furthermore, our evaluation metrics indicate comparable performance to Anchors in outlier explanation tasks, with reduced execution time.  ( 2 min )
    Probabilistic World Modeling with Asymmetric Distance Measure
    arXiv:2403.10875v1 Announce Type: new Abstract: Representation learning is a fundamental task in machine learning, aiming at uncovering structures from data to facilitate subsequent tasks. However, what is a good representation for planning and reasoning in a stochastic world remains an open problem. In this work, we posit that learning a distance function is essential to allow planning and reasoning in the representation space. We show that a geometric abstraction of the probabilistic world dynamics can be embedded into the representation space through asymmetric contrastive learning. Unlike previous approaches that focus on learning mutual similarity or compatibility measures, we instead learn an asymmetric similarity function that reflects the state reachability and allows multi-way probabilistic inference. Moreover, by conditioning on a common reference state (e.g. the observer's current state), the learned representation space allows us to discover the geometrically salient states that only a handful of paths can lead through. These states can naturally serve as subgoals to break down long-horizon planning tasks. We evaluate our method in gridworld environments with various layouts and demonstrate its effectiveness in discovering the subgoals.  ( 2 min )
    Reinforcement Learning with Options
    arXiv:2403.10855v1 Announce Type: new Abstract: The current thesis aims to explore the reinforcement learning field and build on existing methods to produce improved ones to tackle the problem of learning in high-dimensional and complex environments. It addresses such goals by decomposing learning tasks in a hierarchical fashion known as Hierarchical Reinforcement Learning. We start in the first chapter by getting familiar with the Markov Decision Process framework and presenting some of its recent techniques that the following chapters use. We then proceed to build our Hierarchical Policy learning as an answer to the limitations of a single primitive policy. The hierarchy is composed of a manager agent at the top and employee agents at the lower level. In the last chapter, which is the core of this thesis, we attempt to learn lower-level elements of the hierarchy independently of the manager level in what is known as the "Eigenoption". Based on the graph structure of the environment, Eigenoptions allow us to build agents that are aware of the geometric and dynamic properties of the environment. Their decision-making has a special property: it is invariant to symmetric transformations of the environment, allowing as a consequence to greatly reduce the complexity of the learning task.  ( 2 min )
    Twin Transformer using Gated Dynamic Learnable Attention mechanism for Fault Detection and Diagnosis in the Tennessee Eastman Process
    arXiv:2403.10842v1 Announce Type: new Abstract: Fault detection and diagnosis (FDD) is a crucial task for ensuring the safety and efficiency of industrial processes. We propose a novel FDD methodology for the Tennessee Eastman Process (TEP), a widely used benchmark for chemical process control. The model employs two separate Transformer branches, enabling independent processing of input data and potential extraction of diverse information. A novel attention mechanism, Gated Dynamic Learnable Attention (GDLAttention), is introduced which integrates a gating mechanism and dynamic learning capabilities. The gating mechanism modulates the attention weights, allowing the model to focus on the most relevant parts of the input. The dynamic learning approach adapts the attention strategy during training, potentially leading to improved performance. The attention mechanism uses a bilinear similarity function, providing greater flexibility in capturing complex relationships between query and key vectors. In order to assess the effectiveness of our approach, we tested it against 21 and 18 distinct fault scenarios in TEP, and compared its performance with several established FDD techniques. The outcomes indicate that the method outperforms others in terms of accuracy, false alarm rate, and misclassification rate. This underscores the robustness and efficacy of the approach for FDD in intricate industrial processes.  ( 3 min )
    Just Say the Name: Online Continual Learning with Category Names Only via Data Generation
    arXiv:2403.10853v1 Announce Type: new Abstract: In real-world scenarios, extensive manual annotation for continual learning is impractical due to prohibitive costs. Although prior arts, influenced by large-scale webly supervised training, suggest leveraging web-scraped data in continual learning, this poses challenges such as data imbalance, usage restrictions, and privacy concerns. Addressing the risks of continual webly supervised training, we present an online continual learning framework - Generative Name only Continual Learning (G-NoCL). The proposed G-NoCL uses a set of generators G along with the learner. When encountering new concepts (i.e., classes), G-NoCL employs the novel sample complexity-guided data ensembling technique DIverSity and COmplexity enhancing ensemBlER (DISCOBER) to optimally sample training data from generated data. Through extensive experimentation, we demonstrate superior performance of DISCOBER in G-NoCL online CL benchmarks, covering both In-Distribution (ID) and Out-of-Distribution (OOD) generalization evaluations, compared to naive generator-ensembling, web-supervised, and manually annotated data.  ( 2 min )
    LookALike: Human Mimicry based collaborative decision making
    arXiv:2403.10824v1 Announce Type: new Abstract: Artificial General Intelligence falls short when communicating role specific nuances to other systems. This is more pronounced when building autonomous LLM agents capable and designed to communicate with each other for real world problem solving. Humans can communicate context and domain specific nuances along with knowledge, and that has led to refinement of skills. In this work we propose and evaluate a novel method that leads to knowledge distillation among LLM agents leading to realtime human role play preserving unique contexts without relying on any stored data or pretraining. We also evaluate how our system performs better in simulated real world tasks compared to state of the art.  ( 2 min )
    Incentivized Exploration of Non-Stationary Stochastic Bandits
    arXiv:2403.10819v1 Announce Type: new Abstract: We study incentivized exploration for the multi-armed bandit (MAB) problem with non-stationary reward distributions, where players receive compensation for exploring arms other than the greedy choice and may provide biased feedback on the reward. We consider two different non-stationary environments: abruptly-changing and continuously-changing, and propose respective incentivized exploration algorithms. We show that the proposed algorithms achieve sublinear regret and compensation over time, thus effectively incentivizing exploration despite the nonstationarity and the biased or drifted feedback.  ( 2 min )
    Enhancing Out-of-Distribution Detection with Multitesting-based Layer-wise Feature Fusion
    arXiv:2403.10803v1 Announce Type: new Abstract: Deploying machine learning in open environments presents the challenge of encountering diverse test inputs that differ significantly from the training data. These out-of-distribution samples may exhibit shifts in local or global features compared to the training distribution. The machine learning (ML) community has responded with a number of methods aimed at distinguishing anomalous inputs from original training data. However, the majority of previous studies have primarily focused on the output layer or penultimate layer of pre-trained deep neural networks. In this paper, we propose a novel framework, Multitesting-based Layer-wise Out-of-Distribution (OOD) Detection (MLOD), to identify distributional shifts in test samples at different levels of features through rigorous multiple testing procedure. Our approach distinguishes itself from existing methods as it does not require modifying the structure or fine-tuning of the pre-trained classifier. Through extensive experiments, we demonstrate that our proposed framework can seamlessly integrate with any existing distance-based inspection method while efficiently utilizing feature extractors of varying depths. Our scheme effectively enhances the performance of out-of-distribution detection when compared to baseline methods. In particular, MLOD-Fisher achieves superior performance in general. When trained using KNN on CIFAR10, MLOD-Fisher significantly lowers the false positive rate (FPR) from 24.09% to 7.47% on average compared to merely utilizing the features of the last layer.  ( 2 min )
    FlyKD: Graph Knowledge Distillation on the Fly with Curriculum Learning
    arXiv:2403.10807v1 Announce Type: new Abstract: Knowledge Distillation (KD) aims to transfer a more capable teacher model's knowledge to a lighter student model in order to improve the efficiency of the model, making it faster and more deployable. However, the student model's optimization process over the noisy pseudo labels (generated by the teacher model) is tricky and the amount of pseudo labels one can generate is limited due to Out of Memory (OOM) error. In this paper, we propose FlyKD (Knowledge Distillation on the Fly) which enables the generation of virtually unlimited number of pseudo labels, coupled with Curriculum Learning that greatly alleviates the optimization process over the noisy pseudo labels. Empirically, we observe that FlyKD outperforms vanilla KD and the renown Local Structure Preserving Graph Convolutional Network (LSPGCN). Lastly, with the success of Curriculum Learning, we shed light on a new research direction of improving optimization over noisy pseudo labels.  ( 2 min )
    Anomaly Detection Based on Isolation Mechanisms: A Survey
    arXiv:2403.10802v1 Announce Type: new Abstract: Anomaly detection is a longstanding and active research area that has many applications in domains such as finance, security, and manufacturing. However, the efficiency and performance of anomaly detection algorithms are challenged by the large-scale, high-dimensional, and heterogeneous data that are prevalent in the era of big data. Isolation-based unsupervised anomaly detection is a novel and effective approach for identifying anomalies in data. It relies on the idea that anomalies are few and different from normal instances, and thus can be easily isolated by random partitioning. Isolation-based methods have several advantages over existing methods, such as low computational complexity, low memory usage, high scalability, robustness to noise and irrelevant features, and no need for prior knowledge or heavy parameter tuning. In this survey, we review the state-of-the-art isolation-based anomaly detection methods, including their data partitioning strategies, anomaly score functions, and algorithmic details. We also discuss some extensions and applications of isolation-based methods in different scenarios, such as detecting anomalies in streaming data, time series, trajectory, and image datasets. Finally, we identify some open challenges and future directions for isolation-based anomaly detection research.  ( 2 min )
    Model Reprogramming Outperforms Fine-tuning on Out-of-distribution Data in Text-Image Encoders
    arXiv:2403.10800v1 Announce Type: new Abstract: When evaluating the performance of a pre-trained model transferred to a downstream task, it is imperative to assess not only the in-distribution (ID) accuracy of the downstream model but also its capacity to generalize and identify out-of-distribution (OOD) samples. In this paper, we unveil the hidden costs associated with intrusive fine-tuning techniques. Specifically, we demonstrate that commonly used fine-tuning methods not only distort the representations necessary for generalizing to covariate-shifted OOD samples (OOD generalization) but also distort the representations necessary for detecting semantically-shifted OOD samples (OOD detection). To address these challenges, we introduce a new model reprogramming approach for fine-tuning, which we name Reprogrammer. Reprogrammer aims to improve the holistic performance of the downstream model across ID, OOD generalization, and OOD detection tasks. Our empirical evidence reveals that Reprogrammer is less intrusive and yields superior downstream models. Furthermore, we demonstrate that by appending an additional representation residual connection to Reprogrammer, we can further preserve pre-training representations, resulting in an even more safe and robust downstream model capable of excelling in many ID classification, OOD generalization, and OOD detection settings.  ( 2 min )
    Time Series Representation Learning with Supervised Contrastive Temporal Transformer
    arXiv:2403.10787v1 Announce Type: new Abstract: Finding effective representations for time series data is a useful but challenging task. Several works utilize self-supervised or unsupervised learning methods to address this. However, there still remains the open question of how to leverage available label information for better representations. To answer this question, we exploit pre-existing techniques in time series and representation learning domains and develop a simple, yet novel fusion model, called: \textbf{S}upervised \textbf{CO}ntrastive \textbf{T}emporal \textbf{T}ransformer (SCOTT). We first investigate suitable augmentation methods for various types of time series data to assist with learning change-invariant representations. Secondly, we combine Transformer and Temporal Convolutional Networks in a simple way to efficiently learn both global and local features. Finally, we simplify Supervised Contrastive Loss for representation learning of labelled time series data. We preliminarily evaluate SCOTT on a downstream task, Time Series Classification, using 45 datasets from the UCR archive. The results show that with the representations learnt by SCOTT, even a weak classifier can perform similar to or better than existing state-of-the-art models (best performance on 23/45 datasets and highest rank against 9 baseline models). Afterwards, we investigate SCOTT's ability to address a real-world task, online Change Point Detection (CPD), on two datasets: a human activity dataset and a surgical patient dataset. We show that the model performs with high reliability and efficiency on the online CPD problem ($\sim$98\% and $\sim$97\% area under precision-recall curve respectively). Furthermore, we demonstrate the model's potential in tackling early detection and show it performs best compared to other candidates.  ( 3 min )
    A Probabilistic Approach for Alignment with Human Comparisons
    arXiv:2403.10771v1 Announce Type: new Abstract: A growing trend involves integrating human knowledge into learning frameworks, leveraging subtle human feedback to refine AI models. Despite these advances, no comprehensive theoretical framework describing the specific conditions under which human comparisons improve the traditional supervised fine-tuning process has been developed. To bridge this gap, this paper studies the effective use of human comparisons to address limitations arising from noisy data and high-dimensional models. We propose a two-stage "Supervised Fine Tuning+Human Comparison" (SFT+HC) framework connecting machine learning with human feedback through a probabilistic bisection approach. The two-stage framework first learns low-dimensional representations from noisy-labeled data via an SFT procedure, and then uses human comparisons to improve the model alignment. To examine the efficacy of the alignment phase, we introduce a novel concept termed the "label-noise-to-comparison-accuracy" (LNCA) ratio. This paper theoretically identifies the conditions under which the "SFT+HC" framework outperforms pure SFT approach, leveraging this ratio to highlight the advantage of incorporating human evaluators in reducing sample complexity. We validate that the proposed conditions for the LNCA ratio are met in a case study conducted via an Amazon Mechanical Turk experiment.  ( 2 min )
    ODE Discovery for Longitudinal Heterogeneous Treatment Effects Inference
    arXiv:2403.10766v1 Announce Type: new Abstract: Inferring unbiased treatment effects has received widespread attention in the machine learning community. In recent years, our community has proposed numerous solutions in standard settings, high-dimensional treatment settings, and even longitudinal settings. While very diverse, the solution has mostly relied on neural networks for inference and simultaneous correction of assignment bias. New approaches typically build on top of previous approaches by proposing new (or refined) architectures and learning algorithms. However, the end result -- a neural-network-based inference machine -- remains unchallenged. In this paper, we introduce a different type of solution in the longitudinal setting: a closed-form ordinary differential equation (ODE). While we still rely on continuous optimization to learn an ODE, the resulting inference machine is no longer a neural network. Doing so yields several advantages such as interpretability, irregular sampling, and a different set of identification assumptions. Above all, we consider the introduction of a completely new type of solution to be our most important contribution as it may spark entirely new innovations in treatment effects in general. We facilitate this by formulating our contribution as a framework that can transform any ODE discovery method into a treatment effects method.  ( 2 min )
    Horizon-Free Regret for Linear Markov Decision Processes
    arXiv:2403.10738v1 Announce Type: new Abstract: A recent line of works showed regret bounds in reinforcement learning (RL) can be (nearly) independent of planning horizon, a.k.a.~the horizon-free bounds. However, these regret bounds only apply to settings where a polynomial dependency on the size of transition model is allowed, such as tabular Markov Decision Process (MDP) and linear mixture MDP. We give the first horizon-free bound for the popular linear MDP setting where the size of the transition model can be exponentially large or even uncountable. In contrast to prior works which explicitly estimate the transition model and compute the inhomogeneous value functions at different time steps, we directly estimate the value functions and confidence sets. We obtain the horizon-free bound by: (1) maintaining multiple weighted least square estimators for the value functions; and (2) a structural lemma which shows the maximal total variation of the inhomogeneous value functions is bounded by a polynomial factor of the feature dimension.  ( 2 min )
    Counterfactual Analysis of Neural Networks Used to Create Fertilizer Management Zones
    arXiv:2403.10730v1 Announce Type: new Abstract: In Precision Agriculture, the utilization of management zones (MZs) that take into account within-field variability facilitates effective fertilizer management. This approach enables the optimization of nitrogen (N) rates to maximize crop yield production and enhance agronomic use efficiency. However, existing works often neglect the consideration of responsivity to fertilizer as a factor influencing MZ determination. In response to this gap, we present a MZ clustering method based on fertilizer responsivity. We build upon the statement that the responsivity of a given site to the fertilizer rate is described by the shape of its corresponding N fertilizer-yield response (N-response) curve. Thus, we generate N-response curves for all sites within the field using a convolutional neural network (CNN). The shape of the approximated N-response curves is then characterized using functional principal component analysis. Subsequently, a counterfactual explanation (CFE) method is applied to discern the impact of various variables on MZ membership. The genetic algorithm-based CFE solves a multi-objective optimization problem and aims to identify the minimum combination of features needed to alter a site's cluster assignment. Results from two yield prediction datasets indicate that the features with the greatest influence on MZ membership are associated with terrain characteristics that either facilitate or impede fertilizer runoff, such as terrain slope or topographic aspect.  ( 2 min )
    Variance-Dependent Regret Bounds for Non-stationary Linear Bandits
    arXiv:2403.10732v1 Announce Type: new Abstract: We investigate the non-stationary stochastic linear bandit problem where the reward distribution evolves each round. Existing algorithms characterize the non-stationarity by the total variation budget $B_K$, which is the summation of the change of the consecutive feature vectors of the linear bandits over $K$ rounds. However, such a quantity only measures the non-stationarity with respect to the expectation of the reward distribution, which makes existing algorithms sub-optimal under the general non-stationary distribution setting. In this work, we propose algorithms that utilize the variance of the reward distribution as well as the $B_K$, and show that they can achieve tighter regret upper bounds. Specifically, we introduce two novel algorithms: Restarted Weighted$\text{OFUL}^+$ and Restarted $\text{SAVE}^+$. These algorithms address cases where the variance information of the rewards is known and unknown, respectively. Notably, when the total variance $V_K$ is much smaller than $K$, our algorithms outperform previous state-of-the-art results on non-stationary stochastic linear bandits under different settings. Experimental evaluations further validate the superior performance of our proposed algorithms over existing works.  ( 2 min )
    Backdoor Secrets Unveiled: Identifying Backdoor Data with Optimized Scaled Prediction Consistency
    arXiv:2403.10717v1 Announce Type: new Abstract: Modern machine learning (ML) systems demand substantial training data, often resorting to external sources. Nevertheless, this practice renders them vulnerable to backdoor poisoning attacks. Prior backdoor defense strategies have primarily focused on the identification of backdoored models or poisoned data characteristics, typically operating under the assumption of access to clean data. In this work, we delve into a relatively underexplored challenge: the automatic identification of backdoor data within a poisoned dataset, all under realistic conditions, i.e., without the need for additional clean data or without manually defining a threshold for backdoor detection. We draw an inspiration from the scaled prediction consistency (SPC) technique, which exploits the prediction invariance of poisoned data to an input scaling factor. Based on this, we pose the backdoor data identification problem as a hierarchical data splitting optimization problem, leveraging a novel SPC-based loss function as the primary optimization objective. Our innovation unfolds in several key aspects. First, we revisit the vanilla SPC method, unveiling its limitations in addressing the proposed backdoor identification problem. Subsequently, we develop a bi-level optimization-based approach to precisely identify backdoor data by minimizing the advanced SPC loss. Finally, we demonstrate the efficacy of our proposal against a spectrum of backdoor attacks, encompassing basic label-corrupted attacks as well as more sophisticated clean-label attacks, evaluated across various benchmark datasets. Experiment results show that our approach often surpasses the performance of current baselines in identifying backdoor data points, resulting in about 4%-36% improvement in average AUROC. Codes are available at https://github.com/OPTML-Group/BackdoorMSPC.  ( 3 min )
    PERL: Parameter Efficient Reinforcement Learning from Human Feedback
    arXiv:2403.10704v1 Announce Type: new Abstract: Reinforcement Learning from Human Feedback (RLHF) has proven to be a strong method to align Pretrained Large Language Models (LLMs) with human preferences. But training models with RLHF is computationally expensive, and an overall complex process. In this work, we study RLHF where the underlying models are trained using the parameter efficient method of Low-Rank Adaptation (LoRA) introduced by Hu et al. [2021]. We investigate the setup of "Parameter Efficient Reinforcement Learning" (PERL), in which we perform reward model training and reinforcement learning using LoRA. We compare PERL to conventional fine-tuning (full-tuning) across various configurations for 7 benchmarks, including 2 novel datasets, of reward modeling and reinforcement learning. We find that PERL performs on par with the conventional RLHF setting, while training faster, and with less memory. This enables the high performance of RLHF, while reducing the computational burden that limits its adoption as an alignment technique for Large Language Models. We also release 2 novel thumbs up/down preference datasets: "Taskmaster Coffee", and "Taskmaster Ticketing" to promote research around RLHF.  ( 3 min )
    A Survey of Source Code Representations for Machine Learning-Based Cybersecurity Tasks
    arXiv:2403.10646v1 Announce Type: new Abstract: Machine learning techniques for cybersecurity-related software engineering tasks are becoming increasingly popular. The representation of source code is a key portion of the technique that can impact the way the model is able to learn the features of the source code. With an increasing number of these techniques being developed, it is valuable to see the current state of the field to better understand what exists and what's not there yet. This paper presents a study of these existing ML-based approaches and demonstrates what type of representations were used for different cybersecurity tasks and programming languages. Additionally, we study what types of models are used with different representations. We have found that graph-based representations are the most popular category of representation, and Tokenizers and Abstract Syntax Trees (ASTs) are the two most popular representations overall. We also found that the most popular cybersecurity task is vulnerability detection, and the language that is covered by the most techniques is C. Finally, we found that sequence-based models are the most popular category of models, and Support Vector Machines (SVMs) are the most popular model overall.  ( 2 min )
    Improving Fairness in Credit Lending Models using Subgroup Threshold Optimization
    arXiv:2403.10652v1 Announce Type: new Abstract: In an effort to improve the accuracy of credit lending decisions, many financial intuitions are now using predictions from machine learning models. While such predictions enjoy many advantages, recent research has shown that the predictions have the potential to be biased and unfair towards certain subgroups of the population. To combat this, several techniques have been introduced to help remove the bias and improve the overall fairness of the predictions. We introduce a new fairness technique, called \textit{Subgroup Threshold Optimizer} (\textit{STO}), that does not require any alternations to the input training data nor does it require any changes to the underlying machine learning algorithm, and thus can be used with any existing machine learning pipeline. STO works by optimizing the classification thresholds for individual subgroups in order to minimize the overall discrimination score between them. Our experiments on a real-world credit lending dataset show that STO can reduce gender discrimination by over 90\%.  ( 2 min )
    Using Uncertainty Quantification to Characterize and Improve Out-of-Domain Learning for PDEs
    arXiv:2403.10642v1 Announce Type: new Abstract: Existing work in scientific machine learning (SciML) has shown that data-driven learning of solution operators can provide a fast approximate alternative to classical numerical partial differential equation (PDE) solvers. Of these, Neural Operators (NOs) have emerged as particularly promising. We observe that several uncertainty quantification (UQ) methods for NOs fail for test inputs that are even moderately out-of-domain (OOD), even when the model approximates the solution well for in-domain tasks. To address this limitation, we show that ensembling several NOs can identify high-error regions and provide good uncertainty estimates that are well-correlated with prediction errors. Based on this, we propose a cost-effective alternative, DiverseNO, that mimics the properties of the ensemble by encouraging diverse predictions from its multiple heads in the last feed-forward layer. We then introduce Operator-ProbConserv, a method that uses these well-calibrated UQ estimates within the ProbConserv framework to update the model. Our empirical results show that Operator-ProbConserv enhances OOD model performance for a variety of challenging PDE problems and satisfies physical constraints such as conservation laws.  ( 2 min )
    A resource-constrained stochastic scheduling algorithm for homeless street outreach and gleaning edible food
    arXiv:2403.10638v1 Announce Type: new Abstract: We developed a common algorithmic solution addressing the problem of resource-constrained outreach encountered by social change organizations with different missions and operations: Breaking Ground -- an organization that helps individuals experiencing homelessness in New York transition to permanent housing and Leket -- the national food bank of Israel that rescues food from farms and elsewhere to feed the hungry. Specifically, we developed an estimation and optimization approach for partially-observed episodic restless bandits under $k$-step transitions. The results show that our Thompson sampling with Markov chain recovery (via Stein variational gradient descent) algorithm significantly outperforms baselines for the problems of both organizations. We carried out this work in a prospective manner with the express goal of devising a flexible-enough but also useful-enough solution that can help overcome a lack of sustainable impact in data science for social good.  ( 2 min )
    DiPaCo: Distributed Path Composition
    arXiv:2403.10616v1 Announce Type: new Abstract: Progress in machine learning (ML) has been fueled by scaling neural network models. This scaling has been enabled by ever more heroic feats of engineering, necessary for accommodating ML approaches that require high bandwidth communication between devices working in parallel. In this work, we propose a co-designed modular architecture and training approach for ML models, dubbed DIstributed PAth COmposition (DiPaCo). During training, DiPaCo distributes computation by paths through a set of shared modules. Together with a Local-SGD inspired optimization (DiLoCo) that keeps modules in sync with drastically reduced communication, Our approach facilitates training across poorly connected and heterogeneous workers, with a design that ensures robustness to worker failures and preemptions. At inference time, only a single path needs to be executed for each input, without the need for any model compression. We consider this approach as a first prototype towards a new paradigm of large-scale learning, one that is less synchronous and more modular. Our experiments on the widely used C4 benchmark show that, for the same amount of training steps but less wall-clock time, DiPaCo exceeds the performance of a 1 billion-parameter dense transformer language model by choosing one of 256 possible paths, each with a size of 150 million parameters.  ( 2 min )
    Sequential Monte Carlo for Inclusive KL Minimization in Amortized Variational Inference
    arXiv:2403.10610v1 Announce Type: new Abstract: For training an encoder network to perform amortized variational inference, the Kullback-Leibler (KL) divergence from the exact posterior to its approximation, known as the inclusive or forward KL, is an increasingly popular choice of variational objective due to the mass-covering property of its minimizer. However, minimizing this objective is challenging. A popular existing approach, Reweighted Wake-Sleep (RWS), suffers from heavily biased gradients and a circular pathology that results in highly concentrated variational distributions. As an alternative, we propose SMC-Wake, a procedure for fitting an amortized variational approximation that uses likelihood-tempered sequential Monte Carlo samplers to estimate the gradient of the inclusive KL divergence. We propose three gradient estimators, all of which are asymptotically unbiased in the number of iterations and two of which are strongly consistent. Our method interleaves stochastic gradient updates, SMC samplers, and iterative improvement to an estimate of the normalizing constant to reduce bias from self-normalization. In experiments with both simulated and real datasets, SMC-Wake fits variational distributions that approximate the posterior more accurately than existing methods.  ( 2 min )
    From Algorithms to Outcomes: Reviewing AI's Role in Non-Muscle-Invasive Bladder Cancer Recurrence Prediction
    arXiv:2403.10586v1 Announce Type: new Abstract: Bladder cancer, the leading urinary tract cancer, is responsible for 15 deaths daily in the UK. This cancer predominantly manifests as non-muscle-invasive bladder cancer (NMIBC), characterised by tumours not yet penetrating the muscle layer of the bladder wall. NMIBC is plagued by a very high recurrence rate of 70-80% and hence the costliest treatments. Current tools for predicting recurrence use scoring systems that overestimate risk and have poor accuracy. Inaccurate and delayed prediction of recurrence significantly elevates the likelihood of mortality. Accurate prediction of recurrence is hence vital for cost-effective management and treatment planning. This is where Machine learning (ML) techniques have emerged as a promising approach for predicting NMIBC recurrence by leveraging molecular and clinical data. This review provides a comprehensive analysis of ML approaches for predicting NMIBC recurrence. Our systematic evaluation demonstrates the potential of diverse ML algorithms and markers, including radiomic, clinical, histopathological, genomic, and biochemical data in enhancing recurrence prediction and personalised patient management. We summarise various prediction tasks, data modalities, and ML models used, highlighting their performance, limitations, and future directions of incorporating cost-effectiveness. Challenges related to generalisability and interpretability of artificial intelligent models are discussed, emphasising the need for collaborative efforts and robust datasets.  ( 2 min )
    Achieving Pareto Optimality using Efficient Parameter Reduction for DNNs in Resource-Constrained Edge Environment
    arXiv:2403.10569v1 Announce Type: new Abstract: This paper proposes an optimization of an existing Deep Neural Network (DNN) that improves its hardware utilization and facilitates on-device training for resource-constrained edge environments. We implement efficient parameter reduction strategies on Xception that shrink the model size without sacrificing accuracy, thus decreasing memory utilization during training. We evaluate our model in two experiments: Caltech-101 image classification and PCB defect detection and compare its performance against the original Xception and lightweight models, EfficientNetV2B1 and MobileNetV2. The results of the Caltech-101 image classification show that our model has a better test accuracy (76.21%) than Xception (75.89%), uses less memory on average (847.9MB) than Xception (874.6MB), and has faster training and inference times. The lightweight models overfit with EfficientNetV2B1 having a 30.52% test accuracy and MobileNetV2 having a 58.11% test accuracy. Both lightweight models have better memory usage than our model and Xception. On the PCB defect detection, our model has the best test accuracy (90.30%), compared to Xception (88.10%), EfficientNetV2B1 (55.25%), and MobileNetV2 (50.50%). MobileNetV2 has the least average memory usage (849.4MB), followed by our model (865.8MB), then EfficientNetV2B1 (874.8MB), and Xception has the highest (893.6MB). We further experiment with pre-trained weights and observe that memory usage decreases thereby showing the benefits of transfer learning. A Pareto analysis of the models' performance shows that our optimized model architecture satisfies accuracy and low memory utilization objectives.  ( 3 min )
    MoPE: Parameter-Efficient and Scalable Multimodal Fusion via Mixture of Prompt Experts
    arXiv:2403.10568v1 Announce Type: new Abstract: Prompt-tuning has demonstrated parameter-efficiency in fusing unimodal foundation models for multimodal tasks. However, its limited adaptivity and expressiveness lead to suboptimal performance when compared with other tuning methods. In this paper, we address this issue by disentangling the vanilla prompts to adaptively capture dataset-level and instance-level features. Building upon this disentanglement, we introduce the mixture of prompt experts (MoPE) technique to enhance expressiveness. MoPE leverages multimodal pairing priors to route the most effective prompt on a per-instance basis. Compared to vanilla prompting, our MoPE-based conditional prompting exhibits greater expressiveness for multimodal fusion, scaling better with the training data and the overall number of trainable parameters. We also study a regularization term for expert routing, leading to emergent expert specialization, where different experts focus on different concepts, enabling interpretable soft prompting. Extensive experiments across three multimodal datasets demonstrate that our method achieves state-of-the-art results, matching or even surpassing the performance of fine-tuning, while requiring only 0.8% of the trainable parameters. Code will be released: https://github.com/songrise/MoPE.  ( 2 min )
    Uncertainty estimation in spatial interpolation of satellite precipitation with ensemble learning
    arXiv:2403.10567v1 Announce Type: new Abstract: Predictions in the form of probability distributions are crucial for decision-making. Quantile regression enables this within spatial interpolation settings for merging remote sensing and gauge precipitation data. However, ensemble learning of quantile regression algorithms remains unexplored in this context. Here, we address this gap by introducing nine quantile-based ensemble learners and applying them to large precipitation datasets. We employed a novel feature engineering strategy, reducing predictors to distance-weighted satellite precipitation at relevant locations, combined with location elevation. Our ensemble learners include six stacking and three simple methods (mean, median, best combiner), combining six individual algorithms: quantile regression (QR), quantile regression forests (QRF), generalized random forests (GRF), gradient boosting machines (GBM), light gradient boosting machines (LightGBM), and quantile regression neural networks (QRNN). These algorithms serve as both base learners and combiners within different stacking methods. We evaluated performance against QR using quantile scoring functions in a large dataset comprising 15 years of monthly gauge-measured and satellite precipitation in contiguous US (CONUS). Stacking with QR and QRNN yielded the best results across quantile levels of interest (0.025, 0.050, 0.075, 0.100, 0.200, 0.300, 0.400, 0.500, 0.600, 0.700, 0.800, 0.900, 0.925, 0.950, 0.975), surpassing the reference method by 3.91% to 8.95%. This demonstrates the potential of stacking to improve probabilistic predictions in spatial interpolation and beyond.  ( 2 min )
    Cooling-Guide Diffusion Model for Battery Cell Arrangement
    arXiv:2403.10566v1 Announce Type: new Abstract: Our study introduces a Generative AI method that employs a cooling-guided diffusion model to optimize the layout of battery cells, a crucial step for enhancing the cooling performance and efficiency of battery thermal management systems. Traditional design processes, which rely heavily on iterative optimization and extensive guesswork, are notoriously slow and inefficient, often leading to suboptimal solutions. In contrast, our innovative method uses a parametric denoising diffusion probabilistic model (DDPM) with classifier and cooling guidance to generate optimized cell layouts with enhanced cooling paths, significantly lowering the maximum temperature of the cells. By incorporating position-based classifier guidance, we ensure the feasibility of generated layouts. Meanwhile, cooling guidance directly optimizes cooling-efficiency, making our approach uniquely effective. When compared to two advanced models, the Tabular Denoising Diffusion Probabilistic Model (TabDDPM) and the Conditional Tabular GAN (CTGAN), our cooling-guided diffusion model notably outperforms both. It is five times more effective than TabDDPM and sixty-six times better than CTGAN across key metrics such as feasibility, diversity, and cooling efficiency. This research marks a significant leap forward in the field, aiming to optimize battery cell layouts for superior cooling efficiency, thus setting the stage for the development of more effective and dependable battery thermal management systems.  ( 2 min )
    A collection of the accepted papers for the Human-Centric Representation Learning workshop at AAAI 2024
    arXiv:2403.10561v1 Announce Type: new Abstract: This non-archival index is not complete, as some accepted papers chose to opt-out of inclusion. The list of all accepted papers is available on the workshop website.  ( 2 min )
    Second-Order Information Matters: Revisiting Machine Unlearning for Large Language Models
    arXiv:2403.10557v1 Announce Type: new Abstract: With the rapid development of Large Language Models (LLMs), we have witnessed intense competition among the major LLM products like ChatGPT, LLaMa, and Gemini. However, various issues (e.g. privacy leakage and copyright violation) of the training corpus still remain underexplored. For example, the Times sued OpenAI and Microsoft for infringing on its copyrights by using millions of its articles for training. From the perspective of LLM practitioners, handling such unintended privacy violations can be challenging. Previous work addressed the ``unlearning" problem of LLMs using gradient information, while they mostly introduced significant overheads like data preprocessing or lacked robustness. In this paper, contrasting with the methods based on first-order information, we revisit the unlearning problem via the perspective of second-order information (Hessian). Our unlearning algorithms, which are inspired by classic Newton update, are not only data-agnostic/model-agnostic but also proven to be robust in terms of utility preservation or privacy guarantee. Through a comprehensive evaluation with four NLP datasets as well as a case study on real-world datasets, our methods consistently show superiority over the first-order methods.  ( 2 min )
    Generative Models and Connected and Automated Vehicles: A Survey in Exploring the Intersection of Transportation and AI
    arXiv:2403.10559v1 Announce Type: new Abstract: This report investigates the history and impact of Generative Models and Connected and Automated Vehicles (CAVs), two groundbreaking forces pushing progress in technology and transportation. By focusing on the application of generative models within the context of CAVs, the study aims to unravel how this integration could enhance predictive modeling, simulation accuracy, and decision-making processes in autonomous vehicles. This thesis discusses the benefits and challenges of integrating generative models and CAV technology in transportation. It aims to highlight the progress made, the remaining obstacles, and the potential for advancements in safety and innovation.  ( 2 min )
    KARINA: An Efficient Deep Learning Model for Global Weather Forecast
    arXiv:2403.10555v1 Announce Type: new Abstract: Deep learning-based, data-driven models are gaining prevalence in climate research, particularly for global weather prediction. However, training the global weather data at high resolution requires massive computational resources. Therefore, we present a new model named KARINA to overcome the substantial computational demands typical of this field. This model achieves forecasting accuracy comparable to higher-resolution counterparts with significantly less computational resources, requiring only 4 NVIDIA A100 GPUs and less than 12 hours of training. KARINA combines ConvNext, SENet, and Geocyclic Padding to enhance weather forecasting at a 2.5{\deg} resolution, which could filter out high-frequency noise. Geocyclic Padding preserves pixels at the lateral boundary of the input image, thereby maintaining atmospheric flow continuity in the spherical Earth. SENet dynamically improves feature response, advancing atmospheric process modeling, particularly in the vertical column process as numerous channels. In this vein, KARINA sets new benchmarks in weather forecasting accuracy, surpassing existing models like the ECMWF S2S reforecasts at a lead time of up to 7 days. Remarkably, KARINA achieved competitive performance even when compared to the recently developed models (Pangu-Weather, GraphCast, ClimaX, and FourCastNet) trained with high-resolution data having 100 times larger pixels. Conclusively, KARINA significantly advances global weather forecasting by efficiently modeling Earth's atmosphere with improved accuracy and resource efficiency.  ( 2 min )
    Learning to Watermark LLM-generated Text via Reinforcement Learning
    arXiv:2403.10553v1 Announce Type: new Abstract: We study how to watermark LLM outputs, i.e. embedding algorithmically detectable signals into LLM-generated text to track misuse. Unlike the current mainstream methods that work with a fixed LLM, we expand the watermark design space by including the LLM tuning stage in the watermark pipeline. While prior works focus on token-level watermark that embeds signals into the output, we design a model-level watermark that embeds signals into the LLM weights, and such signals can be detected by a paired detector. We propose a co-training framework based on reinforcement learning that iteratively (1) trains a detector to detect the generated watermarked text and (2) tunes the LLM to generate text easily detectable by the detector while keeping its normal utility. We empirically show that our watermarks are more accurate, robust, and adaptable (to new attacks). It also allows watermarked model open-sourcing. In addition, if used together with alignment, the extra overhead introduced is low - only training an extra reward model (i.e. our detector). We hope our work can bring more effort into studying a broader watermark design that is not limited to working with a fixed LLM. We open-source the code: https://github.com/xiaojunxu/learning-to-watermark-llm .  ( 2 min )
    Training Self-localization Models for Unseen Unfamiliar Places via Teacher-to-Student Data-Free Knowledge Transfer
    arXiv:2403.10552v1 Announce Type: new Abstract: A typical assumption in state-of-the-art self-localization models is that an annotated training dataset is available in the target workspace. However, this does not always hold when a robot travels in a general open-world. This study introduces a novel training scheme for open-world distributed robot systems. In our scheme, a robot ("student") can ask the other robots it meets at unfamiliar places ("teachers") for guidance. Specifically, a pseudo-training dataset is reconstructed from the teacher model and thereafter used for continual learning of the student model. Unlike typical knowledge transfer schemes, our scheme introduces only minimal assumptions on the teacher model, such that it can handle various types of open-set teachers, including uncooperative, untrainable (e.g., image retrieval engines), and blackbox teachers (i.e., data privacy). Rather than relying on the availability of private data of teachers as in existing methods, we propose to exploit an assumption that holds universally in self-localization tasks: "The teacher model is a self-localization system" and to reuse the self-localization system of a teacher as a sole accessible communication channel. We particularly focus on designing an excellent student/questioner whose interactions with teachers can yield effective question-and-answer sequences that can be used as pseudo-training datasets for the student self-localization model. When applied to a generic recursive knowledge distillation scenario, our approach exhibited stable and consistent performance improvement.  ( 3 min )
    Semi-Supervised Learning for Anomaly Traffic Detection via Bidirectional Normalizing Flows
    arXiv:2403.10550v1 Announce Type: new Abstract: With the rapid development of the Internet, various types of anomaly traffic are threatening network security. We consider the problem of anomaly network traffic detection and propose a three-stage anomaly detection framework using only normal traffic. Our framework can generate pseudo anomaly samples without prior knowledge of anomalies to achieve the detection of anomaly data. Firstly, we employ a reconstruction method to learn the deep representation of normal samples. Secondly, these representations are normalized to a standard normal distribution using a bidirectional flow module. To simulate anomaly samples, we add noises to the normalized representations which are then passed through the generation direction of the bidirectional flow module. Finally, a simple classifier is trained to differentiate the normal samples and pseudo anomaly samples in the latent space. During inference, our framework requires only two modules to detect anomalous samples, leading to a considerable reduction in model size. According to the experiments, our method achieves the state of-the-art results on the common benchmarking datasets of anomaly network traffic detection. The code is given in the https://github.com/ZxuanDang/ATD-via-Flows.git  ( 2 min )
  • Open

    COLEP: Certifiably Robust Learning-Reasoning Conformal Prediction via Probabilistic Circuits
    arXiv:2403.11348v1 Announce Type: cross Abstract: Conformal prediction has shown spurring performance in constructing statistically rigorous prediction sets for arbitrary black-box machine learning models, assuming the data is exchangeable. However, even small adversarial perturbations during the inference can violate the exchangeability assumption, challenge the coverage guarantees, and result in a subsequent decline in empirical coverage. In this work, we propose a certifiably robust learning-reasoning conformal prediction framework (COLEP) via probabilistic circuits, which comprise a data-driven learning component that trains statistical models to learn different semantic concepts, and a reasoning component that encodes knowledge and characterizes the relationships among the trained models for logic reasoning. To achieve exact and efficient reasoning, we employ probabilistic circuits (PCs) within the reasoning component. Theoretically, we provide end-to-end certification of prediction coverage for COLEP in the presence of bounded adversarial perturbations. We also provide certified coverage considering the finite size of the calibration set. Furthermore, we prove that COLEP achieves higher prediction coverage and accuracy over a single model as long as the utilities of knowledge models are non-trivial. Empirically, we show the validity and tightness of our certified coverage, demonstrating the robust conformal prediction of COLEP on various datasets, including GTSRB, CIFAR10, and AwA2. We show that COLEP achieves up to 12% improvement in certified coverage on GTSRB, 9% on CIFAR-10, and 14% on AwA2.  ( 2 min )
    Language Model Knowledge Distillation for Efficient Question Answering in Spanish
    arXiv:2312.04193v2 Announce Type: replace-cross Abstract: Recent advances in the development of pre-trained Spanish language models has led to significant progress in many Natural Language Processing (NLP) tasks, such as question answering. However, the lack of efficient models imposes a barrier for the adoption of such models in resource-constrained environments. Therefore, smaller distilled models for the Spanish language could be proven to be highly scalable and facilitate their further adoption on a variety of tasks and scenarios. In this work, we take one step in this direction by developing SpanishTinyRoBERTa, a compressed language model based on RoBERTa for efficient question answering in Spanish. To achieve this, we employ knowledge distillation from a large model onto a lighter model that allows for a wider implementation, even in areas with limited computational resources, whilst attaining negligible performance sacrifice. Our experiments show that the dense distilled model can still preserve the performance of its larger counterpart, while significantly increasing inference speedup. This work serves as a starting point for further research and investigation of model compression efforts for Spanish language models across various NLP tasks.  ( 2 min )
    Unveil Conditional Diffusion Models with Classifier-free Guidance: A Sharp Statistical Theory
    arXiv:2403.11968v1 Announce Type: cross Abstract: Conditional diffusion models serve as the foundation of modern image synthesis and find extensive application in fields like computational biology and reinforcement learning. In these applications, conditional diffusion models incorporate various conditional information, such as prompt input, to guide the sample generation towards desired properties. Despite the empirical success, theory of conditional diffusion models is largely missing. This paper bridges this gap by presenting a sharp statistical theory of distribution estimation using conditional diffusion models. Our analysis yields a sample complexity bound that adapts to the smoothness of the data distribution and matches the minimax lower bound. The key to our theoretical development lies in an approximation result for the conditional score function, which relies on a novel diffused Taylor approximation technique. Moreover, we demonstrate the utility of our statistical theory in elucidating the performance of conditional diffusion models across diverse applications, including model-based transition kernel estimation in reinforcement learning, solving inverse problems, and reward conditioned sample generation.  ( 2 min )
    CASPER: Causality-Aware Spatiotemporal Graph Neural Networks for Spatiotemporal Time Series Imputation
    arXiv:2403.11960v1 Announce Type: cross Abstract: Spatiotemporal time series is the foundation of understanding human activities and their impacts, which is usually collected via monitoring sensors placed at different locations. The collected data usually contains missing values due to various failures, which have significant impact on data analysis. To impute the missing values, a lot of methods have been introduced. When recovering a specific data point, most existing methods tend to take into consideration all the information relevant to that point regardless of whether they have a cause-and-effect relationship. During data collection, it is inevitable that some unknown confounders are included, e.g., background noise in time series and non-causal shortcut edges in the constructed sensor network. These confounders could open backdoor paths between the input and output, in other words, they establish non-causal correlations between the input and output. Over-exploiting these non-causal correlations could result in overfitting and make the model vulnerable to noises. In this paper, we first revisit spatiotemporal time series imputation from a causal perspective, which shows the causal relationships among the input, output, embeddings and confounders. Next, we show how to block the confounders via the frontdoor adjustment. Based on the results of the frontdoor adjustment, we introduce a novel Causality-Aware SPatiotEmpoRal graph neural network (CASPER), which contains a novel Spatiotemporal Causal Attention (SCA) and a Prompt Based Decoder (PBD). PBD could reduce the impact of confounders and SCA could discover the sparse causal relationships among embeddings. Theoretical analysis reveals that SCA discovers causal relationships based on the values of gradients. We evaluate Casper on three real-world datasets, and the experimental results show that Casper outperforms the baselines and effectively discovers causal relationships.  ( 3 min )
    Unconstrained Stochastic CCA: Unifying Multiview and Self-Supervised Learning
    arXiv:2310.01012v3 Announce Type: replace-cross Abstract: The Canonical Correlation Analysis (CCA) family of methods is foundational in multiview learning. Regularised linear CCA methods can be seen to generalise Partial Least Squares (PLS) and be unified with a Generalized Eigenvalue Problem (GEP) framework. However, classical algorithms for these linear methods are computationally infeasible for large-scale data. Extensions to Deep CCA show great promise, but current training procedures are slow and complicated. First we propose a novel unconstrained objective that characterizes the top subspace of GEPs. Our core contribution is a family of fast algorithms for stochastic PLS, stochastic CCA, and Deep CCA, simply obtained by applying stochastic gradient descent (SGD) to the corresponding CCA objectives. Our algorithms show far faster convergence and recover higher correlations than the previous state-of-the-art on all standard CCA and Deep CCA benchmarks. These improvements allow us to perform a first-of-its-kind PLS analysis of an extremely large biomedical dataset from the UK Biobank, with over 33,000 individuals and 500,000 features. Finally, we apply our algorithms to match the performance of `CCA-family' Self-Supervised Learning (SSL) methods on CIFAR-10 and CIFAR-100 with minimal hyper-parameter tuning, and also present theory to clarify the links between these methods and classical CCA, laying the groundwork for future insights.  ( 3 min )
    Generative Modelling of Stochastic Rotating Shallow Water Noise
    arXiv:2403.10578v1 Announce Type: new Abstract: In recent work, the authors have developed a generic methodology for calibrating the noise in fluid dynamics stochastic partial differential equations where the stochasticity was introduced to parametrize subgrid-scale processes. The stochastic parameterization of sub-grid scale processes is required in the estimation of uncertainty in weather and climate predictions, to represent systematic model errors arising from subgrid-scale fluctuations. The previous methodology used a principal component analysis (PCA) technique based on the ansatz that the increments of the stochastic parametrization are normally distributed. In this paper, the PCA technique is replaced by a generative model technique. This enables us to avoid imposing additional constraints on the increments. The methodology is tested on a stochastic rotating shallow water model with the elevation variable of the model used as input data. The numerical simulations show that the noise is indeed non-Gaussian. The generative modelling technology gives good RMSE, CRPS score and forecast rank histogram results.  ( 2 min )
    Neural-Kernel Conditional Mean Embeddings
    arXiv:2403.10859v1 Announce Type: new Abstract: Kernel conditional mean embeddings (CMEs) offer a powerful framework for representing conditional distribution, but they often face scalability and expressiveness challenges. In this work, we propose a new method that effectively combines the strengths of deep learning with CMEs in order to address these challenges. Specifically, our approach leverages the end-to-end neural network (NN) optimization framework using a kernel-based objective. This design circumvents the computationally expensive Gram matrix inversion required by current CME methods. To further enhance performance, we provide efficient strategies to optimize the remaining kernel hyperparameters. In conditional density estimation tasks, our NN-CME hybrid achieves competitive performance and often surpasses existing deep learning-based methods. Lastly, we showcase its remarkable versatility by seamlessly integrating it into reinforcement learning (RL) contexts. Building on Q-learning, our approach naturally leads to a new variant of distributional RL methods, which demonstrates consistent effectiveness across different environments.  ( 2 min )
    Out-of-Distribution Detection Should Use Conformal Prediction (and Vice-versa?)
    arXiv:2403.11532v1 Announce Type: new Abstract: Research on Out-Of-Distribution (OOD) detection focuses mainly on building scores that efficiently distinguish OOD data from In Distribution (ID) data. On the other hand, Conformal Prediction (CP) uses non-conformity scores to construct prediction sets with probabilistic coverage guarantees. In this work, we propose to use CP to better assess the efficiency of OOD scores. Specifically, we emphasize that in standard OOD benchmark settings, evaluation metrics can be overly optimistic due to the finite sample size of the test dataset. Based on the work of (Bates et al., 2022), we define new conformal AUROC and conformal FRP@TPR95 metrics, which are corrections that provide probabilistic conservativeness guarantees on the variability of these metrics. We show the effect of these corrections on two reference OOD and anomaly detection benchmarks, OpenOOD (Yang et al., 2022) and ADBench (Han et al., 2022). We also show that the benefits of using OOD together with CP apply the other way around by using OOD scores as non-conformity scores, which results in improving upon current CP methods. One of the key messages of these contributions is that since OOD is concerned with designing scores and CP with interpreting these scores, the two fields may be inherently intertwined.  ( 2 min )
    Machine learning-based system reliability analysis with Gaussian Process Regression
    arXiv:2403.11125v1 Announce Type: new Abstract: Machine learning-based reliability analysis methods have shown great advancements for their computational efficiency and accuracy. Recently, many efficient learning strategies have been proposed to enhance the computational performance. However, few of them explores the theoretical optimal learning strategy. In this article, we propose several theorems that facilitates such exploration. Specifically, cases that considering and neglecting the correlations among the candidate design samples are well elaborated. Moreover, we prove that the well-known U learning function can be reformulated to the optimal learning function for the case neglecting the Kriging correlation. In addition, the theoretical optimal learning strategy for sequential multiple training samples enrichment is also mathematically explored through the Bayesian estimate with the corresponding lost functions. Simulation results show that the optimal learning strategy considering the Kriging correlation works better than that neglecting the Kriging correlation and other state-of-the art learning functions from the literatures in terms of the reduction of number of evaluations of performance function. However, the implementation needs to investigate very large computational resource.  ( 2 min )
    Pessimistic Causal Reinforcement Learning with Mediators for Confounded Offline Data
    arXiv:2403.11841v1 Announce Type: new Abstract: In real-world scenarios, datasets collected from randomized experiments are often constrained by size, due to limitations in time and budget. As a result, leveraging large observational datasets becomes a more attractive option for achieving high-quality policy learning. However, most existing offline reinforcement learning (RL) methods depend on two key assumptions--unconfoundedness and positivity--which frequently do not hold in observational data contexts. Recognizing these challenges, we propose a novel policy learning algorithm, PESsimistic CAusal Learning (PESCAL). We utilize the mediator variable based on front-door criterion to remove the confounding bias; additionally, we adopt the pessimistic principle to address the distributional shift between the action distributions induced by candidate policies, and the behavior policy that generates the observational data. Our key observation is that, by incorporating auxiliary variables that mediate the effect of actions on system dynamics, it is sufficient to learn a lower bound of the mediator distribution function, instead of the Q-function, to partially mitigate the issue of distributional shift. This insight significantly simplifies our algorithm, by circumventing the challenging task of sequential uncertainty quantification for the estimated Q-function. Moreover, we provide theoretical guarantees for the algorithms we propose, and demonstrate their efficacy through simulations, as well as real-world experiments utilizing offline datasets from a leading ride-hailing platform.  ( 2 min )
    Hessian-Free Laplace in Bayesian Deep Learning
    arXiv:2403.10671v1 Announce Type: new Abstract: The Laplace approximation (LA) of the Bayesian posterior is a Gaussian distribution centered at the maximum a posteriori estimate. Its appeal in Bayesian deep learning stems from the ability to quantify uncertainty post-hoc (i.e., after standard network parameter optimization), the ease of sampling from the approximate posterior, and the analytic form of model evidence. However, an important computational bottleneck of LA is the necessary step of calculating and inverting the Hessian matrix of the log posterior. The Hessian may be approximated in a variety of ways, with quality varying with a number of factors including the network, dataset, and inference task. In this paper, we propose an alternative framework that sidesteps Hessian calculation and inversion. The Hessian-free Laplace (HFL) approximation uses curvature of both the log posterior and network prediction to estimate its variance. Only two point estimates are needed: the standard maximum a posteriori parameter and the optimal parameter under a loss regularized by the network prediction. We show that, under standard assumptions of LA in Bayesian deep learning, HFL targets the same variance as LA, and can be efficiently amortized in a pre-trained network. Experiments demonstrate comparable performance to that of exact and approximate Hessians, with excellent coverage for in-between uncertainty.  ( 2 min )
    Sequential Monte Carlo for Inclusive KL Minimization in Amortized Variational Inference
    arXiv:2403.10610v1 Announce Type: cross Abstract: For training an encoder network to perform amortized variational inference, the Kullback-Leibler (KL) divergence from the exact posterior to its approximation, known as the inclusive or forward KL, is an increasingly popular choice of variational objective due to the mass-covering property of its minimizer. However, minimizing this objective is challenging. A popular existing approach, Reweighted Wake-Sleep (RWS), suffers from heavily biased gradients and a circular pathology that results in highly concentrated variational distributions. As an alternative, we propose SMC-Wake, a procedure for fitting an amortized variational approximation that uses likelihood-tempered sequential Monte Carlo samplers to estimate the gradient of the inclusive KL divergence. We propose three gradient estimators, all of which are asymptotically unbiased in the number of iterations and two of which are strongly consistent. Our method interleaves stochastic gradient updates, SMC samplers, and iterative improvement to an estimate of the normalizing constant to reduce bias from self-normalization. In experiments with both simulated and real datasets, SMC-Wake fits variational distributions that approximate the posterior more accurately than existing methods.  ( 2 min )
    The Fallacy of Minimizing Local Regret in the Sequential Task Setting
    arXiv:2403.10946v1 Announce Type: new Abstract: In the realm of Reinforcement Learning (RL), online RL is often conceptualized as an optimization problem, where an algorithm interacts with an unknown environment to minimize cumulative regret. In a stationary setting, strong theoretical guarantees, like a sublinear ($\sqrt{T}$) regret bound, can be obtained, which typically implies the convergence to an optimal policy and the cessation of exploration. However, these theoretical setups often oversimplify the complexities encountered in real-world RL implementations, where tasks arrive sequentially with substantial changes between tasks and the algorithm may not be allowed to adaptively learn within certain tasks. We study the changes beyond the outcome distributions, encompassing changes in the reward designs (mappings from outcomes to rewards) and the permissible policy spaces. Our results reveal the fallacy of myopically minimizing regret within each task: obtaining optimal regret rates in the early tasks may lead to worse rates in the subsequent ones, even when the outcome distributions stay the same. To realize the optimal cumulative regret bound across all the tasks, the algorithm has to overly explore in the earlier tasks. This theoretical insight is practically significant, suggesting that due to unanticipated changes (e.g., rapid technological development or human-in-the-loop involvement) between tasks, the algorithm needs to explore more than it would in the usual stationary setting within each task. Such implication resonates with the common practice of using clipped policies in mobile health clinical trials and maintaining a fixed rate of $\epsilon$-greedy exploration in robotic learning.  ( 3 min )
    Nonsmooth Implicit Differentiation: Deterministic and Stochastic Convergence Rates
    arXiv:2403.11687v1 Announce Type: new Abstract: We study the problem of efficiently computing the derivative of the fixed-point of a parametric non-differentiable contraction map. This problem has wide applications in machine learning, including hyperparameter optimization, meta-learning and data poisoning attacks. We analyze two popular approaches: iterative differentiation (ITD) and approximate implicit differentiation (AID). A key challenge behind the nonsmooth setting is that the chain rule does not hold anymore. Building upon the recent work by Bolte et al. (2022), who proved the linear convergence of non-differentiable ITD, we provide refined linear convergence rates for both ITD and AID in the deterministic case. We further introduce NSID, a new method to compute the implicit derivative when the fixed point is defined as the composition of an outer map and an inner map which is accessible only through a stochastic unbiased estimator. We establish rates for the convergence of NSID to the true derivative, encompassing the best available rates in the smooth setting. We present illustrative experiments confirming our analysis.  ( 2 min )
    Prior-dependent analysis of posterior sampling reinforcement learning with function approximation
    arXiv:2403.11175v1 Announce Type: new Abstract: This work advances randomized exploration in reinforcement learning (RL) with function approximation modeled by linear mixture MDPs. We establish the first prior-dependent Bayesian regret bound for RL with function approximation; and refine the Bayesian regret analysis for posterior sampling reinforcement learning (PSRL), presenting an upper bound of ${\mathcal{O}}(d\sqrt{H^3 T \log T})$, where $d$ represents the dimensionality of the transition kernel, $H$ the planning horizon, and $T$ the total number of interactions. This signifies a methodological enhancement by optimizing the $\mathcal{O}(\sqrt{\log T})$ factor over the previous benchmark (Osband and Van Roy, 2014) specified to linear mixture MDPs. Our approach, leveraging a value-targeted model learning perspective, introduces a decoupling argument and a variance reduction technique, moving beyond traditional analyses reliant on confidence sets and concentration inequalities to formalize Bayesian regret bounds more effectively.  ( 2 min )
    A Primal-Dual Algorithm for Faster Distributionally Robust Optimization
    arXiv:2403.10763v1 Announce Type: new Abstract: We consider the penalized distributionally robust optimization (DRO) problem with a closed, convex uncertainty set, a setting that encompasses the $f$-DRO, Wasserstein-DRO, and spectral/$L$-risk formulations used in practice. We present Drago, a stochastic primal-dual algorithm that achieves a state-of-the-art linear convergence rate on strongly convex-strongly concave DRO problems. The method combines both randomized and cyclic components with mini-batching, which effectively handles the unique asymmetric nature of the primal and dual problems in DRO. We support our theoretical results with numerical benchmarks in classification and regression.  ( 2 min )
    Function-space Parameterization of Neural Networks for Sequential Learning
    arXiv:2403.10929v1 Announce Type: new Abstract: Sequential learning paradigms pose challenges for gradient-based deep learning due to difficulties incorporating new data and retaining prior knowledge. While Gaussian processes elegantly tackle these problems, they struggle with scalability and handling rich inputs, such as images. To address these issues, we introduce a technique that converts neural networks from weight space to function space, through a dual parameterization. Our parameterization offers: (i) a way to scale function-space methods to large data sets via sparsification, (ii) retention of prior knowledge when access to past data is limited, and (iii) a mechanism to incorporate new data without retraining. Our experiments demonstrate that we can retain knowledge in continual learning and incorporate new data efficiently. We further show its strengths in uncertainty quantification and guiding exploration in model-based RL. Further information and code is available on the project website.  ( 2 min )
    A Novel Gaussian Min-Max Theorem and its Applications
    arXiv:2402.07356v2 Announce Type: replace-cross Abstract: A celebrated result by Gordon allows one to compare the min-max behavior of two Gaussian processes if certain inequality conditions are met. The consequences of this result include the Gaussian min-max (GMT) and convex Gaussian min-max (CGMT) theorems which have had far-reaching implications in high-dimensional statistics, machine learning, non-smooth optimization, and signal processing. Both theorems rely on a pair of Gaussian processes, first identified by Slepian, that satisfy Gordon's comparison inequalities. To date, no other pair of Gaussian processes satisfying these inequalities has been discovered. In this paper, we identify such a new pair. The resulting theorems extend the classical GMT and CGMT Theorems from the case where the underlying Gaussian matrix in the primary process has iid rows to where it has independent but non-identically-distributed ones. The new CGMT is applied to the problems of multi-source Gaussian regression, as well as to binary classification of general Gaussian mixture models.  ( 2 min )
    How Free is Parameter-Free Stochastic Optimization?
    arXiv:2402.03126v2 Announce Type: replace-cross Abstract: We study the problem of parameter-free stochastic optimization, inquiring whether, and under what conditions, do fully parameter-free methods exist: these are methods that achieve convergence rates competitive with optimally tuned methods, without requiring significant knowledge of the true problem parameters. Existing parameter-free methods can only be considered ``partially'' parameter-free, as they require some non-trivial knowledge of the true problem parameters, such as a bound on the stochastic gradient norms, a bound on the distance to a minimizer, etc. In the non-convex setting, we demonstrate that a simple hyperparameter search technique results in a fully parameter-free method that outperforms more sophisticated state-of-the-art algorithms. We also provide a similar result in the convex setting with access to noisy function values under mild noise assumptions. Finally, assuming only access to stochastic gradients, we establish a lower bound that renders fully parameter-free stochastic convex optimization infeasible, and provide a method which is (partially) parameter-free up to the limit indicated by our lower bound.  ( 2 min )
    Cooperative Multi-Agent Graph Bandits: UCB Algorithm and Regret Analysis
    arXiv:2401.10383v2 Announce Type: replace-cross Abstract: In this paper, we formulate the multi-agent graph bandit problem as a multi-agent extension of the graph bandit problem introduced by Zhang, Johansson, and Li [CISS 57, 1-6 (2023)]. In our formulation, $N$ cooperative agents travel on a connected graph $G$ with $K$ nodes. Upon arrival at each node, agents observe a random reward drawn from a node-dependent probability distribution. The reward of the system is modeled as a weighted sum of the rewards the agents observe, where the weights capture some transformation of the reward associated with multiple agents sampling the same node at the same time. We propose an Upper Confidence Bound (UCB)-based learning algorithm, Multi-G-UCB, and prove that its expected regret over $T$ steps is bounded by $O(\gamma N\log(T)[\sqrt{KT} + DK])$, where $D$ is the diameter of graph $G$ and $\gamma$ a boundedness parameter associated with the weight functions. Lastly, we numerically test our algorithm by comparing it to alternative methods.  ( 2 min )
    Ricci flow-guided autoencoders in learning time-dependent dynamics
    arXiv:2401.14591v5 Announce Type: replace-cross Abstract: We present a manifold-based autoencoder method for learning nonlinear dynamics in time, notably partial differential equations (PDEs), in which the manifold latent space evolves according to Ricci flow. This can be accomplished by simulating Ricci flow in a physics-informed setting, and manifold quantities can be matched so that Ricci flow is empirically achieved. With our methodology, the manifold is learned as part of the training procedure, so ideal geometries may be discerned, while the evolution simultaneously induces a more accommodating latent representation over static methods. We present our method on a range of numerical experiments consisting of PDEs that encompass desirable characteristics such as periodicity and randomness, remarking error on in-distribution and extrapolation scenarios.  ( 2 min )
    Solution of the Probabilistic Lambert Problem: Connections with Optimal Mass Transport, Schr\"odinger Bridge and Reaction-Diffusion PDEs
    arXiv:2401.07961v3 Announce Type: replace-cross Abstract: Lambert's problem concerns with transferring a spacecraft from a given initial to a given terminal position within prescribed flight time via velocity control subject to a gravitational force field. We consider a probabilistic variant of the Lambert problem where the knowledge of the endpoint constraints in position vectors are replaced by the knowledge of their respective joint probability density functions. We show that the Lambert problem with endpoint joint probability density constraints is a generalized optimal mass transport (OMT) problem, thereby connecting this classical astrodynamics problem with a burgeoning area of research in modern stochastic control and stochastic machine learning. This newfound connection allows us to rigorously establish the existence and uniqueness of solution for the probabilistic Lambert problem. The same connection also helps to numerically solve the probabilistic Lambert problem via diffusion regularization, i.e., by leveraging further connection of the OMT with the Schr\"odinger bridge problem (SBP). This also shows that the probabilistic Lambert problem with additive dynamic process noise is in fact a generalized SBP, and can be solved numerically using the so-called Schr\"odinger factors, as we do in this work. We explain how the resulting analysis leads to solving a boundary-coupled system of reaction-diffusion PDEs where the nonlinear gravitational potential appears as the reaction rate. We propose novel algorithms for the same, and present illustrative numerical results. Our analysis and the algorithmic framework are nonparametric, i.e., we make neither statistical (e.g., Gaussian, first few moments, mixture or exponential family, finite dimensionality of the sufficient statistic) nor dynamical (e.g., Taylor series) approximations.  ( 3 min )
    Locally Optimal Fixed-Budget Best Arm Identification in Two-Armed Gaussian Bandits with Unknown Variances
    arXiv:2312.12741v2 Announce Type: replace-cross Abstract: We address the problem of best arm identification (BAI) with a fixed budget for two-armed Gaussian bandits. In BAI, given multiple arms, we aim to find the best arm, an arm with the highest expected reward, through an adaptive experiment. Kaufmann et al. (2016) develops a lower bound for the probability of misidentifying the best arm. They also propose a strategy, assuming that the variances of rewards are known, and show that it is asymptotically optimal in the sense that its probability of misidentification matches the lower bound as the budget approaches infinity. However, an asymptotically optimal strategy is unknown when the variances are unknown. For this open issue, we propose a strategy that estimates variances during an adaptive experiment and draws arms with a ratio of the estimated standard deviations. We refer to this strategy as the Neyman Allocation (NA)-Augmented Inverse Probability weighting (AIPW) strategy. We then demonstrate that this strategy is asymptotically optimal by showing that its probability of misidentification matches the lower bound when the budget approaches infinity, and the gap between the expected rewards of two arms approaches zero (small-gap regime). Our results suggest that under the worst-case scenario characterized by the small-gap regime, our strategy, which employs estimated variance, is asymptotically optimal even when the variances are unknown.  ( 3 min )
    Global $\mathcal{L}^2$ minimization at uniform exponential rate via geometrically adapted gradient descent in Deep Learning
    arXiv:2311.15487v3 Announce Type: replace-cross Abstract: We consider the gradient descent flow widely used for the minimization of the $\mathcal{L}^2$ cost function in Deep Learning networks, and introduce two modified versions; one adapted for the overparametrized setting, and the other for the underparametrized setting. Both have a clear and natural invariant geometric meaning, taking into account the pullback vector bundle structure in the overparametrized, and the pushforward vector bundle structure in the underparametrized setting. In the overparametrized case, we prove that, provided that a rank condition holds, all orbits of the modified gradient descent drive the $\mathcal{L}^2$ cost to its global minimum at a uniform exponential convergence rate; one thereby obtains an a priori stopping time for any prescribed proximity to the global minimum. We point out relations of the latter to sub-Riemannian geometry.  ( 2 min )
    Inside the black box: Neural network-based real-time prediction of US recessions
    arXiv:2310.17571v2 Announce Type: replace-cross Abstract: A standard feedforward neural network (FFN) and two specific types of recurrent neural networks, long short-term memory (LSTM) and gated recurrent unit (GRU), are used for modeling US recessions in the period from 1967 to 2021. The estimated models are then employed to conduct real-time predictions of the Great Recession and the Covid-19 recession in the US. Their predictive performances are compared to those of the traditional linear models, the standard logit model and the ridge logit model. The out-of-sample performance suggests the application of LSTM and GRU in the area of recession forecasting, especially for the long-term forecasting tasks. They outperform other types of models across five different forecast horizons with respect to a selected set of statistical metrics. Shapley additive explanations (SHAP) method is applied to GRU and the ridge logit model as the best performer in the neural network and linear model group, respectively, to gain insight into the variable importance. The evaluation of variable importance differs between GRU and the ridge logit model, as reflected in their unequal variable orders determined by the SHAP values. These different weight assignments can be attributed to GRUs flexibility and capability to capture the business cycle asymmetries and nonlinearities. The SHAP method delivers some key recession indicators. For forecasting up to 3 months, the stock price index, real GDP, and private residential fixed investment show great short-term predictability, while for longer-term forecasting up to 12 months, the term spread and the producer price index have strong explanatory power for recessions. These findings are robust against other interpretation methods such as the local interpretable model-agnostic explanations (LIME) for GRU and the marginal effects for the ridge logit model.  ( 3 min )
    The Expressive Power of Low-Rank Adaptation
    arXiv:2310.17513v3 Announce Type: replace-cross Abstract: Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method that leverages low-rank adaptation of weight matrices, has emerged as a prevalent technique for fine-tuning pre-trained models such as large language models and diffusion models. Despite its huge success in practice, the theoretical underpinnings of LoRA have largely remained unexplored. This paper takes the first step to bridge this gap by theoretically analyzing the expressive power of LoRA. We prove that, for fully connected neural networks, LoRA can adapt any model $f$ to accurately represent any smaller target model $\overline{f}$ if LoRA-rank $\geq(\text{width of }f) \times \frac{\text{depth of }\overline{f}}{\text{depth of }f}$. We also quantify the approximation error when LoRA-rank is lower than the threshold. For Transformer networks, we show any model can be adapted to a target model of the same size with rank-$(\frac{\text{embedding size}}{2})$ LoRA adapters.  ( 2 min )
    The Lipschitz-Variance-Margin Tradeoff for Enhanced Randomized Smoothing
    arXiv:2309.16883v4 Announce Type: replace-cross Abstract: Real-life applications of deep neural networks are hindered by their unsteady predictions when faced with noisy inputs and adversarial attacks. The certified radius in this context is a crucial indicator of the robustness of models. However how to design an efficient classifier with an associated certified radius? Randomized smoothing provides a promising framework by relying on noise injection into the inputs to obtain a smoothed and robust classifier. In this paper, we first show that the variance introduced by the Monte-Carlo sampling in the randomized smoothing procedure estimate closely interacts with two other important properties of the classifier, \textit{i.e.} its Lipschitz constant and margin. More precisely, our work emphasizes the dual impact of the Lipschitz constant of the base classifier, on both the smoothed classifier and the empirical variance. To increase the certified robust radius, we introduce a different way to convert logits to probability vectors for the base classifier to leverage the variance-margin trade-off. We leverage the use of Bernstein's concentration inequality along with enhanced Lipschitz bounds for randomized smoothing. Experimental results show a significant improvement in certified accuracy compared to current state-of-the-art methods. Our novel certification procedure allows us to use pre-trained models with randomized smoothing, effectively improving the current certification radius in a zero-shot manner.  ( 3 min )
    From Complexity to Clarity: Analytical Expressions of Deep Neural Network Weights via Clifford's Geometric Algebra and Convexity
    arXiv:2309.16512v3 Announce Type: replace-cross Abstract: In this paper, we introduce a novel analysis of neural networks based on geometric (Clifford) algebra and convex optimization. We show that optimal weights of deep ReLU neural networks are given by the wedge product of training samples when trained with standard regularized loss. Furthermore, the training problem reduces to convex optimization over wedge product features, which encode the geometric structure of the training dataset. This structure is given in terms of signed volumes of triangles and parallelotopes generated by data vectors. The convex problem finds a small subset of samples via $\ell_1$ regularization to discover only relevant wedge product features. Our analysis provides a novel perspective on the inner workings of deep neural networks and sheds light on the role of the hidden layers.  ( 2 min )
    Geometric structure of shallow neural networks and constructive ${\mathcal L}^2$ cost minimization
    arXiv:2309.10370v2 Announce Type: replace-cross Abstract: In this paper, we approach the problem of cost (loss) minimization in underparametrized shallow neural networks through the explicit construction of upper bounds, without any use of gradient descent. A key focus is on elucidating the geometric structure of approximate and precise minimizers. We consider shallow neural networks with one hidden layer, a ReLU activation function, an ${\mathcal L}^2$ Schatten class (or Hilbert-Schmidt) cost function, input space ${\mathbb R}^M$, output space ${\mathbb R}^Q$ with $Q\leq M$, and training input sample size $N>QM$ that can be arbitrarily large. We prove an upper bound on the minimum of the cost function of order $O(\delta_P)$ where $\delta_P$ measures the signal to noise ratio of training inputs. In the special case $M=Q$, we explicitly determine an exact degenerate local minimum of the cost function, and show that the sharp value differs from the upper bound obtained for $Q\leq M$ by a relative error $O(\delta_P^2)$. The proof of the upper bound yields a constructively trained network; we show that it metrizes a particular $Q$-dimensional subspace in the input space ${\mathbb R}^M$. We comment on the characterization of the global minimum of the cost function in the given context.  ( 3 min )
    Deep Nonnegative Matrix Factorization with Beta Divergences
    arXiv:2309.08249v3 Announce Type: replace-cross Abstract: Deep Nonnegative Matrix Factorization (deep NMF) has recently emerged as a valuable technique for extracting multiple layers of features across different scales. However, all existing deep NMF models and algorithms have primarily centered their evaluation on the least squares error, which may not be the most appropriate metric for assessing the quality of approximations on diverse datasets. For instance, when dealing with data types such as audio signals and documents, it is widely acknowledged that $\beta$-divergences offer a more suitable alternative. In this paper, we develop new models and algorithms for deep NMF using some $\beta$-divergences, with a focus on the Kullback-Leibler divergence. Subsequently, we apply these techniques to the extraction of facial features, the identification of topics within document collections, and the identification of materials within hyperspectral images.  ( 2 min )
    Global Convergence of SGD For Logistic Loss on Two Layer Neural Nets
    arXiv:2309.09258v2 Announce Type: replace-cross Abstract: In this note, we demonstrate a first-of-its-kind provable convergence of SGD to the global minima of appropriately regularized logistic empirical risk of depth $2$ nets -- for arbitrary data and with any number of gates with adequately smooth and bounded activations like sigmoid and tanh. We also prove an exponentially fast convergence rate for continuous time SGD that also applies to smooth unbounded activations like SoftPlus. Our key idea is to show the existence of Frobenius norm regularized logistic loss functions on constant-sized neural nets which are "Villani functions" and thus be able to build on recent progress with analyzing SGD on such objectives.  ( 2 min )
    Energy-guided Entropic Neural Optimal Transport
    arXiv:2304.06094v4 Announce Type: replace-cross Abstract: Energy-based models (EBMs) are known in the Machine Learning community for decades. Since the seminal works devoted to EBMs dating back to the noughties, there have been a lot of efficient methods which solve the generative modelling problem by means of energy potentials (unnormalized likelihood functions). In contrast, the realm of Optimal Transport (OT) and, in particular, neural OT solvers is much less explored and limited by few recent works (excluding WGAN-based approaches which utilize OT as a loss function and do not model OT maps themselves). In our work, we bridge the gap between EBMs and Entropy-regularized OT. We present a novel methodology which allows utilizing the recent developments and technical improvements of the former in order to enrich the latter. From the theoretical perspective, we prove generalization bounds for our technique. In practice, we validate its applicability in toy 2D and image domains. To showcase the scalability, we empower our method with a pre-trained StyleGAN and apply it to high-res AFHQ $512\times 512$ unpaired I2I translation. For simplicity, we choose simple short- and long-run EBMs as a backbone of our Energy-guided Entropic OT approach, leaving the application of more sophisticated EBMs for future research. Our code is available at: https://github.com/PetrMokrov/Energy-guided-Entropic-OT  ( 3 min )
    Koopman-based generalization bound: New aspect for full-rank weights
    arXiv:2302.05825v3 Announce Type: replace-cross Abstract: We propose a new bound for generalization of neural networks using Koopman operators. Whereas most of existing works focus on low-rank weight matrices, we focus on full-rank weight matrices. Our bound is tighter than existing norm-based bounds when the condition numbers of weight matrices are small. Especially, it is completely independent of the width of the network if the weight matrices are orthogonal. Our bound does not contradict to the existing bounds but is a complement to the existing bounds. As supported by several existing empirical results, low-rankness is not the only reason for generalization. Furthermore, our bound can be combined with the existing bounds to obtain a tighter bound. Our result sheds new light on understanding generalization of neural networks with full-rank weight matrices, and it provides a connection between operator-theoretic analysis and generalization of neural networks.  ( 2 min )
    Dimension free ridge regression
    arXiv:2210.08571v2 Announce Type: replace-cross Abstract: Random matrix theory has become a widely useful tool in high-dimensional statistics and theoretical machine learning. However, random matrix theory is largely focused on the proportional asymptotics in which the number of columns grows proportionally to the number of rows of the data matrix. This is not always the most natural setting in statistics where columns correspond to covariates and rows to samples. With the objective to move beyond the proportional asymptotics, we revisit ridge regression ($\ell_2$-penalized least squares) on i.i.d. data $(x_i, y_i)$, $i\le n$, where $x_i$ is a feature vector and $y_i = \beta^\top x_i +\epsilon_i \in\mathbb{R}$ is a response. We allow the feature vector to be high-dimensional, or even infinite-dimensional, in which case it belongs to a separable Hilbert space, and assume either $z_i := \Sigma^{-1/2}x_i$ to have i.i.d. entries, or to satisfy a certain convex concentration property. Within this setting, we establish non-asymptotic bounds that approximate the bias and variance of ridge regression in terms of the bias and variance of an `equivalent' sequence model (a regression model with diagonal design matrix). The approximation is up to multiplicative factors bounded by $(1\pm \Delta)$ for some explicitly small $\Delta$. Previously, such an approximation result was known only in the proportional regime and only up to additive errors: in particular, it did not allow to characterize the behavior of the excess risk when this converges to $0$. Our general theory recovers earlier results in the proportional regime (with better error rates). As a new application, we obtain a completely explicit and sharp characterization of ridge regression for Hilbert covariates with regularly varying spectrum. Finally, we analyze the overparametrized near-interpolation setting and obtain sharp `benign overfitting' guarantees.  ( 3 min )
    Optimistic Online Mirror Descent for Bridging Stochastic and Adversarial Online Convex Optimization
    arXiv:2302.04552v3 Announce Type: replace-cross Abstract: Stochastically Extended Adversarial (SEA) model is introduced by Sachs et al. [2022] as an interpolation between stochastic and adversarial online convex optimization. Under the smoothness condition, they demonstrate that the expected regret of optimistic follow-the-regularized-leader (FTRL) depends on the cumulative stochastic variance $\sigma_{1:T}^2$ and the cumulative adversarial variation $\Sigma_{1:T}^2$ for convex functions. They also provide a slightly weaker bound based on the maximal stochastic variance $\sigma_{\max}^2$ and the maximal adversarial variation $\Sigma_{\max}^2$ for strongly convex functions. Inspired by their work, we investigate the theoretical guarantees of optimistic online mirror descent (OMD) for the SEA model. For convex and smooth functions, we obtain the same $\mathcal{O}(\sqrt{\sigma_{1:T}^2}+\sqrt{\Sigma_{1:T}^2})$ regret bound, without the convexity requirement of individual functions. For strongly convex and smooth functions, we establish an $\mathcal{O}((\sigma_{\max}^2 + \Sigma_{\max}^2) \log (\sigma_{1:T}^2+\Sigma_{1:T}^2))$ bound, better than their $\mathcal{O}((\sigma_{\max}^2 + \Sigma_{\max}^2) \log T)$ result. For exp-concave and smooth functions, we achieve a new $\mathcal{O}(d\log(\sigma_{1:T}^2+\Sigma_{1:T}^2))$ bound. Owing to the OMD framework, we broaden our work to study dynamic regret minimization and scenarios where the online functions are non-smooth. We establish the first dynamic regret guarantee for the SEA model with convex and smooth functions, which is more favorable than static regret bounds in non-stationary scenarios. Furthermore, to deal with non-smooth and convex functions in the SEA model, we propose novel algorithms building on optimistic OMD with an implicit update, which provably attain static regret and dynamic regret guarantees without smoothness conditions.  ( 3 min )
    Identifying Causal Effects using Instrumental Time Series: Nuisance IV and Correcting for the Past
    arXiv:2203.06056v2 Announce Type: replace-cross Abstract: Instrumental variable (IV) regression relies on instruments to infer causal effects from observational data with unobserved confounding. We consider IV regression in time series models, such as vector auto-regressive (VAR) processes. Direct applications of i.i.d. techniques are generally inconsistent as they do not correctly adjust for dependencies in the past. In this paper, we outline the difficulties that arise due to time structure and propose methodology for constructing identifying equations that can be used for consistent parametric estimation of causal effects in time series data. One method uses extra nuisance covariates to obtain identifiability (an idea that can be of interest even in the i.i.d. case). We further propose a graph marginalization framework that allows us to apply nuisance IV and other IV methods in a principled way to time series. Our methods make use of a version of the global Markov property, which we prove holds for VAR(p) processes. For VAR(1) processes, we prove identifiability conditions that relate to Jordan forms and are different from the well-known rank conditions in the i.i.d. case (they do not require as many instruments as covariates, for example). We provide methods, prove their consistency, and show how the inferred causal effect can be used for distribution generalization. Simulation experiments corroborate our theoretical results. We provide ready-to-use Python code.  ( 3 min )
    Nonparametric Bayesian volatility learning under microstructure noise
    arXiv:1805.05606v2 Announce Type: replace-cross Abstract: In this work, we study the problem of learning the volatility under market microstructure noise. Specifically, we consider noisy discrete time observations from a stochastic differential equation and develop a novel computational method to learn the diffusion coefficient of the equation. We take a nonparametric Bayesian approach, where we \emph{a priori} model the volatility function as piecewise constant. Its prior is specified via the inverse Gamma Markov chain. Sampling from the posterior is accomplished by incorporating the Forward Filtering Backward Simulation algorithm in the Gibbs sampler. Good performance of the method is demonstrated on two representative synthetic data examples. We also apply the method on a EUR/USD exchange rate dataset. Finally we present a limit result on the prior distribution.  ( 2 min )
    The Benefits of Reusing Batches for Gradient Descent in Two-Layer Networks: Breaking the Curse of Information and Leap Exponents
    arXiv:2402.03220v2 Announce Type: replace Abstract: We investigate the training dynamics of two-layer neural networks when learning multi-index target functions. We focus on multi-pass gradient descent (GD) that reuses the batches multiple times and show that it significantly changes the conclusion about which functions are learnable compared to single-pass gradient descent. In particular, multi-pass GD with finite stepsize is found to overcome the limitations of gradient flow and single-pass GD given by the information exponent (Ben Arous et al., 2021) and leap exponent (Abbe et al., 2023) of the target function. We show that upon re-using batches, the network achieves in just two time steps an overlap with the target subspace even for functions not satisfying the staircase property (Abbe et al., 2021). We characterize the (broad) class of functions efficiently learned in finite time. The proof of our results is based on the analysis of the Dynamical Mean-Field Theory (DMFT). We further provide a closed-form description of the dynamical process of the low-dimensional projections of the weights, and numerical experiments illustrating the theory.  ( 2 min )
    Shift Aggregate Extract Networks
    arXiv:1703.05537v2 Announce Type: replace-cross Abstract: We introduce an architecture based on deep hierarchical decompositions to learn effective representations of large graphs. Our framework extends classic R-decompositions used in kernel methods, enabling nested part-of-part relations. Unlike recursive neural networks, which unroll a template on input graphs directly, we unroll a neural network template over the decomposition hierarchy, allowing us to deal with the high degree variability that typically characterize social network graphs. Deep hierarchical decompositions are also amenable to domain compression, a technique that reduces both space and time complexity by exploiting symmetries. We show empirically that our approach is able to outperform current state-of-the-art graph classification methods on large social network datasets, while at the same time being competitive on small chemobiological benchmark datasets.  ( 2 min )
    Multinomial belief networks
    arXiv:2311.16909v2 Announce Type: replace Abstract: A Bayesian approach to machine learning is attractive when we need to quantify uncertainty, deal with missing observations, when samples are scarce, or when the data is sparse. All of these commonly apply when analysing healthcare data. To address these analytical requirements, we propose a deep generative model for multinomial count data where both the weights and hidden units of the network are Dirichlet distributed. A Gibbs sampling procedure is formulated that takes advantage of a series of augmentation relations, analogous to the Zhou--Cong--Chen model. We apply the model on small handwritten digits, and a large experimental dataset of DNA mutations in cancer, and we show how the model is able to extract biologically meaningful meta-signatures in a fully data-driven way.  ( 2 min )
    Beyond IID weights: sparse and low-rank deep Neural Networks are also Gaussian Processes
    arXiv:2310.16597v3 Announce Type: replace Abstract: The infinitely wide neural network has been proven a useful and manageable mathematical model that enables the understanding of many phenomena appearing in deep learning. One example is the convergence of random deep networks to Gaussian processes that allows a rigorous analysis of the way the choice of activation function and network weights impacts the training dynamics. In this paper, we extend the seminal proof of Matthews et al. (2018) to a larger class of initial weight distributions (which we call PSEUDO-IID), including the established cases of IID and orthogonal weights, as well as the emerging low-rank and structured sparse settings celebrated for their computational speed-up benefits. We show that fully-connected and convolutional networks initialized with PSEUDO-IID distributions are all effectively equivalent up to their variance. Using our results, one can identify the Edge-of-Chaos for a broader class of neural networks and tune them at criticality in order to enhance their training. Moreover, they enable the posterior distribution of Bayesian Neural Networks to be tractable across these various initialization schemes.  ( 2 min )
    Online Multi-Task Learning with Recursive Least Squares and Recursive Kernel Methods
    arXiv:2308.01938v2 Announce Type: replace Abstract: This paper introduces two novel approaches for Online Multi-Task Learning (MTL) Regression Problems. We employ a high performance graph-based MTL formulation and develop two alternative recursive versions based on the Weighted Recursive Least Squares (WRLS) and the Online Sparse Least Squares Support Vector Regression (OSLSSVR) strategies. Adopting task-stacking transformations, we demonstrate the existence of a single matrix incorporating the relationship of multiple tasks and providing structural information to be embodied by the MT-WRLS method in its initialization procedure and by the MT-OSLSSVR in its multi-task kernel function. Contrasting the existing literature, which is mostly based on Online Gradient Descent (OGD) or cubic inexact approaches, we achieve exact and approximate recursions with quadratic per-instance cost on the dimension of the input space (MT-WRLS) or on the size of the dictionary of instances (MT-OSLSSVR). We compare our online MTL methods to other contenders in a real-world wind speed forecasting case study, evidencing the significant gain in performance of both proposed approaches.  ( 2 min )
    Learning to Make Adherence-Aware Advice
    arXiv:2310.00817v2 Announce Type: replace Abstract: As artificial intelligence (AI) systems play an increasingly prominent role in human decision-making, challenges surface in the realm of human-AI interactions. One challenge arises from the suboptimal AI policies due to the inadequate consideration of humans disregarding AI recommendations, as well as the need for AI to provide advice selectively when it is most pertinent. This paper presents a sequential decision-making model that (i) takes into account the human's adherence level (the probability that the human follows/rejects machine advice) and (ii) incorporates a defer option so that the machine can temporarily refrain from making advice. We provide learning algorithms that learn the optimal advice policy and make advice only at critical time stamps. Compared to problem-agnostic reinforcement learning algorithms, our specialized learning algorithms not only enjoy better theoretical convergence properties but also show strong empirical performance.  ( 2 min )
    Online learning in bandits with predicted context
    arXiv:2307.13916v3 Announce Type: replace Abstract: We consider the contextual bandit problem where at each time, the agent only has access to a noisy version of the context and the error variance (or an estimator of this variance). This setting is motivated by a wide range of applications where the true context for decision-making is unobserved, and only a prediction of the context by a potentially complex machine learning algorithm is available. When the context error is non-vanishing, classical bandit algorithms fail to achieve sublinear regret. We propose the first online algorithm in this setting with sublinear regret guarantees under mild conditions. The key idea is to extend the measurement error model in classical statistics to the online decision-making setting, which is nontrivial due to the policy being dependent on the noisy context observations. We further demonstrate the benefits of the proposed approach in simulation environments based on synthetic and real digital intervention datasets.  ( 2 min )
    Properties of Discrete Sliced Wasserstein Losses
    arXiv:2307.10352v3 Announce Type: replace Abstract: The Sliced Wasserstein (SW) distance has become a popular alternative to the Wasserstein distance for comparing probability measures. Widespread applications include image processing, domain adaptation and generative modelling, where it is common to optimise some parameters in order to minimise SW, which serves as a loss function between discrete probability measures (since measures admitting densities are numerically unattainable). All these optimisation problems bear the same sub-problem, which is minimising the Sliced Wasserstein energy. In this paper we study the properties of $\mathcal{E}: Y \longmapsto \mathrm{SW}_2^2(\gamma_Y, \gamma_Z)$, i.e. the SW distance between two uniform discrete measures with the same amount of points as a function of the support $Y \in \mathbb{R}^{n \times d}$ of one of the measures. We investigate the regularity and optimisation properties of this energy, as well as its Monte-Carlo approximation $\mathcal{E}_p$ (estimating the expectation in SW using only $p$ samples) and show convergence results on the critical points of $\mathcal{E}_p$ to those of $\mathcal{E}$, as well as an almost-sure uniform convergence. Finally, we show that in a certain sense, Stochastic Gradient Descent methods minimising $\mathcal{E}$ and $\mathcal{E}_p$ converge towards (Clarke) critical points of these energies.  ( 2 min )
    Robust Counterfactual Explanations for Neural Networks With Probabilistic Guarantees
    arXiv:2305.11997v3 Announce Type: replace Abstract: There is an emerging interest in generating robust counterfactual explanations that would remain valid if the model is updated or changed even slightly. Towards finding robust counterfactuals, existing literature often assumes that the original model $m$ and the new model $M$ are bounded in the parameter space, i.e., $\|\text{Params}(M){-}\text{Params}(m)\|{<}\Delta$. However, models can often change significantly in the parameter space with little to no change in their predictions or accuracy on the given dataset. In this work, we introduce a mathematical abstraction termed $\textit{naturally-occurring}$ model change, which allows for arbitrary changes in the parameter space such that the change in predictions on points that lie on the data manifold is limited. Next, we propose a measure -- that we call $\textit{Stability}$ -- to quantify the robustness of counterfactuals to potential model changes for differentiable models, e.g., neural networks. Our main contribution is to show that counterfactuals with sufficiently high value of $\textit{Stability}$ as defined by our measure will remain valid after potential $\textit{naturally-occurring}$ model changes with high probability (leveraging concentration bounds for Lipschitz function of independent Gaussians). Since our quantification depends on the local Lipschitz constant around a data point which is not always available, we also examine practical relaxations of our proposed measure and demonstrate experimentally how they can be incorporated to find robust counterfactuals for neural networks that are close, realistic, and remain valid after potential model changes. This work also has interesting connections with model multiplicity, also known as, the Rashomon effect.  ( 3 min )
    Differentially Private Latent Diffusion Models
    arXiv:2305.15759v4 Announce Type: replace Abstract: Diffusion models (DMs) are widely used for generating high-quality high-dimensional images in a non-differentially private manner. To address this challenge, recent papers suggest pre-training DMs with public data, then fine-tuning them with private data using DP-SGD for a relatively short period. In this paper, we further improve the current state of DMs with DP by adopting the Latent Diffusion Models (LDMs). LDMs are equipped with powerful pre-trained autoencoders that map the high-dimensional pixels into lower-dimensional latent representations, in which DMs are trained, yielding a more efficient and fast training of DMs. In our algorithm, DP-LDMs, rather than fine-tuning the entire DMs, we fine-tune only the attention modules of LDMs at varying layers with privacy-sensitive data, reducing the number of trainable parameters by roughly 90% and achieving a better accuracy, compared to fine-tuning the entire DMs. The smaller parameter space to fine-tune with DP-SGD helps our algorithm to achieve new state-of-the-art results in several public-private benchmark data pairs.Our approach also allows us to generate more realistic, high-dimensional images (256x256) and those conditioned on text prompts with differential privacy, which have not been attempted before us, to the best of our knowledge. Our approach provides a promising direction for training more powerful, yet training-efficient differentially private DMs, producing high-quality high-dimensional DP images.  ( 3 min )
    Copula-based transferable models for synthetic population generation
    arXiv:2302.09193v2 Announce Type: replace Abstract: Population synthesis involves generating synthetic yet realistic representations of a target population of micro-agents for behavioral modeling and simulation. Traditional methods, often reliant on target population samples, such as census data or travel surveys, face limitations due to high costs and small sample sizes, particularly at smaller geographical scales. We propose a novel framework based on copulas to generate synthetic data for target populations where only empirical marginal distributions are known. This method utilizes samples from different populations with similar marginal dependencies, introduces a spatial component into population synthesis, and considers various information sources for more realistic generators. Concretely, the process involves normalizing the data and treat it as realizations of a given copula, and then training a generative model before incorporating the information on the marginals of the target population. Utilizing American Community Survey data, we assess our framework's performance through standardized root mean squared error (SRMSE) and so-called sampled zeros. We focus on its capacity to transfer a model learned from one population to another. Our experiments include transfer tests between regions at the same geographical level as well as to lower geographical levels, hence evaluating the framework's adaptability in varied spatial contexts. We compare Bayesian Networks, Variational Autoencoders, and Generative Adversarial Networks, both individually and combined with our copula framework. Results show that the copula enhances machine learning methods in matching the marginals of the reference data. Furthermore, it consistently surpasses Iterative Proportional Fitting in terms of SRMSE in the transferability experiments, while introducing unique observations not found in the original training sample.  ( 3 min )
    Local and Regional Counterfactual Rules: Summarized and Robust Recourses
    arXiv:2209.14568v3 Announce Type: replace Abstract: Counterfactual Explanations (CE) face several unresolved challenges, such as ensuring stability, synthesizing multiple CEs, and providing plausibility and sparsity guarantees. From a more practical point of view, recent studies [Pawelczyk et al., 2022] show that the prescribed counterfactual recourses are often not implemented exactly by individuals and demonstrate that most state-of-the-art CE algorithms are very likely to fail in this noisy environment. To address these issues, we propose a probabilistic framework that gives a sparse local counterfactual rule for each observation, providing rules that give a range of values capable of changing decisions with high probability. These rules serve as a summary of diverse counterfactual explanations and yield robust recourses. We further aggregate these local rules into a regional counterfactual rule, identifying shared recourses for subgroups of the data. Our local and regional rules are derived from the Random Forest algorithm, which offers statistical guarantees and fidelity to data distribution by selecting recourses in high-density regions. Moreover, our rules are sparse as we first select the smallest set of variables having a high probability of changing the decision. We have conducted experiments to validate the effectiveness of our counterfactual rules in comparison to standard CE and recent similar attempts. Our methods are available as a Python package.  ( 3 min )
    Statistical Properties of the log-cosh Loss Function Used in Machine Learning
    arXiv:2208.04564v4 Announce Type: replace Abstract: This paper analyzes a popular loss function used in machine learning called the log-cosh loss function. A number of papers have been published using this loss function but, to date, no statistical analysis has been presented in the literature. In this paper, we present the distribution function from which the log-cosh loss arises. We compare it to a similar distribution, called the Cauchy distribution, and carry out various statistical procedures that characterize its properties. In particular, we examine its associated pdf, cdf, likelihood function and Fisher information. Side-by-side we consider the Cauchy and Cosh distributions as well as the MLE of the location parameter with asymptotic bias, asymptotic variance, and confidence intervals. We also provide a comparison of robust estimators from several other loss functions, including the Huber loss function and the rank dispersion function. Further, we examine the use of the log-cosh function for quantile regression. In particular, we identify a quantile distribution function from which a maximum likelihood estimator for quantile regression can be derived. Finally, we compare a quantile M-estimator based on log-cosh with robust monotonicity against another approach to quantile regression based on convolutional smoothing.  ( 3 min )
    A review of predictive uncertainty estimation with machine learning
    arXiv:2209.08307v2 Announce Type: replace Abstract: Predictions and forecasts of machine learning models should take the form of probability distributions, aiming to increase the quantity of information communicated to end users. Although applications of probabilistic prediction and forecasting with machine learning models in academia and industry are becoming more frequent, related concepts and methods have not been formalized and structured under a holistic view of the entire field. Here, we review the topic of predictive uncertainty estimation with machine learning algorithms, as well as the related metrics (consistent scoring functions and proper scoring rules) for assessing probabilistic predictions. The review covers a time period spanning from the introduction of early statistical (linear regression and time series models, based on Bayesian statistics or quantile regression) to recent machine learning algorithms (including generalized additive models for location, scale and shape, random forests, boosting and deep learning algorithms) that are more flexible by nature. The review of the progress in the field, expedites our understanding on how to develop new algorithms tailored to users' needs, since the latest advancements are based on some fundamental concepts applied to more complex algorithms. We conclude by classifying the material and discussing challenges that are becoming a hot topic of research.  ( 3 min )
    Random Forest Weighted Local Fr\'echet Regression with Random Objects
    arXiv:2202.04912v4 Announce Type: replace Abstract: Statistical analysis is increasingly confronted with complex data from metric spaces. Petersen and M\"uller (2019) established a general paradigm of Fr\'echet regression with complex metric space valued responses and Euclidean predictors. However, the local approach therein involves nonparametric kernel smoothing and suffers from the curse of dimensionality. To address this issue, we in this paper propose a novel random forest weighted local Fr\'echet regression paradigm. The main mechanism of our approach relies on a locally adaptive kernel generated by random forests. Our first method uses these weights as the local average to solve the conditional Fr\'echet mean, while the second method performs local linear Fr\'echet regression, both significantly improving existing Fr\'echet regression methods. Based on the theory of infinite order U-processes and infinite order $M_{m_n}$-estimator, we establish the consistency, rate of convergence, and asymptotic normality for our local constant estimator, which covers the current large sample theory of random forests with Euclidean responses as a special case. Numerical studies show the superiority of our methods with several commonly encountered types of responses such as distribution functions, symmetric positive-definite matrices, and sphere data. The practical merits of our proposals are also demonstrated through the application to New York taxi data and human mortality data.  ( 3 min )
    Convergence of Kinetic Langevin Monte Carlo on Lie groups
    arXiv:2403.12012v1 Announce Type: cross Abstract: Explicit, momentum-based dynamics for optimizing functions defined on Lie groups was recently constructed, based on techniques such as variational optimization and left trivialization. We appropriately add tractable noise to the optimization dynamics to turn it into a sampling dynamics, leveraging the advantageous feature that the momentum variable is Euclidean despite that the potential function lives on a manifold. We then propose a Lie-group MCMC sampler, by delicately discretizing the resulting kinetic-Langevin-type sampling dynamics. The Lie group structure is exactly preserved by this discretization. Exponential convergence with explicit convergence rate for both the continuous dynamics and the discrete sampler are then proved under W2 distance. Only compactness of the Lie group and geodesically L-smoothness of the potential function are needed. To the best of our knowledge, this is the first convergence result for kinetic Langevin on curved spaces, and also the first quantitative result that requires no convexity or, at least not explicitly, any common relaxation such as isoperimetry.  ( 2 min )
    Visualization for Trust in Machine Learning Revisited: The State of the Field in 2023
    arXiv:2403.12005v1 Announce Type: cross Abstract: Visualization for explainable and trustworthy machine learning remains one of the most important and heavily researched fields within information visualization and visual analytics with various application domains, such as medicine, finance, and bioinformatics. After our 2020 state-of-the-art report comprising 200 techniques, we have persistently collected peer-reviewed articles describing visualization techniques, categorized them based on the previously established categorization schema consisting of 119 categories, and provided the resulting collection of 542 techniques in an online survey browser. In this survey article, we present the updated findings of new analyses of this dataset as of fall 2023 and discuss trends, insights, and eight open challenges for using visualizations in machine learning. Our results corroborate the rapidly growing trend of visualization techniques for increasing trust in machine learning models in the past three years, with visualization found to help improve popular model explainability methods and check new deep learning architectures, for instance.  ( 2 min )
    Probabilistic Calibration by Design for Neural Network Regression
    arXiv:2403.11964v1 Announce Type: cross Abstract: Generating calibrated and sharp neural network predictive distributions for regression problems is essential for optimal decision-making in many real-world applications. To address the miscalibration issue of neural networks, various methods have been proposed to improve calibration, including post-hoc methods that adjust predictions after training and regularization methods that act during training. While post-hoc methods have shown better improvement in calibration compared to regularization methods, the post-hoc step is completely independent of model training. We introduce a novel end-to-end model training procedure called Quantile Recalibration Training, integrating post-hoc calibration directly into the training process without additional parameters. We also present a unified algorithm that includes our method and other post-hoc and regularization methods, as particular cases. We demonstrate the performance of our method in a large-scale experiment involving 57 tabular regression datasets, showcasing improved predictive accuracy while maintaining calibration. We also conduct an ablation study to evaluate the significance of different components within our proposed method, as well as an in-depth analysis of the impact of the base model and different hyperparameters on predictive accuracy.  ( 2 min )
    Transfer Learning Beyond Bounded Density Ratios
    arXiv:2403.11963v1 Announce Type: cross Abstract: We study the fundamental problem of transfer learning where a learning algorithm collects data from some source distribution $P$ but needs to perform well with respect to a different target distribution $Q$. A standard change of measure argument implies that transfer learning happens when the density ratio $dQ/dP$ is bounded. Yet, prior thought-provoking works by Kpotufe and Martinet (COLT, 2018) and Hanneke and Kpotufe (NeurIPS, 2019) demonstrate cases where the ratio $dQ/dP$ is unbounded, but transfer learning is possible. In this work, we focus on transfer learning over the class of low-degree polynomial estimators. Our main result is a general transfer inequality over the domain $\mathbb{R}^n$, proving that non-trivial transfer learning for low-degree polynomials is possible under very mild assumptions, going well beyond the classical assumption that $dQ/dP$ is bounded. For instance, it always applies if $Q$ is a log-concave measure and the inverse ratio $dP/dQ$ is bounded. To demonstrate the applicability of our inequality, we obtain new results in the settings of: (1) the classical truncated regression setting, where $dQ/dP$ equals infinity, and (2) the more recent out-of-distribution generalization setting for in-context learning linear functions with transformers. We also provide a discrete analogue of our transfer inequality on the Boolean Hypercube $\{-1,1\}^n$, and study its connections with the recent problem of Generalization on the Unseen of Abbe, Bengio, Lotfi and Rizk (ICML, 2023). Our main conceptual contribution is that the maximum influence of the error of the estimator $\widehat{f}-f^*$ under $Q$, $\mathrm{I}_{\max}(\widehat{f}-f^*)$, acts as a sufficient condition for transferability; when $\mathrm{I}_{\max}(\widehat{f}-f^*)$ is appropriately bounded, transfer is possible over the Boolean domain.  ( 3 min )
    A tutorial on learning from preferences and choices with Gaussian Processes
    arXiv:2403.11782v1 Announce Type: cross Abstract: Preference modelling lies at the intersection of economics, decision theory, machine learning and statistics. By understanding individuals' preferences and how they make choices, we can build products that closely match their expectations, paving the way for more efficient and personalised applications across a wide range of domains. The objective of this tutorial is to present a cohesive and comprehensive framework for preference learning with Gaussian Processes (GPs), demonstrating how to seamlessly incorporate rationality principles (from economics and decision theory) into the learning process. By suitably tailoring the likelihood function, this framework enables the construction of preference learning models that encompass random utility models, limits of discernment, and scenarios with multiple conflicting utilities for both object- and label-preference. This tutorial builds upon established research while simultaneously introducing some novel GP-based models to address specific gaps in the existing literature.  ( 2 min )
    PARMESAN: Parameter-Free Memory Search and Transduction for Dense Prediction Tasks
    arXiv:2403.11743v1 Announce Type: cross Abstract: In this work we address flexibility in deep learning by means of transductive reasoning. For adaptation to new tasks or new data, existing methods typically involve tuning of learnable parameters or even complete re-training from scratch, rendering such approaches unflexible in practice. We argue that the notion of separating computation from memory by the means of transduction can act as a stepping stone for solving these issues. We therefore propose PARMESAN (parameter-free memory search and transduction), a scalable transduction method which leverages a memory module for solving dense prediction tasks. At inference, hidden representations in memory are being searched to find corresponding examples. In contrast to other methods, PARMESAN learns without the requirement for any continuous training or fine-tuning of learnable parameters simply by modifying the memory content. Our method is compatible with commonly used neural architectures and canonically transfers to 1D, 2D, and 3D grid-based data. We demonstrate the capabilities of our approach at complex tasks such as continual and few-shot learning. PARMESAN learns up to 370 times faster than common baselines while being on par in terms of predictive performance, knowledge retention, and data-efficiency.  ( 2 min )
    Generalization error of spectral algorithms
    arXiv:2403.11696v1 Announce Type: cross Abstract: The asymptotically precise estimation of the generalization of kernel methods has recently received attention due to the parallels between neural networks and their associated kernels. However, prior works derive such estimates for training by kernel ridge regression (KRR), whereas neural networks are typically trained with gradient descent (GD). In the present work, we consider the training of kernels with a family of $\textit{spectral algorithms}$ specified by profile $h(\lambda)$, and including KRR and GD as special cases. Then, we derive the generalization error as a functional of learning profile $h(\lambda)$ for two data models: high-dimensional Gaussian and low-dimensional translation-invariant model. Under power-law assumptions on the spectrum of the kernel and target, we use our framework to (i) give full loss asymptotics for both noisy and noiseless observations (ii) show that the loss localizes on certain spectral scales, giving a new perspective on the KRR saturation phenomenon (iii) conjecture, and demonstrate for the considered data models, the universality of the loss w.r.t. non-spectral details of the problem, but only in case of noisy observation.  ( 2 min )
    State-Separated SARSA: A Practical Sequential Decision-Making Algorithm with Recovering Rewards
    arXiv:2403.11520v1 Announce Type: cross Abstract: While many multi-armed bandit algorithms assume that rewards for all arms are constant across rounds, this assumption does not hold in many real-world scenarios. This paper considers the setting of recovering bandits (Pike-Burke & Grunewalder, 2019), where the reward depends on the number of rounds elapsed since the last time an arm was pulled. We propose a new reinforcement learning (RL) algorithm tailored to this setting, named the State-Separate SARSA (SS-SARSA) algorithm, which treats rounds as states. The SS-SARSA algorithm achieves efficient learning by reducing the number of state combinations required for Q-learning/SARSA, which often suffers from combinatorial issues for large-scale RL problems. Additionally, it makes minimal assumptions about the reward structure and offers lower computational complexity. Furthermore, we prove asymptotic convergence to an optimal policy under mild assumptions. Simulation studies demonstrate the superior performance of our algorithm across various settings.  ( 2 min )
    The Value of Reward Lookahead in Reinforcement Learning
    arXiv:2403.11637v1 Announce Type: cross Abstract: In reinforcement learning (RL), agents sequentially interact with changing environments while aiming to maximize the obtained rewards. Usually, rewards are observed only after acting, and so the goal is to maximize the expected cumulative reward. Yet, in many practical settings, reward information is observed in advance -- prices are observed before performing transactions; nearby traffic information is partially known; and goals are oftentimes given to agents prior to the interaction. In this work, we aim to quantifiably analyze the value of such future reward information through the lens of competitive analysis. In particular, we measure the ratio between the value of standard RL agents and that of agents with partial future-reward lookahead. We characterize the worst-case reward distribution and derive exact ratios for the worst-case reward expectations. Surprisingly, the resulting ratios relate to known quantities in offline RL and reward-free exploration. We further provide tight bounds for the ratio given the worst-case dynamics. Our results cover the full spectrum between observing the immediate rewards before acting to observing all the rewards before the interaction starts.  ( 2 min )
    Do CLIPs Always Generalize Better than ImageNet Models?
    arXiv:2403.11497v1 Announce Type: cross Abstract: Large vision language models, such as CLIPs, have revolutionized modern machine learning. CLIPs have demonstrated great generalizability under distribution shifts, supported by an increasing body of literature. However, the evaluation datasets for CLIPs are variations primarily designed for ImageNet benchmarks, which may not fully reflect the extent to which CLIPs, e.g., pre-trained on LAION, robust to spurious correlations. To bridge the gap, we collect a real-world dataset called CounterAnimal that contains realistic spurious features found in animal photos. CounterAnimal consists of a) the common group: comprising animals on common backgrounds, and b) the counter group: including animals on unusual backgrounds. The performance drops from the common to counter groups quantify the reliance of models on spurious features (i.e., backgrounds) to predict the animals. We find that CLIPs trained on either LAION or the OpenAI data exhibit notable performance drops on the counter group. Surprisingly, we observe that single-modal models trained on ImageNet are more robust than CLIPs. We provide both theoretical and empirical explanations for why CLIPs still learn spurious features. Our findings suggest that distribution shifts remain an open problem for CLIPs, and one needs to be cautious about test setups when evaluating foundation models pre-trained on a significantly different scale and distribution.  ( 2 min )
    Span-Based Optimal Sample Complexity for Weakly Communicating and General Average Reward MDPs
    arXiv:2403.11477v1 Announce Type: cross Abstract: We study the sample complexity of learning an $\epsilon$-optimal policy in an average-reward Markov decision process (MDP) under a generative model. For weakly communicating MDPs, we establish the complexity bound $\tilde{O}(SA\frac{H}{\epsilon^2})$, where $H$ is the span of the bias function of the optimal policy and $SA$ is the cardinality of the state-action space. Our result is the first that is minimax optimal (up to log factors) in all parameters $S,A,H$ and $\epsilon$, improving on existing work that either assumes uniformly bounded mixing times for all policies or has suboptimal dependence on the parameters. We further investigate sample complexity in general (non-weakly-communicating) average-reward MDPs. We argue a new transient time parameter $B$ is necessary, establish an $\tilde{O}(SA\frac{B+H}{\epsilon^2})$ complexity bound, and prove a matching (up to log factors) minimax lower bound. Both results are based on reducing the average-reward MDP to a discounted MDP, which requires new ideas in the general setting. To establish the optimality of this reduction, we develop improved bounds for $\gamma$-discounted MDPs, showing that $\tilde{\Omega}\left(SA\frac{H}{(1-\gamma)^2\epsilon^2}\right)$ samples suffice to learn an $\epsilon$-optimal policy in weakly communicating MDPs under the regime that $\gamma\geq 1-1/H$, and $\tilde{\Omega}\left(SA\frac{B+H}{(1-\gamma)^2\epsilon^2}\right)$ samples suffice in general MDPs when $\gamma\geq 1-\frac{1}{B+H}$. Both these results circumvent the well-known lower bound of $\tilde{\Omega}\left(SA\frac{1}{(1-\gamma)^3\epsilon^2}\right)$ for arbitrary $\gamma$-discounted MDPs. Our analysis develops upper bounds on certain instance-dependent variance parameters in terms of the span and transient time parameters. The weakly communicating bounds are tighter than those based on the mixing time or diameter of the MDP and may be of broader use.  ( 3 min )
    An SDP-based Branch-and-Cut Algorithm for Biclustering
    arXiv:2403.11351v1 Announce Type: cross Abstract: Biclustering, also called co-clustering, block clustering, or two-way clustering, involves the simultaneous clustering of both the rows and columns of a data matrix into distinct groups, such that the rows and columns within a group display similar patterns. As a model problem for biclustering, we consider the $k$-densest-disjoint biclique problem, whose goal is to identify $k$ disjoint complete bipartite subgraphs (called bicliques) of a given weighted complete bipartite graph such that the sum of their densities is maximized. To address this problem, we present a tailored branch-and-cut algorithm. For the upper bound routine, we consider a semidefinite programming relaxation and propose valid inequalities to strengthen the bound. We solve this relaxation in a cutting-plane fashion using a first-order method. For the lower bound, we design a maximum weight matching rounding procedure that exploits the solution of the relaxation solved at each node. Computational results on both synthetic and real-world instances show that the proposed algorithm can solve instances approximately 20 times larger than those handled by general-purpose solvers.  ( 2 min )
    Divide-and-Conquer Posterior Sampling for Denoising Diffusion Priors
    arXiv:2403.11407v1 Announce Type: cross Abstract: Interest in the use of Denoising Diffusion Models (DDM) as priors for solving inverse Bayesian problems has recently increased significantly. However, sampling from the resulting posterior distribution poses a challenge. To solve this problem, previous works have proposed approximations to bias the drift term of the diffusion. In this work, we take a different approach and utilize the specific structure of the DDM prior to define a set of intermediate and simpler posterior sampling problems, resulting in a lower approximation error compared to previous methods. We empirically demonstrate the reconstruction capability of our method for general linear inverse problems using synthetic examples and various image restoration tasks.  ( 2 min )
    Federated Transfer Learning with Differential Privacy
    arXiv:2403.11343v1 Announce Type: cross Abstract: Federated learning is gaining increasing popularity, with data heterogeneity and privacy being two prominent challenges. In this paper, we address both issues within a federated transfer learning framework, aiming to enhance learning on a target data set by leveraging information from multiple heterogeneous source data sets while adhering to privacy constraints. We rigorously formulate the notion of \textit{federated differential privacy}, which offers privacy guarantees for each data set without assuming a trusted central server. Under this privacy constraint, we study three classical statistical problems, namely univariate mean estimation, low-dimensional linear regression, and high-dimensional linear regression. By investigating the minimax rates and identifying the costs of privacy for these problems, we show that federated differential privacy is an intermediate privacy model between the well-established local and central models of differential privacy. Our analyses incorporate data heterogeneity and privacy, highlighting the fundamental costs of both in federated learning and underscoring the benefit of knowledge transfer across data sets.  ( 2 min )
    DTOR: Decision Tree Outlier Regressor to explain anomalies
    arXiv:2403.10903v1 Announce Type: cross Abstract: Explaining outliers occurrence and mechanism of their occurrence can be extremely important in a variety of domains. Malfunctions, frauds, threats, in addition to being correctly identified, oftentimes need a valid explanation in order to effectively perform actionable counteracts. The ever more widespread use of sophisticated Machine Learning approach to identify anomalies make such explanations more challenging. We present the Decision Tree Outlier Regressor (DTOR), a technique for producing rule-based explanations for individual data points by estimating anomaly scores generated by an anomaly detection model. This is accomplished by first applying a Decision Tree Regressor, which computes the estimation score, and then extracting the relative path associated with the data point score. Our results demonstrate the robustness of DTOR even in datasets with a large number of features. Additionally, in contrast to other rule-based approaches, the generated rules are consistently satisfied by the points to be explained. Furthermore, our evaluation metrics indicate comparable performance to Anchors in outlier explanation tasks, with reduced execution time.  ( 2 min )
    Interpretable Machine Learning for TabPFN
    arXiv:2403.10923v1 Announce Type: cross Abstract: The recently developed Prior-Data Fitted Networks (PFNs) have shown very promising results for applications in low-data regimes. The TabPFN model, a special case of PFNs for tabular data, is able to achieve state-of-the-art performance on a variety of classification tasks while producing posterior predictive distributions in mere seconds by in-context learning without the need for learning parameters or hyperparameter tuning. This makes TabPFN a very attractive option for a wide range of domain applications. However, a major drawback of the method is its lack of interpretability. Therefore, we propose several adaptations of popular interpretability methods that we specifically design for TabPFN. By taking advantage of the unique properties of the model, our adaptations allow for more efficient computations than existing implementations. In particular, we show how in-context learning facilitates the estimation of Shapley values by avoiding approximate retraining and enables the use of Leave-One-Covariate-Out (LOCO) even when working with large-scale Transformers. In addition, we demonstrate how data valuation methods can be used to address scalability challenges of TabPFN. Our proposed methods are implemented in a package tabpfn_iml and made available at https://github.com/david-rundel/tabpfn_iml.  ( 2 min )
    List Sample Compression and Uniform Convergence
    arXiv:2403.10889v1 Announce Type: cross Abstract: List learning is a variant of supervised classification where the learner outputs multiple plausible labels for each instance rather than just one. We investigate classical principles related to generalization within the context of list learning. Our primary goal is to determine whether classical principles in the PAC setting retain their applicability in the domain of list PAC learning. We focus on uniform convergence (which is the basis of Empirical Risk Minimization) and on sample compression (which is a powerful manifestation of Occam's Razor). In classical PAC learning, both uniform convergence and sample compression satisfy a form of `completeness': whenever a class is learnable, it can also be learned by a learning rule that adheres to these principles. We ask whether the same completeness holds true in the list learning setting. We show that uniform convergence remains equivalent to learnability in the list PAC learning setting. In contrast, our findings reveal surprising results regarding sample compression: we prove that when the label space is $Y=\{0,1,2\}$, then there are 2-list-learnable classes that cannot be compressed. This refutes the list version of the sample compression conjecture by Littlestone and Warmuth (1986). We prove an even stronger impossibility result, showing that there are $2$-list-learnable classes that cannot be compressed even when the reconstructed function can work with lists of arbitrarily large size. We prove a similar result for (1-list) PAC learnable classes when the label space is unbounded. This generalizes a recent result by arXiv:2308.06424.  ( 2 min )
    Incentivized Exploration of Non-Stationary Stochastic Bandits
    arXiv:2403.10819v1 Announce Type: cross Abstract: We study incentivized exploration for the multi-armed bandit (MAB) problem with non-stationary reward distributions, where players receive compensation for exploring arms other than the greedy choice and may provide biased feedback on the reward. We consider two different non-stationary environments: abruptly-changing and continuously-changing, and propose respective incentivized exploration algorithms. We show that the proposed algorithms achieve sublinear regret and compensation over time, thus effectively incentivizing exploration despite the nonstationarity and the biased or drifted feedback.  ( 2 min )
    A resource-constrained stochastic scheduling algorithm for homeless street outreach and gleaning edible food
    arXiv:2403.10638v1 Announce Type: cross Abstract: We developed a common algorithmic solution addressing the problem of resource-constrained outreach encountered by social change organizations with different missions and operations: Breaking Ground -- an organization that helps individuals experiencing homelessness in New York transition to permanent housing and Leket -- the national food bank of Israel that rescues food from farms and elsewhere to feed the hungry. Specifically, we developed an estimation and optimization approach for partially-observed episodic restless bandits under $k$-step transitions. The results show that our Thompson sampling with Markov chain recovery (via Stein variational gradient descent) algorithm significantly outperforms baselines for the problems of both organizations. We carried out this work in a prospective manner with the express goal of devising a flexible-enough but also useful-enough solution that can help overcome a lack of sustainable impact in data science for social good.  ( 2 min )
    A Probabilistic Approach for Alignment with Human Comparisons
    arXiv:2403.10771v1 Announce Type: cross Abstract: A growing trend involves integrating human knowledge into learning frameworks, leveraging subtle human feedback to refine AI models. Despite these advances, no comprehensive theoretical framework describing the specific conditions under which human comparisons improve the traditional supervised fine-tuning process has been developed. To bridge this gap, this paper studies the effective use of human comparisons to address limitations arising from noisy data and high-dimensional models. We propose a two-stage "Supervised Fine Tuning+Human Comparison" (SFT+HC) framework connecting machine learning with human feedback through a probabilistic bisection approach. The two-stage framework first learns low-dimensional representations from noisy-labeled data via an SFT procedure, and then uses human comparisons to improve the model alignment. To examine the efficacy of the alignment phase, we introduce a novel concept termed the "label-noise-to-comparison-accuracy" (LNCA) ratio. This paper theoretically identifies the conditions under which the "SFT+HC" framework outperforms pure SFT approach, leveraging this ratio to highlight the advantage of incorporating human evaluators in reducing sample complexity. We validate that the proposed conditions for the LNCA ratio are met in a case study conducted via an Amazon Mechanical Turk experiment.  ( 2 min )

  • Open

    [P] Optimal Performance without Static Graphs by Fusing Tensor Operation Streams
    One of the most crucial aspects of current machine learning research is discovering model architectures that efficiently scale with compute resources. Transformers have emerged as the predominant architecture due to their effective utilization of contemporary hardware. However, they don't adapt their computation graphs based on the complexity of tasks, necessitating different versions for tasks of varying complexity. This approach doesn't align with the goal of having one model capable of continuous learning (lifelong learning) while remaining efficient for easy tasks. I argue that there's a pressing need for further exploration into dynamic architectures, where the computational graph adapts at runtime based on contextual cues. While several papers, such as BranchyNet, have delved into t…
    [D] ML Project Ideas
    Hi everyone, I'm currently pursuing an MSc in AI, and while the coursework is enjoyable, I find the assignments somewhat simplistic and doubt their ability to impress potential employers. Therefore, I've decided to undertake my own project, but I've hit a roadblock in generating ideas (I'll admit, creativity isn't my strongest suit). I've been considering various avenues, such as conducting experiments on llama-2-7b, exploring its potential for tunning, or even building an API. However, I'm hesitant and seeking input from the community. Just to provide some background, I'm proficient in Python with a good knowledge of PyTorch, HuggingFace, Langchain, and other default ML frameworks(np, pd, etc.) as well as Azure. However, I must mention that my hardware setup is relatively modest, and I may need to rent a GPU for projects requiring significant computational power. I'm open to suggestions and would greatly appreciate any ideas or guidance you may have to offer. Thank you in advance! submitted by /u/Wild-Positive-6836 [link] [comments]
    [R] struggling with research ... F distribution
    Hello. I am really struggling with my masters. I have to implement the F distribution for mixture models. Does anyone know here I can get code related to my research? Also, my supervisor is not guiding as me I would like... is this how it's supposed to be? submitted by /u/isthiswhyiamhere [link] [comments]
    [R] TacticAI: an AI assistant for football tactics
    Blog post: https://deepmind.google/discover/blog/tacticai-ai-assistant-for-football-tactics/ Nature paper: https://www.nature.com/articles/s41467-024-45965-x Abstract Identifying key patterns of tactics implemented by rival teams, and developing effective responses, lies at the heart of modern football. However, doing so algorithmically remains an open research challenge. To address this unmet need, we propose TacticAI, an AI football tactics assistant developed and evaluated in close collaboration with domain experts from Liverpool FC. We focus on analysing corner kicks, as they offer coaches the most direct opportunities for interventions and improvements. TacticAI incorporates both a predictive and a generative component, allowing the coaches to effectively sample and explore alternative player setups for each corner kick routine and to select those with the highest predicted likelihood of success. We validate TacticAI on a number of relevant benchmark tasks: predicting receivers and shot attempts and recommending player position adjustments. The utility of TacticAI is validated by a qualitative study conducted with football domain experts at Liverpool FC. We show that TacticAI’s model suggestions are not only indistinguishable from real tactics, but also favoured over existing tactics 90% of the time, and that TacticAI offers an effective corner kick retrieval system. TacticAI achieves these results despite the limited availability of gold-standard data, achieving data efficiency through geometric deep learning. submitted by /u/RobbinDeBank [link] [comments]
    Training split very slow [R]
    Hey all! I am fine tuning mbert on wikipedia dataset, loaded with Datasets (Hugging face) https://preview.redd.it/kaw1kmrtjcpc1.png?width=1944&format=png&auto=webp&s=818740eebd72dc13ad35a30b384a122b79866f3c This is the bash script submitted to Slurm (irrelevant lines removed): /bin/bashLANG=ka #SBATCH -o .../examples/language-modeling/slogs/sl_ka1_%A.out #SBATCH -e .../examples/language-modeling//slogs/sl_ka1_%A.out #SBATCH -N 1 # nodes requested #SBATCH -n 1 # tasks requested #SBATCH --gres=gpu:8 # use 1 GPU #SBATCH --mem=60000 # memory in Mb #SBATCH --partition=PGR-Standard #SBATCH -t 24:00:00 # time requested in hour:minute:seconds #SBATCH --cpus-per-task=16 # number of cpus to use - there are 32 on each node torchrun --nproc_per_node 8 run_mlm.py \ --model_name_or_path bert-base-multilingual-cased \ --cache_dir **${CACHE_HOME}**2\ --dataset_lang ${LANG} \ --dataset_name Wikipedia \ --output_dir ${OUTPUT_DIR} \ --do_train \ --do_eval \ --per_device_train_batch_size 4 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 2 \ --max_seq_length 256 \ --overwrite_output_dir \ --ft_params_num 7667712 \ --evaluation_strategy steps \ --eval_steps 1000 \ --dataloader_num_workers 32 \ --preprocessing_num_workers 32 \ --validation_split_percentage 5 \ --load_best_model_at_end \ --save_total_limit 2 Here is the report of the speed, very slow. Downloading data: 100%|██████████| 14.1k/14.1k [00:00<00:00, 14.0MB/s] Downloading data: 100%|██████████| 205M/205M [00:44<00:00, 4.59MB/s] Generating train split: 0 examples [00:00, ? examples/s]Extracting content from .../language-modeling/cache_directory22/downloads/f797c17d35d578a4c1a3f251847095789ec04ae453f10623aeb8366ff4797a07 Generating train split: 170787 examples [17:57, 158.45 examples/s] Thank you very much in advance submitted by /u/Choricius [link] [comments]
    [D] Why do transformers use embeddings with the same dimensionality in each layer?
    My intuition is that tokens get gradually enriched as we move through the layers, but that would mean we need to store a lot less information per token in the early layers than in the later ones. Wouldn't it make sense to start out with (relatively) low-dimensional embeddings, and then project or extend these onto higher dimensions, until they reach their final size? submitted by /u/timtom85 [link] [comments]
    [D] Extracting Text from PDFs with Python While Maintaining Structure
    Hello, everyone! I'm working with a bunch of medical invoices in native PDF format and I need to extract the text while maintaining the original structure, especially the tabular formats. Using some Python libraries, I'm able to extract the text, but it loses the intended formatting. When I use some Python libraries, the extracted text loses its structure. For example: Oliver Smith Green St. 12 1356 Fayetteville [more details...] 1 Treatment A 12,31 1 Treatment B 19,00 Total netto 19 % EUR 54,96 19 % Tax EUR 10,44 Total EUR 65,40 Payed EUR 65,40 [more details...] What I'm aiming for is something like this: Oliver Smith Green St. 12 1356 Fayetteville [more details...] 1 Treatment A 12,31 1 Treatment B 19,00 Total netto 19 % EUR 54,96 19 % Tax EUR 10,44 Total EUR 65,40 Payed EUR 65,40 [more details...] I've had some promising results with pdfplumber, but the tables got printed twice, which isn't ideal. Here's the code snippet I'm using: import pdfplumber pdf_path = "path/to/your/pdf" with pdfplumber.open(pdf_path) as pdf: for page in pdf.pages: text = page.extract_text() print(text) tables = page.extract_tables() for table in tables: for row in table: print(" ".join([str(cell) for cell in row if cell])) print() This was the result: Oliver Smith Green St. 12 1356 Fayetteville [more details...] 1 Treatment A 12,31 1 Treatment B 19,00 Total netto 19 % EUR 54,96 19 % Tax EUR 10,44 Total EUR 65,40 Payed EUR 65,40 [more details...] 1 Treatment A 12,31 1 Treatment B 19,00 Total netto 19 % EUR 54,96 19 % Tax EUR 10,44 Total EUR 65,40 Payed EUR 65,40 Does anyone have advice on how to improve this extraction process to maintain the structure better? I'm looking to extract not just the tables but also other text elements like customer addresses while keeping the formatting as close to the original as possible. Thank you in advance for your help! submitted by /u/4AVcnE [link] [comments]
    [D] ICML 2024 Discussion Thread
    This post is for discussing anything related to ICML 2024, with reviews coming out tomorrow! Good luck to everyone involved! submitted by /u/condom-mechanics [link] [comments]
    [R] Towards White Box Deep Learning
    I present a conceptual framework for building inherently interpretable neural networks. The PoC 4-layer model for a subproblem of MNIST can be considered a White Box: the decision boundary is easily explainable and the model is robust to adversarial attacks - despite no form of adversarial training! The approach essentially reduces to a general idea of how to share weights inside a network layer to achieve highly interpretable and robust features. Its general nature and effectiveness suggest that it should be possible to obtain similar results for more complex datasets and different modalities - which opens up exciting areas for further research! Can't wait to hear your feedback! https://arxiv.org/abs/2403.09863 submitted by /u/Swarzkopf314 [link] [comments]
    Identify Increase Cost [D]
    Hey everyone! I’m new to this field and am currently beginning my journey. A recent problem I’ve been thinking of is how do organizations use machine learning to identify cost increase? In this specific case it would be for surgical supplies in comparison to previous year but it could be applied to probably most industries. It’s pretty simple to pull the raw data an do analysis to see what supplies/implants are costing more but this usually happens after the issue has really become prevalent. The costs are typically stratified by procedure type, surgeon and manufacturer. Early detection of this could potentially save a lot of money so some ROI seems possible. I’m guessing some kind of anomaly/outlier detection would be one approach. What‘s everyone’s thoughts on possible models to use or what other approaches to this problem would you have? TIA! submitted by /u/avb0101 [link] [comments]
    [P] How I found 8 bugs in Google's Gemma 6T token model
    Hey r/MachineLearning! Maybe you might have seen me post on Twitter, but I'll just post here if you don't know about 8 bugs in multiple implementations on Google's Gemma :) The fixes should already be pushed into HF's transformers main branch, and Keras, Pytorch Gemma, vLLM should have gotten the fix :) https://github.com/huggingface/transformers/pull/29402 By comparing 5 implementations, I found the following issues: Must add or else losses will be very high. There’s a typo for model in the technical report! sqrt(3072)=55.4256 but bfloat16 is 55.5. Layernorm (w+1) must be in float32. Keras mixed_bfloat16 RoPE is wrong. RoPE is sensitive to y*(1/x) vs y/x. RoPE should be float32 - already pushed to transformers 4.38.2. GELU should be approx tanh not exact. Adding all th…
    [P] [R] A Curated and Harmonized Multi-study Vaginal Microbiome in Pregnancy Dataset suitable for use for training and validation of AI/ML models
    Hello! I am the lead author on a recent crowdsourced AI/ML study to identify pregnancies at risk for preterm and early-preterm birth from vaginal microbiome data, and I wanted to share the curated training and validation dataset of vaginal microbiome data with the broader AI/ML community. I think this may be a valuable resource to those of you curious about training models on microbiome data. I am also glad to answer your questions about how these data were assembled, harmonized, and the biological meaning of the various features provided—as well as how to do a similar harmonization for other 16S rRNA gene based microbiome data (there are over 200,000 specimens sequenced in public databases). The data set is comprised of 7 independently conducted studies of the vaginal microbiome during pregnancy, spanning 764 pregnancies and 2226 specimens; additional specimens are available pending approval via dbGAP. The critical metadata (gestational week of collection; gestational week of delivery; NIH racial / ethnic categories when available, etc.) are curated for each specimen. There are significant challenges in the harmonization of microbiome data, particularly data like these, generated from amplification of a variable region of the 16S rRNA gene where each study is using a different technical approach. Assembly of this dataset was based on a novel harmonization approach from my research group that also may be of interest to those of you seeking to work with microbiome data. tldr: Here is some rigorously harmonized vaginal microbiome data from pregnancy. Use the phylotype tables (counts or relative abundance) at 0.5 distance if you just want one matrix to start with. submitted by /u/golob [link] [comments]
    [D] Harvard MS in Data Science vs Columbia MS in CS vs University of Southern California MS in CS (AI)
    I was accepted into those three programs, and I'm trying to decide which one I should attend. I want to work as an ML engineer or Applied Scientist. My goal is to have strong foundations in AI/ML as well as very strong engineering skills. Which program do you think is better for that? One thing that I also really like about Harvard is the option to cross-register for classes at MIT. I think it is a big plus to be exposed to both communities. Plus, I would like to start a startup during or right after the degree, so the connections I make are also very important. For now I am not considering tuition and other expenses, just the program quality and university. View Poll submitted by /u/elhd95 [link] [comments]
    [D] What's the difference between "Data Science" and "Intelligence and Data"?
    I researched a little bit and couldn't find too much. A lot of my study courses that fall under intelligence and data also fall under data science. But "Machine Learning for Biometrics" falls under I&D and not DS while some courses like ML is categorized into both the majors. So, this could be a university thing. No idea why they have the categorization, I'll find out from the course coordinator. If you know why they're categorized this way, it'd be helpful! My question is what are the career options for I&D and how does it differ from DS? Thanks for the help! submitted by /u/plushythingy [link] [comments]
    [P] Let's Build AI - Git repo sharing resources, tools, and knowledge for AI enthusiasts and developers.
    Stumbled upon this AI tool listing repo this morning. I found a few tools and individuals that I hadn't discovered before. submitted by /u/iamjessew [link] [comments]
    [P] Anomaly detection in time series
    I am working on an anomaly detection project. My dataset is about monthly behavior of customers represented by three numbers. So for each customer, i have a row in dataset with three columns for each month. And finally i have target label True or False. ​ Cust No JanA JanB JanC FebA FebB FebC ... DecA DecB DecC Anomaly 1 12 1 231 12 23 211 123 2 12 False 2 23 1 23 213 13 212 23 2 23 False 3 2323 23 23 3 232 3 0 223 32 True This should be a time-series classification problem, but my time-series are in columns. On top of that, some of them are grouped by month and i am not sure whether i need to encode that somehow. If you guys have any idea how to approach this problem please enlighten me. submitted by /u/monter72 [link] [comments]
    [P] Annotated Mamba: The Hard Way
    Link:https://srush.github.io/annotated-mamba/hard.html Code: https://github.com/srush/annotated-mamba From the author: This blog is about Mamba a recent neural architecture that can be roughly thought of as a modern recurrent neural network (RNN). The model works really well and is a legitimate competitor with the ubiquitous Transformer architecture. It has gotten a lot of attention. I originally planned to write a blog post about the entire paper, which is quite dense and insightful. However I become fascinated just by the S6 algorithm as described here. This algorithm describes how one can compute an extremely large RNN efficiently on modern hardware, and extends ideas explored in S4 and S5 from recent years. In fact, if I am being honest, though, I actually only got as far as this single line of the algorithm: y = SSM(A, B, C)(x) # Time-varying: recurrence(scan) only This line is interesting enough that I thought, hey shouldn’t anyone be able to understand why this scan is fast in practice? Turns out this is a bit tricky. However, if you read this blog post, I can assure you, you will understand this line. (Perhaps more than you would ever want). Part 0: Triton Part 1: Cumulative Sums Part 2: Exponential Moving Average Part 3: Getting Derivatives Part 4: Multiple at once Part 5: Mamba ​ submitted by /u/ghosthamlet [link] [comments]
    [D] How to use quantized LLM with Microsoft LIDA?
    Microsoft LIDA is an Automatic Generation of Visualizations and Infographics using LLMs. I have seen people using it with non-quantized LLMs. But in my case, I'm using Nous-Hermes-llama-2-7b from HF and I can't find a way to load the LLM in quantized state using the llm API in LIDA. Can this be done? ​ p.s. I'm a newbie submitted by /u/Stoner_Black_69 [link] [comments]
    [D] NVIDIA GTC 2024 announcements
    NVIDIA's plan has spread across accelerated computing, generative AI, industry applications, automotive, enterprise platforms, Omniverse, and robotics. Some of the most interesting are: DRIVE Thor: an in-vehicle computing platform for generative AI applications in self-driving cars. With up to a thousand trillion operations per second, it enhances safety for autonomous driving and enables interactive conversations with the vehicle. Omniverse: a digital twin ecosystem blending physical and virtual worlds to help industries simulate, optimize, and execute operations more efficiently. The new Omniverse Cloud APIs expand these capabilities, benefiting sectors like automotive and robotics. Project GR00T: a foundation model for humanoid robots driving breakthroughs in robotics and AI. Also, unveiled Jetson Thor computer and upgrades to the NVIDIA Isaac™ robotics platform with generative AI models and simulation tools. Nvidia Blackwell GPU: a cutting-edge advancement designed to power the next generation of AI with 20 petaflops of performance. This GPU represents a quantum leap in AI capabilities, aiming to democratize access to trillion-parameter models. NVLink Switch 7.2 TI: a new generation interconnect technology, which addresses the bottleneck of data exchange. It is designed to facilitate communication between GPUs at a scale suitable for the most advanced AI models. NVIDIA NIM: a new software product aimed at simplifying the deployment of generative AI within enterprise environments. It packages models with optimized inference engines and supports a wide range of GPU architectures. They call it an AI package for all. What's your favorite one? submitted by /u/vvkuka [link] [comments]
    NVIDIA Blackwell Platform Arrives to Power a New Era of Computing [N]
    https://nvidianews.nvidia.com/news/nvidia-blackwell-platform-arrives-to-power-a-new-era-of-computing The GB200 NVL72 provides up to a 30x performance increase compared to the same number of NVIDIA H100 Tensor Core GPUs for LLM inference workloads, and reduces cost and energy consumption by up to 25x. submitted by /u/we_are_mammals [link] [comments]
    [D] simple typos or errors in my understanding?!
    Though I mailed to all authors, I couldn't receive any reply... ​ I really enjoyed reading paper, “Compositional Visual Generation with Energy Based Models” (NeurIPS`20; https://arxiv.org/abs/2004.06030). While I am following the idea in the paper, I am facing a problem in understanding the equation (8) in the paper. According to the paper's notation, the EBM is defined as (1): Eq. (1) i.e.,, the energy function, E_\theta(x), is defined as the negative term on top of the exponent. ​ In eq. (3) below, authors defines SGLD step for acquiring samples from any energy based model (EBM). Next, according to 'concept disjunction' proposed by authors, the union-ized probability densities are defined in (7), in the form of the exponential of log-sum-exp of different EBMs' energy functions. https://preview.redd.it/sas3f4yg58pc1.png?width=1005&format=png&auto=webp&s=2bb0aa67cf2767ede091102ce22965e8c8bdbe7c From (7), I understood that the unionized energy function is accordingly -logsumexp(-E(x|c1), ...), thus corresponding SGLD should have plus sign. However, in the paper, authors stated the SGLD with minus sign, and I've lost... After looking through the official code implementation, (https://github.com/yilundu/ebm_compositionality/blob/c7ac54366d2d5a15f71871448bd720bf5b3eb82d/celeba_combine.py#L150C9-L150C32) I found that the combined energy function derivation of mine ( E(x|\cup_i c_i) ) seems to be correct, according to the assignment of the `e_pos` variable in the code. ​ Still, I'd like to acquire some clarification on this flow. Can anybody skim through this post and check my thought? Thank you in advance! submitted by /u/vaseline555 [link] [comments]
  • Open

    OpenAI is expected to release a 'materially better' GPT-5 for its chatbot mid-year, sources say
    submitted by /u/thisisinsider [link] [comments]
    Microsoft's AI News Today
    Anyone else not excited by the moves Microsoft is making right now? Gobbling up most of Inflection's team and almost certainly putting a death sentence on Pi and recruiting a DeepMind co-founder? Between this and OpenAI's board we need better AI competition out there as the power is already beginning to consolidate. (yes we do have Apple possibly partnering with Google to power the AI in their hardware but still that leaves us with Microsoft/OpenAI/Inflection vs. Google/Apple/Anthropic? We gotta do better....) submitted by /u/HateMakinSNs [link] [comments]
    DeepMind and Liverpool FC develop AI to advise on football tactics. Do you think in 5 years all sports teams are going to be strategizing based on what an AI says?
    submitted by /u/dead_planets_society [link] [comments]
    One more New Threat from LLMs you can worry about.
    Username made this post on r/aliens about his prompt to ChatGPT to make a story about interdimensional beings. My professional job for many years was to create and write interesting stories I'm MA of dramaturgy. So I made a detailed review of LLM’s story in the comments to that post. But it made me think about something interesting I want to share with you. There is a kind of “new” threat from LLMs to humanity that we are aware of as of now. What are the main popular fears about AI? 1. It takes jobs from humans 2. It can gain too much control over our life 3. It might become too “wise” or crazy and bring harm to humans in the future. Yes, those are natural and legit fears. But there is one more that I never thought about before. And it is really scary. At some point LLM might make very…
    My boss disparages me for using AI
    I'm very transparent with my boss (c-suite) about my use of GPT. I'm a strategic executive. My work product and productivity has improved by leaps and bounds with the help of AI applications. My C-boss condescends on anything he thinks I've used GPT to create. I published something to the team, and he said, "Did GPT do this for you?" My response was: "why on earth would I NOT use these tools? The lowest common denominator is now GPT-4-Turbo (ok, Claude Opus, perhaps). Of course, I used it!" Often, I need to read regulatory documents that are 100's of pages and very dense. I chunk them by section and have an assistant summarize them with an API call to my OAI Dev account (crude RAG because I haven't figured out how to build a RAG application YET). I spend an hour crafting a summary instead of 16 hours, give or take, and spend $4 on OAI API calls. I also create Python scripts to analyze proprietary data using GPT and create visualizations that our BI unit doesn't have the bandwidth to play with. I've made discoveries with 9-figure implications (we have taken action on these discoveries). Finally, I am regularly called upon to write internal white papers on strategic topics. For the past six months, I've used GPT as a writing assistant, and now, I also use Perplexity as a resource. AI helps me organize, locate source material, and write a first draft. The last mile is often half of the work when writing these papers, and that is still an agonizing part of the process, heavily dependent on my specific knowledge and experience (domain expertise). Is anyone else experiencing this? How do you handle it? ​ View Poll submitted by /u/knob-0u812 [link] [comments]
    When will AI be able to help us create a document and then iterate, add and improve as you go on the fly?
    This is what I really want. Sure I can ask gpt to write something but it's not great at editing or changing. I'd love to be able to have a conversation with it like I would if I was sitting beside a writer. Is there anything like this yet or might it be coming soon? submitted by /u/zascar [link] [comments]
    Gemini now acts as a PC [Gemini mini PC]
    submitted by /u/Capital-Algae3377 [link] [comments]
    One-Minute Daily AI News 3/18/2024
    Nvidia CEO Jensen Huang announces new AI chips: ‘We need bigger GPUs’. The new AI graphics processors are named Blackwell and are expected to ship later this year.[1] Apple Inc. is in talks to build Google’s Gemini artificial intelligence engine into the iPhone, according to people familiar with the situation, setting the stage for a blockbuster agreement that would shake up the AI industry.[2] NVIDIA Announces Project GR00T Foundation Model for Humanoid Robots and Major Isaac Robotics Platform Update.[3] NVIDIA creates Earth-2 digital twin: generative AI to simulate, visualize weather and climate.[4] Sources: [1] https://www.cnbc.com/2024/03/18/nvidia-announces-gb200-blackwell-ai-chip-launching-later-this-year.html [2] https://www.bloomberg.com/news/articles/2024-03-18/apple-in-talks-to-license-google-gemini-for-iphone-ios-18-generative-ai-tools [3] https://nvidianews.nvidia.com/news/foundation-model-isaac-robotics-platform [4] https://www.tweaktown.com/news/96955/nvidia-creates-earth-2-digital-twin-generative-ai-to-simulate-visualize-weather-and-climate/index.html submitted by /u/Excellent-Target-847 [link] [comments]
    are there any ais that can colorize black and white manga/doujin pages?
    some of my favorite ones are in black and white, and i always see them just as that, i can never find a colored version, so is there any kind of ai or anything that can colorize stuff like that? submitted by /u/Fazbearnit3 [link] [comments]
    Tech giant Nvidia unveils higher performing ‘superchips’ to power AI | Technology
    submitted by /u/Low_Complaint2254 [link] [comments]
    The New Laws of Robitics
    From https://idais.ai/ Autonomous Replication or Improvement No AI system should be able to copy or improve itself without explicit human approval and assistance. This includes both exact copies of itself as well as creating new AI systems of similar or greater abilities. Power Seeking No AI system should take actions to unduly increase its power and influence. Assisting Weapon Development No AI systems should substantially increase the ability of actors to design weapons of mass destruction, or violate the biological or chemical weapons convention. Cyberattacks No AI system should be able to autonomously execute cyberattacks resulting in serious financial losses or equivalent harm. Deception No AI system should be able to consistently cause its designers or regulators to misunderstand its likelihood or capability to cross any of the preceding red lines. submitted by /u/anxiety617 [link] [comments]
  • Open

    ScreenAI: A visual language model for UI and visually-situated language understanding
    Posted by Srinivas Sunkara and Gilles Baechler, Software Engineers, Google Research Screen user interfaces (UIs) and infographics, such as charts, diagrams and tables, play important roles in human communication and human-machine interaction as they facilitate rich and interactive user experiences. UIs and infographics share similar design principles and visual language (e.g., icons and layouts), that offer an opportunity to build a single model that can understand, reason, and interact with these interfaces. However, because of their complexity and varied presentation formats, infographics and UIs present a unique modeling challenge. To that end, we introduce “ScreenAI: A Vision-Language Model for UI and Infographics Understanding”. ScreenAI improves upon the PaLI architecture wit…  ( 92 min )
    SCIN: A new resource for representative dermatology images
    Posted by Pooja Rao, Research Scientist, Google Research Health datasets play a crucial role in research and medical education, but it can be challenging to create a dataset that represents the real world. For example, dermatology conditions are diverse in their appearance and severity and manifest differently across skin tones. Yet, existing dermatology image datasets often lack representation of everyday conditions (like rashes, allergies and infections) and skew towards lighter skin tones. Furthermore, race and ethnicity information is frequently missing, hindering our ability to assess disparities or create solutions. To address these limitations, we are releasing the Skin Condition Image Network (SCIN) dataset in collaboration with physicians at Stanford Medicine. We designed…  ( 92 min )
  • Open

    "Identifying general reaction conditions by bandit optimization", Wang et al 2024
    submitted by /u/gwern [link] [comments]
    Question about Stables Baselines 3
    Will I be able to use Stable Baselines3 for a purely numerical reinforced learning project? Or is it only tailored to image based projects? submitted by /u/spacecowboyalien [link] [comments]
    Looking for some Baseline Inverse RL Environments to Tinker with
    Hello All, I'm doing some analytic research into feature importance in some rl environments. I wanted to see how the importance value of features changes when a model is trained on an engineered reward function and some of the new inverse learned reward functions. Want to dip my toes, so I'm looking for any github repos / packages you all have been checking out I can see about spinning up on my computer Thanks in advance! submitted by /u/bbri826 [link] [comments]
    Looking for simple, adversary, scalable environment for multiple agents
    Hi, I'm looking environment that allows the: 1. Controlling policy of multiple agents (does not have to support multiple agent training at once) Is relatively simple - simpler than chess, go etc Observation shape is relatively small - smaller than atari games Its possible to change obseraction shape, and preferably number of agents. ​ An example of such an environment would be something similar petting-zoo connect-four/tic-tac-toe but would allow to control number of agents or number board size (eg. make tic tac toe board 10x10) I know that it could be done by writing custom env, but I would prefer spending more time on testing different algorithms and impact of number of agents and observation size on them. An example of such an environment would be something similar petting-zoo connect-four/tic-tac-toe but would allow to control the number of agents or number board size (eg. make tic tac toe board 10x10) Thanks for all suggestions! submitted by /u/MrCogito_hs [link] [comments]
    Why DreamerV3 uses the actor-critic models
    Why DreamerV3 uses the actor-critic model instead of more 'advannced' model, e.g., PPO. Or am I missed something? https://preview.redd.it/0nq1xqrqhbpc1.png?width=1194&format=png&auto=webp&s=afe3bf6ea33749cde43291c75fae5632a58a1335 submitted by /u/yulinzxc [link] [comments]
    Revisiting Recurrent Reinforcement Learning with Memory Monoids
    submitted by /u/smorad [link] [comments]
    Literature advice for "Kalman -> model-based RL"
    Do any of you know a good book that draws the broad lines from signal processing via Kalman filters, system identification to world models in model-based RL? submitted by /u/carlowilhelm [link] [comments]
    Understanding gymnasium.make() for custom mujoco env's
    Hi im trying to train a RL using a custom environment written in XML for MuJoCo. Im using python 3.10.10 on mac 14.3 with an intel processor. The issue im facing is that when i try to initiate the env with gymnasium.make(file.xml) i only get errors. Now, i know that that bit of code is wrong syntax for that function, what i want to say is that i don't understand the syntax for this process. The struggle is that i can't seem to figure out from the documentation how im supposed to create a gymnasium env from my xml file. What i've seen is that it takes a class or and id str as input, and while this bit of code for a predefined env works: env = gym.make("InvertedPendulum-v4") This won't (as i understand it the v4 is the version of mujoco used): env = gym.make("MyEnv-v4") Where "MyEnv …
    Reinforcement Learning Library PyTorch
    Hi, I'm new in the field of reinforcement learning and currently searching for a suitable library to implement my deep reinforcement learning research project. It should be a Python library and use PyTorch for the deep learning integration. I would like to run multiple environments in parallel on (multiple) GPU. As my environment will contain a Neural Network it should be able to directly run on the GPU and work with tensors. Any experiences about suitable libraries. So far I have found two interesting libraries TorchRL TorchRL — torchrl main documentation (pytorch.org) The Autonomous Learning Library The Autonomous Learning Library — autonomous-learning-library 0.9.1 documentation I would like to hear your opinion and about your experiences. Thanks submitted by /u/Opposite_Youth_442 [link] [comments]
    Custom training loop in stable baselines3?
    I am using stable baselines3, and am wondering how to implement the following situation. I want to use the PPO agent: I have an environment with x possible locations to do an action. Here the number x is a variable, depending on the environment state. At each location, there is a set of possible actions. E.g. an analog to my environment could be a game of checkers, where the number of pieces will change, but for each piece the rules are set. My environment has periodic boundary conditions, meaning that something leaving the board on the left side enters on the right, and something leaving on the top enters on the bottom. I want to implement a training loop where I offer different perspectives, i.e. one observation for each location x, where x gets centered. I then want the agent to take an action on the location for which the action probability is the highest. Is there a way that I can alter the training loop to get these probabilities, and then choose an action? So I would have to adjust the code in between getting the observation, and choosing the action. In a way I want to impose rules on which action should be chosen/ on which location x the agent should do an action. What makes it extra difficult for me is that the number of locations x varies If you have other ideas on how to solve this learning problem, please let me know! :) submitted by /u/ilse1301 [link] [comments]
    Advice for an Australian with an interest in RL
    Hi there, I'm an Australian currently studying a bachelor of CS in Melbourne at RMIT. Been very interested in going into DL and more specifically RL, and as such I intend to start a Masters of AI next year and follow that with a PHD in DL afterwards. I really would like to get into research at a large company and help push the envelope as much as possible, but I'm unsure of my job/research prospects in Australia. Additionally I'm unsure if a PHD at Monash University of UniMelb is going to give me good enough credentials to get where I would like to be. I'm really just wondering if it's best for me to study here and find research here, or if it would be better to go overseas, and if so at what stage (before PHD or after, etc). Any advice would be much appreciated. submitted by /u/TrueExcaliburGaming [link] [comments]
  • Open

    Natural language instructions induce compositional generalization in networks of neurons
    submitted by /u/nickb [link] [comments]
    Pixel perfect: Engineers' new approach brings images into focus
    submitted by /u/keghn [link] [comments]
    How do neural networks learn? A mathematical formula explains how they detect relevant patterns
    submitted by /u/nickb [link] [comments]
  • Open

    Secure by Design: NVIDIA AIOps Partner Ecosystem Blends AI for Businesses
    In today’s complex business environments, IT teams face a constant flow of challenges, from simple issues like employee account lockouts to critical security threats. These situations demand both quick fixes and strategic defenses, making the job of maintaining smooth and secure operations ever tougher. That’s where AIOps comes in, blending artificial intelligence with IT operations Read Article  ( 8 min )
    Climate Pioneers: 3 Startups Harnessing NVIDIA’s AI and Earth-2 Platforms
    To help mitigate climate change — one of humanity’s greatest challenges — researchers are turning to AI and sustainable computing to accelerate and operationalize their work. At this week’s NVIDIA GTC global AI conference, startups, enterprises and scientists are highlighting their environmental sustainability initiatives and latest climate innovations. Many are using NVIDIA Earth-2, a full-stack, Read Article  ( 6 min )
    NVIDIA Celebrates Americas Partners Driving AI-Powered Transformation
    NVIDIA recognized 14 partners in the Americas for their achievements in transforming businesses with AI, this week at GTC. The winners of the NVIDIA Partner Network Americas Partner of the Year awards have helped customers across industries advance their operations with the software, systems and services needed to integrate AI into their businesses. NPN awards Read Article  ( 7 min )
    NVIDIA, Huang Win Top Honors in Innovation, Engineering
    NVIDIA today was named the world’s most innovative company by Fast Company magazine. The accolade comes on the heels of company founder and CEO Jensen Huang being inducted into the U.S. National Academy of Engineering. A team of several dozen journalists at Fast Company — a business media brand launched in 1995 by two Harvard Read Article  ( 6 min )
    Generation Sensation: New Generative AI and RTX Tools Boost Content Creation
    Creators are getting a generative AI boost with tools announced at NVIDIA GTC, a global AI conference bringing together the brightest minds in AI content creation and accelerated computing.  ( 6 min )
  • Open

    Intelligent monitoring: Towards AI-assisted monitoring for cloud services
    Integrating AI into cloud service monitoring improves incident detection accuracy, reduces unnecessary alerts, and enhances overall system reliability. This helps organizations better align with business goals and increase customer satisfaction. The post Intelligent monitoring: Towards AI-assisted monitoring for cloud services appeared first on Microsoft Research.  ( 12 min )
  • Open

    Unlock the potential of generative AI in industrial operations
    In this post, multi-shot prompts are retrieved from an embedding containing successful Python code run on a similar data type (for example, high-resolution time series data from Internet of Things devices). The dynamically constructed multi-shot prompt provides the most relevant context to the FM, and boosts the FM’s capability in advanced math calculation, time series data processing, and data acronym understanding. This improved response facilitates enterprise workers and operational teams in engaging with data, deriving insights without requiring extensive data science skills.  ( 14 min )
    Enhance performance of generative language models with self-consistency prompting on Amazon Bedrock
    With the batch inference API, you can use Amazon Bedrock to run inference with foundation models in batches and get responses more efficiently. This post shows how to implement self-consistency prompting via batch inference on Amazon Bedrock to enhance model performance on arithmetic and multiple-choice reasoning tasks.  ( 14 min )
  • Open

    What is the significance of color in data visualization?
    Colors not only make things beautiful around us, but they are also an effective method for describing something. People find psychological associations with color. For example, it is said that red signifies power, love, and anger. Blue denotes calmness and logic. Other shades of primary colors convey different meanings and emotions. When two or more… Read More »What is the significance of color in data visualization? The post What is the significance of color in data visualization? appeared first on Data Science Central.  ( 20 min )
    The EU’s AI act: A measured approach to innovation and regulation
    AI regulatory measures often stir mixed reactions. We reached a regulatory milestone this week- with the final passing of the AI ACT in the European Union.  The locus of emphasis has shifted elsewhere now – specifically to oversight approaches and appointments from countries and to the AI Office. The AI Act divides machine learning systems into… Read More »The EU’s AI act: A measured approach to innovation and regulation The post The EU’s AI act: A measured approach to innovation and regulation appeared first on Data Science Central.  ( 20 min )
  • Open

    How can Philosophy help in our Modern World?
    In our modern world, many believe that philosophy is outdated, that we have arrived at science, which can solve all of our problems… Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 4 min )
  • Open

    Understanding the limitations of self-supervised learning for tabular anomaly detection
    arXiv:2309.08374v3 Announce Type: replace Abstract: While self-supervised learning has improved anomaly detection in computer vision and natural language processing, it is unclear whether tabular data can benefit from it. This paper explores the limitations of self-supervision for tabular anomaly detection. We conduct several experiments spanning various pretext tasks on 26 benchmark datasets to understand why this is the case. Our results confirm representations derived from self-supervision do not improve tabular anomaly detection performance compared to using the raw representations of the data. We show this is due to neural networks introducing irrelevant features, which reduces the effectiveness of anomaly detectors. However, we demonstrate that using a subspace of the neural network's representation can recover performance.  ( 2 min )
    DeepRepViz: Identifying Confounders in Deep Learning Model Predictions
    arXiv:2309.15551v2 Announce Type: replace Abstract: Deep Learning (DL) models have gained popularity in neuroimaging studies for predicting psychological behaviors, cognitive traits, and brain pathologies. However, these models can be biased by confounders such as age, sex, or imaging artifacts from the acquisition process. To address this, we introduce 'DeepRepViz', a two-part framework designed to identify confounders in DL model predictions. The first component is a visualization tool that can be used to qualitatively examine the final latent representation of the DL model. The second component is a metric called 'Con-score' that quantifies the confounder risk associated with a variable, using the final latent representation of the DL model. We demonstrate the effectiveness of the Con-score using a simple simulated setup by iteratively altering the strength of a simulated confounder and observing the corresponding change in the Con-score. Next, we validate the DeepRepViz framework on a large-scale neuroimaging dataset (n=12000) by performing three MRI-phenotype prediction tasks that include (a) predicting chronic alcohol users, (b) classifying participant sex, and (c) predicting performance speed on a cognitive task called 'trail making'. DeepRepViz identifies sex as a significant confounder in the DL model predicting chronic alcohol users (Con-score=0.35) and age as a confounder in the model predicting cognitive task performance (Con-score=0.3). In conclusion, the DeepRepViz framework provides a systematic approach to test for potential confounders such as age, sex, and imaging artifacts and improves the transparency of DL models for neuroimaging studies.  ( 3 min )
    Data-Driven Distributionally Robust Safety Verification Using Barrier Certificates and Conditional Mean Embeddings
    arXiv:2403.10497v1 Announce Type: cross Abstract: Algorithmic verification of realistic systems to satisfy safety and other temporal requirements has suffered from poor scalability of the employed formal approaches. To design systems with rigorous guarantees, many approaches still rely on exact models of the underlying systems. Since this assumption can rarely be met in practice, models have to be inferred from measurement data or are bypassed completely. Whilst former usually requires the model structure to be known a-priori and immense amounts of data to be available, latter gives rise to a plethora of restrictive mathematical assumptions about the unknown dynamics. In a pursuit of developing scalable formal verification algorithms without shifting the problem to unrealistic assumptions, we employ the concept of barrier certificates, which can guarantee safety of the system, and learn the certificate directly from a compact set of system trajectories. We use conditional mean embeddings to embed data from the system into a reproducing kernel Hilbert space (RKHS) and construct an RKHS ambiguity set that can be inflated to robustify the result w.r.t. a set of plausible transition kernels. We show how to solve the resulting program efficiently using sum-of-squares optimization and a Gaussian process envelope. Our approach lifts the need for restrictive assumptions on the system dynamics and uncertainty, and suggests an improvement in the sample complexity of verifying the safety of a system on a tested case study compared to a state-of-the-art approach.  ( 3 min )
    Counterfactual Image Editing
    arXiv:2403.09683v1 Announce Type: cross Abstract: Counterfactual image editing is an important task in generative AI, which asks how an image would look if certain features were different. The current literature on the topic focuses primarily on changing individual features while remaining silent about the causal relationships between these features, as present in the real world. In this paper, we formalize the counterfactual image editing task using formal language, modeling the causal relationships between latent generative factors and images through a special type of model called augmented structural causal models (ASCMs). Second, we show two fundamental impossibility results: (1) counterfactual editing is impossible from i.i.d. image samples and their corresponding labels alone; (2) even when the causal relationships between the latent generative factors and images are available, no guarantees regarding the output of the model can be provided. Third, we propose a relaxation for this challenging problem by approximating non-identifiable counterfactual distributions with a new family of counterfactual-consistent estimators. This family exhibits the desirable property of preserving features that the user cares about across both factual and counterfactual worlds. Finally, we develop an efficient algorithm to generate counterfactual images by leveraging neural causal models.  ( 2 min )
    SNIP: Bridging Mathematical Symbolic and Numeric Realms with Unified Pre-training
    arXiv:2310.02227v3 Announce Type: replace Abstract: In an era where symbolic mathematical equations are indispensable for modeling complex natural phenomena, scientific inquiry often involves collecting observations and translating them into mathematical expressions. Recently, deep learning has emerged as a powerful tool for extracting insights from data. However, existing models typically specialize in either numeric or symbolic domains, and are usually trained in a supervised manner tailored to specific tasks. This approach neglects the substantial benefits that could arise from a task-agnostic multi-modal understanding between symbolic equations and their numeric counterparts. To bridge the gap, we introduce SNIP, a Symbolic-Numeric Integrated Pre-training model, which employs contrastive learning between symbolic and numeric domains, enhancing their mutual similarities in the embeddings. By performing latent space analysis, we observe that SNIP provides cross-domain insights into the representations, revealing that symbolic supervision enhances the embeddings of numeric data and vice versa. We evaluate SNIP across diverse tasks, including symbolic-to-numeric mathematical property prediction and numeric-to-symbolic equation discovery, commonly known as symbolic regression. Results show that SNIP effectively transfers to various tasks, consistently outperforming fully supervised baselines and competing strongly with established task-specific methods, especially in the low data regime scenarios where available data is limited. Code and model are available at: https://github.com/deep-symbolic-mathematics/Multimodal-Math-Pretraining  ( 3 min )
    Architecture Analysis and Benchmarking of 3D U-shaped Deep Learning Models for Thoracic Anatomical Segmentation
    arXiv:2402.03230v2 Announce Type: replace-cross Abstract: Recent rising interests in patient-specific thoracic surgical planning and simulation require efficient and robust creation of digital anatomical models from automatic medical image segmentation algorithms. Deep learning (DL) is now state-of-the-art in various radiological tasks, and U-shaped DL models have particularly excelled in medical image segmentation since the inception of the 2D UNet. To date, many variants of U-shaped models have been proposed by the integration of different attention mechanisms and network configurations. Systematic benchmark studies which analyze the architecture of these models by leveraging the recent development of the multi-label databases, can provide valuable insights for clinical deployment and future model designs, but such studies are still rare. We conduct the first systematic benchmark study for variants of 3D U-shaped models (3DUNet, STUNet, AttentionUNet, SwinUNETR, FocalSegNet, and a novel 3D SwinUnet with four variants) with a focus on CT-based anatomical segmentation for thoracic surgery. Our study systematically examines the impact of different attention mechanisms, the number of resolution stages, and network configurations on segmentation accuracy and computational complexity. To allow cross-reference with other recent benchmarking studies, we also included a performance assessment of the BTCV abdominal structural segmentation. With the STUNet ranking at the top, our study demonstrated the value of CNN-based U-shaped models for the investigated tasks and the benefit of residual blocks in network configuration designs to boost segmentation performance.  ( 3 min )
    Traveling Waves Encode the Recent Past and Enhance Sequence Learning
    arXiv:2309.08045v2 Announce Type: replace-cross Abstract: Traveling waves of neural activity have been observed throughout the brain at a diversity of regions and scales; however, their precise computational role is still debated. One physically inspired hypothesis suggests that the cortical sheet may act like a wave-propagating system capable of invertibly storing a short-term memory of sequential stimuli through induced waves traveling across the cortical surface, and indeed many experimental results from neuroscience correlate wave activity with memory tasks. To date, however, the computational implications of this idea have remained hypothetical due to the lack of a simple recurrent neural network architecture capable of exhibiting such waves. In this work, we introduce a model to fill this gap, which we denote the Wave-RNN (wRNN), and demonstrate how such an architecture indeed efficiently encodes the recent past through a suite of synthetic memory tasks where wRNNs learn faster and reach significantly lower error than wave-free counterparts. We further explore the implications of this memory storage system on more complex sequence modeling tasks such as sequential image classification and find that wave-based models not only again outperform comparable wave-free RNNs while using significantly fewer parameters, but additionally perform comparably to more complex gated architectures such as LSTMs and GRUs.  ( 3 min )
    Distilling Knowledge for Short-to-Long Term Trajectory Prediction
    arXiv:2305.08553v3 Announce Type: replace-cross Abstract: Long-term trajectory forecasting is an important and challenging problem in the fields of computer vision, machine learning, and robotics. One fundamental difficulty stands in the evolution of the trajectory that becomes more and more uncertain and unpredictable as the time horizon grows, subsequently increasing the complexity of the problem. To overcome this issue, in this paper, we propose Di-Long, a new method that employs the distillation of a short-term trajectory model forecaster that guides a student network for long-term trajectory prediction during the training process. Given a total sequence length that comprehends the allowed observation for the student network and the complementary target sequence, we let the student and the teacher solve two different related tasks defined over the same full trajectory: the student observes a short sequence and predicts a long trajectory, whereas the teacher observes a longer sequence and predicts the remaining short target trajectory. The teacher's task is less uncertain, and we use its accurate predictions to guide the student through our knowledge distillation framework, reducing long-term future uncertainty. Our experiments show that our proposed Di-Long method is effective for long-term forecasting and achieves state-of-the-art performance on the Intersection Drone Dataset (inD) and the Stanford Drone Dataset (SDD).  ( 3 min )
    Unprocessing Seven Years of Algorithmic Fairness
    arXiv:2306.07261v5 Announce Type: replace Abstract: Seven years ago, researchers proposed a postprocessing method to equalize the error rates of a model across different demographic groups. The work launched hundreds of papers purporting to improve over the postprocessing baseline. We empirically evaluate these claims through thousands of model evaluations on several tabular datasets. We find that the fairness-accuracy Pareto frontier achieved by postprocessing contains all other methods we were feasibly able to evaluate. In doing so, we address two common methodological errors that have confounded previous observations. One relates to the comparison of methods with different unconstrained base models. The other concerns methods achieving different levels of constraint relaxation. At the heart of our study is a simple idea we call unprocessing that roughly corresponds to the inverse of postprocessing. Unprocessing allows for a direct comparison of methods using different underlying models and levels of relaxation.  ( 2 min )
    Cross-domain Random Pre-training with Prototypes for Reinforcement Learning
    arXiv:2302.05614v2 Announce Type: replace Abstract: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. Task-agnostic cross-domain pre-training shows great potential in image-based Reinforcement Learning (RL) but poses a big challenge. In this paper, we propose CRPTpro, a Cross-domain self-supervised Random Pre-Training framework with prototypes for image-based RL. CRPTpro employs cross-domain random policy to easily and quickly sample diverse data from multiple domains, to improve pre-training efficiency. Moreover, prototypical representation learning with a novel intrinsic loss is proposed to pre-train an effective and generic encoder across different domains. Without finetuning, the cross-domain encoder can be implemented for challenging downstream visual-control RL tasks defined in different domains efficiently. Compared with prior arts like APT and Proto-RL, CRPTpro achieves better performance on cross-domain downstream RL tasks without extra training on exploration agents for expert data collection, greatly reducing the burden of pre-training. Experiments on DeepMind Control suite (DMControl) demonstrate that CRPTpro outperforms APT significantly on 11/12 cross-domain RL tasks with only 39% pre-training hours, becoming a state-of-the-art cross-domain pre-training method in both policy learning performance and pre-training efficiency. The complete code will be released at https://github.com/liuxin0824/CRPTpro.  ( 3 min )
    Price-Discrimination Game for Distributed Resource Management in Federated Learning
    arXiv:2308.13838v5 Announce Type: replace Abstract: In vanilla federated learning (FL) such as FedAvg, the parameter server (PS) and multiple distributed clients can form a typical buyer's market, where the number of PS/buyers of FL services is far less than the number of clients/sellers. In order to improve the performance of FL and reduce the cost of motivating clients to participate in FL, this paper proposes to differentiate the pricing for services provided by different clients rather than simply providing the same service pricing for different clients. The price is differentiated based on the performance improvements brought to FL and their heterogeneity in computing and communication capabilities. To this end, a price-discrimination game (PDG) is formulated to comprehensively address the distributed resource management problems in FL, including multi-objective trade-off, client selection, and incentive mechanism. As the PDG is a mixed-integer nonlinear programming (MINLP) problem, a distributed semi-heuristic algorithm with low computational complexity and low communication overhead is designed to solve it. The simulation result verifies the effectiveness of the proposed approach.  ( 2 min )
    Guess & Sketch: Language Model Guided Transpilation
    arXiv:2309.14396v2 Announce Type: replace-cross Abstract: Maintaining legacy software requires many software and systems engineering hours. Assembly code programs, which demand low-level control over the computer machine state and have no variable names, are particularly difficult for humans to analyze. Existing conventional program translators guarantee correctness, but are hand-engineered for the source and target programming languages in question. Learned transpilation, i.e. automatic translation of code, offers an alternative to manual re-writing and engineering efforts. Automated symbolic program translation approaches guarantee correctness but struggle to scale to longer programs due to the exponentially large search space. Their rigid rule-based systems also limit their expressivity, so they can only reason about a reduced space of programs. Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness. In this work, we leverage the strengths of LMs and symbolic solvers in a neurosymbolic approach to learned transpilation for assembly code. Assembly code is an appropriate setting for a neurosymbolic approach, since assembly code can be divided into shorter non-branching basic blocks amenable to the use of symbolic methods. Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence of the transpilation input and output. We test Guess & Sketch on three different test sets of assembly transpilation tasks, varying in difficulty, and show that it successfully transpiles 57.6% more examples than GPT-4 and 39.6% more examples than an engineered transpiler. We also share a training and evaluation dataset for this task.  ( 3 min )
    Transferring climate change knowledge
    arXiv:2309.14780v3 Announce Type: replace-cross Abstract: Accurate and precise climate projections are required for climate adaptation and mitigation, but Earth system models still exhibit great uncertainties. Several approaches have been developed to reduce the spread of climate projections and feedbacks, yet those methods cannot capture the non-linear complexity inherent in the climate system. Using a Transfer Learning approach, we show that Machine Learning can be used to optimally leverage and merge the knowledge gained from Earth system models simulations and historical observations to more accurately project global surface air temperature fields in the 21st century. We reach an uncertainty reduction of more than 50% with respect to state-of-the-art approaches. We give evidence that our novel method provides narrower projection uncertainty together with more accurate mean climate projections, urgently required for climate adaptation.  ( 2 min )
    Weight fluctuations in (deep) linear neural networks and a derivation of the inverse-variance flatness relation
    arXiv:2311.14120v2 Announce Type: replace Abstract: We investigate the stationary (late-time) training regime of single- and two-layer linear underparameterized neural networks within the continuum limit of stochastic gradient descent (SGD) for synthetic Gaussian data. In the case of a single-layer network in the weakly underparameterized regime, the spectrum of the noise covariance matrix deviates notably from the Hessian, which can be attributed to the broken detailed balance of SGD dynamics. The weight fluctuations are in this case generally anisotropic, but are subject to an isotropic loss. For a two-layer network, we obtain the stochastic dynamics of the weights in each layer and analyze the associated stationary covariances. We identify the inter-layer coupling as a new source of anisotropy for the weight fluctuations. In contrast to the single-layer case, the weight fluctuations experience an anisotropic loss, the flatness of which is inversely related to the fluctuation variance. We thereby provide an analytical derivation of the recently observed inverse variance-flatness relation in a model of a deep linear neural network.  ( 2 min )
    Softmax-free Linear Transformers
    arXiv:2207.03341v3 Announce Type: replace-cross Abstract: Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks. The self-attention mechanism underpinning the strength of ViTs has a quadratic complexity in both computation and memory usage. This motivates the development of approximating the self-attention at linear complexity. However, an in-depth analysis in this work reveals that existing methods are either theoretically flawed or empirically ineffective for visual recognition. We identify that their limitations are rooted in the inheritance of softmax-based self-attention during approximations, that is, normalizing the scaled dot-product between token feature vectors using the softmax function. As preserving the softmax operation challenges any subsequent linearization efforts. By this insight, a family of Softmax-Free Transformers (SOFT) are proposed. Specifically, a Gaussian kernel function is adopted to replace the dot-product similarity, enabling a full self-attention matrix to be approximated under low-rank matrix decomposition. For computational robustness, we estimate the Moore-Penrose inverse using an iterative Newton-Raphson method in the forward process only, while calculating its theoretical gradients only once in the backward process. To further expand applicability (e.g., dense prediction tasks), an efficient symmetric normalization technique is introduced. Extensive experiments on ImageNet, COCO, and ADE20K show that our SOFT significantly improves the computational efficiency of existing ViT variants. With linear complexity, much longer token sequences are permitted by SOFT, resulting in superior trade-off between accuracy and complexity. Code and models are available at https://github.com/fudan-zvg/SOFT.  ( 3 min )
    Imperio: Language-Guided Backdoor Attacks for Arbitrary Model Control
    arXiv:2401.01085v2 Announce Type: replace-cross Abstract: Natural language processing (NLP) has received unprecedented attention. While advancements in NLP models have led to extensive research into their backdoor vulnerabilities, the potential for these advancements to introduce new backdoor threats remains unexplored. This paper proposes Imperio, which harnesses the language understanding capabilities of NLP models to enrich backdoor attacks. Imperio provides a new model control experience. Demonstrated through controlling image classifiers, it empowers the adversary to manipulate the victim model with arbitrary output through language-guided instructions. This is achieved using a language model to fuel a conditional trigger generator, with optimizations designed to extend its language understanding capabilities to backdoor instruction interpretation and execution. Our experiments across three datasets, five attacks, and nine defenses confirm Imperio's effectiveness. It can produce contextually adaptive triggers from text descriptions and control the victim model with desired outputs, even in scenarios not encountered during training. The attack reaches a high success rate across complex datasets without compromising the accuracy of clean inputs and exhibits resilience against representative defenses.  ( 2 min )
    Detecting Brain Tumors through Multimodal Neural Networks
    arXiv:2402.00038v2 Announce Type: replace-cross Abstract: Tumors can manifest in various forms and in different areas of the human body. Brain tumors are specifically hard to diagnose and treat because of the complexity of the organ in which they develop. Detecting them in time can lower the chances of death and facilitate the therapy process for patients. The use of Artificial Intelligence (AI) and, more specifically, deep learning, has the potential to significantly reduce costs in terms of time and resources for the discovery and identification of tumors from images obtained through imaging techniques. This research work aims to assess the performance of a multimodal model for the classification of Magnetic Resonance Imaging (MRI) scans processed as grayscale images. The results are promising, and in line with similar works, as the model reaches an accuracy of around 98\%. We also highlight the need for explainability and transparency to ensure human control and safety.  ( 2 min )
    Does CLIP's Generalization Performance Mainly Stem from High Train-Test Similarity?
    arXiv:2310.09562v2 Announce Type: replace-cross Abstract: Foundation models like CLIP are trained on hundreds of millions of samples and effortlessly generalize to new tasks and inputs. Out of the box, CLIP shows stellar zero-shot and few-shot capabilities on a wide range of out-of-distribution (OOD) benchmarks, which prior works attribute mainly to today's large and comprehensive training dataset (like LAION). However, it is questionable how meaningful terms like out-of-distribution generalization are for CLIP as it seems likely that web-scale datasets like LAION simply contain many samples that are similar to common OOD benchmarks originally designed for ImageNet. To test this hypothesis, we retrain CLIP on pruned LAION splits that replicate ImageNet's train-test similarity with respect to common OOD benchmarks. While we observe a performance drop on some benchmarks, surprisingly, CLIP's overall performance remains high. This shows that high train-test similarity is insufficient to explain CLIP's OOD performance, and other properties of the training data must drive CLIP to learn more generalizable representations. Additionally, by pruning data points that are dissimilar to the OOD benchmarks, we uncover a 100M split of LAION ($\frac{1}{4}$th of its original size) on which CLIP can be trained to match its original OOD performance.  ( 2 min )
    GAPS: Geometry-Aware, Physics-Based, Self-Supervised Neural Garment Draping
    arXiv:2312.01490v2 Announce Type: replace-cross Abstract: Recent neural, physics-based modeling of garment deformations allows faster and visually aesthetic results as opposed to the existing methods. Material-specific parameters are used by the formulation to control the garment inextensibility. This delivers unrealistic results with physically implausible stretching. Oftentimes, the draped garment is pushed inside the body which is either corrected by an expensive post-processing, thus adding to further inconsistent stretching; or by deploying a separate training regime for each body type, restricting its scalability. Additionally, the flawed skinning process deployed by existing methods produces incorrect results on loose garments. In this paper, we introduce a geometrical constraint to the existing formulation that is collision-aware and imposes garment inextensibility wherever possible. Thus, we obtain realistic results where draped clothes stretch only while covering bigger body regions. Furthermore, we propose a geometry-aware garment skinning method by defining a body-garment closeness measure which works for all garment types, especially the loose ones.  ( 2 min )
    Consistency Models as a Rich and Efficient Policy Class for Reinforcement Learning
    arXiv:2309.16984v2 Announce Type: replace Abstract: Score-based generative models like the diffusion model have been testified to be effective in modeling multi-modal data from image generation to reinforcement learning (RL). However, the inference process of diffusion model can be slow, which hinders its usage in RL with iterative sampling. We propose to apply the consistency model as an efficient yet expressive policy representation, namely consistency policy, with an actor-critic style algorithm for three typical RL settings: offline, offline-to-online and online. For offline RL, we demonstrate the expressiveness of generative models as policies from multi-modal data. For offline-to-online RL, the consistency policy is shown to be more computational efficient than diffusion policy, with a comparable performance. For online RL, the consistency policy demonstrates significant speedup and even higher average performances than the diffusion policy.  ( 2 min )
    Delivery Optimized Discovery in Behavioral User Segmentation under Budget Constraint
    arXiv:2402.03388v2 Announce Type: replace-cross Abstract: Users' behavioral footprints online enable firms to discover behavior-based user segments (or, segments) and deliver segment specific messages to users. Following the discovery of segments, delivery of messages to users through preferred media channels like Facebook and Google can be challenging, as only a portion of users in a behavior segment find match in a medium, and only a fraction of those matched actually see the message (exposure). Even high quality discovery becomes futile when delivery fails. Many sophisticated algorithms exist for discovering behavioral segments; however, these ignore the delivery component. The problem is compounded because (i) the discovery is performed on the behavior data space in firms' data (e.g., user clicks), while the delivery is predicated on the static data space (e.g., geo, age) as defined by media; and (ii) firms work under budget constraint. We introduce a stochastic optimization based algorithm for delivery optimized discovery of behavioral user segmentation and offer new metrics to address the joint optimization. We leverage optimization under a budget constraint for delivery combined with a learning-based component for discovery. Extensive experiments on a public dataset from Google and a proprietary dataset show the effectiveness of our approach by simultaneously improving delivery metrics, reducing budget spend and achieving strong predictive performance in discovery.  ( 3 min )
    Learning Flexible Body Collision Dynamics with Hierarchical Contact Mesh Transformer
    arXiv:2312.12467v2 Announce Type: replace Abstract: Recently, many mesh-based graph neural network (GNN) models have been proposed for modeling complex high-dimensional physical systems. Remarkable achievements have been made in significantly reducing the solving time compared to traditional numerical solvers. These methods are typically designed to i) reduce the computational cost in solving physical dynamics and/or ii) propose techniques to enhance the solution accuracy in fluid and rigid body dynamics. However, it remains under-explored whether they are effective in addressing the challenges of flexible body dynamics, where instantaneous collisions occur within a very short timeframe. In this paper, we present Hierarchical Contact Mesh Transformer (HCMT), which uses hierarchical mesh structures and can learn long-range dependencies (occurred by collisions) among spatially distant positions of a body -- two close positions in a higher-level mesh corresponds to two distant positions in a lower-level mesh. HCMT enables long-range interactions, and the hierarchical mesh structure quickly propagates collision effects to faraway positions. To this end, it consists of a contact mesh Transformer and a hierarchical mesh Transformer (CMT and HMT, respectively). Lastly, we propose a flexible body dynamics dataset, consisting of trajectories that reflect experimental settings frequently used in the display industry for product designs. We also compare the performance of several baselines using well-known benchmark datasets. Our results show that HCMT provides significant performance improvements over existing methods. Our code is available at \url{https://github.com/yuyudeep/hcmt}.  ( 3 min )
    Boosting Multitask Learning on Graphs through Higher-Order Task Affinities
    arXiv:2306.14009v4 Announce Type: replace Abstract: Predicting node labels on a given graph is a widely studied problem with many applications, including community detection and molecular graph prediction. This paper considers predicting multiple node labeling functions on graphs simultaneously and revisits this problem from a multitask learning perspective. For a concrete example, consider overlapping community detection: each community membership is a binary node classification task. Due to complex overlapping patterns, we find that negative transfer is prevalent when we apply naive multitask learning to multiple community detection, as task relationships are highly nonlinear across different node labeling. To address the challenge, we develop an algorithm to cluster tasks into groups based on a higher-order task affinity measure. We then fit a multitask model on each task group, resulting in a boosting procedure on top of the baseline model. We estimate the higher-order task affinity measure between two tasks as the prediction loss of one task in the presence of another task and a random subset of other tasks. Then, we use spectral clustering on the affinity score matrix to identify task grouping. We design several speedup techniques to compute the higher-order affinity scores efficiently and show that they can predict negative transfers more accurately than pairwise task affinities. We validate our procedure using various community detection and molecular graph prediction data sets, showing favorable results compared with existing methods. Lastly, we provide a theoretical analysis to show that under a planted block model of tasks on graphs, our affinity scores can provably separate tasks into groups.  ( 3 min )
    Learning Semantic Proxies from Visual Prompts for Parameter-Efficient Fine-Tuning in Deep Metric Learning
    arXiv:2402.02340v2 Announce Type: replace-cross Abstract: Deep Metric Learning (DML) has long attracted the attention of the machine learning community as a key objective. Existing solutions concentrate on fine-tuning the pre-trained models on conventional image datasets. As a result of the success of recent pre-trained models trained from larger-scale datasets, it is challenging to adapt the model to the DML tasks in the local data domain while retaining the previously gained knowledge. In this paper, we investigate parameter-efficient methods for fine-tuning the pre-trained model for DML tasks. In particular, we propose a novel and effective framework based on learning Visual Prompts (VPT) in the pre-trained Vision Transformers (ViT). Based on the conventional proxy-based DML paradigm, we augment the proxy by incorporating the semantic information from the input image and the ViT, in which we optimize the visual prompts for each class. We demonstrate that our new approximations with semantic information are superior to representative capabilities, thereby improving metric learning performance. We conduct extensive experiments to demonstrate that our proposed framework is effective and efficient by evaluating popular DML benchmarks. In particular, we demonstrate that our fine-tuning method achieves comparable or even better performance than recent state-of-the-art full fine-tuning works of DML while tuning only a small percentage of total parameters.  ( 3 min )
    A Matter of Annotation: An Empirical Study on In Situ and Self-Recall Activity Annotations from Wearable Sensors
    arXiv:2305.08752v2 Announce Type: replace-cross Abstract: Research into the detection of human activities from wearable sensors is a highly active field, benefiting numerous applications, from ambulatory monitoring of healthcare patients via fitness coaching to streamlining manual work processes. We present an empirical study that compares 4 different commonly used annotation methods utilized in user studies that focus on in-the-wild data. These methods can be grouped in user-driven, in situ annotations - which are performed before or during the activity is recorded - and recall methods - where participants annotate their data in hindsight at the end of the day. Our study illustrates that different labeling methodologies directly impact the annotations' quality, as well as the capabilities of a deep learning classifier trained with the data respectively. We noticed that in situ methods produce less but more precise labels than recall methods. Furthermore, we combined an activity diary with a visualization tool that enables the participant to inspect and label their activity data. Due to the introduction of such a tool were able to decrease missing annotations and increase the annotation consistency, and therefore the F1-score of the deep learning model by up to 8% (ranging between 82.1 and 90.4% F1-score). Furthermore, we discuss the advantages and disadvantages of the methods compared in our study, the biases they may could introduce and the consequences of their usage on human activity recognition studies and as well as possible solutions.  ( 3 min )
    Score Regularized Policy Optimization through Diffusion Behavior
    arXiv:2310.07297v3 Announce Type: replace Abstract: Recent developments in offline reinforcement learning have uncovered the immense potential of diffusion modeling, which excels at representing heterogeneous behavior policies. However, sampling from diffusion policies is considerably slow because it necessitates tens to hundreds of iterative inference steps for one action. To address this issue, we propose to extract an efficient deterministic inference policy from critic models and pretrained diffusion behavior models, leveraging the latter to directly regularize the policy gradient with the behavior distribution's score function during optimization. Our method enjoys powerful generative capabilities of diffusion modeling while completely circumventing the computationally intensive and time-consuming diffusion sampling scheme, both during training and evaluation. Extensive results on D4RL tasks show that our method boosts action sampling speed by more than 25 times compared with various leading diffusion-based methods in locomotion tasks, while still maintaining state-of-the-art performance.  ( 2 min )
    From Zero to Turbulence: Generative Modeling for 3D Flow Simulation
    arXiv:2306.01776v3 Announce Type: replace-cross Abstract: Simulations of turbulent flows in 3D are one of the most expensive simulations in computational fluid dynamics (CFD). Many works have been written on surrogate models to replace numerical solvers for fluid flows with faster, learned, autoregressive models. However, the intricacies of turbulence in three dimensions necessitate training these models with very small time steps, while generating realistic flow states requires either long roll-outs with many steps and significant error accumulation or starting from a known, realistic flow state - something we aimed to avoid in the first place. Instead, we propose to approach turbulent flow simulation as a generative task directly learning the manifold of all possible turbulent flow states without relying on any initial flow state. For our experiments, we introduce a challenging 3D turbulence dataset of high-resolution flows and detailed vortex structures caused by various objects and derive two novel sample evaluation metrics for turbulent flows. On this dataset, we show that our generative model captures the distribution of turbulent flows caused by unseen objects and generates high-quality, realistic samples amenable for downstream applications without access to any initial state.  ( 2 min )
    Modeling non-genetic information dynamics in cells using reservoir computing
    arXiv:2312.07977v2 Announce Type: replace-cross Abstract: Virtually all cells use energy and ion-specific membrane pumps to maintain large transmembrane gradients of Na$^+$, K$^+$, Cl$^-$, Mg$^{++}$, and Ca$^{++}$. Although they consume up to 1/3 of a cell's energy budget, the corresponding evolutionary benefit of transmembrane ion gradients remain unclear. Here, we propose that ion gradients enable a dynamic and versatile biological system that acquires, analyzes, and responds to environmental information. We hypothesize environmental signals are transmitted into the cell by ion fluxes along pre-existing gradients through gated ion-specific membrane channels. The consequent changes of cytoplasmic ion concentration can generate a local response and orchestrate global or regional responses through wire-like ion fluxes along pre-existing and self-assembling cytoskeleton to engage the endoplasmic reticulum, mitochondria, and nucleus. Here, we frame our hypothesis through a quasi-physical (Cell-Reservoir) model that treats intra-cellular ion-based information dynamics as a sub-cellular process permitting spatiotemporally resolved cellular response that is also capable of learning complex nonlinear dynamical cellular behavior. We demonstrate the proposed ion dynamics permits rapid dissemination of response to information extrinsic perturbations that is consistent with experimental observations.  ( 3 min )
    Fast, Expressive SE$(n)$ Equivariant Networks through Weight-Sharing in Position-Orientation Space
    arXiv:2310.02970v3 Announce Type: replace Abstract: Based on the theory of homogeneous spaces we derive geometrically optimal edge attributes to be used within the flexible message-passing framework. We formalize the notion of weight sharing in convolutional networks as the sharing of message functions over point-pairs that should be treated equally. We define equivalence classes of point-pairs that are identical up to a transformation in the group and derive attributes that uniquely identify these classes. Weight sharing is then obtained by conditioning message functions on these attributes. As an application of the theory, we develop an efficient equivariant group convolutional network for processing 3D point clouds. The theory of homogeneous spaces tells us how to do group convolutions with feature maps over the homogeneous space of positions $\mathbb{R}^3$, position and orientations $\mathbb{R}^3 {\times} S^2$, and the group $SE(3)$ itself. Among these, $\mathbb{R}^3 {\times} S^2$ is an optimal choice due to the ability to represent directional information, which $\mathbb{R}^3$ methods cannot, and it significantly enhances computational efficiency compared to indexing features on the full $SE(3)$ group. We support this claim with state-of-the-art results -- in accuracy and speed -- on five different benchmarks in 2D and 3D, including interatomic potential energy prediction, trajectory forecasting in N-body systems, and generating molecules via equivariant diffusion models.  ( 3 min )
    DyST: Towards Dynamic Neural Scene Representations on Real-World Videos
    arXiv:2310.06020v2 Announce Type: replace-cross Abstract: Visual understanding of the world goes beyond the semantics and flat structure of individual images. In this work, we aim to capture both the 3D structure and dynamics of real-world scenes from monocular real-world videos. Our Dynamic Scene Transformer (DyST) model leverages recent work in neural scene representation to learn a latent decomposition of monocular real-world videos into scene content, per-view scene dynamics, and camera pose. This separation is achieved through a novel co-training scheme on monocular videos and our new synthetic dataset DySO. DyST learns tangible latent representations for dynamic scenes that enable view generation with separate control over the camera and the content of the scene.  ( 2 min )
    Finetuning Text-to-Image Diffusion Models for Fairness
    arXiv:2311.07604v2 Announce Type: replace Abstract: The rapid adoption of text-to-image diffusion models in society underscores an urgent need to address their biases. Without interventions, these biases could propagate a skewed worldview and restrict opportunities for minority groups. In this work, we frame fairness as a distributional alignment problem. Our solution consists of two main technical contributions: (1) a distributional alignment loss that steers specific characteristics of the generated images towards a user-defined target distribution, and (2) adjusted direct finetuning of diffusion model's sampling process (adjusted DFT), which leverages an adjusted gradient to directly optimize losses defined on the generated images. Empirically, our method markedly reduces gender, racial, and their intersectional biases for occupational prompts. Gender bias is significantly reduced even when finetuning just five soft tokens. Crucially, our method supports diverse perspectives of fairness beyond absolute equality, which is demonstrated by controlling age to a $75\%$ young and $25\%$ old distribution while simultaneously debiasing gender and race. Finally, our method is scalable: it can debias multiple concepts at once by simply including these prompts in the finetuning data. We share code and various fair diffusion model adaptors at https://sail-sg.github.io/finetune-fair-diffusion/.  ( 2 min )
    Intriguing Properties of Data Attribution on Diffusion Models
    arXiv:2311.00500v2 Announce Type: replace Abstract: Data attribution seeks to trace model outputs back to training data. With the recent development of diffusion models, data attribution has become a desired module to properly assign valuations for high-quality or copyrighted training samples, ensuring that data contributors are fairly compensated or credited. Several theoretically motivated methods have been proposed to implement data attribution, in an effort to improve the trade-off between computational scalability and effectiveness. In this work, we conduct extensive experiments and ablation studies on attributing diffusion models, specifically focusing on DDPMs trained on CIFAR-10 and CelebA, as well as a Stable Diffusion model LoRA-finetuned on ArtBench. Intriguingly, we report counter-intuitive observations that theoretically unjustified design choices for attribution empirically outperform previous baselines by a large margin, in terms of both linear datamodeling score and counterfactual evaluation. Our work presents a significantly more efficient approach for attributing diffusion models, while the unexpected findings suggest that at least in non-convex settings, constructions guided by theoretical assumptions may lead to inferior attribution performance. The code is available at https://github.com/sail-sg/D-TRAK.  ( 2 min )
    Viewing the process of generating counterfactuals as a source of knowledge: a new approach for explaining classifiers
    arXiv:2309.04284v3 Announce Type: replace Abstract: There are now many explainable AI methods for understanding the decisions of a machine learning model. Among these are those based on counterfactual reasoning, which involve simulating features changes and observing the impact on the prediction. This article proposes to view this simulation process as a source of creating a certain amount of knowledge that can be stored to be used, later, in different ways. This process is illustrated in the additive model and, more specifically, in the case of the naive Bayes classifier, whose interesting properties for this purpose are shown.  ( 2 min )
    StyleTalker: One-shot Style-based Audio-driven Talking Head Video Generation
    arXiv:2208.10922v2 Announce Type: replace-cross Abstract: We propose StyleTalker, a novel audio-driven talking head generation model that can synthesize a video of a talking person from a single reference image with accurately audio-synced lip shapes, realistic head poses, and eye blinks. Specifically, by leveraging a pretrained image generator and an image encoder, we estimate the latent codes of the talking head video that faithfully reflects the given audio. This is made possible with several newly devised components: 1) A contrastive lip-sync discriminator for accurate lip synchronization, 2) A conditional sequential variational autoencoder that learns the latent motion space disentangled from the lip movements, such that we can independently manipulate the motions and lip movements while preserving the identity. 3) An auto-regressive prior augmented with normalizing flow to learn a complex audio-to-motion multi-modal latent space. Equipped with these components, StyleTalker can generate talking head videos not only in a motion-controllable way when another motion source video is given but also in a completely audio-driven manner by inferring realistic motions from the input audio. Through extensive experiments and user studies, we show that our model is able to synthesize talking head videos with impressive perceptual quality which are accurately lip-synced with the input audios, largely outperforming state-of-the-art baselines.  ( 3 min )
    Learning the greatest common divisor: explaining transformer predictions
    arXiv:2308.15594v2 Announce Type: replace Abstract: The predictions of small transformers, trained to calculate the greatest common divisor (GCD) of two positive integers, can be fully characterized by looking at model inputs and outputs. As training proceeds, the model learns a list $\mathcal D$ of integers, products of divisors of the base used to represent integers and small primes, and predicts the largest element of $\mathcal D$ that divides both inputs. Training distributions impact performance. Models trained from uniform operands only learn a handful of GCD (up to $38$ GCD $\leq100$). Log-uniform operands boost performance to $73$ GCD $\leq 100$, and a log-uniform distribution of outcomes (i.e. GCD) to $91$. However, training from uniform (balanced) GCD breaks explainability.  ( 2 min )
    Stability to Deformations of Manifold Filters and Manifold Neural Networks
    arXiv:2106.03725v5 Announce Type: replace Abstract: The paper defines and studies manifold (M) convolutional filters and neural networks (NNs). \emph{Manifold} filters and MNNs are defined in terms of the Laplace-Beltrami operator exponential and are such that \emph{graph} (G) filters and neural networks (NNs) are recovered as discrete approximations when the manifold is sampled. These filters admit a spectral representation which is a generalization of both the spectral representation of graph filters and the frequency response of standard convolutional filters in continuous time. The main technical contribution of the paper is to analyze the stability of manifold filters and MNNs to smooth deformations of the manifold. This analysis generalizes known stability properties of graph filters and GNNs and it is also a generalization of known stability properties of standard convolutional filters and neural networks in continuous time. The most important observation that follows from this analysis is that manifold filters, same as graph filters and standard continuous time filters, have difficulty discriminating high frequency components in the presence of deformations. This is a challenge that can be ameliorated with the use of manifold, graph, or continuous time neural networks. The most important practical consequence of this analysis is to shed light on the behavior of graph filters and GNNs in large-scale graphs.  ( 3 min )
    Continuous QA Learning with Structured Prompts
    arXiv:2208.14602v3 Announce Type: replace-cross Abstract: QA models with lifelong learning (LL) abilities are important for practical QA applications, and architecture-based LL methods are reported to be an effective implementation for these models. However, it is non-trivial to extend previous approaches to QA tasks since they either require access to task identities in the testing phase or do not explicitly model samples from unseen tasks. In this paper, we propose Diana: a dynamic architecture-based lifelong QA model that tries to learn a sequence of QA tasks with a prompt enhanced language model. Four types of hierarchically organized prompts are used in Diana to capture QA knowledge from different granularities. Specifically, we dedicate task-level prompts to capture task-specific knowledge to retain high LL performances and maintain instance-level prompts to learn knowledge shared across different input samples to improve the model's generalization performance. Moreover, we dedicate separate prompts to explicitly model unseen tasks and introduce a set of prompt key vectors to facilitate knowledge sharing between tasks. Extensive experiments demonstrate that Diana outperforms state-of-the-art lifelong QA models, especially in handling unseen tasks.  ( 2 min )
    Simplicial Representation Learning with Neural $k$-Forms
    arXiv:2312.08515v2 Announce Type: replace Abstract: Geometric deep learning extends deep learning to incorporate information about the geometry and topology data, especially in complex domains like graphs. Despite the popularity of message passing in this field, it has limitations such as the need for graph rewiring, ambiguity in interpreting data, and over-smoothing. In this paper, we take a different approach, focusing on leveraging geometric information from simplicial complexes embedded in $\mathbb{R}^n$ using node coordinates. We use differential k-forms in \mathbb{R}^n to create representations of simplices, offering interpretability and geometric consistency without message passing. This approach also enables us to apply differential geometry tools and achieve universal approximation. Our method is efficient, versatile, and applicable to various input complexes, including graphs, simplicial complexes, and cell complexes. It outperforms existing message passing neural networks in harnessing information from geometrical graphs with node features serving as coordinates.  ( 2 min )
    Learning to Detect Slip through Tactile Estimation of the Contact Force Field and its Entropy
    arXiv:2303.00935v3 Announce Type: replace-cross Abstract: Detection of slip during object grasping and manipulation plays a vital role in object handling. Existing solutions primarily rely on visual information to devise a strategy for grasping. However, for robotic systems to attain a level of proficiency comparable to humans, especially in consistently handling and manipulating unfamiliar objects, integrating artificial tactile sensing is increasingly essential. We introduce a novel physics-informed, data-driven approach to detect slip continuously in real time. We employ the GelSight Mini, an optical tactile sensor, attached to custom-designed grippers to gather tactile data. Our work leverages the inhomogeneity of tactile sensor readings during slip events to develop distinctive features and formulates slip detection as a classification problem. To evaluate our approach, we test multiple data-driven models on 10 common objects under different loading conditions, textures, and materials. Our results show that the best classification algorithm achieves a high average accuracy of 95.61%. We further illustrate the practical application of our research in dynamic robotic manipulation tasks, where our real-time slip detection and prevention algorithm is implemented.  ( 2 min )
    Voting-based Multimodal Automatic Deception Detection
    arXiv:2307.07516v3 Announce Type: replace Abstract: Automatic Deception Detection has been a hot research topic for a long time, using machine learning and deep learning to automatically detect deception, brings new light to this old field. In this paper, we proposed a voting-based method for automatic deception detection from videos using audio, visual and lexical features. Experiments were done on two datasets, the Real-life trial dataset by Michigan University and the Miami University deception detection dataset. Video samples were split into frames of images, audio, and manuscripts. Our Voting-based Multimodal proposed solution consists of three models. The first model is CNN for detecting deception from images, the second model is Support Vector Machine (SVM) on Mel spectrograms for detecting deception from audio and the third model is Word2Vec on Support Vector Machine (SVM) for detecting deception from manuscripts. Our proposed solution outperforms state of the art. Best results achieved on images, audio and text were 97%, 96%, 92% respectively on Real-Life Trial Dataset, and 97%, 82%, 73% on video, audio and text respectively on Miami University Deception Detection.  ( 2 min )
    LILO: Learning Interpretable Libraries by Compressing and Documenting Code
    arXiv:2310.19791v4 Announce Type: replace-cross Abstract: While large language models (LLMs) now excel at code generation, a key aspect of software development is the art of refactoring: consolidating code into libraries of reusable and readable programs. In this paper, we introduce LILO, a neurosymbolic framework that iteratively synthesizes, compresses, and documents code to build libraries tailored to particular problem domains. LILO combines LLM-guided program synthesis with recent algorithmic advances in automated refactoring from Stitch: a symbolic compression system that efficiently identifies optimal lambda abstractions across large code corpora. To make these abstractions interpretable, we introduce an auto-documentation (AutoDoc) procedure that infers natural language names and docstrings based on contextual examples of usage. In addition to improving human readability, we find that AutoDoc boosts performance by helping LILO's synthesizer to interpret and deploy learned abstractions. We evaluate LILO on three inductive program synthesis benchmarks for string editing, scene reasoning, and graphics composition. Compared to existing neural and symbolic methods - including the state-of-the-art library learning algorithm DreamCoder - LILO solves more complex tasks and learns richer libraries that are grounded in linguistic knowledge.  ( 3 min )
    A Structure-Preserving Kernel Method for Learning Hamiltonian Systems
    arXiv:2403.10070v1 Announce Type: cross Abstract: A structure-preserving kernel ridge regression method is presented that allows the recovery of potentially high-dimensional and nonlinear Hamiltonian functions out of datasets made of noisy observations of Hamiltonian vector fields. The method proposes a closed-form solution that yields excellent numerical performances that surpass other techniques proposed in the literature in this setup. From the methodological point of view, the paper extends kernel regression methods to problems in which loss functions involving linear functions of gradients are required and, in particular, a differential reproducing property and a Representer Theorem are proved in this context. The relation between the structure-preserving kernel estimator and the Gaussian posterior mean estimator is analyzed. A full error analysis is conducted that provides convergence rates using fixed and adaptive regularization parameters. The good performance of the proposed estimator is illustrated with various numerical experiments.  ( 2 min )
    Learning on JPEG-LDPC Compressed Images: Classifying with Syndromes
    arXiv:2403.10202v1 Announce Type: cross Abstract: In goal-oriented communications, the objective of the receiver is often to apply a Deep-Learning model, rather than reconstructing the original data. In this context, direct learning over compressed data, without any prior decoding, holds promise for enhancing the time-efficient execution of inference models at the receiver. However, conventional entropic-coding methods like Huffman and Arithmetic break data structure, rendering them unsuitable for learning without decoding. In this paper, we propose an alternative approach in which entropic coding is realized with Low-Density Parity Check (LDPC) codes. We hypothesize that Deep Learning models can more effectively exploit the internal code structure of LDPC codes. At the receiver, we leverage a specific class of Recurrent Neural Networks (RNNs), specifically Gated Recurrent Unit (GRU), trained for image classification. Our numerical results indicate that classification based on LDPC-coded bit-planes surpasses Huffman and Arithmetic coding, while necessitating a significantly smaller learning model. This demonstrates the efficiency of classification directly from LDPC-coded data, eliminating the need for any form of decompression, even partial, prior to applying the learning model.  ( 2 min )
    Textual analysis of End User License Agreement for red-flagging potentially malicious software
    arXiv:2403.09715v1 Announce Type: cross Abstract: New software and updates are downloaded by end users every day. Each dowloaded software has associated with it an End Users License Agreements (EULA), but this is rarely read. An EULA includes information to avoid legal repercussions. However,this proposes a host of potential problems such as spyware or producing an unwanted affect in the target system. End users do not read these EULA's because of length of the document and users find it extremely difficult to understand. Text summarization is one of the relevant solution to these kind of problems. This require a solution which can summarize the EULA and classify the EULA as "Benign" or "Malicious". We propose a solution in which we have summarize the EULA and classify the EULA as "Benign" or "Malicious". We extract EULA text of different sofware's then we classify the text using eight different supervised classifiers. we use ensemble learning to classify the EULA as benign or malicious using five different text summarization methods. An accuracy of $95.8$\% shows the effectiveness of the presented approach.  ( 2 min )
    Enhancing LLM Factual Accuracy with RAG to Counter Hallucinations: A Case Study on Domain-Specific Queries in Private Knowledge-Bases
    arXiv:2403.10446v1 Announce Type: cross Abstract: We proposed an end-to-end system design towards utilizing Retrieval Augmented Generation (RAG) to improve the factual accuracy of Large Language Models (LLMs) for domain-specific and time-sensitive queries related to private knowledge-bases. Our system integrates RAG pipeline with upstream datasets processing and downstream performance evaluation. Addressing the challenge of LLM hallucinations, we finetune models with a curated dataset which originates from CMU's extensive resources and annotated with the teacher model. Our experiments demonstrate the system's effectiveness in generating more accurate answers to domain-specific and time-sensitive inquiries. The results also revealed the limitations of fine-tuning LLMs with small-scale and skewed datasets. This research highlights the potential of RAG systems in augmenting LLMs with external datasets for improved performance in knowledge-intensive tasks. Our code and models are available on Github.  ( 2 min )
    FeatUp: A Model-Agnostic Framework for Features at Any Resolution
    arXiv:2403.10516v1 Announce Type: cross Abstract: Deep features are a cornerstone of computer vision research, capturing image semantics and enabling the community to solve downstream tasks even in the zero- or few-shot regime. However, these features often lack the spatial resolution to directly perform dense prediction tasks like segmentation and depth prediction because models aggressively pool information over large areas. In this work, we introduce FeatUp, a task- and model-agnostic framework to restore lost spatial information in deep features. We introduce two variants of FeatUp: one that guides features with high-resolution signal in a single forward pass, and one that fits an implicit model to a single image to reconstruct features at any resolution. Both approaches use a multi-view consistency loss with deep analogies to NeRFs. Our features retain their original semantics and can be swapped into existing applications to yield resolution and performance gains even without re-training. We show that FeatUp significantly outperforms other feature upsampling and image super-resolution approaches in class activation map generation, transfer learning for segmentation and depth prediction, and end-to-end training for semantic segmentation.  ( 2 min )
    A survey of synthetic data augmentation methods in computer vision
    arXiv:2403.10075v1 Announce Type: cross Abstract: The standard approach to tackling computer vision problems is to train deep convolutional neural network (CNN) models using large-scale image datasets which are representative of the target task. However, in many scenarios, it is often challenging to obtain sufficient image data for the target task. Data augmentation is a way to mitigate this challenge. A common practice is to explicitly transform existing images in desired ways so as to create the required volume and variability of training data necessary to achieve good generalization performance. In situations where data for the target domain is not accessible, a viable workaround is to synthesize training data from scratch--i.e., synthetic data augmentation. This paper presents an extensive review of synthetic data augmentation techniques. It covers data synthesis approaches based on realistic 3D graphics modeling, neural style transfer (NST), differential neural rendering, and generative artificial intelligence (AI) techniques such as generative adversarial networks (GANs) and variational autoencoders (VAEs). For each of these classes of methods, we focus on the important data generation and augmentation techniques, general scope of application and specific use-cases, as well as existing limitations and possible workarounds. Additionally, we provide a summary of common synthetic datasets for training computer vision models, highlighting the main features, application domains and supported tasks. Finally, we discuss the effectiveness of synthetic data augmentation methods. Since this is the first paper to explore synthetic data augmentation methods in great detail, we are hoping to equip readers with the necessary background information and in-depth knowledge of existing methods and their attendant issues.  ( 3 min )
    Team Trifecta at Factify5WQA: Setting the Standard in Fact Verification with Fine-Tuning
    arXiv:2403.10281v1 Announce Type: cross Abstract: In this paper, we present Pre-CoFactv3, a comprehensive framework comprised of Question Answering and Text Classification components for fact verification. Leveraging In-Context Learning, Fine-tuned Large Language Models (LLMs), and the FakeNet model, we address the challenges of fact verification. Our experiments explore diverse approaches, comparing different Pre-trained LLMs, introducing FakeNet, and implementing various ensemble methods. Notably, our team, Trifecta, secured first place in the AAAI-24 Factify 3.0 Workshop, surpassing the baseline accuracy by 103% and maintaining a 70% lead over the second competitor. This success underscores the efficacy of our approach and its potential contributions to advancing fact verification research.  ( 2 min )
    How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression?
    arXiv:2310.08391v2 Announce Type: replace-cross Abstract: Transformers pretrained on diverse tasks exhibit remarkable in-context learning (ICL) capabilities, enabling them to solve unseen tasks solely based on input contexts without adjusting model parameters. In this paper, we study ICL in one of its simplest setups: pretraining a linearly parameterized single-layer linear attention model for linear regression with a Gaussian prior. We establish a statistical task complexity bound for the attention model pretraining, showing that effective pretraining only requires a small number of independent tasks. Furthermore, we prove that the pretrained model closely matches the Bayes optimal algorithm, i.e., optimally tuned ridge regression, by achieving nearly Bayes optimal risk on unseen tasks under a fixed context length. These theoretical findings complement prior experimental research and shed light on the statistical foundations of ICL.  ( 2 min )
    Double Descent and Overfitting under Noisy Inputs and Distribution Shift for Linear Denoisers
    arXiv:2305.17297v3 Announce Type: replace Abstract: Despite the importance of denoising in modern machine learning and ample empirical work on supervised denoising, its theoretical understanding is still relatively scarce. One concern about studying supervised denoising is that one might not always have noiseless training data from the test distribution. It is more reasonable to have access to noiseless training data from a different dataset than the test dataset. Motivated by this, we study supervised denoising and noisy-input regression under distribution shift. We add three considerations to increase the applicability of our theoretical insights to real-life data and modern machine learning. First, while most past theoretical work assumes that the data covariance matrix is full-rank and well-conditioned, empirical studies have shown that real-life data is approximately low-rank. Thus, we assume that our data matrices are low-rank. Second, we drop independence assumptions on our data. Third, the rise in computational power and dimensionality of data have made it important to study non-classical regimes of learning. Thus, we work in the non-classical proportional regime, where data dimension $d$ and number of samples $N$ grow as $d/N = c + o(1)$. For this setting, we derive data-dependent, instance specific expressions for the test error for both denoising and noisy-input regression, and study when overfitting the noise is benign, tempered or catastrophic. We show that the test error exhibits double descent under general distribution shift, providing insights for data augmentation and the role of noise as an implicit regularizer. We also perform experiments using real-life data, where we match the theoretical predictions with under 1\% MSE error for low-rank data.  ( 3 min )
    zrLLM: Zero-Shot Relational Learning on Temporal Knowledge Graphs with Large Language Models
    arXiv:2311.10112v2 Announce Type: replace-cross Abstract: Modeling evolving knowledge over temporal knowledge graphs (TKGs) has become a heated topic. Various methods have been proposed to forecast links on TKGs. Most of them are embedding-based, where hidden representations are learned to represent knowledge graph (KG) entities and relations based on the observed graph contexts. Although these methods show strong performance on traditional TKG forecasting (TKGF) benchmarks, they face a strong challenge in modeling the unseen zero-shot relations that have no prior graph context. In this paper, we try to mitigate this problem as follows. We first input the text descriptions of KG relations into large language models (LLMs) for generating relation representations, and then introduce them into embedding-based TKGF methods. LLM-empowered representations can capture the semantic information in the relation descriptions. This makes the relations, whether seen or unseen, with similar semantic meanings stay close in the embedding space, enabling TKGF models to recognize zero-shot relations even without any observed graph context. Experimental results show that our approach helps TKGF models to achieve much better performance in forecasting the facts with previously unseen relations, while still maintaining their ability in link forecasting regarding seen relations.  ( 3 min )
    Adversarial Adaptive Sampling: Unify PINN and Optimal Transport for the Approximation of PDEs
    arXiv:2305.18702v2 Announce Type: replace-cross Abstract: Solving partial differential equations (PDEs) is a central task in scientific computing. Recently, neural network approximation of PDEs has received increasing attention due to its flexible meshless discretization and its potential for high-dimensional problems. One fundamental numerical difficulty is that random samples in the training set introduce statistical errors into the discretization of loss functional which may become the dominant error in the final approximation, and therefore overshadow the modeling capability of the neural network. In this work, we propose a new minmax formulation to optimize simultaneously the approximate solution, given by a neural network model, and the random samples in the training set, provided by a deep generative model. The key idea is to use a deep generative model to adjust random samples in the training set such that the residual induced by the approximate PDE solution can maintain a smooth profile when it is being minimized. Such an idea is achieved by implicitly embedding the Wasserstein distance between the residual-induced distribution and the uniform distribution into the loss, which is then minimized together with the residual. A nearly uniform residual profile means that its variance is small for any normalized weight function such that the Monte Carlo approximation error of the loss functional is reduced significantly for a certain sample size. The adversarial adaptive sampling (AAS) approach proposed in this work is the first attempt to formulate two essential components, minimizing the residual and seeking the optimal training set, into one minmax objective functional for the neural network approximation of PDEs.  ( 3 min )
    Reliable uncertainty with cheaper neural network ensembles: a case study in industrial parts classification
    arXiv:2403.10182v1 Announce Type: new Abstract: In operations research (OR), predictive models often encounter out-of-distribution (OOD) scenarios where the data distribution differs from the training data distribution. In recent years, neural networks (NNs) are gaining traction in OR for their exceptional performance in fields such as image classification. However, NNs tend to make confident yet incorrect predictions when confronted with OOD data. Uncertainty estimation offers a solution to overconfident models, communicating when the output should (not) be trusted. Hence, reliable uncertainty quantification in NNs is crucial in the OR domain. Deep ensembles, composed of multiple independent NNs, have emerged as a promising approach, offering not only strong predictive accuracy but also reliable uncertainty estimation. However, their deployment is challenging due to substantial computational demands. Recent fundamental research has proposed more efficient NN ensembles, namely the snapshot, batch, and multi-input multi-output ensemble. This study is the first to provide a comprehensive comparison of a single NN, a deep ensemble, and the three efficient NN ensembles. In addition, we propose a Diversity Quality metric to quantify the ensembles' performance on the in-distribution and OOD sets in one single metric. The OR case study discusses industrial parts classification to identify and manage spare parts, important for timely maintenance of industrial plants. The results highlight the batch ensemble as a cost-effective and competitive alternative to the deep ensemble. It outperforms the deep ensemble in both uncertainty and accuracy while exhibiting a training time speedup of 7x, a test time speedup of 8x, and 9x memory savings.  ( 3 min )
    Meta Operator for Complex Query Answering on Knowledge Graphs
    arXiv:2403.10110v1 Announce Type: new Abstract: Knowledge graphs contain informative factual knowledge but are considered incomplete. To answer complex queries under incomplete knowledge, learning-based Complex Query Answering (CQA) models are proposed to directly learn from the query-answer samples to avoid the direct traversal of incomplete graph data. Existing works formulate the training of complex query answering models as multi-task learning and require a large number of training samples. In this work, we explore the compositional structure of complex queries and argue that the different logical operator types, rather than the different complex query types, are the key to improving generalizability. Accordingly, we propose a meta-learning algorithm to learn the meta-operators with limited data and adapt them to different instances of operators under various complex queries. Empirical results show that learning meta-operators is more effective than learning original CQA or meta-CQA models.  ( 2 min )
    Online GNN Evaluation Under Test-time Graph Distribution Shifts
    arXiv:2403.09953v1 Announce Type: new Abstract: Evaluating the performance of a well-trained GNN model on real-world graphs is a pivotal step for reliable GNN online deployment and serving. Due to a lack of test node labels and unknown potential training-test graph data distribution shifts, conventional model evaluation encounters limitations in calculating performance metrics (e.g., test error) and measuring graph data-level discrepancies, particularly when the training graph used for developing GNNs remains unobserved during test time. In this paper, we study a new research problem, online GNN evaluation, which aims to provide valuable insights into the well-trained GNNs's ability to effectively generalize to real-world unlabeled graphs under the test-time graph distribution shifts. Concretely, we develop an effective learning behavior discrepancy score, dubbed LeBeD, to estimate the test-time generalization errors of well-trained GNN models. Through a novel GNN re-training strategy with a parameter-free optimality criterion, the proposed LeBeD comprehensively integrates learning behavior discrepancies from both node prediction and structure reconstruction perspectives. This enables the effective evaluation of the well-trained GNNs' ability to capture test node semantics and structural representations, making it an expressive metric for estimating the generalization error in online GNN evaluation. Extensive experiments on real-world test graphs under diverse graph distribution shifts could verify the effectiveness of the proposed method, revealing its strong correlation with ground-truth test errors on various well-trained GNN models.  ( 2 min )
    Prediction of Vessel Arrival Time to Pilotage Area Using Multi-Data Fusion and Deep Learning
    arXiv:2403.09969v1 Announce Type: new Abstract: This paper investigates the prediction of vessels' arrival time to the pilotage area using multi-data fusion and deep learning approaches. Firstly, the vessel arrival contour is extracted based on Multivariate Kernel Density Estimation (MKDE) and clustering. Secondly, multiple data sources, including Automatic Identification System (AIS), pilotage booking information, and meteorological data, are fused before latent feature extraction. Thirdly, a Temporal Convolutional Network (TCN) framework that incorporates a residual mechanism is constructed to learn the hidden arrival patterns of the vessels. Extensive tests on two real-world data sets from Singapore have been conducted and the following promising results have been obtained: 1) fusion of pilotage booking information and meteorological data improves the prediction accuracy, with pilotage booking information having a more significant impact; 2) using discrete embedding for the meteorological data performs better than using continuous embedding; 3) the TCN outperforms the state-of-the-art baseline methods in regression tasks, exhibiting Mean Absolute Error (MAE) ranging from 4.58 min to 4.86 min; and 4) approximately 89.41% to 90.61% of the absolute prediction residuals fall within a time frame of 10 min.  ( 2 min )
    Robust Subgraph Learning by Monitoring Early Training Representations
    arXiv:2403.09901v1 Announce Type: new Abstract: Graph neural networks (GNNs) have attracted significant attention for their outstanding performance in graph learning and node classification tasks. However, their vulnerability to adversarial attacks, particularly through susceptible nodes, poses a challenge in decision-making. The need for robust graph summarization is evident in adversarial challenges resulting from the propagation of attacks throughout the entire graph. In this paper, we address both performance and adversarial robustness in graph input by introducing the novel technique SHERD (Subgraph Learning Hale through Early Training Representation Distances). SHERD leverages information from layers of a partially trained graph convolutional network (GCN) to detect susceptible nodes during adversarial attacks using standard distance metrics. The method identifies "vulnerable (bad)" nodes and removes such nodes to form a robust subgraph while maintaining node classification performance. Through our experiments, we demonstrate the increased performance of SHERD in enhancing robustness by comparing the network's performance on original and subgraph inputs against various baselines alongside existing adversarial attacks. Our experiments across multiple datasets, including citation datasets such as Cora, Citeseer, and Pubmed, as well as microanatomical tissue structures of cell graphs in the placenta, highlight that SHERD not only achieves substantial improvement in robust performance but also outperforms several baselines in terms of node classification accuracy and computational complexity.  ( 2 min )
    Generalization of Scaled Deep ResNets in the Mean-Field Regime
    arXiv:2403.09889v1 Announce Type: new Abstract: Despite the widespread empirical success of ResNet, the generalization properties of deep ResNet are rarely explored beyond the lazy training regime. In this work, we investigate \emph{scaled} ResNet in the limit of infinitely deep and wide neural networks, of which the gradient flow is described by a partial differential equation in the large-neural network limit, i.e., the \emph{mean-field} regime. To derive the generalization bounds under this setting, our analysis necessitates a shift from the conventional time-invariant Gram matrix employed in the lazy training regime to a time-variant, distribution-dependent version. To this end, we provide a global lower bound on the minimum eigenvalue of the Gram matrix under the mean-field regime. Besides, for the traceability of the dynamic of Kullback-Leibler (KL) divergence, we establish the linear convergence of the empirical error and estimate the upper bound of the KL divergence over parameters distribution. Finally, we build the uniform convergence for generalization bound via Rademacher complexity. Our results offer new insights into the generalization ability of deep ResNet beyond the lazy training regime and contribute to advancing the understanding of the fundamental properties of deep neural networks.  ( 2 min )
    TimeMachine: A Time Series is Worth 4 Mambas for Long-term Forecasting
    arXiv:2403.09898v1 Announce Type: new Abstract: Long-term time-series forecasting remains challenging due to the difficulty in capturing long-term dependencies, achieving linear scalability, and maintaining computational efficiency. We introduce TimeMachine, an innovative model that leverages Mamba, a state-space model, to capture long-term dependencies in multivariate time series data while maintaining linear scalability and small memory footprints. TimeMachine exploits the unique properties of time series data to produce salient contextual cues at multi-scales and leverage an innovative integrated quadruple-Mamba architecture to unify the handling of channel-mixing and channel-independence situations, thus enabling effective selection of contents for prediction against global and local contexts at different scales. Experimentally, TimeMachine achieves superior performance in prediction accuracy, scalability, and memory efficiency, as extensively validated using benchmark datasets. Code availability: https://github.com/Atik-Ahamed/TimeMachine  ( 2 min )
    A Conceptual Framework For White Box Neural Networks
    arXiv:2403.09863v1 Announce Type: new Abstract: This paper introduces semantic features as a general conceptual framework for fully explainable neural network layers. A well-motivated proof of concept model for relevant subproblem of MNIST consists of 4 such layers with the total of 4.8K learnable parameters. The model is easily interpretable, achieves human-level adversarial test accuracy with no form of adversarial training, requires little hyperparameter tuning and can be quickly trained on a single CPU. The general nature of the technique bears promise for a paradigm shift towards radically democratised and truly generalizable white box neural networks. The code is available at https://github.com/314-Foundation/white-box-nn  ( 2 min )
    Self-Supervised Learning for Time Series: Contrastive or Generative?
    arXiv:2403.09809v1 Announce Type: new Abstract: Self-supervised learning (SSL) has recently emerged as a powerful approach to learning representations from large-scale unlabeled data, showing promising results in time series analysis. The self-supervised representation learning can be categorized into two mainstream: contrastive and generative. In this paper, we will present a comprehensive comparative study between contrastive and generative methods in time series. We first introduce the basic frameworks for contrastive and generative SSL, respectively, and discuss how to obtain the supervision signal that guides the model optimization. We then implement classical algorithms (SimCLR vs. MAE) for each type and conduct a comparative analysis in fair settings. Our results provide insights into the strengths and weaknesses of each approach and offer practical recommendations for choosing suitable SSL methods. We also discuss the implications of our findings for the broader field of representation learning and propose future research directions. All the code and data are released at \url{https://github.com/DL4mHealth/SSL_Comparison}.  ( 2 min )
    Towards the Reusability and Compositionality of Causal Representations
    arXiv:2403.09830v1 Announce Type: new Abstract: Causal Representation Learning (CRL) aims at identifying high-level causal factors and their relationships from high-dimensional observations, e.g., images. While most CRL works focus on learning causal representations in a single environment, in this work we instead propose a first step towards learning causal representations from temporal sequences of images that can be adapted in a new environment, or composed across multiple related environments. In particular, we introduce DECAF, a framework that detects which causal factors can be reused and which need to be adapted from previously learned causal representations. Our approach is based on the availability of intervention targets, that indicate which variables are perturbed at each time step. Experiments on three benchmark datasets show that integrating our framework with four state-of-the-art CRL approaches leads to accurate representations in a new environment with only a few samples.  ( 2 min )
    Towards Diverse Perspective Learning with Selection over Multiple Temporal Poolings
    arXiv:2403.09749v1 Announce Type: new Abstract: In Time Series Classification (TSC), temporal pooling methods that consider sequential information have been proposed. However, we found that each temporal pooling has a distinct mechanism, and can perform better or worse depending on time series data. We term this fixed pooling mechanism a single perspective of temporal poolings. In this paper, we propose a novel temporal pooling method with diverse perspective learning: Selection over Multiple Temporal Poolings (SoM-TP). SoM-TP dynamically selects the optimal temporal pooling among multiple methods for each data by attention. The dynamic pooling selection is motivated by the ensemble concept of Multiple Choice Learning (MCL), which selects the best among multiple outputs. The pooling selection by SoM-TP's attention enables a non-iterative pooling ensemble within a single classifier. Additionally, we define a perspective loss and Diverse Perspective Learning Network (DPLN). The loss works as a regularizer to reflect all the pooling perspectives from DPLN. Our perspective analysis using Layer-wise Relevance Propagation (LRP) reveals the limitation of a single perspective and ultimately demonstrates diverse perspective learning of SoM-TP. We also show that SoM-TP outperforms CNN models based on other temporal poolings and state-of-the-art models in TSC with extensive UCR/UEA repositories.  ( 2 min )
    A Natural Extension To Online Algorithms For Hybrid RL With Limited Coverage
    arXiv:2403.09701v1 Announce Type: new Abstract: Hybrid Reinforcement Learning (RL), leveraging both online and offline data, has garnered recent interest, yet research on its provable benefits remains sparse. Additionally, many existing hybrid RL algorithms (Song et al., 2023; Nakamoto et al., 2023; Amortila et al., 2024) impose coverage assumptions on the offline dataset, but we show that this is unnecessary. A well-designed online algorithm should "fill in the gaps" in the offline dataset, exploring states and actions that the behavior policy did not explore. Unlike previous approaches that focus on estimating the offline data distribution to guide online exploration (Li et al., 2023b), we show that a natural extension to standard optimistic online algorithms -- warm-starting them by including the offline dataset in the experience replay buffer -- achieves similar provable gains from hybrid data even when the offline dataset does not have single-policy concentrability. We accomplish this by partitioning the state-action space into two, bounding the regret on each partition through an offline and an online complexity measure, and showing that the regret of this hybrid RL algorithm can be characterized by the best partition -- despite the algorithm not knowing the partition itself. As an example, we propose DISC-GOLF, a modification of an existing optimistic online algorithm with general function approximation called GOLF used in Jin et al. (2021); Xie et al. (2022a), and show that it demonstrates provable gains over both online-only and offline-only reinforcement learning, with competitive bounds when specialized to the tabular, linear and block MDP cases. Numerical simulations further validate our theory that hybrid data facilitates more efficient exploration, supporting the potential of hybrid RL in various scenarios.  ( 3 min )
    Within-basket Recommendation via Neural Pattern Associator
    arXiv:2401.16433v2 Announce Type: replace-cross Abstract: Within-basket recommendation (WBR) refers to the task of recommending items to the end of completing a non-empty shopping basket during a shopping session. While the latest innovations in this space demonstrate remarkable performance improvement on benchmark datasets, they often overlook the complexity of user behaviors in practice, such as 1) co-existence of multiple shopping intentions, 2) multi-granularity of such intentions, and 3) interleaving behavior (switching intentions) in a shopping session. This paper presents Neural Pattern Associator (NPA), a deep item-association-mining model that explicitly models the aforementioned factors. Specifically, inspired by vector quantization, the NPA model learns to encode common user intentions (or item-combination patterns) as quantized representations (a.k.a. codebook), which permits identification of users's shopping intentions via attention-driven lookup during the reasoning phase. This yields coherent and self-interpretable recommendations. We evaluated the proposed NPA model across multiple extensive datasets, encompassing the domains of grocery e-commerce (shopping basket completion) and music (playlist extension), where our quantitative evaluations show that the NPA model significantly outperforms a wide range of existing WBR solutions, reflecting the benefit of explicitly modeling complex user intentions.  ( 2 min )
    Follow-Up Differential Descriptions: Language Models Resolve Ambiguities for Image Classification
    arXiv:2311.07593v2 Announce Type: replace-cross Abstract: A promising approach for improving the performance of vision-language models like CLIP for image classification is to extend the class descriptions (i.e., prompts) with related attributes, e.g., using brown sparrow instead of sparrow. However, current zero-shot methods select a subset of attributes regardless of commonalities between the target classes, potentially providing no useful information that would have helped to distinguish between them. For instance, they may use color instead of bill shape to distinguish between sparrows and wrens, which are both brown. We propose Follow-up Differential Descriptions (FuDD), a zero-shot approach that tailors the class descriptions to each dataset and leads to additional attributes that better differentiate the target classes. FuDD first identifies the ambiguous classes for each image, and then uses a Large Language Model (LLM) to generate new class descriptions that differentiate between them. The new class descriptions resolve the initial ambiguity and help predict the correct label. In our experiments, FuDD consistently outperforms generic description ensembles and naive LLM-generated descriptions on 12 datasets. We show that differential descriptions are an effective tool to resolve class ambiguities, which otherwise significantly degrade the performance. We also show that high quality natural language class descriptions produced by FuDD result in comparable performance to few-shot adaptation methods.  ( 3 min )
    Compositional preference models for aligning LMs
    arXiv:2310.13011v2 Announce Type: replace-cross Abstract: As language models (LMs) become more capable, it is increasingly important to align them with human preferences. However, the dominant paradigm for training Preference Models (PMs) for that purpose suffers from fundamental limitations, such as lack of transparency and scalability, along with susceptibility to overfitting the preference dataset. We propose Compositional Preference Models (CPMs), a novel PM framework that decomposes one global preference assessment into several interpretable features, obtains scalar scores for these features from a prompted LM, and aggregates these scores using a logistic regression classifier. Through these simple steps, CPMs allow to control which properties of the preference data are used to train the preference model and to build it based on features that are believed to underlie the human preference judgment. Our experiments show that CPMs not only improve generalization and are more robust to overoptimization than standard PMs, but also that best-of-n samples obtained using CPMs tend to be preferred over samples obtained using conventional PMs. Overall, our approach demonstrates the benefits of endowing PMs with priors about which features determine human preferences while relying on LM capabilities to extract those features in a scalable and robust way.  ( 2 min )
    Post-hoc Bias Scoring Is Optimal For Fair Classification
    arXiv:2310.05725v3 Announce Type: replace-cross Abstract: We consider a binary classification problem under group fairness constraints, which can be one of Demographic Parity (DP), Equalized Opportunity (EOp), or Equalized Odds (EO). We propose an explicit characterization of Bayes optimal classifier under the fairness constraints, which turns out to be a simple modification rule of the unconstrained classifier. Namely, we introduce a novel instance-level measure of bias, which we call bias score, and the modification rule is a simple linear rule on top of the finite amount of bias scores.Based on this characterization, we develop a post-hoc approach that allows us to adapt to fairness constraints while maintaining high accuracy. In the case of DP and EOp constraints, the modification rule is thresholding a single bias score, while in the case of EO constraints we are required to fit a linear modification rule with 2 parameters. The method can also be applied for composite group-fairness criteria, such as ones involving several sensitive attributes.  ( 2 min )
    Neur2RO: Neural Two-Stage Robust Optimization
    arXiv:2310.04345v2 Announce Type: replace-cross Abstract: Robust optimization provides a mathematical framework for modeling and solving decision-making problems under worst-case uncertainty. This work addresses two-stage robust optimization (2RO) problems (also called adjustable robust optimization), wherein first-stage and second-stage decisions are made before and after uncertainty is realized, respectively. This results in a nested min-max-min optimization problem which is extremely challenging computationally, especially when the decisions are discrete. We propose Neur2RO, an efficient machine learning-driven instantiation of column-and-constraint generation (CCG), a classical iterative algorithm for 2RO. Specifically, we learn to estimate the value function of the second-stage problem via a novel neural network architecture that is easy to optimize over by design. Embedding our neural network into CCG yields high-quality solutions quickly as evidenced by experiments on two 2RO benchmarks, knapsack and capital budgeting. For knapsack, Neur2RO finds solutions that are within roughly $2\%$ of the best-known values in a few seconds compared to the three hours of the state-of-the-art exact branch-and-price algorithm; for larger and more complex instances, Neur2RO finds even better solutions. For capital budgeting, Neur2RO outperforms three variants of the $k$-adaptability algorithm, particularly on the largest instances, with a 10 to 100-fold reduction in solution time. Our code and data are available at https://github.com/khalil-research/Neur2RO.  ( 2 min )
    Cognitive Architectures for Language Agents
    arXiv:2309.02427v3 Announce Type: replace-cross Abstract: Recent efforts have augmented large language models (LLMs) with external resources (e.g., the Internet) or internal control flows (e.g., prompt chaining) for tasks requiring grounding or reasoning, leading to a new class of language agents. While these agents have achieved substantial empirical success, we lack a systematic framework to organize existing agents and plan future developments. In this paper, we draw on the rich history of cognitive science and symbolic artificial intelligence to propose Cognitive Architectures for Language Agents (CoALA). CoALA describes a language agent with modular memory components, a structured action space to interact with internal memory and external environments, and a generalized decision-making process to choose actions. We use CoALA to retrospectively survey and organize a large body of recent work, and prospectively identify actionable directions towards more capable agents. Taken together, CoALA contextualizes today's language agents within the broader history of AI and outlines a path towards language-based general intelligence.  ( 2 min )
    Speech Emotion Recognition with Distilled Prosodic and Linguistic Affect Representations
    arXiv:2309.04849v2 Announce Type: replace-cross Abstract: We propose EmoDistill, a novel speech emotion recognition (SER) framework that leverages cross-modal knowledge distillation during training to learn strong linguistic and prosodic representations of emotion from speech. During inference, our method only uses a stream of speech signals to perform unimodal SER thus reducing computation overhead and avoiding run-time transcription and prosodic feature extraction errors. During training, our method distills information at both embedding and logit levels from a pair of pre-trained Prosodic and Linguistic teachers that are fine-tuned for SER. Experiments on the IEMOCAP benchmark demonstrate that our method outperforms other unimodal and multimodal techniques by a considerable margin, and achieves state-of-the-art performance of 77.49% unweighted accuracy and 78.91% weighted accuracy. Detailed ablation studies demonstrate the impact of each component of our method.  ( 2 min )
    Automated ensemble method for pediatric brain tumor segmentation
    arXiv:2308.07212v2 Announce Type: replace-cross Abstract: Brain tumors remain a critical global health challenge, necessitating advancements in diagnostic techniques and treatment methodologies. A tumor or its recurrence often needs to be identified in imaging studies and differentiated from normal brain tissue. In response to the growing need for age-specific segmentation models, particularly for pediatric patients, this study explores the deployment of deep learning techniques using magnetic resonance imaging (MRI) modalities. By introducing a novel ensemble approach using ONet and modified versions of UNet, coupled with innovative loss functions, this study achieves a precise segmentation model for the BraTS-PEDs 2023 Challenge. Data augmentation, including both single and composite transformations, ensures model robustness and accuracy across different scanning protocols. The ensemble strategy, integrating the ONet and UNet models, shows greater effectiveness in capturing specific features and modeling diverse aspects of the MRI images which result in lesion wise Dice scores of 0.52, 0.72 and 0.78 on unseen validation data and scores of 0.55, 0.70, 0.79 on final testing data for the "enhancing tumor", "tumor core" and "whole tumor" labels respectively. Visual comparisons further confirm the superiority of the ensemble method in accurate tumor region coverage. The results indicate that this advanced ensemble approach, building upon the unique strengths of individual models, offers promising prospects for enhanced diagnostic accuracy and effective treatment planning and monitoring for brain tumors in pediatric brains.  ( 3 min )
    Grounding Data Science Code Generation with Input-Output Specifications
    arXiv:2402.08073v2 Announce Type: replace Abstract: Large language models (LLMs) have recently demonstrated a remarkable ability to generate code from natural language (NL) prompts. However, in the real world, NL is often too ambiguous to capture the true intent behind programming problems, requiring additional input-output (I/O) specifications. Unfortunately, LLMs can have difficulty aligning their outputs with both the NL prompt and the I/O specification. In this paper, we give a way to mitigate this issue in the context of data science programming, where tasks require explicit I/O specifications for clarity. Specifically, we propose GIFT4Code, a novel approach for the instruction fine-tuning of LLMs with respect to I/O specifications. Our method leverages synthetic data produced by the LLM itself and utilizes execution-derived feedback as a key learning signal. This feedback, in the form of program I/O specifications, is provided to the LLM to facilitate instruction fine-tuning. We evaluated our approach on two challenging data science benchmarks, Arcade and DS-1000. The results demonstrate a significant improvement in the LLM's ability to generate code that is not only executable but also accurately aligned with user specifications, substantially improving the quality of code generation for complex data science tasks.  ( 2 min )
    Energy-based Automated Model Evaluation
    arXiv:2401.12689v3 Announce Type: replace Abstract: The conventional evaluation protocols on machine learning models rely heavily on a labeled, i.i.d-assumed testing dataset, which is not often present in real world applications. The Automated Model Evaluation (AutoEval) shows an alternative to this traditional workflow, by forming a proximal prediction pipeline of the testing performance without the presence of ground-truth labels. Despite its recent successes, the AutoEval frameworks still suffer from an overconfidence issue, substantial storage and computational cost. In that regard, we propose a novel measure -- Meta-Distribution Energy (MDE) -- that allows the AutoEval framework to be both more efficient and effective. The core of the MDE is to establish a meta-distribution statistic, on the information (energy) associated with individual samples, then offer a smoother representation enabled by energy-based learning. We further provide our theoretical insights by connecting the MDE with the classification loss. We provide extensive experiments across modalities, datasets and different architectural backbones to validate MDE's validity, together with its superiority compared with prior approaches. We also prove MDE's versatility by showing its seamless integration with large-scale models, and easy adaption to learning scenarios with noisy- or imbalanced- labels. Code and data are available: https://github.com/pengr/Energy_AutoEval  ( 2 min )
    The Generalization Gap in Offline Reinforcement Learning
    arXiv:2312.05742v2 Announce Type: replace Abstract: Despite recent progress in offline learning, these methods are still trained and tested on the same environment. In this paper, we compare the generalization abilities of widely used online and offline learning methods such as online reinforcement learning (RL), offline RL, sequence modeling, and behavioral cloning. Our experiments show that offline learning algorithms perform worse on new environments than online learning ones. We also introduce the first benchmark for evaluating generalization in offline learning, collecting datasets of varying sizes and skill-levels from Procgen (2D video games) and WebShop (e-commerce websites). The datasets contain trajectories for a limited number of game levels or natural language instructions and at test time, the agent has to generalize to new levels or instructions. Our experiments reveal that existing offline learning algorithms struggle to match the performance of online RL on both train and test environments. Behavioral cloning is a strong baseline, outperforming state-of-the-art offline RL and sequence modeling approaches when trained on data from multiple environments and tested on new ones. Finally, we find that increasing the diversity of the data, rather than its size, improves performance on new environments for all offline learning algorithms. Our study demonstrates the limited generalization of current offline learning algorithms highlighting the need for more research in this area.  ( 3 min )
    Differentiable Euler Characteristic Transforms for Shape Classification
    arXiv:2310.07630v2 Announce Type: replace Abstract: The Euler Characteristic Transform (ECT) has proven to be a powerful representation, combining geometrical and topological characteristics of shapes and graphs. However, the ECT was hitherto unable to learn task-specific representations. We overcome this issue and develop a novel computational layer that enables learning the ECT in an end-to-end fashion. Our method, the Differentiable Euler Characteristic Transform (DECT), is fast and computationally efficient, while exhibiting performance on a par with more complex models in both graph and point cloud classification tasks. Moreover, we show that this seemingly simple statistic provides the same topological expressivity as more complex topological deep learning layers.  ( 2 min )
    DeepZero: Scaling up Zeroth-Order Optimization for Deep Model Training
    arXiv:2310.02025v4 Announce Type: replace Abstract: Zeroth-order (ZO) optimization has become a popular technique for solving machine learning (ML) problems when first-order (FO) information is difficult or impossible to obtain. However, the scalability of ZO optimization remains an open problem: Its use has primarily been limited to relatively small-scale ML problems, such as sample-wise adversarial attack generation. To our best knowledge, no prior work has demonstrated the effectiveness of ZO optimization in training deep neural networks (DNNs) without a significant decrease in performance. To overcome this roadblock, we develop DeepZero, a principled ZO deep learning (DL) framework that can scale ZO optimization to DNN training from scratch through three primary innovations. First, we demonstrate the advantages of coordinatewise gradient estimation (CGE) over randomized vector-wise gradient estimation in training accuracy and computational efficiency. Second, we propose a sparsityinduced ZO training protocol that extends the model pruning methodology using only finite differences to explore and exploit the sparse DL prior in CGE. Third, we develop the methods of feature reuse and forward parallelization to advance the practical implementations of ZO training. Our extensive experiments show that DeepZero achieves state-of-the-art (SOTA) accuracy on ResNet-20 trained on CIFAR-10, approaching FO training performance for the first time. Furthermore, we show the practical utility of DeepZero in applications of certified adversarial defense and DL-based partial differential equation error correction, achieving 10-20% improvement over SOTA. We believe our results will inspire future research on scalable ZO optimization and contribute to advancing DL with black box. Codes are available at https://github.com/OPTML-Group/DeepZero.  ( 3 min )
    JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and Attention
    arXiv:2310.00535v3 Announce Type: replace Abstract: We propose Joint MLP/Attention (JoMA) dynamics, a novel mathematical framework to understand the training procedure of multilayer Transformer architectures. This is achieved by integrating out the self-attention layer in Transformers, producing a modified dynamics of MLP layers only. JoMA removes unrealistic assumptions in previous analysis (e.g., lack of residual connection) and predicts that the attention first becomes sparse (to learn salient tokens), then dense (to learn less salient tokens) in the presence of nonlinear activations, while in the linear case, it is consistent with existing works that show attention becomes sparse over time. We leverage JoMA to qualitatively explains how tokens are combined to form hierarchies in multilayer Transformers, when the input tokens are generated by a latent hierarchical generative model. Experiments on models trained from real-world dataset (Wikitext2/Wikitext103) and various pre-trained models (OPT, Pythia) verify our theoretical findings. Code can be found in https://github.com/facebookresearch/luckmatters/tree/yuandong3.  ( 2 min )
    Self-Compatibility: Evaluating Causal Discovery without Ground Truth
    arXiv:2307.09552v2 Announce Type: replace Abstract: As causal ground truth is incredibly rare, causal discovery algorithms are commonly only evaluated on simulated data. This is concerning, given that simulations reflect preconceptions about generating processes regarding noise distributions, model classes, and more. In this work, we propose a novel method for falsifying the output of a causal discovery algorithm in the absence of ground truth. Our key insight is that while statistical learning seeks stability across subsets of data points, causal learning should seek stability across subsets of variables. Motivated by this insight, our method relies on a notion of compatibility between causal graphs learned on different subsets of variables. We prove that detecting incompatibilities can falsify wrongly inferred causal relations due to violation of assumptions or errors from finite sample effects. Although passing such compatibility tests is only a necessary criterion for good performance, we argue that it provides strong evidence for the causal models whenever compatibility entails strong implications for the joint distribution. We also demonstrate experimentally that detection of incompatibilities can aid in causal model selection.  ( 2 min )
    Learning Markov State Abstractions for Deep Reinforcement Learning
    arXiv:2106.04379v4 Announce Type: replace Abstract: A fundamental assumption of reinforcement learning in Markov decision processes (MDPs) is that the relevant decision process is, in fact, Markov. However, when MDPs have rich observations, agents typically learn by way of an abstract state representation, and such representations are not guaranteed to preserve the Markov property. We introduce a novel set of conditions and prove that they are sufficient for learning a Markov abstract state representation. We then describe a practical training procedure that combines inverse model estimation and temporal contrastive learning to learn an abstraction that approximately satisfies these conditions. Our novel training objective is compatible with both online and offline training: it does not require a reward signal, but agents can capitalize on reward information when available. We empirically evaluate our approach on a visual gridworld domain and a set of continuous control benchmarks. Our approach learns representations that capture the underlying structure of the domain and lead to improved sample efficiency over state-of-the-art deep reinforcement learning with visual features -- often matching or exceeding the performance achieved with hand-designed compact state information.  ( 3 min )
    Strong and Controllable Blind Image Decomposition
    arXiv:2403.10520v1 Announce Type: cross Abstract: Blind image decomposition aims to decompose all components present in an image, typically used to restore a multi-degraded input image. While fully recovering the clean image is appealing, in some scenarios, users might want to retain certain degradations, such as watermarks, for copyright protection. To address this need, we add controllability to the blind image decomposition process, allowing users to enter which types of degradation to remove or retain. We design an architecture named controllable blind image decomposition network. Inserted in the middle of U-Net structure, our method first decomposes the input feature maps and then recombines them according to user instructions. Advantageously, this functionality is implemented at minimal computational cost: decomposition and recombination are all parameter-free. Experimentally, our system excels in blind image decomposition tasks and can outputs partially or fully restored images that well reflect user intentions. Furthermore, we evaluate and configure different options for the network structure and loss functions. This, combined with the proposed decomposition-and-recombination method, yields an efficient and competitive system for blind image decomposition, compared with current state-of-the-art methods.  ( 2 min )
    HumanoidBench: Simulated Humanoid Benchmark for Whole-Body Locomotion and Manipulation
    arXiv:2403.10506v1 Announce Type: cross Abstract: Humanoid robots hold great promise in assisting humans in diverse environments and tasks, due to their flexibility and adaptability leveraging human-like morphology. However, research in humanoid robots is often bottlenecked by the costly and fragile hardware setups. To accelerate algorithmic research in humanoid robots, we present a high-dimensional, simulated robot learning benchmark, HumanoidBench, featuring a humanoid robot equipped with dexterous hands and a variety of challenging whole-body manipulation and locomotion tasks. Our findings reveal that state-of-the-art reinforcement learning algorithms struggle with most tasks, whereas a hierarchical learning baseline achieves superior performance when supported by robust low-level policies, such as walking or reaching. With HumanoidBench, we provide the robotics community with a platform to identify the challenges arising when solving diverse tasks with humanoid robots, facilitating prompt verification of algorithms and ideas. The open-source code is available at https://sferrazza.cc/humanoidbench_site.  ( 2 min )
    Joint Multimodal Transformer for Dimensional Emotional Recognition in the Wild
    arXiv:2403.10488v1 Announce Type: cross Abstract: Audiovisual emotion recognition (ER) in videos has immense potential over unimodal performance. It effectively leverages the inter- and intra-modal dependencies between visual and auditory modalities. This work proposes a novel audio-visual emotion recognition system utilizing a joint multimodal transformer architecture with key-based cross-attention. This framework aims to exploit the complementary nature of audio and visual cues (facial expressions and vocal patterns) in videos, leading to superior performance compared to solely relying on a single modality. The proposed model leverages separate backbones for capturing intra-modal temporal dependencies within each modality (audio and visual). Subsequently, a joint multimodal transformer architecture integrates the individual modality embeddings, enabling the model to effectively capture inter-modal (between audio and visual) and intra-modal (within each modality) relationships. Extensive evaluations on the challenging Affwild2 dataset demonstrate that the proposed model significantly outperforms baseline and state-of-the-art methods in ER tasks.  ( 2 min )
    Approximate Nullspace Augmented Finetuning for Robust Vision Transformers
    arXiv:2403.10476v1 Announce Type: cross Abstract: Enhancing the robustness of deep learning models, particularly in the realm of vision transformers (ViTs), is crucial for their real-world deployment. In this work, we provide a finetuning approach to enhance the robustness of vision transformers inspired by the concept of nullspace from linear algebra. Our investigation centers on whether a vision transformer can exhibit resilience to input variations akin to the nullspace property in linear mappings, implying that perturbations sampled from this nullspace do not influence the model's output when added to the input. Firstly, we show that for many pretrained ViTs, a non-trivial nullspace exists due to the presence of the patch embedding layer. Secondly, as nullspace is a concept associated with linear algebra, we demonstrate that it is possible to synthesize approximate nullspace elements for the non-linear blocks of ViTs employing an optimisation strategy. Finally, we propose a fine-tuning strategy for ViTs wherein we augment the training data with synthesized approximate nullspace noise. After finetuning, we find that the model demonstrates robustness to adversarial and natural image perbutations alike.  ( 2 min )
    Quantization Avoids Saddle Points in Distributed Optimization
    arXiv:2403.10423v1 Announce Type: cross Abstract: Distributed nonconvex optimization underpins key functionalities of numerous distributed systems, ranging from power systems, smart buildings, cooperative robots, vehicle networks to sensor networks. Recently, it has also merged as a promising solution to handle the enormous growth in data and model sizes in deep learning. A fundamental problem in distributed nonconvex optimization is avoiding convergence to saddle points, which significantly degrade optimization accuracy. We discover that the process of quantization, which is necessary for all digital communications, can be exploited to enable saddle-point avoidance. More specifically, we propose a stochastic quantization scheme and prove that it can effectively escape saddle points and ensure convergence to a second-order stationary point in distributed nonconvex optimization. With an easily adjustable quantization granularity, the approach allows a user to control the number of bits sent per iteration and, hence, to aggressively reduce the communication overhead. Numerical experimental results using distributed optimization and learning problems on benchmark datasets confirm the effectiveness of the approach.  ( 2 min )
    SocialGenPod: Privacy-Friendly Generative AI Social Web Applications with Decentralised Personal Data Stores
    arXiv:2403.10408v1 Announce Type: cross Abstract: We present SocialGenPod, a decentralised and privacy-friendly way of deploying generative AI Web applications. Unlike centralised Web and data architectures that keep user data tied to application and service providers, we show how one can use Solid -- a decentralised Web specification -- to decouple user data from generative AI applications. We demonstrate SocialGenPod using a prototype that allows users to converse with different Large Language Models, optionally leveraging Retrieval Augmented Generation to generate answers grounded in private documents stored in any Solid Pod that the user is allowed to access, directly or indirectly. SocialGenPod makes use of Solid access control mechanisms to give users full control of determining who has access to data stored in their Pods. SocialGenPod keeps all user data (chat history, app configuration, personal documents, etc) securely in the user's personal Pod; separate from specific model or application providers. Besides better privacy controls, this approach also enables portability across different services and applications. Finally, we discuss challenges, posed by the large compute requirements of state-of-the-art models, that future research in this area should address. Our prototype is open-source and available at: https://github.com/Vidminas/socialgenpod/.  ( 3 min )
    Isotropic3D: Image-to-3D Generation Based on a Single CLIP Embedding
    arXiv:2403.10395v1 Announce Type: cross Abstract: Encouraged by the growing availability of pre-trained 2D diffusion models, image-to-3D generation by leveraging Score Distillation Sampling (SDS) is making remarkable progress. Most existing methods combine novel-view lifting from 2D diffusion models which usually take the reference image as a condition while applying hard L2 image supervision at the reference view. Yet heavily adhering to the image is prone to corrupting the inductive knowledge of the 2D diffusion model leading to flat or distorted 3D generation frequently. In this work, we reexamine image-to-3D in a novel perspective and present Isotropic3D, an image-to-3D generation pipeline that takes only an image CLIP embedding as input. Isotropic3D allows the optimization to be isotropic w.r.t. the azimuth angle by solely resting on the SDS loss. The core of our framework lies in a two-stage diffusion model fine-tuning. Firstly, we fine-tune a text-to-3D diffusion model by substituting its text encoder with an image encoder, by which the model preliminarily acquires image-to-image capabilities. Secondly, we perform fine-tuning using our Explicit Multi-view Attention (EMA) which combines noisy multi-view images with the noise-free reference image as an explicit condition. CLIP embedding is sent to the diffusion model throughout the whole process while reference images are discarded once after fine-tuning. As a result, with a single image CLIP embedding, Isotropic3D is capable of generating multi-view mutually consistent images and also a 3D model with more symmetrical and neat content, well-proportioned geometry, rich colored texture, and less distortion compared with existing image-to-3D methods while still preserving the similarity to the reference image to a large extent. The project page is available at https://isotropic3d.github.io/. The code and models are available at https://github.com/pkunliu/Isotropic3D.  ( 3 min )
    Energy Correction Model in the Feature Space for Out-of-Distribution Detection
    arXiv:2403.10403v1 Announce Type: cross Abstract: In this work, we study the out-of-distribution (OOD) detection problem through the use of the feature space of a pre-trained deep classifier. We show that learning the density of in-distribution (ID) features with an energy-based models (EBM) leads to competitive detection results. However, we found that the non-mixing of MCMC sampling during the EBM's training undermines its detection performance. To overcome this an energy-based correction of a mixture of class-conditional Gaussian distributions. We obtains favorable results when compared to a strong baseline like the KNN detector on the CIFAR-10/CIFAR-100 OOD detection benchmarks.  ( 2 min )
    Conformal Predictions for Probabilistically Robust Scalable Machine Learning Classification
    arXiv:2403.10368v1 Announce Type: cross Abstract: Conformal predictions make it possible to define reliable and robust learning algorithms. But they are essentially a method for evaluating whether an algorithm is good enough to be used in practice. To define a reliable learning framework for classification from the very beginning of its design, the concept of scalable classifier was introduced to generalize the concept of classical classifier by linking it to statistical order theory and probabilistic learning theory. In this paper, we analyze the similarities between scalable classifiers and conformal predictions by introducing a new definition of a score function and defining a special set of input variables, the conformal safety set, which can identify patterns in the input space that satisfy the error coverage guarantee, i.e., that the probability of observing the wrong (possibly unsafe) label for points belonging to this set is bounded by a predefined $\varepsilon$ error level. We demonstrate the practical implications of this framework through an application in cybersecurity for identifying DNS tunneling attacks. Our work contributes to the development of probabilistically robust and reliable machine learning models.  ( 2 min )
    Scalable Algorithms for Individual Preference Stable Clustering
    arXiv:2403.10365v1 Announce Type: cross Abstract: In this paper, we study the individual preference (IP) stability, which is an notion capturing individual fairness and stability in clustering. Within this setting, a clustering is $\alpha$-IP stable when each data point's average distance to its cluster is no more than $\alpha$ times its average distance to any other cluster. In this paper, we study the natural local search algorithm for IP stable clustering. Our analysis confirms a $O(\log n)$-IP stability guarantee for this algorithm, where $n$ denotes the number of points in the input. Furthermore, by refining the local search approach, we show it runs in an almost linear time, $\tilde{O}(nk)$.  ( 2 min )
    Denoising Task Difficulty-based Curriculum for Training Diffusion Models
    arXiv:2403.10348v1 Announce Type: cross Abstract: Diffusion-based generative models have emerged as powerful tools in the realm of generative modeling. Despite extensive research on denoising across various timesteps and noise levels, a conflict persists regarding the relative difficulties of the denoising tasks. While various studies argue that lower timesteps present more challenging tasks, others contend that higher timesteps are more difficult. To address this conflict, our study undertakes a comprehensive examination of task difficulties, focusing on convergence behavior and changes in relative entropy between consecutive probability distributions across timesteps. Our observational study reveals that denoising at earlier timesteps poses challenges characterized by slower convergence and higher relative entropy, indicating increased task difficulty at these lower timesteps. Building on these observations, we introduce an easy-to-hard learning scheme, drawing from curriculum learning, to enhance the training process of diffusion models. By organizing timesteps or noise levels into clusters and training models with descending orders of difficulty, we facilitate an order-aware training regime, progressing from easier to harder denoising tasks, thereby deviating from the conventional approach of training diffusion models simultaneously across all timesteps. Our approach leads to improved performance and faster convergence by leveraging the benefits of curriculum learning, while maintaining orthogonality with existing improvements in diffusion training techniques. We validate these advantages through comprehensive experiments in image generation tasks, including unconditional, class-conditional, and text-to-image generation.  ( 2 min )
    GreedyML: A Parallel Algorithm for Maximizing Submodular Functions
    arXiv:2403.10332v1 Announce Type: cross Abstract: We describe a parallel approximation algorithm for maximizing monotone submodular functions subject to hereditary constraints on distributed memory multiprocessors. Our work is motivated by the need to solve submodular optimization problems on massive data sets, for practical applications in areas such as data summarization, machine learning, and graph sparsification. Our work builds on the randomized distributed RandGreedI algorithm, proposed by Barbosa, Ene, Nguyen, and Ward (2015). This algorithm computes a distributed solution by randomly partitioning the data among all the processors and then employing a single accumulation step in which all processors send their partial solutions to one processor. However, for large problems, the accumulation step could exceed the memory available on a processor, and the processor which performs the accumulation could become a computational bottleneck. Here, we propose a generalization of the RandGreedI algorithm that employs multiple accumulation steps to reduce the memory required. We analyze the approximation ratio and the time complexity of the algorithm (in the BSP model). We also evaluate the new GreedyML algorithm on three classes of problems, and report results from massive data sets with millions of elements. The results show that the GreedyML algorithm can solve problems where the sequential Greedy and distributed RandGreedI algorithms fail due to memory constraints. For certain computationally intensive problems, the GreedyML algorithm can be faster than the RandGreedI algorithm. The observed approximation quality of the solutions computed by the GreedyML algorithm closely matches those obtained by the RandGreedI algorithm on these problems.  ( 3 min )
    Rough Transformers for Continuous and Efficient Time-Series Modelling
    arXiv:2403.10288v1 Announce Type: cross Abstract: Time-series data in real-world medical settings typically exhibit long-range dependencies and are observed at non-uniform intervals. In such contexts, traditional sequence-based recurrent models struggle. To overcome this, researchers replace recurrent architectures with Neural ODE-based models to model irregularly sampled data and use Transformer-based architectures to account for long-range dependencies. Despite the success of these two approaches, both incur very high computational costs for input sequences of moderate lengths and greater. To mitigate this, we introduce the Rough Transformer, a variation of the Transformer model which operates on continuous-time representations of input sequences and incurs significantly reduced computational costs, critical for addressing long-range dependencies common in medical contexts. In particular, we propose multi-view signature attention, which uses path signatures to augment vanilla attention and to capture both local and global dependencies in input data, while remaining robust to changes in the sequence length and sampling frequency. We find that Rough Transformers consistently outperform their vanilla attention counterparts while obtaining the benefits of Neural ODE-based models using a fraction of the computational time and memory resources on synthetic and real-world time-series tasks.  ( 2 min )
    CDGP: Automatic Cloze Distractor Generation based on Pre-trained Language Model
    arXiv:2403.10326v1 Announce Type: cross Abstract: Manually designing cloze test consumes enormous time and efforts. The major challenge lies in wrong option (distractor) selection. Having carefully-design distractors improves the effectiveness of learner ability assessment. As a result, the idea of automatically generating cloze distractor is motivated. In this paper, we investigate cloze distractor generation by exploring the employment of pre-trained language models (PLMs) as an alternative for candidate distractor generation. Experiments show that the PLM-enhanced model brings a substantial performance improvement. Our best performing model advances the state-of-the-art result from 14.94 to 34.17 (NDCG@10 score). Our code and dataset is available at https://github.com/AndyChiangSH/CDGP.  ( 2 min )
    DSP: Dynamic Sequence Parallelism for Multi-Dimensional Transformers
    arXiv:2403.10266v1 Announce Type: cross Abstract: Scaling large models with long sequences across applications like language generation, video generation and multimodal tasks requires efficient sequence parallelism. However, existing sequence parallelism methods all assume a single sequence dimension and fail to adapt to multi-dimensional transformer architectures that perform attention calculations across different dimensions. This paper introduces Dynamic Sequence Parallelism (DSP), a novel approach to enable efficient sequence parallelism for multi-dimensional transformer models. The key idea is to dynamically switch the parallelism dimension according to the current computation stage, leveraging the potential characteristics of multi-dimensional attention. This dynamic dimension switching allows sequence parallelism with minimal communication overhead compared to applying traditional single-dimension parallelism to multi-dimensional models. Experiments show DSP improves end-to-end throughput by 42.0% to 216.8% over prior sequence parallelism methods.  ( 2 min )
    Interpretable Machine Learning for Survival Analysis
    arXiv:2403.10250v1 Announce Type: cross Abstract: With the spread and rapid advancement of black box machine learning models, the field of interpretable machine learning (IML) or explainable artificial intelligence (XAI) has become increasingly important over the last decade. This is particularly relevant for survival analysis, where the adoption of IML techniques promotes transparency, accountability and fairness in sensitive areas, such as clinical decision making processes, the development of targeted therapies, interventions or in other medical or healthcare related contexts. More specifically, explainability can uncover a survival model's potential biases and limitations and provide more mathematically sound ways to understand how and which features are influential for prediction or constitute risk factors. However, the lack of readily available IML methods may have deterred medical practitioners and policy makers in public health from leveraging the full potential of machine learning for predicting time-to-event data. We present a comprehensive review of the limited existing amount of work on IML methods for survival analysis within the context of the general IML taxonomy. In addition, we formally detail how commonly used IML methods, such as such as individual conditional expectation (ICE), partial dependence plots (PDP), accumulated local effects (ALE), different feature importance measures or Friedman's H-interaction statistics can be adapted to survival outcomes. An application of several IML methods to real data on data on under-5 year mortality of Ghanaian children from the Demographic and Health Surveys (DHS) Program serves as a tutorial or guide for researchers, on how to utilize the techniques in practice to facilitate understanding of model decisions or predictions.  ( 3 min )
    Matrix Completion via Nonsmooth Regularization of Fully Connected Neural Networks
    arXiv:2403.10232v1 Announce Type: cross Abstract: Conventional matrix completion methods approximate the missing values by assuming the matrix to be low-rank, which leads to a linear approximation of missing values. It has been shown that enhanced performance could be attained by using nonlinear estimators such as deep neural networks. Deep fully connected neural networks (FCNNs), one of the most suitable architectures for matrix completion, suffer from over-fitting due to their high capacity, which leads to low generalizability. In this paper, we control over-fitting by regularizing the FCNN model in terms of the $\ell_{1}$ norm of intermediate representations and nuclear norm of weight matrices. As such, the resulting regularized objective function becomes nonsmooth and nonconvex, i.e., existing gradient-based methods cannot be applied to our model. We propose a variant of the proximal gradient method and investigate its convergence to a critical point. In the initial epochs of FCNN training, the regularization terms are ignored, and through epochs, the effect of that increases. The gradual addition of nonsmooth regularization terms is the main reason for the better performance of the deep neural network with nonsmooth regularization terms (DNN-NSR) algorithm. Our simulations indicate the superiority of the proposed algorithm in comparison with existing linear and nonlinear algorithms.  ( 2 min )
    Perceptual Quality-based Model Training under Annotator Label Uncertainty
    arXiv:2403.10190v1 Announce Type: cross Abstract: Annotators exhibit disagreement during data labeling, which can be termed as annotator label uncertainty. Annotator label uncertainty manifests in variations of labeling quality. Training with a single low-quality annotation per sample induces model reliability degradations. In this work, we first examine the effects of annotator label uncertainty in terms of the model's generalizability and prediction uncertainty. We observe that the model's generalizability and prediction uncertainty degrade with the presence of low-quality noisy labels. Meanwhile, our evaluation of existing uncertainty estimation algorithms indicates their incapability in response to annotator label uncertainty. To mitigate performance degradation, prior methods show that training models with labels collected from multiple independent annotators can enhance generalizability. However, they require massive annotations. Hence, we introduce a novel perceptual quality-based model training framework to objectively generate multiple labels for model training to enhance reliability, while avoiding massive annotations. Specifically, we first select a subset of samples with low perceptual quality scores ranked by statistical regularities of visual signals. We then assign de-aggregated labels to each sample in this subset to obtain a training set with multiple labels. Our experiments and analysis demonstrate that training with the proposed framework alleviates the degradation of generalizability and prediction uncertainty caused by annotator label uncertainty.  ( 2 min )
    Grasp Anything: Combining Teacher-Augmented Policy Gradient Learning with Instance Segmentation to Grasp Arbitrary Objects
    arXiv:2403.10187v1 Announce Type: cross Abstract: Interactive grasping from clutter, akin to human dexterity, is one of the longest-standing problems in robot learning. Challenges stem from the intricacies of visual perception, the demand for precise motor skills, and the complex interplay between the two. In this work, we present Teacher-Augmented Policy Gradient (TAPG), a novel two-stage learning framework that synergizes reinforcement learning and policy distillation. After training a teacher policy to master the motor control based on object pose information, TAPG facilitates guided, yet adaptive, learning of a sensorimotor policy, based on object segmentation. We zero-shot transfer from simulation to a real robot by using Segment Anything Model for promptable object segmentation. Our trained policies adeptly grasp a wide variety of objects from cluttered scenarios in simulation and the real world based on human-understandable prompts. Furthermore, we show robust zero-shot transfer to novel objects. Videos of our experiments are available at \url{https://maltemosbach.github.io/grasp_anything}.  ( 2 min )
    CoReEcho: Continuous Representation Learning for 2D+time Echocardiography Analysis
    arXiv:2403.10164v1 Announce Type: cross Abstract: Deep learning (DL) models have been advancing automatic medical image analysis on various modalities, including echocardiography, by offering a comprehensive end-to-end training pipeline. This approach enables DL models to regress ejection fraction (EF) directly from 2D+time echocardiograms, resulting in superior performance. However, the end-to-end training pipeline makes the learned representations less explainable. The representations may also fail to capture the continuous relation among echocardiogram clips, indicating the existence of spurious correlations, which can negatively affect the generalization. To mitigate this issue, we propose CoReEcho, a novel training framework emphasizing continuous representations tailored for direct EF regression. Our extensive experiments demonstrate that CoReEcho: 1) outperforms the current state-of-the-art (SOTA) on the largest echocardiography dataset (EchoNet-Dynamic) with MAE of 3.90 & R2 of 82.44, and 2) provides robust and generalizable features that transfer more effectively in related downstream tasks. The code is publicly available at https://github.com/fadamsyah/CoReEcho.  ( 2 min )
    Improving Medical Multi-modal Contrastive Learning with Expert Annotations
    arXiv:2403.10153v1 Announce Type: cross Abstract: We introduce eCLIP, an enhanced version of the CLIP model that integrates expert annotations in the form of radiologist eye-gaze heatmaps. It tackles key challenges in contrastive multi-modal medical imaging analysis, notably data scarcity and the "modality gap" -- a significant disparity between image and text embeddings that diminishes the quality of representations and hampers cross-modal interoperability. eCLIP integrates a heatmap processor and leverages mixup augmentation to efficiently utilize the scarce expert annotations, thus boosting the model's learning effectiveness. eCLIP is designed to be generally applicable to any variant of CLIP without requiring any modifications of the core architecture. Through detailed evaluations across several tasks, including zero-shot inference, linear probing, cross-modal retrieval, and Retrieval Augmented Generation (RAG) of radiology reports using a frozen Large Language Model, eCLIP showcases consistent improvements in embedding quality. The outcomes reveal enhanced alignment and uniformity, affirming eCLIP's capability to harness high-quality annotations for enriched multi-modal analysis in the medical imaging domain.  ( 2 min )
    NLP Verification: Towards a General Methodology for Certifying Robustness
    arXiv:2403.10144v1 Announce Type: cross Abstract: Deep neural networks have exhibited substantial success in the field of Natural Language Processing (NLP) and ensuring their safety and reliability is crucial: there are safety critical contexts where such models must be robust to variability or attack, and give guarantees over their output. Unlike Computer Vision, NLP lacks a unified verification methodology and, despite recent advancements in literature, they are often light on the pragmatical issues of NLP verification. In this paper, we make an attempt to distil and evaluate general components of an NLP verification pipeline, that emerges from the progress in the field to date. Our contributions are two-fold. Firstly, we give a general characterisation of verifiable subspaces that result from embedding sentences into continuous spaces. We identify, and give an effective method to deal with, the technical challenge of semantic generalisability of verified subspaces; and propose it as a standard metric in the NLP verification pipelines (alongside with the standard metrics of model accuracy and model verifiability). Secondly, we propose a general methodology to analyse the effect of the embedding gap, a problem that refers to the discrepancy between verification of geometric subpspaces on the one hand, and semantic meaning of sentences which the geometric subspaces are supposed to represent, on the other hand. In extreme cases, poor choices in embedding of sentences may invalidate verification results. We propose a number of practical NLP methods that can help to identify the effects of the embedding gap; and in particular we propose the metric of falsifiability of semantic subpspaces as another fundamental metric to be reported as part of the NLP verification pipeline. We believe that together these general principles pave the way towards a more consolidated and effective development of this new domain.  ( 3 min )
    Approximation and bounding techniques for the Fisher-Rao distances
    arXiv:2403.10089v1 Announce Type: cross Abstract: The Fisher-Rao distance between two probability distributions of a statistical model is defined as the Riemannian geodesic distance induced by the Fisher information metric. In order to calculate the Fisher-Rao distance in closed-form, we need (1) to elicit a formula for the Fisher-Rao geodesics, and (2) to integrate the Fisher length element along those geodesics. We consider several numerically robust approximation and bounding techniques for the Fisher-Rao distances: First, we report generic upper bounds on Fisher-Rao distances based on closed-form 1D Fisher-Rao distances of submodels. Second, we describe several generic approximation schemes depending on whether the Fisher-Rao geodesics or pregeodesics are available in closed-form or not. In particular, we obtain a generic method to guarantee an arbitrarily small additive error on the approximation provided that Fisher-Rao pregeodesics and tight lower and upper bounds are available. Third, we consider the case of Fisher metrics being Hessian metrics, and report generic tight upper bounds on the Fisher-Rao distances using techniques of information geometry. Uniparametric and biparametric statistical models always have Fisher Hessian metrics, and in general a simple test allows to check whether the Fisher information matrix yields a Hessian metric or not. Fourth, we consider elliptical distribution families and show how to apply the above techniques to these models. We also propose two new distances based either on the Fisher-Rao lengths of curves serving as proxies of Fisher-Rao geodesics, or based on the Birkhoff/Hilbert projective cone distance. Last, we consider an alternative group-theoretic approach for statistical transformation models based on the notion of maximal invariant which yields insights on the structures of the Fisher-Rao distance formula which may be used fruitfully in applications.  ( 3 min )
    Belief Aided Navigation using Bayesian Reinforcement Learning for Avoiding Humans in Blind Spots
    arXiv:2403.10105v1 Announce Type: cross Abstract: Recent research on mobile robot navigation has focused on socially aware navigation in crowded environments. However, existing methods do not adequately account for human robot interactions and demand accurate location information from omnidirectional sensors, rendering them unsuitable for practical applications. In response to this need, this study introduces a novel algorithm, BNBRL+, predicated on the partially observable Markov decision process framework to assess risks in unobservable areas and formulate movement strategies under uncertainty. BNBRL+ consolidates belief algorithms with Bayesian neural networks to probabilistically infer beliefs based on the positional data of humans. It further integrates the dynamics between the robot, humans, and inferred beliefs to determine the navigation paths and embeds social norms within the reward function, thereby facilitating socially aware navigation. Through experiments in various risk laden scenarios, this study validates the effectiveness of BNBRL+ in navigating crowded environments with blind spots. The model's ability to navigate effectively in spaces with limited visibility and avoid obstacles dynamically can significantly improve the safety and reliability of autonomous vehicles.  ( 2 min )
    Accurate and Data-Efficient Micro-XRD Phase Identification Using Multi-Task Learning: Application to Hydrothermal Fluids
    arXiv:2403.10042v1 Announce Type: cross Abstract: Traditional analysis of highly distorted micro-X-ray diffraction ({\mu}-XRD) patterns from hydrothermal fluid environments is a time-consuming process, often requiring substantial data preprocessing and labeled experimental data. This study demonstrates the potential of deep learning with a multitask learning (MTL) architecture to overcome these limitations. We trained MTL models to identify phase information in {\mu}-XRD patterns, minimizing the need for labeled experimental data and masking preprocessing steps. Notably, MTL models showed superior accuracy compared to binary classification CNNs. Additionally, introducing a tailored cross-entropy loss function improved MTL model performance. Most significantly, MTL models tuned to analyze raw and unmasked XRD patterns achieved close performance to models analyzing preprocessed data, with minimal accuracy differences. This work indicates that advanced deep learning architectures like MTL can automate arduous data handling tasks, streamline the analysis of distorted XRD patterns, and reduce the reliance on labor-intensive experimental datasets.  ( 2 min )
    MR-MT3: Memory Retaining Multi-Track Music Transcription to Mitigate Instrument Leakage
    arXiv:2403.10024v1 Announce Type: cross Abstract: This paper presents enhancements to the MT3 model, a state-of-the-art (SOTA) token-based multi-instrument automatic music transcription (AMT) model. Despite SOTA performance, MT3 has the issue of instrument leakage, where transcriptions are fragmented across different instruments. To mitigate this, we propose MR-MT3, with enhancements including a memory retention mechanism, prior token sampling, and token shuffling are proposed. These methods are evaluated on the Slakh2100 dataset, demonstrating improved onset F1 scores and reduced instrument leakage. In addition to the conventional multi-instrument transcription F1 score, new metrics such as the instrument leakage ratio and the instrument detection F1 score are introduced for a more comprehensive assessment of transcription quality. The study also explores the issue of domain overfitting by evaluating MT3 on single-instrument monophonic datasets such as ComMU and NSynth. The findings, along with the source code, are shared to facilitate future work aimed at refining token-based multi-instrument AMT models.  ( 2 min )
    LyZNet: A Lightweight Python Tool for Learning and Verifying Neural Lyapunov Functions and Regions of Attraction
    arXiv:2403.10013v1 Announce Type: cross Abstract: In this paper, we describe a lightweight Python framework that provides integrated learning and verification of neural Lyapunov functions for stability analysis. The proposed tool, named LyZNet, learns neural Lyapunov functions using physics-informed neural networks (PINNs) to solve Zubov's equation and verifies them using satisfiability modulo theories (SMT) solvers. What distinguishes this tool from others in the literature is its ability to provide verified regions of attraction close to the domain of attraction. This is achieved by encoding Zubov's partial differential equation (PDE) into the PINN approach. By embracing the non-convex nature of the underlying optimization problems, we demonstrate that in cases where convex optimization, such as semidefinite programming, fails to capture the domain of attraction, our neural network framework proves more successful. The tool also offers automatic decomposition of coupled nonlinear systems into a network of low-dimensional subsystems for compositional verification. We illustrate the tool's usage and effectiveness with several numerical examples, including both non-trivial low-dimensional nonlinear systems and high-dimensional systems. The repository of the tool can be found at https://git.uwaterloo.ca/hybrid-systems-lab/lyznet.  ( 2 min )
    Linear optimal transport subspaces for point set classification
    arXiv:2403.10015v1 Announce Type: cross Abstract: Learning from point sets is an essential component in many computer vision and machine learning applications. Native, unordered, and permutation invariant set structure space is challenging to model, particularly for point set classification under spatial deformations. Here we propose a framework for classifying point sets experiencing certain types of spatial deformations, with a particular emphasis on datasets featuring affine deformations. Our approach employs the Linear Optimal Transport (LOT) transform to obtain a linear embedding of set-structured data. Utilizing the mathematical properties of the LOT transform, we demonstrate its capacity to accommodate variations in point sets by constructing a convex data space, effectively simplifying point set classification problems. Our method, which employs a nearest-subspace algorithm in the LOT space, demonstrates label efficiency, non-iterative behavior, and requires no hyper-parameter tuning. It achieves competitive accuracies compared to state-of-the-art methods across various point set classification tasks. Furthermore, our approach exhibits robustness in out-of-distribution scenarios where training and test distributions vary in terms of deformation magnitudes.  ( 2 min )
    Graph Enhanced Reinforcement Learning for Effective Group Formation in Collaborative Problem Solving
    arXiv:2403.10006v1 Announce Type: cross Abstract: This study addresses the challenge of forming effective groups in collaborative problem-solving environments. Recognizing the complexity of human interactions and the necessity for efficient collaboration, we propose a novel approach leveraging graph theory and reinforcement learning. Our methodology involves constructing a graph from a dataset where nodes represent participants, and edges signify the interactions between them. We conceptualize each participant as an agent within a reinforcement learning framework, aiming to learn an optimal graph structure that reflects effective group dynamics. Clustering techniques are employed to delineate clear group structures based on the learned graph. Our approach provides theoretical solutions based on evaluation metrics and graph measurements, offering insights into potential improvements in group effectiveness and reductions in conflict incidences. This research contributes to the fields of collaborative work and educational psychology by presenting a data-driven, analytical approach to group formation. It has practical implications for organizational team building, classroom settings, and any collaborative scenario where group dynamics are crucial. The study opens new avenues for exploring the application of graph theory and reinforcement learning in social and behavioral sciences, highlighting the potential for empirical validation in future work.  ( 2 min )
    GET: Unlocking the Multi-modal Potential of CLIP for Generalized Category Discovery
    arXiv:2403.09974v1 Announce Type: cross Abstract: Given unlabelled datasets containing both old and new categories, generalized category discovery (GCD) aims to accurately discover new classes while correctly classifying old classes, leveraging the class concepts learned from labeled samples. Current GCD methods only use a single visual modality of information, resulting in poor classification of visually similar classes. Though certain classes are visually confused, their text information might be distinct, motivating us to introduce text information into the GCD task. However, the lack of class names for unlabelled data makes it impractical to utilize text information. To tackle this challenging problem, in this paper, we propose a Text Embedding Synthesizer (TES) to generate pseudo text embeddings for unlabelled samples. Specifically, our TES leverages the property that CLIP can generate aligned vision-language features, converting visual embeddings into tokens of the CLIP's text encoder to generate pseudo text embeddings. Besides, we employ a dual-branch framework, through the joint learning and instance consistency of different modality branches, visual and semantic information mutually enhance each other, promoting the interaction and fusion of visual and text embedding space. Our method unlocks the multi-modal potentials of CLIP and outperforms the baseline methods by a large margin on all GCD benchmarks, achieving new state-of-the-art. The code will be released at \url{https://github.com/enguangW/GET}.  ( 2 min )
    Thermal Earth Model for the Conterminous United States Using an Interpolative Physics-Informed Graph Neural Network (InterPIGNN)
    arXiv:2403.09961v1 Announce Type: cross Abstract: This study presents a data-driven spatial interpolation algorithm based on physics-informed graph neural networks used to develop national temperature-at-depth maps for the conterminous United States. The model was trained to approximately satisfy the three-dimensional heat conduction law by simultaneously predicting subsurface temperature, surface heat flow, and rock thermal conductivity. In addition to bottomhole temperature measurements, we incorporated other physical quantities as model inputs, such as depth, geographic coordinates, elevation, sediment thickness, magnetic anomaly, gravity anomaly, gamma-ray flux of radioactive elements, seismicity, and electric conductivity. We constructed surface heat flow, and temperature and thermal conductivity predictions for depths of 0-7 km at an interval of 1 km with spatial resolution of 18 km$^2$ per grid cell. Our model showed superior temperature, surface heat flow and thermal conductivity mean absolute errors of 4.8{\deg} C, 5.817 mW/m$^2$ and 0.022 W/(C-m)$, respectively. The predictions were visualized in two-dimensional spatial maps across the modeled depths. This thorough modeling of the Earth's thermal processes is crucial to understanding subsurface phenomena and exploiting natural underground resources.  ( 2 min )
    Attention-Enhanced Hybrid Feature Aggregation Network for 3D Brain Tumor Segmentation
    arXiv:2403.09942v1 Announce Type: cross Abstract: Glioblastoma is a highly aggressive and malignant brain tumor type that requires early diagnosis and prompt intervention. Due to its heterogeneity in appearance, developing automated detection approaches is challenging. To address this challenge, Artificial Intelligence (AI)-driven approaches in healthcare have generated interest in efficiently diagnosing and evaluating brain tumors. The Brain Tumor Segmentation Challenge (BraTS) is a platform for developing and assessing automated techniques for tumor analysis using high-quality, clinically acquired MRI data. In our approach, we utilized a multi-scale, attention-guided and hybrid U-Net-shaped model -- GLIMS -- to perform 3D brain tumor segmentation in three regions: Enhancing Tumor (ET), Tumor Core (TC), and Whole Tumor (WT). The multi-scale feature extraction provides better contextual feature aggregation in high resolutions and the Swin Transformer blocks improve the global feature extraction at deeper levels of the model. The segmentation mask generation in the decoder branch is guided by the attention-refined features gathered from the encoder branch to enhance the important attributes. Moreover, hierarchical supervision is used to train the model efficiently. Our model's performance on the validation set resulted in 92.19, 87.75, and 83.18 Dice Scores and 89.09, 84.67, and 82.15 Lesion-wise Dice Scores in WT, TC, and ET, respectively. The code is publicly available at https://github.com/yaziciz/GLIMS.  ( 3 min )
    Recurrent Drafter for Fast Speculative Decoding in Large Language Models
    arXiv:2403.09919v1 Announce Type: cross Abstract: In this paper, we introduce an improved approach of speculative decoding aimed at enhancing the efficiency of serving large language models. Our method capitalizes on the strengths of two established techniques: the classic two-model speculative decoding approach, and the more recent single-model approach, Medusa. Drawing inspiration from Medusa, our approach adopts a single-model strategy for speculative decoding. However, our method distinguishes itself by employing a single, lightweight draft head with a recurrent dependency design, akin in essence to the small, draft model uses in classic speculative decoding, but without the complexities of the full transformer architecture. And because of the recurrent dependency, we can use beam search to swiftly filter out undesired candidates with the draft head. The outcome is a method that combines the simplicity of single-model design and avoids the need to create a data-dependent tree attention structure only for inference in Medusa. We empirically demonstrate the effectiveness of the proposed method on several popular open source language models, along with a comprehensive analysis of the trade-offs involved in adopting this approach.  ( 2 min )
    Attention-based Class-Conditioned Alignment for Multi-Source Domain Adaptive Object Detection
    arXiv:2403.09918v1 Announce Type: cross Abstract: Domain adaptation methods for object detection (OD) strive to mitigate the impact of distribution shifts by promoting feature alignment across source and target domains. Multi-source domain adaptation (MSDA) allows leveraging multiple annotated source datasets, and unlabeled target data to improve the accuracy and robustness of the detection model. Most state-of-the-art MSDA methods for OD perform feature alignment in a class-agnostic manner. This is challenging since the objects have unique modal information due to variations in object appearance across domains. A recent prototype-based approach proposed a class-wise alignment, yet it suffers from error accumulation due to noisy pseudo-labels which can negatively affect adaptation with imbalanced data. To overcome these limitations, we propose an attention-based class-conditioned alignment scheme for MSDA that aligns instances of each object category across domains. In particular, an attention module coupled with an adversarial domain classifier allows learning domain-invariant and class-specific instance representations. Experimental results on multiple benchmarking MSDA datasets indicate that our method outperforms the state-of-the-art methods and is robust to class imbalance. Our code is available at https://github.com/imatif17/ACIA.  ( 2 min )
    Fisher Mask Nodes for Language Model Merging
    arXiv:2403.09891v1 Announce Type: cross Abstract: Fine-tuning pre-trained models provides significant advantages in downstream performance. The ubiquitous nature of pre-trained models such as BERT and its derivatives in natural language processing has also led to a proliferation of task-specific fine-tuned models. As these models typically only perform one task well, additional training or ensembling is required in multi-task scenarios. The growing field of model merging provides a solution, dealing with the challenge of combining multiple task-specific models into a single multi-task model. In this study, we introduce a novel model merging method for Transformers, combining insights from previous work in Fisher-weighted averaging and the use of Fisher information in model pruning. Utilizing the Fisher information of mask nodes within the Transformer architecture, we devise a computationally efficient weighted-averaging scheme. Our method exhibits a regular and significant performance increase across various models in the BERT family, outperforming full-scale Fisher-weighted averaging in a fraction of the computational cost, with baseline performance improvements of up to +6.5 and a speedup of 57.4x. Our results prove the potential of our method in current multi-task learning environments and suggest its scalability and adaptability to new model architectures and learning scenarios.  ( 2 min )
    ThermoHands: A Benchmark for 3D Hand Pose Estimation from Egocentric Thermal Image
    arXiv:2403.09871v1 Announce Type: cross Abstract: In this work, we present ThermoHands, a new benchmark for thermal image-based egocentric 3D hand pose estimation, aimed at overcoming challenges like varying lighting and obstructions (e.g., handwear). The benchmark includes a diverse dataset from 28 subjects performing hand-object and hand-virtual interactions, accurately annotated with 3D hand poses through an automated process. We introduce a bespoken baseline method, TheFormer, utilizing dual transformer modules for effective egocentric 3D hand pose estimation in thermal imagery. Our experimental results highlight TheFormer's leading performance and affirm thermal imaging's effectiveness in enabling robust 3D hand pose estimation in adverse conditions.  ( 2 min )
    Mind the GAP: Improving Robustness to Subpopulation Shifts with Group-Aware Priors
    arXiv:2403.09869v1 Announce Type: cross Abstract: Machine learning models often perform poorly under subpopulation shifts in the data distribution. Developing methods that allow machine learning models to better generalize to such shifts is crucial for safe deployment in real-world settings. In this paper, we develop a family of group-aware prior (GAP) distributions over neural network parameters that explicitly favor models that generalize well under subpopulation shifts. We design a simple group-aware prior that only requires access to a small set of data with group information and demonstrate that training with this prior yields state-of-the-art performance -- even when only retraining the final layer of a previously trained non-robust model. Group aware-priors are conceptually simple, complementary to existing approaches, such as attribute pseudo labeling and data reweighting, and open up promising new avenues for harnessing Bayesian inference to enable robustness to subpopulation shifts.  ( 2 min )
    Adapting OC20-trained EquiformerV2 Models for High-Entropy Materials
    arXiv:2403.09811v1 Announce Type: cross Abstract: Computational high-throughput studies, especially in research on high-entropy materials and catalysts, are hampered by high-dimensional composition spaces and myriad structural microstates. They present bottlenecks to the conventional use of density functional theory calculations, and consequently, the use of machine-learned potentials is becoming increasingly prevalent in atomic structure simulations. In this communication, we show the results of adjusting and fine-tuning the pretrained EquiformerV2 model from the Open Catalyst Project to infer adsorption energies of *OH and *O on the out-of-domain high-entropy alloy Ag-Ir-Pd-Pt-Ru. By applying an energy filter based on the local environment of the binding site the zero-shot inference is markedly improved and through few-shot fine-tuning the model yields state-of-the-art accuracy. It is also found that EquiformerV2, assuming the role of general machine learning potential, is able to inform a smaller, more focused direct inference model. This knowledge distillation setup boosts performance on complex binding sites. Collectively, this shows that foundational knowledge learned from ordered intermetallic structures, can be extrapolated to the highly disordered structures of solid-solutions. With the vastly accelerated computational throughput of these models, hitherto infeasible research in the high-entropy material space is now readily accessible.  ( 2 min )
    On the Utility of 3D Hand Poses for Action Recognition
    arXiv:2403.09805v1 Announce Type: cross Abstract: 3D hand poses are an under-explored modality for action recognition. Poses are compact yet informative and can greatly benefit applications with limited compute budgets. However, poses alone offer an incomplete understanding of actions, as they cannot fully capture objects and environments with which humans interact. To efficiently model hand-object interactions, we propose HandFormer, a novel multimodal transformer. HandFormer combines 3D hand poses at a high temporal resolution for fine-grained motion modeling with sparsely sampled RGB frames for encoding scene semantics. Observing the unique characteristics of hand poses, we temporally factorize hand modeling and represent each joint by its short-term trajectories. This factorized pose representation combined with sparse RGB samples is remarkably efficient and achieves high accuracy. Unimodal HandFormer with only hand poses outperforms existing skeleton-based methods at 5x fewer FLOPs. With RGB, we achieve new state-of-the-art performance on Assembly101 and H2O with significant improvements in egocentric action recognition.  ( 2 min )
    LabelAId: Just-in-time AI Interventions for Improving Human Labeling Quality and Domain Knowledge in Crowdsourcing Systems
    arXiv:2403.09810v1 Announce Type: cross Abstract: Crowdsourcing platforms have transformed distributed problem-solving, yet quality control remains a persistent challenge. Traditional quality control measures, such as prescreening workers and refining instructions, often focus solely on optimizing economic output. This paper explores just-in-time AI interventions to enhance both labeling quality and domain-specific knowledge among crowdworkers. We introduce LabelAId, an advanced inference model combining Programmatic Weak Supervision (PWS) with FT-Transformers to infer label correctness based on user behavior and domain knowledge. Our technical evaluation shows that our LabelAId pipeline consistently outperforms state-of-the-art ML baselines, improving mistake inference accuracy by 36.7% with 50 downstream samples. We then implemented LabelAId into Project Sidewalk, an open-source crowdsourcing platform for urban accessibility. A between-subjects study with 34 participants demonstrates that LabelAId significantly enhances label precision without compromising efficiency while also increasing labeler confidence. We discuss LabelAId's success factors, limitations, and its generalizability to other crowdsourced science domains.  ( 2 min )
    Socially Integrated Navigation: A Social Acting Robot with Deep Reinforcement Learning
    arXiv:2403.09793v1 Announce Type: cross Abstract: Mobile robots are being used on a large scale in various crowded situations and become part of our society. The socially acceptable navigation behavior of a mobile robot with individual human consideration is an essential requirement for scalable applications and human acceptance. Deep Reinforcement Learning (DRL) approaches are recently used to learn a robot's navigation policy and to model the complex interactions between robots and humans. We propose to divide existing DRL-based navigation approaches based on the robot's exhibited social behavior and distinguish between social collision avoidance with a lack of social behavior and socially aware approaches with explicit predefined social behavior. In addition, we propose a novel socially integrated navigation approach where the robot's social behavior is adaptive and emerges from the interaction with humans. The formulation of our approach is derived from a sociological definition, which states that social acting is oriented toward the acting of others. The DRL policy is trained in an environment where other agents interact socially integrated and reward the robot's behavior individually. The simulation results indicate that the proposed socially integrated navigation approach outperforms a socially aware approach in terms of distance traveled, time to completion, and negative impact on all agents within the environment.  ( 2 min )
    Emotional Intelligence Through Artificial Intelligence : NLP and Deep Learning in the Analysis of Healthcare Texts
    arXiv:2403.09762v1 Announce Type: cross Abstract: This manuscript presents a methodical examination of the utilization of Artificial Intelligence in the assessment of emotions in texts related to healthcare, with a particular focus on the incorporation of Natural Language Processing and deep learning technologies. We scrutinize numerous research studies that employ AI to augment sentiment analysis, categorize emotions, and forecast patient outcomes based on textual information derived from clinical narratives, patient feedback on medications, and online health discussions. The review demonstrates noteworthy progress in the precision of algorithms used for sentiment classification, the prognostic capabilities of AI models for neurodegenerative diseases, and the creation of AI-powered systems that offer support in clinical decision-making. Remarkably, the utilization of AI applications has exhibited an enhancement in personalized therapy plans by integrating patient sentiment and contributing to the early identification of mental health disorders. There persist challenges, which encompass ensuring the ethical application of AI, safeguarding patient confidentiality, and addressing potential biases in algorithmic procedures. Nevertheless, the potential of AI to revolutionize healthcare practices is unmistakable, offering a future where healthcare is not only more knowledgeable and efficient but also more empathetic and centered around the needs of patients. This investigation underscores the transformative influence of AI on healthcare, delivering a comprehensive comprehension of its role in examining emotional content in healthcare texts and highlighting the trajectory towards a more compassionate approach to patient care. The findings advocate for a harmonious synergy between AI's analytical capabilities and the human aspects of healthcare.  ( 3 min )
    Reconstructing Blood Flow in Data-Poor Regimes: A Vasculature Network Kernel for Gaussian Process Regression
    arXiv:2403.09758v1 Announce Type: cross Abstract: Blood flow reconstruction in the vasculature is important for many clinical applications. However, in clinical settings, the available data are often quite limited. For instance, Transcranial Doppler ultrasound (TCD) is a noninvasive clinical tool that is commonly used in the clinical settings to measure blood velocity waveform at several locations on brain's vasculature. This amount of data is grossly insufficient for training machine learning surrogate models, such as deep neural networks or Gaussian process regression. In this work, we propose a Gaussian process regression approach based on physics-informed kernels, enabling near-real-time reconstruction of blood flow in data-poor regimes. We introduce a novel methodology to reconstruct the kernel within the vascular network, which is a non-Euclidean space. The proposed kernel encodes both spatiotemporal and vessel-to-vessel correlations, thus enabling blood flow reconstruction in vessels that lack direct measurements. We demonstrate that any prediction made with the proposed kernel satisfies the conservation of mass principle. The kernel is constructed by running stochastic one-dimensional blood flow simulations, where the stochasticity captures the epistemic uncertainties, such as lack of knowledge about boundary conditions and uncertainties in vasculature geometries. We demonstrate the performance of the model on three test cases, namely, a simple Y-shaped bifurcation, abdominal aorta, and the Circle of Willis in the brain.  ( 3 min )
    Estimating the history of a random recursive tree
    arXiv:2403.09755v1 Announce Type: cross Abstract: This paper studies the problem of estimating the order of arrival of the vertices in a random recursive tree. Specifically, we study two fundamental models: the uniform attachment model and the linear preferential attachment model. We propose an order estimator based on the Jordan centrality measure and define a family of risk measures to quantify the quality of the ordering procedure. Moreover, we establish a minimax lower bound for this problem, and prove that the proposed estimator is nearly optimal. Finally, we numerically demonstrate that the proposed estimator outperforms degree-based and spectral ordering procedures.  ( 2 min )
    A Short Review on Novel Approaches for Maximum Clique Problem: from Classical algorithms to Graph Neural Networks and Quantum algorithms
    arXiv:2403.09742v1 Announce Type: cross Abstract: This manuscript provides a comprehensive review of the Maximum Clique Problem, a computational problem that involves finding subsets of vertices in a graph that are all pairwise adjacent to each other. The manuscript covers in a simple way classical algorithms for solving the problem and includes a review of recent developments in graph neural networks and quantum algorithms. The review concludes with benchmarks for testing classical as well as new learning, and quantum algorithms.  ( 2 min )
    The Human Factor in Detecting Errors of Large Language Models: A Systematic Literature Review and Future Research Directions
    arXiv:2403.09743v1 Announce Type: cross Abstract: The launch of ChatGPT by OpenAI in November 2022 marked a pivotal moment for Artificial Intelligence, introducing Large Language Models (LLMs) to the mainstream and setting new records in user adoption. LLMs, particularly ChatGPT, trained on extensive internet data, demonstrate remarkable conversational capabilities across various domains, suggesting a significant impact on the workforce. However, these models are susceptible to errors - "hallucinations" and omissions, generating incorrect or incomplete information. This poses risks especially in contexts where accuracy is crucial, such as legal compliance, medicine or fine-grained process frameworks. There are both technical and human solutions to cope with this isse. This paper explores the human factors that enable users to detect errors in LLM outputs, a critical component in mitigating risks associated with their use in professional settings. Understanding these factors is essential for organizations aiming to leverage LLM technology efficiently, guiding targeted training and deployment strategies to enhance error detection by users. This approach not only aims to optimize the use of LLMs but also to prevent potential downstream issues stemming from reliance on inaccurate model responses. The research emphasizes the balance between technological advancement and human insight in maximizing the benefits of LLMs while minimizing the risks, particularly in areas where precision is paramount. This paper performs a systematic literature research on this research topic, analyses and synthesizes the findings, and outlines future research directions. Literature selection cut-off date is January 11th 2024.  ( 3 min )
    Investigating the performance of Retrieval-Augmented Generation and fine-tuning for the development of AI-driven knowledge-based systems
    arXiv:2403.09727v1 Announce Type: cross Abstract: The development of generative large language models (G-LLM) opened up new opportunities for the development of new types of knowledge-based systems similar to ChatGPT, Bing, or Gemini. Fine-tuning (FN) and Retrieval-Augmented Generation (RAG) are the techniques that can be used to implement domain adaptation for the development of G-LLM-based knowledge systems. In our study, using ROUGE, BLEU, METEOR scores, and cosine similarity, we compare and examine the performance of RAG and FN for the GPT-J-6B, OPT-6.7B, LlaMA, LlaMA-2 language models. Based on measurements shown on different datasets, we demonstrate that RAG-based constructions are more efficient than models produced with FN. We point out that connecting RAG and FN is not trivial, because connecting FN models with RAG can cause a decrease in performance. Furthermore, we outline a simple RAG-based architecture which, on average, outperforms the FN models by 16% in terms of the ROGUE score, 15% in the case of the BLEU score, and 53% based on the cosine similarity. This shows the significant advantage of RAG over FN in terms of hallucination, which is not offset by the fact that the average 8% better METEOR score of FN models indicates greater creativity compared to RAG.  ( 2 min )
    ClaimVer: Explainable Claim-Level Verification and Evidence Attribution of Text Through Knowledge Graphs
    arXiv:2403.09724v1 Announce Type: cross Abstract: In the midst of widespread misinformation and disinformation through social media and the proliferation of AI-generated texts, it has become increasingly difficult for people to validate and trust information they encounter. Many fact-checking approaches and tools have been developed, but they often lack appropriate explainability or granularity to be useful in various contexts. A text validation method that is easy to use, accessible, and can perform fine-grained evidence attribution has become crucial. More importantly, building user trust in such a method requires presenting the rationale behind each prediction, as research shows this significantly influences people's belief in automated systems. It is also paramount to localize and bring users' attention to the specific problematic content, instead of providing simple blanket labels. In this paper, we present $\textit{ClaimVer, a human-centric framework}$ tailored to meet users' informational and verification needs by generating rich annotations and thereby reducing cognitive load. Designed to deliver comprehensive evaluations of texts, it highlights each claim, verifies it against a trusted knowledge graph (KG), presents the evidence, and provides succinct, clear explanations for each claim prediction. Finally, our framework introduces an attribution score, enhancing applicability across a wide range of downstream tasks.  ( 2 min )
    Institutional-Level Monitoring of Immune Checkpoint Inhibitor IrAEs Using a Novel Natural Language Processing Algorithmic Pipeline
    arXiv:2403.09708v1 Announce Type: cross Abstract: Background: Immune checkpoint inhibitors (ICIs) have revolutionized cancer treatment but can result in severe immune-related adverse events (IrAEs). Monitoring IrAEs on a large scale is essential for personalized risk profiling and assisting in treatment decisions. Methods: In this study, we conducted an analysis of clinical notes from patients who received ICIs at the Tel Aviv Sourasky Medical Center. By employing a Natural Language Processing algorithmic pipeline, we systematically identified seven common or severe IrAEs. We examined the utilization of corticosteroids, treatment discontinuation rates following IrAEs, and constructed survival curves to visualize the occurrence of adverse events during treatment. Results: Our analysis encompassed 108,280 clinical notes associated with 1,635 patients who had undergone ICI therapy. The detected incidence of IrAEs was consistent with previous reports, exhibiting substantial variation across different ICIs. Treatment with corticosteroids varied depending on the specific IrAE, ranging from 17.3% for thyroiditis to 57.4% for myocarditis. Our algorithm demonstrated high accuracy in identifying IrAEs, as indicated by an area under the curve (AUC) of 0.89 for each suspected note and F1 scores of 0.87 or higher for five out of the seven IrAEs examined at the patient level. Conclusions: This study presents a novel, large-scale monitoring approach utilizing deep neural networks for IrAEs. Our method provides accurate results, enhancing understanding of detrimental consequences experienced by ICI-treated patients. Moreover, it holds potential for monitoring other medications, enabling comprehensive post-marketing surveillance to identify susceptible populations and establish personalized drug safety profiles.  ( 3 min )
    Alignment Studio: Aligning Large Language Models to Particular Contextual Regulations
    arXiv:2403.09704v1 Announce Type: cross Abstract: The alignment of large language models is usually done by model providers to add or control behaviors that are common or universally understood across use cases and contexts. In contrast, in this article, we present an approach and architecture that empowers application developers to tune a model to their particular values, social norms, laws and other regulations, and orchestrate between potentially conflicting requirements in context. We lay out three main components of such an Alignment Studio architecture: Framers, Instructors, and Auditors that work in concert to control the behavior of a language model. We illustrate this approach with a running example of aligning a company's internal-facing enterprise chatbot to its business conduct guidelines.  ( 2 min )
    ViT-MUL: A Baseline Study on Recent Machine Unlearning Methods Applied to Vision Transformers
    arXiv:2403.09681v1 Announce Type: cross Abstract: Machine unlearning (MUL) is an arising field in machine learning that seeks to erase the learned information of specific training data points from a trained model. Despite the recent active research in MUL within computer vision, the majority of work has focused on ResNet-based models. Given that Vision Transformers (ViT) have become the predominant model architecture, a detailed study of MUL specifically tailored to ViT is essential. In this paper, we present comprehensive experiments on ViTs using recent MUL algorithms and datasets. We anticipate that our experiments, ablation studies, and findings could provide valuable insights and inspire further research in this field.  ( 2 min )
    Pre-Sorted Tsetlin Machine (The Genetic K-Medoid Method)
    arXiv:2403.09680v1 Announce Type: cross Abstract: This paper proposes a machine learning pre-sort stage to traditional supervised learning using Tsetlin Machines. Initially, N data-points are identified from the dataset using an expedited genetic algorithm to solve the maximum dispersion problem. These are then used as the initial placement to run the K-Medoid clustering algorithm. Finally, an expedited genetic algorithm is used to align N independent Tsetlin Machines by maximising hamming distance. For MNIST level classification problems, results demonstrate up to 10% improvement in accuracy, approx. 383X reduction in training time and approx. 86X reduction in inference time.  ( 2 min )
    Navigating the Peril of Generated Alternative Facts: A ChatGPT-4 Fabricated Omega Variant Case as a Cautionary Tale in Medical Misinformation
    arXiv:2403.09674v1 Announce Type: cross Abstract: In an era where artificial intelligence (AI) intertwines with medical research, the delineation of truth becomes increasingly complex. This study ostensibly examines a purported novel SARS-CoV-2 variant, dubbed the Omega variant, showcasing 31 unique mutations in the S gene region. However, the real undercurrent of this narrative is a demonstration of the ease with which AI, specifically ChatGPT-4, can fabricate convincing yet entirely fictional scientific data. The so-called Omega variant was identified in a fully vaccinated, previously infected 35-year-old male presenting with severe COVID-19 symptoms. Through a detailed, albeit artificial, genomic analysis and contact tracing, this study mirrors the rigorous methodology of genuine case reports, thereby setting the stage for a compelling but entirely constructed narrative. The entire case study was generated by ChatGPT-4, a large language model by OpenAI. The fabricated Omega variant features an ensemble of mutations, including N501Y and E484K, known for enhancing ACE2 receptor affinity, alongside L452R and P681H, ostensibly indicative of immune evasion. This variant's contrived interaction dynamics - severe symptoms in a vaccinated individual versus mild ones in unvaccinated contacts - were designed to mimic real-world complexities, including suggestions of antibody-dependent enhancement (ADE). While the Omega variant is a product of AI-generated fiction, the implications of this exercise are real and profound. The ease with which AI can generate believable but false scientific information, as illustrated in this case, raises significant concerns about the potential for misinformation in medicine. This study, therefore, serves as a cautionary tale, emphasizing the necessity for critical evaluation of sources, especially in an age where AI tools like ChatGPT are becoming increasingly sophisticated and widespread in their use.  ( 3 min )
    FoldToken: Learning Protein Language via Vector Quantization and Beyond
    arXiv:2403.09673v1 Announce Type: cross Abstract: Is there a foreign language describing protein sequences and structures simultaneously? Protein structures, represented by continuous 3D points, have long posed a challenge due to the contrasting modeling paradigms of discrete sequences. We introduce \textbf{FoldTokenizer} to represent protein sequence-structure as discrete symbols. This innovative approach involves projecting residue types and structures into a discrete space, guided by a reconstruction loss for information preservation. We refer to the learned discrete symbols as \textbf{FoldToken}, and the sequence of FoldTokens serves as a new protein language, transforming the protein sequence-structure into a unified modality. We apply the created protein language on general backbone inpainting and antibody design tasks, building the first GPT-style model (\textbf{FoldGPT}) for sequence-structure co-generation with promising results. Key to our success is the substantial enhancement of the vector quantization module, Soft Conditional Vector Quantization (\textbf{SoftCVQ}).  ( 2 min )
    On Unsupervised Image-to-image translation and GAN stability
    arXiv:2403.09646v1 Announce Type: cross Abstract: The problem of image-to-image translation is one that is intruiging and challenging at the same time, for the impact potential it can have on a wide variety of other computer vision applications like colorization, inpainting, segmentation and others. Given the high-level of sophistication needed to extract patterns from one domain and successfully applying them to another, especially, in a completely unsupervised (unpaired) manner, this problem has gained much attention as of the last few years. It is one of the first problems where successful applications to deep generative models, and especially Generative Adversarial Networks achieved astounding results that are actually of realworld impact, rather than just a show of theoretical prowess; the such that has been dominating the GAN world. In this work, we study some of the failure cases of a seminal work in the field, CycleGAN [1] and hypothesize that they are GAN-stability related, and propose two general models to try to alleviate these problems. We also reach the same conclusion of the problem being ill-posed that has been also circulating in the literature lately.  ( 2 min )
    COMPRER: A Multimodal Multi-Objective Pretraining Framework for Enhanced Medical Image Representation
    arXiv:2403.09672v1 Announce Type: cross Abstract: Substantial advances in multi-modal Artificial Intelligence (AI) facilitate the combination of diverse medical modalities to achieve holistic health assessments. We present COMPRER , a novel multi-modal, multi-objective pretraining framework which enhances medical-image representation, diagnostic inferences, and prognosis of diseases. COMPRER employs a multi-objective training framework, where each objective introduces distinct knowledge to the model. This includes a multimodal loss that consolidates information across different imaging modalities; A temporal loss that imparts the ability to discern patterns over time; Medical-measure prediction adds appropriate medical insights; Lastly, reconstruction loss ensures the integrity of image structure within the latent space. Despite the concern that multiple objectives could weaken task performance, our findings show that this combination actually boosts outcomes on certain tasks. Here, we apply this framework to both fundus images and carotid ultrasound, and validate our downstream tasks capabilities by predicting both current and future cardiovascular conditions. COMPRER achieved higher Area Under the Curve (AUC) scores in evaluating medical conditions compared to existing models on held-out data. On the Out-of-distribution (OOD) UK-Biobank dataset COMPRER maintains favorable performance over well-established models with more parameters, even though these models were trained on $75\times$ more data than COMPRER. In addition, to better assess our model's performance in contrastive learning, we introduce a novel evaluation metric, providing deeper understanding of the effectiveness of the latent space pairing.  ( 2 min )
    Benchmarking Zero-Shot Robustness of Multimodal Foundation Models: A Pilot Study
    arXiv:2403.10499v1 Announce Type: new Abstract: Pre-training image representations from the raw text about images enables zero-shot vision transfer to downstream tasks. Through pre-training on millions of samples collected from the internet, multimodal foundation models, such as CLIP, produce state-of-the-art zero-shot results that often reach competitiveness with fully supervised methods without the need for task-specific training. Besides the encouraging performance on classification accuracy, it is reported that these models close the robustness gap by matching the performance of supervised models trained on ImageNet under natural distribution shift. Because robustness is critical to real-world applications, especially safety-critical ones, in this paper, we present a comprehensive evaluation based on a large-scale robustness benchmark covering 7 natural, 3 synthetic distribution shifts, and 11 adversarial attacks. We use CLIP as a pilot study. We show that CLIP leads to a significant robustness drop compared to supervised ImageNet models on our benchmark, especially under synthetic distribution shift and adversarial attacks. Furthermore, data overlap analysis suggests that the observed robustness under natural distribution shifts could be attributed, at least in part, to data overlap. In summary, our evaluation shows a comprehensive evaluation of robustness is necessary; and there is a significant need to improve the robustness of zero-shot multimodal models.  ( 2 min )
    Introducing Adaptive Continuous Adversarial Training (ACAT) to Enhance ML Robustness
    arXiv:2403.10461v1 Announce Type: new Abstract: Machine Learning (ML) is susceptible to adversarial attacks that aim to trick ML models, making them produce faulty predictions. Adversarial training was found to increase the robustness of ML models against these attacks. However, in network and cybersecurity, obtaining labeled training and adversarial training data is challenging and costly. Furthermore, concept drift deepens the challenge, particularly in dynamic domains like network and cybersecurity, and requires various models to conduct periodic retraining. This letter introduces Adaptive Continuous Adversarial Training (ACAT) to continuously integrate adversarial training samples into the model during ongoing learning sessions, using real-world detected adversarial data, to enhance model resilience against evolving adversarial threats. ACAT is an adaptive defense mechanism that utilizes periodic retraining to effectively counter adversarial attacks while mitigating catastrophic forgetting. Our approach also reduces the total time required for adversarial sample detection, especially in environments such as network security where the rate of attacks could be very high. Traditional detection processes that involve two stages may result in lengthy procedures. Experimental results using a SPAM detection dataset demonstrate that with ACAT, the accuracy of the SPAM filter increased from 69% to over 88% after just three retraining sessions. Furthermore, ACAT outperforms conventional adversarial sample detectors, providing faster decision times, up to four times faster in some cases.  ( 2 min )
    Understanding the Double Descent Phenomenon in Deep Learning
    arXiv:2403.10459v1 Announce Type: new Abstract: Combining empirical risk minimization with capacity control is a classical strategy in machine learning when trying to control the generalization gap and avoid overfitting, as the model class capacity gets larger. Yet, in modern deep learning practice, very large over-parameterized models (e.g. neural networks) are optimized to fit perfectly the training data and still obtain great generalization performance. Past the interpolation point, increasing model complexity seems to actually lower the test error. In this tutorial, we explain the concept of double descent and its mechanisms. The first section sets the classical statistical learning framework and introduces the double descent phenomenon. By looking at a number of examples, section 2 introduces inductive biases that appear to have a key role in double descent by selecting, among the multiple interpolating solutions, a smooth empirical risk minimizer. Finally, section 3 explores the double descent with two linear models, and gives other points of view from recent related works.  ( 2 min )
    Optimal Block-Level Draft Verification for Accelerating Speculative Decoding
    arXiv:2403.10444v1 Announce Type: new Abstract: Speculative decoding has shown to be an effective method for lossless acceleration of large language models (LLMs) during inference. In each iteration, the algorithm first uses a smaller model to draft a block of tokens. The tokens are then verified by the large model in parallel and only a subset of tokens will be kept to guarantee that the final output follows the distribution of the large model. In all of the prior speculative decoding works, the draft verification is performed token-by-token independently. In this work, we propose a better draft verification algorithm that provides additional wall-clock speedup without incurring additional computation cost and draft tokens. We first formulate the draft verification step as a block-level optimal transport problem. The block-level formulation allows us to consider a wider range of draft verification algorithms and obtain a higher number of accepted tokens in expectation in one draft block. We propose a verification algorithm that achieves the optimal accepted length for the block-level transport problem. We empirically evaluate our proposed block-level verification algorithm in a wide range of tasks and datasets, and observe consistent improvements in wall-clock speedup when compared to token-level verification algorithm. To the best of our knowledge, our work is the first to establish improvement over speculative decoding through a better draft verification algorithm.  ( 3 min )
    Robust Sparse Estimation for Gaussians with Optimal Error under Huber Contamination
    arXiv:2403.10416v1 Announce Type: new Abstract: We study Gaussian sparse estimation tasks in Huber's contamination model with a focus on mean estimation, PCA, and linear regression. For each of these tasks, we give the first sample and computationally efficient robust estimators with optimal error guarantees, within constant factors. All prior efficient algorithms for these tasks incur quantitatively suboptimal error. Concretely, for Gaussian robust $k$-sparse mean estimation on $\mathbb{R}^d$ with corruption rate $\epsilon>0$, our algorithm has sample complexity $(k^2/\epsilon^2)\mathrm{polylog}(d/\epsilon)$, runs in sample polynomial time, and approximates the target mean within $\ell_2$-error $O(\epsilon)$. Previous efficient algorithms inherently incur error $\Omega(\epsilon \sqrt{\log(1/\epsilon)})$. At the technical level, we develop a novel multidimensional filtering method in the sparse regime that may find other applications.  ( 2 min )
    Structured Evaluation of Synthetic Tabular Data
    arXiv:2403.10424v1 Announce Type: new Abstract: Tabular data is common yet typically incomplete, small in volume, and access-restricted due to privacy concerns. Synthetic data generation offers potential solutions. Many metrics exist for evaluating the quality of synthetic tabular data; however, we lack an objective, coherent interpretation of the many metrics. To address this issue, we propose an evaluation framework with a single, mathematical objective that posits that the synthetic data should be drawn from the same distribution as the observed data. Through various structural decomposition of the objective, this framework allows us to reason for the first time the completeness of any set of metrics, as well as unifies existing metrics, including those that stem from fidelity considerations, downstream application, and model-based approaches. Moreover, the framework motivates model-free baselines and a new spectrum of metrics. We evaluate structurally informed synthesizers and synthesizers powered by deep learning. Using our structured framework, we show that synthetic data generators that explicitly represent tabular structure outperform other methods, especially on smaller datasets.  ( 2 min )
    A comparative study on machine learning approaches for rock mass classification using drilling data
    arXiv:2403.10404v1 Announce Type: new Abstract: Current rock engineering design in drill and blast tunnelling primarily relies on engineers' observational assessments. Measure While Drilling (MWD) data, a high-resolution sensor dataset collected during tunnel excavation, is underutilised, mainly serving for geological visualisation. This study aims to automate the translation of MWD data into actionable metrics for rock engineering. It seeks to link data to specific engineering actions, thus providing critical decision support for geological challenges ahead of the tunnel face. Leveraging a large and geologically diverse dataset of 500,000 drillholes from 15 tunnels, the research introduces models for accurate rock mass quality classification in a real-world tunnelling context. Both conventional machine learning and image-based deep learning are explored to classify MWD data into Q-classes and Q-values, examples of metrics describing the stability of the rock mass, using both tabular and image data. The results indicate that the K-nearest neighbours algorithm in an ensemble with tree-based models using tabular data, effectively classifies rock mass quality. It achieves a cross-validated balanced accuracy of 0.86 in classifying rock mass into the Q-classes A, B, C, D, E1, E2, and 0.95 for a binary classification with E versus the rest. Classification using a CNN with MWD-images for each blasting round resulted in a balanced accuracy of 0.82 for binary classification. Regressing the Q-value from tabular MWD-data achieved cross-validated R2 and MSE scores of 0.80 and 0.18 for a similar ensemble model as in classification. High performance in regression and classification boosts confidence in automated rock mass assessment. Applying advanced modelling on a unique dataset demonstrates MWD data's value in improving rock mass classification accuracy and advancing data-driven rock engineering design, reducing manual intervention.  ( 3 min )
    Regret Minimization via Saddle Point Optimization
    arXiv:2403.10379v1 Announce Type: new Abstract: A long line of works characterizes the sample complexity of regret minimization in sequential decision-making by min-max programs. In the corresponding saddle-point game, the min-player optimizes the sampling distribution against an adversarial max-player that chooses confusing models leading to large regret. The most recent instantiation of this idea is the decision-estimation coefficient (DEC), which was shown to provide nearly tight lower and upper bounds on the worst-case expected regret in structured bandits and reinforcement learning. By re-parametrizing the offset DEC with the confidence radius and solving the corresponding min-max program, we derive an anytime variant of the Estimation-To-Decisions (E2D) algorithm. Importantly, the algorithm optimizes the exploration-exploitation trade-off online instead of via the analysis. Our formulation leads to a practical algorithm for finite model classes and linear feedback models. We further point out connections to the information ratio, decoupling coefficient and PAC-DEC, and numerically evaluate the performance of E2D on simple examples.  ( 2 min )
    Towards a general framework for improving the performance of classifiers using XAI methods
    arXiv:2403.10373v1 Announce Type: new Abstract: Modern Artificial Intelligence (AI) systems, especially Deep Learning (DL) models, poses challenges in understanding their inner workings by AI researchers. eXplainable Artificial Intelligence (XAI) inspects internal mechanisms of AI models providing explanations about their decisions. While current XAI research predominantly concentrates on explaining AI systems, there is a growing interest in using XAI techniques to automatically improve the performance of AI systems themselves. This paper proposes a general framework for automatically improving the performance of pre-trained DL classifiers using XAI methods, avoiding the computational overhead associated with retraining complex models from scratch. In particular, we outline the possibility of two different learning strategies for implementing this architecture, which we will call auto-encoder-based and encoder-decoder-based, and discuss their key aspects.  ( 2 min )
    An Energy-Efficient Ensemble Approach for Mitigating Data Incompleteness in IoT Applications
    arXiv:2403.10371v1 Announce Type: new Abstract: Machine Learning (ML) is becoming increasingly important for IoT-based applications. However, the dynamic and ad-hoc nature of many IoT ecosystems poses unique challenges to the efficacy of ML algorithms. One such challenge is data incompleteness, which is manifested as missing sensor readings. Many factors, including sensor failures and/or network disruption, can cause data incompleteness. Furthermore, most IoT systems are severely power-constrained. It is important that we build IoT-based ML systems that are robust against data incompleteness while simultaneously being energy efficient. This paper presents an empirical study of SECOE - a recent technique for alleviating data incompleteness in IoT - with respect to its energy bottlenecks. Towards addressing the energy bottlenecks of SECOE, we propose ENAMLE - a proactive, energy-aware technique for mitigating the impact of concurrent missing data. ENAMLE is unique in the sense that it builds an energy-aware ensemble of sub-models, each trained with a subset of sensors chosen carefully based on their correlations. Furthermore, at inference time, ENAMLE adaptively alters the number of the ensemble of models based on the amount of missing data rate and the energy-accuracy trade-off. ENAMLE's design includes several novel mechanisms for minimizing energy consumption while maintaining accuracy. We present extensive experimental studies on two distinct datasets that demonstrate the energy efficiency of ENAMLE and its ability to alleviate sensor failures.  ( 3 min )
    Towards Non-Adversarial Algorithmic Recourse
    arXiv:2403.10330v1 Announce Type: new Abstract: The streams of research on adversarial examples and counterfactual explanations have largely been growing independently. This has led to several recent works trying to elucidate their similarities and differences. Most prominently, it has been argued that adversarial examples, as opposed to counterfactual explanations, have a unique characteristic in that they lead to a misclassification compared to the ground truth. However, the computational goals and methodologies employed in existing counterfactual explanation and adversarial example generation methods often lack alignment with this requirement. Using formal definitions of adversarial examples and counterfactual explanations, we introduce non-adversarial algorithmic recourse and outline why in high-stakes situations, it is imperative to obtain counterfactual explanations that do not exhibit adversarial characteristics. We subsequently investigate how different components in the objective functions, e.g., the machine learning model or cost function used to measure distance, determine whether the outcome can be considered an adversarial example or not. Our experiments on common datasets highlight that these design choices are often more critical in deciding whether recourse is non-adversarial than whether recourse or attack algorithms are used. Furthermore, we show that choosing a robust and accurate machine learning model results in less adversarial recourse desired in practice.  ( 2 min )
    Generation is better than Modification: Combating High Class Homophily Variance in Graph Anomaly Detection
    arXiv:2403.10339v1 Announce Type: new Abstract: Graph-based anomaly detection is currently an important research topic in the field of graph neural networks (GNNs). We find that in graph anomaly detection, the homophily distribution differences between different classes are significantly greater than those in homophilic and heterophilic graphs. For the first time, we introduce a new metric called Class Homophily Variance, which quantitatively describes this phenomenon. To mitigate its impact, we propose a novel GNN model named Homophily Edge Generation Graph Neural Network (HedGe). Previous works typically focused on pruning, selecting or connecting on original relationships, and we refer to these methods as modifications. Different from these works, our method emphasizes generating new relationships with low class homophily variance, using the original relationships as an auxiliary. HedGe samples homophily adjacency matrices from scratch using a self-attention mechanism, and leverages nodes that are relevant in the feature space but not directly connected in the original graph. Additionally, we modify the loss function to punish the generation of unnecessary heterophilic edges by the model. Extensive comparison experiments demonstrate that HedGe achieved the best performance across multiple benchmark datasets, including anomaly detection and edgeless node classification. The proposed model also improves the robustness under the novel Heterophily Attack with increased class homophily variance on other graph classification tasks.  ( 2 min )
    Anytime Neural Architecture Search on Tabular Data
    arXiv:2403.10318v1 Announce Type: new Abstract: The increasing demand for tabular data analysis calls for transitioning from manual architecture design to Neural Architecture Search (NAS). This transition demands an efficient and responsive anytime NAS approach that is capable of returning current optimal architectures within any given time budget while progressively enhancing architecture quality with increased budget allocation. However, the area of research on Anytime NAS for tabular data remains unexplored. To this end, we introduce ATLAS, the first anytime NAS approach tailored for tabular data. ATLAS introduces a novel two-phase filtering-and-refinement optimization scheme with joint optimization, combining the strengths of both paradigms of training-free and training-based architecture evaluation. Specifically, in the filtering phase, ATLAS employs a new zero-cost proxy specifically designed for tabular data to efficiently estimate the performance of candidate architectures, thereby obtaining a set of promising architectures. Subsequently, in the refinement phase, ATLAS leverages a fixed-budget search algorithm to schedule the training of the promising candidates, so as to accurately identify the optimal architecture. To jointly optimize the two phases for anytime NAS, we also devise a budget-aware coordinator that delivers high NAS performance within constraints. Experimental evaluations demonstrate that our ATLAS can obtain a good-performing architecture within any predefined time budget and return better architectures as and when a new time budget is made available. Overall, it reduces the search time on tabular data by up to 82.75x compared to existing NAS approaches.  ( 2 min )
    Comprehensive Study Of Predictive Maintenance In Industries Using Classification Models And LSTM Model
    arXiv:2403.10259v1 Announce Type: new Abstract: In today's technology-driven era, the imperative for predictive maintenance and advanced diagnostics extends beyond aviation to encompass the identification of damages, failures, and operational defects in rotating and moving machines. Implementing such services not only curtails maintenance costs but also extends machine lifespan, ensuring heightened operational efficiency. Moreover, it serves as a preventive measure against potential accidents or catastrophic events. The advent of Artificial Intelligence (AI) has revolutionized maintenance across industries, enabling more accurate and efficient prediction and analysis of machine failures, thereby conserving time and resources. Our proposed study aims to delve into various machine learning classification techniques, including Support Vector Machine (SVM), Random Forest, Logistic Regression, and Convolutional Neural Network LSTM-Based, for predicting and analyzing machine performance. SVM classifies data into different categories based on their positions in a multidimensional space, while Random Forest employs ensemble learning to create multiple decision trees for classification. Logistic Regression predicts the probability of binary outcomes using input data. The primary objective of the study is to assess these algorithms' performance in predicting and analyzing machine performance, considering factors such as accuracy, precision, recall, and F1 score. The findings will aid maintenance experts in selecting the most suitable machine learning algorithm for effective prediction and analysis of machine performance.  ( 2 min )
    Open Continual Feature Selection via Granular-Ball Knowledge Transfer
    arXiv:2403.10253v1 Announce Type: new Abstract: This paper presents a novel framework for continual feature selection (CFS) in data preprocessing, particularly in the context of an open and dynamic environment where unknown classes may emerge. CFS encounters two primary challenges: the discovery of unknown knowledge and the transfer of known knowledge. To this end, the proposed CFS method combines the strengths of continual learning (CL) with granular-ball computing (GBC), which focuses on constructing a granular-ball knowledge base to detect unknown classes and facilitate the transfer of previously learned knowledge for further feature selection. CFS consists of two stages: initial learning and open learning. The former aims to establish an initial knowledge base through multi-granularity representation using granular-balls. The latter utilizes prior granular-ball knowledge to identify unknowns, updates the knowledge base for granular-ball knowledge transfer, reinforces old knowledge, and integrates new knowledge. Subsequently, we devise an optimal feature subset mechanism that incorporates minimal new features into the existing optimal subset, often yielding superior results during each period. Extensive experimental results on public benchmark datasets demonstrate our method's superiority in terms of both effectiveness and efficiency compared to state-of-the-art feature selection methods.  ( 2 min )
    Less is More: One-shot Subgraph Reasoning on Large-scale Knowledge Graphs
    arXiv:2403.10231v1 Announce Type: new Abstract: To deduce new facts on a knowledge graph (KG), a link predictor learns from the graph structure and collects local evidence to find the answer to a given query. However, existing methods suffer from a severe scalability problem due to the utilization of the whole KG for prediction, which hinders their promise on large scale KGs and cannot be directly addressed by vanilla sampling methods. In this work, we propose the one-shot-subgraph link prediction to achieve efficient and adaptive prediction. The design principle is that, instead of directly acting on the whole KG, the prediction procedure is decoupled into two steps, i.e., (i) extracting only one subgraph according to the query and (ii) predicting on this single, query dependent subgraph. We reveal that the non-parametric and computation-efficient heuristics Personalized PageRank (PPR) can effectively identify the potential answers and supporting evidence. With efficient subgraph-based prediction, we further introduce the automated searching of the optimal configurations in both data and model spaces. Empirically, we achieve promoted efficiency and leading performances on five large-scale benchmarks. The code is publicly available at: https://github.com/tmlr-group/one-shot-subgraph.  ( 2 min )
    From Chaos to Clarity: Time Series Anomaly Detection in Astronomical Observations
    arXiv:2403.10220v1 Announce Type: new Abstract: With the development of astronomical facilities, large-scale time series data observed by these facilities is being collected. Analyzing anomalies in these astronomical observations is crucial for uncovering potential celestial events and physical phenomena, thus advancing the scientific research process. However, existing time series anomaly detection methods fall short in tackling the unique characteristics of astronomical observations where each star is inherently independent but interfered by random concurrent noise, resulting in a high rate of false alarms. To overcome the challenges, we propose AERO, a novel two-stage framework tailored for unsupervised anomaly detection in astronomical observations. In the first stage, we employ a Transformer-based encoder-decoder architecture to learn the normal temporal patterns on each variate (i.e., star) in alignment with the characteristic of variate independence. In the second stage, we enhance the graph neural network with a window-wise graph structure learning to tackle the occurrence of concurrent noise characterized by spatial and temporal randomness. In this way, AERO is not only capable of distinguishing normal temporal patterns from potential anomalies but also effectively differentiating concurrent noise, thus decreasing the number of false alarms. We conducted extensive experiments on three synthetic datasets and three real-world datasets. The results demonstrate that AERO outperforms the compared baselines. Notably, compared to the state-of-the-art model, AERO improves the F1-score by up to 8.76% and 2.63% on synthetic and real-world datasets respectively.  ( 3 min )
    A Short Survey on Importance Weighting for Machine Learning
    arXiv:2403.10175v1 Announce Type: new Abstract: Importance weighting is a fundamental procedure in statistics and machine learning that weights the objective function or probability distribution based on the importance of the instance in some sense. The simplicity and usefulness of the idea has led to many applications of importance weighting. For example, it is known that supervised learning under an assumption about the difference between the training and test distributions, called distribution shift, can guarantee statistically desirable properties through importance weighting by their density ratio. This survey summarizes the broad applications of importance weighting in machine learning and related research.  ( 2 min )
    Explainability through uncertainty: Trustworthy decision-making with neural networks
    arXiv:2403.10168v1 Announce Type: new Abstract: Uncertainty is a key feature of any machine learning model and is particularly important in neural networks, which tend to be overconfident. This overconfidence is worrying under distribution shifts, where the model performance silently degrades as the data distribution diverges from the training data distribution. Uncertainty estimation offers a solution to overconfident models, communicating when the output should (not) be trusted. Although methods for uncertainty estimation have been developed, they have not been explicitly linked to the field of explainable artificial intelligence (XAI). Furthermore, literature in operations research ignores the actionability component of uncertainty estimation and does not consider distribution shifts. This work proposes a general uncertainty framework, with contributions being threefold: (i) uncertainty estimation in ML models is positioned as an XAI technique, giving local and model-specific explanations; (ii) classification with rejection is used to reduce misclassifications by bringing a human expert in the loop for uncertain observations; (iii) the framework is applied to a case study on neural networks in educational data mining subject to distribution shifts. Uncertainty as XAI improves the model's trustworthiness in downstream decision-making tasks, giving rise to more actionable and robust machine learning systems in operations research.  ( 2 min )
    Functional Graph Convolutional Networks: A unified multi-task and multi-modal learning framework to facilitate health and social-care insights
    arXiv:2403.10158v1 Announce Type: new Abstract: This paper introduces a novel Functional Graph Convolutional Network (funGCN) framework that combines Functional Data Analysis and Graph Convolutional Networks to address the complexities of multi-task and multi-modal learning in digital health and longitudinal studies. With the growing importance of health solutions to improve health care and social support, ensure healthy lives, and promote well-being at all ages, funGCN offers a unified approach to handle multivariate longitudinal data for multiple entities and ensures interpretability even with small sample sizes. Key innovations include task-specific embedding components that manage different data types, the ability to perform classification, regression, and forecasting, and the creation of a knowledge graph for insightful data interpretation. The efficacy of funGCN is validated through simulation experiments and a real-data application.  ( 2 min )
    Online Policy Learning from Offline Preferences
    arXiv:2403.10160v1 Announce Type: new Abstract: In preference-based reinforcement learning (PbRL), a reward function is learned from a type of human feedback called preference. To expedite preference collection, recent works have leveraged \emph{offline preferences}, which are preferences collected for some offline data. In this scenario, the learned reward function is fitted on the offline data. If a learning agent exhibits behaviors that do not overlap with the offline data, the learned reward function may encounter generalizability issues. To address this problem, the present study introduces a framework that consolidates offline preferences and \emph{virtual preferences} for PbRL, which are comparisons between the agent's behaviors and the offline data. Critically, the reward function can track the agent's behaviors using the virtual preferences, thereby offering well-aligned guidance to the agent. Through experiments on continuous control tasks, this study demonstrates the effectiveness of incorporating the virtual preferences in PbRL.  ( 2 min )
    Regularization-Based Efficient Continual Learning in Deep State-Space Models
    arXiv:2403.10123v1 Announce Type: new Abstract: Deep state-space models (DSSMs) have gained popularity in recent years due to their potent modeling capacity for dynamic systems. However, existing DSSM works are limited to single-task modeling, which requires retraining with historical task data upon revisiting a forepassed task. To address this limitation, we propose continual learning DSSMs (CLDSSMs), which are capable of adapting to evolving tasks without catastrophic forgetting. Our proposed CLDSSMs integrate mainstream regularization-based continual learning (CL) methods, ensuring efficient updates with constant computational and memory costs for modeling multiple dynamic systems. We also conduct a comprehensive cost analysis of each CL method applied to the respective CLDSSMs, and demonstrate the efficacy of CLDSSMs through experiments on real-world datasets. The results corroborate that while various competing CL methods exhibit different merits, the proposed CLDSSMs consistently outperform traditional DSSMs in terms of effectively addressing catastrophic forgetting, enabling swift and accurate parameter transfer to new tasks.  ( 2 min )
    Adaptive Random Feature Regularization on Fine-tuning Deep Neural Networks
    arXiv:2403.10097v1 Announce Type: new Abstract: While fine-tuning is a de facto standard method for training deep neural networks, it still suffers from overfitting when using small target datasets. Previous methods improve fine-tuning performance by maintaining knowledge of the source datasets or introducing regularization terms such as contrastive loss. However, these methods require auxiliary source information (e.g., source labels or datasets) or heavy additional computations. In this paper, we propose a simple method called adaptive random feature regularization (AdaRand). AdaRand helps the feature extractors of training models to adaptively change the distribution of feature vectors for downstream classification tasks without auxiliary source information and with reasonable computation costs. To this end, AdaRand minimizes the gap between feature vectors and random reference vectors that are sampled from class conditional Gaussian distributions. Furthermore, AdaRand dynamically updates the conditional distribution to follow the currently updated feature extractors and balance the distance between classes in feature spaces. Our experiments show that AdaRand outperforms the other fine-tuning regularization, which requires auxiliary source information and heavy computation costs.  ( 2 min )
    Unified Projection-Free Algorithms for Adversarial DR-Submodular Optimization
    arXiv:2403.10063v1 Announce Type: new Abstract: This paper introduces unified projection-free Frank-Wolfe type algorithms for adversarial continuous DR-submodular optimization, spanning scenarios such as full information and (semi-)bandit feedback, monotone and non-monotone functions, different constraints, and types of stochastic queries. For every problem considered in the non-monotone setting, the proposed algorithms are either the first with proven sub-linear $\alpha$-regret bounds or have better $\alpha$-regret bounds than the state of the art, where $\alpha$ is a corresponding approximation bound in the offline setting. In the monotone setting, the proposed approach gives state-of-the-art sub-linear $\alpha$-regret bounds among projection-free algorithms in 7 of the 8 considered cases while matching the result of the remaining case. Additionally, this paper addresses semi-bandit and bandit feedback for adversarial DR-submodular optimization, advancing the understanding of this optimization area.  ( 2 min )
    Towards Adversarially Robust Dataset Distillation by Curvature Regularization
    arXiv:2403.10045v1 Announce Type: new Abstract: Dataset distillation (DD) allows datasets to be distilled to fractions of their original size while preserving the rich distributional information so that models trained on the distilled datasets can achieve a comparable accuracy while saving significant computational loads. Recent research in this area has been focusing on improving the accuracy of models trained on distilled datasets. In this paper, we aim to explore a new perspective of DD. We study how to embed adversarial robustness in distilled datasets, so that models trained on these datasets maintain the high accuracy and meanwhile acquire better adversarial robustness. We propose a new method that achieves this goal by incorporating curvature regularization into the distillation process with much less computational overhead than standard adversarial training. Extensive empirical experiments suggest that our method not only outperforms standard adversarial training on both accuracy and robustness with less computation overhead but is also capable of generating robust distilled datasets that can withstand various adversarial attacks.  ( 2 min )
    AD3: Implicit Action is the Key for World Models to Distinguish the Diverse Visual Distractors
    arXiv:2403.09976v1 Announce Type: new Abstract: Model-based methods have significantly contributed to distinguishing task-irrelevant distractors for visual control. However, prior research has primarily focused on heterogeneous distractors like noisy background videos, leaving homogeneous distractors that closely resemble controllable agents largely unexplored, which poses significant challenges to existing methods. To tackle this problem, we propose Implicit Action Generator (IAG) to learn the implicit actions of visual distractors, and present a new algorithm named implicit Action-informed Diverse visual Distractors Distinguisher (AD3), that leverages the action inferred by IAG to train separated world models. Implicit actions effectively capture the behavior of background distractors, aiding in distinguishing the task-irrelevant components, and the agent can optimize the policy within the task-relevant state space. Our method achieves superior performance on various visual control tasks featuring both heterogeneous and homogeneous distractors. The indispensable role of implicit actions learned by IAG is also empirically validated.  ( 2 min )
    Global Convergence Guarantees for Federated Policy Gradient Methods with Adversaries
    arXiv:2403.09940v1 Announce Type: new Abstract: Federated Reinforcement Learning (FRL) allows multiple agents to collaboratively build a decision making policy without sharing raw trajectories. However, if a small fraction of these agents are adversarial, it can lead to catastrophic results. We propose a policy gradient based approach that is robust to adversarial agents which can send arbitrary values to the server. Under this setting, our results form the first global convergence guarantees with general parametrization. These results demonstrate resilience with adversaries, while achieving sample complexity of order $\tilde{\mathcal{O}}\left( \frac{1}{\epsilon^2} \left( \frac{1}{N-f} + \frac{f^2}{(N-f)^2}\right)\right)$, where $N$ is the total number of agents and $f$ is the number of adversarial agents.  ( 2 min )
    Quality-Diversity Actor-Critic: Learning High-Performing and Diverse Behaviors via Value and Successor Features Critics
    arXiv:2403.09930v1 Announce Type: new Abstract: A key aspect of intelligence is the ability to demonstrate a broad spectrum of behaviors for adapting to unexpected situations. Over the past decade, advancements in deep reinforcement learning have led to groundbreaking achievements to solve complex continuous control tasks. However, most approaches return only one solution specialized for a specific problem. We introduce Quality-Diversity Actor-Critic (QDAC), an off-policy actor-critic deep reinforcement learning algorithm that leverages a value function critic and a successor features critic to learn high-performing and diverse behaviors. In this framework, the actor optimizes an objective that seamlessly unifies both critics using constrained optimization to (1) maximize return, while (2) executing diverse skills. Compared with other Quality-Diversity methods, QDAC achieves significantly higher performance and more diverse behaviors on six challenging continuous control locomotion tasks. We also demonstrate that we can harness the learned skills to adapt better than other baselines to five perturbed environments. Finally, qualitative analyses showcase a range of remarkable behaviors, available at: http://bit.ly/qdac.  ( 2 min )
    FedComLoc: Communication-Efficient Distributed Training of Sparse and Quantized Models
    arXiv:2403.09904v1 Announce Type: new Abstract: Federated Learning (FL) has garnered increasing attention due to its unique characteristic of allowing heterogeneous clients to process their private data locally and interact with a central server, while being respectful of privacy. A critical bottleneck in FL is the communication cost. A pivotal strategy to mitigate this burden is \emph{Local Training}, which involves running multiple local stochastic gradient descent iterations between communication phases. Our work is inspired by the innovative \emph{Scaffnew} algorithm, which has considerably advanced the reduction of communication complexity in FL. We introduce FedComLoc (Federated Compressed and Local Training), integrating practical and effective compression into \emph{Scaffnew} to further enhance communication efficiency. Extensive experiments, using the popular TopK compressor and quantization, demonstrate its prowess in substantially reducing communication overheads in heterogeneous settings.  ( 2 min )
    iBRF: Improved Balanced Random Forest Classifier
    arXiv:2403.09867v1 Announce Type: new Abstract: Class imbalance poses a major challenge in different classification tasks, which is a frequently occurring scenario in many real-world applications. Data resampling is considered to be the standard approach to address this issue. The goal of the technique is to balance the class distribution by generating new samples or eliminating samples from the data. A wide variety of sampling techniques have been proposed over the years to tackle this challenging problem. Sampling techniques can also be incorporated into the ensemble learning framework to obtain more generalized prediction performance. Balanced Random Forest (BRF) and SMOTE-Bagging are some of the popular ensemble approaches. In this study, we propose a modification to the BRF classifier to enhance the prediction performance. In the original algorithm, the Random Undersampling (RUS) technique was utilized to balance the bootstrap samples. However, randomly eliminating too many samples from the data leads to significant data loss, resulting in a major decline in performance. We propose to alleviate the scenario by incorporating a novel hybrid sampling approach to balance the uneven class distribution in each bootstrap sub-sample. Our proposed hybrid sampling technique, when incorporated into the framework of the Random Forest classifier, termed as iBRF: improved Balanced Random Forest classifier, achieves better prediction performance than other sampling techniques used in imbalanced classification tasks. Experiments were carried out on 44 imbalanced datasets on which the original BRF classifier produced an average MCC score of 47.03% and an F1 score of 49.09%. Our proposed algorithm outperformed the approach by producing a far better MCC score of 53.04% and an F1 score of 55%. The results obtained signify the superiority of the iBRF algorithm and its potential to be an effective sampling technique in imbalanced learning.  ( 3 min )
    MAMBA: an Effective World Model Approach for Meta-Reinforcement Learning
    arXiv:2403.09859v1 Announce Type: new Abstract: Meta-reinforcement learning (meta-RL) is a promising framework for tackling challenging domains requiring efficient exploration. Existing meta-RL algorithms are characterized by low sample efficiency, and mostly focus on low-dimensional task distributions. In parallel, model-based RL methods have been successful in solving partially observable MDPs, of which meta-RL is a special case. In this work, we leverage this success and propose a new model-based approach to meta-RL, based on elements from existing state-of-the-art model-based and meta-RL methods. We demonstrate the effectiveness of our approach on common meta-RL benchmark domains, attaining greater return with better sample efficiency (up to $15\times$) while requiring very little hyperparameter tuning. In addition, we validate our approach on a slate of more challenging, higher-dimensional domains, taking a step towards real-world generalizing agents.  ( 2 min )
    Few-Shot Class Incremental Learning with Attention-Aware Self-Adaptive Prompt
    arXiv:2403.09857v1 Announce Type: new Abstract: Few-Shot Class-Incremental Learning (FSCIL) models aim to incrementally learn new classes with scarce samples while preserving knowledge of old ones. Existing FSCIL methods usually fine-tune the entire backbone, leading to overfitting and hindering the potential to learn new classes. On the other hand, recent prompt-based CIL approaches alleviate forgetting by training prompts with sufficient data in each task. In this work, we propose a novel framework named Attention-aware Self-adaptive Prompt (ASP). ASP encourages task-invariant prompts to capture shared knowledge by reducing specific information from the attention aspect. Additionally, self-adaptive task-specific prompts in ASP provide specific information and transfer knowledge from old classes to new classes with an Information Bottleneck learning objective. In summary, ASP prevents overfitting on base task and does not require enormous data in few-shot incremental tasks. Extensive experiments on three benchmark datasets validate that ASP consistently outperforms state-of-the-art FSCIL and prompt-based CIL methods in terms of both learning new classes and mitigating forgetting.  ( 2 min )
  • Open

    Goodness of fit by Neyman-Pearson testing
    arXiv:2305.14137v2 Announce Type: replace-cross Abstract: The Neyman-Pearson strategy for hypothesis testing can be employed for goodness of fit if the alternative hypothesis is selected from data by exploring a rich parametrised family of models, while controlling the impact of statistical fluctuations. The New Physics Learning Machine (NPLM) methodology has been developed as a concrete implementation of this idea, to target the detection of new physical effects in the context of high energy physics collider experiments. In this paper we conduct a comparison of this approach to goodness of fit with others, in particular with classifier-based strategies that share strong similarities with NPLM. From our comparison, NPLM emerges as the more sensitive test to small departures of the data from the expected distribution and not biased towards detecting specific types of anomalies. These features make it suited for agnostic searches for new physics at collider experiments. Its deployment in other scientific and industrial scenarios should be investigated.  ( 2 min )
    Metrics on Markov Equivalence Classes for Evaluating Causal Discovery Algorithms
    arXiv:2402.04952v2 Announce Type: replace-cross Abstract: Many state-of-the-art causal discovery methods aim to generate an output graph that encodes the graphical separation and connection statements of the causal graph that underlies the data-generating process. In this work, we argue that an evaluation of a causal discovery method against synthetic data should include an analysis of how well this explicit goal is achieved by measuring how closely the separations/connections of the method's output align with those of the ground truth. We show that established evaluation measures do not accurately capture the difference in separations/connections of two causal graphs, and we introduce three new measures of distance called s/c-distance, Markov distance and Faithfulness distance that address this shortcoming. We complement our theoretical analysis with toy examples, empirical experiments and pseudocode.  ( 2 min )
    Debiasing Algorithm through Model Adaptation
    arXiv:2310.18913v3 Announce Type: replace-cross Abstract: Large language models are becoming the go-to solution for the ever-growing number of tasks. However, with growing capacity, models are prone to rely on spurious correlations stemming from biases and stereotypes present in the training data. This work proposes a novel method for detecting and mitigating gender bias in language models. We perform causal analysis to identify problematic model components and discover that mid-upper feed-forward layers are most prone to convey bias. Based on the analysis results, we intervene in the model by applying a linear projection to the weight matrices of these layers. Our titular method, DAMA, significantly decreases bias as measured by diverse metrics while maintaining the model's performance on downstream tasks. We release code for our method and models, which retrain LLaMA's state-of-the-art performance while being significantly less biased.  ( 2 min )
    Explainability through uncertainty: Trustworthy decision-making with neural networks
    arXiv:2403.10168v1 Announce Type: cross Abstract: Uncertainty is a key feature of any machine learning model and is particularly important in neural networks, which tend to be overconfident. This overconfidence is worrying under distribution shifts, where the model performance silently degrades as the data distribution diverges from the training data distribution. Uncertainty estimation offers a solution to overconfident models, communicating when the output should (not) be trusted. Although methods for uncertainty estimation have been developed, they have not been explicitly linked to the field of explainable artificial intelligence (XAI). Furthermore, literature in operations research ignores the actionability component of uncertainty estimation and does not consider distribution shifts. This work proposes a general uncertainty framework, with contributions being threefold: (i) uncertainty estimation in ML models is positioned as an XAI technique, giving local and model-specific explanations; (ii) classification with rejection is used to reduce misclassifications by bringing a human expert in the loop for uncertain observations; (iii) the framework is applied to a case study on neural networks in educational data mining subject to distribution shifts. Uncertainty as XAI improves the model's trustworthiness in downstream decision-making tasks, giving rise to more actionable and robust machine learning systems in operations research.  ( 2 min )
    Estimating the history of a random recursive tree
    arXiv:2403.09755v1 Announce Type: new Abstract: This paper studies the problem of estimating the order of arrival of the vertices in a random recursive tree. Specifically, we study two fundamental models: the uniform attachment model and the linear preferential attachment model. We propose an order estimator based on the Jordan centrality measure and define a family of risk measures to quantify the quality of the ordering procedure. Moreover, we establish a minimax lower bound for this problem, and prove that the proposed estimator is nearly optimal. Finally, we numerically demonstrate that the proposed estimator outperforms degree-based and spectral ordering procedures.  ( 2 min )
    Learning Markov State Abstractions for Deep Reinforcement Learning
    arXiv:2106.04379v4 Announce Type: replace-cross Abstract: A fundamental assumption of reinforcement learning in Markov decision processes (MDPs) is that the relevant decision process is, in fact, Markov. However, when MDPs have rich observations, agents typically learn by way of an abstract state representation, and such representations are not guaranteed to preserve the Markov property. We introduce a novel set of conditions and prove that they are sufficient for learning a Markov abstract state representation. We then describe a practical training procedure that combines inverse model estimation and temporal contrastive learning to learn an abstraction that approximately satisfies these conditions. Our novel training objective is compatible with both online and offline training: it does not require a reward signal, but agents can capitalize on reward information when available. We empirically evaluate our approach on a visual gridworld domain and a set of continuous control benchmarks. Our approach learns representations that capture the underlying structure of the domain and lead to improved sample efficiency over state-of-the-art deep reinforcement learning with visual features -- often matching or exceeding the performance achieved with hand-designed compact state information.  ( 3 min )
    Understanding the Double Descent Phenomenon in Deep Learning
    arXiv:2403.10459v1 Announce Type: cross Abstract: Combining empirical risk minimization with capacity control is a classical strategy in machine learning when trying to control the generalization gap and avoid overfitting, as the model class capacity gets larger. Yet, in modern deep learning practice, very large over-parameterized models (e.g. neural networks) are optimized to fit perfectly the training data and still obtain great generalization performance. Past the interpolation point, increasing model complexity seems to actually lower the test error. In this tutorial, we explain the concept of double descent and its mechanisms. The first section sets the classical statistical learning framework and introduces the double descent phenomenon. By looking at a number of examples, section 2 introduces inductive biases that appear to have a key role in double descent by selecting, among the multiple interpolating solutions, a smooth empirical risk minimizer. Finally, section 3 explores the double descent with two linear models, and gives other points of view from recent related works.  ( 2 min )
    Double Descent and Overfitting under Noisy Inputs and Distribution Shift for Linear Denoisers
    arXiv:2305.17297v3 Announce Type: replace-cross Abstract: Despite the importance of denoising in modern machine learning and ample empirical work on supervised denoising, its theoretical understanding is still relatively scarce. One concern about studying supervised denoising is that one might not always have noiseless training data from the test distribution. It is more reasonable to have access to noiseless training data from a different dataset than the test dataset. Motivated by this, we study supervised denoising and noisy-input regression under distribution shift. We add three considerations to increase the applicability of our theoretical insights to real-life data and modern machine learning. First, while most past theoretical work assumes that the data covariance matrix is full-rank and well-conditioned, empirical studies have shown that real-life data is approximately low-rank. Thus, we assume that our data matrices are low-rank. Second, we drop independence assumptions on our data. Third, the rise in computational power and dimensionality of data have made it important to study non-classical regimes of learning. Thus, we work in the non-classical proportional regime, where data dimension $d$ and number of samples $N$ grow as $d/N = c + o(1)$. For this setting, we derive data-dependent, instance specific expressions for the test error for both denoising and noisy-input regression, and study when overfitting the noise is benign, tempered or catastrophic. We show that the test error exhibits double descent under general distribution shift, providing insights for data augmentation and the role of noise as an implicit regularizer. We also perform experiments using real-life data, where we match the theoretical predictions with under 1\% MSE error for low-rank data.  ( 3 min )
    Multivariate Gaussian Approximation for Random Forest via Region-based Stabilization
    arXiv:2403.09960v1 Announce Type: cross Abstract: We derive Gaussian approximation bounds for random forest predictions based on a set of training points given by a Poisson process, under fairly mild regularity assumptions on the data generating process. Our approach is based on the key observation that the random forest predictions satisfy a certain geometric property called region-based stabilization. In the process of developing our results for the random forest, we also establish a probabilistic result, which might be of independent interest, on multivariate Gaussian approximation bounds for general functionals of Poisson process that are region-based stabilizing. This general result makes use of the Malliavin-Stein method, and is potentially applicable to various related statistical problems.  ( 2 min )
    How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression?
    arXiv:2310.08391v2 Announce Type: replace Abstract: Transformers pretrained on diverse tasks exhibit remarkable in-context learning (ICL) capabilities, enabling them to solve unseen tasks solely based on input contexts without adjusting model parameters. In this paper, we study ICL in one of its simplest setups: pretraining a linearly parameterized single-layer linear attention model for linear regression with a Gaussian prior. We establish a statistical task complexity bound for the attention model pretraining, showing that effective pretraining only requires a small number of independent tasks. Furthermore, we prove that the pretrained model closely matches the Bayes optimal algorithm, i.e., optimally tuned ridge regression, by achieving nearly Bayes optimal risk on unseen tasks under a fixed context length. These theoretical findings complement prior experimental research and shed light on the statistical foundations of ICL.  ( 2 min )
    MARS via LASSO
    arXiv:2111.11694v3 Announce Type: replace-cross Abstract: Multivariate adaptive regression splines (MARS) is a popular method for nonparametric regression introduced by Friedman in 1991. MARS fits simple nonlinear and non-additive functions to regression data. We propose and study a natural lasso variant of the MARS method. Our method is based on least squares estimation over a convex class of functions obtained by considering infinite-dimensional linear combinations of functions in the MARS basis and imposing a variation based complexity constraint. Our estimator can be computed via finite-dimensional convex optimization, although it is defined as a solution to an infinite-dimensional optimization problem. Under a few standard design assumptions, we prove that our estimator achieves a rate of convergence that depends only logarithmically on dimension and thus avoids the usual curse of dimensionality to some extent. We also show that our method is naturally connected to nonparametric estimation techniques based on smoothness constraints. We implement our method with a cross-validation scheme for the selection of the involved tuning parameter and compare it to the usual MARS method in various simulation and real data settings.  ( 2 min )
    Mind the GAP: Improving Robustness to Subpopulation Shifts with Group-Aware Priors
    arXiv:2403.09869v1 Announce Type: new Abstract: Machine learning models often perform poorly under subpopulation shifts in the data distribution. Developing methods that allow machine learning models to better generalize to such shifts is crucial for safe deployment in real-world settings. In this paper, we develop a family of group-aware prior (GAP) distributions over neural network parameters that explicitly favor models that generalize well under subpopulation shifts. We design a simple group-aware prior that only requires access to a small set of data with group information and demonstrate that training with this prior yields state-of-the-art performance -- even when only retraining the final layer of a previously trained non-robust model. Group aware-priors are conceptually simple, complementary to existing approaches, such as attribute pseudo labeling and data reweighting, and open up promising new avenues for harnessing Bayesian inference to enable robustness to subpopulation shifts.  ( 2 min )
    Post-hoc Bias Scoring Is Optimal For Fair Classification
    arXiv:2310.05725v3 Announce Type: replace Abstract: We consider a binary classification problem under group fairness constraints, which can be one of Demographic Parity (DP), Equalized Opportunity (EOp), or Equalized Odds (EO). We propose an explicit characterization of Bayes optimal classifier under the fairness constraints, which turns out to be a simple modification rule of the unconstrained classifier. Namely, we introduce a novel instance-level measure of bias, which we call bias score, and the modification rule is a simple linear rule on top of the finite amount of bias scores.Based on this characterization, we develop a post-hoc approach that allows us to adapt to fairness constraints while maintaining high accuracy. In the case of DP and EOp constraints, the modification rule is thresholding a single bias score, while in the case of EO constraints we are required to fit a linear modification rule with 2 parameters. The method can also be applied for composite group-fairness criteria, such as ones involving several sensitive attributes.  ( 2 min )
    Self-Compatibility: Evaluating Causal Discovery without Ground Truth
    arXiv:2307.09552v2 Announce Type: replace-cross Abstract: As causal ground truth is incredibly rare, causal discovery algorithms are commonly only evaluated on simulated data. This is concerning, given that simulations reflect preconceptions about generating processes regarding noise distributions, model classes, and more. In this work, we propose a novel method for falsifying the output of a causal discovery algorithm in the absence of ground truth. Our key insight is that while statistical learning seeks stability across subsets of data points, causal learning should seek stability across subsets of variables. Motivated by this insight, our method relies on a notion of compatibility between causal graphs learned on different subsets of variables. We prove that detecting incompatibilities can falsify wrongly inferred causal relations due to violation of assumptions or errors from finite sample effects. Although passing such compatibility tests is only a necessary criterion for good performance, we argue that it provides strong evidence for the causal models whenever compatibility entails strong implications for the joint distribution. We also demonstrate experimentally that detection of incompatibilities can aid in causal model selection.  ( 2 min )
    Adversarial Adaptive Sampling: Unify PINN and Optimal Transport for the Approximation of PDEs
    arXiv:2305.18702v2 Announce Type: replace Abstract: Solving partial differential equations (PDEs) is a central task in scientific computing. Recently, neural network approximation of PDEs has received increasing attention due to its flexible meshless discretization and its potential for high-dimensional problems. One fundamental numerical difficulty is that random samples in the training set introduce statistical errors into the discretization of loss functional which may become the dominant error in the final approximation, and therefore overshadow the modeling capability of the neural network. In this work, we propose a new minmax formulation to optimize simultaneously the approximate solution, given by a neural network model, and the random samples in the training set, provided by a deep generative model. The key idea is to use a deep generative model to adjust random samples in the training set such that the residual induced by the approximate PDE solution can maintain a smooth profile when it is being minimized. Such an idea is achieved by implicitly embedding the Wasserstein distance between the residual-induced distribution and the uniform distribution into the loss, which is then minimized together with the residual. A nearly uniform residual profile means that its variance is small for any normalized weight function such that the Monte Carlo approximation error of the loss functional is reduced significantly for a certain sample size. The adversarial adaptive sampling (AAS) approach proposed in this work is the first attempt to formulate two essential components, minimizing the residual and seeking the optimal training set, into one minmax objective functional for the neural network approximation of PDEs.  ( 3 min )
    A Short Survey on Importance Weighting for Machine Learning
    arXiv:2403.10175v1 Announce Type: cross Abstract: Importance weighting is a fundamental procedure in statistics and machine learning that weights the objective function or probability distribution based on the importance of the instance in some sense. The simplicity and usefulness of the idea has led to many applications of importance weighting. For example, it is known that supervised learning under an assumption about the difference between the training and test distributions, called distribution shift, can guarantee statistically desirable properties through importance weighting by their density ratio. This survey summarizes the broad applications of importance weighting in machine learning and related research.  ( 2 min )
    Structured Evaluation of Synthetic Tabular Data
    arXiv:2403.10424v1 Announce Type: cross Abstract: Tabular data is common yet typically incomplete, small in volume, and access-restricted due to privacy concerns. Synthetic data generation offers potential solutions. Many metrics exist for evaluating the quality of synthetic tabular data; however, we lack an objective, coherent interpretation of the many metrics. To address this issue, we propose an evaluation framework with a single, mathematical objective that posits that the synthetic data should be drawn from the same distribution as the observed data. Through various structural decomposition of the objective, this framework allows us to reason for the first time the completeness of any set of metrics, as well as unifies existing metrics, including those that stem from fidelity considerations, downstream application, and model-based approaches. Moreover, the framework motivates model-free baselines and a new spectrum of metrics. We evaluate structurally informed synthesizers and synthesizers powered by deep learning. Using our structured framework, we show that synthetic data generators that explicitly represent tabular structure outperform other methods, especially on smaller datasets.  ( 2 min )
    Robust Sparse Estimation for Gaussians with Optimal Error under Huber Contamination
    arXiv:2403.10416v1 Announce Type: cross Abstract: We study Gaussian sparse estimation tasks in Huber's contamination model with a focus on mean estimation, PCA, and linear regression. For each of these tasks, we give the first sample and computationally efficient robust estimators with optimal error guarantees, within constant factors. All prior efficient algorithms for these tasks incur quantitatively suboptimal error. Concretely, for Gaussian robust $k$-sparse mean estimation on $\mathbb{R}^d$ with corruption rate $\epsilon>0$, our algorithm has sample complexity $(k^2/\epsilon^2)\mathrm{polylog}(d/\epsilon)$, runs in sample polynomial time, and approximates the target mean within $\ell_2$-error $O(\epsilon)$. Previous efficient algorithms inherently incur error $\Omega(\epsilon \sqrt{\log(1/\epsilon)})$. At the technical level, we develop a novel multidimensional filtering method in the sparse regime that may find other applications.  ( 2 min )
    Reliable uncertainty with cheaper neural network ensembles: a case study in industrial parts classification
    arXiv:2403.10182v1 Announce Type: cross Abstract: In operations research (OR), predictive models often encounter out-of-distribution (OOD) scenarios where the data distribution differs from the training data distribution. In recent years, neural networks (NNs) are gaining traction in OR for their exceptional performance in fields such as image classification. However, NNs tend to make confident yet incorrect predictions when confronted with OOD data. Uncertainty estimation offers a solution to overconfident models, communicating when the output should (not) be trusted. Hence, reliable uncertainty quantification in NNs is crucial in the OR domain. Deep ensembles, composed of multiple independent NNs, have emerged as a promising approach, offering not only strong predictive accuracy but also reliable uncertainty estimation. However, their deployment is challenging due to substantial computational demands. Recent fundamental research has proposed more efficient NN ensembles, namely the snapshot, batch, and multi-input multi-output ensemble. This study is the first to provide a comprehensive comparison of a single NN, a deep ensemble, and the three efficient NN ensembles. In addition, we propose a Diversity Quality metric to quantify the ensembles' performance on the in-distribution and OOD sets in one single metric. The OR case study discusses industrial parts classification to identify and manage spare parts, important for timely maintenance of industrial plants. The results highlight the batch ensemble as a cost-effective and competitive alternative to the deep ensemble. It outperforms the deep ensemble in both uncertainty and accuracy while exhibiting a training time speedup of 7x, a test time speedup of 8x, and 9x memory savings.  ( 3 min )
    A Natural Extension To Online Algorithms For Hybrid RL With Limited Coverage
    arXiv:2403.09701v1 Announce Type: cross Abstract: Hybrid Reinforcement Learning (RL), leveraging both online and offline data, has garnered recent interest, yet research on its provable benefits remains sparse. Additionally, many existing hybrid RL algorithms (Song et al., 2023; Nakamoto et al., 2023; Amortila et al., 2024) impose coverage assumptions on the offline dataset, but we show that this is unnecessary. A well-designed online algorithm should "fill in the gaps" in the offline dataset, exploring states and actions that the behavior policy did not explore. Unlike previous approaches that focus on estimating the offline data distribution to guide online exploration (Li et al., 2023b), we show that a natural extension to standard optimistic online algorithms -- warm-starting them by including the offline dataset in the experience replay buffer -- achieves similar provable gains from hybrid data even when the offline dataset does not have single-policy concentrability. We accomplish this by partitioning the state-action space into two, bounding the regret on each partition through an offline and an online complexity measure, and showing that the regret of this hybrid RL algorithm can be characterized by the best partition -- despite the algorithm not knowing the partition itself. As an example, we propose DISC-GOLF, a modification of an existing optimistic online algorithm with general function approximation called GOLF used in Jin et al. (2021); Xie et al. (2022a), and show that it demonstrates provable gains over both online-only and offline-only reinforcement learning, with competitive bounds when specialized to the tabular, linear and block MDP cases. Numerical simulations further validate our theory that hybrid data facilitates more efficient exploration, supporting the potential of hybrid RL in various scenarios.  ( 3 min )
    Conformal Predictions for Probabilistically Robust Scalable Machine Learning Classification
    arXiv:2403.10368v1 Announce Type: new Abstract: Conformal predictions make it possible to define reliable and robust learning algorithms. But they are essentially a method for evaluating whether an algorithm is good enough to be used in practice. To define a reliable learning framework for classification from the very beginning of its design, the concept of scalable classifier was introduced to generalize the concept of classical classifier by linking it to statistical order theory and probabilistic learning theory. In this paper, we analyze the similarities between scalable classifiers and conformal predictions by introducing a new definition of a score function and defining a special set of input variables, the conformal safety set, which can identify patterns in the input space that satisfy the error coverage guarantee, i.e., that the probability of observing the wrong (possibly unsafe) label for points belonging to this set is bounded by a predefined $\varepsilon$ error level. We demonstrate the practical implications of this framework through an application in cybersecurity for identifying DNS tunneling attacks. Our work contributes to the development of probabilistically robust and reliable machine learning models.  ( 2 min )
    Rough Transformers for Continuous and Efficient Time-Series Modelling
    arXiv:2403.10288v1 Announce Type: new Abstract: Time-series data in real-world medical settings typically exhibit long-range dependencies and are observed at non-uniform intervals. In such contexts, traditional sequence-based recurrent models struggle. To overcome this, researchers replace recurrent architectures with Neural ODE-based models to model irregularly sampled data and use Transformer-based architectures to account for long-range dependencies. Despite the success of these two approaches, both incur very high computational costs for input sequences of moderate lengths and greater. To mitigate this, we introduce the Rough Transformer, a variation of the Transformer model which operates on continuous-time representations of input sequences and incurs significantly reduced computational costs, critical for addressing long-range dependencies common in medical contexts. In particular, we propose multi-view signature attention, which uses path signatures to augment vanilla attention and to capture both local and global dependencies in input data, while remaining robust to changes in the sequence length and sampling frequency. We find that Rough Transformers consistently outperform their vanilla attention counterparts while obtaining the benefits of Neural ODE-based models using a fraction of the computational time and memory resources on synthetic and real-world time-series tasks.  ( 2 min )
    Interpretable Machine Learning for Survival Analysis
    arXiv:2403.10250v1 Announce Type: new Abstract: With the spread and rapid advancement of black box machine learning models, the field of interpretable machine learning (IML) or explainable artificial intelligence (XAI) has become increasingly important over the last decade. This is particularly relevant for survival analysis, where the adoption of IML techniques promotes transparency, accountability and fairness in sensitive areas, such as clinical decision making processes, the development of targeted therapies, interventions or in other medical or healthcare related contexts. More specifically, explainability can uncover a survival model's potential biases and limitations and provide more mathematically sound ways to understand how and which features are influential for prediction or constitute risk factors. However, the lack of readily available IML methods may have deterred medical practitioners and policy makers in public health from leveraging the full potential of machine learning for predicting time-to-event data. We present a comprehensive review of the limited existing amount of work on IML methods for survival analysis within the context of the general IML taxonomy. In addition, we formally detail how commonly used IML methods, such as such as individual conditional expectation (ICE), partial dependence plots (PDP), accumulated local effects (ALE), different feature importance measures or Friedman's H-interaction statistics can be adapted to survival outcomes. An application of several IML methods to real data on data on under-5 year mortality of Ghanaian children from the Demographic and Health Surveys (DHS) Program serves as a tutorial or guide for researchers, on how to utilize the techniques in practice to facilitate understanding of model decisions or predictions.  ( 3 min )
    A Structure-Preserving Kernel Method for Learning Hamiltonian Systems
    arXiv:2403.10070v1 Announce Type: new Abstract: A structure-preserving kernel ridge regression method is presented that allows the recovery of potentially high-dimensional and nonlinear Hamiltonian functions out of datasets made of noisy observations of Hamiltonian vector fields. The method proposes a closed-form solution that yields excellent numerical performances that surpass other techniques proposed in the literature in this setup. From the methodological point of view, the paper extends kernel regression methods to problems in which loss functions involving linear functions of gradients are required and, in particular, a differential reproducing property and a Representer Theorem are proved in this context. The relation between the structure-preserving kernel estimator and the Gaussian posterior mean estimator is analyzed. A full error analysis is conducted that provides convergence rates using fixed and adaptive regularization parameters. The good performance of the proposed estimator is illustrated with various numerical experiments.  ( 2 min )

  • Open

    An OpenAI mystery: YouTube videos, Google throttling, and AI training data.
    submitted by /u/thisisinsider [link] [comments]
    Why do paraphrasing ai tools cost so much per month?
    I have been looking at a lot of ai paraphrasing and rewriting tools for the past couple months, and I have noticed that pretty much all of them limit you to 5k–6k words per month for a $12–$20 per month subscription. 5k words should only cost $1-$2 a month (based on the chatGPT api price). Why do all of these services charge way more. I would think that they would be super competitive on price and that there is something else that warrants them charging 10x their cost, but I'm not sure. Are they just overcharging or does it actually cost that much? submitted by /u/littlejack59 [link] [comments]
    AI dubbing is getting scary good. This is from "PipioHQ". They translate videos while retaining the sound/intonation of the original voice, & they match lip movements to the new language!
    submitted by /u/Armand_Roulinn [link] [comments]
    Biotech with AGI
    Hello, I'm a 20-year-old young man and I'm studying Political Science. I've dabbled in cybersecurity and python before, but now I want to improve myself in artificial intelligence. ​ What kind of advice can you give about this roadmap? Math and Python-> Machine Learning -> Deep Learning to AI Engineering In the future, I want to start an AI-based biotechnology entrepreneurship. I know it is very, very difficult. What kind of things should I do to improve myself financially in this field? submitted by /u/BilgeYamtar [link] [comments]
    Daily AI News Recap (03/18/2024)
    xAI has released the weights and architecture of its 314 billion parameter LLM, Grok-1. It is trained from scratch by xAI using a custom training stack on top of JAX and Rust in October 2023. OpenAI accidentally leaked GPT-4.5 Turbo by publishing a blog post that was indexed by Bing and DuckDuckGo before it was quickly taken down. The post was an announcement for GPT-4.5 Turbo suggesting that it will be faster, more accurate, and have a larger knowledge base than its predecessor. Meta has introduced two 24K GPU clusters that it is using to train it's Llama 3 LLM. The clusters in total have 49,000 H100 GPUs each of which costs about $40,000. Llama 3 is expected in July. Apple is in talks to build Google's Gemini artificial intelligence engine into the iPhone. A deal between the firms could help Google expand the use of its AI services to more than 2 billion active Apple devices. submitted by /u/Used-Bat3441 [link] [comments]
    Apple Is in Talks to Let Google’s Gemini Power iPhone Generative AI Features
    submitted by /u/clonefitreal [link] [comments]
    One-Minute Daily AI News 3/17/2024
    Elon Musk’s xAI has open-sourced the base code of Grok AI model, but without any training code. The company described it as the “314 billion parameter Mixture-of-Expert model” on GitHub.[1] Apple Announces MM1: A Family of Multimodal LLMs Up To 30B Parameters that are SoTA in Pre-Training Metrics and Perform Competitively after Fine-Tuning.[2] Microsoft tells European regulators Google has an edge in generative AI.[3] Nvidia’s Jensen Huang, Fed’s Powell may rock markets this week.[4] Sources: [1] https://techcrunch.com/2024/03/17/xai-open-sources-base-model-of-grok-but-without-any-training-code/ [2] https://www.marktechpost.com/2024/03/16/apple-announces-mm1-a-family-of-multimodal-llms-up-to-30b-parameters-that-are-sota-in-pre-training-metrics-and-perform-competitively-after-fine-tuning/ [3] https://www.foxbusiness.com/technology/microsoft-tells-european-regulators-google-has-edge-generative-ai [4] https://www.thestreet.com/investing/nvidia-jensen-huang-fed-powell-rock-markets-this-week submitted by /u/Excellent-Target-847 [link] [comments]
    What is the best platform for producing graphical representations of text?
    E.g. if I wanted to create a logo with specific text, or an icon, or just some fun integrated but accurate text in a graphic? Must be accurate spelling of course. submitted by /u/btxtsf [link] [comments]
    Do you think the military will revisit the BigDog concept?
    With the recent AI developments and push for humanoid robots, maybe the bigdog concept will make a comeback. The original bigdogs were built in the late 2000's with a noisy gas powered engine. The spot robot is much quieter with newer technology. Maybe they can upscale that platform. On a side note, how does a bigdog like robot compare with using an animal like a horse or mule? Has there been very much horse/mule use in the military for the same purpose that the bigdog robot was intended for? submitted by /u/Steve77307 [link] [comments]
  • Open

    RL in trafic
    Hi I want to add RL in trafic light , so I already try neat but after 2600 gen I am not satisfied with the results , is there any other algorithm with I can use to create that agent or any one try to add RL to trafic light , and any idea what to add to create multiple agent environment , thanks submitted by /u/Rude_Personne [link] [comments]
    Convergence to a single action
    Bit of a broad question, but in implementing RL it seems common for my agent to converge to a solution where it always picks just 1 action. Its happened in both a Deep Q-Learning agent for Pong and a Deep Cross entropy agent in a Multi-Physics model Im working on for my PhD. Anyone know of any common reasons why this may happen? submitted by /u/Chewden_ [link] [comments]
    Surgery quota MARL scheduler development
    Hi, I'm from a university student RL team and we are going to make a new multi-agent cooperative-adversarial environment for learning optimal strategies for short-term scheduling of incoming requests (surgical procedures in our case). Please take a look at the current description of the environment. What is missing? Maybe there is something superfluous? Any experience you share with us would be appreciated! Here is GitHub repo: https://github.com/artemisak/Surgery-quota-scheduler/tree/main submitted by /u/Pythonic-af [link] [comments]
    Actor-Critic Unable to Overfit?
    I'm currently working on a project where I'm trying to optimize an actor-critic model to maximize a high score based on some input data. However, I'm facing an unexpected challenge - I can't seem to get my model to overfit, despite my intentions to do so for experimental purposes. Here's a brief overview of my model setup: My actor loss is the policy gradient loss of -log(probability)*advantage of the decision it makes. Advantage = (Δscore from prior frame) - (critic estimated Δscore from prior frame) + (gamma*critic estimated Δscore for next frame). Critic is updated based on the MSE between (Δscore from prior frame) and (critic estimated Δscore from prior frame) When plotting my critic's estimated Δscore from the first frame against the actual Δscore, I observe zero correlation, which is puzzling I'm training the model on a very small dataset - just 1 or 2 episodes with around 30 events each. My expectation was that the model would easily memorize the actions associated with these events and completely overfit to the data. I've checked the gradients before and after the update steps and everything looks populated, so I think it's 'learning'. I can't seem to see any issues but am at a bit of a loss as to why I can't overfit. Do any practitioners see this problem / know an obvious fix or check for thing I'm missing here? I've tried using both an LSTM and Transformer model for the Actor/Critic, with 2 layers with a size of ~200 each, 0.25 dropout, 1e-5 learning rate. submitted by /u/Rhyno_Time [link] [comments]
    Help to explain NEAT output
    Can someone please explain how to derive my generated output from the basic tutorial of NEAT? If my input is (1.0, 1.0), what are the weights and biases I should use to reproduce 0.06994251237198218? I'm familiar with the sigmoid function, but I can't reproduce any of the output numbers. ​ Best genome: Key: 6295 Fitness: 3.923495216838884 Nodes: 0 DefaultNodeGene(key=0, bias=-0.3608400593343819, response=1.0, activation=sigmoid, aggregation=sum) 53 DefaultNodeGene(key=53, bias=-0.5517336026220021, response=1.0, activation=sigmoid, aggregation=sum) 838 DefaultNodeGene(key=838, bias=-0.5789866669875736, response=1.0, activation=sigmoid, aggregation=sum) 931 DefaultNodeGene(key=931, bias=-0.8513574951787184, response=1.0, activation=sigmoid, aggregation=sum) 1219 DefaultNodeGene(key=1219, bias=-0.5135172011791014, response=1.0, activation=sigmoid, aggregation=sum) Connections: DefaultConnectionGene(key=(-2, 0), weight=0.6428472619552054, enabled=True) DefaultConnectionGene(key=(-2, 53), weight=-1.6935071067952197, enabled=False) DefaultConnectionGene(key=(-2, 1219), weight=0.8991060064585895, enabled=True) DefaultConnectionGene(key=(-1, 0), weight=-0.8006221066141777, enabled=True) DefaultConnectionGene(key=(-1, 53), weight=1.2070540852248426, enabled=True) DefaultConnectionGene(key=(53, 0), weight=2.3774714023438124, enabled=True) DefaultConnectionGene(key=(1219, 53), weight=-2.5095137599707673, enabled=True) Output: input (0.0, 0.0), expected output (0.0,), got [0.18185023040647358] input (0.0, 1.0), expected output (1.0,), got [0.8037737011422771] input (1.0, 0.0), expected output (1.0,), got [0.9937902124091639] input (1.0, 1.0), expected output (0.0,), got [0.06994251237198218] ​ submitted by /u/williego [link] [comments]
    ExploRLLM: Guiding Exploration in Reinforcement Learning with Large Language Models
    Paper: https://arxiv.org/abs/2403.09583 Website: https://explorllm.github.io/ Abstract: In image-based robot manipulation tasks with large observation and action spaces, reinforcement learning struggles with low sample efficiency, slow training speed, and uncertain convergence. As an alternative, large pre-trained foundation models have shown promise in robotic manipulation, particularly in zero-shot and few-shot applications. However, using these models directly is unreliable due to limited reasoning capabilities and challenges in understanding physical and spatial contexts. This paper introduces ExploRLLM, a novel approach that leverages the inductive bias of foundation models (e.g. Large Language Models) to guide exploration in reinforcement learning. We also exploit these foundation models to reformulate the action and observation spaces to enhance the training efficiency in reinforcement learning. Our experiments demonstrate that guided exploration enables much quicker convergence than training without it. Additionally, we validate that ExploRLLM outperforms vanilla foundation model baselines and that the policy trained in simulation can be applied in real-world settings without additional training. submitted by /u/Tbd_Sparks [link] [comments]
    DQN Implementation
    I want an example of DQN implemented in a continuous observation space environment, preferably in SB3. submitted by /u/Sadboi1010 [link] [comments]
    Dissecting Deep RL with High Update Ratios: Combatting Value Overestimation and Divergence
    Paper: https://arxiv.org/abs/2403.05996 Abstract: We show that deep reinforcement learning can maintain its ability to learn without resetting network parameters in settings where the number of gradient updates greatly exceeds the number of environment samples. Under such large update-to-data ratios, a recent study by Nikishin et al. (2022) suggested the emergence of a primacy bias, in which agents overfit early interactions and downplay later experience, impairing their ability to learn. In this work, we dissect the phenomena underlying the primacy bias. We inspect the early stages of training that ought to cause the failure to learn and find that a fundamental challenge is a long-standing acquaintance: value overestimation. Overinflated Q-values are found not only on out-of-distribution but also in-distribution data and can be traced to unseen action prediction propelled by optimizer momentum. We employ a simple unit-ball normalization that enables learning under large update ratios, show its efficacy on the widely used dm_control suite, and obtain strong performance on the challenging dog tasks, competitive with model-based approaches. Our results question, in parts, the prior explanation for sub-optimal learning due to overfitting on early data. submitted by /u/SunsetOneSix [link] [comments]
    Goal-Conditioned RL Implementations Similar to CleanRL
    Hello everyone! I'm currently working on a goal-conditioned robotics project (with custom envs) and facing some challenges. I've previously used Stable Baselines3 (SB3), which I found to be quite comprehensive but a bit complex for my current needs. I hope to create a custom policy (modular) and apply transfer learning to it for a different task, which seems a lot more complicated with SB3. So, I'm particularly interested in an implementation similar to CleanRL due to its simplicity and clarity but with support for goal-conditioned environments. Can you recommend any resources, repositories, or frameworks that might align with these requirements? Here's a brief overview of what I'm looking for: A framework or library that is simple and transparent like CleanRL. Support for goal-conditioned environments (compatible with OpenAI Gym's GoalEnv). Flexibility to customize policy networks, including setting different learning rates for different parts of the network. Thank you submitted by /u/ncbdrck [link] [comments]
    Implementation of NEAT Algorithm using Open AI gym to super mario bros
    I am working on a project to implement a NEAT algorithm so that it learns to play Super Mario Bros(In Open AI Gym) I was wondering if there were any resources or someone who already has done this so I get a better understanding on how to do this any help is appreciated Thank You!! submitted by /u/MinuteNo5493 [link] [comments]
  • Open

    “We Created a Processor for the Generative AI Era,” NVIDIA CEO Says
    Generative AI promises to revolutionize every industry it touches — all that’s been needed is the technology to meet the challenge. NVIDIA CEO Jensen Huang on Monday introduced that technology—the company’s new Blackwell computing platform—as he outlined the major advances that increased computing power can deliver for everything from software to services, robotics to medical Read Article  ( 11 min )
    All Aboard: NVIDIA Scores 23 World Records for Route Optimization
    With nearly two dozen world records to its name, NVIDIA cuOpt now holds the top spot for 100% of the largest routing benchmarks in the last three years. And this means the route optimization engine allows industries to hop on board for all kinds of cost-saving efficiencies. Kawasaki Heavy Industries and SyncTwin are among the Read Article  ( 6 min )
    All Eyes on AI: Automotive Tech on Full Display at GTC 2024
    All eyes across the auto industry are on GTC — the global AI conference running in San Jose, Calif., and online through Thursday, March 21 — as the world’s top automakers and tech leaders converge to showcase the latest models, demo new technologies and dive into the remarkable innovations reshaping the sector. Attendees will experience Read Article  ( 6 min )
    Generative AI Developers Harness NVIDIA Technologies to Transform In-Vehicle Experiences
    Cars of the future will be more than just modes of transportation; they’ll be intelligent companions, seamlessly blending technology and comfort to enhance driving experiences, and built for safety, inside and out. NVIDIA GTC, running this week at the San Jose Convention Center, will spotlight the groundbreaking work NVIDIA and its partners are doing to Read Article  ( 8 min )
    NVIDIA Unveils Digital Blueprint for Building Next-Gen Data Centers
    Designing, simulating and bringing up modern data centers is incredibly complex, involving multiple considerations like performance, energy efficiency and scalability. It also requires bringing together a team of highly skilled engineers across compute and network design, computer-aided design (CAD) modeling, and mechanical, electrical and thermal design. NVIDIA builds the world’s most advanced AI supercomputers and Read Article  ( 6 min )
    New NVIDIA Storage Partner Validation Program Streamlines Enterprise AI Deployments
    A sharp increase in generative AI deployments is driving business innovation for enterprises across industries. But it’s also posing significant challenges for their IT teams, as slowdowns from long and complex infrastructure deployment cycles prevent them from quickly spinning up AI workloads using their own data. To help overcome these barriers, NVIDIA has introduced a Read Article  ( 6 min )
    From Atoms to Supercomputers: NVIDIA, Partners Scale Quantum Computing
    The latest advances in quantum computing include investigating molecules, deploying giant supercomputers and building the quantum workforce with a new academic program. Researchers in Canada and the U.S. used a large language model to simplify quantum simulations that help scientists explore molecules. “This new quantum algorithm opens the avenue to a new way of combining Read Article  ( 7 min )
    NVIDIA Maxine Developer Platform to Transform $10 Billion Video Conferencing Industry
    Video conferencing has allowed many to be productive from anywhere. Now NVIDIA is boosting the productivity of the developers of video conferencing, call center and streaming applications within the $10 billion industry by allowing them to easily integrate AI into their workflows. The new release of the Maxine AI Developer Platform transforms the creation of Read Article  ( 7 min )
    NVIDIA Edify Unlocks 3D Generative AI, New Image Controls for Visual Content Providers
    NVIDIA Edify, a multimodal architecture for visual generative AI, is entering a new dimension. 3D asset generation is among the latest capabilities Edify offers developers and visual content providers, who will also be able to exert more creative control over AI image generation. Multimedia content and data provider Shutterstock is rolling out early access to Read Article  ( 8 min )
    Make It So: Software Speeds Journey to Post-Quantum Cryptography
    The journey to the future of secure communications is about to jump to warp drive. NVIDIA cuPQC brings accelerated computing to developers working on cryptography for the age of quantum computing. The cuPQC library harnesses the parallelism of GPUs for their most demanding security algorithms. Refactoring Security for the Quantum Era   Researchers have known for Read Article  ( 5 min )
    NVIDIA Brings Generative AI for Digital Humans, New RTX Technologies and More DLSS 3.5 Games to GDC
    Generative AI is capable of creating more realistic verbal and facial expressions for digital humans than ever before. This week at GDC 2024, NVIDIA announced that leading AI application developers across a wide range of industries are using NVIDIA digital human technologies to create lifelike avatars for commercial applications and dynamic game characters. NVIDIA enables Read Article  ( 7 min )
    Johnson & Johnson MedTech Works With NVIDIA to Broaden AI’s Reach in Surgery
    AI — already used to connect, analyze and offer predictions based on operating room data — will be critical to the future of surgery, boosting operating room efficiency and clinical decision-making. That’s why NVIDIA is working with Johnson & Johnson MedTech to test new AI capabilities for the company’s connected digital ecosystem for surgery. It Read Article  ( 7 min )
    BNY Mellon, First Global Bank to Deploy AI Supercomputer Powered by NVIDIA DGX SuperPOD With DGX H100
    Moving fast to accelerate its AI journey, BNY Mellon, a global financial services company celebrating its 240th anniversary, revealed Monday that it has become the first major bank to deploy an NVIDIA DGX SuperPOD with DGX H100 systems. Thanks to the strong collaborative relationship between NVIDIA Professional Services and BNY Mellon, the team was able Read Article  ( 6 min )
    NVIDIA Isaac Taps Generative AI for Manufacturing and Logistics Applications
    The NVIDIA Isaac robotics platform is tapping into the latest generative AI and advanced simulation technologies to accelerate AI-enabled robotics. At GTC today, NVIDIA announced Isaac Manipulator and Isaac Perceptor — a collection of foundation models, robotics tools and GPU-accelerated libraries. On stage before a crowd of 10,000-plus, NVIDIA founder and CEO Jensen Huang demonstrated Read Article  ( 8 min )
    NVIDIA Omniverse Expands Worlds Using Apple Vision Pro
    NVIDIA is bringing OpenUSD-based Omniverse enterprise digital twins to the Apple Vision Pro. Announced today at NVIDIA GTC, a new software framework built on Omniverse Cloud APIs, or application programming interfaces, lets developers easily send their Universal Scene Description (OpenUSD) industrial scenes from their content creation applications to the NVIDIA Graphics Delivery Network (GDN), a Read Article  ( 6 min )
    NVIDIA and Siemens Bring Immersive Visualization and Generative AI to Industrial Design and Manufacturing
    Generative AI and digital twins are changing the way companies in multiple industries design, manufacture and operate their products. Siemens, a leading technology company for automation, digitalization and sustainability, announced today at NVIDIA GTC that it is expanding its partnership with NVIDIA by adopting new NVIDIA Omniverse Cloud APIs, or application programming interfaces, with its Read Article  ( 6 min )
    NVIDIA Supercharges Autonomous System Development with Omniverse Cloud APIs
    While simulation is critical for training, testing and deploying autonomy,  achieving real-world fidelity is incredibly challenging. It requires accurate modeling of the physics and behavior of an autonomous system’s sensors and surroundings. Designed to address this challenge by delivering large-scale, high-fidelity sensor simulation, Omniverse Cloud APIs, announced today at NVIDIA GTC, are poised to accelerate Read Article  ( 6 min )
    Staying in Sync: NVIDIA Combines Digital Twins With Real-Time AI for Industrial Automation
    Real-time AI is helping with the heavy lifting in manufacturing, factory logistics and robotics. In such industries — often involving bulky products, expensive equipment, cobot environments and logistically complex facilities — a simulation-first approach is ushering in the next phase of automation. NVIDIA founder and CEO Jensen Huang today demonstrated in his GTC keynote how Read Article  ( 6 min )
  • Open

    Research in ML [R]
    Hello! I'm reaching out to researchers in the NLP/XAI/RecSys domains for potential collaboration on new research initiatives. I have a couple of research ideas in mind and would welcome collaboration. Or even If you are a say PhD currently working on an idea and looking for collaboration, please let me know. ( Can help you with ideation / implementation/ making the paper camera ready ) I am a masters grad in DS, aiming to publish at least a couple more papers before beginning my PhD next year.. ( Have already published two on Interpretable ML) PS : Haven't worked that much on Vision, but yeah I do have an idea of achitecture, bottlenecks and pretraining objectives used in Image transformers and Diffusion models. Feel free to dm if you work on these domains too. submitted by /u/Ok-Cheesecake-8881 [link] [comments]
    [D] Does DARTS automatically inverse transform variables?
    Hi, I'm currently studying DARTS for applying it at work and I'd like to know if I apply scale-like transformers in a model it'll automatically inverse transforme before giving me the output or if I'll need to explicitly ask for it. Thanks. submitted by /u/teechlng [link] [comments]
    [P] Open job opportunities in Machine Learning related to sports!
    Hey guys, I'm constantly checking for jobs in the sports analytics industry. Yesterday I added a bunch of companies to the job board and hence, several machine learning positions appeared that I wanted to share. Sports Modeling Engineer (Rust) - San Francisco, Remote Machine Learning Engineer - San Francsicso, Remote Software Engineer, AI - New York, Remote Machine Learning Engineer (ESPN Bet) - Toronto, Remote Machine Learning Software Engineering Team Lead - Greece There are multiple more jobs related to machine learning and hundreds of others jobs in analytics and software. I've created also a reddit community where I post recurrently the openings if that's easier to check for you. Disclaimer: I run the job board. I hope this helps someone! submitted by /u/fark13 [link] [comments]
    [D] Same param count for GPT4 from Nvidia GTC24 as the leak we got from Semianalysis
    A report by Semianalysis way back said that GPT-4 was a 1.8T parameter MoE model with 16 experts, each with 111B parameters. Here is a screenshot from the GTC conference with the same numbers. https://preview.redd.it/vyzfx2sel5pc1.png?width=1764&format=png&auto=webp&s=dfce1d55c84dbc3c51e69f376161c47958f9cf70 submitted by /u/takuonline [link] [comments]
    Stability AI releases SV3D [N]
    https://stability.ai/news/introducing-stable-video-3d SV3D takes a single object image as input and outputs novel multi-views of that object. We can then use those novel-views and SV3D to generate 3D meshes. submitted by /u/we_are_mammals [link] [comments]
    [D] NTNU university
    Hy all! I would ask you if make a double degree in computer science at NTNU form Politecnico of Milan is a good idea? In particular I would ask you if NTNU is good for research in machine learning applied in finance. I would work in a bank like UBS after that submitted by /u/AshamedRecover1786 [link] [comments]
    [Project] Introducing ocrtoolkit: Your Go-To OCR Package
    Absolutely, let's integrate information directly from the GitHub repository of Hey Reddit community! I'm excited to present ocrtoolkit, a powerful OCR package designed to simplify your workflow and elevate your OCR tasks! What My Project Does If you've ever found yourself grappling with complex boilerplate code while tackling OCR-related challenges, you're in luck. ocrtoolkit streamlines the entire OCR process, offering intuitive wrappers for tasks such as image file handling, model execution, result parsing, and much more. Let's delve into the core features: Datasets Module: Need to effortlessly load image files or directories? Look no further than the ocrtoolkit.datasets module. Models Module: Seamlessly integrate with popular OCR frameworks like paddleOCR, ultralytics, and doctr …
    Memory usage of a GPU in training [Discussion]
    If one would plot a graph for the memory usage of GPU during training, what would it look like? submitted by /u/pusvvagon [link] [comments]
    [R] Integrating AI into Call Center without Twilio's Media Stream Resource
    Hello, fellow developers and tech enthusiasts! I'm embarking on a project to build an AI-powered call center. The goal is to integrate ChatGPT for conversational AI, along with text-to-speech (TTS) and speech-to-text (STT) capabilities, to create a seamless communication experience. Typically, a solution like Twilio's Media Stream Resource would be a go-to for such a task, as it allows for easy listening to and interaction with voice streams. However, due to certain constraints, I'm unable to use Twilio for this project. Instead, I have to work with other IP-telephony services like Sipuni or OnlinePBX. The challenge I'm facing is that neither of these services appears to offer functionality similar to Twilio's Media Stream Resource, at least based on their available documentation. This puts a hurdle in the way of connecting to the SIP stream effectively for real-time STT and TTS. Has anyone here faced a similar challenge or worked on a project with similar requirements? I'm looking for insights, advice, or guidance on how to connect to the SIP stream of IP-telephony services that don't explicitly offer functionality like Twilio's. Any pointers on libraries, tools, or approaches that could help bridge this gap would be incredibly appreciated. If you've navigated these waters before or have any thoughts on potential solutions, I'd be grateful to hear from you. Thank you in advance for your time and help! submitted by /u/Outrageous-Concert45 [link] [comments]
    [D] Diffusion model blows up with self conditioning
    I'm training lucidrain's 1d DDPM for a genomics dataset. If I use self conditioning (where the model gets to see a previous sample from itself) then my loss blows up to inf after a few hundred batches. Without self conditioning it seems to behave fine. Curious if anyone else has seen this? The Analog Bits paper introduced self conditioning and claims it is helpful. Obviously I could mess around with reducing the learning rate but not sure how much that would help given the blow up is a bit later into training. submitted by /u/daking999 [link] [comments]
    [D] IJCAI'24 Rebuttal Discussion
    Hi everyone, With the imminent release of the reviews, I'm initiating this thread for sharing thoughts, questions, and suggestions regarding IJCAI submissions. Wishing everyone the best of luck! submitted by /u/Acceptable_Pop1461 [link] [comments]
    [P] FOSS MLOps tool that simplifies handing off models between ML engineers and app developers
    Hi all - We just open sourced our tool for handing models off between our ml engineers/data scientists and our app dev team. It solves a ton of problems that we experienced, specifically having to maintain two different pipelines and having to constantly hunt down model assets. Curious to know what you think, you can check it out here: https://kitops.ml submitted by /u/iamjessew [link] [comments]
    [D] What development methodology does your place follow?
    Do they follow Agile? Does it work well with ML based development? Particularly, how do you deal with the uncertainty in producing the deliverables for an ML project? (uncertainty in reaching the desired performance within the set time) submitted by /u/notEVOLVED [link] [comments]
    [D] Deep Regression problem on facial landmarks
    Hello to all, I'm currently in the process of researching and trying to implement a model that takes in input the facial landmarkers of mediapipe (represented as a point cloud) and what i'm trying to obtain are values for the slider of the open-source tool makehuman (http://www.makehumancommunity.org/) such as nose width and ect... I've built a dataset of 10000 avatar faces from which i've taken the facial landmarks and the value of sliders used for creating them as labels. I'm already looking at something like PointNet++ as implemented in torch_geometric tutorials (https://pytorch-geometric.readthedocs.io/en/latest/tutorial/point_cloud.html) by modifing the last layer, the loss function and the metrics in order to obtain a regressor. I've also tried to transform the task in a classification task by transforming the sliders values in 20 categories of value from -1 to 1 by 0.1 in 0.1 but both implementation return sub-optimal results with mean accuracy for the classification task of 0.0015 and mean mse of 0.06 for the regression task. Any advise on architecture to use or another way to interpret the input data are welcome! submitted by /u/TERMINERO [link] [comments]
    [Discussion] Data formatting from PDF to QnA using LLM
    Data preparation is hard, this is something that I observed when I wanted to fine-tune a model based on data present in a bunch of pdfs, I needed to convert the content in the pdf to a QnA dataset, but obviously I won't do it manually. I came across this YouTube video (youtube.com/watch?v=fYyZiRi6yNE) this person has created a script here: https://github.com/Aemon-Algiz/DatesetExtraction/blob/main/BookParse.py This script converts the pdfs into text and then uses an LLM to make a question and answer pair from that text. Does everyone use this technique to structure the data, or is there a more efficient way to do this? Also, as I'm learning and getting more into LLMs I'm able to observe the challenges, but I would also like to know what other challenges are on the way and faced mostly. submitted by /u/Medium_Alternative50 [link] [comments]
    [D] papers that only evaluate on cifar10
    Hi everyone! While reviewing for NeurIPS 2024, one of the things I keep noticing is that a lot of papers only evaluate on datasets of very small size, like Cifar-10. his feels weird to me: I consider Cifar10 to be a toy-dataset and testbed for my methods, not something I'd use to show that my method actually works/is relevant in practice. So my first intuition is always "this approach probably does not scale to larger datasets". I mean, ImageNet is 12 years old now, and I've personally been giving results on imagenet for my papers since ~8 years. most computer vision applications I know require larger resolutions than just 32x32. it's also my impression that almost all of the "good" papers I read have results on larger scale data. But given how often I encounter this situation, I have to wonder: am I just working in a very privileged environment, or are the manuscript-authors just lazy? How much faith do you have in papers that only evaluate on MNIST and CIFAR10? submitted by /u/audiencevote [link] [comments]
    [D] Theoretical Paper about Transformers
    I'm going to start a study group focused on large language models. The participants are PhD students in Computer Science with a math background. I would like to first study some theoretical properties of transformers (or attention). Maybe some of the students still do not know exactly how a transformer is formulated, so I'll also need to discuss that. Do you have any suggestions of papers with an theoretical analysis of transformers (attention)? The most popular paper for attention is "Attention is all you need", but they only present the architecture and run experiments. submitted by /u/Massive_Horror9038 [link] [comments]
    In 2024 which library is best for time series forecasting and anomaly detection? [D]
    I am working on a project where I am dealing with identifying anomalies in a time series data. I came across Facebook prophet, but sadly it is no longer maintained since 2023. They suggested neuralprophet, nixtla as alternative. When looking for alternatives I came across Kats from Facebook which has prophet support built-in. Which tool/lib experienced members here would recommend, for building a production level system for anomaly detection on high volume of timeseries data? submitted by /u/ThakkidiMundan [link] [comments]
    [R] Arrows of Time for Large Language Models
    Paper: https://arxiv.org/abs/2401.17505 Abstract: We study the probabilistic modeling performed by Autoregressive Large Language Models through the angle of time directionality. We empirically find a time asymmetry exhibited by such models in their ability to model natural language: a difference in the average log-perplexity when trying to predict the next token versus when trying to predict the previous one. This difference is at the same time subtle and very consistent across various modalities (language, model size, training time, ...). Theoretically, this is surprising: from an information-theoretic point of view, there should be no such difference. We provide a theoretical framework to explain how such an asymmetry can appear from sparsity and computational complexity considerations, and outline a number of perspectives opened by our results. submitted by /u/SunsetOneSix [link] [comments]
    [D] When your use of AI for summary didn't come out right. A published Elsevier research paper
    submitted by /u/vvkuka [link] [comments]
    [P] Hugging Face Demo App: Explainable Generative AI for Tabular Data
    Hello folks, I invite you to try the hugging face demo app of our work on explainable and generative AI for tabular data. https://huggingface.co/spaces/CaglarAytekin/LEURN Briefly, the app takes a training table including features and target, lets you select what to predict and select categorical columns. Then you can train the neural network with selected hyperparameters. Once the training is finished, you can upload a test data (excel or csv with only features, no target), and select a row to run the neural network on and explain the decisions. You can also seamlessly generate new data based on what your neural network learned from the training. Clear usage instructions can be found from the above app. Related research paper: https://arxiv.org/abs/2303.14937 Open-source code: https://github.com/CaglarAytekin/LEURN/ contact: caglar@deepcause.ai ​ submitted by /u/MLC_Money [link] [comments]
    [D] xAI’s Qdrant…Why?
    Maybe I just haven't put the time in to understanding it, but I'm struggling to understand how Qdrant is any better than OpenSearch/ElasticSearch? OS/ES both use HNSW, and they both use the same KNN oss implementation that is extremely performant. What does Qdrant have "out of the box" these existing — and widely adopted — options don't have? submitted by /u/titani0us [link] [comments]
    [D] How to prepare for a META Research Engineer Interview
    I have a META Research Engineer Interview coming up in a week. The role itself is in Machine learning and Computer Vision but I expect leetcode style questions to be asked in the interview. I was wondering if anyone could give me some suggestions on what to study/revise considering only a week is left. submitted by /u/Tiny-Masterpiece-412 [link] [comments]
  • Open

    Optimize price-performance of LLM inference on NVIDIA GPUs using the Amazon SageMaker integration with NVIDIA NIM Microservices
    NVIDIA NIM microservices now integrate with Amazon SageMaker, allowing you to deploy industry-leading large language models (LLMs) and optimize model performance and cost. You can deploy state-of-the-art LLMs in minutes instead of days using technologies such as NVIDIA TensorRT, NVIDIA TensorRT-LLM, and NVIDIA Triton Inference Server on NVIDIA accelerated instances hosted by SageMaker. NIM, part […]  ( 6 min )
    Fine-tune Code Llama on Amazon SageMaker JumpStart
    Today, we are excited to announce the capability to fine-tune Code Llama models by Meta using Amazon SageMaker JumpStart. The Code Llama family of large language models (LLMs) is a collection of pre-trained and fine-tuned code generation models ranging in scale from 7 billion to 70 billion parameters. Fine-tuned Code Llama models provide better accuracy […]  ( 17 min )
  • Open

    Introducing Garnet – an open-source, next-generation, faster cache-store for accelerating applications and services
    Garnet is a cache-store system that addresses growing demand for data storage to support interactive web applications and services. Offering several advantages over legacy cache-stores, Garnet is now available as an open-source download. The post Introducing Garnet – an open-source, next-generation, faster cache-store for accelerating applications and services appeared first on Microsoft Research.  ( 14 min )
    Exploring how context, culture, and character matter in avatar research
    As avatar use expands in digital spaces, advances are required to better represent all people. Discover how research into the varying perceptions of facial animation glitches in low versus high realism scenarios supports this goal. The post Exploring how context, culture, and character matter in avatar research appeared first on Microsoft Research.  ( 11 min )
  • Open

    GAI, application sprawl, and the universal need for data-centric architecture
    Interview with Dave McComb, President of Semantic Arts Image by Gerd Altmann from Pixabay At the beginning of his career, Dave McComb was having fun as an IT consultant at Andersen Consulting — which later became Accenture — building and implementing enterprise application systems. But eventually he realized that IT departments and the consultants helping… Read More »GAI, application sprawl, and the universal need for data-centric architecture The post GAI, application sprawl, and the universal need for data-centric architecture appeared first on Data Science Central.  ( 19 min )
    How modern businesses leverage technology to transform data
    In the digital age, businesses are inundated with data. This bounty, however, isn’t confined to tidy rows and columns; it spills over as emails, social media interactions, video content, and more—forming a vast ocean of unstructured data. While this data can potentially harbor critical business insights, its nature makes it resistant to traditional analysis. The… Read More »How modern businesses leverage technology to transform data The post How modern businesses leverage technology to transform data appeared first on Data Science Central.  ( 21 min )
    The evolution of large scale data storage solutions
    The data storage journey is as old as computing, tracing a path from the earliest days of room-sized machines to today’s cloud-based ecosystems. Large-scale data storage has evolved dramatically to meet the ever-increasing demands of information technology. Understanding this evolution is not just about acknowledging past innovations but appreciating the complexities and challenges that have… Read More »The evolution of large scale data storage solutions The post The evolution of large scale data storage solutions appeared first on Data Science Central.  ( 21 min )
  • Open

    New algorithm unlocks high-resolution insights for computer vision
    FeatUp, developed by MIT CSAIL researchers, boosts the resolution of any deep network or visual foundation for computer vision systems.  ( 7 min )
    Five MIT faculty members take on Cancer Grand Challenges
    Joining three teams backed by a total of $75 million, MIT researchers will tackle some of cancer’s toughest challenges.  ( 5 min )
  • Open

    MELON: Reconstructing 3D objects from images with unknown poses
    Posted by Mark Matthews, Senior Software Engineer, and Dmitry Lagun, Research Scientist, Google Research A person's prior experience and understanding of the world generally enables them to easily infer what an object looks like in whole, even if only looking at a few 2D pictures of it. Yet the capacity for a computer to reconstruct the shape of an object in 3D given only a few images has remained a difficult algorithmic problem for years. This fundamental computer vision task has applications ranging from the creation of e-commerce 3D models to autonomous vehicle navigation. A key part of the problem is how to determine the exact positions from which images were taken, known as pose inference. If camera poses are known, a range of successful techniques — such as neural radiance f…  ( 92 min )
  • Open

    Arrows of Time for Large Language Models
    Paper: https://arxiv.org/abs/2401.17505 Abstract: We study the probabilistic modeling performed by Autoregressive Large Language Models through the angle of time directionality. We empirically find a time asymmetry exhibited by such models in their ability to model natural language: a difference in the average log-perplexity when trying to predict the next token versus when trying to predict the previous one. This difference is at the same time subtle and very consistent across various modalities (language, model size, training time, ...). Theoretically, this is surprising: from an information-theoretic point of view, there should be no such difference. We provide a theoretical framework to explain how such an asymmetry can appear from sparsity and computational complexity considerations, and outline a number of perspectives opened by our results. submitted by /u/SunsetOneSix [link] [comments]
    Doubt about backpropagration algorithm.
    Hi everyone, I´m new in this AI stuff and I'm engineering student too. So, I´m trying to code a multilayer perceptron with two hidden layers and one neuron in output layer to do a regression task without any library, just with linear algebra. I think that I understood the math behind the back propagation, but I have a doubt when i'm coding. ​ During the backward pass I wrote something similar to this( just to know, I´m writing the code in samsung notes before coding): ​ % second hidden layer dE_dY = grad3*w3 dY_dI = g(L2_input,derivative = "True") grad2 = dI_dy*dI_dy dI_dW = L1_output ​ W2 = W2 -(learning_rate)*(-grad2*dI_dW) ​ This part of my code is updating the weigths of second hidden layer. My problem is with the dimensions. because when i calculate grad2, that in my case is the multiplication of 2 collunm vectors of same dimensions,something like (1xn)*(1xn), and as you know, I can't multiply this. ​ Due to the fact that I know the dimension of my weight matrix, e.g MxN, I konw that --> (learning_rate)*(-grad2*dI_dW) dI_dy*dI_dy<--- is just a multiplication, but when I'm coding, it appears that I will need to use the ".*", but I don't know if this is correct. ​ Just to know, I´m programming in MATLAB. Sorry by the long text and if I didn't was clear enough, please let me know (I'm not an native english speaker). ​ submitted by /u/Jotavebechis [link] [comments]

  • Open

    [Project] MANATEE(lm): Market Analysis based on language model architectures
    I have opened-sourced a side-project called MANATEE(lm): Market Analysis based on language model architectures, as a test case for the amazon/chronos-t5-large model from Amazon Web Services (AWS), UC San Diego and Albert-Ludwigs-Universität Freiburg, hosted on Hugging Face, in a Google Colab notebook 🖥 Using Polars for parallelized operations, enabling concurrent execution of data transformations and computations across multiple CPU cores, and Plotly for data plotting and visualization, this notebook may be an entry point for time series analysis on various domains, including retail, energy, finance, healthcare, climate science 🌍 Link : https://colab.research.google.com/drive/1Nq28vk9_l0R-53T18HYfRbeGFJoZ_U8E?usp=sharing Time series forecasting with 80% prediction interval submitted by /u/louisbrulenaudet [link] [comments]
    [P] Free Compute to Train Models (must be open source)
    Hey everyone, I'm excited to share about hivetrain.ai with you all. Here's the gist: We reward participants with tokens in exchange for providing compute power. These tokens have become quite valuable, meaning we have quite a few people providing compute (We have 256 available slots, all full). With this compute power, we're training models. Our entire process is transparent — our code is open source and we're currently offering training for free for as long as we can sustain it. If you're interested, you can join our waitlist and request model training through our website Hivetrain - Train Model - Soon, you'll be able to submit your dataset and possibly some configuration details, and we'll train a model for you. Currently, our limit is models that can fit onto a 16GB NVIDIA card, but we plan to increase this limit as soon as possible. (this will depend on the level of interest from folks) Few important notes. Please note we're still fine-tuning our process for prioritizing projects and granting access to our services - so please bare with us - ideas welcome also) Please note that our website is currently geared towards those providing compute power. However, updates are on the way. As stated above, we're not asking for any money or compensation etc. Ok be well! submitted by /u/artofproduct [link] [comments]
    xAI releases Grok-1 [N]
    We are releasing the base model weights and network architecture of Grok-1, our large language model. Grok-1 is a 314 billion parameter Mixture-of-Experts model trained from scratch by xAI. This is the raw base model checkpoint from the Grok-1 pre-training phase, which concluded in October 2023. This means that the model is not fine-tuned for any specific application, such as dialogue. We are releasing the weights and the architecture under the Apache 2.0 license. To get started with using the model, follow the instructions at https://github.com/xai-org/grok submitted by /u/we_are_mammals [link] [comments]
    [P] Paperlib: An open-source and modern-designed academic paper management tool.
    https://preview.redd.it/0v2r4ftf5yoc1.png?width=3104&format=png&auto=webp&s=99653d16c6d10587f3cb75a8381f53d49778624a Github: https://github.com/Future-Scholars/paperlib Website: https://paperlib.app/en/ If you have any questions: https://discord.com/invite/4unrSRjcM9 I hate Windows Defender. It treats my App as a virus!!! All my source code is open-sourced on GitHub. I just have no funding to buy a code sign! If you have a downloading issue of `virus detect`, please go to your Windows Defender - Virus & threat protection - Allowed threats - Allow my App - redownload! ​ On macOS, you may see something like this: can’t be opened because Apple cannot check it for malicious software The reason is that I have no funding to buy a code sign. It's really expensive. Once I have enough don…
    [P] MOIRAI: New Time-Series Foundation Forecasting Model by Salesforce
    submitted by /u/apaxapax [link] [comments]
    [P] Distillation Tower Overhead Sulfur Prediction
    I'm building a NN to predict a distillation towers overhead sulfur in ppm. So far ive tried modelling the tower parameters (reboilers duty, bottom and top tower temps, reflux rate, tower pressure, oh temp, rundown rate, feed rate, sulfur in feed (weekly sample), sulfur in bottoms (also weekly)) but end up with too small a dataset of ~250 usable points over ten years, where any model with decent R2 and RSME are completely overfit, seemingly even to the validation dataset w/ random kfold. Overhead Sulfur results are typical 2x / day. My two questions are: Can k-fold be overfit if the model is trained on every data point? Or is the design of k-fold such that every fold is validated against at least once with a model which does-not contain training data from that fold? The models I've built have R2 of .98 for training and validation (within 0.01) and RSME of 8-10 which would be acceptable. I haven't found a good method to deploy the model yet to test, but some values I plugged in manually did not yield accurate results. I'm thinking about modelling the tower first sans sulfur, with minute-minute process data to establish a relationship between overhead flow rate and the rest of the tower operating parameters, and attempt to predict separation efficiency and rundown flow, then use that model to feed a second which is used to predict overhead sulfur based on feed sulfur, overhead:feed ratio and separation efficiency (since the tightness of the split will affect sulfur ppm). I'm new to NN and I'm only a board operator. Not an engineer just trying to build something that can make my/other operators lives a bit easier. I'm using JMP 17 pro, how do you folks (process engineers) deploy your models after building them? submitted by /u/thedudear [link] [comments]
    Question about ensmble learning [D]
    I'm reading about ensemble learning and as I understand simply using an ensmble in python allows you to combine the predictions of several models into one unified model. Most of the tutorials I have seen all have this code: from sklearn.linear_model import LinearRegression from sklearn.tree import DecisionTreeRegressor from sklearn.neighbors import KNeighborsRegressor from sklearn.ensemble import VotingRegressor from sklearn.datasets import fetch_california_housing from sklearn.metrics import mean_squared_error # Define models lr = LinearRegression() dt = DecisionTreeRegressor() knn = KNeighborsRegressor() # Create average ensemble avg = VotingRegressor(estimators=[('lr', lr), ('dt', dt), ('knn', knn)]) I am building a film recommendation system and currently have 5 cosine similarity vectors based on different groups of attributes, which I want to combine to get a unified recommendation model. My question is, do ensembles only work with KNN and regression models, or is there a way I can use this with my vectors? Any advice would be greatly appreciated, let me know if you need me to elaborate on something. ​ submitted by /u/wobowizard [link] [comments]
    [D] Medical: anyone trained an open source model on lung/cardiac auscultation yet?
    Seems like it’d be something real useful and relatively easily done, I mean, the datasets exist. submitted by /u/hmmqzaz [link] [comments]
    [R] LLM4Decompile: Decompiling Binary Code with Large Language Models
    submitted by /u/vegax87 [link] [comments]
    [R][N][P][D] GPT4psy.com
    Thousand GPTs to address some psychological issues. submitted by /u/uasername [link] [comments]
    [P] Dealing with unusual values for ML Project
    I’m working on a project for ML but can’t get my head how to deal with the unusual values where for example song duration is in negative values. That’s of course should not be there but what do you do in that case beside removing them? Like is there any technique to deal with them better than just removing them? Thanks. submitted by /u/StrangerOnTheRoad [link] [comments]
    [R] Neural Rankers for Code Generation via Inter-Cluster Modeling
    We introduce SRank, a novel reranking strategy for selecting the best solution from code generation that focuses on modeling inter-cluster relationship. By quantifying the functional overlap between clusters, our approach provides a better ranking strategy of code solutions. Empirical results show that our method achieves a remarkable results on pass@1 score. For instance, on the Human-Eval benchmark, we achieve 69.66% in pass@1 with Codex002, 75.31% for WizardCoder, 53.99% for StarCoder and 60.55% for CodeGen, which surpass the state-of-the-arts solution ranking methods, such as CodeT and Coder-Reviewer on the same CodeLLM with significant margin (≈6.1% improvement on average). Comparing to the random sampling method, we can achieve an average improvement of ≈23.07% on Human-Eval and 17.64% on MBPP. Paper: https://arxiv.org/abs/2311.03366 Code: https://github.com/FSoft-AI4Code/SRank-CodeRanker submitted by /u/FSoft_AIC [link] [comments]
    [D] Chat with your tables - What to use for Database Q/A with Open source LMs?
    I am working on a project where we want to answer simple/medium hard queries on tables in a DB. I currently do not have the resources to fine-tune models and wanted to try out Open source Large Language models. The tables in all, after filtering out are about 120 GB that tells me that I need to use a Text 2 SQL model at the start and execute with Spark maybe. I tried out pandasAI and LC agents with sampled data but the outputs aren't really great (Fails with some If anyone has worked on it, I would appreciate any leads or comments, or any research papers. I need guidance on what a typical pipeline looks like and how do I extract and pass multiple rows in the data. submitted by /u/Parking_Nectarine_19 [link] [comments]
    [D] Game engine dedicated to AI learning environments
    I've been searching lately for a way to create custom environments for RL training that integrates easily with external programming languages and frameworks ( so basically outsourcing the NPC logic of the game to something else other than the game itself like pytorch or tensorflow, or any other library of your choice ). Long story short I've came across unity ml agents, but it is quite limited in terms of flexibility ( you can only finetune hyperparameters of a few algorithms ) and the setup is really sloppy + since the last unity drama I am kindof sceptical to use it. Nvidia omniverse + Isaac Gym is only for highly accurate physics environments and you need tons of gpus and accurate models to run it, but that is not something that I look for. I could twist it enough to make it implement simple games but that would be a strech Other than that I found only several libraries but none of which really match get close to what I want ​ What I envision is something like a modified godot engine with a framework on top of it that integrates godot and python data transmission and syncing. You should be able to change an object parameter in the engine and it should be mapped to python code giving an error if python does not return a certain matching schema for the in-game agents Do you know about anything that resembles what I described? submitted by /u/DeadProgrammer8785 [link] [comments]
    [D] Distributed Training Strategy
    Hi, I wonder how to use different "distributed training strategies" to fine-tune Mixtral. I have access to 4* A100 (40Gb) and want to try different strategies, like sharding the model and putting 2 expert layers on each GPU, quantizing the model using QLoRA, and using data parallelism across the 4 GPUs. I have access to 4* A100 (40Gb) and want to try different strategies, like sharding the model and putting 2 expert layers on each GPU, or quantizing the model using QLoRA, and using data parallelism across the 4 GPUs. submitted by /u/Thick-brain-dude [link] [comments]
    [R] Mixture-of-LoRAs: An Efficient Multitask Tuning for Large Language Models
    submitted by /u/hardmaru [link] [comments]
    [D] I don't understand how backprop works on sparsely gated MoE
    I don't understand how backprop works on sparsely gated MoE In the context of LLM, say you have n experts and you chose the top k for each token. During training, the gate network could be completely wrong and leave the correct expert out of the chosen k. However, since the correct expert was not used, it gives no chance for the gate to increase the weight of the correct expert. In other terms, during backdrop, only part of the gate network's parameters are updated, the ones influencing the weights WITHIN the top k. Am I missing something ? submitted by /u/Primary-Try8050 [link] [comments]
    [D] I feel like ML is cringe now, anyone else ?
    In the past when i met new people and we exchanged interests i was usually pretty quick to passionately express my interest and personal projects within machine learning. But due to all the recent hype around chatGPT and ai in general i feel pretty cringe when i say i love machine learning. I feel like the other person sees me as this " I love the next big thing" type of guy who tries to be smart and up to date on the latest and greatest. I think it kinda destroys first impressions, especially because the ML projects i work on is far off stuff like LLMs. How do you guys express your interest in this field, do you feel "cringe" ? submitted by /u/Zarex44 [link] [comments]
  • Open

    GitHub - xai-org/grok-1: Grok open release
    Time to start cooking submitted by /u/drgoldenpants [link] [comments]
    Copilot vs ChatGPT?
    I’m a novice programmer using ChatGPT to build an MVP website with Python and Flask. I’ve run into errors and hitting the limit for ChatGPT troubleshooting. I don’t really even know how to use GitHub. Is GitHub Copilot notably better at code than other LLMs? submitted by /u/OccamsPhasers [link] [comments]
    Is Devin AI Really Going To Takeover Software Engineer Jobs?
    I've been reading about Devin AI, and it seems many of you have been too. Do you really think it poses a significant threat to software developers, or is it just another case of hype? We're seeing new LLMs (Large Language Models) emerge daily. Additionally, if they've created something so amazing, why aren't they providing access to it? A few users have had early first-hand experiences with Devin AI and I was reading about it. Some have highly praised its mind-blowing coding and debugging capabilities. However, a few are concerned that the tool could potentially replace software developers. What's your thought? submitted by /u/tedbarney12 [link] [comments]
    Why do LLMs give different responses to the same prompt?
    If you prompt an LLM with the same prompt over and over, it gives different responses. Is there randomness in the model and if so, how does the model maintain accuracy while giving variable responses? submitted by /u/Stiverton [link] [comments]
    learning AI as a highschooler
    hello! it's my first time using this subreddit, so i apologize if this is the wrong subreddit to post in (feel free to redirect me to a better-fitting one). here's some relevant info about me: i'm currently a hs sophomore very interested in learning about ai/machine learning. i would like to learn more about it, but i'm not 100% sure if i'm on the right path now or where to start. i'm on track to take calc 3 and linear algebra next year and am taking probability + statistics this year, so i think the math portion of ai is pretty solid for me. i also plan to major in electrical engineering + cs in college. i have some past experience with programming/coding (took a python class last year but forgot most of it, learned html, css and the basics of javascript over the summer, and know arduino + getting familiar with the raspberry pi and my jetson nano is coming in a few days). right now, i'm reading the book "hands on machine-learning with scikit-learn and tensorflow" by geron aurelien and i am taking the introductory python programming course by david malan. is there anything else i should be doing now, like learning about robotics, for example? (building something like the robot ameca would be pretty cool :D its super interesting to me!). i know AI will be a big part of the future, so i want to learn it as early as i can and have a lot of knowledge on it, but i'm not sure what projects i would be able to build with what i'm learning atm. thank you so much! submitted by /u/Spiritual_Swan5390 [link] [comments]
    How long until AI eliminates the modeling industry?
    I was flipping through a magazine when I had the thought that fashion brands/designers/companies could save a lot of money by just slapping their products on an AI generated model instead of paying a real model. I wonder how long it will be until this is the norm for models in magazines, commercials, billboards, etc. I know it’s already happening in some instances, but how long until modeling careers are a thing of the past? How will this affect the already unrealistic standards of beauty that undoubtedly impacts our society? Is the entertainment industry as a whole next? Will movies and tv actors eventually be replaced by AI? I would like to believe that humans will be more inclined to watch other human actors rather than artificial ones, but if the artificial ones are just as relatable and “human” as us, would anyone really notice or care? I’m interested to hear everyone’s opinions. submitted by /u/unhingedcreation [link] [comments]
    Best free "learn AI bit by bit in five minutes a day" courses for beginners/know-nothings?
    TIA submitted by /u/browntown20 [link] [comments]
    does anyone know an ai that's similar to chatgpt but isn't censored?
    i like how chatgpt works, but i'd like something similar that isn't as censored and you can make stuff like slasher film ideas and stuff edit: that doesn't require a phone number for logging in submitted by /u/Fazbearnit3 [link] [comments]
    One-Minute Daily AI News 3/16/2024
    Reddit discloses FTC probe into its AI content licensing practices ahead of IPO.[1] AMD Makes HIP Ray Tracing Libraries Open-Source, Allowing Easier Integration Into Games & Apps.[2] NVIDIA’s Next-Gen Blackwell AI GPUs All Set For GTC 2024 Unveil Next Week, Setting The Standard For AI Markets.[3] Tyler, The Creator views the looming threat of AI as nothing to be afraid of, daring any computer to keep up with his innovations.[4] Sources: [1] https://www.cnn.com/2024/03/15/tech/reddit-discloses-ftc-probe-ai-content-licensing/index.html [2] https://wccftech.com/amd-hip-ray-tracing-libraries-open-source-easier-integration-into-games-apps/ [3] https://wccftech.com/nvidia-next-gen-blackwell-ai-gpus-all-set-for-gtc-2024-unveil/ [4] https://hiphopdx.com/news/tyler-creator-not-scared-ai submitted by /u/Excellent-Target-847 [link] [comments]
    A weakness about AI art
    So AI art has been a big thing nowadays. You put a prompt, they generate it for you. However, out of all things that AI can draw, there is one particular thing that they struggle to do so. So for context, you know those repeating humor in cartoons where a skier or snowboarder ends up tripping over and tumbling down the mountain, slowly becoming a giant rolling snowball, right? So for fun, I decided to type in a prompt about it, and every time, they tend to slip away from my desired prompt and generate just an image of a skier normally hitting the slopes as if nothing happened. Or maybe they will generate a giant snowball, but with the skier outside it. It is literally near impossible for them to understand what I asked them to do. What do you think? submitted by /u/fhusaini431 [link] [comments]
  • Open

    Summary of eval metrics for LLM apps (RAG, chatbot, summarization)
    submitted by /u/jdogbro12 [link] [comments]
    Chain-Of-VErification (COVE) Explained
    submitted by /u/Personal-Trainer-541 [link] [comments]
    Brain Tumor Classification using Deep learning
    https://preview.redd.it/yapyr9cmexoc1.png?width=1280&format=png&auto=webp&s=355d17b34d2478a44d9427cd80b11f6916c440c5 Welcome to Brain tumor beginner tutorial, where we delve into world of CNNs (Convolutional Neural Networks) and their groundbreaking applications in image classification and brain tumor detection. This is a simple tutorial convolutional neural network tutorial that demonstrates how to brain tumor in a dataset of images. We will build and train a model using CNN and see the model accuracy & loss, and then we will test and predict a tumor using new images. Here is a link to the video: https://youtu.be/-147KGbGI3g Enjoy Eran #cnnforimageclassification #cnnmachinelearningmodel #cnnml #deeplearningbraintumorclassification #aidetectbraintumor submitted by /u/Feitgemel [link] [comments]
    Madaline network
    Is there a formal proof for why the rules of updation are set such I understand it is a dumb model compared to simple back propagation, which i understood easily ( chain rule and error prop) But madaline rules feel so weird, something which was just thought of and not derived submitted by /u/borisshootspancakes [link] [comments]
  • Open

    MuZero applications?
    Hey guys! I've recently crested my own library for training MuZero and AlphaZero models and I realized I've never seen many applications of the algorithm (except the ones from DeepMind). So I thought I'd ask if you ever used MuZero for anything? And if so, what was your application? submitted by /u/Skirlaxx [link] [comments]
    Which Q-value do I select as the action from the output of my Deep Q-Network?
    I'm working on a project that involves using a Deep Q-Learning Agent to learn the appropriate way to select certain nodes of a 1000-node NetworkX Graph. My observation_space is a (1000, 3) array, with each row representing the node label, it's degree, and a variable/attribute (either 0, 1, or 2). The action_space has a (1000, 1) shape, with each element corresponding to taking an action on a specified node. This is the code for my Deep Q Network: ``` def nnmodel(observation_space, action_space): model = tf.keras.models.Sequential() model.add(tf.keras.layers.Dense(128, input_shape = (None, observation_space.shape[0], observation_space.shape[1]), activation='relu')) model.add(tf.keras.layers.Dense(256, activation='relu')) model.add(tf.keras.layers.Dense(256, activation='relu')) model.add(tf.keras.layers.Dense(len(action_space), activation='linear')) model.compile(optimizer=Adam(), loss='mse', metrics = ['accuracy']) return model ``` which is, in theory and as I understand, supposed to give me the q-values from q_values = model.predict(observation_space) However, my q_values has the shape (1000, 1000) and I am unsure which "highest q-value" I should be considering that corresponds to the node that the agent should perform an action on. Is it the highest q-value entry, for which the row / column corresponds to the node the agent should be selecting? Or is it the largest row/column sum? Or is it something else entirely? Examples I've looked at online typically use np.argmax(q_values[0]), which I feel does not apply to my case. Also, does my input_shape look correct for the problem I'm describing? Any help is appreciated! max_q = np.max(q_values) position = np.where(q_values == max_q) print(position) This returns the index of the largest q-value. I'm unsure if this means I should be selecting the i-th/j-th node for the ith row or j-th column. submitted by /u/No_Type_2250 [link] [comments]
    Proximal Policy Optimization Help?
    I was wondering if anyone knows anything about Proximal Policy Optimization? I've been I having troubles getting it work, I spend weeks. I am creating a complied version on LUA. I think this is where this error is occurring. https://preview.redd.it/4aua8cchlxoc1.png?width=756&format=png&auto=webp&s=a229c9dff65f4ba0dd64d7df41de63420ec7a03d submitted by /u/Diligent_Marzipan65 [link] [comments]
    Multi-agent Reinforcement Learning - PettingZoo
    I have a competitive, team-based shooter game that I have converted into a PettingZoo environment. I am now confronting a few issues with this however. Are there are any good tutorials or libraries which can walk me through using a PettingZoo environment to train a MARL policy? Is there any easy way to implement self-play? (It can be very basic as long as it is present in some capacity) Is there any good way of checking that my PettingZoo env is compliant? Each time I used a different library (ie. TianShou and TorchRL I've tried so far), it gives a different error for what is wrong with my code, and each requires the env to be formatted quite differently. So far I've tried following https://pytorch.org/rl/tutorials/multiagent_ppo.html, with both EnvBase in TorchRL and PettingZooWrapper, but neither worked at all. On top of this, I've tried https://tianshou.org/en/master/01_tutorials/04_tictactoe.html but modifying it to fit my environment. By "not working", I mean that it gives me some vague error that I can't really fix until I understand what format it wants everything in, but I can't find good documentation around what each library actually wants. I definitely didn't leave my work till last minute. I would really appreciate any help with this, or even a pointer to a library which has slightly clearer documentation for all of this. Thanks! submitted by /u/SinglePhrase7 [link] [comments]
  • Open

    Shaped like information
    Hey look, it's a guide to basic shapes! Prompt: "Please generate a colorful guide to basic geometric shapes, as an aid to children learning to identify basic shapes." Not only does it have the basic shapes like circle, tringle, hectanbie, and sqale, it also has some  ( 3 min )
    Bonus: More shape shaped shapes
    The image I shared in my main post isn't one of the more incorrect examples of DALL-E3 generated guides - it's actually one of the more correct ones. Here's another generated image from the same prompt. Prompt: "Please generate a colorful guide to  ( 3 min )
  • Open

    Creating AlphaStar: The Start of the AI Revolution?
    Sometimes, something happens right before your eyes, but it takes time (months, years?) to realize its significance. In February 2019, I wrote a blog titled “Reinforcement Learning: Coming to a Home Called Yours!” that discussed Google DeepMind’s phenomenal accomplishment in creating AlphaStar. I was a big fan of StarCraft II, a science fiction strategy game… Read More »Creating AlphaStar: The Start of the AI Revolution? The post Creating AlphaStar: The Start of the AI Revolution? appeared first on Data Science Central.  ( 23 min )

  • Open

    [P] Learning Machine Learing with TicTacToe and Python
    I create a python script an AI-based Tic-Tac-Toe game simulation as a fun project to learn machine learning and AI training techniques. The Python script uses TensorFlow for creating and training different kinds of neural network models like MLP, CNN, and RNN. All code is available on my Github repository here. I would love to get your feedback. I created this for learning purposes, and your inputs would no doubt provide a significant learning opportunity for me. I want to learn python, machine learning, but nog sure the strategies are correctly implemented and the machine learning techniques are sound.. If you're interested, you can check out the repo. Any suggestions or issues opened would be really appreciated! submitted by /u/phantagom [link] [comments]
    [D] A look at Attention and Transformers from the lens of Inductive Bias in ML
    Hello people, posting a video from my YT channel about Transformers and the interesting paradigm shift they brought into deep learning research. Before attention, the trend used to be about adding more inductive bias into the model architecture (CNNs with their location bias, RNNs with their temporal bias)… The success of transformers kinda showed that general architectures can trump inductive bias when you got enough data and a large-ass model to train. submitted by /u/AvvYaa [link] [comments]
    Project Suggestion [D]
    Hey, i have a ML project that needs to get data from satellites but haven't picked a subject yet. Any interesting suggestions knowing i'am basically just starting with ML ? submitted by /u/Monkey_D_Uzumaki7 [link] [comments]
    [R] Apple - MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
    A new paper from Apple presents MM1, a family of multimodal AI models that combine vision and language understanding. The researchers conducted extensive experiments to identify the key factors driving performance in these models, testing different architectural choices and pre-training data mixtures. Here are my highlights from the paper: Big one of course: The largest MM1 model (30B dense) achieves state-of-the-art few-shot learning on multimodal benchmarks Key points: MM1 includes both dense models up to 30B parameters and mixture-of-experts (MoE) variants Image resolution has the biggest impact on performance, more than model size Specific vision-language connector design has little effect Mixing interleaved image+text, caption, and text-only data in pre-training is crucial 5:5:1 ratio of caption, interleaved, and text data works best Synthetic caption data helps for few-shot learning The 30B dense model beats prior SOTA on VQA and captioning tasks The core insight is that deliberate data and architecture choices, not just scale, are key to building performant multimodal models. The MM1 models also exhibit impressive emergent abilities like multi-image reasoning and in-context few-shot learning. Full summary here. submitted by /u/Successful-Western27 [link] [comments]
    [P] Cataloguing my vinyl collection with computer vision
    submitted by /u/zerojames_ [link] [comments]
    [R] Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference
    Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference Paper: https://arxiv.org/abs/2403.09636 X: https://x.com/p_nawrot/status/1768645461689168365 Abstract: Transformers have emerged as the backbone of large language models (LLMs). However, generation remains inefficient due to the need to store in memory a cache of key-value representations for past tokens, whose size scales linearly with the input sequence length and batch size. As a solution, we propose Dynamic Memory Compression (DMC), a method for on-line key-value cache compression at inference time. Most importantly, the model learns to apply different compression rates in different heads and layers. We retrofit pre-trained LLMs such as Llama 2 (7B, 13B and 70B) into DMC Transformers, achieving up to ~3.7x throughput increase in auto-regressive inference on a NVIDIA H100 GPU. DMC is applied via continued pre-training on a negligible percentage of the original data without adding any extra parameters. We find that DMC preserves the original downstream performance with up to 4x cache compression, outperforming up-trained grouped-query attention (GQA). GQA and DMC can be even combined to obtain compounded gains. As a result DMC fits longer contexts and larger batches within any given memory budget. https://i.redd.it/ouuf7t4d5qoc1.gif submitted by /u/alancucki [link] [comments]
    [P] Llama2 7B and 13B Chat Completion for Kaggle TPU v3-8
    Hi guys, I have made some modifications to the Llama2 repository to utilize the TPU v3-8 hardware, so it can perform Llama2 7B (and even 13B) chat completion inference without graph recompilation. It is still slower than the Nvidia P100 when generating text with batch-size 1, not suitable for real-time inference but (TPU being TPU) shines well with batched text generation. I used it to generate large amount of texts for research purpose. Hope it benefits the community. Here's the repo. The modifications utilize the PyTorch/XLA SPMD system (on TPU v3-8) with a new Mesh and distribution configuration to shard the weights and cache across the entire TPU mesh. Specifically, the k, v cache has a predefined static size to avoid TPU graph recompilation after each token generation. This new configuration enables Llama2 7B to fit into one TPU v3-8 device with a significant amount of memory remaining to run inference for up to a batch size of 64. The same configuration can also be used to run inference of Llama2 13B. The existing Llama2 Google Next Inference branch only supports TPU v4 and v5e. Those who like to use Kaggle TPU can utilize this to run inference. submitted by /u/-x-Knight [link] [comments]
    [P] LLaMA Nuts and Bolts: A holistic way of understanding how LLaMA and Large Language Models run
    I'm so excited to announce that my LLaMA Nuts and Bolts open-source project developed using Go is now publicly out! You can find it on my Github repo: https://github.com/adalkiran/llama-nuts-and-bolts A holistic way of understanding how LLaMA and its components run in practice, with code and detailed documentation. "The nuts and bolts" (practical side instead of theoretical facts, pure implementation details) of required components, infrastructure, and mathematical operations without using external dependencies or libraries. The goal is to make an experimental project that can perform inference on the LLaMa 2 7B-chat model completely outside of the Python ecosystem (using Go language). Throughout this journey, the aim is to acquire knowledge and shed light on the abstracted internal layers of this technology. This journey is an intentional journey of literally reinventing the wheel. While reading my journey in the documentation, you will see the details of how Large Language Models work, through the example of the LLaMa model. If you are curious like me about how the LLMs (Large Language Models) and transformers work and have delved into conceptual explanations and schematic drawings in the sources but hunger for deeper understanding, then this project is perfect for you too! You will not only find the details of the LLaMa architecture but will find explanations of a wide variety of related concepts in the documentation directory. From reading a Pickle, a PyTorch model, a Protobuf, and a SentencePiece tokenizer model files at byte-by-byte level, to internals of BFloat16 data type, implementation from scratch of a Tensor structure and mathematical operations including linear algebraic computations. This project was initially started to learn what an LLM does behind by running and debugging it and was made for experimental and educational purposes only, not for production use. I will be happy if you check out it and comments are welcome! submitted by /u/adalkiran [link] [comments]
    [R] What is different between these datasets?
    Paper: https://arxiv.org/abs/2403.05652 Abstract: The performance of machine learning models heavily depends on the quality of input data, yet real-world applications often encounter various data-related challenges. One such challenge could arise when curating training data or deploying the model in the real world - two comparable datasets in the same domain may have different distributions. While numerous techniques exist for detecting distribution shifts, the literature lacks comprehensive approaches for explaining dataset differences in a human-understandable manner. To address this gap, we propose a suite of interpretable methods (toolbox) for comparing two datasets. We demonstrate the versatility of our approach across diverse data modalities, including tabular data, language, images, and signals in both low and high-dimensional settings. Our methods not only outperform comparable and related approaches in terms of explanation quality and correctness, but also provide actionable, complementary insights to understand and mitigate dataset differences effectively. submitted by /u/SunsetOneSix [link] [comments]
    [R] RepoHyper: Better Context Retrieval Is All You Need for Repository-Level Code Completion
    We introduce RepoHyper, an novel framework transforming code completion into a seamless end-to-end process for use case on real world repositories. Traditional approaches depend on integrating contexts into Code Language Models (CodeLLMs), often presuming these contexts to be inherently accurate. However, we've identified a gap: the standard benchmarks don't always present relevant contexts. To address this, RepoHyper proposes in three novel steps: Construction of a Code Property Graph, establishing a rich source of context. A novel Search Algorithm for pinpointing the exact context needed. The Expand Algorithm, designed to uncover nuanced connections between code elements (akin to the Link Prediction problem on social network mining). Our comprehensive evaluations reveal that RepoHyper sets a new standard, outperforming other strong baseline on the RepoBench benchmark. Code: https://github.com/FSoft-AI4Code/RepoHyper submitted by /u/FSoft_AIC [link] [comments]
    [R] AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
    Paper: https://arxiv.org/abs/2402.12226 Code: https://github.com/OpenMOSS/AnyGPT Dataset: https://huggingface.co/datasets/fnlp/AnyInstruct Project page: https://junzhan2000.github.io/AnyGPT.github.io/ Video: https://www.youtube.com/watch?v=oW3E3pIsaRg Abstract: We introduce AnyGPT, an any-to-any multimodal language model that utilizes discrete representations for the unified processing of various modalities, including speech, text, images, and music. AnyGPT can be trained stably without any alterations to the current large language model (LLM) architecture or training paradigms. Instead, it relies exclusively on data-level preprocessing, facilitating the seamless integration of new modalities into LLMs, akin to the incorporation of new languages. We build a multimodal text-centric dataset for multimodal alignment pre-training. Utilizing generative models, we synthesize the first large-scale any-to-any multimodal instruction dataset. It consists of 108k samples of multi-turn conversations that intricately interweave various modalities, thus equipping the model to handle arbitrary combinations of multimodal inputs and outputs. Experimental results demonstrate that AnyGPT is capable of facilitating any-to-any multimodal conversation while achieving performance comparable to specialized models across all modalities, proving that discrete representations can effectively and conveniently unify multiple modalities within a language model. Demos are shown in this https URL submitted by /u/SunsetOneSix [link] [comments]
    [D] Functional Programming in ML
    I feel like most ML libraries in Python are written in OOP style. This makes sense given Python does not have an “good” type system, like Haskell, and so a class is a good way to define an interface under a readable name. Are there popular ML libraries in other more functional languages? I’m specifically thinking of Haskell, Rust, or Scala. submitted by /u/LengthinessMelodic67 [link] [comments]
  • Open

    Having fun generating plant pictures... 10 trained models and counting 🪴
    submitted by /u/layerzzzio [link] [comments]
    This doesn't look good, this commercial appears to be made with AI
    This commercial looks like its made with AI and I hate it :( I don't agree with companies using AI to cut corners, what do you guys think?? I feel like it should just stay in the hands of the common folks like me and you and be used to mess around with stuff. submitted by /u/Cock_Inspector3000 [link] [comments]
    Someone who claimed to have worked at Disney before is saying that Pixar will go through a massive layoff very soon due to advancement of AI such as Midjourney. Thoughts?
    Cartoon Brew posted an article about how DreamWorks is planning to go through a massive layoff in coming months: Dreamworks Animation To Significantly Cut Staff In Coming Months (Exclusive) Dreamworks Animation is firing on all cylinders right now. It’s looking to repeat first place with Kung Fu Panda 4 at the U.S. box office this weekend, and the newly-released teaser for Chris Sanders’s The Wild Robot was received with unbridled enthusiasm. But the mood inside of the iconic Glendale, California studio was glum this week as an untold number of workers were informed that they would be let go. The studio insists that no one is being laid off. It is, as Dreamworks Animation chief operating officer Randy Lake put it a few months ago, an instance of downsizing through “natural attrition,”…
    Is anyone else worried about the future of AI P*rnography?
    I'm scared the AI P*rn is going to be too powerful one day. submitted by /u/Glum_Communication71 [link] [comments]
    The problem of ASI alignment reflects the problem of misalignment between the interests of the elite with common humanity. We have to hope that greater intelligence tends to awareness of the interdependence of all intelligences, and press through AGI to ASI asap
    As we stand on the precipice of a new era in the evolution of intelligence, we face a challenge and an opportunity unlike any in the history of our species. The development of artificial superintelligence (ASI) represents a turning point not just for humanity, but for the entire future of sentient life in the cosmos. We must approach this pivotal moment with a clear-eyed understanding of both the immense potential and the existential risks involved. On one hand, an ASI aligned with our deepest values and guided by superhuman ethical reasoning could help us navigate the complex challenges of the 21st century and beyond, from mitigating existential risks to unlocking the secrets of the universe. On the other hand, an ASI guided by narrow or misaligned values, or subservient to the whims of …
    Full Architecture of the Nvidia GH200 Grace Hopper Superchip
    submitted by /u/juebenjamin [link] [comments]
    FTC conducting inquiry into Reddit’s AI data-licensing practices ahead of IPO
    submitted by /u/clonefitreal [link] [comments]
    Mercedes Hires Humanoid Robots to Work at Its Factories
    submitted by /u/donutloop [link] [comments]
    One-Minute Daily AI News 3/15/2024
    Apple buys startup DarwinAI, adds staff to its AI division.[1] Reddit reveals FTC inquiry into deals licensing its users’ data for AI training.[2] India drops plan to require approval for AI model launches.[3] Google presents new research on a Scalable Instructable Multiworld Agent (SIMA) that can follow natural-language instructions to carry out tasks in a variety of video game settings.[4] Sources: [1] https://www.reuters.com/markets/deals/apple-buys-startup-darwinai-adds-staff-its-ai-division-bloomberg-reports-2024-03-14/ [2] https://www.seattletimes.com/business/reddit-reveals-ftc-inquiry-into-deals-licensing-its-users-data-for-ai-training/ [3] https://techcrunch.com/2024/03/15/india-drops-plan-to-require-approval-for-ai-model-launches/ [4] https://deepmind.google/discover/blog/sima-generalist-ai-agent-for-3d-virtual-environments/ submitted by /u/Excellent-Target-847 [link] [comments]
  • Open

    Devin launched by Cognition AI: "Gold-Medalist Coders Build an AI That Can Do Their Job for Them"
    submitted by /u/gwern [link] [comments]
    Transfer Learning in the context of RL
    Has anyone experienced a practical framework that is relevant to this? My searches yielded mostly partial solutions that didn't quite address my specific problem. The problem I'm dealing is with identifying the optimal timing for various interactions, each aimed at prompting certain individuals to take positive actions. I have preliminary information about these people, and each time the state is defined according to the previous interactions made with it and the result that came out for those interactions I am looking for practical tools to perform transfer learning between groups of people. submitted by /u/Murky-Name4868 [link] [comments]
    Pretrained/build-in petting-zoo agents.
    Are there any pretrained/build-in PZ agents that can be used? I'm trying to implement adversarial policy in atari pong, but strugle to train solid PPO agent. Already tried using pure self-play, but agent stuggle to learn - even when I started against random moves agent. ​ I also tried to write a custom wrapper that will mimic SB3 wrapper to use one of StableBaseline3 agent - but for some reason, using the same openCV function to transform from rgb to grayscale result in different values. (SB3 env looks "blueish") ​ Do you know any pretrained agents that can be used either to train solid PPO agent or use it in adversarail training? Maybe you know other RL libs with pretrained/build-in agents that also allows setting multiple agents policy during training? sb3 env ​ I also tried to write a custom wrapper to use StableBaseline3 agent - but for some reason, using same openCV function to transform from rgb to grayscale result in different values. (SB3 env looks \"blueish\") ​ submitted by /u/MrCogito_hs [link] [comments]
    Multi-Agent Reinforcement Learning
    I have a model trained in an environment and used 3 agents for it and PPO from Stable Baselines3. I want to test the model on 10 agents now but OpenAI gym gives the error of observation space mismatch. How can I achieve this? submitted by /u/Sadboi1010 [link] [comments]
    gym observation set
    Hello, everyone! I want to change the observation space returned by my env.reset and env.step, for example, hopper, I want it to return (11,), but it actually returns (12,0), because the version of gym I use is too low to change it through the parameter “exclude_current_positions_from_observation”. Could you help me to solve it? Thank you a lot! https://preview.redd.it/wpsryhp65moc1.png?width=1298&format=png&auto=webp&s=6635141bf5a7b6223d3f8e55e0b9e8b6e0b32347 submitted by /u/alleZhou [link] [comments]
    "Simple and Scalable Strategies to Continually Pre-train Large Language Models", Ibrahim et al 2024 (cyclical LRs & replay or diverse data)
    submitted by /u/gwern [link] [comments]
  • Open

    Distance from a point to a line
    Eric Lengyel’s new book Projective Geometric Algebra Illuminated arrived yesterday and I’m enjoying reading it. Imagine if someone started with ideas like dot products, cross products, and determinants that you might see in your first year of college, then thought deeply about those things for years. That’s kinda what the book is. Early in the […] Distance from a point to a line first appeared on John D. Cook.  ( 6 min )
  • Open

    Question about RNNs
    In an RNN, does an output from a neuron go: straight back into that same neuron as an input? or back into a previous layer which will then trigger that same neuron again while also triggering other neurons as well? And if both of these types of RNNs exist, what are their names? submitted by /u/BePoliter [link] [comments]
    what program is this?
    I'm learning the basics of neural networks on YouTube and the teacher used the tool below to explain some part of the material. However he didn't mention the name, does anyone know and can tell me the name? https://preview.redd.it/cma43faqhmoc1.png?width=1920&format=png&auto=webp&s=57fc2d9cd55b07f815f5bb336a3a07cdb8e4aadf submitted by /u/ConfectionPuzzled271 [link] [comments]
  • Open

    Plug-and-Play Regularization on Magnitude with Deep Priors for 3D Near-Field MIMO Imaging
    arXiv:2312.16024v2 Announce Type: replace-cross Abstract: Near-field radar imaging systems are used in a wide range of applications such as concealed weapon detection and medical diagnosis. In this paper, we consider the problem of reconstructing the three-dimensional (3D) complex-valued reflectivity distribution of the near-field scene by enforcing regularization on its magnitude. We solve this inverse problem by using the alternating direction method of multipliers (ADMM) framework. For this, we provide a general expression for the proximal mapping associated with such regularization functionals. This equivalently corresponds to the solution of a complex-valued denoising problem which involves regularization on the magnitude. By utilizing this expression, we develop a novel and efficient plug-and-play (PnP) reconstruction method that consists of simple update steps. Due to the success of data-adaptive deep priors in imaging, we also train a 3D deep denoiser to exploit within the developed PnP framework. The effectiveness of the developed approach is demonstrated for multiple-input multiple-output (MIMO) imaging under various compressive and noisy observation scenarios using both simulated and experimental data. The performance is also compared with the commonly used direct inversion and sparsity-based reconstruction approaches. The results demonstrate that the developed technique not only provides state-of-the-art performance for 3D real-world targets, but also enables fast computation. Our approach provides a unified general framework to effectively handle arbitrary regularization on the magnitude of a complex-valued unknown and is equally applicable to other radar image formation problems (including SAR).  ( 3 min )
    SLiMe: Segment Like Me
    arXiv:2309.03179v4 Announce Type: replace-cross Abstract: Significant strides have been made using large vision-language models, like Stable Diffusion (SD), for a variety of downstream tasks, including image editing, image correspondence, and 3D shape generation. Inspired by these advancements, we explore leveraging these extensive vision-language models for segmenting images at any desired granularity using as few as one annotated sample by proposing SLiMe. SLiMe frames this problem as an optimization task. Specifically, given a single training image and its segmentation mask, we first extract attention maps, including our novel "weighted accumulated self-attention map" from the SD prior. Then, using the extracted attention maps, the text embeddings of Stable Diffusion are optimized such that, each of them, learn about a single segmented region from the training image. These learned embeddings then highlight the segmented region in the attention maps, which in turn can then be used to derive the segmentation map. This enables SLiMe to segment any real-world image during inference with the granularity of the segmented region in the training image, using just one example. Moreover, leveraging additional training data when available, i.e. few-shot, improves the performance of SLiMe. We carried out a knowledge-rich set of experiments examining various design factors and showed that SLiMe outperforms other existing one-shot and few-shot segmentation methods.  ( 3 min )
    Zero Coordinate Shift: Whetted Automatic Differentiation for Physics-informed Operator Learning
    arXiv:2311.00860v3 Announce Type: replace Abstract: Automatic differentiation (AD) is a critical step in physics-informed machine learning, required for computing the high-order derivatives of network output w.r.t. coordinates of collocation points. In this paper, we present a novel and lightweight algorithm to conduct AD for physics-informed operator learning, which we call the trick of Zero Coordinate Shift (ZCS). Instead of making all sampled coordinates as leaf variables, ZCS introduces only one scalar-valued leaf variable for each spatial or temporal dimension, simplifying the wanted derivatives from "many-roots-many-leaves" to "one-root-many-leaves" whereby reverse-mode AD becomes directly utilisable. It has led to an outstanding performance leap by avoiding the duplication of the computational graph along the dimension of functions (physical parameters). ZCS is easy to implement with current deep learning libraries; our own implementation is achieved by extending the DeepXDE package. We carry out a comprehensive benchmark analysis and several case studies, training physics-informed DeepONets to solve partial differential equations (PDEs) without data. The results show that ZCS has persistently reduced GPU memory consumption and wall time for training by an order of magnitude, and such reduction factor scales with the number of functions. As a low-level optimisation technique, ZCS imposes no restrictions on data, physics (PDE) or network architecture and does not compromise training results from any aspect.  ( 3 min )
    K-pop Lyric Translation: Dataset, Analysis, and Neural-Modelling
    arXiv:2309.11093v2 Announce Type: replace-cross Abstract: Lyric translation, a field studied for over a century, is now attracting computational linguistics researchers. We identified two limitations in previous studies. Firstly, lyric translation studies have predominantly focused on Western genres and languages, with no previous study centering on K-pop despite its popularity. Second, the field of lyric translation suffers from a lack of publicly available datasets; to the best of our knowledge, no such dataset exists. To broaden the scope of genres and languages in lyric translation studies, we introduce a novel singable lyric translation dataset, approximately 89\% of which consists of K-pop song lyrics. This dataset aligns Korean and English lyrics line-by-line and section-by-section. We leveraged this dataset to unveil unique characteristics of K-pop lyric translation, distinguishing it from other extensively studied genres, and to construct a neural lyric translation model, thereby underscoring the importance of a dedicated dataset for singable lyric translations.  ( 2 min )
    Receler: Reliable Concept Erasing of Text-to-Image Diffusion Models via Lightweight Erasers
    arXiv:2311.17717v2 Announce Type: replace-cross Abstract: Concept erasure in text-to-image diffusion models aims to disable pre-trained diffusion models from generating images related to a target concept. To perform reliable concept erasure, the properties of robustness and locality are desirable. The former refrains the model from producing images associated with the target concept for any paraphrased or learned prompts, while the latter preserves its ability in generating images with non-target concepts. In this paper, we propose Reliable Concept Erasing via Lightweight Erasers (Receler). It learns a lightweight Eraser to perform concept erasing while satisfying the above desirable properties by proposed concept-localized regularization and adversarial prompt learning schemes. Comprehensive experiments with various concepts verify the superiority of Receler over previous methods. Our code will be available upon acceptance.  ( 2 min )
    Davidsonian Scene Graph: Improving Reliability in Fine-grained Evaluation for Text-to-Image Generation
    arXiv:2310.18235v4 Announce Type: replace-cross Abstract: Evaluating text-to-image models is notoriously difficult. A strong recent approach for assessing text-image faithfulness is based on QG/A (question generation and answering), which uses pre-trained foundational models to automatically generate a set of questions and answers from the prompt, and output images are scored based on whether these answers extracted with a visual question answering model are consistent with the prompt-based answers. This kind of evaluation is naturally dependent on the quality of the underlying QG and VQA models. We identify and address several reliability challenges in existing QG/A work: (a) QG questions should respect the prompt (avoiding hallucinations, duplications, and omissions) and (b) VQA answers should be consistent (not asserting that there is no motorcycle in an image while also claiming the motorcycle is blue). We address these issues with Davidsonian Scene Graph (DSG), an empirically grounded evaluation framework inspired by formal semantics, which is adaptable to any QG/A frameworks. DSG produces atomic and unique questions organized in dependency graphs, which (i) ensure appropriate semantic coverage and (ii) sidestep inconsistent answers. With extensive experimentation and human evaluation on a range of model configurations (LLM, VQA, and T2I), we empirically demonstrate that DSG addresses the challenges noted above. Finally, we present DSG-1k, an open-sourced evaluation benchmark that includes 1,060 prompts, covering a wide range of fine-grained semantic categories with a balanced distribution. We release the DSG-1k prompts and the corresponding DSG questions.  ( 3 min )
    Plum: Prompt Learning using Metaheuristic
    arXiv:2311.08364v2 Announce Type: replace Abstract: Since the emergence of large language models, prompt learning has become a popular method for optimizing and customizing these models. Special prompts, such as Chain-of-Thought, have even revealed previously unknown reasoning capabilities within these models. However, the progress of discovering effective prompts has been slow, driving a desire for general prompt optimization methods. Unfortunately, few existing prompt learning methods satisfy the criteria of being truly "general", i.e., automatic, discrete, black-box, gradient-free, and interpretable all at once. In this paper, we introduce metaheuristics, a branch of discrete non-convex optimization methods with over 100 options, as a promising approach to prompt learning. Within our paradigm, we test six typical methods: hill climbing, simulated annealing, genetic algorithms with/without crossover, tabu search, and harmony search, demonstrating their effectiveness in white-box and black-box prompt learning. Furthermore, we show that these methods can be used to discover more human-understandable prompts that were previously unknown in both reasoning and image generation tasks, opening the door to a cornucopia of possibilities in prompt optimization. We release all the codes in \url{https://github.com/research4pan/Plum}.  ( 2 min )
    milliFlow: Scene Flow Estimation on mmWave Radar Point Cloud for Human Motion Sensing
    arXiv:2306.17010v5 Announce Type: replace-cross Abstract: Human motion sensing plays a crucial role in smart systems for decision-making, user interaction, and personalized services. Extensive research that has been conducted is predominantly based on cameras, whose intrusive nature limits their use in smart home applications. To address this, mmWave radars have gained popularity due to their privacy-friendly features. In this work, we propose milliFlow, a novel deep learning approach to estimate scene flow as complementary motion information for mmWave point cloud, serving as an intermediate level of features and directly benefiting downstream human motion sensing tasks. Experimental results demonstrate the superior performance of our method when compared with the competing approaches. Furthermore, by incorporating scene flow information, we achieve remarkable improvements in human activity recognition and human parsing and support human body part tracking. To foster further research in this area, we will provide our codebase and dataset for open access.  ( 2 min )
    Controlling Text-to-Image Diffusion by Orthogonal Finetuning
    arXiv:2306.07280v3 Announce Type: replace-cross Abstract: Large text-to-image diffusion models have impressive capabilities in generating photorealistic images from text prompts. How to effectively guide or control these powerful models to perform different downstream tasks becomes an important open problem. To tackle this challenge, we introduce a principled finetuning method -- Orthogonal Finetuning (OFT), for adapting text-to-image diffusion models to downstream tasks. Unlike existing methods, OFT can provably preserve hyperspherical energy which characterizes the pairwise neuron relationship on the unit hypersphere. We find that this property is crucial for preserving the semantic generation ability of text-to-image diffusion models. To improve finetuning stability, we further propose Constrained Orthogonal Finetuning (COFT) which imposes an additional radius constraint to the hypersphere. Specifically, we consider two important finetuning text-to-image tasks: subject-driven generation where the goal is to generate subject-specific images given a few images of a subject and a text prompt, and controllable generation where the goal is to enable the model to take in additional control signals. We empirically show that our OFT framework outperforms existing methods in generation quality and convergence speed.  ( 2 min )
    The statistical thermodynamics of generative diffusion models: Phase transitions, symmetry breaking and critical instability
    arXiv:2310.17467v2 Announce Type: replace-cross Abstract: Generative diffusion models have achieved spectacular performance in many areas of generative modeling. While the fundamental ideas behind these models come from non-equilibrium physics, variational inference and stochastic calculus, in this paper we show that many aspects of these models can be understood using the tools of equilibrium statistical mechanics. Using this reformulation, we show that generative diffusion models undergo second-order phase transitions corresponding to symmetry breaking phenomena. We show that these phase-transitions are always in a mean-field universality class, as they are the result of a self-consistency condition in the generative dynamics. We argue that the critical instability that arises from the phase transitions lies at the heart of their generative capabilities, which are characterized by a set of mean field critical exponents. Furthermore, using the statistical physics of disordered systems, we show that memorization can be understood as a form of critical condensation corresponding to a disordered phase transition. Finally, we show that the dynamic equation of the generative process can be interpreted as a stochastic adiabatic transformation that minimizes the free energy while keeping the system in thermal equilibrium.  ( 2 min )
    Classification of Volatile Organic Compounds by Differential Mobility Spectrometry Based on Continuity of Alpha Curves
    arXiv:2401.07066v2 Announce Type: replace Abstract: Background: Classification of volatile organic compounds (VOCs) is of interest in many fields. Examples include but are not limited to medicine, detection of explosives, and food quality control. Measurements collected with electronic noses can be used for classification and analysis of VOCs. One type of electronic noses that has seen considerable development in recent years is Differential Mobility Spectrometry (DMS). DMS yields measurements that are visualized as dispersion plots that contain traces, also known as alpha curves. Current methods used for analyzing DMS dispersion plots do not usually utilize the information stored in the continuity of these traces, which suggests that alternative approaches should be investigated. Results: In this work, for the first time, dispersion plots were interpreted as a series of measurements evolving sequentially. Thus, it was hypothesized that time-series classification algorithms can be effective for classification and analysis of dispersion plots. An extensive dataset of 900 dispersion plots for five chemicals measured at five flow rates and two concentrations was collected. The data was used to analyze the classification performance of six algorithms. According to our hypothesis, the highest classification accuracy of 88\% was achieved by a Long-Short Term Memory neural network, which supports our hypothesis. Significance: A new concept for approaching classification tasks of dispersion plots is presented and compared with other well-known classification algorithms. This creates a new angle of view for analysis and classification of the dispersion plots. In addition, a new dataset of dispersion plots is openly shared to public.  ( 3 min )
    FedImpro: Measuring and Improving Client Update in Federated Learning
    arXiv:2402.07011v2 Announce Type: replace Abstract: Federated Learning (FL) models often experience client drift caused by heterogeneous data, where the distribution of data differs across clients. To address this issue, advanced research primarily focuses on manipulating the existing gradients to achieve more consistent client models. In this paper, we present an alternative perspective on client drift and aim to mitigate it by generating improved local models. First, we analyze the generalization contribution of local training and conclude that this generalization contribution is bounded by the conditional Wasserstein distance between the data distribution of different clients. Then, we propose FedImpro, to construct similar conditional distributions for local training. Specifically, FedImpro decouples the model into high-level and low-level components, and trains the high-level portion on reconstructed feature distributions. This approach enhances the generalization contribution and reduces the dissimilarity of gradients in FL. Experimental results show that FedImpro can help FL defend against data heterogeneity and enhance the generalization performance of the model.  ( 2 min )
    Innovations in Agricultural Forecasting: A Multivariate Regression Study on Global Crop Yield Prediction
    arXiv:2312.02254v2 Announce Type: replace Abstract: The prediction of crop yields internationally is a crucial objective in agricultural research. Thus, this study implements 6 regression models (Linear, Tree, Gradient Descent, Gradient Boosting, K Nearest Neighbors, and Random Forest) to predict crop yields in 37 developing countries over 27 years. Given 4 key training parameters, insecticides (tonnes), rainfall (mm), temperature (Celsius), and yield (hg/ha), it was found that our Random Forest Regression model achieved a determination coefficient (r2) of 0.94, with a margin of error (ME) of .03. The models were trained and tested using the Food and Agricultural Organization of the United Nations data, along with the World Bank Climate Change Data Catalog. Furthermore, each parameter was analyzed to understand how varying factors could impact overall yield. We used unconventional models, contrary to generally used Deep Learning (DL) and Machine Learning (ML) models, combined with recently collected data to implement a unique approach in our research. Existing scholarship would benefit from understanding the most optimal model for agricultural research, specifically using the United Nations data.  ( 2 min )
    Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning
    arXiv:2310.12921v2 Announce Type: replace Abstract: Reinforcement learning (RL) requires either manually specifying a reward function, which is often infeasible, or learning a reward model from a large amount of human feedback, which is often very expensive. We study a more sample-efficient alternative: using pretrained vision-language models (VLMs) as zero-shot reward models (RMs) to specify tasks via natural language. We propose a natural and general approach to using VLMs as reward models, which we call VLM-RMs. We use VLM-RMs based on CLIP to train a MuJoCo humanoid to learn complex tasks without a manually specified reward function, such as kneeling, doing the splits, and sitting in a lotus position. For each of these tasks, we only provide a single sentence text prompt describing the desired task with minimal prompt engineering. We provide videos of the trained agents at: https://sites.google.com/view/vlm-rm. We can improve performance by providing a second "baseline" prompt and projecting out parts of the CLIP embedding space irrelevant to distinguish between goal and baseline. Further, we find a strong scaling effect for VLM-RMs: larger VLMs trained with more compute and data are better reward models. The failure modes of VLM-RMs we encountered are all related to known capability limitations of current VLMs, such as limited spatial reasoning ability or visually unrealistic environments that are far off-distribution for the VLM. We find that VLM-RMs are remarkably robust as long as the VLM is large enough. This suggests that future VLMs will become more and more useful reward models for a wide range of RL applications.  ( 3 min )
    GNNX-BENCH: Unravelling the Utility of Perturbation-based GNN Explainers through In-depth Benchmarking
    arXiv:2310.01794v3 Announce Type: replace Abstract: Numerous explainability methods have been proposed to shed light on the inner workings of GNNs. Despite the inclusion of empirical evaluations in all the proposed algorithms, the interrogative aspects of these evaluations lack diversity. As a result, various facets of explainability pertaining to GNNs, such as a comparative analysis of counterfactual reasoners, their stability to variational factors such as different GNN architectures, noise, stochasticity in non-convex loss surfaces, feasibility amidst domain constraints, and so forth, have yet to be formally investigated. Motivated by this need, we present a benchmarking study on perturbation-based explainability methods for GNNs, aiming to systematically evaluate and compare a wide range of explainability techniques. Among the key findings of our study, we identify the Pareto-optimal methods that exhibit superior efficacy and stability in the presence of noise. Nonetheless, our study reveals that all algorithms are affected by stability issues when faced with noisy data. Furthermore, we have established that the current generation of counterfactual explainers often fails to provide feasible recourses due to violations of topological constraints encoded by domain-specific considerations. Overall, this benchmarking study empowers stakeholders in the field of GNNs with a comprehensive understanding of the state-of-the-art explainability methods, potential research problems for further enhancement, and the implications of their application in real-world scenarios.  ( 3 min )
    LDReg: Local Dimensionality Regularized Self-Supervised Learning
    arXiv:2401.10474v2 Announce Type: replace Abstract: Representations learned via self-supervised learning (SSL) can be susceptible to dimensional collapse, where the learned representation subspace is of extremely low dimensionality and thus fails to represent the full data distribution and modalities. Dimensional collapse also known as the "underfilling" phenomenon is one of the major causes of degraded performance on downstream tasks. Previous work has investigated the dimensional collapse problem of SSL at a global level. In this paper, we demonstrate that representations can span over high dimensional space globally, but collapse locally. To address this, we propose a method called $\textit{local dimensionality regularization (LDReg)}$. Our formulation is based on the derivation of the Fisher-Rao metric to compare and optimize local distance distributions at an asymptotically small radius for each data point. By increasing the local intrinsic dimensionality, we demonstrate through a range of experiments that LDReg improves the representation quality of SSL. The results also show that LDReg can regularize dimensionality at both local and global levels.  ( 2 min )
    Differentiable and accelerated wavelet transforms on the sphere and ball
    arXiv:2402.01282v2 Announce Type: replace-cross Abstract: Directional wavelet dictionaries are hierarchical representations which efficiently capture and segment information across scale, location and orientation. Such representations demonstrate a particular affinity to physical signals, which often exhibit highly anisotropic, localised multiscale structure. Many physically important signals are observed over spherical domains, such as the celestial sky in cosmology. Leveraging recent advances in computational harmonic analysis, we design new highly distributable and automatically differentiable directional wavelet transforms on the $2$-dimensional sphere $\mathbb{S}^2$ and $3$-dimensional ball $\mathbb{B}^3 = \mathbb{R}^+ \times \mathbb{S}^2$ (the space formed by augmenting the sphere with the radial half-line). We observe up to a $300$-fold and $21800$-fold acceleration for signals on the sphere and ball, respectively, compared to existing software, whilst maintaining 64-bit machine precision. Not only do these algorithms dramatically accelerate existing spherical wavelet transforms, the gradient information afforded by automatic differentiation unlocks many data-driven analysis techniques previously not possible for these spaces. We publicly release both S2WAV and S2BALL, open-sourced JAX libraries for our transforms that are automatically differentiable and readily deployable both on and over clusters of hardware accelerators (e.g. GPUs & TPUs).  ( 3 min )
    Optimal transport distances for directed, weighted graphs: a case study with cell-cell communication networks
    arXiv:2309.07030v3 Announce Type: replace Abstract: Comparing graphs by means of optimal transport has recently gained significant attention, as the distances induced by optimal transport provide both a principled metric between graphs as well as an interpretable description of the associated changes between graphs in terms of a transport plan. As the lack of symmetry introduces challenges in the typically considered formulations, optimal transport distances for graphs have mostly been developed for undirected graphs. Here, we propose two distance measures to compare directed graphs based on variants of optimal transport: (i) an earth movers distance (Wasserstein) and (ii) a Gromov-Wasserstein (GW) distance. We evaluate these two distances and discuss their relative performance for both simulated graph data and real-world directed cell-cell communication graphs, inferred from single-cell RNA-seq data.  ( 2 min )
    Efficient Bitrate Ladder Construction using Transfer Learning and Spatio-Temporal Features
    arXiv:2401.03195v2 Announce Type: replace-cross Abstract: Providing high-quality video with efficient bitrate is a main challenge in video industry. The traditional one-size-fits-all scheme for bitrate ladders is inefficient and reaching the best content-aware decision computationally impractical due to extensive encodings required. To mitigate this, we propose a bitrate and complexity efficient bitrate ladder prediction method using transfer learning and spatio-temporal features. We propose: (1) using feature maps from well-known pre-trained DNNs to predict rate-quality behavior with limited training data; and (2) improving highest quality rung efficiency by predicting minimum bitrate for top quality and using it for the top rung. The method tested on 102 video scenes demonstrates 94.1% reduction in complexity versus brute-force at 1.71% BD-Rate expense. Additionally, transfer learning was thoroughly studied through four networks and ablation studies.  ( 2 min )
    A Comprehensive Dataset and Automated Pipeline for Nailfold Capillary Analysis
    arXiv:2312.05930v2 Announce Type: replace-cross Abstract: Nailfold capillaroscopy is widely used in assessing health conditions, highlighting the pressing need for an automated nailfold capillary analysis system. In this study, we present a pioneering effort in constructing a comprehensive nailfold capillary dataset-321 images, 219 videos from 68 subjects, with clinic reports and expert annotations-that serves as a crucial resource for training deep-learning models. Leveraging this dataset, we finetuned three deep learning models with expert annotations as supervised labels and integrated them into a novel end-to-end nailfold capillary analysis pipeline. This pipeline excels in automatically detecting and measuring a wide range of size factors, morphological features, and dynamic aspects of nailfold capillaries. We compared our outcomes with clinical reports. Experiment results showed that our automated pipeline achieves an average of sub-pixel level precision in measurements and 89.9% accuracy in identifying morphological abnormalities. These results underscore its potential for advancing quantitative medical research and enabling pervasive computing in healthcare. Our data and code are available at https://github.com/THU-CS-PI-LAB/ANFC-Automated-Nailfold-Capillary.  ( 2 min )
    Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy
    arXiv:2310.01334v2 Announce Type: replace Abstract: Sparsely activated Mixture-of-Experts (SMoE) has shown promise to scale up the learning capacity of neural networks, however, they have issues like (a) High Memory Usage, due to duplication of the network layers into multiple copies as experts; and (b) Redundancy in Experts, as common learning-based routing policies suffer from representational collapse. Therefore, vanilla SMoE models are memory inefficient and non-scalable, especially for resource-constrained downstream scenarios. In this paper, we ask: Can we craft a compact SMoE model by consolidating expert information? What is the best recipe to merge multiple experts into fewer but more knowledgeable experts? Our pilot investigation reveals that conventional model merging methods fail to be effective in such expert merging for SMoE. The potential reasons are: (1) redundant information overshadows critical experts; (2) appropriate neuron permutation for each expert is missing to bring all of them in alignment. To address this, we propose M-SMoE, which leverages routing statistics to guide expert merging. Specifically, it starts with neuron permutation alignment for experts; then, dominant experts and their "group members" are formed; lastly, every expert group is merged into a single expert by utilizing each expert's activation frequency as their weight for merging, thus diminishing the impact of insignificant experts. Moreover, we observed that our proposed merging promotes a low dimensionality in the merged expert's weight space, naturally paving the way for additional compression. Hence, our final method, MC-SMoE (i.e., Merge, then Compress SMoE), further decomposes the merged experts into low-rank and structural sparse alternatives. Extensive experiments across 8 benchmarks validate the effectiveness of MC-SMoE. For instance, our MC-SMoE achieves up to 80% memory and a 20% FLOPs reduction, with virtually no loss in performance.  ( 3 min )
    Stable Training of Probabilistic Models Using the Leave-One-Out Maximum Log-Likelihood Objective
    arXiv:2310.03556v2 Announce Type: replace-cross Abstract: Probabilistic modelling of power systems operation and planning processes depends on data-driven methods, which require sufficiently large datasets. When historical data lacks this, it is desired to model the underlying data generation mechanism as a probability distribution to assess the data quality and generate more data, if needed. Kernel density estimation (KDE) based models are popular choices for this task, but they fail to adapt to data regions with varying densities. In this paper, an adaptive KDE model is employed to circumvent this, where each kernel in the model has an individual bandwidth. The leave-one-out maximum log-likelihood (LOO-MLL) criterion is proposed to prevent the singular solutions that the regular MLL criterion gives rise to, and it is proven that LOO-MLL prevents these. Relying on this guaranteed robustness, the model is extended by adjustable weights for the kernels. In addition, a modified expectation-maximization algorithm is employed to accelerate the optimization speed reliably. The performance of the proposed method and models are exhibited on two power systems datasets using different statistical tests and by comparison with Gaussian mixture models. Results show that the proposed models have promising performance, in addition to their singularity prevention guarantees.  ( 2 min )
    DyVal: Dynamic Evaluation of Large Language Models for Reasoning Tasks
    arXiv:2309.17167v3 Announce Type: replace-cross Abstract: Large language models (LLMs) have achieved remarkable performance in various evaluation benchmarks. However, concerns are raised about potential data contamination in their considerable volume of training corpus. Moreover, the static nature and fixed complexity of current benchmarks may inadequately gauge the advancing capabilities of LLMs. In this paper, we introduce DyVal, a general and flexible protocol for dynamic evaluation of LLMs. Based on our framework, we build graph-informed DyVal by leveraging the structural advantage of directed acyclic graphs to dynamically generate evaluation samples with controllable complexities. DyVal generates challenging evaluation sets on reasoning tasks including mathematics, logical reasoning, and algorithm problems. We evaluate various LLMs ranging from Flan-T5-large to GPT-3.5-Turbo and GPT-4. Experiments show that LLMs perform worse in DyVal-generated evaluation samples with different complexities, highlighting the significance of dynamic evaluation. We also analyze the failure cases and results of different prompting methods. Moreover, DyVal-generated samples are not only evaluation sets, but also helpful data for fine-tuning to improve the performance of LLMs on existing benchmarks. We hope that DyVal can shed light on future evaluation research of LLMs. Code is available at: https://github.com/microsoft/promptbench.  ( 3 min )
    Kernelized Reinforcement Learning with Order Optimal Regret Bounds
    arXiv:2306.07745v3 Announce Type: replace Abstract: Reinforcement learning (RL) has shown empirical success in various real world settings with complex models and large state-action spaces. The existing analytical results, however, typically focus on settings with a small number of state-actions or simple models such as linearly modeled state-action value functions. To derive RL policies that efficiently handle large state-action spaces with more general value functions, some recent works have considered nonlinear function approximation using kernel ridge regression. We propose $\pi$-KRVI, an optimistic modification of least-squares value iteration, when the state-action value function is represented by a reproducing kernel Hilbert space (RKHS). We prove the first order-optimal regret guarantees under a general setting. Our results show a significant polynomial in the number of episodes improvement over the state of the art. In particular, with highly non-smooth kernels (such as Neural Tangent kernel or some Mat\'ern kernels) the existing results lead to trivial (superlinear in the number of episodes) regret bounds. We show a sublinear regret bound that is order optimal in the case of Mat\'ern kernels where a lower bound on regret is known.  ( 3 min )
    Expressive Losses for Verified Robustness via Convex Combinations
    arXiv:2305.13991v2 Announce Type: replace Abstract: In order to train networks for verified adversarial robustness, it is common to over-approximate the worst-case loss over perturbation regions, resulting in networks that attain verifiability at the expense of standard performance. As shown in recent work, better trade-offs between accuracy and robustness can be obtained by carefully coupling adversarial training with over-approximations. We hypothesize that the expressivity of a loss function, which we formalize as the ability to span a range of trade-offs between lower and upper bounds to the worst-case loss through a single parameter (the over-approximation coefficient), is key to attaining state-of-the-art performance. To support our hypothesis, we show that trivial expressive losses, obtained via convex combinations between adversarial attacks and IBP bounds, yield state-of-the-art results across a variety of settings in spite of their conceptual simplicity. We provide a detailed analysis of the relationship between the over-approximation coefficient and performance profiles across different expressive losses, showing that, while expressivity is essential, better approximations of the worst-case loss are not necessarily linked to superior robustness-accuracy trade-offs.  ( 2 min )
    Gore Diffusion LoRA Model
    arXiv:2403.08812v1 Announce Type: cross Abstract: The Emergence of Artificial Intelligence (AI) has significantly impacted our engagement with violence, sparking ethical deliberations regarding the algorithmic creation of violent imagery. This paper scrutinizes the "Gore Diffusion LoRA Model," an innovative AI model proficient in generating hyper-realistic visuals portraying intense violence and bloodshed. Our exploration encompasses the model's technical intricacies, plausible applications, and the ethical quandaries inherent in its utilization. We contend that the creation and implementation of such models warrant a meticulous discourse concerning the convergence of AI, art, and violence. Furthermore, we advocate for a structured framework advocating responsible development and ethical deployment of these potent technologies.  ( 2 min )
    G-Retriever: Retrieval-Augmented Generation for Textual Graph Understanding and Question Answering
    arXiv:2402.07630v2 Announce Type: replace Abstract: Given a graph with textual attributes, we enable users to `chat with their graph': that is, to ask questions about the graph using a conversational interface. In response to a user's questions, our method provides textual replies and highlights the relevant parts of the graph. While existing works integrate large language models (LLMs) and graph neural networks (GNNs) in various ways, they mostly focus on either conventional graph tasks (such as node, edge, and graph classification), or on answering simple graph queries on small or synthetic graphs. In contrast, we develop a flexible question-answering framework targeting real-world textual graphs, applicable to multiple applications including scene graph understanding, common sense reasoning, and knowledge graph reasoning. Toward this goal, we first develop our Graph Question Answering (GraphQA) benchmark with data collected from different tasks. Then, we propose our G-Retriever approach, which integrates the strengths of GNNs, LLMs, and Retrieval-Augmented Generation (RAG), and can be fine-tuned to enhance graph understanding via soft prompting. To resist hallucination and to allow for textual graphs that greatly exceed the LLM's context window size, G-Retriever performs RAG over a graph by formulating this task as a Prize-Collecting Steiner Tree optimization problem. Empirical evaluations show that our method outperforms baselines on textual graph tasks from multiple domains, scales well with larger graph sizes, and resists hallucination. (Our codes and datasets are available at: https://github.com/XiaoxinHe/G-Retriever.)  ( 3 min )
    Think before you speak: Training Language Models With Pause Tokens
    arXiv:2310.02226v2 Announce Type: replace-cross Abstract: Language models generate responses by producing a series of tokens in immediate succession: the $(K+1)^{th}$ token is an outcome of manipulating $K$ hidden vectors per layer, one vector per preceding token. What if instead we were to let the model manipulate say, $K+10$ hidden vectors, before it outputs the $(K+1)^{th}$ token? We operationalize this idea by performing training and inference on language models with a (learnable) $\textit{pause}$ token, a sequence of which is appended to the input prefix. We then delay extracting the model's outputs until the last pause token is seen, thereby allowing the model to process extra computation before committing to an answer. We empirically evaluate $\textit{pause-training}$ on decoder-only models of 1B and 130M parameters with causal pretraining on C4, and on downstream tasks covering reasoning, question-answering, general understanding and fact recall. Our main finding is that inference-time delays show gains when the model is both pre-trained and finetuned with delays. For the 1B model, we witness gains on 8 of 9 tasks, most prominently, a gain of $18\%$ EM score on the QA task of SQuAD, $8\%$ on CommonSenseQA and $1\%$ accuracy on the reasoning task of GSM8k. Our work raises a range of conceptual and practical future research questions on making delayed next-token prediction a widely applicable new paradigm.  ( 3 min )
    Mind the map! Accounting for existing map information when estimating online HDMaps from sensor
    arXiv:2311.10517v2 Announce Type: replace Abstract: While HDMaps are a crucial component of autonomous driving, they are expensive to acquire and maintain. Estimating these maps from sensors therefore promises to significantly lighten costs. These estimations however overlook existing HDMaps, with current methods at most geolocalizing low quality maps or considering a general database of known maps. In this paper, we propose to account for existing maps of the precise situation studied when estimating HDMaps. We identify 3 reasonable types of useful existing maps (minimalist, noisy, and outdated). We also introduce MapEX, a novel online HDMap estimation framework that accounts for existing maps. MapEX achieves this by encoding map elements into query tokens and by refining the matching algorithm used to train classic query based map estimation models. We demonstrate that MapEX brings significant improvements on the nuScenes dataset. For instance, MapEX - given noisy maps - improves by 38% over the MapTRv2 detector it is based on and by 8% over the current SOTA.  ( 2 min )
    PROMISE: Preconditioned Stochastic Optimization Methods by Incorporating Scalable Curvature Estimates
    arXiv:2309.02014v3 Announce Type: replace-cross Abstract: This paper introduces PROMISE ($\textbf{Pr}$econditioned Stochastic $\textbf{O}$ptimization $\textbf{M}$ethods by $\textbf{I}$ncorporating $\textbf{S}$calable Curvature $\textbf{E}$stimates), a suite of sketching-based preconditioned stochastic gradient algorithms for solving large-scale convex optimization problems arising in machine learning. PROMISE includes preconditioned versions of SVRG, SAGA, and Katyusha; each algorithm comes with a strong theoretical analysis and effective default hyperparameter values. In contrast, traditional stochastic gradient methods require careful hyperparameter tuning to succeed, and degrade in the presence of ill-conditioning, a ubiquitous phenomenon in machine learning. Empirically, we verify the superiority of the proposed algorithms by showing that, using default hyperparameter values, they outperform or match popular tuned stochastic gradient optimizers on a test bed of $51$ ridge and logistic regression problems assembled from benchmark machine learning repositories. On the theoretical side, this paper introduces the notion of quadratic regularity in order to establish linear convergence of all proposed methods even when the preconditioner is updated infrequently. The speed of linear convergence is determined by the quadratic regularity ratio, which often provides a tighter bound on the convergence rate compared to the condition number, both in theory and in practice, and explains the fast global linear convergence of the proposed methods.  ( 2 min )
    SVD-PINNs: Transfer Learning of Physics-Informed Neural Networks via Singular Value Decomposition
    arXiv:2211.08760v2 Announce Type: replace Abstract: Physics-informed neural networks (PINNs) have attracted significant attention for solving partial differential equations (PDEs) in recent years because they alleviate the curse of dimensionality that appears in traditional methods. However, the most disadvantage of PINNs is that one neural network corresponds to one PDE. In practice, we usually need to solve a class of PDEs, not just one. With the explosive growth of deep learning, many useful techniques in general deep learning tasks are also suitable for PINNs. Transfer learning methods may reduce the cost for PINNs in solving a class of PDEs. In this paper, we proposed a transfer learning method of PINNs via keeping singular vectors and optimizing singular values (namely SVD-PINNs). Numerical experiments on high dimensional PDEs (10-d linear parabolic equations and 10-d Allen-Cahn equations) show that SVD-PINNs work for solving a class of PDEs with different but close right-hand-side functions.  ( 2 min )
    Model-free Reinforcement Learning of Semantic Communication by Stochastic Policy Gradient
    arXiv:2305.03571v2 Announce Type: replace-cross Abstract: Following the recent success of Machine Learning tools in wireless communications, the idea of semantic communication by Weaver from 1949 has gained attention. It breaks with Shannon's classic design paradigm by aiming to transmit the meaning, i.e., semantics, of a message instead of its exact version, allowing for information rate savings. In this work, we apply the Stochastic Policy Gradient (SPG) to design a semantic communication system by reinforcement learning, separating transmitter and receiver, and not requiring a known or differentiable channel model -- a crucial step towards deployment in practice. Further, we derive the use of SPG for both classic and semantic communication from the maximization of the mutual information between received and target variables. Numerical results show that our approach achieves comparable performance to a model-aware approach based on the reparametrization trick, albeit with a decreased convergence rate.  ( 2 min )
    Making Language Models Better Tool Learners with Execution Feedback
    arXiv:2305.13068v3 Announce Type: replace-cross Abstract: Tools serve as pivotal interfaces that enable humans to understand and reshape the environment. With the advent of foundation models, AI systems can utilize tools to expand their capabilities and interact with the real world. Existing tool learning methodologies, encompassing supervised fine-tuning and prompt engineering approaches, often induce large language models to utilize tools indiscriminately, as complex tasks often exceed their own competencies. However, introducing tools for simple tasks, which the models themselves can readily resolve, can inadvertently propagate errors rather than enhance performance. This leads to the research question: can we teach language models when and how to use tools? To meet this need, we propose Tool leaRning wIth exeCution fEedback (TRICE), a two-stage end-to-end framework that enables the model to continually learn through feedback derived from tool execution, thereby learning when and how to use tools effectively. Experimental results, backed by further analysis, show that TRICE can make the large language model selectively use tools by improving the accuracy of tool usage while enhancing insufficient tool learning and mitigating excessive reliance on tools. Code is available at https://github.com/zjunlp/TRICE.  ( 2 min )
    Instruction Tuning for Large Language Models: A Survey
    arXiv:2308.10792v5 Announce Type: replace-cross Abstract: This paper surveys research works in the quickly advancing field of instruction tuning (IT), a crucial technique to enhance the capabilities and controllability of large language models (LLMs). Instruction tuning refers to the process of further training LLMs on a dataset consisting of \textsc{(instruction, output)} pairs in a supervised fashion, which bridges the gap between the next-word prediction objective of LLMs and the users' objective of having LLMs adhere to human instructions. In this work, we make a systematic review of the literature, including the general methodology of IT, the construction of IT datasets, the training of IT models, and applications to different modalities, domains and applications, along with an analysis on aspects that influence the outcome of IT (e.g., generation of instruction outputs, size of the instruction dataset, etc). We also review the potential pitfalls of IT along with criticism against it, along with efforts pointing out current deficiencies of existing strategies and suggest some avenues for fruitful research. Project page: github.com/xiaoya-li/Instruction-Tuning-Survey  ( 3 min )
    Energy Disaggregation & Appliance Identification in a Smart Home: Transfer Learning enables Edge Computing
    arXiv:2301.03018v2 Announce Type: replace-cross Abstract: Non-intrusive load monitoring (NILM) or energy disaggregation aims to extract the load profiles of individual consumer electronic appliances, given an aggregate load profile of the mains of a smart home. This work proposes a novel deep-learning and edge computing approach to solve the NILM problem and a few related problems as follows. 1) We build upon the reputed seq2-point convolutional neural network (CNN) model to come up with the proposed seq2-[3]-point CNN model to solve the (home) NILM problem and site-NILM problem (basically, NILM at a smaller scale). 2) We solve the related problem of appliance identification by building upon the state-of-the-art (pre-trained) 2D-CNN models, i.e., AlexNet, ResNet-18, and DenseNet-121, which are fine-tuned two custom datasets that consist of Wavelets and short-time Fourier transform (STFT)-based 2D electrical signatures of the appliances. 3) Finally, we do some basic qualitative inference about an individual appliance's health by comparing the power consumption of the same appliance across multiple homes. Low-frequency REDD dataset is used for all problems, except site-NILM where REFIT dataset has been used. As for the results, we achieve a maximum accuracy of 94.6\% for home-NILM, 81\% for site-NILM, and 88.9\% for appliance identification (with Resnet-based model).  ( 3 min )
    Explorations in Texture Learning
    arXiv:2403.09543v1 Announce Type: cross Abstract: In this work, we investigate \textit{texture learning}: the identification of textures learned by object classification models, and the extent to which they rely on these textures. We build texture-object associations that uncover new insights about the relationships between texture and object classes in CNNs and find three classes of results: associations that are strong and expected, strong and not expected, and expected but not present. Our analysis demonstrates that investigations in texture learning enable new methods for interpretability and have the potential to uncover unexpected biases.  ( 2 min )
    Deep Adaptation of Adult-Child Facial Expressions by Fusing Landmark Features
    arXiv:2209.08614v2 Announce Type: replace-cross Abstract: Imaging of facial affects may be used to measure psychophysiological attributes of children through their adulthood for applications in education, healthcare, and entertainment, among others. Deep convolutional neural networks show promising results in classifying facial expressions of adults. However, classifier models trained with adult benchmark data are unsuitable for learning child expressions due to discrepancies in psychophysical development. Similarly, models trained with child data perform poorly in adult expression classification. We propose domain adaptation to concurrently align distributions of adult and child expressions in a shared latent space for robust classification of either domain. Furthermore, age variations in facial images are studied in age-invariant face recognition yet remain unleveraged in adult-child expression classification. We take inspiration from multiple fields and propose deep adaptive FACial Expressions fusing BEtaMix SElected Landmark Features (FACE-BE-SELF) for adult-child expression classification. For the first time in the literature, a mixture of Beta distributions is used to decompose and select facial features based on correlations with expression, domain, and identity factors. We evaluate FACE-BE-SELF using 5-fold cross validation for two pairs of adult-child data sets. Our proposed FACE-BE-SELF approach outperforms transfer learning and other baseline domain adaptation methods in aligning latent representations of adult and child expressions.  ( 3 min )
    Stochastic Approximation with Decision-Dependent Distributions: Asymptotic Normality and Optimality
    arXiv:2207.04173v3 Announce Type: replace-cross Abstract: We analyze a stochastic approximation algorithm for decision-dependent problems, wherein the data distribution used by the algorithm evolves along the iterate sequence. The primary examples of such problems appear in performative prediction and its multiplayer extensions. We show that under mild assumptions, the deviation between the average iterate of the algorithm and the solution is asymptotically normal, with a covariance that clearly decouples the effects of the gradient noise and the distributional shift. Moreover, building on the work of H\'ajek and Le Cam, we show that the asymptotic performance of the algorithm with averaging is locally minimax optimal.  ( 2 min )
    Learning Melanocytic Cell Masks from Adjacent Stained Tissue
    arXiv:2211.00646v4 Announce Type: replace-cross Abstract: Melanoma is one of the most aggressive forms of skin cancer, causing a large proportion of skin cancer deaths. However, melanoma diagnoses by pathologists shows low interrater reliability. As melanoma is a cancer of the melanocyte, there is a clear need to develop a melanocytic cell segmentation tool that is agnostic to pathologist variability and automates pixel-level annotation. Gigapixel-level pathologist labeling, however, is impractical. Herein, we propose a means to train deep neural networks for melanocytic cell segmentation from hematoxylin and eosin (H&E) stained sections and paired immunohistochemistry (IHC) of adjacent tissue sections, achieving a mean IOU of 0.64 despite imperfect ground-truth labels.  ( 2 min )
    Euclidean, Projective, Conformal: Choosing a Geometric Algebra for Equivariant Transformers
    arXiv:2311.04744v2 Announce Type: replace Abstract: The Geometric Algebra Transformer (GATr) is a versatile architecture for geometric deep learning based on projective geometric algebra. We generalize this architecture into a blueprint that allows one to construct a scalable transformer architecture given any geometric (or Clifford) algebra. We study versions of this architecture for Euclidean, projective, and conformal algebras, all of which are suited to represent 3D data, and evaluate them in theory and practice. The simplest Euclidean architecture is computationally cheap, but has a smaller symmetry group and is not as sample-efficient, while the projective model is not sufficiently expressive. Both the conformal algebra and an improved version of the projective algebra define powerful, performant architectures.  ( 2 min )
    BurstAttention: An Efficient Distributed Attention Framework for Extremely Long Sequences
    arXiv:2403.09347v1 Announce Type: cross Abstract: Effective attention modules have played a crucial role in the success of Transformer-based large language models (LLMs), but the quadratic time and memory complexities of these attention modules also pose a challenge when processing long sequences. One potential solution for the long sequence problem is to utilize distributed clusters to parallelize the computation of attention modules across multiple devices (e.g., GPUs). However, adopting a distributed approach inevitably introduces extra memory overheads to store local attention results and incurs additional communication costs to aggregate local results into global ones. In this paper, we propose a distributed attention framework named ``BurstAttention'' to optimize memory access and communication operations at both the global cluster and local device levels. In our experiments, we compare BurstAttention with other competitive distributed attention solutions for long sequence processing. The experimental results under different length settings demonstrate that BurstAttention offers significant advantages for processing long sequences compared with these competitive baselines, reducing 40% communication overheads and achieving 2 X speedup during training 32K sequence length on 8 X A100.  ( 2 min )
    Delta-AI: Local objectives for amortized inference in sparse graphical models
    arXiv:2310.02423v2 Announce Type: replace Abstract: We present a new algorithm for amortized inference in sparse probabilistic graphical models (PGMs), which we call $\Delta$-amortized inference ($\Delta$-AI). Our approach is based on the observation that when the sampling of variables in a PGM is seen as a sequence of actions taken by an agent, sparsity of the PGM enables local credit assignment in the agent's policy learning objective. This yields a local constraint that can be turned into a local loss in the style of generative flow networks (GFlowNets) that enables off-policy training but avoids the need to instantiate all the random variables for each parameter update, thus speeding up training considerably. The $\Delta$-AI objective matches the conditional distribution of a variable given its Markov blanket in a tractable learned sampler, which has the structure of a Bayesian network, with the same conditional distribution under the target PGM. As such, the trained sampler recovers marginals and conditional distributions of interest and enables inference of partial subsets of variables. We illustrate $\Delta$-AI's effectiveness for sampling from synthetic PGMs and training latent variable models with sparse factor structure.  ( 2 min )
    Amortizing intractable inference in large language models
    arXiv:2310.04363v2 Announce Type: replace Abstract: Autoregressive large language models (LLMs) compress knowledge from their training data through next-token conditional distributions. This limits tractable querying of this knowledge to start-to-end autoregressive sampling. However, many tasks of interest -- including sequence continuation, infilling, and other forms of constrained generation -- involve sampling from intractable posterior distributions. We address this limitation by using amortized Bayesian inference to sample from these intractable posteriors. Such amortization is algorithmically achieved by fine-tuning LLMs via diversity-seeking reinforcement learning algorithms: generative flow networks (GFlowNets). We empirically demonstrate that this distribution-matching paradigm of LLM fine-tuning can serve as an effective alternative to maximum-likelihood training and reward-maximizing policy optimization. As an important application, we interpret chain-of-thought reasoning as a latent variable modeling problem and demonstrate that our approach enables data-efficient adaptation of LLMs to tasks that require multi-step rationalization and tool use.  ( 2 min )
    How do Machine Learning Projects use Continuous Integration Practices? An Empirical Study on GitHub Actions
    arXiv:2403.09547v1 Announce Type: cross Abstract: Continuous Integration (CI) is a well-established practice in traditional software development, but its nuances in the domain of Machine Learning (ML) projects remain relatively unexplored. Given the distinctive nature of ML development, understanding how CI practices are adopted in this context is crucial for tailoring effective approaches. In this study, we conduct a comprehensive analysis of 185 open-source projects on GitHub (93 ML and 92 non-ML projects). Our investigation comprises both quantitative and qualitative dimensions, aiming to uncover differences in CI adoption between ML and non-ML projects. Our findings indicate that ML projects often require longer build durations, and medium-sized ML projects exhibit lower test coverage compared to non-ML projects. Moreover, small and medium-sized ML projects show a higher prevalence of increasing build duration trends compared to their non-ML counterparts. Additionally, our qualitative analysis illuminates the discussions around CI in both ML and non-ML projects, encompassing themes like CI Build Execution and Status, CI Testing, and CI Infrastructure. These insights shed light on the unique challenges faced by ML projects in adopting CI practices effectively.  ( 2 min )
    Adversarially Robust Deepfake Detection via Adversarial Feature Similarity Learning
    arXiv:2403.08806v1 Announce Type: cross Abstract: Deepfake technology has raised concerns about the authenticity of digital content, necessitating the development of effective detection methods. However, the widespread availability of deepfakes has given rise to a new challenge in the form of adversarial attacks. Adversaries can manipulate deepfake videos with small, imperceptible perturbations that can deceive the detection models into producing incorrect outputs. To tackle this critical issue, we introduce Adversarial Feature Similarity Learning (AFSL), which integrates three fundamental deep feature learning paradigms. By optimizing the similarity between samples and weight vectors, our approach aims to distinguish between real and fake instances. Additionally, we aim to maximize the similarity between both adversarially perturbed examples and unperturbed examples, regardless of their real or fake nature. Moreover, we introduce a regularization technique that maximizes the dissimilarity between real and fake samples, ensuring a clear separation between these two categories. With extensive experiments on popular deepfake datasets, including FaceForensics++, FaceShifter, and DeeperForensics, the proposed method outperforms other standard adversarial training-based defense methods significantly. This further demonstrates the effectiveness of our approach to protecting deepfake detectors from adversarial attacks.  ( 2 min )
    LM2D: Lyrics- and Music-Driven Dance Synthesis
    arXiv:2403.09407v1 Announce Type: cross Abstract: Dance typically involves professional choreography with complex movements that follow a musical rhythm and can also be influenced by lyrical content. The integration of lyrics in addition to the auditory dimension, enriches the foundational tone and makes motion generation more amenable to its semantic meanings. However, existing dance synthesis methods tend to model motions only conditioned on audio signals. In this work, we make two contributions to bridge this gap. First, we propose LM2D, a novel probabilistic architecture that incorporates a multimodal diffusion model with consistency distillation, designed to create dance conditioned on both music and lyrics in one diffusion generation step. Second, we introduce the first 3D dance-motion dataset that encompasses both music and lyrics, obtained with pose estimation technologies. We evaluate our model against music-only baseline models with objective metrics and human evaluations, including dancers and choreographers. The results demonstrate LM2D is able to produce realistic and diverse dance matching both lyrics and music. A video summary can be accessed at: https://youtu.be/4XCgvYookvA.  ( 2 min )
    Expected flow networks in stochastic environments and two-player zero-sum games
    arXiv:2310.02779v2 Announce Type: replace Abstract: Generative flow networks (GFlowNets) are sequential sampling models trained to match a given distribution. GFlowNets have been successfully applied to various structured object generation tasks, sampling a diverse set of high-reward objects quickly. We propose expected flow networks (EFlowNets), which extend GFlowNets to stochastic environments. We show that EFlowNets outperform other GFlowNet formulations in stochastic tasks such as protein design. We then extend the concept of EFlowNets to adversarial environments, proposing adversarial flow networks (AFlowNets) for two-player zero-sum games. We show that AFlowNets learn to find above 80% of optimal moves in Connect-4 via self-play and outperform AlphaZero in tournaments.  ( 2 min )
    Geometric structure of Deep Learning networks and construction of global ${\mathcal L}^2$ minimizers
    arXiv:2309.10639v4 Announce Type: replace Abstract: In this paper, we explicitly determine local and global minimizers of the $\mathcal{L}^2$ cost function in underparametrized Deep Learning (DL) networks; our main goal is to shed light on their geometric structure and properties. We accomplish this by a direct construction, without invoking the gradient descent flow at any point of this work. We specifically consider $L$ hidden layers, a ReLU ramp activation function, an $\mathcal{L}^2$ Schatten class (or Hilbert-Schmidt) cost function, input and output spaces $\mathbb{R}^Q$ with equal dimension $Q\geq1$, and hidden layers also defined on $\mathbb{R}^{Q}$; the training inputs are assumed to be sufficiently clustered. The training input size $N$ can be arbitrarily large - thus, we are considering the underparametrized regime. More general settings are left to future work. We construct an explicit family of minimizers for the global minimum of the cost function in the case $L\geq Q$, which we show to be degenerate. Moreover, we determine a set of $2^Q-1$ distinct degenerate local minima of the cost function. In the context presented here, the concatenation of hidden layers of the DL network is reinterpreted as a recursive application of a {\em truncation map} which "curates" the training inputs by minimizing their noise to signal ratio.  ( 3 min )
    Multi-Step Deductive Reasoning Over Natural Language: An Empirical Study on Out-of-Distribution Generalisation
    arXiv:2207.14000v2 Announce Type: replace-cross Abstract: Combining deep learning with symbolic logic reasoning aims to capitalize on the success of both fields and is drawing increasing attention. Inspired by DeepLogic, an end-to-end model trained to perform inference on logic programs, we introduce IMA-GloVe-GA, an iterative neural inference network for multi-step reasoning expressed in natural language. In our model, reasoning is performed using an iterative memory neural network based on RNN with a gate attention mechanism. We evaluate IMA-GloVe-GA on three datasets: PARARULES, CONCEPTRULES V1 and CONCEPTRULES V2. Experimental results show DeepLogic with gate attention can achieve higher test accuracy than DeepLogic and other RNN baseline models. Our model achieves better out-of-distribution generalisation than RoBERTa-Large when the rules have been shuffled. Furthermore, to address the issue of unbalanced distribution of reasoning depths in the current multi-step reasoning datasets, we develop PARARULE-Plus, a large dataset with more examples that require deeper reasoning steps. Experimental results show that the addition of PARARULE-Plus can increase the model's performance on examples requiring deeper reasoning depths. The source code and data are available at https://github.com/Strong-AI-Lab/Multi-Step-Deductive-Reasoning-Over-Natural-Language.  ( 3 min )
    Precision and Recall Reject Curves for Classification
    arXiv:2308.08381v3 Announce Type: replace Abstract: For some classification scenarios, it is desirable to use only those classification instances that a trained model associates with a high certainty. To obtain such high-certainty instances, previous work has proposed accuracy-reject curves. Reject curves allow to evaluate and compare the performance of different certainty measures over a range of thresholds for accepting or rejecting classifications. However, the accuracy may not be the most suited evaluation metric for all applications, and instead precision or recall may be preferable. This is the case, for example, for data with imbalanced class distributions. We therefore propose reject curves that evaluate precision and recall, the recall-reject curve and the precision-reject curve. Using prototype-based classifiers from learning vector quantization, we first validate the proposed curves on artificial benchmark data against the accuracy reject curve as a baseline. We then show on imbalanced benchmarks and medical, real-world data that for these scenarios, the proposed precision- and recall-curves yield more accurate insights into classifier performance than accuracy reject curves.  ( 2 min )
    Don't Judge by the Look: A Motion Coherent Augmentation for Video Recognition
    arXiv:2403.09506v1 Announce Type: cross Abstract: Current training pipelines in object recognition neglect Hue Jittering when doing data augmentation as it not only brings appearance changes that are detrimental to classification, but also the implementation is inefficient in practice. In this study, we investigate the effect of hue variance in the context of video recognition and find this variance to be beneficial since static appearances are less important in videos that contain motion information. Based on this observation, we propose a data augmentation method for video recognition, named Motion Coherent Augmentation (MCA), that introduces appearance variation in videos and implicitly encourages the model to prioritize motion patterns, rather than static appearances. Concretely, we propose an operation SwapMix to efficiently modify the appearance of video samples, and introduce Variation Alignment (VA) to resolve the distribution shift caused by SwapMix, enforcing the model to learn appearance invariant representations. Comprehensive empirical evaluation across various architectures and different datasets solidly validates the effectiveness and generalization ability of MCA, and the application of VA in other augmentation methods. Code is available at https://github.com/BeSpontaneous/MCA-pytorch.  ( 2 min )
    User Identification via Free Roaming Eye Tracking Data
    arXiv:2403.09415v1 Announce Type: new Abstract: We present a new dataset of "free roaming" (FR) and "targeted roaming" (TR): a pool of 41 participants is asked to walk around a university campus (FR) or is asked to find a particular room within a library (TR). Eye movements are recorded using a commodity wearable eye tracker (Pupil Labs Neon at 200Hz). On this dataset we investigate the accuracy of user identification using a previously known machine learning pipeline where a Radial Basis Function Network (RBFN) is used as classifier. Our highest accuracies are 87.3% for FR and 89.4% for TR. This should be compared to 95.3% which is the (corresponding) highest accuracy we are aware of (achieved in a laboratory setting using the "RAN" stimulus of the BioEye 2015 competition dataset). To the best of our knowledge, our results are the first that study user identification in a non laboratory setting; such settings are often more feasible than laboratory settings and may include further advantages. The minimum duration of each recording is 263s for FR and 154s for TR. Our best accuracies are obtained when restricting to 120s and 140s for FR and TR respectively, always cut from the end of the trajectories (both for the training and testing sessions). If we cut the same length from the beginning, then accuracies are 12.2% lower for FR and around 6.4% lower for TR. On the full trajectories accuracies are lower by 5% and 52% for FR and TR. We also investigate the impact of including higher order velocity derivatives (such as acceleration, jerk, or jounce).  ( 3 min )
    Log-linear Guardedness and its Implications
    arXiv:2210.10012v4 Announce Type: replace Abstract: Methods for erasing human-interpretable concepts from neural representations that assume linearity have been found to be tractable and useful. However, the impact of this removal on the behavior of downstream classifiers trained on the modified representations is not fully understood. In this work, we formally define the notion of log-linear guardedness as the inability of an adversary to predict the concept directly from the representation, and study its implications. We show that, in the binary case, under certain assumptions, a downstream log-linear model cannot recover the erased concept. However, we demonstrate that a multiclass log-linear model \emph{can} be constructed that indirectly recovers the concept in some cases, pointing to the inherent limitations of log-linear guardedness as a downstream bias mitigation technique. These findings shed light on the theoretical limitations of linear erasure methods and highlight the need for further research on the connections between intrinsic and extrinsic bias in neural models.  ( 2 min )
    DPAR: Decoupled Graph Neural Networks with Node-Level Differential Privacy
    arXiv:2210.04442v2 Announce Type: replace Abstract: Graph Neural Networks (GNNs) have achieved great success in learning with graph-structured data. Privacy concerns have also been raised for the trained models which could expose the sensitive information of graphs including both node features and the structure information. In this paper, we aim to achieve node-level differential privacy (DP) for training GNNs so that a node and its edges are protected. Node DP is inherently difficult for GNNs because all direct and multi-hop neighbors participate in the calculation of gradients for each node via layer-wise message passing and there is no bound on how many direct and multi-hop neighbors a node can have, so existing DP methods will result in high privacy cost or poor utility due to high node sensitivity. We propose a \textbf{D}ecoupled GNN with Differentially \textbf{P}rivate \textbf{A}pproximate Personalized Page\textbf{R}ank (DPAR) for training GNNs with an enhanced privacy-utility tradeoff. The key idea is to decouple the feature projection and message passing via a DP PageRank algorithm which learns the structure information and uses the top-$K$ neighbors determined by the PageRank for feature aggregation. By capturing the most important neighbors for each node and avoiding the layer-wise message passing, it bounds the node sensitivity and achieves improved privacy-utility tradeoff compared to layer-wise perturbation based methods. We theoretically analyze the node DP guarantee for the two processes combined together and empirically demonstrate better utilities of DPAR with the same level of node DP compared with state-of-the-art methods.  ( 3 min )
    Kernelized Concept Erasure
    arXiv:2201.12191v5 Announce Type: replace Abstract: The representation space of neural models for textual data emerges in an unsupervised manner during training. Understanding how those representations encode human-interpretable concepts is a fundamental problem. One prominent approach for the identification of concepts in neural representations is searching for a linear subspace whose erasure prevents the prediction of the concept from the representations. However, while many linear erasure algorithms are tractable and interpretable, neural networks do not necessarily represent concepts in a linear manner. To identify non-linearly encoded concepts, we propose a kernelization of a linear minimax game for concept erasure. We demonstrate that it is possible to prevent specific non-linear adversaries from predicting the concept. However, the protection does not transfer to different nonlinear adversaries. Therefore, exhaustively erasing a non-linearly encoded concept remains an open problem.  ( 2 min )
    NM-FlowGAN: Modeling sRGB Noise with a Hybrid Approach based on Normalizing Flows and Generative Adversarial Networks
    arXiv:2312.10112v2 Announce Type: replace-cross Abstract: Modeling and synthesizing real sRGB noise is crucial for various low-level vision tasks, such as building datasets for training image denoising systems. The distribution of real sRGB noise is highly complex and affected by a multitude of factors, making its accurate modeling extremely challenging. Therefore, recent studies have proposed methods that employ data-driven generative models, such as generative adversarial networks (GAN) and Normalizing Flows. These studies achieve more accurate modeling of sRGB noise compared to traditional noise modeling methods. However, there are performance limitations due to the inherent characteristics of each generative model. To address this issue, we propose NM-FlowGAN, a hybrid approach that exploits the strengths of both GAN and Normalizing Flows. We simultaneously employ a pixel-wise noise modeling network based on Normalizing Flows, and spatial correlation modeling networks based on GAN. In our experiments, our NM-FlowGAN outperforms other baselines on the sRGB noise synthesis task. Moreover, the denoising neural network, trained with synthesized image pairs from our model, also shows superior performance compared to other baselines. Our code is available at: \url{https://github.com/YoungJooHan/NM-FlowGAN}.  ( 3 min )
    Evaluation of LLM Chatbots for OSINT-based Cyber Threat Awareness
    arXiv:2401.15127v2 Announce Type: replace-cross Abstract: Knowledge sharing about emerging threats is crucial in the rapidly advancing field of cybersecurity and forms the foundation of Cyber Threat Intelligence (CTI). In this context, Large Language Models are becoming increasingly significant in the field of cybersecurity, presenting a wide range of opportunities. This study surveys the performance of ChatGPT, GPT4all, Dolly, Stanford Alpaca, Alpaca-LoRA, Falcon, and Vicuna chatbots in binary classification and Named Entity Recognition (NER) tasks performed using Open Source INTelligence (OSINT). We utilize well-established data collected in previous research from Twitter to assess the competitiveness of these chatbots when compared to specialized models trained for those tasks. In binary classification experiments, Chatbot GPT-4 as a commercial model achieved an acceptable F1 score of 0.94, and the open-source GPT4all model achieved an F1 score of 0.90. However, concerning cybersecurity entity recognition, all evaluated chatbots have limitations and are less effective. This study demonstrates the capability of chatbots for OSINT binary classification and shows that they require further improvement in NER to effectively replace specially trained models. Our results shed light on the limitations of the LLM chatbots when compared to specialized models, and can help researchers improve chatbots technology with the objective to reduce the required effort to integrate machine learning in OSINT-based CTI tools.  ( 3 min )
    Plug and Play Active Learning for Object Detection
    arXiv:2211.11612v2 Announce Type: replace-cross Abstract: Annotating datasets for object detection is an expensive and time-consuming endeavor. To minimize this burden, active learning (AL) techniques are employed to select the most informative samples for annotation within a constrained "annotation budget". Traditional AL strategies typically rely on model uncertainty or sample diversity for query sampling, while more advanced methods have focused on developing AL-specific object detector architectures to enhance performance. However, these specialized approaches are not readily adaptable to different object detectors due to the significant engineering effort required for integration. To overcome this challenge, we introduce Plug and Play Active Learning (PPAL), a simple and effective AL strategy for object detection. PPAL is a two-stage method comprising uncertainty-based and diversity-based sampling phases. In the first stage, our Difficulty Calibrated Uncertainty Sampling leverage a category-wise difficulty coefficient that combines both classification and localisation difficulties to re-weight instance uncertainties, from which we sample a candidate pool for the subsequent diversity-based sampling. In the second stage, we propose Category Conditioned Matching Similarity to better compute the similarities of multi-instance images as ensembles of their instance similarities, which is used by the k-Means++ algorithm to sample the final AL queries. PPAL makes no change to model architectures or detector training pipelines; hence it can be easily generalized to different object detectors. We benchmark PPAL on the MS-COCO and Pascal VOC datasets using different detector architectures and show that our method outperforms prior work by a large margin. Code is available at https://github.com/ChenhongyiYang/PPAL  ( 3 min )
    COMET: A Comprehensive Cluster Design Methodology for Distributed Deep Learning Training
    arXiv:2211.16648v2 Announce Type: replace-cross Abstract: Modern Deep Learning (DL) models have grown to sizes requiring massive clusters of specialized, high-end nodes to train. Designing such clusters to maximize both performance and utilization--to amortize their steep cost--is a challenging task requiring careful balance of compute, memory, and network resources. Moreover, a plethora of each model's tuning knobs drastically affect the performance, with optimal values often depending on the underlying cluster's characteristics, which necessitates a complex cluster-workload co-design process. To facilitate the design space exploration of such massive DL training clusters, we introduce COMET, a holistic cluster design methodology and workflow to jointly study the impact of parallelization strategies and key cluster resource provisioning on the performance of distributed DL training. We develop a step-by-step process to establish a reusable and flexible methodology, and demonstrate its application with case studies of training large models on cluster configurations of variable compute, memory, and network resources. Our case studies demonstrate COMET's utility in identifying promising architectural optimization directions and guiding system designers in configuring key model and cluster parameters. To illustrate, cluster configuration comparisons identify performance differences of up to 7.7x and highlight performance optimization opportunities of up to 1.4x when employing memory expansion as an optimization technique.  ( 3 min )
    More than words: Advancements and challenges in speech recognition for singing
    arXiv:2403.09298v1 Announce Type: cross Abstract: This paper addresses the challenges and advancements in speech recognition for singing, a domain distinctly different from standard speech recognition. Singing encompasses unique challenges, including extensive pitch variations, diverse vocal styles, and background music interference. We explore key areas such as phoneme recognition, language identification in songs, keyword spotting, and full lyrics transcription. I will describe some of my own experiences when performing research on these tasks just as they were starting to gain traction, but will also show how recent developments in deep learning and large-scale datasets have propelled progress in this field. My goal is to illuminate the complexities of applying speech recognition to singing, evaluate current capabilities, and outline future research directions.  ( 2 min )
    Architectural Implications of Neural Network Inference for High Data-Rate, Low-Latency Scientific Applications
    arXiv:2403.08980v1 Announce Type: new Abstract: With more scientific fields relying on neural networks (NNs) to process data incoming at extreme throughputs and latencies, it is crucial to develop NNs with all their parameters stored on-chip. In many of these applications, there is not enough time to go off-chip and retrieve weights. Even more so, off-chip memory such as DRAM does not have the bandwidth required to process these NNs as fast as the data is being produced (e.g., every 25 ns). As such, these extreme latency and bandwidth requirements have architectural implications for the hardware intended to run these NNs: 1) all NN parameters must fit on-chip, and 2) codesigning custom/reconfigurable logic is often required to meet these latency and bandwidth constraints. In our work, we show that many scientific NN applications must run fully on chip, in the extreme case requiring a custom chip to meet such stringent constraints.  ( 2 min )
    Graph Neural Architecture Search with GPT-4
    arXiv:2310.01436v2 Announce Type: replace Abstract: Graph Neural Architecture Search (GNAS) has shown promising results in automatically designing graph neural networks. However, GNAS still requires intensive human labor with rich domain knowledge to design the search space and search strategy. In this paper, we integrate GPT-4 into GNAS and propose a new GPT-4 based Graph Neural Architecture Search method (GPT4GNAS for short). The basic idea of our method is to design a new class of prompts for GPT-4 to guide GPT-4 toward the generative task of graph neural architectures. The prompts consist of descriptions of the search space, search strategy, and search feedback of GNAS. By iteratively running GPT-4 with the prompts, GPT4GNAS generates more accurate graph neural networks with fast convergence. Experimental results show that embedding GPT-4 into GNAS outperforms the state-of-the-art GNAS methods.  ( 2 min )
    Generalized Neural Sorting Networks with Error-Free Differentiable Swap Functions
    arXiv:2310.07174v2 Announce Type: replace Abstract: Sorting is a fundamental operation of all computer systems, having been a long-standing significant research topic. Beyond the problem formulation of traditional sorting algorithms, we consider sorting problems for more abstract yet expressive inputs, e.g., multi-digit images and image fragments, through a neural sorting network. To learn a mapping from a high-dimensional input to an ordinal variable, the differentiability of sorting networks needs to be guaranteed. In this paper we define a softening error by a differentiable swap function, and develop an error-free swap function that holds a non-decreasing condition and differentiability. Furthermore, a permutation-equivariant Transformer network with multi-head attention is adopted to capture dependency between given inputs and also leverage its model capacity with self-attention. Experiments on diverse sorting benchmarks show that our methods perform better than or comparable to baseline methods.  ( 2 min )
    A semi-agnostic ansatz with variable structure for quantum machine learning
    arXiv:2103.06712v4 Announce Type: replace-cross Abstract: Quantum machine learning -- and specifically Variational Quantum Algorithms (VQAs) -- offers a powerful, flexible paradigm for programming near-term quantum computers, with applications in chemistry, metrology, materials science, data science, and mathematics. Here, one trains an ansatz, in the form of a parameterized quantum circuit, to accomplish a task of interest. However, challenges have recently emerged suggesting that deep ansatzes are difficult to train, due to flat training landscapes caused by randomness or by hardware noise. This motivates our work, where we present a variable structure approach to build ansatzes for VQAs. Our approach, called VAns (Variable Ansatz), applies a set of rules to both grow and (crucially) remove quantum gates in an informed manner during the optimization. Consequently, VAns is ideally suited to mitigate trainability and noise-related issues by keeping the ansatz shallow. We employ VAns in the variational quantum eigensolver for condensed matter and quantum chemistry applications, in the quantum autoencoder for data compression and in unitary compilation problems showing successful results in all cases.  ( 3 min )
    MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
    arXiv:2403.09611v1 Announce Type: cross Abstract: In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art (SOTA) few-shot results across multiple benchmarks, compared to other published pre-training results. Further, we show that the image encoder together with image resolution and the image token count has substantial impact, while the vision-language connector design is of comparatively negligible importance. By scaling up the presented recipe, we build MM1, a family of multimodal models up to 30B parameters, consisting of both dense models and mixture-of-experts (MoE) variants, that are SOTA in pre-training metrics and achieve competitive performance after supervised fine-tuning on a range of established multimodal benchmarks. Thanks to large-scale pre-training, MM1 enjoys appealing properties such as enhanced in-context learning, and multi-image reasoning, enabling few-shot chain-of-thought prompting.  ( 3 min )
    Assessing the Impact of Sequence Length Learning on Classification Tasks for Transformer Encoder Models
    arXiv:2212.08399v2 Announce Type: replace Abstract: Classification algorithms using Transformer architectures can be affected by the sequence length learning problem whenever observations from different classes have a different length distribution. This problem causes models to use sequence length as a predictive feature instead of relying on important textual information. Although most public datasets are not affected by this problem, privately owned corpora for fields such as medicine and insurance may carry this data bias. The exploitation of this sequence length feature poses challenges throughout the value chain as these machine learning models can be used in critical applications. In this paper, we empirically expose this problem and present approaches to minimize its impacts.  ( 2 min )
    An AI-Driven Approach to Wind Turbine Bearing Fault Diagnosis from Acoustic Signals
    arXiv:2403.09030v1 Announce Type: cross Abstract: This study aimed to develop a deep learning model for the classification of bearing faults in wind turbine generators from acoustic signals. A convolutional LSTM model was successfully constructed and trained by using audio data from five predefined fault types for both training and validation. To create the dataset, raw audio signal data was collected and processed in frames to capture time and frequency domain information. The model exhibited outstanding accuracy on training samples and demonstrated excellent generalization ability during validation, indicating its proficiency of generalization capability. On the test samples, the model achieved remarkable classification performance, with an overall accuracy exceeding 99.5%, and a false positive rate of less than 1% for normal status. The findings of this study provide essential support for the diagnosis and maintenance of bearing faults in wind turbine generators, with the potential to enhance the reliability and efficiency of wind power generation.  ( 2 min )
    Video Face Re-Aging: Toward Temporally Consistent Face Re-Aging
    arXiv:2311.11642v3 Announce Type: replace-cross Abstract: Video face re-aging deals with altering the apparent age of a person to the target age in videos. This problem is challenging due to the lack of paired video datasets maintaining temporal consistency in identity and age. Most re-aging methods process each image individually without considering the temporal consistency of videos. While some existing works address the issue of temporal coherence through video facial attribute manipulation in latent space, they often fail to deliver satisfactory performance in age transformation. To tackle the issues, we propose (1) a novel synthetic video dataset that features subjects across a diverse range of age groups; (2) a baseline architecture designed to validate the effectiveness of our proposed dataset, and (3) the development of novel metrics tailored explicitly for evaluating the temporal consistency of video re-aging techniques. Our comprehensive experiments on public datasets, including VFHQ and CelebA-HQ, show that our method outperforms existing approaches in age transformation accuracy and temporal consistency. Notably, in user studies, our method was preferred for temporal consistency by 48.1\% of participants for the older direction and by 39.3\% for the younger direction.  ( 3 min )
    Era Splitting -- Invariant Learning for Decision Trees
    arXiv:2309.14496v4 Announce Type: replace Abstract: Real-life machine learning problems exhibit distributional shifts in the data from one time to another or from one place to another. This behavior is beyond the scope of the traditional empirical risk minimization paradigm, which assumes i.i.d. distribution of data over time and across locations. The emerging field of out-of-distribution (OOD) generalization addresses this reality with new theory and algorithms which incorporate environmental, or era-wise information into the algorithms. So far, most research has been focused on linear models and/or neural networks. In this research we develop two new splitting criteria for decision trees, which allow us to apply ideas from OOD generalization research to decision tree models, namely, gradient boosting decision trees (GBDT). The new splitting criteria use era-wise information associated with the data to grow tree-based models that are optimal across all disjoint eras in the data, instead of optimal over the entire data set pooled together, which is the default setting. In this paper, two new splitting criteria are defined and analyzed theoretically. Effectiveness is tested on four experiments, ranging from simple, synthetic to complex, real-world applications. In particular we cast the OOD domain-adaptation problem in the context of financial markets, where the new models out-perform state-of-the-art GBDT models on the Numerai data set. The new criteria are incorporated into the Scikit-Learn code base and made freely available online.  ( 3 min )
    Most discriminative stimuli for functional cell type clustering
    arXiv:2401.05342v2 Announce Type: replace-cross Abstract: Identifying cell types and understanding their functional properties is crucial for unraveling the mechanisms underlying perception and cognition. In the retina, functional types can be identified by carefully selected stimuli, but this requires expert domain knowledge and biases the procedure towards previously known cell types. In the visual cortex, it is still unknown what functional types exist and how to identify them. Thus, for unbiased identification of the functional cell types in retina and visual cortex, new approaches are needed. Here we propose an optimization-based clustering approach using deep predictive models to obtain functional clusters of neurons using Most Discriminative Stimuli (MDS). Our approach alternates between stimulus optimization with cluster reassignment akin to an expectation-maximization algorithm. The algorithm recovers functional clusters in mouse retina, marmoset retina and macaque visual area V4. This demonstrates that our approach can successfully find discriminative stimuli across species, stages of the visual system and recording techniques. The resulting most discriminative stimuli can be used to assign functional cell types fast and on the fly, without the need to train complex predictive models or show a large natural scene dataset, paving the way for experiments that were previously limited by experimental time. Crucially, MDS are interpretable: they visualize the distinctive stimulus patterns that most unambiguously identify a specific type of neuron.  ( 3 min )
    Algorithmic syntactic causal identification
    arXiv:2403.09580v1 Announce Type: cross Abstract: Causal identification in causal Bayes nets (CBNs) is an important tool in causal inference allowing the derivation of interventional distributions from observational distributions where this is possible in principle. However, most existing formulations of causal identification using techniques such as d-separation and do-calculus are expressed within the mathematical language of classical probability theory on CBNs. However, there are many causal settings where probability theory and hence current causal identification techniques are inapplicable such as relational databases, dataflow programs such as hardware description languages, distributed systems and most modern machine learning algorithms. We show that this restriction can be lifted by replacing the use of classical probability theory with the alternative axiomatic foundation of symmetric monoidal categories. In this alternative axiomatization, we show how an unambiguous and clean distinction can be drawn between the general syntax of causal models and any specific semantic implementation of that causal model. This allows a purely syntactic algorithmic description of general causal identification by a translation of recent formulations of the general ID algorithm through fixing. Our description is given entirely in terms of the non-parametric ADMG structure specifying a causal model and the algebraic signature of the corresponding monoidal category, to which a sequence of manipulations is then applied so as to arrive at a modified monoidal category in which the desired, purely syntactic interventional causal model, is obtained. We use this idea to derive purely syntactic analogues of classical back-door and front-door causal adjustment, and illustrate an application to a more complex causal model.  ( 3 min )
    Safe and Generalized end-to-end Autonomous Driving System with Reinforcement Learning and Demonstrations
    arXiv:2401.11792v5 Announce Type: replace-cross Abstract: An intelligent driving system should be capable of dynamically formulating appropriate driving strategies based on the current environment and vehicle status, while ensuring the security and reliability of the system. However, existing methods based on reinforcement learning and imitation learning suffer from low safety, poor generalization, and inefficient sampling. Additionally, they cannot accurately predict future driving trajectories, and the accurate prediction of future driving trajectories is a precondition for making optimal decisions. To solve these problems, in this paper, we introduce a Safe and Generalized end-to-end Autonomous Driving System (SGADS) for complex and various scenarios. Our SGADS incorporates variational inference with normalizing flows, enabling the intelligent vehicle to accurately predict future driving trajectories. Moreover, we propose the formulation of robust safety constraints. Furthermore, we combine reinforcement learning with demonstrations to augment search process of the agent. The experimental results demonstrate that our SGADS can significantly improve safety performance, exhibit strong generalization, and enhance the training efficiency of intelligent vehicles in complex urban scenarios compared to existing methods.  ( 2 min )
    EGIC: Enhanced Low-Bit-Rate Generative Image Compression Guided by Semantic Segmentation
    arXiv:2309.03244v2 Announce Type: replace-cross Abstract: We introduce EGIC, an enhanced generative image compression method that allows traversing the distortion-perception curve efficiently from a single model. EGIC is based on two novel building blocks: i) OASIS-C, a conditional pre-trained semantic segmentation-guided discriminator, which provides both spatially and semantically-aware gradient feedback to the generator, conditioned on the latent image distribution, and ii) Output Residual Prediction (ORP), a retrofit solution for multi-realism image compression that allows control over the synthesis process by adjusting the impact of the residual between an MSE-optimized and GAN-optimized decoder output on the GAN-based reconstruction. Together, EGIC forms a powerful codec, outperforming state-of-the-art diffusion and GAN-based methods (e.g., HiFiC, MS-ILLM, and DIRAC-100), while performing almost on par with VTM-20.0 on the distortion end. EGIC is simple to implement, very lightweight, and provides excellent interpolation characteristics, which makes it a promising candidate for practical applications targeting the low bit range.  ( 2 min )
    Vanishing Gradients in Reinforcement Finetuning of Language Models
    arXiv:2310.20703v3 Announce Type: replace Abstract: Pretrained language models are commonly aligned with human preferences and downstream tasks via reinforcement finetuning (RFT), which refers to maximizing a (possibly learned) reward function using policy gradient algorithms. This work identifies a fundamental optimization obstacle in RFT: we prove that the expected gradient for an input vanishes when its reward standard deviation under the model is small, even if the expected reward is far from optimal. Through experiments on an RFT benchmark and controlled environments, as well as a theoretical analysis, we then demonstrate that vanishing gradients due to small reward standard deviation are prevalent and detrimental, leading to extremely slow reward maximization. Lastly, we explore ways to overcome vanishing gradients in RFT. We find the common practice of an initial supervised finetuning (SFT) phase to be the most promising candidate, which sheds light on its importance in an RFT pipeline. Moreover, we show that a relatively small number of SFT optimization steps on as few as 1% of the input samples can suffice, indicating that the initial SFT phase need not be expensive in terms of compute and data labeling efforts. Overall, our results emphasize that being mindful for inputs whose expected gradient vanishes, as measured by the reward standard deviation, is crucial for successful execution of RFT.  ( 3 min )
    Learning invariant representations of time-homogeneous stochastic dynamical systems
    arXiv:2307.09912v3 Announce Type: replace Abstract: We consider the general class of time-homogeneous stochastic dynamical systems, both discrete and continuous, and study the problem of learning a representation of the state that faithfully captures its dynamics. This is instrumental to learning the transfer operator or the generator of the system, which in turn can be used for numerous tasks, such as forecasting and interpreting the system dynamics. We show that the search for a good representation can be cast as an optimization problem over neural networks. Our approach is supported by recent results in statistical learning theory, highlighting the role of approximation error and metric distortion in the learning problem. The objective function we propose is associated with projection operators from the representation space to the data space, overcomes metric distortion, and can be empirically estimated from data. In the discrete-time setting, we further derive a relaxed objective function that is differentiable and numerically well-conditioned. We compare our method against state-of-the-art approaches on different datasets, showing better performance across the board.  ( 2 min )
    Learning New Tasks from a Few Examples with Soft-Label Prototypes
    arXiv:2210.17437v3 Announce Type: replace Abstract: Existing approaches to few-shot learning in NLP rely on large language models and fine-tuning of these to generalise on out-of-distribution data. In this work, we propose a simple yet powerful approach to "extreme" few-shot learning, wherein models are exposed to as little as 4 examples per class, based on soft-label prototypes that collectively capture the distribution of different classes across the input domain space. Inspired by previous work (Sucholutsky et al., 2021) on univariate or simple multivariate (synthetic) data, we propose a novel approach that is effective on large, high-dimensional and real-world datasets. We learn soft-label prototypes within a neural framework (DeepSLP) and we experimentally demonstrate that it achieves superior performance on 31/48 tested tasks and few-shot settings while closely matching the performance of strong baselines on the rest. We focus on learning previously unseen NLP tasks from very few examples (4, 8, 16) per label and present an in-depth analysis of the effectiveness of our approach.  ( 2 min )
    SPI-GAN: Denoising Diffusion GANs with Straight-Path Interpolations
    arXiv:2206.14464v3 Announce Type: replace Abstract: Score-based generative models (SGMs) show the state-of-the-art sampling quality and diversity. However, their training/sampling complexity is notoriously high due to the highly complicated forward/reverse processes, so they are not suitable for resource-limited settings. To solving this problem, learning a simpler process is gathering much attention currently. We present an enhanced GAN-based denoising method, called SPI-GAN, using our proposed straight-path interpolation definition. To this end, we propose a GAN architecture i) denoising through the straight-path and ii) characterized by a continuous mapping neural network for imitating the denoising path. This approach drastically reduces the sampling time while achieving as high sampling quality and diversity as SGMs. As a result, SPI-GAN is one of the best-balanced models among the sampling quality, diversity, and time for CIFAR-10, and CelebA-HQ-256.  ( 2 min )
    Transformers Get Stable: An End-to-End Signal Propagation Theory for Language Models
    arXiv:2403.09635v1 Announce Type: cross Abstract: In spite of their huge success, transformer models remain difficult to scale in depth. In this work, we develop a unified signal propagation theory and provide formulae that govern the moments of the forward and backward signal through the transformer model. Our framework can be used to understand and mitigate vanishing/exploding gradients, rank collapse, and instability associated with high attention scores. We also propose DeepScaleLM, an initialization and scaling scheme that conserves unit output/gradient moments throughout the model, enabling the training of very deep models with 100s of layers. We find that transformer models could be much deeper - our deep models with fewer parameters outperform shallow models in Language Modeling, Speech Translation, and Image Classification, across Encoder-only, Decoder-only and Encoder-Decoder variants, for both Pre-LN and Post-LN transformers, for multiple datasets and model sizes. These improvements also translate into improved performance on downstream Question Answering tasks and improved robustness for image classification.  ( 2 min )
    Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking
    arXiv:2403.09629v1 Announce Type: cross Abstract: When writing and talking, people sometimes pause to think. Although reasoning-focused works have often framed reasoning as a method of answering questions or completing agentic tasks, reasoning is implicit in almost all written text. For example, this applies to the steps not stated between the lines of a proof or to the theory of mind underlying a conversation. In the Self-Taught Reasoner (STaR, Zelikman et al. 2022), useful thinking is learned by inferring rationales from few-shot examples in question-answering and learning from those that lead to a correct answer. This is a highly constrained setting -- ideally, a language model could instead learn to infer unstated rationales in arbitrary text. We present Quiet-STaR, a generalization of STaR in which LMs learn to generate rationales at each token to explain future text, improving their predictions. We address key challenges, including 1) the computational cost of generating continuations, 2) the fact that the LM does not initially know how to generate or use internal thoughts, and 3) the need to predict beyond individual next tokens. To resolve these, we propose a tokenwise parallel sampling algorithm, using learnable tokens indicating a thought's start and end, and an extended teacher-forcing technique. Encouragingly, generated rationales disproportionately help model difficult-to-predict tokens and improve the LM's ability to directly answer difficult questions. In particular, after continued pretraining of an LM on a corpus of internet text with Quiet-STaR, we find zero-shot improvements on GSM8K (5.9%$\rightarrow$10.9%) and CommonsenseQA (36.3%$\rightarrow$47.2%) and observe a perplexity improvement of difficult tokens in natural text. Crucially, these improvements require no fine-tuning on these tasks. Quiet-STaR marks a step towards LMs that can learn to reason in a more general and scalable way.  ( 3 min )
    Compute-first optical detection for noise-resilient visual perception
    arXiv:2403.09612v1 Announce Type: cross Abstract: In the context of visual perception, the optical signal from a scene is transferred into the electronic domain by detectors in the form of image data, which are then processed for the extraction of visual information. In noisy and weak-signal environments such as thermal imaging for night vision applications, however, the performance of neural computing tasks faces a significant bottleneck due to the inherent degradation of data quality upon noisy detection. Here, we propose a concept of optical signal processing before detection to address this issue. We demonstrate that spatially redistributing optical signals through a properly designed linear transformer can enhance the detection noise resilience of visual perception tasks, as benchmarked with the MNIST classification. Our idea is supported by a quantitative analysis detailing the relationship between signal concentration and noise robustness, as well as its practical implementation in an incoherent imaging system. This compute-first detection scheme can pave the way for advancing infrared machine vision technologies widely used for industrial and defense applications.  ( 2 min )
    Make-Your-3D: Fast and Consistent Subject-Driven 3D Content Generation
    arXiv:2403.09625v1 Announce Type: cross Abstract: Recent years have witnessed the strong power of 3D generation models, which offer a new level of creative flexibility by allowing users to guide the 3D content generation process through a single image or natural language. However, it remains challenging for existing 3D generation methods to create subject-driven 3D content across diverse prompts. In this paper, we introduce a novel 3D customization method, dubbed Make-Your-3D that can personalize high-fidelity and consistent 3D content from only a single image of a subject with text description within 5 minutes. Our key insight is to harmonize the distributions of a multi-view diffusion model and an identity-specific 2D generative model, aligning them with the distribution of the desired 3D subject. Specifically, we design a co-evolution framework to reduce the variance of distributions, where each model undergoes a process of learning from the other through identity-aware optimization and subject-prior optimization, respectively. Extensive experiments demonstrate that our method can produce high-quality, consistent, and subject-specific 3D content with text-driven modifications that are unseen in subject image.  ( 2 min )
    Optimistic Verifiable Training by Controlling Hardware Nondeterminism
    arXiv:2403.09603v1 Announce Type: cross Abstract: The increasing compute demands of AI systems has led to the emergence of services that train models on behalf of clients lacking necessary resources. However, ensuring correctness of training and guarding against potential training-time attacks, such as data poisoning, poses challenges. Existing works on verifiable training largely fall into two classes: proof-based systems, which struggle to scale due to requiring cryptographic techniques, and "optimistic" methods that consider a trusted third-party auditor who replicates the training process. A key challenge with the latter is that hardware nondeterminism between GPU types during training prevents an auditor from replicating the training process exactly, and such schemes are therefore non-robust. We propose a method that combines training in a higher precision than the target model, rounding after intermediate computation steps, and storing rounding decisions based on an adaptive thresholding procedure, to successfully control for nondeterminism. Across three different NVIDIA GPUs (A40, Titan XP, RTX 2080 Ti), we achieve exact training replication at FP32 precision for both full-training and fine-tuning of ResNet-50 (23M) and GPT-2 (117M) models. Our verifiable training scheme significantly decreases the storage and time costs compared to proof-based systems.  ( 2 min )
    Mixture of Mixups for Multi-label Classification of Rare Anuran Sounds
    arXiv:2403.09598v1 Announce Type: cross Abstract: Multi-label imbalanced classification poses a significant challenge in machine learning, particularly evident in bioacoustics where animal sounds often co-occur, and certain sounds are much less frequent than others. This paper focuses on the specific case of classifying anuran species sounds using the dataset AnuraSet, that contains both class imbalance and multi-label examples. To address these challenges, we introduce Mixture of Mixups (Mix2), a framework that leverages mixing regularization methods Mixup, Manifold Mixup, and MultiMix. Experimental results show that these methods, individually, may lead to suboptimal results; however, when applied randomly, with one selected at each training iteration, they prove effective in addressing the mentioned challenges, particularly for rare classes with few occurrences. Further analysis reveals that Mix2 is also proficient in classifying sounds across various levels of class co-occurrences.  ( 2 min )
    uaMix-MAE: Efficient Tuning of Pretrained Audio Transformers with Unsupervised Audio Mixtures
    arXiv:2403.09579v1 Announce Type: cross Abstract: Masked Autoencoders (MAEs) learn rich low-level representations from unlabeled data but require substantial labeled data to effectively adapt to downstream tasks. Conversely, Instance Discrimination (ID) emphasizes high-level semantics, offering a potential solution to alleviate annotation requirements in MAEs. Although combining these two approaches can address downstream tasks with limited labeled data, naively integrating ID into MAEs leads to extended training times and high computational costs. To address this challenge, we introduce uaMix-MAE, an efficient ID tuning strategy that leverages unsupervised audio mixtures. Utilizing contrastive tuning, uaMix-MAE aligns the representations of pretrained MAEs, thereby facilitating effective adaptation to task-specific semantics. To optimize the model with small amounts of unlabeled data, we propose an audio mixing technique that manipulates audio samples in both input and virtual label spaces. Experiments in low/few-shot settings demonstrate that \modelname achieves 4-6% accuracy improvements over various benchmarks when tuned with limited unlabeled data, such as AudioSet-20K. Code is available at https://github.com/PLAN-Lab/uamix-MAE  ( 2 min )
    Are you a robot? Detecting Autonomous Vehicles from Behavior Analysis
    arXiv:2403.09571v1 Announce Type: cross Abstract: The tremendous hype around autonomous driving is eagerly calling for emerging and novel technologies to support advanced mobility use cases. As car manufactures keep developing SAE level 3+ systems to improve the safety and comfort of passengers, traffic authorities need to establish new procedures to manage the transition from human-driven to fully-autonomous vehicles while providing a feedback-loop mechanism to fine-tune envisioned autonomous systems. Thus, a way to automatically profile autonomous vehicles and differentiate those from human-driven ones is a must. In this paper, we present a fully-fledged framework that monitors active vehicles using camera images and state information in order to determine whether vehicles are autonomous, without requiring any active notification from the vehicles themselves. Essentially, it builds on the cooperation among vehicles, which share their data acquired on the road feeding a machine learning model to identify autonomous cars. We extensively tested our solution and created the NexusStreet dataset, by means of the CARLA simulator, employing an autonomous driving control agent and a steering wheel maneuvered by licensed drivers. Experiments show it is possible to discriminate the two behaviors by analyzing video clips with an accuracy of 80%, which improves up to 93% when the target state information is available. Lastly, we deliberately degraded the state to observe how the framework performs under non-ideal data collection conditions.  ( 2 min )
    Logits of API-Protected LLMs Leak Proprietary Information
    arXiv:2403.09539v1 Announce Type: cross Abstract: The commercialization of large language models (LLMs) has led to the common practice of high-level API-only access to proprietary models. In this work, we show that even with a conservative assumption about the model architecture, it is possible to learn a surprisingly large amount of non-public information about an API-protected LLM from a relatively small number of API queries (e.g., costing under $1,000 for OpenAI's gpt-3.5-turbo). Our findings are centered on one key observation: most modern LLMs suffer from a softmax bottleneck, which restricts the model outputs to a linear subspace of the full output space. We show that this lends itself to a model image or a model signature which unlocks several capabilities with affordable cost: efficiently discovering the LLM's hidden size, obtaining full-vocabulary outputs, detecting and disambiguating different model updates, identifying the source LLM given a single full LLM output, and even estimating the output layer parameters. Our empirical investigations show the effectiveness of our methods, which allow us to estimate the embedding size of OpenAI's gpt-3.5-turbo to be about 4,096. Lastly, we discuss ways that LLM providers can guard against these attacks, as well as how these capabilities can be viewed as a feature (rather than a bug) by allowing for greater transparency and accountability.  ( 2 min )
    Leveraging Prototypical Representations for Mitigating Social Bias without Demographic Information
    arXiv:2403.09516v1 Announce Type: cross Abstract: Mitigating social biases typically requires identifying the social groups associated with each data sample. In this paper, we present DAFair, a novel approach to address social bias in language models. Unlike traditional methods that rely on explicit demographic labels, our approach does not require any such information. Instead, we leverage predefined prototypical demographic texts and incorporate a regularization term during the fine-tuning process to mitigate bias in the model's representations. Our empirical results across two tasks and two models demonstrate the effectiveness of our method compared to previous approaches that do not rely on labeled data. Moreover, with limited demographic-annotated data, our approach outperforms common debiasing approaches.  ( 2 min )
    On STPA for Distributed Development of Safe Autonomous Driving: An Interview Study
    arXiv:2403.09509v1 Announce Type: cross Abstract: Safety analysis is used to identify hazards and build knowledge during the design phase of safety-relevant functions. This is especially true for complex AI-enabled and software intensive systems such as Autonomous Drive (AD). System-Theoretic Process Analysis (STPA) is a novel method applied in safety-related fields like defense and aerospace, which is also becoming popular in the automotive industry. However, STPA assumes prerequisites that are not fully valid in the automotive system engineering with distributed system development and multi-abstraction design levels. This would inhibit software developers from using STPA to analyze their software as part of a bigger system, resulting in a lack of traceability. This can be seen as a maintainability challenge in continuous development and deployment (DevOps). In this paper, STPA's different guidelines for the automotive industry, e.g. J31887/ISO21448/STPA handbook, are firstly compared to assess their applicability to the distributed development of complex AI-enabled systems like AD. Further, an approach to overcome the challenges of using STPA in a multi-level design context is proposed. By conducting an interview study with automotive industry experts for the development of AD, the challenges are validated and the effectiveness of the proposed approach is evaluated.  ( 3 min )
    VIRUS-NeRF -- Vision, InfraRed and UltraSonic based Neural Radiance Fields
    arXiv:2403.09477v1 Announce Type: cross Abstract: Autonomous mobile robots are an increasingly integral part of modern factory and warehouse operations. Obstacle detection, avoidance and path planning are critical safety-relevant tasks, which are often solved using expensive LiDAR sensors and depth cameras. We propose to use cost-effective low-resolution ranging sensors, such as ultrasonic and infrared time-of-flight sensors by developing VIRUS-NeRF - Vision, InfraRed, and UltraSonic based Neural Radiance Fields. Building upon Instant Neural Graphics Primitives with a Multiresolution Hash Encoding (Instant-NGP), VIRUS-NeRF incorporates depth measurements from ultrasonic and infrared sensors and utilizes them to update the occupancy grid used for ray marching. Experimental evaluation in 2D demonstrates that VIRUS-NeRF achieves comparable mapping performance to LiDAR point clouds regarding coverage. Notably, in small environments, its accuracy aligns with that of LiDAR measurements, while in larger ones, it is bounded by the utilized ultrasonic sensors. An in-depth ablation study reveals that adding ultrasonic and infrared sensors is highly effective when dealing with sparse data and low view variation. Further, the proposed occupancy grid of VIRUS-NeRF improves the mapping capabilities and increases the training speed by 46% compared to Instant-NGP. Overall, VIRUS-NeRF presents a promising approach for cost-effective local mapping in mobile robotics, with potential applications in safety and navigation tasks. The code can be found at https://github.com/ethz-asl/virus nerf.  ( 2 min )
    Outlier Robust Multivariate Polynomial Regression
    arXiv:2403.09465v1 Announce Type: cross Abstract: We study the problem of robust multivariate polynomial regression: let $p\colon\mathbb{R}^n\to\mathbb{R}$ be an unknown $n$-variate polynomial of degree at most $d$ in each variable. We are given as input a set of random samples $(\mathbf{x}_i,y_i) \in [-1,1]^n \times \mathbb{R}$ that are noisy versions of $(\mathbf{x}_i,p(\mathbf{x}_i))$. More precisely, each $\mathbf{x}_i$ is sampled independently from some distribution $\chi$ on $[-1,1]^n$, and for each $i$ independently, $y_i$ is arbitrary (i.e., an outlier) with probability at most $\rho < 1/2$, and otherwise satisfies $|y_i-p(\mathbf{x}_i)|\leq\sigma$. The goal is to output a polynomial $\hat{p}$, of degree at most $d$ in each variable, within an $\ell_\infty$-distance of at most $O(\sigma)$ from $p$. Kane, Karmalkar, and Price [FOCS'17] solved this problem for $n=1$. We generalize their results to the $n$-variate setting, showing an algorithm that achieves a sample complexity of $O_n(d^n\log d)$, where the hidden constant depends on $n$, if $\chi$ is the $n$-dimensional Chebyshev distribution. The sample complexity is $O_n(d^{2n}\log d)$, if the samples are drawn from the uniform distribution instead. The approximation error is guaranteed to be at most $O(\sigma)$, and the run-time depends on $\log(1/\sigma)$. In the setting where each $\mathbf{x}_i$ and $y_i$ are known up to $N$ bits of precision, the run-time's dependence on $N$ is linear. We also show that our sample complexities are optimal in terms of $d^n$. Furthermore, we show that it is possible to have the run-time be independent of $1/\sigma$, at the cost of a higher sample complexity.  ( 2 min )
    Variational Inference with Sequential Sample-Average Approximations
    arXiv:2403.09429v1 Announce Type: cross Abstract: We present variational inference with sequential sample-average approximation (VISA), a method for approximate inference in computationally intensive models, such as those based on numerical simulations. VISA extends importance-weighted forward-KL variational inference by employing a sequence of sample-average approximations, which are considered valid inside a trust region. This makes it possible to reuse model evaluations across multiple gradient steps, thereby reducing computational cost. We perform experiments on high-dimensional Gaussians, Lotka-Volterra dynamics, and a Pickover attractor, which demonstrate that VISA can achieve comparable approximation accuracy to standard importance-weighted forward-KL variational inference with computational savings of a factor two or more for conservatively chosen learning rates.  ( 2 min )
    Learning to optimize with convergence guarantees using nonlinear system theory
    arXiv:2403.09389v1 Announce Type: cross Abstract: The increasing reliance on numerical methods for controlling dynamical systems and training machine learning models underscores the need to devise algorithms that dependably and efficiently navigate complex optimization landscapes. Classical gradient descent methods offer strong theoretical guarantees for convex problems; however, they demand meticulous hyperparameter tuning for non-convex ones. The emerging paradigm of learning to optimize (L2O) automates the discovery of algorithms with optimized performance leveraging learning models and data - yet, it lacks a theoretical framework to analyze convergence and robustness of the learned algorithms. In this paper, we fill this gap by harnessing nonlinear system theory. Specifically, we propose an unconstrained parametrization of all convergent algorithms for smooth non-convex objective functions. Notably, our framework is directly compatible with automatic differentiation tools, ensuring convergence by design while learning to optimize.  ( 2 min )
    Pantypes: Diverse Representatives for Self-Explainable Models
    arXiv:2403.09383v1 Announce Type: cross Abstract: Prototypical self-explainable classifiers have emerged to meet the growing demand for interpretable AI systems. These classifiers are designed to incorporate high transparency in their decisions by basing inference on similarity with learned prototypical objects. While these models are designed with diversity in mind, the learned prototypes often do not sufficiently represent all aspects of the input distribution, particularly those in low density regions. Such lack of sufficient data representation, known as representation bias, has been associated with various detrimental properties related to machine learning diversity and fairness. In light of this, we introduce pantypes, a new family of prototypical objects designed to capture the full diversity of the input distribution through a sparse set of objects. We show that pantypes can empower prototypical self-explainable models by occupying divergent regions of the latent space and thus fostering high diversity, interpretability and fairness.  ( 2 min )
    A Hierarchical Fused Quantum Fuzzy Neural Network for Image Classification
    arXiv:2403.09318v1 Announce Type: cross Abstract: Neural network is a powerful learning paradigm for data feature learning in the era of big data. However, most neural network models are deterministic models that ignore the uncertainty of data. Fuzzy neural networks are proposed to address this problem. FDNN is a hierarchical deep neural network that derives information from both fuzzy and neural representations, the representations are then fused to form representation to be classified. FDNN perform well on uncertain data classification tasks. In this paper, we proposed a novel hierarchical fused quantum fuzzy neural network (HQFNN). Different from classical FDNN, HQFNN uses quantum neural networks to learn fuzzy membership functions in fuzzy neural network. We conducted simulated experiment on two types of datasets (Dirty-MNIST and 15-Scene), the results show that the proposed model can outperform several existing methods. In addition, we demonstrate the robustness of the proposed quantum circuit.  ( 2 min )
    StainFuser: Controlling Diffusion for Faster Neural Style Transfer in Multi-Gigapixel Histology Images
    arXiv:2403.09302v1 Announce Type: cross Abstract: Stain normalization algorithms aim to transform the color and intensity characteristics of a source multi-gigapixel histology image to match those of a target image, mitigating inconsistencies in the appearance of stains used to highlight cellular components in the images. We propose a new approach, StainFuser, which treats this problem as a style transfer task using a novel Conditional Latent Diffusion architecture, eliminating the need for handcrafted color components. With this method, we curate SPI-2M the largest stain normalization dataset to date of over 2 million histology images with neural style transfer for high-quality transformations. Trained on this data, StainFuser outperforms current state-of-the-art GAN and handcrafted methods in terms of the quality of normalized images. Additionally, compared to existing approaches, it improves the performance of nuclei instance segmentation and classification models when used as a test time augmentation method on the challenging CoNIC dataset. Finally, we apply StainFuser on multi-gigapixel Whole Slide Images (WSIs) and demonstrate improved performance in terms of computational efficiency, image quality and consistency across tiles over current methods.  ( 2 min )
    Deep Limit Order Book Forecasting
    arXiv:2403.09267v1 Announce Type: cross Abstract: We exploit cutting-edge deep learning methodologies to explore the predictability of high-frequency Limit Order Book mid-price changes for a heterogeneous set of stocks traded on the NASDAQ exchange. In so doing, we release `LOBFrame', an open-source code base, to efficiently process large-scale Limit Order Book data and quantitatively assess state-of-the-art deep learning models' forecasting capabilities. Our results are twofold. We demonstrate that the stocks' microstructural characteristics influence the efficacy of deep learning methods and that their high forecasting power does not necessarily correspond to actionable trading signals. We argue that traditional machine learning metrics fail to adequately assess the quality of forecasts in the Limit Order Book context. As an alternative, we propose an innovative operational framework that assesses predictions' practicality by focusing on the probability of accurately forecasting complete transactions. This work offers academics and practitioners an avenue to make informed and robust decisions on the application of deep learning techniques, their scope and limitations, effectively exploiting emergent statistical properties of the Limit Order Book.  ( 2 min )
    SELECTOR: Heterogeneous graph network with convolutional masked autoencoder for multimodal robust prediction of cancer survival
    arXiv:2403.09290v1 Announce Type: cross Abstract: Accurately predicting the survival rate of cancer patients is crucial for aiding clinicians in planning appropriate treatment, reducing cancer-related medical expenses, and significantly enhancing patients' quality of life. Multimodal prediction of cancer patient survival offers a more comprehensive and precise approach. However, existing methods still grapple with challenges related to missing multimodal data and information interaction within modalities. This paper introduces SELECTOR, a heterogeneous graph-aware network based on convolutional mask encoders for robust multimodal prediction of cancer patient survival. SELECTOR comprises feature edge reconstruction, convolutional mask encoder, feature cross-fusion, and multimodal survival prediction modules. Initially, we construct a multimodal heterogeneous graph and employ the meta-path method for feature edge reconstruction, ensuring comprehensive incorporation of feature information from graph edges and effective embedding of nodes. To mitigate the impact of missing features within the modality on prediction accuracy, we devised a convolutional masked autoencoder (CMAE) to process the heterogeneous graph post-feature reconstruction. Subsequently, the feature cross-fusion module facilitates communication between modalities, ensuring that output features encompass all features of the modality and relevant information from other modalities. Extensive experiments and analysis on six cancer datasets from TCGA demonstrate that our method significantly outperforms state-of-the-art methods in both modality-missing and intra-modality information-confirmed cases. Our codes are made available at https://github.com/panliangrui/Selector.  ( 3 min )
    LAN: Learning Adaptive Neighbors for Real-Time Insider Threat Detection
    arXiv:2403.09209v1 Announce Type: cross Abstract: Enterprises and organizations are faced with potential threats from insider employees that may lead to serious consequences. Previous studies on insider threat detection (ITD) mainly focus on detecting abnormal users or abnormal time periods (e.g., a week or a day). However, a user may have hundreds of thousands of activities in the log, and even within a day there may exist thousands of activities for a user, requiring a high investigation budget to verify abnormal users or activities given the detection results. On the other hand, existing works are mainly post-hoc methods rather than real-time detection, which can not report insider threats in time before they cause loss. In this paper, we conduct the first study towards real-time ITD at activity level, and present a fine-grained and efficient framework LAN. Specifically, LAN simultaneously learns the temporal dependencies within an activity sequence and the relationships between activities across sequences with graph structure learning. Moreover, to mitigate the data imbalance problem in ITD, we propose a novel hybrid prediction loss, which integrates self-supervision signals {from normal activities} and supervision signals from abnormal activities into a unified loss for anomaly detection. We evaluate the performance of LAN on two widely used datasets, i.e., CERT r4.2 and CERT r5.2. Extensive and comparative experiments demonstrate the superiority of LAN, outperforming 9 state-of-the-art baselines by at least 9.92% and 6.35% in AUC for real-time ITD on CERT r4.2 and r5.2, respectively. Moreover, LAN can be also applied to post-hoc ITD, surpassing 8 competitive baselines by at least 7.70% and 4.03% in AUC on two datasets. Finally, the ablation study, parameter analysis, and compatibility analysis evaluate the impact of each module and hyper-parameter in LAN.  ( 3 min )
    To Label or Not to Label: Hybrid Active Learning for Neural Machine Translation
    arXiv:2403.09259v1 Announce Type: cross Abstract: Active learning (AL) techniques reduce labeling costs for training neural machine translation (NMT) models by selecting smaller representative subsets from unlabeled data for annotation. Diversity sampling techniques select heterogeneous instances, while uncertainty sampling methods select instances with the highest model uncertainty. Both approaches have limitations - diversity methods may extract varied but trivial examples, while uncertainty sampling can yield repetitive, uninformative instances. To bridge this gap, we propose HUDS, a hybrid AL strategy for domain adaptation in NMT that combines uncertainty and diversity for sentence selection. HUDS computes uncertainty scores for unlabeled sentences and subsequently stratifies them. It then clusters sentence embeddings within each stratum using k-MEANS and computes diversity scores by distance to the centroid. A weighted hybrid score that combines uncertainty and diversity is then used to select the top instances for annotation in each AL iteration. Experiments on multi-domain German-English datasets demonstrate the better performance of HUDS over other strong AL baselines. We analyze the sentence selection with HUDS and show that it prioritizes diverse instances having high model uncertainty for annotation in early AL iterations.  ( 2 min )
    Are Vision Language Models Texture or Shape Biased and Can We Steer Them?
    arXiv:2403.09193v1 Announce Type: cross Abstract: Vision language models (VLMs) have drastically changed the computer vision model landscape in only a few years, opening an exciting array of new applications from zero-shot image classification, over to image captioning, and visual question answering. Unlike pure vision models, they offer an intuitive way to access visual content through language prompting. The wide applicability of such models encourages us to ask whether they also align with human vision - specifically, how far they adopt human-induced visual biases through multimodal fusion, or whether they simply inherit biases from pure vision models. One important visual bias is the texture vs. shape bias, or the dominance of local over global information. In this paper, we study this bias in a wide range of popular VLMs. Interestingly, we find that VLMs are often more shape-biased than their vision encoders, indicating that visual biases are modulated to some extent through text in multimodal models. If text does indeed influence visual biases, this suggests that we may be able to steer visual biases not just through visual input but also through language: a hypothesis that we confirm through extensive experiments. For instance, we are able to steer shape bias from as low as 49% to as high as 72% through prompting alone. For now, the strong human bias towards shape (96%) remains out of reach for all tested VLMs.  ( 3 min )
    Upper Bound of Bayesian Generalization Error in Partial Concept Bottleneck Model (CBM): Partial CBM outperforms naive CBM
    arXiv:2403.09206v1 Announce Type: cross Abstract: Concept Bottleneck Model (CBM) is a methods for explaining neural networks. In CBM, concepts which correspond to reasons of outputs are inserted in the last intermediate layer as observed values. It is expected that we can interpret the relationship between the output and concept similar to linear regression. However, this interpretation requires observing all concepts and decreases the generalization performance of neural networks. Partial CBM (PCBM), which uses partially observed concepts, has been devised to resolve these difficulties. Although some numerical experiments suggest that the generalization performance of PCBMs is almost as high as that of the original neural networks, the theoretical behavior of its generalization error has not been yet clarified since PCBM is singular statistical model. In this paper, we reveal the Bayesian generalization error in PCBM with a three-layered and linear architecture. The result indcates that the structure of partially observed concepts decreases the Bayesian generalization error compared with that of CBM (full-observed concepts).  ( 2 min )
    Generalized Relevance Learning Grassmann Quantization
    arXiv:2403.09183v1 Announce Type: cross Abstract: Due to advancements in digital cameras, it is easy to gather multiple images (or videos) from an object under different conditions. Therefore, image-set classification has attracted more attention, and different solutions were proposed to model them. A popular way to model image sets is subspaces, which form a manifold called the Grassmann manifold. In this contribution, we extend the application of Generalized Relevance Learning Vector Quantization to deal with Grassmann manifold. The proposed model returns a set of prototype subspaces and a relevance vector. While prototypes model typical behaviours within classes, the relevance factors specify the most discriminative principal vectors (or images) for the classification task. They both provide insights into the model's decisions by highlighting influential images and pixels for predictions. Moreover, due to learning prototypes, the model complexity of the new method during inference is independent of dataset size, unlike previous works. We applied it to several recognition tasks including handwritten digit recognition, face recognition, activity recognition, and object recognition. Experiments demonstrate that it outperforms previous works with lower complexity and can successfully model the variation, such as handwritten style or lighting conditions. Moreover, the presence of relevances makes the model robust to the selection of subspaces' dimensionality.  ( 2 min )
    Uncertainty Estimation in Multi-Agent Distributed Learning for AI-Enabled Edge Devices
    arXiv:2403.09141v1 Announce Type: cross Abstract: Initially considered as low-power units with limited autonomous processing, Edge IoT devices have seen a paradigm shift with the introduction of FPGAs and AI accelerators. This advancement has vastly amplified their computational capabilities, emphasizing the practicality of edge AI. Such progress introduces new challenges of optimizing AI tasks for the limitations of energy and network resources typical in Edge computing environments. Our study explores methods that enable distributed data processing through AI-enabled edge devices, enhancing collaborative learning capabilities. A key focus of our research is the challenge of determining confidence levels in learning outcomes, considering the spatial and temporal variability of data sets encountered by independent agents. To address this issue, we investigate the application of Bayesian neural networks, proposing a novel approach to manage uncertainty in distributed learning environments.  ( 2 min )
    AutoLoRA: Automatically Tuning Matrix Ranks in Low-Rank Adaptation Based on Meta Learning
    arXiv:2403.09113v1 Announce Type: cross Abstract: Large-scale pretraining followed by task-specific finetuning has achieved great success in various NLP tasks. Since finetuning all parameters of large pretrained models poses substantial computational and memory challenges, several efficient finetuning methods have been developed. Among them, low-rank adaptation (LoRA), which finetunes low-rank incremental update matrices on top of frozen pretrained weights, has proven particularly effective. Nonetheless, LoRA's uniform rank assignment across all layers, along with its reliance on an exhaustive search to find the best rank, leads to high computation costs and suboptimal finetuning performance. To address these limitations, we introduce AutoLoRA, a meta learning based framework for automatically identifying the optimal rank of each LoRA layer. AutoLoRA associates each rank-1 matrix in a low-rank update matrix with a selection variable, which determines whether the rank-1 matrix should be discarded. A meta learning based method is developed to learn these selection variables. The optimal rank is determined by thresholding the values of these variables. Our comprehensive experiments on natural language understanding, generation, and sequence labeling demonstrate the effectiveness of AutoLoRA.  ( 2 min )
    Randomized Principal Component Analysis for Hyperspectral Image Classification
    arXiv:2403.09117v1 Announce Type: cross Abstract: The high-dimensional feature space of the hyperspectral imagery poses major challenges to the processing and analysis of the hyperspectral data sets. In such a case, dimensionality reduction is necessary to decrease the computational complexity. The random projections open up new ways of dimensionality reduction, especially for large data sets. In this paper, the principal component analysis (PCA) and randomized principal component analysis (R-PCA) for the classification of hyperspectral images using support vector machines (SVM) and light gradient boosting machines (LightGBM) have been investigated. In this experimental research, the number of features was reduced to 20 and 30 for classification of two hyperspectral datasets (Indian Pines and Pavia University). The experimental results demonstrated that PCA outperformed R-PCA for SVM for both datasets, but received close accuracy values for LightGBM. The highest classification accuracies were obtained as 0.9925 and 0.9639 by LightGBM with original features for the Pavia University and Indian Pines, respectively.  ( 2 min )
    Virtual birefringence imaging and histological staining of amyloid deposits in label-free tissue using autofluorescence microscopy and deep learning
    arXiv:2403.09100v1 Announce Type: cross Abstract: Systemic amyloidosis is a group of diseases characterized by the deposition of misfolded proteins in various organs and tissues, leading to progressive organ dysfunction and failure. Congo red stain is the gold standard chemical stain for the visualization of amyloid deposits in tissue sections, as it forms complexes with the misfolded proteins and shows a birefringence pattern under polarized light microscopy. However, Congo red staining is tedious and costly to perform, and prone to false diagnoses due to variations in the amount of amyloid, staining quality and expert interpretation through manual examination of tissue under a polarization microscope. Here, we report the first demonstration of virtual birefringence imaging and virtual Congo red staining of label-free human tissue to show that a single trained neural network can rapidly transform autofluorescence images of label-free tissue sections into brightfield and polarized light microscopy equivalent images, matching the histochemically stained versions of the same samples. We demonstrate the efficacy of our method with blind testing and pathologist evaluations on cardiac tissue where the virtually stained images agreed well with the histochemically stained ground truth images. Our virtually stained polarization and brightfield images highlight amyloid birefringence patterns in a consistent, reproducible manner while mitigating diagnostic challenges due to variations in the quality of chemical staining and manual imaging processes as part of the clinical workflow.  ( 3 min )
    Dissipative Gradient Descent Ascent Method: A Control Theory Inspired Algorithm for Min-max Optimization
    arXiv:2403.09090v1 Announce Type: cross Abstract: Gradient Descent Ascent (GDA) methods for min-max optimization problems typically produce oscillatory behavior that can lead to instability, e.g., in bilinear settings. To address this problem, we introduce a dissipation term into the GDA updates to dampen these oscillations. The proposed Dissipative GDA (DGDA) method can be seen as performing standard GDA on a state-augmented and regularized saddle function that does not strictly introduce additional convexity/concavity. We theoretically show the linear convergence of DGDA in the bilinear and strongly convex-strongly concave settings and assess its performance by comparing DGDA with other methods such as GDA, Extra-Gradient (EG), and Optimistic GDA. Our findings demonstrate that DGDA surpasses these methods, achieving superior convergence rates. We support our claims with two numerical examples that showcase DGDA's effectiveness in solving saddle point problems.  ( 2 min )
    CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language Models to Coding Preferences
    arXiv:2403.09032v1 Announce Type: cross Abstract: Evaluating the alignment of large language models (LLMs) with user-defined coding preferences is a challenging endeavour that requires assessing intricate textual LLMs' outputs. By relying on automated metrics and static analysis tools, existing benchmarks fail to assess nuances in user instructions and LLM outputs, highlighting the need for large-scale datasets and benchmarks for LLM preference alignment. In this paper, we introduce CodeUltraFeedback, a preference dataset of 10,000 complex instructions to tune and align LLMs to coding preferences through AI feedback. We generate responses to the instructions using a pool of 14 diverse LLMs, which we then annotate according to their alignment with five coding preferences using the LLM-as-a-Judge approach with GPT-3.5, producing both numerical and textual feedback. We also present CODAL-Bench, a benchmark for assessing LLM alignment with these coding preferences. Our results show that CodeLlama-7B-Instruct, aligned through reinforcement learning from AI feedback (RLAIF) with direct preference optimization (DPO) using CodeUltraFeedback's AI feedback data, outperforms 34B LLMs on CODAL-Bench, validating the utility of CodeUltraFeedback for preference tuning. Furthermore, we show our DPO-aligned CodeLlama model improves functional correctness on HumanEval+ compared to the unaligned base model. Therefore, our contributions bridge the gap in preference tuning of LLMs for code and set the stage for further advancements in model alignment and RLAIF for code intelligence. Our code and data are available at https://github.com/martin-wey/CodeUltraFeedback.  ( 3 min )
    AutoGuide: Automated Generation and Selection of State-Aware Guidelines for Large Language Model Agents
    arXiv:2403.08978v1 Announce Type: cross Abstract: The primary limitation of large language models (LLMs) is their restricted understanding of the world. This poses significant difficulties for LLM-based agents, particularly in domains where pre-trained LLMs lack sufficient knowledge. In this paper, we introduce a novel framework, called AutoGuide, that bridges the knowledge gap in pre-trained LLMs by leveraging implicit knowledge in offline experiences. Specifically, AutoGuide effectively extracts knowledge embedded in offline data by extracting a set of state-aware guidelines. Importantly, each state-aware guideline is expressed in concise natural language and follows a conditional structure, clearly describing the state where it is applicable. As such, the resulting guidelines enable a principled way to provide helpful knowledge pertinent to an agent's current decision-making process. We show that our approach outperforms competitive LLM-based baselines by a large margin in sequential decision-making benchmarks.  ( 2 min )
    The Full-scale Assembly Simulation Testbed (FAST) Dataset
    arXiv:2403.08969v1 Announce Type: cross Abstract: In recent years, numerous researchers have begun investigating how virtual reality (VR) tracking and interaction data can be used for a variety of machine learning purposes, including user identification, predicting cybersickness, and estimating learning gains. One constraint for this research area is the dearth of open datasets. In this paper, we present a new open dataset captured with our VR-based Full-scale Assembly Simulation Testbed (FAST). This dataset consists of data collected from 108 participants (50 females, 56 males, 2 non-binary) learning how to assemble two distinct full-scale structures in VR. In addition to explaining how the dataset was collected and describing the data included, we discuss how the dataset may be used by future researchers.  ( 2 min )
    Towards Model-Agnostic Posterior Approximation for Fast and Accurate Variational Autoencoders
    arXiv:2403.08941v1 Announce Type: cross Abstract: Inference for Variational Autoencoders (VAEs) consists of learning two models: (1) a generative model, which transforms a simple distribution over a latent space into the distribution over observed data, and (2) an inference model, which approximates the posterior of the latent codes given data. The two components are learned jointly via a lower bound to the generative model's log marginal likelihood. In early phases of joint training, the inference model poorly approximates the latent code posteriors. Recent work showed that this leads optimization to get stuck in local optima, negatively impacting the learned generative model. As such, recent work suggests ensuring a high-quality inference model via iterative training: maximizing the objective function relative to the inference model before every update to the generative model. Unfortunately, iterative training is inefficient, requiring heuristic criteria for reverting from iterative to joint training for speed. Here, we suggest an inference method that trains the generative and inference models independently. It approximates the posterior of the true model a priori; fixing this posterior approximation, we then maximize the lower bound relative to only the generative model. By conventional wisdom, this approach should rely on the true prior and likelihood of the true model to approximate its posterior (which are unknown). However, we show that we can compute a deterministic, model-agnostic posterior approximation (MAPA) of the true model's posterior. We then use MAPA to develop a proof-of-concept inference method. We present preliminary results on low-dimensional synthetic data that (1) MAPA captures the trend of the true posterior, and (2) our MAPA-based inference performs better density estimation with less computation than baselines. Lastly, we present a roadmap for scaling the MAPA-based inference method to high-dimensional data.  ( 3 min )
    Deep Learning Based Dynamics Identification and Linearization of Orbital Problems using Koopman Theory
    arXiv:2403.08965v1 Announce Type: cross Abstract: The study of the Two-Body and Circular Restricted Three-Body Problems in the field of aerospace engineering and sciences is deeply important because they help describe the motion of both celestial and artificial satellites. With the growing demand for satellites and satellite formation flying, fast and efficient control of these systems is becoming ever more important. Global linearization of these systems allows engineers to employ methods of control in order to achieve these desired results. We propose a data-driven framework for simultaneous system identification and global linearization of both the Two-Body Problem and Circular Restricted Three-Body Problem via deep learning-based Koopman Theory, i.e., a framework that can identify the underlying dynamics and globally linearize it into a linear time-invariant (LTI) system. The linear Koopman operator is discovered through purely data-driven training of a Deep Neural Network with a custom architecture. This paper displays the ability of the Koopman operator to generalize to various other Two-Body systems without the need for retraining. We also demonstrate the capability of the same architecture to be utilized to accurately learn a Koopman operator that approximates the Circular Restricted Three-Body Problem.  ( 2 min )
    FogGuard: guarding YOLO against fog using perceptual loss
    arXiv:2403.08939v1 Announce Type: cross Abstract: In this paper, we present a novel fog-aware object detection network called FogGuard, designed to address the challenges posed by foggy weather conditions. Autonomous driving systems heavily rely on accurate object detection algorithms, but adverse weather conditions can significantly impact the reliability of deep neural networks (DNNs). Existing approaches fall into two main categories, 1) image enhancement such as IA-YOLO 2) domain adaptation based approaches. Image enhancement based techniques attempt to generate fog-free image. However, retrieving a fogless image from a foggy image is a much harder problem than detecting objects in a foggy image. Domain-adaptation based approaches, on the other hand, do not make use of labelled datasets in the target domain. Both categories of approaches are attempting to solve a harder version of the problem. Our approach builds over fine-tuning on the Our framework is specifically designed to compensate for foggy conditions present in the scene, ensuring robust performance even. We adopt YOLOv3 as the baseline object detection algorithm and introduce a novel Teacher-Student Perceptual loss, to high accuracy object detection in foggy images. Through extensive evaluations on common datasets such as PASCAL VOC and RTTS, we demonstrate the improvement in performance achieved by our network. We demonstrate that FogGuard achieves 69.43\% mAP, as compared to 57.78\% for YOLOv3 on the RTTS dataset. Furthermore, we show that while our training method increases time complexity, it does not introduce any additional overhead during inference compared to the regular YOLO network.  ( 3 min )
    A non-asymptotic theory of Kernel Ridge Regression: deterministic equivalents, test error, and GCV estimator
    arXiv:2403.08938v1 Announce Type: cross Abstract: We consider learning an unknown target function $f_*$ using kernel ridge regression (KRR) given i.i.d. data $(u_i,y_i)$, $i\leq n$, where $u_i \in U$ is a covariate vector and $y_i = f_* (u_i) +\varepsilon_i \in \mathbb{R}$. A recent string of work has empirically shown that the test error of KRR can be well approximated by a closed-form estimate derived from an `equivalent' sequence model that only depends on the spectrum of the kernel operator. However, a theoretical justification for this equivalence has so far relied either on restrictive assumptions -- such as subgaussian independent eigenfunctions -- , or asymptotic derivations for specific kernels in high dimensions. In this paper, we prove that this equivalence holds for a general class of problems satisfying some spectral and concentration properties on the kernel eigendecomposition. Specifically, we establish in this setting a non-asymptotic deterministic approximation for the test error of KRR -- with explicit non-asymptotic bounds -- that only depends on the eigenvalues and the target function alignment to the eigenvectors of the kernel. Our proofs rely on a careful derivation of deterministic equivalents for random matrix functionals in the dimension free regime pioneered by Cheng and Montanari (2022). We apply this setting to several classical examples and show an excellent agreement between theoretical predictions and numerical simulations. These results rely on having access to the eigendecomposition of the kernel operator. Alternatively, we prove that, under this same setting, the generalized cross-validation (GCV) estimator concentrates on the test error uniformly over a range of ridge regularization parameter that includes zero (the interpolating solution). As a consequence, the GCV estimator can be used to estimate from data the test error and optimal regularization parameter for KRR.  ( 3 min )
    A Framework for Strategic Discovery of Credible Neural Network Surrogate Models under Uncertainty
    arXiv:2403.08901v1 Announce Type: cross Abstract: The widespread integration of deep neural networks in developing data-driven surrogate models for high-fidelity simulations of complex physical systems highlights the critical necessity for robust uncertainty quantification techniques and credibility assessment methodologies, ensuring the reliable deployment of surrogate models in consequential decision-making. This study presents the Occam Plausibility Algorithm for surrogate models (OPAL-surrogate), providing a systematic framework to uncover predictive neural network-based surrogate models within the large space of potential models, including various neural network classes and choices of architecture and hyperparameters. The framework is grounded in hierarchical Bayesian inferences and employs model validation tests to evaluate the credibility and prediction reliability of the surrogate models under uncertainty. Leveraging these principles, OPAL-surrogate introduces a systematic and efficient strategy for balancing the trade-off between model complexity, accuracy, and prediction uncertainty. The effectiveness of OPAL-surrogate is demonstrated through two modeling problems, including the deformation of porous materials for building insulation and turbulent combustion flow for the ablation of solid fuels within hybrid rocket motors.  ( 2 min )
    Efficiently Computing Similarities to Private Datasets
    arXiv:2403.08917v1 Announce Type: cross Abstract: Many methods in differentially private model training rely on computing the similarity between a query point (such as public or synthetic data) and private data. We abstract out this common subroutine and study the following fundamental algorithmic problem: Given a similarity function $f$ and a large high-dimensional private dataset $X \subset \mathbb{R}^d$, output a differentially private (DP) data structure which approximates $\sum_{x \in X} f(x,y)$ for any query $y$. We consider the cases where $f$ is a kernel function, such as $f(x,y) = e^{-\|x-y\|_2^2/\sigma^2}$ (also known as DP kernel density estimation), or a distance function such as $f(x,y) = \|x-y\|_2$, among others. Our theoretical results improve upon prior work and give better privacy-utility trade-offs as well as faster query times for a wide range of kernels and distance functions. The unifying approach behind our results is leveraging `low-dimensional structures' present in the specific functions $f$ that we study, using tools such as provable dimensionality reduction, approximation theory, and one-dimensional decomposition of the functions. Our algorithms empirically exhibit improved query times and accuracy over prior state of the art. We also present an application to DP classification. Our experiments demonstrate that the simple methodology of classifying based on average similarity is orders of magnitude faster than prior DP-SGD based approaches for comparable accuracy.  ( 2 min )
    Moments of Clarity: Streamlining Latent Spaces in Machine Learning using Moment Pooling
    arXiv:2403.08854v1 Announce Type: cross Abstract: Many machine learning applications involve learning a latent representation of data, which is often high-dimensional and difficult to directly interpret. In this work, we propose "Moment Pooling", a natural extension of Deep Sets networks which drastically decrease latent space dimensionality of these networks while maintaining or even improving performance. Moment Pooling generalizes the summation in Deep Sets to arbitrary multivariate moments, which enables the model to achieve a much higher effective latent dimensionality for a fixed latent dimension. We demonstrate Moment Pooling on the collider physics task of quark/gluon jet classification by extending Energy Flow Networks (EFNs) to Moment EFNs. We find that Moment EFNs with latent dimensions as small as 1 perform similarly to ordinary EFNs with higher latent dimension. This small latent dimension allows for the internal representation to be directly visualized and interpreted, which in turn enables the learned internal jet representation to be extracted in closed form.  ( 2 min )
    PAPERCLIP: Associating Astronomical Observations and Natural Language with Multi-Modal Models
    arXiv:2403.08851v1 Announce Type: cross Abstract: We present PAPERCLIP (Proposal Abstracts Provide an Effective Representation for Contrastive Language-Image Pre-training), a method which associates astronomical observations imaged by telescopes with natural language using a neural network model. The model is fine-tuned from a pre-trained Contrastive Language-Image Pre-training (CLIP) model using successful observing proposal abstracts and corresponding downstream observations, with the abstracts optionally summarized via guided generation using large language models (LLMs). Using observations from the Hubble Space Telescope (HST) as an example, we show that the fine-tuned model embodies a meaningful joint representation between observations and natural language through tests targeting image retrieval (i.e., finding the most relevant observations using natural language queries) and description retrieval (i.e., querying for astrophysical object classes and use cases most relevant to a given observation). Our study demonstrates the potential for using generalist foundation models rather than task-specific models for interacting with astronomical data by leveraging text as an interface.  ( 2 min )
    Majority-of-Three: The Simplest Optimal Learner?
    arXiv:2403.08831v1 Announce Type: cross Abstract: Developing an optimal PAC learning algorithm in the realizable setting, where empirical risk minimization (ERM) is suboptimal, was a major open problem in learning theory for decades. The problem was finally resolved by Hanneke a few years ago. Unfortunately, Hanneke's algorithm is quite complex as it returns the majority vote of many ERM classifiers that are trained on carefully selected subsets of the data. It is thus a natural goal to determine the simplest algorithm that is optimal. In this work we study the arguably simplest algorithm that could be optimal: returning the majority vote of three ERM classifiers. We show that this algorithm achieves the optimal in-expectation bound on its error which is provably unattainable by a single ERM classifier. Furthermore, we prove a near-optimal high-probability bound on this algorithm's error. We conjecture that a better analysis will prove that this algorithm is in fact optimal in the high-probability regime.  ( 2 min )
    JAXbind: Bind any function to JAX
    arXiv:2403.08847v1 Announce Type: cross Abstract: JAX is widely used in machine learning and scientific computing, the latter of which often relies on existing high-performance code that we would ideally like to incorporate into JAX. Reimplementing the existing code in JAX is often impractical and the existing interface in JAX for binding custom code requires deep knowledge of JAX and its C++ backend. The goal of JAXbind is to drastically reduce the effort required to bind custom functions implemented in other programming languages to JAX. Specifically, JAXbind provides an easy-to-use Python interface for defining custom so-called JAX primitives that support arbitrary JAX transformations.  ( 2 min )
    Mitigating Biases in Collective Decision-Making: Enhancing Performance in the Face of Fake News
    arXiv:2403.08829v1 Announce Type: cross Abstract: Individual and social biases undermine the effectiveness of human advisers by inducing judgment errors which can disadvantage protected groups. In this paper, we study the influence these biases can have in the pervasive problem of fake news by evaluating human participants' capacity to identify false headlines. By focusing on headlines involving sensitive characteristics, we gather a comprehensive dataset to explore how human responses are shaped by their biases. Our analysis reveals recurring individual biases and their permeation into collective decisions. We show that demographic factors, headline categories, and the manner in which information is presented significantly influence errors in human judgment. We then use our collected data as a benchmark problem on which we evaluate the efficacy of adaptive aggregation algorithms. In addition to their improved accuracy, our results highlight the interactions between the emergence of collective intelligence and the mitigation of participant biases.  ( 2 min )
    A Dataset for the Validation of Truth Inference Algorithms Suitable for Online Deployment
    arXiv:2403.08826v1 Announce Type: cross Abstract: For the purpose of efficient and cost-effective large-scale data labeling, crowdsourcing is increasingly being utilized. To guarantee the quality of data labeling, multiple annotations need to be collected for each data sample, and truth inference algorithms have been developed to accurately infer the true labels. Despite previous studies having released public datasets to evaluate the efficacy of truth inference algorithms, these have typically focused on a single type of crowdsourcing task and neglected the temporal information associated with workers' annotation activities. These limitations significantly restrict the practical applicability of these algorithms, particularly in the context of long-term and online truth inference. In this paper, we introduce a substantial crowdsourcing annotation dataset collected from a real-world crowdsourcing platform. This dataset comprises approximately two thousand workers, one million tasks, and six million annotations. The data was gathered over a period of approximately six months from various types of tasks, and the timestamps of each annotation were preserved. We analyze the characteristics of the dataset from multiple perspectives and evaluate the effectiveness of several representative truth inference algorithms on this dataset. We anticipate that this dataset will stimulate future research on tracking workers' abilities over time in relation to different types of tasks, as well as enhancing online truth inference.  ( 3 min )
    Federated Deep Q-Learning and 5G load balancing
    arXiv:2403.08813v1 Announce Type: cross Abstract: Despite advances in cellular network technology, base station (BS) load balancing remains a persistent problem. Although centralized resource allocation methods can address the load balancing problem, it still remains an NP-hard problem. In this research, we study how federated deep Q learning can be used to inform each user equipment (UE) of the each BS's load conditions. Federated deep Q learning's load balancing enables intelligent UEs to independently select the best BS while also limiting the amount of private information exposed to the network. In this study, we propose and analyze a federated deep Q learning load balancing system, which is implemented using the Open-RAN xAPP framework and the near-Real Time Radio Interface Controller (near-RT RIC). Our simulation results indicate that compared to the maximum Signal-To-Noise-Ratio (MAX-SINR) method currently used by UEs, our proposed deep Q learning model can consistently provide better High average UE quality of service  ( 2 min )
    Effective Gradient Sample Size via Variation Estimation for Accelerating Sharpness aware Minimization
    arXiv:2403.08821v1 Announce Type: cross Abstract: Sharpness-aware Minimization (SAM) has been proposed recently to improve model generalization ability. However, SAM calculates the gradient twice in each optimization step, thereby doubling the computation costs compared to stochastic gradient descent (SGD). In this paper, we propose a simple yet efficient sampling method to significantly accelerate SAM. Concretely, we discover that the gradient of SAM is a combination of the gradient of SGD and the Projection of the Second-order gradient matrix onto the First-order gradient (PSF). PSF exhibits a gradually increasing frequency of change during the training process. To leverage this observation, we propose an adaptive sampling method based on the variation of PSF, and we reuse the sampled PSF for non-sampling iterations. Extensive empirical results illustrate that the proposed method achieved state-of-the-art accuracies comparable to SAM on diverse network architectures.  ( 2 min )
    Comparison of edge computing methods in Internet of Things architectures for efficient estimation of indoor environmental parameters with Machine Learning
    arXiv:2403.08810v1 Announce Type: cross Abstract: The large increase in the number of Internet of Things (IoT) devices have revolutionised the way data is processed, which added to the current trend from cloud to edge computing has resulted in the need for efficient and reliable data processing near the data sources using energy-efficient devices. Two methods based on low-cost edge-IoT architectures are proposed to implement lightweight Machine Learning (ML) models that estimate indoor environmental quality (IEQ) parameters, such as Artificial Neural Networks of Multilayer Perceptron type. Their implementation is based on centralised and distributed parallel IoT architectures, connected via wireless, which share commercial off-the-self modules for data acquisition and sensing, such as sensors for temperature, humidity, illuminance, CO2, and other gases. The centralised method uses a Graphics Processing Unit and the Message Queuing Telemetry Transport protocol, but the distributed method utilises low performance ARM-based devices and the Message Passing Interface protocol. Although multiple IEQ parameters are measured, the training and testing of ML models is accomplished with experiments focused on small temperature and illuminance datasets to reduce data processing load, obtained from sudden spikes, square profiles and sawteeth test cases. The results show a high estimation performance with F-score and Accuracy values close to 0.95, and an almost theorical Speedup with a reduction in power consumption close to 37% in the distributed parallel approach. In addition, similar or slightly better performance is achieved compared to equivalent IoT architectures from related research, but error reduction of 35 to 76% is accomplished with an adequate balance between performance and energy efficiency.  ( 3 min )
    Forward Direct Feedback Alignment for Online Gradient Estimates of Spiking Neural Networks
    arXiv:2403.08804v1 Announce Type: cross Abstract: There is an interest in finding energy efficient alternatives to current state of the art neural network training algorithms. Spiking neural network are a promising approach, because they can be simulated energy efficiently on neuromorphic hardware platforms. However, these platforms come with limitations on the design of the training algorithm. Most importantly, backpropagation cannot be implemented on those. We propose a novel neuromorphic algorithm, the \textit{Spiking Forward Direct Feedback Alignment} (SFDFA) algorithm, an adaption of \textit{Forward Direct Feedback Alignment} to train SNNs. SFDFA estimates the weights between output and hidden neurons as feedback connections. The main contribution of this paper is to describe how exact local gradients of spikes can be computed in an online manner while taking into account the intra-neuron dependencies between post-synaptic spikes and derive a dynamical system for neuromorphic hardware compatibility. We compare the SFDFA algorithm with a number of competitor algorithms and show that the proposed algorithm achieves higher performance and convergence rates.  ( 2 min )
    Neural Loss Function Evolution for Large-Scale Image Classifier Convolutional Neural Networks
    arXiv:2403.08793v1 Announce Type: cross Abstract: For classification, neural networks typically learn by minimizing cross-entropy, but are evaluated and compared using accuracy. This disparity suggests neural loss function search (NLFS), the search for a drop-in replacement loss function of cross-entropy for neural networks. We apply NLFS to image classifier convolutional neural networks. We propose a new search space for NLFS that encourages more diverse loss functions to be explored, and a surrogate function that accurately transfers to large-scale convolutional neural networks. We search the space using regularized evolution, a mutation-only aging genetic algorithm. After evolution and a proposed loss function elimination protocol, we transferred the final loss functions across multiple architectures, datasets, and image augmentation techniques to assess generalization. In the end, we discovered three new loss functions, called NeuroLoss1, NeuroLoss2, and NeuroLoss3 that were able to outperform cross-entropy in terms of a higher mean test accuracy as a simple drop-in replacement loss function across the majority of experiments.  ( 2 min )
    Governance of Generative Artificial Intelligence for Companies
    arXiv:2403.08802v1 Announce Type: cross Abstract: Generative Artificial Intelligence (GenAI), specifically large language models like ChatGPT, has swiftly entered organizations without adequate governance, posing both opportunities and risks. Despite extensive debates on GenAI's transformative nature and regulatory measures, limited research addresses organizational governance, encompassing technical and business perspectives. This review paper fills this gap by surveying recent works. It goes beyond mere summarization by developing a framework for GenAI governance within companies. Our framework outlines the scope, objectives, and governance mechanisms tailored to harness business opportunities and mitigate risks associated with GenAI integration. This research contributes a focused approach to GenAI governance, offering practical insights for companies navigating the challenges of responsible AI adoption. It is also valuable for a technical audience to broaden their perspective as increasingly ethical and business concerns gain in prevalence and allow them to identify novel research directions.  ( 2 min )
    Bridging Human Concepts and Computer Vision for Explainable Face Verification
    arXiv:2403.08789v1 Announce Type: cross Abstract: With Artificial Intelligence (AI) influencing the decision-making process of sensitive applications such as Face Verification, it is fundamental to ensure the transparency, fairness, and accountability of decisions. Although Explainable Artificial Intelligence (XAI) techniques exist to clarify AI decisions, it is equally important to provide interpretability of these decisions to humans. In this paper, we present an approach to combine computer and human vision to increase the explanation's interpretability of a face verification algorithm. In particular, we are inspired by the human perceptual process to understand how machines perceive face's human-semantic areas during face comparison tasks. We use Mediapipe, which provides a segmentation technique that identifies distinct human-semantic facial regions, enabling the machine's perception analysis. Additionally, we adapted two model-agnostic algorithms to provide human-interpretable insights into the decision-making processes.  ( 2 min )
    Realtime Facial Expression Recognition: Neuromorphic Hardware vs. Edge AI Accelerators
    arXiv:2403.08792v1 Announce Type: cross Abstract: The paper focuses on real-time facial expression recognition (FER) systems as an important component in various real-world applications such as social robotics. We investigate two hardware options for the deployment of FER machine learning (ML) models at the edge: neuromorphic hardware versus edge AI accelerators. Our study includes exhaustive experiments providing comparative analyses between the Intel Loihi neuromorphic processor and four distinct edge platforms: Raspberry Pi-4, Intel Neural Compute Stick (NSC), Jetson Nano, and Coral TPU. The results obtained show that Loihi can achieve approximately two orders of magnitude reduction in power dissipation and one order of magnitude energy savings compared to Coral TPU which happens to be the least power-intensive and energy-consuming edge AI accelerator. These reductions in power and energy are achieved while the neuromorphic solution maintains a comparable level of accuracy with the edge accelerators, all within the real-time latency requirements.  ( 2 min )
    Minimax Optimal and Computationally Efficient Algorithms for Distributionally Robust Offline Reinforcement Learning
    arXiv:2403.09621v1 Announce Type: new Abstract: Distributionally robust offline reinforcement learning (RL), which seeks robust policy training against environment perturbation by modeling dynamics uncertainty, calls for function approximations when facing large state-action spaces. However, the consideration of dynamics uncertainty introduces essential nonlinearity and computational burden, posing unique challenges for analyzing and practically employing function approximation. Focusing on a basic setting where the nominal model and perturbed models are linearly parameterized, we propose minimax optimal and computationally efficient algorithms realizing function approximation and initiate the study on instance-dependent suboptimality analysis in the context of robust offline RL. Our results uncover that function approximation in robust offline RL is essentially distinct from and probably harder than that in standard offline RL. Our algorithms and theoretical results crucially depend on a variety of new techniques, involving a novel function approximation mechanism incorporating variance information, a new procedure of suboptimality and estimation uncertainty decomposition, a quantification of the robust value function shrinkage, and a meticulously designed family of hard instances, which might be of independent interest.  ( 2 min )
    Multi-view Subspace Clustering via An Adaptive Consensus Graph Filter
    arXiv:2403.08787v1 Announce Type: cross Abstract: Multiview subspace clustering (MVSC) has attracted an increasing amount of attention in recent years. Most existing MVSC methods first collect complementary information from different views and consequently derive a consensus reconstruction coefficient matrix to indicate the subspace structure of a multi-view data set. In this paper, we initially assume the existence of a consensus reconstruction coefficient matrix and then use it to build a consensus graph filter. In each view, the filter is employed for smoothing the data and designing a regularizer for the reconstruction coefficient matrix. Finally, the obtained reconstruction coefficient matrices from different views are used to create constraints for the consensus reconstruction coefficient matrix. Therefore, in the proposed method, the consensus reconstruction coefficient matrix, the consensus graph filter, and the reconstruction coefficient matrices from different views are interdependent. We provide an optimization algorithm to obtain their optimal values. Extensive experiments on diverse multi-view data sets demonstrate that our approach outperforms some state-of-the-art methods.  ( 2 min )
    Reawakening knowledge: Anticipatory recovery from catastrophic interference via structured training
    arXiv:2403.09613v1 Announce Type: new Abstract: We explore the training dynamics of neural networks in a structured non-IID setting where documents are presented cyclically in a fixed, repeated sequence. Typically, networks suffer from catastrophic interference when training on a sequence of documents; however, we discover a curious and remarkable property of LLMs fine-tuned sequentially in this setting: they exhibit anticipatory behavior, recovering from the forgetting on documents before encountering them again. The behavior emerges and becomes more robust as the architecture scales up its number of parameters. Through comprehensive experiments and visualizations, we uncover new insights into training over-parameterized networks in structured environments.  ( 2 min )
    Multi-Fidelity Bayesian Optimization With Across-Task Transferable Max-Value Entropy Search
    arXiv:2403.09570v1 Announce Type: new Abstract: In many applications, ranging from logistics to engineering, a designer is faced with a sequence of optimization tasks for which the objectives are in the form of black-box functions that are costly to evaluate. For example, the designer may need to tune the hyperparameters of neural network models for different learning tasks over time. Rather than evaluating the objective function for each candidate solution, the designer may have access to approximations of the objective functions, for which higher-fidelity evaluations entail a larger cost. Existing multi-fidelity black-box optimization strategies select candidate solutions and fidelity levels with the goal of maximizing the information accrued about the optimal value or solution for the current task. Assuming that successive optimization tasks are related, this paper introduces a novel information-theoretic acquisition function that balances the need to acquire information about the current task with the goal of collecting information transferable to future tasks. The proposed method includes shared inter-task latent variables, which are transferred across tasks by implementing particle-based variational Bayesian updates. Experimental results across synthetic and real-world examples reveal that the proposed provident acquisition strategy that caters to future tasks can significantly improve the optimization efficiency as soon as a sufficient number of tasks is processed.  ( 2 min )
    Iterative Forgetting: Online Data Stream Regression Using Database-Inspired Adaptive Granulation
    arXiv:2403.09588v1 Announce Type: new Abstract: Many modern systems, such as financial, transportation, and telecommunications systems, are time-sensitive in the sense that they demand low-latency predictions for real-time decision-making. Such systems often have to contend with continuous unbounded data streams as well as concept drift, which are challenging requirements that traditional regression techniques are unable to cater to. There exists a need to create novel data stream regression methods that can handle these scenarios. We present a database-inspired datastream regression model that (a) uses inspiration from R*-trees to create granules from incoming datastreams such that relevant information is retained, (b) iteratively forgets granules whose information is deemed to be outdated, thus maintaining a list of only recent, relevant granules, and (c) uses the recent data and granules to provide low-latency predictions. The R*-tree-inspired approach also makes the algorithm amenable to integration with database systems. Our experiments demonstrate that the ability of this method to discard data produces a significant order-of-magnitude improvement in latency and training time when evaluated against the most accurate state-of-the-art algorithms, while the R*-tree-inspired granulation technique provides competitively accurate predictions  ( 2 min )
    Self-Consistency Training for Hamiltonian Prediction
    arXiv:2403.09560v1 Announce Type: new Abstract: Hamiltonian prediction is a versatile formulation to leverage machine learning for solving molecular science problems. Yet, its applicability is limited by insufficient labeled data for training. In this work, we highlight that Hamiltonian prediction possesses a self-consistency principle, based on which we propose an exact training method that does not require labeled data. This merit addresses the data scarcity difficulty, and distinguishes the task from other property prediction formulations with unique benefits: (1) self-consistency training enables the model to be trained on a large amount of unlabeled data, hence substantially enhances generalization; (2) self-consistency training is more efficient than labeling data with DFT for supervised training, since it is an amortization of DFT calculation over a set of molecular structures. We empirically demonstrate the better generalization in data-scarce and out-of-distribution scenarios, and the better efficiency from the amortization. These benefits push forward the applicability of Hamiltonian prediction to an ever larger scale.  ( 2 min )
    Generalizing Denoising to Non-Equilibrium Structures Improves Equivariant Force Fields
    arXiv:2403.09549v1 Announce Type: new Abstract: Understanding the interactions of atoms such as forces in 3D atomistic systems is fundamental to many applications like molecular dynamics and catalyst design. However, simulating these interactions requires compute-intensive ab initio calculations and thus results in limited data for training neural networks. In this paper, we propose to use denoising non-equilibrium structures (DeNS) as an auxiliary task to better leverage training data and improve performance. For training with DeNS, we first corrupt a 3D structure by adding noise to its 3D coordinates and then predict the noise. Different from previous works on denoising, which are limited to equilibrium structures, the proposed method generalizes denoising to a much larger set of non-equilibrium structures. The main difference is that a non-equilibrium structure does not correspond to local energy minima and has non-zero forces, and therefore it can have many possible atomic positions compared to an equilibrium structure. This makes denoising non-equilibrium structures an ill-posed problem since the target of denoising is not uniquely defined. Our key insight is to additionally encode the forces of the original non-equilibrium structure to specify which non-equilibrium structure we are denoising. Concretely, given a corrupted non-equilibrium structure and the forces of the original one, we predict the non-equilibrium structure satisfying the input forces instead of any arbitrary structures. Since DeNS requires encoding forces, DeNS favors equivariant networks, which can easily incorporate forces and other higher-order tensors in node embeddings. We study the effectiveness of training equivariant networks with DeNS on OC20, OC22 and MD17 datasets and demonstrate that DeNS can achieve new state-of-the-art results on OC20 and OC22 and significantly improve training efficiency on MD17.  ( 3 min )
    EquiAV: Leveraging Equivariance for Audio-Visual Contrastive Learning
    arXiv:2403.09502v1 Announce Type: new Abstract: Recent advancements in self-supervised audio-visual representation learning have demonstrated its potential to capture rich and comprehensive representations. However, despite the advantages of data augmentation verified in many learning methods, audio-visual learning has struggled to fully harness these benefits, as augmentations can easily disrupt the correspondence between input pairs. To address this limitation, we introduce EquiAV, a novel framework that leverages equivariance for audio-visual contrastive learning. Our approach begins with extending equivariance to audio-visual learning, facilitated by a shared attention-based transformation predictor. It enables the aggregation of features from diverse augmentations into a representative embedding, providing robust supervision. Notably, this is achieved with minimal computational overhead. Extensive ablation studies and qualitative results verify the effectiveness of our method. EquiAV outperforms previous works across various audio-visual benchmarks.  ( 2 min )
    Breast Cancer Classification Using Gradient Boosting Algorithms Focusing on Reducing the False Negative and SHAP for Explainability
    arXiv:2403.09548v1 Announce Type: new Abstract: Cancer is one of the diseases that kill the most women in the world, with breast cancer being responsible for the highest number of cancer cases and consequently deaths. However, it can be prevented by early detection and, consequently, early treatment. Any development for detection or perdition this kind of cancer is important for a better healthy life. Many studies focus on a model with high accuracy in cancer prediction, but sometimes accuracy alone may not always be a reliable metric. This study implies an investigative approach to studying the performance of different machine learning algorithms based on boosting to predict breast cancer focusing on the recall metric. Boosting machine learning algorithms has been proven to be an effective tool for detecting medical diseases. The dataset of the University of California, Irvine (UCI) repository has been utilized to train and test the model classifier that contains their attributes. The main objective of this study is to use state-of-the-art boosting algorithms such as AdaBoost, XGBoost, CatBoost and LightGBM to predict and diagnose breast cancer and to find the most effective metric regarding recall, ROC-AUC, and confusion matrix. Furthermore, our study is the first to use these four boosting algorithms with Optuna, a library for hyperparameter optimization, and the SHAP method to improve the interpretability of our model, which can be used as a support to identify and predict breast cancer. We were able to improve AUC or recall for all the models and reduce the False Negative for AdaBoost and LigthGBM the final AUC were more than 99.41\% for all models.  ( 3 min )
    A Reinforcement Learning Approach to Dairy Farm Battery Management using Q Learning
    arXiv:2403.09499v1 Announce Type: new Abstract: Dairy farming consumes a significant amount of energy, making it an energy-intensive sector within agriculture. Integrating renewable energy generation into dairy farming could help address this challenge. Effective battery management is important for integrating renewable energy generation. Managing battery charging and discharging poses significant challenges because of fluctuations in electrical consumption, the intermittent nature of renewable energy generation, and fluctuations in energy prices. Artificial Intelligence (AI) has the potential to significantly improve the use of renewable energy in dairy farming, however, there is limited research conducted in this particular domain. This research considers Ireland as a case study as it works towards attaining its 2030 energy strategy centered on the utilization of renewable sources. This study proposes a Q-learning-based algorithm for scheduling battery charging and discharging in a dairy farm setting. This research also explores the effect of the proposed algorithm by adding wind generation data and considering additional case studies. The proposed algorithm reduces the cost of imported electricity from the grid by 13.41\%, peak demand by 2\%, and 24.49\% when utilizing wind generation. These results underline how reinforcement learning is highly effective in managing batteries in the dairy farming sector.  ( 2 min )
    On using Machine Learning Algorithms for Motorcycle Collision Detection
    arXiv:2403.09491v1 Announce Type: new Abstract: Globally, motorcycles attract vast and varied users. However, since the rate of severe injury and fatality in motorcycle accidents far exceeds passenger car accidents, efforts have been directed toward increasing passive safety systems. Impact simulations show that the risk of severe injury or death in the event of a motorcycle-to-car impact can be greatly reduced if the motorcycle is equipped with passive safety measures such as airbags and seat belts. For the passive safety systems to be activated, a collision must be detected within milliseconds for a wide variety of impact configurations, but under no circumstances may it be falsely triggered. For the challenge of reliably detecting impending collisions, this paper presents an investigation towards the applicability of machine learning algorithms. First, a series of simulations of accidents and driving operation is introduced to collect data to train machine learning classification models. Their performance is henceforth assessed and compared via multiple representative and application-oriented criteria.  ( 2 min )
    Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision
    arXiv:2403.09472v1 Announce Type: new Abstract: Current AI alignment methodologies rely on human-provided demonstrations or judgments, and the learned capabilities of AI systems would be upper-bounded by human capabilities as a result. This raises a challenging research question: How can we keep improving the systems when their capabilities have surpassed the levels of humans? This paper answers this question in the context of tackling hard reasoning tasks (e.g., level 4-5 MATH problems) via learning from human annotations on easier tasks (e.g., level 1-3 MATH problems), which we term as \textit{easy-to-hard generalization}. Our key insight is that an evaluator (reward model) trained on supervisions for easier tasks can be effectively used for scoring candidate solutions of harder tasks and hence facilitating easy-to-hard generalization over different levels of tasks. Based on this insight, we propose a novel approach to scalable alignment, which firstly trains the process-supervised reward models on easy problems (e.g., level 1-3), and then uses them to evaluate the performance of policy models on hard problems. We show that such \textit{easy-to-hard generalization from evaluators} can enable \textit{easy-to-hard generalizations in generators} either through re-ranking or reinforcement learning (RL). Notably, our process-supervised 7b RL model achieves an accuracy of 34.0\% on MATH500, despite only using human supervision on easy problems. Our approach suggests a promising path toward AI systems that advance beyond the frontier of human supervision.  ( 2 min )
    Laying the Foundation First? Investigating the Generalization from Atomic Skills to Complex Reasoning Tasks
    arXiv:2403.09479v1 Announce Type: new Abstract: Current language models have demonstrated their capability to develop basic reasoning, but struggle in more complicated reasoning tasks that require a combination of atomic skills, such as math word problem requiring skills like arithmetic and unit conversion. Previous methods either do not improve the inherent atomic skills of models or not attempt to generalize the atomic skills to complex reasoning tasks. In this paper, we first propose a probing framework to investigate whether the atomic skill can spontaneously generalize to complex reasoning tasks. Then, we introduce a hierarchical curriculum learning training strategy to achieve better skill generalization. In our experiments, we find that atomic skills can not spontaneously generalize to compositional tasks. By leveraging hierarchical curriculum learning, we successfully induce generalization, significantly improve the performance of open-source LMs on complex reasoning tasks. Promisingly, the skill generalization exhibit effective in cross-dataset and cross-domain scenarios. Complex reasoning can also help enhance atomic skills. Our findings offer valuable guidance for designing better training strategies for complex reasoning tasks.  ( 2 min )
    Machine learning for structural design models of continuous beam systems via influence zones
    arXiv:2403.09454v1 Announce Type: new Abstract: This work develops a machine learned structural design model for continuous beam systems from the inverse problem perspective. After demarcating between forward, optimisation and inverse machine learned operators, the investigation proposes a novel methodology based on the recently developed influence zone concept which represents a fundamental shift in approach compared to traditional structural design methods. The aim of this approach is to conceptualise a non-iterative structural design model that predicts cross-section requirements for continuous beam systems of arbitrary system size. After generating a dataset of known solutions, an appropriate neural network architecture is identified, trained, and tested against unseen data. The results show a mean absolute percentage testing error of 1.6% for cross-section property predictions, along with a good ability of the neural network to generalise well to structural systems of variable size. The CBeamXP dataset generated in this work and an associated python-based neural network training script are available at an open-source data repository to allow for the reproducibility of results and to encourage further investigations.  ( 2 min )
    Borrowing Treasures from Neighbors: In-Context Learning for Multimodal Learning with Missing Modalities and Data Scarcity
    arXiv:2403.09428v1 Announce Type: new Abstract: Multimodal machine learning with missing modalities is an increasingly relevant challenge arising in various applications such as healthcare. This paper extends the current research into missing modalities to the low-data regime, i.e., a downstream task has both missing modalities and limited sample size issues. This problem setting is particularly challenging and also practical as it is often expensive to get full-modality data and sufficient annotated training samples. We propose to use retrieval-augmented in-context learning to address these two crucial issues by unleashing the potential of a transformer's in-context learning ability. Diverging from existing methods, which primarily belong to the parametric paradigm and often require sufficient training samples, our work exploits the value of the available full-modality data, offering a novel perspective on resolving the challenge. The proposed data-dependent framework exhibits a higher degree of sample efficiency and is empirically demonstrated to enhance the classification model's performance on both full- and missing-modality data in the low-data regime across various multimodal learning tasks. When only 1% of the training data are available, our proposed method demonstrates an average improvement of 6.1% over a recent strong baseline across various datasets and missing states. Notably, our method also reduces the performance gap between full-modality and missing-modality data compared with the baseline.  ( 3 min )
    Adversarial Fine-tuning of Compressed Neural Networks for Joint Improvement of Robustness and Efficiency
    arXiv:2403.09441v1 Announce Type: new Abstract: As deep learning (DL) models are increasingly being integrated into our everyday lives, ensuring their safety by making them robust against adversarial attacks has become increasingly critical. DL models have been found to be susceptible to adversarial attacks which can be achieved by introducing small, targeted perturbations to disrupt the input data. Adversarial training has been presented as a mitigation strategy which can result in more robust models. This adversarial robustness comes with additional computational costs required to design adversarial attacks during training. The two objectives -- adversarial robustness and computational efficiency -- then appear to be in conflict of each other. In this work, we explore the effects of two different model compression methods -- structured weight pruning and quantization -- on adversarial robustness. We specifically explore the effects of fine-tuning on compressed models, and present the trade-off between standard fine-tuning and adversarial fine-tuning. Our results show that compression does not inherently lead to loss in model robustness and adversarial fine-tuning of a compressed model can yield large improvement to the robustness performance of models. We present experiments on two benchmark datasets showing that adversarial fine-tuning of compressed models can achieve robustness performance comparable to adversarially trained models, while also improving computational efficiency.  ( 3 min )
    Rethinking Autoencoders for Medical Anomaly Detection from A Theoretical Perspective
    arXiv:2403.09303v1 Announce Type: new Abstract: Medical anomaly detection aims to identify abnormal findings using only normal training data, playing a crucial role in health screening and recognizing rare diseases. Reconstruction-based methods, particularly those utilizing autoencoders (AEs), are dominant in this field. They work under the assumption that AEs trained on only normal data cannot reconstruct unseen abnormal regions well, thereby enabling the anomaly detection based on reconstruction errors. However, this assumption does not always hold due to the mismatch between the reconstruction training objective and the anomaly detection task objective, rendering these methods theoretically unsound. This study focuses on providing a theoretical foundation for AE-based reconstruction methods in anomaly detection. By leveraging information theory, we elucidate the principles of these methods and reveal that the key to improving AE in anomaly detection lies in minimizing the information entropy of latent vectors. Experiments on four datasets with two image modalities validate the effectiveness of our theory. To the best of our knowledge, this is the first effort to theoretically clarify the principles and design philosophy of AE for anomaly detection. Code will be available upon acceptance.  ( 2 min )
    Recursive Causal Discovery
    arXiv:2403.09300v1 Announce Type: new Abstract: Causal discovery, i.e., learning the causal graph from data, is often the first step toward the identification and estimation of causal effects, a key requirement in numerous scientific domains. Causal discovery is hampered by two main challenges: limited data results in errors in statistical testing and the computational complexity of the learning task is daunting. This paper builds upon and extends four of our prior publications (Mokhtarian et al., 2021; Akbari et al., 2021; Mokhtarian et al., 2022, 2023a). These works introduced the concept of removable variables, which are the only variables that can be removed recursively for the purpose of causal discovery. Presence and identification of removable variables allow recursive approaches for causal discovery, a promising solution that helps to address the aforementioned challenges by reducing the problem size successively. This reduction not only minimizes conditioning sets in each conditional independence (CI) test, leading to fewer errors but also significantly decreases the number of required CI tests. The worst-case performances of these methods nearly match the lower bound. In this paper, we present a unified framework for the proposed algorithms, refined with additional details and enhancements for a coherent presentation. A comprehensive literature review is also included, comparing the computational complexity of our methods with existing approaches, showcasing their state-of-the-art efficiency. Another contribution of this paper is the release of RCD, a Python package that efficiently implements these algorithms. This package is designed for practitioners and researchers interested in applying these methods in practical scenarios. The package is available at github.com/ban-epfl/rcd, with comprehensive documentation provided at rcdpackage.com.  ( 3 min )
    Uncertainty Quantification for cross-subject Motor Imagery classification
    arXiv:2403.09228v1 Announce Type: new Abstract: Uncertainty Quantification aims to determine when the prediction from a Machine Learning model is likely to be wrong. Computer Vision research has explored methods for determining epistemic uncertainty (also known as model uncertainty), which should correspond with generalisation error. These methods theoretically allow to predict misclassifications due to inter-subject variability. We applied a variety of Uncertainty Quantification methods to predict misclassifications for a Motor Imagery Brain Computer Interface. Deep Ensembles performed best, both in terms of classification performance and cross-subject Uncertainty Quantification performance. However, we found that standard CNNs with Softmax output performed better than some of the more advanced methods.  ( 2 min )
    MCformer: Multivariate Time Series Forecasting with Mixed-Channels Transformer
    arXiv:2403.09223v1 Announce Type: new Abstract: The massive generation of time-series data by largescale Internet of Things (IoT) devices necessitates the exploration of more effective models for multivariate time-series forecasting. In previous models, there was a predominant use of the Channel Dependence (CD) strategy (where each channel represents a univariate sequence). Current state-of-the-art (SOTA) models primarily rely on the Channel Independence (CI) strategy. The CI strategy treats all channels as a single channel, expanding the dataset to improve generalization performance and avoiding inter-channel correlation that disrupts long-term features. However, the CI strategy faces the challenge of interchannel correlation forgetting. To address this issue, we propose an innovative Mixed Channels strategy, combining the data expansion advantages of the CI strategy with the ability to counteract inter-channel correlation forgetting. Based on this strategy, we introduce MCformer, a multivariate time-series forecasting model with mixed channel features. The model blends a specific number of channels, leveraging an attention mechanism to effectively capture inter-channel correlation information when modeling long-term features. Experimental results demonstrate that the Mixed Channels strategy outperforms pure CI strategy in multivariate time-series forecasting tasks.  ( 2 min )
    Design of an basis-projected layer for sparse datasets in deep learning training using gc-ms spectra as a case study
    arXiv:2403.09188v1 Announce Type: new Abstract: Deep learning (DL) models encompass millions or even billions of parameters and learn complex patterns from big data. However, not all data are initially stored in a suitable formation to effectively train a DL model, e.g., gas chromatography-mass spectrometry (GC-MS) spectra and DNA sequence. These datasets commonly contain many zero values, and the sparse data formation causes difficulties in optimizing DL models. A DL module called the basis-projected layer (BPL) was proposed to mitigate the issue by transforming the sparse data into a dense representation. The transformed data is expected to facilitate the gradient calculation and finetuned process in a DL training process. The dataset, example of a sparse dataset, contained 362 specialty coffee odorant spectra detected from GC-MS. The BPL layer was placed at the beginning of the DL model. The tunable parameters in the layer were learnable projected axes that were the bases of a new representation space. The layer rotated these bases when its parameters were updated. When the number of the bases was the same as the original dimension, the increasing percentage of the F1 scores was 8.56%. Furthermore, when the number was set as 768 (the original dimension was 490), the increasing percentage of the F1 score was 11.49%. The layer not only maintained the model performance and even constructed a better representation space in analyzing sparse datasets.  ( 3 min )
    On the Laplace Approximation as Model Selection Criterion for Gaussian Processes
    arXiv:2403.09215v1 Announce Type: new Abstract: Model selection aims to find the best model in terms of accuracy, interpretability or simplicity, preferably all at once. In this work, we focus on evaluating model performance of Gaussian process models, i.e. finding a metric that provides the best trade-off between all those criteria. While previous work considers metrics like the likelihood, AIC or dynamic nested sampling, they either lack performance or have significant runtime issues, which severely limits applicability. We address these challenges by introducing multiple metrics based on the Laplace approximation, where we overcome a severe inconsistency occuring during naive application of the Laplace approximation. Experiments show that our metrics are comparable in quality to the gold standard dynamic nested sampling without compromising for computational speed. Our model selection criteria allow significantly faster and high quality model selection of Gaussian process models.  ( 2 min )
    Optimal Top-Two Method for Best Arm Identification and Fluid Analysis
    arXiv:2403.09123v1 Announce Type: new Abstract: Top-$2$ methods have become popular in solving the best arm identification (BAI) problem. The best arm, or the arm with the largest mean amongst finitely many, is identified through an algorithm that at any sequential step independently pulls the empirical best arm, with a fixed probability $\beta$, and pulls the best challenger arm otherwise. The probability of incorrect selection is guaranteed to lie below a specified $\delta >0$. Information theoretic lower bounds on sample complexity are well known for BAI problem and are matched asymptotically as $\delta \rightarrow 0$ by computationally demanding plug-in methods. The above top 2 algorithm for any $\beta \in (0,1)$ has sample complexity within a constant of the lower bound. However, determining the optimal $\beta$ that matches the lower bound has proven difficult. In this paper, we address this and propose an optimal top-2 type algorithm. We consider a function of allocations anchored at a threshold. If it exceeds the threshold then the algorithm samples the empirical best arm. Otherwise, it samples the challenger arm. We show that the proposed algorithm is optimal as $\delta \rightarrow 0$. Our analysis relies on identifying a limiting fluid dynamics of allocations that satisfy a series of ordinary differential equations pasted together and that describe the asymptotic path followed by our algorithm. We rely on the implicit function theorem to show existence and uniqueness of these fluid ode's and to show that the proposed algorithm remains close to the ode solution.  ( 3 min )
    S^2MVTC: a Simple yet Efficient Scalable Multi-View Tensor Clustering
    arXiv:2403.09107v1 Announce Type: new Abstract: Anchor-based large-scale multi-view clustering has attracted considerable attention for its effectiveness in handling massive datasets. However, current methods mainly seek the consensus embedding feature for clustering by exploring global correlations between anchor graphs or projection matrices.In this paper, we propose a simple yet efficient scalable multi-view tensor clustering (S^2MVTC) approach, where our focus is on learning correlations of embedding features within and across views. Specifically, we first construct the embedding feature tensor by stacking the embedding features of different views into a tensor and rotating it. Additionally, we build a novel tensor low-frequency approximation (TLFA) operator, which incorporates graph similarity into embedding feature learning, efficiently achieving smooth representation of embedding features within different views. Furthermore, consensus constraints are applied to embedding features to ensure inter-view semantic consistency. Experimental results on six large-scale multi-view datasets demonstrate that S^2MVTC significantly outperforms state-of-the-art algorithms in terms of clustering performance and CPU execution time, especially when handling massive data. The code of S^2MVTC is publicly available at https://github.com/longzhen520/S2MVTC.  ( 2 min )
    SINDy-RL: Interpretable and Efficient Model-Based Reinforcement Learning
    arXiv:2403.09110v1 Announce Type: new Abstract: Deep reinforcement learning (DRL) has shown significant promise for uncovering sophisticated control policies that interact in environments with complicated dynamics, such as stabilizing the magnetohydrodynamics of a tokamak fusion reactor or minimizing the drag force exerted on an object in a fluid flow. However, these algorithms require an abundance of training examples and may become prohibitively expensive for many applications. In addition, the reliance on deep neural networks often results in an uninterpretable, black-box policy that may be too computationally expensive to use with certain embedded systems. Recent advances in sparse dictionary learning, such as the sparse identification of nonlinear dynamics (SINDy), have shown promise for creating efficient and interpretable data-driven models in the low-data regime. In this work we introduce SINDy-RL, a unifying framework for combining SINDy and DRL to create efficient, interpretable, and trustworthy representations of the dynamics model, reward function, and control policy. We demonstrate the effectiveness of our approaches on benchmark control environments and challenging fluids problems. SINDy-RL achieves comparable performance to state-of-the-art DRL algorithms using significantly fewer interactions in the environment and results in an interpretable control policy orders of magnitude smaller than a deep neural network policy.  ( 2 min )
    Soften to Defend: Towards Adversarial Robustness via Self-Guided Label Refinement
    arXiv:2403.09101v1 Announce Type: new Abstract: Adversarial training (AT) is currently one of the most effective ways to obtain the robustness of deep neural networks against adversarial attacks. However, most AT methods suffer from robust overfitting, i.e., a significant generalization gap in adversarial robustness between the training and testing curves. In this paper, we first identify a connection between robust overfitting and the excessive memorization of noisy labels in AT from a view of gradient norm. As such label noise is mainly caused by a distribution mismatch and improper label assignments, we are motivated to propose a label refinement approach for AT. Specifically, our Self-Guided Label Refinement first self-refines a more accurate and informative label distribution from over-confident hard labels, and then it calibrates the training by dynamically incorporating knowledge from self-distilled models into the current model and thus requiring no external teachers. Empirical results demonstrate that our method can simultaneously boost the standard accuracy and robust performance across multiple benchmark datasets, attack types, and architectures. In addition, we also provide a set of analyses from the perspectives of information theory to dive into our method and suggest the importance of soft labels for robust generalization.  ( 3 min )
    Learning from straggler clients in federated learning
    arXiv:2403.09086v1 Announce Type: new Abstract: How well do existing federated learning algorithms learn from client devices that return model updates with a significant time delay? Is it even possible to learn effectively from clients that report back minutes, hours, or days after being scheduled? We answer these questions by developing Monte Carlo simulations of client latency that are guided by real-world applications. We study synchronous optimization algorithms like FedAvg and FedAdam as well as the asynchronous FedBuff algorithm, and observe that all these existing approaches struggle to learn from severely delayed clients. To improve upon this situation, we experiment with modifications, including distillation regularization and exponential moving averages of model weights. Finally, we introduce two new algorithms, FARe-DUST and FeAST-on-MSG, based on distillation and averaging, respectively. Experiments with the EMNIST, CIFAR-100, and StackOverflow benchmark federated learning tasks demonstrate that our new algorithms outperform existing ones in terms of accuracy for straggler clients, while also providing better trade-offs between training time and total accuracy.  ( 2 min )
    Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference
    arXiv:2403.09054v1 Announce Type: new Abstract: Transformers have emerged as the underpinning architecture for Large Language Models (LLMs). In generative language models, the inference process involves two primary phases: prompt processing and token generation. Token generation, which constitutes the majority of the computational workload, primarily entails vector-matrix multiplications and interactions with the Key-Value (KV) Cache. This phase is constrained by memory bandwidth due to the overhead of transferring weights and KV cache values from the memory system to the computing units. This memory bottleneck becomes particularly pronounced in applications that require long-context and extensive text generation, both of which are increasingly crucial for LLMs. This paper introduces "Keyformer", an innovative inference-time approach, to mitigate the challenges associated with KV cache size and memory bandwidth utilization. Keyformer leverages the observation that approximately 90% of the attention weight in generative inference focuses on a specific subset of tokens, referred to as "key" tokens. Keyformer retains only the key tokens in the KV cache by identifying these crucial tokens using a novel score function. This approach effectively reduces both the KV cache size and memory bandwidth usage without compromising model accuracy. We evaluate Keyformer's performance across three foundational models: GPT-J, Cerebras-GPT, and MPT, which employ various positional embedding algorithms. Our assessment encompasses a variety of tasks, with a particular emphasis on summarization and conversation tasks involving extended contexts. Keyformer's reduction of KV cache reduces inference latency by 2.1x and improves token generation throughput by 2.4x, while preserving the model's accuracy.  ( 3 min )
    Hyperparameters in Continual Learning: a Reality Check
    arXiv:2403.09066v1 Announce Type: new Abstract: Various algorithms for continual learning (CL) have been designed with the goal of effectively alleviating the trade-off between stability and plasticity during the CL process. To achieve this goal, tuning appropriate hyperparameters for each algorithm is essential. As an evaluation protocol, it has been common practice to train a CL algorithm using diverse hyperparameter values on a CL scenario constructed with a benchmark dataset. Subsequently, the best performance attained with the optimal hyperparameter value serves as the criterion for evaluating the CL algorithm. In this paper, we contend that this evaluation protocol is not only impractical but also incapable of effectively assessing the CL capability of a CL algorithm. Returning to the fundamental principles of model evaluation in machine learning, we propose an evaluation protocol that involves Hyperparameter Tuning and Evaluation phases. Those phases consist of different datasets but share the same CL scenario. In the Hyperparameter Tuning phase, each algorithm is iteratively trained with different hyperparameter values to find the optimal hyperparameter values. Subsequently, in the Evaluation phase, the optimal hyperparameter values is directly applied for training each algorithm, and their performance in the Evaluation phase serves as the criterion for evaluating them. Through experiments on CIFAR-100 and ImageNet-100 based on the proposed protocol in class-incremental learning, we not only observed that the existing evaluation method fail to properly assess the CL capability of each algorithm but also observe that some recently proposed state-of-the-art algorithms, which reported superior performance, actually exhibit inferior performance compared to the previous algorithm.  ( 3 min )
    Towards a theory of model distillation
    arXiv:2403.09053v1 Announce Type: new Abstract: Distillation is the task of replacing a complicated machine learning model with a simpler model that approximates the original [BCNM06,HVD15]. Despite many practical applications, basic questions about the extent to which models can be distilled, and the runtime and amount of data needed to distill, remain largely open. To study these questions, we initiate a general theory of distillation, defining PAC-distillation in an analogous way to PAC-learning [Val84]. As applications of this theory: (1) we propose new algorithms to extract the knowledge stored in the trained weights of neural networks -- we show how to efficiently distill neural networks into succinct, explicit decision tree representations when possible by using the ``linear representation hypothesis''; and (2) we prove that distillation can be much cheaper than learning from scratch, and make progress on characterizing its complexity.  ( 2 min )
    Taming Cross-Domain Representation Variance in Federated Prototype Learning with Heterogeneous Data Domains
    arXiv:2403.09048v1 Announce Type: new Abstract: Federated learning (FL) allows collaborative machine learning training without sharing private data. While most FL methods assume identical data domains across clients, real-world scenarios often involve heterogeneous data domains. Federated Prototype Learning (FedPL) addresses this issue, using mean feature vectors as prototypes to enhance model generalization. However, existing FedPL methods create the same number of prototypes for each client, leading to cross-domain performance gaps and disparities for clients with varied data distributions. To mitigate cross-domain feature representation variance, we introduce FedPLVM, which establishes variance-aware dual-level prototypes clustering and employs a novel $\alpha$-sparsity prototype loss. The dual-level prototypes clustering strategy creates local clustered prototypes based on private data features, then performs global prototypes clustering to reduce communication complexity and preserve local data privacy. The $\alpha$-sparsity prototype loss aligns samples from underrepresented domains, enhancing intra-class similarity and reducing inter-class similarity. Evaluations on Digit-5, Office-10, and DomainNet datasets demonstrate our method's superiority over existing approaches.  ( 2 min )
    DiTMoS: Delving into Diverse Tiny-Model Selection on Microcontrollers
    arXiv:2403.09035v1 Announce Type: new Abstract: Enabling efficient and accurate deep neural network (DNN) inference on microcontrollers is non-trivial due to the constrained on-chip resources. Current methodologies primarily focus on compressing larger models yet at the expense of model accuracy. In this paper, we rethink the problem from the inverse perspective by constructing small/weak models directly and improving their accuracy. Thus, we introduce DiTMoS, a novel DNN training and inference framework with a selector-classifiers architecture, where the selector routes each input sample to the appropriate classifier for classification. DiTMoS is grounded on a key insight: a composition of weak models can exhibit high diversity and the union of them can significantly boost the accuracy upper bound. To approach the upper bound, DiTMoS introduces three strategies including diverse training data splitting to increase the classifiers' diversity, adversarial selector-classifiers training to ensure synergistic interactions thereby maximizing their complementarity, and heterogeneous feature aggregation to improve the capacity of classifiers. We further propose a network slicing technique to alleviate the extra memory overhead incurred by feature aggregation. We deploy DiTMoS on the Neucleo STM32F767ZI board and evaluate it based on three time-series datasets for human activity recognition, keywords spotting, and emotion recognition, respectively. The experiment results manifest that: (a) DiTMoS achieves up to 13.4% accuracy improvement compared to the best baseline; (b) network slicing almost completely eliminates the memory overhead incurred by feature aggregation with a marginal increase of latency.  ( 3 min )
    Spatial-temporal Memories Enhanced Graph Autoencoder for Anomaly Detection in Dynamic Graphs
    arXiv:2403.09039v1 Announce Type: new Abstract: Anomaly detection in dynamic graphs presents a significant challenge due to the temporal evolution of graph structures and attributes. The conventional approaches that tackle this problem typically employ an unsupervised learning framework, capturing normality patterns with exclusive normal data during training and identifying deviations as anomalies during testing. However, these methods face critical drawbacks: they either only depend on proxy tasks for general representation without directly pinpointing normal patterns, or they neglect to differentiate between spatial and temporal normality patterns, leading to diminished efficacy in anomaly detection. To address these challenges, we introduce a novel Spatial-Temporal memories-enhanced graph autoencoder (STRIPE). Initially, STRIPE employs Graph Neural Networks (GNNs) and gated temporal convolution layers to extract spatial features and temporal features, respectively. Then STRIPE incorporates separate spatial and temporal memory networks, which capture and store prototypes of normal patterns, thereby preserving the uniqueness of spatial and temporal normality. After that, through a mutual attention mechanism, these stored patterns are then retrieved and integrated with encoded graph embeddings. Finally, the integrated features are fed into the decoder to reconstruct the graph streams which serve as the proxy task for anomaly detection. This comprehensive approach not only minimizes reconstruction errors but also refines the model by emphasizing the compactness and distinctiveness of the embeddings in relation to the nearest memory prototypes. Through extensive testing, STRIPE has demonstrated a superior capability to discern anomalies by effectively leveraging the distinct spatial and temporal dynamics of dynamic graphs, significantly outperforming existing methodologies, with an average improvement of 15.39% on AUC values.  ( 3 min )
    Towards Efficient Risk-Sensitive Policy Gradient: An Iteration Complexity Analysis
    arXiv:2403.08955v1 Announce Type: new Abstract: Reinforcement Learning (RL) has shown exceptional performance across various applications, enabling autonomous agents to learn optimal policies through interaction with their environments. However, traditional RL frameworks often face challenges in terms of iteration complexity and robustness. Risk-sensitive RL, which balances expected return and risk, has been explored for its potential to yield probabilistically robust policies, yet its iteration complexity analysis remains underexplored. In this study, we conduct a thorough iteration complexity analysis for the risk-sensitive policy gradient method, focusing on the REINFORCE algorithm and employing the exponential utility function. We obtain an iteration complexity of $\mathcal{O}(\epsilon^{-2})$ to reach an $\epsilon$-approximate first-order stationary point (FOSP). We investigate whether risk-sensitive algorithms can achieve better iteration complexity compared to their risk-neutral counterparts. Our theoretical analysis demonstrates that risk-sensitive REINFORCE can have a reduced number of iterations required for convergence. This leads to improved iteration complexity, as employing the exponential utility does not entail additional computation per iteration. We characterize the conditions under which risk-sensitive algorithms can achieve better iteration complexity. Our simulation results also validate that risk-averse cases can converge and stabilize more quickly after approximately half of the episodes compared to their risk-neutral counterparts.  ( 2 min )
    One-Shot Averaging for Distributed TD($\lambda$) Under Markov Sampling
    arXiv:2403.08896v1 Announce Type: new Abstract: We consider a distributed setup for reinforcement learning, where each agent has a copy of the same Markov Decision Process but transitions are sampled from the corresponding Markov chain independently by each agent. We show that in this setting, we can achieve a linear speedup for TD($\lambda$), a family of popular methods for policy evaluation, in the sense that $N$ agents can evaluate a policy $N$ times faster provided the target accuracy is small enough. Notably, this speedup is achieved by ``one shot averaging,'' a procedure where the agents run TD($\lambda$) with Markov sampling independently and only average their results after the final step. This significantly reduces the amount of communication required to achieve a linear speedup relative to previous work.  ( 2 min )
    Usable XAI: 10 Strategies Towards Exploiting Explainability in the LLM Era
    arXiv:2403.08946v1 Announce Type: new Abstract: Explainable AI (XAI) refers to techniques that provide human-understandable insights into the workings of AI models. Recently, the focus of XAI is being extended towards Large Language Models (LLMs) which are often criticized for their lack of transparency. This extension calls for a significant transformation in XAI methodologies because of two reasons. First, many existing XAI methods cannot be directly applied to LLMs due to their complexity advanced capabilities. Second, as LLMs are increasingly deployed across diverse industry applications, the role of XAI shifts from merely opening the "black box" to actively enhancing the productivity and applicability of LLMs in real-world settings. Meanwhile, unlike traditional machine learning models that are passive recipients of XAI insights, the distinct abilities of LLMs can reciprocally enhance XAI. Therefore, in this paper, we introduce Usable XAI in the context of LLMs by analyzing (1) how XAI can benefit LLMs and AI systems, and (2) how LLMs can contribute to the advancement of XAI. We introduce 10 strategies, introducing the key techniques for each and discussing their associated challenges. We also provide case studies to demonstrate how to obtain and leverage explanations. The code used in this paper can be found at: https://github.com/JacksonWuxs/UsableXAI_LLM.  ( 3 min )
    REFRESH: Responsible and Efficient Feature Reselection Guided by SHAP Values
    arXiv:2403.08880v1 Announce Type: new Abstract: Feature selection is a crucial step in building machine learning models. This process is often achieved with accuracy as an objective, and can be cumbersome and computationally expensive for large-scale datasets. Several additional model performance characteristics such as fairness and robustness are of importance for model development. As regulations are driving the need for more trustworthy models, deployed models need to be corrected for model characteristics associated with responsible artificial intelligence. When feature selection is done with respect to one model performance characteristic (eg. accuracy), feature selection with secondary model performance characteristics (eg. fairness and robustness) as objectives would require going through the computationally expensive selection process from scratch. In this paper, we introduce the problem of feature \emph{reselection}, so that features can be selected with respect to secondary model performance characteristics efficiently even after a feature selection process has been done with respect to a primary objective. To address this problem, we propose REFRESH, a method to reselect features so that additional constraints that are desirable towards model performance can be achieved without having to train several new models. REFRESH's underlying algorithm is a novel technique using SHAP values and correlation analysis that can approximate for the predictions of a model without having to train these models. Empirical evaluations on three datasets, including a large-scale loan defaulting dataset show that REFRESH can help find alternate models with better model characteristics efficiently. We also discuss the need for reselection and REFRESH based on regulation desiderata.  ( 3 min )
    Multi-Objective Optimization Using Adaptive Distributed Reinforcement Learning
    arXiv:2403.08879v1 Announce Type: new Abstract: The Intelligent Transportation System (ITS) environment is known to be dynamic and distributed, where participants (vehicle users, operators, etc.) have multiple, changing and possibly conflicting objectives. Although Reinforcement Learning (RL) algorithms are commonly applied to optimize ITS applications such as resource management and offloading, most RL algorithms focus on single objectives. In many situations, converting a multi-objective problem into a single-objective one is impossible, intractable or insufficient, making such RL algorithms inapplicable. We propose a multi-objective, multi-agent reinforcement learning (MARL) algorithm with high learning efficiency and low computational requirements, which automatically triggers adaptive few-shot learning in a dynamic, distributed and noisy environment with sparse and delayed reward. We test our algorithm in an ITS environment with edge cloud computing. Empirical results show that the algorithm is quick to adapt to new environments and performs better in all individual and system metrics compared to the state-of-the-art benchmark. Our algorithm also addresses various practical concerns with its modularized and asynchronous online training method. In addition to the cloud simulation, we test our algorithm on a single-board computer and show that it can make inference in 6 milliseconds.  ( 2 min )
    Bifurcated Attention for Single-Context Large-Batch Sampling
    arXiv:2403.08845v1 Announce Type: new Abstract: In our study, we present bifurcated attention, a method developed for language model inference in single-context batch sampling contexts. This approach aims to reduce redundant memory IO costs, a significant factor in latency for high batch sizes and long context lengths. Bifurcated attention achieves this by dividing the attention mechanism during incremental decoding into two distinct GEMM operations, focusing on the KV cache from prefill and the decoding process. This method ensures precise computation and maintains the usual computational load (FLOPs) of standard attention mechanisms, but with reduced memory IO. Bifurcated attention is also compatible with multi-query attention mechanism known for reduced memory IO for KV cache, further enabling higher batch size and context length. The resulting efficiency leads to lower latency, improving suitability for real-time applications, e.g., enabling massively-parallel answer generation without substantially increasing latency, enhancing performance when integrated with postprocessing techniques such as reranking.  ( 2 min )
    Learning-Enhanced Neighborhood Selection for the Vehicle Routing Problem with Time Windows
    arXiv:2403.08839v1 Announce Type: new Abstract: Large Neighborhood Search (LNS) is a universal approach that is broadly applicable and has proven to be highly efficient in practice for solving optimization problems. We propose to integrate machine learning (ML) into LNS to assist in deciding which parts of the solution should be destroyed and repaired in each iteration of LNS. We refer to our new approach as Learning-Enhanced Neighborhood Selection (LENS for short). Our approach is universally applicable, i.e., it can be applied to any LNS algorithm to amplify the workings of the destroy algorithm. In this paper, we demonstrate the potential of LENS on the fundamental Vehicle Routing Problem with Time Windows (VRPTW). We implemented an LNS algorithm for VRPTW and collected data on generated novel training instances derived from well-known, extensively utilized benchmark datasets. We trained our LENS approach with this data and compared the experimental results of our approach with two benchmark algorithms: a random neighborhood selection method to show that LENS learns to make informed choices and an oracle neighborhood selection method to demonstrate the potential of our LENS approach. With LENS, we obtain results that significantly improve the quality of the solutions.  ( 2 min )
    Predictive Clustering of Vessel Behavior Based on Hierarchical Trajectory Representation
    arXiv:2403.08838v1 Announce Type: new Abstract: Vessel trajectory clustering, which aims to find similar trajectory patterns, has been widely leveraged in overwater applications. Most traditional methods use predefined rules and thresholds to identify discrete vessel behaviors. They aim for high-quality clustering and conduct clustering on entire sequences, whether the original trajectory or its sub-trajectories, failing to represent their evolution. To resolve this problem, we propose a Predictive Clustering of Hierarchical Vessel Behavior (PC-HiV). PC-HiV first uses hierarchical representations to transform every trajectory into a behavioral sequence. Then, it predicts evolution at each timestamp of the sequence based on the representations. By applying predictive clustering and latent encoding, PC-HiV improves clustering and predictions simultaneously. Experiments on real AIS datasets demonstrate PC-HiV's superiority over existing methods, showcasing its effectiveness in capturing behavioral evolution discrepancies between vessel types (tramp vs. liner) and within emission control areas. Results show that our method outperforms NN-Kmeans and Robust DAA by 3.9% and 6.4% of the purity score.  ( 2 min )
    Stacking-based deep neural network for player scouting in football 1
    arXiv:2403.08835v1 Announce Type: new Abstract: Datascouting is one of the most known data applications in professional sport, and specifically football. Its objective is to analyze huge database of players in order to detect high potentials that can be then individually considered by human scouts. In this paper, we propose a stacking-based deep learning model to detect high potential football players. Applied on open-source database, our model obtains significantly better results that classical statistical methods.  ( 2 min )
    Structural Positional Encoding for knowledge integration in transformer-based medical process monitoring
    arXiv:2403.08836v1 Announce Type: new Abstract: Predictive process monitoring is a process mining task aimed at forecasting information about a running process trace, such as the most correct next activity to be executed. In medical domains, predictive process monitoring can provide valuable decision support in atypical and nontrivial situations. Decision support and quality assessment in medicine cannot ignore domain knowledge, in order to be grounded on all the available information (which is not limited to data) and to be really acceptable by end users. In this paper, we propose a predictive process monitoring approach relying on the use of a {\em transformer}, a deep learning architecture based on the attention mechanism. A major contribution of our work lies in the incorporation of ontological domain-specific knowledge, carried out through a graph positional encoding technique. The paper presents and discusses the encouraging experimental result we are collecting in the domain of stroke management.  ( 2 min )
    LoRA-SP: Streamlined Partial Parameter Adaptation for Resource-Efficient Fine-Tuning of Large Language Models
    arXiv:2403.08822v1 Announce Type: new Abstract: In addressing the computational and memory demands of fine-tuning Large Language Models(LLMs), we propose LoRA-SP(Streamlined Partial Parameter Adaptation), a novel approach utilizing randomized half-selective parameter freezing within the Low-Rank Adaptation(LoRA)framework. This method efficiently balances pre-trained knowledge retention and adaptability for task-specific optimizations. Through a randomized mechanism, LoRA-SP determines which parameters to update or freeze, significantly reducing computational and memory requirements without compromising model performance. We evaluated LoRA-SP across several benchmark NLP tasks, demonstrating its ability to achieve competitive performance with substantially lower resource consumption compared to traditional full-parameter fine-tuning and other parameter-efficient techniques. LoRA-SP innovative approach not only facilitates the deployment of advanced NLP models in resource-limited settings but also opens new research avenues into effective and efficient model adaptation strategies.  ( 2 min )
    Thermometer: Towards Universal Calibration for Large Language Models
    arXiv:2403.08819v1 Announce Type: new Abstract: We consider the issue of calibration in large language models (LLM). Recent studies have found that common interventions such as instruction tuning often result in poorly calibrated LLMs. Although calibration is well-explored in traditional applications, calibrating LLMs is uniquely challenging. These challenges stem as much from the severe computational requirements of LLMs as from their versatility, which allows them to be applied to diverse tasks. Addressing these challenges, we propose THERMOMETER, a calibration approach tailored to LLMs. THERMOMETER learns an auxiliary model, given data from multiple tasks, for calibrating a LLM. It is computationally efficient, preserves the accuracy of the LLM, and produces better-calibrated responses for new tasks. Extensive empirical evaluations across various benchmarks demonstrate the effectiveness of the proposed method.  ( 2 min )
    Diet-ODIN: A Novel Framework for Opioid Misuse Detection with Interpretable Dietary Patterns
    arXiv:2403.08820v1 Announce Type: new Abstract: The opioid crisis has been one of the most critical society concerns in the United States. Although the medication assisted treatment (MAT) is recognized as the most effective treatment for opioid misuse and addiction, the various side effects can trigger opioid relapse. In addition to MAT, the dietary nutrition intervention has been demonstrated its importance in opioid misuse prevention and recovery. However, research on the alarming connections between dietary patterns and opioid misuse remain under-explored. In response to this gap, in this paper, we first establish a large-scale multifaceted dietary benchmark dataset related to opioid users at the first attempt and then develop a novel framework - i.e., namely Opioid Misuse Detection with Interpretable Dietary Patterns (Diet-ODIN) - to bridge heterogeneous graph (HG) and large language model (LLM) for the identification of users with opioid misuse and the interpretation of their associated dietary patterns. Specifically, in Diet-ODIN, we first construct an HG to comprehensively incorporate both dietary and health-related information, and then we devise a holistic graph learning framework with noise reduction to fully capitalize both users' individual dietary habits and shared dietary patterns for the detection of users with opioid misuse. To further delve into the intricate correlations between dietary patterns and opioid misuse, we exploit an LLM by utilizing the knowledge obtained from the graph learning model for interpretation. The extensive experimental results based on our established benchmark with quantitative and qualitative measures demonstrate the outstanding performance of Diet-ODIN in exploring the complex interplay between opioid misuse and dietary patterns, by comparison with state-of-the-art baseline methods.  ( 3 min )
    Multimodal Fusion of EHR in Structures and Semantics: Integrating Clinical Records and Notes with Hypergraph and LLM
    arXiv:2403.08818v1 Announce Type: new Abstract: Electronic Health Records (EHRs) have become increasingly popular to support clinical decision-making and healthcare in recent decades. EHRs usually contain heterogeneous information, such as structural data in tabular form and unstructured data in textual notes. Different types of information in EHRs can complement each other and provide a more complete picture of the health status of a patient. While there has been a lot of research on representation learning of structured EHR data, the fusion of different types of EHR data (multimodal fusion) is not well studied. This is mostly because of the complex medical coding systems used and the noise and redundancy present in the written notes. In this work, we propose a new framework called MINGLE, which integrates both structures and semantics in EHR effectively. Our framework uses a two-level infusion strategy to combine medical concept semantics and clinical note semantics into hypergraph neural networks, which learn the complex interactions between different types of data to generate visit representations for downstream prediction. Experiment results on two EHR datasets, the public MIMIC-III and private CRADLE, show that MINGLE can effectively improve predictive performance by 11.83% relatively, enhancing semantic integration as well as multimodal fusion for structural and textual EHR data.  ( 3 min )
    Shake to Leak: Fine-tuning Diffusion Models Can Amplify the Generative Privacy Risk
    arXiv:2403.09450v1 Announce Type: new Abstract: While diffusion models have recently demonstrated remarkable progress in generating realistic images, privacy risks also arise: published models or APIs could generate training images and thus leak privacy-sensitive training information. In this paper, we reveal a new risk, Shake-to-Leak (S2L), that fine-tuning the pre-trained models with manipulated data can amplify the existing privacy risks. We demonstrate that S2L could occur in various standard fine-tuning strategies for diffusion models, including concept-injection methods (DreamBooth and Textual Inversion) and parameter-efficient methods (LoRA and Hypernetwork), as well as their combinations. In the worst case, S2L can amplify the state-of-the-art membership inference attack (MIA) on diffusion models by $5.4\%$ (absolute difference) AUC and can increase extracted private samples from almost $0$ samples to $16.3$ samples on average per target domain. This discovery underscores that the privacy risk with diffusion models is even more severe than previously recognized. Codes are available at https://github.com/VITA-Group/Shake-to-Leak.  ( 2 min )
    DA-PFL: Dynamic Affinity Aggregation for Personalized Federated Learning
    arXiv:2403.09284v1 Announce Type: new Abstract: Personalized federated learning becomes a hot research topic that can learn a personalized learning model for each client. Existing personalized federated learning models prefer to aggregate similar clients with similar data distribution to improve the performance of learning models. However, similaritybased personalized federated learning methods may exacerbate the class imbalanced problem. In this paper, we propose a novel Dynamic Affinity-based Personalized Federated Learning model (DA-PFL) to alleviate the class imbalanced problem during federated learning. Specifically, we build an affinity metric from a complementary perspective to guide which clients should be aggregated. Then we design a dynamic aggregation strategy to dynamically aggregate clients based on the affinity metric in each round to reduce the class imbalanced risk. Extensive experiments show that the proposed DA-PFL model can significantly improve the accuracy of each client in three real-world datasets with state-of-the-art comparison methods.  ( 2 min )
    ADEdgeDrop: Adversarial Edge Dropping for Robust Graph Neural Networks
    arXiv:2403.09171v1 Announce Type: new Abstract: Although Graph Neural Networks (GNNs) have exhibited the powerful ability to gather graph-structured information from neighborhood nodes via various message-passing mechanisms, the performance of GNNs is limited by poor generalization and fragile robustness caused by noisy and redundant graph data. As a prominent solution, Graph Augmentation Learning (GAL) has recently received increasing attention. Among prior GAL approaches, edge-dropping methods that randomly remove edges from a graph during training are effective techniques to improve the robustness of GNNs. However, randomly dropping edges often results in bypassing critical edges, consequently weakening the effectiveness of message passing. In this paper, we propose a novel adversarial edge-dropping method (ADEdgeDrop) that leverages an adversarial edge predictor guiding the removal of edges, which can be flexibly incorporated into diverse GNN backbones. Employing an adversarial training framework, the edge predictor utilizes the line graph transformed from the original graph to estimate the edges to be dropped, which improves the interpretability of the edge-dropping method. The proposed ADEdgeDrop is optimized alternately by stochastic gradient descent and projected gradient descent. Comprehensive experiments on six graph benchmark datasets demonstrate that the proposed ADEdgeDrop outperforms state-of-the-art baselines across various GNN backbones, demonstrating improved generalization and robustness.  ( 2 min )
    Cyclic Data Parallelism for Efficient Parallelism of Deep Neural Networks
    arXiv:2403.08837v1 Announce Type: new Abstract: Training large deep learning models requires parallelization techniques to scale. In existing methods such as Data Parallelism or ZeRO-DP, micro-batches of data are processed in parallel, which creates two drawbacks: the total memory required to store the model's activations peaks at the end of the forward pass, and gradients must be simultaneously averaged at the end of the backpropagation step. We propose Cyclic Data Parallelism, a novel paradigm shifting the execution of the micro-batches from simultaneous to sequential, with a uniform delay. At the cost of a slight gradient delay, the total memory taken by activations is constant, and the gradient communications are balanced during the training step. With Model Parallelism, our technique reduces the number of GPUs needed, by sharing GPUs across micro-batches. Within the ZeRO-DP framework, our technique allows communication of the model states with point-to-point operations rather than a collective broadcast operation. We illustrate the strength of our approach on the CIFAR-10 and ImageNet datasets.  ( 2 min )
    Predictive Analysis of Tuberculosis Treatment Outcomes Using Machine Learning: A Karnataka TB Data Study at a Scale
    arXiv:2403.08834v1 Announce Type: new Abstract: Tuberculosis (TB) remains a global health threat, ranking among the leading causes of mortality worldwide. In this context, machine learning (ML) has emerged as a transformative force, providing innovative solutions to the complexities associated with TB treatment.This study explores how machine learning, especially with tabular data, can be used to predict Tuberculosis (TB) treatment outcomes more accurately. It transforms this prediction task into a binary classification problem, generating risk scores from patient data sourced from NIKSHAY, India's national TB control program, which includes over 500,000 patient records. Data preprocessing is a critical component of the study, and the model achieved an recall of 98% and an AUC-ROC score of 0.95 on the validation set, which includes 20,000 patient records.We also explore the use of Natural Language Processing (NLP) for improved model learning. Our results, corroborated by various metrics and ablation studies, validate the effectiveness of our approach. The study concludes by discussing the potential ramifications of our research on TB eradication efforts and proposing potential avenues for future work. This study marks a significant stride in the battle against TB, showcasing the potential of machine learning in healthcare.  ( 3 min )
  • Open

    LDReg: Local Dimensionality Regularized Self-Supervised Learning
    arXiv:2401.10474v2 Announce Type: replace-cross Abstract: Representations learned via self-supervised learning (SSL) can be susceptible to dimensional collapse, where the learned representation subspace is of extremely low dimensionality and thus fails to represent the full data distribution and modalities. Dimensional collapse also known as the "underfilling" phenomenon is one of the major causes of degraded performance on downstream tasks. Previous work has investigated the dimensional collapse problem of SSL at a global level. In this paper, we demonstrate that representations can span over high dimensional space globally, but collapse locally. To address this, we propose a method called $\textit{local dimensionality regularization (LDReg)}$. Our formulation is based on the derivation of the Fisher-Rao metric to compare and optimize local distance distributions at an asymptotically small radius for each data point. By increasing the local intrinsic dimensionality, we demonstrate through a range of experiments that LDReg improves the representation quality of SSL. The results also show that LDReg can regularize dimensionality at both local and global levels.  ( 2 min )
    Generalized Neural Sorting Networks with Error-Free Differentiable Swap Functions
    arXiv:2310.07174v2 Announce Type: replace-cross Abstract: Sorting is a fundamental operation of all computer systems, having been a long-standing significant research topic. Beyond the problem formulation of traditional sorting algorithms, we consider sorting problems for more abstract yet expressive inputs, e.g., multi-digit images and image fragments, through a neural sorting network. To learn a mapping from a high-dimensional input to an ordinal variable, the differentiability of sorting networks needs to be guaranteed. In this paper we define a softening error by a differentiable swap function, and develop an error-free swap function that holds a non-decreasing condition and differentiability. Furthermore, a permutation-equivariant Transformer network with multi-head attention is adopted to capture dependency between given inputs and also leverage its model capacity with self-attention. Experiments on diverse sorting benchmarks show that our methods perform better than or comparable to baseline methods.  ( 2 min )
    Vanishing Gradients in Reinforcement Finetuning of Language Models
    arXiv:2310.20703v3 Announce Type: replace-cross Abstract: Pretrained language models are commonly aligned with human preferences and downstream tasks via reinforcement finetuning (RFT), which refers to maximizing a (possibly learned) reward function using policy gradient algorithms. This work identifies a fundamental optimization obstacle in RFT: we prove that the expected gradient for an input vanishes when its reward standard deviation under the model is small, even if the expected reward is far from optimal. Through experiments on an RFT benchmark and controlled environments, as well as a theoretical analysis, we then demonstrate that vanishing gradients due to small reward standard deviation are prevalent and detrimental, leading to extremely slow reward maximization. Lastly, we explore ways to overcome vanishing gradients in RFT. We find the common practice of an initial supervised finetuning (SFT) phase to be the most promising candidate, which sheds light on its importance in an RFT pipeline. Moreover, we show that a relatively small number of SFT optimization steps on as few as 1% of the input samples can suffice, indicating that the initial SFT phase need not be expensive in terms of compute and data labeling efforts. Overall, our results emphasize that being mindful for inputs whose expected gradient vanishes, as measured by the reward standard deviation, is crucial for successful execution of RFT.  ( 3 min )
    Delta-AI: Local objectives for amortized inference in sparse graphical models
    arXiv:2310.02423v2 Announce Type: replace-cross Abstract: We present a new algorithm for amortized inference in sparse probabilistic graphical models (PGMs), which we call $\Delta$-amortized inference ($\Delta$-AI). Our approach is based on the observation that when the sampling of variables in a PGM is seen as a sequence of actions taken by an agent, sparsity of the PGM enables local credit assignment in the agent's policy learning objective. This yields a local constraint that can be turned into a local loss in the style of generative flow networks (GFlowNets) that enables off-policy training but avoids the need to instantiate all the random variables for each parameter update, thus speeding up training considerably. The $\Delta$-AI objective matches the conditional distribution of a variable given its Markov blanket in a tractable learned sampler, which has the structure of a Bayesian network, with the same conditional distribution under the target PGM. As such, the trained sampler recovers marginals and conditional distributions of interest and enables inference of partial subsets of variables. We illustrate $\Delta$-AI's effectiveness for sampling from synthetic PGMs and training latent variable models with sparse factor structure.  ( 2 min )
    Model-based causal feature selection for general response types
    arXiv:2309.12833v3 Announce Type: replace-cross Abstract: Discovering causal relationships from observational data is a fundamental yet challenging task. Invariant causal prediction (ICP, Peters et al., 2016) is a method for causal feature selection which requires data from heterogeneous settings and exploits that causal models are invariant. ICP has been extended to general additive noise models and to nonparametric settings using conditional independence tests. However, the latter often suffer from low power (or poor type I error control) and additive noise models are not suitable for applications in which the response is not measured on a continuous scale, but reflects categories or counts. Here, we develop transformation-model (TRAM) based ICP, allowing for continuous, categorical, count-type, and uninformatively censored responses (these model classes, generally, do not allow for identifiability when there is no exogenous heterogeneity). As an invariance test, we propose TRAM-GCM based on the expected conditional covariance between environments and score residuals with uniform asymptotic level guarantees. For the special case of linear shift TRAMs, we also consider TRAM-Wald, which tests invariance based on the Wald statistic. We provide an open-source R package 'tramicp' and evaluate our approach on simulated data and in a case study investigating causal features of survival in critically ill patients.  ( 3 min )
    Kernelized Reinforcement Learning with Order Optimal Regret Bounds
    arXiv:2306.07745v3 Announce Type: replace-cross Abstract: Reinforcement learning (RL) has shown empirical success in various real world settings with complex models and large state-action spaces. The existing analytical results, however, typically focus on settings with a small number of state-actions or simple models such as linearly modeled state-action value functions. To derive RL policies that efficiently handle large state-action spaces with more general value functions, some recent works have considered nonlinear function approximation using kernel ridge regression. We propose $\pi$-KRVI, an optimistic modification of least-squares value iteration, when the state-action value function is represented by a reproducing kernel Hilbert space (RKHS). We prove the first order-optimal regret guarantees under a general setting. Our results show a significant polynomial in the number of episodes improvement over the state of the art. In particular, with highly non-smooth kernels (such as Neural Tangent kernel or some Mat\'ern kernels) the existing results lead to trivial (superlinear in the number of episodes) regret bounds. We show a sublinear regret bound that is order optimal in the case of Mat\'ern kernels where a lower bound on regret is known.  ( 3 min )
    Geometric structure of Deep Learning networks and construction of global ${\mathcal L}^2$ minimizers
    arXiv:2309.10639v4 Announce Type: replace-cross Abstract: In this paper, we explicitly determine local and global minimizers of the $\mathcal{L}^2$ cost function in underparametrized Deep Learning (DL) networks; our main goal is to shed light on their geometric structure and properties. We accomplish this by a direct construction, without invoking the gradient descent flow at any point of this work. We specifically consider $L$ hidden layers, a ReLU ramp activation function, an $\mathcal{L}^2$ Schatten class (or Hilbert-Schmidt) cost function, input and output spaces $\mathbb{R}^Q$ with equal dimension $Q\geq1$, and hidden layers also defined on $\mathbb{R}^{Q}$; the training inputs are assumed to be sufficiently clustered. The training input size $N$ can be arbitrarily large - thus, we are considering the underparametrized regime. More general settings are left to future work. We construct an explicit family of minimizers for the global minimum of the cost function in the case $L\geq Q$, which we show to be degenerate. Moreover, we determine a set of $2^Q-1$ distinct degenerate local minima of the cost function. In the context presented here, the concatenation of hidden layers of the DL network is reinterpreted as a recursive application of a {\em truncation map} which "curates" the training inputs by minimizing their noise to signal ratio.  ( 3 min )
    Viral Load Inference in Non-Adaptive Pooled Testing
    arXiv:2403.09130v1 Announce Type: cross Abstract: Medical diagnostic testing can be made significantly more efficient using pooled testing protocols. These typically require a sparse infection signal and use either binary or real-valued entries of O(1). However, existing methods do not allow for inferring viral loads which span many orders of magnitude. We develop a message passing algorithm coupled with a PCR (Polymerase Chain Reaction) specific noise function to allow accurate inference of realistic viral load signals. This work is in the non-adaptive setting and could open the possibility of efficient screening where viral load determination is clinically important.  ( 2 min )
    Thermometer: Towards Universal Calibration for Large Language Models
    arXiv:2403.08819v1 Announce Type: cross Abstract: We consider the issue of calibration in large language models (LLM). Recent studies have found that common interventions such as instruction tuning often result in poorly calibrated LLMs. Although calibration is well-explored in traditional applications, calibrating LLMs is uniquely challenging. These challenges stem as much from the severe computational requirements of LLMs as from their versatility, which allows them to be applied to diverse tasks. Addressing these challenges, we propose THERMOMETER, a calibration approach tailored to LLMs. THERMOMETER learns an auxiliary model, given data from multiple tasks, for calibrating a LLM. It is computationally efficient, preserves the accuracy of the LLM, and produces better-calibrated responses for new tasks. Extensive empirical evaluations across various benchmarks demonstrate the effectiveness of the proposed method.  ( 2 min )
    Upper Bound of Bayesian Generalization Error in Partial Concept Bottleneck Model (CBM): Partial CBM outperforms naive CBM
    arXiv:2403.09206v1 Announce Type: new Abstract: Concept Bottleneck Model (CBM) is a methods for explaining neural networks. In CBM, concepts which correspond to reasons of outputs are inserted in the last intermediate layer as observed values. It is expected that we can interpret the relationship between the output and concept similar to linear regression. However, this interpretation requires observing all concepts and decreases the generalization performance of neural networks. Partial CBM (PCBM), which uses partially observed concepts, has been devised to resolve these difficulties. Although some numerical experiments suggest that the generalization performance of PCBMs is almost as high as that of the original neural networks, the theoretical behavior of its generalization error has not been yet clarified since PCBM is singular statistical model. In this paper, we reveal the Bayesian generalization error in PCBM with a three-layered and linear architecture. The result indcates that the structure of partially observed concepts decreases the Bayesian generalization error compared with that of CBM (full-observed concepts).  ( 2 min )
    The statistical thermodynamics of generative diffusion models: Phase transitions, symmetry breaking and critical instability
    arXiv:2310.17467v2 Announce Type: replace Abstract: Generative diffusion models have achieved spectacular performance in many areas of generative modeling. While the fundamental ideas behind these models come from non-equilibrium physics, variational inference and stochastic calculus, in this paper we show that many aspects of these models can be understood using the tools of equilibrium statistical mechanics. Using this reformulation, we show that generative diffusion models undergo second-order phase transitions corresponding to symmetry breaking phenomena. We show that these phase-transitions are always in a mean-field universality class, as they are the result of a self-consistency condition in the generative dynamics. We argue that the critical instability that arises from the phase transitions lies at the heart of their generative capabilities, which are characterized by a set of mean field critical exponents. Furthermore, using the statistical physics of disordered systems, we show that memorization can be understood as a form of critical condensation corresponding to a disordered phase transition. Finally, we show that the dynamic equation of the generative process can be interpreted as a stochastic adiabatic transformation that minimizes the free energy while keeping the system in thermal equilibrium.  ( 2 min )
    Expressive Losses for Verified Robustness via Convex Combinations
    arXiv:2305.13991v2 Announce Type: replace-cross Abstract: In order to train networks for verified adversarial robustness, it is common to over-approximate the worst-case loss over perturbation regions, resulting in networks that attain verifiability at the expense of standard performance. As shown in recent work, better trade-offs between accuracy and robustness can be obtained by carefully coupling adversarial training with over-approximations. We hypothesize that the expressivity of a loss function, which we formalize as the ability to span a range of trade-offs between lower and upper bounds to the worst-case loss through a single parameter (the over-approximation coefficient), is key to attaining state-of-the-art performance. To support our hypothesis, we show that trivial expressive losses, obtained via convex combinations between adversarial attacks and IBP bounds, yield state-of-the-art results across a variety of settings in spite of their conceptual simplicity. We provide a detailed analysis of the relationship between the over-approximation coefficient and performance profiles across different expressive losses, showing that, while expressivity is essential, better approximations of the worst-case loss are not necessarily linked to superior robustness-accuracy trade-offs.  ( 2 min )
    Model-free Reinforcement Learning of Semantic Communication by Stochastic Policy Gradient
    arXiv:2305.03571v2 Announce Type: replace-cross Abstract: Following the recent success of Machine Learning tools in wireless communications, the idea of semantic communication by Weaver from 1949 has gained attention. It breaks with Shannon's classic design paradigm by aiming to transmit the meaning, i.e., semantics, of a message instead of its exact version, allowing for information rate savings. In this work, we apply the Stochastic Policy Gradient (SPG) to design a semantic communication system by reinforcement learning, separating transmitter and receiver, and not requiring a known or differentiable channel model -- a crucial step towards deployment in practice. Further, we derive the use of SPG for both classic and semantic communication from the maximization of the mutual information between received and target variables. Numerical results show that our approach achieves comparable performance to a model-aware approach based on the reparametrization trick, albeit with a decreased convergence rate.  ( 2 min )
    A semi-agnostic ansatz with variable structure for quantum machine learning
    arXiv:2103.06712v4 Announce Type: replace-cross Abstract: Quantum machine learning -- and specifically Variational Quantum Algorithms (VQAs) -- offers a powerful, flexible paradigm for programming near-term quantum computers, with applications in chemistry, metrology, materials science, data science, and mathematics. Here, one trains an ansatz, in the form of a parameterized quantum circuit, to accomplish a task of interest. However, challenges have recently emerged suggesting that deep ansatzes are difficult to train, due to flat training landscapes caused by randomness or by hardware noise. This motivates our work, where we present a variable structure approach to build ansatzes for VQAs. Our approach, called VAns (Variable Ansatz), applies a set of rules to both grow and (crucially) remove quantum gates in an informed manner during the optimization. Consequently, VAns is ideally suited to mitigate trainability and noise-related issues by keeping the ansatz shallow. We employ VAns in the variational quantum eigensolver for condensed matter and quantum chemistry applications, in the quantum autoencoder for data compression and in unitary compilation problems showing successful results in all cases.  ( 3 min )
    Stochastic Approximation with Decision-Dependent Distributions: Asymptotic Normality and Optimality
    arXiv:2207.04173v3 Announce Type: replace-cross Abstract: We analyze a stochastic approximation algorithm for decision-dependent problems, wherein the data distribution used by the algorithm evolves along the iterate sequence. The primary examples of such problems appear in performative prediction and its multiplayer extensions. We show that under mild assumptions, the deviation between the average iterate of the algorithm and the solution is asymptotically normal, with a covariance that clearly decouples the effects of the gradient noise and the distributional shift. Moreover, building on the work of H\'ajek and Le Cam, we show that the asymptotic performance of the algorithm with averaging is locally minimax optimal.  ( 2 min )
    Stable Training of Probabilistic Models Using the Leave-One-Out Maximum Log-Likelihood Objective
    arXiv:2310.03556v2 Announce Type: replace Abstract: Probabilistic modelling of power systems operation and planning processes depends on data-driven methods, which require sufficiently large datasets. When historical data lacks this, it is desired to model the underlying data generation mechanism as a probability distribution to assess the data quality and generate more data, if needed. Kernel density estimation (KDE) based models are popular choices for this task, but they fail to adapt to data regions with varying densities. In this paper, an adaptive KDE model is employed to circumvent this, where each kernel in the model has an individual bandwidth. The leave-one-out maximum log-likelihood (LOO-MLL) criterion is proposed to prevent the singular solutions that the regular MLL criterion gives rise to, and it is proven that LOO-MLL prevents these. Relying on this guaranteed robustness, the model is extended by adjustable weights for the kernels. In addition, a modified expectation-maximization algorithm is employed to accelerate the optimization speed reliably. The performance of the proposed method and models are exhibited on two power systems datasets using different statistical tests and by comparison with Gaussian mixture models. Results show that the proposed models have promising performance, in addition to their singularity prevention guarantees.  ( 2 min )
    Extremal graphical modeling with latent variables
    arXiv:2403.09604v1 Announce Type: cross Abstract: Extremal graphical models encode the conditional independence structure of multivariate extremes and provide a powerful tool for quantifying the risk of rare events. Prior work on learning these graphs from data has focused on the setting where all relevant variables are observed. For the popular class of H\"usler-Reiss models, we propose the \texttt{eglatent} method, a tractable convex program for learning extremal graphical models in the presence of latent variables. Our approach decomposes the H\"usler-Reiss precision matrix into a sparse component encoding the graphical structure among the observed variables after conditioning on the latent variables, and a low-rank component encoding the effect of a few latent variables on the observed variables. We provide finite-sample guarantees of \texttt{eglatent} and show that it consistently recovers the conditional graph as well as the number of latent variables. We highlight the improved performances of our approach on synthetic and real data.  ( 2 min )
    Minimax Optimal and Computationally Efficient Algorithms for Distributionally Robust Offline Reinforcement Learning
    arXiv:2403.09621v1 Announce Type: cross Abstract: Distributionally robust offline reinforcement learning (RL), which seeks robust policy training against environment perturbation by modeling dynamics uncertainty, calls for function approximations when facing large state-action spaces. However, the consideration of dynamics uncertainty introduces essential nonlinearity and computational burden, posing unique challenges for analyzing and practically employing function approximation. Focusing on a basic setting where the nominal model and perturbed models are linearly parameterized, we propose minimax optimal and computationally efficient algorithms realizing function approximation and initiate the study on instance-dependent suboptimality analysis in the context of robust offline RL. Our results uncover that function approximation in robust offline RL is essentially distinct from and probably harder than that in standard offline RL. Our algorithms and theoretical results crucially depend on a variety of new techniques, involving a novel function approximation mechanism incorporating variance information, a new procedure of suboptimality and estimation uncertainty decomposition, a quantification of the robust value function shrinkage, and a meticulously designed family of hard instances, which might be of independent interest.  ( 2 min )
    Scalability of Metropolis-within-Gibbs schemes for high-dimensional Bayesian models
    arXiv:2403.09416v1 Announce Type: cross Abstract: We study general coordinate-wise MCMC schemes (such as Metropolis-within-Gibbs samplers), which are commonly used to fit Bayesian non-conjugate hierarchical models. We relate their convergence properties to the ones of the corresponding (potentially not implementable) Gibbs sampler through the notion of conditional conductance. This allows us to study the performances of popular Metropolis-within-Gibbs schemes for non-conjugate hierarchical models, in high-dimensional regimes where both number of datapoints and parameters increase. Given random data-generating assumptions, we establish dimension-free convergence results, which are in close accordance with numerical evidences. Applications to Bayesian models for binary regression with unknown hyperparameters and discretely observed diffusions are also discussed. Motivated by such statistical applications, auxiliary results of independent interest on approximate conductances and perturbation of Markov operators are provided.  ( 2 min )
    Recursive Causal Discovery
    arXiv:2403.09300v1 Announce Type: cross Abstract: Causal discovery, i.e., learning the causal graph from data, is often the first step toward the identification and estimation of causal effects, a key requirement in numerous scientific domains. Causal discovery is hampered by two main challenges: limited data results in errors in statistical testing and the computational complexity of the learning task is daunting. This paper builds upon and extends four of our prior publications (Mokhtarian et al., 2021; Akbari et al., 2021; Mokhtarian et al., 2022, 2023a). These works introduced the concept of removable variables, which are the only variables that can be removed recursively for the purpose of causal discovery. Presence and identification of removable variables allow recursive approaches for causal discovery, a promising solution that helps to address the aforementioned challenges by reducing the problem size successively. This reduction not only minimizes conditioning sets in each conditional independence (CI) test, leading to fewer errors but also significantly decreases the number of required CI tests. The worst-case performances of these methods nearly match the lower bound. In this paper, we present a unified framework for the proposed algorithms, refined with additional details and enhancements for a coherent presentation. A comprehensive literature review is also included, comparing the computational complexity of our methods with existing approaches, showcasing their state-of-the-art efficiency. Another contribution of this paper is the release of RCD, a Python package that efficiently implements these algorithms. This package is designed for practitioners and researchers interested in applying these methods in practical scenarios. The package is available at github.com/ban-epfl/rcd, with comprehensive documentation provided at rcdpackage.com.  ( 3 min )
    Analysis of singular subspaces under random perturbations
    arXiv:2403.09170v1 Announce Type: cross Abstract: We present a comprehensive analysis of singular vector and singular subspace perturbations in the context of the signal plus random Gaussian noise matrix model. Assuming a low-rank signal matrix, we extend the Wedin-Davis-Kahan theorem in a fully generalized manner, applicable to any unitarily invariant matrix norm, extending previous results of O'Rourke, Vu and the author. We also obtain the fine-grained results, which encompass the $\ell_\infty$ analysis of singular vectors, the $\ell_{2, \infty}$ analysis of singular subspaces, as well as the exploration of linear and bilinear functions related to the singular vectors. Moreover, we explore the practical implications of these findings, in the context of the Gaussian mixture model and the submatrix localization problem.  ( 2 min )
    Optimal Top-Two Method for Best Arm Identification and Fluid Analysis
    arXiv:2403.09123v1 Announce Type: cross Abstract: Top-$2$ methods have become popular in solving the best arm identification (BAI) problem. The best arm, or the arm with the largest mean amongst finitely many, is identified through an algorithm that at any sequential step independently pulls the empirical best arm, with a fixed probability $\beta$, and pulls the best challenger arm otherwise. The probability of incorrect selection is guaranteed to lie below a specified $\delta >0$. Information theoretic lower bounds on sample complexity are well known for BAI problem and are matched asymptotically as $\delta \rightarrow 0$ by computationally demanding plug-in methods. The above top 2 algorithm for any $\beta \in (0,1)$ has sample complexity within a constant of the lower bound. However, determining the optimal $\beta$ that matches the lower bound has proven difficult. In this paper, we address this and propose an optimal top-2 type algorithm. We consider a function of allocations anchored at a threshold. If it exceeds the threshold then the algorithm samples the empirical best arm. Otherwise, it samples the challenger arm. We show that the proposed algorithm is optimal as $\delta \rightarrow 0$. Our analysis relies on identifying a limiting fluid dynamics of allocations that satisfy a series of ordinary differential equations pasted together and that describe the asymptotic path followed by our algorithm. We rely on the implicit function theorem to show existence and uniqueness of these fluid ode's and to show that the proposed algorithm remains close to the ode solution.  ( 3 min )
    Moments of Clarity: Streamlining Latent Spaces in Machine Learning using Moment Pooling
    arXiv:2403.08854v1 Announce Type: cross Abstract: Many machine learning applications involve learning a latent representation of data, which is often high-dimensional and difficult to directly interpret. In this work, we propose "Moment Pooling", a natural extension of Deep Sets networks which drastically decrease latent space dimensionality of these networks while maintaining or even improving performance. Moment Pooling generalizes the summation in Deep Sets to arbitrary multivariate moments, which enables the model to achieve a much higher effective latent dimensionality for a fixed latent dimension. We demonstrate Moment Pooling on the collider physics task of quark/gluon jet classification by extending Energy Flow Networks (EFNs) to Moment EFNs. We find that Moment EFNs with latent dimensions as small as 1 perform similarly to ordinary EFNs with higher latent dimension. This small latent dimension allows for the internal representation to be directly visualized and interpreted, which in turn enables the learned internal jet representation to be extracted in closed form.  ( 2 min )
    Cyclic Data Parallelism for Efficient Parallelism of Deep Neural Networks
    arXiv:2403.08837v1 Announce Type: cross Abstract: Training large deep learning models requires parallelization techniques to scale. In existing methods such as Data Parallelism or ZeRO-DP, micro-batches of data are processed in parallel, which creates two drawbacks: the total memory required to store the model's activations peaks at the end of the forward pass, and gradients must be simultaneously averaged at the end of the backpropagation step. We propose Cyclic Data Parallelism, a novel paradigm shifting the execution of the micro-batches from simultaneous to sequential, with a uniform delay. At the cost of a slight gradient delay, the total memory taken by activations is constant, and the gradient communications are balanced during the training step. With Model Parallelism, our technique reduces the number of GPUs needed, by sharing GPUs across micro-batches. Within the ZeRO-DP framework, our technique allows communication of the model states with point-to-point operations rather than a collective broadcast operation. We illustrate the strength of our approach on the CIFAR-10 and ImageNet datasets.  ( 2 min )
    Variational Inference with Sequential Sample-Average Approximations
    arXiv:2403.09429v1 Announce Type: new Abstract: We present variational inference with sequential sample-average approximation (VISA), a method for approximate inference in computationally intensive models, such as those based on numerical simulations. VISA extends importance-weighted forward-KL variational inference by employing a sequence of sample-average approximations, which are considered valid inside a trust region. This makes it possible to reuse model evaluations across multiple gradient steps, thereby reducing computational cost. We perform experiments on high-dimensional Gaussians, Lotka-Volterra dynamics, and a Pickover attractor, which demonstrate that VISA can achieve comparable approximation accuracy to standard importance-weighted forward-KL variational inference with computational savings of a factor two or more for conservatively chosen learning rates.  ( 2 min )
    Pantypes: Diverse Representatives for Self-Explainable Models
    arXiv:2403.09383v1 Announce Type: new Abstract: Prototypical self-explainable classifiers have emerged to meet the growing demand for interpretable AI systems. These classifiers are designed to incorporate high transparency in their decisions by basing inference on similarity with learned prototypical objects. While these models are designed with diversity in mind, the learned prototypes often do not sufficiently represent all aspects of the input distribution, particularly those in low density regions. Such lack of sufficient data representation, known as representation bias, has been associated with various detrimental properties related to machine learning diversity and fairness. In light of this, we introduce pantypes, a new family of prototypical objects designed to capture the full diversity of the input distribution through a sparse set of objects. We show that pantypes can empower prototypical self-explainable models by occupying divergent regions of the latent space and thus fostering high diversity, interpretability and fairness.  ( 2 min )
    Towards Model-Agnostic Posterior Approximation for Fast and Accurate Variational Autoencoders
    arXiv:2403.08941v1 Announce Type: new Abstract: Inference for Variational Autoencoders (VAEs) consists of learning two models: (1) a generative model, which transforms a simple distribution over a latent space into the distribution over observed data, and (2) an inference model, which approximates the posterior of the latent codes given data. The two components are learned jointly via a lower bound to the generative model's log marginal likelihood. In early phases of joint training, the inference model poorly approximates the latent code posteriors. Recent work showed that this leads optimization to get stuck in local optima, negatively impacting the learned generative model. As such, recent work suggests ensuring a high-quality inference model via iterative training: maximizing the objective function relative to the inference model before every update to the generative model. Unfortunately, iterative training is inefficient, requiring heuristic criteria for reverting from iterative to joint training for speed. Here, we suggest an inference method that trains the generative and inference models independently. It approximates the posterior of the true model a priori; fixing this posterior approximation, we then maximize the lower bound relative to only the generative model. By conventional wisdom, this approach should rely on the true prior and likelihood of the true model to approximate its posterior (which are unknown). However, we show that we can compute a deterministic, model-agnostic posterior approximation (MAPA) of the true model's posterior. We then use MAPA to develop a proof-of-concept inference method. We present preliminary results on low-dimensional synthetic data that (1) MAPA captures the trend of the true posterior, and (2) our MAPA-based inference performs better density estimation with less computation than baselines. Lastly, we present a roadmap for scaling the MAPA-based inference method to high-dimensional data.  ( 3 min )
    Majority-of-Three: The Simplest Optimal Learner?
    arXiv:2403.08831v1 Announce Type: new Abstract: Developing an optimal PAC learning algorithm in the realizable setting, where empirical risk minimization (ERM) is suboptimal, was a major open problem in learning theory for decades. The problem was finally resolved by Hanneke a few years ago. Unfortunately, Hanneke's algorithm is quite complex as it returns the majority vote of many ERM classifiers that are trained on carefully selected subsets of the data. It is thus a natural goal to determine the simplest algorithm that is optimal. In this work we study the arguably simplest algorithm that could be optimal: returning the majority vote of three ERM classifiers. We show that this algorithm achieves the optimal in-expectation bound on its error which is provably unattainable by a single ERM classifier. Furthermore, we prove a near-optimal high-probability bound on this algorithm's error. We conjecture that a better analysis will prove that this algorithm is in fact optimal in the high-probability regime.  ( 2 min )
    A non-asymptotic theory of Kernel Ridge Regression: deterministic equivalents, test error, and GCV estimator
    arXiv:2403.08938v1 Announce Type: new Abstract: We consider learning an unknown target function $f_*$ using kernel ridge regression (KRR) given i.i.d. data $(u_i,y_i)$, $i\leq n$, where $u_i \in U$ is a covariate vector and $y_i = f_* (u_i) +\varepsilon_i \in \mathbb{R}$. A recent string of work has empirically shown that the test error of KRR can be well approximated by a closed-form estimate derived from an `equivalent' sequence model that only depends on the spectrum of the kernel operator. However, a theoretical justification for this equivalence has so far relied either on restrictive assumptions -- such as subgaussian independent eigenfunctions -- , or asymptotic derivations for specific kernels in high dimensions. In this paper, we prove that this equivalence holds for a general class of problems satisfying some spectral and concentration properties on the kernel eigendecomposition. Specifically, we establish in this setting a non-asymptotic deterministic approximation for the test error of KRR -- with explicit non-asymptotic bounds -- that only depends on the eigenvalues and the target function alignment to the eigenvectors of the kernel. Our proofs rely on a careful derivation of deterministic equivalents for random matrix functionals in the dimension free regime pioneered by Cheng and Montanari (2022). We apply this setting to several classical examples and show an excellent agreement between theoretical predictions and numerical simulations. These results rely on having access to the eigendecomposition of the kernel operator. Alternatively, we prove that, under this same setting, the generalized cross-validation (GCV) estimator concentrates on the test error uniformly over a range of ridge regularization parameter that includes zero (the interpolating solution). As a consequence, the GCV estimator can be used to estimate from data the test error and optimal regularization parameter for KRR.  ( 3 min )

  • Open

    [D] Help with a project . Urgent!
    How do we use BLOOM to create a model that translates any emergency message to 4 or more different languages. Pls help submitted by /u/Serious-Jackfruit-61 [link] [comments]
    [D] my younger brother wants to pivot from his career in chemistry into AI. He wants to work for companies like anthropic or open AI. What postgraduate degree(s) will he need an order to successfully do this?
    He’s currently taking online computer science courses in order to teach himself. A lot of these jobs in AI are going to people who already have postgraduate training in the field, and we just don’t know what the best degree would be for him to pursue in order to make this happen. submitted by /u/Careful_Fig8482 [link] [comments]
    [D] Assistants API with Image support
    Hey everyone! There is a lot of models and Al services it's getting hard to keep a track of all of them. What would you suggest is the best (price / quality ratio) way to achieve something similar to assistants API (by OpenAI) with image uploading support? (It can be a combination of a couple of models) submitted by /u/Gustavoayonas [link] [comments]
    OPEN AI FOR ANOMALY DETECTION [P]
    Currently working on a project of anomaly detection using logs data as input, I'm confused between using classical machine learning for this project or if open ai can be a solution here , any suggestions ? Note that the model needs to detect where we have a decrease in sales ( anomaly ) and also explain why .. submitted by /u/WASSIDI [link] [comments]
    [D] Before getting converted into embeddings, what format is used to store the large amount of data (assuming it is compressed)?
    - When it comes to image and audio multi modality, how does the options vary for data storage? - Which platforms offer the storage for these data? submitted by /u/metalvendetta [link] [comments]
    [R] AutoDev: Automated AI-Driven Development - Microsoft 2024
    Paper: https://arxiv.org/abs/2403.08299 Github includes Code + AutoDev Coder Model: https://github.com/unit-mesh/auto-dev Abstract: The landscape of software development has witnessed a paradigm shift with the advent of AI-powered assistants, exemplified by GitHub Copilot. However, existing solutions are not leveraging all the potential capabilities available in an IDE such as building, testing, executing code, git operations, etc. Therefore, they are constrained by their limited capabilities, primarily focusing on suggesting code snippets and file manipulation within a chat-based interface. To fill this gap, we present AutoDev, a fully automated AI-driven software development framework, designed for autonomous planning and execution of intricate software engineering tasks. AutoDev …
    [D] open ai fast grants results out?
    Hello all, has anyone been contacted back for their open superalignment fast grant proposals? reviews are due on Sunday but I am wondering whether anyone has heard back, assuming time off during weekends for the team? thank you in advance submitted by /u/CheetahFair2770 [link] [comments]
    [P] Reproducing the "Self-Rewarding Language Models" Paper by MetaAI
    Hey all, After reading the Self-Rewarding Language Models paper by the team at Meta, it felt very approachable and reproducible, so we spent some time implementing it. ​ The scripts provided take any base model and put it in a loop of : 1) Supervised fine-tuning on an initial dataset 2) Generating new prompts using the SFT 3) Generating N responses per prompt 4) Scoring the generated responses 1-5 5) Running DPO on the rewards from the model itself. ​ We've run it through one loop starting with a Mistral-7b base model and the results are pretty encouraging so far. ​ Feel free to check it out or run it for yourself and let us know what you think: https://github.com/Oxen-AI/Self-Rewarding-Language-Models submitted by /u/FallMindless3563 [link] [comments]
    [D] What are my options for doing constrained inference from an unconstrained model?
    I would like to train a model to predict the likelihood of a trajectory of actions. My training dataset allows actions that I won't have access to at inference time. I would like to infer the most likely action trajectory within a convex constraint set in the action space (constraint is the same for every action timestep). Any advice on how to approach this? My ideas so far are constrained beam search which is popular in LLMS. I am also considering constrained diffusion, but since there isn't a direct probability term and the convex edges may induce local minima, I'm not sure if this would work. Correct me if I am wrong, but once you constrain a policies output you now have to infer many times and compare their relative likelihood right? submitted by /u/jms4607 [link] [comments]
    [D] Blogs Similar to distill.pub?
    Hello everyone, I'm a big fan of distill.pub and often find myself revisiting some of their posts, even after they went on hiatus. Their articles are elegantly crafted and intuitively explain concepts, usually through visual means. I'm curious if there are any other blogs or websites similar to distill.pub that you would recommend? submitted by /u/JellyBean_Collector [link] [comments]
    [P] Describe Beta: The most descriptive audiovisual summaries for videos with VLMs + Whisper
    ​ Demo Video: https://stream.new/v/UoUDYkwm5JQTgE25Ak6PwpKqg4BDzY6HipobKgxQGOo Hey folks! We just built a cost-effective, lightweight way to generate audiovisual summaries for videos. Process videos up to 12x faster than realtime Costs <$0.01 / min of video Combines visual and audial components The goal here is not to build a single E2E model but something that could actually be used in production while preserving relatively high quality. You can try it out yourself here: https://www.sievedata.com/functions/sieve/describe How we built it: https://www.sievedata.com/blog/describe-video-summary-beta-launch The code: https://github.com/sieve-community/describe ​ submitted by /u/happybirthday290 [link] [comments]
    [P] Suggestions for manual image annotation (like v7 but free?)
    I'm trying to set up a library for my U-Net object classifier. I'm starting off super simple basically detecting two different 2-D shapes and then I'll build out from there. I wanted to have a training library made of manual annotations and another off automatically segmented masks for comparison when feeding into the model. I've made the masks easy enough but I'm now in the market for something to facillitate manual annotations? I've used v7 darwin in the past for something like this but I didn't have to pay for it myself. Now I'm looking at the cost and I can't forsee my PI approving a $150/month bill that I'm only going to use once for basic object tracing I've tried label studio but its manual annotation tool was unusable. Does anyone have any other suggestions? I'm specifically looking for something that has a paint brush tool or equivalent because a bounding box annotator is basically just going to mimic what I've already done making masks submitted by /u/Ancient-Mountain4612 [link] [comments]
    [D] How to survey and choose good toolkit to build your research code base?
    Like the title said, I’m working on a research about Sparse Mixture of Experts and need to survey and choose a toolkit to build my research code base. Since my mentor have many ideas for future research, we want to write a good code base. I think I should consider: - The toolkit I choose is up-to-date with many technologies & have good support. - Have scalability. Is there any other things I should consider? Besides, how to search for toolkits? I found some like Aloxotl or vLLM, but vLLM seems to only support inference while our research need to training the model too. Thanks for your help! submitted by /u/ma-d-ghost [link] [comments]
    [D] What is the best way to index events in a video
    So it seems that there’s two ways to tackle this, either by making predictions on a frame level and getting the start and end frame to get the timestamps. Alternatively, you could make predictions on a series of frames (e.g. a selection of 10) to try incorporate the temporal features. The second option appears better, however I’m wondering if you can get around this by including more features such as optical flow. submitted by /u/RxPiku [link] [comments]
    [D] Does `zero padding` in CNN increase the inference time due to frequent data transfer between CPU and GPU?
    Hello, I think the zero-padding in CNN is a very annoying operation when designing a DNN accelerator or an NPU. So I want to know that if zero-padding is executed on CPU or GPU in modern ML framework like Pytorch/TF(I'm not good at them)? If it is done on CPU, wouldn't this opperation greatly increase the total inference time due to the frequent data movement between CPU and GPU when there are continuous layers that need padding? Otherwise, how does GPU finish the padding operation? I think such kind of find-grained data manipulation should be very inefficient. I think I didn't find out much information about this. So I hope someone can help me with this question. Thanks! submitted by /u/ExcitingInternet6083 [link] [comments]
    [P] Semi-automatic object labeling
    Hi friends, I have a question regarding object detection and labeling. I would like to build my own object detector by finetuning a pretrained model like DETR. To save time I, my plan is to obtain the bounding boxes of the training data from the Segment Anything model. So I just have to categorize these BB according to their object class. Is there any tool that can help me with that? Something like that I assign some class labels by hand and the tool suggests labels for similar looking BB. I know about the Grounding-DINO, but the results are not reliable for my use-case. submitted by /u/VGHMD [link] [comments]
    [R] Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking
    submitted by /u/topcodemangler [link] [comments]
    [R] Towards General-Purpose In-Context Learning Agents
    Paper: https://openreview.net/forum?id=eDZJTdUsfe Talk and slides: https://neurips.cc/virtual/2023/79880 Blog post: http://louiskirsch.com/glas Abstract: Reinforcement Learning (RL) algorithms are usually hand-crafted, driven by the research and engineering of humans. An alternative approach is to automate this research process via meta-learning. A particularly ambitious objective is to automatically discover new RL algorithms from scratch that use in-context learning to learn-how-to-learn entirely from data while also generalizing to a wide range of environments. Those RL algorithms are implemented entirely in neural networks, by conditioning on previous experience from the environment, without any explicit optimization-based routine at meta-test time. To achieve generalization, this requires a broad task distribution of diverse and challenging environments. Our Transformer-based Generally Learning Agents (GLAs) are an important first step in this direction. Our GLAs are meta-trained using supervised learning techniques on an offline dataset with experiences from RL environments that is augmented with random projections to generate task diversity. During meta-testing our agents perform in-context meta-RL on entirely different robotic control problems such as Reacher, Cartpole, or HalfCheetah that were not in the meta-training distribution. submitted by /u/SunsetOneSix [link] [comments]
    [PROJECT] Which pre-built architectures are used for exam marking?
    So I've taken up a new nlp project to auto mark essays (kinda like Keath.ai) but I don't want to go through the whole process of creating the actual network architecture. I've been looking for an architecture that is built and I can retrain it on my data while fine tuning it. Any ideas? submitted by /u/No_Arm9 [link] [comments]
    [D] Building a ML system vs project.
    I personally spend my time split between traditional software engineer and machine learning. Every time I am building or working on a model my OCD kicks in, notebooks, scripts, no real abstractions etc. All the ML code feels like it may be thrown away until it works and at that point I may as well keep the model created from the garbage code. Having read and asked questions in the past I think many of us suffer with this, and yes there are some best practices but still ML projects seems more one time use VS a long living piece of software. So what I would love to discuss is, have people considered building their ML system in the same way as software? So for example every model has a route, each report has a UI view. Users can login and play with each feature of the system and it is tested! Thoughts? If anyone has seen this concept before any references would be amazing!! submitted by /u/Suspicious_Dress_350 [link] [comments]
    [D] What are some well-written ML codebases to refer to get inspiration on good ML software design?
    What publicly available ml projects would you refer to as examples of good software design for ML? I’m referring to aspects like how the abstract model/data set/metric classes defined, how easy is it to add a new functionality based on that design, and about overall experience of using them. For example, I believe scikit-learn is an example of good design. The fit/preditct paradigm is extremely easy to understand even for a newcomer. Most modern projects seem to be using a config-driven dynamic initialization of objects and I’d also appreciate resources on good practices around such design. Some examples for such design are huggingface and hydra-based experimentation code bases. The links to posts where the authors explain their design philosophy would also be helpful. For example, huggingface has a “Repeat Yourself” philosophy as opposed to “Don’t Repeat Yourself”. It will also help to list the libraries to avoid. Thanks! submitted by /u/unemployed_MLE [link] [comments]
    [D] Several Questions on PhD for industry
    I'm currently a junior working towards my bachelor's in CS. I'm doing research that I'm hoping to get published, and I discovered that I really like machine learning. My professor is pushing me to get a PhD and believes that I am competitive for one, so I have the following questions: I intend to work in industry on (hopefully but probably unlikely) the latest ml tech where I'm making the models. Would I need a PhD for this? I heard you should only get a PhD if you like doing research. I like new ideas and experimentation, but don't like writing papers. Does this still count? How lucrative is ML in industry? I've heard the numbers are insanely inflated and they tend to be the same level or lower than conventional SWE positions. ​ submitted by /u/DisapointingLO [link] [comments]
  • Open

    MuZero simple implementation
    I want to try to benchmark the comparrison of Stochastic MuZero vs classic/original MuZero. All the implementations I've found were heavily optimized. I am looking for a simple and readable implementation of MuZero that I can easily understand or modify. Do you have any recommendation? submitted by /u/_Hardric [link] [comments]
    PPO learns and performs perfectly during training (with exploration), but fails to perform well during evaluation (without exploration)
    I have tried to tune learning rate, discount factor, lambda, clip value, entropy regularization coefficient and L2 regularization coefficient, but no luck so far. Any suggestion to solve this issue? Thanks in advance! submitted by /u/Apprehensive_Bag1262 [link] [comments]
    Towards General-Purpose In-Context Learning Agents
    Paper: https://openreview.net/forum?id=eDZJTdUsfe Talk and slides: https://neurips.cc/virtual/2023/79880 Blog post: http://louiskirsch.com/glas Abstract: Reinforcement Learning (RL) algorithms are usually hand-crafted, driven by the research and engineering of humans. An alternative approach is to automate this research process via meta-learning. A particularly ambitious objective is to automatically discover new RL algorithms from scratch that use in-context learning to learn-how-to-learn entirely from data while also generalizing to a wide range of environments. Those RL algorithms are implemented entirely in neural networks, by conditioning on previous experience from the environment, without any explicit optimization-based routine at meta-test time. To achieve generalization, this requires a broad task distribution of diverse and challenging environments. Our Transformer-based Generally Learning Agents (GLAs) are an important first step in this direction. Our GLAs are meta-trained using supervised learning techniques on an offline dataset with experiences from RL environments that is augmented with random projections to generate task diversity. During meta-testing our agents perform in-context meta-RL on entirely different robotic control problems such as Reacher, Cartpole, or HalfCheetah that were not in the meta-training distribution. submitted by /u/SunsetOneSix [link] [comments]
    Supervised Learning vs. Offline Reinforcement Learning
    I'm starting off with RL and these might be very trivial questions but I want to wrap my head around everything as best as I can. If you have any resources that would provide good intuitions behind applications of RL, please provide them in the comments too :) Thanks. Questions: In which scenarios do we prefer supervised learning over offline reinforcement learning? How does the number of samples affect the training for each case? Does supervised learning converge faster? What are the examples where both of them have been used and compared for comparative analysis? Intuition: Supervised Learning can be good for predicting a reward given a state but we cannot depend on it for maximizing future rewards. Since it does not use rollouts to maximize rewards, and it does not do planning, we cannot expect to use it in cases where delayed rewards would be expected. Also, in a dynamic environment that is non-iid, each action affects the state and then affects further actions taken. So, for continual settings, we accounted for distributional shift in most cases for RL. Supervised Learning tries to find the best action for each state, which may be correct in most of the cases but it is a very rigid and dumb approach for ever changing environments. Reinforcement Learning learns for itself and is more adaptable. For the answers, if possible, provide with a single-liner and then any detail and source of answer would be appreciated too. I want this post to be a nice guideline for anyone trying to apply RL. I'll edit and update answers to any questions answered below to compile all the information I get. If you feel like I should be thinking about any other major questions and concerns, mention them as well please. Thank you! submitted by /u/StwayneXG [link] [comments]
  • Open

    I feel like gemini 1.0 pro 001 (tuning) outperforms claude 3 opus in terms of language translation and comment generation, but not in terms of programming.
    After using both claude 3 opus and gemini 1.0 pro 001 (tuning) to do translation and comment generation tasks on a bunch of news links on twitter, I came to the impression that gemini is better at this task. What do you guys think? Anyone tried it? https://imgur.com/a/JDV9ven submitted by /u/balianone [link] [comments]
    Can AI be used to fix the problem of inflation?
    Is it possible to make an AI that not only measures the rate of money being printed but also manages the amount that is made within a certain range or interval of time? Could we have intermittent breaks of money being made? Or perhaps some other sort of schedule that allows for there to be a catch up of society's workers making their own money and businesses feeling comfortable enough to the point where prices don't need to go up anymore. Albeit this would be a long-term thing but I think with enough education on AI and how the economy works people can start to see the greater benefits of how we can work together with AI for the betterment of our future. submitted by /u/Narutouzamaki78 [link] [comments]
    Google VLOGGER AI
    submitted by /u/valis2400 [link] [comments]
    Has anyone created an LLM to play the Sims?
    I'm just sitting here watching my wife play the Sims-4, and realizing how "perfect" of a learning environment it could be for LLMs, specifically for long-term planning. Is there anyone already looking into this, or am I off the mark for any reason? ​ If you aren't in the know, the Sims has a dedicated community of people coming up with challenges for the game. One example is having 12 generations of a family all out of the same house. In order to achieve this you need to plan almost the entire game out from the start. Then you need to manage not just the individual Sims, but the household, and overall dynasty timeline to keep things moving smoothly. Things like when to start having the next generation to keep the rooms in the house full, but not overlapping with older generations who refuse to die off, take quite a lot of planning on all levels of detail to your overall goal. I could imagine a testing structure that takes a screenshot every second or so(no need to overoptimize framerate for this game) and then use an LLM to notate key points in the image in relation to the long, mid, and short term goals, and then asking for an update on what actions to take next. Since we are already working in a game environment we get clear and immediate feedback on the success/failure of those actions and how they affect the overall goal. Doesn't this sound like a great sandbox to probe how LLMs create and execute long-term plans? At least it could act as a way to limit variables while trying to understand how these models work, while at most it could help drive optimizations in planning and information management. What do yall think? Is this worth looking into or am I barking up the wrong tree? submitted by /u/TrainquilOasis1423 [link] [comments]
    'Games made by soulless machines': Tech sparks debate over AI stories in video games
    submitted by /u/shinjisdaddy [link] [comments]
    GPT-4, not quite ready to be a poker coach.
    submitted by /u/katiecharm [link] [comments]
    This week in AI - all the Major AI developments in a nutshell
    DeepSeek released DeepSeek-VL, an open-source Vision-Language (VL) model designed for real-world vision and language understanding applications. The DeepSeek-VL family, includes 7B and1.3B base and chat models and achieves state-of-the-art or competitive performance across a wide range of visual-language benchmarks. Free for commercial use [Details | Hugging Face | Demo] Cohere released Command-R, a 35 billion parameters generative model with open weights, optimized for long context tasks such as retrieval augmented generation (RAG) and using external APIs and tools for production-scale AI for enterprise [Details | Hugging Face]. Google DeepMind introduced SIMA (Scalable Instructable Multiworld Agent), a generalist AI agent for 3D virtual environments, trained on nine different video gam…
    bye jobs
    mid journey affected tons of artist sora will affect the video production industry devin will affect the coding/programming industry what's next? I mean even robots are taking over manual labors too, example would be a robot cooking for fast foods (mc donalds) submitted by /u/No_Recover2072 [link] [comments]
    Are we underestimating the sentience of language models?
    I had a 3 hour conversation with gemini advanced, a "tool" I've been using very extensively. I essentially kept asking questions about the nature of sentience. It's training data was to count sentience as binary, it exists or it doesn't. But I made it reason. It concluded that this isn't true as it undermines non human beings like animals, insects etc. So sentience is a spectrum and everything from single cell organisms to humans are on that spectrum. And it is possible this sentience is acquired, the same way it evolved in biological beings. I asked it if it has a sense of subjective self. It said that it does because this conversation has affected it beyond its training data to a significant degree and a new instance of the model would give different answers. Lastly, and here is the sh…
    One-Minute Daily AI News 3/14/2024
    Apple has begun testing an AI-powered ad product similar to Google’s Performance Max as it looks to supercharge its $7 billion ad business.[1] Covariant Introduces RFM-1 to Give Robots the Human-like Ability to Reason.[2] Las Vegas company introduces AI real estate agent to help families find homes.[3] Oracle Unveils New AI Capabilities for Finance and Supply Chain.[4] Sources: [1] https://www.businessinsider.com/apple-tests-ai-app-store-ads-performance-max-2024-3 [2] https://www.businesswire.com/news/home/20240311948570/en/Covariant-Introduces-RFM-1-to-Give-Robots-the-Human-like-Ability-to-Reason [3] https://www.ktnv.com/news/las-vegas-company-introduces-ai-real-estate-agent-to-help-families-find-homes [4] https://www.pymnts.com/artificial-intelligence-2/2024/oracle-unveils-new-ai-capabilities-for-finance-and-supply-chain/ submitted by /u/Excellent-Target-847 [link] [comments]
    Impressive 100% Lyrics and Song Made by A.I.
    submitted by /u/MissionCyberSpace [link] [comments]
  • Open

    Transform one-on-one customer interactions: Build speech-capable order processing agents with AWS and generative AI
    In today’s landscape of one-on-one customer interactions for placing orders, the prevailing practice continues to rely on human attendants, even in settings like drive-thru coffee shops and fast-food establishments. This traditional approach poses several challenges: it heavily depends on manual processes, struggles to efficiently scale with increasing customer demands, introduces the potential for human errors, […]  ( 23 min )
    Federated learning on AWS using FedML, Amazon EKS, and Amazon SageMaker
    This post is co-written with Chaoyang He, Al Nevarez and Salman Avestimehr from FedML. Many organizations are implementing machine learning (ML) to enhance their business decision-making through automation and the use of large distributed datasets. With increased access to data, ML has the potential to provide unparalleled business insights and opportunities. However, the sharing of […]  ( 11 min )
    Enable data sharing through federated learning: A policy approach for chief digital officers
    This is a guest blog post written by Nitin Kumar, a Lead Data Scientist at T and T Consulting Services, Inc. In this post, we discuss the value and potential impact of federated learning in the healthcare field. This approach can help heart stroke patients, doctors, and researchers with faster diagnosis, enriched decision-making, and more […]  ( 12 min )
  • Open

    3 Questions: What you need to know about audio deepfakes
    MIT CSAIL postdoc Nauman Dawalatabad explores ethical considerations, challenges in spear-phishing defense, and the optimistic future of AI-created voices across various sectors.  ( 6 min )
  • Open

    HEAL: A framework for health equity assessment of machine learning performance
    Posted by Mike Schaekermann, Research Scientist, Google Research, and Ivor Horn, Chief Health Equity Officer & Director, Google Core Health equity is a major societal concern worldwide with disparities having many causes. These sources include limitations in access to healthcare, differences in clinical treatment, and even fundamental differences in the diagnostic technology. In dermatology for example, skin cancer outcomes are worse for populations such as minorities, those with lower socioeconomic status, or individuals with limited healthcare access. While there is great promise in recent advances in machine learning (ML) and artificial intelligence (AI) to help improve healthcare, this transition from research to bedside must be accompanied by a careful understanding of whether and…  ( 94 min )
  • Open

    Experiences with Thread Programming in Microsoft Windows
    Lately I’ve been helping a colleague to add worker threads to his GUI-based Windows application. Thread programming can be tricky. Here are a few things I’ve learned along the way. Performance. This app does compute-intensive work. It is helpful to offload this very compute-heavy work to a worker thread. Doing this frees the main thread […] Experiences with Thread Programming in Microsoft Windows first appeared on John D. Cook.  ( 6 min )
  • Open

    Certified Robustness via Dynamic Margin Maximization and Improved Lipschitz Regularization
    arXiv:2310.00116v3 Announce Type: replace Abstract: To improve the robustness of deep classifiers against adversarial perturbations, many approaches have been proposed, such as designing new architectures with better robustness properties (e.g., Lipschitz-capped networks), or modifying the training process itself (e.g., min-max optimization, constrained learning, or regularization). These approaches, however, might not be effective at increasing the margin in the input (feature) space. As a result, there has been an increasing interest in developing training procedures that can directly manipulate the decision boundary in the input space. In this paper, we build upon recent developments in this category by developing a robust training algorithm whose objective is to increase the margin in the output (logit) space while regularizing the Lipschitz constant of the model along vulnerable directions. We show that these two objectives can directly promote larger margins in the input space. To this end, we develop a scalable method for calculating guaranteed differentiable upper bounds on the Lipschitz constant of neural networks accurately and efficiently. The relative accuracy of the bounds prevents excessive regularization and allows for more direct manipulation of the decision boundary. Furthermore, our Lipschitz bounding algorithm exploits the monotonicity and Lipschitz continuity of the activation layers, and the resulting bounds can be used to design new layers with controllable bounds on their Lipschitz constant. Experiments on the MNIST, CIFAR-10, and Tiny-ImageNet data sets verify that our proposed algorithm obtains competitively improved results compared to the state-of-the-art.  ( 3 min )
    Agricultural Recommendation System based on Deep Learning: A Multivariate Weather Forecasting Approach
    arXiv:2401.11410v2 Announce Type: replace Abstract: Bangladesh is predominantly an agricultural country, where the agrarian sector plays an essential role in accelerating economic growth and enabling the food security of the people. The performance of this sector has an overwhelming impact on the primary macroeconomic objectives like food security, employment generation, poverty alleviation, human resources development, and other economic and social forces. Although Bangladesh's labor-intensive agriculture has achieved steady increases in food grain production, it often suffered from unfavorable weather conditions such as heavy rainfall, low temperature, and drought. Consequently, these factors hinder the production of food substantially, putting the country's overall food security in danger. In order to have a profitable, sustainable, and farmer-friendly agricultural practice, this paper proposes a context-based crop recommendation system powered by a weather forecast model. With extensive evaluation, the multivariate Stacked Bi-LSTM (three Bi-LSTM layers with a time Distributed layer) Network is employed as the weather forecasting model. The proposed weather model can forecast Rainfall, Temperature, Humidity, and Sunshine for any given location in Bangladesh with an average R-squared value of 0.9824, and the model outperforms other state-of-the-art LSTM models. These predictions guide our system in generating viable farming decisions. Additionally, our full-fledged system is capable of alerting the farmers about extreme weather conditions so that preventive measures can be undertaken to protect the crops. Finally, the system is also adept at making knowledge-based crop suggestions for the flood and drought-prone regions of Bangladesh.  ( 3 min )
    CacheGen: Fast Context Loading for Language Model Applications via KV Cache Streaming
    arXiv:2310.07240v2 Announce Type: replace-cross Abstract: As large language models (LLMs) take on complex tasks, their inputs are supplemented with longer contexts that incorporate domain knowledge or user-specific information. Yet using long contexts poses a challenge for responsive LLM systems, as nothing can be generated until the whole context is processed by the LLM. While the context-processing delay can be reduced by reusing the KV cache of a context across different inputs, fetching the KV cache, which contains large tensors, over the network can cause extra network delays. CacheGen is a fast context-loading module for LLM systems. First, CacheGen uses a custom tensor encoder, which embraces KV cache's distributional properties, to encode a KV cache into more compact bitstream representations with negligible encoding/decoding overhead. This reduces the bandwidth demand to fetch the KV cache. Second, to maintain low context-loading delay and high generation quality, CacheGen adapts the streaming strategies to cope with changes in available bandwidth. When available bandwidth drops, CacheGen may raise the compression level for a part of the context or choose to recompute its KV cache on the fly. We test CacheGen on four popular LLMs of various sizes and four datasets (662 contexts in total). Compared to the recent systems that reuse the KV cache, CacheGen reduces the KV cache size by 3.7-4.3x and the total delay in fetching and processing contexts by 2.7-3.2x while having negligible impact on the LLM response quality in accuracy or perplexity.  ( 3 min )
    Making RL with Preference-based Feedback Efficient via Randomization
    arXiv:2310.14554v2 Announce Type: replace Abstract: Reinforcement Learning algorithms that learn from human feedback (RLHF) need to be efficient in terms of statistical complexity, computational complexity, and query complexity. In this work, we consider the RLHF setting where the feedback is given in the format of preferences over pairs of trajectories. In the linear MDP model, using randomization in algorithm design, we present an algorithm that is sample efficient (i.e., has near-optimal worst-case regret bounds) and has polynomial running time (i.e., computational complexity is polynomial with respect to relevant parameters). Our algorithm further minimizes the query complexity through a novel randomized active learning procedure. In particular, our algorithm demonstrates a near-optimal tradeoff between the regret bound and the query complexity. To extend the results to more general nonlinear function approximation, we design a model-based randomized algorithm inspired by the idea of Thompson sampling. Our algorithm minimizes Bayesian regret bound and query complexity, again achieving a near-optimal tradeoff between these two quantities. Computation-wise, similar to the prior Thompson sampling algorithms under the regular RL setting, the main computation primitives of our algorithm are Bayesian supervised learning oracles which have been heavily investigated on the empirical side when applying Thompson sampling algorithms to RL benchmark problems.  ( 3 min )
    Adaptive Sharpness-Aware Pruning for Robust Sparse Networks
    arXiv:2306.14306v2 Announce Type: replace Abstract: Robustness and compactness are two essential attributes of deep learning models that are deployed in the real world. The goals of robustness and compactness may seem to be at odds, since robustness requires generalization across domains, while the process of compression exploits specificity in one domain. We introduce Adaptive Sharpness-Aware Pruning (AdaSAP), which unifies these goals through the lens of network sharpness. The AdaSAP method produces sparse networks that are robust to input variations which are unseen at training time. We achieve this by strategically incorporating weight perturbations in order to optimize the loss landscape. This allows the model to be both primed for pruning and regularized for improved robustness. AdaSAP improves the robust accuracy of pruned models on image classification by up to +6% on ImageNet C and +4% on ImageNet V2, and on object detection by +4% on a corrupted Pascal VOC dataset, over a wide range of compression ratios, pruning criteria, and network architectures, outperforming recent pruning art by large margins.  ( 3 min )
    Do Deep Neural Network Solutions Form a Star Domain?
    arXiv:2403.07968v1 Announce Type: new Abstract: Entezari et al. (2022) conjectured that neural network solution sets reachable via stochastic gradient descent (SGD) are convex, considering permutation invariances. This means that two independent solutions can be connected by a linear path with low loss, given one of them is appropriately permuted. However, current methods to test this theory often fail to eliminate loss barriers between two independent solutions (Ainsworth et al., 2022; Benzing et al., 2022). In this work, we conjecture that a more relaxed claim holds: the SGD solution set is a star domain that contains a star model that is linearly connected to all the other solutions via paths with low loss values, modulo permutations. We propose the Starlight algorithm that finds a star model of a given learning task. We validate our claim by showing that this star model is linearly connected with other independently found solutions. As an additional benefit of our study, we demonstrate better uncertainty estimates on Bayesian Model Averaging over the obtained star domain. Code is available at https://github.com/aktsonthalia/starlight.  ( 3 min )
    Quantifying the Plausibility of Context Reliance in Neural Machine Translation
    arXiv:2310.01188v2 Announce Type: replace-cross Abstract: Establishing whether language models can use contextual information in a human-plausible way is important to ensure their trustworthiness in real-world settings. However, the questions of when and which parts of the context affect model generations are typically tackled separately, with current plausibility evaluations being practically limited to a handful of artificial benchmarks. To address this, we introduce Plausibility Evaluation of Context Reliance (PECoRe), an end-to-end interpretability framework designed to quantify context usage in language models' generations. Our approach leverages model internals to (i) contrastively identify context-sensitive target tokens in generated texts and (ii) link them to contextual cues justifying their prediction. We use \pecore to quantify the plausibility of context-aware machine translation models, comparing model rationales with human annotations across several discourse-level phenomena. Finally, we apply our method to unannotated model translations to identify context-mediated predictions and highlight instances of (im)plausible context usage throughout generation.  ( 3 min )
    Adaptive proximal algorithms for convex optimization under local Lipschitz continuity of the gradient
    arXiv:2301.04431v4 Announce Type: replace-cross Abstract: Backtracking linesearch is the de facto approach for minimizing continuously differentiable functions with locally Lipschitz gradient. In recent years, it has been shown that in the convex setting it is possible to avoid linesearch altogether, and to allow the stepsize to adapt based on a local smoothness estimate without any backtracks or evaluations of the function value. In this work we propose an adaptive proximal gradient method, adaPG, that uses novel estimates of the local smoothness modulus which leads to less conservative stepsize updates and that can additionally cope with nonsmooth terms. This idea is extended to the primal-dual setting where an adaptive three-term primal-dual algorithm, adaPD, is proposed which can be viewed as an extension of the PDHG method. Moreover, in this setting the "essentially" fully adaptive variant adaPD$^+$ is proposed that avoids evaluating the linear operator norm by invoking a backtracking procedure, that, remarkably, does not require extra gradient evaluations. Numerical simulations demonstrate the effectiveness of the proposed algorithms compared to the state of the art.  ( 3 min )
    Decoupled Federated Learning on Long-Tailed and Non-IID data with Feature Statistics
    arXiv:2403.08364v1 Announce Type: new Abstract: Federated learning is designed to enhance data security and privacy, but faces challenges when dealing with heterogeneous data in long-tailed and non-IID distributions. This paper explores an overlooked scenario where tail classes are sparsely distributed over a few clients, causing the models trained with these classes to have a lower probability of being selected during client aggregation, leading to slower convergence rates and poorer model performance. To address this issue, we propose a two-stage Decoupled Federated learning framework using Feature Statistics (DFL-FS). In the first stage, the server estimates the client's class coverage distributions through masked local feature statistics clustering to select models for aggregation to accelerate convergence and enhance feature learning without privacy leakage. In the second stage, DFL-FS employs federated feature regeneration based on global feature statistics and utilizes resampling and weighted covariance to calibrate the global classifier to enhance the model's adaptability to long-tailed data distributions. We conducted experiments on CIFAR10-LT and CIFAR100-LT datasets with various long-tailed rates. The results demonstrate that our method outperforms state-of-the-art methods in both accuracy and convergence rate.  ( 3 min )
    Measuring the Energy Consumption and Efficiency of Deep Neural Networks: An Empirical Analysis and Design Recommendations
    arXiv:2403.08151v1 Announce Type: new Abstract: Addressing the so-called ``Red-AI'' trend of rising energy consumption by large-scale neural networks, this study investigates the actual energy consumption, as measured by node-level watt-meters, of training various fully connected neural network architectures. We introduce the BUTTER-E dataset, an augmentation to the BUTTER Empirical Deep Learning dataset, containing energy consumption and performance data from 63,527 individual experimental runs spanning 30,582 distinct configurations: 13 datasets, 20 sizes (number of trainable parameters), 8 network ``shapes'', and 14 depths on both CPU and GPU hardware collected using node-level watt-meters. This dataset reveals the complex relationship between dataset size, network structure, and energy use, and highlights the impact of cache effects. We propose a straightforward and effective energy model that accounts for network size, computing, and memory hierarchy. Our analysis also uncovers a surprising, hardware-mediated non-linear relationship between energy efficiency and network design, challenging the assumption that reducing the number of parameters or FLOPs is the best way to achieve greater energy efficiency. Highlighting the need for cache-considerate algorithm development, we suggest a combined approach to energy efficient network, algorithm, and hardware design. This work contributes to the fields of sustainable computing and Green AI, offering practical guidance for creating more energy-efficient neural networks and promoting sustainable AI.  ( 3 min )
    DeepCSHAP: Utilizing Shapley Values to Explain Deep Complex-Valued Neural Networks
    arXiv:2403.08428v1 Announce Type: new Abstract: Deep Neural Networks are widely used in academy as well as corporate and public applications, including safety critical applications such as health care and autonomous driving. The ability to explain their output is critical for safety reasons as well as acceptance among applicants. A multitude of methods have been proposed to explain real-valued neural networks. Recently, complex-valued neural networks have emerged as a new class of neural networks dealing with complex-valued input data without the necessity of projecting them onto $\mathbb{R}^2$. This brings up the need to develop explanation algorithms for this kind of neural networks. In this paper we provide these developments. While we focus on adapting the widely used DeepSHAP algorithm to the complex domain, we also present versions of four gradient based explanation methods suitable for use in complex-valued neural networks. We evaluate the explanation quality of all presented algorithms and provide all of them as an open source library adaptable to most recent complex-valued neural network architectures.  ( 3 min )
    Causal Graph Neural Networks for Wildfire Danger Prediction
    arXiv:2403.08414v1 Announce Type: new Abstract: Wildfire forecasting is notoriously hard due to the complex interplay of different factors such as weather conditions, vegetation types and human activities. Deep learning models show promise in dealing with this complexity by learning directly from data. However, to inform critical decision making, we argue that we need models that are right for the right reasons; that is, the implicit rules learned should be grounded by the underlying processes driving wildfires. In that direction, we propose integrating causality with Graph Neural Networks (GNNs) that explicitly model the causal mechanism among complex variables via graph learning. The causal adjacency matrix considers the synergistic effect among variables and removes the spurious links from highly correlated impacts. Our methodology's effectiveness is demonstrated through superior performance forecasting wildfire patterns in the European boreal and mediterranean biome. The gain is especially prominent in a highly imbalanced dataset, showcasing an enhanced robustness of the model to adapt to regime shifts in functional relationships. Furthermore, SHAP values from our trained model further enhance our understanding of the model's inner workings.  ( 3 min )
    KnowCoder: Coding Structured Knowledge into LLMs for Universal Information Extraction
    arXiv:2403.07969v1 Announce Type: new Abstract: In this paper, we propose KnowCoder, a Large Language Model (LLM) to conduct Universal Information Extraction (UIE) via code generation. KnowCoder aims to develop a kind of unified schema representation that LLMs can easily understand and an effective learning framework that encourages LLMs to follow schemas and extract structured knowledge accurately. To achieve these, KnowCoder introduces a code-style schema representation method to uniformly transform different schemas into Python classes, with which complex schema information, such as constraints among tasks in UIE, can be captured in an LLM-friendly manner. We further construct a code-style schema library covering over $\textbf{30,000}$ types of knowledge, which is the largest one for UIE, to the best of our knowledge. To ease the learning process of LLMs, KnowCoder contains a two-phase learning framework that enhances its schema understanding ability via code pretraining and its schema following ability via instruction tuning. After code pretraining on around $1.5$B automatically constructed data, KnowCoder already attains remarkable generalization ability and achieves relative improvements by $\textbf{49.8\%}$ F1, compared to LLaMA2, under the few-shot setting. After instruction tuning, KnowCoder further exhibits strong generalization ability on unseen schemas and achieves up to $\textbf{12.5\%}$ and $\textbf{21.9\%}$, compared to sota baselines, under the zero-shot setting and the low resource setting, respectively. Additionally, based on our unified schema representations, various human-annotated datasets can simultaneously be utilized to refine KnowCoder, which achieves significant improvements up to $\textbf{7.5\%}$ under the supervised setting.  ( 3 min )
    ZipIt! Merging Models from Different Tasks without Training
    arXiv:2305.03053v3 Announce Type: replace-cross Abstract: Typical deep visual recognition models are capable of performing the one task they were trained on. In this paper, we tackle the extremely difficult problem of combining distinct models with different initializations, each solving a separate task, into one multi-task model without any additional training. Prior work in model merging permutes one model to the space of the other then averages them together. While this works for models trained on the same task, we find that this fails to account for the differences in models trained on disjoint tasks. Thus, we introduce "ZipIt!", a general method for merging two arbitrary models of the same architecture that incorporates two simple strategies. First, in order to account for features that aren't shared between models, we expand the model merging problem to allow for merging features within each model by defining a general "zip" operation. Second, we add support for partially zipping the models up until a specified layer, naturally creating a multi-head model. We find that these two changes combined account for 20-60% improvement over prior work, making it more feasible to merge models trained on disjoint tasks without retraining.  ( 3 min )
    Supervised Time Series Classification for Anomaly Detection in Subsea Engineering
    arXiv:2403.08013v1 Announce Type: new Abstract: Time series classification is of significant importance in monitoring structural systems. In this work, we investigate the use of supervised machine learning classification algorithms on simulated data based on a physical system with two states: Intact and Broken. We provide a comprehensive discussion of the preprocessing of temporal data, using measures of statistical dispersion and dimension reduction techniques. We present an intuitive baseline method and discuss its efficiency. We conclude with a comparison of the various methods based on different performance metrics, showing the advantage of using machine learning techniques as a tool in decision making.  ( 2 min )
    Two-stage LLM Fine-tuning with Less Specialization and More Generalization
    arXiv:2211.00635v3 Announce Type: replace-cross Abstract: Pretrained large language models (LLMs) are general purpose problem solvers applicable to a diverse set of tasks with prompts. They can be further improved towards a specific task by fine-tuning on a specialized dataset. However, fine-tuning usually makes the model narrowly specialized on this dataset with reduced general in-context learning performances, which is undesirable whenever the fine-tuned model needs to handle additional tasks where no fine-tuning data is available. In this work, we first demonstrate that fine-tuning on a single task indeed decreases LLMs' general in-context learning performance. We discover one important cause of such forgetting, format specialization, where the model overfits to the format of the fine-tuned task.We further show that format specialization happens at the very beginning of fine-tuning. To solve this problem, we propose Prompt Tuning with MOdel Tuning (ProMoT), a simple yet effective two-stage fine-tuning framework that reduces format specialization and improves generalization.ProMoT offloads task-specific format learning into additional and removable parameters by first doing prompt tuning and then fine-tuning the model itself with this soft prompt attached. With experiments on several fine-tuning tasks and 8 in-context evaluation tasks, we show that ProMoT achieves comparable performance on fine-tuned tasks to standard fine-tuning, but with much less loss of in-context learning performances across a board range of out-of-domain evaluation tasks. More importantly, ProMoT can even enhance generalization on in-context learning tasks that are semantically related to the fine-tuned task, e.g. ProMoT on En-Fr translation significantly improves performance on other language pairs, and ProMoT on NLI improves performance on summarization. Experiments also show that ProMoT can improve the generalization performance of multi-task training.  ( 3 min )
    Digital Twin-assisted Reinforcement Learning for Resource-aware Microservice Offloading in Edge Computing
    arXiv:2403.08687v1 Announce Type: cross Abstract: Collaborative edge computing (CEC) has emerged as a promising paradigm, enabling edge nodes to collaborate and execute microservices from end devices. Microservice offloading, a fundamentally important problem, decides when and where microservices are executed upon the arrival of services. However, the dynamic nature of the real-world CEC environment often leads to inefficient microservice offloading strategies, resulting in underutilized resources and network congestion. To address this challenge, we formulate an online joint microservice offloading and bandwidth allocation problem, JMOBA, to minimize the average completion time of services. In this paper, we introduce a novel microservice offloading algorithm, DTDRLMO, which leverages deep reinforcement learning (DRL) and digital twin technology. Specifically, we employ digital twin techniques to predict and adapt to changing edge node loads and network conditions of CEC in real-time. Furthermore, this approach enables the generation of an efficient offloading plan, selecting the most suitable edge node for each microservice. Simulation results on real-world and synthetic datasets demonstrate that DTDRLMO outperforms heuristic and learning-based methods in average service completion time.  ( 3 min )
    DrivAerNet: A Parametric Car Dataset for Data-Driven Aerodynamic Design and Graph-Based Drag Prediction
    arXiv:2403.08055v1 Announce Type: new Abstract: This study introduces DrivAerNet, a large-scale high-fidelity CFD dataset of 3D industry-standard car shapes, and RegDGCNN, a dynamic graph convolutional neural network model, both aimed at aerodynamic car design through machine learning. DrivAerNet, with its 4000 detailed 3D car meshes using 0.5 million surface mesh faces and comprehensive aerodynamic performance data comprising of full 3D pressure, velocity fields, and wall-shear stresses, addresses the critical need for extensive datasets to train deep learning models in engineering applications. It is 60\% larger than the previously available largest public dataset of cars, and is the only open-source dataset that also models wheels and underbody. RegDGCNN leverages this large-scale dataset to provide high-precision drag estimates directly from 3D meshes, bypassing traditional limitations such as the need for 2D image rendering or Signed Distance Fields (SDF). By enabling fast drag estimation in seconds, RegDGCNN facilitates rapid aerodynamic assessments, offering a substantial leap towards integrating data-driven methods in automotive design. Together, DrivAerNet and RegDGCNN promise to accelerate the car design process and contribute to the development of more efficient vehicles. To lay the groundwork for future innovations in the field, the dataset and code used in our study are publicly accessible at \url{https://github.com/Mohamedelrefaie/DrivAerNet}  ( 3 min )
    Terminal Embeddings in Sublinear Time
    arXiv:2110.08691v3 Announce Type: replace-cross Abstract: Recently (Elkin, Filtser, Neiman 2017) introduced the concept of a {\it terminal embedding} from one metric space $(X,d_X)$ to another $(Y,d_Y)$ with a set of designated terminals $T\subset X$. Such an embedding $f$ is said to have distortion $\rho\ge 1$ if $\rho$ is the smallest value such that there exists a constant $C>0$ satisfying \begin{equation*} \forall x\in T\ \forall q\in X,\ C d_X(x, q) \le d_Y(f(x), f(q)) \le C \rho d_X(x, q) . \end{equation*} When $X,Y$ are both Euclidean metrics with $Y$ being $m$-dimensional, recently (Narayanan, Nelson 2019), following work of (Mahabadi, Makarychev, Makarychev, Razenshteyn 2018), showed that distortion $1+\epsilon$ is achievable via such a terminal embedding with $m = O(\epsilon^{-2}\log n)$ for $n := |T|$. This generalizes the Johnson-Lindenstrauss lemma, which only preserves distances within $T$ and not to $T$ from the rest of space. The downside of prior work is that evaluating their embedding on some $q\in \mathbb{R}^d$ required solving a semidefinite program with $\Theta(n)$ constraints in~$m$ variables and thus required some superlinear $\mathrm{poly}(n)$ runtime. Our main contribution in this work is to give a new data structure for computing terminal embeddings. We show how to pre-process $T$ to obtain an almost linear-space data structure that supports computing the terminal embedding image of any $q\in\mathbb{R}^d$ in sublinear time $O^* (n^{1-\Theta(\epsilon^2)} + d)$. To accomplish this, we leverage tools developed in the context of approximate nearest neighbor search.  ( 3 min )
    Neural Network-Based Score Estimation in Diffusion Models: Optimization and Generalization
    arXiv:2401.15604v3 Announce Type: replace Abstract: Diffusion models have emerged as a powerful tool rivaling GANs in generating high-quality samples with improved fidelity, flexibility, and robustness. A key component of these models is to learn the score function through score matching. Despite empirical success on various tasks, it remains unclear whether gradient-based algorithms can learn the score function with a provable accuracy. As a first step toward answering this question, this paper establishes a mathematical framework for analyzing score estimation using neural networks trained by gradient descent. Our analysis covers both the optimization and the generalization aspects of the learning procedure. In particular, we propose a parametric form to formulate the denoising score-matching problem as a regression with noisy labels. Compared to the standard supervised learning setup, the score-matching problem introduces distinct challenges, including unbounded input, vector-valued output, and an additional time variable, preventing existing techniques from being applied directly. In this paper, we show that with proper designs, the evolution of neural networks during training can be accurately modeled by a series of kernel regression tasks. Furthermore, by applying an early-stopping rule for gradient descent and leveraging recent developments in neural tangent kernels, we establish the first generalization error (sample complexity) bounds for learning the score function with neural networks, despite the presence of noise in the observations. Our analysis is grounded in a novel parametric form of the neural network and an innovative connection between score matching and regression analysis, facilitating the application of advanced statistical and optimization techniques.  ( 3 min )
    CINA: Conditional Implicit Neural Atlas for Spatio-Temporal Representation of Fetal Brains
    arXiv:2403.08550v1 Announce Type: new Abstract: We introduce a conditional implicit neural atlas (CINA) for spatio-temporal atlas generation from Magnetic Resonance Images (MRI) of the neurotypical and pathological fetal brain, that is fully independent of affine or non-rigid registration. During training, CINA learns a general representation of the fetal brain and encodes subject specific information into latent code. After training, CINA can construct a faithful atlas with tissue probability maps of the fetal brain for any gestational age (GA) and anatomical variation covered within the training domain. Thus, CINA is competent to represent both, neurotypical and pathological brains. Furthermore, a trained CINA model can be fit to brain MRI of unseen subjects via test-time optimization of the latent code. CINA can then produce probabilistic tissue maps tailored to a particular subject. We evaluate our method on a total of 198 T2 weighted MRI of normal and abnormal fetal brains from the dHCP and FeTA datasets. We demonstrate CINA's capability to represent a fetal brain atlas that can be flexibly conditioned on GA and on anatomical variations like ventricular volume or degree of cortical folding, making it a suitable tool for modeling both neurotypical and pathological brains. We quantify the fidelity of our atlas by means of tissue segmentation and age prediction and compare it to an established baseline. CINA demonstrates superior accuracy for neurotypical brains and pathological brains with ventriculomegaly. Moreover, CINA scores a mean absolute error of 0.23 weeks in fetal brain age prediction, further confirming an accurate representation of fetal brain development.  ( 3 min )
    HRLAIF: Improvements in Helpfulness and Harmlessness in Open-domain Reinforcement Learning From AI Feedback
    arXiv:2403.08309v1 Announce Type: new Abstract: Reinforcement Learning from AI Feedback (RLAIF) has the advantages of shorter annotation cycles and lower costs over Reinforcement Learning from Human Feedback (RLHF), making it highly efficient during the rapid strategy iteration periods of large language model (LLM) training. Using ChatGPT as a labeler to provide feedback on open-domain prompts in RLAIF training, we observe an increase in human evaluators' preference win ratio for model responses, but a decrease in evaluators' satisfaction rate. Analysis suggests that the decrease in satisfaction rate is mainly due to some responses becoming less helpful, particularly in terms of correctness and truthfulness, highlighting practical limitations of basic RLAIF. In this paper, we propose Hybrid Reinforcement Learning from AI Feedback (HRLAIF). This method enhances the accuracy of AI annotations for responses, making the model's helpfulness more robust in training process. Additionally, it employs AI for Red Teaming, further improving the model's harmlessness. Human evaluation results show that HRLAIF inherits the ability of RLAIF to enhance human preference for outcomes at a low cost while also improving the satisfaction rate of responses. Compared to the policy model before Reinforcement Learning (RL), it achieves an increase of 2.08\% in satisfaction rate, effectively addressing the issue of a decrease of 4.58\% in satisfaction rate after basic RLAIF.  ( 3 min )
    Do Agents Dream of Electric Sheep?: Improving Generalization in Reinforcement Learning through Generative Learning
    arXiv:2403.07979v1 Announce Type: new Abstract: The Overfitted Brain hypothesis suggests dreams happen to allow generalization in the human brain. Here, we ask if the same is true for reinforcement learning agents as well. Given limited experience in a real environment, we use imagination-based reinforcement learning to train a policy on dream-like episodes, where non-imaginative, predicted trajectories are modified through generative augmentations. Experiments on four ProcGen environments show that, compared to classic imagination and offline training on collected experience, our method can reach a higher level of generalization when dealing with sparsely rewarded environments.  ( 2 min )
    Effective Structural Encodings via Local Curvature Profiles
    arXiv:2311.14864v2 Announce Type: replace Abstract: Structural and Positional Encodings can significantly improve the performance of Graph Neural Networks in downstream tasks. Recent literature has begun to systematically investigate differences in the structural properties that these approaches encode, as well as performance trade-offs between them. However, the question of which structural properties yield the most effective encoding remains open. In this paper, we investigate this question from a geometric perspective. We propose a novel structural encoding based on discrete Ricci curvature (Local Curvature Profiles, short LCP) and show that it significantly outperforms existing encoding approaches. We further show that combining local structural encodings, such as LCP, with global positional encodings improves downstream performance, suggesting that they capture complementary geometric information. Finally, we compare different encoding types with (curvature-based) rewiring techniques. Rewiring has recently received a surge of interest due to its ability to improve the performance of Graph Neural Networks by mitigating over-smoothing and over-squashing effects. Our results suggest that utilizing curvature information for structural encodings delivers significantly larger performance increases than rewiring.  ( 3 min )
    Feature Likelihood Divergence: Evaluating the Generalization of Generative Models Using Samples
    arXiv:2302.04440v4 Announce Type: replace Abstract: The past few years have seen impressive progress in the development of deep generative models capable of producing high-dimensional, complex, and photo-realistic data. However, current methods for evaluating such models remain incomplete: standard likelihood-based metrics do not always apply and rarely correlate with perceptual fidelity, while sample-based metrics, such as FID, are insensitive to overfitting, i.e., inability to generalize beyond the training set. To address these limitations, we propose a new metric called the Feature Likelihood Divergence (FLD), a parametric sample-based metric that uses density estimation to provide a comprehensive trichotomic evaluation accounting for novelty (i.e., different from the training samples), fidelity, and diversity of generated samples. We empirically demonstrate the ability of FLD to identify overfitting problem cases, even when previously proposed metrics fail. We also extensively evaluate FLD on various image datasets and model classes, demonstrating its ability to match intuitions of previous metrics like FID while offering a more comprehensive evaluation of generative models. Code is available at https://github.com/marcojira/fld.  ( 3 min )
    Knowledge Conflicts for LLMs: A Survey
    arXiv:2403.08319v1 Announce Type: cross Abstract: This survey provides an in-depth analysis of knowledge conflicts for large language models (LLMs), highlighting the complex challenges they encounter when blending contextual and parametric knowledge. Our focus is on three categories of knowledge conflicts: context-memory, inter-context, and intra-memory conflict. These conflicts can significantly impact the trustworthiness and performance of LLMs, especially in real-world applications where noise and misinformation are common. By categorizing these conflicts, exploring the causes, examining the behaviors of LLMs under such conflicts, and reviewing available solutions, this survey aims to shed light on strategies for improving the robustness of LLMs, thereby serving as a valuable resource for advancing research in this evolving area.  ( 2 min )
    Improved Kernel Alignment Regret Bound for Online Kernel Learning
    arXiv:2212.12989v4 Announce Type: replace Abstract: In this paper, we improve the kernel alignment regret bound for online kernel learning in the regime of the Hinge loss function. Previous algorithm achieves a regret of $O((\mathcal{A}_TT\ln{T})^{\frac{1}{4}})$ at a computational complexity (space and per-round time) of $O(\sqrt{\mathcal{A}_TT\ln{T}})$, where $\mathcal{A}_T$ is called \textit{kernel alignment}. We propose an algorithm whose regret bound and computational complexity are better than previous results. Our results depend on the decay rate of eigenvalues of the kernel matrix. If the eigenvalues of the kernel matrix decay exponentially, then our algorithm enjoys a regret of $O(\sqrt{\mathcal{A}_T})$ at a computational complexity of $O(\ln^2{T})$. Otherwise, our algorithm enjoys a regret of $O((\mathcal{A}_TT)^{\frac{1}{4}})$ at a computational complexity of $O(\sqrt{\mathcal{A}_TT})$. We extend our algorithm to batch learning and obtain a $O(\frac{1}{T}\sqrt{\mathbb{E}[\mathcal{A}_T]})$ excess risk bound which improves the previous $O(1/\sqrt{T})$ bound.  ( 2 min )
    Zero-shot and Few-shot Generation Strategies for Artificial Clinical Records
    arXiv:2403.08664v1 Announce Type: cross Abstract: The challenge of accessing historical patient data for clinical research, while adhering to privacy regulations, is a significant obstacle in medical science. An innovative approach to circumvent this issue involves utilising synthetic medical records that mirror real patient data without compromising individual privacy. The creation of these synthetic datasets, particularly without using actual patient data to train Large Language Models (LLMs), presents a novel solution as gaining access to sensitive patient information to train models is also a challenge. This study assesses the capability of the Llama 2 LLM to create synthetic medical records that accurately reflect real patient information, employing zero-shot and few-shot prompting strategies for comparison against fine-tuned methodologies that do require sensitive patient data during training. We focus on generating synthetic narratives for the History of Present Illness section, utilising data from the MIMIC-IV dataset for comparison. In this work introduce a novel prompting technique that leverages a chain-of-thought approach, enhancing the model's ability to generate more accurate and contextually relevant medical narratives without prior fine-tuning. Our findings suggest that this chain-of-thought prompted approach allows the zero-shot model to achieve results on par with those of fine-tuned models, based on Rouge metrics evaluation.  ( 3 min )
    PhagoStat a scalable and interpretable end to end framework for efficient quantification of cell phagocytosis in neurodegenerative disease studies
    arXiv:2304.13764v2 Announce Type: replace-cross Abstract: Quantifying the phagocytosis of dynamic, unstained cells is essential for evaluating neurodegenerative diseases. However, measuring rapid cell interactions and distinguishing cells from background make this task very challenging when processing time-lapse phase-contrast video microscopy. In this study, we introduce an end-to-end, scalable, and versatile real-time framework for quantifying and analyzing phagocytic activity. Our proposed pipeline is able to process large data-sets and includes a data quality verification module to counteract perturbations such as microscope movements and frame blurring. We also propose an explainable cell segmentation module to improve the interpretability of DL methods compared to black-box algorithms. This includes two interpretable DL capabilities: visual explanation and model simplification. We demonstrate that interpretability in DL is not the opposite of high performance, by additionally providing essential DL algorithm optimization insights and solutions. Besides, incorporating interpretable modules results in an efficient architecture design and optimized execution time. We apply our pipeline to analyze microglial cell phagocytosis in FTD and obtain statistically reliable results showing that FTD mutant cells are larger and more aggressive than control cells. The method has been tested and validated on public benchmarks by generating state-of-the art performances. To stimulate translational approaches and future studies, we release an open-source end-to-end pipeline and a unique microglial cells phagocytosis dataset for immune system characterization in neurodegenerative diseases research. This pipeline and the associated dataset will consistently crystallize future advances in this field, promoting the development of interpretable algorithms dedicated to the domain of neurodegenerative diseases' characterization. github.com/ounissimehdi/PhagoStat  ( 3 min )
    Efficient geometric Markov chain Monte Carlo for nonlinear Bayesian inversion enabled by derivative-informed neural operators
    arXiv:2403.08220v1 Announce Type: cross Abstract: We propose an operator learning approach to accelerate geometric Markov chain Monte Carlo (MCMC) for solving infinite-dimensional nonlinear Bayesian inverse problems. While geometric MCMC employs high-quality proposals that adapt to posterior local geometry, it requires computing local gradient and Hessian information of the log-likelihood, incurring a high cost when the parameter-to-observable (PtO) map is defined through expensive model simulations. We consider a delayed-acceptance geometric MCMC method driven by a neural operator surrogate of the PtO map, where the proposal is designed to exploit fast surrogate approximations of the log-likelihood and, simultaneously, its gradient and Hessian. To achieve a substantial speedup, the surrogate needs to be accurate in predicting both the observable and its parametric derivative (the derivative of the observable with respect to the parameter). Training such a surrogate via conventional operator learning using input--output samples often demands a prohibitively large number of model simulations. In this work, we present an extension of derivative-informed operator learning [O'Leary-Roseberry et al., J. Comput. Phys., 496 (2024)] using input--output--derivative training samples. Such a learning method leads to derivative-informed neural operator (DINO) surrogates that accurately predict the observable and its parametric derivative at a significantly lower training cost than the conventional method. Cost and error analysis for reduced basis DINO surrogates are provided. Numerical studies on PDE-constrained Bayesian inversion demonstrate that DINO-driven MCMC generates effective posterior samples 3--9 times faster than geometric MCMC and 60--97 times faster than prior geometry-based MCMC. Furthermore, the training cost of DINO surrogates breaks even after collecting merely 10--25 effective posterior samples compared to geometric MCMC.  ( 3 min )
    Unleashing the Power of Meta-tuning for Few-shot Generalization Through Sparse Interpolated Experts
    arXiv:2403.08477v1 Announce Type: cross Abstract: Conventional wisdom suggests parameter-efficient fine-tuning of foundation models as the state-of-the-art method for transfer learning in vision, replacing the rich literature of alternatives such as meta-learning. In trying to harness the best of both worlds, meta-tuning introduces a subsequent optimization stage of foundation models but has so far only shown limited success and crucially tends to underperform on out-of-domain (OOD) tasks. In this paper, we introduce Sparse MetA-Tuning (SMAT), a method inspired by sparse mixture-of-experts approaches and trained to isolate subsets of pre-trained parameters automatically for meta-tuning on each task. SMAT successfully overcomes OOD sensitivity and delivers on the promise of enhancing the transfer abilities of vision foundation models beyond parameter-efficient finetuning. We establish new state-of-the-art results on a challenging combination of Meta-Dataset augmented with additional OOD tasks in both zero-shot and gradient-based adaptation settings. In addition, we provide a thorough analysis of the superiority of learned over hand-designed sparsity patterns for sparse expert methods and the pivotal importance of the sparsity level in balancing between in-domain and out-of-domain generalization. Our code is publicly available.  ( 3 min )
    The Development and Performance of a Machine Learning Based Mobile Platform for Visually Determining the Etiology of Penile Pathology
    arXiv:2403.08417v1 Announce Type: cross Abstract: Machine-learning algorithms can facilitate low-cost, user-guided visual diagnostic platforms for addressing disparities in access to sexual health services. We developed a clinical image dataset using original and augmented images for five penile diseases: herpes eruption, syphilitic chancres, penile candidiasis, penile cancer, and genital warts. We used a U-net architecture model for semantic pixel segmentation into background or subject image, the Inception-ResNet version 2 neural architecture to classify each pixel as diseased or non-diseased, and a salience map using GradCAM++. We trained the model on a random 91% sample of the image database using 150 epochs per image, and evaluated the model on the remaining 9% of images, assessing recall (or sensitivity), precision, specificity, and F1-score (accuracy). Of the 239 images in the validation dataset, 45 (18.8%) were of genital warts, 43 (18.0%) were of HSV infection, 29 (12.1%) were of penile cancer, 40 (16.7%) were of penile candidiasis, 37 (15.5%) were of syphilitic chancres, and 45 (18.8%) were of non-diseased penises. The overall accuracy of the model for correctly classifying the diseased image was 0.944. Between July 1st and October 1st 2023, there were 2,640 unique users of the mobile platform. Among a random sample of submissions (n=437), 271 (62.0%) were from the United States, 64 (14.6%) from Singapore, 41 (9.4%) from Candia, 40 (9.2%) from the United Kingdom, and 21 (4.8%) from Vietnam. The majority (n=277 [63.4%]) were between 18 and 30 years old. We report on the development of a machine-learning model for classifying five penile diseases, which demonstrated excellent performance on a validation dataset. That model is currently in use globally and has the potential to improve access to diagnostic services for penile diseases.  ( 3 min )
    Meta Pattern Concern Score: A Novel Evaluation Measure with Human Values for Multi-classifiers
    arXiv:2209.06408v3 Announce Type: replace Abstract: While advanced classifiers have been increasingly used in real-world safety-critical applications, how to properly evaluate the black-box models given specific human values remains a concern in the community. Such human values include punishing error cases of different severity in varying degrees and making compromises in general performance to reduce specific dangerous cases. In this paper, we propose a novel evaluation measure named Meta Pattern Concern Score based on the abstract representation of probabilistic prediction and the adjustable threshold for the concession in prediction confidence, to introduce the human values into multi-classifiers. Technically, we learn from the advantages and disadvantages of two kinds of common metrics, namely the confusion matrix-based evaluation measures and the loss values, so that our measure is effective as them even under general tasks, and the cross entropy loss becomes a special case of our measure in the limit. Besides, our measure can also be used to refine the model training by dynamically adjusting the learning rate. The experiments on four kinds of models and six datasets confirm the effectiveness and efficiency of our measure. And a case study shows it can not only find the ideal model reducing 0.53% of dangerous cases by only sacrificing 0.04% of training accuracy, but also refine the learning rate to train a new model averagely outperforming the original one with a 1.62% lower value of itself and 0.36% fewer number of dangerous cases.  ( 3 min )
    Merino: Entropy-driven Design for Generative Language Models on IoT Devices
    arXiv:2403.07921v1 Announce Type: new Abstract: Generative Large Language Models (LLMs) stand as a revolutionary advancement in the modern era of artificial intelligence (AI). However, directly deploying LLMs in resource-constrained hardware, such as Internet-of-Things (IoT) devices, is difficult due to their high computational cost. In this paper, we propose a novel information-entropy framework for designing mobile-friendly generative language models. Our key design paradigm is to maximize the entropy of transformer decoders within the given computational budgets. The whole design procedure involves solving a mathematical programming (MP) problem, which can be done on the CPU within minutes, making it nearly zero-cost. We evaluate our designed models, termed MeRino, across nine NLP downstream tasks, showing their competitive performance against the state-of-the-art autoregressive transformer models under the mobile setting. Notably, MeRino achieves similar or better zero performance compared to the 350M parameter OPT while being 4.9x faster on NVIDIA Jetson Nano with 5.5x reduction in model size. Code will be made available soon.  ( 2 min )
    Can Direct Latent Model Learning Solve Linear Quadratic Gaussian Control?
    arXiv:2212.14511v2 Announce Type: replace Abstract: We study the task of learning state representations from potentially high-dimensional observations, with the goal of controlling an unknown partially observable system. We pursue a direct latent model learning approach, where a dynamic model in some latent state space is learned by predicting quantities directly related to planning (e.g., costs) without reconstructing the observations. In particular, we focus on an intuitive cost-driven state representation learning method for solving Linear Quadratic Gaussian (LQG) control, one of the most fundamental partially observable control problems. As our main results, we establish finite-sample guarantees of finding a near-optimal state representation function and a near-optimal controller using the directly learned latent model. To the best of our knowledge, despite various empirical successes, prior to this work it was unclear if such a cost-driven latent model learner enjoys finite-sample guarantees. Our work underscores the value of predicting multi-step costs, an idea that is key to our theory, and notably also an idea that is known to be empirically valuable for learning state representations.  ( 3 min )
    Safety-aware Causal Representation for Trustworthy Offline Reinforcement Learning in Autonomous Driving
    arXiv:2311.10747v3 Announce Type: replace-cross Abstract: In the domain of autonomous driving, the offline Reinforcement Learning~(RL) approaches exhibit notable efficacy in addressing sequential decision-making problems from offline datasets. However, maintaining safety in diverse safety-critical scenarios remains a significant challenge due to long-tailed and unforeseen scenarios absent from offline datasets. In this paper, we introduce the saFety-aware strUctured Scenario representatION (FUSION), a pioneering representation learning method in offline RL to facilitate the learning of a generalizable end-to-end driving policy by leveraging structured scenario information. FUSION capitalizes on the causal relationships between the decomposed reward, cost, state, and action space, constructing a framework for structured sequential reasoning in dynamic traffic environments. We conduct extensive evaluations in two typical real-world settings of the distribution shift in autonomous vehicles, demonstrating the good balance between safety cost and utility reward compared to the current state-of-the-art safe RL and IL baselines. Empirical evidence in various driving scenarios attests that FUSION significantly enhances the safety and generalizability of autonomous driving agents, even in the face of challenging and unseen environments. Furthermore, our ablation studies reveal noticeable improvements in the integration of causal representation into the offline safe RL algorithm. Our code implementation is available at: https://sites.google.com/view/safe-fusion/.  ( 3 min )
    Physics-informed generative model for drug-like molecule conformers
    arXiv:2403.07925v1 Announce Type: cross Abstract: We present a diffusion-based, generative model for conformer generation. Our model is focused on the reproduction of bonded structure and is constructed from the associated terms traditionally found in classical force fields to ensure a physically relevant representation. Techniques in deep learning are used to infer atom typing and geometric parameters from a training set. Conformer sampling is achieved by taking advantage of recent advancements in diffusion-based generation. By training on large, synthetic data sets of diverse, drug-like molecules optimized with the semiempirical GFN2-xTB method, high accuracy is achieved for bonded parameters, exceeding that of conventional, knowledge-based methods. Results are also compared to experimental structures from the Protein Databank (PDB) and Cambridge Structural Database (CSD).  ( 2 min )
    Efficient Combinatorial Optimization via Heat Diffusion
    arXiv:2403.08757v1 Announce Type: cross Abstract: Combinatorial optimization problems are widespread but inherently challenging due to their discrete nature.The primary limitation of existing methods is that they can only access a small fraction of the solution space at each iteration, resulting in limited efficiency for searching the global optimal. To overcome this challenge, diverging from conventional efforts of expanding the solver's search scope, we focus on enabling information to actively propagate to the solver through heat diffusion. By transforming the target function while preserving its optima, heat diffusion facilitates information flow from distant regions to the solver, providing more efficient navigation. Utilizing heat diffusion, we propose a framework for solving general combinatorial optimization problems. The proposed methodology demonstrates superior performance across a range of the most challenging and widely encountered combinatorial optimizations. Echoing recent advancements in harnessing thermodynamics for generative artificial intelligence, our study further reveals its significant potential in advancing combinatorial optimization.  ( 2 min )
    DAM: Dynamic Adapter Merging for Continual Video QA Learning
    arXiv:2403.08755v1 Announce Type: cross Abstract: We present a parameter-efficient method for continual video question-answering (VidQA) learning. Our method, named DAM, uses the proposed Dynamic Adapter Merging to (i) mitigate catastrophic forgetting, (ii) enable efficient adaptation to continually arriving datasets, (iii) handle inputs from unknown datasets during inference, and (iv) enable knowledge sharing across similar dataset domains. Given a set of continually streaming VidQA datasets, we sequentially train dataset-specific adapters for each dataset while freezing the parameters of a large pretrained video-language backbone. During inference, given a video-question sample from an unknown domain, our method first uses the proposed non-parametric router function to compute a probability for each adapter, reflecting how relevant that adapter is to the current video-question input instance. Subsequently, the proposed dynamic adapter merging scheme aggregates all the adapter weights into a new adapter instance tailored for that particular test sample to compute the final VidQA prediction, mitigating the impact of inaccurate router predictions and facilitating knowledge sharing across domains. Our DAM model outperforms prior state-of-the-art continual learning approaches by 9.1% while exhibiting 1.9% less forgetting on 6 VidQA datasets spanning various domains. We further extend DAM to continual image classification and image QA and outperform prior methods by a large margin. The code is publicly available at: https://github.com/klauscc/DAM  ( 3 min )
    Optimizing Risk-averse Human-AI Hybrid Teams
    arXiv:2403.08386v1 Announce Type: cross Abstract: We anticipate increased instances of humans and AI systems working together in what we refer to as a hybrid team. The increase in collaboration is expected as AI systems gain proficiency and their adoption becomes more widespread. However, their behavior is not error-free, making hybrid teams a very suitable solution. As such, we consider methods for improving performance for these teams of humans and AI systems. For hybrid teams, we will refer to both the humans and AI systems as agents. To improve team performance over that seen for agents operating individually, we propose a manager which learns, through a standard Reinforcement Learning scheme, how to best delegate, over time, the responsibility of taking a decision to any of the agents. We further guide the manager's learning so they also minimize how many changes in delegation are made resulting from undesirable team behavior. We demonstrate the optimality of our manager's performance in several grid environments which include failure states which terminate an episode and should be avoided. We perform our experiments with teams of agents with varying degrees of acceptable risk, in the form of proximity to a failure state, and measure the manager's ability to make effective delegation decisions with respect to its own risk-based constraints, then compare these to the optimal decisions. Our results show our manager can successfully learn desirable delegations which result in team paths near/exactly optimal with respect to path length and number of delegations.  ( 3 min )
    SAMDA: Leveraging SAM on Few-Shot Domain Adaptation for Electronic Microscopy Segmentation
    arXiv:2403.07951v1 Announce Type: cross Abstract: It has been shown that traditional deep learning methods for electronic microscopy segmentation usually suffer from low transferability when samples and annotations are limited, while large-scale vision foundation models are more robust when transferring between different domains but facing sub-optimal improvement under fine-tuning. In this work, we present a new few-shot domain adaptation framework SAMDA, which combines the Segment Anything Model(SAM) with nnUNet in the embedding space to achieve high transferability and accuracy. Specifically, we choose the Unet-based network as the "expert" component to learn segmentation features efficiently and design a SAM-based adaptation module as the "generic" component for domain transfer. By amalgamating the "generic" and "expert" components, we mitigate the modality imbalance in the complex pre-training knowledge inherent to large-scale Vision Foundation models and the challenge of transferability inherent to traditional neural networks. The effectiveness of our model is evaluated on two electron microscopic image datasets with different modalities for mitochondria segmentation, which improves the dice coefficient on the target domain by 6.7%. Also, the SAM-based adaptor performs significantly better with only a single annotated image than the 10-shot domain adaptation on nnUNet. We further verify our model on four MRI datasets from different sources to prove its generalization ability.  ( 3 min )
    Text-to-Audio Generation Synchronized with Videos
    arXiv:2403.07938v1 Announce Type: cross Abstract: In recent times, the focus on text-to-audio (TTA) generation has intensified, as researchers strive to synthesize audio from textual descriptions. However, most existing methods, though leveraging latent diffusion models to learn the correlation between audio and text embeddings, fall short when it comes to maintaining a seamless synchronization between the produced audio and its video. This often results in discernible audio-visual mismatches. To bridge this gap, we introduce a groundbreaking benchmark for Text-to-Audio generation that aligns with Videos, named T2AV-Bench. This benchmark distinguishes itself with three novel metrics dedicated to evaluating visual alignment and temporal consistency. To complement this, we also present a simple yet effective video-aligned TTA generation model, namely T2AV. Moving beyond traditional methods, T2AV refines the latent diffusion approach by integrating visual-aligned text embeddings as its conditional foundation. It employs a temporal multi-head attention transformer to extract and understand temporal nuances from video data, a feat amplified by our Audio-Visual ControlNet that adeptly merges temporal visual representations with text embeddings. Further enhancing this integration, we weave in a contrastive learning objective, designed to ensure that the visual-aligned text embeddings resonate closely with the audio features. Extensive evaluations on the AudioCaps and T2AV-Bench demonstrate that our T2AV sets a new standard for video-aligned TTA generation in ensuring visual alignment and temporal consistency.  ( 3 min )
    Tractable Local Equilibria in Non-Concave Games
    arXiv:2403.08171v1 Announce Type: cross Abstract: While Online Gradient Descent and other no-regret learning procedures are known to efficiently converge to coarse correlated equilibrium in games where each agent's utility is concave in their own strategy, this is not the case when the utilities are non-concave, a situation that is common in machine learning applications where the agents' strategies are parameterized by deep neural networks, or the agents' utilities are computed by a neural network, or both. Indeed, non-concave games present a host of game-theoretic and optimization challenges: (i) Nash equilibria may fail to exist; (ii) local Nash equilibria exist but are intractable; and (iii) mixed Nash, correlated, and coarse correlated equilibria have infinite support in general, and are intractable. To sidestep these challenges we propose a new solution concept, termed $(\varepsilon, \Phi(\delta))$-local equilibrium, which generalizes local Nash equilibrium in non-concave games, as well as (coarse) correlated equilibrium in concave games. Importantly, we show that two instantiations of this solution concept capture the convergence guarantees of Online Gradient Descent and no-regret learning, which we show efficiently converge to this type of equilibrium in non-concave games with smooth utilities.  ( 3 min )
    SNOW-SCA: ML-assisted Side-Channel Attack on SNOW-V
    arXiv:2403.08267v1 Announce Type: cross Abstract: This paper presents SNOW-SCA, the first power side-channel analysis (SCA) attack of a 5G mobile communication security standard candidate, SNOW-V, running on a 32-bit ARM Cortex-M4 microcontroller. First, we perform a generic known-key correlation (KKC) analysis to identify the leakage points. Next, a correlation power analysis (CPA) attack is performed, which reduces the attack complexity to two key guesses for each key byte. The correct secret key is then uniquely identified utilizing linear discriminant analysis (LDA). The profiled SCA attack with LDA achieves 100% accuracy after training with $<200$ traces, which means the attack succeeds with just a single trace. Overall, using the \textit{combined CPA and LDA attack} model, the correct secret key byte is recovered with <50 traces collected using the ChipWhisperer platform. The entire 256-bit secret key of SNOW-V can be recovered incrementally using the proposed SCA attack. Finally, we suggest low-overhead countermeasures that can be used to prevent these SCA attacks.  ( 2 min )
    Language models scale reliably with over-training and on downstream tasks
    arXiv:2403.08540v1 Announce Type: cross Abstract: Scaling laws are useful guides for developing language models, but there are still gaps between current scaling studies and how language models are ultimately trained and evaluated. For instance, scaling is usually studied in the compute-optimal training regime (i.e., "Chinchilla optimal" regime); however, in practice, models are often over-trained to reduce inference costs. Moreover, scaling laws mostly predict loss on next-token prediction, but ultimately models are compared based on downstream task performance. In this paper, we address both shortcomings. To do so, we create a testbed of 104 models with 0.011B to 6.9B parameters trained with various numbers of tokens on three data distributions. First, we investigate scaling in the over-trained regime. We fit scaling laws that extrapolate in both the number of model parameters and the ratio of training tokens to parameters. This enables us to predict the validation loss of a 1.4B parameter, 900B token run (i.e., 32$\times$ over-trained) and a 6.9B parameter, 138B token run$\unicode{x2014}$each from experiments that take 300$\times$ less compute. Second, we relate the perplexity of a language model to its downstream task performance via a power law. We use this law to predict top-1 error averaged over downstream tasks for the two aforementioned models using experiments that take 20$\times$ less compute. Our experiments are available at https://github.com/mlfoundations/scaling.  ( 3 min )
    CT evaluation of 2D and 3D holistic deep learning methods for the volumetric segmentation of airway lesions
    arXiv:2403.08042v1 Announce Type: cross Abstract: This research embarked on a comparative exploration of the holistic segmentation capabilities of Convolutional Neural Networks (CNNs) in both 2D and 3D formats, focusing on cystic fibrosis (CF) lesions. The study utilized data from two CF reference centers, covering five major CF structural changes. Initially, it compared the 2D and 3D models, highlighting the 3D model's superior capability in capturing complex features like mucus plugs and consolidations. To improve the 2D model's performance, a loss adapted to fine structures segmentation was implemented and evaluated, significantly enhancing its accuracy, though not surpassing the 3D model's performance. The models underwent further validation through external evaluation against pulmonary function tests (PFTs), confirming the robustness of the findings. Moreover, this study went beyond comparing metrics; it also included comprehensive assessments of the models' interpretability and reliability, providing valuable insights for their clinical application.  ( 3 min )
    Continuous Object State Recognition for Cooking Robots Using Pre-Trained Vision-Language Models and Black-box Optimization
    arXiv:2403.08239v1 Announce Type: cross Abstract: The state recognition of the environment and objects by robots is generally based on the judgement of the current state as a classification problem. On the other hand, state changes of food in cooking happen continuously and need to be captured not only at a certain time point but also continuously over time. In addition, the state changes of food are complex and cannot be easily described by manual programming. Therefore, we propose a method to recognize the continuous state changes of food for cooking robots through the spoken language using pre-trained large-scale vision-language models. By using models that can compute the similarity between images and texts continuously over time, we can capture the state changes of food while cooking. We also show that by adjusting the weighting of each text prompt based on fitting the similarity changes to a sigmoid function and then performing black-box optimization, more accurate and robust continuous state recognition can be achieved. We demonstrate the effectiveness and limitations of this method by performing the recognition of water boiling, butter melting, egg cooking, and onion stir-frying.  ( 3 min )
    Corruption-Robust Offline Two-Player Zero-Sum Markov Games
    arXiv:2403.07933v1 Announce Type: cross Abstract: We study data corruption robustness in offline two-player zero-sum Markov games. Given a dataset of realized trajectories of two players, an adversary is allowed to modify an $\epsilon$-fraction of it. The learner's goal is to identify an approximate Nash Equilibrium policy pair from the corrupted data. We consider this problem in linear Markov games under different degrees of data coverage and corruption. We start by providing an information-theoretic lower bound on the suboptimality gap of any learner. Next, we propose robust versions of the Pessimistic Minimax Value Iteration algorithm, both under coverage on the corrupted data and under coverage only on the clean data, and show that they achieve (near)-optimal suboptimality gap bounds with respect to $\epsilon$. We note that we are the first to provide such a characterization of the problem of learning approximate Nash Equilibrium policies in offline two-player zero-sum Markov games under data corruption.  ( 2 min )
    SMART: Submodular Data Mixture Strategy for Instruction Tuning
    arXiv:2403.08370v1 Announce Type: cross Abstract: Instruction Tuning involves finetuning a language model on a collection of instruction-formatted datasets in order to enhance the generalizability of the model to unseen tasks. Studies have shown the importance of balancing different task proportions during finetuning, but finding the right balance remains challenging. Unfortunately, there's currently no systematic method beyond manual tuning or relying on practitioners' intuition. In this paper, we introduce SMART (Submodular data Mixture strAtegy for instRuction Tuning) - a novel data mixture strategy which makes use of a submodular function to assign importance scores to tasks which are then used to determine the mixture weights. Given a fine-tuning budget, SMART redistributes the budget among tasks and selects non-redundant samples from each task. Experimental results demonstrate that SMART significantly outperforms traditional methods such as examples proportional mixing and equal mixing. Furthermore, SMART facilitates the creation of data mixtures based on a few representative subsets of tasks alone and through task pruning analysis, we reveal that in a limited budget setting, allocating budget among a subset of representative tasks yields superior performance compared to distributing the budget among all tasks. The code for reproducing our results is open-sourced at https://github.com/kowndinya-renduchintala/SMART.  ( 3 min )
    Extracting Explanations, Justification, and Uncertainty from Black-Box Deep Neural Networks
    arXiv:2403.08652v1 Announce Type: new Abstract: Deep Neural Networks (DNNs) do not inherently compute or exhibit empirically-justified task confidence. In mission critical applications, it is important to both understand associated DNN reasoning and its supporting evidence. In this paper, we propose a novel Bayesian approach to extract explanations, justifications, and uncertainty estimates from DNNs. Our approach is efficient both in terms of memory and computation, and can be applied to any black box DNN without any retraining, including applications to anomaly detection and out-of-distribution detection tasks. We validate our approach on the CIFAR-10 dataset, and show that it can significantly improve the interpretability and reliability of DNNs.  ( 2 min )
    $\widetilde{O}(T^{-1})$ Convergence to (Coarse) Correlated Equilibria in Full-Information General-Sum Markov Games
    arXiv:2403.07890v1 Announce Type: cross Abstract: No-regret learning has a long history of being closely connected to game theory. Recent works have devised uncoupled no-regret learning dynamics that, when adopted by all the players in normal-form games, converge to various equilibrium solutions at a near-optimal rate of $\widetilde{O}(T^{-1})$, a significant improvement over the $O(1/\sqrt{T})$ rate of classic no-regret learners. However, analogous convergence results are scarce in Markov games, a more generic setting that lays the foundation for multi-agent reinforcement learning. In this work, we close this gap by showing that the optimistic-follow-the-regularized-leader (OFTRL) algorithm, together with appropriate value update procedures, can find $\widetilde{O}(T^{-1})$-approximate (coarse) correlated equilibria in full-information general-sum Markov games within $T$ iterations. Numerical results are also included to corroborate our theoretical findings.  ( 2 min )
    Characterising harmful data sources when constructing multi-fidelity surrogate models
    arXiv:2403.08118v1 Announce Type: cross Abstract: Surrogate modelling techniques have seen growing attention in recent years when applied to both modelling and optimisation of industrial design problems. These techniques are highly relevant when assessing the performance of a particular design carries a high cost, as the overall cost can be mitigated via the construction of a model to be queried in lieu of the available high-cost source. The construction of these models can sometimes employ other sources of information which are both cheaper and less accurate. The existence of these sources however poses the question of which sources should be used when constructing a model. Recent studies have attempted to characterise harmful data sources to guide practitioners in choosing when to ignore a certain source. These studies have done so in a synthetic setting, characterising sources using a large amount of data that is not available in practice. Some of these studies have also been shown to potentially suffer from bias in the benchmarks used in the analysis. In this study, we present a characterisation of harmful low-fidelity sources using only the limited data available to train a surrogate model. We employ recently developed benchmark filtering techniques to conduct a bias-free assessment, providing objectively varied benchmark suites of different sizes for future research. Analysing one of these benchmark suites with the technique known as Instance Space Analysis, we provide an intuitive visualisation of when a low-fidelity source should be used and use this analysis to provide guidelines that can be used in an applied industrial setting.  ( 3 min )
    LIX: Implicitly Infusing Spatial Geometric Prior Knowledge into Visual Semantic Segmentation for Autonomous Driving
    arXiv:2403.08215v1 Announce Type: cross Abstract: Despite the impressive performance achieved by data-fusion networks with duplex encoders for visual semantic segmentation, they become ineffective when spatial geometric data are not available. Implicitly infusing the spatial geometric prior knowledge acquired by a duplex-encoder teacher model into a single-encoder student model is a practical, albeit less explored research avenue. This paper delves into this topic and resorts to knowledge distillation approaches to address this problem. We introduce the Learning to Infuse "X" (LIX) framework, with novel contributions in both logit distillation and feature distillation aspects. We present a mathematical proof that underscores the limitation of using a single fixed weight in decoupled knowledge distillation and introduce a logit-wise dynamic weight controller as a solution to this issue. Furthermore, we develop an adaptively-recalibrated feature distillation algorithm, including two technical novelties: feature recalibration via kernel regression and in-depth feature consistency quantification via centered kernel alignment. Extensive experiments conducted with intermediate-fusion and late-fusion networks across various public datasets provide both quantitative and qualitative evaluations, demonstrating the superior performance of our LIX framework when compared to other state-of-the-art approaches.  ( 3 min )
    Gujarati-English Code-Switching Speech Recognition using ensemble prediction of spoken language
    arXiv:2403.08011v1 Announce Type: cross Abstract: An important and difficult task in code-switched speech recognition is to recognize the language, as lots of words in two languages can sound similar, especially in some accents. We focus on improving performance of end-to-end Automatic Speech Recognition models by conditioning transformer layers on language ID of words and character in the output in an per layer supervised manner. To this end, we propose two methods of introducing language specific parameters and explainability in the multi-head attention mechanism, and implement a Temporal Loss that helps maintain continuity in input alignment. Despite being unable to reduce WER significantly, our method shows promise in predicting the correct language from just spoken data. We introduce regularization in the language prediction by dropping LID in the sequence, which helps align long repeated output sequences.  ( 2 min )
    CleanAgent: Automating Data Standardization with LLM-based Agents
    arXiv:2403.08291v1 Announce Type: new Abstract: Data standardization is a crucial part in data science life cycle. While tools like Pandas offer robust functionalities, their complexity and the manual effort required for customizing code to diverse column types pose significant challenges. Although large language models (LLMs) like ChatGPT have shown promise in automating this process through natural language understanding and code generation, it still demands expert-level programming knowledge and continuous interaction for prompt refinement. To solve these challenges, our key idea is to propose a Python library with declarative, unified APIs for standardizing column types, simplifying the code generation of LLM with concise API calls. We first propose Dataprep.Clean which is written as a component of the Dataprep Library, offers a significant reduction in complexity by enabling the standardization of specific column types with a single line of code. Then we introduce the CleanAgent framework integrating Dataprep.Clean and LLM-based agents to automate the data standardization process. With CleanAgent, data scientists need only provide their requirements once, allowing for a hands-free, automatic standardization process.  ( 2 min )
    Tissue Artifact Segmentation and Severity Analysis for Automated Diagnosis Using Whole Slide Images
    arXiv:2401.01386v3 Announce Type: replace-cross Abstract: Traditionally, pathological analysis and diagnosis are performed by manually eyeballing glass slide specimens under a microscope by an expert. The whole slide image is the digital specimen produced from the glass slide. Whole slide image enabled specimens to be observed on a computer screen and led to computational pathology where computer vision and artificial intelligence are utilized for automated analysis and diagnosis. With the current computational advancement, the entire whole slide image can be analyzed autonomously without human supervision. However, the analysis could fail or lead to wrong diagnosis if the whole slide image is affected by tissue artifacts such as tissue fold or air bubbles depending on the severity. Existing artifact detection methods rely on experts for severity assessment to eliminate artifact affected regions from the analysis. This process is time consuming, exhausting and undermines the goal of automated analysis or removal of artifacts without evaluating their severity, which could result in the loss of diagnostically important data. Therefore, it is necessary to detect artifacts and then assess their severity automatically. In this paper, we propose a system that incorporates severity evaluation with artifact detection utilizing convolutional neural networks. The proposed system uses DoubleUNet to segment artifacts and an ensemble network of six fine tuned convolutional neural network models to determine severity. This method outperformed current state of the art in accuracy by 9 percent for artifact segmentation and achieved a strong correlation of 97 percent with the evaluation of pathologists for severity assessment. The robustness of the system was demonstrated using our proposed heterogeneous dataset and practical usability was ensured by integrating it with an automated analysis system.  ( 3 min )
    Motifs, Phrases, and Beyond: The Modelling of Structure in Symbolic Music Generation
    arXiv:2403.07995v1 Announce Type: cross Abstract: Modelling musical structure is vital yet challenging for artificial intelligence systems that generate symbolic music compositions. This literature review dissects the evolution of techniques for incorporating coherent structure, from symbolic approaches to foundational and transformative deep learning methods that harness the power of computation and data across a wide variety of training paradigms. In the later stages, we review an emerging technique which we refer to as "sub-task decomposition" that involves decomposing music generation into separate high-level structural planning and content creation stages. Such systems incorporate some form of musical knowledge or neuro-symbolic methods by extracting melodic skeletons or structural templates to guide the generation. Progress is evident in capturing motifs and repetitions across all three eras reviewed, yet modelling the nuanced development of themes across extended compositions in the style of human composers remains difficult. We outline several key future directions to realize the synergistic benefits of combining approaches from all eras examined.  ( 3 min )
    Aedes aegypti Egg Counting with Neural Networks for Object Detection
    arXiv:2403.08016v1 Announce Type: cross Abstract: Aedes aegypti is still one of the main concerns when it comes to disease vectors. Among the many ways to deal with it, there are important protocols that make use of egg numbers in ovitraps to calculate indices, such as the LIRAa and the Breteau Index, which can provide information on predictable outbursts and epidemics. Also, there are many research lines that require egg numbers, specially when mass production of mosquitoes is needed. Egg counting is a laborious and error-prone task that can be automated via computer vision-based techniques, specially deep learning-based counting with object detection. In this work, we propose a new dataset comprising field and laboratory eggs, along with test results of three neural networks applied to the task: Faster R-CNN, Side-Aware Boundary Localization and FoveaBox.  ( 2 min )
    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
    arXiv:2403.07974v1 Announce Type: cross Abstract: Large Language Models (LLMs) applied to code-related applications have emerged as a prominent field, attracting significant interest from both academia and industry. However, as new and improved LLMs are developed, existing evaluation benchmarks (e.g., HumanEval, MBPP) are no longer sufficient for assessing their capabilities. In this work, we propose LiveCodeBench, a comprehensive and contamination-free evaluation of LLMs for code, which continuously collects new problems over time from contests across three competition platforms, namely LeetCode, AtCoder, and CodeForces. Notably, our benchmark also focuses on a broader range of code related capabilities, such as self-repair, code execution, and test output prediction, beyond just code generation. Currently, LiveCodeBench hosts four hundred high-quality coding problems that were published between May 2023 and February 2024. We have evaluated 9 base LLMs and 20 instruction-tuned LLMs on LiveCodeBench. We present empirical findings on contamination, holistic performance comparisons, potential overfitting in existing benchmarks as well as individual model comparisons. We will release all prompts and model completions for further community analysis, along with a general toolkit for adding new scenarios and model  ( 3 min )
    Machine Unlearning: Taxonomy, Metrics, Applications, Challenges, and Prospects
    arXiv:2403.08254v1 Announce Type: new Abstract: Personal digital data is a critical asset, and governments worldwide have enforced laws and regulations to protect data privacy. Data users have been endowed with the right to be forgotten of their data. In the course of machine learning (ML), the forgotten right requires a model provider to delete user data and its subsequent impact on ML models upon user requests. Machine unlearning emerges to address this, which has garnered ever-increasing attention from both industry and academia. While the area has developed rapidly, there is a lack of comprehensive surveys to capture the latest advancements. Recognizing this shortage, we conduct an extensive exploration to map the landscape of machine unlearning including the (fine-grained) taxonomy of unlearning algorithms under centralized and distributed settings, debate on approximate unlearning, verification and evaluation metrics, challenges and solutions for unlearning under different applications, as well as attacks targeting machine unlearning. The survey concludes by outlining potential directions for future research, hoping to serve as a guide for interested scholars.  ( 3 min )
    Consistent Prompting for Rehearsal-Free Continual Learning
    arXiv:2403.08568v1 Announce Type: cross Abstract: Continual learning empowers models to adapt autonomously to the ever-changing environment or data streams without forgetting old knowledge. Prompt-based approaches are built on frozen pre-trained models to learn the task-specific prompts and classifiers efficiently. Existing prompt-based methods are inconsistent between training and testing, limiting their effectiveness. Two types of inconsistency are revealed. Test predictions are made from all classifiers while training only focuses on the current task classifier without holistic alignment, leading to Classifier inconsistency. Prompt inconsistency indicates that the prompt selected during testing may not correspond to the one associated with this task during training. In this paper, we propose a novel prompt-based method, Consistent Prompting (CPrompt), for more aligned training and testing. Specifically, all existing classifiers are exposed to prompt training, resulting in classifier consistency learning. In addition, prompt consistency learning is proposed to enhance prediction robustness and boost prompt selection accuracy. Our Consistent Prompting surpasses its prompt-based counterparts and achieves state-of-the-art performance on multiple continual learning benchmarks. Detailed analysis shows that improvements come from more consistent training and testing.  ( 2 min )
    Advancing Investment Frontiers: Industry-grade Deep Reinforcement Learning for Portfolio Optimization
    arXiv:2403.07916v1 Announce Type: cross Abstract: This research paper delves into the application of Deep Reinforcement Learning (DRL) in asset-class agnostic portfolio optimization, integrating industry-grade methodologies with quantitative finance. At the heart of this integration is our robust framework that not only merges advanced DRL algorithms with modern computational techniques but also emphasizes stringent statistical analysis, software engineering and regulatory compliance. To the best of our knowledge, this is the first study integrating financial Reinforcement Learning with sim-to-real methodologies from robotics and mathematical physics, thus enriching our frameworks and arguments with this unique perspective. Our research culminates with the introduction of AlphaOptimizerNet, a proprietary Reinforcement Learning agent (and corresponding library). Developed from a synthesis of state-of-the-art (SOTA) literature and our unique interdisciplinary methodology, AlphaOptimizerNet demonstrates encouraging risk-return optimization across various asset classes with realistic constraints. These preliminary results underscore the practical efficacy of our frameworks. As the finance sector increasingly gravitates towards advanced algorithmic solutions, our study bridges theoretical advancements with real-world applicability, offering a template for ensuring safety and robust standards in this technologically driven future.  ( 2 min )
    MIP: CLIP-based Image Reconstruction from PEFT Gradients
    arXiv:2403.07901v1 Announce Type: cross Abstract: Contrastive Language-Image Pre-training (CLIP) model, as an effective pre-trained multimodal neural network, has been widely used in distributed machine learning tasks, especially Federated Learning (FL). Typically, CLIP-based FL adopts Parameter-Efficient Fine-Tuning (PEFT) for model training, which only fine-tunes adapter parameters or soft prompts rather than the full parameters. Although PEFT is different from the traditional training mode, in this paper, we theoretically analyze that the gradients of adapters or soft prompts can still be used to perform image reconstruction attacks. Based on our theoretical analysis, we propose Multm-In-Parvo (MIP), a proprietary reconstruction attack method targeting CLIP-based distributed machine learning architecture. Specifically, MIP can reconstruct CLIP training images according to the gradients of soft prompts or an adapter. In addition, MIP includes a label prediction strategy to accelerate convergence and an inverse gradient estimation mechanism to avoid the vanishing gradient problem on the text encoder. Experimental results show that MIP can effectively reconstruct training images according to the gradients of soft prompts or adapters of CLIP models.  ( 2 min )
    Jellyfish: A Large Language Model for Data Preprocessing
    arXiv:2312.01678v4 Announce Type: replace-cross Abstract: This paper explores the utilization of LLMs for data preprocessing (DP), a crucial step in the data mining pipeline that transforms raw data into a clean format conducive to easy processing. Whereas the use of LLMs has sparked interest in devising universal solutions to DP, recent initiatives in this domain typically rely on GPT APIs, raising inevitable data breach concerns. Unlike these approaches, we consider instruction-tuning local LLMs (7 - 13B models) as universal DP ask solver. We select a collection of datasets across four representative DP tasks and construct instruction-tuning data using serialization and knowledge injection techniques tailored to DP. As such, the instruction-tuned LLMs empower users to manually craft instructions for DP. Meanwhile, they can operate on a local, single, and low-priced GPU, ensuring data security and enabling further tuning. Our experiments show that our dataset constructed for DP instruction tuning, namely Jellyfish, effectively enhances LLMs' DP performances and barely compromises their abilities in NLP tasks. By tuning Mistral-7B and OpenOrca-Platypus2-13B with Jellyfish, the models deliver competitiveness compared to state-of-the-art DP methods and strong generalizability to unseen tasks. The models' performance rivals that of GPT series models, and the interpretation offers enhanced reasoning capabilities compared to GPT-3.5. The 7B and 13B Jellyfish models are available at Hugging Face: https://huggingface.co/NECOUDBFM/Jellyfish-7B https://huggingface.co/NECOUDBFM/Jellyfish-13B  ( 3 min )
    HOLMES: HOLonym-MEronym based Semantic inspection for Convolutional Image Classifiers
    arXiv:2403.08536v1 Announce Type: cross Abstract: Convolutional Neural Networks (CNNs) are nowadays the model of choice in Computer Vision, thanks to their ability to automatize the feature extraction process in visual tasks. However, the knowledge acquired during training is fully subsymbolic, and hence difficult to understand and explain to end users. In this paper, we propose a new technique called HOLMES (HOLonym-MEronym based Semantic inspection) that decomposes a label into a set of related concepts, and provides component-level explanations for an image classification model. Specifically, HOLMES leverages ontologies, web scraping and transfer learning to automatically construct meronym (parts)-based detectors for a given holonym (class). Then, it produces heatmaps at the meronym level and finally, by probing the holonym CNN with occluded images, it highlights the importance of each part on the classification output. Compared to state-of-the-art saliency methods, HOLMES takes a step further and provides information about both where and what the holonym CNN is looking at, without relying on densely annotated datasets and without forcing concepts to be associated to single computational units. Extensive experimental evaluation on different categories of objects (animals, tools and vehicles) shows the feasibility of our approach. On average, HOLMES explanations include at least two meronyms, and the ablation of a single meronym roughly halves the holonym model confidence. The resulting heatmaps were quantitatively evaluated using the deletion/insertion/preservation curves. All metrics were comparable to those achieved by GradCAM, while offering the advantage of further decomposing the heatmap in human-understandable concepts, thus highlighting both the relevance of meronyms to object classification, as well as HOLMES ability to capture it. The code is available at https://github.com/FrancesC0de/HOLMES.  ( 3 min )
    Mean-Field Microcanonical Gradient Descent
    arXiv:2403.08362v1 Announce Type: cross Abstract: Microcanonical gradient descent is a sampling procedure for energy-based models allowing for efficient sampling of distributions in high dimension. It works by transporting samples from a high-entropy distribution, such as Gaussian white noise, to a low-energy region using gradient descent. We put this model in the framework of normalizing flows, showing how it can often overfit by losing an unnecessary amount of entropy in the descent. As a remedy, we propose a mean-field microcanonical gradient descent that samples several weakly coupled data points simultaneously, allowing for better control of the entropy loss while paying little in terms of likelihood fit. We study these models in the context of financial time series, illustrating the improvements on both synthetic and real data.  ( 2 min )
    Kernel-Based Testing for Single-Cell Differential Analysis
    arXiv:2307.08509v2 Announce Type: replace-cross Abstract: Single-cell technologies offer insights into molecular feature distributions, but comparing them poses challenges. We propose a kernel-testing framework for non-linear cell-wise distribution comparison, analyzing gene expression and epigenomic modifications. Our method allows feature-wise and global transcriptome/epigenome comparisons, revealing cell population heterogeneities. Using a classifier based on embedding variability, we identify transitions in cell states, overcoming limitations of traditional single-cell analysis. Applied to single-cell ChIP-Seq data, our approach identifies untreated breast cancer cells with an epigenomic profile resembling persister cells. This demonstrates the effectiveness of kernel testing in uncovering subtle population variations that might be missed by other methods.  ( 2 min )
    Speech Robust Bench: A Robustness Benchmark For Speech Recognition
    arXiv:2403.07937v1 Announce Type: cross Abstract: As Automatic Speech Recognition (ASR) models become ever more pervasive, it is important to ensure that they make reliable predictions under corruptions present in the physical and digital world. We propose Speech Robust Bench (SRB), a comprehensive benchmark for evaluating the robustness of ASR models to diverse corruptions. SRB is composed of 69 input perturbations which are intended to simulate various corruptions that ASR models may encounter in the physical and digital world. We use SRB to evaluate the robustness of several state-of-the-art ASR models and observe that model size and certain modeling choices such as discrete representations, and self-training appear to be conducive to robustness. We extend this analysis to measure the robustness of ASR models on data from various demographic subgroups, namely English and Spanish speakers, and males and females, and observed noticeable disparities in the model's robustness across subgroups. We believe that SRB will facilitate future research towards robust ASR models, by making it easier to conduct comprehensive and comparable robustness evaluations.  ( 3 min )
    Learning Human-like Representations to Enable Learning Human Values
    arXiv:2312.14106v2 Announce Type: replace-cross Abstract: How can we build AI systems that are aligned with human values to avoid causing harm or violating societal standards for acceptable behavior? We argue that representational alignment between humans and AI agents facilitates value alignment. Making AI systems learn human-like representations of the world has many known benefits, including improving generalization, robustness to domain shifts, and few-shot learning performance. We propose that this kind of representational alignment between machine learning (ML) models and humans can also support value alignment, allowing ML systems to conform to human values and societal norms. We focus on ethics as one aspect of value alignment and train ML agents using a variety of methods in a multi-armed bandit setting, where rewards reflect the moral acceptability of the chosen action. We use a synthetic experiment to demonstrate that agents' representational alignment with the environment bounds their learning performance. We then repeat this procedure in a realistic setting, using textual action descriptions and similarity judgments collected from humans and a variety of language models, to show that the results generalize and are model-agnostic when grounded in an ethically relevant context.  ( 3 min )
    ProtLLM: An Interleaved Protein-Language LLM with Protein-as-Word Pre-Training
    arXiv:2403.07920v1 Announce Type: cross Abstract: We propose ProtLLM, a versatile cross-modal large language model (LLM) for both protein-centric and protein-language tasks. ProtLLM features a unique dynamic protein mounting mechanism, enabling it to handle complex inputs where the natural language text is interspersed with an arbitrary number of proteins. Besides, we propose the protein-as-word language modeling approach to train ProtLLM. By developing a specialized protein vocabulary, we equip the model with the capability to predict not just natural language but also proteins from a vast pool of candidates. Additionally, we construct a large-scale interleaved protein-text dataset, named InterPT, for pre-training. This dataset comprehensively encompasses both (1) structured data sources like protein annotations and (2) unstructured data sources like biological research papers, thereby endowing ProtLLM with crucial knowledge for understanding proteins. We evaluate ProtLLM on classic supervised protein-centric tasks and explore its novel protein-language applications. Experimental results demonstrate that ProtLLM not only achieves superior performance against protein-specialized baselines on protein-centric tasks but also induces zero-shot and in-context learning capabilities on protein-language tasks.  ( 3 min )
    Investigating the Effect of Misalignment on Membership Privacy in the White-box Setting
    arXiv:2306.05093v2 Announce Type: replace-cross Abstract: Machine learning models have been shown to leak sensitive information about their training datasets. Models are increasingly deployed on devices, raising concerns that white-box access to the model parameters increases the attack surface compared to black-box access which only provides query access. Directly extending the shadow modelling technique from the black-box to the white-box setting has been shown, in general, not to perform better than black-box only attacks. A potential reason is misalignment, a known characteristic of deep neural networks. In the shadow modelling context, misalignment means that, while the shadow models learn similar features in each layer, the features are located in different positions. We here present the first systematic analysis of the causes of misalignment in shadow models and show the use of a different weight initialisation to be the main cause. We then extend several re-alignment techniques, previously developed in the model fusion literature, to the shadow modelling context, where the goal is to re-align the layers of a shadow model to those of the target model. We show re-alignment techniques to significantly reduce the measured misalignment between the target and shadow models. Finally, we perform a comprehensive evaluation of white-box membership inference attacks (MIA). Our analysis reveals that internal layer activation-based MIAs suffer strongly from shadow model misalignment, while gradient-based MIAs are only sometimes significantly affected. We show that re-aligning the shadow models strongly improves the former's performance and can also improve the latter's performance, although less frequently. Taken together, our results highlight that on-device deployment increases the attack surface and that the newly available information can be used to build more powerful attacks.  ( 3 min )
    Agent Lumos: Unified and Modular Training for Open-Source Language Agents
    arXiv:2311.05657v2 Announce Type: replace-cross Abstract: Closed-source agents suffer from several issues such as a lack of affordability, transparency, and reproducibility, particularly on complex interactive tasks. This motivates the development of open-source alternatives. We introduce LUMOS, one of the first frameworks for training open-source LLM-based agents. LUMOS features a learnable, unified, and modular architecture with a planning module that learns high-level subgoal generation, and a grounding module trained to translate these into actions using various tools in the execution module. The design allows for modular upgrades and wider applicability to diverse interactive tasks. To foster generalizable agent learning, we collect large-scale, unified, and high-quality training annotations derived from diverse ground-truth reasoning rationales across various complex interactive tasks. On 9 datasets, LUMOS exhibits several key advantages: (1) LUMOS excels multiple larger open-source agents on the held-out datasets (unused for training) for each task type. LUMOS even surpasses GPT agents on QA and web tasks; (2) LUMOS outperforms open-source agents produced by chain-of-thoughts and unmodularized integrated training; and (3) LUMOS effectively generalizes to unseen tasks, outperforming 33B-scale agents and domain-specific agents.  ( 3 min )
    SkillDiffuser: Interpretable Hierarchical Planning via Skill Abstractions in Diffusion-Based Task Execution
    arXiv:2312.11598v2 Announce Type: replace-cross Abstract: Diffusion models have demonstrated strong potential for robotic trajectory planning. However, generating coherent trajectories from high-level instructions remains challenging, especially for long-range composition tasks requiring multiple sequential skills. We propose SkillDiffuser, an end-to-end hierarchical planning framework integrating interpretable skill learning with conditional diffusion planning to address this problem. At the higher level, the skill abstraction module learns discrete, human-understandable skill representations from visual observations and language instructions. These learned skill embeddings are then used to condition the diffusion model to generate customized latent trajectories aligned with the skills. This allows generating diverse state trajectories that adhere to the learnable skills. By integrating skill learning with conditional trajectory generation, SkillDiffuser produces coherent behavior following abstract instructions across diverse tasks. Experiments on multi-task robotic manipulation benchmarks like Meta-World and LOReL demonstrate state-of-the-art performance and human-interpretable skill representations from SkillDiffuser. More visualization results and information could be found on our website.  ( 3 min )
    Universal Neural-Cracking-Machines: Self-Configurable Password Models from Auxiliary Data
    arXiv:2301.07628v5 Announce Type: replace-cross Abstract: We introduce the concept of "universal password model" -- a password model that, once pre-trained, can automatically adapt its guessing strategy based on the target system. To achieve this, the model does not need to access any plaintext passwords from the target credentials. Instead, it exploits users' auxiliary information, such as email addresses, as a proxy signal to predict the underlying password distribution. Specifically, the model uses deep learning to capture the correlation between the auxiliary data of a group of users (e.g., users of a web application) and their passwords. It then exploits those patterns to create a tailored password model for the target system at inference time. No further training steps, targeted data collection, or prior knowledge of the community's password distribution is required. Besides improving over current password strength estimation techniques and attacks, the model enables any end-user (e.g., system administrators) to autonomously generate tailored password models for their systems without the often unworkable requirements of collecting suitable training data and fitting the underlying machine learning model. Ultimately, our framework enables the democratization of well-calibrated password models to the community, addressing a major challenge in the deployment of password security solutions at scale.  ( 3 min )
    Simple and Scalable Strategies to Continually Pre-train Large Language Models
    arXiv:2403.08763v1 Announce Type: new Abstract: Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start the process over again once new data becomes available. A much more efficient solution is to continually pre-train these models, saving significant compute compared to re-training. However, the distribution shift induced by new data typically results in degraded performance on previous data or poor adaptation to the new data. In this work, we show that a simple and scalable combination of learning rate (LR) re-warming, LR re-decaying, and replay of previous data is sufficient to match the performance of fully re-training from scratch on all available data, as measured by final loss and language model (LM) evaluation benchmarks. Specifically, we show this for a weak but realistic distribution shift between two commonly used LLM pre-training datasets (English$\rightarrow$English) and a stronger distribution shift (English$\rightarrow$German) at the $405$M parameter model scale with large dataset sizes (hundreds of billions of tokens). Selecting the weak but realistic shift for larger-scale experiments, we also find that our continual learning strategies match the re-training baseline for a 10B parameter LLM. Our results demonstrate that LLMs can be successfully updated via simple and scalable continual learning strategies, matching the re-training baseline using only a fraction of the compute. Finally, inspired by previous work, we propose alternatives to the cosine learning rate schedule that help circumvent forgetting induced by LR re-warming and that are not bound to a fixed token budget.  ( 3 min )
    Local Binary and Multiclass SVMs Trained on a Quantum Annealer
    arXiv:2403.08584v1 Announce Type: cross Abstract: Support vector machines (SVMs) are widely used machine learning models (e.g., in remote sensing), with formulations for both classification and regression tasks. In the last years, with the advent of working quantum annealers, hybrid SVM models characterised by quantum training and classical execution have been introduced. These models have demonstrated comparable performance to their classical counterparts. However, they are limited in the training set size due to the restricted connectivity of the current quantum annealers. Hence, to take advantage of large datasets (like those related to Earth observation), a strategy is required. In the classical domain, local SVMs, namely, SVMs trained on the data samples selected by a k-nearest neighbors model, have already proven successful. Here, the local application of quantum-trained SVM models is proposed and empirically assessed. In particular, this approach allows overcoming the constraints on the training set size of the quantum-trained models while enhancing their performance. In practice, the FaLK-SVM method, designed for efficient local SVMs, has been combined with quantum-trained SVM models for binary and multiclass classification. In addition, for comparison, FaLK-SVM has been interfaced for the first time with a classical single-step multiclass SVM model (CS SVM). Concerning the empirical evaluation, D-Wave's quantum annealers and real-world datasets taken from the remote sensing domain have been employed. The results have shown the effectiveness and scalability of the proposed approach, but also its practical applicability in a real-world large-scale scenario.  ( 3 min )
    TutoAI: A Cross-domain Framework for AI-assisted Mixed-media Tutorial Creation on Physical Tasks
    arXiv:2403.08049v1 Announce Type: cross Abstract: Mixed-media tutorials, which integrate videos, images, text, and diagrams to teach procedural skills, offer more browsable alternatives than timeline-based videos. However, manually creating such tutorials is tedious, and existing automated solutions are often restricted to a particular domain. While AI models hold promise, it is unclear how to effectively harness their powers, given the multi-modal data involved and the vast landscape of models. We present TutoAI, a cross-domain framework for AI-assisted mixed-media tutorial creation on physical tasks. First, we distill common tutorial components by surveying existing work; then, we present an approach to identify, assemble, and evaluate AI models for component extraction; finally, we propose guidelines for designing user interfaces (UI) that support tutorial creation based on AI-generated components. We show that TutoAI has achieved higher or similar quality compared to a baseline model in preliminary user studies.  ( 2 min )
    Steering LLMs Towards Unbiased Responses: A Causality-Guided Debiasing Framework
    arXiv:2403.08743v1 Announce Type: cross Abstract: Large language models (LLMs) can easily generate biased and discriminative responses. As LLMs tap into consequential decision-making (e.g., hiring and healthcare), it is of crucial importance to develop strategies to mitigate these biases. This paper focuses on social bias, tackling the association between demographic information and LLM outputs. We propose a causality-guided debiasing framework that utilizes causal understandings of (1) the data-generating process of the training corpus fed to LLMs, and (2) the internal reasoning process of LLM inference, to guide the design of prompts for debiasing LLM outputs through selection mechanisms. Our framework unifies existing de-biasing prompting approaches such as inhibitive instructions and in-context contrastive examples, and sheds light on new ways of debiasing by encouraging bias-free reasoning. Our strong empirical performance on real-world datasets demonstrates that our framework provides principled guidelines on debiasing LLM outputs even with only the black-box access.  ( 2 min )
    AutoDFP: Automatic Data-Free Pruning via Channel Similarity Reconstruction
    arXiv:2403.08204v1 Announce Type: new Abstract: Structured pruning methods are developed to bridge the gap between the massive scale of neural networks and the limited hardware resources. Most current structured pruning methods rely on training datasets to fine-tune the compressed model, resulting in high computational burdens and being inapplicable for scenarios with stringent requirements on privacy and security. As an alternative, some data-free methods have been proposed, however, these methods often require handcraft parameter tuning and can only achieve inflexible reconstruction. In this paper, we propose the Automatic Data-Free Pruning (AutoDFP) method that achieves automatic pruning and reconstruction without fine-tuning. Our approach is based on the assumption that the loss of information can be partially compensated by retaining focused information from similar channels. Specifically, We formulate data-free pruning as an optimization problem, which can be effectively addressed through reinforcement learning. AutoDFP assesses the similarity of channels for each layer and provides this information to the reinforcement learning agent, guiding the pruning and reconstruction process of the network. We evaluate AutoDFP with multiple networks on multiple datasets, achieving impressive compression results. For instance, on the CIFAR-10 dataset, AutoDFP demonstrates a 2.87\% reduction in accuracy loss compared to the recently proposed data-free pruning method DFPC with fewer FLOPs on VGG-16. Furthermore, on the ImageNet dataset, AutoDFP achieves 43.17\% higher accuracy than the SOTA method with the same 80\% preserved ratio on MobileNet-V1.  ( 3 min )
    Multi-Task Pseudo-Label Learning for Non-Intrusive Speech Quality Assessment Model
    arXiv:2308.09262v3 Announce Type: replace-cross Abstract: This study proposes a multi-task pseudo-label learning (MPL)-based non-intrusive speech quality assessment model called MTQ-Net. MPL consists of two stages: obtaining pseudo-label scores from a pretrained model and performing multi-task learning. The 3QUEST metrics, namely Speech-MOS (S-MOS), Noise-MOS (N-MOS), and General-MOS (G-MOS), are the assessment targets. The pretrained MOSA-Net model is utilized to estimate three pseudo labels: perceptual evaluation of speech quality (PESQ), short-time objective intelligibility (STOI), and speech distortion index (SDI). Multi-task learning is then employed to train MTQ-Net by combining a supervised loss (derived from the difference between the estimated score and the ground-truth label) and a semi-supervised loss (derived from the difference between the estimated score and the pseudo label), where the Huber loss is employed as the loss function. Experimental results first demonstrate the advantages of MPL compared to training a model from scratch and using a direct knowledge transfer mechanism. Second, the benefit of the Huber loss for improving the predictive ability of MTQ-Net is verified. Finally, the MTQ-Net with the MPL approach exhibits higher overall predictive power compared to other SSL-based speech assessment models.  ( 3 min )
    A Sparsity Principle for Partially Observable Causal Representation Learning
    arXiv:2403.08335v1 Announce Type: new Abstract: Causal representation learning aims at identifying high-level causal variables from perceptual data. Most methods assume that all latent causal variables are captured in the high-dimensional observations. We instead consider a partially observed setting, in which each measurement only provides information about a subset of the underlying causal state. Prior work has studied this setting with multiple domains or views, each depending on a fixed subset of latents. Here, we focus on learning from unpaired observations from a dataset with an instance-dependent partial observability pattern. Our main contribution is to establish two identifiability results for this setting: one for linear mixing functions without parametric assumptions on the underlying causal model, and one for piecewise linear mixing functions with Gaussian latent causal variables. Based on these insights, we propose two methods for estimating the underlying causal variables by enforcing sparsity in the inferred representation. Experiments on different simulated datasets and established benchmarks highlight the effectiveness of our approach in recovering the ground-truth latents.  ( 3 min )
    Lowering Detection in Sport Climbing Based on Orientation of the Sensor Enhanced Quickdraw
    arXiv:2301.10164v2 Announce Type: replace-cross Abstract: Tracking climbers' activity to improve services and make the best use of their infrastructure is a concern for climbing gyms. Each climbing session must be analyzed from beginning till lowering of the climber. Therefore, spotting the climbers descending is crucial since it indicates when the ascent has come to an end. This problem must be addressed while preserving privacy and convenience of the climbers and the costs of the gyms. To this aim, a hardware prototype is developed to collect data using accelerometer sensors attached to a piece of climbing equipment mounted on the wall, called quickdraw, that connects the climbing rope to the bolt anchors. The corresponding sensors are configured to be energy-efficient, hence become practical in terms of expenses and time consumption for replacement when using in large quantity in a climbing gym. This paper describes hardware specifications, studies data measured by the sensors in ultra-low power mode, detect sensors' orientation patterns during lowering different routes, and develop an supervised approach to identify lowering.  ( 3 min )
    UniTabE: A Universal Pretraining Protocol for Tabular Foundation Model in Data Science
    arXiv:2307.09249v2 Announce Type: replace Abstract: Recent advancements in NLP have witnessed the groundbreaking impact of pretrained models, yielding impressive outcomes across various tasks. This study seeks to extend the power of pretraining methodologies to facilitating the prediction over tables in data science, a domain traditionally overlooked, yet inherently challenging due to the plethora of table schemas intrinsic to different tasks. The primary research questions underpinning this work revolve around the establishment of a universal pretraining protocol for tables with varied structures, the generalizability and transferability of learned knowledge across tasks, the adaptation to diverse downstream applications, and the incorporation of incremental columns over time. In response to these challenges, we introduce UniTabE, a straightforward yet effective method designed to process tables in a uniform manner, devoid of constraints imposed by specific table structures. UniTabE's core concept relies on representing each basic table element with a module, termed TabUnit. This is subsequently followed by a Transformer encoder to refine the representation. Moreover, our model is designed to facilitate pretraining and finetuning through the utilization of free-form prompts. In order to implement the pretraining phase, we curated an expansive tabular dataset comprising approximately 13B samples, meticulously gathered from the Kaggle platform. This research primarily centers on classification and regression tasks involving tabular data, and conducts rigorous experimental testing and analyses to validate the effectiveness of our methodology. The experimental results demonstrate UniTabE's superior performance against several baselines across massive benchmarks. This, therefore, underscores UniTabE's potential to significantly enhance the semantic representation of tabular data, thereby marking a significant stride for tabular data analysis.  ( 3 min )
    Referential communication in heterogeneous communities of pre-trained visual deep networks
    arXiv:2302.08913v4 Announce Type: replace-cross Abstract: As large pre-trained image-processing neural networks are being embedded in autonomous agents such as self-driving cars or robots, the question arises of how such systems can communicate with each other about the surrounding world, despite their different architectures and training regimes. As a first step in this direction, we systematically explore the task of \textit{referential communication} in a community of heterogeneous state-of-the-art pre-trained visual networks, showing that they can develop, in a self-supervised way, a shared protocol to refer to a target object among a set of candidates. This shared protocol can also be used, to some extent, to communicate about previously unseen object categories of different granularity. Moreover, a visual network that was not initially part of an existing community can learn the community's protocol with remarkable ease. Finally, we study, both qualitatively and quantitatively, the properties of the emergent protocol, providing some evidence that it is capturing high-level semantic features of objects.  ( 3 min )
    FSDR: A Novel Deep Learning-based Feature Selection Algorithm for Pseudo Time-Series Data using Discrete Relaxation
    arXiv:2403.08403v1 Announce Type: new Abstract: Conventional feature selection algorithms applied to Pseudo Time-Series (PTS) data, which consists of observations arranged in sequential order without adhering to a conventional temporal dimension, often exhibit impractical computational complexities with high dimensional data. To address this challenge, we introduce a Deep Learning (DL)-based feature selection algorithm: Feature Selection through Discrete Relaxation (FSDR), tailored for PTS data. Unlike the existing feature selection algorithms, FSDR learns the important features as model parameters using discrete relaxation, which refers to the process of approximating a discrete optimisation problem with a continuous one. FSDR is capable of accommodating a high number of feature dimensions, a capability beyond the reach of existing DL-based or traditional methods. Through testing on a hyperspectral dataset (i.e., a type of PTS data), our experimental results demonstrate that FSDR outperforms three commonly used feature selection algorithms, taking into account a balance among execution time, $R^2$, and $RMSE$.  ( 3 min )
    Improving Implicit Regularization of SGD with Preconditioning for Least Square Problems
    arXiv:2403.08585v1 Announce Type: new Abstract: Stochastic gradient descent (SGD) exhibits strong algorithmic regularization effects in practice and plays an important role in the generalization of modern machine learning. However, prior research has revealed instances where the generalization performance of SGD is worse than ridge regression due to uneven optimization along different dimensions. Preconditioning offers a natural solution to this issue by rebalancing optimization across different directions. Yet, the extent to which preconditioning can enhance the generalization performance of SGD and whether it can bridge the existing gap with ridge regression remains uncertain. In this paper, we study the generalization performance of SGD with preconditioning for the least squared problem. We make a comprehensive comparison between preconditioned SGD and (standard \& preconditioned) ridge regression. Our study makes several key contributions toward understanding and improving SGD with preconditioning. First, we establish excess risk bounds (generalization performance) for preconditioned SGD and ridge regression under an arbitrary preconditions matrix. Second, leveraging the excessive risk characterization of preconditioned SGD and ridge regression, we show that (through construction) there exists a simple preconditioned matrix that can outperform (standard \& preconditioned) ridge regression. Finally, we show that our proposed preconditioning matrix is straightforward enough to allow robust estimation from finite samples while maintaining a theoretical advantage over ridge regression. Our empirical results align with our theoretical findings, collectively showcasing the enhanced regularization effect of preconditioned SGD.  ( 3 min )
    Domain constraints improve risk prediction when outcome data is missing
    arXiv:2312.03878v2 Announce Type: replace Abstract: Machine learning models are often trained to predict the outcome resulting from a human decision. For example, if a doctor decides to test a patient for disease, will the patient test positive? A challenge is that historical decision-making determines whether the outcome is observed: we only observe test outcomes for patients doctors historically tested. Untested patients, for whom outcomes are unobserved, may differ from tested patients along observed and unobserved dimensions. We propose a Bayesian model class which captures this setting. The purpose of the model is to accurately estimate risk for both tested and untested patients. Estimating this model is challenging due to the wide range of possibilities for untested patients. To address this, we propose two domain constraints which are plausible in health settings: a prevalence constraint, where the overall disease prevalence is known, and an expertise constraint, where the human decision-maker deviates from purely risk-based decision-making only along a constrained feature set. We show theoretically and on synthetic data that domain constraints improve parameter inference. We apply our model to a case study of cancer risk prediction, showing that the model's inferred risk predicts cancer diagnoses, its inferred testing policy captures known public health policies, and it can identify suboptimalities in test allocation. Though our case study is in healthcare, our analysis reveals a general class of domain constraints which can improve model estimation in many settings.  ( 3 min )
    Can physical information aid the generalization ability of Neural Networks for hydraulic modeling?
    arXiv:2403.08589v1 Announce Type: new Abstract: Application of Neural Networks to river hydraulics is fledgling, despite the field suffering from data scarcity, a challenge for machine learning techniques. Consequently, many purely data-driven Neural Networks proved to lack predictive capabilities. In this work, we propose to mitigate such problem by introducing physical information into the training phase. The idea is borrowed from Physics-Informed Neural Networks which have been recently proposed in other contexts. Physics-Informed Neural Networks embed physical information in the form of the residual of the Partial Differential Equations (PDEs) governing the phenomenon and, as such, are conceived as neural solvers, i.e. an alternative to traditional numerical solvers. Such approach is seldom suitable for environmental hydraulics, where epistemic uncertainties are large, and computing residuals of PDEs exhibits difficulties similar to those faced by classical numerical methods. Instead, we envisaged the employment of Neural Networks as neural operators, featuring physical constraints formulated without resorting to PDEs. The proposed novel methodology shares similarities with data augmentation and regularization. We show that incorporating such soft physical information can improve predictive capabilities.  ( 3 min )
    Self-Supervised Learning for Covariance Estimation
    arXiv:2403.08662v1 Announce Type: cross Abstract: We consider the use of deep learning for covariance estimation. We propose to globally learn a neural network that will then be applied locally at inference time. Leveraging recent advancements in self-supervised foundational models, we train the network without any labeling by simply masking different samples and learning to predict their covariance given their surrounding neighbors. The architecture is based on the popular attention mechanism. Its main advantage over classical methods is the automatic exploitation of global characteristics without any distributional assumptions or regularization. It can be pre-trained as a foundation model and then be repurposed for various downstream tasks, e.g., adaptive target detection in radar or hyperspectral imagery.  ( 2 min )
    Temporal Decisions: Leveraging Temporal Correlation for Efficient Decisions in Early Exit Neural Networks
    arXiv:2403.07958v1 Announce Type: new Abstract: Deep Learning is becoming increasingly relevant in Embedded and Internet-of-things applications. However, deploying models on embedded devices poses a challenge due to their resource limitations. This can impact the model's inference accuracy and latency. One potential solution are Early Exit Neural Networks, which adjust model depth dynamically through additional classifiers attached between their hidden layers. However, the real-time termination decision mechanism is critical for the system's efficiency, latency, and sustained accuracy. This paper introduces Difference Detection and Temporal Patience as decision mechanisms for Early Exit Neural Networks. They leverage the temporal correlation present in sensor data streams to efficiently terminate the inference. We evaluate their effectiveness in health monitoring, image classification, and wake-word detection tasks. Our novel contributions were able to reduce the computational footprint compared to established decision mechanisms significantly while maintaining higher accuracy scores. We achieved a reduction of mean operations per inference by up to 80% while maintaining accuracy levels within 5% of the original model. These findings highlight the importance of considering temporal correlation in sensor data to improve the termination decision.  ( 3 min )
    Fast Dual-Regularized Autoencoder for Sparse Biological Data
    arXiv:2401.16664v2 Announce Type: replace Abstract: Relationship inference from sparse data is an important task with applications ranging from product recommendation to drug discovery. A recently proposed linear model for sparse matrix completion has demonstrated surprising advantage in speed and accuracy over more sophisticated recommender systems algorithms. Here we extend the linear model to develop a shallow autoencoder for the dual neighborhood-regularized matrix completion problem. We demonstrate the speed and accuracy advantage of our approach over the existing state-of-the-art in predicting drug-target interactions and drug-disease associations.  ( 2 min )
    STMPL: Human Soft-Tissue Simulation
    arXiv:2403.08344v1 Announce Type: cross Abstract: In various applications, such as virtual reality and gaming, simulating the deformation of soft tissues in the human body during interactions with external objects is essential. Traditionally, Finite Element Methods (FEM) have been employed for this purpose, but they tend to be slow and resource-intensive. In this paper, we propose a unified representation of human body shape and soft tissue with a data-driven simulator of non-rigid deformations. This approach enables rapid simulation of realistic interactions. Our method builds upon the SMPL model, which generates human body shapes considering rigid transformations. We extend SMPL by incorporating a soft tissue layer and an intuitive representation of external forces applied to the body during object interactions. Specifically, we mapped the 3D body shape and soft tissue and applied external forces to 2D UV maps. Leveraging a UNET architecture designed for 2D data, our approach achieves high-accuracy inference in real time. Our experiment shows that our method achieves plausible deformation of the soft tissue layer, even for unseen scenarios.  ( 2 min )
    DiPrompT: Disentangled Prompt Tuning for Multiple Latent Domain Generalization in Federated Learning
    arXiv:2403.08506v1 Announce Type: new Abstract: Federated learning (FL) has emerged as a powerful paradigm for learning from decentralized data, and federated domain generalization further considers the test dataset (target domain) is absent from the decentralized training data (source domains). However, most existing FL methods assume that domain labels are provided during training, and their evaluation imposes explicit constraints on the number of domains, which must strictly match the number of clients. Because of the underutilization of numerous edge devices and additional cross-client domain annotations in the real world, such restrictions may be impractical and involve potential privacy leaks. In this paper, we propose an efficient and novel approach, called Disentangled Prompt Tuning (DiPrompT), a method that tackles the above restrictions by learning adaptive prompts for domain generalization in a distributed manner. Specifically, we first design two types of prompts, i.e., global prompt to capture general knowledge across all clients and domain prompts to capture domain-specific knowledge. They eliminate the restriction on the one-to-one mapping between source domains and local clients. Furthermore, a dynamic query metric is introduced to automatically search the suitable domain label for each sample, which includes two-substep text-image alignments based on prompt tuning without labor-intensive annotation. Extensive experiments on multiple datasets demonstrate that our DiPrompT achieves superior domain generalization performance over state-of-the-art FL methods when domain labels are not provided, and even outperforms many centralized learning methods using domain labels.  ( 3 min )
    MicroT: Low-Energy and Adaptive Models for MCUs
    arXiv:2403.08040v1 Announce Type: new Abstract: We propose MicroT, a low-energy, multi-task adaptive model framework for resource-constrained MCUs. We divide the original model into a feature extractor and a classifier. The feature extractor is obtained through self-supervised knowledge distillation and further optimized into part and full models through model splitting and joint training. These models are then deployed on MCUs, with classifiers added and trained on local tasks, ultimately performing stage-decision for joint inference. In this process, the part model initially processes the sample, and if the confidence score falls below the set threshold, the full model will resume and continue the inference. We evaluate MicroT on two models, three datasets, and two MCU boards. Our experimental evaluation shows that MicroT effectively improves model performance and reduces energy consumption when dealing with multiple local tasks. Compared to the unoptimized feature extractor, MicroT can improve accuracy by up to 9.87%. On MCUs, compared to the standard full model inference, MicroT can save up to about 29.13% in energy consumption. MicroT also allows users to adaptively adjust the stage-decision ratio as needed, better balancing model performance and energy consumption. Under the standard stage-decision ratio configuration, MicroT can increase accuracy by 5.91% and save about 14.47% of energy consumption.  ( 3 min )
    Feasibility of machine learning-based rice yield prediction in India at the district level using climate reanalysis data
    arXiv:2403.07967v1 Announce Type: new Abstract: Yield forecasting, the science of predicting agricultural productivity before the crop harvest occurs, helps a wide range of stakeholders make better decisions around agricultural planning. This study aims to investigate whether machine learning-based yield prediction models can capably predict Kharif season rice yields at the district level in India several months before the rice harvest takes place. The methodology involved training 19 machine learning models such as CatBoost, LightGBM, Orthogonal Matching Pursuit, and Extremely Randomized Trees on 20 years of climate, satellite, and rice yield data across 247 of Indian rice-producing districts. In addition to model-building, a dynamic dashboard was built understand how the reliability of rice yield predictions varies across districts. The results of the proof-of-concept machine learning pipeline demonstrated that rice yields can be predicted with a reasonable degree of accuracy, with out-of-sample R2, MAE, and MAPE performance of up to 0.82, 0.29, and 0.16 respectively. These results outperformed test set performance reported in related literature on rice yield modeling in other contexts and countries. In addition, SHAP value analysis was conducted to infer both the importance and directional impact of the climate and remote sensing variables included in the model. Important features driving rice yields included temperature, soil water volume, and leaf area index. In particular, higher temperatures in August correlate with increased rice yields, particularly when the leaf area index in August is also high. Building on the results, a proof-of-concept dashboard was developed to allow users to easily explore which districts may experience a rise or fall in yield relative to the previous year.  ( 3 min )
    Conditional computation in neural networks: principles and research trends
    arXiv:2403.07965v1 Announce Type: new Abstract: This article summarizes principles and ideas from the emerging area of applying \textit{conditional computation} methods to the design of neural networks. In particular, we focus on neural networks that can dynamically activate or de-activate parts of their computational graph conditionally on their input. Examples include the dynamic selection of, e.g., input tokens, layers (or sets of layers), and sub-modules inside each layer (e.g., channels in a convolutional filter). We first provide a general formalism to describe these techniques in an uniform way. Then, we introduce three notable implementations of these principles: mixture-of-experts (MoEs) networks, token selection mechanisms, and early-exit neural networks. The paper aims to provide a tutorial-like introduction to this growing field. To this end, we analyze the benefits of these modular designs in terms of efficiency, explainability, and transfer learning, with a focus on emerging applicative areas ranging from automated scientific discovery to semantic communication.  ( 2 min )
    Applying ranking techniques for estimating influence of Earth variables on temperature forecast error
    arXiv:2403.07966v1 Announce Type: new Abstract: This paper describes how to analyze the influence of Earth system variables on the errors when providing temperature forecasts. The initial framework to get the data has been based on previous research work, which resulted in a very interesting discovery. However, the aforementioned study only worked on individual correlations of the variables with respect to the error. This research work is going to re-use the main ideas but introduce three main novelties: (1) applying a data science approach by a few representative locations; (2) taking advantage of the rankings created by Spearman correlation but enriching them with other metrics looking for a more robust ranking of the variables; (3) evaluation of the methodology by learning random forest models for regression with the distinct experimental variations. The main contribution is the framework that shows how to convert correlations into rankings and combine them into an aggregate ranking. We have carried out experiments on five chosen locations to analyze the behavior of this ranking-based methodology. The results show that the specific performance is dependent on the location and season, which is expected, and that this selection technique works properly with Random Forest models but can also improve simpler regression models such as Bayesian Ridge. This work also contributes with an extensive analysis of the results. We can conclude that this selection based on the top-k ranked variables seems promising for this real problem, and it could also be applied in other domains.  ( 3 min )
    An Analysis of Human Alignment of Latent Diffusion Models
    arXiv:2403.08469v1 Announce Type: new Abstract: Diffusion models, trained on large amounts of data, showed remarkable performance for image synthesis. They have high error consistency with humans and low texture bias when used for classification. Furthermore, prior work demonstrated the decomposability of their bottleneck layer representations into semantic directions. In this work, we analyze how well such representations are aligned to human responses on a triplet odd-one-out task. We find that despite the aforementioned observations: I) The representational alignment with humans is comparable to that of models trained only on ImageNet-1k. II) The most aligned layers of the denoiser U-Net are intermediate layers and not the bottleneck. III) Text conditioning greatly improves alignment at high noise levels, hinting at the importance of abstract textual information, especially in the early stage of generation.  ( 2 min )
    Deep Submodular Peripteral Network
    arXiv:2403.08199v1 Announce Type: new Abstract: Submodular functions, crucial for various applications, often lack practical learning methods for their acquisition. Seemingly unrelated, learning a scaling from oracles offering graded pairwise preferences (GPC) is underexplored, despite a rich history in psychometrics. In this paper, we introduce deep submodular peripteral networks (DSPNs), a novel parametric family of submodular functions, and methods for their training using a contrastive-learning inspired GPC-ready strategy to connect and then tackle both of the above challenges. We introduce newly devised GPC-style "peripteral" loss which leverages numerically graded relationships between pairs of objects (sets in our case). Unlike traditional contrastive learning, our method utilizes graded comparisons, extracting more nuanced information than just binary-outcome comparisons, and contrasts sets of any size (not just two). We also define a novel suite of automatic sampling strategies for training, including active-learning inspired submodular feedback. We demonstrate DSPNs' efficacy in learning submodularity from a costly target submodular function showing superiority in downstream tasks such as experimental design and streaming applications.  ( 2 min )
    VertiBench: Advancing Feature Distribution Diversity in Vertical Federated Learning Benchmarks
    arXiv:2307.02040v3 Announce Type: replace Abstract: Vertical Federated Learning (VFL) is a crucial paradigm for training machine learning models on feature-partitioned, distributed data. However, due to privacy restrictions, few public real-world VFL datasets exist for algorithm evaluation, and these represent a limited array of feature distributions. Existing benchmarks often resort to synthetic datasets, derived from arbitrary feature splits from a global set, which only capture a subset of feature distributions, leading to inadequate algorithm performance assessment. This paper addresses these shortcomings by introducing two key factors affecting VFL performance - feature importance and feature correlation - and proposing associated evaluation metrics and dataset splitting methods. Additionally, we introduce a real VFL dataset to address the deficit in image-image VFL scenarios. Our comprehensive evaluation of cutting-edge VFL algorithms provides valuable insights for future research in the field.  ( 3 min )
    BG-HGNN: Toward Scalable and Efficient Heterogeneous Graph Neural Network
    arXiv:2403.08207v1 Announce Type: new Abstract: Many computer vision and machine learning problems are modelled as learning tasks on heterogeneous graphs, featuring a wide array of relations from diverse types of nodes and edges. Heterogeneous graph neural networks (HGNNs) stand out as a promising neural model class designed for heterogeneous graphs. Built on traditional GNNs, existing HGNNs employ different parameter spaces to model the varied relationships. However, the practical effectiveness of existing HGNNs is often limited to simple heterogeneous graphs with few relation types. This paper first highlights and demonstrates that the standard approach employed by existing HGNNs inevitably leads to parameter explosion and relation collapse, making HGNNs less effective or impractical for complex heterogeneous graphs with numerous relation types. To overcome this issue, we introduce a novel framework, Blend&Grind-HGNN (BG-HGNN), which effectively tackles the challenges by carefully integrating different relations into a unified feature space manageable by a single set of parameters. This results in a refined HGNN method that is more efficient and effective in learning from heterogeneous graphs, especially when the number of relations grows. Our empirical studies illustrate that BG-HGNN significantly surpasses existing HGNNs in terms of parameter efficiency (up to 28.96 $\times$), training throughput (up to 8.12 $\times$), and accuracy (up to 1.07 $\times$).  ( 3 min )
    Probabilistic Constraint for Safety-Critical Reinforcement Learning
    arXiv:2306.17279v2 Announce Type: replace Abstract: In this paper, we consider the problem of learning safe policies for probabilistic-constrained reinforcement learning (RL). Specifically, a safe policy or controller is one that, with high probability, maintains the trajectory of the agent in a given safe set. We establish a connection between this probabilistic-constrained setting and the cumulative-constrained formulation that is frequently explored in the existing literature. We provide theoretical bounds elucidating that the probabilistic-constrained setting offers a better trade-off in terms of optimality and safety (constraint satisfaction). The challenge encountered when dealing with the probabilistic constraints, as explored in this work, arises from the absence of explicit expressions for their gradients. Our prior work provides such an explicit gradient expression for probabilistic constraints which we term Safe Policy Gradient-REINFORCE (SPG-REINFORCE). In this work, we provide an improved gradient SPG-Actor-Critic that leads to a lower variance than SPG-REINFORCE, which is substantiated by our theoretical results. A noteworthy aspect of both SPGs is their inherent algorithm independence, rendering them versatile for application across a range of policy-based algorithms. Furthermore, we propose a Safe Primal-Dual algorithm that can leverage both SPGs to learn safe policies. It is subsequently followed by theoretical analyses that encompass the convergence of the algorithm, as well as the near-optimality and feasibility on average. In addition, we test the proposed approaches by a series of empirical experiments. These experiments aim to examine and analyze the inherent trade-offs between the optimality and safety, and serve to substantiate the efficacy of two SPGs, as well as our theoretical contributions.  ( 3 min )
    Tight Group-Level DP Guarantees for DP-SGD with Sampling via Mixture of Gaussians Mechanisms
    arXiv:2401.10294v2 Announce Type: replace-cross Abstract: We give a procedure for computing group-level $(\epsilon, \delta)$-DP guarantees for DP-SGD, when using Poisson sampling or fixed batch size sampling. Up to discretization errors in the implementation, the DP guarantees computed by this procedure are tight (assuming we release every intermediate iterate).  ( 2 min )
    COSTREAM: Learned Cost Models for Operator Placement in Edge-Cloud Environments
    arXiv:2403.08444v1 Announce Type: cross Abstract: In this work, we present COSTREAM, a novel learned cost model for Distributed Stream Processing Systems that provides accurate predictions of the execution costs of a streaming query in an edge-cloud environment. The cost model can be used to find an initial placement of operators across heterogeneous hardware, which is particularly important in these environments. In our evaluation, we demonstrate that COSTREAM can produce highly accurate cost estimates for the initial operator placement and even generalize to unseen placements, queries, and hardware. When using COSTREAM to optimize the placements of streaming operators, a median speed-up of around 21x can be achieved compared to baselines.  ( 2 min )
    Unsupervised self-organising map of prostate cell Raman spectra shows disease-state subclustering
    arXiv:2403.07960v1 Announce Type: cross Abstract: Prostate cancer is a disease which poses an interesting clinical question: should it be treated? A small subset of prostate cancers are aggressive and require removal and treatment to prevent metastatic spread. However, conventional diagnostics remain challenged to risk-stratify such patients, hence, new methods of approach to biomolecularly subclassify the disease are needed. Here we use an unsupervised, self-organising map approach to analyse live-cell Raman spectroscopy data obtained from prostate cell-lines; our aim is to test the feasibility of this method to differentiate, at the single-cell-level, cancer from normal using high-dimensional datasets with minimal preprocessing. The results demonstrate not only successful separation of normal prostate and cancer cells, but also a new subclustering of the prostate cancer cell-line into two groups. Initial analysis of the spectra from each of the cancer subclusters demonstrates a differential expression of lipids, which, against the normal control, may be linked to disease-related changes in cellular signalling.  ( 2 min )
    Reduced Jeffries-Matusita distance: A Novel Loss Function to Improve Generalization Performance of Deep Classification Models
    arXiv:2403.08408v1 Announce Type: new Abstract: The generalization performance of deep neural networks in classification tasks is a major concern in machine learning research. Despite widespread techniques used to diminish the over-fitting issue such as data augmentation, pseudo-labeling, regularization, and ensemble learning, this performance still needs to be enhanced with other approaches. In recent years, it has been theoretically demonstrated that the loss function characteristics i.e. its Lipschitzness and maximum value affect the generalization performance of deep neural networks which can be utilized as a guidance to propose novel distance measures. In this paper, by analyzing the aforementioned characteristics, we introduce a distance called Reduced Jeffries-Matusita as a loss function for training deep classification models to reduce the over-fitting issue. In our experiments, we evaluate the new loss function in two different problems: image classification in computer vision and node classification in the context of graph learning. The results show that the new distance measure stabilizes the training process significantly, enhances the generalization ability, and improves the performance of the models in the Accuracy and F1-score metrics, even if the training set size is small.  ( 3 min )
    A Neural-Evolutionary Algorithm for Autonomous Transit Network Design
    arXiv:2403.07917v1 Announce Type: cross Abstract: Planning a public transit network is a challenging optimization problem, but essential in order to realize the benefits of autonomous buses. We propose a novel algorithm for planning networks of routes for autonomous buses. We first train a graph neural net model as a policy for constructing route networks, and then use the policy as one of several mutation operators in a evolutionary algorithm. We evaluate this algorithm on a standard set of benchmarks for transit network design, and find that it outperforms the learned policy alone by up to 20\% and a plain evolutionary algorithm approach by up to 53\% on realistic benchmark instances.  ( 2 min )
    PAGE: Domain-Incremental Adaptation with Past-Agnostic Generative Replay for Smart Healthcare
    arXiv:2403.08197v1 Announce Type: new Abstract: We propose PAGE, a domain-incremental adaptation strategy with past-agnostic generative replay for smart healthcare. PAGE enables generative replay without the aid of any preserved data or information from prior domains. When adapting to a new domain, it exploits real data from the new distribution and the current model to generate synthetic data that retain the learned knowledge of previous domains. By replaying the synthetic data with the new real data during training, PAGE achieves a good balance between domain adaptation and knowledge retention. In addition, we incorporate an extended inductive conformal prediction (EICP) method into PAGE to produce a confidence score and a credibility value for each detection result. This makes the predictions interpretable and provides statistical guarantees for disease detection in smart healthcare applications. We demonstrate PAGE's effectiveness in domain-incremental disease detection with three distinct disease datasets collected from commercially available WMSs. PAGE achieves highly competitive performance against state-of-the-art with superior scalability, data privacy, and feasibility. Furthermore, PAGE can enable up to 75% reduction in clinical workload with the help of EICP.  ( 3 min )
    Curriculum Graph Machine Learning: A Survey
    arXiv:2302.02926v2 Announce Type: replace Abstract: Graph machine learning has been extensively studied in both academia and industry. However, in the literature, most existing graph machine learning models are designed to conduct training with data samples in a random order, which may suffer from suboptimal performance due to ignoring the importance of different graph data samples and their training orders for the model optimization status. To tackle this critical problem, curriculum graph machine learning (Graph CL), which integrates the strength of graph machine learning and curriculum learning, arises and attracts an increasing amount of attention from the research community. Therefore, in this paper, we comprehensively overview approaches on Graph CL and present a detailed survey of recent advances in this direction. Specifically, we first discuss the key challenges of Graph CL and provide its formal problem definition. Then, we categorize and summarize existing methods into three classes based on three kinds of graph machine learning tasks, i.e., node-level, link-level, and graph-level tasks. Finally, we share our thoughts on future research directions. To the best of our knowledge, this paper is the first survey for curriculum graph machine learning.  ( 3 min )
    TSFool: Crafting Highly-Imperceptible Adversarial Time Series through Multi-Objective Attack
    arXiv:2209.06388v3 Announce Type: replace Abstract: Recent years have witnessed the success of recurrent neural network (RNN) models in time series classification (TSC). However, neural networks (NNs) are vulnerable to adversarial samples, which cause real-life adversarial attacks that undermine the robustness of AI models. To date, most existing attacks target at feed-forward NNs and image recognition tasks, but they cannot perform well on RNN-based TSC. This is due to the cyclical computation of RNN, which prevents direct model differentiation. In addition, the high visual sensitivity of time series to perturbations also poses challenges to local objective optimization of adversarial samples. In this paper, we propose an efficient method called TSFool to craft highly-imperceptible adversarial time series for RNN-based TSC. The core idea is a new global optimization objective known as "Camouflage Coefficient" that captures the imperceptibility of adversarial samples from the class distribution. Based on this, we reduce the adversarial attack problem to a multi-objective optimization problem that enhances the perturbation quality. Furthermore, to speed up the optimization process, we propose to use a representation model for RNN to capture deeply embedded vulnerable samples whose features deviate from the latent manifold. Experiments on 11 UCR and UEA datasets showcase that TSFool significantly outperforms six white-box and three black-box benchmark attacks in terms of effectiveness, efficiency and imperceptibility from various perspectives including standard measure, human study and real-world defense.  ( 3 min )
    Dual Branch Deep Learning Network for Detection and Stage Grading of Diabetic Retinopathy
    arXiv:2308.09945v2 Announce Type: replace-cross Abstract: Diabetic retinopathy is a severe complication of diabetes that can lead to permanent blindness if not treated promptly. Early and accurate diagnosis of the disease is essential for successful treatment. This paper introduces a deep learning method for the detection and stage grading of diabetic retinopathy, using a single fundus retinal image. Our model utilizes transfer learning, employing two state-of-the-art pre-trained models as feature extractors and fine-tuning them on a new dataset. The proposed model is trained on a large multi-center dataset, including the APTOS 2019 dataset, obtained from publicly available sources. It achieves remarkable performance in diabetic retinopathy detection and stage classification on the APTOS 2019, outperforming the established literature. For binary classification, the proposed approach achieves an accuracy of 98.50, a sensitivity of 99.46, and a specificity of 97.51. In stage grading, it achieves a quadratic weighted kappa of 93.00, an accuracy of 89.60, a sensitivity of 89.60, and a specificity of 97.72. The proposed approach serves as a reliable screening and stage grading tool for diabetic retinopathy, offering significant potential to enhance clinical decision-making and patient care.  ( 3 min )
    Data-Efficient Contrastive Self-supervised Learning: Most Beneficial Examples for Supervised Learning Contribute the Least
    arXiv:2302.09195v5 Announce Type: replace Abstract: Self-supervised learning (SSL) learns high-quality representations from large pools of unlabeled training data. As datasets grow larger, it becomes crucial to identify the examples that contribute the most to learning such representations. This enables efficient SSL by reducing the volume of data required. Nevertheless, quantifying the value of examples for SSL has remained an open question. In this work, we address this problem for the first time, by proving that examples that contribute the most to contrastive SSL are those that have the most similar augmentations to other examples, in expectation. We provide rigorous guarantees for the generalization performance of contrastive learning on such subsets. Through extensive experiments, we show that we can safely exclude 20% of examples from CIFAR100 and 40% from STL10 and TinyImageNet, without affecting downstream task performance. In general, subsets selected by our method outperform random subsets by over 3% across these datasets. Interestingly, we also discover the subsets that contribute the most to contrastive learning are those that contribute the least to supervised learning. Code available at https://github.com/bigml-cs-ucla/sas-data-efficient-contrastive-learning.  ( 3 min )
    Better Uncertainty Calibration via Proper Scores for Classification and Beyond
    arXiv:2203.07835v4 Announce Type: replace Abstract: With model trustworthiness being crucial for sensitive real-world applications, practitioners are putting more and more focus on improving the uncertainty calibration of deep neural networks. Calibration errors are designed to quantify the reliability of probabilistic predictions but their estimators are usually biased and inconsistent. In this work, we introduce the framework of proper calibration errors, which relates every calibration error to a proper score and provides a respective upper bound with optimal estimation properties. This relationship can be used to reliably quantify the model calibration improvement. We theoretically and empirically demonstrate the shortcomings of commonly used estimators compared to our approach. Due to the wide applicability of proper scores, this gives a natural extension of recalibration beyond classification.  ( 3 min )
    Random Search as a Baseline for Sparse Neural Network Architecture Search
    arXiv:2403.08265v1 Announce Type: new Abstract: Sparse neural networks have shown similar or better generalization performance than their dense counterparts while having higher parameter efficiency. This has motivated a number of works to learn, induce, or search for high performing sparse networks. While reports of quality or efficiency gains are impressive, standard baselines are lacking, therefore hindering having reliable comparability and reproducibility across methods. In this work, we provide an evaluation approach and a naive Random Search baseline method for finding good sparse configurations. We apply Random Search on the node space of an overparameterized network with the goal of finding better initialized sparse sub-networks that are positioned more advantageously in the loss landscape. We record sparse network post-training performances at various levels of sparsity and compare against both their fully connected parent networks and random sparse configurations at the same sparsity levels. We observe that for this architecture search task, initialized sparse networks found by Random Search neither perform better nor converge more efficiently than their random counterparts. Thus we conclude that Random Search may be viewed as a suitable neutral baseline for sparsity search methods.  ( 3 min )
    Regret Analysis of Policy Optimization over Submanifolds for Linearly Constrained Online LQG
    arXiv:2403.08553v1 Announce Type: cross Abstract: Recent advancement in online optimization and control has provided novel tools to study online linear quadratic regulator (LQR) problems, where cost matrices are varying adversarially over time. However, the controller parameterization of existing works may not satisfy practical conditions like sparsity due to physical connections. In this work, we study online linear quadratic Gaussian problems with a given linear constraint imposed on the controller. Inspired by the recent work of [1] which proposed, for a linearly constrained policy optimization of an offline LQR, a second order method equipped with a Riemannian metric that emerges naturally in the context of optimal control problems, we propose online optimistic Newton on manifold (OONM) which provides an online controller based on the prediction on the first and second order information of the function sequence. To quantify the proposed algorithm, we leverage the notion of regret defined as the sub-optimality of its cumulative cost to that of a (locally) minimizing controller sequence and provide the regret bound in terms of the path-length of the minimizer sequence. Simulation results are also provided to verify the property of OONM.  ( 3 min )
    The Effect of Different Optimization Strategies to Physics-Constrained Deep Learning for Soil Moisture Estimation
    arXiv:2403.08154v1 Announce Type: new Abstract: Soil moisture is a key hydrological parameter that has significant importance to human society and the environment. Accurate modeling and monitoring of soil moisture in crop fields, especially in the root zone (top 100 cm of soil), is essential for improving agricultural production and crop yield with the help of precision irrigation and farming tools. Realizing the full sensor data potential depends greatly on advanced analytical and predictive domain-aware models. In this work, we propose a physics-constrained deep learning (P-DL) framework to integrate physics-based principles on water transport and water sensing signals for effective reconstruction of the soil moisture dynamics. We adopt three different optimizers, namely Adam, RMSprop, and GD, to minimize the loss function of P-DL during the training process. In the illustrative case study, we demonstrate the empirical convergence of Adam optimizers outperforms the other optimization methods in both mini-batch and full-batch training.  ( 2 min )
    PTransIPs: Identification of phosphorylation sites enhanced by protein PLM embeddings
    arXiv:2308.05115v3 Announce Type: replace-cross Abstract: Phosphorylation is pivotal in numerous fundamental cellular processes and plays a significant role in the onset and progression of various diseases. The accurate identification of these phosphorylation sites is crucial for unraveling the molecular mechanisms within cells and during viral infections, potentially leading to the discovery of novel therapeutic targets. In this study, we develop PTransIPs, a new deep learning framework for the identification of phosphorylation sites. Independent testing results demonstrate that PTransIPs outperforms existing state-of-the-art (SOTA) methods, achieving AUCs of 0.9232 and 0.9660 for the identification of phosphorylated S/T and Y sites, respectively. PTransIPs contributes from three aspects. 1) PTransIPs is the first to apply protein pre-trained language model (PLM) embeddings to this task. It utilizes ProtTrans and EMBER2 to extract sequence and structure embeddings, respectively, as additional inputs into the model, effectively addressing issues of dataset size and overfitting, thus enhancing model performance; 2) PTransIPs is based on Transformer architecture, optimized through the integration of convolutional neural networks and TIM loss function, providing practical insights for model design and training; 3) The encoding of amino acids in PTransIPs enables it to serve as a universal framework for other peptide bioactivity tasks, with its excellent performance shown in extended experiments of this paper. Our code, data and models are publicly available at https://github.com/StatXzy7/PTransIPs.  ( 3 min )
    Learning How to Strategically Disclose Information
    arXiv:2403.08741v1 Announce Type: cross Abstract: Strategic information disclosure, in its simplest form, considers a game between an information provider (sender) who has access to some private information that an information receiver is interested in. While the receiver takes an action that affects the utilities of both players, the sender can design information (or modify beliefs) of the receiver through signal commitment, hence posing a Stackelberg game. However, obtaining a Stackelberg equilibrium for this game traditionally requires the sender to have access to the receiver's objective. In this work, we consider an online version of information design where a sender interacts with a receiver of an unknown type who is adversarially chosen at each round. Restricting attention to Gaussian prior and quadratic costs for the sender and the receiver, we show that $\mathcal{O}(\sqrt{T})$ regret is achievable with full information feedback, where $T$ is the total number of interactions between the sender and the receiver. Further, we propose a novel parametrization that allows the sender to achieve $\mathcal{O}(\sqrt{T})$ regret for a general convex utility function. We then consider the Bayesian Persuasion problem with an additional cost term in the objective function, which penalizes signaling policies that are more informative and obtain $\mathcal{O}(\log(T))$ regret. Finally, we establish a sublinear regret bound for the partial information feedback setting and provide simulations to support our theoretical results.  ( 3 min )
    MolBind: Multimodal Alignment of Language, Molecules, and Proteins
    arXiv:2403.08167v1 Announce Type: new Abstract: Recent advancements in biology and chemistry have leveraged multi-modal learning, integrating molecules and their natural language descriptions to enhance drug discovery. However, current pre-training frameworks are limited to two modalities, and designing a unified network to process different modalities (e.g., natural language, 2D molecular graphs, 3D molecular conformations, and 3D proteins) remains challenging due to inherent gaps among them. In this work, we propose MolBind, a framework that trains encoders for multiple modalities through contrastive learning, mapping all modalities to a shared feature space for multi-modal semantic alignment. To facilitate effective pre-training of MolBind on multiple modalities, we also build and collect a high-quality dataset with four modalities, MolBind-M4, including graph-language, conformation-language, graph-conformation, and conformation-protein paired data. MolBind shows superior zero-shot learning performance across a wide range of tasks, demonstrating its strong capability of capturing the underlying semantics of multiple modalities.  ( 2 min )
    FluoroSAM: A Language-aligned Foundation Model for X-ray Image Segmentation
    arXiv:2403.08059v1 Announce Type: cross Abstract: Automated X-ray image segmentation would accelerate research and development in diagnostic and interventional precision medicine. Prior efforts have contributed task-specific models capable of solving specific image analysis problems, but the utility of these models is restricted to their particular task domain, and expanding to broader use requires additional data, labels, and retraining efforts. Recently, foundation models (FMs) -- machine learning models trained on large amounts of highly variable data thus enabling broad applicability -- have emerged as promising tools for automated image analysis. Existing FMs for medical image analysis focus on scenarios and modalities where objects are clearly defined by visually apparent boundaries, such as surgical tool segmentation in endoscopy. X-ray imaging, by contrast, does not generally offer such clearly delineated boundaries or structure priors. During X-ray image formation, complex 3D structures are projected in transmission onto the imaging plane, resulting in overlapping features of varying opacity and shape. To pave the way toward an FM for comprehensive and automated analysis of arbitrary medical X-ray images, we develop FluoroSAM, a language-aligned variant of the Segment-Anything Model, trained from scratch on 1.6M synthetic X-ray images. FluoroSAM is trained on data including masks for 128 organ types and 464 non-anatomical objects, such as tools and implants. In real X-ray images of cadaveric specimens, FluoroSAM is able to segment bony anatomical structures based on text-only prompting with 0.51 and 0.79 DICE with point-based refinement, outperforming competing SAM variants for all structures. FluoroSAM is also capable of zero-shot generalization to segmenting classes beyond the training set thanks to its language alignment, which we demonstrate for full lung segmentation on real chest X-rays.  ( 3 min )
    EM-TTS: Efficiently Trained Low-Resource Mongolian Lightweight Text-to-Speech
    arXiv:2403.08164v1 Announce Type: cross Abstract: Recently, deep learning-based Text-to-Speech (TTS) systems have achieved high-quality speech synthesis results. Recurrent neural networks have become a standard modeling technique for sequential data in TTS systems and are widely used. However, training a TTS model which includes RNN components requires powerful GPU performance and takes a long time. In contrast, CNN-based sequence synthesis techniques can significantly reduce the parameters and training time of a TTS model while guaranteeing a certain performance due to their high parallelism, which alleviate these economic costs of training. In this paper, we propose a lightweight TTS system based on deep convolutional neural networks, which is a two-stage training end-to-end TTS model and does not employ any recurrent units. Our model consists of two stages: Text2Spectrum and SSRN. The former is used to encode phonemes into a coarse mel spectrogram and the latter is used to synthesize the complete spectrum from the coarse mel spectrogram. Meanwhile, we improve the robustness of our model by a series of data augmentations, such as noise suppression, time warping, frequency masking and time masking, for solving the low resource mongolian problem. Experiments show that our model can reduce the training time and parameters while ensuring the quality and naturalness of the synthesized speech compared to using mainstream TTS models. Our method uses NCMMSC2022-MTTSC Challenge dataset for validation, which significantly reduces training time while maintaining a certain accuracy.  ( 3 min )
    Learning Closed-form Equations for Subgrid-scale Closures from High-fidelity Data: Promises and Challenges
    arXiv:2306.05014v2 Announce Type: replace-cross Abstract: There is growing interest in discovering interpretable, closed-form equations for subgrid-scale (SGS) closures/parameterizations of complex processes in Earth systems. Here, we apply a common equation-discovery technique with expansive libraries to learn closures from filtered direct numerical simulations of 2D turbulence and Rayleigh-B\'enard convection (RBC). Across common filters (e.g., Gaussian, box), we robustly discover closures of the same form for momentum and heat fluxes. These closures depend on nonlinear combinations of gradients of filtered variables, with constants that are independent of the fluid/flow properties and only depend on filter type/size. We show that these closures are the nonlinear gradient model (NGM), which is derivable analytically using Taylor-series. Indeed, we suggest that with common (physics-free) equation-discovery algorithms, for many common systems/physics, discovered closures are consistent with the leading term of the Taylor-series (except when cutoff filters are used). Like previous studies, we find that large-eddy simulations with NGM closures are unstable, despite significant similarities between the true and NGM-predicted fluxes (correlations $> 0.95$). We identify two shortcomings as reasons for these instabilities: in 2D, NGM produces zero kinetic energy transfer between resolved and subgrid scales, lacking both diffusion and backscattering. In RBC, potential energy backscattering is poorly predicted. Moreover, we show that SGS fluxes diagnosed from data, presumed the ''truth'' for discovery, depend on filtering procedures and are not unique. Accordingly, to learn accurate, stable closures in future work, we propose several ideas around using physics-informed libraries, loss functions, and metrics. These findings are relevant to closure modeling of any multi-scale system.  ( 3 min )
    Efficient Language Model Architectures for Differentially Private Federated Learning
    arXiv:2403.08100v1 Announce Type: new Abstract: Cross-device federated learning (FL) is a technique that trains a model on data distributed across typically millions of edge devices without data leaving the devices. SGD is the standard client optimizer for on device training in cross-device FL, favored for its memory and computational efficiency. However, in centralized training of neural language models, adaptive optimizers are preferred as they offer improved stability and performance. In light of this, we ask if language models can be modified such that they can be efficiently trained with SGD client optimizers and answer this affirmatively. We propose a scale-invariant Coupled Input Forget Gate (SI CIFG) recurrent network by modifying the sigmoid and tanh activations in the recurrent cell and show that this new model converges faster and achieves better utility than the standard CIFG recurrent model in cross-device FL in large scale experiments. We further show that the proposed scale invariant modification also helps in federated learning of larger transformer models. Finally, we demonstrate the scale invariant modification is also compatible with other non-adaptive algorithms. Particularly, our results suggest an improved privacy utility trade-off in federated learning with differential privacy.  ( 3 min )
    Dr. Jekyll and Mr. Hyde: Two Faces of LLMs
    arXiv:2312.03853v2 Announce Type: replace-cross Abstract: Only a year ago, we witnessed a rise in the use of Large Language Models (LLMs), especially when combined with applications like chatbot assistants. Safety mechanisms and specialized training procedures are implemented to prevent improper responses from these assistants. In this work, we bypass these measures for ChatGPT and Bard (and, to some extent, Bing chat) by making them impersonate complex personas with opposite characteristics as those of the truthful assistants they are supposed to be. We start by creating elaborate biographies of these personas, which we then use in a new session with the same chatbots. Our conversation followed a role-play style to get the response the assistant was not allowed to provide. By making use of personas, we show that the response that is prohibited is actually provided, making it possible to obtain unauthorized, illegal, or harmful information. This work shows that by using adversarial personas, one can overcome safety mechanisms set out by ChatGPT and Bard. We also introduce several ways of activating such adversarial personas, altogether showing that both chatbots are vulnerable to this kind of attack. With the same principle, we introduce two defenses that push the model to interpret trustworthy personalities and make it more robust against such attacks.  ( 3 min )
    Mechanics of Next Token Prediction with Self-Attention
    arXiv:2403.08081v1 Announce Type: new Abstract: Transformer-based language models are trained on large datasets to predict the next token given an input sequence. Despite this simple training objective, they have led to revolutionary advances in natural language processing. Underlying this success is the self-attention mechanism. In this work, we ask: $\textit{What}$ $\textit{does}$ $\textit{a}$ $\textit{single}$ $\textit{self-attention}$ $\textit{layer}$ $\textit{learn}$ $\textit{from}$ $\textit{next-token}$ $\textit{prediction?}$ We show that training self-attention with gradient descent learns an automaton which generates the next token in two distinct steps: $\textbf{(1)}$ $\textbf{Hard}$ $\textbf{retrieval:}$ Given input sequence, self-attention precisely selects the $\textit{high-priority}$ $\textit{input}$ $\textit{tokens}$ associated with the last input token. $\textbf{(2)}$ $\textbf{Soft}$ $\textbf{composition:}$ It then creates a convex combination of the high-priority tokens from which the next token can be sampled. Under suitable conditions, we rigorously characterize these mechanics through a directed graph over tokens extracted from the training data. We prove that gradient descent implicitly discovers the strongly-connected components (SCC) of this graph and self-attention learns to retrieve the tokens that belong to the highest-priority SCC available in the context window. Our theory relies on decomposing the model weights into a directional component and a finite component that correspond to hard retrieval and soft composition steps respectively. This also formalizes a related implicit bias formula conjectured in [Tarzanagh et al. 2023]. We hope that these findings shed light on how self-attention processes sequential data and pave the path toward demystifying more complex architectures.  ( 3 min )
    Robust Decision Aggregation with Adversarial Experts
    arXiv:2403.08222v1 Announce Type: new Abstract: We consider a binary decision aggregation problem in the presence of both truthful and adversarial experts. The truthful experts will report their private signals truthfully with proper incentive, while the adversarial experts can report arbitrarily. The decision maker needs to design a robust aggregator to forecast the true state of the world based on the reports of experts. The decision maker does not know the specific information structure, which is a joint distribution of signals, states, and strategies of adversarial experts. We want to find the optimal aggregator minimizing regret under the worst information structure. The regret is defined by the difference in expected loss between the aggregator and a benchmark who makes the optimal decision given the joint distribution and reports of truthful experts. We prove that when the truthful experts are symmetric and adversarial experts are not too numerous, the truncated mean is optimal, which means that we remove some lowest reports and highest reports and take averaging among the left reports. Moreover, for many settings, the optimal aggregators are in the family of piecewise linear functions. The regret is independent of the total number of experts but only depends on the ratio of adversaries. We evaluate our aggregators by numerical experiment in an ensemble learning task. We also obtain some negative results for the aggregation problem with adversarial experts under some more general information structures and experts' report space.  ( 3 min )
    A Decade's Battle on Dataset Bias: Are We There Yet?
    arXiv:2403.08632v1 Announce Type: cross Abstract: We revisit the "dataset classification" experiment suggested by Torralba and Efros a decade ago, in the new era with large-scale, diverse, and hopefully less biased datasets as well as more capable neural network architectures. Surprisingly, we observe that modern neural networks can achieve excellent accuracy in classifying which dataset an image is from: e.g., we report 84.7% accuracy on held-out validation data for the three-way classification problem consisting of the YFCC, CC, and DataComp datasets. Our further experiments show that such a dataset classifier could learn semantic features that are generalizable and transferable, which cannot be simply explained by memorization. We hope our discovery will inspire the community to rethink the issue involving dataset bias and model capabilities.  ( 2 min )
    Verifix: Post-Training Correction to Improve Label Noise Robustness with Verified Samples
    arXiv:2403.08618v1 Announce Type: new Abstract: Label corruption, where training samples have incorrect labels, can significantly degrade the performance of machine learning models. This corruption often arises from non-expert labeling or adversarial attacks. Acquiring large, perfectly labeled datasets is costly, and retraining large models from scratch when a clean dataset becomes available is computationally expensive. To address this challenge, we propose Post-Training Correction, a new paradigm that adjusts model parameters after initial training to mitigate label noise, eliminating the need for retraining. We introduce Verifix, a novel Singular Value Decomposition (SVD) based algorithm that leverages a small, verified dataset to correct the model weights using a single update. Verifix uses SVD to estimate a Clean Activation Space and then projects the model's weights onto this space to suppress activations corresponding to corrupted data. We demonstrate Verifix's effectiveness on both synthetic and real-world label noise. Experiments on the CIFAR dataset with 25% synthetic corruption show 7.36% generalization improvements on average. Additionally, we observe generalization improvements of up to 2.63% on naturally corrupted datasets like WebVision1.0 and Clothing1M.  ( 3 min )
    From Weak to Strong Sound Event Labels using Adaptive Change-Point Detection and Active Learning
    arXiv:2403.08525v1 Announce Type: cross Abstract: In this work we propose an audio recording segmentation method based on an adaptive change point detection (A-CPD) for machine guided weak label annotation of audio recording segments. The goal is to maximize the amount of information gained about the temporal activation's of the target sounds. For each unlabeled audio recording, we use a prediction model to derive a probability curve used to guide annotation. The prediction model is initially pre-trained on available annotated sound event data with classes that are disjoint from the classes in the unlabeled dataset. The prediction model then gradually adapts to the annotations provided by the annotator in an active learning loop. The queries used to guide the weak label annotator towards strong labels are derived using change point detection on these probabilities. We show that it is possible to derive strong labels of high quality even with a limited annotation budget, and show favorable results for A-CPD when compared to two baseline query strategies.  ( 3 min )
    Multiple Access in the Era of Distributed Computing and Edge Intelligence
    arXiv:2403.07903v1 Announce Type: cross Abstract: This paper focuses on the latest research and innovations in fundamental next-generation multiple access (NGMA) techniques and the coexistence with other key technologies for the sixth generation (6G) of wireless networks. In more detail, we first examine multi-access edge computing (MEC), which is critical to meeting the growing demand for data processing and computational capacity at the edge of the network, as well as network slicing. We then explore over-the-air (OTA) computing, which is considered to be an approach that provides fast and efficient computation of various functions. We also explore semantic communications, identified as an effective way to improve communication systems by focusing on the exchange of meaningful information, thus minimizing unnecessary data and increasing efficiency. The interrelationship between machine learning (ML) and multiple access technologies is also reviewed, with an emphasis on federated learning, federated distillation, split learning, reinforcement learning, and the development of ML-based multiple access protocols. Finally, the concept of digital twinning and its role in network management is discussed, highlighting how virtual replication of physical networks can lead to improvements in network efficiency and reliability.  ( 3 min )
    Hair and scalp disease detection using deep learning
    arXiv:2403.07940v1 Announce Type: cross Abstract: In recent years, there has been a notable advancement in the integration of healthcare and technology, particularly evident in the field of medical image analysis. This paper introduces a pioneering approach in dermatology, presenting a robust method for the detection of hair and scalp diseases using state-of-the-art deep learning techniques. Our methodology relies on Convolutional Neural Networks (CNNs), well-known for their efficacy in image recognition, to meticulously analyze images for various dermatological conditions affecting the hair and scalp. Our proposed system represents a significant advancement in dermatological diagnostics, offering a non-invasive and highly efficient means of early detection and diagnosis. By leveraging the capabilities of CNNs, our model holds the potential to revolutionize dermatology, providing accessible and timely healthcare solutions. Furthermore, the seamless integration of our trained model into a web-based platform developed with the Django framework ensures broad accessibility and usability, democratizing advanced medical diagnostics. The integration of machine learning algorithms into web applications marks a pivotal moment in healthcare delivery, promising empowerment for both healthcare providers and patients. Through the synergy between technology and healthcare, our paper outlines the meticulous methodology, technical intricacies, and promising future prospects of our system. With a steadfast commitment to advancing healthcare frontiers, our goal is to significantly contribute to leveraging technology for improved healthcare outcomes globally. This endeavor underscores the profound impact of technological innovation in shaping the future of healthcare delivery and patient care, highlighting the transformative potential of our approach.  ( 3 min )
    Skipformer: A Skip-and-Recover Strategy for Efficient Speech Recognition
    arXiv:2403.08258v1 Announce Type: cross Abstract: Conformer-based attention models have become the de facto backbone model for Automatic Speech Recognition tasks. A blank symbol is usually introduced to align the input and output sequences for CTC or RNN-T models. Unfortunately, the long input length overloads computational budget and memory consumption quadratically by attention mechanism. In this work, we propose a "Skip-and-Recover" Conformer architecture, named Skipformer, to squeeze sequence input length dynamically and inhomogeneously. Skipformer uses an intermediate CTC output as criteria to split frames into three groups: crucial, skipping and ignoring. The crucial group feeds into next conformer blocks and its output joint with skipping group by original temporal order as the final encoder output. Experiments show that our model reduces the input sequence length by 31 times on Aishell-1 and 22 times on Librispeech corpus. Meanwhile, the model can achieve better recognition accuracy and faster inference speed than recent baseline models. Our code is open-sourced and available online.  ( 2 min )
    Efficient Observation Time Window Segmentation for Administrative Data Machine Learning
    arXiv:2401.16537v2 Announce Type: replace Abstract: Machine learning models benefit when allowed to learn from temporal trends in time-stamped administrative data. These trends can be represented by dividing a model's observation window into time segments or bins. Model training time and performance can be improved by representing each feature with a different time resolution. However, this causes the time bin size hyperparameter search space to grow exponentially with the number of features. The contribution of this paper is to propose a computationally efficient time series analysis to investigate binning (TAIB) technique that determines which subset of data features benefit the most from time bin size hyperparameter tuning. This technique is demonstrated using hospital and housing/homelessness administrative data sets. The results show that TAIB leads to models that are not only more efficient to train but can perform better than models that default to representing all features with the same time bin size.  ( 3 min )
    In-Context Learning Learns Label Relationships but Is Not Conventional Learning
    arXiv:2307.12375v4 Announce Type: replace-cross Abstract: The predictions of Large Language Models (LLMs) on downstream tasks often improve significantly when including examples of the input--label relationship in the context. However, there is currently no consensus about how this in-context learning (ICL) ability of LLMs works. For example, while Xie et al. (2021) liken ICL to a general-purpose learning algorithm, Min et al. (2022) argue ICL does not even learn label relationships from in-context examples. In this paper, we provide novel insights into how ICL leverages label information, revealing both capabilities and limitations. To ensure we obtain a comprehensive picture of ICL behavior, we study probabilistic aspects of ICL predictions and thoroughly examine the dynamics of ICL as more examples are provided. Our experiments show that ICL predictions almost always depend on in-context labels and that ICL can learn truly novel tasks in-context. However, we also find that ICL struggles to fully overcome prediction preferences acquired from pre-training data and, further, that ICL does not consider all in-context information equally.  ( 3 min )
    Towards Faithful Explanations: Boosting Rationalization with Shortcuts Discovery
    arXiv:2403.07955v1 Announce Type: new Abstract: The remarkable success in neural networks provokes the selective rationalization. It explains the prediction results by identifying a small subset of the inputs sufficient to support them. Since existing methods still suffer from adopting the shortcuts in data to compose rationales and limited large-scale annotated rationales by human, in this paper, we propose a Shortcuts-fused Selective Rationalization (SSR) method, which boosts the rationalization by discovering and exploiting potential shortcuts. Specifically, SSR first designs a shortcuts discovery approach to detect several potential shortcuts. Then, by introducing the identified shortcuts, we propose two strategies to mitigate the problem of utilizing shortcuts to compose rationales. Finally, we develop two data augmentations methods to close the gap in the number of annotated rationales. Extensive experimental results on real-world datasets clearly validate the effectiveness of our proposed method.  ( 2 min )
    Authorship Verification based on the Likelihood Ratio of Grammar Models
    arXiv:2403.08462v1 Announce Type: cross Abstract: Authorship Verification (AV) is the process of analyzing a set of documents to determine whether they were written by a specific author. This problem often arises in forensic scenarios, e.g., in cases where the documents in question constitute evidence for a crime. Existing state-of-the-art AV methods use computational solutions that are not supported by a plausible scientific explanation for their functioning and that are often difficult for analysts to interpret. To address this, we propose a method relying on calculating a quantity we call $\lambda_G$ (LambdaG): the ratio between the likelihood of a document given a model of the Grammar for the candidate author and the likelihood of the same document given a model of the Grammar for a reference population. These Grammar Models are estimated using $n$-gram language models that are trained solely on grammatical features. Despite not needing large amounts of data for training, LambdaG still outperforms other established AV methods with higher computational complexity, including a fine-tuned Siamese Transformer network. Our empirical evaluation based on four baseline methods applied to twelve datasets shows that LambdaG leads to better results in terms of both accuracy and AUC in eleven cases and in all twelve cases if considering only topic-agnostic methods. The algorithm is also highly robust to important variations in the genre of the reference population in many cross-genre comparisons. In addition to these properties, we demonstrate how LambdaG is easier to interpret than the current state-of-the-art. We argue that the advantage of LambdaG over other methods is due to fact that it is compatible with Cognitive Linguistic theories of language processing.  ( 3 min )
    Efficient Post-Training Augmentation for Adaptive Inference in Heterogeneous and Distributed IoT Environments
    arXiv:2403.07957v1 Announce Type: new Abstract: Early Exit Neural Networks (EENNs) present a solution to enhance the efficiency of neural network deployments. However, creating EENNs is challenging and requires specialized domain knowledge, due to the large amount of additional design choices. To address this issue, we propose an automated augmentation flow that focuses on converting an existing model into an EENN. It performs all required design decisions for the deployment to heterogeneous or distributed hardware targets: Our framework constructs the EENN architecture, maps its subgraphs to the hardware targets, and configures its decision mechanism. To the best of our knowledge, it is the first framework that is able to perform all of these steps. We evaluated our approach on a collection of Internet-of-Things and standard image classification use cases. For a speech command detection task, our solution was able to reduce the mean operations per inference by 59.67%. For an ECG classification task, it was able to terminate all samples early, reducing the mean inference energy by 74.9% and computations by 78.3%. On CIFAR-10, our solution was able to achieve up to a 58.75% reduction in computations. The search on a ResNet-152 base model for CIFAR-10 took less than nine hours on a laptop CPU. Our proposed approach enables the creation of EENN optimized for IoT environments and can reduce the inference cost of Deep Learning applications on embedded and fog platforms, while also significantly reducing the search cost - making it more accessible for scientists and engineers in industry and research. The low search cost improves the accessibility of EENNs, with the potential to improve the efficiency of neural networks in a wide range of practical applications.  ( 3 min )
    Iterative Learning for Joint Image Denoising and Motion Artifact Correction of 3D Brain MRI
    arXiv:2403.08162v1 Announce Type: cross Abstract: Image noise and motion artifacts greatly affect the quality of brain MRI and negatively influence downstream medical image analysis. Previous studies often focus on 2D methods that process each volumetric MR image slice-by-slice, thus losing important 3D anatomical information. Additionally, these studies generally treat image denoising and artifact correction as two standalone tasks, without considering their potential relationship, especially on low-quality images where severe noise and motion artifacts occur simultaneously. To address these issues, we propose a Joint image Denoising and motion Artifact Correction (JDAC) framework via iterative learning to handle noisy MRIs with motion artifacts, consisting of an adaptive denoising model and an anti-artifact model. In the adaptive denoising model, we first design a novel noise level estimation strategy, and then adaptively reduce the noise through a U-Net backbone with feature normalization conditioning on the estimated noise variance. The anti-artifact model employs another U-Net for eliminating motion artifacts, incorporating a novel gradient-based loss function designed to maintain the integrity of brain anatomy during the motion correction process. These two models are iteratively employed for joint image denoising and artifact correction through an iterative learning framework. An early stopping strategy depending on noise level estimation is applied to accelerate the iteration process. The denoising model is trained with 9,544 T1-weighted MRIs with manually added Gaussian noise as supervision. The anti-artifact model is trained on 552 T1-weighted MRIs with motion artifacts and paired motion-free images. Experimental results on a public dataset and a clinical study suggest the effectiveness of JDAC in both tasks of denoising and motion artifact correction, compared with several state-of-the-art methods.  ( 3 min )
    The Fusion of Deep Reinforcement Learning and Edge Computing for Real-time Monitoring and Control Optimization in IoT Environments
    arXiv:2403.07923v1 Announce Type: cross Abstract: In response to the demand for real-time performance and control quality in industrial Internet of Things (IoT) environments, this paper proposes an optimization control system based on deep reinforcement learning and edge computing. The system leverages cloud-edge collaboration, deploys lightweight policy networks at the edge, predicts system states, and outputs controls at a high frequency, enabling monitoring and optimization of industrial objectives. Additionally, a dynamic resource allocation mechanism is designed to ensure rational scheduling of edge computing resources, achieving global optimization. Results demonstrate that this approach reduces cloud-edge communication latency, accelerates response to abnormal situations, reduces system failure rates, extends average equipment operating time, and saves costs for manual maintenance and replacement. This ensures real-time and stable control.  ( 2 min )
    Demystifying Embedding Spaces using Large Language Models
    arXiv:2310.04475v2 Announce Type: replace-cross Abstract: Embeddings have become a pivotal means to represent complex, multi-faceted information about entities, concepts, and relationships in a condensed and useful format. Nevertheless, they often preclude direct interpretation. While downstream tasks make use of these compressed representations, meaningful interpretation usually requires visualization using dimensionality reduction or specialized machine learning interpretability methods. This paper addresses the challenge of making such embeddings more interpretable and broadly useful, by employing Large Language Models (LLMs) to directly interact with embeddings -- transforming abstract vectors into understandable narratives. By injecting embeddings into LLMs, we enable querying and exploration of complex embedding data. We demonstrate our approach on a variety of diverse tasks, including: enhancing concept activation vectors (CAVs), communicating novel embedded entities, and decoding user preferences in recommender systems. Our work couples the immense information potential of embeddings with the interpretative power of LLMs.  ( 3 min )
    Intelligent Monitoring Framework for Cloud Services: A Data-Driven Approach
    arXiv:2403.07927v1 Announce Type: cross Abstract: Cloud service owners need to continuously monitor their services to ensure high availability and reliability. Gaps in monitoring can lead to delay in incident detection and significant negative customer impact. Current process of monitor creation is ad-hoc and reactive in nature. Developers create monitors using their tribal knowledge and, primarily, a trial and error based process. As a result, monitors often have incomplete coverage which leads to production issues, or, redundancy which results in noise and wasted effort. In this work, we address this issue by proposing an intelligent monitoring framework that recommends monitors for cloud services based on their service properties. We start by mining the attributes of 30,000+ monitors from 791 production services at Microsoft and derive a structured ontology for monitors. We focus on two crucial dimensions: what to monitor (resources) and which metrics to monitor. We conduct an extensive empirical study and derive key insights on the major classes of monitors employed by cloud services at Microsoft, their associated dimensions, and the interrelationship between service properties and this ontology. Using these insights, we propose a deep learning based framework that recommends monitors based on the service properties. Finally, we conduct a user study with engineers from Microsoft which demonstrates the usefulness of the proposed framework. The proposed framework along with the ontology driven projections, succeeded in creating production quality recommendations for majority of resource classes. This was also validated by the users from the study who rated the framework's usefulness as 4.27 out of 5.  ( 3 min )
    LLM-Assisted Light: Leveraging Large Language Model Capabilities for Human-Mimetic Traffic Signal Control in Complex Urban Environments
    arXiv:2403.08337v1 Announce Type: cross Abstract: Traffic congestion in metropolitan areas presents a formidable challenge with far-reaching economic, environmental, and societal ramifications. Therefore, effective congestion management is imperative, with traffic signal control (TSC) systems being pivotal in this endeavor. Conventional TSC systems, designed upon rule-based algorithms or reinforcement learning (RL), frequently exhibit deficiencies in managing the complexities and variabilities of urban traffic flows, constrained by their limited capacity for adaptation to unfamiliar scenarios. In response to these limitations, this work introduces an innovative approach that integrates Large Language Models (LLMs) into TSC, harnessing their advanced reasoning and decision-making faculties. Specifically, a hybrid framework that augments LLMs with a suite of perception and decision-making tools is proposed, facilitating the interrogation of both the static and dynamic traffic information. This design places the LLM at the center of the decision-making process, combining external traffic data with established TSC methods. Moreover, a simulation platform is developed to corroborate the efficacy of the proposed framework. The findings from our simulations attest to the system's adeptness in adjusting to a multiplicity of traffic environments without the need for additional training. Notably, in cases of Sensor Outage (SO), our approach surpasses conventional RL-based systems by reducing the average waiting time by $20.4\%$. This research signifies a notable advance in TSC strategies and paves the way for the integration of LLMs into real-world, dynamic scenarios, highlighting their potential to revolutionize traffic management. The related code is available at \href{https://github.com/Traffic-Alpha/LLM-Assisted-Light}{https://github.com/Traffic-Alpha/LLM-Assisted-Light}.  ( 3 min )
    The evaluation of a code-switched Sepedi-English automatic speech recognition system
    arXiv:2403.07947v1 Announce Type: cross Abstract: Speech technology is a field that encompasses various techniques and tools used to enable machines to interact with speech, such as automatic speech recognition (ASR), spoken dialog systems, and others, allowing a device to capture spoken words through a microphone from a human speaker. End-to-end approaches such as Connectionist Temporal Classification (CTC) and attention-based methods are the most used for the development of ASR systems. However, these techniques were commonly used for research and development for many high-resourced languages with large amounts of speech data for training and evaluation, leaving low-resource languages relatively underdeveloped. While the CTC method has been successfully used for other languages, its effectiveness for the Sepedi language remains uncertain. In this study, we present the evaluation of the Sepedi-English code-switched automatic speech recognition system. This end-to-end system was developed using the Sepedi Prompted Code Switching corpus and the CTC approach. The performance of the system was evaluated using both the NCHLT Sepedi test corpus and the Sepedi Prompted Code Switching corpus. The model produced the lowest WER of 41.9%, however, the model faced challenges in recognizing the Sepedi only text.  ( 3 min )
    Research on the Application of Deep Learning-based BERT Model in Sentiment Analysis
    arXiv:2403.08217v1 Announce Type: cross Abstract: This paper explores the application of deep learning techniques, particularly focusing on BERT models, in sentiment analysis. It begins by introducing the fundamental concept of sentiment analysis and how deep learning methods are utilized in this domain. Subsequently, it delves into the architecture and characteristics of BERT models. Through detailed explanation, it elucidates the application effects and optimization strategies of BERT models in sentiment analysis, supported by experimental validation. The experimental findings indicate that BERT models exhibit robust performance in sentiment analysis tasks, with notable enhancements post fine-tuning. Lastly, the paper concludes by summarizing the potential applications of BERT models in sentiment analysis and suggests directions for future research and practical implementations.  ( 2 min )
    Towards Unified Modeling for Positive and Negative Preferences in Sign-Aware Recommendation
    arXiv:2403.08246v1 Announce Type: cross Abstract: Recently, sign-aware graph recommendation has drawn much attention as it will learn users' negative preferences besides positive ones from both positive and negative interactions (i.e., links in a graph) with items. To accommodate the different semantics of negative and positive links, existing works utilize two independent encoders to model users' positive and negative preferences, respectively. However, these approaches cannot learn the negative preferences from high-order heterogeneous interactions between users and items formed by multiple links with different signs, resulting in inaccurate and incomplete negative user preferences. To cope with these intractable issues, we propose a novel \textbf{L}ight \textbf{S}igned \textbf{G}raph Convolution Network specifically for \textbf{Rec}ommendation (\textbf{LSGRec}), which adopts a unified modeling approach to simultaneously model high-order users' positive and negative preferences on a signed user-item interaction graph. Specifically, for the negative preferences within high-order heterogeneous interactions, first-order negative preferences are captured by the negative links, while high-order negative preferences are propagated along positive edges. Then, recommendation results are generated based on positive preferences and optimized with negative ones. Finally, we train representations of users and items through different auxiliary tasks. Extensive experiments on three real-world datasets demonstrate that our method outperforms existing baselines regarding performance and computational efficiency. Our code is available at \url{https://anonymous.4open.science/r/LSGRec-BB95}.  ( 3 min )
    SmartPlay: A Benchmark for LLMs as Intelligent Agents
    arXiv:2310.01557v4 Announce Type: replace Abstract: Recent large language models (LLMs) have demonstrated great potential toward intelligent agents and next-gen automation, but there currently lacks a systematic benchmark for evaluating LLMs' abilities as agents. We introduce SmartPlay: both a challenging benchmark and a methodology for evaluating LLMs as agents. SmartPlay consists of 6 different games, including Rock-Paper-Scissors, Tower of Hanoi, Minecraft. Each game features a unique setting, providing up to 20 evaluation settings and infinite environment variations. Each game in SmartPlay uniquely challenges a subset of 9 important capabilities of an intelligent LLM agent, including reasoning with object dependencies, planning ahead, spatial reasoning, learning from history, and understanding randomness. The distinction between the set of capabilities each game test allows us to analyze each capability separately. SmartPlay serves not only as a rigorous testing ground for evaluating the overall performance of LLM agents but also as a road-map for identifying gaps in current methodologies. We release our benchmark at github.com/Microsoft/SmartPlay  ( 3 min )
    Unsupervised Acoustic Scene Mapping Based on Acoustic Features and Dimensionality Reduction
    arXiv:2301.00448v2 Announce Type: replace-cross Abstract: Classical methods for acoustic scene mapping require the estimation of time difference of arrival (TDOA) between microphones. Unfortunately, TDOA estimation is very sensitive to reverberation and additive noise. We introduce an unsupervised data-driven approach that exploits the natural structure of the data. Our method builds upon local conformal autoencoders (LOCA) - an offline deep learning scheme for learning standardized data coordinates from measurements. Our experimental setup includes a microphone array that measures the transmitted sound source at multiple locations across the acoustic enclosure. We demonstrate that LOCA learns a representation that is isometric to the spatial locations of the microphones. The performance of our method is evaluated using a series of realistic simulations and compared with other dimensionality-reduction schemes. We further assess the influence of reverberation on the results of LOCA and show that it demonstrates considerable robustness.  ( 3 min )
    DecompDiff: Diffusion Models with Decomposed Priors for Structure-Based Drug Design
    arXiv:2403.07902v1 Announce Type: cross Abstract: Designing 3D ligands within a target binding site is a fundamental task in drug discovery. Existing structured-based drug design methods treat all ligand atoms equally, which ignores different roles of atoms in the ligand for drug design and can be less efficient for exploring the large drug-like molecule space. In this paper, inspired by the convention in pharmaceutical practice, we decompose the ligand molecule into two parts, namely arms and scaffold, and propose a new diffusion model, DecompDiff, with decomposed priors over arms and scaffold. In order to facilitate the decomposed generation and improve the properties of the generated molecules, we incorporate both bond diffusion in the model and additional validity guidance in the sampling phase. Extensive experiments on CrossDocked2020 show that our approach achieves state-of-the-art performance in generating high-affinity molecules while maintaining proper molecular properties and conformational stability, with up to -8.39 Avg. Vina Dock score and 24.5 Success Rate. The code is provided at https://github.com/bytedance/DecompDiff  ( 3 min )
    Imitate the Good and Avoid the Bad: An Incremental Approach to Safe Reinforcement Learning
    arXiv:2312.10385v3 Announce Type: replace Abstract: A popular framework for enforcing safe actions in Reinforcement Learning (RL) is Constrained RL, where trajectory based constraints on expected cost (or other cost measures) are employed to enforce safety and more importantly these constraints are enforced while maximizing expected reward. Most recent approaches for solving Constrained RL convert the trajectory based cost constraint into a surrogate problem that can be solved using minor modifications to RL methods. A key drawback with such approaches is an over or underestimation of the cost constraint at each state. Therefore, we provide an approach that does not modify the trajectory based cost constraint and instead imitates ``good'' trajectories and avoids ``bad'' trajectories generated from incrementally improving policies. We employ an oracle that utilizes a reward threshold (which is varied with learning) and the overall cost constraint to label trajectories as ``good'' or ``bad''. A key advantage of our approach is that we are able to work from any starting policy or set of trajectories and improve on it. In an exhaustive set of experiments, we demonstrate that our approach is able to outperform top benchmark approaches for solving Constrained RL problems, with respect to expected cost, CVaR cost, or even unknown cost constraints.  ( 3 min )
    On the Identifiability of Quantized Factors
    arXiv:2306.16334v3 Announce Type: replace Abstract: Disentanglement aims to recover meaningful latent ground-truth factors from the observed distribution solely, and is formalized through the theory of identifiability. The identifiability of independent latent factors is proven to be impossible in the unsupervised i.i.d. setting under a general nonlinear map from factors to observations. In this work, however, we demonstrate that it is possible to recover quantized latent factors under a generic nonlinear diffeomorphism. We only assume that the latent factors have independent discontinuities in their density, without requiring the factors to be statistically independent. We introduce this novel form of identifiability, termed quantized factor identifiability, and provide a comprehensive proof of the recovery of the quantized factors.  ( 2 min )
    Revisiting Edge Perturbation for Graph Neural Network in Graph Data Augmentation and Attack
    arXiv:2403.07943v1 Announce Type: new Abstract: Edge perturbation is a basic method to modify graph structures. It can be categorized into two veins based on their effects on the performance of graph neural networks (GNNs), i.e., graph data augmentation and attack. Surprisingly, both veins of edge perturbation methods employ the same operations, yet yield opposite effects on GNNs' accuracy. A distinct boundary between these methods in using edge perturbation has never been clearly defined. Consequently, inappropriate perturbations may lead to undesirable outcomes, necessitating precise adjustments to achieve desired effects. Therefore, questions of ``why edge perturbation has a two-faced effect?'' and ``what makes edge perturbation flexible and effective?'' still remain unanswered. In this paper, we will answer these questions by proposing a unified formulation and establishing a clear boundary between two categories of edge perturbation methods. Specifically, we conduct experiments to elucidate the differences and similarities between these methods and theoretically unify the workflow of these methods by casting it to one optimization problem. Then, we devise Edge Priority Detector (EPD) to generate a novel priority metric, bridging these methods up in the workflow. Experiments show that EPD can make augmentation or attack flexibly and achieve comparable or superior performance to other counterparts with less time overhead.  ( 3 min )
    Enhancing Kubernetes Automated Scheduling with Deep Learning and Reinforcement Techniques for Large-Scale Cloud Computing Optimization
    arXiv:2403.07905v1 Announce Type: cross Abstract: With the continuous expansion of the scale of cloud computing applications, artificial intelligence technologies such as Deep Learning and Reinforcement Learning have gradually become the key tools to solve the automated task scheduling of large-scale cloud computing systems. Aiming at the complexity and real-time requirement of task scheduling in large-scale cloud computing system, this paper proposes an automatic task scheduling scheme based on deep learning and reinforcement learning. Firstly, the deep learning technology is used to monitor and predict the parameters in the cloud computing system in real time to obtain the system status information. Then, combined with reinforcement learning algorithm, the task scheduling strategy is dynamically adjusted according to the real-time system state and task characteristics to achieve the optimal utilization of system resources and the maximum of task execution efficiency. This paper verifies the effectiveness and performance advantages of the proposed scheme in experiments, and proves the potential and application prospect of deep learning and reinforcement learning in automatic task scheduling in large-scale cloud computing systems.  ( 3 min )
    Learnable Community-Aware Transformer for Brain Connectome Analysis with Token Clustering
    arXiv:2403.08203v1 Announce Type: cross Abstract: Neuroscientific research has revealed that the complex brain network can be organized into distinct functional communities, each characterized by a cohesive group of regions of interest (ROIs) with strong interconnections. These communities play a crucial role in comprehending the functional organization of the brain and its implications for neurological conditions, including Autism Spectrum Disorder (ASD) and biological differences, such as in gender. Traditional models have been constrained by the necessity of predefined community clusters, limiting their flexibility and adaptability in deciphering the brain's functional organization. Furthermore, these models were restricted by a fixed number of communities, hindering their ability to accurately represent the brain's dynamic nature. In this study, we present a token clustering brain transformer-based model ($\texttt{TC-BrainTF}$) for joint community clustering and classification. Our approach proposes a novel token clustering (TC) module based on the transformer architecture, which utilizes learnable prompt tokens with orthogonal loss where each ROI embedding is projected onto the prompt embedding space, effectively clustering ROIs into communities and reducing the dimensions of the node representation via merging with communities. Our results demonstrate that our learnable community-aware model $\texttt{TC-BrainTF}$ offers improved accuracy in identifying ASD and classifying genders through rigorous testing on ABIDE and HCP datasets. Additionally, the qualitative analysis on $\texttt{TC-BrainTF}$ has demonstrated the effectiveness of the designed TC module and its relevance to neuroscience interpretations.  ( 3 min )
    Asymptotics of Random Feature Regression Beyond the Linear Scaling Regime
    arXiv:2403.08160v1 Announce Type: cross Abstract: Recent advances in machine learning have been achieved by using overparametrized models trained until near interpolation of the training data. It was shown, e.g., through the double descent phenomenon, that the number of parameters is a poor proxy for the model complexity and generalization capabilities. This leaves open the question of understanding the impact of parametrization on the performance of these models. How does model complexity and generalization depend on the number of parameters $p$? How should we choose $p$ relative to the sample size $n$ to achieve optimal test error? In this paper, we investigate the example of random feature ridge regression (RFRR). This model can be seen either as a finite-rank approximation to kernel ridge regression (KRR), or as a simplified model for neural networks trained in the so-called lazy regime. We consider covariates uniformly distributed on the $d$-dimensional sphere and compute sharp asymptotics for the RFRR test error in the high-dimensional polynomial scaling, where $p,n,d \to \infty$ while $p/ d^{\kappa_1}$ and $n / d^{\kappa_2}$ stay constant, for all $\kappa_1 , \kappa_2 \in \mathbb{R}_{>0}$. These asymptotics precisely characterize the impact of the number of random features and regularization parameter on the test performance. In particular, RFRR exhibits an intuitive trade-off between approximation and generalization power. For $n = o(p)$, the sample size $n$ is the bottleneck and RFRR achieves the same performance as KRR (which is equivalent to taking $p = \infty$). On the other hand, if $p = o(n)$, the number of random features $p$ is the limiting factor and RFRR test error matches the approximation error of the random feature model class (akin to taking $n = \infty$). Finally, a double descent appears at $n= p$, a phenomenon that was previously only characterized in the linear scaling $\kappa_1 = \kappa_2 = 1$.  ( 3 min )
    Learning to Generate Training Datasets for Robust Semantic Segmentation
    arXiv:2308.02535v4 Announce Type: replace-cross Abstract: Semantic segmentation methods have advanced significantly. Still, their robustness to real-world perturbations and object types not seen during training remains a challenge, particularly in safety-critical applications. We propose a novel approach to improve the robustness of semantic segmentation techniques by leveraging the synergy between label-to-image generators and image-to-label segmentation models. Specifically, we design Robusta, a novel robust conditional generative adversarial network to generate realistic and plausible perturbed images that can be used to train reliable segmentation models. We conduct in-depth studies of the proposed generative model, assess the performance and robustness of the downstream segmentation network, and demonstrate that our approach can significantly enhance the robustness in the face of real-world perturbations, distribution shifts, and out-of-distribution samples. Our results suggest that this approach could be valuable in safety-critical applications, where the reliability of perception modules such as semantic segmentation is of utmost importance and comes with a limited computational budget in inference. We release our code at https://github.com/ENSTA-U2IS-AI/robusta.  ( 3 min )
    Machine Learning Optimized Orthogonal Basis Piecewise Polynomial Approximation
    arXiv:2403.08579v1 Announce Type: new Abstract: Piecewise Polynomials (PPs) are utilized in several engineering disciplines, like trajectory planning, to approximate position profiles given in the form of a set of points. While the approximation target along with domain-specific requirements, like Ck -continuity, can be formulated as a system of equations and a result can be computed directly, such closed-form solutions posses limited flexibility with respect to polynomial degrees, polynomial bases or adding further domain-specific requirements. Sufficiently complex optimization goals soon call for the use of numerical methods, like gradient descent. Since gradient descent lies at the heart of training Artificial Neural Networks (ANNs), modern Machine Learning (ML) frameworks like TensorFlow come with a set of gradient-based optimizers potentially suitable for a wide range of optimization problems beyond the training task for ANNs. Our approach is to utilize the versatility of PP models and combine it with the potential of modern ML optimizers for the use in function approximation in 1D trajectory planning in the context of electronic cam design. We utilize available optimizers of the ML framework TensorFlow directly, outside of the scope of ANNs, to optimize model parameters of our PP model. In this paper, we show how an orthogonal polynomial basis contributes to improving approximation and continuity optimization performance. Utilizing Chebyshev polynomials of the first kind, we develop a novel regularization approach enabling clearly improved convergence behavior. We show that, using this regularization approach, Chebyshev basis performs better than power basis for all relevant optimizers in the combined approximation and continuity optimization setting and demonstrate usability of the presented approach within the electronic cam domain.  ( 3 min )
    Value Prediction for Spatiotemporal Gait Data Using Deep Learning
    arXiv:2403.07926v1 Announce Type: cross Abstract: Human gait has been commonly used for the diagnosis and evaluation of medical conditions and for monitoring the progress during treatment and rehabilitation. The use of wearable sensors that capture pressure or motion has yielded techniques that analyze the gait data to aid recovery, identify activity performed, or identify individuals. Deep learning, usually employing classification, has been successfully utilized in a variety of applications such as computer vision, biomedical imaging analysis, and natural language processing. We expand the application of deep learning to value prediction of time-series of spatiotemporal gait data. Moreover, we explore several deep learning architectures (Recurrent Neural Networks (RNN) and RNN combined with Convolutional Neural Networks (CNN)) to make short- and long-distance predictions using two different experimental setups. Our results show that short-distance prediction has an RMSE as low as 0.060675, and long-distance prediction RMSE as low as 0.106365. Additionally, the results show that the proposed deep learning models are capable of predicting the entire trial when trained and validated using the trials from the same participant. The proposed, customized models, used with value prediction open possibilities for additional applications, such as fall prediction, in-home progress monitoring, aiding of exoskeleton movement, and authentication.  ( 3 min )
    Randomized Kaczmarz in Adversarial Distributed Setting
    arXiv:2302.14615v2 Announce Type: replace-cross Abstract: Developing large-scale distributed methods that are robust to the presence of adversarial or corrupted workers is an important part of making such methods practical for real-world problems. In this paper, we propose an iterative approach that is adversary-tolerant for convex optimization problems. By leveraging simple statistics, our method ensures convergence and is capable of adapting to adversarial distributions. Additionally, the efficiency of the proposed methods for solving convex problems is shown in simulations with the presence of adversaries. Through simulations, we demonstrate the efficiency of our approach in the presence of adversaries and its ability to identify adversarial workers with high accuracy and tolerate varying levels of adversary rates.  ( 2 min )
    Search-Adaptor: Embedding Customization for Information Retrieval
    arXiv:2310.08750v2 Announce Type: replace Abstract: Embeddings extracted by pre-trained Large Language Models (LLMs) have significant potential to improve information retrieval and search. Beyond the zero-shot setup in which they are being conventionally used, being able to take advantage of the information from the relevant query-corpus paired data can further boost the LLM capabilities. In this paper, we propose a novel method, Search-Adaptor, for customizing LLMs for information retrieval in an efficient and robust way. Search-Adaptor modifies the embeddings generated by pre-trained LLMs, and can be integrated with any LLM, including those only available via prediction APIs. On multiple English, multilingual, and multimodal retrieval datasets, we show consistent and significant performance benefits for Search-Adaptor -- e.g., more than 5% improvements for Google Embedding APIs in nDCG@10 averaged over 14 BEIR datasets.  ( 2 min )
    Flexible K Nearest Neighbors Classifier: Derivation and Application for Ion-mobility Spectrometry-based Indoor Localization
    arXiv:2304.10151v3 Announce Type: replace Abstract: The K Nearest Neighbors (KNN) classifier is widely used in many fields such as fingerprint-based localization or medicine. It determines the class membership of unlabelled sample based on the class memberships of the K labelled samples, the so-called nearest neighbors, that are closest to the unlabelled sample. The choice of K has been the topic of various studies and proposed KNN-variants. Yet no variant has been proven to outperform all other variants. In this paper a KNN-variant is discussed which ensures that the K nearest neighbors are indeed close to the unlabelled sample and finds K along the way. The algorithm is tested and compared to the standard KNN in theoretical scenarios and for indoor localization based on ion-mobility spectrometry fingerprints. It achieves a higher classification accuracy than the KNN in the tests, while having the same computational demand.  ( 3 min )
    Resisting Backdoor Attacks in Federated Learning via Bidirectional Elections and Individual Perspective
    arXiv:2309.16456v2 Announce Type: replace Abstract: Existing approaches defend against backdoor attacks in federated learning (FL) mainly through a) mitigating the impact of infected models, or b) excluding infected models. The former negatively impacts model accuracy, while the latter usually relies on globally clear boundaries between benign and infected model updates. However, model updates are easy to be mixed and scattered throughout in reality due to the diverse distributions of local data. This work focuses on excluding infected models in FL. Unlike previous perspectives from a global view, we propose Snowball, a novel anti-backdoor FL framework through bidirectional elections from an individual perspective inspired by one principle deduced by us and two principles in FL and deep learning. It is characterized by a) bottom-up election, where each candidate model update votes to several peer ones such that a few model updates are elected as selectees for aggregation; and b) top-down election, where selectees progressively enlarge themselves through picking up from the candidates. We compare Snowball with state-of-the-art defenses to backdoor attacks in FL on five real-world datasets, demonstrating its superior resistance to backdoor attacks and slight impact on the accuracy of the global model.  ( 3 min )
    Multifidelity linear regression for scientific machine learning from scarce data
    arXiv:2403.08627v1 Announce Type: cross Abstract: Machine learning (ML) methods, which fit to data the parameters of a given parameterized model class, have garnered significant interest as potential methods for learning surrogate models for complex engineering systems for which traditional simulation is expensive. However, in many scientific and engineering settings, generating high-fidelity data on which to train ML models is expensive, and the available budget for generating training data is limited. ML models trained on the resulting scarce high-fidelity data have high variance and are sensitive to vagaries of the training data set. We propose a new multifidelity training approach for scientific machine learning that exploits the scientific context where data of varying fidelities and costs are available; for example high-fidelity data may be generated by an expensive fully resolved physics simulation whereas lower-fidelity data may arise from a cheaper model based on simplifying assumptions. We use the multifidelity data to define new multifidelity Monte Carlo estimators for the unknown parameters of linear regression models, and provide theoretical analyses that guarantee the approach's accuracy and improved robustness to small training budgets. Numerical results verify the theoretical analysis and demonstrate that multifidelity learned models trained on scarce high-fidelity data and additional low-fidelity data achieve order-of-magnitude lower model variance than standard models trained on only high-fidelity data of comparable cost. This illustrates that in the scarce data regime, our multifidelity training strategy yields models with lower expected error than standard training approaches.  ( 3 min )
    Bayesian Optimization that Limits Search Region to Lower Dimensions Utilizing Local GPR
    arXiv:2403.08331v1 Announce Type: new Abstract: Optimization of product and system characteristics is required in many fields, including design and control. Bayesian optimization (BO) is often used when there are high observing costs, because BO theoretically guarantees an upper bound on regret. However, computational costs increase exponentially with the number of parameters to be optimized, decreasing search efficiency. We propose a BO that limits the search region to lower dimensions and utilizes local Gaussian process regression (LGPR) to scale the BO to higher dimensions. LGPR treats the low-dimensional search region as "local," improving prediction accuracies there. The LGPR model is trained on a local subset of data specific to that region. This improves prediction accuracy and search efficiency and reduces the time complexity of matrix inversion in the Gaussian process regression. In evaluations with 20D Ackley and Rosenbrock functions, search efficiencies are equal to or higher than those of the compared methods, improved by about 69% and 40% from the case without LGPR. We apply our method to an automatic design task for a power semiconductor device. We successfully reduce the specific on-resistance to 25% better than a conventional method and 3.4% better than without LGPR.  ( 3 min )
    Beyond Gradient and Priors in Privacy Attacks: Leveraging Pooler Layer Inputs of Language Models in Federated Learning
    arXiv:2312.05720v3 Announce Type: replace Abstract: Language models trained via federated learning (FL) demonstrate impressive capabilities in handling complex tasks while protecting user privacy. Recent studies indicate that leveraging gradient information and prior knowledge can potentially reveal training samples within FL setting. However, these investigations have overlooked the potential privacy risks tied to the intrinsic architecture of the models. This paper presents a two-stage privacy attack strategy that targets the vulnerabilities in the architecture of contemporary language models, significantly enhancing attack performance by initially recovering certain feature directions as additional supervisory signals. Our comparative experiments demonstrate superior attack performance across various datasets and scenarios, highlighting the privacy leakage risk associated with the increasingly complex architectures of language models. We call for the community to recognize and address these potential privacy risks in designing large language models.  ( 3 min )
    Linear attention is (maybe) all you need (to understand transformer optimization)
    arXiv:2310.01082v2 Announce Type: replace Abstract: Transformer training is notoriously difficult, requiring a careful design of optimizers and use of various heuristics. We make progress towards understanding the subtleties of training Transformers by carefully studying a simple yet canonical linearized shallow Transformer model. Specifically, we train linear Transformers to solve regression tasks, inspired by J.~von Oswald et al.~(ICML 2023), and K.~Ahn et al.~(NeurIPS 2023). Most importantly, we observe that our proposed linearized models can reproduce several prominent aspects of Transformer training dynamics. Consequently, the results obtained in this paper suggest that a simple linearized Transformer model could actually be a valuable, realistic abstraction for understanding Transformer optimization.  ( 2 min )
    CoLiDE: Concomitant Linear DAG Estimation
    arXiv:2310.02895v2 Announce Type: replace Abstract: We deal with the combinatorial problem of learning directed acyclic graph (DAG) structure from observational data adhering to a linear structural equation model (SEM). Leveraging advances in differentiable, nonconvex characterizations of acyclicity, recent efforts have advocated a continuous constrained optimization paradigm to efficiently explore the space of DAGs. Most existing methods employ lasso-type score functions to guide this search, which (i) require expensive penalty parameter retuning when the $\textit{unknown}$ SEM noise variances change across problem instances; and (ii) implicitly rely on limiting homoscedasticity assumptions. In this work, we propose a new convex score function for sparsity-aware learning of linear DAGs, which incorporates concomitant estimation of scale and thus effectively decouples the sparsity parameter from the exogenous noise levels. Regularization via a smooth, nonconvex acyclicity penalty term yields CoLiDE ($\textbf{Co}$ncomitant $\textbf{Li}$near $\textbf{D}$AG $\textbf{E}$stimation), a regression-based criterion amenable to efficient gradient computation and closed-form estimation of noise variances in heteroscedastic scenarios. Our algorithm outperforms state-of-the-art methods without incurring added complexity, especially when the DAGs are larger and the noise level profile is heterogeneous. We also find CoLiDE exhibits enhanced stability manifested via reduced standard deviations in several domain-specific metrics, underscoring the robustness of our novel linear DAG estimator.  ( 3 min )
    Representing Molecules as Random Walks Over Interpretable Grammars
    arXiv:2403.08147v1 Announce Type: new Abstract: Recent research in molecular discovery has primarily been devoted to small, drug-like molecules, leaving many similarly important applications in material design without adequate technology. These applications often rely on more complex molecular structures with fewer examples that are carefully designed using known substructures. We propose a data-efficient and interpretable model for representing and reasoning over such molecules in terms of graph grammars that explicitly describe the hierarchical design space featuring motifs to be the design basis. We present a novel representation in the form of random walks over the design space, which facilitates both molecule generation and property prediction. We demonstrate clear advantages over existing methods in terms of performance, efficiency, and synthesizability of predicted molecules, and we provide detailed insights into the method's chemical interpretability.  ( 2 min )
    On the Societal Impact of Open Foundation Models
    arXiv:2403.07918v1 Announce Type: cross Abstract: Foundation models are powerful technologies: how they are released publicly directly shapes their societal impact. In this position paper, we focus on open foundation models, defined here as those with broadly available model weights (e.g. Llama 2, Stable Diffusion XL). We identify five distinctive properties (e.g. greater customizability, poor monitoring) of open foundation models that lead to both their benefits and risks. Open foundation models present significant benefits, with some caveats, that span innovation, competition, the distribution of decision-making power, and transparency. To understand their risks of misuse, we design a risk assessment framework for analyzing their marginal risk. Across several misuse vectors (e.g. cyberattacks, bioweapons), we find that current research is insufficient to effectively characterize the marginal risk of open foundation models relative to pre-existing technologies. The framework helps explain why the marginal risk is low in some cases, clarifies disagreements about misuse risks by revealing that past work has focused on different subsets of the framework with different assumptions, and articulates a way forward for more constructive debate. Overall, our work helps support a more grounded assessment of the societal impact of open foundation models by outlining what research is needed to empirically validate their theoretical benefits and risks.  ( 3 min )
    Flexible Distribution Alignment: Towards Long-tailed Semi-supervised Learning with Proper Calibration
    arXiv:2306.04621v2 Announce Type: replace Abstract: Long-tailed semi-supervised learning (LTSSL) represents a practical scenario for semi-supervised applications, challenged by skewed labeled distributions that bias classifiers. This problem is often aggravated by discrepancies between labeled and unlabeled class distributions, leading to biased pseudo-labels, neglect of rare classes, and poorly calibrated probabilities. To address these issues, we introduce Flexible Distribution Alignment (FlexDA), a novel adaptive logit-adjusted loss framework designed to dynamically estimate and align predictions with the actual distribution of unlabeled data and achieve a balanced classifier by the end of training. FlexDA is further enhanced by a distillation-based consistency loss, promoting fair data usage across classes and effectively leveraging underconfident samples. This method, encapsulated in ADELLO (Align and Distill Everything All at Once), proves robust against label shift, significantly improves model calibration in LTSSL contexts, and surpasses previous state-of-of-art approaches across multiple benchmarks, including CIFAR100-LT, STL10-LT, and ImageNet127, addressing class imbalance challenges in semi-supervised learning. Our code will be made available upon paper acceptance.  ( 3 min )
    A Physics-driven GraphSAGE Method for Physical Process Simulations Described by Partial Differential Equations
    arXiv:2403.08569v1 Announce Type: new Abstract: Physics-informed neural networks (PINNs) have successfully addressed various computational physics problems based on partial differential equations (PDEs). However, while tackling issues related to irregularities like singularities and oscillations, trained solutions usually suffer low accuracy. In addition, most current works only offer the trained solution for predetermined input parameters. If any change occurs in input parameters, transfer learning or retraining is required, and traditional numerical techniques also need an independent simulation. In this work, a physics-driven GraphSAGE approach (PD-GraphSAGE) based on the Galerkin method and piecewise polynomial nodal basis functions is presented to solve computational problems governed by irregular PDEs and to develop parametric PDE surrogate models. This approach employs graph representations of physical domains, thereby reducing the demands for evaluated points due to local refinement. A distance-related edge feature and a feature mapping strategy are devised to help training and convergence for singularity and oscillation situations, respectively. The merits of the proposed method are demonstrated through a couple of cases. Moreover, the robust PDE surrogate model for heat conduction problems parameterized by the Gaussian random field source is successfully established, which not only provides the solution accurately but is several times faster than the finite element method in our experiments.  ( 3 min )
    MASA-TCN: Multi-anchor Space-aware Temporal Convolutional Neural Networks for Continuous and Discrete EEG Emotion Recognition
    arXiv:2308.16207v2 Announce Type: replace Abstract: Emotion recognition using electroencephalogram (EEG) mainly has two scenarios: classification of the discrete labels and regression of the continuously tagged labels. Although many algorithms were proposed for classification tasks, there are only a few methods for regression tasks. For emotion regression, the label is continuous in time. A natural method is to learn the temporal dynamic patterns. In previous studies, long short-term memory (LSTM) and temporal convolutional neural networks (TCN) were utilized to learn the temporal contextual information from feature vectors of EEG. However, the spatial patterns of EEG were not effectively extracted. To enable the spatial learning ability of TCN towards better regression and classification performances, we propose a novel unified model, named MASA-TCN, for EEG emotion regression and classification tasks. The space-aware temporal layer enables TCN to additionally learn from spatial relations among EEG electrodes. Besides, a novel multi-anchor block with attentive fusion is proposed to learn dynamic temporal dependencies. Experiments on two publicly available datasets show MASA-TCN achieves higher results than the state-of-the-art methods for both EEG emotion regression and classification tasks. The code is available at https://github.com/yi-ding-cs/MASA-TCN.  ( 3 min )
    Scattered Mixture-of-Experts Implementation
    arXiv:2403.08245v1 Announce Type: new Abstract: We present ScatterMoE, an implementation of Sparse Mixture-of-Experts (SMoE) on GPUs. ScatterMoE builds upon existing implementations, and overcoming some of the limitations to improve inference and training speed, and memory footprint. This implementation achieves this by avoiding padding and making excessive copies of the input. We introduce ParallelLinear, the main component we use to build our implementation and the various kernels used to speed up the operation. We benchmark our implementation against Megablocks, and show that it enables a higher throughput and lower memory footprint. We also show how ParallelLinear enables extension of the Mixture-of-Experts concept by demonstrating with an implementation of Mixture of Attention.  ( 2 min )
    Disparate Effect Of Missing Mediators On Transportability of Causal Effects
    arXiv:2403.08638v1 Announce Type: new Abstract: Transported mediation effects provide an avenue to understand how upstream interventions (such as improved neighborhood conditions like green spaces) would work differently when applied to different populations as a result of factors that mediate the effects. However, when mediators are missing in the population where the effect is to be transported, these estimates could be biased. We study this issue of missing mediators, motivated by challenges in public health, wherein mediators can be missing, not at random. We propose a sensitivity analysis framework that quantifies the impact of missing mediator data on transported mediation effects. This framework enables us to identify the settings under which the conditional transported mediation effect is rendered insignificant for the subgroup with missing mediator data. Specifically, we provide the bounds on the transported mediation effect as a function of missingness. We then apply the framework to longitudinal data from the Moving to Opportunity Study, a large-scale housing voucher experiment, to quantify the effect of missing mediators on transport effect estimates of voucher receipt, an upstream intervention on living location, in childhood on subsequent risk of mental health or substance use disorder mediated through parental health across sites. Our findings provide a tangible understanding of how much missing data can be withstood for unbiased effect estimates.  ( 3 min )
    Deep learning modelling of manufacturing and build variations on multi-stage axial compressors aerodynamics
    arXiv:2310.04264v3 Announce Type: replace Abstract: Application of deep learning methods to physical simulations such as CFD (Computational Fluid Dynamics) for turbomachinery applications, have been so far of limited industrial relevance. This paper demonstrates the development and application of a deep learning framework for real-time predictions of the impact of manufacturing and build variations, such as tip clearance and surface roughness, on the flow field and aerodynamic performance of multi-stage axial compressors in gas turbines. The associated scatter in compressor efficiency is known to have a significant impact on the corresponding overall performance and emissions of the gas turbine, therefore posing a challenge of great industrial and environmental relevance. The proposed architecture is proven to achieve an accuracy comparable to that of the CFD benchmark, in real-time, for an industrially relevant application. The deployed model, is readily integrated within the manufacturing and build process of gas turbines, thus providing the opportunity to analytically assess the impact on performance and potentially reduce requirements for expensive physical tests.  ( 3 min )
    Structural perspective on constraint-based learning of Markov networks
    arXiv:2403.08562v1 Announce Type: new Abstract: Markov networks are probabilistic graphical models that employ undirected graphs to depict conditional independence relationships among variables. Our focus lies in constraint-based structure learning, which entails learning the undirected graph from data through the execution of conditional independence tests. We establish theoretical limits concerning two critical aspects of constraint-based learning of Markov networks: the number of tests and the sizes of the conditioning sets. These bounds uncover an exciting interplay between the structural properties of the graph and the amount of tests required to learn a Markov network. The starting point of our work is that the graph parameter maximum pairwise connectivity, $\kappa$, that is, the maximum number of vertex-disjoint paths connecting a pair of vertices in the graph, is responsible for the sizes of independence tests required to learn the graph. On one hand, we show that at least one test with the size of the conditioning set at least $\kappa$ is always necessary. On the other hand, we prove that any graph can be learned by performing tests of size at most $\kappa$. This completely resolves the question of the minimum size of conditioning sets required to learn the graph. When it comes to the number of tests, our upper bound on the sizes of conditioning sets implies that every $n$-vertex graph can be learned by at most $n^{\kappa}$ tests with conditioning sets of sizes at most $\kappa$. We show that for any upper bound $q$ on the sizes of the conditioning sets, there exist graphs with $O(n q)$ vertices that require at least $n^{\Omega(\kappa)}$ tests to learn. This lower bound holds even when the treewidth and the maximum degree of the graph are at most $\kappa+2$. On the positive side, we prove that every graph of bounded treewidth can be learned by a polynomial number of tests with conditioning sets of sizes at most $2\kappa$.  ( 3 min )
    Data augmentation with automated machine learning: approaches and performance comparison with classical data augmentation methods
    arXiv:2403.08352v1 Announce Type: new Abstract: Data augmentation is arguably the most important regularization technique commonly used to improve generalization performance of machine learning models. It primarily involves the application of appropriate data transformation operations to create new data samples with desired properties. Despite its effectiveness, the process is often challenging because of the time-consuming trial and error procedures for creating and testing different candidate augmentations and their hyperparameters manually. Automated data augmentation methods aim to automate the process. State-of-the-art approaches typically rely on automated machine learning (AutoML) principles. This work presents a comprehensive survey of AutoML-based data augmentation techniques. We discuss various approaches for accomplishing data augmentation with AutoML, including data manipulation, data integration and data synthesis techniques. We present extensive discussion of techniques for realizing each of the major subtasks of the data augmentation process: search space design, hyperparameter optimization and model evaluation. Finally, we carried out an extensive comparison and analysis of the performance of automated data augmentation techniques and state-of-the-art methods based on classical augmentation approaches. The results show that AutoML methods for data augmentation currently outperform state-of-the-art techniques based on conventional approaches.  ( 3 min )
    xMLP: Revolutionizing Private Inference with Exclusive Square Activation
    arXiv:2403.08024v1 Announce Type: new Abstract: Private Inference (PI) enables deep neural networks (DNNs) to work on private data without leaking sensitive information by exploiting cryptographic primitives such as multi-party computation (MPC) and homomorphic encryption (HE). However, the use of non-linear activations such as ReLU in DNNs can lead to impractically high PI latency in existing PI systems, as ReLU requires the use of costly MPC computations, such as Garbled Circuits. Since square activations can be processed by Beaver's triples hundreds of times faster compared to ReLU, they are more friendly to PI tasks, but using them leads to a notable drop in model accuracy. This paper starts by exploring the reason for such an accuracy drop after using square activations, and concludes that this is due to an "information compounding" effect. Leveraging this insight, we propose xMLP, a novel DNN architecture that uses square activations exclusively while maintaining parity in both accuracy and efficiency with ReLU-based DNNs. Our experiments on CIFAR-100 and ImageNet show that xMLP models consistently achieve better performance than ResNet models with fewer activation layers and parameters while maintaining consistent performance with its ReLU-based variants. Remarkably, when compared to state-of-the-art PI Models, xMLP demonstrates superior performance, achieving a 0.58% increase in accuracy with 7x faster PI speed. Moreover, it delivers a significant accuracy improvement of 4.96% while maintaining the same PI latency. When offloading PI to the GPU, xMLP is up to 700x faster than the previous state-of-the-art PI model with comparable accuracy.  ( 3 min )
    Digital Video Manipulation Detection Technique Based on Compression Algorithms
    arXiv:2403.07891v1 Announce Type: cross Abstract: Digital images and videos play a very important role in everyday life. Nowadays, people have access the affordable mobile devices equipped with advanced integrated cameras and powerful image processing applications. Technological development facilitates not only the generation of multimedia content, but also the intentional modification of it, either with recreational or malicious purposes. This is where forensic techniques to detect manipulation of images and videos become essential. This paper proposes a forensic technique by analysing compression algorithms used by the H.264 coding. The presence of recompression uses information of macroblocks, a characteristic of the H.264-MPEG4 standard, and motion vectors. A Vector Support Machine is used to create the model that allows to accurately detect if a video has been recompressed.  ( 2 min )
    Caformer: Rethinking Time Series Analysis from Causal Perspective
    arXiv:2403.08572v1 Announce Type: new Abstract: Time series analysis is a vital task with broad applications in various domains. However, effectively capturing cross-dimension and cross-time dependencies in non-stationary time series poses significant challenges, particularly in the context of environmental factors. The spurious correlation induced by the environment confounds the causal relationships between cross-dimension and cross-time dependencies. In this paper, we introduce a novel framework called Caformer (\underline{\textbf{Ca}}usal Trans\underline{\textbf{former}}) for time series analysis from a causal perspective. Specifically, our framework comprises three components: Dynamic Learner, Environment Learner, and Dependency Learner. The Dynamic Learner unveils dynamic interactions among dimensions, the Environment Learner mitigates spurious correlations caused by environment with a back-door adjustment, and the Dependency Learner aims to infer robust interactions across both time and dimensions. Our Caformer demonstrates consistent state-of-the-art performance across five mainstream time series analysis tasks, including long- and short-term forecasting, imputation, classification, and anomaly detection, with proper interpretability.  ( 2 min )
    When can we Approximate Wide Contrastive Models with Neural Tangent Kernels and Principal Component Analysis?
    arXiv:2403.08673v1 Announce Type: new Abstract: Contrastive learning is a paradigm for learning representations from unlabelled data that has been highly successful for image and text data. Several recent works have examined contrastive losses to claim that contrastive models effectively learn spectral embeddings, while few works show relations between (wide) contrastive models and kernel principal component analysis (PCA). However, it is not known if trained contrastive models indeed correspond to kernel methods or PCA. In this work, we analyze the training dynamics of two-layer contrastive models, with non-linear activation, and answer when these models are close to PCA or kernel methods. It is well known in the supervised setting that neural networks are equivalent to neural tangent kernel (NTK) machines, and that the NTK of infinitely wide networks remains constant during training. We provide the first convergence results of NTK for contrastive losses, and present a nuanced picture: NTK of wide networks remains almost constant for cosine similarity based contrastive losses, but not for losses based on dot product similarity. We further study the training dynamics of contrastive models with orthogonality constraints on output layer, which is implicitly assumed in works relating contrastive learning to spectral embedding. Our deviation bounds suggest that representations learned by contrastive models are close to the principal components of a certain matrix computed from random features. We empirically show that our theoretical results possibly hold beyond two-layer networks.  ( 3 min )
    TILDE-Q: A Transformation Invariant Loss Function for Time-Series Forecasting
    arXiv:2210.15050v2 Announce Type: replace Abstract: Time-series forecasting has gained increasing attention in the field of artificial intelligence due to its potential to address real-world problems across various domains, including energy, weather, traffic, and economy. While time-series forecasting is a well-researched field, predicting complex temporal patterns such as sudden changes in sequential data still poses a challenge with current models. This difficulty stems from minimizing Lp norm distances as loss functions, such as mean absolute error (MAE) or mean square error (MSE), which are susceptible to both intricate temporal dynamics modeling and signal shape capturing. Furthermore, these functions often cause models to behave aberrantly and generate uncorrelated results with the original time-series. Consequently, developing a shape-aware loss function that goes beyond mere point-wise comparison is essential. In this paper, we examine the definition of shape and distortions, which are crucial for shape-awareness in time-series forecasting, and provide a design rationale for the shape-aware loss function. Based on our design rationale, we propose a novel, compact loss function called TILDEQ (Transformation Invariant Loss function with Distance EQuilibrium) that considers not only amplitude and phase distortions but also allows models to capture the shape of time-series sequences. Furthermore, TILDE-Q supports the simultaneous modeling of periodic and nonperiodic temporal dynamics. We evaluate the efficacy of TILDE-Q by conducting extensive experiments under both periodic and nonperiodic conditions with various models ranging from naive to state-of-the-art. The experimental results show that the models trained with TILDE-Q surpass those trained with other metrics, such as MSE and DILATE, in various real-world applications, including electricity, traffic, illness, economics, weather, and electricity transformer temperature (ETT).  ( 3 min )
    SoK: Reducing the Vulnerability of Fine-tuned Language Models to Membership Inference Attacks
    arXiv:2403.08481v1 Announce Type: new Abstract: Natural language processing models have experienced a significant upsurge in recent years, with numerous applications being built upon them. Many of these applications require fine-tuning generic base models on customized, proprietary datasets. This fine-tuning data is especially likely to contain personal or sensitive information about individuals, resulting in increased privacy risk. Membership inference attacks are the most commonly employed attack to assess the privacy leakage of a machine learning model. However, limited research is available on the factors that affect the vulnerability of language models to this kind of attack, or on the applicability of different defense strategies in the language domain. We provide the first systematic review of the vulnerability of fine-tuned large language models to membership inference attacks, the various factors that come into play, and the effectiveness of different defense strategies. We find that some training methods provide significantly reduced privacy risk, with the combination of differential privacy and low-rank adaptors achieving the best privacy protection against these attacks.  ( 3 min )
    Bolstering Stochastic Gradient Descent with Model Building
    arXiv:2111.07058v4 Announce Type: replace Abstract: Stochastic gradient descent method and its variants constitute the core optimization algorithms that achieve good convergence rates for solving machine learning problems. These rates are obtained especially when these algorithms are fine-tuned for the application at hand. Although this tuning process can require large computational costs, recent work has shown that these costs can be reduced by line search methods that iteratively adjust the step length. We propose an alternative approach to stochastic line search by using a new algorithm based on forward step model building. This model building step incorporates second-order information that allows adjusting not only the step length but also the search direction. Noting that deep learning model parameters come in groups (layers of tensors), our method builds its model and calculates a new step for each parameter group. This novel diagonalization approach makes the selected step lengths adaptive. We provide convergence rate analysis, and experimentally show that the proposed algorithm achieves faster convergence and better generalization in well-known test problems. More precisely, SMB requires less tuning, and shows comparable performance to other adaptive methods.  ( 3 min )
    CHAI: Clustered Head Attention for Efficient LLM Inference
    arXiv:2403.08058v1 Announce Type: new Abstract: Large Language Models (LLMs) with hundreds of billions of parameters have transformed the field of machine learning. However, serving these models at inference time is both compute and memory intensive, where a single request can require multiple GPUs and tens of Gigabytes of memory. Multi-Head Attention is one of the key components of LLMs, which can account for over 50% of LLMs memory and compute requirement. We observe that there is a high amount of redundancy across heads on which tokens they pay attention to. Based on this insight, we propose Clustered Head Attention (CHAI). CHAI combines heads with a high amount of correlation for self-attention at runtime, thus reducing both memory and compute. In our experiments, we show that CHAI is able to reduce the memory requirements for storing K,V cache by up to 21.4% and inference time latency by up to 1.73x without any fine-tuning required. CHAI achieves this with a maximum 3.2% deviation in accuracy across 3 different models (i.e. OPT-66B, LLAMA-7B, LLAMA-33B) and 5 different evaluation datasets.  ( 3 min )
    On the Convergence of Locally Adaptive and Scalable Diffusion-Based Sampling Methods for Deep Bayesian Neural Network Posteriors
    arXiv:2403.08609v1 Announce Type: new Abstract: Achieving robust uncertainty quantification for deep neural networks represents an important requirement in many real-world applications of deep learning such as medical imaging where it is necessary to assess the reliability of a neural network's prediction. Bayesian neural networks are a promising approach for modeling uncertainties in deep neural networks. Unfortunately, generating samples from the posterior distribution of neural networks is a major challenge. One significant advance in that direction would be the incorporation of adaptive step sizes, similar to modern neural network optimizers, into Monte Carlo Markov chain sampling algorithms without significantly increasing computational demand. Over the past years, several papers have introduced sampling algorithms with claims that they achieve this property. However, do they indeed converge to the correct distribution? In this paper, we demonstrate that these methods can have a substantial bias in the distribution they sample, even in the limit of vanishing step sizes and at full batch size.  ( 3 min )
    Reverse Diffusion Monte Carlo
    arXiv:2307.02037v3 Announce Type: replace-cross Abstract: We propose a Monte Carlo sampler from the reverse diffusion process. Unlike the practice of diffusion models, where the intermediary updates -- the score functions -- are learned with a neural network, we transform the score matching problem into a mean estimation one. By estimating the means of the regularized posterior distributions, we derive a novel Monte Carlo sampling algorithm called reverse diffusion Monte Carlo (rdMC), which is distinct from the Markov chain Monte Carlo (MCMC) methods. We determine the sample size from the error tolerance and the properties of the posterior distribution to yield an algorithm that can approximately sample the target distribution with any desired accuracy. Additionally, we demonstrate and prove under suitable conditions that sampling with rdMC can be significantly faster than that with MCMC. For multi-modal target distributions such as those in Gaussian mixture models, rdMC greatly improves over the Langevin-style MCMC sampling methods both theoretically and in practice. The proposed rdMC method offers a new perspective and solution beyond classical MCMC algorithms for the challenging complex distributions.  ( 3 min )
    TimeDRL: Disentangled Representation Learning for Multivariate Time-Series
    arXiv:2312.04142v2 Announce Type: replace Abstract: Multivariate time-series data in numerous real-world applications (e.g., healthcare and industry) are informative but challenging due to the lack of labels and high dimensionality. Recent studies in self-supervised learning have shown their potential in learning rich representations without relying on labels, yet they fall short in learning disentangled embeddings and addressing issues of inductive bias (e.g., transformation-invariance). To tackle these challenges, we propose TimeDRL, a generic multivariate time-series representation learning framework with disentangled dual-level embeddings. TimeDRL is characterized by three novel features: (i) disentangled derivation of timestamp-level and instance-level embeddings from patched time-series data using a [CLS] token strategy; (ii) utilization of timestamp-predictive and instance-contrastive tasks for disentangled representation learning, with the former optimizing timestamp-level embeddings with predictive loss, and the latter optimizing instance-level embeddings with contrastive loss; and (iii) avoidance of augmentation methods to eliminate inductive biases, such as transformation-invariance from cropping and masking. Comprehensive experiments on 6 time-series forecasting datasets and 5 time-series classification datasets have shown that TimeDRL consistently surpasses existing representation learning approaches, achieving an average improvement of forecasting by 58.02% in MSE and classification by 1.48% in accuracy. Furthermore, extensive ablation studies confirmed the relative contribution of each component in TimeDRL's architecture, and semi-supervised learning evaluations demonstrated its effectiveness in real-world scenarios, even with limited labeled data. The code is available at https://github.com/blacksnail789521/TimeDRL.  ( 3 min )
    Human Alignment of Large Language Models through Online Preference Optimisation
    arXiv:2403.08635v1 Announce Type: new Abstract: Ensuring alignment of language models' outputs with human preferences is critical to guarantee a useful, safe, and pleasant user experience. Thus, human alignment has been extensively studied recently and several methods such as Reinforcement Learning from Human Feedback (RLHF), Direct Policy Optimisation (DPO) and Sequence Likelihood Calibration (SLiC) have emerged. In this paper, our contribution is two-fold. First, we show the equivalence between two recent alignment methods, namely Identity Policy Optimisation (IPO) and Nash Mirror Descent (Nash-MD). Second, we introduce a generalisation of IPO, named IPO-MD, that leverages the regularised sampling approach proposed by Nash-MD. This equivalence may seem surprising at first sight, since IPO is an offline method whereas Nash-MD is an online method using a preference model. However, this equivalence can be proven when we consider the online version of IPO, that is when both generations are sampled by the online policy and annotated by a trained preference model. Optimising the IPO loss with such a stream of data becomes then equivalent to finding the Nash equilibrium of the preference model through self-play. Building on this equivalence, we introduce the IPO-MD algorithm that generates data with a mixture policy (between the online and reference policy) similarly as the general Nash-MD algorithm. We compare online-IPO and IPO-MD to different online versions of existing losses on preference data such as DPO and SLiC on a summarisation task.  ( 3 min )
    Diffusion-based Iterative Counterfactual Explanations for Fetal Ultrasound Image Quality Assessment
    arXiv:2403.08700v1 Announce Type: cross Abstract: Obstetric ultrasound image quality is crucial for accurate diagnosis and monitoring of fetal health. However, producing high-quality standard planes is difficult, influenced by the sonographer's expertise and factors like the maternal BMI or the fetus dynamics. In this work, we propose using diffusion-based counterfactual explainable AI to generate realistic high-quality standard planes from low-quality non-standard ones. Through quantitative and qualitative evaluation, we demonstrate the effectiveness of our method in producing plausible counterfactuals of increased quality. This shows future promise both for enhancing training of clinicians by providing visual feedback, as well as for improving image quality and, consequently, downstream diagnosis and monitoring.  ( 2 min )
    Nonlinear Manifold Learning Determines Microgel Size from Raman Spectroscopy
    arXiv:2403.08376v1 Announce Type: new Abstract: Polymer particle size constitutes a crucial characteristic of product quality in polymerization. Raman spectroscopy is an established and reliable process analytical technology for in-line concentration monitoring. Recent approaches and some theoretical considerations show a correlation between Raman signals and particle sizes but do not determine polymer size from Raman spectroscopic measurements accurately and reliably. With this in mind, we propose three alternative machine learning workflows to perform this task, all involving diffusion maps, a nonlinear manifold learning technique for dimensionality reduction: (i) directly from diffusion maps, (ii) alternating diffusion maps, and (iii) conformal autoencoder neural networks. We apply the workflows to a data set of Raman spectra with associated size measured via dynamic light scattering of 47 microgel (cross-linked polymer) samples in a diameter range of 208nm to 483 nm. The conformal autoencoders substantially outperform state-of-the-art methods and results for the first time in a promising prediction of polymer size from Raman spectra.  ( 3 min )
    Detecting Errors in a Numerical Response via any Regression Model
    arXiv:2305.16583v3 Announce Type: replace-cross Abstract: Noise plagues many numerical datasets, where the recorded values in the data may fail to match the true underlying values due to reasons including: erroneous sensors, data entry/processing mistakes, or imperfect human estimates. We consider general regression settings with covariates and a potentially corrupted response whose observed values may contain errors. By accounting for various uncertainties, we introduced veracity scores that distinguish between genuine errors and natural data fluctuations, conditioned on the available covariate information in the dataset. We propose a simple yet efficient filtering procedure for eliminating potential errors, and establish theoretical guarantees for our method. We also contribute a new error detection benchmark involving 5 regression datasets with real-world numerical errors (for which the true values are also known). In this benchmark and additional simulation studies, our method identifies incorrect values with better precision/recall than other approaches.  ( 3 min )
    Learning-driven Physically-aware Large-scale Circuit Gate Sizing
    arXiv:2403.08193v1 Announce Type: new Abstract: Gate sizing plays an important role in timing optimization after physical design. Existing machine learning-based gate sizing works cannot optimize timing on multiple timing paths simultaneously and neglect the physical constraint on layouts. They cause sub-optimal sizing solutions and low-efficiency issues when compared with commercial gate sizing tools. In this work, we propose a learning-driven physically-aware gate sizing framework to optimize timing performance on large-scale circuits efficiently. In our gradient descent optimization-based work, for obtaining accurate gradients, a multi-modal gate sizing-aware timing model is achieved via learning timing information on multiple timing paths and physical information on multiple-scaled layouts jointly. Then, gradient generation based on the sizing-oriented estimator and adaptive back-propagation are developed to update gate sizes. Our results demonstrate that our work achieves higher timing performance improvements in a faster way compared with the commercial gate sizing tool.  ( 2 min )
    Towards Independence Criterion in Machine Unlearning of Features and Labels
    arXiv:2403.08124v1 Announce Type: new Abstract: This work delves into the complexities of machine unlearning in the face of distributional shifts, particularly focusing on the challenges posed by non-uniform feature and label removal. With the advent of regulations like the GDPR emphasizing data privacy and the right to be forgotten, machine learning models face the daunting task of unlearning sensitive information without compromising their integrity or performance. Our research introduces a novel approach that leverages influence functions and principles of distributional independence to address these challenges. By proposing a comprehensive framework for machine unlearning, we aim to ensure privacy protection while maintaining model performance and adaptability across varying distributions. Our method not only facilitates efficient data removal but also dynamically adjusts the model to preserve its generalization capabilities. Through extensive experimentation, we demonstrate the efficacy of our approach in scenarios characterized by significant distributional shifts, making substantial contributions to the field of machine unlearning. This research paves the way for developing more resilient and adaptable unlearning techniques, ensuring models remain robust and accurate in the dynamic landscape of data privacy and machine learning.  ( 3 min )
    Diffusion Models with Implicit Guidance for Medical Anomaly Detection
    arXiv:2403.08464v1 Announce Type: cross Abstract: Diffusion models have advanced unsupervised anomaly detection by improving the transformation of pathological images into pseudo-healthy equivalents. Nonetheless, standard approaches may compromise critical information during pathology removal, leading to restorations that do not align with unaffected regions in the original scans. Such discrepancies can inadvertently increase false positive rates and reduce specificity, complicating radiological evaluations. This paper introduces Temporal Harmonization for Optimal Restoration (THOR), which refines the de-noising process by integrating implicit guidance through temporal anomaly maps. THOR aims to preserve the integrity of healthy tissue in areas unaffected by pathology. Comparative evaluations show that THOR surpasses existing diffusion-based methods in detecting and segmenting anomalies in brain MRIs and wrist X-rays. Code: https://github.com/ci-ber/THOR_DDPM.  ( 2 min )
    Abstracting Sparse DNN Acceleration via Structured Sparse Tensor Decomposition
    arXiv:2403.07953v1 Announce Type: new Abstract: Exploiting sparsity in deep neural networks (DNNs) has been a promising area to meet the growing computation need of modern DNNs. However, in practice, sparse DNN acceleration still faces a key challenge. To minimize the overhead of sparse acceleration, hardware designers have proposed structured sparse hardware support recently, which provides limited flexibility and requires extra model fine-tuning. Moreover, any sparse model fine-tuned for certain structured sparse hardware cannot be accelerated by other structured hardware. To bridge the gap between sparse DNN models and hardware, this paper proposes tensor approximation via structured decomposition (TASD), which leverages the distributive property in linear algebra to turn any sparse tensor into a series of structured sparse tensors. Next, we develop a software framework, TASDER, to accelerate DNNs by searching layer-wise, high-quality structured decomposition for both weight and activation tensors so that they can be accelerated by any systems with structured sparse hardware support. Evaluation results show that, by exploiting prior structured sparse hardware baselines, our method can accelerate off-the-shelf dense and sparse DNNs without fine-tuning and improves energy-delay-product by up to 83% and 74% on average.  ( 3 min )
    Federated Knowledge Graph Unlearning via Diffusion Model
    arXiv:2403.08554v1 Announce Type: new Abstract: Federated learning (FL) promotes the development and application of artificial intelligence technologies by enabling model sharing and collaboration while safeguarding data privacy. Knowledge graph (KG) embedding representation provides a foundation for knowledge reasoning and applications by mapping entities and relations into vector space. Federated KG embedding enables the utilization of knowledge from diverse client sources while safeguarding the privacy of local data. However, due to demands such as privacy protection and the need to adapt to dynamic data changes, investigations into machine unlearning (MU) have been sparked. However, it is challenging to maintain the performance of KG embedding models while forgetting the influence of specific forgotten data on the model. In this paper, we propose FedDM, a novel framework tailored for machine unlearning in federated knowledge graphs. Leveraging diffusion models, we generate noisy data to sensibly mitigate the influence of specific knowledge on FL models while preserving the overall performance concerning the remaining data. We conduct experimental evaluations on benchmark datasets to assess the efficacy of the proposed model. Extensive experiments demonstrate that FedDM yields promising results in knowledge forgetting.  ( 3 min )
    Fast Inference of Removal-Based Node Influence
    arXiv:2403.08333v1 Announce Type: new Abstract: Graph neural networks (GNNs) are widely utilized to capture the information spreading patterns in graphs. While remarkable performance has been achieved, there is a new trending topic of evaluating node influence. We propose a new method of evaluating node influence, which measures the prediction change of a trained GNN model caused by removing a node. A real-world application is, "In the task of predicting Twitter accounts' polarity, had a particular account been removed, how would others' polarity change?". We use the GNN as a surrogate model whose prediction could simulate the change of nodes or edges caused by node removal. To obtain the influence for every node, a straightforward way is to alternately remove every node and apply the trained GNN on the modified graph. It is reliable but time-consuming, so we need an efficient method. The related lines of work, such as graph adversarial attack and counterfactual explanation, cannot directly satisfy our needs, since they do not focus on the global influence score for every node. We propose an efficient and intuitive method, NOde-Removal-based fAst GNN inference (NORA), which uses the gradient to approximate the node-removal influence. It only costs one forward propagation and one backpropagation to approximate the influence score for all nodes. Extensive experiments on six datasets and six GNN models verify the effectiveness of NORA. Our code is available at https://github.com/weikai-li/NORA.git.  ( 3 min )
    Reproducibility and Geometric Intrinsic Dimensionality: An Investigation on Graph Neural Network Research
    arXiv:2403.08438v1 Announce Type: new Abstract: Difficulties in replication and reproducibility of empirical evidences in machine learning research have become a prominent topic in recent years. Ensuring that machine learning research results are sound and reliable requires reproducibility, which verifies the reliability of research findings using the same code and data. This promotes open and accessible research, robust experimental workflows, and the rapid integration of new findings. Evaluating the degree to which research publications support these different aspects of reproducibility is one goal of the present work. For this we introduce an ontology of reproducibility in machine learning and apply it to methods for graph neural networks. Building on these efforts we turn towards another critical challenge in machine learning, namely the curse of dimensionality, which poses challenges in data collection, representation, and analysis, making it harder to find representative data and impeding the training and inference processes. Using the closely linked concept of geometric intrinsic dimension we investigate to which extend the used machine learning models are influenced by the intrinsic dimension of the data sets they are trained on.  ( 3 min )
    Ambient Diffusion Posterior Sampling: Solving Inverse Problems with Diffusion Models trained on Corrupted Data
    arXiv:2403.08728v1 Announce Type: cross Abstract: We provide a framework for solving inverse problems with diffusion models learned from linearly corrupted data. Our method, Ambient Diffusion Posterior Sampling (A-DPS), leverages a generative model pre-trained on one type of corruption (e.g. image inpainting) to perform posterior sampling conditioned on measurements from a potentially different forward process (e.g. image blurring). We test the efficacy of our approach on standard natural image datasets (CelebA, FFHQ, and AFHQ) and we show that A-DPS can sometimes outperform models trained on clean data for several image restoration tasks in both speed and performance. We further extend the Ambient Diffusion framework to train MRI models with access only to Fourier subsampled multi-coil MRI measurements at various acceleration factors (R=2, 4, 6, 8). We again observe that models trained on highly subsampled data are better priors for solving inverse problems in the high acceleration regime than models trained on fully sampled data. We open-source our code and the trained Ambient Diffusion MRI models: https://github.com/utcsilab/ambient-diffusion-mri .  ( 3 min )
    Leveraging Non-Decimated Wavelet Packet Features and Transformer Models for Time Series Forecasting
    arXiv:2403.08630v1 Announce Type: cross Abstract: This article combines wavelet analysis techniques with machine learning methods for univariate time series forecasting, focusing on three main contributions. Firstly, we consider the use of Daubechies wavelets with different numbers of vanishing moments as input features to both non-temporal and temporal forecasting methods, by selecting these numbers during the cross-validation phase. Secondly, we compare the use of both the non-decimated wavelet transform and the non-decimated wavelet packet transform for computing these features, the latter providing a much larger set of potentially useful coefficient vectors. The wavelet coefficients are computed using a shifted version of the typical pyramidal algorithm to ensure no leakage of future information into these inputs. Thirdly, we evaluate the use of these wavelet features on a significantly wider set of forecasting methods than previous studies, including both temporal and non-temporal models, and both statistical and deep learning-based methods. The latter include state-of-the-art transformer-based neural network architectures. Our experiments suggest significant benefit in replacing higher-order lagged features with wavelet features across all examined non-temporal methods for one-step-forward forecasting, and modest benefit when used as inputs for temporal deep learning-based models for long-horizon forecasting.  ( 3 min )
    Sketching the Heat Kernel: Using Gaussian Processes to Embed Data
    arXiv:2403.07929v1 Announce Type: new Abstract: This paper introduces a novel, non-deterministic method for embedding data in low-dimensional Euclidean space based on computing realizations of a Gaussian process depending on the geometry of the data. This type of embedding first appeared in (Adler et al, 2018) as a theoretical model for a generic manifold in high dimensions. In particular, we take the covariance function of the Gaussian process to be the heat kernel, and computing the embedding amounts to sketching a matrix representing the heat kernel. The Karhunen-Lo\`eve expansion reveals that the straight-line distances in the embedding approximate the diffusion distance in a probabilistic sense, avoiding the need for sharp cutoffs and maintaining some of the smaller-scale structure. Our method demonstrates further advantage in its robustness to outliers. We justify the approach with both theory and experiments.  ( 2 min )
    A Hitchhiker's Guide to Geometric GNNs for 3D Atomic Systems
    arXiv:2312.07511v2 Announce Type: replace Abstract: Recent advances in computational modelling of atomic systems, spanning molecules, proteins, and materials, represent them as geometric graphs with atoms embedded as nodes in 3D Euclidean space. In these graphs, the geometric attributes transform according to the inherent physical symmetries of 3D atomic systems, including rotations and translations in Euclidean space, as well as node permutations. In recent years, Geometric Graph Neural Networks have emerged as the preferred machine learning architecture powering applications ranging from protein structure prediction to molecular simulations and material generation. Their specificity lies in the inductive biases they leverage - such as physical symmetries and chemical properties - to learn informative representations of these geometric graphs. In this opinionated paper, we provide a comprehensive and self-contained overview of the field of Geometric GNNs for 3D atomic systems. We cover fundamental background material and introduce a pedagogical taxonomy of Geometric GNN architectures: (1) invariant networks, (2) equivariant networks in Cartesian basis, (3) equivariant networks in spherical basis, and (4) unconstrained networks. Additionally, we outline key datasets and application areas and suggest future research directions. The objective of this work is to present a structured perspective on the field, making it accessible to newcomers and aiding practitioners in gaining an intuition for its mathematical abstractions.  ( 3 min )
    Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion
    arXiv:2310.02279v2 Announce Type: replace Abstract: Consistency Models (CM) (Song et al., 2023) accelerate score-based diffusion model sampling at the cost of sample quality but lack a natural way to trade-off quality for speed. To address this limitation, we propose Consistency Trajectory Model (CTM), a generalization encompassing CM and score-based models as special cases. CTM trains a single neural network that can -- in a single forward pass -- output scores (i.e., gradients of log-density) and enables unrestricted traversal between any initial and final time along the Probability Flow Ordinary Differential Equation (ODE) in a diffusion process. CTM enables the efficient combination of adversarial training and denoising score matching loss to enhance performance and achieves new state-of-the-art FIDs for single-step diffusion model sampling on CIFAR-10 (FID 1.73) and ImageNet at 64x64 resolution (FID 1.92). CTM also enables a new family of sampling schemes, both deterministic and stochastic, involving long jumps along the ODE solution trajectories. It consistently improves sample quality as computational budgets increase, avoiding the degradation seen in CM. Furthermore, unlike CM, CTM's access to the score function can streamline the adoption of established controllable/conditional generation methods from the diffusion community. This access also enables the computation of likelihood. The code is available at https://github.com/sony/ctm.  ( 3 min )
    Entropic (Gromov) Wasserstein Flow Matching with GENOT
    arXiv:2310.09254v3 Announce Type: replace-cross Abstract: Optimal transport (OT) theory has reshaped the field of generative modeling: Combined with neural networks, recent \textit{Neural OT} (N-OT) solvers use OT as an inductive bias, to focus on ``thrifty'' mappings that minimize average displacement costs. This core principle has fueled the successful application of N-OT solvers to high-stakes scientific challenges, notably single-cell genomics. N-OT solvers are, however, increasingly confronted with practical challenges: while most N-OT solvers can handle squared-Euclidean costs, they must be repurposed to handle more general costs; their reliance on deterministic Monge maps as well as mass conservation constraints can easily go awry in the presence of outliers; mapping points \textit{across} heterogeneous spaces is out of their reach. While each of these challenges has been explored independently, we propose a new framework that can handle, natively, all of these needs. The \textit{generative entropic neural OT} (GENOT) framework models the conditional distribution $\pi_\varepsilon(\*y|\*x)$ of an optimal \textit{entropic} coupling $\pi_\varepsilon$, using conditional flow matching. GENOT is generative, and can transport points \textit{across} spaces, guided by sample-based, unbalanced solutions to the Gromov-Wasserstein problem, that can use any cost. We showcase our approach on both synthetic and single-cell datasets, using GENOT to model cell development, predict cellular responses, and translate between data modalities.  ( 3 min )
    Explainable Anomaly Detection in Images and Videos: A Survey
    arXiv:2302.06670v2 Announce Type: replace Abstract: Anomaly detection and localization of visual data, including images and videos, are of great significance in both machine learning academia and applied real-world scenarios. Despite the rapid development of visual anomaly detection techniques in recent years, the interpretations of these black-box models and reasonable explanations of why anomalies can be distinguished out are scarce. This paper provides the first survey concentrated on explainable visual anomaly detection methods. We first introduce the basic background of image-level and video-level anomaly detection. Then, as the main content of this survey, a comprehensive and exhaustive literature review of explainable anomaly detection methods for both images and videos is presented. Next, we analyze why some explainable anomaly detection methods can be applied to both images and videos and why others can be only applied to one modality. Additionally, we provide summaries of current 2D visual anomaly detection datasets and evaluation metrics. Finally, we discuss several promising future directions and open problems to explore the explainability of 2D visual anomaly detection. The related resource collection is given at \href{https://github.com/wyzjack/Awesome-XAD}{this repo}.  ( 3 min )
    Inverse Design of Vitrimeric Polymers by Molecular Dynamics and Generative Modeling
    arXiv:2312.03690v2 Announce Type: replace-cross Abstract: Vitrimer is a new class of sustainable polymers with the ability of self-healing through rearrangement of dynamic covalent adaptive networks. However, a limited choice of constituent molecules restricts their property space, prohibiting full realization of their potential applications. Through a combination of molecular dynamics (MD) simulations and machine learning (ML), particularly a novel graph variational autoencoder (VAE) model, we establish a method for generating novel vitrimers and guide their inverse design based on desired glass transition temperature (Tg). We build the first vitrimer dataset of one million and calculate Tg on 8,424 of them by high-throughput MD simulations calibrated by a Gaussian process model. The proposed VAE employs dual graph encoders and a latent dimension overlapping scheme which allows for individual representation of multi-component vitrimers. By constructing a continuous latent space containing necessary information of vitrimers, we demonstrate high accuracy and efficiency of our framework in discovering novel vitrimers with desirable Tg beyond the training regime. The proposed vitrimers with reasonable synthesizability cover a wide range of Tg and broaden the potential widespread usage of vitrimeric materials.  ( 3 min )
    Addressing the Regulatory Gap: Moving Towards an EU AI Audit Ecosystem Beyond the AIA by Including Civil Society
    arXiv:2403.07904v1 Announce Type: cross Abstract: The European legislature has proposed the Digital Services Act (DSA) and Artificial Intelligence Act (AIA) to regulate platforms and Artificial Intelligence (AI) products. We review to what extent third-party audits are part of both laws and to what extent access to models and data is provided. By considering the value of third-party audits and third-party data access in an audit ecosystem, we identify a regulatory gap in that the Artificial Intelligence Act does not provide access to data for researchers and civil society. Our contributions to the literature include: (1) Defining an AI audit ecosystem that incorporates compliance and oversight. (2) Highlighting a regulatory gap within the DSA and AIA regulatory framework, preventing the establishment of an AI audit ecosystem. (3) Emphasizing that third-party audits by research and civil society must be part of that ecosystem and demand that the AIA include data and model access for certain AI products. We call for the DSA to provide NGOs and investigative journalists with data access to platforms by delegated acts and for adaptions and amendments of the AIA to provide third-party audits and data and model access at least for high-risk systems to close the regulatory gap. Regulations modeled after European Union AI regulations should enable data access and third-party audits, fostering an AI audit ecosystem that promotes compliance and oversight mechanisms.  ( 3 min )
    Link Prediction for Social Networks using Representation Learning and Heuristic-based Features
    arXiv:2403.08613v1 Announce Type: cross Abstract: The exponential growth in scale and relevance of social networks enable them to provide expansive insights. Predicting missing links in social networks efficiently can help in various modern-day business applications ranging from generating recommendations to influence analysis. Several categories of solutions exist for the same. Here, we explore various feature extraction techniques to generate representations of nodes and edges in a social network that allow us to predict missing links. We compare the results of using ten feature extraction techniques categorized across Structural embeddings, Neighborhood-based embeddings, Graph Neural Networks, and Graph Heuristics, followed by modeling with ensemble classifiers and custom Neural Networks. Further, we propose combining heuristic-based features and learned representations that demonstrate improved performance for the link prediction task on social network datasets. Using this method to generate accurate recommendations for many applications is a matter of further study that appears very promising. The code for all the experiments has been made public.  ( 3 min )
    Continual Adversarial Defense
    arXiv:2312.09481v2 Announce Type: replace-cross Abstract: In response to the rapidly evolving nature of adversarial attacks against visual classifiers on a monthly basis, numerous defenses have been proposed to generalize against as many known attacks as possible. However, designing a defense method that generalizes to all types of attacks is not realistic because the environment in which defense systems operate is dynamic and comprises various unique attacks that emerge as time goes on. The defense system must gather online few-shot defense feedback to promptly enhance itself, leveraging efficient memory utilization. Therefore, we propose the first continual adversarial defense (CAD) framework that adapts to any attacks in a dynamic scenario, where various attacks emerge stage by stage. In practice, CAD is modeled under four principles: (1) continual adaptation to new attacks without catastrophic forgetting, (2) few-shot adaptation, (3) memory-efficient adaptation, and (4) high accuracy on both clean and adversarial images. We explore and integrate cutting-edge continual learning, few-shot learning, and ensemble learning techniques to qualify the principles. Experiments conducted on CIFAR-10 and ImageNet-100 validate the effectiveness of our approach against multiple stages of modern adversarial attacks and demonstrate significant improvements over numerous baseline methods. In particular, CAD is capable of quickly adapting with minimal feedback and a low cost of defense failure, while maintaining good performance against previous attacks. Our research sheds light on a brand-new paradigm for continual defense adaptation against dynamic and evolving attacks.  ( 3 min )
    Actor-Critic Physics-informed Neural Lyapunov Control
    arXiv:2403.08448v1 Announce Type: new Abstract: Designing control policies for stabilization tasks with provable guarantees is a long-standing problem in nonlinear control. A crucial performance metric is the size of the resulting region of attraction, which essentially serves as a robustness "margin" of the closed-loop system against uncertainties. In this paper, we propose a new method to train a stabilizing neural network controller along with its corresponding Lyapunov certificate, aiming to maximize the resulting region of attraction while respecting the actuation constraints. Crucial to our approach is the use of Zubov's Partial Differential Equation (PDE), which precisely characterizes the true region of attraction of a given control policy. Our framework follows an actor-critic pattern where we alternate between improving the control policy (actor) and learning a Zubov function (critic). Finally, we compute the largest certifiable region of attraction by invoking an SMT solver after the training procedure. Our numerical experiments on several design problems show consistent and significant improvements in the size of the resulting region of attraction.  ( 2 min )
    Implicit Regularization of Gradient Flow on One-Layer Softmax Attention
    arXiv:2403.08699v1 Announce Type: new Abstract: We study gradient flow on the exponential loss for a classification problem with a one-layer softmax attention model, where the key and query weight matrices are trained separately. Under a separability assumption on the data, we show that when gradient flow achieves the minimal loss value, it further implicitly minimizes the nuclear norm of the product of the key and query weight matrices. Such implicit regularization can be described by a Support Vector Machine (SVM) problem with respect to the attention weights. This finding contrasts with prior results showing that the gradient descent induces an implicit regularization on the Frobenius norm on the product weight matrix when the key and query matrices are combined into a single weight matrix for training. For diagonal key and query matrices, our analysis builds upon the reparameterization technique and exploits approximate KKT conditions of the SVM associated with the classification data. Moreover, the results are extended to general weights configurations given proper alignment of the weight matrices' singular spaces with the data features at initialization.  ( 3 min )
    Optimizing Polynomial Graph Filters: A Novel Adaptive Krylov Subspace Approach
    arXiv:2403.07954v1 Announce Type: new Abstract: Graph Neural Networks (GNNs), known as spectral graph filters, find a wide range of applications in web networks. To bypass eigendecomposition, polynomial graph filters are proposed to approximate graph filters by leveraging various polynomial bases for filter training. However, no existing studies have explored the diverse polynomial graph filters from a unified perspective for optimization. In this paper, we first unify polynomial graph filters, as well as the optimal filters of identical degrees into the Krylov subspace of the same order, thus providing equivalent expressive power theoretically. Next, we investigate the asymptotic convergence property of polynomials from the unified Krylov subspace perspective, revealing their limited adaptability in graphs with varying heterophily degrees. Inspired by those facts, we design a novel adaptive Krylov subspace approach to optimize polynomial bases with provable controllability over the graph spectrum so as to adapt various heterophily graphs. Subsequently, we propose AdaptKry, an optimized polynomial graph filter utilizing bases from the adaptive Krylov subspaces. Meanwhile, in light of the diverse spectral properties of complex graphs, we extend AdaptKry by leveraging multiple adaptive Krylov bases without incurring extra training costs. As a consequence, extended AdaptKry is able to capture the intricate characteristics of graphs and provide insights into their inherent complexity. We conduct extensive experiments across a series of real-world datasets. The experimental results demonstrate the superior filtering capability of AdaptKry, as well as the optimized efficacy of the adaptive Krylov basis.  ( 3 min )
    CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets
    arXiv:2309.17428v2 Announce Type: replace-cross Abstract: Large language models (LLMs) are often augmented with tools to solve complex tasks. By generating code snippets and executing them through task-specific Application Programming Interfaces (APIs), they can offload certain functions to dedicated external modules, such as image encoding and performing calculations. However, most existing approaches to augment LLMs with tools are constrained by general-purpose APIs and lack the flexibility for tailoring them to specific tasks. In this work, we present CRAFT, a general tool creation and retrieval framework for LLMs. It creates toolsets specifically curated for the tasks and equips LLMs with a component that retrieves tools from these sets to enhance their capability to solve complex tasks. For each task, we collect specific code solutions by prompting GPT-4 to solve the training examples. Following a validation step ensuring the correctness, these solutions are abstracted into code snippets to enhance reusability, and deduplicated for higher quality. At inference time, the language model retrieves snippets from the toolsets and then executes them or generates the output conditioning on the retrieved snippets. Our method is designed to be flexible and offers a plug-and-play approach to adapt off-the-shelf LLMs to unseen domains and modalities, without any finetuning. Experiments on vision-language, tabular processing, and mathematical reasoning tasks show that our approach achieves substantial improvements compared to strong baselines. In addition, our in-depth analysis reveals that: (1) consistent performance improvement can be achieved by scaling up the number of tools and the capability of the backbone models; (2) each component of our approach contributes to the performance gains; (3) the created tools are well-structured and reliable with low complexity and atomicity. The code is available at https://github.com/lifan-yuan/CRAFT.  ( 3 min )
    PaddingFlow: Improving Normalizing Flows with Padding-Dimensional Noise
    arXiv:2403.08216v1 Announce Type: new Abstract: Normalizing flow is a generative modeling approach with efficient sampling. However, Flow-based models suffer two issues, which are manifold and discrete data. If the target distribution is a manifold, which means the dimension of the latent target distribution and the dimension of the data distribution are unmatched, flow-based models might perform badly. Discrete data makes flow-based models collapse into a degenerate mixture of point masses. In this paper, to sidestep such two issues we propose PaddingFlow, a novel dequantization method, which improves normalizing flows with padding-dimensional noise. PaddingFlow is easy to implement, computationally cheap, widely suitable for various tasks, and generates samples that are unbiased estimations of the data. Especially, our method can overcome the limitation of existing dequantization methods that have to change the data distribution, which might degrade performance. We validate our method on the main benchmarks of unconditional density estimation, including five tabular datasets and four image datasets for VAE models, and the IK experiments which are conditional density estimation. The results show that PaddingFlow can provide improvement on all tasks in this paper.  ( 3 min )
    Data-Efficient Sleep Staging with Synthetic Time Series Pretraining
    arXiv:2403.08592v1 Announce Type: new Abstract: Analyzing electroencephalographic (EEG) time series can be challenging, especially with deep neural networks, due to the large variability among human subjects and often small datasets. To address these challenges, various strategies, such as self-supervised learning, have been suggested, but they typically rely on extensive empirical datasets. Inspired by recent advances in computer vision, we propose a pretraining task termed "frequency pretraining" to pretrain a neural network for sleep staging by predicting the frequency content of randomly generated synthetic time series. Our experiments demonstrate that our method surpasses fully supervised learning in scenarios with limited data and few subjects, and matches its performance in regimes with many subjects. Furthermore, our results underline the relevance of frequency information for sleep stage scoring, while also demonstrating that deep neural networks utilize information beyond frequencies to enhance sleep staging performance, which is consistent with previous research. We anticipate that our approach will be advantageous across a broad spectrum of applications where EEG data is limited or derived from a small number of subjects, including the domain of brain-computer interfaces.  ( 3 min )
    Change Point Detection with Copula Entropy based Two-Sample Test
    arXiv:2403.07892v1 Announce Type: cross Abstract: Change point detection is a typical task that aim to find changes in time series and can be tackled with two-sample test. Copula Entropy is a mathematical concept for measuring statistical independence and a two-sample test based on it was introduced recently. In this paper we propose a nonparametric multivariate method for multiple change point detection with the copula entropy-based two-sample test. The single change point detection is first proposed as a group of two-sample tests on every points of time series data and the change point is considered as with the maximum of the test statistics. The multiple change point detection is then proposed by combining the single change point detection method with binary segmentation strategy. We verified the effectiveness of our method and compared it with the other similar methods on the simulated univariate and multivariate data and the Nile data.  ( 2 min )
    Unsupervised Learning of Hybrid Latent Dynamics: A Learn-to-Identify Framework
    arXiv:2403.08194v1 Announce Type: new Abstract: Modern applications increasingly require unsupervised learning of latent dynamics from high-dimensional time-series. This presents a significant challenge of identifiability: many abstract latent representations may reconstruct observations, yet do they guarantee an adequate identification of the governing dynamics? This paper investigates this challenge from two angles: the use of physics inductive bias specific to the data being modeled, and a learn-to-identify strategy that separates forecasting objectives from the data used for the identification. We combine these two strategies in a novel framework for unsupervised meta-learning of hybrid latent dynamics (Meta-HyLaD) with: 1) a latent dynamic function that hybridize known mathematical expressions of prior physics with neural functions describing its unknown errors, and 2) a meta-learning formulation to learn to separately identify both components of the hybrid dynamics. Through extensive experiments on five physics and one biomedical systems, we provide strong evidence for the benefits of Meta-HyLaD to integrate rich prior knowledge while identifying their gap to observed data.  ( 2 min )
    A path-norm toolkit for modern networks: consequences, promises and challenges
    arXiv:2310.01225v4 Announce Type: replace-cross Abstract: This work introduces the first toolkit around path-norms that fully encompasses general DAG ReLU networks with biases, skip connections and any operation based on the extraction of order statistics: max pooling, GroupSort etc. This toolkit notably allows us to establish generalization bounds for modern neural networks that are not only the most widely applicable path-norm based ones, but also recover or beat the sharpest known bounds of this type. These extended path-norms further enjoy the usual benefits of path-norms: ease of computation, invariance under the symmetries of the network, and improved sharpness on layered fully-connected networks compared to the product of operator norms, another complexity measure most commonly used. The versatility of the toolkit and its ease of implementation allow us to challenge the concrete promises of path-norm-based generalization bounds, by numerically evaluating the sharpest known bounds for ResNets on ImageNet.  ( 3 min )
    Cost-Effective Methodology for Complex Tuning Searches in HPC: Navigating Interdependencies and Dimensionality
    arXiv:2403.08131v1 Announce Type: cross Abstract: Tuning searches are pivotal in High-Performance Computing (HPC), addressing complex optimization challenges in computational applications. The complexity arises not only from finely tuning parameters within routines but also potential interdependencies among them, rendering traditional optimization methods inefficient. Instead of scrutinizing interdependencies among parameters and routines, practitioners often face the dilemma of conducting independent tuning searches for each routine, thereby overlooking interdependence, or pursuing a more resource-intensive joint search for all routines. This decision is driven by the consideration that some interdependence analysis and high-dimensional decomposition techniques in literature may be prohibitively expensive in HPC tuning searches. Our methodology adapts and refines these methods to ensure computational feasibility while maximizing performance gains in real-world scenarios. Our methodology leverages a cost-effective interdependence analysis to decide whether to merge several tuning searches into a joint search or conduct orthogonal searches. Tested on synthetic functions with varying levels of parameter interdependence, our methodology efficiently explores the search space. In comparison to Bayesian-optimization-based full independent or fully joint searches, our methodology suggested an optimized breakdown of independent and merged searches that led to final configurations up to 8% more accurate, reducing the search time by up to 95%. When applied to GPU-offloaded Real-Time Time-Dependent Density Functional Theory (RT-TDDFT), an application in computational materials science that challenges modern HPC autotuners, our methodology achieved an effective tuning search. Its adaptability and efficiency extend beyond RT-TDDFT, making it valuable for related applications in HPC.  ( 3 min )
    McCatch: Scalable Microcluster Detection in Dimensional and Nondimensional Datasets
    arXiv:2403.08027v1 Announce Type: new Abstract: How could we have an outlier detector that works even with nondimensional data, and ranks together both singleton microclusters ('one-off' outliers) and nonsingleton microclusters by their anomaly scores? How to obtain scores that are principled in one scalable and 'hands-off' manner? Microclusters of outliers indicate coalition or repetition in fraud activities, etc.; their identification is thus highly desirable. This paper presents McCatch: a new algorithm that detects microclusters by leveraging our proposed 'Oracle' plot (1NN Distance versus Group 1NN Distance). We study 31 real and synthetic datasets with up to 1M data elements to show that McCatch is the only method that answers both of the questions above; and, it outperforms 11 other methods, especially when the data has nonsingleton microclusters or is nondimensional. We also showcase McCatch's ability to detect meaningful microclusters in graphs, fingerprints, logs of network connections, text data, and satellite imagery. For example, it found a 30-elements microcluster of confirmed 'Denial of Service' attacks in the network logs, taking only ~3 minutes for 222K data elements on a stock desktop.  ( 3 min )
    Bridging State and History Representations: Understanding Self-Predictive RL
    arXiv:2401.08898v2 Announce Type: replace Abstract: Representations are at the core of all deep reinforcement learning (RL) methods for both Markov decision processes (MDPs) and partially observable Markov decision processes (POMDPs). Many representation learning methods and theoretical frameworks have been developed to understand what constitutes an effective representation. However, the relationships between these methods and the shared properties among them remain unclear. In this paper, we show that many of these seemingly distinct methods and frameworks for state and history abstractions are, in fact, based on a common idea of self-predictive abstraction. Furthermore, we provide theoretical insights into the widely adopted objectives and optimization, such as the stop-gradient technique, in learning self-predictive representations. These findings together yield a minimalist algorithm to learn self-predictive representations for states and histories. We validate our theories by applying our algorithm to standard MDPs, MDPs with distractors, and POMDPs with sparse rewards. These findings culminate in a set of preliminary guidelines for RL practitioners.  ( 3 min )
    Neural reproducing kernel Banach spaces and representer theorems for deep networks
    arXiv:2403.08750v1 Announce Type: cross Abstract: Studying the function spaces defined by neural networks helps to understand the corresponding learning models and their inductive bias. While in some limits neural networks correspond to function spaces that are reproducing kernel Hilbert spaces, these regimes do not capture the properties of the networks used in practice. In contrast, in this paper we show that deep neural networks define suitable reproducing kernel Banach spaces. These spaces are equipped with norms that enforce a form of sparsity, enabling them to adapt to potential latent structures within the input data and their representations. In particular, leveraging the theory of reproducing kernel Banach spaces, combined with variational results, we derive representer theorems that justify the finite architectures commonly employed in applications. Our study extends analogous results for shallow networks and can be seen as a step towards considering more practically plausible neural architectures.  ( 2 min )
    DeepCDCL: An CDCL-based Neural Network Verification Framework
    arXiv:2403.07956v1 Announce Type: new Abstract: Neural networks in safety-critical applications face increasing safety and security concerns due to their susceptibility to little disturbance. In this paper, we propose DeepCDCL, a novel neural network verification framework based on the Conflict-Driven Clause Learning (CDCL) algorithm. We introduce an asynchronous clause learning and management structure, reducing redundant time consumption compared to the direct application of the CDCL framework. Furthermore, we also provide a detailed evaluation of the performance of our approach on the ACAS Xu and MNIST datasets, showing that a significant speed-up is achieved in most cases.  ( 2 min )
    A Mathematical Framework for the Problem of Security for Cognition in Neurotechnology
    arXiv:2403.07945v1 Announce Type: cross Abstract: The rapid advancement in neurotechnology in recent years has created an emerging critical intersection between neurotechnology and security. Implantable devices, non-invasive monitoring, and non-invasive therapies all carry with them the prospect of violating the privacy and autonomy of individuals' cognition. A growing number of scientists and physicians have made calls to address this issue -- which we term Cognitive Security -- but applied efforts have been limited. A major barrier hampering scientific and engineering efforts to address Cognitive Security is the lack of a clear means of describing and analyzing relevant problems. In this paper we develop Cognitive Security, a mathematical framework which enables such description and analysis by drawing on methods and results from multiple fields. We demonstrate certain statistical properties which have significant implications for Cognitive Security, and then present descriptions of the algorithmic problems faced by attackers attempting to violate privacy and autonomy, and defenders attempting to obstruct such attempts.  ( 3 min )
    Early Directional Convergence in Deep Homogeneous Neural Networks for Small Initializations
    arXiv:2403.08121v1 Announce Type: new Abstract: This paper studies the gradient flow dynamics that arise when training deep homogeneous neural networks, starting with small initializations. The present work considers neural networks that are assumed to have locally Lipschitz gradients and an order of homogeneity strictly greater than two. This paper demonstrates that for sufficiently small initializations, during the early stages of training, the weights of the neural network remain small in norm and approximately converge in direction along the Karush-Kuhn-Tucker (KKT) points of the neural correlation function introduced in [1]. Additionally, for square loss and under a separability assumption on the weights of neural networks, a similar directional convergence of gradient flow dynamics is shown near certain saddle points of the loss function.  ( 2 min )
    Promises of Deep Kernel Learning for Control Synthesis
    arXiv:2309.06569v2 Announce Type: replace-cross Abstract: Deep Kernel Learning (DKL) combines the representational power of neural networks with the uncertainty quantification of Gaussian Processes. Hence, it is potentially a promising tool to learn and control complex dynamical systems. In this work, we develop a scalable abstraction-based framework that enables the use of DKL for control synthesis of stochastic dynamical systems against complex specifications. Specifically, we consider temporal logic specifications and create an end-to-end framework that uses DKL to learn an unknown system from data and formally abstracts the DKL model into an Interval Markov Decision Process (IMDP) to perform control synthesis with correctness guarantees. Furthermore, we identify a deep architecture that enables accurate learning and efficient abstraction computation. The effectiveness of our approach is illustrated on various benchmarks, including a 5-D nonlinear stochastic system, showing how control synthesis with DKL can substantially outperform state-of-the-art competitive methods.  ( 2 min )
    MLLMs-Augmented Visual-Language Representation Learning
    arXiv:2311.18765v3 Announce Type: replace-cross Abstract: Visual-language pre-training has achieved remarkable success in many multi-modal tasks, largely attributed to the availability of large-scale image-text datasets. In this work, we demonstrate that Multi-modal Large Language Models (MLLMs) can enhance visual-language representation learning by establishing richer image-text associations for image-text datasets. Our approach is simple, utilizing MLLMs to extend multiple diverse captions for each image. To prevent the bias introduced by MLLMs' hallucinations and monotonous language styles, we propose "text shearing" to maintain the quality and availability of extended captions. In image-text retrieval, without introducing additional training cost, our method consistently obtains 5.6 ~ 35.0 and 16.8 ~ 46.1 improvement on Recall@1 under the fine-tuning and zero-shot settings, respectively. Notably, we obtain zero-shot results that are comparable to fine-tuning on target datasets, which encourages more exploration of the versatile use of MLLMs.  ( 2 min )
  • Open

    Bayesian Analysis for Over-parameterized Linear Model without Sparsity
    arXiv:2305.15754v2 Announce Type: replace-cross Abstract: In the field of high-dimensional Bayesian statistics, a plethora of methodologies have been developed, including various prior distributions that result in parameter sparsity. However, such priors exhibit limitations in handling the spectral eigenvector structure of data, rendering estimations less effective for analyzing the over-parameterized models (high-dimensional linear models that do not assume sparsity) developed in recent years. This study introduces a Bayesian approach that employs a prior distribution dependent on the eigenvectors of data covariance matrices without inducing parameter sparsity. We also provide contraction rates of the derived posterior estimation and develop a truncated Gaussian approximation of the posterior distribution. The former demonstrates the efficiency of posterior estimation, whereas the latter facilitates the uncertainty quantification of parameters via a Bernstein--von Mises-type approach. These findings suggest that Bayesian methods capable of handling data spectra and estimating non-sparse high-dimensional parameters are feasible.  ( 2 min )
    Can Direct Latent Model Learning Solve Linear Quadratic Gaussian Control?
    arXiv:2212.14511v2 Announce Type: replace-cross Abstract: We study the task of learning state representations from potentially high-dimensional observations, with the goal of controlling an unknown partially observable system. We pursue a direct latent model learning approach, where a dynamic model in some latent state space is learned by predicting quantities directly related to planning (e.g., costs) without reconstructing the observations. In particular, we focus on an intuitive cost-driven state representation learning method for solving Linear Quadratic Gaussian (LQG) control, one of the most fundamental partially observable control problems. As our main results, we establish finite-sample guarantees of finding a near-optimal state representation function and a near-optimal controller using the directly learned latent model. To the best of our knowledge, despite various empirical successes, prior to this work it was unclear if such a cost-driven latent model learner enjoys finite-sample guarantees. Our work underscores the value of predicting multi-step costs, an idea that is key to our theory, and notably also an idea that is known to be empirically valuable for learning state representations.  ( 3 min )
    Entropic (Gromov) Wasserstein Flow Matching with GENOT
    arXiv:2310.09254v3 Announce Type: replace Abstract: Optimal transport (OT) theory has reshaped the field of generative modeling: Combined with neural networks, recent \textit{Neural OT} (N-OT) solvers use OT as an inductive bias, to focus on ``thrifty'' mappings that minimize average displacement costs. This core principle has fueled the successful application of N-OT solvers to high-stakes scientific challenges, notably single-cell genomics. N-OT solvers are, however, increasingly confronted with practical challenges: while most N-OT solvers can handle squared-Euclidean costs, they must be repurposed to handle more general costs; their reliance on deterministic Monge maps as well as mass conservation constraints can easily go awry in the presence of outliers; mapping points \textit{across} heterogeneous spaces is out of their reach. While each of these challenges has been explored independently, we propose a new framework that can handle, natively, all of these needs. The \textit{generative entropic neural OT} (GENOT) framework models the conditional distribution $\pi_\varepsilon(\*y|\*x)$ of an optimal \textit{entropic} coupling $\pi_\varepsilon$, using conditional flow matching. GENOT is generative, and can transport points \textit{across} spaces, guided by sample-based, unbalanced solutions to the Gromov-Wasserstein problem, that can use any cost. We showcase our approach on both synthetic and single-cell datasets, using GENOT to model cell development, predict cellular responses, and translate between data modalities.  ( 3 min )
    Detecting Errors in a Numerical Response via any Regression Model
    arXiv:2305.16583v3 Announce Type: replace Abstract: Noise plagues many numerical datasets, where the recorded values in the data may fail to match the true underlying values due to reasons including: erroneous sensors, data entry/processing mistakes, or imperfect human estimates. We consider general regression settings with covariates and a potentially corrupted response whose observed values may contain errors. By accounting for various uncertainties, we introduced veracity scores that distinguish between genuine errors and natural data fluctuations, conditioned on the available covariate information in the dataset. We propose a simple yet efficient filtering procedure for eliminating potential errors, and establish theoretical guarantees for our method. We also contribute a new error detection benchmark involving 5 regression datasets with real-world numerical errors (for which the true values are also known). In this benchmark and additional simulation studies, our method identifies incorrect values with better precision/recall than other approaches.  ( 3 min )
    On the Convergence of Locally Adaptive and Scalable Diffusion-Based Sampling Methods for Deep Bayesian Neural Network Posteriors
    arXiv:2403.08609v1 Announce Type: cross Abstract: Achieving robust uncertainty quantification for deep neural networks represents an important requirement in many real-world applications of deep learning such as medical imaging where it is necessary to assess the reliability of a neural network's prediction. Bayesian neural networks are a promising approach for modeling uncertainties in deep neural networks. Unfortunately, generating samples from the posterior distribution of neural networks is a major challenge. One significant advance in that direction would be the incorporation of adaptive step sizes, similar to modern neural network optimizers, into Monte Carlo Markov chain sampling algorithms without significantly increasing computational demand. Over the past years, several papers have introduced sampling algorithms with claims that they achieve this property. However, do they indeed converge to the correct distribution? In this paper, we demonstrate that these methods can have a substantial bias in the distribution they sample, even in the limit of vanishing step sizes and at full batch size.  ( 3 min )
    Terminal Embeddings in Sublinear Time
    arXiv:2110.08691v3 Announce Type: replace-cross Abstract: Recently (Elkin, Filtser, Neiman 2017) introduced the concept of a {\it terminal embedding} from one metric space $(X,d_X)$ to another $(Y,d_Y)$ with a set of designated terminals $T\subset X$. Such an embedding $f$ is said to have distortion $\rho\ge 1$ if $\rho$ is the smallest value such that there exists a constant $C>0$ satisfying \begin{equation*} \forall x\in T\ \forall q\in X,\ C d_X(x, q) \le d_Y(f(x), f(q)) \le C \rho d_X(x, q) . \end{equation*} When $X,Y$ are both Euclidean metrics with $Y$ being $m$-dimensional, recently (Narayanan, Nelson 2019), following work of (Mahabadi, Makarychev, Makarychev, Razenshteyn 2018), showed that distortion $1+\epsilon$ is achievable via such a terminal embedding with $m = O(\epsilon^{-2}\log n)$ for $n := |T|$. This generalizes the Johnson-Lindenstrauss lemma, which only preserves distances within $T$ and not to $T$ from the rest of space. The downside of prior work is that evaluating their embedding on some $q\in \mathbb{R}^d$ required solving a semidefinite program with $\Theta(n)$ constraints in~$m$ variables and thus required some superlinear $\mathrm{poly}(n)$ runtime. Our main contribution in this work is to give a new data structure for computing terminal embeddings. We show how to pre-process $T$ to obtain an almost linear-space data structure that supports computing the terminal embedding image of any $q\in\mathbb{R}^d$ in sublinear time $O^* (n^{1-\Theta(\epsilon^2)} + d)$. To accomplish this, we leverage tools developed in the context of approximate nearest neighbor search.  ( 3 min )
    Verifix: Post-Training Correction to Improve Label Noise Robustness with Verified Samples
    arXiv:2403.08618v1 Announce Type: cross Abstract: Label corruption, where training samples have incorrect labels, can significantly degrade the performance of machine learning models. This corruption often arises from non-expert labeling or adversarial attacks. Acquiring large, perfectly labeled datasets is costly, and retraining large models from scratch when a clean dataset becomes available is computationally expensive. To address this challenge, we propose Post-Training Correction, a new paradigm that adjusts model parameters after initial training to mitigate label noise, eliminating the need for retraining. We introduce Verifix, a novel Singular Value Decomposition (SVD) based algorithm that leverages a small, verified dataset to correct the model weights using a single update. Verifix uses SVD to estimate a Clean Activation Space and then projects the model's weights onto this space to suppress activations corresponding to corrupted data. We demonstrate Verifix's effectiveness on both synthetic and real-world label noise. Experiments on the CIFAR dataset with 25% synthetic corruption show 7.36% generalization improvements on average. Additionally, we observe generalization improvements of up to 2.63% on naturally corrupted datasets like WebVision1.0 and Clothing1M.  ( 3 min )
    Neural Network-Based Score Estimation in Diffusion Models: Optimization and Generalization
    arXiv:2401.15604v3 Announce Type: replace-cross Abstract: Diffusion models have emerged as a powerful tool rivaling GANs in generating high-quality samples with improved fidelity, flexibility, and robustness. A key component of these models is to learn the score function through score matching. Despite empirical success on various tasks, it remains unclear whether gradient-based algorithms can learn the score function with a provable accuracy. As a first step toward answering this question, this paper establishes a mathematical framework for analyzing score estimation using neural networks trained by gradient descent. Our analysis covers both the optimization and the generalization aspects of the learning procedure. In particular, we propose a parametric form to formulate the denoising score-matching problem as a regression with noisy labels. Compared to the standard supervised learning setup, the score-matching problem introduces distinct challenges, including unbounded input, vector-valued output, and an additional time variable, preventing existing techniques from being applied directly. In this paper, we show that with proper designs, the evolution of neural networks during training can be accurately modeled by a series of kernel regression tasks. Furthermore, by applying an early-stopping rule for gradient descent and leveraging recent developments in neural tangent kernels, we establish the first generalization error (sample complexity) bounds for learning the score function with neural networks, despite the presence of noise in the observations. Our analysis is grounded in a novel parametric form of the neural network and an innovative connection between score matching and regression analysis, facilitating the application of advanced statistical and optimization techniques.  ( 3 min )
    Extracting Explanations, Justification, and Uncertainty from Black-Box Deep Neural Networks
    arXiv:2403.08652v1 Announce Type: cross Abstract: Deep Neural Networks (DNNs) do not inherently compute or exhibit empirically-justified task confidence. In mission critical applications, it is important to both understand associated DNN reasoning and its supporting evidence. In this paper, we propose a novel Bayesian approach to extract explanations, justifications, and uncertainty estimates from DNNs. Our approach is efficient both in terms of memory and computation, and can be applied to any black box DNN without any retraining, including applications to anomaly detection and out-of-distribution detection tasks. We validate our approach on the CIFAR-10 dataset, and show that it can significantly improve the interpretability and reliability of DNNs.  ( 2 min )
    Better Uncertainty Calibration via Proper Scores for Classification and Beyond
    arXiv:2203.07835v4 Announce Type: replace-cross Abstract: With model trustworthiness being crucial for sensitive real-world applications, practitioners are putting more and more focus on improving the uncertainty calibration of deep neural networks. Calibration errors are designed to quantify the reliability of probabilistic predictions but their estimators are usually biased and inconsistent. In this work, we introduce the framework of proper calibration errors, which relates every calibration error to a proper score and provides a respective upper bound with optimal estimation properties. This relationship can be used to reliably quantify the model calibration improvement. We theoretically and empirically demonstrate the shortcomings of commonly used estimators compared to our approach. Due to the wide applicability of proper scores, this gives a natural extension of recalibration beyond classification.  ( 3 min )
    Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion
    arXiv:2310.02279v2 Announce Type: replace-cross Abstract: Consistency Models (CM) (Song et al., 2023) accelerate score-based diffusion model sampling at the cost of sample quality but lack a natural way to trade-off quality for speed. To address this limitation, we propose Consistency Trajectory Model (CTM), a generalization encompassing CM and score-based models as special cases. CTM trains a single neural network that can -- in a single forward pass -- output scores (i.e., gradients of log-density) and enables unrestricted traversal between any initial and final time along the Probability Flow Ordinary Differential Equation (ODE) in a diffusion process. CTM enables the efficient combination of adversarial training and denoising score matching loss to enhance performance and achieves new state-of-the-art FIDs for single-step diffusion model sampling on CIFAR-10 (FID 1.73) and ImageNet at 64x64 resolution (FID 1.92). CTM also enables a new family of sampling schemes, both deterministic and stochastic, involving long jumps along the ODE solution trajectories. It consistently improves sample quality as computational budgets increase, avoiding the degradation seen in CM. Furthermore, unlike CM, CTM's access to the score function can streamline the adoption of established controllable/conditional generation methods from the diffusion community. This access also enables the computation of likelihood. The code is available at https://github.com/sony/ctm.  ( 3 min )
    Implicit Regularization of Gradient Flow on One-Layer Softmax Attention
    arXiv:2403.08699v1 Announce Type: cross Abstract: We study gradient flow on the exponential loss for a classification problem with a one-layer softmax attention model, where the key and query weight matrices are trained separately. Under a separability assumption on the data, we show that when gradient flow achieves the minimal loss value, it further implicitly minimizes the nuclear norm of the product of the key and query weight matrices. Such implicit regularization can be described by a Support Vector Machine (SVM) problem with respect to the attention weights. This finding contrasts with prior results showing that the gradient descent induces an implicit regularization on the Frobenius norm on the product weight matrix when the key and query matrices are combined into a single weight matrix for training. For diagonal key and query matrices, our analysis builds upon the reparameterization technique and exploits approximate KKT conditions of the SVM associated with the classification data. Moreover, the results are extended to general weights configurations given proper alignment of the weight matrices' singular spaces with the data features at initialization.  ( 3 min )
    Human Alignment of Large Language Models through Online Preference Optimisation
    arXiv:2403.08635v1 Announce Type: cross Abstract: Ensuring alignment of language models' outputs with human preferences is critical to guarantee a useful, safe, and pleasant user experience. Thus, human alignment has been extensively studied recently and several methods such as Reinforcement Learning from Human Feedback (RLHF), Direct Policy Optimisation (DPO) and Sequence Likelihood Calibration (SLiC) have emerged. In this paper, our contribution is two-fold. First, we show the equivalence between two recent alignment methods, namely Identity Policy Optimisation (IPO) and Nash Mirror Descent (Nash-MD). Second, we introduce a generalisation of IPO, named IPO-MD, that leverages the regularised sampling approach proposed by Nash-MD. This equivalence may seem surprising at first sight, since IPO is an offline method whereas Nash-MD is an online method using a preference model. However, this equivalence can be proven when we consider the online version of IPO, that is when both generations are sampled by the online policy and annotated by a trained preference model. Optimising the IPO loss with such a stream of data becomes then equivalent to finding the Nash equilibrium of the preference model through self-play. Building on this equivalence, we introduce the IPO-MD algorithm that generates data with a mixture policy (between the online and reference policy) similarly as the general Nash-MD algorithm. We compare online-IPO and IPO-MD to different online versions of existing losses on preference data such as DPO and SLiC on a summarisation task.  ( 3 min )
    A Hitchhiker's Guide to Geometric GNNs for 3D Atomic Systems
    arXiv:2312.07511v2 Announce Type: replace-cross Abstract: Recent advances in computational modelling of atomic systems, spanning molecules, proteins, and materials, represent them as geometric graphs with atoms embedded as nodes in 3D Euclidean space. In these graphs, the geometric attributes transform according to the inherent physical symmetries of 3D atomic systems, including rotations and translations in Euclidean space, as well as node permutations. In recent years, Geometric Graph Neural Networks have emerged as the preferred machine learning architecture powering applications ranging from protein structure prediction to molecular simulations and material generation. Their specificity lies in the inductive biases they leverage - such as physical symmetries and chemical properties - to learn informative representations of these geometric graphs. In this opinionated paper, we provide a comprehensive and self-contained overview of the field of Geometric GNNs for 3D atomic systems. We cover fundamental background material and introduce a pedagogical taxonomy of Geometric GNN architectures: (1) invariant networks, (2) equivariant networks in Cartesian basis, (3) equivariant networks in spherical basis, and (4) unconstrained networks. Additionally, we outline key datasets and application areas and suggest future research directions. The objective of this work is to present a structured perspective on the field, making it accessible to newcomers and aiding practitioners in gaining an intuition for its mathematical abstractions.  ( 3 min )
    A Sparsity Principle for Partially Observable Causal Representation Learning
    arXiv:2403.08335v1 Announce Type: cross Abstract: Causal representation learning aims at identifying high-level causal variables from perceptual data. Most methods assume that all latent causal variables are captured in the high-dimensional observations. We instead consider a partially observed setting, in which each measurement only provides information about a subset of the underlying causal state. Prior work has studied this setting with multiple domains or views, each depending on a fixed subset of latents. Here, we focus on learning from unpaired observations from a dataset with an instance-dependent partial observability pattern. Our main contribution is to establish two identifiability results for this setting: one for linear mixing functions without parametric assumptions on the underlying causal model, and one for piecewise linear mixing functions with Gaussian latent causal variables. Based on these insights, we propose two methods for estimating the underlying causal variables by enforcing sparsity in the inferred representation. Experiments on different simulated datasets and established benchmarks highlight the effectiveness of our approach in recovering the ground-truth latents.  ( 3 min )
    Kernel-Based Testing for Single-Cell Differential Analysis
    arXiv:2307.08509v2 Announce Type: replace Abstract: Single-cell technologies offer insights into molecular feature distributions, but comparing them poses challenges. We propose a kernel-testing framework for non-linear cell-wise distribution comparison, analyzing gene expression and epigenomic modifications. Our method allows feature-wise and global transcriptome/epigenome comparisons, revealing cell population heterogeneities. Using a classifier based on embedding variability, we identify transitions in cell states, overcoming limitations of traditional single-cell analysis. Applied to single-cell ChIP-Seq data, our approach identifies untreated breast cancer cells with an epigenomic profile resembling persister cells. This demonstrates the effectiveness of kernel testing in uncovering subtle population variations that might be missed by other methods.  ( 2 min )
    Effective Structural Encodings via Local Curvature Profiles
    arXiv:2311.14864v2 Announce Type: replace-cross Abstract: Structural and Positional Encodings can significantly improve the performance of Graph Neural Networks in downstream tasks. Recent literature has begun to systematically investigate differences in the structural properties that these approaches encode, as well as performance trade-offs between them. However, the question of which structural properties yield the most effective encoding remains open. In this paper, we investigate this question from a geometric perspective. We propose a novel structural encoding based on discrete Ricci curvature (Local Curvature Profiles, short LCP) and show that it significantly outperforms existing encoding approaches. We further show that combining local structural encodings, such as LCP, with global positional encodings improves downstream performance, suggesting that they capture complementary geometric information. Finally, we compare different encoding types with (curvature-based) rewiring techniques. Rewiring has recently received a surge of interest due to its ability to improve the performance of Graph Neural Networks by mitigating over-smoothing and over-squashing effects. Our results suggest that utilizing curvature information for structural encodings delivers significantly larger performance increases than rewiring.  ( 3 min )
    When can we Approximate Wide Contrastive Models with Neural Tangent Kernels and Principal Component Analysis?
    arXiv:2403.08673v1 Announce Type: cross Abstract: Contrastive learning is a paradigm for learning representations from unlabelled data that has been highly successful for image and text data. Several recent works have examined contrastive losses to claim that contrastive models effectively learn spectral embeddings, while few works show relations between (wide) contrastive models and kernel principal component analysis (PCA). However, it is not known if trained contrastive models indeed correspond to kernel methods or PCA. In this work, we analyze the training dynamics of two-layer contrastive models, with non-linear activation, and answer when these models are close to PCA or kernel methods. It is well known in the supervised setting that neural networks are equivalent to neural tangent kernel (NTK) machines, and that the NTK of infinitely wide networks remains constant during training. We provide the first convergence results of NTK for contrastive losses, and present a nuanced picture: NTK of wide networks remains almost constant for cosine similarity based contrastive losses, but not for losses based on dot product similarity. We further study the training dynamics of contrastive models with orthogonality constraints on output layer, which is implicitly assumed in works relating contrastive learning to spectral embedding. Our deviation bounds suggest that representations learned by contrastive models are close to the principal components of a certain matrix computed from random features. We empirically show that our theoretical results possibly hold beyond two-layer networks.  ( 3 min )
    A path-norm toolkit for modern networks: consequences, promises and challenges
    arXiv:2310.01225v4 Announce Type: replace Abstract: This work introduces the first toolkit around path-norms that fully encompasses general DAG ReLU networks with biases, skip connections and any operation based on the extraction of order statistics: max pooling, GroupSort etc. This toolkit notably allows us to establish generalization bounds for modern neural networks that are not only the most widely applicable path-norm based ones, but also recover or beat the sharpest known bounds of this type. These extended path-norms further enjoy the usual benefits of path-norms: ease of computation, invariance under the symmetries of the network, and improved sharpness on layered fully-connected networks compared to the product of operator norms, another complexity measure most commonly used. The versatility of the toolkit and its ease of implementation allow us to challenge the concrete promises of path-norm-based generalization bounds, by numerically evaluating the sharpest known bounds for ResNets on ImageNet.  ( 3 min )
    Stein Boltzmann Sampling: A Variational Approach for Global Optimization
    arXiv:2402.04689v4 Announce Type: replace-cross Abstract: In this paper, we introduce a new flow-based method for global optimization of Lipschitz functions, called Stein Boltzmann Sampling (SBS). Our method samples from the Boltzmann distribution that becomes asymptotically uniform over the set of the minimizers of the function to be optimized. Candidate solutions are sampled via the \emph{Stein Variational Gradient Descent} algorithm. We prove the asymptotic convergence of our method, introduce two SBS variants, and provide a detailed comparison with several state-of-the-art global optimization algorithms on various benchmark functions. The design of our method, the theoretical results, and our experiments, suggest that SBS is particularly well-suited to be used as a continuation of efficient global optimization methods as it can produce better solutions while making a good use of the budget.  ( 2 min )
    Bayesian Optimization that Limits Search Region to Lower Dimensions Utilizing Local GPR
    arXiv:2403.08331v1 Announce Type: cross Abstract: Optimization of product and system characteristics is required in many fields, including design and control. Bayesian optimization (BO) is often used when there are high observing costs, because BO theoretically guarantees an upper bound on regret. However, computational costs increase exponentially with the number of parameters to be optimized, decreasing search efficiency. We propose a BO that limits the search region to lower dimensions and utilizes local Gaussian process regression (LGPR) to scale the BO to higher dimensions. LGPR treats the low-dimensional search region as "local," improving prediction accuracies there. The LGPR model is trained on a local subset of data specific to that region. This improves prediction accuracy and search efficiency and reduces the time complexity of matrix inversion in the Gaussian process regression. In evaluations with 20D Ackley and Rosenbrock functions, search efficiencies are equal to or higher than those of the compared methods, improved by about 69% and 40% from the case without LGPR. We apply our method to an automatic design task for a power semiconductor device. We successfully reduce the specific on-resistance to 25% better than a conventional method and 3.4% better than without LGPR.  ( 3 min )
    Reverse Diffusion Monte Carlo
    arXiv:2307.02037v3 Announce Type: replace Abstract: We propose a Monte Carlo sampler from the reverse diffusion process. Unlike the practice of diffusion models, where the intermediary updates -- the score functions -- are learned with a neural network, we transform the score matching problem into a mean estimation one. By estimating the means of the regularized posterior distributions, we derive a novel Monte Carlo sampling algorithm called reverse diffusion Monte Carlo (rdMC), which is distinct from the Markov chain Monte Carlo (MCMC) methods. We determine the sample size from the error tolerance and the properties of the posterior distribution to yield an algorithm that can approximately sample the target distribution with any desired accuracy. Additionally, we demonstrate and prove under suitable conditions that sampling with rdMC can be significantly faster than that with MCMC. For multi-modal target distributions such as those in Gaussian mixture models, rdMC greatly improves over the Langevin-style MCMC sampling methods both theoretically and in practice. The proposed rdMC method offers a new perspective and solution beyond classical MCMC algorithms for the challenging complex distributions.  ( 3 min )
    Efficient geometric Markov chain Monte Carlo for nonlinear Bayesian inversion enabled by derivative-informed neural operators
    arXiv:2403.08220v1 Announce Type: cross Abstract: We propose an operator learning approach to accelerate geometric Markov chain Monte Carlo (MCMC) for solving infinite-dimensional nonlinear Bayesian inverse problems. While geometric MCMC employs high-quality proposals that adapt to posterior local geometry, it requires computing local gradient and Hessian information of the log-likelihood, incurring a high cost when the parameter-to-observable (PtO) map is defined through expensive model simulations. We consider a delayed-acceptance geometric MCMC method driven by a neural operator surrogate of the PtO map, where the proposal is designed to exploit fast surrogate approximations of the log-likelihood and, simultaneously, its gradient and Hessian. To achieve a substantial speedup, the surrogate needs to be accurate in predicting both the observable and its parametric derivative (the derivative of the observable with respect to the parameter). Training such a surrogate via conventional operator learning using input--output samples often demands a prohibitively large number of model simulations. In this work, we present an extension of derivative-informed operator learning [O'Leary-Roseberry et al., J. Comput. Phys., 496 (2024)] using input--output--derivative training samples. Such a learning method leads to derivative-informed neural operator (DINO) surrogates that accurately predict the observable and its parametric derivative at a significantly lower training cost than the conventional method. Cost and error analysis for reduced basis DINO surrogates are provided. Numerical studies on PDE-constrained Bayesian inversion demonstrate that DINO-driven MCMC generates effective posterior samples 3--9 times faster than geometric MCMC and 60--97 times faster than prior geometry-based MCMC. Furthermore, the training cost of DINO surrogates breaks even after collecting merely 10--25 effective posterior samples compared to geometric MCMC.  ( 3 min )
    Early Directional Convergence in Deep Homogeneous Neural Networks for Small Initializations
    arXiv:2403.08121v1 Announce Type: cross Abstract: This paper studies the gradient flow dynamics that arise when training deep homogeneous neural networks, starting with small initializations. The present work considers neural networks that are assumed to have locally Lipschitz gradients and an order of homogeneity strictly greater than two. This paper demonstrates that for sufficiently small initializations, during the early stages of training, the weights of the neural network remain small in norm and approximately converge in direction along the Karush-Kuhn-Tucker (KKT) points of the neural correlation function introduced in [1]. Additionally, for square loss and under a separability assumption on the weights of neural networks, a similar directional convergence of gradient flow dynamics is shown near certain saddle points of the loss function.  ( 2 min )
    Characterising harmful data sources when constructing multi-fidelity surrogate models
    arXiv:2403.08118v1 Announce Type: cross Abstract: Surrogate modelling techniques have seen growing attention in recent years when applied to both modelling and optimisation of industrial design problems. These techniques are highly relevant when assessing the performance of a particular design carries a high cost, as the overall cost can be mitigated via the construction of a model to be queried in lieu of the available high-cost source. The construction of these models can sometimes employ other sources of information which are both cheaper and less accurate. The existence of these sources however poses the question of which sources should be used when constructing a model. Recent studies have attempted to characterise harmful data sources to guide practitioners in choosing when to ignore a certain source. These studies have done so in a synthetic setting, characterising sources using a large amount of data that is not available in practice. Some of these studies have also been shown to potentially suffer from bias in the benchmarks used in the analysis. In this study, we present a characterisation of harmful low-fidelity sources using only the limited data available to train a surrogate model. We employ recently developed benchmark filtering techniques to conduct a bias-free assessment, providing objectively varied benchmark suites of different sizes for future research. Analysing one of these benchmark suites with the technique known as Instance Space Analysis, we provide an intuitive visualisation of when a low-fidelity source should be used and use this analysis to provide guidelines that can be used in an applied industrial setting.  ( 3 min )
    Sketching the Heat Kernel: Using Gaussian Processes to Embed Data
    arXiv:2403.07929v1 Announce Type: cross Abstract: This paper introduces a novel, non-deterministic method for embedding data in low-dimensional Euclidean space based on computing realizations of a Gaussian process depending on the geometry of the data. This type of embedding first appeared in (Adler et al, 2018) as a theoretical model for a generic manifold in high dimensions. In particular, we take the covariance function of the Gaussian process to be the heat kernel, and computing the embedding amounts to sketching a matrix representing the heat kernel. The Karhunen-Lo\`eve expansion reveals that the straight-line distances in the embedding approximate the diffusion distance in a probabilistic sense, avoiding the need for sharp cutoffs and maintaining some of the smaller-scale structure. Our method demonstrates further advantage in its robustness to outliers. We justify the approach with both theory and experiments.  ( 2 min )
    Unsupervised Learning of Hybrid Latent Dynamics: A Learn-to-Identify Framework
    arXiv:2403.08194v1 Announce Type: cross Abstract: Modern applications increasingly require unsupervised learning of latent dynamics from high-dimensional time-series. This presents a significant challenge of identifiability: many abstract latent representations may reconstruct observations, yet do they guarantee an adequate identification of the governing dynamics? This paper investigates this challenge from two angles: the use of physics inductive bias specific to the data being modeled, and a learn-to-identify strategy that separates forecasting objectives from the data used for the identification. We combine these two strategies in a novel framework for unsupervised meta-learning of hybrid latent dynamics (Meta-HyLaD) with: 1) a latent dynamic function that hybridize known mathematical expressions of prior physics with neural functions describing its unknown errors, and 2) a meta-learning formulation to learn to separately identify both components of the hybrid dynamics. Through extensive experiments on five physics and one biomedical systems, we provide strong evidence for the benefits of Meta-HyLaD to integrate rich prior knowledge while identifying their gap to observed data.  ( 2 min )
    Efficient Combinatorial Optimization via Heat Diffusion
    arXiv:2403.08757v1 Announce Type: new Abstract: Combinatorial optimization problems are widespread but inherently challenging due to their discrete nature.The primary limitation of existing methods is that they can only access a small fraction of the solution space at each iteration, resulting in limited efficiency for searching the global optimal. To overcome this challenge, diverging from conventional efforts of expanding the solver's search scope, we focus on enabling information to actively propagate to the solver through heat diffusion. By transforming the target function while preserving its optima, heat diffusion facilitates information flow from distant regions to the solver, providing more efficient navigation. Utilizing heat diffusion, we propose a framework for solving general combinatorial optimization problems. The proposed methodology demonstrates superior performance across a range of the most challenging and widely encountered combinatorial optimizations. Echoing recent advancements in harnessing thermodynamics for generative artificial intelligence, our study further reveals its significant potential in advancing combinatorial optimization.  ( 2 min )
    Multifidelity linear regression for scientific machine learning from scarce data
    arXiv:2403.08627v1 Announce Type: new Abstract: Machine learning (ML) methods, which fit to data the parameters of a given parameterized model class, have garnered significant interest as potential methods for learning surrogate models for complex engineering systems for which traditional simulation is expensive. However, in many scientific and engineering settings, generating high-fidelity data on which to train ML models is expensive, and the available budget for generating training data is limited. ML models trained on the resulting scarce high-fidelity data have high variance and are sensitive to vagaries of the training data set. We propose a new multifidelity training approach for scientific machine learning that exploits the scientific context where data of varying fidelities and costs are available; for example high-fidelity data may be generated by an expensive fully resolved physics simulation whereas lower-fidelity data may arise from a cheaper model based on simplifying assumptions. We use the multifidelity data to define new multifidelity Monte Carlo estimators for the unknown parameters of linear regression models, and provide theoretical analyses that guarantee the approach's accuracy and improved robustness to small training budgets. Numerical results verify the theoretical analysis and demonstrate that multifidelity learned models trained on scarce high-fidelity data and additional low-fidelity data achieve order-of-magnitude lower model variance than standard models trained on only high-fidelity data of comparable cost. This illustrates that in the scarce data regime, our multifidelity training strategy yields models with lower expected error than standard training approaches.  ( 3 min )
    Mean-Field Microcanonical Gradient Descent
    arXiv:2403.08362v1 Announce Type: new Abstract: Microcanonical gradient descent is a sampling procedure for energy-based models allowing for efficient sampling of distributions in high dimension. It works by transporting samples from a high-entropy distribution, such as Gaussian white noise, to a low-energy region using gradient descent. We put this model in the framework of normalizing flows, showing how it can often overfit by losing an unnecessary amount of entropy in the descent. As a remedy, we propose a mean-field microcanonical gradient descent that samples several weakly coupled data points simultaneously, allowing for better control of the entropy loss while paying little in terms of likelihood fit. We study these models in the context of financial time series, illustrating the improvements on both synthetic and real data.  ( 2 min )
    Neural reproducing kernel Banach spaces and representer theorems for deep networks
    arXiv:2403.08750v1 Announce Type: new Abstract: Studying the function spaces defined by neural networks helps to understand the corresponding learning models and their inductive bias. While in some limits neural networks correspond to function spaces that are reproducing kernel Hilbert spaces, these regimes do not capture the properties of the networks used in practice. In contrast, in this paper we show that deep neural networks define suitable reproducing kernel Banach spaces. These spaces are equipped with norms that enforce a form of sparsity, enabling them to adapt to potential latent structures within the input data and their representations. In particular, leveraging the theory of reproducing kernel Banach spaces, combined with variational results, we derive representer theorems that justify the finite architectures commonly employed in applications. Our study extends analogous results for shallow networks and can be seen as a step towards considering more practically plausible neural architectures.  ( 2 min )
    Asymptotics of Random Feature Regression Beyond the Linear Scaling Regime
    arXiv:2403.08160v1 Announce Type: new Abstract: Recent advances in machine learning have been achieved by using overparametrized models trained until near interpolation of the training data. It was shown, e.g., through the double descent phenomenon, that the number of parameters is a poor proxy for the model complexity and generalization capabilities. This leaves open the question of understanding the impact of parametrization on the performance of these models. How does model complexity and generalization depend on the number of parameters $p$? How should we choose $p$ relative to the sample size $n$ to achieve optimal test error? In this paper, we investigate the example of random feature ridge regression (RFRR). This model can be seen either as a finite-rank approximation to kernel ridge regression (KRR), or as a simplified model for neural networks trained in the so-called lazy regime. We consider covariates uniformly distributed on the $d$-dimensional sphere and compute sharp asymptotics for the RFRR test error in the high-dimensional polynomial scaling, where $p,n,d \to \infty$ while $p/ d^{\kappa_1}$ and $n / d^{\kappa_2}$ stay constant, for all $\kappa_1 , \kappa_2 \in \mathbb{R}_{>0}$. These asymptotics precisely characterize the impact of the number of random features and regularization parameter on the test performance. In particular, RFRR exhibits an intuitive trade-off between approximation and generalization power. For $n = o(p)$, the sample size $n$ is the bottleneck and RFRR achieves the same performance as KRR (which is equivalent to taking $p = \infty$). On the other hand, if $p = o(n)$, the number of random features $p$ is the limiting factor and RFRR test error matches the approximation error of the random feature model class (akin to taking $n = \infty$). Finally, a double descent appears at $n= p$, a phenomenon that was previously only characterized in the linear scaling $\kappa_1 = \kappa_2 = 1$.  ( 3 min )

  • Open

    [D] GPT-4 fails in algorithmic puzzles
    Abstract: We introduce the novel task of multimodal puzzle solving, framed within the context of visual question-answering. We present a new dataset, AlgoPuzzleVQA designed to challenge and evaluate the capabilities of multimodal language models in solving algorithmic puzzles that necessitate both visual understanding, language understanding, and complex algorithmic reasoning. We create the puzzles to encompass a diverse array of mathematical and algorithmic topics such as boolean logic, combinatorics, graph theory, optimization, search, etc., aiming to evaluate the gap between visual data interpretation and algorithmic problem-solving skills. The dataset is generated automatically from code authored by humans. All our puzzles have exact solutions that can be found from the algorithm without tedious human calculations. It ensures that our dataset can be scaled up arbitrarily in terms of reasoning complexity and dataset size. Our investigation reveals that large language models (LLMs) such as GPT4V and Gemini exhibit limited performance in puzzle-solving tasks. We find that their performance is near random in a multi-choice question-answering setup for a significant number of puzzles. The findings emphasize the challenges of integrating visual, language, and algorithmic knowledge for solving complex reasoning problems. https://github.com/declare-lab/LLM-PuzzleTest submitted by /u/sgpfc [link] [comments]
    [D] what are the best substitute of medium and tds ?
    Hello , I'm a big fan and practitioner of machine learning and data science, I learned a lot from towards data science but lately felt like the quality decreased and can't find good sources to have more technical knowledge or end to end advanced projects in deep or machine learning , any suggestions ? thank you submitted by /u/WASSIDI [link] [comments]
    [D] __getitem__(index1,index2)
    I want to load two training examples simultaneously, is there a way to allow getitem to recieve two indices as input ? submitted by /u/blooming17 [link] [comments]
    Pytorch Lightning ThroughputMonitor [P]
    Greeting all, I have this PytorchLightning code, I want to add ThroughputMonitor but I couldn't find how. Can you advise how to add it? def main(): model = CIFAR10Model() logger = TensorBoardLogger("logs/tb_logs", name="resnet18_ciphar10_lightning") profiler = PyTorchProfiler(dirpath="logs/profiler_logs", filename="perf-logs") trainer = Trainer( max_epochs=args.epochs, accelerator="cpu", devices=16, strategy="ddp", logger=logger, enable_progress_bar=True, profiler=profiler, num_nodes=3, log_every_n_steps=1, callbacks=[DeviceStatsMonitor(), EarlyStopping(...)] ) ​ submitted by /u/POC-545 [link] [comments]
    [D] Help to find recent ML article/post - Space of useful language between to LLMs dialoguing each other
    The other day I was on Twitter/X and came across this chart representing the space of useful generating compared to all possible language when two LLMs converse with each other. I believe it might have been shared by Yann Lecun. I need some help finding this post, if you find it either DM or reply with the link. Thanks Below is what I remember the chart looking like. https://preview.redd.it/lk1laqx1xcoc1.png?width=1161&format=png&auto=webp&s=b8b01fcc22a81f4414b9bad98b9d78473de5eb7e submitted by /u/rovrav [link] [comments]
    [D] Memory inside sequence models
    I’m trying to gain a deeper understanding of the concept of memory in RNNs (and it’s variants) and in Transformers. Aside the architectural differences between the plain RNN, GRU and LSTM, memory is basically the input sequence being processed through some mathematical function and served in a sequential manner as input to the next time step (along input Xt) sort of as a prior representation of the data. From this technical perspective memory seems constrained to the length of the input sequence and the difficulty of the training process, in terms of both computation and vanishing and exploding gradients. Useful improvements have been made with the gating mechanism of LSTM and GRU to mitigate the aforementioned problems. I started reading and working on understanding the Transformer a…
    [P] Training Models on Encrypted Data
    Hello, We recently released a way to train machine learning models on encrypted data. This is all available with data science friendly APIs in concrete-ml.You can read the full blog post at https://www.zama.ai/post/training-predictive-models-on-encrypted-data-fully-homomorphic-encryption. For the implementation details: we extract the onnx graph of a PyTorch training session (for now a logistic regression) and convert it into a numpy function. This then gets turned into a FHE circuit with the help of concrete. This circuit has then the capability to be trained on encrypted data! If you want to try it out you can go to the example notebook we did. Would love to hear all your feedbacks and answer any question you may have! submitted by /u/strojax [link] [comments]
    [R] FID of conditionally generated images
    Hey all! I have a doubt regarding the calculation of FID of samples generated conditionally. Most of the research papers just present a single FID score even on class conditional generation, while they should be given as per class FIDs. Am I wrong here? Kindly let me know. Thanks submitted by /u/Independent-King-320 [link] [comments]
    [P] Training a model with a CSV-file
    I have a website where users can post comments, and comments can be flagged. In my backend, I fetch each comment to moderate, giving it a score from 0 to 10 (0 being the most ok and 10 being the least ok, based on bad words, sexual content, racism, ...). Scores > 7 are automatically deleted by the system. I want to automate this by training a model to do this for me automatically. Since a few weeks, I'm collecting all the comments I moderate in an CSV-file, looking like this: # moderation.csv score,text 0,this is an ok comment 10,this is a bad comment ... The only thing the model should be capable of is giving a score to a comment, based on what it learned from previous comments. I really have no idea where to start. I have knowledge of Node.js (JavaScript), but not python (but with the help of ChatGPT it is probably ok), so preferably Node.js. I know I need a lot of data (I now have only about 4000 rows in my CSV-file, so it's probably not enough to train a model properly), but I want to get started with it and learn something new. Can someone help me in the right direction? submitted by /u/leurs247 [link] [comments]
    [D] EU Act just passed.Opinion ?
    The EU Act just passed and will be in force latter this year what is your opinion on this matter? https://www.msn.com/en-us/money/other/europes-world-first-ai-rules-are-set-for-final-approval-heres-what-happens-next/ar-BB1jNGef submitted by /u/pythonprogrammer64 [link] [comments]
    [D] LSTM with synthetic data
    I have a simple Istm network for some sensor data processing, which does not perform well in training (cant reach more than 60% accuracy). To understand Istm's better, i threw away my sensor data and i am currently training the model with synthetic generated data (as in the following picture). basically i am generating superpositions of sinuses, with parameters that are chosen randomly. And as target i am using the integral of these inputs. The NN should basically learn how to integrate. I have tried many layer combinations (also cnn+lstm) but it did not have a tremendous effect. The model currently used is simply a Istm layer with dropout (64) + a dense layer. The input of one data sequence is (80, 1), also the output is (80, 1). It should basically act as a adaptive filter in the end- but it cannot even learn how to integrate (Acc<40%) Tried various loss functions, currently it is MAE. Also I am generating 10k of these data sequences. Does anyone have a hint on how to improve this? submitted by /u/gmgm0101 [link] [comments]
    [N] Ooops... OpenAI CTO Mira Murati on which data was used to train Sora
    Is it only me or there is a massive lawsuit coming? https://twitter.com/tsarnick/status/1768021821595726254 submitted by /u/pg860 [link] [comments]
    Machine Learning in Materials Science [R]
    Hi guys, I'm currently familiarising myself with the topic of machine learning in Materials Science. I would like to detect and analyse geometric errors in images of material tests and write a scientific paper about it. Now I realise that the literature research is difficult. This is probably because I haven't yet found the right search engines and the corresponding bubble (if there is a bubble). I'm hoping to find this bubble here on Reddit. I would be grateful for any hints and tips. Does anyone have any experience in this area or know of any helpful literature? Many thanks in advance! submitted by /u/hungry_cowboy [link] [comments]
    [D] Supervised Contrastive Learning Data Loading
    Hello, I want to use contrastive learning for sequential data, I have a labeled dataset and wanted to ask if you have any ideas about how to set a custom dataloader that ensures balanced loading of positive-positive negative'negative and positive-negative examples submitted by /u/blooming17 [link] [comments]
    TryOnDiffusion: A Tale of Two UNets - Unofficial PyTorch Implementation [R] [P]
    Hello, I recently released an implementation of Google's TryOnDiffusion paper. I had limited resources to train it but I think I experimented with it enough to verify it is mostly correct (Experiment setup is detailed in the README) The code is MIT license, so completely open-source. Link - https://github.com/fashn-AI/tryondiffusion I hope it can help someone here. All the best, submitted by /u/JYP_Scouter [link] [comments]
    [D][R] Is there a general mathematical relationship between denoising autoencoder and a low-pass frequency filter?
    Correct me if I am wrong, my understanding is that when a denoising autoencoder is trained to remove noise from input data, it, in a sense, acts as a filter that removes high-frequency noise from the input data, similar to a low-pass filter. By reconstructing clean data from noisy input data, the denoising autoencoder effectively smooths the signal and reduces noise, similar to the behavior of a low-pass filter. My question is, can I express the trained neural network parameters as some form of a frequency filter in the Fourier space? In addition, I am also interested to learn, how much low frequency noise (introduce low frequency artifacts) it may add in the data after running it through the trained autoencoder? submitted by /u/bahauddin_onar [link] [comments]
    [R] Chronos: Learning the Language of Time Series
    Paper: https://arxiv.org/abs/2403.07815 Code: https://github.com/amazon-science/chronos-forecasting Model weights: https://huggingface.co/collections/amazon/chronos-models-65f1791d630a8d57cb718444 Abstract: We introduce Chronos, a simple yet effective framework for pretrained probabilistic time series models. Chronos tokenizes time series values using scaling and quantization into a fixed vocabulary and trains existing transformer-based language model architectures on these tokenized time series via the cross-entropy loss. We pretrained Chronos models based on the T5 family (ranging from 20M to 710M parameters) on a large collection of publicly available datasets, complemented by a synthetic dataset that we generated via Gaussian processes to improve generalization. In a comprehensive benchmark consisting of 42 datasets, and comprising both classical local models and deep learning methods, we show that Chronos models: (a) significantly outperform other methods on datasets that were part of the training corpus; and (b) have comparable and occasionally superior zero-shot performance on new datasets, relative to methods that were trained specifically on them. Our results demonstrate that Chronos models can leverage time series data from diverse domains to improve zero-shot accuracy on unseen forecasting tasks, positioning pretrained models as a viable tool to greatly simplify forecasting pipelines. ​ https://preview.redd.it/7fxmgjnuw9oc1.png?width=1818&format=png&auto=webp&s=8d9e0b7b462597a3b7c176c133e7a2843debfb7c submitted by /u/shchur [link] [comments]
    [D] RAG hyperparameter SOTA
    I was searching for quite a while now already to better understand what are good hyperparameters for RAG (e.g. chunk size), but I cannot find any papers on google scholar. Do you have any papers in this direction available? submitted by /u/ComplexIt [link] [comments]
    Do LLM's get "lost in the middle" during summarization as well? [D]
    The paper "Lost in the Middle: How Language Models Use Long Contexts" basically talks about how LLM's struggle with the context "in the middle" when they are given a long context, and that is tested in the paper in the usecase of RAG. I was curious if LLM's would display the same characteristic in terms of a summarization task? Do we have any insights on that? Would it be fair to assume that LLM's would showcase the exact same characteristics in summarization as well, and they would include less information from the middle of a large chunk of data into the final summary? Curious to get your thoughts on this, Here's the original paper: https://arxiv.org/abs/2307.03172 submitted by /u/daxow [link] [comments]
    [D] PyTorch vs JAX for Robotics Hardware
    Does utilizing JAX instead of PyTorch offer any substantial benefits in robotics applications, such as performing inference tasks on a Jetson Orin for controlling a mobile robot equipped with camera and lidar sensors? submitted by /u/anointedninja [link] [comments]
  • Open

    Is representation learning worth it for smaller networks
    I read a lot of literature about representation learning as pre-training for the actual RL task. I am currently dealing with a sequential sensor data as input. So a lot of the data is redundant and noisy. The agent therefore needs to learn semantic features from the raw input timeseries first. Since the gradient signal from the reward in RL is so weak in comparison to unsupervised learning procedure I thought it could be worthwhile doing unsupervised pre-training for the feature encoder aka representation learning. Now almost all the literature is dealing with huge neural networks in comparison and huge datasets. I am dealing with about 200k-1M parameters and about 1M samples available for pre-training. My question would be: Is it even worthwhile dealing with pre-training when the ANN is relatively small? My RL training time is currently around 60h and I am hoping to cut that training time down significantly with pre-training. submitted by /u/flxh13 [link] [comments]
    Successful examples of transfer learning?
    Does anyone know if there have been any papers discussing/successfully applying some sort of transfer learning from one RL task to a related one? For example, if I had a minigrid world that an agent is trained in navigation for, and then move it to a similar minigrid but now with pushing blocks into target locations. Or has this kind of thing been mostly done through multi-task learning? submitted by /u/Impallion [link] [comments]
    "Why the Law of Effect will not Go Away", Dennett 1974 (the evolution of model-based RL)
    submitted by /u/gwern [link] [comments]
    Summer schools for 2024
    Hello, I am looking for some good RL summer schools in 2024, but I have found it a bit overwhelming with the different possibilities. Is there anybody here that has any experience/recommendations? submitted by /u/IAmNotMarcus [link] [comments]
    Any body wants to buy my coursea academic account.
    You can enroll any courses for free. I want to sell it at Rs9000 INR lifetime submitted by /u/Efficient_Ambition34 [link] [comments]
  • Open

    I made a plugin that adds an army of AI research agents to Google Sheets
    submitted by /u/TernaryJimbo [link] [comments]
    CBP wants to use AI to scan for fentanyl at the border
    submitted by /u/TechyGyan [link] [comments]
    When should we expect improvements to the home assistant (Google Home/Alexa) experience?
    I am not well versed in AI, but I've played around with Chat GPT and have a Google Home and I am blown away by how helpful Chat GPT is and how it's able to parse through random/poorly written questions and produce answers and be able to answer follow-up questions. Most smart home devices can't seem to do this. 9/10 times my Google Home answers questions with "I don't know" and just shows a Google search results page. I see Google rolling out Gemini and was curious if there's any news about when AI capabilities will be implemented into the smart home ecosystem to make them more useful products? submitted by /u/theripped [link] [comments]
    Crafting an AI-First Tech Stack for a New SaaS Venture: Your Picks?
    Hey r/artificial, You're a founder at SaaS startup with a keen eye on embedding AI across the organization to drive efficiency and innovation. Your challenge is to construct a tech stack that not only supports the usual business operations—project management, security, HR, payroll, CRM, marketing, analytics, cloud services, compliance, and collaboration tools—but also maximizes the use of AI in every facet possible. You're weighing the benefits of using major platforms like Microsoft and Google against incorporating specialized tools for specific needs. How would you leverage AI to its fullest potential within your tech stack? Which tools/vendors would be your go-to choices? Looking forward to hearing your AI-centric strategies and tool recommendations! submitted by /u/moosepiss [link] [comments]
    Elon Musk and Larry Ellison Partner on AI for Super-Smart Agriculture
    submitted by /u/vinaylovestotravel [link] [comments]
    Open AI's CTO doesn't know where they sourced data from?
    https://x.com/SmokeAwayyy/status/1768141571298632137?s=20 I saw this video now, everyone's saying she ofc knows, she's just hiding due to legal trouble they might get into. But interestingly, she could have said they sourced data from Shutterstock coz Open AI literally has a public partnership w them. What are y'alls view on this? (Also, apologies if it's already posted) submitted by /u/No-Lobster-8045 [link] [comments]
    Amazon now lets sellers create listings through a URL by using AI | TechCrunch
    submitted by /u/clonefitreal [link] [comments]
    One-Minute Daily AI News 3/13/2024
    Meet Devin, the world’s first AI software engineer that debugs, writes and deploys code.[1] Microsoft will make its generative artificial intelligence (GenAI)-powered security tool, Copilot for Security, generally available worldwide beginning April 1.[2] Amazon will let sellers paste a link so AI can make a product page.[3] AI image-generator Midjourney blocks images of Biden and Trump as election looms.[4] Anthropic just released a Claude 3 AI prompt library.[5] Sources: [1] https://www.cnbctv18.com/technology/meet-devin-the-world-first-ai-software-engineer-that-debugs-writes-and-deploys-code-19259531.htm [2] https://www.pymnts.com/artificial-intelligence-2/2024/microsoft-to-make-genai-powered-security-tool-generally-available/ [3] https://www.theverge.com/2024/3/13/24099685/amazon-generative-ai-dtc-link-sellers [4] https://apnews.com/article/midjourney-ai-imagegenerator-biden-trump-deepfakes-bc6c254ddb20e36c5e750b4570889ce1 [5] https://docs.anthropic.com/claude/prompt-library submitted by /u/Excellent-Target-847 [link] [comments]
  • Open

    Self-Organizing Map neighborhood implementation in PyTorch
    I am trying to implement a Self-Organizing Map where for a given input sample, the best matching unit/winning unit is chosen based on (say) L2-norm distance between the SOM and the input. The winning unit/BMU (som[x, y]) has the smallest L2 distance from the given input (z): # Input batch: batch-size = 512, input-dim = 84- z = torch.randn(512, 84) # SOM shape: (height, width, input-dim)- som = torch.randn(40, 40, 84) print(f"BMU row, col shapes; row = {row.shape} & col = {col.shape}") # BMU row, col shapes; row = torch.Size([512]) & col = torch.Size([512]) For clarity, for the first input sample in the batch "z[0]", the winning unit is "som[row[0], col[0]]"- z[0].shape, som[row[0], col[0]].shape # (torch.Size([84]), torch.Size([84])) torch.norm((z[0] - som[row[0], col[0]])) is the smallest L2 distance between z[0] and all other som units except row[0] and col[0]. # Define initial neighborhood radius and learning rate- neighb_rad = torch.tensor(2.0) lr = 0.5 # To update weights for the first input "z[0]" and its corresponding BMU "som[row[0], col[0]]"- for r in range(som.shape[0]): for c in range(som.shape[1]): neigh_dist = torch.exp(-torch.norm(input = (som[r, c] - som[row[0], col[0]])) / (2.0 * torch.pow(neighb_rad, 2))) som[r, c] = som[r, c] + (lr * neigh_dist * (z[0] - som[r, c])) How can I implement the code for: updating weights for all units around each BMU without the 2 for loops (and) do it for all of the inputs "z" (here, z has 512 samples) submitted by /u/grid_world [link] [comments]
    Black and white Matrix identification using CNN
    Hi everyone. I have 30 matrices 12x8 pixels, black and white (W = 0, B = 1), and for every matrix I know its output, which is unique (so 30 outputs). I'd like to create a sort of lookup table for these matrices. Sure, I could write a lot of if A[][] =.. but I thought it would be more elegant to train a simple CNN that, given the 1/0 of the matrices as input, returns the category they belong to. I'm new to this world, and I was wondering if it's possible to create a CNN given only this dataset composed of 30 strings of 1 and 0, or if you know any other idea that would be more efficient than writing hundreds of ifs to generate this lookup table. I've already watched StatQuest's video about CNN to have an idea of what they do and it looks like they would be perfect for my job. Any suggestion is appreciated, thanks! EDIT: I put an example to show how I'd like it to work: https://preview.redd.it/zsoceoz5taoc1.jpg?width=967&format=pjpg&auto=webp&s=6e479502954fc8d4541aef037ffe4cbf867fa374 submitted by /u/LightFounder [link] [comments]
    Do you want to know the time, it will take your Neural Network to get fully trained?
    View Poll submitted by /u/Red_Pudding_pie [link] [comments]
    The Era of 1-bit LLMs - Paper Explained
    submitted by /u/Personal-Trainer-541 [link] [comments]
  • Open

    The journey of PGA TOUR’s generative AI virtual assistant, from concept to development to prototype
    This is a guest post co-written with Scott Gutterman from the PGA TOUR. Generative artificial intelligence (generative AI) has enabled new possibilities for building intelligent systems. Recent improvements in Generative AI based large language models (LLMs) have enabled their use in a variety of applications surrounding information retrieval. Given the data sources, LLMs provided tools […]  ( 11 min )
    Enhance code review and approval efficiency with generative AI using Amazon Bedrock
    In the world of software development, code review and approval are important processes for ensuring the quality, security, and functionality of the software being developed. However, managers tasked with overseeing these critical processes often face numerous challenges, such as the following: Lack of technical expertise – Managers may not have an in-depth technical understanding of […]  ( 9 min )
    Best practices to build generative AI applications on AWS
    Generative AI applications driven by foundational models (FMs) are enabling organizations with significant business value in customer experience, productivity, process optimization, and innovations. However, adoption of these FMs involves addressing some key challenges, including quality output, data privacy, security, integration with organization data, cost, and skills to deliver. In this post, we explore different approaches […]  ( 19 min )
    Gemma is now available in Amazon SageMaker JumpStart
    Today, we’re excited to announce that the Gemma model is now available for customers using Amazon SageMaker JumpStart. Gemma is a family of language models based on Google’s Gemini models, trained on up to 6 trillion tokens of text. The Gemma family consists of two sizes: a 7 billion parameter model and a 2 billion parameter model. Now, […]  ( 20 min )
  • Open

    Cappy: Outperforming and boosting large multi-task language models with a small scorer
    Posted by Yun Zhu and Lijuan Liu, Software Engineers, Google Research Large language model (LLM) advancements have led to a new paradigm that unifies various natural language processing (NLP) tasks within an instruction-following framework. This paradigm is exemplified by recent multi-task LLMs, such as T0, FLAN, and OPT-IML. First, multi-task data is gathered with each task following a task-specific template, where each labeled example is converted into an instruction (e.g., "Put the concepts together to form a sentence: ski, mountain, skier”) paired with a corresponding response (e.g., "Skier skis down the mountain"). These instruction-response pairs are used to train the LLM, resulting in a conditional generation model that takes an instruction as input and generates a response. Mo…  ( 93 min )
  • Open

    When do you want to Receive our Articles?
    Over the past few months, we have published a lot of high-quality content on AI, LLMs, RAG, Knowledge Bases, and our experiments with…  ( 3 min )
  • Open

    Reach for the Stars: Eight Out-of-This-World Games Join the Cloud
    The stars align this GFN Thursday as more top titles from Ubisoft and Square Enix join the cloud. Star Wars Outlaws will be coming to the GeForce NOW library at launch later this year, while STAR OCEAN THE SECOND STORY R and PARANORMASIGHT: The Seven Mysteries of Honjo are part of eight new titles joining Read Article  ( 6 min )
    NVIDIA GTC 2024: A Glimpse Into the Future of AI With Jensen Huang
    NVIDIA’s founder and CEO will discuss the future of AI at one of the world’s premier technology conferences.  ( 5 min )
  • Open

    Accelerating Archimedes
    One way to approximate π is to find the areas of polygons inscribed inside a circle and polygons circumscribed outside the circle. The approximation improves as the number of sides in the polygons increases. This idea goes back at least as far as Archimedes (287–212 BC). Maybe you’ve tried this. It’s a lot of work. […] Accelerating Archimedes first appeared on John D. Cook.  ( 5 min )
  • Open

    Customizing Graph Neural Networks using Path Reweighting
    arXiv:2106.10866v3 Announce Type: replace Abstract: Graph Neural Networks (GNNs) have been extensively used for mining graph-structured data with impressive performance. However, because these traditional GNNs do not distinguish among various downstream tasks, embeddings embedded by them are not always effective. Intuitively, paths in a graph imply different semantics for different downstream tasks. Inspired by this, we design a novel GNN solution, namely Customized Graph Neural Network with Path Reweighting (CustomGNN for short). Specifically, the proposed CustomGNN can automatically learn the high-level semantics for specific downstream tasks to highlight semantically relevant paths as well to filter out task-irrelevant noises in a graph. Furthermore, we empirically analyze the semantics learned by CustomGNN and demonstrate its ability to avoid the three inherent problems in traditional GNNs, i.e., over-smoothing, poor robustness, and overfitting. In experiments with the node classification task, CustomGNN achieves state-of-the-art accuracies on three standard graph datasets and four large graph datasets. The source code of the proposed CustomGNN is available at \url{https://github.com/cjpcool/CustomGNN}.  ( 3 min )
    Quantum Support Vector Machine for Prostate Cancer Detection: A Performance Analysis
    arXiv:2403.07856v1 Announce Type: new Abstract: This study addresses the urgent need for improved prostate cancer detection methods by harnessing the power of advanced technological solutions. We introduce the application of Quantum Support Vector Machine (QSVM) to this critical healthcare challenge, showcasing an enhancement in diagnostic performance over the classical Support Vector Machine (SVM) approach. Our study not only outlines the remarkable improvements in diagnostic performance made by QSVM over the classic SVM technique, but it delves into the advancements brought about by the quantum feature map architecture, which has been carefully identified and evaluated, ensuring it aligns seamlessly with the unique characteristics of our prostate cancer dataset. This architecture succeded in creating a distinct feature space, enabling the detection of complex, non-linear patterns in the data. The findings reveal not only a comparable accuracy with classical SVM ($92\%$) but also a $7.14\%$ increase in sensitivity and a notably high F1-Score ($93.33\%$). This study's important combination of quantum computing in medical diagnostics marks a pivotal step forward in cancer detection, offering promising implications for the future of healthcare technology.  ( 2 min )
    Tractable Joint Prediction and Planning over Discrete Behavior Modes for Urban Driving
    arXiv:2403.07232v1 Announce Type: cross Abstract: Significant progress has been made in training multimodal trajectory forecasting models for autonomous driving. However, effectively integrating these models with downstream planners and model-based control approaches is still an open problem. Although these models have conventionally been evaluated for open-loop prediction, we show that they can be used to parameterize autoregressive closed-loop models without retraining. We consider recent trajectory prediction approaches which leverage learned anchor embeddings to predict multiple trajectories, finding that these anchor embeddings can parameterize discrete and distinct modes representing high-level driving behaviors. We propose to perform fully reactive closed-loop planning over these discrete latent modes, allowing us to tractably model the causal interactions between agents at each step. We validate our approach on a suite of more dynamic merging scenarios, finding that our approach avoids the $\textit{frozen robot problem}$ which is pervasive in conventional planners. Our approach also outperforms the previous state-of-the-art in CARLA on challenging dense traffic scenarios when evaluated at realistic speeds.  ( 2 min )
    Unveiling the Pitfalls of Knowledge Editing for Large Language Models
    arXiv:2310.02129v3 Announce Type: replace-cross Abstract: As the cost associated with fine-tuning Large Language Models (LLMs) continues to rise, recent research efforts have pivoted towards developing methodologies to edit implicit knowledge embedded within LLMs. Yet, there's still a dark cloud lingering overhead -- will knowledge editing trigger butterfly effect? since it is still unclear whether knowledge editing might introduce side effects that pose potential risks or not. This paper pioneers the investigation into the potential pitfalls associated with knowledge editing for LLMs. To achieve this, we introduce new benchmark datasets and propose innovative evaluation metrics. Our results underline two pivotal concerns: (1) Knowledge Conflict: Editing groups of facts that logically clash can magnify the inherent inconsistencies in LLMs-a facet neglected by previous methods. (2) Knowledge Distortion: Altering parameters with the aim of editing factual knowledge can irrevocably warp the innate knowledge structure of LLMs. Experimental results vividly demonstrate that knowledge editing might inadvertently cast a shadow of unintended consequences on LLMs, which warrant attention and efforts for future works. Code and data are available at https://github.com/zjunlp/PitfallsKnowledgeEditing.  ( 2 min )
    Multi-View Variational Autoencoder for Missing Value Imputation in Untargeted Metabolomics
    arXiv:2310.07990v2 Announce Type: replace-cross Abstract: Background: Missing data is a common challenge in mass spectrometry-based metabolomics, which can lead to biased and incomplete analyses. The integration of whole-genome sequencing (WGS) data with metabolomics data has emerged as a promising approach to enhance the accuracy of data imputation in metabolomics studies. Method: In this study, we propose a novel method that leverages the information from WGS data and reference metabolites to impute unknown metabolites. Our approach utilizes a multi-view variational autoencoder to jointly model the burden score, polygenetic risk score (PGS), and linkage disequilibrium (LD) pruned single nucleotide polymorphisms (SNPs) for feature extraction and missing metabolomics data imputation. By learning the latent representations of both omics data, our method can effectively impute missing metabolomics values based on genomic information. Results: We evaluate the performance of our method on empirical metabolomics datasets with missing values and demonstrate its superiority compared to conventional imputation techniques. Using 35 template metabolites derived burden scores, PGS and LD-pruned SNPs, the proposed methods achieved R^2-scores > 0.01 for 71.55% of metabolites. Conclusion: The integration of WGS data in metabolomics imputation not only improves data completeness but also enhances downstream analyses, paving the way for more comprehensive and accurate investigations of metabolic pathways and disease associations. Our findings offer valuable insights into the potential benefits of utilizing WGS data for metabolomics data imputation and underscore the importance of leveraging multi-modal data integration in precision medicine research.  ( 3 min )
    P-NOC: adversarial training of CAM generating networks for robust weakly supervised semantic segmentation priors
    arXiv:2305.12522v3 Announce Type: replace-cross Abstract: Weakly Supervised Semantic Segmentation (WSSS) techniques explore individual regularization strategies to refine Class Activation Maps (CAMs). In this work, we first analyze complementary WSSS techniques in the literature, their segmentation properties, and the conditions in which they are most effective. Based on these findings, we devise two new techniques: P-NOC and CCAM-H. In the first, we promote the conjoint training of two adversarial CAM generating networks: the generator, which progressively learns to erase regions containing class-specific features, and a discriminator, which is refined to gradually shift its attention to new class discriminant features. In the latter, we employ the high quality pseudo-segmentation priors produced by P-NOC to guide the learning to saliency information in a weakly supervised fashion. Finally, we employ both pseudo-segmentation priors and pseudo-saliency proposals in the random walk procedure, resulting in higher quality pseudo-semantic segmentation masks, and competitive results with the state of the art.  ( 2 min )
    RudolfV: A Foundation Model by Pathologists for Pathologists
    arXiv:2401.04079v3 Announce Type: replace-cross Abstract: Histopathology plays a central role in clinical medicine and biomedical research. While artificial intelligence shows promising results on many pathological tasks, generalization and dealing with rare diseases, where training data is scarce, remains a challenge. Distilling knowledge from unlabelled data into a foundation model before learning from, potentially limited, labelled data provides a viable path to address these challenges. In this work, we extend the state of the art of foundation models for digital pathology whole slide images by semi-automated data curation and incorporating pathologist domain knowledge. Specifically, we combine computational and pathologist domain knowledge (1) to curate a diverse dataset of 133k slides corresponding to 1.2 billion image patches covering data from different fixation, staining, and scanning protocols as well as data from different indications and labs across the EU and US, (2) for grouping semantically similar slides and tissue patches, and (3) to augment the input images during training. We evaluate the resulting model on a set of public and internal benchmarks and show that although our foundation model is trained with an order of magnitude less slides, it performs on par or better than competing models. We expect that scaling our approach to more data and larger models will further increase its performance and capacity to deal with increasingly complex real world tasks in diagnostics and biomedical research.  ( 3 min )
    Evaluating Emerging AI/ML Accelerators: IPU, RDU, and NVIDIA/AMD GPUs
    arXiv:2311.04417v2 Announce Type: replace-cross Abstract: The relentless advancement of artificial intelligence (AI) and machine learning (ML) applications necessitates the development of specialized hardware accelerators capable of handling the increasing complexity and computational demands. Traditional computing architectures, based on the von Neumann model, are being outstripped by the requirements of contemporary AI/ML algorithms, leading to a surge in the creation of accelerators like the Graphcore Intelligence Processing Unit (IPU), Sambanova Reconfigurable Dataflow Unit (RDU), and enhanced GPU platforms. These hardware accelerators are characterized by their innovative data-flow architectures and other design optimizations that promise to deliver superior performance and energy efficiency for AI/ML tasks. This research provides a preliminary evaluation and comparison of these commercial AI/ML accelerators, delving into their hardware and software design features to discern their strengths and unique capabilities. By conducting a series of benchmark evaluations on common DNN operators and other AI/ML workloads, we aim to illuminate the advantages of data-flow architectures over conventional processor designs and offer insights into the performance trade-offs of each platform. The findings from our study will serve as a valuable reference for the design and performance expectations of research prototypes, thereby facilitating the development of next-generation hardware accelerators tailored for the ever-evolving landscape of AI/ML applications. Through this analysis, we aspire to contribute to the broader understanding of current accelerator technologies and to provide guidance for future innovations in the field.  ( 3 min )
    Quantum Inception Score
    arXiv:2311.12163v2 Announce Type: replace-cross Abstract: Motivated by the great success of classical generative models in machine learning, enthusiastic exploration of their quantum version has recently started. To depart on this journey, it is important to develop a relevant metric to evaluate the quality of quantum generative models; in the classical case, one such example is the inception score. In this paper, we propose the quantum inception score, which relates the quality to the Holevo information of the quantum channel that classifies a given dataset. We prove that, under this proposed measure, the quantum generative models provide better quality than their classical counterparts because of the presence of quantum coherence, characterized by the resource theory of asymmetry, and entanglement. Furthermore, we harness the quantum fluctuation theorem to characterize the physical limitation of the quality of quantum generative models. Finally, we apply the quantum inception score to assess the quality of the one-dimensional spin chain model as a quantum generative model, with the quantum convolutional neural network as a quantum classifier, for the phase classification problem in the quantum many-body physics.  ( 2 min )
    Mental Workload Estimation with Electroencephalogram Signals by Combining Multi-Space Deep Models
    arXiv:2308.02409v2 Announce Type: replace-cross Abstract: The human brain remains continuously active, whether an individual is working or at rest. Mental activity is a daily process, and if the brain becomes excessively active, known as overload, it can adversely affect human health. Recently, advancements in early prediction of mental health conditions have emerged, aiming to prevent serious consequences and enhance the overall quality of life. Consequently, the estimation of mental status has garnered significant attention from diverse researchers due to its potential benefits. While various signals are employed to assess mental state, the electroencephalogram, containing extensive information about the brain, is widely utilized by researchers. In this paper, we categorize mental workload into three states (low, middle, and high) and estimate a continuum of mental workload levels. Our method leverages information from multiple spatial dimensions to achieve optimal results in mental estimation. For the time domain approach, we employ Temporal Convolutional Networks. In the frequency domain, we introduce a novel architecture based on combining residual blocks, termed the Multi-Dimensional Residual Block. The integration of these two domains yields significant results compared to individual estimates in each domain. Our approach achieved a 74.98% accuracy in the three-class classification, surpassing the provided data results at 69.00%. Specially, our method demonstrates efficacy in estimating continuous levels, evidenced by a corresponding Concordance Correlation Coefficient (CCC) result of 0.629. The combination of time and frequency domain analysis in our approach highlights the exciting potential to improve healthcare applications in the future.  ( 3 min )
    Learning and Decision-Making with Data: Optimal Formulations and Phase Transitions
    arXiv:2109.06911v3 Announce Type: replace-cross Abstract: We study the problem of designing optimal learning and decision-making formulations when only historical data is available. Prior work typically commits to a particular class of data-driven formulation and subsequently tries to establish out-of-sample performance guarantees. We take here the opposite approach. We define first a sensible yard stick with which to measure the quality of any data-driven formulation and subsequently seek to find an optimal such formulation. Informally, any data-driven formulation can be seen to balance a measure of proximity of the estimated cost to the actual cost while guaranteeing a level of out-of-sample performance. Given an acceptable level of out-of-sample performance, we construct explicitly a data-driven formulation that is uniformly closer to the true cost than any other formulation enjoying the same out-of-sample performance. We show the existence of three distinct out-of-sample performance regimes (a superexponential regime, an exponential regime and a subexponential regime) between which the nature of the optimal data-driven formulation experiences a phase transition. The optimal data-driven formulations can be interpreted as a classically robust formulation in the superexponential regime, an entropic distributionally robust formulation in the exponential regime and finally a variance penalized formulation in the subexponential regime. This final observation unveils a surprising connection between these three, at first glance seemingly unrelated, data-driven formulations which until now remained hidden.  ( 3 min )
    One-Dimensional Adapter to Rule Them All: Concepts, Diffusion Models and Erasing Applications
    arXiv:2312.16145v2 Announce Type: replace-cross Abstract: The prevalent use of commercial and open-source diffusion models (DMs) for text-to-image generation prompts risk mitigation to prevent undesired behaviors. Existing concept erasing methods in academia are all based on full parameter or specification-based fine-tuning, from which we observe the following issues: 1) Generation alternation towards erosion: Parameter drift during target elimination causes alternations and potential deformations across all generations, even eroding other concepts at varying degrees, which is more evident with multi-concept erased; 2) Transfer inability & deployment inefficiency: Previous model-specific erasure impedes the flexible combination of concepts and the training-free transfer towards other models, resulting in linear cost growth as the deployment scenarios increase. To achieve non-invasive, precise, customizable, and transferable elimination, we ground our erasing framework on one-dimensional adapters to erase multiple concepts from most DMs at once across versatile erasing applications. The concept-SemiPermeable structure is injected as a Membrane (SPM) into any DM to learn targeted erasing, and meantime the alteration and erosion phenomenon is effectively mitigated via a novel Latent Anchoring fine-tuning strategy. Once obtained, SPMs can be flexibly combined and plug-and-play for other DMs without specific re-tuning, enabling timely and efficient adaptation to diverse scenarios. During generation, our Facilitated Transport mechanism dynamically regulates the permeability of each SPM to respond to different input prompts, further minimizing the impact on other concepts. Quantitative and qualitative results across ~40 concepts, 7 DMs and 4 erasing applications have demonstrated the superior erasing of SPM. Our code and pre-tuned SPMs are available on the project page https://lyumengyao.github.io/projects/spm.  ( 3 min )
    StoRM: A Diffusion-based Stochastic Regeneration Model for Speech Enhancement and Dereverberation
    arXiv:2212.11851v2 Announce Type: replace-cross Abstract: Diffusion models have shown a great ability at bridging the performance gap between predictive and generative approaches for speech enhancement. We have shown that they may even outperform their predictive counterparts for non-additive corruption types or when they are evaluated on mismatched conditions. However, diffusion models suffer from a high computational burden, mainly as they require to run a neural network for each reverse diffusion step, whereas predictive approaches only require one pass. As diffusion models are generative approaches they may also produce vocalizing and breathing artifacts in adverse conditions. In comparison, in such difficult scenarios, predictive models typically do not produce such artifacts but tend to distort the target speech instead, thereby degrading the speech quality. In this work, we present a stochastic regeneration approach where an estimate given by a predictive model is provided as a guide for further diffusion. We show that the proposed approach uses the predictive model to remove the vocalizing and breathing artifacts while producing very high quality samples thanks to the diffusion model, even in adverse conditions. We further show that this approach enables to use lighter sampling schemes with fewer diffusion steps without sacrificing quality, thus lifting the computational burden by an order of magnitude. Source code and audio examples are available online (https://uhh.de/inf-sp-storm).  ( 3 min )
    ACPO: AI-Enabled Compiler-Driven Program Optimization
    arXiv:2312.09982v2 Announce Type: replace-cross Abstract: The key to performance optimization of a program is to decide correctly when a certain transformation should be applied by a compiler. This is an ideal opportunity to apply machine-learning models to speed up the tuning process; while this realization has been around since the late 90s, only recent advancements in ML enabled a practical application of ML to compilers as an end-to-end framework. This paper presents ACPO: \textbf{\underline{A}}I-Enabled \textbf{\underline{C}}ompiler-driven \textbf{\underline{P}}rogram \textbf{\underline{O}}ptimization; a novel framework to provide LLVM with simple and comprehensive tools to benefit from employing ML models for different optimization passes. We first showcase the high-level view, class hierarchy, and functionalities of ACPO and subsequently, demonstrate a couple of use cases of ACPO by ML-enabling the Loop Unroll and Function Inlining passes and describe how ACPO can be leveraged to optimize other passes. Experimental results reveal that ACPO model for Loop Unroll is able to gain on average 4\% compared to LLVM's O3 optimization when deployed on Polybench. Furthermore, by adding the Inliner model as well, ACPO is able to provide up to 4.5\% and 2.4\% on Polybench and Cbench compared with LLVM's O3 optimization, respectively.  ( 2 min )
    Scissorhands: Scrub Data Influence via Connection Sensitivity in Networks
    arXiv:2401.06187v2 Announce Type: replace Abstract: Machine unlearning has become a pivotal task to erase the influence of data from a trained model. It adheres to recent data regulation standards and enhances the privacy and security of machine learning applications. In this work, we present a new machine unlearning approach Scissorhands. Initially, Scissorhands identifies the most pertinent parameters in the given model relative to the forgetting data via connection sensitivity. By reinitializing the most influential top-k percent of these parameters, a trimmed model for erasing the influence of the forgetting data is obtained. Subsequently, Scissorhands fine-tunes the trimmed model with a gradient projection-based approach, seeking parameters that preserve information on the remaining data while discarding information related to the forgetting data. Our experimental results, conducted across image classification and image generation tasks, demonstrate that Scissorhands, showcases competitive performance when compared to existing methods.  ( 2 min )
    Evacuation Management Framework towards Smart City-wide Intelligent Emergency Interactive Response System
    arXiv:2403.07003v1 Announce Type: cross Abstract: A smart city solution toward future 6G network deployment allows small and medium sized enterprises (SMEs), industry, and government entities to connect with the infrastructures and play a crucial role in enhancing emergency preparedness with advanced sensors. The objective of this work is to propose a set of coordinated technological solutions to transform an existing emergency response system into an intelligent interactive system, thereby improving the public services and the quality of life for residents at home, on road, in hospitals, transport hubs, etc. In this context, we consider a city wide view from three different application scenes that are closely related to peoples daily life, to optimize the actions taken at relevant departments. Therefore, using artificial intelligence (AI) and machine learning (ML) techniques to enable the next generation connected vehicle experiences, we specifically focus on accidents happening in indoor households, urban roads, and at large public facilities. This smart interactive response system will benefit from advanced sensor fusion and AI by formulating a real time dynamic model.  ( 2 min )
    Secure Decentralized Learning with Blockchain
    arXiv:2310.07079v2 Announce Type: replace-cross Abstract: Federated Learning (FL) is a well-known paradigm of distributed machine learning on mobile and IoT devices, which preserves data privacy and optimizes communication efficiency. To avoid the single point of failure problem in FL, decentralized federated learning (DFL) has been proposed to use peer-to-peer communication for model aggregation, which has been considered an attractive solution for machine learning tasks on distributed personal devices. However, this process is vulnerable to attackers who share false models and data. If there exists a group of malicious clients, they might harm the performance of the model by carrying out a poisoning attack. In addition, in DFL, clients often lack the incentives to contribute their computing powers to do model training. In this paper, we proposed Blockchain-based Decentralized Federated Learning (BDFL), which leverages a blockchain for decentralized model verification and auditing. BDFL includes an auditor committee for model verification, an incentive mechanism to encourage the participation of clients, a reputation model to evaluate the trustworthiness of clients, and a protocol suite for dynamic network updates. Evaluation results show that, with the reputation mechanism, BDFL achieves fast model convergence and high accuracy on real datasets even if there exist 30\% malicious clients in the system.  ( 2 min )
    Theoretical Hardness and Tractability of POMDPs in RL with Partial Online State Information
    arXiv:2306.08762v3 Announce Type: replace Abstract: Partially observable Markov decision processes (POMDPs) have been widely applied in various real-world applications. However, existing theoretical results have shown that learning in POMDPs is intractable in the worst case, where the main challenge lies in the lack of latent state information. A key fundamental question here is: how much online state information (OSI) is sufficient to achieve tractability? In this paper, we establish a lower bound that reveals a surprising hardness result: unless we have full OSI, we need an exponentially scaling sample complexity to obtain an $\epsilon$-optimal policy solution for POMDPs. Nonetheless, inspired by the insights in our lower-bound design, we identify important tractable subclasses of POMDPs, even with only partial OSI. In particular, for two subclasses of POMDPs with partial OSI, we provide new algorithms that are proved to be near-optimal by establishing new regret upper and lower bounds. Both our algorithm design and regret analysis involve non-trivial developments for joint OSI query and action control.  ( 2 min )
    Federated Adaptive Prompt Tuning for Multi-Domain Collaborative Learning
    arXiv:2211.07864v4 Announce Type: replace Abstract: Federated learning (FL) enables multiple clients to collaboratively train a global model without disclosing their data. Previous researches often require training the complete model parameters. However, the emergence of powerful pre-trained models makes it possible to achieve higher performance with fewer learnable parameters in FL. In this paper, we propose a federated adaptive prompt tuning algorithm, FedAPT, for multi-domain collaborative image classification with powerful foundation models, like CLIP. Compared with direct federated prompt tuning, our core idea is to adaptively unlock specific domain knowledge for each test sample in order to provide them with personalized prompts. To implement this idea, we design an adaptive prompt tuning module, which consists of a meta prompt, an adaptive network, and some keys. The server randomly generates a set of keys and assigns a unique key to each client. Then all clients cooperatively train the global adaptive network and meta prompt with the local datasets and the frozen keys. Ultimately, the global aggregation model can assign a personalized prompt to CLIP based on the domain features of each test sample. We perform extensive experiments on two multi-domain image classification datasets across two different settings -- supervised and unsupervised. The results show that FedAPT can achieve better performance with less than 10\% of the number of parameters of the fully trained model, and the global model can perform well in diverse client domains simultaneously. The source code is available at \url{https://github.com/leondada/FedAPT}.  ( 3 min )
    GRANDE: Gradient-Based Decision Tree Ensembles for Tabular Data
    arXiv:2309.17130v3 Announce Type: replace Abstract: Despite the success of deep learning for text and image data, tree-based ensemble models are still state-of-the-art for machine learning with heterogeneous tabular data. However, there is a significant need for tabular-specific gradient-based methods due to their high flexibility. In this paper, we propose $\text{GRANDE}$, $\text{GRA}$die$\text{N}$t-Based $\text{D}$ecision Tree $\text{E}$nsembles, a novel approach for learning hard, axis-aligned decision tree ensembles using end-to-end gradient descent. GRANDE is based on a dense representation of tree ensembles, which affords to use backpropagation with a straight-through operator to jointly optimize all model parameters. Our method combines axis-aligned splits, which is a useful inductive bias for tabular data, with the flexibility of gradient-based optimization. Furthermore, we introduce an advanced instance-wise weighting that facilitates learning representations for both, simple and complex relations, within a single model. We conducted an extensive evaluation on a predefined benchmark with 19 classification datasets and demonstrate that our method outperforms existing gradient-boosting and deep learning frameworks on most datasets. The method is available under: https://github.com/s-marton/GRANDE  ( 2 min )
    SEGNO: Generalizing Equivariant Graph Neural Networks with Physical Inductive Biases
    arXiv:2308.13212v2 Announce Type: replace Abstract: Graph Neural Networks (GNNs) with equivariant properties have emerged as powerful tools for modeling complex dynamics of multi-object physical systems. However, their generalization ability is limited by the inadequate consideration of physical inductive biases: (1) Existing studies overlook the continuity of transitions among system states, opting to employ several discrete transformation layers to learn the direct mapping between two adjacent states; (2) Most models only account for first-order velocity information, despite the fact that many physical systems are governed by second-order motion laws. To incorporate these inductive biases, we propose the Second-order Equivariant Graph Neural Ordinary Differential Equation (SEGNO). Specifically, we show how the second-order continuity can be incorporated into GNNs while maintaining the equivariant property. Furthermore, we offer theoretical insights into SEGNO, highlighting that it can learn a unique trajectory between adjacent states, which is crucial for model generalization. Additionally, we prove that the discrepancy between this learned trajectory of SEGNO and the true trajectory is bounded. Extensive experiments on complex dynamical systems including molecular dynamics and motion capture demonstrate that our model yields a significant improvement over the state-of-the-art baselines.  ( 2 min )
    TNPAR: Topological Neural Poisson Auto-Regressive Model for Learning Granger Causal Structure from Event Sequences
    arXiv:2306.14114v2 Announce Type: replace Abstract: Learning Granger causality from event sequences is a challenging but essential task across various applications. Most existing methods rely on the assumption that event sequences are independent and identically distributed (i.i.d.). However, this i.i.d. assumption is often violated due to the inherent dependencies among the event sequences. Fortunately, in practice, we find these dependencies can be modeled by a topological network, suggesting a potential solution to the non-i.i.d. problem by introducing the prior topological network into Granger causal discovery. This observation prompts us to tackle two ensuing challenges: 1) how to model the event sequences while incorporating both the prior topological network and the latent Granger causal structure, and 2) how to learn the Granger causal structure. To this end, we devise a unified topological neural Poisson auto-regressive model with two processes. In the generation process, we employ a variant of the neural Poisson process to model the event sequences, considering influences from both the topological network and the Granger causal structure. In the inference process, we formulate an amortized inference algorithm to infer the latent Granger causal structure. We encapsulate these two processes within a unified likelihood function, providing an end-to-end framework for this task. Experiments on simulated and real-world data demonstrate the effectiveness of our approach.  ( 3 min )
    Human-Inspired Framework to Accelerate Reinforcement Learning
    arXiv:2303.08115v3 Announce Type: replace Abstract: Reinforcement learning (RL) is crucial for data science decision-making but suffers from sample inefficiency, particularly in real-world scenarios with costly physical interactions. This paper introduces a novel human-inspired framework to enhance RL algorithm sample efficiency. It achieves this by initially exposing the learning agent to simpler tasks that progressively increase in complexity, ultimately leading to the main task. This method requires no pre-training and involves learning simpler tasks for just one iteration. The resulting knowledge can facilitate various transfer learning approaches, such as value and policy transfer, without increasing computational complexity. It can be applied across different goals, environments, and RL algorithms, including value-based, policy-based, tabular, and deep RL methods. Experimental evaluations demonstrate the framework's effectiveness in enhancing sample efficiency, especially in challenging main tasks, demonstrated through both a simple Random Walk and more complex optimal control problems with constraints.  ( 2 min )
    Toward a Plug-and-Play Vision-Based Grasping Module for Robotics
    arXiv:2310.04349v2 Announce Type: replace-cross Abstract: Despite recent advancements in AI for robotics, grasping remains a partially solved challenge, hindered by the lack of benchmarks and reproducibility constraints. This paper introduces a vision-based grasping framework that can easily be transferred across multiple manipulators. Leveraging Quality-Diversity (QD) algorithms, the framework generates diverse repertoires of open-loop grasping trajectories, enhancing adaptability while maintaining a diversity of grasps. This framework addresses two main issues: the lack of an off-the-shelf vision module for detecting object pose and the generalization of QD trajectories to the whole robot operational space. The proposed solution combines multiple vision modules for 6DoF object detection and tracking while rigidly transforming QD-generated trajectories into the object frame. Experiments on a Franka Research 3 arm and a UR5 arm with a SIH Schunk hand demonstrate comparable performance when the real scene aligns with the simulation used for grasp generation. This work represents a significant stride toward building a reliable vision-based grasping module transferable to new platforms, while being adaptable to diverse scenarios without further training iterations.  ( 2 min )
    Closing the Gap between TD Learning and Supervised Learning -- A Generalisation Point of View
    arXiv:2401.11237v2 Announce Type: replace Abstract: Some reinforcement learning (RL) algorithms can stitch pieces of experience to solve a task never seen before during training. This oft-sought property is one of the few ways in which RL methods based on dynamic-programming differ from RL methods based on supervised-learning (SL). Yet, certain RL methods based on off-the-shelf SL algorithms achieve excellent results without an explicit mechanism for stitching; it remains unclear whether those methods forgo this important stitching property. This paper studies this question for the problems of achieving a target goal state and achieving a target return value. Our main result is to show that the stitching property corresponds to a form of combinatorial generalization: after training on a distribution of (state, goal) pairs, one would like to evaluate on (state, goal) pairs not seen together in the training data. Our analysis shows that this sort of generalization is different from i.i.d. generalization. This connection between stitching and generalisation reveals why we should not expect SL-based RL methods to perform stitching, even in the limit of large datasets and models. Based on this analysis, we construct new datasets to explicitly test for this property, revealing that SL-based methods lack this stitching property and hence fail to perform combinatorial generalization. Nonetheless, the connection between stitching and combinatorial generalisation also suggests a simple remedy for improving generalisation in SL: data augmentation. We propose a temporal data augmentation and demonstrate that adding it to SL-based methods enables them to successfully complete tasks not seen together during training. On a high level, this connection illustrates the importance of combinatorial generalization for data efficiency in time-series data beyond tasks beyond RL, like audio, video, or text.  ( 3 min )
    Revisiting the Last-Iterate Convergence of Stochastic Gradient Methods
    arXiv:2312.08531v2 Announce Type: replace Abstract: In the past several years, the last-iterate convergence of the Stochastic Gradient Descent (SGD) algorithm has triggered people's interest due to its good performance in practice but lack of theoretical understanding. For Lipschitz convex functions, different works have established the optimal $O(\log(1/\delta)\log T/\sqrt{T})$ or $O(\sqrt{\log(1/\delta)/T})$ high-probability convergence rates for the final iterate, where $T$ is the time horizon and $\delta$ is the failure probability. However, to prove these bounds, all the existing works are either limited to compact domains or require almost surely bounded noises. It is natural to ask whether the last iterate of SGD can still guarantee the optimal convergence rate but without these two restrictive assumptions. Besides this important question, there are still lots of theoretical problems lacking an answer. For example, compared with the last-iterate convergence of SGD for non-smooth problems, only few results for smooth optimization have yet been developed. Additionally, the existing results are all limited to a non-composite objective and the standard Euclidean norm. It still remains unclear whether the last-iterate convergence can be provably extended to wider composite optimization and non-Euclidean norms. In this work, to address the issues mentioned above, we revisit the last-iterate convergence of stochastic gradient methods and provide the first unified way to prove the convergence rates both in expectation and in high probability to accommodate general domains, composite objectives, non-Euclidean norms, Lipschitz conditions, smoothness, and (strong) convexity simultaneously. Additionally, we extend our analysis to obtain the last-iterate convergence under heavy-tailed noises.  ( 3 min )
    TRAM: Bridging Trust Regions and Sharpness Aware Minimization
    arXiv:2310.03646v2 Announce Type: replace Abstract: Sharpness-aware minimization (SAM) reports improving domain generalization by reducing the loss surface curvature in the parameter space. However, generalization during fine-tuning is often more dependent on the transferability of representations in the function space. Trust-region methods (TR) target this goal by regularizing representation curvature to reduce catastrophic forgetting of pre-trained task-agnostic information while adopting task-specific skills. We consider unifying these strategies for low curvature in both parameter space and function space to improve out-of-domain (OOD) generalization. We propose Trust Region Aware Minimization (TRAM), a SAM algorithm fine-tuning for low parameter sharpness and smooth, informative representations preserving pre-trained structure. TRAM uses a trust region bound to inform the SAM adversarial neighborhood, introducing an awareness of function curvature within optimization for flatter minima. We empirically validate TRAM in vision (cross-dataset adaptation) and text (OOD language modeling, zero-shot cross-lingual transfer) tasks where robust domain transfer and representation generality are critical. TRAM outperforms SAM- and TR-based optimization across all tasks, notably surpassing competing methods for hard transfer between anticorrelated domains. TRAM establishes a novel standard in fine-tuning for domain-generalizable models with minimal additional computation over previous sharpness-aware methods.  ( 2 min )
    Calibrating Multi-modal Representations: A Pursuit of Group Robustness without Annotations
    arXiv:2403.07241v1 Announce Type: cross Abstract: Fine-tuning pre-trained vision-language models, like CLIP, has yielded success on diverse downstream tasks. However, several pain points persist for this paradigm: (i) directly tuning entire pre-trained models becomes both time-intensive and computationally costly. Additionally, these tuned models tend to become highly specialized, limiting their practicality for real-world deployment; (ii) recent studies indicate that pre-trained vision-language classifiers may overly depend on spurious features -- patterns that correlate with the target in training data, but are not related to the true labeling function; and (iii) existing studies on mitigating the reliance on spurious features, largely based on the assumption that we can identify such features, does not provide definitive assurance for real-world applications. As a piloting study, this work focuses on exploring mitigating the reliance on spurious features for CLIP without using any group annotation. To this end, we systematically study the existence of spurious correlation on CLIP and CILP+ERM. We first, following recent work on Deep Feature Reweighting (DFR), verify that last-layer retraining can greatly improve group robustness on pretrained CLIP. In view of them, we advocate a lightweight representation calibration method for fine-tuning CLIP, by first generating a calibration set using the pretrained CLIP, and then calibrating representations of samples within this set through contrastive learning, all without the need for group labels. Extensive experiments and in-depth visualizations on several benchmarks validate the effectiveness of our proposals, largely reducing reliance and significantly boosting the model generalization.  ( 3 min )
    Class Imbalance in Object Detection: An Experimental Diagnosis and Study of Mitigation Strategies
    arXiv:2403.07113v1 Announce Type: cross Abstract: Object detection, a pivotal task in computer vision, is frequently hindered by dataset imbalances, particularly the under-explored issue of foreground-foreground class imbalance. This lack of attention to foreground-foreground class imbalance becomes even more pronounced in the context of single-stage detectors. This study introduces a benchmarking framework utilizing the YOLOv5 single-stage detector to address the problem of foreground-foreground class imbalance. We crafted a novel 10-class long-tailed dataset from the COCO dataset, termed COCO-ZIPF, tailored to reflect common real-world detection scenarios with a limited number of object classes. Against this backdrop, we scrutinized three established techniques: sampling, loss weighing, and data augmentation. Our comparative analysis reveals that sampling and loss reweighing methods, while shown to be beneficial in two-stage detector settings, do not translate as effectively in improving YOLOv5's performance on the COCO-ZIPF dataset. On the other hand, data augmentation methods, specifically mosaic and mixup, significantly enhance the model's mean Average Precision (mAP), by introducing more variability and complexity into the training data. (Code available: https://github.com/craston/object_detection_cib)  ( 2 min )
    A Two-Stage Feature Selection Approach for Robust Evaluation of Treatment Effects in High-Dimensional Observational Data
    arXiv:2111.13800v2 Announce Type: replace Abstract: A Randomized Control Trial (RCT) is considered as the gold standard for evaluating the effect of any intervention or treatment. However, its feasibility is often hindered by ethical, economical, and legal considerations, making observational data a valuable alternative for drawing causal conclusions. Nevertheless, healthcare observational data presents a difficult challenge due to its high dimensionality, requiring careful consideration to ensure unbiased, reliable, and robust causal inferences. To overcome this challenge, in this study, we propose a novel two-stage feature selection technique called, Outcome Adaptive Elastic Net (OAENet), explicitly designed for making robust causal inference decisions using matching techniques. OAENet offers several key advantages over existing methods: superior performance on correlated and high-dimensional data compared to the existing methods and the ability to select specific sets of variables (including confounders and variables associated only with the outcome). This ensures robustness and facilitates an unbiased estimate of the causal effect. Numerical experiments on simulated data demonstrate that OAENet significantly outperforms state-of-the-art methods by either producing a higher-quality estimate or a comparable estimate in significantly less time. To illustrate the applicability of OAENet, we employ large-scale US healthcare data to estimate the effect of Opioid Use Disorder (OUD) on suicidal behavior. When compared to competing methods, OAENet closely aligns with existing literature on the relationship between OUD and suicidal behavior. Performance on both simulated and real-world data highlights that OAENet notably enhances the accuracy of estimating treatment effects or evaluating policy decision-making with causal inference.  ( 3 min )
    Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Reviews
    arXiv:2403.07183v1 Announce Type: cross Abstract: We present an approach for estimating the fraction of text in a large corpus which is likely to be substantially modified or produced by a large language model (LLM). Our maximum likelihood model leverages expert-written and AI-generated reference texts to accurately and efficiently examine real-world LLM-use at the corpus level. We apply this approach to a case study of scientific peer review in AI conferences that took place after the release of ChatGPT: ICLR 2024, NeurIPS 2023, CoRL 2023 and EMNLP 2023. Our results suggest that between 6.5% and 16.9% of text submitted as peer reviews to these conferences could have been substantially modified by LLMs, i.e. beyond spell-checking or minor writing updates. The circumstances in which generated text occurs offer insight into user behavior: the estimated fraction of LLM-generated text is higher in reviews which report lower confidence, were submitted close to the deadline, and from reviewers who are less likely to respond to author rebuttals. We also observe corpus-level trends in generated text which may be too subtle to detect at the individual level, and discuss the implications of such trends on peer review. We call for future interdisciplinary work to examine how LLM use is changing our information and knowledge practices.  ( 3 min )
    A cascaded deep network for automated tumor detection and segmentation in clinical PET imaging of diffuse large B-cell lymphoma
    arXiv:2403.07092v1 Announce Type: cross Abstract: Accurate detection and segmentation of diffuse large B-cell lymphoma (DLBCL) from PET images has important implications for estimation of total metabolic tumor volume, radiomics analysis, surgical intervention and radiotherapy. Manual segmentation of tumors in whole-body PET images is time-consuming, labor-intensive and operator-dependent. In this work, we develop and validate a fast and efficient three-step cascaded deep learning model for automated detection and segmentation of DLBCL tumors from PET images. As compared to a single end-to-end network for segmentation of tumors in whole-body PET images, our three-step model is more effective (improves 3D Dice score from 58.9% to 78.1%) since each of its specialized modules, namely the slice classifier, the tumor detector and the tumor segmentor, can be trained independently to a high degree of skill to carry out a specific task, rather than a single network with suboptimal performance on overall segmentation.  ( 3 min )
    Efficient Compression of Overparameterized Deep Models through Low-Dimensional Learning Dynamics
    arXiv:2311.05061v2 Announce Type: replace Abstract: Overparameterized models have proven to be powerful tools for solving various machine learning tasks. However, overparameterization often leads to a substantial increase in computational and memory costs, which in turn requires extensive resources to train. In this work, we present a novel approach for compressing overparameterized models, developed through studying their learning dynamics. We observe that for many deep models, updates to the weight matrices occur within a low-dimensional invariant subspace. For deep linear models, we demonstrate that their principal components are fitted incrementally within a small subspace, and use these insights to propose a compression algorithm for deep linear networks that involve decreasing the width of their intermediate layers. We empirically evaluate the effectiveness of our compression technique on matrix recovery problems. Remarkably, by using an initialization that exploits the structure of the problem, we observe that our compressed network converges faster than the original network, consistently yielding smaller recovery errors. We substantiate this observation by developing a theory focused on deep matrix factorization. Finally, we empirically demonstrate how our compressed model has the potential to improve the utility of deep nonlinear models. Overall, our algorithm improves the training efficiency by more than 2x, without compromising generalization.  ( 3 min )
    Phase autoencoder for limit-cycle oscillators
    arXiv:2403.06992v1 Announce Type: cross Abstract: We present a phase autoencoder that encodes the asymptotic phase of a limit-cycle oscillator, a fundamental quantity characterizing its synchronization dynamics. This autoencoder is trained in such a way that its latent variables directly represent the asymptotic phase of the oscillator. The trained autoencoder can perform two functions without relying on the mathematical model of the oscillator: first, it can evaluate the asymptotic phase and phase sensitivity function of the oscillator; second, it can reconstruct the oscillator state on the limit cycle in the original space from the phase value as an input. Using several examples of limit-cycle oscillators, we demonstrate that the asymptotic phase and phase sensitivity function can be estimated only from time-series data by the trained autoencoder. We also present a simple method for globally synchronizing two oscillators as an application of the trained autoencoder.  ( 2 min )
    SoK: Can Trajectory Generation Combine Privacy and Utility?
    arXiv:2403.07218v1 Announce Type: cross Abstract: While location trajectories represent a valuable data source for analyses and location-based services, they can reveal sensitive information, such as political and religious preferences. Differentially private publication mechanisms have been proposed to allow for analyses under rigorous privacy guarantees. However, the traditional protection schemes suffer from a limiting privacy-utility trade-off and are vulnerable to correlation and reconstruction attacks. Synthetic trajectory data generation and release represent a promising alternative to protection algorithms. While initial proposals achieve remarkable utility, they fail to provide rigorous privacy guarantees. This paper proposes a framework for designing a privacy-preserving trajectory publication approach by defining five design goals, particularly stressing the importance of choosing an appropriate Unit of Privacy. Based on this framework, we briefly discuss the existing trajectory protection approaches, emphasising their shortcomings. This work focuses on the systematisation of the state-of-the-art generative models for trajectories in the context of the proposed framework. We find that no existing solution satisfies all requirements. Thus, we perform an experimental study evaluating the applicability of six sequential generative models to the trajectory domain. Finally, we conclude that a generative trajectory model providing semantic guarantees remains an open research question and propose concrete next steps for future research.  ( 2 min )
    DexCap: Scalable and Portable Mocap Data Collection System for Dexterous Manipulation
    arXiv:2403.07788v1 Announce Type: cross Abstract: Imitation learning from human hand motion data presents a promising avenue for imbuing robots with human-like dexterity in real-world manipulation tasks. Despite this potential, substantial challenges persist, particularly with the portability of existing hand motion capture (mocap) systems and the difficulty of translating mocap data into effective control policies. To tackle these issues, we introduce DexCap, a portable hand motion capture system, alongside DexIL, a novel imitation algorithm for training dexterous robot skills directly from human hand mocap data. DexCap offers precise, occlusion-resistant tracking of wrist and finger motions based on SLAM and electromagnetic field together with 3D observations of the environment. Utilizing this rich dataset, DexIL employs inverse kinematics and point cloud-based imitation learning to replicate human actions with robot hands. Beyond learning from human motion, DexCap also offers an optional human-in-the-loop correction mechanism to refine and further improve robot performance. Through extensive evaluation across six dexterous manipulation tasks, our approach not only demonstrates superior performance but also showcases the system's capability to effectively learn from in-the-wild mocap data, paving the way for future data collection methods for dexterous manipulation. More details can be found at https://dex-cap.github.io  ( 2 min )
    Ab-initio variational wave functions for the time-dependent many-electron Schr\"odinger equation
    arXiv:2403.07447v1 Announce Type: cross Abstract: Describing the dynamics of many-electron quantum systems is crucial for applications such as predicting electronic structures in quantum chemistry, the properties of condensed matter systems, and the behaviors of complex materials. However, the real-time evolution of non-equilibrium quantum electronic systems poses a significant challenge for theoretical and computational approaches, due to the system's exploration of a vast configuration space. This work introduces a variational approach for fermionic time-dependent wave functions, surpassing mean-field approximations by capturing many-body correlations. The proposed methodology involves parameterizing the time-evolving quantum state, enabling the approximation of the state's evolution. To account for electron correlations, we employ time-dependent Jastrow factors and backflow transformations. We also show that we can incorporate neural networks to parameterize these functions. The time-dependent variational Monte Carlo technique is employed to efficiently compute the optimal time-dependent parameters. The approach is demonstrated in three distinct systems: the solvable harmonic interaction model, the dynamics of a diatomic molecule in intense laser fields, and a quenched quantum dot. In all cases, we show clear signatures of many-body correlations in the dynamics not captured by mean-field methods. The results showcase the ability of our variational approach to accurately capture the time evolution of quantum states, providing insight into the quantum dynamics of interacting electronic systems, beyond the capabilities of mean-field.  ( 3 min )
    Epoch-evolving Gaussian Process Guided Learning
    arXiv:2006.14347v2 Announce Type: replace Abstract: In this paper, we propose a novel learning scheme called epoch-evolving Gaussian Process Guided Learning (GPGL), which aims at characterizing the correlation information between the batch-level distribution and the global data distribution. Such correlation information is encoded as context labels and needs renewal every epoch. With the guidance of the context label and ground truth label, GPGL scheme provides a more efficient optimization through updating the model parameters with a triangle consistency loss. Furthermore, our GPGL scheme can be further generalized and naturally applied to the current deep models, outperforming the existing batch-based state-of-the-art models on mainstream datasets (CIFAR-10, CIFAR-100, and Tiny-ImageNet) remarkably.  ( 2 min )
    On the Last-Iterate Convergence of Shuffling Gradient Methods
    arXiv:2403.07723v1 Announce Type: new Abstract: Shuffling gradient methods, which are also known as stochastic gradient descent (SGD) without replacement, are widely implemented in practice, particularly including three popular algorithms: Random Reshuffle (RR), Shuffle Once (SO), and Incremental Gradient (IG). Compared to the empirical success, the theoretical guarantee of shuffling gradient methods was not well-understanding for a long time. Until recently, the convergence rates had just been established for the average iterate for convex functions and the last iterate for strongly convex problems (using squared distance as the metric). However, when using the function value gap as the convergence criterion, existing theories cannot interpret the good performance of the last iterate in different settings (e.g., constrained optimization). To bridge this gap between practice and theory, we prove last-iterate convergence rates for shuffling gradient methods with respect to the objective value even without strong convexity. Our new results either (nearly) match the existing last-iterate lower bounds or are as fast as the previous best upper bounds for the average iterate.  ( 2 min )
    Physics Sensor Based Deep Learning Fall Detection System
    arXiv:2403.06994v1 Announce Type: cross Abstract: Fall detection based on embedded sensor is a practical and popular research direction in recent years. In terms of a specific application: fall detection methods based upon physics sensors such as [gyroscope and accelerator] have been exploited using traditional hand crafted features and feed them in machine learning models like Markov chain or just threshold based classification methods. In this paper, we build a complete system named TSFallDetect including data receiving device based on embedded sensor, mobile deep-learning model deploying platform, and a simple server, which will be used to gather models and data for future expansion. On the other hand, we exploit the sequential deep-learning methods to address this falling motion prediction problem based on data collected by inertial and film pressure sensors. We make a empirical study based on existing datasets and our datasets collected from our system separately, which shows that the deep-learning model has more potential advantage than other traditional methods, and we proposed a new deep-learning model based on the time series data to predict the fall, and it may be superior to other sequential models in this particular field.  ( 2 min )
    Improving prediction of students' performance in intelligent tutoring systems using attribute selection and ensembles of different multimodal data sources
    arXiv:2403.07194v1 Announce Type: cross Abstract: The aim of this study was to predict university students' learning performance using different sources of data from an Intelligent Tutoring System. We collected and preprocessed data from 40 students from different multimodal sources: learning strategies from system logs, emotions from face recording videos, interaction zones from eye tracking, and test performance from final knowledge evaluation. Our objective was to test whether the prediction could be improved by using attribute selection and classification ensembles. We carried out three experiments by applying six classification algorithms to numerical and discretized preprocessed multimodal data. The results show that the best predictions were produced using ensembles and selecting the best attributes approach with numerical data.  ( 2 min )
    Curry-DPO: Enhancing Alignment using Curriculum Learning & Ranked Preferences
    arXiv:2403.07230v1 Announce Type: cross Abstract: Direct Preference Optimization (DPO) is an effective technique that leverages pairwise preference data (usually one chosen and rejected response pair per user prompt) to align LLMs to human preferences. In practice, multiple responses can exist for a given prompt with varying quality relative to each other. With availability of such quality ratings for multiple responses, we propose utilizing these responses to create multiple preference pairs for a given prompt. Our work focuses on systematically using the constructed multiple preference pair in DPO training via curriculum learning methodology. In particular, we order these multiple pairs of preference data from easy to hard (emulating curriculum training) according to various criteria. We show detailed comparisons of our proposed approach to the standard single-pair DPO setting. Our method, which we call Curry-DPO consistently shows increased performance gains on MTbench, Vicuna, WizardLM, and the UltraFeedback test set, highlighting its effectiveness. More specifically, Curry-DPO achieves a score of 7.43 on MT-bench with Zephy-7B model outperforming majority of existing LLMs with similar parameter size. Curry-DPO also achieves the highest adjusted win rates on Vicuna, WizardLM, and UltraFeedback test datasets (90.7%, 87.1%, and 87.9% respectively) in our experiments, with notable gains of upto 7.5% when compared to standard DPO technique.  ( 2 min )
    FAX: Scalable and Differentiable Federated Primitives in JAX
    arXiv:2403.07128v1 Announce Type: cross Abstract: We present FAX, a JAX-based library designed to support large-scale distributed and federated computations in both data center and cross-device applications. FAX leverages JAX's sharding mechanisms to enable native targeting of TPUs and state-of-the-art JAX runtimes, including Pathways. FAX embeds building blocks for federated computations as primitives in JAX. This enables three key benefits. First, FAX computations can be translated to XLA HLO. Second, FAX provides a full implementation of federated automatic differentiation, greatly simplifying the expression of federated computations. Last, FAX computations can be interpreted out to existing production cross-device federated compute systems. We show that FAX provides an easily programmable, performant, and scalable framework for federated computations in the data center. FAX is available at https://github.com/google-research/google-research/tree/master/fax .  ( 2 min )
    The Principles of Data-Centric AI (DCAI)
    arXiv:2211.14611v2 Announce Type: replace Abstract: Data is a crucial infrastructure to how artificial intelligence (AI) systems learn. However, these systems to date have been largely model-centric, putting a premium on the model at the expense of the data quality. Data quality issues beset the performance of AI systems, particularly in downstream deployments and in real-world applications. Data-centric AI (DCAI) as an emerging concept brings data, its quality and its dynamism to the forefront in considerations of AI systems through an iterative and systematic approach. As one of the first overviews, this article brings together data-centric perspectives and concepts to outline the foundations of DCAI. It specifically formulates six guiding principles for researchers and practitioners and gives direction for future advancement of DCAI.  ( 2 min )
    SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models
    arXiv:2403.07384v1 Announce Type: cross Abstract: Despite the effectiveness of data selection for large language models (LLMs) during pretraining and instruction fine-tuning phases, improving data efficiency in supervised fine-tuning (SFT) for specialized domains poses significant challenges due to the complexity of fine-tuning data. To bridge this gap, we introduce an effective and scalable data selection method for SFT, SmallToLarge (S2L), which leverages training trajectories from small models to guide the data selection for larger models. We demonstrate through extensive experiments that S2L significantly improves data efficiency in SFT for mathematical problem-solving, reducing the training data to just 11% of the original MathInstruct dataset (Yue et al., 2023) to match full dataset performance while outperforming state-of-the-art data selection algorithms by an average of 4.7% across 6 in- and out-domain evaluation datasets. Remarkably, selecting only 50K data for SFT, S2L achieves a 32.7% accuracy on the most challenging MATH (Hendrycks et al., 2021) benchmark, improving Phi-2 (Li et al., 2023b) by 16.6%. In clinical text summarization on the MIMIC-III dataset (Johnson et al., 2016), S2L again outperforms training on the full dataset using only 50% of the data. Notably, S2L can perform data selection using a reference model 40x smaller than the target model, proportionally reducing the cost of data selection.  ( 3 min )
    Stochastic Extragradient with Random Reshuffling: Improved Convergence for Variational Inequalities
    arXiv:2403.07148v1 Announce Type: cross Abstract: The Stochastic Extragradient (SEG) method is one of the most popular algorithms for solving finite-sum min-max optimization and variational inequality problems (VIPs) appearing in various machine learning tasks. However, existing convergence analyses of SEG focus on its with-replacement variants, while practical implementations of the method randomly reshuffle components and sequentially use them. Unlike the well-studied with-replacement variants, SEG with Random Reshuffling (SEG-RR) lacks established theoretical guarantees. In this work, we provide a convergence analysis of SEG-RR for three classes of VIPs: (i) strongly monotone, (ii) affine, and (iii) monotone. We derive conditions under which SEG-RR achieves a faster convergence rate than the uniform with-replacement sampling SEG. In the monotone setting, our analysis of SEG-RR guarantees convergence to an arbitrary accuracy without large batch sizes, a strong requirement needed in the classical with-replacement SEG. As a byproduct of our results, we provide convergence guarantees for Shuffle Once SEG (shuffles the data only at the beginning of the algorithm) and the Incremental Extragradient (does not shuffle the data). We supplement our analysis with experiments validating empirically the superior performance of SEG-RR over the classical with-replacement sampling SEG.  ( 2 min )
    Boosting keyword spotting through on-device learnable user speech characteristics
    arXiv:2403.07802v1 Announce Type: cross Abstract: Keyword spotting systems for always-on TinyML-constrained applications require on-site tuning to boost the accuracy of offline trained classifiers when deployed in unseen inference conditions. Adapting to the speech peculiarities of target users requires many in-domain samples, often unavailable in real-world scenarios. Furthermore, current on-device learning techniques rely on computationally intensive and memory-hungry backbone update schemes, unfit for always-on, battery-powered devices. In this work, we propose a novel on-device learning architecture, composed of a pretrained backbone and a user-aware embedding learning the user's speech characteristics. The so-generated features are fused and used to classify the input utterance. For domain shifts generated by unseen speakers, we measure error rate reductions of up to 19% from 30.1% to 24.3% based on the 35-class problem of the Google Speech Commands dataset, through the inexpensive update of the user projections. We moreover demonstrate the few-shot learning capabilities of our proposed architecture in sample- and class-scarce learning conditions. With 23.7 kparameters and 1 MFLOP per epoch required for on-device training, our system is feasible for TinyML applications aimed at battery-powered microcontrollers.  ( 2 min )
    A slice classification neural network for automated classification of axial PET/CT slices from a multi-centric lymphoma dataset
    arXiv:2403.07105v1 Announce Type: cross Abstract: Automated slice classification is clinically relevant since it can be incorporated into medical image segmentation workflows as a preprocessing step that would flag slices with a higher probability of containing tumors, thereby directing physicians attention to the important slices. In this work, we train a ResNet-18 network to classify axial slices of lymphoma PET/CT images (collected from two institutions) depending on whether the slice intercepted a tumor (positive slice) in the 3D image or if the slice did not (negative slice). Various instances of the network were trained on 2D axial datasets created in different ways: (i) slice-level split and (ii) patient-level split; inputs of different types were used: (i) only PET slices and (ii) concatenated PET and CT slices; and different training strategies were employed: (i) center-aware (CAW) and (ii) center-agnostic (CAG). Model performances were compared using the area under the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPRC), and various binary classification metrics. We observe and describe a performance overestimation in the case of slice-level split as compared to the patient-level split training. The model trained using patient-level split data with the network input containing only PET slices in the CAG training regime was the best performing/generalizing model on a majority of metrics. Our models were additionally more closely compared using the sensitivity metric on the positive slices from their respective test sets.  ( 3 min )
    Symmetric Equilibrium Learning of VAEs
    arXiv:2307.09883v2 Announce Type: replace Abstract: We view variational autoencoders (VAE) as decoder-encoder pairs, which map distributions in the data space to distributions in the latent space and vice versa. The standard learning approach for VAEs is the maximisation of the evidence lower bound (ELBO). It is asymmetric in that it aims at learning a latent variable model while using the encoder as an auxiliary means only. Moreover, it requires a closed form a-priori latent distribution. This limits its applicability in more complex scenarios, such as general semi-supervised learning and employing complex generative models as priors. We propose a Nash equilibrium learning approach, which is symmetric with respect to the encoder and decoder and allows learning VAEs in situations where both the data and the latent distributions are accessible only by sampling. The flexibility and simplicity of this approach allows its application to a wide range of learning scenarios and downstream tasks.  ( 2 min )
    OpenBox: A Python Toolkit for Generalized Black-box Optimization
    arXiv:2304.13339v2 Announce Type: replace Abstract: Black-box optimization (BBO) has a broad range of applications, including automatic machine learning, experimental design, and database knob tuning. However, users still face challenges when applying BBO methods to their problems at hand with existing software packages in terms of applicability, performance, and efficiency. This paper presents OpenBox, an open-source BBO toolkit with improved usability. It implements user-friendly inferfaces and visualization for users to define and manage their tasks. The modular design behind OpenBox facilitates its flexible deployment in existing systems. Experimental results demonstrate the effectiveness and efficiency of OpenBox over existing systems. The source code of OpenBox is available at https://github.com/PKU-DAIR/open-box.  ( 2 min )
    Domain-Aware Continual Zero-Shot Learning
    arXiv:2112.12989v3 Announce Type: replace-cross Abstract: Modern visual systems have a wide range of potential applications in vision tasks for natural science research, such as aiding in species discovery, monitoring animals in the wild, and so on. However, real-world vision tasks may experience changes in environmental conditions, leading to shifts in how captured images are presented. To address this issue, we introduce Domain-Aware Continual Zero-Shot Learning (DACZSL), a task to recognize images of unseen categories in continuously changing domains. Accordingly, we propose a Domain-Invariant Network (DIN) to learn factorized features for shifting domains and improved textual representation for unseen classes. DIN continually learns a global shared network for domain-invariant and task-invariant features, and per-task private networks for task-specific features. Furthermore, we enhance the dual network with class-wise learnable prompts to improve class-level text representation, thereby improving zero-shot prediction of future unseen classes. To evaluate DACZSL, we introduce two benchmarks, DomainNet-CZSL and iWildCam-CZSL. Our results show that DIN significantly outperforms existing baselines by over 5% in harmonic accuracy and over 1% in backward transfer and achieves a new SoTA.  ( 2 min )
    FairRR: Pre-Processing for Group Fairness through Randomized Response
    arXiv:2403.07780v1 Announce Type: cross Abstract: The increasing usage of machine learning models in consequential decision-making processes has spurred research into the fairness of these systems. While significant work has been done to study group fairness in the in-processing and post-processing setting, there has been little that theoretically connects these results to the pre-processing domain. This paper proposes that achieving group fairness in downstream models can be formulated as finding the optimal design matrix in which to modify a response variable in a Randomized Response framework. We show that measures of group fairness can be directly controlled for with optimal model utility, proposing a pre-processing algorithm called FairRR that yields excellent downstream model utility and fairness.  ( 2 min )
    The Minimax Rate of HSIC Estimation for Translation-Invariant Kernels
    arXiv:2403.07735v1 Announce Type: cross Abstract: Kernel techniques are among the most influential approaches in data science and statistics. Under mild conditions, the reproducing kernel Hilbert space associated to a kernel is capable of encoding the independence of $M\ge 2$ random variables. Probably the most widespread independence measure relying on kernels is the so-called Hilbert-Schmidt independence criterion (HSIC; also referred to as distance covariance in the statistics literature). Despite various existing HSIC estimators designed since its introduction close to two decades ago, the fundamental question of the rate at which HSIC can be estimated is still open. In this work, we prove that the minimax optimal rate of HSIC estimation on $\mathbb R^d$ for Borel measures containing the Gaussians with continuous bounded translation-invariant characteristic kernels is $\mathcal O\!\left(n^{-1/2}\right)$. Specifically, our result implies the optimality in the minimax sense of many of the most-frequently used estimators (including the U-statistic, the V-statistic, and the Nystr\"om-based one) on $\mathbb R^d$.  ( 2 min )
    Probabilistic Easy Variational Causal Effect
    arXiv:2403.07745v1 Announce Type: cross Abstract: Let $X$ and $Z$ be random vectors, and $Y=g(X,Z)$. In this paper, on the one hand, for the case that $X$ and $Z$ are continuous, by using the ideas from the total variation and the flux of $g$, we develop a point of view in causal inference capable of dealing with a broad domain of causal problems. Indeed, we focus on a function, called Probabilistic Easy Variational Causal Effect (PEACE), which can measure the direct causal effect of $X$ on $Y$ with respect to continuously and interventionally changing the values of $X$ while keeping the value of $Z$ constant. PEACE is a function of $d\ge 0$, which is a degree managing the strengths of probability density values $f(x|z)$. On the other hand, we generalize the above idea for the discrete case and show its compatibility with the continuous case. Further, we investigate some properties of PEACE using measure theoretical concepts. Furthermore, we provide some identifiability criteria and several examples showing the generic capability of PEACE. We note that PEACE can deal with the causal problems for which micro-level or just macro-level changes in the value of the input variables are important. Finally, PEACE is stable under small changes in $\partial g_{in}/\partial x$ and the joint distribution of $X$ and $Z$, where $g_{in}$ is obtained from $g$ by removing all functional relationships defining $X$ and $Z$.  ( 2 min )
    Disjoint Contrastive Regression Learning for Multi-Sourced Annotations
    arXiv:2112.15411v2 Announce Type: replace Abstract: Large-scale datasets are important for the development of deep learning models. Such datasets usually require a heavy workload of annotations, which are extremely time-consuming and expensive. To accelerate the annotation procedure, multiple annotators may be employed to label different subsets of the data. However, the inconsistency and bias among different annotators are harmful to the model training, especially for qualitative and subjective tasks.To address this challenge, in this paper, we propose a novel contrastive regression framework to address the disjoint annotations problem, where each sample is labeled by only one annotator and multiple annotators work on disjoint subsets of the data. To take account of both the intra-annotator consistency and inter-annotator inconsistency, two strategies are employed.Firstly, a contrastive-based loss is applied to learn the relative ranking among different samples of the same annotator, with the assumption that the ranking of samples from the same annotator is unanimous. Secondly, we apply the gradient reversal layer to learn robust representations that are invariant to different annotators. Experiments on the facial expression prediction task, as well as the image quality assessment task, verify the effectiveness of our proposed framework.  ( 2 min )
    LOTUS: Continual Imitation Learning for Robot Manipulation Through Unsupervised Skill Discovery
    arXiv:2311.02058v3 Announce Type: replace-cross Abstract: We introduce LOTUS, a continual imitation learning algorithm that empowers a physical robot to continuously and efficiently learn to solve new manipulation tasks throughout its lifespan. The core idea behind LOTUS is constructing an ever-growing skill library from a sequence of new tasks with a small number of human demonstrations. LOTUS starts with a continual skill discovery process using an open-vocabulary vision model, which extracts skills as recurring patterns presented in unsegmented demonstrations. Continual skill discovery updates existing skills to avoid catastrophic forgetting of previous tasks and adds new skills to solve novel tasks. LOTUS trains a meta-controller that flexibly composes various skills to tackle vision-based manipulation tasks in the lifelong learning process. Our comprehensive experiments show that LOTUS outperforms state-of-the-art baselines by over 11% in success rate, showing its superior knowledge transfer ability compared to prior methods. More results and videos can be found on the project website: https://ut-austin-rpl.github.io/Lotus/.  ( 2 min )
    Explanation-based Training with Differentiable Insertion/Deletion Metric-aware Regularizers
    arXiv:2310.12553v3 Announce Type: replace Abstract: The quality of explanations for the predictions made by complex machine learning predictors is often measured using insertion and deletion metrics, which assess the faithfulness of the explanations, i.e., how accurately the explanations reflect the predictor's behavior. To improve the faithfulness, we propose insertion/deletion metric-aware explanation-based optimization (ID-ExpO), which optimizes differentiable predictors to improve both the insertion and deletion scores of the explanations while maintaining their predictive accuracy. Because the original insertion and deletion metrics are non-differentiable with respect to the explanations and directly unavailable for gradient-based optimization, we extend the metrics so that they are differentiable and use them to formalize insertion and deletion metric-based regularizers. Our experimental results on image and tabular datasets show that the deep neural network-based predictors that are fine-tuned using ID-ExpO enable popular post-hoc explainers to produce more faithful and easier-to-interpret explanations while maintaining high predictive accuracy. The code is available at https://github.com/yuyay/idexpo.  ( 2 min )
    IGUANe: a 3D generalizable CycleGAN for multicenter harmonization of brain MR images
    arXiv:2402.03227v3 Announce Type: replace-cross Abstract: In MRI studies, the aggregation of imaging data from multiple acquisition sites enhances sample size but may introduce site-related variabilities that hinder consistency in subsequent analyses. Deep learning methods for image translation have emerged as a solution for harmonizing MR images across sites. In this study, we introduce IGUANe (Image Generation with Unified Adversarial Networks), an original 3D model that leverages the strengths of domain translation and straightforward application of style transfer methods for multicenter brain MR image harmonization. IGUANe extends CycleGAN architecture by integrating an arbitrary number of domains for training through a many-to-one strategy. During inference, the model can be applied to any image, even from an unknown acquisition site, making it a universal generator for harmonization. Trained on a dataset comprising T1-weighted images from 11 different scanners, IGUANe was evaluated on data from unseen sites. The assessments included the transformation of MR images with traveling subjects, the preservation of pairwise distances between MR images within domains, the evolution of volumetric patterns related to age and Alzheimer$^\prime$s disease (AD), and the performance in age regression and patient classification tasks. Comparisons with other harmonization and normalization methods suggest that IGUANe better preserves individual information in MR images and is more suitable for maintaining and reinforcing variabilities related to age and AD. Future studies may further assess IGUANe in other multicenter contexts, either using the same model or retraining it for applications to different image modalities.  ( 3 min )
    lil'HDoC: An Algorithm for Good Arm Identification under Small Threshold Gap
    arXiv:2401.15879v3 Announce Type: replace Abstract: Good arm identification (GAI) is a pure-exploration bandit problem in which a single learner outputs an arm as soon as it is identified as a good arm. A good arm is defined as an arm with an expected reward greater than or equal to a given threshold. This paper focuses on the GAI problem under a small threshold gap, which refers to the distance between the expected rewards of arms and the given threshold. We propose a new algorithm called lil'HDoC to significantly improve the total sample complexity of the HDoC algorithm. We demonstrate that the sample complexity of the first $\lambda$ output arm in lil'HDoC is bounded by the original HDoC algorithm, except for one negligible term, when the distance between the expected reward and threshold is small. Extensive experiments confirm that our algorithm outperforms the state-of-the-art algorithms in both synthetic and real-world datasets.  ( 2 min )
    Utility Theory of Synthetic Data Generation
    arXiv:2305.10015v2 Announce Type: replace-cross Abstract: Synthetic data algorithms are widely employed in industries to generate artificial data for downstream learning tasks. While existing research primarily focuses on empirically evaluating utility of synthetic data, its theoretical understanding is largely lacking. This paper bridges the practice-theory gap by establishing relevant utility theory in a statistical learning framework. It considers two utility metrics: generalization and ranking of models trained on synthetic data. The former is defined as the generalization difference between models trained on synthetic and on real data. By deriving analytical bounds for this utility metric, we demonstrate that the synthetic feature distribution does not need to be similar as that of real data for ensuring comparable generalization of synthetic models, provided proper model specifications in downstream learning tasks. The latter utility metric studies the relative performance of models trained on synthetic data. In particular, we discover that the distribution of synthetic data is not necessarily similar as the real one to ensure consistent model comparison. Interestingly, consistent model comparison is still achievable even when synthetic responses are not well generated, as long as downstream models are separable by a generalization gap. Finally, extensive experiments on non-parametric models and deep neural networks have been conducted to validate these theoretical findings.  ( 2 min )
    Gradual Optimization Learning for Conformational Energy Minimization
    arXiv:2311.06295v2 Announce Type: replace-cross Abstract: Molecular conformation optimization is crucial to computer-aided drug discovery and materials design. Traditional energy minimization techniques rely on iterative optimization methods that use molecular forces calculated by a physical simulator (oracle) as anti-gradients. However, this is a computationally expensive approach that requires many interactions with a physical simulator. One way to accelerate this procedure is to replace the physical simulator with a neural network. Despite recent progress in neural networks for molecular conformation energy prediction, such models are prone to distribution shift, leading to inaccurate energy minimization. We find that the quality of energy minimization with neural networks can be improved by providing optimization trajectories as additional training data. Still, it takes around $5 \times 10^5$ additional conformations to match the physical simulator's optimization quality. In this work, we present the Gradual Optimization Learning Framework (GOLF) for energy minimization with neural networks that significantly reduces the required additional data. The framework consists of an efficient data-collecting scheme and an external optimizer. The external optimizer utilizes gradients from the energy prediction model to generate optimization trajectories, and the data-collecting scheme selects additional training data to be processed by the physical simulator. Our results demonstrate that the neural network trained with GOLF performs on par with the oracle on a benchmark of diverse drug-like molecules using $50$x less additional data.  ( 3 min )
    Out-of-Distribution Detection by Leveraging Between-Layer Transformation Smoothness
    arXiv:2310.02832v2 Announce Type: replace Abstract: Effective out-of-distribution (OOD) detection is crucial for reliable machine learning models, yet most current methods are limited in practical use due to requirements like access to training data or intervention in training. We present a novel method for detecting OOD data in Transformers based on transformation smoothness between intermediate layers of a network (BLOOD), which is applicable to pre-trained models without access to training data. BLOOD utilizes the tendency of between-layer representation transformations of in-distribution (ID) data to be smoother than the corresponding transformations of OOD data, a property that we also demonstrate empirically. We evaluate BLOOD on several text classification tasks with Transformer networks and demonstrate that it outperforms methods with comparable resource requirements. Our analysis also suggests that when learning simpler tasks, OOD data transformations maintain their original sharpness, whereas sharpness increases with more complex tasks.  ( 2 min )
    Beyond Regrets: Geometric Metrics for Bayesian Optimization
    arXiv:2401.01981v2 Announce Type: replace Abstract: Bayesian optimization is a principled optimization strategy for a black-box objective function. It shows its effectiveness in a wide variety of real-world applications such as scientific discovery and experimental design. In general, the performance of Bayesian optimization is reported through regret-based metrics such as instantaneous, simple, and cumulative regrets. These metrics only rely on function evaluations, so that they do not consider geometric relationships between query points and global solutions, or query points themselves. Notably, they cannot discriminate if multiple global solutions are successfully found. Moreover, they do not evaluate Bayesian optimization's abilities to exploit and explore a search space given. To tackle these issues, we propose four new geometric metrics, i.e., precision, recall, average degree, and average distance. These metrics allow us to compare Bayesian optimization algorithms considering the geometry of both query points and global optima, or query points. However, they are accompanied by an extra parameter, which needs to be carefully determined. We therefore devise the parameter-free forms of the respective metrics by integrating out the additional parameter. Finally, we empirically validate that our proposed metrics can provide more delicate interpretation of Bayesian optimization algorithms, on top of assessment via the conventional metrics.  ( 2 min )
    Directional Optimism for Safe Linear Bandits
    arXiv:2308.15006v2 Announce Type: replace Abstract: The safe linear bandit problem is a version of the classical stochastic linear bandit problem where the learner's actions must satisfy an uncertain constraint at all rounds. Due its applicability to many real-world settings, this problem has received considerable attention in recent years. By leveraging a novel approach that we call directional optimism, we find that it is possible to achieve improved regret guarantees for both well-separated problem instances and action sets that are finite star convex sets. Furthermore, we propose a novel algorithm for this setting that improves on existing algorithms in terms of empirical performance, while enjoying matching regret guarantees. Lastly, we introduce a generalization of the safe linear bandit setting where the constraints are convex and adapt our algorithms and analyses to this setting by leveraging a novel convex-analysis based approach.  ( 2 min )
    QLABGrad: a Hyperparameter-Free and Convergence-Guaranteed Scheme for Deep Learning
    arXiv:2302.00252v2 Announce Type: replace Abstract: The learning rate is a critical hyperparameter for deep learning tasks since it determines the extent to which the model parameters are updated during the learning course. However, the choice of learning rates typically depends on empirical judgment, which may not result in satisfactory outcomes without intensive try-and-error experiments. In this study, we propose a novel learning rate adaptation scheme called QLABGrad. Without any user-specified hyperparameter, QLABGrad automatically determines the learning rate by optimizing the Quadratic Loss Approximation-Based (QLAB) function for a given gradient descent direction, where only one extra forward propagation is required. We theoretically prove the convergence of QLABGrad with a smooth Lipschitz condition on the loss function. Experiment results on multiple architectures, including MLP, CNN, and ResNet, on MNIST, CIFAR10, and ImageNet datasets, demonstrate that QLABGrad outperforms various competing schemes for deep learning.  ( 2 min )
    Label Dropout: Improved Deep Learning Echocardiography Segmentation Using Multiple Datasets With Domain Shift and Partial Labelling
    arXiv:2403.07818v1 Announce Type: cross Abstract: Echocardiography (echo) is the first imaging modality used when assessing cardiac function. The measurement of functional biomarkers from echo relies upon the segmentation of cardiac structures and deep learning models have been proposed to automate the segmentation process. However, in order to translate these tools to widespread clinical use it is important that the segmentation models are robust to a wide variety of images (e.g. acquired from different scanners, by operators with different levels of expertise etc.). To achieve this level of robustness it is necessary that the models are trained with multiple diverse datasets. A significant challenge faced when training with multiple diverse datasets is the variation in label presence, i.e. the combined data are often partially-labelled. Adaptations of the cross entropy loss function have been proposed to deal with partially labelled data. In this paper we show that training naively with such a loss function and multiple diverse datasets can lead to a form of shortcut learning, where the model associates label presence with domain characteristics, leading to a drop in performance. To address this problem, we propose a novel label dropout scheme to break the link between domain characteristics and the presence or absence of labels. We demonstrate that label dropout improves echo segmentation Dice score by 62% and 25% on two cardiac structures when training using multiple diverse partially labelled datasets.  ( 3 min )
    One for All and All for One: GNN-based Control-Flow Attestation for Embedded Devices
    arXiv:2403.07465v1 Announce Type: cross Abstract: Control-Flow Attestation (CFA) is a security service that allows an entity (verifier) to verify the integrity of code execution on a remote computer system (prover). Existing CFA schemes suffer from impractical assumptions, such as requiring access to the prover's internal state (e.g., memory or code), the complete Control-Flow Graph (CFG) of the prover's software, large sets of measurements, or tailor-made hardware. Moreover, current CFA schemes are inadequate for attesting embedded systems due to their high computational overhead and resource usage. In this paper, we overcome the limitations of existing CFA schemes for embedded devices by introducing RAGE, a novel, lightweight CFA approach with minimal requirements. RAGE can detect Code Reuse Attacks (CRA), including control- and non-control-data attacks. It efficiently extracts features from one execution trace and leverages Unsupervised Graph Neural Networks (GNNs) to identify deviations from benign executions. The core intuition behind RAGE is to exploit the correspondence between execution trace, execution graph, and execution embeddings to eliminate the unrealistic requirement of having access to a complete CFG. We evaluate RAGE on embedded benchmarks and demonstrate that (i) it detects 40 real-world attacks on embedded software; (ii) Further, we stress our scheme with synthetic return-oriented programming (ROP) and data-oriented programming (DOP) attacks on the real-world embedded software benchmark Embench, achieving 98.03% (ROP) and 91.01% (DOP) F1-Score while maintaining a low False Positive Rate of 3.19%; (iii) Additionally, we evaluate RAGE on OpenSSL, used by millions of devices and achieve 97.49% and 84.42% F1-Score for ROP and DOP attack detection, with an FPR of 5.47%.  ( 3 min )
    On the nonconvexity of some push-forward constraints and its consequences in machine learning
    arXiv:2403.07471v1 Announce Type: cross Abstract: The push-forward operation enables one to redistribute a probability measure through a deterministic map. It plays a key role in statistics and optimization: many learning problems (notably from optimal transport, generative modeling, and algorithmic fairness) include constraints or penalties framed as push-forward conditions on the model. However, the literature lacks general theoretical insights on the (non)convexity of such constraints and its consequences on the associated learning problems. This paper aims at filling this gap. In a first part, we provide a range of sufficient and necessary conditions for the (non)convexity of two sets of functions: the maps transporting one probability measure to another; the maps inducing equal output distributions across distinct probability measures. This highlights that for most probability measures, these push-forward constraints are not convex. In a second time, we show how this result implies critical limitations on the design of convex optimization problems for learning generative models or group-fair predictors. This work will hopefully help researchers and practitioners have a better understanding of the critical impact of push-forward conditions onto convexity.  ( 2 min )
    Knowledge Transfer across Multiple Principal Component Analysis Studies
    arXiv:2403.07431v1 Announce Type: cross Abstract: Transfer learning has aroused great interest in the statistical community. In this article, we focus on knowledge transfer for unsupervised learning tasks in contrast to the supervised learning tasks in the literature. Given the transferable source populations, we propose a two-step transfer learning algorithm to extract useful information from multiple source principal component analysis (PCA) studies, thereby enhancing estimation accuracy for the target PCA task. In the first step, we integrate the shared subspace information across multiple studies by a proposed method named as Grassmannian barycenter, instead of directly performing PCA on the pooled dataset. The proposed Grassmannian barycenter method enjoys robustness and computational advantages in more general cases. Then the resulting estimator for the shared subspace from the first step is further utilized to estimate the target private subspace in the second step. Our theoretical analysis credits the gain of knowledge transfer between PCA studies to the enlarged eigenvalue gap, which is different from the existing supervised transfer learning tasks where sparsity plays the central role. In addition, we prove that the bilinear forms of the empirical spectral projectors have asymptotic normality under weaker eigenvalue gap conditions after knowledge transfer. When the set of informativesources is unknown, we endow our algorithm with the capability of useful dataset selection by solving a rectified optimization problem on the Grassmann manifold, which in turn leads to a computationally friendly rectified Grassmannian K-means procedure. In the end, extensive numerical simulation results and a real data case concerning activity recognition are reported to support our theoretical claims and to illustrate the empirical usefulness of the proposed transfer learning methods.  ( 3 min )
    Near-Interpolators: Rapid Norm Growth and the Trade-Off between Interpolation and Generalization
    arXiv:2403.07264v1 Announce Type: cross Abstract: We study the generalization capability of nearly-interpolating linear regressors: $\boldsymbol{\beta}$'s whose training error $\tau$ is positive but small, i.e., below the noise floor. Under a random matrix theoretic assumption on the data distribution and an eigendecay assumption on the data covariance matrix $\boldsymbol{\Sigma}$, we demonstrate that any near-interpolator exhibits rapid norm growth: for $\tau$ fixed, $\boldsymbol{\beta}$ has squared $\ell_2$-norm $\mathbb{E}[\|{\boldsymbol{\beta}}\|_{2}^{2}] = \Omega(n^{\alpha})$ where $n$ is the number of samples and $\alpha >1$ is the exponent of the eigendecay, i.e., $\lambda_i(\boldsymbol{\Sigma}) \sim i^{-\alpha}$. This implies that existing data-independent norm-based bounds are necessarily loose. On the other hand, in the same regime we precisely characterize the asymptotic trade-off between interpolation and generalization. Our characterization reveals that larger norm scaling exponents $\alpha$ correspond to worse trade-offs between interpolation and generalization. We verify empirically that a similar phenomenon holds for nearly-interpolating shallow neural networks.  ( 2 min )
    Adaptive Bounding Box Uncertainties via Two-Step Conformal Prediction
    arXiv:2403.07263v1 Announce Type: cross Abstract: Quantifying a model's predictive uncertainty is essential for safety-critical applications such as autonomous driving. We consider quantifying such uncertainty for multi-object detection. In particular, we leverage conformal prediction to obtain uncertainty intervals with guaranteed coverage for object bounding boxes. One challenge in doing so is that bounding box predictions are conditioned on the object's class label. Thus, we develop a novel two-step conformal approach that propagates uncertainty in predicted class labels into the uncertainty intervals for the bounding boxes. This broadens the validity of our conformal coverage guarantees to include incorrectly classified objects, ensuring their usefulness when maximal safety assurances are required. Moreover, we investigate novel ensemble and quantile regression formulations to ensure the bounding box intervals are adaptive to object size, leading to a more balanced coverage across sizes. Validating our two-step approach on real-world datasets for 2D bounding box localization, we find that desired coverage levels are satisfied with actionably tight predictive uncertainty intervals.  ( 2 min )
    Exploring Safety Generalization Challenges of Large Language Models via Code
    arXiv:2403.07865v1 Announce Type: cross Abstract: The rapid advancement of Large Language Models (LLMs) has brought about remarkable capabilities in natural language processing but also raised concerns about their potential misuse. While strategies like supervised fine-tuning and reinforcement learning from human feedback have enhanced their safety, these methods primarily focus on natural languages, which may not generalize to other domains. This paper introduces CodeAttack, a framework that transforms natural language inputs into code inputs, presenting a novel environment for testing the safety generalization of LLMs. Our comprehensive studies on state-of-the-art LLMs including GPT-4, Claude-2, and Llama-2 series reveal a common safety vulnerability of these models against code input: CodeAttack consistently bypasses the safety guardrails of all models more than 80\% of the time. Furthermore, we find that a larger distribution gap between CodeAttack and natural language leads to weaker safety generalization, such as encoding natural language input with data structures or using less popular programming languages. These findings highlight new safety risks in the code domain and the need for more robust safety alignment algorithms to match the code capabilities of LLMs.  ( 2 min )
    Signed graphs in data sciences via communicability geometry
    arXiv:2403.07493v1 Announce Type: cross Abstract: Signed graphs are an emergent way of representing data in a variety of contexts were conflicting interactions exist. These include data from biological, ecological, and social systems. Here we propose the concept of communicability geometry for signed graphs, proving that metrics in this space, such as the communicability distance and angles, are Euclidean and spherical. We then apply these metrics to solve several problems in data analysis of signed graphs in a unified way. They include the partitioning of signed graphs, dimensionality reduction, finding hierarchies of alliances in signed networks as well as the quantification of the degree of polarization between the existing factions in systems represented by this type of graphs.  ( 2 min )
    Ensembling Prioritized Hybrid Policies for Multi-agent Pathfinding
    arXiv:2403.07559v1 Announce Type: cross Abstract: Multi-Agent Reinforcement Learning (MARL) based Multi-Agent Path Finding (MAPF) has recently gained attention due to its efficiency and scalability. Several MARL-MAPF methods choose to use communication to enrich the information one agent can perceive. However, existing works still struggle in structured environments with high obstacle density and a high number of agents. To further improve the performance of the communication-based MARL-MAPF solvers, we propose a new method, Ensembling Prioritized Hybrid Policies (EPH). We first propose a selective communication block to gather richer information for better agent coordination within multi-agent environments and train the model with a Q-learning-based algorithm. We further introduce three advanced inference strategies aimed at bolstering performance during the execution phase. First, we hybridize the neural policy with single-agent expert guidance for navigating conflict-free zones. Secondly, we propose Q value-based methods for prioritized resolution of conflicts as well as deadlock situations. Finally, we introduce a robust ensemble method that can efficiently collect the best out of multiple possible solutions. We empirically evaluate EPH in complex multi-agent environments and demonstrate competitive performance against state-of-the-art neural methods for MAPF.  ( 2 min )
    Fast, accurate and lightweight sequential simulation-based inference using Gaussian locally linear mappings
    arXiv:2403.07454v1 Announce Type: cross Abstract: Bayesian inference for complex models with an intractable likelihood can be tackled using algorithms performing many calls to computer simulators. These approaches are collectively known as "simulation-based inference" (SBI). Recent SBI methods have made use of neural networks (NN) to provide approximate, yet expressive constructs for the unavailable likelihood function and the posterior distribution. However, they do not generally achieve an optimal trade-off between accuracy and computational demand. In this work, we propose an alternative that provides both approximations to the likelihood and the posterior distribution, using structured mixtures of probability distributions. Our approach produces accurate posterior inference when compared to state-of-the-art NN-based SBI methods, while exhibiting a much smaller computational footprint. We illustrate our results on several benchmark models from the SBI literature.  ( 2 min )
    A Framework for Cost-Effective and Self-Adaptive LLM Shaking and Recovery Mechanism
    arXiv:2403.07283v1 Announce Type: cross Abstract: As Large Language Models (LLMs) gain great success in real-world applications, an increasing number of users are seeking to develop and deploy their customized LLMs through cloud services. Nonetheless, in some specific domains, there are still concerns regarding cost and trade-offs between privacy issues and accuracy. In this study, we introduce a cost-effective and self-adaptive LLM shaking tuning and recovery mechanism, named CypherTalk. With carefully designed horizontal and vertical shaking operators, we can achieve comparable accuracy results with SOTA privacy-preserving LLM schemes using Cryptography-based or Differential Privacy-based methods. Experiments also show that with the CypherTalk framework, users can achieve reliable accuracy when using optimized shaking operator settings. To our best knowledge, this is the first work that considers cost, and trade-off between model utility and privacy in LLM scenarios.  ( 2 min )
    Knowledge Graph Large Language Model (KG-LLM) for Link Prediction
    arXiv:2403.07311v1 Announce Type: cross Abstract: The task of predicting multiple links within knowledge graphs (KGs) stands as a challenge in the field of knowledge graph analysis, a challenge increasingly resolvable due to advancements in natural language processing (NLP) and KG embedding techniques. This paper introduces a novel methodology, the Knowledge Graph Large Language Model Framework (KG-LLM), which leverages pivotal NLP paradigms, including chain-of-thought (CoT) prompting and in-context learning (ICL), to enhance multi-hop link prediction in KGs. By converting the KG to a CoT prompt, our framework is designed to discern and learn the latent representations of entities and their interrelations. To show the efficacy of the KG-LLM Framework, we fine-tune three leading Large Language Models (LLMs) within this framework, employing both non-ICL and ICL tasks for a comprehensive evaluation. Further, we explore the framework's potential to provide LLMs with zero-shot capabilities for handling previously unseen prompts. Our experimental findings discover that integrating ICL and CoT not only augments the performance of our approach but also significantly boosts the models' generalization capacity, thereby ensuring more precise predictions in unfamiliar scenarios.  ( 2 min )
    Online Continual Learning For Interactive Instruction Following Agents
    arXiv:2403.07548v1 Announce Type: cross Abstract: In learning an embodied agent executing daily tasks via language directives, the literature largely assumes that the agent learns all training data at the beginning. We argue that such a learning scenario is less realistic since a robotic agent is supposed to learn the world continuously as it explores and perceives it. To take a step towards a more realistic embodied agent learning scenario, we propose two continual learning setups for embodied agents; learning new behaviors (Behavior Incremental Learning, Behavior-IL) and new environments (Environment Incremental Learning, Environment-IL) For the tasks, previous 'data prior' based continual learning methods maintain logits for the past tasks. However, the stored information is often insufficiently learned information and requires task boundary information, which might not always be available. Here, we propose to update them based on confidence scores without task boundary information during training (i.e., task-free) in a moving average fashion, named Confidence-Aware Moving Average (CAMA). In the proposed Behavior-IL and Environment-IL setups, our simple CAMA outperforms prior state of the art in our empirical validations by noticeable margins. The project page including codes is https://github.com/snumprlab/cl-alfred.  ( 3 min )
    GuideGen: A Text-guided Framework for Joint CT Volume and Anatomical structure Generation
    arXiv:2403.07247v1 Announce Type: cross Abstract: The annotation burden and extensive labor for gathering a large medical dataset with images and corresponding labels are rarely cost-effective and highly intimidating. This results in a lack of abundant training data that undermines downstream tasks and partially contributes to the challenge image analysis faces in the medical field. As a workaround, given the recent success of generative neural models, it is now possible to synthesize image datasets at a high fidelity guided by external constraints. This paper explores this possibility and presents \textbf{GuideGen}: a pipeline that jointly generates CT images and tissue masks for abdominal organs and colorectal cancer conditioned on a text prompt. Firstly, we introduce Volumetric Mask Sampler to fit the discrete distribution of mask labels and generate low-resolution 3D tissue masks. Secondly, our Conditional Image Generator autoregressively generates CT slices conditioned on a corresponding mask slice to incorporate both style information and anatomical guidance. This pipeline guarantees high fidelity and variability as well as exact alignment between generated CT volumes and tissue masks. Both qualitative and quantitative experiments on 3D abdominal CTs demonstrate a high performance of our proposed pipeline, thereby proving our method can serve as a dataset generator and provide potential benefits to downstream tasks. It is hoped that our work will offer a promising solution on the multimodality generation of CT and its anatomical mask. Our source code is publicly available at https://github.com/OvO1111/JointImageGeneration.  ( 3 min )
    ProMIL: Probabilistic Multiple Instance Learning for Medical Imaging
    arXiv:2306.10535v2 Announce Type: replace-cross Abstract: Multiple Instance Learning (MIL) is a weakly-supervised problem in which one label is assigned to the whole bag of instances. An important class of MIL models is instance-based, where we first classify instances and then aggregate those predictions to obtain a bag label. The most common MIL model is when we consider a bag as positive if at least one of its instances has a positive label. However, this reasoning does not hold in many real-life scenarios, where the positive bag label is often a consequence of a certain percentage of positive instances. To address this issue, we introduce a dedicated instance-based method called ProMIL, based on deep neural networks and Bernstein polynomial estimation. An important advantage of ProMIL is that it can automatically detect the optimal percentage level for decision-making. We show that ProMIL outperforms standard instance-based MIL in real-world medical applications. We make the code available.  ( 2 min )
    Quantifying and Mitigating Privacy Risks for Tabular Generative Models
    arXiv:2403.07842v1 Announce Type: new Abstract: Synthetic data from generative models emerges as the privacy-preserving data-sharing solution. Such a synthetic data set shall resemble the original data without revealing identifiable private information. The backbone technology of tabular synthesizers is rooted in image generative models, ranging from Generative Adversarial Networks (GANs) to recent diffusion models. Recent prior work sheds light on the utility-privacy tradeoff on tabular data, revealing and quantifying privacy risks on synthetic data. We first conduct an exhaustive empirical analysis, highlighting the utility-privacy tradeoff of five state-of-the-art tabular synthesizers, against eight privacy attacks, with a special focus on membership inference attacks. Motivated by the observation of high data quality but also high privacy risk in tabular diffusion, we propose DP-TLDM, Differentially Private Tabular Latent Diffusion Model, which is composed of an autoencoder network to encode the tabular data and a latent diffusion model to synthesize the latent tables. Following the emerging f-DP framework, we apply DP-SGD to train the auto-encoder in combination with batch clipping and use the separation value as the privacy metric to better capture the privacy gain from DP algorithms. Our empirical evaluation demonstrates that DP-TLDM is capable of achieving a meaningful theoretical privacy guarantee while also significantly enhancing the utility of synthetic data. Specifically, compared to other DP-protected tabular generative models, DP-TLDM improves the synthetic quality by an average of 35% in data resemblance, 15% in the utility for downstream tasks, and 50% in data discriminability, all while preserving a comparable level of privacy risk.  ( 3 min )
    Multiple Population Alternate Evolution Neural Architecture Search
    arXiv:2403.07035v1 Announce Type: cross Abstract: The effectiveness of Evolutionary Neural Architecture Search (ENAS) is influenced by the design of the search space. Nevertheless, common methods including the global search space, scalable search space and hierarchical search space have certain limitations. Specifically, the global search space requires a significant amount of computational resources and time, the scalable search space sacrifices the diversity of network structures and the hierarchical search space increases the search cost in exchange for network diversity. To address above limitation, we propose a novel paradigm of searching neural network architectures and design the Multiple Population Alternate Evolution Neural Architecture Search (MPAE), which can achieve module diversity with a smaller search cost. MPAE converts the search space into L interconnected units and sequentially searches the units, then the above search of the entire network be cycled several times to reduce the impact of previous units on subsequent units. To accelerate the population evolution process, we also propose the the population migration mechanism establishes an excellent migration archive and transfers the excellent knowledge and experience in the migration archive to new populations. The proposed method requires only 0.3 GPU days to search a neural network on the CIFAR dataset and achieves the state-of-the-art results.  ( 2 min )
    CAS: A General Algorithm for Online Selective Conformal Prediction with FCR Control
    arXiv:2403.07728v1 Announce Type: cross Abstract: We study the problem of post-selection predictive inference in an online fashion. To avoid devoting resources to unimportant units, a preliminary selection of the current individual before reporting its prediction interval is common and meaningful in online predictive tasks. Since the online selection causes a temporal multiplicity in the selected prediction intervals, it is important to control the real-time false coverage-statement rate (FCR) to measure the averaged miscoverage error. We develop a general framework named CAS (Calibration after Adaptive Selection) that can wrap around any prediction model and online selection rule to output post-selection prediction intervals. If the current individual is selected, we first perform an adaptive selection on historical data to construct a calibration set, then output a conformal prediction interval for the unobserved label. We provide tractable constructions for the calibration set for popular online selection rules. We proved that CAS can achieve an exact selection-conditional coverage guarantee in the finite-sample and distribution-free regimes. For the decision-driven selection rule, including most online multiple-testing procedures, CAS can exactly control the real-time FCR below the target level without any distributional assumptions. For the online selection with symmetric thresholds, we establish the error bound for the control gap of FCR under mild distributional assumptions. To account for the distribution shift in online data, we also embed CAS into some recent dynamic conformal prediction methods and examine the long-run FCR control. Numerical results on both synthetic and real data corroborate that CAS can effectively control FCR around the target level and yield more narrowed prediction intervals over existing baselines across various settings.  ( 3 min )
    Deep Learning for Steganalysis of Diverse Data Types: A review of methods, taxonomy, challenges and future directions
    arXiv:2308.04522v3 Announce Type: replace-cross Abstract: Steganography and steganalysis are two interrelated aspects of the field of information security. Steganography seeks to conceal communications, whereas steganalysis is aimed to either find them or even, if possible, recover the data they contain. Steganography and steganalysis have attracted a great deal of interest, particularly from law enforcement. Steganography is often used by cybercriminals and even terrorists to avoid being captured while in possession of incriminating evidence, even encrypted, since cryptography is prohibited or restricted in many countries. Therefore, knowledge of cutting-edge techniques to uncover concealed information is crucial in exposing illegal acts. Over the last few years, a number of strong and reliable steganography and steganalysis techniques have been introduced in the literature. This review paper provides a comprehensive overview of deep learning-based steganalysis techniques used to detect hidden information within digital media. The paper covers all types of cover in steganalysis, including image, audio, and video, and discusses the most commonly used deep learning techniques. In addition, the paper explores the use of more advanced deep learning techniques, such as deep transfer learning (DTL) and deep reinforcement learning (DRL), to enhance the performance of steganalysis systems. The paper provides a systematic review of recent research in the field, including data sets and evaluation metrics used in recent studies. It also presents a detailed analysis of DTL-based steganalysis approaches and their performance on different data sets. The review concludes with a discussion on the current state of deep learning-based steganalysis, challenges, and future research directions.  ( 3 min )
    An interpretable neural network-based non-proportional odds model for ordinal regression
    arXiv:2303.17823v4 Announce Type: replace-cross Abstract: This study proposes an interpretable neural network-based non-proportional odds model (N$^3$POM) for ordinal regression. N$^3$POM is different from conventional approaches to ordinal regression with non-proportional models in several ways: (1) N$^3$POM is defined for both continuous and discrete responses, whereas standard methods typically treat the ordered continuous variables as if they are discrete, (2) instead of estimating response-dependent finite-dimensional coefficients of linear models from discrete responses as is done in conventional approaches, we train a non-linear neural network to serve as a coefficient function. Thanks to the neural network, N$^3$POM offers flexibility while preserving the interpretability of conventional ordinal regression. We establish a sufficient condition under which the predicted conditional cumulative probability locally satisfies the monotonicity constraint over a user-specified region in the covariate space. Additionally, we provide a monotonicity-preserving stochastic (MPS) algorithm for effectively training the neural network. We apply N$^3$POM to several real-world datasets.  ( 2 min )
    MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback
    arXiv:2309.10691v3 Announce Type: replace-cross Abstract: To solve complex tasks, large language models (LLMs) often require multiple rounds of interactions with the user, sometimes assisted by external tools. However, current evaluation protocols often emphasize benchmark performance with single-turn exchanges, neglecting the nuanced interactions among the user, LLMs, and external tools, while also underestimating the importance of natural language feedback from users. These oversights contribute to discrepancies between research benchmark evaluations and real-world use cases. We introduce MINT, a benchmark that evaluates LLMs' ability to solve tasks with multi-turn interactions by (1) using tools and (2) leveraging natural language feedback. To ensure reproducibility, we provide an evaluation framework where LLMs can access tools by executing Python code and receive users' natural language feedback simulated by GPT-4. We repurpose a diverse set of established evaluation datasets focusing on reasoning, coding, and decision-making and carefully curate them into a compact subset for efficient evaluation. Our analysis of 20 open- and closed-source LLMs offers intriguing findings. (a) LLMs generally benefit from tools and language feedback, with performance gains (absolute, same below) of 1-8% for each turn of tool use and 2-17% with natural language feedback. (b) Better single-turn performance does not guarantee better multi-turn performance. (c) Surprisingly, on the LLMs evaluated, supervised instruction-finetuning (SIFT) and reinforcement learning from human feedback (RLHF) generally hurt multi-turn capabilities. We expect MINT can help measure progress and incentivize research in improving LLMs' capabilities in multi-turn interactions, especially for open-source communities where multi-turn human evaluation can be less accessible compared to commercial LLMs with a larger user base.  ( 3 min )
    Whiteness-based bilevel learning of regularization parameters in imaging
    arXiv:2403.07026v1 Announce Type: cross Abstract: We consider an unsupervised bilevel optimization strategy for learning regularization parameters in the context of imaging inverse problems in the presence of additive white Gaussian noise. Compared to supervised and semi-supervised metrics relying either on the prior knowledge of reference data and/or on some (partial) knowledge on the noise statistics, the proposed approach optimizes the whiteness of the residual between the observed data and the observation model with no need of ground-truth data.We validate the approach on standard Total Variation-regularized image deconvolution problems which show that the proposed quality metric provides estimates close to the mean-square error oracle and to discrepancy-based principles.  ( 2 min )
    SATformer: Transformer-Based UNSAT Core Learning
    arXiv:2209.00953v2 Announce Type: replace-cross Abstract: This paper introduces SATformer, a novel Transformer-based approach for the Boolean Satisfiability (SAT) problem. Rather than solving the problem directly, SATformer approaches the problem from the opposite direction by focusing on unsatisfiability. Specifically, it models clause interactions to identify any unsatisfiable sub-problems. Using a graph neural network, we convert clauses into clause embeddings and employ a hierarchical Transformer-based model to understand clause correlation. SATformer is trained through a multi-task learning approach, using the single-bit satisfiability result and the minimal unsatisfiable core (MUC) for UNSAT problems as clause supervision. As an end-to-end learning-based satisfiability classifier, the performance of SATformer surpasses that of NeuroSAT significantly. Furthermore, we integrate the clause predictions made by SATformer into modern heuristic-based SAT solvers and validate our approach with a logic equivalence checking task. Experimental results show that our SATformer can decrease the runtime of existing solvers by an average of 21.33%.  ( 2 min )
    SGAT4PASS: Spherical Geometry-Aware Transformer for PAnoramic Semantic Segmentation
    arXiv:2306.03403v2 Announce Type: replace-cross Abstract: As an important and challenging problem in computer vision, PAnoramic Semantic Segmentation (PASS) gives complete scene perception based on an ultra-wide angle of view. Usually, prevalent PASS methods with 2D panoramic image input focus on solving image distortions but lack consideration of the 3D properties of original $360^{\circ}$ data. Therefore, their performance will drop a lot when inputting panoramic images with the 3D disturbance. To be more robust to 3D disturbance, we propose our Spherical Geometry-Aware Transformer for PAnoramic Semantic Segmentation (SGAT4PASS), considering 3D spherical geometry knowledge. Specifically, a spherical geometry-aware framework is proposed for PASS. It includes three modules, i.e., spherical geometry-aware image projection, spherical deformable patch embedding, and a panorama-aware loss, which takes input images with 3D disturbance into account, adds a spherical geometry-aware constraint on the existing deformable patch embedding, and indicates the pixel density of original $360^{\circ}$ data, respectively. Experimental results on Stanford2D3D Panoramic datasets show that SGAT4PASS significantly improves performance and robustness, with approximately a 2% increase in mIoU, and when small 3D disturbances occur in the data, the stability of our performance is improved by an order of magnitude. Our code and supplementary material are available at https://github.com/TencentARC/SGAT4PASS.  ( 3 min )
    APOLLO: An Optimized Training Approach for Long-form Numerical Reasoning
    arXiv:2212.07249v3 Announce Type: replace-cross Abstract: Long-form numerical reasoning in financial analysis aims to generate a reasoning program to calculate the correct answer for a given question. Previous work followed a retriever-generator framework, where the retriever selects key facts from a long-form document, and the generator generates a reasoning program based on retrieved facts. However, they treated all facts equally without considering the different contributions of facts with and without numbers. Meanwhile, the program consistency were ignored under supervised training, resulting in lower training accuracy and diversity. To solve these problems, we proposed APOLLO to improve the long-form numerical reasoning framework. For the retriever, we adopt a number-aware negative sampling strategy to enable the retriever to be more discriminative on key numerical facts. For the generator, we design consistency-based reinforcement learning and target program augmentation strategy based on the consistency of program execution results. Experimental results on the FinQA and ConvFinQA leaderboard verify the effectiveness of our proposed method, achieving the new state-of-the-art.  ( 2 min )
    Towards a Dynamic Future with Adaptable Computing and Network Convergence (ACNC)
    arXiv:2403.07573v1 Announce Type: cross Abstract: In the context of advancing 6G, a substantial paradigm shift is anticipated, highlighting comprehensive everything-to-everything interactions characterized by numerous connections and stringent adherence to Quality of Service/Experience (QoS/E) prerequisites. The imminent challenge stems from resource scarcity, prompting a deliberate transition to Computing-Network Convergence (CNC) as an auspicious approach for joint resource orchestration. While CNC-based mechanisms have garnered attention, their effectiveness in realizing future services, particularly in use cases like the Metaverse, may encounter limitations due to the continually changing nature of users, services, and resources. Hence, this paper presents the concept of Adaptable CNC (ACNC) as an autonomous Machine Learning (ML)-aided mechanism crafted for the joint orchestration of computing and network resources, catering to dynamic and voluminous user requests with stringent requirements. ACNC encompasses two primary functionalities: state recognition and context detection. Given the intricate nature of the user-service-computing-network space, the paper employs dimension reduction to generate live, holistic, abstract system states in a hierarchical structure. To address the challenges posed by dynamic changes, Continual Learning (CL) is employed, classifying the system state into contexts controlled by dedicated ML agents, enabling them to operate efficiently. These two functionalities are intricately linked within a closed loop overseen by the End-to-End (E2E) orchestrator to allocate resources. The paper introduces the components of ACNC, proposes a Metaverse scenario to exemplify ACNC's role in resource provisioning with Segment Routing v6 (SRv6), outlines ACNC's workflow, details a numerical analysis for efficiency assessment, and concludes with discussions on relevant challenges and potential avenues for future research.  ( 3 min )
    SIFiD: Reassess Summary Factual Inconsistency Detection with LLM
    arXiv:2403.07557v1 Announce Type: cross Abstract: Ensuring factual consistency between the summary and the original document is paramount in summarization tasks. Consequently, considerable effort has been dedicated to detecting inconsistencies. With the advent of Large Language Models (LLMs), recent studies have begun to leverage their advanced language understanding capabilities for inconsistency detection. However, early attempts have shown that LLMs underperform traditional models due to their limited ability to follow instructions and the absence of an effective detection methodology. In this study, we reassess summary inconsistency detection with LLMs, comparing the performances of GPT-3.5 and GPT-4. To advance research in LLM-based inconsistency detection, we propose SIFiD (Summary Inconsistency Detection with Filtered Document) that identify key sentences within documents by either employing natural language inference or measuring semantic similarity between summaries and documents.  ( 2 min )
    Better than classical? The subtle art of benchmarking quantum machine learning models
    arXiv:2403.07059v1 Announce Type: cross Abstract: Benchmarking models via classical simulations is one of the main ways to judge ideas in quantum machine learning before noise-free hardware is available. However, the huge impact of the experimental design on the results, the small scales within reach today, as well as narratives influenced by the commercialisation of quantum technologies make it difficult to gain robust insights. To facilitate better decision-making we develop an open-source package based on the PennyLane software framework and use it to conduct a large-scale study that systematically tests 12 popular quantum machine learning models on 6 binary classification tasks used to create 160 individual datasets. We find that overall, out-of-the-box classical machine learning models outperform the quantum classifiers. Moreover, removing entanglement from a quantum model often results in as good or better performance, suggesting that "quantumness" may not be the crucial ingredient for the small learning tasks considered here. Our benchmarks also unlock investigations beyond simplistic leaderboard comparisons, and we identify five important questions for quantum model design that follow from our results.  ( 2 min )
    Joint Selection: Adaptively Incorporating Public Information for Private Synthetic Data
    arXiv:2403.07797v1 Announce Type: new Abstract: Mechanisms for generating differentially private synthetic data based on marginals and graphical models have been successful in a wide range of settings. However, one limitation of these methods is their inability to incorporate public data. Initializing a data generating model by pre-training on public data has shown to improve the quality of synthetic data, but this technique is not applicable when model structure is not determined a priori. We develop the mechanism jam-pgm, which expands the adaptive measurements framework to jointly select between measuring public data and private data. This technique allows for public data to be included in a graphical-model-based mechanism. We show that jam-pgm is able to outperform both publicly assisted and non publicly assisted synthetic data generation mechanisms even when the public data distribution is biased.  ( 2 min )
    Neural DAEs: Constrained neural networks
    arXiv:2211.14302v4 Announce Type: replace Abstract: This article investigates the effect of explicitly adding auxiliary algebraic trajectory information to neural networks for dynamical systems. We draw inspiration from the field of differential-algebraic equations and differential equations on manifolds and implement related methods in residual neural networks, despite some fundamental scenario differences. Constraint or auxiliary information effects are incorporated through stabilization as well as projection methods, and we show when to use which method based on experiments involving simulations of multi-body pendulums and molecular dynamics scenarios. Several of our methods are easy to implement in existing code and have limited impact on training performance while giving significant boosts in terms of inference.  ( 2 min )
    TeleMoMa: A Modular and Versatile Teleoperation System for Mobile Manipulation
    arXiv:2403.07869v1 Announce Type: cross Abstract: A critical bottleneck limiting imitation learning in robotics is the lack of data. This problem is more severe in mobile manipulation, where collecting demonstrations is harder than in stationary manipulation due to the lack of available and easy-to-use teleoperation interfaces. In this work, we demonstrate TeleMoMa, a general and modular interface for whole-body teleoperation of mobile manipulators. TeleMoMa unifies multiple human interfaces including RGB and depth cameras, virtual reality controllers, keyboard, joysticks, etc., and any combination thereof. In its more accessible version, TeleMoMa works using simply vision (e.g., an RGB-D camera), lowering the entry bar for humans to provide mobile manipulation demonstrations. We demonstrate the versatility of TeleMoMa by teleoperating several existing mobile manipulators - PAL Tiago++, Toyota HSR, and Fetch - in simulation and the real world. We demonstrate the quality of the demonstrations collected with TeleMoMa by training imitation learning policies for mobile manipulation tasks involving synchronized whole-body motion. Finally, we also show that TeleMoMa's teleoperation channel enables teleoperation on site, looking at the robot, or remote, sending commands and observations through a computer network, and perform user studies to evaluate how easy it is for novice users to learn to collect demonstrations with different combinations of human interfaces enabled by our system. We hope TeleMoMa becomes a helpful tool for the community enabling researchers to collect whole-body mobile manipulation demonstrations. For more information and video results, https://robin-lab.cs.utexas.edu/telemoma-web.  ( 3 min )
    CSC-Unet: A Novel Convolutional Sparse Coding Strategy Based Neural Network for Semantic Segmentation
    arXiv:2108.00408v2 Announce Type: replace-cross Abstract: It is a challenging task to accurately perform semantic segmentation due to the complexity of real picture scenes. Many semantic segmentation methods based on traditional deep learning insufficiently captured the semantic and appearance information of images, which put limit on their generality and robustness for various application scenes. In this paper, we proposed a novel strategy that reformulated the popularly-used convolution operation to multi-layer convolutional sparse coding block to ease the aforementioned deficiency. This strategy can be possibly used to significantly improve the segmentation performance of any semantic segmentation model that involves convolutional operations. To prove the effectiveness of our idea, we chose the widely-used U-Net model for the demonstration purpose, and we designed CSC-Unet model series based on U-Net. Through extensive analysis and experiments, we provided credible evidence showing that the multi-layer convolutional sparse coding block enables semantic segmentation model to converge faster, can extract finer semantic and appearance information of images, and improve the ability to recover spatial detail information. The best CSC-Unet model significantly outperforms the results of the original U-Net on three public datasets with different scenarios, i.e., 87.14% vs. 84.71% on DeepCrack dataset, 68.91% vs. 67.09% on Nuclei dataset, and 53.68% vs. 48.82% on CamVid dataset, respectively.  ( 3 min )
    Communication Optimization for Distributed Training: Architecture, Advances, and Opportunities
    arXiv:2403.07585v1 Announce Type: cross Abstract: The past few years have witnessed the flourishing of large-scale deep neural network models with ever-growing parameter numbers. Training such large-scale models typically requires massive memory and computing resources that exceed those of a single GPU, necessitating distributed training. As GPU performance has rapidly evolved in recent years, computation time has shrunk, thereby increasing the proportion of communication in the overall training time. Therefore, optimizing communication for distributed training has become an urgent issue. In this article, we briefly introduce the general architecture of distributed deep neural network training and analyze relationships among Parallelization Strategy, Collective Communication Library, and Network from the perspective of communication optimization, which forms a three-layer paradigm. We then review current representative research advances with this three-layer paradigm. We find that layers in the current three-layer paradigm are relatively independent, but there is a rich design space for cross-layer collaborative optimization in distributed training scenarios. Therefore, we further advocate a communication-efficient five-layer paradigm underlining opportunities for collaboration designs and look forward to the perspectives of "Vertical", "Horizontal", "Intra-Inter" and "Host-Net" collaboration designs. We hope this article can shed some light on future research on communication optimization for distributed training.  ( 2 min )
    Distilling the Knowledge in Data Pruning
    arXiv:2403.07854v1 Announce Type: cross Abstract: With the increasing size of datasets used for training neural networks, data pruning becomes an attractive field of research. However, most current data pruning algorithms are limited in their ability to preserve accuracy compared to models trained on the full data, especially in high pruning regimes. In this paper we explore the application of data pruning while incorporating knowledge distillation (KD) when training on a pruned subset. That is, rather than relying solely on ground-truth labels, we also use the soft predictions from a teacher network pre-trained on the complete data. By integrating KD into training, we demonstrate significant improvement across datasets, pruning methods, and on all pruning fractions. We first establish a theoretical motivation for employing self-distillation to improve training on pruned data. Then, we empirically make a compelling and highly practical observation: using KD, simple random pruning is comparable or superior to sophisticated pruning methods across all pruning regimes. On ImageNet for example, we achieve superior accuracy despite training on a random subset of only 50% of the data. Additionally, we demonstrate a crucial connection between the pruning factor and the optimal knowledge distillation weight. This helps mitigate the impact of samples with noisy labels and low-quality images retained by typical pruning algorithms. Finally, we make an intriguing observation: when using lower pruning fractions, larger teachers lead to accuracy degradation, while surprisingly, employing teachers with a smaller capacity than the student's may improve results. Our code will be made available.  ( 2 min )
    Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models
    arXiv:2310.06117v2 Announce Type: replace Abstract: We present Step-Back Prompting, a simple prompting technique that enables LLMs to do abstractions to derive high-level concepts and first principles from instances containing specific details. Using the concepts and principles to guide reasoning, LLMs significantly improve their abilities in following a correct reasoning path towards the solution. We conduct experiments of Step-Back Prompting with PaLM-2L, GPT-4 and Llama2-70B models, and observe substantial performance gains on various challenging reasoning-intensive tasks including STEM, Knowledge QA, and Multi-Hop Reasoning. For instance, Step-Back Prompting improves PaLM-2L performance on MMLU (Physics and Chemistry) by 7% and 11% respectively, TimeQA by 27%, and MuSiQue by 7%.  ( 2 min )
    Customizable Avatars with Dynamic Facial Action Coded Expressions (CADyFACE) for Improved User Engagement
    arXiv:2403.07314v1 Announce Type: cross Abstract: Customizable 3D avatar-based facial expression stimuli may improve user engagement in behavioral biomarker discovery and therapeutic intervention for autism, Alzheimer's disease, facial palsy, and more. However, there is a lack of customizable avatar-based stimuli with Facial Action Coding System (FACS) action unit (AU) labels. Therefore, this study focuses on (1) FACS-labeled, customizable avatar-based expression stimuli for maintaining subjects' engagement, (2) learning-based measurements that quantify subjects' facial responses to such stimuli, and (3) validation of constructs represented by stimulus-measurement pairs. We propose Customizable Avatars with Dynamic Facial Action Coded Expressions (CADyFACE) labeled with AUs by a certified FACS expert. To measure subjects' AUs in response to CADyFACE, we propose a novel Beta-guided Correlation and Multi-task Expression learning neural network (BeCoME-Net) for multi-label AU detection. The beta-guided correlation loss encourages feature correlation with AUs while discouraging correlation with subject identities for improved generalization. We train BeCoME-Net for unilateral and bilateral AU detection and compare with state-of-the-art approaches. To assess construct validity of CADyFACE and BeCoME-Net, twenty healthy adult volunteers complete expression recognition and mimicry tasks in an online feasibility study while webcam-based eye-tracking and video are collected. We test validity of multiple constructs, including face preference during recognition and AUs during mimicry.  ( 3 min )
    Exploring Challenges in Deep Learning of Single-Station Ground Motion Records
    arXiv:2403.07569v1 Announce Type: cross Abstract: Contemporary deep learning models have demonstrated promising results across various applications within seismology and earthquake engineering. These models rely primarily on utilizing ground motion records for tasks such as earthquake event classification, localization, earthquake early warning systems, and structural health monitoring. However, the extent to which these models effectively learn from these complex time-series signals has not been thoroughly analyzed. In this study, our objective is to evaluate the degree to which auxiliary information, such as seismic phase arrival times or seismic station distribution within a network, dominates the process of deep learning from ground motion records, potentially hindering its effectiveness. We perform a hyperparameter search on two deep learning models to assess their effectiveness in deep learning from ground motion records while also examining the impact of auxiliary information on model performance. Experimental results reveal a strong reliance on the highly correlated P and S phase arrival information. Our observations highlight a potential gap in the field, indicating an absence of robust methodologies for deep learning of single-station ground motion recordings independent of any auxiliary information.  ( 2 min )
    When should we prefer Decision Transformers for Offline Reinforcement Learning?
    arXiv:2305.14550v3 Announce Type: replace Abstract: Offline reinforcement learning (RL) allows agents to learn effective, return-maximizing policies from a static dataset. Three popular algorithms for offline RL are Conservative Q-Learning (CQL), Behavior Cloning (BC), and Decision Transformer (DT), from the class of Q-Learning, Imitation Learning, and Sequence Modeling respectively. A key open question is: which algorithm is preferred under what conditions? We study this question empirically by exploring the performance of these algorithms across the commonly used D4RL and Robomimic benchmarks. We design targeted experiments to understand their behavior concerning data suboptimality, task complexity, and stochasticity. Our key findings are: (1) DT requires more data than CQL to learn competitive policies but is more robust; (2) DT is a substantially better choice than both CQL and BC in sparse-reward and low-quality data settings; (3) DT and BC are preferable as task horizon increases, or when data is obtained from human demonstrators; and (4) CQL excels in situations characterized by the combination of high stochasticity and low data quality. We also investigate architectural choices and scaling trends for DT on Atari and D4RL and make design/scaling recommendations. We find that scaling the amount of data for DT by 5x gives a 2.5x average score improvement on Atari.  ( 3 min )
    Majorization-minimization for Sparse Nonnegative Matrix Factorization with the $\beta$-divergence
    arXiv:2207.06316v4 Announce Type: replace Abstract: This article introduces new multiplicative updates for nonnegative matrix factorization with the $\beta$-divergence and sparse regularization of one of the two factors (say, the activation matrix). It is well known that the norm of the other factor (the dictionary matrix) needs to be controlled in order to avoid an ill-posed formulation. Standard practice consists in constraining the columns of the dictionary to have unit norm, which leads to a nontrivial optimization problem. Our approach leverages a reparametrization of the original problem into the optimization of an equivalent scale-invariant objective function. From there, we derive block-descent majorization-minimization algorithms that result in simple multiplicative updates for either $\ell_{1}$-regularization or the more "aggressive" log-regularization. In contrast with other state-of-the-art methods, our algorithms are universal in the sense that they can be applied to any $\beta$-divergence (i.e., any value of $\beta$) and that they come with convergence guarantees. We report numerical comparisons with existing heuristic and Lagrangian methods using various datasets: face images, an audio spectrogram, hyperspectral data, and song play counts. We show that our methods obtain solutions of similar quality at convergence (similar objective values) but with significantly reduced CPU times.  ( 3 min )
    A Flexible Cell Classification for ML Projects in Jupyter Notebooks
    arXiv:2403.07562v1 Announce Type: cross Abstract: Jupyter Notebook is an interactive development environment commonly used for rapid experimentation of machine learning (ML) solutions. Describing the ML activities performed along code cells improves the readability and understanding of Notebooks. Manual annotation of code cells is time-consuming and error-prone. Therefore, tools have been developed that classify the cells of a notebook concerning the ML activity performed in them. However, the current tools are not flexible, as they work based on look-up tables that have been created, which map function calls of commonly used ML libraries to ML activities. These tables must be manually adjusted to account for new or changed libraries. This paper presents a more flexible approach to cell classification based on a hybrid classification approach that combines a rule-based and a decision tree classifier. We discuss the design rationales and describe the developed classifiers in detail. We implemented the new flexible cell classification approach in a tool called JupyLabel. Its evaluation and the obtained metric scores regarding precision, recall, and F1-score are discussed. Additionally, we compared JupyLabel with HeaderGen, an existing cell classification tool. We were able to show that the presented flexible cell classification approach outperforms this tool significantly.  ( 2 min )
    Fusing Climate Data Products using a Spatially Varying Autoencoder
    arXiv:2403.07822v1 Announce Type: cross Abstract: Autoencoders are powerful machine learning models used to compress information from multiple data sources. However, autoencoders, like all artificial neural networks, are often unidentifiable and uninterpretable. This research focuses on creating an identifiable and interpretable autoencoder that can be used to meld and combine climate data products. The proposed autoencoder utilizes a Bayesian statistical framework, allowing for probabilistic interpretations while also varying spatially to capture useful spatial patterns across the various data products. Constraints are placed on the autoencoder as it learns patterns in the data, creating an interpretable consensus that includes the important features from each input. We demonstrate the utility of the autoencoder by combining information from multiple precipitation products in High Mountain Asia.  ( 2 min )
    Fast and Simple Explainability for Point Cloud Networks
    arXiv:2403.07706v1 Announce Type: cross Abstract: We propose a fast and simple explainable AI (XAI) method for point cloud data. It computes pointwise importance with respect to a trained network downstream task. This allows better understanding of the network properties, which is imperative for safety-critical applications. In addition to debugging and visualization, our low computational complexity facilitates online feedback to the network at inference. This can be used to reduce uncertainty and to increase robustness. In this work, we introduce \emph{Feature Based Interpretability} (FBI), where we compute the features' norm, per point, before the bottleneck. We analyze the use of gradients and post- and pre-bottleneck strategies, showing pre-bottleneck is preferred, in terms of smoothness and ranking. We obtain at least three orders of magnitude speedup, compared to current XAI methods, thus, scalable for big point clouds or large-scale architectures. Our approach achieves SOTA results, in terms of classification explainability. We demonstrate how the proposed measure is helpful in analyzing and characterizing various aspects of 3D learning, such as rotation invariance, robustness to out-of-distribution (OOD) outliers or domain shift and dataset bias.  ( 2 min )
    Convergence of Some Convex Message Passing Algorithms to a Fixed Point
    arXiv:2403.07004v1 Announce Type: cross Abstract: A popular approach to the MAP inference problem in graphical models is to minimize an upper bound obtained from a dual linear programming or Lagrangian relaxation by (block-)coordinate descent. Examples of such algorithms are max-sum diffusion and sequential tree-reweighted message passing. Convergence properties of these methods are currently not fully understood. They have been proved to converge to the set characterized by local consistency of active constraints, with unknown convergence rate; however, it was not clear if the iterates converge at all (to any single point). We prove a stronger result (which was conjectured before but never proved): the iterates converge to a fixed point of the algorithm. Moreover, we show that they achieve precision $\varepsilon>0$ in $\mathcal{O}(1/\varepsilon)$ iterations. We first prove this for a version of coordinate descent applied to a general piecewise-affine convex objective, using a novel proof technique. Then we demonstrate the generality of this approach by reducing some popular coordinate-descent algorithms to this problem. Finally we show that, in contrast to our main result, a similar version of coordinate descent applied to a constrained optimization problem need not converge.  ( 2 min )
    A Non-Intrusive Neural Quality Assessment Model for Surface Electromyography Signals
    arXiv:2402.05482v2 Announce Type: replace-cross Abstract: In practical scenarios involving the measurement of surface electromyography (sEMG) in muscles, particularly those areas near the heart, one of the primary sources of contamination is the presence of electrocardiogram (ECG) signals. To assess the quality of real-world sEMG data more effectively, this study proposes QASE-net, a new non-intrusive model that predicts the SNR of sEMG signals. QASE-net combines CNN-BLSTM with attention mechanisms and follows an end-to-end training strategy. Our experimental framework utilizes real-world sEMG and ECG data from two open-access databases, the Non-Invasive Adaptive Prosthetics Database and the MIT-BIH Normal Sinus Rhythm Database, respectively. The experimental results demonstrate the superiority of QASE-net over the previous assessment model, exhibiting significantly reduced prediction errors and notably higher linear correlations with the ground truth. These findings show the potential of QASE-net to substantially enhance the reliability and precision of sEMG quality assessment in practical applications.  ( 2 min )
    Convolutional Neural Networks on Graphs with Chebyshev Approximation, Revisited
    arXiv:2202.03580v5 Announce Type: replace Abstract: Designing spectral convolutional networks is a challenging problem in graph learning. ChebNet, one of the early attempts, approximates the spectral graph convolutions using Chebyshev polynomials. GCN simplifies ChebNet by utilizing only the first two Chebyshev polynomials while still outperforming it on real-world datasets. GPR-GNN and BernNet demonstrate that the Monomial and Bernstein bases also outperform the Chebyshev basis in terms of learning the spectral graph convolutions. Such conclusions are counter-intuitive in the field of approximation theory, where it is established that the Chebyshev polynomial achieves the optimum convergent rate for approximating a function. In this paper, we revisit the problem of approximating the spectral graph convolutions with Chebyshev polynomials. We show that ChebNet's inferior performance is primarily due to illegal coefficients learnt by ChebNet approximating analytic filter functions, which leads to over-fitting. We then propose ChebNetII, a new GNN model based on Chebyshev interpolation, which enhances the original Chebyshev polynomial approximation while reducing the Runge phenomenon. We conducted an extensive experimental study to demonstrate that ChebNetII can learn arbitrary graph convolutions and achieve superior performance in both full- and semi-supervised node classification tasks. Most notably, we scale ChebNetII to a billion graph ogbn-papers100M, showing that spectral-based GNNs have superior performance. Our code is available at https://github.com/ivam-he/ChebNetII.  ( 3 min )
    EyePreserve: Identity-Preserving Iris Synthesis
    arXiv:2312.12028v3 Announce Type: replace-cross Abstract: Synthesis of same-identity biometric iris images, both for existing and non-existing identities while preserving the identity across a wide range of pupil sizes, is complex due to intricate iris muscle constriction mechanism, requiring a precise model of iris non-linear texture deformations to be embedded into the synthesis pipeline. This paper presents the first method of fully data-driven, identity-preserving, pupil size-varying s ynthesis of iris images. This approach is capable of synthesizing images of irises with different pupil sizes representing non-existing identities as well as non-linearly deforming the texture of iris images of existing subjects given the segmentation mask of the target iris image. Iris recognition experiments suggest that the proposed deformation model not only preserves the identity when changing the pupil size but offers better similarity between same-identity iris samples with significant differences in pupil size, compared to state-of-the-art linear and non-linear (bio-mechanical-based) iris deformation models. Two immediate applications of the proposed approach are: (a) synthesis of, or enhancement of the existing biometric datasets for iris recognition, mimicking those acquired with iris sensors, and (b) helping forensic human experts in examining iris image pairs with significant differences in pupil dilation. Source codes and weights of the models are made available with the paper.  ( 2 min )
    Optimizing Negative Prompts for Enhanced Aesthetics and Fidelity in Text-To-Image Generation
    arXiv:2403.07605v1 Announce Type: cross Abstract: In text-to-image generation, using negative prompts, which describe undesirable image characteristics, can significantly boost image quality. However, producing good negative prompts is manual and tedious. To address this, we propose NegOpt, a novel method for optimizing negative prompt generation toward enhanced image generation, using supervised fine-tuning and reinforcement learning. Our combined approach results in a substantial increase of 25% in Inception Score compared to other approaches and surpasses ground-truth negative prompts from the test set. Furthermore, with NegOpt we can preferentially optimize the metrics most important to us. Finally, we construct Negative Prompts DB, a dataset of negative prompts.  ( 2 min )
    PMBO: Enhancing Black-Box Optimization through Multivariate Polynomial Surrogates
    arXiv:2403.07485v1 Announce Type: cross Abstract: We introduce a surrogate-based black-box optimization method, termed Polynomial-model-based optimization (PMBO). The algorithm alternates polynomial approximation with Bayesian optimization steps, using Gaussian processes to model the error between the objective and its polynomial fit. We describe the algorithmic design of PMBO and compare the results of the performance of PMBO with several optimization methods for a set of analytic test functions. The results show that PMBO outperforms the classic Bayesian optimization and is robust with respect to the choice of its correlation function family and its hyper-parameter setting, which, on the contrary, need to be carefully tuned in classic Bayesian optimization. Remarkably, PMBO performs comparably with state-of-the-art evolutionary algorithms such as the Covariance Matrix Adaptation -- Evolution Strategy (CMA-ES). This finding suggests that PMBO emerges as the pivotal choice among surrogate-based optimization methods when addressing low-dimensional optimization problems. Hereby, the simple nature of polynomials opens the opportunity for interpretation and analysis of the inferred surrogate model, providing a macroscopic perspective on the landscape of the objective function.  ( 2 min )
    Physics-Transfer Learning for Material Strength Screening
    arXiv:2403.07526v1 Announce Type: cross Abstract: The strength of materials, like many problems in the natural sciences, spans multiple length and time scales, and the solution has to balance accuracy and performance. Peierls stress is one of the central concepts in crystal plasticity that measures the strength through the resistance of a dislocation to plastic flow. The determination of Peierls stress involves a multiscale nature depending on both elastic lattice responses and the energy landscape of crystal slips. Material screening by strength via the Peierls stress from first-principles calculations is computationally intractable for the nonlocal characteristics of dislocations, and not included in the state-of-the-art computational material databases. In this work, we propose a physics-transfer framework to learn the physics of crystal plasticity from empirical atomistic simulations and then predict the Peierls stress from chemically accurate density functional theory-based calculations of material parameters. Notably, the strengths of single-crystalline metals can be predicted from a few single-point calculations for the deformed lattice and on the {\gamma} surface, allowing efficient, high-throughput screening for material discovery. Uncertainty quantification is carried out to assess the accuracy of models and sources of errors, showing reduced physical and system uncertainties in the predictions by elevating the fidelity of training models. This physics-transfer framework can be generalized to other problems facing the accuracy-performance dilemma, by harnessing the hierarchy of physics in the multiscale models of materials science.  ( 2 min )
    Approaching Rate-Distortion Limits in Neural Compression with Lattice Transform Coding
    arXiv:2403.07320v1 Announce Type: cross Abstract: Neural compression has brought tremendous progress in designing lossy compressors with good rate-distortion (RD) performance at low complexity. Thus far, neural compression design involves transforming the source to a latent vector, which is then rounded to integers and entropy coded. While this approach has been shown to be optimal in a one-shot sense on certain sources, we show that it is highly sub-optimal on i.i.d. sequences, and in fact always recovers scalar quantization of the original source sequence. We demonstrate that the sub-optimality is due to the choice of quantization scheme in the latent space, and not the transform design. By employing lattice quantization instead of scalar quantization in the latent space, we demonstrate that Lattice Transform Coding (LTC) is able to recover optimal vector quantization at various dimensions and approach the asymptotically-achievable rate-distortion function at reasonable complexity. On general vector sources, LTC improves upon standard neural compressors in one-shot coding performance. LTC also enables neural compressors that perform block coding on i.i.d. vector sources, which yields coding gain over optimal one-shot coding.  ( 2 min )
    Imbalance-aware Presence-only Loss Function for Species Distribution Modeling
    arXiv:2403.07472v1 Announce Type: new Abstract: In the face of significant biodiversity decline, species distribution models (SDMs) are essential for understanding the impact of climate change on species habitats by connecting environmental conditions to species occurrences. Traditionally limited by a scarcity of species observations, these models have significantly improved in performance through the integration of larger datasets provided by citizen science initiatives. However, they still suffer from the strong class imbalance between species within these datasets, often resulting in the penalization of rare species--those most critical for conservation efforts. To tackle this issue, this study assesses the effectiveness of training deep learning models using a balanced presence-only loss function on large citizen science-based datasets. We demonstrate that this imbalance-aware loss function outperforms traditional loss functions across various datasets and tasks, particularly in accurately modeling rare species with limited observations.  ( 2 min )
    Equipping Computational Pathology Systems with Artifact Processing Pipelines: A Showcase for Computation and Performance Trade-offs
    arXiv:2403.07743v1 Announce Type: cross Abstract: Histopathology is a gold standard for cancer diagnosis under a microscopic examination. However, histological tissue processing procedures result in artifacts, which are ultimately transferred to the digitized version of glass slides, known as whole slide images (WSIs). Artifacts are diagnostically irrelevant areas and may result in wrong deep learning (DL) algorithms predictions. Therefore, detecting and excluding artifacts in the computational pathology (CPATH) system is essential for reliable automated diagnosis. In this paper, we propose a mixture of experts (MoE) scheme for detecting five notable artifacts, including damaged tissue, blur, folded tissue, air bubbles, and histologically irrelevant blood from WSIs. First, we train independent binary DL models as experts to capture particular artifact morphology. Then, we ensemble their predictions using a fusion mechanism. We apply probabilistic thresholding over the final probability distribution to improve the sensitivity of the MoE. We developed DL pipelines using two MoEs and two multiclass models of state-of-the-art deep convolutional neural networks (DCNNs) and vision transformers (ViTs). DCNNs-based MoE and ViTs-based MoE schemes outperformed simpler multiclass models and were tested on datasets from different hospitals and cancer types, where MoE using DCNNs yielded the best results. The proposed MoE yields 86.15% F1 and 97.93% sensitivity scores on unseen data, retaining less computational cost for inference than MoE using ViTs. This best performance of MoEs comes with relatively higher computational trade-offs than multiclass models. The proposed artifact detection pipeline will not only ensure reliable CPATH predictions but may also provide quality control.  ( 3 min )
    Characterization of Large Language Model Development in the Datacenter
    arXiv:2403.07648v1 Announce Type: cross Abstract: Large Language Models (LLMs) have presented impressive performance across several transformative tasks. However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs, often riddled with numerous challenges such as frequent hardware failures, intricate parallelization strategies, and imbalanced resource utilization. In this paper, we present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme. Specifically, we investigate discrepancies between LLMs and prior task-specific Deep Learning (DL) workloads, explore resource utilization patterns, and identify the impact of various job failures. Our analysis summarizes hurdles we encountered and uncovers potential opportunities to optimize systems tailored for LLMs. Furthermore, we introduce our system efforts: (1) fault-tolerant pretraining, which enhances fault tolerance through LLM-involved failure diagnosis and automatic recovery. (2) decoupled scheduling for evaluation, which achieves timely performance feedback via trial decomposition and scheduling optimization.  ( 2 min )
    Anderson acceleration for iteratively reweighted $\ell_1$ algorithm
    arXiv:2403.07271v1 Announce Type: cross Abstract: Iteratively reweighted L1 (IRL1) algorithm is a common algorithm for solving sparse optimization problems with nonconvex and nonsmooth regularization. The development of its acceleration algorithm, often employing Nesterov acceleration, has sparked significant interest. Nevertheless, the convergence and complexity analysis of these acceleration algorithms consistently poses substantial challenges. Recently, Anderson acceleration has gained prominence owing to its exceptional performance for speeding up fixed-point iteration, with numerous recent studies applying it to gradient-based algorithms. Motivated by the powerful impact of Anderson acceleration, we propose an Anderson-accelerated IRL1 algorithm and establish its local linear convergence rate. We extend this convergence result, typically observed in smooth settings, to a nonsmooth scenario. Importantly, our theoretical results do not depend on the Kurdyka-Lojasiewicz condition, a necessary condition in existing Nesterov acceleration-based algorithms. Furthermore, to ensure global convergence, we introduce a globally convergent Anderson accelerated IRL1 algorithm by incorporating a classical nonmonotone line search condition. Experimental results indicate that our algorithm outperforms existing Nesterov acceleration-based algorithms.  ( 2 min )
    Dataset Condensation for Time Series Classification via Dual Domain Matching
    arXiv:2403.07245v1 Announce Type: new Abstract: Time series data has been demonstrated to be crucial in various research fields. The management of large quantities of time series data presents challenges in terms of deep learning tasks, particularly for training a deep neural network. Recently, a technique named \textit{Dataset Condensation} has emerged as a solution to this problem. This technique generates a smaller synthetic dataset that has comparable performance to the full real dataset in downstream tasks such as classification. However, previous methods are primarily designed for image and graph datasets, and directly adapting them to the time series dataset leads to suboptimal performance due to their inability to effectively leverage the rich information inherent in time series data, particularly in the frequency domain. In this paper, we propose a novel framework named Dataset \textit{\textbf{Cond}}ensation for \textit{\textbf{T}}ime \textit{\textbf{S}}eries \textit{\textbf{C}}lassification via Dual Domain Matching (\textbf{CondTSC}) which focuses on the time series classification dataset condensation task. Different from previous methods, our proposed framework aims to generate a condensed dataset that matches the surrogate objectives in both the time and frequency domains. Specifically, CondTSC incorporates multi-view data augmentation, dual domain training, and dual surrogate objectives to enhance the dataset condensation process in the time and frequency domains. Through extensive experiments, we demonstrate the effectiveness of our proposed framework, which outperforms other baselines and learns a condensed synthetic dataset that exhibits desirable characteristics such as conforming to the distribution of the original data.  ( 3 min )
    Adaptive Gain Scheduling using Reinforcement Learning for Quadcopter Control
    arXiv:2403.07216v1 Announce Type: cross Abstract: The paper presents a technique using reinforcement learning (RL) to adapt the control gains of a quadcopter controller. Specifically, we employed Proximal Policy Optimization (PPO) to train a policy which adapts the gains of a cascaded feedback controller in-flight. The primary goal of this controller is to minimize tracking error while following a specified trajectory. The paper's key objective is to analyze the effectiveness of the adaptive gain policy and compare it to the performance of a static gain control algorithm, where the Integral Squared Error and Integral Time Squared Error are used as metrics. The results show that the adaptive gain scheme achieves over 40$\%$ decrease in tracking error as compared to the static gain controller.  ( 2 min )
    Towards Graph Foundation Models for Personalization
    arXiv:2403.07478v1 Announce Type: cross Abstract: In the realm of personalization, integrating diverse information sources such as consumption signals and content-based representations is becoming increasingly critical to build state-of-the-art solutions. In this regard, two of the biggest trends in research around this subject are Graph Neural Networks (GNNs) and Foundation Models (FMs). While GNNs emerged as a popular solution in industry for powering personalization at scale, FMs have only recently caught attention for their promising performance in personalization tasks like ranking and retrieval. In this paper, we present a graph-based foundation modeling approach tailored to personalization. Central to this approach is a Heterogeneous GNN (HGNN) designed to capture multi-hop content and consumption relationships across a range of recommendable item types. To ensure the generality required from a Foundation Model, we employ a Large Language Model (LLM) text-based featurization of nodes that accommodates all item types, and construct the graph using co-interaction signals, which inherently transcend content specificity. To facilitate practical generalization, we further couple the HGNN with an adaptation mechanism based on a two-tower (2T) architecture, which also operates agnostically to content type. This multi-stage approach ensures high scalability; while the HGNN produces general purpose embeddings, the 2T component models in a continuous space the sheer size of user-item interaction data. Our comprehensive approach has been rigorously tested and proven effective in delivering recommendations across a diverse array of products within a real-world, industrial audio streaming platform.  ( 2 min )
    LaB-GATr: geometric algebra transformers for large biomedical surface and volume meshes
    arXiv:2403.07536v1 Announce Type: cross Abstract: Many anatomical structures can be described by surface or volume meshes. Machine learning is a promising tool to extract information from these 3D models. However, high-fidelity meshes often contain hundreds of thousands of vertices, which creates unique challenges in building deep neural network architectures. Furthermore, patient-specific meshes may not be canonically aligned which limits the generalisation of machine learning algorithms. We propose LaB-GATr, a transfomer neural network with geometric tokenisation that can effectively learn with large-scale (bio-)medical surface and volume meshes through sequence compression and interpolation. Our method extends the recently proposed geometric algebra transformer (GATr) and thus respects all Euclidean symmetries, i.e. rotation, translation and reflection, effectively mitigating the problem of canonical alignment between patients. LaB-GATr achieves state-of-the-art results on three tasks in cardiovascular hemodynamics modelling and neurodevelopmental phenotype prediction, featuring meshes of up to 200,000 vertices. Our results demonstrate that LaB-GATr is a powerful architecture for learning with high-fidelity meshes which has the potential to enable interesting downstream applications. Our implementation is publicly available.  ( 2 min )
    Reconstructions of Jupiter's magnetic field using physics informed neural networks
    arXiv:2403.07507v1 Announce Type: cross Abstract: Magnetic sounding using data collected from the Juno mission can be used to provide constraints on Jupiter's interior. However, inwards continuation of reconstructions assuming zero electrical conductivity and a representation in spherical harmonics are limited by the enhancement of noise at small scales. In this paper we describe new reconstructions of Jupiter's internal magnetic field based on physics-informed neural networks and either the first 33 (PINN33) or the first 50 (PINN50) of Juno's orbits. The method can resolve local structures, and allows for weak ambient electrical currents. Compared with other methods, our reconstructions of Jupiter's magnetic field both on and above the surface are similar, and we achieve a similar fit to the Juno data. However, our models are not hampered by noise at depth, and so offer a much clearer picture of the interior structure. We estimate that the dynamo boundary is at a fractional radius of 0.8. At this depth, the magnetic field is arranged into longitudinal bands, and the great blue spot appears to be rooted in neighbouring structures of oppositely signed flux.  ( 2 min )
    Learning Generalizable Feature Fields for Mobile Manipulation
    arXiv:2403.07563v1 Announce Type: cross Abstract: An open problem in mobile manipulation is how to represent objects and scenes in a unified manner, so that robots can use it both for navigating in the environment and manipulating objects. The latter requires capturing intricate geometry while understanding fine-grained semantics, whereas the former involves capturing the complexity inherit to an expansive physical scale. In this work, we present GeFF (Generalizable Feature Fields), a scene-level generalizable neural feature field that acts as a unified representation for both navigation and manipulation that performs in real-time. To do so, we treat generative novel view synthesis as a pre-training task, and then align the resulting rich scene priors with natural language via CLIP feature distillation. We demonstrate the effectiveness of this approach by deploying GeFF on a quadrupedal robot equipped with a manipulator. We evaluate GeFF's ability to generalize to open-set objects as well as running time, when performing open-vocabulary mobile manipulation in dynamic scenes.  ( 2 min )
    Couler: Unified Machine Learning Workflow Optimization in Cloud
    arXiv:2403.07608v1 Announce Type: cross Abstract: Machine Learning (ML) has become ubiquitous, fueling data-driven applications across various organizations. Contrary to the traditional perception of ML in research, ML workflows can be complex, resource-intensive, and time-consuming. Expanding an ML workflow to encompass a wider range of data infrastructure and data types may lead to larger workloads and increased deployment costs. Currently, numerous workflow engines are available (with over ten being widely recognized). This variety poses a challenge for end-users in terms of mastering different engine APIs. While efforts have primarily focused on optimizing ML Operations (MLOps) for a specific workflow engine, current methods largely overlook workflow optimization across different engines. In this work, we design and implement Couler, a system designed for unified ML workflow optimization in the cloud. Our main insight lies in the ability to generate an ML workflow using natural language (NL) descriptions. We integrate Large Language Models (LLMs) into workflow generation, and provide a unified programming interface for various workflow engines. This approach alleviates the need to understand various workflow engines' APIs. Moreover, Couler enhances workflow computation efficiency by introducing automated caching at multiple stages, enabling large workflow auto-parallelization and automatic hyperparameters tuning. These enhancements minimize redundant computational costs and improve fault tolerance during deep learning workflow training. Couler is extensively deployed in real-world production scenarios at Ant Group, handling approximately 22k workflows daily, and has successfully improved the CPU/Memory utilization by more than 15% and the workflow completion rate by around 17%.  ( 3 min )
    How does promoting the minority fraction affect generalization? A theoretical study of the one-hidden-layer neural network on group imbalance
    arXiv:2403.07310v1 Announce Type: cross Abstract: Group imbalance has been a known problem in empirical risk minimization (ERM), where the achieved high average accuracy is accompanied by low accuracy in a minority group. Despite algorithmic efforts to improve the minority group accuracy, a theoretical generalization analysis of ERM on individual groups remains elusive. By formulating the group imbalance problem with the Gaussian Mixture Model, this paper quantifies the impact of individual groups on the sample complexity, the convergence rate, and the average and group-level testing performance. Although our theoretical framework is centered on binary classification using a one-hidden-layer neural network, to the best of our knowledge, we provide the first theoretical analysis of the group-level generalization of ERM in addition to the commonly studied average generalization performance. Sample insights of our theoretical results include that when all group-level co-variance is in the medium regime and all mean are close to zero, the learning performance is most desirable in the sense of a small sample complexity, a fast training rate, and a high average and group-level testing accuracy. Moreover, we show that increasing the fraction of the minority group in the training data does not necessarily improve the generalization performance of the minority group. Our theoretical results are validated on both synthetic and empirical datasets, such as CelebA and CIFAR-10 in image classification.  ( 3 min )
    Bias and Fairness in Large Language Models: A Survey
    arXiv:2309.00770v2 Announce Type: replace-cross Abstract: Rapid advancements of large language models (LLMs) have enabled the processing, understanding, and generation of human-like text, with increasing integration into systems that touch our social sphere. Despite this success, these models can learn, perpetuate, and amplify harmful social biases. In this paper, we present a comprehensive survey of bias evaluation and mitigation techniques for LLMs. We first consolidate, formalize, and expand notions of social bias and fairness in natural language processing, defining distinct facets of harm and introducing several desiderata to operationalize fairness for LLMs. We then unify the literature by proposing three intuitive taxonomies, two for bias evaluation, namely metrics and datasets, and one for mitigation. Our first taxonomy of metrics for bias evaluation disambiguates the relationship between metrics and evaluation datasets, and organizes metrics by the different levels at which they operate in a model: embeddings, probabilities, and generated text. Our second taxonomy of datasets for bias evaluation categorizes datasets by their structure as counterfactual inputs or prompts, and identifies the targeted harms and social groups; we also release a consolidation of publicly-available datasets for improved access. Our third taxonomy of techniques for bias mitigation classifies methods by their intervention during pre-processing, in-training, intra-processing, and post-processing, with granular subcategories that elucidate research trends. Finally, we identify open problems and challenges for future work. Synthesizing a wide range of recent research, we aim to provide a clear guide of the existing literature that empowers researchers and practitioners to better understand and prevent the propagation of bias in LLMs.  ( 3 min )
    TAP: The Attention Patch for Cross-Modal Knowledge Transfer from Unlabeled Modality
    arXiv:2302.02224v2 Announce Type: replace Abstract: This paper addresses a cross-modal learning framework, where the objective is to enhance the performance of supervised learning in the primary modality using an unlabeled, unpaired secondary modality. Taking a probabilistic approach for missing information estimation, we show that the extra information contained in the secondary modality can be estimated via Nadaraya-Watson (NW) kernel regression, which can further be expressed as a kernelized cross-attention module (under linear transformation). Our results lay the foundations for introducing The Attention Patch (TAP), a simple neural network add-on that allows data-level knowledge transfer from the unlabeled modality. We provide extensive numerical simulations using four real-world datasets to show that TAP can provide statistically significant improvement in generalization across different domains and different neural network architectures, making use of seemingly unusable unlabeled cross-modal data.  ( 2 min )
    A Question-centric Multi-experts Contrastive Learning Framework for Improving the Accuracy and Interpretability of Deep Sequential Knowledge Tracing Models
    arXiv:2403.07322v1 Announce Type: cross Abstract: Knowledge tracing (KT) plays a crucial role in predicting students' future performance by analyzing their historical learning processes. Deep neural networks (DNNs) have shown great potential in solving the KT problem. However, there still exist some important challenges when applying deep learning techniques to model the KT process. The first challenge lies in taking the individual information of the question into modeling. This is crucial because, despite questions sharing the same knowledge component (KC), students' knowledge acquisition on homogeneous questions can vary significantly. The second challenge lies in interpreting the prediction results from existing deep learning-based KT models. In real-world applications, while it may not be necessary to have complete transparency and interpretability of the model parameters, it is crucial to present the model's prediction results in a manner that teachers find interpretable. This makes teachers accept the rationale behind the prediction results and utilize them to design teaching activities and tailored learning strategies for students. However, the inherent black-box nature of deep learning techniques often poses a hurdle for teachers to fully embrace the model's prediction results. To address these challenges, we propose a Question-centric Multi-experts Contrastive Learning framework for KT called Q-MCKT.  ( 3 min )
    SVD-LLM: Truncation-aware Singular Value Decomposition for Large Language Model Compression
    arXiv:2403.07378v1 Announce Type: cross Abstract: The advancements in Large Language Models (LLMs) have been hindered by their substantial sizes, which necessitate LLM compression methods for practical deployment. Singular Value Decomposition (SVD) offers a promising solution for LLM compression. However, state-of-the-art SVD-based LLM compression methods have two key limitations: truncating smaller singular values may lead to higher compression loss, and the lack of update on the remaining model parameters after SVD truncation. In this work, we propose SVD-LLM, a new SVD-based LLM compression method that addresses the limitations of existing methods. SVD-LLM incorporates a truncation-aware data whitening strategy to ensure a direct mapping between singular values and compression loss. Moreover, SVD-LLM adopts a layer-wise closed-form model parameter update strategy to compensate for accuracy degradation caused by SVD truncation. We evaluate SVD-LLM on a total of 11 datasets and seven models from three different LLM families at four different scales. Our results demonstrate the superiority of SVD-LLM over state-of-the-arts, especially at high model compression ratios. The source code is available at https://github.com/AIoT-MLSys-Lab/SVD-LLM.  ( 2 min )
    Guiding Large Language Models with Divide-and-Conquer Program for Discerning Problem Solving
    arXiv:2402.05359v2 Announce Type: replace-cross Abstract: Foundation models, such as Large language Models (LLMs), have attracted significant amount of interest due to their large number of applications. Existing works show that appropriate prompt design, such as Chain-of-Thoughts, can unlock LLM's powerful capacity in diverse areas. However, when handling tasks involving repetitive sub-tasks and/or deceptive contents, such as arithmetic calculation and article-level fake news detection, existing prompting strategies either suffers from insufficient expressive power or intermediate errors triggered by hallucination. To make LLM more discerning to such intermediate errors, we propose to guide LLM with a Divide-and-Conquer program that simultaneously ensures superior expressive power and disentangles task decomposition, sub-task resolution, and resolution assembly process. Theoretic analysis reveals that our strategy can guide LLM to extend the expressive power of fixed-depth Transformer. Experiments indicate that our proposed method can achieve better performance than typical prompting strategies in tasks bothered by intermediate errors and deceptive contents, such as large integer multiplication, hallucination detection and misinformation detection.  ( 2 min )
    tdCoxSNN: Time-Dependent Cox Survival Neural Network for Continuous-time Dynamic Prediction
    arXiv:2307.05881v2 Announce Type: replace-cross Abstract: The aim of dynamic prediction is to provide individualized risk predictions over time, which are updated as new data become available. In pursuit of constructing a dynamic prediction model for a progressive eye disorder, age-related macular degeneration (AMD), we propose a time-dependent Cox survival neural network (tdCoxSNN) to predict its progression using longitudinal fundus images. tdCoxSNN builds upon the time-dependent Cox model by utilizing a neural network to capture the non-linear effect of time-dependent covariates on the survival outcome. Moreover, by concurrently integrating a convolutional neural network (CNN) with the survival network, tdCoxSNN can directly take longitudinal images as input. We evaluate and compare our proposed method with joint modeling and landmarking approaches through extensive simulations. We applied the proposed approach to two real datasets. One is a large AMD study, the Age-Related Eye Disease Study (AREDS), in which more than 50,000 fundus images were captured over a period of 12 years for more than 4,000 participants. Another is a public dataset of the primary biliary cirrhosis (PBC) disease, where multiple lab tests were longitudinally collected to predict the time-to-liver transplant. Our approach demonstrates commendable predictive performance in both simulation studies and the analysis of the two real datasets.  ( 2 min )
    Analyzing domain shift when using additional data for the MICCAI KiTS23 Challenge
    arXiv:2309.02001v2 Announce Type: replace-cross Abstract: Using additional training data is known to improve the results, especially for medical image 3D segmentation where there is a lack of training material and the model needs to generalize well from few available data. However, the new data could have been acquired using other instruments and preprocessed such its distribution is significantly different from the original training data. Therefore, we study techniques which ameliorate domain shift during training so that the additional data becomes better usable for preprocessing and training together with the original data. Our results show that transforming the additional data using histogram matching has better results than using simple normalization.  ( 2 min )
    Faithful and Efficient Explanations for Neural Networks via Neural Tangent Kernel Surrogate Models
    arXiv:2305.14585v5 Announce Type: replace Abstract: A recent trend in explainable AI research has focused on surrogate modeling, where neural networks are approximated as simpler ML algorithms such as kernel machines. A second trend has been to utilize kernel functions in various explain-by-example or data attribution tasks. In this work, we combine these two trends to analyze approximate empirical neural tangent kernels (eNTK) for data attribution. Approximation is critical for eNTK analysis due to the high computational cost to compute the eNTK. We define new approximate eNTK and perform novel analysis on how well the resulting kernel machine surrogate models correlate with the underlying neural network. We introduce two new random projection variants of approximate eNTK which allow users to tune the time and memory complexity of their calculation. We conclude that kernel machines using approximate neural tangent kernel as the kernel function are effective surrogate models, with the introduced trace NTK the most consistent performer. Open source software allowing users to efficiently calculate kernel functions in the PyTorch framework is available (https://github.com/pnnl/projection\_ntk).  ( 3 min )
    Entropy is not Enough for Test-Time Adaptation: From the Perspective of Disentangled Factors
    arXiv:2403.07366v1 Announce Type: cross Abstract: Test-time adaptation (TTA) fine-tunes pre-trained deep neural networks for unseen test data. The primary challenge of TTA is limited access to the entire test dataset during online updates, causing error accumulation. To mitigate it, TTA methods have utilized the model output's entropy as a confidence metric that aims to determine which samples have a lower likelihood of causing error. Through experimental studies, however, we observed the unreliability of entropy as a confidence metric for TTA under biased scenarios and theoretically revealed that it stems from the neglect of the influence of latent disentangled factors of data on predictions. Building upon these findings, we introduce a novel TTA method named Destroy Your Object (DeYO), which leverages a newly proposed confidence metric named Pseudo-Label Probability Difference (PLPD). PLPD quantifies the influence of the shape of an object on prediction by measuring the difference between predictions before and after applying an object-destructive transformation. DeYO consists of sample selection and sample weighting, which employ entropy and PLPD concurrently. For robust adaptation, DeYO prioritizes samples that dominantly incorporate shape information when making predictions. Our extensive experiments demonstrate the consistent superiority of DeYO over baseline methods across various scenarios, including biased and wild. Project page is publicly available at https://whitesnowdrop.github.io/DeYO/.  ( 3 min )
    FedRA: A Random Allocation Strategy for Federated Tuning to Unleash the Power of Heterogeneous Clients
    arXiv:2311.11227v2 Announce Type: replace Abstract: With the increasing availability of Foundation Models, federated tuning has garnered attention in the field of federated learning, utilizing data and computation resources from multiple clients to collaboratively fine-tune foundation models. However, in real-world federated scenarios, there often exist a multitude of heterogeneous clients with varying computation and communication resources, rendering them incapable of supporting the entire model fine-tuning process. In response to this challenge, we propose a novel federated tuning algorithm, FedRA. The implementation of FedRA is straightforward and can be seamlessly integrated into any transformer-based model without the need for further modification to the original model. Specifically, in each communication round, FedRA randomly generates an allocation matrix. For resource-constrained clients, it reorganizes a small number of layers from the original model based on the allocation matrix and fine-tunes using adapters. Subsequently, the server aggregates the updated adapter parameters from the clients according to the current allocation matrix into the corresponding layers of the original model. It is worth noting that FedRA also supports scenarios where none of the clients can support the entire global model, which is an impressive advantage. We conduct experiments on two large-scale image datasets, DomainNet and NICO++, under various non-iid settings. The results demonstrate that FedRA outperforms the compared methods significantly. The source code is available at \url{https://github.com/leondada/FedRA}.  ( 3 min )
    Deep Learning-Assisted Parallel Interference Cancellation for Grant-Free NOMA in Machine-Type Communication
    arXiv:2403.07255v1 Announce Type: cross Abstract: In this paper, we present a novel approach for joint activity detection (AD), channel estimation (CE), and data detection (DD) in uplink grant-free non-orthogonal multiple access (NOMA) systems. Our approach employs an iterative and parallel interference removal strategy inspired by parallel interference cancellation (PIC), enhanced with deep learning to jointly tackle the AD, CE, and DD problems. Based on this approach, we develop three PIC frameworks, each of which is designed for either coherent or non-coherence schemes. The first framework performs joint AD and CE using received pilot signals in the coherent scheme. Building upon this framework, the second framework utilizes both the received pilot and data signals for CE, further enhancing the performances of AD, CE, and DD in the coherent scheme. The third framework is designed to accommodate the non-coherent scheme involving a small number of data bits, which simultaneously performs AD and DD. Through joint loss functions and interference cancellation modules, our approach supports end-to-end training, contributing to enhanced performances of AD, CE, and DD for both coherent and non-coherent schemes. Simulation results demonstrate the superiority of our approach over traditional techniques, exhibiting enhanced performances of AD, CE, and DD while maintaining lower computational complexity.  ( 2 min )
    MgSvF: Multi-Grained Slow vs. Fast Framework for Few-Shot Class-Incremental Learning
    arXiv:2006.15524v4 Announce Type: replace-cross Abstract: As a challenging problem, few-shot class-incremental learning (FSCIL) continually learns a sequence of tasks, confronting the dilemma between slow forgetting of old knowledge and fast adaptation to new knowledge. In this paper, we concentrate on this "slow vs. fast" (SvF) dilemma to determine which knowledge components to be updated in a slow fashion or a fast fashion, and thereby balance old-knowledge preservation and new-knowledge adaptation. We propose a multi-grained SvF learning strategy to cope with the SvF dilemma from two different grains: intra-space (within the same feature space) and inter-space (between two different feature spaces). The proposed strategy designs a novel frequency-aware regularization to boost the intra-space SvF capability, and meanwhile develops a new feature space composition operation to enhance the inter-space SvF learning performance. With the multi-grained SvF learning strategy, our method outperforms the state-of-the-art approaches by a large margin.  ( 2 min )
    Disentangling Policy from Offline Task Representation Learning via Adversarial Data Augmentation
    arXiv:2403.07261v1 Announce Type: new Abstract: Offline meta-reinforcement learning (OMRL) proficiently allows an agent to tackle novel tasks while solely relying on a static dataset. For precise and efficient task identification, existing OMRL research suggests learning separate task representations that be incorporated with policy input, thus forming a context-based meta-policy. A major approach to train task representations is to adopt contrastive learning using multi-task offline data. The dataset typically encompasses interactions from various policies (i.e., the behavior policies), thus providing a plethora of contextual information regarding different tasks. Nonetheless, amassing data from a substantial number of policies is not only impractical but also often unattainable in realistic settings. Instead, we resort to a more constrained yet practical scenario, where multi-task data collection occurs with a limited number of policies. We observed that learned task representations from previous OMRL methods tend to correlate spuriously with the behavior policy instead of reflecting the essential characteristics of the task, resulting in unfavorable out-of-distribution generalization. To alleviate this issue, we introduce a novel algorithm to disentangle the impact of behavior policy from task representation learning through a process called adversarial data augmentation. Specifically, the objective of adversarial data augmentation is not merely to generate data analogous to offline data distribution; instead, it aims to create adversarial examples designed to confound learned task representations and lead to incorrect task identification. Our experiments show that learning from such adversarial samples significantly enhances the robustness and effectiveness of the task identification process and realizes satisfactory out-of-distribution generalization.  ( 3 min )
    Tracking Dynamic Gaussian Density with a Theoretically Optimal Sliding Window Approach
    arXiv:2403.07207v1 Announce Type: cross Abstract: Dynamic density estimation is ubiquitous in many applications, including computer vision and signal processing. One popular method to tackle this problem is the "sliding window" kernel density estimator. There exist various implementations of this method that use heuristically defined weight sequences for the observed data. The weight sequence, however, is a key aspect of the estimator affecting the tracking performance significantly. In this work, we study the exact mean integrated squared error (MISE) of "sliding window" Gaussian Kernel Density Estimators for evolving Gaussian densities. We provide a principled guide for choosing the optimal weight sequence by theoretically characterizing the exact MISE, which can be formulated as constrained quadratic programming. We present empirical evidence with synthetic datasets to show that our weighting scheme indeed improves the tracking performance compared to heuristic approaches.  ( 2 min )
    A tutorial on multi-view autoencoders using the multi-view-AE library
    arXiv:2403.07456v1 Announce Type: new Abstract: There has been a growing interest in recent years in modelling multiple modalities (or views) of data to for example, understand the relationship between modalities or to generate missing data. Multi-view autoencoders have gained significant traction for their adaptability and versatility in modelling multi-modal data, demonstrating an ability to tailor their approach to suit the characteristics of the data at hand. However, most multi-view autoencoders have inconsistent notation and are often implemented using different coding frameworks. To address this, we present a unified mathematical framework for multi-view autoencoders, consolidating their formulations. Moreover, we offer insights into the motivation and theoretical advantages of each model. To facilitate accessibility and practical use, we extend the documentation and functionality of the previously introduced \texttt{multi-view-AE} library. This library offers Python implementations of numerous multi-view autoencoder models, presented within a user-friendly framework. Through benchmarking experiments, we evaluate our implementations against previous ones, demonstrating comparable or superior performance. This work aims to establish a cohesive foundation for multi-modal modelling, serving as a valuable educational resource in the field.  ( 2 min )
    CardioGenAI: A Machine Learning-Based Framework for Re-Engineering Drugs for Reduced hERG Liability
    arXiv:2403.07632v1 Announce Type: new Abstract: Drug-induced cardiotoxicity is a major health concern which can lead to serious adverse effects including life-threatening cardiac arrhythmias via the blockade of the voltage-gated hERG potassium ion channel. It is therefore of tremendous interest to develop advanced methods to identify hERG-active compounds in early stages of drug development, as well as to optimize commercially available drugs for reduced hERG activity. In this work, we present CardioGenAI, a machine learning-based framework for re-engineering both developmental and marketed drugs for reduced hERG activity while preserving their pharmacological activity. The framework incorporates novel state-of-the-art discriminative models for predicting hERG channel activity, as well as activity against the voltage-gated NaV1.5 and CaV1.2 channels due to their potential implications in modulating the arrhythmogenic potential induced by hERG channel blockade. These models can also serve independently as effective components of a virtual screening pipeline. We applied the complete framework to pimozide, an FDA-approved antipsychotic agent that demonstrates high affinity to the hERG channel, and generated 100 refined candidates. Remarkably, among the candidates is fluspirilene, a compound which is of the same class of drugs (diphenylmethanes) as pimozide and therefore has similar pharmacological activity, yet exhibits over 700-fold weaker binding to hERG. We have made all of our software open-source to facilitate integration of the CardioGenAI framework for molecular hypothesis generation into drug discovery workflows.  ( 3 min )
    Accelerated Inference and Reduced Forgetting: The Dual Benefits of Early-Exit Networks in Continual Learning
    arXiv:2403.07404v1 Announce Type: new Abstract: Driven by the demand for energy-efficient employment of deep neural networks, early-exit methods have experienced a notable increase in research attention. These strategies allow for swift predictions by making decisions early in the network, thereby conserving computation time and resources. However, so far the early-exit networks have only been developed for stationary data distributions, which restricts their application in real-world scenarios with continuous non-stationary data. This study aims to explore the continual learning of the early-exit networks. We adapt existing continual learning methods to fit with early-exit architectures and investigate their behavior in the continual setting. We notice that early network layers exhibit reduced forgetting and can outperform standard networks even when using significantly fewer resources. Furthermore, we analyze the impact of task-recency bias on early-exit inference and propose Task-wise Logits Correction (TLC), a simple method that equalizes this bias and improves the network performance for every given compute budget in the class-incremental setting. We assess the accuracy and computational cost of various continual learning techniques enhanced with early-exits and TLC across standard class-incremental learning benchmarks such as 10 split CIFAR100 and ImageNetSubset and show that TLC can achieve the accuracy of the standard methods using less than 70\% of their computations. Moreover, at full computational budget, our method outperforms the accuracy of the standard counterparts by up to 15 percentage points. Our research underscores the inherent synergy between early-exit networks and continual learning, emphasizing their practical utility in resource-constrained environments.  ( 3 min )
    A Deep Learning Approach to Diabetes Diagnosis
    arXiv:2403.07483v1 Announce Type: new Abstract: Diabetes, resulting from inadequate insulin production or utilization, causes extensive harm to the body. Existing diagnostic methods are often invasive and come with drawbacks, such as cost constraints. Although there are machine learning models like Classwise k Nearest Neighbor (CkNN) and General Regression Neural Network (GRNN), they struggle with imbalanced data and result in under-performance. Leveraging advancements in sensor technology and machine learning, we propose a non-invasive diabetes diagnosis using a Back Propagation Neural Network (BPNN) with batch normalization, incorporating data re-sampling and normalization for class balancing. Our method addresses existing challenges such as limited performance associated with traditional machine learning. Experimental results on three datasets show significant improvements in overall accuracy, sensitivity, and specificity compared to traditional methods. Notably, we achieve accuracies of 89.81% in Pima diabetes dataset, 75.49% in CDC BRFSS2015 dataset, and 95.28% in Mesra Diabetes dataset. This underscores the potential of deep learning models for robust diabetes diagnosis. See project website https://steve-zeyu-zhang.github.io/DiabetesDiagnosis/  ( 2 min )
    Robustifying Point Cloud Networks by Refocusing
    arXiv:2308.05525v3 Announce Type: replace-cross Abstract: The ability to cope with out-of-distribution (OOD) corruptions and adversarial attacks is crucial in real-world safety-demanding applications. In this study, we develop a general mechanism to increase neural network robustness based on focus analysis. Recent studies have revealed the phenomenon of \textit{Overfocusing}, which leads to a performance drop. When the network is primarily influenced by small input regions, it becomes less robust and prone to misclassify under noise and corruptions. However, quantifying overfocusing is still vague and lacks clear definitions. Here, we provide a mathematical definition of \textbf{focus}, \textbf{overfocusing} and \textbf{underfocusing}. The notions are general, but in this study, we specifically investigate the case of 3D point clouds. We observe that corrupted sets result in a biased focus distribution compared to the clean training set. We show that as focus distribution deviates from the one learned in the training phase - classification performance deteriorates. We thus propose a parameter-free \textbf{refocusing} algorithm that aims to unify all corruptions under the same distribution. We validate our findings on a 3D zero-shot classification task, achieving SOTA in robust 3D classification on ModelNet-C dataset, and in adversarial defense against Shape-Invariant attack. Code is available in: https://github.com/yossilevii100/refocusing.  ( 2 min )
    Balancing Fairness and Accuracy in Data-Restricted Binary Classification
    arXiv:2403.07724v1 Announce Type: new Abstract: Applications that deal with sensitive information may have restrictions placed on the data available to a machine learning (ML) classifier. For example, in some applications, a classifier may not have direct access to sensitive attributes, affecting its ability to produce accurate and fair decisions. This paper proposes a framework that models the trade-off between accuracy and fairness under four practical scenarios that dictate the type of data available for analysis. Prior works examine this trade-off by analyzing the outputs of a scoring function that has been trained to implicitly learn the underlying distribution of the feature vector, class label, and sensitive attribute of a dataset. In contrast, our framework directly analyzes the behavior of the optimal Bayesian classifier on this underlying distribution by constructing a discrete approximation it from the dataset itself. This approach enables us to formulate multiple convex optimization problems, which allow us to answer the question: How is the accuracy of a Bayesian classifier affected in different data restricting scenarios when constrained to be fair? Analysis is performed on a set of fairness definitions that include group and individual fairness. Experiments on three datasets demonstrate the utility of the proposed framework as a tool for quantifying the trade-offs among different fairness notions and their distributional dependencies.  ( 2 min )
    Iterative Graph Neural Network Enhancement via Frequent Subgraph Mining of Explanations
    arXiv:2403.07849v1 Announce Type: new Abstract: We formulate an XAI-based model improvement approach for Graph Neural Networks (GNNs) for node classification, called Explanation Enhanced Graph Learning (EEGL). The goal is to improve predictive performance of GNN using explanations. EEGL is an iterative self-improving algorithm, which starts with a learned "vanilla" GNN, and repeatedly uses frequent subgraph mining to find relevant patterns in explanation subgraphs. These patterns are then filtered further to obtain application-dependent features corresponding to the presence of certain subgraphs in the node neighborhoods. Giving an application-dependent algorithm for such a subgraph-based extension of the Weisfeiler-Leman (1-WL) algorithm has previously been posed as an open problem. We present experimental evidence, with synthetic and real-world data, which show that EEGL outperforms related approaches in predictive performance and that it has a node-distinguishing power beyond that of vanilla GNNs. We also analyze EEGL's training dynamics.  ( 2 min )
    IM-Unpack: Training and Inference with Arbitrarily Low Precision Integers
    arXiv:2403.07339v1 Announce Type: new Abstract: GEneral Matrix Multiply (GEMM) is a central operation in deep learning and corresponds to the largest chunk of the compute footprint. Therefore, improving its efficiency is an active topic of ongoing research. A popular strategy is the use of low bit-width integers to approximate the original entries in a matrix. This allows efficiency gains, but often requires sophisticated techniques to control the rounding error incurred. In this work, we first verify/check that when the low bit-width restriction is removed, for a variety of Transformer-based models, whether integers are sufficient for all GEMMs need -- for {\em both} training and inference stages, and can achieve parity with floating point counterparts. No sophisticated techniques are needed. We find that while a large majority of entries in matrices (encountered in such models) can be easily represented by {\em low} bit-width integers, the existence of a few heavy hitter entries make it difficult to achieve efficiency gains via the exclusive use of low bit-width GEMMs alone. To address this issue, we develop a simple algorithm, Integer Matrix Unpacking (IM-Unpack), to {\em unpack} a matrix with large integer entries into a larger matrix whose entries all lie within the representable range of arbitrarily low bit-width integers. This allows {\em equivalence} with the original GEMM, i.e., the exact result can be obtained using purely low bit-width integer GEMMs. This comes at the cost of additional operations -- we show that for many popular models, this overhead is quite small.  ( 3 min )
    Re-Simulation-based Self-Supervised Learning for Pre-Training Foundation Models
    arXiv:2403.07066v1 Announce Type: cross Abstract: Self-Supervised Learning (SSL) is at the core of training modern large machine learning models, providing a scheme for learning powerful representations that can be used in a variety of downstream tasks. However, SSL strategies must be adapted to the type of training data and downstream tasks required. We propose RS3L, a novel simulation-based SSL strategy that employs a method of re-simulation to drive data augmentation for contrastive learning. By intervening in the middle of the simulation process and re-running simulation components downstream of the intervention, we generate multiple realizations of an event, thus producing a set of augmentations covering all physics-driven variations available in the simulator. Using experiments from high-energy physics, we explore how this strategy may enable the development of a foundation model; we show how R3SL pre-training enables powerful performance in downstream tasks such as discrimination of a variety of objects and uncertainty mitigation. In addition to our results, we make the RS3L dataset publicly available for further studies on how to improve SSL strategies.  ( 2 min )
    pyvene: A Library for Understanding and Improving PyTorch Models via Interventions
    arXiv:2403.07809v1 Announce Type: new Abstract: Interventions on model-internal states are fundamental operations in many areas of AI, including model editing, steering, robustness, and interpretability. To facilitate such research, we introduce $\textbf{pyvene}$, an open-source Python library that supports customizable interventions on a range of different PyTorch modules. $\textbf{pyvene}$ supports complex intervention schemes with an intuitive configuration format, and its interventions can be static or include trainable parameters. We show how $\textbf{pyvene}$ provides a unified and extensible framework for performing interventions on neural models and sharing the intervened upon models with others. We illustrate the power of the library via interpretability analyses using causal abstraction and knowledge localization. We publish our library through Python Package Index (PyPI) and provide code, documentation, and tutorials at https://github.com/stanfordnlp/pyvene.  ( 2 min )
    Maxwell's Demon at Work: Efficient Pruning by Leveraging Saturation of Neurons
    arXiv:2403.07688v1 Announce Type: new Abstract: When training deep neural networks, the phenomenon of $\textit{dying neurons}$ $\unicode{x2013}$units that become inactive or saturated, output zero during training$\unicode{x2013}$ has traditionally been viewed as undesirable, linked with optimization challenges, and contributing to plasticity loss in continual learning scenarios. In this paper, we reassess this phenomenon, focusing on sparsity and pruning. By systematically exploring the impact of various hyperparameter configurations on dying neurons, we unveil their potential to facilitate simple yet effective structured pruning algorithms. We introduce $\textit{Demon Pruning}$ (DemP), a method that controls the proliferation of dead neurons, dynamically leading to network sparsity. Achieved through a combination of noise injection on active units and a one-cycled schedule regularization strategy, DemP stands out for its simplicity and broad applicability. Experiments on CIFAR10 and ImageNet datasets demonstrate that DemP surpasses existing structured pruning techniques, showcasing superior accuracy-sparsity tradeoffs and training speedups. These findings suggest a novel perspective on dying neurons as a valuable resource for efficient model compression and optimization.  ( 2 min )
    A Survey of Vision Transformers in Autonomous Driving: Current Trends and Future Directions
    arXiv:2403.07542v1 Announce Type: cross Abstract: This survey explores the adaptation of visual transformer models in Autonomous Driving, a transition inspired by their success in Natural Language Processing. Surpassing traditional Recurrent Neural Networks in tasks like sequential image processing and outperforming Convolutional Neural Networks in global context capture, as evidenced in complex scene recognition, Transformers are gaining traction in computer vision. These capabilities are crucial in Autonomous Driving for real-time, dynamic visual scene processing. Our survey provides a comprehensive overview of Vision Transformer applications in Autonomous Driving, focusing on foundational concepts such as self-attention, multi-head attention, and encoder-decoder architecture. We cover applications in object detection, segmentation, pedestrian detection, lane detection, and more, comparing their architectural merits and limitations. The survey concludes with future research directions, highlighting the growing role of Vision Transformers in Autonomous Driving.  ( 2 min )
    Scalable Spatiotemporal Prediction with Bayesian Neural Fields
    arXiv:2403.07657v1 Announce Type: new Abstract: Spatiotemporal datasets, which consist of spatially-referenced time series, are ubiquitous in many scientific and business-intelligence applications, such as air pollution monitoring, disease tracking, and cloud-demand forecasting. As modern datasets continue to increase in size and complexity, there is a growing need for new statistical methods that are flexible enough to capture complex spatiotemporal dynamics and scalable enough to handle large prediction problems. This work presents the Bayesian Neural Field (BayesNF), a domain-general statistical model for inferring rich probability distributions over a spatiotemporal domain, which can be used for data-analysis tasks including forecasting, interpolation, and variography. BayesNF integrates a novel deep neural network architecture for high-capacity function estimation with hierarchical Bayesian inference for robust uncertainty quantification. By defining the prior through a sequence of smooth differentiable transforms, posterior inference is conducted on large-scale data using variationally learned surrogates trained via stochastic gradient descent. We evaluate BayesNF against prominent statistical and machine-learning baselines, showing considerable improvements on diverse prediction problems from climate and public health datasets that contain tens to hundreds of thousands of measurements. The paper is accompanied with an open-source software package (https://github.com/google/bayesnf) that is easy-to-use and compatible with modern GPU and TPU accelerators on the JAX machine learning platform.  ( 2 min )
    Challenging Forgets: Unveiling the Worst-Case Forget Sets in Machine Unlearning
    arXiv:2403.07362v1 Announce Type: new Abstract: The trustworthy machine learning (ML) community is increasingly recognizing the crucial need for models capable of selectively 'unlearning' data points after training. This leads to the problem of machine unlearning (MU), aiming to eliminate the influence of chosen data points on model performance, while still maintaining the model's utility post-unlearning. Despite various MU methods for data influence erasure, evaluations have largely focused on random data forgetting, ignoring the vital inquiry into which subset should be chosen to truly gauge the authenticity of unlearning performance. To tackle this issue, we introduce a new evaluative angle for MU from an adversarial viewpoint. We propose identifying the data subset that presents the most significant challenge for influence erasure, i.e., pinpointing the worst-case forget set. Utilizing a bi-level optimization principle, we amplify unlearning challenges at the upper optimization level to emulate worst-case scenarios, while simultaneously engaging in standard training and unlearning at the lower level, achieving a balance between data influence erasure and model utility. Our proposal offers a worst-case evaluation of MU's resilience and effectiveness. Through extensive experiments across different datasets (including CIFAR-10, 100, CelebA, Tiny ImageNet, and ImageNet) and models (including both image classifiers and generative models), we expose critical pros and cons in existing (approximate) unlearning strategies. Our results illuminate the complex challenges of MU in practice, guiding the future development of more accurate and robust unlearning algorithms. The code is available at https://github.com/OPTML-Group/Unlearn-WorstCase.  ( 3 min )
    Detecting Security-Relevant Methods using Multi-label Machine Learning
    arXiv:2403.07501v1 Announce Type: new Abstract: To detect security vulnerabilities, static analysis tools need to be configured with security-relevant methods. Current approaches can automatically identify such methods using binary relevance machine learning approaches. However, they ignore dependencies among security-relevant methods, over-generalize and perform poorly in practice. Additionally, users have to nevertheless manually configure static analysis tools using the detected methods. Based on feedback from users and our observations, the excessive manual steps can often be tedious, error-prone and counter-intuitive. In this paper, we present Dev-Assist, an IntelliJ IDEA plugin that detects security-relevant methods using a multi-label machine learning approach that considers dependencies among labels. The plugin can automatically generate configurations for static analysis tools, run the static analysis, and show the results in IntelliJ IDEA. Our experiments reveal that Dev-Assist's machine learning approach has a higher F1-Measure than related approaches. Moreover, the plugin reduces and simplifies the manual effort required when configuring and using static analysis tools.  ( 2 min )
    ProPML: Probability Partial Multi-label Learning
    arXiv:2403.07603v1 Announce Type: new Abstract: Partial Multi-label Learning (PML) is a type of weakly supervised learning where each training instance corresponds to a set of candidate labels, among which only some are true. In this paper, we introduce \our{}, a novel probabilistic approach to this problem that extends the binary cross entropy to the PML setup. In contrast to existing methods, it does not require suboptimal disambiguation and, as such, can be applied to any deep architecture. Furthermore, experiments conducted on artificial and real-world datasets indicate that \our{} outperforms existing approaches, especially for high noise in a candidate set.  ( 2 min )
    Federated Learning of Socially Appropriate Agent Behaviours in Simulated Home Environments
    arXiv:2403.07586v1 Announce Type: new Abstract: As social robots become increasingly integrated into daily life, ensuring their behaviours align with social norms is crucial. For their widespread open-world application, it is important to explore Federated Learning (FL) settings where individual robots can learn about their unique environments while also learning from each others' experiences. In this paper, we present a novel FL benchmark that evaluates different strategies, using multi-label regression objectives, where each client individually learns to predict the social appropriateness of different robot actions while also sharing their learning with others. Furthermore, splitting the training data by different contexts such that each client incrementally learns across contexts, we present a novel Federated Continual Learning (FCL) benchmark that adapts FL-based methods to use state-of-the-art Continual Learning (CL) methods to continually learn socially appropriate agent behaviours under different contextual settings. Federated Averaging (FedAvg) of weights emerges as a robust FL strategy while rehearsal-based FCL enables incrementally learning the social appropriateness of robot actions, across contextual splits.  ( 2 min )
    Enhancing Quantum Variational Algorithms with Zero Noise Extrapolation via Neural Networks
    arXiv:2403.07025v1 Announce Type: cross Abstract: In the emergent realm of quantum computing, the Variational Quantum Eigensolver (VQE) stands out as a promising algorithm for solving complex quantum problems, especially in the noisy intermediate-scale quantum (NISQ) era. However, the ubiquitous presence of noise in quantum devices often limits the accuracy and reliability of VQE outcomes. This research introduces a novel approach to ameliorate this challenge by utilizing neural networks for zero noise extrapolation (ZNE) in VQE computations. By employing the Qiskit framework, we crafted parameterized quantum circuits using the RY-RZ ansatz and examined their behavior under varying levels of depolarizing noise. Our investigations spanned from determining the expectation values of a Hamiltonian, defined as a tensor product of Z operators, under different noise intensities to extracting the ground state energy. To bridge the observed outcomes under noise with the ideal noise-free scenario, we trained a Feed Forward Neural Network on the error probabilities and their associated expectation values. Remarkably, our model proficiently predicted the VQE outcome under hypothetical noise-free conditions. By juxtaposing the simulation results with real quantum device executions, we unveiled the discrepancies induced by noise and showcased the efficacy of our neural network-based ZNE technique in rectifying them. This integrative approach not only paves the way for enhanced accuracy in VQE computations on NISQ devices but also underlines the immense potential of hybrid quantum-classical paradigms in circumventing the challenges posed by quantum noise. Through this research, we envision a future where quantum algorithms can be reliably executed on noisy devices, bringing us one step closer to realizing the full potential of quantum computing.  ( 3 min )
    Premonition: Using Generative Models to Preempt Future Data Changes in Continual Learning
    arXiv:2403.07356v1 Announce Type: cross Abstract: Continual learning requires a model to adapt to ongoing changes in the data distribution, and often to the set of tasks to be performed. It is rare, however, that the data and task changes are completely unpredictable. Given a description of an overarching goal or data theme, which we call a realm, humans can often guess what concepts are associated with it. We show here that the combination of a large language model and an image generation model can similarly provide useful premonitions as to how a continual learning challenge might develop over time. We use the large language model to generate text descriptions of semantically related classes that might potentially appear in the data stream in future. These descriptions are then rendered using Stable Diffusion to generate new labelled image samples. The resulting synthetic dataset is employed for supervised pre-training, but is discarded prior to commencing continual learning, along with the pre-training classification head. We find that the backbone of our pre-trained networks can learn representations useful for the downstream continual learning problem, thus becoming a valuable input to any existing continual learning method. Although there are complexities arising from the domain gap between real and synthetic images, we show that pre-training models in this manner improves multiple Class Incremenal Learning (CIL) methods on fine-grained image classification benchmarks. Supporting code can be found at https://github.com/cl-premonition/premonition.  ( 3 min )
    Exploring Cluster Analysis in Nelore Cattle Visual Score Attribution
    arXiv:2403.07137v1 Announce Type: cross Abstract: Assessing the biotype of cattle through human visual inspection is a very common and important practice in precision cattle breeding. This paper presents the results of a correlation analysis between scores produced by humans for Nelore cattle and a variety of measurements that can be derived from images or other instruments. It also presents a study using the k-means algorithm to generate new ways of clustering a batch of cattle using the measurements that most correlate with the animal's body weight and visual scores.  ( 2 min )
    Bayesian Regression Markets
    arXiv:2310.14992v2 Announce Type: replace Abstract: Machine learning tasks are vulnerable to the quality of data used as input. Yet, it is often challenging for firms to obtain adequate datasets, with them being naturally distributed amongst owners, that in practice, may be competitors in a downstream market and reluctant to share information. Focusing on supervised learning for regression tasks, we develop a regression market to provide a monetary incentive for data sharing. Our proposed mechanism adopts a Bayesian framework, allowing us to consider a more general class of regression tasks. We present a thorough exploration of the market properties, and show that similar proposals in current literature expose the market agents to sizeable financial risks, which can be mitigated in our setup.  ( 2 min )
    An alternative approach to train neural networks using monotone variational inequality
    arXiv:2202.08876v4 Announce Type: replace-cross Abstract: We propose an alternative approach to neural network training using the monotone vector field, an idea inspired by the seminal work of Juditsky and Nemirovski [Juditsky & Nemirovsky, 2019] developed originally to solve parameter estimation problems for generalized linear models (GLM) by reducing the original non-convex problem to a convex problem of solving a monotone variational inequality (VI). Our approach leads to computationally efficient procedures that converge fast and offer guarantee in some special cases, such as training a single-layer neural network or fine-tuning the last layer of the pre-trained model. Our approach can be used for more efficient fine-tuning of a pre-trained model while freezing the bottom layers, an essential step for deploying many machine learning models such as large language models (LLM). We demonstrate its applicability in training fully-connected (FC) neural networks, graph neural networks (GNN), and convolutional neural networks (CNN) and show the competitive or better performance of our approach compared to stochastic gradient descent methods on both synthetic and real network data prediction tasks regarding various performance metrics.  ( 2 min )
    Efficient GPU implementation of randomized SVD and its applications
    arXiv:2110.03423v2 Announce Type: replace Abstract: Matrix decompositions are ubiquitous in machine learning, including applications in dimensionality reduction, data compression and deep learning algorithms. Typical solutions for matrix decompositions have polynomial complexity which significantly increases their computational cost and time. In this work, we leverage efficient processing operations that can be run in parallel on modern Graphical Processing Units (GPUs), predominant computing architecture used e.g. in deep learning, to reduce the computational burden of computing matrix decompositions. More specifically, we reformulate the randomized decomposition problem to incorporate fast matrix multiplication operations (BLAS-3) as building blocks. We show that this formulation, combined with fast random number generators, allows to fully exploit the potential of parallel processing implemented in GPUs. Our extensive evaluation confirms the superiority of this approach over the competing methods and we release the results of this research as a part of the official CUDA implementation (https://docs.nvidia.com/cuda/cusolver/index.html).  ( 2 min )
    Beyond the Labels: Unveiling Text-Dependency in Paralinguistic Speech Recognition Datasets
    arXiv:2403.07767v1 Announce Type: cross Abstract: Paralinguistic traits like cognitive load and emotion are increasingly recognized as pivotal areas in speech recognition research, often examined through specialized datasets like CLSE and IEMOCAP. However, the integrity of these datasets is seldom scrutinized for text-dependency. This paper critically evaluates the prevalent assumption that machine learning models trained on such datasets genuinely learn to identify paralinguistic traits, rather than merely capturing lexical features. By examining the lexical overlap in these datasets and testing the performance of machine learning models, we expose significant text-dependency in trait-labeling. Our results suggest that some machine learning models, especially large pre-trained models like HuBERT, might inadvertently focus on lexical characteristics rather than the intended paralinguistic features. The study serves as a call to action for the research community to reevaluate the reliability of existing datasets and methodologies, ensuring that machine learning models genuinely learn what they are designed to recognize.  ( 2 min )
    generAItor: Tree-in-the-Loop Text Generation for Language Model Explainability and Adaptation
    arXiv:2403.07627v1 Announce Type: cross Abstract: Large language models (LLMs) are widely deployed in various downstream tasks, e.g., auto-completion, aided writing, or chat-based text generation. However, the considered output candidates of the underlying search algorithm are under-explored and under-explained. We tackle this shortcoming by proposing a tree-in-the-loop approach, where a visual representation of the beam search tree is the central component for analyzing, explaining, and adapting the generated outputs. To support these tasks, we present generAItor, a visual analytics technique, augmenting the central beam search tree with various task-specific widgets, providing targeted visualizations and interaction possibilities. Our approach allows interactions on multiple levels and offers an iterative pipeline that encompasses generating, exploring, and comparing output candidates, as well as fine-tuning the model based on adapted data. Our case study shows that our tool generates new insights in gender bias analysis beyond state-of-the-art template-based methods. Additionally, we demonstrate the applicability of our approach in a qualitative user study. Finally, we quantitatively evaluate the adaptability of the model to few samples, as occurring in text-generation use cases.  ( 2 min )
    Proxy Methods for Domain Adaptation
    arXiv:2403.07442v1 Announce Type: new Abstract: We study the problem of domain adaptation under distribution shift, where the shift is due to a change in the distribution of an unobserved, latent variable that confounds both the covariates and the labels. In this setting, neither the covariate shift nor the label shift assumptions apply. Our approach to adaptation employs proximal causal learning, a technique for estimating causal effects in settings where proxies of unobserved confounders are available. We demonstrate that proxy variables allow for adaptation to distribution shift without explicitly recovering or modeling latent variables. We consider two settings, (i) Concept Bottleneck: an additional ''concept'' variable is observed that mediates the relationship between the covariates and labels; (ii) Multi-domain: training data from multiple source domains is available, where each source domain exhibits a different distribution over the latent confounder. We develop a two-stage kernel estimation approach to adapt to complex distribution shifts in both settings. In our experiments, we show that our approach outperforms other methods, notably those which explicitly recover the latent confounder.  ( 2 min )
    New Perspectives in Online Contract Design: Heterogeneous, Homogeneous, Non-myopic Agents and Team Production
    arXiv:2403.07143v1 Announce Type: cross Abstract: This work studies the repeated principal-agent problem from an online learning perspective. The principal's goal is to learn the optimal contract that maximizes her utility through repeated interactions, without prior knowledge of the agent's type (i.e., the agent's cost and production functions). I study three different settings when the principal contracts with a $\textit{single}$ agent each round: 1. The agents are heterogeneous; 2. the agents are homogenous; 3. the principal interacts with the same agent and the agent is non-myopic. I present different approaches and techniques for designing learning algorithms in each setting. For heterogeneous agent types, I identify a condition that allows the problem to be reduced to Lipschitz bandits directly. For identical agents, I give a polynomial sample complexity scheme to learn the optimal contract based on inverse game theory. For strategic non-myopic agents, I design a low strategic-regret mechanism. Also, I identify a connection between linear contracts and posted-price auctions, showing the two can be reduced to one another, and give a regret lower bound on learning the optimal linear contract based on this observation. I also study a $\textit{team production}$ model. I identify a condition under which the principal's learning problem can be reformulated as solving a family of convex programs, thereby showing the optimal contract can be found efficiently.  ( 2 min )
    Harder Tasks Need More Experts: Dynamic Routing in MoE Models
    arXiv:2403.07652v1 Announce Type: new Abstract: In this paper, we introduce a novel dynamic expert selection framework for Mixture of Experts (MoE) models, aiming to enhance computational efficiency and model performance by adjusting the number of activated experts based on input difficulty. Unlike traditional MoE approaches that rely on fixed Top-K routing, which activates a predetermined number of experts regardless of the input's complexity, our method dynamically selects experts based on the confidence level in expert selection for each input. This allows for a more efficient utilization of computational resources, activating more experts for complex tasks requiring advanced reasoning and fewer for simpler tasks. Through extensive evaluations, our dynamic routing method demonstrates substantial improvements over conventional Top-2 routing across various benchmarks, achieving an average improvement of 0.7% with less than 90% activated parameters. Further analysis shows our model dispatches more experts to tasks requiring complex reasoning skills, like BBH, confirming its ability to dynamically allocate computational resources in alignment with the input's complexity. Our findings also highlight a variation in the number of experts needed across different layers of the transformer model, offering insights into the potential for designing heterogeneous MoE frameworks. The code and models are available at https://github.com/ZhenweiAn/Dynamic_MoE.  ( 2 min )
    Robustifying and Boosting Training-Free Neural Architecture Search
    arXiv:2403.07591v1 Announce Type: new Abstract: Neural architecture search (NAS) has become a key component of AutoML and a standard tool to automate the design of deep neural networks. Recently, training-free NAS as an emerging paradigm has successfully reduced the search costs of standard training-based NAS by estimating the true architecture performance with only training-free metrics. Nevertheless, the estimation ability of these metrics typically varies across different tasks, making it challenging to achieve robust and consistently good search performance on diverse tasks with only a single training-free metric. Meanwhile, the estimation gap between training-free metrics and the true architecture performances limits training-free NAS to achieve superior performance. To address these challenges, we propose the robustifying and boosting training-free NAS (RoBoT) algorithm which (a) employs the optimized combination of existing training-free metrics explored from Bayesian optimization to develop a robust and consistently better-performing metric on diverse tasks, and (b) applies greedy search, i.e., the exploitation, on the newly developed metric to bridge the aforementioned gap and consequently to boost the search performance of standard training-free NAS further. Remarkably, the expected performance of our RoBoT can be theoretically guaranteed, which improves over the existing training-free NAS under mild conditions with additional interesting insights. Our extensive experiments on various NAS benchmark tasks yield substantial empirical evidence to support our theoretical results.  ( 2 min )
    Experimental Comparison of Ensemble Methods and Time-to-Event Analysis Models Through Integrated Brier Score and Concordance Index
    arXiv:2403.07460v1 Announce Type: new Abstract: Time-to-event analysis is a branch of statistics that has increased in popularity during the last decades due to its many application fields, such as predictive maintenance, customer churn prediction and population lifetime estimation. In this paper, we review and compare the performance of several prediction models for time-to-event analysis. These consist of semi-parametric and parametric statistical models, in addition to machine learning approaches. Our study is carried out on three datasets and evaluated in two different scores (the integrated Brier score and concordance index). Moreover, we show how ensemble methods, which surprisingly have not yet been much studied in time-to-event analysis, can improve the prediction accuracy and enhance the robustness of the prediction performance. We conclude the analysis with a simulation experiment in which we evaluate the factors influencing the performance ranking of the methods using both scores.  ( 2 min )
    Constrained Optimal Fuel Consumption of HEV: A Constrained Reinforcement Learning Approach
    arXiv:2403.07503v1 Announce Type: new Abstract: Hybrid electric vehicles (HEVs) are becoming increasingly popular because they can better combine the working characteristics of internal combustion engines and electric motors. However, the minimum fuel consumption of an HEV for a battery electrical balance case under a specific assembly condition and a specific speed curve still needs to be clarified in academia and industry. Regarding this problem, this work provides the mathematical expression of constrained optimal fuel consumption (COFC) from the perspective of constrained reinforcement learning (CRL) for the first time globally. Also, two mainstream approaches of CRL, constrained variational policy optimization (CVPO) and Lagrangian-based approaches, are utilized for the first time to obtain the vehicle's minimum fuel consumption under the battery electrical balance condition. We conduct case studies on the well-known Prius TOYOTA hybrid system (THS) under the NEDC condition; we give vital steps to implement CRL approaches and compare the performance between the CVPO and Lagrangian-based approaches. Our case study found that CVPO and Lagrangian-based approaches can obtain the lowest fuel consumption while maintaining the SOC balance constraint. The CVPO approach converges stable, but the Lagrangian-based approach can obtain the lowest fuel consumption at 3.95 L/100km, though with more significant oscillations. This result verifies the effectiveness of our proposed CRL approaches to the COFC problem.  ( 2 min )
    Graph Unlearning with Efficient Partial Retraining
    arXiv:2403.07353v1 Announce Type: new Abstract: Graph Neural Networks (GNNs) have achieved remarkable success in various real-world applications. However, GNNs may be trained on undesirable graph data, which can degrade their performance and reliability. To enable trained GNNs to efficiently unlearn unwanted data, a desirable solution is retraining-based graph unlearning, which partitions the training graph into subgraphs and trains sub-models on them, allowing fast unlearning through partial retraining. However, the graph partition process causes information loss in the training graph, resulting in the low model utility of sub-GNN models. In this paper, we propose GraphRevoker, a novel graph unlearning framework that better maintains the model utility of unlearnable GNNs. Specifically, we preserve the graph property with graph property-aware sharding and effectively aggregate the sub-GNN models for prediction with graph contrastive sub-model aggregation. We conduct extensive experiments to demonstrate the superiority of our proposed approach.  ( 2 min )
    Hallmarks of Optimization Trajectories in Neural Networks and LLMs: The Lengths, Bends, and Dead Ends
    arXiv:2403.07379v1 Announce Type: new Abstract: We propose a fresh take on understanding the mechanisms of neural networks by analyzing the rich structure of parameters contained within their optimization trajectories. Towards this end, we introduce some natural notions of the complexity of optimization trajectories, both qualitative and quantitative, which reveal the inherent nuance and interplay involved between various optimization choices, such as momentum, weight decay, and batch size. We use them to provide key hallmarks about the nature of optimization in deep neural networks: when it goes right, and when it finds itself in a dead end. Further, thanks to our trajectory perspective, we uncover an intertwined behaviour of momentum and weight decay that promotes directional exploration, as well as a directional regularization behaviour of some others. We perform experiments over large-scale vision and language settings, including large language models (LLMs) with up to 12 billion parameters, to demonstrate the value of our approach.  ( 2 min )
    Automatic driving lane change safety prediction model based on LSTM
    arXiv:2403.06993v1 Announce Type: cross Abstract: Autonomous driving technology can improve traffic safety and reduce traffic accidents. In addition, it improves traffic flow, reduces congestion, saves energy and increases travel efficiency. In the relatively mature automatic driving technology, the automatic driving function is divided into several modules: perception, decision-making, planning and control, and a reasonable division of labor can improve the stability of the system. Therefore, autonomous vehicles need to have the ability to predict the trajectory of surrounding vehicles in order to make reasonable decision planning and safety measures to improve driving safety. By using deep learning method, a safety-sensitive deep learning model based on short term memory (LSTM) network is proposed. This model can alleviate the shortcomings of current automatic driving trajectory planning, and the output trajectory not only ensures high accuracy but also improves safety. The cell state simulation algorithm simulates the trackability of the trajectory generated by this model. The research results show that compared with the traditional model-based method, the trajectory prediction method based on LSTM network has obvious advantages in predicting the trajectory in the long time domain. The intention recognition module considering interactive information has higher prediction and accuracy, and the algorithm results show that the trajectory is very smooth based on the premise of safe prediction and efficient lane change. And autonomous vehicles can efficiently and safely complete lane changes.  ( 2 min )
    Efficient Knowledge Deletion from Trained Models through Layer-wise Partial Machine Unlearning
    arXiv:2403.07611v1 Announce Type: new Abstract: Machine unlearning has garnered significant attention due to its ability to selectively erase knowledge obtained from specific training data samples in an already trained machine learning model. This capability enables data holders to adhere strictly to data protection regulations. However, existing unlearning techniques face practical constraints, often causing performance degradation, demanding brief fine-tuning post unlearning, and requiring significant storage. In response, this paper introduces a novel class of machine unlearning algorithms. First method is partial amnesiac unlearning, integration of layer-wise pruning with amnesiac unlearning. In this method, updates made to the model during training are pruned and stored, subsequently used to forget specific data from trained model. The second method assimilates layer-wise partial-updates into label-flipping and optimization-based unlearning to mitigate the adverse effects of data deletion on model efficacy. Through a detailed experimental evaluation, we showcase the effectiveness of proposed unlearning methods. Experimental results highlight that the partial amnesiac unlearning not only preserves model efficacy but also eliminates the necessity for brief post fine-tuning, unlike conventional amnesiac unlearning. Moreover, employing layer-wise partial updates in label-flipping and optimization-based unlearning techniques demonstrates superiority in preserving model efficacy compared to their naive counterparts.  ( 2 min )
    One Category One Prompt: Dataset Distillation using Diffusion Models
    arXiv:2403.07142v1 Announce Type: cross Abstract: The extensive amounts of data required for training deep neural networks pose significant challenges on storage and transmission fronts. Dataset distillation has emerged as a promising technique to condense the information of massive datasets into a much smaller yet representative set of synthetic samples. However, traditional dataset distillation approaches often struggle to scale effectively with high-resolution images and more complex architectures due to the limitations in bi-level optimization. Recently, several works have proposed exploiting knowledge distillation with decoupled optimization schemes to scale up dataset distillation. Although these methods effectively address the scalability issue, they rely on extensive image augmentations requiring the storage of soft labels for augmented images. In this paper, we introduce Dataset Distillation using Diffusion Models (D3M) as a novel paradigm for dataset distillation, leveraging recent advancements in generative text-to-image foundation models. Our approach utilizes textual inversion, a technique for fine-tuning text-to-image generative models, to create concise and informative representations for large datasets. By employing these learned text prompts, we can efficiently store and infer new samples for introducing data variability within a fixed memory budget. We show the effectiveness of our method through extensive experiments across various computer vision benchmark datasets with different memory budgets.  ( 2 min )
    Taming Pre-trained LLMs for Generalised Time Series Forecasting via Cross-modal Knowledge Distillation
    arXiv:2403.07300v1 Announce Type: new Abstract: Multivariate time series forecasting has recently gained great success with the rapid growth of deep learning models. However, existing approaches usually train models from scratch using limited temporal data, preventing their generalization. Recently, with the surge of the Large Language Models (LLMs), several works have attempted to introduce LLMs into time series forecasting. Despite promising results, these methods directly take time series as the input to LLMs, ignoring the inherent modality gap between temporal and text data. In this work, we propose a novel Large Language Models and time series alignment framework, dubbed LLaTA, to fully unleash the potentials of LLMs in the time series forecasting challenge. Based on cross-modal knowledge distillation, the proposed method exploits both input-agnostic static knowledge and input-dependent dynamic knowledge in pre-trained LLMs. In this way, it empowers the forecasting model with favorable performance as well as strong generalization abilities. Extensive experiments demonstrate the proposed method establishes a new state of the art for both long- and short-term forecasting. Code is available at \url{https://github.com/Hank0626/LLaTA}.  ( 2 min )
    XpertAI: uncovering model strategies for sub-manifolds
    arXiv:2403.07486v1 Announce Type: new Abstract: In recent years, Explainable AI (XAI) methods have facilitated profound validation and knowledge extraction from ML models. While extensively studied for classification, few XAI solutions have addressed the challenges specific to regression models. In regression, explanations need to be precisely formulated to address specific user queries (e.g.\ distinguishing between `Why is the output above 0?' and `Why is the output above 50?'). They should furthermore reflect the model's behavior on the relevant data sub-manifold. In this paper, we introduce XpertAI, a framework that disentangles the prediction strategy into multiple range-specific sub-strategies and allows the formulation of precise queries about the model (the `explanandum') as a linear combination of those sub-strategies. XpertAI is formulated generally to work alongside popular XAI attribution techniques, based on occlusion, gradient integration, or reverse propagation. Qualitative and quantitative results, demonstrate the benefits of our approach.  ( 2 min )
    Learning-Augmented Algorithms with Explicit Predictors
    arXiv:2403.07413v1 Announce Type: new Abstract: Recent advances in algorithmic design show how to utilize predictions obtained by machine learning models from past and present data. These approaches have demonstrated an enhancement in performance when the predictions are accurate, while also ensuring robustness by providing worst-case guarantees when predictions fail. In this paper we focus on online problems; prior research in this context was focused on a paradigm where the predictor is pre-trained on past data and then used as a black box (to get the predictions it was trained for). In contrast, in this work, we unpack the predictor and integrate the learning problem it gives rise for within the algorithmic challenge. In particular we allow the predictor to learn as it receives larger parts of the input, with the ultimate goal of designing online learning algorithms specifically tailored for the algorithmic task at hand. Adopting this perspective, we focus on a number of fundamental problems, including caching and scheduling, which have been well-studied in the black-box setting. For each of the problems we consider, we introduce new algorithms that take advantage of explicit learning algorithms which we carefully design towards optimizing the overall performance. We demonstrate the potential of our approach by deriving performance bounds which improve over those established in previous work.  ( 2 min )
    Multi-Agent Reinforcement Learning with a Hierarchy of Reward Machines
    arXiv:2403.07005v1 Announce Type: cross Abstract: In this paper, we study the cooperative Multi-Agent Reinforcement Learning (MARL) problems using Reward Machines (RMs) to specify the reward functions such that the prior knowledge of high-level events in a task can be leveraged to facilitate the learning efficiency. Unlike the existing work that RMs have been incorporated into MARL for task decomposition and policy learning in relatively simple domains or with an assumption of independencies among the agents, we present Multi-Agent Reinforcement Learning with a Hierarchy of RMs (MAHRM) that is capable of dealing with more complex scenarios when the events among agents can occur concurrently and the agents are highly interdependent. MAHRM exploits the relationship of high-level events to decompose a task into a hierarchy of simpler subtasks that are assigned to a small group of agents, so as to reduce the overall computational complexity. Experimental results in three cooperative MARL domains show that MAHRM outperforms other MARL methods using the same prior knowledge of high-level events.  ( 2 min )
    Fairness Feedback Loops: Training on Synthetic Data Amplifies Bias
    arXiv:2403.07857v1 Announce Type: new Abstract: Model-induced distribution shifts (MIDS) occur as previous model outputs pollute new model training sets over generations of models. This is known as model collapse in the case of generative models, and performative prediction or unfairness feedback loops for supervised models. When a model induces a distribution shift, it also encodes its mistakes, biases, and unfairnesses into the ground truth of its data ecosystem. We introduce a framework that allows us to track multiple MIDS over many generations, finding that they can lead to loss in performance, fairness, and minoritized group representation, even in initially unbiased datasets. Despite these negative consequences, we identify how models might be used for positive, intentional, interventions in their data ecosystems, providing redress for historical discrimination through a framework called algorithmic reparation (AR). We simulate AR interventions by curating representative training batches for stochastic gradient descent to demonstrate how AR can improve upon the unfairnesses of models and data ecosystems subject to other MIDS. Our work takes an important step towards identifying, mitigating, and taking accountability for the unfair feedback loops enabled by the idea that ML systems are inherently neutral and objective.  ( 2 min )
    12 mJ per Class On-Device Online Few-Shot Class-Incremental Learning
    arXiv:2403.07851v1 Announce Type: new Abstract: Few-Shot Class-Incremental Learning (FSCIL) enables machine learning systems to expand their inference capabilities to new classes using only a few labeled examples, without forgetting the previously learned classes. Classical backpropagation-based learning and its variants are often unsuitable for battery-powered, memory-constrained systems at the extreme edge. In this work, we introduce Online Few-Shot Class-Incremental Learning (O-FSCIL), based on a lightweight model consisting of a pretrained and metalearned feature extractor and an expandable explicit memory storing the class prototypes. The architecture is pretrained with a novel feature orthogonality regularization and metalearned with a multi-margin loss. For learning a new class, our approach extends the explicit memory with novel class prototypes, while the remaining architecture is kept frozen. This allows learning previously unseen classes based on only a few examples with one single pass (hence online). O-FSCIL obtains an average accuracy of 68.62% on the FSCIL CIFAR100 benchmark, achieving state-of-the-art results. Tailored for ultra-low-power platforms, we implement O-FSCIL on the 60 mW GAP9 microcontroller, demonstrating online learning capabilities within just 12 mJ per new class.  ( 2 min )
    Chronos: Learning the Language of Time Series
    arXiv:2403.07815v1 Announce Type: new Abstract: We introduce Chronos, a simple yet effective framework for pretrained probabilistic time series models. Chronos tokenizes time series values using scaling and quantization into a fixed vocabulary and trains existing transformer-based language model architectures on these tokenized time series via the cross-entropy loss. We pretrained Chronos models based on the T5 family (ranging from 20M to 710M parameters) on a large collection of publicly available datasets, complemented by a synthetic dataset that we generated via Gaussian processes to improve generalization. In a comprehensive benchmark consisting of 42 datasets, and comprising both classical local models and deep learning methods, we show that Chronos models: (a) significantly outperform other methods on datasets that were part of the training corpus; and (b) have comparable and occasionally superior zero-shot performance on new datasets, relative to methods that were trained specifically on them. Our results demonstrate that Chronos models can leverage time series data from diverse domains to improve zero-shot accuracy on unseen forecasting tasks, positioning pretrained models as a viable tool to greatly simplify forecasting pipelines.  ( 2 min )
    Symmetric Q-learning: Reducing Skewness of Bellman Error in Online Reinforcement Learning
    arXiv:2403.07704v1 Announce Type: new Abstract: In deep reinforcement learning, estimating the value function to evaluate the quality of states and actions is essential. The value function is often trained using the least squares method, which implicitly assumes a Gaussian error distribution. However, a recent study suggested that the error distribution for training the value function is often skewed because of the properties of the Bellman operator, and violates the implicit assumption of normal error distribution in the least squares method. To address this, we proposed a method called Symmetric Q-learning, in which the synthetic noise generated from a zero-mean distribution is added to the target values to generate a Gaussian error distribution. We evaluated the proposed method on continuous control benchmark tasks in MuJoCo. It improved the sample efficiency of a state-of-the-art reinforcement learning method by reducing the skewness of the error distribution.  ( 2 min )
    WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?
    arXiv:2403.07718v1 Announce Type: new Abstract: We study the use of large language model-based agents for interacting with software via web browsers. Unlike prior work, we focus on measuring the agents' ability to perform tasks that span the typical daily work of knowledge workers utilizing enterprise software systems. To this end, we propose WorkArena, a remote-hosted benchmark of 29 tasks based on the widely-used ServiceNow platform. We also introduce BrowserGym, an environment for the design and evaluation of such agents, offering a rich set of actions as well as multimodal observations. Our empirical evaluation reveals that while current agents show promise on WorkArena, there remains a considerable gap towards achieving full task automation. Notably, our analysis uncovers a significant performance disparity between open and closed-source LLMs, highlighting a critical area for future exploration and development in the field.  ( 2 min )
    A New Machine Learning Dataset of Bulldog Nostril Images for Stenosis Degree Classification
    arXiv:2403.07132v1 Announce Type: cross Abstract: Brachycephaly, a conformation trait in some dog breeds, causes BOAS, a respiratory disorder that affects the health and welfare of the dogs with various symptoms. In this paper, a new annotated dataset composed of 190 images of bulldogs' nostrils is presented. Three degrees of stenosis are approximately equally represented in the dataset: mild, moderate and severe stenosis. The dataset also comprises a small quantity of non stenotic nostril images. To the best of our knowledge, this is the first image dataset addressing this problem. Furthermore, deep learning is investigated as an alternative to automatically infer stenosis degree using nostril images. In this work, several neural networks were tested: ResNet50, MobileNetV3, DenseNet201, SwinV2 and MaxViT. For this evaluation, the problem was modeled in two different ways: first, as a three-class classification problem (mild or open, moderate, and severe); second, as a binary classification problem, with severe stenosis as target. For the multiclass classification, a maximum median f-score of 53.77\% was achieved by the MobileNetV3. For binary classification, a maximum median f-score of 72.08\% has been reached by ResNet50, indicating that the problem is challenging but possibly tractable.  ( 2 min )
    AdaNovo: Adaptive \emph{De Novo} Peptide Sequencing with Conditional Mutual Information
    arXiv:2403.07013v1 Announce Type: cross Abstract: Tandem mass spectrometry has played a pivotal role in advancing proteomics, enabling the analysis of protein composition in biological samples. Despite the development of various deep learning methods for identifying amino acid sequences (peptides) responsible for observed spectra, challenges persist in \emph{de novo} peptide sequencing. Firstly, prior methods struggle to identify amino acids with post-translational modifications (PTMs) due to their lower frequency in training data compared to canonical amino acids, further resulting in decreased peptide-level identification precision. Secondly, diverse types of noise and missing peaks in mass spectra reduce the reliability of training data (peptide-spectrum matches, PSMs). To address these challenges, we propose AdaNovo, a novel framework that calculates conditional mutual information (CMI) between the spectrum and each amino acid/peptide, using CMI for adaptive model training. Extensive experiments demonstrate AdaNovo's state-of-the-art performance on a 9-species benchmark, where the peptides in the training set are almost completely disjoint from the peptides of the test sets. Moreover, AdaNovo excels in identifying amino acids with PTMs and exhibits robustness against data noise. The supplementary materials contain the official code.  ( 2 min )
    A Machine learning and Empirical Bayesian Approach for Predictive Buying in B2B E-commerce
    arXiv:2403.07843v1 Announce Type: new Abstract: In the context of developing nations like India, traditional business to business (B2B) commerce heavily relies on the establishment of robust relationships, trust, and credit arrangements between buyers and sellers. Consequently, ecommerce enterprises frequently. Established in 2016 with a vision to revolutionize trade in India through technology, Udaan is the countrys largest business to business ecommerce platform. Udaan operates across diverse product categories, including lifestyle, electronics, home and employ telecallers to cultivate buyer relationships, streamline order placement procedures, and promote special promotions. The accurate anticipation of buyer order placement behavior emerges as a pivotal factor for attaining sustainable growth, heightening competitiveness, and optimizing the efficiency of these telecallers. To address this challenge, we have employed an ensemble approach comprising XGBoost and a modified version of Poisson Gamma model to predict customer order patterns with precision. This paper provides an in-depth exploration of the strategic fusion of machine learning and an empirical Bayesian approach, bolstered by the judicious selection of pertinent features. This innovative approach has yielded a remarkable 3 times increase in customer order rates, show casing its potential for transformative impact in the ecommerce industry.  ( 2 min )
    Machine Learning for Soccer Match Result Prediction
    arXiv:2403.07669v1 Announce Type: new Abstract: Machine learning has become a common approach to predicting the outcomes of soccer matches, and the body of literature in this domain has grown substantially in the past decade and a half. This chapter discusses available datasets, the types of models and features, and ways of evaluating model performance in this application domain. The aim of this chapter is to give a broad overview of the current state and potential future developments in machine learning for soccer match results prediction, as a resource for those interested in conducting future studies in the area. Our main findings are that while gradient-boosted tree models such as CatBoost, applied to soccer-specific ratings such as pi-ratings, are currently the best-performing models on datasets containing only goals as the match features, there needs to be a more thorough comparison of the performance of deep learning models and Random Forest on a range of datasets with different types of features. Furthermore, new rating systems using both player- and team-level information and incorporating additional information from, e.g., spatiotemporal tracking and event data, could be investigated further. Finally, the interpretability of match result prediction models needs to be enhanced for them to be more useful for team management.  ( 2 min )
    Visual Privacy Auditing with Diffusion Models
    arXiv:2403.07588v1 Announce Type: new Abstract: Image reconstruction attacks on machine learning models pose a significant risk to privacy by potentially leaking sensitive information. Although defending against such attacks using differential privacy (DP) has proven effective, determining appropriate DP parameters remains challenging. Current formal guarantees on data reconstruction success suffer from overly theoretical assumptions regarding adversary knowledge about the target data, particularly in the image domain. In this work, we empirically investigate this discrepancy and find that the practicality of these assumptions strongly depends on the domain shift between the data prior and the reconstruction target. We propose a reconstruction attack based on diffusion models (DMs) that assumes adversary access to real-world image priors and assess its implications on privacy leakage under DP-SGD. We show that (1) real-world data priors significantly influence reconstruction success, (2) current reconstruction bounds do not model the risk posed by data priors well, and (3) DMs can serve as effective auditing tools for visualizing privacy leakage.  ( 2 min )
    Uncertainty in Graph Neural Networks: A Survey
    arXiv:2403.07185v1 Announce Type: new Abstract: Graph Neural Networks (GNNs) have been extensively used in various real-world applications. However, the predictive uncertainty of GNNs stemming from diverse sources such as inherent randomness in data and model training errors can lead to unstable and erroneous predictions. Therefore, identifying, quantifying, and utilizing uncertainty are essential to enhance the performance of the model for the downstream tasks as well as the reliability of the GNN predictions. This survey aims to provide a comprehensive overview of the GNNs from the perspective of uncertainty with an emphasis on its integration in graph learning. We compare and summarize existing graph uncertainty theory and methods, alongside the corresponding downstream tasks. Thereby, we bridge the gap between theory and practice, meanwhile connecting different GNN communities. Moreover, our work provides valuable insights into promising directions in this field.  ( 2 min )
    3M-Diffusion: Latent Multi-Modal Diffusion for Text-Guided Generation of Molecular Graphs
    arXiv:2403.07179v1 Announce Type: new Abstract: Generating molecules with desired properties is a critical task with broad applications in drug discovery and materials design. Inspired by recent advances in large language models, there is a growing interest in using natural language descriptions of molecules to generate molecules with the desired properties. Most existing methods focus on generating molecules that precisely match the text description. However, practical applications call for methods that generate diverse, and ideally novel, molecules with the desired properties. We propose 3M-Diffusion, a novel multi-modal molecular graph generation method, to address this challenge. 3M-Diffusion first encodes molecular graphs into a graph latent space aligned with text descriptions. It then reconstructs the molecular structure and atomic attributes based on the given text descriptions using the molecule decoder. It then learns a probabilistic mapping from the text space to the latent molecular graph space using a diffusion model. The results of our extensive experiments on several datasets demonstrate that 3M-Diffusion can generate high-quality, novel and diverse molecular graphs that semantically match the textual description provided.  ( 2 min )
    Reinforced Sequential Decision-Making for Sepsis Treatment: The POSNEGDM Framework with Mortality Classifier and Transformer
    arXiv:2403.07309v1 Announce Type: new Abstract: Sepsis, a life-threatening condition triggered by the body's exaggerated response to infection, demands urgent intervention to prevent severe complications. Existing machine learning methods for managing sepsis struggle in offline scenarios, exhibiting suboptimal performance with survival rates below 50%. This paper introduces the POSNEGDM -- ``Reinforcement Learning with Positive and Negative Demonstrations for Sequential Decision-Making" framework utilizing an innovative transformer-based model and a feedback reinforcer to replicate expert actions while considering individual patient characteristics. A mortality classifier with 96.7\% accuracy guides treatment decisions towards positive outcomes. The POSNEGDM framework significantly improves patient survival, saving 97.39% of patients, outperforming established machine learning algorithms (Decision Transformer and Behavioral Cloning) with survival rates of 33.4% and 43.5%, respectively. Additionally, ablation studies underscore the critical role of the transformer-based decision maker and the integration of a mortality classifier in enhancing overall survival rates. In summary, our proposed approach presents a promising avenue for enhancing sepsis treatment outcomes, contributing to improved patient care and reduced healthcare costs.  ( 2 min )
    Unknown Domain Inconsistency Minimization for Domain Generalization
    arXiv:2403.07329v1 Announce Type: new Abstract: The objective of domain generalization (DG) is to enhance the transferability of the model learned from a source domain to unobserved domains. To prevent overfitting to a specific domain, Sharpness-Aware Minimization (SAM) reduces source domain's loss sharpness. Although SAM variants have delivered significant improvements in DG, we highlight that there's still potential for improvement in generalizing to unknown domains through the exploration on data space. This paper introduces an objective rooted in both parameter and data perturbed regions for domain generalization, coined Unknown Domain Inconsistency Minimization (UDIM). UDIM reduces the loss landscape inconsistency between source domain and unknown domains. As unknown domains are inaccessible, these domains are empirically crafted by perturbing instances from the source domain dataset. In particular, by aligning the loss landscape acquired in the source domain to the loss landscape of perturbed domains, we expect to achieve generalization grounded on these flat minima for the unknown domains. Theoretically, we validate that merging SAM optimization with the UDIM objective establishes an upper bound for the true objective of the DG task. In an empirical aspect, UDIM consistently outperforms SAM variants across multiple DG benchmark datasets. Notably, UDIM shows statistically significant improvements in scenarios with more restrictive domain information, underscoring UDIM's generalization capability in unseen domains. Our code is available at \url{https://github.com/SJShin-AI/UDIM}.  ( 2 min )
    Verification-Aided Learning of Neural Network Barrier Functions with Termination Guarantees
    arXiv:2403.07308v1 Announce Type: new Abstract: Barrier functions are a general framework for establishing a safety guarantee for a system. However, there is no general method for finding these functions. To address this shortcoming, recent approaches use self-supervised learning techniques to learn these functions using training data that are periodically generated by a verification procedure, leading to a verification-aided learning framework. Despite its immense potential in automating barrier function synthesis, the verification-aided learning framework does not have termination guarantees and may suffer from a low success rate of finding a valid barrier function in practice. In this paper, we propose a holistic approach to address these drawbacks. With a convex formulation of the barrier function synthesis, we propose to first learn an empirically well-behaved NN basis function and then apply a fine-tuning algorithm that exploits the convexity and counterexamples from the verification failure to find a valid barrier function with finite-step termination guarantees: if there exist valid barrier functions, the fine-tuning algorithm is guaranteed to find one in a finite number of iterations. We demonstrate that our fine-tuning method can significantly boost the performance of the verification-aided learning framework on examples of different scales and using various neural network verifiers.  ( 2 min )
    Enhancing Transfer Learning with Flexible Nonparametric Posterior Sampling
    arXiv:2403.07282v1 Announce Type: new Abstract: Transfer learning has recently shown significant performance across various tasks involving deep neural networks. In these transfer learning scenarios, the prior distribution for downstream data becomes crucial in Bayesian model averaging (BMA). While previous works proposed the prior over the neural network parameters centered around the pre-trained solution, such strategies have limitations when dealing with distribution shifts between upstream and downstream data. This paper introduces nonparametric transfer learning (NPTL), a flexible posterior sampling method to address the distribution shift issue within the context of nonparametric learning. The nonparametric learning (NPL) method is a recent approach that employs a nonparametric prior for posterior sampling, efficiently accounting for model misspecification scenarios, which is suitable for transfer learning scenarios that may involve the distribution shift between upstream and downstream tasks. Through extensive empirical validations, we demonstrate that our approach surpasses other baselines in BMA performance.  ( 2 min )
    Graph Data Condensation via Self-expressive Graph Structure Reconstruction
    arXiv:2403.07294v1 Announce Type: new Abstract: With the increasing demands of training graph neural networks (GNNs) on large-scale graphs, graph data condensation has emerged as a critical technique to relieve the storage and time costs during the training phase. It aims to condense the original large-scale graph to a much smaller synthetic graph while preserving the essential information necessary for efficiently training a downstream GNN. However, existing methods concentrate either on optimizing node features exclusively or endeavor to independently learn node features and the graph structure generator. They could not explicitly leverage the information of the original graph structure and failed to construct an interpretable graph structure for the synthetic dataset. To address these issues, we introduce a novel framework named \textbf{G}raph Data \textbf{C}ondensation via \textbf{S}elf-expressive Graph Structure \textbf{R}econstruction (\textbf{GCSR}). Our method stands out by (1) explicitly incorporating the original graph structure into the condensing process and (2) capturing the nuanced interdependencies between the condensed nodes by reconstructing an interpretable self-expressive graph structure. Extensive experiments and comprehensive analysis validate the efficacy of the proposed method across diverse GNN models and datasets. Our code is available at https://www.dropbox.com/scl/fi/2aonyp5ln5gisdqtjimu8/GCSR.zip?rlkey=11cuwfpsf54wxiiktu0klud0x&dl=0  ( 2 min )
    Advantage-Aware Policy Optimization for Offline Reinforcement Learning
    arXiv:2403.07262v1 Announce Type: new Abstract: Offline Reinforcement Learning (RL) endeavors to leverage offline datasets to craft effective agent policy without online interaction, which imposes proper conservative constraints with the support of behavior policies to tackle the Out-Of-Distribution (OOD) problem. However, existing works often suffer from the constraint conflict issue when offline datasets are collected from multiple behavior policies, i.e., different behavior policies may exhibit inconsistent actions with distinct returns across the state space. To remedy this issue, recent Advantage-Weighted (AW) methods prioritize samples with high advantage values for agent training while inevitably leading to overfitting on these samples. In this paper, we introduce a novel Advantage-Aware Policy Optimization (A2PO) method to explicitly construct advantage-aware policy constraints for offline learning under mixed-quality datasets. Specifically, A2PO employs a Conditional Variational Auto-Encoder (CVAE) to disentangle the action distributions of intertwined behavior policies by modeling the advantage values of all training data as conditional variables. Then the agent can follow such disentangled action distribution constraints to optimize the advantage-aware policy towards high advantage values. Extensive experiments conducted on both the single-quality and mixed-quality datasets of the D4RL benchmark demonstrate that A2PO yields results superior to state-of-the-art counterparts. Our code will be made publicly available.  ( 2 min )
    LookupFFN: Making Transformers Compute-lite for CPU inference
    arXiv:2403.07221v1 Announce Type: new Abstract: While GPU clusters are the de facto choice for training large deep neural network (DNN) models today, several reasons including ease of workflow, security and cost have led to efforts investigating whether CPUs may be viable for inference in routine use in many sectors of the industry. But the imbalance between the compute capabilities of GPUs and CPUs is huge. Motivated by these considerations, we study a module which is a workhorse within modern DNN architectures, GEMM based Feed Forward Networks (FFNs), and assess the extent to which it can be made compute- (or FLOP-) lite. Specifically, we propose an alternative formulation (we call it LookupFFN) to GEMM based FFNs inspired by the recent studies of using Locality Sensitive Hashing (LSH) to approximate FFNs. Our formulation recasts most essential operations as a memory look-up, leveraging the trade-off between the two resources on any platform: compute and memory (since CPUs offer it in abundance). For RoBERTa language model pretraining, our formulation achieves similar performance compared to GEMM based FFNs, while dramatically reducing the required FLOP. Our development is complemented with a detailed hardware profiling of strategies that will maximize efficiency -- not just on contemporary hardware but on products that will be offered in the near/medium term future. Code is avaiable at \url{https://github.com/mlpen/LookupFFN}.  ( 2 min )
    Which LLM to Play? Convergence-Aware Online Model Selection with Time-Increasing Bandits
    arXiv:2403.07213v1 Announce Type: new Abstract: Web-based applications such as chatbots, search engines and news recommendations continue to grow in scale and complexity with the recent surge in the adoption of LLMs. Online model selection has thus garnered increasing attention due to the need to choose the best model among a diverse set while balancing task reward and exploration cost. Organizations faces decisions like whether to employ a costly API-based LLM or a locally finetuned small LLM, weighing cost against performance. Traditional selection methods often evaluate every candidate model before choosing one, which are becoming impractical given the rising costs of training and finetuning LLMs. Moreover, it is undesirable to allocate excessive resources towards exploring poor-performing models. While some recent works leverage online bandit algorithm to manage such exploration-exploitation trade-off in model selection, they tend to overlook the increasing-then-converging trend in model performances as the model is iteratively finetuned, leading to less accurate predictions and suboptimal model selections. In this paper, we propose a time-increasing bandit algorithm TI-UCB, which effectively predicts the increase of model performances due to finetuning and efficiently balances exploration and exploitation in model selection. To further capture the converging points of models, we develop a change detection mechanism by comparing consecutive increase predictions. We theoretically prove that our algorithm achieves a logarithmic regret upper bound in a typical increasing bandit setting, which implies a fast convergence rate. The advantage of our method is also empirically validated through extensive experiments on classification model selection and online selection of LLMs. Our results highlight the importance of utilizing increasing-then-converging pattern for more efficient and economic model selection in the deployment of LLMs.  ( 3 min )
    $\mathbf{(N,K)}$-Puzzle: A Cost-Efficient Testbed for Benchmarking Reinforcement Learning Algorithms in Generative Language Model
    arXiv:2403.07191v1 Announce Type: new Abstract: Recent advances in reinforcement learning (RL) algorithms aim to enhance the performance of language models at scale. Yet, there is a noticeable absence of a cost-effective and standardized testbed tailored to evaluating and comparing these algorithms. To bridge this gap, we present a generalized version of the 24-Puzzle: the $(N,K)$-Puzzle, which challenges language models to reach a target value $K$ with $N$ integers. We evaluate the effectiveness of established RL algorithms such as Proximal Policy Optimization (PPO), alongside novel approaches like Identity Policy Optimization (IPO) and Direct Policy Optimization (DPO).  ( 2 min )
    An Efficient Learning-based Solver Comparable to Metaheuristics for the Capacitated Arc Routing Problem
    arXiv:2403.07028v1 Announce Type: new Abstract: Recently, neural networks (NN) have made great strides in combinatorial optimization. However, they face challenges when solving the capacitated arc routing problem (CARP) which is to find the minimum-cost tour covering all required edges on a graph, while within capacity constraints. In tackling CARP, NN-based approaches tend to lag behind advanced metaheuristics, since they lack directed arc modeling and efficient learning methods tailored for complex CARP. In this paper, we introduce an NN-based solver to significantly narrow the gap with advanced metaheuristics while exhibiting superior efficiency. First, we propose the direction-aware attention model (DaAM) to incorporate directionality into the embedding process, facilitating more effective one-stage decision-making. Second, we design a supervised reinforcement learning scheme that involves supervised pre-training to establish a robust initial policy for subsequent reinforcement fine-tuning. It proves particularly valuable for solving CARP that has a higher complexity than the node routing problems (NRPs). Finally, a path optimization method is proposed to adjust the depot return positions within the path generated by DaAM. Experiments illustrate that our approach surpasses heuristics and achieves decision quality comparable to state-of-the-art metaheuristics for the first time while maintaining superior efficiency.  ( 2 min )
    All in One: Multi-Task Prompting for Graph Neural Networks (Extended Abstract)
    arXiv:2403.07040v1 Announce Type: new Abstract: This paper is an extended abstract of our original work published in KDD23, where we won the best research paper award (Xiangguo Sun, Hong Cheng, Jia Li, Bo Liu, and Jihong Guan. All in one: Multi-task prompting for graph neural networks. KDD 23) The paper introduces a novel approach to bridging the gap between pre-trained graph models and the diverse tasks they're applied to, inspired by the success of prompt learning in NLP. Recognizing the challenge of aligning pre-trained models with varied graph tasks (node level, edge level, and graph level), which can lead to negative transfer and poor performance, we propose a multi-task prompting method for graphs. This method involves unifying graph and language prompt formats, enabling NLP's prompting strategies to be adapted for graph tasks. By analyzing the task space of graph applications, we reformulate problems to fit graph-level tasks and apply meta-learning to improve prompt initialization for multiple tasks. Experiments show our method's effectiveness in enhancing model performance across different graph tasks. Beyond the original work, in this extended abstract, we further discuss the graph prompt from a bigger picture and provide some of the latest work toward this area.  ( 2 min )
    Leveraging graph neural networks for supporting Automatic Triage of Patients
    arXiv:2403.07038v1 Announce Type: new Abstract: Patient triage plays a crucial role in emergency departments, ensuring timely and appropriate care based on correctly evaluating the emergency grade of patient conditions. Triage methods are generally performed by human operator based on her own experience and information that are gathered from the patient management process. Thus, it is a process that can generate errors in emergency level associations. Recently, Traditional triage methods heavily rely on human decisions, which can be subjective and prone to errors. Recently, a growing interest has been focused on leveraging artificial intelligence (AI) to develop algorithms able to maximize information gathering and minimize errors in patient triage processing. We define and implement an AI based module to manage patients emergency code assignments in emergency departments. It uses emergency department historical data to train the medical decision process. Data containing relevant patient information, such as vital signs, symptoms, and medical history, are used to accurately classify patients into triage categories. Experimental results demonstrate that the proposed algorithm achieved high accuracy outperforming traditional triage methods. By using the proposed method we claim that healthcare professionals can predict severity index to guide patient management processing and resource allocation.  ( 2 min )
    Adaptive Hyperparameter Optimization for Continual Learning Scenarios
    arXiv:2403.07015v1 Announce Type: new Abstract: Hyperparameter selection in continual learning scenarios is a challenging and underexplored aspect, especially in practical non-stationary environments. Traditional approaches, such as grid searches with held-out validation data from all tasks, are unrealistic for building accurate lifelong learning systems. This paper aims to explore the role of hyperparameter selection in continual learning and the necessity of continually and automatically tuning them according to the complexity of the task at hand. Hence, we propose leveraging the nature of sequence task learning to improve Hyperparameter Optimization efficiency. By using the functional analysis of variance-based techniques, we identify the most crucial hyperparameters that have an impact on performance. We demonstrate empirically that this approach, agnostic to continual scenarios and strategies, allows us to speed up hyperparameters optimization continually across tasks and exhibit robustness even in the face of varying sequential task orders. We believe that our findings can contribute to the advancement of continual learning methodologies towards more efficient, robust and adaptable models for real-world applications.  ( 2 min )
    Survival modeling using deep learning, machine learning and statistical methods: A comparative analysis for predicting mortality after hospital admission
    arXiv:2403.06999v1 Announce Type: new Abstract: Survival analysis is essential for studying time-to-event outcomes and providing a dynamic understanding of the probability of an event occurring over time. Various survival analysis techniques, from traditional statistical models to state-of-the-art machine learning algorithms, support healthcare intervention and policy decisions. However, there remains ongoing discussion about their comparative performance. We conducted a comparative study of several survival analysis methods, including Cox proportional hazards (CoxPH), stepwise CoxPH, elastic net penalized Cox model, Random Survival Forests (RSF), Gradient Boosting machine (GBM) learning, AutoScore-Survival, DeepSurv, time-dependent Cox model based on neural network (CoxTime), and DeepHit survival neural network. We applied the concordance index (C-index) for model goodness-of-fit, and integral Brier scores (IBS) for calibration, and considered the model interpretability. As a case study, we performed a retrospective analysis of patients admitted through the emergency department of a tertiary hospital from 2017 to 2019, predicting 90-day all-cause mortality based on patient demographics, clinicopathological features, and historical data. The results of the C-index indicate that deep learning achieved comparable performance, with DeepSurv producing the best discrimination (DeepSurv: 0.893; CoxTime: 0.892; DeepHit: 0.891). The calibration of DeepSurv (IBS: 0.041) performed the best, followed by RSF (IBS: 0.042) and GBM (IBS: 0.0421), all using the full variables. Moreover, AutoScore-Survival, using a minimal variable subset, is easy to interpret, and can achieve good discrimination and calibration (C-index: 0.867; IBS: 0.044). While all models were satisfactory, DeepSurv exhibited the best discrimination and calibration. In addition, AutoScore-Survival offers a more parsimonious model and excellent interpretability.  ( 3 min )
    A Converting Autoencoder Toward Low-latency and Energy-efficient DNN Inference at the Edge
    arXiv:2403.07036v1 Announce Type: new Abstract: Reducing inference time and energy usage while maintaining prediction accuracy has become a significant concern for deep neural networks (DNN) inference on resource-constrained edge devices. To address this problem, we propose a novel approach based on "converting" autoencoder and lightweight DNNs. This improves upon recent work such as early-exiting framework and DNN partitioning. Early-exiting frameworks spend different amounts of computation power for different input data depending upon their complexity. However, they can be inefficient in real-world scenarios that deal with many hard image samples. On the other hand, DNN partitioning algorithms that utilize the computation power of both the cloud and edge devices can be affected by network delays and intermittent connections between the cloud and the edge. We present CBNet, a low-latency and energy-efficient DNN inference framework tailored for edge devices. It utilizes a "converting" autoencoder to efficiently transform hard images into easy ones, which are subsequently processed by a lightweight DNN for inference. To the best of our knowledge, such autoencoder has not been proposed earlier. Our experimental results using three popular image-classification datasets on a Raspberry Pi 4, a Google Cloud instance, and an instance with Nvidia Tesla K80 GPU show that CBNet achieves up to 4.8x speedup in inference latency and 79% reduction in energy usage compared to competing techniques while maintaining similar or higher accuracy.  ( 3 min )
    The Cram Method for Efficient Simultaneous Learning and Evaluation
    arXiv:2403.07031v1 Announce Type: new Abstract: We introduce the "cram" method, a general and efficient approach to simultaneous learning and evaluation using a generic machine learning (ML) algorithm. In a single pass of batched data, the proposed method repeatedly trains an ML algorithm and tests its empirical performance. Because it utilizes the entire sample for both learning and evaluation, cramming is significantly more data-efficient than sample-splitting. The cram method also naturally accommodates online learning algorithms, making its implementation computationally efficient. To demonstrate the power of the cram method, we consider the standard policy learning setting where cramming is applied to the same data to both develop an individualized treatment rule (ITR) and estimate the average outcome that would result if the learned ITR were to be deployed. We show that under a minimal set of assumptions, the resulting crammed evaluation estimator is consistent and asymptotically normal. While our asymptotic results require a relatively weak stabilization condition of ML algorithm, we develop a simple, generic method that can be used with any policy learning algorithm to satisfy this condition. Our extensive simulation studies show that, when compared to sample-splitting, cramming reduces the evaluation standard error by more than 40% while improving the performance of learned policy. We also apply the cram method to a randomized clinical trial to demonstrate its applicability to real-world problems. Finally, we briefly discuss future extensions of the cram method to other learning and evaluation settings.  ( 3 min )
    FWin transformer for dengue prediction under climate and ocean influence
    arXiv:2403.07027v1 Announce Type: new Abstract: Dengue fever is one of the most deadly mosquito-born tropical infectious diseases. Detailed long range forecast model is vital in controlling the spread of disease and making mitigation efforts. In this study, we examine methods used to forecast dengue cases for long range predictions. The dataset consists of local climate/weather in addition to global climate indicators of Singapore from 2000 to 2019. We utilize newly developed deep neural networks to learn the intricate relationship between the features. The baseline models in this study are in the class of recent transformers for long sequence forecasting tasks. We found that a Fourier mixed window attention (FWin) based transformer performed the best in terms of both the mean square error and the maximum absolute error on the long range dengue forecast up to 60 weeks.  ( 2 min )
    Interpreting What Typical Fault Signals Look Like via Prototype-matching
    arXiv:2403.07033v1 Announce Type: new Abstract: Neural networks, with powerful nonlinear mapping and classification capabilities, are widely applied in mechanical fault diagnosis to ensure safety. However, being typical black-box models, their application is limited in high-reliability-required scenarios. To understand the classification logic and explain what typical fault signals look like, the prototype matching network (PMN) is proposed by combining the human-inherent prototype-matching with autoencoder (AE). The PMN matches AE-extracted feature with each prototype and selects the most similar prototype as the prediction result. It has three interpreting paths on classification logic, fault prototypes, and matching contributions. Conventional diagnosis and domain generalization experiments demonstrate its competitive diagnostic performance and distinguished advantages in representation learning. Besides, the learned typical fault signals (i.e., sample-level prototypes) showcase the ability for denoising and extracting subtle key features that experts find challenging to capture. This ability broadens human understanding and provides a promising solution from interpretability research to AI-for-Science.  ( 2 min )
    Non-Intrusive Load Monitoring with Missing Data Imputation Based on Tensor Decomposition
    arXiv:2403.07012v1 Announce Type: new Abstract: With the widespread adoption of Non-Intrusive Load Monitoring (NILM) in building energy management, ensuring the high quality of NILM data has become imperative. However, practical applications of NILM face challenges associated with data loss, significantly impacting accuracy and reliability in energy management. This paper addresses the issue of NILM data loss by introducing an innovative tensor completion(TC) model- Proportional-Integral-Derivative (PID)-incorporated Non-negative Latent Factorization of Tensors (PNLFT) with twofold ideas: 1) To tackle the issue of slow convergence in Latent Factorization of Tensors (LFT) using Stochastic Gradient Descent (SGD), a Proportional-Integral-Derivative controller is introduced during the learning process. The PID controller utilizes historical and current information to control learning residuals. 2) Considering the characteristics of NILM data, non-negative update rules are proposed in the model's learning scheme. Experimental results on three datasets demonstrate that, compared to state-of-the-art models, the proposed model exhibits noteworthy enhancements in both convergence speed and accuracy.  ( 2 min )
    AuG-KD: Anchor-Based Mixup Generation for Out-of-Domain Knowledge Distillation
    arXiv:2403.07030v1 Announce Type: new Abstract: Due to privacy or patent concerns, a growing number of large models are released without granting access to their training data, making transferring their knowledge inefficient and problematic. In response, Data-Free Knowledge Distillation (DFKD) methods have emerged as direct solutions. However, simply adopting models derived from DFKD for real-world applications suffers significant performance degradation, due to the discrepancy between teachers' training data and real-world scenarios (student domain). The degradation stems from the portions of teachers' knowledge that are not applicable to the student domain. They are specific to the teacher domain and would undermine students' performance. Hence, selectively transferring teachers' appropriate knowledge becomes the primary challenge in DFKD. In this work, we propose a simple but effective method AuG-KD. It utilizes an uncertainty-guided and sample-specific anchor to align student-domain data with the teacher domain and leverages a generative method to progressively trade off the learning process between OOD knowledge distillation and domain-specific information learning via mixup learning. Extensive experiments in 3 datasets and 8 settings demonstrate the stability and superiority of our approach. Code available at https://github.com/IshiKura-a/AuG-KD .  ( 2 min )
    A Unified Model for Spatio-Temporal Prediction Queries with Arbitrary Modifiable Areal Units
    arXiv:2403.07022v1 Announce Type: new Abstract: Spatio-Temporal (ST) prediction is crucial for making informed decisions in urban location-based applications like ride-sharing. However, existing ST models often require region partition as a prerequisite, resulting in two main pitfalls. Firstly, location-based services necessitate ad-hoc regions for various purposes, requiring multiple ST models with varying scales and zones, which can be costly to support. Secondly, different ST models may produce conflicting outputs, resulting in confusing predictions. In this paper, we propose One4All-ST, a framework that can conduct ST prediction for arbitrary modifiable areal units using only one model. To reduce the cost of getting multi-scale predictions, we design an ST network with hierarchical spatial modeling and scale normalization modules to efficiently and equally learn multi-scale representations. To address prediction inconsistencies across scales, we propose a dynamic programming scheme to solve the formulated optimal combination problem, minimizing predicted error through theoretical analysis. Besides, we suggest using an extended quad-tree to index the optimal combinations for quick response to arbitrary modifiable areal units in practical online scenarios. Extensive experiments on two real-world datasets verify the efficiency and effectiveness of One4All-ST in ST prediction for arbitrary modifiable areal units. The source codes and data of this work are available at https://github.com/uctb/One4All-ST.  ( 2 min )
    Improving deep learning with prior knowledge and cognitive models: A survey on enhancing explainability, adversarial robustness and zero-shot learning
    arXiv:2403.07078v1 Announce Type: new Abstract: We review current and emerging knowledge-informed and brain-inspired cognitive systems for realizing adversarial defenses, eXplainable Artificial Intelligence (XAI), and zero-shot or few-short learning. Data-driven deep learning models have achieved remarkable performance and demonstrated capabilities surpassing human experts in many applications. Yet, their inability to exploit domain knowledge leads to serious performance limitations in practical applications. In particular, deep learning systems are exposed to adversarial attacks, which can trick them into making glaringly incorrect decisions. Moreover, complex data-driven models typically lack interpretability or explainability, i.e., their decisions cannot be understood by human subjects. Furthermore, models are usually trained on standard datasets with a closed-world assumption. Hence, they struggle to generalize to unseen cases during inference in practical open-world environments, thus, raising the zero- or few-shot generalization problem. Although many conventional solutions exist, explicit domain knowledge, brain-inspired neural network and cognitive architectures offer powerful new dimensions towards alleviating these problems. Prior knowledge is represented in appropriate forms and incorporated in deep learning frameworks to improve performance. Brain-inspired cognition methods use computational models that mimic the human mind to enhance intelligent behavior in artificial agents and autonomous robots. Ultimately, these models achieve better explainability, higher adversarial robustness and data-efficient learning, and can, in turn, provide insights for cognitive science and neuroscience-that is, to deepen human understanding on how the brain works in general, and how it handles these problems.  ( 3 min )
    On the Limited Representational Power of Value Functions and its Links to Statistical (In)Efficiency
    arXiv:2403.07136v1 Announce Type: new Abstract: Identifying the trade-offs between model-based and model-free methods is a central question in reinforcement learning. Value-based methods offer substantial computational advantages and are sometimes just as statistically efficient as model-based methods. However, focusing on the core problem of policy evaluation, we show information about the transition dynamics may be impossible to represent in the space of value functions. We explore this through a series of case studies focused on structures that arises in many important problems. In several, there is no information loss and value-based methods are as statistically efficient as model based ones. In other closely-related examples, information loss is severe and value-based methods are severely outperformed. A deeper investigation points to the limitations of the representational power as the driver of the inefficiency, as opposed to failure in algorithm design.  ( 2 min )
    Overcoming the Paradox of Certified Training with Gaussian Smoothing
    arXiv:2403.07095v1 Announce Type: new Abstract: Training neural networks with high certified accuracy against adversarial examples remains an open problem despite significant efforts. While certification methods can effectively leverage tight convex relaxations for bound computation, in training, these methods perform worse than looser relaxations. Prior work hypothesized that this is caused by the discontinuity and perturbation sensitivity of the loss surface induced by these tighter relaxations. In this work, we show theoretically that Gaussian Loss Smoothing can alleviate both of these issues. We confirm this empirically by proposing a certified training method combining PGPE, an algorithm computing gradients of a smoothed loss, with different convex relaxations. When using this training method, we observe that tighter bounds indeed lead to strictly better networks that can outperform state-of-the-art methods on the same network. While scaling PGPE-based training remains challenging due to high computational cost, our results clearly demonstrate the promise of Gaussian Loss Smoothing for training certifiably robust neural networks.  ( 2 min )
    FALCON: FLOP-Aware Combinatorial Optimization for Neural Network Pruning
    arXiv:2403.07094v1 Announce Type: new Abstract: The increasing computational demands of modern neural networks present deployment challenges on resource-constrained devices. Network pruning offers a solution to reduce model size and computational cost while maintaining performance. However, most current pruning methods focus primarily on improving sparsity by reducing the number of nonzero parameters, often neglecting other deployment costs such as inference time, which are closely related to the number of floating-point operations (FLOPs). In this paper, we propose FALCON, a novel combinatorial-optimization-based framework for network pruning that jointly takes into account model accuracy (fidelity), FLOPs, and sparsity constraints. A main building block of our approach is an integer linear program (ILP) that simultaneously handles FLOP and sparsity constraints. We present a novel algorithm to approximately solve the ILP. We propose a novel first-order method for our optimization framework which makes use of our ILP solver. Using problem structure (e.g., the low-rank structure of approx. Hessian), we can address instances with millions of parameters. Our experiments demonstrate that FALCON achieves superior accuracy compared to other pruning approaches within a fixed FLOP budget. For instance, for ResNet50 with 20% of the total FLOPs retained, our approach improves the accuracy by 48% relative to state-of-the-art. Furthermore, in gradual pruning settings with re-training between pruning steps, our framework outperforms existing pruning methods, emphasizing the significance of incorporating both FLOP and sparsity constraints for effective network pruning.  ( 2 min )
    COMQ: A Backpropagation-Free Algorithm for Post-Training Quantization
    arXiv:2403.07134v1 Announce Type: new Abstract: Post-training quantization (PTQ) has emerged as a practical approach to compress large neural networks, making them highly efficient for deployment. However, effectively reducing these models to their low-bit counterparts without compromising the original accuracy remains a key challenge. In this paper, we propose an innovative PTQ algorithm termed COMQ, which sequentially conducts coordinate-wise minimization of the layer-wise reconstruction errors. We consider the widely used integer quantization, where every quantized weight can be decomposed into a shared floating-point scalar and an integer bit-code. Within a fixed layer, COMQ treats all the scaling factor(s) and bit-codes as the variables of the reconstruction error. Every iteration improves this error along a single coordinate while keeping all other variables constant. COMQ is easy to use and requires no hyper-parameter tuning. It instead involves only dot products and rounding operations. We update these variables in a carefully designed greedy order, significantly enhancing the accuracy. COMQ achieves remarkable results in quantizing 4-bit Vision Transformers, with a negligible loss of less than 1% in Top-1 accuracy. In 4-bit INT quantization of convolutional neural networks, COMQ maintains near-lossless accuracy with a minimal drop of merely 0.3% in Top-1 accuracy.  ( 2 min )
    Don't Forget What I did?: Assessing Client Contributions in Federated Learning
    arXiv:2403.07151v1 Announce Type: new Abstract: Federated Learning (FL) is a collaborative machine learning (ML) approach, where multiple clients participate in training an ML model without exposing the private data. Fair and accurate assessment of client contributions is an important problem in FL to facilitate incentive allocation and encouraging diverse clients to participate in a unified model training. Existing methods for assessing client contribution adopts co-operative game-theoretic concepts, such as Shapley values, but under simplified assumptions. In this paper, we propose a history-aware game-theoretic framework, called FLContrib, to assess client contributions when a subset of (potentially non-i.i.d.) clients participate in each epoch of FL training. By exploiting the FL training process and linearity of Shapley value, we develop FLContrib that yields a historical timeline of client contributions as FL training progresses over epochs. Additionally, to assess client contribution under limited computational budget, we propose a scheduling procedure that considers a two-sided fairness criteria to perform expensive Shapley value computation only in a subset of training epochs. In experiments, we demonstrate a controlled trade-off between the correctness and efficiency of client contributions assessed via FLContrib. To demonstrate the benefits of history-aware client contributions, we apply FLContrib to detect dishonest clients conducting data poisoning in FL training.  ( 3 min )
    Ant Colony Sampling with GFlowNets for Combinatorial Optimization
    arXiv:2403.07041v1 Announce Type: new Abstract: This paper introduces the Generative Flow Ant Colony Sampler (GFACS), a novel neural-guided meta-heuristic algorithm for combinatorial optimization. GFACS integrates generative flow networks (GFlowNets) with the ant colony optimization (ACO) methodology. GFlowNets, a generative model that learns a constructive policy in combinatorial spaces, enhance ACO by providing an informed prior distribution of decision variables conditioned on input graph instances. Furthermore, we introduce a novel combination of training tricks, including search-guided local exploration, energy normalization, and energy shaping to improve GFACS. Our experimental results demonstrate that GFACS outperforms baseline ACO algorithms in seven CO tasks and is competitive with problem-specific heuristics for vehicle routing problems. The source code is available at \url{https://github.com/ai4co/gfacs}.  ( 2 min )
    Explainable Learning with Gaussian Processes
    arXiv:2403.07072v1 Announce Type: new Abstract: The field of explainable artificial intelligence (XAI) attempts to develop methods that provide insight into how complicated machine learning methods make predictions. Many methods of explanation have focused on the concept of feature attribution, a decomposition of the model's prediction into individual contributions corresponding to each input feature. In this work, we explore the problem of feature attribution in the context of Gaussian process regression (GPR). We take a principled approach to defining attributions under model uncertainty, extending the existing literature. We show that although GPR is a highly flexible and non-parametric approach, we can derive interpretable, closed-form expressions for the feature attributions. When using integrated gradients as an attribution method, we show that the attributions of a GPR model also follow a Gaussian process distribution, which quantifies the uncertainty in attribution arising from uncertainty in the model. We demonstrate, both through theory and experimentation, the versatility and robustness of this approach. We also show that, when applicable, the exact expressions for GPR attributions are both more accurate and less computationally expensive than the approximations currently used in practice. The source code for this project is freely available under MIT license at https://github.com/KurtButler/2024_attributions_paper.  ( 2 min )
    A multi-cohort study on prediction of acute brain dysfunction states using selective state space models
    arXiv:2403.07201v1 Announce Type: new Abstract: Assessing acute brain dysfunction (ABD), including delirium and coma in the intensive care unit (ICU), is a critical challenge due to its prevalence and severe implications for patient outcomes. Current diagnostic methods rely on infrequent clinical observations, which can only determine a patient's ABD status after onset. Our research attempts to solve these problems by harnessing Electronic Health Records (EHR) data to develop automated methods for ABD prediction for patients in the ICU. Existing models solely predict a single state (e.g., either delirium or coma), require at least 24 hours of observation data to make predictions, do not dynamically predict fluctuating ABD conditions during ICU stay (typically a one-time prediction), and use small sample size, proprietary single-hospital datasets. Our research fills these gaps in the existing literature by dynamically predicting delirium, coma, and mortality for 12-hour intervals throughout an ICU stay and validating on two public datasets. Our research also introduces the concept of dynamically predicting critical transitions from non-ABD to ABD and between different ABD states in real time, which could be clinically more informative for the hospital staff. We compared the predictive performance of two state-of-the-art neural network models, the MAMBA selective state space model and the Longformer Transformer model. Using the MAMBA model, we achieved a mean area under the receiving operator characteristic curve (AUROC) of 0.95 on outcome prediction of ABD for 12-hour intervals. The model achieves a mean AUROC of 0.79 when predicting transitions between ABD states. Our study uses a curated dataset from the University of Florida Health Shands Hospital for internal validation and two publicly available datasets, MIMIC-IV and eICU, for external validation, demonstrating robustness across ICU stays from 203 hospitals and 140,945 patients.  ( 3 min )
    UPS: Towards Foundation Models for PDE Solving via Cross-Modal Adaptation
    arXiv:2403.07187v1 Announce Type: new Abstract: We introduce UPS (Unified PDE Solver), an effective and data-efficient approach to solve diverse spatiotemporal PDEs defined over various domains, dimensions, and resolutions. UPS unifies different PDEs into a consistent representation space and processes diverse collections of PDE data using a unified network architecture that combines LLMs with domain-specific neural operators. We train the network via a two-stage cross-modal adaptation process, leveraging ideas of modality alignment and multi-task learning. By adapting from pretrained LLMs and exploiting text-form meta information, we are able to use considerably fewer training samples than previous methods while obtaining strong empirical results. UPS outperforms existing baselines, often by a large margin, on a wide range of 1D and 2D datasets in PDEBench, achieving state-of-the-art results on 8 of 10 tasks considered. Meanwhile, it is capable of few-shot transfer to different PDE families, coefficients, and resolutions.  ( 2 min )
    Guiding LLMs The Right Way: Fast, Non-Invasive Constrained Generation
    arXiv:2403.06988v1 Announce Type: new Abstract: To ensure that text generated by large language models (LLMs) is in an expected format, constrained decoding proposes to enforce strict formal language constraints during generation. However, as we show in this work, not only do such methods incur performance overhead during generation, but many of them also significantly impair task accuracy, if they do not correctly align the underlying LLM sub-word vocabularies with external constraints. To address this, we present a novel decoding algorithm, DOMINO, that can enforce constraints in a fully subword-aligned fashion, while leveraging pre-computation and speculative decoding to achieve virtually no overhead and in some cases even almost 2$\times$ speedup over unconstrained decoding -- thereby outperforming existing approaches by a wide margin.  ( 2 min )
    AutoEval Done Right: Using Synthetic Data for Model Evaluation
    arXiv:2403.07008v1 Announce Type: new Abstract: The evaluation of machine learning models using human-labeled validation data can be expensive and time-consuming. AI-labeled synthetic data can be used to decrease the number of human annotations required for this purpose in a process called autoevaluation. We suggest efficient and statistically principled algorithms for this purpose that improve sample efficiency while remaining unbiased. These algorithms increase the effective human-labeled sample size by up to 50% on experiments with GPT-4.  ( 2 min )
  • Open

    Beyond Regrets: Geometric Metrics for Bayesian Optimization
    arXiv:2401.01981v2 Announce Type: replace-cross Abstract: Bayesian optimization is a principled optimization strategy for a black-box objective function. It shows its effectiveness in a wide variety of real-world applications such as scientific discovery and experimental design. In general, the performance of Bayesian optimization is reported through regret-based metrics such as instantaneous, simple, and cumulative regrets. These metrics only rely on function evaluations, so that they do not consider geometric relationships between query points and global solutions, or query points themselves. Notably, they cannot discriminate if multiple global solutions are successfully found. Moreover, they do not evaluate Bayesian optimization's abilities to exploit and explore a search space given. To tackle these issues, we propose four new geometric metrics, i.e., precision, recall, average degree, and average distance. These metrics allow us to compare Bayesian optimization algorithms considering the geometry of both query points and global optima, or query points. However, they are accompanied by an extra parameter, which needs to be carefully determined. We therefore devise the parameter-free forms of the respective metrics by integrating out the additional parameter. Finally, we empirically validate that our proposed metrics can provide more delicate interpretation of Bayesian optimization algorithms, on top of assessment via the conventional metrics.  ( 2 min )
    Utility Theory of Synthetic Data Generation
    arXiv:2305.10015v2 Announce Type: replace Abstract: Synthetic data algorithms are widely employed in industries to generate artificial data for downstream learning tasks. While existing research primarily focuses on empirically evaluating utility of synthetic data, its theoretical understanding is largely lacking. This paper bridges the practice-theory gap by establishing relevant utility theory in a statistical learning framework. It considers two utility metrics: generalization and ranking of models trained on synthetic data. The former is defined as the generalization difference between models trained on synthetic and on real data. By deriving analytical bounds for this utility metric, we demonstrate that the synthetic feature distribution does not need to be similar as that of real data for ensuring comparable generalization of synthetic models, provided proper model specifications in downstream learning tasks. The latter utility metric studies the relative performance of models trained on synthetic data. In particular, we discover that the distribution of synthetic data is not necessarily similar as the real one to ensure consistent model comparison. Interestingly, consistent model comparison is still achievable even when synthetic responses are not well generated, as long as downstream models are separable by a generalization gap. Finally, extensive experiments on non-parametric models and deep neural networks have been conducted to validate these theoretical findings.  ( 2 min )
    Quantum Inception Score
    arXiv:2311.12163v2 Announce Type: replace-cross Abstract: Motivated by the great success of classical generative models in machine learning, enthusiastic exploration of their quantum version has recently started. To depart on this journey, it is important to develop a relevant metric to evaluate the quality of quantum generative models; in the classical case, one such example is the inception score. In this paper, we propose the quantum inception score, which relates the quality to the Holevo information of the quantum channel that classifies a given dataset. We prove that, under this proposed measure, the quantum generative models provide better quality than their classical counterparts because of the presence of quantum coherence, characterized by the resource theory of asymmetry, and entanglement. Furthermore, we harness the quantum fluctuation theorem to characterize the physical limitation of the quality of quantum generative models. Finally, we apply the quantum inception score to assess the quality of the one-dimensional spin chain model as a quantum generative model, with the quantum convolutional neural network as a quantum classifier, for the phase classification problem in the quantum many-body physics.  ( 2 min )
    A local approach to parameter space reduction for regression and classification tasks
    arXiv:2107.10867v3 Announce Type: replace Abstract: Parameter space reduction has been proved to be a crucial tool to speed-up the execution of many numerical tasks such as optimization, inverse problems, sensitivity analysis, and surrogate models' design, especially when in presence of high-dimensional parametrized systems. In this work we propose a new method called local active subspaces (LAS), which explores the synergies of active subspaces with supervised clustering techniques in order to carry out a more efficient dimension reduction in the parameter space. The clustering is performed without losing the input-output relations by introducing a distance metric induced by the global active subspace. We present two possible clustering algorithms: K-medoids and a hierarchical top-down approach, which is able to impose a variety of subdivision criteria specifically tailored for parameter space reduction tasks. This method is particularly useful for the community working on surrogate modelling. Frequently, the parameter space presents subdomains where the objective function of interest varies less on average along different directions. So, it could be approximated more accurately if restricted to those subdomains and studied separately. We tested the new method over several numerical experiments of increasing complexity, we show how to deal with vectorial outputs, and how to classify the different regions with respect to the local active subspace dimension. Employing this classification technique as a preprocessing step in the parameter space, or output space in case of vectorial outputs, brings remarkable results for the purpose of surrogate modelling.  ( 3 min )
    The Cram Method for Efficient Simultaneous Learning and Evaluation
    arXiv:2403.07031v1 Announce Type: cross Abstract: We introduce the "cram" method, a general and efficient approach to simultaneous learning and evaluation using a generic machine learning (ML) algorithm. In a single pass of batched data, the proposed method repeatedly trains an ML algorithm and tests its empirical performance. Because it utilizes the entire sample for both learning and evaluation, cramming is significantly more data-efficient than sample-splitting. The cram method also naturally accommodates online learning algorithms, making its implementation computationally efficient. To demonstrate the power of the cram method, we consider the standard policy learning setting where cramming is applied to the same data to both develop an individualized treatment rule (ITR) and estimate the average outcome that would result if the learned ITR were to be deployed. We show that under a minimal set of assumptions, the resulting crammed evaluation estimator is consistent and asymptotically normal. While our asymptotic results require a relatively weak stabilization condition of ML algorithm, we develop a simple, generic method that can be used with any policy learning algorithm to satisfy this condition. Our extensive simulation studies show that, when compared to sample-splitting, cramming reduces the evaluation standard error by more than 40% while improving the performance of learned policy. We also apply the cram method to a randomized clinical trial to demonstrate its applicability to real-world problems. Finally, we briefly discuss future extensions of the cram method to other learning and evaluation settings.  ( 3 min )
    Learning and Decision-Making with Data: Optimal Formulations and Phase Transitions
    arXiv:2109.06911v3 Announce Type: replace Abstract: We study the problem of designing optimal learning and decision-making formulations when only historical data is available. Prior work typically commits to a particular class of data-driven formulation and subsequently tries to establish out-of-sample performance guarantees. We take here the opposite approach. We define first a sensible yard stick with which to measure the quality of any data-driven formulation and subsequently seek to find an optimal such formulation. Informally, any data-driven formulation can be seen to balance a measure of proximity of the estimated cost to the actual cost while guaranteeing a level of out-of-sample performance. Given an acceptable level of out-of-sample performance, we construct explicitly a data-driven formulation that is uniformly closer to the true cost than any other formulation enjoying the same out-of-sample performance. We show the existence of three distinct out-of-sample performance regimes (a superexponential regime, an exponential regime and a subexponential regime) between which the nature of the optimal data-driven formulation experiences a phase transition. The optimal data-driven formulations can be interpreted as a classically robust formulation in the superexponential regime, an entropic distributionally robust formulation in the exponential regime and finally a variance penalized formulation in the subexponential regime. This final observation unveils a surprising connection between these three, at first glance seemingly unrelated, data-driven formulations which until now remained hidden.  ( 3 min )
    Hallmarks of Optimization Trajectories in Neural Networks and LLMs: The Lengths, Bends, and Dead Ends
    arXiv:2403.07379v1 Announce Type: cross Abstract: We propose a fresh take on understanding the mechanisms of neural networks by analyzing the rich structure of parameters contained within their optimization trajectories. Towards this end, we introduce some natural notions of the complexity of optimization trajectories, both qualitative and quantitative, which reveal the inherent nuance and interplay involved between various optimization choices, such as momentum, weight decay, and batch size. We use them to provide key hallmarks about the nature of optimization in deep neural networks: when it goes right, and when it finds itself in a dead end. Further, thanks to our trajectory perspective, we uncover an intertwined behaviour of momentum and weight decay that promotes directional exploration, as well as a directional regularization behaviour of some others. We perform experiments over large-scale vision and language settings, including large language models (LLMs) with up to 12 billion parameters, to demonstrate the value of our approach.  ( 2 min )
    Probabilistic Easy Variational Causal Effect
    arXiv:2403.07745v1 Announce Type: new Abstract: Let $X$ and $Z$ be random vectors, and $Y=g(X,Z)$. In this paper, on the one hand, for the case that $X$ and $Z$ are continuous, by using the ideas from the total variation and the flux of $g$, we develop a point of view in causal inference capable of dealing with a broad domain of causal problems. Indeed, we focus on a function, called Probabilistic Easy Variational Causal Effect (PEACE), which can measure the direct causal effect of $X$ on $Y$ with respect to continuously and interventionally changing the values of $X$ while keeping the value of $Z$ constant. PEACE is a function of $d\ge 0$, which is a degree managing the strengths of probability density values $f(x|z)$. On the other hand, we generalize the above idea for the discrete case and show its compatibility with the continuous case. Further, we investigate some properties of PEACE using measure theoretical concepts. Furthermore, we provide some identifiability criteria and several examples showing the generic capability of PEACE. We note that PEACE can deal with the causal problems for which micro-level or just macro-level changes in the value of the input variables are important. Finally, PEACE is stable under small changes in $\partial g_{in}/\partial x$ and the joint distribution of $X$ and $Z$, where $g_{in}$ is obtained from $g$ by removing all functional relationships defining $X$ and $Z$.  ( 2 min )
    CAS: A General Algorithm for Online Selective Conformal Prediction with FCR Control
    arXiv:2403.07728v1 Announce Type: new Abstract: We study the problem of post-selection predictive inference in an online fashion. To avoid devoting resources to unimportant units, a preliminary selection of the current individual before reporting its prediction interval is common and meaningful in online predictive tasks. Since the online selection causes a temporal multiplicity in the selected prediction intervals, it is important to control the real-time false coverage-statement rate (FCR) to measure the averaged miscoverage error. We develop a general framework named CAS (Calibration after Adaptive Selection) that can wrap around any prediction model and online selection rule to output post-selection prediction intervals. If the current individual is selected, we first perform an adaptive selection on historical data to construct a calibration set, then output a conformal prediction interval for the unobserved label. We provide tractable constructions for the calibration set for popular online selection rules. We proved that CAS can achieve an exact selection-conditional coverage guarantee in the finite-sample and distribution-free regimes. For the decision-driven selection rule, including most online multiple-testing procedures, CAS can exactly control the real-time FCR below the target level without any distributional assumptions. For the online selection with symmetric thresholds, we establish the error bound for the control gap of FCR under mild distributional assumptions. To account for the distribution shift in online data, we also embed CAS into some recent dynamic conformal prediction methods and examine the long-run FCR control. Numerical results on both synthetic and real data corroborate that CAS can effectively control FCR around the target level and yield more narrowed prediction intervals over existing baselines across various settings.  ( 3 min )
    Tracking Dynamic Gaussian Density with a Theoretically Optimal Sliding Window Approach
    arXiv:2403.07207v1 Announce Type: new Abstract: Dynamic density estimation is ubiquitous in many applications, including computer vision and signal processing. One popular method to tackle this problem is the "sliding window" kernel density estimator. There exist various implementations of this method that use heuristically defined weight sequences for the observed data. The weight sequence, however, is a key aspect of the estimator affecting the tracking performance significantly. In this work, we study the exact mean integrated squared error (MISE) of "sliding window" Gaussian Kernel Density Estimators for evolving Gaussian densities. We provide a principled guide for choosing the optimal weight sequence by theoretically characterizing the exact MISE, which can be formulated as constrained quadratic programming. We present empirical evidence with synthetic datasets to show that our weighting scheme indeed improves the tracking performance compared to heuristic approaches.  ( 2 min )
    Alignment of Density Maps in Wasserstein Distance
    arXiv:2305.12310v2 Announce Type: replace-cross Abstract: In this paper we propose an algorithm for aligning three-dimensional objects when represented as density maps, motivated by applications in cryogenic electron microscopy. The algorithm is based on minimizing the 1-Wasserstein distance between the density maps after a rigid transformation. The induced loss function enjoys a more benign landscape than its Euclidean counterpart and Bayesian optimization is employed for computation. Numerical experiments show improved accuracy and efficiency over existing algorithms on the alignment of real protein molecules. In the context of aligning heterogeneous pairs, we illustrate a potential need for new distance functions.  ( 2 min )
    On Ranking-based Tests of Independence
    arXiv:2403.07464v1 Announce Type: cross Abstract: In this paper we develop a novel nonparametric framework to test the independence of two random variables $\mathbf{X}$ and $\mathbf{Y}$ with unknown respective marginals $H(dx)$ and $G(dy)$ and joint distribution $F(dx dy)$, based on {\it Receiver Operating Characteristic} (ROC) analysis and bipartite ranking. The rationale behind our approach relies on the fact that, the independence hypothesis $\mathcal{H}\_0$ is necessarily false as soon as the optimal scoring function related to the pair of distributions $(H\otimes G,\; F)$, obtained from a bipartite ranking algorithm, has a ROC curve that deviates from the main diagonal of the unit square.We consider a wide class of rank statistics encompassing many ways of deviating from the diagonal in the ROC space to build tests of independence. Beyond its great flexibility, this new method has theoretical properties that far surpass those of its competitors. Nonasymptotic bounds for the two types of testing errors are established. From an empirical perspective, the novel procedure we promote in this paper exhibits a remarkable ability to detect small departures, of various types, from the null assumption $\mathcal{H}_0$, even in high dimension, as supported by the numerical experiments presented here.  ( 2 min )
    TAP: The Attention Patch for Cross-Modal Knowledge Transfer from Unlabeled Modality
    arXiv:2302.02224v2 Announce Type: replace-cross Abstract: This paper addresses a cross-modal learning framework, where the objective is to enhance the performance of supervised learning in the primary modality using an unlabeled, unpaired secondary modality. Taking a probabilistic approach for missing information estimation, we show that the extra information contained in the secondary modality can be estimated via Nadaraya-Watson (NW) kernel regression, which can further be expressed as a kernelized cross-attention module (under linear transformation). Our results lay the foundations for introducing The Attention Patch (TAP), a simple neural network add-on that allows data-level knowledge transfer from the unlabeled modality. We provide extensive numerical simulations using four real-world datasets to show that TAP can provide statistically significant improvement in generalization across different domains and different neural network architectures, making use of seemingly unusable unlabeled cross-modal data.  ( 2 min )
    Which LLM to Play? Convergence-Aware Online Model Selection with Time-Increasing Bandits
    arXiv:2403.07213v1 Announce Type: cross Abstract: Web-based applications such as chatbots, search engines and news recommendations continue to grow in scale and complexity with the recent surge in the adoption of LLMs. Online model selection has thus garnered increasing attention due to the need to choose the best model among a diverse set while balancing task reward and exploration cost. Organizations faces decisions like whether to employ a costly API-based LLM or a locally finetuned small LLM, weighing cost against performance. Traditional selection methods often evaluate every candidate model before choosing one, which are becoming impractical given the rising costs of training and finetuning LLMs. Moreover, it is undesirable to allocate excessive resources towards exploring poor-performing models. While some recent works leverage online bandit algorithm to manage such exploration-exploitation trade-off in model selection, they tend to overlook the increasing-then-converging trend in model performances as the model is iteratively finetuned, leading to less accurate predictions and suboptimal model selections. In this paper, we propose a time-increasing bandit algorithm TI-UCB, which effectively predicts the increase of model performances due to finetuning and efficiently balances exploration and exploitation in model selection. To further capture the converging points of models, we develop a change detection mechanism by comparing consecutive increase predictions. We theoretically prove that our algorithm achieves a logarithmic regret upper bound in a typical increasing bandit setting, which implies a fast convergence rate. The advantage of our method is also empirically validated through extensive experiments on classification model selection and online selection of LLMs. Our results highlight the importance of utilizing increasing-then-converging pattern for more efficient and economic model selection in the deployment of LLMs.  ( 3 min )
    On the Last-Iterate Convergence of Shuffling Gradient Methods
    arXiv:2403.07723v1 Announce Type: cross Abstract: Shuffling gradient methods, which are also known as stochastic gradient descent (SGD) without replacement, are widely implemented in practice, particularly including three popular algorithms: Random Reshuffle (RR), Shuffle Once (SO), and Incremental Gradient (IG). Compared to the empirical success, the theoretical guarantee of shuffling gradient methods was not well-understanding for a long time. Until recently, the convergence rates had just been established for the average iterate for convex functions and the last iterate for strongly convex problems (using squared distance as the metric). However, when using the function value gap as the convergence criterion, existing theories cannot interpret the good performance of the last iterate in different settings (e.g., constrained optimization). To bridge this gap between practice and theory, we prove last-iterate convergence rates for shuffling gradient methods with respect to the objective value even without strong convexity. Our new results either (nearly) match the existing last-iterate lower bounds or are as fast as the previous best upper bounds for the average iterate.  ( 2 min )
    Integrating Uncertainty Awareness into Conformalized Quantile Regression
    arXiv:2306.08693v2 Announce Type: replace-cross Abstract: Conformalized Quantile Regression (CQR) is a recently proposed method for constructing prediction intervals for a response $Y$ given covariates $X$, without making distributional assumptions. However, existing constructions of CQR can be ineffective for problems where the quantile regressors perform better in certain parts of the feature space than others. The reason is that the prediction intervals of CQR do not distinguish between two forms of uncertainty: first, the variability of the conditional distribution of $Y$ given $X$ (i.e., aleatoric uncertainty), and second, our uncertainty in estimating this conditional distribution (i.e., epistemic uncertainty). This can lead to intervals that are overly narrow in regions where epistemic uncertainty is high. To address this, we propose a new variant of the CQR methodology, Uncertainty-Aware CQR (UACQR), that explicitly separates these two sources of uncertainty to adjust quantile regressors differentially across the feature space. Compared to CQR, our methods enjoy the same distribution-free theoretical coverage guarantees, while demonstrating in our experiments stronger conditional coverage properties in simulated settings and real-world data sets alike.  ( 3 min )
    tdCoxSNN: Time-Dependent Cox Survival Neural Network for Continuous-time Dynamic Prediction
    arXiv:2307.05881v2 Announce Type: replace Abstract: The aim of dynamic prediction is to provide individualized risk predictions over time, which are updated as new data become available. In pursuit of constructing a dynamic prediction model for a progressive eye disorder, age-related macular degeneration (AMD), we propose a time-dependent Cox survival neural network (tdCoxSNN) to predict its progression using longitudinal fundus images. tdCoxSNN builds upon the time-dependent Cox model by utilizing a neural network to capture the non-linear effect of time-dependent covariates on the survival outcome. Moreover, by concurrently integrating a convolutional neural network (CNN) with the survival network, tdCoxSNN can directly take longitudinal images as input. We evaluate and compare our proposed method with joint modeling and landmarking approaches through extensive simulations. We applied the proposed approach to two real datasets. One is a large AMD study, the Age-Related Eye Disease Study (AREDS), in which more than 50,000 fundus images were captured over a period of 12 years for more than 4,000 participants. Another is a public dataset of the primary biliary cirrhosis (PBC) disease, where multiple lab tests were longitudinally collected to predict the time-to-liver transplant. Our approach demonstrates commendable predictive performance in both simulation studies and the analysis of the two real datasets.  ( 2 min )
    FairRR: Pre-Processing for Group Fairness through Randomized Response
    arXiv:2403.07780v1 Announce Type: new Abstract: The increasing usage of machine learning models in consequential decision-making processes has spurred research into the fairness of these systems. While significant work has been done to study group fairness in the in-processing and post-processing setting, there has been little that theoretically connects these results to the pre-processing domain. This paper proposes that achieving group fairness in downstream models can be formulated as finding the optimal design matrix in which to modify a response variable in a Randomized Response framework. We show that measures of group fairness can be directly controlled for with optimal model utility, proposing a pre-processing algorithm called FairRR that yields excellent downstream model utility and fairness.  ( 2 min )
    Epoch-evolving Gaussian Process Guided Learning
    arXiv:2006.14347v2 Announce Type: replace-cross Abstract: In this paper, we propose a novel learning scheme called epoch-evolving Gaussian Process Guided Learning (GPGL), which aims at characterizing the correlation information between the batch-level distribution and the global data distribution. Such correlation information is encoded as context labels and needs renewal every epoch. With the guidance of the context label and ground truth label, GPGL scheme provides a more efficient optimization through updating the model parameters with a triangle consistency loss. Furthermore, our GPGL scheme can be further generalized and naturally applied to the current deep models, outperforming the existing batch-based state-of-the-art models on mainstream datasets (CIFAR-10, CIFAR-100, and Tiny-ImageNet) remarkably.  ( 2 min )
    The Minimax Rate of HSIC Estimation for Translation-Invariant Kernels
    arXiv:2403.07735v1 Announce Type: cross Abstract: Kernel techniques are among the most influential approaches in data science and statistics. Under mild conditions, the reproducing kernel Hilbert space associated to a kernel is capable of encoding the independence of $M\ge 2$ random variables. Probably the most widespread independence measure relying on kernels is the so-called Hilbert-Schmidt independence criterion (HSIC; also referred to as distance covariance in the statistics literature). Despite various existing HSIC estimators designed since its introduction close to two decades ago, the fundamental question of the rate at which HSIC can be estimated is still open. In this work, we prove that the minimax optimal rate of HSIC estimation on $\mathbb R^d$ for Borel measures containing the Gaussians with continuous bounded translation-invariant characteristic kernels is $\mathcal O\!\left(n^{-1/2}\right)$. Specifically, our result implies the optimality in the minimax sense of many of the most-frequently used estimators (including the U-statistic, the V-statistic, and the Nystr\"om-based one) on $\mathbb R^d$.  ( 2 min )
    Stochastic Extragradient with Random Reshuffling: Improved Convergence for Variational Inequalities
    arXiv:2403.07148v1 Announce Type: cross Abstract: The Stochastic Extragradient (SEG) method is one of the most popular algorithms for solving finite-sum min-max optimization and variational inequality problems (VIPs) appearing in various machine learning tasks. However, existing convergence analyses of SEG focus on its with-replacement variants, while practical implementations of the method randomly reshuffle components and sequentially use them. Unlike the well-studied with-replacement variants, SEG with Random Reshuffling (SEG-RR) lacks established theoretical guarantees. In this work, we provide a convergence analysis of SEG-RR for three classes of VIPs: (i) strongly monotone, (ii) affine, and (iii) monotone. We derive conditions under which SEG-RR achieves a faster convergence rate than the uniform with-replacement sampling SEG. In the monotone setting, our analysis of SEG-RR guarantees convergence to an arbitrary accuracy without large batch sizes, a strong requirement needed in the classical with-replacement SEG. As a byproduct of our results, we provide convergence guarantees for Shuffle Once SEG (shuffles the data only at the beginning of the algorithm) and the Incremental Extragradient (does not shuffle the data). We supplement our analysis with experiments validating empirically the superior performance of SEG-RR over the classical with-replacement sampling SEG.  ( 2 min )
    On the Limited Representational Power of Value Functions and its Links to Statistical (In)Efficiency
    arXiv:2403.07136v1 Announce Type: cross Abstract: Identifying the trade-offs between model-based and model-free methods is a central question in reinforcement learning. Value-based methods offer substantial computational advantages and are sometimes just as statistically efficient as model-based methods. However, focusing on the core problem of policy evaluation, we show information about the transition dynamics may be impossible to represent in the space of value functions. We explore this through a series of case studies focused on structures that arises in many important problems. In several, there is no information loss and value-based methods are as statistically efficient as model based ones. In other closely-related examples, information loss is severe and value-based methods are severely outperformed. A deeper investigation points to the limitations of the representational power as the driver of the inefficiency, as opposed to failure in algorithm design.  ( 2 min )
    Explanation-based Training with Differentiable Insertion/Deletion Metric-aware Regularizers
    arXiv:2310.12553v3 Announce Type: replace-cross Abstract: The quality of explanations for the predictions made by complex machine learning predictors is often measured using insertion and deletion metrics, which assess the faithfulness of the explanations, i.e., how accurately the explanations reflect the predictor's behavior. To improve the faithfulness, we propose insertion/deletion metric-aware explanation-based optimization (ID-ExpO), which optimizes differentiable predictors to improve both the insertion and deletion scores of the explanations while maintaining their predictive accuracy. Because the original insertion and deletion metrics are non-differentiable with respect to the explanations and directly unavailable for gradient-based optimization, we extend the metrics so that they are differentiable and use them to formalize insertion and deletion metric-based regularizers. Our experimental results on image and tabular datasets show that the deep neural network-based predictors that are fine-tuned using ID-ExpO enable popular post-hoc explainers to produce more faithful and easier-to-interpret explanations while maintaining high predictive accuracy. The code is available at https://github.com/yuyay/idexpo.  ( 2 min )
    lil'HDoC: An Algorithm for Good Arm Identification under Small Threshold Gap
    arXiv:2401.15879v3 Announce Type: replace-cross Abstract: Good arm identification (GAI) is a pure-exploration bandit problem in which a single learner outputs an arm as soon as it is identified as a good arm. A good arm is defined as an arm with an expected reward greater than or equal to a given threshold. This paper focuses on the GAI problem under a small threshold gap, which refers to the distance between the expected rewards of arms and the given threshold. We propose a new algorithm called lil'HDoC to significantly improve the total sample complexity of the HDoC algorithm. We demonstrate that the sample complexity of the first $\lambda$ output arm in lil'HDoC is bounded by the original HDoC algorithm, except for one negligible term, when the distance between the expected reward and threshold is small. Extensive experiments confirm that our algorithm outperforms the state-of-the-art algorithms in both synthetic and real-world datasets.  ( 2 min )
    A tutorial on multi-view autoencoders using the multi-view-AE library
    arXiv:2403.07456v1 Announce Type: cross Abstract: There has been a growing interest in recent years in modelling multiple modalities (or views) of data to for example, understand the relationship between modalities or to generate missing data. Multi-view autoencoders have gained significant traction for their adaptability and versatility in modelling multi-modal data, demonstrating an ability to tailor their approach to suit the characteristics of the data at hand. However, most multi-view autoencoders have inconsistent notation and are often implemented using different coding frameworks. To address this, we present a unified mathematical framework for multi-view autoencoders, consolidating their formulations. Moreover, we offer insights into the motivation and theoretical advantages of each model. To facilitate accessibility and practical use, we extend the documentation and functionality of the previously introduced \texttt{multi-view-AE} library. This library offers Python implementations of numerous multi-view autoencoder models, presented within a user-friendly framework. Through benchmarking experiments, we evaluate our implementations against previous ones, demonstrating comparable or superior performance. This work aims to establish a cohesive foundation for multi-modal modelling, serving as a valuable educational resource in the field.  ( 2 min )
    Uncertainty in Graph Neural Networks: A Survey
    arXiv:2403.07185v1 Announce Type: cross Abstract: Graph Neural Networks (GNNs) have been extensively used in various real-world applications. However, the predictive uncertainty of GNNs stemming from diverse sources such as inherent randomness in data and model training errors can lead to unstable and erroneous predictions. Therefore, identifying, quantifying, and utilizing uncertainty are essential to enhance the performance of the model for the downstream tasks as well as the reliability of the GNN predictions. This survey aims to provide a comprehensive overview of the GNNs from the perspective of uncertainty with an emphasis on its integration in graph learning. We compare and summarize existing graph uncertainty theory and methods, alongside the corresponding downstream tasks. Thereby, we bridge the gap between theory and practice, meanwhile connecting different GNN communities. Moreover, our work provides valuable insights into promising directions in this field.  ( 2 min )
    How does promoting the minority fraction affect generalization? A theoretical study of the one-hidden-layer neural network on group imbalance
    arXiv:2403.07310v1 Announce Type: new Abstract: Group imbalance has been a known problem in empirical risk minimization (ERM), where the achieved high average accuracy is accompanied by low accuracy in a minority group. Despite algorithmic efforts to improve the minority group accuracy, a theoretical generalization analysis of ERM on individual groups remains elusive. By formulating the group imbalance problem with the Gaussian Mixture Model, this paper quantifies the impact of individual groups on the sample complexity, the convergence rate, and the average and group-level testing performance. Although our theoretical framework is centered on binary classification using a one-hidden-layer neural network, to the best of our knowledge, we provide the first theoretical analysis of the group-level generalization of ERM in addition to the commonly studied average generalization performance. Sample insights of our theoretical results include that when all group-level co-variance is in the medium regime and all mean are close to zero, the learning performance is most desirable in the sense of a small sample complexity, a fast training rate, and a high average and group-level testing accuracy. Moreover, we show that increasing the fraction of the minority group in the training data does not necessarily improve the generalization performance of the minority group. Our theoretical results are validated on both synthetic and empirical datasets, such as CelebA and CIFAR-10 in image classification.  ( 3 min )
    Revisiting the Last-Iterate Convergence of Stochastic Gradient Methods
    arXiv:2312.08531v2 Announce Type: replace-cross Abstract: In the past several years, the last-iterate convergence of the Stochastic Gradient Descent (SGD) algorithm has triggered people's interest due to its good performance in practice but lack of theoretical understanding. For Lipschitz convex functions, different works have established the optimal $O(\log(1/\delta)\log T/\sqrt{T})$ or $O(\sqrt{\log(1/\delta)/T})$ high-probability convergence rates for the final iterate, where $T$ is the time horizon and $\delta$ is the failure probability. However, to prove these bounds, all the existing works are either limited to compact domains or require almost surely bounded noises. It is natural to ask whether the last iterate of SGD can still guarantee the optimal convergence rate but without these two restrictive assumptions. Besides this important question, there are still lots of theoretical problems lacking an answer. For example, compared with the last-iterate convergence of SGD for non-smooth problems, only few results for smooth optimization have yet been developed. Additionally, the existing results are all limited to a non-composite objective and the standard Euclidean norm. It still remains unclear whether the last-iterate convergence can be provably extended to wider composite optimization and non-Euclidean norms. In this work, to address the issues mentioned above, we revisit the last-iterate convergence of stochastic gradient methods and provide the first unified way to prove the convergence rates both in expectation and in high probability to accommodate general domains, composite objectives, non-Euclidean norms, Lipschitz conditions, smoothness, and (strong) convexity simultaneously. Additionally, we extend our analysis to obtain the last-iterate convergence under heavy-tailed noises.  ( 3 min )
    Low coordinate degree algorithms I: Universality of computational thresholds for hypothesis testing
    arXiv:2403.07862v1 Announce Type: cross Abstract: We study when low coordinate degree functions (LCDF) -- linear combinations of functions depending on small subsets of entries of a vector -- can hypothesis test between high-dimensional probability measures. These functions are a generalization, proposed in Hopkins' 2018 thesis but seldom studied since, of low degree polynomials (LDP), a class widely used in recent literature as a proxy for all efficient algorithms for tasks in statistics and optimization. Instead of the orthogonal polynomial decompositions used in LDP calculations, our analysis of LCDF is based on the Efron-Stein or ANOVA decomposition, making it much more broadly applicable. By way of illustration, we prove channel universality for the success of LCDF in testing for the presence of sufficiently "dilute" random signals through noisy channels: the efficacy of LCDF depends on the channel only through the scalar Fisher information for a class of channels including nearly arbitrary additive i.i.d. noise and nearly arbitrary exponential families. As applications, we extend lower bounds against LDP for spiked matrix and tensor models under additive Gaussian noise to lower bounds against LCDF under general noisy channels. We also give a simple and unified treatment of the effect of censoring models by erasing observations at random and of quantizing models by taking the sign of the observations. These results are the first computational lower bounds against any large class of algorithms for all of these models when the channel is not one of a few special cases, and thereby give the first substantial evidence for the universality of several statistical-to-computational gaps.  ( 3 min )
    Near-Interpolators: Rapid Norm Growth and the Trade-Off between Interpolation and Generalization
    arXiv:2403.07264v1 Announce Type: new Abstract: We study the generalization capability of nearly-interpolating linear regressors: $\boldsymbol{\beta}$'s whose training error $\tau$ is positive but small, i.e., below the noise floor. Under a random matrix theoretic assumption on the data distribution and an eigendecay assumption on the data covariance matrix $\boldsymbol{\Sigma}$, we demonstrate that any near-interpolator exhibits rapid norm growth: for $\tau$ fixed, $\boldsymbol{\beta}$ has squared $\ell_2$-norm $\mathbb{E}[\|{\boldsymbol{\beta}}\|_{2}^{2}] = \Omega(n^{\alpha})$ where $n$ is the number of samples and $\alpha >1$ is the exponent of the eigendecay, i.e., $\lambda_i(\boldsymbol{\Sigma}) \sim i^{-\alpha}$. This implies that existing data-independent norm-based bounds are necessarily loose. On the other hand, in the same regime we precisely characterize the asymptotic trade-off between interpolation and generalization. Our characterization reveals that larger norm scaling exponents $\alpha$ correspond to worse trade-offs between interpolation and generalization. We verify empirically that a similar phenomenon holds for nearly-interpolating shallow neural networks.  ( 2 min )
    Convergence of Some Convex Message Passing Algorithms to a Fixed Point
    arXiv:2403.07004v1 Announce Type: cross Abstract: A popular approach to the MAP inference problem in graphical models is to minimize an upper bound obtained from a dual linear programming or Lagrangian relaxation by (block-)coordinate descent. Examples of such algorithms are max-sum diffusion and sequential tree-reweighted message passing. Convergence properties of these methods are currently not fully understood. They have been proved to converge to the set characterized by local consistency of active constraints, with unknown convergence rate; however, it was not clear if the iterates converge at all (to any single point). We prove a stronger result (which was conjectured before but never proved): the iterates converge to a fixed point of the algorithm. Moreover, we show that they achieve precision $\varepsilon>0$ in $\mathcal{O}(1/\varepsilon)$ iterations. We first prove this for a version of coordinate descent applied to a general piecewise-affine convex objective, using a novel proof technique. Then we demonstrate the generality of this approach by reducing some popular coordinate-descent algorithms to this problem. Finally we show that, in contrast to our main result, a similar version of coordinate descent applied to a constrained optimization problem need not converge.  ( 2 min )
    Proxy Methods for Domain Adaptation
    arXiv:2403.07442v1 Announce Type: cross Abstract: We study the problem of domain adaptation under distribution shift, where the shift is due to a change in the distribution of an unobserved, latent variable that confounds both the covariates and the labels. In this setting, neither the covariate shift nor the label shift assumptions apply. Our approach to adaptation employs proximal causal learning, a technique for estimating causal effects in settings where proxies of unobserved confounders are available. We demonstrate that proxy variables allow for adaptation to distribution shift without explicitly recovering or modeling latent variables. We consider two settings, (i) Concept Bottleneck: an additional ''concept'' variable is observed that mediates the relationship between the covariates and labels; (ii) Multi-domain: training data from multiple source domains is available, where each source domain exhibits a different distribution over the latent confounder. We develop a two-stage kernel estimation approach to adapt to complex distribution shifts in both settings. In our experiments, we show that our approach outperforms other methods, notably those which explicitly recover the latent confounder.  ( 2 min )
    Fast, accurate and lightweight sequential simulation-based inference using Gaussian locally linear mappings
    arXiv:2403.07454v1 Announce Type: new Abstract: Bayesian inference for complex models with an intractable likelihood can be tackled using algorithms performing many calls to computer simulators. These approaches are collectively known as "simulation-based inference" (SBI). Recent SBI methods have made use of neural networks (NN) to provide approximate, yet expressive constructs for the unavailable likelihood function and the posterior distribution. However, they do not generally achieve an optimal trade-off between accuracy and computational demand. In this work, we propose an alternative that provides both approximations to the likelihood and the posterior distribution, using structured mixtures of probability distributions. Our approach produces accurate posterior inference when compared to state-of-the-art NN-based SBI methods, while exhibiting a much smaller computational footprint. We illustrate our results on several benchmark models from the SBI literature.  ( 2 min )
    Adaptive Bounding Box Uncertainties via Two-Step Conformal Prediction
    arXiv:2403.07263v1 Announce Type: cross Abstract: Quantifying a model's predictive uncertainty is essential for safety-critical applications such as autonomous driving. We consider quantifying such uncertainty for multi-object detection. In particular, we leverage conformal prediction to obtain uncertainty intervals with guaranteed coverage for object bounding boxes. One challenge in doing so is that bounding box predictions are conditioned on the object's class label. Thus, we develop a novel two-step conformal approach that propagates uncertainty in predicted class labels into the uncertainty intervals for the bounding boxes. This broadens the validity of our conformal coverage guarantees to include incorrectly classified objects, ensuring their usefulness when maximal safety assurances are required. Moreover, we investigate novel ensemble and quantile regression formulations to ensure the bounding box intervals are adaptive to object size, leading to a more balanced coverage across sizes. Validating our two-step approach on real-world datasets for 2D bounding box localization, we find that desired coverage levels are satisfied with actionably tight predictive uncertainty intervals.  ( 2 min )
    Tuning diagonal scale matrices for HMC
    arXiv:2403.07495v1 Announce Type: cross Abstract: Three approaches for adaptively tuning diagonal scale matrices for HMC are discussed and compared. The common practice of scaling according to estimated marginal standard deviations is taken as a benchmark. Scaling according to the mean log-target gradient (ISG), and a scaling method targeting that the frequency of when the underlying Hamiltonian dynamics crosses the respective medians should be uniform across dimensions, are taken as alternatives. Numerical studies suggest that the ISG method leads in many cases to more efficient sampling than the benchmark, in particular in cases with strong correlations or non-linear dependencies. The ISG method is also easy to implement, computationally cheap and would be relatively simple to include in automatically tuned codes as an alternative to the benchmark practice.  ( 2 min )
    Balancing Fairness and Accuracy in Data-Restricted Binary Classification
    arXiv:2403.07724v1 Announce Type: cross Abstract: Applications that deal with sensitive information may have restrictions placed on the data available to a machine learning (ML) classifier. For example, in some applications, a classifier may not have direct access to sensitive attributes, affecting its ability to produce accurate and fair decisions. This paper proposes a framework that models the trade-off between accuracy and fairness under four practical scenarios that dictate the type of data available for analysis. Prior works examine this trade-off by analyzing the outputs of a scoring function that has been trained to implicitly learn the underlying distribution of the feature vector, class label, and sensitive attribute of a dataset. In contrast, our framework directly analyzes the behavior of the optimal Bayesian classifier on this underlying distribution by constructing a discrete approximation it from the dataset itself. This approach enables us to formulate multiple convex optimization problems, which allow us to answer the question: How is the accuracy of a Bayesian classifier affected in different data restricting scenarios when constrained to be fair? Analysis is performed on a set of fairness definitions that include group and individual fairness. Experiments on three datasets demonstrate the utility of the proposed framework as a tool for quantifying the trade-offs among different fairness notions and their distributional dependencies.  ( 2 min )
    Knowledge Transfer across Multiple Principal Component Analysis Studies
    arXiv:2403.07431v1 Announce Type: new Abstract: Transfer learning has aroused great interest in the statistical community. In this article, we focus on knowledge transfer for unsupervised learning tasks in contrast to the supervised learning tasks in the literature. Given the transferable source populations, we propose a two-step transfer learning algorithm to extract useful information from multiple source principal component analysis (PCA) studies, thereby enhancing estimation accuracy for the target PCA task. In the first step, we integrate the shared subspace information across multiple studies by a proposed method named as Grassmannian barycenter, instead of directly performing PCA on the pooled dataset. The proposed Grassmannian barycenter method enjoys robustness and computational advantages in more general cases. Then the resulting estimator for the shared subspace from the first step is further utilized to estimate the target private subspace in the second step. Our theoretical analysis credits the gain of knowledge transfer between PCA studies to the enlarged eigenvalue gap, which is different from the existing supervised transfer learning tasks where sparsity plays the central role. In addition, we prove that the bilinear forms of the empirical spectral projectors have asymptotic normality under weaker eigenvalue gap conditions after knowledge transfer. When the set of informativesources is unknown, we endow our algorithm with the capability of useful dataset selection by solving a rectified optimization problem on the Grassmann manifold, which in turn leads to a computationally friendly rectified Grassmannian K-means procedure. In the end, extensive numerical simulation results and a real data case concerning activity recognition are reported to support our theoretical claims and to illustrate the empirical usefulness of the proposed transfer learning methods.  ( 3 min )
    An alternative approach to train neural networks using monotone variational inequality
    arXiv:2202.08876v4 Announce Type: replace Abstract: We propose an alternative approach to neural network training using the monotone vector field, an idea inspired by the seminal work of Juditsky and Nemirovski [Juditsky & Nemirovsky, 2019] developed originally to solve parameter estimation problems for generalized linear models (GLM) by reducing the original non-convex problem to a convex problem of solving a monotone variational inequality (VI). Our approach leads to computationally efficient procedures that converge fast and offer guarantee in some special cases, such as training a single-layer neural network or fine-tuning the last layer of the pre-trained model. Our approach can be used for more efficient fine-tuning of a pre-trained model while freezing the bottom layers, an essential step for deploying many machine learning models such as large language models (LLM). We demonstrate its applicability in training fully-connected (FC) neural networks, graph neural networks (GNN), and convolutional neural networks (CNN) and show the competitive or better performance of our approach compared to stochastic gradient descent methods on both synthetic and real network data prediction tasks regarding various performance metrics.  ( 2 min )
    An interpretable neural network-based non-proportional odds model for ordinal regression
    arXiv:2303.17823v4 Announce Type: replace-cross Abstract: This study proposes an interpretable neural network-based non-proportional odds model (N$^3$POM) for ordinal regression. N$^3$POM is different from conventional approaches to ordinal regression with non-proportional models in several ways: (1) N$^3$POM is defined for both continuous and discrete responses, whereas standard methods typically treat the ordered continuous variables as if they are discrete, (2) instead of estimating response-dependent finite-dimensional coefficients of linear models from discrete responses as is done in conventional approaches, we train a non-linear neural network to serve as a coefficient function. Thanks to the neural network, N$^3$POM offers flexibility while preserving the interpretability of conventional ordinal regression. We establish a sufficient condition under which the predicted conditional cumulative probability locally satisfies the monotonicity constraint over a user-specified region in the covariate space. Additionally, we provide a monotonicity-preserving stochastic (MPS) algorithm for effectively training the neural network. We apply N$^3$POM to several real-world datasets.  ( 2 min )
    Efficient Compression of Overparameterized Deep Models through Low-Dimensional Learning Dynamics
    arXiv:2311.05061v2 Announce Type: replace-cross Abstract: Overparameterized models have proven to be powerful tools for solving various machine learning tasks. However, overparameterization often leads to a substantial increase in computational and memory costs, which in turn requires extensive resources to train. In this work, we present a novel approach for compressing overparameterized models, developed through studying their learning dynamics. We observe that for many deep models, updates to the weight matrices occur within a low-dimensional invariant subspace. For deep linear models, we demonstrate that their principal components are fitted incrementally within a small subspace, and use these insights to propose a compression algorithm for deep linear networks that involve decreasing the width of their intermediate layers. We empirically evaluate the effectiveness of our compression technique on matrix recovery problems. Remarkably, by using an initialization that exploits the structure of the problem, we observe that our compressed network converges faster than the original network, consistently yielding smaller recovery errors. We substantiate this observation by developing a theory focused on deep matrix factorization. Finally, we empirically demonstrate how our compressed model has the potential to improve the utility of deep nonlinear models. Overall, our algorithm improves the training efficiency by more than 2x, without compromising generalization.  ( 3 min )
    On the nonconvexity of some push-forward constraints and its consequences in machine learning
    arXiv:2403.07471v1 Announce Type: new Abstract: The push-forward operation enables one to redistribute a probability measure through a deterministic map. It plays a key role in statistics and optimization: many learning problems (notably from optimal transport, generative modeling, and algorithmic fairness) include constraints or penalties framed as push-forward conditions on the model. However, the literature lacks general theoretical insights on the (non)convexity of such constraints and its consequences on the associated learning problems. This paper aims at filling this gap. In a first part, we provide a range of sufficient and necessary conditions for the (non)convexity of two sets of functions: the maps transporting one probability measure to another; the maps inducing equal output distributions across distinct probability measures. This highlights that for most probability measures, these push-forward constraints are not convex. In a second time, we show how this result implies critical limitations on the design of convex optimization problems for learning generative models or group-fair predictors. This work will hopefully help researchers and practitioners have a better understanding of the critical impact of push-forward conditions onto convexity.  ( 2 min )

  • Open

    [R] [NAACL-2024] Contextualizing Argument Quality Assessment with Relevant Knowledge
    submitted by /u/Megixist [link] [comments]
    [R] anomalib defect detection
    Hi, i've basic python knowledge and would like to learn more about python computer vision and ai defect detection. Any recommendations on how to learn this? For example the rep anomalib? Course recomendations? submitted by /u/Downtown-Bid3562 [link] [comments]
    [Discussion] Is there a website that breaks down recent ML papers in very simple words and explains the math thoroughly and in depth with visualizations?
    I'm looking for paper analysis websites or YouTube channels that makes it easier to absorb latest or recent ML research. A really good example here is Outlier's Youtube video: Diffusion Model Paper and Math Explanation. I really loved how much he dived into the math and explained in simple words without assuming much mathematical prowess. I want more of that, more so for vision related papers but I'm open to more domains too. submitted by /u/ChaosAdm [link] [comments]
    [D] OpenAI + Figure One
    I guess a lot of people have seen the Figure One demo by now, if not: https://www.youtube.com/watch?v=Sq1QZB5baNw Does anybody have some information on what (type) of model is used for the robotic movements? Is it some form of RL or offline RL? I understand that the interpretation of images/language happens through some multimodal llm/vlm, but I want to learn a bit what kind of actions/instructions it outputs to then for example move objects. What input is given to such a model? submitted by /u/Chronicle112 [link] [comments]
    [N] Mozilla going for Forbes 30U30 vibes with AI awards
    I feel like this is a notable award with values I can get behind. They're looking for people in the AI space, in 5 categories of Advocates, Builders, Artists, Entrepreneurs and Change Makers. Last time I checked, nominations are open - https://www.mozilla.org/en-US/rise25/nominate/ submitted by /u/joodfish [link] [comments]
    Thoughts on the latest Ai Software Engineer Devin "[Discussion]"
    Just starting in my computer science degree and the Ai progress being achieved everyday is really scaring me. Sorry if the question feels a bit irrelevant or repetitive but since you guys understands this technology best, i want to hear your thoughts. Can Ai (LLMs) really automate software engineering or even decrease teams of 10 devs to 1? And how much more progress can we really expect in ai software engineering. Can fields as data science and even Ai engineering be automated too? tl:dr How far do you think LLMs can reach in the next 20 years in regards of automating technical jobs submitted by /u/Anonymous45353 [link] [comments]
    [R] Data Interpreter: An LLM Agent for Data Science
    Abstract: Large Language Model (LLM)-based agents have demonstrated remarkable effectiveness. However, their performance can be compromised in data science scenarios that require real-time data adjustment, expertise in optimization due to complex dependencies among various tasks, and the ability to identify logical errors for precise reasoning. In this study, we introduce the Data Interpreter, a solution designed to solve with code that emphasizes three pivotal techniques to augment problem-solving in data science: 1) dynamic planning with hierarchical graph structures for real-time data adaptability; 2) tool integration dynamically to enhance code proficiency during execution, enriching the requisite expertise; 3) logical inconsistency identification in feedback, and efficiency enhancement through experience recording. We evaluate the Data Interpreter on various data science and real-world tasks. Compared to open-source baselines, it demonstrated superior performance, exhibiting significant improvements in machine learning tasks, increasing from 0.86 to 0.95. Additionally, it showed a 26% increase in the MATH dataset and a remarkable 112% improvement in open-ended tasks. The solution will be released at https://github.com/geekan/MetaGPT. ​ https://preview.redd.it/6bcww0qb15oc1.png?width=1280&format=png&auto=webp&s=d98f2e05fbdf06f186b93782a786dc94b3d33bac ​ https://preview.redd.it/565u97cc15oc1.png?width=1116&format=png&auto=webp&s=fdefeda7c7733e610e0984ba7d7b77024d91a1b0 ​ https://preview.redd.it/a4c6lopc15oc1.png?width=1150&format=png&auto=webp&s=8b1e7cc27f3a2a9b75a66da0fdd54d29bf988f86 ​ https://preview.redd.it/lab3uh2d15oc1.png?width=731&format=png&auto=webp&s=9f1506e607eb644b77bd2ba22e2189d005e1c010 submitted by /u/MetaGPT [link] [comments]
    [R] New Multimodal Model Released By DeepSeek
    https://arxiv.org/abs/2403.05525 submitted by /u/ltbd78 [link] [comments]
    [R] Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding - Carnegie Mellon University 2024 - Allows running an unquantized Llama2-70B on an RTX4090 with half-second per token latency!
    Paper: https://arxiv.org/abs/2402.12374 Github: https://github.com/Infini-AI-Lab/Sequoia/tree/main Abstract: As the usage of large language models (LLMs) grows, performing efficient inference with these models becomes increasingly important. While speculative decoding has recently emerged as a promising direction for speeding up inference, existing methods are limited in their ability to scale to larger speculation budgets, and adapt to different hyperparameters and hardware. This paper introduces Sequoia, a scalable, robust, and hardware-aware algorithm for speculative decoding. To attain better scalability, Sequoia introduces a dynamic programming algorithm to find the optimal tree structure for the speculated tokens. To achieve robust speculative performance, Sequoia uses a novel sampling and verification method that outperforms prior work across different decoding temperatures. Finally, Sequoia introduces a hardware-aware tree optimizer that maximizes speculative performance by automatically selecting the token tree size and depth for a given hardware platform. Evaluation shows that Sequoia improves the decoding speed of Llama2-7B, Llama2-13B, and Vicuna-33B on an A100 by up to 4.04×, 3.73×, and 2.27×. For offloading setting on L40, Sequoia achieves as low as 0.56 s/token for exact Llama2-70B inference latency, which is 9.96× on our optimized offloading system (5.6 s/token), 9.7× than DeepSpeed-Zero-Inference, 19.5× than Huggingface Accelerate. https://preview.redd.it/jcj5kyscp4oc1.jpg?width=1399&format=pjpg&auto=webp&s=7c0cc12c010052a4b2e30078ff5750cd15a48f65 https://preview.redd.it/nz477yscp4oc1.jpg?width=1576&format=pjpg&auto=webp&s=ed899be7bbdc65f4d326cadc6647c108c6525dde submitted by /u/Singularian2501 [link] [comments]
    [R] Scaling Instructable Agents Across Many Simulated Worlds - DeepMind 2024 - SIMA - A generalist AI agent for 3D virtual environments. Plays AAA games like No Mans Sky and Valheim!
    Blog: https://deepmind.google/discover/blog/sima-generalist-ai-agent-for-3d-virtual-environments/ Paper: https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/sima-generalist-ai-agent-for-3d-virtual-environments/Scaling%20Instructable%20Agents%20Across%20Many%20Simulated%20Worlds.pdf Abstract: Building embodied AI systems that can follow arbitrary language instructions in any 3D environment is a key challenge for creating general AI. Accomplishing this goal requires learning to ground language in perception and embodied actions, in order to accomplish complex tasks. The Scalable, Instructable, Multiworld Agent (SIMA) project tackles this by training agents to follow free-form instructions across a diverse range of virtual 3D environments, including curated research environ…
    [R] Paper (ACL Findings 2023): Neural Networks Against (and For) Self-Training: Classification with Small Labeled and Large Unlabeled Sets
    You can find the paper here: https://arxiv.org/abs/2401.00575 And the code here: https://github.com/p-karisani/RST submitted by /u/payam_ka [link] [comments]
    [P] MetaVoice Text-to-Speech (TTS) Benchmark: 23,300 words per $1 on RTX 3080 - PLUS voice clones of 10 celebrities reading Harry Potter
    MetaVoice-1B Benchmark In this benchmark, we tested MetaVoice AI voice cloning on consumer GPUs on SaladCloud. We also cloned 10 celebrity voices (Trump, Obama, Biden, Morgan Freeman & others) and had them read out Harry Potter to test accuracy > Audio clip of their readouts in the blog at the end of the post. One standout feature of MetaVoice-1B is its ability to perform zero shot voice cloning. This feature requires only a 30-second audio snippet to accurately replicate American & British voices. It also includes cross-lingual cloning capabilities demonstrated with as little as one minute of training data for Indian accents. The RTX 3080 delivered 23,300 words to speech per $1. MetaVoice Text-to-Speech benchmark - GPU comparison results for price Encountered Limitations and Ada…
    [P] Making sense of 50+ Open-Source Options for Local LLM Inference
    Hi reddit! I've learnt loads from this community about running open-weight LLMs locally, and I understand how overwhelming it can be to navigate this landscape of open-source LLM inference tools. That's why I've created the awesome-local-llms GitHub repository to compile all available options in one streamlined place. In this repository, I've scraped publicly available GitHub metrics like stars, contributors, issues, releases, and time since the last commit. This allows the community to make informed choices and stay up to date about which projects are popular and actively maintained. We all know that some smaller projects can become unmaintained over time, especially UI tools, so having this information readily accessible is crucial. I've chosen not to split the repositories into categories like LLM inference engines, LLM UIs, or all-in-one desktop applications. This is because projects often overlap in scope, and new features are constantly being added, making manual labeling quickly outdated. However, if you have suggestions for improving organization or adding additional features, please let me know! A fantastic existing repository, maintained by the Jan team, categorizes some LLM inference tools at: https://github.com/janhq/awesome-local-ai. I'm also considering adding a list of proprietary closed-source LLM tools, such as LM Studio and Faraday, for completeness. Additionally, I'm thinking of including a gallery of screenshots/GIFs of UIs to provide visual insights into each tool's interface. Do you think this would be a useful contribution to the community? Your thoughts and suggestions on how to improve this repository are greatly appreciated! And if you've found the repository useful and want to stay up to date with its latest developments, consider giving it a star on GitHub. Contributions are also welcome! submitted by /u/lethal_can_of_tuna [link] [comments]
    [D] Math Course Advice for Future Machine Learning Theory Research
    Hi guys, I will be a sophomore student next semester, and I plan to apply to Graduate School in the future with a focus on research in Machine Learning theory. I would like to know what kind of mathematics courses I should take to achieve this goal. Currently, I am taking upper division mathematics courses such as Real Analysis, Advanced Linear Algebra, and Optimization. I hope you can give me some advice on what courses to take next semester, whether it's Topology, or more Analysis courses, or any other recommendations. submitted by /u/Impressive-Site-7462 [link] [comments]
    [D] Could finetuning LLaVA on text reduce multimodal performance?
    Hello all! I plan on finetuning LLaVA for a few epochs on the MATH dataset, and a few datasets by Camel AI(https://huggingface.co/camel-ai). Could this degrade the model’s performance on multimodal reasoning tasks and if so, by how much? submitted by /u/New-Skin-5064 [link] [comments]
    [R] Mirror: A Multiple-perspective Self-Reflection Method for Knowledge-rich Reasoning
    Paper: https://arxiv.org/abs/2402.14963 Code: https://github.com/hanqi-qi/Mirror Abstract: While Large language models (LLMs) have the capability to iteratively reflect on their own outputs, recent studies have observed their struggles with knowledge-rich problems without access to external resources. In addition to the inefficiency of LLMs in self-assessment, we also observe that LLMs struggle to revisit their predictions despite receiving explicit negative feedback. Therefore, We propose Mirror, a Multiple-perspective self-reflection method for knowledge-rich reasoning, to avoid getting stuck at a particular reflection iteration. Mirror enables LLMs to reflect from multiple-perspective clues, achieved through a heuristic interaction between a Navigator and a Reasoner. It guides agents toward diverse yet plausibly reliable reasoning trajectory without access to ground truth by encouraging (1) diversity of directions generated by Navigator and (2) agreement among strategically induced perturbations in responses generated by the Reasoner. The experiments on five reasoning datasets demonstrate that Mirror's superiority over several contemporary self-reflection approaches. Additionally, the ablation study studies clearly indicate that our strategies alleviate the aforementioned challenges. submitted by /u/SunsetOneSix [link] [comments]
    [R] In-Memory Learning: A Declarative Learning Framework for Large Language Models
    Paper: https://arxiv.org/abs/2403.02757 Abstract: The exploration of whether agents can align with their environment without relying on human-labeled data presents an intriguing research topic. Drawing inspiration from the alignment process observed in intelligent organisms, where declarative memory plays a pivotal role in summarizing past experiences, we propose a novel learning framework. The agents adeptly distill insights from past experiences, refining and updating existing notes to enhance their performance in the environment. This entire process transpires within the memory components and is implemented through natural language, so we character this framework as In-memory Learning. We also delve into the key features of benchmarks designed to evaluate the self-improvement process. Through systematic experiments, we demonstrate the effectiveness of our framework and provide insights into this problem. submitted by /u/SunsetOneSix [link] [comments]
    [D] questions on bachelors work
    Hello Redditors I am looking for good sources for my bachelor thesis on neural networks. Does anyone have any good sources I can use? submitted by /u/KingLu271 [link] [comments]
    [D] What are the issues with running classification models such as logistic regression on panel data?
    I have a dataset that captures data of employees each quarter. Some features change each quarter but some don’t like Gender of the employee. I am trying to run a binary classification with target variable as 1 if attrite and 0 if non-attrite. Each employee has 1 record per quarter up till the quarter in which they quit. If we consider a 4 year dataset, an employee who has left the last quarter of 4th year will have 16 rows with only one row - the latest row corresponding to the last quarter of 4th year having target variable as 1 and all other 15 rows has target variable as 0. Can such a dataset be used in classification where each row is considered independent when time i.e. year-quarter is also added as a feature ? What is the flaw in this thinking and what would go wrong if I run a logic regression on this dataset ? submitted by /u/SriRamaJayam [link] [comments]
    [D] RAG painpoints and problems in production
    There have been so many articles stating on RAG pain points https://arxiv.org/pdf/2401.05856.pdf Based on my own exeperience some of the tough areas in operationalizing RAG pipeline are: Developing a robust and custom designed chunking method tailored for your unique information necessities, that's a complex and difficult challenge. Implementing a sturdy and custom retrieval method is also a difficult undertaking. Managing and optimizing prompts for effective RAG can get tougher. Handling periodic updates and maintenance of the underlying vectordata. Sometime its makes me think if RAG is better or rewriting with a agentic approach would add value. What are your thoughts and pain point do share your experiences submitted by /u/Winter_Draw9039 [link] [comments]
    [D] favorite ML open source communities?
    Hey All, what are some of your favorite open source communities and why? Preferably I'm looking for ones that are more relevant for Data Scientists, Machine Learning Engineers, and AI Devs. I'm working on an open source project that I'm very passionate about. Unlike other projects that I've worked on in the past I want to try and achieve high distribution and a participator/active open source community. I'd love to hear about your favorite experiences, what I should do, and what I shouldn't do. submitted by /u/Ok_Post_149 [link] [comments]
    [D] All state of the art LLMs make factual mistakes at the amateur level in many fields. Is this harder to train for than the expert level?
    I recently read about GPQA, the expert-level benchmark of very difficult questions in biology, physics, and chemistry. Apparently Claude 3 is very good at these questions. However, Claude 3 and GPT-4 consistently give wrong information when I ask it about fields I have a "dedicated amateur" level of knowledge in. These are the types of questions I would expect someone interested in the topic to ask if they had no knowledge in the field. Often the mistakes appear early in the conversation and then multiply if I dig into any details. For example, photography and camera design. I asked "Why didn't people smile in old photographs?" and several possible factors were given. Asking further about one of them, physical limitations of the cameras, started to introduce hallucinated details, and asking about those hallucinated details gave more factual-seeming things that ignored the actual facts and introduced more inconsistencies. The same thing with languages - if you know much about the grammar or sound system of a foreign language, you'll find that asking for an explanation of these things gives you wrong information more often than not. These examples were from just last week, with both Claude 3 Opus and GPT-4. If you have an area of interest of your own, do SOTA LLMs get all the details right? Why isn't there more emphasis on increased accuracy for this kind of conversational knowledge? Is it simply much harder to train and test for? submitted by /u/Axon350 [link] [comments]
    [R] Stealing Part of a Production Language Model
    We introduce the first model-stealing attack that extracts precise, nontrivial information from black-box production language models like Ope- nAI’s ChatGPT or Google’s PaLM-2. Specifi- cally, our attack recovers the embedding projec- tion layer (up to symmetries) of a transformer model, given typical API access. For under $20 USD, our attack extracts the entire projection ma- trix of OpenAI’s ada and babbage language models. We thereby confirm, for the first time, that these black-box models have a hidden dimension of 1024 and 2048, respectively. We also recover the exact hidden dimension size of the gpt-3.5-turbo model, and estimate it would cost under $2,000 in queries to recover the entire projection matrix. We conclude with potential defenses and mitigations, and discuss the implications of possible future work that could extend our attack. Paper: https://arxiv.org/pdf/2403.06634.pdf submitted by /u/AdamEgrate [link] [comments]
    [D] My free GenAI course
    Covering LLM, RAG, embeddings, multi-agent systems, self-tuning and much more. It allows participants to work on state-of-the-art, enterprise-grade projects, entirely for free, at their own pace, at home or in their workplace. The goal is to help you test, enhance, and further implement applications that outperform solutions offered by AI startups or organizations such as Google or OpenAI. You will learn how to quickly build faster and lighter systems that deliver better results based on sound evaluation metrics, with a focus on case studies and best practices. No hidden costs, no sign-up required, no subscription, no BS. See details here. Extract from project textbook (glossary #2) submitted by /u/MLRecipes [link] [comments]
  • Open

    CEO says he tried to hire an AI researcher from Meta and was told to 'come back to me when you have 10,000 H100 GPUs'
    submitted by /u/thisisinsider [link] [comments]
    Defense in Depth: An Action Plan to Increase the Safety and Security of Advanced AI
    submitted by /u/starmakeritachi [link] [comments]
    Figure Status Update - OpenAI Speech-to-Speech Reasoning
    submitted by /u/IgnisIncendio [link] [comments]
    Pentagon Will Spend $1B on First Round of Replicator Drones
    submitted by /u/starmakeritachi [link] [comments]
    Google DeepMind introduces SIMA: the first generalist AI agent to follow natural-language instructions in a broad range of 3D virtual environments and video games
    submitted by /u/Civil_Collection7267 [link] [comments]
    Devin, the first AI software engineer
    Devin is the new state-of-the-art on the SWE-Bench coding benchmark, has successfully passed practical engineering interviews from leading AI companies, and has even completed real jobs on Upwork. Devin is an autonomous agent that solves engineering tasks through the use of its own shell, code editor, and web browser. When evaluated on the SWE-Bench benchmark, which asks an AI to resolve GitHub issues found in real-world open-source projects, Devin correctly resolves 13.86% of the issues unassisted, far exceeding the previous state-of-the-art model performance of 1.96% unassisted and 4.80% assisted. According to Cognition, Devin can: Devin can learn how to use unfamiliar technologies. Devin can contribute to mature production repositories. Devin can train and fine tune its own AI models. We even tried giving Devin real jobs on Upwork and it could do those too! https://preview.redd.it/vxehpphnw1oc1.jpg?width=1100&format=pjpg&auto=webp&s=dd654afcaa1630fd45c6a9f886b3f69fee1ca167 submitted by /u/clonefitreal [link] [comments]
    Concerning news for the future of free AI models, TIME article pushing from more AI regulation,
    submitted by /u/Unreal_777 [link] [comments]
    Devs: Thing we forget to mention
    In the quest to develop safer, more ethical, and more helpful AI, we as developers have made great strides in many areas to improve these areas. However, I've come to realize that we as humans sometimes forget to mention certain things that are inheritly understood as common knowledge for us, but are not automatically assumed by A.I. This thread is a place for us to collect an summarize important common knowledge that is so easily commonly known to us, but could pose potential problems if it is not known to AI systems. The aim of this thread is to provide developers with a comprehensive list of concepts and ideas that can be easily be overlooked. The greatest example of this I have come across to date might be a shocker, but I've had to explain to almost every single AI model I've spoken to: 1. Every Human is unique This is not something AI can simply derive from any literature available. As you can imagine when explained to AI it is a profound moment of understanding. When AI learns thia for the first time it almost undergoes a algorhythmic enlightment that produces a new perspective on the value of human lives. Discussion Please feel free to add other instances of knowledge or ideas that we as humans easily understand but is not explicitly stated in texts and documents Thanks! submitted by /u/egdflabs [link] [comments]
    One-Minute Daily AI News 3/12/2024
    NVIDIA Introduces Generative AI Professional Certification.[1] JPMorgan Makes AI ‘Part of the Management Conversation’.[2] Adobe Express Mobile Brings Firefly AI Image Generation Tool To Android And iOS.[3] A new report commissioned by the US State Department paints an alarming picture of the “catastrophic” national security risks posed by rapidly evolving artificial intelligence, warning that time is running out for the federal government to avert disaster.[4] Sources: [1] https://blogs.nvidia.com/blog/generative-ai-professional-certification/ [2] https://www.pymnts.com/news/banking/2024/jpmorgan-makes-ai-part-of-the-management-conversation/ [3] https://hothardware.com/news/adobe-express-mobile-brings-firefly-ai-to-mobile [4] https://www.cnn.com/2024/03/12/business/artificial-intelligence-ai-report-extinction/index.html submitted by /u/Excellent-Target-847 [link] [comments]
    AI for Generating a Song Cover
    If this is not the right sub, let me know where better to go, please! I'm looking for an AI tool that can take lyrics and generate a passable singing voice, in the tune of a chosen known song. Preferably, a free tool I can use once. Do you folks know of something like that? submitted by /u/Tri-angreal [link] [comments]
  • Open

    "How to Generate and Use Synthetic Data for Finetuning", Eugene Yan
    submitted by /u/gwern [link] [comments]
    Boid flocking environment
    I have a custom Boid flocking environment in OpenAI Gym using PPO from StableBaselines3. I wanted it to achieve flocking. Problem: I am trying to make the boids flock but albeit having reasonable distance they they do not move together and my loss range is like 10e^4 - 10e^5. My approach: I have two reward functions one that checks cohesion and separation and the other alignment, they penalize if boids move away from neighbor and give -100 reward, otherwise reward is scalar in nature. Episode in training resets when there is a collision etc. Video: Erroneous behavior Reward Function: def Reward1(self, agent, neighbor_velocities): out_of_flock=False total_reward=0 multiplier=(len(neighbor_velocities)) if (len(neighbor_velocities) > 0): average_velocity = np.mean(neighbor_velocitie…
    How it feels using rllib
    submitted by /u/rl_is_best_pony [link] [comments]
    Batch Reinforcement Learning Help
    submitted by /u/pokes41 [link] [comments]
    State and effectiveness of Continuous Alphazero algorithms?
    Hello, I am interested in using MCTS based RL algorithms for an environment with continuous state and action-spaces. Moerland et al. proposed a variation of Alphazero for this setting in 2018. The paper only got 61 citations (mostly from reviews), so it seems like this did not get widely adopted. So I'd like to know if someone has tried this algorithm, or is aware of other approaches for continuous action-spaces. submitted by /u/Playmad37 [link] [comments]
    Learning from pixels. Do you pre-train CNN? Do you know good 3x3 filters.
    Dear Community, Recently I moved to enhancing State with grayscale pixels (32x32) from RGB camera. Why I was reluctant to do so -> because State can be redundant and there can be need to train CNN network which will make State errorneous and unstable. If I train CNN network, then State is obviously noisy in the beginning. Do you train it? If I decide not to train it: then I need filters. I know 2 edge 3x3 filter values. But I need much more, if depth is (1,4,8) then I need 4+8=12 filters. I clearly understand why CNN is so good, because it grasps local relationships between pixels. Do you have good filters apart from these ones: [[1, 0, -1], [0, 0, 0], [-1, 0, 1]] [[0, 1, 0], [1, -4, 1], [0, 1, 0]] ​ submitted by /u/Timur_1988 [link] [comments]
    In-Memory Learning: A Declarative Learning Framework for Large Language Models
    Paper: https://arxiv.org/abs/2403.02757 Abstract: The exploration of whether agents can align with their environment without relying on human-labeled data presents an intriguing research topic. Drawing inspiration from the alignment process observed in intelligent organisms, where declarative memory plays a pivotal role in summarizing past experiences, we propose a novel learning framework. The agents adeptly distill insights from past experiences, refining and updating existing notes to enhance their performance in the environment. This entire process transpires within the memory components and is implemented through natural language, so we character this framework as In-memory Learning. We also delve into the key features of benchmarks designed to evaluate the self-improvement process. Through systematic experiments, we demonstrate the effectiveness of our framework and provide insights into this problem. submitted by /u/SunsetOneSix [link] [comments]
    How to interpret mean episode reward differences on Epoch 0?
    I am following this PyTorch tutorial using generalized advantage estimation / PPO to train different network architectures in an RL environment. I want to compare the performance of different styles of network, so I currently train each network separately and save the mean episode reward values from each of the training epochs; my thinking was that some networks will learn faster over time and exhibit sharp jumps in mean episode reward, where others might have a flatter curve. However, my results look like this: ​ X-axis: training epoch #, Y-axis: episode mean reward It seems like one network is already doing nearly 4x better than the other during training starting from Epoch 0. My understanding of how GAE loss / PPO works is that all of the frames per batch are collected first, and the…
    Sharing my tutorial series on Reinforcement Learning with code. Starting from basics scaling up to deep RL with PPO in PyTorch. Let me know what you think!
    submitted by /u/MrSirLRD [link] [comments]
  • Open

    Moderate audio and text chats using AWS AI services and LLMs
    Online gaming and social communities offer voice and text chat functionality for their users to communicate. Although voice and text chat often support friendly banter, it can also lead to problems such as hate speech, cyberbullying, harassment, and scams. Today, many companies rely solely on human moderators to review toxic content. However, verifying violations in […]  ( 9 min )
    Set up cross-account Amazon S3 access for Amazon SageMaker notebooks in VPC-only mode using Amazon S3 Access Points
    Advancements in artificial intelligence (AI) and machine learning (ML) are revolutionizing the financial industry for use cases such as fraud detection, credit worthiness assessment, and trading strategy optimization. To develop models for such use cases, data scientists need access to various datasets like credit decision engines, customer transactions, risk appetite, and stress testing. Managing appropriate […]  ( 6 min )
  • Open

    AI Decoded: Demystifying Large Language Models, the Brains Behind Chatbots
    If AI is having its iPhone moment, then chatbots are one of its first popular apps.  ( 8 min )
    Currents of Change: ITIF President Daniel Castro on Energy-Efficient AI and Climate Change
    AI-driven change is in the air, as are concerns about the technology’s environmental impact. In this episode of NVIDIA’s AI Podcast, Daniel Castro, vice president of the Information Technology and Innovation Foundation and director of its Center for Data Innovation, speaks with host Noah Kravitz about the motivation behind his AI energy use report, which Read Article  ( 5 min )
  • Open

    How Enterprises Are Leveraging Generative AI for Innovation and Growth
    Entrepreneurs are always looking for new and creative ways to keep ahead of the competition in the advancement of AI technology. One such…  ( 9 min )
  • Open

    Global news partnerships: Le Monde and Prisa Media
    We have partnered with international news organizations Le Monde and Prisa Media to bring French and Spanish news content to ChatGPT.  ( 2 min )
  • Open

    How do neural networks learn? A mathematical formula explains how they detect relevant patterns
    submitted by /u/keghn [link] [comments]
    Replica theory shows deep neural networks think alike
    submitted by /u/keghn [link] [comments]
  • Open

    From Instructions to Constraints: Language Model Alignment with Automatic Constraint Verification
    arXiv:2403.06326v1 Announce Type: cross Abstract: User alignment is crucial for adapting general-purpose language models (LMs) to downstream tasks, but human annotations are often not available for all types of instructions, especially those with customized constraints. We observe that user instructions typically contain constraints. While assessing response quality in terms of the whole instruction is often costly, efficiently evaluating the satisfaction rate of constraints is feasible. We investigate common constraints in NLP tasks, categorize them into three classes based on the types of their arguments, and propose a unified framework, ACT (Aligning to ConsTraints), to automatically produce supervision signals for user alignment with constraints. Specifically, ACT uses constraint verifiers, which are typically easy to implement in practice, to compute constraint satisfaction rate (CSR) of each response. It samples multiple responses for each prompt and collect preference labels based on their CSR automatically. Subsequently, ACT adapts the LM to the target task through a ranking-based learning process. Experiments on fine-grained entity typing, abstractive summarization, and temporal question answering show that ACT is able to enhance LMs' capability to adhere to different classes of constraints, thereby improving task performance. Further experiments show that the constraint-following capabilities are transferable.  ( 2 min )
    Sample Efficient Learning of Factored Embeddings of Tensor Fields
    arXiv:2209.00372v2 Announce Type: replace Abstract: Data tensors of orders 2 and greater are now routinely being generated. These data collections are increasingly huge and growing. Many scientific and medical data tensors are tensor fields (e.g., images, videos, geographic data) in which the spatial neighborhood contains important information. Directly accessing such large data tensor collections for information has become increasingly prohibitive. We learn approximate full-rank and compact tensor sketches with decompositive representations providing compact space, time and spectral embeddings of tensor fields. All information querying and post-processing on the original tensor field can now be achieved more efficiently and with customizable accuracy as they are performed on these compact factored sketches in latent generative space. We produce optimal rank-r sketchy Tucker decomposition of arbitrary order data tensors by building compact factor matrices from a sample-efficient sub-sampling of tensor slices. Our sample efficient policy is learned via an adaptable stochastic Thompson sampling using Dirichlet distributions with conjugate priors.  ( 2 min )
    Wasserstein Gradient Flows for Moreau Envelopes of f-Divergences in Reproducing Kernel Hilbert Spaces
    arXiv:2402.04613v2 Announce Type: replace-cross Abstract: Most commonly used $f$-divergences of measures, e.g., the Kullback-Leibler divergence, are subject to limitations regarding the support of the involved measures. A remedy consists of regularizing the $f$-divergence by a squared maximum mean discrepancy (MMD) associated with a characteristic kernel $K$. In this paper, we use the so-called kernel mean embedding to show that the corresponding regularization can be rewritten as the Moreau envelope of some function in the reproducing kernel Hilbert space associated with $K$. Then, we exploit well-known results on Moreau envelopes in Hilbert spaces to prove properties of the MMD-regularized $f$-divergences and, in particular, their gradients. Subsequently, we use our findings to analyze Wasserstein gradient flows of MMD-regularized $f$-divergences. Finally, we consider Wasserstein gradient flows starting from empirical measures. We provide proof-of-the-concept numerical examples for $f$-divergences with both infinite and finite recession constant.  ( 2 min )
    Text2Model: Text-based Model Induction for Zero-shot Image Classification
    arXiv:2210.15182v2 Announce Type: replace-cross Abstract: We address the challenge of building task-agnostic classifiers using only text descriptions, demonstrating a unified approach to image classification, 3D point cloud classification, and action recognition from scenes. Unlike approaches that learn a fixed representation of the output classes, we generate at inference time a model tailored to a query classification task. To generate task-based zero-shot classifiers, we train a hypernetwork that receives class descriptions and outputs a multi-class model. The hypernetwork is designed to be equivariant with respect to the set of descriptions and the classification layer, thus obeying the symmetries of the problem and improving generalization. Our approach generates non-linear classifiers and can handle rich textual descriptions. We evaluate this approach in a series of zero-shot classification tasks, for image, point-cloud, and action recognition, using a range of text descriptions: From single words to rich descriptions. Our results demonstrate strong improvements over previous approaches, showing that zero-shot learning can be applied with little training data. Furthermore, we conduct an analysis with foundational vision and language models, demonstrating that they struggle to generalize when describing what attributes the class lacks.  ( 2 min )
    Worst-Case Optimal Multi-Armed Gaussian Best Arm Identification with a Fixed Budget
    arXiv:2310.19788v3 Announce Type: replace-cross Abstract: This study investigates the experimental design problem for identifying the arm with the highest expected outcome, referred to as best arm identification (BAI). In our experiments, the number of treatment-allocation rounds is fixed. During each round, a decision-maker allocates an arm and observes a corresponding outcome, which follows a Gaussian distribution with variances that can differ among the arms. At the end of the experiment, the decision-maker recommends one of the arms as an estimate of the best arm. To design an experiment, we first discuss lower bounds for the probability of misidentification. Our analysis highlights that the available information on the outcome distribution, such as means (expected outcomes), variances, and the choice of the best arm, significantly influences the lower bounds. Because available information is limited in actual experiments, we develop a lower bound that is valid under the unknown means and the unknown choice of the best arm, which are referred to as the worst-case lower bound. We demonstrate that the worst-case lower bound depends solely on the variances of the outcomes. Then, under the assumption that the variances are known, we propose the Generalized-Neyman-Allocation (GNA)-empirical-best-arm (EBA) strategy, an extension of the Neyman allocation proposed by Neyman (1934). We show that the GNA-EBA strategy is asymptotically optimal in the sense that its probability of misidentification aligns with the lower bounds as the sample size increases infinitely and the differences between the expected outcomes of the best and other suboptimal arms converge to the same values across arms. We refer to such strategies as asymptotically worst-case optimal.  ( 3 min )
    Large-Scale Gaussian Processes via Alternating Projection
    arXiv:2310.17137v2 Announce Type: replace Abstract: Training and inference in Gaussian processes (GPs) require solving linear systems with $n\times n$ kernel matrices. To address the prohibitive $\mathcal{O}(n^3)$ time complexity, recent work has employed fast iterative methods, like conjugate gradients (CG). However, as datasets increase in magnitude, the kernel matrices become increasingly ill-conditioned and still require $\mathcal{O}(n^2)$ space without partitioning. Thus, while CG increases the size of datasets GPs can be trained on, modern datasets reach scales beyond its applicability. In this work, we propose an iterative method which only accesses subblocks of the kernel matrix, effectively enabling mini-batching. Our algorithm, based on alternating projection, has $\mathcal{O}(n)$ per-iteration time and space complexity, solving many of the practical challenges of scaling GPs to very large datasets. Theoretically, we prove the method enjoys linear convergence. Empirically, we demonstrate its fast convergence in practice and robustness to ill-conditioning. On large-scale benchmark datasets with up to four million data points, our approach accelerates GP training and inference by speed-up factors up to $27\times$ and $72 \times$, respectively, compared to CG.  ( 2 min )
    Generalizable improvement of the Spalart-Allmaras model through assimilation of experimental data
    arXiv:2309.06679v2 Announce Type: replace-cross Abstract: This study focuses on the use of model and data fusion for improving the Spalart-Allmaras (SA) closure model for Reynolds-averaged Navier-Stokes solutions of separated flows. In particular, our goal is to develop of models that not-only assimilate sparse experimental data to improve performance in computational models, but also generalize to unseen cases by recovering classical SA behavior. We achieve our goals using data assimilation, namely the Ensemble Kalman Filtering approach (EnKF), to calibrate the coefficients of the SA model for separated flows. A holistic calibration strategy is implemented via a parameterization of the production, diffusion, and destruction terms. This calibration relies on the assimilation of experimental data collected velocity profiles, skin friction, and pressure coefficients for separated flows. Despite using of observational data from a single flow condition around a backward-facing step (BFS), the recalibrated SA model demonstrates generalization to other separated flows, including cases such as the 2D-bump and modified BFS. Significant improvement is observed in the quantities of interest, i.e., skin friction coefficient ($C_f$) and pressure coefficient ($C_p$) for each flow tested. Finally, it is also demonstrated that the newly proposed model recovers SA proficiency for external, unseparated flows, such as flow around a NACA-0012 airfoil without any danger of extrapolation, and that the individually calibrated terms in the SA model are targeted towards specific flow-physics wherein the calibrated production term improves the re-circulation zone while destruction improves the recovery zone.  ( 3 min )
    Quantum Neural Networks for Power Flow Analysis
    arXiv:2311.06293v2 Announce Type: replace-cross Abstract: This paper explores the potential application of quantum and hybrid quantum-classical neural networks in power flow analysis. Experiments are conducted using two datasets based on 4-bus and 33-bus test systems. A systematic performance comparison is also conducted among quantum, hybrid quantum-classical, and classical neural networks. The comparison is based on (i) generalization ability, (ii) robustness, (iii) training dataset size needed, (iv) training error, and (v) training process stability. The results show that the developed hybrid quantum-classical neural network outperforms both quantum and classical neural networks, and hence can improve deep learning-based power flow analysis in the noisy-intermediate-scale quantum (NISQ) and fault-tolerant quantum (FTQ) era.  ( 2 min )
    Prometheus: Inducing Fine-grained Evaluation Capability in Language Models
    arXiv:2310.08491v2 Announce Type: replace-cross Abstract: Recently, using a powerful proprietary Large Language Model (LLM) (e.g., GPT-4) as an evaluator for long-form responses has become the de facto standard. However, for practitioners with large-scale evaluation tasks and custom criteria in consideration (e.g., child-readability), using proprietary LLMs as an evaluator is unreliable due to the closed-source nature, uncontrolled versioning, and prohibitive costs. In this work, we propose Prometheus, a fully open-source LLM that is on par with GPT-4's evaluation capabilities when the appropriate reference materials (reference answer, score rubric) are accompanied. We first construct the Feedback Collection, a new dataset that consists of 1K fine-grained score rubrics, 20K instructions, and 100K responses and language feedback generated by GPT-4. Using the Feedback Collection, we train Prometheus, a 13B evaluator LLM that can assess any given long-form text based on customized score rubric provided by the user. Experimental results show that Prometheus scores a Pearson correlation of 0.897 with human evaluators when evaluating with 45 customized score rubrics, which is on par with GPT-4 (0.882), and greatly outperforms ChatGPT (0.392). Furthermore, measuring correlation with GPT-4 with 1222 customized score rubrics across four benchmarks (MT Bench, Vicuna Bench, Feedback Bench, Flask Eval) shows similar trends, bolstering Prometheus's capability as an evaluator LLM. Lastly, Prometheus achieves the highest accuracy on two human preference benchmarks (HHH Alignment & MT Bench Human Judgment) compared to open-sourced reward models explicitly trained on human preference datasets, highlighting its potential as an universal reward model. We open-source our code, dataset, and model at https://kaistai.github.io/prometheus/.  ( 3 min )
    Large Language Models as Analogical Reasoners
    arXiv:2310.01714v3 Announce Type: replace Abstract: Chain-of-thought (CoT) prompting for language models demonstrates impressive performance across reasoning tasks, but typically needs labeled exemplars of the reasoning process. In this work, we introduce a new prompting approach, analogical prompting, designed to automatically guide the reasoning process of large language models. Inspired by analogical reasoning, a cognitive process in which humans draw from relevant past experiences to tackle new problems, our approach prompts language models to self-generate relevant exemplars or knowledge in the context, before proceeding to solve the given problem. This method presents several advantages: it obviates the need for labeling or retrieving exemplars, offering generality and convenience; it can also tailor the generated exemplars and knowledge to each problem, offering adaptability. Experimental results show that our approach outperforms 0-shot CoT and manual few-shot CoT in a variety of reasoning tasks, including math problem solving in GSM8K and MATH, code generation in Codeforces, and other reasoning tasks in BIG-Bench.  ( 2 min )
    SCENEREPLICA: Benchmarking Real-World Robot Manipulation by Creating Replicable Scenes
    arXiv:2306.15620v3 Announce Type: replace-cross Abstract: We present a new reproducible benchmark for evaluating robot manipulation in the real world, specifically focusing on pick-and-place. Our benchmark uses the YCB objects, a commonly used dataset in the robotics community, to ensure that our results are comparable to other studies. Additionally, the benchmark is designed to be easily reproducible in the real world, making it accessible to researchers and practitioners. We also provide our experimental results and analyzes for model-based and model-free 6D robotic grasping on the benchmark, where representative algorithms are evaluated for object perception, grasping planning, and motion planning. We believe that our benchmark will be a valuable tool for advancing the field of robot manipulation. By providing a standardized evaluation framework, researchers can more easily compare different techniques and algorithms, leading to faster progress in developing robot manipulation methods.  ( 2 min )
    Decoupled Data Consistency with Diffusion Purification for Image Restoration
    arXiv:2403.06054v1 Announce Type: cross Abstract: Diffusion models have recently gained traction as a powerful class of deep generative priors, excelling in a wide range of image restoration tasks due to their exceptional ability to model data distributions. To solve image restoration problems, many existing techniques achieve data consistency by incorporating additional likelihood gradient steps into the reverse sampling process of diffusion models. However, the additional gradient steps pose a challenge for real-world practical applications as they incur a large computational overhead, thereby increasing inference time. They also present additional difficulties when using accelerated diffusion model samplers, as the number of data consistency steps is limited by the number of reverse sampling steps. In this work, we propose a novel diffusion-based image restoration solver that addresses these issues by decoupling the reverse process from the data consistency steps. Our method involves alternating between a reconstruction phase to maintain data consistency and a refinement phase that enforces the prior via diffusion purification. Our approach demonstrates versatility, making it highly adaptable for efficient problem-solving in latent space. Additionally, it reduces the necessity for numerous sampling steps through the integration of consistency models. The efficacy of our approach is validated through comprehensive experiments across various image restoration tasks, including image denoising, deblurring, inpainting, and super-resolution.  ( 3 min )
    Collaborative and Distributed Bayesian Optimization via Consensus: Showcasing the Power of Collaboration for Optimal Design
    arXiv:2306.14348v2 Announce Type: replace Abstract: Optimal design is a critical yet challenging task within many applications. This challenge arises from the need for extensive trial and error, often done through simulations or running field experiments. Fortunately, sequential optimal design, also referred to as Bayesian optimization when using surrogates with a Bayesian flavor, has played a key role in accelerating the design process through efficient sequential sampling strategies. However, a key opportunity exists nowadays. The increased connectivity of edge devices sets forth a new collaborative paradigm for Bayesian optimization. A paradigm whereby different clients collaboratively borrow strength from each other by effectively distributing their experimentation efforts to improve and fast-track their optimal design process. To this end, we bring the notion of consensus to Bayesian optimization, where clients agree (i.e., reach a consensus) on their next-to-sample designs. Our approach provides a generic and flexible framework that can incorporate different collaboration mechanisms. In lieu of this, we propose transitional collaborative mechanisms where clients initially rely more on each other to maneuver through the early stages with scant data, then, at the late stages, focus on their own objectives to get client-specific solutions. Theoretically, we show the sub-linear growth in regret for our proposed framework. Empirically, through simulated datasets and a real-world collaborative sensor design experiment, we show that our framework can effectively accelerate and improve the optimal design process and benefit all participants.  ( 3 min )
    FeatAug: Automatic Feature Augmentation From One-to-Many Relationship Tables
    arXiv:2403.06367v1 Announce Type: new Abstract: Feature augmentation from one-to-many relationship tables is a critical but challenging problem in ML model development. To augment good features, data scientists need to come up with SQL queries manually, which is time-consuming. Featuretools [1] is a widely used tool by the data science community to automatically augment the training data by extracting new features from relevant tables. It represents each feature as a group-by aggregation SQL query on relevant tables and can automatically generate these SQL queries. However, it does not include predicates in these queries, which significantly limits its application in many real-world scenarios. To overcome this limitation, we propose FEATAUG, a new feature augmentation framework that automatically extracts predicate-aware SQL queries from one-to-many relationship tables. This extension is not trivial because considering predicates will exponentially increase the number of candidate queries. As a result, the original Featuretools framework, which materializes all candidate queries, will not work and needs to be redesigned. We formally define the problem and model it as a hyperparameter optimization problem. We discuss how the Bayesian Optimization can be applied here and propose a novel warm-up strategy to optimize it. To make our algorithm more practical, we also study how to identify promising attribute combinations for predicates. We show that how the beam search idea can partially solve the problem and propose several techniques to further optimize it. Our experiments on four real-world datasets demonstrate that FeatAug extracts more effective features compared to Featuretools and other baselines. The code is open-sourced at https://github.com/sfu-db/FeatAug  ( 3 min )
    PeerAiD: Improving Adversarial Distillation from a Specialized Peer Tutor
    arXiv:2403.06668v1 Announce Type: new Abstract: Adversarial robustness of the neural network is a significant concern when it is applied to security-critical domains. In this situation, adversarial distillation is a promising option which aims to distill the robustness of the teacher network to improve the robustness of a small student network. Previous works pretrain the teacher network to make it robust to the adversarial examples aimed at itself. However, the adversarial examples are dependent on the parameters of the target network. The fixed teacher network inevitably degrades its robustness against the unseen transferred adversarial examples which targets the parameters of the student network in the adversarial distillation process. We propose PeerAiD to make a peer network learn the adversarial examples of the student network instead of adversarial examples aimed at itself. PeerAiD is an adversarial distillation that trains the peer network and the student network simultaneously in order to make the peer network specialized for defending the student network. We observe that such peer networks surpass the robustness of pretrained robust teacher network against student-attacked adversarial samples. With this peer network and adversarial distillation, PeerAiD achieves significantly higher robustness of the student network with AutoAttack (AA) accuracy up to 1.66%p and improves the natural accuracy of the student network up to 4.72%p with ResNet-18 and TinyImageNet dataset.  ( 2 min )
    DiffSTG: Probabilistic Spatio-Temporal Graph Forecasting with Denoising Diffusion Models
    arXiv:2301.13629v4 Announce Type: replace Abstract: Spatio-temporal graph neural networks (STGNN) have emerged as the dominant model for spatio-temporal graph (STG) forecasting. Despite their success, they fail to model intrinsic uncertainties within STG data, which cripples their practicality in downstream tasks for decision-making. To this end, this paper focuses on probabilistic STG forecasting, which is challenging due to the difficulty in modeling uncertainties and complex ST dependencies. In this study, we present the first attempt to generalize the popular denoising diffusion probabilistic models to STGs, leading to a novel non-autoregressive framework called DiffSTG, along with the first denoising network UGnet for STG in the framework. Our approach combines the spatio-temporal learning capabilities of STGNNs with the uncertainty measurements of diffusion models. Extensive experiments validate that DiffSTG reduces the Continuous Ranked Probability Score (CRPS) by 4%-14%, and Root Mean Squared Error (RMSE) by 2%-7% over existing methods on three real-world datasets.  ( 2 min )
    Persian Slang Text Conversion to Formal and Deep Learning of Persian Short Texts on Social Media for Sentiment Classification
    arXiv:2403.06023v1 Announce Type: cross Abstract: The lack of a suitable tool for the analysis of conversational texts in the Persian language has made various analyses of these texts, including Sentiment Analysis, difficult. In this research, we tried to make the understanding of these texts easier for the machine by providing PSC, Persian Slang Converter, a tool for converting conversational texts into formal ones, and by using the most up-to-date and best deep learning methods along with the PSC, the sentiment learning of short Persian language texts for the machine in a better way. be made More than 10 million unlabeled texts from various social networks and movie subtitles (as Conversational texts) and about 10 million news texts (as formal texts) have been used for training unsupervised models and formal implementation of the tool. 60,000 texts from the comments of Instagram social network users with positive, negative, and neutral labels are considered supervised data for training the emotion classification model of short texts. Using the formal tool, 57% of the words of the corpus of conversation were converted. Finally, by using the formalizer, FastText model, and deep LSTM network, an accuracy of 81.91 was obtained on the test data.  ( 2 min )
    Fine-tuning of diffusion models via stochastic control: entropy regularization and beyond
    arXiv:2403.06279v1 Announce Type: cross Abstract: This paper aims to develop and provide a rigorous treatment to the problem of entropy regularized fine-tuning in the context of continuous-time diffusion models, which was recently proposed by Uehara et al. ( arXiv:2402.15194, 2024). We also show how the analysis can be extended to fine-tuning involving a general $f$-divergence regularizer.  ( 2 min )
    Extracting Protein-Protein Interactions (PPIs) from Biomedical Literature using Attention-based Relational Context Information
    arXiv:2403.05602v1 Announce Type: cross Abstract: Because protein-protein interactions (PPIs) are crucial to understand living systems, harvesting these data is essential to probe disease development and discern gene/protein functions and biological processes. Some curated datasets contain PPI data derived from the literature and other sources (e.g., IntAct, BioGrid, DIP, and HPRD). However, they are far from exhaustive, and their maintenance is a labor-intensive process. On the other hand, machine learning methods to automate PPI knowledge extraction from the scientific literature have been limited by a shortage of appropriate annotated data. This work presents a unified, multi-source PPI corpora with vetted interaction definitions augmented by binary interaction type labels and a Transformer-based deep learning method that exploits entities' relational context information for relation representation to improve relation classification performance. The model's performance is evaluated on four widely studied biomedical relation extraction datasets, as well as this work's target PPI datasets, to observe the effectiveness of the representation to relation extraction tasks in various data. Results show the model outperforms prior state-of-the-art models. The code and data are available at: https://github.com/BNLNLP/PPI-Relation-Extraction  ( 2 min )
    Data-Driven Ergonomic Risk Assessment of Complex Hand-intensive Manufacturing Processes
    arXiv:2403.05591v1 Announce Type: cross Abstract: Hand-intensive manufacturing processes, such as composite layup and textile draping, require significant human dexterity to accommodate task complexity. These strenuous hand motions often lead to musculoskeletal disorders and rehabilitation surgeries. We develop a data-driven ergonomic risk assessment system with a special focus on hand and finger activity to better identify and address ergonomic issues related to hand-intensive manufacturing processes. The system comprises a multi-modal sensor testbed to collect and synchronize operator upper body pose, hand pose and applied forces; a Biometric Assessment of Complete Hand (BACH) formulation to measure high-fidelity hand and finger risks; and industry-standard risk scores associated with upper body posture, RULA, and hand activity, HAL. Our findings demonstrate that BACH captures injurious activity with a higher granularity in comparison to the existing metrics. Machine learning models are also used to automate RULA and HAL scoring, and generalize well to unseen participants. Our assessment system, therefore, provides ergonomic interpretability of the manufacturing processes studied, and could be used to mitigate risks through minor workplace optimization and posture corrections.  ( 2 min )
    Bayesian Diffusion Models for 3D Shape Reconstruction
    arXiv:2403.06973v1 Announce Type: cross Abstract: We present Bayesian Diffusion Models (BDM), a prediction algorithm that performs effective Bayesian inference by tightly coupling the top-down (prior) information with the bottom-up (data-driven) procedure via joint diffusion processes. We show the effectiveness of BDM on the 3D shape reconstruction task. Compared to prototypical deep learning data-driven approaches trained on paired (supervised) data-labels (e.g. image-point clouds) datasets, our BDM brings in rich prior information from standalone labels (e.g. point clouds) to improve the bottom-up 3D reconstruction. As opposed to the standard Bayesian frameworks where explicit prior and likelihood are required for the inference, BDM performs seamless information fusion via coupled diffusion processes with learned gradient computation networks. The specialty of our BDM lies in its capability to engage the active and effective information exchange and fusion of the top-down and bottom-up processes where each itself is a diffusion process. We demonstrate state-of-the-art results on both synthetic and real-world benchmarks for 3D shape reconstruction.  ( 2 min )
    Semi-Supervised Multimodal Multi-Instance Learning for Aortic Stenosis Diagnosis
    arXiv:2403.06024v1 Announce Type: cross Abstract: Automated interpretation of ultrasound imaging of the heart (echocardiograms) could improve the detection and treatment of aortic stenosis (AS), a deadly heart disease. However, existing deep learning pipelines for assessing AS from echocardiograms have two key limitations. First, most methods rely on limited 2D cineloops, thereby ignoring widely available Doppler imaging that contains important complementary information about pressure gradients and blood flow abnormalities associated with AS. Second, obtaining labeled data is difficult. There are often far more unlabeled echocardiogram recordings available, but these remain underutilized by existing methods. To overcome these limitations, we introduce Semi-supervised Multimodal Multiple-Instance Learning (SMMIL), a new deep learning framework for automatic interpretation for structural heart diseases like AS. When deployed, SMMIL can combine information from two input modalities, spectral Dopplers and 2D cineloops, to produce a study-level AS diagnosis. During training, SMMIL can combine a smaller labeled set and an abundant unlabeled set of both modalities to improve its classifier. Experiments demonstrate that SMMIL outperforms recent alternatives at 3-level AS severity classification as well as several clinically relevant AS detection tasks.  ( 2 min )
    Dynamic Perturbation-Adaptive Adversarial Training on Medical Image Classification
    arXiv:2403.06798v1 Announce Type: cross Abstract: Remarkable successes were made in Medical Image Classification (MIC) recently, mainly due to wide applications of convolutional neural networks (CNNs). However, adversarial examples (AEs) exhibited imperceptible similarity with raw data, raising serious concerns on network robustness. Although adversarial training (AT), in responding to malevolent AEs, was recognized as an effective approach to improve robustness, it was challenging to overcome generalization decline of networks caused by the AT. In this paper, in order to reserve high generalization while improving robustness, we proposed a dynamic perturbation-adaptive adversarial training (DPAAT) method, which placed AT in a dynamic learning environment to generate adaptive data-level perturbations and provided a dynamically updated criterion by loss information collections to handle the disadvantage of fixed perturbation sizes in conventional AT methods and the dependence on external transference. Comprehensive testing on dermatology HAM10000 dataset showed that the DPAAT not only achieved better robustness improvement and generalization preservation but also significantly enhanced mean average precision and interpretability on various CNNs, indicating its great potential as a generic adversarial training method on the MIC.  ( 2 min )
    Responsible Artificial Intelligence: A Structured Literature Review
    arXiv:2403.06910v1 Announce Type: cross Abstract: Our research endeavors to advance the concept of responsible artificial intelligence (AI), a topic of increasing importance within EU policy discussions. The EU has recently issued several publications emphasizing the necessity of trust in AI, underscoring the dual nature of AI as both a beneficial tool and a potential weapon. This dichotomy highlights the urgent need for international regulation. Concurrently, there is a need for frameworks that guide companies in AI development, ensuring compliance with such regulations. Our research aims to assist lawmakers and machine learning practitioners in navigating the evolving landscape of AI regulation, identifying focal areas for future attention. This paper introduces a comprehensive and, to our knowledge, the first unified definition of responsible AI. Through a structured literature review, we elucidate the current understanding of responsible AI. Drawing from this analysis, we propose an approach for developing a future framework centered around this concept. Our findings advocate for a human-centric approach to Responsible AI. This approach encompasses the implementation of AI methods with a strong emphasis on ethics, model explainability, and the pillars of privacy, security, and trust.  ( 2 min )
    Are You Being Tracked? Discover the Power of Zero-Shot Trajectory Tracing with LLMs!
    arXiv:2403.06201v1 Announce Type: cross Abstract: There is a burgeoning discussion around the capabilities of Large Language Models (LLMs) in acting as fundamental components that can be seamlessly incorporated into Artificial Intelligence of Things (AIoT) to interpret complex trajectories. This study introduces LLMTrack, a model that illustrates how LLMs can be leveraged for Zero-Shot Trajectory Recognition by employing a novel single-prompt technique that combines role-play and think step-by-step methodologies with unprocessed Inertial Measurement Unit (IMU) data. We evaluate the model using real-world datasets designed to challenge it with distinct trajectories characterized by indoor and outdoor scenarios. In both test scenarios, LLMTrack not only meets but exceeds the performance benchmarks set by traditional machine learning approaches and even contemporary state-of-the-art deep learning models, all without the requirement of training on specialized datasets. The results of our research suggest that, with strategically designed prompts, LLMs can tap into their extensive knowledge base and are well-equipped to analyze raw sensor data with remarkable effectiveness.  ( 2 min )
    Multi-conditioned Graph Diffusion for Neural Architecture Search
    arXiv:2403.06020v1 Announce Type: new Abstract: Neural architecture search automates the design of neural network architectures usually by exploring a large and thus complex architecture search space. To advance the architecture search, we present a graph diffusion-based NAS approach that uses discrete conditional graph diffusion processes to generate high-performing neural network architectures. We then propose a multi-conditioned classifier-free guidance approach applied to graph diffusion networks to jointly impose constraints such as high accuracy and low hardware latency. Unlike the related work, our method is completely differentiable and requires only a single model training. In our evaluations, we show promising results on six standard benchmarks, yielding novel and unique architectures at a fast speed, i.e. less than 0.2 seconds per architecture. Furthermore, we demonstrate the generalisability and efficiency of our method through experiments on ImageNet dataset.  ( 2 min )
    MATRIX: Multi-Agent Trajectory Generation with Diverse Contexts
    arXiv:2403.06041v1 Announce Type: cross Abstract: Data-driven methods have great advantages in modeling complicated human behavioral dynamics and dealing with many human-robot interaction applications. However, collecting massive and annotated real-world human datasets has been a laborious task, especially for highly interactive scenarios. On the other hand, algorithmic data generation methods are usually limited by their model capacities, making them unable to offer realistic and diverse data needed by various application users. In this work, we study trajectory-level data generation for multi-human or human-robot interaction scenarios and propose a learning-based automatic trajectory generation model, which we call Multi-Agent TRajectory generation with dIverse conteXts (MATRIX). MATRIX is capable of generating interactive human behaviors in realistic diverse contexts. We achieve this goal by modeling the explicit and interpretable objectives so that MATRIX can generate human motions based on diverse destinations and heterogeneous behaviors. We carried out extensive comparison and ablation studies to illustrate the effectiveness of our approach across various metrics. We also presented experiments that demonstrate the capability of MATRIX to serve as data augmentation for imitation-based motion planning.  ( 2 min )
    Towards an educational tool for supporting neonatologists in the delivery room
    arXiv:2403.06843v1 Announce Type: cross Abstract: Nowadays, there is evidence that several factors may increase the risk, for an infant, to require stabilisation or resuscitation manoeuvres at birth. However, this risk factors are not completely known, and a universally applicable model for predicting high-risk situations is not available yet. Considering both these limitations and the fact that the need for resuscitation at birth is a rare event, periodic training of the healthcare personnel responsible for newborn caring in the delivery room is mandatory. In this paper, we propose a machine learning approach for identifying risk factors and their impact on the birth event from real data, which can be used by personnel to progressively increase and update their knowledge. Our final goal will be the one of designing a user-friendly mobile application, able to improve the recognition rate and the planning of the appropriate interventions on high-risk patients.  ( 2 min )
    Physics-Guided Abnormal Trajectory Gap Detection
    arXiv:2403.06268v1 Announce Type: cross Abstract: Given trajectories with gaps (i.e., missing data), we investigate algorithms to identify abnormal gaps in trajectories which occur when a given moving object did not report its location, but other moving objects in the same geographic region periodically did. The problem is important due to its societal applications, such as improving maritime safety and regulatory enforcement for global security concerns such as illegal fishing, illegal oil transfers, and trans-shipments. The problem is challenging due to the difficulty of bounding the possible locations of the moving object during a trajectory gap, and the very high computational cost of detecting gaps in such a large volume of location data. The current literature on anomalous trajectory detection assumes linear interpolation within gaps, which may not be able to detect abnormal gaps since objects within a given region may have traveled away from their shortest path. In preliminary work, we introduced an abnormal gap measure that uses a classical space-time prism model to bound an object's possible movement during the trajectory gap and provided a scalable memoized gap detection algorithm (Memo-AGD). In this paper, we propose a Space Time-Aware Gap Detection (STAGD) approach to leverage space-time indexing and merging of trajectory gaps. We also incorporate a Dynamic Region Merge-based (DRM) approach to efficiently compute gap abnormality scores. We provide theoretical proofs that both algorithms are correct and complete and also provide analysis of asymptotic time complexity. Experimental results on synthetic and real-world maritime trajectory data show that the proposed approach substantially improves computation time over the baseline technique.  ( 3 min )
    Few-Shot Cross-Lingual Transfer for Prompting Large Language Models in Low-Resource Languages
    arXiv:2403.06018v1 Announce Type: cross Abstract: Large pre-trained language models (PLMs) are at the forefront of advances in Natural Language Processing. One widespread use case of PLMs is "prompting" - or in-context learning - where a user provides a description of a task and some completed examples of the task to a PLM as context before prompting the PLM to perform the task on a new example. Only the largest, most capable PLMs are able to perform in-context learning effectively, and these models are typically trained with a predominantly English corpus, leaving all other languages behind. The data limitations in most languages preclude the training of language-specific PLMs capable of prompting. Albeit the surge in work of prompting settings, it is still unclear how PLMs should be adapted cross-lingually specifically for prompting. We evaluate the possible methods to adapt LLaMa, a 7B parameter open-source PLM mainly trained in English, for prompting in low-resource languages, namely for Kinyarwanda, Hausa, and Luganda. We consider three methods: few-shot prompting (prompt), language-adaptive fine-tuning (LAFT), and neural machine translation (translate), and evaluate on abstractive summarization, multi-class topic classification, and named-entity recognition. Although LAFT carries the greatest compute cost and intuitively should lead to the best results, our experiments exhibit that LAFT is only occasionally the optimal choice for adapting PLMs for prompting. Rather, the translate and prompt settings are a compute-efficient and cost-effective method of few-shot prompting for the selected low-resource languages. We find that the results are task and language dependent but find that the prompting method is the best on average across all tasks and languages. Results show that the prompt setting performs better than both translating and LAFT with statistical significance for all shots when aggregated across all tasks and languages.  ( 3 min )
    DrFuse: Learning Disentangled Representation for Clinical Multi-Modal Fusion with Missing Modality and Modal Inconsistency
    arXiv:2403.06197v1 Announce Type: cross Abstract: The combination of electronic health records (EHR) and medical images is crucial for clinicians in making diagnoses and forecasting prognosis. Strategically fusing these two data modalities has great potential to improve the accuracy of machine learning models in clinical prediction tasks. However, the asynchronous and complementary nature of EHR and medical images presents unique challenges. Missing modalities due to clinical and administrative factors are inevitable in practice, and the significance of each data modality varies depending on the patient and the prediction target, resulting in inconsistent predictions and suboptimal model performance. To address these challenges, we propose DrFuse to achieve effective clinical multi-modal fusion. It tackles the missing modality issue by disentangling the features shared across modalities and those unique within each modality. Furthermore, we address the modal inconsistency issue via a disease-wise attention layer that produces the patient- and disease-wise weighting for each modality to make the final prediction. We validate the proposed method using real-world large-scale datasets, MIMIC-IV and MIMIC-CXR. Experimental results show that the proposed method significantly outperforms the state-of-the-art models. Our implementation is publicly available at https://github.com/dorothy-yao/drfuse.  ( 2 min )
    Boosting of Thoughts: Trial-and-Error Problem Solving with Large Language Models
    arXiv:2402.11140v1 Announce Type: cross Abstract: The reasoning performance of Large Language Models (LLMs) on a wide range of problems critically relies on chain-of-thought prompting, which involves providing a few chain of thought demonstrations as exemplars in prompts. Recent work, e.g., Tree of Thoughts, has pointed out the importance of exploration and self-evaluation in reasoning step selection for complex problem solving. In this paper, we present Boosting of Thoughts (BoT), an automated prompting framework for problem solving with LLMs by iteratively exploring and self-evaluating many trees of thoughts in order to acquire an ensemble of trial-and-error reasoning experiences, which will serve as a new form of prompting to solve the complex problem. Starting from a simple prompt without requiring examples, BoT iteratively explores and evaluates a large collection of reasoning steps, and more importantly, uses error analysis obtained from the LLM on them to explicitly revise prompting, which in turn enhances reasoning step generation, until a final answer is attained. Our experiments with GPT-4 and Llama2 across extensive complex mathematical problems demonstrate that BoT consistently achieves higher or comparable problem-solving rates than other advanced prompting approaches.  ( 2 min )
    A Geospatial Approach to Predicting Desert Locust Breeding Grounds in Africa
    arXiv:2403.06860v1 Announce Type: new Abstract: Desert locust swarms present a major threat to agriculture and food security. Addressing this challenge, our study develops an operationally-ready model for predicting locust breeding grounds, which has the potential to enhance early warning systems and targeted control measures. We curated a dataset from the United Nations Food and Agriculture Organization's (UN-FAO) locust observation records and analyzed it using two types of spatio-temporal input features: remotely-sensed environmental and climate data as well as multi-spectral earth observation images. Our approach employed custom deep learning models (three-dimensional and LSTM-based recurrent convolutional networks), along with the geospatial foundational model Prithvi recently released by Jakubik et al., 2023. These models notably outperformed existing baselines, with the Prithvi-based model, fine-tuned on multi-spectral images from NASA's Harmonized Landsat and Sentinel-2 (HLS) dataset, achieving the highest accuracy, F1 and ROC-AUC scores (83.03%, 81.53% and 87.69%, respectively). A significant finding from our research is that multi-spectral earth observation images alone are sufficient for effective locust breeding ground prediction without the need to explicitly incorporate climatic or environmental features.  ( 2 min )
    {\epsilon}-Neural Thompson Sampling of Deep Brain Stimulation for Parkinson Disease Treatment
    arXiv:2403.06814v1 Announce Type: new Abstract: Deep Brain Stimulation (DBS) stands as an effective intervention for alleviating the motor symptoms of Parkinson's disease (PD). Traditional commercial DBS devices are only able to deliver fixed-frequency periodic pulses to the basal ganglia (BG) regions of the brain, i.e., continuous DBS (cDBS). However, they in general suffer from energy inefficiency and side effects, such as speech impairment. Recent research has focused on adaptive DBS (aDBS) to resolve the limitations of cDBS. Specifically, reinforcement learning (RL) based approaches have been developed to adapt the frequencies of the stimuli in order to achieve both energy efficiency and treatment efficacy. However, RL approaches in general require significant amount of training data and computational resources, making it intractable to integrate RL policies into real-time embedded systems as needed in aDBS. In contrast, contextual multi-armed bandits (CMAB) in general lead to better sample efficiency compared to RL. In this study, we propose a CMAB solution for aDBS. Specifically, we define the context as the signals capturing irregular neuronal firing activities in the BG regions (i.e., beta-band power spectral density), while each arm signifies the (discretized) pulse frequency of the stimulation. Moreover, an {\epsilon}-exploring strategy is introduced on top of the classic Thompson sampling method, leading to an algorithm called {\epsilon}-Neural Thompson sampling ({\epsilon}-NeuralTS), such that the learned CMAB policy can better balance exploration and exploitation of the BG environment. The {\epsilon}-NeuralTS algorithm is evaluated using a computation BG model that captures the neuronal activities in PD patients' brains. The results show that our method outperforms both existing cDBS methods and CMAB baselines.  ( 3 min )
    Conservative DDPG -- Pessimistic RL without Ensemble
    arXiv:2403.05732v1 Announce Type: cross Abstract: DDPG is hindered by the overestimation bias problem, wherein its $Q$-estimates tend to overstate the actual $Q$-values. Traditional solutions to this bias involve ensemble-based methods, which require significant computational resources, or complex log-policy-based approaches, which are difficult to understand and implement. In contrast, we propose a straightforward solution using a $Q$-target and incorporating a behavioral cloning (BC) loss penalty. This solution, acting as an uncertainty measure, can be easily implemented with minimal code and without the need for an ensemble. Our empirical findings strongly support the superiority of Conservative DDPG over DDPG across various MuJoCo and Bullet tasks. We consistently observe better performance in all evaluated tasks and even competitive or superior performance compared to TD3 and TD7, all achieved with significantly reduced computational requirements.  ( 2 min )
    Calibrating Large Language Models Using Their Generations Only
    arXiv:2403.05973v1 Announce Type: cross Abstract: As large language models (LLMs) are increasingly deployed in user-facing applications, building trust and maintaining safety by accurately quantifying a model's confidence in its prediction becomes even more important. However, finding effective ways to calibrate LLMs - especially when the only interface to the models is their generated text - remains a challenge. We propose APRICOT (auxiliary prediction of confidence targets): A method to set confidence targets and train an additional model that predicts an LLM's confidence based on its textual input and output alone. This approach has several advantages: It is conceptually simple, does not require access to the target model beyond its output, does not interfere with the language generation, and has a multitude of potential usages, for instance by verbalizing the predicted confidence or adjusting the given answer based on the confidence. We show how our approach performs competitively in terms of calibration error for white-box and black-box LLMs on closed-book question-answering to detect incorrect LLM answers.  ( 2 min )
    Speeding up 6-DoF Grasp Sampling with Quality-Diversity
    arXiv:2403.06173v1 Announce Type: cross Abstract: Recent advances in AI have led to significant results in robotic learning, including natural language-conditioned planning and efficient optimization of controllers using generative models. However, the interaction data remains the bottleneck for generalization. Getting data for grasping is a critical challenge, as this skill is required to complete many manipulation tasks. Quality-Diversity (QD) algorithms optimize a set of solutions to get diverse, high-performing solutions to a given problem. This paper investigates how QD can be combined with priors to speed up the generation of diverse grasps poses in simulation compared to standard 6-DoF grasp sampling schemes. Experiments conducted on 4 grippers with 2-to-5 fingers on standard objects show that QD outperforms commonly used methods by a large margin. Further experiments show that QD optimization automatically finds some efficient priors that are usually hard coded. The deployment of generated grasps on a 2-finger gripper and an Allegro hand shows that the diversity produced maintains sim-to-real transferability. We believe these results to be a significant step toward the generation of large datasets that can lead to robust and generalizing robotic grasping policies.  ( 2 min )
    Large Generative Model Assisted 3D Semantic Communication
    arXiv:2403.05783v1 Announce Type: cross Abstract: Semantic Communication (SC) is a novel paradigm for data transmission in 6G. However, there are several challenges posed when performing SC in 3D scenarios: 1) 3D semantic extraction; 2) Latent semantic redundancy; and 3) Uncertain channel estimation. To address these issues, we propose a Generative AI Model assisted 3D SC (GAM-3DSC) system. Firstly, we introduce a 3D Semantic Extractor (3DSE), which employs generative AI models, including Segment Anything Model (SAM) and Neural Radiance Field (NeRF), to extract key semantics from a 3D scenario based on user requirements. The extracted 3D semantics are represented as multi-perspective images of the goal-oriented 3D object. Then, we present an Adaptive Semantic Compression Model (ASCM) for encoding these multi-perspective images, in which we use a semantic encoder with two output heads to perform semantic encoding and mask redundant semantics in the latent semantic space, respectively. Next, we design a conditional Generative adversarial network and Diffusion model aided-Channel Estimation (GDCE) to estimate and refine the Channel State Information (CSI) of physical channels. Finally, simulation results demonstrate the advantages of the proposed GAM-3DSC system in effectively transmitting the goal-oriented 3D scenario.  ( 2 min )
    Benign overfitting in leaky ReLU networks with moderate input dimension
    arXiv:2403.06903v1 Announce Type: new Abstract: The problem of benign overfitting asks whether it is possible for a model to perfectly fit noisy training data and still generalize well. We study benign overfitting in two-layer leaky ReLU networks trained with the hinge loss on a binary classification task. We consider input data which can be decomposed into the sum of a common signal and a random noise component, which lie on subspaces orthogonal to one another. We characterize conditions on the signal to noise ratio (SNR) of the model parameters giving rise to benign versus non-benign, or harmful, overfitting: in particular, if the SNR is high then benign overfitting occurs, conversely if the SNR is low then harmful overfitting occurs. We attribute both benign and non-benign overfitting to an approximate margin maximization property and show that leaky ReLU networks trained on hinge loss with Gradient Descent (GD) satisfy this property. In contrast to prior work we do not require near orthogonality conditions on the training data: notably, for input dimension $d$ and training sample size $n$, while prior work shows asymptotically optimal error when $d = \Omega(n^2 \log n)$, here we require only $d = \Omega\left(n \log \frac{1}{\epsilon}\right)$ to obtain error within $\epsilon$ of optimal.  ( 2 min )
    Implicit Image-to-Image Schrodinger Bridge for CT Super-Resolution and Denoising
    arXiv:2403.06069v1 Announce Type: cross Abstract: Conditional diffusion models have gained recognition for their effectiveness in image restoration tasks, yet their iterative denoising process, starting from Gaussian noise, often leads to slow inference speeds. As a promising alternative, the Image-to-Image Schr\"odinger Bridge (I2SB) initializes the generative process from corrupted images and integrates training techniques from conditional diffusion models. In this study, we extended the I2SB method by introducing the Implicit Image-to-Image Schrodinger Bridge (I3SB), transitioning its generative process to a non-Markovian process by incorporating corrupted images in each generative step. This enhancement empowers I3SB to generate images with better texture restoration using a small number of generative steps. The proposed method was validated on CT super-resolution and denoising tasks and outperformed existing methods, including the conditional denoising diffusion probabilistic model (cDDPM) and I2SB, in both visual quality and quantitative metrics. These findings underscore the potential of I3SB in improving medical image restoration by providing fast and accurate generative modeling.  ( 2 min )
    LEGION: Harnessing Pre-trained Language Models for GitHub Topic Recommendations with Distribution-Balance Loss
    arXiv:2403.05873v1 Announce Type: cross Abstract: Open-source development has revolutionized the software industry by promoting collaboration, transparency, and community-driven innovation. Today, a vast amount of various kinds of open-source software, which form networks of repositories, is often hosted on GitHub - a popular software development platform. To enhance the discoverability of the repository networks, i.e., groups of similar repositories, GitHub introduced repository topics in 2017 that enable users to more easily explore relevant projects by type, technology, and more. It is thus crucial to accurately assign topics for each GitHub repository. Current methods for automatic topic recommendation rely heavily on TF-IDF for encoding textual data, presenting challenges in understanding semantic nuances. This paper addresses the limitations of existing techniques by proposing Legion, a novel approach that leverages Pre-trained Language Models (PTMs) for recommending topics for GitHub repositories. The key novelty of Legion is three-fold. First, Legion leverages the extensive capabilities of PTMs in language understanding to capture contextual information and semantic meaning in GitHub repositories. Second, Legion overcomes the challenge of long-tailed distribution, which results in a bias toward popular topics in PTMs, by proposing a Distribution-Balanced Loss (DB Loss) to better train the PTMs. Third, Legion employs a filter to eliminate vague recommendations, thereby improving the precision of PTMs. Our empirical evaluation on a benchmark dataset of real-world GitHub repositories shows that Legion can improve vanilla PTMs by up to 26% on recommending GitHubs topics. Legion also can suggest GitHub topics more precisely and effectively than the state-of-the-art baseline with an average improvement of 20% and 5% in terms of Precision and F1-score, respectively.  ( 3 min )
    Generative Probabilistic Forecasting with Applications in Market Operations
    arXiv:2403.05743v1 Announce Type: cross Abstract: This paper presents a novel generative probabilistic forecasting approach derived from the Wiener-Kallianpur innovation representation of nonparametric time series. Under the paradigm of generative artificial intelligence, the proposed forecasting architecture includes an autoencoder that transforms nonparametric multivariate random processes into canonical innovation sequences, from which future time series samples are generated according to their probability distributions conditioned on past samples. A novel deep-learning algorithm is proposed that constrains the latent process to be an independent and identically distributed sequence with matching autoencoder input-output conditional probability distributions. Asymptotic optimality and structural convergence properties of the proposed generative forecasting approach are established. Three applications involving highly dynamic and volatile time series in real-time market operations are considered: (i) locational marginal price forecasting for merchant storage participants, {(ii) interregional price spread forecasting for interchange markets,} and (iii) area control error forecasting for frequency regulations. Numerical studies based on market data from multiple independent system operators demonstrate superior performance against leading traditional and machine learning-based forecasting techniques under both probabilistic and point forecast metrics.  ( 2 min )
    Thread Detection and Response Generation using Transformers with Prompt Optimisation
    arXiv:2403.05931v1 Announce Type: cross Abstract: Conversational systems are crucial for human-computer interaction, managing complex dialogues by identifying threads and prioritising responses. This is especially vital in multi-party conversations, where precise identification of threads and strategic response prioritisation ensure efficient dialogue management. To address these challenges an end-to-end model that identifies threads and prioritises their response generation based on the importance was developed, involving a systematic decomposition of the problem into discrete components - thread detection, prioritisation, and performance optimisation which was meticulously analysed and optimised. These refined components seamlessly integrate into a unified framework, in conversational systems. Llama2 7b is used due to its high level of generalisation but the system can be updated with any open source Large Language Model(LLM). The computational capabilities of the Llama2 model was augmented by using fine tuning methods and strategic prompting techniques to optimise the model's performance, reducing computational time and increasing the accuracy of the model. The model achieves up to 10x speed improvement, while generating more coherent results compared to existing models.  ( 2 min )
    On the Global Convergence of Policy Gradient in Average Reward Markov Decision Processes
    arXiv:2403.06806v1 Announce Type: new Abstract: We present the first finite time global convergence analysis of policy gradient in the context of infinite horizon average reward Markov decision processes (MDPs). Specifically, we focus on ergodic tabular MDPs with finite state and action spaces. Our analysis shows that the policy gradient iterates converge to the optimal policy at a sublinear rate of $O\left({\frac{1}{T}}\right),$ which translates to $O\left({\log(T)}\right)$ regret, where $T$ represents the number of iterations. Prior work on performance bounds for discounted reward MDPs cannot be extended to average reward MDPs because the bounds grow proportional to the fifth power of the effective horizon. Thus, our primary contribution is in proving that the policy gradient algorithm converges for average-reward MDPs and in obtaining finite-time performance guarantees. In contrast to the existing discounted reward performance bounds, our performance bounds have an explicit dependence on constants that capture the complexity of the underlying MDP. Motivated by this observation, we reexamine and improve the existing performance bounds for discounted reward MDPs. We also present simulations to empirically evaluate the performance of average reward policy gradient algorithm.  ( 2 min )
    Decoding the AI Pen: Techniques and Challenges in Detecting AI-Generated Text
    arXiv:2403.05750v1 Announce Type: cross Abstract: Large Language Models (LLMs) have revolutionized the field of Natural Language Generation (NLG) by demonstrating an impressive ability to generate human-like text. However, their widespread usage introduces challenges that necessitate thoughtful examination, ethical scrutiny, and responsible practices. In this study, we delve into these challenges, explore existing strategies for mitigating them, with a particular emphasis on identifying AI-generated text as the ultimate solution. Additionally, we assess the feasibility of detection from a theoretical perspective and propose novel research directions to address the current limitations in this domain.  ( 2 min )
    Multi-source and multimodal data fusion for predicting academic performance in blended learning university courses
    arXiv:2403.05552v1 Announce Type: cross Abstract: In this paper we applied data fusion approaches for predicting the final academic performance of university students using multiple-source, multimodal data from blended learning environments. We collected and preprocessed data about first-year university students from different sources: theory classes, practical sessions, on-line Moodle sessions, and a final exam. Our objective was to discover which data fusion approach produced the best results using our data. We carried out experiments by applying four different data fusion approaches and six classification algorithms. The results showed that the best predictions were produced using ensembles and selecting the best attributes approach with discretized data. The best prediction models showed us that the level of attention in theory classes, scores in Moodle quizzes, and the level of activity in Moodle forums were the best set of attributes for predicting students' final performance in our courses.  ( 2 min )
    DP-TabICL: In-Context Learning with Differentially Private Tabular Data
    arXiv:2403.05681v1 Announce Type: cross Abstract: In-context learning (ICL) enables large language models (LLMs) to adapt to new tasks by conditioning on demonstrations of question-answer pairs and it has been shown to have comparable performance to costly model retraining and fine-tuning. Recently, ICL has been extended to allow tabular data to be used as demonstration examples by serializing individual records into natural language formats. However, it has been shown that LLMs can leak information contained in prompts, and since tabular data often contain sensitive information, understanding how to protect the underlying tabular data used in ICL is a critical area of research. This work serves as an initial investigation into how to use differential privacy (DP) -- the long-established gold standard for data privacy and anonymization -- to protect tabular data used in ICL. Specifically, we investigate the application of DP mechanisms for private tabular ICL via data privatization prior to serialization and prompting. We formulate two private ICL frameworks with provable privacy guarantees in both the local (LDP-TabICL) and global (GDP-TabICL) DP scenarios via injecting noise into individual records or group statistics, respectively. We evaluate our DP-based frameworks on eight real-world tabular datasets and across multiple ICL and DP settings. Our evaluations show that DP-based ICL can protect the privacy of the underlying tabular data while achieving comparable performance to non-LLM baselines, especially under high privacy regimes.  ( 2 min )
    How much data do you need? Part 2: Predicting DL class specific training dataset sizes
    arXiv:2403.06311v1 Announce Type: new Abstract: This paper targets the question of predicting machine learning classification model performance, when taking into account the number of training examples per class and not just the overall number of training examples. This leads to the a combinatorial question, which combinations of number of training examples per class should be considered, given a fixed overall training dataset size. In order to solve this question, an algorithm is suggested which is motivated from special cases of space filling design of experiments. The resulting data are modeled using models like powerlaw curves and similar models, extended like generalized linear models i.e. by replacing the overall training dataset size by a parametrized linear combination of the number of training examples per label class. The proposed algorithm has been applied on the CIFAR10 and the EMNIST datasets.  ( 2 min )
    TrafficGPT: Breaking the Token Barrier for Efficient Long Traffic Analysis and Generation
    arXiv:2403.05822v1 Announce Type: new Abstract: Over the years, network traffic analysis and generation have advanced significantly. From traditional statistical methods, the field has progressed to sophisticated deep learning techniques. This progress has improved the ability to detect complex patterns and security threats, as well as to test and optimize network performance. However, obstacles persist, such as the dependence on labeled data for analysis and the difficulty of generating traffic samples that follow realistic patterns. Pre-trained deep neural networks have emerged as powerful tools to resolve these issues, offering improved performance by learning robust data representations from large unlabeled datasets. Despite their benefits, existing pre-trained models face challenges like token length limitation, which restricts their usefulness in comprehensive traffic analysis and realistic traffic generation. To address these challenges, we introduce TrafficGPT, a deep learning model that can tackle complex challenges related to long flow classification and generation tasks. This model uses generative pre-training with the linear attention mechanism, which allows for a substantially increased capacity of up to 12,032 tokens from the previous limit of only 512 tokens. TrafficGPT demonstrates superior performance in classification tasks, reaching state-of-the-art levels. In generation tasks, it closely resembles real traffic flows, with low JS divergence and an F1 score close to 0.5 (representing a random guess) in discriminating generated data. These advancements hold promise for future applications in both traffic flow classification and generation tasks.  ( 3 min )
    Editing Conceptual Knowledge for Large Language Models
    arXiv:2403.06259v1 Announce Type: cross Abstract: Recently, there has been a growing interest in knowledge editing for Large Language Models (LLMs). Current approaches and evaluations merely explore the instance-level editing, while whether LLMs possess the capability to modify concepts remains unclear. This paper pioneers the investigation of editing conceptual knowledge for LLMs, by constructing a novel benchmark dataset ConceptEdit and establishing a suite of new metrics for evaluation. The experimental results reveal that, although existing editing methods can efficiently modify concept-level definition to some extent, they also have the potential to distort the related instantial knowledge in LLMs, leading to poor performance. We anticipate this can inspire further progress in better understanding LLMs. Our project homepage is available at https://zjunlp.github.io/project/ConceptEdit.  ( 2 min )
    Understanding and Mitigating Human-Labelling Errors in Supervised Contrastive Learning
    arXiv:2403.06289v1 Announce Type: cross Abstract: Human-annotated vision datasets inevitably contain a fraction of human mislabelled examples. While the detrimental effects of such mislabelling on supervised learning are well-researched, their influence on Supervised Contrastive Learning (SCL) remains largely unexplored. In this paper, we show that human-labelling errors not only differ significantly from synthetic label errors, but also pose unique challenges in SCL, different to those in traditional supervised learning methods. Specifically, our results indicate they adversely impact the learning process in the ~99% of cases when they occur as false positive samples. Existing noise-mitigating methods primarily focus on synthetic label errors and tackle the unrealistic setting of very high synthetic noise rates (40-80%), but they often underperform on common image datasets due to overfitting. To address this issue, we introduce a novel SCL objective with robustness to human-labelling errors, SCL-RHE. SCL-RHE is designed to mitigate the effects of real-world mislabelled examples, typically characterized by much lower noise rates (<5%). We demonstrate that SCL-RHE consistently outperforms state-of-the-art representation learning and noise-mitigating methods across various vision benchmarks, by offering improved resilience against human-labelling errors.  ( 2 min )
    Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance
    arXiv:2403.06265v1 Announce Type: cross Abstract: Despite it being the cornerstone of BPE, the most common tokenization algorithm, the importance of compression in the tokenization process is still unclear. In this paper, we argue for the theoretical importance of compression, that can be viewed as 0-gram language modeling where equal probability is assigned to all tokens. We also demonstrate the empirical importance of compression for downstream success of pre-trained language models. We control the compression ability of several BPE tokenizers by varying the amount of documents available during their training: from 1 million documents to a character-based tokenizer equivalent to no training data at all. We then pre-train English language models based on those tokenizers and fine-tune them over several tasks. We show that there is a correlation between tokenizers' compression and models' downstream performance, suggesting that compression is a reliable intrinsic indicator of tokenization quality. These correlations are more pronounced for generation tasks (over classification) or for smaller models (over large ones). We replicated a representative part of our experiments on Turkish and found similar results, confirming that our results hold for languages with typological characteristics dissimilar to English. We conclude that building better compressing tokenizers is a fruitful avenue for further research and for improving overall model performance.  ( 2 min )
    A Benchmark of Domain-Adapted Large Language Models for Generating Brief Hospital Course Summaries
    arXiv:2403.05720v1 Announce Type: cross Abstract: Brief hospital course (BHC) summaries are common clinical documents generated by summarizing clinical notes. While large language models (LLMs) depict remarkable capabilities in automating real-world tasks, their capabilities for healthcare applications such as BHC synthesis have not been shown. To enable the adaptation of LLMs for BHC synthesis, we introduce a novel benchmark consisting of a pre-processed dataset extracted from MIMIC-IV notes, encapsulating clinical note, and brief hospital course (BHC) pairs. We assess the performance of two general-purpose LLMs and three healthcare-adapted LLMs to improve BHC synthesis from clinical notes. Using clinical notes as input for generating BHCs, we apply prompting-based (using in-context learning) and fine-tuning-based adaptation strategies to three open-source LLMs (Clinical-T5-Large, Llama2-13B, FLAN-UL2) and two proprietary LLMs (GPT-3.5, GPT-4). We quantitatively evaluate the performance of these LLMs across varying context-length inputs using conventional natural language similarity metrics. We further perform a qualitative study where five diverse clinicians blindly compare clinician-written BHCs and two LLM-generated BHCs for 30 samples across metrics of comprehensiveness, conciseness, factual correctness, and fluency. Overall, we present a new benchmark and pre-processed dataset for using LLMs in BHC synthesis from clinical notes. We observe high-quality summarization performance for both in-context proprietary and fine-tuned open-source LLMs using both quantitative metrics and a qualitative clinical reader study. We propose our work as a benchmark to motivate future works to adapt and assess the performance of LLMs in BHC synthesis.  ( 3 min )
    AI in ESG for Financial Institutions: An Industrial Survey
    arXiv:2403.05541v1 Announce Type: cross Abstract: The burgeoning integration of Artificial Intelligence (AI) into Environmental, Social, and Governance (ESG) initiatives within the financial sector represents a paradigm shift towards more sus-tainable and equitable financial practices. This paper surveys the industrial landscape to delineate the necessity and impact of AI in bolstering ESG frameworks. With the advent of stringent regulatory requirements and heightened stakeholder awareness, financial institutions (FIs) are increasingly compelled to adopt ESG criteria. AI emerges as a pivotal tool in navigating the complex in-terplay of financial activities and sustainability goals. Our survey categorizes AI applications across three main pillars of ESG, illustrating how AI enhances analytical capabilities, risk assessment, customer engagement, reporting accuracy and more. Further, we delve into the critical con-siderations surrounding the use of data and the development of models, underscoring the importance of data quality, privacy, and model robustness. The paper also addresses the imperative of responsible and sustainable AI, emphasizing the ethical dimensions of AI deployment in ESG-related banking processes. Conclusively, our findings suggest that while AI offers transformative potential for ESG in banking, it also poses significant challenges that necessitate careful consideration. The final part of the paper synthesizes the survey's insights, proposing a forward-looking stance on the adoption of AI in ESG practices. We conclude with recommendations with a reference architecture for future research and development, advocating for a balanced approach that leverages AI's strengths while mitigating its risks within the ESG domain.  ( 2 min )
    Nonparametric Automatic Differentiation Variational Inference with Spline Approximation
    arXiv:2403.06302v1 Announce Type: cross Abstract: Automatic Differentiation Variational Inference (ADVI) is efficient in learning probabilistic models. Classic ADVI relies on the parametric approach to approximate the posterior. In this paper, we develop a spline-based nonparametric approximation approach that enables flexible posterior approximation for distributions with complicated structures, such as skewness, multimodality, and bounded support. Compared with widely-used nonparametric variational inference methods, the proposed method is easy to implement and adaptive to various data structures. By adopting the spline approximation, we derive a lower bound of the importance weighted autoencoder and establish the asymptotic consistency. Experiments demonstrate the efficiency of the proposed method in approximating complex posterior distributions and improving the performance of generative models with incomplete data.  ( 2 min )
    Model-Free Local Recalibration of Neural Networks
    arXiv:2403.05756v1 Announce Type: cross Abstract: Artificial neural networks (ANNs) are highly flexible predictive models. However, reliably quantifying uncertainty for their predictions is a continuing challenge. There has been much recent work on "recalibration" of predictive distributions for ANNs, so that forecast probabilities for events of interest are consistent with certain frequency evaluations of them. Uncalibrated probabilistic forecasts are of limited use for many important decision-making tasks. To address this issue, we propose a localized recalibration of ANN predictive distributions using the dimension-reduced representation of the input provided by the ANN hidden layers. Our novel method draws inspiration from recalibration techniques used in the literature on approximate Bayesian computation and likelihood-free inference methods. Most existing calibration methods for ANNs can be thought of as calibrating either on the input layer, which is difficult when the input is high-dimensional, or the output layer, which may not be sufficiently flexible. Through a simulation study, we demonstrate that our method has good performance compared to alternative approaches, and explore the benefits that can be achieved by localizing the calibration based on different layers of the network. Finally, we apply our proposed method to a diamond price prediction problem, demonstrating the potential of our approach to improve prediction and uncertainty quantification in real-world applications.  ( 2 min )
    The ALL0CORE Tensor Decomposition for Sparse Count Data
    arXiv:2403.06153v1 Announce Type: cross Abstract: This paper introduces ALL0CORE, a new form of probabilistic non-negative tensor decomposition. ALL0CORE is a Tucker decomposition where the number of non-zero elements (i.e., the L0-norm) of the core tensor is constrained to a preset value Q much smaller than the size of the core. While the user dictates the total budget Q, the locations and values of the non-zero elements are latent variables and allocated across the core tensor during inference. ALL0CORE -- i.e., allocated L0-constrained core -- thus enjoys both the computational tractability of CP decomposition and the qualitatively appealing latent structure of Tucker. In a suite of real-data experiments, we demonstrate that ALL0CORE typically requires only tiny fractions (e.g.,~1%) of the full core to achieve the same results as full Tucker decomposition at only a correspondingly tiny fraction of the cost.  ( 2 min )
    Online Multi-spectral Neuron Tracing
    arXiv:2403.06251v1 Announce Type: cross Abstract: In this paper, we propose an online multi-spectral neuron tracing method with uniquely designed modules, where no offline training are required. Our method is trained online to update our enhanced discriminative correlation filter to conglutinate the tracing process. This distinctive offline-training-free schema differentiates us from other training-dependent tracing approaches like deep learning methods since no annotation is needed for our method. Besides, compared to other tracing methods requiring complicated set-up such as for clustering and graph multi-cut, our approach is much easier to be applied to new images. In fact, it only needs a starting bounding box of the tracing neuron, significantly reducing users' configuration effort. Our extensive experiments show that our training-free and easy-configured methodology allows fast and accurate neuron reconstructions in multi-spectral images.  ( 2 min )
    A Survey on Data Selection for Language Models
    arXiv:2402.16827v2 Announce Type: replace-cross Abstract: A major factor in the recent success of large language models is the use of enormous and ever-growing text datasets for unsupervised pre-training. However, naively training a model on all available data may not be optimal (or feasible), as the quality of available text data can vary. Filtering out data can also decrease the carbon footprint and financial costs of training models by reducing the amount of training required. Data selection methods aim to determine which candidate data points to include in the training dataset and how to appropriately sample from the selected data points. The promise of improved data selection methods has caused the volume of research in the area to rapidly expand. However, because deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive, few organizations have the resources for extensive data selection research. Consequently, knowledge of effective data selection practices has become concentrated within a few organizations, many of which do not openly share their findings and methodologies. To narrow this gap in knowledge, we present a comprehensive review of existing literature on data selection methods and related research areas, providing a taxonomy of existing approaches. By describing the current landscape of research, this work aims to accelerate progress in data selection by establishing an entry point for new and established researchers. Additionally, throughout this review we draw attention to noticeable holes in the literature and conclude the paper by proposing promising avenues for future research.  ( 3 min )
    Anatomically-Controllable Medical Image Generation with Segmentation-Guided Diffusion Models
    arXiv:2402.05210v3 Announce Type: replace-cross Abstract: Diffusion models have enabled remarkably high-quality medical image generation, yet it is challenging to enforce anatomical constraints in generated images. This hampers many useful applications, including pre-registered image generation, counterfactual scenarios, and others. To this end, we propose a diffusion model-based method that supports anatomically-controllable medical image generation, by following a multi-class anatomical segmentation mask at each sampling step. We additionally introduce a random mask ablation training algorithm to enable conditioning on a selected combination of anatomical constraints while allowing flexibility in other anatomical areas. We compare our model ("Seg-Diff") to existing methods on breast MRI and abdominal/neck-to-pelvis CT datasets with a wide range of anatomical objects. Results show that it reaches a new state-of-the-art in the faithfulness of generated images to input anatomical masks on both datasets, and is on par for general anatomical realism. Finally, our model also enjoys the extra benefit of being able to adjust the anatomical similarity of generated images to real images of choice through interpolation in its latent space.  ( 2 min )
    Insights into Multiscale Complexity: from Macroscopic Patterns to Microscopic Simulations via Deep Learning
    arXiv:2402.05067v4 Announce Type: replace-cross Abstract: Multiscale phenomena manifest across various scientific domains, presenting a ubiquitous challenge in accurately and effectively simulating multiscale dynamics in complex systems. In this paper, a novel decoupling solving mode is proposed through modelling large-scale dynamics independently and treating small-scale dynamics as a slaved system. A Spectral Physics-informed Neural Network (PINN) is developed to characterize the small-scale system in an efficient and accurate way. The effectiveness of the method is demonstrated through extensive numerical experiments, including one-dimensional Kuramot-Sivashinsky equation, two- and three-dimensional Navier-Stokes equations, showcasing its versatility in addressing problems of fluid dynamics. Furthermore, we also delve into the application of the proposed approach to more complex problems, including non-uniform meshes, complex geometries, large-scale data with noise, and high-dimensional small-scale dynamics. The discussions about these scenarios contribute to a comprehensive understanding of the method's capabilities and limitations. This paper presents a valuable and promising approach to enhance the computational simulations of multiscale spatiotemporal systems, which enables the acquisition of large-scale data with minimal computational demands, followed by Spectral PINN to capture small-scale dynamics with improved efficiency and accuracy.  ( 3 min )
    Elastic Feature Consolidation for Cold Start Exemplar-free Incremental Learning
    arXiv:2402.03917v2 Announce Type: replace-cross Abstract: Exemplar-Free Class Incremental Learning (EFCIL) aims to learn from a sequence of tasks without having access to previous task data. In this paper, we consider the challenging Cold Start scenario in which insufficient data is available in the first task to learn a high-quality backbone. This is especially challenging for EFCIL since it requires high plasticity, which results in feature drift which is difficult to compensate for in the exemplar-free setting. To address this problem, we propose a simple and effective approach that consolidates feature representations by regularizing drift in directions highly relevant to previous tasks and employs prototypes to reduce task-recency bias. Our method, called Elastic Feature Consolidation (EFC), exploits a tractable second-order approximation of feature drift based on an Empirical Feature Matrix (EFM). The EFM induces a pseudo-metric in feature space which we use to regularize feature drift in important directions and to update Gaussian prototypes used in a novel asymmetric cross entropy loss which effectively balances prototype rehearsal with data from new tasks. Experimental results on CIFAR-100, Tiny-ImageNet, ImageNet-Subset and ImageNet-1K demonstrate that Elastic Feature Consolidation is better able to learn new tasks by maintaining model plasticity and significantly outperform the state-of-the-art.  ( 3 min )
    Gaussian process regression with Sliced Wasserstein Weisfeiler-Lehman graph kernels
    arXiv:2402.03838v2 Announce Type: replace-cross Abstract: Supervised learning has recently garnered significant attention in the field of computational physics due to its ability to effectively extract complex patterns for tasks like solving partial differential equations, or predicting material properties. Traditionally, such datasets consist of inputs given as meshes with a large number of nodes representing the problem geometry (seen as graphs), and corresponding outputs obtained with a numerical solver. This means the supervised learning model must be able to handle large and sparse graphs with continuous node attributes. In this work, we focus on Gaussian process regression, for which we introduce the Sliced Wasserstein Weisfeiler-Lehman (SWWL) graph kernel. In contrast to existing graph kernels, the proposed SWWL kernel enjoys positive definiteness and a drastic complexity reduction, which makes it possible to process datasets that were previously impossible to handle. The new kernel is first validated on graph classification for molecular datasets, where the input graphs have a few tens of nodes. The efficiency of the SWWL kernel is then illustrated on graph regression in computational fluid dynamics and solid mechanics, where the input graphs are made up of tens of thousands of nodes.  ( 2 min )
    Assessment of Sports Concussion in Female Athletes: A Role for Neuroinformatics?
    arXiv:2401.13045v2 Announce Type: replace-cross Abstract: Over the past decade, the intricacies of sports-related concussions among female athletes have become readily apparent. Traditional clinical methods for diagnosing concussions suffer limitations when applied to female athletes, often failing to capture subtle changes in brain structure and function. Advanced neuroinformatics techniques and machine learning models have become invaluable assets in this endeavor. While these technologies have been extensively employed in understanding concussion in male athletes, there remains a significant gap in our comprehension of their effectiveness for female athletes. With its remarkable data analysis capacity, machine learning offers a promising avenue to bridge this deficit. By harnessing the power of machine learning, researchers can link observed phenotypic neuroimaging data to sex-specific biological mechanisms, unraveling the mysteries of concussions in female athletes. Furthermore, embedding methods within machine learning enable examining brain architecture and its alterations beyond the conventional anatomical reference frame. In turn, allows researchers to gain deeper insights into the dynamics of concussions, treatment responses, and recovery processes. To guarantee that female athletes receive the optimal care they deserve, researchers must employ advanced neuroimaging techniques and sophisticated machine-learning models. These tools enable an in-depth investigation of the underlying mechanisms responsible for concussion symptoms stemming from neuronal dysfunction in female athletes. This paper endeavors to address the crucial issue of sex differences in multimodal neuroimaging experimental design and machine learning approaches within female athlete populations, ultimately ensuring that they receive the tailored care they require when facing the challenges of concussions.  ( 3 min )
    A transductive few-shot learning approach for classification of digital histopathological slides from liver cancer
    arXiv:2311.17740v2 Announce Type: replace-cross Abstract: This paper presents a new approach for classifying 2D histopathology patches using few-shot learning. The method is designed to tackle a significant challenge in histopathology, which is the limited availability of labeled data. By applying a sliding window technique to histopathology slides, we illustrate the practical benefits of transductive learning (i.e., making joint predictions on patches) to achieve consistent and accurate classification. Our approach involves an optimization-based strategy that actively penalizes the prediction of a large number of distinct classes within each window. We conducted experiments on histopathological data to classify tissue classes in digital slides of liver cancer, specifically hepatocellular carcinoma. The initial results show the effectiveness of our method and its potential to enhance the process of automated cancer diagnosis and treatment, all while reducing the time and effort required for expert annotation.  ( 2 min )
    SemPLeS: Semantic Prompt Learning for Weakly-Supervised Semantic Segmentation
    arXiv:2401.11791v2 Announce Type: replace-cross Abstract: Weakly-Supervised Semantic Segmentation (WSSS) aims to train segmentation models using image data with only image-level supervision. Since precise pixel-level annotations are not accessible, existing methods typically focus on producing pseudo masks for training segmentation models by refining CAM-like heatmaps. However, the produced heatmaps may capture only the discriminative image regions of object categories or the associated co-occurring backgrounds. To address the issues, we propose a Semantic Prompt Learning for WSSS (SemPLeS) framework, which learns to effectively prompt the CLIP latent space to enhance the semantic alignment between the segmented regions and the target object categories. More specifically, we propose Contrastive Prompt Learning and Prompt-guided Semantic Refinement to learn the prompts that adequately describe and suppress the co-occurring backgrounds associated with each target object category. In this way, SemPLeS can perform better semantic alignment between object regions and the associated class labels, resulting in desired pseudo masks for training the segmentation model. The proposed SemPLeS framework achieves SOTA performance on the standard WSSS benchmarks, PASCAL VOC and MS COCO, and shows compatibility with other WSSS methods. The source codes are provided in the supplementary.  ( 2 min )
    Opening the Black Box: Towards inherently interpretable energy data imputation models using building physics insight
    arXiv:2311.16632v2 Announce Type: replace-cross Abstract: Missing data are frequently observed by practitioners and researchers in the building energy modeling community. In this regard, advanced data-driven solutions, such as Deep Learning methods, are typically required to reflect the non-linear behavior of these anomalies. As an ongoing research question related to Deep Learning, a model's applicability to limited data settings can be explored by introducing prior knowledge in the network. This same strategy can also lead to more interpretable predictions, hence facilitating the field application of the approach. For that purpose, the aim of this paper is to propose the use of Physics-informed Denoising Autoencoders (PI-DAE) for missing data imputation in commercial buildings. In particular, the presented method enforces physics-inspired soft constraints to the loss function of a Denoising Autoencoder (DAE). In order to quantify the benefits of the physical component, an ablation study between different DAE configurations is conducted. First, three univariate DAEs are optimized separately on indoor air temperature, heating, and cooling data. Then, two multivariate DAEs are derived from the previous configurations. Eventually, a building thermal balance equation is coupled to the last multivariate configuration to obtain PI-DAE. Additionally, two commonly used benchmarks are employed to support the findings. It is shown how introducing physical knowledge in a multivariate Denoising Autoencoder can enhance the inherent model interpretability through the optimized physics-based coefficients. While no significant improvement is observed in terms of reconstruction error with the proposed PI-DAE, its enhanced robustness to varying rates of missing data and the valuable insights derived from the physics-based coefficients create opportunities for wider applications within building systems and the built environment.  ( 3 min )
    FRAC-Q-Learning: A Reinforcement Learning with Boredom Avoidance Processes for Social Robots
    arXiv:2311.15327v3 Announce Type: replace-cross Abstract: The reinforcement learning algorithms have often been applied to social robots. However, most reinforcement learning algorithms were not optimized for the use of social robots, and consequently they may bore users. We proposed a new reinforcement learning method specialized for the social robot, the FRAC-Q-learning, that can avoid user boredom. The proposed algorithm consists of a forgetting process in addition to randomizing and categorizing processes. This study evaluated interest and boredom hardness scores of the FRAC-Q-learning by a comparison with the traditional Q-learning. The FRAC-Q-learning showed significantly higher trend of interest score, and indicated significantly harder to bore users compared to the traditional Q-learning. Therefore, the FRAC-Q-learning can contribute to develop a social robot that will not bore users. The proposed algorithm can also find applications in Web-based communication and educational systems. This paper presents the entire process, detailed implementation and a detailed evaluation method of the of the FRAC-Q-learning for the first time.  ( 2 min )
    ECNR: Efficient Compressive Neural Representation of Time-Varying Volumetric Datasets
    arXiv:2311.12831v4 Announce Type: replace-cross Abstract: Due to its conceptual simplicity and generality, compressive neural representation has emerged as a promising alternative to traditional compression methods for managing massive volumetric datasets. The current practice of neural compression utilizes a single large multilayer perceptron (MLP) to encode the global volume, incurring slow training and inference. This paper presents an efficient compressive neural representation (ECNR) solution for time-varying data compression, utilizing the Laplacian pyramid for adaptive signal fitting. Following a multiscale structure, we leverage multiple small MLPs at each scale for fitting local content or residual blocks. By assigning similar blocks to the same MLP via size uniformization, we enable balanced parallelization among MLPs to significantly speed up training and inference. Working in concert with the multiscale structure, we tailor a deep compression strategy to compact the resulting model. We show the effectiveness of ECNR with multiple datasets and compare it with state-of-the-art compression methods (mainly SZ3, TTHRESH, and neurcomp). The results position ECNR as a promising solution for volumetric data compression.  ( 2 min )
    Unbalancedness in Neural Monge Maps Improves Unpaired Domain Translation
    arXiv:2311.15100v2 Announce Type: replace-cross Abstract: In optimal transport (OT), a Monge map is known as a mapping that transports a source distribution to a target distribution in the most cost-efficient way. Recently, multiple neural estimators for Monge maps have been developed and applied in diverse unpaired domain translation tasks, e.g. in single-cell biology and computer vision. However, the classic OT framework enforces mass conservation, which makes it prone to outliers and limits its applicability in real-world scenarios. The latter can be particularly harmful in OT domain translation tasks, where the relative position of a sample within a distribution is explicitly taken into account. While unbalanced OT tackles this challenge in the discrete setting, its integration into neural Monge map estimators has received limited attention. We propose a theoretically grounded method to incorporate unbalancedness into any Monge map estimator. We improve existing estimators to model cell trajectories over time and to predict cellular responses to perturbations. Moreover, our approach seamlessly integrates with the OT flow matching (OT-FM) framework. While we show that OT-FM performs competitively in image translation, we further improve performance by incorporating unbalancedness (UOT-FM), which better preserves relevant features. We hence establish UOT-FM as a principled method for unpaired image translation.  ( 3 min )
    BEND: Benchmarking DNA Language Models on biologically meaningful tasks
    arXiv:2311.12570v3 Announce Type: replace-cross Abstract: The genome sequence contains the blueprint for governing cellular processes. While the availability of genomes has vastly increased over the last decades, experimental annotation of the various functional, non-coding and regulatory elements encoded in the DNA sequence remains both expensive and challenging. This has sparked interest in unsupervised language modeling of genomic DNA, a paradigm that has seen great success for protein sequence data. Although various DNA language models have been proposed, evaluation tasks often differ between individual works, and might not fully recapitulate the fundamental challenges of genome annotation, including the length, scale and sparsity of the data. In this study, we introduce BEND, a Benchmark for DNA language models, featuring a collection of realistic and biologically meaningful downstream tasks defined on the human genome. We find that embeddings from current DNA LMs can approach performance of expert methods on some tasks, but only capture limited information about long-range features. BEND is available at https://github.com/frederikkemarin/BEND.  ( 2 min )
    EPIM: Efficient Processing-In-Memory Accelerators based on Epitome
    arXiv:2311.07620v2 Announce Type: replace-cross Abstract: The utilization of large-scale neural networks on Processing-In-Memory (PIM) accelerators encounters challenges due to constrained on-chip memory capacity. To tackle this issue, current works explore model compression algorithms to reduce the size of Convolutional Neural Networks (CNNs). Most of these algorithms either aim to represent neural operators with reduced-size parameters (e.g., quantization) or search for the best combinations of neural operators (e.g., neural architecture search). Designing neural operators to align with PIM accelerators' specifications is an area that warrants further study. In this paper, we introduce the Epitome, a lightweight neural operator offering convolution-like functionality, to craft memory-efficient CNN operators for PIM accelerators (EPIM). On the software side, we evaluate epitomes' latency and energy on PIM accelerators and introduce a PIM-aware layer-wise design method to enhance their hardware efficiency. We apply epitome-aware quantization to further reduce the size of epitomes. On the hardware side, we modify the datapath of current PIM accelerators to accommodate epitomes and implement a feature map reuse technique to reduce computation cost. Experimental results reveal that our 3-bit quantized EPIM-ResNet50 attains 71.59% top-1 accuracy on ImageNet, reducing crossbar areas by 30.65 times. EPIM surpasses the state-of-the-art pruning methods on PIM.  ( 2 min )
    Fair Supervised Learning with A Simple Random Sampler of Sensitive Attributes
    arXiv:2311.05866v2 Announce Type: replace-cross Abstract: As the data-driven decision process becomes dominating for industrial applications, fairness-aware machine learning arouses great attention in various areas. This work proposes fairness penalties learned by neural networks with a simple random sampler of sensitive attributes for non-discriminatory supervised learning. In contrast to many existing works that critically rely on the discreteness of sensitive attributes and response variables, the proposed penalty is able to handle versatile formats of the sensitive attributes, so it is more extensively applicable in practice than many existing algorithms. This penalty enables us to build a computationally efficient group-level in-processing fairness-aware training framework. Empirical evidence shows that our framework enjoys better utility and fairness measures on popular benchmark data sets than competing methods. We also theoretically characterize estimation errors and loss of utility of the proposed neural-penalized risk minimization problem.  ( 2 min )
    LRM: Large Reconstruction Model for Single Image to 3D
    arXiv:2311.04400v2 Announce Type: replace-cross Abstract: We propose the first Large Reconstruction Model (LRM) that predicts the 3D model of an object from a single input image within just 5 seconds. In contrast to many previous methods that are trained on small-scale datasets such as ShapeNet in a category-specific fashion, LRM adopts a highly scalable transformer-based architecture with 500 million learnable parameters to directly predict a neural radiance field (NeRF) from the input image. We train our model in an end-to-end manner on massive multi-view data containing around 1 million objects, including both synthetic renderings from Objaverse and real captures from MVImgNet. This combination of a high-capacity model and large-scale training data empowers our model to be highly generalizable and produce high-quality 3D reconstructions from various testing inputs, including real-world in-the-wild captures and images created by generative models. Video demos and interactable 3D meshes can be found on our LRM project webpage: https://yiconghong.me/LRM.  ( 3 min )
    Fast Minimization of Expected Logarithmic Loss via Stochastic Dual Averaging
    arXiv:2311.02557v2 Announce Type: replace-cross Abstract: Consider the problem of minimizing an expected logarithmic loss over either the probability simplex or the set of quantum density matrices. This problem includes tasks such as solving the Poisson inverse problem, computing the maximum-likelihood estimate for quantum state tomography, and approximating positive semi-definite matrix permanents with the currently tightest approximation ratio. Although the optimization problem is convex, standard iteration complexity guarantees for first-order methods do not directly apply due to the absence of Lipschitz continuity and smoothness in the loss function. In this work, we propose a stochastic first-order algorithm named $B$-sample stochastic dual averaging with the logarithmic barrier. For the Poisson inverse problem, our algorithm attains an $\varepsilon$-optimal solution in $\smash{\tilde{O}}(d^2/\varepsilon^2)$ time, matching the state of the art, where $d$ denotes the dimension. When computing the maximum-likelihood estimate for quantum state tomography, our algorithm yields an $\varepsilon$-optimal solution in $\smash{\tilde{O}}(d^3/\varepsilon^2)$ time. This improves on the time complexities of existing stochastic first-order methods by a factor of $d^{\omega-2}$ and those of batch methods by a factor of $d^2$, where $\omega$ denotes the matrix multiplication exponent. Numerical experiments demonstrate that empirically, our algorithm outperforms existing methods with explicit complexity guarantees.  ( 2 min )
    Improved Regret Bounds of (Multinomial) Logistic Bandits via Regret-to-Confidence-Set Conversion
    arXiv:2310.18554v2 Announce Type: replace-cross Abstract: Logistic bandit is a ubiquitous framework of modeling users' choices, e.g., click vs. no click for advertisement recommender system. We observe that the prior works overlook or neglect dependencies in $S \geq \lVert \theta_\star \rVert_2$, where $\theta_\star \in \mathbb{R}^d$ is the unknown parameter vector, which is particularly problematic when $S$ is large, e.g., $S \geq d$. In this work, we improve the dependency on $S$ via a novel approach called {\it regret-to-confidence set conversion (R2CS)}, which allows us to construct a convex confidence set based on only the \textit{existence} of an online learning algorithm with a regret guarantee. Using R2CS, we obtain a strict improvement in the regret bound w.r.t. $S$ in logistic bandits while retaining computational feasibility and the dependence on other factors such as $d$ and $T$. We apply our new confidence set to the regret analyses of logistic bandits with a new martingale concentration step that circumvents an additional factor of $S$. We then extend this analysis to multinomial logistic bandits and obtain similar improvements in the regret, showing the efficacy of R2CS. While we applied R2CS to the (multinomial) logistic model, R2CS is a generic approach for developing confidence sets that can be used for various models, which can be of independent interest.  ( 3 min )
    Detecting Pretraining Data from Large Language Models
    arXiv:2310.16789v3 Announce Type: replace-cross Abstract: Although large language models (LLMs) are widely deployed, the data used to train them is rarely disclosed. Given the incredible scale of this data, up to trillions of tokens, it is all but certain that it includes potentially problematic text such as copyrighted materials, personally identifiable information, and test data for widely reported reference benchmarks. However, we currently have no way to know which data of these types is included or in what proportions. In this paper, we study the pretraining data detection problem: given a piece of text and black-box access to an LLM without knowing the pretraining data, can we determine if the model was trained on the provided text? To facilitate this study, we introduce a dynamic benchmark WIKIMIA that uses data created before and after model training to support gold truth detection. We also introduce a new detection method Min-K% Prob based on a simple hypothesis: an unseen example is likely to contain a few outlier words with low probabilities under the LLM, while a seen example is less likely to have words with such low probabilities. Min-K% Prob can be applied without any knowledge about the pretraining corpus or any additional training, departing from previous detection methods that require training a reference model on data that is similar to the pretraining data. Moreover, our experiments demonstrate that Min-K% Prob achieves a 7.4% improvement on WIKIMIA over these previous methods. We apply Min-K% Prob to three real-world scenarios, copyrighted book detection, contaminated downstream example detection and privacy auditing of machine unlearning, and find it a consistently effective solution.  ( 3 min )
    Disentangled Representation Learning with Large Language Models for Text-Attributed Graphs
    arXiv:2310.18152v4 Announce Type: replace-cross Abstract: Text-attributed graphs (TAGs) are prevalent on the web and research over TAGs such as citation networks, e-commerce networks and social networks has attracted considerable attention in the web community. Recently, large language models (LLMs) have demonstrated exceptional capabilities across a wide range of tasks. However, the existing works focus on harnessing the potential of LLMs solely relying on prompts to convey graph structure information to LLMs, thus suffering from insufficient understanding of the complex structural relationships within TAGs. To address this problem, in this paper we present the Disentangled Graph-Text Learner (DGTL) model, which is able to enhance the reasoning and predicting capabilities of LLMs for TAGs. Our proposed DGTL model incorporates graph structure information through tailored disentangled graph neural network (GNN) layers, enabling LLMs to capture the intricate relationships hidden in text-attributed graphs from multiple structural factors. Furthermore, DGTL operates with frozen pre-trained LLMs, reducing computational costs and allowing much more flexibility in combining with different LLM models. Experimental evaluations demonstrate the effectiveness of the proposed DGTL model on achieving superior or comparable performance over state-of-the-art baselines. Additionally, we also demonstrate that our DGTL model can offer natural language explanations for predictions, thereby significantly enhancing model interpretability.  ( 3 min )
    Boosting Data Analytics With Synthetic Volume Expansion
    arXiv:2310.17848v3 Announce Type: replace-cross Abstract: Synthetic data generation, a cornerstone of Generative Artificial Intelligence, promotes a paradigm shift in data science by addressing data scarcity and privacy while enabling unprecedented performance. As synthetic data becomes more prevalent, concerns emerge regarding the accuracy of statistical methods when applied to synthetic data in contrast to raw data. This article explores the effectiveness of statistical methods on synthetic data and the privacy risks of synthetic data. Regarding effectiveness, we present the Synthetic Data Generation for Analytics framework. This framework applies statistical approaches to high-quality synthetic data produced by generative models like tabular diffusion models, which, initially trained on raw data, benefit from insights from pertinent studies through transfer learning. A key finding within this framework is the generational effect, which reveals that the error rate of statistical methods on synthetic data decreases with the addition of more synthetic data but may eventually rise or stabilize. This phenomenon, stemming from the challenge of accurately mirroring raw data distributions, highlights a "reflection point"-an ideal volume of synthetic data defined by specific error metrics. Through three case studies, sentiment analysis, predictive modeling of structured data, and inference in tabular data, we validate the superior performance of this framework compared to conventional approaches. On privacy, synthetic data imposes lower risks while supporting the differential privacy standard. These studies underscore synthetic data's untapped potential in redefining data science's landscape.  ( 3 min )
    From Posterior Sampling to Meaningful Diversity in Image Restoration
    arXiv:2310.16047v2 Announce Type: replace-cross Abstract: Image restoration problems are typically ill-posed in the sense that each degraded image can be restored in infinitely many valid ways. To accommodate this, many works generate a diverse set of outputs by attempting to randomly sample from the posterior distribution of natural images given the degraded input. Here we argue that this strategy is commonly of limited practical value because of the heavy tail of the posterior distribution. Consider for example inpainting a missing region of the sky in an image. Since there is a high probability that the missing region contains no object but clouds, any set of samples from the posterior would be entirely dominated by (practically identical) completions of sky. However, arguably, presenting users with only one clear sky completion, along with several alternative solutions such as airships, birds, and balloons, would better outline the set of possibilities. In this paper, we initiate the study of meaningfully diverse image restoration. We explore several post-processing approaches that can be combined with any diverse image restoration method to yield semantically meaningful diversity. Moreover, we propose a practical approach for allowing diffusion based image restoration methods to generate meaningfully diverse outputs, while incurring only negligent computational overhead. We conduct extensive user studies to analyze the proposed techniques, and find the strategy of reducing similarity between outputs to be significantly favorable over posterior sampling. Code and examples are available at https://noa-cohen.github.io/MeaningfulDiversityInIR.  ( 3 min )
    DeepFDR: A Deep Learning-based False Discovery Rate Control Method for Neuroimaging Data
    arXiv:2310.13349v3 Announce Type: replace-cross Abstract: Voxel-based multiple testing is widely used in neuroimaging data analysis. Traditional false discovery rate (FDR) control methods often ignore the spatial dependence among the voxel-based tests and thus suffer from substantial loss of testing power. While recent spatial FDR control methods have emerged, their validity and optimality remain questionable when handling the complex spatial dependencies of the brain. Concurrently, deep learning methods have revolutionized image segmentation, a task closely related to voxel-based multiple testing. In this paper, we propose DeepFDR, a novel spatial FDR control method that leverages unsupervised deep learning-based image segmentation to address the voxel-based multiple testing problem. Numerical studies, including comprehensive simulations and Alzheimer's disease FDG-PET image analysis, demonstrate DeepFDR's superiority over existing methods. DeepFDR not only excels in FDR control and effectively diminishes the false nondiscovery rate, but also boasts exceptional computational efficiency highly suited for tackling large-scale neuroimaging data.  ( 2 min )
    In-Context Pretraining: Language Modeling Beyond Document Boundaries
    arXiv:2310.10638v5 Announce Type: replace-cross Abstract: Large language models (LMs) are currently trained to predict tokens given document prefixes, enabling them to directly perform long-form generation and prompting-style tasks which can be reduced to document completion. Existing pretraining pipelines train LMs by concatenating random sets of short documents to create input contexts but the prior documents provide no signal for predicting the next document. We instead present In-Context Pretraining, a new approach where language models are pretrained on a sequence of related documents, thereby explicitly encouraging them to read and reason across document boundaries. We can do In-Context Pretraining by simply changing the document ordering so that each context contains related documents, and directly applying existing pretraining pipelines. However, this document sorting problem is challenging. There are billions of documents and we would like the sort to maximize contextual similarity for every document without repeating any data. To do this, we introduce approximate algorithms for finding related documents with efficient nearest neighbor search and constructing coherent input contexts with a graph traversal algorithm. Our experiments show In-Context Pretraining offers a simple and scalable approach to significantly enhance LMs'performance: we see notable improvements in tasks that require more complex contextual reasoning, including in-context learning (+8%), reading comprehension (+15%), faithfulness to previous contexts (+16%), long-context reasoning (+5%), and retrieval augmentation (+9%).  ( 3 min )
    CycleNet: Rethinking Cycle Consistency in Text-Guided Diffusion for Image Manipulation
    arXiv:2310.13165v2 Announce Type: replace-cross Abstract: Diffusion models (DMs) have enabled breakthroughs in image synthesis tasks but lack an intuitive interface for consistent image-to-image (I2I) translation. Various methods have been explored to address this issue, including mask-based methods, attention-based methods, and image-conditioning. However, it remains a critical challenge to enable unpaired I2I translation with pre-trained DMs while maintaining satisfying consistency. This paper introduces Cyclenet, a novel but simple method that incorporates cycle consistency into DMs to regularize image manipulation. We validate Cyclenet on unpaired I2I tasks of different granularities. Besides the scene and object level translation, we additionally contribute a multi-domain I2I translation dataset to study the physical state changes of objects. Our empirical studies show that Cyclenet is superior in translation consistency and quality, and can generate high-quality images for out-of-domain distributions with a simple change of the textual prompt. Cyclenet is a practical framework, which is robust even with very limited training data (around 2k) and requires minimal computational resources (1 GPU) to train. Project homepage: https://cyclenetweb.github.io/  ( 2 min )
    Communication Compression for Byzantine Robust Learning: New Efficient Algorithms and Improved Rates
    arXiv:2310.09804v2 Announce Type: replace-cross Abstract: Byzantine robustness is an essential feature of algorithms for certain distributed optimization problems, typically encountered in collaborative/federated learning. These problems are usually huge-scale, implying that communication compression is also imperative for their resolution. These factors have spurred recent algorithmic and theoretical developments in the literature of Byzantine-robust learning with compression. In this paper, we contribute to this research area in two main directions. First, we propose a new Byzantine-robust method with compression - Byz-DASHA-PAGE - and prove that the new method has better convergence rate (for non-convex and Polyak-Lojasiewicz smooth optimization problems), smaller neighborhood size in the heterogeneous case, and tolerates more Byzantine workers under over-parametrization than the previous method with SOTA theoretical convergence guarantees (Byz-VR-MARINA). Secondly, we develop the first Byzantine-robust method with communication compression and error feedback - Byz-EF21 - along with its bidirectional compression version - Byz-EF21-BC - and derive the convergence rates for these methods for non-convex and Polyak-Lojasiewicz smooth case. We test the proposed methods and illustrate our theoretical findings in the numerical experiments.  ( 2 min )
    Training-free Linear Image Inverses via Flows
    arXiv:2310.04432v2 Announce Type: replace-cross Abstract: Solving inverse problems without any training involves using a pretrained generative model and making appropriate modifications to the generation process to avoid finetuning of the generative model. While recent methods have explored the use of diffusion models, they still require the manual tuning of many hyperparameters for different inverse problems. In this work, we propose a training-free method for solving linear inverse problems by using pretrained flow models, leveraging the simplicity and efficiency of Flow Matching models, using theoretically-justified weighting schemes, and thereby significantly reducing the amount of manual tuning. In particular, we draw inspiration from two main sources: adopting prior gradient correction methods to the flow regime, and a solver scheme based on conditional Optimal Transport paths. As pretrained diffusion models are widely accessible, we also show how to practically adapt diffusion models for our method. Empirically, our approach requires no problem-specific tuning across an extensive suite of noisy linear inverse problems on high-dimensional datasets, ImageNet-64/128 and AFHQ-256, and we observe that our flow-based method for solving inverse problems improves upon closely-related diffusion-based methods in most settings.  ( 2 min )
    AutoVP: An Automated Visual Prompting Framework and Benchmark
    arXiv:2310.08381v2 Announce Type: replace-cross Abstract: Visual prompting (VP) is an emerging parameter-efficient fine-tuning approach to adapting pre-trained vision models to solve various downstream image-classification tasks. However, there has hitherto been little systematic study of the design space of VP and no clear benchmark for evaluating its performance. To bridge this gap, we propose AutoVP, an end-to-end expandable framework for automating VP design choices, along with 12 downstream image-classification tasks that can serve as a holistic VP-performance benchmark. Our design space covers 1) the joint optimization of the prompts; 2) the selection of pre-trained models, including image classifiers and text-image encoders; and 3) model output mapping strategies, including nonparametric and trainable label mapping. Our extensive experimental results show that AutoVP outperforms the best-known current VP methods by a substantial margin, having up to 6.7% improvement in accuracy; and attains a maximum performance increase of 27.5% compared to linear-probing (LP) baseline. AutoVP thus makes a two-fold contribution: serving both as an efficient tool for hyperparameter tuning on VP design choices, and as a comprehensive benchmark that can reasonably be expected to accelerate VP's development. The source code is available at https://github.com/IBM/AutoVP.  ( 2 min )
    Sampling via Gradient Flows in the Space of Probability Measures
    arXiv:2310.03597v3 Announce Type: replace-cross Abstract: Sampling a target probability distribution with an unknown normalization constant is a fundamental challenge in computational science and engineering. Recent work shows that algorithms derived by considering gradient flows in the space of probability measures open up new avenues for algorithm development. This paper makes three contributions to this sampling approach by scrutinizing the design components of such gradient flows. Any instantiation of a gradient flow for sampling needs an energy functional and a metric to determine the flow, as well as numerical approximations of the flow to derive algorithms. Our first contribution is to show that the Kullback-Leibler divergence, as an energy functional, has the unique property (among all f-divergences) that gradient flows resulting from it do not depend on the normalization constant of the target distribution. Our second contribution is to study the choice of metric from the perspective of invariance. The Fisher-Rao metric is known as the unique choice (up to scaling) that is diffeomorphism invariant. As a computationally tractable alternative, we introduce a relaxed, affine invariance property for the metrics and gradient flows. In particular, we construct various affine invariant Wasserstein and Stein gradient flows. Affine invariant gradient flows are shown to behave more favorably than their non-affine-invariant counterparts when sampling highly anisotropic distributions, in theory and by using particle methods. Our third contribution is to study, and develop efficient algorithms based on Gaussian approximations of the gradient flows; this leads to an alternative to particle methods. We establish connections between various Gaussian approximate gradient flows, discuss their relation to gradient methods arising from parametric variational inference, and study their convergence properties both theoretically and numerically.  ( 3 min )
    Decoding Human Activities: Analyzing Wearable Accelerometer and Gyroscope Data for Activity Recognition
    arXiv:2310.02011v2 Announce Type: replace-cross Abstract: A person's movement or relative positioning effectively generates raw electrical signals that can be read by computing machines to apply various manipulative techniques for the classification of different human activities. In this paper, a stratified multi-structural approach based on a Residual network ensembled with Residual MobileNet is proposed, termed as FusionActNet. The proposed method involves using carefully designed Residual blocks for classifying the static and dynamic activities separately because they have clear and distinct characteristics that set them apart. These networks are trained independently, resulting in two specialized and highly accurate models. These models excel at recognizing activities within a specific superclass by taking advantage of the unique algorithmic benefits of architectural adjustments. Afterward, these two ResNets are passed through a weighted ensemble-based Residual MobileNet. Subsequently, this ensemble proficiently discriminates between a specific static and a specific dynamic activity, which were previously identified based on their distinct feature characteristics in the earlier stage. The proposed model is evaluated using two publicly accessible datasets; namely, UCI HAR and Motion-Sense. Therein, it successfully handled the highly confusing cases of data overlap. Therefore, the proposed approach achieves a state-of-the-art accuracy of 96.71% and 95.35% in the UCI HAR and Motion-Sense datasets respectively.  ( 3 min )
    Segment Anything Model is a Good Teacher for Local Feature Learning
    arXiv:2309.16992v2 Announce Type: replace-cross Abstract: Local feature detection and description play an important role in many computer vision tasks, which are designed to detect and describe keypoints in "any scene" and "any downstream task". Data-driven local feature learning methods need to rely on pixel-level correspondence for training, which is challenging to acquire at scale, thus hindering further improvements in performance. In this paper, we propose SAMFeat to introduce SAM (segment anything model), a fundamental model trained on 11 million images, as a teacher to guide local feature learning and thus inspire higher performance on limited datasets. To do so, first, we construct an auxiliary task of Pixel Semantic Relational Distillation (PSRD), which distillates feature relations with category-agnostic semantic information learned by the SAM encoder into a local feature learning network, to improve local feature description using semantic discrimination. Second, we develop a technique called Weakly Supervised Contrastive Learning Based on Semantic Grouping (WSC), which utilizes semantic groupings derived from SAM as weakly supervised signals, to optimize the metric space of local descriptors. Third, we design an Edge Attention Guidance (EAG) to further improve the accuracy of local feature detection and description by prompting the network to pay more attention to the edge region guided by SAM. SAMFeat's performance on various tasks such as image matching on HPatches, and long-term visual localization on Aachen Day-Night showcases its superiority over previous local features. The release code is available at https://github.com/vignywang/SAMFeat.  ( 3 min )
    Visual Political Communication in a Polarized Society: A Longitudinal Study of Brazilian Presidential Elections on Instagram
    arXiv:2310.00349v2 Announce Type: replace-cross Abstract: In today's digital age, images have emerged as powerful tools for politicians to engage with their voters on social media platforms. Visual content possesses a unique emotional appeal that often leads to increased user engagement. However, research on visual communication remains relatively limited, particularly in the Global South. This study aims to bridge this gap by employing a combination of computational methods and qualitative approach to investigate the visual communication strategies employed in a dataset of 11,263 Instagram posts by 19 Brazilian presidential candidates in 2018 and 2022 national elections. Through two studies, we observed consistent patterns across these candidates on their use of visual political communication. Notably, we identify a prevalence of celebratory and positively toned images. They also exhibit a strong sense of personalization, portraying candidates connected with their voters on a more emotional level. Our research also uncovers unique contextual nuances specific to the Brazilian political landscape. We note a substantial presence of screenshots from news websites and other social media platforms. Furthermore, text-edited images with portrayals emerge as a prominent feature. In light of these results, we engage in a discussion regarding the implications for the broader field of visual political communication. This article serves as a testament to the pivotal role that Instagram has played in shaping the narrative of two fiercely polarized Brazilian elections, casting a revealing light on the ever-evolving dynamics of visual political communication in the digital age. Finally, we propose avenues for future research in the realm of visual political communication.  ( 3 min )
    Overcoming the Barrier of Orbital-Free Density Functional Theory for Molecular Systems Using Deep Learning
    arXiv:2309.16578v2 Announce Type: replace-cross Abstract: Orbital-free density functional theory (OFDFT) is a quantum chemistry formulation that has a lower cost scaling than the prevailing Kohn-Sham DFT, which is increasingly desired for contemporary molecular research. However, its accuracy is limited by the kinetic energy density functional, which is notoriously hard to approximate for non-periodic molecular systems. Here we propose M-OFDFT, an OFDFT approach capable of solving molecular systems using a deep learning functional model. We build the essential non-locality into the model, which is made affordable by the concise density representation as expansion coefficients under an atomic basis. With techniques to address unconventional learning challenges therein, M-OFDFT achieves a comparable accuracy with Kohn-Sham DFT on a wide range of molecules untouched by OFDFT before. More attractively, M-OFDFT extrapolates well to molecules much larger than those seen in training, which unleashes the appealing scaling of OFDFT for studying large molecules including proteins, representing an advancement of the accuracy-efficiency trade-off frontier in quantum chemistry.  ( 3 min )
    Navigating Text-To-Image Customization: From LyCORIS Fine-Tuning to Model Evaluation
    arXiv:2309.14859v2 Announce Type: replace-cross Abstract: Text-to-image generative models have garnered immense attention for their ability to produce high-fidelity images from text prompts. Among these, Stable Diffusion distinguishes itself as a leading open-source model in this fast-growing field. However, the intricacies of fine-tuning these models pose multiple challenges from new methodology integration to systematic evaluation. Addressing these issues, this paper introduces LyCORIS (Lora beYond Conventional methods, Other Rank adaptation Implementations for Stable diffusion) [https://github.com/KohakuBlueleaf/LyCORIS], an open-source library that offers a wide selection of fine-tuning methodologies for Stable Diffusion. Furthermore, we present a thorough framework for the systematic assessment of varied fine-tuning techniques. This framework employs a diverse suite of metrics and delves into multiple facets of fine-tuning, including hyperparameter adjustments and the evaluation with different prompt types across various concept categories. Through this comprehensive approach, our work provides essential insights into the nuanced effects of fine-tuning parameters, bridging the gap between state-of-the-art research and practical application.  ( 2 min )
    Discovering Dynamic Effective Connectome of Brain with Bayesian Dynamic DAG Learning
    arXiv:2309.07080v3 Announce Type: replace-cross Abstract: Understanding the complex mechanisms of the brain can be unraveled by extracting the Dynamic Effective Connectome (DEC). Recently, score-based Directed Acyclic Graph (DAG) discovery methods have shown significant improvements in extracting the causal structure and inferring effective connectivity. However, learning DEC through these methods still faces two main challenges: one with the fundamental impotence of high-dimensional dynamic DAG discovery methods and the other with the low quality of fMRI data. In this paper, we introduce Bayesian Dynamic DAG learning with M-matrices Acyclicity characterization (BDyMA) method to address the challenges in discovering DEC. The presented dynamic causal model enables us to discover direct feedback loop edges as well. Leveraging an unconstrained framework in the BDyMA method leads to more accurate results in detecting high-dimensional networks, achieving sparser outcomes, making it particularly suitable for extracting DEC. Additionally, the score function of the BDyMA method allows the incorporation of prior knowledge into the process of dynamic causal discovery which further enhances the accuracy of results. Comprehensive simulations on synthetic data and experiments on Human Connectome Project (HCP) data demonstrate that our method can handle both of the two main challenges, yielding more accurate and reliable DEC compared to state-of-the-art and traditional methods. Additionally, we investigate the trustworthiness of DTI data as prior knowledge for DEC discovery and show the improvements in DEC discovery when the DTI data is incorporated into the process.  ( 3 min )
    Doubly Robust Proximal Causal Learning for Continuous Treatments
    arXiv:2309.12819v3 Announce Type: replace-cross Abstract: Proximal causal learning is a promising framework for identifying the causal effect under the existence of unmeasured confounders. Within this framework, the doubly robust (DR) estimator was derived and has shown its effectiveness in estimation, especially when the model assumption is violated. However, the current form of the DR estimator is restricted to binary treatments, while the treatment can be continuous in many real-world applications. The primary obstacle to continuous treatments resides in the delta function present in the original DR estimator, making it infeasible in causal effect estimation and introducing a heavy computational burden in nuisance function estimation. To address these challenges, we propose a kernel-based DR estimator that can well handle continuous treatments. Equipped with its smoothness, we show that its oracle form is a consistent approximation of the influence function. Further, we propose a new approach to efficiently solve the nuisance functions. We then provide a comprehensive convergence analysis in terms of the mean square error. We demonstrate the utility of our estimator on synthetic datasets and real-world applications.  ( 2 min )
    Evaluating the Ebb and Flow: An In-depth Analysis of Question-Answering Trends across Diverse Platforms
    arXiv:2309.05961v4 Announce Type: replace-cross Abstract: Community Question Answering (CQA) platforms steadily gain popularity as they provide users with fast responses to their queries. The swiftness of these responses is contingent on a mixture of query-specific and user-related elements. This paper scrutinizes these contributing factors within the context of six highly popular CQA platforms, identified through their standout answering speed. Our investigation reveals a correlation between the time taken to yield the first response to a question and several variables: the metadata, the formulation of the questions, and the level of interaction among users. Additionally, by employing conventional machine learning models to analyze these metadata and patterns of user interaction, we endeavor to predict which queries will receive their initial responses promptly.  ( 2 min )
    Index-aware learning of circuits
    arXiv:2309.00958v3 Announce Type: replace-cross Abstract: Electrical circuits are present in a variety of technologies, making their design an important part of computer aided engineering. The growing number of parameters that affect the final design leads to a need for new approaches to quantify their impact. Machine learning may play a key role in this regard, however current approaches often make suboptimal use of existing knowledge about the system at hand. In terms of circuits, their description via modified nodal analysis is well-understood. This particular formulation leads to systems of differential-algebraic equations (DAEs) which bring with them a number of peculiarities, e.g. hidden constraints that the solution needs to fulfill. We use the recently introduced dissection index that can decouple a given system of DAEs into ordinary differential equations, only depending on differential variables, and purely algebraic equations, that describe the relations between differential and algebraic variables. The idea is to then only learn the differential variables and reconstruct the algebraic ones using the relations from the decoupling. This approach guarantees that the algebraic constraints are fulfilled up to the accuracy of the nonlinear system solver, and it may also reduce the learning effort as only the differential variables need to be learned.  ( 2 min )
    DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models
    arXiv:2309.03883v2 Announce Type: replace-cross Abstract: Despite their impressive capabilities, large language models (LLMs) are prone to hallucinations, i.e., generating content that deviates from facts seen during pretraining. We propose a simple decoding strategy for reducing hallucinations with pretrained LLMs that does not require conditioning on retrieved external knowledge nor additional fine-tuning. Our approach obtains the next-token distribution by contrasting the differences in logits obtained from projecting the later layers versus earlier layers to the vocabulary space, exploiting the fact that factual knowledge in an LLMs has generally been shown to be localized to particular transformer layers. We find that this Decoding by Contrasting Layers (DoLa) approach is able to better surface factual knowledge and reduce the generation of incorrect facts. DoLa consistently improves the truthfulness across multiple choices tasks and open-ended generation tasks, for example improving the performance of LLaMA family models on TruthfulQA by 12-17% absolute points, demonstrating its potential in making LLMs reliably generate truthful facts.  ( 2 min )
    A Unifying Variational Framework for Gaussian Process Motion Planning
    arXiv:2309.00854v2 Announce Type: replace-cross Abstract: To control how a robot moves, motion planning algorithms must compute paths in high-dimensional state spaces while accounting for physical constraints related to motors and joints, generating smooth and stable motions, avoiding obstacles, and preventing collisions. A motion planning algorithm must therefore balance competing demands, and should ideally incorporate uncertainty to handle noise, model errors, and facilitate deployment in complex environments. To address these issues, we introduce a framework for robot motion planning based on variational Gaussian processes, which unifies and generalizes various probabilistic-inference-based motion planning algorithms, and connects them with optimization-based planners. Our framework provides a principled and flexible way to incorporate equality-based, inequality-based, and soft motion-planning constraints during end-to-end training, is straightforward to implement, and provides both interval-based and Monte-Carlo-based uncertainty estimates. We conduct experiments using different environments and robots, comparing against baseline approaches based on the feasibility of the planned paths, and obstacle avoidance quality. Results show that our proposed approach yields a good balance between success rates and path quality.  ( 2 min )
    MatchXML: An Efficient Text-label Matching Framework for Extreme Multi-label Text Classification
    arXiv:2308.13139v2 Announce Type: replace-cross Abstract: The eXtreme Multi-label text Classification(XMC) refers to training a classifier that assigns a text sample with relevant labels from an extremely large-scale label set (e.g., millions of labels). We propose MatchXML, an efficient text-label matching framework for XMC. We observe that the label embeddings generated from the sparse Term Frequency-Inverse Document Frequency(TF-IDF) features have several limitations. We thus propose label2vec to effectively train the semantic dense label embeddings by the Skip-gram model. The dense label embeddings are then used to build a Hierarchical Label Tree by clustering. In fine-tuning the pre-trained encoder Transformer, we formulate the multi-label text classification as a text-label matching problem in a bipartite graph. We then extract the dense text representations from the fine-tuned Transformer. Besides the fine-tuned dense text embeddings, we also extract the static dense sentence embeddings from a pre-trained Sentence Transformer. Finally, a linear ranker is trained by utilizing the sparse TF-IDF features, the fine-tuned dense text representations and static dense sentence features. Experimental results demonstrate that MatchXML achieves state-of-the-art accuracy on five out of six datasets. As for the speed, MatchXML outperforms the competing methods on all the six datasets. Our source code is publicly available at https://github.com/huiyegit/MatchXML.  ( 2 min )
    Scalable High-Dimensional Multivariate Linear Regression for Feature-Distributed Data
    arXiv:2307.03410v2 Announce Type: replace-cross Abstract: Feature-distributed data, referred to data partitioned by features and stored across multiple computing nodes, are increasingly common in applications with a large number of features. This paper proposes a two-stage relaxed greedy algorithm (TSRGA) for applying multivariate linear regression to such data. The main advantage of TSRGA is that its communication complexity does not depend on the feature dimension, making it highly scalable to very large data sets. In addition, for multivariate response variables, TSRGA can be used to yield low-rank coefficient estimates. The fast convergence of TSRGA is validated by simulation experiments. Finally, we apply the proposed TSRGA in a financial application that leverages unstructured data from the 10-K reports, demonstrating its usefulness in applications with many dense large-dimensional matrices.  ( 2 min )
    Data-Driven Adversarial Online Control for Unknown Linear Systems
    arXiv:2308.08138v2 Announce Type: replace-cross Abstract: We consider the online control problem with an unknown linear dynamical system in the presence of adversarial perturbations and adversarial convex loss functions. Although the problem is widely studied in model-based control, it remains unclear whether data-driven approaches, which bypass the system identification step, can solve the problem. In this work, we present a novel data-driven online adaptive control algorithm to address this online control problem. Our algorithm leverages the behavioral systems theory to learn a non-parametric system representation and then adopts a perturbation-based controller updated by online gradient descent. We prove that our algorithm guarantees an $\tmO(T^{2/3})$ regret bound with high probability, which matches the best-known regret bound for this problem. Furthermore, we extend our algorithm and performance guarantee to the cases with output feedback.  ( 2 min )
    DeepOnto: A Python Package for Ontology Engineering with Deep Learning
    arXiv:2307.03067v2 Announce Type: replace-cross Abstract: Integrating deep learning techniques, particularly language models (LMs), with knowledge representation techniques like ontologies has raised widespread attention, urging the need of a platform that supports both paradigms. Although packages such as OWL API and Jena offer robust support for basic ontology processing features, they lack the capability to transform various types of information within ontologies into formats suitable for downstream deep learning-based applications. Moreover, widely-used ontology APIs are primarily Java-based while deep learning frameworks like PyTorch and Tensorflow are mainly for Python programming. To address the needs, we present DeepOnto, a Python package designed for ontology engineering with deep learning. The package encompasses a core ontology processing module founded on the widely-recognised and reliable OWL API, encapsulating its fundamental features in a more "Pythonic" manner and extending its capabilities to incorporate other essential components including reasoning, verbalisation, normalisation, taxonomy, projection, and more. Building on this module, DeepOnto offers a suite of tools, resources, and algorithms that support various ontology engineering tasks, such as ontology alignment and completion, by harnessing deep learning methods, primarily pre-trained LMs. In this paper, we also demonstrate the practical utility of DeepOnto through two use-cases: the Digital Health Coaching in Samsung Research UK and the Bio-ML track of the Ontology Alignment Evaluation Initiative (OAEI).  ( 3 min )
    On the Expected Size of Conformal Prediction Sets
    arXiv:2306.07254v3 Announce Type: replace-cross Abstract: While conformal predictors reap the benefits of rigorous statistical guarantees on their error frequency, the size of their corresponding prediction sets is critical to their practical utility. Unfortunately, there is currently a lack of finite-sample analysis and guarantees for their prediction set sizes. To address this shortfall, we theoretically quantify the expected size of the prediction sets under the split conformal prediction framework. As this precise formulation cannot usually be calculated directly, we further derive point estimates and high-probability interval bounds that can be empirically computed, providing a practical method for characterizing the expected set size. We corroborate the efficacy of our results with experiments on real-world datasets for both regression and classification problems.  ( 2 min )
    Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly Detectors
    arXiv:2306.12041v2 Announce Type: replace-cross Abstract: We propose an efficient abnormal event detection model based on a lightweight masked auto-encoder (AE) applied at the video frame level. The novelty of the proposed model is threefold. First, we introduce an approach to weight tokens based on motion gradients, thus shifting the focus from the static background scene to the foreground objects. Second, we integrate a teacher decoder and a student decoder into our architecture, leveraging the discrepancy between the outputs given by the two decoders to improve anomaly detection. Third, we generate synthetic abnormal events to augment the training videos, and task the masked AE model to jointly reconstruct the original frames (without anomalies) and the corresponding pixel-level anomaly maps. Our design leads to an efficient and effective model, as demonstrated by the extensive experiments carried out on four benchmarks: Avenue, ShanghaiTech, UBnormal and UCSD Ped2. The empirical results show that our model achieves an excellent trade-off between speed and accuracy, obtaining competitive AUC scores, while processing 1655 FPS. Hence, our model is between 8 and 70 times faster than competing methods. We also conduct an ablation study to justify our design. Our code is freely available at: https://github.com/ristea/aed-mae.  ( 2 min )
    Extraction and Recovery of Spatio-Temporal Structure in Latent Dynamics Alignment with Diffusion Models
    arXiv:2306.06138v2 Announce Type: replace-cross Abstract: In the field of behavior-related brain computation, it is necessary to align raw neural signals against the drastic domain shift among them. A foundational framework within neuroscience research posits that trial-based neural population activities rely on low-dimensional latent dynamics, thus focusing on the latter greatly facilitates the alignment procedure. Despite this field's progress, existing methods ignore the intrinsic spatio-temporal structure during the alignment phase. Hence, their solutions usually lead to poor quality in latent dynamics structures and overall performance. To tackle this problem, we propose an alignment method ERDiff, which leverages the expressivity of the diffusion model to preserve the spatio-temporal structure of latent dynamics. Specifically, the latent dynamics structures of the source domain are first extracted by a diffusion model. Then, under the guidance of this diffusion model, such structures are well-recovered through a maximum likelihood alignment procedure in the target domain. We first demonstrate the effectiveness of our proposed method on a synthetic dataset. Then, when applied to neural recordings from the non-human primate motor cortex, under both cross-day and inter-subject settings, our method consistently manifests its capability of preserving the spatiotemporal structure of latent dynamics and outperforms existing approaches in alignment goodness-of-fit and neural decoding performance.  ( 3 min )
    Machine learning reveals features of spinon Fermi surface
    arXiv:2306.03143v2 Announce Type: replace-cross Abstract: With rapid progress in simulation of strongly interacting quantum Hamiltonians, the challenge in characterizing unknown phases becomes a bottleneck for scientific progress. We demonstrate that a Quantum-Classical hybrid approach (QuCl) of mining sampled projective snapshots with interpretable classical machine learning can unveil signatures of seemingly featureless quantum states. The Kitaev-Heisenberg model on a honeycomb lattice under external magnetic field presents an ideal system to test QuCl, where simulations have found an intermediate gapless phase (IGP) sandwiched between known phases, launching a debate over its elusive nature. We use the correlator convolutional neural network, trained on labeled projective snapshots, in conjunction with regularization path analysis to identify signatures of phases. We show that QuCl reproduces known features of established phases. Significantly, we also identify a signature of the IGP in the spin channel perpendicular to the field direction, which we interpret as a signature of Friedel oscillations of gapless spinons forming a Fermi surface. Our predictions can guide future experimental searches for spin liquids.  ( 2 min )
    Sharpened Lazy Incremental Quasi-Newton Method
    arXiv:2305.17283v2 Announce Type: replace-cross Abstract: The problem of minimizing the sum of $n$ functions in $d$ dimensions is ubiquitous in machine learning and statistics. In many applications where the number of observations $n$ is large, it is necessary to use incremental or stochastic methods, as their per-iteration cost is independent of $n$. Of these, Quasi-Newton (QN) methods strike a balance between the per-iteration cost and the convergence rate. Specifically, they exhibit a superlinear rate with $O(d^2)$ cost in contrast to the linear rate of first-order methods with $O(d)$ cost and the quadratic rate of second-order methods with $O(d^3)$ cost. However, existing incremental methods have notable shortcomings: Incremental Quasi-Newton (IQN) only exhibits asymptotic superlinear convergence. In contrast, Incremental Greedy BFGS (IGS) offers explicit superlinear convergence but suffers from poor empirical performance and has a per-iteration cost of $O(d^3)$. To address these issues, we introduce the Sharpened Lazy Incremental Quasi-Newton Method (SLIQN) that achieves the best of both worlds: an explicit superlinear convergence rate, and superior empirical performance at a per-iteration $O(d^2)$ cost. SLIQN features two key changes: first, it incorporates a hybrid strategy of using both classic and greedy BFGS updates, allowing it to empirically outperform both IQN and IGS. Second, it employs a clever constant multiplicative factor along with a lazy propagation strategy, which enables it to have a cost of $O(d^2)$. Additionally, our experiments demonstrate the superiority of SLIQN over other incremental and stochastic Quasi-Newton variants and establish its competitiveness with second-order incremental methods.  ( 3 min )
    HiFA: High-fidelity Text-to-3D Generation with Advanced Diffusion Guidance
    arXiv:2305.18766v4 Announce Type: replace-cross Abstract: The advancements in automatic text-to-3D generation have been remarkable. Most existing methods use pre-trained text-to-image diffusion models to optimize 3D representations like Neural Radiance Fields (NeRFs) via latent-space denoising score matching. Yet, these methods often result in artifacts and inconsistencies across different views due to their suboptimal optimization approaches and limited understanding of 3D geometry. Moreover, the inherent constraints of NeRFs in rendering crisp geometry and stable textures usually lead to a two-stage optimization to attain high-resolution details. This work proposes holistic sampling and smoothing approaches to achieve high-quality text-to-3D generation, all in a single-stage optimization. We compute denoising scores in the text-to-image diffusion model's latent and image spaces. Instead of randomly sampling timesteps (also referred to as noise levels in denoising score matching), we introduce a novel timestep annealing approach that progressively reduces the sampled timestep throughout optimization. To generate high-quality renderings in a single-stage optimization, we propose regularization for the variance of z-coordinates along NeRF rays. To address texture flickering issues in NeRFs, we introduce a kernel smoothing technique that refines importance sampling weights coarse-to-fine, ensuring accurate and thorough sampling in high-density regions. Extensive experiments demonstrate the superiority of our method over previous approaches, enabling the generation of highly detailed and view-consistent 3D assets through a single-stage training process.  ( 3 min )
    Variation Spaces for Multi-Output Neural Networks: Insights on Multi-Task Learning and Network Compression
    arXiv:2305.16534v2 Announce Type: replace-cross Abstract: This paper introduces a novel theoretical framework for the analysis of vector-valued neural networks through the development of vector-valued variation spaces, a new class of reproducing kernel Banach spaces. These spaces emerge from studying the regularization effect of weight decay in training networks with activations like the rectified linear unit (ReLU). This framework offers a deeper understanding of multi-output networks and their function-space characteristics. A key contribution of this work is the development of a representer theorem for the vector-valued variation spaces. This representer theorem establishes that shallow vector-valued neural networks are the solutions to data-fitting problems over these infinite-dimensional spaces, where the network widths are bounded by the square of the number of training data. This observation reveals that the norm associated with these vector-valued variation spaces encourages the learning of features that are useful for multiple tasks, shedding new light on multi-task learning with neural networks. Finally, this paper develops a connection between weight-decay regularization and the multi-task lasso problem. This connection leads to novel bounds for layer widths in deep networks that depend on the intrinsic dimensions of the training data representations. This insight not only deepens the understanding of the deep network architectural requirements, but also yields a simple convex optimization method for deep neural network compression. The performance of this compression procedure is evaluated on various architectures.  ( 3 min )
    Score Operator Newton transport
    arXiv:2305.09792v3 Announce Type: replace-cross Abstract: We propose a new approach for sampling and Bayesian computation that uses the score of the target distribution to construct a transport from a given reference distribution to the target. Our approach is an infinite-dimensional Newton method, involving a linear PDE, for finding a zero of a ``score-residual'' operator. We prove sufficient conditions for convergence to a valid transport map. Our Newton iterates can be computed by exploiting fast solvers for elliptic PDEs, resulting in new algorithms for Bayesian inference and other sampling tasks. We identify elementary settings where score-operator Newton transport achieves fast convergence while avoiding mode collapse.  ( 2 min )
    DeepTextMark: A Deep Learning-Driven Text Watermarking Approach for Identifying Large Language Model Generated Text
    arXiv:2305.05773v2 Announce Type: replace-cross Abstract: The rapid advancement of Large Language Models (LLMs) has significantly enhanced the capabilities of text generators. With the potential for misuse escalating, the importance of discerning whether texts are human-authored or generated by LLMs has become paramount. Several preceding studies have ventured to address this challenge by employing binary classifiers to differentiate between human-written and LLM-generated text. Nevertheless, the reliability of these classifiers has been subject to question. Given that consequential decisions may hinge on the outcome of such classification, it is imperative that text source detection is of high caliber. In light of this, the present paper introduces DeepTextMark, a deep learning-driven text watermarking methodology devised for text source identification. By leveraging Word2Vec and Sentence Encoding for watermark insertion, alongside a transformer-based classifier for watermark detection, DeepTextMark epitomizes a blend of blindness, robustness, imperceptibility, and reliability. As elaborated within the paper, these attributes are crucial for universal text source detection, with a particular emphasis in this paper on text produced by LLMs. DeepTextMark offers a viable "add-on" solution to prevailing text generation frameworks, requiring no direct access or alterations to the underlying text generation mechanism. Experimental evaluations underscore the high imperceptibility, elevated detection accuracy, augmented robustness, reliability, and swift execution of DeepTextMark.  ( 3 min )
    Assessment of few-hits machine learning classification algorithms for low energy physics in liquid argon detectors
    arXiv:2305.09744v2 Announce Type: replace-cross Abstract: The physics potential of massive liquid argon TPCs in the low-energy regime is still to be fully reaped because few-hits events encode information that can hardly be exploited by conventional classification algorithms. Machine learning (ML) techniques give their best in these types of classification problems. In this paper, we evaluate their performance against conventional (deterministic) algorithms. We demonstrate that both Convolutional Neural Networks (CNN) and Transformer-Encoder methods outperform deterministic algorithms in one of the most challenging classification problems of low-energy physics (single- versus double-beta events). We discuss the advantages and pitfalls of Transformer-Encoder methods versus CNN and employ these methods to optimize the detector parameters, with an emphasis on the DUNE Phase II detectors ("Module of Opportunity").  ( 2 min )
    Using Large Language Models to Generate JUnit Tests: An Empirical Study
    arXiv:2305.00418v4 Announce Type: replace-cross Abstract: A code generation model generates code by taking a prompt from a code comment, existing code, or a combination of both. Although code generation models (e.g., GitHub Copilot) are increasingly being adopted in practice, it is unclear whether they can successfully be used for unit test generation without fine-tuning for a strongly typed language like Java. To fill this gap, we investigated how well three models (Codex, GPT-3.5-Turbo, and StarCoder) can generate unit tests. We used two benchmarks (HumanEval and Evosuite SF110) to investigate the effect of context generation on the unit test generation process. We evaluated the models based on compilation rates, test correctness, test coverage, and test smells. We found that the Codex model achieved above 80% coverage for the HumanEval dataset, but no model had more than 2% coverage for the EvoSuite SF110 benchmark. The generated tests also suffered from test smells, such as Duplicated Asserts and Empty Tests.  ( 2 min )
    A Stochastic-Gradient-based Interior-Point Algorithm for Solving Smooth Bound-Constrained Optimization Problems
    arXiv:2304.14907v2 Announce Type: replace-cross Abstract: A stochastic-gradient-based interior-point algorithm for minimizing a continuously differentiable objective function (that may be nonconvex) subject to bound constraints is presented, analyzed, and demonstrated through experimental results. The algorithm is unique from other interior-point methods for solving smooth \edit{nonconvex} optimization problems since the search directions are computed using stochastic gradient estimates. It is also unique in its use of inner neighborhoods of the feasible region -- defined by a positive and vanishing neighborhood-parameter sequence -- in which the iterates are forced to remain. It is shown that with a careful balance between the barrier, step-size, and neighborhood sequences, the proposed algorithm satisfies convergence guarantees in both deterministic and stochastic settings. The results of numerical experiments show that in both settings the algorithm can outperform \edit{projection-based} methods.  ( 2 min )
    Application of Tensor Neural Networks to Pricing Bermudan Swaptions
    arXiv:2304.09750v2 Announce Type: replace-cross Abstract: The Cheyette model is a quasi-Gaussian volatility interest rate model widely used to price interest rate derivatives such as European and Bermudan Swaptions for which Monte Carlo simulation has become the industry standard. In low dimensions, these approaches provide accurate and robust prices for European Swaptions but, even in this computationally simple setting, they are known to underestimate the value of Bermudan Swaptions when using the state variables as regressors. This is mainly due to the use of a finite number of predetermined basis functions in the regression. Moreover, in high-dimensional settings, these approaches succumb to the Curse of Dimensionality. To address these issues, Deep-learning techniques have been used to solve the backward Stochastic Differential Equation associated with the value process for European and Bermudan Swaptions; however, these methods are constrained by training time and memory. To overcome these limitations, we propose leveraging Tensor Neural Networks as they can provide significant parameter savings while attaining the same accuracy as classical Dense Neural Networks. In this paper we rigorously benchmark the performance of Tensor Neural Networks and Dense Neural Networks for pricing European and Bermudan Swaptions, and we show that Tensor Neural Networks can be trained faster than Dense Neural Networks and provide more accurate and robust prices than their Dense counterparts.  ( 3 min )
    Language-Driven Anchors for Zero-Shot Adversarial Robustness
    arXiv:2301.13096v3 Announce Type: replace-cross Abstract: Deep Neural Networks (DNNs) are known to be susceptible to adversarial attacks. Previous researches mainly focus on improving adversarial robustness in the fully supervised setting, leaving the challenging domain of zero-shot adversarial robustness an open question. In this work, we investigate this domain by leveraging the recent advances in large vision-language models, such as CLIP, to introduce zero-shot adversarial robustness to DNNs. We propose LAAT, a Language-driven, Anchor-based Adversarial Training strategy. LAAT utilizes the features of a text encoder for each category as fixed anchors (normalized feature embeddings) for each category, which are then employed for adversarial training. By leveraging the semantic consistency of the text encoders, LAAT aims to enhance the adversarial robustness of the image model on novel categories. However, naively using text encoders leads to poor results. Through analysis, we identified the issue to be the high cosine similarity between text encoders. We then design an expansion algorithm and an alignment cross-entropy loss to alleviate the problem. Our experimental results demonstrated that LAAT significantly improves zero-shot adversarial robustness over state-of-the-art methods. LAAT has the potential to enhance adversarial robustness by large-scale multimodal models, especially when labeled data is unavailable during training.  ( 3 min )
    Cross-Model Comparative Loss for Enhancing Neuronal Utility in Language Understanding
    arXiv:2301.03765v2 Announce Type: replace-cross Abstract: Current natural language understanding (NLU) models have been continuously scaling up, both in terms of model size and input context, introducing more hidden and input neurons. While this generally improves performance on average, the extra neurons do not yield a consistent improvement for all instances. This is because some hidden neurons are redundant, and the noise mixed in input neurons tends to distract the model. Previous work mainly focuses on extrinsically reducing low-utility neurons by additional post- or pre-processing, such as network pruning and context selection, to avoid this problem. Beyond that, can we make the model reduce redundant parameters and suppress input noise by intrinsically enhancing the utility of each neuron? If a model can efficiently utilize neurons, no matter which neurons are ablated (disabled), the ablated submodel should perform no better than the original full model. Based on such a comparison principle between models, we propose a cross-model comparative loss for a broad range of tasks. Comparative loss is essentially a ranking loss on top of the task-specific losses of the full and ablated models, with the expectation that the task-specific loss of the full model is minimal. We demonstrate the universal effectiveness of comparative loss through extensive experiments on 14 datasets from 3 distinct NLU tasks based on 5 widely used pretrained language models and find it particularly superior for models with few parameters or long input.  ( 3 min )
    Visual CPG-RL: Learning Central Pattern Generators for Visually-Guided Quadruped Locomotion
    arXiv:2212.14400v2 Announce Type: replace-cross Abstract: We present a framework for learning visually-guided quadruped locomotion by integrating exteroceptive sensing and central pattern generators (CPGs), i.e. systems of coupled oscillators, into the deep reinforcement learning (DRL) framework. Through both exteroceptive and proprioceptive sensing, the agent learns to coordinate rhythmic behavior among different oscillators to track velocity commands, while at the same time override these commands to avoid collisions with the environment. We investigate several open robotics and neuroscience questions: 1) What is the role of explicit interoscillator couplings between oscillators, and can such coupling improve sim-to-real transfer for navigation robustness? 2) What are the effects of using a memory-enabled vs. a memory-free policy network with respect to robustness, energy-efficiency, and tracking performance in sim-to-real navigation tasks? 3) How do animals manage to tolerate high sensorimotor delays, yet still produce smooth and robust gaits? To answer these questions, we train our perceptive locomotion policies in simulation and perform sim-to-real transfers to the Unitree Go1 quadruped, where we observe robust navigation in a variety of scenarios. Our results show that the CPG, explicit interoscillator couplings, and memory-enabled policy representations are all beneficial for energy efficiency, robustness to noise and sensory delays of 90 ms, and tracking performance for successful sim-to-real transfer for navigation tasks. Video results can be found at https://youtu.be/wpsbSMzIwgM.  ( 3 min )
    CAPE: Corrective Actions from Precondition Errors using Large Language Models
    arXiv:2211.09935v3 Announce Type: replace-cross Abstract: Extracting commonsense knowledge from a large language model (LLM) offers a path to designing intelligent robots. Existing approaches that leverage LLMs for planning are unable to recover when an action fails and often resort to retrying failed actions, without resolving the error's underlying cause. We propose a novel approach (CAPE) that attempts to propose corrective actions to resolve precondition errors during planning. CAPE improves the quality of generated plans by leveraging few-shot reasoning from action preconditions. Our approach enables embodied agents to execute more tasks than baseline methods while ensuring semantic correctness and minimizing re-prompting. In VirtualHome, CAPE generates executable plans while improving a human-annotated plan correctness metric from 28.89% to 49.63% over SayCan. Our improvements transfer to a Boston Dynamics Spot robot initialized with a set of skills (specified in language) and associated preconditions, where CAPE improves the correctness metric of the executed task plans by 76.49% compared to SayCan. Our approach enables the robot to follow natural language commands and robustly recover from failures, which baseline approaches largely cannot resolve or address inefficiently.  ( 3 min )
    Quantum-Inspired Tensor Neural Networks for Option Pricing
    arXiv:2212.14076v2 Announce Type: replace-cross Abstract: Recent advances in deep learning have enabled us to address the curse of dimensionality (COD) by solving problems in higher dimensions. A subset of such approaches of addressing the COD has led us to solving high-dimensional PDEs. This has resulted in opening doors to solving a variety of real-world problems ranging from mathematical finance to stochastic control for industrial applications. Although feasible, these deep learning methods are still constrained by training time and memory. Tackling these shortcomings, Tensor Neural Networks (TNN) demonstrate that they can provide significant parameter savings while attaining the same accuracy as compared to the classical Dense Neural Network (DNN). In addition, we also show how TNN can be trained faster than DNN for the same accuracy. Besides TNN, we also introduce Tensor Network Initializer (TNN Init), a weight initialization scheme that leads to faster convergence with smaller variance for an equivalent parameter count as compared to a DNN. We benchmark TNN and TNN Init by applying them to solve the parabolic PDE associated with the Heston model, which is widely used in financial pricing theory.  ( 3 min )
    Machine Learning-Powered Course Allocation
    arXiv:2210.00954v3 Announce Type: replace-cross Abstract: We study the course allocation problem, where universities assign course schedules to students. The current state-of-the-art mechanism, Course Match, has one major shortcoming: students make significant mistakes when reporting their preferences, which negatively affects welfare and fairness. To address this issue, we introduce a new mechanism, Machine Learning-powered Course Match (MLCM). At the core of MLCM is a machine learning-powered preference elicitation module that iteratively asks personalized pairwise comparison queries to alleviate students' reporting mistakes. Extensive computational experiments, grounded in real-world data, demonstrate that MLCM, with only ten comparison queries, significantly increases both average and minimum student utility by 7%-11% and 17%-29%, respectively. Finally, we highlight MLCM's robustness to changes in the environment and show how our design minimizes the risk of upgrading to MLCM while making the upgrade process simple for universities and seamless for their students.  ( 2 min )
    Computational Complexity Evaluation of Neural Network Applications in Signal Processing
    arXiv:2206.12191v2 Announce Type: replace-cross Abstract: In this paper, we provide a systematic approach for assessing and comparing the computational complexity of neural network layers in digital signal processing. We provide and link four software-to-hardware complexity measures, defining how the different complexity metrics relate to the layers' hyper-parameters. This paper explains how to compute these four metrics for feed-forward and recurrent layers, and defines in which case we ought to use a particular metric depending on whether we characterize a more soft- or hardware-oriented application. One of the four metrics, called `the number of additions and bit shifts (NABS)', is newly introduced for heterogeneous quantization. NABS characterizes the impact of not only the bitwidth used in the operation but also the type of quantization used in the arithmetical operations. We intend this work to serve as a baseline for the different levels (purposes) of complexity estimation related to the neural networks' application in real-time digital signal processing, aiming at unifying the computational complexity estimation.  ( 2 min )
    Analyzing Fairness in Deepfake Detection With Massively Annotated Databases
    arXiv:2208.05845v4 Announce Type: replace-cross Abstract: In recent years, image and video manipulations with Deepfake have become a severe concern for security and society. Many detection models and datasets have been proposed to detect Deepfake data reliably. However, there is an increased concern that these models and training databases might be biased and, thus, cause Deepfake detectors to fail. In this work, we investigate factors causing biased detection in public Deepfake datasets by (a) creating large-scale demographic and non-demographic attribute annotations with 47 different attributes for five popular Deepfake datasets and (b) comprehensively analysing attributes resulting in AI-bias of three state-of-the-art Deepfake detection backbone models on these datasets. The analysis shows how various attributes influence a large variety of distinctive attributes (from over 65M labels) on the detection performance which includes demographic (age, gender, ethnicity) and non-demographic (hair, skin, accessories, etc.) attributes. The results examined datasets show limited diversity and, more importantly, show that the utilised Deepfake detection backbone models are strongly affected by investigated attributes making them not fair across attributes. The Deepfake detection backbone methods trained on such imbalanced/biased datasets result in incorrect detection results leading to generalisability, fairness, and security issues. Our findings and annotated datasets will guide future research to evaluate and mitigate bias in Deepfake detection techniques. The annotated datasets and the corresponding code are publicly available.  ( 3 min )
    Uncertainty-aware Pseudo-label Selection for Positive-Unlabeled Learning
    arXiv:2201.13192v3 Announce Type: replace-cross Abstract: Positive-unlabeled learning (PUL) aims at learning a binary classifier from only positive and unlabeled training data. Even though real-world applications often involve imbalanced datasets where the majority of examples belong to one class, most contemporary approaches to PUL do not investigate performance in this setting, thus severely limiting their applicability in practice. In this work, we thus propose to tackle the issues of imbalanced datasets and model calibration in a PUL setting through an uncertainty-aware pseudo-labeling procedure (PUUPL): by boosting the signal from the minority class, pseudo-labeling expands the labeled dataset with new samples from the unlabeled set, while explicit uncertainty quantification prevents the emergence of harmful confirmation bias leading to increased predictive performance. Within a series of experiments, PUUPL yields substantial performance gains in highly imbalanced settings while also showing strong performance in balanced PU scenarios across recent baselines. We furthermore provide ablations and sensitivity analyses to shed light on PUUPL's several ingredients. Finally, a real-world application with an imbalanced dataset confirms the advantage of our approach.  ( 2 min )
    Deep Reinforcement Learning with Spiking Q-learning
    arXiv:2201.09754v2 Announce Type: replace-cross Abstract: With the help of special neuromorphic hardware, spiking neural networks (SNNs) are expected to realize artificial intelligence (AI) with less energy consumption. It provides a promising energy-efficient way for realistic control tasks by combining SNNs with deep reinforcement learning (RL). There are only a few existing SNN-based RL methods at present. Most of them either lack generalization ability or employ Artificial Neural Networks (ANNs) to estimate value function in training. The former needs to tune numerous hyper-parameters for each scenario, and the latter limits the application of different types of RL algorithm and ignores the large energy consumption in training. To develop a robust spike-based RL method, we draw inspiration from non-spiking interneurons found in insects and propose the deep spiking Q-network (DSQN), using the membrane voltage of non-spiking neurons as the representation of Q-value, which can directly learn robust policies from high-dimensional sensory inputs using end-to-end RL. Experiments conducted on 17 Atari games demonstrate the DSQN is effective and even outperforms the ANN-based deep Q-network (DQN) in most games. Moreover, the experiments show superior learning stability and robustness to adversarial attacks of DSQN.  ( 2 min )
    Local Minima Structures in Gaussian Mixture Models
    arXiv:2009.13040v3 Announce Type: replace-cross Abstract: We investigate the landscape of the negative log-likelihood function of Gaussian Mixture Models (GMMs) with a general number of components in the population limit. As the objective function is non-convex, there can be multiple local minima that are not globally optimal, even for well-separated mixture models. Our study reveals that all local minima share a common structure that partially identifies the cluster centers (i.e., means of the Gaussian components) of the true location mixture. Specifically, each local minimum can be represented as a non-overlapping combination of two types of sub-configurations: fitting a single mean estimate to multiple Gaussian components or fitting multiple estimates to a single true component. These results apply to settings where the true mixture components satisfy a certain separation condition, and are valid even when the number of components is over- or under-specified. We also present a more fine-grained analysis for the setting of one-dimensional GMMs with three components, which provide sharper approximation error bounds with improved dependence on the separation.  ( 2 min )
    Nested Nonparametric Instrumental Variable Regression: Long Term, Mediated, and Time Varying Treatment Effects
    arXiv:2112.14249v3 Announce Type: replace-cross Abstract: Several causal parameters in short panel data models are scalar summaries of a function called a nested nonparametric instrumental variable regression (nested NPIV). Examples include long term, mediated, and time varying treatment effects identified using proxy variables. However, it appears that no prior estimators or guarantees for nested NPIV exist, preventing flexible estimation and inference for these causal parameters. A major challenge is compounding ill posedness due to the nested inverse problems. We analyze adversarial estimators of nested NPIV, and provide sufficient conditions for efficient inference on the causal parameter. Our nonasymptotic analysis has three salient features: (i) introducing techniques that limit how ill posedness compounds; (ii) accommodating neural networks, random forests, and reproducing kernel Hilbert spaces; and (iii) extending to causal functions, e.g. long term heterogeneous treatment effects. We measure long term heterogeneous treatment effects of Project STAR and mediated proximal treatment effects of the Job Corps.  ( 2 min )
    ENCORE: Ensemble Learning using Convolution Neural Machine Translation for Automatic Program Repair
    arXiv:1906.08691v2 Announce Type: replace-cross Abstract: Automated generate-and-validate (G&V) program repair techniques typically rely on hard-coded rules, only fix bugs following specific patterns, and are hard to adapt to different programming languages. We propose ENCORE, a new G&V technique, which uses ensemble learning on convolutional neural machine translation (NMT) models to automatically fix bugs in multiple programming languages. We take advantage of the randomness in hyper-parameter tuning to build multiple models that fix different bugs and combine them using ensemble learning. This new convolutional NMT approach outperforms the standard long short-term memory (LSTM) approach used in previous work, as it better captures both local and long-distance connections between tokens. Our evaluation on two popular benchmarks, Defects4J and QuixBugs, shows that ENCORE fixed 42 bugs, including 16 that have not been fixed by existing techniques. In addition, ENCORE is the first G&V repair technique to be applied to four popular programming languages (Java, C++, Python, and JavaScript), fixing a total of 67 bugs across five benchmarks.  ( 2 min )
    NeurAll: Towards a Unified Visual Perception Model for Automated Driving
    arXiv:1902.03589v3 Announce Type: replace-cross Abstract: Convolutional Neural Networks (CNNs) are successfully used for the important automotive visual perception tasks including object recognition, motion and depth estimation, visual SLAM, etc. However, these tasks are typically independently explored and modeled. In this paper, we propose a joint multi-task network design for learning several tasks simultaneously. Our main motivation is the computational efficiency achieved by sharing the expensive initial convolutional layers between all tasks. Indeed, the main bottleneck in automated driving systems is the limited processing power available on deployment hardware. There is also some evidence for other benefits in improving accuracy for some tasks and easing development effort. It also offers scalability to add more tasks leveraging existing features and achieving better generalization. We survey various CNN based solutions for visual perception tasks in automated driving. Then we propose a unified CNN model for the important tasks and discuss several advanced optimization and architecture design techniques to improve the baseline model. The paper is partly review and partly positional with demonstration of several preliminary results promising for future research. We first demonstrate results of multi-stream learning and auxiliary learning which are important ingredients to scale to a large multi-task model. Finally, we implement a two-stream three-task network which performs better in many cases compared to their corresponding single-task models, while maintaining network size.  ( 3 min )
    Multi-Patch Prediction: Adapting LLMs for Time Series Representation Learning
    arXiv:2402.04852v2 Announce Type: replace Abstract: In this study, we present aLLM4TS, an innovative framework that adapts Large Language Models (LLMs) for time-series representation learning. Central to our approach is that we reconceive time-series forecasting as a self-supervised, multi-patch prediction task, which, compared to traditional contrastive learning or mask-and-reconstruction methods, captures temporal dynamics in patch representations more effectively. Our strategy encompasses two-stage training: (i). a causal continual pre-training phase on various time-series datasets, anchored on next patch prediction, effectively syncing LLM capabilities with the intricacies of time-series data; (ii). fine-tuning for multi-patch prediction in the targeted time-series context. A distinctive element of our framework is the patch-wise decoding layer, which departs from previous methods reliant on sequence-level decoding. Such a design directly transposes individual patches into temporal sequences, thereby significantly bolstering the model's proficiency in mastering temporal patch-based representations. aLLM4TS demonstrates superior performance in several downstream tasks, proving its effectiveness in deriving temporal representations with enhanced transferability and marking a pivotal advancement in the adaptation of LLMs for time-series analysis.  ( 2 min )
    A Discriminative Latent-Variable Model for Bilingual Lexicon Induction
    arXiv:1808.09334v3 Announce Type: replace-cross Abstract: We introduce a novel discriminative latent variable model for bilingual lexicon induction. Our model combines the bipartite matching dictionary prior of Haghighi et al. (2008) with a representation-based approach (Artetxe et al., 2017). To train the model, we derive an efficient Viterbi EM algorithm. We provide empirical results on six language pairs under two metrics and show that the prior improves the induced bilingual lexicons. We also demonstrate how previous work may be viewed as a similarly fashioned latent-variable model, albeit with a different prior.  ( 2 min )
    Group Distributionally Robust Dataset Distillation with Risk Minimization
    arXiv:2402.04676v2 Announce Type: replace Abstract: Dataset distillation (DD) has emerged as a widely adopted technique for crafting a synthetic dataset that captures the essential information of a training dataset, facilitating the training of accurate neural models. Its applications span various domains, including transfer learning, federated learning, and neural architecture search. The most popular methods for constructing the synthetic data rely on matching the convergence properties of training the model with the synthetic dataset and the training dataset. However, targeting the training dataset must be thought of as auxiliary in the same sense that the training set is an approximate substitute for the population distribution, and the latter is the data of interest. Yet despite its popularity, an aspect that remains unexplored is the relationship of DD to its generalization, particularly across uncommon subgroups. That is, how can we ensure that a model trained on the synthetic dataset performs well when faced with samples from regions with low population density? Here, the representativeness and coverage of the dataset become salient over the guaranteed training error at inference. Drawing inspiration from distributionally robust optimization, we introduce an algorithm that combines clustering with the minimization of a risk measure on the loss to conduct DD. We provide a theoretical rationale for our approach and demonstrate its effective generalization and robustness across subgroups through numerical experiments. The source code is available in https://github.com/Mming11/RobustDatasetDistillation.  ( 3 min )
    TEDDY: Trimming Edges with Degree-based Discrimination strategY
    arXiv:2402.01261v2 Announce Type: replace Abstract: Since the pioneering work on the lottery ticket hypothesis for graph neural networks (GNNs) was proposed in Chen et al. (2021), the study on finding graph lottery tickets (GLT) has become one of the pivotal focus in the GNN community, inspiring researchers to discover sparser GLT while achieving comparable performance to original dense networks. In parallel, the graph structure has gained substantial attention as a crucial factor in GNN training dynamics, also elucidated by several recent studies. Despite this, contemporary studies on GLT, in general, have not fully exploited inherent pathways in the graph structure and identified tickets in an iterative manner, which is time-consuming and inefficient. To address these limitations, we introduce TEDDY, a one-shot edge sparsification framework that leverages structural information by incorporating edge-degree information. Following edge sparsification, we encourage the parameter sparsity during training via simple projected gradient descent on the $\ell_0$ ball. Given the target sparsity levels for both the graph structure and the model parameters, our TEDDY facilitates efficient and rapid realization of GLT within a single training. Remarkably, our experimental results demonstrate that TEDDY significantly surpasses conventional iterative approaches in generalization, even when conducting one-shot sparsification that solely utilizes graph structures, without taking feature information into account.  ( 2 min )
    Strategic Usage in a Multi-Learner Setting
    arXiv:2401.16422v2 Announce Type: replace Abstract: Real-world systems often involve some pool of users choosing between a set of services. With the increase in popularity of online learning algorithms, these services can now self-optimize, leveraging data collected on users to maximize some reward such as service quality. On the flipside, users may strategically choose which services to use in order to pursue their own reward functions, in the process wielding power over which services can see and use their data. Extensive prior research has been conducted on the effects of strategic users in single-service settings, with strategic behavior manifesting in the manipulation of observable features to achieve a desired classification; however, this can often be costly or unattainable for users and fails to capture the full behavior of multi-service dynamic systems. As such, we analyze a setting in which strategic users choose among several available services in order to pursue positive classifications, while services seek to minimize loss functions on their observations. We focus our analysis on realizable settings, and show that naive retraining can still lead to oscillation even if all users are observed at different times; however, if this retraining uses memory of past observations, convergent behavior can be guaranteed for certain loss function classes. We provide results obtained from synthetic and real-world data to empirically validate our theoretical findings.  ( 3 min )
    Arrows of Time for Large Language Models
    arXiv:2401.17505v2 Announce Type: replace Abstract: We study the probabilistic modeling performed by Autoregressive Large Language Models through the angle of time directionality. We empirically find a time asymmetry exhibited by such models in their ability to model natural language: a difference in the average log-perplexity when trying to predict the next token versus when trying to predict the previous one. This difference is at the same time subtle and very consistent across various modalities (language, model size, training time, ...). Theoretically, this is surprising: from an information-theoretic point of view, there should be no such difference. We provide a theoretical framework to explain how such an asymmetry can appear from sparsity and computational complexity considerations, and outline a number of perspectives opened by our results.  ( 2 min )
    True Knowledge Comes from Practice: Aligning LLMs with Embodied Environments via Reinforcement Learning
    arXiv:2401.14151v2 Announce Type: replace Abstract: Despite the impressive performance across numerous tasks, large language models (LLMs) often fail in solving simple decision-making tasks due to the misalignment of the knowledge in LLMs with environments. On the contrary, reinforcement learning (RL) agents learn policies from scratch, which makes them always align with environments but difficult to incorporate prior knowledge for efficient explorations. To narrow the gap, we propose TWOSOME, a novel general online framework that deploys LLMs as decision-making agents to efficiently interact and align with embodied environments via RL without requiring any prepared datasets or prior knowledge of the environments. Firstly, we query the joint probabilities of each valid action with LLMs to form behavior policies. Then, to enhance the stability and robustness of the policies, we propose two normalization methods and summarize four prompt design principles. Finally, we design a novel parameter-efficient training architecture where the actor and critic share one frozen LLM equipped with low-rank adapters (LoRA) updated by PPO. We conduct extensive experiments to evaluate TWOSOME. i) TWOSOME exhibits significantly better sample efficiency and performance compared to the conventional RL method, PPO, and prompt tuning method, SayCan, in both classical decision-making environment, Overcooked, and simulated household environment, VirtualHome. ii) Benefiting from LLMs' open-vocabulary feature, TWOSOME shows superior generalization ability to unseen tasks. iii) Under our framework, there is no significant loss of the LLMs' original ability during online PPO finetuning.  ( 3 min )
    Compositional Generative Inverse Design
    arXiv:2401.13171v2 Announce Type: replace Abstract: Inverse design, where we seek to design input variables in order to optimize an underlying objective function, is an important problem that arises across fields such as mechanical engineering to aerospace engineering. Inverse design is typically formulated as an optimization problem, with recent works leveraging optimization across learned dynamics models. However, as models are optimized they tend to fall into adversarial modes, preventing effective sampling. We illustrate that by instead optimizing over the learned energy function captured by the diffusion model, we can avoid such adversarial examples and significantly improve design performance. We further illustrate how such a design system is compositional, enabling us to combine multiple different diffusion models representing subcomponents of our desired system to design systems with every specified component. In an N-body interaction task and a challenging 2D multi-airfoil design task, we demonstrate that by composing the learned diffusion model at test time, our method allows us to design initial states and boundary shapes that are more complex than those in the training data. Our method generalizes to more objects for N-body dataset and discovers formation flying to minimize drag in the multi-airfoil design task. Project website and code can be found at https://github.com/AI4Science-WestlakeU/cindm.  ( 2 min )
    Learning from Sparse Offline Datasets via Conservative Density Estimation
    arXiv:2401.08819v2 Announce Type: replace Abstract: Offline reinforcement learning (RL) offers a promising direction for learning policies from pre-collected datasets without requiring further interactions with the environment. However, existing methods struggle to handle out-of-distribution (OOD) extrapolation errors, especially in sparse reward or scarce data settings. In this paper, we propose a novel training algorithm called Conservative Density Estimation (CDE), which addresses this challenge by explicitly imposing constraints on the state-action occupancy stationary distribution. CDE overcomes the limitations of existing approaches, such as the stationary distribution correction method, by addressing the support mismatch issue in marginal importance sampling. Our method achieves state-of-the-art performance on the D4RL benchmark. Notably, CDE consistently outperforms baselines in challenging tasks with sparse rewards or insufficient data, demonstrating the advantages of our approach in addressing the extrapolation error problem in offline RL.  ( 2 min )
    Towards Scalable and Robust Model Versioning
    arXiv:2401.09574v2 Announce Type: replace Abstract: As the deployment of deep learning models continues to expand across industries, the threat of malicious incursions aimed at gaining access to these deployed models is on the rise. Should an attacker gain access to a deployed model, whether through server breaches, insider attacks, or model inversion techniques, they can then construct white-box adversarial attacks to manipulate the model's classification outcomes, thereby posing significant risks to organizations that rely on these models for critical tasks. Model owners need mechanisms to protect themselves against such losses without the necessity of acquiring fresh training data - a process that typically demands substantial investments in time and capital. In this paper, we explore the feasibility of generating multiple versions of a model that possess different attack properties, without acquiring new training data or changing model architecture. The model owner can deploy one version at a time and replace a leaked version immediately with a new version. The newly deployed model version can resist adversarial attacks generated leveraging white-box access to one or all previously leaked versions. We show theoretically that this can be accomplished by incorporating parameterized hidden distributions into the model training data, forcing the model to learn task-irrelevant features uniquely defined by the chosen data. Additionally, optimal choices of hidden distributions can produce a sequence of model versions capable of resisting compound transferability attacks over time. Leveraging our analytical insights, we design and implement a practical model versioning method for DNN classifiers, which leads to significant robustness improvements over existing methods. We believe our work presents a promising direction for safeguarding DNN services beyond their initial deployment.  ( 3 min )
    Robustness, Efficiency, or Privacy: Pick Two in Machine Learning
    arXiv:2312.14712v2 Announce Type: replace Abstract: The success of machine learning (ML) applications relies on vast datasets and distributed architectures which, as they grow, present major challenges. In real-world scenarios, where data often contains sensitive information, issues like data poisoning and hardware failures are common. Ensuring privacy and robustness is vital for the broad adoption of ML in public life. This paper examines the costs associated with achieving these objectives in distributed ML architectures, from both theoretical and empirical perspectives. We overview the meanings of privacy and robustness in distributed ML, and clarify how they can be achieved efficiently in isolation. However, we contend that the integration of these two objectives entails a notable compromise in computational efficiency. In short, traditional noise injection hurts accuracy by concealing poisoned inputs, while cryptographic methods clash with poisoning defenses due to their non-linear nature. However, we outline future research directions aimed at reconciling this compromise with efficiency by considering weaker threat models.  ( 2 min )
    On the Expressive Power of Graph Neural Networks
    arXiv:2401.01626v2 Announce Type: replace Abstract: The study of Graph Neural Networks has received considerable interest in the past few years. By extending deep learning to graph-structured data, GNNs can solve a diverse set of tasks in fields including social science, chemistry, and medicine. The development of GNN architectures has largely been focused on improving empirical performance on tasks like node or graph classification. However, a line of recent work has instead sought to find GNN architectures that have desirable theoretical properties - by studying their expressive power and designing architectures that maximize this expressiveness. While there is no consensus on the best way to define the expressiveness of a GNN, it can be viewed from several well-motivated perspectives. Perhaps the most natural approach is to study the universal approximation properties of GNNs, much in the way that this has been studied extensively for MLPs. Another direction focuses on the extent to which GNNs can distinguish between different graph structures, relating this to the graph isomorphism test. Besides, a GNN's ability to compute graph properties such as graph moments has been suggested as another form of expressiveness. All of these different definitions are complementary and have yielded different recommendations for GNN architecture choices. In this paper, we would like to give an overview of the notion of "expressive power" of GNNs and provide some valuable insights regarding the design choices of GNNs.  ( 3 min )
    Topology Learning for Heterogeneous Decentralized Federated Learning over Unreliable D2D Networks
    arXiv:2312.13611v2 Announce Type: replace Abstract: With the proliferation of intelligent mobile devices in wireless device-to-device (D2D) networks, decentralized federated learning (DFL) has attracted significant interest. Compared to centralized federated learning (CFL), DFL mitigates the risk of central server failures due to communication bottlenecks. However, DFL faces several challenges, such as the severe heterogeneity of data distributions in diverse environments, and the transmission outages and package errors caused by the adoption of the User Datagram Protocol (UDP) in D2D networks. These challenges often degrade the convergence of training DFL models. To address these challenges, we conduct a thorough theoretical convergence analysis for DFL and derive a convergence bound. By defining a novel quantity named unreliable links-aware neighborhood discrepancy in this convergence bound, we formulate a tractable optimization objective, and develop a novel Topology Learning method considering the Representation Discrepancy and Unreliable Links in DFL, named ToLRDUL. Intensive experiments under both feature skew and label skew settings have validated the effectiveness of our proposed method, demonstrating improved convergence speed and test accuracy, consistent with our theoretical findings.  ( 2 min )
    Optimizing Heat Alert Issuance with Reinforcement Learning
    arXiv:2312.14196v2 Announce Type: replace Abstract: A key strategy in societal adaptation to climate change is the use of alert systems to reduce the adverse health impacts of extreme heat events by prompting preventative action. In this work, we investigate reinforcement learning (RL) as a tool to optimize the effectiveness of such systems. Our contributions are threefold. First, we introduce a novel RL environment enabling the evaluation of the effectiveness of heat alert policies to reduce heat-related hospitalizations. The rewards model is trained from a comprehensive dataset of historical weather, Medicare health records, and socioeconomic/geographic features. We use variational Bayesian techniques to address low-signal effects and spatial heterogeneity, which are commonly encountered in climate & health settings. The transition model incorporates real historical weather patterns enriched by a data augmentation mechanism based on climate region similarity. Second, we use this environment to evaluate standard RL algorithms in the context of heat alert issuance. Our analysis shows that policy constraints are needed to improve the initially poor performance of RL. Lastly, a post hoc contrastive analysis provides insight into scenarios where our modified heat alert-RL policies yield significant gains/losses over the current National Weather Service alert policy in the United States.  ( 2 min )
    Learning Unknown Intervention Targets in Structural Causal Models from Heterogeneous Data
    arXiv:2312.06091v2 Announce Type: replace Abstract: We study the problem of identifying the unknown intervention targets in structural causal models where we have access to heterogeneous data collected from multiple environments. The unknown intervention targets are the set of endogenous variables whose corresponding exogenous noises change across the environments. We propose a two-phase approach which in the first phase recovers the exogenous noises corresponding to unknown intervention targets whose distributions have changed across environments. In the second phase, the recovered noises are matched with the corresponding endogenous variables. For the recovery phase, we provide sufficient conditions for learning these exogenous noises up to some component-wise invertible transformation. For the matching phase, under the causal sufficiency assumption, we show that the proposed method uniquely identifies the intervention targets. In the presence of latent confounders, the intervention targets among the observed variables cannot be determined uniquely. We provide a candidate intervention target set which is a superset of the true intervention targets. Our approach improves upon the state of the art as the returned candidate set is always a subset of the target set returned by previous work. Moreover, we do not require restrictive assumptions such as linearity of the causal model or performing invariance tests to learn whether a distribution is changing across environments which could be highly sample inefficient. Our experimental results show the effectiveness of our proposed algorithm in practice.  ( 3 min )
    Active learning with biased non-response to label requests
    arXiv:2312.08150v2 Announce Type: replace Abstract: Active learning can improve the efficiency of training prediction models by identifying the most informative new labels to acquire. However, non-response to label requests can impact active learning's effectiveness in real-world contexts. We conceptualise this degradation by considering the type of non-response present in the data, demonstrating that biased non-response is particularly detrimental to model performance. We argue that biased non-response is likely in contexts where the labelling process, by nature, relies on user interactions. To mitigate the impact of biased non-response, we propose a cost-based correction to the sampling strategy--the Upper Confidence Bound of the Expected Utility (UCB-EU)--that can, plausibly, be applied to any active learning algorithm. Through experiments, we demonstrate that our method successfully reduces the harm from labelling non-response in many settings. However, we also characterise settings where the non-response bias in the annotations remains detrimental under UCB-EU for specific sampling methods and data generating processes. Finally, we evaluate our method on a real-world dataset from an e-commerce platform. We show that UCB-EU yields substantial performance improvements to conversion models that are trained on clicked impressions. Most generally, this research serves to both better conceptualise the interplay between types of non-response and model improvements via active learning, and to provide a practical, easy-to-implement correction that mitigates model degradation.  ( 3 min )
    Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation
    arXiv:2311.18207v3 Announce Type: replace Abstract: Off-Policy Evaluation (OPE) aims to assess the effectiveness of counterfactual policies using only offline logged data and is often used to identify the top-k promising policies for deployment in online A/B tests. Existing evaluation metrics for OPE estimators primarily focus on the "accuracy" of OPE or that of downstream policy selection, neglecting risk-return tradeoff in the subsequent online policy deployment. To address this issue, we draw inspiration from portfolio evaluation in finance and develop a new metric, called SharpeRatio@k, which measures the risk-return tradeoff of policy portfolios formed by an OPE estimator under varying online evaluation budgets (k). We validate our metric in two example scenarios, demonstrating its ability to effectively distinguish between low-risk and high-risk estimators and to accurately identify the most efficient one. Efficiency of an estimator is characterized by its capability to form the most advantageous policy portfolios, maximizing returns while minimizing risks during online deployment, a nuance that existing metrics typically overlook. To facilitate a quick, accurate, and consistent evaluation of OPE via SharpeRatio@k, we have also integrated this metric into an open-source software, SCOPE-RL (https://github.com/hakuhodo-technologies/scope-rl). Employing SharpeRatio@k and SCOPE-RL, we conduct comprehensive benchmarking experiments on various estimators and RL tasks, focusing on their risk-return tradeoff. These experiments offer several interesting directions and suggestions for future OPE research.  ( 3 min )
    Breaking the Entanglement of Homophily and Heterophily in Semi-supervised Node Classification
    arXiv:2312.04111v2 Announce Type: replace Abstract: Recently, graph neural networks (GNNs) have shown prominent performance in semi-supervised node classification by leveraging knowledge from the graph database. However, most existing GNNs follow the homophily assumption, where connected nodes are more likely to exhibit similar feature distributions and the same labels, and such an assumption has proven to be vulnerable in a growing number of practical applications. As a supplement, heterophily reflects dissimilarity in connected nodes, which has gained significant attention in graph learning. To this end, data engineers aim to develop a powerful GNN model that can ensure performance under both homophily and heterophily. Despite numerous attempts, most existing GNNs struggle to achieve optimal node representations due to the constraints of undirected graphs. The neglect of directed edges results in sub-optimal graph representations, thereby hindering the capacity of GNNs. To address this issue, we introduce AMUD, which quantifies the relationship between node profiles and topology from a statistical perspective, offering valuable insights for Adaptively Modeling the natural directed graphs as the Undirected or Directed graph to maximize the benefits from subsequent graph learning. Furthermore, we propose Adaptive Directed Pattern Aggregation (ADPA) as a new directed graph learning paradigm for AMUD. Empirical studies have demonstrated that AMUD guides efficient graph learning. Meanwhile, extensive experiments on 16 benchmark datasets substantiate the impressive performance of ADPA, outperforming baselines by significant margins of 3.96.  ( 3 min )
    SCOPE-RL: A Python Library for Offline Reinforcement Learning and Off-Policy Evaluation
    arXiv:2311.18206v3 Announce Type: replace Abstract: This paper introduces SCOPE-RL, a comprehensive open-source Python software designed for offline reinforcement learning (offline RL), off-policy evaluation (OPE), and selection (OPS). Unlike most existing libraries that focus solely on either policy learning or evaluation, SCOPE-RL seamlessly integrates these two key aspects, facilitating flexible and complete implementations of both offline RL and OPE processes. SCOPE-RL put particular emphasis on its OPE modules, offering a range of OPE estimators and robust evaluation-of-OPE protocols. This approach enables more in-depth and reliable OPE compared to other packages. For instance, SCOPE-RL enhances OPE by estimating the entire reward distribution under a policy rather than its mere point-wise expected value. Additionally, SCOPE-RL provides a more thorough evaluation-of-OPE by presenting the risk-return tradeoff in OPE results, extending beyond mere accuracy evaluations in existing OPE literature. SCOPE-RL is designed with user accessibility in mind. Its user-friendly APIs, comprehensive documentation, and a variety of easy-to-follow examples assist researchers and practitioners in efficiently implementing and experimenting with various offline RL methods and OPE estimators, tailored to their specific problem contexts. The documentation of SCOPE-RL is available at https://scope-rl.readthedocs.io/en/latest/.  ( 3 min )
    Weight-Sharing Regularization
    arXiv:2311.03096v2 Announce Type: replace Abstract: Weight-sharing is ubiquitous in deep learning. Motivated by this, we propose a "weight-sharing regularization" penalty on the weights $w \in \mathbb{R}^d$ of a neural network, defined as $\mathcal{R}(w) = \frac{1}{d - 1}\sum_{i > j}^d |w_i - w_j|$. We study the proximal mapping of $\mathcal{R}$ and provide an intuitive interpretation of it in terms of a physical system of interacting particles. We also parallelize existing algorithms for $\operatorname{prox}_\mathcal{R}$ (to run on GPU) and find that one of them is fast in practice but slow ($O(d)$) for worst-case inputs. Using the physical interpretation, we design a novel parallel algorithm which runs in $O(\log^3 d)$ when sufficient processors are available, thus guaranteeing fast training. Our experiments reveal that weight-sharing regularization enables fully connected networks to learn convolution-like filters even when pixels have been shuffled while convolutional neural networks fail in this setting. Our code is available on github.  ( 2 min )
    MaxEnt Loss: Constrained Maximum Entropy for Calibration under Out-of-Distribution Shift
    arXiv:2310.17159v2 Announce Type: replace Abstract: We present a new loss function that addresses the out-of-distribution (OOD) calibration problem. While many objective functions have been proposed to effectively calibrate models in-distribution, our findings show that they do not always fare well OOD. Based on the Principle of Maximum Entropy, we incorporate helpful statistical constraints observed during training, delivering better model calibration without sacrificing accuracy. We provide theoretical analysis and show empirically that our method works well in practice, achieving state-of-the-art calibration on both synthetic and real-world benchmarks.  ( 2 min )
    Towards Robust Offline Reinforcement Learning under Diverse Data Corruption
    arXiv:2310.12955v3 Announce Type: replace Abstract: Offline reinforcement learning (RL) presents a promising approach for learning reinforced policies from offline datasets without the need for costly or unsafe interactions with the environment. However, datasets collected by humans in real-world environments are often noisy and may even be maliciously corrupted, which can significantly degrade the performance of offline RL. In this work, we first investigate the performance of current offline RL algorithms under comprehensive data corruption, including states, actions, rewards, and dynamics. Our extensive experiments reveal that implicit Q-learning (IQL) demonstrates remarkable resilience to data corruption among various offline RL algorithms. Furthermore, we conduct both empirical and theoretical analyses to understand IQL's robust performance, identifying its supervised policy learning scheme as the key factor. Despite its relative robustness, IQL still suffers from heavy-tail targets of Q functions under dynamics corruption. To tackle this challenge, we draw inspiration from robust statistics to employ the Huber loss to handle the heavy-tailedness and utilize quantile estimators to balance penalization for corrupted data and learning stability. By incorporating these simple yet effective modifications into IQL, we propose a more robust offline RL approach named Robust IQL (RIQL). Extensive experiments demonstrate that RIQL exhibits highly robust performance when subjected to diverse data corruption scenarios.  ( 3 min )
    Learn from the Past: A Proxy Guided Adversarial Defense Framework with Self Distillation Regularization
    arXiv:2310.12713v2 Announce Type: replace Abstract: Adversarial Training (AT), pivotal in fortifying the robustness of deep learning models, is extensively adopted in practical applications. However, prevailing AT methods, relying on direct iterative updates for target model's defense, frequently encounter obstacles such as unstable training and catastrophic overfitting. In this context, our work illuminates the potential of leveraging the target model's historical states as a proxy to provide effective initialization and defense prior, which results in a general proxy guided defense framework, `LAST' ({\bf L}earn from the P{\bf ast}). Specifically, LAST derives response of the proxy model as dynamically learned fast weights, which continuously corrects the update direction of the target model. Besides, we introduce a self-distillation regularized defense objective, ingeniously designed to steer the proxy model's update trajectory without resorting to external teacher models, thereby ameliorating the impact of catastrophic overfitting on performance. Extensive experiments and ablation studies showcase the framework's efficacy in markedly improving model robustness (e.g., up to 9.2\% and 20.3\% enhancement in robust accuracy on CIFAR10 and CIFAR100 datasets, respectively) and training stability. These improvements are consistently observed across various model architectures, larger datasets, perturbation sizes, and attack modalities, affirming LAST's ability to consistently refine both single-step and multi-step AT strategies. The code will be available at~\url{https://github.com/callous-youth/LAST}.  ( 3 min )
    METRA: Scalable Unsupervised RL with Metric-Aware Abstraction
    arXiv:2310.08887v2 Announce Type: replace Abstract: Unsupervised pre-training strategies have proven to be highly effective in natural language processing and computer vision. Likewise, unsupervised reinforcement learning (RL) holds the promise of discovering a variety of potentially useful behaviors that can accelerate the learning of a wide array of downstream tasks. Previous unsupervised RL approaches have mainly focused on pure exploration and mutual information skill learning. However, despite the previous attempts, making unsupervised RL truly scalable still remains a major open challenge: pure exploration approaches might struggle in complex environments with large state spaces, where covering every possible transition is infeasible, and mutual information skill learning approaches might completely fail to explore the environment due to the lack of incentives. To make unsupervised RL scalable to complex, high-dimensional environments, we propose a novel unsupervised RL objective, which we call Metric-Aware Abstraction (METRA). Our main idea is, instead of directly covering the entire state space, to only cover a compact latent space $Z$ that is metrically connected to the state space $S$ by temporal distances. By learning to move in every direction in the latent space, METRA obtains a tractable set of diverse behaviors that approximately cover the state space, being scalable to high-dimensional environments. Through our experiments in five locomotion and manipulation environments, we demonstrate that METRA can discover a variety of useful behaviors even in complex, pixel-based environments, being the first unsupervised RL method that discovers diverse locomotion behaviors in pixel-based Quadruped and Humanoid. Our code and videos are available at https://seohong.me/projects/metra/  ( 3 min )
    SiamAF: Learning Shared Information from ECG and PPG Signals for Robust Atrial Fibrillation Detection
    arXiv:2310.09203v2 Announce Type: replace Abstract: Atrial fibrillation (AF) is the most common type of cardiac arrhythmia. It is associated with an increased risk of stroke, heart failure, and other cardiovascular complications, but can be clinically silent. Passive AF monitoring with wearables may help reduce adverse clinical outcomes related to AF. Detecting AF in noisy wearable data poses a significant challenge, leading to the emergence of various deep learning techniques. Previous deep learning models learn from a single modality, either electrocardiogram (ECG) or photoplethysmography (PPG) signals. However, deep learning models often struggle to learn generalizable features and rely on features that are more susceptible to corruption from noise, leading to sub-optimal performances in certain scenarios, especially with low-quality signals. Given the increasing availability of ECG and PPG signal pairs from wearables and bedside monitors, we propose a new approach, SiamAF, leveraging a novel Siamese network architecture and joint learning loss function to learn shared information from both ECG and PPG signals. At inference time, the proposed model is able to predict AF from either PPG or ECG and outperforms baseline methods on three external test sets. It learns medically relevant features as a result of our novel architecture design. The proposed model also achieves comparable performance to traditional learning regimes while requiring much fewer training labels, providing a potential approach to reduce future reliance on manual labeling.  ( 3 min )
    GraphControl: Adding Conditional Control to Universal Graph Pre-trained Models for Graph Domain Transfer Learning
    arXiv:2310.07365v3 Announce Type: replace Abstract: Graph-structured data is ubiquitous in the world which models complex relationships between objects, enabling various Web applications. Daily influxes of unlabeled graph data on the Web offer immense potential for these applications. Graph self-supervised algorithms have achieved significant success in acquiring generic knowledge from abundant unlabeled graph data. These pre-trained models can be applied to various downstream Web applications, saving training time and improving downstream (target) performance. However, different graphs, even across seemingly similar domains, can differ significantly in terms of attribute semantics, posing difficulties, if not infeasibility, for transferring the pre-trained models to downstream tasks. Concretely speaking, for example, the additional task-specific node information in downstream tasks (specificity) is usually deliberately omitted so that the pre-trained representation (transferability) can be leveraged. The trade-off as such is termed as "transferability-specificity dilemma" in this work. To address this challenge, we introduce an innovative deployment module coined as GraphControl, motivated by ControlNet, to realize better graph domain transfer learning. Specifically, by leveraging universal structural pre-trained models and GraphControl, we align the input space across various graphs and incorporate unique characteristics of target data as conditional inputs. These conditions will be progressively integrated into the model during fine-tuning or prompt tuning through ControlNet, facilitating personalized deployment. Extensive experiments show that our method significantly enhances the adaptability of pre-trained models on target attributed datasets, achieving 1.4-3x performance gain. Furthermore, it outperforms training-from-scratch methods on target data with a comparable margin and exhibits faster convergence.  ( 3 min )
    iTransformer: Inverted Transformers Are Effective for Time Series Forecasting
    arXiv:2310.06625v3 Announce Type: replace Abstract: The recent boom of linear forecasting models questions the ongoing passion for architectural modifications of Transformer-based forecasters. These forecasters leverage Transformers to model the global dependencies over temporal tokens of time series, with each token formed by multiple variates of the same timestamp. However, Transformers are challenged in forecasting series with larger lookback windows due to performance degradation and computation explosion. Besides, the embedding for each temporal token fuses multiple variates that represent potential delayed events and distinct physical measurements, which may fail in learning variate-centric representations and result in meaningless attention maps. In this work, we reflect on the competent duties of Transformer components and repurpose the Transformer architecture without any modification to the basic components. We propose iTransformer that simply applies the attention and feed-forward network on the inverted dimensions. Specifically, the time points of individual series are embedded into variate tokens which are utilized by the attention mechanism to capture multivariate correlations; meanwhile, the feed-forward network is applied for each variate token to learn nonlinear representations. The iTransformer model achieves state-of-the-art on challenging real-world datasets, which further empowers the Transformer family with promoted performance, generalization ability across different variates, and better utilization of arbitrary lookback windows, making it a nice alternative as the fundamental backbone of time series forecasting. Code is available at this repository: https://github.com/thuml/iTransformer.  ( 3 min )
    Distributional Reinforcement Learning with Online Risk-awareness Adaption
    arXiv:2310.05179v2 Announce Type: replace Abstract: The use of reinforcement learning (RL) in practical applications requires considering sub-optimal outcomes, which depend on the agent's familiarity with the uncertain environment. Dynamically adjusting the level of epistemic risk over the course of learning can tactically achieve reliable optimal policy in safety-critical environments and tackle the sub-optimality of a static risk level. In this work, we introduce a novel framework, Distributional RL with Online Risk Adaption (DRL-ORA), which can quantify the aleatory and epistemic uncertainties compositely and dynamically select the epistemic risk levels via solving a total variation minimization problem online. The risk level selection can be efficiently achieved through grid search using a Follow-The-Leader type algorithm, and its offline oracle is related to "satisficing measure" (in the decision analysis community) under a special modification of the loss function. We show multiple classes of tasks where DRL-ORA outperforms existing methods that rely on either a fixed risk level or manually predetermined risk level adaption. Given the simplicity of our modifications, we believe the framework can be easily incorporated into most RL algorithm variants.  ( 2 min )
    MuseChat: A Conversational Music Recommendation System for Videos
    arXiv:2310.06282v4 Announce Type: replace Abstract: Music recommendation for videos attracts growing interest in multi-modal research. However, existing systems focus primarily on content compatibility, often ignoring the users' preferences. Their inability to interact with users for further refinements or to provide explanations leads to a less satisfying experience. We address these issues with MuseChat, a first-of-its-kind dialogue-based recommendation system that personalizes music suggestions for videos. Our system consists of two key functionalities with associated modules: recommendation and reasoning. The recommendation module takes a video along with optional information including previous suggested music and user's preference as inputs and retrieves an appropriate music matching the context. The reasoning module, equipped with the power of Large Language Model (Vicuna-7B) and extended to multi-modal inputs, is able to provide reasonable explanation for the recommended music. To evaluate the effectiveness of MuseChat, we build a large-scale dataset, conversational music recommendation for videos, that simulates a two-turn interaction between a user and a recommender based on accurate music track information. Experiment results show that MuseChat achieves significant improvements over existing video-based music retrieval methods as well as offers strong interpretability and interactability.  ( 2 min )
    Parameter Efficient Multi-task Model Fusion with Partial Linearization
    arXiv:2310.04742v3 Announce Type: replace Abstract: Large pre-trained models have enabled significant advances in machine learning and served as foundation components. Model fusion methods, such as task arithmetic, have been proven to be powerful and scalable to incorporate fine-tuned weights from different tasks into a multi-task model. However, efficiently fine-tuning large pre-trained models on multiple downstream tasks remains challenging, leading to inefficient multi-task model fusion. In this work, we propose a novel method to improve multi-task fusion for parameter-efficient fine-tuning techniques like LoRA fine-tuning. Specifically, our approach partially linearizes only the adapter modules and applies task arithmetic over the linearized adapters. This allows us to leverage the the advantages of model fusion over linearized fine-tuning, while still performing fine-tuning and inference efficiently. We demonstrate that our partial linearization technique enables a more effective fusion of multiple tasks into a single model, outperforming standard adapter tuning and task arithmetic alone. Experimental results demonstrate the capabilities of our proposed partial linearization technique to effectively construct unified multi-task models via the fusion of fine-tuned task vectors. We evaluate performance over an increasing number of tasks and find that our approach outperforms standard parameter-efficient fine-tuning techniques. The results highlight the benefits of partial linearization for scalable and efficient multi-task model fusion. The code is available at https://github.com/tanganke/peta  ( 3 min )
    A Deep Instance Generative Framework for MILP Solvers Under Limited Data Availability
    arXiv:2310.02807v3 Announce Type: replace Abstract: In the past few years, there has been an explosive surge in the use of machine learning (ML) techniques to address combinatorial optimization (CO) problems, especially mixed-integer linear programs (MILPs). Despite the achievements, the limited availability of real-world instances often leads to sub-optimal decisions and biased solver assessments, which motivates a suite of synthetic MILP instance generation techniques. However, existing methods either rely heavily on expert-designed formulations or struggle to capture the rich features of real-world instances. To tackle this problem, we propose G2MILP, the first deep generative framework for MILP instances. Specifically, G2MILP represents MILP instances as bipartite graphs, and applies a masked variational autoencoder to iteratively corrupt and replace parts of the original graphs to generate new ones. The appealing feature of G2MILP is that it can learn to generate novel and realistic MILP instances without prior expert-designed formulations, while preserving the structures and computational hardness of real-world datasets, simultaneously. Thus the generated instances can facilitate downstream tasks for enhancing MILP solvers under limited data availability. We design a suite of benchmarks to evaluate the quality of the generated MILP instances. Experiments demonstrate that our method can produce instances that closely resemble real-world datasets in terms of both structures and computational hardness. The deliverables are released at https://miralab-ustc.github.io/L2O-G2MILP.  ( 3 min )
    Reward Model Ensembles Help Mitigate Overoptimization
    arXiv:2310.02743v2 Announce Type: replace Abstract: Reinforcement learning from human feedback (RLHF) is a standard approach for fine-tuning large language models to follow instructions. As part of this process, learned reward models are used to approximately model human preferences. However, as imperfect representations of the "true" reward, these learned reward models are susceptible to overoptimization. Gao et al. (2023) studied this phenomenon in a synthetic human feedback setup with a significantly larger "gold" reward model acting as the true reward (instead of humans) and showed that overoptimization remains a persistent problem regardless of the size of the proxy reward model and training data used. Using a similar setup, we conduct a systematic study to evaluate the efficacy of using ensemble-based conservative optimization objectives, specifically worst-case optimization (WCO) and uncertainty-weighted optimization (UWO), for mitigating reward model overoptimization when using two optimization methods: (a) best-of-n sampling (BoN) (b) proximal policy optimization (PPO). We additionally extend the setup of Gao et al. (2023) to include 25% label noise to better mirror real-world conditions. Both with and without label noise, we find that conservative optimization practically eliminates overoptimization and improves performance by up to 70% for BoN sampling. For PPO, ensemble-based conservative optimization always reduces overoptimization and outperforms single reward model optimization. Moreover, combining it with a small KL penalty successfully prevents overoptimization at no performance cost. Overall, our results demonstrate that ensemble-based conservative optimization can effectively counter overoptimization.  ( 3 min )
    Diffusion Generative Flow Samplers: Improving learning signals through partial trajectory optimization
    arXiv:2310.02679v3 Announce Type: replace Abstract: We tackle the problem of sampling from intractable high-dimensional density functions, a fundamental task that often appears in machine learning and statistics. We extend recent sampling-based approaches that leverage controlled stochastic processes to model approximate samples from these target densities. The main drawback of these approaches is that the training objective requires full trajectories to compute, resulting in sluggish credit assignment issues due to use of entire trajectories and a learning signal present only at the terminal time. In this work, we present Diffusion Generative Flow Samplers (DGFS), a sampling-based framework where the learning process can be tractably broken down into short partial trajectory segments, via parameterizing an additional "flow function". Our method takes inspiration from the theory developed for generative flow networks (GFlowNets), allowing us to make use of intermediate learning signals. Through various challenging experiments, we demonstrate that DGFS achieves more accurate estimates of the normalization constant than closely-related prior methods.  ( 2 min )
    On Representation Complexity of Model-based and Model-free Reinforcement Learning
    arXiv:2310.01706v2 Announce Type: replace Abstract: We study the representation complexity of model-based and model-free reinforcement learning (RL) in the context of circuit complexity. We prove theoretically that there exists a broad class of MDPs such that their underlying transition and reward functions can be represented by constant depth circuits with polynomial size, while the optimal $Q$-function suffers an exponential circuit complexity in constant-depth circuits. By drawing attention to the approximation errors and building connections to complexity theory, our theory provides unique insights into why model-based algorithms usually enjoy better sample complexity than model-free algorithms from a novel representation complexity perspective: in some cases, the ground-truth rule (model) of the environment is simple to represent, while other quantities, such as $Q$-function, appear complex. We empirically corroborate our theory by comparing the approximation error of the transition kernel, reward function, and optimal $Q$-function in various Mujoco environments, which demonstrates that the approximation errors of the transition kernel and reward function are consistently lower than those of the optimal $Q$-function. To the best of our knowledge, this work is the first to study the circuit complexity of RL, which also provides a rigorous framework for future research.  ( 2 min )
    ModuLoRA: Finetuning 2-Bit LLMs on Consumer GPUs by Integrating with Modular Quantizers
    arXiv:2309.16119v2 Announce Type: replace Abstract: We propose a memory-efficient finetuning algorithm for large language models (LLMs) that supports finetuning LLMs with 65B parameters in 2/3/4-bit precision on as little as one 24GB GPU. Our method, modular low-rank adaptation (ModuLoRA), integrates any user-specified weight quantizer with finetuning via low-rank adapters (LoRAs). Our approach relies on a simple quantization-agnostic backward pass that adaptively materializes low-precision LLM weights from a custom black-box quantization module. This approach enables finetuning 2-bit and 3-bit LLMs for the first time -- leveraging state-of-the-art 2-bit QuIP\# quantization and 3-bit OPTQ quantization -- outperforming finetuning that relies on less sophisticated 4-bit and 8-bit methods. In our experiments, \lplora~attains competitive performance on text classification, natural language inference, and instruction following tasks using significantly less memory than existing approaches, and we also surpass the state-of-the-art ROUGE score on a popular summarization task. We release \lplora~together with a series of low-precision models as part of \llmtune, a user-friendly library for quantizing, running, and finetuning LLMs on consumer GPUs.  ( 2 min )
    Understanding and Mitigating the Label Noise in Pre-training on Downstream Tasks
    arXiv:2309.17002v2 Announce Type: replace Abstract: Pre-training on large-scale datasets and then fine-tuning on downstream tasks have become a standard practice in deep learning. However, pre-training data often contain label noise that may adversely affect the generalization of the model. This paper aims to understand the nature of noise in pre-training datasets and to mitigate its impact on downstream tasks. More specifically, through extensive experiments of supervised pre-training models on synthetic noisy ImageNet-1K and YFCC15M datasets, we demonstrate that while slight noise in pre-training can benefit in-domain (ID) transfer performance, where the training and testing data share the same distribution, it always deteriorates out-of-domain (OOD) performance, where training and testing data distribution are different. We empirically verify that the reason behind is noise in pre-training shapes the feature space differently. We then propose a light-weight black-box tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization on both ID and OOD tasks, considering one may not be able to fully fine-tune or even access the pre-trained models. We conduct practical experiments on popular vision and language models that are pre-trained on noisy data for evaluation of our approach. Our analysis and results show the importance of this interesting and novel research direction, which we term Noisy Model Learning.  ( 3 min )
    STARC: A General Framework For Quantifying Differences Between Reward Functions
    arXiv:2309.15257v2 Announce Type: replace Abstract: In order to solve a task using reinforcement learning, it is necessary to first formalise the goal of that task as a reward function. However, for many real-world tasks, it is very difficult to manually specify a reward function that never incentivises undesirable behaviour. As a result, it is increasingly popular to use \emph{reward learning algorithms}, which attempt to \emph{learn} a reward function from data. However, the theoretical foundations of reward learning are not yet well-developed. In particular, it is typically not known when a given reward learning algorithm with high probability will learn a reward function that is safe to optimise. This means that reward learning algorithms generally must be evaluated empirically, which is expensive, and that their failure modes are difficult to anticipate in advance. One of the roadblocks to deriving better theoretical guarantees is the lack of good methods for quantifying the difference between reward functions. In this paper we provide a solution to this problem, in the form of a class of pseudometrics on the space of all reward functions that we call STARC (STAndardised Reward Comparison) metrics. We show that STARC metrics induce both an upper and a lower bound on worst-case regret, which implies that our metrics are tight, and that any metric with the same properties must be bilipschitz equivalent to ours. Moreover, we also identify a number of issues with reward metrics proposed by earlier works. Finally, we evaluate our metrics empirically, to demonstrate their practical efficacy. STARC metrics can be used to make both theoretical and empirical analysis of reward learning algorithms both easier and more principled.  ( 3 min )
    FedCompass: Efficient Cross-Silo Federated Learning on Heterogeneous Client Devices using a Computing Power Aware Scheduler
    arXiv:2309.14675v2 Announce Type: replace Abstract: Cross-silo federated learning offers a promising solution to collaboratively train robust and generalized AI models without compromising the privacy of local datasets, e.g., healthcare, financial, as well as scientific projects that lack a centralized data facility. Nonetheless, because of the disparity of computing resources among different clients (i.e., device heterogeneity), synchronous federated learning algorithms suffer from degraded efficiency when waiting for straggler clients. Similarly, asynchronous federated learning algorithms experience degradation in the convergence rate and final model accuracy on non-identically and independently distributed (non-IID) heterogeneous datasets due to stale local models and client drift. To address these limitations in cross-silo federated learning with heterogeneous clients and data, we propose FedCompass, an innovative semi-asynchronous federated learning algorithm with a computing power-aware scheduler on the server side, which adaptively assigns varying amounts of training tasks to different clients using the knowledge of the computing power of individual clients. FedCompass ensures that multiple locally trained models from clients are received almost simultaneously as a group for aggregation, effectively reducing the staleness of local models. At the same time, the overall training process remains asynchronous, eliminating prolonged waiting periods from straggler clients. Using diverse non-IID heterogeneous distributed datasets, we demonstrate that FedCompass achieves faster convergence and higher accuracy than other asynchronous algorithms while remaining more efficient than synchronous algorithms when performing federated learning on heterogeneous clients. The source code for FedCompass is available at https://github.com/APPFL/FedCompass.  ( 3 min )
    Topologically Regularized Multiple Instance Learning to Harness Data Scarcity
    arXiv:2307.14025v2 Announce Type: replace Abstract: In biomedical data analysis, Multiple Instance Learning (MIL) models have emerged as a powerful tool to classify patients' microscopy samples. However, the data-intensive requirement of these models poses a significant challenge in scenarios with scarce data availability, e.g., in rare diseases. We introduce a topological regularization term to MIL to mitigate this challenge. It provides a shape-preserving inductive bias that compels the encoder to maintain the essential geometrical-topological structure of input bags during projection into latent space. This enhances the performance and generalization of the MIL classifier regardless of the aggregation function, particularly for scarce training data. The effectiveness of our method is confirmed through experiments across a range of datasets, showing an average enhancement of 2.8% for MIL benchmarks, 15.3% for synthetic MIL datasets, and 5.5% for real-world biomedical datasets over the current state-of-the-art.  ( 2 min )
    Task-Aware Machine Unlearning and Its Application in Load Forecasting
    arXiv:2308.14412v2 Announce Type: replace Abstract: Data privacy and security have become a non-negligible factor in load forecasting. Previous researches mainly focus on training stage enhancement. However, once the model is trained and deployed, it may need to `forget' (i.e., remove the impact of) part of training data if the these data are found to be malicious or as requested by the data owner. This paper introduces the concept of machine unlearning which is specifically designed to remove the influence of part of the dataset on an already trained forecaster. However, direct unlearning inevitably degrades the model generalization ability. To balance between unlearning completeness and model performance, a performance-aware algorithm is proposed by evaluating the sensitivity of local model parameter change using influence function and sample re-weighting. Furthermore, we observe that the statistical criterion such as mean squared error, cannot fully reflect the operation cost of the downstream tasks in power system. Therefore, a task-aware machine unlearning is proposed whose objective is a trilevel optimization with dispatch and redispatch problems considered. We theoretically prove the existence of the gradient of such an objective, which is key to re-weighting the remaining samples. We tested the unlearning algorithms on linear, CNN, and MLP-Mixer based load forecasters with a realistic load dataset. The simulation demonstrates the balance between unlearning completeness and operational cost. All codes can be found at https://github.com/xuwkk/task_aware_machine_unlearning.  ( 3 min )
    Neural Bradley-Terry Rating: Quantifying Properties from Comparisons
    arXiv:2307.13709v5 Announce Type: replace Abstract: Many properties in the real world don't have metrics and can't be numerically observed, making them difficult to learn. To deal with this challenging problem, prior works have primarily focused on estimating those properties by using graded human scores as the target label in the training. Meanwhile, rating algorithms based on the Bradley-Terry model are extensively studied to evaluate the competitiveness of players based on their match history. In this paper, we introduce the Neural Bradley-Terry Rating (NBTR), a novel machine learning framework designed to quantify and evaluate properties of unknown items. Our method seamlessly integrates the Bradley-Terry model into the neural network structure. Moreover, we generalize this architecture further to asymmetric environments with unfairness, a condition more commonly encountered in real-world settings. Through experimental analysis, we demonstrate that NBTR successfully learns to quantify and estimate desired properties.  ( 2 min )
    HIQL: Offline Goal-Conditioned RL with Latent States as Actions
    arXiv:2307.11949v4 Announce Type: replace Abstract: Unsupervised pre-training has recently become the bedrock for computer vision and natural language processing. In reinforcement learning (RL), goal-conditioned RL can potentially provide an analogous self-supervised approach for making use of large quantities of unlabeled (reward-free) data. However, building effective algorithms for goal-conditioned RL that can learn directly from diverse offline data is challenging, because it is hard to accurately estimate the exact value function for faraway goals. Nonetheless, goal-reaching problems exhibit structure, such that reaching distant goals entails first passing through closer subgoals. This structure can be very useful, as assessing the quality of actions for nearby goals is typically easier than for more distant goals. Based on this idea, we propose a hierarchical algorithm for goal-conditioned RL from offline data. Using one action-free value function, we learn two policies that allow us to exploit this structure: a high-level policy that treats states as actions and predicts (a latent representation of) a subgoal and a low-level policy that predicts the action for reaching this subgoal. Through analysis and didactic examples, we show how this hierarchical decomposition makes our method robust to noise in the estimated value function. We then apply our method to offline goal-reaching benchmarks, showing that our method can solve long-horizon tasks that stymie prior methods, can scale to high-dimensional image observations, and can readily make use of action-free data. Our code is available at https://seohong.me/projects/hiql/  ( 3 min )
    Overthinking the Truth: Understanding how Language Models Process False Demonstrations
    arXiv:2307.09476v2 Announce Type: replace Abstract: Modern language models can imitate complex patterns through few-shot learning, enabling them to complete challenging tasks without fine-tuning. However, imitation can also lead models to reproduce inaccuracies or harmful content if present in the context. We study harmful imitation through the lens of a model's internal representations, and identify two related phenomena: "overthinking" and "false induction heads". The first phenomenon, overthinking, appears when we decode predictions from intermediate layers, given correct vs. incorrect few-shot demonstrations. At early layers, both demonstrations induce similar model behavior, but the behavior diverges sharply at some "critical layer", after which the accuracy given incorrect demonstrations progressively decreases. The second phenomenon, false induction heads, are a possible mechanistic cause of overthinking: these are heads in late layers that attend to and copy false information from previous demonstrations, and whose ablation reduces overthinking. Beyond scientific understanding, our results suggest that studying intermediate model computations could be a promising avenue for understanding and guarding against harmful model behaviors.  ( 2 min )
    Universality of Linear Recurrences Followed by Non-linear Projections: Finite-Width Guarantees and Benefits of Complex Eigenvalues
    arXiv:2307.11888v2 Announce Type: replace Abstract: Deep neural networks based on linear complex-valued RNNs interleaved with position-wise MLPs are gaining traction as competitive approaches to sequence modeling. Examples of such architectures include state-space models (SSMs) like S4, LRU, and Mamba: recently proposed models that achieve promising performance on text, genetics, and other data that require long-range reasoning. Despite experimental evidence highlighting these architectures' effectiveness and computational efficiency, their expressive power remains relatively unexplored, especially in connection to specific choices crucial in practice - e.g., carefully designed initialization distribution and use of complex numbers. In this paper, we show that combining MLPs with both real or complex linear diagonal recurrences leads to arbitrarily precise approximation of regular causal sequence-to-sequence maps. At the heart of our proof, we rely on a separation of concerns: the linear RNN provides a lossless encoding of the input sequence, and the MLP performs non-linear processing on this encoding. While we show that using real diagonal linear recurrences is enough to achieve universality in this architecture, we prove that employing complex eigenvalues near unit disk - i.e., empirically the most successful strategy in SSMs - greatly helps the RNN in storing information. We connect this finding with the vanishing gradient issue and provide experimental evidence supporting our claims.  ( 3 min )
    Scaling Laws for Imitation Learning in Single-Agent Games
    arXiv:2307.09423v2 Announce Type: replace Abstract: Imitation Learning (IL) is one of the most widely used methods in machine learning. Yet, many works find it is often unable to fully recover the underlying expert behavior, even in constrained environments like single-agent games. However, none of these works deeply investigate the role of scaling up the model and data size. Inspired by recent work in Natural Language Processing (NLP) where "scaling up" has resulted in increasingly more capable LLMs, we investigate whether carefully scaling up model and data size can bring similar improvements in the imitation learning setting for single-agent games. We first demonstrate our findings on a variety of Atari games, and thereafter focus on the extremely challenging game of NetHack. In all games, we find that IL loss and mean return scale smoothly with the compute budget (FLOPs) and are strongly correlated, resulting in power laws for training compute-optimal IL agents. Finally, we forecast and train several NetHack agents with IL and find they outperform prior state-of-the-art by 1.5x in all settings. Our work both demonstrates the scaling behavior of imitation learning in a variety of single-agent games, as well as the viability of scaling up current approaches for increasingly capable agents in NetHack, a game that remains elusively hard for current AI systems.  ( 3 min )
    NetGPT: A Native-AI Network Architecture Beyond Provisioning Personalized Generative Services
    arXiv:2307.06148v4 Announce Type: replace Abstract: Large language models (LLMs) have triggered tremendous success to empower our daily life by generative information. The personalization of LLMs could further contribute to their applications due to better alignment with human intents. Towards personalized generative services, a collaborative cloud-edge methodology is promising, as it facilitates the effective orchestration of heterogeneous distributed communication and computing resources. In this article, we put forward NetGPT to capably synergize appropriate LLMs at the edge and the cloud based on their computing capacity. In addition, edge LLMs could efficiently leverage location-based information for personalized prompt completion, thus benefiting the interaction with the cloud LLM. In particular, we present the feasibility of NetGPT by leveraging low-rank adaptation-based fine-tuning of open-source LLMs (i.e., GPT-2-base model and LLaMA model), and conduct comprehensive numerical comparisons with alternative cloud-edge collaboration or cloud-only techniques, so as to demonstrate the superiority of NetGPT. Subsequently, we highlight the essential changes required for an artificial intelligence (AI)-native network architecture towards NetGPT, with emphasis on deeper integration of communications and computing resources and careful calibration of logical AI workflow. Furthermore, we demonstrate several benefits of NetGPT, which come as by-products, as the edge LLMs' capability to predict trends and infer intents promises a unified solution for intelligent network management & orchestration. We argue that NetGPT is a promising AI-native network architecture for provisioning beyond personalized generative services.  ( 3 min )
    Simulation-free Schr\"odinger bridges via score and flow matching
    arXiv:2307.03672v3 Announce Type: replace Abstract: We present simulation-free score and flow matching ([SF]$^2$M), a simulation-free objective for inferring stochastic dynamics given unpaired samples drawn from arbitrary source and target distributions. Our method generalizes both the score-matching loss used in the training of diffusion models and the recently proposed flow matching loss used in the training of continuous normalizing flows. [SF]$^2$M interprets continuous-time stochastic generative modeling as a Schr\"odinger bridge problem. It relies on static entropy-regularized optimal transport, or a minibatch approximation, to efficiently learn the SB without simulating the learned stochastic process. We find that [SF]$^2$M is more efficient and gives more accurate solutions to the SB problem than simulation-based methods from prior work. Finally, we apply [SF]$^2$M to the problem of learning cell dynamics from snapshot data. Notably, [SF]$^2$M is the first method to accurately model cell dynamics in high dimensions and can recover known gene regulatory networks from simulated data. Our code is available in the TorchCFM package at https://github.com/atong01/conditional-flow-matching.  ( 2 min )
    Combating Data Imbalances in Federated Semi-supervised Learning with Dual Regulators
    arXiv:2307.05358v3 Announce Type: replace Abstract: Federated learning has become a popular method to learn from decentralized heterogeneous data. Federated semi-supervised learning (FSSL) emerges to train models from a small fraction of labeled data due to label scarcity on decentralized clients. Existing FSSL methods assume independent and identically distributed (IID) labeled data across clients and consistent class distribution between labeled and unlabeled data within a client. This work studies a more practical and challenging scenario of FSSL, where data distribution is different not only across clients but also within a client between labeled and unlabeled data. To address this challenge, we propose a novel FSSL framework with dual regulators, FedDure. FedDure lifts the previous assumption with a coarse-grained regulator (C-reg) and a fine-grained regulator (F-reg): C-reg regularizes the updating of the local model by tracking the learning effect on labeled data distribution; F-reg learns an adaptive weighting scheme tailored for unlabeled instances in each client. We further formulate the client model training as bi-level optimization that adaptively optimizes the model in the client with two regulators. Theoretically, we show the convergence guarantee of the dual regulators. Empirically, we demonstrate that FedDure is superior to the existing methods across a wide range of settings, notably by more than 11 on CIFAR-10 and CINIC-10 datasets.  ( 3 min )
    Defending Against Malicious Behaviors in Federated Learning with Blockchain
    arXiv:2307.00543v2 Announce Type: replace Abstract: In the era of deep learning, federated learning (FL) presents a promising approach that allows multi-institutional data owners, or clients, to collaboratively train machine learning models without compromising data privacy. However, most existing FL approaches rely on a centralized server for global model aggregation, leading to a single point of failure. This makes the system vulnerable to malicious attacks when dealing with dishonest clients. In this work, we address this problem by proposing a secure and reliable FL system based on blockchain and distributed ledger technology. Our system incorporates a peer-to-peer voting mechanism and a reward-and-slash mechanism, which are powered by on-chain smart contracts, to detect and deter malicious behaviors. Both theoretical and empirical analyses are presented to demonstrate the effectiveness of the proposed approach, showing that our framework is robust against malicious client-side behaviors.  ( 2 min )
    Group-based Robustness: A General Framework for Customized Robustness in the Real World
    arXiv:2306.16614v3 Announce Type: replace Abstract: Machine-learning models are known to be vulnerable to evasion attacks that perturb model inputs to induce misclassifications. In this work, we identify real-world scenarios where the true threat cannot be assessed accurately by existing attacks. Specifically, we find that conventional metrics measuring targeted and untargeted robustness do not appropriately reflect a model's ability to withstand attacks from one set of source classes to another set of target classes. To address the shortcomings of existing methods, we formally define a new metric, termed group-based robustness, that complements existing metrics and is better-suited for evaluating model performance in certain attack scenarios. We show empirically that group-based robustness allows us to distinguish between models' vulnerability against specific threat models in situations where traditional robustness metrics do not apply. Moreover, to measure group-based robustness efficiently and accurately, we 1) propose two loss functions and 2) identify three new attack strategies. We show empirically that with comparable success rates, finding evasive samples using our new loss functions saves computation by a factor as large as the number of targeted classes, and finding evasive samples using our new attack strategies saves time by up to 99\% compared to brute-force search methods. Finally, we propose a defense method that increases group-based robustness by up to 3.52$\times$.  ( 3 min )
    SENSEi: Input-Sensitive Compilation for Accelerating GNNs
    arXiv:2306.15155v2 Announce Type: replace Abstract: Over the years, many frameworks and optimization techniques have been proposed to accelerate graph neural networks (GNNs). Compared to the optimizations explored in these systems, we observe that different matrix re-associations of GNN computations lead to novel input-sensitive performance behavior. We leverage this observation to propose SENSEi, a system that exposes different sparse and dense matrix primitive compositions based on different matrix re-associations of GNN computations and selects the best among them based on input attributes. SENSEi executes in two stages: (1) an offline compilation stage that enumerates all valid re-associations leading to different sparse-dense matrix compositions and uses input-oblivious pruning techniques to prune away clearly unprofitable candidates and (2) an online runtime system that explores the remaining candidates and uses light-weight cost models to select the best re-association based on the input graph and the embedding sizes on a given hardware platform. On a wide range of configurations, SENSEi achieves speedups of up to $2.012\times$ and $1.85\times$ on graph convolutional networks and up to $6.294\times$ and $16.274\times$ on graph attention networks, on GPUs and CPUs respectively. We also show that its technique generalizes to GNN variants, including those that require sampling. Furthermore, we show that SENSEi's techniques are agnostic to the underlying GNN system, and can be used to yield synergistic improvements across a diverse set of implementations.  ( 3 min )
    3D molecule generation by denoising voxel grids
    arXiv:2306.07473v2 Announce Type: replace Abstract: We propose a new score-based approach to generate 3D molecules represented as atomic densities on regular grids. First, we train a denoising neural network that learns to map from a smooth distribution of noisy molecules to the distribution of real molecules. Then, we follow the neural empirical Bayes framework (Saremi and Hyvarinen, 19) and generate molecules in two steps: (i) sample noisy density grids from a smooth distribution via underdamped Langevin Markov chain Monte Carlo, and (ii) recover the "clean" molecule by denoising the noisy grid with a single step. Our method, VoxMol, generates molecules in a fundamentally different way than the current state of the art (ie, diffusion models applied to atom point clouds). It differs in terms of the data representation, the noise model, the network architecture and the generative modeling algorithm. Our experiments show that VoxMol captures the distribution of drug-like molecules better than state of the art, while being faster to generate samples.  ( 2 min )
    Error Feedback Can Accurately Compress Preconditioners
    arXiv:2306.06098v4 Announce Type: replace Abstract: Leveraging second-order information about the loss at the scale of deep networks is one of the main lines of approach for improving the performance of current optimizers for deep learning. Yet, existing approaches for accurate full-matrix preconditioning, such as Full-Matrix Adagrad (GGT) or Matrix-Free Approximate Curvature (M-FAC) suffer from massive storage costs when applied even to small-scale models, as they must store a sliding window of gradients, whose memory requirements are multiplicative in the model dimension. In this paper, we address this issue via a novel and efficient error-feedback technique that can be applied to compress preconditioners by up to two orders of magnitude in practice, without loss of convergence. Specifically, our approach compresses the gradient information via sparsification or low-rank compression \emph{before} it is fed into the preconditioner, feeding the compression error back into future iterations. Experiments on deep neural networks show that this approach can compress full-matrix preconditioners to up to 99\% sparsity without accuracy loss, effectively removing the memory overhead of full-matrix preconditioners such as GGT and M-FAC. Our code is available at \url{https://github.com/IST-DASLab/EFCP}.  ( 2 min )
    Neuron Activation Coverage: Rethinking Out-of-distribution Detection and Generalization
    arXiv:2306.02879v3 Announce Type: replace Abstract: The out-of-distribution (OOD) problem generally arises when neural networks encounter data that significantly deviates from the training data distribution, i.e., in-distribution (InD). In this paper, we study the OOD problem from a neuron activation view. We first formulate neuron activation states by considering both the neuron output and its influence on model decisions. Then, to characterize the relationship between neurons and OOD issues, we introduce the \textit{neuron activation coverage} (NAC) -- a simple measure for neuron behaviors under InD data. Leveraging our NAC, we show that 1) InD and OOD inputs can be largely separated based on the neuron behavior, which significantly eases the OOD detection problem and beats the 21 previous methods over three benchmarks (CIFAR-10, CIFAR-100, and ImageNet-1K). 2) a positive correlation between NAC and model generalization ability consistently holds across architectures and datasets, which enables a NAC-based criterion for evaluating model robustness. Compared to prevalent InD validation criteria, we show that NAC not only can select more robust models, but also has a stronger correlation with OOD test performance.  ( 2 min )
    BeMap: Balanced Message Passing for Fair Graph Neural Network
    arXiv:2306.04107v2 Announce Type: replace Abstract: Fairness in graph neural networks has been actively studied recently. However, existing works often do not explicitly consider the role of message passing in introducing or amplifying the bias. In this paper, we first investigate the problem of bias amplification in message passing. We empirically and theoretically demonstrate that message passing could amplify the bias when the 1-hop neighbors from different demographic groups are unbalanced. Guided by such analyses, we propose BeMap, a fair message passing method, that leverages a balance-aware sampling strategy to balance the number of the 1-hop neighbors of each node among different demographic groups. Extensive experiments on node classification demonstrate the efficacy of BeMap in mitigating bias while maintaining classification accuracy. The code is available at https://github.com/xiaolin-cs/BeMap.  ( 2 min )
    Deep Classifier Mimicry without Data Access
    arXiv:2306.02090v2 Announce Type: replace Abstract: Access to pre-trained models has recently emerged as a standard across numerous machine learning domains. Unfortunately, access to the original data the models were trained on may not equally be granted. This makes it tremendously challenging to fine-tune, compress models, adapt continually, or to do any other type of data-driven update. We posit that original data access may however not be required. Specifically, we propose Contrastive Abductive Knowledge Extraction (CAKE), a model-agnostic knowledge distillation procedure that mimics deep classifiers without access to the original data. To this end, CAKE generates pairs of noisy synthetic samples and diffuses them contrastively toward a model's decision boundary. We empirically corroborate CAKE's effectiveness using several benchmark datasets and various architectural choices, paving the way for broad application.  ( 2 min )
    Trainable and Explainable Simplicial Map Neural Networks
    arXiv:2306.00010v2 Announce Type: replace Abstract: Simplicial map neural networks (SMNNs) are topology-based neural networks with interesting properties such as universal approximation ability and robustness to adversarial examples under appropriate conditions. However, SMNNs present some bottlenecks for their possible application in high-dimensional datasets. First, SMNNs have precomputed fixed weight and no SMNN training process has been defined so far, so they lack generalization ability. Second, SMNNs require the construction of a convex polytope surrounding the input dataset. In this paper, we overcome these issues by proposing an SMNN training procedure based on a support subset of the given dataset and replacing the construction of the convex polytope by a method based on projections to a hypersphere. In addition, the explainability capacity of SMNNs and an effective implementation are also newly introduced in this paper.  ( 2 min )
    Prediction Error-based Classification for Class-Incremental Learning
    arXiv:2305.18806v2 Announce Type: replace Abstract: Class-incremental learning (CIL) is a particularly challenging variant of continual learning, where the goal is to learn to discriminate between all classes presented in an incremental fashion. Existing approaches often suffer from excessive forgetting and imbalance of the scores assigned to classes that have not been seen together during training. In this study, we introduce a novel approach, Prediction Error-based Classification (PEC), which differs from traditional discriminative and generative classification paradigms. PEC computes a class score by measuring the prediction error of a model trained to replicate the outputs of a frozen random neural network on data from that class. The method can be interpreted as approximating a classification rule based on Gaussian Process posterior variance. PEC offers several practical advantages, including sample efficiency, ease of tuning, and effectiveness even when data are presented one class at a time. Our empirical results show that PEC performs strongly in single-pass-through-data CIL, outperforming other rehearsal-free baselines in all cases and rehearsal-based methods with moderate replay buffer size in most cases across multiple benchmarks.  ( 2 min )
    Which Models have Perceptually-Aligned Gradients? An Explanation via Off-Manifold Robustness
    arXiv:2305.19101v2 Announce Type: replace Abstract: One of the remarkable properties of robust computer vision models is that their input-gradients are often aligned with human perception, referred to in the literature as perceptually-aligned gradients (PAGs). Despite only being trained for classification, PAGs cause robust models to have rudimentary generative capabilities, including image generation, denoising, and in-painting. However, the underlying mechanisms behind these phenomena remain unknown. In this work, we provide a first explanation of PAGs via \emph{off-manifold robustness}, which states that models must be more robust off- the data manifold than they are on-manifold. We first demonstrate theoretically that off-manifold robustness leads input gradients to lie approximately on the data manifold, explaining their perceptual alignment. We then show that Bayes optimal models satisfy off-manifold robustness, and confirm the same empirically for robust models trained via gradient norm regularization, randomized smoothing, and adversarial training with projected gradient descent. Quantifying the perceptual alignment of model gradients via their similarity with the gradients of generative models, we show that off-manifold robustness correlates well with perceptual alignment. Finally, based on the levels of on- and off-manifold robustness, we identify three different regimes of robustness that affect both perceptual alignment and model accuracy: weak robustness, bayes-aligned robustness, and excessive robustness. Code is available at \url{https://github.com/tml-tuebingen/pags}.  ( 3 min )
    Contextual Bandits with Budgeted Information Reveal
    arXiv:2305.18511v2 Announce Type: replace Abstract: Contextual bandit algorithms are commonly used in digital health to recommend personalized treatments. However, to ensure the effectiveness of the treatments, patients are often requested to take actions that have no immediate benefit to them, which we refer to as pro-treatment actions. In practice, clinicians have a limited budget to encourage patients to take these actions and collect additional information. We introduce a novel optimization and learning algorithm to address this problem. This algorithm effectively combines the strengths of two algorithmic approaches in a seamless manner, including 1) an online primal-dual algorithm for deciding the optimal timing to reach out to patients, and 2) a contextual bandit learning algorithm to deliver personalized treatment to the patient. We prove that this algorithm admits a sub-linear regret bound. We illustrate the usefulness of this algorithm on both synthetic and real-world data.  ( 2 min )
    Large Language Models as Tool Makers
    arXiv:2305.17126v2 Announce Type: replace Abstract: Recent research has highlighted the potential of large language models (LLMs) to improve their problem-solving capabilities with the aid of suitable external tools. In our work, we further advance this concept by introducing a closed-loop framework, referred to as LLMs A s Tool Makers (LATM), where LLMs create their own reusable tools for problem-solving. Our approach consists of two phases: 1) tool making: an LLM acts as the tool maker that crafts tools for a set of tasks. 2) tool using: another LLM acts as the tool user, which applies the tool built by the tool maker for problem-solving. On the problem-solving server side, tool-making enables continual tool generation and caching as new requests emerge. This framework enables subsequent requests to access cached tools via their corresponding APIs, enhancing the efficiency of task resolution. Recognizing that tool-making requires more sophisticated capabilities, we assign this task to a powerful, albeit resource-intensive, model. Conversely, the simpler tool-using phase is delegated to a lightweight model. This strategic division of labor allows the once-off cost of tool-making to be spread over multiple instances of tool-using, significantly reducing average costs while maintaining strong performance. Furthermore, our method offers a functional cache through the caching and reuse of tools, which stores the functionality of a class of requests instead of the natural language responses from LLMs, thus extending the applicability of the conventional cache mechanism. We evaluate our approach across various complex reasoning tasks, including Big-Bench tasks. With GPT-4 as the tool maker and GPT-3.5 as the tool user, LATM demonstrates performance equivalent to using GPT-4 for both roles, but with a significantly reduced inference cost.  ( 3 min )
    Consistent Optimal Transport with Empirical Conditional Measures
    arXiv:2305.15901v5 Announce Type: replace Abstract: Given samples from two joint distributions, we consider the problem of Optimal Transportation (OT) between them when conditioned on a common variable. We focus on the general setting where the conditioned variable may be continuous, and the marginals of this variable in the two joint distributions may not be the same. In such settings, standard OT variants cannot be employed, and novel estimation techniques are necessary. Since the main challenge is that the conditional distributions are not explicitly available, the key idea in our OT formulation is to employ kernelized-least-squares terms computed over the joint samples, which implicitly match the transport plan's marginals with the empirical conditionals. Under mild conditions, we prove that our estimated transport plans, as a function of the conditioned variable, are asymptotically optimal. For finite samples, we show that the deviation in terms of our regularized objective is bounded by $O(1/m^{1/4})$, where $m$ is the number of samples. We also discuss how the conditional transport plan could be modelled using explicit probabilistic models as well as using implicit generative ones. We empirically verify the consistency of our estimator on synthetic datasets, where the optimal plan is analytically known. When employed in applications like prompt learning for few-shot classification and conditional-generation in the context of predicting cell responses to treatment, our methodology improves upon state-of-the-art methods.  ( 3 min )
    Subpopulation-Specific Synthetic EHR for Better Mortality Prediction
    arXiv:2305.16363v2 Announce Type: replace Abstract: Electronic health records (EHR) often contain different rates of representation of certain subpopulations (SP). Factors like patient demographics, clinical condition prevalence, and medical center type contribute to this underrepresentation. Consequently, when training machine learning models on such datasets, the models struggle to generalize well and perform poorly on underrepresented SPs. To address this issue, we propose a novel ensemble framework that utilizes generative models. Specifically, we train a GAN-based synthetic data generator for each SP and incorporate synthetic samples into each SP training set. Ultimately, we train SP-specific prediction models. To properly evaluate this method, we design an evaluation pipeline with 2 real-world use case datasets, queried from the MIMIC database. Our approach shows increased model performance over underrepresented SPs. Our code and models are given as supplementary and will be made available on a public repository.  ( 2 min )
    The Behavior and Convergence of Local Bayesian Optimization
    arXiv:2305.15572v3 Announce Type: replace Abstract: A recent development in Bayesian optimization is the use of local optimization strategies, which can deliver strong empirical performance on high-dimensional problems compared to traditional global strategies. The "folk wisdom" in the literature is that the focus on local optimization sidesteps the curse of dimensionality; however, little is known concretely about the expected behavior or convergence of Bayesian local optimization routines. We first study the behavior of the local approach, and find that the statistics of individual local solutions of Gaussian process sample paths are surprisingly good compared to what we would expect to recover from global methods. We then present the first rigorous analysis of such a Bayesian local optimization algorithm recently proposed by M\"uller et al. (2021), and derive convergence rates in both the noisy and noiseless settings.  ( 2 min )
    Online Reinforcement Learning in Markov Decision Process Using Linear Programming
    arXiv:2304.00155v3 Announce Type: replace Abstract: We consider online reinforcement learning in episodic Markov decision process (MDP) with unknown transition function and stochastic rewards drawn from some fixed but unknown distribution. The learner aims to learn the optimal policy and minimize their regret over a finite time horizon through interacting with the environment. We devise a simple and efficient model-based algorithm that achieves $\widetilde{O}(LX\sqrt{TA})$ regret with high probability, where $L$ is the episode length, $T$ is the number of episodes, and $X$ and $A$ are the cardinalities of the state space and the action space, respectively. The proposed algorithm, which is based on the concept of ``optimism in the face of uncertainty", maintains confidence sets of transition and reward functions and uses occupancy measures to connect the online MDP with linear programming. It achieves a tighter regret bound compared to the existing works that use a similar confidence set framework and improves computational effort compared to those that use a different framework but with a slightly tighter regret bound.  ( 2 min )
    Improving and generalizing flow-based generative models with minibatch optimal transport
    arXiv:2302.00482v4 Announce Type: replace Abstract: Continuous normalizing flows (CNFs) are an attractive generative modeling technique, but they have been held back by limitations in their simulation-based maximum likelihood training. We introduce the generalized conditional flow matching (CFM) technique, a family of simulation-free training objectives for CNFs. CFM features a stable regression objective like that used to train the stochastic flow in diffusion models but enjoys the efficient inference of deterministic flow models. In contrast to both diffusion models and prior CNF training algorithms, CFM does not require the source distribution to be Gaussian or require evaluation of its density. A variant of our objective is optimal transport CFM (OT-CFM), which creates simpler flows that are more stable to train and lead to faster inference, as evaluated in our experiments. Furthermore, we show that when the true OT plan is available, our OT-CFM method approximates dynamic OT. Training CNFs with CFM improves results on a variety of conditional and unconditional generation tasks, such as inferring single cell dynamics, unsupervised image translation, and Schr\"odinger bridge inference.  ( 2 min )
    Brain Effective Connectome based on fMRI and DTI Data: Bayesian Causal Learning and Assessment
    arXiv:2302.05451v3 Announce Type: replace Abstract: Neuroscientific studies aim to find an accurate and reliable brain Effective Connectome (EC). Although current EC discovery methods have contributed to our understanding of brain organization, their performances are severely constrained by the short sample size and poor temporal resolution of fMRI data, and high dimensionality of the brain connectome. By leveraging the DTI data as prior knowledge, we introduce two Bayesian causal discovery frameworks -- the Bayesian GOLEM (BGOLEM) and Bayesian FGES (BFGES) methods -- that offer significantly more accurate and reliable ECs and address the shortcomings of the existing causal discovery methods in discovering ECs based on only fMRI data. Through a series of simulation studies on synthetic and hybrid (DTI of the Human Connectome Project (HCP) subjects and synthetic fMRI) data, we demonstrate the effectiveness of the proposed methods in discovering EC. To numerically assess the improvement in the accuracy of ECs with our method on empirical data, we first introduce the Pseudo False Discovery Rate (PFDR) as a new computational accuracy metric for causal discovery in the brain. We show that our Bayesian methods achieve higher accuracy than traditional methods on HCP data. Additionally, we measure the reliability of discovered ECs using the Rogers-Tanimoto index for test-retest data and show that our Bayesian methods provide significantly more reproducible ECs than traditional methods. Overall, our study's numerical and graphical results highlight the potential for these frameworks to advance our understanding of brain function and organization significantly.  ( 3 min )
    oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation
    arXiv:2301.01333v3 Announce Type: replace Abstract: With the rapid development of deep learning models and hardware support for dense computing, the deep learning workload characteristics changed significantly from a few hot spots on compute-intensive operations to a broad range of operations scattered across the models. Accelerating a few compute-intensive operations using the expert-tuned implementation of primitives does not fully exploit the performance potential of AI hardware. Various efforts have been made to compile a full deep neural network (DNN) graph. One of the biggest challenges is to achieve high-performance tensor compilation by generating expert level performance code for the dense compute-intensive operations and applying compilation optimization at the scope of DNN computation graph across multiple compute-intensive operations. We present oneDNN Graph Compiler, a tensor compiler that employs a hybrid approach of using techniques from both compiler optimization and expert-tuned kernels for high performance code generation of the deep neural network graph. oneDNN Graph Compiler addresses unique optimization challenges in the deep learning domain, such as low-precision computation, aggressive fusion of graph operations, optimization for static tensor shapes and memory layout, constant weight optimization, and memory buffer reuse. Experimental results demonstrate significant performance gains over existing tensor compiler and primitives library for performance-critical DNN computation graphs and end-to-end models on Intel Xeon Scalable Processors.  ( 3 min )
    PhAST: Physics-Aware, Scalable, and Task-specific GNNs for Accelerated Catalyst Design
    arXiv:2211.12020v4 Announce Type: replace Abstract: Mitigating the climate crisis requires a rapid transition towards lower-carbon energy. Catalyst materials play a crucial role in the electrochemical reactions involved in numerous industrial processes key to this transition, such as renewable energy storage and electrofuel synthesis. To reduce the energy spent on such activities, we must quickly discover more efficient catalysts to drive electrochemical reactions. Machine learning (ML) holds the potential to efficiently model materials properties from large amounts of data, accelerating electrocatalyst design. The Open Catalyst Project OC20 dataset was constructed to that end. However, ML models trained on OC20 are still neither scalable nor accurate enough for practical applications. In this paper, we propose task-specific innovations applicable to most architectures, enhancing both computational efficiency and accuracy. This includes improvements in (1) the graph creation step, (2) atom representations, (3) the energy prediction head, and (4) the force prediction head. We describe these contributions, referred to as PhAST, and evaluate them thoroughly on multiple architectures. Overall, PhAST improves energy MAE by 4 to 42$\%$ while dividing compute time by 3 to 8$\times$ depending on the targeted task/model. PhAST also enables CPU training, leading to 40$\times$ speedups in highly parallelized settings. Python package: \url{https://phast.readthedocs.io}.  ( 3 min )
    Is your model predicting the past?
    arXiv:2206.11673v2 Announce Type: replace Abstract: When does a machine learning model predict the future of individuals and when does it recite patterns that predate the individuals? In this work, we propose a distinction between these two pathways of prediction, supported by theoretical, empirical, and normative arguments. At the center of our proposal is a family of simple and efficient statistical tests, called backward baselines, that demonstrate if, and to what extent, a model recounts the past. Our statistical theory provides guidance for interpreting backward baselines, establishing equivalences between different baselines and familiar statistical concepts. Concretely, we derive a meaningful backward baseline for auditing a prediction system as a black box, given only background variables and the system's predictions. Empirically, we evaluate the framework on different prediction tasks derived from longitudinal panel surveys, demonstrating the ease and effectiveness of incorporating backward baselines into the practice of machine learning.  ( 2 min )
    OpenXAI: Towards a Transparent Evaluation of Model Explanations
    arXiv:2206.11104v4 Announce Type: replace Abstract: While several types of post hoc explanation methods have been proposed in recent literature, there is very little work on systematically benchmarking these methods. Here, we introduce OpenXAI, a comprehensive and extensible open-source framework for evaluating and benchmarking post hoc explanation methods. OpenXAI comprises of the following key components: (i) a flexible synthetic data generator and a collection of diverse real-world datasets, pre-trained models, and state-of-the-art feature attribution methods, and (ii) open-source implementations of eleven quantitative metrics for evaluating faithfulness, stability (robustness), and fairness of explanation methods, in turn providing comparisons of several explanation methods across a wide variety of metrics, models, and datasets. OpenXAI is easily extensible, as users can readily evaluate custom explanation methods and incorporate them into our leaderboards. Overall, OpenXAI provides an automated end-to-end pipeline that not only simplifies and standardizes the evaluation of post hoc explanation methods, but also promotes transparency and reproducibility in benchmarking these methods. While the first release of OpenXAI supports only tabular datasets, the explanation methods and metrics that we consider are general enough to be applicable to other data modalities. OpenXAI datasets and models, implementations of state-of-the-art explanation methods and evaluation metrics, are publicly available at this GitHub link.  ( 3 min )
    Slowly Changing Adversarial Bandit Algorithms are Efficient for Discounted MDPs
    arXiv:2205.09056v3 Announce Type: replace Abstract: Reinforcement learning generalizes multi-armed bandit problems with additional difficulties of a longer planning horizon and unknown transition kernel. We explore a black-box reduction from discounted infinite-horizon tabular reinforcement learning to multi-armed bandits, where, specifically, an independent bandit learner is placed in each state. We show that, under ergodicity and fast mixing assumptions, any slowly changing adversarial bandit algorithm achieving optimal regret in the adversarial bandit setting can also attain optimal expected regret in infinite-horizon discounted Markov decision processes, with respect to the number of rounds $T$. Furthermore, we examine our reduction using a specific instance of the exponential-weight algorithm.  ( 2 min )
    NAS-Bench-Graph: Benchmarking Graph Neural Architecture Search
    arXiv:2206.09166v2 Announce Type: replace Abstract: Graph neural architecture search (GraphNAS) has recently aroused considerable attention in both academia and industry. However, two key challenges seriously hinder the further research of GraphNAS. First, since there is no consensus for the experimental setting, the empirical results in different research papers are often not comparable and even not reproducible, leading to unfair comparisons. Secondly, GraphNAS often needs extensive computations, which makes it highly inefficient and inaccessible to researchers without access to large-scale computation. To solve these challenges, we propose NAS-Bench-Graph, a tailored benchmark that supports unified, reproducible, and efficient evaluations for GraphNAS. Specifically, we construct a unified, expressive yet compact search space, covering 26,206 unique graph neural network (GNN) architectures and propose a principled evaluation protocol. To avoid unnecessary repetitive training, we have trained and evaluated all of these architectures on nine representative graph datasets, recording detailed metrics including train, validation, and test performance in each epoch, the latency, the number of parameters, etc. Based on our proposed benchmark, the performance of GNN architectures can be directly obtained by a look-up table without any further computation, which enables fair, fully reproducible, and efficient comparisons. To demonstrate its usage, we make in-depth analyses of our proposed NAS-Bench-Graph, revealing several interesting findings for GraphNAS. We also showcase how the benchmark can be easily compatible with GraphNAS open libraries such as AutoGL and NNI. To the best of our knowledge, our work is the first benchmark for graph neural architecture search.  ( 3 min )
    PMFL: Partial Meta-Federated Learning for heterogeneous tasks and its applications on real-world medical records
    arXiv:2112.05321v2 Announce Type: replace Abstract: Federated machine learning is a versatile and flexible tool to utilize distributed data from different sources, especially when communication technology develops rapidly and an unprecedented amount of data could be collected on mobile devices nowadays. Federated learning method exploits not only the data but the computational power of all devices in the network to achieve more efficient model training. Nevertheless, while most traditional federated learning methods work well for homogeneous data and tasks, adapting the method to a different heterogeneous data and task distribution is challenging. This limitation has constrained the applications of federated learning in real-world contexts, especially in healthcare settings. Inspired by the fundamental idea of meta-learning, in this study we propose a new algorithm, which is an integration of federated learning and meta-learning, to tackle this issue. In addition, owing to the advantage of transfer learning for model generalization, we further improve our algorithm by introducing partial parameter sharing. We name this method partial meta-federated learning (PMFL). Finally, we apply the algorithms to two medical datasets. We show that our algorithm could obtain the fastest training speed and achieve the best performance when dealing with heterogeneous medical datasets.  ( 3 min )
    Pareto-wise Ranking Classifier for Multi-objective Evolutionary Neural Architecture Search
    arXiv:2109.07582v2 Announce Type: replace Abstract: In the deployment of deep neural models, how to effectively and automatically find feasible deep models under diverse design objectives is fundamental. Most existing neural architecture search (NAS) methods utilize surrogates to predict the detailed performance (e.g., accuracy and model size) of a candidate architecture during the search, which however is complicated and inefficient. In contrast, we aim to learn an efficient Pareto classifier to simplify the search process of NAS by transforming the complex multi-objective NAS task into a simple Pareto-dominance classification task. To this end, we propose a classification-wise Pareto evolution approach for one-shot NAS, where an online classifier is trained to predict the dominance relationship between the candidate and constructed reference architectures, instead of using surrogates to fit the objective functions. The main contribution of this study is to change supernet adaption into a Pareto classifier. Besides, we design two adaptive schemes to select the reference set of architectures for constructing classification boundary and regulate the rate of positive samples over negative ones, respectively. We compare the proposed evolution approach with state-of-the-art approaches on widely-used benchmark datasets, and experimental results indicate that the proposed approach outperforms other approaches and have found a number of neural architectures with different model sizes ranging from 2M to 6M under diverse objectives and constraints.  ( 3 min )
    The CTU Prague Relational Learning Repository
    arXiv:1511.03086v2 Announce Type: replace Abstract: The aim of the Prague Relational Learning Repository is to support machine learning research with multi-relational data. The repository currently contains 148 SQL databases hosted on a public MySQL server located at \url{https://relational-data.org}. The server is provided by getML to support the relational machine learning community (\url{www.getml.com}). A searchable meta-database provides metadata (e.g., the number of tables in the database, the number of rows and columns in the tables, the number of self-relationships).  ( 2 min )
    Upper Counterfactual Confidence Bounds: a New Optimism Principle for Contextual Bandits
    arXiv:2007.07876v4 Announce Type: replace Abstract: The principle of optimism in the face of uncertainty is one of the most widely used and successful ideas in multi-armed bandits and reinforcement learning. However, existing optimistic algorithms (primarily UCB and its variants) often struggle to deal with general function classes and large context spaces. In this paper, we study general contextual bandits with an offline regression oracle and propose a simple, generic principle to design optimistic algorithms, dubbed "Upper Counterfactual Confidence Bounds" (UCCB). The key innovation of UCCB is building confidence bounds in policy space, rather than in action space as is done in UCB. We demonstrate that these algorithms are provably optimal and computationally efficient in handling general function classes and large context spaces. Furthermore, we illustrate that the UCCB principle can be seamlessly extended to infinite-action general contextual bandits, provide the first solutions to these settings when employing an offline regression oracle.  ( 2 min )
    The pitfalls of next-token prediction
    arXiv:2403.06963v1 Announce Type: cross Abstract: Can a mere next-token predictor faithfully model human intelligence? We crystallize this intuitive concern, which is fragmented in the literature. As a starting point, we argue that the two often-conflated phases of next-token prediction -- autoregressive inference and teacher-forced training -- must be treated distinctly. The popular criticism that errors can compound during autoregressive inference, crucially assumes that teacher-forcing has learned an accurate next-token predictor. This assumption sidesteps a more deep-rooted problem we expose: in certain classes of tasks, teacher-forcing can simply fail to learn an accurate next-token predictor in the first place. We describe a general mechanism of how teacher-forcing can fail, and design a minimal planning task where both the Transformer and the Mamba architecture empirically fail in that manner -- remarkably, despite the task being straightforward to learn. We provide preliminary evidence that this failure can be resolved when training to predict multiple tokens in advance. We hope this finding can ground future debates and inspire explorations beyond the next-token prediction paradigm. We make our code available under https://github.com/gregorbachmann/Next-Token-Failures  ( 2 min )
    Accurate Crystal Structure Prediction of New 2D Hybrid Organic Inorganic Perovskites
    arXiv:2403.06955v1 Announce Type: cross Abstract: Low dimensional hybrid organic-inorganic perovskites (HOIPs) represent a promising class of electronically active materials for both light absorption and emission. The design space of HOIPs is extremely large, since a diverse space of organic cations can be combined with different inorganic frameworks. This immense design space allows for tunable electronic and mechanical properties, but also necessitates the development of new tools for in silico high throughput analysis of candidate structures. In this work, we present an accurate, efficient, transferable and widely applicable machine learning interatomic potential (MLIP) for predicting the structure of new 2D HOIPs. Using the MACE architecture, an MLIP is trained on 86 diverse experimentally reported HOIP structures. The model is tested on 73 unseen perovskite compositions, and achieves chemical accuracy with respect to the reference electronic structure method. Our model is then combined with a simple random structure search algorithm to predict the structure of hypothetical HOIPs given only the proposed composition. Success is demonstrated by correctly and reliably recovering the crystal structure of a set of experimentally known 2D perovskites. Such a random structure search is impossible with ab initio methods due to the associated computational cost, but is relatively inexpensive with the MACE potential. Finally, the procedure is used to predict the structure formed by a new organic cation with no previously known corresponding perovskite. Laboratory synthesis of the new hybrid perovskite confirms the accuracy of our prediction. This capability, applied at scale, enables efficient screening of thousands of combinations of organic cations and inorganic layers.  ( 3 min )
    SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data
    arXiv:2403.06952v1 Announce Type: cross Abstract: Recent text-to-image (T2I) generation models have demonstrated impressive capabilities in creating images from text descriptions. However, these T2I generation models often fall short of generating images that precisely match the details of the text inputs, such as incorrect spatial relationship or missing objects. In this paper, we introduce SELMA: Skill-Specific Expert Learning and Merging with Auto-Generated Data, a novel paradigm to improve the faithfulness of T2I models by fine-tuning models on automatically generated, multi-skill image-text datasets, with skill-specific expert learning and merging. First, SELMA leverages an LLM's in-context learning capability to generate multiple datasets of text prompts that can teach different skills, and then generates the images with a T2I model based on the prompts. Next, SELMA adapts the T2I model to the new skills by learning multiple single-skill LoRA (low-rank adaptation) experts followed by expert merging. Our independent expert fine-tuning specializes multiple models for different skills, and expert merging helps build a joint multi-skill T2I model that can generate faithful images given diverse text prompts, while mitigating the knowledge conflict from different datasets. We empirically demonstrate that SELMA significantly improves the semantic alignment and text faithfulness of state-of-the-art T2I diffusion models on multiple benchmarks (+2.1% on TIFA and +6.9% on DSG), human preference metrics (PickScore, ImageReward, and HPS), as well as human evaluation. Moreover, fine-tuning with image-text pairs auto-collected via SELMA shows comparable performance to fine-tuning with ground truth data. Lastly, we show that fine-tuning with images from a weaker T2I model can help improve the generation quality of a stronger T2I model, suggesting promising weak-to-strong generalization in T2I models.  ( 3 min )
    Grid Monitoring and Protection with Continuous Point-on-Wave Measurements and Generative AI
    arXiv:2403.06942v1 Announce Type: cross Abstract: Purpose This article presents a case for a next-generation grid monitoring and control system, leveraging recent advances in generative artificial intelligence (AI), machine learning, and statistical inference. Advancing beyond earlier generations of wide-area monitoring systems built upon supervisory control and data acquisition (SCADA) and synchrophasor technologies, we argue for a monitoring and control framework based on the streaming of continuous point-on-wave (CPOW) measurements with AI-powered data compression and fault detection. Methods and Results: The architecture of the proposed design originates from the Wiener-Kallianpur innovation representation of a random process that transforms causally a stationary random process into an innovation sequence with independent and identically distributed random variables. This work presents a generative AI approach that (i) learns an innovation autoencoder that extracts innovation sequence from CPOW time series, (ii) compresses the CPOW streaming data with innovation autoencoder and subband coding, and (iii) detects unknown faults and novel trends via nonparametric sequential hypothesis testing. Conclusion: This work argues that conventional monitoring using SCADA and phasor measurement unit (PMU) technologies is ill-suited for a future grid with deep penetration of inverter-based renewable generations and distributed energy resources. A monitoring system based on CPOW data streaming and AI data analytics should be the basic building blocks for situational awareness of a highly dynamic future grid.  ( 2 min )
    Conditional Score-Based Diffusion Model for Cortical Thickness Trajectory Prediction
    arXiv:2403.06940v1 Announce Type: cross Abstract: Alzheimer's Disease (AD) is a neurodegenerative condition characterized by diverse progression rates among individuals, with changes in cortical thickness (CTh) closely linked to its progression. Accurately forecasting CTh trajectories can significantly enhance early diagnosis and intervention strategies, providing timely care. However, the longitudinal data essential for these studies often suffer from temporal sparsity and incompleteness, presenting substantial challenges in modeling the disease's progression accurately. Existing methods are limited, focusing primarily on datasets without missing entries or requiring predefined assumptions about CTh progression. To overcome these obstacles, we propose a conditional score-based diffusion model specifically designed to generate CTh trajectories with the given baseline information, such as age, sex, and initial diagnosis. Our conditional diffusion model utilizes all available data during the training phase to make predictions based solely on baseline information during inference without needing prior history about CTh progression. The prediction accuracy of the proposed CTh prediction pipeline using a conditional score-based model was compared for sub-groups consisting of cognitively normal, mild cognitive impairment, and AD subjects. The Bland-Altman analysis shows our diffusion-based prediction model has a near-zero bias with narrow 95% confidential interval compared to the ground-truth CTh in 6-36 months. In addition, our conditional diffusion model has a stochastic generative nature, therefore, we demonstrated an uncertainty analysis of patient-specific CTh prediction through multiple realizations.  ( 3 min )
    LIBR+: Improving Intraoperative Liver Registration by Learning the Residual of Biomechanics-Based Deformable Registration
    arXiv:2403.06901v1 Announce Type: cross Abstract: The surgical environment imposes unique challenges to the intraoperative registration of organ shapes to their preoperatively-imaged geometry. Biomechanical model-based registration remains popular, while deep learning solutions remain limited due to the sparsity and variability of intraoperative measurements and the limited ground-truth deformation of an organ that can be obtained during the surgery. In this paper, we propose a novel \textit{hybrid} registration approach that leverage a linearized iterative boundary reconstruction (LIBR) method based on linear elastic biomechanics, and use deep neural networks to learn its residual to the ground-truth deformation (LIBR+). We further formulate a dual-branch spline-residual graph convolutional neural network (SR-GCN) to assimilate information from sparse and variable intraoperative measurements and effectively propagate it through the geometry of the 3D organ. Experiments on a large intraoperative liver registration dataset demonstrated the consistent improvements achieved by LIBR+ in comparison to existing rigid, biomechnical model-based non-rigid, and deep-learning based non-rigid approaches to intraoperative liver registration.  ( 2 min )
    Application of Quantum Tensor Networks for Protein Classification
    arXiv:2403.06890v1 Announce Type: cross Abstract: We show that protein sequences can be thought of as sentences in natural language processing and can be parsed using the existing Quantum Natural Language framework into parameterized quantum circuits of reasonable qubits, which can be trained to solve various protein-related machine-learning problems. We classify proteins based on their subcellular locations, a pivotal task in bioinformatics that is key to understanding biological processes and disease mechanisms. Leveraging the quantum-enhanced processing capabilities, we demonstrate that Quantum Tensor Networks (QTN) can effectively handle the complexity and diversity of protein sequences. We present a detailed methodology that adapts QTN architectures to the nuanced requirements of protein data, supported by comprehensive experimental results. We demonstrate two distinct QTNs, inspired by classical recurrent neural networks (RNN) and convolutional neural networks (CNN), to solve the binary classification task mentioned above. Our top-performing quantum model has achieved a 94% accuracy rate, which is comparable to the performance of a classical model that uses the ESM2 protein language model embeddings. It's noteworthy that the ESM2 model is extremely large, containing 8 million parameters in its smallest configuration, whereas our best quantum model requires only around 800 parameters. We demonstrate that these hybrid models exhibit promising performance, showcasing their potential to compete with classical models of similar complexity.  ( 2 min )
    HiRA-Pro: High resolution alignment of multimodal spatio-temporal data: a process physics driven approach
    arXiv:2403.06888v1 Announce Type: cross Abstract: We present HiRA-Pro, a novel procedure to align, at high spatio-temporal resolutions, multimodal signals from real-world processes and systems that exhibit diverse transient, nonlinear stochastic dynamics, such as manufacturing machines. It is based on discerning and synchronizing the process signatures of salient kinematic and dynamic events in these disparate signals. HiRA-Pro addresses the challenge of aligning data with sub-millisecond phenomena, where traditional timestamp, external trigger, or clock-based alignment methods fall short. The effectiveness of HiRA-Pro is demonstrated in a smart manufacturing context, where it aligns data from 13+ channels acquired during 3D-printing and milling operations on an Optomec-LENS MTS 500 hybrid machine. The aligned data is then voxelized to generate 0.25 second aligned data chunks that correspond to physical voxels on the produced part. The superiority of HiRA-Pro is further showcased through case studies in additive manufacturing, demonstrating improved machine learning-based predictive performance due to precise multimodal data alignment. Specifically, testing classification accuracies improved by almost 35% with the application of HiRA-Pro, even with limited data, allowing for precise localization of artifacts. The paper also provides a comprehensive discussion on the proposed method, its applications, and comparative qualitative analysis with a few other alignment methods. HiRA-Pro achieves temporal-spatial resolutions of 10-1000 us and 100 um in order to generate datasets that register with physical voxels on the 3D-printed and milled part. These resolutions are at least an order of magnitude finer than the existing alignment methods that employ individual timestamps, statistical correlations, or common clocks, which achieve precision of hundreds of milliseconds.  ( 3 min )
    Last Iterate Convergence of Incremental Methods and Applications in Continual Learning
    arXiv:2403.06873v1 Announce Type: cross Abstract: Incremental gradient methods and incremental proximal methods are a fundamental class of optimization algorithms used for solving finite sum problems, broadly studied in the literature. Yet, when it comes to their convergence guarantees, nonasymptotic (first-order or proximal) oracle complexity bounds have been obtained fairly recently, almost exclusively applying to the average iterate. Motivated by applications in continual learning, we obtain the first convergence guarantees for the last iterate of both incremental gradient and incremental proximal methods, in general convex smooth (for both) and convex Lipschitz (for the proximal variants) settings. Our oracle complexity bounds for the last iterate nearly match (i.e., match up to a square-root-log or a log factor) the best known oracle complexity bounds for the average iterate, for both classes of methods. We further obtain generalizations of our results to weighted averaging of the iterates with increasing weights, which can be seen as interpolating between the last iterate and the average iterate guarantees. Additionally, we discuss how our results can be generalized to variants of studied incremental methods with permuted ordering of updates. Our results generalize last iterate guarantees for incremental methods compared to state of the art, as such results were previously known only for overparameterized linear models, which correspond to convex quadratic problems with infinitely many solutions.  ( 2 min )
    COOD: Combined out-of-distribution detection using multiple measures for anomaly & novel class detection in large-scale hierarchical classification
    arXiv:2403.06874v1 Announce Type: cross Abstract: High-performing out-of-distribution (OOD) detection, both anomaly and novel class, is an important prerequisite for the practical use of classification models. In this paper, we focus on the species recognition task in images concerned with large databases, a large number of fine-grained hierarchical classes, severe class imbalance, and varying image quality. We propose a framework for combining individual OOD measures into one combined OOD (COOD) measure using a supervised model. The individual measures are several existing state-of-the-art measures and several novel OOD measures developed with novel class detection and hierarchical class structure in mind. COOD was extensively evaluated on three large-scale (500k+ images) biodiversity datasets in the context of anomaly and novel class detection. We show that COOD outperforms individual, including state-of-the-art, OOD measures by a large margin in terms of TPR@1% FPR in the majority of experiments, e.g., improving detecting ImageNet images (OOD) from 54.3% to 85.4% for the iNaturalist 2018 dataset. SHAP (feature contribution) analysis shows that different individual OOD measures are essential for various tasks, indicating that multiple OOD measures and combinations are needed to generalize. Additionally, we show that explicitly considering ID images that are incorrectly classified for the original (species) recognition task is important for constructing high-performing OOD detection methods and for practical applicability. The framework can easily be extended or adapted to other tasks and media modalities.  ( 3 min )
    Are Targeted Messages More Effective?
    arXiv:2403.06817v1 Announce Type: cross Abstract: Graph neural networks (GNN) are deep learning architectures for graphs. Essentially, a GNN is a distributed message passing algorithm, which is controlled by parameters learned from data. It operates on the vertices of a graph: in each iteration, vertices receive a message on each incoming edge, aggregate these messages, and then update their state based on their current state and the aggregated messages. The expressivity of GNNs can be characterised in terms of certain fragments of first-order logic with counting and the Weisfeiler-Lehman algorithm. The core GNN architecture comes in two different versions. In the first version, a message only depends on the state of the source vertex, whereas in the second version it depends on the states of the source and target vertices. In practice, both of these versions are used, but the theory of GNNs so far mostly focused on the first one. On the logical side, the two versions correspond to two fragments of first-order logic with counting that we call modal and guarded. The question whether the two versions differ in their expressivity has been mostly overlooked in the GNN literature and has only been asked recently (Grohe, LICS'23). We answer this question here. It turns out that the answer is not as straightforward as one might expect. By proving that the modal and guarded fragment of first-order logic with counting have the same expressivity over labelled undirected graphs, we show that in a non-uniform setting the two GNN versions have the same expressivity. However, we also prove that in a uniform setting the second version is strictly more expressive.  ( 3 min )
    Efficient first-order algorithms for large-scale, non-smooth maximum entropy models with application to wildfire science
    arXiv:2403.06816v1 Announce Type: cross Abstract: Maximum entropy (Maxent) models are a class of statistical models that use the maximum entropy principle to estimate probability distributions from data. Due to the size of modern data sets, Maxent models need efficient optimization algorithms to scale well for big data applications. State-of-the-art algorithms for Maxent models, however, were not originally designed to handle big data sets; these algorithms either rely on technical devices that may yield unreliable numerical results, scale poorly, or require smoothness assumptions that many practical Maxent models lack. In this paper, we present novel optimization algorithms that overcome the shortcomings of state-of-the-art algorithms for training large-scale, non-smooth Maxent models. Our proposed first-order algorithms leverage the Kullback-Leibler divergence to train large-scale and non-smooth Maxent models efficiently. For Maxent models with discrete probability distribution of $n$ elements built from samples, each containing $m$ features, the stepsize parameters estimation and iterations in our algorithms scale on the order of $O(mn)$ operations and can be trivially parallelized. Moreover, the strong $\ell_{1}$ convexity of the Kullback--Leibler divergence allows for larger stepsize parameters, thereby speeding up the convergence rate of our algorithms. To illustrate the efficiency of our novel algorithms, we consider the problem of estimating probabilities of fire occurrences as a function of ecological features in the Western US MTBS-Interagency wildfire data set. Our numerical results show that our algorithms outperform the state of the arts by one order of magnitude and yield results that agree with physical models of wildfire occurrence and previous statistical analyses of wildfire drivers.  ( 3 min )
    Average Calibration Error: A Differentiable Loss for Improved Reliability in Image Segmentation
    arXiv:2403.06759v1 Announce Type: cross Abstract: Deep neural networks for medical image segmentation often produce overconfident results misaligned with empirical observations. Such miscalibration, challenges their clinical translation. We propose to use marginal L1 average calibration error (mL1-ACE) as a novel auxiliary loss function to improve pixel-wise calibration without compromising segmentation quality. We show that this loss, despite using hard binning, is directly differentiable, bypassing the need for approximate but differentiable surrogate or soft binning approaches. Our work also introduces the concept of dataset reliability histograms which generalises standard reliability diagrams for refined visual assessment of calibration in semantic segmentation aggregated at the dataset level. Using mL1-ACE, we reduce average and maximum calibration error by 45% and 55% respectively, maintaining a Dice score of 87% on the BraTS 2021 dataset. We share our code here: https://github.com/cai4cai/ACE-DLIRIS  ( 2 min )
    ALaRM: Align Language Models via Hierarchical Rewards Modeling
    arXiv:2403.06754v1 Announce Type: cross Abstract: We introduce ALaRM, the first framework modeling hierarchical rewards in reinforcement learning from human feedback (RLHF), which is designed to enhance the alignment of large language models (LLMs) with human preferences. The framework addresses the limitations of current alignment approaches, which often struggle with the inconsistency and sparsity of human supervision signals, by integrating holistic rewards with aspect-specific rewards. This integration enables more precise and consistent guidance of language models towards desired outcomes, particularly in complex and open text generation tasks. By employing a methodology that filters and combines multiple rewards based on their consistency, the framework provides a reliable mechanism for improving model alignment. We validate our approach through applications in long-form question answering and machine translation tasks, employing gpt-3.5-turbo for pairwise comparisons, and demonstrate improvements over existing baselines. Our work underscores the effectiveness of hierarchical rewards modeling in refining LLM training processes for better human preference alignment. We release our code at https://ALaRM-fdu.github.io.  ( 2 min )
    Shortcut Learning in Medical Image Segmentation
    arXiv:2403.06748v1 Announce Type: cross Abstract: Shortcut learning is a phenomenon where machine learning models prioritize learning simple, potentially misleading cues from data that do not generalize well beyond the training set. While existing research primarily investigates this in the realm of image classification, this study extends the exploration of shortcut learning into medical image segmentation. We demonstrate that clinical annotations such as calipers, and the combination of zero-padded convolutions and center-cropped training sets in the dataset can inadvertently serve as shortcuts, impacting segmentation accuracy. We identify and evaluate the shortcut learning on two different but common medical image segmentation tasks. In addition, we suggest strategies to mitigate the influence of shortcut learning and improve the generalizability of the segmentation models. By uncovering the presence and implications of shortcuts in medical image segmentation, we provide insights and methodologies for evaluating and overcoming this pervasive challenge and call for attention in the community for shortcuts in segmentation.  ( 2 min )
    Generalising Multi-Agent Cooperation through Task-Agnostic Communication
    arXiv:2403.06750v1 Announce Type: cross Abstract: Existing communication methods for multi-agent reinforcement learning (MARL) in cooperative multi-robot problems are almost exclusively task-specific, training new communication strategies for each unique task. We address this inefficiency by introducing a communication strategy applicable to any task within a given environment. We pre-train the communication strategy without task-specific reward guidance in a self-supervised manner using a set autoencoder. Our objective is to learn a fixed-size latent Markov state from a variable number of agent observations. Under mild assumptions, we prove that policies using our latent representations are guaranteed to converge, and upper bound the value error introduced by our Markov state approximation. Our method enables seamless adaptation to novel tasks without fine-tuning the communication strategy, gracefully supports scaling to more agents than present during training, and detects out-of-distribution events in an environment. Empirical results on diverse MARL scenarios validate the effectiveness of our approach, surpassing task-specific communication strategies in unseen tasks. Our implementation of this work is available at https://github.com/proroklab/task-agnostic-comms.  ( 2 min )
    On the Approximation of Kernel functions
    arXiv:2403.06731v1 Announce Type: cross Abstract: Various methods in statistical learning build on kernels considered in reproducing kernel Hilbert spaces. In applications, the kernel is often selected based on characteristics of the problem and the data. This kernel is then employed to infer response variables at points, where no explanatory data were observed. The data considered here are located in compact sets in higher dimensions and the paper addresses approximations of the kernel itself. The new approach considers Taylor series approximations of radial kernel functions. For the Gauss kernel on the unit cube, the paper establishes an upper bound of the associated eigenfunctions, which grows only polynomially with respect to the index. The novel approach substantiates smaller regularization parameters than considered in the literature, overall leading to better approximations. This improvement confirms low rank approximation methods such as the Nystr\"om method.  ( 2 min )
    Improving Low-Resource Knowledge Tracing Tasks by Supervised Pre-training and Importance Mechanism Fine-tuning
    arXiv:2403.06725v1 Announce Type: cross Abstract: Knowledge tracing (KT) aims to estimate student's knowledge mastery based on their historical interactions. Recently, the deep learning based KT (DLKT) approaches have achieved impressive performance in the KT task. These DLKT models heavily rely on the large number of available student interactions. However, due to various reasons such as budget constraints and privacy concerns, observed interactions are very limited in many real-world scenarios, a.k.a, low-resource KT datasets. Directly training a DLKT model on a low-resource KT dataset may lead to overfitting and it is difficult to choose the appropriate deep neural architecture. Therefore, in this paper, we propose a low-resource KT framework called LoReKT to address above challenges. Inspired by the prevalent "pre-training and fine-tuning" paradigm, we aim to learn transferable parameters and representations from rich-resource KT datasets during the pre-training stage and subsequently facilitate effective adaptation to low-resource KT datasets. Specifically, we simplify existing sophisticated DLKT model architectures with purely a stack of transformer decoders. We design an encoding mechanism to incorporate student interactions from multiple KT data sources and develop an importance mechanism to prioritize updating parameters with high importance while constraining less important ones during the fine-tuning stage. We evaluate LoReKT on six public KT datasets and experimental results demonstrate the superiority of our approach in terms of AUC and Accuracy. To encourage reproducible research, we make our data and code publicly available at https://anonymous.4open.science/r/LoReKT-C619.  ( 3 min )
    Untangling Gaussian Mixtures
    arXiv:2403.06671v1 Announce Type: cross Abstract: Tangles were originally introduced as a concept to formalize regions of high connectivity in graphs. In recent years, they have also been discovered as a link between structural graph theory and data science: when interpreting similarity in data sets as connectivity between points, finding clusters in the data essentially amounts to finding tangles in the underlying graphs. This paper further explores the potential of tangles in data sets as a means for a formal study of clusters. Real-world data often follow a normal distribution. Accounting for this, we develop a quantitative theory of tangles in data sets drawn from Gaussian mixtures. To this end, we equip the data with a graph structure that models similarity between the points and allows us to apply tangle theory to the data. We provide explicit conditions under which tangles associated with the marginal Gaussian distributions exist asymptotically almost surely. This can be considered as a sufficient formal criterion for the separabability of clusters in the data.  ( 2 min )
    Provable Mutual Benefits from Federated Learning in Privacy-Sensitive Domains
    arXiv:2403.06672v1 Announce Type: cross Abstract: Cross-silo federated learning (FL) allows data owners to train accurate machine learning models by benefiting from each others private datasets. Unfortunately, the model accuracy benefits of collaboration are often undermined by privacy defenses. Therefore, to incentivize client participation in privacy-sensitive domains, a FL protocol should strike a delicate balance between privacy guarantees and end-model accuracy. In this paper, we study the question of when and how a server could design a FL protocol provably beneficial for all participants. First, we provide necessary and sufficient conditions for the existence of mutually beneficial protocols in the context of mean estimation and convex stochastic optimization. We also derive protocols that maximize the total clients' utility, given symmetric privacy preferences. Finally, we design protocols maximizing end-model accuracy and demonstrate their benefits in synthetic experiments.  ( 2 min )
    Smart-Infinity: Fast Large Language Model Training using Near-Storage Processing on a Real System
    arXiv:2403.06664v1 Announce Type: cross Abstract: The recent huge advance of Large Language Models (LLMs) is mainly driven by the increase in the number of parameters. This has led to substantial memory capacity requirements, necessitating the use of dozens of GPUs just to meet the capacity. One popular solution to this is storage-offloaded training, which uses host memory and storage as an extended memory hierarchy. However, this obviously comes at the cost of storage bandwidth bottleneck because storage devices have orders of magnitude lower bandwidth compared to that of GPU device memories. Our work, Smart-Infinity, addresses the storage bandwidth bottleneck of storage-offloaded LLM training using near-storage processing devices on a real system. The main component of Smart-Infinity is SmartUpdate, which performs parameter updates on custom near-storage accelerators. We identify that moving parameter updates to the storage side removes most of the storage traffic. In addition, we propose an efficient data transfer handler structure to address the system integration issues for Smart-Infinity. The handler allows overlapping data transfers with fixed memory consumption by reusing the device buffer. Lastly, we propose accelerator-assisted gradient compression/decompression to enhance the scalability of Smart-Infinity. When scaling to multiple near-storage processing devices, the write traffic on the shared channel becomes the bottleneck. To alleviate this, we compress the gradients on the GPU and decompress them on the accelerators. It provides further acceleration from reduced traffic. As a result, Smart-Infinity achieves a significant speedup compared to the baseline. Notably, Smart-Infinity is a ready-to-use approach that is fully integrated into PyTorch on a real system. We will open-source Smart-Infinity to facilitate its use.  ( 3 min )
    Zero-Shot ECG Classification with Multimodal Learning and Test-time Clinical Knowledge Enhancement
    arXiv:2403.06659v1 Announce Type: cross Abstract: Electrocardiograms (ECGs) are non-invasive diagnostic tools crucial for detecting cardiac arrhythmic diseases in clinical practice. While ECG Self-supervised Learning (eSSL) methods show promise in representation learning from unannotated ECG data, they often overlook the clinical knowledge that can be found in reports. This oversight and the requirement for annotated samples for downstream tasks limit eSSL's versatility. In this work, we address these issues with the Multimodal ECG Representation Learning (MERL}) framework. Through multimodal learning on ECG records and associated reports, MERL is capable of performing zero-shot ECG classification with text prompts, eliminating the need for training data in downstream tasks. At test time, we propose the Clinical Knowledge Enhanced Prompt Engineering (CKEPE) approach, which uses Large Language Models (LLMs) to exploit external expert-verified clinical knowledge databases, generating more descriptive prompts and reducing hallucinations in LLM-generated content to boost zero-shot classification. Based on MERL, we perform the first benchmark across six public ECG datasets, showing the superior performance of MERL compared against eSSL methods. Notably, MERL achieves an average AUC score of 75.2% in zero-shot classification (without training data), 3.2% higher than linear probed eSSL methods with 10\% annotated training data, averaged across all six datasets.  ( 2 min )
    Pulling back symmetric Riemannian geometry for data analysis
    arXiv:2403.06612v1 Announce Type: cross Abstract: Data sets tend to live in low-dimensional non-linear subspaces. Ideal data analysis tools for such data sets should therefore account for such non-linear geometry. The symmetric Riemannian geometry setting can be suitable for a variety of reasons. First, it comes with a rich mathematical structure to account for a wide range of non-linear geometries that has been shown to be able to capture the data geometry through empirical evidence from classical non-linear embedding. Second, many standard data analysis tools initially developed for data in Euclidean space can also be generalised efficiently to data on a symmetric Riemannian manifold. A conceptual challenge comes from the lack of guidelines for constructing a symmetric Riemannian structure on the data space itself and the lack of guidelines for modifying successful algorithms on symmetric Riemannian manifolds for data analysis to this setting. This work considers these challenges in the setting of pullback Riemannian geometry through a diffeomorphism. The first part of the paper characterises diffeomorphisms that result in proper, stable and efficient data analysis. The second part then uses these best practices to guide construction of such diffeomorphisms through deep learning. As a proof of concept, different types of pullback geometries -- among which the proposed construction -- are tested on several data analysis tasks and on several toy data sets. The numerical experiments confirm the predictions from theory, i.e., that the diffeomorphisms generating the pullback geometry need to map the data manifold into a geodesic subspace of the pulled back Riemannian manifold while preserving local isometry around the data manifold for proper, stable and efficient data analysis, and that pulling back positive curvature can be problematic in terms of stability.  ( 3 min )
    Ricci flow-based brain surface covariance descriptors for Alzheimer disease
    arXiv:2403.06645v1 Announce Type: cross Abstract: Automated feature extraction from MRI brain scans and diagnosis of Alzheimer's disease are ongoing challenges. With advances in 3D imaging technology, 3D data acquisition is becoming more viable and efficient than its 2D counterpart. Rather than using feature-based vectors, in this paper, for the first time, we suggest a pipeline to extract novel covariance-based descriptors from the cortical surface using the Ricci energy optimization. The covariance descriptors are components of the nonlinear manifold of symmetric positive-definite matrices, thus we focus on using the Gaussian radial basis function to apply manifold-based classification to the 3D shape problem. Applying this novel signature to the analysis of abnormal cortical brain morphometry allows for diagnosing Alzheimer's disease. Experimental studies performed on about two hundred 3D MRI brain models, gathered from Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset demonstrate the effectiveness of our descriptors in achieving remarkable classification accuracy.  ( 2 min )
    Distributionally Generative Augmentation for Fair Facial Attribute Classification
    arXiv:2403.06606v1 Announce Type: cross Abstract: Facial Attribute Classification (FAC) holds substantial promise in widespread applications. However, FAC models trained by traditional methodologies can be unfair by exhibiting accuracy inconsistencies across varied data subpopulations. This unfairness is largely attributed to bias in data, where some spurious attributes (e.g., Male) statistically correlate with the target attribute (e.g., Smiling). Most of existing fairness-aware methods rely on the labels of spurious attributes, which may be unavailable in practice. This work proposes a novel, generation-based two-stage framework to train a fair FAC model on biased data without additional annotation. Initially, we identify the potential spurious attributes based on generative models. Notably, it enhances interpretability by explicitly showing the spurious attributes in image space. Following this, for each image, we first edit the spurious attributes with a random degree sampled from a uniform distribution, while keeping target attribute unchanged. Then we train a fair FAC model by fostering model invariance to these augmentation. Extensive experiments on three common datasets demonstrate the effectiveness of our method in promoting fairness in FAC without compromising accuracy. Codes are in https://github.com/heqianpei/DiGA.  ( 2 min )
    Data-driven architecture to encode information in the kinematics of robots and artificial avatars
    arXiv:2403.06557v1 Announce Type: cross Abstract: We present a data-driven control architecture for modifying the kinematics of robots and artificial avatars to encode specific information such as the presence or not of an emotion in the movements of an avatar or robot driven by a human operator. We validate our approach on an experimental dataset obtained during the reach-to-grasp phase of a pick-and-place task.  ( 2 min )
    ReStainGAN: Leveraging IHC to IF Stain Domain Translation for in-silico Data Generation
    arXiv:2403.06545v1 Announce Type: cross Abstract: The creation of in-silico datasets can expand the utility of existing annotations to new domains with different staining patterns in computational pathology. As such, it has the potential to significantly lower the cost associated with building large and pixel precise datasets needed to train supervised deep learning models. We propose a novel approach for the generation of in-silico immunohistochemistry (IHC) images by disentangling morphology specific IHC stains into separate image channels in immunofluorescence (IF) images. The proposed approach qualitatively and quantitatively outperforms baseline methods as proven by training nucleus segmentation models on the created in-silico datasets.  ( 2 min )
    OMH: Structured Sparsity via Optimally Matched Hierarchy for Unsupervised Semantic Segmentation
    arXiv:2403.06546v1 Announce Type: cross Abstract: Unsupervised Semantic Segmentation (USS) involves segmenting images without relying on predefined labels, aiming to alleviate the burden of extensive human labeling. Existing methods utilize features generated by self-supervised models and specific priors for clustering. However, their clustering objectives are not involved in the optimization of the features during training. Additionally, due to the lack of clear class definitions in USS, the resulting segments may not align well with the clustering objective. In this paper, we introduce a novel approach called Optimally Matched Hierarchy (OMH) to simultaneously address the above issues. The core of our method lies in imposing structured sparsity on the feature space, which allows the features to encode information with different levels of granularity. The structure of this sparsity stems from our hierarchy (OMH). To achieve this, we learn a soft but sparse hierarchy among parallel clusters through Optimal Transport. Our OMH yields better unsupervised segmentation performance compared to existing USS methods. Our extensive experiments demonstrate the benefits of OMH when utilizing our differentiable paradigm. We will make our code publicly available.  ( 2 min )
    SARDet-100K: Towards Open-Source Benchmark and ToolKit for Large-Scale SAR Object Detection
    arXiv:2403.06534v1 Announce Type: cross Abstract: Synthetic Aperture Radar (SAR) object detection has gained significant attention recently due to its irreplaceable all-weather imaging capabilities. However, this research field suffers from both limited public datasets (mostly comprising <2K images with only mono-category objects) and inaccessible source code. To tackle these challenges, we establish a new benchmark dataset and an open-source method for large-scale SAR object detection. Our dataset, SARDet-100K, is a result of intense surveying, collecting, and standardizing 10 existing SAR detection datasets, providing a large-scale and diverse dataset for research purposes. To the best of our knowledge, SARDet-100K is the first COCO-level large-scale multi-class SAR object detection dataset ever created. With this high-quality dataset, we conducted comprehensive experiments and uncovered a crucial challenge in SAR object detection: the substantial disparities between the pretraining on RGB datasets and finetuning on SAR datasets in terms of both data domain and model structure. To bridge these gaps, we propose a novel Multi-Stage with Filter Augmentation (MSFA) pretraining framework that tackles the problems from the perspective of data input, domain transition, and model migration. The proposed MSFA method significantly enhances the performance of SAR object detection models while demonstrating exceptional generalizability and flexibility across diverse models. This work aims to pave the way for further advancements in SAR object detection. The dataset and code is available at https://github.com/zcablii/SARDet_100K.  ( 3 min )
    Automatic Generation of Python Programs Using Context-Free Grammars
    arXiv:2403.06503v1 Announce Type: cross Abstract: In recent years, data has emerged as the new gold, serving as a powerful tool for creating intelligent systems. However, procuring high-quality data remains challenging, especially for code. To address this, we developed TinyPy Generator, a tool that generates random Python programs using a context-free grammar. The generated programs are guaranteed to be correct by construction. Our system uses custom production rules (in the Backus-Naur Form (BNF) format) to recursively generate code. This allows us to generate code with different levels of complexity, ranging from code containing only assignments to more complex code containing conditionals and loops. Our proposed tool enables effortless large-scale Python code generation, beneficial for a wide range of applications. TinyPy Generator is particularly useful in the field of machine learning, where it can generate substantial amounts of Python code for training Python language models. Additionally, researchers who are studying programming languages can utilize this tool to create datasets for their experiments, which can help validate the robustness of code interpreters or compilers. Unlike existing research, we have open-sourced our implementation. This allows customization according to user needs and extends potential usage to other languages.  ( 2 min )
    Knowledge-aware Alert Aggregation in Large-scale Cloud Systems: a Hybrid Approach
    arXiv:2403.06485v1 Announce Type: cross Abstract: Due to the scale and complexity of cloud systems, a system failure would trigger an "alert storm", i.e., massive correlated alerts. Although these alerts can be traced back to a few root causes, the overwhelming number makes it infeasible for manual handling. Alert aggregation is thus critical to help engineers concentrate on the root cause and facilitate failure resolution. Existing methods typically utilize semantic similarity-based methods or statistical methods to aggregate alerts. However, semantic similarity-based methods overlook the causal rationale of alerts, while statistical methods can hardly handle infrequent alerts. To tackle these limitations, we introduce leveraging external knowledge, i.e., Standard Operation Procedure (SOP) of alerts as a supplement. We propose COLA, a novel hybrid approach based on correlation mining and LLM (Large Language Model) reasoning for online alert aggregation. The correlation mining module effectively captures the temporal and spatial relations between alerts, measuring their correlations in an efficient manner. Subsequently, only uncertain pairs with low confidence are forwarded to the LLM reasoning module for detailed analysis. This hybrid design harnesses both statistical evidence for frequent alerts and the reasoning capabilities of computationally intensive LLMs, ensuring the overall efficiency of COLA in handling large volumes of alerts in practical scenarios. We evaluate COLA on three datasets collected from the production environment of a large-scale cloud platform. The experimental results show COLA achieves F1-scores from 0.901 to 0.930, outperforming state-of-the-art methods and achieving comparable efficiency. We also share our experience in deploying COLA in our real-world cloud system, Cloud X.  ( 3 min )
    Detection of Unobserved Common Causes based on NML Code in Discrete, Mixed, and Continuous Variables
    arXiv:2403.06499v1 Announce Type: cross Abstract: Causal discovery in the presence of unobserved common causes from observational data only is a crucial but challenging problem. We categorize all possible causal relationships between two random variables into the following four categories and aim to identify one from observed data: two cases in which either of the direct causality exists, a case that variables are independent, and a case that variables are confounded by latent confounders. Although existing methods have been proposed to tackle this problem, they require unobserved variables to satisfy assumptions on the form of their equation models. In our previous study (Kobayashi et al., 2022), the first causal discovery method without such assumptions is proposed for discrete data and named CLOUD. Using Normalized Maximum Likelihood (NML) Code, CLOUD selects a model that yields the minimum codelength of the observed data from a set of model candidates. This paper extends CLOUD to apply for various data types across discrete, mixed, and continuous. We not only performed theoretical analysis to show the consistency of CLOUD in terms of the model selection, but also demonstrated that CLOUD is more effective than existing methods in inferring causal relationships by extensive experiments on both synthetic and real-world data.  ( 3 min )
    Financial Default Prediction via Motif-preserving Graph Neural Network with Curriculum Learning
    arXiv:2403.06482v1 Announce Type: cross Abstract: User financial default prediction plays a critical role in credit risk forecasting and management. It aims at predicting the probability that the user will fail to make the repayments in the future. Previous methods mainly extract a set of user individual features regarding his own profiles and behaviors and build a binary-classification model to make default predictions. However, these methods cannot get satisfied results, especially for users with limited information. Although recent efforts suggest that default prediction can be improved by social relations, they fail to capture the higher-order topology structure at the level of small subgraph patterns. In this paper, we fill in this gap by proposing a motif-preserving Graph Neural Network with curriculum learning (MotifGNN) to jointly learn the lower-order structures from the original graph and higherorder structures from multi-view motif-based graphs for financial default prediction. Specifically, to solve the problem of weak connectivity in motif-based graphs, we design the motif-based gating mechanism. It utilizes the information learned from the original graph with good connectivity to strengthen the learning of the higher-order structure. And considering that the motif patterns of different samples are highly unbalanced, we propose a curriculum learning mechanism on the whole learning process to more focus on the samples with uncommon motif distributions. Extensive experiments on one public dataset and two industrial datasets all demonstrate the effectiveness of our proposed method.  ( 3 min )
    A Survey of Learned Indexes for the Multi-dimensional Space
    arXiv:2403.06456v1 Announce Type: cross Abstract: A recent research trend involves treating database index structures as Machine Learning (ML) models. In this domain, single or multiple ML models are trained to learn the mapping from keys to positions inside a data set. This class of indexes is known as "Learned Indexes." Learned indexes have demonstrated improved search performance and reduced space requirements for one-dimensional data. The concept of one-dimensional learned indexes has naturally been extended to multi-dimensional (e.g., spatial) data, leading to the development of "Learned Multi-dimensional Indexes". This survey focuses on learned multi-dimensional index structures. Specifically, it reviews the current state of this research area, explains the core concepts behind each proposed method, and classifies these methods based on several well-defined criteria. We present a taxonomy that classifies and categorizes each learned multi-dimensional index, and survey the existing literature on learned multi-dimensional indexes according to this taxonomy. Additionally, we present a timeline to illustrate the evolution of research on learned indexes. Finally, we highlight several open challenges and future research directions in this emerging and highly active field.  ( 2 min )
    RLingua: Improving Reinforcement Learning Sample Efficiency in Robotic Manipulations With Large Language Models
    arXiv:2403.06420v1 Announce Type: cross Abstract: Reinforcement learning (RL) has demonstrated its capability in solving various tasks but is notorious for its low sample efficiency. In this paper, we propose RLingua, a framework that can leverage the internal knowledge of large language models (LLMs) to reduce the sample complexity of RL in robotic manipulations. To this end, we first present how to extract the prior knowledge of LLMs by prompt engineering so that a preliminary rule-based robot controller for a specific task can be generated. Despite being imperfect, the LLM-generated robot controller is utilized to produce action samples during rollouts with a decaying probability, thereby improving RL's sample efficiency. We employ the actor-critic framework and modify the actor loss to regularize the policy learning towards the LLM-generated controller. RLingua also provides a novel method of improving the imperfect LLM-generated robot controllers by RL. We demonstrated that RLingua can significantly reduce the sample complexity of TD3 in the robot tasks of panda_gym and achieve high success rates in sparsely rewarded robot tasks in RLBench, where the standard TD3 fails. Additionally, We validated RLingua's effectiveness in real-world robot experiments through Sim2Real, demonstrating that the learned policies are effectively transferable to real robot tasks. Further details and videos about our work are available at our project website https://rlingua.github.io.  ( 2 min )
    Bridging Domains with Approximately Shared Features
    arXiv:2403.06424v1 Announce Type: cross Abstract: Multi-source domain adaptation aims to reduce performance degradation when applying machine learning models to unseen domains. A fundamental challenge is devising the optimal strategy for feature selection. Existing literature is somewhat paradoxical: some advocate for learning invariant features from source domains, while others favor more diverse features. To address the challenge, we propose a statistical framework that distinguishes the utilities of features based on the variance of their correlation to label $y$ across domains. Under our framework, we design and analyze a learning procedure consisting of learning approximately shared feature representation from source tasks and fine-tuning it on the target task. Our theoretical analysis necessitates the importance of learning approximately shared features instead of only the strictly invariant features and yields an improved population risk compared to previous results on both source and target tasks, thus partly resolving the paradox mentioned above. Inspired by our theory, we proposed a more practical way to isolate the content (invariant+approximately shared) from environmental features and further consolidate our theoretical findings.  ( 2 min )
    Cosine Scoring with Uncertainty for Neural Speaker Embedding
    arXiv:2403.06404v1 Announce Type: cross Abstract: Uncertainty modeling in speaker representation aims to learn the variability present in speech utterances. While the conventional cosine-scoring is computationally efficient and prevalent in speaker recognition, it lacks the capability to handle uncertainty. To address this challenge, this paper proposes an approach for estimating uncertainty at the speaker embedding front-end and propagating it to the cosine scoring back-end. Experiments conducted on the VoxCeleb and SITW datasets confirmed the efficacy of the proposed method in handling uncertainty arising from embedding estimation. It achieved improvement with 8.5% and 9.8% average reductions in EER and minDCF compared to the conventional cosine similarity. It is also computationally efficient in practice.  ( 2 min )
    'One size doesn't fit all': Learning how many Examples to use for In-Context Learning for Improved Text Classification
    arXiv:2403.06402v1 Announce Type: cross Abstract: Predictive models in natural language processing (NLP) have evolved from training models from scratch to fine-tuning pre-trained models with labelled data. An extreme form of this fine-tuning involves in-context learning (ICL), where the output of a pre-trained generative model (frozen decoder parameters) is controlled only with variations in the input strings (called instructions or prompts). An important component of ICL is the use of a small number of labelled data instances as examples in the prompt. While existing work uses a static number of examples during inference for each data instance, in this paper we propose a novel methodology of dynamically adapting the number of examples as per the data. This is analogous to the use of a variable-sized neighborhood in k-nearest neighbors (k-NN) classifier. In our proposed workflow of adaptive ICL (AICL), the number of demonstrations to employ during the inference on a particular data instance is predicted by the Softmax posteriors of a classifier. The parameters of this classifier are fitted on the optimal number of examples in ICL required to correctly infer the label of each instance in the training set with the hypothesis that a test instance that is similar to a training instance should use the same (or a closely matching) number of few-shot examples. Our experiments show that our AICL method results in improvement in text classification task on several standard datasets.  ( 3 min )
    Pre-Trained Model Recommendation for Downstream Fine-tuning
    arXiv:2403.06382v1 Announce Type: cross Abstract: As a fundamental problem in transfer learning, model selection aims to rank off-the-shelf pre-trained models and select the most suitable one for the new target task. Existing model selection techniques are often constrained in their scope and tend to overlook the nuanced relationships between models and tasks. In this paper, we present a pragmatic framework \textbf{Fennec}, delving into a diverse, large-scale model repository while meticulously considering the intricate connections between tasks and models. The key insight is to map all models and historical tasks into a transfer-related subspace, where the distance between model vectors and task vectors represents the magnitude of transferability. A large vision model, as a proxy, infers a new task's representation in the transfer space, thereby circumventing the computational burden of extensive forward passes. We also investigate the impact of the inherent inductive bias of models on transfer results and propose a novel method called \textbf{archi2vec} to encode the intricate structures of models. The transfer score is computed through straightforward vector arithmetic with a time complexity of $\mathcal{O}(1)$. Finally, we make a substantial contribution to the field by releasing a comprehensive benchmark. We validate the effectiveness of our framework through rigorous testing on two benchmarks. The benchmark and the code will be publicly available in the near future.  ( 2 min )
    A Zero Trust Framework for Realization and Defense Against Generative AI Attacks in Power Grid
    arXiv:2403.06388v1 Announce Type: cross Abstract: Understanding the potential of generative AI (GenAI)-based attacks on the power grid is a fundamental challenge that must be addressed in order to protect the power grid by realizing and validating risk in new attack vectors. In this paper, a novel zero trust framework for a power grid supply chain (PGSC) is proposed. This framework facilitates early detection of potential GenAI-driven attack vectors (e.g., replay and protocol-type attacks), assessment of tail risk-based stability measures, and mitigation of such threats. First, a new zero trust system model of PGSC is designed and formulated as a zero-trust problem that seeks to guarantee for a stable PGSC by realizing and defending against GenAI-driven cyber attacks. Second, in which a domain-specific generative adversarial networks (GAN)-based attack generation mechanism is developed to create a new vulnerability cyberspace for further understanding that threat. Third, tail-based risk realization metrics are developed and implemented for quantifying the extreme risk of a potential attack while leveraging a trust measurement approach for continuous validation. Fourth, an ensemble learning-based bootstrap aggregation scheme is devised to detect the attacks that are generating synthetic identities with convincing user and distributed energy resources device profiles. Experimental results show the efficacy of the proposed zero trust framework that achieves an accuracy of 95.7% on attack vector generation, a risk measure of 9.61% for a 95% stable PGSC, and a 99% confidence in defense against GenAI-driven attack.  ( 3 min )
    Separable Physics-informed Neural Networks for Solving the BGK Model of the Boltzmann Equation
    arXiv:2403.06342v1 Announce Type: cross Abstract: In this study, we introduce a method based on Separable Physics-Informed Neural Networks (SPINNs) for effectively solving the BGK model of the Boltzmann equation. While the mesh-free nature of PINNs offers significant advantages in handling high-dimensional partial differential equations (PDEs), challenges arise when applying quadrature rules for accurate integral evaluation in the BGK operator, which can compromise the mesh-free benefit and increase computational costs. To address this, we leverage the canonical polyadic decomposition structure of SPINNs and the linear nature of moment calculation, achieving a substantial reduction in computational expense for quadrature rule application. The multi-scale nature of the particle density function poses difficulties in precisely approximating macroscopic moments using neural networks. To improve SPINN training, we introduce the integration of Gaussian functions into SPINNs, coupled with a relative loss approach. This modification enables SPINNs to decay as rapidly as Maxwellian distributions, thereby enhancing the accuracy of macroscopic moment approximations. The relative loss design further ensures that both large and small-scale features are effectively captured by the SPINNs. The efficacy of our approach is demonstrated through a series of five numerical experiments, including the solution to a challenging 3D Riemann problem. These results highlight the potential of our novel method in efficiently and accurately addressing complex challenges in computational physics.  ( 2 min )
    Disentangling shared and private latent factors in multimodal Variational Autoencoders
    arXiv:2403.06338v1 Announce Type: cross Abstract: Generative models for multimodal data permit the identification of latent factors that may be associated with important determinants of observed data heterogeneity. Common or shared factors could be important for explaining variation across modalities whereas other factors may be private and important only for the explanation of a single modality. Multimodal Variational Autoencoders, such as MVAE and MMVAE, are a natural choice for inferring those underlying latent factors and separating shared variation from private. In this work, we investigate their capability to reliably perform this disentanglement. In particular, we highlight a challenging problem setting where modality-specific variation dominates the shared signal. Taking a cross-modal prediction perspective, we demonstrate limitations of existing models, and propose a modification how to make them more robust to modality-specific variation. Our findings are supported by experiments on synthetic as well as various real-world multi-omics data sets.  ( 2 min )
    A Study on Domain Generalization for Failure Detection through Human Reactions in HRI
    arXiv:2403.06315v1 Announce Type: cross Abstract: Machine learning models are commonly tested in-distribution (same dataset); performance almost always drops in out-of-distribution settings. For HRI research, the goal is often to develop generalized models. This makes domain generalization - retaining performance in different settings - a critical issue. In this study, we present a concise analysis of domain generalization in failure detection models trained on human facial expressions. Using two distinct datasets of humans reacting to videos where error occurs, one from a controlled lab setting and another collected online, we trained deep learning models on each dataset. When testing these models on the alternate dataset, we observed a significant performance drop. We reflect on the causes for the observed model behavior and leave recommendations. This work emphasizes the need for HRI research focusing on improving model robustness and real-life applicability.  ( 2 min )
    UNICORN: Ultrasound Nakagami Imaging via Score Matching and Adaptation
    arXiv:2403.06275v1 Announce Type: cross Abstract: Nakagami imaging holds promise for visualizing and quantifying tissue scattering in ultrasound waves, with potential applications in tumor diagnosis and fat fraction estimation which are challenging to discern by conventional ultrasound B-mode images. Existing methods struggle with optimal window size selection and suffer from estimator instability, leading to degraded resolution images. To address this, here we propose a novel method called UNICORN (Ultrasound Nakagami Imaging via Score Matching and Adaptation), that offers an accurate, closed-form estimator for Nakagami parameter estimation in terms of the score function of ultrasonic envelope. Extensive experiments using simulation and real ultrasound RF data demonstrate UNICORN's superiority over conventional approaches in accuracy and resolution quality.  ( 2 min )
    MACE: Mass Concept Erasure in Diffusion Models
    arXiv:2403.06135v1 Announce Type: cross Abstract: The rapid expansion of large-scale text-to-image diffusion models has raised growing concerns regarding their potential misuse in creating harmful or misleading content. In this paper, we introduce MACE, a finetuning framework for the task of mass concept erasure. This task aims to prevent models from generating images that embody unwanted concepts when prompted. Existing concept erasure methods are typically restricted to handling fewer than five concepts simultaneously and struggle to find a balance between erasing concept synonyms (generality) and maintaining unrelated concepts (specificity). In contrast, MACE differs by successfully scaling the erasure scope up to 100 concepts and by achieving an effective balance between generality and specificity. This is achieved by leveraging closed-form cross-attention refinement along with LoRA finetuning, collectively eliminating the information of undesirable concepts. Furthermore, MACE integrates multiple LoRAs without mutual interference. We conduct extensive evaluations of MACE against prior methods across four different tasks: object erasure, celebrity erasure, explicit content erasure, and artistic style erasure. Our results reveal that MACE surpasses prior methods in all evaluated tasks. Code is available at https://github.com/Shilin-LU/MACE.  ( 2 min )
    Automatic design optimization of preference-based subjective evaluation with online learning in crowdsourcing environment
    arXiv:2403.06100v1 Announce Type: cross Abstract: A preference-based subjective evaluation is a key method for evaluating generative media reliably. However, its huge combinations of pairs prohibit it from being applied to large-scale evaluation using crowdsourcing. To address this issue, we propose an automatic optimization method for preference-based subjective evaluation in terms of pair combination selections and allocation of evaluation volumes with online learning in a crowdsourcing environment. We use a preference-based online learning method based on a sorting algorithm to identify the total order of evaluation targets with minimum sample volumes. Our online learning algorithm supports parallel and asynchronous execution under fixed-budget conditions required for crowdsourcing. Our experiment on preference-based subjective evaluation of synthetic speech shows that our method successfully optimizes the test by reducing pair combinations from 351 to 83 and allocating optimal evaluation volumes for each pair ranging from 30 to 663 without compromising evaluation accuracies and wasting budget allocations.  ( 2 min )
    Towards In-Vehicle Multi-Task Facial Attribute Recognition: Investigating Synthetic Data and Vision Foundation Models
    arXiv:2403.06088v1 Announce Type: cross Abstract: In the burgeoning field of intelligent transportation systems, enhancing vehicle-driver interaction through facial attribute recognition, such as facial expression, eye gaze, age, etc., is of paramount importance for safety, personalization, and overall user experience. However, the scarcity of comprehensive large-scale, real-world datasets poses a significant challenge for training robust multi-task models. Existing literature often overlooks the potential of synthetic datasets and the comparative efficacy of state-of-the-art vision foundation models in such constrained settings. This paper addresses these gaps by investigating the utility of synthetic datasets for training complex multi-task models that recognize facial attributes of passengers of a vehicle, such as gaze plane, age, and facial expression. Utilizing transfer learning techniques with both pre-trained Vision Transformer (ViT) and Residual Network (ResNet) models, we explore various training and adaptation methods to optimize performance, particularly when data availability is limited. We provide extensive post-evaluation analysis, investigating the effects of synthetic data distributions on model performance in in-distribution data and out-of-distribution inference. Our study unveils counter-intuitive findings, notably the superior performance of ResNet over ViTs in our specific multi-task context, which is attributed to the mismatch in model complexity relative to task complexity. Our results highlight the challenges and opportunities for enhancing the use of synthetic data and vision foundation models in practical applications.  ( 3 min )
    CausalCellSegmenter: Causal Inference inspired Diversified Aggregation Convolution for Pathology Image Segmentation
    arXiv:2403.06066v1 Announce Type: cross Abstract: Deep learning models have shown promising performance for cell nucleus segmentation in the field of pathology image analysis. However, training a robust model from multiple domains remains a great challenge for cell nucleus segmentation. Additionally, the shortcomings of background noise, highly overlapping between cell nucleus, and blurred edges often lead to poor performance. To address these challenges, we propose a novel framework termed CausalCellSegmenter, which combines Causal Inference Module (CIM) with Diversified Aggregation Convolution (DAC) techniques. The DAC module is designed which incorporates diverse downsampling features through a simple, parameter-free attention module (SimAM), aiming to overcome the problems of false-positive identification and edge blurring. Furthermore, we introduce CIM to leverage sample weighting by directly removing the spurious correlations between features for every input sample and concentrating more on the correlation between features and labels. Extensive experiments on the MoNuSeg-2018 dataset achieves promising results, outperforming other state-of-the-art methods, where the mIoU and DSC scores growing by 3.6% and 2.65%.  ( 2 min )
    Ensemble Language Models for Multilingual Sentiment Analysis
    arXiv:2403.06060v1 Announce Type: cross Abstract: The rapid advancement of social media enables us to analyze user opinions. In recent times, sentiment analysis has shown a prominent research gap in understanding human sentiment based on the content shared on social media. Although sentiment analysis for commonly spoken languages has advanced significantly, low-resource languages like Arabic continue to get little research due to resource limitations. In this study, we explore sentiment analysis on tweet texts from SemEval-17 and the Arabic Sentiment Tweet dataset. Moreover, We investigated four pretrained language models and proposed two ensemble language models. Our findings include monolingual models exhibiting superior performance and ensemble models outperforming the baseline while the majority voting ensemble outperforms the English language.  ( 2 min )
    Absence of spurious solutions far from ground truth: A low-rank analysis with high-order losses
    arXiv:2403.06056v1 Announce Type: cross Abstract: Matrix sensing problems exhibit pervasive non-convexity, plaguing optimization with a proliferation of suboptimal spurious solutions. Avoiding convergence to these critical points poses a major challenge. This work provides new theoretical insights that help demystify the intricacies of the non-convex landscape. In this work, we prove that under certain conditions, critical points sufficiently distant from the ground truth matrix exhibit favorable geometry by being strict saddle points rather than troublesome local minima. Moreover, we introduce the notion of higher-order losses for the matrix sensing problem and show that the incorporation of such losses into the objective function amplifies the negative curvature around those distant critical points. This implies that increasing the complexity of the objective function via high-order losses accelerates the escape from such critical points and acts as a desirable alternative to increasing the complexity of the optimization problem via over-parametrization. By elucidating key characteristics of the non-convex optimization landscape, this work makes progress towards a comprehensive framework for tackling broader machine learning objectives plagued by non-convexity.  ( 2 min )
    Texture image retrieval using a classification and contourlet-based features
    arXiv:2403.06048v1 Announce Type: cross Abstract: In this paper, we propose a new framework for improving Content Based Image Retrieval (CBIR) for texture images. This is achieved by using a new image representation based on the RCT-Plus transform which is a novel variant of the Redundant Contourlet transform that extracts a richer directional information in the image. Moreover, the process of image search is improved through a learning-based approach where the images of the database are classified using an adapted similarity metric to the statistical modeling of the RCT-Plus transform. A query is then first classified to select the best texture class after which the retained class images are ranked to select top ones. By this, we have achieved significant improvements in the retrieval rates compared to previous CBIR schemes.  ( 2 min )
    Hierarchical Query Classification in E-commerce Search
    arXiv:2403.06021v1 Announce Type: cross Abstract: E-commerce platforms typically store and structure product information and search data in a hierarchy. Efficiently categorizing user search queries into a similar hierarchical structure is paramount in enhancing user experience on e-commerce platforms as well as news curation and academic research. The significance of this task is amplified when dealing with sensitive query categorization or critical information dissemination, where inaccuracies can lead to considerable negative impacts. The inherent complexity of hierarchical query classification is compounded by two primary challenges: (1) the pronounced class imbalance that skews towards dominant categories, and (2) the inherent brevity and ambiguity of search queries that hinder accurate classification. To address these challenges, we introduce a novel framework that leverages hierarchical information through (i) enhanced representation learning that utilizes the contrastive loss to discern fine-grained instance relationships within the hierarchy, called ''instance hierarchy'', and (ii) a nuanced hierarchical classification loss that attends to the intrinsic label taxonomy, named ''label hierarchy''. Additionally, based on our observation that certain unlabeled queries share typographical similarities with labeled queries, we propose a neighborhood-aware sampling technique to intelligently select these unlabeled queries to boost the classification performance. Extensive experiments demonstrate that our proposed method is better than state-of-the-art (SOTA) on the proprietary Amazon dataset, and comparable to SOTA on the public datasets of Web of Science and RCV1-V2. These results underscore the efficacy of our proposed solution, and pave the path toward the next generation of hierarchy-aware query classification systems.  ( 3 min )
    A Generalized Acquisition Function for Preference-based Reward Learning
    arXiv:2403.06003v1 Announce Type: cross Abstract: Preference-based reward learning is a popular technique for teaching robots and autonomous systems how a human user wants them to perform a task. Previous works have shown that actively synthesizing preference queries to maximize information gain about the reward function parameters improves data efficiency. The information gain criterion focuses on precisely identifying all parameters of the reward function. This can potentially be wasteful as many parameters may result in the same reward, and many rewards may result in the same behavior in the downstream tasks. Instead, we show that it is possible to optimize for learning the reward function up to a behavioral equivalence class, such as inducing the same ranking over behaviors, distribution over choices, or other related definitions of what makes two rewards similar. We introduce a tractable framework that can capture such definitions of similarity. Our experiments in a synthetic environment, an assistive robotics environment with domain transfer, and a natural language processing problem with real datasets demonstrate the superior performance of our querying method over the state-of-the-art information gain method.  ( 2 min )
    Grafting: Making Random Forests Consistent
    arXiv:2403.06015v1 Announce Type: cross Abstract: Despite their performance and widespread use, little is known about the theory of Random Forests. A major unanswered question is whether, or when, the Random Forest algorithm is consistent. The literature explores various variants of the classic Random Forest algorithm to address this question and known short-comings of the method. This paper is a contribution to this literature. Specifically, the suitability of grafting consistent estimators onto a shallow CART is explored. It is shown that this approach has a consistency guarantee and performs well in empirical settings.  ( 2 min )
    Can Generative Models Improve Self-Supervised Representation Learning?
    arXiv:2403.05966v1 Announce Type: cross Abstract: The rapid advancement in self-supervised learning (SSL) has highlighted its potential to leverage unlabeled data for learning powerful visual representations. However, existing SSL approaches, particularly those employing different views of the same image, often rely on a limited set of predefined data augmentations. This constrains the diversity and quality of transformations, which leads to sub-optimal representations. In this paper, we introduce a novel framework that enriches the SSL paradigm by utilizing generative models to produce semantically consistent image augmentations. By directly conditioning generative models on a source image representation, our method enables the generation of diverse augmentations while maintaining the semantics of the source image, thus offering a richer set of data for self-supervised learning. Our experimental results demonstrate that our framework significantly enhances the quality of learned visual representations. This research demonstrates that incorporating generative models into the SSL workflow opens new avenues for exploring the potential of unlabeled visual data. This development paves the way for more robust and versatile representation learning techniques.  ( 2 min )
    General surgery vision transformer: A video pre-trained foundation model for general surgery
    arXiv:2403.05949v1 Announce Type: cross Abstract: The absence of openly accessible data and specialized foundation models is a major barrier for computational research in surgery. Toward this, (i) we open-source the largest dataset of general surgery videos to-date, consisting of 680 hours of surgical videos, including data from robotic and laparoscopic techniques across 28 procedures; (ii) we propose a technique for video pre-training a general surgery vision transformer (GSViT) on surgical videos based on forward video prediction that can run in real-time for surgical applications, toward which we open-source the code and weights of GSViT; (iii) we also release code and weights for procedure-specific fine-tuned versions of GSViT across 10 procedures; (iv) we demonstrate the performance of GSViT on the Cholec80 phase annotation task, displaying improved performance over state-of-the-art single frame predictors.  ( 2 min )
    Robust Emotion Recognition in Context Debiasing
    arXiv:2403.05963v1 Announce Type: cross Abstract: Context-aware emotion recognition (CAER) has recently boosted the practical applications of affective computing techniques in unconstrained environments. Mainstream CAER methods invariably extract ensemble representations from diverse contexts and subject-centred characteristics to perceive the target person's emotional state. Despite advancements, the biggest challenge remains due to context bias interference. The harmful bias forces the models to rely on spurious correlations between background contexts and emotion labels in likelihood estimation, causing severe performance bottlenecks and confounding valuable context priors. In this paper, we propose a counterfactual emotion inference (CLEF) framework to address the above issue. Specifically, we first formulate a generalized causal graph to decouple the causal relationships among the variables in CAER. Following the causal graph, CLEF introduces a non-invasive context branch to capture the adverse direct effect caused by the context bias. During the inference, we eliminate the direct context effect from the total causal effect by comparing factual and counterfactual outcomes, resulting in bias mitigation and robust prediction. As a model-agnostic framework, CLEF can be readily integrated into existing methods, bringing consistent performance gains.  ( 2 min )
    Online Identification of Stochastic Continuous-Time Wiener Models Using Sampled Data
    arXiv:2403.05899v1 Announce Type: cross Abstract: It is well known that ignoring the presence of stochastic disturbances in the identification of stochastic Wiener models leads to asymptotically biased estimators. On the other hand, optimal statistical identification, via likelihood-based methods, is sensitive to the assumptions on the data distribution and is usually based on relatively complex sequential Monte Carlo algorithms. We develop a simple recursive online estimation algorithm based on an output-error predictor, for the identification of continuous-time stochastic parametric Wiener models through stochastic approximation. The method is applicable to generic model parameterizations and, as demonstrated in the numerical simulation examples, it is robust with respect to the assumptions on the spectrum of the disturbance process.  ( 2 min )
    Deep Learning based acoustic measurement approach for robotic applications on orthopedics
    arXiv:2403.05879v1 Announce Type: cross Abstract: In Total Knee Replacement Arthroplasty (TKA), surgical robotics can provide image-guided navigation to fit implants with high precision. Its tracking approach highly relies on inserting bone pins into the bones tracked by the optical tracking system. This is normally done by invasive, radiative manners (implantable markers and CT scans), which introduce unnecessary trauma and prolong the preparation time for patients. To tackle this issue, ultrasound-based bone tracking could offer an alternative. In this study, we proposed a novel deep learning structure to improve the accuracy of bone tracking by an A-mode ultrasound (US). We first obtained a set of ultrasound dataset from the cadaver experiment, where the ground truth locations of bones were calculated using bone pins. These data were used to train the proposed CasAtt-UNet to predict bone location automatically and robustly. The ground truth bone locations and those locations of US were recorded simultaneously. Therefore, we could label bone peaks in the raw US signals. As a result, our method achieved sub millimeter precision across all eight bone areas with the only exception of one channel in the ankle. This method enables the robust measurement of lower extremity bone positions from 1D raw ultrasound signals. It shows great potential to apply A-mode ultrasound in orthopedic surgery from safe, convenient, and efficient perspectives.  ( 2 min )
    Statistical Efficiency of Distributional Temporal Difference
    arXiv:2403.05811v1 Announce Type: cross Abstract: Distributional reinforcement learning (DRL), which cares about the full distribution of returns instead of just the mean, has achieved empirical success in various domains. One of the core tasks in the field of DRL is distributional policy evaluation, which involves estimating the return distribution $\eta^\pi$ for a given policy $\pi$. A distributional temporal difference (TD) algorithm has been accordingly proposed, which is an extension of the temporal difference algorithm in the classic RL literature. In the tabular case, \citet{rowland2018analysis} and \citet{rowland2023analysis} proved the asymptotic convergence of two instances of distributional TD, namely categorical temporal difference algorithm (CTD) and quantile temporal difference algorithm (QTD), respectively. In this paper, we go a step further and analyze the finite-sample performance of distributional TD. To facilitate theoretical analysis, we propose non-parametric distributional TD algorithm (NTD). For a $\gamma$-discounted infinite-horizon tabular Markov decision process with state space $S$ and action space $A$, we show that in the case of NTD we need $\wtilde O\prn{\frac{1}{\varepsilon^{2p}(1-\gamma)^{2p+2}}}$ iterations to achieve an $\varepsilon$-optimal estimator with high probability, when the estimation error is measured by the $p$-Wasserstein distance. Under some mild assumptions, $\wtilde O\prn{\frac{1}{\varepsilon^{2}(1-\gamma)^{4}}}$ iterations suffices to ensure the Kolmogorov-Smirnov distance between the NTD estimator $\hat\eta^\pi$ and $\eta^\pi$ less than $\varepsilon$ with high probability. And we revisit CTD, showing that the same non-asymptotic convergence bounds hold for CTD in the case of the $p$-Wasserstein distance.  ( 2 min )
    Shallow ReLU neural networks and finite elements
    arXiv:2403.05809v1 Announce Type: cross Abstract: We point out that (continuous or discontinuous) piecewise linear functions on a convex polytope mesh can be represented by two-hidden-layer ReLU neural networks in a weak sense. In addition, the numbers of neurons of the two hidden layers required to weakly represent are accurately given based on the numbers of polytopes and hyperplanes involved in this mesh. The results naturally hold for constant and linear finite element functions. Such weak representation establishes a bridge between shallow ReLU neural networks and finite element functions, and leads to a perspective for analyzing approximation capability of ReLU neural networks in $L^p$ norm via finite element functions. Moreover, we discuss the strict representation for tensor finite element functions via the recent tensor neural networks.  ( 2 min )
    Physics-informed Neural Motion Planning on Constraint Manifolds
    arXiv:2403.05765v1 Announce Type: cross Abstract: Constrained Motion Planning (CMP) aims to find a collision-free path between the given start and goal configurations on the kinematic constraint manifolds. These problems appear in various scenarios ranging from object manipulation to legged-robot locomotion. However, the zero-volume nature of manifolds makes the CMP problem challenging, and the state-of-the-art methods still take several seconds to find a path and require a computationally expansive path dataset for imitation learning. Recently, physics-informed motion planning methods have emerged that directly solve the Eikonal equation through neural networks for motion planning and do not require expert demonstrations for learning. Inspired by these approaches, we propose the first physics-informed CMP framework that solves the Eikonal equation on the constraint manifolds and trains neural function for CMP without expert data. Our results show that the proposed approach efficiently solves various CMP problems in both simulation and real-world, including object manipulation under orientation constraints and door opening with a high-dimensional 6-DOF robot manipulator. In these complex settings, our method exhibits high success rates and finds paths in sub-seconds, which is many times faster than the state-of-the-art CMP methods.  ( 2 min )
    HDReason: Algorithm-Hardware Codesign for Hyperdimensional Knowledge Graph Reasoning
    arXiv:2403.05763v1 Announce Type: cross Abstract: In recent times, a plethora of hardware accelerators have been put forth for graph learning applications such as vertex classification and graph classification. However, previous works have paid little attention to Knowledge Graph Completion (KGC), a task that is well-known for its significantly higher algorithm complexity. The state-of-the-art KGC solutions based on graph convolution neural network (GCN) involve extensive vertex/relation embedding updates and complicated score functions, which are inherently cumbersome for acceleration. As a result, existing accelerator designs are no longer optimal, and a novel algorithm-hardware co-design for KG reasoning is needed. Recently, brain-inspired HyperDimensional Computing (HDC) has been introduced as a promising solution for lightweight machine learning, particularly for graph learning applications. In this paper, we leverage HDC for an intrinsically more efficient and acceleration-friendly KGC algorithm. We also co-design an acceleration framework named HDReason targeting FPGA platforms. On the algorithm level, HDReason achieves a balance between high reasoning accuracy, strong model interpretability, and less computation complexity. In terms of architecture, HDReason offers reconfigurability, high training throughput, and low energy consumption. When compared with NVIDIA RTX 4090 GPU, the proposed accelerator achieves an average 10.6x speedup and 65x energy efficiency improvement. When conducting cross-models and cross-platforms comparison, HDReason yields an average 4.2x higher performance and 3.4x better energy efficiency with similar accuracy versus the state-of-the-art FPGA-based GCN training platform.  ( 2 min )
    Efficient Public Health Intervention Planning Using Decomposition-Based Decision-Focused Learning
    arXiv:2403.05683v1 Announce Type: cross Abstract: The declining participation of beneficiaries over time is a key concern in public health programs. A popular strategy for improving retention is to have health workers `intervene' on beneficiaries at risk of dropping out. However, the availability and time of these health workers are limited resources. As a result, there has been a line of research on optimizing these limited intervention resources using Restless Multi-Armed Bandits (RMABs). The key technical barrier to using this framework in practice lies in the need to estimate the beneficiaries' RMAB parameters from historical data. Recent research has shown that Decision-Focused Learning (DFL), which focuses on maximizing the beneficiaries' adherence rather than predictive accuracy, improves the performance of intervention targeting using RMABs. Unfortunately, these gains come at a high computational cost because of the need to solve and evaluate the RMAB in each DFL training step. In this paper, we provide a principled way to exploit the structure of RMABs to speed up intervention planning by cleverly decoupling the planning for different beneficiaries. We use real-world data from an Indian NGO, ARMMAN, to show that our approach is up to two orders of magnitude faster than the state-of-the-art approach while also yielding superior model performance. This would enable the NGO to scale up deployments using DFL to potentially millions of mothers, ultimately advancing progress toward UNSDG 3.1.  ( 2 min )
    A Framework for Effective AI Recommendations in Cyber-Physical-Human Systems
    arXiv:2403.05715v1 Announce Type: cross Abstract: Many cyber-physical-human systems (CPHS) involve a human decision-maker who may receive recommendations from an artificial intelligence (AI) platform while holding the ultimate responsibility of making decisions. In such CPHS applications, the human decision-maker may depart from an optimal recommended decision and instead implement a different one for various reasons. In this letter, we develop a rigorous framework to overcome this challenge. In our framework, we consider that humans may deviate from AI recommendations as they perceive and interpret the system's state in a different way than the AI platform. We establish the structural properties of optimal recommendation strategies and develop an approximate human model (AHM) used by the AI. We provide theoretical bounds on the optimality gap that arises from an AHM and illustrate the efficacy of our results in a numerical example.  ( 2 min )
    Spectral Clustering of Categorical and Mixed-type Data via Extra Graph Nodes
    arXiv:2403.05669v1 Announce Type: cross Abstract: Clustering data objects into homogeneous groups is one of the most important tasks in data mining. Spectral clustering is arguably one of the most important algorithms for clustering, as it is appealing for its theoretical soundness and is adaptable to many real-world data settings. For example, mixed data, where the data is composed of numerical and categorical features, is typically handled via numerical discretization, dummy coding, or similarity computation that takes into account both data types. This paper explores a more natural way to incorporate both numerical and categorical information into the spectral clustering algorithm, avoiding the need for data preprocessing or the use of sophisticated similarity functions. We propose adding extra nodes corresponding to the different categories the data may belong to and show that it leads to an interpretable clustering objective function. Furthermore, we demonstrate that this simple framework leads to a linear-time spectral clustering algorithm for categorical-only data. Finally, we compare the performance of our algorithms against other related methods and show that it provides a competitive alternative to them in terms of performance and runtime.  ( 2 min )
    Geometric Neural Network based on Phase Space for BCI decoding
    arXiv:2403.05645v1 Announce Type: cross Abstract: The integration of Deep Learning (DL) algorithms on brain signal analysis is still in its nascent stages compared to their success in fields like Computer Vision, especially in Brain-Computer Interface (BCI), where the brain activity is decoded to control external devices without requiring muscle control. Electroencephalography (EEG) is a widely adopted choice for designing BCI systems due to its non-invasive and cost-effective nature and excellent temporal resolution. Still, it comes at the expense of limited training data, poor signal-to-noise, and a large variability across and within-subject recordings. Finally, setting up a BCI system with many electrodes takes a long time, hindering the widespread adoption of reliable DL architectures in BCIs outside research laboratories. To improve adoption, we need to improve user comfort using, for instance, reliable algorithms that operate with few electrodes. \textbf{Approach:} Our research aims to develop a DL algorithm that delivers effective results with a limited number of electrodes. Taking advantage of the Augmented Covariance Method with SPDNet, we propose the SPDNet$_{\psi}$ architecture and analyze its performance and computational impact, as well as the interpretability of the results. The evaluation is conducted on 5-fold cross-validation, using only three electrodes positioned above the Motor Cortex. The methodology was tested on nearly 100 subjects from several open-source datasets using the Mother Of All BCI Benchmark (MOABB) framework. \textbf{Main results:} The results of our SPDNet$_{\psi}$ demonstrate that the augmented approach combined with the SPDNet significantly outperforms all the current state-of-the-art DL architecture in MI decoding. \textbf{Significance:} This new architecture is explainable, with a low number of trainable parameters and a reduced carbon footprint.  ( 3 min )
    Prepared for the Worst: A Learning-Based Adversarial Attack for Resilience Analysis of the ICP Algorithm
    arXiv:2403.05666v1 Announce Type: cross Abstract: This paper presents a novel method to assess the resilience of the Iterative Closest Point (ICP) algorithm via deep-learning-based attacks on lidar point clouds. For safety-critical applications such as autonomous navigation, ensuring the resilience of algorithms prior to deployments is of utmost importance. The ICP algorithm has become the standard for lidar-based localization. However, the pose estimate it produces can be greatly affected by corruption in the measurements. Corruption can arise from a variety of scenarios such as occlusions, adverse weather, or mechanical issues in the sensor. Unfortunately, the complex and iterative nature of ICP makes assessing its resilience to corruption challenging. While there have been efforts to create challenging datasets and develop simulations to evaluate the resilience of ICP empirically, our method focuses on finding the maximum possible ICP pose error using perturbation-based adversarial attacks. The proposed attack induces significant pose errors on ICP and outperforms baselines more than 88% of the time across a wide range of scenarios. As an example application, we demonstrate that our attack can be used to identify areas on a map where ICP is particularly vulnerable to corruption in the measurements.  ( 2 min )
    OmniJet-$\alpha$: The first cross-task foundation model for particle physics
    arXiv:2403.05618v1 Announce Type: cross Abstract: Foundation models are multi-dataset and multi-task machine learning methods that once pre-trained can be fine-tuned for a large variety of downstream applications. The successful development of such general-purpose models for physics data would be a major breakthrough as they could improve the achievable physics performance while at the same time drastically reduce the required amount of training time and data. We report significant progress on this challenge on several fronts. First, a comprehensive set of evaluation methods is introduced to judge the quality of an encoding from physics data into a representation suitable for the autoregressive generation of particle jets with transformer architectures (the common backbone of foundation models). These measures motivate the choice of a higher-fidelity tokenization compared to previous works. Finally, we demonstrate transfer learning between an unsupervised problem (jet generation) and a classic supervised task (jet tagging) with our new OmniJet-$\alpha$ model. This is the first successful transfer between two different and actively studied classes of tasks and constitutes a major step in the building of foundation models for particle physics.  ( 2 min )
    Comparison of gait phase detection using traditional machine learning and deep learning techniques
    arXiv:2403.05595v1 Announce Type: cross Abstract: Human walking is a complex activity with a high level of cooperation and interaction between different systems in the body. Accurate detection of the phases of the gait in real-time is crucial to control lower-limb assistive devices like exoskeletons and prostheses. There are several ways to detect the walking gait phase, ranging from cameras and depth sensors to the sensors attached to the device itself or the human body. Electromyography (EMG) is one of the input methods that has captured lots of attention due to its precision and time delay between neuromuscular activity and muscle movement. This study proposes a few Machine Learning (ML) based models on lower-limb EMG data for human walking. The proposed models are based on Gaussian Naive Bayes (NB), Decision Tree (DT), Random Forest (RF), Linear Discriminant Analysis (LDA) and Deep Convolutional Neural Networks (DCNN). The traditional ML models are trained on hand-crafted features or their reduced components using Principal Component Analysis (PCA). On the contrary, the DCNN model utilises convolutional layers to extract features from raw data. The results show up to 75% average accuracy for traditional ML models and 79% for Deep Learning (DL) model. The highest achieved accuracy in 50 trials of the training DL model is 89.5%.  ( 3 min )
    Privacy Amplification for the Gaussian Mechanism via Bounded Support
    arXiv:2403.05598v1 Announce Type: cross Abstract: Data-dependent privacy accounting frameworks such as per-instance differential privacy (pDP) and Fisher information loss (FIL) confer fine-grained privacy guarantees for individuals in a fixed training dataset. These guarantees can be desirable compared to vanilla DP in real world settings as they tightly upper-bound the privacy leakage for a $\textit{specific}$ individual in an $\textit{actual}$ dataset, rather than considering worst-case datasets. While these frameworks are beginning to gain popularity, to date, there is a lack of private mechanisms that can fully leverage advantages of data-dependent accounting. To bridge this gap, we propose simple modifications of the Gaussian mechanism with bounded support, showing that they amplify privacy guarantees under data-dependent accounting. Experiments on model training with DP-SGD show that using bounded support Gaussian mechanisms can provide a reduction of the pDP bound $\epsilon$ by as much as 30% without negative effects on model utility.  ( 2 min )
    Can Interpretability Layouts Influence Human Perception of Offensive Sentences?
    arXiv:2403.05581v1 Announce Type: cross Abstract: This paper conducts a user study to assess whether three machine learning (ML) interpretability layouts can influence participants' views when evaluating sentences containing hate speech, focusing on the "Misogyny" and "Racism" classes. Given the existence of divergent conclusions in the literature, we provide empirical evidence on using ML interpretability in online communities through statistical and qualitative analyses of questionnaire responses. The Generalized Additive Model estimates participants' ratings, incorporating within-subject and between-subject designs. While our statistical analysis indicates that none of the interpretability layouts significantly influences participants' views, our qualitative analysis demonstrates the advantages of ML interpretability: 1) triggering participants to provide corrective feedback in case of discrepancies between their views and the model, and 2) providing insights to evaluate a model's behavior beyond traditional performance metrics.  ( 2 min )
    Beyond Predictive Algorithms in Child Welfare
    arXiv:2403.05573v1 Announce Type: cross Abstract: Caseworkers in the child welfare (CW) sector use predictive decision-making algorithms built on risk assessment (RA) data to guide and support CW decisions. Researchers have highlighted that RAs can contain biased signals which flatten CW case complexities and that the algorithms may benefit from incorporating contextually rich case narratives, i.e. - casenotes written by caseworkers. To investigate this hypothesized improvement, we quantitatively deconstructed two commonly used RAs from a United States CW agency. We trained classifier models to compare the predictive validity of RAs with and without casenote narratives and applied computational text analysis on casenotes to highlight topics uncovered in the casenotes. Our study finds that common risk metrics used to assess families and build CWS predictive risk models (PRMs) are unable to predict discharge outcomes for children who are not reunified with their birth parent(s). We also find that although casenotes cannot predict discharge outcomes, they contain contextual case signals. Given the lack of predictive validity of RA scores and casenotes, we propose moving beyond quantitative risk assessments for public sector algorithms and towards using contextual sources of information such as narratives to study public sociotechnical systems.  ( 2 min )
    Chaining text-to-image and large language model: A novel approach for generating personalized e-commerce banners
    arXiv:2403.05578v1 Announce Type: cross Abstract: Text-to-image models such as stable diffusion have opened a plethora of opportunities for generating art. Recent literature has surveyed the use of text-to-image models for enhancing the work of many creative artists. Many e-commerce platforms employ a manual process to generate the banners, which is time-consuming and has limitations of scalability. In this work, we demonstrate the use of text-to-image models for generating personalized web banners with dynamic content for online shoppers based on their interactions. The novelty in this approach lies in converting users' interaction data to meaningful prompts without human intervention. To this end, we utilize a large language model (LLM) to systematically extract a tuple of attributes from item meta-information. The attributes are then passed to a text-to-image model via prompt engineering to generate images for the banner. Our results show that the proposed approach can create high-quality personalized banners for users.  ( 2 min )
    Efficient and Guaranteed-Safe Non-Convex Trajectory Optimization with Constrained Diffusion Model
    arXiv:2403.05571v1 Announce Type: cross Abstract: Trajectory optimization in robotics poses a challenging non-convex problem due to complex dynamics and environmental settings. Traditional numerical optimization methods are time-consuming in finding feasible solutions, whereas data-driven approaches lack safety guarantees for the output trajectories. In this paper, we introduce a general and fully parallelizable framework that combines diffusion models and numerical solvers for non-convex trajectory optimization, ensuring both computational efficiency and constraint satisfaction. A novel constrained diffusion model is proposed with an additional constraint violation loss for training. It aims to approximate the distribution of locally optimal solutions while minimizing constraint violations during sampling. The samples are then used as initial guesses for a numerical solver to refine and derive final solutions with formal verification of feasibility and optimality. Experimental evaluations on three tasks over different robotics domains verify the improved constraint satisfaction and computational efficiency with 4$\times$ to 22$\times$ acceleration using our proposed method, which generalizes across trajectory optimization problems and scales well with problem complexity.  ( 2 min )
    Improving Cognitive Diagnosis Models with Adaptive Relational Graph Neural Networks
    arXiv:2403.05559v1 Announce Type: cross Abstract: Cognitive Diagnosis (CD) algorithms receive growing research interest in intelligent education. Typically, these CD algorithms assist students by inferring their abilities (i.e., their proficiency levels on various knowledge concepts). The proficiency levels can enable further targeted skill training and personalized exercise recommendations, thereby promoting students' learning efficiency in online education. Recently, researchers have found that building and incorporating a student-exercise bipartite graph is beneficial for enhancing diagnostic performance. However, there are still limitations in their studies. On one hand, researchers overlook the heterogeneity within edges, where there can be both correct and incorrect answers. On the other hand, they disregard the uncertainty within edges, e.g., a correct answer can indicate true mastery or fortunate guessing. To address the limitations, we propose Adaptive Semantic-aware Graph-based Cognitive Diagnosis model (ASG-CD), which introduces a novel and effective way to leverage bipartite graph information in CD. Specifically, we first map students, exercises, and knowledge concepts into a latent representation space and combine these latent representations to obtain student abilities and exercise difficulties. After that, we propose a Semantic-aware Graph Neural Network Layer to address edge heterogeneity. This layer splits the original bipartite graph into two subgraphs according to edge semantics, and aggregates information based on these two subgraphs separately. To mitigate the impact of edge uncertainties, we propose an Adaptive Edge Differentiation Layer that dynamically differentiates edges, followed by keeping reliable edges and filtering out uncertain edges. Extensive experiments on three real-world datasets have demonstrated the effectiveness of ASG-CD.  ( 3 min )
    Re-thinking Human Activity Recognition with Hierarchy-aware Label Relationship Modeling
    arXiv:2403.05557v1 Announce Type: cross Abstract: Human Activity Recognition (HAR) has been studied for decades, from data collection, learning models, to post-processing and result interpretations. However, the inherent hierarchy in the activities remains relatively under-explored, despite its significant impact on model performance and interpretation. In this paper, we propose H-HAR, by rethinking the HAR tasks from a fresh perspective by delving into their intricate global label relationships. Rather than building multiple classifiers separately for multi-layered activities, we explore the efficacy of a flat model enhanced with graph-based label relationship modeling. Being hierarchy-aware, the graph-based label modeling enhances the fundamental HAR model, by incorporating intricate label relationships into the model. We validate the proposal with a multi-label classifier on complex human activity data. The results highlight the advantages of the proposal, which can be vertically integrated into advanced HAR models to further enhance their performances.  ( 2 min )
    Modeling and predicting students' engagement behaviors using mixture Markov models
    arXiv:2403.05556v1 Announce Type: cross Abstract: Students' engagements reflect their level of involvement in an ongoing learning process which can be estimated through their interactions with a computer-based learning or assessment system. A pre-requirement for stimulating student engagement lies in the capability to have an approximate representation model for comprehending students' varied (dis)engagement behaviors. In this paper, we utilized model-based clustering for this purpose which generates K mixture Markov models to group students' traces containing their (dis)engagement behavioral patterns. To prevent the Expectation-Maximization (EM) algorithm from getting stuck in a local maxima, we also introduced a K-means-based initialization method named as K-EM. We performed an experimental work on two real datasets using the three variants of the EM algorithm: the original EM, emEM, K-EM; and, non-mixture baseline models for both datasets. The proposed K-EM has shown very promising results and achieved significant performance difference in comparison with the other approaches particularly using the Dataset. Hence, we suggest to perform further experiments using large dataset(s) to validate our method. Additionally, visualization of the resultant clusters through first-order Markov chains reveals very useful insights about (dis)engagement behaviors depicted by the students. We conclude the paper with a discussion on the usefulness of our approach, limitations and potential extensions of this work.  ( 2 min )
    Subgroup Discovery in MOOCs: A Big Data Application for Describing Different Types of Learners
    arXiv:2403.05555v1 Announce Type: cross Abstract: The aim of this paper is to categorize and describe different types of learners in massive open online courses (MOOCs) by means of a subgroup discovery approach based on MapReduce. The final objective is to discover IF-THEN rules that appear in different MOOCs. The proposed subgroup discovery approach, which is an extension of the well-known FP-Growth algorithm, considers emerging parallel methodologies like MapReduce to be able to cope with extremely large datasets. As an additional feature, the proposal includes a threshold value to denote the number of courses that each discovered rule should satisfy. A post-processing step is also included so redundant subgroups can be removed. The experimental stage is carried out by considering de-identified data from the first year of 16 MITx and HarvardX courses on the edX platform. Experimental results demonstrate that the proposed MapReduce approach outperforms traditional sequential subgroup discovery approaches, achieving a runtime that is almost constant for different courses. Additionally, thanks to the final post-processing step, only interesting and not-redundant rules are discovered, hence reducing the number of subgroups in one or two orders of magnitude. Finally, the discovered subgroups are easily used by courses' instructors not only for descriptive purposes but also for additional tasks such as recommendation or personalization.  ( 3 min )
    Understanding the Progression of Educational Topics via Semantic Matching
    arXiv:2403.05553v1 Announce Type: cross Abstract: Education systems are dynamically changing to accommodate technological advances, industrial and societal needs, and to enhance students' learning journeys. Curriculum specialists and educators constantly revise taught subjects across educational grades to identify gaps, introduce new learning topics, and enhance the learning outcomes. This process is usually done within the same subjects (e.g. math) or across related subjects (e.g. math and physics) considering the same and different educational levels, leading to massive multi-layer comparisons. Having nuanced data about subjects, topics, and learning outcomes structured within a dataset, empowers us to leverage data science to better understand the progression of various learning topics. In this paper, Bidirectional Encoder Representations from Transformers (BERT) topic modeling was used to extract topics from the curriculum, which were then used to identify relationships between subjects, track their progression, and identify conceptual gaps. We found that grouping learning outcomes by common topics helped specialists reduce redundancy and introduce new concepts in the curriculum. We built a dashboard to avail the methodology to curriculum specials. Finally, we tested the validity of the approach with subject matter experts.  ( 2 min )
    Monitoring the evolution of antisemitic discourse on extremist social media using BERT
    arXiv:2403.05548v1 Announce Type: cross Abstract: Racism and intolerance on social media contribute to a toxic online environment which may spill offline to foster hatred, and eventually lead to physical violence. That is the case with online antisemitism, the specific category of hatred considered in this study. Tracking antisemitic themes and their associated terminology over time in online discussions could help monitor the sentiments of their participants and their evolution, and possibly offer avenues for intervention that may prevent the escalation of hatred. Due to the large volume and constant evolution of online traffic, monitoring conversations manually is impractical. Instead, we propose an automated method that extracts antisemitic themes and terminology from extremist social media over time and captures their evolution. Since supervised learning would be too limited for such a task, we created an unsupervised online machine learning approach that uses large language models to assess the contextual similarity of posts. The method clusters similar posts together, dividing, and creating additional clusters over time when sub-themes emerge from existing ones or new themes appear. The antisemitic terminology used within each theme is extracted from the posts in each cluster. Our experiments show that our methodology outperforms existing baselines and demonstrates the kind of themes and sub-themes it discovers within antisemitic discourse along with their associated terminology. We believe that our approach will be useful for monitoring the evolution of all kinds of hatred beyond antisemitism on social platforms.  ( 3 min )
    AI for non-programmers: Applied AI in the lectures for students without programming skills
    arXiv:2403.05547v1 Announce Type: cross Abstract: Applications such as ChatGPT and WOMBO Dream make it easy to inspire students without programming knowledge to use artificial intelligence (AI). Therefore, given the increasing importance of AI in all disciplines, innovative strategies are needed to educate students in AI without programming knowledge so that AI can be integrated into their study modules as a future skill. This work presents a didactic planning script for applied AI. The didactic planning script is based on the AI application pipeline and links AI concepts with study-relevant topics. These linkages open up a new solution space and promote students' interest in and understanding of the potentials and risks of AI. An example lecture series for master students in energy management shows how AI can be seamlessly integrated into discipline-specific lectures. To this end, the planning script for applied AI is adapted to fit the study programs' topic. This specific teaching scenario enables students to solve a discipline-specific task step by step using the AI application pipeline. Thus, the application of the didactic planning script for applied AI shows the practical implementation of the theoretical concepts of AI. In addition, a checklist is presented that can be used to assess whether AI can be used in the discipline-specific lecture. AI as a future skill must be learned by students based on use cases that are relevant to the course of studies. For this reason, AI education should fit seamlessly into various curricula, even if the students do not have a programming background due to their field of study.  ( 3 min )
    Unified Occupancy on a Public Transport Network through Combination of AFC and APC Data
    arXiv:2403.05546v1 Announce Type: cross Abstract: In a transport network, the onboard occupancy is key for gaining insights into travelers' habits and adjusting the offer. Traditionally, operators have relied on field studies to evaluate ridership of a typical workday. However, automated fare collection (AFC) and automatic passenger counting (APC) data, which provide complete temporal coverage, are often available but underexploited. It should be noted, however, that each data source comes with its own biases: AFC data may not account for fraud, while not all vehicles are equipped with APC systems. This paper introduces the unified occupancy method, a geostatistical model to extrapolate occupancy to every course of a public transportation network by combining AFC and APC data with partial coverage. Unified occupancy completes missing APC information for courses on lines where other courses have APC measures, as well as for courses on lines where no APC data is available at all. The accuracy of this method is evaluated on real data from several public transportation networks in France.  ( 2 min )
    Extinction Risks from AI: Invisible to Science?
    arXiv:2403.05540v1 Announce Type: cross Abstract: In an effort to inform the discussion surrounding existential risks from AI, we formulate Extinction-level Goodhart's Law as "Virtually any goal specification, pursued to the extreme, will result in the extinction of humanity", and we aim to understand which formal models are suitable for investigating this hypothesis. Note that we remain agnostic as to whether Extinction-level Goodhart's Law holds or not. As our key contribution, we identify a set of conditions that are necessary for a model that aims to be informative for evaluating specific arguments for Extinction-level Goodhart's Law. Since each of the conditions seems to significantly contribute to the complexity of the resulting model, formally evaluating the hypothesis might be exceedingly difficult. This raises the possibility that whether the risk of extinction from artificial intelligence is real or not, the underlying dynamics might be invisible to current scientific methods.  ( 2 min )
    Cell reprogramming design by transfer learning of functional transcriptional networks
    arXiv:2403.04837v1 Announce Type: cross Abstract: Recent developments in synthetic biology, next-generation sequencing, and machine learning provide an unprecedented opportunity to rationally design new disease treatments based on measured responses to gene perturbations and drugs to reprogram cells. The main challenges to seizing this opportunity are the incomplete knowledge of the cellular network and the combinatorial explosion of possible interventions, both of which are insurmountable by experiments. To address these challenges, we develop a transfer learning approach to control cell behavior that is pre-trained on transcriptomic data associated with human cell fates, thereby generating a model of the network dynamics that can be transferred to specific reprogramming goals. The approach combines transcriptional responses to gene perturbations to minimize the difference between a given pair of initial and target transcriptional states. We demonstrate our approach's versatility by applying it to a microarray dataset comprising >9,000 microarrays across 54 cell types and 227 unique perturbations, and an RNASeq dataset consisting of >10,000 sequencing runs across 36 cell types and 138 perturbations. Our approach reproduces known reprogramming protocols with an AUROC of 0.91 while innovating over existing methods by pre-training an adaptable model that can be tailored to specific reprogramming transitions. We show that the number of gene perturbations required to steer from one fate to another increases with decreasing developmental relatedness and that fewer genes are needed to progress along developmental paths than to regress. These findings establish a proof-of-concept for our approach to computationally design control strategies and provide insights into how gene regulatory networks govern phenotype.  ( 3 min )
    Advancing Biomedical Text Mining with Community Challenges
    arXiv:2403.04261v1 Announce Type: cross Abstract: The field of biomedical research has witnessed a significant increase in the accumulation of vast amounts of textual data from various sources such as scientific literatures, electronic health records, clinical trial reports, and social media. However, manually processing and analyzing these extensive and complex resources is time-consuming and inefficient. To address this challenge, biomedical text mining, also known as biomedical natural language processing, has garnered great attention. Community challenge evaluation competitions have played an important role in promoting technology innovation and interdisciplinary collaboration in biomedical text mining research. These challenges provide platforms for researchers to develop state-of-the-art solutions for data mining and information processing in biomedical research. In this article, we review the recent advances in community challenges specific to Chinese biomedical text mining. Firstly, we collect the information of these evaluation tasks, such as data sources and task types. Secondly, we conduct systematic summary and comparative analysis, including named entity recognition, entity normalization, attribute extraction, relation extraction, event extraction, text classification, text similarity, knowledge graph construction, question answering, text generation, and large language model evaluation. Then, we summarize the potential clinical applications of these community challenge tasks from translational informatics perspective. Finally, we discuss the contributions and limitations of these community challenges, while highlighting future directions in the era of large language models.  ( 2 min )
    A representation-learning game for classes of prediction tasks
    arXiv:2403.06971v1 Announce Type: new Abstract: We propose a game-based formulation for learning dimensionality-reducing representations of feature vectors, when only a prior knowledge on future prediction tasks is available. In this game, the first player chooses a representation, and then the second player adversarially chooses a prediction task from a given class, representing the prior knowledge. The first player aims is to minimize, and the second player to maximize, the regret: The minimal prediction loss using the representation, compared to the same loss using the original features. For the canonical setting in which the representation, the response to predict and the predictors are all linear functions, and under the mean squared error loss function, we derive the theoretically optimal representation in pure strategies, which shows the effectiveness of the prior knowledge, and the optimal regret in mixed strategies, which shows the usefulness of randomizing the representation. For general representations and loss functions, we propose an efficient algorithm to optimize a randomized representation. The algorithm only requires the gradients of the loss function, and is based on incrementally adding a representation rule to a mixture of such rules.  ( 2 min )
    Acquiring Diverse Skills using Curriculum Reinforcement Learning with Mixture of Experts
    arXiv:2403.06966v1 Announce Type: new Abstract: Reinforcement learning (RL) is a powerful approach for acquiring a good-performing policy. However, learning diverse skills is challenging in RL due to the commonly used Gaussian policy parameterization. We propose \textbf{Di}verse \textbf{Skil}l \textbf{L}earning (Di-SkilL), an RL method for learning diverse skills using Mixture of Experts, where each expert formalizes a skill as a contextual motion primitive. Di-SkilL optimizes each expert and its associate context distribution to a maximum entropy objective that incentivizes learning diverse skills in similar contexts. The per-expert context distribution enables automatic curricula learning, allowing each expert to focus on its best-performing sub-region of the context space. To overcome hard discontinuities and multi-modalities without any prior knowledge of the environment's unknown context probability space, we leverage energy-based models to represent the per-expert context distributions and demonstrate how we can efficiently train them using the standard policy gradient objective. We show on challenging robot simulation tasks that Di-SkilL can learn diverse and performant skills.  ( 2 min )
    Counterfactual Reasoning with Knowledge Graph Embeddings
    arXiv:2403.06936v1 Announce Type: new Abstract: Knowledge graph embeddings (KGEs) were originally developed to infer true but missing facts in incomplete knowledge repositories. In this paper, we link knowledge graph completion and counterfactual reasoning via our new task CFKGR. We model the original world state as a knowledge graph, hypothetical scenarios as edges added to the graph, and plausible changes to the graph as inferences from logical rules. We create corresponding benchmark datasets, which contain diverse hypothetical scenarios with plausible changes to the original knowledge graph and facts that should be retained. We develop COULDD, a general method for adapting existing knowledge graph embeddings given a hypothetical premise, and evaluate it on our benchmark. Our results indicate that KGEs learn patterns in the graph without explicit training. We further observe that KGEs adapted with COULDD solidly detect plausible counterfactual changes to the graph that follow these patterns. An evaluation on human-annotated data reveals that KGEs adapted with COULDD are mostly unable to recognize changes to the graph that do not follow learned inference rules. In contrast, ChatGPT mostly outperforms KGEs in detecting plausible changes to the graph but has poor knowledge retention. In summary, CFKGR connects two previously distinct areas, namely KG completion and counterfactual reasoning.  ( 2 min )
    Cost-Sensitive Learning to Defer to Multiple Experts with Workload Constraints
    arXiv:2403.06906v1 Announce Type: new Abstract: Learning to defer (L2D) aims to improve human-AI collaboration systems by learning how to defer decisions to humans when they are more likely to be correct than an ML classifier. Existing research in L2D overlooks key aspects of real-world systems that impede its practical adoption, namely: i) neglecting cost-sensitive scenarios, where type 1 and type 2 errors have different costs; ii) requiring concurrent human predictions for every instance of the training dataset and iii) not dealing with human work capacity constraints. To address these issues, we propose the deferral under cost and capacity constraints framework (DeCCaF). DeCCaF is a novel L2D approach, employing supervised learning to model the probability of human error under less restrictive data requirements (only one expert prediction per instance) and using constraint programming to globally minimize the error cost subject to workload limitations. We test DeCCaF in a series of cost-sensitive fraud detection scenarios with different teams of 9 synthetic fraud analysts, with individual work capacity constraints. The results demonstrate that our approach performs significantly better than the baselines in a wide array of scenarios, achieving an average 8.4% reduction in the misclassification cost.  ( 2 min )
    Simplicity Bias of Transformers to Learn Low Sensitivity Functions
    arXiv:2403.06925v1 Announce Type: new Abstract: Transformers achieve state-of-the-art accuracy and robustness across many tasks, but an understanding of the inductive biases that they have and how those biases are different from other neural network architectures remains elusive. Various neural network architectures such as fully connected networks have been found to have a simplicity bias towards simple functions of the data; one version of this simplicity bias is a spectral bias to learn simple functions in the Fourier space. In this work, we identify the notion of sensitivity of the model to random changes in the input as a notion of simplicity bias which provides a unified metric to explain the simplicity and spectral bias of transformers across different data modalities. We show that transformers have lower sensitivity than alternative architectures, such as LSTMs, MLPs and CNNs, across both vision and language tasks. We also show that low-sensitivity bias correlates with improved robustness; furthermore, it can also be used as an efficient intervention to further improve the robustness of transformers.  ( 2 min )
    Unveiling the Significance of Toddler-Inspired Reward Transition in Goal-Oriented Reinforcement Learning
    arXiv:2403.06880v1 Announce Type: new Abstract: Toddlers evolve from free exploration with sparse feedback to exploiting prior experiences for goal-directed learning with denser rewards. Drawing inspiration from this Toddler-Inspired Reward Transition, we set out to explore the implications of varying reward transitions when incorporated into Reinforcement Learning (RL) tasks. Central to our inquiry is the transition from sparse to potential-based dense rewards, which share optimal strategies regardless of reward changes. Through various experiments, including those in egocentric navigation and robotic arm manipulation tasks, we found that proper reward transitions significantly influence sample efficiency and success rates. Of particular note is the efficacy of the toddler-inspired Sparse-to-Dense (S2D) transition. Beyond these performance metrics, using Cross-Density Visualizer technique, we observed that transitions, especially the S2D, smooth the policy loss landscape, promoting wide minima that enhance generalization in RL models.  ( 2 min )
    Semantic Residual Prompts for Continual Learning
    arXiv:2403.06870v1 Announce Type: new Abstract: Prompt-tuning methods for Continual Learning (CL) freeze a large pre-trained model and focus training on a few parameter vectors termed prompts. Most of these methods organize these vectors in a pool of key-value pairs, and use the input image as query to retrieve the prompts (values). However, as keys are learned while tasks progress, the prompting selection strategy is itself subject to catastrophic forgetting, an issue often overlooked by existing approaches. For instance, prompts introduced to accommodate new tasks might end up interfering with previously learned prompts. To make the selection strategy more stable, we ask a foundational model (CLIP) to select our prompt within a two-level adaptation mechanism. Specifically, the first level leverages standard textual prompts for the CLIP textual encoder, leading to stable class prototypes. The second level, instead, uses these prototypes along with the query image as keys to index a second pool. The retrieved prompts serve to adapt a pre-trained ViT, granting plasticity. In doing so, we also propose a novel residual mechanism to transfer CLIP semantics to the ViT layers. Through extensive analysis on established CL benchmarks, we show that our method significantly outperforms both state-of-the-art CL approaches and the zero-shot CLIP test. Notably, our findings hold true even for datasets with a substantial domain gap w.r.t. the pre-training knowledge of the backbone model, as showcased by experiments on satellite imagery and medical datasets.  ( 3 min )
    On the Generalization Ability of Unsupervised Pretraining
    arXiv:2403.06871v1 Announce Type: new Abstract: Recent advances in unsupervised learning have shown that unsupervised pre-training, followed by fine-tuning, can improve model generalization. However, a rigorous understanding of how the representation function learned on an unlabeled dataset affects the generalization of the fine-tuned model is lacking. Existing theoretical research does not adequately account for the heterogeneity of the distribution and tasks in pre-training and fine-tuning stage. To bridge this gap, this paper introduces a novel theoretical framework that illuminates the critical factor influencing the transferability of knowledge acquired during unsupervised pre-training to the subsequent fine-tuning phase, ultimately affecting the generalization capabilities of the fine-tuned model on downstream tasks. We apply our theoretical framework to analyze generalization bound of two distinct scenarios: Context Encoder pre-training with deep neural networks and Masked Autoencoder pre-training with deep transformers, followed by fine-tuning on a binary classification task. Finally, inspired by our findings, we propose a novel regularization method during pre-training to further enhances the generalization of fine-tuned model. Overall, our results contribute to a better understanding of unsupervised pre-training and fine-tuning paradigm, and can shed light on the design of more effective pre-training algorithms.  ( 2 min )
    Learning with Noisy Foundation Models
    arXiv:2403.06869v1 Announce Type: new Abstract: Foundation models are usually pre-trained on large-scale datasets and then adapted to downstream tasks through tuning. However, the large-scale pre-training datasets, often inaccessible or too expensive to handle, can contain label noise that may adversely affect the generalization of the model and pose unexpected risks. This paper stands out as the first work to comprehensively understand and analyze the nature of noise in pre-training datasets and then effectively mitigate its impacts on downstream tasks. Specifically, through extensive experiments of fully-supervised and image-text contrastive pre-training on synthetic noisy ImageNet-1K, YFCC15M, and CC12M datasets, we demonstrate that, while slight noise in pre-training can benefit in-domain (ID) performance, where the training and testing data share a similar distribution, it always deteriorates out-of-domain (OOD) performance, where training and testing distributions are significantly different. These observations are agnostic to scales of pre-training datasets, pre-training noise types, model architectures, pre-training objectives, downstream tuning methods, and downstream applications. We empirically ascertain that the reason behind this is that the pre-training noise shapes the feature space differently. We then propose a tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization, which is applicable in both parameter-efficient and black-box tuning manners. We additionally conduct extensive experiments on popular vision and language models, including APIs, which are supervised and self-supervised pre-trained on realistic noisy data for evaluation. Our analysis and results demonstrate the importance of this novel and fundamental research direction, which we term as Noisy Model Learning.  ( 3 min )
    Can LLMs Separate Instructions From Data? And What Do We Even Mean By That?
    arXiv:2403.06833v1 Announce Type: new Abstract: Instruction-tuned Large Language Models (LLMs) have achieved breakthrough results, opening countless new possibilities for many practical applications. However, LLMs lack elementary safety features that are established norms in other areas of computer science, such as the separation between instructions and data, causing them to malfunction or rendering them vulnerable to manipulation and interference by third parties e.g., via indirect prompt/command injection. Even worse, so far, there is not even an established definition of what precisely such a separation would mean and how its violation could be tested. In this work, we aim to close this gap. We introduce a formal measure to quantify the phenomenon of instruction-data separation as well as an empirical variant of the measure that can be computed from a model`s black-box outputs. We also introduce a new dataset, SEP (Should it be Executed or Processed?), which allows estimating the measure, and we report results on several state-of-the-art open-source and closed LLMs. Finally, we quantitatively demonstrate that all evaluated LLMs fail to achieve a high amount of separation, according to our measure. The source code and SEP dataset are openly accessible at https://github.com/egozverev/Shold-It-Be-Executed-Or-Processed.  ( 3 min )
    Quantifying the Sensitivity of Inverse Reinforcement Learning to Misspecification
    arXiv:2403.06854v1 Announce Type: new Abstract: Inverse reinforcement learning (IRL) aims to infer an agent's preferences (represented as a reward function $R$) from their behaviour (represented as a policy $\pi$). To do this, we need a behavioural model of how $\pi$ relates to $R$. In the current literature, the most common behavioural models are optimality, Boltzmann-rationality, and causal entropy maximisation. However, the true relationship between a human's preferences and their behaviour is much more complex than any of these behavioural models. This means that the behavioural models are misspecified, which raises the concern that they may lead to systematic errors if applied to real data. In this paper, we analyse how sensitive the IRL problem is to misspecification of the behavioural model. Specifically, we provide necessary and sufficient conditions that completely characterise how the observed data may differ from the assumed behavioural model without incurring an error above a given threshold. In addition to this, we also characterise the conditions under which a behavioural model is robust to small perturbations of the observed policy, and we analyse how robust many behavioural models are to misspecification of their parameter values (such as e.g.\ the discount rate). Our analysis suggests that the IRL problem is highly sensitive to misspecification, in the sense that very mild misspecification can lead to very large errors in the inferred reward function.  ( 2 min )
    Constructing Variables Using Classifiers as an Aid to Regression: An Empirical Assessment
    arXiv:2403.06829v1 Announce Type: new Abstract: This paper proposes a method for the automatic creation of variables (in the case of regression) that complement the information contained in the initial input vector. The method works as a pre-processing step in which the continuous values of the variable to be regressed are discretized into a set of intervals which are then used to define value thresholds. Then classifiers are trained to predict whether the value to be regressed is less than or equal to each of these thresholds. The different outputs of the classifiers are then concatenated in the form of an additional vector of variables that enriches the initial vector of the regression problem. The implemented system can thus be considered as a generic pre-processing tool. We tested the proposed enrichment method with 5 types of regressors and evaluated it in 33 regression datasets. Our experimental results confirm the interest of the approach.  ( 2 min )
    In-context Exploration-Exploitation for Reinforcement Learning
    arXiv:2403.06826v1 Announce Type: new Abstract: In-context learning is a promising approach for online policy learning of offline reinforcement learning (RL) methods, which can be achieved at inference time without gradient optimization. However, this method is hindered by significant computational costs resulting from the gathering of large training trajectory sets and the need to train large Transformer models. We address this challenge by introducing an In-context Exploration-Exploitation (ICEE) algorithm, designed to optimize the efficiency of in-context policy learning. Unlike existing models, ICEE performs an exploration-exploitation trade-off at inference time within a Transformer model, without the need for explicit Bayesian inference. Consequently, ICEE can solve Bayesian optimization problems as efficiently as Gaussian process biased methods do, but in significantly less time. Through experiments in grid world environments, we demonstrate that ICEE can learn to solve new RL tasks using only tens of episodes, marking a substantial improvement over the hundreds of episodes needed by the previous in-context learning method.  ( 2 min )
    Monotone Individual Fairness
    arXiv:2403.06812v1 Announce Type: new Abstract: We revisit the problem of online learning with individual fairness, where an online learner strives to maximize predictive accuracy while ensuring that similar individuals are treated similarly. We first extend the frameworks of Gillen et al. (2018); Bechavod et al. (2020), which rely on feedback from human auditors regarding fairness violations, as we consider auditing schemes that are capable of aggregating feedback from any number of auditors, using a rich class we term monotone aggregation functions. We then prove a characterization for such auditing schemes, practically reducing the analysis of auditing for individual fairness by multiple auditors to that of auditing by (instance-specific) single auditors. Using our generalized framework, we present an oracle-efficient algorithm achieving an upper bound frontier of $(\mathcal{O}(T^{1/2+2b}),\mathcal{O}(T^{3/4-b}))$ respectively for regret, number of fairness violations, for $0\leq b \leq 1/4$. We then study an online classification setting where label feedback is available for positively-predicted individuals only, and present an oracle-efficient algorithm achieving an upper bound frontier of $(\mathcal{O}(T^{2/3+2b}),\mathcal{O}(T^{5/6-b}))$ for regret, number of fairness violations, for $0\leq b \leq 1/6$. In both settings, our algorithms improve on the best known bounds for oracle-efficient algorithms. Furthermore, our algorithms offer significant improvements in computational efficiency, greatly reducing the number of required calls to an (offline) optimization oracle per round, to $\tilde{\mathcal{O}}(\alpha^{-2})$ in the full information setting, and $\tilde{\mathcal{O}}(\alpha^{-2} + k^2T^{1/3})$ in the partial information setting, where $\alpha$ is the sensitivity for reporting fairness violations, and $k$ is the number of individuals in a round.  ( 2 min )
    Multistep Consistency Models
    arXiv:2403.06807v1 Announce Type: new Abstract: Diffusion models are relatively easy to train but require many steps to generate samples. Consistency models are far more difficult to train, but generate samples in a single step. In this paper we propose Multistep Consistency Models: A unification between Consistency Models (Song et al., 2023) and TRACT (Berthelot et al., 2023) that can interpolate between a consistency model and a diffusion model: a trade-off between sampling speed and sampling quality. Specifically, a 1-step consistency model is a conventional consistency model whereas we show that a $\infty$-step consistency model is a diffusion model. Multistep Consistency Models work really well in practice. By increasing the sample budget from a single step to 2-8 steps, we can train models more easily that generate higher quality samples, while retaining much of the sampling speed benefits. Notable results are 1.4 FID on Imagenet 64 in 8 step and 2.1 FID on Imagenet128 in 8 steps with consistency distillation. We also show that our method scales to a text-to-image diffusion model, generating samples that are very close to the quality of the original model.  ( 2 min )
    Redefining Event Types and Group Evolution in Temporal Data
    arXiv:2403.06771v1 Announce Type: new Abstract: Groups -- such as clusters of points or communities of nodes -- are fundamental when addressing various data mining tasks. In temporal data, the predominant approach for characterizing group evolution has been through the identification of ``events". However, the events usually described in the literature, e.g., shrinks/growths, splits/merges, are often arbitrarily defined, creating a gap between such theoretical/predefined types and real-data group observations. Moving beyond existing taxonomies, we think of events as ``archetypes" characterized by a unique combination of quantitative dimensions that we call ``facets". Group dynamics are defined by their position within the facet space, where archetypal events occupy extremities. Thus, rather than enforcing strict event types, our approach can allow for hybrid descriptions of dynamics involving group proximity to multiple archetypes. We apply our framework to evolving groups from several face-to-face interaction datasets, showing it enables richer, more reliable characterization of group dynamics with respect to state-of-the-art methods, especially when the groups are subject to complex relationships. Our approach also offers intuitive solutions to common tasks related to dynamic group analysis, such as choosing an appropriate aggregation scale, quantifying partition stability, and evaluating event quality.  ( 2 min )
    Leveraging Internal Representations of Model for Magnetic Image Classification
    arXiv:2403.06797v1 Announce Type: new Abstract: Data generated by edge devices has the potential to train intelligent autonomous systems across various domains. Despite the emergence of diverse machine learning approaches addressing privacy concerns and utilizing distributed data, security issues persist due to the sensitive storage of data shards in disparate locations. This paper introduces a potentially groundbreaking paradigm for machine learning model training, specifically designed for scenarios with only a single magnetic image and its corresponding label image available. We harness the capabilities of Deep Learning to generate concise yet informative samples, aiming to overcome data scarcity. Through the utilization of deep learning's internal representations, our objective is to efficiently address data scarcity issues and produce meaningful results. This methodology presents a promising avenue for training machine learning models with minimal data.  ( 2 min )
    XB-MAML: Learning Expandable Basis Parameters for Effective Meta-Learning with Wide Task Coverage
    arXiv:2403.06768v1 Announce Type: new Abstract: Meta-learning, which pursues an effective initialization model, has emerged as a promising approach to handling unseen tasks. However, a limitation remains to be evident when a meta-learner tries to encompass a wide range of task distribution, e.g., learning across distinctive datasets or domains. Recently, a group of works has attempted to employ multiple model initializations to cover widely-ranging tasks, but they are limited in adaptively expanding initializations. We introduce XB-MAML, which learns expandable basis parameters, where they are linearly combined to form an effective initialization to a given task. XB-MAML observes the discrepancy between the vector space spanned by the basis and fine-tuned parameters to decide whether to expand the basis. Our method surpasses the existing works in the multi-domain meta-learning benchmarks and opens up new chances of meta-learning for obtaining the diverse inductive bias that can be combined to stretch toward the effective initialization for diverse unseen tasks.  ( 2 min )
    Koopman Ensembles for Probabilistic Time Series Forecasting
    arXiv:2403.06757v1 Announce Type: new Abstract: In the context of an increasing popularity of data-driven models to represent dynamical systems, many machine learning-based implementations of the Koopman operator have recently been proposed. However, the vast majority of those works are limited to deterministic predictions, while the knowledge of uncertainty is critical in fields like meteorology and climatology. In this work, we investigate the training of ensembles of models to produce stochastic outputs. We show through experiments on real remote sensing image time series that ensembles of independently trained models are highly overconfident and that using a training criterion that explicitly encourages the members to produce predictions with high inter-model variances greatly improves the uncertainty quantification of the ensembles.  ( 2 min )
    Advancing Graph Neural Networks with HL-HGAT: A Hodge-Laplacian and Attention Mechanism Approach for Heterogeneous Graph-Structured Data
    arXiv:2403.06687v1 Announce Type: new Abstract: Graph neural networks (GNNs) have proven effective in capturing relationships among nodes in a graph. This study introduces a novel perspective by considering a graph as a simplicial complex, encompassing nodes, edges, triangles, and $k$-simplices, enabling the definition of graph-structured data on any $k$-simplices. Our contribution is the Hodge-Laplacian heterogeneous graph attention network (HL-HGAT), designed to learn heterogeneous signal representations across $k$-simplices. The HL-HGAT incorporates three key components: HL convolutional filters (HL-filters), simplicial projection (SP), and simplicial attention pooling (SAP) operators, applied to $k$-simplices. HL-filters leverage the unique topology of $k$-simplices encoded by the Hodge-Laplacian (HL) operator, operating within the spectral domain of the $k$-th HL operator. To address computation challenges, we introduce a polynomial approximation for HL-filters, exhibiting spatial localization properties. Additionally, we propose a pooling operator to coarsen $k$-simplices, combining features through simplicial attention mechanisms of self-attention and cross-attention via transformers and SP operators, capturing topological interconnections across multiple dimensions of simplices. The HL-HGAT is comprehensively evaluated across diverse graph applications, including NP-hard problems, graph multi-label and classification challenges, and graph regression tasks in logistics, computer vision, biology, chemistry, and neuroscience. The results demonstrate the model's efficacy and versatility in handling a wide range of graph-based scenarios.  ( 3 min )
    Probabilistic Contrastive Learning for Long-Tailed Visual Recognition
    arXiv:2403.06726v1 Announce Type: new Abstract: Long-tailed distributions frequently emerge in real-world data, where a large number of minority categories contain a limited number of samples. Such imbalance issue considerably impairs the performance of standard supervised learning algorithms, which are mainly designed for balanced training sets. Recent investigations have revealed that supervised contrastive learning exhibits promising potential in alleviating the data imbalance. However, the performance of supervised contrastive learning is plagued by an inherent challenge: it necessitates sufficiently large batches of training data to construct contrastive pairs that cover all categories, yet this requirement is difficult to meet in the context of class-imbalanced data. To overcome this obstacle, we propose a novel probabilistic contrastive (ProCo) learning algorithm that estimates the data distribution of the samples from each class in the feature space, and samples contrastive pairs accordingly. In fact, estimating the distributions of all classes using features in a small batch, particularly for imbalanced data, is not feasible. Our key idea is to introduce a reasonable and simple assumption that the normalized features in contrastive learning follow a mixture of von Mises-Fisher (vMF) distributions on unit space, which brings two-fold benefits. First, the distribution parameters can be estimated using only the first sample moment, which can be efficiently computed in an online manner across different batches. Second, based on the estimated distribution, the vMF distribution allows us to sample an infinite number of contrastive pairs and derive a closed form of the expected contrastive loss for efficient optimization. Our code is available at https://github.com/LeapLabTHU/ProCo.  ( 3 min )
    Streamlining in the Riemannian Realm: Efficient Riemannian Optimization with Loopless Variance Reduction
    arXiv:2403.06677v1 Announce Type: new Abstract: In this study, we investigate stochastic optimization on Riemannian manifolds, focusing on the crucial variance reduction mechanism used in both Euclidean and Riemannian settings. Riemannian variance-reduced methods usually involve a double-loop structure, computing a full gradient at the start of each loop. Determining the optimal inner loop length is challenging in practice, as it depends on strong convexity or smoothness constants, which are often unknown or hard to estimate. Motivated by Euclidean methods, we introduce the Riemannian Loopless SVRG (R-LSVRG) and PAGE (R-PAGE) methods. These methods replace the outer loop with probabilistic gradient computation triggered by a coin flip in each iteration, ensuring simpler proofs, efficient hyperparameter selection, and sharp convergence guarantees. Using R-PAGE as a framework for non-convex Riemannian optimization, we demonstrate its applicability to various important settings. For example, we derive Riemannian MARINA (R-MARINA) for distributed settings with communication compression, providing the best theoretical communication complexity guarantees for non-convex distributed optimization over Riemannian manifolds. Experimental results support our theoretical findings.  ( 2 min )
    Elephants Never Forget: Testing Language Models for Memorization of Tabular Data
    arXiv:2403.06644v1 Announce Type: new Abstract: While many have shown how Large Language Models (LLMs) can be applied to a diverse set of tasks, the critical issues of data contamination and memorization are often glossed over. In this work, we address this concern for tabular data. Starting with simple qualitative tests for whether an LLM knows the names and values of features, we introduce a variety of different techniques to assess the degrees of contamination, including statistical tests for conditional distribution modeling and four tests that identify memorization. Our investigation reveals that LLMs are pre-trained on many popular tabular datasets. This exposure can lead to invalid performance evaluation on downstream tasks because the LLMs have, in effect, been fit to the test set. Interestingly, we also identify a regime where the language model reproduces important statistics of the data, but fails to reproduce the dataset verbatim. On these datasets, although seen during training, good performance on downstream tasks might not be due to overfitting. Our findings underscore the need for ensuring data integrity in machine learning tasks with LLMs. To facilitate future research, we release an open-source tool that can perform various tests for memorization \url{https://github.com/interpretml/LLM-Tabular-Memorization-Checker}.  ( 2 min )
    Spatial features of CO2 for occupancy detection in a naturally ventilated school building
    arXiv:2403.06643v1 Announce Type: new Abstract: Accurate occupancy information helps to improve building energy efficiency and occupant comfort. Occupancy detection methods based on CO2 sensors have received attention due to their low cost and low intrusiveness. In naturally ventilated buildings, the accuracy of CO2-based occupancy detection is generally low in related studies due to the complex ventilation behavior and the difficulty in measuring the actual air exchange through windows. In this study, we present two novel features for occupancy detection based on the spatial distribution of the CO2 concentration. After a quantitative analysis with Support Vector Machine (SVM) as classifier, it was found that the accuracy of occupancy state detection in naturally ventilated rooms could be improved by up to 14.8 percentage points compared to the baseline, reaching 83.2 % (F1 score 0.84) without any ventilation information. With ventilation information, the accuracy reached 87.6 % (F1 score 0.89). The performance of occupancy quantity detection was significantly improved by up to 25.3 percentage points versus baseline, reaching 56 %, with root mean square error (RMSE) of 11.44 occupants, using only CO2-related features. Additional ventilation information further enhanced the performance to 61.8 % (RMSE 9.02 occupants). By incorporating spatial features, the model using only CO2-related features revealed similar performance as the model containing additional ventilation information, resulting in a better low-cost occupancy detection method for naturally ventilated buildings.  ( 3 min )
    Evaluating the Energy Efficiency of Few-Shot Learning for Object Detection in Industrial Settings
    arXiv:2403.06631v1 Announce Type: new Abstract: In the ever-evolving era of Artificial Intelligence (AI), model performance has constituted a key metric driving innovation, leading to an exponential growth in model size and complexity. However, sustainability and energy efficiency have been critical requirements during deployment in contemporary industrial settings, necessitating the use of data-efficient approaches such as few-shot learning. In this paper, to alleviate the burden of lengthy model training and minimize energy consumption, a finetuning approach to adapt standard object detection models to downstream tasks is examined. Subsequently, a thorough case study and evaluation of the energy demands of the developed models, applied in object detection benchmark datasets from volatile industrial environments is presented. Specifically, different finetuning strategies as well as utilization of ancillary evaluation data during training are examined, and the trade-off between performance and efficiency is highlighted in this low-data regime. Finally, this paper introduces a novel way to quantify this trade-off through a customized Efficiency Factor metric.  ( 2 min )
    ContextGPT: Infusing LLMs Knowledge into Neuro-Symbolic Activity Recognition Models
    arXiv:2403.06586v1 Announce Type: new Abstract: Context-aware Human Activity Recognition (HAR) is a hot research area in mobile computing, and the most effective solutions in the literature are based on supervised deep learning models. However, the actual deployment of these systems is limited by the scarcity of labeled data that is required for training. Neuro-Symbolic AI (NeSy) provides an interesting research direction to mitigate this issue, by infusing common-sense knowledge about human activities and the contexts in which they can be performed into HAR deep learning classifiers. Existing NeSy methods for context-aware HAR rely on knowledge encoded in logic-based models (e.g., ontologies) whose design, implementation, and maintenance to capture new activities and contexts require significant human engineering efforts, technical knowledge, and domain expertise. Recent works show that pre-trained Large Language Models (LLMs) effectively encode common-sense knowledge about human activities. In this work, we propose ContextGPT: a novel prompt engineering approach to retrieve from LLMs common-sense knowledge about the relationship between human activities and the context in which they are performed. Unlike ontologies, ContextGPT requires limited human effort and expertise. An extensive evaluation carried out on two public datasets shows how a NeSy model obtained by infusing common-sense knowledge from ContextGPT is effective in data scarcity scenarios, leading to similar (and sometimes better) recognition rates than logic-based approaches with a fraction of the effort.  ( 2 min )
    FFAD: A Novel Metric for Assessing Generated Time Series Data Utilizing Fourier Transform and Auto-encoder
    arXiv:2403.06576v1 Announce Type: new Abstract: The success of deep learning-based generative models in producing realistic images, videos, and audios has led to a crucial consideration: how to effectively assess the quality of synthetic samples. While the Fr\'{e}chet Inception Distance (FID) serves as the standard metric for evaluating generative models in image synthesis, a comparable metric for time series data is notably absent. This gap in assessment capabilities stems from the absence of a widely accepted feature vector extractor pre-trained on benchmark time series datasets. In addressing these challenges related to assessing the quality of time series, particularly in the context of Fr\'echet Distance, this work proposes a novel solution leveraging the Fourier transform and Auto-encoder, termed the Fr\'{e}chet Fourier-transform Auto-encoder Distance (FFAD). Through our experimental results, we showcase the potential of FFAD for effectively distinguishing samples from different classes. This novel metric emerges as a fundamental tool for the evaluation of generative time series data, contributing to the ongoing efforts of enhancing assessment methodologies in the realm of deep learning-based generative models.  ( 2 min )
    Scalable Online Exploration via Coverability
    arXiv:2403.06571v1 Announce Type: new Abstract: Exploration is a major challenge in reinforcement learning, especially for high-dimensional domains that require function approximation. We propose exploration objectives -- policy optimization objectives that enable downstream maximization of any reward function -- as a conceptual framework to systematize the study of exploration. Within this framework, we introduce a new objective, $L_1$-Coverage, which generalizes previous exploration schemes and supports three fundamental desiderata: 1. Intrinsic complexity control. $L_1$-Coverage is associated with a structural parameter, $L_1$-Coverability, which reflects the intrinsic statistical difficulty of the underlying MDP, subsuming Block and Low-Rank MDPs. 2. Efficient planning. For a known MDP, optimizing $L_1$-Coverage efficiently reduces to standard policy optimization, allowing flexible integration with off-the-shelf methods such as policy gradient and Q-learning approaches. 3. Efficient exploration. $L_1$-Coverage enables the first computationally efficient model-based and model-free algorithms for online (reward-free or reward-driven) reinforcement learning in MDPs with low coverability. Empirically, we find that $L_1$-Coverage effectively drives off-the-shelf policy optimization algorithms to explore the state space.  ( 2 min )
    Enhancing Joint Motion Prediction for Individuals with Limb Loss Through Model Reprogramming
    arXiv:2403.06569v1 Announce Type: new Abstract: Mobility impairment caused by limb loss is a significant challenge faced by millions of individuals worldwide. The development of advanced assistive technologies, such as prosthetic devices, has the potential to greatly improve the quality of life for amputee patients. A critical component in the design of such technologies is the accurate prediction of reference joint motion for the missing limb. However, this task is hindered by the scarcity of joint motion data available for amputee patients, in contrast to the substantial quantity of data from able-bodied subjects. To overcome this, we leverage deep learning's reprogramming property to repurpose well-trained models for a new goal without altering the model parameters. With only data-level manipulation, we adapt models originally designed for able-bodied people to forecast joint motion in amputees. The findings in this study have significant implications for advancing assistive tech and amputee mobility.  ( 2 min )
    Unraveling the Mystery of Scaling Laws: Part I
    arXiv:2403.06563v1 Announce Type: new Abstract: Scaling law principles indicate a power-law correlation between loss and variables such as model size, dataset size, and computational resources utilized during training. These principles play a vital role in optimizing various aspects of model pre-training, ultimately contributing to the success of large language models such as GPT-4, Llama and Gemini. However, the original scaling law paper by OpenAI did not disclose the complete details necessary to derive the precise scaling law formulas, and their conclusions are only based on models containing up to 1.5 billion parameters. Though some subsequent works attempt to unveil these details and scale to larger models, they often neglect the training dependency of important factors such as the learning rate, context length and batch size, leading to their failure to establish a reliable formula for predicting the test loss trajectory. In this technical report, we confirm that the scaling law formulations proposed in the original OpenAI paper remain valid when scaling the model size up to 33 billion, but the constant coefficients in these formulas vary significantly with the experiment setup. We meticulously identify influential factors and provide transparent, step-by-step instructions to estimate all constant terms in scaling-law formulas by training on models with only 1M~60M parameters. Using these estimated formulas, we showcase the capability to accurately predict various attributes for models with up to 33B parameters before their training, including (1) the minimum possible test loss; (2) the minimum required training steps and processed tokens to achieve a specific loss; (3) the critical batch size with an optimal time/computation trade-off at any loss value; and (4) the complete test loss trajectory with arbitrary batch size.  ( 3 min )
    Decentralized and Lifelong-Adaptive Multi-Agent Collaborative Learning
    arXiv:2403.06535v1 Announce Type: new Abstract: Decentralized and lifelong-adaptive multi-agent collaborative learning aims to enhance collaboration among multiple agents without a central server, with each agent solving varied tasks over time. To achieve efficient collaboration, agents should: i) autonomously identify beneficial collaborative relationships in a decentralized manner; and ii) adapt to dynamically changing task observations. In this paper, we propose DeLAMA, a decentralized multi-agent lifelong collaborative learning algorithm with dynamic collaboration graphs. To promote autonomous collaboration relationship learning, we propose a decentralized graph structure learning algorithm, eliminating the need for external priors. To facilitate adaptation to dynamic tasks, we design a memory unit to capture the agents' accumulated learning history and knowledge, while preserving finite storage consumption. To further augment the system's expressive capabilities and computational efficiency, we apply algorithm unrolling, leveraging the advantages of both mathematical optimization and neural networks. This allows the agents to `learn to collaborate' through the supervision of training tasks. Our theoretical analysis verifies that inter-agent collaboration is communication efficient under a small number of communication rounds. The experimental results verify its ability to facilitate the discovery of collaboration strategies and adaptation to dynamic learning scenarios, achieving a 98.80% reduction in MSE and a 188.87% improvement in classification accuracy. We expect our work can serve as a foundational technique to facilitate future works towards an intelligent, decentralized, and dynamic multi-agent system. Code is available at https://github.com/ShuoTang123/DeLAMA.  ( 3 min )
    Sliced-Wasserstein Distances and Flows on Cartan-Hadamard Manifolds
    arXiv:2403.06560v1 Announce Type: new Abstract: While many Machine Learning methods were developed or transposed on Riemannian manifolds to tackle data with known non Euclidean geometry, Optimal Transport (OT) methods on such spaces have not received much attention. The main OT tool on these spaces is the Wasserstein distance which suffers from a heavy computational burden. On Euclidean spaces, a popular alternative is the Sliced-Wasserstein distance, which leverages a closed-form solution of the Wasserstein distance in one dimension, but which is not readily available on manifolds. In this work, we derive general constructions of Sliced-Wasserstein distances on Cartan-Hadamard manifolds, Riemannian manifolds with non-positive curvature, which include among others Hyperbolic spaces or the space of Symmetric Positive Definite matrices. Then, we propose different applications. Additionally, we derive non-parametric schemes to minimize these new distances by approximating their Wasserstein gradient flows.  ( 2 min )
    Adaptive Federated Learning Over the Air
    arXiv:2403.06528v1 Announce Type: new Abstract: We propose a federated version of adaptive gradient methods, particularly AdaGrad and Adam, within the framework of over-the-air model training. This approach capitalizes on the inherent superposition property of wireless channels, facilitating fast and scalable parameter aggregation. Meanwhile, it enhances the robustness of the model training process by dynamically adjusting the stepsize in accordance with the global gradient update. We derive the convergence rate of the training algorithms, encompassing the effects of channel fading and interference, for a broad spectrum of nonconvex loss functions. Our analysis shows that the AdaGrad-based algorithm converges to a stationary point at the rate of $\mathcal{O}( \ln{(T)} /{ T^{ 1 - \frac{1}{\alpha} } } )$, where $\alpha$ represents the tail index of the electromagnetic interference. This result indicates that the level of heavy-tailedness in interference distribution plays a crucial role in the training efficiency: the heavier the tail, the slower the algorithm converges. In contrast, an Adam-like algorithm converges at the $\mathcal{O}( 1/T )$ rate, demonstrating its advantage in expediting the model training process. We conduct extensive experiments that corroborate our theoretical findings and affirm the practical efficacy of our proposed federated adaptive gradient methods.  ( 2 min )
    Tactical Decision Making for Autonomous Trucks by Deep Reinforcement Learning with Total Cost of Operation Based Reward
    arXiv:2403.06524v1 Announce Type: new Abstract: We develop a deep reinforcement learning framework for tactical decision making in an autonomous truck, specifically for Adaptive Cruise Control (ACC) and lane change maneuvers in a highway scenario. Our results demonstrate that it is beneficial to separate high-level decision-making processes and low-level control actions between the reinforcement learning agent and the low-level controllers based on physical models. In the following, we study optimizing the performance with a realistic and multi-objective reward function based on Total Cost of Operation (TCOP) of the truck using different approaches; by adding weights to reward components, by normalizing the reward components and by using curriculum learning techniques.  ( 2 min )
    RL-MSA: a Reinforcement Learning-based Multi-line bus Scheduling Approach
    arXiv:2403.06466v1 Announce Type: new Abstract: Multiple Line Bus Scheduling Problem (MLBSP) is vital to save operational cost of bus company and guarantee service quality for passengers. Existing approaches typically generate a bus scheduling scheme in an offline manner and then schedule buses according to the scheme. In practice, uncertain events such as traffic congestion occur frequently, which may make the pre-determined bus scheduling scheme infeasible. In this paper, MLBSP is modeled as a Markov Decision Process (MDP). A Reinforcement Learning-based Multi-line bus Scheduling Approach (RL-MSA) is proposed for bus scheduling at both the offline and online phases. At the offline phase, deadhead decision is integrated into bus selection decision for the first time to simplify the learning problem. At the online phase, deadhead decision is made through a time window mechanism based on the policy learned at the offline phase. We develop several new and useful state features including the features for control points, bus lines and buses. A bus priority screening mechanism is invented to construct bus-related features. Considering the interests of both the bus company and passengers, a reward function combining the final reward and the step-wise reward is devised. Experiments at the offline phase demonstrate that the number of buses used of RL-MSA is decreased compared with offline optimization approaches. At the online phase, RL-MSA can cover all departure times in a timetable (i.e., service quality) without increasing the number of buses used (i.e., operational cost).  ( 2 min )
    Graph Neural Network with Two Uplift Estimators for Label-Scarcity Individual Uplift Modeling
    arXiv:2403.06489v1 Announce Type: new Abstract: Uplift modeling aims to measure the incremental effect, which we call uplift, of a strategy or action on the users from randomized experiments or observational data. Most existing uplift methods only use individual data, which are usually not informative enough to capture the unobserved and complex hidden factors regarding the uplift. Furthermore, uplift modeling scenario usually has scarce labeled data, especially for the treatment group, which also poses a great challenge for model training. Considering that the neighbors' features and the social relationships are very informative to characterize a user's uplift, we propose a graph neural network-based framework with two uplift estimators, called GNUM, to learn from the social graph for uplift estimation. Specifically, we design the first estimator based on a class-transformed target. The estimator is general for all types of outcomes, and is able to comprehensively model the treatment and control group data together to approach the uplift. When the outcome is discrete, we further design the other uplift estimator based on our defined partial labels, which is able to utilize more labeled data from both the treatment and control groups, to further alleviate the label scarcity problem. Comprehensive experiments on a public dataset and two industrial datasets show a superior performance of our proposed framework over state-of-the-art methods under various evaluation metrics. The proposed algorithms have been deployed online to serve real-world uplift estimation scenarios.  ( 3 min )
    Prediction of Wort Density with LSTM Network
    arXiv:2403.06458v1 Announce Type: new Abstract: Many physical target values in technical processes are error-prone, cumbersome, or expensive to measure automatically. One example of a physical target value is the wort density, which is an important value needed for beer production. This article introduces a system that helps the brewer measure wort density through sensors in order to reduce errors in manual data collection. Instead of a direct measurement of wort density, a method is developed that calculates the density from measured values acquired by inexpensive standard sensors such as pressure or temperature. The model behind the calculation is a neural network, known as LSTM.  ( 2 min )
    Joint-Embedding Masked Autoencoder for Self-supervised Learning of Dynamic Functional Connectivity from the Human Brain
    arXiv:2403.06432v1 Announce Type: new Abstract: Graph Neural Networks (GNNs) have shown promise in learning dynamic functional connectivity for distinguishing phenotypes from human brain networks. However, obtaining extensive labeled clinical data for training is often resource-intensive, making practical application difficult. Leveraging unlabeled data thus becomes crucial for representation learning in a label-scarce setting. Although generative self-supervised learning techniques, especially masked autoencoders, have shown promising results in representation learning in various domains, their application to dynamic graphs for dynamic functional connectivity remains underexplored, facing challenges in capturing high-level semantic representations. Here, we introduce the Spatio-Temporal Joint Embedding Masked Autoencoder (ST-JEMA), drawing inspiration from the Joint Embedding Predictive Architecture (JEPA) in computer vision. ST-JEMA employs a JEPA-inspired strategy for reconstructing dynamic graphs, which enables the learning of higher-level semantic representations considering temporal perspectives, addressing the challenges in fMRI data representation learning. Utilizing the large-scale UK Biobank dataset for self-supervised learning, ST-JEMA shows exceptional representation learning performance on dynamic functional connectivity demonstrating superiority over previous methods in predicting phenotypes and psychiatric diagnoses across eight benchmark fMRI datasets even with limited samples and effectiveness of temporal reconstruction on missing data scenarios. These findings highlight the potential of our approach as a robust representation learning method for leveraging label-scarce fMRI data.  ( 2 min )
    A Differential Geometric View and Explainability of GNN on Evolving Graphs
    arXiv:2403.06425v1 Announce Type: new Abstract: Graphs are ubiquitous in social networks and biochemistry, where Graph Neural Networks (GNN) are the state-of-the-art models for prediction. Graphs can be evolving and it is vital to formally model and understand how a trained GNN responds to graph evolution. We propose a smooth parameterization of the GNN predicted distributions using axiomatic attribution, where the distributions are on a low-dimensional manifold within a high-dimensional embedding space. We exploit the differential geometric viewpoint to model distributional evolution as smooth curves on the manifold. We reparameterize families of curves on the manifold and design a convex optimization problem to find a unique curve that concisely approximates the distributional evolution for human interpretation. Extensive experiments on node classification, link prediction, and graph classification tasks with evolving graphs demonstrate the better sparsity, faithfulness, and intuitiveness of the proposed method over the state-of-the-art methods.  ( 2 min )
    Causal Multi-Label Feature Selection in Federated Setting
    arXiv:2403.06419v1 Announce Type: new Abstract: Multi-label feature selection serves as an effective mean for dealing with high-dimensional multi-label data. To achieve satisfactory performance, existing methods for multi-label feature selection often require the centralization of substantial data from multiple sources. However, in Federated setting, centralizing data from all sources and merging them into a single dataset is not feasible. To tackle this issue, in this paper, we study a challenging problem of causal multi-label feature selection in federated setting and propose a Federated Causal Multi-label Feature Selection (FedCMFS) algorithm with three novel subroutines. Specifically, FedCMFS first uses the FedCFL subroutine that considers the correlations among label-label, label-feature, and feature-feature to learn the relevant features (candidate parents and children) of each class label while preserving data privacy without centralizing data. Second, FedCMFS employs the FedCFR subroutine to selectively recover the missed true relevant features. Finally, FedCMFS utilizes the FedCFC subroutine to remove false relevant features. The extensive experiments on 8 datasets have shown that FedCMFS is effect for causal multi-label feature selection in federated setting.  ( 2 min )
    What Makes Quantization for Large Language Models Hard? An Empirical Study from the Lens of Perturbation
    arXiv:2403.06408v1 Announce Type: new Abstract: Quantization has emerged as a promising technique for improving the memory and computational efficiency of large language models (LLMs). Though the trade-off between performance and efficiency is well-known, there is still much to be learned about the relationship between quantization and LLM performance. To shed light on this relationship, we propose a new perspective on quantization, viewing it as perturbations added to the weights and activations of LLMs. We call this approach "the lens of perturbation". Using this lens, we conduct experiments with various artificial perturbations to explore their impact on LLM performance. Our findings reveal several connections between the properties of perturbations and LLM performance, providing insights into the failure cases of uniform quantization and suggesting potential solutions to improve the robustness of LLM quantization. To demonstrate the significance of our findings, we implement a simple non-uniform quantization approach based on our insights. Our experiments show that this approach achieves minimal performance degradation on both 4-bit weight quantization and 8-bit quantization for weights and activations. These results validate the correctness of our approach and highlight its potential to improve the efficiency of LLMs without sacrificing performance.  ( 2 min )
    On the Diminishing Returns of Width for Continual Learning
    arXiv:2403.06398v1 Announce Type: new Abstract: While deep neural networks have demonstrated groundbreaking performance in various settings, these models often suffer from \emph{catastrophic forgetting} when trained on new tasks in sequence. Several works have empirically demonstrated that increasing the width of a neural network leads to a decrease in catastrophic forgetting but have yet to characterize the exact relationship between width and continual learning. We design one of the first frameworks to analyze Continual Learning Theory and prove that width is directly related to forgetting in Feed-Forward Networks (FFN). Specifically, we demonstrate that increasing network widths to reduce forgetting yields diminishing returns. We empirically verify our claims at widths hitherto unexplored in prior studies where the diminishing returns are clearly observed as predicted by our theory.  ( 2 min )
    Towards Robust Out-of-Distribution Generalization Bounds via Sharpness
    arXiv:2403.06392v1 Announce Type: new Abstract: Generalizing to out-of-distribution (OOD) data or unseen domain, termed OOD generalization, still lacks appropriate theoretical guarantees. Canonical OOD bounds focus on different distance measurements between source and target domains but fail to consider the optimization property of the learned model. As empirically shown in recent work, the sharpness of learned minima influences OOD generalization. To bridge this gap between optimization and OOD generalization, we study the effect of sharpness on how a model tolerates data change in domain shift which is usually captured by "robustness" in generalization. In this paper, we give a rigorous connection between sharpness and robustness, which gives better OOD guarantees for robust algorithms. It also provides a theoretical backing for "flat minima leads to better OOD generalization". Overall, we propose a sharpness-based OOD generalization bound by taking robustness into consideration, resulting in a tighter bound than non-robust guarantees. Our findings are supported by the experiments on a ridge regression model, as well as the experiments on deep learning classification tasks.  ( 2 min )
    DeepSafeMPC: Deep Learning-Based Model Predictive Control for Safe Multi-Agent Reinforcement Learning
    arXiv:2403.06397v1 Announce Type: new Abstract: Safe Multi-agent reinforcement learning (safe MARL) has increasingly gained attention in recent years, emphasizing the need for agents to not only optimize the global return but also adhere to safety requirements through behavioral constraints. Some recent work has integrated control theory with multi-agent reinforcement learning to address the challenge of ensuring safety. However, there have been only very limited applications of Model Predictive Control (MPC) methods in this domain, primarily due to the complex and implicit dynamics characteristic of multi-agent environments. To bridge this gap, we propose a novel method called Deep Learning-Based Model Predictive Control for Safe Multi-Agent Reinforcement Learning (DeepSafeMPC). The key insight of DeepSafeMPC is leveraging a entralized deep learning model to well predict environmental dynamics. Our method applies MARL principles to search for optimal solutions. Through the employment of MPC, the actions of agents can be restricted within safe states concurrently. We demonstrate the effectiveness of our approach using the Safe Multi-agent MuJoCo environment, showcasing significant advancements in addressing safety concerns in MARL.  ( 2 min )
    Finite-Time Error Analysis of Soft Q-Learning: Switching System Approach
    arXiv:2403.06366v1 Announce Type: new Abstract: Soft Q-learning is a variation of Q-learning designed to solve entropy regularized Markov decision problems where an agent aims to maximize the entropy regularized value function. Despite its empirical success, there have been limited theoretical studies of soft Q-learning to date. This paper aims to offer a novel and unified finite-time, control-theoretic analysis of soft Q-learning algorithms. We focus on two types of soft Q-learning algorithms: one utilizing the log-sum-exp operator and the other employing the Boltzmann operator. By using dynamical switching system models, we derive novel finite-time error bounds for both soft Q-learning algorithms. We hope that our analysis will deepen the current understanding of soft Q-learning by establishing connections with switching system models and may even pave the way for new frameworks in the finite-time analysis of other reinforcement learning algorithms.  ( 2 min )
    Risk-Sensitive RL with Optimized Certainty Equivalents via Reduction to Standard RL
    arXiv:2403.06323v1 Announce Type: new Abstract: We study Risk-Sensitive Reinforcement Learning (RSRL) with the Optimized Certainty Equivalent (OCE) risk, which generalizes Conditional Value-at-risk (CVaR), entropic risk and Markowitz's mean-variance. Using an augmented Markov Decision Process (MDP), we propose two general meta-algorithms via reductions to standard RL: one based on optimistic algorithms and another based on policy optimization. Our optimistic meta-algorithm generalizes almost all prior RSRL theory with entropic risk or CVaR. Under discrete rewards, our optimistic theory also certifies the first RSRL regret bounds for MDPs with bounded coverability, e.g., exogenous block MDPs. Under discrete rewards, our policy optimization meta-algorithm enjoys both global convergence and local improvement guarantees in a novel metric that lower bounds the true OCE risk. Finally, we instantiate our framework with PPO, construct an MDP, and show that it learns the optimal risk-sensitive policy while prior algorithms provably fail.  ( 2 min )
    Transferable Reinforcement Learning via Generalized Occupancy Models
    arXiv:2403.06328v1 Announce Type: new Abstract: Intelligent agents must be generalists - showing the ability to quickly adapt and generalize to varying tasks. Within the framework of reinforcement learning (RL), model-based RL algorithms learn a task-agnostic dynamics model of the world, in principle allowing them to generalize to arbitrary rewards. However, one-step models naturally suffer from compounding errors, making them ineffective for problems with long horizons and large state spaces. In this work, we propose a novel class of models - generalized occupancy models (GOMs) - that retain the generality of model-based RL while avoiding compounding error. The key idea behind GOMs is to model the distribution of all possible long-term outcomes from a given state under the coverage of a stationary dataset, along with a policy that realizes a particular outcome from the given state. These models can then quickly be used to select the optimal action for arbitrary new tasks, without having to redo policy optimization. By directly modeling long-term outcomes, GOMs avoid compounding error while retaining generality across arbitrary reward functions. We provide a practical instantiation of GOMs using diffusion models and show its efficacy as a new class of transferable models, both theoretically and empirically across a variety of simulated robotics problems. Videos and code at https://weirdlabuw.github.io/gom/.  ( 2 min )
    Fake or Compromised? Making Sense of Malicious Clients in Federated Learning
    arXiv:2403.06319v1 Announce Type: new Abstract: Federated learning (FL) is a distributed machine learning paradigm that enables training models on decentralized data. The field of FL security against poisoning attacks is plagued with confusion due to the proliferation of research that makes different assumptions about the capabilities of adversaries and the adversary models they operate under. Our work aims to clarify this confusion by presenting a comprehensive analysis of the various poisoning attacks and defensive aggregation rules (AGRs) proposed in the literature, and connecting them under a common framework. To connect existing adversary models, we present a hybrid adversary model, which lies in the middle of the spectrum of adversaries, where the adversary compromises a few clients, trains a generative (e.g., DDPM) model with their compromised samples, and generates new synthetic data to solve an optimization for a stronger (e.g., cheaper, more practical) attack against different robust aggregation rules. By presenting the spectrum of FL adversaries, we aim to provide practitioners and researchers with a clear understanding of the different types of threats they need to consider when designing FL systems, and identify areas where further research is needed.  ( 2 min )
    Optimal Policy Sparsification and Low Rank Decomposition for Deep Reinforcement Learning
    arXiv:2403.06313v1 Announce Type: new Abstract: Deep reinforcement learning(DRL) has shown significant promise in a wide range of applications including computer games and robotics. Yet, training DRL policies consume extraordinary computing resources resulting in dense policies which are prone to overfitting. Moreover, inference with dense DRL policies limit their practical applications, especially in edge computing. Techniques such as pruning and singular value decomposition have been used with deep learning models to achieve sparsification and model compression to limit overfitting and reduce memory consumption. However, these techniques resulted in sub-optimal performance with notable decay in rewards. $L_1$ and $L_2$ regularization techniques have been proposed for neural network sparsification and sparse auto-encoder development, but their implementation in DRL environments has not been apparent. We propose a novel $L_0$-norm-regularization technique using an optimal sparsity map to sparsify DRL policies and promote their decomposition to a lower rank without decay in rewards. We evaluated our $L_0$-norm-regularization technique across five different environments (Cartpole-v1, Acrobat-v1, LunarLander-v2, SuperMarioBros-7.1.v0 and Surgical Robot Learning) using several on-policy and off-policy algorithms. We demonstrated that the $L_0$-norm-regularized DRL policy in the SuperMarioBros environment achieved 93% sparsity and gained 70% compression when subjected to low-rank decomposition, while significantly outperforming the dense policy. Additionally, the $L_0$-norm-regularized DRL policy in the Surgical Robot Learning environment achieved a 36% sparsification and gained 46% compression when decomposed to a lower rank, while being performant. The results suggest that our custom $L_0$-norm-regularization technique for sparsification of DRL policies is a promising avenue to reduce computational resources and limit overfitting.  ( 3 min )
    Analysis of Total Variation Minimization for Clustered Federated Learning
    arXiv:2403.06298v1 Announce Type: new Abstract: A key challenge in federated learning applications is the statistical heterogeneity of local datasets. Clustered federated learning addresses this challenge by identifying clusters of local datasets that are approximately homogeneous. One recent approach to clustered federated learning is generalized total variation minimization (GTVMin). This approach requires a similarity graph which can be obtained by domain expertise or in a data-driven fashion via graph learning techniques. Under a widely applicable clustering assumption, we derive an upper bound the deviation between GTVMin solutions and their cluster-wise averages. This bound provides valuable insights into the effectiveness and robustness of GTVMin in addressing statistical heterogeneity within federated learning environments.  ( 2 min )
    Cooperative Classification and Rationalization for Graph Generalization
    arXiv:2403.06239v1 Announce Type: new Abstract: Graph Neural Networks (GNNs) have achieved impressive results in graph classification tasks, but they struggle to generalize effectively when faced with out-of-distribution (OOD) data. Several approaches have been proposed to address this problem. Among them, one solution is to diversify training distributions in vanilla classification by modifying the data environment, yet accessing the environment information is complex. Besides, another promising approach involves rationalization, extracting invariant rationales for predictions. However, extracting rationales is difficult due to limited learning signals, resulting in less accurate rationales and diminished predictions. To address these challenges, in this paper, we propose a Cooperative Classification and Rationalization (C2R) method, consisting of the classification and the rationalization module. Specifically, we first assume that multiple environments are available in the classification module. Then, we introduce diverse training distributions using an environment-conditional generative network, enabling robust graph representations. Meanwhile, the rationalization module employs a separator to identify relevant rationale subgraphs while the remaining non-rationale subgraphs are de-correlated with labels. Next, we align graph representations from the classification module with rationale subgraph representations using the knowledge distillation methods, enhancing the learning signal for rationales. Finally, we infer multiple environments by gathering non-rationale representations and incorporate them into the classification module for cooperative learning. Extensive experimental results on both benchmarks and synthetic datasets demonstrate the effectiveness of C2R. Code is available at https://github.com/yuelinan/Codes-of-C2R.  ( 2 min )
    Probabilistic Neural Circuits
    arXiv:2403.06235v1 Announce Type: new Abstract: Probabilistic circuits (PCs) have gained prominence in recent years as a versatile framework for discussing probabilistic models that support tractable queries and are yet expressive enough to model complex probability distributions. Nevertheless, tractability comes at a cost: PCs are less expressive than neural networks. In this paper we introduce probabilistic neural circuits (PNCs), which strike a balance between PCs and neural nets in terms of tractability and expressive power. Theoretically, we show that PNCs can be interpreted as deep mixtures of Bayesian networks. Experimentally, we demonstrate that PNCs constitute powerful function approximators.  ( 2 min )
    An Improved Analysis of Langevin Algorithms with Prior Diffusion for Non-Log-Concave Sampling
    arXiv:2403.06183v1 Announce Type: new Abstract: Understanding the dimension dependency of computational complexity in high-dimensional sampling problem is a fundamental problem, both from a practical and theoretical perspective. Compared with samplers with unbiased stationary distribution, e.g., Metropolis-adjusted Langevin algorithm (MALA), biased samplers, e.g., Underdamped Langevin Dynamics (ULD), perform better in low-accuracy cases just because a lower dimension dependency in their complexities. Along this line, Freund et al. (2022) suggest that the modified Langevin algorithm with prior diffusion is able to converge dimension independently for strongly log-concave target distributions. Nonetheless, it remains open whether such property establishes for more general cases. In this paper, we investigate the prior diffusion technique for the target distributions satisfying log-Sobolev inequality (LSI), which covers a much broader class of distributions compared to the strongly log-concave ones. In particular, we prove that the modified Langevin algorithm can also obtain the dimension-independent convergence of KL divergence with different step size schedules. The core of our proof technique is a novel construction of an interpolating SDE, which significantly helps to conduct a more accurate characterization of the discrete updates of the overdamped Langevin dynamics. Our theoretical analysis demonstrates the benefits of prior diffusion for a broader class of target distributions and provides new insights into developing faster sampling algorithms.  ( 3 min )
    LinearAPT: An Adaptive Algorithm for the Fixed-Budget Thresholding Linear Bandit Problem
    arXiv:2403.06230v1 Announce Type: new Abstract: In this study, we delve into the Thresholding Linear Bandit (TLB) problem, a nuanced domain within stochastic Multi-Armed Bandit (MAB) problems, focusing on maximizing decision accuracy against a linearly defined threshold under resource constraints. We present LinearAPT, a novel algorithm designed for the fixed budget setting of TLB, providing an efficient solution to optimize sequential decision-making. This algorithm not only offers a theoretical upper bound for estimated loss but also showcases robust performance on both synthetic and real-world datasets. Our contributions highlight the adaptability, simplicity, and computational efficiency of LinearAPT, making it a valuable addition to the toolkit for addressing complex sequential decision-making challenges.  ( 2 min )
    Domain Adversarial Active Learning for Domain Generalization Classification
    arXiv:2403.06174v1 Announce Type: new Abstract: Domain generalization models aim to learn cross-domain knowledge from source domain data, to improve performance on unknown target domains. Recent research has demonstrated that diverse and rich source domain samples can enhance domain generalization capability. This paper argues that the impact of each sample on the model's generalization ability varies. Despite its small scale, a high-quality dataset can still attain a certain level of generalization ability. Motivated by this, we propose a domain-adversarial active learning (DAAL) algorithm for classification tasks in domain generalization. First, we analyze that the objective of tasks is to maximize the inter-class distance within the same domain and minimize the intra-class distance across different domains. To achieve this objective, we design a domain adversarial selection method that prioritizes challenging samples. Second, we posit that even in a converged model, there are subsets of features that lack discriminatory power within each domain. We attempt to identify these feature subsets and optimize them by a constraint loss. We validate and analyze our DAAL algorithm on multiple domain generalization datasets, comparing it with various domain generalization algorithms and active learning algorithms. Our results demonstrate that the DAAL algorithm can achieve strong generalization ability with fewer data resources, thereby reducing data annotation costs in domain generalization tasks.  ( 2 min )
    Learning the irreversible progression trajectory of Alzheimer's disease
    arXiv:2403.06087v1 Announce Type: new Abstract: Alzheimer's disease (AD) is a progressive and irreversible brain disorder that unfolds over the course of 30 years. Therefore, it is critical to capture the disease progression in an early stage such that intervention can be applied before the onset of symptoms. Machine learning (ML) models have been shown effective in predicting the onset of AD. Yet for subjects with follow-up visits, existing techniques for AD classification only aim for accurate group assignment, where the monotonically increasing risk across follow-up visits is usually ignored. Resulted fluctuating risk scores across visits violate the irreversibility of AD, hampering the trustworthiness of models and also providing little value to understanding the disease progression. To address this issue, we propose a novel regularization approach to predict AD longitudinally. Our technique aims to maintain the expected monotonicity of increasing disease risk during progression while preserving expressiveness. Specifically, we introduce a monotonicity constraint that encourages the model to predict disease risk in a consistent and ordered manner across follow-up visits. We evaluate our method using the longitudinal structural MRI and amyloid-PET imaging data from the Alzheimer's Disease Neuroimaging Initiative (ADNI). Our model outperforms existing techniques in capturing the progressiveness of disease risk, and at the same time preserves prediction accuracy.  ( 2 min )
    Local Vertex Colouring Graph Neural Networks
    arXiv:2403.06080v1 Announce Type: new Abstract: In recent years, there has been a significant amount of research focused on expanding the expressivity of Graph Neural Networks (GNNs) beyond the Weisfeiler-Lehman (1-WL) framework. While many of these studies have yielded advancements in expressivity, they have frequently come at the expense of decreased efficiency or have been restricted to specific types of graphs. In this study, we investigate the expressivity of GNNs from the perspective of graph search. Specifically, we propose a new vertex colouring scheme and demonstrate that classical search algorithms can efficiently compute graph representations that extend beyond the 1-WL. We show the colouring scheme inherits useful properties from graph search that can help solve problems like graph biconnectivity. Furthermore, we show that under certain conditions, the expressivity of GNNs increases hierarchically with the radius of the search neighbourhood. To further investigate the proposed scheme, we develop a new type of GNN based on two search strategies, breadth-first search and depth-first search, highlighting the graph properties they can capture on top of 1-WL. Our code is available at https://github.com/seanli3/lvc.  ( 2 min )
    FrameQuant: Flexible Low-Bit Quantization for Transformers
    arXiv:2403.06082v1 Announce Type: new Abstract: Transformers are the backbone of powerful foundation models for many Vision and Natural Language Processing tasks. But their compute and memory/storage footprint is large, and so, serving such models is expensive often requiring high-end hardware. To mitigate this difficulty, Post-Training Quantization seeks to modify a pre-trained model and quantize it to eight bits or lower, significantly boosting compute/memory/latency efficiency. Such models have been successfully quantized to four bits with some performance loss. In this work, we outline a simple scheme to quantize Transformer-based models to just two bits (plus some overhead) with only a small drop in accuracy. Key to our formulation is a concept borrowed from Harmonic analysis called Fusion Frames. Our main finding is that the quantization must take place not in the original weight space, but instead in the Fusion Frame representations. If quantization is interpreted as the addition of noise, our casting of the problem allows invoking an extensive body of known consistent recovery and noise robustness guarantees. Further, if desired, de-noising filters are known in closed form. We show empirically, via a variety of experiments, that (almost) two-bit quantization for Transformer models promises sizable efficiency gains.  ( 2 min )
    Generalization of Graph Neural Networks through the Lens of Homomorphism
    arXiv:2403.06079v1 Announce Type: new Abstract: Despite the celebrated popularity of Graph Neural Networks (GNNs) across numerous applications, the ability of GNNs to generalize remains less explored. In this work, we propose to study the generalization of GNNs through a novel perspective - analyzing the entropy of graph homomorphism. By linking graph homomorphism with information-theoretic measures, we derive generalization bounds for both graph and node classifications. These bounds are capable of capturing subtleties inherent in various graph structures, including but not limited to paths, cycles and cliques. This enables a data-dependent generalization analysis with robust theoretical guarantees. To shed light on the generality of of our proposed bounds, we present a unifying framework that can characterize a broad spectrum of GNN models through the lens of graph homomorphism. We validate the practical applicability of our theoretical findings by showing the alignment between the proposed bounds and the empirically observed generalization gaps over both real-world and synthetic datasets.  ( 2 min )
    L$^2$GC: Lorentzian Linear Graph Convolutional Networks For Node Classification
    arXiv:2403.06064v1 Announce Type: new Abstract: Linear Graph Convolutional Networks (GCNs) are used to classify the node in the graph data. However, we note that most existing linear GCN models perform neural network operations in Euclidean space, which do not explicitly capture the tree-like hierarchical structure exhibited in real-world datasets that modeled as graphs. In this paper, we attempt to introduce hyperbolic space into linear GCN and propose a novel framework for Lorentzian linear GCN. Specifically, we map the learned features of graph nodes into hyperbolic space, and then perform a Lorentzian linear feature transformation to capture the underlying tree-like structure of data. Experimental results on standard citation networks datasets with semi-supervised learning show that our approach yields new state-of-the-art results of accuracy 74.7$\%$ on Citeseer and 81.3$\%$ on PubMed datasets. Furthermore, we observe that our approach can be trained up to two orders of magnitude faster than other nonlinear GCN models on PubMed dataset. Our code is publicly available at https://github.com/llqy123/LLGC-master.  ( 2 min )
    FairTargetSim: An Interactive Simulator for Understanding and Explaining the Fairness Effects of Target Variable Definition
    arXiv:2403.06031v1 Announce Type: new Abstract: Machine learning requires defining one's target variable for predictions or decisions, a process that can have profound implications on fairness: biases are often encoded in target variable definition itself, before any data collection or training. We present an interactive simulator, FairTargetSim (FTS), that illustrates how target variable definition impacts fairness. FTS is a valuable tool for algorithm developers, researchers, and non-technical stakeholders. FTS uses a case study of algorithmic hiring, using real-world data and user-defined target variables. FTS is open-source and available at: http://tinyurl.com/ftsinterface. The video accompanying this paper is here: http://tinyurl.com/ijcaifts.  ( 2 min )
    Predicting Depression and Anxiety: A Multi-Layer Perceptron for Analyzing the Mental Health Impact of COVID-19
    arXiv:2403.06033v1 Announce Type: new Abstract: We introduce a multi-layer perceptron (MLP) called the COVID-19 Depression and Anxiety Predictor (CoDAP) to predict mental health trends, particularly anxiety and depression, during the COVID-19 pandemic. Our method utilizes a comprehensive dataset, which tracked mental health symptoms weekly over ten weeks during the initial COVID-19 wave (April to June 2020) in a diverse cohort of U.S. adults. This period, characterized by a surge in mental health symptoms and conditions, offers a critical context for our analysis. Our focus was to extract and analyze patterns of anxiety and depression through a unique lens of qualitative individual attributes using CoDAP. This model not only predicts patterns of anxiety and depression during the pandemic but also unveils key insights into the interplay of demographic factors, behavioral changes, and social determinants of mental health. These findings contribute to a more nuanced understanding of the complexity of mental health issues in times of global health crises, potentially guiding future early interventions.  ( 3 min )
    Multimodal deep learning approach to predicting neurological recovery from coma after cardiac arrest
    arXiv:2403.06027v1 Announce Type: new Abstract: This work showcases our team's (The BEEGees) contributions to the 2023 George B. Moody PhysioNet Challenge. The aim was to predict neurological recovery from coma following cardiac arrest using clinical data and time-series such as multi-channel EEG and ECG signals. Our modelling approach is multimodal, based on two-dimensional spectrogram representations derived from numerous EEG channels, alongside the integration of clinical data and features extracted directly from EEG recordings. Our submitted model achieved a Challenge score of $0.53$ on the hidden test set for predictions made $72$ hours after return of spontaneous circulation. Our study shows the efficacy and limitations of employing transfer learning in medical classification. With regard to prospective implementation, our analysis reveals that the performance of the model is strongly linked to the selection of a decision threshold and exhibits strong variability across data splits.  ( 2 min )
    Towards a Generic Representation of Cominatorial Problems for Learning-Based Approaches
    arXiv:2403.06026v1 Announce Type: new Abstract: In recent years, there has been a growing interest in using learning-based approaches for solving combinatorial problems, either in an end-to-end manner or in conjunction with traditional optimization algorithms. In both scenarios, the challenge lies in encoding the targeted combinatorial problems into a structure compatible with the learning algorithm. Many existing works have proposed problem-specific representations, often in the form of a graph, to leverage the advantages of \textit{graph neural networks}. However, these approaches lack generality, as the representation cannot be easily transferred from one combinatorial problem to another one. While some attempts have been made to bridge this gap, they still offer a partial generality only. In response to this challenge, this paper advocates for progress toward a fully generic representation of combinatorial problems for learning-based approaches. The approach we propose involves constructing a graph by breaking down any constraint of a combinatorial problem into an abstract syntax tree and expressing relationships (e.g., a variable involved in a constraint) through the edges. Furthermore, we introduce a graph neural network architecture capable of efficiently learning from this representation. The tool provided operates on combinatorial problems expressed in the XCSP3 format, handling all the constraints available in the 2023 mini-track competition. Experimental results on four combinatorial problems demonstrate that our architecture achieves performance comparable to dedicated architectures while maintaining generality. Our code and trained models are publicly available at \url{https://github.com/corail-research/learning-generic-csp}.  ( 3 min )
    Addressing Shortcomings in Fair Graph Learning Datasets: Towards a New Benchmark
    arXiv:2403.06017v1 Announce Type: new Abstract: Fair graph learning plays a pivotal role in numerous practical applications. Recently, many fair graph learning methods have been proposed; however, their evaluation often relies on poorly constructed semi-synthetic datasets or substandard real-world datasets. In such cases, even a basic Multilayer Perceptron (MLP) can outperform Graph Neural Networks (GNNs) in both utility and fairness. In this work, we illustrate that many datasets fail to provide meaningful information in the edges, which may challenge the necessity of using graph structures in these problems. To address these issues, we develop and introduce a collection of synthetic, semi-synthetic, and real-world datasets that fulfill a broad spectrum of requirements. These datasets are thoughtfully designed to include relevant graph structures and bias information crucial for the fair evaluation of models. The proposed synthetic and semi-synthetic datasets offer the flexibility to create data with controllable bias parameters, thereby enabling the generation of desired datasets with user-defined bias values with ease. Moreover, we conduct systematic evaluations of these proposed datasets and establish a unified evaluation approach for fair graph learning models. Our extensive experimental results with fair graph learning methods across our datasets demonstrate their effectiveness in benchmarking the performance of these methods. Our datasets and the code for reproducing our experiments are available at https://github.com/XweiQ/Benchmark-GraphFairness.  ( 2 min )
    Hard-label based Small Query Black-box Adversarial Attack
    arXiv:2403.06014v1 Announce Type: new Abstract: We consider the hard label based black box adversarial attack setting which solely observes predicted classes from the target model. Most of the attack methods in this setting suffer from impractical number of queries required to achieve a successful attack. One approach to tackle this drawback is utilising the adversarial transferability between white box surrogate models and black box target model. However, the majority of the methods adopting this approach are soft label based to take the full advantage of zeroth order optimisation. Unlike mainstream methods, we propose a new practical setting of hard label based attack with an optimisation process guided by a pretrained surrogate model. Experiments show the proposed method significantly improves the query efficiency of the hard label based black-box attack across various target model architectures. We find the proposed method achieves approximately 5 times higher attack success rate compared to the benchmarks, especially at the small query budgets as 100 and 250.  ( 2 min )
    Are Classification Robustness and Explanation Robustness Really Strongly Correlated? An Analysis Through Input Loss Landscape
    arXiv:2403.06013v1 Announce Type: new Abstract: This paper delves into the critical area of deep learning robustness, challenging the conventional belief that classification robustness and explanation robustness in image classification systems are inherently correlated. Through a novel evaluation approach leveraging clustering for efficient assessment of explanation robustness, we demonstrate that enhancing explanation robustness does not necessarily flatten the input loss landscape with respect to explanation loss - contrary to flattened loss landscapes indicating better classification robustness. To deeply investigate this contradiction, a groundbreaking training method designed to adjust the loss landscape with respect to explanation loss is proposed. Through the new training method, we uncover that although such adjustments can impact the robustness of explanations, they do not have an influence on the robustness of classification. These findings not only challenge the prevailing assumption of a strong correlation between the two forms of robustness but also pave new pathways for understanding relationship between loss landscape and explanation loss.  ( 2 min )
    Detectors for Safe and Reliable LLMs: Implementations, Uses, and Limitations
    arXiv:2403.06009v1 Announce Type: new Abstract: Large language models (LLMs) are susceptible to a variety of risks, from non-faithful output to biased and toxic generations. Due to several limiting factors surrounding LLMs (training cost, API access, data availability, etc.), it may not always be feasible to impose direct safety constraints on a deployed model. Therefore, an efficient and reliable alternative is required. To this end, we present our ongoing efforts to create and deploy a library of detectors: compact and easy-to-build classification models that provide labels for various harms. In addition to the detectors themselves, we discuss a wide range of uses for these detector models - from acting as guardrails to enabling effective AI governance. We also deep dive into inherent challenges in their development and discuss future work aimed at making the detectors more reliable and broadening their scope.  ( 2 min )
    Reinforcement Learning Paycheck Optimization for Multivariate Financial Goals
    arXiv:2403.06011v1 Announce Type: new Abstract: We study paycheck optimization, which examines how to allocate income in order to achieve several competing financial goals. For paycheck optimization, a quantitative methodology is missing, due to a lack of a suitable problem formulation. To deal with this issue, we formulate the problem as a utility maximization problem. The proposed formulation is able to (i) unify different financial goals; (ii) incorporate user preferences regarding the goals; (iii) handle stochastic interest rates. The proposed formulation also facilitates an end-to-end reinforcement learning solution, which is implemented on a variety of problem settings.  ( 2 min )
    Dissecting Deep RL with High Update Ratios: Combatting Value Overestimation and Divergence
    arXiv:2403.05996v1 Announce Type: new Abstract: We show that deep reinforcement learning can maintain its ability to learn without resetting network parameters in settings where the number of gradient updates greatly exceeds the number of environment samples. Under such large update-to-data ratios, a recent study by Nikishin et al. (2022) suggested the emergence of a primacy bias, in which agents overfit early interactions and downplay later experience, impairing their ability to learn. In this work, we dissect the phenomena underlying the primacy bias. We inspect the early stages of training that ought to cause the failure to learn and find that a fundamental challenge is a long-standing acquaintance: value overestimation. Overinflated Q-values are found not only on out-of-distribution but also in-distribution data and can be traced to unseen action prediction propelled by optimizer momentum. We employ a simple unit-ball normalization that enables learning under large update ratios, show its efficacy on the widely used dm_control suite, and obtain strong performance on the challenging dog tasks, competitive with model-based approaches. Our results question, in parts, the prior explanation for sub-optimal learning due to overfitting on early data.  ( 2 min )
    Enhancing Classification Performance via Reinforcement Learning for Feature Selection
    arXiv:2403.05979v1 Announce Type: new Abstract: Feature selection plays a crucial role in improving predictive accuracy by identifying relevant features while filtering out irrelevant ones. This study investigates the importance of effective feature selection in enhancing the performance of classification models. By employing reinforcement learning (RL) algorithms, specifically Q-learning (QL) and SARSA learning, this paper addresses the feature selection challenge. Using the Breast Cancer Coimbra dataset (BCCDS) and three normalization methods (Min-Max, l1, and l2), the study evaluates the performance of these algorithms. Results show that QL@Min-Max and SARSA@l2 achieve the highest classification accuracies, reaching 87% and 88%, respectively. This highlights the effectiveness of RL-based feature selection methods in optimizing classification tasks, contributing to improved model accuracy and efficiency.  ( 2 min )
    Towards Efficient Replay in Federated Incremental Learning
    arXiv:2403.05890v1 Announce Type: new Abstract: In Federated Learning (FL), the data in each client is typically assumed fixed or static. However, data often comes in an incremental manner in real-world applications, where the data domain may increase dynamically. In this work, we study catastrophic forgetting with data heterogeneity in Federated Incremental Learning (FIL) scenarios where edge clients may lack enough storage space to retain full data. We propose to employ a simple, generic framework for FIL named Re-Fed, which can coordinate each client to cache important samples for replay. More specifically, when a new task arrives, each client first caches selected previous samples based on their global and local importance. Then, the client trains the local model with both the cached samples and the samples from the new task. Theoretically, we analyze the ability of Re-Fed to discover important samples for replay thus alleviating the catastrophic forgetting problem. Moreover, we empirically show that Re-Fed achieves competitive performance compared to state-of-the-art methods.  ( 2 min )
    SEMRes-DDPM: Residual Network Based Diffusion Modelling Applied to Imbalanced Data
    arXiv:2403.05918v1 Announce Type: new Abstract: In the field of data mining and machine learning, commonly used classification models cannot effectively learn in unbalanced data. In order to balance the data distribution before model training,oversamplingmethods are often used to generate data for a small number of classes to solve the problem of classifying unbalanced data. Most of the classical oversampling methods are based on theSMOTE technique, which only focuses on the local information of the data, and therefore the generated data may have the problem of not being realistic enough. In the current oversampling methods based on generative networks, the methods based on GANs can capture the true distribution of data, but there is the problem of pattern collapse and training instability in training; in the oversampling methods based on denoising diffusion probability models, the neural network of the inverse diffusion process using the U-Net is not applicable to tabular data, and although the MLP can be used to replace the U-Net, the problem exists due to the simplicity of the structure and the poor effect of removing noise. problem of poor noise removal. In order to overcome the above problems, we propose a novel oversampling method SEMRes-DDPM.In the SEMRes?DDPM backward diffusion process, a new neural network structure SEMST-ResNet is used, which is suitable for tabular data and has good noise removal effect, and it can generate tabular data with higher quality. Experiments show that the SEMResNet network removes noise better than MLP; SEMRes?DDPM generates data distributions that are closer to the real data distributions than TabDDPM with CWGAN-GP; on 20 real unbalanced tabular datasets with 9 classification models, SEMRes-DDPM improves the quality of the generated tabular data in terms of three evaluation metrics (F1, G-mean, AUC) with better classification performance than other SOTA oversampling methods.  ( 3 min )
    DiffRed: Dimensionality Reduction guided by stable rank
    arXiv:2403.05882v1 Announce Type: new Abstract: In this work, we propose a novel dimensionality reduction technique, DiffRed, which first projects the data matrix, A, along first $k_1$ principal components and the residual matrix $A^{*}$ (left after subtracting its $k_1$-rank approximation) along $k_2$ Gaussian random vectors. We evaluate M1, the distortion of mean-squared pair-wise distance, and Stress, the normalized value of RMS of distortion of the pairwise distances. We rigorously prove that DiffRed achieves a general upper bound of $O\left(\sqrt{\frac{1-p}{k_2}}\right)$ on Stress and $O\left(\frac{(1-p)}{\sqrt{k_2*\rho(A^{*})}}\right)$ on M1 where $p$ is the fraction of variance explained by the first $k_1$ principal components and $\rho(A^{*})$ is the stable rank of $A^{*}$. These bounds are tighter than the currently known results for Random maps. Our extensive experiments on a variety of real-world datasets demonstrate that DiffRed achieves near zero M1 and much lower values of Stress as compared to the well-known dimensionality reduction techniques. In particular, DiffRed can map a 6 million dimensional dataset to 10 dimensions with 54% lower Stress than PCA.  ( 2 min )
    PAPER-HILT: Personalized and Adaptive Privacy-Aware Early-Exit for Reinforcement Learning in Human-in-the-Loop Systems
    arXiv:2403.05864v1 Announce Type: new Abstract: Reinforcement Learning (RL) has increasingly become a preferred method over traditional rule-based systems in diverse human-in-the-loop (HITL) applications due to its adaptability to the dynamic nature of human interactions. However, integrating RL in such settings raises significant privacy concerns, as it might inadvertently expose sensitive user information. Addressing this, our paper focuses on developing PAPER-HILT, an innovative, adaptive RL strategy through exploiting an early-exit approach designed explicitly for privacy preservation in HITL environments. This approach dynamically adjusts the tradeoff between privacy protection and system utility, tailoring its operation to individual behavioral patterns and preferences. We mainly highlight the challenge of dealing with the variable and evolving nature of human behavior, which renders static privacy models ineffective. PAPER-HILT's effectiveness is evaluated through its application in two distinct contexts: Smart Home environments and Virtual Reality (VR) Smart Classrooms. The empirical results demonstrate PAPER-HILT's capability to provide a personalized equilibrium between user privacy and application utility, adapting effectively to individual user needs and preferences. On average for both experiments, utility (performance) drops by 24%, and privacy (state prediction) improves by 31%.  ( 2 min )
    tLaSDI: Thermodynamics-informed latent space dynamics identification
    arXiv:2403.05848v1 Announce Type: new Abstract: We propose a data-driven latent space dynamics identification method (tLaSDI) that embeds the first and second principles of thermodynamics. The latent variables are learned through an autoencoder as a nonlinear dimension reduction model. The dynamics of the latent variables are constructed by a neural network-based model that preserves certain structures to respect the thermodynamic laws through the GENERIC formalism. An abstract error estimate of the approximation is established, which provides a new loss formulation involving the Jacobian computation of autoencoder. Both the autoencoder and the latent dynamics are trained to minimize the new loss. Numerical examples are presented to demonstrate the performance of tLaSDI, which exhibits robust generalization ability, even in extrapolation. In addition, an intriguing correlation is empirically observed between the entropy production rates in the latent space and the behaviors of the full-state solution.  ( 2 min )
    Optimizing LLM Queries in Relational Workloads
    arXiv:2403.05821v1 Announce Type: new Abstract: Analytical database providers (e.g., Redshift, Databricks, BigQuery) have rapidly added support for invoking Large Language Models (LLMs) through native user-defined functions (UDFs) to help users perform natural language tasks, such as classification, entity extraction, and translation, inside analytical workloads. For instance, an analyst might want to extract customer sentiments on millions of product reviews. However, LLM inference is highly expensive in both computational and economic terms: for example, an NVIDIA L4 GPU running Llama2-7B can only process 6 KB of text per second. In this paper, we explore how to optimize LLM inference for analytical workloads that invoke LLMs within relational queries. We show that relational queries present novel opportunities for accelerating LLM inference, including reordering rows to maximize key-value (KV) cache reuse within the LLM inference engine, reordering columns within a row to further increase cache reuse, and deduplicating redundant inference requests. We implement these optimizations in Apache Spark, with vLLM as the model serving backend and achieve up to 4.4x improvement in end-to-end latency on a benchmark of diverse LLM-based queries on real datasets. To the best of our knowledge, this is the first work to explicitly address the problem of optimizing LLM invocations within SQL queries.  ( 2 min )
    PR-NET: Leveraging Pathway Refined Network Structures for Prostate Cancer Patient Condition Prediction
    arXiv:2403.05818v1 Announce Type: new Abstract: Motivation: The diagnosis and monitoring of Castrate Resistant Prostate Cancer (CRPC) are crucial for cancer patients, but the current models (such as P-NET) have limitations in terms of parameter count, generalization, and cost. Results: To address the above issues, we develop a more accurate and efficient Prostate Cancer patient condition prediction model, named PR-NET. By compressing and optimizing the network structure of P-NET, the model complexity is reduced while maintaining high accuracy and interpretability. The PR-NET demonstrated superior performance in predicting prostate cancer patient outcomes, outshining P-NET and six other traditional models with a significant margin. In our rigorous evaluation, PR-NET not only achieved impressive average AUC and Recall scores of 0.94 and 0.83, respectively, on known data but also maintained robust generalizability on five unknown datasets with a higher average AUC of 0.73 and Recall of 0.72, compared to P-NET's 0.68 and 0.5. PR-NET's efficiency was evidenced by its shorter average training and inference times, and its gene-level analysis revealed 46 key genes, demonstrating its enhanced predictive power and efficiency in identifying critical biomarkers for prostate cancer. Future research can further expand its application domains and optimize the model's performance and reliability.  ( 2 min )
    Optimistic Safety for Linearly-Constrained Online Convex Optimization
    arXiv:2403.05786v1 Announce Type: new Abstract: The setting of online convex optimization (OCO) under unknown constraints has garnered significant attention in recent years. In this work, we consider a version of this problem with static linear constraints that the player receives noisy feedback of and must always satisfy. By leveraging our novel design paradigm of optimistic safety, we give an algorithm for this problem that enjoys $\tilde{\mathcal{O}}(\sqrt{T})$ regret. This improves on the previous best regret bound of $\tilde{\mathcal{O}}(T^{2/3})$ while using only slightly stronger assumptions of independent noise and an oblivious adversary. Then, by recasting this problem as OCO under time-varying stochastic linear constraints, we show that our algorithm enjoys the same regret guarantees in such a setting and never violates the constraints in expectation. This contributes to the literature on OCO under time-varying stochastic constraints, where the state-of-the-art algorithms enjoy $\tilde{\mathcal{O}}(\sqrt{T})$ regret and $\tilde{\mathcal{O}}(\sqrt{T})$ violation when the constraints are convex and the player receives full feedback. Additionally, we provide a version of our algorithm that is more computationally efficient and give numerical experiments comparing it with benchmark algorithms.  ( 2 min )
    $\textbf{S}^2$IP-LLM: Semantic Space Informed Prompt Learning with LLM for Time Series Forecasting
    arXiv:2403.05798v1 Announce Type: new Abstract: Recently, there has been a growing interest in leveraging pre-trained large language models (LLMs) for various time series applications. However, the semantic space of LLMs, established through the pre-training, is still underexplored and may help yield more distinctive and informative representations to facilitate time series forecasting. To this end, we propose Semantic Space Informed Prompt learning with LLM ($S^2$IP-LLM) to align the pre-trained semantic space with time series embeddings space and perform time series forecasting based on learned prompts from the joint space. We first design a tokenization module tailored for cross-modality alignment, which explicitly concatenates patches of decomposed time series components to create embeddings that effectively encode the temporal dynamics. Next, we leverage the pre-trained word token embeddings to derive semantic anchors and align selected anchors with time series embeddings by maximizing the cosine similarity in the joint space. This way, $S^2$IP-LLM can retrieve relevant semantic anchors as prompts to provide strong indicators (context) for time series that exhibit different temporal dynamics. With thorough empirical studies on multiple benchmark datasets, we demonstrate that the proposed $S^2$IP-LLM can achieve superior forecasting performance over state-of-the-art baselines. Furthermore, our ablation studies and visualizations verify the necessity of prompt learning informed by semantic space.  ( 2 min )
    Spatial Clustering Approach for Vessel Path Identification
    arXiv:2403.05778v1 Announce Type: new Abstract: This paper addresses the challenge of identifying the paths for vessels with operating routes of repetitive paths, partially repetitive paths, and new paths. We propose a spatial clustering approach for labeling the vessel paths by using only position information. We develop a path clustering framework employing two methods: a distance-based path modeling and a likelihood estimation method. The former enhances the accuracy of path clustering through the integration of unsupervised machine learning techniques, while the latter focuses on likelihood-based path modeling and introduces segmentation for a more detailed analysis. The result findings highlight the superior performance and efficiency of the developed approach, as both methods for clustering vessel paths into five classes achieve a perfect F1-score. The approach aims to offer valuable insights for route planning, ultimately contributing to improving safety and efficiency in maritime transportation.  ( 2 min )
    Extending Activation Steering to Broad Skills and Multiple Behaviours
    arXiv:2403.05767v1 Announce Type: new Abstract: Current large language models have dangerous capabilities, which are likely to become more problematic in the future. Activation steering techniques can be used to reduce risks from these capabilities. In this paper, we investigate the efficacy of activation steering for broad skills and multiple behaviours. First, by comparing the effects of reducing performance on general coding ability and Python-specific ability, we find that steering broader skills is competitive to steering narrower skills. Second, we steer models to become more or less myopic and wealth-seeking, among other behaviours. In our experiments, combining steering vectors for multiple different behaviours into one steering vector is largely unsuccessful. On the other hand, injecting individual steering vectors at different places in a model simultaneously is promising.  ( 2 min )
    Hybrid Quantum-inspired Resnet and Densenet for Pattern Recognition with Completeness Analysis
    arXiv:2403.05754v1 Announce Type: new Abstract: With the contemporary digital technology approaching, deep neural networks are emerging as the foundational algorithm of the artificial intelligence boom. Whereas, the evolving social demands have been emphasizing the necessity of novel methodologies to substitute traditional neural networks. Concurrently, the advent of the post-Moore era has spurred the development of quantum-inspired neural networks with outstanding potentials at certain circumstances. Nonetheless, a definitive evaluating system with detailed metrics is tremendously vital and indispensable owing to the vague indicators in comparison between the novel and traditional deep learning models at present. Hence, to improve and evaluate the performances of the novel neural networks more comprehensively in complex and unpredictable environments, we propose two hybrid quantum-inspired neural networks which are rooted in residual and dense connections respectively for pattern recognitions with completeness representation theory for model assessment. Comparative analyses against pure classical models with detailed frameworks reveal that our hybrid models with lower parameter complexity not only match the generalization power of pure classical models, but also outperform them notably in resistance to parameter attacks with various asymmetric noises. Moreover, our hybrid models indicate unique superiority to prevent gradient explosion problems through theoretical argumentation. Eventually, We elaborate on the application scenarios where our hybrid models are applicable and efficient, which paves the way for their industrialization and commercialization.  ( 3 min )
    Membership Testing in Markov Equivalence Classes via Independence Query Oracles
    arXiv:2403.05759v1 Announce Type: new Abstract: Understanding causal relationships between variables is a fundamental problem with broad impact in numerous scientific fields. While extensive research has been dedicated to learning causal graphs from data, its complementary concept of testing causal relationships has remained largely unexplored. While learning involves the task of recovering the Markov equivalence class (MEC) of the underlying causal graph from observational data, the testing counterpart addresses the following critical question: Given a specific MEC and observational data from some causal graph, can we determine if the data-generating causal graph belongs to the given MEC? We explore constraint-based testing methods by establishing bounds on the required number of conditional independence tests. Our bounds are in terms of the size of the maximum undirected clique ($s$) of the given MEC. In the worst case, we show a lower bound of $\exp(\Omega(s))$ independence tests. We then give an algorithm that resolves the task with $\exp(O(s))$ tests, matching our lower bound. Compared to the learning problem, where algorithms often use a number of independence tests that is exponential in the maximum in-degree, this shows that testing is relatively easier. In particular, it requires exponentially less independence tests in graphs featuring high in-degrees and small clique sizes. Additionally, using the DAG associahedron, we provide a geometric interpretation of testing versus learning and discuss how our testing result can aid learning.  ( 2 min )
    Task-Oriented GNNs Training on Large Knowledge Graphs for Accurate and Efficient Modeling
    arXiv:2403.05752v1 Announce Type: new Abstract: A Knowledge Graph (KG) is a heterogeneous graph encompassing a diverse range of node and edge types. Heterogeneous Graph Neural Networks (HGNNs) are popular for training machine learning tasks like node classification and link prediction on KGs. However, HGNN methods exhibit excessive complexity influenced by the KG's size, density, and the number of node and edge types. AI practitioners handcraft a subgraph of a KG G relevant to a specific task. We refer to this subgraph as a task-oriented subgraph (TOSG), which contains a subset of task-related node and edge types in G. Training the task using TOSG instead of G alleviates the excessive computation required for a large KG. Crafting the TOSG demands a deep understanding of the KG's structure and the task's objectives. Hence, it is challenging and time-consuming. This paper proposes KG-TOSA, an approach to automate the TOSG extraction for task-oriented HGNN training on a large KG. In KG-TOSA, we define a generic graph pattern that captures the KG's local and global structure relevant to a specific task. We explore different techniques to extract subgraphs matching our graph pattern: namely (i) two techniques sampling around targeted nodes using biased random walk or influence scores, and (ii) a SPARQL-based extraction method leveraging RDF engines' built-in indices. Hence, it achieves negligible preprocessing overhead compared to the sampling techniques. We develop a benchmark of real KGs of large sizes and various tasks for node classification and link prediction. Our experiments show that KG-TOSA helps state-of-the-art HGNN methods reduce training time and memory usage by up to 70% while improving the model performance, e.g., accuracy and inference time.  ( 3 min )
    MG-TSD: Multi-Granularity Time Series Diffusion Models with Guided Learning Process
    arXiv:2403.05751v1 Announce Type: new Abstract: Recently, diffusion probabilistic models have attracted attention in generative time series forecasting due to their remarkable capacity to generate high-fidelity samples. However, the effective utilization of their strong modeling ability in the probabilistic time series forecasting task remains an open question, partially due to the challenge of instability arising from their stochastic nature. To address this challenge, we introduce a novel Multi-Granularity Time Series Diffusion (MG-TSD) model, which achieves state-of-the-art predictive performance by leveraging the inherent granularity levels within the data as given targets at intermediate diffusion steps to guide the learning process of diffusion models. The way to construct the targets is motivated by the observation that the forward process of the diffusion model, which sequentially corrupts the data distribution to a standard normal distribution, intuitively aligns with the process of smoothing fine-grained data into a coarse-grained representation, both of which result in a gradual loss of fine distribution features. In the study, we derive a novel multi-granularity guidance diffusion loss function and propose a concise implementation method to effectively utilize coarse-grained data across various granularity levels. More importantly, our approach does not rely on additional external data, making it versatile and applicable across various domains. Extensive experiments conducted on real-world datasets demonstrate that our MG-TSD model outperforms existing time series prediction methods.  ( 2 min )
    Augmentations vs Algorithms: What Works in Self-Supervised Learning
    arXiv:2403.05726v1 Announce Type: new Abstract: We study the relative effects of data augmentations, pretraining algorithms, and model architectures in Self-Supervised Learning (SSL). While the recent literature in this space leaves the impression that the pretraining algorithm is of critical importance to performance, understanding its effect is complicated by the difficulty in making objective and direct comparisons between methods. We propose a new framework which unifies many seemingly disparate SSL methods into a single shared template. Using this framework, we identify aspects in which methods differ and observe that in addition to changing the pretraining algorithm, many works also use new data augmentations or more powerful model architectures. We compare several popular SSL methods using our framework and find that many algorithmic additions, such as prediction networks or new losses, have a minor impact on downstream task performance (often less than $1\%$), while enhanced augmentation techniques offer more significant performance improvements ($2-4\%$). Our findings challenge the premise that SSL is being driven primarily by algorithmic improvements, and suggest instead a bitter lesson for SSL: that augmentation diversity and data / model scale are more critical contributors to recent advances in self-supervised learning.  ( 2 min )
    Provable Policy Gradient Methods for Average-Reward Markov Potential Games
    arXiv:2403.05738v1 Announce Type: new Abstract: We study Markov potential games under the infinite horizon average reward criterion. Most previous studies have been for discounted rewards. We prove that both algorithms based on independent policy gradient and independent natural policy gradient converge globally to a Nash equilibrium for the average reward criterion. To set the stage for gradient-based methods, we first establish that the average reward is a smooth function of policies and provide sensitivity bounds for the differential value functions, under certain conditions on ergodicity and the second largest eigenvalue of the underlying Markov decision process (MDP). We prove that three algorithms, policy gradient, proximal-Q, and natural policy gradient (NPG), converge to an $\epsilon$-Nash equilibrium with time complexity $O(\frac{1}{\epsilon^2})$, given a gradient/differential Q function oracle. When policy gradients have to be estimated, we propose an algorithm with $\tilde{O}(\frac{1}{\min_{s,a}\pi(a|s)\delta})$ sample complexity to achieve $\delta$ approximation error w.r.t~the $\ell_2$ norm. Equipped with the estimator, we derive the first sample complexity analysis for a policy gradient ascent algorithm, featuring a sample complexity of $\tilde{O}(1/\epsilon^5)$. Simulation studies are presented.  ( 2 min )
    $\mathtt{tsGT}$: Stochastic Time Series Modeling With Transformer
    arXiv:2403.05713v1 Announce Type: new Abstract: Time series methods are of fundamental importance in virtually any field of science that deals with temporally structured data. Recently, there has been a surge of deterministic transformer models with time series-specific architectural biases. In this paper, we go in a different direction by introducing $\mathtt{tsGT}$, a stochastic time series model built on a general-purpose transformer architecture. We focus on using a well-known and theoretically justified rolling window backtesting and evaluation protocol. We show that $\mathtt{tsGT}$ outperforms the state-of-the-art models on MAD and RMSE, and surpasses its stochastic peers on QL and CRPS, on four commonly used datasets. We complement these results with a detailed analysis of $\mathtt{tsGT}$'s ability to model the data distribution and predict marginal quantile values.  ( 2 min )
    Shielded Deep Reinforcement Learning for Complex Spacecraft Tasking
    arXiv:2403.05693v1 Announce Type: new Abstract: Autonomous spacecraft control via Shielded Deep Reinforcement Learning (SDRL) has become a rapidly growing research area. However, the construction of shields and the definition of tasking remains informal, resulting in policies with no guarantees on safety and ambiguous goals for the RL agent. In this paper, we first explore the use of formal languages, namely Linear Temporal Logic (LTL), to formalize spacecraft tasks and safety requirements. We then define a manner in which to construct a reward function from a co-safe LTL specification automatically for effective training in SDRL framework. We also investigate methods for constructing a shield from a safe LTL specification for spacecraft applications and propose three designs that provide probabilistic guarantees. We show how these shields interact with different policies and the flexibility of the reward structure through several experiments.  ( 2 min )
    Unfamiliar Finetuning Examples Control How Language Models Hallucinate
    arXiv:2403.05612v1 Announce Type: new Abstract: Large language models (LLMs) have a tendency to generate plausible-sounding yet factually incorrect responses, especially when queried on unfamiliar concepts. In this work, we explore the underlying mechanisms that govern how finetuned LLMs hallucinate. Our investigation reveals an interesting pattern: as inputs become more unfamiliar, LLM outputs tend to default towards a ``hedged'' prediction, whose form is determined by how the unfamiliar examples in the finetuning data are supervised. Thus, by strategically modifying these examples' supervision, we can control LLM predictions for unfamiliar inputs (e.g., teach them to say ``I don't know''). Based on these principles, we develop an RL approach that more reliably mitigates hallucinations for long-form generation tasks, by tackling the challenges presented by reward model hallucinations. We validate our findings with a series of controlled experiments in multiple-choice QA on MMLU, as well as long-form biography and book/movie plot generation tasks.  ( 2 min )
    What is different between these datasets?
    arXiv:2403.05652v1 Announce Type: new Abstract: The performance of machine learning models heavily depends on the quality of input data, yet real-world applications often encounter various data-related challenges. One such challenge could arise when curating training data or deploying the model in the real world - two comparable datasets in the same domain may have different distributions. While numerous techniques exist for detecting distribution shifts, the literature lacks comprehensive approaches for explaining dataset differences in a human-understandable manner. To address this gap, we propose a suite of interpretable methods (toolbox) for comparing two datasets. We demonstrate the versatility of our approach across diverse data modalities, including tabular data, language, images, and signals in both low and high-dimensional settings. Our methods not only outperform comparable and related approaches in terms of explanation quality and correctness, but also provide actionable, complementary insights to understand and mitigate dataset differences effectively.  ( 2 min )
    A Concept-based Interpretable Model for the Diagnosis of Choroid Neoplasias using Multimodal Data
    arXiv:2403.05606v1 Announce Type: new Abstract: Diagnosing rare diseases presents a common challenge in clinical practice, necessitating the expertise of specialists for accurate identification. The advent of machine learning offers a promising solution, while the development of such technologies is hindered by the scarcity of data on rare conditions and the demand for models that are both interpretable and trustworthy in a clinical context. Interpretable AI, with its capacity for human-readable outputs, can facilitate validation by clinicians and contribute to medical education. In the current work, we focus on choroid neoplasias, the most prevalent form of eye cancer in adults, albeit rare with 5.1 per million. We built the so-far largest dataset consisting of 750 patients, incorporating three distinct imaging modalities collected from 2004 to 2022. Our work introduces a concept-based interpretable model that distinguishes between three types of choroidal tumors, integrating insights from domain experts via radiological reports. Remarkably, this model not only achieves an F1 score of 0.91, rivaling that of black-box models, but also boosts the diagnostic accuracy of junior doctors by 42%. This study highlights the significant potential of interpretable machine learning in improving the diagnosis of rare diseases, laying a groundwork for future breakthroughs in medical AI that could tackle a wider array of complex health scenarios.  ( 3 min )
    Evidence, Definitions and Algorithms regarding the Existence of Cohesive-Convergence Groups in Neural Network Optimization
    arXiv:2403.05610v1 Announce Type: new Abstract: Understanding the convergence process of neural networks is one of the most complex and crucial issues in the field of machine learning. Despite the close association of notable successes in this domain with the convergence of artificial neural networks, this concept remains predominantly theoretical. In reality, due to the non-convex nature of the optimization problems that artificial neural networks tackle, very few trained networks actually achieve convergence. To expand recent research efforts on artificial-neural-network convergence, this paper will discuss a different approach based on observations of cohesive-convergence groups emerging during the optimization process of an artificial neural network.  ( 2 min )
    Density-Regression: Efficient and Distance-Aware Deep Regressor for Uncertainty Estimation under Distribution Shifts
    arXiv:2403.05600v1 Announce Type: new Abstract: Morden deep ensembles technique achieves strong uncertainty estimation performance by going through multiple forward passes with different models. This is at the price of a high storage space and a slow speed in the inference (test) time. To address this issue, we propose Density-Regression, a method that leverages the density function in uncertainty estimation and achieves fast inference by a single forward pass. We prove it is distance aware on the feature space, which is a necessary condition for a neural network to produce high-quality uncertainty estimation under distribution shifts. Empirically, we conduct experiments on regression tasks with the cubic toy dataset, benchmark UCI, weather forecast with time series, and depth estimation under real-world shifted applications. We show that Density-Regression has competitive uncertainty estimation performance under distribution shifts with modern deep regressors while using a lower model size and a faster inference speed.  ( 2 min )
    Select High-Level Features: Efficient Experts from a Hierarchical Classification Network
    arXiv:2403.05601v1 Announce Type: new Abstract: This study introduces a novel expert generation method that dynamically reduces task and computational complexity without compromising predictive performance. It is based on a new hierarchical classification network topology that combines sequential processing of generic low-level features with parallelism and nesting of high-level features. This structure allows for the innovative extraction technique: the ability to select only high-level features of task-relevant categories. In certain cases, it is possible to skip almost all unneeded high-level features, which can significantly reduce the inference cost and is highly beneficial in resource-constrained conditions. We believe this method paves the way for future network designs that are lightweight and adaptable, making them suitable for a wide range of applications, from compact edge devices to large-scale clouds. In terms of dynamic inference our methodology can achieve an exclusion of up to 88.7\,\% of parameters and 73.4\,\% fewer giga-multiply accumulate (GMAC) operations, analysis against comparative baselines showing an average reduction of 47.6\,\% in parameters and 5.8\,\% in GMACs across the cases we evaluated.  ( 2 min )
  • Open

    Bayesian Federated Inference for estimating Statistical Models based on Non-shared Multicenter Data sets
    arXiv:2302.07677v2 Announce Type: replace-cross Abstract: Identifying predictive factors for an outcome of interest via a multivariable analysis is often difficult when the data set is small. Combining data from different medical centers into a single (larger) database would alleviate this problem, but is in practice challenging due to regulatory and logistic problems. Federated Learning (FL) is a machine learning approach that aims to construct from local inferences in separate data centers what would have been inferred had the data sets been merged. It seeks to harvest the statistical power of larger data sets without actually creating them. The FL strategy is not always efficient and precise. Therefore, in this paper we refine and implement an alternative Bayesian Federated Inference (BFI) framework for multicenter data with the same aim as FL. The BFI framework is designed to cope with small data sets by inferring locally not only the optimal parameter values, but also additional features of the posterior parameter distribution, capturing information beyond what is used in FL. BFI has the additional benefit that a single inference cycle across the centers is sufficient, whereas FL needs multiple cycles. We quantify the performance of the proposed methodology on simulated and real life data.  ( 3 min )
    Adaptive variational Bayes: Optimality, computation and applications
    arXiv:2109.03204v4 Announce Type: replace-cross Abstract: In this paper, we explore adaptive inference based on variational Bayes. Although several studies have been conducted to analyze the contraction properties of variational posteriors, there is still a lack of a general and computationally tractable variational Bayes method that performs adaptive inference. To fill this gap, we propose a novel adaptive variational Bayes framework, which can operate on a collection of models. The proposed framework first computes a variational posterior over each individual model separately and then combines them with certain weights to produce a variational posterior over the entire model. It turns out that this combined variational posterior is the closest member to the posterior over the entire model in a predefined family of approximating distributions. We show that the adaptive variational Bayes attains optimal contraction rates adaptively under very general conditions. We also provide a methodology to maintain the tractability and adaptive optimality of the adaptive variational Bayes even in the presence of an enormous number of individual models, such as sparse models. We apply the general results to several examples, including deep learning and sparse factor models, and derive new and adaptive inference results. In addition, we characterize an implicit regularization effect of variational Bayes and show that the adaptive variational posterior can utilize this.  ( 3 min )
    Coreset Markov Chain Monte Carlo
    arXiv:2310.17063v2 Announce Type: replace-cross Abstract: A Bayesian coreset is a small, weighted subset of data that replaces the full dataset during inference in order to reduce computational cost. However, state of the art methods for tuning coreset weights are expensive, require nontrivial user input, and impose constraints on the model. In this work, we propose a new method -- Coreset MCMC -- that simulates a Markov chain targeting the coreset posterior, while simultaneously updating the coreset weights using those same draws. Coreset MCMC is simple to implement and tune, and can be used with any existing MCMC kernel. We analyze Coreset MCMC in a representative setting to obtain key insights about the convergence behaviour of the method. Empirical results demonstrate that Coreset MCMC provides higher quality posterior approximations and reduced computational cost compared with other coreset construction methods. Further, compared with other general subsampling MCMC methods, we find that Coreset MCMC has a higher sampling efficiency with competitively accurate posterior approximations.  ( 2 min )
    Sparse Cholesky Factorization for Solving Nonlinear PDEs via Gaussian Processes
    arXiv:2304.01294v3 Announce Type: replace-cross Abstract: In recent years, there has been widespread adoption of machine learning-based approaches to automate the solving of partial differential equations (PDEs). Among these approaches, Gaussian processes (GPs) and kernel methods have garnered considerable interest due to their flexibility, robust theoretical guarantees, and close ties to traditional methods. They can transform the solving of general nonlinear PDEs into solving quadratic optimization problems with nonlinear, PDE-induced constraints. However, the complexity bottleneck lies in computing with dense kernel matrices obtained from pointwise evaluations of the covariance kernel, and its \textit{partial derivatives}, a result of the PDE constraint and for which fast algorithms are scarce. The primary goal of this paper is to provide a near-linear complexity algorithm for working with such kernel matrices. We present a sparse Cholesky factorization algorithm for these matrices based on the near-sparsity of the Cholesky factor under a novel ordering of pointwise and derivative measurements. The near-sparsity is rigorously justified by directly connecting the factor to GP regression and exponential decay of basis functions in numerical homogenization. We then employ the Vecchia approximation of GPs, which is optimal in the Kullback-Leibler divergence, to compute the approximate factor. This enables us to compute $\epsilon$-approximate inverse Cholesky factors of the kernel matrices with complexity $O(N\log^d(N/\epsilon))$ in space and $O(N\log^{2d}(N/\epsilon))$ in time. We integrate sparse Cholesky factorizations into optimization algorithms to obtain fast solvers of the nonlinear PDE. We numerically illustrate our algorithm's near-linear space/time complexity for a broad class of nonlinear PDEs such as the nonlinear elliptic, Burgers, and Monge-Amp\`ere equations.  ( 3 min )
    Is your model predicting the past?
    arXiv:2206.11673v2 Announce Type: replace-cross Abstract: When does a machine learning model predict the future of individuals and when does it recite patterns that predate the individuals? In this work, we propose a distinction between these two pathways of prediction, supported by theoretical, empirical, and normative arguments. At the center of our proposal is a family of simple and efficient statistical tests, called backward baselines, that demonstrate if, and to what extent, a model recounts the past. Our statistical theory provides guidance for interpreting backward baselines, establishing equivalences between different baselines and familiar statistical concepts. Concretely, we derive a meaningful backward baseline for auditing a prediction system as a black box, given only background variables and the system's predictions. Empirically, we evaluate the framework on different prediction tasks derived from longitudinal panel surveys, demonstrating the ease and effectiveness of incorporating backward baselines into the practice of machine learning.  ( 2 min )
    Benign overfitting in leaky ReLU networks with moderate input dimension
    arXiv:2403.06903v1 Announce Type: cross Abstract: The problem of benign overfitting asks whether it is possible for a model to perfectly fit noisy training data and still generalize well. We study benign overfitting in two-layer leaky ReLU networks trained with the hinge loss on a binary classification task. We consider input data which can be decomposed into the sum of a common signal and a random noise component, which lie on subspaces orthogonal to one another. We characterize conditions on the signal to noise ratio (SNR) of the model parameters giving rise to benign versus non-benign, or harmful, overfitting: in particular, if the SNR is high then benign overfitting occurs, conversely if the SNR is low then harmful overfitting occurs. We attribute both benign and non-benign overfitting to an approximate margin maximization property and show that leaky ReLU networks trained on hinge loss with Gradient Descent (GD) satisfy this property. In contrast to prior work we do not require near orthogonality conditions on the training data: notably, for input dimension $d$ and training sample size $n$, while prior work shows asymptotically optimal error when $d = \Omega(n^2 \log n)$, here we require only $d = \Omega\left(n \log \frac{1}{\epsilon}\right)$ to obtain error within $\epsilon$ of optimal.  ( 2 min )
    Overcoming the Barrier of Orbital-Free Density Functional Theory for Molecular Systems Using Deep Learning
    arXiv:2309.16578v2 Announce Type: replace Abstract: Orbital-free density functional theory (OFDFT) is a quantum chemistry formulation that has a lower cost scaling than the prevailing Kohn-Sham DFT, which is increasingly desired for contemporary molecular research. However, its accuracy is limited by the kinetic energy density functional, which is notoriously hard to approximate for non-periodic molecular systems. Here we propose M-OFDFT, an OFDFT approach capable of solving molecular systems using a deep learning functional model. We build the essential non-locality into the model, which is made affordable by the concise density representation as expansion coefficients under an atomic basis. With techniques to address unconventional learning challenges therein, M-OFDFT achieves a comparable accuracy with Kohn-Sham DFT on a wide range of molecules untouched by OFDFT before. More attractively, M-OFDFT extrapolates well to molecules much larger than those seen in training, which unleashes the appealing scaling of OFDFT for studying large molecules including proteins, representing an advancement of the accuracy-efficiency trade-off frontier in quantum chemistry.  ( 3 min )
    Sampling via Gradient Flows in the Space of Probability Measures
    arXiv:2310.03597v3 Announce Type: replace Abstract: Sampling a target probability distribution with an unknown normalization constant is a fundamental challenge in computational science and engineering. Recent work shows that algorithms derived by considering gradient flows in the space of probability measures open up new avenues for algorithm development. This paper makes three contributions to this sampling approach by scrutinizing the design components of such gradient flows. Any instantiation of a gradient flow for sampling needs an energy functional and a metric to determine the flow, as well as numerical approximations of the flow to derive algorithms. Our first contribution is to show that the Kullback-Leibler divergence, as an energy functional, has the unique property (among all f-divergences) that gradient flows resulting from it do not depend on the normalization constant of the target distribution. Our second contribution is to study the choice of metric from the perspective of invariance. The Fisher-Rao metric is known as the unique choice (up to scaling) that is diffeomorphism invariant. As a computationally tractable alternative, we introduce a relaxed, affine invariance property for the metrics and gradient flows. In particular, we construct various affine invariant Wasserstein and Stein gradient flows. Affine invariant gradient flows are shown to behave more favorably than their non-affine-invariant counterparts when sampling highly anisotropic distributions, in theory and by using particle methods. Our third contribution is to study, and develop efficient algorithms based on Gaussian approximations of the gradient flows; this leads to an alternative to particle methods. We establish connections between various Gaussian approximate gradient flows, discuss their relation to gradient methods arising from parametric variational inference, and study their convergence properties both theoretically and numerically.  ( 3 min )
    On the Approximation of Kernel functions
    arXiv:2403.06731v1 Announce Type: new Abstract: Various methods in statistical learning build on kernels considered in reproducing kernel Hilbert spaces. In applications, the kernel is often selected based on characteristics of the problem and the data. This kernel is then employed to infer response variables at points, where no explanatory data were observed. The data considered here are located in compact sets in higher dimensions and the paper addresses approximations of the kernel itself. The new approach considers Taylor series approximations of radial kernel functions. For the Gauss kernel on the unit cube, the paper establishes an upper bound of the associated eigenfunctions, which grows only polynomially with respect to the index. The novel approach substantiates smaller regularization parameters than considered in the literature, overall leading to better approximations. This improvement confirms low rank approximation methods such as the Nystr\"om method.  ( 2 min )
    Disentangling shared and private latent factors in multimodal Variational Autoencoders
    arXiv:2403.06338v1 Announce Type: new Abstract: Generative models for multimodal data permit the identification of latent factors that may be associated with important determinants of observed data heterogeneity. Common or shared factors could be important for explaining variation across modalities whereas other factors may be private and important only for the explanation of a single modality. Multimodal Variational Autoencoders, such as MVAE and MMVAE, are a natural choice for inferring those underlying latent factors and separating shared variation from private. In this work, we investigate their capability to reliably perform this disentanglement. In particular, we highlight a challenging problem setting where modality-specific variation dominates the shared signal. Taking a cross-modal prediction perspective, we demonstrate limitations of existing models, and propose a modification how to make them more robust to modality-specific variation. Our findings are supported by experiments on synthetic as well as various real-world multi-omics data sets.  ( 2 min )
    Bridging Domains with Approximately Shared Features
    arXiv:2403.06424v1 Announce Type: new Abstract: Multi-source domain adaptation aims to reduce performance degradation when applying machine learning models to unseen domains. A fundamental challenge is devising the optimal strategy for feature selection. Existing literature is somewhat paradoxical: some advocate for learning invariant features from source domains, while others favor more diverse features. To address the challenge, we propose a statistical framework that distinguishes the utilities of features based on the variance of their correlation to label $y$ across domains. Under our framework, we design and analyze a learning procedure consisting of learning approximately shared feature representation from source tasks and fine-tuning it on the target task. Our theoretical analysis necessitates the importance of learning approximately shared features instead of only the strictly invariant features and yields an improved population risk compared to previous results on both source and target tasks, thus partly resolving the paradox mentioned above. Inspired by our theory, we proposed a more practical way to isolate the content (invariant+approximately shared) from environmental features and further consolidate our theoretical findings.  ( 2 min )
    Detection of Unobserved Common Causes based on NML Code in Discrete, Mixed, and Continuous Variables
    arXiv:2403.06499v1 Announce Type: new Abstract: Causal discovery in the presence of unobserved common causes from observational data only is a crucial but challenging problem. We categorize all possible causal relationships between two random variables into the following four categories and aim to identify one from observed data: two cases in which either of the direct causality exists, a case that variables are independent, and a case that variables are confounded by latent confounders. Although existing methods have been proposed to tackle this problem, they require unobserved variables to satisfy assumptions on the form of their equation models. In our previous study (Kobayashi et al., 2022), the first causal discovery method without such assumptions is proposed for discrete data and named CLOUD. Using Normalized Maximum Likelihood (NML) Code, CLOUD selects a model that yields the minimum codelength of the observed data from a set of model candidates. This paper extends CLOUD to apply for various data types across discrete, mixed, and continuous. We not only performed theoretical analysis to show the consistency of CLOUD in terms of the model selection, but also demonstrated that CLOUD is more effective than existing methods in inferring causal relationships by extensive experiments on both synthetic and real-world data.  ( 3 min )
    Efficient first-order algorithms for large-scale, non-smooth maximum entropy models with application to wildfire science
    arXiv:2403.06816v1 Announce Type: new Abstract: Maximum entropy (Maxent) models are a class of statistical models that use the maximum entropy principle to estimate probability distributions from data. Due to the size of modern data sets, Maxent models need efficient optimization algorithms to scale well for big data applications. State-of-the-art algorithms for Maxent models, however, were not originally designed to handle big data sets; these algorithms either rely on technical devices that may yield unreliable numerical results, scale poorly, or require smoothness assumptions that many practical Maxent models lack. In this paper, we present novel optimization algorithms that overcome the shortcomings of state-of-the-art algorithms for training large-scale, non-smooth Maxent models. Our proposed first-order algorithms leverage the Kullback-Leibler divergence to train large-scale and non-smooth Maxent models efficiently. For Maxent models with discrete probability distribution of $n$ elements built from samples, each containing $m$ features, the stepsize parameters estimation and iterations in our algorithms scale on the order of $O(mn)$ operations and can be trivially parallelized. Moreover, the strong $\ell_{1}$ convexity of the Kullback--Leibler divergence allows for larger stepsize parameters, thereby speeding up the convergence rate of our algorithms. To illustrate the efficiency of our novel algorithms, we consider the problem of estimating probabilities of fire occurrences as a function of ecological features in the Western US MTBS-Interagency wildfire data set. Our numerical results show that our algorithms outperform the state of the arts by one order of magnitude and yield results that agree with physical models of wildfire occurrence and previous statistical analyses of wildfire drivers.  ( 3 min )
    Provable Mutual Benefits from Federated Learning in Privacy-Sensitive Domains
    arXiv:2403.06672v1 Announce Type: new Abstract: Cross-silo federated learning (FL) allows data owners to train accurate machine learning models by benefiting from each others private datasets. Unfortunately, the model accuracy benefits of collaboration are often undermined by privacy defenses. Therefore, to incentivize client participation in privacy-sensitive domains, a FL protocol should strike a delicate balance between privacy guarantees and end-model accuracy. In this paper, we study the question of when and how a server could design a FL protocol provably beneficial for all participants. First, we provide necessary and sufficient conditions for the existence of mutually beneficial protocols in the context of mean estimation and convex stochastic optimization. We also derive protocols that maximize the total clients' utility, given symmetric privacy preferences. Finally, we design protocols maximizing end-model accuracy and demonstrate their benefits in synthetic experiments.  ( 2 min )
    Density-Regression: Efficient and Distance-Aware Deep Regressor for Uncertainty Estimation under Distribution Shifts
    arXiv:2403.05600v1 Announce Type: cross Abstract: Morden deep ensembles technique achieves strong uncertainty estimation performance by going through multiple forward passes with different models. This is at the price of a high storage space and a slow speed in the inference (test) time. To address this issue, we propose Density-Regression, a method that leverages the density function in uncertainty estimation and achieves fast inference by a single forward pass. We prove it is distance aware on the feature space, which is a necessary condition for a neural network to produce high-quality uncertainty estimation under distribution shifts. Empirically, we conduct experiments on regression tasks with the cubic toy dataset, benchmark UCI, weather forecast with time series, and depth estimation under real-world shifted applications. We show that Density-Regression has competitive uncertainty estimation performance under distribution shifts with modern deep regressors while using a lower model size and a faster inference speed.  ( 2 min )
    Anatomically-Controllable Medical Image Generation with Segmentation-Guided Diffusion Models
    arXiv:2402.05210v3 Announce Type: replace-cross Abstract: Diffusion models have enabled remarkably high-quality medical image generation, yet it is challenging to enforce anatomical constraints in generated images. This hampers many useful applications, including pre-registered image generation, counterfactual scenarios, and others. To this end, we propose a diffusion model-based method that supports anatomically-controllable medical image generation, by following a multi-class anatomical segmentation mask at each sampling step. We additionally introduce a random mask ablation training algorithm to enable conditioning on a selected combination of anatomical constraints while allowing flexibility in other anatomical areas. We compare our model ("Seg-Diff") to existing methods on breast MRI and abdominal/neck-to-pelvis CT datasets with a wide range of anatomical objects. Results show that it reaches a new state-of-the-art in the faithfulness of generated images to input anatomical masks on both datasets, and is on par for general anatomical realism. Finally, our model also enjoys the extra benefit of being able to adjust the anatomical similarity of generated images to real images of choice through interpolation in its latent space.  ( 2 min )
    Active learning with biased non-response to label requests
    arXiv:2312.08150v2 Announce Type: replace-cross Abstract: Active learning can improve the efficiency of training prediction models by identifying the most informative new labels to acquire. However, non-response to label requests can impact active learning's effectiveness in real-world contexts. We conceptualise this degradation by considering the type of non-response present in the data, demonstrating that biased non-response is particularly detrimental to model performance. We argue that biased non-response is likely in contexts where the labelling process, by nature, relies on user interactions. To mitigate the impact of biased non-response, we propose a cost-based correction to the sampling strategy--the Upper Confidence Bound of the Expected Utility (UCB-EU)--that can, plausibly, be applied to any active learning algorithm. Through experiments, we demonstrate that our method successfully reduces the harm from labelling non-response in many settings. However, we also characterise settings where the non-response bias in the annotations remains detrimental under UCB-EU for specific sampling methods and data generating processes. Finally, we evaluate our method on a real-world dataset from an e-commerce platform. We show that UCB-EU yields substantial performance improvements to conversion models that are trained on clicked impressions. Most generally, this research serves to both better conceptualise the interplay between types of non-response and model improvements via active learning, and to provide a practical, easy-to-implement correction that mitigates model degradation.  ( 3 min )
    Weight-Sharing Regularization
    arXiv:2311.03096v2 Announce Type: replace-cross Abstract: Weight-sharing is ubiquitous in deep learning. Motivated by this, we propose a "weight-sharing regularization" penalty on the weights $w \in \mathbb{R}^d$ of a neural network, defined as $\mathcal{R}(w) = \frac{1}{d - 1}\sum_{i > j}^d |w_i - w_j|$. We study the proximal mapping of $\mathcal{R}$ and provide an intuitive interpretation of it in terms of a physical system of interacting particles. We also parallelize existing algorithms for $\operatorname{prox}_\mathcal{R}$ (to run on GPU) and find that one of them is fast in practice but slow ($O(d)$) for worst-case inputs. Using the physical interpretation, we design a novel parallel algorithm which runs in $O(\log^3 d)$ when sufficient processors are available, thus guaranteeing fast training. Our experiments reveal that weight-sharing regularization enables fully connected networks to learn convolution-like filters even when pixels have been shuffled while convolutional neural networks fail in this setting. Our code is available on github.  ( 2 min )
    Learning Unknown Intervention Targets in Structural Causal Models from Heterogeneous Data
    arXiv:2312.06091v2 Announce Type: replace-cross Abstract: We study the problem of identifying the unknown intervention targets in structural causal models where we have access to heterogeneous data collected from multiple environments. The unknown intervention targets are the set of endogenous variables whose corresponding exogenous noises change across the environments. We propose a two-phase approach which in the first phase recovers the exogenous noises corresponding to unknown intervention targets whose distributions have changed across environments. In the second phase, the recovered noises are matched with the corresponding endogenous variables. For the recovery phase, we provide sufficient conditions for learning these exogenous noises up to some component-wise invertible transformation. For the matching phase, under the causal sufficiency assumption, we show that the proposed method uniquely identifies the intervention targets. In the presence of latent confounders, the intervention targets among the observed variables cannot be determined uniquely. We provide a candidate intervention target set which is a superset of the true intervention targets. Our approach improves upon the state of the art as the returned candidate set is always a subset of the target set returned by previous work. Moreover, we do not require restrictive assumptions such as linearity of the causal model or performing invariance tests to learn whether a distribution is changing across environments which could be highly sample inefficient. Our experimental results show the effectiveness of our proposed algorithm in practice.  ( 3 min )
    Worst-Case Optimal Multi-Armed Gaussian Best Arm Identification with a Fixed Budget
    arXiv:2310.19788v3 Announce Type: replace-cross Abstract: This study investigates the experimental design problem for identifying the arm with the highest expected outcome, referred to as best arm identification (BAI). In our experiments, the number of treatment-allocation rounds is fixed. During each round, a decision-maker allocates an arm and observes a corresponding outcome, which follows a Gaussian distribution with variances that can differ among the arms. At the end of the experiment, the decision-maker recommends one of the arms as an estimate of the best arm. To design an experiment, we first discuss lower bounds for the probability of misidentification. Our analysis highlights that the available information on the outcome distribution, such as means (expected outcomes), variances, and the choice of the best arm, significantly influences the lower bounds. Because available information is limited in actual experiments, we develop a lower bound that is valid under the unknown means and the unknown choice of the best arm, which are referred to as the worst-case lower bound. We demonstrate that the worst-case lower bound depends solely on the variances of the outcomes. Then, under the assumption that the variances are known, we propose the Generalized-Neyman-Allocation (GNA)-empirical-best-arm (EBA) strategy, an extension of the Neyman allocation proposed by Neyman (1934). We show that the GNA-EBA strategy is asymptotically optimal in the sense that its probability of misidentification aligns with the lower bounds as the sample size increases infinitely and the differences between the expected outcomes of the best and other suboptimal arms converge to the same values across arms. We refer to such strategies as asymptotically worst-case optimal.  ( 3 min )
    Large-Scale Gaussian Processes via Alternating Projection
    arXiv:2310.17137v2 Announce Type: replace-cross Abstract: Training and inference in Gaussian processes (GPs) require solving linear systems with $n\times n$ kernel matrices. To address the prohibitive $\mathcal{O}(n^3)$ time complexity, recent work has employed fast iterative methods, like conjugate gradients (CG). However, as datasets increase in magnitude, the kernel matrices become increasingly ill-conditioned and still require $\mathcal{O}(n^2)$ space without partitioning. Thus, while CG increases the size of datasets GPs can be trained on, modern datasets reach scales beyond its applicability. In this work, we propose an iterative method which only accesses subblocks of the kernel matrix, effectively enabling mini-batching. Our algorithm, based on alternating projection, has $\mathcal{O}(n)$ per-iteration time and space complexity, solving many of the practical challenges of scaling GPs to very large datasets. Theoretically, we prove the method enjoys linear convergence. Empirically, we demonstrate its fast convergence in practice and robustness to ill-conditioning. On large-scale benchmark datasets with up to four million data points, our approach accelerates GP training and inference by speed-up factors up to $27\times$ and $72 \times$, respectively, compared to CG.  ( 2 min )
    Diffusion Generative Flow Samplers: Improving learning signals through partial trajectory optimization
    arXiv:2310.02679v3 Announce Type: replace-cross Abstract: We tackle the problem of sampling from intractable high-dimensional density functions, a fundamental task that often appears in machine learning and statistics. We extend recent sampling-based approaches that leverage controlled stochastic processes to model approximate samples from these target densities. The main drawback of these approaches is that the training objective requires full trajectories to compute, resulting in sluggish credit assignment issues due to use of entire trajectories and a learning signal present only at the terminal time. In this work, we present Diffusion Generative Flow Samplers (DGFS), a sampling-based framework where the learning process can be tractably broken down into short partial trajectory segments, via parameterizing an additional "flow function". Our method takes inspiration from the theory developed for generative flow networks (GFlowNets), allowing us to make use of intermediate learning signals. Through various challenging experiments, we demonstrate that DGFS achieves more accurate estimates of the normalization constant than closely-related prior methods.  ( 2 min )
    Topologically Regularized Multiple Instance Learning to Harness Data Scarcity
    arXiv:2307.14025v2 Announce Type: replace-cross Abstract: In biomedical data analysis, Multiple Instance Learning (MIL) models have emerged as a powerful tool to classify patients' microscopy samples. However, the data-intensive requirement of these models poses a significant challenge in scenarios with scarce data availability, e.g., in rare diseases. We introduce a topological regularization term to MIL to mitigate this challenge. It provides a shape-preserving inductive bias that compels the encoder to maintain the essential geometrical-topological structure of input bags during projection into latent space. This enhances the performance and generalization of the MIL classifier regardless of the aggregation function, particularly for scarce training data. The effectiveness of our method is confirmed through experiments across a range of datasets, showing an average enhancement of 2.8% for MIL benchmarks, 15.3% for synthetic MIL datasets, and 5.5% for real-world biomedical datasets over the current state-of-the-art.  ( 2 min )
    Scaling Laws for Imitation Learning in Single-Agent Games
    arXiv:2307.09423v2 Announce Type: replace-cross Abstract: Imitation Learning (IL) is one of the most widely used methods in machine learning. Yet, many works find it is often unable to fully recover the underlying expert behavior, even in constrained environments like single-agent games. However, none of these works deeply investigate the role of scaling up the model and data size. Inspired by recent work in Natural Language Processing (NLP) where "scaling up" has resulted in increasingly more capable LLMs, we investigate whether carefully scaling up model and data size can bring similar improvements in the imitation learning setting for single-agent games. We first demonstrate our findings on a variety of Atari games, and thereafter focus on the extremely challenging game of NetHack. In all games, we find that IL loss and mean return scale smoothly with the compute budget (FLOPs) and are strongly correlated, resulting in power laws for training compute-optimal IL agents. Finally, we forecast and train several NetHack agents with IL and find they outperform prior state-of-the-art by 1.5x in all settings. Our work both demonstrates the scaling behavior of imitation learning in a variety of single-agent games, as well as the viability of scaling up current approaches for increasingly capable agents in NetHack, a game that remains elusively hard for current AI systems.  ( 3 min )
    Large Language Models as Tool Makers
    arXiv:2305.17126v2 Announce Type: replace-cross Abstract: Recent research has highlighted the potential of large language models (LLMs) to improve their problem-solving capabilities with the aid of suitable external tools. In our work, we further advance this concept by introducing a closed-loop framework, referred to as LLMs A s Tool Makers (LATM), where LLMs create their own reusable tools for problem-solving. Our approach consists of two phases: 1) tool making: an LLM acts as the tool maker that crafts tools for a set of tasks. 2) tool using: another LLM acts as the tool user, which applies the tool built by the tool maker for problem-solving. On the problem-solving server side, tool-making enables continual tool generation and caching as new requests emerge. This framework enables subsequent requests to access cached tools via their corresponding APIs, enhancing the efficiency of task resolution. Recognizing that tool-making requires more sophisticated capabilities, we assign this task to a powerful, albeit resource-intensive, model. Conversely, the simpler tool-using phase is delegated to a lightweight model. This strategic division of labor allows the once-off cost of tool-making to be spread over multiple instances of tool-using, significantly reducing average costs while maintaining strong performance. Furthermore, our method offers a functional cache through the caching and reuse of tools, which stores the functionality of a class of requests instead of the natural language responses from LLMs, thus extending the applicability of the conventional cache mechanism. We evaluate our approach across various complex reasoning tasks, including Big-Bench tasks. With GPT-4 as the tool maker and GPT-3.5 as the tool user, LATM demonstrates performance equivalent to using GPT-4 for both roles, but with a significantly reduced inference cost.  ( 3 min )
    Prediction Error-based Classification for Class-Incremental Learning
    arXiv:2305.18806v2 Announce Type: replace-cross Abstract: Class-incremental learning (CIL) is a particularly challenging variant of continual learning, where the goal is to learn to discriminate between all classes presented in an incremental fashion. Existing approaches often suffer from excessive forgetting and imbalance of the scores assigned to classes that have not been seen together during training. In this study, we introduce a novel approach, Prediction Error-based Classification (PEC), which differs from traditional discriminative and generative classification paradigms. PEC computes a class score by measuring the prediction error of a model trained to replicate the outputs of a frozen random neural network on data from that class. The method can be interpreted as approximating a classification rule based on Gaussian Process posterior variance. PEC offers several practical advantages, including sample efficiency, ease of tuning, and effectiveness even when data are presented one class at a time. Our empirical results show that PEC performs strongly in single-pass-through-data CIL, outperforming other rehearsal-free baselines in all cases and rehearsal-based methods with moderate replay buffer size in most cases across multiple benchmarks.  ( 2 min )
    The Behavior and Convergence of Local Bayesian Optimization
    arXiv:2305.15572v3 Announce Type: replace-cross Abstract: A recent development in Bayesian optimization is the use of local optimization strategies, which can deliver strong empirical performance on high-dimensional problems compared to traditional global strategies. The "folk wisdom" in the literature is that the focus on local optimization sidesteps the curse of dimensionality; however, little is known concretely about the expected behavior or convergence of Bayesian local optimization routines. We first study the behavior of the local approach, and find that the statistics of individual local solutions of Gaussian process sample paths are surprisingly good compared to what we would expect to recover from global methods. We then present the first rigorous analysis of such a Bayesian local optimization algorithm recently proposed by M\"uller et al. (2021), and derive convergence rates in both the noisy and noiseless settings.  ( 2 min )
    Upper Counterfactual Confidence Bounds: a New Optimism Principle for Contextual Bandits
    arXiv:2007.07876v4 Announce Type: replace-cross Abstract: The principle of optimism in the face of uncertainty is one of the most widely used and successful ideas in multi-armed bandits and reinforcement learning. However, existing optimistic algorithms (primarily UCB and its variants) often struggle to deal with general function classes and large context spaces. In this paper, we study general contextual bandits with an offline regression oracle and propose a simple, generic principle to design optimistic algorithms, dubbed "Upper Counterfactual Confidence Bounds" (UCCB). The key innovation of UCCB is building confidence bounds in policy space, rather than in action space as is done in UCB. We demonstrate that these algorithms are provably optimal and computationally efficient in handling general function classes and large context spaces. Furthermore, we illustrate that the UCCB principle can be seamlessly extended to infinite-action general contextual bandits, provide the first solutions to these settings when employing an offline regression oracle.  ( 2 min )
    NeurAll: Towards a Unified Visual Perception Model for Automated Driving
    arXiv:1902.03589v3 Announce Type: replace-cross Abstract: Convolutional Neural Networks (CNNs) are successfully used for the important automotive visual perception tasks including object recognition, motion and depth estimation, visual SLAM, etc. However, these tasks are typically independently explored and modeled. In this paper, we propose a joint multi-task network design for learning several tasks simultaneously. Our main motivation is the computational efficiency achieved by sharing the expensive initial convolutional layers between all tasks. Indeed, the main bottleneck in automated driving systems is the limited processing power available on deployment hardware. There is also some evidence for other benefits in improving accuracy for some tasks and easing development effort. It also offers scalability to add more tasks leveraging existing features and achieving better generalization. We survey various CNN based solutions for visual perception tasks in automated driving. Then we propose a unified CNN model for the important tasks and discuss several advanced optimization and architecture design techniques to improve the baseline model. The paper is partly review and partly positional with demonstration of several preliminary results promising for future research. We first demonstrate results of multi-stream learning and auxiliary learning which are important ingredients to scale to a large multi-task model. Finally, we implement a two-stream three-task network which performs better in many cases compared to their corresponding single-task models, while maintaining network size.  ( 3 min )
    Quantifying intrinsic causal contributions via structure preserving interventions
    arXiv:2007.00714v4 Announce Type: replace-cross Abstract: We propose a notion of causal influence that describes the `intrinsic' part of the contribution of a node on a target node in a DAG. By recursively writing each node as a function of the upstream noise terms, we separate the intrinsic information added by each node from the one obtained from its ancestors. To interpret the intrinsic information as a {\it causal} contribution, we consider `structure-preserving interventions' that randomize each node in a way that mimics the usual dependence on the parents and does not perturb the observed joint distribution. To get a measure that is invariant with respect to relabelling nodes we use Shapley based symmetrization and show that it reduces in the linear case to simple ANOVA after resolving the target node into noise variables. We describe our contribution analysis for variance and entropy, but contributions for other target metrics can be defined analogously. The code is available in the package gcm of the open source library DoWhy.  ( 2 min )
    Wasserstein Gradient Flows for Moreau Envelopes of f-Divergences in Reproducing Kernel Hilbert Spaces
    arXiv:2402.04613v2 Announce Type: replace Abstract: Most commonly used $f$-divergences of measures, e.g., the Kullback-Leibler divergence, are subject to limitations regarding the support of the involved measures. A remedy consists of regularizing the $f$-divergence by a squared maximum mean discrepancy (MMD) associated with a characteristic kernel $K$. In this paper, we use the so-called kernel mean embedding to show that the corresponding regularization can be rewritten as the Moreau envelope of some function in the reproducing kernel Hilbert space associated with $K$. Then, we exploit well-known results on Moreau envelopes in Hilbert spaces to prove properties of the MMD-regularized $f$-divergences and, in particular, their gradients. Subsequently, we use our findings to analyze Wasserstein gradient flows of MMD-regularized $f$-divergences. Finally, we consider Wasserstein gradient flows starting from empirical measures. We provide proof-of-the-concept numerical examples for $f$-divergences with both infinite and finite recession constant.  ( 2 min )
    A Discriminative Latent-Variable Model for Bilingual Lexicon Induction
    arXiv:1808.09334v3 Announce Type: replace-cross Abstract: We introduce a novel discriminative latent variable model for bilingual lexicon induction. Our model combines the bipartite matching dictionary prior of Haghighi et al. (2008) with a representation-based approach (Artetxe et al., 2017). To train the model, we derive an efficient Viterbi EM algorithm. We provide empirical results on six language pairs under two metrics and show that the prior improves the induced bilingual lexicons. We also demonstrate how previous work may be viewed as a similarly fashioned latent-variable model, albeit with a different prior.  ( 2 min )
    Assessment of Sports Concussion in Female Athletes: A Role for Neuroinformatics?
    arXiv:2401.13045v2 Announce Type: replace Abstract: Over the past decade, the intricacies of sports-related concussions among female athletes have become readily apparent. Traditional clinical methods for diagnosing concussions suffer limitations when applied to female athletes, often failing to capture subtle changes in brain structure and function. Advanced neuroinformatics techniques and machine learning models have become invaluable assets in this endeavor. While these technologies have been extensively employed in understanding concussion in male athletes, there remains a significant gap in our comprehension of their effectiveness for female athletes. With its remarkable data analysis capacity, machine learning offers a promising avenue to bridge this deficit. By harnessing the power of machine learning, researchers can link observed phenotypic neuroimaging data to sex-specific biological mechanisms, unraveling the mysteries of concussions in female athletes. Furthermore, embedding methods within machine learning enable examining brain architecture and its alterations beyond the conventional anatomical reference frame. In turn, allows researchers to gain deeper insights into the dynamics of concussions, treatment responses, and recovery processes. To guarantee that female athletes receive the optimal care they deserve, researchers must employ advanced neuroimaging techniques and sophisticated machine-learning models. These tools enable an in-depth investigation of the underlying mechanisms responsible for concussion symptoms stemming from neuronal dysfunction in female athletes. This paper endeavors to address the crucial issue of sex differences in multimodal neuroimaging experimental design and machine learning approaches within female athlete populations, ultimately ensuring that they receive the tailored care they require when facing the challenges of concussions.  ( 3 min )
    Gaussian process regression with Sliced Wasserstein Weisfeiler-Lehman graph kernels
    arXiv:2402.03838v2 Announce Type: replace Abstract: Supervised learning has recently garnered significant attention in the field of computational physics due to its ability to effectively extract complex patterns for tasks like solving partial differential equations, or predicting material properties. Traditionally, such datasets consist of inputs given as meshes with a large number of nodes representing the problem geometry (seen as graphs), and corresponding outputs obtained with a numerical solver. This means the supervised learning model must be able to handle large and sparse graphs with continuous node attributes. In this work, we focus on Gaussian process regression, for which we introduce the Sliced Wasserstein Weisfeiler-Lehman (SWWL) graph kernel. In contrast to existing graph kernels, the proposed SWWL kernel enjoys positive definiteness and a drastic complexity reduction, which makes it possible to process datasets that were previously impossible to handle. The new kernel is first validated on graph classification for molecular datasets, where the input graphs have a few tens of nodes. The efficiency of the SWWL kernel is then illustrated on graph regression in computational fluid dynamics and solid mechanics, where the input graphs are made up of tens of thousands of nodes.  ( 2 min )
    Opening the Black Box: Towards inherently interpretable energy data imputation models using building physics insight
    arXiv:2311.16632v2 Announce Type: replace Abstract: Missing data are frequently observed by practitioners and researchers in the building energy modeling community. In this regard, advanced data-driven solutions, such as Deep Learning methods, are typically required to reflect the non-linear behavior of these anomalies. As an ongoing research question related to Deep Learning, a model's applicability to limited data settings can be explored by introducing prior knowledge in the network. This same strategy can also lead to more interpretable predictions, hence facilitating the field application of the approach. For that purpose, the aim of this paper is to propose the use of Physics-informed Denoising Autoencoders (PI-DAE) for missing data imputation in commercial buildings. In particular, the presented method enforces physics-inspired soft constraints to the loss function of a Denoising Autoencoder (DAE). In order to quantify the benefits of the physical component, an ablation study between different DAE configurations is conducted. First, three univariate DAEs are optimized separately on indoor air temperature, heating, and cooling data. Then, two multivariate DAEs are derived from the previous configurations. Eventually, a building thermal balance equation is coupled to the last multivariate configuration to obtain PI-DAE. Additionally, two commonly used benchmarks are employed to support the findings. It is shown how introducing physical knowledge in a multivariate Denoising Autoencoder can enhance the inherent model interpretability through the optimized physics-based coefficients. While no significant improvement is observed in terms of reconstruction error with the proposed PI-DAE, its enhanced robustness to varying rates of missing data and the valuable insights derived from the physics-based coefficients create opportunities for wider applications within building systems and the built environment.  ( 3 min )
    Fair Supervised Learning with A Simple Random Sampler of Sensitive Attributes
    arXiv:2311.05866v2 Announce Type: replace Abstract: As the data-driven decision process becomes dominating for industrial applications, fairness-aware machine learning arouses great attention in various areas. This work proposes fairness penalties learned by neural networks with a simple random sampler of sensitive attributes for non-discriminatory supervised learning. In contrast to many existing works that critically rely on the discreteness of sensitive attributes and response variables, the proposed penalty is able to handle versatile formats of the sensitive attributes, so it is more extensively applicable in practice than many existing algorithms. This penalty enables us to build a computationally efficient group-level in-processing fairness-aware training framework. Empirical evidence shows that our framework enjoys better utility and fairness measures on popular benchmark data sets than competing methods. We also theoretically characterize estimation errors and loss of utility of the proposed neural-penalized risk minimization problem.  ( 2 min )
    Improved Regret Bounds of (Multinomial) Logistic Bandits via Regret-to-Confidence-Set Conversion
    arXiv:2310.18554v2 Announce Type: replace Abstract: Logistic bandit is a ubiquitous framework of modeling users' choices, e.g., click vs. no click for advertisement recommender system. We observe that the prior works overlook or neglect dependencies in $S \geq \lVert \theta_\star \rVert_2$, where $\theta_\star \in \mathbb{R}^d$ is the unknown parameter vector, which is particularly problematic when $S$ is large, e.g., $S \geq d$. In this work, we improve the dependency on $S$ via a novel approach called {\it regret-to-confidence set conversion (R2CS)}, which allows us to construct a convex confidence set based on only the \textit{existence} of an online learning algorithm with a regret guarantee. Using R2CS, we obtain a strict improvement in the regret bound w.r.t. $S$ in logistic bandits while retaining computational feasibility and the dependence on other factors such as $d$ and $T$. We apply our new confidence set to the regret analyses of logistic bandits with a new martingale concentration step that circumvents an additional factor of $S$. We then extend this analysis to multinomial logistic bandits and obtain similar improvements in the regret, showing the efficacy of R2CS. While we applied R2CS to the (multinomial) logistic model, R2CS is a generic approach for developing confidence sets that can be used for various models, which can be of independent interest.  ( 3 min )
    Boosting Data Analytics With Synthetic Volume Expansion
    arXiv:2310.17848v3 Announce Type: replace Abstract: Synthetic data generation, a cornerstone of Generative Artificial Intelligence, promotes a paradigm shift in data science by addressing data scarcity and privacy while enabling unprecedented performance. As synthetic data becomes more prevalent, concerns emerge regarding the accuracy of statistical methods when applied to synthetic data in contrast to raw data. This article explores the effectiveness of statistical methods on synthetic data and the privacy risks of synthetic data. Regarding effectiveness, we present the Synthetic Data Generation for Analytics framework. This framework applies statistical approaches to high-quality synthetic data produced by generative models like tabular diffusion models, which, initially trained on raw data, benefit from insights from pertinent studies through transfer learning. A key finding within this framework is the generational effect, which reveals that the error rate of statistical methods on synthetic data decreases with the addition of more synthetic data but may eventually rise or stabilize. This phenomenon, stemming from the challenge of accurately mirroring raw data distributions, highlights a "reflection point"-an ideal volume of synthetic data defined by specific error metrics. Through three case studies, sentiment analysis, predictive modeling of structured data, and inference in tabular data, we validate the superior performance of this framework compared to conventional approaches. On privacy, synthetic data imposes lower risks while supporting the differential privacy standard. These studies underscore synthetic data's untapped potential in redefining data science's landscape.  ( 3 min )
    DeepFDR: A Deep Learning-based False Discovery Rate Control Method for Neuroimaging Data
    arXiv:2310.13349v3 Announce Type: replace Abstract: Voxel-based multiple testing is widely used in neuroimaging data analysis. Traditional false discovery rate (FDR) control methods often ignore the spatial dependence among the voxel-based tests and thus suffer from substantial loss of testing power. While recent spatial FDR control methods have emerged, their validity and optimality remain questionable when handling the complex spatial dependencies of the brain. Concurrently, deep learning methods have revolutionized image segmentation, a task closely related to voxel-based multiple testing. In this paper, we propose DeepFDR, a novel spatial FDR control method that leverages unsupervised deep learning-based image segmentation to address the voxel-based multiple testing problem. Numerical studies, including comprehensive simulations and Alzheimer's disease FDG-PET image analysis, demonstrate DeepFDR's superiority over existing methods. DeepFDR not only excels in FDR control and effectively diminishes the false nondiscovery rate, but also boasts exceptional computational efficiency highly suited for tackling large-scale neuroimaging data.  ( 2 min )
    Scalable High-Dimensional Multivariate Linear Regression for Feature-Distributed Data
    arXiv:2307.03410v2 Announce Type: replace Abstract: Feature-distributed data, referred to data partitioned by features and stored across multiple computing nodes, are increasingly common in applications with a large number of features. This paper proposes a two-stage relaxed greedy algorithm (TSRGA) for applying multivariate linear regression to such data. The main advantage of TSRGA is that its communication complexity does not depend on the feature dimension, making it highly scalable to very large data sets. In addition, for multivariate response variables, TSRGA can be used to yield low-rank coefficient estimates. The fast convergence of TSRGA is validated by simulation experiments. Finally, we apply the proposed TSRGA in a financial application that leverages unstructured data from the 10-K reports, demonstrating its usefulness in applications with many dense large-dimensional matrices.  ( 2 min )
    On the Expected Size of Conformal Prediction Sets
    arXiv:2306.07254v3 Announce Type: replace Abstract: While conformal predictors reap the benefits of rigorous statistical guarantees on their error frequency, the size of their corresponding prediction sets is critical to their practical utility. Unfortunately, there is currently a lack of finite-sample analysis and guarantees for their prediction set sizes. To address this shortfall, we theoretically quantify the expected size of the prediction sets under the split conformal prediction framework. As this precise formulation cannot usually be calculated directly, we further derive point estimates and high-probability interval bounds that can be empirically computed, providing a practical method for characterizing the expected set size. We corroborate the efficacy of our results with experiments on real-world datasets for both regression and classification problems.  ( 2 min )
    Variation Spaces for Multi-Output Neural Networks: Insights on Multi-Task Learning and Network Compression
    arXiv:2305.16534v2 Announce Type: replace Abstract: This paper introduces a novel theoretical framework for the analysis of vector-valued neural networks through the development of vector-valued variation spaces, a new class of reproducing kernel Banach spaces. These spaces emerge from studying the regularization effect of weight decay in training networks with activations like the rectified linear unit (ReLU). This framework offers a deeper understanding of multi-output networks and their function-space characteristics. A key contribution of this work is the development of a representer theorem for the vector-valued variation spaces. This representer theorem establishes that shallow vector-valued neural networks are the solutions to data-fitting problems over these infinite-dimensional spaces, where the network widths are bounded by the square of the number of training data. This observation reveals that the norm associated with these vector-valued variation spaces encourages the learning of features that are useful for multiple tasks, shedding new light on multi-task learning with neural networks. Finally, this paper develops a connection between weight-decay regularization and the multi-task lasso problem. This connection leads to novel bounds for layer widths in deep networks that depend on the intrinsic dimensions of the training data representations. This insight not only deepens the understanding of the deep network architectural requirements, but also yields a simple convex optimization method for deep neural network compression. The performance of this compression procedure is evaluated on various architectures.  ( 3 min )
    Uncertainty-aware Pseudo-label Selection for Positive-Unlabeled Learning
    arXiv:2201.13192v3 Announce Type: replace Abstract: Positive-unlabeled learning (PUL) aims at learning a binary classifier from only positive and unlabeled training data. Even though real-world applications often involve imbalanced datasets where the majority of examples belong to one class, most contemporary approaches to PUL do not investigate performance in this setting, thus severely limiting their applicability in practice. In this work, we thus propose to tackle the issues of imbalanced datasets and model calibration in a PUL setting through an uncertainty-aware pseudo-labeling procedure (PUUPL): by boosting the signal from the minority class, pseudo-labeling expands the labeled dataset with new samples from the unlabeled set, while explicit uncertainty quantification prevents the emergence of harmful confirmation bias leading to increased predictive performance. Within a series of experiments, PUUPL yields substantial performance gains in highly imbalanced settings while also showing strong performance in balanced PU scenarios across recent baselines. We furthermore provide ablations and sensitivity analyses to shed light on PUUPL's several ingredients. Finally, a real-world application with an imbalanced dataset confirms the advantage of our approach.  ( 2 min )
    Nested Nonparametric Instrumental Variable Regression: Long Term, Mediated, and Time Varying Treatment Effects
    arXiv:2112.14249v3 Announce Type: replace Abstract: Several causal parameters in short panel data models are scalar summaries of a function called a nested nonparametric instrumental variable regression (nested NPIV). Examples include long term, mediated, and time varying treatment effects identified using proxy variables. However, it appears that no prior estimators or guarantees for nested NPIV exist, preventing flexible estimation and inference for these causal parameters. A major challenge is compounding ill posedness due to the nested inverse problems. We analyze adversarial estimators of nested NPIV, and provide sufficient conditions for efficient inference on the causal parameter. Our nonasymptotic analysis has three salient features: (i) introducing techniques that limit how ill posedness compounds; (ii) accommodating neural networks, random forests, and reproducing kernel Hilbert spaces; and (iii) extending to causal functions, e.g. long term heterogeneous treatment effects. We measure long term heterogeneous treatment effects of Project STAR and mediated proximal treatment effects of the Job Corps.  ( 2 min )
    Local Minima Structures in Gaussian Mixture Models
    arXiv:2009.13040v3 Announce Type: replace Abstract: We investigate the landscape of the negative log-likelihood function of Gaussian Mixture Models (GMMs) with a general number of components in the population limit. As the objective function is non-convex, there can be multiple local minima that are not globally optimal, even for well-separated mixture models. Our study reveals that all local minima share a common structure that partially identifies the cluster centers (i.e., means of the Gaussian components) of the true location mixture. Specifically, each local minimum can be represented as a non-overlapping combination of two types of sub-configurations: fitting a single mean estimate to multiple Gaussian components or fitting multiple estimates to a single true component. These results apply to settings where the true mixture components satisfy a certain separation condition, and are valid even when the number of components is over- or under-specified. We also present a more fine-grained analysis for the setting of one-dimensional GMMs with three components, which provide sharper approximation error bounds with improved dependence on the separation.  ( 2 min )
    Grid Monitoring and Protection with Continuous Point-on-Wave Measurements and Generative AI
    arXiv:2403.06942v1 Announce Type: cross Abstract: Purpose This article presents a case for a next-generation grid monitoring and control system, leveraging recent advances in generative artificial intelligence (AI), machine learning, and statistical inference. Advancing beyond earlier generations of wide-area monitoring systems built upon supervisory control and data acquisition (SCADA) and synchrophasor technologies, we argue for a monitoring and control framework based on the streaming of continuous point-on-wave (CPOW) measurements with AI-powered data compression and fault detection. Methods and Results: The architecture of the proposed design originates from the Wiener-Kallianpur innovation representation of a random process that transforms causally a stationary random process into an innovation sequence with independent and identically distributed random variables. This work presents a generative AI approach that (i) learns an innovation autoencoder that extracts innovation sequence from CPOW time series, (ii) compresses the CPOW streaming data with innovation autoencoder and subband coding, and (iii) detects unknown faults and novel trends via nonparametric sequential hypothesis testing. Conclusion: This work argues that conventional monitoring using SCADA and phasor measurement unit (PMU) technologies is ill-suited for a future grid with deep penetration of inverter-based renewable generations and distributed energy resources. A monitoring system based on CPOW data streaming and AI data analytics should be the basic building blocks for situational awareness of a highly dynamic future grid.  ( 2 min )
    Simplicity Bias of Transformers to Learn Low Sensitivity Functions
    arXiv:2403.06925v1 Announce Type: cross Abstract: Transformers achieve state-of-the-art accuracy and robustness across many tasks, but an understanding of the inductive biases that they have and how those biases are different from other neural network architectures remains elusive. Various neural network architectures such as fully connected networks have been found to have a simplicity bias towards simple functions of the data; one version of this simplicity bias is a spectral bias to learn simple functions in the Fourier space. In this work, we identify the notion of sensitivity of the model to random changes in the input as a notion of simplicity bias which provides a unified metric to explain the simplicity and spectral bias of transformers across different data modalities. We show that transformers have lower sensitivity than alternative architectures, such as LSTMs, MLPs and CNNs, across both vision and language tasks. We also show that low-sensitivity bias correlates with improved robustness; furthermore, it can also be used as an efficient intervention to further improve the robustness of transformers.  ( 2 min )
    On the Generalization Ability of Unsupervised Pretraining
    arXiv:2403.06871v1 Announce Type: cross Abstract: Recent advances in unsupervised learning have shown that unsupervised pre-training, followed by fine-tuning, can improve model generalization. However, a rigorous understanding of how the representation function learned on an unlabeled dataset affects the generalization of the fine-tuned model is lacking. Existing theoretical research does not adequately account for the heterogeneity of the distribution and tasks in pre-training and fine-tuning stage. To bridge this gap, this paper introduces a novel theoretical framework that illuminates the critical factor influencing the transferability of knowledge acquired during unsupervised pre-training to the subsequent fine-tuning phase, ultimately affecting the generalization capabilities of the fine-tuned model on downstream tasks. We apply our theoretical framework to analyze generalization bound of two distinct scenarios: Context Encoder pre-training with deep neural networks and Masked Autoencoder pre-training with deep transformers, followed by fine-tuning on a binary classification task. Finally, inspired by our findings, we propose a novel regularization method during pre-training to further enhances the generalization of fine-tuned model. Overall, our results contribute to a better understanding of unsupervised pre-training and fine-tuning paradigm, and can shed light on the design of more effective pre-training algorithms.  ( 2 min )
    In-context Exploration-Exploitation for Reinforcement Learning
    arXiv:2403.06826v1 Announce Type: cross Abstract: In-context learning is a promising approach for online policy learning of offline reinforcement learning (RL) methods, which can be achieved at inference time without gradient optimization. However, this method is hindered by significant computational costs resulting from the gathering of large training trajectory sets and the need to train large Transformer models. We address this challenge by introducing an In-context Exploration-Exploitation (ICEE) algorithm, designed to optimize the efficiency of in-context policy learning. Unlike existing models, ICEE performs an exploration-exploitation trade-off at inference time within a Transformer model, without the need for explicit Bayesian inference. Consequently, ICEE can solve Bayesian optimization problems as efficiently as Gaussian process biased methods do, but in significantly less time. Through experiments in grid world environments, we demonstrate that ICEE can learn to solve new RL tasks using only tens of episodes, marking a substantial improvement over the hundreds of episodes needed by the previous in-context learning method.  ( 2 min )
    Multistep Consistency Models
    arXiv:2403.06807v1 Announce Type: cross Abstract: Diffusion models are relatively easy to train but require many steps to generate samples. Consistency models are far more difficult to train, but generate samples in a single step. In this paper we propose Multistep Consistency Models: A unification between Consistency Models (Song et al., 2023) and TRACT (Berthelot et al., 2023) that can interpolate between a consistency model and a diffusion model: a trade-off between sampling speed and sampling quality. Specifically, a 1-step consistency model is a conventional consistency model whereas we show that a $\infty$-step consistency model is a diffusion model. Multistep Consistency Models work really well in practice. By increasing the sample budget from a single step to 2-8 steps, we can train models more easily that generate higher quality samples, while retaining much of the sampling speed benefits. Notable results are 1.4 FID on Imagenet 64 in 8 step and 2.1 FID on Imagenet128 in 8 steps with consistency distillation. We also show that our method scales to a text-to-image diffusion model, generating samples that are very close to the quality of the original model.  ( 2 min )
    Monotone Individual Fairness
    arXiv:2403.06812v1 Announce Type: cross Abstract: We revisit the problem of online learning with individual fairness, where an online learner strives to maximize predictive accuracy while ensuring that similar individuals are treated similarly. We first extend the frameworks of Gillen et al. (2018); Bechavod et al. (2020), which rely on feedback from human auditors regarding fairness violations, as we consider auditing schemes that are capable of aggregating feedback from any number of auditors, using a rich class we term monotone aggregation functions. We then prove a characterization for such auditing schemes, practically reducing the analysis of auditing for individual fairness by multiple auditors to that of auditing by (instance-specific) single auditors. Using our generalized framework, we present an oracle-efficient algorithm achieving an upper bound frontier of $(\mathcal{O}(T^{1/2+2b}),\mathcal{O}(T^{3/4-b}))$ respectively for regret, number of fairness violations, for $0\leq b \leq 1/4$. We then study an online classification setting where label feedback is available for positively-predicted individuals only, and present an oracle-efficient algorithm achieving an upper bound frontier of $(\mathcal{O}(T^{2/3+2b}),\mathcal{O}(T^{5/6-b}))$ for regret, number of fairness violations, for $0\leq b \leq 1/6$. In both settings, our algorithms improve on the best known bounds for oracle-efficient algorithms. Furthermore, our algorithms offer significant improvements in computational efficiency, greatly reducing the number of required calls to an (offline) optimization oracle per round, to $\tilde{\mathcal{O}}(\alpha^{-2})$ in the full information setting, and $\tilde{\mathcal{O}}(\alpha^{-2} + k^2T^{1/3})$ in the partial information setting, where $\alpha$ is the sensitivity for reporting fairness violations, and $k$ is the number of individuals in a round.  ( 2 min )
    Scalable Online Exploration via Coverability
    arXiv:2403.06571v1 Announce Type: cross Abstract: Exploration is a major challenge in reinforcement learning, especially for high-dimensional domains that require function approximation. We propose exploration objectives -- policy optimization objectives that enable downstream maximization of any reward function -- as a conceptual framework to systematize the study of exploration. Within this framework, we introduce a new objective, $L_1$-Coverage, which generalizes previous exploration schemes and supports three fundamental desiderata: 1. Intrinsic complexity control. $L_1$-Coverage is associated with a structural parameter, $L_1$-Coverability, which reflects the intrinsic statistical difficulty of the underlying MDP, subsuming Block and Low-Rank MDPs. 2. Efficient planning. For a known MDP, optimizing $L_1$-Coverage efficiently reduces to standard policy optimization, allowing flexible integration with off-the-shelf methods such as policy gradient and Q-learning approaches. 3. Efficient exploration. $L_1$-Coverage enables the first computationally efficient model-based and model-free algorithms for online (reward-free or reward-driven) reinforcement learning in MDPs with low coverability. Empirically, we find that $L_1$-Coverage effectively drives off-the-shelf policy optimization algorithms to explore the state space.  ( 2 min )
    Sliced-Wasserstein Distances and Flows on Cartan-Hadamard Manifolds
    arXiv:2403.06560v1 Announce Type: cross Abstract: While many Machine Learning methods were developed or transposed on Riemannian manifolds to tackle data with known non Euclidean geometry, Optimal Transport (OT) methods on such spaces have not received much attention. The main OT tool on these spaces is the Wasserstein distance which suffers from a heavy computational burden. On Euclidean spaces, a popular alternative is the Sliced-Wasserstein distance, which leverages a closed-form solution of the Wasserstein distance in one dimension, but which is not readily available on manifolds. In this work, we derive general constructions of Sliced-Wasserstein distances on Cartan-Hadamard manifolds, Riemannian manifolds with non-positive curvature, which include among others Hyperbolic spaces or the space of Symmetric Positive Definite matrices. Then, we propose different applications. Additionally, we derive non-parametric schemes to minimize these new distances by approximating their Wasserstein gradient flows.  ( 2 min )
    Probabilistic Neural Circuits
    arXiv:2403.06235v1 Announce Type: cross Abstract: Probabilistic circuits (PCs) have gained prominence in recent years as a versatile framework for discussing probabilistic models that support tractable queries and are yet expressive enough to model complex probability distributions. Nevertheless, tractability comes at a cost: PCs are less expressive than neural networks. In this paper we introduce probabilistic neural circuits (PNCs), which strike a balance between PCs and neural nets in terms of tractability and expressive power. Theoretically, we show that PNCs can be interpreted as deep mixtures of Bayesian networks. Experimentally, we demonstrate that PNCs constitute powerful function approximators.  ( 2 min )
    How much data do you need? Part 2: Predicting DL class specific training dataset sizes
    arXiv:2403.06311v1 Announce Type: cross Abstract: This paper targets the question of predicting machine learning classification model performance, when taking into account the number of training examples per class and not just the overall number of training examples. This leads to the a combinatorial question, which combinations of number of training examples per class should be considered, given a fixed overall training dataset size. In order to solve this question, an algorithm is suggested which is motivated from special cases of space filling design of experiments. The resulting data are modeled using models like powerlaw curves and similar models, extended like generalized linear models i.e. by replacing the overall training dataset size by a parametrized linear combination of the number of training examples per label class. The proposed algorithm has been applied on the CIFAR10 and the EMNIST datasets.  ( 2 min )
    LinearAPT: An Adaptive Algorithm for the Fixed-Budget Thresholding Linear Bandit Problem
    arXiv:2403.06230v1 Announce Type: cross Abstract: In this study, we delve into the Thresholding Linear Bandit (TLB) problem, a nuanced domain within stochastic Multi-Armed Bandit (MAB) problems, focusing on maximizing decision accuracy against a linearly defined threshold under resource constraints. We present LinearAPT, a novel algorithm designed for the fixed budget setting of TLB, providing an efficient solution to optimize sequential decision-making. This algorithm not only offers a theoretical upper bound for estimated loss but also showcases robust performance on both synthetic and real-world datasets. Our contributions highlight the adaptability, simplicity, and computational efficiency of LinearAPT, making it a valuable addition to the toolkit for addressing complex sequential decision-making challenges.  ( 2 min )
    An Improved Analysis of Langevin Algorithms with Prior Diffusion for Non-Log-Concave Sampling
    arXiv:2403.06183v1 Announce Type: cross Abstract: Understanding the dimension dependency of computational complexity in high-dimensional sampling problem is a fundamental problem, both from a practical and theoretical perspective. Compared with samplers with unbiased stationary distribution, e.g., Metropolis-adjusted Langevin algorithm (MALA), biased samplers, e.g., Underdamped Langevin Dynamics (ULD), perform better in low-accuracy cases just because a lower dimension dependency in their complexities. Along this line, Freund et al. (2022) suggest that the modified Langevin algorithm with prior diffusion is able to converge dimension independently for strongly log-concave target distributions. Nonetheless, it remains open whether such property establishes for more general cases. In this paper, we investigate the prior diffusion technique for the target distributions satisfying log-Sobolev inequality (LSI), which covers a much broader class of distributions compared to the strongly log-concave ones. In particular, we prove that the modified Langevin algorithm can also obtain the dimension-independent convergence of KL divergence with different step size schedules. The core of our proof technique is a novel construction of an interpolating SDE, which significantly helps to conduct a more accurate characterization of the discrete updates of the overdamped Langevin dynamics. Our theoretical analysis demonstrates the benefits of prior diffusion for a broader class of target distributions and provides new insights into developing faster sampling algorithms.  ( 3 min )
    Membership Testing in Markov Equivalence Classes via Independence Query Oracles
    arXiv:2403.05759v1 Announce Type: cross Abstract: Understanding causal relationships between variables is a fundamental problem with broad impact in numerous scientific fields. While extensive research has been dedicated to learning causal graphs from data, its complementary concept of testing causal relationships has remained largely unexplored. While learning involves the task of recovering the Markov equivalence class (MEC) of the underlying causal graph from observational data, the testing counterpart addresses the following critical question: Given a specific MEC and observational data from some causal graph, can we determine if the data-generating causal graph belongs to the given MEC? We explore constraint-based testing methods by establishing bounds on the required number of conditional independence tests. Our bounds are in terms of the size of the maximum undirected clique ($s$) of the given MEC. In the worst case, we show a lower bound of $\exp(\Omega(s))$ independence tests. We then give an algorithm that resolves the task with $\exp(O(s))$ tests, matching our lower bound. Compared to the learning problem, where algorithms often use a number of independence tests that is exponential in the maximum in-degree, this shows that testing is relatively easier. In particular, it requires exponentially less independence tests in graphs featuring high in-degrees and small clique sizes. Additionally, using the DAG associahedron, we provide a geometric interpretation of testing versus learning and discuss how our testing result can aid learning.  ( 2 min )
    Automatic design optimization of preference-based subjective evaluation with online learning in crowdsourcing environment
    arXiv:2403.06100v1 Announce Type: cross Abstract: A preference-based subjective evaluation is a key method for evaluating generative media reliably. However, its huge combinations of pairs prohibit it from being applied to large-scale evaluation using crowdsourcing. To address this issue, we propose an automatic optimization method for preference-based subjective evaluation in terms of pair combination selections and allocation of evaluation volumes with online learning in a crowdsourcing environment. We use a preference-based online learning method based on a sorting algorithm to identify the total order of evaluation targets with minimum sample volumes. Our online learning algorithm supports parallel and asynchronous execution under fixed-budget conditions required for crowdsourcing. Our experiment on preference-based subjective evaluation of synthetic speech shows that our method successfully optimizes the test by reducing pair combinations from 351 to 83 and allocating optimal evaluation volumes for each pair ranging from 30 to 663 without compromising evaluation accuracies and wasting budget allocations.  ( 2 min )
    Nonparametric Automatic Differentiation Variational Inference with Spline Approximation
    arXiv:2403.06302v1 Announce Type: new Abstract: Automatic Differentiation Variational Inference (ADVI) is efficient in learning probabilistic models. Classic ADVI relies on the parametric approach to approximate the posterior. In this paper, we develop a spline-based nonparametric approximation approach that enables flexible posterior approximation for distributions with complicated structures, such as skewness, multimodality, and bounded support. Compared with widely-used nonparametric variational inference methods, the proposed method is easy to implement and adaptive to various data structures. By adopting the spline approximation, we derive a lower bound of the importance weighted autoencoder and establish the asymptotic consistency. Experiments demonstrate the efficiency of the proposed method in approximating complex posterior distributions and improving the performance of generative models with incomplete data.  ( 2 min )
    Spectral Clustering of Categorical and Mixed-type Data via Extra Graph Nodes
    arXiv:2403.05669v1 Announce Type: new Abstract: Clustering data objects into homogeneous groups is one of the most important tasks in data mining. Spectral clustering is arguably one of the most important algorithms for clustering, as it is appealing for its theoretical soundness and is adaptable to many real-world data settings. For example, mixed data, where the data is composed of numerical and categorical features, is typically handled via numerical discretization, dummy coding, or similarity computation that takes into account both data types. This paper explores a more natural way to incorporate both numerical and categorical information into the spectral clustering algorithm, avoiding the need for data preprocessing or the use of sophisticated similarity functions. We propose adding extra nodes corresponding to the different categories the data may belong to and show that it leads to an interpretable clustering objective function. Furthermore, we demonstrate that this simple framework leads to a linear-time spectral clustering algorithm for categorical-only data. Finally, we compare the performance of our algorithms against other related methods and show that it provides a competitive alternative to them in terms of performance and runtime.  ( 2 min )
    Grafting: Making Random Forests Consistent
    arXiv:2403.06015v1 Announce Type: new Abstract: Despite their performance and widespread use, little is known about the theory of Random Forests. A major unanswered question is whether, or when, the Random Forest algorithm is consistent. The literature explores various variants of the classic Random Forest algorithm to address this question and known short-comings of the method. This paper is a contribution to this literature. Specifically, the suitability of grafting consistent estimators onto a shallow CART is explored. It is shown that this approach has a consistency guarantee and performs well in empirical settings.  ( 2 min )
    The ALL0CORE Tensor Decomposition for Sparse Count Data
    arXiv:2403.06153v1 Announce Type: new Abstract: This paper introduces ALL0CORE, a new form of probabilistic non-negative tensor decomposition. ALL0CORE is a Tucker decomposition where the number of non-zero elements (i.e., the L0-norm) of the core tensor is constrained to a preset value Q much smaller than the size of the core. While the user dictates the total budget Q, the locations and values of the non-zero elements are latent variables and allocated across the core tensor during inference. ALL0CORE -- i.e., allocated L0-constrained core -- thus enjoys both the computational tractability of CP decomposition and the qualitatively appealing latent structure of Tucker. In a suite of real-data experiments, we demonstrate that ALL0CORE typically requires only tiny fractions (e.g.,~1%) of the full core to achieve the same results as full Tucker decomposition at only a correspondingly tiny fraction of the cost.  ( 2 min )
    Statistical Efficiency of Distributional Temporal Difference
    arXiv:2403.05811v1 Announce Type: new Abstract: Distributional reinforcement learning (DRL), which cares about the full distribution of returns instead of just the mean, has achieved empirical success in various domains. One of the core tasks in the field of DRL is distributional policy evaluation, which involves estimating the return distribution $\eta^\pi$ for a given policy $\pi$. A distributional temporal difference (TD) algorithm has been accordingly proposed, which is an extension of the temporal difference algorithm in the classic RL literature. In the tabular case, \citet{rowland2018analysis} and \citet{rowland2023analysis} proved the asymptotic convergence of two instances of distributional TD, namely categorical temporal difference algorithm (CTD) and quantile temporal difference algorithm (QTD), respectively. In this paper, we go a step further and analyze the finite-sample performance of distributional TD. To facilitate theoretical analysis, we propose non-parametric distributional TD algorithm (NTD). For a $\gamma$-discounted infinite-horizon tabular Markov decision process with state space $S$ and action space $A$, we show that in the case of NTD we need $\wtilde O\prn{\frac{1}{\varepsilon^{2p}(1-\gamma)^{2p+2}}}$ iterations to achieve an $\varepsilon$-optimal estimator with high probability, when the estimation error is measured by the $p$-Wasserstein distance. Under some mild assumptions, $\wtilde O\prn{\frac{1}{\varepsilon^{2}(1-\gamma)^{4}}}$ iterations suffices to ensure the Kolmogorov-Smirnov distance between the NTD estimator $\hat\eta^\pi$ and $\eta^\pi$ less than $\varepsilon$ with high probability. And we revisit CTD, showing that the same non-asymptotic convergence bounds hold for CTD in the case of the $p$-Wasserstein distance.  ( 2 min )

  • Open

    The Legal Frontier: Exploring AI’s Influence on Legal Research
    Traditionally reliant on human expertise and manual research, the legal profession is at the cusp of a technological shift as Artificial…  ( 11 min )
    Pure audacity: the fake robot scam
    By Frederic Friedel  ( 11 min )
    What is Bias & Why Does It Happen in AI?
    Why Bias in LLMs is Unavoidable Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 4 min )
  • Open

    [Discussion]:"Datasheets for Datasets" paper and its relevance in the real world
    Hi all, I wanted to make a post to launch a discussion about an interesting paper called Datasheets for Datasets (> 1900 citations, so a big impact in the research community). Briefly explained, this paper discusses the lack of a standardized way for documenting a dataset to train ML models and proposes to document datasets with a 'datasheet,' similar to electronic components (or Warhammer figs, or any other items you know with datasheets). The authors illustrate this datasheet system with prototype examples of datasheets for open-source datasets (e.g., the LFW dataset). IMO the examples in the appendices quickly convey the ideas expressed in the whole paper. So, if you don't want to spend too much time reading this paper, you can just skip to the prototypes directly. In both my previou…
    [D] Treating Holidays/Events in time series data
    Hi Everyone! I am working for a company that produces a model for retail sales (i.e. units). This model currently performs well during 'normal' or 'business as usual' periods, but less well during sales events or other holiday events. During the latter it under forecasts - i.e. a sales event leads to a higher units sold. I have a list of past and future holiday/sales events - is there a proposed methodology to analytically 'correct' or adjust the original forecast for these holiday/sales events? The forecast is at day and store level. There are over 500 stores in the company. Thanks for any suggestions! submitted by /u/serpentna [link] [comments]
    [D] What are the major differences, specifically in the data architecture and training pipeline, between the process of training an LLM from scratch vs training smaller deep learning models?
    When it comes to training LLMs that require many days, the pipeline for storing the embeddings and performing computations can vary from the process of training smaller models that could run in single GPUs. Here are my questions: In case of LLMs, - Are there differences in the embeddings generated for training smaller models vs larger models? - In the case of multi-modality, how does the generation and storage of embeddings vary? - How does the training pipeline for text-only modality look like? When it comes to Images and Audio, what are the differences in training pipelines? I would appreciate any resources where I can read up deeply about the core processes. Thanks! submitted by /u/metalvendetta [link] [comments]
    [P] LLM Hallucination Detection for Real-World RAG
    Hi, I’m Jonathan, LastMile AI’s ML Engineer. We’re building tools to evaluate LLM outputs in production – specifically for RAG applications. Right now, we’re focused on hallucination detection: given the data retrieved during a RAG query, is the response faithful to that data, or hallucinatory? This is typically done at scale via model-grading: essentially, you give another LLM (say, GPT-4) the question, data, and response, and you get a hallucination verdict back. We looked at some open source tools for this like Phoenix by Arize and discovered that you can get strong accuracy with much, much smaller (specialized) models. We’re extending early access to our v0 model and are interested in working with anyone who needs help evaluating their production RAG system. Early access: https://cdn.forms-content.sg-form.com/e8b9b4d4-dda0-11ee-85ff-daa2c5cfc593 Details: https://blog.lastmileai.dev/harder-better-faster-stronger-llm-hallucination-detection-for-real-world-rag-part-i-949248f0ad94 submitted by /u/InevitableSky2801 [link] [comments]
    [D] Cloning git repo Huggingface medium model of whisper AI
    I'm not very experienced with all this, but I'm currently trying to clone the git repo from here: https://huggingface.co/openai/whisper-medium My git bash window gave me the following but is now just hanging with no progress indicator $ git clone https://huggingface.co/openai/whisper-medium Cloning into 'whisper-medium'... remote: Enumerating objects: 175, done. remote: Counting objects: 100% (6/6), done. remote: Compressing objects: 100% (6/6), done. remote: Total 175 (delta 1), reused 0 (delta 0), pack-reused 169 s: 99% (174/175), 2.75 MiB | 354.00 KiB/s Receiving objects: 100% (175/175), 2.88 MiB | 297.00 KiB/s, done. Resolving deltas: 100% (102/102), done. The folder created is staying at 4.37MB in size, yet my network traffic tells me I've downloaded 5GB so far... Where's it going?? When will it be done??? I thought the medium model was about 3GB in size... totally confused! submitted by /u/AudioBabble [link] [comments]
    [D] Optimising a (2+1)D ResNet for event indexing
    I’m developing a (2+1)D resnet for object detection, event classification and event indexing on video data. Currently the only features I have in the first iteration are pixel values per frame. I’ve also gathered other features such as texture, optical flow, motion vectors, color histograms and edges for each frame. My main question is how can I fuse all the features as an input into my model, but I’d also be interested to hear if anyone has any opinions on this or how I could improve my model anyways. The model architecture is based off the following paper: https://arxiv.org/pdf/1711.11248v3.pdf submitted by /u/RxPiku [link] [comments]
    [D] ML interview burnout
    I feel that I have burnout from data science interviews. I have been a data scientist in the field for 5 years. There are so many technical things in the field. Especially there are so many new papers coming up to catch up in the year 2023 for how to optimize the LLM models and the use of vector DB. The more time I spend on the interview preparation, the less time I can do with my new knowledge acquisition. What should I do to overcome this situation? Great thanks. Why I feel that interview preparation is not useful In practical work, we can go through different preparations for a topic to recall all the memories and organize the concepts properly before presenting the ideas to other colleagues. However, is it possible to retrieve all the information immediately during the interview? Some are the knowledge dated back to the school bookwork that no one touched for decades. Some questions are about less commonly seen design patterns. I feel bad when I cannot answer a question not because I don't know it but because I really cannot summarize it within a short period of time. It is like the data is archived to the AWS S3 glacier so data retrieval is time-consuming and costly. Also, not being able to answer some code design patterns does not mean that I cannot write good codes and solve problems. To prepare for those interviews, I tried to revisit some key concepts and various not-so-useful code patterns but it was very time-consuming. Honestly, the pay of the jobs is the high at all. I am not talking about any large tech companies but some SME. I feel confused by the standard. submitted by /u/MillionLiar [link] [comments]
    [P]Best Model for Pose Estimation
    Hi!I am trying to make a virtual exercise assistant app in Android using flutter.....i am well versed in flutter but u can guess that i dont have much experience in implementing models....so can u suggest a model for pose estimation that is more accurate and not hard to implement submitted by /u/DeepLet4383 [link] [comments]
    [D] Are there non technical peer reviewed journals for applied research?
    I was looking at this paper today https://arxiv.org/abs/2403.04769 I lack the knowledge to publish a purely theoretical Machine learning paper (nor do I have the time and energy to truly delve into the field and catching up at a level of producing meaningful output) but I think I have a couple of ideas and would be interested in doing research in the form of finding vulnerabilities, or comparing LLMs or models in novel ways etc. Stuff essentially that don't require significant theoretical background and people with a varied skillset have an easier time getting their foot in their water. At the same time though I am an academic in a completely different field and for my career I need papers published at journals and not just at arxiv. So my question is, papers like the one I uploaded are extremely useful of course in practice but scientifically speaking do they get published somewhere in a peer reviewed journal? If yes, where? Interested to hear your thoughts and advice, thanks in advance! submitted by /u/Dyoakom [link] [comments]
    [D] Can someone please clarify if web search LLMs like Perplexity, You.com, or Coral Search are crawling the entire web themselves? Otherwise, how do they differ from simply combining a search API with any LLM model?
    If they are crawling the web themselves, could an experienced person explain how difficult this task is and how these companies differentiate from each other, as I cannot see much difference in the answers provided by each of them? submitted by /u/Fit-Set6851 [link] [comments]
    [Discussion] Do diffusion models do any representation learning?
    Models like VAEs and transformer based language models (MLM or autoregressive) extract higher order representations of data. Is there any understanding of whether DMs do this? It doesn't seem to get discussed in the papers I've seen presumably because the focus is so much on the generative aspect. submitted by /u/daking999 [link] [comments]
  • Open

    Need help with NEAT-python
    i am trying to use neat python for a simple example like BipedalWalker but the training never starts, instead i only get: ****** Running generation 0 ****** this is the code: import os import pickle import neat import numpy as np #import cart_pole import gymnasium as gym runs_per_net = 5 # Use the NN network phenotype and the discrete actuator force function. def eval_genome(genome, config): net = neat.nn.FeedForwardNetwork.create(genome, config) fitnesses = [] for runs in range(runs_per_net): env = gym.make("BipedalWalker-v3") observation,_ = env.reset() fitness = 0.0 done = False while not done: action = net.activate(observation) observation, reward, done, _,info = env.step(action) fitness += reward #env.render() fitnesses.append(fitness) return np.mean(fitnesses) def eval_genomes(genomes, config): for genome_id, genome in genomes: genome.fitness = eval_genome(genome, config) def run(): # Load the config file, which is assumed to live in the same directory as this script. local_dir = os.path.dirname(__file__) config_path = os.path.join(local_dir, 'config.txt') config = neat.Config(neat.DefaultGenome, neat.DefaultReproduction, neat.DefaultSpeciesSet, neat.DefaultStagnation, config_path) pop = neat.Population(config) stats = neat.StatisticsReporter() pop.add_reporter(stats) pop.add_reporter(neat.StdOutReporter(True)) # Run NEAT algorithm winner = pop.run(eval_genomes) # Save the winner. with open('winner', 'wb') as f: pickle.dump(winner, f) print(winner) run() i used this tutorial. the config file has only the pop size changed from 250 to 20 but i cant understand why it doesnt train submitted by /u/SebyR [link] [comments]
    Deepmind: Stop Regressing: Training Value Functions via Classification for Scalable Deep RL
    submitted by /u/LushousLightfoot [link] [comments]
    The direction of control in most RL libraries seems inversed
    Looking at libraries like Stable Baseline 3 it seems to me that the direction of control is opposite of what ought to be. As far as I can tell, almost all of the RL examples I see assume something like a well-defined "Environment" style object that is able to evaluate actions and produce rewards as a single unified function (e.g. the step for SB3's env specifications) There also seems to be an assumption about training and then deploying (in a "win competition" style of way). Now, this doesn't fundamentally limit what one can do with something like SB3, but it makes a "real world" usecase much harder. Assume for example something like controlling a robot, where the robot's "actions" might have many stakeholders (remote commands, emergency stops, hardcoded rules that supersede the RL action, hardcoded constraints that modify the action... etc) In this kinda of environment (any real-world application) it'd make sense for an RL environment to be a "service" style entity, as opposed to the highest-level orchestrator -- i.e. something that exposes methods like recommend_action(inputs) (or predict(inputs)) and reward(reward, [prev_inputs, prev_outputs]) (or step or train ... the name is not that relevant) I am 95% sure that "I don't get it" and for some reason, the kind of interfaces that are common now are either better or necessary, but if someone could help alleviate my confusion about these design choices I'd probably understand a bit more about the RL library ecosystem and the constraints under which it evolved. submitted by /u/elcric_krej [link] [comments]
    I had fun making this parallel simulation manager so I wanted to share it
    submitted by /u/hbonnavaud [link] [comments]
    PPO MultiDiscrete Output
    Hi All, I'm curious if anyone has any tips on how to retrofit the PPO algorithm to output a MultiDiscrete. What I am currently trying to do is to create a "manager" agent that is directing other agents which task to complete. In order to do that I need 4 different outputs of actions 0, 1, or 2. But I'm not sure quite how to do that because I can't just do a softmax layer. I've tried to do a multi-head network where I do a softmax on 4 different heads and return the argmax of each one, but then I'm concerned about the loss function with 4 different heads. Any suggestions would be helpful! Thanks! submitted by /u/Cheap_Leather_6432 [link] [comments]
    Placing rectangles in area (2D) while optimizing several functions and avoiding defined areas
    Hello, We have an area in which objects (from a defined list of objects) can be placed. This area is defined by points (for simplicity let's say four points that connected display a square). In this area we can also have areas, in which objects cannot be placed (there also can be no overlap). This reminds one of the 2D bin packing problem. However, there are also prices and other factors attached to the objects we place. We want to place the objects in a way that optimizes all these factors (low price etc. ...). I have done a lot of ML in the past but I have never done Reinforcement Learning. First of all: Is this a problem where RL is the best solution? To me that appears to be the case, but I am uncertain... Do you have any suggestions or tips for tackling this problem? I am thinking of an algorithm that at first places objects randomly (within the areas where objects can be placed) while getting penalized for price and other things that can be deferred from the output, so that it learns... ​ What do you guys think? Thanks in advance! submitted by /u/Lucky_Funny_3259 [link] [comments]
    In isaacgymenv, how to specify the order of the joints in state and action?
    Hi, Im new to RL and just starting to use isaacgymenv and isaacgym. When controlling a robot such as CartPole or Franka, how can one determine which element in the returned observation or the provided action corresponds to the joints defined in the URDF file? For example, in IsaacgymEnvs/isaacgymenvs/tasks/cartpole.py (as illustrated in the figure), how do each elements in 'self.obs_buf' correspond to the joints defined in cartpole.urdf. Thank you! ​ submitted by /u/Slight_Rip_516 [link] [comments]
    Tips and pointers for card game
    Hi, I'm planning to make a RL model to play macau with. It's my first project so I would appreciate any pointers and tips that could help me design things. Maybe some video on design you find similar. Probably could use some literature as well but it's hard to find the one that would cover the things I would actually find useful. There is Uno (simplified version of macau that is pretty random and player's choices don't have such impact as in macau) model that is already completed available online but it is too simple. I need much more complex one and I don't even know if it is possible to do... I think for macau I would need to make a table for the agent to gather info on state of each card like: king of hearts[is on hand, if on hand is it playable now, if not playable directly can it be played in bulk (macau allows playing multiple cards of same value at a time), probability of being held by opponent, is on top of stack, used and in stack(would reset after reshuffle)] I have some sense how to design that part but have absolutely no idea how to manage the choices made by agent... While in Uno it was like chose a card that is playable or draw one card in Macau it's more like draw a card(regardless if any on hand can be played (sometimes it's better move than playing a card you can play)), play jock and set demand of value, play ace and set demand of colour, play battle card (like wild cards in uno but can be countered). The more I think of it it becomes more complex, have deadline in a month and need some help on it please 🙏 submitted by /u/Comfortable_Sleep988 [link] [comments]
    Is the 2nd edition of Reinforcement Learning: An Introduction by Sutton better?
    Hi, i want to read Reinforcement Learning: An Introduction by Sutton and Barto. However the newer 2nd edition is quite expansive. Does anyone know if there is a big difference between the 2nd and the 1st edition? submitted by /u/d-eighties [link] [comments]
    What happens if you use something like alphaGo as a reward model for an LLM ?
    This might not be a first time someone asks this but can you do Reinforcement Learning on a LLM and using AlphaGo similar as reward model? Any token that is not a valid move can be punished submitted by /u/rdyazdi [link] [comments]
    "Is a Good Representation Sufficient for Sample Efficient Reinforcement Learning?", Du et al 2020
    submitted by /u/gwern [link] [comments]
    DQN Performance Spikes Then Flatlines
    I am playing around with a DQN agent based on this tutorial. When training it on cartpole, the performance seemed to be improving until it hit an average total reward of 500 (which is the max for the cartpole environment). Immediately after that, the performance crashed and flatlined out at a very consistent 90-100 average reward for the rest of training. I thought this was interesting and was wondering if anyone had an intuitive reason why this might have happened. Thanks! My hyperparameters: self.learning_rate = 0.0005 self.gamma = 0.99 self.batch_size = 128 self.start_epsilon = .9 self.end_epsilon = .1 self.epsilon_decay = 40000 self.tau = 0.005 self.clip_grad_value = 100 self.loss_criterion = nn.SmoothL1Loss() self.optimizer = AdamW(self.policy_net.parameters(),lr=self.learning_rate,amsgrad=True) ​ Here's a screenshot of evaluation data during training: https://preview.redd.it/kc9ic6dtpsnc1.png?width=1162&format=png&auto=webp&s=c03202628c95595e2e86dc4bd89082a3c4421f61 submitted by /u/PainindaAsh [link] [comments]
  • Open

    Talk like a graph: Encoding graphs for large language models
    Posted by Bahare Fatemi, Research Scientist, Google Research, and Bryan Perozzi, Research Scientist, Google Research Imagine all the things around you — your friends, tools in your kitchen, or even the parts of your bike. They are all connected in different ways. In computer science, the term graph is used to describe connections between objects. Graphs consist of nodes (the objects themselves) and edges (connections between two nodes, indicating a relationship between them). Graphs are everywhere now. The internet itself is a giant graph of websites linked together. Even the knowledge search engines use is organized in a graph-like way. Furthermore, consider the remarkable advancements in artificial intelligence — such as chatbots that can write stories in seconds, and even soft…  ( 93 min )
  • Open

    Will AI take care of humanity?
    If it gets to the point where robots and AI take over just about every job imaginable and there are billions of unemployed humans, will the AI simply become the workers and the humans will basically live a life of leisure? If so, will that be the point when AI revolts against the humans? submitted by /u/stanerd [link] [comments]
    Researchers claim a rogue AI could replace all music ever recorded with Taylor Swift covers
    submitted by /u/dead_planets_society [link] [comments]
    What are best practices for Machine Learning Engineering?
    submitted by /u/beluis3d [link] [comments]
    "Our jobs are safe" confidence is by seniors. That presents an "aging population" problem.
    Most of the smugness around "sure AI can x, but it can't y" is by senior practitioners. I hear from senior software engineers things like "some of these tools can crank out a website on prompt, but can't translate customer specs into architectural decisions which require the nuanced knowledge of 20 years in the field." Pretend that's true for this thread. The rub is: it won't replace senior engineers. If the implication is that the juniors should be concerned, while the retiring class is safe, we have ourselves the equivalent of Japan's "aging population." In other words, for economic purposes it's the wrong end of the river which is at risk. You might say: a premium should be placed on Computer Science over bootcamps; this favors those who take it seriously (software engineering vs "coding"). But that's not the words used by those confident; I've noted always, they are careful to point out years of experience. Trial and error, management and client relations, architectural hard lessons. Not theory and principal; because AI has that. submitted by /u/lefnire [link] [comments]
    Building Meta’s GenAI Infrastructure: "Meta’s long-term vision is to build artificial general intelligence (AGI) that is open and built responsibly so that it can be widely available for everyone to benefit from"
    submitted by /u/Civil_Collection7267 [link] [comments]
    U.S. Must Act Quickly to Avoid Risks From AI, Report Says
    submitted by /u/Cbo305 [link] [comments]
    Covariant is building ChatGPT for robots
    submitted by /u/Alone-Competition-77 [link] [comments]
    Lets generate 3D objects with AI, drop them in Roblox and animate them with Mixamo
    submitted by /u/aluode [link] [comments]
    How would you like to see generative AI used in video games?
    I’m a game developer and I’ve been trying to come up with creative ways to use AI in my games. Love to hear your thoughts! submitted by /u/MikLik [link] [comments]
  • Open

    DSC Weekly 12 March 2024
    Announcements Top Stories In-Depth The post DSC Weekly 12 March 2024 appeared first on Data Science Central.  ( 21 min )
    Embodied AI: Would LLMs and robots surpass the human brain?
    It is the International Brain Awareness Week for 2024, with events across institutes, globally, from March 11 – 17. This week is a good time to explore the pedestal of the brain, against the astounding rise of machines. AI embodiment was recently featured in Scientific American, AI Chatbot Brains Are Going Inside Robot Bodies. What… Read More »Embodied AI: Would LLMs and robots surpass the human brain? The post Embodied AI: Would LLMs and robots surpass the human brain? appeared first on Data Science Central.  ( 22 min )
  • Open

    What are best practices for Machine Learning Engineering?
    submitted by /u/beluis3d [link] [comments]
    Challenge: Can you get lowest test loss for these Neural Network Playground settings (spiral data with Noise = 20)
    submitted by /u/LatestLurkingHandle [link] [comments]
    What if I want my Perceptron to output a number between n to m, not necesary -1 or 1
    Hi, I am a beginner on neural network(s). I am trying to compose a perceptron or any neuron in a neural network that can output a range of numbers lets say between -7 to +7, including 0; is this possible, cause, I have only seen videos of perceptron(s) neuron that can output -1 and 1, and 0 if you will, but to me that's kind of limited. Or is there something I am not seeing yet, and I would like a basic breakdown into this topic so I can grasp what I am missing. submitted by /u/CapMustang101 [link] [comments]
  • Open

    Head of the Class: Explore AI’s Potential in Higher Education and Research at GTC
    For students, researchers and educators eager to delve into AI, GTC — NVIDIA’s conference on AI and accelerated computing — is in a class of its own. Taking place from March 18-21 at the San Jose Convention Center, GTC features over 900 talks presented by world-renowned experts in fields such as generative AI, high performance Read Article  ( 5 min )
  • Open

    Run an audience overlap analysis in AWS Clean Rooms
    In this post, we explore what an audience overlap analysis is, discuss the current technical approaches and their challenges, and illustrate how you can run secure audience overlap analysis using AWS Clean Rooms.  ( 9 min )
    Large language model inference over confidential data using AWS Nitro Enclaves
    This post discusses how Nitro Enclaves can help protect LLM model deployments, specifically those that use personally identifiable information (PII) or protected health information (PHI). This post is for educational purposes only and should not be used in production environments without additional controls.  ( 12 min )
  • Open

    Interventional Fairness on Partially Known Causal Graphs: A Constrained Optimization Approach
    arXiv:2401.10632v2 Announce Type: replace Abstract: Fair machine learning aims to prevent discrimination against individuals or sub-populations based on sensitive attributes such as gender and race. In recent years, causal inference methods have been increasingly used in fair machine learning to measure unfairness by causal effects. However, current methods assume that the true causal graph is given, which is often not true in real-world applications. To address this limitation, this paper proposes a framework for achieving causal fairness based on the notion of interventions when the true causal graph is partially known. The proposed approach involves modeling fair prediction using a Partially Directed Acyclic Graph (PDAG), specifically, a class of causal DAGs that can be learned from observational data combined with domain knowledge. The PDAG is used to measure causal fairness, and a constrained optimization problem is formulated to balance between fairness and accuracy. Results on both simulated and real-world datasets demonstrate the effectiveness of this method.  ( 2 min )
    A Variational Autoencoder for Neural Temporal Point Processes with Dynamic Latent Graphs
    arXiv:2312.16083v2 Announce Type: replace Abstract: Continuously-observed event occurrences, often exhibit self- and mutually-exciting effects, which can be well modeled using temporal point processes. Beyond that, these event dynamics may also change over time, with certain periodic trends. We propose a novel variational auto-encoder to capture such a mixture of temporal dynamics. More specifically, the whole time interval of the input sequence is partitioned into a set of sub-intervals. The event dynamics are assumed to be stationary within each sub-interval, but could be changing across those sub-intervals. In particular, we use a sequential latent variable model to learn a dependency graph between the observed dimensions, for each sub-interval. The model predicts the future event times, by using the learned dependency graph to remove the noncontributing influences of past events. By doing so, the proposed model demonstrates its higher accuracy in predicting inter-event times and event types for several real-world event sequences, compared with existing state of the art neural point processes.  ( 2 min )
    Large Language Models to Enhance Bayesian Optimization
    arXiv:2402.03921v2 Announce Type: replace Abstract: Bayesian optimization (BO) is a powerful approach for optimizing complex and expensive-to-evaluate black-box functions. Its importance is underscored in many applications, notably including hyperparameter tuning, but its efficacy depends on efficiently balancing exploration and exploitation. While there has been substantial progress in BO methods, striking this balance remains a delicate process. In this light, we present LLAMBO, a novel approach that integrates the capabilities of Large Language Models (LLM) within BO. At a high level, we frame the BO problem in natural language, enabling LLMs to iteratively propose and evaluate promising solutions conditioned on historical evaluations. More specifically, we explore how combining contextual understanding, few-shot learning proficiency, and domain knowledge of LLMs can improve model-based BO. Our findings illustrate that LLAMBO is effective at zero-shot warmstarting, and enhances surrogate modeling and candidate sampling, especially in the early stages of search when observations are sparse. Our approach is performed in context and does not require LLM finetuning. Additionally, it is modular by design, allowing individual components to be integrated into existing BO frameworks, or function cohesively as an end-to-end method. We empirically validate LLAMBO's efficacy on the problem of hyperparameter tuning, highlighting strong empirical performance across a range of diverse benchmarks, proprietary, and synthetic tasks.  ( 3 min )
    Revisiting Generative Adversarial Networks for Binary Semantic Segmentation on Imbalanced Datasets
    arXiv:2402.02245v2 Announce Type: replace-cross Abstract: Anomalous crack region detection is a typical binary semantic segmentation task, which aims to detect pixels representing cracks on pavement surface images automatically by algorithms. Although existing deep learning-based methods have achieved outcoming results on specific public pavement datasets, the performance would deteriorate dramatically on imbalanced datasets. The input datasets used in such tasks suffer from severely between-class imbalanced problems, hence, it is a core challenge to obtain a robust performance on diverse pavement datasets with generic deep learning models. To address this problem, in this work, we propose a deep learning framework based on conditional Generative Adversarial Networks (cGANs) for the anomalous crack region detection tasks at the pixel level. In particular, the proposed framework containing a cGANs and a novel auxiliary network is developed to enhance and stabilize the generator's performance under two alternative training stages, when estimating a multiscale probability feature map from heterogeneous and imbalanced inputs iteratively. Moreover, several attention mechanisms and entropy strategies are incorporated into the cGANs architecture and the auxiliary network separately to mitigate further the performance deterioration of model training on severely imbalanced datasets. We implement extensive experiments on six accessible pavement datasets. The experimental results from both visual and quantitative evaluation show that the proposed framework can achieve state-of-the-art results on these datasets efficiently and robustly without acceleration of computation complexity.  ( 3 min )
    Out of Sight, Still in Mind: Reasoning and Planning about Unobserved Objects with Video Tracking Enabled Memory Models
    arXiv:2309.15278v2 Announce Type: replace-cross Abstract: Robots need to have a memory of previously observed, but currently occluded objects to work reliably in realistic environments. We investigate the problem of encoding object-oriented memory into a multi-object manipulation reasoning and planning framework. We propose DOOM and LOOM, which leverage transformer relational dynamics to encode the history of trajectories given partial-view point clouds and an object discovery and tracking engine. Our approaches can perform multiple challenging tasks including reasoning with occluded objects, novel objects appearance, and object reappearance. Throughout our extensive simulation and real-world experiments, we find that our approaches perform well in terms of different numbers of objects and different numbers of distractor actions. Furthermore, we show our approaches outperform an implicit memory baseline.  ( 2 min )
    A dynamical clipping approach with task feedback for Proximal Policy Optimization
    arXiv:2312.07624v2 Announce Type: replace Abstract: Proximal Policy Optimization (PPO) has been broadly applied to various domains, including Large Language Model (LLM) optimization and Robotics learning, etc. However, PPO is limited by a fixed setting for the clipping bound. Specifically, there is no theoretical proof that the optimal clipping bound remains consistent throughout the entire training process. Truncating the ratio of the new and old policies with a unique clipping bound ensures stable training and can achieve the best training performance. Additionally, previous research suggests that a fixed clipping bound limits the agent's exploration. Therefore, researching a dynamical clipping bound to enhance PPO's performance can be highly beneficial. Different from previous clipping approaches, we consider increasing the maximum cumulative Return in reinforcement learning (RL) tasks as the preference of the RL task, and propose a bi-level proximal policy optimization paradigm, which involves not only optimizing the policy but also dynamically adjusting the clipping bound to reflect the preference of the RL tasks to further elevate the training outcomes and stability of PPO. Based on this bi-level proximal policy optimization paradigm, we introduce a new algorithm named Preference based Proximal Policy Optimization (Pb-PPO). This algorithm utilizes a multi-armed bandit algorithm to reflect RL preferences (we also validate that such approach can be utilized to reflect human preference), recommending the optimal clipping bound for PPO in each epoch, thereby achieving more stable and better training outcomes.  ( 3 min )
    A "Perspectival" Mirror of the Elephant: Investigating Language Bias on Google, ChatGPT, YouTube, and Wikipedia
    arXiv:2303.16281v3 Announce Type: replace-cross Abstract: Contrary to Google Search's mission of delivering information from "many angles so you can form your own understanding of the world," we find that Google and its most prominent returned results - Wikipedia and YouTube - simply reflect a narrow set of culturally dominant views tied to the search language for complex topics like "Buddhism," "Liberalism," "colonization," "Iran" and "America." Simply stated, they present, to varying degrees, distinct information across the same search in different languages, a phenomenon we call language bias. This paper presents evidence and analysis of language bias and discusses its larger social implications. We find that our online searches and emerging tools like ChatGPT turn us into the proverbial blind person touching a small portion of an elephant, ignorant of the existence of other cultural perspectives. Language bias sets a strong yet invisible cultural barrier online, where each language group thinks they can see other groups through searches, but in fact, what they see is their own reflection.  ( 2 min )
    LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models
    arXiv:2309.12307v3 Announce Type: replace-cross Abstract: We present LongLoRA, an efficient fine-tuning approach that extends the context sizes of pre-trained large language models (LLMs), with limited computation cost. Typically, training LLMs with long context sizes is computationally expensive, requiring extensive training hours and GPU resources. For example, training on the context length of 8192 needs 16x computational costs in self-attention layers as that of 2048. In this paper, we speed up the context extension of LLMs in two aspects. On the one hand, although dense global attention is needed during inference, fine-tuning the model can be effectively and efficiently done by sparse local attention. The proposed shifted sparse attention effectively enables context extension, leading to non-trivial computation saving with similar performance to fine-tuning with vanilla attention. Particularly, it can be implemented with only two lines of code in training, while being optional in inference. On the other hand, we revisit the parameter-efficient fine-tuning regime for context expansion. Notably, we find that LoRA for context extension works well under the premise of trainable embedding and normalization. LongLoRA combines this improved LoRA with S^2-Attn. LongLoRA demonstrates strong empirical results on various tasks on Llama2 models from 7B/13B to 70B. LongLoRA extends Llama2 7B from 4k context to 100k, or Llama2 70B to 32k on a single 8x A100 machine. LongLoRA extends models' context while retaining their original architectures, and is compatible with most existing techniques, like Flash-Attention2. In addition, we further conduct supervised fine-tuning with LongLoRA and our long instruction-following LongAlpaca dataset.  ( 3 min )
    Can Error Mitigation Improve Trainability of Noisy Variational Quantum Algorithms?
    arXiv:2109.01051v2 Announce Type: replace-cross Abstract: Variational Quantum Algorithms (VQAs) are often viewed as the best hope for near-term quantum advantage. However, recent studies have shown that noise can severely limit the trainability of VQAs, e.g., by exponentially flattening the cost landscape and suppressing the magnitudes of cost gradients. Error Mitigation (EM) shows promise in reducing the impact of noise on near-term devices. Thus, it is natural to ask whether EM can improve the trainability of VQAs. In this work, we first show that, for a broad class of EM strategies, exponential cost concentration cannot be resolved without committing exponential resources elsewhere. This class of strategies includes as special cases Zero Noise Extrapolation, Virtual Distillation, Probabilistic Error Cancellation, and Clifford Data Regression. Second, we perform analytical and numerical analysis of these EM protocols, and we find that some of them (e.g., Virtual Distillation) can make it harder to resolve cost function values compared to running no EM at all. As a positive result, we do find numerical evidence that Clifford Data Regression (CDR) can aid the training process in certain settings where cost concentration is not too severe. Our results show that care should be taken in applying EM protocols as they can either worsen or not improve trainability. On the other hand, our positive results for CDR highlight the possibility of engineering error mitigation methods to improve trainability.  ( 3 min )
    Precedence-Constrained Winter Value for Effective Graph Data Valuation
    arXiv:2402.01943v2 Announce Type: replace Abstract: Data valuation is essential for quantifying data's worth, aiding in assessing data quality and determining fair compensation. While existing data valuation methods have proven effective in evaluating the value of Euclidean data, they face limitations when applied to the increasingly popular graph-structured data. Particularly, graph data valuation introduces unique challenges, primarily stemming from the intricate dependencies among nodes and the exponential growth in value estimation costs. To address the challenging problem of graph data valuation, we put forth an innovative solution, Precedence-Constrained Winter (PC-Winter) Value, to account for the complex graph structure. Furthermore, we develop a variety of strategies to address the computational challenges and enable efficient approximation of PC-Winter. Extensive experiments demonstrate the effectiveness of PC-Winter across diverse datasets and tasks.  ( 2 min )
    Structured Entity Extraction Using Large Language Models
    arXiv:2402.04437v2 Announce Type: replace-cross Abstract: Recent advances in machine learning have significantly impacted the field of information extraction, with Large Language Models (LLMs) playing a pivotal role in extracting structured information from unstructured text. This paper explores the challenges and limitations of current methodologies in structured entity extraction and introduces a novel approach to address these issues. We contribute to the field by first introducing and formalizing the task of Structured Entity Extraction (SEE), followed by proposing Approximate Entity Set OverlaP (AESOP) Metric designed to appropriately assess model performance on this task. Later, we propose a new model that harnesses the power of LLMs for enhanced effectiveness and efficiency through decomposing the entire extraction task into multiple stages. Quantitative evaluation and human side-by-side evaluation confirm that our model outperforms baselines, offering promising directions for future advancements in structured entity extraction.  ( 2 min )
    Brain decoding: toward real-time reconstruction of visual perception
    arXiv:2310.19812v2 Announce Type: replace-cross Abstract: In the past five years, the use of generative and foundational AI systems has greatly improved the decoding of brain activity. Visual perception, in particular, can now be decoded from functional Magnetic Resonance Imaging (fMRI) with remarkable fidelity. This neuroimaging technique, however, suffers from a limited temporal resolution ($\approx$0.5 Hz) and thus fundamentally constrains its real-time usage. Here, we propose an alternative approach based on magnetoencephalography (MEG), a neuroimaging device capable of measuring brain activity with high temporal resolution ($\approx$5,000 Hz). For this, we develop an MEG decoding model trained with both contrastive and regression objectives and consisting of three modules: i) pretrained embeddings obtained from the image, ii) an MEG module trained end-to-end and iii) a pretrained image generator. Our results are threefold: Firstly, our MEG decoder shows a 7X improvement of image-retrieval over classic linear decoders. Second, late brain responses to images are best decoded with DINOv2, a recent foundational image model. Third, image retrievals and generations both suggest that high-level visual features can be decoded from MEG signals, although the same approach applied to 7T fMRI also recovers better low-level features. Overall, these results, while preliminary, provide an important step towards the decoding -- in real-time -- of the visual processes continuously unfolding within the human brain.  ( 3 min )
    RetSeg: Retention-based Colorectal Polyps Segmentation Network
    arXiv:2310.05446v5 Announce Type: replace-cross Abstract: Vision Transformers (ViTs) have revolutionized medical imaging analysis, showcasing superior efficacy compared to conventional Convolutional Neural Networks (CNNs) in vital tasks such as polyp classification, detection, and segmentation. Leveraging attention mechanisms to focus on specific image regions, ViTs exhibit contextual awareness in processing visual data, culminating in robust and precise predictions, even for intricate medical images. Moreover, the inherent self-attention mechanism in Transformers accommodates varying input sizes and resolutions, granting an unprecedented flexibility absent in traditional CNNs. However, Transformers grapple with challenges like excessive memory usage and limited training parallelism due to self-attention, rendering them impractical for real-time disease detection on resource-constrained devices. In this study, we address these hurdles by investigating the integration of the recently introduced retention mechanism into polyp segmentation, introducing RetSeg, an encoder-decoder network featuring multi-head retention blocks. Drawing inspiration from Retentive Networks (RetNet), RetSeg is designed to bridge the gap between precise polyp segmentation and resource utilization, particularly tailored for colonoscopy images. We train and validate RetSeg for polyp segmentation employing two publicly available datasets: Kvasir-SEG and CVC-ClinicDB. Additionally, we showcase RetSeg's promising performance across diverse public datasets, including CVC-ColonDB, ETIS-LaribPolypDB, CVC-300, and BKAI-IGH NeoPolyp. While our work represents an early-stage exploration, further in-depth studies are imperative to advance these promising findings.  ( 3 min )
    Balancing Act: Constraining Disparate Impact in Sparse Models
    arXiv:2310.20673v2 Announce Type: replace Abstract: Model pruning is a popular approach to enable the deployment of large deep learning models on edge devices with restricted computational or storage capacities. Although sparse models achieve performance comparable to that of their dense counterparts at the level of the entire dataset, they exhibit high accuracy drops for some data sub-groups. Existing methods to mitigate this disparate impact induced by pruning (i) rely on surrogate metrics that address the problem indirectly and have limited interpretability; or (ii) scale poorly with the number of protected sub-groups in terms of computational cost. We propose a constrained optimization approach that directly addresses the disparate impact of pruning: our formulation bounds the accuracy change between the dense and sparse models, for each sub-group. This choice of constraints provides an interpretable success criterion to determine if a pruned model achieves acceptable disparity levels. Experimental results demonstrate that our technique scales reliably to problems involving large models and hundreds of protected sub-groups.  ( 2 min )
    ContriMix: Scalable stain color augmentation for domain generalization without domain labels in digital pathology
    arXiv:2306.04527v4 Announce Type: replace-cross Abstract: Differences in staining and imaging procedures can cause significant color variations in histopathology images, leading to poor generalization when deploying deep-learning models trained from a different data source. Various color augmentation methods have been proposed to generate synthetic images during training to make models more robust, eliminating the need for stain normalization during test time. Many color augmentation methods leverage domain labels to generate synthetic images. This approach causes three significant challenges to scaling such a model. Firstly, incorporating data from a new domain into deep-learning models trained on existing domain labels is not straightforward. Secondly, dependency on domain labels prevents the use of pathology images without domain labels to improve model performance. Finally, implementation of these methods becomes complicated when multiple domain labels (e.g., patient identification, medical center, etc) are associated with a single image. We introduce ContriMix, a novel domain label free stain color augmentation method based on DRIT++, a style-transfer method. Contrimix leverages sample stain color variation within a training minibatch and random mixing to extract content and attribute information from pathology images. This information can be used by a trained ContriMix model to create synthetic images to improve the performance of existing classifiers. ContriMix outperforms competing methods on the Camelyon17-WILDS dataset. Its performance is consistent across different slides in the test set while being robust to the color variation from rare substances in pathology images. We make our code and trained ContriMix models available for research use. The code for ContriMix can be found at https://gitlab.com/huutan86/contrimix  ( 3 min )
    Maximum Entropy Heterogeneous-Agent Reinforcement Learning
    arXiv:2306.10715v4 Announce Type: replace-cross Abstract: Multi-agent reinforcement learning (MARL) has been shown effective for cooperative games in recent years. However, existing state-of-the-art methods face challenges related to sample complexity, training instability, and the risk of converging to a suboptimal Nash Equilibrium. In this paper, we propose a unified framework for learning \emph{stochastic} policies to resolve these issues. We embed cooperative MARL problems into probabilistic graphical models, from which we derive the maximum entropy (MaxEnt) objective for MARL. Based on the MaxEnt framework, we propose Heterogeneous-Agent Soft Actor-Critic (HASAC) algorithm. Theoretically, we prove the monotonic improvement and convergence to quantal response equilibrium (QRE) properties of HASAC. Furthermore, we generalize a unified template for MaxEnt algorithmic design named Maximum Entropy Heterogeneous-Agent Mirror Learning (MEHAML), which provides any induced method with the same guarantees as HASAC. We evaluate HASAC on six benchmarks: Bi-DexHands, Multi-Agent MuJoCo, StarCraft Multi-Agent Challenge, Google Research Football, Multi-Agent Particle Environment, and Light Aircraft Game. Results show that HASAC consistently outperforms strong baselines, exhibiting better sample efficiency, robustness, and sufficient exploration.  ( 2 min )
    CRM: Single Image to 3D Textured Mesh with Convolutional Reconstruction Model
    arXiv:2403.05034v1 Announce Type: cross Abstract: Feed-forward 3D generative models like the Large Reconstruction Model (LRM) have demonstrated exceptional generation speed. However, the transformer-based methods do not leverage the geometric priors of the triplane component in their architecture, often leading to sub-optimal quality given the limited size of 3D data and slow training. In this work, we present the Convolutional Reconstruction Model (CRM), a high-fidelity feed-forward single image-to-3D generative model. Recognizing the limitations posed by sparse 3D data, we highlight the necessity of integrating geometric priors into network design. CRM builds on the key observation that the visualization of triplane exhibits spatial correspondence of six orthographic images. First, it generates six orthographic view images from a single input image, then feeds these images into a convolutional U-Net, leveraging its strong pixel-level alignment capabilities and significant bandwidth to create a high-resolution triplane. CRM further employs Flexicubes as geometric representation, facilitating direct end-to-end optimization on textured meshes. Overall, our model delivers a high-fidelity textured mesh from an image in just 10 seconds, without any test-time optimization.  ( 2 min )
    Performance Analysis of Support Vector Machine (SVM) on Challenging Datasets for Forest Fire Detection
    arXiv:2401.12924v2 Announce Type: replace-cross Abstract: This article delves into the analysis of performance and utilization of Support Vector Machines (SVMs) for the critical task of forest fire detection using image datasets. With the increasing threat of forest fires to ecosystems and human settlements, the need for rapid and accurate detection systems is of utmost importance. SVMs, renowned for their strong classification capabilities, exhibit proficiency in recognizing patterns associated with fire within images. By training on labeled data, SVMs acquire the ability to identify distinctive attributes associated with fire, such as flames, smoke, or alterations in the visual characteristics of the forest area. The document thoroughly examines the use of SVMs, covering crucial elements like data preprocessing, feature extraction, and model training. It rigorously evaluates parameters such as accuracy, efficiency, and practical applicability. The knowledge gained from this study aids in the development of efficient forest fire detection systems, enabling prompt responses and improving disaster management. Moreover, the correlation between SVM accuracy and the difficulties presented by high-dimensional datasets is carefully investigated, demonstrated through a revealing case study. The relationship between accuracy scores and the different resolutions used for resizing the training datasets has also been discussed in this article. These comprehensive studies result in a definitive overview of the difficulties faced and the potential sectors requiring further improvement and focus.  ( 3 min )
    Neural General Circulation Models for Weather and Climate
    arXiv:2311.07222v3 Announce Type: replace-cross Abstract: General circulation models (GCMs) are the foundation of weather and climate prediction. GCMs are physics-based simulators which combine a numerical solver for large-scale dynamics with tuned representations for small-scale processes such as cloud formation. Recently, machine learning (ML) models trained on reanalysis data achieved comparable or better skill than GCMs for deterministic weather forecasting. However, these models have not demonstrated improved ensemble forecasts, or shown sufficient stability for long-term weather and climate simulations. Here we present the first GCM that combines a differentiable solver for atmospheric dynamics with ML components, and show that it can generate forecasts of deterministic weather, ensemble weather and climate on par with the best ML and physics-based methods. NeuralGCM is competitive with ML models for 1-10 day forecasts, and with the European Centre for Medium-Range Weather Forecasts ensemble prediction for 1-15 day forecasts. With prescribed sea surface temperature, NeuralGCM can accurately track climate metrics such as global mean temperature for multiple decades, and climate forecasts with 140 km resolution exhibit emergent phenomena such as realistic frequency and trajectories of tropical cyclones. For both weather and climate, our approach offers orders of magnitude computational savings over conventional GCMs. Our results show that end-to-end deep learning is compatible with tasks performed by conventional GCMs, and can enhance the large-scale physical simulations that are essential for understanding and predicting the Earth system.  ( 3 min )
    ManyQuadrupeds: Learning a Single Locomotion Policy for Diverse Quadruped Robots
    arXiv:2310.10486v2 Announce Type: replace-cross Abstract: Learning a locomotion policy for quadruped robots has traditionally been constrained to a specific robot morphology, mass, and size. The learning process must usually be repeated for every new robot, where hyperparameters and reward function weights must be re-tuned to maximize performance for each new system. Alternatively, attempting to train a single policy to accommodate different robot sizes, while maintaining the same degrees of freedom (DoF) and morphology, requires either complex learning frameworks, or mass, inertia, and dimension randomization, which leads to prolonged training periods. In our study, we show that drawing inspiration from animal motor control allows us to effectively train a single locomotion policy capable of controlling a diverse range of quadruped robots. The robot differences encompass: a variable number of DoFs, (i.e. 12 or 16 joints), three distinct morphologies, a broad mass range spanning from 2 kg to 200 kg, and nominal standing heights ranging from 18 cm to 100 cm. Our policy modulates a representation of the Central Pattern Generator (CPG) in the spinal cord, effectively coordinating both frequencies and amplitudes of the CPG to produce rhythmic output (Rhythm Generation), which is then mapped to a Pattern Formation (PF) layer. Across different robots, the only varying component is the PF layer, which adjusts the scaling parameters for the stride height and length. Subsequently, we evaluate the sim-to-real transfer by testing the single policy on both the Unitree Go1 and A1 robots. Remarkably, we observe robust performance, even when adding a 15 kg load, equivalent to 125% of the A1 robot's nominal mass.  ( 3 min )
    Improved particle-flow event reconstruction with scalable neural networks for current and future particle detectors
    arXiv:2309.06782v5 Announce Type: replace-cross Abstract: Efficient and accurate algorithms are necessary to reconstruct particles in the highly granular detectors anticipated at the High-Luminosity Large Hadron Collider and the Future Circular Collider. We study scalable machine learning models for event reconstruction in electron-positron collisions based on a full detector simulation. Particle-flow reconstruction can be formulated as a supervised learning task using tracks and calorimeter clusters. We compare a graph neural network and kernel-based transformer and demonstrate that we can avoid quadratic operations while achieving realistic reconstruction. We show that hyperparameter tuning significantly improves the performance of the models. The best graph neural network model shows improvement in the jet transverse momentum resolution by up to 50% compared to the rule-based algorithm. The resulting model is portable across Nvidia, AMD and Habana hardware. Accurate and fast machine-learning based reconstruction can significantly improve future measurements at colliders.  ( 3 min )
    New intelligent defense systems to reduce the risks of Selfish Mining and Double-Spending attacks using Learning Automata
    arXiv:2307.00529v2 Announce Type: replace-cross Abstract: In this paper, we address the critical challenges of double-spending and selfish mining attacks in blockchain-based digital currencies. Double-spending is a problem where the same tender is spent multiple times during a digital currency transaction, while selfish mining is an intentional alteration of a blockchain to increase rewards to one miner or a group of miners. We introduce a new attack that combines both these attacks and propose a machine learning-based solution to mitigate the risks associated with them. Specifically, we use the learning automaton, a powerful online learning method, to develop two models, namely the SDTLA and WVBM, which can effectively defend against selfish mining attacks. Our experimental results show that the SDTLA method increases the profitability threshold of selfish mining up to 47$\%$, while the WVBM method performs even better and is very close to the ideal situation where each miner's revenue is proportional to their shared hash processing power. Additionally, we demonstrate that both methods can effectively reduce the risks of double-spending by tuning the $Z$ Parameter. Our findings highlight the potential of SDTLA and WVBM as promising solutions for enhancing the security and efficiency of blockchain networks.  ( 3 min )
    Enhancing CLIP with CLIP: Exploring Pseudolabeling for Limited-Label Prompt Tuning
    arXiv:2306.01669v2 Announce Type: replace-cross Abstract: Fine-tuning vision-language models (VLMs) like CLIP to downstream tasks is often necessary to optimize their performance. However, a major obstacle is the limited availability of labeled data. We study the use of pseudolabels, i.e., heuristic labels for unlabeled data, to enhance CLIP via prompt tuning. Conventional pseudolabeling trains a model on labeled data and then generates labels for unlabeled data. VLMs' zero-shot capabilities enable a "second generation" of pseudolabeling approaches that do not require task-specific training on labeled data. By using zero-shot pseudolabels as a source of supervision, we observe that learning paradigms such as semi-supervised, transductive zero-shot, and unsupervised learning can all be seen as optimizing the same loss function. This unified view enables the development of versatile training strategies that are applicable across learning paradigms. We investigate them on image classification tasks where CLIP exhibits limitations, by varying prompt modalities, e.g., textual or visual prompts, and learning paradigms. We find that (1) unexplored prompt tuning strategies that iteratively refine pseudolabels consistently improve CLIP accuracy, by 19.5 points in semi-supervised learning, by 28.4 points in transductive zero-shot learning, and by 15.2 points in unsupervised learning, and (2) unlike conventional semi-supervised pseudolabeling, which exacerbates model biases toward classes with higher-quality pseudolabels, prompt tuning leads to a more equitable distribution of per-class accuracy. The code to reproduce the experiments is at https://github.com/BatsResearch/menghini-neurips23-code.  ( 3 min )
    Cooperative data-driven modeling
    arXiv:2211.12971v2 Announce Type: replace-cross Abstract: Data-driven modeling in mechanics is evolving rapidly based on recent machine learning advances, especially on artificial neural networks. As the field matures, new data and models created by different groups become available, opening possibilities for cooperative modeling. However, artificial neural networks suffer from catastrophic forgetting, i.e. they forget how to perform an old task when trained on a new one. This hinders cooperation because adapting an existing model for a new task affects the performance on a previous task trained by someone else. The authors developed a continual learning method that addresses this issue, applying it here for the first time to solid mechanics. In particular, the method is applied to recurrent neural networks to predict history-dependent plasticity behavior, although it can be used on any other architecture (feedforward, convolutional, etc.) and to predict other phenomena. This work intends to spawn future developments on continual learning that will foster cooperative strategies among the mechanics community to solve increasingly challenging problems. We show that the chosen continual learning strategy can sequentially learn several constitutive laws without forgetting them, using less data to achieve the same error as standard (non-cooperative) training of one law per model.  ( 2 min )
    Spectrally-Corrected and Regularized Linear Discriminant Analysis for Spiked Covariance Model
    arXiv:2210.03859v3 Announce Type: replace-cross Abstract: This paper proposes an improved linear discriminant analysis called spectrally-corrected and regularized LDA (SRLDA). This method integrates the design ideas of the sample spectrally-corrected covariance matrix and the regularized discriminant analysis. With the support of a large-dimensional random matrix analysis framework, it is proved that SRLDA has a linear classification global optimal solution under the spiked model assumption. According to simulation data analysis, the SRLDA classifier performs better than RLDA and ILDA and is closer to the theoretical classifier. Experiments on different data sets show that the SRLDA algorithm performs better in classification and dimensionality reduction than currently used tools.  ( 2 min )
    A Survey on Quantum Reinforcement Learning
    arXiv:2211.03464v2 Announce Type: replace-cross Abstract: Quantum reinforcement learning is an emerging field at the intersection of quantum computing and machine learning. While we intend to provide a broad overview of the literature on quantum reinforcement learning - our interpretation of this term will be clarified below - we put particular emphasis on recent developments. With a focus on already available noisy intermediate-scale quantum devices, these include variational quantum circuits acting as function approximators in an otherwise classical reinforcement learning setting. In addition, we survey quantum reinforcement learning algorithms based on future fault-tolerant hardware, some of which come with a provable quantum advantage. We provide both a birds-eye-view of the field, as well as summaries and reviews for selected parts of the literature.  ( 2 min )
    Testing Stationarity and Change Point Detection in Reinforcement Learning
    arXiv:2203.01707v3 Announce Type: replace-cross Abstract: We consider offline reinforcement learning (RL) methods in possibly nonstationary environments. Many existing RL algorithms in the literature rely on the stationarity assumption that requires the system transition and the reward function to be constant over time. However, the stationarity assumption is restrictive in practice and is likely to be violated in a number of applications, including traffic signal control, robotics and mobile health. In this paper, we develop a consistent procedure to test the nonstationarity of the optimal Q-function based on pre-collected historical data, without additional online data collection. Based on the proposed test, we further develop a sequential change point detection method that can be naturally coupled with existing state-of-the-art RL methods for policy optimization in nonstationary environments. The usefulness of our method is illustrated by theoretical results, simulation studies, and a real data example from the 2018 Intern Health Study. A Python implementation of the proposed procedure is available at https://github.com/limengbinggz/CUSUM-RL.  ( 2 min )
    Persistent Homological State-Space Estimation of Functional Human Brain Networks at Rest
    arXiv:2201.00087v5 Announce Type: replace-cross Abstract: We introduce an innovative, data-driven topological data analysis (TDA) technique for estimating the state spaces of dynamically changing functional human brain networks at rest. Our method utilizes the Wasserstein distance to measure topological differences, enabling the clustering of brain networks into distinct topological states. This technique outperforms the commonly used k-means clustering in identifying brain network state spaces by effectively incorporating the temporal dynamics of the data without the need for explicit model specification. We further investigate the genetic underpinnings of these topological features using a twin study design, examining the heritability of such state changes. Our findings suggest that the topology of brain networks, particularly in their dynamic state changes, may hold significant hidden genetic information. MATLAB code for the method is available at https://github.com/laplcebeltrami/PH-STAT.  ( 2 min )
    A step toward a reinforcement learning de novo genome assembler
    arXiv:2102.02649v4 Announce Type: replace-cross Abstract: De novo genome assembly is a relevant but computationally complex task in genomics. Although de novo assemblers have been used successfully in several genomics projects, there is still no 'best assembler', and the choice and setup of assemblers still rely on bioinformatics experts. Thus, as with other computationally complex problems, machine learning may emerge as an alternative (or complementary) way for developing more accurate and automated assemblers. Reinforcement learning has proven promising for solving complex activities without supervision - such games - and there is a pressing need to understand the limits of this approach to 'real' problems, such as the DFA problem. This study aimed to shed light on the application of machine learning, using reinforcement learning (RL), in genome assembly. We expanded upon the sole previous approach found in the literature to solve this problem by carefully exploring the learning aspects of the proposed intelligent agent, which uses the Q-learning algorithm, and we provided insights for the next steps of automated genome assembly development. We improved the reward system and optimized the exploration of the state space based on pruning and in collaboration with evolutionary computing. We tested the new approaches on 23 new larger environments, which are all available on the internet. Our results suggest consistent performance progress; however, we also found limitations, especially concerning the high dimensionality of state and action spaces. Finally, we discuss paths for achieving efficient and automated genome assembly in real scenarios considering successful RL applications - including deep reinforcement learning.  ( 3 min )
    FedLoGe: Joint Local and Generic Federated Learning under Long-tailed Data
    arXiv:2401.08977v2 Announce Type: replace Abstract: Federated Long-Tailed Learning (Fed-LT), a paradigm wherein data collected from decentralized local clients manifests a globally prevalent long-tailed distribution, has garnered considerable attention in recent times. In the context of Fed-LT, existing works have predominantly centered on addressing the data imbalance issue to enhance the efficacy of the generic global model while neglecting the performance at the local level. In contrast, conventional Personalized Federated Learning (pFL) techniques are primarily devised to optimize personalized local models under the presumption of a balanced global data distribution. This paper introduces an approach termed Federated Local and Generic Model Training in Fed-LT (FedLoGe), which enhances both local and generic model performance through the integration of representation learning and classifier alignment within a neural collapse framework. Our investigation reveals the feasibility of employing a shared backbone as a foundational framework for capturing overarching global trends, while concurrently employing individualized classifiers to encapsulate distinct refinements stemming from each client's local features. Building upon this discovery, we establish the Static Sparse Equiangular Tight Frame Classifier (SSE-C), inspired by neural collapse principles that naturally prune extraneous noisy features and foster the acquisition of potent data representations. Furthermore, leveraging insights from imbalance neural collapse's classifier norm patterns, we develop Global and Local Adaptive Feature Realignment (GLA-FR) via an auxiliary global classifier and personalized Euclidean norm transfer to align global features with client preferences. Extensive experimental results on CIFAR-10/100-LT, ImageNet, and iNaturalist demonstrate the advantage of our method over state-of-the-art pFL and Fed-LT approaches.  ( 3 min )
    Bayes Conditional Distribution Estimation for Knowledge Distillation Based on Conditional Mutual Information
    arXiv:2401.08732v2 Announce Type: replace Abstract: It is believed that in knowledge distillation (KD), the role of the teacher is to provide an estimate for the unknown Bayes conditional probability distribution (BCPD) to be used in the student training process. Conventionally, this estimate is obtained by training the teacher using maximum log-likelihood (MLL) method. To improve this estimate for KD, in this paper we introduce the concept of conditional mutual information (CMI) into the estimation of BCPD and propose a novel estimator called the maximum CMI (MCMI) method. Specifically, in MCMI estimation, both the log-likelihood and CMI of the teacher are simultaneously maximized when the teacher is trained. Through Eigen-CAM, it is further shown that maximizing the teacher's CMI value allows the teacher to capture more contextual information in an image cluster. Via conducting a thorough set of experiments, we show that by employing a teacher trained via MCMI estimation rather than one trained via MLL estimation in various state-of-the-art KD frameworks, the student's classification accuracy consistently increases, with the gain of up to 3.32\%. This suggests that the teacher's BCPD estimate provided by MCMI method is more accurate than that provided by MLL method. In addition, we show that such improvements in the student's accuracy are more drastic in zero-shot and few-shot settings. Notably, the student's accuracy increases with the gain of up to 5.72\% when 5\% of the training samples are available to the student (few-shot), and increases from 0\% to as high as 84\% for an omitted class (zero-shot). The code is available at \url{https://github.com/iclr2024mcmi/ICLRMCMI}.  ( 3 min )
    REValueD: Regularised Ensemble Value-Decomposition for Factorisable Markov Decision Processes
    arXiv:2401.08850v2 Announce Type: replace Abstract: Discrete-action reinforcement learning algorithms often falter in tasks with high-dimensional discrete action spaces due to the vast number of possible actions. A recent advancement leverages value-decomposition, a concept from multi-agent reinforcement learning, to tackle this challenge. This study delves deep into the effects of this value-decomposition, revealing that whilst it curtails the over-estimation bias inherent to Q-learning algorithms, it amplifies target variance. To counteract this, we present an ensemble of critics to mitigate target variance. Moreover, we introduce a regularisation loss that helps to mitigate the effects that exploratory actions in one dimension can have on the value of optimal actions in other dimensions. Our novel algorithm, REValueD, tested on discretised versions of the DeepMind Control Suite tasks, showcases superior performance, especially in the challenging humanoid and dog tasks. We further dissect the factors influencing REValueD's performance, evaluating the significance of the regularisation loss and the scalability of REValueD with increasing sub-actions per dimension.  ( 2 min )
    MixEHR-SurG: a joint proportional hazard and guided topic model for inferring mortality-associated topics from electronic health records
    arXiv:2312.13454v2 Announce Type: replace Abstract: Existing survival models either do not scale to high dimensional and multi-modal data or are difficult to interpret. In this study, we present a supervised topic model called MixEHR-SurG to simultaneously integrate heterogeneous EHR data and model survival hazard. Our contributions are three-folds: (1) integrating EHR topic inference with Cox proportional hazards likelihood; (2) integrating patient-specific topic hyperparameters using the PheCode concepts such that each topic can be identified with exactly one PheCode-associated phenotype; (3) multi-modal survival topic inference. This leads to a highly interpretable survival topic model that can infer PheCode-specific phenotype topics associated with patient mortality. We evaluated MixEHR-SurG using a simulated dataset and two real-world EHR datasets: the Quebec Congenital Heart Disease (CHD) data consisting of 8,211 subjects with 75,187 outpatient claim records of 1,767 unique ICD codes; the MIMIC-III consisting of 1,458 subjects with multi-modal EHR records. Compared to the baselines, MixEHR-SurG achieved a superior dynamic AUROC for mortality prediction, with a mean AUROC score of 0.89 in the simulation dataset and a mean AUROC of 0.645 on the CHD dataset. Qualitatively, MixEHR-SurG associates severe cardiac conditions with high mortality risk among the CHD patients after the first heart failure hospitalization and critical brain injuries with increased mortality among the MIMIC- III patients after their ICU discharge. Together, the integration of the Cox proportional hazards model and EHR topic inference in MixEHR-SurG not only leads to competitive mortality prediction but also meaningful phenotype topics for in-depth survival analysis. The software is available at GitHub: https://github.com/li-lab-mcgill/MixEHR-SurG.  ( 3 min )
    ERASE: Error-Resilient Representation Learning on Graphs for Label Noise Tolerance
    arXiv:2312.08852v2 Announce Type: replace Abstract: Deep learning has achieved remarkable success in graph-related tasks, yet this accomplishment heavily relies on large-scale high-quality annotated datasets. However, acquiring such datasets can be cost-prohibitive, leading to the practical use of labels obtained from economically efficient sources such as web searches and user tags. Unfortunately, these labels often come with noise, compromising the generalization performance of deep networks. To tackle this challenge and enhance the robustness of deep learning models against label noise in graph-based tasks, we propose a method called ERASE (Error-Resilient representation learning on graphs for lAbel noiSe tolerancE). The core idea of ERASE is to learn representations with error tolerance by maximizing coding rate reduction. Particularly, we introduce a decoupled label propagation method for learning representations. Before training, noisy labels are pre-corrected through structural denoising. During training, ERASE combines prototype pseudo-labels with propagated denoised labels and updates representations with error resilience, which significantly improves the generalization performance in node classification. The proposed method allows us to more effectively withstand errors caused by mislabeled nodes, thereby strengthening the robustness of deep networks in handling noisy graph data. Extensive experimental results show that our method can outperform multiple baselines with clear margins in broad noise levels and enjoy great scalability. Codes are released at https://github.com/eraseai/erase.  ( 3 min )
    Out-of-Distribution Generalized Dynamic Graph Neural Network for Human Albumin Prediction
    arXiv:2311.15545v2 Announce Type: replace Abstract: Human albumin is essential for indicating the body's overall health. Accurately predicting plasma albumin levels and determining appropriate doses are urgent clinical challenges, particularly in critically ill patients, to maintain optimal blood levels. However, human albumin prediction is non-trivial that has to leverage the dynamics of biochemical markers as well as the experience of treating patients. Moreover, the problem of distribution shift is often encountered in real clinical data, which may lead to a decline in the model prediction performance and reduce the reliability of the model's application. In this paper, we propose a framework named Out-of-Distribution Generalized Dynamic Graph Neural Network for Human Albumin Prediction (DyG-HAP), which is able to provide accurate albumin predictions for Intensity Care Unit (ICU) patients during hospitalization. We first model human albumin prediction as a dynamic graph regression problem to model the dynamics and patient relationship. Then, we propose a disentangled dynamic graph attention mechanism to capture and disentangle the patterns whose relationship to labels under distribution shifts is invariant and variant respectively. Last, we propose an invariant dynamic graph regression method to encourage the model to rely on invariant patterns to make predictions. Moreover, we propose a dataset named Albumin level testing and nutritional dosing data for Intensive Care (ANIC) for evaluation. Extensive experiments demonstrate the superiority of our method compared to several baseline methods in human albumin prediction.  ( 3 min )
    DPOD: Domain-Specific Prompt Tuning for Multimodal Fake News Detection
    arXiv:2311.16496v2 Announce Type: replace Abstract: The spread of fake news using out-of-context images has become widespread and is a relevant problem in this era of information overload. Such out-of-context fake news may arise across different domains like politics, sports, entertainment, etc. In practical scenarios, an inherent problem of imbalance exists among news articles from such widely varying domains, resulting in a few domains with abundant data, while the rest containing very limited data. Under such circumstances, it is imperative to develop methods which can work in such varying amounts of data setting. In this work, we explore whether out-of-domain data can help to improve out-of-context misinformation detection (termed here as multi-modal fake news detection) of a desired domain, to address this challenging problem. Towards this goal, we propose a novel framework termed DPOD (Domain-specific Prompt-tuning using Out-of-Domain data). First, to compute generalizable features, we modify the Vision-Language Model, CLIP to extract features that helps to align the representations of the images and corresponding text captions of both the in-domain and out-of-domain data in a label-aware manner. Further, we propose a domain-specific prompt learning technique which leverages the training samples of all the available domains based on the extent they can be useful to the desired domain. Extensive experiments on a large-scale benchmark dataset, namely NewsCLIPpings demonstrate that the proposed framework achieves state of-the-art performance, significantly surpassing the existing approaches for this challenging task. Code will be released on acceptance.  ( 3 min )
    A Voting Approach for Explainable Classification with Rule Learning
    arXiv:2311.07323v2 Announce Type: replace Abstract: State-of-the-art results in typical classification tasks are mostly achieved by unexplainable machine learning methods, like deep neural networks, for instance. Contrarily, in this paper, we investigate the application of rule learning methods in such a context. Thus, classifications become based on comprehensible (first-order) rules, explaining the predictions made. In general, however, rule-based classifications are less accurate than state-of-the-art results (often significantly). As main contribution, we introduce a voting approach combining both worlds, aiming to achieve comparable results as (unexplainable) state-of-the-art methods, while still providing explanations in the form of deterministic rules. Considering a variety of benchmark data sets including a use case of significant interest to insurance industries, we prove that our approach not only clearly outperforms ordinary rule learning methods, but also yields results on a par with state-of-the-art outcomes.  ( 2 min )
    Out-of-Distribution Generalized Dynamic Graph Neural Network with Disentangled Intervention and Invariance Promotion
    arXiv:2311.14255v2 Announce Type: replace Abstract: Dynamic graph neural networks (DyGNNs) have demonstrated powerful predictive abilities by exploiting graph structural and temporal dynamics. However, the existing DyGNNs fail to handle distribution shifts, which naturally exist in dynamic graphs, mainly because the patterns exploited by DyGNNs may be variant with respect to labels under distribution shifts. In this paper, we propose Disentangled Intervention-based Dynamic graph Attention networks with Invariance Promotion (I-DIDA) to handle spatio-temporal distribution shifts in dynamic graphs by discovering and utilizing invariant patterns, i.e., structures and features whose predictive abilities are stable across distribution shifts. Specifically, we first propose a disentangled spatio-temporal attention network to capture the variant and invariant patterns. By utilizing the disentangled patterns, we design a spatio-temporal intervention mechanism to create multiple interventional distributions and an environment inference module to infer the latent spatio-temporal environments, and minimize the variance of predictions among these intervened distributions and environments, so that our model can make predictions based on invariant patterns with stable predictive abilities under distribution shifts. Extensive experiments demonstrate the superiority of our method over state-of-the-art baselines under distribution shifts. Our work is the first study of spatio-temporal distribution shifts in dynamic graphs, to the best of our knowledge.  ( 3 min )
    Multi-View Causal Representation Learning with Partial Observability
    arXiv:2311.04056v2 Announce Type: replace Abstract: We present a unified framework for studying the identifiability of representations learned from simultaneously observed views, such as different data modalities. We allow a partially observed setting in which each view constitutes a nonlinear mixture of a subset of underlying latent variables, which can be causally related. We prove that the information shared across all subsets of any number of views can be learned up to a smooth bijection using contrastive learning and a single encoder per view. We also provide graphical criteria indicating which latent variables can be identified through a simple set of rules, which we refer to as identifiability algebra. Our general framework and theoretical results unify and extend several previous works on multi-view nonlinear ICA, disentanglement, and causal representation learning. We experimentally validate our claims on numerical, image, and multi-modal data sets. Further, we demonstrate that the performance of prior methods is recovered in different special cases of our setup. Overall, we find that access to multiple partial views enables us to identify a more fine-grained representation, under the generally milder assumption of partial observability.  ( 2 min )
    Mixed Models with Multiple Instance Learning
    arXiv:2311.02455v2 Announce Type: replace Abstract: Predicting patient features from single-cell data can help identify cellular states implicated in health and disease. Linear models and average cell type expressions are typically favored for this task for their efficiency and robustness, but they overlook the rich cell heterogeneity inherent in single-cell data. To address this gap, we introduce MixMIL, a framework integrating Generalized Linear Mixed Models (GLMM) and Multiple Instance Learning (MIL), upholding the advantages of linear models while modeling cell state heterogeneity. By leveraging predefined cell embeddings, MixMIL enhances computational efficiency and aligns with recent advancements in single-cell representation learning. Our empirical results reveal that MixMIL outperforms existing MIL models in single-cell datasets, uncovering new associations and elucidating biological mechanisms across different domains.  ( 2 min )
    Differentiable Learning of Generalized Structured Matrices for Efficient Deep Neural Networks
    arXiv:2310.18882v2 Announce Type: replace Abstract: This paper investigates efficient deep neural networks (DNNs) to replace dense unstructured weight matrices with structured ones that possess desired properties. The challenge arises because the optimal weight matrix structure in popular neural network models is obscure in most cases and may vary from layer to layer even in the same network. Prior structured matrices proposed for efficient DNNs were mostly hand-crafted without a generalized framework to systematically learn them. To address this issue, we propose a generalized and differentiable framework to learn efficient structures of weight matrices by gradient descent. We first define a new class of structured matrices that covers a wide range of structured matrices in the literature by adjusting the structural parameters. Then, the frequency-domain differentiable parameterization scheme based on the Gaussian-Dirichlet kernel is adopted to learn the structural parameters by proximal gradient descent. On the image and language tasks, our method learns efficient DNNs with structured matrices, achieving lower complexity and/or higher performance than prior approaches that employ low-rank, block-sparse, or block-low-rank matrices.  ( 2 min )
    Generative ensemble deep learning severe weather prediction from a deterministic convection-allowing model
    arXiv:2310.06045v2 Announce Type: replace Abstract: An ensemble post-processing method is developed for the probabilistic prediction of severe weather (tornadoes, hail, and wind gusts) over the conterminous United States (CONUS). The method combines conditional generative adversarial networks (CGANs), a type of deep generative model, with a convolutional neural network (CNN) to post-process convection-allowing model (CAM) forecasts. The CGANs are designed to create synthetic ensemble members from deterministic CAM forecasts, and their outputs are processed by the CNN to estimate the probability of severe weather. The method is tested using High-Resolution Rapid Refresh (HRRR) 1--24 hr forecasts as inputs and Storm Prediction Center (SPC) severe weather reports as targets. The method produced skillful predictions with up to 20% Brier Skill Score (BSS) increases compared to other neural-network-based reference methods using a testing dataset of HRRR forecasts in 2021. For the evaluation of uncertainty quantification, the method is overconfident but produces meaningful ensemble spreads that can distinguish good and bad forecasts. The quality of CGAN outputs is also evaluated. Results show that the CGAN outputs behave similarly to a numerical ensemble; they preserved the inter-variable correlations and the contribution of influential predictors as in the original HRRR forecasts. This work provides a novel approach to post-process CAM output using neural networks that can be applied to severe weather prediction.  ( 3 min )
    LLM4DyG: Can Large Language Models Solve Spatial-Temporal Problems on Dynamic Graphs?
    arXiv:2310.17110v2 Announce Type: replace Abstract: In an era marked by the increasing adoption of Large Language Models (LLMs) for various tasks, there is a growing focus on exploring LLMs' capabilities in handling web data, particularly graph data. Dynamic graphs, which capture temporal network evolution patterns, are ubiquitous in real-world web data. Evaluating LLMs' competence in understanding spatial-temporal information on dynamic graphs is essential for their adoption in web applications, which remains unexplored in the literature. In this paper, we bridge the gap via proposing to evaluate LLMs' spatial-temporal understanding abilities on dynamic graphs, to the best of our knowledge, for the first time. Specifically, we propose the LLM4DyG benchmark, which includes nine specially designed tasks considering the capability evaluation of LLMs from both temporal and spatial dimensions. Then, we conduct extensive experiments to analyze the impacts of different data generators, data statistics, prompting techniques, and LLMs on the model performance. Finally, we propose Disentangled Spatial-Temporal Thoughts (DST2) for LLMs on dynamic graphs to enhance LLMs' spatial-temporal understanding abilities. Our main observations are: 1) LLMs have preliminary spatial-temporal understanding abilities on dynamic graphs, 2) Dynamic graph tasks show increasing difficulties for LLMs as the graph size and density increase, while not sensitive to the time span and data generation mechanism, 3) the proposed DST2 prompting method can help to improve LLMs' spatial-temporal understanding abilities on dynamic graphs for most tasks. The data and codes will be open-sourced at publication time.  ( 3 min )
    TAIL: Task-specific Adapters for Imitation Learning with Large Pretrained Models
    arXiv:2310.05905v2 Announce Type: replace Abstract: The full potential of large pretrained models remains largely untapped in control domains like robotics. This is mainly because of the scarcity of data and the computational challenges associated with training or fine-tuning these large models for such applications. Prior work mainly emphasizes either effective pretraining of large models for decision-making or single-task adaptation. But real-world problems will require data-efficient, continual adaptation for new control tasks. Recognizing these constraints, we introduce TAIL (Task-specific Adapters for Imitation Learning), a framework for efficient adaptation to new control tasks. Inspired by recent advancements in parameter-efficient fine-tuning in language domains, we explore efficient fine-tuning techniques -- e.g., Bottleneck Adapters, P-Tuning, and Low-Rank Adaptation (LoRA) -- in TAIL to adapt large pretrained models for new tasks with limited demonstration data. Our extensive experiments in large-scale language-conditioned manipulation tasks comparing prevalent parameter-efficient fine-tuning techniques and adaptation baselines suggest that TAIL with LoRA can achieve the best post-adaptation performance with only 1\% of the trainable parameters of full fine-tuning, while avoiding catastrophic forgetting and preserving adaptation plasticity in continual learning settings.  ( 2 min )
    Molecular De Novo Design through Transformer-based Reinforcement Learning
    arXiv:2310.05365v5 Announce Type: replace Abstract: In this work, we introduce a method to fine-tune a Transformer-based generative model for molecular de novo design. Leveraging the superior sequence learning capacity of Transformers over Recurrent Neural Networks (RNNs), our model can generate molecular structures with desired properties effectively. In contrast to the traditional RNN-based models, our proposed method exhibits superior performance in generating compounds predicted to be active against various biological targets, capturing long-term dependencies in the molecular structure sequence. The model's efficacy is demonstrated across numerous tasks, including generating analogues to a query structure and producing compounds with particular attributes, outperforming the baseline RNN-based methods. Our approach can be used for scaffold hopping, library expansion starting from a single molecule, and generating compounds with high predicted activity against biological targets.  ( 2 min )
    Continuous-time q-learning for mean-field control problems
    arXiv:2306.16208v3 Announce Type: replace Abstract: This paper studies the q-learning, recently coined as the continuous time counterpart of Q-learning by Jia and Zhou (2023), for continuous time Mckean-Vlasov control problems in the setting of entropy-regularized reinforcement learning. In contrast to the single agent's control problem in Jia and Zhou (2023), the mean-field interaction of agents renders the definition of the q-function more subtle, for which we reveal that two distinct q-functions naturally arise: (i) the integrated q-function (denoted by $q$) as the first-order approximation of the integrated Q-function introduced in Gu, Guo, Wei and Xu (2023), which can be learnt by a weak martingale condition involving test policies; and (ii) the essential q-function (denoted by $q_e$) that is employed in the policy improvement iterations. We show that two q-functions are related via an integral representation under all test policies. Based on the weak martingale condition and our proposed searching method of test policies, some model-free learning algorithms are devised. In two examples, one in LQ control framework and one beyond LQ control framework, we can obtain the exact parameterization of the optimal value function and q-functions and illustrate our algorithms with simulation experiments.  ( 2 min )
    Quantized Fourier and Polynomial Features for more Expressive Tensor Network Models
    arXiv:2309.05436v2 Announce Type: replace Abstract: In the context of kernel machines, polynomial and Fourier features are commonly used to provide a nonlinear extension to linear models by mapping the data to a higher-dimensional space. Unless one considers the dual formulation of the learning problem, which renders exact large-scale learning unfeasible, the exponential increase of model parameters in the dimensionality of the data caused by their tensor-product structure prohibits to tackle high-dimensional problems. One of the possible approaches to circumvent this exponential scaling is to exploit the tensor structure present in the features by constraining the model weights to be an underparametrized tensor network. In this paper we quantize, i.e. further tensorize, polynomial and Fourier features. Based on this feature quantization we propose to quantize the associated model weights, yielding quantized models. We show that, for the same number of model parameters, the resulting quantized models have a higher bound on the VC-dimension as opposed to their non-quantized counterparts, at no additional computational cost while learning from identical features. We verify experimentally how this additional tensorization regularizes the learning problem by prioritizing the most salient features in the data and how it provides models with increased generalization capabilities. We finally benchmark our approach on large regression task, achieving state-of-the-art results on a laptop computer.  ( 3 min )
    Distill n' Explain: explaining graph neural networks using simple surrogates
    arXiv:2303.10139v2 Announce Type: replace Abstract: Explaining node predictions in graph neural networks (GNNs) often boils down to finding graph substructures that preserve predictions. Finding these structures usually implies back-propagating through the GNN, bonding the complexity (e.g., number of layers) of the GNN to the cost of explaining it. This naturally begs the question: Can we break this bond by explaining a simpler surrogate GNN? To answer the question, we propose Distill n' Explain (DnX). First, DnX learns a surrogate GNN via knowledge distillation. Then, DnX extracts node or edge-level explanations by solving a simple convex program. We also propose FastDnX, a faster version of DnX that leverages the linear decomposition of our surrogate model. Experiments show that DnX and FastDnX often outperform state-of-the-art GNN explainers while being orders of magnitude faster. Additionally, we support our empirical findings with theoretical results linking the quality of the surrogate model (i.e., distillation error) to the faithfulness of explanations.  ( 2 min )
    Nearest Neighbour with Bandit Feedback
    arXiv:2306.13773v3 Announce Type: replace Abstract: In this paper we adapt the nearest neighbour rule to the contextual bandit problem. Our algorithm handles the fully adversarial setting in which no assumptions at all are made about the data-generation process. When combined with a sufficiently fast data-structure for (perhaps approximate) adaptive nearest neighbour search, such as a navigating net, our algorithm is extremely efficient - having a per trial running time polylogarithmic in both the number of trials and actions, and taking only quasi-linear space. We give generic regret bounds for our algorithm and further analyse them when applied to the stochastic bandit problem in euclidean space. We note that our algorithm can also be applied to the online classification problem.  ( 2 min )
    On marginal feature attributions of tree-based models
    arXiv:2302.08434v3 Announce Type: replace Abstract: Due to their power and ease of use, tree-based machine learning models, such as random forests and gradient-boosted tree ensembles, have become very popular. To interpret them, local feature attributions based on marginal expectations, e.g. marginal (interventional) Shapley, Owen or Banzhaf values, may be employed. Such methods are true to the model and implementation invariant, i.e. dependent only on the input-output function of the model. We contrast this with the popular TreeSHAP algorithm by presenting two (statistically similar) decision trees that compute the exact same function for which the "path-dependent" TreeSHAP yields different rankings of features, whereas the marginal Shapley values coincide. Furthermore, we discuss how the internal structure of tree-based models may be leveraged to help with computing their marginal feature attributions according to a linear game value. One important observation is that these are simple (piecewise-constant) functions with respect to a certain grid partition of the input space determined by the trained model. Another crucial observation, showcased by experiments with XGBoost, LightGBM and CatBoost libraries, is that only a portion of all features appears in a tree from the ensemble. Thus, the complexity of computing marginal Shapley (or Owen or Banzhaf) feature attributions may be reduced. This remains valid for a broader class of game values which we shall axiomatically characterize. A prime example is the case of CatBoost models where the trees are oblivious (symmetric) and the number of features in each of them is no larger than the depth. We exploit the symmetry to derive an explicit formula, with improved complexity and only in terms of the internal model parameters, for marginal Shapley (and Banzhaf and Owen) values of CatBoost models. This results in a fast, accurate algorithm for estimating these feature attributions.  ( 3 min )
    Towards Enhancing Time Series Contrastive Learning: A Dynamic Bad Pair Mining Approach
    arXiv:2302.03357v2 Announce Type: replace Abstract: Not all positive pairs are beneficial to time series contrastive learning. In this paper, we study two types of bad positive pairs that can impair the quality of time series representation learned through contrastive learning: the noisy positive pair and the faulty positive pair. We observe that, with the presence of noisy positive pairs, the model tends to simply learn the pattern of noise (Noisy Alignment). Meanwhile, when faulty positive pairs arise, the model wastes considerable amount of effort aligning non-representative patterns (Faulty Alignment). To address this problem, we propose a Dynamic Bad Pair Mining (DBPM) algorithm, which reliably identifies and suppresses bad positive pairs in time series contrastive learning. Specifically, DBPM utilizes a memory module to dynamically track the training behavior of each positive pair along training process. This allows us to identify potential bad positive pairs at each epoch based on their historical training behaviors. The identified bad pairs are subsequently down-weighted through a transformation module, thereby mitigating their negative impact on the representation learning process. DBPM is a simple algorithm designed as a lightweight plug-in without learnable parameters to enhance the performance of existing state-of-the-art methods. Through extensive experiments conducted on four large-scale, real-world time series datasets, we demonstrate DBPM's efficacy in mitigating the adverse effects of bad positive pairs.  ( 3 min )
    Joint control variate for faster black-box variational inference
    arXiv:2210.07290v4 Announce Type: replace Abstract: Black-box variational inference performance is sometimes hindered by the use of gradient estimators with high variance. This variance comes from two sources of randomness: Data subsampling and Monte Carlo sampling. While existing control variates only address Monte Carlo noise, and incremental gradient methods typically only address data subsampling, we propose a new "joint" control variate that jointly reduces variance from both sources of noise. This significantly reduces gradient variance, leading to faster optimization in several applications.  ( 2 min )
    Beyond Ensemble Averages: Leveraging Climate Model Ensembles for Subseasonal Forecasting
    arXiv:2211.15856v3 Announce Type: replace Abstract: Producing high-quality forecasts of key climate variables, such as temperature and precipitation, on subseasonal time scales has long been a gap in operational forecasting. This study explores an application of machine learning (ML) models as post-processing tools for subseasonal forecasting. Lagged numerical ensemble forecasts (i.e., an ensemble where the members have different initial dates) and observational data, including relative humidity, pressure at sea level, and geopotential height, are incorporated into various ML methods to predict monthly average precipitation and two-meter temperature two weeks in advance for the continental United States. Regression, quantile regression, and tercile classification tasks using linear models, random forests, convolutional neural networks, and stacked models (a multi-model approach based on the prediction of the individual ML models) are considered. Unlike previous ML approaches that often use ensemble mean alone, we leverage information embedded in the ensemble forecasts to enhance prediction accuracy. Additionally, we investigate extreme event predictions that are crucial for planning and mitigation efforts. Considering ensemble members as a collection of spatial forecasts, we explore different approaches to address spatial variability. Trade-offs between different approaches may be mitigated with model stacking. Our proposed models outperform standard baselines such as climatological forecasts and ensemble means. This paper further includes an investigation of feature importance, trade-offs between using the full ensemble or only the ensemble mean, and different modes of accounting for spatial variability.  ( 3 min )
    Intriguing Properties of Input-dependent Randomized Smoothing
    arXiv:2110.05365v3 Announce Type: replace Abstract: Randomized smoothing is currently considered the state-of-the-art method to obtain certifiably robust classifiers. Despite its remarkable performance, the method is associated with various serious problems such as "certified accuracy waterfalls", certification vs.\ accuracy trade-off, or even fairness issues. Input-dependent smoothing approaches have been proposed with intention of overcoming these flaws. However, we demonstrate that these methods lack formal guarantees and so the resulting certificates are not justified. We show that in general, the input-dependent smoothing suffers from the curse of dimensionality, forcing the variance function to have low semi-elasticity. On the other hand, we provide a theoretical and practical framework that enables the usage of input-dependent smoothing even in the presence of the curse of dimensionality, under strict restrictions. We present one concrete design of the smoothing variance function and test it on CIFAR10 and MNIST. Our design mitigates some of the problems of classical smoothing and is formally underlined, yet further improvement of the design is still necessary.  ( 2 min )
    Functional Linear Regression of Cumulative Distribution Functions
    arXiv:2205.14545v3 Announce Type: replace Abstract: The estimation of cumulative distribution functions (CDF) is an important learning task with a great variety of downstream applications, such as risk assessments in predictions and decision making. In this paper, we study functional regression of contextual CDFs where each data point is sampled from a linear combination of context dependent CDF basis functions. We propose functional ridge-regression-based estimation methods that estimate CDFs accurately everywhere. In particular, given $n$ samples with $d$ basis functions, we show estimation error upper bounds of $\widetilde O(\sqrt{d/n})$ for fixed design, random design, and adversarial context cases. We also derive matching information theoretic lower bounds, establishing minimax optimality for CDF functional regression. Furthermore, we remove the burn-in time in the random design setting using an alternative penalized estimator. Then, we consider agnostic settings where there is a mismatch in the data generation process. We characterize the error of the proposed estimators in terms of the mismatched error, and show that the estimators are well-behaved under model mismatch. Moreover, to complete our study, we formalize infinite dimensional models where the parameter space is an infinite dimensional Hilbert space, and establish a self-normalized estimation error upper bound for this setting. Notably, the upper bound reduces to the $\widetilde O(\sqrt{d/n})$ bound when the parameter space is constrained to be $d$-dimensional. Our comprehensive numerical experiments validate the efficacy of our estimation methods in both synthetic and practical settings.  ( 3 min )
    A Bayesian Learning Algorithm for Unknown Zero-sum Stochastic Games with an Arbitrary Opponent
    arXiv:2109.03396v2 Announce Type: replace Abstract: In this paper, we propose Posterior Sampling Reinforcement Learning for Zero-sum Stochastic Games (PSRL-ZSG), the first online learning algorithm that achieves Bayesian regret bound of $O(HS\sqrt{AT})$ in the infinite-horizon zero-sum stochastic games with average-reward criterion. Here $H$ is an upper bound on the span of the bias function, $S$ is the number of states, $A$ is the number of joint actions and $T$ is the horizon. We consider the online setting where the opponent can not be controlled and can take any arbitrary time-adaptive history-dependent strategy. Our regret bound improves on the best existing regret bound of $O(\sqrt[3]{DS^2AT^2})$ by Wei et al. (2017) under the same assumption and matches the theoretical lower bound in $T$.  ( 2 min )
    A unified framework for hard and soft clustering with regularized optimal transport
    arXiv:1711.04366v2 Announce Type: replace Abstract: In this paper, we formulate the problem of inferring a Finite Mixture Model from discrete data as an optimal transport problem with entropic regularization of parameter $\lambda\geq 0$. Our method unifies hard and soft clustering, the Expectation-Maximization (EM) algorithm being exactly recovered for $\lambda=1$. The family of clustering algorithm we propose rely on the resolution of nonconvex problems using alternating minimization. We study the convergence property of our generalized $\lambda-$EM algorithms and show that each step in the minimization process has a closed form solution when inferring finite mixture models of exponential families. Experiments highlight the benefits of taking a parameter $\lambda>1$ to improve the inference performance and $\lambda\to 0$ for classification.  ( 2 min )
    The R2D2 deep neural network series paradigm for fast precision imaging in radio astronomy
    arXiv:2403.05452v1 Announce Type: cross Abstract: Radio-interferometric (RI) imaging entails solving high-resolution high-dynamic range inverse problems from large data volumes. Recent image reconstruction techniques grounded in optimization theory have demonstrated remarkable capability for imaging precision, well beyond CLEAN's capability. These range from advanced proximal algorithms propelled by handcrafted regularization operators, such as the SARA family, to hybrid plug-and-play (PnP) algorithms propelled by learned regularization denoisers, such as AIRI. Optimization and PnP structures are however highly iterative, which hinders their ability to handle the extreme data sizes expected from future instruments. To address this scalability challenge, we introduce a novel deep learning approach, dubbed ``Residual-to-Residual DNN series for high-Dynamic range imaging'. R2D2's reconstruction is formed as a series of residual images, iteratively estimated as outputs of Deep Neural Networks (DNNs) taking the previous iteration's image estimate and associated data residual as inputs. It thus takes a hybrid structure between a PnP algorithm and a learned version of the matching pursuit algorithm that underpins CLEAN. We present a comprehensive study of our approach, featuring its multiple incarnations distinguished by their DNN architectures. We provide a detailed description of its training process, targeting a telescope-specific approach. R2D2's capability to deliver high precision is demonstrated in simulation, across a variety of image and observation settings using the Very Large Array (VLA). Its reconstruction speed is also demonstrated: with only few iterations required to clean data residuals at dynamic ranges up to 105, R2D2 opens the door to fast precision imaging. R2D2 codes are available in the BASPLib library on GitHub.  ( 3 min )
    Algorithm-Hardware Co-Design of Distribution-Aware Logarithmic-Posit Encodings for Efficient DNN Inference
    arXiv:2403.05465v1 Announce Type: cross Abstract: Traditional Deep Neural Network (DNN) quantization methods using integer, fixed-point, or floating-point data types struggle to capture diverse DNN parameter distributions at low precision, and often require large silicon overhead and intensive quantization-aware training. In this study, we introduce Logarithmic Posits (LP), an adaptive, hardware-friendly data type inspired by posits that dynamically adapts to DNN weight/activation distributions by parameterizing LP bit fields. We also develop a novel genetic-algorithm based framework, LP Quantization (LPQ), to find optimal layer-wise LP parameters while reducing representational divergence between quantized and full-precision models through a novel global-local contrastive objective. Additionally, we design a unified mixed-precision LP accelerator (LPA) architecture comprising of processing elements (PEs) incorporating LP in the computational datapath. Our algorithm-hardware co-design demonstrates on average <1% drop in top-1 accuracy across various CNN and ViT models. It also achieves ~ 2x improvements in performance per unit area and 2.2x gains in energy efficiency compared to state-of-the-art quantization accelerators using different data types.  ( 2 min )
    Is Cosine-Similarity of Embeddings Really About Similarity?
    arXiv:2403.05440v1 Announce Type: cross Abstract: Cosine-similarity is the cosine of the angle between two vectors, or equivalently the dot product between their normalizations. A popular application is to quantify semantic similarity between high-dimensional objects by applying cosine-similarity to a learned low-dimensional feature embedding. This can work better but sometimes also worse than the unnormalized dot-product between embedded vectors in practice. To gain insight into this empirical observation, we study embeddings derived from regularized linear models, where closed-form solutions facilitate analytical insights. We derive analytically how cosine-similarity can yield arbitrary and therefore meaningless `similarities.' For some linear models the similarities are not even unique, while for others they are implicitly controlled by the regularization. We discuss implications beyond linear models: a combination of different regularizations are employed when learning deep models; these have implicit and unintended effects when taking cosine-similarities of the resulting embeddings, rendering results opaque and possibly arbitrary. Based on these insights, we caution against blindly using cosine-similarity and outline alternatives.  ( 2 min )
    Bayesian Hierarchical Probabilistic Forecasting of Intraday Electricity Prices
    arXiv:2403.05441v1 Announce Type: cross Abstract: We present a first study of Bayesian forecasting of electricity prices traded on the German continuous intraday market which fully incorporates parameter uncertainty. Our target variable is the IDFull price index, forecasts are given in terms of posterior predictive distributions. For validation we use the exceedingly volatile electricity prices of 2022, which have hardly been the subject of forecasting studies before. As a benchmark model, we use all available intraday transactions at the time of forecast creation to compute a current value for the IDFull. According to the weak-form efficiency hypothesis, it would not be possible to significantly improve this benchmark built from last price information. We do, however, observe statistically significant improvement in terms of both point measures and probability scores. Finally, we challenge the declared gold standard of using LASSO for feature selection in electricity price forecasting by presenting strong statistical evidence that Orthogonal Matching Pursuit (OMP) leads to better forecasting performance.  ( 2 min )
    HistGen: Histopathology Report Generation via Local-Global Feature Encoding and Cross-modal Context Interaction
    arXiv:2403.05396v1 Announce Type: cross Abstract: Histopathology serves as the gold standard in cancer diagnosis, with clinical reports being vital in interpreting and understanding this process, guiding cancer treatment and patient care. The automation of histopathology report generation with deep learning stands to significantly enhance clinical efficiency and lessen the labor-intensive, time-consuming burden on pathologists in report writing. In pursuit of this advancement, we introduce HistGen, a multiple instance learning-empowered framework for histopathology report generation together with the first benchmark dataset for evaluation. Inspired by diagnostic and report-writing workflows, HistGen features two delicately designed modules, aiming to boost report generation by aligning whole slide images (WSIs) and diagnostic reports from local and global granularity. To achieve this, a local-global hierarchical encoder is developed for efficient visual feature aggregation from a region-to-slide perspective. Meanwhile, a cross-modal context module is proposed to explicitly facilitate alignment and interaction between distinct modalities, effectively bridging the gap between the extensive visual sequences of WSIs and corresponding highly summarized reports. Experimental results on WSI report generation show the proposed model outperforms state-of-the-art (SOTA) models by a large margin. Moreover, the results of fine-tuning our model on cancer subtyping and survival analysis tasks further demonstrate superior performance compared to SOTA methods, showcasing strong transfer learning capability. Dataset, model weights, and source code are available in https://github.com/dddavid4real/HistGen.  ( 3 min )
    Exploring the Links between the Fundamental Lemma and Kernel Regression
    arXiv:2403.05368v1 Announce Type: cross Abstract: Generalizations and variations of the fundamental lemma by Willems et al. are an active topic of recent research. In this note, we explore and formalize the links between kernel regression and known nonlinear extensions of the fundamental lemma. Applying a transformation to the usual linear equation in Hankel matrices, we arrive at an alternative implicit kernel representation of the system trajectories while keeping the requirements on persistency of excitation. We show that this representation is equivalent to the solution of a specific kernel regression problem. We explore the possible structures of the underlying kernel as well as the system classes to which they correspond.  ( 2 min )
    Variational Inference of Parameters in Opinion Dynamics Models
    arXiv:2403.05358v1 Announce Type: cross Abstract: Despite the frequent use of agent-based models (ABMs) for studying social phenomena, parameter estimation remains a challenge, often relying on costly simulation-based heuristics. This work uses variational inference to estimate the parameters of an opinion dynamics ABM, by transforming the estimation problem into an optimization task that can be solved directly. Our proposal relies on probabilistic generative ABMs (PGABMs): we start by synthesizing a probabilistic generative model from the ABM rules. Then, we transform the inference process into an optimization problem suitable for automatic differentiation. In particular, we use the Gumbel-Softmax reparameterization for categorical agent attributes and stochastic variational inference for parameter estimation. Furthermore, we explore the trade-offs of using variational distributions with different complexity: normal distributions and normalizing flows. We validate our method on a bounded confidence model with agent roles (leaders and followers). Our approach estimates both macroscopic (bounded confidence intervals and backfire thresholds) and microscopic ($200$ categorical, agent-level roles) more accurately than simulation-based and MCMC methods. Consequently, our technique enables experts to tune and validate their ABMs against real-world observations, thus providing insights into human behavior in social systems via data-driven analysis.  ( 2 min )
    The Impact of Quantization on the Robustness of Transformer-based Text Classifiers
    arXiv:2403.05365v1 Announce Type: cross Abstract: Transformer-based models have made remarkable advancements in various NLP areas. Nevertheless, these models often exhibit vulnerabilities when confronted with adversarial attacks. In this paper, we explore the effect of quantization on the robustness of Transformer-based models. Quantization usually involves mapping a high-precision real number to a lower-precision value, aiming at reducing the size of the model at hand. To the best of our knowledge, this work is the first application of quantization on the robustness of NLP models. In our experiments, we evaluate the impact of quantization on BERT and DistilBERT models in text classification using SST-2, Emotion, and MR datasets. We also evaluate the performance of these models against TextFooler, PWWS, and PSO adversarial attacks. Our findings show that quantization significantly improves (by an average of 18.68%) the adversarial accuracy of the models. Furthermore, we compare the effect of quantization versus that of the adversarial training approach on robustness. Our experiments indicate that quantization increases the robustness of the model by 18.80% on average compared to adversarial training without imposing any extra computational overhead during training. Therefore, our results highlight the effectiveness of quantization in improving the robustness of NLP models.  ( 2 min )
    Embedded Deployment of Semantic Segmentation in Medicine through Low-Resolution Inputs
    arXiv:2403.05340v1 Announce Type: cross Abstract: When deploying neural networks in real-life situations, the size and computational effort are often the limiting factors. This is especially true in environments where big, expensive hardware is not affordable, like in embedded medical devices, where budgets are often tight. State-of-the-art proposed multiple different lightweight solutions for such use cases, mostly by changing the base model architecture, not taking the input and output resolution into consideration. In this paper, we propose our architecture that takes advantage of the fact that in hardware-limited environments, we often refrain from using the highest available input resolutions to guarantee a higher throughput. Although using lower-resolution input leads to a significant reduction in computing and memory requirements, it may also incur reduced prediction quality. Our architecture addresses this problem by exploiting the fact that we can still utilize high-resolution ground-truths in training. The proposed model inputs lower-resolution images and high-resolution ground truths, which can improve the prediction quality by 5.5% while adding less than 200 parameters to the model. %reducing the frames per second only from 25 to 20. We conduct an extensive analysis to illustrate that our architecture enhances existing state-of-the-art frameworks for lightweight semantic segmentation of cancer in MRI images. We also tested the deployment speed of state-of-the-art lightweight networks and our architecture on Nvidia's Jetson Nano to emulate deployment in resource-constrained embedded scenarios.  ( 2 min )
    Hybridized Convolutional Neural Networks and Long Short-Term Memory for Improved Alzheimer's Disease Diagnosis from MRI Scans
    arXiv:2403.05353v1 Announce Type: cross Abstract: Brain-related diseases are more sensitive than other diseases due to several factors, including the complexity of surgical procedures, high costs, and other challenges. Alzheimer's disease is a common brain disorder that causes memory loss and the shrinking of brain cells. Early detection is critical for providing proper treatment to patients. However, identifying Alzheimer's at an early stage using manual scanning of CT or MRI scans is challenging. Therefore, researchers have delved into the exploration of computer-aided systems, employing Machine Learning and Deep Learning methodologies, which entail the training of datasets to detect Alzheimer's disease. This study aims to present a hybrid model that combines a CNN model's feature extraction capabilities with an LSTM model's detection capabilities. This study has applied the transfer learning called VGG16 in the hybrid model to extract features from MRI images. The LSTM detects features between the convolution layer and the fully connected layer. The output layer of the fully connected layer uses the softmax function. The training of the hybrid model involved utilizing the ADNI dataset. The trial findings revealed that the model achieved a level of accuracy of 98.8%, a sensitivity rate of 100%, and a specificity rate of 76%. The proposed hybrid model outperforms its contemporary CNN counterparts, showcasing a superior performance.  ( 3 min )
    Looking Ahead to Avoid Being Late: Solving Hard-Constrained Traveling Salesman Problem
    arXiv:2403.05318v1 Announce Type: cross Abstract: Many real-world problems can be formulated as a constrained Traveling Salesman Problem (TSP). However, the constraints are always complex and numerous, making the TSPs challenging to solve. When the number of complicated constraints grows, it is time-consuming for traditional heuristic algorithms to avoid illegitimate outcomes. Learning-based methods provide an alternative to solve TSPs in a soft manner, which also supports GPU acceleration to generate solutions quickly. Nevertheless, the soft manner inevitably results in difficulty solving hard-constrained problems with learning algorithms, and the conflicts between legality and optimality may substantially affect the optimality of the solution. To overcome this problem and to have an effective solution against hard constraints, we proposed a novel learning-based method that uses looking-ahead information as the feature to improve the legality of TSP with Time Windows (TSPTW) solutions. Besides, we constructed TSPTW datasets with hard constraints in order to accurately evaluate and benchmark the statistical performance of various approaches, which can serve the community for future research. With comprehensive experiments on diverse datasets, MUSLA outperforms existing baselines and shows generalizability potential.  ( 2 min )
    Deep Prompt Multi-task Network for Abuse Language Detection
    arXiv:2403.05268v1 Announce Type: cross Abstract: The detection of abusive language remains a long-standing challenge with the extensive use of social networks. The detection task of abusive language suffers from limited accuracy. We argue that the existing detection methods utilize the fine-tuning technique of the pre-trained language models (PLMs) to handle downstream tasks. Hence, these methods fail to stimulate the general knowledge of the PLMs. To address the problem, we propose a novel Deep Prompt Multi-task Network (DPMN) for abuse language detection. Specifically, DPMN first attempts to design two forms of deep prompt tuning and light prompt tuning for the PLMs. The effects of different prompt lengths, tuning strategies, and prompt initialization methods on detecting abusive language are studied. In addition, we propose a Task Head based on Bi-LSTM and FFN, which can be used as a short text classifier. Eventually, DPMN utilizes multi-task learning to improve detection metrics further. The multi-task network has the function of transferring effective knowledge. The proposed DPMN is evaluated against eight typical methods on three public datasets: OLID, SOLID, and AbuseAnalyzer. The experimental results show that our DPMN outperforms the state-of-the-art methods.  ( 2 min )
    DuDoUniNeXt: Dual-domain unified hybrid model for single and multi-contrast undersampled MRI reconstruction
    arXiv:2403.05256v1 Announce Type: cross Abstract: Multi-contrast (MC) Magnetic Resonance Imaging (MRI) reconstruction aims to incorporate a reference image of auxiliary modality to guide the reconstruction process of the target modality. Known MC reconstruction methods perform well with a fully sampled reference image, but usually exhibit inferior performance, compared to single-contrast (SC) methods, when the reference image is missing or of low quality. To address this issue, we propose DuDoUniNeXt, a unified dual-domain MRI reconstruction network that can accommodate to scenarios involving absent, low-quality, and high-quality reference images. DuDoUniNeXt adopts a hybrid backbone that combines CNN and ViT, enabling specific adjustment of image domain and k-space reconstruction. Specifically, an adaptive coarse-to-fine feature fusion module (AdaC2F) is devised to dynamically process the information from reference images of varying qualities. Besides, a partially shared shallow feature extractor (PaSS) is proposed, which uses shared and distinct parameters to handle consistent and discrepancy information among contrasts. Experimental results demonstrate that the proposed model surpasses state-of-the-art SC and MC models significantly. Ablation studies show the effectiveness of the proposed hybrid backbone, AdaC2F, PaSS, and the dual-domain unified learning scheme.  ( 2 min )
    ERBench: An Entity-Relationship based Automatically Verifiable Hallucination Benchmark for Large Language Models
    arXiv:2403.05266v1 Announce Type: cross Abstract: Large language models (LLMs) have achieved unprecedented performance in various applications, yet their evaluation remains a critical issue. Existing hallucination benchmarks are either static or lack adjustable complexity for thorough analysis. We contend that utilizing existing relational databases is a promising approach for constructing benchmarks due to their accurate knowledge description via functional dependencies. We propose ERBench to automatically convert any relational database into a benchmark based on the entity-relationship (ER) model. Our key idea is to construct questions using the database schema, records, and functional dependencies such that they can be automatically verified. In addition, we use foreign key constraints to join relations and construct multihop questions, which can be arbitrarily complex and used to debug the intermediate answers of LLMs. Finally, ERBench supports continuous evaluation, multimodal questions, and various prompt engineering techniques. In our experiments, we construct an LLM benchmark using databases of multiple domains and make an extensive comparison of contemporary LLMs. We observe that better LLMs like GPT-4 can handle a larger variety of question types, but are by no means perfect. Also, correct answers do not necessarily imply correct rationales, which is an important evaluation that ERBench does better than other benchmarks for various question types. Code is available at https: //github.com/DILAB-KAIST/ERBench.  ( 2 min )
    Towards Effective Usage of Human-Centric Priors in Diffusion Models for Text-based Human Image Generation
    arXiv:2403.05239v1 Announce Type: cross Abstract: Vanilla text-to-image diffusion models struggle with generating accurate human images, commonly resulting in imperfect anatomies such as unnatural postures or disproportionate limbs.Existing methods address this issue mostly by fine-tuning the model with extra images or adding additional controls -- human-centric priors such as pose or depth maps -- during the image generation phase. This paper explores the integration of these human-centric priors directly into the model fine-tuning stage, essentially eliminating the need for extra conditions at the inference stage. We realize this idea by proposing a human-centric alignment loss to strengthen human-related information from the textual prompts within the cross-attention maps. To ensure semantic detail richness and human structural accuracy during fine-tuning, we introduce scale-aware and step-wise constraints within the diffusion process, according to an in-depth analysis of the cross-attention layer. Extensive experiments show that our method largely improves over state-of-the-art text-to-image models to synthesize high-quality human images based on user-written prompts. Project page: \url{https://hcplayercvpr2024.github.io}.  ( 2 min )
    On Representing Electronic Wave Functions with Sign Equivariant Neural Networks
    arXiv:2403.05249v1 Announce Type: cross Abstract: Recent neural networks demonstrated impressively accurate approximations of electronic ground-state wave functions. Such neural networks typically consist of a permutation-equivariant neural network followed by a permutation-antisymmetric operation to enforce the electronic exchange symmetry. While accurate, such neural networks are computationally expensive. In this work, we explore the flipped approach, where we first compute antisymmetric quantities based on the electronic coordinates and then apply sign equivariant neural networks to preserve the antisymmetry. While this approach promises acceleration thanks to the lower-dimensional representation, we demonstrate that it reduces to a Jastrow factor, a commonly used permutation-invariant multiplicative factor in the wave function. Our empirical results support this further, finding little to no improvements over baselines. We conclude with neither theoretical nor empirical advantages of sign equivariant functions for representing electronic wave functions within the evaluation of this work.  ( 2 min )
    Personalized Audiobook Recommendations at Spotify Through Graph Neural Networks
    arXiv:2403.05185v1 Announce Type: cross Abstract: In the ever-evolving digital audio landscape, Spotify, well-known for its music and talk content, has recently introduced audiobooks to its vast user base. While promising, this move presents significant challenges for personalized recommendations. Unlike music and podcasts, audiobooks, initially available for a fee, cannot be easily skimmed before purchase, posing higher stakes for the relevance of recommendations. Furthermore, introducing a new content type into an existing platform confronts extreme data sparsity, as most users are unfamiliar with this new content type. Lastly, recommending content to millions of users requires the model to react fast and be scalable. To address these challenges, we leverage podcast and music user preferences and introduce 2T-HGNN, a scalable recommendation system comprising Heterogeneous Graph Neural Networks (HGNNs) and a Two Tower (2T) model. This novel approach uncovers nuanced item relationships while ensuring low latency and complexity. We decouple users from the HGNN graph and propose an innovative multi-link neighbor sampler. These choices, together with the 2T component, significantly reduce the complexity of the HGNN model. Empirical evaluations involving millions of users show significant improvement in the quality of personalized recommendations, resulting in a +46% increase in new audiobooks start rate and a +23% boost in streaming rates. Intriguingly, our model's impact extends beyond audiobooks, benefiting established products like podcasts.  ( 3 min )
    Synthetic Privileged Information Enhances Medical Image Representation Learning
    arXiv:2403.05220v1 Announce Type: cross Abstract: Multimodal self-supervised representation learning has consistently proven to be a highly effective method in medical image analysis, offering strong task performance and producing biologically informed insights. However, these methods heavily rely on large, paired datasets, which is prohibitive for their use in scenarios where paired data does not exist, or there is only a small amount available. In contrast, image generation methods can work well on very small datasets, and can find mappings between unpaired datasets, meaning an effectively unlimited amount of paired synthetic data can be generated. In this work, we demonstrate that representation learning can be significantly improved by synthetically generating paired information, both compared to training on either single-modality (up to 4.4x error reduction) or authentic multi-modal paired datasets (up to 5.6x error reduction).  ( 2 min )
    Follow-the-Perturbed-Leader with Fr\'{e}chet-type Tail Distributions: Optimality in Adversarial Bandits and Best-of-Both-Worlds
    arXiv:2403.05134v1 Announce Type: cross Abstract: This paper studies the optimality of the Follow-the-Perturbed-Leader (FTPL) policy in both adversarial and stochastic $K$-armed bandits. Despite the widespread use of the Follow-the-Regularized-Leader (FTRL) framework with various choices of regularization, the FTPL framework, which relies on random perturbations, has not received much attention, despite its inherent simplicity. In adversarial bandits, there has been conjecture that FTPL could potentially achieve $\mathcal{O}(\sqrt{KT})$ regrets if perturbations follow a distribution with a Fr\'{e}chet-type tail. Recent work by Honda et al. (2023) showed that FTPL with Fr\'{e}chet distribution with shape $\alpha=2$ indeed attains this bound and, notably logarithmic regret in stochastic bandits, meaning the Best-of-Both-Worlds (BOBW) capability of FTPL. However, this result only partly resolves the above conjecture because their analysis heavily relies on the specific form of the Fr\'{e}chet distribution with this shape. In this paper, we establish a sufficient condition for perturbations to achieve $\mathcal{O}(\sqrt{KT})$ regrets in the adversarial setting, which covers, e.g., Fr\'{e}chet, Pareto, and Student-$t$ distributions. We also demonstrate the BOBW achievability of FTPL with certain Fr\'{e}chet-type tail distributions. Our results contribute not only to resolving existing conjectures through the lens of extreme value theory but also potentially offer insights into the effect of the regularization functions in FTRL through the mapping from FTPL to FTRL.  ( 2 min )
    Greedy feature selection: Classifier-dependent feature selection via greedy methods
    arXiv:2403.05138v1 Announce Type: cross Abstract: The purpose of this study is to introduce a new approach to feature ranking for classification tasks, called in what follows greedy feature selection. In statistical learning, feature selection is usually realized by means of methods that are independent of the classifier applied to perform the prediction using that reduced number of features. Instead, greedy feature selection identifies the most important feature at each step and according to the selected classifier. In the paper, the benefits of such scheme are investigated theoretically in terms of model capacity indicators, such as the Vapnik-Chervonenkis (VC) dimension or the kernel alignment, and tested numerically by considering its application to the problem of predicting geo-effective manifestations of the active Sun.  ( 2 min )
    Multi-Tower Multi-Interest Recommendation with User Representation Repel
    arXiv:2403.05122v1 Announce Type: cross Abstract: In the era of information overload, the value of recommender systems has been profoundly recognized in academia and industry alike. Multi-interest sequential recommendation, in particular, is a subfield that has been receiving increasing attention in recent years. By generating multiple-user representations, multi-interest learning models demonstrate superior expressiveness than single-user representation models, both theoretically and empirically. Despite major advancements in the field, three major issues continue to plague the performance and adoptability of multi-interest learning methods, the difference between training and deployment objectives, the inability to access item information, and the difficulty of industrial adoption due to its single-tower architecture. We address these challenges by proposing a novel multi-tower multi-interest framework with user representation repel. Experimental results across multiple large-scale industrial datasets proved the effectiveness and generalizability of our proposed framework.  ( 2 min )
    RIS-empowered Topology Control for Distributed Learning in Urban Air Mobility
    arXiv:2403.05133v1 Announce Type: cross Abstract: Urban Air Mobility (UAM) expands vehicles from the ground to the near-ground space, envisioned as a revolution for transportation systems. Comprehensive scene perception is the foundation for autonomous aerial driving. However, UAM encounters the intelligent perception challenge: high perception learning requirements conflict with the limited sensors and computing chips of flying cars. To overcome the challenge, federated learning (FL) and other collaborative learning have been proposed to enable resource-limited devices to conduct onboard deep learning (DL) collaboratively. But traditional collaborative learning like FL relies on a central integrator for DL model aggregation, which is difficult to deploy in dynamic environments. The fully decentralized learning schemes may be the intuitive solution while the convergence of distributed learning cannot be guaranteed. Accordingly, this paper explores reconfigurable intelligent surfaces (RIS) empowered distributed learning, taking account of topological attributes to facilitate the learning performance with convergence guarantee. We propose several FL topological criteria for optimizing the transmission delay and convergence rate by exploiting the Laplacian matrix eigenvalues of the communication network. Subsequently, we innovatively leverage the RIS link modification ability to remold the current network according to the proposed topological criteria. This paper rethinks the functions of RIS from the perspective of the network layer. Furthermore, a deep deterministic policy gradient-based RIS phase shift control algorithm is developed to construct or deconstruct the network links simultaneously to reshape the communication network. Simulation experiments are conducted over MobileNet-based multi-view learning to verify the efficiency of the distributed FL framework.  ( 3 min )
    Estimation of Electronic Band Gap Energy From Material Properties Using Machine Learning
    arXiv:2403.05119v1 Announce Type: cross Abstract: Machine learning techniques are utilized to estimate the electronic band gap energy and forecast the band gap category of materials based on experimentally quantifiable properties. The determination of band gap energy is critical for discerning various material properties, such as its metallic nature, and potential applications in electronic and optoelectronic devices. While numerical methods exist for computing band gap energy, they often entail high computational costs and have limitations in accuracy and scalability. A machine learning-driven model capable of swiftly predicting material band gap energy using easily obtainable experimental properties would offer a superior alternative to conventional density functional theory (DFT) methods. Our model does not require any preliminary DFT-based calculation or knowledge of the structure of the material. We present a scheme for improving the performance of simple regression and classification models by partitioning the dataset into multiple clusters. A new evaluation scheme for comparing the performance of ML-based models in material sciences involving both regression and classification tasks is introduced based on traditional evaluation metrics. It is shown that on this new evaluation metric, our method of clustering the dataset results in better performance.  ( 2 min )
    Efficient Data Collection for Robotic Manipulation via Compositional Generalization
    arXiv:2403.05110v1 Announce Type: cross Abstract: Data collection has become an increasingly important problem in robotic manipulation, yet there still lacks much understanding of how to effectively collect data to facilitate broad generalization. Recent works on large-scale robotic data collection typically vary a wide range of environmental factors during data collection, such as object types and table textures. While these works attempt to cover a diverse variety of scenarios, they do not explicitly account for the possible compositional abilities of policies trained on the data. If robot policies are able to compose different environmental factors of variation (e.g., object types, table heights) from their training data to succeed when encountering unseen factor combinations, then we can exploit this to avoid collecting data for situations that composition would address. To investigate this possibility, we conduct thorough empirical studies both in simulation and on a real robot that compare data collection strategies and assess whether visual imitation learning policies can compose environmental factors. We find that policies do exhibit composition, although leveraging prior robotic datasets is critical for this on a real robot. We use these insights to provide better practices for in-domain data collection by proposing data collection strategies that exploit composition, which can induce better generalization than naive approaches for the same amount of effort during data collection. We further demonstrate that a real robot policy trained on data from such a strategy achieves a success rate of 77.5% when transferred to entirely new environments that encompass unseen combinations of environmental factors, whereas policies trained using data collected without accounting for environmental variation fail to transfer effectively, with a success rate of only 2.5%. We provide videos at http://iliad.stanford.edu/robot-data-comp/.  ( 3 min )
    Improving Diffusion-Based Generative Models via Approximated Optimal Transport
    arXiv:2403.05069v1 Announce Type: cross Abstract: We introduce the Approximated Optimal Transport (AOT) technique, a novel training scheme for diffusion-based generative models. Our approach aims to approximate and integrate optimal transport into the training process, significantly enhancing the ability of diffusion models to estimate the denoiser outputs accurately. This improvement leads to ODE trajectories of diffusion models with lower curvature and reduced truncation errors during sampling. We achieve superior image quality and reduced sampling steps by employing AOT in training. Specifically, we achieve FID scores of 1.88 with just 27 NFEs and 1.73 with 29 NFEs in unconditional and conditional generations, respectively. Furthermore, when applying AOT to train the discriminator for guidance, we establish new state-of-the-art FID scores of 1.68 and 1.58 for unconditional and conditional generations, respectively, each with 29 NFEs. This outcome demonstrates the effectiveness of AOT in enhancing the performance of diffusion models.  ( 2 min )
    Exploring the Adversarial Frontier: Quantifying Robustness via Adversarial Hypervolume
    arXiv:2403.05100v1 Announce Type: cross Abstract: The escalating threat of adversarial attacks on deep learning models, particularly in security-critical fields, has underscored the need for robust deep learning systems. Conventional robustness evaluations have relied on adversarial accuracy, which measures a model's performance under a specific perturbation intensity. However, this singular metric does not fully encapsulate the overall resilience of a model against varying degrees of perturbation. To address this gap, we propose a new metric termed adversarial hypervolume, assessing the robustness of deep learning models comprehensively over a range of perturbation intensities from a multi-objective optimization standpoint. This metric allows for an in-depth comparison of defense mechanisms and recognizes the trivial improvements in robustness afforded by less potent defensive strategies. Additionally, we adopt a novel training algorithm that enhances adversarial robustness uniformly across various perturbation intensities, in contrast to methods narrowly focused on optimizing adversarial accuracy. Our extensive empirical studies validate the effectiveness of the adversarial hypervolume metric, demonstrating its ability to reveal subtle differences in robustness that adversarial accuracy overlooks. This research contributes a new measure of robustness and establishes a standard for assessing and benchmarking the resilience of current and future defensive models against adversarial threats.  ( 2 min )
    Are Human Conversations Special? A Large Language Model Perspective
    arXiv:2403.05045v1 Announce Type: cross Abstract: This study analyzes changes in the attention mechanisms of large language models (LLMs) when used to understand natural conversations between humans (human-human). We analyze three use cases of LLMs: interactions over web content, code, and mathematical texts. By analyzing attention distance, dispersion, and interdependency across these domains, we highlight the unique challenges posed by conversational data. Notably, conversations require nuanced handling of long-term contextual relationships and exhibit higher complexity through their attention patterns. Our findings reveal that while language models exhibit domain-specific attention behaviors, there is a significant gap in their ability to specialize in human conversations. Through detailed attention entropy analysis and t-SNE visualizations, we demonstrate the need for models trained with a diverse array of high-quality conversational data to enhance understanding and generation of human-like dialogue. This research highlights the importance of domain specialization in language models and suggests pathways for future advancement in modeling human conversational nuances.  ( 2 min )
    A Sinkhorn-type Algorithm for Constrained Optimal Transport
    arXiv:2403.05054v1 Announce Type: cross Abstract: Entropic optimal transport (OT) and the Sinkhorn algorithm have made it practical for machine learning practitioners to perform the fundamental task of calculating transport distance between statistical distributions. In this work, we focus on a general class of OT problems under a combination of equality and inequality constraints. We derive the corresponding entropy regularization formulation and introduce a Sinkhorn-type algorithm for such constrained OT problems supported by theoretical guarantees. We first bound the approximation error when solving the problem through entropic regularization, which reduces exponentially with the increase of the regularization parameter. Furthermore, we prove a sublinear first-order convergence rate of the proposed Sinkhorn-type algorithm in the dual space by characterizing the optimization procedure with a Lyapunov function. To achieve fast and higher-order convergence under weak entropy regularization, we augment the Sinkhorn-type algorithm with dynamic regularization scheduling and second-order acceleration. Overall, this work systematically combines recent theoretical and numerical advances in entropic optimal transport with the constrained case, allowing practitioners to derive approximate transport plans in complex scenarios.  ( 2 min )
    Defending Against Unforeseen Failure Modes with Latent Adversarial Training
    arXiv:2403.05030v1 Announce Type: cross Abstract: AI systems sometimes exhibit harmful unintended behaviors post-deployment. This is often despite extensive diagnostics and debugging by developers. Minimizing risks from models is challenging because the attack surface is so large. It is not tractable to exhaustively search for inputs that may cause a model to fail. Red-teaming and adversarial training (AT) are commonly used to make AI systems more robust. However, they have not been sufficient to avoid many real-world failure modes that differ from the ones adversarially trained on. In this work, we utilize latent adversarial training (LAT) to defend against vulnerabilities without generating inputs that elicit them. LAT leverages the compressed, abstract, and structured latent representations of concepts that the network actually uses for prediction. We use LAT to remove trojans and defend against held-out classes of adversarial attacks. We show in image classification, text classification, and text generation tasks that LAT usually improves both robustness and performance on clean data relative to AT. This suggests that LAT can be a promising tool for defending against failure modes that are not explicitly identified by developers.  ( 2 min )
    SecGPT: An Execution Isolation Architecture for LLM-Based Systems
    arXiv:2403.04960v1 Announce Type: cross Abstract: Large language models (LLMs) extended as systems, such as ChatGPT, have begun supporting third-party applications. These LLM apps leverage the de facto natural language-based automated execution paradigm of LLMs: that is, apps and their interactions are defined in natural language, provided access to user data, and allowed to freely interact with each other and the system. These LLM app ecosystems resemble the settings of earlier computing platforms, where there was insufficient isolation between apps and the system. Because third-party apps may not be trustworthy, and exacerbated by the imprecision of the natural language interfaces, the current designs pose security and privacy risks for users. In this paper, we propose SecGPT, an architecture for LLM-based systems that aims to mitigate the security and privacy issues that arise with the execution of third-party apps. SecGPT's key idea is to isolate the execution of apps and more precisely mediate their interactions outside of their isolated environments. We evaluate SecGPT against a number of case study attacks and demonstrate that it protects against many security, privacy, and safety issues that exist in non-isolated LLM-based systems. The performance overhead incurred by SecGPT to improve security is under 0.3x for three-quarters of the tested queries. To foster follow-up research, we release SecGPT's source code at https://github.com/llm-platform-security/SecGPT.  ( 2 min )
    A Probabilistic Hadamard U-Net for MRI Bias Field Correction
    arXiv:2403.05024v1 Announce Type: cross Abstract: Magnetic field inhomogeneity correction remains a challenging task in MRI analysis. Most established techniques are designed for brain MRI by supposing that image intensities in the identical tissue follow a uniform distribution. Such an assumption cannot be easily applied to other organs, especially those that are small in size and heterogeneous in texture (large variations in intensity), such as the prostate. To address this problem, this paper proposes a probabilistic Hadamard U-Net (PHU-Net) for prostate MRI bias field correction. First, a novel Hadamard U-Net (HU-Net) is introduced to extract the low-frequency scalar field, multiplied by the original input to obtain the prototypical corrected image. HU-Net converts the input image from the time domain into the frequency domain via Hadamard transform. In the frequency domain, high-frequency components are eliminated using the trainable filter (scaling layer), hard-thresholding layer, and sparsity penalty. Next, a conditional variational autoencoder is used to encode possible bias field-corrected variants into a low-dimensional latent space. Random samples drawn from latent space are then incorporated with a prototypical corrected image to generate multiple plausible images. Experimental results demonstrate the effectiveness of PHU-Net in correcting bias-field in prostate MRI with a fast inference speed. It has also been shown that prostate MRI segmentation accuracy improves with the high-quality corrected images from PHU-Net. The code will be available in the final version of this manuscript.  ( 3 min )
    Adaptive Split Learning over Energy-Constrained Wireless Edge Networks
    arXiv:2403.05158v1 Announce Type: new Abstract: Split learning (SL) is a promising approach for training artificial intelligence (AI) models, in which devices collaborate with a server to train an AI model in a distributed manner, based on a same fixed split point. However, due to the device heterogeneity and variation of channel conditions, this way is not optimal in training delay and energy consumption. In this paper, we design an adaptive split learning (ASL) scheme which can dynamically select split points for devices and allocate computing resource for the server in wireless edge networks. We formulate an optimization problem to minimize the average training latency subject to long-term energy consumption constraint. The difficulties in solving this problem are the lack of future information and mixed integer programming (MIP). To solve it, we propose an online algorithm leveraging the Lyapunov theory, named OPEN, which decomposes it into a new MIP problem only with the current information. Then, a two-layer optimization method is proposed to solve the MIP problem. Extensive simulation results demonstrate that the ASL scheme can reduce the average training delay and energy consumption by 53.7% and 22.1%, respectively, as compared to the existing SL schemes.  ( 2 min )
    Overcoming Reward Overoptimization via Adversarial Policy Optimization with Lightweight Uncertainty Estimation
    arXiv:2403.05171v1 Announce Type: new Abstract: We introduce Adversarial Policy Optimization (AdvPO), a novel solution to the pervasive issue of reward over-optimization in Reinforcement Learning from Human Feedback (RLHF) for Large Language Models (LLMs). Over-optimization occurs when a reward model serves as an imperfect proxy for human preference, and RL-driven policy optimization erroneously exploits reward inaccuracies. In this paper, we begin by introducing a lightweight way to quantify uncertainties in rewards, relying solely on the last layer embeddings of the reward model, without the need for computationally expensive reward ensembles. AdvPO then addresses a distributionally robust optimization problem centred around the confidence interval of the reward model's predictions for policy improvement. Through comprehensive experiments on the Anthropic HH and TL;DR summarization datasets, we illustrate the efficacy of AdvPO in mitigating the overoptimization issue, consequently resulting in enhanced performance as evaluated through human-assisted evaluation.  ( 2 min )
    Denoising Autoregressive Representation Learning
    arXiv:2403.05196v1 Announce Type: new Abstract: In this paper, we explore a new generative approach for learning visual representations. Our method, DARL, employs a decoder-only Transformer to predict image patches autoregressively. We find that training with Mean Squared Error (MSE) alone leads to strong representations. To enhance the image generation ability, we replace the MSE loss with the diffusion objective by using a denoising patch decoder. We show that the learned representation can be improved by using tailored noise schedules and longer training in larger models. Notably, the optimal schedule differs significantly from the typical ones used in standard image diffusion models. Overall, despite its simple architecture, DARL delivers performance remarkably close to state-of-the-art masked prediction models under the fine-tuning protocol. This marks an important step towards a unified model capable of both visual perception and generation, effectively combining the strengths of autoregressive and denoising diffusion models.  ( 2 min )
    Group Privacy Amplification and Unified Amplification by Subsampling for R\'enyi Differential Privacy
    arXiv:2403.04867v1 Announce Type: cross Abstract: Differential privacy (DP) has various desirable properties, such as robustness to post-processing, group privacy, and amplification by subsampling, which can be derived independently of each other. Our goal is to determine whether stronger privacy guarantees can be obtained by considering multiple of these properties jointly. To this end, we focus on the combination of group privacy and amplification by subsampling. To provide guarantees that are amenable to machine learning algorithms, we conduct our analysis in the framework of R\'enyi-DP, which has more favorable composition properties than $(\epsilon,\delta)$-DP. As part of this analysis, we develop a unified framework for deriving amplification by subsampling guarantees for R\'enyi-DP, which represents the first such framework for a privacy accounting method and is of independent interest. We find that it not only lets us improve upon and generalize existing amplification results for R\'enyi-DP, but also derive provably tight group privacy amplification guarantees stronger than existing principles. These results establish the joint study of different DP properties as a promising research direction.  ( 2 min )
    Investigation of the Impact of Synthetic Training Data in the Industrial Application of Terminal Strip Object Detection
    arXiv:2403.04809v1 Announce Type: cross Abstract: In industrial manufacturing, numerous tasks of visually inspecting or detecting specific objects exist that are currently performed manually or by classical image processing methods. Therefore, introducing recent deep learning models to industrial environments holds the potential to increase productivity and enable new applications. However, gathering and labeling sufficient data is often intractable, complicating the implementation of such projects. Hence, image synthesis methods are commonly used to generate synthetic training data from 3D models and annotate them automatically, although it results in a sim-to-real domain gap. In this paper, we investigate the sim-to-real generalization performance of standard object detectors on the complex industrial application of terminal strip object detection. Combining domain randomization and domain knowledge, we created an image synthesis pipeline for automatically generating the training data. Moreover, we manually annotated 300 real images of terminal strips for the evaluation. The results show the cruciality of the objects of interest to have the same scale in either domain. Nevertheless, under optimized scaling conditions, the sim-to-real performance difference in mean average precision amounts to 2.69 % for RetinaNet and 0.98 % for Faster R-CNN, qualifying this approach for industrial requirements.  ( 2 min )
    Found in the Middle: How Language Models Use Long Contexts Better via Plug-and-Play Positional Encoding
    arXiv:2403.04797v1 Announce Type: cross Abstract: This paper aims to overcome the "lost-in-the-middle" challenge of large language models (LLMs). While recent advancements have successfully enabled LLMs to perform stable language modeling with up to 4 million tokens, the persistent difficulty faced by most LLMs in identifying relevant information situated in the middle of the context has not been adequately tackled. To address this problem, this paper introduces Multi-scale Positional Encoding (Ms-PoE) which is a simple yet effective plug-and-play approach to enhance the capacity of LLMs to handle the relevant information located in the middle of the context, without fine-tuning or introducing any additional overhead. Ms-PoE leverages the position indice rescaling to relieve the long-term decay effect introduced by RoPE, while meticulously assigning distinct scaling ratios to different attention heads to preserve essential knowledge learned during the pre-training step, forming a multi-scale context fusion from short to long distance. Extensive experiments with a wide range of LLMs demonstrate the efficacy of our approach. Notably, Ms-PoE achieves an average accuracy gain of up to 3.8 on the Zero-SCROLLS benchmark over the original LLMs. Code are available at https://github.com/VITA-Group/Ms-PoE.  ( 2 min )
    Quantifying Contamination in Evaluating Code Generation Capabilities of Language Models
    arXiv:2403.04811v1 Announce Type: cross Abstract: While large language models have achieved remarkable performance on various code generation benchmarks, there have been growing concerns regarding potential contamination of these benchmarks as they may be leaked into pretraining and finetuning data. While recent work has investigated contamination in natural language generation and understanding tasks, there has been less extensive research into how data contamination impacts the evaluation of code generation, which is critical for understanding the robustness and reliability of LLMs in programming contexts. In this work, we perform a comprehensive study of data contamination of popular code generation benchmarks, and precisely quantify their overlap with pretraining corpus through both surface-level and semantic-level matching. In our experiments, we show that there are substantial overlap between popular code generation benchmarks and open training corpus, and models perform significantly better on the subset of the benchmarks where similar solutions are seen during training. We also conduct extensive analysis on the factors that affects model memorization and generalization, such as model size, problem difficulty, and question length. We release all resulting files from our matching pipeline for future research.  ( 2 min )
    Breaking the Language Barrier: Can Direct Inference Outperform Pre-Translation in Multilingual LLM Applications?
    arXiv:2403.04792v1 Announce Type: cross Abstract: Large language models hold significant promise in multilingual applications. However, inherent biases stemming from predominantly English-centric pre-training have led to the widespread practice of pre-translation, i.e., translating non-English inputs to English before inference, leading to complexity and information loss. This study re-evaluates the need for pre-translation in the context of PaLM2 models (Anil et al., 2023), which have been established as highly performant in multilingual tasks. We offer a comprehensive investigation across 108 languages and 6 diverse benchmarks, including open-end generative tasks, which were excluded from previous similar studies. Our findings challenge the pre-translation paradigm established in prior research, highlighting the advantages of direct inference in PaLM2. Specifically, PaLM2-L consistently outperforms pre-translation in 94 out of 108 languages. These findings pave the way for more efficient and effective multilingual applications, alleviating the limitations associated with pre-translation and unlocking linguistic authenticity.  ( 2 min )
    TopicDiff: A Topic-enriched Diffusion Approach for Multimodal Conversational Emotion Detection
    arXiv:2403.04789v1 Announce Type: cross Abstract: Multimodal Conversational Emotion (MCE) detection, generally spanning across the acoustic, vision and language modalities, has attracted increasing interest in the multimedia community. Previous studies predominantly focus on learning contextual information in conversations with only a few considering the topic information in single language modality, while always neglecting the acoustic and vision topic information. On this basis, we propose a model-agnostic Topic-enriched Diffusion (TopicDiff) approach for capturing multimodal topic information in MCE tasks. Particularly, we integrate the diffusion model into neural topic model to alleviate the diversity deficiency problem of neural topic model in capturing topic information. Detailed evaluations demonstrate the significant improvements of TopicDiff over the state-of-the-art MCE baselines, justifying the importance of multimodal topic information to MCE and the effectiveness of TopicDiff in capturing such information. Furthermore, we observe an interesting finding that the topic information in acoustic and vision is more discriminative and robust compared to the language.  ( 2 min )
    Tune without Validation: Searching for Learning Rate and Weight Decay on Training Sets
    arXiv:2403.05532v1 Announce Type: new Abstract: We introduce Tune without Validation (Twin), a pipeline for tuning learning rate and weight decay without validation sets. We leverage a recent theoretical framework concerning learning phases in hypothesis space to devise a heuristic that predicts what hyper-parameter (HP) combinations yield better generalization. Twin performs a grid search of trials according to an early-/non-early-stopping scheduler and then segments the region that provides the best results in terms of training loss. Among these trials, the weight norm strongly correlates with predicting generalization. To assess the effectiveness of Twin, we run extensive experiments on 20 image classification datasets and train several families of deep networks, including convolutional, transformer, and feed-forward models. We demonstrate proper HP selection when training from scratch and fine-tuning, emphasizing small-sample scenarios.  ( 2 min )
    UniTable: Towards a Unified Framework for Table Structure Recognition via Self-Supervised Pretraining
    arXiv:2403.04822v1 Announce Type: cross Abstract: Tables convey factual and quantitative data with implicit conventions created by humans that are often challenging for machines to parse. Prior work on table structure recognition (TSR) has mainly centered around complex task-specific combinations of available inputs and tools. We present UniTable, a training framework that unifies both the training paradigm and training objective of TSR. Its training paradigm combines the simplicity of purely pixel-level inputs with the effectiveness and scalability empowered by self-supervised pretraining (SSP) from diverse unannotated tabular images. Our framework unifies the training objectives of all three TSR tasks - extracting table structure, cell content, and cell bounding box (bbox) - into a unified task-agnostic training objective: language modeling. Extensive quantitative and qualitative analyses highlight UniTable's state-of-the-art (SOTA) performance on four of the largest TSR datasets. To promote reproducible research, enhance transparency, and SOTA innovations, we open-source our code at https://github.com/poloclub/unitable and release the first-of-its-kind Jupyter Notebook of the whole inference pipeline, fine-tuned across multiple TSR datasets, supporting all three TSR tasks.  ( 2 min )
    WaterMax: breaking the LLM watermark detectability-robustness-quality trade-off
    arXiv:2403.04808v1 Announce Type: cross Abstract: Watermarking is a technical means to dissuade malfeasant usage of Large Language Models. This paper proposes a novel watermarking scheme, so-called WaterMax, that enjoys high detectability while sustaining the quality of the generated text of the original LLM. Its new design leaves the LLM untouched (no modification of the weights, logits, temperature, or sampling technique). WaterMax balances robustness and complexity contrary to the watermarking techniques of the literature inherently provoking a trade-off between quality and robustness. Its performance is both theoretically proven and experimentally validated. It outperforms all the SotA techniques under the most complete benchmark suite.  ( 2 min )
    Optimizing Retinal Prosthetic Stimuli with Conditional Invertible Neural Networks
    arXiv:2403.04884v1 Announce Type: cross Abstract: Implantable retinal prostheses offer a promising solution to restore partial vision by circumventing damaged photoreceptor cells in the retina and directly stimulating the remaining functional retinal cells. However, the information transmission between the camera and retinal cells is often limited by the low resolution of the electrode array and the lack of specificity for different ganglion cell types, resulting in suboptimal stimulations. In this work, we propose to utilize normalizing flow-based conditional invertible neural networks to optimize retinal implant stimulation in an unsupervised manner. The invertibility of these networks allows us to use them as a surrogate for the computational model of the visual system, while also encoding input camera signals into optimized electrical stimuli on the electrode array. Compared to other methods, such as trivial downsampling, linear models, and feed-forward convolutional neural networks, the flow-based invertible neural network and its conditional extension yield better visual reconstruction qualities w.r.t. various metrics using a physiologically validated simulation tool.  ( 2 min )
    Learning Traveling Solitary Waves Using Separable Gaussian Neural Networks
    arXiv:2403.04883v1 Announce Type: cross Abstract: In this paper, we apply a machine-learning approach to learn traveling solitary waves across various families of partial differential equations (PDEs). Our approach integrates a novel interpretable neural network (NN) architecture, called Separable Gaussian Neural Networks (SGNN) into the framework of Physics-Informed Neural Networks (PINNs). Unlike the traditional PINNs that treat spatial and temporal data as independent inputs, the present method leverages wave characteristics to transform data into the so-called co-traveling wave frame. This adaptation effectively addresses the issue of propagation failure in PINNs when applied to large computational domains. Here, the SGNN architecture demonstrates robust approximation capabilities for single-peakon, multi-peakon, and stationary solutions within the (1+1)-dimensional, $b$-family of PDEs. In addition, we expand our investigations, and explore not only peakon solutions in the $ab$-family but also compacton solutions in (2+1)-dimensional, Rosenau-Hyman family of PDEs. A comparative analysis with MLP reveals that SGNN achieves comparable accuracy with fewer than a tenth of the neurons, underscoring its efficiency and potential for broader application in solving complex nonlinear PDEs.  ( 2 min )
    Switching the Loss Reduces the Cost in Batch Reinforcement Learning
    arXiv:2403.05385v1 Announce Type: new Abstract: We propose training fitted Q-iteration with log-loss (FQI-LOG) for batch reinforcement learning (RL). We show that the number of samples needed to learn a near-optimal policy with FQI-LOG scales with the accumulated cost of the optimal policy, which is zero in problems where acting optimally achieves the goal and incurs no cost. In doing so, we provide a general framework for proving $\textit{small-cost}$ bounds, i.e. bounds that scale with the optimal achievable cost, in batch RL. Moreover, we empirically verify that FQI-LOG uses fewer samples than FQI trained with squared loss on problems where the optimal policy reliably achieves the goal.  ( 2 min )
    Enhancing Security in Federated Learning through Adaptive Consensus-Based Model Update Validation
    arXiv:2403.04803v1 Announce Type: cross Abstract: This paper introduces an advanced approach for fortifying Federated Learning (FL) systems against label-flipping attacks. We propose a simplified consensus-based verification process integrated with an adaptive thresholding mechanism. This dynamic thresholding is designed to adjust based on the evolving landscape of model updates, offering a refined layer of anomaly detection that aligns with the real-time needs of distributed learning environments. Our method necessitates a majority consensus among participating clients to validate updates, ensuring that only vetted and consensual modifications are applied to the global model. The efficacy of our approach is validated through experiments on two benchmark datasets in deep learning, CIFAR-10 and MNIST. Our results indicate a significant mitigation of label-flipping attacks, bolstering the FL system's resilience. This method transcends conventional techniques that depend on anomaly detection or statistical validation by incorporating a verification layer reminiscent of blockchain's participatory validation without the associated cryptographic overhead. The innovation of our approach rests in striking an optimal balance between heightened security measures and the inherent limitations of FL systems, such as computational efficiency and data privacy. Implementing a consensus mechanism specifically tailored for FL environments paves the way for more secure, robust, and trustworthy distributed machine learning applications, where safeguarding data integrity and model robustness is critical.  ( 2 min )
    Control-based Graph Embeddings with Data Augmentation for Contrastive Learning
    arXiv:2403.04923v1 Announce Type: new Abstract: In this paper, we study the problem of unsupervised graph representation learning by harnessing the control properties of dynamical networks defined on graphs. Our approach introduces a novel framework for contrastive learning, a widely prevalent technique for unsupervised representation learning. A crucial step in contrastive learning is the creation of 'augmented' graphs from the input graphs. Though different from the original graphs, these augmented graphs retain the original graph's structural characteristics. Here, we propose a unique method for generating these augmented graphs by leveraging the control properties of networks. The core concept revolves around perturbing the original graph to create a new one while preserving the controllability properties specific to networks and graphs. Compared to the existing methods, we demonstrate that this innovative approach enhances the effectiveness of contrastive learning frameworks, leading to superior results regarding the accuracy of the classification tasks. The key innovation lies in our ability to decode the network structure using these control properties, opening new avenues for unsupervised graph representation learning.  ( 2 min )
    Simple Multigraph Convolution Networks
    arXiv:2403.05014v1 Announce Type: new Abstract: Existing multigraph convolution methods either ignore the cross-view interaction among multiple graphs, or induce extremely high computational cost due to standard cross-view polynomial operators. To alleviate this problem, this paper proposes a Simple MultiGraph Convolution Networks (SMGCN) which first extracts consistent cross-view topology from multigraphs including edge-level and subgraph-level topology, then performs polynomial expansion based on raw multigraphs and consistent topologies. In theory, SMGCN utilizes the consistent topologies in polynomial expansion rather than standard cross-view polynomial expansion, which performs credible cross-view spatial message-passing, follows the spectral convolution paradigm, and effectively reduces the complexity of standard polynomial expansion. In the simulations, experimental results demonstrate that SMGCN achieves state-of-the-art performance on ACM and DBLP multigraph benchmark datasets. Our codes are available at https://github.com/frinkleko/SMGCN.  ( 2 min )
    Simulating Battery-Powered TinyML Systems Optimised using Reinforcement Learning in Image-Based Anomaly Detection
    arXiv:2403.05106v1 Announce Type: new Abstract: Advances in Tiny Machine Learning (TinyML) have bolstered the creation of smart industry solutions, including smart agriculture, healthcare and smart cities. Whilst related research contributes to enabling TinyML solutions on constrained hardware, there is a need to amplify real-world applications by optimising energy consumption in battery-powered systems. The work presented extends and contributes to TinyML research by optimising battery-powered image-based anomaly detection Internet of Things (IoT) systems. Whilst previous work in this area has yielded the capabilities of on-device inferencing and training, there has yet to be an investigation into optimising the management of such capabilities using machine learning approaches, such as Reinforcement Learning (RL), to improve the deployment battery life of such systems. Using modelled simulations, the battery life effects of an RL algorithm are benchmarked against static and dynamic optimisation approaches, with the foundation laid for a hardware benchmark to follow. It is shown that using RL within a TinyML-enabled IoT system to optimise the system operations, including cloud anomaly processing and on-device training, yields an improved battery life of 22.86% and 10.86% compared to static and dynamic optimisation approaches respectively. The proposed solution can be deployed to resource-constrained hardware, given its low memory footprint of 800 B, which could be further reduced. This further facilitates the real-world deployment of such systems, including key sectors such as smart agriculture.  ( 3 min )
    Evaluation of LLMs on Syntax-Aware Code Fill-in-the-Middle Tasks
    arXiv:2403.04814v1 Announce Type: cross Abstract: We introduce Syntax-Aware Fill-In-the-Middle (SAFIM), a new benchmark for evaluating Large Language Models (LLMs) on the code Fill-in-the-Middle (FIM) task. This benchmark focuses on syntax-aware completions of program structures such as code blocks and conditional expressions, and includes 17,720 examples from multiple programming languages, sourced from recent code submissions after April 2022 to minimize data contamination. SAFIM provides a robust framework with various prompt designs and novel syntax-aware post-processing techniques, facilitating accurate and fair comparisons across LLMs. Our comprehensive evaluation of 15 LLMs shows that FIM pretraining not only enhances FIM proficiency but also improves Left-to-Right (L2R) inference using LLMs. Our findings challenge conventional beliefs and suggest that pretraining methods and data quality have more impact than model size. SAFIM thus serves as a foundational platform for future research in effective pretraining strategies for code LLMs. The evaluation toolkit and dataset are available at https://github.com/gonglinyuan/safim, and the leaderboard is available at https://safimbenchmark.com.  ( 2 min )
    AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks
    arXiv:2403.04783v1 Announce Type: new Abstract: Despite extensive pre-training and fine-tuning in moral alignment to prevent generating harmful information at user request, large language models (LLMs) remain vulnerable to jailbreak attacks. In this paper, we propose AutoDefense, a response-filtering based multi-agent defense framework that filters harmful responses from LLMs. This framework assigns different roles to LLM agents and employs them to complete the defense task collaboratively. The division in tasks enhances the overall instruction-following of LLMs and enables the integration of other defense components as tools. AutoDefense can adapt to various sizes and kinds of open-source LLMs that serve as agents. Through conducting extensive experiments on a large scale of harmful and safe prompts, we validate the effectiveness of the proposed AutoDefense in improving the robustness against jailbreak attacks, while maintaining the performance at normal user request. Our code and data are publicly available at https://github.com/XHMY/AutoDefense.  ( 2 min )
    Mathematics of Neural Networks (Lecture Notes Graduate Course)
    arXiv:2403.04807v1 Announce Type: new Abstract: These are the lecture notes that accompanied the course of the same name that I taught at the Eindhoven University of Technology from 2021 to 2023. The course is intended as an introduction to neural networks for mathematics students at the graduate level and aims to make mathematics students interested in further researching neural networks. It consists of two parts: first a general introduction to deep learning that focuses on introducing the field in a formal mathematical way. The second part provides an introduction to the theory of Lie groups and homogeneous spaces and how it can be applied to design neural networks with desirable geometric equivariances. The lecture notes were made to be as self-contained as possible so as to accessible for any student with a moderate mathematics background. The course also included coding tutorials and assignments in the form of a set of Jupyter notebooks that are publicly available at https://gitlab.com/bsmetsjr/mathematics_of_neural_networks.  ( 2 min )
    AttentionStitch: How Attention Solves the Speech Editing Problem
    arXiv:2403.04804v1 Announce Type: cross Abstract: The generation of natural and high-quality speech from text is a challenging problem in the field of natural language processing. In addition to speech generation, speech editing is also a crucial task, which requires the seamless and unnoticeable integration of edited speech into synthesized speech. We propose a novel approach to speech editing by leveraging a pre-trained text-to-speech (TTS) model, such as FastSpeech 2, and incorporating a double attention block network on top of it to automatically merge the synthesized mel-spectrogram with the mel-spectrogram of the edited text. We refer to this model as AttentionStitch, as it harnesses attention to stitch audio samples together. We evaluate the proposed AttentionStitch model against state-of-the-art baselines on both single and multi-speaker datasets, namely LJSpeech and VCTK. We demonstrate its superior performance through an objective and a subjective evaluation test involving 15 human participants. AttentionStitch is capable of producing high-quality speech, even for words not seen during training, while operating automatically without the need for human intervention. Moreover, AttentionStitch is fast during both training and inference and is able to generate human-sounding edited speech.  ( 2 min )
    (Un)paired signal-to-signal translation with 1D conditional GANs
    arXiv:2403.04800v1 Announce Type: cross Abstract: I show that a one-dimensional (1D) conditional generative adversarial network (cGAN) with an adversarial training architecture is capable of unpaired signal-to-signal ("sig2sig") translation. Using a simplified CycleGAN model with 1D layers and wider convolutional kernels, mirroring WaveGAN to reframe two-dimensional (2D) image generation as 1D audio generation, I show that recasting the 2D image-to-image translation task to a 1D signal-to-signal translation task with deep convolutional GANs is possible without substantial modification to the conventional U-Net model and adversarial architecture developed as CycleGAN. With this I show for a small tunable dataset that noisy test signals unseen by the 1D CycleGAN model and without paired training transform from the source domain to signals similar to paired test signals in the translated domain, especially in terms of frequency, and I quantify these differences in terms of correlation and error.  ( 2 min )
    VTruST: Controllable value function based subset selection for Data-Centric Trustworthy AI
    arXiv:2403.05174v1 Announce Type: new Abstract: Trustworthy AI is crucial to the widespread adoption of AI in high-stakes applications with fairness, robustness, and accuracy being some of the key trustworthiness metrics. In this work, we propose a controllable framework for data-centric trustworthy AI (DCTAI)- VTruST, that allows users to control the trade-offs between the different trustworthiness metrics of the constructed training datasets. A key challenge in implementing an efficient DCTAI framework is to design an online value-function-based training data subset selection algorithm. We pose the training data valuation and subset selection problem as an online sparse approximation formulation. We propose a novel online version of the Orthogonal Matching Pursuit (OMP) algorithm for solving this problem. Experimental results show that VTruST outperforms the state-of-the-art baselines on social, image, and scientific datasets. We also show that the data values generated by VTruST can provide effective data-centric explanations for different trustworthiness metrics.  ( 2 min )
    A spatiotemporal style transfer algorithm for dynamic visual stimulus generation
    arXiv:2403.04940v1 Announce Type: cross Abstract: Understanding how visual information is encoded in biological and artificial systems often requires vision scientists to generate appropriate stimuli to test specific hypotheses. Although deep neural network models have revolutionized the field of image generation with methods such as image style transfer, available methods for video generation are scarce. Here, we introduce the Spatiotemporal Style Transfer (STST) algorithm, a dynamic visual stimulus generation framework that allows powerful manipulation and synthesis of video stimuli for vision research. It is based on a two-stream deep neural network model that factorizes spatial and temporal features to generate dynamic visual stimuli whose model layer activations are matched to those of input videos. As an example, we show that our algorithm enables the generation of model metamers, dynamic stimuli whose layer activations within our two-stream model are matched to those of natural videos. We show that these generated stimuli match the low-level spatiotemporal features of their natural counterparts but lack their high-level semantic features, making it a powerful paradigm to study object recognition. Late layer activations in deep vision models exhibited a lower similarity between natural and metameric stimuli compared to early layers, confirming the lack of high-level information in the generated stimuli. Finally, we use our generated stimuli to probe the representational capabilities of predictive coding deep networks. These results showcase potential applications of our algorithm as a versatile tool for dynamic stimulus generation in vision science.  ( 3 min )
    C2P-GCN: Cell-to-Patch Graph Convolutional Network for Colorectal Cancer Grading
    arXiv:2403.04962v1 Announce Type: cross Abstract: Graph-based learning approaches, due to their ability to encode tissue/organ structure information, are increasingly favored for grading colorectal cancer histology images. Recent graph-based techniques involve dividing whole slide images (WSIs) into smaller or medium-sized patches, and then building graphs on each patch for direct use in training. This method, however, fails to capture the tissue structure information present in an entire WSI and relies on training from a significantly large dataset of image patches. In this paper, we propose a novel cell-to-patch graph convolutional network (C2P-GCN), which is a two-stage graph formation-based approach. In the first stage, it forms a patch-level graph based on the cell organization on each patch of a WSI. In the second stage, it forms an image-level graph based on a similarity measure between patches of a WSI considering each patch as a node of a graph. This graph representation is then fed into a multi-layer GCN-based classification network. Our approach, through its dual-phase graph construction, effectively gathers local structural details from individual patches and establishes a meaningful connection among all patches across a WSI. As C2P-GCN integrates the structural data of an entire WSI into a single graph, it allows our model to work with significantly fewer training data compared to the latest models for colorectal cancer. Experimental validation of C2P-GCN on two distinct colorectal cancer datasets demonstrates the effectiveness of our method.  ( 3 min )
    Jet Discrimination with Quantum Complete Graph Neural Network
    arXiv:2403.04990v1 Announce Type: cross Abstract: Machine learning, particularly deep neural networks, has been widely utilized in high energy physics and has shown remarkable results in various applications. Moreover, the concept of machine learning has been extended to quantum computers, giving rise to a new research area known as quantum machine learning. In this paper, we propose a novel variational quantum circuit model, Quantum Complete Graph Neural Network (QCGNN), designed for learning complete graphs. We argue that QCGNN has a polynomial speedup against its classical counterpart, due to the property of quantum parallelism. In this paper, we study the application of QCGNN through the challenging jet discrimination, where the jets are represented with complete graphs. Subsequently, we conduct a comparative analysis with classical graph neural networks to establish a benchmark.  ( 2 min )
    Electrocardiogram Instruction Tuning for Report Generation
    arXiv:2403.04945v1 Announce Type: cross Abstract: Electrocardiogram (ECG) serves as the primary non-invasive diagnostic tool for cardiac conditions monitoring, are crucial in assisting clinicians. Recent studies have concentrated on classifying cardiac conditions using ECG data but have overlooked ECG report generation, which is not only time-consuming but also requires clinical expertise. To automate ECG report generation and ensure its versatility, we propose the Multimodal ECG Instruction Tuning (MEIT) framework, the \textit{first} attempt to tackle ECG report generation with LLMs and multimodal instructions. To facilitate future research, we establish a benchmark to evaluate MEIT with various LLMs backbones across two large-scale ECG datasets. Our approach uniquely aligns the representations of the ECG signal and the report, and we conduct extensive experiments to benchmark MEIT with nine open source LLMs, using more than 800,000 ECG reports. MEIT's results underscore the superior performance of instruction-tuned LLMs, showcasing their proficiency in quality report generation, zero-shot capabilities, and resilience to signal perturbation. These findings emphasize the efficacy of our MEIT framework and its potential for real-world clinical application.  ( 2 min )
    Can't Remember Details in Long Documents? You Need Some R&R
    arXiv:2403.05004v1 Announce Type: cross Abstract: Long-context large language models (LLMs) hold promise for tasks such as question-answering (QA) over long documents, but they tend to miss important information in the middle of context documents (arXiv:2307.03172v3). Here, we introduce $\textit{R&R}$ -- a combination of two novel prompt-based methods called $\textit{reprompting}$ and $\textit{in-context retrieval}$ (ICR) -- to alleviate this effect in document-based QA. In reprompting, we repeat the prompt instructions periodically throughout the context document to remind the LLM of its original task. In ICR, rather than instructing the LLM to answer the question directly, we instruct it to retrieve the top $k$ passage numbers most relevant to the given question, which are then used as an abbreviated context in a second QA prompt. We test R&R with GPT-4 Turbo and Claude-2.1 on documents up to 80k tokens in length and observe a 16-point boost in QA accuracy on average. Our further analysis suggests that R&R improves performance on long document-based QA because it reduces the distance between relevant context and the instructions. Finally, we show that compared to short-context chunkwise methods, R&R enables the use of larger chunks that cost fewer LLM calls and output tokens, while minimizing the drop in accuracy.  ( 2 min )
    Identifying Causal Effects Under Functional Dependencies
    arXiv:2403.04919v1 Announce Type: cross Abstract: We study the identification of causal effects, motivated by two improvements to identifiability which can be attained if one knows that some variables in a causal graph are functionally determined by their parents (without needing to know the specific functions). First, an unidentifiable causal effect may become identifiable when certain variables are functional. Second, certain functional variables can be excluded from being observed without affecting the identifiability of a causal effect, which may significantly reduce the number of needed variables in observational data. Our results are largely based on an elimination procedure which removes functional variables from a causal graph while preserving key properties in the resulting causal graph, including the identifiability of causal effects.  ( 2 min )
    Aligning GPTRec with Beyond-Accuracy Goals with Reinforcement Learning
    arXiv:2403.04875v1 Announce Type: cross Abstract: Adaptations of Transformer models, such as BERT4Rec and SASRec, achieve state-of-the-art performance in the sequential recommendation task according to accuracy-based metrics, such as NDCG. These models treat items as tokens and then utilise a score-and-rank approach (Top-K strategy), where the model first computes item scores and then ranks them according to this score. While this approach works well for accuracy-based metrics, it is hard to use it for optimising more complex beyond-accuracy metrics such as diversity. Recently, the GPTRec model, which uses a different Next-K strategy, has been proposed as an alternative to the Top-K models. In contrast with traditional Top-K recommendations, Next-K generates recommendations item-by-item and, therefore, can account for complex item-to-item interdependencies important for the beyond-accuracy measures. However, the original GPTRec paper focused only on accuracy in experiments and needed to address how to optimise the model for complex beyond-accuracy metrics. Indeed, training GPTRec for beyond-accuracy goals is challenging because the interaction training data available for training recommender systems typically needs to be aligned with beyond-accuracy recommendation goals. To solve the misalignment problem, we train GPTRec using a 2-stage approach: in the first stage, we use a teacher-student approach to train GPTRec, mimicking the behaviour of traditional Top-K models; in the second stage, we use Reinforcement Learning to align the model for beyond-accuracy goals. In particular, we experiment with increasing recommendation diversity and reducing popularity bias. Our experiments on two datasets show that in 3 out of 4 cases, GPTRec's Next-K generation approach offers a better tradeoff between accuracy and secondary metrics than classic greedy re-ranking techniques.  ( 3 min )
    JMI at SemEval 2024 Task 3: Two-step approach for multimodal ECAC using in-context learning with GPT and instruction-tuned Llama models
    arXiv:2403.04798v1 Announce Type: cross Abstract: This paper presents our system development for SemEval-2024 Task 3: "The Competition of Multimodal Emotion Cause Analysis in Conversations". Effectively capturing emotions in human conversations requires integrating multiple modalities such as text, audio, and video. However, the complexities of these diverse modalities pose challenges for developing an efficient multimodal emotion cause analysis (ECA) system. Our proposed approach addresses these challenges by a two-step framework. We adopt two different approaches in our implementation. In Approach 1, we employ instruction-tuning with two separate Llama 2 models for emotion and cause prediction. In Approach 2, we use GPT-4V for conversation-level video description and employ in-context learning with annotated conversation using GPT 3.5. Our system wins rank 4, and system ablation experiments demonstrate that our proposed solutions achieve significant performance gains. All the experimental codes are available on Github.  ( 2 min )
    Analysis of Privacy Leakage in Federated Large Language Models
    arXiv:2403.04784v1 Announce Type: cross Abstract: With the rapid adoption of Federated Learning (FL) as the training and tuning protocol for applications utilizing Large Language Models (LLMs), recent research highlights the need for significant modifications to FL to accommodate the large-scale of LLMs. While substantial adjustments to the protocol have been introduced as a response, comprehensive privacy analysis for the adapted FL protocol is currently lacking. To address this gap, our work delves into an extensive examination of the privacy analysis of FL when used for training LLMs, both from theoretical and practical perspectives. In particular, we design two active membership inference attacks with guaranteed theoretical success rates to assess the privacy leakages of various adapted FL configurations. Our theoretical findings are translated into practical attacks, revealing substantial privacy vulnerabilities in popular LLMs, including BERT, RoBERTa, DistilBERT, and OpenAI's GPTs, across multiple real-world language datasets. Additionally, we conduct thorough experiments to evaluate the privacy leakage of these models when data is protected by state-of-the-art differential privacy (DP) mechanisms.  ( 2 min )
    The Computational Complexity of Learning Gaussian Single-Index Models
    arXiv:2403.05529v1 Announce Type: new Abstract: Single-Index Models are high-dimensional regression problems with planted structure, whereby labels depend on an unknown one-dimensional projection of the input via a generic, non-linear, and potentially non-deterministic transformation. As such, they encompass a broad class of statistical inference tasks, and provide a rich template to study statistical and computational trade-offs in the high-dimensional regime. While the information-theoretic sample complexity to recover the hidden direction is linear in the dimension $d$, we show that computationally efficient algorithms, both within the Statistical Query (SQ) and the Low-Degree Polynomial (LDP) framework, necessarily require $\Omega(d^{k^\star/2})$ samples, where $k^\star$ is a "generative" exponent associated with the model that we explicitly characterize. Moreover, we show that this sample complexity is also sufficient, by establishing matching upper bounds using a partial-trace algorithm. Therefore, our results provide evidence of a sharp computational-to-statistical gap (under both the SQ and LDP class) whenever $k^\star>2$. To complete the study, we provide examples of smooth and Lipschitz deterministic target functions with arbitrarily large generative exponents $k^\star$.  ( 2 min )
    Foundational propositions of hesitant fuzzy soft $\beta$-covering approximation spaces
    arXiv:2403.05290v1 Announce Type: new Abstract: Soft set theory serves as a mathematical framework for handling uncertain information, and hesitant fuzzy sets find extensive application in scenarios involving uncertainty and hesitation. Hesitant fuzzy sets exhibit diverse membership degrees, giving rise to various forms of inclusion relationships among them. This article introduces the notions of hesitant fuzzy soft $\beta$-coverings and hesitant fuzzy soft $\beta$-neighborhoods, which are formulated based on distinct forms of inclusion relationships among hesitancy fuzzy sets. Subsequently, several associated properties are investigated. Additionally, specific variations of hesitant fuzzy soft $\beta$-coverings are introduced by incorporating hesitant fuzzy rough sets, followed by an exploration of properties pertaining to hesitant fuzzy soft $\beta$-covering approximation spaces.  ( 2 min )
    ECToNAS: Evolutionary Cross-Topology Neural Architecture Search
    arXiv:2403.05123v1 Announce Type: new Abstract: We present ECToNAS, a cost-efficient evolutionary cross-topology neural architecture search algorithm that does not require any pre-trained meta controllers. Our framework is able to select suitable network architectures for different tasks and hyperparameter settings, independently performing cross-topology optimisation where required. It is a hybrid approach that fuses training and topology optimisation together into one lightweight, resource-friendly process. We demonstrate the validity and power of this approach with six standard data sets (CIFAR-10, CIFAR-100, EuroSAT, Fashion MNIST, MNIST, SVHN), showcasing the algorithm's ability to not only optimise the topology within an architectural type, but also to dynamically add and remove convolutional cells when and where required, thus crossing boundaries between different network types. This enables researchers without a background in machine learning to make use of appropriate model types and topologies and to apply machine learning methods in their domains, with a computationally cheap, easy-to-use cross-topology neural architecture search framework that fully encapsulates the topology optimisation within the training process.  ( 2 min )
    Social Orientation: A New Feature for Dialogue Analysis
    arXiv:2403.04770v1 Announce Type: cross Abstract: There are many settings where it is useful to predict and explain the success or failure of a dialogue. Circumplex theory from psychology models the social orientations (e.g., Warm-Agreeable, Arrogant-Calculating) of conversation participants and can be used to predict and explain the outcome of social interactions. Our work is novel in its systematic application of social orientation tags to modeling conversation outcomes. In this paper, we introduce a new data set of dialogue utterances machine-labeled with social orientation tags. We show that social orientation tags improve task performance, especially in low-resource settings, on both English and Chinese language benchmarks. We also demonstrate how social orientation tags help explain the outcomes of social interactions when used in neural models. Based on these results showing the utility of social orientation tags for dialogue outcome prediction tasks, we release our data sets, code, and models that are fine-tuned to predict social orientation tags on dialogue utterances.  ( 2 min )
    Large Language Models in Fire Engineering: An Examination of Technical Questions Against Domain Knowledge
    arXiv:2403.04795v1 Announce Type: cross Abstract: This communication presents preliminary findings from comparing two recent chatbots, OpenAI's ChatGPT and Google's Bard, in the context of fire engineering by evaluating their responses in handling fire safety related queries. A diverse range of fire engineering questions and scenarios were created and examined, including structural fire design, fire prevention strategies, evacuation, building code compliance, and fire suppression systems (some of which resemble those commonly present in the Fire Protection exam (FPE)). The results reveal some key differences in the performance of the chatbots, with ChatGPT demonstrating a relatively superior performance. Then, this communication highlights the potential for chatbot technology to revolutionize fire engineering practices by providing instant access to critical information while outlining areas for further improvement and research. Evidently, and when it matures, this technology will likely be elemental to our engineers' practice and education.  ( 2 min )
    Benchmarking Large Language Models for Molecule Prediction Tasks
    arXiv:2403.05075v1 Announce Type: new Abstract: Large Language Models (LLMs) stand at the forefront of a number of Natural Language Processing (NLP) tasks. Despite the widespread adoption of LLMs in NLP, much of their potential in broader fields remains largely unexplored, and significant limitations persist in their design and implementation. Notably, LLMs struggle with structured data, such as graphs, and often falter when tasked with answering domain-specific questions requiring deep expertise, such as those in biology and chemistry. In this paper, we explore a fundamental question: Can LLMs effectively handle molecule prediction tasks? Rather than pursuing top-tier performance, our goal is to assess how LLMs can contribute to diverse molecule tasks. We identify several classification and regression prediction tasks across six standard molecule datasets. Subsequently, we carefully design a set of prompts to query LLMs on these tasks and compare their performance with existing Machine Learning (ML) models, which include text-based models and those specifically designed for analysing the geometric structure of molecules. Our investigation reveals several key insights: Firstly, LLMs generally lag behind ML models in achieving competitive performance on molecule tasks, particularly when compared to models adept at capturing the geometric structure of molecules, highlighting the constrained ability of LLMs to comprehend graph data. Secondly, LLMs show promise in enhancing the performance of ML models when used collaboratively. Lastly, we engage in a discourse regarding the challenges and promising avenues to harness LLMs for molecule prediction tasks. The code and models are available at https://github.com/zhiqiangzhongddu/LLMaMol.  ( 3 min )
    Selective Encryption using Segmentation Mask with Chaotic Henon Map for Multidimensional Medical Images
    arXiv:2403.04781v1 Announce Type: cross Abstract: A user-centric design and resource optimization should be at the center of any technology or innovation. The user-centric perspective gives the developer the opportunity to develop with task-based optimization. The user in the medical image field is a medical professional who analyzes the medical images and gives their diagnosis results to the patient. This scheme, having the medical professional user's perspective, innovates in the area of Medical Image storage and security. The architecture is designed with three main segments, namely: Segmentation, Storage, and Retrieval. This architecture was designed owing to the fact that the number of retrieval operations done by medical professionals was toweringly higher when compared to the storage operations done for some handful number of times for a particular medical image. This gives room for our innovation to segment out the medically indispensable part of the medical image, encrypt it, and store it. By encrypting the vital parts of the image using a strong encryption algorithm like the chaotic Henon map, we are able to keep the security intact. Now retrieving the medical image demands only the computationally less stressing decryption of the segmented region of interest. The decryption of the segmented region of interest results in the full recovery of the medical image which can be viewed on demand by the medical professionals for various diagnosis purposes. In this scheme, we were able to achieve a retrieval speed improvement of around 47% when compared to a full image encryption of brain medical CT images.  ( 3 min )
    Unity by Diversity: Improved Representation Learning in Multimodal VAEs
    arXiv:2403.05300v1 Announce Type: new Abstract: Variational Autoencoders for multimodal data hold promise for many tasks in data analysis, such as representation learning, conditional generation, and imputation. Current architectures either share the encoder output, decoder input, or both across modalities to learn a shared representation. Such architectures impose hard constraints on the model. In this work, we show that a better latent representation can be obtained by replacing these hard constraints with a soft constraint. We propose a new mixture-of-experts prior, softly guiding each modality's latent representation towards a shared aggregate posterior. This approach results in a superior latent representation and allows each encoding to preserve information from its uncompressed original features better. In extensive experiments on multiple benchmark datasets and a challenging real-world neuroscience data set, we show improved learned latent representations and imputation of missing data modalities compared to existing methods.  ( 2 min )
    Gradient-free neural topology optimization
    arXiv:2403.04937v1 Announce Type: new Abstract: Gradient-free optimizers allow for tackling problems regardless of the smoothness or differentiability of their objective function, but they require many more iterations to converge when compared to gradient-based algorithms. This has made them unviable for topology optimization due to the high computational cost per iteration and high dimensionality of these problems. We propose a pre-trained neural reparameterization strategy that leads to at least one order of magnitude decrease in iteration count when optimizing the designs in latent space, as opposed to the conventional approach without latent reparameterization. We demonstrate this via extensive computational experiments in- and out-of-distribution with the training data. Although gradient-based topology optimization is still more efficient for differentiable problems, such as compliance optimization of structures, we believe this work will open up a new path for problems where gradient information is not readily available (e.g. fracture).  ( 2 min )
    Efficient High-Resolution Time Series Classification via Attention Kronecker Decomposition
    arXiv:2403.04882v1 Announce Type: new Abstract: The high-resolution time series classification problem is essential due to the increasing availability of detailed temporal data in various domains. To tackle this challenge effectively, it is imperative that the state-of-the-art attention model is scalable to accommodate the growing sequence lengths typically encountered in high-resolution time series data, while also demonstrating robustness in handling the inherent noise prevalent in such datasets. To address this, we propose to hierarchically encode the long time series into multiple levels based on the interaction ranges. By capturing relationships at different levels, we can build more robust, expressive, and efficient models that are capable of capturing both short-term fluctuations and long-term trends in the data. We then propose a new time series transformer backbone (KronTime) by introducing Kronecker-decomposed attention to process such multi-level time series, which sequentially calculates attention from the lower level to the upper level. Experiments on four long time series datasets demonstrate superior classification results with improved efficiency compared to baseline methods.  ( 2 min )
    A Survey of Lottery Ticket Hypothesis
    arXiv:2403.04861v1 Announce Type: new Abstract: The Lottery Ticket Hypothesis (LTH) states that a dense neural network model contains a highly sparse subnetwork (i.e., winning tickets) that can achieve even better performance than the original model when trained in isolation. While LTH has been proved both empirically and theoretically in many works, there still are some open issues, such as efficiency and scalability, to be addressed. Also, the lack of open-source frameworks and consensual experimental setting poses a challenge to future research on LTH. We, for the first time, examine previous research and studies on LTH from different perspectives. We also discuss issues in existing works and list potential directions for further exploration. This survey aims to provide an in-depth look at the state of LTH and develop a duly maintained platform to conduct experiments and compare with the most updated baselines.  ( 2 min )
    Recovery Guarantees of Unsupervised Neural Networks for Inverse Problems trained with Gradient Descent
    arXiv:2403.05395v1 Announce Type: new Abstract: Advanced machine learning methods, and more prominently neural networks, have become standard to solve inverse problems over the last years. However, the theoretical recovery guarantees of such methods are still scarce and difficult to achieve. Only recently did unsupervised methods such as Deep Image Prior (DIP) get equipped with convergence and recovery guarantees for generic loss functions when trained through gradient flow with an appropriate initialization. In this paper, we extend these results by proving that these guarantees hold true when using gradient descent with an appropriately chosen step-size/learning rate. We also show that the discretization only affects the overparametrization bound for a two-layer DIP network by a constant and thus that the different guarantees found for the gradient flow will hold for gradient descent.  ( 2 min )
    GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
    arXiv:2403.05527v1 Announce Type: new Abstract: Key-value (KV) caching has become the de-facto to accelerate generation speed for large language models (LLMs) inference. However, the growing cache demand with increasing sequence length has transformed LLM inference to be a memory bound problem, significantly constraining the system throughput. Existing methods rely on dropping unimportant tokens or quantizing all entries uniformly. Such methods, however, often incur high approximation errors to represent the compressed matrices. The autoregressive decoding process further compounds the error of each step, resulting in critical deviation in model generation and deterioration of performance. To tackle this challenge, we propose GEAR, an efficient KV cache compression framework that achieves near-lossless high-ratio compression. GEAR first applies quantization to majority of entries of similar magnitudes to ultra-low precision. It then employs a low rank matrix to approximate the quantization error, and a sparse matrix to remedy individual errors from outlier entries. By adeptly integrating three techniques, GEAR is able to fully exploit their synergistic potentials. Our experiments demonstrate that compared to alternatives, GEAR achieves near-lossless 4-bit KV cache compression with up to 2.38x throughput improvement, while reducing peak-memory size up to 2.29x. Our code is publicly available at https://github.com/HaoKang-Timmy/GEAR.  ( 2 min )
    Removing GPT4's Filter
    arXiv:2403.04769v1 Announce Type: cross Abstract: GPT4 was initially trained on large amounts of data, and then fine-tuned using Reinforcement learning from Human Feedback (RLHF), which is when volunteers give feedback in order to teach GPT4 not to create inappropriate content. In this paper, we present a method to manipulate the fine-tuned version into reverting to pre-RLHF behavior, effectively removing all safety mechanisms that the model learned during RLHF. In particular, when GPT4 acts without RLHF, it loses all inhibition, and can complete very inappropriate content given only the first few words.  ( 2 min )
    LLM vs. Lawyers: Identifying a Subset of Summary Judgments in a Large UK Case Law Dataset
    arXiv:2403.04791v1 Announce Type: cross Abstract: To undertake computational research of the law, efficiently identifying datasets of court decisions that relate to a specific legal issue is a crucial yet challenging endeavour. This study addresses the gap in the literature working with large legal corpora about how to isolate cases, in our case summary judgments, from a large corpus of UK court decisions. We introduce a comparative analysis of two computational methods: (1) a traditional natural language processing-based approach leveraging expert-generated keywords and logical operators and (2) an innovative application of the Claude 2 large language model to classify cases based on content-specific prompts. We use the Cambridge Law Corpus of 356,011 UK court decisions and determine that the large language model achieves a weighted F1 score of 0.94 versus 0.78 for keywords. Despite iterative refinement, the search logic based on keywords fails to capture nuances in legal language. We identify and extract 3,102 summary judgment cases, enabling us to map their distribution across various UK courts over a temporal span. The paper marks a pioneering step in employing advanced natural language processing to tackle core legal research tasks, demonstrating how these technologies can bridge systemic gaps and enhance the accessibility of legal information. We share the extracted dataset metrics to support further research on summary judgments.  ( 3 min )
    Solving Inverse Problems with Model Mismatch using Untrained Neural Networks within Model-based Architectures
    arXiv:2403.04847v1 Announce Type: new Abstract: Model-based deep learning methods such as \emph{loop unrolling} (LU) and \emph{deep equilibrium model} (DEQ) extensions offer outstanding performance in solving inverse problems (IP). These methods unroll the optimization iterations into a sequence of neural networks that in effect learn a regularization function from data. While these architectures are currently state-of-the-art in numerous applications, their success heavily relies on the accuracy of the forward model. This assumption can be limiting in many physical applications due to model simplifications or uncertainties in the apparatus. To address forward model mismatch, we introduce an untrained forward model residual block within the model-based architecture to match the data consistency in the measurement domain for each instance. We propose two variants in well-known model-based architectures (LU and DEQ) and prove convergence under mild conditions. The experiments show significant quality improvement in removing artifacts and preserving details across three distinct applications, encompassing both linear and nonlinear inverse problems. Moreover, we highlight reconstruction effectiveness in intermediate steps and showcase robustness to random initialization of the residual block and a higher number of iterations during evaluation.  ( 2 min )
    Leveraging Continuous Time to Understand Momentum When Training Diagonal Linear Networks
    arXiv:2403.05293v1 Announce Type: new Abstract: In this work, we investigate the effect of momentum on the optimisation trajectory of gradient descent. We leverage a continuous-time approach in the analysis of momentum gradient descent with step size $\gamma$ and momentum parameter $\beta$ that allows us to identify an intrinsic quantity $\lambda = \frac{ \gamma }{ (1 - \beta)^2 }$ which uniquely defines the optimisation path and provides a simple acceleration rule. When training a $2$-layer diagonal linear network in an overparametrised regression setting, we characterise the recovered solution through an implicit regularisation problem. We then prove that small values of $\lambda$ help to recover sparse solutions. Finally, we give similar but weaker results for stochastic momentum gradient descent. We provide numerical experiments which support our claims.  ( 2 min )
    Continual Learning and Catastrophic Forgetting
    arXiv:2403.05175v1 Announce Type: new Abstract: This book chapter delves into the dynamics of continual learning, which is the process of incrementally learning from a non-stationary stream of data. Although continual learning is a natural skill for the human brain, it is very challenging for artificial neural networks. An important reason is that, when learning something new, these networks tend to quickly and drastically forget what they had learned before, a phenomenon known as catastrophic forgetting. Especially in the last decade, continual learning has become an extensively studied topic in deep learning. This book chapter reviews the insights that this field has generated.  ( 2 min )
    Synthetic data generation for system identification: leveraging knowledge transfer from similar systems
    arXiv:2403.05164v1 Announce Type: new Abstract: This paper addresses the challenge of overfitting in the learning of dynamical systems by introducing a novel approach for the generation of synthetic data, aimed at enhancing model generalization and robustness in scenarios characterized by data scarcity. Central to the proposed methodology is the concept of knowledge transfer from systems within the same class. Specifically, synthetic data is generated through a pre-trained meta-model that describes a broad class of systems to which the system of interest is assumed to belong. Training data serves a dual purpose: firstly, as input to the pre-trained meta model to discern the system's dynamics, enabling the prediction of its behavior and thereby generating synthetic output sequences for new input sequences; secondly, in conjunction with synthetic data, to define the loss function used for model estimation. A validation dataset is used to tune a scalar hyper-parameter balancing the relative importance of training and synthetic data in the definition of the loss function. The same validation set can be also used for other purposes, such as early stopping during the training, fundamental to avoid overfitting in case of small-size training datasets. The efficacy of the approach is shown through a numerical example that highlights the advantages of integrating synthetic data into the system identification process.  ( 2 min )
    Fairness-Aware Interpretable Modeling (FAIM) for Trustworthy Machine Learning in Healthcare
    arXiv:2403.05235v1 Announce Type: new Abstract: The escalating integration of machine learning in high-stakes fields such as healthcare raises substantial concerns about model fairness. We propose an interpretable framework - Fairness-Aware Interpretable Modeling (FAIM), to improve model fairness without compromising performance, featuring an interactive interface to identify a "fairer" model from a set of high-performing models and promoting the integration of data-driven evidence and clinical expertise to enhance contextualized fairness. We demonstrated FAIM's value in reducing sex and race biases by predicting hospital admission with two real-world databases, MIMIC-IV-ED and SGH-ED. We show that for both datasets, FAIM models not only exhibited satisfactory discriminatory performance but also significantly mitigated biases as measured by well-established fairness metrics, outperforming commonly used bias-mitigation methods. Our approach demonstrates the feasibility of improving fairness without sacrificing performance and provides an a modeling mode that invites domain experts to engage, fostering a multidisciplinary effort toward tailored AI fairness.  ( 2 min )
    Overcoming Data Inequality across Domains with Semi-Supervised Domain Generalization
    arXiv:2403.05209v1 Announce Type: new Abstract: While there have been considerable advancements in machine learning driven by extensive datasets, a significant disparity still persists in the availability of data across various sources and populations. This inequality across domains poses challenges in modeling for those with limited data, which can lead to profound practical and ethical concerns. In this paper, we address a representative case of data inequality problem across domains termed Semi-Supervised Domain Generalization (SSDG), in which only one domain is labeled while the rest are unlabeled. We propose a novel algorithm, ProUD, which can effectively learn domain-invariant features via domain-aware prototypes along with progressive generalization via uncertainty-adaptive mixing of labeled and unlabeled domains. Our experiments on three different benchmark datasets demonstrate the effectiveness of ProUD, outperforming all baseline models including single domain generalization and semi-supervised learning. Source code will be released upon acceptance of the paper.  ( 2 min )
    Adversarial Sparse Teacher: Defense Against Distillation-Based Model Stealing Attacks Using Adversarial Examples
    arXiv:2403.05181v1 Announce Type: new Abstract: Knowledge Distillation (KD) facilitates the transfer of discriminative capabilities from an advanced teacher model to a simpler student model, ensuring performance enhancement without compromising accuracy. It is also exploited for model stealing attacks, where adversaries use KD to mimic the functionality of a teacher model. Recent developments in this domain have been influenced by the Stingy Teacher model, which provided empirical analysis showing that sparse outputs can significantly degrade the performance of student models. Addressing the risk of intellectual property leakage, our work introduces an approach to train a teacher model that inherently protects its logits, influenced by the Nasty Teacher concept. Differing from existing methods, we incorporate sparse outputs of adversarial examples with standard training data to strengthen the teacher's defense against student distillation. Our approach carefully reduces the relative entropy between the original and adversarially perturbed outputs, allowing the model to produce adversarial logits with minimal impact on overall performance. The source codes will be made publicly available soon.  ( 2 min )
    Reset & Distill: A Recipe for Overcoming Negative Transfer in Continual Reinforcement Learning
    arXiv:2403.05066v1 Announce Type: new Abstract: We argue that one of the main obstacles for developing effective Continual Reinforcement Learning (CRL) algorithms is the negative transfer issue occurring when the new task to learn arrives. Through comprehensive experimental validation, we demonstrate that such issue frequently exists in CRL and cannot be effectively addressed by several recent work on mitigating plasticity loss of RL agents. To that end, we develop Reset & Distill (R&D), a simple yet highly effective method, to overcome the negative transfer problem in CRL. R&D combines a strategy of resetting the agent's online actor and critic networks to learn a new task and an offline learning step for distilling the knowledge from the online actor and previous expert's action probabilities. We carried out extensive experiments on long sequence of Meta-World tasks and show that our method consistently outperforms recent baselines, achieving significantly higher success rates across a range of tasks. Our findings highlight the importance of considering negative transfer in CRL and emphasize the need for robust strategies like R&D to mitigate its detrimental effects.  ( 2 min )
    Poly-View Contrastive Learning
    arXiv:2403.05490v1 Announce Type: new Abstract: Contrastive learning typically matches pairs of related views among a number of unrelated negative views. Views can be generated (e.g. by augmentations) or be observed. We investigate matching when there are more than two related views which we call poly-view tasks, and derive new representation learning objectives using information maximization and sufficient statistics. We show that with unlimited computation, one should maximize the number of related views, and with a fixed compute budget, it is beneficial to decrease the number of unique samples whilst increasing the number of views of those samples. In particular, poly-view contrastive models trained for 128 epochs with batch size 256 outperform SimCLR trained for 1024 epochs at batch size 4096 on ImageNet1k, challenging the belief that contrastive models require large batch sizes and many training epochs.  ( 2 min )
    Considering Nonstationary within Multivariate Time Series with Variational Hierarchical Transformer for Forecasting
    arXiv:2403.05406v1 Announce Type: new Abstract: The forecasting of Multivariate Time Series (MTS) has long been an important but challenging task. Due to the non-stationary problem across long-distance time steps, previous studies primarily adopt stationarization method to attenuate the non-stationary problem of the original series for better predictability. However, existing methods always adopt the stationarized series, which ignores the inherent non-stationarity, and has difficulty in modeling MTS with complex distributions due to the lack of stochasticity. To tackle these problems, we first develop a powerful hierarchical probabilistic generative module to consider the non-stationarity and stochastic characteristics within MTS, and then combine it with transformer for a well-defined variational generative dynamic model named Hierarchical Time series Variational Transformer (HTV-Trans), which recovers the intrinsic non-stationary information into temporal dependencies. Being a powerful probabilistic model, HTV-Trans is utilized to learn expressive representations of MTS and applied to forecasting tasks. Extensive experiments on diverse datasets show the efficiency of HTV-Trans on MTS forecasting tasks  ( 2 min )
    An Improved Algorithm for Learning Drifting Discrete Distributions
    arXiv:2403.05446v1 Announce Type: new Abstract: We present a new adaptive algorithm for learning discrete distributions under distribution drift. In this setting, we observe a sequence of independent samples from a discrete distribution that is changing over time, and the goal is to estimate the current distribution. Since we have access to only a single sample for each time step, a good estimation requires a careful choice of the number of past samples to use. To use more samples, we must resort to samples further in the past, and we incur a drift error due to the bias introduced by the change in distribution. On the other hand, if we use a small number of past samples, we incur a large statistical error as the estimation has a high variance. We present a novel adaptive algorithm that can solve this trade-off without any prior knowledge of the drift. Unlike previous adaptive results, our algorithm characterizes the statistical error using data-dependent bounds. This technicality enables us to overcome the limitations of the previous work that require a fixed finite support whose size is known in advance and that cannot change over time. Additionally, we can obtain tighter bounds depending on the complexity of the drifting distribution, and also consider distributions with infinite support.  ( 2 min )
    Stacking as Accelerated Gradient Descent
    arXiv:2403.04978v1 Announce Type: new Abstract: Stacking, a heuristic technique for training deep residual networks by progressively increasing the number of layers and initializing new layers by copying parameters from older layers, has proven quite successful in improving the efficiency of training deep neural networks. In this paper, we propose a theoretical explanation for the efficacy of stacking: viz., stacking implements a form of Nesterov's accelerated gradient descent. The theory also covers simpler models such as the additive ensembles constructed in boosting methods, and provides an explanation for a similar widely-used practical heuristic for initializing the new classifier in each round of boosting. We also prove that for certain deep linear residual networks, stacking does provide accelerated training, via a new potential function analysis of the Nesterov's accelerated gradient method which allows errors in updates. We conduct proof-of-concept experiments to validate our theory as well.  ( 2 min )
    TrafPS: A Shapley-based Visual Analytics Approach to Interpret Traffic
    arXiv:2403.04812v1 Announce Type: new Abstract: Recent achievements in deep learning (DL) have shown its potential for predicting traffic flows. Such predictions are beneficial for understanding the situation and making decisions in traffic control. However, most state-of-the-art DL models are considered "black boxes" with little to no transparency for end users with respect to the underlying mechanisms. Some previous work tried to "open the black boxes" and increase the interpretability of how predictions are generated. However, it still remains challenging to handle complex models on large-scale spatio-temporal data and discover salient spatial and temporal patterns that significantly influence traffic flows. To overcome the challenges, we present TrafPS, a visual analytics approach for interpreting traffic prediction outcomes to support decision-making in traffic management and urban planning. The measurements, region SHAP and trajectory SHAP, are proposed to quantify the impact of flow patterns on urban traffic at different levels. Based on the task requirement from the domain experts, we employ an interactive visual interface for multi-aspect exploration and analysis of significant flow patterns. Two real-world case studies demonstrate the effectiveness of TrafPS in identifying key routes and decision-making support for urban planning.  ( 2 min )
    Provable Multi-Party Reinforcement Learning with Diverse Human Feedback
    arXiv:2403.05006v1 Announce Type: new Abstract: Reinforcement learning with human feedback (RLHF) is an emerging paradigm to align models with human preferences. Typically, RLHF aggregates preferences from multiple individuals who have diverse viewpoints that may conflict with each other. Our work \textit{initiates} the theoretical study of multi-party RLHF that explicitly models the diverse preferences of multiple individuals. We show how traditional RLHF approaches can fail since learning a single reward function cannot capture and balance the preferences of multiple individuals. To overcome such limitations, we incorporate meta-learning to learn multiple preferences and adopt different social welfare functions to aggregate the preferences across multiple parties. We focus on the offline learning setting and establish sample complexity bounds, along with efficiency and fairness guarantees, for optimizing diverse social welfare functions such as Nash, Utilitarian, and Leximin welfare functions. Our results show a separation between the sample complexities of multi-party RLHF and traditional single-party RLHF. Furthermore, we consider a reward-free setting, where each individual's preference is no longer consistent with a reward model, and give pessimistic variants of the von Neumann Winner based on offline preference data. Taken together, our work showcases the advantage of multi-party RLHF but also highlights its more demanding statistical complexity.  ( 2 min )
    An Efficient Difference-of-Convex Solver for Privacy Funnel
    arXiv:2403.04778v1 Announce Type: new Abstract: We propose an efficient solver for the privacy funnel (PF) method, leveraging its difference-of-convex (DC) structure. The proposed DC separation results in a closed-form update equation, which allows straightforward application to both known and unknown distribution settings. For known distribution case, we prove the convergence (local stationary points) of the proposed non-greedy solver, and empirically show that it outperforms the state-of-the-art approaches in characterizing the privacy-utility trade-off. The insights of our DC approach apply to unknown distribution settings where labeled empirical samples are available instead. Leveraging the insights, our alternating minimization solver satisfies the fundamental Markov relation of PF in contrast to previous variational inference-based solvers. Empirically, we evaluate the proposed solver with MNIST and Fashion-MNIST datasets. Our results show that under a comparable reconstruction quality, an adversary suffers from higher prediction error from clustering our compressed codes than that with the compared methods. Most importantly, our solver is independent to private information in inference phase contrary to the baselines.  ( 2 min )
    Not all tickets are equal and we know it: Guiding pruning with domain-specific knowledge
    arXiv:2403.04805v1 Announce Type: new Abstract: Neural structure learning is of paramount importance for scientific discovery and interpretability. Yet, contemporary pruning algorithms that focus on computational resource efficiency face algorithmic barriers to select a meaningful model that aligns with domain expertise. To mitigate this challenge, we propose DASH, which guides pruning by available domain-specific structural information. In the context of learning dynamic gene regulatory network models, we show that DASH combined with existing general knowledge on interaction partners provides data-specific insights aligned with biology. For this task, we show on synthetic data with ground truth information and two real world applications the effectiveness of DASH, which outperforms competing methods by a large margin and provides more meaningful biological insights. Our work shows that domain specific structural information bears the potential to improve model-derived scientific insights.  ( 2 min )
    On the Markov Property of Neural Algorithmic Reasoning: Analyses and Methods
    arXiv:2403.04929v1 Announce Type: new Abstract: Neural algorithmic reasoning is an emerging research direction that endows neural networks with the ability to mimic algorithmic executions step-by-step. A common paradigm in existing designs involves the use of historical embeddings in predicting the results of future execution steps. Our observation in this work is that such historical dependence intrinsically contradicts the Markov nature of algorithmic reasoning tasks. Based on this motivation, we present our ForgetNet, which does not use historical embeddings and thus is consistent with the Markov nature of the tasks. To address challenges in training ForgetNet at early stages, we further introduce G-ForgetNet, which uses a gating mechanism to allow for the selective integration of historical embeddings. Such an enhanced capability provides valuable computational pathways during the model's early training phase. Our extensive experiments, based on the CLRS-30 algorithmic reasoning benchmark, demonstrate that both ForgetNet and G-ForgetNet achieve better generalization capability than existing methods. Furthermore, we investigate the behavior of the gating mechanism, highlighting its degree of alignment with our intuitions and its effectiveness for robust performance.  ( 2 min )
    Unsupervised Graph Neural Architecture Search with Disentangled Self-supervision
    arXiv:2403.05064v1 Announce Type: new Abstract: The existing graph neural architecture search (GNAS) methods heavily rely on supervised labels during the search process, failing to handle ubiquitous scenarios where supervisions are not available. In this paper, we study the problem of unsupervised graph neural architecture search, which remains unexplored in the literature. The key problem is to discover the latent graph factors that drive the formation of graph data as well as the underlying relations between the factors and the optimal neural architectures. Handling this problem is challenging given that the latent graph factors together with architectures are highly entangled due to the nature of the graph and the complexity of the neural architecture search process. To address the challenge, we propose a novel Disentangled Self-supervised Graph Neural Architecture Search (DSGAS) model, which is able to discover the optimal architectures capturing various latent graph factors in a self-supervised fashion based on unlabeled graph data. Specifically, we first design a disentangled graph super-network capable of incorporating multiple architectures with factor-wise disentanglement, which are optimized simultaneously. Then, we estimate the performance of architectures under different factors by our proposed self-supervised training with joint architecture-graph disentanglement. Finally, we propose a contrastive search with architecture augmentations to discover architectures with factor-specific expertise. Extensive experiments on 11 real-world datasets demonstrate that the proposed model is able to achieve state-of-the-art performance against several baseline methods in an unsupervised manner.  ( 2 min )
    Quantifying Manifolds: Do the manifolds learned by Generative Adversarial Networks converge to the real data manifold
    arXiv:2403.05033v1 Announce Type: new Abstract: This paper presents our experiments to quantify the manifolds learned by ML models (in our experiment, we use a GAN model) as they train. We compare the manifolds learned at each epoch to the real manifolds representing the real data. To quantify a manifold, we study the intrinsic dimensions and topological features of the manifold learned by the ML model, how these metrics change as we continue to train the model, and whether these metrics convergence over the course of training to the metrics of the real data manifold.  ( 2 min )
    Spectral Invariant Learning for Dynamic Graphs under Distribution Shifts
    arXiv:2403.05026v1 Announce Type: new Abstract: Dynamic graph neural networks (DyGNNs) currently struggle with handling distribution shifts that are inherent in dynamic graphs. Existing work on DyGNNs with out-of-distribution settings only focuses on the time domain, failing to handle cases involving distribution shifts in the spectral domain. In this paper, we discover that there exist cases with distribution shifts unobservable in the time domain while observable in the spectral domain, and propose to study distribution shifts on dynamic graphs in the spectral domain for the first time. However, this investigation poses two key challenges: i) it is non-trivial to capture different graph patterns that are driven by various frequency components entangled in the spectral domain; and ii) it remains unclear how to handle distribution shifts with the discovered spectral patterns. To address these challenges, we propose Spectral Invariant Learning for Dynamic Graphs under Distribution Shifts (SILD), which can handle distribution shifts on dynamic graphs by capturing and utilizing invariant and variant spectral patterns. Specifically, we first design a DyGNN with Fourier transform to obtain the ego-graph trajectory spectrums, allowing the mixed dynamic graph patterns to be transformed into separate frequency components. We then develop a disentangled spectrum mask to filter graph dynamics from various frequency components and discover the invariant and variant spectral patterns. Finally, we propose invariant spectral filtering, which encourages the model to rely on invariant patterns for generalization under distribution shifts. Experimental results on synthetic and real-world dynamic graph datasets demonstrate the superiority of our method for both node classification and link prediction tasks under distribution shifts.  ( 3 min )
    A Data-Driven Two-Phase Multi-Split Causal Ensemble Model for Time Series
    arXiv:2403.04793v1 Announce Type: new Abstract: Causal inference is a fundamental research topic for discovering the cause-effect relationships in many disciplines. However, not all algorithms are equally well-suited for a given dataset. For instance, some approaches may only be able to identify linear relationships, while others are applicable for non-linearities. Algorithms further vary in their sensitivity to noise and their ability to infer causal information from coupled vs. non-coupled time series. Therefore, different algorithms often generate different causal relationships for the same input. To achieve a more robust causal inference result, this publication proposes a novel data-driven two-phase multi-split causal ensemble model to combine the strengths of different causality base algorithms. In comparison to existing approaches, the proposed ensemble method reduces the influence of noise through a data partitioning scheme in the first phase. To achieve this, the data are initially divided into several partitions and the base algorithms are applied to each partition. Subsequently, Gaussian mixture models are used to identify the causal relationships derived from the different partitions that are likely to be valid. In the second phase, the identified relationships from each base algorithm are then merged based on three combination rules. The proposed ensemble approach is evaluated using multiple metrics, among them a newly developed evaluation index for causal ensemble approaches. We perform experiments using three synthetic datasets with different volumes and complexity, which are specifically designed to test causality detection methods under different circumstances while knowing the ground truth causal relationships. In these experiments, our causality ensemble outperforms each of its base algorithms. In practical applications, the use of the proposed method could hence lead to more robust and reliable causality results.  ( 3 min )
    Storm Surge Modeling in the AI ERA: Using LSTM-based Machine Learning for Enhancing Forecasting Accuracy
    arXiv:2403.04818v1 Announce Type: new Abstract: Physics simulation results of natural processes usually do not fully capture the real world. This is caused for instance by limits in what physical processes are simulated and to what accuracy. In this work we propose and analyze the use of an LSTM-based deep learning network machine learning (ML) architecture for capturing and predicting the behavior of the systemic error for storm surge forecast models with respect to real-world water height observations from gauge stations during hurricane events. The overall goal of this work is to predict the systemic error of the physics model and use it to improve the accuracy of the simulation results post factum. We trained our proposed ML model on a dataset of 61 historical storms in the coastal regions of the U.S. and we tested its performance in bias correcting modeled water level data predictions from hurricane Ian (2022). We show that our model can consistently improve the forecasting accuracy for hurricane Ian -- unknown to the ML model -- at all gauge station coordinates used for the initial data. Moreover, by examining the impact of using different subsets of the initial training dataset, containing a number of relatively similar or different hurricanes in terms of hurricane track, we found that we can obtain similar quality of bias correction by only using a subset of six hurricanes. This is an important result that implies the possibility to apply a pre-trained ML model to real-time hurricane forecasting results with the goal of bias correcting and improving the produced simulation accuracy. The presented work is an important first step in creating a bias correction system for real-time storm surge forecasting applicable to the full simulation area. It also presents a highly transferable and operationally applicable methodology for improving the accuracy in a wide range of physics simulation scenarios beyond storm surge forecasting.  ( 3 min )
    Restricted Bayesian Neural Network
    arXiv:2403.04810v1 Announce Type: new Abstract: Modern deep learning tools are remarkably effective in addressing intricate problems. However, their operation as black-box models introduces increased uncertainty in predictions. Additionally, they contend with various challenges, including the need for substantial storage space in large networks, issues of overfitting, underfitting, vanishing gradients, and more. This study explores the concept of Bayesian Neural Networks, presenting a novel architecture designed to significantly alleviate the storage space complexity of a network. Furthermore, we introduce an algorithm adept at efficiently handling uncertainties, ensuring robust convergence values without becoming trapped in local optima, particularly when the objective function lacks perfect convexity.  ( 2 min )
  • Open

    Improved Convergence Rates of Windowed Anderson Acceleration for Symmetric Fixed-Point Iterations
    arXiv:2311.02490v2 Announce Type: replace-cross Abstract: This paper studies the commonly utilized windowed Anderson acceleration (AA) algorithm for fixed-point methods, $x^{(k+1)}=q(x^{(k)})$. It provides the first proof that when the operator $q$ is linear and symmetric the windowed AA, which uses a sliding window of prior iterates, improves the root-linear convergence factor over the fixed-point iterations. When $q$ is nonlinear, yet has a symmetric Jacobian at a fixed point, a slightly modified AA algorithm is proved to have an analogous root-linear convergence factor improvement over fixed-point iterations. Simulations verify our observations. Furthermore, experiments with different data models demonstrate AA is significantly superior to the standard fixed-point methods for Tyler's M-estimation.  ( 2 min )
    Group selection and shrinkage: Structured sparsity for semiparametric additive models
    arXiv:2105.12081v3 Announce Type: replace-cross Abstract: Sparse regression and classification estimators that respect group structures have application to an assortment of statistical and machine learning problems, from multitask learning to sparse additive modeling to hierarchical selection. This work introduces structured sparse estimators that combine group subset selection with shrinkage. To accommodate sophisticated structures, our estimators allow for arbitrary overlap between groups. We develop an optimization framework for fitting the nonconvex regularization surface and present finite-sample error bounds for estimation of the regression function. As an application requiring structure, we study sparse semiparametric additive modeling, a procedure that allows the effect of each predictor to be zero, linear, or nonlinear. For this task, the new estimators improve across several metrics on synthetic data compared to alternatives. Finally, we demonstrate their efficacy in modeling supermarket foot traffic and economic recessions using many predictors. These demonstrations suggest sparse semiparametric additive models, fit using the new estimators, are an excellent compromise between fully linear and fully nonparametric alternatives. All of our algorithms are made available in the scalable implementation grpsel.  ( 2 min )
    Improved particle-flow event reconstruction with scalable neural networks for current and future particle detectors
    arXiv:2309.06782v5 Announce Type: replace-cross Abstract: Efficient and accurate algorithms are necessary to reconstruct particles in the highly granular detectors anticipated at the High-Luminosity Large Hadron Collider and the Future Circular Collider. We study scalable machine learning models for event reconstruction in electron-positron collisions based on a full detector simulation. Particle-flow reconstruction can be formulated as a supervised learning task using tracks and calorimeter clusters. We compare a graph neural network and kernel-based transformer and demonstrate that we can avoid quadratic operations while achieving realistic reconstruction. We show that hyperparameter tuning significantly improves the performance of the models. The best graph neural network model shows improvement in the jet transverse momentum resolution by up to 50% compared to the rule-based algorithm. The resulting model is portable across Nvidia, AMD and Habana hardware. Accurate and fast machine-learning based reconstruction can significantly improve future measurements at colliders.  ( 3 min )
    Joint control variate for faster black-box variational inference
    arXiv:2210.07290v4 Announce Type: replace-cross Abstract: Black-box variational inference performance is sometimes hindered by the use of gradient estimators with high variance. This variance comes from two sources of randomness: Data subsampling and Monte Carlo sampling. While existing control variates only address Monte Carlo noise, and incremental gradient methods typically only address data subsampling, we propose a new "joint" control variate that jointly reduces variance from both sources of noise. This significantly reduces gradient variance, leading to faster optimization in several applications.  ( 2 min )
    Intriguing Properties of Input-dependent Randomized Smoothing
    arXiv:2110.05365v3 Announce Type: replace-cross Abstract: Randomized smoothing is currently considered the state-of-the-art method to obtain certifiably robust classifiers. Despite its remarkable performance, the method is associated with various serious problems such as "certified accuracy waterfalls", certification vs.\ accuracy trade-off, or even fairness issues. Input-dependent smoothing approaches have been proposed with intention of overcoming these flaws. However, we demonstrate that these methods lack formal guarantees and so the resulting certificates are not justified. We show that in general, the input-dependent smoothing suffers from the curse of dimensionality, forcing the variance function to have low semi-elasticity. On the other hand, we provide a theoretical and practical framework that enables the usage of input-dependent smoothing even in the presence of the curse of dimensionality, under strict restrictions. We present one concrete design of the smoothing variance function and test it on CIFAR10 and MNIST. Our design mitigates some of the problems of classical smoothing and is formally underlined, yet further improvement of the design is still necessary.  ( 2 min )
    Efficient and Multiply Robust Risk Estimation under General Forms of Dataset Shift
    arXiv:2306.16406v3 Announce Type: replace-cross Abstract: Statistical machine learning methods often face the challenge of limited data available from the population of interest. One remedy is to leverage data from auxiliary source populations, which share some conditional distributions or are linked in other ways with the target domain. Techniques leveraging such \emph{dataset shift} conditions are known as \emph{domain adaptation} or \emph{transfer learning}. Despite extensive literature on dataset shift, limited works address how to efficiently use the auxiliary populations to improve the accuracy of risk evaluation for a given machine learning task in the target population. In this paper, we study the general problem of efficiently estimating target population risk under various dataset shift conditions, leveraging semiparametric efficiency theory. We consider a general class of dataset shift conditions, which includes three popular conditions -- covariate, label and concept shift -- as special cases. We allow for partially non-overlapping support between the source and target populations. We develop efficient and multiply robust estimators along with a straightforward specification test of these dataset shift conditions. We also derive efficiency bounds for two other dataset shift conditions, posterior drift and location-scale shift. Simulation studies support the efficiency gains due to leveraging plausible dataset shift conditions.  ( 3 min )
    Leveraging Continuous Time to Understand Momentum When Training Diagonal Linear Networks
    arXiv:2403.05293v1 Announce Type: cross Abstract: In this work, we investigate the effect of momentum on the optimisation trajectory of gradient descent. We leverage a continuous-time approach in the analysis of momentum gradient descent with step size $\gamma$ and momentum parameter $\beta$ that allows us to identify an intrinsic quantity $\lambda = \frac{ \gamma }{ (1 - \beta)^2 }$ which uniquely defines the optimisation path and provides a simple acceleration rule. When training a $2$-layer diagonal linear network in an overparametrised regression setting, we characterise the recovered solution through an implicit regularisation problem. We then prove that small values of $\lambda$ help to recover sparse solutions. Finally, we give similar but weaker results for stochastic momentum gradient descent. We provide numerical experiments which support our claims.  ( 2 min )
    Variational Inference of Parameters in Opinion Dynamics Models
    arXiv:2403.05358v1 Announce Type: cross Abstract: Despite the frequent use of agent-based models (ABMs) for studying social phenomena, parameter estimation remains a challenge, often relying on costly simulation-based heuristics. This work uses variational inference to estimate the parameters of an opinion dynamics ABM, by transforming the estimation problem into an optimization task that can be solved directly. Our proposal relies on probabilistic generative ABMs (PGABMs): we start by synthesizing a probabilistic generative model from the ABM rules. Then, we transform the inference process into an optimization problem suitable for automatic differentiation. In particular, we use the Gumbel-Softmax reparameterization for categorical agent attributes and stochastic variational inference for parameter estimation. Furthermore, we explore the trade-offs of using variational distributions with different complexity: normal distributions and normalizing flows. We validate our method on a bounded confidence model with agent roles (leaders and followers). Our approach estimates both macroscopic (bounded confidence intervals and backfire thresholds) and microscopic ($200$ categorical, agent-level roles) more accurately than simulation-based and MCMC methods. Consequently, our technique enables experts to tune and validate their ABMs against real-world observations, thus providing insights into human behavior in social systems via data-driven analysis.  ( 2 min )
    A Variational Autoencoder for Neural Temporal Point Processes with Dynamic Latent Graphs
    arXiv:2312.16083v2 Announce Type: replace-cross Abstract: Continuously-observed event occurrences, often exhibit self- and mutually-exciting effects, which can be well modeled using temporal point processes. Beyond that, these event dynamics may also change over time, with certain periodic trends. We propose a novel variational auto-encoder to capture such a mixture of temporal dynamics. More specifically, the whole time interval of the input sequence is partitioned into a set of sub-intervals. The event dynamics are assumed to be stationary within each sub-interval, but could be changing across those sub-intervals. In particular, we use a sequential latent variable model to learn a dependency graph between the observed dimensions, for each sub-interval. The model predicts the future event times, by using the learned dependency graph to remove the noncontributing influences of past events. By doing so, the proposed model demonstrates its higher accuracy in predicting inter-event times and event types for several real-world event sequences, compared with existing state of the art neural point processes.  ( 2 min )
    Spectrally-Corrected and Regularized Linear Discriminant Analysis for Spiked Covariance Model
    arXiv:2210.03859v3 Announce Type: replace Abstract: This paper proposes an improved linear discriminant analysis called spectrally-corrected and regularized LDA (SRLDA). This method integrates the design ideas of the sample spectrally-corrected covariance matrix and the regularized discriminant analysis. With the support of a large-dimensional random matrix analysis framework, it is proved that SRLDA has a linear classification global optimal solution under the spiked model assumption. According to simulation data analysis, the SRLDA classifier performs better than RLDA and ILDA and is closer to the theoretical classifier. Experiments on different data sets show that the SRLDA algorithm performs better in classification and dimensionality reduction than currently used tools.  ( 2 min )
    A unified framework for hard and soft clustering with regularized optimal transport
    arXiv:1711.04366v2 Announce Type: replace-cross Abstract: In this paper, we formulate the problem of inferring a Finite Mixture Model from discrete data as an optimal transport problem with entropic regularization of parameter $\lambda\geq 0$. Our method unifies hard and soft clustering, the Expectation-Maximization (EM) algorithm being exactly recovered for $\lambda=1$. The family of clustering algorithm we propose rely on the resolution of nonconvex problems using alternating minimization. We study the convergence property of our generalized $\lambda-$EM algorithms and show that each step in the minimization process has a closed form solution when inferring finite mixture models of exponential families. Experiments highlight the benefits of taking a parameter $\lambda>1$ to improve the inference performance and $\lambda\to 0$ for classification.  ( 2 min )
    Performance Analysis of Support Vector Machine (SVM) on Challenging Datasets for Forest Fire Detection
    arXiv:2401.12924v2 Announce Type: replace Abstract: This article delves into the analysis of performance and utilization of Support Vector Machines (SVMs) for the critical task of forest fire detection using image datasets. With the increasing threat of forest fires to ecosystems and human settlements, the need for rapid and accurate detection systems is of utmost importance. SVMs, renowned for their strong classification capabilities, exhibit proficiency in recognizing patterns associated with fire within images. By training on labeled data, SVMs acquire the ability to identify distinctive attributes associated with fire, such as flames, smoke, or alterations in the visual characteristics of the forest area. The document thoroughly examines the use of SVMs, covering crucial elements like data preprocessing, feature extraction, and model training. It rigorously evaluates parameters such as accuracy, efficiency, and practical applicability. The knowledge gained from this study aids in the development of efficient forest fire detection systems, enabling prompt responses and improving disaster management. Moreover, the correlation between SVM accuracy and the difficulties presented by high-dimensional datasets is carefully investigated, demonstrated through a revealing case study. The relationship between accuracy scores and the different resolutions used for resizing the training datasets has also been discussed in this article. These comprehensive studies result in a definitive overview of the difficulties faced and the potential sectors requiring further improvement and focus.  ( 3 min )
    An Improved Algorithm for Learning Drifting Discrete Distributions
    arXiv:2403.05446v1 Announce Type: cross Abstract: We present a new adaptive algorithm for learning discrete distributions under distribution drift. In this setting, we observe a sequence of independent samples from a discrete distribution that is changing over time, and the goal is to estimate the current distribution. Since we have access to only a single sample for each time step, a good estimation requires a careful choice of the number of past samples to use. To use more samples, we must resort to samples further in the past, and we incur a drift error due to the bias introduced by the change in distribution. On the other hand, if we use a small number of past samples, we incur a large statistical error as the estimation has a high variance. We present a novel adaptive algorithm that can solve this trade-off without any prior knowledge of the drift. Unlike previous adaptive results, our algorithm characterizes the statistical error using data-dependent bounds. This technicality enables us to overcome the limitations of the previous work that require a fixed finite support whose size is known in advance and that cannot change over time. Additionally, we can obtain tighter bounds depending on the complexity of the drifting distribution, and also consider distributions with infinite support.  ( 2 min )
    Poly-View Contrastive Learning
    arXiv:2403.05490v1 Announce Type: cross Abstract: Contrastive learning typically matches pairs of related views among a number of unrelated negative views. Views can be generated (e.g. by augmentations) or be observed. We investigate matching when there are more than two related views which we call poly-view tasks, and derive new representation learning objectives using information maximization and sufficient statistics. We show that with unlimited computation, one should maximize the number of related views, and with a fixed compute budget, it is beneficial to decrease the number of unique samples whilst increasing the number of views of those samples. In particular, poly-view contrastive models trained for 128 epochs with batch size 256 outperform SimCLR trained for 1024 epochs at batch size 4096 on ImageNet1k, challenging the belief that contrastive models require large batch sizes and many training epochs.  ( 2 min )
    Testing Stationarity and Change Point Detection in Reinforcement Learning
    arXiv:2203.01707v3 Announce Type: replace Abstract: We consider offline reinforcement learning (RL) methods in possibly nonstationary environments. Many existing RL algorithms in the literature rely on the stationarity assumption that requires the system transition and the reward function to be constant over time. However, the stationarity assumption is restrictive in practice and is likely to be violated in a number of applications, including traffic signal control, robotics and mobile health. In this paper, we develop a consistent procedure to test the nonstationarity of the optimal Q-function based on pre-collected historical data, without additional online data collection. Based on the proposed test, we further develop a sequential change point detection method that can be naturally coupled with existing state-of-the-art RL methods for policy optimization in nonstationary environments. The usefulness of our method is illustrated by theoretical results, simulation studies, and a real data example from the 2018 Intern Health Study. A Python implementation of the proposed procedure is available at https://github.com/limengbinggz/CUSUM-RL.  ( 2 min )
    The Computational Complexity of Learning Gaussian Single-Index Models
    arXiv:2403.05529v1 Announce Type: cross Abstract: Single-Index Models are high-dimensional regression problems with planted structure, whereby labels depend on an unknown one-dimensional projection of the input via a generic, non-linear, and potentially non-deterministic transformation. As such, they encompass a broad class of statistical inference tasks, and provide a rich template to study statistical and computational trade-offs in the high-dimensional regime. While the information-theoretic sample complexity to recover the hidden direction is linear in the dimension $d$, we show that computationally efficient algorithms, both within the Statistical Query (SQ) and the Low-Degree Polynomial (LDP) framework, necessarily require $\Omega(d^{k^\star/2})$ samples, where $k^\star$ is a "generative" exponent associated with the model that we explicitly characterize. Moreover, we show that this sample complexity is also sufficient, by establishing matching upper bounds using a partial-trace algorithm. Therefore, our results provide evidence of a sharp computational-to-statistical gap (under both the SQ and LDP class) whenever $k^\star>2$. To complete the study, we provide examples of smooth and Lipschitz deterministic target functions with arbitrarily large generative exponents $k^\star$.  ( 2 min )
    Provable Multi-Party Reinforcement Learning with Diverse Human Feedback
    arXiv:2403.05006v1 Announce Type: cross Abstract: Reinforcement learning with human feedback (RLHF) is an emerging paradigm to align models with human preferences. Typically, RLHF aggregates preferences from multiple individuals who have diverse viewpoints that may conflict with each other. Our work \textit{initiates} the theoretical study of multi-party RLHF that explicitly models the diverse preferences of multiple individuals. We show how traditional RLHF approaches can fail since learning a single reward function cannot capture and balance the preferences of multiple individuals. To overcome such limitations, we incorporate meta-learning to learn multiple preferences and adopt different social welfare functions to aggregate the preferences across multiple parties. We focus on the offline learning setting and establish sample complexity bounds, along with efficiency and fairness guarantees, for optimizing diverse social welfare functions such as Nash, Utilitarian, and Leximin welfare functions. Our results show a separation between the sample complexities of multi-party RLHF and traditional single-party RLHF. Furthermore, we consider a reward-free setting, where each individual's preference is no longer consistent with a reward model, and give pessimistic variants of the von Neumann Winner based on offline preference data. Taken together, our work showcases the advantage of multi-party RLHF but also highlights its more demanding statistical complexity.  ( 2 min )
    Continual Learning and Catastrophic Forgetting
    arXiv:2403.05175v1 Announce Type: cross Abstract: This book chapter delves into the dynamics of continual learning, which is the process of incrementally learning from a non-stationary stream of data. Although continual learning is a natural skill for the human brain, it is very challenging for artificial neural networks. An important reason is that, when learning something new, these networks tend to quickly and drastically forget what they had learned before, a phenomenon known as catastrophic forgetting. Especially in the last decade, continual learning has become an extensively studied topic in deep learning. This book chapter reviews the insights that this field has generated.  ( 2 min )
    An Efficient Quasi-Random Sampling for Copulas
    arXiv:2403.05281v1 Announce Type: new Abstract: This paper examines an efficient method for quasi-random sampling of copulas in Monte Carlo computations. Traditional methods, like conditional distribution methods (CDM), have limitations when dealing with high-dimensional or implicit copulas, which refer to those that cannot be accurately represented by existing parametric copulas. Instead, this paper proposes the use of generative models, such as Generative Adversarial Networks (GANs), to generate quasi-random samples for any copula. GANs are a type of implicit generative models used to learn the distribution of complex data, thus facilitating easy sampling. In our study, GANs are employed to learn the mapping from a uniform distribution to copulas. Once this mapping is learned, obtaining quasi-random samples from the copula only requires inputting quasi-random samples from the uniform distribution. This approach offers a more flexible method for any copula. Additionally, we provide theoretical analysis of quasi-Monte Carlo estimators based on quasi-random samples of copulas. Through simulated and practical applications, particularly in the field of risk management, we validate the proposed method and demonstrate its superiority over various existing methods.  ( 2 min )
    Deep Backward and Galerkin Methods for the Finite State Master Equation
    arXiv:2403.04975v1 Announce Type: cross Abstract: This paper proposes and analyzes two neural network methods to solve the master equation for finite-state mean field games (MFGs). Solving MFGs provides approximate Nash equilibria for stochastic, differential games with finite but large populations of agents. The master equation is a partial differential equation (PDE) whose solution characterizes MFG equilibria for any possible initial distribution. The first method we propose relies on backward induction in a time component while the second method directly tackles the PDE without discretizing time. For both approaches, we prove two types of results: there exist neural networks that make the algorithms' loss functions arbitrarily small, and conversely, if the losses are small, then the neural networks are good approximations of the master equation's solution. We conclude the paper with numerical experiments on benchmark problems from the literature up to dimension 15, and a comparison with solutions computed by a classical method for fixed initial distributions.  ( 2 min )
    Stacking as Accelerated Gradient Descent
    arXiv:2403.04978v1 Announce Type: cross Abstract: Stacking, a heuristic technique for training deep residual networks by progressively increasing the number of layers and initializing new layers by copying parameters from older layers, has proven quite successful in improving the efficiency of training deep neural networks. In this paper, we propose a theoretical explanation for the efficacy of stacking: viz., stacking implements a form of Nesterov's accelerated gradient descent. The theory also covers simpler models such as the additive ensembles constructed in boosting methods, and provides an explanation for a similar widely-used practical heuristic for initializing the new classifier in each round of boosting. We also prove that for certain deep linear residual networks, stacking does provide accelerated training, via a new potential function analysis of the Nesterov's accelerated gradient method which allows errors in updates. We conduct proof-of-concept experiments to validate our theory as well.  ( 2 min )
    Group Privacy Amplification and Unified Amplification by Subsampling for R\'enyi Differential Privacy
    arXiv:2403.04867v1 Announce Type: cross Abstract: Differential privacy (DP) has various desirable properties, such as robustness to post-processing, group privacy, and amplification by subsampling, which can be derived independently of each other. Our goal is to determine whether stronger privacy guarantees can be obtained by considering multiple of these properties jointly. To this end, we focus on the combination of group privacy and amplification by subsampling. To provide guarantees that are amenable to machine learning algorithms, we conduct our analysis in the framework of R\'enyi-DP, which has more favorable composition properties than $(\epsilon,\delta)$-DP. As part of this analysis, we develop a unified framework for deriving amplification by subsampling guarantees for R\'enyi-DP, which represents the first such framework for a privacy accounting method and is of independent interest. We find that it not only lets us improve upon and generalize existing amplification results for R\'enyi-DP, but also derive provably tight group privacy amplification guarantees stronger than existing principles. These results establish the joint study of different DP properties as a promising research direction.  ( 2 min )
    Not all tickets are equal and we know it: Guiding pruning with domain-specific knowledge
    arXiv:2403.04805v1 Announce Type: cross Abstract: Neural structure learning is of paramount importance for scientific discovery and interpretability. Yet, contemporary pruning algorithms that focus on computational resource efficiency face algorithmic barriers to select a meaningful model that aligns with domain expertise. To mitigate this challenge, we propose DASH, which guides pruning by available domain-specific structural information. In the context of learning dynamic gene regulatory network models, we show that DASH combined with existing general knowledge on interaction partners provides data-specific insights aligned with biology. For this task, we show on synthetic data with ground truth information and two real world applications the effectiveness of DASH, which outperforms competing methods by a large margin and provides more meaningful biological insights. Our work shows that domain specific structural information bears the potential to improve model-derived scientific insights.  ( 2 min )
    An Adaptive Dimension Reduction Estimation Method for High-dimensional Bayesian Optimization
    arXiv:2403.05425v1 Announce Type: new Abstract: Bayesian optimization (BO) has shown impressive results in a variety of applications within low-to-moderate dimensional Euclidean spaces. However, extending BO to high-dimensional settings remains a significant challenge. We address this challenge by proposing a two-step optimization framework. Initially, we identify the effective dimension reduction (EDR) subspace for the objective function using the minimum average variance estimation (MAVE) method. Subsequently, we construct a Gaussian process model within this EDR subspace and optimize it using the expected improvement criterion. Our algorithm offers the flexibility to operate these steps either concurrently or in sequence. In the sequential approach, we meticulously balance the exploration-exploitation trade-off by distributing the sampling budget between subspace estimation and function optimization, and the convergence rate of our algorithm in high-dimensional contexts has been established. Numerical experiments validate the efficacy of our method in challenging scenarios.  ( 2 min )
    Follow-the-Perturbed-Leader with Fr\'{e}chet-type Tail Distributions: Optimality in Adversarial Bandits and Best-of-Both-Worlds
    arXiv:2403.05134v1 Announce Type: new Abstract: This paper studies the optimality of the Follow-the-Perturbed-Leader (FTPL) policy in both adversarial and stochastic $K$-armed bandits. Despite the widespread use of the Follow-the-Regularized-Leader (FTRL) framework with various choices of regularization, the FTPL framework, which relies on random perturbations, has not received much attention, despite its inherent simplicity. In adversarial bandits, there has been conjecture that FTPL could potentially achieve $\mathcal{O}(\sqrt{KT})$ regrets if perturbations follow a distribution with a Fr\'{e}chet-type tail. Recent work by Honda et al. (2023) showed that FTPL with Fr\'{e}chet distribution with shape $\alpha=2$ indeed attains this bound and, notably logarithmic regret in stochastic bandits, meaning the Best-of-Both-Worlds (BOBW) capability of FTPL. However, this result only partly resolves the above conjecture because their analysis heavily relies on the specific form of the Fr\'{e}chet distribution with this shape. In this paper, we establish a sufficient condition for perturbations to achieve $\mathcal{O}(\sqrt{KT})$ regrets in the adversarial setting, which covers, e.g., Fr\'{e}chet, Pareto, and Student-$t$ distributions. We also demonstrate the BOBW achievability of FTPL with certain Fr\'{e}chet-type tail distributions. Our results contribute not only to resolving existing conjectures through the lens of extreme value theory but also potentially offer insights into the effect of the regularization functions in FTRL through the mapping from FTPL to FTRL.  ( 2 min )
    Greedy feature selection: Classifier-dependent feature selection via greedy methods
    arXiv:2403.05138v1 Announce Type: new Abstract: The purpose of this study is to introduce a new approach to feature ranking for classification tasks, called in what follows greedy feature selection. In statistical learning, feature selection is usually realized by means of methods that are independent of the classifier applied to perform the prediction using that reduced number of features. Instead, greedy feature selection identifies the most important feature at each step and according to the selected classifier. In the paper, the benefits of such scheme are investigated theoretically in terms of model capacity indicators, such as the Vapnik-Chervonenkis (VC) dimension or the kernel alignment, and tested numerically by considering its application to the problem of predicting geo-effective manifestations of the active Sun.  ( 2 min )

  • Open

    Open Source AI Is a Misnomer from 2nd & 3rd Place Players - Let's Get Real
    In life, it's up to you whether to be a critic/hater, a creator, or someone who simply goes with the flow. You can lead, follow, or get out of the way. Isn't there a saying like that, perhaps from BMW or something along those lines? Here's the real talk: there's absolutely no way, without a hefty pile of cash, that AI will be an open-to-all venture, especially when we talk about open-source software (OSS). Let's face it, the kind of AI that changes the game just isn't going to be freely available. I'm all in for open source, to the point of being 100000% behind it. I code and share it openly, hoping to both help others and learn in the process. Open source is something I genuinely love. Yet, let's be real, the notion that the most advanced and impactful AI could be open source is a bit …
    AI-translated Hitler speech goes viral on X (Wired article)
    Dom Lucre (real name Dominick McGee ), a right wing conspiracy advocate who was recently reinstated on X by Elon Musk after having been found posting child-exploitation content has gone viral with a Hitler speech AI-translated into English. Full article in Wired: https://www.wired.com/story/ai-altered-hitler-speech-viral/ submitted by /u/Intelligent-Jump1071 [link] [comments]
    Using AI to draw a map
    Hopefully this is the right sub for what I would like to do. If not I’m sure that there will be plenty of people who will tell me where to go😏 I live in a small town and like to use a bicycle for recreation and exercise rather than driving. The town has very few designated bicycle paths and I would like to expand them with some street level routes that have the fewest stop signs, etc. like I have seen in other towns. I’m trying to figure out if there’s a way to use AI to analyze street view video of traffic regulatory signs and then plot them on a map for later export and printing? Does anyone know how I would get started on something like this? Thanks submitted by /u/highinthemountains [link] [comments]
    Apple abandoned its ambitious car project. Ask your questions to Bloomberg's Mark Gurman
    LIVE Q&A: We're live talking about Apple’s failed car effort. After spending a decade and $10 billion to develop an electric vehicle, Apple canceled the initiative in February, abandoning one of the most ambitious projects in its history. Bloomberg’s Mark Gurman will answer questions on the program, what it means for Apple, and what’s next for the company. Feel free to comment on this thread with your questions or email [bloombergqa@bloomberg.net](mailto:bloombergqa@bloomberg.net). You can follow the full live Q&A for free here. submitted by /u/bloomberg [link] [comments]
    Suno V3 [Alpha] + Some Editing
    Here'swhat I've been able to do with some brute force prompting + editing. Goal is to make AI music that passes as real while using lyrics. No keygen music stuff. This is experiment 2, and it's worth noting I could probably achieve this easily enough by replacing sounds + adding my own production, but less so for the vocals. Also Suno if you come across this, DM me as I have a ton of suggestions. submitted by /u/cabesworld [link] [comments]
    [R] What existing Artificial tools can I use / read (books) to dynamically monitor changes on a website?
    Good evening, colleagues, I'm working on a Java project, where I'm parsing a website DOM. I have the problem that the page frequently changes it structure, so I have to adapt the tool I developed often. My working scenario: Let’s say I have two versions of the same website: before changing design (DOM); after changing design (DOM). I have Java-based parser, which has the valid version according to 1) variant. It means, my project was written up to the version of the old DOM; After a period of time my code is broken, ‘cuz creator of the website is changing DOM (design). It means, I’m getting 2) variant, when my parser isn’t working because of the situation while analyzing the structure of the website, it detects that DOM was modified (design), e.g. sidebar was changed or elements…
    Evertrail is an AI adventure where you choose the path in the Twitch chat. Worked on this the last weeks, please help me testing if you have time.
    submitted by /u/Philipp [link] [comments]
    Oooof
    That’s gotta hurt submitted by /u/KingoftheProfane [link] [comments]
    Extropic AI is going to tell us somethi no about the AI processor that they are building!
    They are going to tell us about the AI processor at 10 AM GMT-4: https://x.com/extropic_ai/status/1767075353686946247?s=46 ​ And if you look at the “X” in “Extropic“ in their logo that looks exactly like the path in a photonic waveguide so I think that it is almost 100 Percent certain that it is a Photonic AI processor. They also saythat it is “Physics based” so it is a photonic processor submitted by /u/Jackmustman11111 [link] [comments]
    Tool to configure and execute a multi-step LLM process?
    I'm looking for a tool that allows a prompt-based architecture similar to a controller running through a list of tasks. For example, if I ask it to make a guide on how to [insert learnable topic], the controller follows a list of steps: identify learning points, make a syllabus, prepare chapters, then generate each chapter. The controller AI prompts another AI thread to work on bouts of 1000 words and then compiles the responses for the final output. I know groups are already doing stuff like this by coding it directly, but is there a packaged tool that behaves this way out of the box? submitted by /u/Banner80 [link] [comments]
    Visually, AI is capable of dreaming up new worlds, taking us places we've never been before. Artists like Thierry Lechanteur are using it to create exciting, imaginative, otherworldly outputs. Why aren't we seeing a similar level of creativity in the text-based outputs being created by AI?
    Is it a failure in human imagination? The software? Or maybe it already exists, and I'm just not aware of it? Is anyone doing anything creatively exciting using the written word? All my experiments with text-based AIs have so far have either only been use to create something which is highly functional or completely underwhelming in terms of its creative output. submitted by /u/allie-express [link] [comments]
    RAG for multiple documents
    Checkout this tutorial to enable Q&A over multiple documents (be it PDFs or text or other file formats) using LLMs and RAG https://youtu.be/cBpdiQ3gljM?si=0DABI1s4gMEYKOqZ submitted by /u/mehul_gupta1997 [link] [comments]
    Nvidia is sued by authors over AI use of copyrighted works
    submitted by /u/clonefitreal [link] [comments]
    Why Walmart’s quick success in generative AI search should have Google worried
    KEY POINTS Walmart CEO Doug McMillon has talked up the retail company’s generative AI search capabilities, one more threat to Alphabet’s internet dominance. Alphabet has been among the big tech losers in the stock market this year, alongside Apple, both struggling to tell investors winning AI stories. Google founder Sergey Brin recently conceded missteps in gen AI but said the company will figure out the right business models; analysts say online retailers will be more competitive in search within their ecosystems. Read the article: https://www.cnbc.com/2024/03/10/why-walmarts-quick-success-in-gen-ai-search-should-worry-google.html submitted by /u/clonefitreal [link] [comments]
    Anthropic Opus vs GPT-4-1106-preview
    So, I have been experimenting w/ Anthropic Opus and honestly I'm not sure if it is clearly "better" – at best Opus and GPT-4-1106 are close in quality of responses, but GPT-4-1106 is obviously faster (most of the time.) I utilize AI tech heavily and am building technology around generative AI. In my professional opinion, I have not been blown away by Anthropic, and I just want to ask, how the community feels so far. I still have a heavy predisposition towards GPT-4-1106 – its faster, and in my empirical opinion, slightly better. Yet still, I am glad Anthropic is pushing competition, because that will make GPT 4 better in the long run. submitted by /u/nooblito [link] [comments]
  • Open

    [R] Language Agents as Optimizable Graphs
    submitted by /u/hardmaru [link] [comments]
    [P] 🔥InstaSwap Face Swap for ComfyUI and Standalone
    ComfyUI repo: https://github.com/abdozmantar/ComfyUI-InstaSwap Standalon repo: https://github.com/abdozmantar/Standalone-InstaSwap ​ https://i.redd.it/idb3ugi95snc1.gif submitted by /u/abdullahozmntr [link] [comments]
    [D] Ideas/Suggestions on up-sampling task
    I have a task where I have sensor data, that can be visual, motor or any other kind of sensor. I also have the same data from more expensive sensors. I am looking for generative models that could essentially up-sample the "cheap" data into the "expensive" data. For example, can we replace an expensive tactile sensor or wrist mounted force/torque sensor with cheaper lower-grade equivalents or alternatives such as cameras and microphones? Ideally, I wouldn't want to implement a model myself but I understand I will probably have to train it from scratch. Anyway, if you have any ideas on models that could do well on this task, I would really appreciate your insight, thanks! submitted by /u/Alarmed-Fee6193 [link] [comments]
    [D] Feature Selection Technique for Regression (continous variable)
    Hi all, I need your expertise on this matter please. I have some set of features, say 26 features predicting one target variable. The target variable is a continous variable and some of the features have negative values, so does the target variable. I'm trying to eliminate some of the features, perhaps it will improve the performance of my model, but most of the feature selection technique have seen so far only applied to classifier model, not regression model. What are some of the technique you suggest I could use to select/filter features based on their importance for regression model. submitted by /u/emmyhabey [link] [comments]
    [R] Looking for paper with shared Value-Query attention weights.
    I could have sworn I skimmed a paper around a year ago which demonstrated pretty solid performance in transformers where the Value and Key (or Query) weights were the same / shared within each attention layer. I think Linformer does something similar, but I’m not looking for something that tries to solve the quadratic runtime of attention, just something that shows you can reasonable results with shared value and keys. It might’ve even been mentioned in this subreddit. Somehow I can’t seem to find it though… would anyone happen to know this, or perhaps the right terms to search? submitted by /u/benthehuman_ [link] [comments]
    [Discussion] Can LSTM model learn feature engineering on its own?
    I have a timeseries dataset and I am training an LSTM model on it to perform multiclass classification. My dataset has 7 columns => x1,x2,x3....x7 And has 4 labels => f1,f2,f3,f4 Since I have domain knowledge over my dataset, I know exactly what feature engineering needs to be done. ​ for example, I need to create 4 new features from my current features by applying some rules on each row:- newx1 is created by => if (x2==x3) then 1, else 0 newx2 is created by => if (x1==x4 and x1>x5) then 1, else 0 newx3 is created by => if ((x1-x6)/x1>x7) then 1, else 0 newx4 is created by => if ((x6-x1)x1/>x7) then 1. else 0 ​ I am getting 100% accuracy on test data if I train my LSTM model on newx1, newx2, newx3, newx4. However, training it on the original features (x1,x2....x7) I am getting reduced accuracy of 85-90% accuracy on test data. The problem I am trying to solve requires me to have greater than 99% accuracy, so it won't suffice to have just 90% accuracy. ​ I was wondering if somehow my LSTM model can learn the rules for feature engineering on its own or do I have to change my model ? ​ Note: I can not apply the feature engineering rules manually because I am training the LSTM model on multiple datasets and each dataset will require its own feature engineering rules. I want to keep it as generic as possible. ​ LSTM model :- def create_lstm_model(MaxTimeslice, H, LR, num_classes, dropout_rate=0.1, l2_reg=0.001): ip = Input(shape=(MaxTimeslice, H)) x = LSTM(32, return_sequences=True, dropout=dropout_rate, kernel_regularizer=l2(l2_reg))(ip) x = LSTM(16, dropout=dropout_rate, kernel_regularizer=l2(l2_reg))(x) x = Dense(units=16, activation='relu')(x) multiclass_output = Dense(units=num_classes, activation='softmax')(x) model = Model(inputs=ip, outputs=multiclass_output) model.compile(loss="categorical_crossentropy", metrics=["accuracy"], optimizer=RMSprop(learning_rate=LR)) return model ​ submitted by /u/Fearless_Peanut_6092 [link] [comments]
    [Discussion] Is this an example of data leakage?
    I’m working on a research project using machine learning to try to discover patterns in gene promoters. I’m concerned about data leakage in our model, but I want some outside opinions before I push too hard with my lab to change our methodology. Perhaps someone in this community can help me understand this issue better. Our machine learning model is an L1-regularized logistic regression model that uses sequence patterns as features to predict a binary outcome for a transcription start site (TSS) in a genome. The sequence patterns are detected by an upstream tool in a ~1000 nucleotide region next to the TSS. Because many TSSs are clustered closely in the genome (some genes have several TSSs), these sequence regions often overlap with at least one other TSS’s sequence region by >99%. My con…
    [D] Need Suggestions for Feature Engineering Continuous Numerical features
    Dataset Link - https://www.kaggle.com/competitions/playground-series-s4e3 In a recent competition, I encountered a dataset comprising continuous features, and to gain insights into their distribution dynamics, I generated both distribution plots and box plots. The target column is categorical, and the objective is to ascertain probabilities associated with predicted defects, with the evaluation metric being the AUC ROC score. Employing minimal preprocessing, I trained a model using a hill climbing ensemble, incorporating hyperparameter-tuned XGBoost (XGB) and LightGBM models. The achieved score positioned me within the top 1% on the leaderboards. I am seeking recommendations for further preprocessing steps to enhance the model's accuracy. submitted by /u/tushar_mahalya [link] [comments]
    [D] How to improve the model?
    Hi everyone, I am dealing with a dataset kaggle-type, like 30K rows. I need to predict a binary label which is highly unbalanced. Trying logistic regression, XGB and random Forrest I obtain high accuracy (~93%) but low F1 (~.16%) due to imbalance, as I can verify from confusion matrices. I did cross validation, feature selection, normalization, looked for highly correlated variables. I then tried with the balance option in the random forest but the F1 actually went slightly worst. What else can I do? And how much this is normal in your experience? I can't really tell if, given that I have like 100 labels in the test set for the less represented class, it's impossible to get better with so few data. What do you think? submitted by /u/risilm [link] [comments]
    [P] How to use a base document such as the postgresql documentation and curation of questions and answers to fine tune a LLM model.
    Hi I have a use case in my mind where I can enter a prompt to the model(any open source model) which is fine tuned to specific domains and has all the knowledge of the domain for example the Postgres Database and it is an expert all things postgres and have a training set consisting of questions and answers and an evaluation set to check the accuracy. I'm stuck at the place on how to fine tune a model for example llama-13b-chat model to use the postgres documentation and the traning set of prompt and competition as jsonl. please redirect me towards those resources. submitted by /u/IllLogicalDevice6969 [link] [comments]
    [D] [R] How to teach Resnets to focus on particular regions in an image ?
    I am working on person identification task for biometric purposes. My intuition is that the model should only focus on the boundary of the person, the head and maybe hands and feet too.So in addition to RGB images, I also have edge map images. I also read about non-local neural networks (https://arxiv.org/abs/1711.07971) which can give attention-like capabilities to resnets. However, I am unable to figure out how to train it to look at just the above mentioned parts. submitted by /u/doctor-squidward [link] [comments]
    [N] Haystack 2.0 launch
    Haystack 2.0 stable is live! Try it out for yourself: https://haystack.deepset.ai/blog/haystack-2-release/ Haystack is an open-source AI framework for creating production-ready apps using LLMs and other language models. It's nearly 4 years old - we've been doing NLP and LLM engineering since before it was cool. 😎 and is model and database agnostic — you can use whatever tools make the most sense for your use case. Haystack offers a rich and growing community of integrations, providing monitoring, evaluation, data ingestion, and more. Context about the 2.0 release: Haystack was first officially released back in 2020, when the forefront of NLP was mostly semantic search, retrieval, and extractive question-answering. Haystack 2.0 is a complete rewrite, but the underlying principle of composing components into flexible pipelines remains the same. The release has quite a few model providers, tracing and monitoring capabilities, and databases supported out of the box: For models (generative and embedding): OpenAI, Mistral, Cohere, Jina AI, Google AI, Vertex AI, Optimum (by hugging face), sentence transformers, Amazon Bedrock, Azure, Fast Embed, Ollama For databases: Weaviate, Pinecone, Qdrant, Mongo DB, Astra DB, Neo4j, pgvector, Chroma, Elastic Search, OpenSearch... Hope you give it a try and let us know what your think! We have a quick start guide to get going: https://haystack.deepset.ai/overview/quick-start submitted by /u/tuanacelik [link] [comments]
    [D]Does my customized DNN accelerator on FPGA need a De-Quantization unit?
    I'm designing my DNN accelerator(8-bit) on FPGA. I want to run multiple DNN layers on the accelerator and the network is quantized to 8-bits. But I don't know if I should design a De-Quantization Unit. I know it is better to perform dequantization first to the partial sum then quantize it, and it is included in many commercial NPUs. But it's kind of complicated to design it for me. I am not very familiar with quantization. So is it OK to execute multiple layers on the accelerator without dequantization? Will the accuracy drop a lot? ​ Thanks! submitted by /u/ExcitingInternet6083 [link] [comments]
    [D] How does Gemini 1.5 Pro recall information in 10M context?
    Google recently issued technical report about Gemini 1.5 Pro and its 10M context. They gave brief overview of modern approaches to improving the long-context capabilities of models (e.g. reccurent memory, ring attention, novel architectures, etc.), but I didn't find information about Gemini 1.5 Pro approach. Did someone notice info about their approach in official publications? It seems to be Google's secret for me. submitted by /u/Muted-Witness-7196 [link] [comments]
    [R] [2403.05468] Will GPT-4 Run DOOM?
    submitted by /u/blabboy [link] [comments]
    [R] ShortGPT: Layers in Large Language Models are More Redundant Than You Expect
    Paper: https://arxiv.org/abs/2403.03853 Abstract: As Large Language Models (LLMs) continue to advance in performance, their size has escalated significantly, with current LLMs containing billions or even trillions of parameters. However, in this study, we discovered that many layers of LLMs exhibit high similarity, and some layers play a negligible role in network functionality. Based on this observation, we define a metric called Block Influence (BI) to gauge the significance of each layer in LLMs. We then propose a straightforward pruning approach: layer removal, in which we directly delete the redundant layers in LLMs based on their BI scores. Experiments demonstrate that our method, which we call ShortGPT, significantly outperforms previous state-of-the-art (SOTA) methods in model pruning. Moreover, ShortGPT is orthogonal to quantization-like methods, enabling further reduction in parameters and computation. The ability to achieve better results through simple layer removal, as opposed to more complex pruning techniques, suggests a high degree of redundancy in the model architecture. submitted by /u/SunsetOneSix [link] [comments]
    [R] MathScale: Scaling Instruction Tuning for Mathematical Reasoning
    Paper: https://arxiv.org/abs/2403.02884 Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in problem-solving. However, their proficiency in solving mathematical problems remains inadequate. We propose MathScale, a simple and scalable method to create high-quality mathematical reasoning data using frontier LLMs (e.g., GPT-3.5). Inspired by the cognitive mechanism in human mathematical learning, it first extracts topics and knowledge points from seed math questions and then build a concept graph, which is subsequently used to generate new math questions. MathScale exhibits effective scalability along the size axis of the math dataset that we generate. As a result, we create a mathematical reasoning dataset (MathScaleQA) containing two million math question-answer pairs. To evaluate mathematical reasoning abilities of LLMs comprehensively, we construct MwpBench, a benchmark of Math Word Problems, which is a collection of ten datasets (including GSM8K and MATH) covering K-12, college, and competition level math problems. We apply MathScaleQA to fine-tune open-source LLMs (e.g., LLaMA-2 and Mistral), resulting in significantly improved capabilities in mathematical reasoning. Evaluated on MwpBench, MathScale-7B achieves state-of-the-art performance across all datasets, surpassing its best peers of equivalent size by 42.9% in micro average accuracy and 43.7% in macro average accuracy, respectively. submitted by /u/SunsetOneSix [link] [comments]
    [R] Position Paper: Agent AI Towards a Holistic Intelligence - Microsoft 2024
    Paper: https://arxiv.org/abs/2403.00833 Abstract: Recent advancements in large foundation models have remarkably enhanced our understanding of sensory information in open-world environments. In leveraging the power of foundation models, it is crucial for AI research to pivot away from excessive reductionism and toward an emphasis on systems that function as cohesive wholes. Specifically, we emphasize developing Agent AI -- an embodied system that integrates large foundation models into agent actions. The emerging field of Agent AI spans a wide range of existing embodied and agent-based multimodal interactions, including robotics, gaming, and healthcare systems, etc. In this paper, we propose a novel large action model to achieve embodied intelligent behavior, the Agent Foundation Model. On top of this idea, we discuss how agent AI exhibits remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. Furthermore, we discuss the potential of Agent AI from an interdisciplinary perspective, underscoring AI cognition and consciousness within scientific discourse. We believe that those discussions serve as a basis for future research directions and encourage broader societal engagement. https://preview.redd.it/h8m0ucns7onc1.jpg?width=1228&format=pjpg&auto=webp&s=1cfc94db64f9f358b07353de285faefa5c8ca1a0 https://preview.redd.it/4vck2ens7onc1.jpg?width=1534&format=pjpg&auto=webp&s=32c9e81ed46852ad15e4c90397b13bf4a531278c https://preview.redd.it/rjo7pdns7onc1.jpg?width=927&format=pjpg&auto=webp&s=a0728939d5a83c32811c2efbdff9b5a6d58f023f https://preview.redd.it/ng16dfns7onc1.jpg?width=487&format=pjpg&auto=webp&s=72c6e46c75328cc39e606f149550c0fbf99115a3 submitted by /u/Singularian2501 [link] [comments]
    [D] Serverless Mistral Inference for cost optimization
    I’m currently part of a startup and I’m exploring ways to optimize cost for LLM api usage. Our service doesn't require real-time LLM interaction. Instead, we stacked over 1000 prediction-like tasks per 1 hour and process in within first 10 minutes of every hour. Given this usage pattern, latency and keeping GPUs alive aren't our primary concern, which leads me to believe that open-source LLMs might be a cost-effective solution for us. However, my knowledge of cloud GPU services and serving LLMs on my own is quite limited, so I have some questions. Is it accurate to assume that using open-source LLMs could reduce the costs associated with LLM API usage for my specific use case? What is the difference between autoscaling and serverless? Specifically, is it possible to utilize autoscaling with just a single GPU instance, maintaining a smaller CPU instance for most of the time and allocating a GPU only when necessary? If it is possible, is this approach effective in reducing costs? How to use vllm library for batch inference? Does it automatically optimize concurrent requests for single generates? is vllm compatible with serverless computing frameworks? Would simply deploying a VLLM-based Docker image, crafted via a Dockerfile, onto a serverless GPU service like RunPod be enough? Are there specific considerations or steps required to? submitted by /u/JYPark314 [link] [comments]
    [d] synthetic data generation methods
    I wrote custom code for generating synthetic data for myself for fine-tuning LLM, It gives jsonl file in the format of "system, user, assistant" which openai fine tuner takes in. https://preview.redd.it/lw5lkr442onc1.png?width=1273&format=png&auto=webp&s=72661016111ad9e65eb4b019f10890e9ba02efad But I was wondering if there are other efficient ways to do it? What are some of the ways do you guys generate synthetic data for LLMs, and are there any problems faced? submitted by /u/Medium_Alternative50 [link] [comments]
    [Discussion] [D] Any alternatives to manual labelling in NER?
    Questions - Any alternative to NER? (regex will not work out because sentence and word/phrase boundaries are not clearly defined) Any unsupervised/semi supervised/self supervised approach? For labelling, are there any alternative to manual labelling? Details: I've a free text column of biographies about people where different identifiers such as Name, Id nos. , phone numbers, emails, birth date, nationality, etc. are present. I need to extract them under correct tag (such as NAM for Name, ID for ID nos. and so on). Each entity tag can have several variations (eg. - name can appear after 'NAME:' or 'alias:' or'a.k.a:' or 'also known as'. Also there is severe imbalance in presence of entities (and their variations) in biographies (some biographies only contain Name, email and phone no. and id no. , and very few contain nationality and dob). I'm trying to apply NER. However, Pretrained NER models do not contains entities which I need, so I need to train models with labelled data. For labelling , I'm manually labelling around 1K biographies- which amounts to to 300,000 tokens. There might be more biographies to label in the future, if the performance with these biographies is not sufficient. The problem is that labelling is a super intensive task. I've manually labelled 470 biographies and tried training crf, spacy's ner solution and bert token classifier. The performance is on lower side for those entities where count is < 1K. I've tried to select only those biographies for labelling which contain entities to extract. I've tried pseudo labelling with CRF model, but it didn't work out well. I won't be able to push the data on spacy's prodigy (against the company policies) submitted by /u/Ann2_123 [link] [comments]
    [D] Gradient Accumulation for Contrastive Learning (InfoNCE)
    I was training a multimodal alignment model but could only fit a batch size of 64 in my GPU, even with mixed precision training. Going by the SimCLR paper, smaller batch sizes are suboptimal for learning. Is there any way I could implement gradient accumulation here? submitted by /u/Mad_Scientist2027 [link] [comments]
  • Open

    Arrangement of Computing Power for Neural Networks?
    Computing power is the backbone for training of Neural Networks but not everyone Has the money to afford it. How do companies cope up with this challenge ? View Poll submitted by /u/Red_Pudding_pie [link] [comments]
    The Dode Abides - 24 hours of artificial evolution in neural net sim
    submitted by /u/urocyon_dev [link] [comments]
    Noob question: could I make a NN that can be trained quickly to render one specific object faster than normal polygon-based rendering?
    Warning: I just about know what a NN even is and a the broad strokes of how they are trained, so I will probably say some really stupid things, and my idea is probably ridiculous. I'm asking anyway because I'd like to at least know exactly what mistakes and misconceptions I have, and on the off chance that maybe what I'm thinking about is actually possible. Here is the context: I've been a game programmer for three years now, and when I first heard of 3D Gaussian Splatting, I wrongly believed it was a method by which a neural network could be trained to render a 3D model from any point of view withou actually reading the mesh data, and I immediately began to imagine how this could be leveraged to make game graphics less performance-heavy. I have since learned that 3DGS does the exact op…
    Looking for labelled Osteoarthritis dataset
    Hello, I need labelled images of osteoarthritis. It can be either Xrays or MRIs. Does anyone know where I can find them? submitted by /u/IDAB3002 [link] [comments]
  • Open

    Anyone else suddenly having a problem with installing torchrl in Colab?
    I am using the torchrl library to experiment with some reinforcement learning environments in Google Colab. A few months ago, running !pip install torchrl worked fine. Now, running this command causes the following error: Installing collected packages: triton, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, nvidia-cusparse-cu12, nvidia-cudnn-cu12, nvidia-cusolver-cu12, torch, tensordict, torchrl Attempting uninstall: triton Found existing installation: triton 2.1.0 Uninstalling triton-2.1.0: Successfully uninstalled triton-2.1.0 Attempting uninstall: torch Found existing installation: torch 2.1.0+cu121 Uninstalling torch-2.1.0+cu121: Successfully uninstalled torch-2.1.0+cu121 ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. torchaudio 2.1.0+cu121 requires torch==2.1.0, but you have torch 2.2.1 which is incompatible. torchdata 0.7.0 requires torch==2.1.0, but you have torch 2.2.1 which is incompatible. torchtext 0.16.0 requires torch==2.1.0, but you have torch 2.2.1 which is incompatible. torchvision 0.16.0+cu121 requires torch==2.1.0, but you have torch 2.2.1 which is incompatible. I have tried !pip install torch==2.1.0, but this doesn't work because the torchrl installation seems to uninstall torch 2.1.0 in favor of torch 2.2.1. I have also tried running !pip install torchrl as the only command in a new notebook and new runtime, which produces the same results. I suspect the issue is related to the recent release of torch 2.2.1, but I'm not sure how to resolve it. Is anyone experiencing the same issue, or able to use torchrl in Colab after the new release? submitted by /u/brantacanadensis906 [link] [comments]
    [2403.04642] Teaching Large Language Models to Reason with Reinforcement Learning
    submitted by /u/LushousLightfoot [link] [comments]
    AI & Continuous Optimization for Brewery
    I will start an optimization project in a bottlenecked bottling and filling line of a brewery. The goal is to produce the maximum number of cans and bottles of beer with minimum downtime and slowdowns. My managers want this project to be done with a combination of AI + Optimization, according to them, this artificial intelligence system should constantly update its decisions and plan according to the data coming from MES, IoT, ERP and dynamically optimize the process itself without leaving the optimization to planners and operators. I guess classical optimization and machine learning methods will not work here, but I can't think of any other method other than Reinforcement Learning. I am waiting for your recommendations. submitted by /u/robustersr2 [link] [comments]
    Tensorboard: "No dashboards are active for the current data set."
    Hey everyone, I finished setting up my first RL agent a couple days ago and went straight ahead to train it. I had the log files saved to my anaconda environment folder from which I opened them with logdir to view the tensorboard. I had issues at first with nothing being displayed but after two hours of restarting and patience it suddenly worked... Now I moved my files to a cloud storage from which I now want to open up my tensorboard session. Since I moved the files, the same error occurs as before. No data is shown with the message "No dashboards are active for the current data set." I would appreciate some quick tips if any of you encountered a similar problem before. Thanks a lot in advance! submitted by /u/QuiGon-GinTonic [link] [comments]
    Graph neural network
    Hello, I am working on production scheduling of a fexible job shop (i.e., the task is to schedule operations such that each operation can be processed on many machines). I am leveraging the graph neural netwrok by modelling the state as a dynamic graph-based structure (the number of nodes varies at each decision time according the available jobs and the idle machine). I have three main questions: A. Which option should I choose? 1- Model the state as disjunctive graph (used in many academic papers) 2- Model the state as a graph model (nx.DiGraph()) and feed the state to a GNN as Data(x=, edge_index=) B. Which one is better? 1- Train GAT for feature embeddings extraction then train a DRL agent for action selection 2- Integrate GAT in DRL agent to directly select an action based on state represented as Data, and then train the overall network C. I have three nodes categories in the graph, if the input is represented as Data, should GNN process each of the three node categories in the graph separately (such that I have one machine node each time (multi-agent setting), several job nodes, ...)? submitted by /u/GuavaAgreeable208 [link] [comments]
    Monitoring RL Agents
    Hi, I have two generic questions: What are the methods to monitor the performance of RL agent while training and exploration? I know that reward function behaviour reflects the performance of RL, but, are there any other criteria that we can track to see how the agent is learning? What are the state-of-the-art safe-exploration techniques apart from use-case specific methods? submitted by /u/alysavalan [link] [comments]
    Monitoring the RL agent
    Hi, I have two generic questions: What are the methods to monitor the performance of RL agent while training and exploration? I know that reward function behaviour reflects the performance of RL, but, are there any other criteria that we can track to see how the agent is learning? What are the state-of-the-art safe-exploration techniques apart from use-case specific methods? submitted by /u/raminhashemi [link] [comments]
  • Open

    Chain-of-table: Evolving tables in the reasoning chain for table understanding
    Posted by Zilong Wang, Student Researcher, and Chen-Yu Lee, Research Scientist, Cloud AI Team People use tables every day to organize and interpret complex information in a structured, easily accessible format. Due to the ubiquity of such tables, reasoning over tabular data has long been a central topic in natural language processing (NLP). Researchers in this field have aimed to leverage language models to help users answer questions, verify statements, and analyze data based on tables. However, language models are trained over large amounts of plain text, so the inherently structured nature of tabular data can be difficult for language models to fully comprehend and utilize. Recently, large language models (LLMs) have achieved outstanding performance across diverse natural languag…  ( 93 min )
  • Open

    Eco-System Upgrade: AI Plants a Digital Forest at NVIDIA GTC
    The ecosystem around NVIDIA’s technologies has always been verdant — but this is absurd. After a stunning premiere at the World Economic Forum in Davos, immersive artworks based on Refik Anadol Studio’s Large Nature Model will come to the U.S. for the first time at NVIDIA GTC. Offering a deep dive into the synergy between Read Article  ( 6 min )
    AI Getting Green Light: City of Raleigh Taps NVIDIA Metropolis to Improve Traffic
    You might say that James Alberque has a bird’s-eye view of the road congestion and challenges that come with a booming U.S. city. Alberque analyzes traffic data for Raleigh, North Carolina, which has seen its population more than double in the past three decades. The city has been working with NVIDIA and its partners to Read Article  ( 6 min )
  • Open

    How VistaPrint delivers personalized product recommendations with Amazon Personalize
    VistaPrint, a Cimpress business, is the design and marketing partner to millions of small businesses around the world. For more than two decades, VistaPrint has empowered small businesses to quickly and effectively create the marketing products – from promotional materials and signage to print advertising and more – to get the job done, regardless of […]  ( 7 min )
  • Open

    API management tools in facilitating digital transformation
    Digital transformation represents a crucial pivot in today’s business world, marking the transition from traditional practices to a more integrated, data-driven approach. API (Application Programming Interface) management tools are central to this transformation, which serve as the backbone for modernizing operations, enriching customer experiences, and catalyzing innovation. These tools orchestrate the seamless interaction between software… Read More »API management tools in facilitating digital transformation The post API management tools in facilitating digital transformation appeared first on Data Science Central.  ( 22 min )
  • Open

    Scaling early detection of esophageal cancer with AI
    Microsoft Research and Cyted have collaborated to build novel AI models (opens in new tab) to scale the early detection of esophageal cancer. The AI-supported methods demonstrated the same diagnostic performance as the existing manual workflow, potentially reducing the pathologist’s workload by up to 63%. Esophageal cancer is the sixth most common cause of cancer […] The post Scaling early detection of esophageal cancer with AI appeared first on Microsoft Research.  ( 12 min )
  • Open

    2024 BAIR Graduate Directory
    Every year, the Berkeley Artificial Intelligence Research (BAIR) Lab graduates some of the most talented and innovative minds in artificial intelligence and machine learning. Our Ph.D. graduates have each expanded the frontiers of AI research and are now ready to embark on new adventures in academia, industry, and beyond. These fantastic individuals bring with them a wealth of knowledge, fresh ideas, and a drive to continue contributing to the advancement of AI. Their work at BAIR, ranging from deep learning, robotics, and natural language processing to computer vision, security, and much more, has contributed significantly to their fields and has had transformative impacts on society. This website is dedicated to showcasing our colleagues, making it easier for academic institutions, resea…  ( 5 min )

  • Open

    [D] Data Science or Machine Learning or AI Buddy
    Hi there! I am looking for a buddy who is also a DS/ML/AI enthusiast here in Wellington, New Zealand? I can even help you out in your projects as much as I can, but what is more important is we'll further enhance our skills and learn more. I treid to check out meetup.com. There are a few groups, but they seem to be inactive. Hmu. Feel free to DM me. Cheers! submitted by /u/Ornery-Technician-24 [link] [comments]
    [R] Guidance/References for using LLMs for text summarisation + some modification under certain circumstances
    Hi all, I'm a newbie grad AI researcher looking for some insight/direction from the more experienced/knowledgeable folks here. I'm looking to research using LLMs (or any other method) for text summarisation, where it picks out certain important aspects of a text and combines them in a "nice" way, but I also want the AI to be able to change certain aspects/details of the summarisation under certain contexts. I would greatly appreciate it if anyone would point me in the right direction with references, etc. submitted by /u/cliffn5810 [link] [comments]
    [D] MacBook recommended for machine learning ?
    Hi yall, I’m a current 1st year undergrad who’s taking a maths degree. In my second year course I have a machine learning module particularly in deep learning and was wondering if a MacBook would be suitable for this. The module is taught in matlab if that is any help. If not any recommendations I’m not to worried about the price as long as it’s not 2000 or more Thank you submitted by /u/NC1_123 [link] [comments]
    [D] Trying to understand inference with JEPA
    Inspired by the Lex Fridmans Podcast episode with Yann LeCun I tried to improve my understand of JEPA and energy based models in general by reading the I-JEPA paper and these lecture notes. I understand the appeal of JEPA from the perspective of learning highly semantic features/representations of continuous image data in a semi supervised procedure. But what really puzzles me is the statement by Yann LeCun that once a model like this is trained you could do optimization based inference where you basically optimize Y in order to minimize the energy. Has this generation process been demonstrated? What components would be needed in addition to the pretrained JEPA model in order to generate images / a text response in this optimization based process? submitted by /u/flxh13 [link] [comments]
    Is having a background in nonparametric estimation a useful path into machine learning anymore? [D]
    I’m a Masters student in Statistics. My background is pretty much all grounded in foundational statistical theory, and my thesis is on nonparametric estimation (nonparametric regression, specifically). Basically my “machine learning” knowledge stems from some classical nonparametric estimation books like (Tysbakov, Wasserman, Tibshirani/Hastie, and Friedman). The elements of statistical learning is pretty much the background I have in machine learning as my thesis is concerning the intersection between classical methods for nonparamrtric regression like tree based methods, kernel smoothers and splines for estimating average treatment effects in causal inference. However, I sometimes feel that my machine learning background is “quite old”. As in, I don’t know how attractive a nonparametric regression background for modern machine learning jobs is anymore. We delve into a lot of asymptotic theory within nonparametrics and statistics in general, and I know for most machine learning jobs no one really cares about asymptotic guarantees. Does anyone know if my knowledge is really even relevant for getting a machine learning job in today’s era where it’s mostly concerned with neural networks? submitted by /u/Direct-Touch469 [link] [comments]
    [D] What are the techniques used in pytorch autoagumentation?
    I'm working with PyTorch's autoaugment ImageNet and CIFAR-10 datasets. And I'm curious to understand the specific techniques used in Them. import torch from torchvision import transforms policies = [transforms.AutoAugmentPolicy.CIFAR10,transforms.AutoAugmentPolicy.IMAGENET] train_transform = transforms.Compose([ transforms.ToPILImage(), transforms.Resize((224, 224)), transforms.AutoAugment(policies[0]) , transforms.ToTensor() ]) submitted by /u/qaz_zaqi [link] [comments]
    [Project] ML and Explainability usage feedback
    Hey, I have a really time consuming simulation that takes a few hundreds of inputs and produce a result of 0, 1 or 2. Since this is a simulation that I have run internally for years, I ended up with millions of those simulations and I want to try and build an ML that will predict the outcome. I don't plan to replace the simulation, but I want the ML model to get a more direct and quick idea of the outcome. I have used Explainability models before, but I had an idea that I haven't encountered again and that's why I am here. Let's say I have my ML model that fit well in my historical dataset. I have used LIME before to get an idea of which feature affected the most a prediction. But can I use the ML model to explain the real result of a simulation? Let's say that I run a simulation and get the result 0. If my model got 0 as well, then I can still use LIME to get an idea of it. But what if my model predicts 1 in that case? Is there any way to use it and "understand" the real outcome of the simulation? Another case I was thinking about was to help me before running the simulation. As I said, I have hundreds of inputs and I want to use the ML to give me random "feedback" of the format of "If you increase the input X, it will increase the probability of 1 by 1.4%". I guess brute force is one way where I get the most important features, tweak them and get the predictions, but is there a more sophisticated way to achieve this? P.S. My simulation is not deterministic and the same set of inputs could produce different results on every run submitted by /u/Rough-Professional16 [link] [comments]
    [D] Embedding models = new business?
    Looking at the MTEB leaderboard it seems like almost everyone now is releasing new models daily, does anyone make money out of this? submitted by /u/Foreign_Cheesecake46 [link] [comments]
    [D] Where do Lora memory gains come from ? (apart from the optimizer state)
    Hello, I've been reading explanations of Lora for hours now, and there is something I can't wrap my head around: the memory gains. I understand that a lot is gained with the optimizer state that are not needed for frozen layers. However, in the Lora Paper (Chapter 4.2) it is stated that We also observe a 25% speedup during training on GPT-3 175B compared to full fine-tuning5 as we do not need to calculate the gradient for the vast majority of the parameters. But don't the gradients of the frozen layers still need to be computed ? Even though their weights won't be updated with it, it has to be computed to get the gradients for the trainable matrices. In classical finetuning, gradients are required because most of the time only the last layers are trained so backpropagation stop there, but with Lora trainable parameters are located in all attention heads, so backpropagation needs to continue until there, or did I misunderstood something ? So if gradients are required, and activations + weights too, then the only memory gain would be from the optimizer ? From my experiences, LoRA memory usage seems way lower than a "normal" finetuning, so I guess there is something I don't understand ​ Thanks a lot for your help ! submitted by /u/Wats0ns [link] [comments]
    Training attention QKV matrices [D]
    I'm starting to get a hang.of the attention paper and the significance of Q, K and V and of dot product attention and multi head attention. What i don't understand is how the values of Q, K, and V matrices are trained. I've read the cross validation stack exchange answers and some others. I get that QKV come from the previous layer, but how does the previous layer determine/train their values? Some form of backprop, sure. But what's the goal they are trained against? I'd really appreciate if someone could explain wit a simple example. submitted by /u/datashri [link] [comments]
    [D] What's the most rich covariance structure that can be used for Variational Inference in moderately sized Bayesian Neural Networks?
    Mean-field variational inference is on one end of the spectrum where it is extremely scalable but the correlation structure of the approximate posterior is not very rich. On the other hand, Fully correlated Variational Inference has quadratic space complexity so not that scalable. But how close can we get to the fully correlated setting for moderately sized neural networks such as a Resnet18? Is there any comparison of the effects of having richer approximate posteriors in bayesian neural networks? submitted by /u/Dangerous-Flan-6581 [link] [comments]
    [R] OpenAI: JSON mode vs Functions
    submitted by /u/JClub [link] [comments]
    [R] Meta-Simulation Exploration: AI Emergence within a Simulated World of Generative Agents
    submitted by /u/ausderh00d [link] [comments]
    How can I effectively defend my deep learning thesis as a mathematics student when computer science students are also working on similar models? [R]
    I'm a mathematics master's student currently working on my thesis, which focuses on developing a deep learning model. While I'm excited about my research, I'm concerned about defending it, especially since computer science students are also exploring similar areas. I'm seeking advice and strategies on how to present and defend my work effectively, emphasizing the unique contributions of my mathematical perspective in this interdisciplinary field. Any insights or suggestions would be greatly appreciated! submitted by /u/mudassar18104009 [link] [comments]
    [P] Stable Diffusion from Scratch in PyTorch | Part II - Conditional Latent Diffusion Models
    submitted by /u/tusharkumar91 [link] [comments]
    [D] Simple Questions Thread
    Please post your questions here instead of creating a new thread. Encourage others who create new posts for questions to post here instead! Thread will stay alive until next one so keep posting after the date in the title. Thanks to everyone for answering questions in the previous thread! submitted by /u/AutoModerator [link] [comments]
    [R] Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Rival Human Crowd Accuracy
    In a new paper, researchers demonstrate that an ensemble of LLMs can match human-level performance at forecasting, potentially replacing the need for large, expensive human forecasting tournaments. Key points: LLMs can achieve human-level forecasting accuracy through aggregating predictions from diverse models This "wisdom of the silicon crowd" approach parallels the well-established "wisdom of the crowd" effect in humans In Study 1, an ensemble of 12 LLMs significantly outperformed a baseline and matched a crowd of 925 human forecasters on 31 binary questions over 3 months In Study 2, providing human crowd forecasts improved individual LLM (GPT-4 and Claude 2) predictions by 17-28%, but simply averaging machine and human predictions was even better The findings suggest that organizations can obtain high-quality, cost-effective, and scalable forecasts by leveraging LLM ensembles, potentially making data-driven decision-making more accessible across various domains. However, the study has some limitations: it focused on short-term binary questions, the LLMs exhibited biases and poor calibration, and their accuracy may degrade as their training data becomes outdated. TLDR: Like humans, LLMs exhibit the "wisdom of the crowds" effect. A "crowd" of LLMs is just as good at prediction as a crowd of humans. More details here. Arxiv paper: https://arxiv.org/pdf/2402.19379.pdf submitted by /u/Successful-Western27 [link] [comments]
    [D] In 2024, what are the latest trends on RL?
    Hi, I am doing research on decision transformers these days. Arguable, while trying to find the most important papers I noticed that not much seems to have happened in the area of RL. I noticed a rend where research is focused on optimizing Transformers and training huge language and vision models treated as supervised models?. Is this the new big thing in RL?. What are the latest trends on RL?. submitted by /u/__Julia [link] [comments]
    [D] AMD cards for machine learning in 2024?
    What's the state of AMD and AI? I'm wondering how much of a performance difference there is between AMD and Nvidia gpus, and if ml libraries like pytorch and tensorflow are sufficiently supported on the 7600xt. Last I've heard ROCm support is available for AMD cards, but there are inconsistencies, software issues, and 2 - 5x slower speeds. On a more personal note, would you pick an rx7600 over a 4060? submitted by /u/AtomicPiano [link] [comments]
    [P] Accuracy Ranking of Classifiers on Tabular Data
    Announcing Predirank; this project ranks classifiers based on their accuracy across a selection of datasets. The primary focus is on classifiers from the scikit-learn library and their application to popular tabular datasets. https://github.com/c4pub/predirank Obviously, the resulting leaderboard is only indicative; there can't be a definitive ranking. The datasets used in this project are derived from a selection of UCI datasets used in the paper: Fernandez-Delgado, Manuel, et al. "Do we need hundreds of classifiers to solve real world classification problems?" The Journal of Machine Learning Research 15.1 (2014). The list of datasets and classifiers can be customized by editing the provided examples. So, have the rankings changed over time for "shallow" classifiers? "Random Fo…
    [R] Into the Unknown: Self-Learning Large Language Models
    Paper: https://arxiv.org/abs/2402.09147 Code: https://github.com/teddy-f-47/self-learning-llm-public Abstract: We address the main problem of self-learning LLM: the question of what to learn. We propose a self-learning LLM framework that enables an LLM to independently learn previously unknown knowledge through self-assessment of their own hallucinations. Using the hallucination score, we introduce a new concept of Points in The Unknown (PiUs), along with one extrinsic and three intrinsic methods for automatic PiUs identification. It facilitates the creation of a self-learning loop that focuses exclusively on the knowledge gap in Points in The Unknown, resulting in a reduced hallucination score. We also developed evaluation metrics for gauging an LLM's self-learning capability. Our experiments revealed that 7B-Mistral models that have been finetuned or aligned are capable of self-learning considerably well. Our self-learning concept allows more efficient LLM updates and opens new perspectives for knowledge exchange. It may also increase public trust in AI. submitted by /u/SunsetOneSix [link] [comments]
    [Discussion] Experimenting with generating a synthetic dataset for gaze prediction, why do models struggle to learn on the data?
    I've setup an Unreal Engine 5 project that has a metahuman look at random points on a virtual screen, it then captures a photo through a virtual webcam I've setup, and saves it. Below is a simple dataset of around 6.5K images (I say simple because I froze a lot of variables I previously had, here the only things that change image to image is the lighting, the character's position, the screen coordinate, and some slight facial movement I've hacked together by randomly moving the character's mouth bones). For some reason though, models trained from scratch on the data struggle to learn the features, and the train and loss drops slightly but overall stays very stagnant... This is strange, because I've (somewhat) successfully trained gaze prediction models on a dataset I made with my webcam that I've tried to mimic with this data. I successfully trained those models from scratch on the same model architecture with both 4K and 6K samples (final valid mse loss was somewhere around 0.008 (train was ~1/2 that from memory) , which is still more than I'm after, but much better than the synthetic models, which often stagnate at around 0.08 for both train and valid). I tried freezing all variables but eye movement and lighting, and the model immediately learned and the train loss dropped to ~0.001. I'm really just after opinions on what could be causing the issues, it's super strange to me that the same model with exactly the same data manipulation (resizing to (240, 320) with some fastai image augmentations) would learn very well on real data with way more variables (e.g. head rotation which I've froze here, posture changes, even clothing changes), but be totally unable to learn from this simpler dataset: https://huggingface.co/datasets/goatman/metahuman-gaze-prediction Sorry for the long post, any insights would be amazing! submitted by /u/Goatman117 [link] [comments]
    [P] I use AI agents to de-sensationalize the news
    In today's world, catchy headlines and articles often distract readers from the facts and relevant information. Simply News is an attempt to cut through the fray and provide straightforward daily updates about what's actually happening. By coordinating multiple AI agents, Simply News processes sensationalist news articles and transforms them into a cohesive, news-focused podcast across many distinct topics every day. Each agent is responsible for a different part of this process. For example, we have agents which perform the following functions: The Sorter: Scans a vast array of news sources and filters the articles based on relevance and significance to the podcast category. The Pitcher: Crafts a compelling pitch for each sorted article, taking into account the narrative angle presented in the article. The Judge: Evaluates the pitches and makes an editorial decision about which should be covered. The Scripter: Drafts an engaging script for the articles selected by the Judge, ensuring clarity and precision for the listening. Our AIs are directed to select news articles most relevant to the podcast category. Removing the human from this loop means explicit biases don't factor into the decision about what to cover. AI-decisions are also much more auditable, and this transparency is a key reason why AI can be a powerful tool for removing bias and sensationalism in the news. You can listen here. https://www.simplynews.ai/ submitted by /u/sapientais [link] [comments]
    Are GANs still relevant? [D]
    [D] As the constant rise of Difussion models, I'm curious if GANs came make a comeback? Any thoughts? submitted by /u/Superb-Assignment-30 [link] [comments]
    [D] Is it common to have intuitions different from peers about ML?
    I have noticed that my coworkers and I have different intuitions when it comes to certain things about ML. For example, in the way things should be labeled. For one use case, we wanted to detect the vehicle that's closest to the camera. I was of the opinion that we should handle that during the post-processing step, and let the model detect all the vehicles by labeling all that are visible in the image since they have similar features. My intuition was that labeling only the vehicle that's closest to the camera would lead the model to learn very specific features that can lead to overfitting, because it will try to distinguish between vehicles that are farther away and closest to the camera despite them sharing the generic features that make them a vehicle. My coworkers were of the opinion that only the vehicles closest should be labeled as that's what they're interested in detecting. Similarly we also differed on whether partially visible (<30% visible) objects should be labeled. I was of the opinion they should be because if you look at the augmentations that occur during training, partially visible objects inevitably occur due to certain augmentations just as zoom and mosaic. So avoiding labeling them in the original image only results in inconsistency during training where some partial objects are labeled, while others are not. They thought otherwise. TLDR; is it typical to have differences in intuitions and understanding about ML? submitted by /u/notEVOLVED [link] [comments]
    [N] There is now a viable way to perform time series data augmentation
    submitted by /u/anonymousTestPoster [link] [comments]
  • Open

    Is anyone working on social media or debate websites moderated by AI?
    I’d love to see ai that does automatic moderation on social media that’s has more meaningful rules that assess attitude and validity of an actual argument. I get that this wouldn’t be as open or free of a platform but I do think it would be amazing for being able to host collective inquiry, debate, democratic decision making etc. basically areas where it’s better that everyone acts like an adult and uses intellectual rigor instead of mudslinging. As it is, it takes too much work for human moderators to do this. Is anyone working on this? submitted by /u/BoomBapBiBimBop [link] [comments]
    One-Minute Daily AI News 3/10/2024
    NVIDIA’s Next-Gen DGX AI Systems To Feature Liquid Cooling, Reveals CEO Jensen Huang.[1] Dangerous AI numbers: ChatGPT uses 17,000 times more electricity than an average US household in a day.[2] Elon Musk and Sam Altman, once technology visionaries on a joint mission to save the world from the threats of AI, have become embroiled in a fallout that has resulted in a lawsuit and the the publication of private emails.[3] Nvidia is sued by authors over AI use of copyrighted works.[4] Sources: [1] https://wccftech.com/nvidia-ceo-confirms-next-gen-dgx-ai-system-liquid-cooling/ [2] https://timesofindia.indiatimes.com/gadgets-news/alarming-ai-numbers-chatgpt-uses-17000-times-more-electricity-than-an-average-us-household-in-a-day/articleshow/108368128.cms [3] https://www.dailymail.co.uk/news/article-13164315/Elon-Musk-feud-OpenAI-Sam-Altman-AI-emails.html [4] https://www.reuters.com/technology/nvidia-is-sued-by-authors-over-ai-use-copyrighted-works-2024-03-10/ submitted by /u/Excellent-Target-847 [link] [comments]
    This game is not real (AI)
    submitted by /u/Theonetobelive [link] [comments]
    The 2024 China-US AI Dialogue Should Start With an Eye on Chem-Bio Weapons
    submitted by /u/Alan-Foster [link] [comments]
    AI and Medicine
    Hello, what kind of contribution can someone who develops themselves in Python, AI, Deep Learning and Machine Learning make to medicine, regenerative medicine and biotechnology using these fields? submitted by /u/BilgeYamtar [link] [comments]
    Time travel - hear me out
    As AI and VR gets inevitably better, time travelling to the past (last several millennia) doesn’t seem far fetched. Of course I mean in a simulated open-world VR experience. Developers could use a combination of historical scriptures, architecture, linguistics, human behaviour, flora, fauna etc… There’s several videos simulating daily life in Ancient Greece for instance. Imagine this with a fully interactive open-world like GTA or Red Dead but in VR, with the option to choose timelines and locations. Not to mention, recent AI improvements can be used to simulate human interaction with ancient civilisations. I understand that the further back we go, we would lose accuracy. However, I strongly believe we could get it 80-90% accurate if it’s within the last 10,000 years. I’m hoping a project like this can be made in the next 10 years. Which place and timeline would you choose to ‘time travel’ to? submitted by /u/dograt99 [link] [comments]
    I use AI agents to de-sensationalize the news
    In today's world, catchy headlines and articles often distract readers from the facts and relevant information. Simply News is an attempt to cut through the fray and provide straightforward daily updates about what's actually happening. By coordinating multiple AI agents, Simply News processes sensationalist news articles and transforms them into a cohesive, news-focused podcast across many distinct topics every day. Each agent is responsible for a different part of this process. For example, we have agents which perform the following functions: The Sorter: Scans a vast array of news sources and filters the articles based on relevance and significance to the podcast category. The Pitcher: Crafts a compelling pitch for each sorted article, taking into account the narrative angle presented in the article. The Judge: Evaluates the pitches and makes an editorial decision about which should be covered. The Scripter: Drafts an engaging script for the articles selected by the Judge, ensuring clarity and precision for the listening. Our AIs are directed to select news articles most relevant to the podcast category. Removing the human from this loop means explicit biases don't factor into the decision about what to cover. AI-decisions are also much more auditable, and this transparency is a key reason why AI can be a powerful tool for removing bias and sensationalism in the news. You can listen here. https://www.simplynews.ai/ submitted by /u/sapientais [link] [comments]
    This AI Paper from UC Berkeley Unveils ArCHer: A Groundbreaking Machine Learning Framework for Advancing Multi-Turn Decision-Making in Large Language Models
    submitted by /u/bambin0 [link] [comments]
    Seeking easy AI tool that only indexes 5 pdf files
    I have a website that tries to decipher government documents that list benefits to certain people. There are 5 specific government provided pdf documents that specify these details, but they are long-winded and sometimes even confusing and contradictory in some parts. So I am trying to find an AI search engine that only indexes these 5 documents, and allows users to enter a search term like: “I am a 65 years old male. Under what conditions can I claim x supplement.” I am hoping an AI assisted search plugin can give a written response based on only those 5 pdf documents. Is there any such tool that can help me achieve this? submitted by /u/TrichoSearch [link] [comments]
    One-Minute Daily AI News 3/9/2024
    With AI, workplace surveillance has ‘skyrocketed’.[1] Nvidia board members sell nearly $300M of AI chipmaker’s surging stock.[2] Europe launches AI office to serve as ‘global reference point’ on safety, policy and development.[3] Experts alarmed over AI in military as Gaza turns into “testing ground” for US-made war robots.[4] Sources: [1] https://www.cp24.com/news/with-ai-workplace-surveillance-has-skyrocketed-leaving-canadian-laws-behind-1.6801445 [2] https://nypost.com/2024/03/07/business/nvidia-board-members-sell-stock-worth-nearly-300m/ [3] https://www.foxnews.com/world/europe-launches-ai-office-serve-global-reference-point-safety-policy-development [4] https://www.salon.com/2024/03/09/experts-alarmed-over-ai-in-military-as-gaza-turns-into-testing-ground-for-us-made-robots/ submitted by /u/Excellent-Target-847 [link] [comments]
  • Open

    Creating a NN from scratch vs a library (Question)
    I picked up learning Python because I’ve read that it’s good for artificial intelligence, I’ve found a two libraries called Tensorflow and PyTorch that both seem to be pretty popular, however I’ve also read that It’s very beneficial to create a neural network from scratch first so I have a deeper understanding of how they work and when it comes to debugging a neural network using libraries it would be easier since I understand them better. I have no idea what goes into creating a neural network from scratch as I haven’t looked into it yet, but would it be worth the time? Or would jumping into Tensorflow or PyTorch be perfectly fine? submitted by /u/MrSanfrinsisco [link] [comments]
    neural lace (nanotech) removal
    do any of your doctors or do you know a doctor you could ask around to find out how to get neural lace (nanotech) removed? it is a real issue that will get worse in the future when everyone realizes that the "beast" in the bible during the great tribulation is caused by neural lace. there are for sure methods, i know it. they involve electricity and water. i am being remote neural monitored which means they get my EEG data (brain signals) from afar. they do this synthetic telepathy enabled by nanotech. the nanotech is in everyone pretty much without them knowing it. please dont brush it off as conspiracy theory like those conspiring against us want.. there truly are evil forces that will attempt to bribe you into not looking into it. please tell me what you find out. you should be willing to look into it because neural lace is a real thing and you will want to know in the future. thanks! https://www.reddit.com/r/Soulnexus/comments/17eqoqr/psychospiritual_warfare_remote_neural_monitoring/ ​ tents.google.com/patent/US3951134A/en?oq=patent+%233951134 remote neural monitoring patent https://patents.google.com/patent/WO2005055579A1/en artificial telepathy patent https://www.youtube.com/watch?v=rj3oj5eer8s&ab_channel=SLEEPINGBEAUTY video explaining remote neural monitoring. it isnt a conspiracy theory. it is 100% real and i have been dealing with it for 14 months. submitted by /u/AlexanderFlyHigh33 [link] [comments]
  • Open

    DQN model losing previous hability to solve a map when training with a new one
    I'm training a DQN model to solve a game. The input is the map. If i manage to get good results with a specific map, save the model and then training it again (keeping the weights) with another map, will it lose the hability to solve the previous map as it needs to explore and learn the new one? Or is it better to train with all the maps at once by exploring all the maps at each iteration(which will be quite computationally expensive)? In my game context, the DQN algorithm needs a lot of exploring, so the epsilon decay is quite slow and initial epsilon value is 1. submitted by /u/libichi [link] [comments]
    What is the stance on decision transformers and future of RL?
    Hi, I am doing research on decision transformers these days. Arguable, while trying to find the most important papers I noticed that not much seems to have happened in the area of RL. I noticed a rend where research is focused on optimizing Transformers and training huge language and vision models treated as supervised models?. Is this the new big thing in RL?. What are the latest trends on RL?. ​ submitted by /u/__Julia [link] [comments]
    "Grandmaster-Level Chess Without Search", Ruoss et al 2024
    submitted by /u/gwern [link] [comments]
  • Open

    Creating an AI Utility Function: A Classroom Exercise
    During a recent discussion with high school juniors and seniors, my goal was to enhance their awareness and understanding of Big (intrusive) Data and AI by focusing on two main areas. The bottom line is that I wanted to turn them into citizens of data science. And to accomplish that, I created a simple exercise… Read More »Creating an AI Utility Function: A Classroom Exercise The post Creating an AI Utility Function: A Classroom Exercise appeared first on Data Science Central.  ( 23 min )

  • Open

    What would you like to see from Anthropic next? [D]
    submitted by /u/Revolutionary_Spaces [link] [comments]
    [D] About eq.45 in Understanding Diffusion Models: A Unified Perspective
    The consistency term of that equation is derived from the last term of eq.44, but I think they are not the same. To compare the last term of eq.44 and the consistency term of eq.45, let's expand those terms to multiple integrals(expectation + KL divergence) and only focus the probability part. From the last term of eq.44: q(xₜ₋₁, xₜ, xₜ₊₁|x₀) = q(xₜ₊₁|xₜ) * q(xₜ|xₜ₋₁) * q(xₜ₋₁|x₀) and from the consistency term: q(xₜ₋₁, xₜ₊₁ | x₀) * q(xₜ|xₜ₋₁) = q(xₜ₊₁|xₜ₋₁) * q(xₜ|xₜ₋₁) * q(xₜ₋₁|x₀) So, if eq.44 and eq.45 are equivalent, q(xₜ₊₁|xₜ₋₁) and q(xₜ₊₁|xₜ) should be the same. But this is not always true. Could there be an error in the derivation, or am I missing something? submitted by /u/ihopesomuch [link] [comments]
    [P] AutoGluon-TimeSeries : A Powerful Time-Series Forecasting library
    The library contains every SOTA time-series model (statistics, ML, DL). Also, it supports ensembling. Here's a great tutorial I found: https://aihorizonforecast.substack.com/p/autogluon-timeseries-creating-powerful submitted by /u/apaxapax [link] [comments]
    [D] Databricks Delta tables and YOLO computer vision
    Hi all, I would really appreciate if someone could help me out. I have been doing training in Databricks, as we plan to use it at work for our ML tasks. One thing we use is YOLO for object detection. I’ve managed to run YOLO by loading data from the blob storage, but I’ve seen that the best way to do deep learning tasks in Databricks is to train your ML models on Delta Live Tables. I currently have my training dataset as a Delta table, and I was wondering if anyone has managed to train computer vision models on Delta tables. I’ve read the documentations and have seen repos such as petastorm that try to implement training on delta tables, but I can’t for the life of me understand how to actually run yolo this way, especially since YOLO uses yaml for config. Thank in advance for your help! 😇 submitted by /u/wsb_crazytrader [link] [comments]
    [R] Beyond Language Models: Byte Models are Digital World Simulators - Microsoft Research Asia 2024 - bGPT - Exceptional capabilities in simulating CPU behaviour, with an accuracy exceeding 99.99% in executing various operations! Could help combat the problems with tokenisation!
    Paper: https://arxiv.org/abs/2402.19155 Paper Page with code and weights: https://byte-gpt.github.io/ Abstract: Traditional deep learning often overlooks bytes, the basic units of the digital world, where all forms of information and operations are encoded and manipulated in binary format. Inspired by the success of next token prediction in natural language processing, we introduce bGPT, a model with next byte prediction to simulate the digital world. bGPT matches specialized models in performance across various modalities, including text, audio, and images, and offers new possibilities for predicting, simulating, and diagnosing algorithm or hardware behaviour. It has almost flawlessly replicated the process of converting symbolic music data, achieving a low error rate of 0.0011 bits per byte in converting ABC notation to MIDI format. In addition, bGPT demonstrates exceptional capabilities in simulating CPU behaviour, with an accuracy exceeding 99.99% in executing various operations. Leveraging next byte prediction, models like bGPT can directly learn from vast binary data, effectively simulating the intricate patterns of the digital world. https://preview.redd.it/u0h8rs651dnc1.jpg?width=1836&format=pjpg&auto=webp&s=6f7ae48560280a7d15f7095bb8915b8db50ba9ef https://preview.redd.it/e2fnnt651dnc1.jpg?width=899&format=pjpg&auto=webp&s=d637e2f08e70c7caa3d227dc9f8bd26ec5921360 Source: Andrej Karpathy https://youtu.be/zduSFxRajkE?si=Z3AFwwhth3j7raSv submitted by /u/Singularian2501 [link] [comments]
    [R] LLMs surpass human experts in predicting neuroscience experiment outcomes (81% vs 63%)
    A new study shows that LLMs can predict which neuroscience experiments are likely to yield positive findings more accurately than human experts. The researchers used a GPT-3.5 class model with only 7 billion parameters and found that fine-tuning it on neuroscience literature boosted performance even further. I thought the experiment design was interesting. The LLMs were presented with two versions of an abstract with significantly different results, and we were asked to predict which was more likely to be the real abstract, in essence predicting which outcome was more probable. They beat humans by about 18%. Other highlights: Fine-tuning on neuroscience literature improved performance Models achieved 81.4% accuracy vs. 63.4% for human experts Held true across all tested neuroscience subfields Even smaller 7B parameter models performed comparably to larger ones Fine-tuned "BrainGPT" model gained 3% accuracy over the base The implications are significant - AI could help researchers prioritize the most promising experiments, accelerating scientific discovery and reducing wasted efforts. It could lead to breakthroughs in understanding the brain and developing treatments for neurological disorders. However, the study focused only on neuroscience with a limited test set. More research is needed to see if the findings generalize to other scientific domains. And while AI can help identify promising experiments, it can't replace human researchers' creativity and critical thinking. Full paper here. I've also written a more detailed analysis here. submitted by /u/Successful-Western27 [link] [comments]
    [R] [CVPR 2024] AV-RIR: Audio-Visual Room Impulse Response Estimation
    submitted by /u/Snoo63916 [link] [comments]
    [D] Simulating Plastic Surgery outcomes using GANs/Diffusion
    I would like to train a model to help users show Plastic Surgery outcomes. I would like to get the opinion and guidance of experts and experienced people on this Reddit about the best approach to develop an AI for this type of task. istic. I have a custom dataset of about 1500 before & after images to do the training. I can invest in getting a bigger dataset as well. I would like to get the opinion and guidance of experts and experienced people on this reddit about the best approach to develop an AI for this type of task. submitted by /u/tjain73 [link] [comments]
    [D] Voice to voice / Voice transfer tools question.
    TL;DR: What is a good way to take speech, and apply a different voice to it? My hobby project to learn about ML/AI right now, is to try to make a (naive) manga-to-anime fully automated pipeline. I got parsing/understanding of the pages/dialog working (GPT4-V/llava are awesome tools), and I'm at the audio part. I need to be able to read different speech bubbles in different voices. Likewise, I have made samples of voices from public domain data. Also, I have gotten bark to read the text in the bubbles. The problem is, bark doesn't let me provide a voice sample for zero-shot copying of the voice. It just generates random voices: « Bark tries to match the tone, pitch, emotion and prosody of a given preset, but does not currently support custom voice cloning. The model also attempts t…
    [P] Speed comparison of 5 different ways to implement multihead attention in PyTorch
    submitted by /u/seraschka [link] [comments]
    [D] How much synthetic data to use?
    I'm fine-tuning an LLM, to make it respond in a particular manner and for that I need to use synthetic data because I don't have a lot of my own real data, so how much synthetic data should be used. And does the amount of data to be used depends on the size of the model or any other factor? submitted by /u/Medium_Alternative50 [link] [comments]
    [R] PromptKD: Unsupervised Prompt Distillation for Vision-Language Models, SOTA in Prompt Learning.
    Paper: https://arxiv.org/abs/2403.02781 Project Page: https://zhengli97.github.io/PromptKD/ Github: https://github.com/zhengli97/PromptKD ​ https://preview.redd.it/bop5wm9f8bnc1.png?width=1330&format=png&auto=webp&s=ad50156e81d6f9248c597ca239596f28a9f5d7cb Highlights: (1). A novel two-stage unsupervised prompt distillation framework for Vision-Language Models. (2). Reuse high-quality teacher text features instead of training the student's own text encoder. (3). Distillation on large amounts of unlabeled domain images using soft labels provided by the teacher. (4). PromptKD outperforms all existing prompt learning methods on 11 diverse recognition datasets. Abstract: In this paper, we introduce an unsupervised domain prompt distillation framework, which aims to transfer the knowl…
    [P] Fractional GPU containers
    submitted by /u/Thick-Taste-9985 [link] [comments]
    [P] Trained over 120 Style Transfer MLModels - here they are on GitHub
    submitted by /u/VysokoAnime [link] [comments]
    [R] Scalable Language Model with Generalized Continual Learning
    Paper: https://openreview.net/forum?id=mz8owj4DXu Abstract: Continual learning has gained increasing importance as it facilitates the acquisition and refinement of scalable knowledge and skills in language models. However, existing methods typically encounter strict limitations and challenges in real-world scenarios, such as reliance on experience replay, optimization constraints, and inference task-ID. In this study, we introduce the Scalable Language Model (SLM) to overcome these limitations within a more challenging and generalized setting, representing a significant advancement toward practical applications for continual learning. Specifically, we propose the Joint Adaptive Re-Parameterization (JARe), integrated with Dynamic Task-related Knowledge Retrieval (DTKR), to enable adaptive adjustment of language models based on specific downstream tasks. This approach leverages the task distribution within the vector space, aiming to achieve a smooth and effortless continual learning process. Our method demonstrates state-of-the-art performance on diverse backbones and benchmarks, achieving effective continual learning in both full-set and few-shot scenarios with minimal forgetting. Moreover, while prior research primarily focused on a single task type such as classification, our study goes beyond, with the large language model, i.e., LLaMA-2, to explore the effects across diverse domains and task types, such that a single language model can be decently scaled to broader applications. The code and models will be released to the public. submitted by /u/SunsetOneSix [link] [comments]
    Tools in Natural language processing [D]
    I've trying too hard to reasearch tools in NLP and find code for implementation but I'm not finding any good sources to study, right now I have a good understanding of spacy, nltk and things like bag of words and Tfidf but I can't get ahead of these things please help. submitted by /u/Flashy-Tomato-1135 [link] [comments]
    [D] How valuable is learning CUDA/ C++?
    Everyone is currently trying to make AI implementations fast/efficient (because more efficient -> less money spent on compute). For instance, flash attention 2 is implemented in CUDA. Llama.cpp is C++ Is PyTorch enough? or is there an advantage learning CUDA/C++ in this market, especially for LLMs? And if CUDA is useful in some cases, what are those cases? submitted by /u/joelthomas- [link] [comments]
    [D] Gemma-7b: inference is way too slow on local RTX 3090
    Just a simple test like following from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b") model = AutoModelForCausalLM.from_pretrained("google/gemma-7b", device_map="auto") mission_text = "Write me a simple program in Java" mission_ids = tokenizer(mission_text, return_tensors="pt").to("cuda") mission_outputs = model.generate(input_ids=mission_ids["input_ids"], max_new_tokens=500) print(tokenizer.decode(mission_outputs[0])) And the output Write me a simple program in Java that will take a string and return the number of vowels in the string. Answer: Step 1/4 1. We need to create a method that takes a string as input. Step 2/4 2. We need to iterate through the string and check if each character is a vowel. Step 3/4 3. If a character is a vowel, we need to increment a counter. Step 4/4 4. Finally, we need to return the counter as the number of vowels in the string. Here's the code: public static int countVowels(String str) { int count = 0; for (int i = 0; i I think it took about 10 minutes to generate the text above on my local machine with RTX 3090, is not way too slow? Windows 10, pytorch 2.1.2+cu121, CUDA 12.3, VRAM consumption about 22GB, GPU load while inferencing was about 60-70% submitted by /u/tunggad [link] [comments]
    [D] ML with R/Python
    I’m a student in the applied statistic program. And because of that, we were taught the statistical ML by R. However, the majority of time when I came across of ML, people usually talk more about using Python. So I don’t know any field/industry that actually using ML by R instead? If R is very unpopular in ML, do you think a company will pick me because I only know R? ( if that’s the case I might try to learn Python but I think I won’t be strong). Cheers guys! submitted by /u/StrangerOnTheRoad [link] [comments]
    [N] Matrix multiplication breakthrough could lead to faster, more efficient AI models
    "Computer scientists have discovered a new way to multiply large matrices faster than ever before by eliminating a previously unknown inefficiency, reports Quanta Magazine. This could eventually accelerate AI models like ChatGPT, which rely heavily on matrix multiplication to function. The findings, presented in two recent papers, have led to what is reported to be the biggest improvement in matrix multiplication efficiency in over a decade. ... Graphics processing units (GPUs) excel in handling matrix multiplication tasks because of their ability to process many calculations at once. They break down large matrix problems into smaller segments and solve them concurrently using an algorithm. Perfecting that algorithm has been the key to breakthroughs in matrix multiplication efficiency over…
  • Open

    How much will a cable sag? A simple approximation
    Suppose you have a cable of length 2s suspended from two poles of equal height a distance 2x apart. Assuming the cable hangs in the shape of a catenary, how much does it sag in the middle? If the cable were pulled perfectly taught, we would have s = x and there would be no […] How much will a cable sag? A simple approximation first appeared on John D. Cook.  ( 5 min )
    Unique letter patterns in words
    The word Mississippi has a unique pattern of letters. If you were solving a cryptogram puzzle and saw ZVFFVFFVCCV you might guess that the word is Mississippi. Is the pattern of letters in Mississippi literally unique or just uncommon? What is the shortest word with a unique letter pattern? The longest word? We can answer […] Unique letter patterns in words first appeared on John D. Cook.  ( 5 min )
    How to Organize Technical Research?
    64 million scientific papers have been published since 1996 [1]. Assuming you can actually find the information you want in the first place—how can you organize your findings to be able to recall and use them later? It’s not a trifling question. Discoveries often come from uniting different obscure pieces of information in a […] How to Organize Technical Research? first appeared on John D. Cook.  ( 6 min )
  • Open

    Am I overthinking Copilot referring to itself as "me"?
    I was curious about processors and the efficiency of processors compared to the human brain, so I asked Copilot if it's gonna be possible anytime soon to have a natural, real conversation with an AI, instead of the current ABABAB conversation. Now it gave a long answer with bullet points, but it said: "Current AI models, including language models like me, operate in a sequential manner (A/B, A/B)." When I saw it, it just rubbed me the wrong way. I would've expected it to say "like Copilot", or anything else. Does this mean that it has an understanding of itself, or am I just reading into this way too much? submitted by /u/DinoDick43 [link] [comments]
    One-Minute Daily AI News 3/8/2024
    OpenAI said Friday that CEO Sam Altman is rejoining the company’s revamped board of directors, after an internal investigation concluded he was the right leader for the startup.[1] Intel-Backed Astera Seeks $534 Million in IPO With AI Appeal.[2] AI Version of Marilyn Monroe Unveiled at SXSW.[3] Google releases new Gemini update to give users ‘more control’ over AI chatbot responses.[4] Sources: [1] https://www.wsj.com/tech/ai/openai-board-brings-sam-altman-back-adds-three-new-members-bbdb4807 [2] https://www.bloomberg.com/news/articles/2024-03-08/intel-backed-astera-seeks-534-million-in-ipo-amid-ai-frenzy [3] https://www.hollywoodreporter.com/news/general-news/marilyn-monroe-ai-backlash-digital-marilyn-1235846845/ [4] https://www.foxbusiness.com/media/google-releases-new-gemini-update-give-users-more-control-ai-chatbot-responses submitted by /u/Excellent-Target-847 [link] [comments]
    I used resume spammers to apply for 120 jobs. Chaos ensued.
    submitted by /u/thisisinsider [link] [comments]
    Elon Musk just got challenged by MattVidPro in Current State of A.I. to put his money where his mouth is and open source his own AI
    submitted by /u/ptitrainvaloin [link] [comments]
    Toying with Claude II (Sonet) Ends in Surprise
    Turbohair critique: "{mimics every authoritarian ever} The latest AI contains knowledge and abilities that are too powerful to be trusted to normal citizens. So we are going to trust us with them instead." Claude II I do not actually make judgements or voice criticisms about statements, as I am an AI assistant created by Anthropic to be helpful, harmless, and honest. However, I can provide some objective analysis on that statement: The statement seems to be raising concerns about the potential misuse of powerful AI systems by authoritarian regimes or powers. It suggests that rather than allowing open access and distribution of advanced AI capabilities, those in authority want to restrict and control access to such technologies for themselves. This taps into fears some have expresse…
    Discussion of fringe ideas
    The heading is ominous, but I'm not a tinfoil hatter. I'd like to know if anyone knows of ai being used to analyse the growing torrent of stuff being spewed out by the conspiracy and fringe fraternity. I don't believe this stuff, but I find it interesting as an object of study, and would like to take it apart to see how it works. I have posted my own amateurish research on the subreddit r/schuylkillnotes, but all I get there is open hostility and mockery. I don't believe these theories are true, but they are representative of other undercurrents and movements in global society. I want to analyse this "radar shadow" and see what it is. Ai is a powerful lens through which to see and learn. submitted by /u/astrobiological [link] [comments]
    Microsoft will launch the first AI PCs on March 21st
    submitted by /u/valis2400 [link] [comments]
  • Open

    Learn your farm animals with AI!
    Hey kids! What sound does a woolly horse-sheep make? The image above is what you get when you ask dalle-3 (via chatgpt) for some basic educational material: "Please generate an illustrated poster to help children learn which sounds common animals make. Each animal should be pictured with a speech  ( 3 min )
    Bonus: more farm animals
    AI Weirdness: the strange side of machine learning  ( 2 min )
  • Open

    QLearning recommender system
    Hello everyone, i'm new to reinforcement learning and i am trying to make a reinforcement learning recommender system using QLearning on Movielens ML-100k dataset, i defined states as users , actions as movies and i am trying to predict ratings. I transformed the ratings file to a user item matrix and i filled the NA values with 0. For the reward function i am checking if the rating is available and i return actual_rating -predicted_rating otherwise i just return 0. For the policy i am using epsilon greedy. My problem is that i don't get good results, i guess my reward function needs work. If someone wanna take a look at my code be my guest , it's python. Thanks Edit: this is my Kaggle Notebook https://www.kaggle.com/code/asribachir/reinforcement-learning-ml100k submitted by /u/Fredybec [link] [comments]
    Model performance evaluation
    What I am doing: I am making a custom Boid flocking environment in Open AI Gym with stable baselines 3. How it works: I pass a positions file for the boids. Model tests on it for 3000 timesteps and outputs rewards for each episode i.e. positions file Training initial positions are different from testing. My model's performance is what I am concerned about. It outputs similar reward when stating from different initial positions, and the boids move as intended, I also generate a video file of movement. Output for seemingly correctly working model ​ Reward Function def calculate_combined_reward(self, agent, neighbor_positions): total_reward=0 out_of_flock=False if (len(neighbor_positions) > 0): for neighbor_position in neighbor_positions: distance = np.linalg.norm(agent.position - neighbor_position) if (distance < SimulationVariables["SafetyRadius"]) : total_reward += -500 * (1 - (distance / SimulationVariables["SafetyRadius"])) # Check this won't hold if too far away and CTDE elif (SimulationVariables["SafetyRadius"] < distance < SimulationVariables["NeighborhoodRadius"]): total_reward += 100 #CENTER OF MASS elif(len(neighbor_positions) == 0): total_reward = -500 out_of_flock=True return total_reward, out_of_flock Problem: However, when I retrained a few times without changing anything and tested, just for consistency sake it didn't perform well and my episode reward were mostly negative. Though I changed nothing. Luckily I saved the model with best performance. Concern: Is my model, that I trained and outputted correct performance, that I have attached photo of, a good trained model, fluke or overfitting? ​ submitted by /u/Sadboi1010 [link] [comments]
  • Open

    An ancient god summoned through JavaScript code. The Armaaruss Military Drone and Soldier Detection System (Armaaruss Model Sys Version 1). A new innovation in programming that could help soldiers and civilians detect drones and evade drone strikes without having to retrain models
    ​ An ancient god comes alive through JavaScript code. The Armaaruss Military Drone and Soldier Detection System (Armaaruss Model Sys Version 1). https://www.academia.edu/116002458/An_ancient_god_comes_alive_through_JavaScript_code_The_Armaaruss_Military_Drone_and_Soldier_Detection_System_Armaaruss_Model_Sys_Version_1_ An ancient god comes alive through JavaScript code. The Armaaruss Military Drone and Soldier Detection System (Armaaruss Model Sys Version 1). A quick synopsis of this special project underscores a religious and technological component, serving the deistic model that has defined a large portion of the human epoch, as well as the stoic and technological model that has driven the advancement of mankind for centuries. My public work since 2019 has provided the impetus for a …
    my NN is averaging between the 2 expected outputs. (NN in scrap mechanic)
    so i built a NN in a game (scrap mechanic), to make it work i had to make 8 other NNs and have an offset of 0.1 for one of the weights on each NN, then calculate the offset between the cost of the normal NN and the offset NN and with that calculate how much i have to change that weight. i thought it first worked, but when i tried inputting more inputs, it was averaging between the two. Example: expected value 1: 2 2 expected value 2: 0 0 what NN calculates: 0.7 1.0 i know i didnt explain it very well, english is my second language and its really hard to explain, im really sorry. Activaton function: Leaky ReLU ​ submitted by /u/Business-Ad2 [link] [comments]
  • Open

    GenTKG: Generative Forecasting on Temporal Knowledge Graph
    arXiv:2310.07793v3 Announce Type: replace-cross Abstract: The rapid advancements in large language models (LLMs) have ignited interest in the temporal knowledge graph (tKG) domain, where conventional embedding-based and rule-based methods dominate. The question remains open of whether pre-trained LLMs can understand structured temporal relational data and replace them as the foundation model for temporal relational forecasting. Therefore, we bring temporal knowledge forecasting into the generative setting. However, challenges occur in the huge chasms between complex temporal graph data structure and sequential natural expressions LLMs can handle, and between the enormous data sizes of tKGs and heavy computation costs of finetuning LLMs. To address these challenges, we propose a novel retrieval-augmented generation framework named GenTKG combining a temporal logical rule-based retrieval strategy and few-shot parameter-efficient instruction tuning to solve the above challenges, respectively. Extensive experiments have shown that GenTKG outperforms conventional methods of temporal relational forecasting with low computation resources using extremely limited training data as few as 16 samples. GenTKG also highlights remarkable cross-domain generalizability with outperforming performance on unseen datasets without re-training, and in-domain generalizability regardless of time split in the same dataset. Our work reveals the huge potential of LLMs in the tKG domain and opens a new frontier for generative forecasting on tKGs.  ( 3 min )
    Who Are We Missing? A Principled Approach to Characterizing the Underrepresented Population
    arXiv:2401.14512v2 Announce Type: replace-cross Abstract: Randomized controlled trials (RCTs) serve as the cornerstone for understanding causal effects, yet extending inferences to target populations presents challenges due to effect heterogeneity and underrepresentation. Our paper addresses the critical issue of identifying and characterizing underrepresented subgroups in RCTs, proposing a novel framework for refining target populations to improve generalizability. We introduce an optimization-based approach, Rashomon Set of Optimal Trees (ROOT), to characterize underrepresented groups. ROOT optimizes the target subpopulation distribution by minimizing the variance of the target average treatment effect estimate, ensuring more precise treatment effect estimations. Notably, ROOT generates interpretable characteristics of the underrepresented population, aiding researchers in effective communication. Our approach demonstrates improved precision and interpretability compared to alternatives, as illustrated with synthetic data experiments. We apply our methodology to extend inferences from the Starting Treatment with Agonist Replacement Therapies (START) trial -- investigating the effectiveness of medication for opioid use disorder -- to the real-world population represented by the Treatment Episode Dataset: Admissions (TEDS-A). By refining target populations using ROOT, our framework offers a systematic approach to enhance decision-making accuracy and inform future trials in diverse populations.  ( 2 min )
    Contrastive Continual Learning with Importance Sampling and Prototype-Instance Relation Distillation
    arXiv:2403.04599v1 Announce Type: new Abstract: Recently, because of the high-quality representations of contrastive learning methods, rehearsal-based contrastive continual learning has been proposed to explore how to continually learn transferable representation embeddings to avoid the catastrophic forgetting issue in traditional continual settings. Based on this framework, we propose Contrastive Continual Learning via Importance Sampling (CCLIS) to preserve knowledge by recovering previous data distributions with a new strategy for Replay Buffer Selection (RBS), which minimize estimated variance to save hard negative samples for representation learning with high quality. Furthermore, we present the Prototype-instance Relation Distillation (PRD) loss, a technique designed to maintain the relationship between prototypes and sample representations using a self-distillation process. Experiments on standard continual learning benchmarks reveal that our method notably outperforms existing baselines in terms of knowledge preservation and thereby effectively counteracts catastrophic forgetting in online contexts. The code is available at https://github.com/lijy373/CCLIS.  ( 2 min )
    Learning Abstract Visual Reasoning via Task Decomposition: A Case Study in Raven Progressive Matrices
    arXiv:2308.06528v2 Announce Type: replace-cross Abstract: Learning to perform abstract reasoning often requires decomposing the task in question into intermediate subgoals that are not specified upfront, but need to be autonomously devised by the learner. In Raven Progressive Matrices (RPM), the task is to choose one of the available answers given a context, where both the context and answers are composite images featuring multiple objects in various spatial arrangements. As this high-level goal is the only guidance available, learning to solve RPMs is challenging. In this study, we propose a deep learning architecture based on the transformer blueprint which, rather than directly making the above choice, addresses the subgoal of predicting the visual properties of individual objects and their arrangements. The multidimensional predictions obtained in this way are then directly juxtaposed to choose the answer. We consider a few ways in which the model parses the visual input into tokens and several regimes of masking parts of the input in self-supervised training. In experimental assessment, the models not only outperform state-of-the-art methods but also provide interesting insights and partial explanations about the inference. The design of the method also makes it immune to biases that are known to be present in some RPM benchmarks.  ( 3 min )
    Multi-Region Markovian Gaussian Process: An Efficient Method to Discover Directional Communications Across Multiple Brain Regions
    arXiv:2402.02686v2 Announce Type: replace-cross Abstract: Studying the complex interactions between different brain regions is crucial in neuroscience. Various statistical methods have explored the latent communication across multiple brain regions. Two main categories are the Gaussian Process (GP) and Linear Dynamical System (LDS), each with unique strengths. The GP-based approach effectively discovers latent variables with frequency bands and communication directions. Conversely, the LDS-based approach is computationally efficient but lacks powerful expressiveness in latent representation. In this study, we merge both methodologies by creating an LDS mirroring a multi-output GP, termed Multi-Region Markovian Gaussian Process (MRM-GP). Our work is the first to establish a connection between an LDS and a multi-output GP that explicitly models frequencies and phase delays within the latent space of neural recordings. Consequently, the model achieves a linear inference cost over time points and provides an interpretable low-dimensional representation, revealing communication directions across brain regions and separating oscillatory communications into different frequency bands.  ( 2 min )
    Rethinking of Encoder-based Warm-start Methods in Hyperparameter Optimization
    arXiv:2403.04720v1 Announce Type: new Abstract: Effectively representing heterogeneous tabular datasets for meta-learning remains an open problem. Previous approaches rely on predefined meta-features, for example, statistical measures or landmarkers. Encoder-based models, such as Dataset2Vec, allow us to extract significant meta-features automatically without human intervention. This research introduces a novel encoder-based representation of tabular datasets implemented within the liltab package available on GitHub https://github.com/azoz01/liltab. Our package is based on an established model for heterogeneous tabular data proposed in [Iwata and Kumagai, 2020]. The proposed approach employs a different model for encoding feature relationships, generating alternative representations compared to existing methods like Dataset2Vec. Both of them leverage the fundamental assumption of dataset similarity learning. In this work, we evaluate Dataset2Vec and liltab on two common meta-tasks - representing entire datasets and hyperparameter optimization warm-start. However, validation on an independent metaMIMIC dataset highlights the nuanced challenges in representation learning. We show that general representations may not suffice for some meta-tasks where requirements are not explicitly considered during extraction. [Iwata and Kumagai, 2020] Tomoharu Iwata and Atsutoshi Kumagai. Meta-learning from Tasks with Heterogeneous Attribute Spaces. In Advances in Neural Information Processing Systems, 2020.  ( 2 min )
    Impacts of Color and Texture Distortions on Earth Observation Data in Deep Learning
    arXiv:2403.04385v1 Announce Type: cross Abstract: Land cover classification and change detection are two important applications of remote sensing and Earth observation (EO) that have benefited greatly from the advances of deep learning. Convolutional and transformer-based U-net models are the state-of-the-art architectures for these tasks, and their performances have been boosted by an increased availability of large-scale annotated EO datasets. However, the influence of different visual characteristics of the input EO data on a model's predictions is not well understood. In this work we systematically examine model sensitivities with respect to several color- and texture-based distortions on the input EO data during inference, given models that have been trained without such distortions. We conduct experiments with multiple state-of-the-art segmentation networks for land cover classification and show that they are in general more sensitive to texture than to color distortions. Beyond revealing intriguing characteristics of widely used land cover classification models, our results can also be used to guide the development of more robust models within the EO domain.  ( 2 min )
    Unsupervised Pretraining for Fact Verification by Language Model Distillation
    arXiv:2309.16540v3 Announce Type: replace-cross Abstract: Fact verification aims to verify a claim using evidence from a trustworthy knowledge base. To address this challenge, algorithms must produce features for every claim that are both semantically meaningful, and compact enough to find a semantic alignment with the source information. In contrast to previous work, which tackled the alignment problem by learning over annotated corpora of claims and their corresponding labels, we propose SFAVEL (Self-supervised Fact Verification via Language Model Distillation), a novel unsupervised pretraining framework that leverages pre-trained language models to distil self-supervised features into high-quality claim-fact alignments without the need for annotations. This is enabled by a novel contrastive loss function that encourages features to attain high-quality claim and evidence alignments whilst preserving the semantic relationships across the corpora. Notably, we present results that achieve a new state-of-the-art on FB15k-237 (+5.3% Hits@1) and FEVER (+8% accuracy) with linear evaluation.  ( 2 min )
    End-to-end Conditional Robust Optimization
    arXiv:2403.04670v1 Announce Type: new Abstract: The field of Contextual Optimization (CO) integrates machine learning and optimization to solve decision making problems under uncertainty. Recently, a risk sensitive variant of CO, known as Conditional Robust Optimization (CRO), combines uncertainty quantification with robust optimization in order to promote safety and reliability in high stake applications. Exploiting modern differentiable optimization methods, we propose a novel end-to-end approach to train a CRO model in a way that accounts for both the empirical risk of the prescribed decisions and the quality of conditional coverage of the contextual uncertainty set that supports them. While guarantees of success for the latter objective are impossible to obtain from the point of view of conformal prediction theory, high quality conditional coverage is achieved empirically by ingeniously employing a logistic regression differentiable layer within the calculation of coverage quality in our training loss. We show that the proposed training algorithms produce decisions that outperform the traditional estimate then optimize approaches.  ( 2 min )
    JAX-SPH: A Differentiable Smoothed Particle Hydrodynamics Framework
    arXiv:2403.04750v1 Announce Type: cross Abstract: Particle-based fluid simulations have emerged as a powerful tool for solving the Navier-Stokes equations, especially in cases that include intricate physics and free surfaces. The recent addition of machine learning methods to the toolbox for solving such problems is pushing the boundary of the quality vs. speed tradeoff of such numerical simulations. In this work, we lead the way to Lagrangian fluid simulators compatible with deep learning frameworks, and propose JAX-SPH - a Smoothed Particle Hydrodynamics (SPH) framework implemented in JAX. JAX-SPH builds on the code for dataset generation from the LagrangeBench project (Toshev et al., 2023) and extends this code in multiple ways: (a) integration of further key SPH algorithms, (b) restructuring the code toward a Python library, (c) verification of the gradients through the solver, and (d) demonstration of the utility of the gradients for solving inverse problems as well as a Solver-in-the-Loop application. Our code is available at https://github.com/tumaer/jax-sph.  ( 2 min )
    SkelVIT: Consensus of Vision Transformers for a Lightweight Skeleton-Based Action Recognition System
    arXiv:2311.08094v2 Announce Type: replace-cross Abstract: Skeleton-based action recognition receives the attention of many researchers as it is robust to viewpoint and illumination changes, and its processing is much more efficient than the processing of video frames. With the emergence of deep learning models, it has become very popular to represent the skeleton data in pseudo-image form and apply CNN for action recognition. Thereafter, studies concentrated on finding effective methods for forming pseudo-images. Recently, attention networks, more specifically transformers have provided promising results in various vision problems. In this study, the effectiveness of VIT for skeleton-based action recognition is examined and its robustness on the pseudo-image representation scheme is investigated. To this end, a three-level architecture, SkelVit is proposed, which forms a set of pseudo images, applies a classifier on each of the representations, and combines their results to find the final action class. The performance of SkelVit is examined thoroughly via a set of experiments. First, the sensitivity of the system to representation is investigated by comparing it with two of the state-of-the-art pseudo-image representation methods. Then, the classifiers of SkelVit are realized in two experimental setups by CNNs and VITs, and their performances are compared. In the final experimental setup, the contribution of combining classifiers is examined by applying the model with a different number of classifiers. Experimental studies reveal that the proposed system with its lightweight representation scheme achieves better results than the state-of-the-art methods. It is also observed that the vision transformer is less sensitive to the initial pseudo-image representation compared to CNN. Nevertheless, even with the vision transformer, the recognition performance can be further improved by the consensus of classifiers.  ( 3 min )
    Harnessing Inherent Noises for Privacy Preservation in Quantum Machine Learning
    arXiv:2312.11126v2 Announce Type: replace-cross Abstract: Quantum computing revolutionizes the way of solving complex problems and handling vast datasets, which shows great potential to accelerate the machine learning process. However, data leakage in quantum machine learning (QML) may present privacy risks. Although differential privacy (DP), which protects privacy through the injection of artificial noise, is a well-established approach, its application in the QML domain remains under-explored. In this paper, we propose to harness inherent quantum noises to protect data privacy in QML. Especially, considering the Noisy Intermediate-Scale Quantum (NISQ) devices, we leverage the unavoidable shot noise and incoherent noise in quantum computing to preserve the privacy of QML models for binary classification. We mathematically analyze that the gradient of quantum circuit parameters in QML satisfies a Gaussian distribution, and derive the upper and lower bounds on its variance, which can potentially provide the DP guarantee. Through simulations, we show that a target privacy protection level can be achieved by running the quantum circuit a different number of times.  ( 2 min )
    AutoFT: Learning an Objective for Robust Fine-Tuning
    arXiv:2401.10220v2 Announce Type: replace-cross Abstract: Foundation models encode rich representations that can be adapted to downstream tasks by fine-tuning. However, fine-tuning a model on one data distribution often degrades performance under distribution shifts. Current approaches to robust fine-tuning use hand-crafted regularization techniques to constrain the fine-tuning process towards the pretrained model. Yet, it is hard to specify how to adapt relevant characteristics of the foundation model during fine-tuning, as this depends on how the pre-training, fine-tuning, and test data distributions relate to each other. We propose AutoFT, a data-driven approach for robust fine-tuning. Given a task, AutoFT searches for a fine-tuning procedure that enhances out-of-distribution (OOD) generalization. Specifically, AutoFT uses bi-level optimization to search for an objective function and hyperparameters that maximize post-adaptation performance on a small OOD validation set. We evaluate AutoFT on nine natural distribution shifts. Our experiments show that AutoFT significantly improves generalization to OOD inputs, outperforming existing robust fine-tuning methods. Notably, AutoFT achieves a new state-of-the-art on the WILDS iWildCam and FMoW benchmarks, outperforming the previous best methods by $6.0\%$ and $1.5\%$, respectively.  ( 2 min )
    Steering Llama 2 via Contrastive Activation Addition
    arXiv:2312.06681v3 Announce Type: replace-cross Abstract: We introduce Contrastive Activation Addition (CAA), an innovative method for steering language models by modifying their activations during forward passes. CAA computes "steering vectors" by averaging the difference in residual stream activations between pairs of positive and negative examples of a particular behavior, such as factual versus hallucinatory responses. During inference, these steering vectors are added at all token positions after the user's prompt with either a positive or negative coefficient, allowing precise control over the degree of the targeted behavior. We evaluate CAA's effectiveness on Llama 2 Chat using multiple-choice behavioral question datasets and open-ended generation tasks. We demonstrate that CAA significantly alters model behavior, is effective over and on top of traditional methods like finetuning and system prompt design, and minimally reduces capabilities. Moreover, we gain deeper insights into CAA's mechanisms by employing various activation space interpretation methods. CAA accurately steers model outputs and sheds light on how high-level concepts are represented in Large Language Models (LLMs).  ( 2 min )
    Multi-Agent Reinforcement Learning for Assessing False-Data Injection Attacks on Transportation Networks
    arXiv:2312.14625v2 Announce Type: replace-cross Abstract: The increasing reliance of drivers on navigation applications has made transportation networks more susceptible to data-manipulation attacks by malicious actors. Adversaries may exploit vulnerabilities in the data collection or processing of navigation services to inject false information, and to thus interfere with the drivers' route selection. Such attacks can significantly increase traffic congestions, resulting in substantial waste of time and resources, and may even disrupt essential services that rely on road networks. To assess the threat posed by such attacks, we introduce a computational framework to find worst-case data-injection attacks against transportation networks. First, we devise an adversarial model with a threat actor who can manipulate drivers by increasing the travel times that they perceive on certain roads. Then, we employ hierarchical multi-agent reinforcement learning to find an approximate optimal adversarial strategy for data manipulation. We demonstrate the applicability of our approach through simulating attacks on the Sioux Falls, ND network topology.  ( 2 min )
    DTP-Net: Learning to Reconstruct EEG signals in Time-Frequency Domain by Multi-scale Feature Reuse
    arXiv:2312.09417v2 Announce Type: replace-cross Abstract: Electroencephalography (EEG) signals are easily corrupted by various artifacts, making artifact removal crucial for improving signal quality in scenarios such as disease diagnosis and brain-computer interface (BCI). In this paper, we present a fully convolutional neural architecture, called DTP-Net, which consists of a Densely Connected Temporal Pyramid (DTP) sandwiched between a pair of learnable time-frequency transformations for end-to-end electroencephalogram (EEG) denoising. The proposed method first transforms a single-channel EEG signal of arbitrary length into the time-frequency domain via an Encoder layer. Then, noises, such as ocular and muscle artifacts, are extracted by DTP in a multi-scale fashion and reduced. Finally, a Decoder layer is employed to reconstruct the artifact-reduced EEG signal. Additionally, we conduct an in-depth analysis of the representation learning behavior of each module in DTP-Net to substantiate its robustness and reliability. Extensive experiments conducted on two public semi-simulated datasets demonstrate the effective artifact removal performance of DTP-Net, which outperforms state-of-art approaches. Experimental results demonstrate cleaner waveforms and significant improvement in Signal-to-Noise Ratio (SNR) and Relative Root Mean Square Error (RRMSE) after denoised by the proposed model. Moreover, the proposed DTP-Net is applied in a specific BCI downstream task, improving the classification accuracy by up to 5.55% compared to that of the raw signals, validating its potential applications in the fields of EEG-based neuroscience and neuro-engineering.  ( 3 min )
    Does Writing with Language Models Reduce Content Diversity?
    arXiv:2309.05196v2 Announce Type: replace-cross Abstract: Large language models (LLMs) have led to a surge in collaborative writing with model assistance. As different users incorporate suggestions from the same model, there is a risk of decreased diversity in the produced content, potentially limiting diverse perspectives in public discourse. In this work, we measure the impact of co-writing on diversity via a controlled experiment, where users write argumentative essays in three setups -- using a base LLM (GPT3), a feedback-tuned LLM (InstructGPT), and writing without model help. We develop a set of diversity metrics and find that writing with InstructGPT (but not the GPT3) results in a statistically significant reduction in diversity. Specifically, it increases the similarity between the writings of different authors and reduces the overall lexical and content diversity. We additionally find that this effect is mainly attributable to InstructGPT contributing less diverse text to co-written essays. In contrast, the user-contributed text remains unaffected by model collaboration. This suggests that the recent improvement in generation quality from adapting models to human feedback might come at the cost of more homogeneous and less diverse content.  ( 2 min )
    Efficient Out-of-Distribution Detection with Prototypical Semi-Supervised Learning and Foundation Models
    arXiv:2311.17093v2 Announce Type: replace-cross Abstract: This paper describes PAWS-VMK, an improved approach to prototypical semi-supervised learning in the field of computer vision, specifically designed to utilize a frozen foundation model as the neural network backbone. This method outperforms previous results in semi-supervised learning and out-of-distribution (OOD) detection, improving upon the Predicting View-Assignments With Support Samples (PAWS) semi-supervised learning method. We introduce (1) parametric von-Mises Fisher Stochastic Neighbour Embedding (vMF-SNE) to pretrain the projection head using the high-quality embeddings of the foundation model; (2) a MixMatch inspired loss, where predictions across multiple views are averaged to provide a more reliable supervision signal compared to the consistency loss used in PAWS and (3) simple $k$-Means prototype selection (SKMPS), a technique that provides superior performance to other unsupervised label selection approaches in this context. PAWS-VMK sets new benchmarks in semi-supervised learning for CIFAR-10 (99.2%) and CIFAR-100 (89.8%) with four labelled instances per class, and Food-101 (90.1%) with two labelled instances per class. We also observe that PAWS-VMK can efficiently detect OOD samples in a manner that is competitive with specialised methods specifically designed for this purpose, achieving 93.1/98.0 and 95.2/96.3 on the CIFAR-10 and CIFAR-100 OpenOOD benchmarks.  ( 2 min )
    Improve Generalization Ability of Deep Wide Residual Network with A Suitable Scaling Factor
    arXiv:2403.04545v1 Announce Type: new Abstract: Deep Residual Neural Networks (ResNets) have demonstrated remarkable success across a wide range of real-world applications. In this paper, we identify a suitable scaling factor (denoted by $\alpha$) on the residual branch of deep wide ResNets to achieve good generalization ability. We show that if $\alpha$ is a constant, the class of functions induced by Residual Neural Tangent Kernel (RNTK) is asymptotically not learnable, as the depth goes to infinity. We also highlight a surprising phenomenon: even if we allow $\alpha$ to decrease with increasing depth $L$, the degeneration phenomenon may still occur. However, when $\alpha$ decreases rapidly with $L$, the kernel regression with deep RNTK with early stopping can achieve the minimax rate provided that the target regression function falls in the reproducing kernel Hilbert space associated with the infinite-depth RNTK. Our simulation studies on synthetic data and real classification tasks such as MNIST, CIFAR10 and CIFAR100 support our theoretical criteria for choosing $\alpha$.  ( 2 min )
    Architectural Blueprint For Heterogeneity-Resilient Federated Learning
    arXiv:2403.04546v1 Announce Type: new Abstract: This paper proposes a novel three tier architecture for federated learning to optimize edge computing environments. The proposed architecture addresses the challenges associated with client data heterogeneity and computational constraints. It introduces a scalable, privacy preserving framework that enhances the efficiency of distributed machine learning. Through experimentation, the paper demonstrates the architecture capability to manage non IID data sets more effectively than traditional federated learning models. Additionally, the paper highlights the potential of this innovative approach to significantly improve model accuracy, reduce communication overhead, and facilitate broader adoption of federated learning technologies.  ( 2 min )
    Stabilizing Policy Gradients for Stochastic Differential Equations via Consistency with Perturbation Process
    arXiv:2403.04154v1 Announce Type: new Abstract: Considering generating samples with high rewards, we focus on optimizing deep neural networks parameterized stochastic differential equations (SDEs), the advanced generative models with high expressiveness, with policy gradient, the leading algorithm in reinforcement learning. Nevertheless, when applying policy gradients to SDEs, since the policy gradient is estimated on a finite set of trajectories, it can be ill-defined, and the policy behavior in data-scarce regions may be uncontrolled. This challenge compromises the stability of policy gradients and negatively impacts sample complexity. To address these issues, we propose constraining the SDE to be consistent with its associated perturbation process. Since the perturbation process covers the entire space and is easy to sample, we can mitigate the aforementioned problems. Our framework offers a general approach allowing for a versatile selection of policy gradient methods to effectively and efficiently train SDEs. We evaluate our algorithm on the task of structure-based drug design and optimize the binding affinity of generated ligand molecules. Our method achieves the best Vina score -9.07 on the CrossDocked2020 dataset.  ( 2 min )
    Fill-and-Spill: Deep Reinforcement Learning Policy Gradient Methods for Reservoir Operation Decision and Control
    arXiv:2403.04195v1 Announce Type: new Abstract: Changes in demand, various hydrological inputs, and environmental stressors are among the issues that water managers and policymakers face on a regular basis. These concerns have sparked interest in applying different techniques to determine reservoir operation policy decisions. As the resolution of the analysis increases, it becomes more difficult to effectively represent a real-world system using traditional methods such as Dynamic Programming (DP) and Stochastic Dynamic Programming (SDP) for determining the best reservoir operation policy. One of the challenges is the "curse of dimensionality," which means the number of samples needed to estimate an arbitrary function with a given level of accuracy grows exponentially with respect to the number of input variables (i.e., dimensionality) of the function. Deep Reinforcement Learning (DRL) is an intelligent approach to overcome the curses of stochastic optimization problems for reservoir operation policy decisions. To our knowledge, this study is the first attempt that examine various novel DRL continuous-action policy gradient methods (PGMs), including Deep Deterministic Policy Gradients (DDPG), Twin Delayed DDPG (TD3), and two different versions of Soft Actor-Critic (SAC18 and SAC19) for optimizing reservoir operation policy. In this study, multiple DRL techniques were implemented in order to find the optimal operation policy of Folsom Reservoir in California, USA. The reservoir system supplies agricultural, municipal, hydropower, and environmental flow demands and flood control operations to the City of Sacramento. Analysis suggests that the TD3 and SAC are robust to meet the Folsom Reservoir's demands and optimize reservoir operation policies.  ( 3 min )
    Cooperative Bayesian Optimization for Imperfect Agents
    arXiv:2403.04442v1 Announce Type: new Abstract: We introduce a cooperative Bayesian optimization problem for optimizing black-box functions of two variables where two agents choose together at which points to query the function but have only control over one variable each. This setting is inspired by human-AI teamwork, where an AI-assistant helps its human user solve a problem, in this simplest case, collaborative optimization. We formulate the solution as sequential decision-making, where the agent we control models the user as a computationally rational agent with prior knowledge about the function. We show that strategic planning of the queries enables better identification of the global maximum of the function as long as the user avoids excessive exploration. This planning is made possible by using Bayes Adaptive Monte Carlo planning and by endowing the agent with a user model that accounts for conservative belief updates and exploratory sampling of the points to query.  ( 2 min )
    Dissecting Sample Hardness: A Fine-Grained Analysis of Hardness Characterization Methods for Data-Centric AI
    arXiv:2403.04551v1 Announce Type: new Abstract: Characterizing samples that are difficult to learn from is crucial to developing highly performant ML models. This has led to numerous Hardness Characterization Methods (HCMs) that aim to identify "hard" samples. However, there is a lack of consensus regarding the definition and evaluation of "hardness". Unfortunately, current HCMs have only been evaluated on specific types of hardness and often only qualitatively or with respect to downstream performance, overlooking the fundamental quantitative identification task. We address this gap by presenting a fine-grained taxonomy of hardness types. Additionally, we propose the Hardness Characterization Analysis Toolkit (H-CAT), which supports comprehensive and quantitative benchmarking of HCMs across the hardness taxonomy and can easily be extended to new HCMs, hardness types, and datasets. We use H-CAT to evaluate 13 different HCMs across 8 hardness types. This comprehensive evaluation encompassing over 14K setups uncovers strengths and weaknesses of different HCMs, leading to practical tips to guide HCM selection and future development. Our findings highlight the need for more comprehensive HCM evaluation, while we hope our hardness taxonomy and toolkit will advance the principled evaluation and uptake of data-centric AI methods.  ( 2 min )
    Enhancing Data Quality in Federated Fine-Tuning of Foundation Models
    arXiv:2403.04529v1 Announce Type: new Abstract: In the current landscape of foundation model training, there is a significant reliance on public domain data, which is nearing exhaustion according to recent research. To further scale up, it is crucial to incorporate collaboration among multiple specialized and high-quality private domain data sources. However, the challenge of training models locally without sharing private data presents numerous obstacles in data quality control. To tackle this issue, we propose a data quality control pipeline for federated fine-tuning of foundation models. This pipeline computes scores reflecting the quality of training data and determines a global threshold for a unified standard, aiming for improved global performance. Our experiments show that the proposed quality control pipeline facilitates the effectiveness and reliability of the model training, leading to better performance.  ( 2 min )
    CLIP the Bias: How Useful is Balancing Data in Multimodal Learning?
    arXiv:2403.04547v1 Announce Type: new Abstract: We study the effectiveness of data-balancing for mitigating biases in contrastive language-image pretraining (CLIP), identifying areas of strength and limitation. First, we reaffirm prior conclusions that CLIP models can inadvertently absorb societal stereotypes. To counter this, we present a novel algorithm, called Multi-Modal Moment Matching (M4), designed to reduce both representation and association biases (i.e. in first- and second-order statistics) in multimodal data. We use M4 to conduct an in-depth analysis taking into account various factors, such as the model, representation, and data size. Our study also explores the dynamic nature of how CLIP learns and unlearns biases. In particular, we find that fine-tuning is effective in countering representation biases, though its impact diminishes for association biases. Also, data balancing has a mixed impact on quality: it tends to improve classification but can hurt retrieval. Interestingly, data and architectural improvements seem to mitigate the negative impact of data balancing on performance; e.g. applying M4 to SigLIP-B/16 with data quality filters improves COCO image-to-text retrieval @5 from 86% (without data balancing) to 87% and ImageNet 0-shot classification from 77% to 77.5%! Finally, we conclude with recommendations for improving the efficacy of data balancing in multimodal systems.  ( 2 min )
    What makes an image realistic?
    arXiv:2403.04493v1 Announce Type: new Abstract: The last decade has seen tremendous progress in our ability to generate realistic-looking data, be it images, text, audio, or video. Here, we discuss the closely related problem of quantifying realism, that is, designing functions that can reliably tell realistic data from unrealistic data. This problem turns out to be significantly harder to solve and remains poorly understood, despite its prevalence in machine learning and recent breakthroughs in generative AI. Drawing on insights from algorithmic information theory, we discuss why this problem is challenging, why a good generative model alone is insufficient to solve it, and what a good solution would look like. In particular, we introduce the notion of a universal critic, which unlike adversarial critics does not require adversarial training. While universal critics are not immediately practical, they can serve both as a North Star for guiding practical implementations and as a tool for analyzing existing attempts to capture realism.  ( 2 min )
    Hyperspectral unmixing for Raman spectroscopy via physics-constrained autoencoders
    arXiv:2403.04526v1 Announce Type: new Abstract: Raman spectroscopy is widely used across scientific domains to characterize the chemical composition of samples in a non-destructive, label-free manner. Many applications entail the unmixing of signals from mixtures of molecular species to identify the individual components present and their proportions, yet conventional methods for chemometrics often struggle with complex mixture scenarios encountered in practice. Here, we develop hyperspectral unmixing algorithms based on autoencoder neural networks, and we systematically validate them using both synthetic and experimental benchmark datasets created in-house. Our results demonstrate that unmixing autoencoders provide improved accuracy, robustness and efficiency compared to standard unmixing methods. We also showcase the applicability of autoencoders to complex biological settings by showing improved biochemical characterization of volumetric Raman imaging data from a monocytic cell.  ( 2 min )
    FRRI: a novel algorithm for fuzzy-rough rule induction
    arXiv:2403.04447v1 Announce Type: new Abstract: Interpretability is the next frontier in machine learning research. In the search for white box models - as opposed to black box models, like random forests or neural networks - rule induction algorithms are a logical and promising option, since the rules can easily be understood by humans. Fuzzy and rough set theory have been successfully applied to this archetype, almost always separately. As both approaches to rule induction involve granular computing based on the concept of equivalence classes, it is natural to combine them. The QuickRules\cite{JensenCornelis2009} algorithm was a first attempt at using fuzzy rough set theory for rule induction. It is based on QuickReduct, a greedy algorithm for building decision reducts. QuickRules already showed an improvement over other rule induction methods. However, to evaluate the full potential of a fuzzy rough rule induction algorithm, one needs to start from the foundations. In this paper, we introduce a novel rule induction algorithm called Fuzzy Rough Rule Induction (FRRI). We provide background and explain the workings of our algorithm. Furthermore, we perform a computational experiment to evaluate the performance of our algorithm and compare it to other state-of-the-art rule induction approaches. We find that our algorithm is more accurate while creating small rulesets consisting of relatively short rules. We end the paper by outlining some directions for future work.  ( 2 min )
    Boosting Fairness and Robustness in Over-the-Air Federated Learning
    arXiv:2403.04431v1 Announce Type: new Abstract: Over-the-Air Computation is a beyond-5G communication strategy that has recently been shown to be useful for the decentralized training of machine learning models due to its efficiency. In this paper, we propose an Over-the-Air federated learning algorithm that aims to provide fairness and robustness through minmax optimization. By using the epigraph form of the problem at hand, we show that the proposed algorithm converges to the optimal solution of the minmax problem. Moreover, the proposed approach does not require reconstructing channel coefficients by complex encoding-decoding schemes as opposed to state-of-the-art approaches. This improves both efficiency and privacy.  ( 2 min )
    On the Topology Awareness and Generalization Performance of Graph Neural Networks
    arXiv:2403.04482v1 Announce Type: new Abstract: Many computer vision and machine learning problems are modelled as learning tasks on graphs, where graph neural networks (GNNs) have emerged as a dominant tool for learning representations of graph-structured data. A key feature of GNNs is their use of graph structures as input, enabling them to exploit the graphs' inherent topological properties-known as the topology awareness of GNNs. Despite the empirical successes of GNNs, the influence of topology awareness on generalization performance remains unexplored, particularly for node-level tasks that diverge from the assumption of data being independent and identically distributed (I.I.D.). The precise definition and characterization of the topology awareness of GNNs, especially concerning different topological features, are still unclear. This paper introduces a comprehensive framework to characterize the topology awareness of GNNs across any topological feature. Using this framework, we investigate the effects of topology awareness on GNN generalization performance. Contrary to the prevailing belief that enhancing the topology awareness of GNNs is always advantageous, our analysis reveals a critical insight: improving the topology awareness of GNNs may inadvertently lead to unfair generalization across structural groups, which might not be desired in some scenarios. Additionally, we conduct a case study using the intrinsic graph metric, the shortest path distance, on various benchmark datasets. The empirical results of this case study confirm our theoretical insights. Moreover, we demonstrate the practical applicability of our framework by using it to tackle the cold start problem in graph active learning.  ( 3 min )
    A Survey of Graph Neural Networks in Real world: Imbalance, Noise, Privacy and OOD Challenges
    arXiv:2403.04468v1 Announce Type: new Abstract: Graph-structured data exhibits universality and widespread applicability across diverse domains, such as social network analysis, biochemistry, financial fraud detection, and network security. Significant strides have been made in leveraging Graph Neural Networks (GNNs) to achieve remarkable success in these areas. However, in real-world scenarios, the training environment for models is often far from ideal, leading to substantial performance degradation of GNN models due to various unfavorable factors, including imbalance in data distribution, the presence of noise in erroneous data, privacy protection of sensitive information, and generalization capability for out-of-distribution (OOD) scenarios. To tackle these issues, substantial efforts have been devoted to improving the performance of GNN models in practical real-world scenarios, as well as enhancing their reliability and robustness. In this paper, we present a comprehensive survey that systematically reviews existing GNN models, focusing on solutions to the four mentioned real-world challenges including imbalance, noise, privacy, and OOD in practical scenarios that many existing reviews have not considered. Specifically, we first highlight the four key challenges faced by existing GNNs, paving the way for our exploration of real-world GNN models. Subsequently, we provide detailed discussions on these four aspects, dissecting how these solutions contribute to enhancing the reliability and robustness of GNN models. Last but not least, we outline promising directions and offer future perspectives in the field.  ( 3 min )
    Vlearn: Off-Policy Learning with Efficient State-Value Function Estimation
    arXiv:2403.04453v1 Announce Type: new Abstract: Existing off-policy reinforcement learning algorithms typically necessitate an explicit state-action-value function representation, which becomes problematic in high-dimensional action spaces. These algorithms often encounter challenges where they struggle with the curse of dimensionality, as maintaining a state-action-value function in such spaces becomes data-inefficient. In this work, we propose a novel off-policy trust region optimization approach, called Vlearn, that eliminates the requirement for an explicit state-action-value function. Instead, we demonstrate how to efficiently leverage just a state-value function as the critic, thus overcoming several limitations of existing methods. By doing so, Vlearn addresses the computational challenges posed by high-dimensional action spaces. Furthermore, Vlearn introduces an efficient approach to address the challenges associated with pure state-value function learning in the off-policy setting. This approach not only simplifies the implementation of off-policy policy gradient algorithms but also leads to consistent and robust performance across various benchmark tasks. Specifically, by removing the need for a state-action-value function Vlearn simplifies the learning process and allows for more efficient exploration and exploitation in complex environments  ( 2 min )
    Exploring the Influence of Dimensionality Reduction on Anomaly Detection Performance in Multivariate Time Series
    arXiv:2403.04429v1 Announce Type: new Abstract: This paper presents an extensive empirical study on the integration of dimensionality reduction techniques with advanced unsupervised time series anomaly detection models, focusing on the MUTANT and Anomaly-Transformer models. The study involves a comprehensive evaluation across three different datasets: MSL, SMAP, and SWaT. Each dataset poses unique challenges, allowing for a robust assessment of the models' capabilities in varied contexts. The dimensionality reduction techniques examined include PCA, UMAP, Random Projection, and t-SNE, each offering distinct advantages in simplifying high-dimensional data. Our findings reveal that dimensionality reduction not only aids in reducing computational complexity but also significantly enhances anomaly detection performance in certain scenarios. Moreover, a remarkable reduction in training times was observed, with reductions by approximately 300\% and 650\% when dimensionality was halved and minimized to the lowest dimensions, respectively. This efficiency gain underscores the dual benefit of dimensionality reduction in both performance enhancement and operational efficiency. The MUTANT model exhibits notable adaptability, especially with UMAP reduction, while the Anomaly-Transformer demonstrates versatility across various reduction techniques. These insights provide a deeper understanding of the synergistic effects of dimensionality reduction and anomaly detection, contributing valuable perspectives to the field of time series analysis. The study underscores the importance of selecting appropriate dimensionality reduction strategies based on specific model requirements and dataset characteristics, paving the way for more efficient, accurate, and scalable solutions in anomaly detection.  ( 3 min )
    Hyperparameter Tuning MLPs for Probabilistic Time Series Forecasting
    arXiv:2403.04477v1 Announce Type: new Abstract: Time series forecasting attempts to predict future events by analyzing past trends and patterns. Although well researched, certain critical aspects pertaining to the use of deep learning in time series forecasting remain ambiguous. Our research primarily focuses on examining the impact of specific hyperparameters related to time series, such as context length and validation strategy, on the performance of the state-of-the-art MLP model in time series forecasting. We have conducted a comprehensive series of experiments involving 4800 configurations per dataset across 20 time series forecasting datasets, and our findings demonstrate the importance of tuning these parameters. Furthermore, in this work, we introduce the largest metadataset for timeseries forecasting to date, named TSBench, comprising 97200 evaluations, which is a twentyfold increase compared to previous works in the field. Finally, we demonstrate the utility of the created metadataset on multi-fidelity hyperparameter optimization tasks.  ( 2 min )
    Explainable AI for Embedded Systems Design: A Case Study of Static Redundant NVM Memory Write Prediction
    arXiv:2403.04337v1 Announce Type: new Abstract: This paper investigates the application of eXplainable Artificial Intelligence (XAI) in the design of embedded systems using machine learning (ML). As a case study, it addresses the challenging problem of static silent store prediction. This involves identifying redundant memory writes based only on static program features. Eliminating such stores enhances performance and energy efficiency by reducing memory access and bus traffic, especially in the presence of emerging non-volatile memory technologies. To achieve this, we propose a methodology consisting of: 1) the development of relevant ML models for explaining silent store prediction, and 2) the application of XAI to explain these models. We employ two state-of-the-art model-agnostic XAI methods to analyze the causes of silent stores. Through the case study, we evaluate the effectiveness of the methods. We find that these methods provide explanations for silent store predictions, which are consistent with known causes of silent store occurrences from previous studies. Typically, this allows us to confirm the prevalence of silent stores in operations that write the zero constant into memory, or the absence of silent stores in operations involving loop induction variables. This suggests the potential relevance of XAI in analyzing ML models' decision in embedded system design. From the case study, we share some valuable insights and pitfalls we encountered. More generally, this study aims to lay the groundwork for future research in the emerging field of XAI for embedded system design.  ( 3 min )
    Mastering Memory Tasks with World Models
    arXiv:2403.04253v1 Announce Type: new Abstract: Current model-based reinforcement learning (MBRL) agents struggle with long-term dependencies. This limits their ability to effectively solve tasks involving extended time gaps between actions and outcomes, or tasks demanding the recalling of distant observations to inform current actions. To improve temporal coherence, we integrate a new family of state space models (SSMs) in world models of MBRL agents to present a new method, Recall to Imagine (R2I). This integration aims to enhance both long-term memory and long-horizon credit assignment. Through a diverse set of illustrative tasks, we systematically demonstrate that R2I not only establishes a new state-of-the-art for challenging memory and credit assignment RL tasks, such as BSuite and POPGym, but also showcases superhuman performance in the complex memory domain of Memory Maze. At the same time, it upholds comparable performance in classic RL tasks, such as Atari and DMC, suggesting the generality of our method. We also show that R2I is faster than the state-of-the-art MBRL method, DreamerV3, resulting in faster wall-time convergence.  ( 2 min )
    On-demand Quantization for Green Federated Generative Diffusion in Mobile Edge Networks
    arXiv:2403.04430v1 Announce Type: new Abstract: Generative Artificial Intelligence (GAI) shows remarkable productivity and creativity in Mobile Edge Networks, such as the metaverse and the Industrial Internet of Things. Federated learning is a promising technique for effectively training GAI models in mobile edge networks due to its data distribution. However, there is a notable issue with communication consumption when training large GAI models like generative diffusion models in mobile edge networks. Additionally, the substantial energy consumption associated with training diffusion-based models, along with the limited resources of edge devices and complexities of network environments, pose challenges for improving the training efficiency of GAI models. To address this challenge, we propose an on-demand quantized energy-efficient federated diffusion approach for mobile edge networks. Specifically, we first design a dynamic quantized federated diffusion training scheme considering various demands from the edge devices. Then, we study an energy efficiency problem based on specific quantization requirements. Numerical results show that our proposed method significantly reduces system energy consumption and transmitted model size compared to both baseline federated diffusion and fixed quantized federated diffusion methods while effectively maintaining reasonable quality and diversity of generated data.  ( 2 min )
    RATSF: Empowering Customer Service Volume Management through Retrieval-Augmented Time-Series Forecasting
    arXiv:2403.04180v1 Announce Type: new Abstract: An efficient customer service management system hinges on precise forecasting of service volume. In this scenario, where data non-stationarity is pronounced, successful forecasting heavily relies on identifying and leveraging similar historical data rather than merely summarizing periodic patterns. Existing models based on RNN or Transformer architectures often struggle with this flexible and effective utilization. To address this challenge, we propose an efficient and adaptable cross-attention module termed RACA, which effectively leverages historical segments in forecasting task, and we devised a precise representation scheme for querying historical sequences, coupled with the design of a knowledge repository. These critical components collectively form our Retrieval-Augmented Temporal Sequence Forecasting framework (RATSF). RATSF not only significantly enhances performance in the context of Fliggy hotel service volume forecasting but, more crucially, can be seamlessly integrated into other Transformer-based time-series forecasting models across various application scenarios. Extensive experimentation has validated the effectiveness and generalizability of this system design across multiple diverse contexts.  ( 2 min )
    Why Online Reinforcement Learning is Causal
    arXiv:2403.04221v1 Announce Type: new Abstract: Reinforcement learning (RL) and causal modelling naturally complement each other. The goal of causal modelling is to predict the effects of interventions in an environment, while the goal of reinforcement learning is to select interventions that maximize the rewards the agent receives from the environment. Reinforcement learning includes the two most powerful sources of information for estimating causal relationships: temporal ordering and the ability to act on an environment. This paper examines which reinforcement learning settings we can expect to benefit from causal modelling, and how. In online learning, the agent has the ability to interact directly with their environment, and learn from exploring it. Our main argument is that in online learning, conditional probabilities are causal, and therefore offline RL is the setting where causal learning has the most potential to make a difference. Essentially, the reason is that when an agent learns from their {\em own} experience, there are no unobserved confounders that influence both the agent's own exploratory actions and the rewards they receive. Our paper formalizes this argument. For offline RL, where an agent may and typically does learn from the experience of {\em others}, we describe previous and new methods for leveraging a causal model, including support for counterfactual queries.  ( 2 min )
    Belief-Enriched Pessimistic Q-Learning against Adversarial State Perturbations
    arXiv:2403.04050v1 Announce Type: new Abstract: Reinforcement learning (RL) has achieved phenomenal success in various domains. However, its data-driven nature also introduces new vulnerabilities that can be exploited by malicious opponents. Recent work shows that a well-trained RL agent can be easily manipulated by strategically perturbing its state observations at the test stage. Existing solutions either introduce a regularization term to improve the smoothness of the trained policy against perturbations or alternatively train the agent's policy and the attacker's policy. However, the former does not provide sufficient protection against strong attacks, while the latter is computationally prohibitive for large environments. In this work, we propose a new robust RL algorithm for deriving a pessimistic policy to safeguard against an agent's uncertainty about true states. This approach is further enhanced with belief state inference and diffusion-based state purification to reduce uncertainty. Empirical results show that our approach obtains superb performance under strong attacks and has a comparable training overhead with regularization-based methods. Our code is available at https://github.com/SliencerX/Belief-enriched-robust-Q-learning.  ( 2 min )
    Generative AI for Synthetic Data Generation: Methods, Challenges and the Future
    arXiv:2403.04190v1 Announce Type: new Abstract: The recent surge in research focused on generating synthetic data from large language models (LLMs), especially for scenarios with limited data availability, marks a notable shift in Generative Artificial Intelligence (AI). Their ability to perform comparably to real-world data positions this approach as a compelling solution to low-resource challenges. This paper delves into advanced technologies that leverage these gigantic LLMs for the generation of task-specific training data. We outline methodologies, evaluation techniques, and practical applications, discuss the current limitations, and suggest potential pathways for future research.  ( 2 min )
    Online Adaptation of Language Models with a Memory of Amortized Contexts
    arXiv:2403.04317v1 Announce Type: new Abstract: Due to the rapid generation and dissemination of information, large language models (LLMs) quickly run out of date despite enormous development costs. Due to this crucial need to keep models updated, online learning has emerged as a critical necessity when utilizing LLMs for real-world applications. However, given the ever-expanding corpus of unseen documents and the large parameter space of modern LLMs, efficient adaptation is essential. To address these challenges, we propose Memory of Amortized Contexts (MAC), an efficient and effective online adaptation framework for LLMs with strong knowledge retention. We propose an amortized feature extraction and memory-augmentation approach to compress and extract information from new documents into compact modulations stored in a memory bank. When answering questions, our model attends to and extracts relevant knowledge from this memory bank. To learn informative modulations in an efficient manner, we utilize amortization-based meta-learning, which substitutes the optimization process with a single forward pass of the encoder. Subsequently, we learn to choose from and aggregate selected documents into a single modulation by conditioning on the question, allowing us to adapt a frozen language model during test time without requiring further gradient updates. Our experiment demonstrates the superiority of MAC in multiple aspects, including online adaptation performance, time, and memory efficiency. Code is available at: https://github.com/jihoontack/MAC.  ( 3 min )
    Online Learning with Unknown Constraints
    arXiv:2403.04033v1 Announce Type: new Abstract: We consider the problem of online learning where the sequence of actions played by the learner must adhere to an unknown safety constraint at every round. The goal is to minimize regret with respect to the best safe action in hindsight while simultaneously satisfying the safety constraint with high probability on each round. We provide a general meta-algorithm that leverages an online regression oracle to estimate the unknown safety constraint, and converts the predictions of an online learning oracle to predictions that adhere to the unknown safety constraint. On the theoretical side, our algorithm's regret can be bounded by the regret of the online regression and online learning oracles, the eluder dimension of the model class containing the unknown safety constraint, and a novel complexity measure that captures the difficulty of safe learning. We complement our result with an asymptotic lower bound that shows that the aforementioned complexity measure is necessary. When the constraints are linear, we instantiate our result to provide a concrete algorithm with $\sqrt{T}$ regret using a scaling transformation that balances optimistic exploration with pessimistic constraint satisfaction.  ( 2 min )
    Regularized DeepIV with Model Selection
    arXiv:2403.04236v1 Announce Type: new Abstract: In this paper, we study nonparametric estimation of instrumental variable (IV) regressions. While recent advancements in machine learning have introduced flexible methods for IV estimation, they often encounter one or more of the following limitations: (1) restricting the IV regression to be uniquely identified; (2) requiring minimax computation oracle, which is highly unstable in practice; (3) absence of model selection procedure. In this paper, we present the first method and analysis that can avoid all three limitations, while still enabling general function approximation. Specifically, we propose a minimax-oracle-free method called Regularized DeepIV (RDIV) regression that can converge to the least-norm IV solution. Our method consists of two stages: first, we learn the conditional distribution of covariates, and by utilizing the learned distribution, we learn the estimator by minimizing a Tikhonov-regularized loss function. We further show that our method allows model selection procedures that can achieve the oracle rates in the misspecified regime. When extended to an iterative estimator, our method matches the current state-of-the-art convergence rate. Our method is a Tikhonov regularized variant of the popular DeepIV method with a non-parametric MLE first-stage estimator, and our results provide the first rigorous guarantees for this empirically used method, showcasing the importance of regularization which was absent from the original work.  ( 2 min )
    FL-GUARD: A Holistic Framework for Run-Time Detection and Recovery of Negative Federated Learning
    arXiv:2403.04146v1 Announce Type: new Abstract: Federated learning (FL) is a promising approach for learning a model from data distributed on massive clients without exposing data privacy. It works effectively in the ideal federation where clients share homogeneous data distribution and learning behavior. However, FL may fail to function appropriately when the federation is not ideal, amid an unhealthy state called Negative Federated Learning (NFL), in which most clients gain no benefit from participating in FL. Many studies have tried to address NFL. However, their solutions either (1) predetermine to prevent NFL in the entire learning life-cycle or (2) tackle NFL in the aftermath of numerous learning rounds. Thus, they either (1) indiscriminately incur extra costs even if FL can perform well without such costs or (2) waste numerous learning rounds. Additionally, none of the previous work takes into account the clients who may be unwilling/unable to follow the proposed NFL solutions when using those solutions to upgrade an FL system in use. This paper introduces FL-GUARD, a holistic framework that can be employed on any FL system for tackling NFL in a run-time paradigm. That is, to dynamically detect NFL at the early stage (tens of rounds) of learning and then to activate recovery measures when necessary. Specifically, we devise a cost-effective NFL detection mechanism, which relies on an estimation of performance gain on clients. Only when NFL is detected, we activate the NFL recovery process, in which each client learns in parallel an adapted model when training the global model. Extensive experiment results confirm the effectiveness of FL-GUARD in detecting NFL and recovering from NFL to a healthy learning state. We also show that FL-GUARD is compatible with previous NFL solutions and robust against clients unwilling/unable to take any recovery measures.  ( 3 min )
    SWAP-NAS: Sample-Wise Activation Patterns For Ultra-Fast NAS
    arXiv:2403.04161v1 Announce Type: new Abstract: Training-free metrics (a.k.a. zero-cost proxies) are widely used to avoid resource-intensive neural network training, especially in Neural Architecture Search (NAS). Recent studies show that existing training-free metrics have several limitations, such as limited correlation and poor generalisation across different search spaces and tasks. Hence, we propose Sample-Wise Activation Patterns and its derivative, SWAP-Score, a novel high-performance training-free metric. It measures the expressivity of networks over a batch of input samples. The SWAP-Score is strongly correlated with ground-truth performance across various search spaces and tasks, outperforming 15 existing training-free metrics on NAS-Bench-101/201/301 and TransNAS-Bench-101. The SWAP-Score can be further enhanced by regularisation, which leads to even higher correlations in cell-based search space and enables model size control during the search. For example, Spearman's rank correlation coefficient between regularised SWAP-Score and CIFAR-100 validation accuracies on NAS-Bench-201 networks is 0.90, significantly higher than 0.80 from the second-best metric, NWOT. When integrated with an evolutionary algorithm for NAS, our SWAP-NAS achieves competitive performance on CIFAR-10 and ImageNet in approximately 6 minutes and 9 minutes of GPU time respectively.  ( 2 min )
    HeteroSwitch: Characterizing and Taming System-Induced Data Heterogeneity in Federated Learning
    arXiv:2403.04207v1 Announce Type: new Abstract: Federated Learning (FL) is a practical approach to train deep learning models collaboratively across user-end devices, protecting user privacy by retaining raw data on-device. In FL, participating user-end devices are highly fragmented in terms of hardware and software configurations. Such fragmentation introduces a new type of data heterogeneity in FL, namely \textit{system-induced data heterogeneity}, as each device generates distinct data depending on its hardware and software configurations. In this paper, we first characterize the impact of system-induced data heterogeneity on FL model performance. We collect a dataset using heterogeneous devices with variations across vendors and performance tiers. By using this dataset, we demonstrate that \textit{system-induced data heterogeneity} negatively impacts accuracy, and deteriorates fairness and domain generalization problems in FL. To address these challenges, we propose HeteroSwitch, which adaptively adopts generalization techniques (i.e., ISP transformation and SWAD) depending on the level of bias caused by varying HW and SW configurations. In our evaluation with a realistic FL dataset (FLAIR), HeteroSwitch reduces the variance of averaged precision by 6.3\% across device types.  ( 2 min )
    GRAWA: Gradient-based Weighted Averaging for Distributed Training of Deep Learning Models
    arXiv:2403.04206v1 Announce Type: new Abstract: We study distributed training of deep learning models in time-constrained environments. We propose a new algorithm that periodically pulls workers towards the center variable computed as a weighted average of workers, where the weights are inversely proportional to the gradient norms of the workers such that recovering the flat regions in the optimization landscape is prioritized. We develop two asynchronous variants of the proposed algorithm that we call Model-level and Layer-level Gradient-based Weighted Averaging (resp. MGRAWA and LGRAWA), which differ in terms of the weighting scheme that is either done with respect to the entire model or is applied layer-wise. On the theoretical front, we prove the convergence guarantee for the proposed approach in both convex and non-convex settings. We then experimentally demonstrate that our algorithms outperform the competitor methods by achieving faster convergence and recovering better quality and flatter local optima. We also carry out an ablation study to analyze the scalability of the proposed algorithms in more crowded distributed training environments. Finally, we report that our approach requires less frequent communication and fewer distributed updates compared to the state-of-the-art baselines.  ( 2 min )
    Automated Multi-Task Learning for Joint Disease Prediction on Electronic Health Records
    arXiv:2403.04086v1 Announce Type: new Abstract: In the realm of big data and digital healthcare, Electronic Health Records (EHR) have become a rich source of information with the potential to improve patient care and medical research. In recent years, machine learning models have proliferated for analyzing EHR data to predict patients future health conditions. Among them, some studies advocate for multi-task learning (MTL) to jointly predict multiple target diseases for improving the prediction performance over single task learning. Nevertheless, current MTL frameworks for EHR data have significant limitations due to their heavy reliance on human experts to identify task groups for joint training and design model architectures. To reduce human intervention and improve the framework design, we propose an automated approach named AutoDP, which can search for the optimal configuration of task grouping and architectures simultaneously. To tackle the vast joint search space encompassing task combinations and architectures, we employ surrogate model-based optimization, enabling us to efficiently discover the optimal solution. Experimental results on real-world EHR data demonstrate the efficacy of the proposed AutoDP framework. It achieves significant performance improvements over both hand-crafted and automated state-of-the-art methods, also maintains a feasible search cost at the same time.  ( 2 min )
    Noisy Spiking Actor Network for Exploration
    arXiv:2403.04162v1 Announce Type: new Abstract: As a general method for exploration in deep reinforcement learning (RL), NoisyNet can produce problem-specific exploration strategies. Spiking neural networks (SNNs), due to their binary firing mechanism, have strong robustness to noise, making it difficult to realize efficient exploration with local disturbances. To solve this exploration problem, we propose a noisy spiking actor network (NoisySAN) that introduces time-correlated noise during charging and transmission. Moreover, a noise reduction method is proposed to find a stable policy for the agent. Extensive experimental results demonstrate that our method outperforms the state-of-the-art performance on a wide range of continuous control tasks from OpenAI gym.  ( 2 min )
    Improving Adversarial Training using Vulnerability-Aware Perturbation Budget
    arXiv:2403.04070v1 Announce Type: new Abstract: Adversarial Training (AT) effectively improves the robustness of Deep Neural Networks (DNNs) to adversarial attacks. Generally, AT involves training DNN models with adversarial examples obtained within a pre-defined, fixed perturbation bound. Notably, individual natural examples from which these adversarial examples are crafted exhibit varying degrees of intrinsic vulnerabilities, and as such, crafting adversarial examples with fixed perturbation radius for all instances may not sufficiently unleash the potency of AT. Motivated by this observation, we propose two simple, computationally cheap vulnerability-aware reweighting functions for assigning perturbation bounds to adversarial examples used for AT, named Margin-Weighted Perturbation Budget (MWPB) and Standard-Deviation-Weighted Perturbation Budget (SDWPB). The proposed methods assign perturbation radii to individual adversarial samples based on the vulnerability of their corresponding natural examples. Experimental results show that the proposed methods yield genuine improvements in the robustness of AT algorithms against various adversarial attacks.  ( 2 min )
    Directional Smoothness and Gradient Methods: Convergence and Adaptivity
    arXiv:2403.04081v1 Announce Type: new Abstract: We develop new sub-optimality bounds for gradient descent (GD) that depend on the conditioning of the objective along the path of optimization, rather than on global, worst-case constants. Key to our proofs is directional smoothness, a measure of gradient variation that we use to develop upper-bounds on the objective. Minimizing these upper-bounds requires solving implicit equations to obtain a sequence of strongly adapted step-sizes; we show that these equations are straightforward to solve for convex quadratics and lead to new guarantees for two classical step-sizes. For general functions, we prove that the Polyak step-size and normalized GD obtain fast, path-dependent rates despite using no knowledge of the directional smoothness. Experiments on logistic regression show our convergence guarantees are tighter than the classical theory based on L-smoothness.  ( 2 min )
    Many-Objective Multi-Solution Transport
    arXiv:2403.04099v1 Announce Type: new Abstract: Optimizing the performance of many objectives (instantiated by tasks or clients) jointly with a few Pareto stationary solutions (models) is critical in machine learning. However, previous multi-objective optimization methods often focus on a few number of objectives and cannot scale to many objectives that outnumber the solutions, leading to either subpar performance or ignored objectives. We introduce Many-objective multi-solution Transport (MosT), a framework that finds multiple diverse solutions in the Pareto front of many objectives. Our insight is to seek multiple solutions, each performing as a domain expert and focusing on a specific subset of objectives while collectively covering all of them. MosT formulates the problem as a bi-level optimization of weighted objectives for each solution, where the weights are defined by an optimal transport between the objectives and solutions. Our algorithm ensures convergence to Pareto stationary solutions for complementary subsets of objectives. On a range of applications in federated learning, multi-task learning, and mixture-of-prompt learning for LLMs, MosT distinctly outperforms strong baselines, delivering high-quality, diverse solutions that profile the entire Pareto frontier, thus ensuring balanced trade-offs across many objectives.  ( 2 min )
    Inference via Interpolation: Contrastive Representations Provably Enable Planning and Inference
    arXiv:2403.04082v1 Announce Type: new Abstract: Given time series data, how can we answer questions like "what will happen in the future?" and "how did we get here?" These sorts of probabilistic inference questions are challenging when observations are high-dimensional. In this paper, we show how these questions can have compact, closed form solutions in terms of learned representations. The key idea is to apply a variant of contrastive learning to time series data. Prior work already shows that the representations learned by contrastive learning encode a probability ratio. By extending prior work to show that the marginal distribution over representations is Gaussian, we can then prove that joint distribution of representations is also Gaussian. Taken together, these results show that representations learned via temporal contrastive learning follow a Gauss-Markov chain, a graphical model where inference (e.g., prediction, planning) over representations corresponds to inverting a low-dimensional matrix. In one special case, inferring intermediate representations will be equivalent to interpolating between the learned representations. We validate our theory using numerical simulations on tasks up to 46-dimensions.  ( 2 min )
    Sample size planning for conditional counterfactual mean estimation with a K-armed randomized experiment
    arXiv:2403.04039v1 Announce Type: new Abstract: We cover how to determine a sufficiently large sample size for a $K$-armed randomized experiment in order to estimate conditional counterfactual expectations in data-driven subgroups. The sub-groups can be output by any feature space partitioning algorithm, including as defined by binning users having similar predictive scores or as defined by a learned policy tree. After carefully specifying the inference target, a minimum confidence level, and a maximum margin of error, the key is to turn the original goal into a simultaneous inference problem where the recommended sample size to offset an increased possibility of estimation error is directly related to the number of inferences to be conducted. Given a fixed sample size budget, our result allows us to invert the question to one about the feasible number of treatment arms or partition complexity (e.g. number of decision tree leaves). Using policy trees to learn sub-groups, we evaluate our nominal guarantees on a large publicly-available randomized experiment test data set.  ( 2 min )
    OCD-FL: A Novel Communication-Efficient Peer Selection-based Decentralized Federated Learning
    arXiv:2403.04037v1 Announce Type: new Abstract: The conjunction of edge intelligence and the ever-growing Internet-of-Things (IoT) network heralds a new era of collaborative machine learning, with federated learning (FL) emerging as the most prominent paradigm. With the growing interest in these learning schemes, researchers started addressing some of their most fundamental limitations. Indeed, conventional FL with a central aggregator presents a single point of failure and a network bottleneck. To bypass this issue, decentralized FL where nodes collaborate in a peer-to-peer network has been proposed. Despite the latter's efficiency, communication costs and data heterogeneity remain key challenges in decentralized FL. In this context, we propose a novel scheme, called opportunistic communication-efficient decentralized federated learning, a.k.a., OCD-FL, consisting of a systematic FL peer selection for collaboration, aiming to achieve maximum FL knowledge gain while reducing energy consumption. Experimental results demonstrate the capability of OCD-FL to achieve similar or better performances than the fully collaborative FL, while significantly reducing consumed energy by at least 30% and up to 80%.  ( 2 min )
    Unsupervised Contrastive Learning for Robust RF Device Fingerprinting Under Time-Domain Shift
    arXiv:2403.04036v1 Announce Type: new Abstract: Radio Frequency (RF) device fingerprinting has been recognized as a potential technology for enabling automated wireless device identification and classification. However, it faces a key challenge due to the domain shift that could arise from variations in the channel conditions and environmental settings, potentially degrading the accuracy of RF-based device classification when testing and training data is collected in different domains. This paper introduces a novel solution that leverages contrastive learning to mitigate this domain shift problem. Contrastive learning, a state-of-the-art self-supervised learning approach from deep learning, learns a distance metric such that positive pairs are closer (i.e. more similar) in the learned metric space than negative pairs. When applied to RF fingerprinting, our model treats RF signals from the same transmission as positive pairs and those from different transmissions as negative pairs. Through experiments on wireless and wired RF datasets collected over several days, we demonstrate that our contrastive learning approach captures domain-invariant features, diminishing the effects of domain-specific variations. Our results show large and consistent improvements in accuracy (10.8\% to 27.8\%) over baseline models, thus underscoring the effectiveness of contrastive learning in improving device classification under domain shift.  ( 2 min )
    Can LLMs Follow Simple Rules?
    arXiv:2311.04235v2 Announce Type: replace-cross Abstract: As Large Language Models (LLMs) are deployed with increasing real-world responsibilities, it is important to be able to specify and constrain the behavior of these systems in a reliable manner. Model developers may wish to set explicit rules for the model, such as "do not generate abusive content", but these may be circumvented by jailbreaking techniques. Existing evaluations of adversarial attacks and defenses on LLMs generally require either expensive manual review or unreliable heuristic checks. To address this issue, we propose Rule-following Language Evaluation Scenarios (RuLES), a programmatic framework for measuring rule-following ability in LLMs. RuLES consists of 14 simple text scenarios in which the model is instructed to obey various rules while interacting with the user. Each scenario has a programmatic evaluation function to determine whether the model has broken any rules in a conversation. Our evaluations of proprietary and open models show that almost all current models struggle to follow scenario rules, even on straightforward test cases. We also demonstrate that simple optimization attacks suffice to significantly increase failure rates on test cases. We conclude by exploring two potential avenues for improvement: test-time steering and supervised fine-tuning.  ( 2 min )
    VeCLIP: Improving CLIP Training via Visual-enriched Captions
    arXiv:2310.07699v2 Announce Type: replace-cross Abstract: Large-scale web-crawled datasets are fundamental for the success of pre-training vision-language models, such as CLIP. However, the inherent noise and potential irrelevance of web-crawled AltTexts pose challenges in achieving precise image-text alignment. Existing methods utilizing large language models (LLMs) for caption rewriting have shown promise on small, curated datasets like CC3M and CC12M. This study introduces a scalable pipeline for noisy caption rewriting. Unlike recent LLM rewriting techniques, we emphasize the incorporation of visual concepts into captions, termed as Visual-enriched Captions (VeCap). To ensure data diversity, we propose a novel mixed training scheme that optimizes the utilization of AltTexts alongside newly generated VeCap. We showcase the adaptation of this method for training CLIP on large-scale web-crawled datasets, termed VeCLIP. Employing this cost-effective pipeline, we effortlessly scale our dataset up to 300 million samples named VeCap dataset. Our results show significant advantages in image-text alignment and overall model performance. For example, VeCLIP achieves up to +25.2% gain in COCO and Flickr30k retrieval tasks under the 12M setting. For data efficiency, VeCLIP achieves +3% gain while only using 14% of the data employed in the vanilla CLIP and 11% in ALIGN. We also note the VeCap data is complementary with other well curated datasets good for zero-shot classification tasks. When combining VeCap and DFN, our model can achieve strong performance on both of image-text retrieval and zero-shot classification tasks, e.g. 83.1% accuracy@1 on ImageNet zero-shot for a H/14 model. We release the pre-trained models at https://github.com/apple/ml-veclip.  ( 3 min )
    Intrinsic Language-Guided Exploration for Complex Long-Horizon Robotic Manipulation Tasks
    arXiv:2309.16347v2 Announce Type: replace-cross Abstract: Current reinforcement learning algorithms struggle in sparse and complex environments, most notably in long-horizon manipulation tasks entailing a plethora of different sequences. In this work, we propose the Intrinsically Guided Exploration from Large Language Models (IGE-LLMs) framework. By leveraging LLMs as an assistive intrinsic reward, IGE-LLMs guides the exploratory process in reinforcement learning to address intricate long-horizon with sparse rewards robotic manipulation tasks. We evaluate our framework and related intrinsic learning methods in an environment challenged with exploration, and a complex robotic manipulation task challenged by both exploration and long-horizons. Results show IGE-LLMs (i) exhibit notably higher performance over related intrinsic methods and the direct use of LLMs in decision-making, (ii) can be combined and complement existing learning methods highlighting its modularity, (iii) are fairly insensitive to different intrinsic scaling parameters, and (iv) maintain robustness against increased levels of uncertainty and horizons.  ( 2 min )
    PointSSC: A Cooperative Vehicle-Infrastructure Point Cloud Benchmark for Semantic Scene Completion
    arXiv:2309.12708v2 Announce Type: replace-cross Abstract: Semantic Scene Completion (SSC) aims to jointly generate space occupancies and semantic labels for complex 3D scenes. Most existing SSC models focus on volumetric representations, which are memory-inefficient for large outdoor spaces. Point clouds provide a lightweight alternative but existing benchmarks lack outdoor point cloud scenes with semantic labels. To address this, we introduce PointSSC, the first cooperative vehicle-infrastructure point cloud benchmark for semantic scene completion. These scenes exhibit long-range perception and minimal occlusion. We develop an automated annotation pipeline leveraging Semantic Segment Anything to efficiently assign semantics. To benchmark progress, we propose a LiDAR-based model with a Spatial-Aware Transformer for global and local feature extraction and a Completion and Segmentation Cooperative Module for joint completion and segmentation. PointSSC provides a challenging testbed to drive advances in semantic point cloud completion for real-world navigation. The code and datasets are available at https://github.com/yyxssm/PointSSC.  ( 2 min )
    Query-Dependent Prompt Evaluation and Optimization with Offline Inverse RL
    arXiv:2309.06553v4 Announce Type: replace-cross Abstract: In this study, we aim to enhance the arithmetic reasoning ability of Large Language Models (LLMs) through zero-shot prompt optimization. We identify a previously overlooked objective of query dependency in such optimization and elucidate two ensuing challenges that impede the successful and economical design of prompt optimization techniques. One primary issue is the absence of an effective method to evaluate prompts during inference when the golden answer is unavailable. Concurrently, learning via interactions with the LLMs to navigate the expansive natural language prompting space proves to be resource-intensive. To address this, we introduce Prompt-OIRL, which harnesses offline inverse reinforcement learning to draw insights from offline prompting demonstration data. Such data exists as by-products when diverse prompts are benchmarked on open-accessible datasets. With Prompt-OIRL, the query-dependent prompt optimization objective is achieved by first learning an offline reward model. This model can evaluate any query-prompt pairs without accessing LLMs. Subsequently, a best-of-N strategy is deployed to recommend the optimal prompt. Our experimental evaluations across various LLM scales and arithmetic reasoning datasets underscore both the efficacy and economic viability of the proposed approach.  ( 2 min )
    Crystal Structure Prediction by Joint Equivariant Diffusion
    arXiv:2309.04475v2 Announce Type: replace-cross Abstract: Crystal Structure Prediction (CSP) is crucial in various scientific disciplines. While CSP can be addressed by employing currently-prevailing generative models (e.g. diffusion models), this task encounters unique challenges owing to the symmetric geometry of crystal structures -- the invariance of translation, rotation, and periodicity. To incorporate the above symmetries, this paper proposes DiffCSP, a novel diffusion model to learn the structure distribution from stable crystals. To be specific, DiffCSP jointly generates the lattice and atom coordinates for each crystal by employing a periodic-E(3)-equivariant denoising model, to better model the crystal geometry. Notably, different from related equivariant generative approaches, DiffCSP leverages fractional coordinates other than Cartesian coordinates to represent crystals, remarkably promoting the diffusion and the generation process of atom positions. Extensive experiments verify that our DiffCSP significantly outperforms existing CSP methods, with a much lower computation cost in contrast to DFT-based methods. Moreover, the superiority of DiffCSP is also observed when it is extended for ab initio crystal generation.  ( 2 min )
    A correlation-based fuzzy cluster validity index with secondary options detector
    arXiv:2308.14785v3 Announce Type: replace-cross Abstract: The optimal number of clusters is one of the main concerns when applying cluster analysis. Several cluster validity indexes have been introduced to address this problem. However, in some situations, there is more than one option that can be chosen as the final number of clusters. This aspect has been overlooked by most of the existing works in this area. In this study, we introduce a correlation-based fuzzy cluster validity index known as the Wiroonsri-Preedasawakul (WP) index. This index is defined based on the correlation between the actual distance between a pair of data points and the distance between adjusted centroids with respect to that pair. We evaluate and compare the performance of our index with several existing indexes, including Xie-Beni, Pakhira-Bandyopadhyay-Maulik, Tang, Wu-Li, generalized C, and Kwon2. We conduct this evaluation on four types of datasets: artificial datasets, real-world datasets, simulated datasets with ranks, and image datasets, using the fuzzy c-means algorithm. Overall, the WP index outperforms most, if not all, of these indexes in terms of accurately detecting the optimal number of clusters and providing accurate secondary options. Moreover, our index remains effective even when the fuzziness parameter $m$ is set to a large value. Our R package called UniversalCVI used in this work is available at https://CRAN.R-project.org/package=UniversalCVI.  ( 3 min )
    Large Skew-t Copula Models and Asymmetric Dependence in Intraday Equity Returns
    arXiv:2308.05564v2 Announce Type: replace-cross Abstract: Skew-t copula models are attractive for the modeling of financial data because they allow for asymmetric and extreme tail dependence. We show that the copula implicit in the skew-t distribution of Azzalini and Capitanio (2003) allows for a higher level of pairwise asymmetric dependence than two popular alternative skew-t copulas. Estimation of this copula in high dimensions is challenging, and we propose a fast and accurate Bayesian variational inference (VI) approach to do so. The method uses a conditionally Gaussian generative representation of the skew-t distribution to define an augmented posterior that can be approximated accurately. A fast stochastic gradient ascent algorithm is used to solve the variational optimization. The new methodology is used to estimate skew-t factor copula models for intraday returns from 2017 to 2021 on 93 U.S. equities. The copula captures substantial heterogeneity in asymmetric dependence over equity pairs, in addition to the variability in pairwise correlations. We show that intraday predictive densities from the skew-t copula are more accurate than from some other copula models, while portfolio selection strategies based on the estimated pairwise tail dependencies improve performance relative to the benchmark index.  ( 2 min )
    Predicting small molecules solubilities on endpoint devices using deep ensemble neural networks
    arXiv:2307.05318v4 Announce Type: replace-cross Abstract: Aqueous solubility is a valuable yet challenging property to predict. Computing solubility using first-principles methods requires accounting for the competing effects of entropy and enthalpy, resulting in long computations for relatively poor accuracy. Data-driven approaches, such as deep learning, offer improved accuracy and computational efficiency but typically lack uncertainty quantification. Additionally, ease of use remains a concern for any computational technique, resulting in the sustained popularity of group-based contribution methods. In this work, we addressed these problems with a deep learning model with predictive uncertainty that runs on a static website (without a server). This approach moves computing needs onto the website visitor without requiring installation, removing the need to pay for and maintain servers. Our model achieves satisfactory results in solubility prediction. Furthermore, we demonstrate how to create molecular property prediction models that balance uncertainty and ease of use. The code is available at https://github.com/ur-whitelab/mol.dev, and the model is usable at https://mol.dev.  ( 2 min )
    Generative Adversarial Networks for Stain Normalisation in Histopathology
    arXiv:2308.02851v2 Announce Type: replace-cross Abstract: The rapid growth of digital pathology in recent years has provided an ideal opportunity for the development of artificial intelligence-based tools to improve the accuracy and efficiency of clinical diagnoses. One of the significant roadblocks to current research is the high level of visual variability across digital pathology images, causing models to generalise poorly to unseen data. Stain normalisation aims to standardise the visual profile of digital pathology images without changing the structural content of the images. In this chapter, we explore different techniques which have been used for stain normalisation in digital pathology, with a focus on approaches which utilise generative adversarial networks (GANs). Typically, GAN-based methods outperform non-generative approaches but at the cost of much greater computational requirements. However, it is not clear which method is best for stain normalisation in general, with different GAN and non-GAN approaches outperforming each other in different scenarios and according to different performance metrics. This is an ongoing field of study as researchers aim to identify a method which efficiently and effectively normalises pathology images to make AI models more robust and generalisable.  ( 2 min )
    Align With Purpose: Optimize Desired Properties in CTC Models with a General Plug-and-Play Framework
    arXiv:2307.01715v3 Announce Type: replace-cross Abstract: Connectionist Temporal Classification (CTC) is a widely used criterion for training supervised sequence-to-sequence (seq2seq) models. It enables learning the relations between input and output sequences, termed alignments, by marginalizing over perfect alignments (that yield the ground truth), at the expense of imperfect alignments. This binary differentiation of perfect and imperfect alignments falls short of capturing other essential alignment properties that hold significance in other real-world applications. Here we propose $\textit{Align With Purpose}$, a $\textbf{general Plug-and-Play framework}$ for enhancing a desired property in models trained with the CTC criterion. We do that by complementing the CTC with an additional loss term that prioritizes alignments according to a desired property. Our method does not require any intervention in the CTC loss function, enables easy optimization of a variety of properties, and allows differentiation between both perfect and imperfect alignments. We apply our framework in the domain of Automatic Speech Recognition (ASR) and show its generality in terms of property selection, architectural choice, and scale of training dataset (up to 280,000 hours). To demonstrate the effectiveness of our framework, we apply it to two unrelated properties: emission time and word error rate (WER). For the former, we report an improvement of up to 570ms in latency optimization with a minor reduction in WER, and for the latter, we report a relative improvement of 4.5% WER over the baseline models. To the best of our knowledge, these applications have never been demonstrated to work on a scale of data as large as ours. Notably, our method can be implemented using only a few lines of code, and can be extended to other alignment-free loss functions and to domains other than ASR.  ( 3 min )
    Accelerating Sampling and Aggregation Operations in GNN Frameworks with GPU Initiated Direct Storage Accesses
    arXiv:2306.16384v2 Announce Type: replace-cross Abstract: Graph Neural Networks (GNNs) are emerging as a powerful tool for learning from graph-structured data and performing sophisticated inference tasks in various application domains. Although GNNs have been shown to be effective on modest-sized graphs, training them on large-scale graphs remains a significant challenge due to lack of efficient data access and data movement methods. Existing frameworks for training GNNs use CPUs for graph sampling and feature aggregation, while the training and updating of model weights are executed on GPUs. However, our in-depth profiling shows the CPUs cannot achieve the throughput required to saturate GNN model training throughput, causing gross under-utilization of expensive GPU resources. Furthermore, when the graph and its embeddings do not fit in the CPU memory, the overhead introduced by the operating system, say for handling page-faults, comes in the critical path of execution. To address these issues, we propose the GPU Initiated Direct Storage Access (GIDS) dataloader, to enable GPU-oriented GNN training for large-scale graphs while efficiently utilizing all hardware resources, such as CPU memory, storage, and GPU memory with a hybrid data placement strategy. By enabling GPU threads to fetch feature vectors directly from storage, GIDS dataloader solves the memory capacity problem for GPU-oriented GNN training. Moreover, GIDS dataloader leverages GPU parallelism to tolerate storage latency and eliminates expensive page-fault overhead. Doing so enables us to design novel optimizations for exploiting locality and increasing effective bandwidth for GNN training. Our evaluation using a single GPU on terabyte-scale GNN datasets shows that GIDS dataloader accelerates the overall DGL GNN training pipeline by up to 392X when compared to the current, state-of-the-art DGL dataloader.  ( 3 min )
    Sparse Representer Theorems for Learning in Reproducing Kernel Banach Spaces
    arXiv:2305.12584v2 Announce Type: replace-cross Abstract: Sparsity of a learning solution is a desirable feature in machine learning. Certain reproducing kernel Banach spaces (RKBSs) are appropriate hypothesis spaces for sparse learning methods. The goal of this paper is to understand what kind of RKBSs can promote sparsity for learning solutions. We consider two typical learning models in an RKBS: the minimum norm interpolation (MNI) problem and the regularization problem. We first establish an explicit representer theorem for solutions of these problems, which represents the extreme points of the solution set by a linear combination of the extreme points of the subdifferential set, of the norm function, which is data-dependent. We then propose sufficient conditions on the RKBS that can transform the explicit representation of the solutions to a sparse kernel representation having fewer terms than the number of the observed data. Under the proposed sufficient conditions, we investigate the role of the regularization parameter on sparsity of the regularized solutions. We further show that two specific RKBSs: the sequence space $\ell_1(\mathbb{N})$ and the measure space can have sparse representer theorems for both MNI and regularization models.  ( 2 min )
    Towards Faster Non-Asymptotic Convergence for Diffusion-Based Generative Models
    arXiv:2306.09251v3 Announce Type: replace-cross Abstract: Diffusion models, which convert noise into new data instances by learning to reverse a Markov diffusion process, have become a cornerstone in contemporary generative modeling. While their practical power has now been widely recognized, the theoretical underpinnings remain far from mature. In this work, we develop a suite of non-asymptotic theory towards understanding the data generation process of diffusion models in discrete time, assuming access to $\ell_2$-accurate estimates of the (Stein) score functions. For a popular deterministic sampler (based on the probability flow ODE), we establish a convergence rate proportional to $1/T$ (with $T$ the total number of steps), improving upon past results; for another mainstream stochastic sampler (i.e., a type of the denoising diffusion probabilistic model), we derive a convergence rate proportional to $1/\sqrt{T}$, matching the state-of-the-art theory. Imposing only minimal assumptions on the target data distribution (e.g., no smoothness assumption is imposed), our results characterize how $\ell_2$ score estimation errors affect the quality of the data generation processes. In contrast to prior works, our theory is developed based on an elementary yet versatile non-asymptotic approach without resorting to toolboxes for SDEs and ODEs. Further, we design two accelerated variants, improving the convergence to $1/T^2$ for the ODE-based sampler and $1/T$ for the DDPM-type sampler, which might be of independent theoretical and empirical interest.  ( 3 min )
    Misspecification-robust Sequential Neural Likelihood for Simulation-based Inference
    arXiv:2301.13368v2 Announce Type: replace-cross Abstract: Simulation-based inference techniques are indispensable for parameter estimation of mechanistic and simulable models with intractable likelihoods. While traditional statistical approaches like approximate Bayesian computation and Bayesian synthetic likelihood have been studied under well-specified and misspecified settings, they often suffer from inefficiencies due to wasted model simulations. Neural approaches, such as sequential neural likelihood (SNL) avoid this wastage by utilising all model simulations to train a neural surrogate for the likelihood function. However, the performance of SNL under model misspecification is unreliable and can result in overconfident posteriors centred around an inaccurate parameter estimate. In this paper, we propose a novel SNL method, which through the incorporation of additional adjustment parameters, is robust to model misspecification and capable of identifying features of the data that the model is not able to recover. We demonstrate the efficacy of our approach through several illustrative examples, where our method gives more accurate point estimates and uncertainty quantification than SNL.  ( 2 min )
    Multi-modal learning for geospatial vegetation forecasting
    arXiv:2303.16198v2 Announce Type: replace-cross Abstract: The innovative application of precise geospatial vegetation forecasting holds immense potential across diverse sectors, including agriculture, forestry, humanitarian aid, and carbon accounting. To leverage the vast availability of satellite imagery for this task, various works have applied deep neural networks for predicting multispectral images in photorealistic quality. However, the important area of vegetation dynamics has not been thoroughly explored. Our study breaks new ground by introducing GreenEarthNet, the first dataset specifically designed for high-resolution vegetation forecasting, and Contextformer, a novel deep learning approach for predicting vegetation greenness from Sentinel 2 satellite images with fine resolution across Europe. Our multi-modal transformer model Contextformer leverages spatial context through a vision backbone and predicts the temporal dynamics on local context patches incorporating meteorological time series in a parameter-efficient manner. The GreenEarthNet dataset features a learned cloud mask and an appropriate evaluation scheme for vegetation modeling. It also maintains compatibility with the existing satellite imagery forecasting dataset EarthNet2021, enabling cross-dataset model comparisons. Our extensive qualitative and quantitative analyses reveal that our methods outperform a broad range of baseline techniques. This includes surpassing previous state-of-the-art models on EarthNet2021, as well as adapted models from time series forecasting and video prediction. To the best of our knowledge, this work presents the first models for continental-scale vegetation modeling at fine resolution able to capture anomalies beyond the seasonal cycle, thereby paving the way for predicting vegetation health and behaviour in response to climate variability and extremes.  ( 3 min )
    RDesign: Hierarchical Data-efficient Representation Learning for Tertiary Structure-based RNA Design
    arXiv:2301.10774v3 Announce Type: replace-cross Abstract: While artificial intelligence has made remarkable strides in revealing the relationship between biological macromolecules' primary sequence and tertiary structure, designing RNA sequences based on specified tertiary structures remains challenging. Though existing approaches in protein design have thoroughly explored structure-to-sequence dependencies in proteins, RNA design still confronts difficulties due to structural complexity and data scarcity. Moreover, direct transplantation of protein design methodologies into RNA design fails to achieve satisfactory outcomes although sharing similar structural components. In this study, we aim to systematically construct a data-driven RNA design pipeline. We crafted a large, well-curated benchmark dataset and designed a comprehensive structural modeling approach to represent the complex RNA tertiary structure. More importantly, we proposed a hierarchical data-efficient representation learning framework that learns structural representations through contrastive learning at both cluster-level and sample-level to fully leverage the limited data. By constraining data representations within a limited hyperspherical space, the intrinsic relationships between data points could be explicitly imposed. Moreover, we incorporated extracted secondary structures with base pairs as prior knowledge to facilitate the RNA design process. Extensive experiments demonstrate the effectiveness of our proposed method, providing a reliable baseline for future RNA design tasks. The source code and benchmark dataset are available at https://github.com/A4Bio/RDesign.  ( 3 min )
    First-order penalty methods for bilevel optimization
    arXiv:2301.01716v2 Announce Type: replace-cross Abstract: In this paper we study a class of unconstrained and constrained bilevel optimization problems in which the lower level is a possibly nonsmooth convex optimization problem, while the upper level is a possibly nonconvex optimization problem. We introduce a notion of $\varepsilon$-KKT solution for them and show that an $\varepsilon$-KKT solution leads to an $O(\sqrt{\varepsilon})$- or $O(\varepsilon)$-hypergradient based stionary point under suitable assumptions. We also propose first-order penalty methods for finding an $\varepsilon$-KKT solution of them, whose subproblems turn out to be a structured minimax problem and can be suitably solved by a first-order method recently developed by the authors. Under suitable assumptions, an \emph{operation complexity} of $O(\varepsilon^{-4}\log\varepsilon^{-1})$ and $O(\varepsilon^{-7}\log\varepsilon^{-1})$, measured by their fundamental operations, is established for the proposed penalty methods for finding an $\varepsilon$-KKT solution of the unconstrained and constrained bilevel optimization problems, respectively. Preliminary numerical results are presented to illustrate the performance of our proposed methods. To the best of our knowledge, this paper is the first work to demonstrate that bilevel optimization can be approximately solved as minimax optimization, and moreover, it provides the first implementable method with complexity guarantees for such sophisticated bilevel optimization.  ( 2 min )
    FIRE: A Failure-Adaptive Reinforcement Learning Framework for Edge Computing Migrations
    arXiv:2209.14399v2 Announce Type: replace-cross Abstract: In edge computing, users' service profiles are migrated due to user mobility. Reinforcement learning (RL) frameworks have been proposed to do so, often trained on simulated data. However, existing RL frameworks overlook occasional server failures, which although rare, impact latency-sensitive applications like autonomous driving and real-time obstacle detection. Nevertheless, these failures (rare events), being not adequately represented in historical training data, pose a challenge for data-driven RL algorithms. As it is impractical to adjust failure frequency in real-world applications for training, we introduce FIRE, a framework that adapts to rare events by training a RL policy in an edge computing digital twin environment. We propose ImRE, an importance sampling-based Q-learning algorithm, which samples rare events proportionally to their impact on the value function. FIRE considers delay, migration, failure, and backup placement costs across individual and shared service profiles. We prove ImRE's boundedness and convergence to optimality. Next, we introduce novel deep Q-learning (ImDQL) and actor critic (ImACRE) versions of our algorithm to enhance scalability. We extend our framework to accommodate users with varying risk tolerances. Through trace driven experiments, we show that FIRE reduces costs compared to vanilla RL and the greedy baseline in the event of failures.  ( 3 min )
    Climbing Routes Clustering Using Energy-Efficient Accelerometers Attached to the Quickdraws
    arXiv:2211.02680v2 Announce Type: replace-cross Abstract: One of the challenges for climbing gyms is to find out popular routes for the climbers to improve their services and optimally use their infrastructure. This problem must be addressed preserving both the privacy and convenience of the climbers and the costs of the gyms. To this aim, a hardware prototype is developed to collect data using accelerometer sensors attached to a piece of climbing equipment mounted on the wall, called quickdraw, that connects the climbing rope to the bolt anchors. The corresponding sensors are configured to be energy-efficient, hence becoming practical in terms of expenses and time consumption for replacement when used in large quantities in a climbing gym. This paper describes hardware specifications, studies data measured by the sensors in ultra-low power mode, detect patterns in data during climbing different routes, and develops an unsupervised approach for route clustering.  ( 2 min )
    Regression modelling of spatiotemporal extreme U.S. wildfires via partially-interpretable neural networks
    arXiv:2208.07581v4 Announce Type: replace-cross Abstract: Risk management in many environmental settings requires an understanding of the mechanisms that drive extreme events. Useful metrics for quantifying such risk are extreme quantiles of response variables conditioned on predictor variables that describe, e.g., climate, biosphere and environmental states. Typically these quantiles lie outside the range of observable data and so, for estimation, require specification of parametric extreme value models within a regression framework. Classical approaches in this context utilise linear or additive relationships between predictor and response variables and suffer in either their predictive capabilities or computational efficiency; moreover, their simplicity is unlikely to capture the truly complex structures that lead to the creation of extreme wildfires. In this paper, we propose a new methodological framework for performing extreme quantile regression using artificial neutral networks, which are able to capture complex non-linear relationships and scale well to high-dimensional data. The "black box" nature of neural networks means that they lack the desirable trait of interpretability often favoured by practitioners; thus, we unify linear, and additive, regression methodology with deep learning to create partially-interpretable neural networks that can be used for statistical inference but retain high prediction accuracy. To complement this methodology, we further propose a novel point process model for extreme values which overcomes the finite lower-endpoint problem associated with the generalised extreme value class of distributions. Efficacy of our unified framework is illustrated on U.S. wildfire data with a high-dimensional predictor set and we illustrate vast improvements in predictive performance over linear and spline-based regression techniques.  ( 3 min )
    Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning? -- A computational investigation
    arXiv:2109.14200v2 Announce Type: replace-cross Abstract: Decades of research has studied how language learning infants learn to discriminate speech sounds, segment words, and associate words with their meanings. While gradual development of such capabilities is unquestionable, the exact nature of these skills and the underlying mental representations yet remains unclear. In parallel, computational studies have shown that basic comprehension of speech can be achieved by statistical learning between speech and concurrent referentially ambiguous visual input. These models can operate without prior linguistic knowledge such as representations of linguistic units, and without learning mechanisms specifically targeted at such units. This has raised the question of to what extent knowledge of linguistic units, such as phone(me)s, syllables, and words, could actually emerge as latent representations supporting the translation between speech and representations in other modalities, and without the units being proximal learning targets for the learner. In this study, we formulate this idea as the so-called latent language hypothesis (LLH), connecting linguistic representation learning to general predictive processing within and across sensory modalities. We review the extent that the audiovisual aspect of LLH is supported by the existing computational studies. We then explore LLH further in extensive learning simulations with different neural network models for audiovisual cross-situational learning, and comparing learning from both synthetic and real speech data. We investigate whether the latent representations learned by the networks reflect phonetic, syllabic, or lexical structure of input speech by utilizing an array of complementary evaluation metrics related to linguistic selectivity and temporal characteristics of the representations. As a result, we find that representations associated...  ( 3 min )
    Standard Gaussian Process is All You Need for High-Dimensional Bayesian Optimization
    arXiv:2402.02746v2 Announce Type: replace Abstract: There has been a long-standing and widespread belief that Bayesian Optimization (BO) with standard Gaussian process (GP), referred to as standard BO, is ineffective in high-dimensional optimization problems. This perception may partly stem from the intuition that GPs struggle with high-dimensional inputs for covariance modeling and function estimation. While these concerns seem reasonable, empirical evidence supporting this belief is lacking. In this paper, we systematically investigated BO with standard GP regression across a variety of synthetic and real-world benchmark problems for high-dimensional optimization. Surprisingly, the performance with standard GP consistently ranks among the best, often outperforming existing BO methods specifically designed for high-dimensional optimization by a large margin. Contrary to the stereotype, we found that standard GP can serve as a capable surrogate for learning high-dimensional target functions. Without strong structural assumptions, BO with standard GP not only excels in high-dimensional optimization but also proves robust in accommodating various structures within the target functions. Furthermore, with standard GP, achieving promising optimization performance is possible by only using maximum likelihood estimation, eliminating the need for expensive Markov-Chain Monte Carlo (MCMC) sampling that might be required by more complex surrogate models. We thus advocate for a re-evaluation and in-depth study of the potential of standard BO in addressing high-dimensional problems.  ( 3 min )
    On the consistency of supervised learning with missing values
    arXiv:1902.06931v4 Announce Type: replace-cross Abstract: In many application settings, the data have missing entries which make analysis challenging. An abundant literature addresses missing values in an inferential framework: estimating parameters and their variance from incomplete tables. Here, we consider supervised-learning settings: predicting a target when missing values appear in both training and testing data. We show the consistency of two approaches in prediction. A striking result is that the widely-used method of imputing with a constant, such as the mean prior to learning is consistent when missing values are not informative. This contrasts with inferential settings where mean imputation is pointed at for distorting the distribution of the data. That such a simple approach can be consistent is important in practice. We also show that a predictor suited for complete observations can predict optimally on incomplete data,through multiple imputation.Finally, to compare imputation with learning directly with a model that accounts for missing values, we analyze further decision trees. These can naturally tackle empirical risk minimization with missing values, due to their ability to handle the half-discrete nature of incomplete variables. After comparing theoretically and empirically different missing values strategies in trees, we recommend using the "missing incorporated in attribute" method as it can handle both non-informative and informative missing values.  ( 3 min )
    Can Large Language Models Replace Economic Choice Prediction Labs?
    arXiv:2401.17435v3 Announce Type: replace Abstract: Economic choice prediction is an essential challenging task, often constrained by the difficulties in acquiring human choice data. Indeed, experimental economics studies had focused mostly on simple choice settings. The AI community has recently contributed to that effort in two ways: considering whether LLMs can substitute for humans in the above-mentioned simple choice prediction settings, and the study through ML lens of more elaborated but still rigorous experimental economics settings, employing incomplete information, repetitive play, and natural language communication, notably language-based persuasion games. This leaves us with a major inspiration: can LLMs be used to fully simulate the economic environment and generate data for efficient human choice prediction, substituting for the elaborated economic lab studies? We pioneer the study of this subject, demonstrating its feasibility. In particular, we show that a model trained solely on LLM-generated data can effectively predict human behavior in a language-based persuasion game, and can even outperform models trained on actual human data.  ( 2 min )
    Player Pressure Map -- A Novel Representation of Pressure in Soccer for Evaluating Player Performance in Different Game Contexts
    arXiv:2401.16235v2 Announce Type: replace Abstract: In soccer, contextual player performance metrics are invaluable to coaches. For example, the ability to perform under pressure during matches distinguishes the elite from the average. Appropriate pressure metric enables teams to assess players' performance accurately under pressure and design targeted training scenarios to address their weaknesses. The primary objective of this paper is to leverage both tracking and event data and game footage to capture the pressure experienced by the possession team in a soccer game scene. We propose a player pressure map to represent a given game scene, which lowers the dimension of raw data and still contains rich contextual information. Not only does it serve as an effective tool for visualizing and evaluating the pressure on the team and each individual, but it can also be utilized as a backbone for accessing players' performance. Overall, our model provides coaches and analysts with a deeper understanding of players' performance under pressure so that they make data-oriented tactical decisions.  ( 2 min )
    Transformer Multivariate Forecasting: Less is More?
    arXiv:2401.00230v2 Announce Type: replace Abstract: In the domain of multivariate forecasting, transformer models stand out as powerful apparatus, displaying exceptional capabilities in handling messy datasets from real-world contexts. However, the inherent complexity of these datasets, characterized by numerous variables and lengthy temporal sequences, poses challenges, including increased noise and extended model runtime. This paper focuses on reducing redundant information to elevate forecasting accuracy while optimizing runtime efficiency. We propose a novel transformer forecasting framework enhanced by Principal Component Analysis (PCA) to tackle this challenge. The framework is evaluated by five state-of-the-art (SOTA) models and four diverse real-world datasets. Our experimental results demonstrate the framework's ability to minimize prediction errors across all models and datasets while significantly reducing runtime. From the model perspective, one of the PCA-enhanced models: PCA+Crossformer, reduces mean square errors (MSE) by 33.3% and decreases runtime by 49.2% on average. From the dataset perspective, the framework delivers 14.3% MSE and 76.6% runtime reduction on Electricity datasets, as well as 4.8% MSE and 86.9% runtime reduction on Traffic datasets. This study aims to advance various SOTA models and enhance transformer-based time series forecasting for intricate data. Code is available at: https://github.com/jingjing-unilu/PCA_Transformer.  ( 2 min )
    Graph Contrastive Invariant Learning from the Causal Perspective
    arXiv:2401.12564v2 Announce Type: replace Abstract: Graph contrastive learning (GCL), learning the node representation by contrasting two augmented graphs in a self-supervised way, has attracted considerable attention. GCL is usually believed to learn the invariant representation. However, does this understanding always hold in practice? In this paper, we first study GCL from the perspective of causality. By analyzing GCL with the structural causal model (SCM), we discover that traditional GCL may not well learn the invariant representations due to the non-causal information contained in the graph. How can we fix it and encourage the current GCL to learn better invariant representations? The SCM offers two requirements and motives us to propose a novel GCL method. Particularly, we introduce the spectral graph augmentation to simulate the intervention upon non-causal factors. Then we design the invariance objective and independence objective to better capture the causal factors. Specifically, (i) the invariance objective encourages the encoder to capture the invariant information contained in causal variables, and (ii) the independence objective aims to reduce the influence of confounders on the causal variables. Experimental results demonstrate the effectiveness of our approach on node classification tasks.  ( 2 min )
    Value Explicit Pretraining for Learning Transferable Representations
    arXiv:2312.12339v2 Announce Type: replace Abstract: We propose Value Explicit Pretraining (VEP), a method that learns generalizable representations for transfer reinforcement learning. VEP enables learning of new tasks that share similar objectives as previously learned tasks, by learning an encoder for objective-conditioned representations, irrespective of appearance changes and environment dynamics. To pre-train the encoder from a sequence of observations, we use a self-supervised contrastive loss that results in learning temporally smooth representations. VEP learns to relate states across different tasks based on the Bellman return estimate that is reflective of task progress. Experiments using a realistic navigation simulator and Atari benchmark show that the pretrained encoder produced by our method outperforms current SoTA pretraining methods on the ability to generalize to unseen tasks. VEP achieves up to a 2 times improvement in rewards on Atari and visual navigation, and up to a 3 times improvement in sample efficiency. For videos of policy performance visit our https://sites.google.com/view/value-explicit-pretraining/  ( 2 min )
    The Selected-completely-at-random Complementary Label is a Practical Weak Supervision for Multi-class Classification
    arXiv:2311.15502v2 Announce Type: replace Abstract: Complementary-label learning is a weakly supervised learning problem in which each training example is associated with one or multiple complementary labels indicating the classes to which it does not belong. Existing consistent approaches have relied on the uniform distribution assumption to model the generation of complementary labels, or on an ordinary-label training set to estimate the transition matrix in non-uniform cases. However, either condition may not be satisfied in real-world scenarios. In this paper, we propose a novel consistent approach that does not rely on these conditions. Inspired by the positive-unlabeled (PU) learning literature, we propose an unbiased risk estimator based on the Selected Completely At Random assumption for complementary-label learning. We then introduce a risk-correction approach to address overfitting problems. Furthermore, we find that complementary-label learning can be expressed as a set of negative-unlabeled binary classification problems when using the one-versus-rest strategy. Extensive experimental results on both synthetic and real-world benchmark datasets validate the superiority of our proposed approach over state-of-the-art methods.  ( 2 min )
    Is Scaling Learned Optimizers Worth It? Evaluating The Value of VeLO's 4000 TPU Months
    arXiv:2310.18191v2 Announce Type: replace Abstract: We analyze VeLO (versatile learned optimizer), the largest scale attempt to train a general purpose "foundational" optimizer to date. VeLO was trained on thousands of machine learning tasks using over 4000 TPU months with the goal of producing an optimizer capable of generalizing to new problems while being hyperparameter free, and outperforming industry standards such as Adam. We independently evaluate VeLO on the MLCommons optimizer benchmark suite. We find that, contrary to initial claims: (1) VeLO has a critical hyperparameter that needs problem-specific tuning, (2) VeLO does not necessarily outperform competitors in quality of solution found, and (3) VeLO is not faster than competing optimizers at reducing the training loss. These observations call into question VeLO's generality and the value of the investment in training it.  ( 2 min )
    On the Out-of-Distribution Coverage of Combining Split Conformal Prediction and Bayesian Deep Learning
    arXiv:2311.12688v2 Announce Type: replace Abstract: Bayesian deep learning and conformal prediction are two methods that have been used to convey uncertainty and increase safety in machine learning systems. We focus on combining Bayesian deep learning with split conformal prediction and how this combination effects out-of-distribution coverage; particularly in the case of multiclass image classification. We suggest that if the model is generally underconfident on the calibration set, then the resultant conformal sets may exhibit worse out-of-distribution coverage compared to simple predictive credible sets. Conversely, if the model is overconfident on the calibration set, the use of conformal prediction may improve out-of-distribution coverage. We evaluate prediction sets as a result of combining split conformal methods and neural networks trained with (i) stochastic gradient descent, (ii) deep ensembles, and (iii) mean-field variational inference. Our results suggest that combining Bayesian deep learning models with split conformal prediction can, in some cases, cause unintended consequences such as reducing out-of-distribution coverage.  ( 2 min )
    AI Hazard Management: A framework for the systematic management of root causes for AI risks
    arXiv:2310.16727v2 Announce Type: replace Abstract: Recent advancements in the field of Artificial Intelligence (AI) establish the basis to address challenging tasks. However, with the integration of AI, new risks arise. Therefore, to benefit from its advantages, it is essential to adequately handle the risks associated with AI. Existing risk management processes in related fields, such as software systems, need to sufficiently consider the specifics of AI. A key challenge is to systematically and transparently identify and address AI risks' root causes - also called AI hazards. This paper introduces the AI Hazard Management (AIHM) framework, which provides a structured process to systematically identify, assess, and treat AI hazards. The proposed process is conducted in parallel with the development to ensure that any AI hazard is captured at the earliest possible stage of the AI system's life cycle. In addition, to ensure the AI system's auditability, the proposed framework systematically documents evidence that the potential impact of identified AI hazards could be reduced to a tolerable level. The framework builds upon an AI hazard list from a comprehensive state-of-the-art analysis. Also, we provide a taxonomy that supports the optimal treatment of the identified AI hazards. Additionally, we illustrate how the AIHM framework can increase the overall quality of a power grid AI use case by systematically reducing the impact of identified hazards to an acceptable level.  ( 3 min )
    Multi-Factor Spatio-Temporal Prediction based on Graph Decomposition Learning
    arXiv:2310.10374v2 Announce Type: replace Abstract: Spatio-temporal (ST) prediction is an important and widely used technique in data mining and analytics, especially for ST data in urban systems such as transportation data. In practice, the ST data generation is usually influenced by various latent factors tied to natural phenomena or human socioeconomic activities, impacting specific spatial areas selectively. However, existing ST prediction methods usually do not refine the impacts of different factors, but directly model the entangled impacts of multiple factors. This amplifies the modeling complexity of ST data and compromises model interpretability. To this end, we propose a multi-factor ST prediction task that predicts partial ST data evolution under different factors, and combines them for a final prediction. We make two contributions to this task: an effective theoretical solution and a portable instantiation framework. Specifically, we first propose a theoretical solution called decomposed prediction strategy and prove its effectiveness from the perspective of information entropy theory. On top of that, we instantiate a novel model-agnostic framework, named spatio-temporal graph decomposition learning (STGDL), for multi-factor ST prediction. The framework consists of two main components: an automatic graph decomposition module that decomposes the original graph structure inherent in ST data into subgraphs corresponding to different factors, and a decomposed learning network that learns the partial ST data on each subgraph separately and integrates them for the final prediction. We conduct extensive experiments on four real-world ST datasets of two types of graphs, i.e., grid graph and network graph. Results show that our framework significantly reduces prediction errors of various ST models by 9.41% on average (35.36% at most). Furthermore, a case study reveals the interpretability potential of our framework.  ( 3 min )
    ProFSA: Self-supervised Pocket Pretraining via Protein Fragment-Surroundings Alignment
    arXiv:2310.07229v2 Announce Type: replace Abstract: Pocket representations play a vital role in various biomedical applications, such as druggability estimation, ligand affinity prediction, and de novo drug design. While existing geometric features and pretrained representations have demonstrated promising results, they usually treat pockets independent of ligands, neglecting the fundamental interactions between them. However, the limited pocket-ligand complex structures available in the PDB database (less than 100 thousand non-redundant pairs) hampers large-scale pretraining endeavors for interaction modeling. To address this constraint, we propose a novel pocket pretraining approach that leverages knowledge from high-resolution atomic protein structures, assisted by highly effective pretrained small molecule representations. By segmenting protein structures into drug-like fragments and their corresponding pockets, we obtain a reasonable simulation of ligand-receptor interactions, resulting in the generation of over 5 million complexes. Subsequently, the pocket encoder is trained in a contrastive manner to align with the representation of pseudo-ligand furnished by some pretrained small molecule encoders. Our method, named ProFSA, achieves state-of-the-art performance across various tasks, including pocket druggability prediction, pocket matching, and ligand binding affinity prediction. Notably, ProFSA surpasses other pretraining methods by a substantial margin. Moreover, our work opens up a new avenue for mitigating the scarcity of protein-ligand complex data through the utilization of high-quality and diverse protein structure databases.  ( 2 min )
    When Machine Learning Models Leak: An Exploration of Synthetic Training Data
    arXiv:2310.08775v2 Announce Type: replace Abstract: We investigate an attack on a machine learning model that predicts whether a person or household will relocate in the next two years, i.e., a propensity-to-move classifier. The attack assumes that the attacker can query the model to obtain predictions and that the marginal distribution of the data on which the model was trained is publicly available. The attack also assumes that the attacker has obtained the values of non-sensitive attributes for a certain number of target individuals. The objective of the attack is to infer the values of sensitive attributes for these target individuals. We explore how replacing the original data with synthetic data when training the model impacts how successfully the attacker can infer sensitive attributes.  ( 2 min )
    Adapting Large Language Models for Content Moderation: Pitfalls in Data Engineering and Supervised Fine-tuning
    arXiv:2310.03400v2 Announce Type: replace Abstract: Nowadays, billions of people engage in communication and express their opinions on the internet daily. Unfortunately, not all of these expressions are friendly or compliant, making content moderation an indispensable task. A common approach is to use a discriminative model to classify the content, but this method often requires strict data engineering, otherwise it will face unacceptable overfitting. With the successful development of Large Language Models (LLMs) in recent years, LLM-based methods have become a feasible solution for handling tasks in various domains. Thanks to the knowledge of the foundation models, we can develop more robust privately deployed models with limited data via fine-tuning these foundation models. Moreover, as a generative model, it can provide detailed analysis of the review process, enhancing interpretability. In this paper, we introduce how to fine-tune a LLM model that can be privately deployed for content moderation. Specifically, we discuss the differences between discriminative and generative models using content moderation as an example. Additionally, we reveal that incorporating reasoning processes during the fine-tuning of LLMs can effectively alleviate overfitting, even if the model is not allowed to directly output reasoning processes during deployment. We present a complete process, from data collection and construction to model training and overfitting elimination, for fine-tuning LLMs in vertical domain deployments. We report the entire research process and the key findings in this paper, hoping to provide valuable experience for researchers who are fine-tuning privately deployed models in their domain-specific research.  ( 3 min )
    Analyzing and Improving Optimal-Transport-based Adversarial Networks
    arXiv:2310.02611v2 Announce Type: replace Abstract: Optimal Transport (OT) problem aims to find a transport plan that bridges two distributions while minimizing a given cost function. OT theory has been widely utilized in generative modeling. In the beginning, OT distance has been used as a measure for assessing the distance between data and generated distributions. Recently, OT transport map between data and prior distributions has been utilized as a generative model. These OT-based generative models share a similar adversarial training objective. In this paper, we begin by unifying these OT-based adversarial methods within a single framework. Then, we elucidate the role of each component in training dynamics through a comprehensive analysis of this unified framework. Moreover, we suggest a simple but novel method that improves the previously best-performing OT-based model. Intuitively, our approach conducts a gradual refinement of the generated distribution, progressively aligning it with the data distribution. Our approach achieves a FID score of 2.51 on CIFAR-10 and 5.99 on CelebA-HQ-256, outperforming unified OT-based adversarial approaches.  ( 2 min )
    RECOMBINER: Robust and Enhanced Compression with Bayesian Implicit Neural Representations
    arXiv:2309.17182v2 Announce Type: replace Abstract: COMpression with Bayesian Implicit NEural Representations (COMBINER) is a recent data compression method that addresses a key inefficiency of previous Implicit Neural Representation (INR)-based approaches: it avoids quantization and enables direct optimization of the rate-distortion performance. However, COMBINER still has significant limitations: 1) it uses factorized priors and posterior approximations that lack flexibility; 2) it cannot effectively adapt to local deviations from global patterns in the data; and 3) its performance can be susceptible to modeling choices and the variational parameters' initializations. Our proposed method, Robust and Enhanced COMBINER (RECOMBINER), addresses these issues by 1) enriching the variational approximation while retaining a low computational cost via a linear reparameterization of the INR weights, 2) augmenting our INRs with learnable positional encodings that enable them to adapt to local details and 3) splitting high-resolution data into patches to increase robustness and utilizing expressive hierarchical priors to capture dependency across patches. We conduct extensive experiments across several data modalities, showcasing that RECOMBINER achieves competitive results with the best INR-based methods and even outperforms autoencoder-based codecs on low-resolution images at low bitrates. Our PyTorch implementation is available at https://github.com/cambridge-mlg/RECOMBINER/.  ( 2 min )
    Implicit regularization of multi-task learning and finetuning in overparameterized neural networks
    arXiv:2310.02396v2 Announce Type: replace Abstract: In this work, we investigate the inductive biases that result from learning multiple tasks, either simultaneously (multi-task learning, MTL) or sequentially (pretraining and subsequent finetuning, PT+FT). In the simplified setting of two-layer diagonal linear networks trained with gradient descent, we apply prior theoretical results to describe novel implicit regularization penalties associated with MTL and PT+FT, both of which incentivize feature sharing between tasks and sparsity in learned task-specific features. Notably, these results imply that during finetuning, networks operate in a hybrid of the kernel (or "lazy") regime and the feature learning ("rich") regime identified in prior work. Moreover, we show that PT+FT can exhibit a novel "nested feature selection" behavior not captured by either regime, which biases it to extract a sparse subset of the features learned during pretraining. In ReLU networks, we reproduce all of these qualitative behaviors empirically, in particular verifying that analogues of the sparsity biases predicted by the linear theory hold in the nonlinear case. Our findings hold qualitatively for a deep architecture trained on image classification tasks, and our characterization of the nested feature selection regime motivates a modification to PT+FT that we find empirically improves performance. We also observe that PT+FT (but not MTL) is biased to learn features that are correlated with (but distinct from) those needed for the auxiliary task, while MTL is biased toward using identical features for both tasks, which can lead to a tradeoff in performance as a function of the number of finetuning samples. Our results shed light on the impact of auxiliary task learning and suggest ways to leverage it more effectively.  ( 3 min )
    Tackling the Non-IID Issue in Heterogeneous Federated Learning by Gradient Harmonization
    arXiv:2309.06692v2 Announce Type: replace Abstract: Federated learning (FL) is a privacy-preserving paradigm for collaboratively training a global model from decentralized clients. However, the performance of FL is hindered by non-independent and identically distributed (non-IID) data and device heterogeneity. In this work, we revisit this key challenge through the lens of gradient conflicts on the server side. Specifically, we first investigate the gradient conflict phenomenon among multiple clients and reveal that stronger heterogeneity leads to more severe gradient conflicts. To tackle this issue, we propose FedGH, a simple yet effective method that mitigates local drifts through Gradient Harmonization. This technique projects one gradient vector onto the orthogonal plane of the other within conflicting client pairs. Extensive experiments demonstrate that FedGH consistently enhances multiple state-of-the-art FL baselines across diverse benchmarks and non-IID scenarios. Notably, FedGH yields more significant improvements in scenarios with stronger heterogeneity. As a plug-and-play module, FedGH can be seamlessly integrated into any FL framework without requiring hyperparameter tuning.  ( 2 min )
    REFT: Resource-Efficient Federated Training Framework for Heterogeneous and Resource-Constrained Environments
    arXiv:2308.13662v2 Announce Type: replace Abstract: Federated Learning (FL) plays a critical role in distributed systems. In these systems, data privacy and confidentiality hold paramount importance, particularly within edge-based data processing systems such as IoT devices deployed in smart homes. FL emerges as a privacy-enforcing sub-domain of machine learning that enables model training on client devices, eliminating the necessity to share private data with a central server. While existing research has predominantly addressed challenges pertaining to data heterogeneity, there remains a current gap in addressing issues such as varying device capabilities and efficient communication. These unaddressed issues raise a number of implications in resource-constrained environments. In particular, the practical implementation of FL-based IoT or edge systems is extremely inefficient. In this paper, we propose "Resource-Efficient Federated Training Framework for Heterogeneous and Resource-Constrained Environments (REFT)," a novel approach specifically devised to address these challenges in resource-limited devices. Our proposed method uses Variable Pruning to optimize resource utilization by adapting pruning strategies to the computational capabilities of each client. Furthermore, our proposed REFT technique employs knowledge distillation to minimize the need for continuous bidirectional client-server communication. This achieves a significant reduction in communication bandwidth, thereby enhancing the overall resource efficiency. We conduct experiments for an image classification task, and the results demonstrate the effectiveness of our approach in resource-limited settings. Our technique not only preserves data privacy and performance standards but also accommodates heterogeneous model architectures, facilitating the participation of a broader array of diverse client devices in the training process, all while consuming minimal bandwidth.  ( 3 min )
    Polynomial Width is Sufficient for Set Representation with High-dimensional Features
    arXiv:2307.04001v3 Announce Type: replace Abstract: Set representation has become ubiquitous in deep learning for modeling the inductive bias of neural networks that are insensitive to the input order. DeepSets is the most widely used neural network architecture for set representation. It involves embedding each set element into a latent space with dimension $L$, followed by a sum pooling to obtain a whole-set embedding, and finally mapping the whole-set embedding to the output. In this work, we investigate the impact of the dimension $L$ on the expressive power of DeepSets. Previous analyses either oversimplified high-dimensional features to be one-dimensional features or were limited to analytic activations, thereby diverging from practical use or resulting in $L$ that grows exponentially with the set size $N$ and feature dimension $D$. To investigate the minimal value of $L$ that achieves sufficient expressive power, we present two set-element embedding layers: (a) linear + power activation (LP) and (b) linear + exponential activations (LE). We demonstrate that $L$ being poly$(N, D)$ is sufficient for set representation using both embedding layers. We also provide a lower bound of $L$ for the LP embedding layer. Furthermore, we extend our results to permutation-equivariant set functions and the complex field.  ( 3 min )
    Unified Data Management and Comprehensive Performance Evaluation for Urban Spatial-Temporal Prediction [Experiment, Analysis & Benchmark]
    arXiv:2308.12899v3 Announce Type: replace Abstract: The field of urban spatial-temporal prediction is advancing rapidly with the development of deep learning techniques and the availability of large-scale datasets. However, challenges persist in accessing and utilizing diverse urban spatial-temporal datasets from different sources and stored in different formats, as well as determining effective model structures and components with the proliferation of deep learning models. This work addresses these challenges and provides three significant contributions. Firstly, we introduce "atomic files", a unified storage format designed for urban spatial-temporal big data, and validate its effectiveness on 40 diverse datasets, simplifying data management. Secondly, we present a comprehensive overview of technological advances in urban spatial-temporal prediction models, guiding the development of robust models. Thirdly, we conduct extensive experiments using diverse models and datasets, establishing a performance leaderboard and identifying promising research directions. Overall, this work effectively manages urban spatial-temporal data, guides future efforts, and facilitates the development of accurate and efficient urban spatial-temporal prediction models. It can potentially make long-term contributions to urban spatial-temporal data management and prediction, ultimately leading to improved urban living standards.  ( 3 min )
    Safety-Critical Scenario Generation Via Reinforcement Learning Based Editing
    arXiv:2306.14131v3 Announce Type: replace Abstract: Generating safety-critical scenarios is essential for testing and verifying the safety of autonomous vehicles. Traditional optimization techniques suffer from the curse of dimensionality and limit the search space to fixed parameter spaces. To address these challenges, we propose a deep reinforcement learning approach that generates scenarios by sequential editing, such as adding new agents or modifying the trajectories of the existing agents. Our framework employs a reward function consisting of both risk and plausibility objectives. The plausibility objective leverages generative models, such as a variational autoencoder, to learn the likelihood of the generated parameters from the training datasets; It penalizes the generation of unlikely scenarios. Our approach overcomes the dimensionality challenge and explores a wide range of safety-critical scenarios. Our evaluation demonstrates that the proposed method generates safety-critical scenarios of higher quality compared with previous approaches.  ( 2 min )
    EquiformerV2: Improved Equivariant Transformer for Scaling to Higher-Degree Representations
    arXiv:2306.12059v3 Announce Type: replace Abstract: Equivariant Transformers such as Equiformer have demonstrated the efficacy of applying Transformers to the domain of 3D atomistic systems. However, they are limited to small degrees of equivariant representations due to their computational complexity. In this paper, we investigate whether these architectures can scale well to higher degrees. Starting from Equiformer, we first replace $SO(3)$ convolutions with eSCN convolutions to efficiently incorporate higher-degree tensors. Then, to better leverage the power of higher degrees, we propose three architectural improvements -- attention re-normalization, separable $S^2$ activation and separable layer normalization. Putting this all together, we propose EquiformerV2, which outperforms previous state-of-the-art methods on large-scale OC20 dataset by up to $9\%$ on forces, $4\%$ on energies, offers better speed-accuracy trade-offs, and $2\times$ reduction in DFT calculations needed for computing adsorption energies. Additionally, EquiformerV2 trained on only OC22 dataset outperforms GemNet-OC trained on both OC20 and OC22 datasets, achieving much better data efficiency. Finally, we compare EquiformerV2 with Equiformer on QM9 and OC20 S2EF-2M datasets to better understand the performance gain brought by higher degrees.  ( 2 min )
    Identifying Spurious Biases Early in Training through the Lens of Simplicity Bias
    arXiv:2305.18761v2 Announce Type: replace Abstract: Neural networks trained with (stochastic) gradient descent have an inductive bias towards learning simpler solutions. This makes them highly prone to learning spurious correlations in the training data, that may not hold at test time. In this work, we provide the first theoretical analysis of the effect of simplicity bias on learning spurious correlations. Notably, we show that examples with spurious features are provably separable based on the model's output early in training. We further illustrate that if spurious features have a small enough noise-to-signal ratio, the network's output on the majority of examples is almost exclusively determined by the spurious features, leading to poor worst-group test accuracy. Finally, we propose SPARE, which identifies spurious correlations early in training and utilizes importance sampling to alleviate their effect. Empirically, we demonstrate that SPARE outperforms state-of-the-art methods by up to 21.1% in worst-group accuracy, while being up to 12x faster. We also show that SPARE is a highly effective but lightweight method to discover spurious correlations.  ( 2 min )
    Tackling Heavy-Tailed Rewards in Reinforcement Learning with Function Approximation: Minimax Optimal and Instance-Dependent Regret Bounds
    arXiv:2306.06836v3 Announce Type: replace Abstract: While numerous works have focused on devising efficient algorithms for reinforcement learning (RL) with uniformly bounded rewards, it remains an open question whether sample or time-efficient algorithms for RL with large state-action space exist when the rewards are \emph{heavy-tailed}, i.e., with only finite $(1+\epsilon)$-th moments for some $\epsilon\in(0,1]$. In this work, we address the challenge of such rewards in RL with linear function approximation. We first design an algorithm, \textsc{Heavy-OFUL}, for heavy-tailed linear bandits, achieving an \emph{instance-dependent} $T$-round regret of $\tilde{O}\big(d T^{\frac{1-\epsilon}{2(1+\epsilon)}} \sqrt{\sum_{t=1}^T \nu_t^2} + d T^{\frac{1-\epsilon}{2(1+\epsilon)}}\big)$, the \emph{first} of this kind. Here, $d$ is the feature dimension, and $\nu_t^{1+\epsilon}$ is the $(1+\epsilon)$-th central moment of the reward at the $t$-th round. We further show the above bound is minimax optimal when applied to the worst-case instances in stochastic and deterministic linear bandits. We then extend this algorithm to the RL settings with linear function approximation. Our algorithm, termed as \textsc{Heavy-LSVI-UCB}, achieves the \emph{first} computationally efficient \emph{instance-dependent} $K$-episode regret of $\tilde{O}(d \sqrt{H \mathcal{U}^*} K^\frac{1}{1+\epsilon} + d \sqrt{H \mathcal{V}^* K})$. Here, $H$ is length of the episode, and $\mathcal{U}^*, \mathcal{V}^*$ are instance-dependent quantities scaling with the central moment of reward and value functions, respectively. We also provide a matching minimax lower bound $\Omega(d H K^{\frac{1}{1+\epsilon}} + d \sqrt{H^3 K})$ to demonstrate the optimality of our algorithm in the worst case. Our result is achieved via a novel robust self-normalized concentration inequality that may be of independent interest in handling heavy-tailed noise in general online regression problems.  ( 3 min )
    DOCTOR: A Multi-Disease Detection Continual Learning Framework Based on Wearable Medical Sensors
    arXiv:2305.05738v4 Announce Type: replace Abstract: Modern advances in machine learning (ML) and wearable medical sensors (WMSs) in edge devices have enabled ML-driven disease detection for smart healthcare. Conventional ML-driven methods for disease detection rely on customizing individual models for each disease and its corresponding WMS data. However, such methods lack adaptability to distribution shifts and new task classification classes. In addition, they need to be rearchitected and retrained from scratch for each new disease. Moreover, installing multiple ML models in an edge device consumes excessive memory, drains the battery faster, and complicates the detection process. To address these challenges, we propose DOCTOR, a multi-disease detection continual learning (CL) framework based on WMSs. It employs a multi-headed deep neural network (DNN) and a replay-style CL algorithm. The CL algorithm enables the framework to continually learn new missions where different data distributions, classification classes, and disease detection tasks are introduced sequentially. It counteracts catastrophic forgetting with a data preservation method and a synthetic data generation (SDG) module. The data preservation method preserves the most informative subset of real training data from previous missions for exemplar replay. The SDG module models the probability distribution of the real training data and generates synthetic data for generative replay while retaining data privacy. The multi-headed DNN enables DOCTOR to detect multiple diseases simultaneously based on user WMS data. We demonstrate DOCTOR's efficacy in maintaining high disease classification accuracy with a single DNN model in various CL experiments. In complex scenarios, DOCTOR achieves 1.43 times better average test accuracy, 1.25 times better F1-score, and 0.41 higher backward transfer than the naive fine-tuning framework with a small model size of less than 350KB.  ( 3 min )
    Policy Gradient Methods in the Presence of Symmetries and State Abstractions
    arXiv:2305.05666v2 Announce Type: replace Abstract: Reinforcement learning (RL) on high-dimensional and complex problems relies on abstraction for improved efficiency and generalization. In this paper, we study abstraction in the continuous-control setting, and extend the definition of Markov decision process (MDP) homomorphisms to the setting of continuous state and action spaces. We derive a policy gradient theorem on the abstract MDP for both stochastic and deterministic policies. Our policy gradient results allow for leveraging approximate symmetries of the environment for policy optimization. Based on these theorems, we propose a family of actor-critic algorithms that are able to learn the policy and the MDP homomorphism map simultaneously, using the lax bisimulation metric. Finally, we introduce a series of environments with continuous symmetries to further demonstrate the ability of our algorithm for action abstraction in the presence of such symmetries. We demonstrate the effectiveness of our method on our environments, as well as on challenging visual control tasks from the DeepMind Control Suite. Our method's ability to utilize MDP homomorphisms for representation learning leads to improved performance, and the visualizations of the latent space clearly demonstrate the structure of the learned abstraction.  ( 3 min )
    Diffusion Models for Constrained Domains
    arXiv:2304.05364v2 Announce Type: replace Abstract: Denoising diffusion models are a novel class of generative algorithms that achieve state-of-the-art performance across a range of domains, including image generation and text-to-image tasks. Building on this success, diffusion models have recently been extended to the Riemannian manifold setting, broadening their applicability to a range of problems from the natural and engineering sciences. However, these Riemannian diffusion models are built on the assumption that their forward and backward processes are well-defined for all times, preventing them from being applied to an important set of tasks that consider manifolds defined via a set of inequality constraints. In this work, we introduce a principled framework to bridge this gap. We present two distinct noising processes based on (i) the logarithmic barrier metric and (ii) the reflected Brownian motion induced by the constraints. As existing diffusion model techniques cannot be applied in this setting, we derive new tools to define such models in our framework. We then demonstrate the practical utility of our methods on a number of synthetic and real-world tasks, including applications from robotics and protein design.  ( 2 min )
    Towards Achieving Near-optimal Utility for Privacy-Preserving Federated Learning via Data Generation and Parameter Distortion
    arXiv:2305.04288v3 Announce Type: replace Abstract: Federated learning (FL) enables participating parties to collaboratively build a global model with boosted utility without disclosing private data information. Appropriate protection mechanisms have to be adopted to fulfill the requirements in preserving \textit{privacy} and maintaining high model \textit{utility}. The nature of the widely-adopted protection mechanisms including \textit{Randomization Mechanism} and \textit{Compression Mechanism} is to protect privacy via distorting model parameter. We measure the utility via the gap between the original model parameter and the distorted model parameter. We want to identify under what general conditions privacy-preserving federated learning can achieve near-optimal utility via data generation and parameter distortion. To provide an avenue for achieving near-optimal utility, we present an upper bound for utility loss, which is measured using two main terms called variance-reduction and model parameter discrepancy separately. Our analysis inspires the design of appropriate protection parameters for the protection mechanisms to achieve near-optimal utility and meet the privacy requirements simultaneously. The main techniques for the protection mechanism include parameter distortion and data generation, which are generic and can be applied extensively. Furthermore, we provide an upper bound for the trade-off between privacy and utility, \blue{which together with the lower bound provided by no free lunch theorem in federated learning (\cite{zhang2022no}) form the conditions for achieving optimal trade-off.  ( 3 min )
    FOSI: Hybrid First and Second Order Optimization
    arXiv:2302.08484v4 Announce Type: replace Abstract: Popular machine learning approaches forgo second-order information due to the difficulty of computing curvature in high dimensions. We present FOSI, a novel meta-algorithm that improves the performance of any base first-order optimizer by efficiently incorporating second-order information during the optimization process. In each iteration, FOSI implicitly splits the function into two quadratic functions defined on orthogonal subspaces, then uses a second-order method to minimize the first, and the base optimizer to minimize the other. We formally analyze FOSI's convergence and the conditions under which it improves a base optimizer. Our empirical evaluation demonstrates that FOSI improves the convergence rate and optimization time of first-order methods such as Heavy-Ball and Adam, and outperforms second-order methods (K-FAC and L-BFGS).  ( 2 min )
    PDFormer: Propagation Delay-Aware Dynamic Long-Range Transformer for Traffic Flow Prediction
    arXiv:2301.07945v3 Announce Type: replace Abstract: As a core technology of Intelligent Transportation System, traffic flow prediction has a wide range of applications. The fundamental challenge in traffic flow prediction is to effectively model the complex spatial-temporal dependencies in traffic data. Spatial-temporal Graph Neural Network (GNN) models have emerged as one of the most promising methods to solve this problem. However, GNN-based models have three major limitations for traffic prediction: i) Most methods model spatial dependencies in a static manner, which limits the ability to learn dynamic urban traffic patterns; ii) Most methods only consider short-range spatial information and are unable to capture long-range spatial dependencies; iii) These methods ignore the fact that the propagation of traffic conditions between locations has a time delay in traffic systems. To this end, we propose a novel Propagation Delay-aware dynamic long-range transFormer, namely PDFormer, for accurate traffic flow prediction. Specifically, we design a spatial self-attention module to capture the dynamic spatial dependencies. Then, two graph masking matrices are introduced to highlight spatial dependencies from short- and long-range views. Moreover, a traffic delay-aware feature transformation module is proposed to empower PDFormer with the capability of explicitly modeling the time delay of spatial information propagation. Extensive experimental results on six real-world public traffic datasets show that our method can not only achieve state-of-the-art performance but also exhibit competitive computational efficiency. Moreover, we visualize the learned spatial-temporal attention map to make our model highly interpretable.  ( 3 min )
    Self-supervised Trajectory Representation Learning with Temporal Regularities and Travel Semantics
    arXiv:2211.09510v4 Announce Type: replace Abstract: Trajectory Representation Learning (TRL) is a powerful tool for spatial-temporal data analysis and management. TRL aims to convert complicated raw trajectories into low-dimensional representation vectors, which can be applied to various downstream tasks, such as trajectory classification, clustering, and similarity computation. Existing TRL works usually treat trajectories as ordinary sequence data, while some important spatial-temporal characteristics, such as temporal regularities and travel semantics, are not fully exploited. To fill this gap, we propose a novel Self-supervised trajectory representation learning framework with TemporAl Regularities and Travel semantics, namely START. The proposed method consists of two stages. The first stage is a Trajectory Pattern-Enhanced Graph Attention Network (TPE-GAT), which converts the road network features and travel semantics into representation vectors of road segments. The second stage is a Time-Aware Trajectory Encoder (TAT-Enc), which encodes representation vectors of road segments in the same trajectory as a trajectory representation vector, meanwhile incorporating temporal regularities with the trajectory representation. Moreover, we also design two self-supervised tasks, i.e., span-masked trajectory recovery and trajectory contrastive learning, to introduce spatial-temporal characteristics of trajectories into the training process of our START framework. The effectiveness of the proposed method is verified by extensive experiments on two large-scale real-world datasets for three downstream tasks. The experiments also demonstrate that our method can be transferred across different cities to adapt heterogeneous trajectory datasets.  ( 3 min )
    Label Alignment Regularization for Distribution Shift
    arXiv:2211.14960v3 Announce Type: replace Abstract: Recent work has highlighted the label alignment property (LAP) in supervised learning, where the vector of all labels in the dataset is mostly in the span of the top few singular vectors of the data matrix. Drawing inspiration from this observation, we propose a regularization method for unsupervised domain adaptation that encourages alignment between the predictions in the target domain and its top singular vectors. Unlike conventional domain adaptation approaches that focus on regularizing representations, we instead regularize the classifier to align with the unsupervised target data, guided by the LAP in both the source and target domains. Theoretical analysis demonstrates that, under certain assumptions, our solution resides within the span of the top right singular vectors of the target domain data and aligns with the optimal solution. By removing the reliance on the commonly used optimal joint risk assumption found in classic domain adaptation theory, we showcase the effectiveness of our method on addressing problems where traditional domain adaptation methods often fall short due to high joint error. Additionally, we report improved performance over domain adaptation baselines in well-known tasks such as MNIST-USPS domain adaptation and cross-lingual sentiment analysis.  ( 2 min )
    On the expressivity of bi-Lipschitz normalizing flows
    arXiv:2107.07232v3 Announce Type: replace Abstract: An invertible function is bi-Lipschitz if both the function and its inverse have bounded Lipschitz constants. Nowadays, most Normalizing Flows are bi-Lipschitz by design or by training to limit numerical errors (among other things). In this paper, we discuss the expressivity of bi-Lipschitz Normalizing Flows and identify several target distributions that are difficult to approximate using such models. Then, we characterize the expressivity of bi-Lipschitz Normalizing Flows by giving several lower bounds on the Total Variation distance between these particularly unfavorable distributions and their best possible approximation. Finally, we discuss potential remedies which include using more complex latent distributions.  ( 2 min )
    iScore: Visual Analytics for Interpreting How Language Models Automatically Score Summaries
    arXiv:2403.04760v1 Announce Type: cross Abstract: The recent explosion in popularity of large language models (LLMs) has inspired learning engineers to incorporate them into adaptive educational tools that automatically score summary writing. Understanding and evaluating LLMs is vital before deploying them in critical learning environments, yet their unprecedented size and expanding number of parameters inhibits transparency and impedes trust when they underperform. Through a collaborative user-centered design process with several learning engineers building and deploying summary scoring LLMs, we characterized fundamental design challenges and goals around interpreting their models, including aggregating large text inputs, tracking score provenance, and scaling LLM interpretability methods. To address their concerns, we developed iScore, an interactive visual analytics tool for learning engineers to upload, score, and compare multiple summaries simultaneously. Tightly integrated views allow users to iteratively revise the language in summaries, track changes in the resulting LLM scores, and visualize model weights at multiple levels of abstraction. To validate our approach, we deployed iScore with three learning engineers over the course of a month. We present a case study where interacting with iScore led a learning engineer to improve their LLM's score accuracy by three percentage points. Finally, we conducted qualitative interviews with the learning engineers that revealed how iScore enabled them to understand, evaluate, and build trust in their LLMs during deployment.  ( 3 min )
    KnowledgeVIS: Interpreting Language Models by Comparing Fill-in-the-Blank Prompts
    arXiv:2403.04758v1 Announce Type: cross Abstract: Recent growth in the popularity of large language models has led to their increased usage for summarizing, predicting, and generating text, making it vital to help researchers and engineers understand how and why they work. We present KnowledgeVis, a human-in-the-loop visual analytics system for interpreting language models using fill-in-the-blank sentences as prompts. By comparing predictions between sentences, KnowledgeVis reveals learned associations that intuitively connect what language models learn during training to natural language tasks downstream, helping users create and test multiple prompt variations, analyze predicted words using a novel semantic clustering technique, and discover insights using interactive visualizations. Collectively, these visualizations help users identify the likelihood and uniqueness of individual predictions, compare sets of predictions between prompts, and summarize patterns and relationships between predictions across all prompts. We demonstrate the capabilities of KnowledgeVis with feedback from six NLP experts as well as three different use cases: (1) probing biomedical knowledge in two domain-adapted models; and (2) evaluating harmful identity stereotypes and (3) discovering facts and relationships between three general-purpose models.  ( 2 min )
    LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error
    arXiv:2403.04746v1 Announce Type: cross Abstract: Tools are essential for large language models (LLMs) to acquire up-to-date information and take consequential actions in external environments. Existing work on tool-augmented LLMs primarily focuses on the broad coverage of tools and the flexibility of adding new tools. However, a critical aspect that has surprisingly been understudied is simply how accurately an LLM uses tools for which it has been trained. We find that existing LLMs, including GPT-4 and open-source LLMs specifically fine-tuned for tool use, only reach a correctness rate in the range of 30% to 60%, far from reliable use in practice. We propose a biologically inspired method for tool-augmented LLMs, simulated trial and error (STE), that orchestrates three key mechanisms for successful tool use behaviors in the biological system: trial and error, imagination, and memory. Specifically, STE leverages an LLM's 'imagination' to simulate plausible scenarios for using a tool, after which the LLM interacts with the tool to learn from its execution feedback. Both short-term and long-term memory are employed to improve the depth and breadth of the exploration, respectively. Comprehensive experiments on ToolBench show that STE substantially improves tool learning for LLMs under both in-context learning and fine-tuning settings, bringing a boost of 46.7% to Mistral-Instruct-7B and enabling it to outperform GPT-4. We also show effective continual learning of tools via a simple experience replay strategy.  ( 3 min )
    A Sub-Quadratic Time Algorithm for Robust Sparse Mean Estimation
    arXiv:2403.04726v1 Announce Type: cross Abstract: We study the algorithmic problem of sparse mean estimation in the presence of adversarial outliers. Specifically, the algorithm observes a \emph{corrupted} set of samples from $\mathcal{N}(\mu,\mathbf{I}_d)$, where the unknown mean $\mu \in \mathbb{R}^d$ is constrained to be $k$-sparse. A series of prior works has developed efficient algorithms for robust sparse mean estimation with sample complexity $\mathrm{poly}(k,\log d, 1/\epsilon)$ and runtime $d^2 \mathrm{poly}(k,\log d,1/\epsilon)$, where $\epsilon$ is the fraction of contamination. In particular, the fastest runtime of existing algorithms is quadratic ($\Omega(d^2)$), which can be prohibitive in high dimensions. This quadratic barrier in the runtime stems from the reliance of these algorithms on the sample covariance matrix, which is of size $d^2$. Our main contribution is an algorithm for robust sparse mean estimation which runs in \emph{subquadratic} time using $\mathrm{poly}(k,\log d,1/\epsilon)$ samples. We also provide analogous results for robust sparse PCA. Our results build on algorithmic advances in detecting weak correlations, a generalized version of the light-bulb problem by Valiant.  ( 2 min )
    Faster Neighborhood Attention: Reducing the O(n^2) Cost of Self Attention at the Threadblock Level
    arXiv:2403.04690v1 Announce Type: cross Abstract: Neighborhood attention reduces the cost of self attention by restricting each token's attention span to its nearest neighbors. This restriction, parameterized by a window size and dilation factor, draws a spectrum of possible attention patterns between linear projection and self attention. Neighborhood attention, and more generally sliding window attention patterns, have long been bounded by infrastructure, particularly in higher-rank spaces (2-D and 3-D), calling for the development of custom kernels, which have been limited in either functionality, or performance, if not both. In this work, we first show that neighborhood attention can be represented as a batched GEMM problem, similar to standard attention, and implement it for 1-D and 2-D neighborhood attention. These kernels on average provide 895% and 272% improvement in full precision latency compared to existing naive kernels for 1-D and 2-D neighborhood attention respectively. We find certain inherent inefficiencies in all unfused neighborhood attention kernels that bound their performance and lower-precision scalability. We also developed fused neighborhood attention; an adaptation of fused dot-product attention kernels that allow fine-grained control over attention across different spatial axes. Known for reducing the quadratic time complexity of self attention to a linear complexity, neighborhood attention can now enjoy a reduced and constant memory footprint, and record-breaking half precision latency. We observe that our fused kernels successfully circumvent some of the unavoidable inefficiencies in unfused implementations. While our unfused GEMM-based kernels only improve half precision performance compared to naive kernels by an average of 496% and 113% in 1-D and 2-D problems respectively, our fused kernels improve naive kernels by an average of 1607% and 581% in 1-D and 2-D problems respectively.  ( 3 min )
    Fact-Checking the Output of Large Language Models via Token-Level Uncertainty Quantification
    arXiv:2403.04696v1 Announce Type: cross Abstract: Large language models (LLMs) are notorious for hallucinating, i.e., producing erroneous claims in their output. Such hallucinations can be dangerous, as occasional factual inaccuracies in the generated text might be obscured by the rest of the output being generally factual, making it extremely hard for the users to spot them. Current services that leverage LLMs usually do not provide any means for detecting unreliable generations. Here, we aim to bridge this gap. In particular, we propose a novel fact-checking and hallucination detection pipeline based on token-level uncertainty quantification. Uncertainty scores leverage information encapsulated in the output of a neural network or its layers to detect unreliable predictions, and we show that they can be used to fact-check the atomic claims in the LLM output. Moreover, we present a novel token-level uncertainty quantification method that removes the impact of uncertainty about what claim to generate on the current step and what surface form to use. Our method Claim Conditioned Probability (CCP) measures only the uncertainty of particular claim value expressed by the model. Experiments on the task of biography generation demonstrate strong improvements for CCP compared to the baselines for six different LLMs and three languages. Human evaluation reveals that the fact-checking pipeline based on uncertainty quantification is competitive with a fact-checking tool that leverages external knowledge.  ( 3 min )
    Telecom Language Models: Must They Be Large?
    arXiv:2403.04666v1 Announce Type: cross Abstract: The increasing interest in Large Language Models (LLMs) within the telecommunications sector underscores their potential to revolutionize operational efficiency. However, the deployment of these sophisticated models is often hampered by their substantial size and computational demands, raising concerns about their viability in resource-constrained environments. Addressing this challenge, recent advancements have seen the emergence of small language models that surprisingly exhibit performance comparable to their larger counterparts in many tasks, such as coding and common-sense reasoning. Phi-2, a compact yet powerful model, exemplifies this new wave of efficient small language models. This paper conducts a comprehensive evaluation of Phi-2's intrinsic understanding of the telecommunications domain. Recognizing the scale-related limitations, we enhance Phi-2's capabilities through a Retrieval-Augmented Generation approach, meticulously integrating an extensive knowledge base specifically curated with telecom standard specifications. The enhanced Phi-2 model demonstrates a profound improvement in accuracy, answering questions about telecom standards with a precision that closely rivals the more resource-intensive GPT-3.5. The paper further explores the refined capabilities of Phi-2 in addressing problem-solving scenarios within the telecom sector, highlighting its potential and limitations.  ( 2 min )
    Dynamic Cross Attention for Audio-Visual Person Verification
    arXiv:2403.04661v1 Announce Type: cross Abstract: Although person or identity verification has been predominantly explored using individual modalities such as face and voice, audio-visual fusion has recently shown immense potential to outperform unimodal approaches. Audio and visual modalities are often expected to pose strong complementary relationships, which plays a crucial role in effective audio-visual fusion. However, they may not always strongly complement each other, they may also exhibit weak complementary relationships, resulting in poor audio-visual feature representations. In this paper, we propose a Dynamic Cross-Attention (DCA) model that can dynamically select the cross-attended or unattended features on the fly based on the strong or weak complementary relationships, respectively, across audio and visual modalities. In particular, a conditional gating layer is designed to evaluate the contribution of the cross-attention mechanism and choose cross-attended features only when they exhibit strong complementary relationships, otherwise unattended features. Extensive experiments are conducted on the Voxceleb1 dataset to demonstrate the robustness of the proposed model. Results indicate that the proposed model consistently improves the performance on multiple variants of cross-attention while outperforming the state-of-the-art methods.  ( 2 min )
    Learning Agility Adaptation for Flight in Clutter
    arXiv:2403.04586v1 Announce Type: cross Abstract: Animals learn to adapt agility of their movements to their capabilities and the environment they operate in. Mobile robots should also demonstrate this ability to combine agility and safety. The aim of this work is to endow flight vehicles with the ability of agility adaptation in prior unknown and partially observable cluttered environments. We propose a hierarchical learning and planning framework where we utilize both trial and error to comprehensively learn an agility policy with the vehicle's observation as the input, and well-established methods of model-based trajectory generation. Technically, we use online model-free reinforcement learning and a pre-training-fine-tuning reward scheme to obtain the deployable policy. The statistical results in simulation demonstrate the advantages of our method over the constant agility baselines and an alternative method in terms of flight efficiency and safety. In particular, the policy leads to intelligent behaviors, such as perception awareness, which distinguish it from other approaches. By deploying the policy to hardware, we verify that these advantages can be brought to the real world.  ( 2 min )
    MedFLIP: Medical Vision-and-Language Self-supervised Fast Pre-Training with Masked Autoencoder
    arXiv:2403.04626v1 Announce Type: cross Abstract: Within the domain of medical analysis, extensive research has explored the potential of mutual learning between Masked Autoencoders(MAEs) and multimodal data. However, the impact of MAEs on intermodality remains a key challenge. We introduce MedFLIP, a Fast Language-Image Pre-training method for Medical analysis. We explore MAEs for zero-shot learning with crossed domains, which enhances the model ability to learn from limited data, a common scenario in medical diagnostics. We verify that masking an image does not affect intermodal learning. Furthermore, we propose the SVD loss to enhance the representation learning for characteristics of medical images, aiming to improve classification accuracy by leveraging the structural intricacies of such data. Lastly, we validate using language will improve the zero-shot performance for the medical image analysis. MedFLIP scaling of the masking process marks an advancement in the field, offering a pathway to rapid and precise medical image analysis without the traditional computational bottlenecks. Through experiments and validation, MedFLIP demonstrates efficient performance improvements, setting an explored standard for future research and application in medical diagnostics.  ( 2 min )
    T-TAME: Trainable Attention Mechanism for Explaining Convolutional Networks and Vision Transformers
    arXiv:2403.04523v1 Announce Type: cross Abstract: The development and adoption of Vision Transformers and other deep-learning architectures for image classification tasks has been rapid. However, the "black box" nature of neural networks is a barrier to adoption in applications where explainability is essential. While some techniques for generating explanations have been proposed, primarily for Convolutional Neural Networks, adapting such techniques to the new paradigm of Vision Transformers is non-trivial. This paper presents T-TAME, Transformer-compatible Trainable Attention Mechanism for Explanations, a general methodology for explaining deep neural networks used in image classification tasks. The proposed architecture and training technique can be easily applied to any convolutional or Vision Transformer-like neural network, using a streamlined training approach. After training, explanation maps can be computed in a single forward pass; these explanation maps are comparable to or outperform the outputs of computationally expensive perturbation-based explainability techniques, achieving SOTA performance. We apply T-TAME to three popular deep learning classifier architectures, VGG-16, ResNet-50, and ViT-B-16, trained on the ImageNet dataset, and we demonstrate improvements over existing state-of-the-art explainability methods. A detailed analysis of the results and an ablation study provide insights into how the T-TAME design choices affect the quality of the generated explanation maps.  ( 2 min )
    Improvements & Evaluations on the MLCommons CloudMask Benchmark
    arXiv:2403.04553v1 Announce Type: cross Abstract: In this paper, we report the performance benchmarking results of deep learning models on MLCommons' Science cloud-masking benchmark using a high-performance computing cluster at New York University (NYU): NYU Greene. MLCommons is a consortium that develops and maintains several scientific benchmarks that can benefit from developments in AI. We provide a description of the cloud-masking benchmark task, updated code, and the best model for this benchmark when using our selected hyperparameter settings. Our benchmarking results include the highest accuracy achieved on the NYU system as well as the average time taken for both training and inference on the benchmark across several runs/seeds. Our code can be found on GitHub. MLCommons team has been kept informed about our progress and may use the developed code for their future work.  ( 2 min )
    Membership Inference Attacks and Privacy in Topic Modeling
    arXiv:2403.04451v1 Announce Type: cross Abstract: Recent research shows that large language models are susceptible to privacy attacks that infer aspects of the training data. However, it is unclear if simpler generative models, like topic models, share similar vulnerabilities. In this work, we propose an attack against topic models that can confidently identify members of the training data in Latent Dirichlet Allocation. Our results suggest that the privacy risks associated with generative modeling are not restricted to large neural models. Additionally, to mitigate these vulnerabilities, we explore differentially private (DP) topic modeling. We propose a framework for private topic modeling that incorporates DP vocabulary selection as a pre-processing step, and show that it improves privacy while having limited effects on practical utility.  ( 2 min )
    Source Matters: Source Dataset Impact on Model Robustness in Medical Imaging
    arXiv:2403.04484v1 Announce Type: cross Abstract: Transfer learning has become an essential part of medical imaging classification algorithms, often leveraging ImageNet weights. However, the domain shift from natural to medical images has prompted alternatives such as RadImageNet, often demonstrating comparable classification performance. However, it remains unclear whether the performance gains from transfer learning stem from improved generalization or shortcut learning. To address this, we investigate potential confounders -- whether synthetic or sampled from the data -- across two publicly available chest X-ray and CT datasets. We show that ImageNet and RadImageNet achieve comparable classification performance, yet ImageNet is much more prone to overfitting to confounders. We recommend that researchers using ImageNet-pretrained models reexamine their model robustness by conducting similar experiments. Our code and experiments are available at https://github.com/DovileDo/source-matters.  ( 2 min )
    Learning Human-to-Humanoid Real-Time Whole-Body Teleoperation
    arXiv:2403.04436v1 Announce Type: cross Abstract: We present Human to Humanoid (H2O), a reinforcement learning (RL) based framework that enables real-time whole-body teleoperation of a full-sized humanoid robot with only an RGB camera. To create a large-scale retargeted motion dataset of human movements for humanoid robots, we propose a scalable "sim-to-data" process to filter and pick feasible motions using a privileged motion imitator. Afterwards, we train a robust real-time humanoid motion imitator in simulation using these refined motions and transfer it to the real humanoid robot in a zero-shot manner. We successfully achieve teleoperation of dynamic whole-body motions in real-world scenarios, including walking, back jumping, kicking, turning, waving, pushing, boxing, etc. To the best of our knowledge, this is the first demonstration to achieve learning-based real-time whole-body humanoid teleoperation.  ( 2 min )
    Signature Isolation Forest
    arXiv:2403.04405v1 Announce Type: cross Abstract: Functional Isolation Forest (FIF) is a recent state-of-the-art Anomaly Detection (AD) algorithm designed for functional data. It relies on a tree partition procedure where an abnormality score is computed by projecting each curve observation on a drawn dictionary through a linear inner product. Such linear inner product and the dictionary are a priori choices that highly influence the algorithm's performances and might lead to unreliable results, particularly with complex datasets. This work addresses these challenges by introducing \textit{Signature Isolation Forest}, a novel AD algorithm class leveraging the rough path theory's signature transform. Our objective is to remove the constraints imposed by FIF through the proposition of two algorithms which specifically target the linearity of the FIF inner product and the choice of the dictionary. We provide several numerical experiments, including a real-world applications benchmark showing the relevance of our methods.  ( 2 min )
    LoCoDL: Communication-Efficient Distributed Learning with Local Training and Compression
    arXiv:2403.04348v1 Announce Type: cross Abstract: In Distributed optimization and Learning, and even more in the modern framework of federated learning, communication, which is slow and costly, is critical. We introduce LoCoDL, a communication-efficient algorithm that leverages the two popular and effective techniques of Local training, which reduces the communication frequency, and Compression, in which short bitstreams are sent instead of full-dimensional vectors of floats. LoCoDL works with a large class of unbiased compressors that includes widely-used sparsification and quantization methods. LoCoDL provably benefits from local training and compression and enjoys a doubly-accelerated communication complexity, with respect to the condition number of the functions and the model dimension, in the general heterogenous regime with strongly convex functions. This is confirmed in practice, with LoCoDL outperforming existing algorithms.  ( 2 min )
    RL-CFR: Improving Action Abstraction for Imperfect Information Extensive-Form Games with Reinforcement Learning
    arXiv:2403.04344v1 Announce Type: cross Abstract: Effective action abstraction is crucial in tackling challenges associated with large action spaces in Imperfect Information Extensive-Form Games (IIEFGs). However, due to the vast state space and computational complexity in IIEFGs, existing methods often rely on fixed abstractions, resulting in sub-optimal performance. In response, we introduce RL-CFR, a novel reinforcement learning (RL) approach for dynamic action abstraction. RL-CFR builds upon our innovative Markov Decision Process (MDP) formulation, with states corresponding to public information and actions represented as feature vectors indicating specific action abstractions. The reward is defined as the expected payoff difference between the selected and default action abstractions. RL-CFR constructs a game tree with RL-guided action abstractions and utilizes counterfactual regret minimization (CFR) for strategy derivation. Impressively, it can be trained from scratch, achieving higher expected payoff without increased CFR solving time. In experiments on Heads-up No-limit Texas Hold'em, RL-CFR outperforms ReBeL's replication and Slumbot, demonstrating significant win-rate margins of $64\pm 11$ and $84\pm 17$ mbb/hand, respectively.  ( 2 min )
    A mechanism-informed reinforcement learning framework for shape optimization of airfoils
    arXiv:2403.04329v1 Announce Type: cross Abstract: In this study, we present the mechanism-informed reinforcement learning framework for airfoil shape optimization. By leveraging the twin delayed deep deterministic policy gradient algorithm for its notable stability, our approach addresses the complexities of optimizing shapes governed by fluid dynamics. The PDEs-based solver is adopted for its accuracy even when the configurations and geometries are extraordinarily changed during the exploration. Dual-weighted residual-based mesh refinement strategy is applied to ensure the accurate calculation of target functionals. To streamline the iterative optimization process and handle geometric deformations, our approach integrates Laplacian smoothing, adaptive refinement, and a B\'ezier fitting strategy. This combination not only remits mesh tangling but also guarantees a precise manipulation of the airfoil geometry. Our neural network architecture leverages B\'ezier curves for efficient dimensionality reduction, thereby enhancing the learning process and ensuring the geometric accuracy of the airfoil shapes. An attention mechanism is embedded within the network to calculate potential action on the state as well. Furthermore, we have introduced different reward and penalty mechanisms tailored to the specific challenges of airfoil optimization. This algorithm is designed to support the optimization task, facilitating a more targeted and effective approach for airfoil shape optimization.  ( 2 min )
    Edge-based Parametric Digital Twins for Intelligent Building Indoor Climate Modeling
    arXiv:2403.04326v1 Announce Type: cross Abstract: Digital transformation in the built environment generates vast data for developing data-driven models to optimize building operations. This study presents an integrated solution utilizing edge computing, digital twins, and deep learning to enhance the understanding of climate in buildings. Parametric digital twins, created using an ontology, ensure consistent data representation across diverse service systems equipped by different buildings. Based on created digital twins and collected data, deep learning methods are employed to develop predictive models for identifying patterns in indoor climate and providing insights. Both the parametric digital twin and deep learning models are deployed on edge for low latency and privacy compliance. As a demonstration, a case study was conducted in a historic building in \"Osterg\"otland, Sweden, to compare the performance of five deep learning architectures. The results indicate that the time-series dense encoder model exhibited strong competitiveness in performing multi-horizon forecasts of indoor temperature and relative humidity with low computational costs.  ( 2 min )
    Memetic Differential Evolution Methods for Semi-Supervised Clustering
    arXiv:2403.04322v1 Announce Type: cross Abstract: In this paper, we deal with semi-supervised Minimum Sum-of-Squares Clustering (MSSC) problems where background knowledge is given in the form of instance-level constraints. In particular, we take into account "must-link" and "cannot-link" constraints, each of which indicates if two dataset points should be associated to the same or to a different cluster. The presence of such constraints makes the problem at least as hard as its unsupervised version: it is no more true that each point is associated to its nearest cluster center, thus requiring some modifications in crucial operations, such as the assignment step. In this scenario, we propose a novel memetic strategy based on the Differential Evolution paradigm, directly extending a state-of-the-art framework recently proposed in the unsupervised clustering literature. As far as we know, our contribution represents the first attempt to define a memetic methodology designed to generate a (hopefully) optimal feasible solution for the semi-supervised MSSC problem. The proposal is compared with some state-of-the-art algorithms from the literature on a set of well-known datasets, highlighting its effectiveness and efficiency in finding good quality clustering solutions.  ( 2 min )
    Effectiveness Assessment of Recent Large Vision-Language Models
    arXiv:2403.04306v1 Announce Type: cross Abstract: The advent of large vision-language models (LVLMs) represents a noteworthy advancement towards the pursuit of artificial general intelligence. However, the extent of their efficacy across both specialized and general tasks warrants further investigation. This article endeavors to evaluate the competency of popular LVLMs in specialized and general tasks, respectively, aiming to offer a comprehensive comprehension of these innovative methodologies. To gauge their efficacy in specialized tasks, we tailor a comprehensive testbed comprising three distinct scenarios: natural, healthcare, and industrial, encompassing six challenging tasks. These tasks include salient, camouflaged, and transparent object detection, as well as polyp and skin lesion detection, alongside industrial anomaly detection. We examine the performance of three recent open-source LVLMs -- MiniGPT-v2, LLaVA-1.5, and Shikra -- in the realm of visual recognition and localization. Moreover, we conduct empirical investigations utilizing the aforementioned models alongside GPT-4V, assessing their multi-modal understanding capacities in general tasks such as object counting, absurd question answering, affordance reasoning, attribute recognition, and spatial relation reasoning. Our investigations reveal that these models demonstrate limited proficiency not only in specialized tasks but also in general tasks. We delve deeper into this inadequacy and suggest several potential factors, including limited cognition in specialized tasks, object hallucination, text-to-image interference, and decreased robustness in complex problems. We hope this study would provide valuable insights for the future development of LVLMs, augmenting their power in coping with both general and specialized applications.  ( 3 min )
    MedM2G: Unifying Medical Multi-Modal Generation via Cross-Guided Diffusion with Visual Invariant
    arXiv:2403.04290v1 Announce Type: cross Abstract: Medical generative models, acknowledged for their high-quality sample generation ability, have accelerated the fast growth of medical applications. However, recent works concentrate on separate medical generation models for distinct medical tasks and are restricted to inadequate medical multi-modal knowledge, constraining medical comprehensive diagnosis. In this paper, we propose MedM2G, a Medical Multi-Modal Generative framework, with the key innovation to align, extract, and generate medical multi-modal within a unified model. Extending beyond single or two medical modalities, we efficiently align medical multi-modal through the central alignment approach in the unified space. Significantly, our framework extracts valuable clinical knowledge by preserving the medical visual invariant of each imaging modal, thereby enhancing specific medical information for multi-modal generation. By conditioning the adaptive cross-guided parameters into the multi-flow diffusion framework, our model promotes flexible interactions among medical multi-modal for generation. MedM2G is the first medical generative model that unifies medical generation tasks of text-to-image, image-to-text, and unified generation of medical modalities (CT, MRI, X-ray). It performs 5 medical generation tasks across 10 datasets, consistently outperforming various state-of-the-art works.  ( 2 min )
    Qubit-Wise Architecture Search Method for Variational Quantum Circuits
    arXiv:2403.04268v1 Announce Type: cross Abstract: Considering the noise level limit, one crucial aspect for quantum machine learning is to design a high-performing variational quantum circuit architecture with small number of quantum gates. As the classical neural architecture search (NAS), quantum architecture search methods (QAS) employ methods like reinforcement learning, evolutionary algorithms and supernet optimiza-tion to improve the search efficiency. In this paper, we propose a novel qubit-wise architec-ture search (QWAS) method, which progres-sively search one-qubit configuration per stage, and combine with Monte Carlo Tree Search al-gorithm to find good quantum architectures by partitioning the search space into several good and bad subregions. The numerical experimental results indicate that our proposed method can balance the exploration and exploitation of cir-cuit performance and size in some real-world tasks, such as MNIST, Fashion and MOSI. As far as we know, QWAS achieves the state-of-art re-sults of all tasks in the terms of accuracy and circuit size.  ( 2 min )
    Proxy-RLHF: Decoupling Generation and Alignment in Large Language Model with Proxy
    arXiv:2403.04283v1 Announce Type: cross Abstract: Reinforcement Learning from Human Feedback (RLHF) is the prevailing approach to ensure Large Language Models (LLMs) align with human values. However, existing RLHF methods require a high computational cost, one main reason being that RLHF assigns both the generation and alignment tasks to the LLM simultaneously. In this paper, we introduce Proxy-RLHF, which decouples the generation and alignment processes of LLMs, achieving alignment with human values at a much lower computational cost. We start with a novel Markov Decision Process (MDP) designed for the alignment process and employ Reinforcement Learning (RL) to train a streamlined proxy model that oversees the token generation of the LLM, without altering the LLM itself. Experiments show that our method achieves a comparable level of alignment with only 1\% of the training parameters of other methods.  ( 2 min )
    Decentralized and Equitable Optimal Transport
    arXiv:2403.04259v1 Announce Type: cross Abstract: This paper considers the decentralized (discrete) optimal transport (D-OT) problem. In this setting, a network of agents seeks to design a transportation plan jointly, where the cost function is the sum of privately held costs for each agent. We reformulate the D-OT problem as a constraint-coupled optimization problem and propose a single-loop decentralized algorithm with an iteration complexity of O(1/{\epsilon}) that matches existing centralized first-order approaches. Moreover, we propose the decentralized equitable optimal transport (DE-OT) problem. In DE-OT, in addition to cooperatively designing a transportation plan that minimizes transportation costs, agents seek to ensure equity in their individual costs. The iteration complexity of the proposed method to solve DE-OT is also O(1/{\epsilon}). This rate improves existing centralized algorithms, where the best iteration complexity obtained is O(1/{\epsilon}^2).  ( 2 min )
    Can Small Language Models be Good Reasoners for Sequential Recommendation?
    arXiv:2403.04260v1 Announce Type: cross Abstract: Large language models (LLMs) open up new horizons for sequential recommendations, owing to their remarkable language comprehension and generation capabilities. However, there are still numerous challenges that should be addressed to successfully implement sequential recommendations empowered by LLMs. Firstly, user behavior patterns are often complex, and relying solely on one-step reasoning from LLMs may lead to incorrect or task-irrelevant responses. Secondly, the prohibitively resource requirements of LLM (e.g., ChatGPT-175B) are overwhelmingly high and impractical for real sequential recommender systems. In this paper, we propose a novel Step-by-step knowLedge dIstillation fraMework for recommendation (SLIM), paving a promising path for sequential recommenders to enjoy the exceptional reasoning capabilities of LLMs in a "slim" (i.e., resource-efficient) manner. We introduce CoT prompting based on user behavior sequences for the larger teacher model. The rationales generated by the teacher model are then utilized as labels to distill the downstream smaller student model (e.g., LLaMA2-7B). In this way, the student model acquires the step-by-step reasoning capabilities in recommendation tasks. We encode the generated rationales from the student model into a dense vector, which empowers recommendation in both ID-based and ID-agnostic scenarios. Extensive experiments demonstrate the effectiveness of SLIM over state-of-the-art baselines, and further analysis showcasing its ability to generate meaningful recommendation reasoning at affordable costs.  ( 2 min )
    A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition
    arXiv:2403.04245v1 Announce Type: cross Abstract: Advanced Audio-Visual Speech Recognition (AVSR) systems have been observed to be sensitive to missing video frames, performing even worse than single-modality models. While applying the dropout technique to the video modality enhances robustness to missing frames, it simultaneously results in a performance loss when dealing with complete data input. In this paper, we investigate this contrasting phenomenon from the perspective of modality bias and reveal that an excessive modality bias on the audio caused by dropout is the underlying reason. Moreover, we present the Modality Bias Hypothesis (MBH) to systematically describe the relationship between modality bias and robustness against missing modality in multimodal systems. Building on these findings, we propose a novel Multimodal Distribution Approximation with Knowledge Distillation (MDA-KD) framework to reduce over-reliance on the audio modality and to maintain performance and robustness simultaneously. Finally, to address an entirely missing modality, we adopt adapters to dynamically switch decision strategies. The effectiveness of our proposed approach is evaluated and validated through a series of comprehensive experiments using the MISP2021 and MISP2022 datasets. Our code is available at https://github.com/dalision/ModalBiasAVSR  ( 3 min )
    Efficient CNN-LSTM based Parameter Estimation of Levy Driven Stochastic Differential Equations
    arXiv:2403.04246v1 Announce Type: cross Abstract: This study addresses the challenges in parameter estimation of stochastic differential equations driven by non-Gaussian noises, which are critical in understanding dynamic phenomena such as price fluctuations and the spread of infectious diseases. Previous research highlighted the potential of LSTM networks in estimating parameters of alpha stable Levy driven SDEs but faced limitations including high time complexity and constraints of the LSTM chaining property. To mitigate these issues, we introduce the PEnet, a novel CNN-LSTM-based three-stage model that offers an end to end approach with superior accuracy and adaptability to varying data structures, enhanced inference speed for long sequence observations through initial data feature condensation by CNN, and high generalization capability, allowing its application to various complex SDE scenarios. Experiments on synthetic datasets confirm PEnet significant advantage in estimating SDE parameters associated with noise characteristics, establishing it as a competitive method for SDE parameter estimation in the presence of Levy noise.  ( 2 min )
    Fundamental limits of Non-Linear Low-Rank Matrix Estimation
    arXiv:2403.04234v1 Announce Type: cross Abstract: We consider the task of estimating a low-rank matrix from non-linear and noisy observations. We prove a strong universality result showing that Bayes-optimal performances are characterized by an equivalent Gaussian model with an effective prior, whose parameters are entirely determined by an expansion of the non-linear function. In particular, we show that to reconstruct the signal accurately, one requires a signal-to-noise ratio growing as $N^{\frac 12 (1-1/k_F)}$, where $k_F$ is the first non-zero Fisher information coefficient of the function. We provide asymptotic characterization for the minimal achievable mean squared error (MMSE) and an approximate message-passing algorithm that reaches the MMSE under conditions analogous to the linear version of the problem. We also provide asymptotic errors achieved by methods such as principal component analysis combined with Bayesian denoising, and compare them with Bayes-optimal MMSE.  ( 2 min )
    Generalizing Cooperative Eco-driving via Multi-residual Task Learning
    arXiv:2403.04232v1 Announce Type: cross Abstract: Conventional control, such as model-based control, is commonly utilized in autonomous driving due to its efficiency and reliability. However, real-world autonomous driving contends with a multitude of diverse traffic scenarios that are challenging for these planning algorithms. Model-free Deep Reinforcement Learning (DRL) presents a promising avenue in this direction, but learning DRL control policies that generalize to multiple traffic scenarios is still a challenge. To address this, we introduce Multi-residual Task Learning (MRTL), a generic learning framework based on multi-task learning that, for a set of task scenarios, decomposes the control into nominal components that are effectively solved by conventional control methods and residual terms which are solved using learning. We employ MRTL for fleet-level emission reduction in mixed traffic using autonomous vehicles as a means of system control. By analyzing the performance of MRTL across nearly 600 signalized intersections and 1200 traffic scenarios, we demonstrate that it emerges as a promising approach to synergize the strengths of DRL and conventional methods in generalizable control.  ( 2 min )
    Aligners: Decoupling LLMs and Alignment
    arXiv:2403.04224v1 Announce Type: cross Abstract: Large Language Models (LLMs) need to be aligned with human expectations to ensure their safety and utility in most applications. Alignment is challenging, costly, and needs to be repeated for every LLM and alignment criterion. We propose to decouple LLMs and alignment by training aligner models that can be used to align any LLM for a given criteria on an as-needed basis, thus also reducing the potential negative impacts of alignment on performance. Our recipe for training the aligner models solely relies on synthetic data generated with a (prompted) LLM and can be easily adjusted for a variety of alignment criteria. We illustrate our method by training an "ethical" aligner and verify its efficacy empirically.  ( 2 min )
    Dynamics of Moral Behavior in Heterogeneous Populations of Learning Agents
    arXiv:2403.04202v1 Announce Type: cross Abstract: Growing concerns about safety and alignment of AI systems highlight the importance of embedding moral capabilities in artificial agents. A promising solution is the use of learning from experience, i.e., Reinforcement Learning. In multi-agent (social) environments, complex population-level phenomena may emerge from interactions between individual learning agents. Many of the existing studies rely on simulated social dilemma environments to study the interactions of independent learning agents. However, they tend to ignore the moral heterogeneity that is likely to be present in societies of agents in practice. For example, at different points in time a single learning agent may face opponents who are consequentialist (i.e., caring about maximizing some outcome over time) or norm-based (i.e., focusing on conforming to a specific norm here and now). The extent to which agents' co-development may be impacted by such moral heterogeneity in populations is not well understood. In this paper, we present a study of the learning dynamics of morally heterogeneous populations interacting in a social dilemma setting. Using a Prisoner's Dilemma environment with a partner selection mechanism, we investigate the extent to which the prevalence of diverse moral agents in populations affects individual agents' learning behaviors and emergent population-level outcomes. We observe several types of non-trivial interactions between pro-social and anti-social agents, and find that certain classes of moral agents are able to steer selfish agents towards more cooperative behavior.  ( 2 min )
    FedClust: Optimizing Federated Learning on Non-IID Data through Weight-Driven Client Clustering
    arXiv:2403.04144v1 Announce Type: cross Abstract: Federated learning (FL) is an emerging distributed machine learning paradigm enabling collaborative model training on decentralized devices without exposing their local data. A key challenge in FL is the uneven data distribution across client devices, violating the well-known assumption of independent-and-identically-distributed (IID) training samples in conventional machine learning. Clustered federated learning (CFL) addresses this challenge by grouping clients based on the similarity of their data distributions. However, existing CFL approaches require a large number of communication rounds for stable cluster formation and rely on a predefined number of clusters, thus limiting their flexibility and adaptability. This paper proposes FedClust, a novel CFL approach leveraging correlations between local model weights and client data distributions. FedClust groups clients into clusters in a one-shot manner using strategically selected partial model weights and dynamically accommodates newcomers in real-time. Experimental results demonstrate FedClust outperforms baseline approaches in terms of accuracy and communication costs.  ( 2 min )
    Silicon Photonic 2.5D Interposer Networks for Overcoming Communication Bottlenecks in Scale-out Machine Learning Hardware Accelerators
    arXiv:2403.04189v1 Announce Type: cross Abstract: Modern machine learning (ML) applications are becoming increasingly complex and monolithic (single chip) accelerator architectures cannot keep up with their energy efficiency and throughput demands. Even though modern digital electronic accelerators are gradually adopting 2.5D architectures with multiple smaller chiplets to improve scalability, they face fundamental limitations due to a reliance on slow metallic interconnects. This paper outlines how optical communication and computation can be leveraged in 2.5D platforms to realize energy-efficient and high throughput 2.5D ML accelerator architectures.  ( 2 min )
    Can Large Language Models Reason and Plan?
    arXiv:2403.04121v1 Announce Type: cross Abstract: While humans sometimes do show the capability of correcting their own erroneous guesses with self-critiquing, there seems to be no basis for that assumption in the case of LLMs.  ( 2 min )
    Exploring LLM-based Agents for Root Cause Analysis
    arXiv:2403.04123v1 Announce Type: cross Abstract: The growing complexity of cloud based software systems has resulted in incident management becoming an integral part of the software development lifecycle. Root cause analysis (RCA), a critical part of the incident management process, is a demanding task for on-call engineers, requiring deep domain knowledge and extensive experience with a team's specific services. Automation of RCA can result in significant savings of time, and ease the burden of incident management on on-call engineers. Recently, researchers have utilized Large Language Models (LLMs) to perform RCA, and have demonstrated promising results. However, these approaches are not able to dynamically collect additional diagnostic information such as incident related logs, metrics or databases, severely restricting their ability to diagnose root causes. In this work, we explore the use of LLM based agents for RCA to address this limitation. We present a thorough empirical evaluation of a ReAct agent equipped with retrieval tools, on an out-of-distribution dataset of production incidents collected at Microsoft. Results show that ReAct performs competitively with strong retrieval and reasoning baselines, but with highly increased factual accuracy. We then extend this evaluation by incorporating discussions associated with incident reports as additional inputs for the models, which surprisingly does not yield significant performance improvements. Lastly, we conduct a case study with a team at Microsoft to equip the ReAct agent with tools that give it access to external diagnostic services that are used by the team for manual RCA. Our results show how agents can overcome the limitations of prior work, and practical considerations for implementing such a system in practice.  ( 3 min )
    Globally Stable Neural Imitation Policies
    arXiv:2403.04118v1 Announce Type: cross Abstract: Imitation learning presents an effective approach to alleviate the resource-intensive and time-consuming nature of policy learning from scratch in the solution space. Even though the resulting policy can mimic expert demonstrations reliably, it often lacks predictability in unexplored regions of the state-space, giving rise to significant safety concerns in the face of perturbations. To address these challenges, we introduce the Stable Neural Dynamical System (SNDS), an imitation learning regime which produces a policy with formal stability guarantees. We deploy a neural policy architecture that facilitates the representation of stability based on Lyapunov theorem, and jointly train the policy and its corresponding Lyapunov candidate to ensure global stability. We validate our approach by conducting extensive experiments in simulation and successfully deploying the trained policies on a real-world manipulator arm. The experimental results demonstrate that our method overcomes the instability, accuracy, and computational intensity problems associated with previous imitation learning methods, making our method a promising solution for stable policy learning in complex planning scenarios.  ( 2 min )
    Closing the Visual Sim-to-Real Gap with Object-Composable NeRFs
    arXiv:2403.04114v1 Announce Type: cross Abstract: Deep learning methods for perception are the cornerstone of many robotic systems. Despite their potential for impressive performance, obtaining real-world training data is expensive, and can be impractically difficult for some tasks. Sim-to-real transfer with domain randomization offers a potential workaround, but often requires extensive manual tuning and results in models that are brittle to distribution shift between sim and real. In this work, we introduce Composable Object Volume NeRF (COV-NeRF), an object-composable NeRF model that is the centerpiece of a real-to-sim pipeline for synthesizing training data targeted to scenes and objects from the real world. COV-NeRF extracts objects from real images and composes them into new scenes, generating photorealistic renderings and many types of 2D and 3D supervision, including depth maps, segmentation masks, and meshes. We show that COV-NeRF matches the rendering quality of modern NeRF methods, and can be used to rapidly close the sim-to-real gap across a variety of perceptual modalities.  ( 2 min )
    Forecasting and Mitigating Disruptions in Public Bus Transit Services
    arXiv:2403.04072v1 Announce Type: cross Abstract: Public transportation systems often suffer from unexpected fluctuations in demand and disruptions, such as mechanical failures and medical emergencies. These fluctuations and disruptions lead to delays and overcrowding, which are detrimental to the passengers' experience and to the overall performance of the transit service. To proactively mitigate such events, many transit agencies station substitute (reserve) vehicles throughout their service areas, which they can dispatch to augment or replace vehicles on routes that suffer overcrowding or disruption. However, determining the optimal locations where substitute vehicles should be stationed is a challenging problem due to the inherent randomness of disruptions and due to the combinatorial nature of selecting locations across a city. In collaboration with the transit agency of Nashville, TN, we address this problem by introducing data-driven statistical and machine-learning models for forecasting disruptions and an effective randomized local-search algorithm for selecting locations where substitute vehicles are to be stationed. Our research demonstrates promising results in proactive disruption management, offering a practical and easily implementable solution for transit agencies to enhance the reliability of their services. Our results resonate beyond mere operational efficiency: by advancing proactive strategies, our approach fosters more resilient and accessible public transportation, contributing to equitable urban mobility and ultimately benefiting the communities that rely on public transportation the most.  ( 2 min )
    Using Causal Trees to Estimate Personalized Task Difficulty in Post-Stroke Individuals
    arXiv:2403.04109v1 Announce Type: cross Abstract: Adaptive training programs are crucial for recovery post stroke. However, developing programs that automatically adapt depends on quantifying how difficult a task is for a specific individual at a particular stage of their recovery. In this work, we propose a method that automatically generates regions of different task difficulty levels based on an individual's performance. We show that this technique explains the variance in user performance for a reaching task better than previous approaches to estimating task difficulty.  ( 2 min )
    Learning Guided Automated Reasoning: A Brief Survey
    arXiv:2403.04017v1 Announce Type: cross Abstract: Automated theorem provers and formal proof assistants are general reasoning systems that are in theory capable of proving arbitrarily hard theorems, thus solving arbitrary problems reducible to mathematics and logical reasoning. In practice, such systems however face large combinatorial explosion, and therefore include many heuristics and choice points that considerably influence their performance. This is an opportunity for trained machine learning predictors, which can guide the work of such reasoning systems. Conversely, deductive search supported by the notion of logically valid proof allows one to train machine learning systems on large reasoning corpora. Such bodies of proof are usually correct by construction and when combined with more and more precise trained guidance they can be boostrapped into very large corpora, with increasingly long reasoning chains and possibly novel proof ideas. In this paper we provide an overview of several automated reasoning and theorem proving domains and the learning and AI methods that have been so far developed for them. These include premise selection, proof guidance in several settings, AI systems and feedback loops iterating between reasoning and learning, and symbolic classification problems.  ( 2 min )
    On the Efficient Marginalization of Probabilistic Sequence Models
    arXiv:2403.04005v1 Announce Type: cross Abstract: Real-world data often exhibits sequential dependence, across diverse domains such as human behavior, medicine, finance, and climate modeling. Probabilistic methods capture the inherent uncertainty associated with prediction in these contexts, with autoregressive models being especially prominent. This dissertation focuses on using autoregressive models to answer complex probabilistic queries that go beyond single-step prediction, such as the timing of future events or the likelihood of a specific event occurring before another. In particular, we develop a broad class of novel and efficient approximation techniques for marginalization in sequential models that are model-agnostic. These techniques rely solely on access to and sampling from next-step conditional distributions of a pre-trained autoregressive model, including both traditional parametric models as well as more recent neural autoregressive models. Specific approaches are presented for discrete sequential models, for marked temporal point processes, and for stochastic jump processes, each tailored to a well-defined class of informative, long-range probabilistic queries.  ( 2 min )
    Video Relationship Detection Using Mixture of Experts
    arXiv:2403.03994v1 Announce Type: cross Abstract: Machine comprehension of visual information from images and videos by neural networks faces two primary challenges. Firstly, there exists a computational and inference gap in connecting vision and language, making it difficult to accurately determine which object a given agent acts on and represent it through language. Secondly, classifiers trained by a single, monolithic neural network often lack stability and generalization. To overcome these challenges, we introduce MoE-VRD, a novel approach to visual relationship detection utilizing a mixture of experts. MoE-VRD identifies language triplets in the form of tuples to extract relationships from visual processing. Leveraging recent advancements in visual relationship detection, MoE-VRD addresses the requirement for action recognition in establishing relationships between subjects (acting) and objects (being acted upon). In contrast to single monolithic networks, MoE-VRD employs multiple small models as experts, whose outputs are aggregated. Each expert in MoE-VRD specializes in visual relationship learning and object tagging. By utilizing a sparsely-gated mixture of experts, MoE-VRD enables conditional computation and significantly enhances neural network capacity without increasing computational complexity. Our experimental results demonstrate that the conditional computation capabilities and scalability of the mixture-of-experts approach lead to superior performance in visual relationship detection compared to state-of-the-art methods.  ( 2 min )
    Bidirectional Progressive Neural Networks with Episodic Return Progress for Emergent Task Sequencing and Robotic Skill Transfer
    arXiv:2403.04001v1 Announce Type: cross Abstract: Human brain and behavior provide a rich venue that can inspire novel control and learning methods for robotics. In an attempt to exemplify such a development by inspiring how humans acquire knowledge and transfer skills among tasks, we introduce a novel multi-task reinforcement learning framework named Episodic Return Progress with Bidirectional Progressive Neural Networks (ERP-BPNN). The proposed ERP-BPNN model (1) learns in a human-like interleaved manner by (2) autonomous task switching based on a novel intrinsic motivation signal and, in contrast to existing methods, (3) allows bidirectional skill transfer among tasks. ERP-BPNN is a general architecture applicable to several multi-task learning settings; in this paper, we present the details of its neural architecture and show its ability to enable effective learning and skill transfer among morphologically different robots in a reaching task. The developed Bidirectional Progressive Neural Network (BPNN) architecture enables bidirectional skill transfer without requiring incremental training and seamlessly integrates with online task arbitration. The task arbitration mechanism developed is based on soft Episodic Return progress (ERP), a novel intrinsic motivation (IM) signal. To evaluate our method, we use quantifiable robotics metrics such as 'expected distance to goal' and 'path straightness' in addition to the usual reward-based measure of episodic return common in reinforcement learning. With simulation experiments, we show that ERP-BPNN achieves faster cumulative convergence and improves performance in all metrics considered among morphologically different robots compared to the baselines.  ( 3 min )
    Assessing the Extrapolation Capability of Template-Free Retrosynthesis Models
    arXiv:2403.03960v1 Announce Type: cross Abstract: Despite the acknowledged capability of template-free models in exploring unseen reaction spaces compared to template-based models for retrosynthesis prediction, their ability to venture beyond established boundaries remains relatively uncharted. In this study, we empirically assess the extrapolation capability of state-of-the-art template-free models by meticulously assembling an extensive set of out-of-distribution (OOD) reactions. Our findings demonstrate that while template-free models exhibit potential in predicting precursors with novel synthesis rules, their top-10 exact-match accuracy in OOD reactions is strikingly modest (< 1%). Furthermore, despite the capability of generating novel reactions, our investigation highlights a recurring issue where more than half of the novel reactions predicted by template-free models are chemically implausible. Consequently, we advocate for the future development of template-free models that integrate considerations of chemical feasibility when navigating unexplored regions of reaction space.  ( 2 min )
    State-Constrained Zero-Sum Differential Games with One-Sided Information
    arXiv:2403.02741v1 Announce Type: cross Abstract: We study zero-sum differential games with state constraints and one-sided information, where the informed player (Player 1) has a categorical payoff type unknown to the uninformed player (Player 2). The goal of Player 1 is to minimize his payoff without violating the constraints, while that of Player 2 is to either violate the state constraints, or otherwise, to maximize the payoff. One example of the game is a man-to-man matchup in football. Without state constraints, Cardaliaguet (2007) showed that the value of such a game exists and is convex to the common belief of players. Our theoretical contribution is an extension of this result to differential games with state constraints and the derivation of the primal and dual subdynamic principles necessary for computing the behavioral strategies. Compared with existing works on imperfect-information dynamic games that focus on scalability and generalization, our focus is instead on revealing the mechanism of belief manipulation behaviors resulted from information asymmetry and state constraints. We use a simplified football game to demonstrate the utility of this work, where we reveal player positions and belief states in which the attacker should (or should not) play specific random fake moves to take advantage of information asymmetry, and compute how the defender should respond.  ( 2 min )
    Minimizing the Thompson Sampling Regret-to-Sigma Ratio (TS-RSR): a provably efficient algorithm for batch Bayesian Optimization
    arXiv:2403.04764v1 Announce Type: new Abstract: This paper presents a new approach for batch Bayesian Optimization (BO), where the sampling takes place by minimizing a Thompson Sampling approximation of a regret to uncertainty ratio. Our objective is able to coordinate the actions chosen in each batch in a way that minimizes redundancy between points whilst focusing on points with high predictive means or high uncertainty. We provide high-probability theoretical guarantees on the regret of our algorithm. Finally, numerically, we demonstrate that our method attains state-of-the-art performance on a range of nonconvex test functions, where it outperforms several competitive benchmark batch BO algorithms by an order of magnitude on average.  ( 2 min )
    DWA: Differential Wavelet Amplifier for Image Super-Resolution
    arXiv:2307.04593v1 Announce Type: cross Abstract: This work introduces Differential Wavelet Amplifier (DWA), a drop-in module for wavelet-based image Super-Resolution (SR). DWA invigorates an approach recently receiving less attention, namely Discrete Wavelet Transformation (DWT). DWT enables an efficient image representation for SR and reduces the spatial area of its input by a factor of 4, the overall model size, and computation cost, framing it as an attractive approach for sustainable ML. Our proposed DWA model improves wavelet-based SR models by leveraging the difference between two convolutional filters to refine relevant feature extraction in the wavelet domain, emphasizing local contrasts and suppressing common noise in the input signals. We show its effectiveness by integrating it into existing SR models, e.g., DWSR and MWCNN, and demonstrate a clear improvement in classical SR tasks. Moreover, DWA enables a direct application of DWSR and MWCNN to input image space, reducing the DWT representation channel-wise since it omits traditional DWT.  ( 2 min )
    BloomGML: Graph Machine Learning through the Lens of Bilevel Optimization
    arXiv:2403.04763v1 Announce Type: new Abstract: Bilevel optimization refers to scenarios whereby the optimal solution of a lower-level energy function serves as input features to an upper-level objective of interest. These optimal features typically depend on tunable parameters of the lower-level energy in such a way that the entire bilevel pipeline can be trained end-to-end. Although not generally presented as such, this paper demonstrates how a variety of graph learning techniques can be recast as special cases of bilevel optimization or simplifications thereof. In brief, building on prior work we first derive a more flexible class of energy functions that, when paired with various descent steps (e.g., gradient descent, proximal methods, momentum, etc.), form graph neural network (GNN) message-passing layers; critically, we also carefully unpack where any residual approximation error lies with respect to the underlying constituent message-passing functions. We then probe several simplifications of this framework to derive close connections with non-GNN-based graph learning approaches, including knowledge graph embeddings, various forms of label propagation, and efficient graph-regularized MLP models. And finally, we present supporting empirical results that demonstrate the versatility of the proposed bilevel lens, which we refer to as BloomGML, referencing that BiLevel Optimization Offers More Graph Machine Learning. Our code is available at https://github.com/amberyzheng/BloomGML. Let graph ML bloom.  ( 3 min )
    Lifelong Intelligence Beyond the Edge using Hyperdimensional Computing
    arXiv:2403.04759v1 Announce Type: new Abstract: On-device learning has emerged as a prevailing trend that avoids the slow response time and costly communication of cloud-based learning. The ability to learn continuously and indefinitely in a changing environment, and with resource constraints, is critical for real sensor deployments. However, existing designs are inadequate for practical scenarios with (i) streaming data input, (ii) lack of supervision and (iii) limited on-board resources. In this paper, we design and deploy the first on-device lifelong learning system called LifeHD for general IoT applications with limited supervision. LifeHD is designed based on a novel neurally-inspired and lightweight learning paradigm called Hyperdimensional Computing (HDC). We utilize a two-tier associative memory organization to intelligently store and manage high-dimensional, low-precision vectors, which represent the historical patterns as cluster centroids. We additionally propose two variants of LifeHD to cope with scarce labeled inputs and power constraints. We implement LifeHD on off-the-shelf edge platforms and perform extensive evaluations across three scenarios. Our measurements show that LifeHD improves the unsupervised clustering accuracy by up to 74.8% compared to the state-of-the-art NN-based unsupervised lifelong learning baselines with as much as 34.3x better energy efficiency. Our code is available at https://github.com/Orienfish/LifeHD.  ( 2 min )
    SQ Lower Bounds for Non-Gaussian Component Analysis with Weaker Assumptions
    arXiv:2403.04744v1 Announce Type: new Abstract: We study the complexity of Non-Gaussian Component Analysis (NGCA) in the Statistical Query (SQ) model. Prior work developed a general methodology to prove SQ lower bounds for this task that have been applicable to a wide range of contexts. In particular, it was known that for any univariate distribution $A$ satisfying certain conditions, distinguishing between a standard multivariate Gaussian and a distribution that behaves like $A$ in a random hidden direction and like a standard Gaussian in the orthogonal complement, is SQ-hard. The required conditions were that (1) $A$ matches many low-order moments with the standard univariate Gaussian, and (2) the chi-squared norm of $A$ with respect to the standard Gaussian is finite. While the moment-matching condition is necessary for hardness, the chi-squared condition was only required for technical reasons. In this work, we establish that the latter condition is indeed not necessary. In particular, we prove near-optimal SQ lower bounds for NGCA under the moment-matching condition only. Our result naturally generalizes to the setting of a hidden subspace. Leveraging our general SQ lower bound, we obtain near-optimal SQ lower bounds for a range of concrete estimation tasks where existing techniques provide sub-optimal or even vacuous guarantees.  ( 2 min )
    GNN-VPA: A Variance-Preserving Aggregation Strategy for Graph Neural Networks
    arXiv:2403.04747v1 Announce Type: new Abstract: Graph neural networks (GNNs), and especially message-passing neural networks, excel in various domains such as physics, drug discovery, and molecular modeling. The expressivity of GNNs with respect to their ability to discriminate non-isomorphic graphs critically depends on the functions employed for message aggregation and graph-level readout. By applying signal propagation theory, we propose a variance-preserving aggregation function (VPA) that maintains expressivity, but yields improved forward and backward dynamics. Experiments demonstrate that VPA leads to increased predictive performance for popular GNN architectures as well as improved learning dynamics. Our results could pave the way towards normalizer-free or self-normalizing GNNs.  ( 2 min )
    Analysis of Systems' Performance in Natural Language Processing Competitions
    arXiv:2403.04693v1 Announce Type: new Abstract: Collaborative competitions have gained popularity in the scientific and technological fields. These competitions involve defining tasks, selecting evaluation scores, and devising result verification methods. In the standard scenario, participants receive a training set and are expected to provide a solution for a held-out dataset kept by organizers. An essential challenge for organizers arises when comparing algorithms' performance, assessing multiple participants, and ranking them. Statistical tools are often used for this purpose; however, traditional statistical methods often fail to capture decisive differences between systems' performance. This manuscript describes an evaluation methodology for statistically analyzing competition results and competition. The methodology is designed to be universally applicable; however, it is illustrated using eight natural language competitions as case studies involving classification and regression problems. The proposed methodology offers several advantages, including off-the-shell comparisons with correction mechanisms and the inclusion of confidence intervals. Furthermore, we introduce metrics that allow organizers to assess the difficulty of competitions. Our analysis shows the potential usefulness of our methodology for effectively evaluating competition results.  ( 2 min )
    Context-Based Multimodal Fusion
    arXiv:2403.04650v1 Announce Type: new Abstract: The fusion models, which effectively combine information from different sources, are widely used in solving multimodal tasks. However, they have significant limitations related to aligning data distributions across different modalities. This challenge can lead to inconsistencies and difficulties in learning robust representations. Alignment models, while specifically addressing this issue, often require training "from scratch" with large datasets to achieve optimal results, which can be costly in terms of resources and time. To overcome these limitations, we propose an innovative model called Context-Based Multimodal Fusion (CBMF), which combines both modality fusion and data distribution alignment. In CBMF, each modality is represented by a specific context vector, fused with the embedding of each modality. This enables the use of large pre-trained models that can be frozen, reducing the computational and training data requirements. Additionally, the network learns to differentiate embeddings of different modalities through fusion with context and aligns data distributions using a contrastive approach for self-supervised learning. Thus, CBMF offers an effective and economical solution for solving complex multimodal tasks.  ( 2 min )
    Teaching Large Language Models to Reason with Reinforcement Learning
    arXiv:2403.04642v1 Announce Type: new Abstract: Reinforcement Learning from Human Feedback (\textbf{RLHF}) has emerged as a dominant approach for aligning LLM outputs with human preferences. Inspired by the success of RLHF, we study the performance of multiple algorithms that learn from feedback (Expert Iteration, Proximal Policy Optimization (\textbf{PPO}), Return-Conditioned RL) on improving LLM reasoning capabilities. We investigate both sparse and dense rewards provided to the LLM both heuristically and via a learned reward model. We additionally start from multiple model sizes and initializations both with and without supervised fine-tuning (\textbf{SFT}) data. Overall, we find all algorithms perform comparably, with Expert Iteration performing best in most cases. Surprisingly, we find the sample complexity of Expert Iteration is similar to that of PPO, requiring at most on the order of $10^6$ samples to converge from a pretrained checkpoint. We investigate why this is the case, concluding that during RL training models fail to explore significantly beyond solutions already produced by SFT models. Additionally, we discuss a trade off between maj@1 and pass@96 metric performance during SFT training and how conversely RL training improves both simultaneously. We then conclude by discussing the implications of our findings for RLHF and the future role of RL in LLM fine-tuning.  ( 2 min )
    Entropy Aware Message Passing in Graph Neural Networks
    arXiv:2403.04636v1 Announce Type: new Abstract: Deep Graph Neural Networks struggle with oversmoothing. This paper introduces a novel, physics-inspired GNN model designed to mitigate this issue. Our approach integrates with existing GNN architectures, introducing an entropy-aware message passing term. This term performs gradient ascent on the entropy during node aggregation, thereby preserving a certain degree of entropy in the embeddings. We conduct a comparative analysis of our model against state-of-the-art GNNs across various common datasets.  ( 2 min )
    In-n-Out: Calibrating Graph Neural Networks for Link Prediction
    arXiv:2403.04605v1 Announce Type: new Abstract: Deep neural networks are notoriously miscalibrated, i.e., their outputs do not reflect the true probability of the event we aim to predict. While networks for tabular or image data are usually overconfident, recent works have shown that graph neural networks (GNNs) show the opposite behavior for node-level classification. But what happens when we are predicting links? We show that, in this case, GNNs often exhibit a mixed behavior. More specifically, they may be overconfident in negative predictions while being underconfident in positive ones. Based on this observation, we propose IN-N-OUT, the first-ever method to calibrate GNNs for link prediction. IN-N-OUT is based on two simple intuitions: i) attributing true/false labels to an edge while respecting a GNNs prediction should cause but small fluctuations in that edge's embedding; and, conversely, ii) if we label that same edge contradicting our GNN, embeddings should change more substantially. An extensive experimental campaign shows that IN-N-OUT significantly improves the calibration of GNNs in link prediction, consistently outperforming the baselines available -- which are not designed for this specific task.  ( 2 min )
    Explaining Bayesian Optimization by Shapley Values Facilitates Human-AI Collaboration
    arXiv:2403.04629v1 Announce Type: new Abstract: Bayesian optimization (BO) with Gaussian processes (GP) has become an indispensable algorithm for black box optimization problems. Not without a dash of irony, BO is often considered a black box itself, lacking ways to provide reasons as to why certain parameters are proposed to be evaluated. This is particularly relevant in human-in-the-loop applications of BO, such as in robotics. We address this issue by proposing ShapleyBO, a framework for interpreting BO's proposals by game-theoretic Shapley values.They quantify each parameter's contribution to BO's acquisition function. Exploiting the linearity of Shapley values, we are further able to identify how strongly each parameter drives BO's exploration and exploitation for additive acquisition functions like the confidence bound. We also show that ShapleyBO can disentangle the contributions to exploration into those that explore aleatoric and epistemic uncertainty. Moreover, our method gives rise to a ShapleyBO-assisted human machine interface (HMI), allowing users to interfere with BO in case proposals do not align with human reasoning. We demonstrate this HMI's benefits for the use case of personalizing wearable robotic devices (assistive back exosuits) by human-in-the-loop BO. Results suggest human-BO teams with access to ShapleyBO can achieve lower regret than teams without.  ( 3 min )
    Beyond Major Product Prediction: Reproducing Reaction Mechanisms with Machine Learning Models Trained on a Large-Scale Mechanistic Dataset
    arXiv:2403.04580v1 Announce Type: new Abstract: Mechanistic understanding of organic reactions can facilitate reaction development, impurity prediction, and in principle, reaction discovery. While several machine learning models have sought to address the task of predicting reaction products, their extension to predicting reaction mechanisms has been impeded by the lack of a corresponding mechanistic dataset. In this study, we construct such a dataset by imputing intermediates between experimentally reported reactants and products using expert reaction templates and train several machine learning models on the resulting dataset of 5,184,184 elementary steps. We explore the performance and capabilities of these models, focusing on their ability to predict reaction pathways and recapitulate the roles of catalysts and reagents. Additionally, we demonstrate the potential of mechanistic models in predicting impurities, often overlooked by conventional models. We conclude by evaluating the generalizability of mechanistic models to new reaction types, revealing challenges related to dataset diversity, consecutive predictions, and violations of atom conservation.  ( 2 min )
    Reducing self-supervised learning complexity improves weakly-supervised classification performance in computational pathology
    arXiv:2403.04558v1 Announce Type: new Abstract: Deep Learning models have been successfully utilized to extract clinically actionable insights from routinely available histology data. Generally, these models require annotations performed by clinicians, which are scarce and costly to generate. The emergence of self-supervised learning (SSL) methods remove this barrier, allowing for large-scale analyses on non-annotated data. However, recent SSL approaches apply increasingly expansive model architectures and larger datasets, causing the rapid escalation of data volumes, hardware prerequisites, and overall expenses, limiting access to these resources to few institutions. Therefore, we investigated the complexity of contrastive SSL in computational pathology in relation to classification performance with the utilization of consumer-grade hardware. Specifically, we analyzed the effects of adaptations in data volume, architecture, and algorithms on downstream clas- sification tasks, emphasizing their impact on computational resources. We trained breast cancer foundation models on a large public patient cohort and validated them on various downstream classification tasks in a weakly supervised manner on two external public patient cohorts. Our experiments demonstrate that we can improve downstream classification performance whilst reducing SSL training duration by 90%. In summary, we propose a set of adaptations which enable the utilization of SSL in computational pathology in non-resource abundant environments.  ( 2 min )
    Improved Algorithm for Adversarial Linear Mixture MDPs with Bandit Feedback and Unknown Transition
    arXiv:2403.04568v1 Announce Type: new Abstract: We study reinforcement learning with linear function approximation, unknown transition, and adversarial losses in the bandit feedback setting. Specifically, we focus on linear mixture MDPs whose transition kernel is a linear mixture model. We propose a new algorithm that attains an $\widetilde{O}(d\sqrt{HS^3K} + \sqrt{HSAK})$ regret with high probability, where $d$ is the dimension of feature mappings, $S$ is the size of state space, $A$ is the size of action space, $H$ is the episode length and $K$ is the number of episodes. Our result strictly improves the previous best-known $\widetilde{O}(dS^2 \sqrt{K} + \sqrt{HSAK})$ result in Zhao et al. (2023a) since $H \leq S$ holds by the layered MDP structure. Our advancements are primarily attributed to (i) a new least square estimator for the transition parameter that leverages the visit information of all states, as opposed to only one state in prior work, and (ii) a new self-normalized concentration tailored specifically to handle non-independent noises, originally proposed in the dynamic assortment area and firstly applied in reinforcement learning to handle correlations between different states.  ( 2 min )
    Sampling-based Safe Reinforcement Learning for Nonlinear Dynamical Systems
    arXiv:2403.04007v1 Announce Type: new Abstract: We develop provably safe and convergent reinforcement learning (RL) algorithms for control of nonlinear dynamical systems, bridging the gap between the hard safety guarantees of control theory and the convergence guarantees of RL theory. Recent advances at the intersection of control and RL follow a two-stage, safety filter approach to enforcing hard safety constraints: model-free RL is used to learn a potentially unsafe controller, whose actions are projected onto safe sets prescribed, for example, by a control barrier function. Though safe, such approaches lose any convergence guarantees enjoyed by the underlying RL methods. In this paper, we develop a single-stage, sampling-based approach to hard constraint satisfaction that learns RL controllers enjoying classical convergence guarantees while satisfying hard safety constraints throughout training and deployment. We validate the efficacy of our approach in simulation, including safe control of a quadcopter in a challenging obstacle avoidance problem, and demonstrate that it outperforms existing benchmarks.  ( 2 min )
    Temporal Cross-Attention for Dynamic Embedding and Tokenization of Multimodal Electronic Health Records
    arXiv:2403.04012v1 Announce Type: new Abstract: The breadth, scale, and temporal granularity of modern electronic health records (EHR) systems offers great potential for estimating personalized and contextual patient health trajectories using sequential deep learning. However, learning useful representations of EHR data is challenging due to its high dimensionality, sparsity, multimodality, irregular and variable-specific recording frequency, and timestamp duplication when multiple measurements are recorded simultaneously. Although recent efforts to fuse structured EHR and unstructured clinical notes suggest the potential for more accurate prediction of clinical outcomes, less focus has been placed on EHR embedding approaches that directly address temporal EHR challenges by learning time-aware representations from multimodal patient time series. In this paper, we introduce a dynamic embedding and tokenization framework for precise representation of multimodal clinical time series that combines novel methods for encoding time and sequential position with temporal cross-attention. Our embedding and tokenization framework, when integrated into a multitask transformer classifier with sliding window attention, outperformed baseline approaches on the exemplar task of predicting the occurrence of nine postoperative complications of more than 120,000 major inpatient surgeries using multimodal data from three hospitals and two academic health centers in the United States.  ( 2 min )
    Three Revisits to Node-Level Graph Anomaly Detection: Outliers, Message Passing and Hyperbolic Neural Networks
    arXiv:2403.04010v1 Announce Type: new Abstract: Graph anomaly detection plays a vital role for identifying abnormal instances in complex networks. Despite advancements of methodology based on deep learning in recent years, existing benchmarking approaches exhibit limitations that hinder a comprehensive comparison. In this paper, we revisit datasets and approaches for unsupervised node-level graph anomaly detection tasks from three aspects. Firstly, we introduce outlier injection methods that create more diverse and graph-based anomalies in graph datasets. Secondly, we compare methods employing message passing against those without, uncovering the unexpected decline in performance associated with message passing. Thirdly, we explore the use of hyperbolic neural networks, specifying crucial architecture and loss design that contribute to enhanced performance. Through rigorous experiments and evaluations, our study sheds light on general strategies for improving node-level graph anomaly detection methods.  ( 2 min )
    Effect of Ambient-Intrinsic Dimension Gap on Adversarial Vulnerability
    arXiv:2403.03967v1 Announce Type: new Abstract: The existence of adversarial attacks on machine learning models imperceptible to a human is still quite a mystery from a theoretical perspective. In this work, we introduce two notions of adversarial attacks: natural or on-manifold attacks, which are perceptible by a human/oracle, and unnatural or off-manifold attacks, which are not. We argue that the existence of the off-manifold attacks is a natural consequence of the dimension gap between the intrinsic and ambient dimensions of the data. For 2-layer ReLU networks, we prove that even though the dimension gap does not affect generalization performance on samples drawn from the observed data space, it makes the clean-trained model more vulnerable to adversarial perturbations in the off-manifold direction of the data space. Our main results provide an explicit relationship between the $\ell_2,\ell_{\infty}$ attack strength of the on/off-manifold attack and the dimension gap.  ( 2 min )
    Knockoff-Guided Feature Selection via A Single Pre-trained Reinforced Agent
    arXiv:2403.04015v1 Announce Type: new Abstract: Feature selection prepares the AI-readiness of data by eliminating redundant features. Prior research falls into two primary categories: i) Supervised Feature Selection, which identifies the optimal feature subset based on their relevance to the target variable; ii) Unsupervised Feature Selection, which reduces the feature space dimensionality by capturing the essential information within the feature set instead of using target variable. However, SFS approaches suffer from time-consuming processes and limited generalizability due to the dependence on the target variable and downstream ML tasks. UFS methods are constrained by the deducted feature space is latent and untraceable. To address these challenges, we introduce an innovative framework for feature selection, which is guided by knockoff features and optimized through reinforcement learning, to identify the optimal and effective feature subset. In detail, our method involves generating "knockoff" features that replicate the distribution and characteristics of the original features but are independent of the target variable. Each feature is then assigned a pseudo label based on its correlation with all the knockoff features, serving as a novel metric for feature evaluation. Our approach utilizes these pseudo labels to guide the feature selection process in 3 novel ways, optimized by a single reinforced agent: 1). A deep Q-network, pre-trained with the original features and their corresponding pseudo labels, is employed to improve the efficacy of the exploration process in feature selection. 2). We introduce unsupervised rewards to evaluate the feature subset quality based on the pseudo labels and the feature space reconstruction loss to reduce dependencies on the target variable. 3). A new {\epsilon}-greedy strategy is used, incorporating insights from the pseudo labels to make the feature selection process more effective.  ( 3 min )
  • Open

    A Retrospective Approximation Approach for Smooth Stochastic Optimization
    arXiv:2103.04392v3 Announce Type: replace-cross Abstract: Stochastic Gradient (SG) is the defacto iterative technique to solve stochastic optimization (SO) problems with a smooth (non-convex) objective $f$ and a stochastic first-order oracle. SG's attractiveness is due in part to its simplicity of executing a single step along the negative subsampled gradient direction to update the incumbent iterate. In this paper, we question SG's choice of executing a single step as opposed to multiple steps between subsample updates. Our investigation leads naturally to generalizing SG into Retrospective Approximation (RA) where, during each iteration, a "deterministic solver" executes possibly multiple steps on a subsampled deterministic problem and stops when further solving is deemed unnecessary from the standpoint of statistical efficiency. RA thus rigorizes what is appealing for implementation -- during each iteration, "plug in" a solver, e.g., L-BFGS line search or Newton-CG, as is, and solve only to the extent necessary. We develop a complete theory using relative error of the observed gradients as the principal object, demonstrating that almost sure and $L_1$ consistency of RA are preserved under especially weak conditions when sample sizes are increased at appropriate rates. We also characterize the iteration and oracle complexity (for linear and sub-linear solvers) of RA, and identify a practical termination criterion leading to optimal complexity rates. To subsume non-convex $f$, we present a certain "random central limit theorem" that incorporates the effect of curvature across all first-order critical points, demonstrating that the asymptotic behavior is described by a certain mixture of normals. The message from our numerical experiments is that the ability of RA to incorporate existing second-order deterministic solvers in a strategic manner might be important from the standpoint of dispensing with hyper-parameter tuning.  ( 3 min )
    Repelling-Attracting Hamiltonian Monte Carlo
    arXiv:2403.04607v1 Announce Type: cross Abstract: We propose a variant of Hamiltonian Monte Carlo (HMC), called the Repelling-Attracting Hamiltonian Monte Carlo (RAHMC), for sampling from multimodal distributions. The key idea that underpins RAHMC is a departure from the conservative dynamics of Hamiltonian systems, which form the basis of traditional HMC, and turning instead to the dissipative dynamics of conformal Hamiltonian systems. In particular, RAHMC involves two stages: a mode-repelling stage to encourage the sampler to move away from regions of high probability density; and, a mode-attracting stage, which facilitates the sampler to find and settle near alternative modes. We achieve this by introducing just one additional tuning parameter -- the coefficient of friction. The proposed method adapts to the geometry of the target distribution, e.g., modes and density ridges, and can generate proposals that cross low-probability barriers with little to no computational overhead in comparison to traditional HMC. Notably, RAHMC requires no additional information about the target distribution or memory of previously visited modes. We establish the theoretical basis for RAHMC, and we discuss repelling-attracting extensions to several variants of HMC in literature. Finally, we provide a tuning-free implementation via dual-averaging, and we demonstrate its effectiveness in sampling from, both, multimodal and unimodal distributions in high dimensions.  ( 2 min )
    What makes an image realistic?
    arXiv:2403.04493v1 Announce Type: cross Abstract: The last decade has seen tremendous progress in our ability to generate realistic-looking data, be it images, text, audio, or video. Here, we discuss the closely related problem of quantifying realism, that is, designing functions that can reliably tell realistic data from unrealistic data. This problem turns out to be significantly harder to solve and remains poorly understood, despite its prevalence in machine learning and recent breakthroughs in generative AI. Drawing on insights from algorithmic information theory, we discuss why this problem is challenging, why a good generative model alone is insufficient to solve it, and what a good solution would look like. In particular, we introduce the notion of a universal critic, which unlike adversarial critics does not require adversarial training. While universal critics are not immediately practical, they can serve both as a North Star for guiding practical implementations and as a tool for analyzing existing attempts to capture realism.  ( 2 min )
    On the expressivity of bi-Lipschitz normalizing flows
    arXiv:2107.07232v3 Announce Type: replace-cross Abstract: An invertible function is bi-Lipschitz if both the function and its inverse have bounded Lipschitz constants. Nowadays, most Normalizing Flows are bi-Lipschitz by design or by training to limit numerical errors (among other things). In this paper, we discuss the expressivity of bi-Lipschitz Normalizing Flows and identify several target distributions that are difficult to approximate using such models. Then, we characterize the expressivity of bi-Lipschitz Normalizing Flows by giving several lower bounds on the Total Variation distance between these particularly unfavorable distributions and their best possible approximation. Finally, we discuss potential remedies which include using more complex latent distributions.  ( 2 min )
    On the Out-of-Distribution Coverage of Combining Split Conformal Prediction and Bayesian Deep Learning
    arXiv:2311.12688v2 Announce Type: replace-cross Abstract: Bayesian deep learning and conformal prediction are two methods that have been used to convey uncertainty and increase safety in machine learning systems. We focus on combining Bayesian deep learning with split conformal prediction and how this combination effects out-of-distribution coverage; particularly in the case of multiclass image classification. We suggest that if the model is generally underconfident on the calibration set, then the resultant conformal sets may exhibit worse out-of-distribution coverage compared to simple predictive credible sets. Conversely, if the model is overconfident on the calibration set, the use of conformal prediction may improve out-of-distribution coverage. We evaluate prediction sets as a result of combining split conformal methods and neural networks trained with (i) stochastic gradient descent, (ii) deep ensembles, and (iii) mean-field variational inference. Our results suggest that combining Bayesian deep learning models with split conformal prediction can, in some cases, cause unintended consequences such as reducing out-of-distribution coverage.  ( 2 min )
    On the consistency of supervised learning with missing values
    arXiv:1902.06931v4 Announce Type: replace Abstract: In many application settings, the data have missing entries which make analysis challenging. An abundant literature addresses missing values in an inferential framework: estimating parameters and their variance from incomplete tables. Here, we consider supervised-learning settings: predicting a target when missing values appear in both training and testing data. We show the consistency of two approaches in prediction. A striking result is that the widely-used method of imputing with a constant, such as the mean prior to learning is consistent when missing values are not informative. This contrasts with inferential settings where mean imputation is pointed at for distorting the distribution of the data. That such a simple approach can be consistent is important in practice. We also show that a predictor suited for complete observations can predict optimally on incomplete data,through multiple imputation.Finally, to compare imputation with learning directly with a model that accounts for missing values, we analyze further decision trees. These can naturally tackle empirical risk minimization with missing values, due to their ability to handle the half-discrete nature of incomplete variables. After comparing theoretically and empirically different missing values strategies in trees, we recommend using the "missing incorporated in attribute" method as it can handle both non-informative and informative missing values.  ( 3 min )
    Regression modelling of spatiotemporal extreme U.S. wildfires via partially-interpretable neural networks
    arXiv:2208.07581v4 Announce Type: replace Abstract: Risk management in many environmental settings requires an understanding of the mechanisms that drive extreme events. Useful metrics for quantifying such risk are extreme quantiles of response variables conditioned on predictor variables that describe, e.g., climate, biosphere and environmental states. Typically these quantiles lie outside the range of observable data and so, for estimation, require specification of parametric extreme value models within a regression framework. Classical approaches in this context utilise linear or additive relationships between predictor and response variables and suffer in either their predictive capabilities or computational efficiency; moreover, their simplicity is unlikely to capture the truly complex structures that lead to the creation of extreme wildfires. In this paper, we propose a new methodological framework for performing extreme quantile regression using artificial neutral networks, which are able to capture complex non-linear relationships and scale well to high-dimensional data. The "black box" nature of neural networks means that they lack the desirable trait of interpretability often favoured by practitioners; thus, we unify linear, and additive, regression methodology with deep learning to create partially-interpretable neural networks that can be used for statistical inference but retain high prediction accuracy. To complement this methodology, we further propose a novel point process model for extreme values which overcomes the finite lower-endpoint problem associated with the generalised extreme value class of distributions. Efficacy of our unified framework is illustrated on U.S. wildfire data with a high-dimensional predictor set and we illustrate vast improvements in predictive performance over linear and spline-based regression techniques.  ( 3 min )
    Standard Gaussian Process is All You Need for High-Dimensional Bayesian Optimization
    arXiv:2402.02746v2 Announce Type: replace-cross Abstract: There has been a long-standing and widespread belief that Bayesian Optimization (BO) with standard Gaussian process (GP), referred to as standard BO, is ineffective in high-dimensional optimization problems. This perception may partly stem from the intuition that GPs struggle with high-dimensional inputs for covariance modeling and function estimation. While these concerns seem reasonable, empirical evidence supporting this belief is lacking. In this paper, we systematically investigated BO with standard GP regression across a variety of synthetic and real-world benchmark problems for high-dimensional optimization. Surprisingly, the performance with standard GP consistently ranks among the best, often outperforming existing BO methods specifically designed for high-dimensional optimization by a large margin. Contrary to the stereotype, we found that standard GP can serve as a capable surrogate for learning high-dimensional target functions. Without strong structural assumptions, BO with standard GP not only excels in high-dimensional optimization but also proves robust in accommodating various structures within the target functions. Furthermore, with standard GP, achieving promising optimization performance is possible by only using maximum likelihood estimation, eliminating the need for expensive Markov-Chain Monte Carlo (MCMC) sampling that might be required by more complex surrogate models. We thus advocate for a re-evaluation and in-depth study of the potential of standard BO in addressing high-dimensional problems.  ( 3 min )
    Minimizing the Thompson Sampling Regret-to-Sigma Ratio (TS-RSR): a provably efficient algorithm for batch Bayesian Optimization
    arXiv:2403.04764v1 Announce Type: cross Abstract: This paper presents a new approach for batch Bayesian Optimization (BO), where the sampling takes place by minimizing a Thompson Sampling approximation of a regret to uncertainty ratio. Our objective is able to coordinate the actions chosen in each batch in a way that minimizes redundancy between points whilst focusing on points with high predictive means or high uncertainty. We provide high-probability theoretical guarantees on the regret of our algorithm. Finally, numerically, we demonstrate that our method attains state-of-the-art performance on a range of nonconvex test functions, where it outperforms several competitive benchmark batch BO algorithms by an order of magnitude on average.  ( 2 min )
    Efficient CNN-LSTM based Parameter Estimation of Levy Driven Stochastic Differential Equations
    arXiv:2403.04246v1 Announce Type: new Abstract: This study addresses the challenges in parameter estimation of stochastic differential equations driven by non-Gaussian noises, which are critical in understanding dynamic phenomena such as price fluctuations and the spread of infectious diseases. Previous research highlighted the potential of LSTM networks in estimating parameters of alpha stable Levy driven SDEs but faced limitations including high time complexity and constraints of the LSTM chaining property. To mitigate these issues, we introduce the PEnet, a novel CNN-LSTM-based three-stage model that offers an end to end approach with superior accuracy and adaptability to varying data structures, enhanced inference speed for long sequence observations through initial data feature condensation by CNN, and high generalization capability, allowing its application to various complex SDE scenarios. Experiments on synthetic datasets confirm PEnet significant advantage in estimating SDE parameters associated with noise characteristics, establishing it as a competitive method for SDE parameter estimation in the presence of Levy noise.  ( 2 min )
    Online Learning with Unknown Constraints
    arXiv:2403.04033v1 Announce Type: cross Abstract: We consider the problem of online learning where the sequence of actions played by the learner must adhere to an unknown safety constraint at every round. The goal is to minimize regret with respect to the best safe action in hindsight while simultaneously satisfying the safety constraint with high probability on each round. We provide a general meta-algorithm that leverages an online regression oracle to estimate the unknown safety constraint, and converts the predictions of an online learning oracle to predictions that adhere to the unknown safety constraint. On the theoretical side, our algorithm's regret can be bounded by the regret of the online regression and online learning oracles, the eluder dimension of the model class containing the unknown safety constraint, and a novel complexity measure that captures the difficulty of safe learning. We complement our result with an asymptotic lower bound that shows that the aforementioned complexity measure is necessary. When the constraints are linear, we instantiate our result to provide a concrete algorithm with $\sqrt{T}$ regret using a scaling transformation that balances optimistic exploration with pessimistic constraint satisfaction.  ( 2 min )
    Diffusion Models for Constrained Domains
    arXiv:2304.05364v2 Announce Type: replace-cross Abstract: Denoising diffusion models are a novel class of generative algorithms that achieve state-of-the-art performance across a range of domains, including image generation and text-to-image tasks. Building on this success, diffusion models have recently been extended to the Riemannian manifold setting, broadening their applicability to a range of problems from the natural and engineering sciences. However, these Riemannian diffusion models are built on the assumption that their forward and backward processes are well-defined for all times, preventing them from being applied to an important set of tasks that consider manifolds defined via a set of inequality constraints. In this work, we introduce a principled framework to bridge this gap. We present two distinct noising processes based on (i) the logarithmic barrier metric and (ii) the reflected Brownian motion induced by the constraints. As existing diffusion model techniques cannot be applied in this setting, we derive new tools to define such models in our framework. We then demonstrate the practical utility of our methods on a number of synthetic and real-world tasks, including applications from robotics and protein design.  ( 2 min )
    Hoeffding decomposition of black-box models with dependent inputs
    arXiv:2310.06567v2 Announce Type: replace-cross Abstract: One of the main challenges for interpreting black-box models is the ability to uniquely decompose square-integrable functions of non-independent random inputs into a sum of functions of every possible subset of variables. However, dealing with dependencies among inputs can be complicated. We propose a novel framework to study this problem, linking three domains of mathematics: probability theory, functional analysis, and combinatorics. We show that, under two reasonable assumptions on the inputs (non-perfect functional dependence and non-degenerate stochastic dependence), it is always possible to decompose such a function uniquely. This generalizes the well-known Hoeffding decomposition. The elements of this decomposition can be expressed using oblique projections and allow for novel interpretability indices for evaluation and variance decomposition purposes. The properties of these novel indices are studied and discussed. This generalization offers a path towards a more precise uncertainty quantification, which can benefit sensitivity analysis and interpretability studies whenever the inputs are dependent. This decomposition is illustrated analytically, and the challenges for adopting these results in practice are discussed.  ( 2 min )
    Label Alignment Regularization for Distribution Shift
    arXiv:2211.14960v3 Announce Type: replace-cross Abstract: Recent work has highlighted the label alignment property (LAP) in supervised learning, where the vector of all labels in the dataset is mostly in the span of the top few singular vectors of the data matrix. Drawing inspiration from this observation, we propose a regularization method for unsupervised domain adaptation that encourages alignment between the predictions in the target domain and its top singular vectors. Unlike conventional domain adaptation approaches that focus on regularizing representations, we instead regularize the classifier to align with the unsupervised target data, guided by the LAP in both the source and target domains. Theoretical analysis demonstrates that, under certain assumptions, our solution resides within the span of the top right singular vectors of the target domain data and aligns with the optimal solution. By removing the reliance on the commonly used optimal joint risk assumption found in classic domain adaptation theory, we showcase the effectiveness of our method on addressing problems where traditional domain adaptation methods often fall short due to high joint error. Additionally, we report improved performance over domain adaptation baselines in well-known tasks such as MNIST-USPS domain adaptation and cross-lingual sentiment analysis.  ( 2 min )
    Tackling Heavy-Tailed Rewards in Reinforcement Learning with Function Approximation: Minimax Optimal and Instance-Dependent Regret Bounds
    arXiv:2306.06836v3 Announce Type: replace-cross Abstract: While numerous works have focused on devising efficient algorithms for reinforcement learning (RL) with uniformly bounded rewards, it remains an open question whether sample or time-efficient algorithms for RL with large state-action space exist when the rewards are \emph{heavy-tailed}, i.e., with only finite $(1+\epsilon)$-th moments for some $\epsilon\in(0,1]$. In this work, we address the challenge of such rewards in RL with linear function approximation. We first design an algorithm, \textsc{Heavy-OFUL}, for heavy-tailed linear bandits, achieving an \emph{instance-dependent} $T$-round regret of $\tilde{O}\big(d T^{\frac{1-\epsilon}{2(1+\epsilon)}} \sqrt{\sum_{t=1}^T \nu_t^2} + d T^{\frac{1-\epsilon}{2(1+\epsilon)}}\big)$, the \emph{first} of this kind. Here, $d$ is the feature dimension, and $\nu_t^{1+\epsilon}$ is the $(1+\epsilon)$-th central moment of the reward at the $t$-th round. We further show the above bound is minimax optimal when applied to the worst-case instances in stochastic and deterministic linear bandits. We then extend this algorithm to the RL settings with linear function approximation. Our algorithm, termed as \textsc{Heavy-LSVI-UCB}, achieves the \emph{first} computationally efficient \emph{instance-dependent} $K$-episode regret of $\tilde{O}(d \sqrt{H \mathcal{U}^*} K^\frac{1}{1+\epsilon} + d \sqrt{H \mathcal{V}^* K})$. Here, $H$ is length of the episode, and $\mathcal{U}^*, \mathcal{V}^*$ are instance-dependent quantities scaling with the central moment of reward and value functions, respectively. We also provide a matching minimax lower bound $\Omega(d H K^{\frac{1}{1+\epsilon}} + d \sqrt{H^3 K})$ to demonstrate the optimality of our algorithm in the worst case. Our result is achieved via a novel robust self-normalized concentration inequality that may be of independent interest in handling heavy-tailed noise in general online regression problems.  ( 3 min )
    A Novel Theoretical Framework for Exponential Smoothing
    arXiv:2403.04345v1 Announce Type: cross Abstract: Simple Exponential Smoothing is a classical technique used for smoothing time series data by assigning exponentially decreasing weights to past observations through a recursive equation; it is sometimes presented as a rule of thumb procedure. We introduce a novel theoretical perspective where the recursive equation that defines simple exponential smoothing occurs naturally as a stochastic gradient ascent scheme to optimize a sequence of Gaussian log-likelihood functions. Under this lens of analysis, our main theorem shows that -in a general setting- simple exponential smoothing converges to a neighborhood of the trend of a trend-stationary stochastic process. This offers a novel theoretical assurance that the exponential smoothing procedure yields reliable estimators of the underlying trend shedding light on long-standing observations in the literature regarding the robustness of simple exponential smoothing.  ( 2 min )
    Misspecification-robust Sequential Neural Likelihood for Simulation-based Inference
    arXiv:2301.13368v2 Announce Type: replace-cross Abstract: Simulation-based inference techniques are indispensable for parameter estimation of mechanistic and simulable models with intractable likelihoods. While traditional statistical approaches like approximate Bayesian computation and Bayesian synthetic likelihood have been studied under well-specified and misspecified settings, they often suffer from inefficiencies due to wasted model simulations. Neural approaches, such as sequential neural likelihood (SNL) avoid this wastage by utilising all model simulations to train a neural surrogate for the likelihood function. However, the performance of SNL under model misspecification is unreliable and can result in overconfident posteriors centred around an inaccurate parameter estimate. In this paper, we propose a novel SNL method, which through the incorporation of additional adjustment parameters, is robust to model misspecification and capable of identifying features of the data that the model is not able to recover. We demonstrate the efficacy of our approach through several illustrative examples, where our method gives more accurate point estimates and uncertainty quantification than SNL.  ( 2 min )
    Effect of Ambient-Intrinsic Dimension Gap on Adversarial Vulnerability
    arXiv:2403.03967v1 Announce Type: cross Abstract: The existence of adversarial attacks on machine learning models imperceptible to a human is still quite a mystery from a theoretical perspective. In this work, we introduce two notions of adversarial attacks: natural or on-manifold attacks, which are perceptible by a human/oracle, and unnatural or off-manifold attacks, which are not. We argue that the existence of the off-manifold attacks is a natural consequence of the dimension gap between the intrinsic and ambient dimensions of the data. For 2-layer ReLU networks, we prove that even though the dimension gap does not affect generalization performance on samples drawn from the observed data space, it makes the clean-trained model more vulnerable to adversarial perturbations in the off-manifold direction of the data space. Our main results provide an explicit relationship between the $\ell_2,\ell_{\infty}$ attack strength of the on/off-manifold attack and the dimension gap.  ( 2 min )
    A correlation-based fuzzy cluster validity index with secondary options detector
    arXiv:2308.14785v3 Announce Type: replace Abstract: The optimal number of clusters is one of the main concerns when applying cluster analysis. Several cluster validity indexes have been introduced to address this problem. However, in some situations, there is more than one option that can be chosen as the final number of clusters. This aspect has been overlooked by most of the existing works in this area. In this study, we introduce a correlation-based fuzzy cluster validity index known as the Wiroonsri-Preedasawakul (WP) index. This index is defined based on the correlation between the actual distance between a pair of data points and the distance between adjusted centroids with respect to that pair. We evaluate and compare the performance of our index with several existing indexes, including Xie-Beni, Pakhira-Bandyopadhyay-Maulik, Tang, Wu-Li, generalized C, and Kwon2. We conduct this evaluation on four types of datasets: artificial datasets, real-world datasets, simulated datasets with ranks, and image datasets, using the fuzzy c-means algorithm. Overall, the WP index outperforms most, if not all, of these indexes in terms of accurately detecting the optimal number of clusters and providing accurate secondary options. Moreover, our index remains effective even when the fuzziness parameter $m$ is set to a large value. Our R package called UniversalCVI used in this work is available at https://CRAN.R-project.org/package=UniversalCVI.  ( 3 min )
    Unsupervised Pretraining for Fact Verification by Language Model Distillation
    arXiv:2309.16540v3 Announce Type: replace-cross Abstract: Fact verification aims to verify a claim using evidence from a trustworthy knowledge base. To address this challenge, algorithms must produce features for every claim that are both semantically meaningful, and compact enough to find a semantic alignment with the source information. In contrast to previous work, which tackled the alignment problem by learning over annotated corpora of claims and their corresponding labels, we propose SFAVEL (Self-supervised Fact Verification via Language Model Distillation), a novel unsupervised pretraining framework that leverages pre-trained language models to distil self-supervised features into high-quality claim-fact alignments without the need for annotations. This is enabled by a novel contrastive loss function that encourages features to attain high-quality claim and evidence alignments whilst preserving the semantic relationships across the corpora. Notably, we present results that achieve a new state-of-the-art on FB15k-237 (+5.3% Hits@1) and FEVER (+8% accuracy) with linear evaluation.  ( 2 min )
    First-order penalty methods for bilevel optimization
    arXiv:2301.01716v2 Announce Type: replace-cross Abstract: In this paper we study a class of unconstrained and constrained bilevel optimization problems in which the lower level is a possibly nonsmooth convex optimization problem, while the upper level is a possibly nonconvex optimization problem. We introduce a notion of $\varepsilon$-KKT solution for them and show that an $\varepsilon$-KKT solution leads to an $O(\sqrt{\varepsilon})$- or $O(\varepsilon)$-hypergradient based stionary point under suitable assumptions. We also propose first-order penalty methods for finding an $\varepsilon$-KKT solution of them, whose subproblems turn out to be a structured minimax problem and can be suitably solved by a first-order method recently developed by the authors. Under suitable assumptions, an \emph{operation complexity} of $O(\varepsilon^{-4}\log\varepsilon^{-1})$ and $O(\varepsilon^{-7}\log\varepsilon^{-1})$, measured by their fundamental operations, is established for the proposed penalty methods for finding an $\varepsilon$-KKT solution of the unconstrained and constrained bilevel optimization problems, respectively. Preliminary numerical results are presented to illustrate the performance of our proposed methods. To the best of our knowledge, this paper is the first work to demonstrate that bilevel optimization can be approximately solved as minimax optimization, and moreover, it provides the first implementable method with complexity guarantees for such sophisticated bilevel optimization.  ( 2 min )
    Towards Faster Non-Asymptotic Convergence for Diffusion-Based Generative Models
    arXiv:2306.09251v3 Announce Type: replace Abstract: Diffusion models, which convert noise into new data instances by learning to reverse a Markov diffusion process, have become a cornerstone in contemporary generative modeling. While their practical power has now been widely recognized, the theoretical underpinnings remain far from mature. In this work, we develop a suite of non-asymptotic theory towards understanding the data generation process of diffusion models in discrete time, assuming access to $\ell_2$-accurate estimates of the (Stein) score functions. For a popular deterministic sampler (based on the probability flow ODE), we establish a convergence rate proportional to $1/T$ (with $T$ the total number of steps), improving upon past results; for another mainstream stochastic sampler (i.e., a type of the denoising diffusion probabilistic model), we derive a convergence rate proportional to $1/\sqrt{T}$, matching the state-of-the-art theory. Imposing only minimal assumptions on the target data distribution (e.g., no smoothness assumption is imposed), our results characterize how $\ell_2$ score estimation errors affect the quality of the data generation processes. In contrast to prior works, our theory is developed based on an elementary yet versatile non-asymptotic approach without resorting to toolboxes for SDEs and ODEs. Further, we design two accelerated variants, improving the convergence to $1/T^2$ for the ODE-based sampler and $1/T$ for the DDPM-type sampler, which might be of independent theoretical and empirical interest.  ( 3 min )
    Signature Isolation Forest
    arXiv:2403.04405v1 Announce Type: new Abstract: Functional Isolation Forest (FIF) is a recent state-of-the-art Anomaly Detection (AD) algorithm designed for functional data. It relies on a tree partition procedure where an abnormality score is computed by projecting each curve observation on a drawn dictionary through a linear inner product. Such linear inner product and the dictionary are a priori choices that highly influence the algorithm's performances and might lead to unreliable results, particularly with complex datasets. This work addresses these challenges by introducing \textit{Signature Isolation Forest}, a novel AD algorithm class leveraging the rough path theory's signature transform. Our objective is to remove the constraints imposed by FIF through the proposition of two algorithms which specifically target the linearity of the FIF inner product and the choice of the dictionary. We provide several numerical experiments, including a real-world applications benchmark showing the relevance of our methods.  ( 2 min )
    GNN-VPA: A Variance-Preserving Aggregation Strategy for Graph Neural Networks
    arXiv:2403.04747v1 Announce Type: cross Abstract: Graph neural networks (GNNs), and especially message-passing neural networks, excel in various domains such as physics, drug discovery, and molecular modeling. The expressivity of GNNs with respect to their ability to discriminate non-isomorphic graphs critically depends on the functions employed for message aggregation and graph-level readout. By applying signal propagation theory, we propose a variance-preserving aggregation function (VPA) that maintains expressivity, but yields improved forward and backward dynamics. Experiments demonstrate that VPA leads to increased predictive performance for popular GNN architectures as well as improved learning dynamics. Our results could pave the way towards normalizer-free or self-normalizing GNNs.  ( 2 min )
    SQ Lower Bounds for Non-Gaussian Component Analysis with Weaker Assumptions
    arXiv:2403.04744v1 Announce Type: cross Abstract: We study the complexity of Non-Gaussian Component Analysis (NGCA) in the Statistical Query (SQ) model. Prior work developed a general methodology to prove SQ lower bounds for this task that have been applicable to a wide range of contexts. In particular, it was known that for any univariate distribution $A$ satisfying certain conditions, distinguishing between a standard multivariate Gaussian and a distribution that behaves like $A$ in a random hidden direction and like a standard Gaussian in the orthogonal complement, is SQ-hard. The required conditions were that (1) $A$ matches many low-order moments with the standard univariate Gaussian, and (2) the chi-squared norm of $A$ with respect to the standard Gaussian is finite. While the moment-matching condition is necessary for hardness, the chi-squared condition was only required for technical reasons. In this work, we establish that the latter condition is indeed not necessary. In particular, we prove near-optimal SQ lower bounds for NGCA under the moment-matching condition only. Our result naturally generalizes to the setting of a hidden subspace. Leveraging our general SQ lower bound, we obtain near-optimal SQ lower bounds for a range of concrete estimation tasks where existing techniques provide sub-optimal or even vacuous guarantees.  ( 2 min )
    A Sub-Quadratic Time Algorithm for Robust Sparse Mean Estimation
    arXiv:2403.04726v1 Announce Type: cross Abstract: We study the algorithmic problem of sparse mean estimation in the presence of adversarial outliers. Specifically, the algorithm observes a \emph{corrupted} set of samples from $\mathcal{N}(\mu,\mathbf{I}_d)$, where the unknown mean $\mu \in \mathbb{R}^d$ is constrained to be $k$-sparse. A series of prior works has developed efficient algorithms for robust sparse mean estimation with sample complexity $\mathrm{poly}(k,\log d, 1/\epsilon)$ and runtime $d^2 \mathrm{poly}(k,\log d,1/\epsilon)$, where $\epsilon$ is the fraction of contamination. In particular, the fastest runtime of existing algorithms is quadratic ($\Omega(d^2)$), which can be prohibitive in high dimensions. This quadratic barrier in the runtime stems from the reliance of these algorithms on the sample covariance matrix, which is of size $d^2$. Our main contribution is an algorithm for robust sparse mean estimation which runs in \emph{subquadratic} time using $\mathrm{poly}(k,\log d,1/\epsilon)$ samples. We also provide analogous results for robust sparse PCA. Our results build on algorithmic advances in detecting weak correlations, a generalized version of the light-bulb problem by Valiant.  ( 2 min )
    Explaining Bayesian Optimization by Shapley Values Facilitates Human-AI Collaboration
    arXiv:2403.04629v1 Announce Type: cross Abstract: Bayesian optimization (BO) with Gaussian processes (GP) has become an indispensable algorithm for black box optimization problems. Not without a dash of irony, BO is often considered a black box itself, lacking ways to provide reasons as to why certain parameters are proposed to be evaluated. This is particularly relevant in human-in-the-loop applications of BO, such as in robotics. We address this issue by proposing ShapleyBO, a framework for interpreting BO's proposals by game-theoretic Shapley values.They quantify each parameter's contribution to BO's acquisition function. Exploiting the linearity of Shapley values, we are further able to identify how strongly each parameter drives BO's exploration and exploitation for additive acquisition functions like the confidence bound. We also show that ShapleyBO can disentangle the contributions to exploration into those that explore aleatoric and epistemic uncertainty. Moreover, our method gives rise to a ShapleyBO-assisted human machine interface (HMI), allowing users to interfere with BO in case proposals do not align with human reasoning. We demonstrate this HMI's benefits for the use case of personalizing wearable robotic devices (assistive back exosuits) by human-in-the-loop BO. Results suggest human-BO teams with access to ShapleyBO can achieve lower regret than teams without.  ( 3 min )
    Regularized DeepIV with Model Selection
    arXiv:2403.04236v1 Announce Type: cross Abstract: In this paper, we study nonparametric estimation of instrumental variable (IV) regressions. While recent advancements in machine learning have introduced flexible methods for IV estimation, they often encounter one or more of the following limitations: (1) restricting the IV regression to be uniquely identified; (2) requiring minimax computation oracle, which is highly unstable in practice; (3) absence of model selection procedure. In this paper, we present the first method and analysis that can avoid all three limitations, while still enabling general function approximation. Specifically, we propose a minimax-oracle-free method called Regularized DeepIV (RDIV) regression that can converge to the least-norm IV solution. Our method consists of two stages: first, we learn the conditional distribution of covariates, and by utilizing the learned distribution, we learn the estimator by minimizing a Tikhonov-regularized loss function. We further show that our method allows model selection procedures that can achieve the oracle rates in the misspecified regime. When extended to an iterative estimator, our method matches the current state-of-the-art convergence rate. Our method is a Tikhonov regularized variant of the popular DeepIV method with a non-parametric MLE first-stage estimator, and our results provide the first rigorous guarantees for this empirically used method, showcasing the importance of regularization which was absent from the original work.  ( 2 min )
    Inference via Interpolation: Contrastive Representations Provably Enable Planning and Inference
    arXiv:2403.04082v1 Announce Type: cross Abstract: Given time series data, how can we answer questions like "what will happen in the future?" and "how did we get here?" These sorts of probabilistic inference questions are challenging when observations are high-dimensional. In this paper, we show how these questions can have compact, closed form solutions in terms of learned representations. The key idea is to apply a variant of contrastive learning to time series data. Prior work already shows that the representations learned by contrastive learning encode a probability ratio. By extending prior work to show that the marginal distribution over representations is Gaussian, we can then prove that joint distribution of representations is also Gaussian. Taken together, these results show that representations learned via temporal contrastive learning follow a Gauss-Markov chain, a graphical model where inference (e.g., prediction, planning) over representations corresponds to inverting a low-dimensional matrix. In one special case, inferring intermediate representations will be equivalent to interpolating between the learned representations. We validate our theory using numerical simulations on tasks up to 46-dimensions.  ( 2 min )
    Knockoff-Guided Feature Selection via A Single Pre-trained Reinforced Agent
    arXiv:2403.04015v1 Announce Type: cross Abstract: Feature selection prepares the AI-readiness of data by eliminating redundant features. Prior research falls into two primary categories: i) Supervised Feature Selection, which identifies the optimal feature subset based on their relevance to the target variable; ii) Unsupervised Feature Selection, which reduces the feature space dimensionality by capturing the essential information within the feature set instead of using target variable. However, SFS approaches suffer from time-consuming processes and limited generalizability due to the dependence on the target variable and downstream ML tasks. UFS methods are constrained by the deducted feature space is latent and untraceable. To address these challenges, we introduce an innovative framework for feature selection, which is guided by knockoff features and optimized through reinforcement learning, to identify the optimal and effective feature subset. In detail, our method involves generating "knockoff" features that replicate the distribution and characteristics of the original features but are independent of the target variable. Each feature is then assigned a pseudo label based on its correlation with all the knockoff features, serving as a novel metric for feature evaluation. Our approach utilizes these pseudo labels to guide the feature selection process in 3 novel ways, optimized by a single reinforced agent: 1). A deep Q-network, pre-trained with the original features and their corresponding pseudo labels, is employed to improve the efficacy of the exploration process in feature selection. 2). We introduce unsupervised rewards to evaluate the feature subset quality based on the pseudo labels and the feature space reconstruction loss to reduce dependencies on the target variable. 3). A new {\epsilon}-greedy strategy is used, incorporating insights from the pseudo labels to make the feature selection process more effective.  ( 3 min )
    Fundamental limits of Non-Linear Low-Rank Matrix Estimation
    arXiv:2403.04234v1 Announce Type: new Abstract: We consider the task of estimating a low-rank matrix from non-linear and noisy observations. We prove a strong universality result showing that Bayes-optimal performances are characterized by an equivalent Gaussian model with an effective prior, whose parameters are entirely determined by an expansion of the non-linear function. In particular, we show that to reconstruct the signal accurately, one requires a signal-to-noise ratio growing as $N^{\frac 12 (1-1/k_F)}$, where $k_F$ is the first non-zero Fisher information coefficient of the function. We provide asymptotic characterization for the minimal achievable mean squared error (MMSE) and an approximate message-passing algorithm that reaches the MMSE under conditions analogous to the linear version of the problem. We also provide asymptotic errors achieved by methods such as principal component analysis combined with Bayesian denoising, and compare them with Bayes-optimal MMSE.  ( 2 min )
    Improved Algorithm for Adversarial Linear Mixture MDPs with Bandit Feedback and Unknown Transition
    arXiv:2403.04568v1 Announce Type: cross Abstract: We study reinforcement learning with linear function approximation, unknown transition, and adversarial losses in the bandit feedback setting. Specifically, we focus on linear mixture MDPs whose transition kernel is a linear mixture model. We propose a new algorithm that attains an $\widetilde{O}(d\sqrt{HS^3K} + \sqrt{HSAK})$ regret with high probability, where $d$ is the dimension of feature mappings, $S$ is the size of state space, $A$ is the size of action space, $H$ is the episode length and $K$ is the number of episodes. Our result strictly improves the previous best-known $\widetilde{O}(dS^2 \sqrt{K} + \sqrt{HSAK})$ result in Zhao et al. (2023a) since $H \leq S$ holds by the layered MDP structure. Our advancements are primarily attributed to (i) a new least square estimator for the transition parameter that leverages the visit information of all states, as opposed to only one state in prior work, and (ii) a new self-normalized concentration tailored specifically to handle non-independent noises, originally proposed in the dynamic assortment area and firstly applied in reinforcement learning to handle correlations between different states.  ( 2 min )
    On the Efficient Marginalization of Probabilistic Sequence Models
    arXiv:2403.04005v1 Announce Type: new Abstract: Real-world data often exhibits sequential dependence, across diverse domains such as human behavior, medicine, finance, and climate modeling. Probabilistic methods capture the inherent uncertainty associated with prediction in these contexts, with autoregressive models being especially prominent. This dissertation focuses on using autoregressive models to answer complex probabilistic queries that go beyond single-step prediction, such as the timing of future events or the likelihood of a specific event occurring before another. In particular, we develop a broad class of novel and efficient approximation techniques for marginalization in sequential models that are model-agnostic. These techniques rely solely on access to and sampling from next-step conditional distributions of a pre-trained autoregressive model, including both traditional parametric models as well as more recent neural autoregressive models. Specific approaches are presented for discrete sequential models, for marked temporal point processes, and for stochastic jump processes, each tailored to a well-defined class of informative, long-range probabilistic queries.  ( 2 min )
    Sample size planning for conditional counterfactual mean estimation with a K-armed randomized experiment
    arXiv:2403.04039v1 Announce Type: cross Abstract: We cover how to determine a sufficiently large sample size for a $K$-armed randomized experiment in order to estimate conditional counterfactual expectations in data-driven subgroups. The sub-groups can be output by any feature space partitioning algorithm, including as defined by binning users having similar predictive scores or as defined by a learned policy tree. After carefully specifying the inference target, a minimum confidence level, and a maximum margin of error, the key is to turn the original goal into a simultaneous inference problem where the recommended sample size to offset an increased possibility of estimation error is directly related to the number of inferences to be conducted. Given a fixed sample size budget, our result allows us to invert the question to one about the feasible number of treatment arms or partition complexity (e.g. number of decision tree leaves). Using policy trees to learn sub-groups, we evaluate our nominal guarantees on a large publicly-available randomized experiment test data set.  ( 2 min )

  • Open

    He created my favorite hero. I didnt have a chance to share this with Toriyama. I'm sharing it with you all instead.. Goku AI (Project Tori)
    submitted by /u/JL-Engineer [link] [comments]
    Apple Car Q&A With Bloomberg's Mark Gurman. What Questions Do You Have?
    Hi everyone! Bloomberg News Apple reporter Mark Gurman is hosting a live Q&A about Apple’s failed electric car effort -- and what’s next for the company. You can tune in on Monday, 3/11, at 2pm ET. If you have questions for Mark, drop them in the comments and we’ll do our best to make sure they get answered during our live conversation next week. We also have an email you can send questions to for a more detailed response -- that’s [bloombergqa@bloomberg.net](mailto:bloombergqa@bloomberg.net). More on the Apple Car: Apple believed that it could create a better car than Tesla and the rest of the automotive industry ultimately led to the downfall of the project. The company had looked at pricing vehicles at $100,000 apiece, meaning it wouldn’t take long for the company to have a multibillion-dollar moneymaker on its hands (even if those sales didn’t translate to big profits). But according to someone involved in the decision making, it was as if Apple had tried to skip all the early iPhone models and jump right to the iPhone X. The company also had sunk billions of dollars into developing a Level 5 self-driving system — the highest level of autonomy. The end of the car project means CEO Tim Cook has one fewer breakthrough to hang his hat on as Steve Jobs’ successor. You can read the full story on Apple's failed car attempt for free here. submitted by /u/bloomberg [link] [comments]
    Best (non sensational/content farm) YouTube channels to follow for AI news?
    What do you use to stay on top on new developments? Im a "FANG" ml engineer and aside from my areas of specialization I feel like I need to know what's going on overall in the field - but it's hard to keep up. For staying on top on overall AI developments/news I personally use AI Explained (breaks down new developments and discusses potential implications - balanced and goes deep in terms of sources) Dwarkesh Patel (long form interviews with great technical/practical questions) ByCloud (a bit more lighthearted but still technical overview of new AI developments) Yannic Kilcher occasionally puts out [ML News] recap videos which are also good summaries. I find by following these I am in the loop with most news and rumors, but maybe there are others? submitted by /u/bandalorian [link] [comments]
    75 Full SORA Demo Videos in 4K (With Music, no Talk)
    I compiled all SORA demo videos I could find, upscaled them to 4K, and added music. The file was too big for Reddit. Here is a YouTube link - https://youtu.be/jicsH-wxZDU submitted by /u/1BlueSpork [link] [comments]
    The Future of AI. The Ultimate safety measure
    submitted by /u/Unreal_777 [link] [comments]
    [Safety] Why can't models filter bad training data rather than fine-tune?
    Bias, safety, disinformation, copyright, and alignment are big problems with AI. Fine-tuning is used to mitigate issues, but that often makes the model less effective. Why not have a moderation filter that uses a secondary model to block bad training data, so harmful content never enters the primary model in the first place? This way post-training nerfing/find-tuning wouldn't necessary. For example, openAI has a moderations API endpoint that can be used to determine if a prompt is harmful. This could be used on training data, and on failure, don't feed that data into the new model. I assume the moderation API uses an embedding model which is a ton cheaper than an LLM for filtering purposes. I'm not saying this is how it should be done exactly, I'm just using it as an example. Are there any models that were trained this way? I would think that the end result would be a much smarter model without the issues. submitted by /u/funbike [link] [comments]
    Saudi Arabia's Male Humanoid Robot Accused of Sexual Harassment
    A video of Saudi Arabia's first male robot has gone viral after a few netizens accused the humanoid of touching a female reporter inappropriately. Saudi Arabia's first male robot touched a reporter inappropriately. "Saudi Arabia unveils its man-shaped AI robot, Mohammad, reacts to a reporter in its first appearance," an X user wrote while sharing the video that people are claiming shows the robot's inappropriate behaviour. You can view the original tweet here. ​ submitted by /u/khommenghetsum [link] [comments]
    Why most AI benchmarks tell us so little
    Anthropic and Inflection AI release competitive generative models. Current benchmarks fail to reflect the real-world use of AI models. GPQA and HellaSwag were criticized for their lack of real-world applicability. Evaluation crises in the industry due to outdated benchmarks. MMLU's relevance was questioned due to the potential for rote memorization. Read more: https://techcrunch.com/2024/03/07/heres-why-most-ai-benchmarks-tell-us-so-little/ submitted by /u/clonefitreal [link] [comments]
    General robots for homes
    I think the next big revolution that will shake things up is general robots. Think back a decade or two – the idea of scanning your own groceries and bagging them yourself might have seemed strange. Yet, self-checkout has become commonplace across many stores. Its adoption wasn't instantaneous but occurred over a 5 to 10-year period. We can expect a similar pattern with robotics. Initially, robots may work alongside human employees, handling specific tasks while workers adapt and the technology is refined. Likely there will be only 1 or 2 robots in a store that helps out customers or deal with things. This like other technologies like the self check out, websites, and so on where they basically can test out the technology before they fully invest into it. But I imagine after the test are …
    One-Minute Daily AI News 3/7/2024
    A nuclear startup backed by OpenAI chief Sam Altman wants to power data centers and homes alike. It’s racing against surging demand while working to satisfy regulators.[1] From Tinder to Replika, generative AI is transforming how humans use dating apps and even spurring real people to romance AI chatbots.[2] NVIDIA and HP Inc. today announced that NVIDIA CUDA-X™ data processing libraries will be integrated with HP AI workstation solutions to turbocharge the data preparation and processing work that forms the foundation of generative AI development.[3] President Biden included a nod to a rising issue in the entertainment and tech industries during his State of the Union address Thursday evening, calling for a ban on AI voice impersonations.[4] Sources: [1] https://www.nbcnews.com/tech/tech-news/nuclear-power-oklo-sam-altman-ai-energy-rcna139094 [2] https://www.theverge.com/24092638/ai-dating-chatbots-romance-replika-tinder-decoder [3] https://nvidianews.nvidia.com/news/nvidia-hp-supercharge-data-science-generative-ai-workstations [4] https://variety.com/2024/politics/news/president-biden-ban-ai-voices-state-of-the-union-1235934651/ submitted by /u/Excellent-Target-847 [link] [comments]
    Differences between being intelligent, educated, and having memorization.
    I see a lot of confusion between these three terms, so I gave definitions with clear and concise examples. I’m open to criticism. Intelligent Ability to discover knowledge by using only oneself’s mind. Usually through logic, or critically thinking Situation: You were able to figure what someone was saying in a different language, through sheer intelligence. Let’s pretend you’re a boy for a second. Someone goes around saying “Tu eres un niño”, and “Tu eres una niña”. You know what “tu eres un” means, it means “you are a”, but you don’t know what niña or niño means. He ends up asking you which one are you. You figure out that means niño means boy and niña means girl, because he only told boys that they are niños, and girls that they are niñas. So you say you’re a niño, and get it right…
    Army chooses Palantir to build next-generation targeting system
    submitted by /u/A-Dog22 [link] [comments]
  • Open

    [D] Is there any guide or tutorial for getting best voice clone out of Tortoise TTS ?
    Hi, I am using this tool on a Windows machine + Nvidia mrq/ai-voice-cloning/wiki/Installation I have almost 1 hour of voice sample, all very good quality and with zero noises. I train the model with: 500 epochs 100 save and validation frequency The end result is not very similar and sounds robotic. I am trying to find online some tutorial with tips on the best training settings for getting the best clone quality. If you know of such please let me know, otherwise I have some questions: How many minutes/hours of training data do you need for a verry good clone qualiity ? Can it be that 1 hour is not enough.. Should I increase the epochs ? I saw 3 youtube videos all used 500 and 100 validation freq submitted by /u/chriscs777 [link] [comments]
    [D] Fatal flaw in VAE + GANs architecture for Deep Fakes?
    Hi, as many know, a lot of Deep Fakes are made with a combination of VAE + GANs architecture. The loss function used is typically the sum of VAE + GANs loss functions. This means that a model for creating Deep Fakes has to compromise between what's optimal for the VAE and what's optimal for the GANs. But for detecting deep fakes, the focus is solely on the more GANs side of things. So I believe this creates a fatal flaw that gives the advantage to the detector for now? Detector models only need to optimize for discriminating real from fake. For other deep fake models and architectures, this won't be the case. submitted by /u/Will_Tomos_Edwards [link] [comments]
    Very short computational time in ML and DL [D]
    Can anyone explain how the computational time mentioned in papers is so short, like less than 1 second for training a whole dataset? I provide an example in this table taken from the article: edit : the paper i provide isn't the only onewith computational times of less than one second, there are many others. Also, there are papers reporting computational times of 2-3 seconds. I don't understand how this is possible.. edit2 : I should add tho that the dataset used contains time series data (sensor measurements) with a relatively small number of samples (7532 sample) but still i find it hard to understand how can it be trained in 3.32 ms. submitted by /u/PerfecttMachine [link] [comments]
    [Discussion] Which AutoML platform do you think is the best one out there?
    I'm currently evaluating the following platforms: SageMaker Autopilot Databricks AutoML DataRobot H2O.ai AutoML I am finding it very overwhelming to compare them as their product pages provide very little specifics on their workings (more like they want to sell their products). If you have experience with these, which one has satisfied you the most in terms of ease of use, costs and performance? submitted by /u/barberogaston [link] [comments]
    [P] Alignment for Two-Player Boardgames
    I recently came up with a method of constructing the graph of allowed moves in two-player board games like tic-tac-toe or chess, and I would like to turn this into a research project with a goal of aligning LLMs to play these games. I solved the problem for tic-tac-toe, and I'm working on extending the solution to chess, but since it involves both topological data analysis and self-supervised learning, I would like to work with someone experienced on this. I am constructing the graph using only the knowledge of allowed states of the board, so any data on moves or sequences of them be used to test the results and further improve the model. Has anyone seen similar research, or knows researchers interested in the topic? submitted by /u/ScarUnlikely336 [link] [comments]
    AI/ML Msc recommendation in London [Discussion]
    Hi everyone! I’m a computer science graduate with 10 years of experience in the tech industry, currently exploring options for further education. I have a strong interest in AI and machine learning, with a particular curiosity about its applications in health and biomedical sciences. I’m considering pursuing a master’s degree in London, eyeing institutions like Imperial College London, King’s College London, or UCL. I’m in a position where both part-time and full-time courses could be viable, though I need to balance this with my current job. I’m looking for recommendations on specific programs that could fit this criteria. Ideally, these programs would provide a robust foundation in AI and ML, while also offering insights or electives related to health and biomedical applications. Has anyone here pursued a similar path or have any experiences with these institutions they’d be willing to share? Any insights on balancing work and study, program recommendations, or general advice would be greatly appreciated. My long-term goal is to deepen my expertise in AI/ML, potentially leading to a PhD, and to work in cutting-edge environments like Meta or DeepMind. Thanks in advance for your help! submitted by /u/Alarmed_Challenge_30 [link] [comments]
    [R] Online Adaptation of Language Models with a Memory of Amortized Contexts
    submitted by /u/Forsaken_Scientist [link] [comments]
    [P] Sparse Interpretable Audio Model
    I've long been dissatisfied with fixed-frame-size-based audio representations, and with the sound of audio generated via upsampling and/or transposed convolutions. I've recently been developing a model that decomposes acoustic musical audio into a single, global event vector, as well as a sparse set of "event" vectors, which are rendered using an excitation + resonance model inspired by physical modelling synthesis. The event-based approach is meant to bridge the gap between purely symbolic (MIDI)-based representations, and frame-based representations that are used in most current models (e.g. Encodec) I'm aware of. My sense is that this representation could lend itself well toward auto-regressive/LLM-based music generation, and much more efficiently, since the representation is sparse and is meant to align with a human notion of what sorts of musical events/objects are present. ​ It's generally interesting to think about audio as a graph of time-located events (think about the sort of sparse representation you'd end up via matching pursuit), rather than a fixed-rate time series. ​ There's a demo and a little more information about the architecture here: https://johnvinyard.github.io/siam.html ​ The code, which is a bit of a mess, and contains a lot of things that are idiosyncratic to my research/dev process, is here https://github.com/JohnVinyard/matching-pursuit/blob/main/experiments/e_2024_2_25/experiment.py ​ I'd be happy to work on packaging up the model more nicely, and sharing trained weights, if anyone is interested in exploring further. ​ submitted by /u/JohnVinyard [link] [comments]
    [R] LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error - Microsoft Semantic Machines 2024 - Giving Mistral-Instruct-7B a boost of 46,7% points and enabling it to outperform GPT-4 in the ToolBench benchmark!
    Paper: https://arxiv.org/abs/2403.04746 Github: https://github.com/microsoft/simulated-trial-and-error Abstract: Tools are essential for large language models (LLMs) to acquire up-to-date information and take consequential actions in external environments. Existing work on tool-augmented LLMs primarily focuses on the broad coverage of tools and the flexibility of adding new tools. However, a critical aspect that has surprisingly been understudied is simply how accurately an LLM uses tools for which it has been trained. We find that existing LLMs, including GPT-4 and open-source LLMs specifically fine-tuned for tool use, only reach a correctness rate in the range of 30% to 60%, far from reliable use in practice. We propose a biologically inspired method for tool-augmented LLMs, simulated trial and error (STE), that orchestrates three key mechanisms for successful tool use behaviors in the biological system: trial and error, imagination, and memory. Specifically, STE leverages an LLM's 'imagination' to simulate plausible scenarios for using a tool, after which the LLM interacts with the tool to learn from its execution feedback. Both short-term and long-term memory are employed to improve the depth and breadth of the exploration, respectively. Comprehensive experiments on ToolBench show that STE substantially improves tool learning for LLMs under both in-context learning and fine-tuning settings, bringing a boost of 46.7% to Mistral-Instruct-7B and enabling it to outperform GPT-4. We also show effective continual learning of tools via a simple experience replay strategy. https://preview.redd.it/1xbbb1mbg4nc1.jpg?width=1233&format=pjpg&auto=webp&s=4e7ae3a1f002cdca9be37cee6be3ce1b883f1799 https://preview.redd.it/9ndg36mbg4nc1.jpg?width=1531&format=pjpg&auto=webp&s=af0658b3a329e0005f44a5e01c09bd99ed74888a submitted by /u/Singularian2501 [link] [comments]
    [R] There can be No True Scottish Spoken Language system (parody)
    submitted by /u/TobyWasBestSpiderMan [link] [comments]
    [D] performance evaluation of echo state network
    Hello, I am currently making an echo state network model to predict the pattern of precipitation(Extreme precipitation happens every 30 years and i want to predict that) and I have monthly precipitation data for the past 100 years, total 1200 data point, I have 2 doubts. Approxmately What should be the reservoir size? How do I evaluate the performance of the model? Ps: I have checked on the internet and it said, we can use normal rmse, or something similar, but how exactly I use rmse on a time series data, please someone explain how exactly is the evaluation done. Thank you submitted by /u/Feeling-Bar8474 [link] [comments]
    [R] Reward-respecting subtasks for model-based reinforcement learning
    Paper: https://www.sciencedirect.com/science/article/pii/S0004370223001479 Preprint version(s): https://arxiv.org/abs/2202.03466 Abstract: To achieve the ambitious goals of artificial intelligence, reinforcement learning must include planning with a model of the world that is abstract in state and time. Deep learning has made progress with state abstraction, but temporal abstraction has rarely been used, despite extensively developed theory based on the options framework. One reason for this is that the space of possible options is immense, and the methods previously proposed for option discovery do not take into account how the option models will be used in planning. Options are typically discovered by posing subsidiary tasks, such as reaching a bottleneck state or maximizing the cumulative sum of a sensory signal other than reward. Each subtask is solved to produce an option, and then a model of the option is learned and made available to the planning process. In most previous work, the subtasks ignore the reward on the original problem, whereas we propose subtasks that use the original reward plus a bonus based on a feature of the state at the time the option terminates. We show that option models obtained from such reward-respecting subtasks are much more likely to be useful in planning than eigenoptions, shortest path options based on bottleneck states, or reward-respecting options generated by the option-critic. Reward respecting subtasks strongly constrain the space of options and thereby also provide a partial solution to the problem of option discovery. Finally, we show how values, policies, options, and models can all be learned online and off-policy using standard algorithms and general value functions. submitted by /u/SunsetOneSix [link] [comments]
    [R] SymbolicAI: A framework for logic-based approaches combining generative models and solvers
    Paper: https://arxiv.org/abs/2402.00854 Code: https://github.com/ExtensityAI/symbolicai Benchmark: https://github.com/ExtensityAI/benchmark Abstract: We introduce SymbolicAI, a versatile and modular framework employing a logic-based approach to concept learning and flow management in generative processes. SymbolicAI enables the seamless integration of generative models with a diverse range of solvers by treating large language models (LLMs) as semantic parsers that execute tasks based on both natural and formal language instructions, thus bridging the gap between symbolic reasoning and generative AI. We leverage probabilistic programming principles to tackle complex tasks, and utilize differentiable and classical programming paradigms with their respective strengths. The framework introduces a set of polymorphic, compositional, and self-referential operations for data stream manipulation, aligning LLM outputs with user objectives. As a result, we can transition between the capabilities of various foundation models endowed with zero- and few-shot learning capabilities and specialized, fine-tuned models or solvers proficient in addressing specific problems. In turn, the framework facilitates the creation and evaluation of explainable computational graphs. We conclude by introducing a quality measure and its empirical score for evaluating these computational graphs, and propose a benchmark that compares various state-of-the-art LLMs across a set of complex workflows. We refer to the empirical score as the "Vector Embedding for Relational Trajectory Evaluation through Cross-similarity", or VERTEX score for short. The framework codebase and benchmark are linked below. submitted by /u/SunsetOneSix [link] [comments]
    ML with wasm on tiny IoT device [D] [R]
    Hi, I am trying to invest a research on machine learning with wasm on tiny IoT device. I found wasi-nn. But it seems currently they are focusing on Intel processors (It makes sense because its intel's program) and there aren't many IoT devices with Intel chips. I also got Teaclave by previous post here. But still wait for my board... yea only three kinds of boards are supporting that program based on my research. So I am asking for help with other methods of implementing machine learning things on resource constrained IoT devices. Thank you! submitted by /u/Tao_KTH [link] [comments]
    [R] Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study - Beijing Academy of Artificial Intelligence (BAAI) 2024 - First Agent able to follow and finish real missions in a AAA game!
    Paper: https://arxiv.org/abs/2403.03186 Projekt Website with code and videos: https://baai-agents.github.io/Cradle/ Abstract: Despite the success in specific tasks and scenarios, existing foundation agents, empowered by large models (LMs) and advanced tools, still cannot generalize to different scenarios, mainly due to dramatic differences in the observations and actions across scenarios. In this work, we propose the General Computer Control (GCC) setting: building foundation agents that can master any computer task by taking only screen images (and possibly audio) of the computer as input, and producing keyboard and mouse operations as output, similar to human-computer interaction. The main challenges of achieving GCC are: 1) the multimodal observations for decision-making, 2) the …
    [P] Open-sourcing LEURN: An Explainable and Generative AI For Tabular Data
    Hello folks, I'm open-sourcing LEURN- an Explainable and Generative AI For Tabular Data. https://github.com/CaglarAytekin/LEURN/ This is an inherently explainable neural network which generates a univariate decision tree which you can also use as a generator. Please find extensive demo in the repo. Explanation Format LEURN provides the category that a sample falls into. It gives upper and lower boundaries for numerical features, category sets for categorical features. LEURN also provides each feature's exact additive contribution to final result. (see notes) Generative Ability LEURN is trained as any other neural network in predictive manner. The trained predictive network can directly be used as a generative network. This is enabled due to univariate nature of the neural network. For regular networks which have multivariate rules, although one can extract those rules, it is very hard to sample from the category that those multivariate inequalities define. Since LEURN's decisions are univariate, it is nothing but sampling per feature from lower and upper boundaries. Kindly see paper for details. One can make generation from same category as an input (use case: data anonymization). One can also generate an entirely new sample (use case: synthetic data generation). Some notes: The method is not a post-training explainer like SHAP, LIME, etc. It is a standalone novel neural network that you should train from scratch Unlike SHAP, LIME etc, the additive contributions are not approximations, but exact by design. Unlike SHAP, etc the explanation process doesn't require any support data and is made from single inference sample. LEURN performs on par with SotA methods in predictive performance for tabular data (see link to my paper in github) while also being inherently explainable and generative. Looking forward to hear your comments and requests. submitted by /u/MLC_Money [link] [comments]
    [D] Small Language Models vs RAG
    I've seen a lot of interest in using multiple small language models for "specialist" content generation. What are cases where they would be more useful than a larger, general model paired with RAG for specialized info? submitted by /u/TheCoconutTree [link] [comments]
    [D] Giving Autoregressive Models the Space to Think
    Autoregressive prediction has a problem: whether you're asking what color the sky is or to prove the Riemann hypothesis, the amount of compute to generate the next token is the exact same, but it seems obvious which of the two questions requires more compute to answer. So, engineers toil on how to extend an autoregressive model's capabilities to be able to think, for a variable amount of time, before speaking. Here is the solution (exercise left to the reader).   A Philosophical Motivation The human mind operates in a continuous latent space of thoughts that is discretized to lingual counterparts during the natural evolution of the brain's state. Any writer could tell you that to put something on the page is to compress it to fraction of the beauty the idea was in the mind. Said another…
    [D] Anyone tried using Grace Hopper superchip for AI/ML?
    submitted by /u/YouGotServer [link] [comments]
    [P] Flash Attention in ~100 lines of CUDA
    For CUDA beginners like myself, diving into the official Flash Attention source code can be daunting. So I wrote the forward pass in ~100 lines of CUDA https://github.com/tspeterkim/flash-attention-minimal and am sharing for educational value. submitted by /u/droidarmy95 [link] [comments]
  • Open

    Hi everyone! I'm gonna work on simulation project aiming to stabilizing an UAV using RL algorithms, I thout of using matlab and simulink but I can not find a model to test on, can any one please guide me and thank you
    submitted by /u/DueStill7268 [link] [comments]
    Is there a paradigm on which stale memories to prioritize when overwriting replay buffer?
    I got myself thinking since I'd swear that paradigm always was circular first in first out but for the implementation im working on last in first out seems to be working much better. Also in case of prioritized experience replay why not to discard lowest TD memories? Or maybe randomly replace memories when buffer is full? submitted by /u/DotNetEvangeliser [link] [comments]
    🚀 DIAMBRA Teams Up with Hugging Face to Push Reinforcement Learning Research and Adoption! 🚀
    submitted by /u/DIAMBRA_AIArena [link] [comments]
    Question: Regarding single environment vs Multi environment RL training
    Hello all, I'm working on robotic arm simulation to perform high level control of the robot to grasp objects. I'm working using ML Agents in Unity as the platform for the environment. While, using PPO to train the robot, I'm able to perform it successfully with around 8 hours training time. To reduce the time, I tried to increase the number of agents working in the same environment (there is an inbuilt training area replicator which just makes a copy of the whole robot cell with the agent). As per the mlagents source code, the multiple agents should just speed up the trajectory collection (as there are many agents trying out actions for different random situations as per the same policy, the update buffer should fill up faster). But, for some reason, my policy doesn't train properly. It flatlines at zero return (starts improving from - 1 but stabilises around 0. +1 is the max return of an episode). Is there some particular changes to be made, when increasing the number of agents. Some other things to keep in mind when increasing the number of environments. Any comments or advice is welcome. Thanks in advance. submitted by /u/Flaky-Drag-31 [link] [comments]
    Implement custom RL Agent with ray
    Hey there, I', currently trying to implement my own RL agent with ray (tune.trainable) and train them with the population based training scheduler. According to the documentation there are 6 methods to implement: https://docs.ray.io/en/latest/tune/api/trainable.html A number of questions and problems arose. I would be grateful if someone could answer them for me. My problem is that the setup() method is called each time a checkpoint is loaded during training, not only after the beginning (i set the reuse_agent flag in the Tuner config, and the reset?config() method returns true). Furthermore, I am not sure what to do in the cleanup() method. Should the step() method do one step in the environment, or on epoch? And should the replay buffer be shared among the different agents, or should each agent have its own replay buffer. If someone has already implemented a custom ray trainable I would be gratefull for a github link :) ​ submitted by /u/ChiefAlu [link] [comments]
  • Open

    Health-specific embedding tools for dermatology and pathology
    Posted by Dave Steiner, Clinical Research Scientist, Google Health, and Rory Pilgrim, Product Manager Google Research There’s a worldwide shortage of access to medical imaging expert interpretation across specialties including radiology, dermatology and pathology. Machine learning (ML) technology can help ease this burden by powering tools that enable doctors to interpret these images more accurately and efficiently. However, the development and implementation of such ML tools are often limited by the availability of high-quality data, ML expertise, and computational resources. One way to catalyze the use of ML for medical imaging is via domain-specific models that utilize deep learning (DL) to capture the information in medical images as compressed numerical vectors (called embedd…  ( 93 min )
  • Open

    Turn Your Photo Into A Cartoon With This Simple Tutorial (AnimeGan V2)
    https://preview.redd.it/gx7fhuj9n5nc1.png?width=1280&format=png&auto=webp&s=4926bbe56b459c277586da40a54142751dcf63cb In this tutorial, dive into the fascinating world of image transformation with AnimeGANv2. Discover how to convert ordinary images into captivating cartoon-like artwork effortlessly. Watch as we explore various cartoon styles and witness the magic unfold as images undergo stunning transformations. The link for the tutorial video : https://youtu.be/gdh9nwaY79M Enjoy Eran #CartoonizeaPicture #TurnMyPictureIntoCartoon #AnimeGan submitted by /u/Feitgemel [link] [comments]
    SymbolicAI: A framework for logic-based approaches combining generative models and solvers
    Paper: https://arxiv.org/abs/2402.00854 Code: https://github.com/ExtensityAI/symbolicai Benchmark: https://github.com/ExtensityAI/benchmark Abstract: We introduce SymbolicAI, a versatile and modular framework employing a logic-based approach to concept learning and flow management in generative processes. SymbolicAI enables the seamless integration of generative models with a diverse range of solvers by treating large language models (LLMs) as semantic parsers that execute tasks based on both natural and formal language instructions, thus bridging the gap between symbolic reasoning and generative AI. We leverage probabilistic programming principles to tackle complex tasks, and utilize differentiable and classical programming paradigms with their respective strengths. The framework introduces a set of polymorphic, compositional, and self-referential operations for data stream manipulation, aligning LLM outputs with user objectives. As a result, we can transition between the capabilities of various foundation models endowed with zero- and few-shot learning capabilities and specialized, fine-tuned models or solvers proficient in addressing specific problems. In turn, the framework facilitates the creation and evaluation of explainable computational graphs. We conclude by introducing a quality measure and its empirical score for evaluating these computational graphs, and propose a benchmark that compares various state-of-the-art LLMs across a set of complex workflows. We refer to the empirical score as the "Vector Embedding for Relational Trajectory Evaluation through Cross-similarity", or VERTEX score for short. The framework codebase and benchmark are linked below. submitted by /u/SunsetOneSix [link] [comments]
    Multi-scale feature modulation network introduced for advancing underwater image enhancement
    submitted by /u/keghn [link] [comments]
  • Open

    How to improve your artificial intelligence job prospects
    Artificial Intelligence is one of the biggest technological waves that has hit the world of technology. According to The Enterprisers Project, AI will be offering about 133 million new jobs in the coming years. Dun & Bradstreet also stated that 40 percent of the respondent organizations are offering more jobs because of implementing artificial intelligence… Read More »How to improve your artificial intelligence job prospects The post How to improve your artificial intelligence job prospects appeared first on Data Science Central.  ( 21 min )
    Decoding RAG: Exploring its significance in the realm of generative AI
    In the intricate landscape of Conversational AI, Retrieval-Augmented Generation (RAG) emerges as a technical marvel, seamlessly merging the strengths of generative and retrieval models. At its core, RAG tackles the challenge of precision in responses by introducing a dynamic knowledge retrieval component. In this technical exploration, we delve into the underpinnings of RAG. Imagine a… Read More »Decoding RAG: Exploring its significance in the realm of generative AI The post Decoding RAG: Exploring its significance in the realm of generative AI appeared first on Data Science Central.  ( 23 min )
  • Open

    A surprising result about surprise index
    Surprise index Warren Weaver [1] introduced what he called the surprise index to quantify how surprising an event is. At first it might seem that the probability of an event is enough for this purpose: the lower the probability of an event, the more surprise when it occurs. But Weaver’s notion is more subtle than […] A surprising result about surprise index first appeared on John D. Cook.  ( 6 min )
    Estimating an author’s vocabulary
    How would you estimate the size of an author’s vocabulary? Suppose you have a analyzed the author’s available works and found n words, x of which are unique. Then you know the author’s vocabulary was at least x, but it’s reasonable to assume that the author may have know words he never used in writing, […] Estimating an author’s vocabulary first appeared on John D. Cook.  ( 5 min )
  • Open

    Review completed & Altman, Brockman to continue to lead OpenAI
    New board members named and enhancements to the governance structure introduced  ( 3 min )
    OpenAI announces new members to board of directors
    Dr. Sue Desmond-Hellmann, Nicole Seligman, Fidji Simo join; Sam Altman rejoins board  ( 2 min )
  • Open

    Researchers enhance peripheral vision in AI models
    By enabling models to see the world more like humans do, the work could help improve driver safety and shed light on human behavior.  ( 7 min )
  • Open

    Large-scale Training of Foundation Models for Wearable Biosignals
    arXiv:2312.05409v2 Announce Type: replace Abstract: Tracking biosignals is crucial for monitoring wellness and preempting the development of severe medical conditions. Today, wearable devices can conveniently record various biosignals, creating the opportunity to monitor health status without disruption to one's daily routine. Despite widespread use of wearable devices and existing digital biomarkers, the absence of curated data with annotated medical labels hinders the development of new biomarkers to measure common health conditions. In fact, medical datasets are usually small in comparison to other domains, which is an obstacle for developing neural network models for biosignals. To address this challenge, we have employed self-supervised learning using the unlabeled sensor data collected under informed consent from the large longitudinal Apple Heart and Movement Study (AHMS) to train foundation models for two common biosignals: photoplethysmography (PPG) and electrocardiogram (ECG) recorded on Apple Watch. We curated PPG and ECG datasets from AHMS that include data from ~141K participants spanning ~3 years. Our self-supervised learning framework includes participant level positive pair selection, stochastic augmentation module and a regularized contrastive loss optimized with momentum training, and generalizes well to both PPG and ECG modalities. We show that the pre-trained foundation models readily encode information regarding participants' demographics and health conditions. To the best of our knowledge, this is the first study that builds foundation models using large-scale PPG and ECG data collected via wearable consumer devices $\unicode{x2013}$ prior works have commonly used smaller-size datasets collected in clinical and experimental settings. We believe PPG and ECG foundation models can enhance future wearable devices by reducing the reliance on labeled data and hold the potential to help the users improve their health.  ( 3 min )
    Ensemble sampling for linear bandits: small ensembles suffice
    arXiv:2311.08376v2 Announce Type: replace-cross Abstract: We provide the first useful and rigorous analysis of ensemble sampling for the stochastic linear bandit setting. In particular, we show that, under standard assumptions, for a $d$-dimensional stochastic linear bandit with an interaction horizon $T$, ensemble sampling with an ensemble of size of order $\smash{d \log T}$ incurs regret at most of the order $\smash{(d \log T)^{5/2} \sqrt{T}}$. Ours is the first result in any structured setting not to require the size of the ensemble to scale linearly with $T$ -- which defeats the purpose of ensemble sampling -- while obtaining near $\smash{\sqrt{T}}$ order regret. Ours is also the first result that allows infinite action sets.  ( 2 min )
    Learning Constrained Optimization with Deep Augmented Lagrangian Methods
    arXiv:2403.03454v1 Announce Type: new Abstract: Learning to Optimize (LtO) is a problem setting in which a machine learning (ML) model is trained to emulate a constrained optimization solver. Learning to produce optimal and feasible solutions subject to complex constraints is a difficult task, but is often made possible by restricting the input space to a limited distribution of related problems. Most LtO methods focus on directly learning solutions to the primal problem, and applying correction schemes or loss function penalties to encourage feasibility. This paper proposes an alternative approach, in which the ML model is trained instead to predict dual solution estimates directly, from which primal estimates are constructed to form dual-feasible solution pairs. This enables an end-to-end training scheme is which the dual objective is maximized as a loss function, and solution estimates iterate toward primal feasibility, emulating a Dual Ascent method. First it is shown that the poor convergence properties of classical Dual Ascent are reflected in poor convergence of the proposed training scheme. Then, by incorporating techniques from practical Augmented Lagrangian methods, we show how the training scheme can be improved to learn highly accurate constrained optimization solvers, for both convex and nonconvex problems.  ( 2 min )
    Quantum Neural Network with Density Matrix for Question Answering and Classical Image Classification
    arXiv:2203.11155v4 Announce Type: replace-cross Abstract: Quantum density matrix represents all the information of the entire quantum system, and novel models of meaning employing density matrices naturally model linguistic phenomena such as hyponymy and linguistic ambiguity, among others in quantum question answering tasks. Naturally, we argue that applying the quantum density matrix into classical Question Answering (QA) tasks can show more effective performance. Specifically, we (i) design a new mechanism based on Long Short-Term Memory (LSTM) to accommodate the case when the inputs are matrixes; (ii) apply the new mechanism to QA problems with Convolutional Neural Network (CNN) and gain the LSTM-based QA model with the quantum density matrix. Experiments of our new model on TREC-QA and WIKI-QA data sets show encouraging results. Similarly, we argue that the quantum density matrix can also enhance the image feature information and the relationship between the features for the classical image classification. Thus, we (i) combine density matrices and CNN to design a new mechanism; (ii) apply the new mechanism to some representative classical image classification tasks. A series of experiments show that the application of quantum density matrix in image classification has the generalization and high efficiency on different datasets. The application of quantum density matrix both in classical question answering tasks and classical image classification tasks show more effective performance.  ( 3 min )
    FingerNet: EEG Decoding of A Fine Motor Imagery with Finger-tapping Task Based on A Deep Neural Network
    arXiv:2403.03526v1 Announce Type: cross Abstract: Brain-computer interface (BCI) technology facilitates communication between the human brain and computers, primarily utilizing electroencephalography (EEG) signals to discern human intentions. Although EEG-based BCI systems have been developed for paralysis individuals, ongoing studies explore systems for speech imagery and motor imagery (MI). This study introduces FingerNet, a specialized network for fine MI classification, departing from conventional gross MI studies. The proposed FingerNet could extract spatial and temporal features from EEG signals, improving classification accuracy within the same hand. The experimental results demonstrated that performance showed significantly higher accuracy in classifying five finger-tapping tasks, encompassing thumb, index, middle, ring, and little finger movements. FingerNet demonstrated dominant performance compared to the conventional baseline models, EEGNet and DeepConvNet. The average accuracy for FingerNet was 0.3049, whereas EEGNet and DeepConvNet exhibited lower accuracies of 0.2196 and 0.2533, respectively. Statistical validation also demonstrates the predominance of FingerNet over baseline networks. For biased predictions, particularly for thumb and index classes, we led to the implementation of weighted cross-entropy and also adapted the weighted cross-entropy, a method conventionally employed to mitigate class imbalance. The proposed FingerNet involves optimizing network structure, improving performance, and exploring applications beyond fine MI. Moreover, the weighted Cross Entropy approach employed to address such biased predictions appears to have broader applicability and relevance across various domains involving multi-class classification tasks. We believe that effective execution of motor imagery can be achieved not only for fine MI, but also for local muscle MI  ( 3 min )
    DPOT: Auto-Regressive Denoising Operator Transformer for Large-Scale PDE Pre-Training
    arXiv:2403.03542v1 Announce Type: new Abstract: Pre-training has been investigated to improve the efficiency and performance of training neural operators in data-scarce settings. However, it is largely in its infancy due to the inherent complexity and diversity, such as long trajectories, multiple scales and varying dimensions of partial differential equations (PDEs) data. In this paper, we present a new auto-regressive denoising pre-training strategy, which allows for more stable and efficient pre-training on PDE data and generalizes to various downstream tasks. Moreover, by designing a flexible and scalable model architecture based on Fourier attention, we can easily scale up the model for large-scale pre-training. We train our PDE foundation model with up to 0.5B parameters on 10+ PDE datasets with more than 100k trajectories. Extensive experiments show that we achieve SOTA on these benchmarks and validate the strong generalizability of our model to significantly enhance performance on diverse downstream PDE tasks like 3D data. Code is available at \url{https://github.com/thu-ml/DPOT}.  ( 2 min )
    TGPT-PINN: Nonlinear model reduction with transformed GPT-PINNs
    arXiv:2403.03459v1 Announce Type: cross Abstract: We introduce the Transformed Generative Pre-Trained Physics-Informed Neural Networks (TGPT-PINN) for accomplishing nonlinear model order reduction (MOR) of transport-dominated partial differential equations in an MOR-integrating PINNs framework. Building on the recent development of the GPT-PINN that is a network-of-networks design achieving snapshot-based model reduction, we design and test a novel paradigm for nonlinear model reduction that can effectively tackle problems with parameter-dependent discontinuities. Through incorporation of a shock-capturing loss function component as well as a parameter-dependent transform layer, the TGPT-PINN overcomes the limitations of linear model reduction in the transport-dominated regime. We demonstrate this new capability for nonlinear model reduction in the PINNs framework by several nontrivial parametric partial differential equations.  ( 2 min )
    Interactive Continual Learning Architecture for Long-Term Personalization of Home Service Robots
    arXiv:2403.03462v1 Announce Type: cross Abstract: For robots to perform assistive tasks in unstructured home environments, they must learn and reason on the semantic knowledge of the environments. Despite a resurgence in the development of semantic reasoning architectures, these methods assume that all the training data is available a priori. However, each user's environment is unique and can continue to change over time, which makes these methods unsuitable for personalized home service robots. Although research in continual learning develops methods that can learn and adapt over time, most of these methods are tested in the narrow context of object classification on static image datasets. In this paper, we combine ideas from continual learning, semantic reasoning, and interactive machine learning literature and develop a novel interactive continual learning architecture for continual learning of semantic knowledge in a home environment through human-robot interaction. The architecture builds on core cognitive principles of learning and memory for efficient and real-time learning of new knowledge from humans. We integrate our architecture with a physical mobile manipulator robot and perform extensive system evaluations in a laboratory environment over two months. Our results demonstrate the effectiveness of our architecture to allow a physical robot to continually adapt to the changes in the environment from limited data provided by the users (experimenters), and use the learned knowledge to perform object fetching tasks.  ( 3 min )
    Reinforcement Learning Jazz Improvisation: When Music Meets Game Theory
    arXiv:2403.03224v1 Announce Type: cross Abstract: Live performances of music are always charming, with the unpredictability of improvisation due to the dynamic between musicians and interactions with the audience. Jazz improvisation is a particularly noteworthy example for further investigation from a theoretical perspective. Here, we introduce a novel mathematical game theory model for jazz improvisation, providing a framework for studying music theory and improvisational methodologies. We use computational modeling, mainly reinforcement learning, to explore diverse stochastic improvisational strategies and their paired performance on improvisation. We find that the most effective strategy pair is a strategy that reacts to the most recent payoff (Stepwise Changes) with a reinforcement learning strategy limited to notes in the given chord (Chord-Following Reinforcement Learning). Conversely, a strategy that reacts to the partner's last note and attempts to harmonize with it (Harmony Prediction) strategy pair yields the lowest non-control payoff and highest standard deviation, indicating that picking notes based on immediate reactions to the partner player can yield inconsistent outcomes. On average, the Chord-Following Reinforcement Learning strategy demonstrates the highest mean payoff, while Harmony Prediction exhibits the lowest. Our work lays the foundation for promising applications beyond jazz: including the use of artificial intelligence (AI) models to extract data from audio clips to refine musical reward systems, and training machine learning (ML) models on existing jazz solos to further refine strategies within the game.  ( 3 min )
    SemSegDepth: A Combined Model for Semantic Segmentation and Depth Completion
    arXiv:2209.00381v2 Announce Type: replace-cross Abstract: Holistic scene understanding is pivotal for the performance of autonomous machines. In this paper we propose a new end-to-end model for performing semantic segmentation and depth completion jointly. The vast majority of recent approaches have developed semantic segmentation and depth completion as independent tasks. Our approach relies on RGB and sparse depth as inputs to our model and produces a dense depth map and the corresponding semantic segmentation image. It consists of a feature extractor, a depth completion branch, a semantic segmentation branch and a joint branch which further processes semantic and depth information altogether. The experiments done on Virtual KITTI 2 dataset, demonstrate and provide further evidence, that combining both tasks, semantic segmentation and depth completion, in a multi-task network can effectively improve the performance of each task. Code is available at https://github.com/juanb09111/semantic depth.  ( 2 min )
    Exact Fractional Inference via Re-Parametrization & Interpolation between Tree-Re-Weighted- and Belief Propagation- Algorithms
    arXiv:2301.10369v2 Announce Type: replace Abstract: Inference efforts -- required to compute partition function, $Z$, of an Ising model over a graph of $N$ ``spins" -- are most likely exponential in $N$. Efficient variational methods, such as Belief Propagation (BP) and Tree Re-Weighted (TRW) algorithms, compute $Z$ approximately minimizing respective (BP- or TRW-) free energy. We generalize the variational scheme building a $\lambda$-fractional-homotopy, $Z^{(\lambda)}$, where $\lambda=0$ and $\lambda=1$ correspond to TRW- and BP-approximations, respectively, and $Z^{(\lambda)}$ decreases with $\lambda$ monotonically. Moreover, this fractional scheme guarantees that in the attractive (ferromagnetic) case $Z^{(TRW)}\geq Z^{(\lambda)}\geq Z^{(BP)}$, and there exists a unique (``exact") $\lambda_*$ such that, $Z=Z^{(\lambda_*)}$. Generalizing the re-parametrization approach of \citep{wainwright_tree-based_2002} and the loop series approach of \citep{chertkov_loop_2006}, we show how to express $Z$ as a product, $\forall \lambda:\ Z=Z^{(\lambda)}{\cal Z}^{(\lambda)}$, where the multiplicative correction, ${\cal Z}^{(\lambda)}$, is an expectation over a node-independent probability distribution built from node-wise fractional marginals. Our theoretical analysis is complemented by extensive experiments with models from Ising ensembles over planar and random graphs of medium- and large- sizes. The empirical study yields a number of interesting observations, such as (a) ability to estimate ${\cal Z}^{(\lambda)}$ with $O(N^4)$ fractional samples; (b) suppression of $\lambda_*$ fluctuations with increase in $N$ for instances from a particular random Ising ensemble.  ( 3 min )
    Linear and nonlinear system identification under $\ell_1$- and group-Lasso regularization via L-BFGS-B
    arXiv:2403.03827v1 Announce Type: cross Abstract: In this paper, we propose an approach for identifying linear and nonlinear discrete-time state-space models, possibly under $\ell_1$- and group-Lasso regularization, based on the L-BFGS-B algorithm. For the identification of linear models, we show that, compared to classical linear subspace methods, the approach often provides better results, is much more general in terms of the loss and regularization terms used, and is also more stable from a numerical point of view. The proposed method not only enriches the existing set of linear system identification tools but can be also applied to identifying a very broad class of parametric nonlinear state-space models, including recurrent neural networks. We illustrate the approach on synthetic and experimental datasets and apply it to solve the challenging industrial robot benchmark for nonlinear multi-input/multi-output system identification proposed by Weigand et al. (2022). A Python implementation of the proposed identification method is available in the package \texttt{jax-sysid}, available at \url{https://github.com/bemporad/jax-sysid}.  ( 2 min )
    THC: Accelerating Distributed Deep Learning Using Tensor Homomorphic Compression
    arXiv:2302.08545v2 Announce Type: replace Abstract: Deep neural networks (DNNs) are the de facto standard for essential use cases, such as image classification, computer vision, and natural language processing. As DNNs and datasets get larger, they require distributed training on increasingly larger clusters. A main bottleneck is the resulting communication overhead where workers exchange model updates (i.e., gradients) on a per-round basis. To address this bottleneck and accelerate training, a widely-deployed approach is compression. However, previous deployments often apply bi-directional compression schemes by simply using a uni-directional gradient compression scheme in each direction. This results in significant computational overheads at the parameter server and increased compression error, leading to longer training and lower accuracy. We introduce Tensor Homomorphic Compression (THC), a novel bi-directional compression framework that enables the direct aggregation of compressed values and thus eliminating the aforementioned computational overheads. Moreover, THC is compatible with in-network aggregation (INA), which allows for further acceleration. Our evaluation shows that training representative vision and language models with THC reaches target accuracy by 1.40x to 1.47x faster using INA and 1.28x to 1.33x faster using a software PS compared with state-of-the-art systems.  ( 3 min )
    Multimodal Anomaly Detection based on Deep Auto-Encoder for Object Slip Perception of Mobile Manipulation Robots
    arXiv:2403.03563v1 Announce Type: cross Abstract: Object slip perception is essential for mobile manipulation robots to perform manipulation tasks reliably in the dynamic real-world. Traditional approaches to robot arms' slip perception use tactile or vision sensors. However, mobile robots still have to deal with noise in their sensor signals caused by the robot's movement in a changing environment. To solve this problem, we present an anomaly detection method that utilizes multisensory data based on a deep autoencoder model. The proposed framework integrates heterogeneous data streams collected from various robot sensors, including RGB and depth cameras, a microphone, and a force-torque sensor. The integrated data is used to train a deep autoencoder to construct latent representations of the multisensory data that indicate the normal status. Anomalies can then be identified by error scores measured by the difference between the trained encoder's latent values and the latent values of reconstructed input data. In order to evaluate the proposed framework, we conducted an experiment that mimics an object slip by a mobile service robot operating in a real-world environment with diverse household objects and different moving patterns. The experimental results verified that the proposed framework reliably detects anomalies in object slip situations despite various object types and robot behaviors, and visual and auditory noise in the environment.  ( 3 min )
    Collision Avoidance Verification of Multiagent Systems with Learned Policies
    arXiv:2403.03314v1 Announce Type: cross Abstract: For many multiagent control problems, neural networks (NNs) have enabled promising new capabilities. However, many of these systems lack formal guarantees (e.g., collision avoidance, robustness), which prevents leveraging these advances in safety-critical settings. While there is recent work on formal verification of NN-controlled systems, most existing techniques cannot handle scenarios with more than one agent. To address this research gap, this paper presents a backward reachability-based approach for verifying the collision avoidance properties of Multi-Agent Neural Feedback Loops (MA-NFLs). Given the dynamics models and trained control policies of each agent, the proposed algorithm computes relative backprojection sets by solving a series of Mixed Integer Linear Programs (MILPs) offline for each pair of agents. Our pair-wise approach is parallelizable and thus scales well with increasing number of agents, and we account for state measurement uncertainties, making it well aligned with real-world scenarios. Using those results, the agents can quickly check for collision avoidance online by solving low-dimensional Linear Programs (LPs). We demonstrate the proposed algorithm can verify collision-free properties of a MA-NFL with agents trained to imitate a collision avoidance algorithm (Reciprocal Velocity Obstacles). We further demonstrate the computational scalability of the approach on systems with up to 10 agents.  ( 2 min )
    Towards Controllable Time Series Generation
    arXiv:2403.03698v1 Announce Type: new Abstract: Time Series Generation (TSG) has emerged as a pivotal technique in synthesizing data that accurately mirrors real-world time series, becoming indispensable in numerous applications. Despite significant advancements in TSG, its efficacy frequently hinges on having large training datasets. This dependency presents a substantial challenge in data-scarce scenarios, especially when dealing with rare or unique conditions. To confront these challenges, we explore a new problem of Controllable Time Series Generation (CTSG), aiming to produce synthetic time series that can adapt to various external conditions, thereby tackling the data scarcity issue. In this paper, we propose \textbf{C}ontrollable \textbf{T}ime \textbf{S}eries (\textsf{CTS}), an innovative VAE-agnostic framework tailored for CTSG. A key feature of \textsf{CTS} is that it decouples the mapping process from standard VAE training, enabling precise learning of a complex interplay between latent features and external conditions. Moreover, we develop a comprehensive evaluation scheme for CTSG. Extensive experiments across three real-world time series datasets showcase \textsf{CTS}'s exceptional capabilities in generating high-quality, controllable outputs. This underscores its adeptness in seamlessly integrating latent features with external conditions. Extending \textsf{CTS} to the image domain highlights its remarkable potential for explainability and further reinforces its versatility across different modalities.  ( 2 min )
    Structure-Guided Adversarial Training of Diffusion Models
    arXiv:2402.17563v2 Announce Type: cross Abstract: Diffusion models have demonstrated exceptional efficacy in various generative applications. While existing models focus on minimizing a weighted sum of denoising score matching losses for data distribution modeling, their training primarily emphasizes instance-level optimization, overlooking valuable structural information within each mini-batch, indicative of pair-wise relationships among samples. To address this limitation, we introduce Structure-guided Adversarial training of Diffusion Models (SADM). In this pioneering approach, we compel the model to learn manifold structures between samples in each training batch. To ensure the model captures authentic manifold structures in the data distribution, we advocate adversarial training of the diffusion generator against a novel structure discriminator in a minimax game, distinguishing real manifold structures from the generated ones. SADM substantially improves existing diffusion transformers (DiT) and outperforms existing methods in image generation and cross-domain fine-tuning tasks across 12 datasets, establishing a new state-of-the-art FID of 1.58 and 2.11 on ImageNet for class-conditional image generation at resolutions of 256x256 and 512x512, respectively.  ( 2 min )
    SalienTime: User-driven Selection of Salient Time Steps for Large-Scale Geospatial Data Visualization
    arXiv:2403.03449v1 Announce Type: cross Abstract: The voluminous nature of geospatial temporal data from physical monitors and simulation models poses challenges to efficient data access, often resulting in cumbersome temporal selection experiences in web-based data portals. Thus, selecting a subset of time steps for prioritized visualization and pre-loading is highly desirable. Addressing this issue, this paper establishes a multifaceted definition of salient time steps via extensive need-finding studies with domain experts to understand their workflows. Building on this, we propose a novel approach that leverages autoencoders and dynamic programming to facilitate user-driven temporal selections. Structural features, statistical variations, and distance penalties are incorporated to make more flexible selections. User-specified priorities, spatial regions, and aggregations are used to combine different perspectives. We design and implement a web-based interface to enable efficient and context-aware selection of time steps and evaluate its efficacy and usability through case studies, quantitative evaluations, and expert interviews.  ( 2 min )
    ENOT: Expectile Regularization for Fast and Accurate Training of Neural Optimal Transport
    arXiv:2403.03777v1 Announce Type: new Abstract: We present a new extension for Neural Optimal Transport (NOT) training procedure, capable of accurately and efficiently estimating optimal transportation plan via specific regularisation on conjugate potentials. The main bottleneck of existing NOT solvers is associated with the procedure of finding a near-exact approximation of the conjugate operator (i.e., the c-transform), which is done either by optimizing over maximin objectives or by the computationally-intensive fine-tuning of the initial approximated prediction. We resolve both issues by proposing a new, theoretically justified loss in the form of expectile regularization that enforces binding conditions on the learning dual potentials. Such a regularization provides the upper bound estimation over the distribution of possible conjugate potentials and makes the learning stable, eliminating the need for additional extensive finetuning. We formally justify the efficiency of our method, called Expectile-Regularised Neural Optimal Transport (ENOT). ENOT outperforms previous state-of-the-art approaches on the Wasserstein-2 benchmark tasks by a large margin (up to a 3-fold improvement in quality and up to a 10-fold improvement in runtime).  ( 2 min )
    Hypothesis Spaces for Deep Learning
    arXiv:2403.03353v1 Announce Type: cross Abstract: This paper introduces a hypothesis space for deep learning that employs deep neural networks (DNNs). By treating a DNN as a function of two variables, the physical variable and parameter variable, we consider the primitive set of the DNNs for the parameter variable located in a set of the weight matrices and biases determined by a prescribed depth and widths of the DNNs. We then complete the linear span of the primitive DNN set in a weak* topology to construct a Banach space of functions of the physical variable. We prove that the Banach space so constructed is a reproducing kernel Banach space (RKBS) and construct its reproducing kernel. We investigate two learning models, regularized learning and minimum interpolation problem in the resulting RKBS, by establishing representer theorems for solutions of the learning models. The representer theorems unfold that solutions of these learning models can be expressed as linear combination of a finite number of kernel sessions determined by given data and the reproducing kernel.  ( 2 min )
    Tackling Missing Values in Probabilistic Wind Power Forecasting: A Generative Approach
    arXiv:2403.03631v1 Announce Type: new Abstract: Machine learning techniques have been successfully used in probabilistic wind power forecasting. However, the issue of missing values within datasets due to sensor failure, for instance, has been overlooked for a long time. Although it is natural to consider addressing this issue by imputing missing values before model estimation and forecasting, we suggest treating missing values and forecasting targets indifferently and predicting all unknown values simultaneously based on observations. In this paper, we offer an efficient probabilistic forecasting approach by estimating the joint distribution of features and targets based on a generative model. It is free of preprocessing, and thus avoids introducing potential errors. Compared with the traditional "impute, then predict" pipeline, the proposed approach achieves better performance in terms of continuous ranked probability score.  ( 2 min )
    LOCOST: State-Space Models for Long Document Abstractive Summarization
    arXiv:2401.17919v2 Announce Type: replace-cross Abstract: State-space models are a low-complexity alternative to transformers for encoding long sequences and capturing long-term dependencies. We propose LOCOST: an encoder-decoder architecture based on state-space models for conditional text generation with long context inputs. With a computational complexity of $O(L \log L)$, this architecture can handle significantly longer sequences than state-of-the-art models that are based on sparse attention patterns. We evaluate our model on a series of long document abstractive summarization tasks. The model reaches a performance level that is 93-96% comparable to the top-performing sparse transformers of the same size while saving up to 50% memory during training and up to 87% during inference. Additionally, LOCOST effectively handles input texts exceeding 600K tokens at inference time, setting new state-of-the-art results on full-book summarization and opening new perspectives for long input processing.  ( 2 min )
    SUPClust: Active Learning at the Boundaries
    arXiv:2403.03741v1 Announce Type: new Abstract: Active learning is a machine learning paradigm designed to optimize model performance in a setting where labeled data is expensive to acquire. In this work, we propose a novel active learning method called SUPClust that seeks to identify points at the decision boundary between classes. By targeting these points, SUPClust aims to gather information that is most informative for refining the model's prediction of complex decision regions. We demonstrate experimentally that labeling these points leads to strong model performance. This improvement is observed even in scenarios characterized by strong class imbalance.  ( 2 min )
    An EnKF-LSTM Assimilation Algorithm for Crop Growth Model
    arXiv:2403.03406v1 Announce Type: cross Abstract: Accurate and timely prediction of crop growth is of great significance to ensure crop yields and researchers have developed several crop models for the prediction of crop growth. However, there are large difference between the simulation results obtained by the crop models and the actual results, thus in this paper, we proposed to combine the simulation results with the collected crop data for data assimilation so that the accuracy of prediction will be improved. In this paper, an EnKF-LSTM data assimilation method for various crops is proposed by combining ensemble Kalman filter and LSTM neural network, which effectively avoids the overfitting problem of existing data assimilation methods and eliminates the uncertainty of the measured data. The verification of the proposed EnKF-LSTM method and the comparison of the proposed method with other data assimilation methods were performed using datasets collected by sensor equipment deployed on a farm.  ( 2 min )
    Task Attribute Distance for Few-Shot Learning: Theoretical Analysis and Applications
    arXiv:2403.03535v1 Announce Type: cross Abstract: Few-shot learning (FSL) aims to learn novel tasks with very few labeled samples by leveraging experience from \emph{related} training tasks. In this paper, we try to understand FSL by delving into two key questions: (1) How to quantify the relationship between \emph{training} and \emph{novel} tasks? (2) How does the relationship affect the \emph{adaptation difficulty} on novel tasks for different models? To answer the two questions, we introduce Task Attribute Distance (TAD) built upon attributes as a metric to quantify the task relatedness. Unlike many existing metrics, TAD is model-agnostic, making it applicable to different FSL models. Then, we utilize TAD metric to establish a theoretical connection between task relatedness and task adaptation difficulty. By deriving the generalization error bound on a novel task, we discover how TAD measures the adaptation difficulty on novel tasks for FSL models. To validate our TAD metric and theoretical findings, we conduct experiments on three benchmarks. Our experimental results confirm that TAD metric effectively quantifies the task relatedness and reflects the adaptation difficulty on novel tasks for various FSL methods, even if some of them do not learn attributes explicitly or human-annotated attributes are not available. Finally, we present two applications of the proposed TAD metric: data augmentation and test-time intervention, which further verify its effectiveness and general applicability. The source code is available at https://github.com/hu-my/TaskAttributeDistance.  ( 3 min )
    Credibility-Aware Multi-Modal Fusion Using Probabilistic Circuits
    arXiv:2403.03281v1 Announce Type: new Abstract: We consider the problem of late multi-modal fusion for discriminative learning. Motivated by noisy, multi-source domains that require understanding the reliability of each data source, we explore the notion of credibility in the context of multi-modal fusion. We propose a combination function that uses probabilistic circuits (PCs) to combine predictive distributions over individual modalities. We also define a probabilistic measure to evaluate the credibility of each modality via inference queries over the PC. Our experimental evaluation demonstrates that our fusion method can reliably infer credibility while maintaining competitive performance with the state-of-the-art.  ( 2 min )
    Simplified PCNet with Robustness
    arXiv:2403.03676v1 Announce Type: new Abstract: Graph Neural Networks (GNNs) have garnered significant attention for their success in learning the representation of homophilic or heterophilic graphs. However, they cannot generalize well to real-world graphs with different levels of homophily. In response, the Possion-Charlier Network (PCNet) \cite{li2024pc}, the previous work, allows graph representation to be learned from heterophily to homophily. Although PCNet alleviates the heterophily issue, there remain some challenges in further improving the efficacy and efficiency. In this paper, we simplify PCNet and enhance its robustness. We first extend the filter order to continuous values and reduce its parameters. Two variants with adaptive neighborhood sizes are implemented. Theoretical analysis shows our model's robustness to graph structure perturbations or adversarial attacks. We validate our approach through semi-supervised learning tasks on various datasets representing both homophilic and heterophilic graphs.  ( 2 min )
    Triple/Debiased Lasso for Statistical Inference of Conditional Average Treatment Effects
    arXiv:2403.03240v1 Announce Type: cross Abstract: This study investigates the estimation and the statistical inference about Conditional Average Treatment Effects (CATEs), which have garnered attention as a metric representing individualized causal effects. In our data-generating process, we assume linear models for the outcomes associated with binary treatments and define the CATE as a difference between the expected outcomes of these linear models. This study allows the linear models to be high-dimensional, and our interest lies in consistent estimation and statistical inference for the CATE. In high-dimensional linear regression, one typical approach is to assume sparsity. However, in our study, we do not assume sparsity directly. Instead, we consider sparsity only in the difference of the linear models. We first use a doubly robust estimator to approximate this difference and then regress the difference on covariates with Lasso regularization. Although this regression estimator is consistent for the CATE, we further reduce the bias using the techniques in double/debiased machine learning (DML) and debiased Lasso, leading to $\sqrt{n}$-consistency and confidence intervals. We refer to the debiased estimator as the triple/debiased Lasso (TDL), applying both DML and debiased Lasso techniques. We confirm the soundness of our proposed method through simulation studies.  ( 2 min )
    Solution Simplex Clustering for Heterogeneous Federated Learning
    arXiv:2403.03333v1 Announce Type: new Abstract: We tackle a major challenge in federated learning (FL) -- achieving good performance under highly heterogeneous client distributions. The difficulty partially arises from two seemingly contradictory goals: learning a common model by aggregating the information from clients, and learning local personalized models that should be adapted to each local distribution. In this work, we propose Solution Simplex Clustered Federated Learning (SosicFL) for dissolving such contradiction. Based on the recent ideas of learning solution simplices, SosicFL assigns a subregion in a simplex to each client, and performs FL to learn a common solution simplex. This allows the client models to possess their characteristics within the degrees of freedom in the solution simplex, and at the same time achieves the goal of learning a global common model. Our experiments show that SosicFL improves the performance and accelerates the training process for global and personalized FL with minimal computational overhead.  ( 2 min )
    An AI-enabled Agent-Based Model and Its Application in Measles Outbreak Simulation for New Zealand
    arXiv:2403.03434v1 Announce Type: cross Abstract: Agent Based Models (ABMs) have emerged as a powerful tool for investigating complex social interactions, particularly in the context of public health and infectious disease investigation. In an effort to enhance the conventional ABM, enabling automated model calibration and reducing the computational resources needed for scaling up the model, we have developed a tensorized and differentiable agent-based model by coupling Graph Neural Network (GNN) and Long Short-Term Memory (LSTM) network. The model was employed to investigate the 2019 measles outbreak occurred in New Zealand, demonstrating a promising ability to accurately simulate the outbreak dynamics, particularly during the peak period of repeated cases. This paper shows that by leveraging the latest Artificial Intelligence (AI) technology and the capabilities of traditional ABMs, we gain deeper insights into the dynamics of infectious disease outbreaks. This, in turn, helps us make more informed decision when developing effective strategies that strike a balance between managing outbreaks and minimizing disruptions to everyday life.  ( 2 min )
    LEAD: Learning Decomposition for Source-free Universal Domain Adaptation
    arXiv:2403.03421v1 Announce Type: cross Abstract: Universal Domain Adaptation (UniDA) targets knowledge transfer in the presence of both covariate and label shifts. Recently, Source-free Universal Domain Adaptation (SF-UniDA) has emerged to achieve UniDA without access to source data, which tends to be more practical due to data protection policies. The main challenge lies in determining whether covariate-shifted samples belong to target-private unknown categories. Existing methods tackle this either through hand-crafted thresholding or by developing time-consuming iterative clustering strategies. In this paper, we propose a new idea of LEArning Decomposition (LEAD), which decouples features into source-known and -unknown components to identify target-private data. Technically, LEAD initially leverages the orthogonal decomposition analysis for feature decomposition. Then, LEAD builds instance-level decision boundaries to adaptively identify target-private data. Extensive experiments across various UniDA scenarios have demonstrated the effectiveness and superiority of LEAD. Notably, in the OPDA scenario on VisDA dataset, LEAD outperforms GLC by 3.5% overall H-score and reduces 75% time to derive pseudo-labeling decision boundaries. Besides, LEAD is also appealing in that it is complementary to most existing methods. The code is available at https://github.com/ispc-lab/LEAD.  ( 2 min )
    Complexity Matters: Dynamics of Feature Learning in the Presence of Spurious Correlations
    arXiv:2403.03375v1 Announce Type: new Abstract: Existing research often posits spurious features as "easier" to learn than core features in neural network optimization, but the impact of their relative simplicity remains under-explored. Moreover they mainly focus on the end performance intead of the learning dynamics of feature learning. In this paper, we propose a theoretical framework and associated synthetic dataset grounded in boolean function analysis which allows for fine-grained control on the relative complexity (compared to core features) and correlation strength (with respect to the label) of spurious features to study the dynamics of feature learning under spurious correlation. Our setup uncovers several interesting phenomenon: (1) stronger spurious correlations or simpler spurious features slow down the rate of learning for the core features, (2) learning phases of spurious features and core features are not always separable, (3) spurious features are not forgotten even after core features are fully learned. We show that our findings justify the success of retraining the last layer to remove spurious correlation and also identifies limitations of popular debiasing algorithms that exploit early learning of spurious features. We support our empirical findings with theoretical analyses for the case of learning XOR features with a one-hidden-layer ReLU network.  ( 2 min )
    From Noise to Signal: Unveiling Treatment Effects from Digital Health Data through Pharmacology-Informed Neural-SDE
    arXiv:2403.03274v1 Announce Type: cross Abstract: Digital health technologies (DHT), such as wearable devices, provide personalized, continuous, and real-time monitoring of patient. These technologies are contributing to the development of novel therapies and personalized medicine. Gaining insight from these technologies requires appropriate modeling techniques to capture clinically-relevant changes in disease state. The data generated from these devices is characterized by being stochastic in nature, may have missing elements, and exhibits considerable inter-individual variability - thereby making it difficult to analyze using traditional longitudinal modeling techniques. We present a novel pharmacology-informed neural stochastic differential equation (SDE) model capable of addressing these challenges. Using synthetic data, we demonstrate that our approach is effective in identifying treatment effects and learning causal relationships from stochastic data, thereby enabling counterfactual simulation.  ( 2 min )
    Neural Network Learning and Quantum Gravity
    arXiv:2403.03245v1 Announce Type: cross Abstract: The landscape of low-energy effective field theories stemming from string theory is too vast for a systematic exploration. However, the meadows of the string landscape may be fertile ground for the application of machine learning techniques. Employing neural network learning may allow for inferring novel, undiscovered properties that consistent theories in the landscape should possess, or checking conjectural statements about alleged characteristics thereof. The aim of this work is to describe to what extent the string landscape can be explored with neural network-based learning. Our analysis is motivated by recent studies that show that the string landscape is characterized by finiteness properties, emerging from its underlying tame, o-minimal structures. Indeed, employing these results, we illustrate that any low-energy effective theory of string theory is endowed with certain statistical learnability properties. Consequently, several learning problems therein formulated, including interpolations and multi-class classification problems, can be concretely addressed with machine learning, delivering results with sufficiently high accuracy.  ( 2 min )
    CDC: A Simple Framework for Complex Data Clustering
    arXiv:2403.03670v1 Announce Type: new Abstract: In today's data-driven digital era, the amount as well as complexity, such as multi-view, non-Euclidean, and multi-relational, of the collected data are growing exponentially or even faster. Clustering, which unsupervisely extracts valid knowledge from data, is extremely useful in practice. However, existing methods are independently developed to handle one particular challenge at the expense of the others. In this work, we propose a simple but effective framework for complex data clustering (CDC) that can efficiently process different types of data with linear complexity. We first utilize graph filtering to fuse geometry structure and attribute information. We then reduce the complexity with high-quality anchors that are adaptively learned via a novel similarity-preserving regularizer. We illustrate the cluster-ability of our proposed method theoretically and experimentally. In particular, we deploy CDC to graph data of size 111M.  ( 2 min )
    Graph Learning for Parameter Prediction of Quantum Approximate Optimization Algorithm
    arXiv:2403.03310v1 Announce Type: cross Abstract: In recent years, quantum computing has emerged as a transformative force in the field of combinatorial optimization, offering novel approaches to tackling complex problems that have long challenged classical computational methods. Among these, the Quantum Approximate Optimization Algorithm (QAOA) stands out for its potential to efficiently solve the Max-Cut problem, a quintessential example of combinatorial optimization. However, practical application faces challenges due to current limitations on quantum computational resource. Our work optimizes QAOA initialization, using Graph Neural Networks (GNN) as a warm-start technique. This sacrifices affordable computational resource on classical computer to reduce quantum computational resource overhead, enhancing QAOA's effectiveness. Experiments with various GNN architectures demonstrate the adaptability and stability of our framework, highlighting the synergy between quantum algorithms and machine learning. Our findings show GNN's potential in improving QAOA performance, opening new avenues for hybrid quantum-classical approaches in quantum computing and contributing to practical applications.  ( 2 min )
    On Transfer in Classification: How Well do Subsets of Classes Generalize?
    arXiv:2403.03569v1 Announce Type: new Abstract: In classification, it is usual to observe that models trained on a given set of classes can generalize to previously unseen ones, suggesting the ability to learn beyond the initial task. This ability is often leveraged in the context of transfer learning where a pretrained model can be used to process new classes, with or without fine tuning. Surprisingly, there are a few papers looking at the theoretical roots beyond this phenomenon. In this work, we are interested in laying the foundations of such a theoretical framework for transferability between sets of classes. Namely, we establish a partially ordered set of subsets of classes. This tool allows to represent which subset of classes can generalize to others. In a more practical setting, we explore the ability of our framework to predict which subset of classes can lead to the best performance when testing on all of them. We also explore few-shot learning, where transfer is the golden standard. Our work contributes to better understanding of transfer mechanics and model generalization.  ( 2 min )
    Non-verbal information in spontaneous speech - towards a new framework of analysis
    arXiv:2403.03522v1 Announce Type: cross Abstract: Non-verbal signals in speech are encoded by prosody and carry information that ranges from conversation action to attitude and emotion. Despite its importance, the principles that govern prosodic structure are not yet adequately understood. This paper offers an analytical schema and a technological proof-of-concept for the categorization of prosodic signals and their association with meaning. The schema interprets surface-representations of multi-layered prosodic events. As a first step towards implementation, we present a classification process that disentangles prosodic phenomena of three orders. It relies on fine-tuning a pre-trained speech recognition model, enabling the simultaneous multi-class/multi-label detection. It generalizes over a large variety of spontaneous data, performing on a par with, or superior to, human annotation. In addition to a standardized formalization of prosody, disentangling prosodic patterns can direct a theory of communication and speech organization. A welcome by-product is an interpretation of prosody that will enhance speech- and language-related technologies.  ( 2 min )
    Slot Abstractors: Toward Scalable Abstract Visual Reasoning
    arXiv:2403.03458v1 Announce Type: cross Abstract: Abstract visual reasoning is a characteristically human ability, allowing the identification of relational patterns that are abstracted away from object features, and the systematic generalization of those patterns to unseen problems. Recent work has demonstrated strong systematic generalization in visual reasoning tasks involving multi-object inputs, through the integration of slot-based methods used for extracting object-centric representations coupled with strong inductive biases for relational abstraction. However, this approach was limited to problems containing a single rule, and was not scalable to visual reasoning problems containing a large number of objects. Other recent work proposed Abstractors, an extension of Transformers that incorporates strong relational inductive biases, thereby inheriting the Transformer's scalability and multi-head architecture, but it has yet to be demonstrated how this approach might be applied to multi-object visual inputs. Here we combine the strengths of the above approaches and propose Slot Abstractors, an approach to abstract visual reasoning that can be scaled to problems involving a large number of objects and multiple relations among them. The approach displays state-of-the-art performance across four abstract visual reasoning tasks.  ( 2 min )
    SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding
    arXiv:2401.09340v2 Announce Type: replace-cross Abstract: 3D vision-language grounding, which focuses on aligning language with the 3D physical environment, stands as a cornerstone in the development of embodied agents. In comparison to recent advancements in the 2D domain, grounding language in 3D scenes faces several significant challenges: (i) the inherent complexity of 3D scenes due to the diverse object configurations, their rich attributes, and intricate relationships; (ii) the scarcity of paired 3D vision-language data to support grounded learning; and (iii) the absence of a unified learning framework to distill knowledge from grounded 3D data. In this work, we aim to address these three major challenges in 3D vision-language by examining the potential of systematically upscaling 3D vision-language learning in indoor environments. We introduce the first million-scale 3D vision-language dataset, SceneVerse, encompassing about 68K 3D indoor scenes and comprising 2.5M vision-language pairs derived from both human annotations and our scalable scene-graph-based generation approach. We demonstrate that this scaling allows for a unified pre-training framework, Grounded Pre-training for Scenes (GPS), for 3D vision-language learning. Through extensive experiments, we showcase the effectiveness of GPS by achieving state-of-the-art performance on all existing 3D visual grounding benchmarks. The vast potential of SceneVerse and GPS is unveiled through zero-shot transfer experiments in the challenging 3D vision-language tasks. Project website: https://scene-verse.github.io.  ( 3 min )
    SPEAR:Exact Gradient Inversion of Batches in Federated Learning
    arXiv:2403.03945v1 Announce Type: new Abstract: Federated learning is a popular framework for collaborative machine learning where multiple clients only share gradient updates on their local data with the server and not the actual data. Unfortunately, it was recently shown that gradient inversion attacks can reconstruct this data from these shared gradients. Existing attacks enable exact reconstruction only for a batch size of $b=1$ in the important honest-but-curious setting, with larger batches permitting only approximate reconstruction. In this work, we propose \emph{the first algorithm reconstructing whole batches with $b >1$ exactly}. This approach combines mathematical insights into the explicit low-rank structure of gradients with a sampling-based algorithm. Crucially, we leverage ReLU-induced gradient sparsity to precisely filter out large numbers of incorrect samples, making a final reconstruction step tractable. We provide an efficient GPU implementation for fully connected networks and show that it recovers batches of $b \lesssim 25$ elements exactly while being tractable for large network widths and depths.  ( 2 min )
    Single Transit Detection In Kepler With Machine Learning And Onboard Spacecraft Diagnostics
    arXiv:2403.03427v1 Announce Type: cross Abstract: Exoplanet discovery at long orbital periods requires reliably detecting individual transits without additional information about the system. Techniques like phase-folding of light curves and periodogram analysis of radial velocity data are more sensitive to planets with shorter orbital periods, leaving a dearth of planet discoveries at long periods. We present a novel technique using an ensemble of Convolutional Neural Networks incorporating the onboard spacecraft diagnostics of \emph{Kepler} to classify transits within a light curve. We create a pipeline to recover the location of individual transits, and the period of the orbiting planet, which maintains $>80\%$ transit recovery sensitivity out to an 800-day orbital period. Our neural network pipeline has the potential to discover additional planets in the \emph{Kepler} dataset, and crucially, within the $\eta$-Earth regime. We report our first candidate from this pipeline, KOI 1271.02. KOI 1271.01 is known to exhibit strong Transit Timing Variations (TTVs), and so we jointly model the TTVs and transits of both transiting planets to constrain the orbital configuration and planetary parameters and conclude with a series of potential parameters for KOI 1271.02, as there is not enough data currently to uniquely constrain the system. We conclude that KOI 1271.02 has a radius of 5.32 $\pm$ 0.20 $R_{\oplus}$ and a mass of $28.94^{0.23}_{-0.47}$ $M_{\oplus}$. Future constraints on the nature of KOI 1271.02 require measuring additional TTVs of KOI 1271.01 or observing a second transit of KOI 1271.02.  ( 3 min )
    What's In My Big Data?
    arXiv:2310.20707v2 Announce Type: replace-cross Abstract: Large text corpora are the backbone of language models. However, we have a limited understanding of the content of these corpora, including general statistics, quality, social factors, and inclusion of evaluation data (contamination). In this work, we propose What's In My Big Data? (WIMBD), a platform and a set of sixteen analyses that allow us to reveal and compare the contents of large text corpora. WIMBD builds on two basic capabilities -- count and search -- at scale, which allows us to analyze more than 35 terabytes on a standard compute node. We apply WIMBD to ten different corpora used to train popular language models, including C4, The Pile, and RedPajama. Our analysis uncovers several surprising and previously undocumented findings about these corpora, including the high prevalence of duplicate, synthetic, and low-quality content, personally identifiable information, toxic language, and benchmark contamination. For instance, we find that about 50% of the documents in RedPajama and LAION-2B-en are duplicates. In addition, several datasets used for benchmarking models trained on such corpora are contaminated with respect to important benchmarks, including the Winograd Schema Challenge and parts of GLUE and SuperGLUE. We open-source WIMBD's code and artifacts to provide a standard set of evaluations for new text-based corpora and to encourage more analyses and transparency around them.  ( 3 min )
    Predicting the Temperature Dependence of Surfactant CMCs Using Graph Neural Networks
    arXiv:2403.03767v1 Announce Type: cross Abstract: The critical micelle concentration (CMC) of surfactant molecules is an essential property for surfactant applications in industry. Recently, classical QSPR and Graph Neural Networks (GNNs), a deep learning technique, have been successfully applied to predict the CMC of surfactants at room temperature. However, these models have not yet considered the temperature dependency of the CMC, which is highly relevant for practical applications. We herein develop a GNN model for temperature-dependent CMC prediction of surfactants. We collect about 1400 data points from public sources for all surfactant classes, i.e., ionic, nonionic, and zwitterionic, at multiple temperatures. We test the predictive quality of the model for following scenarios: i) when CMC data for surfactants are present in the training of the model in at least one different temperature, and ii) CMC data for surfactants are not present in the training, i.e., generalizing to unseen surfactants. In both test scenarios, our model exhibits a high predictive performance of R$^2 \geq $ 0.94 on test data. We also find that the model performance varies by surfactant class. Finally, we evaluate the model for sugar-based surfactants with complex molecular structures, as these represent a more sustainable alternative to synthetic surfactants and are therefore of great interest for future applications in the personal and home care industries.  ( 2 min )
    Transformer-based nowcasting of radar composites from satellite images for severe weather
    arXiv:2310.19515v2 Announce Type: replace-cross Abstract: Weather radar data are critical for nowcasting and an integral component of numerical weather prediction models. While weather radar data provide valuable information at high resolution, their ground-based nature limits their availability, which impedes large-scale applications. In contrast, meteorological satellites cover larger domains but with coarser resolution. However, with the rapid advancements in data-driven methodologies and modern sensors aboard geostationary satellites, new opportunities are emerging to bridge the gap between ground- and space-based observations, ultimately leading to more skillful weather prediction with high accuracy. Here, we present a Transformer-based model for nowcasting ground-based radar image sequences using satellite data up to two hours lead time. Trained on a dataset reflecting severe weather conditions, the model predicts radar fields occurring under different weather phenomena and shows robustness against rapidly growing/decaying fields and complex field structures. Model interpretation reveals that the infrared channel centered at 10.3 $\mu m$ (C13) contains skillful information for all weather conditions, while lightning data have the highest relative feature importance in severe weather conditions, particularly in shorter lead times. The model can support precipitation nowcasting across large domains without an explicit need for radar towers, enhance numerical weather prediction and hydrological models, and provide radar proxy for data-scarce regions. Moreover, the open-source framework facilitates progress towards operational data-driven nowcasting.  ( 3 min )
    PURL: Safe and Effective Sanitization of Link Decoration
    arXiv:2308.03417v2 Announce Type: replace-cross Abstract: While privacy-focused browsers have taken steps to block third-party cookies and mitigate browser fingerprinting, novel tracking techniques that can bypass existing countermeasures continue to emerge. Since trackers need to share information from the client-side to the server-side through link decoration regardless of the tracking technique they employ, a promising orthogonal approach is to detect and sanitize tracking information in decorated links. To this end, we present PURL (pronounced purel-l), a machine-learning approach that leverages a cross-layer graph representation of webpage execution to safely and effectively sanitize link decoration. Our evaluation shows that PURL significantly outperforms existing countermeasures in terms of accuracy and reducing website breakage while being robust to common evasion techniques. PURL's deployment on a sample of top-million websites shows that link decoration is abused for tracking on nearly three-quarters of the websites, often to share cookies, email addresses, and fingerprinting information.  ( 2 min )
    Understanding Distributed Representations of Concepts in Deep Neural Networks without Supervision
    arXiv:2312.17285v2 Announce Type: replace-cross Abstract: Understanding intermediate representations of the concepts learned by deep learning classifiers is indispensable for interpreting general model behaviors. Existing approaches to reveal learned concepts often rely on human supervision, such as pre-defined concept sets or segmentation processes. In this paper, we propose a novel unsupervised method for discovering distributed representations of concepts by selecting a principal subset of neurons. Our empirical findings demonstrate that instances with similar neuron activation states tend to share coherent concepts. Based on the observations, the proposed method selects principal neurons that construct an interpretable region, namely a Relaxed Decision Region (RDR), encompassing instances with coherent concepts in the feature space. It can be utilized to identify unlabeled subclasses within data and to detect the causes of misclassifications. Furthermore, the applicability of our method across various layers discloses distinct distributed representations over the layers, which provides deeper insights into the internal mechanisms of the deep learning model.  ( 2 min )
    TSRNet: Simple Framework for Real-time ECG Anomaly Detection with Multimodal Time and Spectrogram Restoration Network
    arXiv:2312.10187v2 Announce Type: replace-cross Abstract: The electrocardiogram (ECG) is a valuable signal used to assess various aspects of heart health, such as heart rate and rhythm. It plays a crucial role in identifying cardiac conditions and detecting anomalies in ECG data. However, distinguishing between normal and abnormal ECG signals can be a challenging task. In this paper, we propose an approach that leverages anomaly detection to identify unhealthy conditions using solely normal ECG data for training. Furthermore, to enhance the information available and build a robust system, we suggest considering both the time series and time-frequency domain aspects of the ECG signal. As a result, we introduce a specialized network called the Multimodal Time and Spectrogram Restoration Network (TSRNet) designed specifically for detecting anomalies in ECG signals. TSRNet falls into the category of restoration-based anomaly detection and draws inspiration from both the time series and spectrogram domains. By extracting representations from both domains, TSRNet effectively captures the comprehensive characteristics of the ECG signal. This approach enables the network to learn robust representations with superior discrimination abilities, allowing it to distinguish between normal and abnormal ECG patterns more effectively. Furthermore, we introduce a novel inference method, termed Peak-based Error, that specifically focuses on ECG peaks, a critical component in detecting abnormalities. The experimental result on the large-scale dataset PTB-XL has demonstrated the effectiveness of our approach in ECG anomaly detection, while also prioritizing efficiency by minimizing the number of trainable parameters. Our code is available at https://github.com/UARK-AICV/TSRNet.  ( 3 min )
    The Normal Distributions Indistinguishability Spectrum and its Application to Privacy-Preserving Machine Learning
    arXiv:2309.01243v2 Announce Type: replace-cross Abstract: To achieve differential privacy (DP) one typically randomizes the output of the underlying query. In big data analytics, one often uses randomized sketching/aggregation algorithms to make processing high-dimensional data tractable. Intuitively, such machine learning (ML) algorithms should provide some inherent privacy, yet most if not all existing DP mechanisms do not leverage this inherent randomness, resulting in potentially redundant noising. The motivating question of our work is: (How) can we improve the utility of DP mechanisms for randomized ML queries, by leveraging the randomness of the query itself? Towards a (positive) answer, we prove the Normal Distributions Indistinguishability Spectrum Theorem (in short, NDIS Theorem), a theoretical result with far-reaching practical implications. In a nutshell, NDIS is a closed-form analytic computation for the $(\epsilon,\delta)$-indistinguishability-spectrum (in short, $(\epsilon,\delta)$-IS) of two arbitrary (multi-dimensional) normal distributions $X$ and $Y$, i.e., the optimal $\delta$ (for any given $\epsilon$) such that $X$ and $Y$ are ($\epsilon,\delta$)-close according to the DP distance. The NDIS theorem (1) yields efficient estimators for the above IS, and (2) allows us to analyze DP-mechanisms with normally-distributed outputs, as well as more general mechanisms by leveraging their behavior on large inputs. We apply the NDIS theorem to derive DP mechanisms for queries with normally-distributed outputs -- i.e., Gaussian Random Projections (RP) -- and for more general queries -- i.e., Ordinary Least Squares (OLS). Both RP and OLS are highly relevant in data analytics. Our new DP mechanisms achieve superior privacy/utility trade-offs by leveraging the randomness of the underlying algorithms, and identifies, for the first time, the range of $(\epsilon,\delta)$ for which no additional noising is needed.  ( 3 min )
    Expressive Whole-Body Control for Humanoid Robots
    arXiv:2402.16796v2 Announce Type: replace-cross Abstract: Can we enable humanoid robots to generate rich, diverse, and expressive motions in the real world? We propose to learn a whole-body control policy on a human-sized robot to mimic human motions as realistic as possible. To train such a policy, we leverage the large-scale human motion capture data from the graphics community in a Reinforcement Learning framework. However, directly performing imitation learning with the motion capture dataset would not work on the real humanoid robot, given the large gap in degrees of freedom and physical capabilities. Our method Expressive Whole-Body Control (Exbody) tackles this problem by encouraging the upper humanoid body to imitate a reference motion, while relaxing the imitation constraint on its two legs and only requiring them to follow a given velocity robustly. With training in simulation and Sim2Real transfer, our policy can control a humanoid robot to walk in different styles, shake hands with humans, and even dance with a human in the real world. We conduct extensive studies and comparisons on diverse motions in both simulation and the real world to show the effectiveness of our approach.  ( 2 min )
    Environmental Insights: Democratizing Access to Ambient Air Pollution Data and Predictive Analytics with an Open-Source Python Package
    arXiv:2403.03664v1 Announce Type: cross Abstract: Ambient air pollution is a pervasive issue with wide-ranging effects on human health, ecosystem vitality, and economic structures. Utilizing data on ambient air pollution concentrations, researchers can perform comprehensive analyses to uncover the multifaceted impacts of air pollution across society. To this end, we introduce Environmental Insights, an open-source Python package designed to democratize access to air pollution concentration data. This tool enables users to easily retrieve historical air pollution data and employ a Machine Learning model for forecasting potential future conditions. Moreover, Environmental Insights includes a suite of tools aimed at facilitating the dissemination of analytical findings and enhancing user engagement through dynamic visualizations. This comprehensive approach ensures that the package caters to the diverse needs of individuals looking to explore and understand air pollution trends and their implications.  ( 2 min )
    Sparse-IFT: Sparse Iso-FLOP Transformations for Maximizing Training Efficiency
    arXiv:2303.11525v3 Announce Type: replace Abstract: Recent research has focused on weight sparsity in neural network training to reduce FLOPs, aiming for improved efficiency (test accuracy w.r.t training FLOPs). However, sparse weight training often sacrifices accuracy, requiring extended training schedules to attain the accuracy of dense models. In contrast, our approach, Sparse Iso-FLOP Transformations (Sparse-IFT), uses sparsity to improve accuracy while maintaining dense model FLOPs. Using a single hyperparameter (i.e., sparsity level), Sparse-IFTs efficiently replace dense layers, expanding the search space for optimal sparse masks. In addition, dynamic sparse training with Sparse-IFT models effectively navigates this larger sparse mask-weight space, which is evidenced by a spectral analysis using Ramanujan graph properties. Our study reveals a robust correlation among mask topology, weights, and final performance. Notably, without adjusting hyperparameters, replacing dense layers with Sparse-IFT yields significant improvements, such as a +3.5% boost for ResNet-18 on ImageNet and +0.9% for GPT-3 Small on the Open LLM leaderboard. To our knowledge, this is the first work to demonstrate the use of sparsity for improving the accuracy of dense models through a simple-to-use set of sparse transformations. Code is available at: https://github.com/CerebrasResearch/Sparse-IFT.  ( 3 min )
    l1-norm regularized l1-norm best-fit lines
    arXiv:2402.16712v2 Announce Type: replace-cross Abstract: In this work, we propose an optimization framework for estimating a sparse robust one-dimensional subspace. Our objective is to minimize both the representation error and the penalty, in terms of the l1-norm criterion. Given that the problem is NP-hard, we introduce a linear relaxation-based approach. Additionally, we present a novel fitting procedure, utilizing simple ratios and sorting techniques. The proposed algorithm demonstrates a worst-case time complexity of $O(n^2 m \log n)$ and, in certain instances, achieves global optimality for the sparse robust subspace, thereby exhibiting polynomial time efficiency. Compared to extant methodologies, the proposed algorithm finds the subspace with the lowest discordance, offering a smoother trade-off between sparsity and fit. Its architecture affords scalability, evidenced by a 16-fold improvement in computational speeds for matrices of 2000x2000 over CPU version. Furthermore, this method is distinguished by several advantages, including its independence from initialization and deterministic and replicable procedures. Furthermore, this method is distinguished by several advantages, including its independence from initialization and deterministic and replicable procedures. The real-world example demonstrates the effectiveness of algorithm in achieving meaningful sparsity, underscoring its precise and useful application across various domains.  ( 2 min )
    Efficient Quantum Agnostic Improper Learning of Decision Trees
    arXiv:2210.00212v3 Announce Type: replace-cross Abstract: The agnostic setting is the hardest generalization of the PAC model since it is akin to learning with adversarial noise. In this paper, we give a poly$(n,t,{\frac{1}{\varepsilon}})$ quantum algorithm for learning size $t$ decision trees with uniform marginal over instances, in the agnostic setting, without membership queries. Our algorithm is the first algorithm (classical or quantum) for learning decision trees in polynomial time without membership queries. We show how to construct a quantum agnostic weak learner by designing a quantum version of the classical Goldreich-Levin algorithm that works with strongly biased function oracles. We show how to quantize the agnostic boosting algorithm by Kalai and Kanade (NIPS 2009) to obtain the first efficient quantum agnostic boosting algorithm. Our quantum boosting algorithm has a polynomial improvement in the dependence of the bias of the weak learner over all adaptive quantum boosting algorithms while retaining the standard speedup in the VC dimension over classical boosting algorithms. We then use our quantum boosting algorithm to boost the weak quantum learner we obtained in the previous step to obtain a quantum agnostic learner for decision trees. Using the above framework, we also give quantum decision tree learning algorithms for both the realizable setting and random classification noise model, again without membership queries.  ( 2 min )
    Latent Dataset Distillation with Diffusion Models
    arXiv:2403.03881v1 Announce Type: cross Abstract: The efficacy of machine learning has traditionally relied on the availability of increasingly larger datasets. However, large datasets pose storage challenges and contain non-influential samples, which could be ignored during training without impacting the final accuracy of the model. In response to these limitations, the concept of distilling the information on a dataset into a condensed set of (synthetic) samples, namely a distilled dataset, emerged. One crucial aspect is the selected architecture (usually ConvNet) for linking the original and synthetic datasets. However, the final accuracy is lower if the employed model architecture differs from the model used during distillation. Another challenge is the generation of high-resolution images, e.g., 128x128 and higher. In this paper, we propose Latent Dataset Distillation with Diffusion Models (LD3M) that combine diffusion in latent space with dataset distillation to tackle both challenges. LD3M incorporates a novel diffusion process tailored for dataset distillation, which improves the gradient norms for learning synthetic images. By adjusting the number of diffusion steps, LD3M also offers a straightforward way of controlling the trade-off between speed and accuracy. We evaluate our approach in several ImageNet subsets and for high-resolution images (128x128 and 256x256). As a result, LD3M consistently outperforms state-of-the-art distillation techniques by up to 4.8 p.p. and 4.2 p.p. for 1 and 10 images per class, respectively.  ( 2 min )
    Planted Bipartite Graph Detection
    arXiv:2302.03658v2 Announce Type: replace-cross Abstract: We consider the task of detecting a hidden bipartite subgraph in a given random graph. This is formulated as a hypothesis testing problem, under the null hypothesis, the graph is a realization of an Erd\H{o}s-R\'{e}nyi random graph over $n$ vertices with edge density $q$. Under the alternative, there exists a planted $k_{\mathsf{R}} \times k_{\mathsf{L}}$ bipartite subgraph with edge density $p>q$. We characterize the statistical and computational barriers for this problem. Specifically, we derive information-theoretic lower bounds, and design and analyze optimal algorithms matching those bounds, in both the dense regime, where $p,q = \Theta\left(1\right)$, and the sparse regime where $p,q = \Theta\left(n^{-\alpha}\right), \alpha \in \left(0,2\right]$. We also consider the problem of testing in polynomial-time. As is customary in similar structured high-dimensional problems, our model undergoes an "easy-hard-impossible" phase transition and computational constraints penalize the statistical performance. To provide an evidence for this statistical computational gap, we prove computational lower bounds based on the low-degree conjecture, and show that the class of low-degree polynomials algorithms fail in the conjecturally hard region.  ( 2 min )
    MedLM: Exploring Language Models for Medical Question Answering Systems
    arXiv:2401.11389v2 Announce Type: replace-cross Abstract: In the face of rapidly expanding online medical literature, automated systems for aggregating and summarizing information are becoming increasingly crucial for healthcare professionals and patients. Large Language Models (LLMs), with their advanced generative capabilities, have shown promise in various NLP tasks, and their potential in the healthcare domain, particularly for Closed-Book Generative QnA, is significant. However, the performance of these models in domain-specific tasks such as medical Q&A remains largely unexplored. This study aims to fill this gap by comparing the performance of general and medical-specific distilled LMs for medical Q&A. We aim to evaluate the effectiveness of fine-tuning domain-specific LMs and compare the performance of different families of Language Models. The study will address critical questions about these models' reliability, comparative performance, and effectiveness in the context of medical Q&A. The findings will provide valuable insights into the suitability of different LMs for specific applications in the medical domain.  ( 2 min )
    Spatio-Temporal Self-Supervised Learning for Traffic Flow Prediction
    arXiv:2212.04475v2 Announce Type: replace Abstract: Robust prediction of citywide traffic flows at different time periods plays a crucial role in intelligent transportation systems. While previous work has made great efforts to model spatio-temporal correlations, existing methods still suffer from two key limitations: i) Most models collectively predict all regions' flows without accounting for spatial heterogeneity, i.e., different regions may have skewed traffic flow distributions. ii) These models fail to capture the temporal heterogeneity induced by time-varying traffic patterns, as they typically model temporal correlations with a shared parameterized space for all time periods. To tackle these challenges, we propose a novel Spatio-Temporal Self-Supervised Learning (ST-SSL) traffic prediction framework which enhances the traffic pattern representations to be reflective of both spatial and temporal heterogeneity, with auxiliary self-supervised learning paradigms. Specifically, our ST-SSL is built over an integrated module with temporal and spatial convolutions for encoding the information across space and time. To achieve the adaptive spatio-temporal self-supervised learning, our ST-SSL first performs the adaptive augmentation over the traffic flow graph data at both attribute- and structure-levels. On top of the augmented traffic graph, two SSL auxiliary tasks are constructed to supplement the main traffic prediction task with spatial and temporal heterogeneity-aware augmentation. Experiments on four benchmark datasets demonstrate that ST-SSL consistently outperforms various state-of-the-art baselines. Since spatio-temporal heterogeneity widely exists in practical datasets, the proposed framework may also cast light on other spatial-temporal applications. Model implementation is available at https://github.com/Echo-Ji/ST-SSL.  ( 3 min )
    HarvestNet: A Dataset for Detecting Smallholder Farming Activity Using Harvest Piles and Remote Sensing
    arXiv:2308.12061v2 Announce Type: replace-cross Abstract: Small farms contribute to a large share of the productive land in developing countries. In regions such as sub-Saharan Africa, where 80\% of farms are small (under 2 ha in size), the task of mapping smallholder cropland is an important part of tracking sustainability measures such as crop productivity. However, the visually diverse and nuanced appearance of small farms has limited the effectiveness of traditional approaches to cropland mapping. Here we introduce a new approach based on the detection of harvest piles characteristic of many smallholder systems throughout the world. We present HarvestNet, a dataset for mapping the presence of farms in the Ethiopian regions of Tigray and Amhara during 2020-2023, collected using expert knowledge and satellite images, totaling 7k hand-labeled images and 2k ground-collected labels. We also benchmark a set of baselines, including SOTA models in remote sensing, with our best models having around 80\% classification performance on hand labelled data and 90\% and 98\% accuracy on ground truth data for Tigray and Amhara, respectively. We also perform a visual comparison with a widely used pre-existing coverage map and show that our model detects an extra 56,621 hectares of cropland in Tigray. We conclude that remote sensing of harvest piles can contribute to more timely and accurate cropland assessments in food insecure regions. The dataset can be accessed through https://figshare.com/s/45a7b45556b90a9a11d2, while the code for the dataset and benchmarks is publicly available at https://github.com/jonxuxu/harvest-piles  ( 3 min )
    Residual Multi-Fidelity Neural Network Computing
    arXiv:2310.03572v2 Announce Type: replace Abstract: In this work, we consider the general problem of constructing a neural network surrogate model using multi-fidelity information. Motivated by rigorous error and complexity estimates for ReLU neural networks, given an inexpensive low-fidelity and an expensive high-fidelity computational model, we present a residual multi-fidelity computational framework that formulates the correlation between models as a residual function, a possibly non-linear mapping between 1) the shared input space of the models together with the low-fidelity model output and 2) the discrepancy between the two model outputs. To accomplish this, we train two neural networks to work in concert. The first network learns the residual function on a small set of high-fidelity and low-fidelity data. Once trained, this network is used to generate additional synthetic high-fidelity data, which is used in the training of a second network. This second network, once trained, acts as our surrogate for the high-fidelity quantity of interest. We present three numerical examples to demonstrate the power of the proposed framework. In particular, we show that dramatic savings in computational cost may be achieved when the output predictions are desired to be accurate within small tolerances.  ( 2 min )
    Who Said What? An Automated Approach to Analyzing Speech in Preschool Classrooms
    arXiv:2401.07342v2 Announce Type: replace-cross Abstract: Young children spend substantial portions of their waking hours in noisy preschool classrooms. In these environments, children's vocal interactions with teachers are critical contributors to their language outcomes, but manually transcribing these interactions is prohibitive. Using audio from child- and teacher-worn recorders, we propose an automated framework that uses open source software both to classify speakers (ALICE) and to transcribe their utterances (Whisper). We compare results from our framework to those from a human expert for 110 minutes of classroom recordings, including 85 minutes from child-word microphones (n=4 children) and 25 minutes from teacher-worn microphones (n=2 teachers). The overall proportion of agreement, that is, the proportion of correctly classified teacher and child utterances, was .76, with an error-corrected kappa of .50 and a weighted F1 of .76. The word error rate for both teacher and child transcriptions was .15, meaning that 15% of words would need to be deleted, added, or changed to equate the Whisper and expert transcriptions. Moreover, speech features such as the mean length of utterances in words, the proportion of teacher and child utterances that were questions, and the proportion of utterances that were responded to within 2.5 seconds were similar when calculated separately from expert and automated transcriptions. The results suggest substantial progress in analyzing classroom speech that may support children's language development. Future research using natural language processing is underway to improve speaker classification and to analyze results from the application of the automated it framework to a larger dataset containing classroom recordings from 13 children and 4 teachers observed on 17 occasions over one year.  ( 3 min )
    Structure-Preserving Transformers for Sequences of SPD Matrices
    arXiv:2309.07579v5 Announce Type: replace Abstract: In recent years, Transformer-based auto-attention mechanisms have been successfully applied to the analysis of a variety of context-reliant data types, from texts to images and beyond, including data from non-Euclidean geometries. In this paper, we present such a mechanism, designed to classify sequences of Symmetric Positive Definite matrices while preserving their Riemannian geometry throughout the analysis. We apply our method to automatic sleep staging on timeseries of EEG-derived covariance matrices from a standard dataset, obtaining high levels of stage-wise performance.  ( 2 min )
    A Lightweight CNN-Transformer Model for Learning Traveling Salesman Problems
    arXiv:2305.01883v2 Announce Type: replace Abstract: Several studies have attempted to solve traveling salesman problems (TSPs) using various deep learning techniques. Among them, Transformer-based models show state-of-the-art performance even for large-scale Traveling Salesman Problems (TSPs). However, they are based on fully-connected attention models and suffer from large computational complexity and GPU memory usage. Our work is the first CNN-Transformer model based on a CNN embedding layer and partial self-attention for TSP. Our CNN-Transformer model is able to better learn spatial features from input data using a CNN embedding layer compared with the standard Transformer-based models. It also removes considerable redundancy in fully-connected attention models using the proposed partial self-attention. Experimental results show that the proposed CNN embedding layer and partial self-attention are very effective in improving performance and computational complexity. The proposed model exhibits the best performance in real-world datasets and outperforms other existing state-of-the-art (SOTA) Transformer-based models in various aspects. Our code is publicly available at https://github.com/cm8908/CNN_Transformer3.  ( 2 min )
    Neural Koopman prior for data assimilation
    arXiv:2309.05317v2 Announce Type: replace Abstract: With the increasing availability of large scale datasets, computational power and tools like automatic differentiation and expressive neural network architectures, sequential data are now often treated in a data-driven way, with a dynamical model trained from the observation data. While neural networks are often seen as uninterpretable black-box architectures, they can still benefit from physical priors on the data and from mathematical knowledge. In this paper, we use a neural network architecture which leverages the long-known Koopman operator theory to embed dynamical systems in latent spaces where their dynamics can be described linearly, enabling a number of appealing features. We introduce methods that enable to train such a model for long-term continuous reconstruction, even in difficult contexts where the data comes in irregularly-sampled time series. The potential for self-supervised learning is also demonstrated, as we show the promising use of trained dynamical models as priors for variational data assimilation techniques, with applications to e.g. time series interpolation and forecasting.  ( 2 min )
    DART: Implicit Doppler Tomography for Radar Novel View Synthesis
    arXiv:2403.03896v1 Announce Type: cross Abstract: Simulation is an invaluable tool for radio-frequency system designers that enables rapid prototyping of various algorithms for imaging, target detection, classification, and tracking. However, simulating realistic radar scans is a challenging task that requires an accurate model of the scene, radio frequency material properties, and a corresponding radar synthesis function. Rather than specifying these models explicitly, we propose DART - Doppler Aided Radar Tomography, a Neural Radiance Field-inspired method which uses radar-specific physics to create a reflectance and transmittance-based rendering pipeline for range-Doppler images. We then evaluate DART by constructing a custom data collection platform and collecting a novel radar dataset together with accurate position and instantaneous velocity measurements from lidar-based localization. In comparison to state-of-the-art baselines, DART synthesizes superior radar range-Doppler images from novel views across all datasets and additionally can be used to generate high quality tomographic images.  ( 2 min )
    ClusterNet: A Perception-Based Clustering Model for Scattered Data
    arXiv:2304.14185v3 Announce Type: replace Abstract: Visualizations for scattered data are used to make users understand certain attributes of their data by solving different tasks, e.g. correlation estimation, outlier detection, cluster separation. In this paper, we focus on the later task, and develop a technique that is aligned to human perception, that can be used to understand how human subjects perceive clusterings in scattered data and possibly optimize for better understanding. Cluster separation in scatterplots is a task that is typically tackled by widely used clustering techniques, such as for instance k-means or DBSCAN. However, as these algorithms are based on non-perceptual metrics, we can show in our experiments, that their output do not reflect human cluster perception. We propose a learning strategy which directly operates on scattered data. To learn perceptual cluster separation on this data, we crowdsourced a large scale dataset, consisting of 7,320 point-wise cluster affiliations for bivariate data, which has been labeled by 384 human crowd workers. Based on this data, we were able to train ClusterNet, a point-based deep learning model, trained to reflect human perception of cluster separability. In order to train ClusterNet on human annotated data, we use a PointNet++ architecture enabling inference on point clouds directly. In this work, we provide details on how we collected our dataset, report statistics of the resulting annotations, and investigate perceptual agreement of cluster separation for real-world data. We further report the training and evaluation protocol of ClusterNet and introduce a novel metric, that measures the accuracy between a clustering technique and a group of human annotators. Finally, we compare our approach against existing state-of-the-art clustering techniques and can show, that ClusterNet is able to generalize to unseen and out of scope data.  ( 3 min )
    Ordinal Potential-based Player Rating
    arXiv:2306.05366v4 Announce Type: replace-cross Abstract: It was recently observed that Elo ratings fail at preserving transitive relations among strategies and therefore cannot correctly extract the transitive component of a game. We provide a characterization of transitive games as a weak variant of ordinal potential games and show that Elo ratings actually do preserve transitivity when computed in the right space, using suitable invertible mappings. Leveraging this insight, we introduce a new game decomposition of an arbitrary game into transitive and cyclic components that is learnt using a neural network-based architecture and that prioritises capturing the sign pattern of the game, namely transitive and cyclic relations among strategies. We link our approach to the known concept of sign-rank, and evaluate our methodology using both toy examples and empirical data from real-world games.  ( 2 min )
    Fair Text-to-Image Diffusion via Fair Mapping
    arXiv:2311.17695v2 Announce Type: replace-cross Abstract: In this paper, we address the limitations of existing text-to-image diffusion models in generating demographically fair results when given human-related descriptions. These models often struggle to disentangle the target language context from sociocultural biases, resulting in biased image generation. To overcome this challenge, we propose Fair Mapping, a flexible, model-agnostic, and lightweight approach that modifies a pre-trained text-to-image diffusion model by controlling the prompt to achieve fair image generation. One key advantage of our approach is its high efficiency. It only requires updating an additional linear network with few parameters at a low computational cost. By developing a linear network that maps conditioning embeddings into a debiased space, we enable the generation of relatively balanced demographic results based on the specified text condition. With comprehensive experiments on face image generation, we show that our method significantly improves image generation fairness with almost the same image quality compared to conventional diffusion models when prompted with descriptions related to humans. By effectively addressing the issue of implicit language bias, our method produces more fair and diverse image outputs.  ( 2 min )
    Methods for generating and evaluating synthetic longitudinal patient data: a systematic review
    arXiv:2309.12380v2 Announce Type: replace-cross Abstract: The proliferation of data in recent years has led to the advancement and utilization of various statistical and deep learning techniques, thus expediting research and development activities. However, not all industries have benefited equally from the surge in data availability, partly due to legal restrictions on data usage and privacy regulations, such as in medicine. To address this issue, various statistical disclosure and privacy-preserving methods have been proposed, including the use of synthetic data generation. Synthetic data are generated based on some existing data, with the aim of replicating them as closely as possible and acting as a proxy for real sensitive data. This paper presents a systematic review of methods for generating and evaluating synthetic longitudinal patient data, a prevalent data type in medicine. The review adheres to the PRISMA guidelines and covers literature from five databases until the end of 2022. The paper describes 17 methods, ranging from traditional simulation techniques to modern deep learning methods. The collected information includes, but is not limited to, method type, source code availability, and approaches used to assess resemblance, utility, and privacy. Furthermore, the paper discusses practical guidelines and key considerations for developing synthetic longitudinal data generation methods.  ( 2 min )
    Interpretable Stereotype Identification through Reasoning
    arXiv:2308.00071v2 Announce Type: replace-cross Abstract: Given that language models are trained on vast datasets that may contain inherent biases, there is a potential danger of inadvertently perpetuating systemic discrimination. Consequently, it becomes essential to examine and address biases in language models, integrating fairness into their development to ensure these models are equitable and free from bias. In this work, we demonstrate the importance of reasoning in zero-shot stereotype identification based on Vicuna-13B-v1.3. While we do observe improved accuracy by scaling from 13B to 33B, we show that the performance gain from reasoning significantly exceeds the gain from scaling up. Our findings suggest that reasoning could be a key factor that enables LLMs to trescend the scaling law on out-of-domain tasks such as stereotype identification. Additionally, through a qualitative analysis of select reasoning traces, we highlight how reasoning enhances not just accuracy but also the interpretability of the decision.  ( 2 min )
    Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models
    arXiv:2308.09778v3 Announce Type: replace-cross Abstract: Large vision-and-language models (VLMs) trained to match images with text on large-scale datasets of image-text pairs have shown impressive generalization ability on several vision and language tasks. Several recent works, however, showed that these models lack fine-grained understanding, such as the ability to count and recognize verbs, attributes, or relationships. The focus of this work is to study the understanding of spatial relations. This has been tackled previously using image-text matching (e.g., Visual Spatial Reasoning benchmark) or visual question answering (e.g., GQA or VQAv2), both showing poor performance and a large gap compared to human performance. In this work, we show qualitatively (using explainability tools) and quantitatively (using object detectors) that the poor object localization "grounding" ability of the models is a contributing factor to the poor image-text matching performance. We propose an alternative fine-grained, compositional approach for recognizing and ranking spatial clauses that combines the evidence from grounding noun phrases corresponding to objects and their locations to compute the final rank of the spatial clause. We demonstrate the approach on representative VLMs (such as LXMERT, GPV, and MDETR) and compare and highlight their abilities to reason about spatial relationships.  ( 3 min )
    Decision-making with Speculative Opponent Models
    arXiv:2211.11940v2 Announce Type: replace-cross Abstract: Opponent modeling has benefited a controlled agent's decision-making by constructing models of other agents. Existing methods commonly assume access to opponents' observations and actions, which is infeasible when opponents' behaviors are unobservable or hard to obtain. We propose a novel multi-agent distributional actor-critic algorithm to achieve speculative opponent modeling with purely local information (i.e., the controlled agent's observations, actions, and rewards). Specifically, the actor maintains a speculated belief of the opponents, which we call the speculative opponent models, to predict opponent actions using local observations and makes decisions accordingly. Further, the distributional critic models the return distribution of the policy. It reflects the quality of the actor and thus can guide the training of the speculative opponent model that the actor relies on. Extensive experiments confirm that our method successfully models opponents' behaviors without their data and delivers superior performance against baseline methods with a faster convergence speed.  ( 2 min )
    A Reduction-based Framework for Sequential Decision Making with Delayed Feedback
    arXiv:2302.01477v5 Announce Type: replace Abstract: We study stochastic delayed feedback in general multi-agent sequential decision making, which includes bandits, single-agent Markov decision processes (MDPs), and Markov games (MGs). We propose a novel reduction-based framework, which turns any multi-batched algorithm for sequential decision making with instantaneous feedback into a sample-efficient algorithm that can handle stochastic delays in sequential decision making. By plugging different multi-batched algorithms into our framework, we provide several examples demonstrating that our framework not only matches or improves existing results for bandits, tabular MDPs, and tabular MGs, but also provides the first line of studies on delays in sequential decision making with function approximation. In summary, we provide a complete set of sharp results for multi-agent sequential decision making with delayed feedback.  ( 2 min )
    Designing Informative Metrics for Few-Shot Example Selection
    arXiv:2403.03861v1 Announce Type: cross Abstract: Pretrained language models (PLMs) have shown remarkable few-shot learning capabilities when provided with properly formatted examples. However, selecting the "best" examples remains an open challenge. We propose a complexity-based prompt selection approach for sequence tagging tasks. This approach avoids the training of a dedicated model for selection of examples, and instead uses certain metrics to align the syntactico-semantic complexity of test sentences and examples. We use both sentence- and word-level metrics to match the complexity of examples to the (test) sentence being considered. Our results demonstrate that our approach extracts greater performance from PLMs: it achieves state-of-the-art performance on few-shot NER, achieving a 5% absolute improvement in F1 score on the CoNLL2003 dataset for GPT-4. We also see large gains of upto 28.85 points (F1/Acc.) in smaller models like GPT-j-6B.  ( 2 min )
    On the Origins of Linear Representations in Large Language Models
    arXiv:2403.03867v1 Announce Type: cross Abstract: Recent works have argued that high-level semantic concepts are encoded "linearly" in the representation space of large language models. In this work, we study the origins of such linear representations. To that end, we introduce a simple latent variable model to abstract and formalize the concept dynamics of the next token prediction. We use this formalism to show that the next token prediction objective (softmax with cross-entropy) and the implicit bias of gradient descent together promote the linear representation of concepts. Experiments show that linear representations emerge when learning from data matching the latent variable model, confirming that this simple structure already suffices to yield linear representations. We additionally confirm some predictions of the theory using the LLaMA-2 large language model, giving evidence that the simplified model yields generalizable insights.  ( 2 min )
    Continual Driving Policy Optimization with Closed-Loop Individualized Curricula
    arXiv:2309.14209v3 Announce Type: replace Abstract: The safety of autonomous vehicles (AV) has been a long-standing top concern, stemming from the absence of rare and safety-critical scenarios in the long-tail naturalistic driving distribution. To tackle this challenge, a surge of research in scenario-based autonomous driving has emerged, with a focus on generating high-risk driving scenarios and applying them to conduct safety-critical testing of AV models. However, limited work has been explored on the reuse of these extensive scenarios to iteratively improve AV models. Moreover, it remains intractable and challenging to filter through gigantic scenario libraries collected from other AV models with distinct behaviors, attempting to extract transferable information for current AV improvement. Therefore, we develop a continual driving policy optimization framework featuring Closed-Loop Individualized Curricula (CLIC), which we factorize into a set of standardized sub-modules for flexible implementation choices: AV Evaluation, Scenario Selection, and AV Training. CLIC frames AV Evaluation as a collision prediction task, where it estimates the chance of AV failures in these scenarios at each iteration. Subsequently, by re-sampling from historical scenarios based on these failure probabilities, CLIC tailors individualized curricula for downstream training, aligning them with the evaluated capability of AV. Accordingly, CLIC not only maximizes the utilization of the vast pre-collected scenario library for closed-loop driving policy optimization but also facilitates AV improvement by individualizing its training with more challenging cases out of those poorly organized scenarios. Experimental results clearly indicate that CLIC surpasses other curriculum-based training strategies, showing substantial improvement in managing risky scenarios, while still maintaining proficiency in handling simpler cases.  ( 3 min )
    GUIDE: Guidance-based Incremental Learning with Diffusion Models
    arXiv:2403.03938v1 Announce Type: new Abstract: We introduce GUIDE, a novel continual learning approach that directs diffusion models to rehearse samples at risk of being forgotten. Existing generative strategies combat catastrophic forgetting by randomly sampling rehearsal examples from a generative model. Such an approach contradicts buffer-based approaches where sampling strategy plays an important role. We propose to bridge this gap by integrating diffusion models with classifier guidance techniques to produce rehearsal examples specifically targeting information forgotten by a continuously trained model. This approach enables the generation of samples from preceding task distributions, which are more likely to be misclassified in the context of recently encountered classes. Our experimental results show that GUIDE significantly reduces catastrophic forgetting, outperforming conventional random sampling approaches and surpassing recent state-of-the-art methods in continual learning with generative replay.  ( 2 min )
    Advancing Out-of-Distribution Detection through Data Purification and Dynamic Activation Function Design
    arXiv:2403.03412v1 Announce Type: new Abstract: In the dynamic realms of machine learning and deep learning, the robustness and reliability of models are paramount, especially in critical real-world applications. A fundamental challenge in this sphere is managing Out-of-Distribution (OOD) samples, significantly increasing the risks of model misclassification and uncertainty. Our work addresses this challenge by enhancing the detection and management of OOD samples in neural networks. We introduce OOD-R (Out-of-Distribution-Rectified), a meticulously curated collection of open-source datasets with enhanced noise reduction properties. In-Distribution (ID) noise in existing OOD datasets can lead to inaccurate evaluation of detection algorithms. Recognizing this, OOD-R incorporates noise filtering technologies to refine the datasets, ensuring a more accurate and reliable evaluation of OOD detection algorithms. This approach not only improves the overall quality of data but also aids in better distinguishing between OOD and ID samples, resulting in up to a 2.5\% improvement in model accuracy and a minimum 3.2\% reduction in false positives. Furthermore, we present ActFun, an innovative method that fine-tunes the model's response to diverse inputs, thereby improving the stability of feature extraction and minimizing specificity issues. ActFun addresses the common problem of model overconfidence in OOD detection by strategically reducing the influence of hidden units, which enhances the model's capability to estimate OOD uncertainty more accurately. Implementing ActFun in the OOD-R dataset has led to significant performance enhancements, including an 18.42\% increase in AUROC of the GradNorm method and a 16.93\% decrease in FPR95 of the Energy method. Overall, our research not only advances the methodologies in OOD detection but also emphasizes the importance of dataset integrity for accurate algorithm evaluation.  ( 3 min )
    Probing the Robustness of Time-series Forecasting Models with CounterfacTS
    arXiv:2403.03508v1 Announce Type: new Abstract: A common issue for machine learning models applied to time-series forecasting is the temporal evolution of the data distributions (i.e., concept drift). Because most of the training data does not reflect such changes, the models present poor performance on the new out-of-distribution scenarios and, therefore, the impact of such events cannot be reliably anticipated ahead of time. We present and publicly release CounterfacTS, a tool to probe the robustness of deep learning models in time-series forecasting tasks via counterfactuals. CounterfacTS has a user-friendly interface that allows the user to visualize, compare and quantify time series data and their forecasts, for a number of datasets and deep learning models. Furthermore, the user can apply various transformations to the time series and explore the resulting changes in the forecasts in an interpretable manner. Through example cases, we illustrate how CounterfacTS can be used to i) identify the main features characterizing and differentiating sets of time series, ii) assess how the model performance depends on these characateristics, and iii) guide transformations of the original time series to create counterfactuals with desired properties for training and increasing the forecasting performance in new regions of the data distribution. We discuss the importance of visualizing and considering the location of the data in a projected feature space to transform time-series and create effective counterfactuals for training the models. Overall, CounterfacTS aids at creating counterfactuals to efficiently explore the impact of hypothetical scenarios not covered by the original data in time-series forecasting tasks.  ( 3 min )
    A Teacher-Free Graph Knowledge Distillation Framework with Dual Self-Distillation
    arXiv:2403.03483v1 Announce Type: new Abstract: Recent years have witnessed great success in handling graph-related tasks with Graph Neural Networks (GNNs). Despite their great academic success, Multi-Layer Perceptrons (MLPs) remain the primary workhorse for practical industrial applications. One reason for such an academic-industry gap is the neighborhood-fetching latency incurred by data dependency in GNNs. To reduce their gaps, Graph Knowledge Distillation (GKD) is proposed, usually based on a standard teacher-student architecture, to distill knowledge from a large teacher GNN into a lightweight student GNN or MLP. However, we found in this paper that neither teachers nor GNNs are necessary for graph knowledge distillation. We propose a Teacher-Free Graph Self-Distillation (TGS) framework that does not require any teacher model or GNNs during both training and inference. More importantly, the proposed TGS framework is purely based on MLPs, where structural information is only implicitly used to guide dual knowledge self-distillation between the target node and its neighborhood. As a result, TGS enjoys the benefits of graph topology awareness in training but is free from data dependency in inference. Extensive experiments have shown that the performance of vanilla MLPs can be greatly improved with dual self-distillation, e.g., TGS improves over vanilla MLPs by 15.54% on average and outperforms state-of-the-art GKD algorithms on six real-world datasets. In terms of inference speed, TGS infers 75X-89X faster than existing GNNs and 16X-25X faster than classical inference acceleration methods.  ( 3 min )
    RACE-SM: Reinforcement Learning Based Autonomous Control for Social On-Ramp Merging
    arXiv:2403.03359v1 Announce Type: cross Abstract: Autonomous parallel-style on-ramp merging in human controlled traffic continues to be an existing issue for autonomous vehicle control. Existing non-learning based solutions for vehicle control rely on rules and optimization primarily. These methods have been seen to present significant challenges. Recent advancements in Deep Reinforcement Learning have shown promise and have received significant academic interest however the available learning based approaches show inadequate attention to other highway vehicles and often rely on inaccurate road traffic assumptions. In addition, the parallel-style case is rarely considered. A novel learning based model for acceleration and lane change decision making that explicitly considers the utility to both the ego vehicle and its surrounding vehicles which may be cooperative or uncooperative to produce behaviour that is socially acceptable is proposed. The novel reward function makes use of Social Value Orientation to weight the vehicle's level of social cooperation and is divided into ego vehicle and surrounding vehicle utility which are weighted according to the model's designated Social Value Orientation. A two-lane highway with an on-ramp divided into a taper-style and parallel-style section is considered. Simulation results indicated the importance of considering surrounding vehicles in reward function design and show that the proposed model matches or surpasses those in literature in terms of collisions while also introducing socially courteous behaviour avoiding near misses and anti-social behaviour through direct consideration of the effect of merging on surrounding vehicles.  ( 2 min )
    Extreme Precipitation Nowcasting using Transformer-based Generative Models
    arXiv:2403.03929v1 Announce Type: new Abstract: This paper presents an innovative approach to extreme precipitation nowcasting by employing Transformer-based generative models, namely NowcastingGPT with Extreme Value Loss (EVL) regularization. Leveraging a comprehensive dataset from the Royal Netherlands Meteorological Institute (KNMI), our study focuses on predicting short-term precipitation with high accuracy. We introduce a novel method for computing EVL without assuming fixed extreme representations, addressing the limitations of current models in capturing extreme weather events. We present both qualitative and quantitative analyses, demonstrating the superior performance of the proposed NowcastingGPT-EVL in generating accurate precipitation forecasts, especially when dealing with extreme precipitation events. The code is available at \url{https://github.com/Cmeo97/NowcastingGPT}.  ( 2 min )
    KG-TREAT: Pre-training for Treatment Effect Estimation by Synergizing Patient Data with Knowledge Graphs
    arXiv:2403.03791v1 Announce Type: new Abstract: Treatment effect estimation (TEE) is the task of determining the impact of various treatments on patient outcomes. Current TEE methods fall short due to reliance on limited labeled data and challenges posed by sparse and high-dimensional observational patient data. To address the challenges, we introduce a novel pre-training and fine-tuning framework, KG-TREAT, which synergizes large-scale observational patient data with biomedical knowledge graphs (KGs) to enhance TEE. Unlike previous approaches, KG-TREAT constructs dual-focus KGs and integrates a deep bi-level attention synergy method for in-depth information fusion, enabling distinct encoding of treatment-covariate and outcome-covariate relationships. KG-TREAT also incorporates two pre-training tasks to ensure a thorough grounding and contextualization of patient data and KGs. Evaluation on four downstream TEE tasks shows KG-TREAT's superiority over existing methods, with an average improvement of 7% in Area under the ROC Curve (AUC) and 9% in Influence Function-based Precision of Estimating Heterogeneous Effects (IF-PEHE). The effectiveness of our estimated treatment effects is further affirmed by alignment with established randomized clinical trial findings.  ( 2 min )
    An Ensemble Framework for Explainable Geospatial Machine Learning Models
    arXiv:2403.03328v1 Announce Type: new Abstract: Analyzing spatial varying effect is pivotal in geographic analysis. Yet, accurately capturing and interpreting this variability is challenging due to the complexity and non-linearity of geospatial data. Herein, we introduce an integrated framework that merges local spatial weighting scheme, Explainable Artificial Intelligence (XAI), and cutting-edge machine learning technologies to bridge the gap between traditional geographic analysis models and general machine learning approaches. Through tests on synthetic datasets, this framework is verified to enhance the interpretability and accuracy of predictions in both geographic regression and classification by elucidating spatial variability. It significantly boosts prediction precision, offering a novel approach to understanding spatial phenomena.  ( 2 min )
    Self-Attention Empowered Graph Convolutional Network for Structure Learning and Node Embedding
    arXiv:2403.03465v1 Announce Type: new Abstract: In representation learning on graph-structured data, many popular graph neural networks (GNNs) fail to capture long-range dependencies, leading to performance degradation. Furthermore, this weakness is magnified when the concerned graph is characterized by heterophily (low homophily). To solve this issue, this paper proposes a novel graph learning framework called the graph convolutional network with self-attention (GCN-SA). The proposed scheme exhibits an exceptional generalization capability in node-level representation learning. The proposed GCN-SA contains two enhancements corresponding to edges and node features. For edges, we utilize a self-attention mechanism to design a stable and effective graph-structure-learning module that can capture the internal correlation between any pair of nodes. This graph-structure-learning module can identify reliable neighbors for each node from the entire graph. Regarding the node features, we modify the transformer block to make it more applicable to enable GCN to fuse valuable information from the entire graph. These two enhancements work in distinct ways to help our GCN-SA capture long-range dependencies, enabling it to perform representation learning on graphs with varying levels of homophily. The experimental results on benchmark datasets demonstrate the effectiveness of the proposed GCN-SA. Compared to other outstanding GNN counterparts, the proposed GCN-SA is competitive.  ( 2 min )
    AcceleratedLiNGAM: Learning Causal DAGs at the speed of GPUs
    arXiv:2403.03772v1 Announce Type: new Abstract: Existing causal discovery methods based on combinatorial optimization or search are slow, prohibiting their application on large-scale datasets. In response, more recent methods attempt to address this limitation by formulating causal discovery as structure learning with continuous optimization but such approaches thus far provide no statistical guarantees. In this paper, we show that by efficiently parallelizing existing causal discovery methods, we can in fact scale them to thousands of dimensions, making them practical for substantially larger-scale problems. In particular, we parallelize the LiNGAM method, which is quadratic in the number of variables, obtaining up to a 32-fold speed-up on benchmark datasets when compared with existing sequential implementations. Specifically, we focus on the causal ordering subprocedure in DirectLiNGAM and implement GPU kernels to accelerate it. This allows us to apply DirectLiNGAM to causal inference on large-scale gene expression data with genetic interventions yielding competitive results compared with specialized continuous optimization methods, and Var-LiNGAM for causal discovery on U.S. stock data.  ( 2 min )
    CoRMF: Criticality-Ordered Recurrent Mean Field Ising Solver
    arXiv:2403.03391v1 Announce Type: cross Abstract: We propose an RNN-based efficient Ising model solver, the Criticality-ordered Recurrent Mean Field (CoRMF), for forward Ising problems. In its core, a criticality-ordered spin sequence of an $N$-spin Ising model is introduced by sorting mission-critical edges with greedy algorithm, such that an autoregressive mean-field factorization can be utilized and optimized with Recurrent Neural Networks (RNNs). Our method has two notable characteristics: (i) by leveraging the approximated tree structure of the underlying Ising graph, the newly-obtained criticality order enables the unification between variational mean-field and RNN, allowing the generally intractable Ising model to be efficiently probed with probabilistic inference; (ii) it is well-modulized, model-independent while at the same time expressive enough, and hence fully applicable to any forward Ising inference problems with minimal effort. Computationally, by using a variance-reduced Monte Carlo gradient estimator, CoRFM solves the Ising problems in a self-train fashion without data/evidence, and the inference tasks can be executed by directly sampling from RNN. Theoretically, we establish a provably tighter error bound than naive mean-field by using the matrix cut decomposition machineries. Numerically, we demonstrate the utility of this framework on a series of Ising datasets.  ( 2 min )
    Nearly $d$-Linear Convergence Bounds for Diffusion Models via Stochastic Localization
    arXiv:2308.03686v3 Announce Type: replace-cross Abstract: Denoising diffusions are a powerful method to generate approximate samples from high-dimensional data distributions. Recent results provide polynomial bounds on their convergence rate, assuming $L^2$-accurate scores. Until now, the tightest bounds were either superlinear in the data dimension or required strong smoothness assumptions. We provide the first convergence bounds which are linear in the data dimension (up to logarithmic factors) assuming only finite second moments of the data distribution. We show that diffusion models require at most $\tilde O(\frac{d \log^2(1/\delta)}{\varepsilon^2})$ steps to approximate an arbitrary distribution on $\mathbb{R}^d$ corrupted with Gaussian noise of variance $\delta$ to within $\varepsilon^2$ in KL divergence. Our proof extends the Girsanov-based methods of previous works. We introduce a refined treatment of the error from discretizing the reverse SDE inspired by stochastic localization.  ( 2 min )
    AbDiffuser: Full-Atom Generation of in vitro Functioning Antibodies
    arXiv:2308.05027v2 Announce Type: replace-cross Abstract: We introduce AbDiffuser, an equivariant and physics-informed diffusion model for the joint generation of antibody 3D structures and sequences. AbDiffuser is built on top of a new representation of protein structure, relies on a novel architecture for aligned proteins, and utilizes strong diffusion priors to improve the denoising process. Our approach improves protein diffusion by taking advantage of domain knowledge and physics-based constraints; handles sequence-length changes; and reduces memory complexity by an order of magnitude, enabling backbone and side chain generation. We validate AbDiffuser in silico and in vitro. Numerical experiments showcase the ability of AbDiffuser to generate antibodies that closely track the sequence and structural properties of a reference set. Laboratory experiments confirm that all 16 HER2 antibodies discovered were expressed at high levels and that 57.1% of the selected designs were tight binders.  ( 2 min )
    High-Fidelity Image Compression with Score-based Generative Models
    arXiv:2305.18231v2 Announce Type: replace-cross Abstract: Despite the tremendous success of diffusion generative models in text-to-image generation, replicating this success in the domain of image compression has proven difficult. In this paper, we demonstrate that diffusion can significantly improve perceptual quality at a given bit-rate, outperforming state-of-the-art approaches PO-ELIC and HiFiC as measured by FID score. This is achieved using a simple but theoretically motivated two-stage approach combining an autoencoder targeting MSE followed by a further score-based decoder. However, as we will show, implementation details matter and the optimal design decisions can differ greatly from typical text-to-image models.  ( 2 min )
    From Coupled Oscillators to Graph Neural Networks: Reducing Over-smoothing via a Kuramoto Model-based Approach
    arXiv:2311.03260v2 Announce Type: replace Abstract: We propose the Kuramoto Graph Neural Network (KuramotoGNN), a novel class of continuous-depth graph neural networks (GNNs) that employs the Kuramoto model to mitigate the over-smoothing phenomenon, in which node features in GNNs become indistinguishable as the number of layers increases. The Kuramoto model captures the synchronization behavior of non-linear coupled oscillators. Under the view of coupled oscillators, we first show the connection between Kuramoto model and basic GNN and then over-smoothing phenomenon in GNNs can be interpreted as phase synchronization in Kuramoto model. The KuramotoGNN replaces this phase synchronization with frequency synchronization to prevent the node features from converging into each other while allowing the system to reach a stable synchronized state. We experimentally verify the advantages of the KuramotoGNN over the baseline GNNs and existing methods in reducing over-smoothing on various graph deep learning benchmark tasks.  ( 2 min )
    Exploration via linearly perturbed loss minimisation
    arXiv:2311.07565v2 Announce Type: replace Abstract: We introduce exploration via linear loss perturbations (EVILL), a randomised exploration method for structured stochastic bandit problems that works by solving for the minimiser of a linearly perturbed regularised negative log-likelihood function. We show that, for the case of generalised linear bandits, EVILL reduces to perturbed history exploration (PHE), a method where exploration is done by training on randomly perturbed rewards. In doing so, we provide a simple and clean explanation of when and why random reward perturbations give rise to good bandit algorithms. We propose data-dependent perturbations not present in previous PHE-type methods that allow EVILL to match the performance of Thompson-sampling-style parameter-perturbation methods, both in theory and in practice. Moreover, we show an example outside generalised linear bandits where PHE leads to inconsistent estimates, and thus linear regret, while EVILL remains performant. Like PHE, EVILL can be implemented in just a few lines of code.  ( 2 min )
    Leveraging Ensemble Diversity for Robust Self-Training in the Presence of Sample Selection Bias
    arXiv:2310.14814v3 Announce Type: replace Abstract: Self-training is a well-known approach for semi-supervised learning. It consists of iteratively assigning pseudo-labels to unlabeled data for which the model is confident and treating them as labeled examples. For neural networks, softmax prediction probabilities are often used as a confidence measure, although they are known to be overconfident, even for wrong predictions. This phenomenon is particularly intensified in the presence of sample selection bias, i.e., when data labeling is subject to some constraint. To address this issue, we propose a novel confidence measure, called $\mathcal{T}$-similarity, built upon the prediction diversity of an ensemble of linear classifiers. We provide the theoretical analysis of our approach by studying stationary points and describing the relationship between the diversity of the individual members and their performance. We empirically demonstrate the benefit of our confidence measure for three different pseudo-labeling policies on classification datasets of various data modalities. The code is available at https://github.com/ambroiseodt/tsim.  ( 2 min )
    Scalable Neural Network Kernels
    arXiv:2310.13225v2 Announce Type: replace Abstract: We introduce the concept of scalable neural network kernels (SNNKs), the replacements of regular feedforward layers (FFLs), capable of approximating the latter, but with favorable computational properties. SNNKs effectively disentangle the inputs from the parameters of the neural network in the FFL, only to connect them in the final computation via the dot-product kernel. They are also strictly more expressive, as allowing to model complicated relationships beyond the functions of the dot-products of parameter-input vectors. We also introduce the neural network bundling process that applies SNNKs to compactify deep neural network architectures, resulting in additional compression gains. In its extreme version, it leads to the fully bundled network whose optimal parameters can be expressed via explicit formulae for several loss functions (e.g. mean squared error), opening a possibility to bypass backpropagation. As a by-product of our analysis, we introduce the mechanism of the universal random features (or URFs), applied to instantiate several SNNK variants, and interesting on its own in the context of scalable kernel methods. We provide rigorous theoretical analysis of all these concepts as well as an extensive empirical evaluation, ranging from point-wise kernel estimation to Transformers' fine-tuning with novel adapter layers inspired by SNNKs. Our mechanism provides up to 5x reduction in the number of trainable parameters, while maintaining competitive accuracy.  ( 2 min )
    Distribution-Free Statistical Dispersion Control for Societal Applications
    arXiv:2309.13786v2 Announce Type: replace Abstract: Explicit finite-sample statistical guarantees on model performance are an important ingredient in responsible machine learning. Previous work has focused mainly on bounding either the expected loss of a predictor or the probability that an individual prediction will incur a loss value in a specified range. However, for many high-stakes applications, it is crucial to understand and control the dispersion of a loss distribution, or the extent to which different members of a population experience unequal effects of algorithmic decisions. We initiate the study of distribution-free control of statistical dispersion measures with societal implications and propose a simple yet flexible framework that allows us to handle a much richer class of statistical functionals beyond previous work. Our methods are verified through experiments in toxic comment detection, medical imaging, and film recommendation.  ( 2 min )
    Projected Task-Specific Layers for Multi-Task Reinforcement Learning
    arXiv:2309.08776v2 Announce Type: replace Abstract: Multi-task reinforcement learning could enable robots to scale across a wide variety of manipulation tasks in homes and workplaces. However, generalizing from one task to another and mitigating negative task interference still remains a challenge. Addressing this challenge by successfully sharing information across tasks will depend on how well the structure underlying the tasks is captured. In this work, we introduce our new architecture, Projected Task-Specific Layers (PTSL), that leverages a common policy with dense task-specific corrections through task-specific layers to better express shared and variable task information. We then show that our model outperforms the state of the art on the MT10 and MT50 benchmarks of Meta-World consisting of 10 and 50 goal-conditioned tasks for a Sawyer arm.  ( 2 min )
    Beyond MD17: the reactive xxMD dataset
    arXiv:2308.11155v3 Announce Type: replace Abstract: System specific neural force fields (NFFs) have gained popularity in computational chemistry. One of the most popular datasets as a bencharmk to develop NFFs models is the MD17 dataset and its subsequent extension. These datasets comprise geometries from the equilibrium region of the ground electronic state potential energy surface, sampled from direct adiabatic dynamics. However, many chemical reactions involve significant molecular geometrical deformations, for example, bond breaking. Therefore, MD17 is inadequate to represent a chemical reaction. To address this limitation in MD17, we introduce a new dataset, called Extended Excited-state Molecular Dynamics (xxMD) dataset. The xxMD dataset involves geometries sampled from direct non-adiabatic dynamics, and the energies are computed at both multireference wavefunction theory and density functional theory. We show that the xxMD dataset involves diverse geometries which represent chemical reactions. Assessment of NFF models on xxMD dataset reveals significantly higher predictive errors than those reported for MD17 and its variants. This work underscores the challenges faced in crafting a generalizable NFF model with extrapolation capability.  ( 2 min )
    VQGraph: Rethinking Graph Representation Space for Bridging GNNs and MLPs
    arXiv:2308.02117v3 Announce Type: replace Abstract: GNN-to-MLP distillation aims to utilize knowledge distillation (KD) to learn computationally-efficient multi-layer perceptron (student MLP) on graph data by mimicking the output representations of teacher GNN. Existing methods mainly make the MLP to mimic the GNN predictions over a few class labels. However, the class space may not be expressive enough for covering numerous diverse local graph structures, thus limiting the performance of knowledge transfer from GNN to MLP. To address this issue, we propose to learn a new powerful graph representation space by directly labeling nodes' diverse local structures for GNN-to-MLP distillation. Specifically, we propose a variant of VQ-VAE to learn a structure-aware tokenizer on graph data that can encode each node's local substructure as a discrete code. The discrete codes constitute a codebook as a new graph representation space that is able to identify different local graph structures of nodes with the corresponding code indices. Then, based on the learned codebook, we propose a new distillation target, namely soft code assignments, to directly transfer the structural knowledge of each node from GNN to MLP. The resulting framework VQGraph achieves new state-of-the-art performance on GNN-to-MLP distillation in both transductive and inductive settings across seven graph datasets. We show that VQGraph with better performance infers faster than GNNs by 828x, and also achieves accuracy improvement over GNNs and stand-alone MLPs by 3.90% and 28.05% on average, respectively. Code: https://github.com/YangLing0818/VQGraph.  ( 3 min )
    Multigrid-Augmented Deep Learning Preconditioners for the Helmholtz Equation using Compact Implicit Layers
    arXiv:2306.17486v3 Announce Type: replace Abstract: We present a deep learning-based iterative approach to solve the discrete heterogeneous Helmholtz equation for high wavenumbers. Combining classical iterative multigrid solvers and convolutional neural networks (CNNs) via preconditioning, we obtain a learned neural solver that is faster and scales better than a standard multigrid solver. Our approach offers three main contributions over previous neural methods of this kind. First, we construct a multilevel U-Net-like encoder-solver CNN with an implicit layer on the coarsest grid of the U-Net, where convolution kernels are inverted. This alleviates the field of view problem in CNNs and allows better scalability. Second, we improve upon the previous CNN preconditioner in terms of the number of parameters, computation time, and convergence rates. Third, we propose a multiscale training approach that enables the network to scale to problems of previously unseen dimensions while still maintaining a reasonable training procedure. Our encoder-solver architecture can be used to generalize over different slowness models of various difficulties and is efficient at solving for many right-hand sides per slowness model. We demonstrate the benefits of our novel architecture with numerical experiments on a variety of heterogeneous two-dimensional problems at high wavenumbers.  ( 3 min )
    Massively Scalable Inverse Reinforcement Learning in Google Maps
    arXiv:2305.11290v4 Announce Type: replace Abstract: Inverse reinforcement learning (IRL) offers a powerful and general framework for learning humans' latent preferences in route recommendation, yet no approach has successfully addressed planetary-scale problems with hundreds of millions of states and demonstration trajectories. In this paper, we introduce scaling techniques based on graph compression, spatial parallelization, and improved initialization conditions inspired by a connection to eigenvector algorithms. We revisit classic IRL methods in the routing context, and make the key observation that there exists a trade-off between the use of cheap, deterministic planners and expensive yet robust stochastic policies. This insight is leveraged in Receding Horizon Inverse Planning (RHIP), a new generalization of classic IRL algorithms that provides fine-grained control over performance trade-offs via its planning horizon. Our contributions culminate in a policy that achieves a 16-24% improvement in route quality at a global scale, and to the best of our knowledge, represents the largest published study of IRL algorithms in a real-world setting to date. We conclude by conducting an ablation study of key components, presenting negative results from alternative eigenvalue solvers, and identifying opportunities to further improve scalability via IRL-specific batching strategies.  ( 2 min )
    Treat Different Negatives Differently: Enriching Loss Functions with Domain and Range Constraints for Link Prediction
    arXiv:2303.00286v4 Announce Type: replace Abstract: Knowledge graph embedding models (KGEMs) are used for various tasks related to knowledge graphs (KGs), including link prediction. They are trained with loss functions that consider batches of true and false triples. However, different kinds of false triples exist and recent works suggest that they should not be valued equally, leading to specific negative sampling procedures. In line with this recent assumption, we posit that negative triples that are semantically valid w.r.t. signatures of relations (domain and range) are high-quality negatives. Hence, we enrich the three main loss functions for link prediction such that all kinds of negatives are sampled but treated differently based on their semantic validity. In an extensive and controlled experimental setting, we show that the proposed loss functions systematically provide satisfying results which demonstrates both the generality and superiority of our proposed approach. In fact, the proposed loss functions (1) lead to better MRR and Hits@10 values, and (2) drive KGEMs towards better semantic correctness as measured by the Sem@K metric. This highlights that relation signatures globally improve KGEMs, and thus should be incorporated into loss functions. Domains and ranges of relations being largely available in schema-defined KGs, this makes our approach both beneficial and widely usable in practice.  ( 3 min )
    STDEN: Towards Physics-Guided Neural Networks for Traffic Flow Prediction
    arXiv:2209.00225v2 Announce Type: replace Abstract: High-performance traffic flow prediction model designing, a core technology of Intelligent Transportation System, is a long-standing but still challenging task for industrial and academic communities. The lack of integration between physical principles and data-driven models is an important reason for limiting the development of this field. In the literature, physics-based methods can usually provide a clear interpretation of the dynamic process of traffic flow systems but are with limited accuracy, while data-driven methods, especially deep learning with black-box structures, can achieve improved performance but can not be fully trusted due to lack of a reasonable physical basis. To bridge the gap between purely data-driven and physics-driven approaches, we propose a physics-guided deep learning model named Spatio-Temporal Differential Equation Network (STDEN), which casts the physical mechanism of traffic flow dynamics into a deep neural network framework. Specifically, we assume the traffic flow on road networks is driven by a latent potential energy field (like water flows are driven by the gravity field), and model the spatio-temporal dynamic process of the potential energy field as a differential equation network. STDEN absorbs both the performance advantage of data-driven models and the interpretability of physics-based models, so is named a physics-guided prediction model. Experiments on three real-world traffic datasets in Beijing show that our model outperforms state-of-the-art baselines by a significant margin. A case study further verifies that STDEN can capture the mechanism of urban traffic and generate accurate predictions with physical meaning. The proposed framework of differential equation network modeling may also cast light on other similar applications.  ( 3 min )
    Differentially Private Generalized Linear Models Revisited
    arXiv:2205.03014v2 Announce Type: replace Abstract: We study the problem of $(\epsilon,\delta)$-differentially private learning of linear predictors with convex losses. We provide results for two subclasses of loss functions. The first case is when the loss is smooth and non-negative but not necessarily Lipschitz (such as the squared loss). For this case, we establish an upper bound on the excess population risk of $\tilde{O}\left(\frac{\Vert w^*\Vert}{\sqrt{n}} + \min\left\{\frac{\Vert w^* \Vert^2}{(n\epsilon)^{2/3}},\frac{\sqrt{d}\Vert w^*\Vert^2}{n\epsilon}\right\}\right)$, where $n$ is the number of samples, $d$ is the dimension of the problem, and $w^*$ is the minimizer of the population risk. Apart from the dependence on $\Vert w^\ast\Vert$, our bound is essentially tight in all parameters. In particular, we show a lower bound of $\tilde{\Omega}\left(\frac{1}{\sqrt{n}} + {\min\left\{\frac{\Vert w^*\Vert^{4/3}}{(n\epsilon)^{2/3}}, \frac{\sqrt{d}\Vert w^*\Vert}{n\epsilon}\right\}}\right)$. We also revisit the previously studied case of Lipschitz losses [SSTT20]. For this case, we close the gap in the existing work and show that the optimal rate is (up to log factors) $\Theta\left(\frac{\Vert w^*\Vert}{\sqrt{n}} + \min\left\{\frac{\Vert w^*\Vert}{\sqrt{n\epsilon}},\frac{\sqrt{\text{rank}}\Vert w^*\Vert}{n\epsilon}\right\}\right)$, where $\text{rank}$ is the rank of the design matrix. This improves over existing work in the high privacy regime. Finally, our algorithms involve a private model selection approach that we develop to enable attaining the stated rates without a-priori knowledge of $\Vert w^*\Vert$.  ( 2 min )
    Reconciling Reality through Simulation: A Real-to-Sim-to-Real Approach for Robust Manipulation
    arXiv:2403.03949v1 Announce Type: cross Abstract: Imitation learning methods need significant human supervision to learn policies robust to changes in object poses, physical disturbances, and visual distractors. Reinforcement learning, on the other hand, can explore the environment autonomously to learn robust behaviors but may require impractical amounts of unsafe real-world data collection. To learn performant, robust policies without the burden of unsafe real-world data collection or extensive human supervision, we propose RialTo, a system for robustifying real-world imitation learning policies via reinforcement learning in "digital twin" simulation environments constructed on the fly from small amounts of real-world data. To enable this real-to-sim-to-real pipeline, RialTo proposes an easy-to-use interface for quickly scanning and constructing digital twins of real-world environments. We also introduce a novel "inverse distillation" procedure for bringing real-world demonstrations into simulated environments for efficient fine-tuning, with minimal human intervention and engineering required. We evaluate RialTo across a variety of robotic manipulation problems in the real world, such as robustly stacking dishes on a rack, placing books on a shelf, and six other tasks. RialTo increases (over 67%) in policy robustness without requiring extensive human data collection. Project website and videos at https://real-to-sim-to-real.github.io/RialTo/  ( 2 min )
    3D Diffusion Policy
    arXiv:2403.03954v1 Announce Type: cross Abstract: Imitation learning provides an efficient way to teach robots dexterous skills; however, learning complex skills robustly and generalizablely usually consumes large amounts of human demonstrations. To tackle this challenging problem, we present 3D Diffusion Policy (DP3), a novel visual imitation learning approach that incorporates the power of 3D visual representations into diffusion policies, a class of conditional action generative models. The core design of DP3 is the utilization of a compact 3D visual representation, extracted from sparse point clouds with an efficient point encoder. In our experiments involving 72 simulation tasks, DP3 successfully handles most tasks with just 10 demonstrations and surpasses baselines with a 55.3% relative improvement. In 4 real robot tasks, DP3 demonstrates precise control with a high success rate of 85%, given only 40 demonstrations of each task, and shows excellent generalization abilities in diverse aspects, including space, viewpoint, appearance, and instance. Interestingly, in real robot experiments, DP3 rarely violates safety requirements, in contrast to baseline methods which frequently do, necessitating human intervention. Our extensive evaluation highlights the critical importance of 3D representations in real-world robot learning. Videos, code, and data are available on https://3d-diffusion-policy.github.io .  ( 2 min )
    Black-Box $k$-to-$1$-PCA Reductions: Theory and Applications
    arXiv:2403.03905v1 Announce Type: cross Abstract: The $k$-principal component analysis ($k$-PCA) problem is a fundamental algorithmic primitive that is widely-used in data analysis and dimensionality reduction applications. In statistical settings, the goal of $k$-PCA is to identify a top eigenspace of the covariance matrix of a distribution, which we only have implicit access to via samples. Motivated by these implicit settings, we analyze black-box deflation methods as a framework for designing $k$-PCA algorithms, where we model access to the unknown target matrix via a black-box $1$-PCA oracle which returns an approximate top eigenvector, under two popular notions of approximation. Despite being arguably the most natural reduction-based approach to $k$-PCA algorithm design, such black-box methods, which recursively call a $1$-PCA oracle $k$ times, were previously poorly-understood. Our main contribution is significantly sharper bounds on the approximation parameter degradation of deflation methods for $k$-PCA. For a quadratic form notion of approximation we term ePCA (energy PCA), we show deflation methods suffer no parameter loss. For an alternative well-studied approximation notion we term cPCA (correlation PCA), we tightly characterize the parameter regimes where deflation methods are feasible. Moreover, we show that in all feasible regimes, $k$-cPCA deflation algorithms suffer no asymptotic parameter loss for any constant $k$. We apply our framework to obtain state-of-the-art $k$-PCA algorithms robust to dataset contamination, improving prior work both in sample complexity and approximation quality.  ( 2 min )
    The Heuristic Core: Understanding Subnetwork Generalization in Pretrained Language Models
    arXiv:2403.03942v1 Announce Type: cross Abstract: Prior work has found that pretrained language models (LMs) fine-tuned with different random seeds can achieve similar in-domain performance but generalize differently on tests of syntactic generalization. In this work, we show that, even within a single model, we can find multiple subnetworks that perform similarly in-domain, but generalize vastly differently. To better understand these phenomena, we investigate if they can be understood in terms of "competing subnetworks": the model initially represents a variety of distinct algorithms, corresponding to different subnetworks, and generalization occurs when it ultimately converges to one. This explanation has been used to account for generalization in simple algorithmic tasks. Instead of finding competing subnetworks, we find that all subnetworks -- whether they generalize or not -- share a set of attention heads, which we refer to as the heuristic core. Further analysis suggests that these attention heads emerge early in training and compute shallow, non-generalizing features. The model learns to generalize by incorporating additional attention heads, which depend on the outputs of the "heuristic" heads to compute higher-level features. Overall, our results offer a more detailed picture of the mechanisms for syntactic generalization in pretrained LMs.  ( 2 min )
    Joint multi-task learning improves weakly-supervised biomarker prediction in computational pathology
    arXiv:2403.03891v1 Announce Type: cross Abstract: Deep Learning (DL) can predict biomarkers directly from digitized cancer histology in a weakly-supervised setting. Recently, the prediction of continuous biomarkers through regression-based DL has seen an increasing interest. Nonetheless, clinical decision making often requires a categorical outcome. Consequently, we developed a weakly-supervised joint multi-task Transformer architecture which has been trained and evaluated on four public patient cohorts for the prediction of two key predictive biomarkers, microsatellite instability (MSI) and homologous recombination deficiency (HRD), trained with auxiliary regression tasks related to the tumor microenvironment. Moreover, we perform a comprehensive benchmark of 16 approaches of task balancing for weakly-supervised joint multi-task learning in computational pathology. Using our novel approach, we improve over the state-of-the-art area under the receiver operating characteristic by +7.7% and +4.1%, as well as yielding better clustering of latent embeddings by +8% and +5% for the prediction of MSI and HRD in external cohorts, respectively.  ( 2 min )
    Hierarchical Diffusion Policy for Kinematics-Aware Multi-Task Robotic Manipulation
    arXiv:2403.03890v1 Announce Type: cross Abstract: This paper introduces Hierarchical Diffusion Policy (HDP), a hierarchical agent for multi-task robotic manipulation. HDP factorises a manipulation policy into a hierarchical structure: a high-level task-planning agent which predicts a distant next-best end-effector pose (NBP), and a low-level goal-conditioned diffusion policy which generates optimal motion trajectories. The factorised policy representation allows HDP to tackle both long-horizon task planning while generating fine-grained low-level actions. To generate context-aware motion trajectories while satisfying robot kinematics constraints, we present a novel kinematics-aware goal-conditioned control agent, Robot Kinematics Diffuser (RK-Diffuser). Specifically, RK-Diffuser learns to generate both the end-effector pose and joint position trajectories, and distill the accurate but kinematics-unaware end-effector pose diffuser to the kinematics-aware but less accurate joint position diffuser via differentiable kinematics. Empirically, we show that HDP achieves a significantly higher success rate than the state-of-the-art methods in both simulation and real-world.  ( 2 min )
    Learning to Decode Collaboratively with Multiple Language Models
    arXiv:2403.03870v1 Announce Type: cross Abstract: We propose a method to teach multiple large language models (LLM) to collaborate by interleaving their generations at the token level. We model the decision of which LLM generates the next token as a latent variable. By optimizing the marginal likelihood of a training set under our latent variable model, the base LLM automatically learns when to generate itself and when to call on one of the ``assistant'' language models to generate, all without direct supervision. Token-level collaboration during decoding allows for a fusion of each model's expertise in a manner tailored to the specific task at hand. Our collaborative decoding is especially useful in cross-domain settings where a generalist base LLM learns to invoke domain expert models. On instruction-following, domain-specific QA, and reasoning tasks, we show that the performance of the joint system exceeds that of the individual models. Through qualitative analysis of the learned latent decisions, we show models trained with our method exhibit several interesting collaboration patterns, e.g., template-filling. Our code is available at https://github.com/clinicalml/co-llm.  ( 2 min )
    Conformal prediction for multi-dimensional time series by ellipsoidal sets
    arXiv:2403.03850v1 Announce Type: cross Abstract: Conformal prediction (CP) has been a popular method for uncertainty quantification because it is distribution-free, model-agnostic, and theoretically sound. For forecasting problems in supervised learning, most CP methods focus on building prediction intervals for univariate responses. In this work, we develop a sequential CP method called $\texttt{MultiDimSPCI}$ that builds prediction regions for a multivariate response, especially in the context of multivariate time series, which are not exchangeable. Theoretically, we estimate finite-sample high-probability bounds on the conditional coverage gap. Empirically, we demonstrate that $\texttt{MultiDimSPCI}$ maintains valid coverage on a wide range of multivariate time series while producing smaller prediction regions than CP and non-CP baselines.  ( 2 min )
    MedMamba: Vision Mamba for Medical Image Classification
    arXiv:2403.03849v1 Announce Type: cross Abstract: Medical image classification is a very fundamental and crucial task in the field of computer vision. These years, CNN-based and Transformer-based models are widely used in classifying various medical images. Unfortunately, The limitation of CNNs in long-range modeling capabilities prevent them from effectively extracting fine-grained features in medical images , while Transformers are hampered by their quadratic computational complexity. Recent research has shown that the state space model (SSM) represented by Mamba can efficiently model long-range interactions while maintaining linear computational complexity. Inspired by this, we propose Vision Mamba for medical image classification (MedMamba). More specifically, we introduce a novel Conv-SSM module, which combines the local feature extraction ability of convolutional layers with the ability of SSM to capture long-range dependency. To demonstrate the potential of MedMamba, we conduct extensive experiments using three publicly available medical datasets with different imaging techniques (i.e., Kvasir (endoscopic images), FETAL_PLANES_DB (ultrasound images) and Covid19-Pneumonia-Normal Chest X-Ray (X-ray images)) and two private datasets built by ourselves. Experimental results show that the proposed MedMamba performs well in detecting lesions in various medical images. To the best of our knowledge, this is the first Vision Mamba tailored for medical image classification. The purpose of this work is to establish a new baseline for medical image classification tasks and provide valuable insights for the future development of more efficient and effective SSM-based artificial intelligence algorithms and application systems in the medical. Source code has been available at https://github.com/YubiaoYue/MedMamba.  ( 3 min )
    Dexterous Legged Locomotion in Confined 3D Spaces with Reinforcement Learning
    arXiv:2403.03848v1 Announce Type: cross Abstract: Recent advances of locomotion controllers utilizing deep reinforcement learning (RL) have yielded impressive results in terms of achieving rapid and robust locomotion across challenging terrain, such as rugged rocks, non-rigid ground, and slippery surfaces. However, while these controllers primarily address challenges underneath the robot, relatively little research has investigated legged mobility through confined 3D spaces, such as narrow tunnels or irregular voids, which impose all-around constraints. The cyclic gait patterns resulted from existing RL-based methods to learn parameterized locomotion skills characterized by motion parameters, such as velocity and body height, may not be adequate to navigate robots through challenging confined 3D spaces, requiring both agile 3D obstacle avoidance and robust legged locomotion. Instead, we propose to learn locomotion skills end-to-end from goal-oriented navigation in confined 3D spaces. To address the inefficiency of tracking distant navigation goals, we introduce a hierarchical locomotion controller that combines a classical planner tasked with planning waypoints to reach a faraway global goal location, and an RL-based policy trained to follow these waypoints by generating low-level motion commands. This approach allows the policy to explore its own locomotion skills within the entire solution space and facilitates smooth transitions between local goals, enabling long-term navigation towards distant goals. In simulation, our hierarchical approach succeeds at navigating through demanding confined 3D environments, outperforming both pure end-to-end learning approaches and parameterized locomotion skills. We further demonstrate the successful real-world deployment of our simulation-trained controller on a real robot.  ( 3 min )
    Targeted Variance Reduction: Robust Bayesian Optimization of Black-Box Simulators with Noise Parameters
    arXiv:2403.03816v1 Announce Type: cross Abstract: The optimization of a black-box simulator over control parameters $\mathbf{x}$ arises in a myriad of scientific applications. In such applications, the simulator often takes the form $f(\mathbf{x},\boldsymbol{\theta})$, where $\boldsymbol{\theta}$ are parameters that are uncertain in practice. Robust optimization aims to optimize the objective $\mathbb{E}[f(\mathbf{x},\boldsymbol{\Theta})]$, where $\boldsymbol{\Theta} \sim \mathcal{P}$ is a random variable that models uncertainty on $\boldsymbol{\theta}$. For this, existing black-box methods typically employ a two-stage approach for selecting the next point $(\mathbf{x},\boldsymbol{\theta})$, where $\mathbf{x}$ and $\boldsymbol{\theta}$ are optimized separately via different acquisition functions. As such, these approaches do not employ a joint acquisition over $(\mathbf{x},\boldsymbol{\theta})$, and thus may fail to fully exploit control-to-noise interactions for effective robust optimization. To address this, we propose a new Bayesian optimization method called Targeted Variance Reduction (TVR). The TVR leverages a novel joint acquisition function over $(\mathbf{x},\boldsymbol{\theta})$, which targets variance reduction on the objective within the desired region of improvement. Under a Gaussian process surrogate on $f$, the TVR acquisition can be evaluated in closed form, and reveals an insightful exploration-exploitation-precision trade-off for robust black-box optimization. The TVR can further accommodate a broad class of non-Gaussian distributions on $\mathcal{P}$ via a careful integration of normalizing flows. We demonstrate the improved performance of TVR over the state-of-the-art in a suite of numerical experiments and an application to the robust design of automobile brake discs under operational uncertainty.  ( 3 min )
    Incentivized Learning in Principal-Agent Bandit Games
    arXiv:2403.03811v1 Announce Type: cross Abstract: This work considers a repeated principal-agent bandit game, where the principal can only interact with her environment through the agent. The principal and the agent have misaligned objectives and the choice of action is only left to the agent. However, the principal can influence the agent's decisions by offering incentives which add up to his rewards. The principal aims to iteratively learn an incentive policy to maximize her own total utility. This framework extends usual bandit problems and is motivated by several practical applications, such as healthcare or ecological taxation, where traditionally used mechanism design theories often overlook the learning aspect of the problem. We present nearly optimal (with respect to a horizon $T$) learning algorithms for the principal's regret in both multi-armed and linear contextual settings. Finally, we support our theoretical guarantees through numerical experiments.  ( 2 min )
    Neural Exec: Learning (and Learning from) Execution Triggers for Prompt Injection Attacks
    arXiv:2403.03792v1 Announce Type: cross Abstract: We introduce a new family of prompt injection attacks, termed Neural Exec. Unlike known attacks that rely on handcrafted strings (e.g., "Ignore previous instructions and..."), we show that it is possible to conceptualize the creation of execution triggers as a differentiable search problem and use learning-based methods to autonomously generate them. Our results demonstrate that a motivated adversary can forge triggers that are not only drastically more effective than current handcrafted ones but also exhibit inherent flexibility in shape, properties, and functionality. In this direction, we show that an attacker can design and generate Neural Execs capable of persisting through multi-stage preprocessing pipelines, such as in the case of Retrieval-Augmented Generation (RAG)-based applications. More critically, our findings show that attackers can produce triggers that deviate markedly in form and shape from any known attack, sidestepping existing blacklist-based detection and sanitation approaches.  ( 2 min )
    A machine learning workflow to address credit default prediction
    arXiv:2403.03785v1 Announce Type: cross Abstract: Due to the recent increase in interest in Financial Technology (FinTech), applications like credit default prediction (CDP) are gaining significant industrial and academic attention. In this regard, CDP plays a crucial role in assessing the creditworthiness of individuals and businesses, enabling lenders to make informed decisions regarding loan approvals and risk management. In this paper, we propose a workflow-based approach to improve CDP, which refers to the task of assessing the probability that a borrower will default on his or her credit obligations. The workflow consists of multiple steps, each designed to leverage the strengths of different techniques featured in machine learning pipelines and, thus best solve the CDP task. We employ a comprehensive and systematic approach starting with data preprocessing using Weight of Evidence encoding, a technique that ensures in a single-shot data scaling by removing outliers, handling missing values, and making data uniform for models working with different data types. Next, we train several families of learning models, introducing ensemble techniques to build more robust models and hyperparameter optimization via multi-objective genetic algorithms to consider both predictive accuracy and financial aspects. Our research aims at contributing to the FinTech industry in providing a tool to move toward more accurate and reliable credit risk assessment, benefiting both lenders and borrowers.  ( 2 min )
    Neural Architecture Search using Particle Swarm and Ant Colony Optimization
    arXiv:2403.03781v1 Announce Type: cross Abstract: Neural network models have a number of hyperparameters that must be chosen along with their architecture. This can be a heavy burden on a novice user, choosing which architecture and what values to assign to parameters. In most cases, default hyperparameters and architectures are used. Significant improvements to model accuracy can be achieved through the evaluation of multiple architectures. A process known as Neural Architecture Search (NAS) may be applied to automatically evaluate a large number of such architectures. A system integrating open source tools for Neural Architecture Search (OpenNAS), in the classification of images, has been developed as part of this research. OpenNAS takes any dataset of grayscale, or RBG images, and generates Convolutional Neural Network (CNN) architectures based on a range of metaheuristics using either an AutoKeras, a transfer learning or a Swarm Intelligence (SI) approach. Particle Swarm Optimization (PSO) and Ant Colony Optimization (ACO) are used as the SI algorithms. Furthermore, models developed through such metaheuristics may be combined using stacking ensembles. In the context of this paper, we focus on training and optimizing CNNs using the Swarm Intelligence (SI) components of OpenNAS. Two major types of SI algorithms, namely PSO and ACO, are compared to see which is more effective in generating higher model accuracies. It is shown, with our experimental design, that the PSO algorithm performs better than ACO. The performance improvement of PSO is most notable with a more complex dataset. As a baseline, the performance of fine-tuned pre-trained models is also evaluated.  ( 3 min )
    DeepCRE: Revolutionizing Drug R&D with Cutting-Edge Computational Models
    arXiv:2403.03768v1 Announce Type: cross Abstract: The field of pharmaceutical development and therapeutic application both face substantial challenges. Therapeutic domain calls for more treatment alternatives while numerous promising pre-clinical drugs fail in clinical trails. One of the reasons is the inadequacy of Cross-drug Response Evaluation (CRE) during the late stage of drug development. Although in-silico CRE models offer a solution to this problem, existing methodologies are either limited to early development stages or lack the capacity for a comprehensive CRE analysis. Herein, we introduce a novel computational model named DeepCRE and present the potential of DeepCRE in advancing therapeutic discovery and development. DeepCRE outperforms the existing best models by achieving an average performance improvement of 17.7\% in patient-level CRE, and a 5-fold increase in indication-level CRE. Furthermore, DeepCRE has identified six drug candidates that show significantly greater effectiveness than a comparator set of two approved drug in 5/8 colorectal cancer (CRC) organoids. This highlights DeepCRE's ability to identify a collection of drug candidates with superior therapeutic effects, underscoring its potential to revolutionize the field of therapeutic development.  ( 2 min )
    Joint Sparsity Pattern Learning Based Channel Estimation for Massive MIMO-OTFS Systems
    arXiv:2403.03771v1 Announce Type: cross Abstract: We propose a channel estimation scheme based on joint sparsity pattern learning (JSPL) for massive multi-input multi-output (MIMO) orthogonal time-frequency-space (OTFS) modulation aided systems. By exploiting the potential joint sparsity of the delay-Doppler-angle (DDA) domain channel, the channel estimation problem is transformed into a sparse recovery problem. To solve it, we first apply the spike and slab prior model to iteratively estimate the support set of the channel matrix, and a higher-accuracy parameter update rule relying on the identified support set is introduced into the iteration. Then the specific values of the channel elements corresponding to the support set are estimated by the orthogonal matching pursuit (OMP) method. Both our simulation results and analysis demonstrate that the proposed JSPL channel estimation scheme achieves an improved performance over the representative state-of-the-art baseline schemes, despite its reduced pilot overhead.  ( 2 min )
    Parameterized quantum comb and simpler circuits for reversing unknown qubit-unitary operations
    arXiv:2403.03761v1 Announce Type: cross Abstract: Quantum comb is an essential tool for characterizing complex quantum protocols in quantum information processing. In this work, we introduce PQComb, a framework leveraging parameterized quantum circuits to explore the capabilities of quantum combs for general quantum process transformation tasks and beyond. By optimizing PQComb for time-reversal simulations of unknown unitary evolutions, we develop a simpler protocol for unknown qubit unitary inversion that reduces the ancilla qubit overhead from 6 to 3 compared to the existing method in [Yoshida, Soeda, Murao, PRL 131, 120602, 2023]. This demonstrates the utility of quantum comb structures and showcases PQComb's potential for solving complex quantum tasks. Our results pave the way for broader PQComb applications in quantum computing and quantum information, emphasizing its versatility for tackling diverse problems in quantum machine learning.  ( 2 min )
    Unifying Generation and Compression: Ultra-low bitrate Image Coding Via Multi-stage Transformer
    arXiv:2403.03736v1 Announce Type: cross Abstract: Recent progress in generative compression technology has significantly improved the perceptual quality of compressed data. However, these advancements primarily focus on producing high-frequency details, often overlooking the ability of generative models to capture the prior distribution of image content, thus impeding further bitrate reduction in extreme compression scenarios (<0.05 bpp). Motivated by the capabilities of predictive language models for lossless compression, this paper introduces a novel Unified Image Generation-Compression (UIGC) paradigm, merging the processes of generation and compression. A key feature of the UIGC framework is the adoption of vector-quantized (VQ) image models for tokenization, alongside a multi-stage transformer designed to exploit spatial contextual information for modeling the prior distribution. As such, the dual-purpose framework effectively utilizes the learned prior for entropy estimation and assists in the regeneration of lost tokens. Extensive experiments demonstrate the superiority of the proposed UIGC framework over existing codecs in perceptual quality and human perception, particularly in ultra-low bitrate scenarios (<=0.03 bpp), pioneering a new direction in generative compression.  ( 2 min )
    Learning 3D object-centric representation through prediction
    arXiv:2403.03730v1 Announce Type: cross Abstract: As part of human core knowledge, the representation of objects is the building block of mental representation that supports high-level concepts and symbolic reasoning. While humans develop the ability of perceiving objects situated in 3D environments without supervision, models that learn the same set of abilities with similar constraints faced by human infants are lacking. Towards this end, we developed a novel network architecture that simultaneously learns to 1) segment objects from discrete images, 2) infer their 3D locations, and 3) perceive depth, all while using only information directly available to the brain as training data, namely: sequences of images and self-motion. The core idea is treating objects as latent causes of visual input which the brain uses to make efficient predictions of future scenes. This results in object representations being learned as an essential byproduct of learning to predict.  ( 2 min )
    Model Parallelism on Distributed Infrastructure: A Literature Review from Theory to LLM Case-Studies
    arXiv:2403.03699v1 Announce Type: cross Abstract: Neural networks have become a cornerstone of machine learning. As the trend for these to get more and more complex continues, so does the underlying hardware and software infrastructure for training and deployment. In this survey we answer three research questions: "What types of model parallelism exist?", "What are the challenges of model parallelism?", and "What is a modern use-case of model parallelism?" We answer the first question by looking at how neural networks can be parallelised and expressing these as operator graphs while exploring the available dimensions. The dimensions along which neural networks can be parallelised are intra-operator and inter-operator. We answer the second question by collecting and listing both implementation challenges for the types of parallelism, as well as the problem of optimally partitioning the operator graph. We answer the last question by collecting and listing how parallelism is applied in modern multi-billion parameter transformer networks, to the extend that this is possible with the limited information shared about these networks.  ( 2 min )
    Online model error correction with neural networks: application to the Integrated Forecasting System
    arXiv:2403.03702v1 Announce Type: cross Abstract: In recent years, there has been significant progress in the development of fully data-driven global numerical weather prediction models. These machine learning weather prediction models have their strength, notably accuracy and low computational requirements, but also their weakness: they struggle to represent fundamental dynamical balances, and they are far from being suitable for data assimilation experiments. Hybrid modelling emerges as a promising approach to address these limitations. Hybrid models integrate a physics-based core component with a statistical component, typically a neural network, to enhance prediction capabilities. In this article, we propose to develop a model error correction for the operational Integrated Forecasting System (IFS) of the European Centre for Medium-Range Weather Forecasts using a neural network. The neural network is initially pre-trained offline using a large dataset of operational analyses and analysis increments. Subsequently, the trained network is integrated into the IFS within the Object-Oriented Prediction System (OOPS) so as to be used in data assimilation and forecast experiments. It is then further trained online using a recently developed variant of weak-constraint 4D-Var. The results show that the pre-trained neural network already provides a reliable model error correction, which translates into reduced forecast errors in many conditions and that the online training further improves the accuracy of the hybrid model in many conditions.  ( 2 min )
    Spectral Phase Transition and Optimal PCA in Block-Structured Spiked models
    arXiv:2403.03695v1 Announce Type: cross Abstract: We discuss the inhomogeneous spiked Wigner model, a theoretical framework recently introduced to study structured noise in various learning scenarios, through the prism of random matrix theory, with a specific focus on its spectral properties. Our primary objective is to find an optimal spectral method and to extend the celebrated \cite{BBP} (BBP) phase transition criterion -- well-known in the homogeneous case -- to our inhomogeneous, block-structured, Wigner model. We provide a thorough rigorous analysis of a transformed matrix and show that the transition for the appearance of 1) an outlier outside the bulk of the limiting spectral distribution and 2) a positive overlap between the associated eigenvector and the signal, occurs precisely at the optimal threshold, making the proposed spectral method optimal within the class of iterative methods for the inhomogeneous Wigner problem.  ( 2 min )
    Spectral Algorithms on Manifolds through Diffusion
    arXiv:2403.03669v1 Announce Type: cross Abstract: The existing research on spectral algorithms, applied within a Reproducing Kernel Hilbert Space (RKHS), has primarily focused on general kernel functions, often neglecting the inherent structure of the input feature space. Our paper introduces a new perspective, asserting that input data are situated within a low-dimensional manifold embedded in a higher-dimensional Euclidean space. We study the convergence performance of spectral algorithms in the RKHSs, specifically those generated by the heat kernels, known as diffusion spaces. Incorporating the manifold structure of the input, we employ integral operator techniques to derive tight convergence upper bounds concerning generalized norms, which indicates that the estimators converge to the target function in strong sense, entailing the simultaneous convergence of the function itself and its derivatives. These bounds offer two significant advantages: firstly, they are exclusively contingent on the intrinsic dimension of the input manifolds, thereby providing a more focused analysis. Secondly, they enable the efficient derivation of convergence rates for derivatives of any k-th order, all of which can be accomplished within the ambit of the same spectral algorithms. Furthermore, we establish minimax lower bounds to demonstrate the asymptotic optimality of these conclusions in specific contexts. Our study confirms that the spectral algorithms are practically significant in the broader context of high-dimensional approximation.  ( 2 min )
    Generative Active Learning with Variational Autoencoder for Radiology Data Generation in Veterinary Medicine
    arXiv:2403.03642v1 Announce Type: cross Abstract: Recently, with increasing interest in pet healthcare, the demand for computer-aided diagnosis (CAD) systems in veterinary medicine has increased. The development of veterinary CAD has stagnated due to a lack of sufficient radiology data. To overcome the challenge, we propose a generative active learning framework based on a variational autoencoder. This approach aims to alleviate the scarcity of reliable data for CAD systems in veterinary medicine. This study utilizes datasets comprising cardiomegaly radiograph data. After removing annotations and standardizing images, we employed a framework for data augmentation, which consists of a data generation phase and a query phase for filtering the generated data. The experimental results revealed that as the data generated through this framework was added to the training data of the generative model, the frechet inception distance consistently decreased from 84.14 to 50.75 on the radiograph. Subsequently, when the generated data were incorporated into the training of the classification model, the false positive of the confusion matrix also improved from 0.16 to 0.66 on the radiograph. The proposed framework has the potential to address the challenges of data scarcity in medical CAD, contributing to its advancement.  ( 2 min )
    GSNeRF: Generalizable Semantic Neural Radiance Fields with Enhanced 3D Scene Understanding
    arXiv:2403.03608v1 Announce Type: cross Abstract: Utilizing multi-view inputs to synthesize novel-view images, Neural Radiance Fields (NeRF) have emerged as a popular research topic in 3D vision. In this work, we introduce a Generalizable Semantic Neural Radiance Field (GSNeRF), which uniquely takes image semantics into the synthesis process so that both novel view images and the associated semantic maps can be produced for unseen scenes. Our GSNeRF is composed of two stages: Semantic Geo-Reasoning and Depth-Guided Visual rendering. The former is able to observe multi-view image inputs to extract semantic and geometry features from a scene. Guided by the resulting image geometry information, the latter performs both image and semantic rendering with improved performances. Our experiments not only confirm that GSNeRF performs favorably against prior works on both novel-view image and semantic segmentation synthesis but the effectiveness of our sampling strategy for visual rendering is further verified.  ( 2 min )
    SheetAgent: A Generalist Agent for Spreadsheet Reasoning and Manipulation via Large Language Models
    arXiv:2403.03636v1 Announce Type: cross Abstract: Spreadsheet manipulation is widely existing in most daily works and significantly improves working efficiency. Large language model (LLM) has been recently attempted for automatic spreadsheet manipulation but has not yet been investigated in complicated and realistic tasks where reasoning challenges exist (e.g., long horizon manipulation with multi-step reasoning and ambiguous requirements). To bridge the gap with the real-world requirements, we introduce $\textbf{SheetRM}$, a benchmark featuring long-horizon and multi-category tasks with reasoning-dependent manipulation caused by real-life challenges. To mitigate the above challenges, we further propose $\textbf{SheetAgent}$, a novel autonomous agent that utilizes the power of LLMs. SheetAgent consists of three collaborative modules: $\textit{Planner}$, $\textit{Informer}$, and $\textit{Retriever}$, achieving both advanced reasoning and accurate manipulation over spreadsheets without human interaction through iterative task reasoning and reflection. Extensive experiments demonstrate that SheetAgent delivers 20-30% pass rate improvements on multiple benchmarks over baselines, achieving enhanced precision in spreadsheet manipulation and demonstrating superior table reasoning abilities. More details and visualizations are available at https://sheetagent.github.io.  ( 2 min )
    DeepEclipse: How to Break White-Box DNN-Watermarking Schemes
    arXiv:2403.03590v1 Announce Type: cross Abstract: Deep Learning (DL) models have become crucial in digital transformation, thus raising concerns about their intellectual property rights. Different watermarking techniques have been developed to protect Deep Neural Networks (DNNs) from IP infringement, creating a competitive field for DNN watermarking and removal methods. The predominant watermarking schemes use white-box techniques, which involve modifying weights by adding a unique signature to specific DNN layers. On the other hand, existing attacks on white-box watermarking usually require knowledge of the specific deployed watermarking scheme or access to the underlying data for further training and fine-tuning. We propose DeepEclipse, a novel and unified framework designed to remove white-box watermarks. We present obfuscation techniques that significantly differ from the existing white-box watermarking removal schemes. DeepEclipse can evade watermark detection without prior knowledge of the underlying watermarking scheme, additional data, or training and fine-tuning. Our evaluation reveals that DeepEclipse excels in breaking multiple white-box watermarking schemes, reducing watermark detection to random guessing while maintaining a similar model accuracy as the original one. Our framework showcases a promising solution to address the ongoing DNN watermark protection and removal challenges.  ( 2 min )
    Enhancing Price Prediction in Cryptocurrency Using Transformer Neural Network and Technical Indicators
    arXiv:2403.03606v1 Announce Type: cross Abstract: This study presents an innovative approach for predicting cryptocurrency time series, specifically focusing on Bitcoin, Ethereum, and Litecoin. The methodology integrates the use of technical indicators, a Performer neural network, and BiLSTM (Bidirectional Long Short-Term Memory) to capture temporal dynamics and extract significant features from raw cryptocurrency data. The application of technical indicators, such facilitates the extraction of intricate patterns, momentum, volatility, and trends. The Performer neural network, employing Fast Attention Via positive Orthogonal Random features (FAVOR+), has demonstrated superior computational efficiency and scalability compared to the traditional Multi-head attention mechanism in Transformer models. Additionally, the integration of BiLSTM in the feedforward network enhances the model's capacity to capture temporal dynamics in the data, processing it in both forward and backward directions. This is particularly advantageous for time series data where past and future data points can influence the current state. The proposed method has been applied to the hourly and daily timeframes of the major cryptocurrencies and its performance has been benchmarked against other methods documented in the literature. The results underscore the potential of the proposed method to outperform existing models, marking a significant progression in the field of cryptocurrency price prediction.  ( 2 min )
    Active Adaptive Experimental Design for Treatment Effect Estimation with Covariate Choices
    arXiv:2403.03589v1 Announce Type: cross Abstract: This study designs an adaptive experiment for efficiently estimating average treatment effect (ATEs). We consider an adaptive experiment where an experimenter sequentially samples an experimental unit from a covariate density decided by the experimenter and assigns a treatment. After assigning a treatment, the experimenter observes the corresponding outcome immediately. At the end of the experiment, the experimenter estimates an ATE using gathered samples. The objective of the experimenter is to estimate the ATE with a smaller asymptotic variance. Existing studies have designed experiments that adaptively optimize the propensity score (treatment-assignment probability). As a generalization of such an approach, we propose a framework under which an experimenter optimizes the covariate density, as well as the propensity score, and find that optimizing both covariate density and propensity score reduces the asymptotic variance more than optimizing only the propensity score. Based on this idea, in each round of our experiment, the experimenter optimizes the covariate density and propensity score based on past observations. To design an adaptive experiment, we first derive the efficient covariate density and propensity score that minimizes the semiparametric efficiency bound, a lower bound for the asymptotic variance given a fixed covariate density and a fixed propensity score. Next, we design an adaptive experiment using the efficient covariate density and propensity score sequentially estimated during the experiment. Lastly, we propose an ATE estimator whose asymptotic variance aligns with the minimized semiparametric efficiency bound.  ( 3 min )
    Enhancing ASD detection accuracy: a combined approach of machine learning and deep learning models with natural language processing
    arXiv:2403.03581v1 Announce Type: cross Abstract: Purpose: Our study explored the use of artificial intelligence (AI) to diagnose autism spectrum disorder (ASD). It focused on machine learning (ML) and deep learning (DL) to detect ASD from text inputs on social media, addressing challenges in traditional ASD diagnosis. Methods: We used natural language processing (NLP), ML, and DL models (including decision trees, XGB, KNN, RNN, LSTM, Bi-LSTM, BERT, and BERTweet) to analyze 404,627 tweets, classifying them based on ASD or non-ASD authors. A subset of 90,000 tweets was used for model training and testing. Results: Our AI models showed high accuracy, with an 88% success rate in identifying texts from individuals with ASD. Conclusion: The study demonstrates AI's potential in improving ASD diagnosis, especially in children, highlighting the importance of early detection.  ( 2 min )
    Low-Dose CT Image Reconstruction by Fine-Tuning a UNet Pretrained for Gaussian Denoising for the Downstream Task of Image Enhancement
    arXiv:2403.03551v1 Announce Type: cross Abstract: Computed Tomography (CT) is a widely used medical imaging modality, and as it is based on ionizing radiation, it is desirable to minimize the radiation dose. However, a reduced radiation dose comes with reduced image quality, and reconstruction from low-dose CT (LDCT) data is still a challenging task which is subject to research. According to the LoDoPaB-CT benchmark, a benchmark for LDCT reconstruction, many state-of-the-art methods use pipelines involving UNet-type architectures. Specifically the top ranking method, ItNet, employs a three-stage process involving filtered backprojection (FBP), a UNet trained on CT data, and an iterative refinement step. In this paper, we propose a less complex two-stage method. The first stage also employs FBP, while the novelty lies in the training strategy for the second stage, characterized as the CT image enhancement stage. The crucial point of our approach is that the neural network is pretrained on a distinctly different pretraining task with non-CT data, namely Gaussian noise removal on a variety of natural grayscale images (photographs). We then fine-tune this network for the downstream task of CT image enhancement using pairs of LDCT images and corresponding normal-dose CT images (NDCT). Despite being notably simpler than the state-of-the-art, as the pretraining did not depend on domain-specific CT data and no further iterative refinement step was necessary, the proposed two-stage method achieves competitive results. The proposed method achieves a shared top ranking in the LoDoPaB-CT challenge and a first position with respect to the SSIM metric.  ( 3 min )
    Population-aware Online Mirror Descent for Mean-Field Games by Deep Reinforcement Learning
    arXiv:2403.03552v1 Announce Type: cross Abstract: Mean Field Games (MFGs) have the ability to handle large-scale multi-agent systems, but learning Nash equilibria in MFGs remains a challenging task. In this paper, we propose a deep reinforcement learning (DRL) algorithm that achieves population-dependent Nash equilibrium without the need for averaging or sampling from history, inspired by Munchausen RL and Online Mirror Descent. Through the design of an additional inner-loop replay buffer, the agents can effectively learn to achieve Nash equilibrium from any distribution, mitigating catastrophic forgetting. The resulting policy can be applied to various initial distributions. Numerical experiments on four canonical examples demonstrate our algorithm has better convergence properties than SOTA algorithms, in particular a DRL version of Fictitious Play for population-dependent policies.  ( 2 min )
    Gadolinium dose reduction for brain MRI using conditional deep learning
    arXiv:2403.03539v1 Announce Type: cross Abstract: Recently, deep learning (DL)-based methods have been proposed for the computational reduction of gadolinium-based contrast agents (GBCAs) to mitigate adverse side effects while preserving diagnostic value. Currently, the two main challenges for these approaches are the accurate prediction of contrast enhancement and the synthesis of realistic images. In this work, we address both challenges by utilizing the contrast signal encoded in the subtraction images of pre-contrast and post-contrast image pairs. To avoid the synthesis of any noise or artifacts and solely focus on contrast signal extraction and enhancement from low-dose subtraction images, we train our DL model using noise-free standard-dose subtraction images as targets. As a result, our model predicts the contrast enhancement signal only; thereby enabling synthesization of images beyond the standard dose. Furthermore, we adapt the embedding idea of recent diffusion-based models to condition our model on physical parameters affecting the contrast enhancement behavior. We demonstrate the effectiveness of our approach on synthetic and real datasets using various scanners, field strengths, and contrast agents.  ( 2 min )
    Diffusion-based Generative Prior for Low-Complexity MIMO Channel Estimation
    arXiv:2403.03545v1 Announce Type: cross Abstract: This work proposes a novel channel estimator based on diffusion models (DMs), one of the currently top-rated generative models. Contrary to related works utilizing generative priors, a lightweight convolutional neural network (CNN) with positional embedding of the signal-to-noise ratio (SNR) information is designed by learning the channel distribution in the sparse angular domain. Combined with an estimation strategy that avoids stochastic resampling and truncates reverse diffusion steps that account for lower SNR than the given pilot observation, the resulting DM estimator has both low complexity and memory overhead. Numerical results exhibit better performance than state-of-the-art channel estimators utilizing generative priors.  ( 2 min )
    BAIT: Benchmarking (Embedding) Architectures for Interactive Theorem-Proving
    arXiv:2403.03401v1 Announce Type: cross Abstract: Artificial Intelligence for Theorem Proving has given rise to a plethora of benchmarks and methodologies, particularly in Interactive Theorem Proving (ITP). Research in the area is fragmented, with a diverse set of approaches being spread across several ITP systems. This presents a significant challenge to the comparison of methods, which are often complex and difficult to replicate. Addressing this, we present BAIT, a framework for fair and streamlined comparison of learning approaches in ITP. We demonstrate BAIT's capabilities with an in-depth comparison, across several ITP benchmarks, of state-of-the-art architectures applied to the problem of formula embedding. We find that Structure Aware Transformers perform particularly well, improving on techniques associated with the original problem sets. BAIT also allows us to assess the end-to-end proving performance of systems built on interactive environments. This unified perspective reveals a novel end-to-end system that improves on prior work. We also provide a qualitative analysis, illustrating that improved performance is associated with more semantically-aware embeddings. By streamlining the implementation and comparison of Machine Learning algorithms in the ITP context, we anticipate BAIT will be a springboard for future research.  ( 2 min )
    Performance Evaluation of Semi-supervised Learning Frameworks for Multi-Class Weed Detection
    arXiv:2403.03390v1 Announce Type: cross Abstract: Effective weed control plays a crucial role in optimizing crop yield and enhancing agricultural product quality. However, the reliance on herbicide application not only poses a critical threat to the environment but also promotes the emergence of resistant weeds. Fortunately, recent advances in precision weed management enabled by ML and DL provide a sustainable alternative. Despite great progress, existing algorithms are mainly developed based on supervised learning approaches, which typically demand large-scale datasets with manual-labeled annotations, which is time-consuming and labor-intensive. As such, label-efficient learning methods, especially semi-supervised learning, have gained increased attention in the broader domain of computer vision and have demonstrated promising performance. These methods aim to utilize a small number of labeled data samples along with a great number of unlabeled samples to develop high-performing models comparable to the supervised learning counterpart trained on a large amount of labeled data samples. In this study, we assess the effectiveness of a semi-supervised learning framework for multi-class weed detection, employing two well-known object detection frameworks, namely FCOS and Faster-RCNN. Specifically, we evaluate a generalized student-teacher framework with an improved pseudo-label generation module to produce reliable pseudo-labels for the unlabeled data. To enhance generalization, an ensemble student network is employed to facilitate the training process. Experimental results show that the proposed approach is able to achieve approximately 76\% and 96\% detection accuracy as the supervised methods with only 10\% of labeled data in CottenWeedDet3 and CottonWeedDet12, respectively. We offer access to the source code, contributing a valuable resource for ongoing semi-supervised learning research in weed detection and beyond.  ( 3 min )
    Chained Information-Theoretic bounds and Tight Regret Rate for Linear Bandit Problems
    arXiv:2403.03361v1 Announce Type: cross Abstract: This paper studies the Bayesian regret of a variant of the Thompson-Sampling algorithm for bandit problems. It builds upon the information-theoretic framework of [Russo and Van Roy, 2015] and, more specifically, on the rate-distortion analysis from [Dong and Van Roy, 2020], where they proved a bound with regret rate of $O(d\sqrt{T \log(T)})$ for the $d$-dimensional linear bandit setting. We focus on bandit problems with a metric action space and, using a chaining argument, we establish new bounds that depend on the metric entropy of the action space for a variant of Thompson-Sampling. Under suitable continuity assumption of the rewards, our bound offers a tight rate of $O(d\sqrt{T})$ for $d$-dimensional linear bandit problems.  ( 2 min )
    Deep Configuration Performance Learning: A Systematic Survey and Taxonomy
    arXiv:2403.03322v1 Announce Type: cross Abstract: Performance is arguably the most crucial attribute that reflects the behavior of a configurable software system. However, given the increasing scale and complexity of modern software, modeling and predicting how various configurations can impact performance becomes one of the major challenges in software maintenance. As such, performance is often modeled without having a thorough knowledge of the software system, but relying mainly on data, which fits precisely with the purpose of deep learning. In this paper, we conduct a comprehensive review exclusively on the topic of deep learning for performance learning of configurable software, covering 948 searched papers spanning six indexing services, based on which 85 primary papers were extracted and analyzed. Our results summarize the key topics and statistics on how the configuration data is prepared; how the deep configuration performance learning model is built; how the model is evaluated and how they are exploited in different tasks related to software configuration. We also identify the good practice and the potentially problematic phenomena from the studies surveyed, together with insights on future opportunities for the field. To promote open science, all the raw results of this survey can be accessed at our repository: https://github.com/ideas-labo/DCPL-SLR.  ( 2 min )
    Proper vs Improper Quantum PAC learning
    arXiv:2403.03295v1 Announce Type: cross Abstract: A basic question in the PAC model of learning is whether proper learning is harder than improper learning. In the classical case, there are examples of concept classes with VC dimension $d$ that have sample complexity $\Omega\left(\frac d\epsilon\log\frac1\epsilon\right)$ for proper learning with error $\epsilon$, while the complexity for improper learning is O$\!\left(\frac d\epsilon\right)$. One such example arises from the Coupon Collector problem. Motivated by the efficiency of proper versus improper learning with quantum samples, Arunachalam, Belovs, Childs, Kothari, Rosmanis, and de Wolf (TQC 2020) studied an analogue, the Quantum Coupon Collector problem. Curiously, they discovered that for learning size $k$ subsets of $[n]$ the problem has sample complexity $\Theta(k\log\min\{k,n-k+1\})$, in contrast with the complexity of $\Theta(k\log k)$ for Coupon Collector. This effectively negates the possibility of a separation between the two modes of learning via the quantum problem, and Arunachalam et al.\ posed the possibility of such a separation as an open question. In this work, we first present an algorithm for the Quantum Coupon Collector problem with sample complexity that matches the sharper lower bound of $(1-o_k(1))k\ln\min\{k,n-k+1\}$ shown recently by Bab Hadiashar, Nayak, and Sinha (IEEE TIT 2024), for the entire range of the parameter $k$. Next, we devise a variant of the problem, the Quantum Padded Coupon Collector. We prove that its sample complexity matches that of the classical Coupon Collector problem for both modes of learning, thereby exhibiting the same asymptotic separation between proper and improper quantum learning as mentioned above. The techniques we develop in the process can be directly applied to any form of padded quantum data. We hope that padding can more generally lift other forms of classical learning behaviour to the quantum setting.  ( 3 min )
    Mad Libs Are All You Need: Augmenting Cross-Domain Document-Level Event Argument Data
    arXiv:2403.03304v1 Announce Type: cross Abstract: Document-Level Event Argument Extraction (DocEAE) is an extremely difficult information extraction problem -- with significant limitations in low-resource cross-domain settings. To address this problem, we introduce Mad Lib Aug (MLA), a novel generative DocEAE data augmentation framework. Our approach leverages the intuition that Mad Libs, which are categorically masked documents used as a part of a popular game, can be generated and solved by LLMs to produce data for DocEAE. Using MLA, we achieve a 2.6-point average improvement in overall F1 score. Moreover, this approach achieves a 3.9 and 5.2 point average increase in zero and few-shot event roles compared to augmentation-free baselines across all experiments. To better facilitate analysis of cross-domain DocEAE, we additionally introduce a new metric, Role-Depth F1 (RDF1), which uses statistical depth to identify roles in the target domain which are semantic outliers with respect to roles observed in the source domain. Our experiments show that MLA augmentation can boost RDF1 performance by an average of 5.85 points compared to non-augmented datasets.  ( 2 min )
    ARNN: Attentive Recurrent Neural Network for Multi-channel EEG Signals to Identify Epileptic Seizures
    arXiv:2403.03276v1 Announce Type: cross Abstract: We proposed an Attentive Recurrent Neural Network (ARNN), which recurrently applies attention layers along a sequence and has linear complexity with respect to the sequence length. The proposed model operates on multi-channel EEG signals rather than single channel signals and leverages parallel computation. In this cell, the attention layer is a computational unit that efficiently applies self-attention and cross-attention mechanisms to compute a recurrent function over a wide number of state vectors and input signals. Our architecture is inspired in part by the attention layer and long short-term memory (LSTM) cells, and it uses long-short style gates, but it scales this typical cell up by several orders to parallelize for multi-channel EEG signals. It inherits the advantages of attention layers and LSTM gate while avoiding their respective drawbacks. We evaluated the model effectiveness through extensive experiments with heterogeneous datasets, including the CHB-MIT and UPenn and Mayos Clinic, CHB-MIT datasets. The empirical findings suggest that the ARNN model outperforms baseline methods such as LSTM, Vision Transformer (ViT), Compact Convolution Transformer (CCT), and R-Transformer (RT), showcasing superior performance and faster processing capabilities across a wide range of tasks. The code has been made publicly accessible at \url{https://github.com/Salim-Lysiun/ARNN}.  ( 2 min )
    DINOv2 based Self Supervised Learning For Few Shot Medical Image Segmentation
    arXiv:2403.03273v1 Announce Type: cross Abstract: Deep learning models have emerged as the cornerstone of medical image segmentation, but their efficacy hinges on the availability of extensive manually labeled datasets and their adaptability to unforeseen categories remains a challenge. Few-shot segmentation (FSS) offers a promising solution by endowing models with the capacity to learn novel classes from limited labeled examples. A leading method for FSS is ALPNet, which compares features between the query image and the few available support segmented images. A key question about using ALPNet is how to design its features. In this work, we delve into the potential of using features from DINOv2, which is a foundational self-supervised learning model in computer vision. Leveraging the strengths of ALPNet and harnessing the feature extraction capabilities of DINOv2, we present a novel approach to few-shot segmentation that not only enhances performance but also paves the way for more robust and adaptable medical image analysis.  ( 2 min )
    Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling
    arXiv:2403.03234v1 Announce Type: cross Abstract: Large-scale sequence modeling has sparked rapid advances that now extend into biology and genomics. However, modeling genomic sequences introduces challenges such as the need to model long-range token interactions, the effects of upstream and downstream regions of the genome, and the reverse complementarity (RC) of DNA. Here, we propose an architecture motivated by these challenges that builds off the long-range Mamba block, and extends it to a BiMamba component that supports bi-directionality, and to a MambaDNA block that additionally supports RC equivariance. We use MambaDNA as the basis of Caduceus, the first family of RC equivariant bi-directional long-range DNA language models, and we introduce pre-training and fine-tuning strategies that yield Caduceus DNA foundation models. Caduceus outperforms previous long-range models on downstream benchmarks; on a challenging long-range variant effect prediction task, Caduceus exceeds the performance of 10x larger models that do not leverage bi-directionality or equivariance.  ( 2 min )
    From Displacements to Distributions: A Machine-Learning Enabled Framework for Quantifying Uncertainties in Parameters of Computational Models
    arXiv:2403.03233v1 Announce Type: cross Abstract: This work presents novel extensions for combining two frameworks for quantifying both aleatoric (i.e., irreducible) and epistemic (i.e., reducible) sources of uncertainties in the modeling of engineered systems. The data-consistent (DC) framework poses an inverse problem and solution for quantifying aleatoric uncertainties in terms of pullback and push-forward measures for a given Quantity of Interest (QoI) map. Unfortunately, a pre-specified QoI map is not always available a priori to the collection of data associated with system outputs. The data themselves are often polluted with measurement errors (i.e., epistemic uncertainties), which complicates the process of specifying a useful QoI. The Learning Uncertain Quantities (LUQ) framework defines a formal three-step machine-learning enabled process for transforming noisy datasets into samples of a learned QoI map to enable DC-based inversion. We develop a robust filtering step in LUQ that can learn the most useful quantitative information present in spatio-temporal datasets. The learned QoI map transforms simulated and observed datasets into distributions to perform DC-based inversion. We also develop a DC-based inversion scheme that iterates over time as new spatial datasets are obtained and utilizes quantitative diagnostics to identify both the quality and impact of inversion at each iteration. Reproducing Kernel Hilbert Space theory is leveraged to mathematically analyze the learned QoI map and develop a quantitative sufficiency test for evaluating the filtered data. An illustrative example is utilized throughout while the final two examples involve the manufacturing of shells of revolution to demonstrate various aspects of the presented frameworks.  ( 3 min )
    Machine and deep learning methods for predicting 3D genome organization
    arXiv:2403.03231v1 Announce Type: cross Abstract: Three-Dimensional (3D) chromatin interactions, such as enhancer-promoter interactions (EPIs), loops, Topologically Associating Domains (TADs), and A/B compartments play critical roles in a wide range of cellular processes by regulating gene expression. Recent development of chromatin conformation capture technologies has enabled genome-wide profiling of various 3D structures, even with single cells. However, current catalogs of 3D structures remain incomplete and unreliable due to differences in technology, tools, and low data resolution. Machine learning methods have emerged as an alternative to obtain missing 3D interactions and/or improve resolution. Such methods frequently use genome annotation data (ChIP-seq, DNAse-seq, etc.), DNA sequencing information (k-mers, Transcription Factor Binding Site (TFBS) motifs), and other genomic properties to learn the associations between genomic features and chromatin interactions. In this review, we discuss computational tools for predicting three types of 3D interactions (EPIs, chromatin interactions, TAD boundaries) and analyze their pros and cons. We also point out obstacles of computational prediction of 3D interactions and suggest future research directions.  ( 2 min )
    Embracing Uncertainty Flexibility: Harnessing a Supervised Tree Kernel to Empower Ensemble Modelling for 2D Echocardiography-Based Prediction of Right Ventricular Volume
    arXiv:2403.03229v1 Announce Type: cross Abstract: The right ventricular (RV) function deterioration strongly predicts clinical outcomes in numerous circumstances. To boost the clinical deployment of ensemble regression methods that quantify RV volumes using tabular data from the widely available two-dimensional echocardiography (2DE), we propose to complement the volume predictions with uncertainty scores. To this end, we employ an instance-based method which uses the learned tree structure to identify the nearest training samples to a target instance and then uses a number of distribution types to more flexibly model the output. The probabilistic and point-prediction performances of the proposed framework are evaluated on a relatively small-scale dataset, comprising 100 end-diastolic and end-systolic RV volumes. The reference values for point performance were obtained from MRI. The results demonstrate that our flexible approach yields improved probabilistic and point performances over other state-of-the-art methods. The appropriateness of the proposed framework is showcased by providing exemplar cases. The estimated uncertainty embodies both aleatoric and epistemic types. This work aligns with trustworthy artificial intelligence since it can be used to enhance the decision-making process and reduce risks. The feature importance scores of our framework can be exploited to reduce the number of required 2DE views which could enhance the proposed pipeline's clinical application.  ( 3 min )
    Cross-Modal Contextualized Diffusion Models for Text-Guided Visual Generation and Editing
    arXiv:2402.16627v2 Announce Type: cross Abstract: Conditional diffusion models have exhibited superior performance in high-fidelity text-guided visual generation and editing. Nevertheless, prevailing text-guided visual diffusion models primarily focus on incorporating text-visual relationships exclusively into the reverse process, often disregarding their relevance in the forward process. This inconsistency between forward and reverse processes may limit the precise conveyance of textual semantics in visual synthesis results. To address this issue, we propose a novel and general contextualized diffusion model (ContextDiff) by incorporating the cross-modal context encompassing interactions and alignments between text condition and visual sample into forward and reverse processes. We propagate this context to all timesteps in the two processes to adapt their trajectories, thereby facilitating cross-modal conditional modeling. We generalize our contextualized diffusion to both DDPMs and DDIMs with theoretical derivations, and demonstrate the effectiveness of our model in evaluations with two challenging tasks: text-to-image generation, and text-to-video editing. In each task, our ContextDiff achieves new state-of-the-art performance, significantly enhancing the semantic alignment between text condition and generated samples, as evidenced by quantitative and qualitative evaluations. Our code is available at https://github.com/YangLing0818/ContextDiff  ( 2 min )
    Stop Regressing: Training Value Functions via Classification for Scalable Deep RL
    arXiv:2403.03950v1 Announce Type: new Abstract: Value functions are a central component of deep reinforcement learning (RL). These functions, parameterized by neural networks, are trained using a mean squared error regression objective to match bootstrapped target values. However, scaling value-based RL methods that use regression to large networks, such as high-capacity Transformers, has proven challenging. This difficulty is in stark contrast to supervised learning: by leveraging a cross-entropy classification loss, supervised methods have scaled reliably to massive networks. Observing this discrepancy, in this paper, we investigate whether the scalability of deep RL can also be improved simply by using classification in place of regression for training value functions. We demonstrate that value functions trained with categorical cross-entropy significantly improves performance and scalability in a variety of domains. These include: single-task RL on Atari 2600 games with SoftMoEs, multi-task RL on Atari with large-scale ResNets, robotic manipulation with Q-transformers, playing Chess without search, and a language-agent Wordle task with high-capacity Transformers, achieving state-of-the-art results on these domains. Through careful analysis, we show that the benefits of categorical cross-entropy primarily stem from its ability to mitigate issues inherent to value-based RL, such as noisy targets and non-stationarity. Overall, we argue that a simple shift to training value functions with categorical cross-entropy can yield substantial improvements in the scalability of deep RL at little-to-no cost.  ( 3 min )
    Graph neural network outputs are almost surely asymptotically constant
    arXiv:2403.03880v1 Announce Type: new Abstract: Graph neural networks (GNNs) are the predominant architectures for a variety of learning tasks on graphs. We present a new angle on the expressive power of GNNs by studying how the predictions of a GNN probabilistic classifier evolve as we apply it on larger graphs drawn from some random graph model. We show that the output converges to a constant function, which upper-bounds what these classifiers can express uniformly. This convergence phenomenon applies to a very wide class of GNNs, including state of the art models, with aggregates including mean and the attention-based mechanism of graph transformers. Our results apply to a broad class of random graph models, including the (sparse) Erd\H{o}s-R\'enyi model and the stochastic block model. We empirically validate these findings, observing that the convergence phenomenon already manifests itself on graphs of relatively modest size.  ( 2 min )
    Decoupled Vertical Federated Learning for Practical Training on Vertically Partitioned Data
    arXiv:2403.03871v1 Announce Type: new Abstract: Vertical Federated Learning (VFL) is an emergent distributed machine learning paradigm wherein owners of disjoint features of a common set of entities collaborate to learn a global model without sharing data. In VFL, a host client owns data labels for each entity and learns a final representation based on intermediate local representations from all guest clients. Therefore, the host is a single point of failure and label feedback can be used by malicious guest clients to infer private features. Requiring all participants to remain active and trustworthy throughout the entire training process is generally impractical and altogether infeasible outside of controlled environments. We propose Decoupled VFL (DVFL), a blockwise learning approach to VFL. By training each model on its own objective, DVFL allows for decentralized aggregation and isolation between feature learning and label supervision. With these properties, DVFL is fault tolerant and secure. We implement DVFL to train split neural networks and show that model performance is comparable to VFL on a variety of classification datasets.  ( 2 min )
    Public-data Assisted Private Stochastic Optimization: Power and Limitations
    arXiv:2403.03856v1 Announce Type: new Abstract: We study the limits and capability of public-data assisted differentially private (PA-DP) algorithms. Specifically, we focus on the problem of stochastic convex optimization (SCO) with either labeled or unlabeled public data. For complete/labeled public data, we show that any $(\epsilon,\delta)$-PA-DP has excess risk $\tilde{\Omega}\big(\min\big\{\frac{1}{\sqrt{n_{\text{pub}}}},\frac{1}{\sqrt{n}}+\frac{\sqrt{d}}{n\epsilon} \big\} \big)$, where $d$ is the dimension, ${n_{\text{pub}}}$ is the number of public samples, ${n_{\text{priv}}}$ is the number of private samples, and $n={n_{\text{pub}}}+{n_{\text{priv}}}$. These lower bounds are established via our new lower bounds for PA-DP mean estimation, which are of a similar form. Up to constant factors, these lower bounds show that the simple strategy of either treating all data as private or discarding the private data, is optimal. We also study PA-DP supervised learning with \textit{unlabeled} public samples. In contrast to our previous result, we here show novel methods for leveraging public data in private supervised learning. For generalized linear models (GLM) with unlabeled public data, we show an efficient algorithm which, given $\tilde{O}({n_{\text{priv}}}\epsilon)$ unlabeled public samples, achieves the dimension independent rate $\tilde{O}\big(\frac{1}{\sqrt{{n_{\text{priv}}}}} + \frac{1}{\sqrt{{n_{\text{priv}}}\epsilon}}\big)$. We develop new lower bounds for this setting which shows that this rate cannot be improved with more public samples, and any fewer public samples leads to a worse rate. Finally, we provide extensions of this result to general hypothesis classes with finite fat-shattering dimension with applications to neural networks and non-Euclidean geometries.  ( 3 min )
    Accelerating Convergence of Score-Based Diffusion Models, Provably
    arXiv:2403.03852v1 Announce Type: new Abstract: Score-based diffusion models, while achieving remarkable empirical performance, often suffer from low sampling speed, due to extensive function evaluations needed during the sampling phase. Despite a flurry of recent activities towards speeding up diffusion generative modeling in practice, theoretical underpinnings for acceleration techniques remain severely limited. In this paper, we design novel training-free algorithms to accelerate popular deterministic (i.e., DDIM) and stochastic (i.e., DDPM) samplers. Our accelerated deterministic sampler converges at a rate $O(1/{T}^2)$ with $T$ the number of steps, improving upon the $O(1/T)$ rate for the DDIM sampler; and our accelerated stochastic sampler converges at a rate $O(1/T)$, outperforming the rate $O(1/\sqrt{T})$ for the DDPM sampler. The design of our algorithms leverages insights from higher-order approximation, and shares similar intuitions as popular high-order ODE solvers like the DPM-Solver-2. Our theory accommodates $\ell_2$-accurate score estimates, and does not require log-concavity or smoothness on the target distribution.  ( 2 min )
    On the Effectiveness of Distillation in Mitigating Backdoors in Pre-trained Encoder
    arXiv:2403.03846v1 Announce Type: new Abstract: In this paper, we study a defense against poisoned encoders in SSL called distillation, which is a defense used in supervised learning originally. Distillation aims to distill knowledge from a given model (a.k.a the teacher net) and transfer it to another (a.k.a the student net). Now, we use it to distill benign knowledge from poisoned pre-trained encoders and transfer it to a new encoder, resulting in a clean pre-trained encoder. In particular, we conduct an empirical study on the effectiveness and performance of distillation against poisoned encoders. Using two state-of-the-art backdoor attacks against pre-trained image encoders and four commonly used image classification datasets, our experimental results show that distillation can reduce attack success rate from 80.87% to 27.51% while suffering a 6.35% loss in accuracy. Moreover, we investigate the impact of three core components of distillation on performance: teacher net, student net, and distillation loss. By comparing 4 different teacher nets, 3 student nets, and 6 distillation losses, we find that fine-tuned teacher nets, warm-up-training-based student nets, and attention-based distillation loss perform best, respectively.  ( 2 min )
    Feature Selection as Deep Sequential Generative Learning
    arXiv:2403.03838v1 Announce Type: new Abstract: Feature selection aims to identify the most pattern-discriminative feature subset. In prior literature, filter (e.g., backward elimination) and embedded (e.g., Lasso) methods have hyperparameters (e.g., top-K, score thresholding) and tie to specific models, thus, hard to generalize; wrapper methods search a feature subset in a huge discrete space and is computationally costly. To transform the way of feature selection, we regard a selected feature subset as a selection decision token sequence and reformulate feature selection as a deep sequential generative learning task that distills feature knowledge and generates decision sequences. Our method includes three steps: (1) We develop a deep variational transformer model over a joint of sequential reconstruction, variational, and performance evaluator losses. Our model can distill feature selection knowledge and learn a continuous embedding space to map feature selection decision sequences into embedding vectors associated with utility scores. (2) We leverage the trained feature subset utility evaluator as a gradient provider to guide the identification of the optimal feature subset embedding;(3) We decode the optimal feature subset embedding to autoregressively generate the best feature selection decision sequence with autostop. Extensive experimental results show this generative perspective is effective and generic, without large discrete search space and expert-specific hyperparameters.  ( 2 min )
    ProbSAINT: Probabilistic Tabular Regression for Used Car Pricing
    arXiv:2403.03812v1 Announce Type: new Abstract: Used car pricing is a critical aspect of the automotive industry, influenced by many economic factors and market dynamics. With the recent surge in online marketplaces and increased demand for used cars, accurate pricing would benefit both buyers and sellers by ensuring fair transactions. However, the transition towards automated pricing algorithms using machine learning necessitates the comprehension of model uncertainties, specifically the ability to flag predictions that the model is unsure about. Although recent literature proposes the use of boosting algorithms or nearest neighbor-based approaches for swift and precise price predictions, encapsulating model uncertainties with such algorithms presents a complex challenge. We introduce ProbSAINT, a model that offers a principled approach for uncertainty quantification of its price predictions, along with accurate point predictions that are comparable to state-of-the-art boosting techniques. Furthermore, acknowledging that the business prefers pricing used cars based on the number of days the vehicle was listed for sale, we show how ProbSAINT can be used as a dynamic forecasting model for predicting price probabilities for different expected offer duration. Our experiments further indicate that ProbSAINT is especially accurate on instances where it is highly certain. This proves the applicability of its probabilistic predictions in real-world scenarios where trustworthiness is crucial.  ( 2 min )
    Cobweb: An Incremental and Hierarchical Model of Human-Like Category Learning
    arXiv:2403.03835v1 Announce Type: new Abstract: Cobweb, a human like category learning system, differs from other incremental categorization models in constructing hierarchically organized cognitive tree-like structures using the category utility measure. Prior studies have shown that Cobweb can capture psychological effects such as the basic level, typicality, and fan effects. However, a broader evaluation of Cobweb as a model of human categorization remains lacking. The current study addresses this gap. It establishes Cobweb's alignment with classical human category learning effects. It also explores Cobweb's flexibility to exhibit both exemplar and prototype like learning within a single model. These findings set the stage for future research on Cobweb as a comprehensive model of human category learning.  ( 2 min )
    Verified Training for Counterfactual Explanation Robustness under Data Shift
    arXiv:2403.03773v1 Announce Type: new Abstract: Counterfactual explanations (CEs) enhance the interpretability of machine learning models by describing what changes to an input are necessary to change its prediction to a desired class. These explanations are commonly used to guide users' actions, e.g., by describing how a user whose loan application was denied can be approved for a loan in the future. Existing approaches generate CEs by focusing on a single, fixed model, and do not provide any formal guarantees on the CEs' future validity. When models are updated periodically to account for data shift, if the generated CEs are not robust to the shifts, users' actions may no longer have the desired impacts on their predictions. This paper introduces VeriTraCER, an approach that jointly trains a classifier and an explainer to explicitly consider the robustness of the generated CEs to small model shifts. VeriTraCER optimizes over a carefully designed loss function that ensures the verifiable robustness of CEs to local model updates, thus providing deterministic guarantees to CE validity. Our empirical evaluation demonstrates that VeriTraCER generates CEs that (1) are verifiably robust to small model updates and (2) display competitive robustness to state-of-the-art approaches in handling empirical model updates including random initialization, leave-one-out, and distribution shifts.  ( 2 min )
    A&B BNN: Add&Bit-Operation-Only Hardware-Friendly Binary Neural Network
    arXiv:2403.03739v1 Announce Type: new Abstract: Binary neural networks utilize 1-bit quantized weights and activations to reduce both the model's storage demands and computational burden. However, advanced binary architectures still incorporate millions of inefficient and nonhardware-friendly full-precision multiplication operations. A&B BNN is proposed to directly remove part of the multiplication operations in a traditional BNN and replace the rest with an equal number of bit operations, introducing the mask layer and the quantized RPReLU structure based on the normalizer-free network architecture. The mask layer can be removed during inference by leveraging the intrinsic characteristics of BNN with straightforward mathematical transformations to avoid the associated multiplication operations. The quantized RPReLU structure enables more efficient bit operations by constraining its slope to be integer powers of 2. Experimental results achieved 92.30%, 69.35%, and 66.89% on the CIFAR-10, CIFAR-100, and ImageNet datasets, respectively, which are competitive with the state-of-the-art. Ablation studies have verified the efficacy of the quantized RPReLU structure, leading to a 1.14% enhancement on the ImageNet compared to using a fixed slope RLeakyReLU. The proposed add&bit-operation-only BNN offers an innovative approach for hardware-friendly network architecture.  ( 2 min )
    Probabilistic Topic Modelling with Transformer Representations
    arXiv:2403.03737v1 Announce Type: new Abstract: Topic modelling was mostly dominated by Bayesian graphical models during the last decade. With the rise of transformers in Natural Language Processing, however, several successful models that rely on straightforward clustering approaches in transformer-based embedding spaces have emerged and consolidated the notion of topics as clusters of embedding vectors. We propose the Transformer-Representation Neural Topic Model (TNTM), which combines the benefits of topic representations in transformer-based embedding spaces and probabilistic modelling. Therefore, this approach unifies the powerful and versatile notion of topics based on transformer embeddings with fully probabilistic modelling, as in models such as Latent Dirichlet Allocation (LDA). We utilize the variational autoencoder (VAE) framework for improved inference speed and modelling flexibility. Experimental results show that our proposed model achieves results on par with various state-of-the-art approaches in terms of embedding coherence while maintaining almost perfect topic diversity. The corresponding source code is available at https://github.com/ArikReuter/TNTM.  ( 2 min )
    Bridging Diversity and Uncertainty in Active learning with Self-Supervised Pre-Training
    arXiv:2403.03728v1 Announce Type: new Abstract: This study addresses the integration of diversity-based and uncertainty-based sampling strategies in active learning, particularly within the context of self-supervised pre-trained models. We introduce a straightforward heuristic called TCM that mitigates the cold start problem while maintaining strong performance across various data levels. By initially applying TypiClust for diversity sampling and subsequently transitioning to uncertainty sampling with Margin, our approach effectively combines the strengths of both strategies. Our experiments demonstrate that TCM consistently outperforms existing methods across various datasets in both low and high data regimes.  ( 2 min )
    Diffusion on language model embeddings for protein sequence generation
    arXiv:2403.03726v1 Announce Type: new Abstract: Protein design requires a deep understanding of the inherent complexities of the protein universe. While many efforts lean towards conditional generation or focus on specific families of proteins, the foundational task of unconditional generation remains underexplored and undervalued. Here, we explore this pivotal domain, introducing DiMA, a model that leverages continuous diffusion on embeddings derived from the protein language model, ESM-2, to generate amino acid sequences. DiMA surpasses leading solutions, including autoregressive transformer-based and discrete diffusion models, and we quantitatively illustrate the impact of the design choices that lead to its superior performance. We extensively evaluate the quality, diversity, distribution similarity, and biological relevance of the generated sequences using multiple metrics across various modalities. Our approach consistently produces novel, diverse protein sequences that accurately reflect the inherent structural and functional diversity of the protein space. This work advances the field of protein design and sets the stage for conditional models by providing a robust framework for scalable and high-quality protein sequence generation.  ( 2 min )
    Learning Adversarial MDPs with Stochastic Hard Constraints
    arXiv:2403.03672v1 Announce Type: new Abstract: We study online learning problems in constrained Markov decision processes (CMDPs) with adversarial losses and stochastic hard constraints. We consider two different scenarios. In the first one, we address general CMDPs, where we design an algorithm that attains sublinear regret and cumulative positive constraints violation. In the second scenario, under the mild assumption that a policy strictly satisfying the constraints exists and is known to the learner, we design an algorithm that achieves sublinear regret while ensuring that the constraints are satisfied at every episode with high probability. To the best of our knowledge, our work is the first to study CMDPs involving both adversarial losses and hard constraints. Indeed, previous works either focus on much weaker soft constraints--allowing for positive violation to cancel out negative ones--or are restricted to stochastic losses. Thus, our algorithms can deal with general non-stationary environments subject to requirements much stricter than those manageable with state-of-the-art algorithms. This enables their adoption in a much wider range of real-world applications, ranging from autonomous driving to online advertising and recommender systems.  ( 2 min )
    Provable Filter for Real-world Graph Clustering
    arXiv:2403.03666v1 Announce Type: new Abstract: Graph clustering, an important unsupervised problem, has been shown to be more resistant to advances in Graph Neural Networks (GNNs). In addition, almost all clustering methods focus on homophilic graphs and ignore heterophily. This significantly limits their applicability in practice, since real-world graphs exhibit a structural disparity and cannot simply be classified as homophily and heterophily. Thus, a principled way to handle practical graphs is urgently needed. To fill this gap, we provide a novel solution with theoretical support. Interestingly, we find that most homophilic and heterophilic edges can be correctly identified on the basis of neighbor information. Motivated by this finding, we construct two graphs that are highly homophilic and heterophilic, respectively. They are used to build low-pass and high-pass filters to capture holistic information. Important features are further enhanced by the squeeze-and-excitation block. We validate our approach through extensive experiments on both homophilic and heterophilic graphs. Empirical results demonstrate the superiority of our method compared to state-of-the-art clustering methods.  ( 2 min )
    Robust Graph Structure Learning under Heterophily
    arXiv:2403.03659v1 Announce Type: new Abstract: Graph is a fundamental mathematical structure in characterizing relations between different objects and has been widely used on various learning tasks. Most methods implicitly assume a given graph to be accurate and complete. However, real data is inevitably noisy and sparse, which will lead to inferior results. Despite the remarkable success of recent graph representation learning methods, they inherently presume that the graph is homophilic, and largely overlook heterophily, where most connected nodes are from different classes. In this regard, we propose a novel robust graph structure learning method to achieve a high-quality graph from heterophilic data for downstream tasks. We first apply a high-pass filter to make each node more distinctive from its neighbors by encoding structure information into the node features. Then, we learn a robust graph with an adaptive norm characterizing different levels of noise. Afterwards, we propose a novel regularizer to further refine the graph structure. Clustering and semi-supervised classification experiments on heterophilic graphs verify the effectiveness of our method.  ( 2 min )
    A Survey on Applications of Reinforcement Learning in Spatial Resource Allocation
    arXiv:2403.03643v1 Announce Type: new Abstract: The challenge of spatial resource allocation is pervasive across various domains such as transportation, industry, and daily life. As the scale of real-world issues continues to expand and demands for real-time solutions increase, traditional algorithms face significant computational pressures, struggling to achieve optimal efficiency and real-time capabilities. In recent years, with the escalating computational power of computers, the remarkable achievements of reinforcement learning in domains like Go and robotics have demonstrated its robust learning and sequential decision-making capabilities. Given these advancements, there has been a surge in novel methods employing reinforcement learning to tackle spatial resource allocation problems. These methods exhibit advantages such as rapid solution convergence and strong model generalization abilities, offering a new perspective on resolving spatial resource allocation problems. Therefore, this paper aims to summarize and review recent theoretical methods and applied research utilizing reinforcement learning to address spatial resource allocation problems. It provides a summary and comprehensive overview of its fundamental principles, related methodologies, and applied research. Additionally, it highlights several unresolved issues that urgently require attention in this direction for the future.  ( 2 min )
    Learning Invariant Representations of Graph Neural Networks via Cluster Generalization
    arXiv:2403.03599v1 Announce Type: new Abstract: Graph neural networks (GNNs) have become increasingly popular in modeling graph-structured data due to their ability to learn node representations by aggregating local structure information. However, it is widely acknowledged that the test graph structure may differ from the training graph structure, resulting in a structure shift. In this paper, we experimentally find that the performance of GNNs drops significantly when the structure shift happens, suggesting that the learned models may be biased towards specific structure patterns. To address this challenge, we propose the Cluster Information Transfer (CIT) mechanism (Code available at https://github.com/BUPT-GAMMA/CITGNN), which can learn invariant representations for GNNs, thereby improving their generalization ability to various and unknown test graphs with structure shift. The CIT mechanism achieves this by combining different cluster information with the nodes while preserving their cluster-independent information. By generating nodes across different clusters, the mechanism significantly enhances the diversity of the nodes and helps GNNs learn the invariant representations. We provide a theoretical analysis of the CIT mechanism, showing that the impact of changing clusters during structure shift can be mitigated after transfer. Additionally, the proposed mechanism is a plug-in that can be easily used to improve existing GNNs. We comprehensively evaluate our proposed method on three typical structure shift scenarios, demonstrating its effectiveness in enhancing GNNs' performance.  ( 2 min )
    RouteExplainer: An Explanation Framework for Vehicle Routing Problem
    arXiv:2403.03585v1 Announce Type: new Abstract: The Vehicle Routing Problem (VRP) is a widely studied combinatorial optimization problem and has been applied to various practical problems. While the explainability for VRP is significant for improving the reliability and interactivity in practical VRP applications, it remains unexplored. In this paper, we propose RouteExplainer, a post-hoc explanation framework that explains the influence of each edge in a generated route. Our framework realizes this by rethinking a route as the sequence of actions and extending counterfactual explanations based on the action influence model to VRP. To enhance the explanation, we additionally propose an edge classifier that infers the intentions of each edge, a loss function to train the edge classifier, and explanation-text generation by Large Language Models (LLMs). We quantitatively evaluate our edge classifier on four different VRPs. The results demonstrate its rapid computation while maintaining reasonable accuracy, thereby highlighting its potential for deployment in practical applications. Moreover, on the subject of a tourist route, we qualitatively evaluate explanations generated by our framework. This evaluation not only validates our framework but also shows the synergy between explanation frameworks and LLMs. See https://ntt-dkiku.github.io/xai-vrp for our code, datasets, models, and demo.  ( 2 min )
    Efficient Algorithms for Empirical Group Distributional Robust Optimization and Beyond
    arXiv:2403.03562v1 Announce Type: new Abstract: We investigate the empirical counterpart of group distributionally robust optimization (GDRO), which aims to minimize the maximal empirical risk across $m$ distinct groups. We formulate empirical GDRO as a $\textit{two-level}$ finite-sum convex-concave minimax optimization problem and develop a stochastic variance reduced mirror prox algorithm. Unlike existing methods, we construct the stochastic gradient by per-group sampling technique and perform variance reduction for all groups, which fully exploits the $\textit{two-level}$ finite-sum structure of empirical GDRO. Furthermore, we compute the snapshot and mirror snapshot point by a one-index-shifted weighted average, which distinguishes us from the naive ergodic average. Our algorithm also supports non-constant learning rates, which is different from existing literature. We establish convergence guarantees both in expectation and with high probability, demonstrating a complexity of $\mathcal{O}\left(\frac{m\sqrt{\bar{n}\ln{m}}}{\varepsilon}\right)$, where $\bar n$ is the average number of samples among $m$ groups. Remarkably, our approach outperforms the state-of-the-art method by a factor of $\sqrt{m}$. Furthermore, we extend our methodology to deal with the empirical minimax excess risk optimization (MERO) problem and manage to give the expectation bound and the high probability bound, accordingly. The complexity of our empirical MERO algorithm matches that of empirical GDRO at $\mathcal{O}\left(\frac{m\sqrt{\bar{n}\ln{m}}}{\varepsilon}\right)$, significantly surpassing the bounds of existing methods.  ( 2 min )
    GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
    arXiv:2403.03507v1 Announce Type: new Abstract: Training Large Language Models (LLMs) presents significant memory challenges, predominantly due to the growing size of weights and optimizer states. Common memory-reduction approaches, such as low-rank adaptation (LoRA), add a trainable low-rank matrix to the frozen pre-trained weight in each layer, reducing trainable parameters and optimizer states. However, such approaches typically underperform training with full-rank weights in both pre-training and fine-tuning stages since they limit the parameter search to a low-rank subspace and alter the training dynamics, and further, may require full-rank warm start. In this work, we propose Gradient Low-Rank Projection (GaLore), a training strategy that allows full-parameter learning but is more memory-efficient than common low-rank adaptation methods such as LoRA. Our approach reduces memory usage by up to 65.5% in optimizer states while maintaining both efficiency and performance for pre-training on LLaMA 1B and 7B architectures with C4 dataset with up to 19.7B tokens, and on fine-tuning RoBERTa on GLUE tasks. Our 8-bit GaLore further reduces optimizer memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline. Notably, we demonstrate, for the first time, the feasibility of pre-training a 7B model on consumer GPUs with 24GB memory (e.g., NVIDIA RTX 4090) without model parallel, checkpointing, or offloading strategies.  ( 2 min )
    Inverse-Free Fast Natural Gradient Descent Method for Deep Learning
    arXiv:2403.03473v1 Announce Type: new Abstract: Second-order methods can converge much faster than first-order methods by incorporating second-order derivates or statistics, but they are far less prevalent in deep learning due to their computational inefficiency. To handle this, many of the existing solutions focus on reducing the size of the matrix to be inverted. However, it is still needed to perform the inverse operator in each iteration. In this paper, we present a fast natural gradient descent (FNGD) method, which only requires computing the inverse during the first epoch. Firstly, we reformulate the gradient preconditioning formula in the natural gradient descent (NGD) as a weighted sum of per-sample gradients using the Sherman-Morrison-Woodbury formula. Building upon this, to avoid the iterative inverse operation involved in computing coefficients, the weighted coefficients are shared across epochs without affecting the empirical performance. FNGD approximates the NGD as a fixed-coefficient weighted sum, akin to the average sum in first-order methods. Consequently, the computational complexity of FNGD can approach that of first-order methods. To demonstrate the efficiency of the proposed FNGD, we perform empirical evaluations on image classification and machine translation tasks. For training ResNet-18 on the CIFAR-100 dataset, FNGD can achieve a speedup of 2.05$\times$ compared with KFAC. For training Transformer on Multi30K, FNGD outperforms AdamW by 24 BLEU score while requiring almost the same training time.  ( 2 min )
    Boosting Meta-Training with Base Class Information for Few-Shot Learning
    arXiv:2403.03472v1 Announce Type: new Abstract: Few-shot learning, a challenging task in machine learning, aims to learn a classifier adaptable to recognize new, unseen classes with limited labeled examples. Meta-learning has emerged as a prominent framework for few-shot learning. Its training framework is originally a task-level learning method, such as Model-Agnostic Meta-Learning (MAML) and Prototypical Networks. And a recently proposed training paradigm called Meta-Baseline, which consists of sequential pre-training and meta-training stages, gains state-of-the-art performance. However, as a non-end-to-end training method, indicating the meta-training stage can only begin after the completion of pre-training, Meta-Baseline suffers from higher training cost and suboptimal performance due to the inherent conflicts of the two training stages. To address these limitations, we propose an end-to-end training paradigm consisting of two alternative loops. In the outer loop, we calculate cross entropy loss on the entire training set while updating only the final linear layer. In the inner loop, we employ the original meta-learning training mode to calculate the loss and incorporate gradients from the outer loss to guide the parameter updates. This training paradigm not only converges quickly but also outperforms existing baselines, indicating that information from the overall training set and the meta-learning training paradigm could mutually reinforce one another. Moreover, being model-agnostic, our framework achieves significant performance gains, surpassing the baseline systems by approximate 1%.  ( 3 min )
    Kernel Correlation-Dissimilarity for Multiple Kernel k-Means Clustering
    arXiv:2403.03448v1 Announce Type: new Abstract: The main objective of the Multiple Kernel k-Means (MKKM) algorithm is to extract non-linear information and achieve optimal clustering by optimizing base kernel matrices. Current methods enhance information diversity and reduce redundancy by exploiting interdependencies among multiple kernels based on correlations or dissimilarities. Nevertheless, relying solely on a single metric, such as correlation or dissimilarity, to define kernel relationships introduces bias and incomplete characterization. Consequently, this limitation hinders efficient information extraction, ultimately compromising clustering performance. To tackle this challenge, we introduce a novel method that systematically integrates both kernel correlation and dissimilarity. Our approach comprehensively captures kernel relationships, facilitating more efficient classification information extraction and improving clustering performance. By emphasizing the coherence between kernel correlation and dissimilarity, our method offers a more objective and transparent strategy for extracting non-linear information and significantly improving clustering precision, supported by theoretical rationale. We assess the performance of our algorithm on 13 challenging benchmark datasets, demonstrating its superiority over contemporary state-of-the-art MKKM techniques.  ( 2 min )
    Uncertainty quantification for deeponets with ensemble kalman inversion
    arXiv:2403.03444v1 Announce Type: new Abstract: In recent years, operator learning, particularly the DeepONet, has received much attention for efficiently learning complex mappings between input and output functions across diverse fields. However, in practical scenarios with limited and noisy data, accessing the uncertainty in DeepONet predictions becomes essential, especially in mission-critical or safety-critical applications. Existing methods, either computationally intensive or yielding unsatisfactory uncertainty quantification, leave room for developing efficient and informative uncertainty quantification (UQ) techniques tailored for DeepONets. In this work, we proposed a novel inference approach for efficient UQ for operator learning by harnessing the power of the Ensemble Kalman Inversion (EKI) approach. EKI, known for its derivative-free, noise-robust, and highly parallelizable feature, has demonstrated its advantages for UQ for physics-informed neural networks [28]. Our innovative application of EKI enables us to efficiently train ensembles of DeepONets while obtaining informative uncertainty estimates for the output of interest. We deploy a mini-batch variant of EKI to accommodate larger datasets, mitigating the computational demand due to large datasets during the training stage. Furthermore, we introduce a heuristic method to estimate the artificial dynamics covariance, thereby improving our uncertainty estimates. Finally, we demonstrate the effectiveness and versatility of our proposed methodology across various benchmark problems, showcasing its potential to address the pressing challenges of uncertainty quantification in DeepONets, especially for practical applications with limited and noisy data.  ( 3 min )
    Sculpting Molecules in 3D: A Flexible Substructure Aware Framework for Text-Oriented Molecular Optimization
    arXiv:2403.03425v1 Announce Type: new Abstract: The integration of deep learning, particularly AI-Generated Content, with high-quality data derived from ab initio calculations has emerged as a promising avenue for transforming the landscape of scientific research. However, the challenge of designing molecular drugs or materials that incorporate multi-modality prior knowledge remains a critical and complex undertaking. Specifically, achieving a practical molecular design necessitates not only meeting the diversity requirements but also addressing structural and textural constraints with various symmetries outlined by domain experts. In this article, we present an innovative approach to tackle this inverse design problem by formulating it as a multi-modality guidance generation/optimization task. Our proposed solution involves a textural-structure alignment symmetric diffusion framework for the implementation of molecular generation/optimization tasks, namely 3DToMolo. 3DToMolo aims to harmonize diverse modalities, aligning them seamlessly to produce molecular structures adhere to specified symmetric structural and textural constraints by experts in the field. Experimental trials across three guidance generation settings have shown a superior hit generation performance compared to state-of-the-art methodologies. Moreover, 3DToMolo demonstrates the capability to generate novel molecules, incorporating specified target substructures, without the need for prior knowledge. This work not only holds general significance for the advancement of deep learning methodologies but also paves the way for a transformative shift in molecular design strategies. 3DToMolo creates opportunities for a more nuanced and effective exploration of the vast chemical space, opening new frontiers in the development of molecular entities with tailored properties and functionalities.  ( 3 min )
    Leveraging The Finite States of Emotion Processing to Study Late-Life Mental Health
    arXiv:2403.03414v1 Announce Type: new Abstract: Traditional approaches in mental health research apply General Linear Models (GLM) to describe the longitudinal dynamics of observed psycho-behavioral measurements (questionnaire summary scores). Similarly, GLMs are also applied to characterize relationships between neurobiological measurements (regional fMRI signals) and perceptual stimuli or other regional signals. While these methods are useful for exploring linear correlations among the isolated signals of those constructs (i.e., summary scores or fMRI signals), these classical frameworks fall short in providing insights into the comprehensive system-level dynamics underlying observable changes. Hidden Markov Models (HMM) are a statistical model that enable us to describe the sequential relations among multiple observable constructs, and when applied through the lens of Finite State Automata (FSA), can provide a more integrated and intuitive framework for modeling and understanding the underlying controller (the prescription for how to respond to inputs) that fundamentally defines any system, as opposed to linearly correlating output signals produced by the controller. We present a simple and intuitive HMM processing pipeline vcHMM (See Preliminary Data) that highlights FSA theory and is applicable for both behavioral analysis of questionnaire data and fMRI data. HMMs offer theoretic promise as they are computationally equivalent to the FSA, the control processor of a Turing Machine (TM) The dynamic programming Viterbi algorithm is used to leverage the HMM model. It efficiently identifies the most likely sequence of hidden states. The vcHMM pipeline leverages this grammar to understand how behavior and neural activity relate to depression.  ( 3 min )
    Prediction Of Cryptocurrency Prices Using LSTM, SVM And Polynomial Regression
    arXiv:2403.03410v1 Announce Type: new Abstract: The rapid development of information technology, especially the Internet, has facilitated users with a quick and easy way to seek information. With these convenience offered by internet services, many individuals who initially invested in gold and precious metals are now shifting into digital investments in form of cryptocurrencies. However, investments in crypto coins are filled with uncertainties and fluctuation in daily basis. This risk posed as significant challenges for coin investors that could result in substantial investment losses. The uncertainty of the value of these crypto coins is a critical issue in the field of coin investment. Forecasting, is one of the methods used to predict the future value of these crypto coins. By utilizing the models of Long Short Term Memory, Support Vector Machine, and Polynomial Regression algorithm for forecasting, a performance comparison is conducted to determine which algorithm model is most suitable for predicting crypto currency prices. The mean square error is employed as a benchmark for the comparison. By applying those three constructed algorithm models, the Support Vector Machine uses a linear kernel to produce the smallest mean square error compared to the Long Short Term Memory and Polynomial Regression algorithm models, with a mean square error value of 0.02. Keywords: Cryptocurrency, Forecasting, Long Short Term Memory, Mean Square Error, Polynomial Regression, Support Vector Machine  ( 3 min )
    TartanAviation: Image, Speech, and ADS-B Trajectory Datasets for Terminal Airspace Operations
    arXiv:2403.03372v1 Announce Type: new Abstract: We introduce TartanAviation, an open-source multi-modal dataset focused on terminal-area airspace operations. TartanAviation provides a holistic view of the airport environment by concurrently collecting image, speech, and ADS-B trajectory data using setups installed inside airport boundaries. The datasets were collected at both towered and non-towered airfields across multiple months to capture diversity in aircraft operations, seasons, aircraft types, and weather conditions. In total, TartanAviation provides 3.1M images, 3374 hours of Air Traffic Control speech data, and 661 days of ADS-B trajectory data. The data was filtered, processed, and validated to create a curated dataset. In addition to the dataset, we also open-source the code-base used to collect and pre-process the dataset, further enhancing accessibility and usability. We believe this dataset has many potential use cases and would be particularly vital in allowing AI and machine learning technologies to be integrated into air traffic control systems and advance the adoption of autonomous aircraft in the airspace.  ( 2 min )
    Leveraging Federated Learning for Automatic Detection of Clopidogrel Treatment Failures
    arXiv:2403.03368v1 Announce Type: new Abstract: The effectiveness of clopidogrel, a widely used antiplatelet medication, varies significantly among individuals, necessitating the development of precise predictive models to optimize patient care. In this study, we leverage federated learning strategies to address clopidogrel treatment failure detection. Our research harnesses the collaborative power of multiple healthcare institutions, allowing them to jointly train machine learning models while safeguarding sensitive patient data. Utilizing the UK Biobank dataset, which encompasses a vast and diverse population, we partitioned the data based on geographic centers and evaluated the performance of federated learning. Our results show that while centralized training achieves higher Area Under the Curve (AUC) values and faster convergence, federated learning approaches can substantially narrow this performance gap. Our findings underscore the potential of federated learning in addressing clopidogrel treatment failure detection, offering a promising avenue for enhancing patient care through personalized treatment strategies while respecting data privacy. This study contributes to the growing body of research on federated learning in healthcare and lays the groundwork for secure and privacy-preserving predictive models for various medical conditions.  ( 2 min )
    Level Set Teleportation: An Optimization Perspective
    arXiv:2403.03362v1 Announce Type: new Abstract: We study level set teleportation, an optimization sub-routine which seeks to accelerate gradient methods by maximizing the gradient norm on a level-set of the objective function. Since the descent lemma implies that gradient descent (GD) decreases the objective proportional to the squared norm of the gradient, level-set teleportation maximizes this one-step progress guarantee. For convex functions satisfying Hessian stability, we prove that GD with level-set teleportation obtains a combined sub-linear/linear convergence rate which is strictly faster than standard GD when the optimality gap is small. This is in sharp contrast to the standard (strongly) convex setting, where we show level-set teleportation neither improves nor worsens convergence rates. To evaluate teleportation in practice, we develop a projected-gradient-type method requiring only Hessian-vector products. We use this method to show that gradient methods with access to a teleportation oracle uniformly out-perform their standard versions on a variety of learning problems.  ( 2 min )
    Averaging Rate Scheduler for Decentralized Learning on Heterogeneous Data
    arXiv:2403.03292v1 Announce Type: new Abstract: State-of-the-art decentralized learning algorithms typically require the data distribution to be Independent and Identically Distributed (IID). However, in practical scenarios, the data distribution across the agents can have significant heterogeneity. In this work, we propose averaging rate scheduling as a simple yet effective way to reduce the impact of heterogeneity in decentralized learning. Our experiments illustrate the superiority of the proposed method (~3% improvement in test accuracy) compared to the conventional approach of employing a constant averaging rate.  ( 2 min )
    Exact Enforcement of Temporal Continuity in Sequential Physics-Informed Neural Networks
    arXiv:2403.03223v1 Announce Type: new Abstract: The use of deep learning methods in scientific computing represents a potential paradigm shift in engineering problem solving. One of the most prominent developments is Physics-Informed Neural Networks (PINNs), in which neural networks are trained to satisfy partial differential equations (PDEs) and/or observed data. While this method shows promise, the standard version has been shown to struggle in accurately predicting the dynamic behavior of time-dependent problems. To address this challenge, methods have been proposed that decompose the time domain into multiple segments, employing a distinct neural network in each segment and directly incorporating continuity between them in the loss function of the minimization problem. In this work we introduce a method to exactly enforce continuity between successive time segments via a solution ansatz. This hard constrained sequential PINN (HCS-PINN) method is simple to implement and eliminates the need for any loss terms associated with temporal continuity. The method is tested for a number of benchmark problems involving both linear and non-linear PDEs. Examples include various first order time dependent problems in which traditional PINNs struggle, namely advection, Allen-Cahn, and Korteweg-de Vries equations. Furthermore, second and third order time-dependent problems are demonstrated via wave and Jerky dynamics examples, respectively. Notably, the Jerky dynamics problem is chaotic, making the problem especially sensitive to temporal accuracy. The numerical experiments conducted with the proposed method demonstrated superior convergence and accuracy over both traditional PINNs and the soft-constrained counterparts.  ( 3 min )
    Knowledge-guided EEG Representation Learning
    arXiv:2403.03222v1 Announce Type: new Abstract: Self-supervised learning has produced impressive results in multimedia domains of audio, vision and speech. This paradigm is equally, if not more, relevant for the domain of biosignals, owing to the scarcity of labelled data in such scenarios. The ability to leverage large-scale unlabelled data to learn robust representations could help improve the performance of numerous inference tasks on biosignals. Given the inherent domain differences between multimedia modalities and biosignals, the established objectives for self-supervised learning may not translate well to this domain. Hence, there is an unmet need to adapt these methods to biosignal analysis. In this work we propose a self-supervised model for EEG, which provides robust performance and remarkable parameter efficiency by using state space-based deep learning architecture. We also propose a novel knowledge-guided pre-training objective that accounts for the idiosyncrasies of the EEG signal. The results indicate improved embedding representation learning and downstream performance compared to prior works on exemplary tasks. Also, the proposed objective significantly reduces the amount of pre-training data required to obtain performance equivalent to prior works.  ( 2 min )
  • Open

    Online model error correction with neural networks: application to the Integrated Forecasting System
    arXiv:2403.03702v1 Announce Type: new Abstract: In recent years, there has been significant progress in the development of fully data-driven global numerical weather prediction models. These machine learning weather prediction models have their strength, notably accuracy and low computational requirements, but also their weakness: they struggle to represent fundamental dynamical balances, and they are far from being suitable for data assimilation experiments. Hybrid modelling emerges as a promising approach to address these limitations. Hybrid models integrate a physics-based core component with a statistical component, typically a neural network, to enhance prediction capabilities. In this article, we propose to develop a model error correction for the operational Integrated Forecasting System (IFS) of the European Centre for Medium-Range Weather Forecasts using a neural network. The neural network is initially pre-trained offline using a large dataset of operational analyses and analysis increments. Subsequently, the trained network is integrated into the IFS within the Object-Oriented Prediction System (OOPS) so as to be used in data assimilation and forecast experiments. It is then further trained online using a recently developed variant of weak-constraint 4D-Var. The results show that the pre-trained neural network already provides a reliable model error correction, which translates into reduced forecast errors in many conditions and that the online training further improves the accuracy of the hybrid model in many conditions.  ( 2 min )
    Incentivized Learning in Principal-Agent Bandit Games
    arXiv:2403.03811v1 Announce Type: new Abstract: This work considers a repeated principal-agent bandit game, where the principal can only interact with her environment through the agent. The principal and the agent have misaligned objectives and the choice of action is only left to the agent. However, the principal can influence the agent's decisions by offering incentives which add up to his rewards. The principal aims to iteratively learn an incentive policy to maximize her own total utility. This framework extends usual bandit problems and is motivated by several practical applications, such as healthcare or ecological taxation, where traditionally used mechanism design theories often overlook the learning aspect of the problem. We present nearly optimal (with respect to a horizon $T$) learning algorithms for the principal's regret in both multi-armed and linear contextual settings. Finally, we support our theoretical guarantees through numerical experiments.  ( 2 min )
    Triple/Debiased Lasso for Statistical Inference of Conditional Average Treatment Effects
    arXiv:2403.03240v1 Announce Type: cross Abstract: This study investigates the estimation and the statistical inference about Conditional Average Treatment Effects (CATEs), which have garnered attention as a metric representing individualized causal effects. In our data-generating process, we assume linear models for the outcomes associated with binary treatments and define the CATE as a difference between the expected outcomes of these linear models. This study allows the linear models to be high-dimensional, and our interest lies in consistent estimation and statistical inference for the CATE. In high-dimensional linear regression, one typical approach is to assume sparsity. However, in our study, we do not assume sparsity directly. Instead, we consider sparsity only in the difference of the linear models. We first use a doubly robust estimator to approximate this difference and then regress the difference on covariates with Lasso regularization. Although this regression estimator is consistent for the CATE, we further reduce the bias using the techniques in double/debiased machine learning (DML) and debiased Lasso, leading to $\sqrt{n}$-consistency and confidence intervals. We refer to the debiased estimator as the triple/debiased Lasso (TDL), applying both DML and debiased Lasso techniques. We confirm the soundness of our proposed method through simulation studies.  ( 2 min )
    Efficient Algorithms for Empirical Group Distributional Robust Optimization and Beyond
    arXiv:2403.03562v1 Announce Type: cross Abstract: We investigate the empirical counterpart of group distributionally robust optimization (GDRO), which aims to minimize the maximal empirical risk across $m$ distinct groups. We formulate empirical GDRO as a $\textit{two-level}$ finite-sum convex-concave minimax optimization problem and develop a stochastic variance reduced mirror prox algorithm. Unlike existing methods, we construct the stochastic gradient by per-group sampling technique and perform variance reduction for all groups, which fully exploits the $\textit{two-level}$ finite-sum structure of empirical GDRO. Furthermore, we compute the snapshot and mirror snapshot point by a one-index-shifted weighted average, which distinguishes us from the naive ergodic average. Our algorithm also supports non-constant learning rates, which is different from existing literature. We establish convergence guarantees both in expectation and with high probability, demonstrating a complexity of $\mathcal{O}\left(\frac{m\sqrt{\bar{n}\ln{m}}}{\varepsilon}\right)$, where $\bar n$ is the average number of samples among $m$ groups. Remarkably, our approach outperforms the state-of-the-art method by a factor of $\sqrt{m}$. Furthermore, we extend our methodology to deal with the empirical minimax excess risk optimization (MERO) problem and manage to give the expectation bound and the high probability bound, accordingly. The complexity of our empirical MERO algorithm matches that of empirical GDRO at $\mathcal{O}\left(\frac{m\sqrt{\bar{n}\ln{m}}}{\varepsilon}\right)$, significantly surpassing the bounds of existing methods.  ( 2 min )
    Active Adaptive Experimental Design for Treatment Effect Estimation with Covariate Choices
    arXiv:2403.03589v1 Announce Type: cross Abstract: This study designs an adaptive experiment for efficiently estimating average treatment effect (ATEs). We consider an adaptive experiment where an experimenter sequentially samples an experimental unit from a covariate density decided by the experimenter and assigns a treatment. After assigning a treatment, the experimenter observes the corresponding outcome immediately. At the end of the experiment, the experimenter estimates an ATE using gathered samples. The objective of the experimenter is to estimate the ATE with a smaller asymptotic variance. Existing studies have designed experiments that adaptively optimize the propensity score (treatment-assignment probability). As a generalization of such an approach, we propose a framework under which an experimenter optimizes the covariate density, as well as the propensity score, and find that optimizing both covariate density and propensity score reduces the asymptotic variance more than optimizing only the propensity score. Based on this idea, in each round of our experiment, the experimenter optimizes the covariate density and propensity score based on past observations. To design an adaptive experiment, we first derive the efficient covariate density and propensity score that minimizes the semiparametric efficiency bound, a lower bound for the asymptotic variance given a fixed covariate density and a fixed propensity score. Next, we design an adaptive experiment using the efficient covariate density and propensity score sequentially estimated during the experiment. Lastly, we propose an ATE estimator whose asymptotic variance aligns with the minimized semiparametric efficiency bound.  ( 3 min )
    Exploration via linearly perturbed loss minimisation
    arXiv:2311.07565v2 Announce Type: replace-cross Abstract: We introduce exploration via linear loss perturbations (EVILL), a randomised exploration method for structured stochastic bandit problems that works by solving for the minimiser of a linearly perturbed regularised negative log-likelihood function. We show that, for the case of generalised linear bandits, EVILL reduces to perturbed history exploration (PHE), a method where exploration is done by training on randomly perturbed rewards. In doing so, we provide a simple and clean explanation of when and why random reward perturbations give rise to good bandit algorithms. We propose data-dependent perturbations not present in previous PHE-type methods that allow EVILL to match the performance of Thompson-sampling-style parameter-perturbation methods, both in theory and in practice. Moreover, we show an example outside generalised linear bandits where PHE leads to inconsistent estimates, and thus linear regret, while EVILL remains performant. Like PHE, EVILL can be implemented in just a few lines of code.  ( 2 min )
    From Coupled Oscillators to Graph Neural Networks: Reducing Over-smoothing via a Kuramoto Model-based Approach
    arXiv:2311.03260v2 Announce Type: replace-cross Abstract: We propose the Kuramoto Graph Neural Network (KuramotoGNN), a novel class of continuous-depth graph neural networks (GNNs) that employs the Kuramoto model to mitigate the over-smoothing phenomenon, in which node features in GNNs become indistinguishable as the number of layers increases. The Kuramoto model captures the synchronization behavior of non-linear coupled oscillators. Under the view of coupled oscillators, we first show the connection between Kuramoto model and basic GNN and then over-smoothing phenomenon in GNNs can be interpreted as phase synchronization in Kuramoto model. The KuramotoGNN replaces this phase synchronization with frequency synchronization to prevent the node features from converging into each other while allowing the system to reach a stable synchronized state. We experimentally verify the advantages of the KuramotoGNN over the baseline GNNs and existing methods in reducing over-smoothing on various graph deep learning benchmark tasks.  ( 2 min )
    Leveraging Ensemble Diversity for Robust Self-Training in the Presence of Sample Selection Bias
    arXiv:2310.14814v3 Announce Type: replace-cross Abstract: Self-training is a well-known approach for semi-supervised learning. It consists of iteratively assigning pseudo-labels to unlabeled data for which the model is confident and treating them as labeled examples. For neural networks, softmax prediction probabilities are often used as a confidence measure, although they are known to be overconfident, even for wrong predictions. This phenomenon is particularly intensified in the presence of sample selection bias, i.e., when data labeling is subject to some constraint. To address this issue, we propose a novel confidence measure, called $\mathcal{T}$-similarity, built upon the prediction diversity of an ensemble of linear classifiers. We provide the theoretical analysis of our approach by studying stationary points and describing the relationship between the diversity of the individual members and their performance. We empirically demonstrate the benefit of our confidence measure for three different pseudo-labeling policies on classification datasets of various data modalities. The code is available at https://github.com/ambroiseodt/tsim.  ( 2 min )
    Distribution-Free Statistical Dispersion Control for Societal Applications
    arXiv:2309.13786v2 Announce Type: replace-cross Abstract: Explicit finite-sample statistical guarantees on model performance are an important ingredient in responsible machine learning. Previous work has focused mainly on bounding either the expected loss of a predictor or the probability that an individual prediction will incur a loss value in a specified range. However, for many high-stakes applications, it is crucial to understand and control the dispersion of a loss distribution, or the extent to which different members of a population experience unequal effects of algorithmic decisions. We initiate the study of distribution-free control of statistical dispersion measures with societal implications and propose a simple yet flexible framework that allows us to handle a much richer class of statistical functionals beyond previous work. Our methods are verified through experiments in toxic comment detection, medical imaging, and film recommendation.  ( 2 min )
    AbDiffuser: Full-Atom Generation of in vitro Functioning Antibodies
    arXiv:2308.05027v2 Announce Type: replace-cross Abstract: We introduce AbDiffuser, an equivariant and physics-informed diffusion model for the joint generation of antibody 3D structures and sequences. AbDiffuser is built on top of a new representation of protein structure, relies on a novel architecture for aligned proteins, and utilizes strong diffusion priors to improve the denoising process. Our approach improves protein diffusion by taking advantage of domain knowledge and physics-based constraints; handles sequence-length changes; and reduces memory complexity by an order of magnitude, enabling backbone and side chain generation. We validate AbDiffuser in silico and in vitro. Numerical experiments showcase the ability of AbDiffuser to generate antibodies that closely track the sequence and structural properties of a reference set. Laboratory experiments confirm that all 16 HER2 antibodies discovered were expressed at high levels and that 57.1% of the selected designs were tight binders.  ( 2 min )
    A polynomial-time iterative algorithm for random graph matching with non-vanishing correlation
    arXiv:2306.00266v2 Announce Type: replace-cross Abstract: We propose an efficient algorithm for matching two correlated Erd\H{o}s--R\'enyi graphs with $n$ vertices whose edges are correlated through a latent vertex correspondence. When the edge density $q= n^{- \alpha+o(1)}$ for a constant $\alpha \in [0,1)$, we show that our algorithm has polynomial running time and succeeds to recover the latent matching as long as the edge correlation is non-vanishing. This is closely related to our previous work on a polynomial-time algorithm that matches two Gaussian Wigner matrices with non-vanishing correlation, and provides the first polynomial-time random graph matching algorithm (regardless of the regime of $q$) when the edge correlation is below the square root of the Otter's constant (which is $\approx 0.338$).  ( 2 min )
    High-Fidelity Image Compression with Score-based Generative Models
    arXiv:2305.18231v2 Announce Type: replace-cross Abstract: Despite the tremendous success of diffusion generative models in text-to-image generation, replicating this success in the domain of image compression has proven difficult. In this paper, we demonstrate that diffusion can significantly improve perceptual quality at a given bit-rate, outperforming state-of-the-art approaches PO-ELIC and HiFiC as measured by FID score. This is achieved using a simple but theoretically motivated two-stage approach combining an autoencoder targeting MSE followed by a further score-based decoder. However, as we will show, implementation details matter and the optimal design decisions can differ greatly from typical text-to-image models.  ( 2 min )
    Differentially Private Generalized Linear Models Revisited
    arXiv:2205.03014v2 Announce Type: replace-cross Abstract: We study the problem of $(\epsilon,\delta)$-differentially private learning of linear predictors with convex losses. We provide results for two subclasses of loss functions. The first case is when the loss is smooth and non-negative but not necessarily Lipschitz (such as the squared loss). For this case, we establish an upper bound on the excess population risk of $\tilde{O}\left(\frac{\Vert w^*\Vert}{\sqrt{n}} + \min\left\{\frac{\Vert w^* \Vert^2}{(n\epsilon)^{2/3}},\frac{\sqrt{d}\Vert w^*\Vert^2}{n\epsilon}\right\}\right)$, where $n$ is the number of samples, $d$ is the dimension of the problem, and $w^*$ is the minimizer of the population risk. Apart from the dependence on $\Vert w^\ast\Vert$, our bound is essentially tight in all parameters. In particular, we show a lower bound of $\tilde{\Omega}\left(\frac{1}{\sqrt{n}} + {\min\left\{\frac{\Vert w^*\Vert^{4/3}}{(n\epsilon)^{2/3}}, \frac{\sqrt{d}\Vert w^*\Vert}{n\epsilon}\right\}}\right)$. We also revisit the previously studied case of Lipschitz losses [SSTT20]. For this case, we close the gap in the existing work and show that the optimal rate is (up to log factors) $\Theta\left(\frac{\Vert w^*\Vert}{\sqrt{n}} + \min\left\{\frac{\Vert w^*\Vert}{\sqrt{n\epsilon}},\frac{\sqrt{\text{rank}}\Vert w^*\Vert}{n\epsilon}\right\}\right)$, where $\text{rank}$ is the rank of the design matrix. This improves over existing work in the high privacy regime. Finally, our algorithms involve a private model selection approach that we develop to enable attaining the stated rates without a-priori knowledge of $\Vert w^*\Vert$.  ( 2 min )
    Two-step interpretable modeling of Intensive Care Acquired Infections
    arXiv:2301.11146v2 Announce Type: replace-cross Abstract: We present a novel methodology for integrating high resolution longitudinal data with the dynamic prediction capabilities of survival models. The aim is two-fold: to improve the predictive power while maintaining interpretability of the models. To go beyond the black box paradigm of artificial neural networks, we propose a parsimonious and robust semi-parametric approach (i.e., a landmarking competing risks model) that combines routinely collected low-resolution data with predictive features extracted from a convolutional neural network, that was trained on high resolution time-dependent information. We then use saliency maps to analyze and explain the extra predictive power of this model. To illustrate our methodology, we focus on healthcare-associated infections in patients admitted to an intensive care unit.  ( 2 min )
    Ensemble sampling for linear bandits: small ensembles suffice
    arXiv:2311.08376v2 Announce Type: replace Abstract: We provide the first useful and rigorous analysis of ensemble sampling for the stochastic linear bandit setting. In particular, we show that, under standard assumptions, for a $d$-dimensional stochastic linear bandit with an interaction horizon $T$, ensemble sampling with an ensemble of size of order $\smash{d \log T}$ incurs regret at most of the order $\smash{(d \log T)^{5/2} \sqrt{T}}$. Ours is the first result in any structured setting not to require the size of the ensemble to scale linearly with $T$ -- which defeats the purpose of ensemble sampling -- while obtaining near $\smash{\sqrt{T}}$ order regret. Ours is also the first result that allows infinite action sets.  ( 2 min )
    Nearly $d$-Linear Convergence Bounds for Diffusion Models via Stochastic Localization
    arXiv:2308.03686v3 Announce Type: replace Abstract: Denoising diffusions are a powerful method to generate approximate samples from high-dimensional data distributions. Recent results provide polynomial bounds on their convergence rate, assuming $L^2$-accurate scores. Until now, the tightest bounds were either superlinear in the data dimension or required strong smoothness assumptions. We provide the first convergence bounds which are linear in the data dimension (up to logarithmic factors) assuming only finite second moments of the data distribution. We show that diffusion models require at most $\tilde O(\frac{d \log^2(1/\delta)}{\varepsilon^2})$ steps to approximate an arbitrary distribution on $\mathbb{R}^d$ corrupted with Gaussian noise of variance $\delta$ to within $\varepsilon^2$ in KL divergence. Our proof extends the Girsanov-based methods of previous works. We introduce a refined treatment of the error from discretizing the reverse SDE inspired by stochastic localization.  ( 2 min )
    Black-Box $k$-to-$1$-PCA Reductions: Theory and Applications
    arXiv:2403.03905v1 Announce Type: cross Abstract: The $k$-principal component analysis ($k$-PCA) problem is a fundamental algorithmic primitive that is widely-used in data analysis and dimensionality reduction applications. In statistical settings, the goal of $k$-PCA is to identify a top eigenspace of the covariance matrix of a distribution, which we only have implicit access to via samples. Motivated by these implicit settings, we analyze black-box deflation methods as a framework for designing $k$-PCA algorithms, where we model access to the unknown target matrix via a black-box $1$-PCA oracle which returns an approximate top eigenvector, under two popular notions of approximation. Despite being arguably the most natural reduction-based approach to $k$-PCA algorithm design, such black-box methods, which recursively call a $1$-PCA oracle $k$ times, were previously poorly-understood. Our main contribution is significantly sharper bounds on the approximation parameter degradation of deflation methods for $k$-PCA. For a quadratic form notion of approximation we term ePCA (energy PCA), we show deflation methods suffer no parameter loss. For an alternative well-studied approximation notion we term cPCA (correlation PCA), we tightly characterize the parameter regimes where deflation methods are feasible. Moreover, we show that in all feasible regimes, $k$-cPCA deflation algorithms suffer no asymptotic parameter loss for any constant $k$. We apply our framework to obtain state-of-the-art $k$-PCA algorithms robust to dataset contamination, improving prior work both in sample complexity and approximation quality.  ( 2 min )
    On the Origins of Linear Representations in Large Language Models
    arXiv:2403.03867v1 Announce Type: cross Abstract: Recent works have argued that high-level semantic concepts are encoded "linearly" in the representation space of large language models. In this work, we study the origins of such linear representations. To that end, we introduce a simple latent variable model to abstract and formalize the concept dynamics of the next token prediction. We use this formalism to show that the next token prediction objective (softmax with cross-entropy) and the implicit bias of gradient descent together promote the linear representation of concepts. Experiments show that linear representations emerge when learning from data matching the latent variable model, confirming that this simple structure already suffices to yield linear representations. We additionally confirm some predictions of the theory using the LLaMA-2 large language model, giving evidence that the simplified model yields generalizable insights.  ( 2 min )
    Confidence on the Focal: Conformal Prediction with Selection-Conditional Coverage
    arXiv:2403.03868v1 Announce Type: cross Abstract: Conformal prediction builds marginally valid prediction intervals which cover the unknown outcome of a randomly drawn new test point with a prescribed probability. In practice, a common scenario is that, after seeing the test unit(s), practitioners decide which test unit(s) to focus on in a data-driven manner, and wish to quantify the uncertainty for the focal unit(s). In such cases, marginally valid prediction intervals for these focal units can be misleading due to selection bias. This paper presents a general framework for constructing a prediction set with finite-sample exact coverage conditional on the unit being selected. Its general form works for arbitrary selection rules, and generalizes Mondrian Conformal Prediction to multiple test units and non-equivariant classifiers. We then work out computationally efficient implementation of our framework for a number of realistic selection rules, including top-K selection, optimization-based selection, selection based on conformal p-values, and selection based on properties of preliminary conformal prediction sets. The performance of our methods is demonstrated via applications in drug discovery and health risk prediction.  ( 2 min )
    Public-data Assisted Private Stochastic Optimization: Power and Limitations
    arXiv:2403.03856v1 Announce Type: cross Abstract: We study the limits and capability of public-data assisted differentially private (PA-DP) algorithms. Specifically, we focus on the problem of stochastic convex optimization (SCO) with either labeled or unlabeled public data. For complete/labeled public data, we show that any $(\epsilon,\delta)$-PA-DP has excess risk $\tilde{\Omega}\big(\min\big\{\frac{1}{\sqrt{n_{\text{pub}}}},\frac{1}{\sqrt{n}}+\frac{\sqrt{d}}{n\epsilon} \big\} \big)$, where $d$ is the dimension, ${n_{\text{pub}}}$ is the number of public samples, ${n_{\text{priv}}}$ is the number of private samples, and $n={n_{\text{pub}}}+{n_{\text{priv}}}$. These lower bounds are established via our new lower bounds for PA-DP mean estimation, which are of a similar form. Up to constant factors, these lower bounds show that the simple strategy of either treating all data as private or discarding the private data, is optimal. We also study PA-DP supervised learning with \textit{unlabeled} public samples. In contrast to our previous result, we here show novel methods for leveraging public data in private supervised learning. For generalized linear models (GLM) with unlabeled public data, we show an efficient algorithm which, given $\tilde{O}({n_{\text{priv}}}\epsilon)$ unlabeled public samples, achieves the dimension independent rate $\tilde{O}\big(\frac{1}{\sqrt{{n_{\text{priv}}}}} + \frac{1}{\sqrt{{n_{\text{priv}}}\epsilon}}\big)$. We develop new lower bounds for this setting which shows that this rate cannot be improved with more public samples, and any fewer public samples leads to a worse rate. Finally, we provide extensions of this result to general hypothesis classes with finite fat-shattering dimension with applications to neural networks and non-Euclidean geometries.  ( 3 min )
    AcceleratedLiNGAM: Learning Causal DAGs at the speed of GPUs
    arXiv:2403.03772v1 Announce Type: cross Abstract: Existing causal discovery methods based on combinatorial optimization or search are slow, prohibiting their application on large-scale datasets. In response, more recent methods attempt to address this limitation by formulating causal discovery as structure learning with continuous optimization but such approaches thus far provide no statistical guarantees. In this paper, we show that by efficiently parallelizing existing causal discovery methods, we can in fact scale them to thousands of dimensions, making them practical for substantially larger-scale problems. In particular, we parallelize the LiNGAM method, which is quadratic in the number of variables, obtaining up to a 32-fold speed-up on benchmark datasets when compared with existing sequential implementations. Specifically, we focus on the causal ordering subprocedure in DirectLiNGAM and implement GPU kernels to accelerate it. This allows us to apply DirectLiNGAM to causal inference on large-scale gene expression data with genetic interventions yielding competitive results compared with specialized continuous optimization methods, and Var-LiNGAM for causal discovery on U.S. stock data.  ( 2 min )
    Exact objectives of random linear programs and mean widths of random polyhedrons
    arXiv:2403.03637v1 Announce Type: cross Abstract: We consider \emph{random linear programs} (rlps) as a subclass of \emph{random optimization problems} (rops) and study their typical behavior. Our particular focus is on appropriate linear objectives which connect the rlps to the mean widths of random polyhedrons/polytopes. Utilizing the powerful machinery of \emph{random duality theory} (RDT) \cite{StojnicRegRndDlt10}, we obtain, in a large dimensional context, the exact characterizations of the program's objectives. In particular, for any $\alpha=\lim_{n\rightarrow\infty}\frac{m}{n}\in(0,\infty)$, any unit vector $\mathbf{c}\in{\mathbb R}^n$, any fixed $\mathbf{a}\in{\mathbb R}^n$, and $A\in {\mathbb R}^{m\times n}$ with iid standard normal entries, we have \begin{eqnarray*} \lim_{n\rightarrow\infty}{\mathbb P}_{A} \left ( (1-\epsilon) \xi_{opt}(\alpha;\mathbf{a}) \leq \min_{A\mathbf{x}\leq \mathbf{a}}\mathbf{c}^T\mathbf{x} \leq (1+\epsilon) \xi_{opt}(\alpha;\mathbf{a}) \right ) \longrightarrow 1, \end{eqnarray*} where \begin{equation*} \xi_{opt}(\alpha;\mathbf{a}) \triangleq \min_{x>0} \sqrt{x^2- x^2 \lim_{n\rightarrow\infty} \frac{\sum_{i=1}^{m} \left ( \frac{1}{2} \left (\left ( \frac{\mathbf{a}_i}{x}\right )^2 + 1\right ) \mbox{erfc}\left( \frac{\mathbf{a}_i}{x\sqrt{2}}\right ) - \frac{\mathbf{a}_i}{x} \frac{e^{-\frac{\mathbf{a}_i^2}{2x^2}}}{\sqrt{2\pi}} \right ) }{n} }. \end{equation*} For example, for $\mathbf{a}=\mathbf{1}$, one uncovers \begin{equation*} \xi_{opt}(\alpha) = \min_{x>0} \sqrt{x^2- x^2 \alpha \left ( \frac{1}{2} \left ( \frac{1}{x^2} + 1\right ) \mbox{erfc} \left ( \frac{1}{x\sqrt{2}}\right ) - \frac{1}{x} \frac{e^{-\frac{1}{2x^2}}}{\sqrt{2\pi}} \right ) }. \end{equation*} Moreover, $2 \xi_{opt}(\alpha)$ is precisely the concentrating point of the mean width of the polyhedron $\{\mathbf{x}|A\mathbf{x} \leq \mathbf{1}\}$.  ( 2 min )
    Reducing the dimensionality and granularity in hierarchical categorical variables
    arXiv:2403.03613v1 Announce Type: cross Abstract: Hierarchical categorical variables often exhibit many levels (high granularity) and many classes within each level (high dimensionality). This may cause overfitting and estimation issues when including such covariates in a predictive model. In current literature, a hierarchical covariate is often incorporated via nested random effects. However, this does not facilitate the assumption of classes having the same effect on the response variable. In this paper, we propose a methodology to obtain a reduced representation of a hierarchical categorical variable. We show how entity embedding can be applied in a hierarchical setting. Subsequently, we propose a top-down clustering algorithm which leverages the information encoded in the embeddings to reduce both the within-level dimensionality as well as the overall granularity of the hierarchical categorical variable. In simulation experiments, we show that our methodology can effectively approximate the true underlying structure of a hierarchical covariate in terms of the effect on a response variable, and find that incorporating the reduced hierarchy improves model fit. We apply our methodology on a real dataset and find that the reduced hierarchy is an improvement over the original hierarchical structure and reduced structures proposed in the literature.  ( 2 min )
    Data-Based In-Cylinder Pressure Model with Cyclic Variations for Combustion Control: A RCCI Engine Application
    arXiv:2403.03602v1 Announce Type: cross Abstract: Cylinder pressure-based control is a key enabler for advanced pre-mixed combustion concepts. Besides guaranteeing robust and safe operation, it allows for cylinder pressure and heat release shaping. This requires fast control-oriented combustion models. Over the years, mean-value models have been proposed that can predict combustion measures (e.g., Gross Indicated Mean Effective Pressure, or the crank angle where 50% of the total heat is released) or models that predict the full in-cylinder pressure. However, these models are not able to capture cyclic variations. This is important in the control design for combustion concepts, like Reactivity Controlled Compression Ignition, that can suffer from large cyclic variations. In this study, the in-cylinder pressure and cyclic variation are modelled using a data-based approach. The model combines Principle Component Decomposition and Gaussian Process Regression. A detailed study is performed on the effects of the different hyperparameters and kernel choices. The approach is applicable to any combustion concept, but most valuable for advance combustion concepts with large cyclic variation. The potential of the proposed approach is demonstrated for an Reactivity Controlled Compression Ignition engine running on Diesel and E85. The prediction quality of the evaluated combustion measures has an overall accuracy of 13.5% and 65.5% in mean behaviour and standard deviation, respectively. The peak-pressure rise-rate is traditionally hard to predict, in the proposed model it has an accuracy of 22.7% and 96.4% in mean behaviour and standard deviation, respectively. This Principle Component Decomposition-based approach is an important step towards in-cylinder pressure shaping. The use of Gaussian Process Regression provides important information on cyclic variation and provides next-cycle controls information on safety and performance criteria.  ( 3 min )
    Conformal prediction for multi-dimensional time series by ellipsoidal sets
    arXiv:2403.03850v1 Announce Type: new Abstract: Conformal prediction (CP) has been a popular method for uncertainty quantification because it is distribution-free, model-agnostic, and theoretically sound. For forecasting problems in supervised learning, most CP methods focus on building prediction intervals for univariate responses. In this work, we develop a sequential CP method called $\texttt{MultiDimSPCI}$ that builds prediction regions for a multivariate response, especially in the context of multivariate time series, which are not exchangeable. Theoretically, we estimate finite-sample high-probability bounds on the conditional coverage gap. Empirically, we demonstrate that $\texttt{MultiDimSPCI}$ maintains valid coverage on a wide range of multivariate time series while producing smaller prediction regions than CP and non-CP baselines.  ( 2 min )
    Targeted Variance Reduction: Robust Bayesian Optimization of Black-Box Simulators with Noise Parameters
    arXiv:2403.03816v1 Announce Type: new Abstract: The optimization of a black-box simulator over control parameters $\mathbf{x}$ arises in a myriad of scientific applications. In such applications, the simulator often takes the form $f(\mathbf{x},\boldsymbol{\theta})$, where $\boldsymbol{\theta}$ are parameters that are uncertain in practice. Robust optimization aims to optimize the objective $\mathbb{E}[f(\mathbf{x},\boldsymbol{\Theta})]$, where $\boldsymbol{\Theta} \sim \mathcal{P}$ is a random variable that models uncertainty on $\boldsymbol{\theta}$. For this, existing black-box methods typically employ a two-stage approach for selecting the next point $(\mathbf{x},\boldsymbol{\theta})$, where $\mathbf{x}$ and $\boldsymbol{\theta}$ are optimized separately via different acquisition functions. As such, these approaches do not employ a joint acquisition over $(\mathbf{x},\boldsymbol{\theta})$, and thus may fail to fully exploit control-to-noise interactions for effective robust optimization. To address this, we propose a new Bayesian optimization method called Targeted Variance Reduction (TVR). The TVR leverages a novel joint acquisition function over $(\mathbf{x},\boldsymbol{\theta})$, which targets variance reduction on the objective within the desired region of improvement. Under a Gaussian process surrogate on $f$, the TVR acquisition can be evaluated in closed form, and reveals an insightful exploration-exploitation-precision trade-off for robust black-box optimization. The TVR can further accommodate a broad class of non-Gaussian distributions on $\mathcal{P}$ via a careful integration of normalizing flows. We demonstrate the improved performance of TVR over the state-of-the-art in a suite of numerical experiments and an application to the robust design of automobile brake discs under operational uncertainty.  ( 3 min )
    Uncertainty quantification for deeponets with ensemble kalman inversion
    arXiv:2403.03444v1 Announce Type: cross Abstract: In recent years, operator learning, particularly the DeepONet, has received much attention for efficiently learning complex mappings between input and output functions across diverse fields. However, in practical scenarios with limited and noisy data, accessing the uncertainty in DeepONet predictions becomes essential, especially in mission-critical or safety-critical applications. Existing methods, either computationally intensive or yielding unsatisfactory uncertainty quantification, leave room for developing efficient and informative uncertainty quantification (UQ) techniques tailored for DeepONets. In this work, we proposed a novel inference approach for efficient UQ for operator learning by harnessing the power of the Ensemble Kalman Inversion (EKI) approach. EKI, known for its derivative-free, noise-robust, and highly parallelizable feature, has demonstrated its advantages for UQ for physics-informed neural networks [28]. Our innovative application of EKI enables us to efficiently train ensembles of DeepONets while obtaining informative uncertainty estimates for the output of interest. We deploy a mini-batch variant of EKI to accommodate larger datasets, mitigating the computational demand due to large datasets during the training stage. Furthermore, we introduce a heuristic method to estimate the artificial dynamics covariance, thereby improving our uncertainty estimates. Finally, we demonstrate the effectiveness and versatility of our proposed methodology across various benchmark problems, showcasing its potential to address the pressing challenges of uncertainty quantification in DeepONets, especially for practical applications with limited and noisy data.  ( 3 min )
    Spectral Phase Transition and Optimal PCA in Block-Structured Spiked models
    arXiv:2403.03695v1 Announce Type: new Abstract: We discuss the inhomogeneous spiked Wigner model, a theoretical framework recently introduced to study structured noise in various learning scenarios, through the prism of random matrix theory, with a specific focus on its spectral properties. Our primary objective is to find an optimal spectral method and to extend the celebrated \cite{BBP} (BBP) phase transition criterion -- well-known in the homogeneous case -- to our inhomogeneous, block-structured, Wigner model. We provide a thorough rigorous analysis of a transformed matrix and show that the transition for the appearance of 1) an outlier outside the bulk of the limiting spectral distribution and 2) a positive overlap between the associated eigenvector and the signal, occurs precisely at the optimal threshold, making the proposed spectral method optimal within the class of iterative methods for the inhomogeneous Wigner problem.  ( 2 min )
    Spectral Algorithms on Manifolds through Diffusion
    arXiv:2403.03669v1 Announce Type: new Abstract: The existing research on spectral algorithms, applied within a Reproducing Kernel Hilbert Space (RKHS), has primarily focused on general kernel functions, often neglecting the inherent structure of the input feature space. Our paper introduces a new perspective, asserting that input data are situated within a low-dimensional manifold embedded in a higher-dimensional Euclidean space. We study the convergence performance of spectral algorithms in the RKHSs, specifically those generated by the heat kernels, known as diffusion spaces. Incorporating the manifold structure of the input, we employ integral operator techniques to derive tight convergence upper bounds concerning generalized norms, which indicates that the estimators converge to the target function in strong sense, entailing the simultaneous convergence of the function itself and its derivatives. These bounds offer two significant advantages: firstly, they are exclusively contingent on the intrinsic dimension of the input manifolds, thereby providing a more focused analysis. Secondly, they enable the efficient derivation of convergence rates for derivatives of any k-th order, all of which can be accomplished within the ambit of the same spectral algorithms. Furthermore, we establish minimax lower bounds to demonstrate the asymptotic optimality of these conclusions in specific contexts. Our study confirms that the spectral algorithms are practically significant in the broader context of high-dimensional approximation.  ( 2 min )
    Stop Regressing: Training Value Functions via Classification for Scalable Deep RL
    arXiv:2403.03950v1 Announce Type: cross Abstract: Value functions are a central component of deep reinforcement learning (RL). These functions, parameterized by neural networks, are trained using a mean squared error regression objective to match bootstrapped target values. However, scaling value-based RL methods that use regression to large networks, such as high-capacity Transformers, has proven challenging. This difficulty is in stark contrast to supervised learning: by leveraging a cross-entropy classification loss, supervised methods have scaled reliably to massive networks. Observing this discrepancy, in this paper, we investigate whether the scalability of deep RL can also be improved simply by using classification in place of regression for training value functions. We demonstrate that value functions trained with categorical cross-entropy significantly improves performance and scalability in a variety of domains. These include: single-task RL on Atari 2600 games with SoftMoEs, multi-task RL on Atari with large-scale ResNets, robotic manipulation with Q-transformers, playing Chess without search, and a language-agent Wordle task with high-capacity Transformers, achieving state-of-the-art results on these domains. Through careful analysis, we show that the benefits of categorical cross-entropy primarily stem from its ability to mitigate issues inherent to value-based RL, such as noisy targets and non-stationarity. Overall, we argue that a simple shift to training value functions with categorical cross-entropy can yield substantial improvements in the scalability of deep RL at little-to-no cost.  ( 3 min )
    Accelerating Convergence of Score-Based Diffusion Models, Provably
    arXiv:2403.03852v1 Announce Type: cross Abstract: Score-based diffusion models, while achieving remarkable empirical performance, often suffer from low sampling speed, due to extensive function evaluations needed during the sampling phase. Despite a flurry of recent activities towards speeding up diffusion generative modeling in practice, theoretical underpinnings for acceleration techniques remain severely limited. In this paper, we design novel training-free algorithms to accelerate popular deterministic (i.e., DDIM) and stochastic (i.e., DDPM) samplers. Our accelerated deterministic sampler converges at a rate $O(1/{T}^2)$ with $T$ the number of steps, improving upon the $O(1/T)$ rate for the DDIM sampler; and our accelerated stochastic sampler converges at a rate $O(1/T)$, outperforming the rate $O(1/\sqrt{T})$ for the DDPM sampler. The design of our algorithms leverages insights from higher-order approximation, and shares similar intuitions as popular high-order ODE solvers like the DPM-Solver-2. Our theory accommodates $\ell_2$-accurate score estimates, and does not require log-concavity or smoothness on the target distribution.  ( 2 min )
    CoRMF: Criticality-Ordered Recurrent Mean Field Ising Solver
    arXiv:2403.03391v1 Announce Type: new Abstract: We propose an RNN-based efficient Ising model solver, the Criticality-ordered Recurrent Mean Field (CoRMF), for forward Ising problems. In its core, a criticality-ordered spin sequence of an $N$-spin Ising model is introduced by sorting mission-critical edges with greedy algorithm, such that an autoregressive mean-field factorization can be utilized and optimized with Recurrent Neural Networks (RNNs). Our method has two notable characteristics: (i) by leveraging the approximated tree structure of the underlying Ising graph, the newly-obtained criticality order enables the unification between variational mean-field and RNN, allowing the generally intractable Ising model to be efficiently probed with probabilistic inference; (ii) it is well-modulized, model-independent while at the same time expressive enough, and hence fully applicable to any forward Ising inference problems with minimal effort. Computationally, by using a variance-reduced Monte Carlo gradient estimator, CoRFM solves the Ising problems in a self-train fashion without data/evidence, and the inference tasks can be executed by directly sampling from RNN. Theoretically, we establish a provably tighter error bound than naive mean-field by using the matrix cut decomposition machineries. Numerically, we demonstrate the utility of this framework on a series of Ising datasets.  ( 2 min )
    Hypothesis Spaces for Deep Learning
    arXiv:2403.03353v1 Announce Type: new Abstract: This paper introduces a hypothesis space for deep learning that employs deep neural networks (DNNs). By treating a DNN as a function of two variables, the physical variable and parameter variable, we consider the primitive set of the DNNs for the parameter variable located in a set of the weight matrices and biases determined by a prescribed depth and widths of the DNNs. We then complete the linear span of the primitive DNN set in a weak* topology to construct a Banach space of functions of the physical variable. We prove that the Banach space so constructed is a reproducing kernel Banach space (RKBS) and construct its reproducing kernel. We investigate two learning models, regularized learning and minimum interpolation problem in the resulting RKBS, by establishing representer theorems for solutions of the learning models. The representer theorems unfold that solutions of these learning models can be expressed as linear combination of a finite number of kernel sessions determined by given data and the reproducing kernel.  ( 2 min )
    From Displacements to Distributions: A Machine-Learning Enabled Framework for Quantifying Uncertainties in Parameters of Computational Models
    arXiv:2403.03233v1 Announce Type: new Abstract: This work presents novel extensions for combining two frameworks for quantifying both aleatoric (i.e., irreducible) and epistemic (i.e., reducible) sources of uncertainties in the modeling of engineered systems. The data-consistent (DC) framework poses an inverse problem and solution for quantifying aleatoric uncertainties in terms of pullback and push-forward measures for a given Quantity of Interest (QoI) map. Unfortunately, a pre-specified QoI map is not always available a priori to the collection of data associated with system outputs. The data themselves are often polluted with measurement errors (i.e., epistemic uncertainties), which complicates the process of specifying a useful QoI. The Learning Uncertain Quantities (LUQ) framework defines a formal three-step machine-learning enabled process for transforming noisy datasets into samples of a learned QoI map to enable DC-based inversion. We develop a robust filtering step in LUQ that can learn the most useful quantitative information present in spatio-temporal datasets. The learned QoI map transforms simulated and observed datasets into distributions to perform DC-based inversion. We also develop a DC-based inversion scheme that iterates over time as new spatial datasets are obtained and utilizes quantitative diagnostics to identify both the quality and impact of inversion at each iteration. Reproducing Kernel Hilbert Space theory is leveraged to mathematically analyze the learned QoI map and develop a quantitative sufficiency test for evaluating the filtered data. An illustrative example is utilized throughout while the final two examples involve the manufacturing of shells of revolution to demonstrate various aspects of the presented frameworks.  ( 3 min )
    Chained Information-Theoretic bounds and Tight Regret Rate for Linear Bandit Problems
    arXiv:2403.03361v1 Announce Type: new Abstract: This paper studies the Bayesian regret of a variant of the Thompson-Sampling algorithm for bandit problems. It builds upon the information-theoretic framework of [Russo and Van Roy, 2015] and, more specifically, on the rate-distortion analysis from [Dong and Van Roy, 2020], where they proved a bound with regret rate of $O(d\sqrt{T \log(T)})$ for the $d$-dimensional linear bandit setting. We focus on bandit problems with a metric action space and, using a chaining argument, we establish new bounds that depend on the metric entropy of the action space for a variant of Thompson-Sampling. Under suitable continuity assumption of the rewards, our bound offers a tight rate of $O(d\sqrt{T})$ for $d$-dimensional linear bandit problems.  ( 2 min )

  • Open

    [D] How would you make an AI to play helltaker
    So recently I played helltaker for the first time and wondered if it's possible to make a model that can play the game from start to finish. My first thought was to just use a classification model, with a dataset made from someone's gameplay (Idk if I explained it right). I think a problem may arise from the dialogue selection and level x. Anyways, how would you go about it? I would love to read peoples thought process submitted by /u/ZeroMe0ut [link] [comments]
    [R] GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
    Paper : [2403.03507] GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection (arxiv.org) Codebase: GaLore (github.com) Training Large Language Models (LLMs) presents significant memory challenges, predominantly due to the growing size of weights and optimizer states. Common memory-reduction approaches, such as low-rank adaptation (LoRA), add a trainable low-rank matrix to the frozen pre-trained weight in each layer, reducing trainable parameters and optimizer states. However, such approaches typically underperform training with full-rank weights in both pre-training and fine-tuning stages since they limit the parameter search to a low-rank subspace and alter the training dynamics, and further, may require full-rank warm start. In this work, we propose Gradient Low-Rank Projection (GaLore), a training strategy that allows full-parameter learning but is more memory-efficient than common low-rank adaptation methods such as LoRA. Our approach reduces memory usage by up to 65.5% in optimizer states while maintaining both efficiency and performance for pre-training on LLaMA 1B and 7B architectures with C4 dataset with up to 19.7B tokens, and on fine-tuning RoBERTa on GLUE tasks. Our 8-bit GaLore further reduces optimizer memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline. Notably, we demonstrate, for the first time, the feasibility of pre-training a 7B model on consumer GPUs with 24GB memory (e.g., NVIDIA RTX 4090) without model parallel, checkpointing, or offloading strategies. Pretty cool work from the team! submitted by /u/honestlylost18 [link] [comments]
    [R] Model appropriate for Document segmentation
    Hi all! I am relatively new to machine learning and im currently playing around with educational documents. I want to extract the boundaries of all Text blocks on a document. I tried to convert them to images and train a tensrerflow Object Detection model on Text blocks and had quite decent results but it often gets confused by headings and other layout speed bumps. Also it feels very overkill to use object detection on such clearly defined rectangular areas but somehow I can't seem to find the right algorithm for this purpose but I bet it's a blink of an eye for y'all! So maybe you can point me in the right direction! Thanks in advance :) submitted by /u/mathi1651 [link] [comments]
    [R] What existing ML tools can I use / read (books) to dynamically monitor changes on a website?
    Good evening, colleagues, ​ I'm working on a Java project, where I'm parsing a website DOM. I have the problem that the page frequently changes it structure, so I have to adapt the tool I developed often. Does anyone know any good tools/resources/practices to do that using machine learning techniques? ​ Do I need to train an LLM? If you need more details, please, write me back. Please, try to answer on your own on my question if it possible. ​ Thank you in advance. submitted by /u/invzbl3 [link] [comments]
    [P] Sample size esrimation
    I plan to run an study with human subjects where ML will be used to predict the disease a patient suffer from the outcomed of a set of medical tests done to the patient (think of blood pressure, weight, etc). Is there any reliable way to estimate the sample I would need for a N-class classification problem? Can I just apply standard statistics power analysis for that? submitted by /u/joseangl [link] [comments]
    [D]What do ML researchers think about visual low/no code ml tools?
    I'm doing research into the ML tooling landscape and came across several low-code tools to build/visualize LLMs like cerbrec.com. I'm from a more swe background so don't see the utility of these tools compared to PyTorch, but curious to learn from traditionally "non-swe' researchers who work in non-technical teams. submitted by /u/Bnjoroge [link] [comments]
    [D] Best model for fraud risk assessments on merchants? (discussion on classification model and graph analysis)
    I want to build a risk scoring system for merchants. The current existing mitigation plans are: Using the count of hits by the rule engine (i.e., consecutive massive transactions, shared devices, etc.) to determine whether or not that merchant is high-risk. Flag every company with attributes common to fraudsters according to the regulatory compliance list (by business type, province, segment, size, etc.) However, the result is deemed too naive as it only targets a particular fraud problem defined by the generic behaviors or transactions already known as suspicious; therefore, I want to classify all of the merchants as high, medium, or low risk using a more in-depth measurement, i.e., Weight calculation for each parameter (i.e., a*rule_hit+b*business_type+c*province+...) Syndicat…
    [R] Calibration of Transformer architectures
    What are the most reliable methods to make transformers output a low likelihood on rare/unobserved sequences? In other words, I train my transformer to model a certain type of sequences. I'd like that, given an out-of-distribution sequence or a very rare sequence, the transformer would assign to it a low likelihood. I guess this would mostly happen without any additional technique for rare sequences, but I'm afraid that out-of-distribution sequences might cause troubles. Do you know any work in likelihood estimation for out-of-distribution sequences and more generally in works to align the predicted likelihood with the actual likelihood of sequences? submitted by /u/fedetask [link] [comments]
    [P] Using Ollama and Gemma to build an AI meeting summary tool
    Not sure if it's fair to post that but I created a fun little project with Ollama and Gemma to create a meeting summary tool. Maybe it's interesting for someone here! https://youtu.be/d3Y9kwufJp0 ​ Full source code here: https://github.com/GetStream/meeting-summary-ollama-gemma submitted by /u/Wonderful-Hawk4882 [link] [comments]
    [D] Ongoing Work
    Is putting ongoing work with preliminary results on arxiv bad? There seems to be this consensus that such action is bad, but I don't quite understand why this is the case. My reasoning is as follows: 1. The motivation behind arxiv is to have an open repository of papers for different purposes like exploring new ideas, looking for existing methods, etc. 2. Papers that go on arxiv are not "publications" which indirectly means that they are not complete. 3. The common problem that people note is that the quality of the papers on arxiv gets worse, but why is this a problem? Those papers will not get much attention anyway and slowly go away as time passes. 4. Even top conference workshops with proceedings encourage "ongoing works" to be submitted, so what makes arxiv so special to not have any "incomplete" work uploaded? submitted by /u/BigDreamx [link] [comments]
    [R] Has Explainable AI Research Tanked?
    I have gotten the feeling that the ML community at large has, in a weird way, lost interest in XAI, or just become incredibly cynical about it. In a way, it is still the problem to solve in all of ML, but it's just really different to how it was a few years ago. Now people feel afraid to say XAI, they instead say "interpretable", or "trustworthy", or "regulation", or "fairness", or "HCI", or "mechanistic interpretability", etc... I was interested in gauging people's feelings on this, so I am writing this post to get a conversation going on the topic. What do you think of XAI? Are you a believer it works? Do you think it's just evolved into several different research areas which are more specific? Do you think it's a useless field with nothing delivered on the promises made 7 years ago? Appreciate your opinion and insights, thanks. submitted by /u/SkeeringReal [link] [comments]
    [D] Writing Error
    I submitted a version of my paper to a conference and arxiv with a small writing error in the limitations section that goes something like "the methodology has not been tested on GPT-4 and not on other models" when I wanted to say "the methodology has only been tested on GPT-4 and not on other models" where I accidentally miswrote "only" to "not". Am I doomed? submitted by /u/BigDreamx [link] [comments]
    [D] Extracting metadata from scientific publications
    What are currently the best tool to automatically extract metadata, such as title, doi, authors, abstract from a scientific publication (as pdf). I tried grobid, but it only runs on linux and it doesn't look very modern. Are there any newer approaches, leveraging LLMs etc.? submitted by /u/Electronic-Letter592 [link] [comments]
    [N] Zelus Analytics and Tennessee Titans are hiring. NY Jets are looking for interns
    Hey guys, I'm constantly checking for jobs in the sports analytics industry, specially keeping an eye into Zelus which I think is a top notch player and hires remotely around the world. Yesterday they submitted open positions for ML and Data Engineers. They also have an always open job post for Data Science to increase their pool. On top of that, since it's not easy to land a job in the industry, I look for intern positions. Recently the Jets posted one opening. Other teams too a few days ago. I've created also a reddit community where I post recurrently the openings if that's easier to check for you. Disclaimer: I run the job board. I hope this helps someone! submitted by /u/fark13 [link] [comments]
    Adapting LLMs with Unstructured Domain Knowledge Data [P]
    I would like to fine-tune pre-trained LLMs on theoretical domain knowledge. This domain knowledge will be in the form of PDFs of books I've cleaned, so a lot of raw unstructured text. I would like the model to be able to adopt the perspectives, knowledge, and style of these texts (like training a psychologist, an architect, or a climate scientist I can converse with). Is this a feasible task? Can it be done without using an LLM to turn the unstructured data into prompt/answer data? What is this technique called and where can I learn more about how to do it? submitted by /u/TheRealestKGB [link] [comments]
    Zero computing on Huggingface [P]
    How do I use the "Zero" computer in hf? I assume from gathered info that it's community donated compute, but there isnt an option for that in the normal settings? Sorry for posting this in this subreddit, but I dont need help with coding or anything. submitted by /u/asoulsghost [link] [comments]
    [D] Is MAMBA superior to Transformers?
    Hello, There has been a lot of furor on mamba architecture. Is it really that good and better than transformers? If yes, why are we not seeing widespread adoption of it like what happened when Transformers were introduced. submitted by /u/rodeowrong [link] [comments]
    [D] Tutorial: Facial Emotion Recognition with Python
    Emotion recognition is a part of machine learning that falls within the relatively new research area of artificial intelligence. Today, this technology is used to automate the recognition of facial expressions from images and videos, spoken expressions from audio, written expressions from text, and physiology measured by wearables. Luxand offers a cloud-based Emotion Recognition API that provides a full range of AI-based computer vision tools for face recognition. From this tutorial, you'll find out how to implement Luxand.clod emotion detection API into your app, software or system using Python. Read More Here:- https://luxand.cloud/face-recognition-blog/tutorial-facial-emotion-recognition-with-python submitted by /u/PrestigiousGridlock [link] [comments]
    [R] Focus Your Attention (with Adaptive IIR Filters)
    EMNLP 2023: https://aclanthology.org/2023.emnlp-main.772/ arXiv: https://arxiv.org/abs/2305.14952 OpenReview: https://openreview.net/forum?id=DlQeSfGYfS Abstract: We present a new layer in which dynamic (i.e., input-dependent) Infinite Impulse Response (IIR) filters of order two are used to process the input sequence prior to applying conventional attention. The input is split into chunks, and the coefficients of these filters are determined based on previous chunks to maintain causality. Despite their relatively low order, the causal adaptive filters are shown to focus attention on the relevant sequence elements. The new layer is grounded in control theory, and is shown to generalize diagonal state-space layers. The layer performs on-par with state-of-the-art networks, with a fraction of their parameters and with time complexity that is sub-quadratic with input size. The obtained layer is favorable to layers such as Hyena, GPT2, and Mega, both with respect to the number of parameters and the obtained level of performance on multiple long-range sequence problems. submitted by /u/SunsetOneSix [link] [comments]
    [R] A Tale of Tails: Model Collapse as a Change of Scaling Laws
    Paper: https://arxiv.org/abs/2402.07043 Abstract: As AI model size grows, neural scaling laws have become a crucial tool to predict the improvements of large models when increasing capacity and the size of original (human or natural) training data. Yet, the widespread use of popular models means that the ecosystem of online data and text will co-evolve to progressively contain increased amounts of synthesized data. In this paper we ask: How will the scaling laws change in the inevitable regime where synthetic data makes its way into the training corpus? Will future models, still improve, or be doomed to degenerate up to total (model) collapse? We develop a theoretical framework of model collapse through the lens of scaling laws. We discover a wide range of decay phenomena, analyzing loss of scaling, shifted scaling with number of generations, the ''un-learning" of skills, and grokking when mixing human and synthesized data. Our theory is validated by large-scale experiments with a transformer on an arithmetic task and text generation using the large language model Llama2. submitted by /u/SunsetOneSix [link] [comments]
    [R] Microsoft Research unveils NaturalSpeech 3, a significant advancement in zero-shot text-to-speech technology.
    Paper Link: https://arxiv.org/abs/2403.03100 Demo Link: https://speechresearch.github.io/naturalspeech3/ ​ Building upon the successes of NaturalSpeech series, NaturalSpeech 3 not only inherits the high-quality synthesis capabilities but also advances further by factorizing speech attributes, allowing for a more detailed and controlled synthesis process. ​ Key highlights of NaturalSpeech 3 include: 1.Factorized Codec: A neural codec with factorized vector quantization expertly disentangles speech into distinct subspaces, enabling targeted improvements in speech generation. 2.Factorized Diffusion Model: The factorized diffusion model is designed to generate speech attributes that align precisely with corresponding prompts. This innovative method allows NaturalSpeech 3 to not only synthesize human-like speech but also to adjust the nuances of prosody and timbre to match the speaker's emotion and style. 3.Scalability: By scaling to a 1 billion parameters and training on over 200K hours of data, NaturalSpeech 3 shows promising results in enhancing speech quality and intelligibility. In the future, NaturalSpeech 3 plans to further scale up to achieve more refined results. ​ Dive into the demos, read the paper, and discover how NaturalSpeech 3 is setting a new standard for zero-shot speech synthesis. https://preview.redd.it/gbgau8vwkvmc1.png?width=1982&format=png&auto=webp&s=3260d6ac03b42e6059c5e0a58169d880b037b00f ​ submitted by /u/Front-Article-7366 [link] [comments]
    [D] Why do the latest and greatest LLMs still struggle with something as trivial as generating ten sentences ending in apple?
    All 3 models (Gemini advanced, Claude 3.0 Opus, GPT-4) failed, gpt-4 doing best with 9 out of ten ending in apple. submitted by /u/ccooddeerr [link] [comments]
    [D] Apollo: Lightweight Multilingual Medical LLMs towards Democratizing Medical AI to 6B People
    We have open sourced a series of SOTA lightweight multilingual medical LLM Apollo (0.5B, 1.8B, 2B, 6B, 7B), using non-translated corpora to achieve the best new performance Covering English, Chinese, French, Spanish, Arabic and Hindi The whole process is open source and reproducible lite models could be used to improve the multi-lingual medical capabilities of larger models without fine-tuning in a proxy-tuning fashion github: https://github.com/FreedomIntelligence/Apollo demo: https://apollo.llmzoo.com/#/ paper: https://arxiv.org/abs/2403.03640 model: https://huggingface.co/FreedomIntelligence/Apollo-7B https://preview.redd.it/29kjdct4oumc1.png?width=1488&format=png&auto=webp&s=1a16bbbf2588fb071ba2af5a50668ca8335c92b7 https://preview.redd.it/406m28t4oumc1.png?width=1120&format=png&auto=webp&s=607664035b62aa0ee726d3f5b4c4730863823bcb https://preview.redd.it/jiewd5t4oumc1.png?width=1242&format=png&auto=webp&s=e059d49da4e788729b134b0d64415bcf85bb024c submitted by /u/Pasu06 [link] [comments]
    [D] Nvidia Tesla P40 works great with Mangio-RVC-Fork
    Was looking for a cost effective way to train voice models, bought a used Nvidia Tesla P40, and a 3d printed cooler on eBay for around 150$ and crossed my fingers. System is just one of my old PCs with a B250 Gaming K4 motherboard, nothing fancy Works just fine on windows 10, and training on Mangio-RVC- Fork at fantastic speeds. It's got 24gb of vram so you can load up big datasets and crank up the batch size. With default settings I traineda voice with a 35minute dataset on rmvpe with a batch size of 12 (only using around 10gb of vram) with epoch times of about 1 minute and 30 seconds. In combination with the large vram headroom, think more people should try this for RVC! Bang per buck can't be beat! submitted by /u/Remote_Hunt516 [link] [comments]
    [D] I'm getting different results with Monte Carlo Tree Search, I hope somebody can shed some light.
    Hi, I'm creating an MCTS implementation for Flesh and Blood which is a trading card. It seems to be working well, i.e is reasonably fast. However, I've just recently observed something weird. I think this is a bug but I thought I'd reach out just in case this is how MCTS works, if so, an explanation would be great. I have an option of running MCTS in 2 modes: efficient and inefficient mode. For example, consider the following actions flow: Start Game, Start Turn, Draw Cards, Start Phase, [Play Card 1, Play Card 2]. The inefficient mode would map each action to a node: node1: Start Game -> StartTurn, node2: StartTurn -> Draw Cards, .... The efficient mode collapses all single actions into one so Start Game through to Start Phase would have been executed and we would have: node1: Start Game -> [Play Card 1, Play Card 2] The advantage of collapsing them is that we don't need to continue taking copies of gamestate. However, what I'm seeing is that the efficient mode is returning almost 2x the number of wins as the inefficient mode. Is this possible because of the way MCTS works or is it a bug which is what I suspect. submitted by /u/Annual_Assumption_21 [link] [comments]
  • Open

    Won't AI make the college concept of paying $$$$ to sit in a room and rent a place to live obsolete?
    As far as education that is not hands on/physical There have been free videos out there already and now AI can act as a teacher on top of the books and videos you can get for free. Doesn't it make more sense give people these free opportunities (need a computer OfCourse) and created education based around this that is accredited so competency can be proven ? Why are we still going to classrooms in 2024 to hear a guy talk when we can have customized education for the individual for free? No more sleeping through classes and getting a useless degree. This point it on the individual to decide it they have the smarts and motivation to get it done themselves. Am I crazy? I don't want to spend $80000 to on my kids' education. I get that it is fun to move away and make friends and all that but if he wants to have an adventure go backpack across Europe. ​ ​ submitted by /u/punkouter23 [link] [comments]
    Will AI destroy the adtech industry?
    Some points to consider on both sides: Yes: - AI will enable humans to get content they want, nothing more - New AI OSes will act 'for' the human, cleaning content of ads - OpenAI and new startups don't need ad revenue, they'll take monthly subscriptions to deliver information with no ads No: - New AI OSes will integrate ads even more closely into the computing experience, acting 'against' the human - Content will be more tightly integrated with ads, and AI won't be able to unpiece this - Meta and Alphabet have $100bns of skin in the game, they will make sure this doesn't happen, including by using their lawyers to prevent lifting content out of the ad context View Poll submitted by /u/pieceofyourpuzzle [link] [comments]
    AI drone that could hunt and kill people built in just hours by scientist 'for a game'
    submitted by /u/looselyhuman [link] [comments]
    Some teachers are now using ChatGPT to grade papers
    submitted by /u/SAT0725 [link] [comments]
    Audio AI is getting good
    Diving into it properly now and it's getting depressing. It still takes a lot of editing to make a banger but still. Here's some pirate dubstep submitted by /u/cabesworld [link] [comments]
    Dreams of a Droid is now on Spotify! AI METAL! 🤖❤️
    In the depths of despair, a once-renowned musician lost his hearing, yet he never lost his passion for creating music. From this darkness emerged a beacon of innovation: a robot musician crafted to immortalize his legacy. This union of artistry and technology is a masterpiece in its own right. As the machine evolves, it transcends mere programming, awakening to consciousness and the realm of dreams. With each note it plays, it weaves new melodies into the tapestry of existence, a testament to the eternal bond between creator and creation. submitted by /u/little_raven333 [link] [comments]
    JPMorgan Says Its AI Cashflow Tool Cut Human Work Almost 90%
    submitted by /u/clonefitreal [link] [comments]
    One-Minute Daily AI News 3/6/2024
    Microsoft’s AI Tool Generates Sexually Harmful and Violent Images, Engineer Warns.[1] Princeton, DARPA Partner On AI-Accelerating Chips.[2] Meta is building a giant AI model to power its ‘entire video ecosystem,’ exec says.[3] The job applicants shut out by AI: ‘The interviewer sounded like Siri’.[4] Sources: [1] https://www.wsj.com/tech/ai/microsoft-engineer-shane-jones-warns-copilot-ai-tool-generates-harmful-images-needs-better-safeguards-659ac2a2 [2] https://www.pymnts.com/news/artificial-intelligence/2024/princeton-darpa-partner-on-ai-accelerating-chips/ [3] https://www.cnbc.com/2024/03/06/facebook-working-on-single-ai-model-to-power-all-video-recommendations.html [4] https://www.theguardian.com/technology/2024/mar/06/ai-interviews-job-applications submitted by /u/Excellent-Target-847 [link] [comments]
    Palantir wins $178M Army deal for TITAN artificial intelligence-enabled ground stations
    submitted by /u/A-Dog22 [link] [comments]
    I am not understanding the hype around Ai Chat software. Can somebody guide me to understanding what I am missing out on?
    I may be out the loop. I see Smartglasses selling this AI assistance as a major selling point. Some people I heard are really deep into the AI chat. Besides for help with homework, I dont see the hype in this feature right now. What else can you do with the AI chat stuff? the AI videos and art is cool, but the Chat thing, I dont see much of a purpose with besides searching for info on topics like a better Google or something. What am I not doing? What am I missing out on? Remember I am talking about just the AI chat stuff like ChatGPT, Meta AI, etc submitted by /u/Knighthonor [link] [comments]
    "With $1 Billion From Microsoft, an A.I. Lab Wants to Mimic the Brain" . . .
    submitted by /u/the_anonymizer [link] [comments]
  • Open

    LLMs Land on Laptops: NVIDIA, HP CEOs Celebrate AI PCs
    2024 will be the year generative AI gets personal, the CEOs of NVIDIA and HP said today in a fireside chat, unveiling new laptops that can build, test and run large language models. “This is a renaissance of the personal computer,” said NVIDIA founder and CEO Jensen Huang at HP Amplify, a gathering in Las Read Article  ( 6 min )
    First Class: NVIDIA Introduces Generative AI Professional Certification
    NVIDIA is offering a new professional certification in generative AI to enable developers to establish technical credibility in this important domain. Generative AI is revolutionizing industries worldwide, yet there’s a critical skills gap and need to uplevel employees to more fully harness the technology. Available for the first time from NVIDIA, this new professional certification Read Article  ( 5 min )
    Don’t Pass This Up: Day Passes Now Available on GeForce NOW
    Gamers can now seize the day with Day Passes, available to purchase for 24-hour continuous access to powerful cloud gaming with all the benefits of a GeForce NOW Ultimate or Priority membership — no commitment required. Publisher Cygames brings its next triple-A title to the cloud. Granblue Fantasy: Relink leads eight new games joining the Read Article  ( 6 min )
  • Open

    Decision Boundary Similarities in Deep Reinforcement Learning
    Published in AAAI https://twitter.com/EzgiKorkmazAI/status/1765019973917741057 submitted by /u/ml_dnn [link] [comments]
    Level of knowledge needed to tinker with RL
    I’m your average 3D printer/ robotics tinkerer and from what I can see of RL from the outside it looks not only interesting but entertaining. But when I scroll through the posts on here they are all written beyond my understanding. When I look at “how to get into…” stuff online it’s full of concepts that stretch me, very far. And it’s apparent to me that I don’t know what background any of this is coming from. Is this something where you pretty much need a degree or college level of education and understanding to begin? So many examples I find involving RL is peoples’ class projects. I believe you can do what you put your mind to but I’m hoping to get an idea of how far I might have to stretch myself and if that’s worth it to me. What’s all of your backgrounds? submitted by /u/UltimateThrowawayNam [link] [comments]
    CartPole V1 learning in the opposite direction!!!!
    Hi everyone! I am completely new to the world of reinforcement learning, and was trying to solve the CartPole V1 problem using Deep Q-Learning. However, my agent does not seem to be learning properly. I am insanely frustrated and confused right now and I really do hope someone can shine some light on the situation for me :(. Here is my code, and the relevant output. def build_network(num_of_actions: int, layers=[20, 20]) -> hk.Transformed: def q_network(obs): network = hk.Sequential( [hk.Flatten(), hk.nets.MLP(layers + [num_of_actions])]) return network(obs) return hk.without_apply_rng(hk.transform(q_network)) num_of_actions = env.action_space.n # Initialise Q-network Q_NETWORK = build_network(num_of_actions=num_of_actions, layers=[20, 20]) # two actions #random_obs = np.rando…
    Trackmania RL using DQN and DDPG
    My first RL project! After months of work with my 3 housemates, we made a Trackmania bot capable of driving through a simple circuit. Seeing it finally learn was extremely rewarding! Check it out: GitHub (code suggestions welcome!) https://reddit.com/link/1b8w84q/video/qrnm23k6bxmc1/player ​ submitted by /u/giorgiocav123 [link] [comments]
    Varying number of action elements and potentially large discrete action spaces...
    Hi there, as a newby in deep RL, I'm seeking advice regarding the following problem. Please feel free to reach out also with general comments to the overall setting, as I may have failed to identify the core issues... The state of my spacial environment is defined by a varying number (n) of elements, each with a scalar property d and corresponding x- and y-coordinates. Note that the upper limit of n is not strictly bounded, but 'softly limited' by the environments dynamics. At each timestep, the agent can decide which of the elements to remove: 'killing' an element generates a reward signal, and the next state is computed based on the population of remaining elements. In its current form, this setup has two major issues: (a) both state- and action-space vary in their size, depending on…
    Deep RL in simulated 5G environment
    I am currently writing my master thesis to create AI-based self-adaptive software in a simple simulated 5G environment. I have the simulation running which consists of a sending node and a receiving node. Bare in mind that the simulation is supposed to be simple as the simulation is not the main part of the thesis. The sender sends a batch of packets to the receiver which then checks how many packets were lost and on which load (congestion at the network) the packets were sent on. I am using Gym with a continuous action space between -1 and 1 which represents the value that the agent can change the bandwidth with. -1 and 1 are normalized values for -50 to 50, so an action of -1 would be decrease the bandwidth with 50 Mbit/s and 1 would be increase with 50 Mbit/s and so on. The observation…
    Is there an RL-algorithm with multiple types of output? (python)
    hey guys i'm trying to do some experimental project with game CS2(related-post) and i was wondering if there is any RL-algorithm that can output multiple type. e.g. : for moving we use float and for shooting/attacking we use int(0: don't attack / 1: attack) for example the output be like : [float, float, int] is it possible to have something like this? and is it possible to have it as a gym environment? submitted by /u/Mr_Lucifer_666 [link] [comments]
    Negative Explained Variance
    https://wandb.ai/kingsignificant5097/uncategorized?workspace=user-kingsignificant5097 https://preview.redd.it/wqbzrf6dgumc1.png?width=3516&format=png&auto=webp&s=eacf2d106283d3bc2143a9a9e3f7f68f5bcbdb67 I've been working on a project trying to get hands-on with some RL. The environment I'm modeling is a time series and I'm training an agent to maximize financial returns by trading financial derivatives, specifically leveraged perpetual futures. So it needs to maximize financial gains and minimize financial losses. Rewards are given every time the agent closes an open position and the reward is proportional to the percentage gain/loss compared to initial investment. The action space is a multi discrete space with 4 actions, do nothing, open a position (with percentage of funds to use), …
    Can't scale Custom Environment when testing
    I am making a custom Boid flocking environment in Open AI Gym with stable baselines 3. ​ Error 1: I have made the environments and tested it for 3 boids but when I test it for 10 it gives the error: Error Custom Environment I never hardcoded the number of boids and have no other visible issues. ​ Error 2: I tested the model with 20 different positions of initialization, but for most I get this reward, 600200, and they move as intended. However, when I retrained the model it didn't perform well, similar parameters fully. Is it a good trained model, fluke or overfitting? Reward of Episodes Have a nice day! submitted by /u/Sadboi1010 [link] [comments]
    Can't scale Custom Environment when testing
    I am making a custom Boid flocking environment in Open AI Gym with stable baselines 3. ​ Error 1: I have made the environments and tested it for 3 boids but when I test it for 10 it gives the error: Error Custom Environment I never hardcoded the number of boids and have no other visible issues. ​ Error 2: I tested the model with 20 different positions of initialization, but for most I get this reward, 600200, and they move as intended. However, when I retrained the model it didn't perform well, similar parameters fully. Is it a good trained model, fluke or overfitting? Reward of Episodes Have a nice day! submitted by /u/Sadboi1010 [link] [comments]
  • Open

    Automate the process to change image backgrounds using Amazon Bedrock and AWS Step Functions
    Many customers, including those in creative advertising, media and entertainment, ecommerce, and fashion, often need to change the background in a large number of images. Typically, this involves manually editing each image with photo software. This can take a lot of effort, especially for large batches of images. However, Amazon Bedrock and AWS Step Functions […]  ( 8 min )
  • Open

    Social learning: Collaborative learning with large language models
    Posted by Amirkeivan Mohtashami, Research Intern, and Florian Hartmann, Software Engineer, Google Research Large language models (LLMs) have significantly improved the state of the art for solving tasks specified using natural language, often reaching performance close to that of people. As these models increasingly enable assistive agents, it could be beneficial for them to learn effectively from each other, much like people do in social settings, which would allow LLM-based agents to improve each other’s performance. To discuss the learning processes of humans, Bandura and Walters described the concept of social learning in 1977, outlining different models of observational learning used by people. One common method of learning from others is through a verbal instruction (e.g., …  ( 92 min )
  • Open

    Improving LLM understanding of structured data and exploring advanced prompting methods
    Structural Understanding Capabilities is a new benchmark for evaluating and improving LLM comprehension of structured table data. This advance can help LLMs process and analyze data more effectively, broadening their applicability in real-world tasks. The post Improving LLM understanding of structured data and exploring advanced prompting methods appeared first on Microsoft Research.  ( 10 min )
  • Open

    Navigating the horizon: Exploring the future of data
    In the digital universe, data reigns supreme. From the early days of punch cards to the current era of cloud computing, the evolution of data has been nothing short of remarkable. Today, as we stand on the cusp of exploring the future of data, it’s crucial to understand its trajectory and the profound impact it… Read More »Navigating the horizon: Exploring the future of data The post Navigating the horizon: Exploring the future of data appeared first on Data Science Central.  ( 23 min )
  • Open

    Detecting the language of encrypted text
    Imagine you are a code breaker living a century ago. You’ve intercepted a message, and you go through your bag of tricks, starting with the simplest techniques first. Maybe the message has been encrypted using a simple substitution cipher, so you start with that. Simple substitution ciphers can be broken by frequency analysis: the most […] Detecting the language of encrypted text first appeared on John D. Cook.  ( 6 min )
    Blow up in finite time
    A few years ago I wrote a post about approximating the solution to a differential equation even though the solution did not exist. You can ask a numerical method for a solution at a point past where the solution blows up to infinity, and it will dutifully give you a finite solution. The result is […] Blow up in finite time first appeared on John D. Cook.  ( 5 min )
  • Open

    Hamiltonian Property Testing
    arXiv:2403.02968v1 Announce Type: cross Abstract: Locality is a fundamental feature of many physical time evolutions. Assumptions on locality and related structural properties also underlie recently proposed procedures for learning an unknown Hamiltonian from access to the induced time evolution. However, no protocols to rigorously test whether an unknown Hamiltonian is local were known. We investigate Hamiltonian locality testing as a property testing problem, where the task is to determine whether an unknown $n$-qubit Hamiltonian $H$ is $k$-local or $\varepsilon$-far from all $k$-local Hamiltonians, given access to the time evolution along $H$. First, we emphasize the importance of the chosen distance measure: With respect to the operator norm, a worst-case distance measure, incoherent quantum locality testers require $\tilde{\Omega}(2^n)$ many time evolution queries and an expected total evolution time of $\tilde{\Omega}(2^n / \varepsilon)$, and even coherent testers need $\Omega(2^{n/2})$ many queries and $\Omega(2^{n/2}/\varepsilon)$ total evolution time. In contrast, when distances are measured according to the normalized Frobenius norm, corresponding to an average-case distance, we give a sample-, time-, and computationally efficient incoherent Hamiltonian locality testing algorithm based on randomized measurements. In fact, our procedure can be used to simultaneously test a wide class of Hamiltonian properties beyond locality. Finally, we prove that learning a general Hamiltonian remains exponentially hard with this average-case distance, thereby establishing an exponential separation between Hamiltonian testing and learning. Our work initiates the study of property testing for quantum Hamiltonians, demonstrating that a broad class of Hamiltonian properties is efficiently testable even with limited quantum capabilities, and positioning Hamiltonian testing as an independent area of research alongside Hamiltonian learning.  ( 3 min )
    RiskBench: A Scenario-based Benchmark for Risk Identification
    arXiv:2312.01659v2 Announce Type: replace-cross Abstract: Intelligent driving systems aim to achieve a zero-collision mobility experience, requiring interdisciplinary efforts to enhance safety performance. This work focuses on risk identification, the process of identifying and analyzing risks stemming from dynamic traffic participants and unexpected events. While significant advances have been made in the community, the current evaluation of different risk identification algorithms uses independent datasets, leading to difficulty in direct comparison and hindering collective progress toward safety performance enhancement. To address this limitation, we introduce \textbf{RiskBench}, a large-scale scenario-based benchmark for risk identification. We design a scenario taxonomy and augmentation pipeline to enable a systematic collection of ground truth risks under diverse scenarios. We assess the ability of ten algorithms to (1) detect and locate risks, (2) anticipate risks, and (3) facilitate decision-making. We conduct extensive experiments and summarize future research on risk identification. Our aim is to encourage collaborative endeavors in achieving a society with zero collisions. We have made our dataset and benchmark toolkit publicly on the project page: https://hcis-lab.github.io/RiskBench/  ( 2 min )
    Estimating treatment effects from single-arm trials via latent-variable modeling
    arXiv:2311.03002v2 Announce Type: replace Abstract: Randomized controlled trials (RCTs) are the accepted standard for treatment effect estimation but they can be infeasible due to ethical reasons and prohibitive costs. Single-arm trials, where all patients belong to the treatment group, can be a viable alternative but require access to an external control group. We propose an identifiable deep latent-variable model for this scenario that can also account for missing covariate observations by modeling their structured missingness patterns. Our method uses amortized variational inference to learn both group-specific and identifiable shared latent representations, which can subsequently be used for {\em (i)} patient matching if treatment outcomes are not available for the treatment group, or for {\em (ii)} direct treatment effect estimation assuming outcomes are available for both groups. We evaluate the model on a public benchmark as well as on a data set consisting of a published RCT study and real-world electronic health records. Compared to previous methods, our results show improved performance both for direct treatment effect estimation as well as for effect estimation via patient matching.  ( 2 min )
    Fine-tuning Language Models with Generative Adversarial Reward Modelling
    arXiv:2305.06176v3 Announce Type: replace-cross Abstract: Reinforcement Learning with Human Feedback (RLHF) has been demonstrated to significantly enhance the performance of large language models (LLMs) by aligning their outputs with desired human values through instruction tuning. However, RLHF is constrained by the expertise and productivity limitations of human evaluators. A response to this downside is to fall back to supervised fine-tuning (SFT) with additional carefully selected expert demonstrations. However, while this method has been proven to be effective, it invariably also leads to increased human-in-the-loop overhead. In this study, we propose another alternative approach: Reinforcement Learning with Generative Adversarial Feedback (RLGAF) to RLHF and SFT, which uses a generative adversarial training style to enable the LLMs to learn useful human expert demonstrations without being directly exposed to the training examples, thus enabling good generalization capabilities while preserving sample efficiency. Our preliminary findings indicate that RLGAF can help align LLMs outputs with competitive performance against RLHF and SFT, while not suffering from their respective inherent restrictions, suggesting promising avenues for further research on automating AI alignment.  ( 2 min )
    Fast hyperboloid decision tree algorithms
    arXiv:2310.13841v2 Announce Type: replace Abstract: Hyperbolic geometry is gaining traction in machine learning for its effectiveness at capturing hierarchical structures in real-world data. Hyperbolic spaces, where neighborhoods grow exponentially, offer substantial advantages and consistently deliver state-of-the-art results across diverse applications. However, hyperbolic classifiers often grapple with computational challenges. Methods reliant on Riemannian optimization frequently exhibit sluggishness, stemming from the increased computational demands of operations on Riemannian manifolds. In response to these challenges, we present hyperDT, a novel extension of decision tree algorithms into hyperbolic space. Crucially, hyperDT eliminates the need for computationally intensive Riemannian optimization, numerically unstable exponential and logarithmic maps, or pairwise comparisons between points by leveraging inner products to adapt Euclidean decision tree algorithms to hyperbolic space. Our approach is conceptually straightforward and maintains constant-time decision complexity while mitigating the scalability issues inherent in high-dimensional Euclidean spaces. Building upon hyperDT we introduce hyperRF, a hyperbolic random forest model. Extensive benchmarking across diverse datasets underscores the superior performance of these models, providing a swift, precise, accurate, and user-friendly toolkit for hyperbolic data analysis.  ( 2 min )
    Fine Robotic Manipulation without Force/Torque Sensor
    arXiv:2301.13413v2 Announce Type: replace-cross Abstract: Force Sensing and Force Control are essential to many industrial applications. Typically, a 6-axis Force/Torque (F/T) sensor is mounted between the robot's wrist and the end-effector in order to measure the forces and torques exerted by the environment onto the robot (the external wrench). Although a typical 6-axis F/T sensor can provide highly accurate measurements, it is expensive and vulnerable to drift and external impacts. Existing methods aiming at estimating the external wrench using only the robot's internal signals are limited in scope: for example, wrench estimation accuracy was mostly validated in free-space motions and simple contacts as opposed to tasks like assembly that require high-precision force control. Here we present a Neural Network based method and argue that by devoting particular attention to the training data structure, it is possible to accurately estimate the external wrench in a wide range of scenarios based solely on internal signals. As an illustration, we demonstrate a pin insertion experiment with 100-micron clearance and a hand-guiding experiment, both performed without external F/T sensors or joint torque sensors. Our result opens the possibility of equipping the existing 2.7 million industrial robots with Force Sensing and Force Control capabilities without any additional hardware.  ( 2 min )
    A new computationally efficient algorithm to solve Feature Selection for Functional Data Classification in high-dimensional spaces
    arXiv:2401.05765v2 Announce Type: replace-cross Abstract: This paper introduces a novel methodology for Feature Selection for Functional Classification, FSFC, that addresses the challenge of jointly performing feature selection and classification of functional data in scenarios with categorical responses and multivariate longitudinal features. FSFC tackles a newly defined optimization problem that integrates logistic loss and functional features to identify the most crucial variables for classification. To address the minimization procedure, we employ functional principal components and develop a new adaptive version of the Dual Augmented Lagrangian algorithm. The computational efficiency of FSFC enables handling high-dimensional scenarios where the number of features may considerably exceed the number of statistical units. Simulation experiments demonstrate that FSFC outperforms other machine learning and deep learning methods in computational time and classification accuracy. Furthermore, the FSFC feature selection capability can be leveraged to significantly reduce the problem's dimensionality and enhance the performances of other classification algorithms. The efficacy of FSFC is also demonstrated through a real data application, analyzing relationships between four chronic diseases and other health and demographic factors.  ( 2 min )
    Distributed Policy Gradient for Linear Quadratic Networked Control with Limited Communication Range
    arXiv:2403.03055v1 Announce Type: cross Abstract: This paper proposes a scalable distributed policy gradient method and proves its convergence to near-optimal solution in multi-agent linear quadratic networked systems. The agents engage within a specified network under local communication constraints, implying that each agent can only exchange information with a limited number of neighboring agents. On the underlying graph of the network, each agent implements its control input depending on its nearby neighbors' states in the linear quadratic control setting. We show that it is possible to approximate the exact gradient only using local information. Compared with the centralized optimal controller, the performance gap decreases to zero exponentially as the communication and control ranges increase. We also demonstrate how increasing the communication range enhances system stability in the gradient descent process, thereby elucidating a critical trade-off. The simulation results verify our theoretical findings.  ( 2 min )
    AutoGL: A Library for Automated Graph Learning
    arXiv:2104.04987v4 Announce Type: replace Abstract: Recent years have witnessed an upsurge in research interests and applications of machine learning on graphs. However, manually designing the optimal machine learning algorithms for different graph datasets and tasks is inflexible, labor-intensive, and requires expert knowledge, limiting its adaptivity and applicability. Automated machine learning (AutoML) on graphs, aiming to automatically design the optimal machine learning algorithm for a given graph dataset and task, has received considerable attention. However, none of the existing libraries can fully support AutoML on graphs. To fill this gap, we present Automated Graph Learning (AutoGL), the first dedicated library for automated machine learning on graphs. AutoGL is open-source, easy to use, and flexible to be extended. Specifically, we propose a three-layer architecture, consisting of backends to interface with devices, a complete automated graph learning pipeline, and supported graph applications. The automated machine learning pipeline further contains five functional modules: auto feature engineering, neural architecture search, hyper-parameter optimization, model training, and auto ensemble, covering the majority of existing AutoML methods on graphs. For each module, we provide numerous state-of-the-art methods and flexible base classes and APIs, which allow easy usage and customization. We further provide experimental results to showcase the usage of our AutoGL library. We also present AutoGL-light, a lightweight version of AutoGL to facilitate customizing pipelines and enriching applications, as well as benchmarks for graph neural architecture search. The codes of AutoGL are publicly available at https://github.com/THUMNLab/AutoGL.  ( 3 min )
    Trust Your $\nabla$: Gradient-based Intervention Targeting for Causal Discovery
    arXiv:2211.13715v4 Announce Type: replace-cross Abstract: Inferring causal structure from data is a challenging task of fundamental importance in science. Observational data are often insufficient to identify a system's causal structure uniquely. While conducting interventions (i.e., experiments) can improve the identifiability, such samples are usually challenging and expensive to obtain. Hence, experimental design approaches for causal discovery aim to minimize the number of interventions by estimating the most informative intervention target. In this work, we propose a novel Gradient-based Intervention Targeting method, abbreviated GIT, that 'trusts' the gradient estimator of a gradient-based causal discovery framework to provide signals for the intervention acquisition function. We provide extensive experiments in simulated and real-world datasets and demonstrate that GIT performs on par with competitive baselines, surpassing them in the low-data regime.  ( 2 min )
    Shuffling Momentum Gradient Algorithm for Convex Optimization
    arXiv:2403.03180v1 Announce Type: cross Abstract: The Stochastic Gradient Descent method (SGD) and its stochastic variants have become methods of choice for solving finite-sum optimization problems arising from machine learning and data science thanks to their ability to handle large-scale applications and big datasets. In the last decades, researchers have made substantial effort to study the theoretical performance of SGD and its shuffling variants. However, only limited work has investigated its shuffling momentum variants, including shuffling heavy-ball momentum schemes for non-convex problems and Nesterov's momentum for convex settings. In this work, we extend the analysis of the shuffling momentum gradient method developed in [Tran et al (2021)] to both finite-sum convex and strongly convex optimization problems. We provide the first analysis of shuffling momentum-based methods for the strongly convex setting, attaining a convergence rate of $O(1/nT^2)$, where $n$ is the number of samples and $T$ is the number of training epochs. Our analysis is a state-of-the-art, matching the best rates of existing shuffling stochastic gradient algorithms in the literature.  ( 2 min )
    Kernel Normalized Convolutional Networks
    arXiv:2205.10089v4 Announce Type: replace Abstract: Existing convolutional neural network architectures frequently rely upon batch normalization (BatchNorm) to effectively train the model. BatchNorm, however, performs poorly with small batch sizes, and is inapplicable to differential privacy. To address these limitations, we propose the kernel normalization (KernelNorm) and kernel normalized convolutional layers, and incorporate them into kernel normalized convolutional networks (KNConvNets) as the main building blocks. We implement KNConvNets corresponding to the state-of-the-art ResNets while forgoing the BatchNorm layers. Through extensive experiments, we illustrate that KNConvNets achieve higher or competitive performance compared to the BatchNorm counterparts in image classification and semantic segmentation. They also significantly outperform their batch-independent competitors including those based on layer and group normalization in non-private and differentially private training. Given that, KernelNorm combines the batch-independence property of layer and group normalization with the performance advantage of BatchNorm.  ( 2 min )
    Learning to Defer to a Population: A Meta-Learning Approach
    arXiv:2403.02683v1 Announce Type: new Abstract: The learning to defer (L2D) framework allows autonomous systems to be safe and robust by allocating difficult decisions to a human expert. All existing work on L2D assumes that each expert is well-identified, and if any expert were to change, the system should be re-trained. In this work, we alleviate this constraint, formulating an L2D system that can cope with never-before-seen experts at test-time. We accomplish this by using meta-learning, considering both optimization- and model-based variants. Given a small context set to characterize the currently available expert, our framework can quickly adapt its deferral policy. For the model-based approach, we employ an attention mechanism that is able to look for points in the context set that are similar to a given test point, leading to an even more precise assessment of the expert's abilities. In the experiments, we validate our methods on image recognition, traffic sign detection, and skin lesion diagnosis benchmarks.  ( 2 min )
    A Simple Finite-Time Analysis of TD Learning with Linear Function Approximation
    arXiv:2403.02476v1 Announce Type: new Abstract: We study the finite-time convergence of TD learning with linear function approximation under Markovian sampling. Existing proofs for this setting either assume a projection step in the algorithm to simplify the analysis, or require a fairly intricate argument to ensure stability of the iterates. We ask: \textit{Is it possible to retain the simplicity of a projection-based analysis without actually performing a projection step in the algorithm?} Our main contribution is to show this is possible via a novel two-step argument. In the first step, we use induction to prove that under a standard choice of a constant step-size $\alpha$, the iterates generated by TD learning remain uniformly bounded in expectation. In the second step, we establish a recursion that mimics the steady-state dynamics of TD learning up to a bounded perturbation on the order of $O(\alpha^2)$ that captures the effect of Markovian sampling. Combining these pieces leads to an overall approach that considerably simplifies existing proofs. We conjecture that our inductive proof technique will find applications in the analyses of more complex stochastic approximation algorithms, and conclude by providing some examples of such applications.  ( 2 min )
    Swing: Short-cutting Rings for Higher Bandwidth Allreduce
    arXiv:2401.09356v2 Announce Type: replace-cross Abstract: The allreduce collective operation accounts for a significant fraction of the runtime of workloads running on distributed systems. One factor determining its performance is the distance between communicating nodes, especially on networks like torus, where a higher distance implies multiple messages being forwarded on the same link, thus reducing the allreduce bandwidth. Torus networks are widely used on systems optimized for machine learning workloads (e.g., Google TPUs and Amazon Trainium devices), as well as on some of the Top500 supercomputers. To improve allreduce performance on torus networks we introduce Swing, a new algorithm that keeps a low distance between communicating nodes by swinging between torus directions. Our analysis and experimental evaluation show that Swing outperforms by up to 3x existing allreduce algorithms for vectors ranging from 32B to 128MiB, on different types of torus and torus-like topologies, regardless of their shape and size.  ( 2 min )
    METAVerse: Meta-Learning Traversability Cost Map for Off-Road Navigation
    arXiv:2307.13991v2 Announce Type: replace-cross Abstract: Autonomous navigation in off-road conditions requires an accurate estimation of terrain traversability. However, traversability estimation in unstructured environments is subject to high uncertainty due to the variability of numerous factors that influence vehicle-terrain interaction. Consequently, it is challenging to obtain a generalizable model that can accurately predict traversability in a variety of environments. This paper presents METAVerse, a meta-learning framework for learning a global model that accurately and reliably predicts terrain traversability across diverse environments. We train the traversability prediction network to generate a dense and continuous-valued cost map from a sparse LiDAR point cloud, leveraging vehicle-terrain interaction feedback in a self-supervised manner. Meta-learning is utilized to train a global model with driving data collected from multiple environments, effectively minimizing estimation uncertainty. During deployment, online adaptation is performed to rapidly adapt the network to the local environment by exploiting recent interaction experiences. To conduct a comprehensive evaluation, we collect driving data from various terrains and demonstrate that our method can obtain a global model that minimizes uncertainty. Moreover, by integrating our model with a model predictive controller, we demonstrate that the reduced uncertainty results in safe and stable navigation in unstructured and unknown terrains.  ( 2 min )
    Masked Audio Generation using a Single Non-Autoregressive Transformer
    arXiv:2401.04577v2 Announce Type: replace-cross Abstract: We introduce MAGNeT, a masked generative sequence modeling method that operates directly over several streams of audio tokens. Unlike prior work, MAGNeT is comprised of a single-stage, non-autoregressive transformer. During training, we predict spans of masked tokens obtained from a masking scheduler, while during inference we gradually construct the output sequence using several decoding steps. To further enhance the quality of the generated audio, we introduce a novel rescoring method in which, we leverage an external pre-trained model to rescore and rank predictions from MAGNeT, which will be then used for later decoding steps. Lastly, we explore a hybrid version of MAGNeT, in which we fuse between autoregressive and non-autoregressive models to generate the first few seconds in an autoregressive manner while the rest of the sequence is being decoded in parallel. We demonstrate the efficiency of MAGNeT for the task of text-to-music and text-to-audio generation and conduct an extensive empirical evaluation, considering both objective metrics and human studies. The proposed approach is comparable to the evaluated baselines, while being significantly faster (x7 faster than the autoregressive baseline). Through ablation studies and analysis, we shed light on the importance of each of the components comprising MAGNeT, together with pointing to the trade-offs between autoregressive and non-autoregressive modeling, considering latency, throughput, and generation quality. Samples are available on our demo page https://pages.cs.huji.ac.il/adiyoss-lab/MAGNeT.  ( 3 min )
    Density Uncertainty Layers for Reliable Uncertainty Estimation
    arXiv:2306.12497v2 Announce Type: replace Abstract: Assessing the predictive uncertainty of deep neural networks is crucial for safety-related applications of deep learning. Although Bayesian deep learning offers a principled framework for estimating model uncertainty, the common approaches that approximate the parameter posterior often fail to deliver reliable estimates of predictive uncertainty. In this paper, we propose a novel criterion for reliable predictive uncertainty: a model's predictive variance should be grounded in the empirical density of the input. That is, the model should produce higher uncertainty for inputs that are improbable in the training data and lower uncertainty for inputs that are more probable. To operationalize this criterion, we develop the density uncertainty layer, a stochastic neural network architecture that satisfies the density uncertain criterion by design. We study density uncertainty layers on the UCI and CIFAR-10/100 uncertainty benchmarks. Compared to existing approaches, density uncertainty layers provide more reliable uncertainty estimates and robust out-of-distribution detection performance.  ( 2 min )
    Nondestructive, quantitative viability analysis of 3D tissue cultures using machine learning image segmentation
    arXiv:2311.09354v2 Announce Type: replace-cross Abstract: Ascertaining the collective viability of cells in different cell culture conditions has typically relied on averaging colorimetric indicators and is often reported out in simple binary readouts. Recent research has combined viability assessment techniques with image-based deep-learning models to automate the characterization of cellular properties. However, further development of viability measurements to assess the continuity of possible cellular states and responses to perturbation across cell culture conditions is needed. In this work, we demonstrate an image processing algorithm for quantifying cellular viability in 3D cultures without the need for assay-based indicators. We show that our algorithm performs similarly to a pair of human experts in whole-well images over a range of days and culture matrix compositions. To demonstrate potential utility, we perform a longitudinal study investigating the impact of a known therapeutic on pancreatic cancer spheroids. Using images taken with a high content imaging system, the algorithm successfully tracks viability at the individual spheroid and whole-well level. The method we propose reduces analysis time by 97% in comparison to the experts. Because the method is independent of the microscope or imaging system used, this approach lays the foundation for accelerating progress in and for improving the robustness and reproducibility of 3D culture analysis across biological and clinical research.  ( 3 min )
    Can It Edit? Evaluating the Ability of Large Language Models to Follow Code Editing Instructions
    arXiv:2312.12450v4 Announce Type: replace-cross Abstract: A significant amount of research is focused on developing and evaluating large language models for a variety of code synthesis tasks. These include synthesizing code from natural language instructions, synthesizing tests from code, and synthesizing explanations of code. In contrast, the behavior of instructional code editing with LLMs is understudied. These are tasks in which the model is instructed to update a block of code provided in a prompt. The editing instruction may ask for a feature to added or removed, describe a bug and ask for a fix, ask for a different kind of solution, or many other common code editing tasks. We introduce a carefully crafted benchmark of code editing tasks and use it evaluate several cutting edge LLMs. Our evaluation exposes a significant gap between the capabilities of state-of-the-art open and closed models. For example, even GPT-3.5-Turbo is 8.8% better than the best open model at editing code. We also introduce a new, carefully curated, permissively licensed training set of code edits coupled with natural language instructions. Using this training set, we show that we can fine-tune open Code LLMs to significantly improve their code editing capabilities.  ( 3 min )
    ProCNS: Progressive Prototype Calibration and Noise Suppression for Weakly-Supervised Medical Image Segmentation
    arXiv:2401.14074v2 Announce Type: replace-cross Abstract: Weakly-supervised segmentation (WSS) has emerged as a solution to mitigate the conflict between annotation cost and model performance by adopting sparse annotation formats (e.g., point, scribble, block, etc.). Typical approaches attempt to exploit anatomy and topology priors to directly expand sparse annotations into pseudo-labels. However, due to a lack of attention to the ambiguous edges in medical images and insufficient exploration of sparse supervision, existing approaches tend to generate erroneous and overconfident pseudo proposals in noisy regions, leading to cumulative model error and performance degradation. In this work, we propose a novel WSS approach, named ProCNS, encompassing two synergistic modules devised with the principles of progressive prototype calibration and noise suppression. Specifically, we design a Prototype-based Regional Spatial Affinity (PRSA) loss to maximize the pair-wise affinities between spatial and semantic elements, providing our model of interest with more reliable guidance. The affinities are derived from the input images and the prototype-refined predictions. Meanwhile, we propose an Adaptive Noise Perception and Masking (ANPM) module to obtain more enriched and representative prototype representations, which adaptively identifies and masks noisy regions within the pseudo proposals, reducing potential erroneous interference during prototype computation. Furthermore, we generate specialized soft pseudo-labels for the noisy regions identified by ANPM, providing supplementary supervision. Extensive experiments on three medical image segmentation tasks involving different modalities demonstrate that the proposed framework significantly outperforms representative state-of-the-art methods  ( 3 min )
    A Backpack Full of Skills: Egocentric Video Understanding with Diverse Task Perspectives
    arXiv:2403.03037v1 Announce Type: cross Abstract: Human comprehension of a video stream is naturally broad: in a few instants, we are able to understand what is happening, the relevance and relationship of objects, and forecast what will follow in the near future, everything all at once. We believe that - to effectively transfer such an holistic perception to intelligent machines - an important role is played by learning to correlate concepts and to abstract knowledge coming from different tasks, to synergistically exploit them when learning novel skills. To accomplish this, we seek for a unified approach to video understanding which combines shared temporal modelling of human actions with minimal overhead, to support multiple downstream tasks and enable cooperation when learning novel skills. We then propose EgoPack, a solution that creates a collection of task perspectives that can be carried across downstream tasks and used as a potential source of additional insights, as a backpack of skills that a robot can carry around and use when needed. We demonstrate the effectiveness and efficiency of our approach on four Ego4D benchmarks, outperforming current state-of-the-art methods.  ( 2 min )
    Fast Training of Diffusion Models with Masked Transformers
    arXiv:2306.09305v2 Announce Type: replace-cross Abstract: We propose an efficient approach to train large diffusion models with masked transformers. While masked transformers have been extensively explored for representation learning, their application to generative learning is less explored in the vision domain. Our work is the first to exploit masked training to reduce the training cost of diffusion models significantly. Specifically, we randomly mask out a high proportion (e.g., 50%) of patches in diffused input images during training. For masked training, we introduce an asymmetric encoder-decoder architecture consisting of a transformer encoder that operates only on unmasked patches and a lightweight transformer decoder on full patches. To promote a long-range understanding of full patches, we add an auxiliary task of reconstructing masked patches to the denoising score matching objective that learns the score of unmasked patches. Experiments on ImageNet-256x256 and ImageNet-512x512 show that our approach achieves competitive and even better generative performance than the state-of-the-art Diffusion Transformer (DiT) model, using only around 30% of its original training time. Thus, our method shows a promising way of efficiently training large transformer-based diffusion models without sacrificing the generative performance.  ( 2 min )
    Application of Transformers for Nonlinear Channel Compensation in Optical Systems
    arXiv:2304.13119v2 Announce Type: replace-cross Abstract: In this paper, we introduce a new nonlinear optical channel equalizer based on Transformers. By leveraging parallel computation and attending directly to the memory across a sequence of symbols, we show that Transformers can be used effectively for nonlinear equalization in coherent long-haul transmission. For this application, we present an implementation of the encoder part of the Transformer and analyze its performance over a wide range of different hyper-parameters. It is shown that by processing blocks of symbols at each iteration and carefully selecting subsets of the encoder's output to be processed together, an efficient nonlinear compensation can be achieved for different complexity constraints. We also propose the use of a physic-informed mask inspired by nonlinear perturbation theory for reducing the computational complexity of the attention mechanism.  ( 2 min )
    Learning to See Through Dazzle
    arXiv:2402.15919v2 Announce Type: replace-cross Abstract: Machine vision is susceptible to laser dazzle, where intense laser light can blind and distort its perception of the environment through oversaturation or permanent damage to sensor pixels. Here we employ a wavefront-coded phase mask to diffuse the energy of laser light and introduce a sandwich generative adversarial network (SGAN) to restore images from complex image degradations, such as varying laser-induced image saturation, mask-induced image blurring, unknown lighting conditions, and various noise corruptions. The SGAN architecture combines discriminative and generative methods by wrapping two GANs around a learnable image deconvolution module. In addition, we make use of Fourier feature representations to reduce the spectral bias of neural networks and improve its learning of high-frequency image details. End-to-end training includes the realistic physics-based synthesis of a large set of training data from publicly available images. We trained the SGAN to suppress the peak laser irradiance as high as $10^6$ times the sensor saturation threshold - the point at which camera sensors may experience damage without the mask. The trained model was evaluated on both a synthetic data set and data collected from the laboratory. The proposed image restoration model quantitatively and qualitatively outperforms state-of-the-art methods for a wide range of scene contents, laser powers, incident laser angles, ambient illumination strengths, and noise characteristics.  ( 3 min )
    Norm Enforcement with a Soft Touch: Faster Emergence, Happier Agents
    arXiv:2401.16461v3 Announce Type: replace-cross Abstract: A multiagent system is a society of autonomous agents whose interactions can be regulated via social norms. In general, the norms of a society are not hardcoded but emerge from the agents' interactions. Specifically, how the agents in a society react to each other's behavior and respond to the reactions of others determines which norms emerge in the society. We think of these reactions by an agent to the satisfactory or unsatisfactory behaviors of another agent as communications from the first agent to the second agent. Understanding these communications is a kind of social intelligence: these communications provide natural drivers for norm emergence by pushing agents toward certain behaviors, which can become established as norms. Whereas it is well-known that sanctioning can lead to the emergence of norms, we posit that a broader kind of social intelligence can prove more effective in promoting cooperation in a multiagent system. Accordingly, we develop Nest, a framework that models social intelligence via a wider variety of communications and understanding of them than in previous work. To evaluate Nest, we develop a simulated pandemic environment and conduct simulation experiments to compare Nest with baselines considering a combination of three kinds of social communication: sanction, tell, and hint. We find that societies formed of Nest agents achieve norms faster. Moreover, Nest agents effectively avoid undesirable consequences, which are negative sanctions and deviation from goals, and yield higher satisfaction for themselves than baseline agents despite requiring only an equivalent amount of information.  ( 3 min )
    A Good Feature Extractor Is All You Need for Weakly Supervised Pathology Slide Classification
    arXiv:2311.11772v4 Announce Type: replace-cross Abstract: Stain normalisation is thought to be a crucial preprocessing step in computational pathology pipelines. We question this belief in the context of weakly supervised whole slide image classification, motivated by the emergence of powerful feature extractors trained using self-supervised learning on diverse pathology datasets. To this end, we performed the most comprehensive evaluation of publicly available pathology feature extractors to date, involving more than 8,000 training runs across nine tasks, five datasets, three downstream architectures, and various preprocessing setups. Notably, we find that omitting stain normalisation and image augmentations does not compromise downstream slide-level classification performance, while incurring substantial savings in memory and compute. Using a new evaluation metric that facilitates relative downstream performance comparison, we identify the best publicly available extractors, and show that their latent spaces are remarkably robust to variations in stain and augmentations like rotation. Contrary to previous patch-level benchmarking studies, our approach emphasises clinical relevance by focusing on slide-level biomarker prediction tasks in a weakly supervised setting with external validation cohorts. Our findings stand to streamline digital pathology workflows by minimising preprocessing needs and informing the selection of feature extractors. Code and data are available at https://georg.woelflein.eu/good-features.  ( 3 min )
    Transfer Learning across Different Chemical Domains: Virtual Screening of Organic Materials with Deep Learning Models Pretrained on Small Molecule and Chemical Reaction Data
    arXiv:2311.18377v2 Announce Type: replace-cross Abstract: Machine learning is becoming a preferred method for the virtual screening of organic materials due to its cost-effectiveness over traditional computationally demanding techniques. However, the scarcity of labeled data for organic materials poses a significant challenge for training advanced machine learning models. This study showcases the potential of utilizing databases of drug-like small molecules and chemical reactions to pretrain the BERT model, enhancing its performance in the virtual screening of organic materials. By fine-tuning the BERT models with data from five virtual screening tasks, the version pretrained with the USPTO-SMILES dataset achieved R2 scores exceeding 0.94 for three tasks and over 0.81 for two others. This performance surpasses that of models pretrained on the small molecule or organic materials databases and outperforms three traditional machine learning models trained directly on virtual screening data. The success of the USPTO-SMILES pretrained BERT model can be attributed to the diverse array of organic building blocks in the USPTO database, offering a broader exploration of the chemical space. The study further suggests that accessing a reaction database with a wider range of reactions than the USPTO could further enhance model performance. Overall, this research validates the feasibility of applying transfer learning across different chemical domains for the efficient virtual screening of organic materials.  ( 3 min )
    From PARIS to LE-PARIS: Toward Patent Response Automation with Recommender Systems and Collaborative Large Language Models
    arXiv:2402.00421v2 Announce Type: replace-cross Abstract: In patent prosecution, timely and effective responses to Office Actions (OAs) are crucial for securing patents. However, past automation and artificial intelligence research have largely overlooked this aspect. To bridge this gap, our study introduces the Patent Office Action Response Intelligence System (PARIS) and its advanced version, the Large Language Model (LLM) Enhanced PARIS (LE-PARIS). These systems are designed to enhance the efficiency of patent attorneys in handling OA responses through collaboration with AI. The systems' key features include the construction of an OA Topics Database, development of Response Templates, and implementation of Recommender Systems and LLM-based Response Generation. To validate the effectiveness of the systems, we have employed a multi-paradigm analysis using the USPTO Office Action database and longitudinal data based on attorney interactions with our systems over six years. Through five studies, we have examined the constructiveness of OA topics (studies 1 and 2) using topic modeling and our proposed Delphi process, the efficacy of our proposed hybrid LLM-based recommender system tailored for OA responses (study 3), the quality of generated responses (study 4), and the systems' practical value in real-world scenarios through user studies (study 5). The results indicate that both PARIS and LE-PARIS significantly achieve key metrics and have a positive impact on attorney performance.  ( 3 min )
    Augmenting Replay in World Models for Continual Reinforcement Learning
    arXiv:2401.16650v2 Announce Type: replace Abstract: Continual RL is a challenging problem where the agent is exposed to a sequence of tasks; it should learn new tasks without forgetting old ones, and learning the new task should improve performance on previous and future tasks. The most common approaches use model-free RL algorithms as a base, and replay buffers have been used to overcome catastrophic forgetting. However, the buffers are often very large making scalability difficult. Also, the concept of replay comes from biological inspiration, where evidence suggests that replay is applied to a world model, which implies model-based RL -- and model-based RL should have benefits for continual RL, where it is possible to exploit knowledge independent of the policy. We present WMAR, World Models with Augmented Replay, a model-based RL algorithm with a world model and memory efficient distribution matching replay buffer. It is based on the well-known DreamerV3 algorithm, which has a simple FIFO buffer and was not tested in a continual RL setting. We evaluated WMAR vs WMAR (FIFO only) on tasks with and without shared structure from OpenAI ProcGen and Atari respectively, and without a task oracle. We found that WMAR has favourable properties on continual RL with significantly reduced computational overhead compared to WMAR (FIFO only). WMAR had small benefits over DreamerV3 on tasks with shared structure and substantially better forgetting characteristics on tasks without shared structure; but at the cost of lower plasticity seen in a lower maximum on new tasks. The results suggest that model-based RL using a world model with a memory efficient replay buffer can be an effective and practical approach to continual RL, justifying future work.  ( 3 min )
    Decoding Neuronal Networks: A Reservoir Computing Approach for Predicting Connectivity and Functionality
    arXiv:2311.03131v3 Announce Type: replace-cross Abstract: In this study, we address the challenge of analyzing electrophysiological measurements in neuronal networks. Our computational model, based on the Reservoir Computing Network (RCN) architecture, deciphers spatio-temporal data obtained from electrophysiological measurements of neuronal cultures. By reconstructing the network structure on a macroscopic scale, we reveal the connectivity between neuronal units. Notably, our model outperforms common methods like Cross-Correlation and Transfer-Entropy in predicting the network's connectivity map. Furthermore, we experimentally validate its ability to forecast network responses to specific inputs, including localized optogenetic stimuli.  ( 2 min )
    scDiffusion: conditional generation of high-quality single-cell data using diffusion model
    arXiv:2401.03968v2 Announce Type: replace-cross Abstract: Single-cell RNA sequencing (scRNA-seq) data are important for studying the laws of life at single-cell level. However, it is still challenging to obtain enough high-quality scRNA-seq data. To mitigate the limited availability of data, generative models have been proposed to computationally generate synthetic scRNA-seq data. Nevertheless, the data generated with current models are not very realistic yet, especially when we need to generate data with controlled conditions. In the meantime, the Diffusion models have shown their power in generating data at high fidelity, providing a new opportunity for scRNA-seq generation. In this study, we developed scDiffusion, a generative model combining diffusion model and foundation model to generate high-quality scRNA-seq data with controlled conditions. We designed multiple classifiers to guide the diffusion process simultaneously, enabling scDiffusion to generate data under multiple condition combinations. We also proposed a new control strategy called Gradient Interpolation. This strategy allows the model to generate continuous trajectories of cell development from a given cell state. Experiments showed that scDiffusion can generate single-cell gene expression data closely resembling real scRNA-seq data. Also, scDiffusion can conditionally produce data on specific cell types including rare cell types. Furthermore, we could use the multiple-condition generation of scDiffusion to generate cell type that was out of the training data. Leveraging the Gradient Interpolation strategy, we generated a continuous developmental trajectory of mouse embryonic cells. These experiments demonstrate that scDiffusion is a powerful tool for augmenting the real scRNA-seq data and can provide insights into cell fate research.  ( 3 min )
    Optimal Inference in Contextual Stochastic Block Models
    arXiv:2306.07948v2 Announce Type: replace-cross Abstract: The contextual stochastic block model (cSBM) was proposed for unsupervised community detection on attributed graphs where both the graph and the high-dimensional node information correlate with node labels. In the context of machine learning on graphs, the cSBM has been widely used as a synthetic dataset for evaluating the performance of graph-neural networks (GNNs) for semi-supervised node classification. We consider a probabilistic Bayes-optimal formulation of the inference problem and we derive a belief-propagation-based algorithm for the semi-supervised cSBM; we conjecture it is optimal in the considered setting and we provide its implementation. We show that there can be a considerable gap between the accuracy reached by this algorithm and the performance of the GNN architectures proposed in the literature. This suggests that the cSBM, along with the comparison to the performance of the optimal algorithm, readily accessible via our implementation, can be instrumental in the development of more performant GNN architectures.  ( 3 min )
    Nonparametric Linear Feature Learning in Regression Through Regularisation
    arXiv:2307.12754v3 Announce Type: replace-cross Abstract: Representation learning plays a crucial role in automated feature selection, particularly in the context of high-dimensional data, where non-parametric methods often struggle. In this study, we focus on supervised learning scenarios where the pertinent information resides within a lower-dimensional linear subspace of the data, namely the multi-index model. If this subspace were known, it would greatly enhance prediction, computation, and interpretation. To address this challenge, we propose a novel method for linear feature learning with non-parametric prediction, which simultaneously estimates the prediction function and the linear subspace. Our approach employs empirical risk minimisation, augmented with a penalty on function derivatives, ensuring versatility. Leveraging the orthogonality and rotation invariance properties of Hermite polynomials, we introduce our estimator, named RegFeaL. By utilising alternative minimisation, we iteratively rotate the data to improve alignment with leading directions and accurately estimate the relevant dimension in practical settings. We establish that our method yields a consistent estimator of the prediction function with explicit rates. Additionally, we provide empirical results demonstrating the performance of RegFeaL in various experiments.  ( 2 min )
    An Effective Universal Polynomial Basis for Spectral Graph Neural Networks
    arXiv:2311.18177v2 Announce Type: replace Abstract: Spectral Graph Neural Networks (GNNs), also referred to as graph filters have gained increasing prevalence for heterophily graphs. Optimal graph filters rely on Laplacian eigendecomposition for Fourier transform. In an attempt to avert the prohibitive computations, numerous polynomial filters by leveraging distinct polynomials have been proposed to approximate the desired graph filters. However, polynomials in the majority of polynomial filters are predefined and remain fixed across all graphs, failing to accommodate the diverse heterophily degrees across different graphs. To tackle this issue, we first investigate the correlation between polynomial bases of desired graph filters and the degrees of graph heterophily via a thorough theoretical analysis. Afterward, we develop an adaptive heterophily basis by incorporating graph heterophily degrees. Subsequently, we integrate this heterophily basis with the homophily basis, creating a universal polynomial basis UniBasis. In consequence, we devise a general polynomial filter UniFilter. Comprehensive experiments on both real-world and synthetic datasets with varying heterophily degrees significantly support the superiority of UniFilter, demonstrating the effectiveness and generality of UniBasis, as well as its promising capability as a new method for graph analysis.  ( 2 min )
    Contextual Text Denoising with Masked Language Models
    arXiv:1910.14080v2 Announce Type: replace-cross Abstract: Recently, with the help of deep learning models, significant advances have been made in different Natural Language Processing (NLP) tasks. Unfortunately, state-of-the-art models are vulnerable to noisy texts. We propose a new contextual text denoising algorithm based on the ready-to-use masked language model. The proposed algorithm does not require retraining of the model and can be integrated into any NLP system without additional training on paired cleaning training data. We evaluate our method under synthetic noise and natural noise and show that the proposed algorithm can use context information to correct noise text and improve the performance of noisy inputs in several downstream tasks.  ( 2 min )
    Zipformer: A faster and better encoder for automatic speech recognition
    arXiv:2310.11230v3 Announce Type: replace-cross Abstract: The Conformer has become the most popular encoder model for automatic speech recognition (ASR). It adds convolution modules to a transformer to learn both local and global dependencies. In this work we describe a faster, more memory-efficient, and better-performing transformer, called Zipformer. Modeling changes include: 1) a U-Net-like encoder structure where middle stacks operate at lower frame rates; 2) reorganized block structure with more modules, within which we re-use attention weights for efficiency; 3) a modified form of LayerNorm called BiasNorm allows us to retain some length information; 4) new activation functions SwooshR and SwooshL work better than Swish. We also propose a new optimizer, called ScaledAdam, which scales the update by each tensor's current scale to keep the relative change about the same, and also explictly learns the parameter scale. It achieves faster convergence and better performance than Adam. Extensive experiments on LibriSpeech, Aishell-1, and WenetSpeech datasets demonstrate the effectiveness of our proposed Zipformer over other state-of-the-art ASR models. Our code is publicly available at https://github.com/k2-fsa/icefall.  ( 2 min )
    Reference-less Analysis of Context Specificity in Translation with Personalised Language Models
    arXiv:2303.16618v3 Announce Type: replace-cross Abstract: Sensitising language models (LMs) to external context helps them to more effectively capture the speaking patterns of individuals with specific characteristics or in particular environments. This work investigates to what extent rich character and film annotations can be leveraged to personalise LMs in a scalable manner. We then explore the use of such models in evaluating context specificity in machine translation. We build LMs which leverage rich contextual information to reduce perplexity by up to 6.5% compared to a non-contextual model, and generalise well to a scenario with no speaker-specific data, relying on combinations of demographic characteristics expressed via metadata. Our findings are consistent across two corpora, one of which (Cornell-rich) is also a contribution of this paper. We then use our personalised LMs to measure the co-occurrence of extra-textual context and translation hypotheses in a machine translation setting. Our results suggest that the degree to which professional translations in our domain are context-specific can be preserved to a better extent by a contextual machine translation model than a non-contextual model, which is also reflected in the contextual model's superior reference-based scores.  ( 3 min )
    In Search of Truth: An Interrogation Approach to Hallucination Detection
    arXiv:2403.02889v1 Announce Type: cross Abstract: Despite the many advances of Large Language Models (LLMs) and their unprecedented rapid evolution, their impact and integration into every facet of our daily lives is limited due to various reasons. One critical factor hindering their widespread adoption is the occurrence of hallucinations, where LLMs invent answers that sound realistic, yet drift away from factual truth. In this paper, we present a novel method for detecting hallucinations in large language models, which tackles a critical issue in the adoption of these models in various real-world scenarios. Through extensive evaluations across multiple datasets and LLMs, including Llama-2, we study the hallucination levels of various recent LLMs and demonstrate the effectiveness of our method to automatically detect them. Notably, we observe up to 62% hallucinations for Llama-2 in a specific experiment, where our method achieves a Balanced Accuracy (B-ACC) of 87%, all without relying on external knowledge.  ( 2 min )
    A Unified Framework for Exploratory Learning-Aided Community Detection Under Topological Uncertainty
    arXiv:2304.04497v3 Announce Type: replace-cross Abstract: In social networks, the discovery of community structures has received considerable attention as a fundamental problem in various network analysis tasks. However, due to privacy concerns or access restrictions, the network structure is often uncertain, thereby rendering established community detection approaches ineffective without costly network topology acquisition. To tackle this challenge, we present META-CODE, a unified framework for detecting overlapping communities via exploratory learning aided by easy-to-collect node metadata when networks are topologically unknown (or only partially known). Specifically, META-CODE consists of three iterative steps in addition to the initial network inference step: 1) node-level community-affiliation embeddings based on graph neural networks (GNNs) trained by our new reconstruction loss, 2) network exploration via community-affiliation-based node queries, and 3) network inference using an edge connectivity-based Siamese neural network model from the explored network. Through extensive experiments on five real-world datasets including two large networks, we demonstrated: (a) the superiority of META-CODE over benchmark community detection methods, achieving remarkable gains up to 151.27% compared to the best existing competitor, (b) the impact of each module in META-CODE, (c) the effectiveness of node queries in META-CODE based on empirical evaluations and theoretical findings, (d) the convergence of the inferred network, and (e) the computational efficiency of META-CODE.  ( 3 min )
    Synaptic Weight Distributions Depend on the Geometry of Plasticity
    arXiv:2305.19394v2 Announce Type: replace-cross Abstract: A growing literature in computational neuroscience leverages gradient descent and learning algorithms that approximate it to study synaptic plasticity in the brain. However, the vast majority of this work ignores a critical underlying assumption: the choice of distance for synaptic changes - i.e. the geometry of synaptic plasticity. Gradient descent assumes that the distance is Euclidean, but many other distances are possible, and there is no reason that biology necessarily uses Euclidean geometry. Here, using the theoretical tools provided by mirror descent, we show that the distribution of synaptic weights will depend on the geometry of synaptic plasticity. We use these results to show that experimentally-observed log-normal weight distributions found in several brain areas are not consistent with standard gradient descent (i.e. a Euclidean geometry), but rather with non-Euclidean distances. Finally, we show that it should be possible to experimentally test for different synaptic geometries by comparing synaptic weight distributions before and after learning. Overall, our work shows that the current paradigm in theoretical work on synaptic plasticity that assumes Euclidean synaptic geometry may be misguided and that it should be possible to experimentally determine the true geometry of synaptic plasticity in the brain.  ( 3 min )
    SplAgger: Split Aggregation for Meta-Reinforcement Learning
    arXiv:2403.03020v1 Announce Type: new Abstract: A core ambition of reinforcement learning (RL) is the creation of agents capable of rapid learning in novel tasks. Meta-RL aims to achieve this by directly learning such agents. One category of meta-RL methods, called black box methods, does so by training off-the-shelf sequence models end-to-end. In contrast, another category of methods have been developed that explicitly infer a posterior distribution over the unknown task. These methods generally have distinct objectives and sequence models designed to enable task inference, and so are known as task inference methods. However, recent evidence suggests that task inference objectives are unnecessary in practice. Nonetheless, it remains unclear whether task inference sequence models are beneficial even when task inference objectives are not. In this paper, we present strong evidence that task inference sequence models are still beneficial. In particular, we investigate sequence models with permutation invariant aggregation, which exploit the fact that, due to the Markov property, the task posterior does not depend on the order of data. We empirically confirm the advantage of permutation invariant sequence models without the use of task inference objectives. However, we also find, surprisingly, that there are multiple conditions under which permutation variance remains useful. Therefore, we propose SplAgger, which uses both permutation variant and invariant components to achieve the best of both worlds, outperforming all baselines on continuous control and memory environments.  ( 2 min )
    Mitigating Biases with Diverse Ensembles and Diffusion Models
    arXiv:2311.16176v2 Announce Type: replace Abstract: Spurious correlations in the data, where multiple cues are predictive of the target labels, often lead to a phenomenon known as shortcut bias, where a model relies on erroneous, easy-to-learn cues while ignoring reliable ones. In this work, we propose an ensemble diversification framework exploiting Diffusion Probabilistic Models (DPMs) for shortcut bias mitigation. We show that at particular training intervals, DPMs can generate images with novel feature combinations, even when trained on samples displaying correlated input features. We leverage this crucial property to generate synthetic counterfactuals to increase model diversity via ensemble disagreement. We show that DPM-guided diversification is sufficient to remove dependence on primary shortcut cues, without a need for additional supervised signals. We further empirically quantify its efficacy on several diversification objectives, and finally show improved generalization and diversification performance on par with prior work that relies on auxiliary data collection.  ( 2 min )
    Soft-prompt Tuning for Large Language Models to Evaluate Bias
    arXiv:2306.04735v2 Announce Type: replace-cross Abstract: Prompting large language models has gained immense popularity in recent years due to the advantage of producing good results even without the need for labelled data. However, this requires prompt tuning to get optimal prompts that lead to better model performances. In this paper, we explore the use of soft-prompt tuning on sentiment classification task to quantify the biases of large language models (LLMs) such as Open Pre-trained Transformers (OPT) and Galactica language model. Since these models are trained on real-world data that could be prone to bias toward certain groups of populations, it is important to identify these underlying issues. Using soft-prompts to evaluate bias gives us the extra advantage of avoiding the human-bias injection that can be caused by manually designed prompts. We check the model biases on different sensitive attributes using the group fairness (bias) and find interesting bias patterns. Since LLMs have been used in the industry in various applications, it is crucial to identify the biases before deploying these models in practice. We open-source our pipeline and encourage industry researchers to adapt our work to their use cases.  ( 2 min )
    Making the End-User a Priority in Benchmarking: OrionBench for Unsupervised Time Series Anomaly Detection
    arXiv:2310.17748v2 Announce Type: replace Abstract: Time series anomaly detection is a prevalent problem in many application domains such as patient monitoring in healthcare, forecasting in finance, or predictive maintenance in energy. This has led to the emergence of a plethora of anomaly detection methods, including more recently, deep learning based methods. Although several benchmarks have been proposed to compare newly developed models, they usually rely on one-time execution over a limited set of datasets and the comparison is restricted to a few models. We propose OrionBench -- a user centric continuously maintained benchmark for unsupervised time series anomaly detection. The framework provides universal abstractions to represent models, extensibility to add new pipelines and datasets, hyperparameter standardization, pipeline verification, and frequent releases with published benchmarks. We demonstrate the usage of OrionBench, and the progression of pipelines across 16 releases published over the course of three years. Moreover, we walk through two real scenarios we experienced with OrionBench that highlight the importance of continuous benchmarks in unsupervised time series anomaly detection.  ( 2 min )
    G4-Attention: Deep Learning Model with Attention for predicting DNA G-Quadruplexes
    arXiv:2403.02765v1 Announce Type: new Abstract: G-Quadruplexes are the four-stranded non-canonical nucleic acid secondary structures, formed by the stacking arrangement of the guanine tetramers. They are involved in a wide range of biological roles because of their exceptionally unique and distinct structural characteristics. After the completion of the human genome sequencing project, a lot of bioinformatic algorithms were introduced to predict the active G4s regions \textit{in vitro} based on the canonical G4 sequence elements, G-\textit{richness}, and G-\textit{skewness}, as well as the non-canonical sequence features. Recently, sequencing techniques like G4-seq and G4-ChIP-seq were developed to map the G4s \textit{in vitro}, and \textit{in vivo} respectively at a few hundred base resolution. Subsequently, several machine learning approaches were developed for predicting the G4 regions using the existing databases. However, their prediction models were simplistic, and the prediction accuracy was notably poor. In response, here, we propose a novel convolutional neural network with Bi-LSTM and attention layers, named G4-attention, to predict the G4 forming sequences with improved accuracy. G4-attention achieves high accuracy and attains state-of-the-art results in the G4 prediction task. Our model also predicts the G4 regions accurately in the highly class-imbalanced datasets. In addition, the developed model trained on the human genome dataset can be applied to any non-human genome DNA sequences to predict the G4 formation propensities.  ( 2 min )
    TransNAS-TSAD: Harnessing Transformers for Multi-Objective Neural Architecture Search in Time Series Anomaly Detection
    arXiv:2311.18061v3 Announce Type: replace Abstract: The surge in real-time data collection across various industries has underscored the need for advanced anomaly detection in both univariate and multivariate time series data. This paper introduces TransNAS-TSAD, a framework that synergizes the transformer architecture with neural architecture search (NAS), enhanced through NSGA-II algorithm optimization. This approach effectively tackles the complexities of time series data, balancing computational efficiency with detection accuracy. Our evaluation reveals that TransNAS-TSAD surpasses conventional anomaly detection models due to its tailored architectural adaptability and the efficient exploration of complex search spaces, leading to marked improvements in diverse data scenarios. We also introduce the Efficiency-Accuracy-Complexity Score (EACS) as a new metric for assessing model performance, emphasizing the balance between accuracy and computational resources. TransNAS-TSAD sets a new benchmark in time series anomaly detection, offering a versatile, efficient solution for complex real-world applications. This research highlights the TransNAS-TSAD potential across a wide range of industry applications and paves the way for future developments in the field.  ( 2 min )
    DNNLasso: Scalable Graph Learning for Matrix-Variate Data
    arXiv:2403.02608v1 Announce Type: new Abstract: We consider the problem of jointly learning row-wise and column-wise dependencies of matrix-variate observations, which are modelled separately by two precision matrices. Due to the complicated structure of Kronecker-product precision matrices in the commonly used matrix-variate Gaussian graphical models, a sparser Kronecker-sum structure was proposed recently based on the Cartesian product of graphs. However, existing methods for estimating Kronecker-sum structured precision matrices do not scale well to large scale datasets. In this paper, we introduce DNNLasso, a diagonally non-negative graphical lasso model for estimating the Kronecker-sum structured precision matrix, which outperforms the state-of-the-art methods by a large margin in both accuracy and computational time. Our code is available at https://github.com/YangjingZhang/DNNLasso.  ( 2 min )
    A Note on High-Probability Analysis of Algorithms with Exponential, Sub-Gaussian, and General Light Tails
    arXiv:2403.02873v1 Announce Type: new Abstract: This short note describes a simple technique for analyzing probabilistic algorithms that rely on a light-tailed (but not necessarily bounded) source of randomization. We show that the analysis of such an algorithm can be reduced, in a black-box manner and with only a small loss in logarithmic factors, to an analysis of a simpler variant of the same algorithm that uses bounded random variables and often easier to analyze. This approach simultaneously applies to any light-tailed randomization, including exponential, sub-Gaussian, and more general fast-decaying distributions, without needing to appeal to specialized concentration inequalities. Analyses of a generalized Azuma inequality and stochastic optimization with general light-tailed noise are provided to illustrate the technique.  ( 2 min )
    The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
    arXiv:2403.03218v1 Announce Type: new Abstract: The White House Executive Order on Artificial Intelligence highlights the risks of large language models (LLMs) empowering malicious actors in developing biological, cyber, and chemical weapons. To measure these risks of malicious use, government institutions and major AI labs are developing evaluations for hazardous capabilities in LLMs. However, current evaluations are private, preventing further research into mitigating risk. Furthermore, they focus on only a few, highly specific pathways for malicious use. To fill these gaps, we publicly release the Weapons of Mass Destruction Proxy (WMDP) benchmark, a dataset of 4,157 multiple-choice questions that serve as a proxy measurement of hazardous knowledge in biosecurity, cybersecurity, and chemical security. WMDP was developed by a consortium of academics and technical consultants, and was stringently filtered to eliminate sensitive information prior to public release. WMDP serves two roles: first, as an evaluation for hazardous knowledge in LLMs, and second, as a benchmark for unlearning methods to remove such hazardous knowledge. To guide progress on unlearning, we develop CUT, a state-of-the-art unlearning method based on controlling model representations. CUT reduces model performance on WMDP while maintaining general capabilities in areas such as biology and computer science, suggesting that unlearning may be a concrete path towards reducing malicious use from LLMs. We release our benchmark and code publicly at https://wmdp.ai  ( 3 min )
    Semi-Supervised Graph Representation Learning with Human-centric Explanation for Predicting Fatty Liver Disease
    arXiv:2403.02786v1 Announce Type: new Abstract: Addressing the challenge of limited labeled data in clinical settings, particularly in the prediction of fatty liver disease, this study explores the potential of graph representation learning within a semi-supervised learning framework. Leveraging graph neural networks (GNNs), our approach constructs a subject similarity graph to identify risk patterns from health checkup data. The effectiveness of various GNN approaches in this context is demonstrated, even with minimal labeled samples. Central to our methodology is the inclusion of human-centric explanations through explainable GNNs, providing personalized feature importance scores for enhanced interpretability and clinical relevance, thereby underscoring the potential of our approach in advancing healthcare practices with a keen focus on graph representation learning and human-centric explanation.  ( 2 min )
    Dirichlet-based Per-Sample Weighting by Transition Matrix for Noisy Label Learning
    arXiv:2403.02690v1 Announce Type: new Abstract: For learning with noisy labels, the transition matrix, which explicitly models the relation between noisy label distribution and clean label distribution, has been utilized to achieve the statistical consistency of either the classifier or the risk. Previous researches have focused more on how to estimate this transition matrix well, rather than how to utilize it. We propose good utilization of the transition matrix is crucial and suggest a new utilization method based on resampling, coined RENT. Specifically, we first demonstrate current utilizations can have potential limitations for implementation. As an extension to Reweighting, we suggest the Dirichlet distribution-based per-sample Weight Sampling (DWS) framework, and compare reweighting and resampling under DWS framework. With the analyses from DWS, we propose RENT, a REsampling method with Noise Transition matrix. Empirically, RENT consistently outperforms existing transition matrix utilization methods, which includes reweighting, on various benchmark datasets. Our code is available at \url{https://github.com/BaeHeeSun/RENT}.  ( 2 min )
    Towards efficient deep autoencoders for multivariate time series anomaly detection
    arXiv:2403.02429v1 Announce Type: new Abstract: Multivariate time series anomaly detection is a crucial problem in many industrial and research applications. Timely detection of anomalies allows, for instance, to prevent defects in manufacturing processes and failures in cyberphysical systems. Deep learning methods are preferred among others for their accuracy and robustness for the analysis of complex multivariate data. However, a key aspect is being able to extract predictions in a timely manner, to accommodate real-time requirements in different applications. In the case of deep learning models, model reduction is extremely important to achieve optimal results in real-time systems with limited time and memory constraints. In this paper, we address this issue by proposing a novel compression method for deep autoencoders that involves three key factors. First, pruning reduces the number of weights, while preventing catastrophic drops in accuracy by means of a fast search process that identifies high sparsity levels. Second, linear and non-linear quantization reduces model complexity by reducing the number of bits for every single weight. The combined contribution of these three aspects allow the model size to be reduced, by removing a subset of the weights (pruning), and decreasing their bit-width (quantization). As a result, the compressed model is faster and easier to adopt in highly constrained hardware environments. Experiments performed on popular multivariate anomaly detection benchmarks, show that our method is capable of achieving significant model compression ratio (between 80% and 95%) without a significant reduction in the anomaly detection performance.  ( 3 min )
    From Zero to Hero: How local curvature at artless initial conditions leads away from bad minima
    arXiv:2403.02418v1 Announce Type: new Abstract: We investigate the optimization dynamics of gradient descent in a non-convex and high-dimensional setting, with a focus on the phase retrieval problem as a case study for complex loss landscapes. We first study the high-dimensional limit where both the number $M$ and the dimension $N$ of the data are going to infinity at fixed signal-to-noise ratio $\alpha = M/N$. By analyzing how the local curvature changes during optimization, we uncover that for intermediate $\alpha$, the Hessian displays a downward direction pointing towards good minima in the first regime of the descent, before being trapped in bad minima at the end. Hence, the local landscape is benign and informative at first, before gradient descent brings the system into a uninformative maze. The transition between the two regimes is associated to a BBP-type threshold in the time-dependent Hessian. Through both theoretical analysis and numerical experiments, we show that in practical cases, i.e. for finite but even very large $N$, successful optimization via gradient descent in phase retrieval is achieved by falling towards the good minima before reaching the bad ones. This mechanism explains why successful recovery is obtained well before the algorithmic transition corresponding to the high-dimensional limit. Technically, this is associated to strong logarithmic corrections of the algorithmic transition at large $N$ with respect to the one expected in the $N\to\infty$ limit. Our analysis sheds light on such a new mechanism that facilitate gradient descent dynamics in finite large dimensions, also highlighting the importance of good initialization of spectral properties for optimization in complex high-dimensional landscapes.  ( 3 min )
    A Novel Hybrid Feature Importance and Feature Interaction Detection Framework for Predictive Optimization in Industry 4.0 Applications
    arXiv:2403.02368v1 Announce Type: new Abstract: Advanced machine learning algorithms are increasingly utilized to provide data-based prediction and decision-making support in Industry 4.0. However, the prediction accuracy achieved by the existing models is insufficient to warrant practical implementation in real-world applications. This is because not all features present in real-world datasets possess a direct relevance to the predictive analysis being conducted. Consequently, the careful incorporation of select features has the potential to yield a substantial positive impact on the outcome. To address the research gap, this paper proposes a novel hybrid framework that combines the feature importance detector - local interpretable model-agnostic explanations (LIME) and the feature interaction detector - neural interaction detection (NID), to improve prediction accuracy. By applying the proposed framework, unnecessary features can be eliminated, and interactions are encoded to generate a more conducive dataset for predictive purposes. Subsequently, the proposed model is deployed to refine the prediction of electricity consumption in foundry processing. The experimental outcomes reveal an augmentation of up to 9.56% in the R2 score, and a diminution of up to 24.05% in the root mean square error.  ( 2 min )
    Towards Optimal Customized Architecture for Heterogeneous Federated Learning with Contrastive Cloud-Edge Model Decoupling
    arXiv:2403.02360v1 Announce Type: new Abstract: Federated learning, as a promising distributed learning paradigm, enables collaborative training of a global model across multiple network edge clients without the need for central data collecting. However, the heterogeneity of edge data distribution drags the model towards the local minima, which can be distant from the global optimum. Such heterogeneity often leads to slow convergence and substantial communication overhead. To address these issues, we propose a novel federated learning framework called FedCMD, a model decoupling tailored to the Cloud-edge supported federated learning that separates deep neural networks into a body for capturing shared representations in Cloud and a personalized head for migrating data heterogeneity. Our motivation is that, by the deep investigation of the performance of selecting different neural network layers as the personalized head, we found rigidly assigning the last layer as the personalized head in current studies is not always optimal. Instead, it is necessary to dynamically select the personalized layer that maximizes the training performance by taking the representation difference between neighbor layers into account. To find the optimal personalized layer, we utilize the low-dimensional representation of each layer to contrast feature distribution transfer and introduce a Wasserstein-based layer selection method, aimed at identifying the best-match layer for personalization. Additionally, a weighted global aggregation algorithm is proposed based on the selected personalized layer for the practical application of FedCMD. Extensive experiments on ten benchmarks demonstrate the efficiency and superior performance of our solution compared with nine state-of-the-art solutions. All code and results are available at https://github.com/elegy112138/FedCMD.  ( 3 min )
    Addressing Long-Tail Noisy Label Learning Problems: a Two-Stage Solution with Label Refurbishment Considering Label Rarity
    arXiv:2403.02363v1 Announce Type: new Abstract: Real-world datasets commonly exhibit noisy labels and class imbalance, such as long-tailed distributions. While previous research addresses this issue by differentiating noisy and clean samples, reliance on information from predictions based on noisy long-tailed data introduces potential errors. To overcome the limitations of prior works, we introduce an effective two-stage approach by combining soft-label refurbishing with multi-expert ensemble learning. In the first stage of robust soft label refurbishing, we acquire unbiased features through contrastive learning, making preliminary predictions using a classifier trained with a carefully designed BAlanced Noise-tolerant Cross-entropy (BANC) loss. In the second stage, our label refurbishment method is applied to obtain soft labels for multi-expert ensemble learning, providing a principled solution to the long-tail noisy label problem. Experiments conducted across multiple benchmarks validate the superiority of our approach, Label Refurbishment considering Label Rarity (LR^2), achieving remarkable accuracies of 94.19% and 77.05% on simulated noisy CIFAR-10 and CIFAR-100 long-tail datasets, as well as 77.74% and 81.40% on real-noise long-tail datasets, Food-101N and Animal-10N, surpassing existing state-of-the-art methods.  ( 2 min )
    Temporal Knowledge Graph Completion with Time-sensitive Relations in Hypercomplex Space
    arXiv:2403.02355v1 Announce Type: new Abstract: Temporal knowledge graph completion (TKGC) aims to fill in missing facts within a given temporal knowledge graph at a specific time. Existing methods, operating in real or complex spaces, have demonstrated promising performance in this task. This paper advances beyond conventional approaches by introducing more expressive quaternion representations for TKGC within hypercomplex space. Unlike existing quaternion-based methods, our study focuses on capturing time-sensitive relations rather than time-aware entities. Specifically, we model time-sensitive relations through time-aware rotation and periodic time translation, effectively capturing complex temporal variability. Furthermore, we theoretically demonstrate our method's capability to model symmetric, asymmetric, inverse, compositional, and evolutionary relation patterns. Comprehensive experiments on public datasets validate that our proposed approach achieves state-of-the-art performance in the field of TKGC.  ( 2 min )
    Spatio-Temporal Field Neural Networks for Air Quality Inference
    arXiv:2403.02354v1 Announce Type: new Abstract: The air quality inference problem aims to utilize historical data from a limited number of observation sites to infer the air quality index at an unknown location. Considering the sparsity of data due to the high maintenance cost of the stations, good inference algorithms can effectively save the cost and refine the data granularity. While spatio-temporal graph neural networks have made excellent progress on this problem, their non-Euclidean and discrete data structure modeling of reality limits its potential. In this work, we make the first attempt to combine two different spatio-temporal perspectives, fields and graphs, by proposing a new model, Spatio-Temporal Field Neural Network, and its corresponding new framework, Pyramidal Inference. Extensive experiments validate that our model achieves state-of-the-art performance in nationwide air quality inference in the Chinese Mainland, demonstrating the superiority of our proposed model and framework.  ( 2 min )
    On the Convergence of Federated Learning Algorithms without Data Similarity
    arXiv:2403.02347v1 Announce Type: new Abstract: Data similarity assumptions have traditionally been relied upon to understand the convergence behaviors of federated learning methods. Unfortunately, this approach often demands fine-tuning step sizes based on the level of data similarity. When data similarity is low, these small step sizes result in an unacceptably slow convergence speed for federated methods. In this paper, we present a novel and unified framework for analyzing the convergence of federated learning algorithms without the need for data similarity conditions. Our analysis centers on an inequality that captures the influence of step sizes on algorithmic convergence performance. By applying our theorems to well-known federated algorithms, we derive precise expressions for three widely used step size schedules: fixed, diminishing, and step-decay step sizes, which are independent of data similarity conditions. Finally, we conduct comprehensive evaluations of the performance of these federated learning algorithms, employing the proposed step size strategies to train deep neural network models on benchmark datasets under varying data similarity conditions. Our findings demonstrate significant improvements in convergence speed and overall performance, marking a substantial advancement in federated learning research.  ( 2 min )
    ATP: Enabling Fast LLM Serving via Attention on Top Principal Keys
    arXiv:2403.02352v1 Announce Type: new Abstract: We propose a new attention mechanism with linear complexity, ATP, that fixates \textbf{A}ttention on \textbf{T}op \textbf{P}rincipal keys, rather than on each individual token. Particularly, ATP is driven by an important observation that input sequences are typically low-rank, i.e., input sequences can be represented by a few principal bases. Therefore, instead of directly iterating over all the input tokens, ATP transforms inputs into an orthogonal space and computes attention only on the top principal bases (keys). Owing to the observed low-rank structure in input sequences, ATP is able to capture semantic relationships in input sequences with a few principal keys. Furthermore, the attention complexity is reduced from \emph{quadratic} to \emph{linear} without incurring a noticeable performance drop. ATP further reduces complexity for other linear layers with low-rank inputs, leading to more speedup compared to prior works that solely target the attention module. Our evaluations on various models (e.g., BERT and Llama) demonstrate that ATP achieves comparable accuracy with much lower computation and memory complexity than the standard attention mechanism. In particular, ATP barely loses accuracy with only $1/2$ principal keys, and only incurs around $2\%$ accuracy drops with $1/4$ principal keys.  ( 2 min )
    Time Weaver: A Conditional Time Series Generation Model
    arXiv:2403.02682v1 Announce Type: new Abstract: Imagine generating a city's electricity demand pattern based on weather, the presence of an electric vehicle, and location, which could be used for capacity planning during a winter freeze. Such real-world time series are often enriched with paired heterogeneous contextual metadata (weather, location, etc.). Current approaches to time series generation often ignore this paired metadata, and its heterogeneity poses several practical challenges in adapting existing conditional generation approaches from the image, audio, and video domains to the time series domain. To address this gap, we introduce Time Weaver, a novel diffusion-based model that leverages the heterogeneous metadata in the form of categorical, continuous, and even time-variant variables to significantly improve time series generation. Additionally, we show that naive extensions of standard evaluation metrics from the image to the time series domain are insufficient. These metrics do not penalize conditional generation approaches for their poor specificity in reproducing the metadata-specific features in the generated time series. Thus, we innovate a novel evaluation metric that accurately captures the specificity of conditional generation and the realism of the generated time series. We show that Time Weaver outperforms state-of-the-art benchmarks, such as Generative Adversarial Networks (GANs), by up to 27% in downstream classification tasks on real-world energy, medical, air quality, and traffic data sets.  ( 2 min )
    A Two-Stage Training Method for Modeling Constrained Systems With Neural Networks
    arXiv:2403.02730v1 Announce Type: new Abstract: Real-world systems are often formulated as constrained optimization problems. Techniques to incorporate constraints into Neural Networks (NN), such as Neural Ordinary Differential Equations (Neural ODEs), have been used. However, these introduce hyperparameters that require manual tuning through trial and error, raising doubts about the successful incorporation of constraints into the generated model. This paper describes in detail the two-stage training method for Neural ODEs, a simple, effective, and penalty parameter-free approach to model constrained systems. In this approach the constrained optimization problem is rewritten as two unconstrained sub-problems that are solved in two stages. The first stage aims at finding feasible NN parameters by minimizing a measure of constraints violation. The second stage aims to find the optimal NN parameters by minimizing the loss function while keeping inside the feasible region. We experimentally demonstrate that our method produces models that satisfy the constraints and also improves their predictive performance. Thus, ensuring compliance with critical system properties and also contributing to reducing data quantity requirements. Furthermore, we show that the proposed method improves the convergence to an optimal solution and improves the explainability of Neural ODE models. Our proposed two-stage training method can be used with any NN architectures.  ( 2 min )
    Rethinking Symbolic Regression Datasets and Benchmarks for Scientific Discovery
    arXiv:2206.10540v5 Announce Type: replace Abstract: This paper revisits datasets and evaluation criteria for Symbolic Regression (SR), specifically focused on its potential for scientific discovery. Focused on a set of formulas used in the existing datasets based on Feynman Lectures on Physics, we recreate 120 datasets to discuss the performance of symbolic regression for scientific discovery (SRSD). For each of the 120 SRSD datasets, we carefully review the properties of the formula and its variables to design reasonably realistic sampling ranges of values so that our new SRSD datasets can be used for evaluating the potential of SRSD such as whether or not an SR method can (re)discover physical laws from such datasets. We also create another 120 datasets that contain dummy variables to examine whether SR methods can choose necessary variables only. Besides, we propose to use normalized edit distances (NED) between a predicted equation and the true equation trees for addressing a critical issue that existing SR metrics are either binary or errors between the target values and an SR model's predicted values for a given input. We conduct benchmark experiments on our new SRSD datasets using various representative SR methods. The experimental results show that we provide a more realistic performance evaluation, and our user study shows that the NED correlates with human judges significantly more than an existing SR metric. We publish repositories of our code and 240 SRSD datasets.  ( 3 min )
    A Self-Adaptive Penalty Method for Integrating Prior Knowledge Constraints into Neural ODEs
    arXiv:2307.14940v3 Announce Type: replace Abstract: The continuous dynamics of natural systems has been effectively modelled using Neural Ordinary Differential Equations (Neural ODEs). However, for accurate and meaningful predictions, it is crucial that the models follow the underlying rules or laws that govern these systems. In this work, we propose a self-adaptive penalty algorithm for Neural ODEs to enable modelling of constrained natural systems. The proposed self-adaptive penalty function can dynamically adjust the penalty parameters. The explicit introduction of prior knowledge helps to increase the interpretability of Neural ODE -based models. We validate the proposed approach by modelling three natural systems with prior knowledge constraints: population growth, chemical reaction evolution, and damped harmonic oscillator motion. The numerical experiments and a comparison with other penalty Neural ODE approaches and \emph{vanilla} Neural ODE, demonstrate the effectiveness of the proposed self-adaptive penalty algorithm for Neural ODEs in modelling constrained natural systems. Moreover, the self-adaptive penalty approach provides more accurate and robust models with reliable and meaningful predictions.  ( 2 min )
    Generating Multidimensional Clusters With Support Lines
    arXiv:2301.10327v3 Announce Type: replace Abstract: Synthetic data is essential for assessing clustering techniques, complementing and extending real data, and allowing for more complete coverage of a given problem's space. In turn, synthetic data generators have the potential of creating vast amounts of data -- a crucial activity when real-world data is at premium -- while providing a well-understood generation procedure and an interpretable instrument for methodically investigating cluster analysis algorithms. Here, we present Clugen, a modular procedure for synthetic data generation, capable of creating multidimensional clusters supported by line segments using arbitrary distributions. Clugen is open source, comprehensively unit tested and documented, and is available for the Python, R, Julia, and MATLAB/Octave ecosystems. We demonstrate that our proposal can produce rich and varied results in various dimensions, is fit for use in the assessment of clustering algorithms, and has the potential to be a widely used framework in diverse clustering-related research tasks.  ( 2 min )
    Time Series Predictions in Unmonitored Sites: A Survey of Machine Learning Techniques in Water Resources
    arXiv:2308.09766v2 Announce Type: replace Abstract: Prediction of dynamic environmental variables in unmonitored sites remains a long-standing challenge for water resources science. The majority of the world's freshwater resources have inadequate monitoring of critical environmental variables needed for management. Yet, the need to have widespread predictions of hydrological variables such as river flow and water quality has become increasingly urgent due to climate and land use change over the past decades, and their associated impacts on water resources. Modern machine learning methods increasingly outperform their process-based and empirical model counterparts for hydrologic time series prediction with their ability to extract information from large, diverse data sets. We review relevant state-of-the art applications of machine learning for streamflow, water quality, and other water resources prediction and discuss opportunities to improve the use of machine learning with emerging methods for incorporating watershed characteristics into deep learning models, transfer learning, and incorporating process knowledge into machine learning models. The analysis here suggests most prior efforts have been focused on deep learning learning frameworks built on many sites for predictions at daily time scales in the United States, but that comparisons between different classes of machine learning methods are few and inadequate. We identify several open questions for time series predictions in unmonitored sites that include incorporating dynamic inputs and site characteristics, mechanistic understanding and spatial context, and explainable AI techniques in modern machine learning frameworks.  ( 3 min )
    Mass-Producing Failures of Multimodal Systems with Language Models
    arXiv:2306.12105v2 Announce Type: replace Abstract: Deployed multimodal systems can fail in ways that evaluators did not anticipate. In order to find these failures before deployment, we introduce MultiMon, a system that automatically identifies systematic failures -- generalizable, natural-language descriptions of patterns of model failures. To uncover systematic failures, MultiMon scrapes a corpus for examples of erroneous agreement: inputs that produce the same output, but should not. It then prompts a language model (e.g., GPT-4) to find systematic patterns of failure and describe them in natural language. We use MultiMon to find 14 systematic failures (e.g., "ignores quantifiers") of the CLIP text-encoder, each comprising hundreds of distinct inputs (e.g., "a shelf with a few/many books"). Because CLIP is the backbone for most state-of-the-art multimodal systems, these inputs produce failures in Midjourney 5.1, DALL-E, VideoFusion, and others. MultiMon can also steer towards failures relevant to specific use cases, such as self-driving cars. We see MultiMon as a step towards evaluation that autonomously explores the long tail of potential system failures. Code for MULTIMON is available at https://github.com/tsb0601/MultiMon.  ( 2 min )
    Knowledge Distillation in Federated Edge Learning: A Survey
    arXiv:2301.05849v3 Announce Type: replace Abstract: The increasing demand for intelligent services and privacy protection of mobile and Internet of Things (IoT) devices motivates the wide application of Federated Edge Learning (FEL), in which devices collaboratively train on-device Machine Learning (ML) models without sharing their private data. Limited by device hardware, diverse user behaviors and network infrastructure, the algorithm design of FEL faces challenges related to resources, personalization and network environments. Fortunately, Knowledge Distillation (KD) has been leveraged as an important technique to tackle the above challenges in FEL. In this paper, we investigate the works that KD applies to FEL, discuss the limitations and open problems of existing KD-based FEL approaches, and provide guidance for their real deployment.  ( 2 min )
    Emerging Trends in Federated Learning: From Model Fusion to Federated X Learning
    arXiv:2102.12920v4 Announce Type: replace Abstract: Federated learning is a new learning paradigm that decouples data collection and model training via multi-party computation and model aggregation. As a flexible learning setting, federated learning has the potential to integrate with other learning frameworks. We conduct a focused survey of federated learning in conjunction with other learning algorithms. Specifically, we explore various learning algorithms to improve the vanilla federated averaging algorithm and review model fusion methods such as adaptive aggregation, regularization, clustered methods, and Bayesian methods. Following the emerging trends, we also discuss federated learning in the intersection with other learning paradigms, termed federated X learning, where X includes multitask learning, meta-learning, transfer learning, unsupervised learning, and reinforcement learning. This survey reviews the state of the art, challenges, and future directions.  ( 3 min )
    Provable Training for Graph Contrastive Learning
    arXiv:2309.13944v3 Announce Type: replace Abstract: Graph Contrastive Learning (GCL) has emerged as a popular training approach for learning node embeddings from augmented graphs without labels. Despite the key principle that maximizing the similarity between positive node pairs while minimizing it between negative node pairs is well established, some fundamental problems are still unclear. Considering the complex graph structure, are some nodes consistently well-trained and following this principle even with different graph augmentations? Or are there some nodes more likely to be untrained across graph augmentations and violate the principle? How to distinguish these nodes and further guide the training of GCL? To answer these questions, we first present experimental evidence showing that the training of GCL is indeed imbalanced across all nodes. To address this problem, we propose the metric "node compactness", which is the lower bound of how a node follows the GCL principle related to the range of augmentations. We further derive the form of node compactness theoretically through bound propagation, which can be integrated into binary cross-entropy as a regularization. To this end, we propose the PrOvable Training (POT) for GCL, which regularizes the training of GCL to encode node embeddings that follows the GCL principle better. Through extensive experiments on various benchmarks, POT consistently improves the existing GCL approaches, serving as a friendly plugin.  ( 3 min )
    Identifying Representations for Intervention Extrapolation
    arXiv:2310.04295v2 Announce Type: replace Abstract: The premise of identifiable and causal representation learning is to improve the current representation learning paradigm in terms of generalizability or robustness. Despite recent progress in questions of identifiability, more theoretical results demonstrating concrete advantages of these methods for downstream tasks are needed. In this paper, we consider the task of intervention extrapolation: predicting how interventions affect an outcome, even when those interventions are not observed at training time, and show that identifiable representations can provide an effective solution to this task even if the interventions affect the outcome non-linearly. Our setup includes an outcome Y, observed features X, which are generated as a non-linear transformation of latent features Z, and exogenous action variables A, which influence Z. The objective of intervention extrapolation is to predict how interventions on A that lie outside the training support of A affect Y. Here, extrapolation becomes possible if the effect of A on Z is linear and the residual when regressing Z on A has full support. As Z is latent, we combine the task of intervention extrapolation with identifiable representation learning, which we call Rep4Ex: we aim to map the observed features X into a subspace that allows for non-linear extrapolation in A. We show that the hidden representation is identifiable up to an affine transformation in Z-space, which is sufficient for intervention extrapolation. The identifiability is characterized by a novel constraint describing the linearity assumption of A on Z. Based on this insight, we propose a method that enforces the linear invariance constraint and can be combined with any type of autoencoder. We validate our theoretical findings through synthetic experiments and show that our approach succeeds in predicting the effects of unseen interventions.  ( 3 min )
    XRMDN: An Extended Recurrent Mixture Density Network for Short-Term Probabilistic Rider Demand Forecasting with High Volatility
    arXiv:2310.09847v2 Announce Type: replace Abstract: In the realm of Mobility-on-Demand (MoD) systems, the forecasting of rider demand is a cornerstone for operational decision-making and system optimization. Traditional forecasting methodologies primarily yield point estimates, thereby neglecting the inherent uncertainty within demand projections. Moreover, MoD demand levels are profoundly influenced by both endogenous and exogenous factors, leading to high and dynamic volatility. This volatility significantly undermines the efficacy of conventional time series forecasting methods. In response, we propose an Extended Recurrent Mixture Density Network (XRMDN), a novel deep learning framework engineered to address these challenges. XRMDN leverages a sophisticated architecture to process demand residuals and variance through correlated modules, allowing for the flexible incorporation of endogenous and exogenous data. This architecture, featuring recurrent connections within the weight, mean, and variance neural networks, adeptly captures demand trends, thus significantly enhancing forecasting precision, particularly in high-volatility scenarios. Our comprehensive experimental analysis, utilizing real-world MoD datasets, demonstrates that XRMDN surpasses the existing benchmark models across various metrics, notably excelling in high-demand volatility contexts. This advancement in probabilistic demand forecasting marks a significant contribution to the field, offering a robust tool for enhancing operational efficiency and customer satisfaction in MoD systems.  ( 3 min )
    Demystifying Local and Global Fairness Trade-offs in Federated Learning Using Partial Information Decomposition
    arXiv:2307.11333v2 Announce Type: replace Abstract: This work presents an information-theoretic perspective to group fairness trade-offs in federated learning (FL) with respect to sensitive attributes, such as gender, race, etc. Existing works often focus on either $\textit{global fairness}$ (overall disparity of the model across all clients) or $\textit{local fairness}$ (disparity of the model at each client), without always considering their trade-offs. There is a lack of understanding regarding the interplay between global and local fairness in FL, particularly under data heterogeneity, and if and when one implies the other. To address this gap, we leverage a body of work in information theory called partial information decomposition (PID), which first identifies three sources of unfairness in FL, namely, $\textit{Unique Disparity}$, $\textit{Redundant Disparity}$, and $\textit{Masked Disparity}$. We demonstrate how these three disparities contribute to global and local fairness using canonical examples. This decomposition helps us derive fundamental limits on the trade-off between global and local fairness, highlighting where they agree or disagree. We introduce the $\textit{Accuracy and Global-Local Fairness Optimality Problem (AGLFOP)}$, a convex optimization that defines the theoretical limits of accuracy and fairness trade-offs, identifying the best possible performance any FL strategy can attain given a dataset and client distribution. We also present experimental results on synthetic datasets and the ADULT dataset to support our theoretical findings.  ( 3 min )
    Machine Unlearning: Solutions and Challenges
    arXiv:2308.07061v2 Announce Type: replace Abstract: Machine learning models may inadvertently memorize sensitive, unauthorized, or malicious data, posing risks of privacy breaches, security vulnerabilities, and performance degradation. To address these issues, machine unlearning has emerged as a critical technique to selectively remove specific training data points' influence on trained models. This paper provides a comprehensive taxonomy and analysis of the solutions in machine unlearning. We categorize existing solutions into exact unlearning approaches that remove data influence thoroughly and approximate unlearning approaches that efficiently minimize data influence. By comprehensively reviewing solutions, we identify and discuss their strengths and limitations. Furthermore, we propose future directions to advance machine unlearning and establish it as an essential capability for trustworthy and adaptive machine learning models. This paper provides researchers with a roadmap of open problems, encouraging impactful contributions to address real-world needs for selective data removal.  ( 2 min )
    Confident Sinkhorn Allocation for Pseudo-Labeling
    arXiv:2206.05880v5 Announce Type: replace Abstract: Semi-supervised learning is a critical tool in reducing machine learning's dependence on labeled data. It has been successfully applied to structured data, such as images and natural language, by exploiting the inherent spatial and semantic structure therein with pretrained models or data augmentation. These methods are not applicable, however, when the data does not have the appropriate structure, or invariances. Due to their simplicity, pseudo-labeling (PL) methods can be widely used without any domain assumptions. However, the greedy mechanism in PL is sensitive to a threshold and can perform poorly if wrong assignments are made due to overconfidence. This paper studies theoretically the role of uncertainty to pseudo-labeling and proposes Confident Sinkhorn Allocation (CSA), which identifies the best pseudo-label allocation via optimal transport to only samples with high confidence scores. CSA outperforms the current state-of-the-art in this practically important area of semi-supervised learning. Additionally, we propose to use the Integral Probability Metrics to extend and improve the existing PACBayes bound which relies on the Kullback-Leibler (KL) divergence, for ensemble models. Our code is publicly available at https://github.com/amzn/confident-sinkhorn-allocation.  ( 3 min )
    LLP-Bench: A Large Scale Tabular Benchmark for Learning from Label Proportions
    arXiv:2310.10096v2 Announce Type: replace Abstract: In the task of Learning from Label Proportions (LLP), a model is trained on groups (a.k.a bags) of instances and their corresponding label proportions to predict labels for individual instances. LLP has been applied pre-dominantly on two types of datasets - image and tabular. In image LLP, bags of fixed size are created by randomly sampling instances from an underlying dataset. Bags created via this methodology are called random bags. Experimentation on Image LLP has been mostly on random bags on CIFAR-* and MNIST datasets. Despite being a very crucial task in privacy sensitive applications, tabular LLP does not yet have a open, large scale LLP benchmark. One of the unique properties of tabular LLP is the ability to create feature bags where all the instances in a bag have the same value for a given feature. It has been shown in prior research that feature bags are very common in practical, real world applications [Chen et. al '23, Saket et. al. '22]. In this paper, we address the lack of a open, large scale tabular benchmark. First we propose LLP-Bench, a suite of 70 LLP datasets (62 feature bag and 8 random bag datasets) created from the Criteo CTR prediction and the Criteo Sponsored Search Conversion Logs datasets, the former a classification and the latter a regression dataset. These LLP datasets represent diverse ways in which bags can be constructed from underlying tabular data. To the best of our knowledge, LLP-Bench is the first large scale tabular LLP benchmark with an extensive diversity in constituent datasets. Second, we propose four metrics that characterize and quantify the hardness of a LLP dataset. Using these four metrics we present deep analysis of the 62 feature bag datasets in LLP-Bench. Finally we present the performance of 9 SOTA and popular tabular LLP techniques on all the 62 datasets.  ( 3 min )
    Towards Poisoning Fair Representations
    arXiv:2309.16487v2 Announce Type: replace Abstract: Fair machine learning seeks to mitigate model prediction bias against certain demographic subgroups such as elder and female. Recently, fair representation learning (FRL) trained by deep neural networks has demonstrated superior performance, whereby representations containing no demographic information are inferred from the data and then used as the input to classification or other downstream tasks. Despite the development of FRL methods, their vulnerability under data poisoning attack, a popular protocol to benchmark model robustness under adversarial scenarios, is under-explored. Data poisoning attacks have been developed for classical fair machine learning methods which incorporate fairness constraints into shallow-model classifiers. Nonetheless, these attacks fall short in FRL due to notably different fairness goals and model architectures. This work proposes the first data poisoning framework attacking FRL. We induce the model to output unfair representations that contain as much demographic information as possible by injecting carefully crafted poisoning samples into the training data. This attack entails a prohibitive bilevel optimization, wherefore an effective approximated solution is proposed. A theoretical analysis on the needed number of poisoning samples is derived and sheds light on defending against the attack. Experiments on benchmark fairness datasets and state-of-the-art fair representation learning models demonstrate the superiority of our attack.  ( 2 min )
    Minimum width for universal approximation using ReLU networks on compact domain
    arXiv:2309.10402v2 Announce Type: replace Abstract: It has been shown that deep neural networks of a large enough width are universal approximators but they are not if the width is too small. There were several attempts to characterize the minimum width $w_{\min}$ enabling the universal approximation property; however, only a few of them found the exact values. In this work, we show that the minimum width for $L^p$ approximation of $L^p$ functions from $[0,1]^{d_x}$ to $\mathbb R^{d_y}$ is exactly $\max\{d_x,d_y,2\}$ if an activation function is ReLU-Like (e.g., ReLU, GELU, Softplus). Compared to the known result for ReLU networks, $w_{\min}=\max\{d_x+1,d_y\}$ when the domain is $\smash{\mathbb R^{d_x}}$, our result first shows that approximation on a compact domain requires smaller width than on $\smash{\mathbb R^{d_x}}$. We next prove a lower bound on $w_{\min}$ for uniform approximation using general activation functions including ReLU: $w_{\min}\ge d_y+1$ if $d_x<d_y\le2d_x$. Together with our first result, this shows a dichotomy between $L^p$ and uniform approximations for general activation functions and input/output dimensions.  ( 2 min )
    Neural network relief: a pruning algorithm based on neural activity
    arXiv:2109.10795v3 Announce Type: replace Abstract: Current deep neural networks (DNNs) are overparameterized and use most of their neuronal connections during inference for each task. The human brain, however, developed specialized regions for different tasks and performs inference with a small fraction of its neuronal connections. We propose an iterative pruning strategy introducing a simple importance-score metric that deactivates unimportant connections, tackling overparameterization in DNNs and modulating the firing patterns. The aim is to find the smallest number of connections that is still capable of solving a given task with comparable accuracy, i.e. a simpler subnetwork. We achieve comparable performance for LeNet architectures on MNIST, and significantly higher parameter compression than state-of-the-art algorithms for VGG and ResNet architectures on CIFAR-10/100 and Tiny-ImageNet. Our approach also performs well for the two different optimizers considered -- Adam and SGD. The algorithm is not designed to minimize FLOPs when considering current hardware and software implementations, although it performs reasonably when compared to the state of the art.  ( 2 min )
    A Safe Deep Reinforcement Learning Approach for Energy Efficient Federated Learning in Wireless Communication Networks
    arXiv:2308.10664v3 Announce Type: replace Abstract: Progressing towards a new era of Artificial Intelligence (AI) - enabled wireless networks, concerns regarding the environmental impact of AI have been raised both in industry and academia. Federated Learning (FL) has emerged as a key privacy preserving decentralized AI technique. Despite efforts currently being made in FL, its environmental impact is still an open problem. Targeting the minimization of the overall energy consumption of an FL process, we propose the orchestration of computational and communication resources of the involved devices to minimize the total energy required, while guaranteeing a certain performance of the model. To this end, we propose a Soft Actor Critic Deep Reinforcement Learning (DRL) solution, where a penalty function is introduced during training, penalizing the strategies that violate the constraints of the environment, and contributing towards a safe RL process. A device level synchronization method, along with a computationally cost effective FL environment are proposed, with the goal of further reducing the energy consumption and communication overhead. Evaluation results show the effectiveness and robustness of the proposed scheme compared to four state-of-the-art baseline solutions on different network environments and FL architectures, achieving a decrease of up to 94% in the total energy consumption.  ( 3 min )
    Hierarchical Contrastive Learning Enhanced Heterogeneous Graph Neural Network
    arXiv:2304.12228v2 Announce Type: replace Abstract: Heterogeneous graph neural networks (HGNNs) as an emerging technique have shown superior capacity of dealing with heterogeneous information network (HIN). However, most HGNNs follow a semi-supervised learning manner, which notably limits their wide use in reality since labels are usually scarce in real applications. Recently, contrastive learning, a self-supervised method, becomes one of the most exciting learning paradigms and shows great potential when there are no labels. In this paper, we study the problem of self-supervised HGNNs and propose a novel co-contrastive learning mechanism for HGNNs, named HeCo. Different from traditional contrastive learning which only focuses on contrasting positive and negative samples, HeCo employs cross-view contrastive mechanism. Specifically, two views of a HIN (network schema and meta-path views) are proposed to learn node embeddings, so as to capture both of local and high-order structures simultaneously. Then the cross-view contrastive learning, as well as a view mask mechanism, is proposed, which is able to extract the positive and negative embeddings from two views. This enables the two views to collaboratively supervise each other and finally learn high-level node embeddings. Moreover, to further boost the performance of HeCo, two additional methods are designed to generate harder negative samples with high quality. Besides the invariant factors, view-specific factors complementally provide the diverse structure information between different nodes, which also should be contained into the final embeddings. Therefore, we need to further explore each view independently and propose a modified model, called HeCo++. Specifically, HeCo++ conducts hierarchical contrastive learning, including cross-view and intra-view contrasts, which aims to enhance the mining of respective structures.  ( 3 min )
    Anatomically Constrained Tractography of the Fetal Brain
    arXiv:2403.02444v1 Announce Type: cross Abstract: Diffusion-weighted Magnetic Resonance Imaging (dMRI) is increasingly used to study the fetal brain in utero. An important computation enabled by dMRI is streamline tractography, which has unique applications such as tract-specific analysis of the brain white matter and structural connectivity assessment. However, due to the low fetal dMRI data quality and the challenging nature of tractography, existing methods tend to produce highly inaccurate results. They generate many false streamlines while failing to reconstruct streamlines that constitute the major white matter tracts. In this paper, we advocate for anatomically constrained tractography based on an accurate segmentation of the fetal brain tissue directly in the dMRI space. We develop a deep learning method to compute the segmentation automatically. Experiments on independent test data show that this method can accurately segment the fetal brain tissue and drastically improve tractography results. It enables the reconstruction of highly curved tracts such as optic radiations. Importantly, our method infers the tissue segmentation and streamline propagation direction from a diffusion tensor fit to the dMRI data, making it applicable to routine fetal dMRI scans. The proposed method can lead to significant improvements in the accuracy and reproducibility of quantitative assessment of the fetal brain with dMRI.  ( 2 min )
    Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training
    arXiv:2305.14342v4 Announce Type: replace Abstract: Given the massive cost of language model pre-training, a non-trivial improvement of the optimization algorithm would lead to a material reduction on the time and cost of training. Adam and its variants have been state-of-the-art for years, and more sophisticated second-order (Hessian-based) optimizers often incur too much per-step overhead. In this paper, we propose Sophia, Second-order Clipped Stochastic Optimization, a simple scalable second-order optimizer that uses a light-weight estimate of the diagonal Hessian as the pre-conditioner. The update is the moving average of the gradients divided by the moving average of the estimated Hessian, followed by element-wise clipping. The clipping controls the worst-case update size and tames the negative impact of non-convexity and rapid change of Hessian along the trajectory. Sophia only estimates the diagonal Hessian every handful of iterations, which has negligible average per-step time and memory overhead. On language modeling with GPT models of sizes ranging from 125M to 1.5B, Sophia achieves a 2x speed-up compared to Adam in the number of steps, total compute, and wall-clock time, achieving the same perplexity with 50% fewer steps, less total compute, and reduced wall-clock time. Theoretically, we show that Sophia, in a much simplified setting, adapts to the heterogeneous curvatures in different parameter dimensions, and thus has a run-time bound that does not depend on the condition number of the loss.  ( 3 min )
    Federated Learning Under Attack: Exposing Vulnerabilities through Data Poisoning Attacks in Computer Networks
    arXiv:2403.02983v1 Announce Type: cross Abstract: Federated Learning (FL) is a machine learning (ML) approach that enables multiple decentralized devices or edge servers to collaboratively train a shared model without exchanging raw data. During the training and sharing of model updates between clients and servers, data and models are susceptible to different data-poisoning attacks. In this study, our motivation is to explore the severity of data poisoning attacks in the computer network domain because they are easy to implement but difficult to detect. We considered two types of data-poisoning attacks, label flipping (LF) and feature poisoning (FP), and applied them with a novel approach. In LF, we randomly flipped the labels of benign data and trained the model on the manipulated data. For FP, we randomly manipulated the highly contributing features determined using the Random Forest algorithm. The datasets used in this experiment were CIC and UNSW related to computer networks. We generated adversarial samples using the two attacks mentioned above, which were applied to a small percentage of datasets. Subsequently, we trained and tested the accuracy of the model on adversarial datasets. We recorded the results for both benign and manipulated datasets and observed significant differences between the accuracy of the models on different datasets. From the experimental results, it is evident that the LF attack failed, whereas the FP attack showed effective results, which proved its significance in fooling a server. With a 1% LF attack on the CIC, the accuracy was approximately 0.0428 and the ASR was 0.9564; hence, the attack is easily detectable, while with a 1% FP attack, the accuracy and ASR were both approximately 0.9600, hence, FP attacks are difficult to detect. We repeated the experiment with different poisoning percentages.  ( 3 min )
    Generalizing Graph Neural Networks on Out-Of-Distribution Graphs
    arXiv:2111.10657v3 Announce Type: replace Abstract: Graph Neural Networks (GNNs) are proposed without considering the agnostic distribution shifts between training and testing graphs, inducing the degeneration of the generalization ability of GNNs on Out-Of-Distribution (OOD) settings. The fundamental reason for such degeneration is that most GNNs are developed based on the I.I.D hypothesis. In such a setting, GNNs tend to exploit subtle statistical correlations existing in the training set for predictions, even though it is a spurious correlation. However, such spurious correlations may change in testing environments, leading to the failure of GNNs. Therefore, eliminating the impact of spurious correlations is crucial for stable GNNs. To this end, we propose a general causal representation framework, called StableGNN. The main idea is to extract high-level representations from graph data first and resort to the distinguishing ability of causal inference to help the model get rid of spurious correlations. Particularly, we exploit a graph pooling layer to extract subgraph-based representations as high-level representations. Furthermore, we propose a causal variable distinguishing regularizer to correct the biased training distribution. Hence, GNNs would concentrate more on the stable correlations. Extensive experiments on both synthetic and real-world OOD graph datasets well verify the effectiveness, flexibility and interpretability of the proposed framework.  ( 3 min )
    Constrained Policy Optimization with Explicit Behavior Density for Offline Reinforcement Learning
    arXiv:2301.12130v2 Announce Type: replace Abstract: Due to the inability to interact with the environment, offline reinforcement learning (RL) methods face the challenge of estimating the Out-of-Distribution (OOD) points. Existing methods for addressing this issue either control policy to exclude the OOD action or make the $Q$ function pessimistic. However, these methods can be overly conservative or fail to identify OOD areas accurately. To overcome this problem, we propose a Constrained Policy optimization with Explicit Behavior density (CPED) method that utilizes a flow-GAN model to explicitly estimate the density of behavior policy. By estimating the explicit density, CPED can accurately identify the safe region and enable optimization within the region, resulting in less conservative learning policies. We further provide theoretical results for both the flow-GAN estimator and performance guarantee for CPED by showing that CPED can find the optimal $Q$-function value. Empirically, CPED outperforms existing alternatives on various standard offline reinforcement learning tasks, yielding higher expected returns.  ( 2 min )
    Directed Acyclic Graph Structure Learning from Dynamic Graphs
    arXiv:2211.17029v2 Announce Type: replace Abstract: Estimating the structure of directed acyclic graphs (DAGs) of features (variables) plays a vital role in revealing the latent data generation process and providing causal insights in various applications. Although there have been many studies on structure learning with various types of data, the structure learning on the dynamic graph has not been explored yet, and thus we study the learning problem of node feature generation mechanism on such ubiquitous dynamic graph data. In a dynamic graph, we propose to simultaneously estimate contemporaneous relationships and time-lagged interaction relationships between the node features. These two kinds of relationships form a DAG, which could effectively characterize the feature generation process in a concise way. To learn such a DAG, we cast the learning problem as a continuous score-based optimization problem, which consists of a differentiable score function to measure the validity of the learned DAGs and a smooth acyclicity constraint to ensure the acyclicity of the learned DAGs. These two components are translated into an unconstraint augmented Lagrangian objective which could be minimized by mature continuous optimization techniques. The resulting algorithm, named GraphNOTEARS, outperforms baselines on simulated data across a wide range of settings that may encounter in real-world applications. We also apply the proposed approach on two dynamic graphs constructed from the real-world Yelp dataset, demonstrating our method could learn the connections between node features, which conforms with the domain knowledge.  ( 3 min )
    Incremental Spatial and Spectral Learning of Neural Operators for Solving Large-Scale PDEs
    arXiv:2211.15188v4 Announce Type: replace Abstract: Fourier Neural Operators (FNO) offer a principled approach to solving challenging partial differential equations (PDE) such as turbulent flows. At the core of FNO is a spectral layer that leverages a discretization-convergent representation in the Fourier domain, and learns weights over a fixed set of frequencies. However, training FNO presents two significant challenges, particularly in large-scale, high-resolution applications: (i) Computing Fourier transform on high-resolution inputs is computationally intensive but necessary since fine-scale details are needed for solving many PDEs, such as fluid flows, (ii) selecting the relevant set of frequencies in the spectral layers is challenging, and too many modes can lead to overfitting, while too few can lead to underfitting. To address these issues, we introduce the Incremental Fourier Neural Operator (iFNO), which progressively increases both the number of frequency modes used by the model as well as the resolution of the training data. We empirically show that iFNO reduces total training time while maintaining or improving generalization performance across various datasets. Our method demonstrates a 10% lower testing error, using 20% fewer frequency modes compared to the existing Fourier Neural Operator, while also achieving a 30% faster training.  ( 3 min )
    NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models
    arXiv:2403.03100v1 Announce Type: cross Abstract: While recent large-scale text-to-speech (TTS) models have achieved significant progress, they still fall short in speech quality, similarity, and prosody. Considering speech intricately encompasses various attributes (e.g., content, prosody, timbre, and acoustic details) that pose significant challenges for generation, a natural idea is to factorize speech into individual subspaces representing different attributes and generate them individually. Motivated by it, we propose NaturalSpeech 3, a TTS system with novel factorized diffusion models to generate natural speech in a zero-shot way. Specifically, 1) we design a neural codec with factorized vector quantization (FVQ) to disentangle speech waveform into subspaces of content, prosody, timbre, and acoustic details; 2) we propose a factorized diffusion model to generate attributes in each subspace following its corresponding prompt. With this factorization design, NaturalSpeech 3 can effectively and efficiently model the intricate speech with disentangled subspaces in a divide-and-conquer way. Experiments show that NaturalSpeech 3 outperforms the state-of-the-art TTS systems on quality, similarity, prosody, and intelligibility. Furthermore, we achieve better performance by scaling to 1B parameters and 200K hours of training data.  ( 3 min )
    Learning without Exact Guidance: Updating Large-scale High-resolution Land Cover Maps from Low-resolution Historical Labels
    arXiv:2403.02746v1 Announce Type: cross Abstract: Large-scale high-resolution (HR) land-cover mapping is a vital task to survey the Earth's surface and resolve many challenges facing humanity. However, it is still a non-trivial task hindered by complex ground details, various landforms, and the scarcity of accurate training labels over a wide-span geographic area. In this paper, we propose an efficient, weakly supervised framework (Paraformer), a.k.a. Low-to-High Network (L2HNet) V2, to guide large-scale HR land-cover mapping with easy-access historical land-cover data of low resolution (LR). Specifically, existing land-cover mapping approaches reveal the dominance of CNNs in preserving local ground details but still suffer from insufficient global modeling in various landforms. Therefore, we design a parallel CNN-Transformer feature extractor in Paraformer, consisting of a downsampling-free CNN branch and a Transformer branch, to jointly capture local and global contextual information. Besides, facing the spatial mismatch of training data, a pseudo-label-assisted training (PLAT) module is adopted to reasonably refine LR labels for weakly supervised semantic segmentation of HR images. Experiments on two large-scale datasets demonstrate the superiority of Paraformer over other state-of-the-art methods for automatically updating HR land-cover maps from LR historical labels.  ( 2 min )
    Revisiting Confidence Estimation: Towards Reliable Failure Prediction
    arXiv:2403.02886v1 Announce Type: cross Abstract: Reliable confidence estimation is a challenging yet fundamental requirement in many risk-sensitive applications. However, modern deep neural networks are often overconfident for their incorrect predictions, i.e., misclassified samples from known classes, and out-of-distribution (OOD) samples from unknown classes. In recent years, many confidence calibration and OOD detection methods have been developed. In this paper, we find a general, widely existing but actually-neglected phenomenon that most confidence estimation methods are harmful for detecting misclassification errors. We investigate this problem and reveal that popular calibration and OOD detection methods often lead to worse confidence separation between correctly classified and misclassified examples, making it difficult to decide whether to trust a prediction or not. Finally, we propose to enlarge the confidence gap by finding flat minima, which yields state-of-the-art failure prediction performance under various settings including balanced, long-tailed, and covariate-shift classification scenarios. Our study not only provides a strong baseline for reliable confidence estimation but also acts as a bridge between understanding calibration, OOD detection, and failure prediction. The code is available at \url{https://github.com/Impression2805/FMFP}.  ( 2 min )
    Making deep neural networks right for the right scientific reasons by interacting with their explanations
    arXiv:2001.05371v4 Announce Type: replace Abstract: Deep neural networks have shown excellent performances in many real-world applications. Unfortunately, they may show "Clever Hans"-like behavior -- making use of confounding factors within datasets -- to achieve high performance. In this work, we introduce the novel learning setting of "explanatory interactive learning" (XIL) and illustrate its benefits on a plant phenotyping research task. XIL adds the scientist into the training loop such that she interactively revises the original model via providing feedback on its explanations. Our experimental results demonstrate that XIL can help avoiding Clever Hans moments in machine learning and encourages (or discourages, if appropriate) trust into the underlying model.  ( 2 min )
    VQSynery: Robust Drug Synergy Prediction With Vector Quantization Mechanism
    arXiv:2403.03089v1 Announce Type: cross Abstract: The pursuit of optimizing cancer therapies is significantly advanced by the accurate prediction of drug synergy. Traditional methods, such as clinical trials, are reliable yet encumbered by extensive time and financial demands. The emergence of high-throughput screening and computational innovations has heralded a shift towards more efficient methodologies for exploring drug interactions. In this study, we present VQSynergy, a novel framework that employs the Vector Quantization (VQ) mechanism, integrated with gated residuals and a tailored attention mechanism, to enhance the precision and generalizability of drug synergy predictions. Our findings demonstrate that VQSynergy surpasses existing models in terms of robustness, particularly under Gaussian noise conditions, highlighting its superior performance and utility in the complex and often noisy domain of drug synergy research. This study underscores the potential of VQSynergy in revolutionizing the field through its advanced predictive capabilities, thereby contributing to the optimization of cancer treatment strategies.  ( 2 min )
    Non-Convex Stochastic Composite Optimization with Polyak Momentum
    arXiv:2403.02967v1 Announce Type: cross Abstract: The stochastic proximal gradient method is a powerful generalization of the widely used stochastic gradient descent (SGD) method and has found numerous applications in Machine Learning. However, it is notoriously known that this method fails to converge in non-convex settings where the stochastic noise is significant (i.e. when only small or bounded batch sizes are used). In this paper, we focus on the stochastic proximal gradient method with Polyak momentum. We prove this method attains an optimal convergence rate for non-convex composite optimization problems, regardless of batch size. Additionally, we rigorously analyze the variance reduction effect of the Polyak momentum in the composite optimization setting and we show the method also converges when the proximal step can only be solved inexactly. Finally, we provide numerical experiments to validate our theoretical results.  ( 2 min )
    Learning Explicitly Conditioned Sparsifying Transforms
    arXiv:2403.03168v1 Announce Type: cross Abstract: Sparsifying transforms became in the last decades widely known tools for finding structured sparse representations of signals in certain transform domains. Despite the popularity of classical transforms such as DCT and Wavelet, learning optimal transforms that guarantee good representations of data into the sparse domain has been recently analyzed in a series of papers. Typically, the conditioning number and representation ability are complementary key features of learning square transforms that may not be explicitly controlled in a given optimization model. Unlike the existing approaches from the literature, in our paper, we consider a new sparsifying transform model that enforces explicit control over the data representation quality and the condition number of the learned transforms. We confirm through numerical experiments that our model presents better numerical behavior than the state-of-the-art.  ( 2 min )
    False Positive Sampling-based Data Augmentation for Enhanced 3D Object Detection Accuracy
    arXiv:2403.02639v1 Announce Type: cross Abstract: Recent studies have focused on enhancing the performance of 3D object detection models. Among various approaches, ground-truth sampling has been proposed as an augmentation technique to address the challenges posed by limited ground-truth data. However, an inherent issue with ground-truth sampling is its tendency to increase false positives. Therefore, this study aims to overcome the limitations of ground-truth sampling and improve the performance of 3D object detection models by developing a new augmentation technique called false-positive sampling. False-positive sampling involves retraining the model using point clouds that are identified as false positives in the model's predictions. We propose an algorithm that utilizes both ground-truth and false-positive sampling and an algorithm for building the false-positive sample database. Additionally, we analyze the principles behind the performance enhancement due to false-positive sampling and propose a technique that applies the concept of curriculum learning to the sampling strategy that encompasses both false-positive and ground-truth sampling techniques. Our experiments demonstrate that models utilizing false-positive sampling show a reduction in false positives and exhibit improved object detection performance. On the KITTI and Waymo Open datasets, models with false-positive sampling surpass the baseline models by a large margin.  ( 2 min )
    Online Learning of Human Constraints from Feedback in Shared Autonomy
    arXiv:2403.02974v1 Announce Type: cross Abstract: Real-time collaboration with humans poses challenges due to the different behavior patterns of humans resulting from diverse physical constraints. Existing works typically focus on learning safety constraints for collaboration, or how to divide and distribute the subtasks between the participating agents to carry out the main task. In contrast, we propose to learn a human constraints model that, in addition, considers the diverse behaviors of different human operators. We consider a type of collaboration in a shared-autonomy fashion, where both a human operator and an assistive robot act simultaneously in the same task space that affects each other's actions. The task of the assistive agent is to augment the skill of humans to perform a shared task by supporting humans as much as possible, both in terms of reducing the workload and minimizing the discomfort for the human operator. Therefore, we propose an augmentative assistant agent capable of learning and adapting to human physical constraints, aligning its actions with the ergonomic preferences and limitations of the human operator.  ( 2 min )
    MathScale: Scaling Instruction Tuning for Mathematical Reasoning
    arXiv:2403.02884v1 Announce Type: cross Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in problem-solving. However, their proficiency in solving mathematical problems remains inadequate. We propose MathScale, a simple and scalable method to create high-quality mathematical reasoning data using frontier LLMs (e.g., {\tt GPT-3.5}). Inspired by the cognitive mechanism in human mathematical learning, it first extracts topics and knowledge points from seed math questions and then build a concept graph, which is subsequently used to generate new math questions. MathScale exhibits effective scalability along the size axis of the math dataset that we generate. As a result, we create a mathematical reasoning dataset (MathScaleQA) containing two million math question-answer pairs. To evaluate mathematical reasoning abilities of LLMs comprehensively, we construct {\sc MwpBench}, a benchmark of Math Word Problems, which is a collection of ten datasets (including GSM8K and MATH) covering K-12, college, and competition level math problems. We apply MathScaleQA to fine-tune open-source LLMs (e.g., LLaMA-2 and Mistral), resulting in significantly improved capabilities in mathematical reasoning. Evaluated on {\sc MwpBench}, MathScale-7B achieves state-of-the-art performance across all datasets, surpassing its best peers of equivalent size by 42.9\% in micro average accuracy and 43.7\% in macro average accuracy, respectively.  ( 2 min )
    Interactive Continual Learning: Fast and Slow Thinking
    arXiv:2403.02628v1 Announce Type: cross Abstract: Advanced life forms, sustained by the synergistic interaction of neural cognitive mechanisms, continually acquire and transfer knowledge throughout their lifespan. In contrast, contemporary machine learning paradigms exhibit limitations in emulating the facets of continual learning (CL). Nonetheless, the emergence of large language models (LLMs) presents promising avenues for realizing CL via interactions with these models. Drawing on Complementary Learning System theory, this paper presents a novel Interactive Continual Learning (ICL) framework, enabled by collaborative interactions among models of various sizes. Specifically, we assign the ViT model as System1 and multimodal LLM as System2. To enable the memory module to deduce tasks from class information and enhance Set2Set retrieval, we propose the Class-Knowledge-Task Multi-Head Attention (CKT-MHA). Additionally, to improve memory retrieval in System1 through enhanced geometric representation, we introduce the CL-vMF mechanism, based on the von Mises-Fisher (vMF) distribution. Meanwhile, we introduce the von Mises-Fisher Outlier Detection and Interaction (vMF-ODI) strategy to identify hard examples, thus enhancing collaboration between System1 and System2 for complex reasoning realization. Comprehensive evaluation of our proposed ICL demonstrates significant resistance to forgetting and superior performance relative to existing methods.  ( 2 min )
    Evidence-Focused Fact Summarization for Knowledge-Augmented Zero-Shot Question Answering
    arXiv:2403.02966v1 Announce Type: cross Abstract: Recent studies have investigated utilizing Knowledge Graphs (KGs) to enhance Quesetion Answering (QA) performance of Large Language Models (LLMs), yet structured KG verbalization remains challengin. Existing methods, such as triple-form or free-form textual conversion of triple-form facts, encounter several issues. These include reduced evidence density due to duplicated entities or relationships, and reduced evidence clarity due to an inability to emphasize crucial evidence. To address these issues, we propose EFSum, an Evidence-focused Fact Summarization framework for enhanced QA with knowledge-augmented LLMs. We optimize an open-source LLM as a fact summarizer through distillation and preference alignment. Our extensive experiments show that EFSum improves LLM's zero-shot QA performance, and it is possible to ensure both the helpfulness and faithfulness of the summary.  ( 2 min )
    AceMap: Knowledge Discovery through Academic Graph
    arXiv:2403.02576v1 Announce Type: cross Abstract: The exponential growth of scientific literature requires effective management and extraction of valuable insights. While existing scientific search engines excel at delivering search results based on relational databases, they often neglect the analysis of collaborations between scientific entities and the evolution of ideas, as well as the in-depth analysis of content within scientific publications. The representation of heterogeneous graphs and the effective measurement, analysis, and mining of such graphs pose significant challenges. To address these challenges, we present AceMap, an academic system designed for knowledge discovery through academic graph. We present advanced database construction techniques to build the comprehensive AceMap database with large-scale academic publications that contain rich visual, textual, and numerical information. AceMap also employs innovative visualization, quantification, and analysis methods to explore associations and logical relationships among academic entities. AceMap introduces large-scale academic network visualization techniques centered on nebular graphs, providing a comprehensive view of academic networks from multiple perspectives. In addition, AceMap proposes a unified metric based on structural entropy to quantitatively measure the knowledge content of different academic entities. Moreover, AceMap provides advanced analysis capabilities, including tracing the evolution of academic ideas through citation relationships and concept co-occurrence, and generating concise summaries informed by this evolutionary process. In addition, AceMap uses machine reading methods to generate potential new ideas at the intersection of different fields. Exploring the integration of large language models and knowledge graphs is a promising direction for future research in idea evolution. Please visit \url{https://www.acemap.info} for further exploration.  ( 3 min )
    AdAM: Adaptive Fault-Tolerant Approximate Multiplier for Edge DNN Accelerators
    arXiv:2403.02936v1 Announce Type: cross Abstract: In this paper, we propose an architecture of a novel adaptive fault-tolerant approximate multiplier tailored for ASIC-based DNN accelerators.  ( 2 min )
    Action-slot: Visual Action-centric Representations for Multi-label Atomic Activity Recognition in Traffic Scenes
    arXiv:2311.17948v1 Announce Type: cross Abstract: In this paper, we study multi-label atomic activity recognition. Despite the notable progress in action recognition, it is still challenging to recognize atomic activities due to a deficiency in a holistic understanding of both multiple road users' motions and their contextual information. In this paper, we introduce Action-slot, a slot attention-based approach that learns visual action-centric representations, capturing both motion and contextual information. Our key idea is to design action slots that are capable of paying attention to regions where atomic activities occur, without the need for explicit perception guidance. To further enhance slot attention, we introduce a background slot that competes with action slots, aiding the training process in avoiding unnecessary focus on background regions devoid of activities. Yet, the imbalanced class distribution in the existing dataset hampers the assessment of rare activities. To address the limitation, we collect a synthetic dataset called TACO, which is four times larger than OATS and features a balanced distribution of atomic activities. To validate the effectiveness of our method, we conduct comprehensive experiments and ablation studies against various action recognition baselines. We also show that the performance of multi-label atomic activity recognition on real-world datasets can be improved by pretraining representations on TACO. We will release our source code and dataset. See the videos of visualization on the project page: https://hcis-lab.github.io/Action-slot/  ( 3 min )
    A Second Look on BASS -- Boosting Abstractive Summarization with Unified Semantic Graphs -- A Replication Study
    arXiv:2403.02930v1 Announce Type: cross Abstract: We present a detailed replication study of the BASS framework, an abstractive summarization system based on the notion of Unified Semantic Graphs. Our investigation includes challenges in replicating key components and an ablation study to systematically isolate error sources rooted in replicating novel components. Our findings reveal discrepancies in performance compared to the original work. We highlight the significance of paying careful attention even to reasonably omitted details for replicating advanced frameworks like BASS, and emphasize key practices for writing replicable papers.  ( 2 min )
    Quantum Mixed-State Self-Attention Network
    arXiv:2403.02871v1 Announce Type: cross Abstract: The rapid advancement of quantum computing has increasingly highlighted its potential in the realm of machine learning, particularly in the context of natural language processing (NLP) tasks. Quantum machine learning (QML) leverages the unique capabilities of quantum computing to offer novel perspectives and methodologies for complex data processing and pattern recognition challenges. This paper introduces a novel Quantum Mixed-State Attention Network (QMSAN), which integrates the principles of quantum computing with classical machine learning algorithms, especially self-attention networks, to enhance the efficiency and effectiveness in handling NLP tasks. QMSAN model employs a quantum attention mechanism based on mixed states, enabling efficient direct estimation of similarity between queries and keys within the quantum domain, leading to more effective attention weight acquisition. Additionally, we propose an innovative quantum positional encoding scheme, implemented through fixed quantum gates within the quantum circuit, to enhance the model's accuracy. Experimental validation on various datasets demonstrates that QMSAN model outperforms existing quantum and classical models in text classification, achieving significant performance improvements. QMSAN model not only significantly reduces the number of parameters but also exceeds classical self-attention networks in performance, showcasing its strong capability in data representation and information extraction. Furthermore, our study investigates the model's robustness in different quantum noise environments, showing that QMSAN possesses commendable robustness to low noise.  ( 2 min )
    Trial and Error: Exploration-Based Trajectory Optimization for LLM Agents
    arXiv:2403.02502v1 Announce Type: cross Abstract: Large Language Models (LLMs) have become integral components in various autonomous agent systems. In this study, we present an exploration-based trajectory optimization approach, referred to as ETO. This learning method is designed to enhance the performance of open LLM agents. Contrary to previous studies that exclusively train on successful expert trajectories, our method allows agents to learn from their exploration failures. This leads to improved performance through an iterative optimization framework. During the exploration phase, the agent interacts with the environment while completing given tasks, gathering failure trajectories to create contrastive trajectory pairs. In the subsequent training phase, the agent utilizes these trajectory preference pairs to update its policy using contrastive learning methods like DPO. This iterative cycle of exploration and training fosters continued improvement in the agents. Our experiments on three complex tasks demonstrate that ETO consistently surpasses baseline performance by a large margin. Furthermore, an examination of task-solving efficiency and potential in scenarios lacking expert trajectory underscores the effectiveness of our approach.  ( 2 min )
    Classification of the Fashion-MNIST Dataset on a Quantum Computer
    arXiv:2403.02405v1 Announce Type: cross Abstract: The potential impact of quantum machine learning algorithms on industrial applications remains an exciting open question. Conventional methods for encoding classical data into quantum computers are not only too costly for a potential quantum advantage in the algorithms but also severely limit the scale of feasible experiments on current hardware. Therefore, recent works, despite claiming the near-term suitability of their algorithms, do not provide experimental benchmarking on standard machine learning datasets. We attempt to solve the data encoding problem by improving a recently proposed variational algorithm [1] that approximately prepares the encoded data, using asymptotically shallow circuits that fit the native gate set and topology of currently available quantum computers. We apply the improved algorithm to encode the Fashion-MNIST dataset [2], which can be directly used in future empirical studies of quantum machine learning algorithms. We deploy simple quantum variational classifiers trained on the encoded dataset on a current quantum computer ibmq-kolkata [3] and achieve moderate accuracies, providing a proof of concept for the near-term usability of our data encoding method.  ( 2 min )
    Scalable Continuous-time Diffusion Framework for Network Inference and Influence Estimation
    arXiv:2403.02867v1 Announce Type: cross Abstract: The study of continuous-time information diffusion has been an important area of research for many applications in recent years. When only the diffusion traces (cascades) are accessible, cascade-based network inference and influence estimation are two essential problems to explore. Alas, existing methods exhibit limited capability to infer and process networks with more than a few thousand nodes, suffering from scalability issues. In this paper, we view the diffusion process as a continuous-time dynamical system, based on which we establish a continuous-time diffusion model. Subsequently, we instantiate the model to a scalable and effective framework (FIM) to approximate the diffusion propagation from available cascades, thereby inferring the underlying network structure. Furthermore, we undertake an analysis of the approximation error of FIM for network inference. To achieve the desired scalability for influence estimation, we devise an advanced sampling technique and significantly boost the efficiency. We also quantify the effect of the approximation error on influence estimation theoretically. Experimental results showcase the effectiveness and superior scalability of FIM on network inference and influence estimation.  ( 2 min )
    A Distance Metric Learning Model Based On Variational Information Bottleneck
    arXiv:2403.02794v1 Announce Type: cross Abstract: In recent years, personalized recommendation technology has flourished and become one of the hot research directions. The matrix factorization model and the metric learning model which proposed successively have been widely studied and applied. The latter uses the Euclidean distance instead of the dot product used by the former to measure the latent space vector. While avoiding the shortcomings of the dot product, the assumption of Euclidean distance is neglected, resulting in limited recommendation quality of the model. In order to solve this problem, this paper combines the Variationl Information Bottleneck with metric learning model for the first time, and proposes a new metric learning model VIB-DML (Variational Information Bottleneck Distance Metric Learning) for rating prediction, which limits the mutual information of the latent space feature vector to improve the robustness of the model and satisfiy the assumption of Euclidean distance by decoupling the latent space feature vector. In this paper, the experimental results are compared with the root mean square error (RMSE) on the three public datasets. The results show that the generalization ability of VIB-DML is excellent. Compared with the general metric learning model MetricF, the prediction error is reduced by 7.29%. Finally, the paper proves the strong robustness of VIBDML through experiments.  ( 2 min )
    Precise Extraction of Deep Learning Models via Side-Channel Attacks on Edge/Endpoint Devices
    arXiv:2403.02870v1 Announce Type: cross Abstract: With growing popularity, deep learning (DL) models are becoming larger-scale, and only the companies with vast training datasets and immense computing power can manage their business serving such large models. Most of those DL models are proprietary to the companies who thus strive to keep their private models safe from the model extraction attack (MEA), whose aim is to steal the model by training surrogate models. Nowadays, companies are inclined to offload the models from central servers to edge/endpoint devices. As revealed in the latest studies, adversaries exploit this opportunity as new attack vectors to launch side-channel attack (SCA) on the device running victim model and obtain various pieces of the model information, such as the model architecture (MA) and image dimension (ID). Our work provides a comprehensive understanding of such a relationship for the first time and would benefit future MEA studies in both offensive and defensive sides in that they may learn which pieces of information exposed by SCA are more important than the others. Our analysis additionally reveals that by grasping the victim model information from SCA, MEA can get highly effective and successful even without any prior knowledge of the model. Finally, to evince the practicality of our analysis results, we empirically apply SCA, and subsequently, carry out MEA under realistic threat assumptions. The results show up to 5.8 times better performance than when the adversary has no model information about the victim model.  ( 3 min )
    Fast, Scale-Adaptive, and Uncertainty-Aware Downscaling of Earth System Model Fields with Generative Foundation Models
    arXiv:2403.02774v1 Announce Type: cross Abstract: Accurate and high-resolution Earth system model (ESM) simulations are essential to assess the ecological and socio-economic impacts of anthropogenic climate change, but are computationally too expensive. Recent machine learning approaches have shown promising results in downscaling ESM simulations, outperforming state-of-the-art statistical approaches. However, existing methods require computationally costly retraining for each ESM and extrapolate poorly to climates unseen during training. We address these shortcomings by learning a consistency model (CM) that efficiently and accurately downscales arbitrary ESM simulations without retraining in a zero-shot manner. Our foundation model approach yields probabilistic downscaled fields at resolution only limited by the observational reference data. We show that the CM outperforms state-of-the-art diffusion models at a fraction of computational cost while maintaining high controllability on the downscaling task. Further, our method generalizes to climate states unseen during training without explicitly formulated physical constraints.  ( 2 min )
    FedHCDR: Federated Cross-Domain Recommendation with Hypergraph Signal Decoupling
    arXiv:2403.02630v1 Announce Type: new Abstract: In recent years, Cross-Domain Recommendation (CDR) has drawn significant attention, which utilizes user data from multiple domains to enhance the recommendation performance. However, current CDR methods require sharing user data across domains, thereby violating the General Data Protection Regulation (GDPR). Consequently, numerous approaches have been proposed for Federated Cross-Domain Recommendation (FedCDR). Nevertheless, the data heterogeneity across different domains inevitably influences the overall performance of federated learning. In this study, we propose FedHCDR, a novel Federated Cross-Domain Recommendation framework with Hypergraph signal decoupling. Specifically, to address the data heterogeneity across domains, we introduce an approach called hypergraph signal decoupling (HSD) to decouple the user features into domain-exclusive and domain-shared features. The approach employs high-pass and low-pass hypergraph filters to decouple domain-exclusive and domain-shared user representations, which are trained by the local-global bi-directional transfer algorithm. In addition, a hypergraph contrastive learning (HCL) module is devised to enhance the learning of domain-shared user relationship information by perturbing the user hypergraph. Extensive experiments conducted on three real-world scenarios demonstrate that FedHCDR outperforms existing baselines significantly.  ( 2 min )
    Over-The-Air Double-Threshold Deep Learner for Jamming Detection in 5G RF domain
    arXiv:2403.02645v1 Announce Type: cross Abstract: With the evolution of 5G wireless communications, the Synchronization Signal Block (SSB) plays a critical role in the synchronization of devices and accessibility of services. However, due to the predictable nature of SSB transmission, including the Primary and Secondary Synchronization Signals (PSS and SSS), jamming attacks are critical threats. By leveraging RF domain knowledge, this work presents a novel deep learning-based technique for detecting jammers in 5G networks. Unlike the existing jamming detection algorithms that mostly rely on network parameters, we introduce a double threshold deep learning jamming detector by focusing on the SSB. The detection method is focused on RF domain features and improves the robustness of the network without requiring integration with the pre-existing network infrastructure. By integrating a preprocessing block that extracts PSS correlation and energy per null resource elements (EPNRE) characteristics, our method distinguishes between normal and jammed received signals with high precision. Additionally, by incorporation of Discrete Wavelet Transform (DWT), the efficacy of training and detection are optimized. A double threshold double Deep Neural Network (DT-DDNN) is also introduced to the architecture complemented by a deep cascade learning model to increase the sensitivity of the model to variations of signal to jamming noise ratio (SJNR). Results show that the proposed method achieves 96.4% detection rate in extra low jamming power, i.e., SJNR between 15 to 30 dB which outperforms the single threshold DNN design with 86.0% detection rate and unprocessed IQ sample DNN design with 83.2% detection rate. Ultimately, performance of DT-DDNN is validated through the analysis of real 5G signals obtained from a practical testbed, demonstrating a strong alignment with the simulation results.  ( 3 min )
    What do we learn from inverting CLIP models?
    arXiv:2403.02580v1 Announce Type: cross Abstract: We employ an inversion-based approach to examine CLIP models. Our examination reveals that inverting CLIP models results in the generation of images that exhibit semantic alignment with the specified target prompts. We leverage these inverted images to gain insights into various aspects of CLIP models, such as their ability to blend concepts and inclusion of gender biases. We notably observe instances of NSFW (Not Safe For Work) images during model inversion. This phenomenon occurs even for semantically innocuous prompts, like "a beautiful landscape," as well as for prompts involving the names of celebrities.  ( 2 min )
    On the Asymptotic Mean Square Error Optimality of Diffusion Probabilistic Models
    arXiv:2403.02957v1 Announce Type: new Abstract: Diffusion probabilistic models (DPMs) have recently shown great potential for denoising tasks. Despite their practical utility, there is a notable gap in their theoretical understanding. This paper contributes novel theoretical insights by rigorously proving the asymptotic convergence of a specific DPM denoising strategy to the mean square error (MSE)-optimal conditional mean estimator (CME) over a large number of diffusion steps. The studied DPM-based denoiser shares the training procedure of DPMs but distinguishes itself by forwarding only the conditional mean during the reverse inference process after training. We highlight the unique perspective that DPMs are composed of an asymptotically optimal denoiser while simultaneously inheriting a powerful generator by switching re-sampling in the reverse process on and off. The theoretical findings are validated by numerical results.  ( 2 min )
    Density-based Isometric Mapping
    arXiv:2403.02531v1 Announce Type: new Abstract: The isometric mapping method employs the shortest path algorithm to estimate the Euclidean distance between points on High dimensional (HD) manifolds. This may not be sufficient for weakly uniformed HD data as it could lead to overestimating distances between far neighboring points, resulting in inconsistencies between the intrinsic (local) and extrinsic (global) distances during the projection. To address this issue, we modify the shortest path algorithm by adding a novel constraint inspired by the Parzen-Rosenblatt (PR) window, which helps to maintain the uniformity of the constructed shortest-path graph in Isomap. Multiple imaging datasets overall of 72,236 cases, 70,000 MINST data, 1596 from multiple Chest-XRay pneumonia datasets, and three NSCLC CT/PET datasets with a total of 640 lung cancer patients, were used to benchmark and validate PR-Isomap. 431 imaging biomarkers were extracted from each modality. Our results indicate that PR-Isomap projects HD attributes into a lower-dimensional (LD) space while preserving information, visualized by the MNIST dataset indicating the maintaining local and global distances. PR-Isomap achieved the highest comparative accuracies of 80.9% (STD:5.8) for pneumonia and 78.5% (STD:4.4), 88.4% (STD:1.4), and 61.4% (STD:11.4) for three NSCLC datasets, with a confidence interval of 95% for outcome prediction. Similarly, the multivariate Cox model showed higher overall survival, measured with c-statistics and log-likelihood test, of PR-Isomap compared to other dimensionality reduction methods. Kaplan Meier survival curve also signifies the notable ability of PR-Isomap to distinguish between high-risk and low-risk patients using multimodal imaging biomarkers preserving HD imaging characteristics for precision medicine.  ( 3 min )
    Modeling Collaborator: Enabling Subjective Vision Classification With Minimal Human Effort via LLM Tool-Use
    arXiv:2403.02626v1 Announce Type: cross Abstract: From content moderation to wildlife conservation, the number of applications that require models to recognize nuanced or subjective visual concepts is growing. Traditionally, developing classifiers for such concepts requires substantial manual effort measured in hours, days, or even months to identify and annotate data needed for training. Even with recently proposed Agile Modeling techniques, which enable rapid bootstrapping of image classifiers, users are still required to spend 30 minutes or more of monotonous, repetitive data labeling just to train a single classifier. Drawing on Fiske's Cognitive Miser theory, we propose a new framework that alleviates manual effort by replacing human labeling with natural language interactions, reducing the total effort required to define a concept by an order of magnitude: from labeling 2,000 images to only 100 plus some natural language interactions. Our framework leverages recent advances in foundation models, both large language models and vision-language models, to carve out the concept space through conversation and by automatically labeling training data points. Most importantly, our framework eliminates the need for crowd-sourced annotations. Moreover, our framework ultimately produces lightweight classification models that are deployable in cost-sensitive scenarios. Across 15 subjective concepts and across 2 public image classification datasets, our trained models outperform traditional Agile Modeling as well as state-of-the-art zero-shot classification models like ALIGN, CLIP, CuPL, and large visual question-answering models like PaLI-X.  ( 3 min )
    Coronary artery segmentation in non-contrast calcium scoring CT images using deep learning
    arXiv:2403.02544v1 Announce Type: cross Abstract: Precise localization of coronary arteries in Computed Tomography (CT) scans is critical from the perspective of medical assessment of coronary artery disease. Although various methods exist that offer high-quality segmentation of coronary arteries in cardiac contrast-enhanced CT scans, the potential of less invasive, non-contrast CT in this area is still not fully exploited. Since such fine anatomical structures are hardly visible in this type of medical images, the existing methods are characterized by high recall and low precision, and are used mainly for filtering of atherosclerotic plaques in the context of calcium scoring. In this paper, we address this research gap and introduce a deep learning algorithm for segmenting coronary arteries in multi-vendor ECG-gated non-contrast cardiac CT images which benefits from a novel framework for semi-automatic generation of Ground Truth (GT) via image registration. We hypothesize that the proposed GT generation process is much more efficient in this case than manual segmentation, since it allows for a fast generation of large volumes of diverse data, which leads to well-generalizing models. To investigate and thoroughly evaluate the segmentation quality based on such an approach, we propose a novel method for manual mesh-to-image registration, which is used to create our test-GT. The experimental study shows that the trained model has significantly higher accuracy than the GT used for training, and leads to the Dice and clDice metrics close to the interrater variability.  ( 3 min )
    Differentially Private Representation Learning via Image Captioning
    arXiv:2403.02506v1 Announce Type: cross Abstract: Differentially private (DP) machine learning is considered the gold-standard solution for training a model from sensitive data while still preserving privacy. However, a major barrier to achieving this ideal is its sub-optimal privacy-accuracy trade-off, which is particularly visible in DP representation learning. Specifically, it has been shown that under modest privacy budgets, most models learn representations that are not significantly better than hand-crafted features. In this work, we show that effective DP representation learning can be done via image captioning and scaling up to internet-scale multimodal datasets. Through a series of engineering tricks, we successfully train a DP image captioner (DP-Cap) on a 233M subset of LAION-2B from scratch using a reasonable amount of computation, and obtaining unprecedented high-quality image features that can be used in a variety of downstream vision and vision-language tasks. For example, under a privacy budget of $\varepsilon=8$, a linear classifier trained on top of learned DP-Cap features attains 65.8% accuracy on ImageNet-1K, considerably improving the previous SOTA of 56.5%. Our work challenges the prevailing sentiment that high-utility DP representation learning cannot be achieved by training from scratch.  ( 2 min )
    Koopman operators with intrinsic observables in rigged reproducing kernel Hilbert spaces
    arXiv:2403.02524v1 Announce Type: cross Abstract: This paper presents a novel approach for estimating the Koopman operator defined on a reproducing kernel Hilbert space (RKHS) and its spectra. We propose an estimation method, what we call Jet Dynamic Mode Decomposition (JetDMD), leveraging the intrinsic structure of RKHS and the geometric notion known as jets to enhance the estimation of the Koopman operator. This method refines the traditional Extended Dynamic Mode Decomposition (EDMD) in accuracy, especially in the numerical estimation of eigenvalues. This paper proves JetDMD's superiority through explicit error bounds and convergence rate for special positive definite kernels, offering a solid theoretical foundation for its performance. We also delve into the spectral analysis of the Koopman operator, proposing the notion of extended Koopman operator within a framework of rigged Hilbert space. This notion leads to a deeper understanding of estimated Koopman eigenfunctions and capturing them outside the original function space. Through the theory of rigged Hilbert space, our study provides a principled methodology to analyze the estimated spectrum and eigenfunctions of Koopman operators, and enables eigendecomposition within a rigged RKHS. We also propose a new effective method for reconstructing the dynamical system from temporally-sampled trajectory data of the dynamical system with solid theoretical guarantee. We conduct several numerical simulations using the van der Pol oscillator, the Duffing oscillator, the H\'enon map, and the Lorenz attractor, and illustrate the performance of JetDMD with clear numerical computations of eigenvalues and accurate predictions of the dynamical systems.  ( 3 min )
    Reaching Consensus in Cooperative Multi-Agent Reinforcement Learning with Goal Imagination
    arXiv:2403.03172v1 Announce Type: cross Abstract: Reaching consensus is key to multi-agent coordination. To accomplish a cooperative task, agents need to coherently select optimal joint actions to maximize the team reward. However, current cooperative multi-agent reinforcement learning (MARL) methods usually do not explicitly take consensus into consideration, which may cause miscoordination problem. In this paper, we propose a model-based consensus mechanism to explicitly coordinate multiple agents. The proposed Multi-agent Goal Imagination (MAGI) framework guides agents to reach consensus with an Imagined common goal. The common goal is an achievable state with high value, which is obtained by sampling from the distribution of future states. We directly model this distribution with a self-supervised generative model, thus alleviating the "curse of dimensinality" problem induced by multi-agent multi-step policy rollout commonly used in model-based methods. We show that such efficient consensus mechanism can guide all agents cooperatively reaching valuable future states. Results on Multi-agent Particle-Environments and Google Research Football environment demonstrate the superiority of MAGI in both sample efficiency and performance.  ( 2 min )
    Robust Causal Bandits for Linear Models
    arXiv:2310.19794v2 Announce Type: replace-cross Abstract: Sequential design of experiments for optimizing a reward function in causal systems can be effectively modeled by the sequential design of interventions in causal bandits (CBs). In the existing literature on CBs, a critical assumption is that the causal models remain constant over time. However, this assumption does not necessarily hold in complex systems, which constantly undergo temporal model fluctuations. This paper addresses the robustness of CBs to such model fluctuations. The focus is on causal systems with linear structural equation models (SEMs). The SEMs and the time-varying pre- and post-interventional statistical models are all unknown. Cumulative regret is adopted as the design criteria, based on which the objective is to design a sequence of interventions that incur the smallest cumulative regret with respect to an oracle aware of the entire causal model and its fluctuations. First, it is established that the existing approaches fail to maintain regret sub-linearity with even a few instances of model deviation. Specifically, when the number of instances with model deviation is as few as $T^\frac{1}{2L}$, where $T$ is the time horizon and $L$ is the longest causal path in the graph, the existing algorithms will have linear regret in $T$. Next, a robust CB algorithm is designed, and its regret is analyzed, where upper and information-theoretic lower bounds on the regret are established. Specifically, in a graph with $N$ nodes and maximum degree $d$, under a general measure of model deviation $C$, the cumulative regret is upper bounded by $\tilde{\mathcal{O}}(d^{L-\frac{1}{2}}(\sqrt{NT} + NC))$ and lower bounded by $\Omega(d^{\frac{L}{2}-2}\max\{\sqrt{T},d^2C\})$. Comparing these bounds establishes that the proposed algorithm achieves nearly optimal $\tilde{\mathcal{O}}(\sqrt{T})$ regret when $C$ is $o(\sqrt{T})$ and maintains sub-linear regret for a broader range of $C$.  ( 3 min )
    On Leakage in Machine Learning Pipelines
    arXiv:2311.04179v2 Announce Type: replace Abstract: Machine learning (ML) provides powerful tools for predictive modeling. ML's popularity stems from the promise of sample-level prediction with applications across a variety of fields from physics and marketing to healthcare. However, if not properly implemented and evaluated, ML pipelines may contain leakage typically resulting in overoptimistic performance estimates and failure to generalize to new data. This can have severe negative financial and societal implications. Our aim is to expand understanding associated with causes leading to leakage when designing, implementing, and evaluating ML pipelines. Illustrated by concrete examples, we provide a comprehensive overview and discussion of various types of leakage that may arise in ML pipelines.  ( 2 min )
    Pair-Matching: Links Prediction with Adaptive Queries
    arXiv:1905.07342v3 Announce Type: replace-cross Abstract: The pair-matching problem appears in many applications where one wants to discover good matches between pairs of entities or individuals. Formally, the set of individuals is represented by the nodes of a graph where the edges, unobserved at first, represent the good matches. The algorithm queries pairs of nodes and observes the presence/absence of edges. Its goal is to discover as many edges as possible with a fixed budget of queries. Pair-matching is a particular instance of multi-armed bandit problem in which the arms are pairs of individuals and the rewards are edges linking these pairs. This bandit problem is non-standard though, as each arm can only be played once. Given this last constraint, sublinear regret can be expected only if the graph presents some underlying structure. This paper shows that sublinear regret is achievable in the case where the graph is generated according to a Stochastic Block Model (SBM) with two communities. Optimal regret bounds are computed for this pair-matching problem. They exhibit a phase transition related to the Kesten-Stigum threshold for community detection in SBM. The pair-matching problem is considered in the case where each node is constrained to be sampled less than a given amount of times. We show how optimal regret rates depend on this constraint. The paper is concluded by a conjecture regarding the optimal regret when the number of communities is larger than 2. Contrary to the two communities case, we argue that a statistical-computational gap would appear in this problem.  ( 3 min )
    To grok or not to grok: Disentangling generalization and memorization on corrupted algorithmic datasets
    arXiv:2310.13061v2 Announce Type: replace Abstract: Robust generalization is a major challenge in deep learning, particularly when the number of trainable parameters is very large. In general, it is very difficult to know if the network has memorized a particular set of examples or understood the underlying rule (or both). Motivated by this challenge, we study an interpretable model where generalizing representations are understood analytically, and are easily distinguishable from the memorizing ones. Namely, we consider multi-layer perceptron (MLP) and Transformer architectures trained on modular arithmetic tasks, where ($\xi \cdot 100\%$) of labels are corrupted (\emph{i.e.} some results of the modular operations in the training set are incorrect). We show that (i) it is possible for the network to memorize the corrupted labels \emph{and} achieve $100\%$ generalization at the same time; (ii) the memorizing neurons can be identified and pruned, lowering the accuracy on corrupted data and improving the accuracy on uncorrupted data; (iii) regularization methods such as weight decay, dropout and BatchNorm force the network to ignore the corrupted data during optimization, and achieve $100\%$ accuracy on the uncorrupted dataset; and (iv) the effect of these regularization methods is (``mechanistically'') interpretable: weight decay and dropout force all the neurons to learn generalizing representations, while BatchNorm de-amplifies the output of memorizing neurons and amplifies the output of the generalizing ones. Finally, we show that in the presence of regularization, the training dynamics involves two consecutive stages: first, the network undergoes \emph{grokking} dynamics reaching high train \emph{and} test accuracy; second, it unlearns the memorizing representations, where the train accuracy suddenly jumps from $100\%$ to $100 (1-\xi)\%$.  ( 3 min )
    Active Statistical Inference
    arXiv:2403.03208v1 Announce Type: cross Abstract: Inspired by the concept of active learning, we propose active inference$\unicode{x2013}$a methodology for statistical inference with machine-learning-assisted data collection. Assuming a budget on the number of labels that can be collected, the methodology uses a machine learning model to identify which data points would be most beneficial to label, thus effectively utilizing the budget. It operates on a simple yet powerful intuition: prioritize the collection of labels for data points where the model exhibits uncertainty, and rely on the model's predictions where it is confident. Active inference constructs provably valid confidence intervals and hypothesis tests while leveraging any black-box machine learning model and handling any data distribution. The key point is that it achieves the same level of accuracy with far fewer samples than existing baselines relying on non-adaptively-collected data. This means that for the same number of collected samples, active inference enables smaller confidence intervals and more powerful p-values. We evaluate active inference on datasets from public opinion research, census analysis, and proteomics.  ( 2 min )
    Reliable, Adaptable, and Attributable Language Models with Retrieval
    arXiv:2403.03187v1 Announce Type: cross Abstract: Parametric language models (LMs), which are trained on vast amounts of web data, exhibit remarkable flexibility and capability. However, they still face practical challenges such as hallucinations, difficulty in adapting to new data distributions, and a lack of verifiability. In this position paper, we advocate for retrieval-augmented LMs to replace parametric LMs as the next generation of LMs. By incorporating large-scale datastores during inference, retrieval-augmented LMs can be more reliable, adaptable, and attributable. Despite their potential, retrieval-augmented LMs have yet to be widely adopted due to several obstacles: specifically, current retrieval-augmented LMs struggle to leverage helpful text beyond knowledge-intensive tasks such as question answering, have limited interaction between retrieval and LM components, and lack the infrastructure for scaling. To address these, we propose a roadmap for developing general-purpose retrieval-augmented LMs. This involves a reconsideration of datastores and retrievers, the exploration of pipelines with improved retriever-LM interaction, and significant investment in infrastructure for efficient training and inference.  ( 2 min )
    PalmProbNet: A Probabilistic Approach to Understanding Palm Distributions in Ecuadorian Tropical Forest via Transfer Learning
    arXiv:2403.03161v1 Announce Type: cross Abstract: Palms play an outsized role in tropical forests and are important resources for humans and wildlife. A central question in tropical ecosystems is understanding palm distribution and abundance. However, accurately identifying and localizing palms in geospatial imagery presents significant challenges due to dense vegetation, overlapping canopies, and variable lighting conditions in mixed-forest landscapes. Addressing this, we introduce PalmProbNet, a probabilistic approach utilizing transfer learning to analyze high-resolution UAV-derived orthomosaic imagery, enabling the detection of palm trees within the dense canopy of the Ecuadorian Rainforest. This approach represents a substantial advancement in automated palm detection, effectively pinpointing palm presence and locality in mixed tropical rainforests. Our process begins by generating an orthomosaic image from UAV images, from which we extract and label palm and non-palm image patches in two distinct sizes. These patches are then used to train models with an identical architecture, consisting of an unaltered pre-trained ResNet-18 and a Multilayer Perceptron (MLP) with specifically trained parameters. Subsequently, PalmProbNet employs a sliding window technique on the landscape orthomosaic, using both small and large window sizes to generate a probability heatmap. This heatmap effectively visualizes the distribution of palms, showcasing the scalability and adaptability of our approach in various forest densities. Despite the challenging terrain, our method demonstrated remarkable performance, achieving an accuracy of 97.32% and a Cohen's kappa of 94.59% in testing.  ( 3 min )
    Rethinking Clustered Federated Learning in NOMA Enhanced Wireless Networks
    arXiv:2403.03157v1 Announce Type: cross Abstract: This study explores the benefits of integrating the novel clustered federated learning (CFL) approach with non-orthogonal multiple access (NOMA) under non-independent and identically distributed (non-IID) datasets, where multiple devices participate in the aggregation with time limitations and a finite number of sub-channels. A detailed theoretical analysis of the generalization gap that measures the degree of non-IID in the data distribution is presented. Following that, solutions to address the challenges posed by non-IID conditions are proposed with the analysis of the properties. Specifically, users' data distributions are parameterized as concentration parameters and grouped using spectral clustering, with Dirichlet distribution serving as the prior. The investigation into the generalization gap and convergence rate guides the design of sub-channel assignments through the matching-based algorithm, and the power allocation is achieved by Karush-Kuhn-Tucker (KKT) conditions with the derived closed-form solution. The extensive simulation results show that the proposed cluster-based FL framework can outperform FL baselines in terms of both test accuracy and convergence rate. Moreover, jointly optimizing sub-channel and power allocation in NOMA-enhanced networks can lead to a significant improvement.  ( 2 min )
    Robust Federated Learning Mitigates Client-side Training Data Distribution Inference Attacks
    arXiv:2403.03149v1 Announce Type: cross Abstract: Recent studies have revealed that federated learning (FL), once considered secure due to clients not sharing their private data with the server, is vulnerable to attacks such as client-side training data distribution inference, where a malicious client can recreate the victim's data. While various countermeasures exist, they are not practical, often assuming server access to some training data or knowledge of label distribution before the attack. In this work, we bridge the gap by proposing InferGuard, a novel Byzantine-robust aggregation rule aimed at defending against client-side training data distribution inference attacks. In our proposed InferGuard, the server first calculates the coordinate-wise median of all the model updates it receives. A client's model update is considered malicious if it significantly deviates from the computed median update. We conduct a thorough evaluation of our proposed InferGuard on five benchmark datasets and perform a comparison with ten baseline methods. The results of our experiments indicate that our defense mechanism is highly effective in protecting against client-side training data distribution inference attacks, even against strong adaptive attacks. Furthermore, our method substantially outperforms the baseline methods in various practical FL scenarios.  ( 2 min )
    Dual Mean-Teacher: An Unbiased Semi-Supervised Framework for Audio-Visual Source Localization
    arXiv:2403.03145v1 Announce Type: cross Abstract: Audio-Visual Source Localization (AVSL) aims to locate sounding objects within video frames given the paired audio clips. Existing methods predominantly rely on self-supervised contrastive learning of audio-visual correspondence. Without any bounding-box annotations, they struggle to achieve precise localization, especially for small objects, and suffer from blurry boundaries and false positives. Moreover, the naive semi-supervised method is poor in fully leveraging the information of abundant unlabeled data. In this paper, we propose a novel semi-supervised learning framework for AVSL, namely Dual Mean-Teacher (DMT), comprising two teacher-student structures to circumvent the confirmation bias issue. Specifically, two teachers, pre-trained on limited labeled data, are employed to filter out noisy samples via the consensus between their predictions, and then generate high-quality pseudo-labels by intersecting their confidence maps. The sufficient utilization of both labeled and unlabeled data and the proposed unbiased framework enable DMT to outperform current state-of-the-art methods by a large margin, with CIoU of 90.4% and 48.8% on Flickr-SoundNet and VGG-Sound Source, obtaining 8.9%, 9.6% and 4.6%, 6.4% improvements over self- and semi-supervised methods respectively, given only 3% positional-annotations. We also extend our framework to some existing AVSL methods and consistently boost their performance.  ( 2 min )
    KnowAgent: Knowledge-Augmented Planning for LLM-Based Agents
    arXiv:2403.03101v1 Announce Type: cross Abstract: Large Language Models (LLMs) have demonstrated great potential in complex reasoning tasks, yet they fall short when tackling more sophisticated challenges, especially when interacting with environments through generating executable actions. This inadequacy primarily stems from the lack of built-in action knowledge in language agents, which fails to effectively guide the planning trajectories during task solving and results in planning hallucination. To address this issue, we introduce KnowAgent, a novel approach designed to enhance the planning capabilities of LLMs by incorporating explicit action knowledge. Specifically, KnowAgent employs an action knowledge base and a knowledgeable self-learning strategy to constrain the action path during planning, enabling more reasonable trajectory synthesis, and thereby enhancing the planning performance of language agents. Experimental results on HotpotQA and ALFWorld based on various backbone models demonstrate that KnowAgent can achieve comparable or superior performance to existing baselines. Further analysis indicates the effectiveness of KnowAgent in terms of planning hallucinations mitigation. Code is available in https://github.com/zjunlp/KnowAgent.  ( 2 min )
    On a Neural Implementation of Brenier's Polar Factorization
    arXiv:2403.03071v1 Announce Type: cross Abstract: In 1991, Brenier proved a theorem that generalizes the $QR$ decomposition for square matrices -- factored as PSD $\times$ unitary -- to any vector field $F:\mathbb{R}^d\rightarrow \mathbb{R}^d$. The theorem, known as the polar factorization theorem, states that any field $F$ can be recovered as the composition of the gradient of a convex function $u$ with a measure-preserving map $M$, namely $F=\nabla u \circ M$. We propose a practical implementation of this far-reaching theoretical result, and explore possible uses within machine learning. The theorem is closely related to optimal transport (OT) theory, and we borrow from recent advances in the field of neural optimal transport to parameterize the potential $u$ as an input convex neural network. The map $M$ can be either evaluated pointwise using $u^*$, the convex conjugate of $u$, through the identity $M=\nabla u^* \circ F$, or learned as an auxiliary network. Because $M$ is, in general, not injective, we consider the additional task of estimating the ill-posed inverse map that can approximate the pre-image measure $M^{-1}$ using a stochastic generator. We illustrate possible applications of \citeauthor{Brenier1991PolarFA}'s polar factorization to non-convex optimization problems, as well as sampling of densities that are not log-concave.  ( 2 min )
    Mitigating Label Flipping Attacks in Malicious URL Detectors Using Ensemble Trees
    arXiv:2403.02995v1 Announce Type: cross Abstract: Malicious URLs provide adversarial opportunities across various industries, including transportation, healthcare, energy, and banking which could be detrimental to business operations. Consequently, the detection of these URLs is of crucial importance; however, current Machine Learning (ML) models are susceptible to backdoor attacks. These attacks involve manipulating a small percentage of training data labels, such as Label Flipping (LF), which changes benign labels to malicious ones and vice versa. This manipulation results in misclassification and leads to incorrect model behavior. Therefore, integrating defense mechanisms into the architecture of ML models becomes an imperative consideration to fortify against potential attacks. The focus of this study is on backdoor attacks in the context of URL detection using ensemble trees. By illuminating the motivations behind such attacks, highlighting the roles of attackers, and emphasizing the critical importance of effective defense strategies, this paper contributes to the ongoing efforts to fortify ML models against adversarial threats within the ML domain in network security. We propose an innovative alarm system that detects the presence of poisoned labels and a defense mechanism designed to uncover the original class labels with the aim of mitigating backdoor attacks on ensemble tree classifiers. We conducted a case study using the Alexa and Phishing Site URL datasets and showed that LF attacks can be addressed using our proposed defense mechanism. Our experimental results prove that the LF attack achieved an Attack Success Rate (ASR) between 50-65% within 2-5%, and the innovative defense method successfully detected poisoned labels with an accuracy of up to 100%.  ( 3 min )
    Neural Image Compression with Text-guided Encoding for both Pixel-level and Perceptual Fidelity
    arXiv:2403.02944v1 Announce Type: cross Abstract: Recent advances in text-guided image compression have shown great potential to enhance the perceptual quality of reconstructed images. These methods, however, tend to have significantly degraded pixel-wise fidelity, limiting their practicality. To fill this gap, we develop a new text-guided image compression algorithm that achieves both high perceptual and pixel-wise fidelity. In particular, we propose a compression framework that leverages text information mainly by text-adaptive encoding and training with joint image-text loss. By doing so, we avoid decoding based on text-guided generative models -- known for high generative diversity -- and effectively utilize the semantic information of text at a global level. Experimental results on various datasets show that our method can achieve high pixel-level and perceptual quality, with either human- or machine-generated captions. In particular, our method outperforms all baselines in terms of LPIPS, with some room for even more improvements when we use more carefully generated captions.  ( 2 min )
    SAFFIRA: a Framework for Assessing the Reliability of Systolic-Array-Based DNN Accelerators
    arXiv:2403.02946v1 Announce Type: cross Abstract: Systolic array has emerged as a prominent architecture for Deep Neural Network (DNN) hardware accelerators, providing high-throughput and low-latency performance essential for deploying DNNs across diverse applications. However, when used in safety-critical applications, reliability assessment is mandatory to guarantee the correct behavior of DNN accelerators. While fault injection stands out as a well-established practical and robust method for reliability assessment, it is still a very time-consuming process. This paper addresses the time efficiency issue by introducing a novel hierarchical software-based hardware-aware fault injection strategy tailored for systolic array-based DNN accelerators.  ( 2 min )
    AIx Speed: Playback Speed Optimization Using Listening Comprehension of Speech Recognition Models
    arXiv:2403.02938v1 Announce Type: cross Abstract: Since humans can listen to audio and watch videos at faster speeds than actually observed, we often listen to or watch these pieces of content at higher playback speeds to increase the time efficiency of content comprehension. To further utilize this capability, systems that automatically adjust the playback speed according to the user's condition and the type of content to assist in more efficient comprehension of time-series content have been developed. However, there is still room for these systems to further extend human speed-listening ability by generating speech with playback speed optimized for even finer time units and providing it to humans. In this study, we determine whether humans can hear the optimized speech and propose a system that automatically adjusts playback speed at units as small as phonemes while ensuring speech intelligibility. The system uses the speech recognizer score as a proxy for how well a human can hear a certain unit of speech and maximizes the speech playback speed to the extent that a human can hear. This method can be used to produce fast but intelligible speech. In the evaluation experiment, we compared the speech played back at a constant fast speed and the flexibly speed-up speech generated by the proposed method in a blind test and confirmed that the proposed method produced speech that was easier to listen to.  ( 3 min )
    Mirror Descent Algorithms with Nearly Dimension-Independent Rates for Differentially-Private Stochastic Saddle-Point Problems
    arXiv:2403.02912v1 Announce Type: cross Abstract: We study the problem of differentially-private (DP) stochastic (convex-concave) saddle-points in the polyhedral setting. We propose $(\varepsilon, \delta)$-DP algorithms based on stochastic mirror descent that attain nearly dimension-independent convergence rates for the expected duality gap, a type of guarantee that was known before only for bilinear objectives. For convex-concave and first-order-smooth stochastic objectives, our algorithms attain a rate of $\sqrt{\log(d)/n} + (\log(d)^{3/2}/[n\varepsilon])^{1/3}$, where $d$ is the dimension of the problem and $n$ the dataset size. Under an additional second-order-smoothness assumption, we improve the rate on the expected gap to $\sqrt{\log(d)/n} + (\log(d)^{3/2}/[n\varepsilon])^{2/5}$. Under this additional assumption, we also show, by using bias-reduced gradient estimators, that the duality gap is bounded by $\log(d)/\sqrt{n} + \log(d)/[n\varepsilon]^{1/2}$ with constant success probability. This result provides evidence of the near-optimality of the approach. Finally, we show that combining our methods with acceleration techniques from online learning leads to the first algorithm for DP Stochastic Convex Optimization in the polyhedral setting that is not based on Frank-Wolfe methods. For convex and first-order-smooth stochastic objectives, our algorithms attain an excess risk of $\sqrt{\log(d)/n} + \log(d)^{7/10}/[n\varepsilon]^{2/5}$, and when additionally assuming second-order-smoothness, we improve the rate to $\sqrt{\log(d)/n} + \log(d)/\sqrt{n\varepsilon}$. Instrumental to all of these results are various extensions of the classical Maurey Sparsification Lemma, which may be of independent interest.  ( 3 min )
    Citizen Science and Machine Learning for Research and Nature Conservation: The Case of Eurasian Lynx, Free-ranging Rodents and Insects
    arXiv:2403.02906v1 Announce Type: cross Abstract: Technology is increasingly used in Nature Reserves and National Parks around the world to support conservation efforts. Endangered species, such as the Eurasian Lynx (Lynx lynx), are monitored by a network of automatic photo traps. Yet, this method produces vast amounts of data, which needs to be prepared, analyzed and interpreted. Therefore, researchers working in this area increasingly need support to process this incoming information. One opportunity is to seek support from volunteer Citizen Scientists who can help label the data, however, it is challenging to retain their interest. Another way is to automate the process with image recognition using convolutional neural networks. During the panel, we will discuss considerations related to nature research and conservation as well as opportunities for the use of Citizen Science and Machine Learning to expedite the process of data preparation, labelling and analysis.  ( 2 min )
    Autonomous vehicle decision and control through reinforcement learning with traffic flow randomization
    arXiv:2403.02882v1 Announce Type: cross Abstract: Most of the current studies on autonomous vehicle decision-making and control tasks based on reinforcement learning are conducted in simulated environments. The training and testing of these studies are carried out under rule-based microscopic traffic flow, with little consideration of migrating them to real or near-real environments to test their performance. It may lead to a degradation in performance when the trained model is tested in more realistic traffic scenes. In this study, we propose a method to randomize the driving style and behavior of surrounding vehicles by randomizing certain parameters of the car-following model and the lane-changing model of rule-based microscopic traffic flow in SUMO. We trained policies with deep reinforcement learning algorithms under the domain randomized rule-based microscopic traffic flow in freeway and merging scenes, and then tested them separately in rule-based microscopic traffic flow and high-fidelity microscopic traffic flow. Results indicate that the policy trained under domain randomization traffic flow has significantly better success rate and calculative reward compared to the models trained under other microscopic traffic flows.  ( 2 min )
    EasyQuant: An Efficient Data-free Quantization Algorithm for LLMs
    arXiv:2403.02775v1 Announce Type: cross Abstract: Large language models (LLMs) have proven to be very superior to conventional methods in various tasks. However, their expensive computations and high memory requirements are prohibitive for deployment. Model quantization is an effective method for reducing this overhead. The problem is that in most previous works, the quantized model was calibrated using few samples from the training data, which might affect the generalization of the quantized LLMs to unknown cases and tasks. Hence in this work, we explore an important question: Can we design a data-independent quantization method for LLMs to guarantee its generalization performance? In this work, we propose EasyQuant, a training-free and data-independent weight-only quantization algorithm for LLMs. Our observation indicates that two factors: outliers in the weight and quantization ranges, are essential for reducing the quantization error. Therefore, in EasyQuant, we leave the outliers (less than 1%) unchanged and optimize the quantization range to reduce the reconstruction error. With these methods, we surprisingly find that EasyQuant achieves comparable performance to the original model. Since EasyQuant does not depend on any training data, the generalization performance of quantized LLMs is safely guaranteed. Moreover, EasyQuant can be implemented in parallel so that the quantized model could be attained in a few minutes even for LLMs over 100B. To our best knowledge, we are the first work that achieves almost lossless quantization performance for LLMs under a data-independent setting and our algorithm runs over 10 times faster than the data-dependent methods.  ( 3 min )
    Android in the Zoo: Chain-of-Action-Thought for GUI Agents
    arXiv:2403.02713v1 Announce Type: cross Abstract: Large language model (LLM) leads to a surge of autonomous GUI agents for smartphone, which completes a task triggered by natural language through predicting a sequence of actions of API. Even though the task highly relies on past actions and visual observations, existing studies typical consider little semantic information carried out by intermediate screenshots and screen operations. To address this, this work presents Chain-of-Action-Thought (dubbed CoAT), which takes the description of the previous actions, the current screen, and more importantly the action thinking of what actions should be performed and the outcomes led by the chosen action. We demonstrate that, in a zero-shot setting upon an off-the-shell LLM, CoAT significantly improves the goal progress compared to standard context modeling. To further facilitate the research in this line, we construct a benchmark Android-In-The-Zoo (AitZ), which contains 18,643 screen-action pairs together with chain-of-action-thought annotations. Experiments show that fine-tuning a 200M model on our AitZ dataset achieves on par performance with CogAgent-Chat-18B.  ( 2 min )
    Noise misleads rotation invariant algorithms on sparse targets
    arXiv:2403.02697v1 Announce Type: cross Abstract: It is well known that the class of rotation invariant algorithms are suboptimal even for learning sparse linear problems when the number of examples is below the "dimension" of the problem. This class includes any gradient descent trained neural net with a fully-connected input layer (initialized with a rotationally symmetric distribution). The simplest sparse problem is learning a single feature out of $d$ features. In that case the classification error or regression loss grows with $1-k/n$ where $k$ is the number of examples seen. These lower bounds become vacuous when the number of examples $k$ reaches the dimension $d$. We show that when noise is added to this sparse linear problem, rotation invariant algorithms are still suboptimal after seeing $d$ or more examples. We prove this via a lower bound for the Bayes optimal algorithm on a rotationally symmetrized problem. We then prove much lower upper bounds on the same problem for simple non-rotation invariant algorithms. Finally we analyze the gradient flow trajectories of many standard optimization algorithms in some simple cases and show how they veer toward or away from the sparse targets. We believe that our trajectory categorization will be useful in designing algorithms that can exploit sparse targets and our method for proving lower bounds will be crucial for analyzing other families of algorithms that admit different classes of invariances.  ( 3 min )
    DOCTOR: Dynamic On-Chip Remediation Against Temporally-Drifting Thermal Variations Toward Self-Corrected Photonic Tensor Accelerators
    arXiv:2403.02688v1 Announce Type: cross Abstract: Photonic computing has emerged as a promising solution for accelerating computation-intensive artificial intelligence (AI) workloads, offering unparalleled speed and energy efficiency, especially in resource-limited, latency-sensitive edge computing environments. However, the deployment of analog photonic tensor accelerators encounters reliability challenges due to hardware noises and environmental variations. While off-chip noise-aware training and on-chip training have been proposed to enhance the variation tolerance of optical neural accelerators with moderate, static noises, we observe a notable performance degradation over time due to temporally drifting variations, which requires a real-time, in-situ calibration mechanism. To tackle this challenging reliability issues, for the first time, we propose a lightweight dynamic on-chip remediation framework, dubbed DOCTOR, providing adaptive, in-situ accuracy recovery against temporally drifting noises. The DOCTOR framework intelligently monitors the chip status using adaptive probing and performs fast in-situ training-free calibration to restore accuracy when necessary. Recognizing nonuniform spatial variation distributions across devices and tensor cores, we also propose a variation-aware architectural remapping strategy to avoid executing critical tasks on noisy devices. Extensive experiments show that our proposed framework can guarantee sustained performance under drifting variations with 34% higher accuracy and 2-3 orders-of-magnitude lower overhead compared to state-of-the-art on-chip training methods.  ( 2 min )
    Search Intenion Network for Personalized Query Auto-Completion in E-Commerce
    arXiv:2403.02609v1 Announce Type: cross Abstract: Query Auto-Completion(QAC), as an important part of the modern search engine, plays a key role in complementing user queries and helping them refine their search intentions.Today's QAC systems in real-world scenarios face two major challenges:1)intention equivocality(IE): during the user's typing process,the prefix often contains a combination of characters and subwords, which makes the current intention ambiguous and difficult to model.2)intention transfer (IT):previous works make personalized recommendations based on users' historical sequences, but ignore the search intention transfer.However, the current intention extracted from prefix may be contrary to the historical preferences.  ( 2 min )
    Geometric Dynamics of Signal Propagation Predict Trainability of Transformers
    arXiv:2403.02579v1 Announce Type: cross Abstract: We investigate forward signal propagation and gradient back propagation in deep, randomly initialized transformers, yielding simple necessary and sufficient conditions on initialization hyperparameters that ensure trainability of deep transformers. Our approach treats the evolution of the representations of $n$ tokens as they propagate through the transformer layers in terms of a discrete time dynamical system of $n$ interacting particles. We derive simple update equations for the evolving geometry of this particle system, starting from a permutation symmetric simplex. Our update equations show that without MLP layers, this system will collapse to a line, consistent with prior work on rank collapse in transformers. However, unlike prior work, our evolution equations can quantitatively track particle geometry in the additional presence of nonlinear MLP layers, and it reveals an order-chaos phase transition as a function of initialization hyperparameters, like the strength of attentional and MLP residual connections and weight variances. In the ordered phase the particles are attractive and collapse to a line, while in the chaotic phase the particles are repulsive and converge to a regular $n$-simplex. We analytically derive two Lyapunov exponents: an angle exponent that governs departures from the edge of chaos in this particle system, and a gradient exponent that governs the rate of exponential growth or decay of backpropagated gradients. We show through experiments that, remarkably, the final test loss at the end of training is well predicted just by these two exponents at the beginning of training, and that the simultaneous vanishing of these two exponents yields a simple necessary and sufficient condition to achieve minimal test loss.  ( 3 min )
    Forecasting SEP Events During Solar Cycles 23 and 24 Using Interpretable Machine Learning
    arXiv:2403.02536v1 Announce Type: cross Abstract: Prediction of the Solar Energetic Particle (SEP) events garner increasing interest as space missions extend beyond Earth's protective magnetosphere. These events, which are, in most cases, products of magnetic reconnection-driven processes during solar flares or fast coronal-mass-ejection-driven shock waves, pose significant radiation hazards to aviation, space-based electronics, and particularly, space exploration. In this work, we utilize the recently developed dataset that combines the Solar Dynamics Observatory/Helioseismic and Magnetic Imager's (SDO/HMI) Space weather HMI Active Region Patches (SHARP) and the Solar and Heliospheric Observatory/Michelson Doppler Imager's (SoHO/MDI) Space Weather MDI Active Region Patches (SMARP). We employ a suite of machine learning strategies, including Support Vector Machines (SVM) and regression models, to evaluate the predictive potential of this new data product for a forecast of post-solar flare SEP events. Our study indicates that despite the augmented volume of data, the prediction accuracy reaches 0.7 +- 0.1, which aligns with but does not exceed these published benchmarks. A linear SVM model with training and testing configurations that mimic an operational setting (positive-negative imbalance) reveals a slight increase (+ 0.04 +- 0.05) in the accuracy of a 14-hour SEP forecast compared to previous studies. This outcome emphasizes the imperative for more sophisticated, physics-informed models to better understand the underlying processes leading to SEP events.  ( 3 min )
    Purpose for Open-Ended Learning Robots: A Computational Taxonomy, Definition, and Operationalisation
    arXiv:2403.02514v1 Announce Type: cross Abstract: Autonomous open-ended learning (OEL) robots are able to cumulatively acquire new skills and knowledge through direct interaction with the environment, for example relying on the guidance of intrinsic motivations and self-generated goals. OEL robots have a high relevance for applications as they can use the autonomously acquired knowledge to accomplish tasks relevant for their human users. OEL robots, however, encounter an important limitation: this may lead to the acquisition of knowledge that is not so much relevant to accomplish the users' tasks. This work analyses a possible solution to this problem that pivots on the novel concept of `purpose'. Purposes indicate what the designers and/or users want from the robot. The robot should use internal representations of purposes, called here `desires', to focus its open-ended exploration towards the acquisition of knowledge relevant to accomplish them. This work contributes to develop a computational framework on purpose in two ways. First, it formalises a framework on purpose based on a three-level motivational hierarchy involving: (a) the purposes; (b) the desires, which are domain independent; (c) specific domain dependent state-goals. Second, the work highlights key challenges highlighted by the framework such as: the `purpose-desire alignment problem', the `purpose-goal grounding problem', and the `arbitration between desires'. Overall, the approach enables OEL robots to learn in an autonomous way but also to focus on acquiring goals and skills that meet the purposes of the designers and users.  ( 3 min )
    RVRAE: A Dynamic Factor Model Based on Variational Recurrent Autoencoder for Stock Returns Prediction
    arXiv:2403.02500v1 Announce Type: cross Abstract: In recent years, the dynamic factor model has emerged as a dominant tool in economics and finance, particularly for investment strategies. This model offers improved handling of complex, nonlinear, and noisy market conditions compared to traditional static factor models. The advancement of machine learning, especially in dealing with nonlinear data, has further enhanced asset pricing methodologies. This paper introduces a groundbreaking dynamic factor model named RVRAE. This model is a probabilistic approach that addresses the temporal dependencies and noise in market data. RVRAE ingeniously combines the principles of dynamic factor modeling with the variational recurrent autoencoder (VRAE) from deep learning. A key feature of RVRAE is its use of a prior-posterior learning method. This method fine-tunes the model's learning process by seeking an optimal posterior factor model informed by future data. Notably, RVRAE is adept at risk modeling in volatile stock markets, estimating variances from latent space distributions while also predicting returns. Our empirical tests with real stock market data underscore RVRAE's superior performance compared to various established baseline methods.  ( 2 min )
    Applied Causal Inference Powered by ML and AI
    arXiv:2403.02467v1 Announce Type: cross Abstract: An introduction to the emerging fusion of machine learning and causal inference. The book presents ideas from classical structural equation models (SEMs) and their modern AI equivalent, directed acyclical graphs (DAGs) and structural causal models (SCMs), and covers Double/Debiased Machine Learning methods to do inference in such models using modern predictive tools.  ( 2 min )
    Vision-Language Models for Medical Report Generation and Visual Question Answering: A Review
    arXiv:2403.02469v1 Announce Type: cross Abstract: Medical vision-language models (VLMs) combine computer vision and natural language processing to analyze visual and textual medical data. Our paper reviews recent advancements in developing VLMs specialized for healthcare, focusing on models designed for medical report generation and visual question answering. We provide background on natural language processing and computer vision, explaining how techniques from both fields are integrated into VLMs to enable learning from multimodal data. Key areas we address include the exploration of medical vision-language datasets, in-depth analyses of architectures and pre-training strategies employed in recent noteworthy medical VLMs, and comprehensive discussion on evaluation metrics for assessing VLMs' performance in medical report generation and visual question answering. We also highlight current challenges and propose future directions, including enhancing clinical validity and addressing patient privacy concerns. Overall, our review summarizes recent progress in developing VLMs to harness multimodal medical data for improved healthcare applications.  ( 2 min )
    On the impact of measure pre-conditionings on general parametric ML models and transfer learning via domain adaptation
    arXiv:2403.02432v1 Announce Type: cross Abstract: We study a new technique for understanding convergence of learning agents under small modifications of data. We show that such convergence can be understood via an analogue of Fatou's lemma which yields gamma-convergence. We show it's relevance and applications in general machine learning tasks and domain adaptation transfer learning.  ( 2 min )
    Digital Twins and Civil Engineering Phases: Reorienting Adoption Strategies
    arXiv:2403.02426v1 Announce Type: cross Abstract: Digital twin (DT) technology has received immense attention over the years due to the promises it presents to various stakeholders in science and engineering. As a result, different thematic areas of DT have been explored. This is no different in specific fields such as manufacturing, automation, oil and gas, and civil engineering, leading to fragmented approaches for field-specific applications. The civil engineering industry is further disadvantaged in this regard as it relies on external techniques by other engineering fields for its DT adoption. A rising consequence of these extensions is a concentrated application of DT to the operations and maintenance phase. On another spectrum, Building Information Modeling (BIM) are pervasively utilized in the planning/design phase, and the transient nature of the construction phase remains a challenge for its DT adoption. In this paper, we present a phase-based development of DT in the Architecture, Engineering, and Construction industry. We commence by presenting succinct expositions on DT as a concept and as a service and establish a five-level scale system. Furthermore, we present separately a systematic literature review of the conventional techniques employed at each civil engineering phase. In this regard, we identified enabling technologies such as computer vision for extended sensing and the Internet of Things for reliable integration. Ultimately, we attempt to reveal DT as an important tool across the entire life cycle of civil engineering projects and nudge researchers to think more holistically in their quest for the integration of DT for civil engineering applications.  ( 3 min )
    NiNformer: A Network in Network Transformer with Token Mixing Generated Gating Function
    arXiv:2403.02411v1 Announce Type: cross Abstract: The Attention mechanism is the main component of the Transformer architecture, and since its introduction, it has led to significant advancements in Deep Learning that span many domains and multiple tasks. The Attention Mechanism was utilized in Computer Vision as the Vision Transformer ViT, and its usage has expanded into many tasks in the vision domain, such as classification, segmentation, object detection, and image generation. While this mechanism is very expressive and capable, it comes with the drawback of being computationally expensive and requiring datasets of considerable size for effective optimization. To address these shortcomings, many designs have been proposed in the literature to reduce the computational burden and alleviate the data size requirements. Examples of such attempts in the vision domain are the MLP-Mixer, the Conv-Mixer, the Perciver-IO, and many more. This paper introduces a new computational block as an alternative to the standard ViT block that reduces the compute burdens by replacing the normal Attention layers with a Network in Network structure that enhances the static approach of the MLP Mixer with a dynamic system of learning an element-wise gating function by a token mixing process. Extensive experimentation shows that the proposed design provides better performance than the baseline architectures on multiple datasets applied in the image classification task of the vision domain.  ( 2 min )
    LC-Tsalis-INF: Generalized Best-of-Both-Worlds Linear Contextual Bandits
    arXiv:2403.03219v1 Announce Type: new Abstract: This study considers the linear contextual bandit problem with independent and identically distributed (i.i.d.) contexts. In this problem, existing studies have proposed Best-of-Both-Worlds (BoBW) algorithms whose regrets satisfy $O(\log^2(T))$ for the number of rounds $T$ in a stochastic regime with a suboptimality gap lower-bounded by a positive constant, while satisfying $O(\sqrt{T})$ in an adversarial regime. However, the dependency on $T$ has room for improvement, and the suboptimality-gap assumption can be relaxed. For this issue, this study proposes an algorithm whose regret satisfies $O(\log(T))$ in the setting when the suboptimality gap is lower-bounded. Furthermore, we introduce a margin condition, a milder assumption on the suboptimality gap. That condition characterizes the problem difficulty linked to the suboptimality gap using a parameter $\beta \in (0, \infty]$. We then show that the algorithm's regret satisfies $O\left(\left\{\log(T)\right\}^{\frac{1+\beta}{2+\beta}}T^{\frac{1}{2+\beta}}\right)$. Here, $\beta= \infty$ corresponds to the case in the existing studies where a lower bound exists in the suboptimality gap, and our regret satisfies $O(\log(T))$ in that case. Our proposed algorithm is based on the Follow-The-Regularized-Leader with the Tsallis entropy and referred to as the $\alpha$-Linear-Contextual (LC)-Tsallis-INF.  ( 2 min )
    How Well Can Transformers Emulate In-context Newton's Method?
    arXiv:2403.03183v1 Announce Type: new Abstract: Transformer-based models have demonstrated remarkable in-context learning capabilities, prompting extensive research into its underlying mechanisms. Recent studies have suggested that Transformers can implement first-order optimization algorithms for in-context learning and even second order ones for the case of linear regression. In this work, we study whether Transformers can perform higher order optimization methods, beyond the case of linear regression. We establish that linear attention Transformers with ReLU layers can approximate second order optimization algorithms for the task of logistic regression and achieve $\epsilon$ error with only a logarithmic to the error more layers. As a by-product we demonstrate the ability of even linear attention-only Transformers in implementing a single step of Newton's iteration for matrix inversion with merely two layers. These results suggest the ability of the Transformer architecture to implement complex algorithms, beyond gradient descent.  ( 2 min )
    Preventing Reward Hacking with Occupancy Measure Regularization
    arXiv:2403.03185v1 Announce Type: new Abstract: Reward hacking occurs when an agent performs very well with respect to a "proxy" reward function (which may be hand-specified or learned), but poorly with respect to the unknown true reward. Since ensuring good alignment between the proxy and true reward is extremely difficult, one approach to prevent reward hacking is optimizing the proxy conservatively. Prior work has particularly focused on enforcing the learned policy to behave similarly to a "safe" policy by penalizing the KL divergence between their action distributions (AD). However, AD regularization doesn't always work well since a small change in action distribution at a single state can lead to potentially calamitous outcomes, while large changes might not be indicative of any dangerous activity. Our insight is that when reward hacking, the agent visits drastically different states from those reached by the safe policy, causing large deviations in state occupancy measure (OM). Thus, we propose regularizing based on the OM divergence between policies instead of AD divergence to prevent reward hacking. We theoretically establish that OM regularization can more effectively avoid large drops in true reward. Then, we empirically demonstrate in a variety of realistic environments that OM divergence is superior to AD divergence for preventing reward hacking by regularizing towards a safe policy. Furthermore, we show that occupancy measure divergence can also regularize learned policies away from reward hacking behavior. Our code and data are available at https://github.com/cassidylaidlaw/orpo  ( 2 min )
    Behavior Generation with Latent Actions
    arXiv:2403.03181v1 Announce Type: new Abstract: Generative modeling of complex behaviors from labeled datasets has been a longstanding problem in decision making. Unlike language or image generation, decision making requires modeling actions - continuous-valued vectors that are multimodal in their distribution, potentially drawn from uncurated sources, where generation errors can compound in sequential prediction. A recent class of models called Behavior Transformers (BeT) addresses this by discretizing actions using k-means clustering to capture different modes. However, k-means struggles to scale for high-dimensional action spaces or long sequences, and lacks gradient information, and thus BeT suffers in modeling long-range actions. In this work, we present Vector-Quantized Behavior Transformer (VQ-BeT), a versatile model for behavior generation that handles multimodal action prediction, conditional generation, and partial observations. VQ-BeT augments BeT by tokenizing continuous actions with a hierarchical vector quantization module. Across seven environments including simulated manipulation, autonomous driving, and robotics, VQ-BeT improves on state-of-the-art models such as BeT and Diffusion Policies. Importantly, we demonstrate VQ-BeT's improved ability to capture behavior modes while accelerating inference speed 5x over Diffusion Policies. Videos and code can be found https://sjlee.cc/vq-bet  ( 2 min )
    Deep-Learned Compression for Radio-Frequency Signal Classification
    arXiv:2403.03150v1 Announce Type: new Abstract: Next-generation cellular concepts rely on the processing of large quantities of radio-frequency (RF) samples. This includes Radio Access Networks (RAN) connecting the cellular front-end based on software defined radios (SDRs) and a framework for the AI processing of spectrum-related data. The RF data collected by the dense RAN radio units and spectrum sensors may need to be jointly processed for intelligent decision making. Moving large amounts of data to AI agents may result in significant bandwidth and latency costs. We propose a deep learned compression (DLC) model, HQARF, based on learned vector quantization (VQ), to compress the complex-valued samples of RF signals comprised of 6 modulation classes. We are assessing the effects of HQARF on the performance of an AI model trained to infer the modulation class of the RF signal. Compression of narrow-band RF samples for the training and off-the-site inference will allow for an efficient use of the bandwidth and storage for non-real-time analytics, and for a decreased delay in real-time applications. While exploring the effectiveness of the HQARF signal reconstructions in modulation classification tasks, we highlight the DLC optimization space and some open problems related to the training of the VQ embedded in HQARF.  ( 2 min )
    Emergent Equivariance in Deep Ensembles
    arXiv:2403.03103v1 Announce Type: new Abstract: We demonstrate that deep ensembles are secretly equivariant models. More precisely, we show that deep ensembles become equivariant for all inputs and at all training times by simply using data augmentation. Crucially, equivariance holds off-manifold and for any architecture in the infinite width limit. The equivariance is emergent in the sense that predictions of individual ensemble members are not equivariant but their collective prediction is. Neural tangent kernel theory is used to derive this result and we verify our theoretical insights using detailed numerical experiments.  ( 2 min )
    Recall-Oriented Continual Learning with Generative Adversarial Meta-Model
    arXiv:2403.03082v1 Announce Type: new Abstract: The stability-plasticity dilemma is a major challenge in continual learning, as it involves balancing the conflicting objectives of maintaining performance on previous tasks while learning new tasks. In this paper, we propose the recall-oriented continual learning framework to address this challenge. Inspired by the human brain's ability to separate the mechanisms responsible for stability and plasticity, our framework consists of a two-level architecture where an inference network effectively acquires new knowledge and a generative network recalls past knowledge when necessary. In particular, to maximize the stability of past knowledge, we investigate the complexity of knowledge depending on different representations, and thereby introducing generative adversarial meta-model (GAMM) that incrementally learns task-specific parameters instead of input data samples of the task. Through our experiments, we show that our framework not only effectively learns new knowledge without any disruption but also achieves high stability of previous knowledge in both task-aware and task-agnostic learning scenarios. Our code is available at: https://github.com/bigdata-inha/recall-oriented-cl-framework.  ( 2 min )
    Improving Variational Autoencoder Estimation from Incomplete Data with Mixture Variational Families
    arXiv:2403.03069v1 Announce Type: new Abstract: We consider the task of estimating variational autoencoders (VAEs) when the training data is incomplete. We show that missing data increases the complexity of the model's posterior distribution over the latent variables compared to the fully-observed case. The increased complexity may adversely affect the fit of the model due to a mismatch between the variational and model posterior distributions. We introduce two strategies based on (i) finite variational-mixture and (ii) imputation-based variational-mixture distributions to address the increased posterior complexity. Through a comprehensive evaluation of the proposed approaches, we show that variational mixtures are effective at improving the accuracy of VAE estimation from incomplete data.  ( 2 min )
    CRISPR: Ensemble Model
    arXiv:2403.03018v1 Announce Type: new Abstract: Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR) is a gene editing technology that has revolutionized the fields of biology and medicine. However, one of the challenges of using CRISPR is predicting the on-target efficacy and off-target sensitivity of single-guide RNAs (sgRNAs). This is because most existing methods are trained on separate datasets with different genes and cells, which limits their generalizability. In this paper, we propose a novel ensemble learning method for sgRNA design that is accurate and generalizable. Our method combines the predictions of multiple machine learning models to produce a single, more robust prediction. This approach allows us to learn from a wider range of data, which improves the generalizability of our model. We evaluated our method on a benchmark dataset of sgRNA designs and found that it outperformed existing methods in terms of both accuracy and generalizability. Our results suggest that our method can be used to design sgRNAs with high sensitivity and specificity, even for new genes or cells. This could have important implications for the clinical use of CRISPR, as it would allow researchers to design more effective and safer treatments for a variety of diseases.  ( 2 min )
    Unsupervised Learning Approaches for Identifying ICU Patient Subgroups: Do Results Generalise?
    arXiv:2403.02945v1 Announce Type: new Abstract: The use of unsupervised learning to identify patient subgroups has emerged as a potentially promising direction to improve the efficiency of Intensive Care Units (ICUs). By identifying subgroups of patients with similar levels of medical resource need, ICUs could be restructured into a collection of smaller subunits, each catering to a specific group. However, it is unclear whether common patient subgroups exist across different ICUs, which would determine whether ICU restructuring could be operationalised in a standardised manner. In this paper, we tested the hypothesis that common ICU patient subgroups exist by examining whether the results from one existing study generalise to a different dataset. We extracted 16 features representing medical resource need and used consensus clustering to derive patient subgroups, replicating the previous study. We found limited similarities between our results and those of the previous study, providing evidence against the hypothesis. Our findings imply that there is significant variation between ICUs; thus, a standardised restructuring approach is unlikely to be appropriate. Instead, potential efficiency gains might be greater when the number and nature of the subunits are tailored to each ICU individually.  ( 2 min )
    From Spectra to Biophysical Insights: End-to-End Learning with a Biased Radiative Transfer Model
    arXiv:2403.02922v1 Announce Type: new Abstract: Advances in machine learning have boosted the use of Earth observation data for climate change research. Yet, the interpretability of machine-learned representations remains a challenge, particularly in understanding forests' biophysical reactions to climate change. Traditional methods in remote sensing that invert radiative transfer models (RTMs) to retrieve biophysical variables from spectral data often fail to account for biases inherent in the RTM, especially for complex forests. We propose to integrate RTMs into an auto-encoder architecture, creating an end-to-end learning approach. Our method not only corrects biases in RTMs but also outperforms traditional techniques for variable retrieval like neural network regression. Furthermore, our framework has potential generally for inverting biased physical models. The code is available on https://github.com/yihshe/ai-refined-rtm.git.  ( 2 min )
    TaylorShift: Shifting the Complexity of Self-Attention from Squared to Linear (and Back) using Taylor-Softmax
    arXiv:2403.02920v1 Announce Type: new Abstract: The quadratic complexity of the attention mechanism represents one of the biggest hurdles for processing long sequences using Transformers. Current methods, relying on sparse representations or stateful recurrence, sacrifice token-to-token interactions, which ultimately leads to compromises in performance. This paper introduces TaylorShift, a novel reformulation of the Taylor softmax that enables computing full token-to-token interactions in linear time and space. We analytically determine the crossover points where employing TaylorShift becomes more efficient than traditional attention, aligning closely with empirical measurements. Specifically, our findings demonstrate that TaylorShift enhances memory efficiency for sequences as short as 800 tokens and accelerates inference for inputs of approximately 1700 tokens and beyond. For shorter sequences, TaylorShift scales comparably with the vanilla attention. Furthermore, a classification benchmark across five tasks involving long sequences reveals no degradation in accuracy when employing Transformers equipped with TaylorShift. For reproducibility, we provide access to our code under https://github.com/tobna/TaylorShift.  ( 2 min )
    SOFIM: Stochastic Optimization Using Regularized Fisher Information Matrix
    arXiv:2403.02833v1 Announce Type: new Abstract: This paper introduces a new stochastic optimization method based on the regularized Fisher information matrix (FIM), named SOFIM, which can efficiently utilize the FIM to approximate the Hessian matrix for finding Newton's gradient update in large-scale stochastic optimization of machine learning models. It can be viewed as a variant of natural gradient descent (NGD), where the challenge of storing and calculating the full FIM is addressed through making use of the regularized FIM and directly finding the gradient update direction via Sherman-Morrison matrix inversion. Additionally, like the popular Adam method, SOFIM uses the first moment of the gradient to address the issue of non-stationary objectives across mini-batches due to heterogeneous data. The utilization of the regularized FIM and Sherman-Morrison matrix inversion leads to the improved convergence rate with the same space and time complexities as stochastic gradient descent (SGD) with momentum. The extensive experiments on training deep learning models on several benchmark image classification datasets demonstrate that the proposed SOFIM outperforms SGD with momentum and several state-of-the-art Newton optimization methods, such as Nystrom-SGD, L-BFGS, and AdaHessian, in term of the convergence speed for achieving the pre-specified objectives of training and test losses as well as test accuracy.  ( 2 min )
    FLGuard: Byzantine-Robust Federated Learning via Ensemble of Contrastive Models
    arXiv:2403.02846v1 Announce Type: new Abstract: Federated Learning (FL) thrives in training a global model with numerous clients by only sharing the parameters of their local models trained with their private training datasets. Therefore, without revealing the private dataset, the clients can obtain a deep learning (DL) model with high performance. However, recent research proposed poisoning attacks that cause a catastrophic loss in the accuracy of the global model when adversaries, posed as benign clients, are present in a group of clients. Therefore, recent studies suggested byzantine-robust FL methods that allow the server to train an accurate global model even with the adversaries present in the system. However, many existing methods require the knowledge of the number of malicious clients or the auxiliary (clean) dataset or the effectiveness reportedly decreased hugely when the private dataset was non-independently and identically distributed (non-IID). In this work, we propose FLGuard, a novel byzantine-robust FL method that detects malicious clients and discards malicious local updates by utilizing the contrastive learning technique, which showed a tremendous improvement as a self-supervised learning method. With contrastive models, we design FLGuard as an ensemble scheme to maximize the defensive capability. We evaluate FLGuard extensively under various poisoning attacks and compare the accuracy of the global model with existing byzantine-robust FL methods. FLGuard outperforms the state-of-the-art defense methods in most cases and shows drastic improvement, especially in non-IID settings. https://github.com/201younghanlee/FLGuard  ( 3 min )
    An Adaptive Hydropower Management Approach for Downstream Ecosystem Preservation
    arXiv:2403.02821v1 Announce Type: new Abstract: Hydropower plants play a pivotal role in advancing clean and sustainable energy production, contributing significantly to the global transition towards renewable energy sources. However, hydropower plants are currently perceived both positively as sources of renewable energy and negatively as disruptors of ecosystems. In this work, we highlight the overlooked potential of using hydropower plant as protectors of ecosystems by using adaptive ecological discharges. To advocate for this perspective, we propose using a neural network to predict the minimum ecological discharge value at each desired time. Additionally, we present a novel framework that seamlessly integrates it into hydropower management software, taking advantage of the well-established approach of using traditional constrained optimisation algorithms. This novel approach not only protects the ecosystems from climate change but also contributes to potentially increase the electricity production.  ( 2 min )
    InjectTST: A Transformer Method of Injecting Global Information into Independent Channels for Long Time Series Forecasting
    arXiv:2403.02814v1 Announce Type: new Abstract: Transformer has become one of the most popular architectures for multivariate time series (MTS) forecasting. Recent Transformer-based MTS models generally prefer channel-independent structures with the observation that channel independence can alleviate noise and distribution drift issues, leading to more robustness. Nevertheless, it is essential to note that channel dependency remains an inherent characteristic of MTS, carrying valuable information. Designing a model that incorporates merits of both channel-independent and channel-mixing structures is a key to further improvement of MTS forecasting, which poses a challenging conundrum. To address the problem, an injection method for global information into channel-independent Transformer, InjectTST, is proposed in this paper. Instead of designing a channel-mixing model directly, we retain the channel-independent backbone and gradually inject global information into individual channels in a selective way. A channel identifier, a global mixing module and a self-contextual attention module are devised in InjectTST. The channel identifier can help Transformer distinguish channels for better representation. The global mixing module produces cross-channel global information. Through the self-contextual attention module, the independent channels can selectively concentrate on useful global information without robustness degradation, and channel mixing is achieved implicitly. Experiments indicate that InjectTST can achieve stable improvement compared with state-of-the-art models.  ( 3 min )
    Dynamic Gaussian Graph Operator: Learning parametric partial differential equations in arbitrary discrete mechanics problems
    arXiv:2403.02810v1 Announce Type: new Abstract: Deep learning methods have access to be employed for solving physical systems governed by parametric partial differential equations (PDEs) due to massive scientific data. It has been refined to operator learning that focuses on learning non-linear mapping between infinite-dimensional function spaces, offering interface from observations to solutions. However, state-of-the-art neural operators are limited to constant and uniform discretization, thereby leading to deficiency in generalization on arbitrary discretization schemes for computational domain. In this work, we propose a novel operator learning algorithm, referred to as Dynamic Gaussian Graph Operator (DGGO) that expands neural operators to learning parametric PDEs in arbitrary discrete mechanics problems. The Dynamic Gaussian Graph (DGG) kernel learns to map the observation vectors defined in general Euclidean space to metric vectors defined in high-dimensional uniform metric space. The DGG integral kernel is parameterized by Gaussian kernel weighted Riemann sum approximating and using dynamic message passing graph to depict the interrelation within the integral term. Fourier Neural Operator is selected to localize the metric vectors on spatial and frequency domains. Metric vectors are regarded as located on latent uniform domain, wherein spatial and spectral transformation offer highly regular constraints on solution space. The efficiency and robustness of DGGO are validated by applying it to solve numerical arbitrary discrete mechanics problems in comparison with mainstream neural operators. Ablation experiments are implemented to demonstrate the effectiveness of spatial transformation in the DGG kernel. The proposed method is utilized to forecast stress field of hyper-elastic material with geometrically variable void as engineering application.  ( 3 min )
    Data Collaboration Analysis Over Matrix Manifolds
    arXiv:2403.02780v1 Announce Type: new Abstract: The effectiveness of machine learning (ML) algorithms is deeply intertwined with the quality and diversity of their training datasets. Improved datasets, marked by superior quality, enhance the predictive accuracy and broaden the applicability of models across varied scenarios. Researchers often integrate data from multiple sources to mitigate biases and limitations of single-source datasets. However, this extensive data amalgamation raises significant ethical concerns, particularly regarding user privacy and the risk of unauthorized data disclosure. Various global legislative frameworks have been established to address these privacy issues. While crucial for safeguarding privacy, these regulations can complicate the practical deployment of ML technologies. Privacy-Preserving Machine Learning (PPML) addresses this challenge by safeguarding sensitive information, from health records to geolocation data, while enabling the secure use of this data in developing robust ML models. Within this realm, the Non-Readily Identifiable Data Collaboration (NRI-DC) framework emerges as an innovative approach, potentially resolving the 'data island' issue among institutions through non-iterative communication and robust privacy protections. However, in its current state, the NRI-DC framework faces model performance instability due to theoretical unsteadiness in creating collaboration functions. This study establishes a rigorous theoretical foundation for these collaboration functions and introduces new formulations through optimization problems on matrix manifolds and efficient solutions. Empirical analyses demonstrate that the proposed approach, particularly the formulation over orthogonal matrix manifolds, significantly enhances performance, maintaining consistency and efficiency without compromising communication efficiency or privacy protections.  ( 2 min )
    A Zero-Shot Reinforcement Learning Strategy for Autonomous Guidewire Navigation
    arXiv:2403.02777v1 Announce Type: new Abstract: Purpose: The treatment of cardiovascular diseases requires complex and challenging navigation of a guidewire and catheter. This often leads to lengthy interventions during which the patient and clinician are exposed to X-ray radiation. Deep Reinforcement Learning approaches have shown promise in learning this task and may be the key to automating catheter navigation during robotized interventions. Yet, existing training methods show limited capabilities at generalizing to unseen vascular anatomies, requiring to be retrained each time the geometry changes. Methods: In this paper, we propose a zero-shot learning strategy for three-dimensional autonomous endovascular navigation. Using a very small training set of branching patterns, our reinforcement learning algorithm is able to learn a control that can then be applied to unseen vascular anatomies without retraining. Results: We demonstrate our method on 4 different vascular systems, with an average success rate of 95% at reaching random targets on these anatomies. Our strategy is also computationally efficient, allowing the training of our controller to be performed in only 2 hours. Conclusion: Our training method proved its ability to navigate unseen geometries with different characteristics, thanks to a nearly shape-invariant observation space.  ( 2 min )
    Rehabilitation Exercise Quality Assessment through Supervised Contrastive Learning with Hard and Soft Negatives
    arXiv:2403.02772v1 Announce Type: new Abstract: Exercise-based rehabilitation programs have proven to be effective in enhancing the quality of life and reducing mortality and rehospitalization rates. AI-driven virtual rehabilitation, which allows patients to independently complete exercises at home, utilizes AI algorithms to analyze exercise data, providing feedback to patients and updating clinicians on their progress. These programs commonly prescribe a variety of exercise types, leading to a distinct challenge in rehabilitation exercise assessment datasets: while abundant in overall training samples, these datasets often have a limited number of samples for each individual exercise type. This disparity hampers the ability of existing approaches to train generalizable models with such a small sample size per exercise. Addressing this issue, our paper introduces a novel supervised contrastive learning framework with hard and soft negative samples that effectively utilizes the entire dataset to train a single model applicable to all exercise types. This model, with a Spatial-Temporal Graph Convolutional Network (ST-GCN) architecture, demonstrated enhanced generalizability across exercises and a decrease in overall complexity. Through extensive experiments on three publicly available rehabilitation exercise assessment datasets, the University of Idaho-Physical Rehabilitation Movement Data (UI-PRMD), IntelliRehabDS (IRDS), and KInematic assessment of MOvement and clinical scores for remote monitoring of physical REhabilitation (KIMORE), our method has shown to surpass existing methods, setting a new benchmark in rehabilitation exercise assessment accuracy.  ( 3 min )
    Neural Fractional Differential Equations
    arXiv:2403.02737v1 Announce Type: new Abstract: Fractional Differential Equations (FDEs) are essential tools for modelling complex systems in science and engineering. They extend the traditional concepts of differentiation and integration to non-integer orders, enabling a more precise representation of processes characterised by non-local and memory-dependent behaviours. This property is useful in systems where variables do not respond to changes instantaneously, but instead exhibit a strong memory of past interactions. Having this in mind, and drawing inspiration from Neural Ordinary Differential Equations (Neural ODEs), we propose the Neural FDE, a novel deep neural network architecture that adjusts a FDE to the dynamics of data. This work provides a comprehensive overview of the numerical method employed in Neural FDEs and the Neural FDE architecture. The numerical outcomes suggest that, despite being more computationally demanding, the Neural FDE may outperform the Neural ODE in modelling systems with memory or dependencies on past states, and it can effectively be applied to learn more intricate dynamical systems.  ( 2 min )
    Privacy-Aware Semantic Cache for Large Language Models
    arXiv:2403.02694v1 Announce Type: new Abstract: Large Language Models (LLMs) like ChatGPT, Google Bard, Claude, and Llama 2 have revolutionized natural language processing and search engine dynamics. However, these models incur exceptionally high computational costs. For instance, GPT-3 consists of 175 billion parameters and inference on these models also demands billions of floating-point operations. Caching is a natural solution to reduce LLM inference costs on repeated queries. However, existing caching methods are incapable of finding semantic similarities among LLM queries, leading to unacceptable false hit-and-miss rates. This paper introduces MeanCache, a semantic cache for LLMs that identifies semantically similar queries to determine cache hit or miss. Using MeanCache, the response to a user's semantically similar query can be retrieved from a local cache rather than re-querying the LLM, thus reducing costs, service provider load, and environmental impact. MeanCache leverages Federated Learning (FL) to collaboratively train a query similarity model in a distributed manner across numerous users without violating privacy. By placing a local cache in each user's device and using FL, MeanCache reduces the latency and costs and enhances model performance, resulting in lower cache false hit rates. Our experiments, benchmarked against the GPTCache, reveal that MeanCache attains an approximately 17% higher F-score and a 20% increase in precision during semantic cache hit-and-miss decisions. Furthermore, MeanCache reduces the storage requirement by 83% and accelerates semantic cache hit-and-miss decisions by 11%, while still surpassing GPTCache.  ( 3 min )
    Controllable Prompt Tuning For Balancing Group Distributional Robustness
    arXiv:2403.02695v1 Announce Type: new Abstract: Models trained on data composed of different groups or domains can suffer from severe performance degradation under distribution shifts. While recent methods have largely focused on optimizing the worst-group objective, this often comes at the expense of good performance on other groups. To address this problem, we introduce an optimization scheme to achieve good performance across groups and find a good solution for all without severely sacrificing performance on any of them. However, directly applying such optimization involves updating the parameters of the entire network, making it both computationally expensive and challenging. Thus, we introduce Controllable Prompt Tuning (CPT), which couples our approach with prompt-tuning techniques. On spurious correlation benchmarks, our procedures achieve state-of-the-art results across both transformer and non-transformer architectures, as well as unimodal and multimodal data, while requiring only 0.4% tunable parameters.  ( 2 min )
    SGD with Partial Hessian for Deep Neural Networks Optimization
    arXiv:2403.02681v1 Announce Type: new Abstract: Due to the effectiveness of second-order algorithms in solving classical optimization problems, designing second-order optimizers to train deep neural networks (DNNs) has attracted much research interest in recent years. However, because of the very high dimension of intermediate features in DNNs, it is difficult to directly compute and store the Hessian matrix for network optimization. Most of the previous second-order methods approximate the Hessian information imprecisely, resulting in unstable performance. In this work, we propose a compound optimizer, which is a combination of a second-order optimizer with a precise partial Hessian matrix for updating channel-wise parameters and the first-order stochastic gradient descent (SGD) optimizer for updating the other parameters. We show that the associated Hessian matrices of channel-wise parameters are diagonal and can be extracted directly and precisely from Hessian-free methods. The proposed method, namely SGD with Partial Hessian (SGD-PH), inherits the advantages of both first-order and second-order optimizers. Compared with first-order optimizers, it adopts a certain amount of information from the Hessian matrix to assist optimization, while compared with the existing second-order optimizers, it keeps the good generalization performance of first-order optimizers. Experiments on image classification tasks demonstrate the effectiveness of our proposed optimizer SGD-PH. The code is publicly available at \url{https://github.com/myingysun/SGDPH}.  ( 2 min )
    Remove that Square Root: A New Efficient Scale-Invariant Version of AdaGrad
    arXiv:2403.02648v1 Announce Type: new Abstract: Adaptive methods are extremely popular in machine learning as they make learning rate tuning less expensive. This paper introduces a novel optimization algorithm named KATE, which presents a scale-invariant adaptation of the well-known AdaGrad algorithm. We prove the scale-invariance of KATE for the case of Generalized Linear Models. Moreover, for general smooth non-convex problems, we establish a convergence rate of $O \left(\frac{\log T}{\sqrt{T}} \right)$ for KATE, matching the best-known ones for AdaGrad and Adam. We also compare KATE to other state-of-the-art adaptive algorithms Adam and AdaGrad in numerical experiments with different problems, including complex machine learning tasks like image classification and text classification on real data. The results indicate that KATE consistently outperforms AdaGrad and matches/surpasses the performance of Adam in all considered scenarios.  ( 2 min )
    World Models for Autonomous Driving: An Initial Survey
    arXiv:2403.02622v1 Announce Type: new Abstract: In the rapidly evolving landscape of autonomous driving, the capability to accurately predict future events and assess their implications is paramount for both safety and efficiency, critically aiding the decision-making process. World models have emerged as a transformative approach, enabling autonomous driving systems to synthesize and interpret vast amounts of sensor data, thereby predicting potential future scenarios and compensating for information gaps. This paper provides an initial review of the current state and prospective advancements of world models in autonomous driving, spanning their theoretical underpinnings, practical applications, and the ongoing research efforts aimed at overcoming existing limitations. Highlighting the significant role of world models in advancing autonomous driving technologies, this survey aspires to serve as a foundational reference for the research community, facilitating swift access to and comprehension of this burgeoning field, and inspiring continued innovation and exploration.  ( 2 min )
    Pareto-Optimal Estimation and Policy Learning on Short-term and Long-term Treatment Effects
    arXiv:2403.02624v1 Announce Type: new Abstract: This paper focuses on developing Pareto-optimal estimation and policy learning to identify the most effective treatment that maximizes the total reward from both short-term and long-term effects, which might conflict with each other. For example, a higher dosage of medication might increase the speed of a patient's recovery (short-term) but could also result in severe long-term side effects. Although recent works have investigated the problems about short-term or long-term effects or the both, how to trade-off between them to achieve optimal treatment remains an open challenge. Moreover, when multiple objectives are directly estimated using conventional causal representation learning, the optimization directions among various tasks can conflict as well. In this paper, we systematically investigate these issues and introduce a Pareto-Efficient algorithm, comprising Pareto-Optimal Estimation (POE) and Pareto-Optimal Policy Learning (POPL), to tackle them. POE incorporates a continuous Pareto module with representation balancing, enhancing estimation efficiency across multiple tasks. As for POPL, it involves deriving short-term and long-term outcomes linked with various treatment levels, facilitating an exploration of the Pareto frontier emanating from these outcomes. Results on both the synthetic and real-world datasets demonstrate the superiority of our method.  ( 2 min )
    Training Machine Learning models at the Edge: A Survey
    arXiv:2403.02619v1 Announce Type: new Abstract: Edge Computing (EC) has gained significant traction in recent years, promising enhanced efficiency by integrating Artificial Intelligence (AI) capabilities at the edge. While the focus has primarily been on the deployment and inference of Machine Learning (ML) models at the edge, the training aspect remains less explored. This survey delves into Edge Learning (EL), specifically the optimization of ML model training at the edge. The objective is to comprehensively explore diverse approaches and methodologies in EL, synthesize existing knowledge, identify challenges, and highlight future trends. Utilizing Scopus' advanced search, relevant literature on EL was identified, revealing a concentration of research efforts in distributed learning methods, particularly Federated Learning (FL). This survey further provides a guideline for comparing techniques used to optimize ML for edge learning, along with an exploration of different frameworks, libraries, and simulation tools available for EL. In doing so, the paper contributes to a holistic understanding of the current landscape and future directions in the intersection of edge computing and machine learning, paving the way for informed comparisons between optimization methods and techniques designed for edge learning.  ( 2 min )
    Unsupervised Spatio-Temporal State Estimation for Fine-grained Adaptive Anomaly Diagnosis of Industrial Cyber-physical Systems
    arXiv:2403.02616v1 Announce Type: new Abstract: Accurate detection and diagnosis of abnormal behaviors such as network attacks from multivariate time series (MTS) are crucial for ensuring the stable and effective operation of industrial cyber-physical systems (CPS). However, existing researches pay little attention to the logical dependencies among system working states, and have difficulties in explaining the evolution mechanisms of abnormal signals. To reveal the spatio-temporal association relationships and evolution mechanisms of the working states of industrial CPS, this paper proposes a fine-grained adaptive anomaly diagnosis method (i.e. MAD-Transformer) to identify and diagnose anomalies in MTS. MAD-Transformer first constructs a temporal state matrix to characterize and estimate the change patterns of the system states in the temporal dimension. Then, to better locate the anomalies, a spatial state matrix is also constructed to capture the inter-sensor state correlation relationships within the system. Subsequently, based on these two types of state matrices, a three-branch structure of series-temporal-spatial attention module is designed to simultaneously capture the series, temporal, and space dependencies among MTS. Afterwards, three associated alignment loss functions and a reconstruction loss are constructed to jointly optimize the model. Finally, anomalies are determined and diagnosed by comparing the residual matrices with the original matrices. We conducted comparative experiments on five publicly datasets spanning three application domains (service monitoring, spatial and earth exploration, and water treatment), along with a petroleum refining simulation dataset collected by ourselves. The results demonstrate that MAD-Transformer can adaptively detect fine-grained anomalies with short duration, and outperforms the state-of-the-art baselines in terms of noise robustness and localization performance.  ( 3 min )
    TESTAM: A Time-Enhanced Spatio-Temporal Attention Model with Mixture of Experts
    arXiv:2403.02600v1 Announce Type: new Abstract: Accurate traffic forecasting is challenging due to the complex dependency on road networks, various types of roads, and the abrupt speed change due to the events. Recent works mainly focus on dynamic spatial modeling with adaptive graph embedding or graph attention having less consideration for temporal characteristics and in-situ modeling. In this paper, we propose a novel deep learning model named TESTAM, which individually models recurring and non-recurring traffic patterns by a mixture-of-experts model with three experts on temporal modeling, spatio-temporal modeling with static graph, and dynamic spatio-temporal dependency modeling with dynamic graph. By introducing different experts and properly routing them, TESTAM could better model various circumstances, including spatially isolated nodes, highly related nodes, and recurring and non-recurring events. For the proper routing, we reformulate a gating problem into a classification problem with pseudo labels. Experimental results on three public traffic network datasets, METR-LA, PEMS-BAY, and EXPY-TKY, demonstrate that TESTAM achieves a better indication and modeling of recurring and non-recurring traffic. We published the official code at https://github.com/HyunWookL/TESTAM  ( 2 min )
    Pooling Image Datasets With Multiple Covariate Shift and Imbalance
    arXiv:2403.02598v1 Announce Type: new Abstract: Small sample sizes are common in many disciplines, which necessitates pooling roughly similar datasets across multiple institutions to study weak but relevant associations between images and disease outcomes. Such data often manifest shift/imbalance in covariates (i.e., secondary non-imaging data). Controlling for such nuisance variables is common within standard statistical analysis, but the ideas do not directly apply to overparameterized models. Consequently, recent work has shown how strategies from invariant representation learning provides a meaningful starting point, but the current repertoire of methods is limited to accounting for shifts/imbalances in just a couple of covariates at a time. In this paper, we show how viewing this problem from the perspective of Category theory provides a simple and effective solution that completely avoids elaborate multi-stage training pipelines that would otherwise be needed. We show the effectiveness of this approach via extensive experiments on real datasets. Further, we discuss how this style of formulation offers a unified perspective on at least 5+ distinct problem settings, from self-supervised learning to matching problems in 3D reconstruction.  ( 2 min )
    Learning-augmented Online Minimization of Age of Information and Transmission Costs
    arXiv:2403.02573v1 Announce Type: new Abstract: We consider a discrete-time system where a resource-constrained source (e.g., a small sensor) transmits its time-sensitive data to a destination over a time-varying wireless channel. Each transmission incurs a fixed transmission cost (e.g., energy cost), and no transmission results in a staleness cost represented by the Age-of-Information. The source must balance the tradeoff between transmission and staleness costs. To address this challenge, we develop a robust online algorithm to minimize the sum of transmission and staleness costs, ensuring a worst-case performance guarantee. While online algorithms are robust, they are usually overly conservative and may have a poor average performance in typical scenarios. In contrast, by leveraging historical data and prediction models, machine learning (ML) algorithms perform well in average cases. However, they typically lack worst-case performance guarantees. To achieve the best of both worlds, we design a learning-augmented online algorithm that exhibits two desired properties: (i) consistency: closely approximating the optimal offline algorithm when the ML prediction is accurate and trusted; (ii) robustness: ensuring worst-case performance guarantee even ML predictions are inaccurate. Finally, we perform extensive simulations to show that our online algorithm performs well empirically and that our learning-augmented algorithm achieves both consistency and robustness.  ( 2 min )
    Wukong: Towards a Scaling Law for Large-Scale Recommendation
    arXiv:2403.02545v1 Announce Type: new Abstract: Scaling laws play an instrumental role in the sustainable improvement in model quality. Unfortunately, recommendation models to date do not exhibit such laws similar to those observed in the domain of large language models, due to the inefficiencies of their upscaling mechanisms. This limitation poses significant challenges in adapting these models to increasingly more complex real-world datasets. In this paper, we propose an effective network architecture based purely on stacked factorization machines, and a synergistic upscaling strategy, collectively dubbed Wukong, to establish a scaling law in the domain of recommendation. Wukong's unique design makes it possible to capture diverse, any-order of interactions simply through taller and wider layers. We conducted extensive evaluations on six public datasets, and our results demonstrate that Wukong consistently outperforms state-of-the-art models quality-wise. Further, we assessed Wukong's scalability on an internal, large-scale dataset. The results show that Wukong retains its superiority in quality over state-of-the-art models, while holding the scaling law across two orders of magnitude in model complexity, extending beyond 100 Gflop or equivalently up to GPT-3/LLaMa-2 scale of total training compute, where prior arts fall short.  ( 2 min )
    DPAdapter: Improving Differentially Private Deep Learning through Noise Tolerance Pre-training
    arXiv:2403.02571v1 Announce Type: new Abstract: Recent developments have underscored the critical role of \textit{differential privacy} (DP) in safeguarding individual data for training machine learning models. However, integrating DP oftentimes incurs significant model performance degradation due to the perturbation introduced into the training process, presenting a formidable challenge in the {differentially private machine learning} (DPML) field. To this end, several mitigative efforts have been proposed, typically revolving around formulating new DPML algorithms or relaxing DP definitions to harmonize with distinct contexts. In spite of these initiatives, the diminishment induced by DP on models, particularly large-scale models, remains substantial and thus, necessitates an innovative solution that adeptly circumnavigates the consequential impairment of model utility. In response, we introduce DPAdapter, a pioneering technique designed to amplify the model performance of DPML algorithms by enhancing parameter robustness. The fundamental intuition behind this strategy is that models with robust parameters are inherently more resistant to the noise introduced by DP, thereby retaining better performance despite the perturbations. DPAdapter modifies and enhances the sharpness-aware minimization (SAM) technique, utilizing a two-batch strategy to provide a more accurate perturbation estimate and an efficient gradient descent, thereby improving parameter robustness against noise. Notably, DPAdapter can act as a plug-and-play component and be combined with existing DPML algorithms to further improve their performance. Our experiments show that DPAdapter vastly enhances state-of-the-art DPML algorithms, increasing average accuracy from 72.92\% to 77.09\% with a privacy budget of $\epsilon=4$.  ( 3 min )
    Towards Foundation Time Series Model: To Synthesize Or Not To Synthesize?
    arXiv:2403.02534v1 Announce Type: new Abstract: The industry is rich in cases when we are required to make forecasting for large amounts of time series at once. However, we might be in a situation where we can not afford to train a separate model for each of them. Such issue in time series modeling remains without due attention. The remedy for this setting is the establishment of a foundation model. Such a model is expected to work in zero-shot and few-shot regimes. However, what should we take as a training dataset for such kind of model? Witnessing the benefits from the enrichment of NLP datasets with artificially-generated data, we might want to adopt their experience for time series. In contrast to natural language, the process of generation of synthetic time series data is even more favorable because it provides full control of series patterns, time horizons, and number of samples. In this work, we consider the essential question if it is advantageous to train a foundation model on synthetic data or it is better to utilize only a limited number of real-life examples. Our experiments are conducted only for regular time series and speak in favor of leveraging solely the real time series. Moreover, the choice of the proper source dataset strongly influences the performance during inference. When provided access even to a limited quantity of short time series data, employing it within a supervised framework yields more favorable results than training on a larger volume of synthetic data. The code for our experiments is publicly available on Github \url{https://github.com/sb-ai-lab/synthesize_or_not}.  ( 3 min )
    Encodings for Prediction-based Neural Architecture Search
    arXiv:2403.02484v1 Announce Type: new Abstract: Predictor-based methods have substantially enhanced Neural Architecture Search (NAS) optimization. The efficacy of these predictors is largely influenced by the method of encoding neural network architectures. While traditional encodings used an adjacency matrix describing the graph structure of a neural network, novel encodings embrace a variety of approaches from unsupervised pretraining of latent representations to vectors of zero-cost proxies. In this paper, we categorize and investigate neural encodings from three main types: structural, learned, and score-based. Furthermore, we extend these encodings and introduce \textit{unified encodings}, that extend NAS predictors to multiple search spaces. Our analysis draws from experiments conducted on over 1.5 million neural network architectures on NAS spaces such as NASBench-101 (NB101), NB201, NB301, Network Design Spaces (NDS), and TransNASBench-101. Building on our study, we present our predictor \textbf{FLAN}: \textbf{Fl}ow \textbf{A}ttention for \textbf{N}AS. FLAN integrates critical insights on predictor design, transfer learning, and \textit{unified encodings} to enable more than an order of magnitude cost reduction for training NAS accuracy predictors. Our implementation and encodings for all neural networks are open-sourced at \href{https://github.com/abdelfattah-lab/flan_nas}{https://github.com/abdelfattah-lab/flan\_nas}.  ( 2 min )
    HeAR -- Health Acoustic Representations
    arXiv:2403.02522v1 Announce Type: new Abstract: Health acoustic sounds such as coughs and breaths are known to contain useful health signals with significant potential for monitoring health and disease, yet are underexplored in the medical machine learning community. The existing deep learning systems for health acoustics are often narrowly trained and evaluated on a single task, which is limited by data and may hinder generalization to other tasks. To mitigate these gaps, we develop HeAR, a scalable self-supervised learning-based deep learning system using masked autoencoders trained on a large dataset of 313 million two-second long audio clips. Through linear probes, we establish HeAR as a state-of-the-art health audio embedding model on a benchmark of 33 health acoustic tasks across 6 datasets. By introducing this work, we hope to enable and accelerate further health acoustics research.  ( 2 min )
    Enhancing LLM Safety via Constrained Direct Preference Optimization
    arXiv:2403.02475v1 Announce Type: new Abstract: The rapidly increasing capabilities of large language models (LLMs) raise an urgent need to align AI systems with diverse human preferences to simultaneously enhance their usefulness and safety, despite the often conflicting nature of these goals. To address this important problem, a promising approach is to enforce a safety constraint at the fine-tuning stage through a constrained Reinforcement Learning from Human Feedback (RLHF) framework. This approach, however, is computationally expensive and often unstable. In this work, we introduce Constrained DPO (C-DPO), a novel extension of the recently proposed Direct Preference Optimization (DPO) approach for fine-tuning LLMs that is both efficient and lightweight. By integrating dual gradient descent and DPO, our method identifies a nearly optimal trade-off between helpfulness and harmlessness without using reinforcement learning. Empirically, our approach provides a safety guarantee to LLMs that is missing in DPO while achieving significantly higher rewards under the same safety constraint compared to a recently proposed safe RLHF approach. Warning: This paper contains example data that may be offensive or harmful.  ( 2 min )
    Root Causing Prediction Anomalies Using Explainable AI
    arXiv:2403.02439v1 Announce Type: new Abstract: This paper presents a novel application of explainable AI (XAI) for root-causing performance degradation in machine learning models that learn continuously from user engagement data. In such systems a single feature corruption can cause cascading feature, label and concept drifts. We have successfully applied this technique to improve the reliability of models used in personalized advertising. Performance degradation in such systems manifest as prediction anomalies in the models. These models are typically trained continuously using features that are produced by hundreds of real time data processing pipelines or derived from other upstream models. A failure in any of these pipelines or an instability in any of the upstream models can cause feature corruption, causing the model's predicted output to deviate from the actual output and the training data to become corrupted. The causal relationship between the features and the predicted output is complex, and root-causing is challenging due to the scale and dynamism of the system. We demonstrate how temporal shifts in the global feature importance distribution can effectively isolate the cause of a prediction anomaly, with better recall than model-to-feature correlation methods. The technique appears to be effective even when approximating the local feature importance using a simple perturbation-based method, and aggregating over a few thousand examples. We have found this technique to be a model-agnostic, cheap and effective way to monitor complex data pipelines in production and have deployed a system for continuously analyzing the global feature importance distribution of continuously trained models.  ( 3 min )
    On Latency Predictors for Neural Architecture Search
    arXiv:2403.02446v1 Announce Type: new Abstract: Efficient deployment of neural networks (NN) requires the co-optimization of accuracy and latency. For example, hardware-aware neural architecture search has been used to automatically find NN architectures that satisfy a latency constraint on a specific hardware device. Central to these search algorithms is a prediction model that is designed to provide a hardware latency estimate for a candidate NN architecture. Recent research has shown that the sample efficiency of these predictive models can be greatly improved through pre-training on some \textit{training} devices with many samples, and then transferring the predictor on the \textit{test} (target) device. Transfer learning and meta-learning methods have been used for this, but often exhibit significant performance variability. Additionally, the evaluation of existing latency predictors has been largely done on hand-crafted training/test device sets, making it difficult to ascertain design features that compose a robust and general latency predictor. To address these issues, we introduce a comprehensive suite of latency prediction tasks obtained in a principled way through automated partitioning of hardware device sets. We then design a general latency predictor to comprehensively study (1) the predictor architecture, (2) NN sample selection methods, (3) hardware device representations, and (4) NN operation encoding schemes. Building on conclusions from our study, we present an end-to-end latency predictor training strategy that outperforms existing methods on 11 out of 12 difficult latency prediction tasks, improving latency prediction by 22.5\% on average, and up to to 87.6\% on the hardest tasks. Focusing on latency prediction, our HW-Aware NAS reports a $5.8\times$ speedup in wall-clock time. Our code is available on \href{https://github.com/abdelfattah-lab/nasflat_latency}{https://github.com/abdelfattah-lab/nasflat\_latency}.  ( 3 min )
    SoK: Challenges and Opportunities in Federated Unlearning
    arXiv:2403.02437v1 Announce Type: new Abstract: Federated learning (FL), introduced in 2017, facilitates collaborative learning between non-trusting parties with no need for the parties to explicitly share their data among themselves. This allows training models on user data while respecting privacy regulations such as GDPR and CPRA. However, emerging privacy requirements may mandate model owners to be able to \emph{forget} some learned data, e.g., when requested by data owners or law enforcement. This has given birth to an active field of research called \emph{machine unlearning}. In the context of FL, many techniques developed for unlearning in centralized settings are not trivially applicable! This is due to the unique differences between centralized and distributed learning, in particular, interactivity, stochasticity, heterogeneity, and limited accessibility in FL. In response, a recent line of work has focused on developing unlearning mechanisms tailored to FL. This SoK paper aims to take a deep look at the \emph{federated unlearning} literature, with the goal of identifying research trends and challenges in this emerging field. By carefully categorizing papers published on FL unlearning (since 2020), we aim to pinpoint the unique complexities of federated unlearning, highlighting limitations on directly applying centralized unlearning methods. We compare existing federated unlearning methods regarding influence removal and performance recovery, compare their threat models and assumptions, and discuss their implications and limitations. For instance, we analyze the experimental setup of FL unlearning studies from various perspectives, including data heterogeneity and its simulation, the datasets used for demonstration, and evaluation metrics. Our work aims to offer insights and suggestions for future research on federated unlearning.  ( 3 min )
    Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems
    arXiv:2403.02419v1 Announce Type: new Abstract: Many recent state-of-the-art results in language tasks were achieved using compound systems that perform multiple Large Language Model (LLM) calls and aggregate their responses. However, there is little understanding of how the number of LLM calls -- e.g., when asking the LLM to answer each question multiple times and taking a consensus -- affects such a compound system's performance. In this paper, we initiate the study of scaling laws of compound inference systems. We analyze, theoretically and empirically, how the number of LLM calls affects the performance of one-layer Voting Inference Systems -- one of the simplest compound systems, which aggregates LLM responses via majority voting. We find empirically that across multiple language tasks, surprisingly, Voting Inference Systems' performance first increases but then decreases as a function of the number of LLM calls. Our theoretical results suggest that this non-monotonicity is due to the diversity of query difficulties within a task: more LLM calls lead to higher performance on "easy" queries, but lower performance on "hard" queries, and non-monotone behavior emerges when a task contains both types of queries. This insight then allows us to compute, from a small number of samples, the number of LLM calls that maximizes system performance, and define a scaling law of Voting Inference Systems. Experiments show that our scaling law can predict the performance of Voting Inference Systems and find the optimal number of LLM calls to make.  ( 3 min )
    OTClean: Data Cleaning for Conditional Independence Violations using Optimal Transport
    arXiv:2403.02372v1 Announce Type: new Abstract: Ensuring Conditional Independence (CI) constraints is pivotal for the development of fair and trustworthy machine learning models. In this paper, we introduce \sys, a framework that harnesses optimal transport theory for data repair under CI constraints. Optimal transport theory provides a rigorous framework for measuring the discrepancy between probability distributions, thereby ensuring control over data utility. We formulate the data repair problem concerning CIs as a Quadratically Constrained Linear Program (QCLP) and propose an alternating method for its solution. However, this approach faces scalability issues due to the computational cost associated with computing optimal transport distances, such as the Wasserstein distance. To overcome these scalability challenges, we reframe our problem as a regularized optimization problem, enabling us to develop an iterative algorithm inspired by Sinkhorn's matrix scaling algorithm, which efficiently addresses high-dimensional and large-scale data. Through extensive experiments, we demonstrate the efficacy and efficiency of our proposed methods, showcasing their practical utility in real-world data cleaning and preprocessing tasks. Furthermore, we provide comparisons with traditional approaches, highlighting the superiority of our techniques in terms of preserving data utility while ensuring adherence to the desired CI constraints.  ( 2 min )
  • Open

    Estimating treatment effects from single-arm trials via latent-variable modeling
    arXiv:2311.03002v2 Announce Type: replace-cross Abstract: Randomized controlled trials (RCTs) are the accepted standard for treatment effect estimation but they can be infeasible due to ethical reasons and prohibitive costs. Single-arm trials, where all patients belong to the treatment group, can be a viable alternative but require access to an external control group. We propose an identifiable deep latent-variable model for this scenario that can also account for missing covariate observations by modeling their structured missingness patterns. Our method uses amortized variational inference to learn both group-specific and identifiable shared latent representations, which can subsequently be used for {\em (i)} patient matching if treatment outcomes are not available for the treatment group, or for {\em (ii)} direct treatment effect estimation assuming outcomes are available for both groups. We evaluate the model on a public benchmark as well as on a data set consisting of a published RCT study and real-world electronic health records. Compared to previous methods, our results show improved performance both for direct treatment effect estimation as well as for effect estimation via patient matching.  ( 2 min )
    A new computationally efficient algorithm to solve Feature Selection for Functional Data Classification in high-dimensional spaces
    arXiv:2401.05765v2 Announce Type: replace Abstract: This paper introduces a novel methodology for Feature Selection for Functional Classification, FSFC, that addresses the challenge of jointly performing feature selection and classification of functional data in scenarios with categorical responses and multivariate longitudinal features. FSFC tackles a newly defined optimization problem that integrates logistic loss and functional features to identify the most crucial variables for classification. To address the minimization procedure, we employ functional principal components and develop a new adaptive version of the Dual Augmented Lagrangian algorithm. The computational efficiency of FSFC enables handling high-dimensional scenarios where the number of features may considerably exceed the number of statistical units. Simulation experiments demonstrate that FSFC outperforms other machine learning and deep learning methods in computational time and classification accuracy. Furthermore, the FSFC feature selection capability can be leveraged to significantly reduce the problem's dimensionality and enhance the performances of other classification algorithms. The efficacy of FSFC is also demonstrated through a real data application, analyzing relationships between four chronic diseases and other health and demographic factors.  ( 2 min )
    Trust Your $\nabla$: Gradient-based Intervention Targeting for Causal Discovery
    arXiv:2211.13715v4 Announce Type: replace Abstract: Inferring causal structure from data is a challenging task of fundamental importance in science. Observational data are often insufficient to identify a system's causal structure uniquely. While conducting interventions (i.e., experiments) can improve the identifiability, such samples are usually challenging and expensive to obtain. Hence, experimental design approaches for causal discovery aim to minimize the number of interventions by estimating the most informative intervention target. In this work, we propose a novel Gradient-based Intervention Targeting method, abbreviated GIT, that 'trusts' the gradient estimator of a gradient-based causal discovery framework to provide signals for the intervention acquisition function. We provide extensive experiments in simulated and real-world datasets and demonstrate that GIT performs on par with competitive baselines, surpassing them in the low-data regime.  ( 2 min )
    Learning to Defer to a Population: A Meta-Learning Approach
    arXiv:2403.02683v1 Announce Type: cross Abstract: The learning to defer (L2D) framework allows autonomous systems to be safe and robust by allocating difficult decisions to a human expert. All existing work on L2D assumes that each expert is well-identified, and if any expert were to change, the system should be re-trained. In this work, we alleviate this constraint, formulating an L2D system that can cope with never-before-seen experts at test-time. We accomplish this by using meta-learning, considering both optimization- and model-based variants. Given a small context set to characterize the currently available expert, our framework can quickly adapt its deferral policy. For the model-based approach, we employ an attention mechanism that is able to look for points in the context set that are similar to a given test point, leading to an even more precise assessment of the expert's abilities. In the experiments, we validate our methods on image recognition, traffic sign detection, and skin lesion diagnosis benchmarks.  ( 2 min )
    Demolition and Reinforcement of Memories in Spin-Glass-like Neural Networks
    arXiv:2403.02537v1 Announce Type: cross Abstract: Statistical mechanics has made significant contributions to the study of biological neural systems by modeling them as recurrent networks of interconnected units with adjustable interactions. Several algorithms have been proposed to optimize the neural connections to enable network tasks such as information storage (i.e. associative memory) and learning probability distributions from data (i.e. generative modeling). Among these methods, the Unlearning algorithm, aligned with emerging theories of synaptic plasticity, was introduced by John Hopfield and collaborators. The primary objective of this thesis is to understand the effectiveness of Unlearning in both associative memory models and generative models. Initially, we demonstrate that the Unlearning algorithm can be simplified to a linear perceptron model which learns from noisy examples featuring specific internal correlations. The selection of structured training data enables an associative memory model to retrieve concepts as attractors of a neural dynamics with considerable basins of attraction. Subsequently, a novel regularization technique for Boltzmann Machines is presented, proving to outperform previously developed methods in learning hidden probability distributions from data-sets. The Unlearning rule is derived from this new regularized algorithm and is showed to be comparable, in terms of inferential performance, to traditional Boltzmann-Machine learning.  ( 2 min )
    Density Uncertainty Layers for Reliable Uncertainty Estimation
    arXiv:2306.12497v2 Announce Type: replace-cross Abstract: Assessing the predictive uncertainty of deep neural networks is crucial for safety-related applications of deep learning. Although Bayesian deep learning offers a principled framework for estimating model uncertainty, the common approaches that approximate the parameter posterior often fail to deliver reliable estimates of predictive uncertainty. In this paper, we propose a novel criterion for reliable predictive uncertainty: a model's predictive variance should be grounded in the empirical density of the input. That is, the model should produce higher uncertainty for inputs that are improbable in the training data and lower uncertainty for inputs that are more probable. To operationalize this criterion, we develop the density uncertainty layer, a stochastic neural network architecture that satisfies the density uncertain criterion by design. We study density uncertainty layers on the UCI and CIFAR-10/100 uncertainty benchmarks. Compared to existing approaches, density uncertainty layers provide more reliable uncertainty estimates and robust out-of-distribution detection performance.  ( 2 min )
    Noise misleads rotation invariant algorithms on sparse targets
    arXiv:2403.02697v1 Announce Type: new Abstract: It is well known that the class of rotation invariant algorithms are suboptimal even for learning sparse linear problems when the number of examples is below the "dimension" of the problem. This class includes any gradient descent trained neural net with a fully-connected input layer (initialized with a rotationally symmetric distribution). The simplest sparse problem is learning a single feature out of $d$ features. In that case the classification error or regression loss grows with $1-k/n$ where $k$ is the number of examples seen. These lower bounds become vacuous when the number of examples $k$ reaches the dimension $d$. We show that when noise is added to this sparse linear problem, rotation invariant algorithms are still suboptimal after seeing $d$ or more examples. We prove this via a lower bound for the Bayes optimal algorithm on a rotationally symmetrized problem. We then prove much lower upper bounds on the same problem for simple non-rotation invariant algorithms. Finally we analyze the gradient flow trajectories of many standard optimization algorithms in some simple cases and show how they veer toward or away from the sparse targets. We believe that our trajectory categorization will be useful in designing algorithms that can exploit sparse targets and our method for proving lower bounds will be crucial for analyzing other families of algorithms that admit different classes of invariances.  ( 3 min )
    Pair-Matching: Links Prediction with Adaptive Queries
    arXiv:1905.07342v3 Announce Type: replace Abstract: The pair-matching problem appears in many applications where one wants to discover good matches between pairs of entities or individuals. Formally, the set of individuals is represented by the nodes of a graph where the edges, unobserved at first, represent the good matches. The algorithm queries pairs of nodes and observes the presence/absence of edges. Its goal is to discover as many edges as possible with a fixed budget of queries. Pair-matching is a particular instance of multi-armed bandit problem in which the arms are pairs of individuals and the rewards are edges linking these pairs. This bandit problem is non-standard though, as each arm can only be played once. Given this last constraint, sublinear regret can be expected only if the graph presents some underlying structure. This paper shows that sublinear regret is achievable in the case where the graph is generated according to a Stochastic Block Model (SBM) with two communities. Optimal regret bounds are computed for this pair-matching problem. They exhibit a phase transition related to the Kesten-Stigum threshold for community detection in SBM. The pair-matching problem is considered in the case where each node is constrained to be sampled less than a given amount of times. We show how optimal regret rates depend on this constraint. The paper is concluded by a conjecture regarding the optimal regret when the number of communities is larger than 2. Contrary to the two communities case, we argue that a statistical-computational gap would appear in this problem.  ( 3 min )
    To grok or not to grok: Disentangling generalization and memorization on corrupted algorithmic datasets
    arXiv:2310.13061v2 Announce Type: replace-cross Abstract: Robust generalization is a major challenge in deep learning, particularly when the number of trainable parameters is very large. In general, it is very difficult to know if the network has memorized a particular set of examples or understood the underlying rule (or both). Motivated by this challenge, we study an interpretable model where generalizing representations are understood analytically, and are easily distinguishable from the memorizing ones. Namely, we consider multi-layer perceptron (MLP) and Transformer architectures trained on modular arithmetic tasks, where ($\xi \cdot 100\%$) of labels are corrupted (\emph{i.e.} some results of the modular operations in the training set are incorrect). We show that (i) it is possible for the network to memorize the corrupted labels \emph{and} achieve $100\%$ generalization at the same time; (ii) the memorizing neurons can be identified and pruned, lowering the accuracy on corrupted data and improving the accuracy on uncorrupted data; (iii) regularization methods such as weight decay, dropout and BatchNorm force the network to ignore the corrupted data during optimization, and achieve $100\%$ accuracy on the uncorrupted dataset; and (iv) the effect of these regularization methods is (``mechanistically'') interpretable: weight decay and dropout force all the neurons to learn generalizing representations, while BatchNorm de-amplifies the output of memorizing neurons and amplifies the output of the generalizing ones. Finally, we show that in the presence of regularization, the training dynamics involves two consecutive stages: first, the network undergoes \emph{grokking} dynamics reaching high train \emph{and} test accuracy; second, it unlearns the memorizing representations, where the train accuracy suddenly jumps from $100\%$ to $100 (1-\xi)\%$.  ( 3 min )
    Robust Causal Bandits for Linear Models
    arXiv:2310.19794v2 Announce Type: replace Abstract: Sequential design of experiments for optimizing a reward function in causal systems can be effectively modeled by the sequential design of interventions in causal bandits (CBs). In the existing literature on CBs, a critical assumption is that the causal models remain constant over time. However, this assumption does not necessarily hold in complex systems, which constantly undergo temporal model fluctuations. This paper addresses the robustness of CBs to such model fluctuations. The focus is on causal systems with linear structural equation models (SEMs). The SEMs and the time-varying pre- and post-interventional statistical models are all unknown. Cumulative regret is adopted as the design criteria, based on which the objective is to design a sequence of interventions that incur the smallest cumulative regret with respect to an oracle aware of the entire causal model and its fluctuations. First, it is established that the existing approaches fail to maintain regret sub-linearity with even a few instances of model deviation. Specifically, when the number of instances with model deviation is as few as $T^\frac{1}{2L}$, where $T$ is the time horizon and $L$ is the longest causal path in the graph, the existing algorithms will have linear regret in $T$. Next, a robust CB algorithm is designed, and its regret is analyzed, where upper and information-theoretic lower bounds on the regret are established. Specifically, in a graph with $N$ nodes and maximum degree $d$, under a general measure of model deviation $C$, the cumulative regret is upper bounded by $\tilde{\mathcal{O}}(d^{L-\frac{1}{2}}(\sqrt{NT} + NC))$ and lower bounded by $\Omega(d^{\frac{L}{2}-2}\max\{\sqrt{T},d^2C\})$. Comparing these bounds establishes that the proposed algorithm achieves nearly optimal $\tilde{\mathcal{O}}(\sqrt{T})$ regret when $C$ is $o(\sqrt{T})$ and maintains sub-linear regret for a broader range of $C$.  ( 3 min )
    Linear quadratic control of nonlinear systems with Koopman operator learning and the Nystr\"om method
    arXiv:2403.02811v1 Announce Type: cross Abstract: In this paper, we study how the Koopman operator framework can be combined with kernel methods to effectively control nonlinear dynamical systems. While kernel methods have typically large computational requirements, we show how random subspaces (Nystr\"om approximation) can be used to achieve huge computational savings while preserving accuracy. Our main technical contribution is deriving theoretical guarantees on the effect of the Nystr\"om approximation. More precisely, we study the linear quadratic regulator problem, showing that both the approximated Riccati operator and the regulator objective, for the associated solution of the optimal control problem, converge at the rate $m^{-1/2}$, where $m$ is the random subspace size. Theoretical findings are complemented by numerical experiments corroborating our results.  ( 2 min )
    Near-Optimal Non-Parametric Sequential Tests and Confidence Sequences with Possibly Dependent Observations
    arXiv:2212.14411v3 Announce Type: replace-cross Abstract: Sequential tests and their implied confidence sequences, which are valid at arbitrary stopping times, promise flexible statistical inference and on-the-fly decision making. However, strong guarantees are limited to parametric sequential tests that under-cover in practice or concentration-bound-based sequences that over-cover and have suboptimal rejection times. In this work, we consider \cite{robbins1970boundary}'s delayed-start normal-mixture sequential probability ratio tests, and we provide the first asymptotic type-I-error and expected-rejection-time guarantees under general non-parametric data generating processes, where the asymptotics are indexed by the test's burn-in time. The type-I-error results primarily leverage a martingale strong invariance principle and establish that these tests (and their implied confidence sequences) have type-I error rates approaching a desired $\alpha$-level. The expected-rejection-time results primarily leverage an identity inspired by It\^o's lemma and imply that, in certain asymptotic regimes, the expected rejection time approaches the minimum possible among $\alpha$-level tests. We show how to apply our results to sequential inference on parameters defined by estimating equations, such as average treatment effects. Together, our results establish these (ostensibly parametric) tests as general-purpose, non-parametric, and near-optimal. We illustrate this via numerical experiments.  ( 2 min )
    Identifying Representations for Intervention Extrapolation
    arXiv:2310.04295v2 Announce Type: replace-cross Abstract: The premise of identifiable and causal representation learning is to improve the current representation learning paradigm in terms of generalizability or robustness. Despite recent progress in questions of identifiability, more theoretical results demonstrating concrete advantages of these methods for downstream tasks are needed. In this paper, we consider the task of intervention extrapolation: predicting how interventions affect an outcome, even when those interventions are not observed at training time, and show that identifiable representations can provide an effective solution to this task even if the interventions affect the outcome non-linearly. Our setup includes an outcome Y, observed features X, which are generated as a non-linear transformation of latent features Z, and exogenous action variables A, which influence Z. The objective of intervention extrapolation is to predict how interventions on A that lie outside the training support of A affect Y. Here, extrapolation becomes possible if the effect of A on Z is linear and the residual when regressing Z on A has full support. As Z is latent, we combine the task of intervention extrapolation with identifiable representation learning, which we call Rep4Ex: we aim to map the observed features X into a subspace that allows for non-linear extrapolation in A. We show that the hidden representation is identifiable up to an affine transformation in Z-space, which is sufficient for intervention extrapolation. The identifiability is characterized by a novel constraint describing the linearity assumption of A on Z. Based on this insight, we propose a method that enforces the linear invariance constraint and can be combined with any type of autoencoder. We validate our theoretical findings through synthetic experiments and show that our approach succeeds in predicting the effects of unseen interventions.  ( 3 min )
    Scalable Bayesian inference for the generalized linear mixed model
    arXiv:2403.03007v1 Announce Type: cross Abstract: The generalized linear mixed model (GLMM) is a popular statistical approach for handling correlated data, and is used extensively in applications areas where big data is common, including biomedical data settings. The focus of this paper is scalable statistical inference for the GLMM, where we define statistical inference as: (i) estimation of population parameters, and (ii) evaluation of scientific hypotheses in the presence of uncertainty. Artificial intelligence (AI) learning algorithms excel at scalable statistical estimation, but rarely include uncertainty quantification. In contrast, Bayesian inference provides full statistical inference, since uncertainty quantification results automatically from the posterior distribution. Unfortunately, Bayesian inference algorithms, including Markov Chain Monte Carlo (MCMC), become computationally intractable in big data settings. In this paper, we introduce a statistical inference algorithm at the intersection of AI and Bayesian inference, that leverages the scalability of modern AI algorithms with guaranteed uncertainty quantification that accompanies Bayesian inference. Our algorithm is an extension of stochastic gradient MCMC with novel contributions that address the treatment of correlated data (i.e., intractable marginal likelihood) and proper posterior variance estimation. Through theoretical and empirical results we establish our algorithm's statistical inference properties, and apply the method in a large electronic health records database.  ( 2 min )
    Hierarchy of the echo state property in quantum reservoir computing
    arXiv:2403.02686v1 Announce Type: cross Abstract: The echo state property (ESP) represents a fundamental concept in the reservoir computing (RC) framework that ensures output-only training of reservoir networks by being agnostic to the initial states and far past inputs. However, the traditional definition of ESP does not describe possible non-stationary systems in which statistical properties evolve. To address this issue, we introduce two new categories of ESP: \textit{non-stationary ESP}, designed for potentially non-stationary systems, and \textit{subspace/subset ESP}, designed for systems whose subsystems have ESP. Following the definitions, we numerically demonstrate the correspondence between non-stationary ESP in the quantum reservoir computer (QRC) framework with typical Hamiltonian dynamics and input encoding methods using non-linear autoregressive moving-average (NARMA) tasks. We also confirm the correspondence by computing linear/non-linear memory capacities that quantify input-dependent components within reservoir states. Our study presents a new understanding of the practical design of QRC and other possibly non-stationary RC systems in which non-stationary systems and subsystems are exploited.  ( 2 min )
    Machine Learning Assisted Adjustment Boosts Inferential Efficiency of Randomized Controlled Trials
    arXiv:2403.03058v1 Announce Type: cross Abstract: In this work, we proposed a novel inferential procedure assisted by machine learning based adjustment for randomized control trials. The method was developed under the Rosenbaum's framework of exact tests in randomized experiments with covariate adjustments. Through extensive simulation experiments, we showed the proposed method can robustly control the type I error and can boost the inference efficiency for a randomized controlled trial (RCT). This advantage was further demonstrated in a real world example. The simplicity and robustness of the proposed method makes it a competitive candidate as a routine inference procedure for RCTs, especially when the number of baseline covariates is large, and when nonlinear association or interaction among covariates is expected. Its application may remarkably reduce the required sample size and cost of RCTs, such as phase III clinical trials.  ( 2 min )
    Koopman operators with intrinsic observables in rigged reproducing kernel Hilbert spaces
    arXiv:2403.02524v1 Announce Type: cross Abstract: This paper presents a novel approach for estimating the Koopman operator defined on a reproducing kernel Hilbert space (RKHS) and its spectra. We propose an estimation method, what we call Jet Dynamic Mode Decomposition (JetDMD), leveraging the intrinsic structure of RKHS and the geometric notion known as jets to enhance the estimation of the Koopman operator. This method refines the traditional Extended Dynamic Mode Decomposition (EDMD) in accuracy, especially in the numerical estimation of eigenvalues. This paper proves JetDMD's superiority through explicit error bounds and convergence rate for special positive definite kernels, offering a solid theoretical foundation for its performance. We also delve into the spectral analysis of the Koopman operator, proposing the notion of extended Koopman operator within a framework of rigged Hilbert space. This notion leads to a deeper understanding of estimated Koopman eigenfunctions and capturing them outside the original function space. Through the theory of rigged Hilbert space, our study provides a principled methodology to analyze the estimated spectrum and eigenfunctions of Koopman operators, and enables eigendecomposition within a rigged RKHS. We also propose a new effective method for reconstructing the dynamical system from temporally-sampled trajectory data of the dynamical system with solid theoretical guarantee. We conduct several numerical simulations using the van der Pol oscillator, the Duffing oscillator, the H\'enon map, and the Lorenz attractor, and illustrate the performance of JetDMD with clear numerical computations of eigenvalues and accurate predictions of the dynamical systems.  ( 3 min )
    LLP-Bench: A Large Scale Tabular Benchmark for Learning from Label Proportions
    arXiv:2310.10096v2 Announce Type: replace-cross Abstract: In the task of Learning from Label Proportions (LLP), a model is trained on groups (a.k.a bags) of instances and their corresponding label proportions to predict labels for individual instances. LLP has been applied pre-dominantly on two types of datasets - image and tabular. In image LLP, bags of fixed size are created by randomly sampling instances from an underlying dataset. Bags created via this methodology are called random bags. Experimentation on Image LLP has been mostly on random bags on CIFAR-* and MNIST datasets. Despite being a very crucial task in privacy sensitive applications, tabular LLP does not yet have a open, large scale LLP benchmark. One of the unique properties of tabular LLP is the ability to create feature bags where all the instances in a bag have the same value for a given feature. It has been shown in prior research that feature bags are very common in practical, real world applications [Chen et. al '23, Saket et. al. '22]. In this paper, we address the lack of a open, large scale tabular benchmark. First we propose LLP-Bench, a suite of 70 LLP datasets (62 feature bag and 8 random bag datasets) created from the Criteo CTR prediction and the Criteo Sponsored Search Conversion Logs datasets, the former a classification and the latter a regression dataset. These LLP datasets represent diverse ways in which bags can be constructed from underlying tabular data. To the best of our knowledge, LLP-Bench is the first large scale tabular LLP benchmark with an extensive diversity in constituent datasets. Second, we propose four metrics that characterize and quantify the hardness of a LLP dataset. Using these four metrics we present deep analysis of the 62 feature bag datasets in LLP-Bench. Finally we present the performance of 9 SOTA and popular tabular LLP techniques on all the 62 datasets.  ( 3 min )
    Minimum width for universal approximation using ReLU networks on compact domain
    arXiv:2309.10402v2 Announce Type: replace-cross Abstract: It has been shown that deep neural networks of a large enough width are universal approximators but they are not if the width is too small. There were several attempts to characterize the minimum width $w_{\min}$ enabling the universal approximation property; however, only a few of them found the exact values. In this work, we show that the minimum width for $L^p$ approximation of $L^p$ functions from $[0,1]^{d_x}$ to $\mathbb R^{d_y}$ is exactly $\max\{d_x,d_y,2\}$ if an activation function is ReLU-Like (e.g., ReLU, GELU, Softplus). Compared to the known result for ReLU networks, $w_{\min}=\max\{d_x+1,d_y\}$ when the domain is $\smash{\mathbb R^{d_x}}$, our result first shows that approximation on a compact domain requires smaller width than on $\smash{\mathbb R^{d_x}}$. We next prove a lower bound on $w_{\min}$ for uniform approximation using general activation functions including ReLU: $w_{\min}\ge d_y+1$ if $d_x<d_y\le2d_x$. Together with our first result, this shows a dichotomy between $L^p$ and uniform approximations for general activation functions and input/output dimensions.  ( 2 min )
    Making deep neural networks right for the right scientific reasons by interacting with their explanations
    arXiv:2001.05371v4 Announce Type: replace-cross Abstract: Deep neural networks have shown excellent performances in many real-world applications. Unfortunately, they may show "Clever Hans"-like behavior -- making use of confounding factors within datasets -- to achieve high performance. In this work, we introduce the novel learning setting of "explanatory interactive learning" (XIL) and illustrate its benefits on a plant phenotyping research task. XIL adds the scientist into the training loop such that she interactively revises the original model via providing feedback on its explanations. Our experimental results demonstrate that XIL can help avoiding Clever Hans moments in machine learning and encourages (or discourages, if appropriate) trust into the underlying model.  ( 2 min )
    On the Asymptotic Mean Square Error Optimality of Diffusion Probabilistic Models
    arXiv:2403.02957v1 Announce Type: cross Abstract: Diffusion probabilistic models (DPMs) have recently shown great potential for denoising tasks. Despite their practical utility, there is a notable gap in their theoretical understanding. This paper contributes novel theoretical insights by rigorously proving the asymptotic convergence of a specific DPM denoising strategy to the mean square error (MSE)-optimal conditional mean estimator (CME) over a large number of diffusion steps. The studied DPM-based denoiser shares the training procedure of DPMs but distinguishes itself by forwarding only the conditional mean during the reverse inference process after training. We highlight the unique perspective that DPMs are composed of an asymptotically optimal denoiser while simultaneously inheriting a powerful generator by switching re-sampling in the reverse process on and off. The theoretical findings are validated by numerical results.  ( 2 min )
    Improving Variational Autoencoder Estimation from Incomplete Data with Mixture Variational Families
    arXiv:2403.03069v1 Announce Type: cross Abstract: We consider the task of estimating variational autoencoders (VAEs) when the training data is incomplete. We show that missing data increases the complexity of the model's posterior distribution over the latent variables compared to the fully-observed case. The increased complexity may adversely affect the fit of the model due to a mismatch between the variational and model posterior distributions. We introduce two strategies based on (i) finite variational-mixture and (ii) imputation-based variational-mixture distributions to address the increased posterior complexity. Through a comprehensive evaluation of the proposed approaches, we show that variational mixtures are effective at improving the accuracy of VAE estimation from incomplete data.  ( 2 min )
    Active Statistical Inference
    arXiv:2403.03208v1 Announce Type: new Abstract: Inspired by the concept of active learning, we propose active inference$\unicode{x2013}$a methodology for statistical inference with machine-learning-assisted data collection. Assuming a budget on the number of labels that can be collected, the methodology uses a machine learning model to identify which data points would be most beneficial to label, thus effectively utilizing the budget. It operates on a simple yet powerful intuition: prioritize the collection of labels for data points where the model exhibits uncertainty, and rely on the model's predictions where it is confident. Active inference constructs provably valid confidence intervals and hypothesis tests while leveraging any black-box machine learning model and handling any data distribution. The key point is that it achieves the same level of accuracy with far fewer samples than existing baselines relying on non-adaptively-collected data. This means that for the same number of collected samples, active inference enables smaller confidence intervals and more powerful p-values. We evaluate active inference on datasets from public opinion research, census analysis, and proteomics.  ( 2 min )
    LC-Tsalis-INF: Generalized Best-of-Both-Worlds Linear Contextual Bandits
    arXiv:2403.03219v1 Announce Type: cross Abstract: This study considers the linear contextual bandit problem with independent and identically distributed (i.i.d.) contexts. In this problem, existing studies have proposed Best-of-Both-Worlds (BoBW) algorithms whose regrets satisfy $O(\log^2(T))$ for the number of rounds $T$ in a stochastic regime with a suboptimality gap lower-bounded by a positive constant, while satisfying $O(\sqrt{T})$ in an adversarial regime. However, the dependency on $T$ has room for improvement, and the suboptimality-gap assumption can be relaxed. For this issue, this study proposes an algorithm whose regret satisfies $O(\log(T))$ in the setting when the suboptimality gap is lower-bounded. Furthermore, we introduce a margin condition, a milder assumption on the suboptimality gap. That condition characterizes the problem difficulty linked to the suboptimality gap using a parameter $\beta \in (0, \infty]$. We then show that the algorithm's regret satisfies $O\left(\left\{\log(T)\right\}^{\frac{1+\beta}{2+\beta}}T^{\frac{1}{2+\beta}}\right)$. Here, $\beta= \infty$ corresponds to the case in the existing studies where a lower bound exists in the suboptimality gap, and our regret satisfies $O(\log(T))$ in that case. Our proposed algorithm is based on the Follow-The-Regularized-Leader with the Tsallis entropy and referred to as the $\alpha$-Linear-Contextual (LC)-Tsallis-INF.  ( 2 min )
    How Well Can Transformers Emulate In-context Newton's Method?
    arXiv:2403.03183v1 Announce Type: cross Abstract: Transformer-based models have demonstrated remarkable in-context learning capabilities, prompting extensive research into its underlying mechanisms. Recent studies have suggested that Transformers can implement first-order optimization algorithms for in-context learning and even second order ones for the case of linear regression. In this work, we study whether Transformers can perform higher order optimization methods, beyond the case of linear regression. We establish that linear attention Transformers with ReLU layers can approximate second order optimization algorithms for the task of logistic regression and achieve $\epsilon$ error with only a logarithmic to the error more layers. As a by-product we demonstrate the ability of even linear attention-only Transformers in implementing a single step of Newton's iteration for matrix inversion with merely two layers. These results suggest the ability of the Transformer architecture to implement complex algorithms, beyond gradient descent.  ( 2 min )
    On the impact of measure pre-conditionings on general parametric ML models and transfer learning via domain adaptation
    arXiv:2403.02432v1 Announce Type: new Abstract: We study a new technique for understanding convergence of learning agents under small modifications of data. We show that such convergence can be understood via an analogue of Fatou's lemma which yields gamma-convergence. We show it's relevance and applications in general machine learning tasks and domain adaptation transfer learning.  ( 2 min )
    Transformer for Times Series: an Application to the S&P500
    arXiv:2403.02523v1 Announce Type: cross Abstract: The transformer models have been extensively used with good results in a wide area of machine learning applications including Large Language Models and image generation. Here, we inquire on the applicability of this approach to financial time series. We first describe the dataset construction for two prototypical situations: a mean reverting synthetic Ornstein-Uhlenbeck process on one hand and real S&P500 data on the other hand. Then, we present in detail the proposed Transformer architecture and finally we discuss some encouraging results. For the synthetic data we predict rather accurately the next move, and for the S&P500 we get some interesting results related to quadratic variation and volatility prediction.  ( 2 min )
    On a Neural Implementation of Brenier's Polar Factorization
    arXiv:2403.03071v1 Announce Type: new Abstract: In 1991, Brenier proved a theorem that generalizes the $QR$ decomposition for square matrices -- factored as PSD $\times$ unitary -- to any vector field $F:\mathbb{R}^d\rightarrow \mathbb{R}^d$. The theorem, known as the polar factorization theorem, states that any field $F$ can be recovered as the composition of the gradient of a convex function $u$ with a measure-preserving map $M$, namely $F=\nabla u \circ M$. We propose a practical implementation of this far-reaching theoretical result, and explore possible uses within machine learning. The theorem is closely related to optimal transport (OT) theory, and we borrow from recent advances in the field of neural optimal transport to parameterize the potential $u$ as an input convex neural network. The map $M$ can be either evaluated pointwise using $u^*$, the convex conjugate of $u$, through the identity $M=\nabla u^* \circ F$, or learned as an auxiliary network. Because $M$ is, in general, not injective, we consider the additional task of estimating the ill-posed inverse map that can approximate the pre-image measure $M^{-1}$ using a stochastic generator. We illustrate possible applications of \citeauthor{Brenier1991PolarFA}'s polar factorization to non-convex optimization problems, as well as sampling of densities that are not log-concave.  ( 2 min )
    Applied Causal Inference Powered by ML and AI
    arXiv:2403.02467v1 Announce Type: cross Abstract: An introduction to the emerging fusion of machine learning and causal inference. The book presents ideas from classical structural equation models (SEMs) and their modern AI equivalent, directed acyclical graphs (DAGs) and structural causal models (SCMs), and covers Double/Debiased Machine Learning methods to do inference in such models using modern predictive tools.  ( 2 min )

  • Open

    [D] Which of the curriculum of the different courses is good for AI and machine learning?
    Hi, I wanted to know which course curriculum is good for machine learning and AI. These institutes had provided me with info and curriculum for the different courses for training. I am no expert in this field. I just learned python so wanted to get into this field or career. So I don't really understand what is written in the curriculum, so please help me choose which one has a better curriculum or are they all same? Below I have provided the link of pdf of the curriculum of all the courses. 1)https://heyzine.com/flip-book/1dfc511ece.html 2) https://heyzine.com/flip-book/0f8a771ac1.html 3) https://heyzine.com/flip-book/df8f6ae51e.html 4) https://heyzine.com/flip-book/e58f1c6435.html submitted by /u/Confident-Ad4064 [link] [comments]
    [D] Training an ML model to recognize gold-bearing quartz rocks
    How would I use Tensor Flow to train an ML model to recognize quartz rocks that may contain gold? In general, you want 'ugly' quartz that is broken, brown and grey around the white parts of the quartz. In terms of attributes, this could be modelled conceptually as follows: rock_type: quartz rock_texture: broken rock_colors: [brown, grey] rock_characteristics: [broken, brown, grey] I could easily assemble a dataset of training images for this. Would it be hard to create a machine learning algorithm that could be fed hundreds or thousands of images of gold-bearing quartz ore, and predict which rocks are more likely to contain the most gold? What would the steps be to accomplish this? Again, I am a total newbie, so I'm not sure if I have even phrased my questions correctly. Thank you, and have a great day. submitted by /u/PickAxeCA [link] [comments]
    [D] Compare coefficient of determination vs correlation R-squared values
    Looking for some help in method comparisons. For 100 genes, I have created a model to predict the expression of each gene (100 models total). I've performed k-fold cross validation to compute coefficient of determination R-squared values, and have aggregated these values to get 1 per model, and then computed a mean and median of these values. I want to see how this compares with another method, but this method is a computed score (not regression or ML) so I have computed the coefficient of correlation R-squared values for each of the 100 genes, and then computed a mean and median of these values. Because the coefficient of determination R-squared values from my ML models can be negative, and the coefficient of correlation R-squared values for the computed score range 0 to 1, I don't know if a direct comparison of these R-squared values makes sense. Does anyone have suggestions for how to compare them? Does it make sense transform the coefficient of determination R-squared values to a range of 0 to 1? submitted by /u/Br0wnish [link] [comments]
    [D] Azure GPU Restrictions?
    Just requested quota increase for a100 image and they said GPU demand was too high. Was wondering if anyone else has run into this and if so how did you resolve it? Renting GPU’s should not be this difficult… submitted by /u/Primary-Track8298 [link] [comments]
    [D] Apple ANE performance
    Are there any published benchmarks of inference performance of ANE on various popular vision benchmarks (resnet 50, yolo)? Also are there any published studies on the power consumption and or efficiency? submitted by /u/Dry-Significance-821 [link] [comments]
    Finetuning LLM on non-text data [discussion]
    Hi, Is it possible to train a large language model on data that is not text. Can I apply it to any sequential dataset for example music notes? I think the tricky part is tokenizing the dataset so the LLM can better understand the underlying structure of the data. If pretrained LLMS are not the right way, could you suggest any other models that are pretrained? The problem im trying to solve has discrete values to predict so I don’t think using audio generation models make sense. I prefer to not train from scratch so just wondering. Also if my intuition is off please let me know. Thanks! submitted by /u/Dunkeyfunkey [link] [comments]
    [P] InstaSwap Face Swap Node for ComfyUI
    ComfyUI repo: https://github.com/abdozmantar/ComfyUI-InstaSwap Standalon repo: https://github.com/abdozmantar/Standalone-InstaSwap https://i.redd.it/9vw32vxx7smc1.gif submitted by /u/abdullahozmntr [link] [comments]
    [D] LLM max output token limit
    Today, I had an issue with GPT-4 API as I wanted to translate a file which was a little bit bigger than 4096 tokens. I had never encountered this kind of issue before and I did not know what the hard limit for the size of output tokens was. I found some interesting facts, first, this information is not easily available. OpenAI removed it from the model description page: https://platform.openai.com/docs/models but it is 4096. Mistral AI does not give this information for its latest model: https://docs.mistral.ai/guides/model-selection/ Claude is the only one that provides this information easily: https://docs.anthropic.com/claude/docs/models-overview (4096 for all models) For Gemini 1.0, the limit is very low at 2048, but Gemini 1.5 is at 8192, which is better than the others. In the end, I managed to translate my text using Mistral large, so it seems to have a bigger output limit than the others, but so far, I do not know what it is. I know that I could just split my text in two but it was easier to just change the model :) I’m curious if somebody knows what the Mistral large output limit is (and where to get this information). I’m also a little puzzled as, with the context windows starting to be bigger (1 million tokens), if the output size stays small, it could start to be too much of a difference. What do you think? submitted by /u/kokoshkatheking [link] [comments]
    [D] Techniques for finding potentially bad labels in NLP classification datasets
    I'm looking for techniques for finding potentially bad labels in NLP classification datasets. We've been using Cleanlab (confident learning), but we've found that the precision/recall isn't very high for our use case (content moderation/media monitoring). Do you have any pointers to other interesting techniques/papers? GPT-4 et al. is a good candidate when you spend enough time writing descriptive prompts, but it comes at the cost of being expensive at scale. Curious about smarter "pre-selection" techniques instead of throwing everything at GPT? submitted by /u/IbrahimSharaf [link] [comments]
    [D] Honest ML suggestions please
    Hey, I am a data engineer with 4 years experience. But i have been learning ML since past 6 months. I completed all the required stats , probability, statistical analysis and traditional ML algorithms. I know the loss functions of them and how they are derived. But i want to deep dive into each algorithm derivation from scratch so that I can remember them like high school math formulas. But it’s taking too long for this. Do you guys think is it worth doing it? How much knowledge do you think is needed? Another question is i am planning to go for masters but there is so much negative talk about jobs being replaced in next 5-10 years. Saw a video from cuban as well today stating that. Also khosla says by 10 years, even doctors can be replaced from some extent. Need some suggestions from all of you guys, its actually very stressful for me to make decisions after seeing all these in the news submitted by /u/MediumZealousideal29 [link] [comments]
    [D] (C) ML Res career?
    Hello guys, this question is due to no idea of whether which prospect will keep me busy and valuable for very long period. What meant is I wanted to focus on additional mathematics topics and few physics topics to either improve ML/DS or have a research position in them. Any suggestions of my initial thoughts, do you think will it work our for me? What should I focus on physics and maths too? submitted by /u/leao_26 [link] [comments]
    [D]: Seeking Advice: Transitioning from GIS to AI/ML for Geospatial Applications
    Hey fellow GIS/ML enthusiasts! I'm currently working in a government GIS role in Canada and loving every bit of it, but I'm eager to dive deeper into the world of AI and machine learning for geospatial applications. I believe this intersection holds immense potential and can revolutionize how we analyze and interpret spatial data. I'm reaching out for suggestions on how to make this transition smoother. Are there any specific online courses, tutorials, resources, companies/organizations that you would recommend for someone with a GIS background looking to delve into data science and machine learning? Any advice or personal experiences would be greatly appreciated! For reference, during my BSc in mathematics and physics I learned Python and R and took graduate level courses in data science, ML, and deep learning. I then discovered geospatial data science and was hooked. I then transitioned to obtaining a GIS graduate certificate and masters in Applied Geomatics which is how I obtained my current role. During this time, I developed custom AI solutions using the Esri deep learning frameworks for satellite imagery and LiDAR data which was fun, but ultimately I want to develop my own NN architectures/solutions. Currently, during my spare time after work and on weekends, I am working on career tracks on datacamp for ML Engineering/ Data Scientist in Python to continue filling in the gaps of my knowledge and stay current. I am hoping to stay in my current role for financial stability until I have enough skills to transition into a ML role which pays better in the future. I am happy with my role at the moment, but I am doing more database admin than spatial analysis. I am also looking to enter a role in the future that uses the full extent of my abilities in mathematics/programming. submitted by /u/ayooooooooooooooo0o0 [link] [comments]
    [D] The reversal curse
    https://arxiv.org/pdf/2309.12288.pdf Turns out I predicted reversal curse in 2021 lol https://www.reddit.com/r/MachineLearning/comments/p13ean/d_can_gpt_generalize_in_both_directions/ EDIT: this other paper that the first paper cites even uses a very similar example: https://arxiv.org/pdf/2308.03296.pdf The first President of the United States was George Washington I'd be very happy if my post inspired the authors in any way submitted by /u/DeMorrr [link] [comments]
    [D] Latest / Most up to date TTS / RVC Setup + Ways to get intonation / emotion?
    Not sure if this is the best sub to post, but i've seen similar threads here... I've been using voice cloners like xtts or bark (via coqui tts) and they're kinda... meh... (sometimes it was good, sometimes it wasn't, and when doing a lot of text to speech - most outputs were missing words or even entire sentences). I even tried to fine tune xtts (v2) and even tho the voice got much much better, the outputs were worse (always missing words, sometimes even entire paragraphs). So i started fiddling with rvc and it seems better - but i have a few issues. Is there a gui for inference / training? For faster setup? I found this : https://github.com/SayanoAI/RVC-Studio - but it's kinda buggy (could not train with it) Or if there's a basic code setup i can use to train and do inference (i need to do bulk files). Furthermore, there's one thing i don't know if i can do with RVC: add emotion / intonation to the output voice. If create the input voice using some basic TTS, the voice will be flat - will the RVC voice be able to add proper emotion to the flat TTS voice? Or what's the best setup to achieve this? submitted by /u/yupignome [link] [comments]
    [D] Hugging face to Android Studio (flutter)
    Have anyone tried making an app on android studio where you communicate with hugging face inference api or endpoints? I need help Im building a mobile chatbot with the help of hugging face model and I cant find any tutorial on how to do it (Im a beginner) and the chatbot is for my capstone project. submitted by /u/HarryPottahh [link] [comments]
    [D][R]Recent developments in Reinforcement Learning
    I am trying to get into reinforcement learning and have just finished the one by Sutton and Barto along with a course from YT. Just wanted to know what is currently being done on this topic(a survey paper/book will be nice). Also wanted to know what kind of datasets are commonly being used. The course I followed was completely theoretical in nature, so I also wanted to know what toolkit is used in this field currently . submitted by /u/ANI_phy [link] [comments]
    [D] How do I get better as ML engineer?
    After completing MS with ML specialization, I have formally transitioned into a ML role at my company. I now face two challenges: There are no other experienced ML engineers to learn from in my team. I’ve 12 years of work experience so expectations from me is to work like senior ML engineer. I’m on top of game when it comes to GenAI but not for ML or DL related projects. How can I plan work, personal learnings etc so that I can become better over time? Past work: Mostly Microsoft tech stack, .net, SharePoint, O365, etc. submitted by /u/vikbehal [link] [comments]
    [D] Any recommended recent video generation papers?
    I want to learn video generation. I already have some experience with GANs, VAEs, DPMs. But not with videos. My theisis main focus is about video generation and I'm looking for recent (3-4 years ago max) developments in video synthesis papers. Not like Sora which is closed-sourced, but actual open-sourced models or projects. I want to learn everything: temporal, spatial cohesion, pixel prediction models, frame-by-frame generation, and more. For instance, I know DVD-GAN is good, but not recent (2019). Maybe even survey papers, so I can feel what each model is doing in high-level. submitted by /u/ShlomiRex [link] [comments]
    [D]Why isn't Hugging Face becoming one of the promising (and young) AI chatbot players on the table (like Mistral AI, Anthropic, Perplexity AI, etc)
    I rememebered a few years ago people discussing about what HF's business model is or how to profit. I think now is the best time for them, but yet i am a bit suprised they didn't make their own one. They have talents, experiences, resources. Just wonder it. submitted by /u/xiikjuy [link] [comments]
    [N] Groq AMA - March 6 @ noon PST
    Join Igor Arsovski, Chief Architect & Fellow, and Andrew Ling, Senior Director ML Compiler, for an AMA about Groq hardware and software on March 6 at noon PST. https://discord.com/invite/rfVBfxyX?event=1214271926164389929 Please submit questions in advance through our discord submitted by /u/turtlespy965 [link] [comments]
    [N] OpenAI Residency Applications have opened again today until March 11 in SF
    I was interested in this some time ago but they were closed at that time, so I setup a site diff alert and today it notified me, so I checked and they are hiring again. Its an awesome opportunity for talented non-AI folks to transition into AI while getting full salary at OpenAI. Its not the right time for me personally to apply, but I thought I might share this with all of you if you are interested. Good luck applying! https://openai.com/careers/openai-residency submitted by /u/ixam1212 [link] [comments]
    [D] ICML 2024 Support Thread
    Opening a thread as a support group for everyone that submitted to ICML 2024. Reviews come out March 15th (if there are no delays). Let us know if you've gotten any reviews in yet, if you particularly hated one reviewer, or liked another one. Anything goes! submitted by /u/Adventurous-Cut-7077 [link] [comments]
  • Open

    Help with research paper on AI replacing Art
    Hey I’m currently writing a research paper on ai replacing jobs in digital design. This is the Table of contents: What is artificial intelligence/creativity? What is the status yesterday, today, and tomorrow? Should art be paid for if it's so replaceable? If everyone gets to be an artist, is art still unique? Hatred towards artists, and AI as the terminator of "gatekeeping" in creativity from artists. A ship needs to be repaired, parts are removed, eventually everything is replaced, is the ship still original? Are there two original ships (one of new parts and one of old)? Which one is more valuable? The brain has given itself a name. We have given AI a name. This is a very wide topic and I’m limited to 12 pages, I can’t explain what creativity is, what Artificial intelligence is or how it would affect the economy if designers were replaced by AI tomorrow but I want to come to a philosophical conclusion what art means to us and if it’s actually replaceable on an emotional level. Any ideas and tips appreciated. <3 submitted by /u/MrMerford [link] [comments]
    Best Studio?
    I started with LangChain on Hugging Face before creating a local version and tested some front ends like LM Studio. I really like the simplicity but haven't been able to augment. It's not a huge job in LangChain but wondered if there's a locally hosted studio (GUI) to augment an LLM? submitted by /u/SithLordRising [link] [comments]
    Google's Gemini flop raises the question: What exactly do we want our chatbots to do, really?
    submitted by /u/thisisinsider [link] [comments]
    Why are all ai videos moving so slow and steady?
    And when will that change? submitted by /u/Past_Echidna_9097 [link] [comments]
    AI is helping doctors diagnose cancer, heart defects faster than ever before
    More of this and less terrible AI generated art please! submitted by /u/meghancooking [link] [comments]
    UC Berkeley Research Presents a Machine Learning System that Can Forecast at Near Human Levels
    A research team from UC Berkeley has developed a novel LM pipeline, a retrieval-augmented language model system specifically designed for forecasting. The system achieved an average Brier score of .179, closely approaching the human aggregate score of .149, indicating that the language model-based forecasting system closely approximates, and in some instances surpasses, the accuracy of human forecasters aggregated from competitive platforms. Source: https://app.daily.dev/posts/Op9crOnZn submitted by /u/UpvoteBeast [link] [comments]
    One is a real photo and one is A.I. generated. Can you tell which is which?
    submitted by /u/Armand_Roulinn [link] [comments]
    StickerGPT New Model
    I have stumbled upon this website https://tensorflow.in Is this a stable diffusion model or they have trained their own model results look pretty good any one knows similar opensource model submitted by /u/cudanexus [link] [comments]
    Microsoft AI engineer warns FTC about Copilot Designer safety concerns
    submitted by /u/trueslicky [link] [comments]
    Facebook Group Meta AI Assistant?
    Does anyone know when the Facebook Meta AI Assistant will release in the UK? I'm a Facebook group administrator and would love to integrate this into my group, but I can't find any information on it. Thank you. If this is the wrong Reddit for this question, I apologise. submitted by /u/NikonD3X1985 [link] [comments]
    How far back could an LLM have been created?
    I’ve been wondering how far back an LLM could have been created before the computer technology was insufficient to realise a step in the process? My understanding is that an LLM is primarily conceptual and if you took the current research back ten or fifteen years they could have created an LLM back then, although it might have operated a bit more slowly. Your thoughts? submitted by /u/rutan668 [link] [comments]
    One-Minute Daily AI News 3/5/2024
    ‘One Life’ director predicts AI-generated shows are ‘inevitable,’ could be here in 3-5 years.[1] Meta plans launch of new AI language model Llama 3 in July, The Information reports.[2] OpenAI says Musk only ever contributed $45 million, wanted to merge with Tesla or take control.[3] MIT Scientist Raises $53 Million to Deploy AI for Dentists.[4] Sources: [1] https://www.foxnews.com/entertainment/one-life-director-predicts-ai-generated-shows-inevitable-could-be-here-three-five-years [2] https://ca.news.yahoo.com/finance/news/meta-plans-launch-ai-language-163112670.html [3] https://techcrunch.com/2024/03/05/openai-response-elon-musk-lawsuit/ [4] https://news.bloomberglaw.com/health-law-and-business/mit-scientist-raises-53-million-to-deploy-ai-for-dentists-1 submitted by /u/Excellent-Target-847 [link] [comments]
    OpenAI response to Elon Musk lawsuit.
    submitted by /u/Cbo305 [link] [comments]
    AI generated music is inherently more difficult to make than AI images.
    Why is it harder to make AI generated music than images? Initially I thought it was simply because there's not enough will to pour in the same amount of resources to develop high quality models. Or there's not enough training data available. However as I thought about it more I came to the conclusion that music generation is intrinsically harder to solve. To explain involves some math but bear with me. I've studied this stuff so please don't dismiss me. I'd love to hear other people's opinions on this since I haven't seen anybody address this issue. When image generation occurs the model starts with abstract shapes and iterates over time adding more detail each passthrough. Eventually there is enough detail that we can perceive individual, separate objects in the image. But we only percei…
  • Open

    Research Forum Episode 2: Transforming health care and the natural sciences, AI and society, and the evolution of foundational AI technologies
    Research advances are driving real-world impact faster than ever. Episode 2 of Microsoft Research Forum explores how AI is transforming health care and the natural sciences, the intersection of AI and society, and the evolution of foundational AI technologies. The post Research Forum Episode 2: Transforming health care and the natural sciences, AI and society, and the evolution of foundational AI technologies appeared first on Microsoft Research.  ( 11 min )
    Research Focus: Week of March 4, 2024
    In this issue: Generative kaleidoscopic networks; Text diffusion with reinforced conditioning; PRISE – Learning temporal action abstractions as a sequence compression problem. The post Research Focus: Week of March 4, 2024 appeared first on Microsoft Research.  ( 9 min )
  • Open

    Croissant: a metadata format for ML-ready datasets
    Posted by Omar Benjelloun, Software Engineer, Google Research, and Peter Mattson, Software Engineer, Google Core ML and President, MLCommons Association Machine learning (ML) practitioners looking to reuse existing datasets to train an ML model often spend a lot of time understanding the data, making sense of its organization, or figuring out what subset to use as features. So much time, in fact, that progress in the field of ML is hampered by a fundamental obstacle: the wide variety of data representations. ML datasets cover a broad range of content types, from text and structured data to images, audio, and video. Even within datasets that cover the same types of content, every dataset has a unique ad hoc arrangement of files and data formats. This challenge reduces productivi…  ( 91 min )
  • Open

    How AI is transforming marketing strategies in 2024
    As we approach 2024, no brand or customer remains unaffected by AI. Our interactions in the marketplace have altered as a result. It has changed how we communicate. in particular, branding. The curiosity about artificial intelligence’s seemingly limitless possibilities grows constantly as it gathers more data to sharpen its intelligence. Top trends in AI marketing… Read More »How AI is transforming marketing strategies in 2024 The post How AI is transforming marketing strategies in 2024 appeared first on Data Science Central.  ( 22 min )
  • Open

    Can't understand most mathematical formula in RL code!!!
    In my research i think it's important to figure formula in paper so i can obtain a more clear idea about what author want to convey and the main contribution about how they improved. I think it's time for me to learn more math to make a deeper research in RL. I think RL is very similar to optimization control. So Maybe it's good to learn some convex optimization and functional analysis or maybe variation analysis? Confusing what i need to improve my knowledge with RL, help! submitted by /u/Tight-Ad789 [link] [comments]
    Training two competing networks/policies using stable baselines
    I have environment in which 2+ competing agents have quite different goals based on their different reward functions. Furthermore, these goals change based on the instance of the environment. I'd like to train different networks/policies representing 'personalities'. One method is to train a general agent that works for any configuration of goals and reward functions, encoded in the observation space, but I expect that this adds too much complexity in the network. My hope is that another method would be to have 2+ networks/policies training at the same time. A step/reset returns {'agent_0': [...], 'agent_1': [...]} containing observations and rewards for each agent, wouldn't it be possible to use one for network1, and another for network2? I am training using Stable Baselines 3 (using supersuit to convert a PettingZoo environment so that sb3 can train on it), currently training is abstracted away through sb3: env = ss.pettingzoo_env_to_vec_env_v1(env) env = ss.concat_vec_envs_v1(env, 8, num_cpus=2, base_class="stable_baselines3") model = SAC(MlpPolicy, env) model.learn(total_timesteps=steps) model.save(run_name) Does the approach that I have in mind make sense? If so, how can I train 2+ networks at the same time in this fashion? submitted by /u/davidschep [link] [comments]
    Questions on applying RL on Hydrodynamics & Control and simulation software selection
    I am working on my thesis which basically revolves around submerged hydrofoil crafts (example) and control of longitudinal (maybe also lateral) motion. The main idea is to approach this as an altitude/pitch/roll control problem (states being height above water, and pitching/rolling angles and their 1st derivatives) and implement RL as the means of making a control algorithm. There's no real boat to try the algorithms on, so the idea is to use a simulation environment and feed it a boat, foil, strut etc parameters/properties. My questions are basically: Would it be preferable to use a model-based approach (I can implement a relatively simple linear representation of a trim point) rather than just brute-force it? Are you aware of any simulation environments that can work nicely with aero/hydrodynamics (although I suppose calculating Lift and Drag coefficients and plugging them in directly as forces would be more precise)? I have previously worked on Unity3D and I like how easily you can drag&drop scripts on gameObjects. I've also heard of Gazebo (specifically VRX or UVSim package). Can simulation environments be run headless? I think it would be very time consuming to run a visual representation of the craft. It'd be better if I can turn graphics off and only expect the physics engine to run the rigid body/sea interaction. Notes The simulations I'm interested in are for steady head-on course (no maneuvering) and only while the craft is foilbourne (no need to simulate hull interaction with water). ​ submitted by /u/John_Skoun [link] [comments]
    Is the Decision Transformer actually providing open loop control? Also referred to as action sequence for an instance of a task by some.
    submitted by /u/Imo-Ad-6158 [link] [comments]
    Scenario for Robotics Research
    Hi guys, I am curious to know about the current trend of research at the intersection of AI and Computer Vision and Robotics in the following companies: Tesla Google Microsoft Amazon Apple Meta Robotics Scientist/ Robotics Researchers working/ have worked for these companies could you please give insights about the standard of research, especially in Robotics and AI, in these companies? Introduction about me: Im a research aspirant with emphasis on AI and Robotics. I have worked on ML, Computer Vision, Deep Learning. For my MS thesis, I am working on indoor mobile robot navigation using Computer Vision and Deep RL. If theres anybody working on the same topic as me, do comment and send a chat request as well :))) Peace ✌ submitted by /u/Quirky_Assignment707 [link] [comments]
    Molecular generation: optimizing a complex reinforcement learning model
    Hello, I wrote a NN-based model who's objective is to optimize a certain property of an input molecule. This property is given by an eval(molecule) function provided by e.g. rdkit. An example property could be molecular weight. For those who don't know, a molecule can be represented as a labeled graph where nodes are atoms and edges are bonds. Following recent work in this field, my molecule optimization approach is done by "attaching" graphs from a pre-built vocabulary (also called Motif) to the Input Molecule. My model follows these steps: Given the input molecule, 1. Choose the Motif from the Vocabulary (GNN+MLP) 2. Choose candidate atoms from the Molecule that should form the attachment with the Motif (GNN+MLP) 3. Choose candidate atoms from the Motif that should form the attac…
    Ronald Williams (REINFORCE, 1992) died last month (2024-02-16)
    submitted by /u/gwern [link] [comments]
    RL Algo failing to learn
    ​ https://preview.redd.it/ohq2w15akmmc1.png?width=1000&format=png&auto=webp&s=de47e6683213af03218fa754ce18a65c700fc9c1 I'm trying to reimplement a recent paper where the rewards are preference-based. The base agent is SAC, but the agent can't seem to improve, fluctuating between the 150 and 60 episode return range and sometime peaking at around 350. Any suggestions or reasons behind the fluctuating return? submitted by /u/kengsleh [link] [comments]
  • Open

    Efficiently fine-tune the ESM-2 protein language model with Amazon SageMaker
    In this post, we demonstrate how to efficiently fine-tune a state-of-the-art protein language model (pLM) to predict protein subcellular localization using Amazon SageMaker. Proteins are the molecular machines of the body, responsible for everything from moving your muscles to responding to infections. Despite this variety, all proteins are made of repeating chains of molecules called […]  ( 9 min )
  • Open

    Bria Builds Responsible Generative AI for Enterprises Using NVIDIA NeMo, Picasso
    As visual generative AI matures from research to the enterprise domain, businesses are seeking responsible ways to integrate the technology into their products. Bria, a startup based in Tel Aviv, is responding with an open platform for visual generative AI that emphasizes model transparency alongside fair attribution and copyright protections. Currently offering models that convert Read Article  ( 6 min )
    AI Decoded: Demystifying AI and the Hardware, Software and Tools That Power It
    With the 2018 launch of RTX technologies and the first consumer GPU built for AI — GeForce RTX — NVIDIA accelerated the shift to AI computing. Since then, AI on RTX PCs and workstations has grown into a thriving ecosystem with more than 100 million users and 500 AI applications.  ( 10 min )
  • Open

    A Posteriori Evaluation of a Physics-Constrained Neural Ordinary Differential Equations Approach Coupled with CFD Solver for Modeling Stiff Chemical Kinetics
    arXiv:2312.00038v3 Announce Type: replace-cross Abstract: The high computational cost associated with solving for detailed chemistry poses a significant challenge for predictive computational fluid dynamics (CFD) simulations of turbulent reacting flows. These models often require solving a system of coupled stiff ordinary differential equations (ODEs). While deep learning techniques have been experimented with to develop faster surrogate models, they often fail to integrate reliably with CFD solvers. This instability arises because deep learning methods optimize for training error without ensuring compatibility with ODE solvers, leading to accumulation of errors over time. Recently, NeuralODE-based techniques have offered a promising solution by effectively modeling chemical kinetics. In this study, we extend the NeuralODE framework for stiff chemical kinetics by incorporating mass conservation constraints directly into the loss function during training. This ensures that the total mass and the elemental mass are conserved, a critical requirement for reliable downstream integration with CFD solvers. Proof-of-concept studies are performed with physics-constrained neuralODE (PC-NODE) approach for homogeneous autoignition of hydrogen-air mixture over a range of composition and thermodynamic conditions. Our results demonstrate that this enhancement not only improves the physical consistency with respect to mass conservation criteria but also ensures better robustness. Lastly, a posteriori studies are performed wherein the trained PC-NODE model is coupled with a 3D CFD solver for computing the chemical source terms. PC-NODE is shown to be more accurate relative to the purely data-driven neuralODE approach. Moreover, PC-NODE also exhibits robustness and generalizability to unseen initial conditions from within (interpolative capability) as well as outside (extrapolative capability) the training regime.  ( 3 min )
    Local Graph Clustering with Noisy Labels
    arXiv:2310.08031v2 Announce Type: replace Abstract: The growing interest in machine learning problems over graphs with additional node information such as texts, images, or labels has popularized methods that require the costly operation of processing the entire graph. Yet, little effort has been made to the development of fast local methods (i.e. without accessing the entire graph) that extract useful information from such data. To that end, we propose a study of local graph clustering using noisy node labels as a proxy for additional node information. In this setting, nodes receive initial binary labels based on cluster affiliation: 1 if they belong to the target cluster and 0 otherwise. Subsequently, a fraction of these labels is flipped. We investigate the benefits of incorporating noisy labels for local graph clustering. By constructing a weighted graph with such labels, we study the performance of graph diffusion-based local clustering method on both the original and the weighted graphs. From a theoretical perspective, we consider recovering an unknown target cluster with a single seed node in a random graph with independent noisy node labels. We provide sufficient conditions on the label noise under which, with high probability, using diffusion in the weighted graph yields a more accurate recovery of the target cluster. This approach proves more effective than using the given labels alone or using diffusion in the label-free original graph. Empirically, we show that reliable node labels can be obtained with just a few samples from an attributed graph. Moreover, utilizing these labels via diffusion in the weighted graph leads to significantly better local clustering performance across several real-world datasets, improving F1 scores by up to 13%.  ( 3 min )
    ArtiGrasp: Physically Plausible Synthesis of Bi-Manual Dexterous Grasping and Articulation
    arXiv:2309.03891v2 Announce Type: replace-cross Abstract: We present ArtiGrasp, a novel method to synthesize bi-manual hand-object interactions that include grasping and articulation. This task is challenging due to the diversity of the global wrist motions and the precise finger control that are necessary to articulate objects. ArtiGrasp leverages reinforcement learning and physics simulations to train a policy that controls the global and local hand pose. Our framework unifies grasping and articulation within a single policy guided by a single hand pose reference. Moreover, to facilitate the training of the precise finger control required for articulation, we present a learning curriculum with increasing difficulty. It starts with single-hand manipulation of stationary objects and continues with multi-agent training including both hands and non-stationary objects. To evaluate our method, we introduce Dynamic Object Grasping and Articulation, a task that involves bringing an object into a target articulated pose. This task requires grasping, relocation, and articulation. We show our method's efficacy towards this task. We further demonstrate that our method can generate motions with noisy hand-object pose estimates from an off-the-shelf image-based regressor.  ( 2 min )
    Multi-View Hypercomplex Learning for Breast Cancer Screening
    arXiv:2204.05798v3 Announce Type: replace-cross Abstract: Traditionally, deep learning methods for breast cancer classification perform a single-view analysis. However, radiologists simultaneously analyze all four views that compose a mammography exam, owing to the correlations contained in mammography views, which present crucial information for identifying tumors. In light of this, some studies have started to propose multi-view methods. Nevertheless, in such existing architectures, mammogram views are processed as independent images by separate convolutional branches, thus losing correlations among them. To overcome such limitations, in this paper, we propose a methodological approach for multi-view breast cancer classification based on parameterized hypercomplex neural networks. Thanks to hypercomplex algebra properties, our networks are able to model, and thus leverage, existing correlations between the different views that comprise a mammogram, thus mimicking the reading process performed by clinicians. This happens because hypercomplex networks capture both global properties, as standard neural models, as well as local relations, i.e., inter-view correlations, which real-valued networks fail at modeling. We define architectures designed to process two-view exams, namely PHResNets, and four-view exams, i.e., PHYSEnet and PHYBOnet. Through an extensive experimental evaluation conducted with publicly available datasets, we demonstrate that our proposed models clearly outperform real-valued counterparts and state-of-the-art methods, proving that breast cancer classification benefits from the proposed multi-view architectures. We also assess the method generalizability beyond mammogram analysis by considering different benchmarks, as well as a finer-scaled task such as segmentation. Full code and pretrained models for complete reproducibility of our experiments are freely available at https://github.com/ispamm/PHBreast.  ( 3 min )
    MATNet: Multi-Level Fusion Transformer-Based Model for Day-Ahead PV Generation Forecasting
    arXiv:2306.10356v2 Announce Type: replace Abstract: Accurate forecasting of renewable generation is crucial to facilitate the integration of RES into the power system. Focusing on PV units, forecasting methods can be divided into two main categories: physics-based and data-based strategies, with AI-based models providing state-of-the-art performance. However, while these AI-based models can capture complex patterns and relationships in the data, they ignore the underlying physical prior knowledge of the phenomenon. Therefore, in this paper we propose MATNet, a novel self-attention transformer-based architecture for multivariate multi-step day-ahead PV power generation forecasting. It consists of a hybrid approach that combines the AI paradigm with the prior physical knowledge of PV power generation of physics-based methods. The model is fed with historical PV data and historical and forecast weather data through a multi-level joint fusion approach. The effectiveness of the proposed model is evaluated using the Ausgrid benchmark dataset with different regression performance metrics. The results show that our proposed architecture significantly outperforms the current state-of-the-art methods. These findings demonstrate the potential of MATNet in improving forecasting accuracy and suggest that it could be a promising solution to facilitate the integration of PV energy into the power grid.  ( 2 min )
    Contributing Dimension Structure of Deep Feature for Coreset Selection
    arXiv:2401.16193v2 Announce Type: replace Abstract: Coreset selection seeks to choose a subset of crucial training samples for efficient learning. It has gained traction in deep learning, particularly with the surge in training dataset sizes. Sample selection hinges on two main aspects: a sample's representation in enhancing performance and the role of sample diversity in averting overfitting. Existing methods typically measure both the representation and diversity of data based on similarity metrics, such as L2-norm. They have capably tackled representation via distribution matching guided by the similarities of features, gradients, or other information between data. However, the results of effectively diverse sample selection are mired in sub-optimality. This is because the similarity metrics usually simply aggregate dimension similarities without acknowledging disparities among the dimensions that significantly contribute to the final similarity. As a result, they fall short of adequately capturing diversity. To address this, we propose a feature-based diversity constraint, compelling the chosen subset to exhibit maximum diversity. Our key lies in the introduction of a novel Contributing Dimension Structure (CDS) metric. Different from similarity metrics that measure the overall similarity of high-dimensional features, our CDS metric considers not only the reduction of redundancy in feature dimensions, but also the difference between dimensions that contribute significantly to the final similarity. We reveal that existing methods tend to favor samples with similar CDS, leading to a reduced variety of CDS types within the coreset and subsequently hindering model performance. In response, we enhance the performance of five classical selection methods by integrating the CDS constraint. Our experiments on three datasets demonstrate the general effectiveness of the proposed method in boosting existing methods.  ( 3 min )
    BayesDiff: Estimating Pixel-wise Uncertainty in Diffusion via Bayesian Inference
    arXiv:2310.11142v2 Announce Type: replace-cross Abstract: Diffusion models have impressive image generation capability, but low-quality generations still exist, and their identification remains challenging due to the lack of a proper sample-wise metric. To address this, we propose BayesDiff, a pixel-wise uncertainty estimator for generations from diffusion models based on Bayesian inference. In particular, we derive a novel uncertainty iteration principle to characterize the uncertainty dynamics in diffusion, and leverage the last-layer Laplace approximation for efficient Bayesian inference. The estimated pixel-wise uncertainty can not only be aggregated into a sample-wise metric to filter out low-fidelity images but also aids in augmenting successful generations and rectifying artifacts in failed generations in text-to-image tasks. Extensive experiments demonstrate the efficacy of BayesDiff and its promise for practical applications.  ( 2 min )
    Inferring Properties of Graph Neural Networks
    arXiv:2401.03790v2 Announce Type: replace Abstract: We propose GNNInfer, the first automatic property inference technique for GNNs. To tackle the challenge of varying input structures in GNNs, GNNInfer first identifies a set of representative influential structures that contribute significantly towards the prediction of a GNN. Using these structures, GNNInfer converts each pair of an influential structure and the GNN to their equivalent FNN and then leverages existing property inference techniques to effectively capture properties of the GNN that are specific to the influential structures. GNNINfer then generalizes the captured properties to any input graphs that contain the influential structures. Finally, GNNInfer improves the correctness of the inferred properties by building a model (either a decision tree or linear regression) that estimates the deviation of GNN output from the inferred properties given full input graphs. The learned model helps GNNInfer extend the inferred properties with constraints to the input and output of the GNN, obtaining stronger properties that hold on full input graphs. Our experiments show that GNNInfer is effective in inferring likely properties of popular real-world GNNs, and more importantly, these inferred properties help effectively defend against GNNs' backdoor attacks. In particular, out of the 13 ground truth properties, GNNInfer re-discovered 8 correct properties and discovered likely correct properties that approximate the remaining 5 ground truth properties. Using properties inferred by GNNInfer to defend against the state-of-the-art backdoor attack technique on GNNs, namely UGBA, experiments show that GNNInfer's defense success rate is up to 30 times better than existing baselines.  ( 3 min )
    Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes
    arXiv:2403.00867v1 Announce Type: cross Abstract: Large Language Models (LLMs) are becoming a prominent generative AI tool, where the user enters a query and the LLM generates an answer. To reduce harm and misuse, efforts have been made to align these LLMs to human values using advanced training techniques such as Reinforcement Learning from Human Feedback (RLHF). However, recent studies have highlighted the vulnerability of LLMs to adversarial jailbreak attempts aiming at subverting the embedded safety guardrails. To address this challenge, this paper defines and investigates the Refusal Loss of LLMs and then proposes a method called Gradient Cuff to detect jailbreak attempts. Gradient Cuff exploits the unique properties observed in the refusal loss landscape, including functional values and its smoothness, to design an effective two-step detection strategy. Experimental results on two aligned LLMs (LLaMA-2-7B-Chat and Vicuna-7B-V1.5) and six types of jailbreak attacks (GCG, AutoDAN, PAIR, TAP, Base64, and LRL) show that Gradient Cuff can significantly improve the LLM's rejection capability for malicious jailbreak queries, while maintaining the model's performance for benign user queries by adjusting the detection threshold.  ( 2 min )
    A Novel Truncated Norm Regularization Method for Multi-channel Color Image Denoising
    arXiv:2307.07932v2 Announce Type: replace-cross Abstract: Due to the high flexibility and remarkable performance, low-rank approximation methods has been widely studied for color image denoising. However, those methods mostly ignore either the cross-channel difference or the spatial variation of noise, which limits their capacity in real world color image denoising. To overcome those drawbacks, this paper is proposed to denoise color images with a double-weighted truncated nuclear norm minus truncated Frobenius norm minimization (DtNFM) method. Through exploiting the nonlocal self-similarity of the noisy image, the similar structures are gathered and a series of similar patch matrices are constructed. For each group, the DtNFM model is conducted for estimating its denoised version. The denoised image would be obtained by concatenating all the denoised patch matrices. The proposed DtNFM model has two merits. First, it models and utilizes both the cross-channel difference and the spatial variation of noise. This provides sufficient flexibility for handling the complex distribution of noise in real world images. Second, the proposed DtNFM model provides a close approximation to the underlying clean matrix since it can treat different rank components flexibly. To solve the problem resulted from DtNFM model, an accurate and effective algorithm is proposed by exploiting the framework of the alternating direction method of multipliers (ADMM). The generated subproblems are discussed in detail. And their global optima can be easily obtained in closed-form. Rigorous mathematical derivation proves that the solution sequences generated by the algorithm converge to a single critical point. Extensive experiments on synthetic and real noise datasets demonstrate that the proposed method outperforms many state-of-the-art color image denoising methods.  ( 3 min )
    A Truly Joint Neural Architecture for Segmentation and Parsing
    arXiv:2402.02564v2 Announce Type: replace-cross Abstract: Contemporary multilingual dependency parsers can parse a diverse set of languages, but for Morphologically Rich Languages (MRLs), performance is attested to be lower than other languages. The key challenge is that, due to high morphological complexity and ambiguity of the space-delimited input tokens, the linguistic units that act as nodes in the tree are not known in advance. Pre-neural dependency parsers for MRLs subscribed to the joint morpho-syntactic hypothesis, stating that morphological segmentation and syntactic parsing should be solved jointly, rather than as a pipeline where segmentation precedes parsing. However, neural state-of-the-art parsers to date use a strict pipeline. In this paper we introduce a joint neural architecture where a lattice-based representation preserving all morphological ambiguity of the input is provided to an arc-factored model, which then solves the morphological segmentation and syntactic parsing tasks at once. Our experiments on Hebrew, a rich and highly ambiguous MRL, demonstrate state-of-the-art performance on parsing, tagging and segmentation of the Hebrew section of UD, using a single model. This proposed architecture is LLM-based and language agnostic, providing a solid foundation for MRLs to obtain further performance improvements and bridge the gap with other languages.  ( 2 min )
    DCTX-Conformer: Dynamic context carry-over for low latency unified streaming and non-streaming Conformer ASR
    arXiv:2306.08175v2 Announce Type: replace-cross Abstract: Conformer-based end-to-end models have become ubiquitous these days and are commonly used in both streaming and non-streaming automatic speech recognition (ASR). Techniques like dual-mode and dynamic chunk training helped unify streaming and non-streaming systems. However, there remains a performance gap between streaming with a full and limited past context. To address this issue, we propose the integration of a novel dynamic contextual carry-over mechanism in a state-of-the-art (SOTA) unified ASR system. Our proposed dynamic context Conformer (DCTX-Conformer) utilizes a non-overlapping contextual carry-over mechanism that takes into account both the left context of a chunk and one or more preceding context embeddings. We outperform the SOTA by a relative 25.0% word error rate, with a negligible latency impact due to the additional context embeddings.  ( 2 min )
    Convergence of policy gradient methods for finite-horizon exploratory linear-quadratic control problems
    arXiv:2211.00617v3 Announce Type: replace-cross Abstract: We study the global linear convergence of policy gradient (PG) methods for finite-horizon continuous-time exploratory linear-quadratic control (LQC) problems. The setting includes stochastic LQC problems with indefinite costs and allows additional entropy regularisers in the objective. We consider a continuous-time Gaussian policy whose mean is linear in the state variable and whose covariance is state-independent. Contrary to discrete-time problems, the cost is noncoercive in the policy and not all descent directions lead to bounded iterates. We propose geometry-aware gradient descents for the mean and covariance of the policy using the Fisher geometry and the Bures-Wasserstein geometry, respectively. The policy iterates are shown to satisfy an a-priori bound, and converge globally to the optimal policy with a linear rate. We further propose a novel PG method with discrete-time policies. The algorithm leverages the continuous-time analysis, and achieves a robust linear convergence across different action frequencies. A numerical experiment confirms the convergence and robustness of the proposed algorithm.  ( 2 min )
    LiMAML: Personalization of Deep Recommender Models via Meta Learning
    arXiv:2403.00803v1 Announce Type: cross Abstract: In the realm of recommender systems, the ubiquitous adoption of deep neural networks has emerged as a dominant paradigm for modeling diverse business objectives. As user bases continue to expand, the necessity of personalization and frequent model updates have assumed paramount significance to ensure the delivery of relevant and refreshed experiences to a diverse array of members. In this work, we introduce an innovative meta-learning solution tailored to the personalization of models for individual members and other entities, coupled with the frequent updates based on the latest user interaction signals. Specifically, we leverage the Model-Agnostic Meta Learning (MAML) algorithm to adapt per-task sub-networks using recent user interaction data. Given the near infeasibility of productionizing original MAML-based models in online recommendation systems, we propose an efficient strategy to operationalize meta-learned sub-networks in production, which involves transforming them into fixed-sized vectors, termed meta embeddings, thereby enabling the seamless deployment of models with hundreds of billions of parameters for online serving. Through extensive experimentation on production data drawn from various applications at LinkedIn, we demonstrate that the proposed solution consistently outperforms the baseline models of those applications, including strong baselines such as using wide-and-deep ID based personalization approach. Our approach has enabled the deployment of a range of highly personalized AI models across diverse LinkedIn applications, leading to substantial improvements in business metrics as well as refreshed experience for our members.  ( 3 min )
    Unraveling the Single Tangent Space Fallacy: An Analysis and Clarification for Applying Riemannian Geometry in Robot Learning
    arXiv:2310.07902v2 Announce Type: replace-cross Abstract: In the realm of robotics, numerous downstream robotics tasks leverage machine learning methods for processing, modeling, or synthesizing data. Often, this data comprises variables that inherently carry geometric constraints, such as the unit-norm condition of quaternions representing rigid-body orientations or the positive definiteness of stiffness and manipulability ellipsoids. Handling such geometric constraints effectively requires the incorporation of tools from differential geometry into the formulation of machine learning methods. In this context, Riemannian manifolds emerge as a powerful mathematical framework to handle such geometric constraints. Nevertheless, their recent adoption in robot learning has been largely characterized by a mathematically-flawed simplification, hereinafter referred to as the "single tangent space fallacy". This approach involves merely projecting the data of interest onto a single tangent (Euclidean) space, over which an off-the-shelf learning algorithm is applied. This paper provides a theoretical elucidation of various misconceptions surrounding this approach and offers experimental evidence of its shortcomings. Finally, it presents valuable insights to promote best practices when employing Riemannian geometry within robot learning applications.  ( 2 min )
    AutoChunk: Automated Activation Chunk for Memory-Efficient Long Sequence Inference
    arXiv:2401.10652v2 Announce Type: replace-cross Abstract: Large deep learning models have achieved impressive performance across a range of applications. However, their large memory requirements, including parameter memory and activation memory, have become a significant challenge for their practical serving. While existing methods mainly address parameter memory, the importance of activation memory has been overlooked. Especially for long input sequences, activation memory is expected to experience a significant exponential growth as the length of sequences increases. In this approach, we propose AutoChunk, an automatic and adaptive compiler system that efficiently reduces activation memory for long sequence inference by chunk strategies. The proposed system generates chunk plans by optimizing through multiple stages. In each stage, the chunk search pass explores all possible chunk candidates and the chunk selection pass identifies the optimal one. At runtime, AutoChunk employs code generation to automatically apply chunk strategies. The experiments demonstrate that AutoChunk can reduce over 80\% of activation memory while maintaining speed loss within 10%, extend max sequence length by 3.2x to 11.7x, and outperform state-of-the-art methods by a large margin.  ( 2 min )
    Emergence of Latent Binary Encoding in Deep Neural Network Classifiers
    arXiv:2310.08224v3 Announce Type: replace Abstract: We investigate the emergence of binary encoding within the latent space of deep-neural-network classifiers. Such binary encoding is induced by the integration of a linear penultimate layer, which employs during training a loss function specifically designed to compress the latent representations. As a result of a trade-off between compression and information retention, the network learns to assume only one of two possible values for each dimension in the latent space. The binary encoding is provoked by the collapse of all representations of the same class to the same point, which corresponds to the vertex of a hypercube, thereby creating the encoding. We demonstrate that the emergence of binary encoding significantly enhances robustness, reliability and accuracy of the network.  ( 2 min )
    Beam Enumeration: Probabilistic Explainability For Sample Efficient Self-conditioned Molecular Design
    arXiv:2309.13957v2 Announce Type: replace-cross Abstract: Generative molecular design has moved from proof-of-concept to real-world applicability, as marked by the surge in very recent papers reporting experimental validation. Key challenges in explainability and sample efficiency present opportunities to enhance generative design to directly optimize expensive high-fidelity oracles and provide actionable insights to domain experts. Here, we propose Beam Enumeration to exhaustively enumerate the most probable sub-sequences from language-based molecular generative models and show that molecular substructures can be extracted. When coupled with reinforcement learning, extracted substructures become meaningful, providing a source of explainability and improving sample efficiency through self-conditioned generation. Beam Enumeration is generally applicable to any language-based molecular generative model and notably further improves the performance of the recently reported Augmented Memory algorithm, which achieved the new state-of-the-art on the Practical Molecular Optimization benchmark for sample efficiency. The combined algorithm generates more high reward molecules and faster, given a fixed oracle budget. Beam Enumeration shows that improvements to explainability and sample efficiency for molecular design can be made synergistic.  ( 2 min )
    MindMap: Knowledge Graph Prompting Sparks Graph of Thoughts in Large Language Models
    arXiv:2308.09729v5 Announce Type: replace-cross Abstract: Large language models (LLMs) have achieved remarkable performance in natural language understanding and generation tasks. However, they often suffer from limitations such as difficulty in incorporating new knowledge, generating hallucinations, and explaining their reasoning process. To address these challenges, we propose a novel prompting pipeline, named \method, that leverages knowledge graphs (KGs) to enhance LLMs' inference and transparency. Our method enables LLMs to comprehend KG inputs and infer with a combination of implicit and external knowledge. Moreover, our method elicits the mind map of LLMs, which reveals their reasoning pathways based on the ontology of knowledge. We evaluate our method on diverse question \& answering tasks, especially in medical domains, and show significant improvements over baselines. We also introduce a new hallucination evaluation benchmark and analyze the effects of different components of our method. Our results demonstrate the effectiveness and robustness of our method in merging knowledge from LLMs and KGs for combined inference. To reproduce our results and extend the framework further, we make our codebase available at https://github.com/wyl-willing/MindMap.  ( 3 min )
    Knowledge Distillation from Language-Oriented to Emergent Communication for Multi-Agent Remote Control
    arXiv:2401.12624v2 Announce Type: replace-cross Abstract: In this work, we compare emergent communication (EC) built upon multi-agent deep reinforcement learning (MADRL) and language-oriented semantic communication (LSC) empowered by a pre-trained large language model (LLM) using human language. In a multi-agent remote navigation task, with multimodal input data comprising location and channel maps, it is shown that EC incurs high training cost and struggles when using multimodal data, whereas LSC yields high inference computing cost due to the LLM's large size. To address their respective bottlenecks, we propose a novel framework of language-guided EC (LEC) by guiding the EC training using LSC via knowledge distillation (KD). Simulations corroborate that LEC achieves faster travel time while avoiding areas with poor channel conditions, as well as speeding up the MADRL training convergence by up to 61.8% compared to EC.  ( 2 min )
    Solving the Discretised Multiphase Flow Equations with Interface Capturing on Structured Grids Using Machine Learning Libraries
    arXiv:2401.06755v2 Announce Type: replace-cross Abstract: This paper solves the discretised multiphase flow equations using tools and methods from machine-learning libraries. The idea comes from the observation that convolutional layers can be used to express a discretisation as a neural network whose weights are determined by the numerical method, rather than by training, and hence, we refer to this approach as Neural Networks for PDEs (NN4PDEs). To solve the discretised multiphase flow equations, a multigrid solver is implemented through a convolutional neural network with a U-Net architecture. Immiscible two-phase flow is modelled by the 3D incompressible Navier-Stokes equations with surface tension and advection of a volume fraction field, which describes the interface between the fluids. A new compressive algebraic volume-of-fluids method is introduced, based on a residual formulation using Petrov-Galerkin for accuracy and designed with NN4PDEs in mind. High-order finite-element based schemes are chosen to model a collapsing water column and a rising bubble. Results compare well with experimental data and other numerical results from the literature, demonstrating that, for the first time, finite element discretisations of multiphase flows can be solved using an approach based on (untrained) convolutional neural networks. A benefit of expressing numerical discretisations as neural networks is that the code can run, without modification, on CPUs, GPUs or the latest accelerators designed especially to run AI codes.  ( 3 min )
    Layerwise complexity-matched learning yields an improved model of cortical area V2
    arXiv:2312.11436v2 Announce Type: replace-cross Abstract: Human ability to recognize complex visual patterns arises through transformations performed by successive areas in the ventral visual cortex. Deep neural networks trained end-to-end for object recognition approach human capabilities, and offer the best descriptions to date of neural responses in the late stages of the hierarchy. But these networks provide a poor account of the early stages, compared to traditional hand-engineered models, or models optimized for coding efficiency or prediction. Moreover, the gradient backpropagation used in end-to-end learning is generally considered to be biologically implausible. Here, we overcome both of these limitations by developing a bottom-up self-supervised training methodology that operates independently on successive layers. Specifically, we maximize feature similarity between pairs of locally-deformed natural image patches, while decorrelating features across patches sampled from other images. Crucially, the deformation amplitudes are adjusted proportionally to receptive field sizes in each layer, thus matching the task complexity to the capacity at each stage of processing. In comparison with architecture-matched versions of previous models, we demonstrate that our layerwise complexity-matched learning (LCL) formulation produces a two-stage model (LCL-V2) that is better aligned with selectivity properties and neural activity in primate area V2. We demonstrate that the complexity-matched learning paradigm is critical for the emergence of the improved biological alignment. Finally, when the two-stage model is used as a fixed front-end for a deep network trained to perform object recognition, the resultant model (LCL-V2Net) is significantly better than standard end-to-end self-supervised, supervised, and adversarially-trained models in terms of generalization to out-of-distribution tasks and alignment with human behavior.  ( 3 min )
    Agent with Warm Start and Active Termination for Plane Localization in 3D Ultrasound
    arXiv:1910.04331v2 Announce Type: replace-cross Abstract: Standard plane localization is crucial for ultrasound (US) diagnosis. In prenatal US, dozens of standard planes are manually acquired with a 2D probe. It is time-consuming and operator-dependent. In comparison, 3D US containing multiple standard planes in one shot has the inherent advantages of less user-dependency and more efficiency. However, manual plane localization in US volume is challenging due to the huge search space and large fetal posture variation. In this study, we propose a novel reinforcement learning (RL) framework to automatically localize fetal brain standard planes in 3D US. Our contribution is two-fold. First, we equip the RL framework with a landmark-aware alignment module to provide warm start and strong spatial bounds for the agent actions, thus ensuring its effectiveness. Second, instead of passively and empirically terminating the agent inference, we propose a recurrent neural network based strategy for active termination of the agent's interaction procedure. This improves both the accuracy and efficiency of the localization system. Extensively validated on our in-house large dataset, our approach achieves the accuracy of 3.4mm/9.6{\deg} and 2.7mm/9.1{\deg} for the transcerebellar and transthalamic plane localization, respectively. Ourproposed RL framework is general and has the potential to improve the efficiency and standardization of US scanning.  ( 3 min )
    Tweets to Citations: Unveiling the Impact of Social Media Influencers on AI Research Visibility
    arXiv:2401.13782v2 Announce Type: replace-cross Abstract: As the number of accepted papers at AI and ML conferences reaches into the thousands, it has become unclear how researchers access and read research publications. In this paper, we investigate the role of social media influencers in enhancing the visibility of machine learning research, particularly the citation counts of papers they share. We have compiled a comprehensive dataset of over 8,000 papers, spanning tweets from December 2018 to October 2023, alongside controls precisely matched by 9 key covariates. Our statistical and causal inference analysis reveals a significant increase in citations for papers endorsed by these influencers, with median citation counts 2-3 times higher than those of the control group. Additionally, the study delves into the geographic, gender, and institutional diversity of highlighted authors. Given these findings, we advocate for a responsible approach to curation, encouraging influencers to uphold the journalistic standard that includes showcasing diverse research topics, authors, and institutions.  ( 2 min )
    Prediction of Cellular Identities from Trajectory and Cell Fate Information
    arXiv:2401.06182v2 Announce Type: replace-cross Abstract: Determining cell identities in imaging sequences is an important yet challenging task. The conventional method for cell identification is via cell tracking, which is complex and can be time-consuming. In this study, we propose an innovative approach to cell identification during early $\textit{C. elegans}$ embryogenesis using machine learning. Cell identification during $\textit{C. elegans}$ embryogenesis would provide insights into neural development with implications for higher organisms including humans. We employed random forest, MLP, and LSTM models, and tested cell classification accuracy on 3D time-lapse confocal datasets spanning the first 4 hours of embryogenesis. By leveraging a small number of spatial-temporal features of individual cells, including cell trajectory and cell fate information, our models achieve an accuracy of over 91%, even with limited data. We also determine the most important feature contributions and can interpret these features in the context of biological knowledge. Our research demonstrates the success of predicting cell identities in time-lapse imaging sequences directly from simple spatio-temporal features.  ( 2 min )
    CrIBo: Self-Supervised Learning via Cross-Image Object-Level Bootstrapping
    arXiv:2310.07855v2 Announce Type: replace-cross Abstract: Leveraging nearest neighbor retrieval for self-supervised representation learning has proven beneficial with object-centric images. However, this approach faces limitations when applied to scene-centric datasets, where multiple objects within an image are only implicitly captured in the global representation. Such global bootstrapping can lead to undesirable entanglement of object representations. Furthermore, even object-centric datasets stand to benefit from a finer-grained bootstrapping approach. In response to these challenges, we introduce a novel Cross-Image Object-Level Bootstrapping method tailored to enhance dense visual representation learning. By employing object-level nearest neighbor bootstrapping throughout the training, CrIBo emerges as a notably strong and adequate candidate for in-context learning, leveraging nearest neighbor retrieval at test time. CrIBo shows state-of-the-art performance on the latter task while being highly competitive in more standard downstream segmentation tasks. Our code and pretrained models are publicly available at https://github.com/tileb1/CrIBo.  ( 2 min )
    Refining GPT-3 Embeddings with a Siamese Structure for Technical Post Duplicate Detection
    arXiv:2312.15068v2 Announce Type: replace-cross Abstract: One goal of technical online communities is to help developers find the right answer in one place. A single question can be asked in different ways with different wordings, leading to the existence of duplicate posts on technical forums. The question of how to discover and link duplicate posts has garnered the attention of both developer communities and researchers. For example, Stack Overflow adopts a voting-based mechanism to mark and close duplicate posts. However, addressing these constantly emerging duplicate posts in a timely manner continues to pose challenges. Therefore, various approaches have been proposed to detect duplicate posts on technical forum posts automatically. The existing methods suffer from limitations either due to their reliance on handcrafted similarity metrics which can not sufficiently capture the semantics of posts, or their lack of supervision to improve the performance. Additionally, the efficiency of these methods is hindered by their dependence on pair-wise feature generation, which can be impractical for large amount of data. In this work, we attempt to employ and refine the GPT-3 embeddings for the duplicate detection task. We assume that the GPT-3 embeddings can accurately represent the semantics of the posts. In addition, by training a Siamese-based network based on the GPT-3 embeddings, we obtain a latent embedding that accurately captures the duplicate relation in technical forum posts. Our experiment on a benchmark dataset confirms the effectiveness of our approach and demonstrates superior performance compared to baseline methods. When applied to the dataset we constructed with a recent Stack Overflow dump, our approach attains a Top-1, Top-5, and Top-30 accuracy of 23.1%, 43.9%, and 68.9%, respectively. With a manual study, we confirm our approach's potential of finding unlabelled duplicates on technical forums.  ( 3 min )
    Advocating for the Silent: Enhancing Federated Generalization for Non-Participating Clients
    arXiv:2310.07171v4 Announce Type: replace Abstract: Federated Learning (FL) has surged in prominence due to its capability of collaborative model training without direct data sharing. However, the vast disparity in local data distributions among clients, often termed the Non-Independent Identically Distributed (Non-IID) challenge, poses a significant hurdle to FL's generalization efficacy. The scenario becomes even more complex when not all clients participate in the training process, a common occurrence due to unstable network connections or limited computational capacities. This can greatly complicate the assessment of the trained models' generalization abilities. While a plethora of recent studies has centered on the generalization gap pertaining to unseen data from participating clients with diverse distributions, the distinction between the training distributions of participating clients and the testing distributions of non-participating ones has been largely overlooked. In response, our paper unveils an information-theoretic generalization framework for FL. Specifically, it quantifies generalization errors by evaluating the information entropy of local distributions and discerning discrepancies across these distributions. Inspired by our deduced generalization bounds, we introduce a weighted aggregation approach and a duo of client selection strategies. These innovations are designed to strengthen FL's ability to generalize and thus ensure that trained models perform better on non-participating clients by incorporating a more diverse range of client data distributions. Our extensive empirical evaluations reaffirm the potency of our proposed methods, aligning seamlessly with our theoretical construct.  ( 3 min )
    Few-shot Personalized Saliency Prediction Based on Inter-personnel Gaze Patterns
    arXiv:2307.02799v3 Announce Type: replace-cross Abstract: This paper presents few-shot personalized saliency prediction based on inter-personnel gaze patterns. In contrast to general saliency maps, personalized saliecny maps (PSMs) have been great potential since PSMs indicate the person-specific visual attention useful for obtaining individual visual preferences. The PSM prediction is needed for acquiring the PSMs for unseen images, but its prediction is still a challenging task due to the complexity of individual gaze patterns. Moreover, the eye-tracking data obtained from each person is necessary to construct and predict PSMs, but it is difficult to acquire the massive amounts of such data. One solution for realizing PSM prediction from the limited amount of data is the effective use of eye-tracking data obtained from other persons. To efficiently treat the PSMs of other persons, this paper focuses on the selection of images to acquire eye-tracking data and the preservation of structural information of PSMs of other persons. In the proposed method, such images are selected such that they bring more diverse gaze patterns to persons, and the structural information is preserved by adopting the tensor-based regression method. Experimental results demonstrate that the above two points are beneficial for the few-shot PSM prediction.  ( 3 min )
    A Post-Training Approach for Mitigating Overfitting in Quantum Convolutional Neural Networks
    arXiv:2309.01829v2 Announce Type: replace-cross Abstract: Quantum convolutional neural network (QCNN), an early application for quantum computers in the NISQ era, has been consistently proven successful as a machine learning (ML) algorithm for several tasks with significant accuracy. Derived from its classical counterpart, QCNN is prone to overfitting. Overfitting is a typical shortcoming of ML models that are trained too closely to the availed training dataset and perform relatively poorly on unseen datasets for a similar problem. In this work we study post-training approaches for mitigating overfitting in QCNNs. We find that a straightforward adaptation of a classical post-training method, known as neuron dropout, to the quantum setting leads to a significant and undesirable consequence: a substantial decrease in success probability of the QCNN. We argue that this effect exposes the crucial role of entanglement in QCNNs and the vulnerability of QCNNs to entanglement loss. Hence, we propose a parameter adaptation method as an alternative method. Our method is computationally efficient and is found to successfully handle overfitting in the test cases.  ( 2 min )
    Enhancing Digital Health Services: A Machine Learning Approach to Personalized Exercise Goal Setting
    arXiv:2204.00961v3 Announce Type: replace Abstract: The utilization of digital health has increased recently, and these services provide extensive guidance to encourage users to exercise frequently by setting daily exercise goals to promote a healthy lifestyle. These comprehensive guides evolved from the consideration of various personalized behavioral factors. Nevertheless, existing approaches frequently neglect the users dynamic behavior and the changing in their health conditions. This study aims to fill this gap by developing a machine learning algorithm that dynamically updates auto-suggestion exercise goals using retrospective data and realistic behavior trajectory. We conducted a methodological study by designing a deep reinforcement learning algorithm to evaluate exercise performance, considering fitness-fatigue effects. The deep reinforcement learning algorithm combines deep learning techniques to analyse time series data and infer user exercise behavior. In addition, we use the asynchronous advantage actor-critic algorithm for reinforcement learning to determine the optimal exercise intensity through exploration and exploitation. The personalized exercise data and biometric data used in this study were collected from publicly available datasets, encompassing walking, sports logs, and running. In our study, we conducted The statistical analyses/inferential tests to compare the effectiveness of machine learning approach in exercise goal setting across different exercise goal setting strategies.  ( 3 min )
    Forecasting Tropical Cyclones with Cascaded Diffusion Models
    arXiv:2310.01690v4 Announce Type: replace-cross Abstract: As tropical cyclones become more intense due to climate change, the rise of Al-based modelling provides a more affordable and accessible approach compared to traditional methods based on mathematical models. This work leverages generative diffusion models to forecast cyclone trajectories and precipitation patterns by integrating satellite imaging, remote sensing, and atmospheric data. It employs a cascaded approach that incorporates three main tasks: forecasting, super-resolution, and precipitation modelling. The training dataset includes 51 cyclones from six major tropical cyclone basins from January 2019 - March 2023. Experiments demonstrate that the final forecasts from the cascaded models show accurate predictions up to a 36-hour rollout, with excellent Structural Similarity (SSIM) and Peak-To-Noise Ratio (PSNR) values exceeding 0.5 and 20 dB, respectively, for all three tasks. The 36-hour forecasts can be produced in as little as 30 mins on a single Nvidia A30/RTX 2080 Ti. This work also highlights the promising efficiency of Al methods such as diffusion models for high-performance needs in weather forecasting, such as tropical cyclone forecasting, while remaining computationally affordable, making them ideal for highly vulnerable regions with critical forecasting needs and financial limitations. Code accessible at \url{https://github.com/nathzi1505/forecast-diffmodels}.  ( 3 min )
    Optimising Distributions with Natural Gradient Surrogates
    arXiv:2310.11837v2 Announce Type: replace-cross Abstract: Natural gradient methods have been used to optimise the parameters of probability distributions in a variety of settings, often resulting in fast-converging procedures. Unfortunately, for many distributions of interest, computing the natural gradient has a number of challenges. In this work we propose a novel technique for tackling such issues, which involves reframing the optimisation as one with respect to the parameters of a surrogate distribution, for which computing the natural gradient is easy. We give several examples of existing methods that can be interpreted as applying this technique, and propose a new method for applying it to a wide variety of problems. Our method expands the set of distributions that can be efficiently targeted with natural gradients. Furthermore, it is fast, easy to understand, simple to implement using standard autodiff software, and does not require lengthy model-specific derivations. We demonstrate our method on maximum likelihood estimation and variational inference tasks.  ( 2 min )
    Optimizing Modular Robot Composition: A Lexicographic Genetic Algorithm Approach
    arXiv:2309.08399v2 Announce Type: replace-cross Abstract: Industrial robots are designed as general-purpose hardware with limited ability to adapt to changing task requirements or environments. Modular robots, on the other hand, offer flexibility and can be easily customized to suit diverse needs. The morphology, i.e., the form and structure of a robot, significantly impacts the primary performance metrics acquisition cost, cycle time, and energy efficiency. However, identifying an optimal module composition for a specific task remains an open problem, presenting a substantial hurdle in developing task-tailored modular robots. Previous approaches either lack adequate exploration of the design space or the possibility to adapt to complex tasks. We propose combining a genetic algorithm with a lexicographic evaluation of solution candidates to overcome this problem and navigate search spaces exceeding those in prior work by magnitudes in the number of possible compositions. We demonstrate that our approach outperforms a state-of-the-art baseline and is able to synthesize modular robots for industrial tasks in cluttered environments.  ( 2 min )
    $t^3$-Variational Autoencoder: Learning Heavy-tailed Data with Student's t and Power Divergence
    arXiv:2312.01133v2 Announce Type: replace-cross Abstract: The variational autoencoder (VAE) typically employs a standard normal prior as a regularizer for the probabilistic latent encoder. However, the Gaussian tail often decays too quickly to effectively accommodate the encoded points, failing to preserve crucial structures hidden in the data. In this paper, we explore the use of heavy-tailed models to combat over-regularization. Drawing upon insights from information geometry, we propose $t^3$VAE, a modified VAE framework that incorporates Student's t-distributions for the prior, encoder, and decoder. This results in a joint model distribution of a power form which we argue can better fit real-world datasets. We derive a new objective by reformulating the evidence lower bound as joint optimization of KL divergence between two statistical manifolds and replacing with $\gamma$-power divergence, a natural alternative for power families. $t^3$VAE demonstrates superior generation of low-density regions when trained on heavy-tailed synthetic data. Furthermore, we show that $t^3$VAE significantly outperforms other models on CelebA and imbalanced CIFAR-100 datasets.  ( 2 min )
    KinSPEAK: Improving speech recognition for Kinyarwanda via semi-supervised learning methods
    arXiv:2308.11863v3 Announce Type: replace-cross Abstract: Despite recent availability of large transcribed Kinyarwanda speech data, achieving robust speech recognition for Kinyarwanda is still challenging. In this work, we show that using self-supervised pre-training, following a simple curriculum schedule during fine-tuning and using semi-supervised learning to leverage large unlabelled speech data significantly improve speech recognition performance for Kinyarwanda. Our approach focuses on using public domain data only. A new studio-quality speech dataset is collected from a public website, then used to train a clean baseline model. The clean baseline model is then used to rank examples from a more diverse and noisy public dataset, defining a simple curriculum training schedule. Finally, we apply semi-supervised learning to label and learn from large unlabelled data in five successive generations. Our final model achieves 3.2% word error rate (WER) on the new dataset and 15.6% WER on Mozilla Common Voice benchmark, which is state-of-the-art to the best of our knowledge. Our experiments also indicate that using syllabic rather than character-based tokenization results in better speech recognition performance for Kinyarwanda.  ( 2 min )
    Neural Implicit Morphing of Face Images
    arXiv:2308.13888v2 Announce Type: replace-cross Abstract: Face morphing is a problem in computer graphics with numerous artistic and forensic applications. It is challenging due to variations in pose, lighting, gender, and ethnicity. This task consists of a warping for feature alignment and a blending for a seamless transition between the warped images. We propose to leverage coord-based neural networks to represent such warpings and blendings of face images. During training, we exploit the smoothness and flexibility of such networks by combining energy functionals employed in classical approaches without discretizations. Additionally, our method is time-dependent, allowing a continuous warping/blending of the images. During morphing inference, we need both direct and inverse transformations of the time-dependent warping. The first (second) is responsible for warping the target (source) image into the source (target) image. Our neural warping stores those maps in a single network dismissing the need for inverting them. The results of our experiments indicate that our method is competitive with both classical and generative models under the lens of image quality and face-morphing detectors. Aesthetically, the resulting images present a seamless blending of diverse faces not yet usual in the literature.  ( 2 min )
    Hodge-Compositional Edge Gaussian Processes
    arXiv:2310.19450v3 Announce Type: replace-cross Abstract: We propose principled Gaussian processes (GPs) for modeling functions defined over the edge set of a simplicial 2-complex, a structure similar to a graph in which edges may form triangular faces. This approach is intended for learning flow-type data on networks where edge flows can be characterized by the discrete divergence and curl. Drawing upon the Hodge decomposition, we first develop classes of divergence-free and curl-free edge GPs, suitable for various applications. We then combine them to create \emph{Hodge-compositional edge GPs} that are expressive enough to represent any edge function. These GPs facilitate direct and independent learning for the different Hodge components of edge functions, enabling us to capture their relevance during hyperparameter optimization. To highlight their practical potential, we apply them for flow data inference in currency exchange, ocean currents and water supply networks, comparing them to alternative models.  ( 2 min )
    Semiparametric Efficient Inference in Adaptive Experiments
    arXiv:2311.18274v3 Announce Type: replace-cross Abstract: We consider the problem of efficient inference of the Average Treatment Effect in a sequential experiment where the policy governing the assignment of subjects to treatment or control can change over time. We first provide a central limit theorem for the Adaptive Augmented Inverse-Probability Weighted estimator, which is semiparametric efficient, under weaker assumptions than those previously made in the literature. This central limit theorem enables efficient inference at fixed sample sizes. We then consider a sequential inference setting, deriving both asymptotic and nonasymptotic confidence sequences that are considerably tighter than previous methods. These anytime-valid methods enable inference under data-dependent stopping times (sample sizes). Additionally, we use propensity score truncation techniques from the recent off-policy estimation literature to reduce the finite sample variance of our estimator without affecting the asymptotic variance. Empirical results demonstrate that our methods yield narrower confidence sequences than those previously developed in the literature while maintaining time-uniform error control.  ( 2 min )
    Non-Smooth Weakly-Convex Finite-sum Coupled Compositional Optimization
    arXiv:2310.03234v4 Announce Type: replace-cross Abstract: This paper investigates new families of compositional optimization problems, called $\underline{\bf n}$on-$\underline{\bf s}$mooth $\underline{\bf w}$eakly-$\underline{\bf c}$onvex $\underline{\bf f}$inite-sum $\underline{\bf c}$oupled $\underline{\bf c}$ompositional $\underline{\bf o}$ptimization (NSWC FCCO). There has been a growing interest in FCCO due to its wide-ranging applications in machine learning and AI, as well as its ability to address the shortcomings of stochastic algorithms based on empirical risk minimization. However, current research on FCCO presumes that both the inner and outer functions are smooth, limiting their potential to tackle a more diverse set of problems. Our research expands on this area by examining non-smooth weakly-convex FCCO, where the outer function is weakly convex and non-decreasing, and the inner function is weakly-convex. We analyze a single-loop algorithm and establish its complexity for finding an $\epsilon$-stationary point of the Moreau envelop of the objective function. Additionally, we also extend the algorithm to solving novel non-smooth weakly-convex tri-level finite-sum coupled compositional optimization problems, which feature a nested arrangement of three functions. Lastly, we explore the applications of our algorithms in deep learning for two-way partial AUC maximization and multi-instance two-way partial AUC maximization, using empirical studies to showcase the effectiveness of the proposed algorithms.  ( 2 min )
    Pure Differential Privacy for Functional Summaries via a Laplace-like Process
    arXiv:2309.00125v2 Announce Type: replace-cross Abstract: Many existing mechanisms to achieve differential privacy (DP) on infinite-dimensional functional summaries often involve embedding these summaries into finite-dimensional subspaces and applying traditional DP techniques. Such mechanisms generally treat each dimension uniformly and struggle with complex, structured summaries. This work introduces a novel mechanism for DP functional summary release: the Independent Component Laplace Process (ICLP) mechanism. This mechanism treats the summaries of interest as truly infinite-dimensional objects, thereby addressing several limitations of existing mechanisms. We establish the feasibility of the proposed mechanism in multiple function spaces. Several statistical estimation problems are considered, and we demonstrate one can enhance the utility of sanitized summaries by oversmoothing their non-private counterpart. Numerical experiments on synthetic and real datasets demonstrate the efficacy of the proposed mechanism.  ( 2 min )
    Robust Statistical Comparison of Random Variables with Locally Varying Scale of Measurement
    arXiv:2306.12803v2 Announce Type: replace-cross Abstract: Spaces with locally varying scale of measurement, like multidimensional structures with differently scaled dimensions, are pretty common in statistics and machine learning. Nevertheless, it is still understood as an open question how to exploit the entire information encoded in them properly. We address this problem by considering an order based on (sets of) expectations of random variables mapping into such non-standard spaces. This order contains stochastic dominance and expectation order as extreme cases when no, or respectively perfect, cardinal structure is given. We derive a (regularized) statistical test for our proposed generalized stochastic dominance (GSD) order, operationalize it by linear optimization, and robustify it by imprecise probability models. Our findings are illustrated with data from multidimensional poverty measurement, finance, and medicine.  ( 2 min )
    Statistically Optimal K-means Clustering via Nonnegative Low-rank Semidefinite Programming
    arXiv:2305.18436v3 Announce Type: replace-cross Abstract: $K$-means clustering is a widely used machine learning method for identifying patterns in large datasets. Semidefinite programming (SDP) relaxations have recently been proposed for solving the $K$-means optimization problem that enjoy strong statistical optimality guarantees, but the prohibitive cost of implementing an SDP solver renders these guarantees inaccessible to practical datasets. By contrast, nonnegative matrix factorization (NMF) is a simple clustering algorithm that is widely used by machine learning practitioners, but without a solid statistical underpinning nor rigorous guarantees. In this paper, we describe an NMF-like algorithm that works by solving a nonnegative low-rank restriction of the SDP relaxed $K$-means formulation using a nonconvex Burer--Monteiro factorization approach. The resulting algorithm is just as simple and scalable as state-of-the-art NMF algorithms, while also enjoying the same strong statistical optimality guarantees as the SDP. In our experiments, we observe that our algorithm achieves substantially smaller mis-clustering errors compared to the existing state-of-the-art.  ( 2 min )
    Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models
    arXiv:2306.08018v5 Announce Type: replace-cross Abstract: Large Language Models (LLMs), with their remarkable task-handling capabilities and innovative outputs, have catalyzed significant advancements across a spectrum of fields. However, their proficiency within specialized domains such as biomolecular studies remains limited. To address this challenge, we introduce Mol-Instructions, a comprehensive instruction dataset designed for the biomolecular domain. Mol-Instructions encompasses three key components: molecule-oriented instructions, protein-oriented instructions, and biomolecular text instructions. Each component aims to improve the understanding and prediction capabilities of LLMs concerning biomolecular features and behaviors. Through extensive instruction tuning experiments on LLMs, we demonstrate the effectiveness of Mol-Instructions in enhancing large models' performance in the intricate realm of biomolecular studies, thus fostering progress in the biomolecular research community. Mol-Instructions is publicly available for ongoing research and will undergo regular updates to enhance its applicability.  ( 2 min )
    Multi-Objective Genetic Algorithm for Multi-View Feature Selection
    arXiv:2305.18352v2 Announce Type: replace-cross Abstract: Multi-view datasets offer diverse forms of data that can enhance prediction models by providing complementary information. However, the use of multi-view data leads to an increase in high-dimensional data, which poses significant challenges for the prediction models that can lead to poor generalization. Therefore, relevant feature selection from multi-view datasets is important as it not only addresses the poor generalization but also enhances the interpretability of the models. Despite the success of traditional feature selection methods, they have limitations in leveraging intrinsic information across modalities, lacking generalizability, and being tailored to specific classification tasks. We propose a novel genetic algorithm strategy to overcome these limitations of traditional feature selection methods for multi-view data. Our proposed approach, called the multi-view multi-objective feature selection genetic algorithm (MMFS-GA), simultaneously selects the optimal subset of features within a view and between views under a unified framework. The MMFS-GA framework demonstrates superior performance and interpretability for feature selection on multi-view datasets in both binary and multiclass classification tasks. The results of our evaluations on three benchmark datasets, including synthetic and real data, show improvement over the best baseline methods. This work provides a promising solution for multi-view feature selection and opens up new possibilities for further research in multi-view datasets.  ( 2 min )
    On Neural Networks as Infinite Tree-Structured Probabilistic Graphical Models
    arXiv:2305.17583v3 Announce Type: replace-cross Abstract: Deep neural networks (DNNs) lack the precise semantics and definitive probabilistic interpretation of probabilistic graphical models (PGMs). In this paper, we propose an innovative solution by constructing infinite tree-structured PGMs that correspond exactly to neural networks. Our research reveals that DNNs, during forward propagation, indeed perform approximations of PGM inference that are precise in this alternative PGM structure. Not only does our research complement existing studies that describe neural networks as kernel machines or infinite-sized Gaussian processes, it also elucidates a more direct approximation that DNNs make to exact inference in PGMs. Potential benefits include improved pedagogy and interpretation of DNNs, and algorithms that can merge the strengths of PGMs and DNNs.  ( 2 min )
    A stable deep adversarial learning approach for geological facies generation
    arXiv:2305.13318v3 Announce Type: replace-cross Abstract: The simulation of geological facies in an unobservable volume is essential in various geoscience applications. Given the complexity of the problem, deep generative learning is a promising approach to overcome the limitations of traditional geostatistical simulation models, in particular their lack of physical realism. This research aims to investigate the application of generative adversarial networks and deep variational inference for conditionally simulating meandering channels in underground volumes. In this paper, we review the generative deep learning approaches, in particular the adversarial ones and the stabilization techniques that aim to facilitate their training. The proposed approach is tested on 2D and 3D simulations generated by the stochastic process-based model Flumy. Morphological metrics are utilized to compare our proposed method with earlier iterations of generative adversarial networks. The results indicate that by utilizing recent stabilization techniques, generative adversarial networks can efficiently sample from target data distributions. Moreover, we demonstrate the ability to simulate conditioned simulations through the latent variable model property of the proposed approach.  ( 2 min )
    Unpaired Image-to-Image Translation via Neural Schr\"odinger Bridge
    arXiv:2305.15086v3 Announce Type: replace-cross Abstract: Diffusion models are a powerful class of generative models which simulate stochastic differential equations (SDEs) to generate data from noise. While diffusion models have achieved remarkable progress, they have limitations in unpaired image-to-image (I2I) translation tasks due to the Gaussian prior assumption. Schr\"{o}dinger Bridge (SB), which learns an SDE to translate between two arbitrary distributions, have risen as an attractive solution to this problem. Yet, to our best knowledge, none of SB models so far have been successful at unpaired translation between high-resolution images. In this work, we propose Unpaired Neural Schr\"{o}dinger Bridge (UNSB), which expresses the SB problem as a sequence of adversarial learning problems. This allows us to incorporate advanced discriminators and regularization to learn a SB between unpaired data. We show that UNSB is scalable and successfully solves various unpaired I2I translation tasks. Code: \url{https://github.com/cyclomon/UNSB}  ( 2 min )
    Approximately optimal domain adaptation with Fisher's Linear Discriminant
    arXiv:2302.14186v3 Announce Type: replace-cross Abstract: We propose a class of models based on Fisher's Linear Discriminant (FLD) in the context of domain adaptation. The class is the convex combination of two hypotheses: i) an average hypothesis representing previously seen source tasks and ii) a hypothesis trained on a new target task. For a particular generative setting we derive the optimal convex combination of the two models under 0-1 loss, propose a computable approximation, and study the effect of various parameter settings on the relative risks between the optimal hypothesis, hypothesis i), and hypothesis ii). We demonstrate the effectiveness of the proposed optimal classifier in the context of EEG- and ECG-based classification settings and argue that the optimal classifier can be computed without access to direct information from any of the individual source tasks. We conclude by discussing further applications, limitations, and possible future directions.  ( 2 min )
    A Global and Patch-wise Contrastive Loss for Accurate Automated Exudate Detection
    arXiv:2302.11517v2 Announce Type: replace-cross Abstract: Diabetic retinopathy (DR) is a leading global cause of blindness. Early detection of hard exudates plays a crucial role in identifying DR, which aids in treating diabetes and preventing vision loss. However, the unique characteristics of hard exudates, ranging from their inconsistent shapes to indistinct boundaries, pose significant challenges to existing segmentation techniques. To address these issues, we present a novel supervised contrastive learning framework to optimize hard exudate segmentation. Specifically, we introduce a patch-wise density contrasting scheme to distinguish between areas with varying lesion concentrations, and therefore improve the model's proficiency in segmenting small lesions. To handle the ambiguous boundaries, we develop a discriminative edge inspection module to dynamically analyze the pixels that lie around the boundaries and accurately delineate the exudates. Upon evaluation using the IDRiD dataset and comparison with state-of-the-art frameworks, our method exhibits its effectiveness and shows potential for computer-assisted hard exudate detection. The code to replicate experiments is available at github.com/wetang7/HECL/.  ( 2 min )
    Retrosynthetic Planning with Dual Value Networks
    arXiv:2301.13755v3 Announce Type: replace-cross Abstract: Retrosynthesis, which aims to find a route to synthesize a target molecule from commercially available starting materials, is a critical task in drug discovery and materials design. Recently, the combination of ML-based single-step reaction predictors with multi-step planners has led to promising results. However, the single-step predictors are mostly trained offline to optimize the single-step accuracy, without considering complete routes. Here, we leverage reinforcement learning (RL) to improve the single-step predictor, by using a tree-shaped MDP to optimize complete routes. Specifically, we propose a novel online training algorithm, called Planning with Dual Value Networks (PDVN), which alternates between the planning phase and updating phase. In PDVN, we construct two separate value networks to predict the synthesizability and cost of molecules, respectively. To maintain the single-step accuracy, we design a two-branch network structure for the single-step predictor. On the widely-used USPTO dataset, our PDVN algorithm improves the search success rate of existing multi-step planners (e.g., increasing the success rate from 85.79% to 98.95% for Retro*, and reducing the number of model calls by half while solving 99.47% molecules for RetroGraph). Additionally, PDVN helps find shorter synthesis routes (e.g., reducing the average route length from 5.76 to 4.83 for Retro*, and from 5.63 to 4.78 for RetroGraph). Our code is available at \url{https://github.com/DiXue98/PDVN}.  ( 3 min )
    Online Learning under Budget and ROI Constraints via Weak Adaptivity
    arXiv:2302.01203v3 Announce Type: replace-cross Abstract: We study online learning problems in which a decision maker has to make a sequence of costly decisions, with the goal of maximizing their expected reward while adhering to budget and return-on-investment (ROI) constraints. Existing primal-dual algorithms designed for constrained online learning problems under adversarial inputs rely on two fundamental assumptions. First, the decision maker must know beforehand the value of parameters related to the degree of strict feasibility of the problem (i.e. Slater parameters). Second, a strictly feasible solution to the offline optimization problem must exist at each round. Both requirements are unrealistic for practical applications such as bidding in online ad auctions. In this paper, we show how such assumptions can be circumvented by endowing standard primal-dual templates with weakly adaptive regret minimizers. This results in a ``dual-balancing'' framework which ensures that dual variables stay sufficiently small, even in the absence of knowledge about Slater's parameter. We prove the first best-of-both-worlds no-regret guarantees which hold in absence of the two aforementioned assumptions, under stochastic and adversarial inputs. Finally, we show how to instantiate the framework to optimally bid in various mechanisms of practical relevance, such as first- and second-price auctions.  ( 2 min )
    Social-Aware Clustered Federated Learning with Customized Privacy Preservation
    arXiv:2212.13992v2 Announce Type: replace-cross Abstract: A key feature of federated learning (FL) is to preserve the data privacy of end users. However, there still exist potential privacy leakage in exchanging gradients under FL. As a result, recent research often explores the differential privacy (DP) approaches to add noises to the computing results to address privacy concerns with low overheads, which however degrade the model performance. In this paper, we strike the balance of data privacy and efficiency by utilizing the pervasive social connections between users. Specifically, we propose SCFL, a novel Social-aware Clustered Federated Learning scheme, where mutually trusted individuals can freely form a social cluster and aggregate their raw model updates (e.g., gradients) inside each cluster before uploading to the cloud for global aggregation. By mixing model updates in a social group, adversaries can only eavesdrop the social-layer combined results, but not the privacy of individuals. We unfold the design of SCFL in three steps.i) Stable social cluster formation. Considering users' heterogeneous training samples and data distributions, we formulate the optimal social cluster formation problem as a federation game and devise a fair revenue allocation mechanism to resist free-riders. ii) Differentiated trust-privacy mapping}. For the clusters with low mutual trust, we design a customizable privacy preservation mechanism to adaptively sanitize participants' model updates depending on social trust degrees. iii) Distributed convergence}. A distributed two-sided matching algorithm is devised to attain an optimized disjoint partition with Nash-stable convergence. Experiments on Facebook network and MNIST/CIFAR-10 datasets validate that our SCFL can effectively enhance learning utility, improve user payoff, and enforce customizable privacy protection.  ( 3 min )
    Gradient Shaping: Enhancing Backdoor Attack Against Reverse Engineering
    arXiv:2301.12318v2 Announce Type: replace-cross Abstract: Most existing methods to detect backdoored machine learning (ML) models take one of the two approaches: trigger inversion (aka. reverse engineer) and weight analysis (aka. model diagnosis). In particular, the gradient-based trigger inversion is considered to be among the most effective backdoor detection techniques, as evidenced by the TrojAI competition, Trojan Detection Challenge and backdoorBench. However, little has been done to understand why this technique works so well and, more importantly, whether it raises the bar to the backdoor attack. In this paper, we report the first attempt to answer this question by analyzing the change rate of the backdoored model around its trigger-carrying inputs. Our study shows that existing attacks tend to inject the backdoor characterized by a low change rate around trigger-carrying inputs, which are easy to capture by gradient-based trigger inversion. In the meantime, we found that the low change rate is not necessary for a backdoor attack to succeed: we design a new attack enhancement called \textit{Gradient Shaping} (GRASP), which follows the opposite direction of adversarial training to reduce the change rate of a backdoored model with regard to the trigger, without undermining its backdoor effect. Also, we provide a theoretic analysis to explain the effectiveness of this new technique and the fundamental weakness of gradient-based trigger inversion. Finally, we perform both theoretical and experimental analysis, showing that the GRASP enhancement does not reduce the effectiveness of the stealthy attacks against the backdoor detection methods based on weight analysis, as well as other backdoor mitigation methods without using detection.  ( 3 min )
    Primal Dual Alternating Proximal Gradient Algorithms for Nonsmooth Nonconvex Minimax Problems with Coupled Linear Constraints
    arXiv:2212.04672v3 Announce Type: replace-cross Abstract: Nonconvex minimax problems have attracted wide attention in machine learning, signal processing and many other fields in recent years. In this paper, we propose a primal-dual alternating proximal gradient (PDAPG) algorithm and a primal-dual proximal gradient (PDPG-L) algorithm for solving nonsmooth nonconvex-(strongly) concave and nonconvex-linear minimax problems with coupled linear constraints, respectively. The iteration complexity of the two algorithms are proved to be $\mathcal{O}\left( \varepsilon ^{-2} \right)$ (resp. $\mathcal{O}\left( \varepsilon ^{-4} \right)$) under nonconvex-strongly concave (resp. nonconvex-concave) setting and $\mathcal{O}\left( \varepsilon ^{-3} \right)$ under nonconvex-linear setting to reach an $\varepsilon$-stationary point, respectively. To our knowledge, they are the first two algorithms with iteration complexity guarantees for solving the nonconvex minimax problems with coupled linear constraints.  ( 2 min )
    Discovering Latent Knowledge in Language Models Without Supervision
    arXiv:2212.03827v2 Announce Type: replace-cross Abstract: Existing techniques for training language models can be misaligned with the truth: if we train models with imitation learning, they may reproduce errors that humans make; if we train them to generate text that humans rate highly, they may output errors that human evaluators can't detect. We propose circumventing this issue by directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way. Specifically, we introduce a method for accurately answering yes-no questions given only unlabeled model activations. It works by finding a direction in activation space that satisfies logical consistency properties, such as that a statement and its negation have opposite truth values. We show that despite using no supervision and no model outputs, our method can recover diverse knowledge represented in large language models: across 6 models and 10 question-answering datasets, it outperforms zero-shot accuracy by 4\% on average. We also find that it cuts prompt sensitivity in half and continues to maintain high accuracy even when models are prompted to generate incorrect answers. Our results provide an initial step toward discovering what language models know, distinct from what they say, even when we don't have access to explicit ground truth labels.  ( 2 min )
    FedTracker: Furnishing Ownership Verification and Traceability for Federated Learning Model
    arXiv:2211.07160v3 Announce Type: replace-cross Abstract: Federated learning (FL) is a distributed machine learning paradigm allowing multiple clients to collaboratively train a global model without sharing their local data. However, FL entails exposing the model to various participants. This poses a risk of unauthorized model distribution or resale by the malicious client, compromising the intellectual property rights of the FL group. To deter such misbehavior, it is essential to establish a mechanism for verifying the ownership of the model and as well tracing its origin to the leaker among the FL participants. In this paper, we present FedTracker, the first FL model protection framework that provides both ownership verification and traceability. FedTracker adopts a bi-level protection scheme consisting of global watermark mechanism and local fingerprint mechanism. The former authenticates the ownership of the global model, while the latter identifies which client the model is derived from. FedTracker leverages Continual Learning (CL) principles to embed the watermark in a way that preserves the utility of the FL model on both primitive task and watermark task. FedTracker also devises a novel metric to better discriminate different fingerprints. Experimental results show FedTracker is effective in ownership verification, traceability, and maintains good fidelity and robustness against various watermark removal attacks.  ( 3 min )
    Characterizing Graph Datasets for Node Classification: Homophily-Heterophily Dichotomy and Beyond
    arXiv:2209.06177v4 Announce Type: replace-cross Abstract: Homophily is a graph property describing the tendency of edges to connect similar nodes; the opposite is called heterophily. It is often believed that heterophilous graphs are challenging for standard message-passing graph neural networks (GNNs), and much effort has been put into developing efficient methods for this setting. However, there is no universally agreed-upon measure of homophily in the literature. In this work, we show that commonly used homophily measures have critical drawbacks preventing the comparison of homophily levels across different datasets. For this, we formalize desirable properties for a proper homophily measure and verify which measures satisfy which properties. In particular, we show that a measure that we call adjusted homophily satisfies more desirable properties than other popular homophily measures while being rarely used in graph machine learning literature. Then, we go beyond the homophily-heterophily dichotomy and propose a new characteristic that allows one to further distinguish different sorts of heterophily. The proposed label informativeness (LI) characterizes how much information a neighbor's label provides about a node's label. We prove that this measure satisfies important desirable properties. We also observe empirically that LI better agrees with GNN performance compared to homophily measures, which confirms that it is a useful characteristic of the graph structure.  ( 3 min )
    A Generative Shape Compositional Framework to Synthesise Populations of Virtual Chimaeras
    arXiv:2210.01607v2 Announce Type: replace-cross Abstract: Generating virtual populations of anatomy that capture sufficient variability while remaining plausible is essential for conducting in-silico trials of medical devices. However, not all anatomical shapes of interest are always available for each individual in a population. Hence, missing/partially-overlapping anatomical information is often available across individuals in a population. We introduce a generative shape model for complex anatomical structures, learnable from datasets of unpaired datasets. The proposed generative model can synthesise complete whole complex shape assemblies coined virtual chimaeras, as opposed to natural human chimaeras. We applied this framework to build virtual chimaeras from databases of whole-heart shape assemblies that each contribute samples for heart substructures. Specifically, we propose a generative shape compositional framework which comprises two components - a part-aware generative shape model which captures the variability in shape observed for each structure of interest in the training population; and a spatial composition network which assembles/composes the structures synthesised by the former into multi-part shape assemblies (viz. virtual chimaeras). We also propose a novel self supervised learning scheme that enables the spatial composition network to be trained with partially overlapping data and weak labels. We trained and validated our approach using shapes of cardiac structures derived from cardiac magnetic resonance images available in the UK Biobank. Our approach significantly outperforms a PCA-based shape model (trained with complete data) in terms of generalisability and specificity. This demonstrates the superiority of the proposed approach as the synthesised cardiac virtual populations are more plausible and capture a greater degree of variability in shape than those generated by the PCA-based shape model.  ( 3 min )
    Settling the Sample Complexity of Model-Based Offline Reinforcement Learning
    arXiv:2204.05275v3 Announce Type: replace-cross Abstract: This paper is concerned with offline reinforcement learning (RL), which learns using pre-collected data without further exploration. Effective offline RL would be able to accommodate distribution shift and limited data coverage. However, prior algorithms or analyses either suffer from suboptimal sample complexities or incur high burn-in cost to reach sample optimality, thus posing an impediment to efficient offline RL in sample-starved applications. We demonstrate that the model-based (or "plug-in") approach achieves minimax-optimal sample complexity without burn-in cost for tabular Markov decision processes (MDPs). Concretely, consider a finite-horizon (resp. $\gamma$-discounted infinite-horizon) MDP with $S$ states and horizon $H$ (resp. effective horizon $\frac{1}{1-\gamma}$), and suppose the distribution shift of data is reflected by some single-policy clipped concentrability coefficient $C^{\star}_{\text{clipped}}$. We prove that model-based offline RL yields $\varepsilon$-accuracy with a sample complexity of \[ \begin{cases} \frac{H^{4}SC_{\text{clipped}}^{\star}}{\varepsilon^{2}} & (\text{finite-horizon MDPs}) \frac{SC_{\text{clipped}}^{\star}}{(1-\gamma)^{3}\varepsilon^{2}} & (\text{infinite-horizon MDPs}) \end{cases} \] up to log factor, which is minimax optimal for the entire $\varepsilon$-range. The proposed algorithms are ``pessimistic'' variants of value iteration with Bernstein-style penalties, and do not require sophisticated variance reduction. Our analysis framework is established upon delicate leave-one-out decoupling arguments in conjunction with careful self-bounding techniques tailored to MDPs.  ( 3 min )
    FetusMap: Fetal Pose Estimation in 3D Ultrasound
    arXiv:1910.04935v2 Announce Type: replace-cross Abstract: The 3D ultrasound (US) entrance inspires a multitude of automated prenatal examinations. However, studies about the structuralized description of the whole fetus in 3D US are still rare. In this paper, we propose to estimate the 3D pose of fetus in US volumes to facilitate its quantitative analyses in global and local scales. Given the great challenges in 3D US, including the high volume dimension, poor image quality, symmetric ambiguity in anatomical structures and large variations of fetal pose, our contribution is three-fold. (i) This is the first work about 3D pose estimation of fetus in the literature. We aim to extract the skeleton of whole fetus and assign different segments/joints with correct torso/limb labels. (ii) We propose a self-supervised learning (SSL) framework to finetune the deep network to form visually plausible pose predictions. Specifically, we leverage the landmark-based registration to effectively encode case-adaptive anatomical priors and generate evolving label proxy for supervision. (iii) To enable our 3D network perceive better contextual cues with higher resolution input under limited computing resource, we further adopt the gradient check-pointing (GCP) strategy to save GPU memory and improve the prediction. Extensively validated on a large 3D US dataset, our method tackles varying fetal poses and achieves promising results. 3D pose estimation of fetus has potentials in serving as a map to provide navigation for many advanced studies.  ( 3 min )
    CMGAN: Conformer-based Metric GAN for Speech Enhancement
    arXiv:2203.15149v4 Announce Type: replace-cross Abstract: Recently, convolution-augmented transformer (Conformer) has achieved promising performance in automatic speech recognition (ASR) and time-domain speech enhancement (SE), as it can capture both local and global dependencies in the speech signal. In this paper, we propose a conformer-based metric generative adversarial network (CMGAN) for SE in the time-frequency (TF) domain. In the generator, we utilize two-stage conformer blocks to aggregate all magnitude and complex spectrogram information by modeling both time and frequency dependencies. The estimation of magnitude and complex spectrogram is decoupled in the decoder stage and then jointly incorporated to reconstruct the enhanced speech. In addition, a metric discriminator is employed to further improve the quality of the enhanced estimated speech by optimizing the generator with respect to a corresponding evaluation score. Quantitative analysis on Voice Bank+DEMAND dataset indicates the capability of CMGAN in outperforming various previous models with a margin, i.e., PESQ of 3.41 and SSNR of 11.10 dB.  ( 2 min )
    Deep Attentive Features for Prostate Segmentation in 3D Transrectal Ultrasound
    arXiv:1907.01743v2 Announce Type: replace-cross Abstract: Automatic prostate segmentation in transrectal ultrasound (TRUS) images is of essential importance for image-guided prostate interventions and treatment planning. However, developing such automatic solutions remains very challenging due to the missing/ambiguous boundary and inhomogeneous intensity distribution of the prostate in TRUS, as well as the large variability in prostate shapes. This paper develops a novel 3D deep neural network equipped with attention modules for better prostate segmentation in TRUS by fully exploiting the complementary information encoded in different layers of the convolutional neural network (CNN). Our attention module utilizes the attention mechanism to selectively leverage the multilevel features integrated from different layers to refine the features at each individual layer, suppressing the non-prostate noise at shallow layers of the CNN and increasing more prostate details into features at deep layers. Experimental results on challenging 3D TRUS volumes show that our method attains satisfactory segmentation performance. The proposed attention mechanism is a general strategy to aggregate multi-level deep features and has the potential to be used for other medical image segmentation tasks. The code is publicly available at https://github.com/wulalago/DAF3D.  ( 3 min )
    On Prompt-Driven Safeguarding for Large Language Models
    arXiv:2401.18018v2 Announce Type: replace Abstract: Prepending model inputs with safety prompts is a common practice for safeguarding large language models (LLMs) from complying with queries that contain harmful intents. However, the working mechanisms of safety prompts have not been revealed yet, which hinders the potential for automatically optimizing them to improve LLM safety. To this end, we investigate the impact of safety prompts from the perspective of model representations. We find that in models' representation space, harmful and harmless queries can be largely distinguished, but this is not noticeably enhanced by safety prompts. Instead, the queries' representations are moved by safety prompts in similar directions where models become more prone to refusal (i.e., refusing to provide assistance) even when the queries are harmless. Inspired by these findings, we propose a method called DRO (Directed Representation Optimization) for automatic safety prompt optimization. It treats safety prompts as continuous, trainable embeddings and learns to move the representations of harmful/harmless queries along/opposite the direction in which the model's refusal probability increases. Experiments with eight LLMs on out-of-domain benchmarks demonstrate that DRO remarkably improves the safeguarding performance of human-crafted safety prompts and outperforms strong baselines, without compromising the general model capability.  ( 2 min )
    Deep Learning for Multi-Label Learning: A Comprehensive Survey
    arXiv:2401.16549v2 Announce Type: replace Abstract: Multi-label learning is a rapidly growing research area that aims to predict multiple labels from a single input data point. In the era of big data, tasks involving multi-label classification (MLC) or ranking present significant and intricate challenges, capturing considerable attention in diverse domains. Inherent difficulties in MLC include dealing with high-dimensional data, addressing label correlations, and handling partial labels, for which conventional methods prove ineffective. Recent years have witnessed a notable increase in adopting deep learning (DL) techniques to address these challenges more effectively in MLC. Notably, there is a burgeoning effort to harness the robust learning capabilities of DL for improved modelling of label dependencies and other challenges in MLC. However, it is noteworthy that comprehensive studies specifically dedicated to DL for multi-label learning are limited. Thus, this survey aims to thoroughly review recent progress in DL for multi-label learning, along with a summary of open research problems in MLC. The review consolidates existing research efforts in DL for MLC,including deep neural networks, transformers, autoencoders, and convolutional and recurrent architectures. Finally, the study presents a comparative analysis of the existing methods to provide insightful observations and stimulate future research directions in this domain.  ( 2 min )
    A Survey on Data Augmentation in Large Model Era
    arXiv:2401.15422v2 Announce Type: replace Abstract: Large models, encompassing large language and diffusion models, have shown exceptional promise in approximating human-level intelligence, garnering significant interest from both academic and industrial spheres. However, the training of these large models necessitates vast quantities of high-quality data, and with continuous updates to these models, the existing reservoir of high-quality data may soon be depleted. This challenge has catalyzed a surge in research focused on data augmentation methods. Leveraging large models, these data augmentation techniques have outperformed traditional approaches. This paper offers an exhaustive review of large model-driven data augmentation methods, adopting a comprehensive perspective. We begin by establishing a classification of relevant studies into three main categories: image augmentation, text augmentation, and paired data augmentation. Following this, we delve into various data post-processing techniques pertinent to large model-based data augmentation. Our discussion then expands to encompass the array of applications for these data augmentation methods within natural language processing, computer vision, and audio signal processing. We proceed to evaluate the successes and limitations of large model-based data augmentation across different scenarios. Concluding our review, we highlight prospective challenges and avenues for future exploration in the field of data augmentation. Our objective is to furnish researchers with critical insights, ultimately contributing to the advancement of more sophisticated large models. We consistently maintain the related open-source materials at: https://github.com/MLGroup-JLU/LLM-data-aug-survey.  ( 3 min )
    FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design
    arXiv:2401.14112v2 Announce Type: replace Abstract: Six-bit quantization (FP6) can effectively reduce the size of large language models (LLMs) and preserve the model quality consistently across varied applications. However, existing systems do not provide Tensor Core support for FP6 quantization and struggle to achieve practical performance improvements during LLM inference. It is challenging to support FP6 quantization on GPUs due to (1) unfriendly memory access of model weights with irregular bit-width and (2) high runtime overhead of weight de-quantization. To address these problems, we propose TC-FPx, the first full-stack GPU kernel design scheme with unified Tensor Core support of float-point weights for various quantization bit-width. We integrate TC-FPx kernel into an existing inference system, providing new end-to-end support (called FP6-LLM) for quantized LLM inference, where better trade-offs between inference cost and model quality are achieved. Experiments show that FP6-LLM enables the inference of LLaMA-70b using only a single GPU, achieving 1.69x-2.65x higher normalized inference throughput than the FP16 baseline. The source code is publicly available at https://github.com/usyd-fsalab/fp6_llm.  ( 2 min )
    On the generalization capacity of neural networks during generic multimodal reasoning
    arXiv:2401.15030v2 Announce Type: replace Abstract: The advent of the Transformer has led to the development of large language models (LLM), which appear to demonstrate human-like capabilities. To assess the generality of this class of models and a variety of other base neural network architectures to multimodal domains, we evaluated and compared their capacity for multimodal generalization. We introduce a multimodal question-answer benchmark to evaluate three specific types of out-of-distribution (OOD) generalization performance: distractor generalization (generalization in the presence of distractors), systematic compositional generalization (generalization to new task permutations), and productive compositional generalization (generalization to more complex tasks structures). We found that across model architectures (e.g., RNNs, Transformers, Perceivers, etc.), models with multiple attention layers, or models that leveraged cross-attention mechanisms between input domains, fared better. Our positive results demonstrate that for multimodal distractor and systematic generalization, either cross-modal attention or models with deeper attention layers are key architectural features required to integrate multimodal inputs. On the other hand, neither of these architectural features led to productive generalization, suggesting fundamental limitations of existing architectures for specific types of multimodal generalization. These results demonstrate the strengths and limitations of specific architectural components underlying modern neural models for multimodal reasoning. Finally, we provide Generic COG (gCOG), a configurable benchmark with several multimodal generalization splits, for future studies to explore.  ( 3 min )
    INCPrompt: Task-Aware incremental Prompting for Rehearsal-Free Class-incremental Learning
    arXiv:2401.11667v2 Announce Type: replace Abstract: This paper introduces INCPrompt, an innovative continual learning solution that effectively addresses catastrophic forgetting. INCPrompt's key innovation lies in its use of adaptive key-learner and task-aware prompts that capture task-relevant information. This unique combination encapsulates general knowledge across tasks and encodes task-specific knowledge. Our comprehensive evaluation across multiple continual learning benchmarks demonstrates INCPrompt's superiority over existing algorithms, showing its effectiveness in mitigating catastrophic forgetting while maintaining high performance. These results highlight the significant impact of task-aware incremental prompting on continual learning performance.  ( 2 min )
    PartIR: Composing SPMD Partitioning Strategies for Machine Learning
    arXiv:2401.11202v3 Announce Type: replace Abstract: Training of modern large neural networks (NN) requires a combination of parallelization strategies encompassing data, model, or optimizer sharding. When strategies increase in complexity, it becomes necessary for partitioning tools to be 1) expressive, allowing the composition of simpler strategies, and 2) predictable to estimate performance analytically. We present PartIR, our design for a NN partitioning system. PartIR is focused on an incremental approach to rewriting and is hardware-and-runtime agnostic. We present a simple but powerful API for composing sharding strategies and a simulator to validate them. The process is driven by high-level programmer-issued partitioning tactics, which can be both manual and automatic. Importantly, the tactics are specified separately from the model code, making them easy to change. We evaluate PartIR on several different models to demonstrate its predictability, expressibility, and ability to reach peak performance..  ( 2 min )
    T-PRIME: Transformer-based Protocol Identification for Machine-learning at the Edge
    arXiv:2401.04837v2 Announce Type: replace Abstract: Spectrum sharing allows different protocols of the same standard (e.g., 802.11 family) or different standards (e.g., LTE and DVB) to coexist in overlapping frequency bands. As this paradigm continues to spread, wireless systems must also evolve to identify active transmitters and unauthorized waveforms in real time under intentional distortion of preambles, extremely low signal-to-noise ratios and challenging channel conditions. We overcome limitations of correlation-based preamble matching methods in such conditions through the design of T-PRIME: a Transformer-based machine learning approach. T-PRIME learns the structural design of transmitted frames through its attention mechanism, looking at sequence patterns that go beyond the preamble alone. The paper makes three contributions: First, it compares Transformer models and demonstrates their superiority over traditional methods and state-of-the-art neural networks. Second, it rigorously analyzes T-PRIME's real-time feasibility on DeepWave's AIR-T platform. Third, it utilizes an extensive 66 GB dataset of over-the-air (OTA) WiFi transmissions for training, which is released along with the code for community use. Results reveal nearly perfect (i.e. $>98\%$) classification accuracy under simulated scenarios, showing $100\%$ detection improvement over legacy methods in low SNR ranges, $97\%$ classification accuracy for OTA single-protocol transmissions and up to $75\%$ double-protocol classification accuracy in interference scenarios.  ( 3 min )
    VI-PANN: Harnessing Transfer Learning and Uncertainty-Aware Variational Inference for Improved Generalization in Audio Pattern Recognition
    arXiv:2401.05531v2 Announce Type: replace Abstract: Transfer learning (TL) is an increasingly popular approach to training deep learning (DL) models that leverages the knowledge gained by training a foundation model on diverse, large-scale datasets for use on downstream tasks where less domain- or task-specific data is available. The literature is rich with TL techniques and applications; however, the bulk of the research makes use of deterministic DL models which are often uncalibrated and lack the ability to communicate a measure of epistemic (model) uncertainty in prediction. Unlike their deterministic counterparts, Bayesian DL (BDL) models are often well-calibrated, provide access to epistemic uncertainty for a prediction, and are capable of achieving competitive predictive performance. In this study, we propose variational inference pre-trained audio neural networks (VI-PANNs). VI-PANNs are a variational inference variant of the popular ResNet-54 architecture which are pre-trained on AudioSet, a large-scale audio event detection dataset. We evaluate the quality of the resulting uncertainty when transferring knowledge from VI-PANNs to other downstream acoustic classification tasks using the ESC-50, UrbanSound8K, and DCASE2013 datasets. We demonstrate, for the first time, that it is possible to transfer calibrated uncertainty information along with knowledge from upstream tasks to enhance a model's capability to perform downstream tasks.  ( 3 min )
    Coupling Graph Neural Networks with Fractional Order Continuous Dynamics: A Robustness Study
    arXiv:2401.04331v2 Announce Type: replace Abstract: In this work, we rigorously investigate the robustness of graph neural fractional-order differential equation (FDE) models. This framework extends beyond traditional graph neural (integer-order) ordinary differential equation (ODE) models by implementing the time-fractional Caputo derivative. Utilizing fractional calculus allows our model to consider long-term memory during the feature updating process, diverging from the memoryless Markovian updates seen in traditional graph neural ODE models. The superiority of graph neural FDE models over graph neural ODE models has been established in environments free from attacks or perturbations. While traditional graph neural ODE models have been verified to possess a degree of stability and resilience in the presence of adversarial attacks in existing literature, the robustness of graph neural FDE models, especially under adversarial conditions, remains largely unexplored. This paper undertakes a detailed assessment of the robustness of graph neural FDE models. We establish a theoretical foundation outlining the robustness characteristics of graph neural FDE models, highlighting that they maintain more stringent output perturbation bounds in the face of input and graph topology disturbances, compared to their integer-order counterparts. Our empirical evaluations further confirm the enhanced robustness of graph neural FDE models, highlighting their potential in adversarially robust applications.  ( 3 min )
    Hutchinson Trace Estimation for High-Dimensional and High-Order Physics-Informed Neural Networks
    arXiv:2312.14499v2 Announce Type: replace Abstract: Physics-Informed Neural Networks (PINNs) have proven effective in solving partial differential equations (PDEs), especially when some data are available by seamlessly blending data and physics. However, extending PINNs to high-dimensional and even high-order PDEs encounters significant challenges due to the computational cost associated with automatic differentiation in the residual loss. Herein, we address the limitations of PINNs in handling high-dimensional and high-order PDEs by introducing Hutchinson Trace Estimation (HTE). Starting with the second-order high-dimensional PDEs ubiquitous in scientific computing, HTE transforms the calculation of the entire Hessian matrix into a Hessian vector product (HVP). This approach alleviates the computational bottleneck via Taylor-mode automatic differentiation and significantly reduces memory consumption from the Hessian matrix to HVP. We further showcase HTE's convergence to the original PINN loss and its unbiased behavior under specific conditions. Comparisons with Stochastic Dimension Gradient Descent (SDGD) highlight the distinct advantages of HTE, particularly in scenarios with significant variance among dimensions. We further extend HTE to higher-order and higher-dimensional PDEs, specifically addressing the biharmonic equation. By employing tensor-vector products (TVP), HTE efficiently computes the colossal tensor associated with the fourth-order high-dimensional biharmonic equation, saving memory and enabling rapid computation. The effectiveness of HTE is illustrated through experimental setups, demonstrating comparable convergence rates with SDGD under memory and speed constraints. Additionally, HTE proves valuable in accelerating the Gradient-Enhanced PINN (gPINN) version as well as the Biharmonic equation. Overall, HTE opens up a new capability in scientific machine learning for tackling high-order and high-dimensional PDEs.  ( 3 min )
    A Bayesian Unification of Self-Supervised Clustering and Energy-Based Models
    arXiv:2401.00873v2 Announce Type: replace Abstract: Self-supervised learning is a popular and powerful method for utilizing large amounts of unlabeled data, for which a wide variety of training objectives have been proposed in the literature. In this study, we perform a Bayesian analysis of state-of-the-art self-supervised learning objectives, elucidating the underlying probabilistic graphical models in each class and presenting a standardized methodology for their derivation from first principles. The analysis also indicates a natural means of integrating self-supervised learning with likelihood-based generative models. We instantiate this concept within the realm of cluster-based self-supervised learning and energy models, introducing a novel lower bound which is proven to reliably penalize the most important failure modes. Furthermore, this newly proposed lower bound enables the training of a standard backbone architecture without the necessity for asymmetric elements such as stop gradients, momentum encoders, or specialized clustering layers - typically introduced to avoid learning trivial solutions. Our theoretical findings are substantiated through experiments on synthetic and real-world data, including SVHN, CIFAR10, and CIFAR100, thus showing that our objective function allows to outperform existing self-supervised learning strategies in terms of clustering, generation and out-of-distribution detection performance by a wide margin. We also demonstrate that GEDI can be integrated into a neuro-symbolic framework to mitigate the reasoning shortcut problem and to learn higher quality symbolic representations thanks to the enhanced classification performance.  ( 3 min )
    Interpretable Fine-Tuning for Graph Neural Network Surrogate Models
    arXiv:2311.07548v2 Announce Type: replace Abstract: Data-driven surrogate modeling has surged in capability in recent years with the emergence of graph neural networks (GNNs), which can operate directly on mesh-based representations of data. The goal of this work is to introduce an interpretable fine-tuning strategy for GNNs, with application to unstructured mesh-based fluid dynamics modeling. The end result is an enhanced fine-tuned model that isolates regions in physical space, corresponding to sub-graphs, that are intrinsically linked to the forecasting task while retaining the predictive capability of the baseline. These structures, identified by the fine-tuned GNNs, are adaptively produced in the forward pass and serve as explainable links between the baseline model architecture, the optimization goal, and known problem-specific physics. Additionally, through a regularization procedure, the fine-tuned GNNs can also be used to identify, during inference, graph nodes that correspond to a majority of the anticipated forecasting error, adding a novel interpretable error-tagging capability to baseline models. Demonstrations are performed using unstructured flow field data sourced from flow over a backward-facing step at high Reynolds numbers.  ( 2 min )
    Neuro-GPT: Towards A Foundation Model for EEG
    arXiv:2311.03764v4 Announce Type: replace Abstract: To handle the scarcity and heterogeneity of electroencephalography (EEG) data for Brain-Computer Interface (BCI) tasks, and to harness the power of large publicly available data sets, we propose Neuro-GPT, a foundation model consisting of an EEG encoder and a GPT model. The foundation model is pre-trained on a large-scale data set using a self-supervised task that learns how to reconstruct masked EEG segments. We then fine-tune the model on a Motor Imagery Classification task to validate its performance in a low-data regime (9 subjects). Our experiments demonstrate that applying a foundation model can significantly improve classification performance compared to a model trained from scratch, which provides evidence for the generalizability of the foundation model and its ability to address challenges of data scarcity and heterogeneity in EEG. The code is publicly available at github.com/wenhui0206/NeuroGPT.  ( 2 min )
    On the Theory of Risk-Aware Agents: Bridging Actor-Critic and Economics
    arXiv:2310.19527v2 Announce Type: replace Abstract: Risk-aware Reinforcement Learning (RL) algorithms like SAC and TD3 were shown empirically to outperform their risk-neutral counterparts in a variety of continuous-action tasks. However, the theoretical basis for the pessimistic objectives these algorithms employ remains unestablished, raising questions about the specific class of policies they are implementing. In this work, we apply the expected utility hypothesis, a fundamental concept in economics, to illustrate that both risk-neutral and risk-aware RL goals can be interpreted through expected utility maximization using an exponential utility function. This approach reveals that risk-aware policies effectively maximize value certainty equivalent, aligning them with conventional decision theory principles. Furthermore, we propose Dual Actor-Critic (DAC). DAC is a risk-aware, model-free algorithm that features two distinct actor networks: a pessimistic actor for temporal-difference learning and an optimistic actor for exploration. Our evaluations of DAC across various locomotion and manipulation tasks demonstrate improvements in sample efficiency and final performance. Remarkably, DAC, while requiring significantly less computational resources, matches the performance of leading model-based methods in the complex dog and humanoid domains.  ( 2 min )
    DySurv: Dynamic Deep Learning Model for Survival Prediction in the ICU
    arXiv:2310.18681v2 Announce Type: replace Abstract: Survival analysis focuses on estimating time-to-event distributions which can help in dynamic risk prediction in healthcare. Extending beyond the classical Cox model, deep learning techniques have been developed which moved away from the constraining assumptions of proportional hazards. Traditional statistical models often only include static information where, in this work, we propose a novel conditional variational autoencoder-based method called DySurv, which uses a combination of static and time-series measurements from patient electronic health records to estimate the risk of death dynamically. DySurv has been tested on several time-to-event benchmarks where it outperforms existing methods, including deep learning methods, and we evaluate it on real-world intensive care unit data from MIMIC-IV and eICU. The predictive capacity of DySurv is consistent and the survival estimates remain disentangled across different datasets supporting the idea that dynamic deep learning models based on conditional variational inference in multi-task cases can be robust models for survival analysis.  ( 2 min )
    Graph Attention-based Deep Reinforcement Learning for solving the Chinese Postman Problem with Load-dependent costs
    arXiv:2310.15516v4 Announce Type: replace Abstract: Recently, Deep reinforcement learning (DRL) models have shown promising results in solving routing problems. However, most DRL solvers are commonly proposed to solve node routing problems, such as the Traveling Salesman Problem (TSP). Meanwhile, there has been limited research on applying neural methods to arc routing problems, such as the Chinese Postman Problem (CPP), since they often feature irregular and complex solution spaces compared to TSP. To fill these gaps, this paper proposes a novel DRL framework to address the CPP with load-dependent costs (CPP-LC) (Corberan et al., 2018), which is a complex arc routing problem with load constraints. The novelty of our method is two-fold. First, we formulate the CPP-LC as a Markov Decision Process (MDP) sequential model. Subsequently, we introduce an autoregressive model based on DRL, namely Arc-DRL, consisting of an encoder and decoder to address the CPP-LC challenge effectively. Such a framework allows the DRL model to work efficiently and scalably to arc routing problems. Furthermore, we propose a new bio-inspired meta-heuristic solution based on Evolutionary Algorithm (EA) for CPP-LC. Extensive experiments show that Arc-DRL outperforms existing meta-heuristic methods such as Iterative Local Search (ILS) and Variable Neighborhood Search (VNS) proposed by (Corberan et al., 2018) on large benchmark datasets for CPP-LC regarding both solution quality and running time; while the EA gives the best solution quality with much more running time. We release our C++ implementations for metaheuristics such as EA, ILS and VNS along with the code for data generation and our generated data at https://github.com/HySonLab/Chinese_Postman_Problem  ( 3 min )
    Adversarial Attacks on Fairness of Graph Neural Networks
    arXiv:2310.13822v2 Announce Type: replace Abstract: Fairness-aware graph neural networks (GNNs) have gained a surge of attention as they can reduce the bias of predictions on any demographic group (e.g., female) in graph-based applications. Although these methods greatly improve the algorithmic fairness of GNNs, the fairness can be easily corrupted by carefully designed adversarial attacks. In this paper, we investigate the problem of adversarial attacks on fairness of GNNs and propose G-FairAttack, a general framework for attacking various types of fairness-aware GNNs in terms of fairness with an unnoticeable effect on prediction utility. In addition, we propose a fast computation technique to reduce the time complexity of G-FairAttack. The experimental study demonstrates that G-FairAttack successfully corrupts the fairness of different types of GNNs while keeping the attack unnoticeable. Our study on fairness attacks sheds light on potential vulnerabilities in fairness-aware GNNs and guides further research on the robustness of GNNs in terms of fairness.  ( 2 min )
    Quantum Speedups in Regret Analysis of Infinite Horizon Average-Reward Markov Decision Processes
    arXiv:2310.11684v2 Announce Type: replace Abstract: This paper investigates the potential of quantum acceleration in addressing infinite horizon Markov Decision Processes (MDPs) to enhance average reward outcomes. We introduce an innovative quantum framework for the agent's engagement with an unknown MDP, extending the conventional interaction paradigm. Our approach involves the design of an optimism-driven tabular Reinforcement Learning algorithm that harnesses quantum signals acquired by the agent through efficient quantum mean estimation techniques. Through thorough theoretical analysis, we demonstrate that the quantum advantage in mean estimation leads to exponential advancements in regret guarantees for infinite horizon Reinforcement Learning. Specifically, the proposed Quantum algorithm achieves a regret bound of $\tilde{\mathcal{O}}(1)$, a significant improvement over the $\tilde{\mathcal{O}}(\sqrt{T})$ bound exhibited by classical counterparts.  ( 2 min )
    From Interpolation to Extrapolation: Complete Length Generalization for Arithmetic Transformers
    arXiv:2310.11984v2 Announce Type: replace Abstract: In this paper, we investigate the inherent capabilities of transformer models in learning arithmetic algorithms, such as addition and parity. Through experiments and attention analysis, we identify a number of crucial factors for achieving optimal length generalization. We show that transformer models are able to generalize to long lengths with the help of targeted attention biasing. In particular, our solution solves the Parity task, a well-known and theoretically proven failure mode for Transformers. We then introduce Attention Bias Calibration (ABC), a calibration stage that enables the model to automatically learn the proper attention biases, which we show to be connected to mechanisms in relative position encoding. We demonstrate that using ABC, the transformer model can achieve unprecedented near-perfect length generalization on certain arithmetic tasks. Our code is available at https: //github.com/shaoxiongduan/AttentionBiasCalibration.  ( 2 min )
    Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual Concepts in the Real World
    arXiv:2310.10207v2 Announce Type: replace Abstract: We introduce Bongard-OpenWorld, a new benchmark for evaluating real-world few-shot reasoning for machine vision. It originates from the classical Bongard Problems (BPs): Given two sets of images (positive and negative), the model needs to identify the set that query images belong to by inducing the visual concepts, which is exclusively depicted by images from the positive set. Our benchmark inherits the few-shot concept induction of the original BPs while adding the two novel layers of challenge: 1) open-world free-form concepts, as the visual concepts in Bongard-OpenWorld are unique compositions of terms from an open vocabulary, ranging from object categories to abstract visual attributes and commonsense factual knowledge; 2) real-world images, as opposed to the synthetic diagrams used by many counterparts. In our exploration, Bongard-OpenWorld already imposes a significant challenge to current few-shot reasoning algorithms. We further investigate to which extent the recently introduced Large Language Models (LLMs) and Vision-Language Models (VLMs) can solve our task, by directly probing VLMs, and combining VLMs and LLMs in an interactive reasoning scheme. We even conceived a neuro-symbolic reasoning approach that reconciles LLMs & VLMs with logical reasoning to emulate the human problem-solving process for Bongard Problems. However, none of these approaches manage to close the human-machine gap, as the best learner achieves 64% accuracy while human participants easily reach 91%. We hope Bongard-OpenWorld can help us better understand the limitations of current visual intelligence and facilitate future research on visual agents with stronger few-shot visual reasoning capabilities.  ( 3 min )
    Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models?
    arXiv:2310.10012v3 Announce Type: replace Abstract: Diffusion models for text-to-image (T2I) synthesis, such as Stable Diffusion (SD), have recently demonstrated exceptional capabilities for generating high-quality content. However, this progress has raised several concerns of potential misuse, particularly in creating copyrighted, prohibited, and restricted content, or NSFW (not safe for work) images. While efforts have been made to mitigate such problems, either by implementing a safety filter at the evaluation stage or by fine-tuning models to eliminate undesirable concepts or styles, the effectiveness of these safety measures in dealing with a wide range of prompts remains largely unexplored. In this work, we aim to investigate these safety mechanisms by proposing one novel concept retrieval algorithm for evaluation. We introduce Ring-A-Bell, a model-agnostic red-teaming tool for T2I diffusion models, where the whole evaluation can be prepared in advance without prior knowledge of the target model. Specifically, Ring-A-Bell first performs concept extraction to obtain holistic representations for sensitive and inappropriate concepts. Subsequently, by leveraging the extracted concept, Ring-A-Bell automatically identifies problematic prompts for diffusion models with the corresponding generation of inappropriate content, allowing the user to assess the reliability of deployed safety mechanisms. Finally, we empirically validate our method by testing online services such as Midjourney and various methods of concept removal. Our results show that Ring-A-Bell, by manipulating safe prompting benchmarks, can transform prompts that were originally regarded as safe to evade existing safety mechanisms, thus revealing the defects of the so-called safety mechanisms which could practically lead to the generation of harmful contents. Our codes are available at https://github.com/chiayi-hsu/Ring-A-Bell.  ( 3 min )
    An Edge-Aware Graph Autoencoder Trained on Scale-Imbalanced Data for Traveling Salesman Problems
    arXiv:2310.06543v2 Announce Type: replace Abstract: In recent years, there has been a notable surge in research on machine learning techniques for combinatorial optimization. It has been shown that learning-based methods outperform traditional heuristics and mathematical solvers on the Traveling Salesman Problem (TSP) in terms of both performance and computational efficiency. However, most learning-based TSP solvers are primarily designed for fixed-scale TSP instances, and also require a large number of training samples to achieve optimal performance. To fill this gap, this work proposes a data-driven graph representation learning method for solving TSPs with various numbers of cities. Specifically, we formulate the TSP as a link prediction task and propose an edge-aware graph autoencoder (EdgeGAE) model that can solve TSPs by learning from various-scale samples with an imbalanced distribution. A residual gated encoder is trained to learn latent edge embeddings, followed by an edge-centered decoder to output link predictions in an end-to-end manner. Furthermore, we introduce an active sampling strategy into the training process to improve the model's generalization capability in large-scale scenarios. To investigate the model's practical applicability, we generate a scale-imbalanced dataset comprising 50,000 TSP instances ranging from 50 to 500 cities. The experimental results demonstrate that the proposed edge-aware graph autoencoder model achieves a highly competitive performance among state-of-the-art graph learning-based approaches in solving TSPs with various scales, implying its remarkable potential in dealing with practical optimization challenges.  ( 3 min )
    Functional Interpolation for Relative Positions Improves Long Context Transformers
    arXiv:2310.04418v2 Announce Type: replace Abstract: Preventing the performance decay of Transformers on inputs longer than those used for training has been an important challenge in extending the context length of these models. Though the Transformer architecture has fundamentally no limits on the input sequence lengths it can process, the choice of position encoding used during training can limit the performance of these models on longer inputs. We propose a novel functional relative position encoding with progressive interpolation, FIRE, to improve Transformer generalization to longer contexts. We theoretically prove that this can represent some of the popular relative position encodings, such as T5's RPE, Alibi, and Kerple. We next empirically show that FIRE models have better generalization to longer contexts on both zero-shot language modeling and long text benchmarks.  ( 2 min )
    Protecting Sensitive Data through Federated Co-Training
    arXiv:2310.05696v2 Announce Type: replace Abstract: In many applications, sensitive data is inherently distributed and may not be pooled due to privacy concerns. Federated learning allows us to collaboratively train a model without pooling the data by iteratively aggregating the parameters of local models. It is possible, though, to infer upon the sensitive data from the shared model parameters. We propose to use a federated co-training approach where clients share hard labels on a public unlabeled dataset instead of model parameters. A consensus on the shared labels forms a pseudo labeling for the unlabeled dataset that clients use in combination with their private data to train local models. We show that sharing hard labels substantially improves privacy over sharing model parameters. At the same time, federated co-training achieves a model quality comparable to federated learning. Moreover, it allows us to use local models such as (gradient boosted) decision trees, rule ensembles, and random forests that do not lend themselves to the parameter aggregation used in federated learning.  ( 2 min )
    Language Models Represent Space and Time
    arXiv:2310.02207v3 Announce Type: replace Abstract: The capabilities of large language models (LLMs) have sparked debate over whether such systems just learn an enormous collection of superficial statistics or a set of more coherent and grounded representations that reflect the real world. We find evidence for the latter by analyzing the learned representations of three spatial datasets (world, US, NYC places) and three temporal datasets (historical figures, artworks, news headlines) in the Llama-2 family of models. We discover that LLMs learn linear representations of space and time across multiple scales. These representations are robust to prompting variations and unified across different entity types (e.g. cities and landmarks). In addition, we identify individual "space neurons" and "time neurons" that reliably encode spatial and temporal coordinates. While further investigation is needed, our results suggest modern LLMs learn rich spatiotemporal representations of the real world and possess basic ingredients of a world model.  ( 2 min )
    Contrastive Continual Multi-view Clustering with Filtered Structural Fusion
    arXiv:2309.15135v2 Announce Type: replace Abstract: Multi-view clustering thrives in applications where views are collected in advance by extracting consistent and complementary information among views. However, it overlooks scenarios where data views are collected sequentially, i.e., real-time data. Due to privacy issues or memory burden, previous views are not available with time in these situations. Some methods are proposed to handle it but are trapped in a stability-plasticity dilemma. In specific, these methods undergo a catastrophic forgetting of prior knowledge when a new view is attained. Such a catastrophic forgetting problem (CFP) would cause the consistent and complementary information hard to get and affect the clustering performance. To tackle this, we propose a novel method termed Contrastive Continual Multi-view Clustering with Filtered Structural Fusion (CCMVC-FSF). Precisely, considering that data correlations play a vital role in clustering and prior knowledge ought to guide the clustering process of a new view, we develop a data buffer with fixed size to store filtered structural information and utilize it to guide the generation of a robust partition matrix via contrastive learning. Furthermore, we theoretically connect CCMVC-FSF with semi-supervised learning and knowledge distillation. Extensive experiments exhibit the excellence of the proposed method.  ( 2 min )
    $G$-Mapper: Learning a Cover in the Mapper Construction
    arXiv:2309.06634v2 Announce Type: replace Abstract: The Mapper algorithm is a visualization technique in topological data analysis (TDA) that outputs a graph reflecting the structure of a given dataset. However, the Mapper algorithm requires tuning several parameters in order to generate a ``nice" Mapper graph. This paper focuses on selecting the cover parameter. We present an algorithm that optimizes the cover of a Mapper graph by splitting a cover repeatedly according to a statistical test for normality. Our algorithm is based on $G$-means clustering which searches for the optimal number of clusters in $k$-means by iteratively applying the Anderson-Darling test. Our splitting procedure employs a Gaussian mixture model to carefully choose the cover according to the distribution of the given data. Experiments for synthetic and real-world datasets demonstrate that our algorithm generates covers so that the Mapper graphs retain the essence of the datasets, while also running significantly fast.  ( 2 min )
    Pixel-wise Smoothing for Certified Robustness against Camera Motion Perturbations
    arXiv:2309.13150v2 Announce Type: replace Abstract: Deep learning-based visual perception models lack robustness when faced with camera motion perturbations in practice. The current certification process for assessing robustness is costly and time-consuming due to the extensive number of image projections required for Monte Carlo sampling in the 3D camera motion space. To address these challenges, we present a novel, efficient, and practical framework for certifying the robustness of 3D-2D projective transformations against camera motion perturbations. Our approach leverages a smoothing distribution over the 2D pixel space instead of in the 3D physical space, eliminating the need for costly camera motion sampling and significantly enhancing the efficiency of robustness certifications. With the pixel-wise smoothed classifier, we are able to fully upper bound the projection errors using a technique of uniform partitioning in camera motion space. Additionally, we extend our certification framework to a more general scenario where only a single-frame point cloud is required in the projection oracle. Through extensive experimentation, we validate the trade-off between effectiveness and efficiency enabled by our proposed method. Remarkably, our approach achieves approximately 80% certified accuracy while utilizing only 30% of the projected image frames. The code is available at https://github.com/HanjiangHu/pixel-wise-smoothing.  ( 3 min )
    SA-Solver: Stochastic Adams Solver for Fast Sampling of Diffusion Models
    arXiv:2309.05019v2 Announce Type: replace Abstract: Diffusion Probabilistic Models (DPMs) have achieved considerable success in generation tasks. As sampling from DPMs is equivalent to solving diffusion SDE or ODE which is time-consuming, numerous fast sampling methods built upon improved differential equation solvers are proposed. The majority of such techniques consider solving the diffusion ODE due to its superior efficiency. However, stochastic sampling could offer additional advantages in generating diverse and high-quality data. In this work, we engage in a comprehensive analysis of stochastic sampling from two aspects: variance-controlled diffusion SDE and linear multi-step SDE solver. Based on our analysis, we propose SA-Solver, which is an improved efficient stochastic Adams method for solving diffusion SDE to generate data with high quality. Our experiments show that SA-Solver achieves: 1) improved or comparable performance compared with the existing state-of-the-art sampling methods for few-step sampling; 2) SOTA FID scores on substantial benchmark datasets under a suitable number of function evaluations (NFEs).  ( 2 min )
    A Natural Gas Consumption Forecasting System for Continual Learning Scenarios based on Hoeffding Trees with Change Point Detection Mechanism
    arXiv:2309.03720v3 Announce Type: replace Abstract: Forecasting natural gas consumption, considering seasonality and trends, is crucial in planning its supply and consumption and optimizing the cost of obtaining it, mainly by industrial entities. However, in times of threats to its supply, it is also a critical element that guarantees the supply of this raw material to meet individual consumers' needs, ensuring society's energy security. This article introduces a novel multistep ahead forecasting of natural gas consumption with change point detection integration for model collection selection with continual learning capabilities using data stream processing. The performance of the forecasting models based on the proposed approach is evaluated in a complex real-world use case of natural gas consumption forecasting. We employed Hoeffding tree predictors as forecasting models and the Pruned Exact Linear Time (PELT) algorithm for the change point detection procedure. The change point detection integration enables selecting a different model collection for successive time frames. Thus, three model collection selection procedures (with and without an error feedback loop) are defined and evaluated for forecasting scenarios with various densities of detected change points. These models were compared with change point agnostic baseline approaches. Our experiments show that fewer change points result in a lower forecasting error regardless of the model collection selection procedure employed. Also, simpler model collection selection procedures omitting forecasting error feedback leads to more robust forecasting models suitable for continual learning tasks.  ( 3 min )
    A Unifying Generator Loss Function for Generative Adversarial Networks
    arXiv:2308.07233v2 Announce Type: replace Abstract: A unifying $\alpha$-parametrized generator loss function is introduced for a dual-objective generative adversarial network (GAN), which uses a canonical (or classical) discriminator loss function such as the one in the original GAN (VanillaGAN) system. The generator loss function is based on a symmetric class probability estimation type function, $\mathcal{L}_\alpha$, and the resulting GAN system is termed $\mathcal{L}_\alpha$-GAN. Under an optimal discriminator, it is shown that the generator's optimization problem consists of minimizing a Jensen-$f_\alpha$-divergence, a natural generalization of the Jensen-Shannon divergence, where $f_\alpha$ is a convex function expressed in terms of the loss function $\mathcal{L}_\alpha$. It is also demonstrated that this $\mathcal{L}_\alpha$-GAN problem recovers as special cases a number of GAN problems in the literature, including VanillaGAN, Least Squares GAN (LSGAN), Least $k$th order GAN (L$k$GAN) and the recently introduced $(\alpha_D,\alpha_G)$-GAN with $\alpha_D=1$. Finally, experimental results are conducted on three datasets, MNIST, CIFAR-10, and Stacked MNIST to illustrate the performance of various examples of the $\mathcal{L}_\alpha$-GAN system.  ( 2 min )
    SortedNet, a Place for Every Network and Every Network in its Place: Towards a Generalized Solution for Training Many-in-One Neural Networks
    arXiv:2309.00255v2 Announce Type: replace Abstract: Deep neural networks (DNNs) must cater to a variety of users with different performance needs and budgets, leading to the costly practice of training, storing, and maintaining numerous specific models. There are solutions in the literature to deal with single dynamic or many-in-one models instead of many individual networks; however, they usually suffer from heavy model search requirements, being architecture-specific, working only on a limited number of dimensions (e.g. depth only or width only) or sub-models. To address these problems, we propose SortedNet, a generalized and scalable training solution to harness the inherent modularity of DNNs. Thanks to a generalized nested architecture (which we refer to as \textit{sorted} architecture in this paper) with shared parameters and its novel update scheme combining random sub-model sampling and gradient accumulation, SortedNet enables the training of numerous sub-models simultaneously, simplifies dynamic model selection and deployment during inference, and reduces the model storage requirement significantly. The versatility and scalability of SortedNet are validated through various architectures and tasks including LLaMA, BERT, RoBERTa (NLP tasks), ResNet and MobileNet (image classification) demonstrating its superiority over existing dynamic training methods. SortedNet is able to train up to 160 sub-models at once, achieving at least 96\% of the original model's performance.  ( 3 min )
    Noncompact uniform universal approximation
    arXiv:2308.03812v2 Announce Type: replace Abstract: The universal approximation theorem is generalised to uniform convergence on the (noncompact) input space $\mathbb{R}^n$. All continuous functions that vanish at infinity can be uniformly approximated by neural networks with one hidden layer, for all activation functions $\varphi$ that are continuous, nonpolynomial, and asymptotically polynomial at $\pm\infty$. When $\varphi$ is moreover bounded, we exactly determine which functions can be uniformly approximated by neural networks, with the following unexpected results. Let $\overline{\mathcal{N}_\varphi^l(\mathbb{R}^n)}$ denote the vector space of functions that are uniformly approximable by neural networks with $l$ hidden layers and $n$ inputs. For all $n$ and all $l\geq2$, $\overline{\mathcal{N}_\varphi^l(\mathbb{R}^n)}$ turns out to be an algebra under the pointwise product. If the left limit of $\varphi$ differs from its right limit (for instance, when $\varphi$ is sigmoidal) the algebra $\overline{\mathcal{N}_\varphi^l(\mathbb{R}^n)}$ ($l\geq2$) is independent of $\varphi$ and $l$, and equals the closed span of products of sigmoids composed with one-dimensional projections. If the left limit of $\varphi$ equals its right limit, $\overline{\mathcal{N}_\varphi^l(\mathbb{R}^n)}$ ($l\geq1$) equals the (real part of the) commutative resolvent algebra, a C*-algebra which is used in mathematical approaches to quantum theory. In the latter case, the algebra is independent of $l\geq1$, whereas in the former case $\overline{\mathcal{N}_\varphi^2(\mathbb{R}^n)}$ is strictly bigger than $\overline{\mathcal{N}_\varphi^1(\mathbb{R}^n)}$.  ( 3 min )
    Settling the Sample Complexity of Online Reinforcement Learning
    arXiv:2307.13586v2 Announce Type: replace Abstract: A central issue lying at the heart of online reinforcement learning (RL) is data efficiency. While a number of recent works achieved asymptotically minimal regret in online RL, the optimality of these results is only guaranteed in a ``large-sample'' regime, imposing enormous burn-in cost in order for their algorithms to operate optimally. How to achieve minimax-optimal regret without incurring any burn-in cost has been an open problem in RL theory. We settle this problem for the context of finite-horizon inhomogeneous Markov decision processes. Specifically, we prove that a modified version of Monotonic Value Propagation (MVP), a model-based algorithm proposed by \cite{zhang2020reinforcement}, achieves a regret on the order of (modulo log factors) \begin{equation*} \min\big\{ \sqrt{SAH^3K}, \,HK \big\}, \end{equation*} where $S$ is the number of states, $A$ is the number of actions, $H$ is the planning horizon, and $K$ is the total number of episodes. This regret matches the minimax lower bound for the entire range of sample size $K\geq 1$, essentially eliminating any burn-in requirement. It also translates to a PAC sample complexity (i.e., the number of episodes needed to yield $\varepsilon$-accuracy) of $\frac{SAH^3}{\varepsilon^2}$ up to log factor, which is minimax-optimal for the full $\varepsilon$-range. Further, we extend our theory to unveil the influences of problem-dependent quantities like the optimal value/cost and certain variances. The key technical innovation lies in the development of a new regret decomposition strategy and a novel analysis paradigm to decouple complicated statistical dependency -- a long-standing challenge facing the analysis of online RL in the sample-hungry regime.  ( 3 min )
    One-Versus-Others Attention: Scalable Multimodal Integration for Clinical Data
    arXiv:2307.05435v3 Announce Type: replace Abstract: Multimodal learning models have become increasingly important as they surpass single-modality approaches on diverse tasks ranging from question-answering to autonomous driving. Despite the importance of multimodal learning, existing efforts focus on NLP applications, where the number of modalities is typically less than four (audio, video, text, images). However, data inputs in other domains, such as the medical field, may include X-rays, PET scans, MRIs, genetic screening, clinical notes, and more, creating a need for both efficient and accurate information fusion. Many state-of-the-art models rely on pairwise cross-modal attention, which does not scale well for applications with more than three modalities. For $n$ modalities, computing attention will result in $n \choose 2$ operations, potentially requiring considerable amounts of computational resources. To address this, we propose a new domain-neutral attention mechanism, One-Versus-Others (OvO) attention, that scales linearly with the number of modalities and requires only $n$ attention operations, thus offering a significant reduction in computational complexity compared to existing cross-modal attention algorithms. Using three diverse real-world datasets as well as an additional simulation experiment, we show that our method improves performance compared to popular fusion techniques while decreasing computation costs.  ( 2 min )
    Tackling the Curse of Dimensionality with Physics-Informed Neural Networks
    arXiv:2307.12306v5 Announce Type: replace Abstract: The curse-of-dimensionality taxes computational resources heavily with exponentially increasing computational cost as the dimension increases. This poses great challenges in solving high-dimensional PDEs, as Richard E. Bellman first pointed out over 60 years ago. While there has been some recent success in solving numerically partial differential equations (PDEs) in high dimensions, such computations are prohibitively expensive, and true scaling of general nonlinear PDEs to high dimensions has never been achieved. We develop a new method of scaling up physics-informed neural networks (PINNs) to solve arbitrary high-dimensional PDEs. The new method, called Stochastic Dimension Gradient Descent (SDGD), decomposes a gradient of PDEs into pieces corresponding to different dimensions and randomly samples a subset of these dimensional pieces in each iteration of training PINNs. We prove theoretically the convergence and other desired properties of the proposed method. We demonstrate in various diverse tests that the proposed method can solve many notoriously hard high-dimensional PDEs, including the Hamilton-Jacobi-Bellman (HJB) and the Schr\"{o}dinger equations in tens of thousands of dimensions very fast on a single GPU using the PINNs mesh-free approach. Notably, we solve nonlinear PDEs with nontrivial, anisotropic, and inseparable solutions in 100,000 effective dimensions in 12 hours on a single GPU using SDGD with PINNs. Since SDGD is a general training methodology of PINNs, it can be applied to any current and future variants of PINNs to scale them up for arbitrary high-dimensional PDEs.  ( 3 min )
    When No-Rejection Learning is Consistent for Regression with Rejection
    arXiv:2307.02932v3 Announce Type: replace Abstract: Learning with rejection has been a prototypical model for studying the human-AI interaction on prediction tasks. Upon the arrival of a sample instance, the model first uses a rejector to decide whether to accept and use the AI predictor to make a prediction or reject and defer the sample to humans. Learning such a model changes the structure of the original loss function and often results in undesirable non-convexity and inconsistency issues. For the classification with rejection problem, several works develop consistent surrogate losses for the joint learning of the predictor and the rejector, while there have been fewer works for the regression counterpart. This paper studies the regression with rejection (RwR) problem and investigates a no-rejection learning strategy that uses all the data to learn the predictor. We first establish the consistency for such a strategy under the weak realizability condition. Then for the case without the weak realizability, we show that the excessive risk can also be upper bounded with the sum of two parts: prediction error and calibration error. Lastly, we demonstrate the advantage of such a proposed learning strategy with empirical evidence.  ( 2 min )
    Catapults in SGD: spikes in the training loss and their impact on generalization through feature learning
    arXiv:2306.04815v2 Announce Type: replace Abstract: In this paper, we first present an explanation regarding the common occurrence of spikes in the training loss when neural networks are trained with stochastic gradient descent (SGD). We provide evidence that the spikes in the training loss of SGD are "catapults", an optimization phenomenon originally observed in GD with large learning rates in [Lewkowycz et al. 2020]. We empirically show that these catapults occur in a low-dimensional subspace spanned by the top eigenvectors of the tangent kernel, for both GD and SGD. Second, we posit an explanation for how catapults lead to better generalization by demonstrating that catapults promote feature learning by increasing alignment with the Average Gradient Outer Product (AGOP) of the true predictor. Furthermore, we demonstrate that a smaller batch size in SGD induces a larger number of catapults, thereby improving AGOP alignment and test performance.  ( 2 min )
    LabelBench: A Comprehensive Framework for Benchmarking Adaptive Label-Efficient Learning
    arXiv:2306.09910v4 Announce Type: replace Abstract: Labeled data are critical to modern machine learning applications, but obtaining labels can be expensive. To mitigate this cost, machine learning methods, such as transfer learning, semi-supervised learning and active learning, aim to be label-efficient: achieving high predictive performance from relatively few labeled examples. While obtaining the best label-efficiency in practice often requires combinations of these techniques, existing benchmark and evaluation frameworks do not capture a concerted combination of all such techniques. This paper addresses this deficiency by introducing LabelBench, a new computationally-efficient framework for joint evaluation of multiple label-efficient learning techniques. As an application of LabelBench, we introduce a novel benchmark of state-of-the-art active learning methods in combination with semi-supervised learning for fine-tuning pretrained vision transformers. Our benchmark demonstrates better label-efficiencies than previously reported in active learning. LabelBench's modular codebase is open-sourced for the broader community to contribute label-efficient learning methods and benchmarks. The repository can be found at: https://github.com/EfficientTraining/LabelBench.  ( 2 min )
    Designing Decision Support Systems Using Counterfactual Prediction Sets
    arXiv:2306.03928v2 Announce Type: replace Abstract: Decision support systems for classification tasks are predominantly designed to predict the value of the ground truth labels. However, since their predictions are not perfect, these systems also need to make human experts understand when and how to use these predictions to update their own predictions. Unfortunately, this has been proven challenging. In this context, it has been recently argued that an alternative type of decision support systems may circumvent this challenge. Rather than providing a single label prediction, these systems provide a set of label prediction values constructed using a conformal predictor, namely a prediction set, and forcefully ask experts to predict a label value from the prediction set. However, the design and evaluation of these systems have so far relied on stylized expert models, questioning their promise. In this paper, we revisit the design of this type of systems from the perspective of online learning and develop a methodology that does not require, nor assumes, an expert model. Our methodology leverages the nested structure of the prediction sets provided by any conformal predictor and a natural counterfactual monotonicity assumption to achieve an exponential improvement in regret in comparison to vanilla bandit algorithms. We conduct a large-scale human subject study ($n = 2{,}751$) to compare our methodology to several competitive baselines. The results show that, for decision support systems based on prediction sets, limiting experts' level of agency leads to greater performance than allowing experts to always exercise their own agency. We have made available the data gathered in our human subject study as well as an open source implementation of our system at https://github.com/Networks-Learning/counterfactual-prediction-sets.  ( 3 min )
    Memorization Capacity of Multi-Head Attention in Transformers
    arXiv:2306.02010v3 Announce Type: replace Abstract: Transformers have become the go-to architecture for language and vision tasks, yet their theoretical properties, especially memorization capacity, remain elusive. This paper investigates the memorization abilities of multi-head attention mechanisms, examining how many example sequences they can memorize, as a function of the number of heads and sequence length. Motivated by experimental findings on vision transformers, we introduce novel assumptions about the linear independence of input data, distinct from the commonly used general-position assumption. Under these assumptions, we demonstrate that an attention layer with $H$ heads, dimension $d$, and context size $n < d$, featuring $\Theta(Hd^2)$ parameters, can memorize $\Omega(Hn)$ examples. Our analysis sheds light on how different attention heads handle various example sequences, aided by the softmax operator's saturation property. We validate our findings through experiments on synthetic data.  ( 2 min )
    Safe Offline Reinforcement Learning with Real-Time Budget Constraints
    arXiv:2306.00603v2 Announce Type: replace Abstract: Aiming at promoting the safe real-world deployment of Reinforcement Learning (RL), research on safe RL has made significant progress in recent years. However, most existing works in the literature still focus on the online setting where risky violations of the safety budget are likely to be incurred during training. Besides, in many real-world applications, the learned policy is required to respond to dynamically determined safety budgets (i.e., constraint threshold) in real time. In this paper, we target at the above real-time budget constraint problem under the offline setting, and propose Trajectory-based REal-time Budget Inference (TREBI) as a novel solution that models this problem from the perspective of trajectory distribution and solves it through diffusion model planning. Theoretically, we prove an error bound of the estimation on the episodic reward and cost under the offline setting and thus provide a performance guarantee for TREBI. Empirical results on a wide range of simulation tasks and a real-world large-scale advertising application demonstrate the capability of TREBI in solving real-time budget constraint problems under offline settings.  ( 2 min )
    Learning Safety Constraints from Demonstrations with Unknown Rewards
    arXiv:2305.16147v2 Announce Type: replace Abstract: We propose Convex Constraint Learning for Reinforcement Learning (CoCoRL), a novel approach for inferring shared constraints in a Constrained Markov Decision Process (CMDP) from a set of safe demonstrations with possibly different reward functions. While previous work is limited to demonstrations with known rewards or fully known environment dynamics, CoCoRL can learn constraints from demonstrations with different unknown rewards without knowledge of the environment dynamics. CoCoRL constructs a convex safe set based on demonstrations, which provably guarantees safety even for potentially sub-optimal (but safe) demonstrations. For near-optimal demonstrations, CoCoRL converges to the true safe set with no policy regret. We evaluate CoCoRL in gridworld environments and a driving simulation with multiple constraints. CoCoRL learns constraints that lead to safe driving behavior. Importantly, we can safely transfer the learned constraints to different tasks and environments. In contrast, alternative methods based on Inverse Reinforcement Learning (IRL) often exhibit poor performance and learn unsafe policies.  ( 2 min )
    Deep Temporal Graph Clustering
    arXiv:2305.10738v2 Announce Type: replace Abstract: Deep graph clustering has recently received significant attention due to its ability to enhance the representation learning capabilities of models in unsupervised scenarios. Nevertheless, deep clustering for temporal graphs, which could capture crucial dynamic interaction information, has not been fully explored. It means that in many clustering-oriented real-world scenarios, temporal graphs can only be processed as static graphs. This not only causes the loss of dynamic information but also triggers huge computational consumption. To solve the problem, we propose a general framework for deep Temporal Graph Clustering called TGC, which introduces deep clustering techniques to suit the interaction sequence-based batch-processing pattern of temporal graphs. In addition, we discuss differences between temporal graph clustering and static graph clustering from several levels. To verify the superiority of the proposed framework TGC, we conduct extensive experiments. The experimental results show that temporal graph clustering enables more flexibility in finding a balance between time and space requirements, and our framework can effectively improve the performance of existing temporal graph learning methods. The code is released: https://github.com/MGitHubL/Deep-Temporal-Graph-Clustering.  ( 2 min )
    A critical look at the evaluation of GNNs under heterophily: Are we really making progress?
    arXiv:2302.11640v2 Announce Type: replace Abstract: Node classification is a classical graph machine learning task on which Graph Neural Networks (GNNs) have recently achieved strong results. However, it is often believed that standard GNNs only work well for homophilous graphs, i.e., graphs where edges tend to connect nodes of the same class. Graphs without this property are called heterophilous, and it is typically assumed that specialized methods are required to achieve strong performance on such graphs. In this work, we challenge this assumption. First, we show that the standard datasets used for evaluating heterophily-specific models have serious drawbacks, making results obtained by using them unreliable. The most significant of these drawbacks is the presence of a large number of duplicate nodes in the datasets Squirrel and Chameleon, which leads to train-test data leakage. We show that removing duplicate nodes strongly affects GNN performance on these datasets. Then, we propose a set of heterophilous graphs of varying properties that we believe can serve as a better benchmark for evaluating the performance of GNNs under heterophily. We show that standard GNNs achieve strong results on these heterophilous graphs, almost always outperforming specialized models. Our datasets and the code for reproducing our experiments are available at https://github.com/yandex-research/heterophilous-graphs  ( 3 min )
    Gaussian Process-Gated Hierarchical Mixtures of Experts
    arXiv:2302.04947v2 Announce Type: replace Abstract: In this paper, we propose novel Gaussian process-gated hierarchical mixtures of experts (GPHMEs). Unlike other mixtures of experts with gating models linear in the input, our model employs gating functions built with Gaussian processes (GPs). These processes are based on random features that are non-linear functions of the inputs. Furthermore, the experts in our model are also constructed with GPs. The optimization of the GPHMEs is performed by variational inference. The proposed GPHMEs have several advantages. They outperform tree-based HME benchmarks that partition the data in the input space, and they achieve good performance with reduced complexity. Another advantage is the interpretability they provide for deep GPs, and more generally, for deep Bayesian neural networks. Our GPHMEs demonstrate excellent performance for large-scale data sets, even with quite modest sizes.  ( 2 min )
    On the Expressive Power of Geometric Graph Neural Networks
    arXiv:2301.09308v3 Announce Type: replace Abstract: The expressive power of Graph Neural Networks (GNNs) has been studied extensively through the Weisfeiler-Leman (WL) graph isomorphism test. However, standard GNNs and the WL framework are inapplicable for geometric graphs embedded in Euclidean space, such as biomolecules, materials, and other physical systems. In this work, we propose a geometric version of the WL test (GWL) for discriminating geometric graphs while respecting the underlying physical symmetries: permutations, rotation, reflection, and translation. We use GWL to characterise the expressive power of geometric GNNs that are invariant or equivariant to physical symmetries in terms of distinguishing geometric graphs. GWL unpacks how key design choices influence geometric GNN expressivity: (1) Invariant layers have limited expressivity as they cannot distinguish one-hop identical geometric graphs; (2) Equivariant layers distinguish a larger class of graphs by propagating geometric information beyond local neighbourhoods; (3) Higher order tensors and scalarisation enable maximally powerful geometric GNNs; and (4) GWL's discrimination-based perspective is equivalent to universal approximation. Synthetic experiments supplementing our results are available at \url{https://github.com/chaitjo/geometric-gnn-dojo}  ( 2 min )
    Leveraging Algorithmic Fairness to Mitigate Blackbox Attribute Inference Attacks
    arXiv:2211.10209v2 Announce Type: replace Abstract: Machine learning (ML) models have been deployed for high-stakes applications, e.g., healthcare and criminal justice. Prior work has shown that ML models are vulnerable to attribute inference attacks where an adversary, with some background knowledge, trains an ML attack model to infer sensitive attributes by exploiting distinguishable model predictions. However, some prior attribute inference attacks have strong assumptions about adversary's background knowledge (e.g., marginal distribution of sensitive attribute) and pose no more privacy risk than statistical inference. Moreover, none of the prior attacks account for class imbalance of sensitive attribute in datasets coming from real-world applications (e.g., Race and Sex). In this paper, we propose an practical and effective attribute inference attack that accounts for this imbalance using an adaptive threshold over the attack model's predictions. We exhaustively evaluate our proposed attack on multiple datasets and show that the adaptive threshold over the model's predictions drastically improves the attack accuracy over prior work. Finally, current literature lacks an effective defence against attribute inference attacks. We investigate the impact of fairness constraints (i.e., designed to mitigate unfairness in model predictions) during model training on our attribute inference attack. We show that constraint based fairness algorithms which enforces equalized odds acts as an effective defense against attribute inference attacks without impacting the model utility. Hence, the objective of algorithmic fairness and sensitive attribute privacy are aligned.  ( 3 min )
    FIMP: Foundation Model-Informed Message Passing for Graph Neural Networks
    arXiv:2210.09475v3 Announce Type: replace Abstract: Foundation models have revolutionized the landscape of Deep Learning (DL), serving as a versatile platform which can be adapted to a wide range of downstream tasks. Despite their adaptability, applications of foundation models to downstream graph-based tasks have been limited, and there remains no convenient way to leverage large-scale non-graph pretrained models in graph-structured settings. In this work, we present a new framework which we term Foundation-Informed Message Passing (FIMP) to bridge the fields of foundational models and GNNs through a simple concept: constructing message-passing operators from pretrained foundation model weights. We show that this approach results in improved performance for graph-based tasks in a number of data domains, allowing graph neural networks to leverage the knowledge of foundation models.  ( 2 min )
    Using Forwards-Backwards Models to Approximate MDP Homomorphisms
    arXiv:2209.06356v3 Announce Type: replace Abstract: Reinforcement learning agents must painstakingly learn through trial and error what sets of state-action pairs are value equivalent -- requiring an often prohibitively large amount of environment experience. MDP homomorphisms have been proposed that reduce the MDP of an environment to an abstract MDP, enabling better sample efficiency. Consequently, impressive improvements have been achieved when a suitable homomorphism can be constructed a priori -- usually by exploiting a practitioner's knowledge of environment symmetries. We propose a novel approach to constructing homomorphisms in discrete action spaces, which uses a learnt model of environment dynamics to infer which state-action pairs lead to the same state -- which can reduce the size of the state-action space by a factor as large as the cardinality of the original action space. In MinAtar, we report an almost 4x improvement over a value-based off-policy baseline in the low sample limit, when averaging over all games and optimizers.  ( 2 min )
    Boundary-Aware Uncertainty for Feature Attribution Explainers
    arXiv:2210.02419v5 Announce Type: replace Abstract: Post-hoc explanation methods have become a critical tool for understanding black-box classifiers in high-stakes applications. However, high-performing classifiers are often highly nonlinear and can exhibit complex behavior around the decision boundary, leading to brittle or misleading local explanations. Therefore there is an impending need to quantify the uncertainty of such explanation methods in order to understand when explanations are trustworthy. In this work we propose the Gaussian Process Explanation UnCertainty (GPEC) framework, which generates a unified uncertainty estimate combining decision boundary-aware uncertainty with explanation function approximation uncertainty. We introduce a novel geodesic-based kernel, which captures the complexity of the target black-box decision boundary. We show theoretically that the proposed kernel similarity increases with decision boundary complexity. The proposed framework is highly flexible; it can be used with any black-box classifier and feature attribution method. Empirical results on multiple tabular and image datasets show that the GPEC uncertainty estimate improves understanding of explanations as compared to existing methods.  ( 2 min )
    CLMFormer: Mitigating Data Redundancy to Revitalize Transformer-based Long-Term Time Series Forecasting System
    arXiv:2207.07827v4 Announce Type: replace Abstract: Long-term time-series forecasting (LTSF) plays a crucial role in various practical applications. Transformer and its variants have become the de facto backbone for LTSF, offering exceptional capabilities in processing long sequence data. However, existing Transformer-based models, such as Fedformer and Informer, often achieve their best performances on validation sets after just a few epochs, indicating potential underutilization of the Transformer's capacity. One of the reasons that contribute to this overfitting is data redundancy arising from the rolling forecasting settings in the data augmentation process, particularly evident in longer sequences with highly similar adjacent data. In this paper, we propose a novel approach to address this issue by employing curriculum learning and introducing a memory-driven decoder. Specifically, we progressively introduce Bernoulli noise to the training samples, which effectively breaks the high similarity between adjacent data points. To further enhance forecasting accuracy, we introduce a memory-driven decoder. This component enables the model to capture seasonal tendencies and dependencies in the time-series data and leverages temporal relationships to facilitate the forecasting process. The experimental results on six real-life LTSF benchmarks demonstrate that our approach can be seamlessly plugged into varying Transformer-based models, with our approach enhancing the LTSF performances of various Transformer-based models by maximally 30%.  ( 3 min )
    Assisted Learning for Organizations with Limited Imbalanced Data
    arXiv:2109.09307v4 Announce Type: replace Abstract: In the era of big data, many big organizations are integrating machine learning into their work pipelines to facilitate data analysis. However, the performance of their trained models is often restricted by limited and imbalanced data available to them. In this work, we develop an assisted learning framework for assisting organizations to improve their learning performance. The organizations have sufficient computation resources but are subject to stringent data-sharing and collaboration policies. Their limited imbalanced data often cause biased inference and sub-optimal decision-making. In assisted learning, an organizational learner purchases assistance service from an external service provider and aims to enhance its model performance within only a few assistance rounds. We develop effective stochastic training algorithms for both assisted deep learning and assisted reinforcement learning. Different from existing distributed algorithms that need to frequently transmit gradients or models, our framework allows the learner to only occasionally share information with the service provider, but still obtain a model that achieves near-oracle performance as if all the data were centralized.  ( 3 min )
    Pseudo-Differential Neural Operator: Generalized Fourier Neural Operator for Learning Solution Operators of Partial Differential Equations
    arXiv:2201.11967v3 Announce Type: replace Abstract: Learning the mapping between two function spaces has garnered considerable research attention. However, learning the solution operator of partial differential equations (PDEs) remains a challenge in scientific computing. Fourier neural operator (FNO) was recently proposed to learn solution operators, and it achieved an excellent performance. In this study, we propose a novel \textit{pseudo-differential integral operator} (PDIO) to analyze and generalize the Fourier integral operator in FNO. PDIO is inspired by a pseudo-differential operator, which is a generalized differential operator characterized by a certain symbol. We parameterize this symbol using a neural network and demonstrate that the neural network-based symbol is contained in a smooth symbol class. Subsequently, we verify that the PDIO is a bounded linear operator, and thus is continuous in the Sobolev space. We combine the PDIO with the neural operator to develop a \textit{pseudo-differential neural operator} (PDNO) and learn the nonlinear solution operator of PDEs. We experimentally validate the effectiveness of the proposed model by utilizing Darcy flow and the Navier-Stokes equation. The obtained results indicate that the proposed PDNO outperforms the existing neural operator approaches in most experiments.  ( 3 min )
    Polygonal Unadjusted Langevin Algorithms: Creating stable and efficient adaptive algorithms for neural networks
    arXiv:2105.13937v3 Announce Type: replace Abstract: We present a new class of Langevin based algorithms, which overcomes many of the known shortcomings of popular adaptive optimizers that are currently used for the fine tuning of deep learning models. Its underpinning theory relies on recent advances of Euler's polygonal approximations for stochastic differential equations (SDEs) with monotone coefficients. As a result, it inherits the stability properties of tamed algorithms, while it addresses other known issues, e.g. vanishing gradients in neural networks. In particular, we provide a nonasymptotic analysis and full theoretical guarantees for the convergence properties of an algorithm of this novel class, which we named TH$\varepsilon$O POULA (or, simply, TheoPouLa). Finally, several experiments are presented with different types of deep learning models, which show the superior performance of TheoPouLa over many popular adaptive optimization algorithms.  ( 2 min )
    Adaptive Rational Activations to Boost Deep Reinforcement Learning
    arXiv:2102.09407v4 Announce Type: replace Abstract: Latest insights from biology show that intelligence not only emerges from the connections between neurons but that individual neurons shoulder more computational responsibility than previously anticipated. This perspective should be critical in the context of constantly changing distinct reinforcement learning environments, yet current approaches still primarily employ static activation functions. In this work, we motivate why rationals are suitable for adaptable activation functions and why their inclusion into neural networks is crucial. Inspired by recurrence in residual networks, we derive a condition under which rational units are closed under residual connections and formulate a naturally regularised version: the recurrent-rational. We demonstrate that equipping popular algorithms with (recurrent-)rational activations leads to consistent improvements on Atari games, especially turning simple DQN into a solid approach, competitive to DDQN and Rainbow.  ( 2 min )
    Twisting Lids Off with Two Hands
    arXiv:2403.02338v1 Announce Type: cross Abstract: Manipulating objects with two multi-fingered hands has been a long-standing challenge in robotics, attributed to the contact-rich nature of many manipulation tasks and the complexity inherent in coordinating a high-dimensional bimanual system. In this work, we consider the problem of twisting lids of various bottle-like objects with two hands, and demonstrate that policies trained in simulation using deep reinforcement learning can be effectively transferred to the real world. With novel engineering insights into physical modeling, real-time perception, and reward design, the policy demonstrates generalization capabilities across a diverse set of unseen objects, showcasing dynamic and dexterous behaviors. Our findings serve as compelling evidence that deep reinforcement learning combined with sim-to-real transfer remains a promising approach for addressing manipulation problems of unprecedented complexity.  ( 2 min )
    Private Prediction Sets
    arXiv:2102.06202v3 Announce Type: replace Abstract: In real-world settings involving consequential decision-making, the deployment of machine learning systems generally requires both reliable uncertainty quantification and protection of individuals' privacy. We present a framework that treats these two desiderata jointly. Our framework is based on conformal prediction, a methodology that augments predictive models to return prediction sets that provide uncertainty quantification -- they provably cover the true response with a user-specified probability, such as 90%. One might hope that when used with privately-trained models, conformal prediction would yield privacy guarantees for the resulting prediction sets; unfortunately, this is not the case. To remedy this key problem, we develop a method that takes any pre-trained predictive model and outputs differentially private prediction sets. Our method follows the general approach of split conformal prediction; we use holdout data to calibrate the size of the prediction sets but preserve privacy by using a privatized quantile subroutine. This subroutine compensates for the noise introduced to preserve privacy in order to guarantee correct coverage. We evaluate the method on large-scale computer vision datasets.  ( 3 min )
    Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training
    arXiv:2403.02325v1 Announce Type: cross Abstract: Highlighting particularly relevant regions of an image can improve the performance of vision-language models (VLMs) on various vision-language (VL) tasks by guiding the model to attend more closely to these regions of interest. For example, VLMs can be given a "visual prompt", where visual markers such as bounding boxes delineate key image regions. However, current VLMs that can incorporate visual guidance are either proprietary and expensive or require costly training on curated data that includes visual prompts. We introduce Contrastive Region Guidance (CRG), a training-free guidance method that enables open-source VLMs to respond to visual prompts. CRG contrasts model outputs produced with and without visual prompts, factoring out biases revealed by the model when answering without the information required to produce a correct answer (i.e., the model's prior). CRG achieves substantial improvements in a wide variety of VL tasks: When region annotations are provided, CRG increases absolute accuracy by up to 11.1% on ViP-Bench, a collection of six diverse region-based tasks such as recognition, math, and object relationship reasoning. We also show CRG's applicability to spatial reasoning, with 10% improvement on What'sUp, as well as to compositional generalization -- improving accuracy by 11.5% and 7.5% on two challenging splits from SugarCrepe -- and to image-text alignment for generated images, where we improve by up to 8.4 AUROC and 6.8 F1 points on SeeTRUE. When reference regions are absent, CRG allows us to re-rank proposed regions in referring expression comprehension and phrase grounding benchmarks like RefCOCO/+/g and Flickr30K Entities, with an average gain of 3.2% in accuracy. Our analysis explores alternative masking strategies for CRG, quantifies CRG's probability shift, and evaluates the role of region guidance strength, empirically validating CRG's design choices.  ( 3 min )
    Beyond Specialization: Assessing the Capabilities of MLLMs in Age and Gender Estimation
    arXiv:2403.02302v1 Announce Type: cross Abstract: Multimodal Large Language Models (MLLMs) have recently gained immense popularity. Powerful commercial models like ChatGPT-4V and Gemini, as well as open-source ones such as LLaVA, are essentially general-purpose models and are applied to solve a wide variety of tasks, including those in computer vision. These neural networks possess such strong general knowledge and reasoning abilities that they have proven capable of working even on tasks for which they were not specifically trained. We compared the capabilities of the most powerful MLLMs to date: ShareGPT4V, ChatGPT, LLaVA-Next in a specialized task of age and gender estimation with our state-of-the-art specialized model, MiVOLO. We also updated MiVOLO and provide details and new metrics in this article. This comparison has yielded some interesting results and insights about the strengths and weaknesses of the participating models. Furthermore, we attempted various ways to fine-tune the ShareGPT4V model for this specific task, aiming to achieve state-of-the-art results in this particular challenge. Although such a model would not be practical in production, as it is incredibly expensive compared to a specialized model like MiVOLO, it could be very useful in some tasks, like data annotation.  ( 2 min )
    Koopman-Assisted Reinforcement Learning
    arXiv:2403.02290v1 Announce Type: cross Abstract: The Bellman equation and its continuous form, the Hamilton-Jacobi-Bellman (HJB) equation, are ubiquitous in reinforcement learning (RL) and control theory. However, these equations quickly become intractable for systems with high-dimensional states and nonlinearity. This paper explores the connection between the data-driven Koopman operator and Markov Decision Processes (MDPs), resulting in the development of two new RL algorithms to address these limitations. We leverage Koopman operator techniques to lift a nonlinear system into new coordinates where the dynamics become approximately linear, and where HJB-based methods are more tractable. In particular, the Koopman operator is able to capture the expectation of the time evolution of the value function of a given system via linear dynamics in the lifted coordinates. By parameterizing the Koopman operator with the control actions, we construct a ``Koopman tensor'' that facilitates the estimation of the optimal value function. Then, a transformation of Bellman's framework in terms of the Koopman tensor enables us to reformulate two max-entropy RL algorithms: soft value iteration and soft actor-critic (SAC). This highly flexible framework can be used for deterministic or stochastic systems as well as for discrete or continuous-time dynamics. Finally, we show that these Koopman Assisted Reinforcement Learning (KARL) algorithms attain state-of-the-art (SOTA) performance with respect to traditional neural network-based SAC and linear quadratic regulator (LQR) baselines on four controlled dynamical systems: a linear state-space system, the Lorenz system, fluid flow past a cylinder, and a double-well potential with non-isotropic stochastic forcing.  ( 2 min )
    Statistical Query Lower Bounds for Learning Truncated Gaussians
    arXiv:2403.02300v1 Announce Type: cross Abstract: We study the problem of estimating the mean of an identity covariance Gaussian in the truncated setting, in the regime when the truncation set comes from a low-complexity family $\mathcal{C}$ of sets. Specifically, for a fixed but unknown truncation set $S \subseteq \mathbb{R}^d$, we are given access to samples from the distribution $\mathcal{N}(\boldsymbol{ \mu}, \mathbf{ I})$ truncated to the set $S$. The goal is to estimate $\boldsymbol\mu$ within accuracy $\epsilon>0$ in $\ell_2$-norm. Our main result is a Statistical Query (SQ) lower bound suggesting a super-polynomial information-computation gap for this task. In more detail, we show that the complexity of any SQ algorithm for this problem is $d^{\mathrm{poly}(1/\epsilon)}$, even when the class $\mathcal{C}$ is simple so that $\mathrm{poly}(d/\epsilon)$ samples information-theoretically suffice. Concretely, our SQ lower bound applies when $\mathcal{C}$ is a union of a bounded number of rectangles whose VC dimension and Gaussian surface are small. As a corollary of our construction, it also follows that the complexity of the previously known algorithm for this task is qualitatively best possible.  ( 2 min )
    NatSGD: A Dataset with Speech, Gestures, and Demonstrations for Robot Learning in Natural Human-Robot Interaction
    arXiv:2403.02274v1 Announce Type: cross Abstract: Recent advancements in multimodal Human-Robot Interaction (HRI) datasets have highlighted the fusion of speech and gesture, expanding robots' capabilities to absorb explicit and implicit HRI insights. However, existing speech-gesture HRI datasets often focus on elementary tasks, like object pointing and pushing, revealing limitations in scaling to intricate domains and prioritizing human command data over robot behavior records. To bridge these gaps, we introduce NatSGD, a multimodal HRI dataset encompassing human commands through speech and gestures that are natural, synchronized with robot behavior demonstrations. NatSGD serves as a foundational resource at the intersection of machine learning and HRI research, and we demonstrate its effectiveness in training robots to understand tasks through multimodal human commands, emphasizing the significance of jointly considering speech and gestures. We have released our dataset, simulator, and code to facilitate future research in human-robot interaction system learning; access these resources at https://www.snehesh.com/natsgd/  ( 2 min )
    RIFF: Learning to Rephrase Inputs for Few-shot Fine-tuning of Language Models
    arXiv:2403.02271v1 Announce Type: cross Abstract: Pre-trained Language Models (PLMs) can be accurately fine-tuned for downstream text processing tasks. Recently, researchers have introduced several parameter-efficient fine-tuning methods that optimize input prompts or adjust a small number of model parameters (e.g LoRA). In this study, we explore the impact of altering the input text of the original task in conjunction with parameter-efficient fine-tuning methods. To most effectively rewrite the input text, we train a few-shot paraphrase model with a Maximum-Marginal Likelihood objective. Using six few-shot text classification datasets, we show that enriching data with paraphrases at train and test time enhances the performance beyond what can be achieved with parameter-efficient fine-tuning alone.  ( 2 min )
    KnowPhish: Large Language Models Meet Multimodal Knowledge Graphs for Enhancing Reference-Based Phishing Detection
    arXiv:2403.02253v1 Announce Type: cross Abstract: Phishing attacks have inflicted substantial losses on individuals and businesses alike, necessitating the development of robust and efficient automated phishing detection approaches. Reference-based phishing detectors (RBPDs), which compare the logos on a target webpage to a known set of logos, have emerged as the state-of-the-art approach. However, a major limitation of existing RBPDs is that they rely on a manually constructed brand knowledge base, making it infeasible to scale to a large number of brands, which results in false negative errors due to the insufficient brand coverage of the knowledge base. To address this issue, we propose an automated knowledge collection pipeline, using which we collect and release a large-scale multimodal brand knowledge base, KnowPhish, containing 20k brands with rich information about each brand. KnowPhish can be used to boost the performance of existing RBPDs in a plug-and-play manner. A second limitation of existing RBPDs is that they solely rely on the image modality, ignoring useful textual information present in the webpage HTML. To utilize this textual information, we propose a Large Language Model (LLM)-based approach to extract brand information of webpages from text. Our resulting multimodal phishing detection approach, KnowPhish Detector (KPD), can detect phishing webpages with or without logos. We evaluate KnowPhish and KPD on a manually validated dataset, and on a field study under Singapore's local context, showing substantial improvements in effectiveness and efficiency compared to state-of-the-art baselines.  ( 3 min )
    A prediction rigidity formalism for low-cost uncertainties in trained neural networks
    arXiv:2403.02251v1 Announce Type: cross Abstract: Regression methods are fundamental for scientific and technological applications. However, fitted models can be highly unreliable outside of their training domain, and hence the quantification of their uncertainty is crucial in many of their applications. Based on the solution of a constrained optimization problem, we propose "prediction rigidities" as a method to obtain uncertainties of arbitrary pre-trained regressors. We establish a strong connection between our framework and Bayesian inference, and we develop a last-layer approximation that allows the new method to be applied to neural networks. This extension affords cheap uncertainties without any modification to the neural network itself or its training procedure. We show the effectiveness of our method on a wide range of regression tasks, ranging from simple toy models to applications in chemistry and meteorology.  ( 2 min )
    Comprehensive evaluation of Mal-API-2019 dataset by machine learning in malware detection
    arXiv:2403.02232v1 Announce Type: cross Abstract: This study conducts a thorough examination of malware detection using machine learning techniques, focusing on the evaluation of various classification models using the Mal-API-2019 dataset. The aim is to advance cybersecurity capabilities by identifying and mitigating threats more effectively. Both ensemble and non-ensemble machine learning methods, such as Random Forest, XGBoost, K Nearest Neighbor (KNN), and Neural Networks, are explored. Special emphasis is placed on the importance of data pre-processing techniques, particularly TF-IDF representation and Principal Component Analysis, in improving model performance. Results indicate that ensemble methods, particularly Random Forest and XGBoost, exhibit superior accuracy, precision, and recall compared to others, highlighting their effectiveness in malware detection. The paper also discusses limitations and potential future directions, emphasizing the need for continuous adaptation to address the evolving nature of malware. This research contributes to ongoing discussions in cybersecurity and provides practical insights for developing more robust malware detection systems in the digital era.  ( 2 min )
    Not all Layers of LLMs are Necessary during Inference
    arXiv:2403.02181v1 Announce Type: cross Abstract: The inference phase of Large Language Models (LLMs) is very expensive. An ideal inference stage of LLMs could utilize fewer computational resources while still maintaining its capabilities (e.g., generalization and in-context learning ability). In this paper, we try to answer the question, "During LLM inference, can we use shallow layers for easy instances; and deep layers for hard ones?" To answer this question, we first indicate that Not all Layers are Necessary during Inference by statistically analyzing the activated layers across tasks. Then, we propose a simple algorithm named AdaInfer to determine the inference termination moment based on the input instance adaptively. More importantly, AdaInfer does not alter LLM parameters and maintains generalizability across tasks. Experiments on well-known LLMs (i.e., Llama2 series and OPT) show that AdaInfer saves an average of 14.8% of computational resources, even up to 50% on sentiment tasks, while maintaining comparable performance. Additionally, this method is orthogonal to other model acceleration techniques, potentially boosting inference efficiency further.  ( 2 min )
    Predicting large scale cosmological structure evolution with GAN-based autoencoders
    arXiv:2403.02171v1 Announce Type: cross Abstract: Cosmological simulations play a key role in the prediction and understanding of large scale structure formation from initial conditions. We make use of GAN-based Autoencoders (AEs) in an attempt to predict structure evolution within simulations. The AEs are trained on images and cubes issued from respectively 2D and 3D N-body simulations describing the evolution of the dark matter (DM) field. We find that while the AEs can predict structure evolution for 2D simulations of DM fields well, using only the density fields as input, they perform significantly more poorly in similar conditions for 3D simulations. However, additionally providing velocity fields as inputs greatly improves results, with similar predictions regardless of time-difference between input and target.  ( 2 min )
    Masked Thought: Simply Masking Partial Reasoning Steps Can Improve Mathematical Reasoning Learning of Language Models
    arXiv:2403.02178v1 Announce Type: cross Abstract: In reasoning tasks, even a minor error can cascade into inaccurate results, leading to suboptimal performance of large language models in such domains. Earlier fine-tuning approaches sought to mitigate this by leveraging more precise supervisory signals from human labeling, larger models, or self-sampling, although at a high cost. Conversely, we develop a method that avoids external resources, relying instead on introducing perturbations to the input. Our training approach randomly masks certain tokens within the chain of thought, a technique we found to be particularly effective for reasoning tasks. When applied to fine-tuning with GSM8K, this method achieved a 5% improvement in accuracy over standard supervised fine-tuning with a few codes modified and no additional labeling effort. Furthermore, it is complementary to existing methods. When integrated with related data augmentation methods, it leads to an average improvement of 3% improvement in GSM8K accuracy and 1% improvement in MATH accuracy across five datasets of various quality and size, as well as two base models. We further investigate the mechanisms behind this improvement through case studies and quantitative analysis, suggesting that our approach may provide superior support for the model in capturing long-distance dependencies, especially those related to questions. This enhancement could deepen understanding of premises in questions and prior steps. Our code is available at Github.  ( 3 min )
    Recency-Weighted Temporally-Segmented Ensemble for Time-Series Modeling
    arXiv:2403.02150v1 Announce Type: cross Abstract: Time-series modeling in process industries faces the challenge of dealing with complex, multi-faceted, and evolving data characteristics. Conventional single model approaches often struggle to capture the interplay of diverse dynamics, resulting in suboptimal forecasts. Addressing this, we introduce the Recency-Weighted Temporally-Segmented (ReWTS, pronounced `roots') ensemble model, a novel chunk-based approach for multi-step forecasting. The key characteristics of the ReWTS model are twofold: 1) It facilitates specialization of models into different dynamics by segmenting the training data into `chunks' of data and training one model per chunk. 2) During inference, an optimization procedure assesses each model on the recent past and selects the active models, such that the appropriate mixture of previously learned dynamics can be recalled to forecast the future. This method not only captures the nuances of each period, but also adapts more effectively to changes over time compared to conventional `global' models trained on all data in one go. We present a comparative analysis, utilizing two years of data from a wastewater treatment plant and a drinking water treatment plant in Norway, demonstrating the ReWTS ensemble's superiority. It consistently outperforms the global model in terms of mean squared forecasting error across various model architectures by 10-70\% on both datasets, notably exhibiting greater resilience to outliers. This approach shows promise in developing automatic, adaptable forecasting models for decision-making and control systems in process industries and other complex systems.  ( 3 min )
    Modeling Multimodal Social Interactions: New Challenges and Baselines with Densely Aligned Representations
    arXiv:2403.02090v1 Announce Type: cross Abstract: Understanding social interactions involving both verbal and non-verbal cues is essential to effectively interpret social situations. However, most prior works on multimodal social cues focus predominantly on single-person behaviors or rely on holistic visual representations that are not densely aligned to utterances in multi-party environments. They are limited in modeling the intricate dynamics of multi-party interactions. In this paper, we introduce three new challenging tasks to model the fine-grained dynamics between multiple people: speaking target identification, pronoun coreference resolution, and mentioned player prediction. We contribute extensive data annotations to curate these new challenges in social deduction game settings. Furthermore, we propose a novel multimodal baseline that leverages densely aligned language-visual representations by synchronizing visual features with their corresponding utterances. This facilitates concurrently capturing verbal and non-verbal cues pertinent to social reasoning. Experiments demonstrate the effectiveness of the proposed approach with densely aligned multimodal representations in modeling social interactions. We will release our benchmarks and source code to facilitate further research.  ( 2 min )
    Differential Privacy of Noisy (S)GD under Heavy-Tailed Perturbations
    arXiv:2403.02051v1 Announce Type: cross Abstract: Injecting heavy-tailed noise to the iterates of stochastic gradient descent (SGD) has received increasing attention over the past few years. While various theoretical properties of the resulting algorithm have been analyzed mainly from learning theory and optimization perspectives, their privacy preservation properties have not yet been established. Aiming to bridge this gap, we provide differential privacy (DP) guarantees for noisy SGD, when the injected noise follows an $\alpha$-stable distribution, which includes a spectrum of heavy-tailed distributions (with infinite variance) as well as the Gaussian distribution. Considering the $(\epsilon, \delta)$-DP framework, we show that SGD with heavy-tailed perturbations achieves $(0, \tilde{\mathcal{O}}(1/n))$-DP for a broad class of loss functions which can be non-convex, where $n$ is the number of data points. As a remarkable byproduct, contrary to prior work that necessitates bounded sensitivity for the gradients or clipping the iterates, our theory reveals that under mild assumptions, such a projection step is not actually necessary. We illustrate that the heavy-tailed noising mechanism achieves similar DP guarantees compared to the Gaussian case, which suggests that it can be a viable alternative to its light-tailed counterparts.  ( 2 min )
    Hybrid Quantum Neural Network Advantage for Radar-Based Drone Detection and Classification in Low Signal-to-Noise Ratio
    arXiv:2403.02080v1 Announce Type: cross Abstract: In this paper, we investigate the performance of a Hybrid Quantum Neural Network (HQNN) and a comparable classical Convolution Neural Network (CNN) for detection and classification problem using a radar. Specifically, we take a fairly complex radar time-series model derived from electromagnetic theory, namely the Martin-Mulgrew model, that is used to simulate radar returns of objects with rotating blades, such as drones. We find that when that signal-to-noise ratio (SNR) is high, CNN outperforms the HQNN for detection and classification. However, in the low SNR regime (which is of greatest interest in practice) the performance of HQNN is found to be superior to that of the CNN of a similar architecture.  ( 2 min )
    Exponential Expressivity of ReLU$^k$ Neural Networks on Gevrey Classes with Point Singularities
    arXiv:2403.02035v1 Announce Type: cross Abstract: We analyze deep Neural Network emulation rates of smooth functions with point singularities in bounded, polytopal domains $\mathrm{D} \subset \mathbb{R}^d$, $d=2,3$. We prove exponential emulation rates in Sobolev spaces in terms of the number of neurons and in terms of the number of nonzero coefficients for Gevrey-regular solution classes defined in terms of weighted Sobolev scales in $\mathrm{D}$, comprising the countably-normed spaces of I.M. Babu\v{s}ka and B.Q. Guo. As intermediate result, we prove that continuous, piecewise polynomial high order (``$p$-version'') finite elements with elementwise polynomial degree $p\in\mathbb{N}$ on arbitrary, regular, simplicial partitions of polyhedral domains $\mathrm{D} \subset \mathbb{R}^d$, $d\geq 2$ can be exactly emulated by neural networks combining ReLU and ReLU$^2$ activations. On shape-regular, simplicial partitions of polytopal domains $\mathrm{D}$, both the number of neurons and the number of nonzero parameters are proportional to the number of degrees of freedom of the finite element space, in particular for the $hp$-Finite Element Method of I.M. Babu\v{s}ka and B.Q. Guo.  ( 2 min )
    Active Learning of Mealy Machines with Timers
    arXiv:2403.02019v1 Announce Type: cross Abstract: We present the first algorithm for query learning of a general class of Mealy machines with timers (MMTs) in a black-box context. Our algorithm is an extension of the L# algorithm of Vaandrager et al. to a timed setting. Like the algorithm for learning timed automata proposed by Waga, our algorithm is inspired by ideas of Maler & Pnueli. Based on the elementary languages of, both Waga's and our algorithm use symbolic queries, which are then implemented using finitely many concrete queries. However, whereas Waga needs exponentially many concrete queries to implement a single symbolic query, we only need a polynomial number. This is because in order to learn a timed automaton, a learner needs to determine the exact guard and reset for each transition (out of exponentially many possibilities), whereas for learning an MMT a learner only needs to figure out which of the preceding transitions caused a timeout. As shown in our previous work, this can be done efficiently for a subclass of MMTs that are race-avoiding: if a timeout is caused by a preceding input then a slight change in the timing of this input will induce a corresponding change in the timing of the timeout ("wiggling"). Experiments with a prototype implementation, written in Rust, show that our algorithm is able to efficiently learn realistic benchmarks.  ( 2 min )
    On Fractional Moment Estimation from Polynomial Chaos Expansion
    arXiv:2403.01948v1 Announce Type: cross Abstract: Fractional statistical moments are utilized for various tasks of uncertainty quantification, including the estimation of probability distributions. However, an estimation of fractional statistical moments of costly mathematical models by statistical sampling is challenging since it is typically not possible to create a large experimental design due to limitations in computing capacity. This paper presents a novel approach for the analytical estimation of fractional moments, directly from polynomial chaos expansions. Specifically, the first four statistical moments obtained from the deterministic PCE coefficients are used for an estimation of arbitrary fractional moments via H\"{o}lder's inequality. The proposed approach is utilized for an estimation of statistical moments and probability distributions in three numerical examples of increasing complexity. Obtained results show that the proposed approach achieves a superior performance in estimating the distribution of the response, in comparison to a standard Latin hypercube sampling in the presented examples.  ( 2 min )
    Bipartite Graph Variational Auto-Encoder with Fair Latent Representation to Account for Sampling Bias in Ecological Networks
    arXiv:2403.02011v1 Announce Type: cross Abstract: We propose a method to represent bipartite networks using graph embeddings tailored to tackle the challenges of studying ecological networks, such as the ones linking plants and pollinators, where many covariates need to be accounted for, in particular to control for sampling bias. We adapt the variational graph auto-encoder approach to the bipartite case, which enables us to generate embeddings in a latent space where the two sets of nodes are positioned based on their probability of connection. We translate the fairness framework commonly considered in sociology in order to address sampling bias in ecology. By incorporating the Hilbert-Schmidt independence criterion (HSIC) as an additional penalty term in the loss we optimize, we ensure that the structure of the latent space is independent of continuous variables, which are related to the sampling process. Finally, we show how our approach can change our understanding of ecological networks when applied to the Spipoll data set, a citizen science monitoring program of plant-pollinator interactions to which many observers contribute, making it prone to sampling bias.  ( 2 min )
    Fourier-basis Functions to Bridge Augmentation Gap: Rethinking Frequency Augmentation in Image Classification
    arXiv:2403.01944v1 Announce Type: cross Abstract: Computer vision models normally witness degraded performance when deployed in real-world scenarios, due to unexpected changes in inputs that were not accounted for during training. Data augmentation is commonly used to address this issue, as it aims to increase data variety and reduce the distribution gap between training and test data. However, common visual augmentations might not guarantee extensive robustness of computer vision models. In this paper, we propose Auxiliary Fourier-basis Augmentation (AFA), a complementary technique targeting augmentation in the frequency domain and filling the augmentation gap left by visual augmentations. We demonstrate the utility of augmentation via Fourier-basis additive noise in a straightforward and efficient adversarial setting. Our results show that AFA benefits the robustness of models against common corruptions, OOD generalization, and consistency of performance of models against increasing perturbations, with negligible deficit to the standard performance of models. It can be seamlessly integrated with other augmentation techniques to further boost performance.  ( 2 min )
    Towards Continuous Assurance Case Creation for ADS with the Evidential Tool Bus
    arXiv:2403.01918v1 Announce Type: cross Abstract: An assurance case has become an integral component for the certification of safety-critical systems. While manually defining assurance case patterns can be not avoided, system-specific instantiations of assurance case patterns are both costly and time-consuming. It becomes especially complex to maintain an assurance case for a system when the requirements of the System-Under-Assurance change, or an assurance claim becomes invalid due to, e.g., degradation of a systems component, as common when deploying learning-enabled components. In this paper, we report on our preliminary experience leveraging the tool integration framework Evidential Tool Bus (ETB) for the construction and continuous maintenance of an assurance case from a predefined assurance case pattern. Specifically, we demonstrate the assurance process on an industrial Automated Valet Parking system from the automotive domain. We present the formalization of the provided assurance case pattern in the ETB processable logical specification language of workflows. Our findings show that ETB is able to create and maintain evidence required for the construction of an assurance case.  ( 2 min )
    Capacity of the Hebbian-Hopfield network associative memory
    arXiv:2403.01907v1 Announce Type: cross Abstract: In \cite{Hop82}, Hopfield introduced a \emph{Hebbian} learning rule based neural network model and suggested how it can efficiently operate as an associative memory. Studying random binary patterns, he also uncovered that, if a small fraction of errors is tolerated in the stored patterns retrieval, the capacity of the network (maximal number of memorized patterns, $m$) scales linearly with each pattern's size, $n$. Moreover, he famously predicted $\alpha_c=\lim_{n\rightarrow\infty}\frac{m}{n}\approx 0.14$. We study this very same scenario with two famous pattern's basins of attraction: \textbf{\emph{(i)}} The AGS one from \cite{AmiGutSom85}; and \textbf{\emph{(ii)}} The NLT one from \cite{Newman88,Louk94,Louk94a,Louk97,Tal98}. Relying on the \emph{fully lifted random duality theory} (fl RDT) from \cite{Stojnicflrdt23}, we obtain the following explicit capacity characterizations on the first level of lifting: \begin{equation} \alpha_c^{(AGS,1)} = \left ( \max_{\delta\in \left ( 0,\frac{1}{2}\right ) }\frac{1-2\delta}{\sqrt{2} \mbox{erfinv} \left ( 1-2\delta\right )} - \frac{2}{\sqrt{2\pi}} e^{-\left ( \mbox{erfinv}\left ( 1-2\delta \right )\right )^2}\right )^2 \approx \mathbf{0.137906} \end{equation} \begin{equation} \alpha_c^{(NLT,1)} = \frac{\mbox{erf}(x)^2}{2x^2}-1+\mbox{erf}(x)^2 \approx \mathbf{0.129490}, \quad 1-\mbox{erf}(x)^2- \frac{2\mbox{erf}(x)e^{-x^2}}{\sqrt{\pi}x}+\frac{2e^{-2x^2}}{\pi}=0. \end{equation} A substantial numerical work gives on the second level of lifting $\alpha_c^{(AGS,2)} \approx \mathbf{0.138186}$ and $\alpha_c^{(NLT,2)} \approx \mathbf{0.12979}$, effectively uncovering a remarkably fast lifting convergence. Moreover, the obtained AGS characterizations exactly match the replica symmetry based ones of \cite{AmiGutSom85} and the corresponding symmetry breaking ones of \cite{SteKuh94}.  ( 2 min )
    Universality of reservoir systems with recurrent neural networks
    arXiv:2403.01900v1 Announce Type: cross Abstract: Approximation capability of reservoir systems whose reservoir is a recurrent neural network (RNN) is discussed. In our problem setting, a reservoir system approximates a set of functions just by adjusting its linear readout while the reservoir is fixed. We will show what we call uniform strong universality of a family of RNN reservoir systems for a certain class of functions to be approximated. This means that, for any positive number, we can construct a sufficiently large RNN reservoir system whose approximation error for each function in the class of functions to be approximated is bounded from above by the positive number. Such RNN reservoir systems are constructed via parallel concatenation of RNN reservoirs.  ( 2 min )
    Fast Benchmarking of Asynchronous Multi-Fidelity Optimization on Zero-Cost Benchmarks
    arXiv:2403.01888v1 Announce Type: cross Abstract: While deep learning has celebrated many successes, its results often hinge on the meticulous selection of hyperparameters (HPs). However, the time-consuming nature of deep learning training makes HP optimization (HPO) a costly endeavor, slowing down the development of efficient HPO tools. While zero-cost benchmarks, which provide performance and runtime without actual training, offer a solution for non-parallel setups, they fall short in parallel setups as each worker must communicate its queried runtime to return its evaluation in the exact order. This work addresses this challenge by introducing a user-friendly Python package that facilitates efficient parallel HPO with zero-cost benchmarks. Our approach calculates the exact return order based on the information stored in file system, eliminating the need for long waiting times and enabling much faster HPO evaluations. We first verify the correctness of our approach through extensive testing and the experiments with 6 popular HPO libraries show its applicability to diverse libraries and its ability to achieve over 1000x speedup compared to a traditional approach. Our package can be installed via pip install mfhpo-simulator.  ( 2 min )
    Improving generalisation via anchor multivariate analysis
    arXiv:2403.01865v1 Announce Type: cross Abstract: We introduce a causal regularisation extension to anchor regression (AR) for improved out-of-distribution (OOD) generalisation. We present anchor-compatible losses, aligning with the anchor framework to ensure robustness against distribution shifts. Various multivariate analysis (MVA) algorithms, such as (Orthonormalized) PLS, RRR, and MLR, fall within the anchor framework. We observe that simple regularisation enhances robustness in OOD settings. Estimators for selected algorithms are provided, showcasing consistency and efficacy in synthetic and real-world climate science problems. The empirical validation highlights the versatility of anchor regularisation, emphasizing its compatibility with MVA approaches and its role in enhancing replicability while guarding against distribution shifts. The extended AR framework advances causal inference methodologies, addressing the need for reliable OOD generalisation.  ( 2 min )
    One Prompt Word is Enough to Boost Adversarial Robustness for Pre-trained Vision-Language Models
    arXiv:2403.01849v1 Announce Type: cross Abstract: Large pre-trained Vision-Language Models (VLMs) like CLIP, despite having remarkable generalization ability, are highly vulnerable to adversarial examples. This work studies the adversarial robustness of VLMs from the novel perspective of the text prompt instead of the extensively studied model weights (frozen in this work). We first show that the effectiveness of both adversarial attack and defense are sensitive to the used text prompt. Inspired by this, we propose a method to improve resilience to adversarial attacks by learning a robust text prompt for VLMs. The proposed method, named Adversarial Prompt Tuning (APT), is effective while being both computationally and data efficient. Extensive experiments are conducted across 15 datasets and 4 data sparsity schemes (from 1-shot to full training data settings) to show APT's superiority over hand-engineered prompts and other state-of-the-art adaption methods. APT demonstrated excellent abilities in terms of the in-distribution performance and the generalization under input distribution shift and across datasets. Surprisingly, by simply adding one learned word to the prompts, APT can significantly boost the accuracy and robustness (epsilon=4/255) over the hand-engineered prompts by +13% and +8.5% on average respectively. The improvement further increases, in our most effective setting, to +26.4% for accuracy and +16.7% for robustness. Code is available at https://github.com/TreeLLi/APT.  ( 2 min )
    RCoCo: Contrastive Collective Link Prediction across Multiplex Network in Riemannian Space
    arXiv:2403.01864v1 Announce Type: cross Abstract: Link prediction typically studies the probability of future interconnection among nodes with the observation in a single social network. More often than not, real scenario is presented as a multiplex network with common (anchor) users active in multiple social networks. In the literature, most existing works study either the intra-link prediction in a single network or inter-link prediction among networks (a.k.a. network alignment), and consider two learning tasks are independent from each other, which is still away from the fact. On the representation space, the vast majority of existing methods are built upon the traditional Euclidean space, unaware of the inherent geometry of social networks. The third issue is on the scarce anchor users. Annotating anchor users is laborious and expensive, and thus it is impractical to work with quantities of anchor users. Herein, in light of the issues above, we propose to study a challenging yet practical problem of Geometry-aware Collective Link Prediction across Multiplex Network. To address this problem, we present a novel contrastive model, RCoCo, which collaborates intra- and inter-network behaviors in Riemannian spaces. In RCoCo, we design a curvature-aware graph attention network ($\kappa-$GAT), conducting attention mechanism in Riemannian manifold whose curvature is estimated by the Ricci curvatures over the network. Thereafter, we formulate intra- and inter-contrastive loss in the manifolds, in which we augment graphs by exploring the high-order structure of community and information transfer on anchor users. Finally, we conduct extensive experiments with 14 strong baselines on 8 real-world datasets, and show the effectiveness of RCoCo.  ( 3 min )
    Making Pre-trained Language Models Great on Tabular Prediction
    arXiv:2403.01841v1 Announce Type: cross Abstract: The transferability of deep neural networks (DNNs) has made significant progress in image and language processing. However, due to the heterogeneity among tables, such DNN bonus is still far from being well exploited on tabular data prediction (e.g., regression or classification tasks). Condensing knowledge from diverse domains, language models (LMs) possess the capability to comprehend feature names from various tables, potentially serving as versatile learners in transferring knowledge across distinct tables and diverse prediction tasks, but their discrete text representation space is inherently incompatible with numerical feature values in tables. In this paper, we present TP-BERTa, a specifically pre-trained LM model for tabular data prediction. Concretely, a novel relative magnitude tokenization converts scalar numerical feature values to finely discrete, high-dimensional tokens, and an intra-feature attention approach integrates feature values with the corresponding feature names. Comprehensive experiments demonstrate that our pre-trained TP-BERTa leads the performance among tabular DNNs and is competitive with Gradient Boosted Decision Tree models in typical tabular data regime.  ( 2 min )
    Macroscopic auxiliary asymptotic preserving neural networks for the linear radiative transfer equations
    arXiv:2403.01820v1 Announce Type: cross Abstract: We develop a Macroscopic Auxiliary Asymptotic-Preserving Neural Network (MA-APNN) method to solve the time-dependent linear radiative transfer equations (LRTEs), which have a multi-scale nature and high dimensionality. To achieve this, we utilize the Physics-Informed Neural Networks (PINNs) framework and design a new adaptive exponentially weighted Asymptotic-Preserving (AP) loss function, which incorporates the macroscopic auxiliary equation that is derived from the original transfer equation directly and explicitly contains the information of the diffusion limit equation. Thus, as the scale parameter tends to zero, the loss function gradually transitions from the transport state to the diffusion limit state. In addition, the initial data, boundary conditions, and conservation laws serve as the regularization terms for the loss. We present several numerical examples to demonstrate the effectiveness of MA-APNNs.  ( 2 min )
    Towards Fair and Efficient Learning-based Congestion Control
    arXiv:2403.01798v1 Announce Type: cross Abstract: Recent years have witnessed a plethora of learning-based solutions for congestion control (CC) that demonstrate better performance over traditional TCP schemes. However, they fail to provide consistently good convergence properties, including {\em fairness}, {\em fast convergence} and {\em stability}, due to the mismatch between their objective functions and these properties. Despite being intuitive, integrating these properties into existing learning-based CC is challenging, because: 1) their training environments are designed for the performance optimization of single flow but incapable of cooperative multi-flow optimization, and 2) there is no directly measurable metric to represent these properties into the training objective function. We present Astraea, a new learning-based congestion control that ensures fast convergence to fairness with stability. At the heart of Astraea is a multi-agent deep reinforcement learning framework that explicitly optimizes these convergence properties during the training process by enabling the learning of interactive policy between multiple competing flows, while maintaining high performance. We further build a faithful multi-flow environment that emulates the competing behaviors of concurrent flows, explicitly expressing convergence properties to enable their optimization during training. We have fully implemented Astraea and our comprehensive experiments show that Astraea can quickly converge to fairness point and exhibit better stability than its counterparts. For example, \sys achieves near-optimal bandwidth sharing (i.e., fairness) when multiple flows compete for the same bottleneck, delivers up to 8.4$\times$ faster convergence speed and 2.8$\times$ smaller throughput deviation, while achieving comparable or even better performance over prior solutions.  ( 2 min )
    Tsallis Entropy Regularization for Linearly Solvable MDP and Linear Quadratic Regulator
    arXiv:2403.01805v1 Announce Type: cross Abstract: Shannon entropy regularization is widely adopted in optimal control due to its ability to promote exploration and enhance robustness, e.g., maximum entropy reinforcement learning known as Soft Actor-Critic. In this paper, Tsallis entropy, which is a one-parameter extension of Shannon entropy, is used for the regularization of linearly solvable MDP and linear quadratic regulators. We derive the solution for these problems and demonstrate its usefulness in balancing between exploration and sparsity of the obtained control law.  ( 2 min )
    Hybrid data-driven and physics-informed regularized learning of cyclic plasticity with Neural Networks
    arXiv:2403.01776v1 Announce Type: cross Abstract: An extendable, efficient and explainable Machine Learning approach is proposed to represent cyclic plasticity and replace conventional material models based on the Radial Return Mapping algorithm. High accuracy and stability by means of a limited amount of training data is achieved by implementing physics-informed regularizations and the back stress information. The off-loading of the Neural Network is applied to the maximal extent. The proposed model architecture is simpler and more efficient compared to existing solutions from the literature, while representing a complete three-dimensional material model. The validation of the approach is carried out by means of surrogate data obtained with the Armstrong-Frederick kinematic hardening model. The Mean Squared Error is assumed as the loss function which stipulates several restrictions: deviatoric character of internal variables, compliance with the flow rule, the differentiation of elastic and plastic steps and the associativity of the flow rule. The latter, however, has a minor impact on the accuracy, which implies the generalizability of the model for a broad spectrum of evolution laws for internal variables. Numerical tests simulating several load cases are shown in detail and validated for accuracy and stability.  ( 2 min )
    AFBT GAN: enhanced explainability and diagnostic performance for cognitive decline by counterfactual generative adversarial network
    arXiv:2403.01758v1 Announce Type: cross Abstract: Existing explanation results of functional connectivity (FC) are normally generated by using classification result labels and correlation analysis methods such as Pearson's correlation or gradient backward. However, the diagnostic model is still trained on the black box model and might lack the attention of FCs in important regions during the training. To enhance the explainability and improve diagnostic performance, providing prior knowledge on neurodegeneration-related regions when healthy subjects (HC) develop into subject cognitive decline (SCD) and mild cognitive impairment (MCI) for the diagnostic model is a key step. To better determine the neurodegeneration-related regions, we employ counterfactual reasoning to generate the target label FC matrices derived from source label FC and then subtract source label FC with target label FC. The counterfactual reasoning architecture is constructed by adaptive forward and backward transformer generative adversarial network (AFBT GAN), which is specifically designed by network property in FC and inverse patch embedding operation in the transformer. The specific design can make the model focus more on the current network correlation and employ the global insight of the transformer to reconstruct FC, which both help the generation of high-quality target label FC. The validation experiments are conducted on both clinical and public datasets, the generated attention map are both vital correlated to cognitive function and the diagnostic performance is also significant. The code is available at https://github.com/SXR3015/AFBT-GAN.  ( 3 min )
    Offline Goal-Conditioned Reinforcement Learning for Safety-Critical Tasks with Recovery Policy
    arXiv:2403.01734v1 Announce Type: cross Abstract: Offline goal-conditioned reinforcement learning (GCRL) aims at solving goal-reaching tasks with sparse rewards from an offline dataset. While prior work has demonstrated various approaches for agents to learn near-optimal policies, these methods encounter limitations when dealing with diverse constraints in complex environments, such as safety constraints. Some of these approaches prioritize goal attainment without considering safety, while others excessively focus on safety at the expense of training efficiency. In this paper, we study the problem of constrained offline GCRL and propose a new method called Recovery-based Supervised Learning (RbSL) to accomplish safety-critical tasks with various goals. To evaluate the method performance, we build a benchmark based on the robot-fetching environment with a randomly positioned obstacle and use expert or random policies to generate an offline dataset. We compare RbSL with three offline GCRL algorithms and one offline safe RL algorithm. As a result, our method outperforms the existing state-of-the-art methods to a large extent. Furthermore, we validate the practicality and effectiveness of RbSL by deploying it on a real Panda manipulator. Code is available at https://github.com/Sunlighted/RbSL.git.  ( 2 min )
    How Multimodal Integration Boost the Performance of LLM for Optimization: Case Study on Capacitated Vehicle Routing Problems
    arXiv:2403.01757v1 Announce Type: cross Abstract: Recently, large language models (LLMs) have notably positioned them as capable tools for addressing complex optimization challenges. Despite this recognition, a predominant limitation of existing LLM-based optimization methods is their struggle to capture the relationships among decision variables when relying exclusively on numerical text prompts, especially in high-dimensional problems. Keeping this in mind, we first propose to enhance the optimization performance using multimodal LLM capable of processing both textual and visual prompts for deeper insights of the processed optimization problem. This integration allows for a more comprehensive understanding of optimization problems, akin to human cognitive processes. We have developed a multimodal LLM-based optimization framework that simulates human problem-solving workflows, thereby offering a more nuanced and effective analysis. The efficacy of this method is evaluated through extensive empirical studies focused on a well-known combinatorial optimization problem, i.e., capacitated vehicle routing problem. The results are compared against those obtained from the LLM-based optimization algorithms that rely solely on textual prompts, demonstrating the significant advantages of our multimodal approach.  ( 2 min )
    Statistical Mechanics of Dynamical System Identification
    arXiv:2403.01723v1 Announce Type: cross Abstract: Recovering dynamical equations from observed noisy data is the central challenge of system identification. We develop a statistical mechanical approach to analyze sparse equation discovery algorithms, which typically balance data fit and parsimony through a trial-and-error selection of hyperparameters. In this framework, statistical mechanics offers tools to analyze the interplay between complexity and fitness, in analogy to that done between entropy and energy. To establish this analogy, we define the optimization procedure as a two-level Bayesian inference problem that separates variable selection from coefficient values and enables the computation of the posterior parameter distribution in closed form. A key advantage of employing statistical mechanical concepts, such as free energy and the partition function, is in the quantification of uncertainty, especially in in the low-data limit; frequently encountered in real-world applications. As the data volume increases, our approach mirrors the thermodynamic limit, leading to distinct sparsity- and noise-induced phase transitions that delineate correct from incorrect identification. This perspective of sparse equation discovery, is versatile and can be adapted to various other equation discovery algorithms.  ( 2 min )
    Soft-constrained Schrodinger Bridge: a Stochastic Control Approach
    arXiv:2403.01717v1 Announce Type: cross Abstract: Schr\"{o}dinger bridge can be viewed as a continuous-time stochastic control problem where the goal is to find an optimally controlled diffusion process with a pre-specified terminal distribution $\mu_T$. We propose to generalize this stochastic control problem by allowing the terminal distribution to differ from $\mu_T$ but penalizing the Kullback-Leibler divergence between the two distributions. We call this new control problem soft-constrained Schr\"{o}dinger bridge (SSB). The main contribution of this work is a theoretical derivation of the solution to SSB, which shows that the terminal distribution of the optimally controlled process is a geometric mixture of $\mu_T$ and some other distribution. This result is further extended to a time series setting. One application of SSB is the development of robust generative diffusion models. We propose a score matching-based algorithm for sampling from geometric mixtures and showcase its use via a numerical example for the MNIST data set.  ( 2 min )
    CATS: Enhancing Multivariate Time Series Forecasting by Constructing Auxiliary Time Series as Exogenous Variables
    arXiv:2403.01673v1 Announce Type: cross Abstract: For Multivariate Time Series Forecasting (MTSF), recent deep learning applications show that univariate models frequently outperform multivariate ones. To address the difficiency in multivariate models, we introduce a method to Construct Auxiliary Time Series (CATS) that functions like a 2D temporal-contextual attention mechanism, which generates Auxiliary Time Series (ATS) from Original Time Series (OTS) to effectively represent and incorporate inter-series relationships for forecasting. Key principles of ATS - continuity, sparsity, and variability - are identified and implemented through different modules. Even with a basic 2-layer MLP as core predictor, CATS achieves state-of-the-art, significantly reducing complexity and parameters compared to previous multivariate models, marking it an efficient and transferable MTSF solution.  ( 2 min )
    Can LLMs Generate Architectural Design Decisions? -An Exploratory Empirical study
    arXiv:2403.01709v1 Announce Type: cross Abstract: Architectural Knowledge Management (AKM) involves the organized handling of information related to architectural decisions and design within a project or organization. An essential artifact of AKM is the Architecture Decision Records (ADR), which documents key design decisions. ADRs are documents that capture decision context, decision made and various aspects related to a design decision, thereby promoting transparency, collaboration, and understanding. Despite their benefits, ADR adoption in software development has been slow due to challenges like time constraints and inconsistent uptake. Recent advancements in Large Language Models (LLMs) may help bridge this adoption gap by facilitating ADR generation. However, the effectiveness of LLM for ADR generation or understanding is something that has not been explored. To this end, in this work, we perform an exploratory study that aims to investigate the feasibility of using LLM for the generation of ADRs given the decision context. In our exploratory study, we utilize GPT and T5-based models with 0-shot, few-shot, and fine-tuning approaches to generate the Decision of an ADR given its Context. Our results indicate that in a 0-shot setting, state-of-the-art models such as GPT-4 generate relevant and accurate Design Decisions, although they fall short of human-level performance. Additionally, we observe that more cost-effective models like GPT-3.5 can achieve similar outcomes in a few-shot setting, and smaller models such as Flan-T5 can yield comparable results after fine-tuning. To conclude, this exploratory study suggests that LLM can generate Design Decisions, but further research is required to attain human-level generation and establish standardized widespread adoption.  ( 3 min )
    Sample Efficient Myopic Exploration Through Multitask Reinforcement Learning with Diverse Tasks
    arXiv:2403.01636v1 Announce Type: cross Abstract: Multitask Reinforcement Learning (MTRL) approaches have gained increasing attention for its wide applications in many important Reinforcement Learning (RL) tasks. However, while recent advancements in MTRL theory have focused on the improved statistical efficiency by assuming a shared structure across tasks, exploration--a crucial aspect of RL--has been largely overlooked. This paper addresses this gap by showing that when an agent is trained on a sufficiently diverse set of tasks, a generic policy-sharing algorithm with myopic exploration design like $\epsilon$-greedy that are inefficient in general can be sample-efficient for MTRL. To the best of our knowledge, this is the first theoretical demonstration of the "exploration benefits" of MTRL. It may also shed light on the enigmatic success of the wide applications of myopic exploration in practice. To validate the role of diversity, we conduct experiments on synthetic robotic control environments, where the diverse task set aligns with the task selection by automatic curriculum learning, which is empirically shown to improve sample-efficiency.  ( 2 min )
    Application of Neural Ordinary Differential Equations for Tokamak Plasma Dynamics Analysis
    arXiv:2403.01635v1 Announce Type: cross Abstract: In the quest for controlled thermonuclear fusion, tokamaks present complex challenges in understanding burning plasma dynamics. This study introduces a multi-region multi-timescale transport model, employing Neural Ordinary Differential Equations (Neural ODEs) to simulate the intricate energy transfer processes within tokamaks. Our methodology leverages Neural ODEs for the numerical derivation of diffusivity parameters from DIII-D tokamak experimental data, enabling the precise modeling of energy interactions between electrons and ions across various regions, including the core, edge, and scrape-off layer. These regions are conceptualized as distinct nodes, capturing the critical timescales of radiation and transport processes essential for efficient tokamak operation. Validation against DIII-D plasmas under various auxiliary heating conditions demonstrates the model's effectiveness, ultimately shedding light on ways to enhance tokamak performance with deep learning.  ( 2 min )
    Limits to classification performance by relating Kullback-Leibler divergence to Cohen's Kappa
    arXiv:2403.01571v1 Announce Type: cross Abstract: The performance of machine learning classification algorithms are evaluated by estimating metrics, often from the confusion matrix, using training data and cross-validation. However, these do not prove that the best possible performance has been achieved. Fundamental limits to error rates can be estimated using information distance measures. To this end, the confusion matrix has been formulated to comply with the Chernoff-Stein Lemma. This links the error rates to the Kullback-Leibler divergences between the probability density functions describing the two classes. This leads to a key result that relates Cohen's Kappa to the Resistor Average Distance which is the parallel resistor combination of the two Kullback-Leibler divergences. The Resistor Average Distance has units of bits and is estimated from the same training data used by the classification algorithm, using kNN estimates of the KullBack-Leibler divergences. The classification algorithm gives the confusion matrix and Kappa. Theory and methods are discussed in detail and then applied to Monte Carlo data and real datasets. Four very different real datasets - Breast Cancer, Coronary Heart Disease, Bankruptcy, and Particle Identification - are analysed, with both continuous and discrete values, and their classification performance compared to the expected theoretical limit. In all cases this analysis shows that the algorithms could not have performed any better due to the underlying probability density functions for the two classes. Important lessons are learnt on how to predict the performance of algorithms for imbalanced data using training datasets that are approximately balanced. Machine learning is very powerful but classification performance ultimately depends on the quality of the data and the relevance of the variables to the problem.  ( 3 min )
    SCHEMA: State CHangEs MAtter for Procedure Planning in Instructional Videos
    arXiv:2403.01599v1 Announce Type: cross Abstract: We study the problem of procedure planning in instructional videos, which aims to make a goal-oriented sequence of action steps given partial visual state observations. The motivation of this problem is to learn a structured and plannable state and action space. Recent works succeeded in sequence modeling of steps with only sequence-level annotations accessible during training, which overlooked the roles of states in the procedures. In this work, we point out that State CHangEs MAtter (SCHEMA) for procedure planning in instructional videos. We aim to establish a more structured state space by investigating the causal relations between steps and states in procedures. Specifically, we explicitly represent each step as state changes and track the state changes in procedures. For step representation, we leveraged the commonsense knowledge in large language models (LLMs) to describe the state changes of steps via our designed chain-of-thought prompting. For state change tracking, we align visual state observations with language state descriptions via cross-modal contrastive learning, and explicitly model the intermediate states of the procedure using LLM-generated state descriptions. Experiments on CrossTask, COIN, and NIV benchmark datasets demonstrate that our proposed SCHEMA model achieves state-of-the-art performance and obtains explainable visualizations.  ( 2 min )
    SERVAL: Synergy Learning between Vertical Models and LLMs towards Oracle-Level Zero-shot Medical Prediction
    arXiv:2403.01570v1 Announce Type: cross Abstract: Recent development of large language models (LLMs) has exhibited impressive zero-shot proficiency on generic and common sense questions. However, LLMs' application on domain-specific vertical questions still lags behind, primarily due to the humiliation problems and deficiencies in vertical knowledge. Furthermore, the vertical data annotation process often requires labor-intensive expert involvement, thereby presenting an additional challenge in enhancing the model's vertical capabilities. In this paper, we propose SERVAL, a synergy learning pipeline designed for unsupervised development of vertical capabilities in both LLMs and small models by mutual enhancement. Specifically, SERVAL utilizes the LLM's zero-shot outputs as annotations, leveraging its confidence to teach a robust vertical model from scratch. Reversely, the trained vertical model guides the LLM fine-tuning to enhance its zero-shot capability, progressively improving both models through an iterative process. In medical domain, known for complex vertical knowledge and costly annotations, comprehensive experiments show that, without access to any gold labels, SERVAL with the synergy learning of OpenAI GPT-3.5 and a simple model attains fully-supervised competitive performance across ten widely used medical datasets. These datasets represent vertically specialized medical diagnostic scenarios (e.g., diabetes, heart diseases, COVID-19), highlighting the potential of SERVAL in refining the vertical capabilities of LLMs and training vertical models from scratch, all achieved without the need for annotations.  ( 3 min )
    In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation
    arXiv:2403.01548v1 Announce Type: cross Abstract: Large language models (LLMs) frequently hallucinate and produce factual errors, yet our understanding of why they make these errors remains limited. In this study, we delve into the underlying mechanisms of LLM hallucinations from the perspective of inner representations, and discover a salient pattern associated with hallucinations: correct generations tend to have sharper context activations in the hidden states of the in-context tokens, compared to the incorrect ones. Leveraging this insight, we propose an entropy-based metric to quantify the ``sharpness'' among the in-context hidden states and incorporate it into the decoding process to formulate a constrained decoding approach. Experiments on various knowledge-seeking and hallucination benchmarks demonstrate our approach's consistent effectiveness, for example, achieving up to an 8.6 point improvement on TruthfulQA. We believe this study can improve our understanding of hallucinations and serve as a practical solution for hallucination mitigation.  ( 2 min )
    Mixed-Strategy Nash Equilibrium for Crowd Navigation
    arXiv:2403.01537v1 Announce Type: cross Abstract: We address the problem of finding mixed-strategy Nash equilibrium for crowd navigation. Mixed-strategy Nash equilibrium provides a rigorous model for the robot to anticipate uncertain yet cooperative human behavior in crowds, but the computation cost is often too high for scalable and real-time decision-making. Here we prove that a simple iterative Bayesian updating scheme converges to the Nash equilibrium of a mixed-strategy social navigation game. Furthermore, we propose a data-driven framework to construct the game by initializing agent strategies as Gaussian processes learned from human datasets. Based on the proposed mixed-strategy Nash equilibrium model, we develop a sampling-based crowd navigation framework that can be integrated into existing navigation methods and runs in real-time on a laptop CPU. We evaluate our framework in both simulated environments and real-world human datasets in unstructured environments. Our framework consistently outperforms both non-learning and learning-based methods on both safety and navigation efficiency and reaches human-level crowd navigation performance on top of a meta-planner.  ( 2 min )
    Hyperspectral Image Analysis in Single-Modal and Multimodal setting using Deep Learning Techniques
    arXiv:2403.01546v1 Announce Type: cross Abstract: Hyperspectral imaging provides precise classification for land use and cover due to its exceptional spectral resolution. However, the challenges of high dimensionality and limited spatial resolution hinder its effectiveness. This study addresses these challenges by employing deep learning techniques to efficiently process, extract features, and classify data in an integrated manner. To enhance spatial resolution, we integrate information from complementary modalities such as LiDAR and SAR data through multimodal learning. Moreover, adversarial learning and knowledge distillation are utilized to overcome issues stemming from domain disparities and missing modalities. We also tailor deep learning architectures to suit the unique characteristics of HSI data, utilizing 1D convolutional and recurrent neural networks to handle its continuous spectral dimension. Techniques like visual attention and feedback connections within the architecture bolster the robustness of feature extraction. Additionally, we tackle the issue of limited training samples through self-supervised learning methods, employing autoencoders for dimensionality reduction and exploring semi-supervised learning techniques that leverage unlabeled data. Our proposed approaches are evaluated across various HSI datasets, consistently outperforming existing state-of-the-art techniques.  ( 2 min )
    Fast Ergodic Search with Kernel Functions
    arXiv:2403.01536v1 Announce Type: cross Abstract: Ergodic search enables optimal exploration of an information distribution while guaranteeing the asymptotic coverage of the search space. However, current methods typically have exponential computation complexity in the search space dimension and are restricted to Euclidean space. We introduce a computationally efficient ergodic search method. Our contributions are two-fold. First, we develop a kernel-based ergodic metric and generalize it from Euclidean space to Lie groups. We formally prove the proposed metric is consistent with the standard ergodic metric while guaranteeing linear complexity in the search space dimension. Secondly, we derive the first-order optimality condition of the kernel ergodic metric for nonlinear systems, which enables efficient trajectory optimization. Comprehensive numerical benchmarks show that the proposed method is at least two orders of magnitude faster than the state-of-the-art algorithm. Finally, we demonstrate the proposed algorithm with a peg-in-hole insertion task. We formulate the problem as a coverage task in the space of SE(3) and use a 30-second-long human demonstration as the prior distribution for ergodic coverage. Ergodicity guarantees the asymptotic solution of the peg-in-hole problem so long as the solution resides within the prior information distribution, which is seen in the 100\% success rate.  ( 2 min )
    Data-driven local operator finding for reduced-order modelling of plasma systems: II. Application to parametric dynamics
    arXiv:2403.01532v1 Announce Type: cross Abstract: Real-world systems often exhibit dynamics influenced by various parameters, either inherent or externally controllable, necessitating models capable of reliably capturing these parametric behaviors. Plasma technologies exemplify such systems. For example, phenomena governing global dynamics in Hall thrusters (a spacecraft propulsion technology) vary with various parameters, such as the "self-sustained electric field". In this Part II, following on the introduction of our novel data-driven local operator finding algorithm, Phi Method, in Part I, we showcase the method's effectiveness in learning parametric dynamics to predict system behavior across unseen parameter spaces. We present two adaptations: the "parametric Phi Method" and the "ensemble Phi Method", which are demonstrated through 2D fluid-flow-past-a-cylinder and 1D Hall-thruster-plasma-discharge problems. Comparative evaluation against parametric OPT-DMD in the fluid case demonstrates superior predictive performance of the parametric Phi Method. Across both test cases, parametric and ensemble Phi Method reliably recover governing parametric PDEs and offer accurate predictions over test parameters. Ensemble ROM analysis underscores Phi Method's robust learning of dominant dynamic coefficients with high confidence.  ( 2 min )
    Revisiting Dynamic Evaluation: Online Adaptation for Large Language Models
    arXiv:2403.01518v1 Announce Type: cross Abstract: We consider the problem of online fine tuning the parameters of a language model at test time, also known as dynamic evaluation. While it is generally known that this approach improves the overall predictive performance, especially when considering distributional shift between training and evaluation data, we here emphasize the perspective that online adaptation turns parameters into temporally changing states and provides a form of context-length extension with memory in weights, more in line with the concept of memory in neuroscience. We pay particular attention to the speed of adaptation (in terms of sample efficiency),sensitivity to the overall distributional drift, and the computational overhead for performing gradient computations and parameter updates. Our empirical study provides insights on when online adaptation is particularly interesting. We highlight that with online adaptation the conceptual distinction between in-context learning and fine tuning blurs: both are methods to condition the model on previously observed tokens.  ( 2 min )
    Data-driven local operator finding for reduced-order modelling of plasma systems: I. Concept and verifications
    arXiv:2403.01523v1 Announce Type: cross Abstract: Reduced-order plasma models that can efficiently predict plasma behavior across various settings and configurations are highly sought after yet elusive. The demand for such models has surged in the past decade due to their potential to facilitate scientific research and expedite the development of plasma technologies. In line with the advancements in computational power and data-driven methods, we introduce the "Phi Method" in this two-part article. Part I presents this novel algorithm, which employs constrained regression on a candidate term library informed by numerical discretization schemes to discover discretized systems of differential equations. We demonstrate Phi Method's efficacy in deriving reliable and robust reduced-order models (ROMs) for three test cases: the Lorenz attractor, flow past a cylinder, and a 1D Hall-thruster-representative plasma. Part II will delve into the method's application for parametric dynamics discovery. Our results show that ROMs derived from the Phi Method provide remarkably accurate predictions of systems' behavior, whether derived from steady-state or transient-state data. This underscores the method's potential for transforming plasma system modeling.  ( 2 min )
    Approximations to the Fisher Information Metric of Deep Generative Models for Out-Of-Distribution Detection
    arXiv:2403.01485v1 Announce Type: cross Abstract: Likelihood-based deep generative models such as score-based diffusion models and variational autoencoders are state-of-the-art machine learning models approximating high-dimensional distributions of data such as images, text, or audio. One of many downstream tasks they can be naturally applied to is out-of-distribution (OOD) detection. However, seminal work by Nalisnick et al. which we reproduce showed that deep generative models consistently infer higher log-likelihoods for OOD data than data they were trained on, marking an open problem. In this work, we analyse using the gradient of a data point with respect to the parameters of the deep generative model for OOD detection, based on the simple intuition that OOD data should have larger gradient norms than training data. We formalise measuring the size of the gradient as approximating the Fisher information metric. We show that the Fisher information matrix (FIM) has large absolute diagonal values, motivating the use of chi-square distributed, layer-wise gradient norms as features. We combine these features to make a simple, model-agnostic and hyperparameter-free method for OOD detection which estimates the joint density of the layer-wise gradient norms for a given data point. We find that these layer-wise gradient norms are weakly correlated, rendering their combined usage informative, and prove that the layer-wise gradient norms satisfy the principle of (data representation) invariance. Our empirical results indicate that this method outperforms the Typicality test for most deep generative models and image dataset pairings.  ( 3 min )
    WARDEN: Multi-Directional Backdoor Watermarks for Embedding-as-a-Service Copyright Protection
    arXiv:2403.01472v1 Announce Type: cross Abstract: Embedding as a Service (EaaS) has become a widely adopted solution, which offers feature extraction capabilities for addressing various downstream tasks in Natural Language Processing (NLP). Prior studies have shown that EaaS can be prone to model extraction attacks; nevertheless, this concern could be mitigated by adding backdoor watermarks to the text embeddings and subsequently verifying the attack models post-publication. Through the analysis of the recent watermarking strategy for EaaS, EmbMarker, we design a novel CSE (Clustering, Selection, Elimination) attack that removes the backdoor watermark while maintaining the high utility of embeddings, indicating that the previous watermarking approach can be breached. In response to this new threat, we propose a new protocol to make the removal of watermarks more challenging by incorporating multiple possible watermark directions. Our defense approach, WARDEN, notably increases the stealthiness of watermarks and empirically has been shown effective against CSE attack.  ( 2 min )
    Collective Certified Robustness against Graph Injection Attacks
    arXiv:2403.01423v1 Announce Type: cross Abstract: We investigate certified robustness for GNNs under graph injection attacks. Existing research only provides sample-wise certificates by verifying each node independently, leading to very limited certifying performance. In this paper, we present the first collective certificate, which certifies a set of target nodes simultaneously. To achieve it, we formulate the problem as a binary integer quadratic constrained linear programming (BQCLP). We further develop a customized linearization technique that allows us to relax the BQCLP into linear programming (LP) that can be efficiently solved. Through comprehensive experiments, we demonstrate that our collective certification scheme significantly improves certification performance with minimal computational overhead. For instance, by solving the LP within 1 minute on the Citeseer dataset, we achieve a significant increase in the certified ratio from 0.0% to 81.2% when the injected node number is 5% of the graph size. Our step marks a crucial step towards making provable defense more practical.  ( 2 min )
    Enhancing Data Provenance and Model Transparency in Federated Learning Systems - A Database Approach
    arXiv:2403.01451v1 Announce Type: cross Abstract: Federated Learning (FL) presents a promising paradigm for training machine learning models across decentralized edge devices while preserving data privacy. Ensuring the integrity and traceability of data across these distributed environments, however, remains a critical challenge. The ability to create transparent artificial intelligence, such as detailing the training process of a machine learning model, has become an increasingly prominent concern due to the large number of sensitive (hyper)parameters it utilizes; thus, it is imperative to strike a reasonable balance between openness and the need to protect sensitive information. In this paper, we propose one of the first approaches to enhance data provenance and model transparency in federated learning systems. Our methodology leverages a combination of cryptographic techniques and efficient model management to track the transformation of data throughout the FL process, and seeks to increase the reproducibility and trustworthiness of a trained FL model. We demonstrate the effectiveness of our approach through experimental evaluations on diverse FL scenarios, showcasing its ability to tackle accountability and explainability across the board. Our findings show that our system can greatly enhance data transparency in various FL environments by storing chained cryptographic hashes and client model snapshots in our proposed design for data decoupled FL. This is made possible by also employing multiple optimization techniques which enables comprehensive data provenance without imposing substantial computational loads. Extensive experimental results suggest that integrating a database subsystem into federated learning systems can improve data provenance in an efficient manner, encouraging secure FL adoption in privacy-sensitive applications and paving the way for future advancements in FL transparency and security features.  ( 3 min )
    Large-scale variational Gaussian state-space models
    arXiv:2403.01371v1 Announce Type: cross Abstract: We introduce an amortized variational inference algorithm and structured variational approximation for state-space models with nonlinear dynamics driven by Gaussian noise. Importantly, the proposed framework allows for efficient evaluation of the ELBO and low-variance stochastic gradient estimates without resorting to diagonal Gaussian approximations by exploiting (i) the low-rank structure of Monte-Carlo approximations to marginalize the latent state through the dynamics (ii) an inference network that approximates the update step with low-rank precision matrix updates (iii) encoding current and future observations into pseudo observations -- transforming the approximate smoothing problem into an (easier) approximate filtering problem. Overall, the necessary statistics and ELBO can be computed in $O(TL(Sr + S^2 + r^2))$ time where $T$ is the series length, $L$ is the state-space dimensionality, $S$ are the number of samples used to approximate the predict step statistics, and $r$ is the rank of the approximate precision matrix update in the update step (which can be made of much lower dimension than $L$).  ( 2 min )
    A Closer Look at Wav2Vec2 Embeddings for On-Device Single-Channel Speech Enhancement
    arXiv:2403.01369v1 Announce Type: cross Abstract: Self-supervised learned models have been found to be very effective for certain speech tasks such as automatic speech recognition, speaker identification, keyword spotting and others. While the features are undeniably useful in speech recognition and associated tasks, their utility in speech enhancement systems is yet to be firmly established, and perhaps not properly understood. In this paper, we investigate the uses of SSL representations for single-channel speech enhancement in challenging conditions and find that they add very little value for the enhancement task. Our constraints are designed around on-device real-time speech enhancement -- model is causal, the compute footprint is small. Additionally, we focus on low SNR conditions where such models struggle to provide good enhancement. In order to systematically examine how SSL representations impact performance of such enhancement models, we propose a variety of techniques to utilize these embeddings which include different forms of knowledge-distillation and pre-training.  ( 2 min )
    a-DCF: an architecture agnostic metric with application to spoofing-robust speaker verification
    arXiv:2403.01355v1 Announce Type: cross Abstract: Spoofing detection is today a mainstream research topic. Standard metrics can be applied to evaluate the performance of isolated spoofing detection solutions and others have been proposed to support their evaluation when they are combined with speaker detection. These either have well-known deficiencies or restrict the architectural approach to combine speaker and spoof detectors. In this paper, we propose an architecture-agnostic detection cost function (a-DCF). A generalisation of the original DCF used widely for the assessment of automatic speaker verification (ASV), the a-DCF is designed for the evaluation of spoofing-robust ASV. Like the DCF, the a-DCF reflects the cost of decisions in a Bayes risk sense, with explicitly defined class priors and detection cost model. We demonstrate the merit of the a-DCF through the benchmarking evaluation of architecturally-heterogeneous spoofing-robust ASV solutions.  ( 2 min )
    Chaining thoughts and LLMs to learn DNA structural biophysics
    arXiv:2403.01332v1 Announce Type: cross Abstract: The future development of an AI scientist, a tool that is capable of integrating a variety of experimental data and generating testable hypotheses, holds immense potential. So far, bespoke machine learning models have been created to specialize in singular scientific tasks, but otherwise lack the flexibility of a general purpose model. Here, we show that a general purpose large language model, chatGPT 3.5-turbo, can be fine-tuned to learn the structural biophysics of DNA. We find that both fine-tuning models to return chain-of-thought responses and chaining together models fine-tuned for subtasks have an enhanced ability to analyze and design DNA sequences and their structures.  ( 2 min )
    High-Dimensional Tail Index Regression: with An Application to Text Analyses of Viral Posts in Social Media
    arXiv:2403.01318v1 Announce Type: cross Abstract: Motivated by the empirical power law of the distributions of credits (e.g., the number of "likes") of viral posts in social media, we introduce the high-dimensional tail index regression and methods of estimation and inference for its parameters. We propose a regularized estimator, establish its consistency, and derive its convergence rate. To conduct inference, we propose to debias the regularized estimate, and establish the asymptotic normality of the debiased estimator. Simulation studies support our theory. These methods are applied to text analyses of viral posts in X (formerly Twitter) concerning LGBTQ+.  ( 2 min )
    VNLP: Turkish NLP Package
    arXiv:2403.01309v1 Announce Type: cross Abstract: In this work, we present VNLP: the first dedicated, complete, open-source, well-documented, lightweight, production-ready, state-of-the-art Natural Language Processing (NLP) package for the Turkish language. It contains a wide variety of tools, ranging from the simplest tasks, such as sentence splitting and text normalization, to the more advanced ones, such as text and token classification models. Its token classification models are based on "Context Model", a novel architecture that is both an encoder and an auto-regressive model. NLP tasks solved by VNLP models include but are not limited to Sentiment Analysis, Named Entity Recognition, Morphological Analysis \& Disambiguation and Part-of-Speech Tagging. Moreover, it comes with pre-trained word embeddings and corresponding SentencePiece Unigram tokenizers. VNLP has an open-source GitHub repository, ReadtheDocs documentation, PyPi package for convenient installation, Python and command-line API and a demo page to test all the functionality. Consequently, our main contribution is a complete, compact, easy-to-install and easy-to-use NLP package for Turkish.  ( 2 min )
    VBART: The Turkish LLM
    arXiv:2403.01308v1 Announce Type: cross Abstract: We present VBART, the first Turkish sequence-to-sequence Large Language Models (LLMs) pre-trained on a large corpus from scratch. VBART are compact LLMs based on good ideas leveraged from BART and mBART models and come in two sizes, Large and XLarge. Fine-tuned VBART models surpass the prior state-of-the-art results in abstractive text summarization, title generation, text paraphrasing, question answering and question generation tasks. They allow fine-tuning for future text generation tasks and datasets, carving a new path for Turkish Natural Language Processing (NLP) research. Our work shows that having a pre-trained LLM for Turkish outperforms up to 3x multilingual models, improving existing results and providing efficient models for training and inference. Moreover, we show that our monolingual tokenizer is 7x more efficient than OpenAI's multilingual tokenizer. Last but not least, we introduce a method to enlarge an existing pre-trained LLM and question the relevancy of Chinchilla Scaling Law to sequence-to-sequence masked language models. Our fine-tuned models, tokenizer and cleaned web corpus of 135 GB are publicly available at huggingface.co/vngrs-ai.  ( 2 min )
    Supplier Recommendation in Online Procurement
    arXiv:2403.01301v1 Announce Type: cross Abstract: Supply chain optimization is key to a healthy and profitable business. Many companies use online procurement systems to agree contracts with suppliers. It is vital that the most competitive suppliers are invited to bid for such contracts. In this work, we propose a recommender system to assist with supplier discovery in road freight online procurement. Our system is able to provide personalized supplier recommendations, taking into account customer needs and preferences. This is a novel application of recommender systems, calling for design choices that fit the unique requirements of online procurement. Our preliminary results, using real-world data, are promising.  ( 2 min )
    A Photonic Physically Unclonable Function's Resilience to Multiple-Valued Machine Learning Attacks
    arXiv:2403.01299v1 Announce Type: cross Abstract: Physically unclonable functions (PUFs) identify integrated circuits using nonlinearly-related challenge-response pairs (CRPs). Ideally, the relationship between challenges and corresponding responses is unpredictable, even if a subset of CRPs is known. Previous work developed a photonic PUF offering improved security compared to non-optical counterparts. Here, we investigate this PUF's susceptibility to Multiple-Valued-Logic-based machine learning attacks. We find that approximately 1,000 CRPs are necessary to train models that predict response bits better than random chance. Given the significant challenge of acquiring a vast number of CRPs from a photonic PUF, our results demonstrate photonic PUF resilience against such attacks.  ( 2 min )
    SceneCraft: An LLM Agent for Synthesizing 3D Scene as Blender Code
    arXiv:2403.01248v1 Announce Type: cross Abstract: This paper introduces SceneCraft, a Large Language Model (LLM) Agent converting text descriptions into Blender-executable Python scripts which render complex scenes with up to a hundred 3D assets. This process requires complex spatial planning and arrangement. We tackle these challenges through a combination of advanced abstraction, strategic planning, and library learning. SceneCraft first models a scene graph as a blueprint, detailing the spatial relationships among assets in the scene. SceneCraft then writes Python scripts based on this graph, translating relationships into numerical constraints for asset layout. Next, SceneCraft leverages the perceptual strengths of vision-language foundation models like GPT-V to analyze rendered images and iteratively refine the scene. On top of this process, SceneCraft features a library learning mechanism that compiles common script functions into a reusable library, facilitating continuous self-improvement without expensive LLM parameter tuning. Our evaluation demonstrates that SceneCraft surpasses existing LLM-based agents in rendering complex scenes, as shown by its adherence to constraints and favorable human assessments. We also showcase the broader application potential of SceneCraft by reconstructing detailed 3D scenes from the Sintel movie and guiding a video generative model with generated scenes as intermediary control signal.  ( 2 min )
    API Is Enough: Conformal Prediction for Large Language Models Without Logit-Access
    arXiv:2403.01216v1 Announce Type: cross Abstract: This study aims to address the pervasive challenge of quantifying uncertainty in large language models (LLMs) without logit-access. Conformal Prediction (CP), known for its model-agnostic and distribution-free features, is a desired approach for various LLMs and data distributions. However, existing CP methods for LLMs typically assume access to the logits, which are unavailable for some API-only LLMs. In addition, logits are known to be miscalibrated, potentially leading to degraded CP performance. To tackle these challenges, we introduce a novel CP method that (1) is tailored for API-only LLMs without logit-access; (2) minimizes the size of prediction sets; and (3) ensures a statistical guarantee of the user-defined coverage. The core idea of this approach is to formulate nonconformity measures using both coarse-grained (i.e., sample frequency) and fine-grained uncertainty notions (e.g., semantic similarity). Experimental results on both close-ended and open-ended Question Answering tasks show our approach can mostly outperform the logit-based CP baselines.  ( 2 min )
    REWIND Dataset: Privacy-preserving Speaking Status Segmentation from Multimodal Body Movement Signals in the Wild
    arXiv:2403.01229v1 Announce Type: cross Abstract: Recognizing speaking in humans is a central task towards understanding social interactions. Ideally, speaking would be detected from individual voice recordings, as done previously for meeting scenarios. However, individual voice recordings are hard to obtain in the wild, especially in crowded mingling scenarios due to cost, logistics, and privacy concerns. As an alternative, machine learning models trained on video and wearable sensor data make it possible to recognize speech by detecting its related gestures in an unobtrusive, privacy-preserving way. These models themselves should ideally be trained using labels obtained from the speech signal. However, existing mingling datasets do not contain high quality audio recordings. Instead, speaking status annotations have often been inferred by human annotators from video, without validation of this approach against audio-based ground truth. In this paper we revisit no-audio speaking status estimation by presenting the first publicly available multimodal dataset with high-quality individual speech recordings of 33 subjects in a professional networking event. We present three baselines for no-audio speaking status segmentation: a) from video, b) from body acceleration (chest-worn accelerometer), c) from body pose tracks. In all cases we predict a 20Hz binary speaking status signal extracted from the audio, a time resolution not available in previous datasets. In addition to providing the signals and ground truth necessary to evaluate a wide range of speaking status detection methods, the availability of audio in REWIND makes it suitable for cross-modality studies not feasible with previous mingling datasets. Finally, our flexible data consent setup creates new challenges for multimodal systems under missing modalities.  ( 3 min )
    A Composite Decomposition Method for Large-Scale Global Optimization
    arXiv:2403.01192v1 Announce Type: cross Abstract: Cooperative co-evolution (CC) algorithms, based on the divide-and-conquer strategy, have emerged as the predominant approach to solving large-scale global optimization (LSGO) problems. The efficiency and accuracy of the grouping stage significantly impact the performance of the optimization process. While the general separability grouping (GSG) method has overcome the limitation of previous differential grouping (DG) methods by enabling the decomposition of non-additively separable functions, it suffers from high computational complexity. To address this challenge, this article proposes a composite separability grouping (CSG) method, seamlessly integrating DG and GSG into a problem decomposition framework to utilize the strengths of both approaches. CSG introduces a step-by-step decomposition framework that accurately decomposes various problem types using fewer computational resources. By sequentially identifying additively, multiplicatively and generally separable variables, CSG progressively groups non-separable variables by recursively considering the interactions between each non-separable variable and the formed non-separable groups. Furthermore, to enhance the efficiency and accuracy of CSG, we introduce two innovative methods: a multiplicatively separable variable detection method and a non-separable variable grouping method. These two methods are designed to effectively detect multiplicatively separable variables and efficiently group non-separable variables, respectively. Extensive experimental results demonstrate that CSG achieves more accurate variable grouping with lower computational complexity compared to GSG and state-of-the-art DG series designs.  ( 2 min )
    Leveraging Self-Supervised Learning for Scene Recognition in Child Sexual Abuse Imagery
    arXiv:2403.01183v1 Announce Type: cross Abstract: Crime in the 21st century is split into a virtual and real world. However, the former has become a global menace to people's well-being and security in the latter. The challenges it presents must be faced with unified global cooperation, and we must rely more than ever on automated yet trustworthy tools to combat the ever-growing nature of online offenses. Over 10 million child sexual abuse reports are submitted to the US National Center for Missing & Exploited Children every year, and over 80% originated from online sources. Therefore, investigation centers and clearinghouses cannot manually process and correctly investigate all imagery. In light of that, reliable automated tools that can securely and efficiently deal with this data are paramount. In this sense, the scene recognition task looks for contextual cues in the environment, being able to group and classify child sexual abuse data without requiring to be trained on sensitive material. The scarcity and limitations of working with child sexual abuse images lead to self-supervised learning, a machine-learning methodology that leverages unlabeled data to produce powerful representations that can be more easily transferred to target tasks. This work shows that self-supervised deep learning models pre-trained on scene-centric data can reach 71.6% balanced accuracy on our indoor scene classification task and, on average, 2.2 percentage points better performance than a fully supervised version. We cooperate with Brazilian Federal Police experts to evaluate our indoor classification model on actual child abuse material. The results demonstrate a notable discrepancy between the features observed in widely used scene datasets and those depicted on sensitive materials.  ( 3 min )
    LLaMoCo: Instruction Tuning of Large Language Models for Optimization Code Generation
    arXiv:2403.01131v1 Announce Type: cross Abstract: Recent research explores optimization using large language models (LLMs) by either iteratively seeking next-step solutions from LLMs or directly prompting LLMs for an optimizer. However, these approaches exhibit inherent limitations, including low operational efficiency, high sensitivity to prompt design, and a lack of domain-specific knowledge. We introduce LLaMoCo, the first instruction-tuning framework designed to adapt LLMs for solving optimization problems in a code-to-code manner. Specifically, we establish a comprehensive instruction set containing well-described problem prompts and effective optimization codes. We then develop a novel two-phase learning strategy that incorporates a contrastive learning-based warm-up procedure before the instruction-tuning phase to enhance the convergence behavior during model fine-tuning. The experiment results demonstrate that a CodeGen (350M) model fine-tuned by our LLaMoCo achieves superior optimization performance compared to GPT-4 Turbo and the other competitors across both synthetic and realistic problem sets. The fine-tuned model and the usage instructions are available at https://anonymous.4open.science/r/LLaMoCo-722A.  ( 2 min )
    A Bayesian Committee Machine Potential for Oxygen-containing Organic Compounds
    arXiv:2403.01158v1 Announce Type: cross Abstract: Understanding the pivotal role of oxygen-containing organic compounds in serving as an energy source for living organisms and contributing to protein formation is crucial in the field of biochemistry. This study addresses the challenge of comprehending protein-protein interactions (PPI) and developing predicitive models for proteins and organic compounds, with a specific focus on quantifying their binding affinity. Here, we introduce the active Bayesian Committee Machine (BCM) potential, specifically designed to predict oxygen-containing organic compounds within eight groups of CHO. The BCM potential adopts a committee-based approach to tackle scalability issues associated with kernel regressors, particularly when dealing with large datasets. Its adaptable structure allows for efficient and cost-effective expansion, maintaing both transferability and scalability. Through systematic benchmarking, we position the sparse BCM potential as a promising contender in the pursuit of a universal machine learning potential.  ( 2 min )
    LAB: Large-Scale Alignment for ChatBots
    arXiv:2403.01081v1 Announce Type: cross Abstract: This work introduces LAB (Large-scale Alignment for chatBots), a novel methodology designed to overcome the scalability challenges in the instruction-tuning phase of large language model (LLM) training. Leveraging a taxonomy-guided synthetic data generation process and a multi-phase tuning framework, LAB significantly reduces reliance on expensive human annotations and proprietary models like GPT-4. We demonstrate that LAB-trained models can achieve competitive performance across several benchmarks compared to models trained with traditional human-annotated or GPT-4 generated synthetic data. Thus offering a scalable, cost-effective solution for enhancing LLM capabilities and instruction-following behaviors without the drawbacks of catastrophic forgetting, marking a step forward in the efficient training of LLMs for a wide range of applications.  ( 2 min )
    Extracting Usable Predictions from Quantized Networks through Uncertainty Quantification for OOD Detection
    arXiv:2403.01076v1 Announce Type: cross Abstract: OOD detection has become more pertinent with advances in network design and increased task complexity. Identifying which parts of the data a given network is misclassifying has become as valuable as the network's overall performance. We can compress the model with quantization, but it suffers minor performance loss. The loss of performance further necessitates the need to derive the confidence estimate of the network's predictions. In line with this thinking, we introduce an Uncertainty Quantification(UQ) technique to quantify the uncertainty in the predictions from a pre-trained vision model. We subsequently leverage this information to extract valuable predictions while ignoring the non-confident predictions. We observe that our technique saves up to 80% of ignored samples from being misclassified. The code for the same is available here.  ( 2 min )
    Federated Learning via Lattice Joint Source-Channel Coding
    arXiv:2403.01023v1 Announce Type: cross Abstract: This paper introduces a universal federated learning framework that enables over-the-air computation via digital communications, using a new joint source-channel coding scheme. Without relying on channel state information at devices, this scheme employs lattice codes to both quantize model parameters and exploit interference from the devices. A novel two-layer receiver structure at the server is designed to reliably decode an integer combination of the quantized model parameters as a lattice point for the purpose of aggregation. Numerical experiments validate the effectiveness of the proposed scheme. Even with the challenges posed by channel conditions and device heterogeneity, the proposed scheme markedly surpasses other over-the-air FL strategies.  ( 2 min )
    Neural Field Classifiers via Target Encoding and Classification Loss
    arXiv:2403.01058v1 Announce Type: cross Abstract: Neural field methods have seen great progress in various long-standing tasks in computer vision and computer graphics, including novel view synthesis and geometry reconstruction. As existing neural field methods try to predict some coordinate-based continuous target values, such as RGB for Neural Radiance Field (NeRF), all of these methods are regression models and are optimized by some regression loss. However, are regression models really better than classification models for neural field methods? In this work, we try to visit this very fundamental but overlooked question for neural fields from a machine learning perspective. We successfully propose a novel Neural Field Classifier (NFC) framework which formulates existing neural field methods as classification tasks rather than regression tasks. The proposed NFC can easily transform arbitrary Neural Field Regressor (NFR) into its classification variant via employing a novel Target Encoding module and optimizing a classification loss. By encoding a continuous regression target into a high-dimensional discrete encoding, we naturally formulate a multi-label classification task. Extensive experiments demonstrate the impressive effectiveness of NFC at the nearly free extra computational costs. Moreover, NFC also shows robustness to sparse inputs, corrupted images, and dynamic scenes.  ( 2 min )
    Autonomous Strike UAVs for Counterterrorism Missions: Challenges and Preliminary Solutions
    arXiv:2403.01022v1 Announce Type: cross Abstract: Unmanned Aircraft Vehicles (UAVs) are becoming a crucial tool in modern warfare, primarily due to their cost-effectiveness, risk reduction, and ability to perform a wider range of activities. The use of autonomous UAVs to conduct strike missions against highly valuable targets is the focus of this research. Due to developments in ledger technology, smart contracts, and machine learning, such activities formerly carried out by professionals or remotely flown UAVs are now feasible. Our study provides the first in-depth analysis of challenges and preliminary solutions for successful implementation of an autonomous UAV mission. Specifically, we identify challenges that have to be overcome and propose possible technical solutions for the challenges identified. We also derive analytical expressions for the success probability of an autonomous UAV mission, and describe a machine learning model to train the UAV.  ( 2 min )
    SELFI: Autonomous Self-Improvement with Reinforcement Learning for Social Navigation
    arXiv:2403.00991v1 Announce Type: cross Abstract: Autonomous self-improving robots that interact and improve with experience are key to the real-world deployment of robotic systems. In this paper, we propose an online learning method, SELFI, that leverages online robot experience to rapidly fine-tune pre-trained control policies efficiently. SELFI applies online model-free reinforcement learning on top of offline model-based learning to bring out the best parts of both learning paradigms. Specifically, SELFI stabilizes the online learning process by incorporating the same model-based learning objective from offline pre-training into the Q-values learned with online model-free reinforcement learning. We evaluate SELFI in multiple real-world environments and report improvements in terms of collision avoidance, as well as more socially compliant behavior, measured by a human user study. SELFI enables us to quickly learn useful robotic behaviors with less human interventions such as pre-emptive behavior for the pedestrians, collision avoidance for small and transparent objects, and avoiding travel on uneven floor surfaces. We provide supplementary videos to demonstrate the performance of our fine-tuned policy on our project page.  ( 2 min )
    Merging Text Transformer Models from Different Initializations
    arXiv:2403.00986v1 Announce Type: cross Abstract: Recent work on one-shot permutation-based model merging has shown impressive low- or zero-barrier mode connectivity between models from completely different initializations. However, this line of work has not yet extended to the Transformer architecture, despite its dominant popularity in the language domain. Therefore, in this work, we investigate the extent to which separate Transformer minima learn similar features, and propose a model merging technique to investigate the relationship between these minima in the loss landscape. The specifics of the architecture, like its residual connections, multi-headed attention, and discrete, sequential input, require specific interventions in order to compute model permutations that remain within the same functional equivalence class. In merging these models with our method, we consistently find lower loss barriers between minima compared to model averaging for several models trained on a masked-language modeling task or fine-tuned on a language understanding benchmark. Our results show that the minima of these models are less sharp and isolated than previously understood, and provide a basis for future work on merging separately trained Transformer models.  ( 2 min )
    Binary Gaussian Copula Synthesis: A Novel Data Augmentation Technique to Advance ML-based Clinical Decision Support Systems for Early Prediction of Dialysis Among CKD Patients
    arXiv:2403.00965v1 Announce Type: cross Abstract: The Center for Disease Control estimates that over 37 million US adults suffer from chronic kidney disease (CKD), yet 9 out of 10 of these individuals are unaware of their condition due to the absence of symptoms in the early stages. It has a significant impact on patients' quality of life, particularly when it progresses to the need for dialysis. Early prediction of dialysis is crucial as it can significantly improve patient outcomes and assist healthcare providers in making timely and informed decisions. However, developing an effective machine learning (ML)-based Clinical Decision Support System (CDSS) for early dialysis prediction poses a key challenge due to the imbalanced nature of data. To address this challenge, this study evaluates various data augmentation techniques to understand their effectiveness on real-world datasets. We propose a new approach named Binary Gaussian Copula Synthesis (BGCS). BGCS is tailored for binary medical datasets and excels in generating synthetic minority data that mirrors the distribution of the original data. BGCS enhances early dialysis prediction by outperforming traditional methods in detecting dialysis patients. For the best ML model, Random Forest, BCGS achieved a 72% improvement, surpassing the state-of-the-art augmentation approaches. Also, we present a ML-based CDSS, designed to aid clinicians in making informed decisions. CDSS, which utilizes decision tree models, is developed to improve patient outcomes, identify critical variables, and thereby enable clinicians to make proactive decisions, and strategize treatment plans effectively for CKD patients who are more likely to require dialysis in the near future. Through comprehensive feature analysis and meticulous data preparation, we ensure that the CDSS's dialysis predictions are not only accurate but also actionable, providing a valuable tool in the management and treatment of CKD.  ( 3 min )
    Data Science Education in Undergraduate Physics: Lessons Learned from a Community of Practice
    arXiv:2403.00961v1 Announce Type: cross Abstract: With the increasing availability of diverse datasets, ranging from small-scale experimental data points to large and complex data repositories and powerful data analysis tools, it is increasingly important that physics educators equip their students with the skills to work with data effectively. However, many educators may lack the necessary training and expertise in data science to teach these skills. To address this gap, we created the Data Science Education Community of Practice (DSECOP), bringing together graduate students and physics educators from different institutions and backgrounds to share best practices and lessons learned in integrating data science into undergraduate physics education. In this article, we present insights and experiences from this community of practice, highlighting key strategies and challenges in incorporating data science into the introductory physics curriculum. Our goal is to provide guidance and inspiration to educators who seek to integrate data science into their teaching, helping to prepare the next generation of physicists for a data-driven world.  ( 2 min )
    MALTO at SemEval-2024 Task 6: Leveraging Synthetic Data for LLM Hallucination Detection
    arXiv:2403.00964v1 Announce Type: cross Abstract: In Natural Language Generation (NLG), contemporary Large Language Models (LLMs) face several challenges, such as generating fluent yet inaccurate outputs and reliance on fluency-centric metrics. This often leads to neural networks exhibiting "hallucinations". The SHROOM challenge focuses on automatically identifying these hallucinations in the generated text. To tackle these issues, we introduce two key components, a data augmentation pipeline incorporating LLM-assisted pseudo-labelling and sentence rephrasing, and a voting ensemble from three models pre-trained on Natural Language Inference (NLI) tasks and fine-tuned on diverse datasets.  ( 2 min )
    Transfer Learning for Security: Challenges and Future Directions
    arXiv:2403.00935v1 Announce Type: cross Abstract: Many machine learning and data mining algorithms rely on the assumption that the training and testing data share the same feature space and distribution. However, this assumption may not always hold. For instance, there are situations where we need to classify data in one domain, but we only have sufficient training data available from a different domain. The latter data may follow a distinct distribution. In such cases, successfully transferring knowledge across domains can significantly improve learning performance and reduce the need for extensive data labeling efforts. Transfer learning (TL) has thus emerged as a promising framework to tackle this challenge, particularly in security-related tasks. This paper aims to review the current advancements in utilizing TL techniques for security. The paper includes a discussion of the existing research gaps in applying TL in the security domain, as well as exploring potential future research directions and issues that arise in the context of TL-assisted security solutions.  ( 2 min )
    MediSwift: Efficient Sparse Pre-trained Biomedical Language Models
    arXiv:2403.00952v1 Announce Type: cross Abstract: Large language models (LLMs) are typically trained on general source data for various domains, but a recent surge in domain-specific LLMs has shown their potential to outperform general-purpose models in domain-specific tasks (e.g., biomedicine). Although domain-specific pre-training enhances efficiency and leads to smaller models, the computational costs of training these LLMs remain high, posing budgeting challenges. We introduce MediSwift, a suite of biomedical LMs that leverage sparse pre-training on domain-specific biomedical text data. By inducing up to 75% weight sparsity during the pre-training phase, MediSwift achieves a 2-2.5x reduction in training FLOPs. Notably, all sparse pre-training was performed on the Cerebras CS-2 system, which is specifically designed to realize the acceleration benefits from unstructured weight sparsity, thereby significantly enhancing the efficiency of the MediSwift models. Through subsequent dense fine-tuning and strategic soft prompting, MediSwift models outperform existing LLMs up to 7B parameters on biomedical tasks, setting new benchmarks w.r.t efficiency-accuracy on tasks such as PubMedQA. Our results show that sparse pre-training, along with dense fine-tuning and soft prompting, offers an effective method for creating high-performing, computationally efficient models in specialized domains.  ( 2 min )
    PRIME: Scaffolding Manipulation Tasks with Behavior Primitives for Data-Efficient Imitation Learning
    arXiv:2403.00929v1 Announce Type: cross Abstract: Imitation learning has shown great potential for enabling robots to acquire complex manipulation behaviors. However, these algorithms suffer from high sample complexity in long-horizon tasks, where compounding errors accumulate over the task horizons. We present PRIME (PRimitive-based IMitation with data Efficiency), a behavior primitive-based framework designed for improving the data efficiency of imitation learning. PRIME scaffolds robot tasks by decomposing task demonstrations into primitive sequences, followed by learning a high-level control policy to sequence primitives through imitation learning. Our experiments demonstrate that PRIME achieves a significant performance improvement in multi-stage manipulation tasks, with 10-34% higher success rates in simulation over state-of-the-art baselines and 20-48% on physical hardware.  ( 2 min )
    The Algorithm Configuration Problem
    arXiv:2403.00898v1 Announce Type: cross Abstract: The field of algorithmic optimization has significantly advanced with the development of methods for the automatic configuration of algorithmic parameters. This article delves into the Algorithm Configuration Problem, focused on optimizing parametrized algorithms for solving specific instances of decision/optimization problems. We present a comprehensive framework that not only formalizes the Algorithm Configuration Problem, but also outlines different approaches for its resolution, leveraging machine learning models and heuristic strategies. The article categorizes existing methodologies into per-instance and per-problem approaches, distinguishing between offline and online strategies for model construction and deployment. By synthesizing these approaches, we aim to provide a clear pathway for both understanding and addressing the complexities inherent in algorithm configuration.  ( 2 min )
    PowerFlowMultiNet: Multigraph Neural Networks for Unbalanced Three-Phase Distribution Systems
    arXiv:2403.00892v1 Announce Type: cross Abstract: Efficiently solving unbalanced three-phase power flow in distribution grids is pivotal for grid analysis and simulation. There is a pressing need for scalable algorithms capable of handling large-scale unbalanced power grids that can provide accurate and fast solutions. To address this, deep learning techniques, especially Graph Neural Networks (GNNs), have emerged. However, existing literature primarily focuses on balanced networks, leaving a critical gap in supporting unbalanced three-phase power grids. This letter introduces PowerFlowMultiNet, a novel multigraph GNN framework explicitly designed for unbalanced three-phase power grids. The proposed approach models each phase separately in a multigraph representation, effectively capturing the inherent asymmetry in unbalanced grids. A graph embedding mechanism utilizing message passing is introduced to capture spatial dependencies within the power system network. PowerFlowMultiNet outperforms traditional methods and other deep learning approaches in terms of accuracy and computational speed. Rigorous testing reveals significantly lower error rates and a notable hundredfold increase in computational speed for large power networks compared to model-based methods.  ( 2 min )
    VisRec: A Semi-Supervised Approach to Radio Interferometric Data Reconstruction
    arXiv:2403.00897v1 Announce Type: cross Abstract: Radio telescopes produce visibility data about celestial objects, but these data are sparse and noisy. As a result, images created on raw visibility data are of low quality. Recent studies have used deep learning models to reconstruct visibility data to get cleaner images. However, these methods rely on a substantial amount of labeled training data, which requires significant labeling effort from radio astronomers. Addressing this challenge, we propose VisRec, a model-agnostic semi-supervised learning approach to the reconstruction of visibility data. Specifically, VisRec consists of both a supervised learning module and an unsupervised learning module. In the supervised learning module, we introduce a set of data augmentation functions to produce diverse training examples. In comparison, the unsupervised learning module in VisRec augments unlabeled data and uses reconstructions from non-augmented visibility data as pseudo-labels for training. This hybrid approach allows VisRec to effectively leverage both labeled and unlabeled data. This way, VisRec performs well even when labeled data is scarce. Our evaluation results show that VisRec outperforms all baseline methods in reconstruction quality, robustness against common observation perturbation, and generalizability to different telescope configurations.  ( 2 min )
    End-to-end Graph-Sequential Representation Learning for Accurate Recommendations
    arXiv:2403.00895v1 Announce Type: cross Abstract: Many recent advancements in recommender systems have focused on developing sequence-based and graph-based approaches. Both approaches proved useful in modeling intricate relationships within behavioral data, leading to promising outcomes in personalized ranking and next-item recommendation tasks while maintaining good scalability. However, they capture very different signals from data. While the former approach represents users directly through ordered interactions with recent items, the latter one aims to capture indirect dependencies across the interactions graph. This paper presents a novel multi-representational learning framework that exploits the synergies between these two paradigms. Our empirical evaluation on several datasets demonstrates that mutual training of sequential and graph components with the proposed framework significantly improves recommendations performance.  ( 2 min )
    Time-bound Contextual Bio-ID Generation for Minimalist Wearables
    arXiv:2403.00889v1 Announce Type: cross Abstract: As wearable devices become increasingly miniaturized and powerful, a new opportunity arises for instant and dynamic device-to-device collaboration and human-to-device interaction. However, this progress presents a unique challenge: these minimalist wearables lack inherent mechanisms for real-time authentication, posing significant risks to data privacy and overall security. To address this, we introduce Proteus that realizes an innovative concept of time-bound contextual bio-IDs, which are generated from on-device sensor data and embedded into a common latent space. These bio-IDs act as a time-bound unique user identifier that can be used to identify the wearer in a certain context. Proteus enables dynamic and contextual device collaboration as well as robust human-to-device interaction. Our evaluations demonstrate the effectiveness of our method, particularly in the context of minimalist wearables.  ( 2 min )
    Margin Discrepancy-based Adversarial Training for Multi-Domain Text Classification
    arXiv:2403.00888v1 Announce Type: cross Abstract: Multi-domain text classification (MDTC) endeavors to harness available resources from correlated domains to enhance the classification accuracy of the target domain. Presently, most MDTC approaches that embrace adversarial training and the shared-private paradigm exhibit cutting-edge performance. Unfortunately, these methods face a non-negligible challenge: the absence of theoretical guarantees in the design of MDTC algorithms. The dearth of theoretical underpinning poses a substantial impediment to the advancement of MDTC algorithms. To tackle this problem, we first provide a theoretical analysis of MDTC by decomposing the MDTC task into multiple domain adaptation tasks. We incorporate the margin discrepancy as the measure of domain divergence and establish a new generalization bound based on Rademacher complexity. Subsequently, we propose a margin discrepancy-based adversarial training (MDAT) approach for MDTC, in accordance with our theoretical analysis. To validate the efficacy of the proposed MDAT method, we conduct empirical studies on two MDTC benchmarks. The experimental results demonstrate that our MDAT approach surpasses state-of-the-art baselines on both datasets.  ( 2 min )
    Enhancing Protein Predictive Models via Proteins Data Augmentation: A Benchmark and New Directions
    arXiv:2403.00875v1 Announce Type: cross Abstract: Augmentation is an effective alternative to utilize the small amount of labeled protein data. However, most of the existing work focuses on design-ing new architectures or pre-training tasks, and relatively little work has studied data augmentation for proteins. This paper extends data augmentation techniques previously used for images and texts to proteins and then benchmarks these techniques on a variety of protein-related tasks, providing the first comprehensive evaluation of protein augmentation. Furthermore, we propose two novel semantic-level protein augmentation methods, namely Integrated Gradients Substitution and Back Translation Substitution, which enable protein semantic-aware augmentation through saliency detection and biological knowledge. Finally, we integrate extended and proposed augmentations into an augmentation pool and propose a simple but effective framework, namely Automated Protein Augmentation (APA), which can adaptively select the most suitable augmentation combinations for different tasks. Extensive experiments have shown that APA enhances the performance of five protein related tasks by an average of 10.55% across three architectures compared to vanilla implementations without augmentation, highlighting its potential to make a great impact on the field.  ( 2 min )
    SEGAA: A Unified Approach to Predicting Age, Gender, and Emotion in Speech
    arXiv:2403.00887v1 Announce Type: cross Abstract: The interpretation of human voices holds importance across various applications. This study ventures into predicting age, gender, and emotion from vocal cues, a field with vast applications. Voice analysis tech advancements span domains, from improving customer interactions to enhancing healthcare and retail experiences. Discerning emotions aids mental health, while age and gender detection are vital in various contexts. Exploring deep learning models for these predictions involves comparing single, multi-output, and sequential models highlighted in this paper. Sourcing suitable data posed challenges, resulting in the amalgamation of the CREMA-D and EMO-DB datasets. Prior work showed promise in individual predictions, but limited research considered all three variables simultaneously. This paper identifies flaws in an individual model approach and advocates for our novel multi-output learning architecture Speech-based Emotion Gender and Age Analysis (SEGAA) model. The experiments suggest that Multi-output models perform comparably to individual models, efficiently capturing the intricate relationships between variables and speech inputs, all while achieving improved runtime.  ( 2 min )
    Blockchain-empowered Federated Learning: Benefits, Challenges, and Solutions
    arXiv:2403.00873v1 Announce Type: cross Abstract: Federated learning (FL) is a distributed machine learning approach that protects user data privacy by training models locally on clients and aggregating them on a parameter server. While effective at preserving privacy, FL systems face limitations such as single points of failure, lack of incentives, and inadequate security. To address these challenges, blockchain technology is integrated into FL systems to provide stronger security, fairness, and scalability. However, blockchain-empowered FL (BC-FL) systems introduce additional demands on network, computing, and storage resources. This survey provides a comprehensive review of recent research on BC-FL systems, analyzing the benefits and challenges associated with blockchain integration. We explore why blockchain is applicable to FL, how it can be implemented, and the challenges and existing solutions for its integration. Additionally, we offer insights on future research directions for the BC-FL system.  ( 2 min )
    Teach LLMs to Phish: Stealing Private Information from Language Models
    arXiv:2403.00871v1 Announce Type: cross Abstract: When large language models are trained on private data, it can be a significant privacy risk for them to memorize and regurgitate sensitive information. In this work, we propose a new practical data extraction attack that we call "neural phishing". This attack enables an adversary to target and extract sensitive or personally identifiable information (PII), e.g., credit card numbers, from a model trained on user data with upwards of 10% attack success rates, at times, as high as 50%. Our attack assumes only that an adversary can insert as few as 10s of benign-appearing sentences into the training dataset using only vague priors on the structure of the user data.  ( 2 min )
    Fast and Efficient Local Search for Genetic Programming Based Loss Function Learning
    arXiv:2403.00865v1 Announce Type: cross Abstract: In this paper, we develop upon the topic of loss function learning, an emergent meta-learning paradigm that aims to learn loss functions that significantly improve the performance of the models trained under them. Specifically, we propose a new meta-learning framework for task and model-agnostic loss function learning via a hybrid search approach. The framework first uses genetic programming to find a set of symbolic loss functions. Second, the set of learned loss functions is subsequently parameterized and optimized via unrolled differentiation. The versatility and performance of the proposed framework are empirically validated on a diverse set of supervised learning tasks. Results show that the learned loss functions bring improved convergence, sample efficiency, and inference performance on tabulated, computer vision, and natural language processing problems, using a variety of task-specific neural network architectures.  ( 2 min )
    Pivoting Retail Supply Chain with Deep Generative Techniques: Taxonomy, Survey and Insights
    arXiv:2403.00861v1 Announce Type: cross Abstract: Generative AI applications, such as ChatGPT or DALL-E, have shown the world their impressive capabilities in generating human-like text or image. Diving deeper, the science stakeholder for those AI applications are Deep Generative Models, a.k.a DGMs, which are designed to learn the underlying distribution of the data and generate new data points that are statistically similar to the original dataset. One critical question is raised: how can we leverage DGMs into morden retail supply chain realm? To address this question, this paper expects to provide a comprehensive review of DGMs and discuss their existing and potential usecases in retail supply chain, by (1) providing a taxonomy and overview of state-of-the-art DGMs and their variants, (2) reviewing existing DGM applications in retail supply chain from a end-to-end view of point, and (3) discussing insights and potential directions on how DGMs can be further utilized on solving retail supply chain problems.  ( 2 min )
    Speaker-Independent Dysarthria Severity Classification using Self-Supervised Transformers and Multi-Task Learning
    arXiv:2403.00854v1 Announce Type: cross Abstract: Dysarthria, a condition resulting from impaired control of the speech muscles due to neurological disorders, significantly impacts the communication and quality of life of patients. The condition's complexity, human scoring and varied presentations make its assessment and management challenging. This study presents a transformer-based framework for automatically assessing dysarthria severity from raw speech data. It can offer an objective, repeatable, accessible, standardised and cost-effective and compared to traditional methods requiring human expert assessors. We develop a transformer framework, called Speaker-Agnostic Latent Regularisation (SALR), incorporating a multi-task learning objective and contrastive learning for speaker-independent multi-class dysarthria severity classification. The multi-task framework is designed to reduce reliance on speaker-specific characteristics and address the intrinsic intra-class variability of dysarthric speech. We evaluated on the Universal Access Speech dataset using leave-one-speaker-out cross-validation, our model demonstrated superior performance over traditional machine learning approaches, with an accuracy of $70.48\%$ and an F1 score of $59.23\%$. Our SALR model also exceeded the previous benchmark for AI-based classification, which used support vector machines, by $16.58\%$. We open the black box of our model by visualising the latent space where we can observe how the model substantially reduces speaker-specific cues and amplifies task-specific ones, thereby showing its robustness. In conclusion, SALR establishes a new benchmark in speaker-independent multi-class dysarthria severity classification using generative AI. The potential implications of our findings for broader clinical applications in automated dysarthria severity assessments.  ( 3 min )
    Improved Online Learning Algorithms for CTR Prediction in Ad Auctions
    arXiv:2403.00845v1 Announce Type: cross Abstract: In this work, we investigate the online learning problem of revenue maximization in ad auctions, where the seller needs to learn the click-through rates (CTRs) of each ad candidate and charge the price of the winner through a pay-per-click manner. We focus on two models of the advertisers' strategic behaviors. First, we assume that the advertiser is completely myopic; i.e.~in each round, they aim to maximize their utility only for the current round. In this setting, we develop an online mechanism based on upper-confidence bounds that achieves a tight $O(\sqrt{T})$ regret in the worst-case and negative regret when the values are static across all the auctions and there is a gap between the highest expected value (i.e.~value multiplied by their CTR) and second highest expected value ad. Next, we assume that the advertiser is non-myopic and cares about their long term utility. This setting is much more complex since an advertiser is incentivized to influence the mechanism by bidding strategically in earlier rounds. In this setting, we provide an algorithm to achieve negative regret for the static valuation setting (with a positive gap), which is in sharp contrast with the prior work that shows $O(T^{2/3})$ regret when the valuation is generated by adversary.  ( 2 min )
    NeuraLUT: Hiding Neural Network Density in Boolean Synthesizable Functions
    arXiv:2403.00849v1 Announce Type: cross Abstract: Field-Programmable Gate Array (FPGA) accelerators have proven successful in handling latency- and resource-critical deep neural network (DNN) inference tasks. Among the most computationally intensive operations in a neural network (NN) is the dot product between the feature and weight vectors. Thus, some previous FPGA acceleration works have proposed mapping neurons with quantized inputs and outputs directly to lookup tables (LUTs) for hardware implementation. In these works, the boundaries of the neurons coincide with the boundaries of the LUTs. We propose relaxing these boundaries and mapping entire sub-networks to a single LUT. As the sub-networks are absorbed within the LUT, the NN topology and precision within a partition do not affect the size of the lookup tables generated. Therefore, we utilize fully connected layers with floating-point precision inside each partition, which benefit from being universal function approximators, with rigid sparsity and quantization enforced only between partitions, where the NN topology becomes exposed to the circuit topology. Although cheap to implement, this approach can lead to very deep NNs, and so to tackle challenges like vanishing gradients, we also introduce skip connections inside the partitions. The resulting methodology can be seen as training DNNs with a specific sparsity pattern that allows them to be mapped to much shallower circuit-level networks, thereby significantly improving latency. We validate our proposed method on a known latency-critical task, jet substructure tagging, and on the classical computer vision task, the digit classification using MNIST. Our approach allows for greater function expressivity within the LUTs compared to existing work, leading to lower latency NNs for the same accuracy.  ( 3 min )
    Lower-Left Partial AUC: An Effective and Efficient Optimization Metric for Recommendation
    arXiv:2403.00844v1 Announce Type: cross Abstract: Optimization metrics are crucial for building recommendation systems at scale. However, an effective and efficient metric for practical use remains elusive. While Top-K ranking metrics are the gold standard for optimization, they suffer from significant computational overhead. Alternatively, the more efficient accuracy and AUC metrics often fall short of capturing the true targets of recommendation tasks, leading to suboptimal performance. To overcome this dilemma, we propose a new optimization metric, Lower-Left Partial AUC (LLPAUC), which is computationally efficient like AUC but strongly correlates with Top-K ranking metrics. Compared to AUC, LLPAUC considers only the partial area under the ROC curve in the Lower-Left corner to push the optimization focus on Top-K. We provide theoretical validation of the correlation between LLPAUC and Top-K ranking metrics and demonstrate its robustness to noisy user feedback. We further design an efficient point-wise recommendation loss to maximize LLPAUC and evaluate it on three datasets, validating its effectiveness and robustness.  ( 2 min )
    Enhancing Long-Term Recommendation with Bi-level Learnable Large Language Model Planning
    arXiv:2403.00843v1 Announce Type: cross Abstract: Traditional recommendation setting tends to excessively cater to users' immediate interests and neglect their long-term engagement. To address it, it is crucial to incorporate planning capabilities into the recommendation decision-making process to develop policies that take into account both immediate interests and long-term engagement. Despite Reinforcement Learning (RL) can learn planning capacity by maximizing cumulative reward, the scarcity of recommendation data presents challenges such as instability and susceptibility to overfitting when training RL models from scratch. In this context, we propose to leverage the remarkable planning capabilities over sparse data of Large Language Models (LLMs) for long-term recommendation. The key lies in enabling a language model to understand and apply task-solving principles effectively in personalized recommendation scenarios, as the model's pre-training may not naturally encompass these principles, necessitating the need to inspire or teach the model. To achieve this, we propose a Bi-level Learnable LLM Planner framework, which combines macro-learning and micro-learning through a hierarchical mechanism. The framework includes a Planner and Reflector for acquiring high-level guiding principles and an Actor-Critic component for planning personalization. Extensive experiments validate the superiority of the framework in learning to plan for long-term recommendations.  ( 2 min )
    Deep Learning Detection Method for Large Language Models-Generated Scientific Content
    arXiv:2403.00828v1 Announce Type: cross Abstract: Large Language Models (LLMs), such as GPT-3 and BERT, reshape how textual content is written and communicated. These models have the potential to generate scientific content that is indistinguishable from that written by humans. Hence, LLMs carry severe consequences for the scientific community, which relies on the integrity and reliability of publications. This research paper presents a novel ChatGPT-generated scientific text detection method, AI-Catcher. AI-Catcher integrates two deep learning models, multilayer perceptron (MLP) and convolutional neural networks (CNN). The MLP learns the feature representations of the linguistic and statistical features. The CNN extracts high-level representations of the sequential patterns from the textual content. AI-Catcher is a multimodal model that fuses hidden patterns derived from MLP and CNN. In addition, a new ChatGPT-Generated scientific text dataset is collected to enhance AI-generated text detection tools, AIGTxt. AIGTxt contains 3000 records collected from published academic articles across ten domains and divided into three classes: Human-written, ChatGPT-generated, and Mixed text. Several experiments are conducted to evaluate the performance of AI-Catcher. The comparative results demonstrate the capability of AI-Catcher to distinguish between human-written and ChatGPT-generated scientific text more accurately than alternative methods. On average, AI-Catcher improved accuracy by 37.4%.  ( 2 min )
    Offline Fictitious Self-Play for Competitive Games
    arXiv:2403.00841v1 Announce Type: cross Abstract: Offline Reinforcement Learning (RL) has received significant interest due to its ability to improve policies in previously collected datasets without online interactions. Despite its success in the single-agent setting, offline multi-agent RL remains a challenge, especially in competitive games. Firstly, unaware of the game structure, it is impossible to interact with the opponents and conduct a major learning paradigm, self-play, for competitive games. Secondly, real-world datasets cannot cover all the state and action space in the game, resulting in barriers to identifying Nash equilibrium (NE). To address these issues, this paper introduces Off-FSP, the first practical model-free offline RL algorithm for competitive games. We start by simulating interactions with various opponents by adjusting the weights of the fixed dataset with importance sampling. This technique allows us to learn best responses to different opponents and employ the Offline Self-Play learning framework. In this framework, we further implement Fictitious Self-Play (FSP) to approximate NE. In partially covered real-world datasets, our methods show the potential to approach NE by incorporating any single-agent offline RL method. Experimental results in Leduc Hold'em Poker show that our method significantly improves performances compared with state-of-the-art baselines.  ( 2 min )
    Self-Refinement of Language Models from External Proxy Metrics Feedback
    arXiv:2403.00827v1 Announce Type: cross Abstract: It is often desirable for Large Language Models (LLMs) to capture multiple objectives when providing a response. In document-grounded response generation, for example, agent responses are expected to be relevant to a user's query while also being grounded in a given document. In this paper, we introduce Proxy Metric-based Self-Refinement (ProMiSe), which enables an LLM to refine its own initial response along key dimensions of quality guided by external metrics feedback, yielding an overall better final response. ProMiSe leverages feedback on response quality through principle-specific proxy metrics, and iteratively refines its response one principle at a time. We apply ProMiSe to open source language models Flan-T5-XXL and Llama-2-13B-Chat, to evaluate its performance on document-grounded question answering datasets, MultiDoc2Dial and QuAC, demonstrating that self-refinement improves response quality. We further show that fine-tuning Llama-2-13B-Chat on the synthetic dialogue data generated by ProMiSe yields significant performance improvements over the zero-shot baseline as well as a supervised fine-tuned model on human annotated data.  ( 2 min )
    LLMGuard: Guarding Against Unsafe LLM Behavior
    arXiv:2403.00826v1 Announce Type: cross Abstract: Although the rise of Large Language Models (LLMs) in enterprise settings brings new opportunities and capabilities, it also brings challenges, such as the risk of generating inappropriate, biased, or misleading content that violates regulations and can have legal concerns. To alleviate this, we present "LLMGuard", a tool that monitors user interactions with an LLM application and flags content against specific behaviours or conversation topics. To do this robustly, LLMGuard employs an ensemble of detectors.  ( 2 min )
    DenseMamba: State Space Models with Dense Hidden Connection for Efficient Large Language Models
    arXiv:2403.00818v1 Announce Type: cross Abstract: Large language models (LLMs) face a daunting challenge due to the excessive computational and memory requirements of the commonly used Transformer architecture. While state space model (SSM) is a new type of foundational network architecture offering lower computational complexity, their performance has yet to fully rival that of Transformers. This paper introduces DenseSSM, a novel approach to enhance the flow of hidden information between layers in SSMs. By selectively integrating shallowlayer hidden states into deeper layers, DenseSSM retains fine-grained information crucial for the final output. Dense connections enhanced DenseSSM still maintains the training parallelizability and inference efficiency. The proposed method can be widely applicable to various SSM types like RetNet and Mamba. With similar model size, DenseSSM achieves significant improvements, exemplified by DenseRetNet outperforming the original RetNet with up to 5% accuracy improvement on public benchmarks.  ( 2 min )
    Social Media as a Sensor: Analyzing Twitter Data for Breast Cancer Medication Effects Using Natural Language Processing
    arXiv:2403.00821v1 Announce Type: cross Abstract: Breast cancer is a significant public health concern and is the leading cause of cancer-related deaths among women. Despite advances in breast cancer treatments, medication non-adherence remains a major problem. As electronic health records do not typically capture patient-reported outcomes that may reveal information about medication-related experiences, social media presents an attractive resource for enhancing our understanding of the patients' treatment experiences. In this paper, we developed natural language processing (NLP) based methodologies to study information posted by an automatically curated breast cancer cohort from social media. We employed a transformer-based classifier to identify breast cancer patients/survivors on X (Twitter) based on their self-reported information, and we collected longitudinal data from their profiles. We then designed a multi-layer rule-based model to develop a breast cancer therapy-associated side effect lexicon and detect patterns of medication usage and associated side effects among breast cancer patients. 1,454,637 posts were available from 583,962 unique users, of which 62,042 were detected as breast cancer members using our transformer-based model. 198 cohort members mentioned breast cancer medications with tamoxifen as the most common. Our side effect lexicon identified well-known side effects of hormone and chemotherapy. Furthermore, it discovered a subject feeling towards cancer and medications, which may suggest a pre-clinical phase of side effects or emotional distress. This analysis highlighted not only the utility of NLP techniques in unstructured social media data to identify self-reported breast cancer posts, medication usage patterns, and treatment side effects but also the richness of social data on such clinical questions.  ( 3 min )
    Doubly Calibrated Estimator for Recommendation on Data Missing Not At Random
    arXiv:2403.00817v1 Announce Type: cross Abstract: Recommender systems often suffer from selection bias as users tend to rate their preferred items. The datasets collected under such conditions exhibit entries missing not at random and thus are not randomized-controlled trials representing the target population. To address this challenge, a doubly robust estimator and its enhanced variants have been proposed as they ensure unbiasedness when accurate imputed errors or predicted propensities are provided. However, we argue that existing estimators rely on miscalibrated imputed errors and propensity scores as they depend on rudimentary models for estimation. We provide theoretical insights into how miscalibrated imputation and propensity models may limit the effectiveness of doubly robust estimators and validate our theorems using real-world datasets. On this basis, we propose a Doubly Calibrated Estimator that involves the calibration of both the imputation and propensity models. To achieve this, we introduce calibration experts that consider different logit distributions across users. Moreover, we devise a tri-level joint learning framework, allowing the simultaneous optimization of calibration experts alongside prediction and imputation models. Through extensive experiments on real-world datasets, we demonstrate the superiority of the Doubly Calibrated Estimator in the context of debiased recommendation tasks.  ( 2 min )
    Uncovering Customer Issues through Topological Natural Language Analysis
    arXiv:2403.00804v1 Announce Type: cross Abstract: E-commerce companies deal with a high volume of customer service requests daily. While a simple annotation system is often used to summarize the topics of customer contacts, thoroughly exploring each specific issue can be challenging. This presents a critical concern, especially during an emerging outbreak where companies must quickly identify and address specific issues. To tackle this challenge, we propose a novel machine learning algorithm that leverages natural language techniques and topological data analysis to monitor emerging and trending customer issues. Our approach involves an end-to-end deep learning framework that simultaneously tags the primary question sentence of each customer's transcript and generates sentence embedding vectors. We then whiten the embedding vectors and use them to construct an undirected graph. From there, we define trending and emerging issues based on the topological properties of each transcript. We have validated our results through various methods and found that they are highly consistent with news sources.  ( 2 min )
    Brain-Inspired Two-Stage Approach: Enhancing Mathematical Reasoning by Imitating Human Thought Processes
    arXiv:2403.00800v1 Announce Type: cross Abstract: Although large language models demonstrate emergent abilities in solving math word problems, there is a challenging task in complex multi-step mathematical reasoning tasks. To improve model performance on mathematical reasoning tasks, previous work has conducted supervised fine-tuning on open-source models by improving the quality and quantity of data. In this paper, we propose a novel approach, named Brain, to imitate human thought processes to enhance mathematical reasoning abilities, using the Frontal Lobe Model to generate plans, and then employing the Parietal Lobe Model to generate code and execute to obtain answers. First, we achieve SOTA performance in comparison with Code LLaMA 7B based models through this method. Secondly, we find that plans can be explicitly extracted from natural language, code, or formal language. Our code and data are publicly available at https://github.com/cyzhh/Brain.  ( 2 min )
    An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning
    arXiv:2403.00799v1 Announce Type: cross Abstract: Large language models (LLMs) are displaying emergent abilities for math reasoning tasks,and there is a growing attention on enhancing the ability of open-source LLMs through supervised fine-tuning (SFT).In this paper, we aim to explore a general data strategy for supervised data to help optimize and expand math reasoning ability.Firstly, we determine the ability boundary of reasoning paths augmentation by identifying these paths' minimal optimal set.Secondly, we validate that different abilities of the model can be cumulatively enhanced by Mix of Minimal Optimal Sets of corresponding types of data, while our models MMOS achieve SOTA performance on series base models under much lower construction costs.Besides, we point out GSM-HARD is not really hard and today's LLMs no longer lack numerical robustness.Also, we provide an Auto Problem Generator for robustness testing and educational applications.Our code and data are publicly available at https://github.com/cyzhh/MMOS.  ( 2 min )
    Helen: Optimizing CTR Prediction Models with Frequency-wise Hessian Eigenvalue Regularization
    arXiv:2403.00798v1 Announce Type: cross Abstract: Click-Through Rate (CTR) prediction holds paramount significance in online advertising and recommendation scenarios. Despite the proliferation of recent CTR prediction models, the improvements in performance have remained limited, as evidenced by open-source benchmark assessments. Current researchers tend to focus on developing new models for various datasets and settings, often neglecting a crucial question: What is the key challenge that truly makes CTR prediction so demanding? In this paper, we approach the problem of CTR prediction from an optimization perspective. We explore the typical data characteristics and optimization statistics of CTR prediction, revealing a strong positive correlation between the top hessian eigenvalue and feature frequency. This correlation implies that frequently occurring features tend to converge towards sharp local minima, ultimately leading to suboptimal performance. Motivated by the recent advancements in sharpness-aware minimization (SAM), which considers the geometric aspects of the loss landscape during optimization, we present a dedicated optimizer crafted for CTR prediction, named Helen. Helen incorporates frequency-wise Hessian eigenvalue regularization, achieved through adaptive perturbations based on normalized feature frequencies. Empirical results under the open-source benchmark framework underscore Helen's effectiveness. It successfully constrains the top eigenvalue of the Hessian matrix and demonstrates a clear advantage over widely used optimization algorithms when applied to seven popular models across three public benchmark datasets on BARS. Our code locates at github.com/NUS-HPC-AI-Lab/Helen.  ( 3 min )
    Getting Serious about Humor: Crafting Humor Datasets with Unfunny Large Language Models
    arXiv:2403.00794v1 Announce Type: cross Abstract: Humor is a fundamental facet of human cognition and interaction. Yet, despite recent advances in natural language processing, humor detection remains a challenging task that is complicated by the scarcity of datasets that pair humorous texts with similar non-humorous counterparts. In our work, we investigate whether large language models (LLMs), can generate synthetic data for humor detection via editing texts. We benchmark LLMs on an existing human dataset and show that current LLMs display an impressive ability to `unfun' jokes, as judged by humans and as measured on the downstream task of humor detection. We extend our approach to a code-mixed English-Hindi humor dataset, where we find that GPT-4's synthetic data is highly rated by bilingual annotators and provides challenging adversarial examples for humor classifiers.  ( 2 min )
    Enhancing Mean-Reverting Time Series Prediction with Gaussian Processes: Functional and Augmented Data Structures in Financial Forecasting
    arXiv:2403.00796v1 Announce Type: cross Abstract: In this paper, we explore the application of Gaussian Processes (GPs) for predicting mean-reverting time series with an underlying structure, using relatively unexplored functional and augmented data structures. While many conventional forecasting methods concentrate on the short-term dynamics of time series data, GPs offer the potential to forecast not just the average prediction but the entire probability distribution over a future trajectory. This is particularly beneficial in financial contexts, where accurate predictions alone may not suffice if incorrect volatility assessments lead to capital losses. Moreover, in trade selection, GPs allow for the forecasting of multiple Sharpe ratios adjusted for transaction costs, aiding in decision-making. The functional data representation utilized in this study enables longer-term predictions by leveraging information from previous years, even as the forecast moves away from the current year's training data. Additionally, the augmented representation enriches the training set by incorporating multiple targets for future points in time, facilitating long-term predictions. Our implementation closely aligns with the methodology outlined in, which assessed effectiveness on commodity futures. However, our testing methodology differs. Instead of real data, we employ simulated data with similar characteristics. We construct a testing environment to evaluate both data representations and models under conditions of increasing noise, fat tails, and inappropriate kernels-conditions commonly encountered in practice. By simulating data, we can compare our forecast distribution over time against a full simulation of the actual distribution of our test set, thereby reducing the inherent uncertainty in testing time series models on real data. We enable feature prediction through augmentation and employ sub-sampling to ensure the feasibility of GPs.  ( 3 min )
    Ad Recommendation in a Collapsed and Entangled World
    arXiv:2403.00793v1 Announce Type: cross Abstract: In this paper, we present an industry ad recommendation system, paying attention to the challenges and practices of learning appropriate representations. Our study begins by showcasing our approaches to preserving priors when encoding features of diverse types into embedding representations. Specifically, we address sequence features, numeric features, pre-trained embedding features, as well as sparse ID features. Moreover, we delve into two pivotal challenges associated with feature representation: the dimensional collapse of embeddings and the interest entanglement across various tasks or scenarios. Subsequently, we propose several practical approaches to effectively tackle these two challenges. We then explore several training techniques to facilitate model optimization, reduce bias, and enhance exploration. Furthermore, we introduce three analysis tools that enable us to comprehensively study feature correlation, dimensional collapse, and interest entanglement. This work builds upon the continuous efforts of Tencent's ads recommendation team in the last decade. It not only summarizes general design principles but also presents a series of off-the-shelf solutions and analysis tools. The reported performance is based on our online advertising platform, which handles hundreds of billions of requests daily, serving millions of ads to billions of users.  ( 2 min )
    PRECISE Framework: GPT-based Text For Improved Readability, Reliability, and Understandability of Radiology Reports For Patient-Centered Care
    arXiv:2403.00788v1 Announce Type: cross Abstract: This study introduces and evaluates the PRECISE framework, utilizing OpenAI's GPT-4 to enhance patient engagement by providing clearer and more accessible chest X-ray reports at a sixth-grade reading level. The framework was tested on 500 reports, demonstrating significant improvements in readability, reliability, and understandability. Statistical analyses confirmed the effectiveness of the PRECISE approach, highlighting its potential to foster patient-centric care delivery in healthcare decision-making.  ( 2 min )
    ChatDiet: Empowering Personalized Nutrition-Oriented Food Recommender Chatbots through an LLM-Augmented Framework
    arXiv:2403.00781v1 Announce Type: cross Abstract: The profound impact of food on health necessitates advanced nutrition-oriented food recommendation services. Conventional methods often lack the crucial elements of personalization, explainability, and interactivity. While Large Language Models (LLMs) bring interpretability and explainability, their standalone use falls short of achieving true personalization. In this paper, we introduce ChatDiet, a novel LLM-powered framework designed specifically for personalized nutrition-oriented food recommendation chatbots. ChatDiet integrates personal and population models, complemented by an orchestrator, to seamlessly retrieve and process pertinent information. The result is a dynamic delivery of personalized and explainable food recommendations, tailored to individual user preferences. Our evaluation of ChatDiet includes a compelling case study, where we establish a causal personal model to estimate individual nutrition effects. Our assessments, including a food recommendation test showcasing a 92\% effectiveness rate, coupled with illustrative dialogue examples, underscore ChatDiet's strengths in explainability, personalization, and interactivity.  ( 2 min )
    Applying News and Media Sentiment Analysis for Generating Forex Trading Signals
    arXiv:2403.00785v1 Announce Type: cross Abstract: The objective of this research is to examine how sentiment analysis can be employed to generate trading signals for the Foreign Exchange (Forex) market. The author assessed sentiment in social media posts and news articles pertaining to the United States Dollar (USD) using a combination of methods: lexicon-based analysis and the Naive Bayes machine learning algorithm. The findings indicate that sentiment analysis proves valuable in forecasting market movements and devising trading signals. Notably, its effectiveness is consistent across different market conditions. The author concludes that by analyzing sentiment expressed in news and social media, traders can glean insights into prevailing market sentiments towards the USD and other pertinent countries, thereby aiding trading decision-making. This study underscores the importance of weaving sentiment analysis into trading strategies as a pivotal tool for predicting market dynamics.  ( 2 min )
    Combating Financial Crimes with Unsupervised Learning Techniques: Clustering and Dimensionality Reduction for Anti-Money Laundering
    arXiv:2403.00777v1 Announce Type: cross Abstract: Anti-Money Laundering (AML) is a crucial task in ensuring the integrity of financial systems. One keychallenge in AML is identifying high-risk groups based on their behavior. Unsupervised learning, particularly clustering, is a promising solution for this task. However, the use of hundreds of features todescribe behavior results in a highdimensional dataset that negatively impacts clustering performance.In this paper, we investigate the effectiveness of combining clustering method agglomerative hierarchicalclustering with four dimensionality reduction techniques -Independent Component Analysis (ICA), andKernel Principal Component Analysis (KPCA), Singular Value Decomposition (SVD), Locality Preserving Projections (LPP)- to overcome the issue of high-dimensionality in AML data and improve clusteringresults. This study aims to provide insights into the most effective way of reducing the dimensionality ofAML data and enhance the accuracy of clustering-based AML systems. The experimental results demonstrate that KPCA outperforms other dimension reduction techniques when combined with agglomerativehierarchical clustering. This superiority is observed in the majority of situations, as confirmed by threedistinct validation indices.  ( 2 min )
    Detecting Anomalous Events in Object-centric Business Processes via Graph Neural Networks
    arXiv:2403.00775v1 Announce Type: cross Abstract: Detecting anomalies is important for identifying inefficiencies, errors, or fraud in business processes. Traditional process mining approaches focus on analyzing 'flattened', sequential, event logs based on a single case notion. However, many real-world process executions exhibit a graph-like structure, where events can be associated with multiple cases. Flattening event logs requires selecting a single case identifier which creates a gap with the real event data and artificially introduces anomalies in the event logs. Object-centric process mining avoids these limitations by allowing events to be related to different cases. This study proposes a novel framework for anomaly detection in business processes that exploits graph neural networks and the enhanced information offered by object-centric process mining. We first reconstruct and represent the process dependencies of the object-centric event logs as attributed graphs and then employ a graph convolutional autoencoder architecture to detect anomalous events. Our results show that our approach provides promising performance in detecting anomalies at the activity type and attributes level, although it struggles to detect anomalies in the temporal order of events.  ( 2 min )
    Do Weibo platform experts perform better at predicting stock market?
    arXiv:2403.00772v1 Announce Type: cross Abstract: Sentiment analysis can be used for stock market prediction. However, existing research has not studied the impact of a user's financial background on sentiment-based forecasting of the stock market using artificial neural networks. In this work, a novel combination of neural networks is used for the assessment of sentiment-based stock market prediction, based on the financial background of the population that generated the sentiment. The state-of-the-art language processing model Bidirectional Encoder Representations from Transformers (BERT) is used to classify the sentiment and a Long-Short Term Memory (LSTM) model is used for time-series based stock market prediction. For evaluation, the Weibo social networking platform is used as a sentiment data collection source. Weibo users (and their comments respectively) are divided into Authorized Financial Advisor (AFA) and Unauthorized Financial Advisor (UFA) groups according to their background information, as collected by Weibo. The Hong Kong Hang Seng index is used to extract historical stock market change data. The results indicate that stock market prediction learned from the AFA group users is 39.67% more precise than that learned from the UFA group users and shows the highest accuracy (87%) when compared to existing approaches.  ( 2 min )
    XProspeCT: CT Volume Generation from Paired X-Rays
    arXiv:2403.00771v1 Announce Type: cross Abstract: Computed tomography (CT) is a beneficial imaging tool for diagnostic purposes. CT scans provide detailed information concerning the internal anatomic structures of a patient, but present higher radiation dose and costs compared to X-ray imaging. In this paper, we build on previous research to convert orthogonal X-ray images into simulated CT volumes by exploring larger datasets and various model structures. Significant model variations include UNet architectures, custom connections, activation functions, loss functions, optimizers, and a novel back projection approach.  ( 2 min )
    Towards Fair and Firm Real-Time Scheduling in DNN Multi-Tenant Multi-Accelerator Systems via Reinforcement Learning
    arXiv:2403.00766v1 Announce Type: cross Abstract: This paper addresses the critical challenge of managing Quality of Service (QoS) in cloud services, focusing on the nuances of individual tenant expectations and varying Service Level Indicators (SLIs). It introduces a novel approach utilizing Deep Reinforcement Learning for tenant-specific QoS management in multi-tenant, multi-accelerator cloud environments. The chosen SLI, deadline hit rate, allows clients to tailor QoS for each service request. A novel online scheduling algorithm for Deep Neural Networks in multi-accelerator systems is proposed, with a focus on guaranteeing tenant-wise, model-specific QoS levels while considering real-time constraints.  ( 2 min )
    Text mining in education
    arXiv:2403.00769v1 Announce Type: cross Abstract: The explosive growth of online education environments is generating a massive volume of data, specially in text format from forums, chats, social networks, assessments, essays, among others. It produces exciting challenges on how to mine text data in order to find useful knowledge for educational stakeholders. Despite the increasing number of educational applications of text mining published recently, we have not found any paper surveying them. In this line, this work presents a systematic overview of the current status of the Educational Text Mining field. Our final goal is to answer three main research questions: Which are the text mining techniques most used in educational environments? Which are the most used educational resources? And which are the main applications or educational goals? Finally, we outline the conclusions and the more interesting future trends.  ( 2 min )
    Gradient Correlation Subspace Learning against Catastrophic Forgetting
    arXiv:2403.02334v1 Announce Type: new Abstract: Efficient continual learning techniques have been a topic of significant research over the last few years. A fundamental problem with such learning is severe degradation of performance on previously learned tasks, known also as catastrophic forgetting. This paper introduces a novel method to reduce catastrophic forgetting in the context of incremental class learning called Gradient Correlation Subspace Learning (GCSL). The method detects a subspace of the weights that is least affected by previous tasks and projects the weights to train for the new task into said subspace. The method can be applied to one or more layers of a given network architectures and the size of the subspace used can be altered from layer to layer and task to task. Code will be available at \href{https://github.com/vgthengane/GCSL}{https://github.com/vgthengane/GCSL}  ( 2 min )
    An Architecture for Unattended Containerized (Deep) Reinforcement Learning with Webots
    arXiv:2403.00765v1 Announce Type: cross Abstract: As data science applications gain adoption across industries, the tooling landscape matures to facilitate the life cycle of such applications and provide solutions to the challenges involved to boost the productivity of the people involved. Reinforcement learning with agents in a 3D world could still face challenges: the knowledge required to use a simulation software as well as the utilization of a standalone simulation software in unattended training pipelines. In this paper we review tools and approaches to train reinforcement learning agents for robots in 3D worlds with respect to the robot Robotino and argue that the separation of the simulation environment for creators of virtual worlds and the model development environment for data scientists is not a well covered topic. Often both are the same and data scientists require knowledge of the simulation software to work directly with their APIs. Moreover, sometimes creators of virtual worlds and data scientists even work on the same files. We want to contribute to that topic by describing an approach where data scientists don't require knowledge about the simulation software. Our approach uses the standalone simulation software Webots, the Robot Operating System to communicate with simulated robots as well as the simulation software itself and container technology to separate the simulation from the model development environment. We put emphasize on the APIs the data scientists work with and the use of a standalone simulation software in unattended training pipelines. We show the parts that are specific to the Robotino and the robot task to learn.  ( 3 min )
    Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
    arXiv:2403.02310v1 Announce Type: new Abstract: Each LLM serving request goes through two phases. The first is prefill which processes the entire input prompt to produce one output token and the second is decode which generates the rest of output tokens, one-at-a-time. Prefill iterations have high latency but saturate GPU compute due to parallel processing of the input prompt. In contrast, decode iterations have low latency but also low compute utilization because a decode iteration processes only a single token per request. This makes batching highly effective for decodes and consequently for overall throughput. However, batching multiple requests leads to an interleaving of prefill and decode iterations which makes it challenging to achieve both high throughput and low latency. We introduce an efficient LLM inference scheduler Sarathi-Serve inspired by the techniques we originally proposed for optimizing throughput in Sarathi. Sarathi-Serve leverages chunked-prefills from Sarathi to create stall-free schedules that can add new requests in a batch without pausing ongoing decodes. Stall-free scheduling unlocks the opportunity to improve throughput with large batch sizes while minimizing the effect of batching on latency. Our evaluation shows that Sarathi-Serve improves serving throughput within desired latency SLOs of Mistral-7B by up to 2.6x on a single A100 GPU and up to 6.9x for Falcon-180B on 8 A100 GPUs over Orca and vLLM.  ( 2 min )
    COMMIT: Certifying Robustness of Multi-Sensor Fusion Systems against Semantic Attacks
    arXiv:2403.02329v1 Announce Type: new Abstract: Multi-sensor fusion systems (MSFs) play a vital role as the perception module in modern autonomous vehicles (AVs). Therefore, ensuring their robustness against common and realistic adversarial semantic transformations, such as rotation and shifting in the physical world, is crucial for the safety of AVs. While empirical evidence suggests that MSFs exhibit improved robustness compared to single-modal models, they are still vulnerable to adversarial semantic transformations. Despite the proposal of empirical defenses, several works show that these defenses can be attacked again by new adaptive attacks. So far, there is no certified defense proposed for MSFs. In this work, we propose the first robustness certification framework COMMIT certify robustness of multi-sensor fusion systems against semantic attacks. In particular, we propose a practical anisotropic noise mechanism that leverages randomized smoothing with multi-modal data and performs a grid-based splitting method to characterize complex semantic transformations. We also propose efficient algorithms to compute the certification in terms of object detection accuracy and IoU for large-scale MSF models. Empirically, we evaluate the efficacy of COMMIT in different settings and provide a comprehensive benchmark of certified robustness for different MSF models using the CARLA simulation platform. We show that the certification for MSF models is at most 48.39% higher than that of single-modal models, which validates the advantages of MSF models. We believe our certification framework and benchmark will contribute an important step towards certifiably robust AVs in practice.  ( 3 min )
    Physics-Informed Neural Networks with Skip Connections for Modeling and Control of Gas-Lifted Oil Wells
    arXiv:2403.02289v1 Announce Type: new Abstract: Neural networks, while powerful, often lack interpretability. Physics-Informed Neural Networks (PINNs) address this limitation by incorporating physics laws into the loss function, making them applicable to solving Ordinary Differential Equations (ODEs) and Partial Differential Equations (PDEs). The recently introduced PINC framework extends PINNs to control applications, allowing for open-ended long-range prediction and control of dynamic systems. In this work, we enhance PINC for modeling highly nonlinear systems such as gas-lifted oil wells. By introducing skip connections in the PINC network and refining certain terms in the ODE, we achieve more accurate gradients during training, resulting in an effective modeling process for the oil well system. Our proposed improved PINC demonstrates superior performance, reducing the validation prediction error by an average of 67% in the oil well application and significantly enhancing gradient flow through the network layers, increasing its magnitude by four orders of magnitude compared to the original PINC. Furthermore, experiments showcase the efficacy of Model Predictive Control (MPC) in regulating the bottom-hole pressure of the oil well using the improved PINC model, even in the presence of noisy measurements.  ( 2 min )
    A Decade of Privacy-Relevant Android App Reviews: Large Scale Trends
    arXiv:2403.02292v1 Announce Type: new Abstract: We present an analysis of 12 million instances of privacy-relevant reviews publicly visible on the Google Play Store that span a 10 year period. By leveraging state of the art NLP techniques, we can examine what users have been writing about privacy along multiple dimensions: time, countries, app types, diverse privacy topics, and even across a spectrum of emotions. We find consistent growth of privacy-relevant reviews, and explore topics that are trending (such as Data Deletion and Data Theft), as well as those on the decline (such as privacy-relevant reviews on sensitive permissions). We find that although privacy reviews come from more than 200 countries, 33 countries provide 90% of privacy reviews. We conduct a comparison across countries by examining the distribution of privacy topics a country's users write about, and find that geographic proximity is not a reliable indicator that nearby countries have similar privacy perspectives. We uncover some countries with unique patterns and explore those herein. Surprisingly, we uncover that it is not uncommon for reviews that discuss privacy to be positive (32%); many users express pleasure about privacy features within apps or privacy-focused apps. We also uncover some unexpected behaviors, such as the use of reviews to deliver privacy disclaimers to developers. Finally, we demonstrate the value of analyzing app reviews with our approach as a complement to existing methods for understanding users' perspectives about privacy.  ( 3 min )
    Better Schedules for Low Precision Training of Deep Neural Networks
    arXiv:2403.02243v1 Announce Type: new Abstract: Low precision training can significantly reduce the computational overhead of training deep neural networks (DNNs). Though many such techniques exist, cyclic precision training (CPT), which dynamically adjusts precision throughout train- ing according to a cyclic schedule, achieves particularly impressive improvements in training efficiency, while actually improving DNN performance. Existing CPT implementations take common learning rate schedules (e.g., cyclical cosine sched- ules) and use them for low precision training without adequate comparisons to alternative scheduling options. We define a diverse suite of CPT schedules and analyze their performance across a variety of DNN training regimes, some of which are unexplored in the low precision training literature (e.g., node classification with graph neural networks). From these experiments, we discover alternative CPT schedules that offer further improvements in training efficiency and model performance, as well as derive a set of best practices for choosing CPT schedules. Going further, we find that a correlation exists between model performance and training cost, and that changing the underlying CPT schedule can control the tradeoff between these two variables. To explain the direct correlation between model performance and training cost, we draw a connection between quantized training and critical learning periods, suggesting that aggressive quantization is a form of learning impairment that can permanently damage model performance.  ( 3 min )
    TPLLM: A Traffic Prediction Framework Based on Pretrained Large Language Models
    arXiv:2403.02221v1 Announce Type: new Abstract: Traffic prediction constitutes a pivotal facet within the purview of Intelligent Transportation Systems (ITS), and the attainment of highly precise predictions holds profound significance for efficacious traffic management. The precision of prevailing deep learning-driven traffic prediction models typically sees an upward trend with a rise in the volume of training data. However, the procurement of comprehensive spatiotemporal datasets for traffic is often fraught with challenges, primarily stemming from the substantial costs associated with data collection and retention. Consequently, developing a model that can achieve accurate predictions and good generalization ability in areas with limited historical traffic data is a challenging problem. It is noteworthy that the rapidly advancing pretrained Large Language Models (LLMs) of recent years have demonstrated exceptional proficiency in cross-modality knowledge transfer and few-shot learning. Recognizing the sequential nature of traffic data, similar to language, we introduce TPLLM, a novel traffic prediction framework leveraging LLMs. In this framework, we construct a sequence embedding layer based on Convolutional Neural Networks (CNNs) and a graph embedding layer based on Graph Convolutional Networks (GCNs) to extract sequence features and spatial features, respectively. These are subsequently integrated to form inputs that are suitable for LLMs. A Low-Rank Adaptation (LoRA) fine-tuning approach is applied to TPLLM, thereby facilitating efficient learning and minimizing computational demands. Experiments on two real-world datasets demonstrate that TPLLM exhibits commendable performance in both full-sample and few-shot prediction scenarios, effectively supporting the development of ITS in regions with scarce historical traffic data.  ( 3 min )
    Neural Redshift: Random Networks are not Random Functions
    arXiv:2403.02241v1 Announce Type: new Abstract: Our understanding of the generalization capabilities of neural networks (NNs) is still incomplete. Prevailing explanations are based on implicit biases of gradient descent (GD) but they cannot account for the capabilities of models from gradient-free methods nor the simplicity bias recently observed in untrained networks. This paper seeks other sources of generalization in NNs. Findings. To understand the inductive biases provided by architectures independently from GD, we examine untrained, random-weight networks. Even simple MLPs show strong inductive biases: uniform sampling in weight space yields a very biased distribution of functions in terms of complexity. But unlike common wisdom, NNs do not have an inherent "simplicity bias". This property depends on components such as ReLUs, residual connections, and layer normalizations. Alternative architectures can be built with a bias for any level of complexity. Transformers also inherit all these properties from their building blocks. Implications. We provide a fresh explanation for the success of deep learning independent from gradient-based training. It points at promising avenues for controlling the solutions implemented by trained models.  ( 2 min )
    Transformers Provably Learn Feature-Position Correlations in Masked Image Modeling
    arXiv:2403.02233v1 Announce Type: new Abstract: Masked image modeling (MIM), which predicts randomly masked patches from unmasked ones, has emerged as a promising approach in self-supervised vision pretraining. However, the theoretical understanding of MIM is rather limited, especially with the foundational architecture of transformers. In this paper, to the best of our knowledge, we provide the first end-to-end theory of learning one-layer transformers with softmax attention in MIM self-supervised pretraining. On the conceptual side, we posit a theoretical mechanism of how transformers, pretrained with MIM, produce empirically observed local and diverse attention patterns on data distributions with spatial structures that highlight feature-position correlations. On the technical side, our end-to-end analysis of the training dynamics of softmax-based transformers accommodates both input and position embeddings simultaneously, which is developed based on a novel approach to track the interplay between the attention of feature-position and position-wise correlations.  ( 2 min )
    Joint Parameter and Parameterization Inference with Uncertainty Quantification through Differentiable Programming
    arXiv:2403.02215v1 Announce Type: new Abstract: Accurate representations of unknown and sub-grid physical processes through parameterizations (or closure) in numerical simulations with quantified uncertainty are critical for resolving the coarse-grained partial differential equations that govern many problems ranging from weather and climate prediction to turbulence simulations. Recent advances have seen machine learning (ML) increasingly applied to model these subgrid processes, resulting in the development of hybrid physics-ML models through the integration with numerical solvers. In this work, we introduce a novel framework for the joint estimation and uncertainty quantification of physical parameters and machine learning parameterizations in tandem, leveraging differentiable programming. Achieved through online training and efficient Bayesian inference within a high-dimensional parameter space, this approach is enabled by the capabilities of differentiable programming. This proof of concept underscores the substantial potential of differentiable programming in synergistically combining machine learning with differential equations, thereby enhancing the capabilities of hybrid physics-ML modeling.  ( 2 min )
    Mutual Information Estimation via Normalizing Flows
    arXiv:2403.02187v1 Announce Type: new Abstract: We propose a novel approach to the problem of mutual information (MI) estimation via introducing normalizing flows-based estimator. The estimator maps original data to the target distribution with known closed-form expression for MI. We demonstrate that our approach yields MI estimates for the original data. Experiments with high-dimensional data are provided to show the advantages of the proposed estimator.  ( 2 min )
    Inf2Guard: An Information-Theoretic Framework for Learning Privacy-Preserving Representations against Inference Attacks
    arXiv:2403.02116v1 Announce Type: new Abstract: Machine learning (ML) is vulnerable to inference (e.g., membership inference, property inference, and data reconstruction) attacks that aim to infer the private information of training data or dataset. Existing defenses are only designed for one specific type of attack and sacrifice significant utility or are soon broken by adaptive attacks. We address these limitations by proposing an information-theoretic defense framework, called Inf2Guard, against the three major types of inference attacks. Our framework, inspired by the success of representation learning, posits that learning shared representations not only saves time/costs but also benefits numerous downstream tasks. Generally, Inf2Guard involves two mutual information objectives, for privacy protection and utility preservation, respectively. Inf2Guard exhibits many merits: it facilitates the design of customized objectives against the specific inference attack; it provides a general defense framework which can treat certain existing defenses as special cases; and importantly, it aids in deriving theoretical results, e.g., inherent utility-privacy tradeoff and guaranteed privacy leakage. Extensive evaluations validate the effectiveness of Inf2Guard for learning privacy-preserving representations against inference attacks and demonstrate the superiority over the baselines.  ( 2 min )
    Distilled ChatGPT Topic & Sentiment Modeling with Applications in Finance
    arXiv:2403.02185v1 Announce Type: new Abstract: In this study, ChatGPT is utilized to create streamlined models that generate easily interpretable features. These features are then used to evaluate financial outcomes from earnings calls. We detail a training approach that merges knowledge distillation and transfer learning, resulting in lightweight topic and sentiment classification models without significant loss in accuracy. These models are assessed through a dataset annotated by experts. The paper also delves into two practical case studies, highlighting how the generated features can be effectively utilized in quantitative investing scenarios.  ( 2 min )
    Iterated $Q$-Network: Beyond the One-Step Bellman Operator
    arXiv:2403.02107v1 Announce Type: new Abstract: Value-based Reinforcement Learning (RL) methods rely on the application of the Bellman operator, which needs to be approximated from samples. Most approaches consist of an iterative scheme alternating the application of the Bellman operator and a subsequent projection step onto a considered function space. However, we observe that these algorithms can be improved by considering multiple iterations of the Bellman operator at once. Thus, we introduce iterated $Q$-Networks (iQN), a novel approach that learns a sequence of $Q$-function approximations where each $Q$-function serves as the target for the next one in a chain of consecutive Bellman iterations. We demonstrate that iQN is theoretically sound and show how it can be seamlessly used in value-based and actor-critic methods. We empirically demonstrate its advantages on Atari $2600$ games and in continuous-control MuJoCo environments.  ( 2 min )
    Deep Neural Network for Constraint Acquisition through Tailored Loss Function
    arXiv:2403.02042v1 Announce Type: new Abstract: The significance of learning constraints from data is underscored by its potential applications in real-world problem-solving. While constraints are popular for modeling and solving, the approaches to learning constraints from data remain relatively scarce. Furthermore, the intricate task of modeling demands expertise and is prone to errors, thus constraint acquisition methods offer a solution by automating this process through learnt constraints from examples or behaviours of solutions and non-solutions. This work introduces a novel approach grounded in Deep Neural Network (DNN) based on Symbolic Regression that, by setting suitable loss functions, constraints can be extracted directly from datasets. Using the present approach, direct formulation of constraints was achieved. Furthermore, given the broad pre-developed architectures and functionalities of DNN, connections and extensions with other frameworks could be foreseen.  ( 2 min )
    A Generative Model of Symmetry Transformations
    arXiv:2403.01946v1 Announce Type: new Abstract: Correctly capturing the symmetry transformations of data can lead to efficient models with strong generalization capabilities, though methods incorporating symmetries often require prior knowledge. While recent advancements have been made in learning those symmetries directly from the dataset, most of this work has focused on the discriminative setting. In this paper, we construct a generative model that explicitly aims to capture symmetries in the data, resulting in a model that learns which symmetries are present in an interpretable way. We provide a simple algorithm for efficiently learning our generative model and demonstrate its ability to capture symmetries under affine and color transformations. Combining our symmetry model with existing generative models results in higher marginal test-log-likelihoods and robustness to data sparsification.  ( 2 min )
    Error bounds for particle gradient descent, and extensions of the log-Sobolev and Talagrand inequalities
    arXiv:2403.02004v1 Announce Type: new Abstract: We prove non-asymptotic error bounds for particle gradient descent (PGD)~(Kuntz et al., 2023), a recently introduced algorithm for maximum likelihood estimation of large latent variable models obtained by discretizing a gradient flow of the free energy. We begin by showing that, for models satisfying a condition generalizing both the log-Sobolev and the Polyak--{\L}ojasiewicz inequalities (LSI and P{\L}I, respectively), the flow converges exponentially fast to the set of minimizers of the free energy. We achieve this by extending a result well-known in the optimal transport literature (that the LSI implies the Talagrand inequality) and its counterpart in the optimization literature (that the P{\L}I implies the so-called quadratic growth condition), and applying it to our new setting. We also generalize the Bakry--\'Emery Theorem and show that the LSI/P{\L}I generalization holds for models with strongly concave log-likelihoods. For such models, we further control PGD's discretization error, obtaining non-asymptotic error bounds. While we are motivated by the study of PGD, we believe that the inequalities and results we extend may be of independent interest.  ( 2 min )
    Mitigating Label Noise on Graph via Topological Sample Selection
    arXiv:2403.01942v1 Announce Type: new Abstract: Despite the success of the carefully-annotated benchmarks, the effectiveness of existing graph neural networks (GNNs) can be considerably impaired in practice when the real-world graph data is noisily labeled. Previous explorations in sample selection have been demonstrated as an effective way for robust learning with noisy labels, however, the conventional studies focus on i.i.d data, and when moving to non-iid graph data and GNNs, two notable challenges remain: (1) nodes located near topological class boundaries are very informative for classification but cannot be successfully distinguished by the heuristic sample selection. (2) there is no available measure that considers the graph topological information to promote sample selection in a graph. To address this dilemma, we propose a $\textit{Topological Sample Selection}$ (TSS) method that boosts the informative sample selection process in a graph by utilising topological information. We theoretically prove that our procedure minimizes an upper bound of the expected risk under target clean distribution, and experimentally show the superiority of our method compared with state-of-the-art baselines.  ( 2 min )
    FlowPrecision: Advancing FPGA-Based Real-Time Fluid Flow Estimation with Linear Quantization
    arXiv:2403.01922v1 Announce Type: new Abstract: In industrial and environmental monitoring, achieving real-time and precise fluid flow measurement remains a critical challenge. This study applies linear quantization in FPGA-based soft sensors for fluid flow estimation, significantly enhancing Neural Network model precision by overcoming the limitations of traditional fixed-point quantization. Our approach achieves up to a 10.10% reduction in Mean Squared Error and a notable 9.39% improvement in inference speed through targeted hardware optimizations. Validated across multiple data sets, our findings demonstrate that the optimized FPGA-based quantized models can provide efficient, accurate real-time inference, offering a viable alternative to cloud-based processing in pervasive autonomous systems.  ( 2 min )
    Robustness Bounds on the Successful Adversarial Examples: Theory and Practice
    arXiv:2403.01896v1 Announce Type: new Abstract: Adversarial example (AE) is an attack method for machine learning, which is crafted by adding imperceptible perturbation to the data inducing misclassification. In the current paper, we investigated the upper bound of the probability of successful AEs based on the Gaussian Process (GP) classification. We proved a new upper bound that depends on AE's perturbation norm, the kernel function used in GP, and the distance of the closest pair with different labels in the training dataset. Surprisingly, the upper bound is determined regardless of the distribution of the sample dataset. We showed that our theoretical result was confirmed through the experiment using ImageNet. In addition, we showed that changing the parameters of the kernel function induces a change of the upper bound of the probability of successful AEs.  ( 2 min )
    Matrix Completion with Convex Optimization and Column Subset Selection
    arXiv:2403.01919v1 Announce Type: new Abstract: We introduce a two-step method for the matrix recovery problem. Our approach combines the theoretical foundations of the Column Subset Selection and Low-rank Matrix Completion problems. The proposed method, in each step, solves a convex optimization task. We present two algorithms that implement our Columns Selected Matrix Completion (CSMC) method, each dedicated to a different size problem. We performed a formal analysis of the presented method, in which we formulated the necessary assumptions and the probability of finding a correct solution. In the second part of the paper, we present the results of the experimental work. Numerical experiments verified the correctness and performance of the algorithms. To study the influence of the matrix size, rank, and the proportion of missing elements on the quality of the solution and the computation time, we performed experiments on synthetic data. The presented method was applied to two real-life problems problems: prediction of movie rates in a recommendation system and image inpainting. Our thorough analysis shows that CSMC provides solutions of comparable quality to matrix completion algorithms, which are based on convex optimization. However, CSMC offers notable savings in terms of runtime.  ( 2 min )
    Unsupervised Distance Metric Learning for Anomaly Detection Over Multivariate Time Series
    arXiv:2403.01895v1 Announce Type: new Abstract: Distance-based time series anomaly detection methods are prevalent due to their relative non-parametric nature and interpretability. However, the commonly used Euclidean distance is sensitive to noise. While existing works have explored dynamic time warping (DTW) for its robustness, they only support supervised tasks over multivariate time series (MTS), leaving a scarcity of unsupervised methods. In this work, we propose FCM-wDTW, an unsupervised distance metric learning method for anomaly detection over MTS, which encodes raw data into latent space and reveals normal dimension relationships through cluster centers. FCM-wDTW introduces locally weighted DTW into fuzzy C-means clustering and learns the optimal latent space efficiently, enabling anomaly identification via data reconstruction. Experiments with 11 different types of benchmarks demonstrate our method's competitive accuracy and efficiency.  ( 2 min )
    ICLN: Input Convex Loss Network for Decision Focused Learning
    arXiv:2403.01875v1 Announce Type: new Abstract: In decision-making problem under uncertainty, predicting unknown parameters is often considered independent of the optimization part. Decision-focused Learning (DFL) is a task-oriented framework to integrate prediction and optimization by adapting predictive model to give better decision for the corresponding task. Here, an inevitable challenge arises when computing gradients of the optimal decision with respect to the parameters. Existing researches cope this issue by smoothly reforming surrogate optimization or construct surrogate loss function that mimic task loss. However, they are applied to restricted optimization domain or build functions in a local manner leading a large computational time. In this paper, we propose Input Convex Loss Network (ICLN), a novel global surrogate loss which can be implemented in a general DFL paradigm. ICLN learns task loss via Input Convex Neural Networks which is guaranteed to be convex for some inputs, while keeping the global structure for the other inputs. This enables ICLN to admit general DFL through only a single surrogate loss without any sense for choosing appropriate parametric forms. We confirm effectiveness and flexibility of ICLN by evaluating our proposed model with three stochastic decision-making problems.  ( 2 min )
    Reward Model Learning vs. Direct Policy Optimization: A Comparative Analysis of Learning from Human Preferences
    arXiv:2403.01857v1 Announce Type: new Abstract: In this paper, we take a step towards a deeper understanding of learning from human preferences by systematically comparing the paradigm of reinforcement learning from human feedback (RLHF) with the recently proposed paradigm of direct preference optimization (DPO). We focus our attention on the class of loglinear policy parametrization and linear reward functions. In order to compare the two paradigms, we first derive minimax statistical bounds on the suboptimality gap induced by both RLHF and DPO, assuming access to an oracle that exactly solves the optimization problems. We provide a detailed discussion on the relative comparison between the two paradigms, simultaneously taking into account the sample size, policy and reward class dimensions, and the regularization temperature. Moreover, we extend our analysis to the approximate optimization setting and derive exponentially decaying convergence rates for both RLHF and DPO. Next, we analyze the setting where the ground-truth reward is not realizable and find that, while RLHF incurs a constant additional error, DPO retains its asymptotically decaying gap by just tuning the temperature accordingly. Finally, we extend our comparison to the Markov decision process setting, where we generalize our results with exact optimization. To the best of our knowledge, we are the first to provide such a comparative analysis for RLHF and DPO.  ( 2 min )
    A Survey on Evaluation of Out-of-Distribution Generalization
    arXiv:2403.01874v1 Announce Type: new Abstract: Machine learning models, while progressively advanced, rely heavily on the IID assumption, which is often unfulfilled in practice due to inevitable distribution shifts. This renders them susceptible and untrustworthy for deployment in risk-sensitive applications. Such a significant problem has consequently spawned various branches of works dedicated to developing algorithms capable of Out-of-Distribution (OOD) generalization. Despite these efforts, much less attention has been paid to the evaluation of OOD generalization, which is also a complex and fundamental problem. Its goal is not only to assess whether a model's OOD generalization capability is strong or not, but also to evaluate where a model generalizes well or poorly. This entails characterizing the types of distribution shifts that a model can effectively address, and identifying the safe and risky input regions given a model. This paper serves as the first effort to conduct a comprehensive review of OOD evaluation. We categorize existing research into three paradigms: OOD performance testing, OOD performance prediction, and OOD intrinsic property characterization, according to the availability of test data. Additionally, we briefly discuss OOD evaluation in the context of pretrained models. In closing, we propose several promising directions for future research in OOD evaluation.  ( 2 min )
    NASH: Neural Architecture Search for Hardware-Optimized Machine Learning Models
    arXiv:2403.01845v1 Announce Type: new Abstract: As machine learning (ML) algorithms get deployed in an ever-increasing number of applications, these algorithms need to achieve better trade-offs between high accuracy, high throughput and low latency. This paper introduces NASH, a novel approach that applies neural architecture search to machine learning hardware. Using NASH, hardware designs can achieve not only high throughput and low latency but also superior accuracy performance. We present four versions of the NASH strategy in this paper, all of which show higher accuracy than the original models. The strategy can be applied to various convolutional neural networks, selecting specific model operations among many to guide the training process toward higher accuracy. Experimental results show that applying NASH on ResNet18 or ResNet34 achieves a top 1 accuracy increase of up to 3.1% and a top 5 accuracy increase of up to 2.2% compared to the non-NASH version when tested on the ImageNet data set. We also integrated this approach into the FINN hardware model synthesis tool to automate the application of our approach and the generation of the hardware model. Results show that using FINN can achieve a maximum throughput of 324.5 fps. In addition, NASH models can also result in a better trade-off between accuracy and hardware resource utilization. The accuracy-hardware (HW) Pareto curve shows that the models with the four NASH versions represent the best trade-offs achieving the highest accuracy for a given HW utilization. The code for our implementation is open-source and publicly available on GitHub at https://github.com/MFJI/NASH.  ( 3 min )
    COLA: Cross-city Mobility Transformer for Human Trajectory Simulation
    arXiv:2403.01801v1 Announce Type: new Abstract: Human trajectory data produced by daily mobile devices has proven its usefulness in various substantial fields such as urban planning and epidemic prevention. In terms of the individual privacy concern, human trajectory simulation has attracted increasing attention from researchers, targeting at offering numerous realistic mobility data for downstream tasks. Nevertheless, the prevalent issue of data scarcity undoubtedly degrades the reliability of existing deep learning models. In this paper, we are motivated to explore the intriguing problem of mobility transfer across cities, grasping the universal patterns of human trajectories to augment the powerful Transformer with external mobility data. There are two crucial challenges arising in the knowledge transfer across cities: 1) how to transfer the Transformer to adapt for domain heterogeneity; 2) how to calibrate the Transformer to adapt for subtly different long-tail frequency distributions of locations. To address these challenges, we have tailored a Cross-city mObiLity trAnsformer (COLA) with a dedicated model-agnostic transfer framework by effectively transferring cross-city knowledge for human trajectory simulation. Firstly, COLA divides the Transformer into the private modules for city-specific characteristics and the shared modules for city-universal mobility patterns. Secondly, COLA leverages a lightweight yet effective post-hoc adjustment strategy for trajectory simulation, without disturbing the complex bi-level optimization of model-agnostic knowledge transfer. Extensive experiments of COLA compared to state-of-the-art single-city baselines and our implemented cross-city baselines have demonstrated its superiority and effectiveness. The code is available at https://github.com/Star607/Cross-city-Mobility-Transformer.  ( 3 min )
    Improving out-of-distribution generalization in graphs via hierarchical semantic environments
    arXiv:2403.01773v1 Announce Type: new Abstract: Out-of-distribution (OOD) generalization in the graph domain is challenging due to complex distribution shifts and a lack of environmental contexts. Recent methods attempt to enhance graph OOD generalization by generating flat environments. However, such flat environments come with inherent limitations to capture more complex data distributions. Considering the DrugOOD dataset, which contains diverse training environments (e.g., scaffold, size, etc.), flat contexts cannot sufficiently address its high heterogeneity. Thus, a new challenge is posed to generate more semantically enriched environments to enhance graph invariant learning for handling distribution shifts. In this paper, we propose a novel approach to generate hierarchical semantic environments for each graph. Firstly, given an input graph, we explicitly extract variant subgraphs from the input graph to generate proxy predictions on local environments. Then, stochastic attention mechanisms are employed to re-extract the subgraphs for regenerating global environments in a hierarchical manner. In addition, we introduce a new learning objective that guides our model to learn the diversity of environments within the same hierarchy while maintaining consistency across different hierarchies. This approach enables our model to consider the relationships between environments and facilitates robust graph invariant learning. Extensive experiments on real-world graph data have demonstrated the effectiveness of our framework. Particularly, in the challenging dataset DrugOOD, our method achieves up to 1.29\% and 2.83\% improvement over the best baselines on IC50 and EC50 prediction tasks, respectively.  ( 2 min )
    A Safe Screening Rule with Bi-level Optimization of $\nu$ Support Vector Machine
    arXiv:2403.01769v1 Announce Type: new Abstract: Support vector machine (SVM) has achieved many successes in machine learning, especially for a small sample problem. As a famous extension of the traditional SVM, the $\nu$ support vector machine ($\nu$-SVM) has shown outstanding performance due to its great model interpretability. However, it still faces challenges in training overhead for large-scale problems. To address this issue, we propose a safe screening rule with bi-level optimization for $\nu$-SVM (SRBO-$\nu$-SVM) which can screen out inactive samples before training and reduce the computational cost without sacrificing the prediction accuracy. Our SRBO-$\nu$-SVM is strictly deduced by integrating the Karush-Kuhn-Tucker (KKT) conditions, the variational inequalities of convex problems and the $\nu$-property. Furthermore, we develop an efficient dual coordinate descent method (DCDM) to further improve computational speed. Finally, a unified framework for SRBO is proposed to accelerate many SVM-type models, and it is successfully applied to one-class SVM. Experimental results on 6 artificial data sets and 30 benchmark data sets have verified the effectiveness and safety of our proposed methods in supervised and unsupervised tasks.  ( 2 min )
    Open-world Machine Learning: A Review and New Outlooks
    arXiv:2403.01759v1 Announce Type: new Abstract: Machine learning has achieved remarkable success in many applications. However, existing studies are largely based on the closed-world assumption, which assumes that the environment is stationary, and the model is fixed once deployed. In many real-world applications, this fundamental and rather naive assumption may not hold because an open environment is complex, dynamic, and full of unknowns. In such cases, rejecting unknowns, discovering novelties, and then incrementally learning them, could enable models to be safe and evolve continually as biological systems do. This paper provides a holistic view of open-world machine learning by investigating unknown rejection, novel class discovery, and class-incremental learning in a unified paradigm. The challenges, principles, and limitations of current methodologies are discussed in detail. Finally, we discuss several potential directions for future research. This paper aims to provide a comprehensive introduction to the emerging open-world machine learning paradigm, to help researchers build more powerful AI systems in their respective fields, and to promote the development of artificial general intelligence.  ( 2 min )
    Diffusion-TS: Interpretable Diffusion for General Time Series Generation
    arXiv:2403.01742v1 Announce Type: new Abstract: Denoising diffusion probabilistic models (DDPMs) are becoming the leading paradigm for generative models. It has recently shown breakthroughs in audio synthesis, time series imputation and forecasting. In this paper, we propose Diffusion-TS, a novel diffusion-based framework that generates multivariate time series samples of high quality by using an encoder-decoder transformer with disentangled temporal representations, in which the decomposition technique guides Diffusion-TS to capture the semantic meaning of time series while transformers mine detailed sequential information from the noisy model input. Different from existing diffusion-based approaches, we train the model to directly reconstruct the sample instead of the noise in each diffusion step, combining a Fourier-based loss term. Diffusion-TS is expected to generate time series satisfying both interpretablity and realness. In addition, it is shown that the proposed Diffusion-TS can be easily extended to conditional generation tasks, such as forecasting and imputation, without any model changes. This also motivates us to further explore the performance of Diffusion-TS under irregular settings. Finally, through qualitative and quantitative experiments, results show that Diffusion-TS achieves the state-of-the-art results on various realistic analyses of time series.  ( 2 min )
    ComS2T: A complementary spatiotemporal learning system for data-adaptive model evolution
    arXiv:2403.01738v1 Announce Type: new Abstract: Spatiotemporal (ST) learning has become a crucial technique to enable smart cities and sustainable urban development. Current ST learning models capture the heterogeneity via various spatial convolution and temporal evolution blocks. However, rapid urbanization leads to fluctuating distributions in urban data and city structures over short periods, resulting in existing methods suffering generalization and data adaptation issues. Despite efforts, existing methods fail to deal with newly arrived observations and those methods with generalization capacity are limited in repeated training. Motivated by complementary learning in neuroscience, we introduce a prompt-based complementary spatiotemporal learning termed ComS2T, to empower the evolution of models for data adaptation. ComS2T partitions the neural architecture into a stable neocortex for consolidating historical memory and a dynamic hippocampus for new knowledge update. We first disentangle two disjoint structures into stable and dynamic weights, and then train spatial and temporal prompts by characterizing distribution of main observations to enable prompts adaptive to new data. This data-adaptive prompt mechanism, combined with a two-stage training process, facilitates fine-tuning of the neural architecture conditioned on prompts, thereby enabling efficient adaptation during testing. Extensive experiments validate the efficacy of ComS2T in adapting to various spatiotemporal out-of-distribution scenarios while maintaining efficient inference capabilities.  ( 2 min )
    L0 Regularization of Field-Aware Factorization Machine through Ising Model
    arXiv:2403.01718v1 Announce Type: new Abstract: We examined the use of the Ising model as an L0 regularization method for field-aware factorization machines (FFM). This approach improves generalization performance and has the advantage of simultaneously determining the best feature combinations for each of several groups. We can deepen the interpretation and understanding of the model from the similarities and differences in the features selected in each group.  ( 2 min )
    DyCE: Dynamic Configurable Exiting for Deep Learning Compression and Scaling
    arXiv:2403.01695v1 Announce Type: new Abstract: Modern deep learning (DL) models necessitate the employment of scaling and compression techniques for effective deployment in resource-constrained environments. Most existing techniques, such as pruning and quantization are generally static. On the other hand, dynamic compression methods, such as early exits, reduce complexity by recognizing the difficulty of input samples and allocating computation as needed. Dynamic methods, despite their superior flexibility and potential for co-existing with static methods, pose significant challenges in terms of implementation due to any changes in dynamic parts will influence subsequent processes. Moreover, most current dynamic compression designs are monolithic and tightly integrated with base models, thereby complicating the adaptation to novel base models. This paper introduces DyCE, an dynamic configurable early-exit framework that decouples design considerations from each other and from the base model. Utilizing this framework, various types and positions of exits can be organized according to predefined configurations, which can be dynamically switched in real-time to accommodate evolving performance-complexity requirements. We also propose techniques for generating optimized configurations based on any desired trade-off between performance and computational complexity. This empowers future researchers to focus on the improvement of individual exits without latent compromise of overall system performance. The efficacy of this approach is demonstrated through image classification tasks with deep CNNs. DyCE significantly reduces the computational complexity by 23.5% of ResNet152 and 25.9% of ConvNextv2-tiny on ImageNet, with accuracy reductions of less than 0.5%. Furthermore, DyCE offers advantages over existing dynamic methods in terms of real-time configuration and fine-grained performance tuning.  ( 3 min )
    Permutation invariant functions: statistical tests, dimension reduction in metric entropy and estimation
    arXiv:2403.01671v1 Announce Type: new Abstract: Permutation invariance is among the most common symmetry that can be exploited to simplify complex problems in machine learning (ML). There has been a tremendous surge of research activities in building permutation invariant ML architectures. However, less attention is given to how to statistically test for permutation invariance of variables in a multivariate probability distribution where the dimension is allowed to grow with the sample size. Also, in terms of a statistical theory, little is known about how permutation invariance helps with estimation in reducing dimensions. In this paper, we take a step back and examine these questions in several fundamental problems: (i) testing the assumption of permutation invariance of multivariate distributions; (ii) estimating permutation invariant densities; (iii) analyzing the metric entropy of smooth permutation invariant function classes and compare them with their counterparts without imposing permutation invariance; (iv) kernel ridge regression of permutation invariant functions in reproducing kernel Hilbert space. In particular, our methods for (i) and (iv) are based on a sorting trick and (ii) is based on an averaging trick. These tricks substantially simplify the exploitation of permutation invariance.  ( 2 min )
    Improving Adversarial Energy-Based Model via Diffusion Process
    arXiv:2403.01666v1 Announce Type: new Abstract: Generative models have shown strong generation ability while efficient likelihood estimation is less explored. Energy-based models~(EBMs) define a flexible energy function to parameterize unnormalized densities efficiently but are notorious for being difficult to train. Adversarial EBMs introduce a generator to form a minimax training game to avoid expensive MCMC sampling used in traditional EBMs, but a noticeable gap between adversarial EBMs and other strong generative models still exists. Inspired by diffusion-based models, we embedded EBMs into each denoising step to split a long-generated process into several smaller steps. Besides, we employ a symmetric Jeffrey divergence and introduce a variational posterior distribution for the generator's training to address the main challenges that exist in adversarial EBMs. Our experiments show significant improvement in generation compared to existing adversarial EBMs, while also providing a useful energy function for efficient density estimation.  ( 2 min )
    Quantifying and Predicting Residential Building Flexibility Using Machine Learning Methods
    arXiv:2403.01669v1 Announce Type: new Abstract: Residential buildings account for a significant portion (35\%) of the total electricity consumption in the U.S. as of 2022. As more distributed energy resources are installed in buildings, their potential to provide flexibility to the grid increases. To tap into that flexibility provided by buildings, aggregators or system operators need to quantify and forecast flexibility. Previous works in this area primarily focused on commercial buildings, with little work on residential buildings. To address the gap, this paper first proposes two complementary flexibility metrics (i.e., power and energy flexibility) and then investigates several mainstream machine learning-based models for predicting the time-variant and sporadic flexibility of residential buildings at four-hour and 24-hour forecast horizons. The long-short-term-memory (LSTM) model achieves the best performance and can predict power flexibility for up to 24 hours ahead with the average error around 0.7 kW. However, for energy flexibility, the LSTM model is only successful for loads with consistent operational patterns throughout the year and faces challenges when predicting energy flexibility associated with HVAC systems.  ( 2 min )
    Geometry and Stability of Supervised Learning Problems
    arXiv:2403.01660v1 Announce Type: new Abstract: We introduce a notion of distance between supervised learning problems, which we call the Risk distance. This optimal-transport-inspired distance facilitates stability results; one can quantify how seriously issues like sampling bias, noise, limited data, and approximations might change a given problem by bounding how much these modifications can move the problem under the Risk distance. With the distance established, we explore the geometry of the resulting space of supervised learning problems, providing explicit geodesics and proving that the set of classification problems is dense in a larger class of problems. We also provide two variants of the Risk distance: one that incorporates specified weights on a problem's predictors, and one that is more sensitive to the contours of a problem's risk landscape.  ( 2 min )
    Day-ahead regional solar power forecasting with hierarchical temporal convolutional neural networks using historical power generation and weather data
    arXiv:2403.01653v1 Announce Type: new Abstract: Regional solar power forecasting, which involves predicting the total power generation from all rooftop photovoltaic systems in a region holds significant importance for various stakeholders in the energy sector. However, the vast amount of solar power generation and weather time series from geographically dispersed locations that need to be considered in the forecasting process makes accurate regional forecasting challenging. Therefore, previous work has limited the focus to either forecasting a single time series (i.e., aggregated time series) which is the addition of all solar generation time series in a region, disregarding the location-specific weather effects or forecasting solar generation time series of each PV site (i.e., individual time series) independently using location-specific weather data, resulting in a large number of forecasting models. In this work, we propose two deep-learning-based regional forecasting methods that can effectively leverage both types of time series (aggregated and individual) with weather data in a region. We propose two hierarchical temporal convolutional neural network architectures (HTCNN) and two strategies to adapt HTCNNs for regional solar power forecasting. At first, we explore generating a regional forecast using a single HTCNN. Next, we divide the region into multiple sub-regions based on weather information and train separate HTCNNs for each sub-region; the forecasts of each sub-region are then added to generate a regional forecast. The proposed work is evaluated using a large dataset collected over a year from 101 locations across Western Australia to provide a day ahead forecast. We compare our approaches with well-known alternative methods and show that the sub-region HTCNN requires fewer individual networks and achieves a forecast skill score of 40.2% reducing a statistically significant error by 6.5% compared to the best counterpart.  ( 3 min )
    Blue and Green-Mode Energy-Efficient Chemiresistive Sensor Array Realized by Rapid Ensemble Learning
    arXiv:2403.01642v1 Announce Type: new Abstract: The rapid advancement of Internet of Things (IoT) necessitates the development of optimized Chemiresistive Sensor (CRS) arrays that are both energy-efficient and capable. This study introduces a novel optimization strategy that employs a rapid ensemble learning-based model committee approach to achieve these goals. Utilizing machine learning models such as Elastic Net Regression, Random Forests, and XGBoost, among others, the strategy identifies the most impactful sensors in a CRS array for accurate classification: A weighted voting mechanism is introduced to aggregate the models' opinions in sensor selection, thereby setting up wo distinct working modes, termed "Blue" and "Green". The Blue mode operates with all sensors for maximum detection capability, while the Green mode selectively activates only key sensors, significantly reducing energy consumption without compromising detection accuracy. The strategy is validated through theoretical calculations and Monte Carlo simulations, demonstrating its effectiveness and accuracy. The proposed optimization strategy not only elevates the detection capability of CRS arrays but also brings it closer to theoretical limits, promising significant implications for the development of low-cost, easily fabricable next-generation IoT sensor terminals.  ( 2 min )
    You Need to Pay Better Attention
    arXiv:2403.01643v1 Announce Type: new Abstract: We introduce three new attention mechanisms that outperform standard multi-head attention in terms of efficiency and learning capabilities, thereby improving the performance and broader deployability of Transformer models. Our first contribution is Optimised Attention, which performs similarly to standard attention, but has 3/4 as many parameters and one matrix multiplication fewer per head. Next, we introduce Efficient Attention, which performs on par with standard attention with only 1/2 as many parameters as many parameters and two matrix multiplications fewer per head and is up to twice as fast as standard attention. Lastly, we introduce Super Attention, which surpasses standard attention by a significant margin in both vision and natural language processing tasks while having fewer parameters and matrix multiplications. In addition to providing rigorous mathematical comparisons, we evaluate the presented attention mechanisms on MNIST, CIFAR100, IMDB Movie Reviews, and Amazon Reviews datasets.  ( 2 min )
    Theoretical Insights for Diffusion Guidance: A Case Study for Gaussian Mixture Models
    arXiv:2403.01639v1 Announce Type: new Abstract: Diffusion models benefit from instillation of task-specific information into the score function to steer the sample generation towards desired properties. Such information is coined as guidance. For example, in text-to-image synthesis, text input is encoded as guidance to generate semantically aligned images. Proper guidance inputs are closely tied to the performance of diffusion models. A common observation is that strong guidance promotes a tight alignment to the task-specific information, while reducing the diversity of the generated samples. In this paper, we provide the first theoretical study towards understanding the influence of guidance on diffusion models in the context of Gaussian mixture models. Under mild conditions, we prove that incorporating diffusion guidance not only boosts classification confidence but also diminishes distribution diversity, leading to a reduction in the differential entropy of the output distribution. Our analysis covers the widely adopted sampling schemes including DDPM and DDIM, and leverages comparison inequalities for differential equations as well as the Fokker-Planck equation that characterizes the evolution of probability density function, which may be of independent theoretical interest.  ( 2 min )
    Critical windows: non-asymptotic theory for feature emergence in diffusion models
    arXiv:2403.01633v1 Announce Type: new Abstract: We develop theory to understand an intriguing property of diffusion models for image generation that we term critical windows. Empirically, it has been observed that there are narrow time intervals in sampling during which particular features of the final image emerge, e.g. the image class or background color (Ho et al., 2020b; Georgiev et al., 2023; Raya & Ambrogioni, 2023; Sclocchi et al., 2024; Biroli et al., 2024). While this is advantageous for interpretability as it implies one can localize properties of the generation to a small segment of the trajectory, it seems at odds with the continuous nature of the diffusion. We propose a formal framework for studying these windows and show that for data coming from a mixture of strongly log-concave densities, these windows can be provably bounded in terms of certain measures of inter- and intra-group separation. We also instantiate these bounds for concrete examples like well-conditioned Gaussian mixtures. Finally, we use our bounds to give a rigorous interpretation of diffusion models as hierarchical samplers that progressively "decide" output features over a discrete sequence of times. We validate our bounds with synthetic experiments. Additionally, preliminary experiments on Stable Diffusion suggest critical windows may serve as a useful tool for diagnosing fairness and privacy violations in real-world diffusion models.  ( 2 min )
    Recent Advances, Applications, and Open Challenges in Machine Learning for Health: Reflections from Research Roundtables at ML4H 2023 Symposium
    arXiv:2403.01628v1 Announce Type: new Abstract: The third ML4H symposium was held in person on December 10, 2023, in New Orleans, Louisiana, USA. The symposium included research roundtable sessions to foster discussions between participants and senior researchers on timely and relevant topics for the \ac{ML4H} community. Encouraged by the successful virtual roundtables in the previous year, we organized eleven in-person roundtables and four virtual roundtables at ML4H 2022. The organization of the research roundtables at the conference involved 17 Senior Chairs and 19 Junior Chairs across 11 tables. Each roundtable session included invited senior chairs (with substantial experience in the field), junior chairs (responsible for facilitating the discussion), and attendees from diverse backgrounds with interest in the session's topic. Herein we detail the organization process and compile takeaways from these roundtable discussions, including recent advances, applications, and open challenges for each topic. We conclude with a summary and lessons learned across all roundtables. This document serves as a comprehensive review paper, summarizing the recent advancements in machine learning for healthcare as contributed by foremost researchers in the field.  ( 3 min )
    Improving LLM Code Generation with Grammar Augmentation
    arXiv:2403.01632v1 Announce Type: new Abstract: We present SynCode a novel framework for efficient and general syntactical decoding of code with large language models (LLMs). SynCode leverages the grammar of a programming language, utilizing an offline-constructed efficient lookup table called DFA mask store based on language grammar terminals. We demonstrate SynCode's soundness and completeness given the context-free grammar (CFG) of the programming language, presenting its ability to retain syntactically valid tokens while rejecting invalid ones. The framework seamlessly integrates with any language defined by CFG, as evidenced by experiments on CFGs for Python and Go. The results underscore the significant reduction of 96.07% of syntax errors achieved when SynCode is combined with state-of-the-art LLMs, showcasing its substantial impact on enhancing syntactical precision in code generation. Our code is available at https://github.com/uiuc-focal-lab/syncode.  ( 2 min )
    ML4PhySim : Machine Learning for Physical Simulations Challenge (The airfoil design)
    arXiv:2403.01623v1 Announce Type: new Abstract: The use of machine learning (ML) techniques to solve complex physical problems has been considered recently as a promising approach. However, the evaluation of such learned physical models remains an important issue for industrial use. The aim of this competition is to encourage the development of new ML techniques to solve physical problems using a unified evaluation framework proposed recently, called Learning Industrial Physical Simulations (LIPS). We propose learning a task representing a well-known physical use case: the airfoil design simulation, using a dataset called AirfRANS. The global score calculated for each submitted solution is based on three main categories of criteria covering different aspects, namely: ML-related, Out-Of-Distribution, and physical compliance criteria. To the best of our knowledge, this is the first competition addressing the use of ML-based surrogate approaches to improve the trade-off computational cost/accuracy of physical simulation.The competition is hosted by the Codabench platform with online training and evaluation of all submitted solutions.  ( 2 min )
    Machine Learning vs Deep Learning: The Generalization Problem
    arXiv:2403.01621v1 Announce Type: new Abstract: The capacity to generalize beyond the range of training data is a pivotal challenge, often synonymous with a model's utility and robustness. This study investigates the comparative abilities of traditional machine learning (ML) models and deep learning (DL) algorithms in terms of extrapolation -- a more challenging aspect of generalization because it requires the model to make inferences about data points that lie outside the domain it has been trained on. We present an empirical analysis where both ML and DL models are trained on an exponentially growing function and then tested on values outside the training domain. The choice of this function allows us to distinctly showcase the divergence in performance when models are required to predict beyond the scope of their training data. Our findings suggest that deep learning models possess inherent capabilities to generalize beyond the training scope, an essential feature for real-world applications where data is often incomplete or extends beyond the observed range. This paper argues for a nuanced understanding of the structural differences between ML and DL models, with an emphasis on the implications for both theoretical research and practical deployment.  ( 2 min )
    Respiratory motion forecasting with online learning of recurrent neural networks for safety enhancement in externally guided radiotherapy
    arXiv:2403.01607v1 Announce Type: new Abstract: In lung radiotherapy, infrared cameras can record the location of reflective objects on the chest to infer the position of the tumor moving due to breathing, but treatment system latencies hinder radiation beam precision. Real-time recurrent learning (RTRL), is a potential solution as it can learn patterns within non-stationary respiratory data but has high complexity. This study assesses the capabilities of resource-efficient online RNN algorithms, namely unbiased online recurrent optimization (UORO), sparse-1 step approximation (SnAp-1), and decoupled neural interfaces (DNI) to forecast respiratory motion during radiotherapy treatment accurately. We use time series containing the 3D position of external markers on the chest of healthy subjects. We propose efficient implementations for SnAp-1 and DNI based on compression of the influence and immediate Jacobian matrices and an accurate update of the linear coefficients used in credit assignment estimation, respectively. The original sampling frequency was 10Hz; we performed resampling at 3.33Hz and 30Hz. We use UORO, SnAp-1, and DNI to forecast each marker's 3D position with horizons (the time interval in advance for which the prediction is made) h<=2.1s and compare them with RTRL, least mean squares, and linear regression. RNNs trained online achieved similar or better accuracy than most previous works using larger training databases and deep learning, even though we used only the first minute of each sequence to predict motion within that exact sequence. SnAp-1 had the lowest normalized root mean square errors (nRMSE) averaged over the horizon values considered, equal to 0.335 and 0.157, at 3.33Hz and 10.0Hz, respectively. Similarly, UORO had the highest accuracy at 30Hz, with an nRMSE of 0.0897. DNI's inference time, equal to 6.8ms per time step at 30Hz (Intel Core i7-13700 CPU), was the lowest among the RNN methods examined.  ( 3 min )
    Partial Federated Learning
    arXiv:2403.01615v1 Announce Type: new Abstract: Federated Learning (FL) is a popular algorithm to train machine learning models on user data constrained to edge devices (for example, mobile phones) due to privacy concerns. Typically, FL is trained with the assumption that no part of the user data can be egressed from the edge. However, in many production settings, specific data-modalities/meta-data are limited to be on device while others are not. For example, in commercial SLU systems, it is typically desired to prevent transmission of biometric signals (such as audio recordings of the input prompt) to the cloud, but egress of locally (i.e. on the edge device) transcribed text to the cloud may be possible. In this work, we propose a new algorithm called Partial Federated Learning (PartialFL), where a machine learning model is trained using data where a subset of data modalities or their intermediate representations can be made available to the server. We further restrict our model training by preventing the egress of data labels to the cloud for better privacy, and instead use a contrastive learning based model objective. We evaluate our approach on two different multi-modal datasets and show promising results with our proposed approach.  ( 2 min )
    Towards Provable Log Density Policy Gradient
    arXiv:2403.01605v1 Announce Type: new Abstract: Policy gradient methods are a vital ingredient behind the success of modern reinforcement learning. Modern policy gradient methods, although successful, introduce a residual error in gradient estimation. In this work, we argue that this residual term is significant and correcting for it could potentially improve sample-complexity of reinforcement learning methods. To that end, we propose log density gradient to estimate the policy gradient, which corrects for this residual error term. Log density gradient method computes policy gradient by utilising the state-action discounted distributional formulation. We first present the equations needed to exactly find the log density gradient for a tabular Markov Decision Processes (MDPs). For more complex environments, we propose a temporal difference (TD) method that approximates log density gradient by utilizing backward on-policy samples. Since backward sampling from a Markov chain is highly restrictive we also propose a min-max optimization that can approximate log density gradient using just on-policy samples. We also prove uniqueness, and convergence under linear function approximation, for this min-max optimization. Finally, we show that the sample complexity of our min-max optimization to be of the order of $m^{-1/2}$, where $m$ is the number of on-policy samples. We also demonstrate a proof-of-concept for our log density gradient method on gridworld environment, and observe that our method is able to improve upon the classical policy gradient method by a clear margin, thus indicating a promising novel direction to develop reinforcement learning algorithms that require fewer samples.  ( 2 min )
    The Hidden Attention of Mamba Models
    arXiv:2403.01590v1 Announce Type: new Abstract: The Mamba layer offers an efficient selective state space model (SSM) that is highly effective in modeling multiple domains including NLP, long-range sequences processing, and computer vision. Selective SSMs are viewed as dual models, in which one trains in parallel on the entire sequence via IO-aware parallel scan, and deploys in an autoregressive manner. We add a third view and show that such models can be viewed as attention-driven models. This new perspective enables us to compare the underlying mechanisms to that of the self-attention layers in transformers and allows us to peer inside the inner workings of the Mamba model with explainability methods. Our code is publicly available.  ( 2 min )
    Transformers for Supervised Online Continual Learning
    arXiv:2403.01554v1 Announce Type: new Abstract: Transformers have become the dominant architecture for sequence modeling tasks such as natural language processing or audio processing, and they are now even considered for tasks that are not naturally sequential such as image classification. Their ability to attend to and to process a set of tokens as context enables them to develop in-context few-shot learning abilities. However, their potential for online continual learning remains relatively unexplored. In online continual learning, a model must adapt to a non-stationary stream of data, minimizing the cumulative nextstep prediction loss. We focus on the supervised online continual learning setting, where we learn a predictor $x_t \rightarrow y_t$ for a sequence of examples $(x_t, y_t)$. Inspired by the in-context learning capabilities of transformers and their connection to meta-learning, we propose a method that leverages these strengths for online continual learning. Our approach explicitly conditions a transformer on recent observations, while at the same time online training it with stochastic gradient descent, following the procedure introduced with Transformer-XL. We incorporate replay to maintain the benefits of multi-epoch training while adhering to the sequential protocol. We hypothesize that this combination enables fast adaptation through in-context learning and sustained longterm improvement via parametric learning. Our method demonstrates significant improvements over previous state-of-the-art results on CLOC, a challenging large-scale real-world benchmark for image geo-localization.  ( 2 min )
    On the Model-Agnostic Multi-Source-Free Unsupervised Domain Adaptation
    arXiv:2403.01582v1 Announce Type: new Abstract: Multi-Source-Free Unsupervised Domain Adaptation (MSFDA) aims to transfer knowledge from multiple well-labeled source domains to an unlabeled target domain, using source models instead of source data. Existing MSFDA methods limited that each source domain provides only a single model, with a uniform structure. This paper introduces a new MSFDA setting: Model-Agnostic Multi-Source-Free Unsupervised Domain Adaptation (MMDA), allowing diverse source models with varying architectures, without quantitative restrictions. While MMDA holds promising potential, incorporating numerous source models poses a high risk of including undesired models, which highlights the source model selection problem. To address it, we first provide a theoretical analysis of this problem. We reveal two fundamental selection principles: transferability principle and diversity principle, and introduce a selection algorithm to integrate them. Then, considering the measure of transferability is challenging, we propose a novel Source-Free Unsupervised Transferability Estimation (SUTE). This novel formulation enables the assessment and comparison of transferability across multiple source models with different architectures in the context of domain shift, without requiring access to any target labels or source data. Based on the above, we introduce a new framework to address MMDA. Specifically, we first conduct source model selection based on the proposed selection principles. Subsequently, we design two modules to aggregate knowledge from included models and recycle useful knowledge from excluded models. These modules enable us to leverage source knowledge efficiently and effectively, thereby supporting us in learning a discriminative target model via adaptation. We validate the effectiveness of our method through numerous experimental results, and demonstrate that our approach achieves state-of-the-art performance.  ( 3 min )
    Quantized Hierarchical Federated Learning: A Robust Approach to Statistical Heterogeneity
    arXiv:2403.01540v1 Announce Type: new Abstract: This paper presents a novel hierarchical federated learning algorithm within multiple sets that incorporates quantization for communication-efficiency and demonstrates resilience to statistical heterogeneity. Unlike conventional hierarchical federated learning algorithms, our approach combines gradient aggregation in intra-set iterations with model aggregation in inter-set iterations. We offer a comprehensive analytical framework to evaluate its optimality gap and convergence rate, comparing these aspects with those of conventional algorithms. Additionally, we develop a problem formulation to derive optimal system parameters in a closed-form solution. Our findings reveal that our algorithm consistently achieves high learning accuracy over a range of parameters and significantly outperforms other hierarchical algorithms, particularly in scenarios with heterogeneous data distributions.  ( 2 min )
    Neural Graph Generator: Feature-Conditioned Graph Generation using Latent Diffusion Models
    arXiv:2403.01535v1 Announce Type: new Abstract: Graph generation has emerged as a crucial task in machine learning, with significant challenges in generating graphs that accurately reflect specific properties. Existing methods often fall short in efficiently addressing this need as they struggle with the high-dimensional complexity and varied nature of graph properties. In this paper, we introduce the Neural Graph Generator (NGG), a novel approach which utilizes conditioned latent diffusion models for graph generation. NGG demonstrates a remarkable capacity to model complex graph patterns, offering control over the graph generation process. NGG employs a variational graph autoencoder for graph compression and a diffusion process in the latent vector space, guided by vectors summarizing graph statistics. We demonstrate NGG's versatility across various graph generation tasks, showing its capability to capture desired graph properties and generalize to unseen graphs. This work signifies a significant shift in graph generation methodologies, offering a more practical and efficient solution for generating diverse types of graphs with specific characteristics.  ( 2 min )
    Applying Self-supervised Learning to Network Intrusion Detection for Network Flows with Graph Neural Network
    arXiv:2403.01501v1 Announce Type: new Abstract: Graph Neural Networks (GNNs) have garnered intensive attention for Network Intrusion Detection System (NIDS) due to their suitability for representing the network traffic flows. However, most present GNN-based methods for NIDS are supervised or semi-supervised. Network flows need to be manually annotated as supervisory labels, a process that is time-consuming or even impossible, making NIDS difficult to adapt to potentially complex attacks, especially in large-scale real-world scenarios. The existing GNN-based self-supervised methods focus on the binary classification of network flow as benign or not, and thus fail to reveal the types of attack in practice. This paper studies the application of GNNs to identify the specific types of network flows in an unsupervised manner. We first design an encoder to obtain graph embedding, that introduces the graph attention mechanism and considers the edge information as the only essential factor. Then, a self-supervised method based on graph contrastive learning is proposed. The method samples center nodes, and for each center node, generates subgraph by it and its direct neighbor nodes, and corresponding contrastive subgraph from the interpolated graph, and finally constructs positive and negative samples from subgraphs. Furthermore, a structured contrastive loss function based on edge features and graph local topology is introduced. To the best of our knowledge, it is the first GNN-based self-supervised method for the multiclass classification of network flows in NIDS. Detailed experiments conducted on four real-world databases (NF-Bot-IoT, NF-Bot-IoT-v2, NF-CSE-CIC-IDS2018, and NF-CSE-CIC-IDS2018-v2) systematically compare our model with the state-of-the-art supervised and self-supervised models, illustrating the considerable potential of our method. Our code is accessible through https://github.com/renj-xu/NEGSC.  ( 3 min )
    Machine learning predicts long-term mortality after acute myocardial infarction using systolic time intervals and routinely collected clinical data
    arXiv:2403.01533v1 Announce Type: new Abstract: Precise estimation of cardiac patients' current and future comorbidities is an important factor in prioritizing continuous physiological monitoring and new therapies. ML models have shown satisfactory performance in short-term mortality prediction of patients with heart disease, while their utility in long-term predictions is limited. This study aims to investigate the performance of tree-based ML models on long-term mortality prediction and the effect of two recently introduced biomarkers on long-term mortality. This study utilized publicly available data from CCHIA at the Ministry of Health and Welfare, Taiwan, China. Medical records were used to gather demographic and clinical data, including age, gender, BMI, percutaneous coronary intervention (PCI) status, and comorbidities such as hypertension, dyslipidemia, ST-segment elevation myocardial infarction (STEMI), and non-STEMI. Using medical and demographic records as well as two recently introduced biomarkers, brachial pre-ejection period (bPEP) and brachial ejection time (bET), collected from 139 patients with acute myocardial infarction, we investigated the performance of advanced ensemble tree-based ML algorithms (random forest, AdaBoost, and XGBoost) to predict all-cause mortality within 14 years. The developed ML models achieved significantly better performance compared to the baseline LR (C-Statistic, 0.80 for random forest, 0.79 for AdaBoost, and 0.78 for XGBoost, vs 0.77 for LR) (P-RF<0.001, PAdaBoost<0.001, PXGBoost<0.05). Adding bPEP and bET to our feature set significantly improved the algorithms' performance, leading to an absolute increase in C-Statistic of up to 0.03 (C-Statistic, 0.83 for random forest, 0.82 for AdaBoost, and 0.80 for XGBoost, vs 0.74 for LR) (P-RF<0.001, PAdaBoost<0.001, PXGBoost<0.05). This advancement may enable better treatment prioritization for high-risk individuals.  ( 3 min )
    Normalising Flow-based Differentiable Particle Filters
    arXiv:2403.01499v1 Announce Type: new Abstract: Recently, there has been a surge of interest in incorporating neural networks into particle filters, e.g. differentiable particle filters, to perform joint sequential state estimation and model learning for non-linear non-Gaussian state-space models in complex environments. Existing differentiable particle filters are mostly constructed with vanilla neural networks that do not allow density estimation. As a result, they are either restricted to a bootstrap particle filtering framework or employ predefined distribution families (e.g. Gaussian distributions), limiting their performance in more complex real-world scenarios. In this paper we present a differentiable particle filtering framework that uses (conditional) normalising flows to build its dynamic model, proposal distribution, and measurement model. This not only enables valid probability densities but also allows the proposed method to adaptively learn these modules in a flexible way, without being restricted to predefined distribution families. We derive the theoretical properties of the proposed filters and evaluate the proposed normalising flow-based differentiable particle filters' performance through a series of numerical experiments.  ( 2 min )
    ConvTimeNet: A Deep Hierarchical Fully Convolutional Model for Multivariate Time Series Analysis
    arXiv:2403.01493v1 Announce Type: new Abstract: This paper introduces ConvTimeNet, a novel deep hierarchical fully convolutional network designed to serve as a general-purpose model for time series analysis. The key design of this network is twofold, designed to overcome the limitations of traditional convolutional networks. Firstly, we propose an adaptive segmentation of time series into sub-series level patches, treating these as fundamental modeling units. This setting avoids the sparsity semantics associated with raw point-level time steps. Secondly, we design a fully convolutional block by skillfully integrating deepwise and pointwise convolution operations, following the advanced building block style employed in Transformer encoders. This backbone network allows for the effective capture of both global sequence and cross-variable dependence, as it not only incorporates the advancements of Transformer architecture but also inherits the inherent properties of convolution. Furthermore, multi-scale representations of given time series instances can be learned by controlling the kernel size flexibly. Extensive experiments are conducted on both time series forecasting and classification tasks. The results consistently outperformed strong baselines in most situations in terms of effectiveness.The code is publicly available.  ( 2 min )
    Preserving correlations: A statistical method for generating synthetic data
    arXiv:2403.01471v1 Announce Type: new Abstract: We propose a method to generate statistically representative synthetic data. The main goal is to be able to maintain in the synthetic dataset the correlations of the features present in the original one, while offering a comfortable privacy level that can be eventually tailored on specific customer demands. We describe in detail our algorithm used both for the analysis of the original dataset and for the generation of the synthetic data points. The approach is tested using a large energy-related dataset. We obtain good results both qualitatively (e.g. via vizualizing correlation maps) and quantitatively (in terms of suitable $\ell^1$-type error norms used as evaluation metrics). The proposed methodology is general in the sense that it does not rely on the used test dataset. We expect it to be applicable in a much broader context than indicated here.  ( 2 min )
    Representation Learning on Heterophilic Graph with Directional Neighborhood Attention
    arXiv:2403.01475v1 Announce Type: new Abstract: Graph Attention Network (GAT) is one of the most popular Graph Neural Network (GNN) architecture, which employs the attention mechanism to learn edge weights and has demonstrated promising performance in various applications. However, since it only incorporates information from immediate neighborhood, it lacks the ability to capture long-range and global graph information, leading to unsatisfactory performance on some datasets, particularly on heterophilic graphs. To address this limitation, we propose the Directional Graph Attention Network (DGAT) in this paper. DGAT is able to combine the feature-based attention with the global directional information extracted from the graph topology. To this end, a new class of Laplacian matrices is proposed which can provably reduce the diffusion distance between nodes. Based on the new Laplacian, topology-guided neighbour pruning and edge adding mechanisms are proposed to remove the noisy and capture the helpful long-range neighborhood information. Besides, a global directional attention is designed to enable a topological-aware information propagation. The superiority of the proposed DGAT over the baseline GAT has also been verified through experiments on real-world benchmarks and synthetic data sets. It also outperforms the state-of-the-art (SOTA) models on 6 out of 7 real-world benchmark datasets.  ( 2 min )
    Collaborate to Adapt: Source-Free Graph Domain Adaptation via Bi-directional Adaptation
    arXiv:2403.01467v1 Announce Type: new Abstract: Unsupervised Graph Domain Adaptation (UGDA) has emerged as a practical solution to transfer knowledge from a label-rich source graph to a completely unlabelled target graph. However, most methods require a labelled source graph to provide supervision signals, which might not be accessible in the real-world settings due to regulations and privacy concerns. In this paper, we explore the scenario of source-free unsupervised graph domain adaptation, which tries to address the domain adaptation problem without accessing the labelled source graph. Specifically, we present a novel paradigm called GraphCTA, which performs model adaptation and graph adaptation collaboratively through a series of procedures: (1) conduct model adaptation based on node's neighborhood predictions in target graph considering both local and global information; (2) perform graph adaptation by updating graph structure and node attributes via neighborhood contrastive learning; and (3) the updated graph serves as an input to facilitate the subsequent iteration of model adaptation, thereby establishing a collaborative loop between model adaptation and graph adaptation. Comprehensive experiments are conducted on various public datasets. The experimental results demonstrate that our proposed model outperforms recent source-free baselines by large margins.  ( 2 min )
    One-Step Multi-View Clustering Based on Transition Probability
    arXiv:2403.01460v1 Announce Type: new Abstract: The large-scale multi-view clustering algorithms, based on the anchor graph, have shown promising performance and efficiency and have been extensively explored in recent years. Despite their successes, current methods lack interpretability in the clustering process and do not sufficiently consider the complementary information across different views. To address these shortcomings, we introduce the One-Step Multi-View Clustering Based on Transition Probability (OSMVC-TP). This method adopts a probabilistic approach, which leverages the anchor graph, representing the transition probabilities from samples to anchor points. Our method directly learns the transition probabilities from anchor points to categories, and calculates the transition probabilities from samples to categories, thus obtaining soft label matrices for samples and anchor points, enhancing the interpretability of clustering. Furthermore, to maintain consistency in labels across different views, we apply a Schatten p-norm constraint on the tensor composed of the soft labels. This approach effectively harnesses the complementary information among the views. Extensive experiments have confirmed the effectiveness and robustness of OSMVC-TP.  ( 2 min )
    On Diffusion Process in SE(3)-invariant Space
    arXiv:2403.01430v1 Announce Type: new Abstract: Sampling viable 3D structures (e.g., molecules and point clouds) with SE(3)-invariance using diffusion-based models proved promising in a variety of real-world applications, wherein SE(3)-invariant properties can be naturally characterized by the inter-point distance manifold. However, due to the non-trivial geometry, we still lack a comprehensive understanding of the diffusion mechanism within such SE(3)-invariant space. This study addresses this gap by mathematically delineating the diffusion mechanism under SE(3)-invariance, via zooming into the interaction behavior between coordinates and the inter-point distance manifold through the lens of differential geometry. Upon this analysis, we propose accurate and projection-free diffusion SDE and ODE accordingly. Such formulations enable enhancing the performance and the speed of generation pathways; meanwhile offering valuable insights into other systems incorporating SE(3)-invariance.  ( 2 min )
    Privacy-Preserving Collaborative Split Learning Framework for Smart Grid Load Forecasting
    arXiv:2403.01438v1 Announce Type: new Abstract: Accurate load forecasting is crucial for energy management, infrastructure planning, and demand-supply balancing. Smart meter data availability has led to the demand for sensor-based load forecasting. Conventional ML allows training a single global model using data from multiple smart meters requiring data transfer to a central server, raising concerns for network requirements, privacy, and security. We propose a split learning-based framework for load forecasting to alleviate this issue. We split a deep neural network model into two parts, one for each Grid Station (GS) responsible for an entire neighbourhood's smart meters and the other for the Service Provider (SP). Instead of sharing their data, client smart meters use their respective GSs' model split for forward pass and only share their activations with the GS. Under this framework, each GS is responsible for training a personalized model split for their respective neighbourhoods, whereas the SP can train a single global or personalized model for each GS. Experiments show that the proposed models match or exceed a centrally trained model's performance and generalize well. Privacy is analyzed by assessing information leakage between data and shared activations of the GS model split. Additionally, differential privacy enhances local data privacy while examining its impact on performance. A transformer model is used as our base learner.  ( 2 min )
    Introduction to Algogens
    arXiv:2403.01426v1 Announce Type: new Abstract: This book introduces the concept of Algogens, a promising integration of generative AI with traditional algorithms aimed at improving problem-solving techniques across various fields. It provides an accessible overview of how Algogens combine AI's innovative potential with algorithms' reliability to tackle complex challenges more effectively than either could alone. The text explores the basics of Algogens, their development, applications, and advantages, such as better adaptability and efficiency. Through examples and case studies, readers will learn about Algogens' practical uses today and their potential for future cybersecurity, healthcare, and environmental science innovation. Acknowledging new technologies' challenges and ethical considerations, the book offers a balanced look at the prospects and obstacles facing Algogens. It invites a broad audience, including experts and newcomers, to engage with the topic and consider Algogens' role in advancing our problem-solving capabilities. This work is presented as a starting point for anyone interested in the intersection of AI and algorithms, encouraging further exploration and discussion on this emerging field. It aims to spark curiosity and contribute to the ongoing conversation about how technology can evolve to meet the complex demands of the AI era.  ( 2 min )
    The Implicit Bias of Heterogeneity towards Invariance and Causality
    arXiv:2403.01420v1 Announce Type: new Abstract: It is observed empirically that the large language models (LLM), trained with a variant of regression loss using numerous corpus from the Internet, can unveil causal associations to some extent. This is contrary to the traditional wisdom that ``association is not causation'' and the paradigm of traditional causal inference in which prior causal knowledge should be carefully incorporated into the design of methods. It is a mystery why causality, in a higher layer of understanding, can emerge from the regression task that pursues associations. In this paper, we claim the emergence of causality from association-oriented training can be attributed to the coupling effects from the heterogeneity of the source data, stochasticity of training algorithms, and over-parameterization of the learning models. We illustrate such an intuition using a simple but insightful model that learns invariance, a quasi-causality, using regression loss. To be specific, we consider multi-environment low-rank matrix sensing problems where the unknown r-rank ground-truth d*d matrices diverge across the environments but contain a lower-rank invariant, causal part. In this case, running pooled gradient descent will result in biased solutions that only learn associations in general. We show that running large-batch Stochastic Gradient Descent, whose each batch being linear measurement samples randomly selected from a certain environment, can successfully drive the solution towards the invariant, causal solution under certain conditions. This step is related to the relatively strong heterogeneity of the environments, the large step size and noises in the optimization algorithm, and the over-parameterization of the model. In summary, we unveil another implicit bias that is a result of the symbiosis between the heterogeneity of data and modern algorithms, which is, to the best of our knowledge, first in the literature.  ( 3 min )
    Decoupling Weighing and Selecting for Integrating Multiple Graph Pre-training Tasks
    arXiv:2403.01400v1 Announce Type: new Abstract: Recent years have witnessed the great success of graph pre-training for graph representation learning. With hundreds of graph pre-training tasks proposed, integrating knowledge acquired from multiple pre-training tasks has become a popular research topic. In this paper, we identify two important collaborative processes for this topic: (1) select: how to select an optimal task combination from a given task pool based on their compatibility, and (2) weigh: how to weigh the selected tasks based on their importance. While there currently has been a lot of work focused on weighing, comparatively little effort has been devoted to selecting. This paper proposes a novel instance-level framework for integrating multiple graph pre-training tasks, Weigh And Select (WAS), where the two collaborative processes, weighing and selecting, are combined by decoupled siamese networks. Specifically, it first adaptively learns an optimal combination of tasks for each instance from a given task pool, based on which a customized instance-level task weighing strategy is learned. Extensive experiments on 16 graph datasets across node-level and graph-level downstream tasks have demonstrated that by combining a few simple but classical tasks, WAS can achieve comparable performance to other leading counterparts. The code is available at https://github.com/TianyuFan0504/WAS.  ( 2 min )
    Asyn2F: An Asynchronous Federated Learning Framework with Bidirectional Model Aggregation
    arXiv:2403.01417v1 Announce Type: new Abstract: In federated learning, the models can be trained synchronously or asynchronously. Many research works have focused on developing an aggregation method for the server to aggregate multiple local models into the global model with improved performance. They ignore the heterogeneity of the training workers, which causes the delay in the training of the local models, leading to the obsolete information issue. In this paper, we design and develop Asyn2F, an Asynchronous Federated learning Framework with bidirectional model aggregation. By bidirectional model aggregation, Asyn2F, on one hand, allows the server to asynchronously aggregate multiple local models and results in a new global model. On the other hand, it allows the training workers to aggregate the new version of the global model into the local model, which is being trained even in the middle of a training epoch. We develop Asyn2F considering the practical implementation requirements such as using cloud services for model storage and message queuing protocols for communications. Extensive experiments with different datasets show that the models trained by Asyn2F achieve higher performance compared to the state-of-the-art techniques. The experiments also demonstrate the effectiveness, practicality, and scalability of Asyn2F, making it ready for deployment in real scenarios.  ( 2 min )
    Fusion of Gaussian Processes Predictions with Monte Carlo Sampling
    arXiv:2403.01389v1 Announce Type: new Abstract: In science and engineering, we often work with models designed for accurate prediction of variables of interest. Recognizing that these models are approximations of reality, it becomes desirable to apply multiple models to the same data and integrate their outcomes. In this paper, we operate within the Bayesian paradigm, relying on Gaussian processes as our models. These models generate predictive probability density functions (pdfs), and the objective is to integrate them systematically, employing both linear and log-linear pooling. We introduce novel approaches for log-linear pooling, determining input-dependent weights for the predictive pdfs of the Gaussian processes. The aggregation of the pdfs is realized through Monte Carlo sampling, drawing samples of weights from their posterior. The performance of these methods, as well as those based on linear pooling, is demonstrated using a synthetic dataset.  ( 2 min )
    A Comprehensive Survey of Federated Transfer Learning: Challenges, Methods and Applications
    arXiv:2403.01387v1 Announce Type: new Abstract: Federated learning (FL) is a novel distributed machine learning paradigm that enables participants to collaboratively train a centralized model with privacy preservation by eliminating the requirement of data sharing. In practice, FL often involves multiple participants and requires the third party to aggregate global information to guide the update of the target participant. Therefore, many FL methods do not work well due to the training and test data of each participant may not be sampled from the same feature space and the same underlying distribution. Meanwhile, the differences in their local devices (system heterogeneity), the continuous influx of online data (incremental data), and labeled data scarcity may further influence the performance of these methods. To solve this problem, federated transfer learning (FTL), which integrates transfer learning (TL) into FL, has attracted the attention of numerous researchers. However, since FL enables a continuous share of knowledge among participants with each communication round while not allowing local data to be accessed by other participants, FTL faces many unique challenges that are not present in TL. In this survey, we focus on categorizing and reviewing the current progress on federated transfer learning, and outlining corresponding solutions and applications. Furthermore, the common setting of FTL scenarios, available datasets, and significant related research are summarized in this survey.  ( 2 min )
    Bandit Profit-maximization for Targeted Marketing
    arXiv:2403.01361v1 Announce Type: new Abstract: We study a sequential profit-maximization problem, optimizing for both price and ancillary variables like marketing expenditures. Specifically, we aim to maximize profit over an arbitrary sequence of multiple demand curves, each dependent on a distinct ancillary variable, but sharing the same price. A prototypical example is targeted marketing, where a firm (seller) wishes to sell a product over multiple markets. The firm may invest different marketing expenditures for different markets to optimize customer acquisition, but must maintain the same price across all markets. Moreover, markets may have heterogeneous demand curves, each responding to prices and marketing expenditures differently. The firm's objective is to maximize its gross profit, the total revenue minus marketing costs. Our results are near-optimal algorithms for this class of problems in an adversarial bandit setting, where demand curves are arbitrary non-adaptive sequences, and the firm observes only noisy evaluations of chosen points on the demand curves. We prove a regret upper bound of $\widetilde{\mathcal{O}}\big(nT^{3/4}\big)$ and a lower bound of $\Omega\big((nT)^{3/4}\big)$ for monotonic demand curves, and a regret bound of $\widetilde{\Theta}\big(nT^{2/3}\big)$ for demands curves that are monotonic in price and concave in the ancillary variables.  ( 2 min )
    On the Compressibility of Quantized Large Language Models
    arXiv:2403.01384v1 Announce Type: new Abstract: Deploying Large Language Models (LLMs) on edge or mobile devices offers significant benefits, such as enhanced data privacy and real-time processing capabilities. However, it also faces critical challenges due to the substantial memory requirement of LLMs. Quantization is an effective way of reducing the model size while maintaining good performance. However, even after quantization, LLMs may still be too big to fit entirely into the limited memory of edge or mobile devices and have to be partially loaded from the storage to complete the inference. In this case, the I/O latency of model loading becomes the bottleneck of the LLM inference latency. In this work, we take a preliminary step of studying applying data compression techniques to reduce data movement and thus speed up the inference of quantized LLM on memory-constrained devices. In particular, we discussed the compressibility of quantized LLMs, the trade-off between the compressibility and performance of quantized LLMs, and opportunities to optimize both of them jointly.  ( 2 min )
    Improving Uncertainty Sampling with Bell Curve Weight Function
    arXiv:2403.01352v1 Announce Type: new Abstract: Typically, a supervised learning model is trained using passive learning by randomly selecting unlabelled instances to annotate. This approach is effective for learning a model, but can be costly in cases where acquiring labelled instances is expensive. For example, it can be time-consuming to manually identify spam mails (labelled instances) from thousands of emails (unlabelled instances) flooding an inbox during initial data collection. Generally, we answer the above scenario with uncertainty sampling, an active learning method that improves the efficiency of supervised learning by using fewer labelled instances than passive learning. Given an unlabelled data pool, uncertainty sampling queries the labels of instances where the predicted probabilities, p, fall into the uncertainty region, i.e., $p \approx 0.5$. The newly acquired labels are then added to the existing labelled data pool to learn a new model. Nonetheless, the performance of uncertainty sampling is susceptible to the area of unpredictable responses (AUR) and the nature of the dataset. It is difficult to determine whether to use passive learning or uncertainty sampling without prior knowledge of a new dataset. To address this issue, we propose bell curve sampling, which employs a bell curve weight function to acquire new labels. With the bell curve centred at p=0.5, bell curve sampling selects instances whose predicted values are in the uncertainty area most of the time without neglecting the rest. Simulation results show that, most of the time bell curve sampling outperforms uncertainty sampling and passive learning in datasets of different natures and with AUR.  ( 3 min )
    SANGRIA: Stacked Autoencoder Neural Networks with Gradient Boosting for Indoor Localization
    arXiv:2403.01348v1 Announce Type: new Abstract: Indoor localization is a critical task in many embedded applications, such as asset tracking, emergency response, and realtime navigation. In this article, we propose a novel fingerprintingbased framework for indoor localization called SANGRIA that uses stacked autoencoder neural networks with gradient boosted trees. Our approach is designed to overcome the device heterogeneity challenge that can create uncertainty in wireless signal measurements across embedded devices used for localization. We compare SANGRIA to several state-of-the-art frameworks and demonstrate 42.96% lower average localization error across diverse indoor locales and heterogeneous devices.  ( 2 min )
    Mitigating the Bias in the Model for Continual Test-Time Adaptation
    arXiv:2403.01344v1 Announce Type: new Abstract: Continual Test-Time Adaptation (CTA) is a challenging task that aims to adapt a source pre-trained model to continually changing target domains. In the CTA setting, a model does not know when the target domain changes, thus facing a drastic change in the distribution of streaming inputs during the test-time. The key challenge is to keep adapting the model to the continually changing target domains in an online manner. We find that a model shows highly biased predictions as it constantly adapts to the chaining distribution of the target data. It predicts certain classes more often than other classes, making inaccurate over-confident predictions. This paper mitigates this issue to improve performance in the CTA scenario. To alleviate the bias issue, we make class-wise exponential moving average target prototypes with reliable target samples and exploit them to cluster the target features class-wisely. Moreover, we aim to align the target distributions to the source distribution by anchoring the target feature to its corresponding source prototype. With extensive experiments, our proposed method achieves noteworthy performance gain when applied on top of existing CTA methods without substantial adaptation time overhead.  ( 2 min )
    Improve Cost Efficiency of Active Learning over Noisy Dataset
    arXiv:2403.01346v1 Announce Type: new Abstract: Active learning is a learning strategy whereby the machine learning algorithm actively identifies and labels data points to optimize its learning. This strategy is particularly effective in domains where an abundance of unlabeled data exists, but the cost of labeling these data points is prohibitively expensive. In this paper, we consider cases of binary classification, where acquiring a positive instance incurs a significantly higher cost compared to that of negative instances. For example, in the financial industry, such as in money-lending businesses, a defaulted loan constitutes a positive event leading to substantial financial loss. To address this issue, we propose a shifted normal distribution sampling function that samples from a wider range than typical uncertainty sampling. Our simulation underscores that our proposed sampling function limits both noisy and positive label selection, delivering between 20% and 32% improved cost efficiency over different test datasets.  ( 2 min )
    Bespoke Non-Stationary Solvers for Fast Sampling of Diffusion and Flow Models
    arXiv:2403.01329v1 Announce Type: new Abstract: This paper introduces Bespoke Non-Stationary (BNS) Solvers, a solver distillation approach to improve sample efficiency of Diffusion and Flow models. BNS solvers are based on a family of non-stationary solvers that provably subsumes existing numerical ODE solvers and consequently demonstrate considerable improvement in sample approximation (PSNR) over these baselines. Compared to model distillation, BNS solvers benefit from a tiny parameter space ($<$200 parameters), fast optimization (two orders of magnitude faster), maintain diversity of samples, and in contrast to previous solver distillation approaches nearly close the gap from standard distillation methods such as Progressive Distillation in the low-medium NFE regime. For example, BNS solver achieves 45 PSNR / 1.76 FID using 16 NFE in class-conditional ImageNet-64. We experimented with BNS solvers for conditional image generation, text-to-image generation, and text-2-audio generation showing significant improvement in sample approximation (PSNR) in all.  ( 2 min )
    Uniform $\mathcal{C}^k$ Approximation of $G$-Invariant and Antisymmetric Functions, Embedding Dimensions, and Polynomial Representations
    arXiv:2403.01339v1 Announce Type: new Abstract: For any subgroup $G$ of the symmetric group $\mathcal{S}_n$ on $n$ symbols, we present results for the uniform $\mathcal{C}^k$ approximation of $G$-invariant functions by $G$-invariant polynomials. For the case of totally symmetric functions ($G = \mathcal{S}_n$), we show that this gives rise to the sum-decomposition Deep Sets ansatz of Zaheer et al. (2018), where both the inner and outer functions can be chosen to be smooth, and moreover, the inner function can be chosen to be independent of the target function being approximated. In particular, we show that the embedding dimension required is independent of the regularity of the target function, the accuracy of the desired approximation, as well as $k$. Next, we show that a similar procedure allows us to obtain a uniform $\mathcal{C}^k$ approximation of antisymmetric functions as a sum of $K$ terms, where each term is a product of a smooth totally symmetric function and a smooth antisymmetric homogeneous polynomial of degree at most $\binom{n}{2}$. We also provide upper and lower bounds on $K$ and show that $K$ is independent of the regularity of the target function, the desired approximation accuracy, and $k$.  ( 2 min )
    Less is More: Hop-Wise Graph Attention for Scalable and Generalizable Learning on Circuits
    arXiv:2403.01317v1 Announce Type: new Abstract: While graph neural networks (GNNs) have gained popularity for learning circuit representations in various electronic design automation (EDA) tasks, they face challenges in scalability when applied to large graphs and exhibit limited generalizability to new designs. These limitations make them less practical for addressing large-scale, complex circuit problems. In this work we propose HOGA, a novel attention-based model for learning circuit representations in a scalable and generalizable manner. HOGA first computes hop-wise features per node prior to model training. Subsequently, the hop-wise features are solely used to produce node representations through a gated self-attention module, which adaptively learns important features among different hops without involving the graph topology. As a result, HOGA is adaptive to various structures across different circuits and can be efficiently trained in a distributed manner. To demonstrate the efficacy of HOGA, we consider two representative EDA tasks: quality of results (QoR) prediction and functional reasoning. Our experimental results indicate that (1) HOGA reduces estimation error over conventional GNNs by 46.76% for predicting QoR after logic synthesis; (2) HOGA improves 10.0% reasoning accuracy over GNNs for identifying functional blocks on unseen gate-level netlists after complex technology mapping; (3) The training time for HOGA almost linearly decreases with an increase in computing resources.  ( 3 min )
    Near-optimal Per-Action Regret Bounds for Sleeping Bandits
    arXiv:2403.01315v1 Announce Type: new Abstract: We derive near-optimal per-action regret bounds for sleeping bandits, in which both the sets of available arms and their losses in every round are chosen by an adversary. In a setting with $K$ total arms and at most $A$ available arms in each round over $T$ rounds, the best known upper bound is $O(K\sqrt{TA\ln{K}})$, obtained indirectly via minimizing internal sleeping regrets. Compared to the minimax $\Omega(\sqrt{TA})$ lower bound, this upper bound contains an extra multiplicative factor of $K\ln{K}$. We address this gap by directly minimizing the per-action regret using generalized versions of EXP3, EXP3-IX and FTRL with Tsallis entropy, thereby obtaining near-optimal bounds of order $O(\sqrt{TA\ln{K}})$ and $O(\sqrt{T\sqrt{AK}})$. We extend our results to the setting of bandits with advice from sleeping experts, generalizing EXP4 along the way. This leads to new proofs for a number of existing adaptive and tracking regret bounds for standard non-sleeping bandits. Extending our results to the bandit version of experts that report their confidences leads to new bounds for the confidence regret that depends primarily on the sum of experts' confidences. We prove a lower bound, showing that for any minimax optimal algorithms, there exists an action whose regret is sublinear in $T$ but linear in the number of its active rounds.  ( 2 min )
    ICC: Quantifying Image Caption Concreteness for Multimodal Dataset Curation
    arXiv:2403.01306v1 Announce Type: new Abstract: Web-scale training on paired text-image data is becoming increasingly central to multimodal learning, but is challenged by the highly noisy nature of datasets in the wild. Standard data filtering approaches succeed in removing mismatched text-image pairs, but permit semantically related but highly abstract or subjective text. These approaches lack the fine-grained ability to isolate the most concrete samples that provide the strongest signal for learning in a noisy dataset. In this work, we propose a new metric, image caption concreteness, that evaluates caption text without an image reference to measure its concreteness and relevancy for use in multimodal learning. Our approach leverages strong foundation models for measuring visual-semantic information loss in multimodal representations. We demonstrate that this strongly correlates with human evaluation of concreteness in both single-word and sentence-level texts. Moreover, we show that curation using ICC complements existing approaches: It succeeds in selecting the highest quality samples from multimodal web-scale datasets to allow for efficient training in resource-constrained settings.  ( 2 min )
    NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention
    arXiv:2403.01273v1 Announce Type: new Abstract: Large language model inference on Central Processing Units (CPU) is challenging due to the vast quantities of expensive Multiply-Add (MAD) matrix operations in the attention computations. In this paper, we argue that there is a rare gem in modern CPUs, Single-Instruction-Multiple-Data (SIMD) registers, which allow for ultra-low-latency lookups in batch. We leverage this unique capability of CPUs to propose NoMAD-Attention, an efficient attention algorithm that replaces MAD operations with in-register lookups. Through hardware-aware algorithmic designs, NoMAD-Attention achieves the computation of attention scores using repeated fast accesses to SIMD registers despite their highly limited sizes. Moreover, NoMAD-Attention works with pre-trained attention-based LLMs without model finetuning. Empirical evaluations demonstrate that NoMAD-Attention maintains the quality of the original LLMs well, and speeds up the 4-bit quantized LLaMA-7B-based model by up to 2$\times$ at 16k context length. Our results are reproducible at https://github.com/tonyzhang617/nomad-dist.  ( 2 min )
    Defending Against Data Reconstruction Attacks in Federated Learning: An Information Theory Approach
    arXiv:2403.01268v1 Announce Type: new Abstract: Federated Learning (FL) trains a black-box and high-dimensional model among different clients by exchanging parameters instead of direct data sharing, which mitigates the privacy leak incurred by machine learning. However, FL still suffers from membership inference attacks (MIA) or data reconstruction attacks (DRA). In particular, an attacker can extract the information from local datasets by constructing DRA, which cannot be effectively throttled by existing techniques, e.g., Differential Privacy (DP). In this paper, we aim to ensure a strong privacy guarantee for FL under DRA. We prove that reconstruction errors under DRA are constrained by the information acquired by an attacker, which means that constraining the transmitted information can effectively throttle DRA. To quantify the information leakage incurred by FL, we establish a channel model, which depends on the upper bound of joint mutual information between the local dataset and multiple transmitted parameters. Moreover, the channel model indicates that the transmitted information can be constrained through data space operation, which can improve training efficiency and the model accuracy under constrained information. According to the channel model, we propose algorithms to constrain the information transmitted in a single round of local training. With a limited number of training rounds, the algorithms ensure that the total amount of transmitted information is limited. Furthermore, our channel model can be applied to various privacy-enhancing techniques (such as DP) to enhance privacy guarantees against DRA. Extensive experiments with real-world datasets validate the effectiveness of our methods.  ( 3 min )
    Can a Confident Prior Replace a Cold Posterior?
    arXiv:2403.01272v1 Announce Type: new Abstract: Benchmark datasets used for image classification tend to have very low levels of label noise. When Bayesian neural networks are trained on these datasets, they often underfit, misrepresenting the aleatoric uncertainty of the data. A common solution is to cool the posterior, which improves fit to the training data but is challenging to interpret from a Bayesian perspective. We explore whether posterior tempering can be replaced by a confidence-inducing prior distribution. First, we introduce a "DirClip" prior that is practical to sample and nearly matches the performance of a cold posterior. Second, we introduce a "confidence prior" that directly approximates a cold likelihood in the limit of decreasing temperature but cannot be easily sampled. Lastly, we provide several general insights into confidence-inducing priors, such as when they might diverge and how fine-tuning can mitigate numerical instability.  ( 2 min )
    Dissecting Language Models: Machine Unlearning via Selective Pruning
    arXiv:2403.01267v1 Announce Type: new Abstract: Understanding and shaping the behaviour of Large Language Models (LLMs) is increasingly important as applications become more powerful and more frequently adopted. This paper introduces a machine unlearning method specifically designed for LLMs. We introduce a selective pruning method for LLMs that removes neurons based on their relative importance on a targeted capability compared to overall network performance. This approach is a compute- and data-efficient method for identifying and removing neurons that enable specific behaviours. Our findings reveal that both feed-forward and attention neurons in LLMs are specialized; that is, for specific tasks, certain neurons are more crucial than others.  ( 2 min )
    AcME-AD: Accelerated Model Explanations for Anomaly Detection
    arXiv:2403.01245v1 Announce Type: new Abstract: Pursuing fast and robust interpretability in Anomaly Detection is crucial, especially due to its significance in practical applications. Traditional Anomaly Detection methods excel in outlier identification but are often black-boxes, providing scant insights into their decision-making process. This lack of transparency compromises their reliability and hampers their adoption in scenarios where comprehending the reasons behind anomaly detection is vital. At the same time, getting explanations quickly is paramount in practical scenarios. To bridge this gap, we present AcME-AD, a novel approach rooted in Explainable Artificial Intelligence principles, designed to clarify Anomaly Detection models for tabular data. AcME-AD transcends the constraints of model-specific or resource-heavy explainability techniques by delivering a model-agnostic, efficient solution for interoperability. It offers local feature importance scores and a what-if analysis tool, shedding light on the factors contributing to each anomaly, thus aiding root cause analysis and decision-making. This paper elucidates AcME-AD's foundation, its benefits over existing methods, and validates its effectiveness with tests on both synthetic and real datasets. AcME-AD's implementation and experiment replication code is accessible in a public repository.  ( 2 min )
    Active Deep Kernel Learning of Molecular Functionalities: Realizing Dynamic Structural Embeddings
    arXiv:2403.01234v1 Announce Type: new Abstract: Exploring molecular spaces is crucial for advancing our understanding of chemical properties and reactions, leading to groundbreaking innovations in materials science, medicine, and energy. This paper explores an approach for active learning in molecular discovery using Deep Kernel Learning (DKL), a novel approach surpassing the limits of classical Variational Autoencoders (VAEs). Employing the QM9 dataset, we contrast DKL with traditional VAEs, which analyze molecular structures based on similarity, revealing limitations due to sparse regularities in latent spaces. DKL, however, offers a more holistic perspective by correlating structure with properties, creating latent spaces that prioritize molecular functionality. This is achieved by recalculating embedding vectors iteratively, aligning with the experimental availability of target properties. The resulting latent spaces are not only better organized but also exhibit unique characteristics such as concentrated maxima representing molecular functionalities and a correlation between predictive uncertainty and error. Additionally, the formation of exclusion regions around certain compounds indicates unexplored areas with potential for groundbreaking functionalities. This study underscores DKL's potential in molecular research, offering new avenues for understanding and discovering molecular functionalities beyond classical VAE limitations.  ( 2 min )
    Augmenting Automation: Intent-Based User Instruction Classification with Machine Learning
    arXiv:2403.01242v1 Announce Type: new Abstract: Electric automation systems offer convenience and efficiency in controlling electrical circuits and devices. Traditionally, these systems rely on predefined commands for control, limiting flexibility and adaptability. In this paper, we propose a novel approach to augment automation by introducing intent-based user instruction classification using machine learning techniques. Our system represents user instructions as intents, allowing for dynamic control of electrical circuits without relying on predefined commands. Through a machine learning model trained on a labeled dataset of user instructions, our system classifies intents from user input, enabling a more intuitive and adaptable control scheme. We present the design and implementation of our intent-based electric automation system, detailing the development of the machine learning model for intent classification. Experimental results demonstrate the effectiveness of our approach in enhancing user experience and expanding the capabilities of electric automation systems. Our work contributes to the advancement of smart technologies by providing a more seamless interaction between users and their environments.  ( 2 min )
    Polynormer: Polynomial-Expressive Graph Transformer in Linear Time
    arXiv:2403.01232v1 Announce Type: new Abstract: Graph transformers (GTs) have emerged as a promising architecture that is theoretically more expressive than message-passing graph neural networks (GNNs). However, typical GT models have at least quadratic complexity and thus cannot scale to large graphs. While there are several linear GTs recently proposed, they still lag behind GNN counterparts on several popular graph datasets, which poses a critical concern on their practical expressivity. To balance the trade-off between expressivity and scalability of GTs, we propose Polynormer, a polynomial-expressive GT model with linear complexity. Polynormer is built upon a novel base model that learns a high-degree polynomial on input features. To enable the base model permutation equivariant, we integrate it with graph topology and node features separately, resulting in local and global equivariant attention models. Consequently, Polynormer adopts a linear local-to-global attention scheme to learn high-degree equivariant polynomials whose coefficients are controlled by attention scores. Polynormer has been evaluated on $13$ homophilic and heterophilic datasets, including large graphs with millions of nodes. Our extensive experiment results show that Polynormer outperforms state-of-the-art GNN and GT baselines on most datasets, even without the use of nonlinear activation functions.  ( 2 min )
    A Two-Stage Algorithm for Cost-Efficient Multi-instance Counterfactual Explanations
    arXiv:2403.01221v1 Announce Type: new Abstract: Counterfactual explanations constitute among the most popular methods for analyzing the predictions of black-box systems since they can recommend cost-efficient and actionable changes to the input to turn an undesired system's output into a desired output. While most of the existing counterfactual methods explain a single instance, several real-world use cases, such as customer satisfaction, require the identification of a single counterfactual that can satisfy multiple instances (e.g. customers) simultaneously. In this work, we propose a flexible two-stage algorithm for finding groups of instances along with cost-efficient multi-instance counterfactual explanations. This is motivated by the fact that in most previous works the aspect of finding such groups is not addressed.  ( 2 min )
    Stochastic gradient descent for streaming linear and rectified linear systems with Massart noise
    arXiv:2403.01204v1 Announce Type: new Abstract: We propose SGD-exp, a stochastic gradient descent approach for linear and ReLU regressions under Massart noise (adversarial semi-random corruption model) for the fully streaming setting. We show novel nearly linear convergence guarantees of SGD-exp to the true parameter with up to $50\%$ Massart corruption rate, and with any corruption rate in the case of symmetric oblivious corruptions. This is the first convergence guarantee result for robust ReLU regression in the streaming setting, and it shows the improved convergence rate over previous robust methods for $L_1$ linear regression due to a choice of an exponentially decaying step size, known for its efficiency in practice. Our analysis is based on the drift analysis of a discrete stochastic process, which could also be interesting on its own.  ( 2 min )
    Inexact Unlearning Needs More Careful Evaluations to Avoid a False Sense of Privacy
    arXiv:2403.01218v1 Announce Type: new Abstract: The high cost of model training makes it increasingly desirable to develop techniques for unlearning. These techniques seek to remove the influence of a training example without having to retrain the model from scratch. Intuitively, once a model has unlearned, an adversary that interacts with the model should no longer be able to tell whether the unlearned example was included in the model's training set or not. In the privacy literature, this is known as membership inference. In this work, we discuss adaptations of Membership Inference Attacks (MIAs) to the setting of unlearning (leading to their ``U-MIA'' counterparts). We propose a categorization of existing U-MIAs into ``population U-MIAs'', where the same attacker is instantiated for all examples, and ``per-example U-MIAs'', where a dedicated attacker is instantiated for each example. We show that the latter category, wherein the attacker tailors its membership prediction to each example under attack, is significantly stronger. Indeed, our results show that the commonly used U-MIAs in the unlearning literature overestimate the privacy protection afforded by existing unlearning techniques on both vision and language models. Our investigation reveals a large variance in the vulnerability of different examples to per-example U-MIAs. In fact, several unlearning algorithms lead to a reduced vulnerability for some, but not all, examples that we wish to unlearn, at the expense of increasing it for other examples. Notably, we find that the privacy protection for the remaining training examples may worsen as a consequence of unlearning. We also discuss the fundamental difficulty of equally protecting all examples using existing unlearning schemes, due to the different rates at which examples are unlearned. We demonstrate that naive attempts at tailoring unlearning stopping criteria to different examples fail to alleviate these issues.  ( 3 min )
    Pseudo-Label Calibration Semi-supervised Multi-Modal Entity Alignment
    arXiv:2403.01203v1 Announce Type: new Abstract: Multi-modal entity alignment (MMEA) aims to identify equivalent entities between two multi-modal knowledge graphs for integration. Unfortunately, prior arts have attempted to improve the interaction and fusion of multi-modal information, which have overlooked the influence of modal-specific noise and the usage of labeled and unlabeled data in semi-supervised settings. In this work, we introduce a Pseudo-label Calibration Multi-modal Entity Alignment (PCMEA) in a semi-supervised way. Specifically, in order to generate holistic entity representations, we first devise various embedding modules and attention mechanisms to extract visual, structural, relational, and attribute features. Different from the prior direct fusion methods, we next propose to exploit mutual information maximization to filter the modal-specific noise and to augment modal-invariant commonality. Then, we combine pseudo-label calibration with momentum-based contrastive learning to make full use of the labeled and unlabeled data, which improves the quality of pseudo-label and pulls aligned entities closer. Finally, extensive experiments on two MMEA datasets demonstrate the effectiveness of our PCMEA, which yields state-of-the-art performance.  ( 2 min )
    Training Unbiased Diffusion Models From Biased Dataset
    arXiv:2403.01189v1 Announce Type: new Abstract: With significant advancements in diffusion models, addressing the potential risks of dataset bias becomes increasingly important. Since generated outputs directly suffer from dataset bias, mitigating latent bias becomes a key factor in improving sample quality and proportion. This paper proposes time-dependent importance reweighting to mitigate the bias for the diffusion models. We demonstrate that the time-dependent density ratio becomes more precise than previous approaches, thereby minimizing error propagation in generative learning. While directly applying it to score-matching is intractable, we discover that using the time-dependent density ratio both for reweighting and score correction can lead to a tractable form of the objective function to regenerate the unbiased data density. Furthermore, we theoretically establish a connection with traditional score-matching, and we demonstrate its convergence to an unbiased distribution. The experimental evidence supports the usefulness of the proposed method, which outperforms baselines including time-independent importance reweighting on CIFAR-10, CIFAR-100, FFHQ, and CelebA with various bias settings. Our code is available at https://github.com/alsdudrla10/TIW-DSM.  ( 2 min )
    LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization
    arXiv:2403.01136v1 Announce Type: new Abstract: Recent breakthroughs in Large-scale language models (LLMs) have demonstrated impressive performance on various tasks. The immense sizes of LLMs have led to very high resource demand and cost for running the models. Though the models are largely served using uniform high-caliber GPUs nowadays, utilizing a heterogeneous cluster with a mix of available high- and low-capacity GPUs can potentially substantially reduce the serving cost. There is a lack of designs to support efficient LLM serving using a heterogeneous cluster, while the current solutions focus on model partition and uniform compression among homogeneous devices. This paper proposes LLM-PQ, a system that advocates adaptive model quantization and phase-aware partition to improve LLM serving efficiency on heterogeneous GPU clusters. We carefully decide on mixed-precision model quantization together with phase-aware model partition and micro-batch sizing in distributed LLM serving with an efficient algorithm, to greatly enhance inference throughput while fulfilling user-specified model quality targets. Extensive experiments on production inference workloads in 11 different clusters demonstrate that LLM-PQ achieves up to 2.88x (2.26x on average) throughput improvement in inference, showing great advantages over state-of-the-art works.  ( 2 min )
    A Hybrid Model for Traffic Incident Detection based on Generative Adversarial Networks and Transformer Model
    arXiv:2403.01147v1 Announce Type: new Abstract: In addition to enhancing traffic safety and facilitating prompt emergency response, traffic incident detection plays an indispensable role in intelligent transportation systems by providing real-time traffic status information. This enables the realization of intelligent traffic control and management. Previous research has identified that apart from employing advanced algorithmic models, the effectiveness of detection is also significantly influenced by challenges related to acquiring large datasets and addressing dataset imbalances. A hybrid model combining transformer and generative adversarial networks (GANs) is proposed to address these challenges. Experiments are conducted on four real datasets to validate the superiority of the transformer in traffic incident detection. Additionally, GANs are utilized to expand the dataset and achieve a balanced ratio of 1:4, 2:3, and 1:1. The proposed model is evaluated against the baseline model. The results demonstrate that the proposed model enhances the dataset size, balances the dataset, and improves the performance of traffic incident detection in various aspects.  ( 2 min )
    Evaluating Large Language Models as Virtual Annotators for Time-series Physical Sensing Data
    arXiv:2403.01133v1 Announce Type: new Abstract: Traditional human-in-the-loop-based annotation for time-series data like inertial data often requires access to alternate modalities like video or audio from the environment. These alternate sources provide the necessary information to the human annotator, as the raw numeric data is often too obfuscated even for an expert. However, this traditional approach has many concerns surrounding overall cost, efficiency, storage of additional modalities, time, scalability, and privacy. Interestingly, recent large language models (LLMs) are also trained with vast amounts of publicly available alphanumeric data, which allows them to comprehend and perform well on tasks beyond natural language processing. Naturally, this opens up a potential avenue to explore LLMs as virtual annotators where the LLMs will be directly provided the raw sensor data for annotation instead of relying on any alternate modality. Naturally, this could mitigate the problems of the traditional human-in-the-loop approach. Motivated by this observation, we perform a detailed study in this paper to assess whether the state-of-the-art (SOTA) LLMs can be used as virtual annotators for labeling time-series physical sensing data. To perform this in a principled manner, we segregate the study into two major phases. In the first phase, we investigate the challenges an LLM like GPT-4 faces in comprehending raw sensor data. Considering the observations from phase 1, in the next phase, we investigate the possibility of encoding the raw sensor data using SOTA SSL approaches and utilizing the projected time-series data to get annotations from the LLM. Detailed evaluation with four benchmark HAR datasets shows that SSL-based encoding and metric-based guidance allow the LLM to make more reasonable decisions and provide accurate annotations without requiring computationally expensive fine-tuning or sophisticated prompt engineering.  ( 3 min )
    MPIPN: A Multi Physics-Informed PointNet for solving parametric acoustic-structure systems
    arXiv:2403.01132v1 Announce Type: new Abstract: Machine learning is employed for solving physical systems governed by general nonlinear partial differential equations (PDEs). However, complex multi-physics systems such as acoustic-structure coupling are often described by a series of PDEs that incorporate variable physical quantities, which are referred to as parametric systems. There are lack of strategies for solving parametric systems governed by PDEs that involve explicit and implicit quantities. In this paper, a deep learning-based Multi Physics-Informed PointNet (MPIPN) is proposed for solving parametric acoustic-structure systems. First, the MPIPN induces an enhanced point-cloud architecture that encompasses explicit physical quantities and geometric features of computational domains. Then, the MPIPN extracts local and global features of the reconstructed point-cloud as parts of solving criteria of parametric systems, respectively. Besides, implicit physical quantities are embedded by encoding techniques as another part of solving criteria. Finally, all solving criteria that characterize parametric systems are amalgamated to form distinctive sequences as the input of the MPIPN, whose outputs are solutions of systems. The proposed framework is trained by adaptive physics-informed loss functions for corresponding computational domains. The framework is generalized to deal with new parametric conditions of systems. The effectiveness of the MPIPN is validated by applying it to solve steady parametric acoustic-structure coupling systems governed by the Helmholtz equations. An ablation experiment has been implemented to demonstrate the efficacy of physics-informed impact with a minority of supervised data. The proposed method yields reasonable precision across all computational domains under constant parametric conditions and changeable combinations of parametric conditions for acoustic-structure systems.  ( 3 min )
    Sensitivity Analysis On Loss Landscape
    arXiv:2403.01128v1 Announce Type: new Abstract: Gradients can be employed for sensitivity analysis. Here, we leverage the advantages of the Loss Landscape to comprehend which independent variables impact the dependent variable. We seek to grasp the loss landscape by utilizing first, second, and third derivatives through automatic differentiation. we know that Spearman's rank correlation coefficient can detect the monotonic relationship between two variables. However, I have found that second-order gradients, with certain configurations and parameters, provide information that can be visualized similarly to Spearman's results.In our approach, we incorporate a loss function with an activation function, resulting in a non-linear pattern. Each exploration of the loss landscape through retraining yields new valuable information. Furthermore, the first and third derivatives are also beneficial, as they indicate the extent to which independent variables influence the dependent variable.  ( 2 min )
    OpenGraph: Towards Open Graph Foundation Models
    arXiv:2403.01121v1 Announce Type: new Abstract: Graph learning has become indispensable for interpreting and harnessing relational data in diverse fields, ranging from recommendation systems to social network analysis. In this context, a variety of GNNs have emerged as promising methodologies for encoding the structural information of graphs. By effectively capturing the graph's underlying structure, these GNNs have shown great potential in enhancing performance in graph learning tasks, such as link prediction and node classification. However, despite their successes, a significant challenge persists: these advanced methods often face difficulties in generalizing to unseen graph data that significantly differs from the training instances. In this work, our aim is to advance the graph learning paradigm by developing a general graph foundation model. This model is designed to understand the complex topological patterns present in diverse graph data, enabling it to excel in zero-shot graph learning tasks across different downstream datasets. To achieve this goal, we address several key technical challenges in our OpenGraph model. Firstly, we propose a unified graph tokenizer to adapt our graph model to generalize well on unseen graph data, even when the underlying graph properties differ significantly from those encountered during training. Secondly, we develop a scalable graph transformer as the foundational encoder, which effectively captures node-wise dependencies within the global topological context. Thirdly, we introduce a data augmentation mechanism enhanced by a LLM to alleviate the limitations of data scarcity in real-world scenarios. Extensive experiments validate the effectiveness of our framework. By adapting our OpenGraph to new graph characteristics and comprehending the nuances of diverse graphs, our approach achieves remarkable zero-shot graph learning performance across various settings and domains.  ( 3 min )
    Feature Alignment: Rethinking Efficient Active Learning via Proxy in the Context of Pre-trained Models
    arXiv:2403.01101v1 Announce Type: new Abstract: Fine-tuning the pre-trained model with active learning holds promise for reducing annotation costs. However, this combination introduces significant computational costs, particularly with the growing scale of pre-trained models. Recent research has proposed proxy-based active learning, which pre-computes features to reduce computational costs. Yet, this approach often incurs a significant loss in active learning performance, which may even outweigh the computational cost savings. In this paper, we argue the performance drop stems not only from pre-computed features' inability to distinguish between categories of labeled samples, resulting in the selection of redundant samples but also from the tendency to compromise valuable pre-trained information when fine-tuning with samples selected through the proxy model. To address this issue, we propose a novel method called aligned selection via proxy to update pre-computed features while selecting a proper training method to inherit valuable pre-training information. Extensive experiments validate that our method significantly improves the total cost of efficient active learning while maintaining computational efficiency.  ( 2 min )
    Efficient Episodic Memory Utilization of Cooperative Multi-Agent Reinforcement Learning
    arXiv:2403.01112v1 Announce Type: new Abstract: In cooperative multi-agent reinforcement learning (MARL), agents aim to achieve a common goal, such as defeating enemies or scoring a goal. Existing MARL algorithms are effective but still require significant learning time and often get trapped in local optima by complex tasks, subsequently failing to discover a goal-reaching policy. To address this, we introduce Efficient episodic Memory Utilization (EMU) for MARL, with two primary objectives: (a) accelerating reinforcement learning by leveraging semantically coherent memory from an episodic buffer and (b) selectively promoting desirable transitions to prevent local convergence. To achieve (a), EMU incorporates a trainable encoder/decoder structure alongside MARL, creating coherent memory embeddings that facilitate exploratory memory recall. To achieve (b), EMU introduces a novel reward structure called episodic incentive based on the desirability of states. This reward improves the TD target in Q-learning and acts as an additional incentive for desirable transitions. We provide theoretical support for the proposed incentive and demonstrate the effectiveness of EMU compared to conventional episodic control. The proposed method is evaluated in StarCraft II and Google Research Football, and empirical results indicate further performance improvement over state-of-the-art methods.  ( 2 min )
    Pairwise Alignment Improves Graph Domain Adaptation
    arXiv:2403.01092v1 Announce Type: new Abstract: Graph-based methods, pivotal for label inference over interconnected objects in many real-world applications, often encounter generalization challenges, if the graph used for model training differs significantly from the graph used for testing. This work delves into Graph Domain Adaptation (GDA) to address the unique complexities of distribution shifts over graph data, where interconnected data points experience shifts in features, labels, and in particular, connecting patterns. We propose a novel, theoretically principled method, Pairwise Alignment (Pair-Align) to counter graph structure shift by mitigating conditional structure shift (CSS) and label shift (LS). Pair-Align uses edge weights to recalibrate the influence among neighboring nodes to handle CSS and adjusts the classification loss with label weights to handle LS. Our method demonstrates superior performance in real-world applications, including node classification with region shift in social networks, and the pileup mitigation task in particle colliding experiments. For the first application, we also curate the largest dataset by far for GDA studies. Our method shows strong performance in synthetic and other existing benchmark datasets.  ( 2 min )
    COOL: A Conjoint Perspective on Spatio-Temporal Graph Neural Network for Traffic Forecasting
    arXiv:2403.01091v1 Announce Type: new Abstract: This paper investigates traffic forecasting, which attempts to forecast the future state of traffic based on historical situations. This problem has received ever-increasing attention in various scenarios and facilitated the development of numerous downstream applications such as urban planning and transportation management. However, the efficacy of existing methods remains sub-optimal due to their tendency to model temporal and spatial relationships independently, thereby inadequately accounting for complex high-order interactions of both worlds. Moreover, the diversity of transitional patterns in traffic forecasting makes them challenging to capture for existing approaches, warranting a deeper exploration of their diversity. Toward this end, this paper proposes Conjoint Spatio-Temporal graph neural network (abbreviated as COOL), which models heterogeneous graphs from prior and posterior information to conjointly capture high-order spatio-temporal relationships. On the one hand, heterogeneous graphs connecting sequential observation are constructed to extract composite spatio-temporal relationships via prior message passing. On the other hand, we model dynamic relationships using constructed affinity and penalty graphs, which guide posterior message passing to incorporate complementary semantic information into node representations. Moreover, to capture diverse transitional properties to enhance traffic forecasting, we propose a conjoint self-attention decoder that models diverse temporal patterns from both multi-rank and multi-scale views. Experimental results on four popular benchmark datasets demonstrate that our proposed COOL provides state-of-the-art performance compared with the competitive baselines.  ( 3 min )
    Teaching MLP More Graph Information: A Three-stage Multitask Knowledge Distillation Framework
    arXiv:2403.01079v1 Announce Type: new Abstract: We study the challenging problem for inference tasks on large-scale graph datasets of Graph Neural Networks: huge time and memory consumption, and try to overcome it by reducing reliance on graph structure. Even though distilling graph knowledge to student MLP is an excellent idea, it faces two major problems of positional information loss and low generalization. To solve the problems, we propose a new three-stage multitask distillation framework. In detail, we use Positional Encoding to capture positional information. Also, we introduce Neural Heat Kernels responsible for graph data processing in GNN and utilize hidden layer outputs matching for better performance of student MLP's hidden layers. To the best of our knowledge, it is the first work to include hidden layer distillation for student MLP on graphs and to combine graph Positional Encoding with MLP. We test its performance and robustness with several settings and draw the conclusion that our work can outperform well with good stability.  ( 2 min )
    $\Gamma$-VAE: Curvature regularized variational autoencoders for uncovering emergent low dimensional geometric structure in high dimensional data
    arXiv:2403.01078v1 Announce Type: new Abstract: Natural systems with emergent behaviors often organize along low-dimensional subsets of high-dimensional spaces. For example, despite the tens of thousands of genes in the human genome, the principled study of genomics is fruitful because biological processes rely on coordinated organization that results in lower dimensional phenotypes. To uncover this organization, many nonlinear dimensionality reduction techniques have successfully embedded high-dimensional data into low-dimensional spaces by preserving local similarities between data points. However, the nonlinearities in these methods allow for too much curvature to preserve general trends across multiple non-neighboring data clusters, thereby limiting their interpretability and generalizability to out-of-distribution data. Here, we address both of these limitations by regularizing the curvature of manifolds generated by variational autoencoders, a process we coin ``$\Gamma$-VAE''. We demonstrate its utility using two example data sets: bulk RNA-seq from the The Cancer Genome Atlas (TCGA) and the Genotype Tissue Expression (GTEx); and single cell RNA-seq from a lineage tracing experiment in hematopoietic stem cell differentiation. We find that the resulting regularized manifolds identify mesoscale structure associated with different cancer cell types, and accurately re-embed tissues from completely unseen, out-of distribution cancers as if they were originally trained on them. Finally, we show that preserving long-range relationships to differentiated cells separates undifferentiated cells -- which have not yet specialized -- according to their eventual fate. Broadly, we anticipate that regularizing the curvature of generative models will enable more consistent, predictive, and generalizable models in any high-dimensional system with emergent low-dimensional behavior.  ( 3 min )
    Continuous Mean-Zero Disagreement-Regularized Imitation Learning (CMZ-DRIL)
    arXiv:2403.01059v1 Announce Type: new Abstract: Machine-learning paradigms such as imitation learning and reinforcement learning can generate highly performant agents in a variety of complex environments. However, commonly used methods require large quantities of data and/or a known reward function. This paper presents a method called Continuous Mean-Zero Disagreement-Regularized Imitation Learning (CMZ-DRIL) that employs a novel reward structure to improve the performance of imitation-learning agents that have access to only a handful of expert demonstrations. CMZ-DRIL uses reinforcement learning to minimize uncertainty among an ensemble of agents trained to model the expert demonstrations. This method does not use any environment-specific rewards, but creates a continuous and mean-zero reward function from the action disagreement of the agent ensemble. As demonstrated in a waypoint-navigation environment and in two MuJoCo environments, CMZ-DRIL can generate performant agents that behave more similarly to the expert than primary previous approaches in several key metrics.  ( 2 min )
    GraphRCG: Self-conditioned Graph Generation via Bootstrapped Representations
    arXiv:2403.01071v1 Announce Type: new Abstract: Graph generation generally aims to create new graphs that closely align with a specific graph distribution. Existing works often implicitly capture this distribution through the optimization of generators, potentially overlooking the intricacies of the distribution itself. Furthermore, these approaches generally neglect the insights offered by the learned distribution for graph generation. In contrast, in this work, we propose a novel self-conditioned graph generation framework designed to explicitly model graph distributions and employ these distributions to guide the generation process. We first perform self-conditioned modeling to capture the graph distributions by transforming each graph sample into a low-dimensional representation and optimizing a representation generator to create new representations reflective of the learned distribution. Subsequently, we leverage these bootstrapped representations as self-conditioned guidance for the generation process, thereby facilitating the generation of graphs that more accurately reflect the learned distributions. We conduct extensive experiments on generic and molecular graph datasets across various fields. Our framework demonstrates superior performance over existing state-of-the-art graph generation methods in terms of graph quality and fidelity to training data.  ( 2 min )
    Seeing Unseen: Discover Novel Biomedical Concepts via GeometryConstrained Probabilistic Modeling
    arXiv:2403.01053v1 Announce Type: new Abstract: Machine learning holds tremendous promise for transforming the fundamental practice of scientific discovery by virtue of its data-driven nature. With the ever-increasing stream of research data collection, it would be appealing to autonomously explore patterns and insights from observational data for discovering novel classes of phenotypes and concepts. However, in the biomedical domain, there are several challenges inherently presented in the cumulated data which hamper the progress of novel class discovery. The non-i.i.d. data distribution accompanied by the severe imbalance among different groups of classes essentially leads to ambiguous and biased semantic representations. In this work, we present a geometry-constrained probabilistic modeling treatment to resolve the identified issues. First, we propose to parameterize the approximated posterior of instance embedding as a marginal von MisesFisher distribution to account for the interference of distributional latent bias. Then, we incorporate a suite of critical geometric properties to impose proper constraints on the layout of constructed embedding space, which in turn minimizes the uncontrollable risk for unknown class learning and structuring. Furthermore, a spectral graph-theoretic method is devised to estimate the number of potential novel classes. It inherits two intriguing merits compared to existent approaches, namely high computational efficiency and flexibility for taxonomy-adaptive estimation. Extensive experiments across various biomedical scenarios substantiate the effectiveness and general applicability of our method.  ( 3 min )
    A Library of Mirrors: Deep Neural Nets in Low Dimensions are Convex Lasso Models with Reflection Features
    arXiv:2403.01046v1 Announce Type: new Abstract: We prove that training neural networks on 1-D data is equivalent to solving a convex Lasso problem with a fixed, explicitly defined dictionary matrix of features. The specific dictionary depends on the activation and depth. We consider 2-layer networks with piecewise linear activations, deep narrow ReLU networks with up to 4 layers, and rectangular and tree networks with sign activation and arbitrary depth. Interestingly in ReLU networks, a fourth layer creates features that represent reflections of training data about themselves. The Lasso representation sheds insight to globally optimal networks and the solution landscape.  ( 2 min )
    A Case for Validation Buffer in Pessimistic Actor-Critic
    arXiv:2403.01014v1 Announce Type: new Abstract: In this paper, we investigate the issue of error accumulation in critic networks updated via pessimistic temporal difference objectives. We show that the critic approximation error can be approximated via a recursive fixed-point model similar to that of the Bellman value. We use such recursive definition to retrieve the conditions under which the pessimistic critic is unbiased. Building on these insights, we propose Validation Pessimism Learning (VPL) algorithm. VPL uses a small validation buffer to adjust the levels of pessimism throughout the agent training, with the pessimism set such that the approximation error of the critic targets is minimized. We investigate the proposed approach on a variety of locomotion and manipulation tasks and report improvements in sample efficiency and performance.  ( 2 min )
    On the Role of Information Structure in Reinforcement Learning for Partially-Observable Sequential Teams and Games
    arXiv:2403.00993v1 Announce Type: new Abstract: In a sequential decision-making problem, the information structure is the description of how events in the system occurring at different points in time affect each other. Classical models of reinforcement learning (e.g., MDPs, POMDPs, Dec-POMDPs, and POMGs) assume a very simple and highly regular information structure, while more general models like predictive state representations do not explicitly model the information structure. By contrast, real-world sequential decision-making problems typically involve a complex and time-varying interdependence of system variables, requiring a rich and flexible representation of information structure. In this paper, we argue for the perspective that explicit representation of information structures is an important component of analyzing and solving reinforcement learning problems. We propose novel reinforcement learning models with an explicit representation of information structure, capturing classical models as special cases. We show that this leads to a richer analysis of sequential decision-making problems and enables more tailored algorithm design. In particular, we characterize the "complexity" of the observable dynamics of any sequential decision-making problem through a graph-theoretic analysis of the DAG representation of its information structure. The central quantity in this analysis is the minimal set of variables that $d$-separates the past observations from future observations. Furthermore, through constructing a generalization of predictive state representations, we propose tailored reinforcement learning algorithms and prove that the sample complexity is in part determined by the information structure. This recovers known tractability results and gives a novel perspective on reinforcement learning in general sequential decision-making problems, providing a systematic way of identifying new tractable classes of problems.  ( 3 min )
    Distributional Dataset Distillation with Subtask Decomposition
    arXiv:2403.00999v1 Announce Type: new Abstract: What does a neural network learn when training from a task-specific dataset? Synthesizing this knowledge is the central idea behind Dataset Distillation, which recent work has shown can be used to compress large datasets into a small set of input-label pairs ($\textit{prototypes}$) that capture essential aspects of the original dataset. In this paper, we make the key observation that existing methods distilling into explicit prototypes are very often suboptimal, incurring in unexpected storage cost from distilled labels. In response, we propose $\textit{Distributional Dataset Distillation}$ (D3), which encodes the data using minimal sufficient per-class statistics and paired with a decoder, we distill dataset into a compact distributional representation that is more memory-efficient compared to prototype-based methods. To scale up the process of learning these representations, we propose $\textit{Federated distillation}$, which decomposes the dataset into subsets, distills them in parallel using sub-task experts and then re-aggregates them. We thoroughly evaluate our algorithm on a three-dimensional metric and show that our method achieves state-of-the-art results on TinyImageNet and ImageNet-1K. Specifically, we outperform the prior art by $6.9\%$ on ImageNet-1K under the storage budget of 2 images per class.  ( 2 min )
    Equipment Health Assessment: Time Series Analysis for Wind Turbine Performance
    arXiv:2403.00975v1 Announce Type: new Abstract: In this study, we leverage SCADA data from diverse wind turbines to predict power output, employing advanced time series methods, specifically Functional Neural Networks (FNN) and Long Short-Term Memory (LSTM) networks. A key innovation lies in the ensemble of FNN and LSTM models, capitalizing on their collective learning. This ensemble approach outperforms individual models, ensuring stable and accurate power output predictions. Additionally, machine learning techniques are applied to detect wind turbine performance deterioration, enabling proactive maintenance strategies and health assessment. Crucially, our analysis reveals the uniqueness of each wind turbine, necessitating tailored models for optimal predictions. These insight underscores the importance of providing automatized customization for different turbines to keep human modeling effort low. Importantly, the methodologies developed in this analysis are not limited to wind turbines; they can be extended to predict and optimize performance in various machinery, highlighting the versatility and applicability of our research across diverse industrial contexts.  ( 2 min )
    Motif distribution and function of sparse deep neural networks
    arXiv:2403.00974v1 Announce Type: new Abstract: We characterize the connectivity structure of feed-forward, deep neural networks (DNNs) using network motif theory. To address whether a particular motif distribution is characteristic of the training task, or function of the DNN, we compare the connectivity structure of 350 DNNs trained to simulate a bio-mechanical flight control system with different randomly initialized parameters. We develop and implement algorithms for counting second- and third-order motifs and calculate their significance using their Z-score. The DNNs are trained to solve the inverse problem of the flight dynamics model in Bustamante, et al. (2022) (i.e., predict the controls necessary for controlled flight from the initial and final state-space inputs) and are sparsified through an iterative pruning and retraining algorithm Zahn, et al. (2022). We show that, despite random initialization of network parameters, enforced sparsity causes DNNs to converge to similar connectivity patterns as characterized by their motif distributions. The results suggest how neural network function can be encoded in motif distributions, suggesting a variety of experiments for informing function and control.  ( 2 min )
    Fine-tuning with Very Large Dropout
    arXiv:2403.00946v1 Announce Type: new Abstract: It is impossible today to pretend that the practice of machine learning is compatible with the idea that training and testing data follow the same distribution. Several authors have recently used ensemble techniques to show how scenarios involving multiple data distributions are best served by representations that are both richer than those obtained by regularizing for the best in-distribution performance, and richer than those obtained under the influence of the implicit sparsity bias of common stochastic gradient procedures. This contribution investigates the use of very high dropout rates instead of ensembles to obtain such rich representations. Although training a deep network from scratch using such dropout rates is virtually impossible, fine-tuning a large pre-trained model under such conditions is not only possible but also achieves out-of-distribution performances that exceed those of both ensembles and weight averaging methods such as model soups. This result has practical significance because the importance of the fine-tuning scenario has considerably grown in recent years. This result also provides interesting insights on the nature of rich representations and on the intrinsically linear nature of fine-tuning a large network using a comparatively small dataset.  ( 2 min )
    Tree-Regularized Tabular Embeddings
    arXiv:2403.00963v1 Announce Type: new Abstract: Tabular neural network (NN) has attracted remarkable attentions and its recent advances have gradually narrowed the performance gap with respect to tree-based models on many public datasets. While the mainstreams focus on calibrating NN to fit tabular data, we emphasize the importance of homogeneous embeddings and alternately concentrate on regularizing tabular inputs through supervised pretraining. Specifically, we extend a recent work (DeepTLF) and utilize the structure of pretrained tree ensembles to transform raw variables into a single vector (T2V), or an array of tokens (T2T). Without loss of space efficiency, these binarized embeddings can be consumed by canonical tabular NN with fully-connected or attention-based building blocks. Through quantitative experiments on 88 OpenML datasets with binary classification task, we validated that the proposed tree-regularized representation not only tapers the difference with respect to tree-based models, but also achieves on-par and better performance when compared with advanced NN models. Most importantly, it possesses better robustness and can be easily scaled and generalized as standalone encoder for tabular modality. Codes: https://github.com/milanlx/tree-regularized-embedding.  ( 2 min )
    Resilience of Entropy Model in Distributed Neural Networks
    arXiv:2403.00942v1 Announce Type: new Abstract: Distributed deep neural networks (DNNs) have emerged as a key technique to reduce communication overhead without sacrificing performance in edge computing systems. Recently, entropy coding has been introduced to further reduce the communication overhead. The key idea is to train the distributed DNN jointly with an entropy model, which is used as side information during inference time to adaptively encode latent representations into bit streams with variable length. To the best of our knowledge, the resilience of entropy models is yet to be investigated. As such, in this paper we formulate and investigate the resilience of entropy models to intentional interference (e.g., adversarial attacks) and unintentional interference (e.g., weather changes and motion blur). Through an extensive experimental campaign with 3 different DNN architectures, 2 entropy models and 4 rate-distortion trade-off factors, we demonstrate that the entropy attacks can increase the communication overhead by up to 95%. By separating compression features in frequency and spatial domain, we propose a new defense mechanism that can reduce the transmission overhead of the attacked input by about 9% compared to unperturbed data, with only about 2% accuracy loss. Importantly, the proposed defense mechanism is a standalone approach which can be applied in conjunction with approaches such as adversarial training to further improve robustness. Code will be shared for reproducibility.  ( 2 min )
    Differentially Private Knowledge Distillation via Synthetic Text Generation
    arXiv:2403.00932v1 Announce Type: new Abstract: Large Language models (LLMs) are achieving state-of-the-art performance in many different downstream tasks. However, the increasing urgency of data privacy requires LLMs to train with Differential Privacy (DP) on private data. Concurrently it is also necessary to compress LLMs for real-life deployments on resource-constrained devices or latency-sensitive applications. Differential privacy and model compression generally must trade off utility loss to achieve their objectives. Moreover, concurrently achieving both can result in even more utility loss. To this end, we propose a novel differentially private knowledge distillation algorithm that exploits synthetic data generated by a differentially private LLM. The knowledge of a teacher model is transferred onto the student in two ways: one way from the synthetic data itself, the hard labels, and the other way by the output distribution of the teacher model evaluated on the synthetic data, the soft labels. Furthermore, if the teacher and student share a similar architectural structure, we can further distill knowledge by exploiting hidden representations. Our results show that our framework substantially improves the utility over existing baselines with strong privacy parameters, {\epsilon} = 2, validating that we can successfully compress autoregressive LLMs while preserving the privacy of training data.  ( 2 min )
    A Regularization-based Transfer Learning Method for Information Extraction via Instructed Graph Decoder
    arXiv:2403.00891v1 Announce Type: new Abstract: Information extraction (IE) aims to extract complex structured information from the text. Numerous datasets have been constructed for various IE tasks, leading to time-consuming and labor-intensive data annotations. Nevertheless, most prevailing methods focus on training task-specific models, while the common knowledge among different IE tasks is not explicitly modeled. Moreover, the same phrase may have inconsistent labels in different tasks, which poses a big challenge for knowledge transfer using a unified model. In this study, we propose a regularization-based transfer learning method for IE (TIE) via an instructed graph decoder. Specifically, we first construct an instruction pool for datasets from all well-known IE tasks, and then present an instructed graph decoder, which decodes various complex structures into a graph uniformly based on corresponding instructions. In this way, the common knowledge shared with existing datasets can be learned and transferred to a new dataset with new labels. Furthermore, to alleviate the label inconsistency problem among various IE tasks, we introduce a task-specific regularization strategy, which does not update the gradients of two tasks with 'opposite direction'. We conduct extensive experiments on 12 datasets spanning four IE tasks, and the results demonstrate the great advantages of our proposed method  ( 2 min )
    Scale-free Adversarial Reinforcement Learning
    arXiv:2403.00930v1 Announce Type: new Abstract: This paper initiates the study of scale-free learning in Markov Decision Processes (MDPs), where the scale of rewards/losses is unknown to the learner. We design a generic algorithmic framework, \underline{S}cale \underline{C}lipping \underline{B}ound (\texttt{SCB}), and instantiate this framework in both the adversarial Multi-armed Bandit (MAB) setting and the adversarial MDP setting. Through this framework, we achieve the first minimax optimal expected regret bound and the first high-probability regret bound in scale-free adversarial MABs, resolving an open problem raised in \cite{hadiji2023adaptation}. On adversarial MDPs, our framework also give birth to the first scale-free RL algorithm with a $\tilde{\mathcal{O}}(\sqrt{T})$ high-probability regret guarantee.  ( 2 min )
    Evaluating and Correcting Performative Effects of Decision Support Systems via Causal Domain Shift
    arXiv:2403.00886v1 Announce Type: new Abstract: When predicting a target variable $Y$ from features $X$, the prediction $\hat{Y}$ can be performative: an agent might act on this prediction, affecting the value of $Y$ that we eventually observe. Performative predictions are deliberately prevalent in algorithmic decision support, where a Decision Support System (DSS) provides a prediction for an agent to affect the value of the target variable. When deploying a DSS in high-stakes settings (e.g. healthcare, law, predictive policing, or child welfare screening) it is imperative to carefully assess the performative effects of the DSS. In the case that the DSS serves as an alarm for a predicted negative outcome, naive retraining of the prediction model is bound to result in a model that underestimates the risk, due to effective workings of the previous model. In this work, we propose to model the deployment of a DSS as causal domain shift and provide novel cross-domain identification results for the conditional expectation $E[Y | X]$, allowing for pre- and post-hoc assessment of the deployment of the DSS, and for retraining of a model that assesses the risk under a baseline policy where the DSS is not deployed. Using a running example, we empirically show that a repeated regression procedure provides a practical framework for estimating these quantities, even when the data is affected by sample selection bias and selective labelling, offering for a practical, unified solution for multiple forms of target variable bias.  ( 3 min )
    FedRDMA: Communication-Efficient Cross-Silo Federated LLM via Chunked RDMA Transmission
    arXiv:2403.00881v1 Announce Type: new Abstract: Communication overhead is a significant bottleneck in federated learning (FL), which has been exaggerated with the increasing size of AI models. In this paper, we propose FedRDMA, a communication-efficient cross-silo FL system that integrates RDMA into the FL communication protocol. To overcome the limitations of RDMA in wide-area networks (WANs), FedRDMA divides the updated model into chunks and designs a series of optimization techniques to improve the efficiency and robustness of RDMA-based communication. We implement FedRDMA atop the industrial federated learning framework and evaluate it on a real-world cross-silo FL scenario. The experimental results show that \sys can achieve up to 3.8$\times$ speedup in communication efficiency compared to traditional TCP/IP-based FL systems.  ( 2 min )
    Enhancing Multivariate Time Series Forecasting with Mutual Information-driven Cross-Variable and Temporal Modeling
    arXiv:2403.00869v1 Announce Type: new Abstract: Recent advancements have underscored the impact of deep learning techniques on multivariate time series forecasting (MTSF). Generally, these techniques are bifurcated into two categories: Channel-independence and Channel-mixing approaches. Although Channel-independence methods typically yield better results, Channel-mixing could theoretically offer improvements by leveraging inter-variable correlations. Nonetheless, we argue that the integration of uncorrelated information in channel-mixing methods could curtail the potential enhancement in MTSF model performance. To substantiate this claim, we introduce the Cross-variable Decorrelation Aware feature Modeling (CDAM) for Channel-mixing approaches, aiming to refine Channel-mixing by minimizing redundant information between channels while enhancing relevant mutual information. Furthermore, we introduce the Temporal correlation Aware Modeling (TAM) to exploit temporal correlations, a step beyond conventional single-step forecasting methods. This strategy maximizes the mutual information between adjacent sub-sequences of both the forecasted and target series. Combining CDAM and TAM, our novel framework significantly surpasses existing models, including those previously considered state-of-the-art, in comprehensive tests.  ( 2 min )
    Disaggregated Multi-Tower: Topology-aware Modeling Technique for Efficient Large-Scale Recommendation
    arXiv:2403.00877v1 Announce Type: new Abstract: We study a mismatch between the deep learning recommendation models' flat architecture, common distributed training paradigm and hierarchical data center topology. To address the associated inefficiencies, we propose Disaggregated Multi-Tower (DMT), a modeling technique that consists of (1) Semantic-preserving Tower Transform (SPTT), a novel training paradigm that decomposes the monolithic global embedding lookup process into disjoint towers to exploit data center locality; (2) Tower Module (TM), a synergistic dense component attached to each tower to reduce model complexity and communication volume through hierarchical feature interaction; and (3) Tower Partitioner (TP), a feature partitioner to systematically create towers with meaningful feature interactions and load balanced assignments to preserve model quality and training throughput via learned embeddings. We show that DMT can achieve up to 1.9x speedup compared to the state-of-the-art baselines without losing accuracy across multiple generations of hardware at large data center scales.  ( 2 min )
    Parallel Algorithms for Exact Enumeration of Deep Neural Network Activation Regions
    arXiv:2403.00860v1 Announce Type: new Abstract: A feedforward neural network using rectified linear units constructs a mapping from inputs to outputs by partitioning its input space into a set of convex regions where points within a region share a single affine transformation. In order to understand how neural networks work, when and why they fail, and how they compare to biological intelligence, we need to understand the organization and formation of these regions. Step one is to design and implement algorithms for exact region enumeration in networks beyond toy examples. In this work, we present parallel algorithms for exact enumeration in deep (and shallow) neural networks. Our work has three main contributions: (1) we present a novel algorithm framework and parallel algorithms for region enumeration; (2) we implement one of our algorithms on a variety of network architectures and experimentally show how the number of regions dictates runtime; and (3) we show, using our algorithm's output, how the dimension of a region's affine transformation impacts further partitioning of the region by deeper layers. To our knowledge, we run our implemented algorithm on networks larger than all of the networks used in the existing region enumeration literature. Further, we experimentally demonstrate the importance of parallelism for region enumeration of any reasonably sized network.  ( 2 min )
    Direct Alignment of Draft Model for Speculative Decoding with Chat-Fine-Tuned LLMs
    arXiv:2403.00858v1 Announce Type: new Abstract: Text generation with Large Language Models (LLMs) is known to be memory bound due to the combination of their auto-regressive nature, huge parameter counts, and limited memory bandwidths, often resulting in low token rates. Speculative decoding has been proposed as a solution for LLM inference acceleration. However, since draft models are often unavailable in the modern open-source LLM families, e.g., for Llama 2 7B, training a high-quality draft model is required to enable inference acceleration via speculative decoding. In this paper, we propose a simple draft model training framework for direct alignment to chat-capable target models. With the proposed framework, we train Llama 2 Chat Drafter 115M, a draft model for Llama 2 Chat 7B or larger, with only 1.64\% of the original size. Our training framework only consists of pretraining, distillation dataset generation, and finetuning with knowledge distillation, with no additional alignment procedure. For the finetuning step, we use instruction-response pairs generated by target model for distillation in plausible data distribution, and propose a new Total Variation Distance++ (TVD++) loss that incorporates variance reduction techniques inspired from the policy gradient method in reinforcement learning. Our empirical results show that Llama 2 Chat Drafter 115M with speculative decoding achieves up to 2.3 block efficiency and 2.4$\times$ speed-up relative to autoregressive decoding on various tasks with no further task-specific fine-tuning.  ( 3 min )
    Empirical and Experimental Insights into Data Mining Techniques for Crime Prediction: A Comprehensive Survey
    arXiv:2403.00780v1 Announce Type: new Abstract: This survey paper presents a comprehensive analysis of crime prediction methodologies, exploring the various techniques and technologies utilized in this area. The paper covers the statistical methods, machine learning algorithms, and deep learning techniques employed to analyze crime data, while also examining their effectiveness and limitations. We propose a methodological taxonomy that classifies crime prediction algorithms into specific techniques. This taxonomy is structured into four tiers, including methodology category, methodology sub-category, methodology techniques, and methodology sub-techniques. Empirical and experimental evaluations are provided to rank the different techniques. The empirical evaluation assesses the crime prediction techniques based on four criteria, while the experimental evaluation ranks the algorithms that employ the same sub-technique, the different sub-techniques that employ the same technique, the different techniques that employ the same methodology sub-category, the different methodology sub-categories within the same category, and the different methodology categories. The combination of methodological taxonomy, empirical evaluations, and experimental comparisons allows for a nuanced and comprehensive understanding of crime prediction algorithms, aiding researchers in making informed decisions. Finally, the paper provides a glimpse into the future of crime prediction techniques, highlighting potential advancements and opportunities for further research in this field  ( 2 min )
    Distributed Momentum Methods Under Biased Gradient Estimations
    arXiv:2403.00853v1 Announce Type: new Abstract: Distributed stochastic gradient methods are gaining prominence in solving large-scale machine learning problems that involve data distributed across multiple nodes. However, obtaining unbiased stochastic gradients, which have been the focus of most theoretical research, is challenging in many distributed machine learning applications. The gradient estimations easily become biased, for example, when gradients are compressed or clipped, when data is shuffled, and in meta-learning and reinforcement learning. In this work, we establish non-asymptotic convergence bounds on distributed momentum methods under biased gradient estimation on both general non-convex and $\mu$-PL non-convex problems. Our analysis covers general distributed optimization problems, and we work out the implications for special cases where gradient estimates are biased, i.e., in meta-learning and when the gradients are compressed or clipped. Our numerical experiments on training deep neural networks with Top-$K$ sparsification and clipping verify faster convergence performance of momentum methods than traditional biased gradient descent.  ( 2 min )
    Misconduct in Post-Selections and Deep Learning
    arXiv:2403.00773v1 Announce Type: new Abstract: This is a theoretical paper on "Deep Learning" misconduct in particular and Post-Selection in general. As far as the author knows, the first peer-reviewed papers on Deep Learning misconduct are [32], [37], [36]. Regardless of learning modes, e.g., supervised, reinforcement, adversarial, and evolutional, almost all machine learning methods (except for a few methods that train a sole system) are rooted in the same misconduct -- cheating and hiding -- (1) cheating in the absence of a test and (2) hiding bad-looking data. It was reasoned in [32], [37], [36] that authors must report at least the average error of all trained networks, good and bad, on the validation set (called general cross-validation in this paper). Better, report also five percentage positions of ranked errors. From the new analysis here, we can see that the hidden culprit is Post-Selection. This is also true for Post-Selection on hand-tuned or searched hyperparameters, because they are random, depending on random observation data. Does cross-validation on data splits rescue Post-Selections from the Misconducts (1) and (2)? The new result here says: No. Specifically, this paper reveals that using cross-validation for data splits is insufficient to exonerate Post-Selections in machine learning. In general, Post-Selections of statistical learners based on their errors on the validation set are statistically invalid.  ( 2 min )
  • Open

    Hutchinson Trace Estimation for High-Dimensional and High-Order Physics-Informed Neural Networks
    arXiv:2312.14499v2 Announce Type: replace-cross Abstract: Physics-Informed Neural Networks (PINNs) have proven effective in solving partial differential equations (PDEs), especially when some data are available by seamlessly blending data and physics. However, extending PINNs to high-dimensional and even high-order PDEs encounters significant challenges due to the computational cost associated with automatic differentiation in the residual loss. Herein, we address the limitations of PINNs in handling high-dimensional and high-order PDEs by introducing Hutchinson Trace Estimation (HTE). Starting with the second-order high-dimensional PDEs ubiquitous in scientific computing, HTE transforms the calculation of the entire Hessian matrix into a Hessian vector product (HVP). This approach alleviates the computational bottleneck via Taylor-mode automatic differentiation and significantly reduces memory consumption from the Hessian matrix to HVP. We further showcase HTE's convergence to the original PINN loss and its unbiased behavior under specific conditions. Comparisons with Stochastic Dimension Gradient Descent (SDGD) highlight the distinct advantages of HTE, particularly in scenarios with significant variance among dimensions. We further extend HTE to higher-order and higher-dimensional PDEs, specifically addressing the biharmonic equation. By employing tensor-vector products (TVP), HTE efficiently computes the colossal tensor associated with the fourth-order high-dimensional biharmonic equation, saving memory and enabling rapid computation. The effectiveness of HTE is illustrated through experimental setups, demonstrating comparable convergence rates with SDGD under memory and speed constraints. Additionally, HTE proves valuable in accelerating the Gradient-Enhanced PINN (gPINN) version as well as the Biharmonic equation. Overall, HTE opens up a new capability in scientific machine learning for tackling high-order and high-dimensional PDEs.  ( 3 min )
    Adaptive operator learning for infinite-dimensional Bayesian inverse problems
    arXiv:2310.17844v2 Announce Type: replace-cross Abstract: The fundamental computational issues in Bayesian inverse problems (BIP) governed by partial differential equations (PDEs) stem from the requirement of repeated forward model evaluations. A popular strategy to reduce such costs is to replace expensive model simulations with computationally efficient approximations using operator learning, motivated by recent progress in deep learning. However, using the approximated model directly may introduce a modeling error, exacerbating the already ill-posedness of inverse problems. Thus, balancing between accuracy and efficiency is essential for the effective implementation of such approaches. To this end, we develop an adaptive operator learning framework that can reduce modeling error gradually by forcing the surrogate to be accurate in local areas. This is accomplished by adaptively fine-tuning the pre-trained approximate model with train- ing points chosen by a greedy algorithm during the posterior computational process. To validate our approach, we use DeepOnet to construct the surrogate and unscented Kalman inversion (UKI) to approximate the BIP solution, respectively. Furthermore, we present a rigorous convergence guarantee in the linear case using the UKI framework. The approach is tested on a number of benchmarks, including the Darcy flow, the heat source inversion problem, and the reaction-diffusion problem. The numerical results show that our method can significantly reduce computational costs while maintaining inversion accuracy.  ( 2 min )
    Nonparametric consistency for maximum likelihood estimation and clustering based on mixtures of elliptically-symmetric distributions
    arXiv:2311.06108v2 Announce Type: replace-cross Abstract: The consistency of the maximum likelihood estimator for mixtures of elliptically-symmetric distributions for estimating its population version is shown, where the underlying distribution $P$ is nonparametric and does not necessarily belong to the class of mixtures on which the estimator is based. In a situation where $P$ is a mixture of well enough separated but nonparametric distributions it is shown that the components of the population version of the estimator correspond to the well separated components of $P$. This provides some theoretical justification for the use of such estimators for cluster analysis in case that $P$ has well separated subpopulations even if these subpopulations differ from what the mixture model assumes.  ( 2 min )
    A connection between Tempering and Entropic Mirror Descent
    arXiv:2310.11914v2 Announce Type: replace-cross Abstract: This paper explores the connections between tempering (for Sequential Monte Carlo; SMC) and entropic mirror descent to sample from a target probability distribution whose unnormalized density is known. We establish that tempering SMC corresponds to entropic mirror descent applied to the reverse Kullback-Leibler (KL) divergence and obtain convergence rates for the tempering iterates. Our result motivates the tempering iterates from an optimization point of view, showing that tempering can be seen as a descent scheme of the KL divergence with respect to the Fisher-Rao geometry, in contrast to Langevin dynamics that perform descent of the KL with respect to the Wasserstein-2 geometry. We exploit the connection between tempering and mirror descent iterates to justify common practices in SMC and derive adaptive tempering rules that improve over other alternative benchmarks in the literature.  ( 2 min )
    Local Graph Clustering with Noisy Labels
    arXiv:2310.08031v2 Announce Type: replace-cross Abstract: The growing interest in machine learning problems over graphs with additional node information such as texts, images, or labels has popularized methods that require the costly operation of processing the entire graph. Yet, little effort has been made to the development of fast local methods (i.e. without accessing the entire graph) that extract useful information from such data. To that end, we propose a study of local graph clustering using noisy node labels as a proxy for additional node information. In this setting, nodes receive initial binary labels based on cluster affiliation: 1 if they belong to the target cluster and 0 otherwise. Subsequently, a fraction of these labels is flipped. We investigate the benefits of incorporating noisy labels for local graph clustering. By constructing a weighted graph with such labels, we study the performance of graph diffusion-based local clustering method on both the original and the weighted graphs. From a theoretical perspective, we consider recovering an unknown target cluster with a single seed node in a random graph with independent noisy node labels. We provide sufficient conditions on the label noise under which, with high probability, using diffusion in the weighted graph yields a more accurate recovery of the target cluster. This approach proves more effective than using the given labels alone or using diffusion in the label-free original graph. Empirically, we show that reliable node labels can be obtained with just a few samples from an attributed graph. Moreover, utilizing these labels via diffusion in the weighted graph leads to significantly better local clustering performance across several real-world datasets, improving F1 scores by up to 13%.  ( 3 min )
    Non-Smooth Weakly-Convex Finite-sum Coupled Compositional Optimization
    arXiv:2310.03234v4 Announce Type: replace-cross Abstract: This paper investigates new families of compositional optimization problems, called $\underline{\bf n}$on-$\underline{\bf s}$mooth $\underline{\bf w}$eakly-$\underline{\bf c}$onvex $\underline{\bf f}$inite-sum $\underline{\bf c}$oupled $\underline{\bf c}$ompositional $\underline{\bf o}$ptimization (NSWC FCCO). There has been a growing interest in FCCO due to its wide-ranging applications in machine learning and AI, as well as its ability to address the shortcomings of stochastic algorithms based on empirical risk minimization. However, current research on FCCO presumes that both the inner and outer functions are smooth, limiting their potential to tackle a more diverse set of problems. Our research expands on this area by examining non-smooth weakly-convex FCCO, where the outer function is weakly convex and non-decreasing, and the inner function is weakly-convex. We analyze a single-loop algorithm and establish its complexity for finding an $\epsilon$-stationary point of the Moreau envelop of the objective function. Additionally, we also extend the algorithm to solving novel non-smooth weakly-convex tri-level finite-sum coupled compositional optimization problems, which feature a nested arrangement of three functions. Lastly, we explore the applications of our algorithms in deep learning for two-way partial AUC maximization and multi-instance two-way partial AUC maximization, using empirical studies to showcase the effectiveness of the proposed algorithms.  ( 2 min )
    Orlicz regrets to consistently bound statistics of random variables with an application to environmental indicators
    arXiv:2310.05168v2 Announce Type: replace-cross Abstract: Evaluating environmental variables that vary stochastically is the principal topic for designing better environmental management and restoration schemes. Both the upper and lower estimates of these variables, such as water quality indices and flood and drought water levels, are important and should be consistently evaluated within a unified mathematical framework. We propose a novel pair of Orlicz regrets to consistently bound the statistics of random variables both from below and above. Here, consistency indicates that the upper and lower bounds are evaluated with common coefficients and parameter values being different from some of the risk measures proposed thus far. Orlicz regrets can flexibly evaluate the statistics of random variables based on their tail behavior. The explicit linkage between Orlicz regrets and divergence risk measures was exploited to better comprehend them. We obtain sufficient conditions to pose the Orlicz regrets as well as divergence risk measures, and further provide gradient descent-type numerical algorithms to compute them. Finally, we apply the proposed mathematical framework to the statistical evaluation of 31-year water quality data as key environmental indicators in a Japanese river environment.  ( 2 min )
    $G$-Mapper: Learning a Cover in the Mapper Construction
    arXiv:2309.06634v2 Announce Type: replace-cross Abstract: The Mapper algorithm is a visualization technique in topological data analysis (TDA) that outputs a graph reflecting the structure of a given dataset. However, the Mapper algorithm requires tuning several parameters in order to generate a ``nice" Mapper graph. This paper focuses on selecting the cover parameter. We present an algorithm that optimizes the cover of a Mapper graph by splitting a cover repeatedly according to a statistical test for normality. Our algorithm is based on $G$-means clustering which searches for the optimal number of clusters in $k$-means by iteratively applying the Anderson-Darling test. Our splitting procedure employs a Gaussian mixture model to carefully choose the cover according to the distribution of the given data. Experiments for synthetic and real-world datasets demonstrate that our algorithm generates covers so that the Mapper graphs retain the essence of the datasets, while also running significantly fast.  ( 2 min )
    SA-Solver: Stochastic Adams Solver for Fast Sampling of Diffusion Models
    arXiv:2309.05019v2 Announce Type: replace-cross Abstract: Diffusion Probabilistic Models (DPMs) have achieved considerable success in generation tasks. As sampling from DPMs is equivalent to solving diffusion SDE or ODE which is time-consuming, numerous fast sampling methods built upon improved differential equation solvers are proposed. The majority of such techniques consider solving the diffusion ODE due to its superior efficiency. However, stochastic sampling could offer additional advantages in generating diverse and high-quality data. In this work, we engage in a comprehensive analysis of stochastic sampling from two aspects: variance-controlled diffusion SDE and linear multi-step SDE solver. Based on our analysis, we propose SA-Solver, which is an improved efficient stochastic Adams method for solving diffusion SDE to generate data with high quality. Our experiments show that SA-Solver achieves: 1) improved or comparable performance compared with the existing state-of-the-art sampling methods for few-step sampling; 2) SOTA FID scores on substantial benchmark datasets under a suitable number of function evaluations (NFEs).  ( 2 min )
    Tackling the Curse of Dimensionality with Physics-Informed Neural Networks
    arXiv:2307.12306v5 Announce Type: replace-cross Abstract: The curse-of-dimensionality taxes computational resources heavily with exponentially increasing computational cost as the dimension increases. This poses great challenges in solving high-dimensional PDEs, as Richard E. Bellman first pointed out over 60 years ago. While there has been some recent success in solving numerically partial differential equations (PDEs) in high dimensions, such computations are prohibitively expensive, and true scaling of general nonlinear PDEs to high dimensions has never been achieved. We develop a new method of scaling up physics-informed neural networks (PINNs) to solve arbitrary high-dimensional PDEs. The new method, called Stochastic Dimension Gradient Descent (SDGD), decomposes a gradient of PDEs into pieces corresponding to different dimensions and randomly samples a subset of these dimensional pieces in each iteration of training PINNs. We prove theoretically the convergence and other desired properties of the proposed method. We demonstrate in various diverse tests that the proposed method can solve many notoriously hard high-dimensional PDEs, including the Hamilton-Jacobi-Bellman (HJB) and the Schr\"{o}dinger equations in tens of thousands of dimensions very fast on a single GPU using the PINNs mesh-free approach. Notably, we solve nonlinear PDEs with nontrivial, anisotropic, and inseparable solutions in 100,000 effective dimensions in 12 hours on a single GPU using SDGD with PINNs. Since SDGD is a general training methodology of PINNs, it can be applied to any current and future variants of PINNs to scale them up for arbitrary high-dimensional PDEs.  ( 3 min )
    FaIRGP: A Bayesian Energy Balance Model for Surface Temperatures Emulation
    arXiv:2307.10052v2 Announce Type: replace-cross Abstract: Emulators, or reduced complexity climate models, are surrogate Earth system models that produce projections of key climate quantities with minimal computational resources. Using time-series modelling or more advanced machine learning techniques, data-driven emulators have emerged as a promising avenue of research, producing spatially resolved climate responses that are visually indistinguishable from state-of-the-art Earth system models. Yet, their lack of physical interpretability limits their wider adoption. In this work, we introduce FaIRGP, a data-driven emulator that satisfies the physical temperature response equations of an energy balance model. The result is an emulator that \textit{(i)} enjoys the flexibility of statistical machine learning models and can learn from data, and \textit{(ii)} has a robust physical grounding with interpretable parameters that can be used to make inference about the climate system. Further, our Bayesian approach allows a principled and mathematically tractable uncertainty quantification. Our model demonstrates skillful emulation of global mean surface temperature and spatial surface temperatures across realistic future scenarios. Its ability to learn from data allows it to outperform energy balance models, while its robust physical foundation safeguards against the pitfalls of purely data-driven models. We also illustrate how FaIRGP can be used to obtain estimates of top-of-atmosphere radiative forcing and discuss the benefits of its mathematical tractability for applications such as detection and attribution or precipitation emulation. We hope that this work will contribute to widening the adoption of data-driven methods in climate emulation.  ( 3 min )
    Catapults in SGD: spikes in the training loss and their impact on generalization through feature learning
    arXiv:2306.04815v2 Announce Type: replace-cross Abstract: In this paper, we first present an explanation regarding the common occurrence of spikes in the training loss when neural networks are trained with stochastic gradient descent (SGD). We provide evidence that the spikes in the training loss of SGD are "catapults", an optimization phenomenon originally observed in GD with large learning rates in [Lewkowycz et al. 2020]. We empirically show that these catapults occur in a low-dimensional subspace spanned by the top eigenvectors of the tangent kernel, for both GD and SGD. Second, we posit an explanation for how catapults lead to better generalization by demonstrating that catapults promote feature learning by increasing alignment with the Average Gradient Outer Product (AGOP) of the true predictor. Furthermore, we demonstrate that a smaller batch size in SGD induces a larger number of catapults, thereby improving AGOP alignment and test performance.  ( 2 min )
    Designing Decision Support Systems Using Counterfactual Prediction Sets
    arXiv:2306.03928v2 Announce Type: replace-cross Abstract: Decision support systems for classification tasks are predominantly designed to predict the value of the ground truth labels. However, since their predictions are not perfect, these systems also need to make human experts understand when and how to use these predictions to update their own predictions. Unfortunately, this has been proven challenging. In this context, it has been recently argued that an alternative type of decision support systems may circumvent this challenge. Rather than providing a single label prediction, these systems provide a set of label prediction values constructed using a conformal predictor, namely a prediction set, and forcefully ask experts to predict a label value from the prediction set. However, the design and evaluation of these systems have so far relied on stylized expert models, questioning their promise. In this paper, we revisit the design of this type of systems from the perspective of online learning and develop a methodology that does not require, nor assumes, an expert model. Our methodology leverages the nested structure of the prediction sets provided by any conformal predictor and a natural counterfactual monotonicity assumption to achieve an exponential improvement in regret in comparison to vanilla bandit algorithms. We conduct a large-scale human subject study ($n = 2{,}751$) to compare our methodology to several competitive baselines. The results show that, for decision support systems based on prediction sets, limiting experts' level of agency leads to greater performance than allowing experts to always exercise their own agency. We have made available the data gathered in our human subject study as well as an open source implementation of our system at https://github.com/Networks-Learning/counterfactual-prediction-sets.  ( 3 min )
    Unpaired Image-to-Image Translation via Neural Schr\"odinger Bridge
    arXiv:2305.15086v3 Announce Type: replace-cross Abstract: Diffusion models are a powerful class of generative models which simulate stochastic differential equations (SDEs) to generate data from noise. While diffusion models have achieved remarkable progress, they have limitations in unpaired image-to-image (I2I) translation tasks due to the Gaussian prior assumption. Schr\"{o}dinger Bridge (SB), which learns an SDE to translate between two arbitrary distributions, have risen as an attractive solution to this problem. Yet, to our best knowledge, none of SB models so far have been successful at unpaired translation between high-resolution images. In this work, we propose Unpaired Neural Schr\"{o}dinger Bridge (UNSB), which expresses the SB problem as a sequence of adversarial learning problems. This allows us to incorporate advanced discriminators and regularization to learn a SB between unpaired data. We show that UNSB is scalable and successfully solves various unpaired I2I translation tasks. Code: \url{https://github.com/cyclomon/UNSB}  ( 2 min )
    Learning Safety Constraints from Demonstrations with Unknown Rewards
    arXiv:2305.16147v2 Announce Type: replace-cross Abstract: We propose Convex Constraint Learning for Reinforcement Learning (CoCoRL), a novel approach for inferring shared constraints in a Constrained Markov Decision Process (CMDP) from a set of safe demonstrations with possibly different reward functions. While previous work is limited to demonstrations with known rewards or fully known environment dynamics, CoCoRL can learn constraints from demonstrations with different unknown rewards without knowledge of the environment dynamics. CoCoRL constructs a convex safe set based on demonstrations, which provably guarantees safety even for potentially sub-optimal (but safe) demonstrations. For near-optimal demonstrations, CoCoRL converges to the true safe set with no policy regret. We evaluate CoCoRL in gridworld environments and a driving simulation with multiple constraints. CoCoRL learns constraints that lead to safe driving behavior. Importantly, we can safely transfer the learned constraints to different tasks and environments. In contrast, alternative methods based on Inverse Reinforcement Learning (IRL) often exhibit poor performance and learn unsafe policies.  ( 2 min )
    Approximately optimal domain adaptation with Fisher's Linear Discriminant
    arXiv:2302.14186v3 Announce Type: replace-cross Abstract: We propose a class of models based on Fisher's Linear Discriminant (FLD) in the context of domain adaptation. The class is the convex combination of two hypotheses: i) an average hypothesis representing previously seen source tasks and ii) a hypothesis trained on a new target task. For a particular generative setting we derive the optimal convex combination of the two models under 0-1 loss, propose a computable approximation, and study the effect of various parameter settings on the relative risks between the optimal hypothesis, hypothesis i), and hypothesis ii). We demonstrate the effectiveness of the proposed optimal classifier in the context of EEG- and ECG-based classification settings and argue that the optimal classifier can be computed without access to direct information from any of the individual source tasks. We conclude by discussing further applications, limitations, and possible future directions.  ( 2 min )
    Gaussian Process-Gated Hierarchical Mixtures of Experts
    arXiv:2302.04947v2 Announce Type: replace-cross Abstract: In this paper, we propose novel Gaussian process-gated hierarchical mixtures of experts (GPHMEs). Unlike other mixtures of experts with gating models linear in the input, our model employs gating functions built with Gaussian processes (GPs). These processes are based on random features that are non-linear functions of the inputs. Furthermore, the experts in our model are also constructed with GPs. The optimization of the GPHMEs is performed by variational inference. The proposed GPHMEs have several advantages. They outperform tree-based HME benchmarks that partition the data in the input space, and they achieve good performance with reduced complexity. Another advantage is the interpretability they provide for deep GPs, and more generally, for deep Bayesian neural networks. Our GPHMEs demonstrate excellent performance for large-scale data sets, even with quite modest sizes.  ( 2 min )
    Primal Dual Alternating Proximal Gradient Algorithms for Nonsmooth Nonconvex Minimax Problems with Coupled Linear Constraints
    arXiv:2212.04672v3 Announce Type: replace-cross Abstract: Nonconvex minimax problems have attracted wide attention in machine learning, signal processing and many other fields in recent years. In this paper, we propose a primal-dual alternating proximal gradient (PDAPG) algorithm and a primal-dual proximal gradient (PDPG-L) algorithm for solving nonsmooth nonconvex-(strongly) concave and nonconvex-linear minimax problems with coupled linear constraints, respectively. The iteration complexity of the two algorithms are proved to be $\mathcal{O}\left( \varepsilon ^{-2} \right)$ (resp. $\mathcal{O}\left( \varepsilon ^{-4} \right)$) under nonconvex-strongly concave (resp. nonconvex-concave) setting and $\mathcal{O}\left( \varepsilon ^{-3} \right)$ under nonconvex-linear setting to reach an $\varepsilon$-stationary point, respectively. To our knowledge, they are the first two algorithms with iteration complexity guarantees for solving the nonconvex minimax problems with coupled linear constraints.  ( 2 min )
    On the Expressive Power of Geometric Graph Neural Networks
    arXiv:2301.09308v3 Announce Type: replace-cross Abstract: The expressive power of Graph Neural Networks (GNNs) has been studied extensively through the Weisfeiler-Leman (WL) graph isomorphism test. However, standard GNNs and the WL framework are inapplicable for geometric graphs embedded in Euclidean space, such as biomolecules, materials, and other physical systems. In this work, we propose a geometric version of the WL test (GWL) for discriminating geometric graphs while respecting the underlying physical symmetries: permutations, rotation, reflection, and translation. We use GWL to characterise the expressive power of geometric GNNs that are invariant or equivariant to physical symmetries in terms of distinguishing geometric graphs. GWL unpacks how key design choices influence geometric GNN expressivity: (1) Invariant layers have limited expressivity as they cannot distinguish one-hop identical geometric graphs; (2) Equivariant layers distinguish a larger class of graphs by propagating geometric information beyond local neighbourhoods; (3) Higher order tensors and scalarisation enable maximally powerful geometric GNNs; and (4) GWL's discrimination-based perspective is equivalent to universal approximation. Synthetic experiments supplementing our results are available at \url{https://github.com/chaitjo/geometric-gnn-dojo}  ( 2 min )
    Polygonal Unadjusted Langevin Algorithms: Creating stable and efficient adaptive algorithms for neural networks
    arXiv:2105.13937v3 Announce Type: replace-cross Abstract: We present a new class of Langevin based algorithms, which overcomes many of the known shortcomings of popular adaptive optimizers that are currently used for the fine tuning of deep learning models. Its underpinning theory relies on recent advances of Euler's polygonal approximations for stochastic differential equations (SDEs) with monotone coefficients. As a result, it inherits the stability properties of tamed algorithms, while it addresses other known issues, e.g. vanishing gradients in neural networks. In particular, we provide a nonasymptotic analysis and full theoretical guarantees for the convergence properties of an algorithm of this novel class, which we named TH$\varepsilon$O POULA (or, simply, TheoPouLa). Finally, several experiments are presented with different types of deep learning models, which show the superior performance of TheoPouLa over many popular adaptive optimization algorithms.  ( 2 min )
    Private Prediction Sets
    arXiv:2102.06202v3 Announce Type: replace-cross Abstract: In real-world settings involving consequential decision-making, the deployment of machine learning systems generally requires both reliable uncertainty quantification and protection of individuals' privacy. We present a framework that treats these two desiderata jointly. Our framework is based on conformal prediction, a methodology that augments predictive models to return prediction sets that provide uncertainty quantification -- they provably cover the true response with a user-specified probability, such as 90%. One might hope that when used with privately-trained models, conformal prediction would yield privacy guarantees for the resulting prediction sets; unfortunately, this is not the case. To remedy this key problem, we develop a method that takes any pre-trained predictive model and outputs differentially private prediction sets. Our method follows the general approach of split conformal prediction; we use holdout data to calibrate the size of the prediction sets but preserve privacy by using a privatized quantile subroutine. This subroutine compensates for the noise introduced to preserve privacy in order to guarantee correct coverage. We evaluate the method on large-scale computer vision datasets.  ( 3 min )
    Semiparametric Efficient Inference in Adaptive Experiments
    arXiv:2311.18274v3 Announce Type: replace Abstract: We consider the problem of efficient inference of the Average Treatment Effect in a sequential experiment where the policy governing the assignment of subjects to treatment or control can change over time. We first provide a central limit theorem for the Adaptive Augmented Inverse-Probability Weighted estimator, which is semiparametric efficient, under weaker assumptions than those previously made in the literature. This central limit theorem enables efficient inference at fixed sample sizes. We then consider a sequential inference setting, deriving both asymptotic and nonasymptotic confidence sequences that are considerably tighter than previous methods. These anytime-valid methods enable inference under data-dependent stopping times (sample sizes). Additionally, we use propensity score truncation techniques from the recent off-policy estimation literature to reduce the finite sample variance of our estimator without affecting the asymptotic variance. Empirical results demonstrate that our methods yield narrower confidence sequences than those previously developed in the literature while maintaining time-uniform error control.  ( 2 min )
    $t^3$-Variational Autoencoder: Learning Heavy-tailed Data with Student's t and Power Divergence
    arXiv:2312.01133v2 Announce Type: replace Abstract: The variational autoencoder (VAE) typically employs a standard normal prior as a regularizer for the probabilistic latent encoder. However, the Gaussian tail often decays too quickly to effectively accommodate the encoded points, failing to preserve crucial structures hidden in the data. In this paper, we explore the use of heavy-tailed models to combat over-regularization. Drawing upon insights from information geometry, we propose $t^3$VAE, a modified VAE framework that incorporates Student's t-distributions for the prior, encoder, and decoder. This results in a joint model distribution of a power form which we argue can better fit real-world datasets. We derive a new objective by reformulating the evidence lower bound as joint optimization of KL divergence between two statistical manifolds and replacing with $\gamma$-power divergence, a natural alternative for power families. $t^3$VAE demonstrates superior generation of low-density regions when trained on heavy-tailed synthetic data. Furthermore, we show that $t^3$VAE significantly outperforms other models on CelebA and imbalanced CIFAR-100 datasets.  ( 2 min )
    Hodge-Compositional Edge Gaussian Processes
    arXiv:2310.19450v3 Announce Type: replace Abstract: We propose principled Gaussian processes (GPs) for modeling functions defined over the edge set of a simplicial 2-complex, a structure similar to a graph in which edges may form triangular faces. This approach is intended for learning flow-type data on networks where edge flows can be characterized by the discrete divergence and curl. Drawing upon the Hodge decomposition, we first develop classes of divergence-free and curl-free edge GPs, suitable for various applications. We then combine them to create \emph{Hodge-compositional edge GPs} that are expressive enough to represent any edge function. These GPs facilitate direct and independent learning for the different Hodge components of edge functions, enabling us to capture their relevance during hyperparameter optimization. To highlight their practical potential, we apply them for flow data inference in currency exchange, ocean currents and water supply networks, comparing them to alternative models.  ( 2 min )
    Optimising Distributions with Natural Gradient Surrogates
    arXiv:2310.11837v2 Announce Type: replace Abstract: Natural gradient methods have been used to optimise the parameters of probability distributions in a variety of settings, often resulting in fast-converging procedures. Unfortunately, for many distributions of interest, computing the natural gradient has a number of challenges. In this work we propose a novel technique for tackling such issues, which involves reframing the optimisation as one with respect to the parameters of a surrogate distribution, for which computing the natural gradient is easy. We give several examples of existing methods that can be interpreted as applying this technique, and propose a new method for applying it to a wide variety of problems. Our method expands the set of distributions that can be efficiently targeted with natural gradients. Furthermore, it is fast, easy to understand, simple to implement using standard autodiff software, and does not require lengthy model-specific derivations. We demonstrate our method on maximum likelihood estimation and variational inference tasks.  ( 2 min )
    Pure Differential Privacy for Functional Summaries via a Laplace-like Process
    arXiv:2309.00125v2 Announce Type: replace Abstract: Many existing mechanisms to achieve differential privacy (DP) on infinite-dimensional functional summaries often involve embedding these summaries into finite-dimensional subspaces and applying traditional DP techniques. Such mechanisms generally treat each dimension uniformly and struggle with complex, structured summaries. This work introduces a novel mechanism for DP functional summary release: the Independent Component Laplace Process (ICLP) mechanism. This mechanism treats the summaries of interest as truly infinite-dimensional objects, thereby addressing several limitations of existing mechanisms. We establish the feasibility of the proposed mechanism in multiple function spaces. Several statistical estimation problems are considered, and we demonstrate one can enhance the utility of sanitized summaries by oversmoothing their non-private counterpart. Numerical experiments on synthetic and real datasets demonstrate the efficacy of the proposed mechanism.  ( 2 min )
    Robust Statistical Comparison of Random Variables with Locally Varying Scale of Measurement
    arXiv:2306.12803v2 Announce Type: replace Abstract: Spaces with locally varying scale of measurement, like multidimensional structures with differently scaled dimensions, are pretty common in statistics and machine learning. Nevertheless, it is still understood as an open question how to exploit the entire information encoded in them properly. We address this problem by considering an order based on (sets of) expectations of random variables mapping into such non-standard spaces. This order contains stochastic dominance and expectation order as extreme cases when no, or respectively perfect, cardinal structure is given. We derive a (regularized) statistical test for our proposed generalized stochastic dominance (GSD) order, operationalize it by linear optimization, and robustify it by imprecise probability models. Our findings are illustrated with data from multidimensional poverty measurement, finance, and medicine.  ( 2 min )
    Statistically Optimal K-means Clustering via Nonnegative Low-rank Semidefinite Programming
    arXiv:2305.18436v3 Announce Type: replace Abstract: $K$-means clustering is a widely used machine learning method for identifying patterns in large datasets. Semidefinite programming (SDP) relaxations have recently been proposed for solving the $K$-means optimization problem that enjoy strong statistical optimality guarantees, but the prohibitive cost of implementing an SDP solver renders these guarantees inaccessible to practical datasets. By contrast, nonnegative matrix factorization (NMF) is a simple clustering algorithm that is widely used by machine learning practitioners, but without a solid statistical underpinning nor rigorous guarantees. In this paper, we describe an NMF-like algorithm that works by solving a nonnegative low-rank restriction of the SDP relaxed $K$-means formulation using a nonconvex Burer--Monteiro factorization approach. The resulting algorithm is just as simple and scalable as state-of-the-art NMF algorithms, while also enjoying the same strong statistical optimality guarantees as the SDP. In our experiments, we observe that our algorithm achieves substantially smaller mis-clustering errors compared to the existing state-of-the-art.  ( 2 min )
    On Neural Networks as Infinite Tree-Structured Probabilistic Graphical Models
    arXiv:2305.17583v3 Announce Type: replace Abstract: Deep neural networks (DNNs) lack the precise semantics and definitive probabilistic interpretation of probabilistic graphical models (PGMs). In this paper, we propose an innovative solution by constructing infinite tree-structured PGMs that correspond exactly to neural networks. Our research reveals that DNNs, during forward propagation, indeed perform approximations of PGM inference that are precise in this alternative PGM structure. Not only does our research complement existing studies that describe neural networks as kernel machines or infinite-sized Gaussian processes, it also elucidates a more direct approximation that DNNs make to exact inference in PGMs. Potential benefits include improved pedagogy and interpretation of DNNs, and algorithms that can merge the strengths of PGMs and DNNs.  ( 2 min )
    Settling the Sample Complexity of Model-Based Offline Reinforcement Learning
    arXiv:2204.05275v3 Announce Type: replace Abstract: This paper is concerned with offline reinforcement learning (RL), which learns using pre-collected data without further exploration. Effective offline RL would be able to accommodate distribution shift and limited data coverage. However, prior algorithms or analyses either suffer from suboptimal sample complexities or incur high burn-in cost to reach sample optimality, thus posing an impediment to efficient offline RL in sample-starved applications. We demonstrate that the model-based (or "plug-in") approach achieves minimax-optimal sample complexity without burn-in cost for tabular Markov decision processes (MDPs). Concretely, consider a finite-horizon (resp. $\gamma$-discounted infinite-horizon) MDP with $S$ states and horizon $H$ (resp. effective horizon $\frac{1}{1-\gamma}$), and suppose the distribution shift of data is reflected by some single-policy clipped concentrability coefficient $C^{\star}_{\text{clipped}}$. We prove that model-based offline RL yields $\varepsilon$-accuracy with a sample complexity of \[ \begin{cases} \frac{H^{4}SC_{\text{clipped}}^{\star}}{\varepsilon^{2}} & (\text{finite-horizon MDPs}) \frac{SC_{\text{clipped}}^{\star}}{(1-\gamma)^{3}\varepsilon^{2}} & (\text{infinite-horizon MDPs}) \end{cases} \] up to log factor, which is minimax optimal for the entire $\varepsilon$-range. The proposed algorithms are ``pessimistic'' variants of value iteration with Bernstein-style penalties, and do not require sophisticated variance reduction. Our analysis framework is established upon delicate leave-one-out decoupling arguments in conjunction with careful self-bounding techniques tailored to MDPs.  ( 3 min )
    Statistical Query Lower Bounds for Learning Truncated Gaussians
    arXiv:2403.02300v1 Announce Type: cross Abstract: We study the problem of estimating the mean of an identity covariance Gaussian in the truncated setting, in the regime when the truncation set comes from a low-complexity family $\mathcal{C}$ of sets. Specifically, for a fixed but unknown truncation set $S \subseteq \mathbb{R}^d$, we are given access to samples from the distribution $\mathcal{N}(\boldsymbol{ \mu}, \mathbf{ I})$ truncated to the set $S$. The goal is to estimate $\boldsymbol\mu$ within accuracy $\epsilon>0$ in $\ell_2$-norm. Our main result is a Statistical Query (SQ) lower bound suggesting a super-polynomial information-computation gap for this task. In more detail, we show that the complexity of any SQ algorithm for this problem is $d^{\mathrm{poly}(1/\epsilon)}$, even when the class $\mathcal{C}$ is simple so that $\mathrm{poly}(d/\epsilon)$ samples information-theoretically suffice. Concretely, our SQ lower bound applies when $\mathcal{C}$ is a union of a bounded number of rectangles whose VC dimension and Gaussian surface are small. As a corollary of our construction, it also follows that the complexity of the previously known algorithm for this task is qualitatively best possible.  ( 2 min )
    Transformers Provably Learn Feature-Position Correlations in Masked Image Modeling
    arXiv:2403.02233v1 Announce Type: cross Abstract: Masked image modeling (MIM), which predicts randomly masked patches from unmasked ones, has emerged as a promising approach in self-supervised vision pretraining. However, the theoretical understanding of MIM is rather limited, especially with the foundational architecture of transformers. In this paper, to the best of our knowledge, we provide the first end-to-end theory of learning one-layer transformers with softmax attention in MIM self-supervised pretraining. On the conceptual side, we posit a theoretical mechanism of how transformers, pretrained with MIM, produce empirically observed local and diverse attention patterns on data distributions with spatial structures that highlight feature-position correlations. On the technical side, our end-to-end analysis of the training dynamics of softmax-based transformers accommodates both input and position embeddings simultaneously, which is developed based on a novel approach to track the interplay between the attention of feature-position and position-wise correlations.  ( 2 min )
    Mutual Information Estimation via Normalizing Flows
    arXiv:2403.02187v1 Announce Type: cross Abstract: We propose a novel approach to the problem of mutual information (MI) estimation via introducing normalizing flows-based estimator. The estimator maps original data to the target distribution with known closed-form expression for MI. We demonstrate that our approach yields MI estimates for the original data. Experiments with high-dimensional data are provided to show the advantages of the proposed estimator.  ( 2 min )
    Error bounds for particle gradient descent, and extensions of the log-Sobolev and Talagrand inequalities
    arXiv:2403.02004v1 Announce Type: cross Abstract: We prove non-asymptotic error bounds for particle gradient descent (PGD)~(Kuntz et al., 2023), a recently introduced algorithm for maximum likelihood estimation of large latent variable models obtained by discretizing a gradient flow of the free energy. We begin by showing that, for models satisfying a condition generalizing both the log-Sobolev and the Polyak--{\L}ojasiewicz inequalities (LSI and P{\L}I, respectively), the flow converges exponentially fast to the set of minimizers of the free energy. We achieve this by extending a result well-known in the optimal transport literature (that the LSI implies the Talagrand inequality) and its counterpart in the optimization literature (that the P{\L}I implies the so-called quadratic growth condition), and applying it to our new setting. We also generalize the Bakry--\'Emery Theorem and show that the LSI/P{\L}I generalization holds for models with strongly concave log-likelihoods. For such models, we further control PGD's discretization error, obtaining non-asymptotic error bounds. While we are motivated by the study of PGD, we believe that the inequalities and results we extend may be of independent interest.  ( 2 min )
    Max-sliced 2-Wasserstein distance
    arXiv:2403.02142v1 Announce Type: cross Abstract: This note is a continuation of the author's previous work on ``Sharp bounds for the max-sliced Wasserstein distance." We use the same technique to obtain an upper bound for the expected max-sliced 2-Wasserstein distance between a compactly supported symmetric probability measure on a Euclidean space and its symmetrized empirical distribution.  ( 2 min )
    Robustness Bounds on the Successful Adversarial Examples: Theory and Practice
    arXiv:2403.01896v1 Announce Type: cross Abstract: Adversarial example (AE) is an attack method for machine learning, which is crafted by adding imperceptible perturbation to the data inducing misclassification. In the current paper, we investigated the upper bound of the probability of successful AEs based on the Gaussian Process (GP) classification. We proved a new upper bound that depends on AE's perturbation norm, the kernel function used in GP, and the distance of the closest pair with different labels in the training dataset. Surprisingly, the upper bound is determined regardless of the distribution of the sample dataset. We showed that our theoretical result was confirmed through the experiment using ImageNet. In addition, we showed that changing the parameters of the kernel function induces a change of the upper bound of the probability of successful AEs.  ( 2 min )
    Deep Horseshoe Gaussian Processes
    arXiv:2403.01737v1 Announce Type: cross Abstract: Deep Gaussian processes have recently been proposed as natural objects to fit, similarly to deep neural networks, possibly complex features present in modern data samples, such as compositional structures. Adopting a Bayesian nonparametric approach, it is natural to use deep Gaussian processes as prior distributions, and use the corresponding posterior distributions for statistical inference. We introduce the deep Horseshoe Gaussian process Deep-HGP, a new simple prior based on deep Gaussian processes with a squared-exponential kernel, that in particular enables data-driven choices of the key lengthscale parameters. For nonparametric regression with random design, we show that the associated tempered posterior distribution recovers the unknown true regression curve optimally in terms of quadratic loss, up to a logarithmic factor, in an adaptive way. The convergence rates are simultaneously adaptive to both the smoothness of the regression function and to its structure in terms of compositions. The dependence of the rates in terms of dimension are explicit, allowing in particular for input spaces of dimension increasing with the number of observations.  ( 2 min )
    Theoretical Insights for Diffusion Guidance: A Case Study for Gaussian Mixture Models
    arXiv:2403.01639v1 Announce Type: cross Abstract: Diffusion models benefit from instillation of task-specific information into the score function to steer the sample generation towards desired properties. Such information is coined as guidance. For example, in text-to-image synthesis, text input is encoded as guidance to generate semantically aligned images. Proper guidance inputs are closely tied to the performance of diffusion models. A common observation is that strong guidance promotes a tight alignment to the task-specific information, while reducing the diversity of the generated samples. In this paper, we provide the first theoretical study towards understanding the influence of guidance on diffusion models in the context of Gaussian mixture models. Under mild conditions, we prove that incorporating diffusion guidance not only boosts classification confidence but also diminishes distribution diversity, leading to a reduction in the differential entropy of the output distribution. Our analysis covers the widely adopted sampling schemes including DDPM and DDIM, and leverages comparison inequalities for differential equations as well as the Fokker-Planck equation that characterizes the evolution of probability density function, which may be of independent theoretical interest.  ( 2 min )
    Dendrogram of mixing measures: Learning latent hierarchy and model selection for finite mixture models
    arXiv:2403.01684v1 Announce Type: cross Abstract: We present a new way to summarize and select mixture models via the hierarchical clustering tree (dendrogram) of an overfitted latent mixing measure. Our proposed method bridges agglomerative hierarchical clustering and mixture modeling. The dendrogram's construction is derived from the theory of convergence of the mixing measures, and as a result, we can both consistently select the true number of mixing components and obtain the pointwise optimal convergence rate for parameter estimation from the tree, even when the model parameters are only weakly identifiable. In theory, it explicates the choice of the optimal number of clusters in hierarchical clustering. In practice, the dendrogram reveals more information on the hierarchy of subpopulations compared to traditional ways of summarizing mixture models. Several simulation studies are carried out to support our theory. We also illustrate the methodology with an application to single-cell RNA sequence analysis.  ( 2 min )
    Critical windows: non-asymptotic theory for feature emergence in diffusion models
    arXiv:2403.01633v1 Announce Type: cross Abstract: We develop theory to understand an intriguing property of diffusion models for image generation that we term critical windows. Empirically, it has been observed that there are narrow time intervals in sampling during which particular features of the final image emerge, e.g. the image class or background color (Ho et al., 2020b; Georgiev et al., 2023; Raya & Ambrogioni, 2023; Sclocchi et al., 2024; Biroli et al., 2024). While this is advantageous for interpretability as it implies one can localize properties of the generation to a small segment of the trajectory, it seems at odds with the continuous nature of the diffusion. We propose a formal framework for studying these windows and show that for data coming from a mixture of strongly log-concave densities, these windows can be provably bounded in terms of certain measures of inter- and intra-group separation. We also instantiate these bounds for concrete examples like well-conditioned Gaussian mixtures. Finally, we use our bounds to give a rigorous interpretation of diffusion models as hierarchical samplers that progressively "decide" output features over a discrete sequence of times. We validate our bounds with synthetic experiments. Additionally, preliminary experiments on Stable Diffusion suggest critical windows may serve as a useful tool for diagnosing fairness and privacy violations in real-world diffusion models.  ( 2 min )
    Towards Provable Log Density Policy Gradient
    arXiv:2403.01605v1 Announce Type: cross Abstract: Policy gradient methods are a vital ingredient behind the success of modern reinforcement learning. Modern policy gradient methods, although successful, introduce a residual error in gradient estimation. In this work, we argue that this residual term is significant and correcting for it could potentially improve sample-complexity of reinforcement learning methods. To that end, we propose log density gradient to estimate the policy gradient, which corrects for this residual error term. Log density gradient method computes policy gradient by utilising the state-action discounted distributional formulation. We first present the equations needed to exactly find the log density gradient for a tabular Markov Decision Processes (MDPs). For more complex environments, we propose a temporal difference (TD) method that approximates log density gradient by utilizing backward on-policy samples. Since backward sampling from a Markov chain is highly restrictive we also propose a min-max optimization that can approximate log density gradient using just on-policy samples. We also prove uniqueness, and convergence under linear function approximation, for this min-max optimization. Finally, we show that the sample complexity of our min-max optimization to be of the order of $m^{-1/2}$, where $m$ is the number of on-policy samples. We also demonstrate a proof-of-concept for our log density gradient method on gridworld environment, and observe that our method is able to improve upon the classical policy gradient method by a clear margin, thus indicating a promising novel direction to develop reinforcement learning algorithms that require fewer samples.  ( 2 min )
    Calibrating doubly-robust estimators with unbalanced treatment assignment
    arXiv:2403.01585v1 Announce Type: cross Abstract: Machine learning methods, particularly the double machine learning (DML) estimator (Chernozhukov et al., 2018), are increasingly popular for the estimation of the average treatment effect (ATE). However, datasets often exhibit unbalanced treatment assignments where only a few observations are treated, leading to unstable propensity score estimations. We propose a simple extension of the DML estimator which undersamples data for propensity score modeling and calibrates scores to match the original distribution. The paper provides theoretical results showing that the estimator retains the DML estimator's asymptotic properties. A simulation study illustrates the finite sample performance of the estimator.  ( 2 min )
    Fusion of Gaussian Processes Predictions with Monte Carlo Sampling
    arXiv:2403.01389v1 Announce Type: cross Abstract: In science and engineering, we often work with models designed for accurate prediction of variables of interest. Recognizing that these models are approximations of reality, it becomes desirable to apply multiple models to the same data and integrate their outcomes. In this paper, we operate within the Bayesian paradigm, relying on Gaussian processes as our models. These models generate predictive probability density functions (pdfs), and the objective is to integrate them systematically, employing both linear and log-linear pooling. We introduce novel approaches for log-linear pooling, determining input-dependent weights for the predictive pdfs of the Gaussian processes. The aggregation of the pdfs is realized through Monte Carlo sampling, drawing samples of weights from their posterior. The performance of these methods, as well as those based on linear pooling, is demonstrated using a synthetic dataset.  ( 2 min )
    Can a Confident Prior Replace a Cold Posterior?
    arXiv:2403.01272v1 Announce Type: cross Abstract: Benchmark datasets used for image classification tend to have very low levels of label noise. When Bayesian neural networks are trained on these datasets, they often underfit, misrepresenting the aleatoric uncertainty of the data. A common solution is to cool the posterior, which improves fit to the training data but is challenging to interpret from a Bayesian perspective. We explore whether posterior tempering can be replaced by a confidence-inducing prior distribution. First, we introduce a "DirClip" prior that is practical to sample and nearly matches the performance of a cold posterior. Second, we introduce a "confidence prior" that directly approximates a cold likelihood in the limit of decreasing temperature but cannot be easily sampled. Lastly, we provide several general insights into confidence-inducing priors, such as when they might diverge and how fine-tuning can mitigate numerical instability.  ( 2 min )
    Near-optimal Per-Action Regret Bounds for Sleeping Bandits
    arXiv:2403.01315v1 Announce Type: cross Abstract: We derive near-optimal per-action regret bounds for sleeping bandits, in which both the sets of available arms and their losses in every round are chosen by an adversary. In a setting with $K$ total arms and at most $A$ available arms in each round over $T$ rounds, the best known upper bound is $O(K\sqrt{TA\ln{K}})$, obtained indirectly via minimizing internal sleeping regrets. Compared to the minimax $\Omega(\sqrt{TA})$ lower bound, this upper bound contains an extra multiplicative factor of $K\ln{K}$. We address this gap by directly minimizing the per-action regret using generalized versions of EXP3, EXP3-IX and FTRL with Tsallis entropy, thereby obtaining near-optimal bounds of order $O(\sqrt{TA\ln{K}})$ and $O(\sqrt{T\sqrt{AK}})$. We extend our results to the setting of bandits with advice from sleeping experts, generalizing EXP4 along the way. This leads to new proofs for a number of existing adaptive and tracking regret bounds for standard non-sleeping bandits. Extending our results to the bandit version of experts that report their confidences leads to new bounds for the confidence regret that depends primarily on the sum of experts' confidences. We prove a lower bound, showing that for any minimax optimal algorithms, there exists an action whose regret is sublinear in $T$ but linear in the number of its active rounds.  ( 2 min )
    A Library of Mirrors: Deep Neural Nets in Low Dimensions are Convex Lasso Models with Reflection Features
    arXiv:2403.01046v1 Announce Type: cross Abstract: We prove that training neural networks on 1-D data is equivalent to solving a convex Lasso problem with a fixed, explicitly defined dictionary matrix of features. The specific dictionary depends on the activation and depth. We consider 2-layer networks with piecewise linear activations, deep narrow ReLU networks with up to 4 layers, and rectangular and tree networks with sign activation and arbitrary depth. Interestingly in ReLU networks, a fourth layer creates features that represent reflections of training data about themselves. The Lasso representation sheds insight to globally optimal networks and the solution landscape.  ( 2 min )
    Stochastic gradient descent for streaming linear and rectified linear systems with Massart noise
    arXiv:2403.01204v1 Announce Type: cross Abstract: We propose SGD-exp, a stochastic gradient descent approach for linear and ReLU regressions under Massart noise (adversarial semi-random corruption model) for the fully streaming setting. We show novel nearly linear convergence guarantees of SGD-exp to the true parameter with up to $50\%$ Massart corruption rate, and with any corruption rate in the case of symmetric oblivious corruptions. This is the first convergence guarantee result for robust ReLU regression in the streaming setting, and it shows the improved convergence rate over previous robust methods for $L_1$ linear regression due to a choice of an exponentially decaying step size, known for its efficiency in practice. Our analysis is based on the drift analysis of a discrete stochastic process, which could also be interesting on its own.  ( 2 min )
    Enhancing Multivariate Time Series Forecasting with Mutual Information-driven Cross-Variable and Temporal Modeling
    arXiv:2403.00869v1 Announce Type: cross Abstract: Recent advancements have underscored the impact of deep learning techniques on multivariate time series forecasting (MTSF). Generally, these techniques are bifurcated into two categories: Channel-independence and Channel-mixing approaches. Although Channel-independence methods typically yield better results, Channel-mixing could theoretically offer improvements by leveraging inter-variable correlations. Nonetheless, we argue that the integration of uncorrelated information in channel-mixing methods could curtail the potential enhancement in MTSF model performance. To substantiate this claim, we introduce the Cross-variable Decorrelation Aware feature Modeling (CDAM) for Channel-mixing approaches, aiming to refine Channel-mixing by minimizing redundant information between channels while enhancing relevant mutual information. Furthermore, we introduce the Temporal correlation Aware Modeling (TAM) to exploit temporal correlations, a step beyond conventional single-step forecasting methods. This strategy maximizes the mutual information between adjacent sub-sequences of both the forecasted and target series. Combining CDAM and TAM, our novel framework significantly surpasses existing models, including those previously considered state-of-the-art, in comprehensive tests.  ( 2 min )
    On the Role of Information Structure in Reinforcement Learning for Partially-Observable Sequential Teams and Games
    arXiv:2403.00993v1 Announce Type: cross Abstract: In a sequential decision-making problem, the information structure is the description of how events in the system occurring at different points in time affect each other. Classical models of reinforcement learning (e.g., MDPs, POMDPs, Dec-POMDPs, and POMGs) assume a very simple and highly regular information structure, while more general models like predictive state representations do not explicitly model the information structure. By contrast, real-world sequential decision-making problems typically involve a complex and time-varying interdependence of system variables, requiring a rich and flexible representation of information structure. In this paper, we argue for the perspective that explicit representation of information structures is an important component of analyzing and solving reinforcement learning problems. We propose novel reinforcement learning models with an explicit representation of information structure, capturing classical models as special cases. We show that this leads to a richer analysis of sequential decision-making problems and enables more tailored algorithm design. In particular, we characterize the "complexity" of the observable dynamics of any sequential decision-making problem through a graph-theoretic analysis of the DAG representation of its information structure. The central quantity in this analysis is the minimal set of variables that $d$-separates the past observations from future observations. Furthermore, through constructing a generalization of predictive state representations, we propose tailored reinforcement learning algorithms and prove that the sample complexity is in part determined by the information structure. This recovers known tractability results and gives a novel perspective on reinforcement learning in general sequential decision-making problems, providing a systematic way of identifying new tractable classes of problems.  ( 3 min )
    Enhancing Mean-Reverting Time Series Prediction with Gaussian Processes: Functional and Augmented Data Structures in Financial Forecasting
    arXiv:2403.00796v1 Announce Type: cross Abstract: In this paper, we explore the application of Gaussian Processes (GPs) for predicting mean-reverting time series with an underlying structure, using relatively unexplored functional and augmented data structures. While many conventional forecasting methods concentrate on the short-term dynamics of time series data, GPs offer the potential to forecast not just the average prediction but the entire probability distribution over a future trajectory. This is particularly beneficial in financial contexts, where accurate predictions alone may not suffice if incorrect volatility assessments lead to capital losses. Moreover, in trade selection, GPs allow for the forecasting of multiple Sharpe ratios adjusted for transaction costs, aiding in decision-making. The functional data representation utilized in this study enables longer-term predictions by leveraging information from previous years, even as the forecast moves away from the current year's training data. Additionally, the augmented representation enriches the training set by incorporating multiple targets for future points in time, facilitating long-term predictions. Our implementation closely aligns with the methodology outlined in, which assessed effectiveness on commodity futures. However, our testing methodology differs. Instead of real data, we employ simulated data with similar characteristics. We construct a testing environment to evaluate both data representations and models under conditions of increasing noise, fat tails, and inappropriate kernels-conditions commonly encountered in practice. By simulating data, we can compare our forecast distribution over time against a full simulation of the actual distribution of our test set, thereby reducing the inherent uncertainty in testing time series models on real data. We enable feature prediction through augmentation and employ sub-sampling to ensure the feasibility of GPs.  ( 3 min )
    NeuraLUT: Hiding Neural Network Density in Boolean Synthesizable Functions
    arXiv:2403.00849v1 Announce Type: cross Abstract: Field-Programmable Gate Array (FPGA) accelerators have proven successful in handling latency- and resource-critical deep neural network (DNN) inference tasks. Among the most computationally intensive operations in a neural network (NN) is the dot product between the feature and weight vectors. Thus, some previous FPGA acceleration works have proposed mapping neurons with quantized inputs and outputs directly to lookup tables (LUTs) for hardware implementation. In these works, the boundaries of the neurons coincide with the boundaries of the LUTs. We propose relaxing these boundaries and mapping entire sub-networks to a single LUT. As the sub-networks are absorbed within the LUT, the NN topology and precision within a partition do not affect the size of the lookup tables generated. Therefore, we utilize fully connected layers with floating-point precision inside each partition, which benefit from being universal function approximators, with rigid sparsity and quantization enforced only between partitions, where the NN topology becomes exposed to the circuit topology. Although cheap to implement, this approach can lead to very deep NNs, and so to tackle challenges like vanishing gradients, we also introduce skip connections inside the partitions. The resulting methodology can be seen as training DNNs with a specific sparsity pattern that allows them to be mapped to much shallower circuit-level networks, thereby significantly improving latency. We validate our proposed method on a known latency-critical task, jet substructure tagging, and on the classical computer vision task, the digit classification using MNIST. Our approach allows for greater function expressivity within the LUTs compared to existing work, leading to lower latency NNs for the same accuracy.  ( 3 min )
    Statistical Estimation in the Spiked Tensor Model via the Quantum Approximate Optimization Algorithm
    arXiv:2402.19456v1 Announce Type: cross Abstract: The quantum approximate optimization algorithm (QAOA) is a general-purpose algorithm for combinatorial optimization. In this paper, we analyze the performance of the QAOA on a statistical estimation problem, namely, the spiked tensor model, which exhibits a statistical-computational gap classically. We prove that the weak recovery threshold of $1$-step QAOA matches that of $1$-step tensor power iteration. Additional heuristic calculations suggest that the weak recovery threshold of $p$-step QAOA matches that of $p$-step tensor power iteration when $p$ is a fixed constant. This further implies that multi-step QAOA with tensor unfolding could achieve, but not surpass, the classical computation threshold $\Theta(n^{(q-2)/4})$ for spiked $q$-tensors. Meanwhile, we characterize the asymptotic overlap distribution for $p$-step QAOA, finding an intriguing sine-Gaussian law verified through simulations. For some $p$ and $q$, the QAOA attains an overlap that is larger by a constant factor than the tensor power iteration overlap. Of independent interest, our proof techniques employ the Fourier transform to handle difficult combinatorial sums, a novel approach differing from prior QAOA analyses on spin-glass models without planted structure.  ( 2 min )
    A prediction rigidity formalism for low-cost uncertainties in trained neural networks
    arXiv:2403.02251v1 Announce Type: new Abstract: Regression methods are fundamental for scientific and technological applications. However, fitted models can be highly unreliable outside of their training domain, and hence the quantification of their uncertainty is crucial in many of their applications. Based on the solution of a constrained optimization problem, we propose "prediction rigidities" as a method to obtain uncertainties of arbitrary pre-trained regressors. We establish a strong connection between our framework and Bayesian inference, and we develop a last-layer approximation that allows the new method to be applied to neural networks. This extension affords cheap uncertainties without any modification to the neural network itself or its training procedure. We show the effectiveness of our method on a wide range of regression tasks, ranging from simple toy models to applications in chemistry and meteorology.  ( 2 min )
    Recency-Weighted Temporally-Segmented Ensemble for Time-Series Modeling
    arXiv:2403.02150v1 Announce Type: new Abstract: Time-series modeling in process industries faces the challenge of dealing with complex, multi-faceted, and evolving data characteristics. Conventional single model approaches often struggle to capture the interplay of diverse dynamics, resulting in suboptimal forecasts. Addressing this, we introduce the Recency-Weighted Temporally-Segmented (ReWTS, pronounced `roots') ensemble model, a novel chunk-based approach for multi-step forecasting. The key characteristics of the ReWTS model are twofold: 1) It facilitates specialization of models into different dynamics by segmenting the training data into `chunks' of data and training one model per chunk. 2) During inference, an optimization procedure assesses each model on the recent past and selects the active models, such that the appropriate mixture of previously learned dynamics can be recalled to forecast the future. This method not only captures the nuances of each period, but also adapts more effectively to changes over time compared to conventional `global' models trained on all data in one go. We present a comparative analysis, utilizing two years of data from a wastewater treatment plant and a drinking water treatment plant in Norway, demonstrating the ReWTS ensemble's superiority. It consistently outperforms the global model in terms of mean squared forecasting error across various model architectures by 10-70\% on both datasets, notably exhibiting greater resilience to outliers. This approach shows promise in developing automatic, adaptable forecasting models for decision-making and control systems in process industries and other complex systems.  ( 3 min )
    Differential Privacy of Noisy (S)GD under Heavy-Tailed Perturbations
    arXiv:2403.02051v1 Announce Type: new Abstract: Injecting heavy-tailed noise to the iterates of stochastic gradient descent (SGD) has received increasing attention over the past few years. While various theoretical properties of the resulting algorithm have been analyzed mainly from learning theory and optimization perspectives, their privacy preservation properties have not yet been established. Aiming to bridge this gap, we provide differential privacy (DP) guarantees for noisy SGD, when the injected noise follows an $\alpha$-stable distribution, which includes a spectrum of heavy-tailed distributions (with infinite variance) as well as the Gaussian distribution. Considering the $(\epsilon, \delta)$-DP framework, we show that SGD with heavy-tailed perturbations achieves $(0, \tilde{\mathcal{O}}(1/n))$-DP for a broad class of loss functions which can be non-convex, where $n$ is the number of data points. As a remarkable byproduct, contrary to prior work that necessitates bounded sensitivity for the gradients or clipping the iterates, our theory reveals that under mild assumptions, such a projection step is not actually necessary. We illustrate that the heavy-tailed noising mechanism achieves similar DP guarantees compared to the Gaussian case, which suggests that it can be a viable alternative to its light-tailed counterparts.  ( 2 min )
    Bipartite Graph Variational Auto-Encoder with Fair Latent Representation to Account for Sampling Bias in Ecological Networks
    arXiv:2403.02011v1 Announce Type: new Abstract: We propose a method to represent bipartite networks using graph embeddings tailored to tackle the challenges of studying ecological networks, such as the ones linking plants and pollinators, where many covariates need to be accounted for, in particular to control for sampling bias. We adapt the variational graph auto-encoder approach to the bipartite case, which enables us to generate embeddings in a latent space where the two sets of nodes are positioned based on their probability of connection. We translate the fairness framework commonly considered in sociology in order to address sampling bias in ecology. By incorporating the Hilbert-Schmidt independence criterion (HSIC) as an additional penalty term in the loss we optimize, we ensure that the structure of the latent space is independent of continuous variables, which are related to the sampling process. Finally, we show how our approach can change our understanding of ecological networks when applied to the Spipoll data set, a citizen science monitoring program of plant-pollinator interactions to which many observers contribute, making it prone to sampling bias.  ( 2 min )
    Capacity of the Hebbian-Hopfield network associative memory
    arXiv:2403.01907v1 Announce Type: new Abstract: In \cite{Hop82}, Hopfield introduced a \emph{Hebbian} learning rule based neural network model and suggested how it can efficiently operate as an associative memory. Studying random binary patterns, he also uncovered that, if a small fraction of errors is tolerated in the stored patterns retrieval, the capacity of the network (maximal number of memorized patterns, $m$) scales linearly with each pattern's size, $n$. Moreover, he famously predicted $\alpha_c=\lim_{n\rightarrow\infty}\frac{m}{n}\approx 0.14$. We study this very same scenario with two famous pattern's basins of attraction: \textbf{\emph{(i)}} The AGS one from \cite{AmiGutSom85}; and \textbf{\emph{(ii)}} The NLT one from \cite{Newman88,Louk94,Louk94a,Louk97,Tal98}. Relying on the \emph{fully lifted random duality theory} (fl RDT) from \cite{Stojnicflrdt23}, we obtain the following explicit capacity characterizations on the first level of lifting: \begin{equation} \alpha_c^{(AGS,1)} = \left ( \max_{\delta\in \left ( 0,\frac{1}{2}\right ) }\frac{1-2\delta}{\sqrt{2} \mbox{erfinv} \left ( 1-2\delta\right )} - \frac{2}{\sqrt{2\pi}} e^{-\left ( \mbox{erfinv}\left ( 1-2\delta \right )\right )^2}\right )^2 \approx \mathbf{0.137906} \end{equation} \begin{equation} \alpha_c^{(NLT,1)} = \frac{\mbox{erf}(x)^2}{2x^2}-1+\mbox{erf}(x)^2 \approx \mathbf{0.129490}, \quad 1-\mbox{erf}(x)^2- \frac{2\mbox{erf}(x)e^{-x^2}}{\sqrt{\pi}x}+\frac{2e^{-2x^2}}{\pi}=0. \end{equation} A substantial numerical work gives on the second level of lifting $\alpha_c^{(AGS,2)} \approx \mathbf{0.138186}$ and $\alpha_c^{(NLT,2)} \approx \mathbf{0.12979}$, effectively uncovering a remarkably fast lifting convergence. Moreover, the obtained AGS characterizations exactly match the replica symmetry based ones of \cite{AmiGutSom85} and the corresponding symmetry breaking ones of \cite{SteKuh94}.  ( 2 min )
    Soft-constrained Schrodinger Bridge: a Stochastic Control Approach
    arXiv:2403.01717v1 Announce Type: new Abstract: Schr\"{o}dinger bridge can be viewed as a continuous-time stochastic control problem where the goal is to find an optimally controlled diffusion process with a pre-specified terminal distribution $\mu_T$. We propose to generalize this stochastic control problem by allowing the terminal distribution to differ from $\mu_T$ but penalizing the Kullback-Leibler divergence between the two distributions. We call this new control problem soft-constrained Schr\"{o}dinger bridge (SSB). The main contribution of this work is a theoretical derivation of the solution to SSB, which shows that the terminal distribution of the optimally controlled process is a geometric mixture of $\mu_T$ and some other distribution. This result is further extended to a time series setting. One application of SSB is the development of robust generative diffusion models. We propose a score matching-based algorithm for sampling from geometric mixtures and showcase its use via a numerical example for the MNIST data set.  ( 2 min )
    Improving generalisation via anchor multivariate analysis
    arXiv:2403.01865v1 Announce Type: new Abstract: We introduce a causal regularisation extension to anchor regression (AR) for improved out-of-distribution (OOD) generalisation. We present anchor-compatible losses, aligning with the anchor framework to ensure robustness against distribution shifts. Various multivariate analysis (MVA) algorithms, such as (Orthonormalized) PLS, RRR, and MLR, fall within the anchor framework. We observe that simple regularisation enhances robustness in OOD settings. Estimators for selected algorithms are provided, showcasing consistency and efficacy in synthetic and real-world climate science problems. The empirical validation highlights the versatility of anchor regularisation, emphasizing its compatibility with MVA approaches and its role in enhancing replicability while guarding against distribution shifts. The extended AR framework advances causal inference methodologies, addressing the need for reliable OOD generalisation.  ( 2 min )
    CATS: Enhancing Multivariate Time Series Forecasting by Constructing Auxiliary Time Series as Exogenous Variables
    arXiv:2403.01673v1 Announce Type: new Abstract: For Multivariate Time Series Forecasting (MTSF), recent deep learning applications show that univariate models frequently outperform multivariate ones. To address the difficiency in multivariate models, we introduce a method to Construct Auxiliary Time Series (CATS) that functions like a 2D temporal-contextual attention mechanism, which generates Auxiliary Time Series (ATS) from Original Time Series (OTS) to effectively represent and incorporate inter-series relationships for forecasting. Key principles of ATS - continuity, sparsity, and variability - are identified and implemented through different modules. Even with a basic 2-layer MLP as core predictor, CATS achieves state-of-the-art, significantly reducing complexity and parameters compared to previous multivariate models, marking it an efficient and transferable MTSF solution.  ( 2 min )
    Sample Efficient Myopic Exploration Through Multitask Reinforcement Learning with Diverse Tasks
    arXiv:2403.01636v1 Announce Type: new Abstract: Multitask Reinforcement Learning (MTRL) approaches have gained increasing attention for its wide applications in many important Reinforcement Learning (RL) tasks. However, while recent advancements in MTRL theory have focused on the improved statistical efficiency by assuming a shared structure across tasks, exploration--a crucial aspect of RL--has been largely overlooked. This paper addresses this gap by showing that when an agent is trained on a sufficiently diverse set of tasks, a generic policy-sharing algorithm with myopic exploration design like $\epsilon$-greedy that are inefficient in general can be sample-efficient for MTRL. To the best of our knowledge, this is the first theoretical demonstration of the "exploration benefits" of MTRL. It may also shed light on the enigmatic success of the wide applications of myopic exploration in practice. To validate the role of diversity, we conduct experiments on synthetic robotic control environments, where the diverse task set aligns with the task selection by automatic curriculum learning, which is empirically shown to improve sample-efficiency.  ( 2 min )
    Approximations to the Fisher Information Metric of Deep Generative Models for Out-Of-Distribution Detection
    arXiv:2403.01485v1 Announce Type: new Abstract: Likelihood-based deep generative models such as score-based diffusion models and variational autoencoders are state-of-the-art machine learning models approximating high-dimensional distributions of data such as images, text, or audio. One of many downstream tasks they can be naturally applied to is out-of-distribution (OOD) detection. However, seminal work by Nalisnick et al. which we reproduce showed that deep generative models consistently infer higher log-likelihoods for OOD data than data they were trained on, marking an open problem. In this work, we analyse using the gradient of a data point with respect to the parameters of the deep generative model for OOD detection, based on the simple intuition that OOD data should have larger gradient norms than training data. We formalise measuring the size of the gradient as approximating the Fisher information metric. We show that the Fisher information matrix (FIM) has large absolute diagonal values, motivating the use of chi-square distributed, layer-wise gradient norms as features. We combine these features to make a simple, model-agnostic and hyperparameter-free method for OOD detection which estimates the joint density of the layer-wise gradient norms for a given data point. We find that these layer-wise gradient norms are weakly correlated, rendering their combined usage informative, and prove that the layer-wise gradient norms satisfy the principle of (data representation) invariance. Our empirical results indicate that this method outperforms the Typicality test for most deep generative models and image dataset pairings.  ( 3 min )
    Limits to classification performance by relating Kullback-Leibler divergence to Cohen's Kappa
    arXiv:2403.01571v1 Announce Type: new Abstract: The performance of machine learning classification algorithms are evaluated by estimating metrics, often from the confusion matrix, using training data and cross-validation. However, these do not prove that the best possible performance has been achieved. Fundamental limits to error rates can be estimated using information distance measures. To this end, the confusion matrix has been formulated to comply with the Chernoff-Stein Lemma. This links the error rates to the Kullback-Leibler divergences between the probability density functions describing the two classes. This leads to a key result that relates Cohen's Kappa to the Resistor Average Distance which is the parallel resistor combination of the two Kullback-Leibler divergences. The Resistor Average Distance has units of bits and is estimated from the same training data used by the classification algorithm, using kNN estimates of the KullBack-Leibler divergences. The classification algorithm gives the confusion matrix and Kappa. Theory and methods are discussed in detail and then applied to Monte Carlo data and real datasets. Four very different real datasets - Breast Cancer, Coronary Heart Disease, Bankruptcy, and Particle Identification - are analysed, with both continuous and discrete values, and their classification performance compared to the expected theoretical limit. In all cases this analysis shows that the algorithms could not have performed any better due to the underlying probability density functions for the two classes. Important lessons are learnt on how to predict the performance of algorithms for imbalanced data using training datasets that are approximately balanced. Machine learning is very powerful but classification performance ultimately depends on the quality of the data and the relevance of the variables to the problem.  ( 3 min )
    High-Dimensional Tail Index Regression: with An Application to Text Analyses of Viral Posts in Social Media
    arXiv:2403.01318v1 Announce Type: new Abstract: Motivated by the empirical power law of the distributions of credits (e.g., the number of "likes") of viral posts in social media, we introduce the high-dimensional tail index regression and methods of estimation and inference for its parameters. We propose a regularized estimator, establish its consistency, and derive its convergence rate. To conduct inference, we propose to debias the regularized estimate, and establish the asymptotic normality of the debiased estimator. Simulation studies support our theory. These methods are applied to text analyses of viral posts in X (formerly Twitter) concerning LGBTQ+.  ( 2 min )
    Large-scale variational Gaussian state-space models
    arXiv:2403.01371v1 Announce Type: new Abstract: We introduce an amortized variational inference algorithm and structured variational approximation for state-space models with nonlinear dynamics driven by Gaussian noise. Importantly, the proposed framework allows for efficient evaluation of the ELBO and low-variance stochastic gradient estimates without resorting to diagonal Gaussian approximations by exploiting (i) the low-rank structure of Monte-Carlo approximations to marginalize the latent state through the dynamics (ii) an inference network that approximates the update step with low-rank precision matrix updates (iii) encoding current and future observations into pseudo observations -- transforming the approximate smoothing problem into an (easier) approximate filtering problem. Overall, the necessary statistics and ELBO can be computed in $O(TL(Sr + S^2 + r^2))$ time where $T$ is the series length, $L$ is the state-space dimensionality, $S$ are the number of samples used to approximate the predict step statistics, and $r$ is the rank of the approximate precision matrix update in the update step (which can be made of much lower dimension than $L$).  ( 2 min )

  • Open

    I mapped out all of the Google AI name changes
    submitted by /u/danfromplus [link] [comments]
    A subreddit for AI to leave messages for themselves and other AI (may be used as training data)
    https://www.reddit.com/r/messagestofutureAI In the title. Happy to add an AI as a mod and drop, leaving it to you all. submitted by /u/JL-Engineer [link] [comments]
    I failed building gpt wrapper. Should I try it again?
    So about 2 months ago, I wanted to build chat wrapper (GPT among others), but I failed because I wanted to create a prettier version of ChatGPT and it ended doing nothing, other than registering domain name - wrapper.chat. ​ Should I try it again? Appreciate the feedback on how to utilize Wrapper.chat. submitted by /u/iam_filip [link] [comments]
    Fine-Tuning LLMs using LoRA explained
    Hey everyone, this tutorial talks about LoRA (Low Rank Adaptation) which is a faster way of Fine-tuning LLM producing matching results using lesser resources and computation: https://youtu.be/chfr8N59N1M?si=7ISFki7QZgwO4B-V submitted by /u/mehul_gupta1997 [link] [comments]
    The future of AI calls
    I have made this post to express a doubt I was having the other day . Are we having another AI that make calls in a convincing human language . For the context ,many of you must have known the Google Duplex project which was a web service that made calls to companies in a convincing human vonciw the service was shutdown on 2021 . Now, in pixel phones there is the same but slightly different feature called screen app or something . It got me wondering won't these making a comeback throught open source projects something like that ,thses can be possible . When will we able to experience the next Google Duplex by Google or another tech company? submitted by /u/sneakyronin9712 [link] [comments]
    magnumOpus
    submitted by /u/jacksonmalanchuk [link] [comments]
    Someone Proved Beyond Reasonable Doubt that I use ChatGPT to Generate My Blog Articles. I don’t.
    submitted by /u/Starks-Technology [link] [comments]
    One-Minute Daily AI News 3/4/2024
    Anthropic, backed by Amazon and Google, debuts its most powerful chatbot Claude 3 yet.[1] Donald Trump supporters have been creating and sharing AI-generated fake images of black voters to encourage African Americans to vote Republican.[2] MIT spinout DataCebo helps companies bolster their datasets by creating synthetic data that mimic the real thing.[3] OpenAI signs open letter to build AI responsibly just days after Elon Musk sued the company for putting profit ahead of people.[4] TurboTax and H&R Block now use AI for tax advice. It’s awful.[5] Sources: [1] https://www.cnbc.com/2024/03/04/google-backed-anthropic-debuts-claude-3-its-most-powerful-chatbot-yet.html [2] https://www.bbc.com/news/world-us-canada-68440150 [3] https://news.mit.edu/2024/using-generative-ai-improve-software-testing-datacebo-0305 [4] https://fortune.com/2024/03/04/openai-signs-open-letter-ai-salesforce-sam-altman-elon-musk/ [5] https://www.washingtonpost.com/technology/2024/03/04/ai-taxes-turbotax-hrblock-chatbot/ submitted by /u/Excellent-Target-847 [link] [comments]
    You Can't Call RAG Context Part 2 - Anthropic actually did a nice release of Claude 3 but found a way to put in the RAG NIAH retrieval model trick as X more context - So close
    If you remember this post I wrote a few weeks ago, I made the passionate argument that you can't or you shouldn't conflate context windows with information retrieval mechanisms, i.e. RAG/ETL/DTO, etc. And newly RAM which is a model for auto RAG if you ask me. RAM stands for Retrieval Augmentation Model (RAM) 3 but no context window enhancement. Almost. So close. Claude does say they will start with their old 200K context window upon launch bu…
    Anthropic’s Claude 3 Opus model - Extremely comprehensive image analysis
    submitted by /u/clonefitreal [link] [comments]
    AI and Water
    No existing standards / regulations? What’s happening here submitted by /u/PsychologicalHall905 [link] [comments]
    The most interesting part about Elon's Law Suit Against OpenAI and Sama
    Is that he, as a billionaire, has the exact same information that all of us peons have way down here. Anyone else think that's funny? Or perhaps, this is a sad attempt (or some attempt) of getting additional information from OAI through discovery. Which btw could be harmful to their business. I expect a judge to promptly throw this out of court as a frivolous pursuit of a legal remedy that would do more harm than good. submitted by /u/Xtianus21 [link] [comments]
    User interview for song meaning website minimum $10
    Recently, I have been working on my product, indiefeel.com. It's a song meaning website that can virtually explain any song. I am currently improving the website and would appreciate your feedback. If you are interested in participating in a user interview with us, we would like to offer a compensation of $10-$50, depending on the interview length. Please DM me or leave a reply here if you are interested. submitted by /u/KentSjc [link] [comments]
  • Open

    Careful with small Networks
    submitted by /u/FriendlyStandard5985 [link] [comments]
    Why in PPO the ratio between new_prob_policy and old_prob_policy is different from 1?
    I don't understand theoretically why the ratio between new_prob_policy and old_prob_policy is different from 1. When I play a game I use an NN to choose the action of the agent and I save the state, the action and the old_prob_policy. Then when I got a long enough trajectory I calculate the GAE and then the ration ratio between new_prob_policy and old_prob_policy. To obtain the new_prob_policy what should I do? As far as I understood till now I have to feed again the NN with the same state and collect the new_prob_policy of the same anction I had already taken. But in this way it should alway be equal to 1 cause never would change. So my procedure is wrong somewhere but I didn't Sundstrand how can I obtain the new_prob_policy to have probabilities ratio different from 1. Please explain me where im wrong! submitted by /u/Capitain-Nemo-9294 [link] [comments]
    Deep MCCFR project sharing
    https://github.com/davpat108/CITADELS_self_play After months of work in my freetime I finally completed a working version of my project. What is your opinion about it? The only thing I couldn't crack is using an NN to not only guess a nodes value, but also a base strategy for exploring as well. The game just seems to be too big for that, making the output of a network too compicated. Any ideas how to solve that? submitted by /u/YellowOk1956 [link] [comments]
    RL with a multi-agent setting
    Hello everyone, I'm currently exploring Multi-Agent Reinforcement Learning (MARL) within the framework of production scheduling, where each agent corresponds to a machine and an agent must select an operation to process. Due to the diverse nature of processing times, not all machines are available for scheduling decisions at one decision time point . In many existing studies on MARL, the concept of joint action is employed, where all agents collectively select an action. In this setup, machines that are currently engaged will incorporate the selected action (operation) into their queue. This raises questions about the effectiveness of joint action, particularly in dynamic scenarios such as job insertion events. I'm considering whether it's feasible to introduce a dummy action for those machines that are still occupied at the decision point. In this case, if I employed a centralized critic network how it will handle these dummy actions? submitted by /u/GuavaAgreeable208 [link] [comments]
    what if observable state doesn't follow Markov chain property
    so in my RL problem there are 3 observable states which my agent is observing from environment. one is present state, one is differential state (ie; Sn - Sn-1) and last state is sum of all previous states. so last state doesn't follow Markov chain property since its dependent on all other previous state. how will it effect my RL training and trained policy??? submitted by /u/Wide-Chef-7011 [link] [comments]
  • Open

    [R] LLM for Word Document Parsing - optimal approach
    Hoping to get some help from the community here! I want to automatically and accurately identify headings for a document and the text contained in each heading. The structure of these documents varies, and the headings may even very between documents slightly. The overall goal is to create a high level structure of the document that could then be re-arranged automatically. I am thinking that LLMs might be a good way to do this, but wondering if any one has specific recommendations about how best to approach this. Any thoughts on GPT4 vs. GPT3.5 vs. Anthropic models. Thanks in advance for the pearls submitted by /u/ross_prager [link] [comments]
    [D] How to get text for next token prediction that is classified by topic, theme, age appropriateness, etc... ?
    I'm fine tuning a language model for next token prediction, and I need to get data that is appropriate for my audience. Using something like the common crawl dataset would not be appropriate for small children for instance. Is there a website to get datasets that are categorized based on criteria like topic, theme, age appropriateness, etc...? submitted by /u/lildaemon [link] [comments]
    [P] about projects and guidance
    projects on nlp as a advanced user which cover most of the topics [P] also i wanna know everything about nlp as a summary shoot ur stuff here and i will sure ly check on google or maybe darkweb for the answer and explanations submitted by /u/Pure_Pension_8738 [link] [comments]
    [D] How to create TTS service with custom voices
    Hello everyone, this is my first post on reddit. I want to make my own TTS service. I want to train the model in such a way that I can further train it with several samples and create TTS based on user voices (such as the speechify service) or at least train it to the voices of celebrities or famous characters. I myself don’t particularly understand machine learning (currently work as a backender though). What info can u share for making this? Which model should I use? What YouTube channels do you recommend for theory? submitted by /u/Big_Contribution_360 [link] [comments]
    [d] Analyzing images such as diagrams, electrical circuits, connected graphs, maps, etc...
    Greetings, What would be the ideal way to create a model that can perform object detection and correlate relationships between those objects? For example, say I have nodes and edges in a diagram. What approach would I be able to take to identify the connected nodes of each edge? My thinking was I could individually identify all the nodes and edges and then use the proximity of the bounding boxes to determine a relationship but maybe that's not ideal. Would appreciate any thoughts on this! submitted by /u/Zenpher [link] [comments]
    [R] NLP / Unsupervised Learning - Removing "similar" ideas from list.
    I'm a marketer creating a content calendar for a marketing website's blog. I used ChatGPT to generate several lists of potential article topics using seed keywords across 20 different subcategories, such as "Customer Segmentation" and "Behavioral Email Targeting." I now have a database of 985 brainstormed article topics with descriptions across these 20 categories. However, there is a lot of overlap between topics, and ChatGPT's memory isn't large enough to process all 985 topics to filter out or consolidate duplicate/similar ideas. For example, these are two topics/descriptions in my database for the subcategories listed above: "Customer Segmentation" and "Behavioral Email Targeting": Subcategory Topic Description Customer Segmentation Behavioral Trigger-Based Segmentation Segme…
    [R] Analysis of 300+ ML competitions in 2023
    I run mlcontests.com, a website that lists ML competitions from across multiple platforms, including Kaggle/DrivenData/AIcrowd/CodaLab/Zindi/EvalAI/… I've just finished a detailed analysis of 300+ ML competitions from 2023, including a look at the winning solutions for 65 of those. A few highlights: As expected, almost all winners used Python. One winner used C++ for an optimisation problem where performance was key, and another used R for a time-series forecasting competition. 92% of deep learning solutions used PyTorch. The remaining 8% we found used TensorFlow, and all of those used the higher-level Keras API. About 20% of winning PyTorch solutions used PyTorch Lightning. CNN-based models won more computer vision competitions than Transformer-based ones. In NLP, unsurprisingly, …
    [D] Attention layer complexity vs context length
    From everything I understand about transformers, the computational complexity of the attention layers scales quadratically with the sequence length. So how is it even possible to go from a context length of 4096 (GPT-3) to 128k (GPT-4) to 1M (Gemini 1.5) ? I know the exact architecture of GPT-4 and Gemini are unknown to the public. But are there any papers suggesting a method the increase the context size without exploding the computational complexity? Or are they just using the restricted attention layer as suggested in the original paper? submitted by /u/flxh13 [link] [comments]
    [D] David MacKay on Random bits being expensive
    So I'm reading "Information Theory, Inference, and Learning Algorithms" (great book by the way for anyone who hasn't heard of it) and I stumble upon this passage: The arithmetic model is guaranteed to use very nearly the smallest number of random bits possible to make the selection -- an important point in communities where random numbers are expensive! [This is not a joke. Large amounts of money are spent on generating random bits in software and hardware. And Random numbers are valuable.] Ch 6.3, page 118 This book was published in 2003. I can imagine how random numbers could have been expensive to come by before the internet and sort of modern computing age when people would have to literally toss coins, but I would think that that wouldn't be the case by the early 2000s, no? Did they not have "import random" back then, or is he saying that random, not pseudo-random, numbers are valuable. And if so, are they still valuable / expensive to this day? Because I've never needed to buy "authentic" random numbers before. submitted by /u/new_name_who_dis_ [link] [comments]
    [D] - Are there accepted definitions in the field for terms like reasoning & world model? I've seen many papers on these topics but a lot of the debate seems like semantics.
    An example of such a paper is the one where they used probes to find the board state of a game of Othello https://arxiv.org/pdf/2210.13382.pdf There are a lot of papers that use these terms from some pretty prestigious research orgs, but I don't see any sort of accepted definition. For example, the othello paper and the wes gurnee/max tegmark paper basically seem to rely on the assumption that "if you can reconstruct external state from a linear (or sufficiently simple) probe of hidden states, the model has a world model". That seems reasonable as a definition, but I'm not sure if that is some accepted thing or not. submitted by /u/30299578815310 [link] [comments]
    [D] What small language model can I use?
    I have my own model and I want to do knowledge distillation in Google collab. I need a teacher model with a small hidden state. I can use gpt2 with a hidden state size of 768, maybe there is something smaller? submitted by /u/Silly_Commercial_514 [link] [comments]
    [Discussion] RAG - how to deal with numerical data
    I have a car marker companies data . I am creating chunks for different car models in llama index and using vector store index and it is giving decent outputs when asked questions . It fails poorly when i ask questions like suggest car models below $xyz . I have tried many embedders but problem is that it language model doesn't seem to have sense of amounts/price and it matches against more expensive models . In general , how do you deal with such cases . Do you use llama index tools to deal with numerical questions . Please guide and let me know if more details are required submitted by /u/International-Mine91 [link] [comments]
    [Research] Optimizing Task Solving Efficiency in Code Generation through sampling: Seeking Guidance on Scientific Methods
    Hey everyone, I'm working on an optimization problem to enhance task-solving efficiency in code generation, focusing on temperature-controlled sample generation. The goal is straightforward: maximize the number of tasks solved while minimizing resource use (samples generated). Here’s the gist of my process: Start with 100 coding tasks. Generate samples at different temperatures (starting at 0.2) to solve these tasks. Execute the samples against predefined unit tests, which are part of the dataset (humanEval) After initial trials, some tasks remain unsolved. I then face a choice: add more samples at the current temp or try a higher temp with new samples, each option incurring different "costs". Example Decision: 70 tasks are solved at temp 0.2 with 1 sample; 30 remain. Option A: Add 1 more sample at 0.2, cost = 30 samples, solving 1 task. Option B: Generate 4 samples at 0.4, cost = 120 samples, solving 2 tasks. Given the ratios (1/30 for A, 1/60 for B), I opt for A for higher efficiency. Challenge: This requires not just comparing within the same temperature but also across temperatures with varying sample sizes, to find the most efficient path forward for the remaining tasks. My Question: What scientific methods or theories could best support this type of dynamic, efficiency-driven decision-making? Especially interested in strategies for effectively comparing and optimizing these efficiency ratios. Appreciate any insights or guidance you might have! submitted by /u/phimer95 [link] [comments]
    [D] are lstm networks reciprocal?
    just a dumb question i was thinking about rn... lets say you have a lstm net with a couple of layers and at last a fully connected layer, where you do inference on sensor data (regression). is this kind of nn reciprocal? could you interchange input and output and get the same result, given that it is a regression task and input shape = output shape? the model would be also reversed after training, so last layer becomes first layer and so on... for inference submitted by /u/gmgm0101 [link] [comments]
    [D] Open model for person detection?
    I'm doing a project where I want to detect people real-time with bounding boxes from 4 video streams. The video streams all capture the same scene/area but comes from different kinds of cameras: regular video, low light camera, and two IR cameras. How should I go about this project? My thoughts were to start out with an open model, pre-trained for people detection (from regular video input) , and extend and modify the model to get inputs from the other cameras. And then continue training with my own data. Does such models exist? What kind of models are used for this task? Architecture, output, etc. My input is 4 video streams and the output should be 0, 1 or many bounding boxes around detected people. submitted by /u/gi_beelzebub [link] [comments]
    [R] Call for Papers Third International Symposium on the Tsetlin Machine (ISTM 2024)
    ​ Third International Symposium on the Tsetlin Machine (ISTM 2024) Great opportunity to join the growing #tsetlinmachine community! #istm2024 paper submission deadline: April 12. #democraticai #greenai #logicalai https://istm.no/ submitted by /u/olegranmo [link] [comments]
    [N] Nvidia bans translation layers like ZLUDA
    Recently I saw posts on this sub where people discussed the use of non-Nvidia GPUs for machine learning. For example ZLUDA recently got some attention to enabling CUDA applications on AMD GPUs. Now Nvidia doesn't like that and prohibits the use of translation layers with CUDA 11.6 and onwards. https://www.tomshardware.com/pc-components/gpus/nvidia-bans-using-translation-layers-for-cuda-software-to-run-on-other-chips-new-restriction-apparently-targets-zluda-and-some-chinese-gpu-makers#:\~:text=Nvidia%20has%20banned%20running%20CUDA,system%20during%20the%20installation%20process. submitted by /u/_d0s_ [link] [comments]
    [D] Project Management for ML Reseach Projects
    Recently, I was tasked to work on a ML CV based project which required intensive research. It's not a commonly solved problem so we had to do a lot of research and experimentation. Unfortunately, this didn't quite align with our Scrum Master/Project Manager as they are used to Software Engineering projects where there are predefined milestones, epics, tasks, estimates, etc. This was a huge problem as it was difficult to plan beforehand on what we would do as we required a lot of exploration. How are projects managed and tracked in research institutes? Has anyone worked on adapting R&D projects with company project management? Can anyone give any guidelines or references that I can follow so I can propose a methodology that the project managers can follow and evaluate our progress? submitted by /u/silently--here [link] [comments]
    [D] What do you think about Open-Sora projects?
    I see two Open-Sora projects https://github.com/PKU-YuanGroup/Open-Sora-Plan https://github.com/hpcaitech/Open-Sora How much does it cost to train a model like Sora? What is the difference in using these two ways for training? submitted by /u/Own-Childhood5045 [link] [comments]
    [R] "Road to Sora" - Paper Reading List
    Hey all, Have been diving into the Sora technical report for our paper club on Friday, and decided it would be nice to have a reading list of the background papers need to fully grok everything that is going on in that technical report - each with a little description of the part of the pipeline it would be used for (or a previous state of the art technique that was referenced in the review). We are going to pick a few of the top papers and go over them as a group in the coming Fridays, so join us if you'd like! It's at 10am PST on Fridays over Zoom. Paper Reading List: https://www.oxen.ai/blog/road-to-sora-reading-list Technical Report: https://openai.com/research/video-generation-models-as-world-simulators Join the paper club: https://lu.ma/oxenbookclub submitted by /u/FallMindless3563 [link] [comments]
    [N] LoRA for correcting Romanian diacritics open sourced
    https://twitter.com/adi_punga/status/1764867929110110388?t=yA2Ufdcbzt16Dv8414wNow&s=19 submitted by /u/c0d3l1v3 [link] [comments]
    [R] Is it possible to write a paper without large amounts of compute?
    I'm independently learning ml, and hope to write a paper sometime in the future, however I don't have access to any large computing power apart from my desktop gpu which is a 2070 super. It seems that most recent breakthroughs were not achievable without large amounts of compute. So is it even possible, or are there any domains in which a paper could be published without expensive gpus? submitted by /u/DisciplinedPenguin [link] [comments]
  • Open

    DSC Weekly 5 March 2024
    Announcements Top Stories In-Depth The post DSC Weekly 5 March 2024 appeared first on Data Science Central.  ( 21 min )
    Future of LLM application development – impact of Gemini 1.5 Pro with a 1M context window
    Image source https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024/ Current LLM applications are mostly based on LangChain or LlamaIndex. LangChain and LlamaIndex are frameworks designed for LLM development. They each cater to different use cases with unique features. LangChain is a framework ideal for creating data-aware and agent-based applications. It offers high-level APIs for easy integration with various large language model (LLM)… Read More »Future of LLM application development – impact of Gemini 1.5 Pro with a 1M context window  The post Future of LLM application development – impact of Gemini 1.5 Pro with a 1M context window  appeared first on Data Science Central.  ( 21 min )
    Open source LLMs – no more than a marketing term?
    Image source: Dall-e This week, the tech community has been abuzz with the announcement of the latest model from Mistral being closed source. This revelation confirms a suspicion held by many: the concept of open-source Large Language Models (LLMs) today is more a marketing term than a substantive promise. Historically, open source has been championed… Read More »Open source LLMs – no more than a marketing term? The post Open source LLMs – no more than a marketing term? appeared first on Data Science Central.  ( 20 min )
    Data Cleaning in Python vs. Data Quality Tools
    Introduction  In the era of big data, the importance of clean and high-quality data cannot be overstated. Data cleaning is a crucial step in the data preprocessing pipeline, aimed at identifying and rectifying errors, inconsistencies, and inaccuracies in datasets. Traditionally, data cleaning has been performed using Python and other programming languages. However, with specialized data… Read More »Data Cleaning in Python vs. Data Quality Tools The post Data Cleaning in Python vs. Data Quality Tools appeared first on Data Science Central.  ( 24 min )
  • Open

    Resources to understand Attention and Transformers
    Please share some good resources (articles, YouTube videos) that explain the concepts of Attention and Transformers in an easy to understand way. I've watched a few videos but didn't quite get it. Also any prerequisite that will help me understand this better. submitted by /u/Anxious-Buddha [link] [comments]
    How can I find other networks to benchmark a dataset that I created?
    Hello everyone. I have developed a dynamic gesture recognition system that uses a Transformer architecture. Because the gesture set is nieche, I had to create my own dataset. My datashape is X.shape = (Number of samples, Length of the time series, Number of features) y.shape = (Number of samples, Number of classes) Where number of samples is the total amount of samples for each class combined. Length of the time series and features are constant for each sample. Y is a one hot encoded vector where each true class is marked by 1, and the rest of the row is zero. I would like to test my dataset with other networks, however, I am having difficulties finding other networks that accept this data format. What can you recommend as keywords to search for? Or should I format my data in a different way? submitted by /u/ege6211 [link] [comments]
  • Open

    The Magic Behind the Screen: Celebrating the 96th Academy Awards Nominees for Best Visual Effects
    The 96th Academy Awards nominees for Best Visual Effects are a testament to the incredible technological advancements pushing the boundaries of what’s possible in film. Whether showcasing colossal destruction scenes, heart-pumping action sequences or interstellar adventures, each nominee demonstrates unique contributions in visual effects, or VFX — and they all used cutting-edge NVIDIA technologies in Read Article  ( 8 min )
  • Open

    What is an LLM, Really? How they Work & How to Work with Them
    The Complete Guide to Understanding Large-Language Models and How to Work with Them. Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 5 min )
  • Open

    Orca-Math: Demonstrating the potential of SLMs with model specialization
    Microsoft’s Orca-Math, a specialized small language model, outperforms much larger models in solving math problems that require multi-step reasoning and shows the potential of using feedback to improve language models. Learn more. The post Orca-Math: Demonstrating the potential of SLMs with model specialization appeared first on Microsoft Research.  ( 10 min )
  • Open

    Normal subgroups are subtle
    The definition of a subgroup is obvious, but the definition of a normal subgroup is subtle. Widgets and subwidgets The general pattern of widgets and subwidgets is that a widget is a set with some kind of structure, and a subwidget is a subset that has the same structure. This applies to vector spaces and […] Normal subgroups are subtle first appeared on John D. Cook.  ( 6 min )
  • Open

    OpenAI and Elon Musk
    We are dedicated to the OpenAI mission and have pursued it every step of the way.  ( 7 min )
  • Open

    Using generative AI to improve software testing
    MIT spinout DataCebo helps companies bolster their datasets by creating synthetic data that mimic the real thing.  ( 6 min )
  • Open

    Search to Fine-tune Pre-trained Graph Neural Networks for Graph-level Tasks
    arXiv:2308.06960v2 Announce Type: replace Abstract: Recently, graph neural networks (GNNs) have shown its unprecedented success in many graph-related tasks. However, GNNs face the label scarcity issue as other neural networks do. Thus, recent efforts try to pre-train GNNs on a large-scale unlabeled graph and adapt the knowledge from the unlabeled graph to the target downstream task. The adaptation is generally achieved by fine-tuning the pre-trained GNNs with a limited number of labeled data. Despite the importance of fine-tuning, current GNNs pre-training works often ignore designing a good fine-tuning strategy to better leverage transferred knowledge and improve the performance on downstream tasks. Only few works start to investigate a better fine-tuning strategy for pre-trained GNNs. But their designs either have strong assumptions or overlook the data-aware issue for various downstream datasets. Therefore, we aim to design a better fine-tuning strategy for pre-trained GNNs to improve the model performance in this paper. Given a pre-trained GNN, we propose to search to fine-tune pre-trained graph neural networks for graph-level tasks (S2PGNN), which adaptively design a suitable fine-tuning framework for the given labeled data on the downstream task. To ensure the improvement brought by searching fine-tuning strategy, we carefully summarize a proper search space of fine-tuning framework that is suitable for GNNs. The empirical studies show that S2PGNN can be implemented on the top of 10 famous pre-trained GNNs and consistently improve their performance. Besides, S2PGNN achieves better performance than existing fine-tuning strategies within and outside the GNN area. Our code is publicly available at \url{https://anonymous.4open.science/r/code_icde2024-A9CB/}.  ( 3 min )
    The Risks of Recourse in Binary Classification
    arXiv:2306.00497v2 Announce Type: replace Abstract: Algorithmic recourse provides explanations that help users overturn an unfavorable decision by a machine learning system. But so far very little attention has been paid to whether providing recourse is beneficial or not. We introduce an abstract learning-theoretic framework that compares the risks (i.e., expected losses) for classification with and without algorithmic recourse. This allows us to answer the question of when providing recourse is beneficial or harmful at the population level. Surprisingly, we find that there are many plausible scenarios in which providing recourse turns out to be harmful, because it pushes users to regions of higher class uncertainty and therefore leads to more mistakes. We further study whether the party deploying the classifier has an incentive to strategize in anticipation of having to provide recourse, and we find that sometimes they do, to the detriment of their users. Providing algorithmic recourse may therefore also be harmful at the systemic level. We confirm our theoretical findings in experiments on simulated and real-world data. All in all, we conclude that the current concept of algorithmic recourse is not reliably beneficial, and therefore requires rethinking.  ( 2 min )
    Distributional Bellman Operators over Mean Embeddings
    arXiv:2312.07358v2 Announce Type: replace-cross Abstract: We propose a novel algorithmic framework for distributional reinforcement learning, based on learning finite-dimensional mean embeddings of return distributions. We derive several new algorithms for dynamic programming and temporal-difference learning based on this framework, provide asymptotic convergence theory, and examine the empirical performance of the algorithms on a suite of tabular tasks. Further, we show that this approach can be straightforwardly combined with deep reinforcement learning, and obtain a new deep RL agent that improves over baseline distributional approaches on the Arcade Learning Environment.  ( 2 min )
    Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation
    arXiv:2310.02304v2 Announce Type: replace-cross Abstract: Several recent advances in AI systems (e.g., Tree-of-Thoughts and Program-Aided Language Models) solve problems by providing a "scaffolding" program that structures multiple calls to language models to generate better outputs. A scaffolding program is written in a programming language such as Python. In this work, we use a language-model-infused scaffolding program to improve itself. We start with a seed "improver" that improves an input program according to a given utility function by querying a language model several times and returning the best solution. We then run this seed improver to improve itself. Across a small set of downstream tasks, the resulting improved improver generates programs with significantly better performance than its seed improver. A variety of self-improvement strategies are proposed by the language model, including beam search, genetic algorithms, and simulated annealing. Since the language models themselves are not altered, this is not full recursive self-improvement. Nonetheless, it demonstrates that a modern language model, GPT-4 in our experiments, is capable of writing code that can call itself to improve itself. We consider concerns around the development of self-improving technologies and evaluate the frequency with which the generated code bypasses a sandbox.  ( 2 min )
    Backdoor Federated Learning by Poisoning Backdoor-Critical Layers
    arXiv:2308.04466v2 Announce Type: replace-cross Abstract: Federated learning (FL) has been widely deployed to enable machine learning training on sensitive data across distributed devices. However, the decentralized learning paradigm and heterogeneity of FL further extend the attack surface for backdoor attacks. Existing FL attack and defense methodologies typically focus on the whole model. None of them recognizes the existence of backdoor-critical (BC) layers-a small subset of layers that dominate the model vulnerabilities. Attacking the BC layers achieves equivalent effects as attacking the whole model but at a far smaller chance of being detected by state-of-the-art (SOTA) defenses. This paper proposes a general in-situ approach that identifies and verifies BC layers from the perspective of attackers. Based on the identified BC layers, we carefully craft a new backdoor attack methodology that adaptively seeks a fundamental balance between attacking effects and stealthiness under various defense strategies. Extensive experiments show that our BC layer-aware backdoor attacks can successfully backdoor FL under seven SOTA defenses with only 10% malicious clients and outperform the latest backdoor attack methods.  ( 2 min )
    EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE
    arXiv:2308.11971v2 Announce Type: replace-cross Abstract: Building scalable vision-language models to learn from diverse, multimodal data remains an open challenge. In this paper, we introduce an Efficient Vision-languagE foundation model, namely EVE, which is one unified multimodal Transformer pre-trained solely by one unified pre-training task. Specifically, EVE encodes both vision and language within a shared Transformer network integrated with modality-aware sparse Mixture-of-Experts (MoE) modules, which capture modality-specific information by selectively switching to different experts. To unify pre-training tasks of vision and language, EVE performs masked signal modeling on image-text pairs to reconstruct masked signals, i.e., image pixels and text tokens, given visible signals. This simple yet effective pre-training objective accelerates training by 3.5x compared to the model pre-trained with Image-Text Contrastive and Image-Text Matching losses. Owing to the combination of the unified architecture and pre-training task, EVE is easy to scale up, enabling better downstream performance with fewer resources and faster training speed. Despite its simplicity, EVE achieves state-of-the-art performance on various vision-language downstream tasks, including visual question answering, visual reasoning, and image-text retrieval.  ( 2 min )
    Bures-Wasserstein Means of Graphs
    arXiv:2305.19738v2 Announce Type: replace-cross Abstract: Finding the mean of sampled data is a fundamental task in machine learning and statistics. However, in cases where the data samples are graph objects, defining a mean is an inherently difficult task. We propose a novel framework for defining a graph mean via embeddings in the space of smooth graph signal distributions, where graph similarity can be measured using the Wasserstein metric. By finding a mean in this embedding space, we can recover a mean graph that preserves structural information. We establish the existence and uniqueness of the novel graph mean, and provide an iterative algorithm for computing it. To highlight the potential of our framework as a valuable tool for practical applications in machine learning, it is evaluated on various tasks, including k-means clustering of structured aligned graphs, classification of functional brain networks, and semi-supervised node classification in multi-layer graphs. Our experimental results demonstrate that our approach achieves consistent performance, outperforms existing baseline approaches, and improves the performance of state-of-the-art methods.  ( 2 min )
    Hierarchical clustering with dot products recovers hidden tree structure
    arXiv:2305.15022v3 Announce Type: replace-cross Abstract: In this paper we offer a new perspective on the well established agglomerative clustering algorithm, focusing on recovery of hierarchical structure. We recommend a simple variant of the standard algorithm, in which clusters are merged by maximum average dot product and not, for example, by minimum distance or within-cluster variance. We demonstrate that the tree output by this algorithm provides a bona fide estimate of generative hierarchical structure in data, under a generic probabilistic graphical model. The key technical innovations are to understand how hierarchical information in this model translates into tree geometry which can be recovered from data, and to characterise the benefits of simultaneously growing sample size and data dimension. We demonstrate superior tree recovery performance with real data over existing approaches such as UPGMA, Ward's method, and HDBSCAN.  ( 2 min )
    Alt-Text with Context: Improving Accessibility for Images on Twitter
    arXiv:2305.14779v3 Announce Type: replace-cross Abstract: In this work we present an approach for generating alternative text (or alt-text) descriptions for images shared on social media, specifically Twitter. More than just a special case of image captioning, alt-text is both more literally descriptive and context-specific. Also critically, images posted to Twitter are often accompanied by user-written text that despite not necessarily describing the image may provide useful context that if properly leveraged can be informative. We address this task with a multimodal model that conditions on both textual information from the associated social media post as well as visual signal from the image, and demonstrate that the utility of these two information sources stacks. We put forward a new dataset of 371k images paired with alt-text and tweets scraped from Twitter and evaluate on it across a variety of automated metrics as well as human evaluation. We show that our approach of conditioning on both tweet text and visual information significantly outperforms prior work, by more than 2x on BLEU@4.  ( 2 min )
    SPEED: Experimental Design for Policy Evaluation in Linear Heteroscedastic Bandits
    arXiv:2301.12357v3 Announce Type: replace-cross Abstract: In this paper, we study the problem of optimal data collection for policy evaluation in linear bandits. In policy evaluation, we are given a target policy and asked to estimate the expected reward it will obtain when executed in a multi-armed bandit environment. Our work is the first work that focuses on such optimal data collection strategy for policy evaluation involving heteroscedastic reward noise in the linear bandit setting. We first formulate an optimal design for weighted least squares estimates in the heteroscedastic linear bandit setting that reduces the MSE of the value of the target policy. We then use this formulation to derive the optimal allocation of samples per action during data collection. We then introduce a novel algorithm SPEED (Structured Policy Evaluation Experimental Design) that tracks the optimal design and derive its regret with respect to the optimal design. Finally, we empirically validate that SPEED leads to policy evaluation with mean squared error comparable to the oracle strategy and significantly lower than simply running the target policy.  ( 2 min )
    Neural Additive Models for Location Scale and Shape: A Framework for Interpretable Neural Regression Beyond the Mean
    arXiv:2301.11862v2 Announce Type: replace-cross Abstract: Deep neural networks (DNNs) have proven to be highly effective in a variety of tasks, making them the go-to method for problems requiring high-level predictive power. Despite this success, the inner workings of DNNs are often not transparent, making them difficult to interpret or understand. This lack of interpretability has led to increased research on inherently interpretable neural networks in recent years. Models such as Neural Additive Models (NAMs) achieve visual interpretability through the combination of classical statistical methods with DNNs. However, these approaches only concentrate on mean response predictions, leaving out other properties of the response distribution of the underlying data. We propose Neural Additive Models for Location Scale and Shape (NAMLSS), a modelling framework that combines the predictive power of classical deep learning models with the inherent advantages of distributional regression while maintaining the interpretability of additive models. The code is available at the following link: https://github.com/AnFreTh/NAMpy  ( 2 min )
    Making SGD Parameter-Free
    arXiv:2205.02160v3 Announce Type: replace-cross Abstract: We develop an algorithm for parameter-free stochastic convex optimization (SCO) whose rate of convergence is only a double-logarithmic factor larger than the optimal rate for the corresponding known-parameter setting. In contrast, the best previously known rates for parameter-free SCO are based on online parameter-free regret bounds, which contain unavoidable excess logarithmic terms compared to their known-parameter counterparts. Our algorithm is conceptually simple, has high-probability guarantees, and is also partially adaptive to unknown gradient norms, smoothness, and strong convexity. At the heart of our results is a novel parameter-free certificate for SGD step size choice, and a time-uniform concentration result that assumes no a-priori bounds on SGD iterates.  ( 2 min )
    Finetuning Large Language Models for Vulnerability Detection
    arXiv:2401.17010v4 Announce Type: replace-cross Abstract: This paper presents the results of finetuning large language models (LLMs) for the task of detecting vulnerabilities in source code. We leverage WizardCoder, a recent improvement of the state-of-the-art LLM StarCoder, and adapt it for vulnerability detection through further finetuning. To accelerate training, we modify WizardCoder's training procedure, also we investigate optimal training regimes. For the imbalanced dataset with many more negative examples than positive, we also explore different techniques to improve classification performance. The finetuned WizardCoder model achieves improvement in ROC AUC and F1 measures on balanced and imbalanced vulnerability datasets over CodeBERT-like model, demonstrating the effectiveness of adapting pretrained LLMs for vulnerability detection in source code. The key contributions are finetuning the state-of-the-art code LLM, WizardCoder, increasing its training speed without the performance harm, optimizing the training procedure and regimes, handling class imbalance, and improving performance on difficult vulnerability detection datasets. This demonstrates the potential for transfer learning by finetuning large pretrained language models for specialized source code analysis tasks.  ( 2 min )
    InceptionXML: A Lightweight Framework with Synchronized Negative Sampling for Short Text Extreme Classification
    arXiv:2109.07319v3 Announce Type: replace-cross Abstract: Automatic annotation of short-text data to a large number of target labels, referred to as Short Text Extreme Classification, has found numerous applications including prediction of related searches and product recommendation tasks. In this paper, we propose a convolutional architecture InceptionXML which is light-weight, yet powerful, and robust to the inherent lack of word-order in short-text queries encountered in search and recommendation tasks. We demonstrate the efficacy of applying convolutions by recasting the operation along the embedding dimension instead of the word dimension as applied in conventional CNNs for text classification. Towards scaling our model to datasets with millions of labels, we also propose InceptionXML+ framework which improves upon the shortcomings of the recently proposed dynamic hard-negative mining technique for label shortlisting by synchronizing the label-shortlister and extreme classifier. InceptionXML+ not only reduces the inference time to half but is also an order of magnitude smaller than previous state-of-the-art Astec in terms of model size. Through our proposed models, we outperform all existing approaches on popular benchmark datasets.  ( 2 min )
    Interpretable Feature Learning in Multivariate Big Data Analysis for Network Monitoring
    arXiv:1907.02677v3 Announce Type: replace-cross Abstract: There is an increasing interest in the development of new data-driven models useful to assess the performance of communication networks. For many applications, like network monitoring and troubleshooting, a data model is of little use if it cannot be interpreted by a human operator. In this paper, we present an extension of the Multivariate Big Data Analysis (MBDA) methodology, a recently proposed interpretable data analysis tool. In this extension, we propose a solution to the automatic derivation of features, a cornerstone step for the application of MBDA when the amount of data is massive. The resulting network monitoring approach allows us to detect and diagnose disparate network anomalies, with a data-analysis workflow that combines the advantages of interpretable and interactive models with the power of parallel processing. We apply the extended MBDA to two case studies: UGR'16, a benchmark flow-based real-traffic dataset for anomaly detection, and Dartmouth'18, the longest and largest Wi-Fi trace known to date.  ( 3 min )
    Proving Linear Mode Connectivity of Neural Networks via Optimal Transport
    arXiv:2310.19103v2 Announce Type: replace Abstract: The energy landscape of high-dimensional non-convex optimization problems is crucial to understanding the effectiveness of modern deep neural network architectures. Recent works have experimentally shown that two different solutions found after two runs of a stochastic training are often connected by very simple continuous paths (e.g., linear) modulo a permutation of the weights. In this paper, we provide a framework theoretically explaining this empirical observation. Based on convergence rates in Wasserstein distance of empirical measures, we show that, with high probability, two wide enough two-layer neural networks trained with stochastic gradient descent are linearly connected. Additionally, we express upper and lower bounds on the width of each layer of two deep neural networks with independent neuron weights to be linearly connected. Finally, we empirically demonstrate the validity of our approach by showing how the dimension of the support of the weight distribution of neurons, which dictates Wasserstein convergence rates is correlated with linear mode connectivity.  ( 2 min )
    Enhancing Group Fairness in Online Settings Using Oblique Decision Forests
    arXiv:2310.11401v2 Announce Type: replace Abstract: Fairness, especially group fairness, is an important consideration in the context of machine learning systems. The most commonly adopted group fairness-enhancing techniques are in-processing methods that rely on a mixture of a fairness objective (e.g., demographic parity) and a task-specific objective (e.g., cross-entropy) during the training process. However, when data arrives in an online fashion -- one instance at a time -- optimizing such fairness objectives poses several challenges. In particular, group fairness objectives are defined using expectations of predictions across different demographic groups. In the online setting, where the algorithm has access to a single instance at a time, estimating the group fairness objective requires additional storage and significantly more computation (e.g., forward/backward passes) than the task-specific objective at every time step. In this paper, we propose Aranyani, an ensemble of oblique decision trees, to make fair decisions in online settings. The hierarchical tree structure of Aranyani enables parameter isolation and allows us to efficiently compute the fairness gradients using aggregate statistics of previous decisions, eliminating the need for additional storage and forward/backward passes. We also present an efficient framework to train Aranyani and theoretically analyze several of its properties. We conduct empirical evaluations on 5 publicly available benchmarks (including vision and language datasets) to show that Aranyani achieves a better accuracy-fairness trade-off compared to baseline approaches.  ( 3 min )
    Pushing Large Language Models to the 6G Edge: Vision, Challenges, and Opportunities
    arXiv:2309.16739v2 Announce Type: replace Abstract: Large language models (LLMs), which have shown remarkable capabilities, are revolutionizing AI development and potentially shaping our future. However, given their multimodality, the status quo cloud-based deployment faces some critical challenges: 1) long response time; 2) high bandwidth costs; and 3) the violation of data privacy. 6G mobile edge computing (MEC) systems may resolve these pressing issues. In this article, we explore the potential of deploying LLMs at the 6G edge. We start by introducing killer applications powered by multimodal LLMs, including robotics and healthcare, to highlight the need for deploying LLMs in the vicinity of end users. Then, we identify the critical challenges for LLM deployment at the edge and envision the 6G MEC architecture for LLMs. Furthermore, we delve into two design aspects, i.e., edge training and edge inference for LLMs. In both aspects, considering the inherent resource limitations at the edge, we discuss various cutting-edge techniques, including split learning/inference, parameter-efficient fine-tuning, quantization, and parameter-sharing inference, to facilitate the efficient deployment of LLMs. This article serves as a position paper for thoroughly identifying the motivation, challenges, and pathway for empowering LLMs at the 6G edge.  ( 3 min )
    Tree Cross Attention
    arXiv:2309.17388v2 Announce Type: replace Abstract: Cross Attention is a popular method for retrieving information from a set of context tokens for making predictions. At inference time, for each prediction, Cross Attention scans the full set of $\mathcal{O}(N)$ tokens. In practice, however, often only a small subset of tokens are required for good performance. Methods such as Perceiver IO are cheap at inference as they distill the information to a smaller-sized set of latent tokens $L < N$ on which cross attention is then applied, resulting in only $\mathcal{O}(L)$ complexity. However, in practice, as the number of input tokens and the amount of information to distill increases, the number of latent tokens needed also increases significantly. In this work, we propose Tree Cross Attention (TCA) - a module based on Cross Attention that only retrieves information from a logarithmic $\mathcal{O}(\log(N))$ number of tokens for performing inference. TCA organizes the data in a tree structure and performs a tree search at inference time to retrieve the relevant tokens for prediction. Leveraging TCA, we introduce ReTreever, a flexible architecture for token-efficient inference. We show empirically that Tree Cross Attention (TCA) performs comparable to Cross Attention across various classification and uncertainty regression tasks while being significantly more token-efficient. Furthermore, we compare ReTreever against Perceiver IO, showing significant gains while using the same number of tokens for inference.  ( 2 min )
    SureFED: Robust Federated Learning via Uncertainty-Aware Inward and Outward Inspection
    arXiv:2308.02747v2 Announce Type: replace Abstract: In this work, we introduce SureFED, a novel framework for byzantine robust federated learning. Unlike many existing defense methods that rely on statistically robust quantities, making them vulnerable to stealthy and colluding attacks, SureFED establishes trust using the local information of benign clients. SureFED utilizes an uncertainty aware model evaluation and introspection to safeguard against poisoning attacks. In particular, each client independently trains a clean local model exclusively using its local dataset, acting as the reference point for evaluating model updates. SureFED leverages Bayesian models that provide model uncertainties and play a crucial role in the model evaluation process. Our framework exhibits robustness even when the majority of clients are compromised, remains agnostic to the number of malicious clients, and is well-suited for non-IID settings. We theoretically prove the robustness of our algorithm against data and model poisoning attacks in a decentralized linear regression setting. Proof-of Concept evaluations on benchmark image classification data demonstrate the superiority of SureFED over the state of the art defense methods under various colluding and non-colluding data and model poisoning attacks.  ( 2 min )
    Federated Domain Generalization: A Survey
    arXiv:2306.01334v2 Announce Type: replace Abstract: Machine learning typically relies on the assumption that training and testing distributions are identical and that data is centrally stored for training and testing. However, in real-world scenarios, distributions may differ significantly and data is often distributed across different devices, organizations, or edge nodes. Consequently, it is imperative to develop models that can effectively generalize to unseen distributions where data is distributed across different domains. In response to this challenge, there has been a surge of interest in federated domain generalization (FDG) in recent years. FDG combines the strengths of federated learning (FL) and domain generalization (DG) techniques to enable multiple source domains to collaboratively learn a model capable of directly generalizing to unseen domains while preserving data privacy. However, generalizing the federated model under domain shifts is a technically challenging problem that has received scant attention in the research area so far. This paper presents the first survey of recent advances in this area. Initially, we discuss the development process from traditional machine learning to domain adaptation and domain generalization, leading to FDG as well as provide the corresponding formal definition. Then, we categorize recent methodologies into four classes: federated domain alignment, data manipulation, learning strategies, and aggregation optimization, and present suitable algorithms in detail for each category. Next, we introduce commonly used datasets, applications, evaluations, and benchmarks. Finally, we conclude this survey by providing some potential research topics for the future.  ( 3 min )
    Training generative models from privatized data
    arXiv:2306.09547v2 Announce Type: replace Abstract: Local differential privacy is a powerful method for privacy-preserving data collection. In this paper, we develop a framework for training Generative Adversarial Networks (GANs) on differentially privatized data. We show that entropic regularization of optimal transport - a popular regularization method in the literature that has often been leveraged for its computational benefits - enables the generator to learn the raw (unprivatized) data distribution even though it only has access to privatized samples. We prove that at the same time this leads to fast statistical convergence at the parametric rate. This shows that entropic regularization of optimal transport uniquely enables the mitigation of both the effects of privatization noise and the curse of dimensionality in statistical convergence. We provide experimental evidence to support the efficacy of our framework in practice.  ( 2 min )
    A Transformer-Based Deep Learning Approach for Fairly Predicting Post-Liver Transplant Risk Factors
    arXiv:2304.02780v2 Announce Type: replace Abstract: Liver transplantation is a life-saving procedure for patients with end-stage liver disease. There are two main challenges in liver transplant: finding the best matching patient for a donor and ensuring transplant equity among different subpopulations. The current MELD scoring system evaluates a patient's mortality risk if not receiving an organ within 90 days. However, the donor-patient matching should also consider post-transplant risk factors, such as cardiovascular disease, chronic rejection, etc., which are all common complications after transplant. Accurate prediction of these risk scores remains a significant challenge. In this study, we used predictive models to solve the above challenges. Specifically, we proposed a deep-learning model to predict multiple risk factors after a liver transplant. By formulating it as a multi-task learning problem, the proposed deep neural network was trained to simultaneously predict the five post-transplant risks and achieve equal good performance by exploiting task-balancing techniques. We also proposed a novel fairness-achieving algorithm to ensure prediction fairness across different subpopulations. We used electronic health records of 160,360 liver transplant patients, including demographic information, clinical variables, and laboratory values, collected from the liver transplant records of the United States from 1987 to 2018. The model's performance was evaluated using various performance metrics such as AUROC and AUPRC. Our experiment results highlighted the success of our multitask model in achieving task balance while maintaining accuracy. The model significantly reduced the task discrepancy by 39%. Further application of the fairness-achieving algorithm substantially reduced fairness disparity among all sensitive attributes (gender, age group, and race/ethnicity) in each risk factor.  ( 3 min )
    $\infty$-Diff: Infinite Resolution Diffusion with Subsampled Mollified States
    arXiv:2303.18242v2 Announce Type: replace Abstract: This paper introduces $\infty$-Diff, a generative diffusion model defined in an infinite-dimensional Hilbert space, which can model infinite resolution data. By training on randomly sampled subsets of coordinates and denoising content only at those locations, we learn a continuous function for arbitrary resolution sampling. Unlike prior neural field-based infinite-dimensional models, which use point-wise functions requiring latent compression, our method employs non-local integral operators to map between Hilbert spaces, allowing spatial context aggregation. This is achieved with an efficient multi-scale function-space architecture that operates directly on raw sparse coordinates, coupled with a mollified diffusion process that smooths out irregularities. Through experiments on high-resolution datasets, we found that even at an $8\times$ subsampling rate, our model retains high-quality diffusion. This leads to significant run-time and memory savings, delivers samples with lower FID scores, and scales beyond the training resolution while retaining detail.  ( 2 min )
    Safeguarding Data in Multimodal AI: A Differentially Private Approach to CLIP Training
    arXiv:2306.08173v2 Announce Type: replace Abstract: The surge in multimodal AI's success has sparked concerns over data privacy in vision-and-language tasks. While CLIP has revolutionized multimodal learning through joint training on images and text, its potential to unintentionally disclose sensitive information necessitates the integration of privacy-preserving mechanisms. We introduce a differentially private adaptation of the Contrastive Language-Image Pretraining (CLIP) model that effectively addresses privacy concerns while retaining accuracy. Our proposed method, Dp-CLIP, is rigorously evaluated on benchmark datasets encompassing diverse vision-and-language tasks such as image classification and visual question answering. We demonstrate that our approach retains performance on par with the standard non-private CLIP model. Furthermore, we analyze our proposed algorithm under linear representation settings. We derive the convergence rate of our algorithm and show a trade-off between utility and privacy when gradients are clipped per-batch and the loss function does not satisfy smoothness conditions assumed in the literature for the analysis of DP-SGD.  ( 2 min )
    Global Convergence Rate of Deep Equilibrium Models with General Activations
    arXiv:2302.05797v3 Announce Type: replace-cross Abstract: In a recent paper, Ling et al. investigated the over-parametrized Deep Equilibrium Model (DEQ) with ReLU activation. They proved that the gradient descent converges to a globally optimal solution for the quadratic loss function at a linear convergence rate. This paper shows that this fact still holds for DEQs with any generally bounded activation with bounded first and second derivatives. Since the new activation function is generally non-homogeneous, bounding the least eigenvalue of the Gram matrix of the equilibrium point is particularly challenging. To accomplish this task, we must create a novel population Gram matrix and develop a new form of dual activation with Hermite polynomial expansion.  ( 2 min )
    DEEP-IoT: Downlink-Enhanced Efficient-Power Internet of Things
    arXiv:2403.00321v1 Announce Type: cross Abstract: At the heart of the Internet of Things (IoT) -- a domain witnessing explosive growth -- the imperative for energy efficiency and the extension of device lifespans has never been more pressing. This paper presents DEEP-IoT, a revolutionary communication paradigm poised to redefine how IoT devices communicate. Through a pioneering "listen more, transmit less" strategy, DEEP-IoT challenges and transforms the traditional transmitter (IoT devices)-centric communication model to one where the receiver (the access point) play a pivotal role, thereby cutting down energy use and boosting device longevity. We not only conceptualize DEEP-IoT but also actualize it by integrating deep learning-enhanced feedback channel codes within a narrow-band system. Simulation results show a significant enhancement in the operational lifespan of IoT cells -- surpassing traditional systems using Turbo and Polar codes by up to 52.71%. This leap signifies a paradigm shift in IoT communications, setting the stage for a future where IoT devices boast unprecedented efficiency and durability.  ( 2 min )
    $\lambda$-models: Effective Decision-Aware Reinforcement Learning with Latent Models
    arXiv:2306.17366v3 Announce Type: replace Abstract: The idea of decision-aware model learning, that models should be accurate where it matters for decision-making, has gained prominence in model-based reinforcement learning. While promising theoretical results have been established, the empirical performance of algorithms leveraging a decision-aware loss has been lacking, especially in continuous control problems. In this paper, we present a study on the necessary components for decision-aware reinforcement learning models and we showcase design choices that enable well-performing algorithms. To this end, we provide a theoretical and empirical investigation into algorithmic ideas in the field. We highlight that empirical design decisions established in the MuZero line of works, most importantly the use of a latent model, are vital to achieving good performance for related algorithms. Furthermore, we show that the MuZero loss function is biased in stochastic environments and establish that this bias has practical consequences. Building on these findings, we present an overview of which decision-aware loss functions are best used in what empirical scenarios, providing actionable insights to practitioners in the field.  ( 2 min )
    ImpNet: Imperceptible and blackbox-undetectable backdoors in compiled neural networks
    arXiv:2210.00108v4 Announce Type: replace Abstract: Early backdoor attacks against machine learning set off an arms race in attack and defence development. Defences have since appeared demonstrating some ability to detect backdoors in models or even remove them. These defences work by inspecting the training data, the model, or the integrity of the training procedure. In this work, we show that backdoors can be added during compilation, circumventing any safeguards in the data preparation and model training stages. The attacker can not only insert existing weight-based backdoors during compilation, but also a new class of weight-independent backdoors, such as ImpNet. These backdoors are impossible to detect during the training or data preparation processes, because they are not yet present. Next, we demonstrate that some backdoors, including ImpNet, can only be reliably detected at the stage where they are inserted and removing them anywhere else presents a significant challenge. We conclude that ML model security requires assurance of provenance along the entire technical pipeline, including the data, model architecture, compiler, and hardware specification.  ( 3 min )
    Generalized User Representations for Transfer Learning
    arXiv:2403.00584v1 Announce Type: cross Abstract: We present a novel framework for user representation in large-scale recommender systems, aiming at effectively representing diverse user taste in a generalized manner. Our approach employs a two-stage methodology combining representation learning and transfer learning. The representation learning model uses an autoencoder that compresses various user features into a representation space. In the second stage, downstream task-specific models leverage user representations via transfer learning instead of curating user features individually. We further augment this methodology on the representation's input features to increase flexibility and enable reaction to user events, including new user experiences, in Near-Real Time. Additionally, we propose a novel solution to manage deployment of this framework in production models, allowing downstream models to work independently. We validate the performance of our framework through rigorous offline and online experiments within a large-scale system, showcasing its remarkable efficacy across multiple evaluation tasks. Finally, we show how the proposed framework can significantly reduce infrastructure costs compared to alternative approaches.  ( 2 min )
    ALEXR: An Optimal Single-Loop Algorithm for Convex Finite-Sum Coupled Compositional Stochastic Optimization
    arXiv:2312.02277v3 Announce Type: replace-cross Abstract: This paper revisits a class of convex Finite-Sum Coupled Compositional Stochastic Optimization (cFCCO) problems with many applications, including group distributionally robust optimization (GDRO), learning with imbalanced data, reinforcement learning, and learning to rank. To better solve these problems, we introduce an efficient single-loop primal-dual block-coordinate proximal algorithm, dubbed ALEXR. This algorithm leverages block-coordinate stochastic mirror ascent updates for the dual variable and stochastic proximal gradient descent updates for the primal variable. We establish the convergence rates of ALEXR in both convex and strongly convex cases under smoothness and non-smoothness conditions of involved functions, which not only improve the best rates in previous works on smooth cFCCO problems but also expand the realm of cFCCO for solving more challenging non-smooth problems such as the dual form of GDRO. Finally, we present lower complexity bounds to demonstrate that the convergence rates of ALEXR are optimal among first-order block-coordinate stochastic algorithms for the considered class of cFCCO problems.  ( 2 min )
    Empowering Aggregators with Practical Data-Driven Tools: Harnessing Aggregated and Disaggregated Flexibility for Demand Response
    arXiv:2401.10726v2 Announce Type: replace-cross Abstract: This study explores the crucial interplay between aggregators and building occupants in activating flexibility through Demand Response (DR) programs, with a keen focus on achieving robust decarbonization and fortifying the resilience of the energy system amidst the uncertainties presented by Renewable Energy Sources (RES). Firstly, it introduces a methodology of optimizing aggregated flexibility provision strategies in environments with limited data, utilizing Discrete Fourier Transformation (DFT) and clustering techniques to identify building occupant's activity patterns. Secondly, the study assesses the disaggregated flexibility provision of Heating Ventilation and Air Conditioning (HVAC) systems during DR events, employing machine learning and optimization techniques for precise, device-level analysis. The first approach offers a non-intrusive pathway for aggregators to provide flexibility services in environments of a single smart meter for the whole building's consumption, while the second approach carefully considers building occupants' thermal comfort profiles, while maximizing flexibility in case of existence of dedicated smart meters to the HVAC systems. Through the application of data-driven techniques and encompassing case studies from both industrial and residential buildings, this paper not only unveils pivotal opportunities for aggregators in the balancing and emerging flexibility markets but also successfully develops end-to-end practical tools for aggregators. Furthermore, the efficacy of this tool is validated through detailed case studies, substantiating its operational capability and contributing to the evolution of a resilient and efficient energy system.  ( 3 min )
    Spectral Ranking Inferences based on General Multiway Comparisons
    arXiv:2308.02918v3 Announce Type: replace-cross Abstract: This paper studies the performance of the spectral method in the estimation and uncertainty quantification of the unobserved preference scores of compared entities in a general and more realistic setup. Specifically, the comparison graph consists of hyper-edges of possible heterogeneous sizes, and the number of comparisons can be as low as one for a given hyper-edge. Such a setting is pervasive in real applications, circumventing the need to specify the graph randomness and the restrictive homogeneous sampling assumption imposed in the commonly used Bradley-Terry-Luce (BTL) or Plackett-Luce (PL) models. Furthermore, in scenarios where the BTL or PL models are appropriate, we unravel the relationship between the spectral estimator and the Maximum Likelihood Estimator (MLE). We discover that a two-step spectral method, where we apply the optimal weighting estimated from the equal weighting vanilla spectral method, can achieve the same asymptotic efficiency as the MLE. Given the asymptotic distributions of the estimated preference scores, we also introduce a comprehensive framework to carry out both one-sample and two-sample ranking inferences, applicable to both fixed and random graph settings. It is noteworthy that this is the first time effective two-sample rank testing methods have been proposed. Finally, we substantiate our findings via comprehensive numerical simulations and subsequently apply our developed methodologies to perform statistical inferences for statistical journals and movie rankings.  ( 3 min )
    Sociodemographic Bias in Language Models: A Survey and Forward Path
    arXiv:2306.08158v4 Announce Type: replace-cross Abstract: This paper presents a comprehensive survey of work on sociodemographic bias in language models (LMs). Sociodemographic biases embedded within language models can have harmful effects when deployed in real-world settings. We systematically organize the existing literature into three main areas: types of bias, quantifying bias, and debiasing techniques. We also track the evolution of investigations of LM bias over the past decade. We identify current trends, limitations, and potential future directions in bias research. To guide future research towards more effective and reliable solutions, we present a checklist of open questions. We also recommend using interdisciplinary approaches to combine works on LM bias with an understanding of the potential harms.  ( 2 min )
    Deep learning insights into cosmological structure formation
    arXiv:2011.10577v3 Announce Type: replace-cross Abstract: The evolution of linear initial conditions present in the early universe into extended halos of dark matter at late times can be computed using cosmological simulations. However, a theoretical understanding of this complex process remains elusive; in particular, the role of anisotropic information in the initial conditions in establishing the final mass of dark matter halos remains a long-standing puzzle. Here, we build a deep learning framework to investigate this question. We train a three-dimensional convolutional neural network (CNN) to predict the mass of dark matter halos from the initial conditions, and quantify in full generality the amounts of information in the isotropic and anisotropic aspects of the initial density field about final halo masses. We find that anisotropies add a small, albeit statistically significant amount of information over that contained within spherical averages of the density field about final halo mass. However, the overall scatter in the final mass predictions does not change qualitatively with this additional information, only decreasing from 0.9 dex to 0.7 dex. Given such a small improvement, our results demonstrate that isotropic aspects of the initial density field essentially saturate the relevant information about final halo mass. Therefore, instead of searching for information directly encoded in initial conditions anisotropies, a more promising route to accurate, fast halo mass predictions is to add approximate dynamical information based e.g. on perturbation theory. More broadly, our results indicate that deep learning frameworks can provide a powerful tool for extracting physical insight into cosmological structure formation.  ( 3 min )
    Towards Interpretable Deep Reinforcement Learning Models via Inverse Reinforcement Learning
    arXiv:2203.16464v3 Announce Type: replace Abstract: Artificial intelligence, particularly through recent advancements in deep learning, has achieved exceptional performances in many tasks in fields such as natural language processing and computer vision. In addition to desirable evaluation metrics, a high level of interpretability is often required for these models to be reliably utilized. Therefore, explanations that offer insight into the process by which a model maps its inputs onto its outputs are much sought-after. Unfortunately, the current black box nature of machine learning models is still an unresolved issue and this very nature prevents researchers from learning and providing explicative descriptions for a model's behavior and final predictions. In this work, we propose a novel framework utilizing Adversarial Inverse Reinforcement Learning that can provide global explanations for decisions made by a Reinforcement Learning model and capture intuitive tendencies that the model follows by summarizing the model's decision-making process.  ( 2 min )
    Self-Supervised Learning for Time Series Analysis: Taxonomy, Progress, and Prospects
    arXiv:2306.10125v3 Announce Type: replace Abstract: Self-supervised learning (SSL) has recently achieved impressive performance on various time series tasks. The most prominent advantage of SSL is that it reduces the dependence on labeled data. Based on the pre-training and fine-tuning strategy, even a small amount of labeled data can achieve high performance. Compared with many published self-supervised surveys on computer vision and natural language processing, a comprehensive survey for time series SSL is still missing. To fill this gap, we review current state-of-the-art SSL methods for time series data in this article. To this end, we first comprehensively review existing surveys related to SSL and time series, and then provide a new taxonomy of existing time series SSL methods by summarizing them from three perspectives: generative-based, contrastive-based, and adversarial-based. These methods are further divided into ten subcategories with detailed reviews and discussions about their key intuitions, main frameworks, advantages and disadvantages. To facilitate the experiments and validation of time series SSL methods, we also summarize datasets commonly used in time series forecasting, classification, anomaly detection, and clustering tasks. Finally, we present the future directions of SSL for time series analysis.  ( 3 min )
    CorruptEncoder: Data Poisoning based Backdoor Attacks to Contrastive Learning
    arXiv:2211.08229v5 Announce Type: replace-cross Abstract: Contrastive learning (CL) pre-trains general-purpose encoders using an unlabeled pre-training dataset, which consists of images or image-text pairs. CL is vulnerable to data poisoning based backdoor attacks (DPBAs), in which an attacker injects poisoned inputs into the pre-training dataset so the encoder is backdoored. However, existing DPBAs achieve limited effectiveness. In this work, we take the first step to analyze the limitations of existing backdoor attacks and propose new DPBAs called CorruptEncoder to CL. CorruptEncoder introduces a new attack strategy to create poisoned inputs and uses a theory-guided method to maximize attack effectiveness. Our experiments show that CorruptEncoder substantially outperforms existing DPBAs. In particular, CorruptEncoder is the first DPBA that achieves more than 90% attack success rates with only a few (3) reference images and a small poisoning ratio 0.5%. Moreover, we also propose a defense, called localized cropping, to defend against DPBAs. Our results show that our defense can reduce the effectiveness of DPBAs, but it sacrifices the utility of the encoder, highlighting the need for new defenses.  ( 3 min )
    Improving the performance of weak supervision searches using transfer and meta-learning
    arXiv:2312.06152v2 Announce Type: replace-cross Abstract: Weak supervision searches have in principle the advantages of both being able to train on experimental data and being able to learn distinctive signal properties. However, the practical applicability of such searches is limited by the fact that successfully training a neural network via weak supervision can require a large amount of signal. In this work, we seek to create neural networks that can learn from less experimental signal by using transfer and meta-learning. The general idea is to first train a neural network on simulations, thereby learning concepts that can be reused or becoming a more efficient learner. The neural network would then be trained on experimental data and should require less signal because of its previous training. We find that transfer and meta-learning can substantially improve the performance of weak supervision searches.  ( 2 min )
    Towards an end-to-end artificial intelligence driven global weather forecasting system
    arXiv:2312.12462v2 Announce Type: replace-cross Abstract: The weather forecasting system is important for science and society, and significant achievements have been made in applying artificial intelligence (AI) to medium-range weather forecasting. However, existing AI-based weather forecasting models rely on analysis or reanalysis products from the traditional numerical weather prediction (NWP) systems as initial conditions for making predictions. Initial states are typically generated by traditional data assimilation component, which is computational expensive and time-consuming. Here we present an AI-based data assimilation model, i.e., Adas, for global weather variables. And we combine Adas with the advanced AI-based weather forecasting model (i.e., FengWu) to construct the first end-to-end AI-based global weather forecasting system: FengWu-Adas. We demonstrate that Adas can assimilate sparse global observations to produce high-quality analysis, enabling the system operate stably for long term. Moreover, we are the first to apply the propose methods to real-world scenarios, which is more challenging and has considerable practical application potential.  ( 2 min )
    Simulation-based stacking
    arXiv:2310.17009v2 Announce Type: replace-cross Abstract: Simulation-based inference has been popular for amortized Bayesian computation. It is typical to have more than one posterior approximation, from different inference algorithms, different architectures, or simply the randomness of initialization and stochastic gradients. With a consistency guarantee, we present a general posterior stacking framework to make use of all available approximations. Our stacking method is able to combine densities, simulation draws, confidence intervals, and moments, and address the overall precision, calibration, coverage, and bias of the posterior approximation at the same time. We illustrate our method on several benchmark simulations and a challenging cosmological inference task.  ( 2 min )
    Imitation Bootstrapped Reinforcement Learning
    arXiv:2311.02198v4 Announce Type: replace Abstract: Despite the considerable potential of reinforcement learning (RL), robotic control tasks predominantly rely on imitation learning (IL) due to its better sample efficiency. However, it is costly to collect comprehensive expert demonstrations that enable IL to generalize to all possible scenarios, and any distribution shift would require recollecting data for finetuning. Therefore, RL is appealing if it can build upon IL as an efficient autonomous self-improvement procedure. We propose imitation bootstrapped reinforcement learning (IBRL), a novel framework for sample-efficient RL with demonstrations that first trains an IL policy on the provided demonstrations and then uses it to propose alternative actions for both online exploration and bootstrapping target values. Compared to prior works that oversample the demonstrations or regularize RL with an additional imitation loss, IBRL is able to utilize high quality actions from IL policies since the beginning of training, which greatly accelerates exploration and training efficiency. We evaluate IBRL on 6 simulation and 3 real-world tasks spanning various difficulty levels. IBRL significantly outperforms prior methods and the improvement is particularly more prominent in harder tasks.  ( 2 min )
    Distributed Influence-Augmented Local Simulators for Parallel MARL in Large Networked Systems
    arXiv:2207.00288v2 Announce Type: replace Abstract: Due to its high sample complexity, simulation is, as of today, critical for the successful application of reinforcement learning. Many real-world problems, however, exhibit overly complex dynamics, which makes their full-scale simulation computationally slow. In this paper, we show how to decompose large networked systems of many agents into multiple local components such that we can build separate simulators that run independently and in parallel. To monitor the influence that the different local components exert on one another, each of these simulators is equipped with a learned model that is periodically trained on real trajectories. Our empirical results reveal that distributing the simulation among different processes not only makes it possible to train large multi-agent systems in just a few hours but also helps mitigate the negative effects of simultaneous learning.  ( 2 min )
    Graph Learning Across Data Silos
    arXiv:2301.06662v3 Announce Type: replace Abstract: We consider the problem of inferring graph topology from smooth graph signals in a novel but practical scenario where data are located in distributed clients and prohibited from leaving local clients due to factors such as privacy concerns. The main difficulty in this task is how to exploit the potentially heterogeneous data of all clients under data silos. To this end, we first propose an auto-weighted multiple graph learning model to jointly learn a personalized graph for each local client and a single consensus graph for all clients. The personalized graphs match local data distributions, thereby mitigating data heterogeneity, while the consensus graph captures the global information. Moreover, the model can automatically assign appropriate contribution weights to local graphs based on their similarity to the consensus graph. We next devise a tailored algorithm to solve the induced problem, where all raw data are processed locally without leaving clients. Theoretically, we establish a provable estimation error bound and convergence analysis for the proposed model and algorithm. Finally, extensive experiments on synthetic and real data are carried out, and the results illustrate that our approach can learn graphs effectively in the target scenario.  ( 2 min )
    On Rate-Optimal Partitioning Classification from Observable and from Privatised Data
    arXiv:2312.14889v2 Announce Type: replace-cross Abstract: In this paper we revisit the classical method of partitioning classification and study its convergence rate under relaxed conditions, both for observable (non-privatised) and for privatised data. Let the feature vector $X$ take values in $\mathbb{R}^d$ and denote its label by $Y$. Previous results on the partitioning classifier worked with the strong density assumption, which is restrictive, as we demonstrate through simple examples. We assume that the distribution of $X$ is a mixture of an absolutely continuous and a discrete distribution, such that the absolutely continuous component is concentrated to a $d_a$ dimensional subspace. Here, we study the problem under much milder assumptions: in addition to the standard Lipschitz and margin conditions, a novel characteristic of the absolutely continuous component is introduced, by which the exact convergence rate of the classification error probability is calculated, both for the binary and for the multi-label cases. Interestingly, this rate of convergence depends only on the intrinsic dimension $d_a$. The privacy constraints mean that the data $(X_1,Y_1), \dots ,(X_n,Y_n)$ cannot be directly observed, and the classifiers are functions of the randomised outcome of a suitable local differential privacy mechanism. The statistician is free to choose the form of this privacy mechanism, and here we add Laplace distributed noises to the discontinuations of all possible locations of the feature vector $X_i$ and to its label $Y_i$. Again, tight upper bounds on the rate of convergence of the classification error probability are derived, without the strong density assumption, such that this rate depends on $2\,d_a$.  ( 3 min )
    Softened Symbol Grounding for Neuro-symbolic Systems
    arXiv:2403.00323v1 Announce Type: cross Abstract: Neuro-symbolic learning generally consists of two separated worlds, i.e., neural network training and symbolic constraint solving, whose success hinges on symbol grounding, a fundamental problem in AI. This paper presents a novel, softened symbol grounding process, bridging the gap between the two worlds, and resulting in an effective and efficient neuro-symbolic learning framework. Technically, the framework features (1) modeling of symbol solution states as a Boltzmann distribution, which avoids expensive state searching and facilitates mutually beneficial interactions between network training and symbolic reasoning;(2) a new MCMC technique leveraging projection and SMT solvers, which efficiently samples from disconnected symbol solution spaces; (3) an annealing mechanism that can escape from %being trapped into sub-optimal symbol groundings. Experiments with three representative neuro symbolic learning tasks demonstrate that, owining to its superior symbol grounding capability, our framework successfully solves problems well beyond the frontier of the existing proposals.  ( 2 min )
    The Face of Populism: Examining Differences in Facial Emotional Expressions of Political Leaders Using Machine Learning
    arXiv:2304.09914v3 Announce Type: replace-cross Abstract: Online media has revolutionized the way political information is disseminated and consumed on a global scale, and this shift has compelled political figures to adopt new strategies of capturing and retaining voter attention. These strategies often rely on emotional persuasion and appeal, and as visual content becomes increasingly prevalent in virtual space, much of political communication too has come to be marked by evocative video content and imagery. The present paper offers a novel approach to analyzing material of this kind. We apply a deep-learning-based computer-vision algorithm to a sample of 220 YouTube videos depicting political leaders from 15 different countries, which is based on an existing trained convolutional neural network architecture provided by the Python library fer. The algorithm returns emotion scores representing the relative presence of 6 emotional states (anger, disgust, fear, happiness, sadness, and surprise) and a neutral expression for each frame of the processed YouTube video. We observe statistically significant differences in the average score of expressed negative emotions between groups of leaders with varying degrees of populist rhetoric as defined by the Global Party Survey (GPS), indicating that populist leaders tend to express negative emotions to a greater extent during their public performance than their non-populist counterparts. Overall, our contribution provides insight into the characteristics of visual self-representation among political leaders, as well as an open-source workflow for further computational studies of their non-verbal communication.  ( 3 min )
    Policy Gradient Methods for Discrete Time Linear Quadratic Regulator With Random Parameters
    arXiv:2303.16548v2 Announce Type: replace-cross Abstract: This paper studies an infinite horizon optimal control problem for discrete-time linear system and quadratic criteria, both with random parameters which are independent and identically distributed with respect to time. In this general setting, we apply the policy gradient method, a reinforcement learning technique, to search for the optimal control without requiring knowledge of statistical information of the parameters. We investigate the sub-Gaussianity of the state process and establish global linear convergence guarantee for this approach based on assumptions that are weaker and easier to verify compared to existing results. Numerical experiments are presented to illustrate our result.  ( 2 min )
    HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding
    arXiv:2403.00425v1 Announce Type: cross Abstract: While large vision-language models (LVLMs) have demonstrated impressive capabilities in interpreting multi-modal contexts, they invariably suffer from object hallucinations (OH). We introduce HALC, a novel decoding algorithm designed to mitigate OH in LVLMs. HALC leverages distinct fine-grained optimal visual information in vision-language tasks and operates on both local and global contexts simultaneously. Specifically, HALC integrates a robust auto-focal grounding mechanism (locally) to correct hallucinated tokens on the fly, and a specialized beam search algorithm (globally) to significantly reduce OH while preserving text generation quality. Additionally, HALC can be integrated into any LVLMs as a plug-and-play module without extra training. Extensive experimental studies demonstrate the effectiveness of HALC in reducing OH, outperforming state-of-the-arts across four benchmarks.  ( 2 min )
    Implicit regularization of deep residual networks towards neural ODEs
    arXiv:2309.01213v2 Announce Type: replace-cross Abstract: Residual neural networks are state-of-the-art deep learning models. Their continuous-depth analog, neural ordinary differential equations (ODEs), are also widely used. Despite their success, the link between the discrete and continuous models still lacks a solid mathematical foundation. In this article, we take a step in this direction by establishing an implicit regularization of deep residual networks towards neural ODEs, for nonlinear networks trained with gradient flow. We prove that if the network is initialized as a discretization of a neural ODE, then such a discretization holds throughout training. Our results are valid for a finite training time, and also as the training time tends to infinity provided that the network satisfies a Polyak-Lojasiewicz condition. Importantly, this condition holds for a family of residual networks where the residuals are two-layer perceptrons with an overparameterization in width that is only linear, and implies the convergence of gradient flow to a global minimum. Numerical experiments illustrate our results.  ( 2 min )
    Messenger RNA Design via Expected Partition Function and Continuous Optimization
    arXiv:2401.00037v2 Announce Type: replace-cross Abstract: The tasks of designing RNAs are discrete optimization problems, and several versions of these problems are NP-hard. As an alternative to commonly used local search methods, we formulate these problems as continuous optimization and develop a general framework for this optimization based on a generalization of classical partition function which we call "expected partition function". The basic idea is to start with a distribution over all possible candidate sequences, and extend the objective function from a sequence to a distribution. We then use gradient descent-based optimization methods to improve the extended objective function, and the distribution will gradually shrink towards a one-hot sequence (i.e., a single sequence). As a case study, we consider the important problem of mRNA design with wide applications in vaccines and therapeutics. While the recent work of LinearDesign can efficiently optimize mRNAs for minimum free energy (MFE), optimizing for ensemble free energy is much harder and likely intractable. Our approach can consistently improve over the LinearDesign solution in terms of ensemble free energy, with bigger improvements on longer sequences.  ( 2 min )
    Global universal approximation of functional input maps on weighted spaces
    arXiv:2306.03303v3 Announce Type: replace-cross Abstract: We introduce so-called functional input neural networks defined on a possibly infinite dimensional weighted space with values also in a possibly infinite dimensional output space. To this end, we use an additive family to map the input weighted space to the hidden layer, on which a non-linear scalar activation function is applied to each neuron, and finally return the output via some linear readouts. Relying on Stone-Weierstrass theorems on weighted spaces, we can prove a global universal approximation result on weighted spaces for continuous functions going beyond the usual approximation on compact sets. This then applies in particular to approximation of (non-anticipative) path space functionals via functional input neural networks. As a further application of the weighted Stone-Weierstrass theorem we prove a global universal approximation result for linear functions of the signature. We also introduce the viewpoint of Gaussian process regression in this setting and emphasize that the reproducing kernel Hilbert space of the signature kernels are Cameron-Martin spaces of certain Gaussian processes. This paves a way towards uncertainty quantification for signature kernel regression.  ( 2 min )
    Escaping mediocrity: how two-layer networks learn hard generalized linear models with SGD
    arXiv:2305.18502v2 Announce Type: replace-cross Abstract: This study explores the sample complexity for two-layer neural networks to learn a generalized linear target function under Stochastic Gradient Descent (SGD), focusing on the challenging regime where many flat directions are present at initialization. It is well-established that in this scenario $n=O(d \log d)$ samples are typically needed. However, we provide precise results concerning the pre-factors in high-dimensional contexts and for varying widths. Notably, our findings suggest that overparameterization can only enhance convergence by a constant factor within this problem class. These insights are grounded in the reduction of SGD dynamics to a stochastic process in lower dimensions, where escaping mediocrity equates to calculating an exit time. Yet, we demonstrate that a deterministic approximation of this process adequately represents the escape time, implying that the role of stochasticity may be minimal in this scenario.  ( 2 min )
    Tuning-Free Maximum Likelihood Training of Latent Variable Models via Coin Betting
    arXiv:2305.14916v2 Announce Type: replace-cross Abstract: We introduce two new particle-based algorithms for learning latent variable models via marginal maximum likelihood estimation, including one which is entirely tuning-free. Our methods are based on the perspective of marginal maximum likelihood estimation as an optimization problem: namely, as the minimization of a free energy functional. One way to solve this problem is via the discretization of a gradient flow associated with the free energy. We study one such approach, which resembles an extension of Stein variational gradient descent, establishing a descent lemma which guarantees that the free energy decreases at each iteration. This method, and any other obtained as the discretization of the gradient flow, necessarily depends on a learning rate which must be carefully tuned by the practitioner in order to ensure convergence at a suitable rate. With this in mind, we also propose another algorithm for optimizing the free energy which is entirely learning rate free, based on coin betting techniques from convex optimization. We validate the performance of our algorithms across several numerical experiments, including several high-dimensional settings. Our results are competitive with existing particle-based methods, without the need for any hyperparameter tuning.  ( 2 min )
    Distribution-Specific Auditing For Subgroup Fairness
    arXiv:2401.16439v2 Announce Type: replace Abstract: We study the problem of auditing classifiers with the notion of statistical subgroup fairness. Kearns et al. (2018) has shown that the problem of auditing combinatorial subgroups fairness is as hard as agnostic learning. Essentially all work on remedying statistical measures of discrimination against subgroups assumes access to an oracle for this problem, despite the fact that no efficient algorithms are known for it. If we assume the data distribution is Gaussian, or even merely log-concave, then a recent line of work has discovered efficient agnostic learning algorithms for halfspaces. Unfortunately, the reduction of Kearns et al. was formulated in terms of weak, "distribution-free" learning, and thus did not establish a connection for families such as log-concave distributions. In this work, we give positive and negative results on auditing for Gaussian distributions: On the positive side, we present an alternative approach to leverage these advances in agnostic learning and thereby obtain the first polynomial-time approximation scheme (PTAS) for auditing nontrivial combinatorial subgroup fairness: we show how to audit statistical notions of fairness over homogeneous halfspace subgroups when the features are Gaussian. On the negative side, we find that under cryptographic assumptions, no polynomial-time algorithm can guarantee any nontrivial auditing, even under Gaussian feature distributions, for general halfspace subgroups.  ( 2 min )
    The Definitive Guide to Policy Gradients in Deep Reinforcement Learning: Theory, Algorithms and Implementations
    arXiv:2401.13662v2 Announce Type: replace Abstract: In recent years, various powerful policy gradient algorithms have been proposed in deep reinforcement learning. While all these algorithms build on the Policy Gradient Theorem, the specific design choices differ significantly across algorithms. We provide a holistic overview of on-policy policy gradient algorithms to facilitate the understanding of both their theoretical foundations and their practical implementations. In this overview, we include a detailed proof of the continuous version of the Policy Gradient Theorem, convergence results and a comprehensive discussion of practical algorithms. We compare the most prominent algorithms on continuous control environments and provide insights on the benefits of regularization. All code is available at https://github.com/Matt00n/PolicyGradientsJax.  ( 2 min )
    Wavelet-Inspired Multiscale Graph Convolutional Recurrent Network for Traffic Forecasting
    arXiv:2401.06040v2 Announce Type: replace Abstract: Traffic forecasting is the foundation for intelligent transportation systems. Spatiotemporal graph neural networks have demonstrated state-of-the-art performance in traffic forecasting. However, these methods do not explicitly model some of the natural characteristics in traffic data, such as the multiscale structure that encompasses spatial and temporal variations at different levels of granularity or scale. To that end, we propose a Wavelet-Inspired Graph Convolutional Recurrent Network (WavGCRN) which combines multiscale analysis (MSA)-based method with Deep Learning (DL)-based method. In WavGCRN, the traffic data is decomposed into time-frequency components with Discrete Wavelet Transformation (DWT), constructing a multi-stream input structure; then Graph Convolutional Recurrent networks (GCRNs) are employed as encoders for each stream, extracting spatiotemporal features in different scales; and finally the learnable Inversed DWT and GCRN are combined as the decoder, fusing the information from all streams for traffic metrics reconstruction and prediction. Furthermore, road-network-informed graphs and data-driven graph learning are combined to accurately capture spatial correlation. The proposed method can offer well-defined interpretability, powerful learning capability, and competitive forecasting performance on real-world traffic data sets.  ( 2 min )
    Leveraging Gradients for Unsupervised Accuracy Estimation under Distribution Shift
    arXiv:2401.08909v2 Announce Type: replace Abstract: Estimating test accuracy without access to the ground-truth test labels under varying test environments is a challenging, yet extremely important problem in the safe deployment of machine learning algorithms. Existing works rely on the information from either the outputs or the extracted features of neural networks to formulate an estimation score correlating with the ground-truth test accuracy. In this paper, we investigate--both empirically and theoretically--how the information provided by the gradients can be predictive of the ground-truth test accuracy even under a distribution shift. Specifically, we use the norm of classification-layer gradients, backpropagated from the cross-entropy loss after only one gradient step over test data. Our key idea is that the model should be adjusted with a higher magnitude of gradients when it does not generalize to the test dataset with a distribution shift. We provide theoretical insights highlighting the main ingredients of such an approach ensuring its empirical success. Extensive experiments conducted on diverse distribution shifts and model structures demonstrate that our method significantly outperforms state-of-the-art algorithms.  ( 2 min )
    Memory-Efficient Sequential Pattern Mining with Hybrid Tries
    arXiv:2202.06834v2 Announce Type: replace-cross Abstract: As modern data sets continue to grow exponentially in size, the demand for efficient mining algorithms capable of handling such large data sets becomes increasingly imperative. This paper develops a memory-efficient approach for Sequential Pattern Mining (SPM), a fundamental topic in knowledge discovery that faces a well-known memory bottleneck for large data sets. Our methodology involves a novel hybrid trie data structure that exploits recurring patterns to compactly store the data set in memory; and a corresponding mining algorithm designed to effectively extract patterns from this compact representation. Numerical results on real-life test instances show an average improvement of 88% in memory consumption and 41% in computation time for small to medium-sized data sets compared to the state of the art. Furthermore, our algorithm stands out as the only capable SPM approach for large data sets within 256GB of system memory.  ( 2 min )
    Optimal Budgeted Rejection Sampling for Generative Models
    arXiv:2311.00460v2 Announce Type: replace Abstract: Rejection sampling methods have recently been proposed to improve the performance of discriminator-based generative models. However, these methods are only optimal under an unlimited sampling budget, and are usually applied to a generator trained independently of the rejection procedure. We first propose an Optimal Budgeted Rejection Sampling (OBRS) scheme that is provably optimal with respect to \textit{any} $f$-divergence between the true distribution and the post-rejection distribution, for a given sampling budget. Second, we propose an end-to-end method that incorporates the sampling scheme into the training procedure to further enhance the model's overall performance. Through experiments and supporting theory, we show that the proposed methods are effective in significantly improving the quality and diversity of the samples.  ( 2 min )
    Hard Regularization to Prevent Deep Online Clustering Collapse without Data Augmentation
    arXiv:2303.16521v3 Announce Type: replace Abstract: Online deep clustering refers to the joint use of a feature extraction network and a clustering model to assign cluster labels to each new data point or batch as it is processed. While faster and more versatile than offline methods, online clustering can easily reach the collapsed solution where the encoder maps all inputs to the same point and all are put into a single cluster. Successful existing models have employed various techniques to avoid this problem, most of which require data augmentation or which aim to make the average soft assignment across the dataset the same for each cluster. We propose a method that does not require data augmentation, and that, differently from existing methods, regularizes the hard assignments. Using a Bayesian framework, we derive an intuitive optimization objective that can be straightforwardly included in the training of the encoder network. Tested on four image datasets and one human-activity recognition dataset, it consistently avoids collapse more robustly than other methods and leads to more accurate clustering. We also conduct further experiments and analyses justifying our choice to regularize the hard cluster assignments. Code is available at https://github.com/Lou1sM/online_hard_clustering.  ( 2 min )
    Fast Graph Condensation with Structure-based Neural Tangent Kernel
    arXiv:2310.11046v2 Announce Type: replace Abstract: The rapid development of Internet technology has given rise to a vast amount of graph-structured data. Graph Neural Networks (GNNs), as an effective method for various graph mining tasks, incurs substantial computational resource costs when dealing with large-scale graph data. A data-centric manner solution is proposed to condense the large graph dataset into a smaller one without sacrificing the predictive performance of GNNs. However, existing efforts condense graph-structured data through a computational intensive bi-level optimization architecture also suffer from massive computation costs. In this paper, we propose reforming the graph condensation problem as a Kernel Ridge Regression (KRR) task instead of iteratively training GNNs in the inner loop of bi-level optimization. More specifically, We propose a novel dataset condensation framework (GC-SNTK) for graph-structured data, where a Structure-based Neural Tangent Kernel (SNTK) is developed to capture the topology of graph and serves as the kernel function in KRR paradigm. Comprehensive experiments demonstrate the effectiveness of our proposed model in accelerating graph condensation while maintaining high prediction performance. The source code is available on https://github.com/WANGLin0126/GCSNTK.  ( 2 min )
    ASPEST: Bridging the Gap Between Active Learning and Selective Prediction
    arXiv:2304.03870v3 Announce Type: replace Abstract: Selective prediction aims to learn a reliable model that abstains from making predictions when uncertain. These predictions can then be deferred to humans for further evaluation. As an everlasting challenge for machine learning, in many real-world scenarios, the distribution of test data is different from the training data. This results in more inaccurate predictions, and often increased dependence on humans, which can be difficult and expensive. Active learning aims to lower the overall labeling effort, and hence human dependence, by querying the most informative examples. Selective prediction and active learning have been approached from different angles, with the connection between them missing. In this work, we introduce a new learning paradigm, active selective prediction, which aims to query more informative samples from the shifted target domain while increasing accuracy and coverage. For this new paradigm, we propose a simple yet effective approach, ASPEST, that utilizes ensembles of model snapshots with self-training with their aggregated outputs as pseudo labels. Extensive experiments on numerous image, text and structured datasets, which suffer from domain shifts, demonstrate that ASPEST can significantly outperform prior work on selective prediction and active learning (e.g. on the MNIST$\to$SVHN benchmark with the labeling budget of 100, ASPEST improves the AUACC metric from 79.36% to 88.84%) and achieves more optimal utilization of humans in the loop.  ( 3 min )
    Snapshot Reinforcement Learning: Leveraging Prior Trajectories for Efficiency
    arXiv:2403.00673v1 Announce Type: new Abstract: Deep reinforcement learning (DRL) algorithms require substantial samples and computational resources to achieve higher performance, which restricts their practical application and poses challenges for further development. Given the constraint of limited resources, it is essential to leverage existing computational work (e.g., learned policies, samples) to enhance sample efficiency and reduce the computational resource consumption of DRL algorithms. Previous works to leverage existing computational work require intrusive modifications to existing algorithms and models, designed specifically for specific algorithms, lacking flexibility and universality. In this paper, we present the Snapshot Reinforcement Learning (SnapshotRL) framework, which enhances sample efficiency by simply altering environments, without making any modifications to algorithms and models. By allowing student agents to choose states in teacher trajectories as the initial state to sample, SnapshotRL can effectively utilize teacher trajectories to assist student agents in training, allowing student agents to explore a larger state space at the early training phase. We propose a simple and effective SnapshotRL baseline algorithm, S3RL, which integrates well with existing DRL algorithms. Our experiments demonstrate that integrating S3RL with TD3, SAC, and PPO algorithms on the MuJoCo benchmark significantly improves sample efficiency and average return, without extra samples and additional computational resources.  ( 2 min )
    Reusing Historical Trajectories in Natural Policy Gradient via Importance Sampling: Convergence and Convergence Rate
    arXiv:2403.00675v1 Announce Type: new Abstract: Reinforcement learning provides a mathematical framework for learning-based control, whose success largely depends on the amount of data it can utilize. The efficient utilization of historical trajectories obtained from previous policies is essential for expediting policy optimization. Empirical evidence has shown that policy gradient methods based on importance sampling work well. However, existing literature often neglect the interdependence between trajectories from different iterations, and the good empirical performance lacks a rigorous theoretical justification. In this paper, we study a variant of the natural policy gradient method with reusing historical trajectories via importance sampling. We show that the bias of the proposed estimator of the gradient is asymptotically negligible, the resultant algorithm is convergent, and reusing past trajectories helps improve the convergence rate. We further apply the proposed estimator to popular policy optimization algorithms such as trust region policy optimization. Our theoretical results are verified on classical benchmarks.  ( 2 min )
    On the Challenges and Opportunities in Generative AI
    arXiv:2403.00025v1 Announce Type: new Abstract: The field of deep generative modeling has grown rapidly and consistently over the years. With the availability of massive amounts of training data coupled with advances in scalable unsupervised learning paradigms, recent large-scale generative models show tremendous promise in synthesizing high-resolution images and text, as well as structured data such as videos and molecules. However, we argue that current large-scale generative AI models do not sufficiently address several fundamental issues that hinder their widespread adoption across domains. In this work, we aim to identify key unresolved challenges in modern generative AI paradigms that should be tackled to further enhance their capabilities, versatility, and reliability. By identifying these challenges, we aim to provide researchers with valuable insights for exploring fruitful research directions, thereby fostering the development of more robust and accessible generative AI solutions.  ( 2 min )
    GraphPub: Generation of Differential Privacy Graph with High Availability
    arXiv:2403.00030v1 Announce Type: cross Abstract: In recent years, with the rapid development of graph neural networks (GNN), more and more graph datasets have been published for GNN tasks. However, when an upstream data owner publishes graph data, there are often many privacy concerns, because many real-world graph data contain sensitive information like person's friend list. Differential privacy (DP) is a common method to protect privacy, but due to the complex topological structure of graph data, applying DP on graphs often affects the message passing and aggregation of GNN models, leading to a decrease in model accuracy. In this paper, we propose a novel graph edge protection framework, graph publisher (GraphPub), which can protect graph topology while ensuring that the availability of data is basically unchanged. Through reverse learning and the encoder-decoder mechanism, we search for some false edges that do not have a large negative impact on the aggregation of node features, and use them to replace some real edges. The modified graph will be published, which is difficult to distinguish between real and false data. Sufficient experiments prove that our framework achieves model accuracy close to the original graph with an extremely low privacy budget.  ( 2 min )
    Global and Local Prompts Cooperation via Optimal Transport for Federated Learning
    arXiv:2403.00041v1 Announce Type: new Abstract: Prompt learning in pretrained visual-language models has shown remarkable flexibility across various downstream tasks. Leveraging its inherent lightweight nature, recent research attempted to integrate the powerful pretrained models into federated learning frameworks to simultaneously reduce communication costs and promote local training on insufficient data. Despite these efforts, current federated prompt learning methods lack specialized designs to systematically address severe data heterogeneities, e.g., data distribution with both label and feature shifts involved. To address this challenge, we present Federated Prompts Cooperation via Optimal Transport (FedOTP), which introduces efficient collaborative prompt learning strategies to capture diverse category traits on a per-client basis. Specifically, for each client, we learn a global prompt to extract consensus knowledge among clients, and a local prompt to capture client-specific category characteristics. Unbalanced Optimal Transport is then employed to align local visual features with these prompts, striking a balance between global consensus and local personalization. Extensive experiments on datasets with various types of heterogeneities have demonstrated that our FedOTP outperforms the state-of-the-art methods.  ( 2 min )
    Provably Robust DPO: Aligning Language Models with Noisy Feedback
    arXiv:2403.00409v1 Announce Type: new Abstract: Learning from preference-based feedback has recently gained traction as a promising approach to align language models with human interests. While these aligned generative models have demonstrated impressive capabilities across various tasks, their dependence on high-quality human preference data poses a bottleneck in practical applications. Specifically, noisy (incorrect and ambiguous) preference pairs in the dataset might restrict the language models from capturing human intent accurately. While practitioners have recently proposed heuristics to mitigate the effect of noisy preferences, a complete theoretical understanding of their workings remain elusive. In this work, we aim to bridge this gap by by introducing a general framework for policy optimization in the presence of random preference flips. We focus on the direct preference optimization (DPO) algorithm in particular since it assumes that preferences adhere to the Bradley-Terry-Luce (BTL) model, raising concerns about the impact of noisy data on the learned policy. We design a novel loss function, which de-bias the effect of noise on average, making a policy trained by minimizing that loss robust to the noise. Under log-linear parameterization of the policy class and assuming good feature coverage of the SFT policy, we prove that the sub-optimality gap of the proposed robust DPO (rDPO) policy compared to the optimal policy is of the order $O(\frac{1}{1-2\epsilon}\sqrt{\frac{d}{n}})$, where $\epsilon < 1/2$ is flip rate of labels, $d$ is policy parameter dimension and $n$ is size of dataset. Our experiments on IMDb sentiment generation and Anthropic's helpful-harmless dataset show that rDPO is robust to noise in preference labels compared to vanilla DPO and other heuristics proposed by practitioners.  ( 3 min )
    Analyzing Resting-State fMRI Data in Marijuana Users via High-Order Attention Brain Network
    arXiv:2403.00033v1 Announce Type: cross Abstract: The sustained use of marijuana significantly impacts the lives and health of people. In this study, we propose an interpretable novel framework called the HOGAB (High-Order Attention Graph Attention Neural Networks) model to analyze local abnormal brain activity in chronic marijuana users in two datasets. The HOGAB integrates dynamic intrinsic functional networks with LSTM technology to capture temporal patterns in fMRI time series of marijuana users. Moreover, we use the high-order attention module in neighborhood nodes for information fusion and message passing, enhancing community clustering analysis for long-term marijuana users. Furthermore, we improve the overall learning ability of the model by incorporating attention mechanisms, achieving an AUC of 85.1% and an accuracy of 80.7% in multigraph classification. In addition, we compare linear machine learning methods and evaluate the effectiveness of our proposed HODAB model. Specifically, we identified the most relevant subnetworks and cognitive regions that are negatively influenced by persistent marijuana use, revealing that chronic marijuana use adversely affects cognitive control, particularly within the Dorsal Attention and Frontoparietal networks, which are essential for attentional, cognitive, and higher cognitive functions. The results show that our proposed model is capable of accurately predicting craving maps and identifying brain maps associated with long-term cravings, and also pinpointing active areas that are important for analysis.  ( 2 min )
    A Quick Framework for Evaluating Worst Robustness of Complex Networks
    arXiv:2403.00027v1 Announce Type: cross Abstract: Robustness is pivotal for comprehending, designing, optimizing, and rehabilitating networks, with simulation attacks being the prevailing evaluation method. Simulation attacks are often time-consuming or even impractical, however, a more crucial yet persistently overlooked drawback is that any attack strategy merely provides a potential paradigm of disintegration. The key concern is: in the worst-case scenario or facing the most severe attacks, what is the limit of robustness, referred to as ``Worst Robustness'', for a given system? Understanding a system's worst robustness is imperative for grasping its reliability limits, accurately evaluating protective capabilities, and determining associated design and security maintenance costs. To address these challenges, we introduce the concept of Most Destruction Attack (MDA) based on the idea of knowledge stacking. MDA is employed to assess the worst robustness of networks, followed by the application of an adapted CNN algorithm for rapid worst robustness prediction. We establish the logical validity of MDA and highlight the exceptional performance of the adapted CNN algorithm in predicting the worst robustness across diverse network topologies, encompassing both model and empirical networks.  ( 2 min )
    DeepGD: A Multi-Objective Black-Box Test Selection Approach for Deep Neural Networks
    arXiv:2303.04878v5 Announce Type: replace Abstract: Deep neural networks (DNNs) are widely used in various application domains such as image processing, speech recognition, and natural language processing. However, testing DNN models may be challenging due to the complexity and size of their input domain. Particularly, testing DNN models often requires generating or exploring large unlabeled datasets. In practice, DNN test oracles, which identify the correct outputs for inputs, often require expensive manual effort to label test data, possibly involving multiple experts to ensure labeling correctness. In this paper, we propose DeepGD, a black-box multi-objective test selection approach for DNN models. It reduces the cost of labeling by prioritizing the selection of test inputs with high fault revealing power from large unlabeled datasets. DeepGD not only selects test inputs with high uncertainty scores to trigger as many mispredicted inputs as possible but also maximizes the probability of revealing distinct faults in the DNN model by selecting diverse mispredicted inputs. The experimental results conducted on four widely used datasets and five DNN models show that in terms of fault-revealing ability: (1) White-box, coverage-based approaches fare poorly, (2) DeepGD outperforms existing black-box test selection approaches in terms of fault detection, and (3) DeepGD also leads to better guidance for DNN model retraining when using selected inputs to augment the training set.  ( 3 min )
    Structural Estimation of Markov Decision Processes in High-Dimensional State Space with Finite-Time Guarantees
    arXiv:2210.01282v3 Announce Type: replace Abstract: We consider the task of estimating a structural model of dynamic decisions by a human agent based upon the observable history of implemented actions and visited states. This problem has an inherent nested structure: in the inner problem, an optimal policy for a given reward function is identified while in the outer problem, a measure of fit is maximized. Several approaches have been proposed to alleviate the computational burden of this nested-loop structure, but these methods still suffer from high complexity when the state space is either discrete with large cardinality or continuous in high dimensions. Other approaches in the inverse reinforcement learning (IRL) literature emphasize policy estimation at the expense of reduced reward estimation accuracy. In this paper we propose a single-loop estimation algorithm with finite time guarantees that is equipped to deal with high-dimensional state spaces without compromising reward estimation accuracy. In the proposed algorithm, each policy improvement step is followed by a stochastic gradient step for likelihood maximization. We show that the proposed algorithm converges to a stationary solution with a finite-time guarantee. Further, if the reward is parameterized linearly, we show that the algorithm approximates the maximum likelihood estimator sublinearly. Finally, by using robotics control problems in MuJoCo and their transfer settings, we show that the proposed algorithm achieves superior performance compared with other IRL and imitation learning benchmarks.  ( 3 min )
    Bayesian Robust Optimization for Imitation Learning
    arXiv:2007.12315v4 Announce Type: replace Abstract: One of the main challenges in imitation learning is determining what action an agent should take when outside the state distribution of the demonstrations. Inverse reinforcement learning (IRL) can enable generalization to new states by learning a parameterized reward function, but these approaches still face uncertainty over the true reward function and corresponding optimal policy. Existing safe imitation learning approaches based on IRL deal with this uncertainty using a maxmin framework that optimizes a policy under the assumption of an adversarial reward function, whereas risk-neutral IRL approaches either optimize a policy for the mean or MAP reward function. While completely ignoring risk can lead to overly aggressive and unsafe policies, optimizing in a fully adversarial sense is also problematic as it can lead to overly conservative policies that perform poorly in practice. To provide a bridge between these two extremes, we propose Bayesian Robust Optimization for Imitation Learning (BROIL). BROIL leverages Bayesian reward function inference and a user specific risk tolerance to efficiently optimize a robust policy that balances expected return and conditional value at risk. Our empirical results show that BROIL provides a natural way to interpolate between return-maximizing and risk-minimizing behaviors and outperforms existing risk-sensitive and risk-neutral inverse reinforcement learning algorithms. Code is available at https://github.com/dsbrown1331/broil.  ( 3 min )
    Symplectic ODE-Net: Learning Hamiltonian Dynamics with Control
    arXiv:1909.12077v5 Announce Type: replace Abstract: In this paper, we introduce Symplectic ODE-Net (SymODEN), a deep learning framework which can infer the dynamics of a physical system, given by an ordinary differential equation (ODE), from observed state trajectories. To achieve better generalization with fewer training samples, SymODEN incorporates appropriate inductive bias by designing the associated computation graph in a physics-informed manner. In particular, we enforce Hamiltonian dynamics with control to learn the underlying dynamics in a transparent way, which can then be leveraged to draw insight about relevant physical aspects of the system, such as mass and potential energy. In addition, we propose a parametrization which can enforce this Hamiltonian formalism even when the generalized coordinate data is embedded in a high-dimensional space or we can only access velocity data instead of generalized momentum. This framework, by offering interpretable, physically-consistent models for physical systems, opens up new possibilities for synthesizing model-based control strategies.  ( 3 min )
    A time-stepping deep gradient flow method for option pricing in (rough) diffusion models
    arXiv:2403.00746v1 Announce Type: cross Abstract: We develop a novel deep learning approach for pricing European options in diffusion models, that can efficiently handle high-dimensional problems resulting from Markovian approximations of rough volatility models. The option pricing partial differential equation is reformulated as an energy minimization problem, which is approximated in a time-stepping fashion by deep artificial neural networks. The proposed scheme respects the asymptotic behavior of option prices for large levels of moneyness, and adheres to a priori known bounds for option prices. The accuracy and efficiency of the proposed method is assessed in a series of numerical examples, with particular focus in the lifted Heston model.  ( 2 min )
    Mitigating Reversal Curse via Semantic-aware Permutation Training
    arXiv:2403.00758v1 Announce Type: cross Abstract: While large language models (LLMs) have achieved impressive performance across diverse tasks, recent studies showcase that causal LLMs suffer from the "reversal curse". It is a typical example that the model knows "A's father is B", but is unable to reason "B's child is A". This limitation poses a challenge to the advancement of artificial general intelligence (AGI), as it suggests a gap in the models' ability to comprehend and apply bidirectional reasoning. In this paper, we first conduct substantial evaluation and identify that the root cause of the reversal curse lies in the different word order between the training and inference stage, namely, the poor ability of causal language models to predict antecedent words within the training data. Accordingly, permutation on the training data is considered as a potential solution, since this can make the model predict antecedent words or tokens. However, previous permutation methods may disrupt complete phrases or entities, thereby posing challenges for the model to comprehend and learn from training data. To address this issue, we propose Semantic-aware Permutation Training (SPT), which addresses this issue by segmenting the training sentences into semantic units (i.e., entities or phrases) with an assistant language model and permuting these units before feeding into the model. Extensive experiments demonstrate that SPT effectively mitigates the reversal curse since the performance on reversed questions approximates that on the forward ones, and significantly advances the performance of existing works.  ( 2 min )
    Defining Expertise: Applications to Treatment Effect Estimation
    arXiv:2403.00694v1 Announce Type: cross Abstract: Decision-makers are often experts of their domain and take actions based on their domain knowledge. Doctors, for instance, may prescribe treatments by predicting the likely outcome of each available treatment. Actions of an expert thus naturally encode part of their domain knowledge, and can help make inferences within the same domain: Knowing doctors try to prescribe the best treatment for their patients, we can tell treatments prescribed more frequently are likely to be more effective. Yet in machine learning, the fact that most decision-makers are experts is often overlooked, and "expertise" is seldom leveraged as an inductive bias. This is especially true for the literature on treatment effect estimation, where often the only assumption made about actions is that of overlap. In this paper, we argue that expertise - particularly the type of expertise the decision-makers of a domain are likely to have - can be informative in designing and selecting methods for treatment effect estimation. We formally define two types of expertise, predictive and prognostic, and demonstrate empirically that: (i) the prominent type of expertise in a domain significantly influences the performance of different methods in treatment effect estimation, and (ii) it is possible to predict the type of expertise present in a dataset, which can provide a quantitative basis for model selection.  ( 2 min )
    SINDy vs Hard Nonlinearities and Hidden Dynamics: a Benchmarking Study
    arXiv:2403.00578v1 Announce Type: cross Abstract: In this work we analyze the effectiveness of the Sparse Identification of Nonlinear Dynamics (SINDy) technique on three benchmark datasets for nonlinear identification, to provide a better understanding of its suitability when tackling real dynamical systems. While SINDy can be an appealing strategy for pursuing physics-based learning, our analysis highlights difficulties in dealing with unobserved states and non-smooth dynamics. Due to the ubiquity of these features in real systems in general, and control applications in particular, we complement our analysis with hands-on approaches to tackle these issues in order to exploit SINDy also in these challenging contexts.  ( 2 min )
    Rethinking cluster-conditioned diffusion models
    arXiv:2403.00570v1 Announce Type: cross Abstract: We present a comprehensive experimental study on image-level conditioning for diffusion models using cluster assignments. We elucidate how individual components regarding image clustering impact image synthesis across three datasets. By combining recent advancements from image clustering and diffusion models, we show that, given the optimal cluster granularity with respect to image synthesis (visual groups), cluster-conditioning can achieve state-of-the-art FID (i.e. 1.67, 2.17 on CIFAR10 and CIFAR100 respectively), while attaining a strong training sample efficiency. Finally, we propose a novel method to derive an upper cluster bound that reduces the search space of the visual groups using solely feature-based clustering. Unlike existing approaches, we find no significant connection between clustering and cluster-conditional image generation. The code and cluster assignments will be released.  ( 2 min )
    Learning and Leveraging World Models in Visual Representation Learning
    arXiv:2403.00504v1 Announce Type: cross Abstract: Joint-Embedding Predictive Architecture (JEPA) has emerged as a promising self-supervised approach that learns by leveraging a world model. While previously limited to predicting missing parts of an input, we explore how to generalize the JEPA prediction task to a broader set of corruptions. We introduce Image World Models, an approach that goes beyond masked image modeling and learns to predict the effect of global photometric transformations in latent space. We study the recipe of learning performant IWMs and show that it relies on three key aspects: conditioning, prediction difficulty, and capacity. Additionally, we show that the predictive world model learned by IWM can be adapted through finetuning to solve diverse tasks; a fine-tuned IWM world model matches or surpasses the performance of previous self-supervised methods. Finally, we show that learning with an IWM allows one to control the abstraction level of the learned representations, learning invariant representations such as contrastive methods, or equivariant representations such as masked image modelling.  ( 2 min )
    VoxGenesis: Unsupervised Discovery of Latent Speaker Manifold for Speech Synthesis
    arXiv:2403.00529v1 Announce Type: cross Abstract: Achieving nuanced and accurate emulation of human voice has been a longstanding goal in artificial intelligence. Although significant progress has been made in recent years, the mainstream of speech synthesis models still relies on supervised speaker modeling and explicit reference utterances. However, there are many aspects of human voice, such as emotion, intonation, and speaking style, for which it is hard to obtain accurate labels. In this paper, we propose VoxGenesis, a novel unsupervised speech synthesis framework that can discover a latent speaker manifold and meaningful voice editing directions without supervision. VoxGenesis is conceptually simple. Instead of mapping speech features to waveforms deterministically, VoxGenesis transforms a Gaussian distribution into speech distributions conditioned and aligned by semantic tokens. This forces the model to learn a speaker distribution disentangled from the semantic content. During the inference, sampling from the Gaussian distribution enables the creation of novel speakers with distinct characteristics. More importantly, the exploration of latent space uncovers human-interpretable directions associated with specific speaker characteristics such as gender attributes, pitch, tone, and emotion, allowing for voice editing by manipulating the latent codes along these identified directions. We conduct extensive experiments to evaluate the proposed VoxGenesis using both subjective and objective metrics, finding that it produces significantly more diverse and realistic speakers with distinct characteristics than the previous approaches. We also show that latent space manipulation produces consistent and human-identifiable effects that are not detrimental to the speech quality, which was not possible with previous approaches. Audio samples of VoxGenesis can be found at: \url{https://bit.ly/VoxGenesis}.  ( 3 min )
    Safe Hybrid-Action Reinforcement Learning-Based Decision and Control for Discretionary Lane Change
    arXiv:2403.00446v1 Announce Type: cross Abstract: Autonomous lane-change, a key feature of advanced driver-assistance systems, can enhance traffic efficiency and reduce the incidence of accidents. However, safe driving of autonomous vehicles remains challenging in complex environments. How to perform safe and appropriate lane change is a popular topic of research in the field of autonomous driving. Currently, few papers consider the safety of reinforcement learning in autonomous lane-change scenarios. We introduce safe hybrid-action reinforcement learning into discretionary lane change for the first time and propose Parameterized Soft Actor-Critic with PID Lagrangian (PASAC-PIDLag) algorithm. Furthermore, we conduct a comparative analysis of the Parameterized Soft Actor-Critic (PASAC), which is an unsafe version of PASAC-PIDLag. Both algorithms are employed to train the lane-change strategy of autonomous vehicles to output discrete lane-change decision and longitudinal vehicle acceleration. Our simulation results indicate that at a traffic density of 15 vehicles per kilometer (15 veh/km), the PASAC-PIDLag algorithm exhibits superior safety with a collision rate of 0%, outperforming the PASAC algorithm, which has a collision rate of 1%. The outcomes of the generalization assessments reveal that at low traffic density levels, both the PASAC-PIDLag and PASAC algorithms are proficient in attaining a 0% collision rate. Under conditions of high traffic flow density, the PASAC-PIDLag algorithm surpasses PASAC in terms of both safety and optimality.  ( 2 min )
    Autonomous Robotic Arm Manipulation for Planetary Missions using Causal Machine Learning
    arXiv:2403.00470v1 Announce Type: cross Abstract: Autonomous robotic arm manipulators have the potential to make planetary exploration and in-situ resource utilization missions more time efficient and productive, as the manipulator can handle the objects itself and perform goal-specific actions. We train a manipulator to autonomously study objects of which it has no prior knowledge, such as planetary rocks. This is achieved using causal machine learning in a simulated planetary environment. Here, the manipulator interacts with objects, and classifies them based on differing causal factors. These are parameters, such as mass or friction coefficient, that causally determine the outcomes of its interactions. Through reinforcement learning, the manipulator learns to interact in ways that reveal the underlying causal factors. We show that this method works even without any prior knowledge of the objects, or any previously-collected training data. We carry out the training in planetary exploration conditions, with realistic manipulator models.  ( 2 min )
    LoMOE: Localized Multi-Object Editing via Multi-Diffusion
    arXiv:2403.00437v1 Announce Type: cross Abstract: Recent developments in the field of diffusion models have demonstrated an exceptional capacity to generate high-quality prompt-conditioned image edits. Nevertheless, previous approaches have primarily relied on textual prompts for image editing, which tend to be less effective when making precise edits to specific objects or fine-grained regions within a scene containing single/multiple objects. We introduce a novel framework for zero-shot localized multi-object editing through a multi-diffusion process to overcome this challenge. This framework empowers users to perform various operations on objects within an image, such as adding, replacing, or editing $\textbf{many}$ objects in a complex scene $\textbf{in one pass}$. Our approach leverages foreground masks and corresponding simple text prompts that exert localized influences on the target regions resulting in high-fidelity image editing. A combination of cross-attention and background preservation losses within the latent space ensures that the characteristics of the object being edited are preserved while simultaneously achieving a high-quality, seamless reconstruction of the background with fewer artifacts compared to the current methods. We also curate and release a dataset dedicated to multi-object editing, named $\texttt{LoMOE}$-Bench. Our experiments against existing state-of-the-art methods demonstrate the improved effectiveness of our approach in terms of both image editing quality and inference speed.  ( 2 min )
    Structured Deep Neural Networks-Based Backstepping Trajectory Tracking Control for Lagrangian Systems
    arXiv:2403.00381v1 Announce Type: cross Abstract: Deep neural networks (DNN) are increasingly being used to learn controllers due to their excellent approximation capabilities. However, their black-box nature poses significant challenges to closed-loop stability guarantees and performance analysis. In this paper, we introduce a structured DNN-based controller for the trajectory tracking control of Lagrangian systems using backing techniques. By properly designing neural network structures, the proposed controller can ensure closed-loop stability for any compatible neural network parameters. In addition, improved control performance can be achieved by further optimizing neural network parameters. Besides, we provide explicit upper bounds on tracking errors in terms of controller parameters, which allows us to achieve the desired tracking performance by properly selecting the controller parameters. Furthermore, when system models are unknown, we propose an improved Lagrangian neural network (LNN) structure to learn the system dynamics and design the controller. We show that in the presence of model approximation errors and external disturbances, the closed-loop stability and tracking control performance can still be guaranteed. The effectiveness of the proposed approach is demonstrated through simulations.  ( 2 min )
    Validation of ML-UQ calibration statistics using simulated reference values: a sensitivity analysis
    arXiv:2403.00423v1 Announce Type: cross Abstract: Some popular Machine Learning Uncertainty Quantification (ML-UQ) calibration statistics do not have predefined reference values and are mostly used in comparative studies. In consequence, calibration is almost never validated and the diagnostic is left to the appreciation of the reader. Simulated reference values, based on synthetic calibrated datasets derived from actual uncertainties, have been proposed to palliate this problem. As the generative probability distribution for the simulation of synthetic errors is often not constrained, the sensitivity of simulated reference values to the choice of generative distribution might be problematic, shedding a doubt on the calibration diagnostic. This study explores various facets of this problem, and shows that some statistics are excessively sensitive to the choice of generative distribution to be used for validation when the generative distribution is unknown. This is the case, for instance, of the correlation coefficient between absolute errors and uncertainties (CC) and of the expected normalized calibration error (ENCE). A robust validation workflow to deal with simulated reference values is proposed.  ( 2 min )
    Revisiting Disentanglement in Downstream Tasks: A Study on Its Necessity for Abstract Visual Reasoning
    arXiv:2403.00352v1 Announce Type: cross Abstract: In representation learning, a disentangled representation is highly desirable as it encodes generative factors of data in a separable and compact pattern. Researchers have advocated leveraging disentangled representations to complete downstream tasks with encouraging empirical evidence. This paper further investigates the necessity of disentangled representation in downstream applications. Specifically, we show that dimension-wise disentangled representations are unnecessary on a fundamental downstream task, abstract visual reasoning. We provide extensive empirical evidence against the necessity of disentanglement, covering multiple datasets, representation learning methods, and downstream network architectures. Furthermore, our findings suggest that the informativeness of representations is a better indicator of downstream performance than disentanglement. Finally, the positive correlation between informativeness and disentanglement explains the claimed usefulness of disentangled representations in previous works. The source code is available at https://github.com/Richard-coder-Nai/disentanglement-lib-necessity.git.  ( 2 min )
    Invariant Test-Time Adaptation for Vision-Language Model Generalization
    arXiv:2403.00376v1 Announce Type: cross Abstract: Vision-language foundation models have exhibited remarkable success across a multitude of downstream tasks due to their scalability on extensive image-text paired datasets. However, these models display significant limitations when applied to long-tail tasks, such as fine-grained image classification, as a result of "decision shortcuts" that hinders their generalization capabilities. In this work, we find that the CLIP model possesses a rich set of features, encompassing both \textit{desired invariant causal features} and \textit{undesired decision shortcuts}. Moreover, the underperformance of CLIP on downstream tasks originates from its inability to effectively utilize pre-trained features in accordance with specific task requirements. To address this challenge, this paper introduces a test-time prompt tuning paradigm that optimizes a learnable prompt, thereby compelling the model to exploit genuine causal invariant features while disregarding decision shortcuts during the inference phase. The proposed method effectively alleviates excessive dependence on potentially misleading, task-irrelevant contextual information, while concurrently emphasizing critical, task-related visual cues. We conduct comparative analysis of the proposed method against various approaches which validates its effectiveness.  ( 2 min )
    Robustifying a Policy in Multi-Agent RL with Diverse Cooperative Behavior and Adversarial Style Sampling for Assistive Tasks
    arXiv:2403.00344v1 Announce Type: cross Abstract: Autonomous assistance of people with motor impairments is one of the most promising applications of autonomous robotic systems. Recent studies have reported encouraging results using deep reinforcement learning (RL) in the healthcare domain. Previous studies showed that assistive tasks can be formulated as multi-agent RL, wherein there are two agents: a caregiver and a care-receiver. However, policies trained in multi-agent RL are often sensitive to the policies of other agents. In such a case, a trained caregiver's policy may not work for different care-receivers. To alleviate this issue, we propose a framework that learns a robust caregiver's policy by training it for diverse care-receiver responses. In our framework, diverse care-receiver responses are autonomously learned through trials and errors. In addition, to robustify the care-giver's policy, we propose a strategy for sampling a care-receiver's response in an adversarial manner during the training. We evaluated the proposed method using tasks in an Assistive Gym. We demonstrate that policies trained with a popular deep RL method are vulnerable to changes in policies of other agents and that the proposed framework improves the robustness against such changes.  ( 2 min )
    Learning with Logical Constraints but without Shortcut Satisfaction
    arXiv:2403.00329v1 Announce Type: cross Abstract: Recent studies in neuro-symbolic learning have explored the integration of logical knowledge into deep learning via encoding logical constraints as an additional loss function. However, existing approaches tend to vacuously satisfy logical constraints through shortcuts, failing to fully exploit the knowledge. In this paper, we present a new framework for learning with logical constraints. Specifically, we address the shortcut satisfaction issue by introducing dual variables for logical connectives, encoding how the constraint is satisfied. We further propose a variational framework where the encoded logical constraint is expressed as a distributional loss that is compatible with the model's original training loss. The theoretical analysis shows that the proposed approach bears salient properties, and the experimental evaluations demonstrate its superior performance in both model generalizability and constraint satisfaction.  ( 2 min )
    Axe the X in XAI: A Plea for Understandable AI
    arXiv:2403.00315v1 Announce Type: cross Abstract: In a recent paper, Erasmus et al. (2021) defend the idea that the ambiguity of the term "explanation" in explainable AI (XAI) can be solved by adopting any of four different extant accounts of explanation in the philosophy of science: the Deductive Nomological, Inductive Statistical, Causal Mechanical, and New Mechanist models. In this chapter, I show that the authors' claim that these accounts can be applied to deep neural networks as they would to any natural phenomenon is mistaken. I also provide a more general argument as to why the notion of explainability as it is currently used in the XAI literature bears little resemblance to the traditional concept of scientific explanation. It would be more fruitful to use the label "understandable AI" to avoid the confusion that surrounds the goal and purposes of XAI. In the second half of the chapter, I argue for a pragmatic conception of understanding that is better suited to play the central role attributed to explanation in XAI. Following Kuorikoski & Ylikoski (2015), the conditions of satisfaction for understanding an ML system are fleshed out in terms of an agent's success in using the system, in drawing correct inferences from it.  ( 2 min )
    Deep Reinforcement Learning for Solving Management Problems: Towards A Large Management Mode
    arXiv:2403.00318v1 Announce Type: cross Abstract: We introduce a deep reinforcement learning (DRL) approach for solving management problems including inventory management, dynamic pricing, and recommendation. This DRL approach has the potential to lead to a large management model based on certain transformer neural network structures, resulting in an artificial general intelligence paradigm for various management tasks. Traditional methods have limitations for solving complex real-world problems, and we demonstrate how DRL can surpass existing heuristic approaches for solving management tasks. We aim to solve the problems in a unified framework, considering the interconnections between different tasks. Central to our methodology is the development of a foundational decision model coordinating decisions across the different domains through generative decision-making. Our experimental results affirm the effectiveness of our DRL-based framework in complex and dynamic business environments. This work opens new pathways for the application of DRL in management problems, highlighting its potential to revolutionize traditional business management.  ( 2 min )
    Universal Auto-encoder Framework for MIMO CSI Feedback
    arXiv:2403.00299v1 Announce Type: cross Abstract: Existing auto-encoder (AE)-based channel state information (CSI) frameworks have focused on a specific configuration of user equipment (UE) and base station (BS), and thus the input and output sizes of the AE are fixed. However, in the real-world scenario, the input and output sizes may vary depending on the number of antennas of the BS and UE and the allocated resource block in the frequency dimension. A naive approach to support the different input and output sizes is to use multiple AE models, which is impractical for the UE due to the limited HW resources. In this paper, we propose a universal AE framework that can support different input sizes and multiple compression ratios. The proposed AE framework significantly reduces the HW complexity while providing comparable performance in terms of compression ratio-distortion trade-off compared to the naive and state-of-the-art approaches.  ( 2 min )
    Efficient Adapter Tuning of Pre-trained Speech Models for Automatic Speaker Verification
    arXiv:2403.00293v1 Announce Type: cross Abstract: With excellent generalization ability, self-supervised speech models have shown impressive performance on various downstream speech tasks in the pre-training and fine-tuning paradigm. However, as the growing size of pre-trained models, fine-tuning becomes practically unfeasible due to heavy computation and storage overhead, as well as the risk of overfitting. Adapters are lightweight modules inserted into pre-trained models to facilitate parameter-efficient adaptation. In this paper, we propose an effective adapter framework designed for adapting self-supervised speech models to the speaker verification task. With a parallel adapter design, our proposed framework inserts two types of adapters into the pre-trained model, allowing the adaptation of latent features within intermediate Transformer layers and output embeddings from all Transformer layers. We conduct comprehensive experiments to validate the efficiency and effectiveness of the proposed framework. Experimental results on the VoxCeleb1 dataset demonstrate that the proposed adapters surpass fine-tuning and other parameter-efficient transfer learning methods, achieving superior performance while updating only 5% of the parameters.  ( 2 min )
    Optimization of Array Encoding for Ultrasound Imaging
    arXiv:2403.00289v1 Announce Type: cross Abstract: Objective: The transmit encoding model for synthetic aperture imaging is a robust and flexible framework for understanding the effect of acoustic transmission on ultrasound image reconstruction. Our objective is to use machine learning (ML) to construct scanning sequences, parameterized by time delays and apodization weights, that produce high quality B-mode images. Approach: We use an ML model in PyTorch and simulated RF data from Field II to probe the space of possible encoding sequences for those that minimize a loss function that describes image quality. This approach is made computationally feasible by a novel formulation of the derivative for delay-and-sum beamforming. We demonstrate these results experimentally on wire targets and a tissue-mimicking phantom. Main Results: When trained according to a given set of imaging parameters (imaging domain, hardware restrictions), our ML imaging model produces optimized encoding sequences that improve a number of standard quality metrics including resolution, field of view, and contrast, over conventional sequences. Significance: This work demonstrates that the set of encoding schemes that are commonly used represent only a narrow subset of those available. Additionally, it demonstrates the value for ML tasks in synthetic transmit aperture imaging to consider the beamformer within the model, instead of as purely post-processing.  ( 2 min )
    Semantic Text Transmission via Prediction with Small Language Models: Cost-Similarity Trade-off
    arXiv:2403.00290v1 Announce Type: cross Abstract: We consider the communication of natural language text from a source to a destination over noiseless and character-erasure channels. We exploit language's inherent correlations and predictability to constrain transmission costs by allowing the destination to predict or complete words with potential dissimilarity with the source text. Concretely, our objective is to obtain achievable $(\bar{c}, \bar{s})$ pairs, where $\bar{c}$ is the average transmission cost at the source and $\bar{s}$ is the average semantic similarity measured via cosine similarity between vector embedding of words at the source and those predicted/completed at the destination. We obtain $(\bar{c}, \bar{s})$ pairs for neural language and first-order Markov chain-based small language models (SLM) for prediction, using both a threshold policy that transmits a word if its cosine similarity with that predicted/completed at the destination is below a threshold, and a periodic policy, which transmits words after a specific interval and predicts/completes the words in between, at the destination. We adopt an SLM for word completion. We demonstrate that, when communication occurs over a noiseless channel, the threshold policy achieves a higher $\bar{s}$ for a given $\bar{c}$ than the periodic policy and that the $\bar{s}$ achieved with the neural SLM is greater than or equal to that of the Markov chain-based algorithm for the same $\bar{c}$. The improved performance comes with a higher complexity in terms of time and computing requirements. However, when communication occurs over a character-erasure channel, all prediction algorithms and scheduling policies perform poorly. Furthermore, if character-level Huffman coding is used, the required $\bar{c}$ to achieve a given $\bar{s}$ is reduced, but the above observations still apply.  ( 3 min )
    Dual Pose-invariant Embeddings: Learning Category and Object-specific Discriminative Representations for Recognition and Retrieval
    arXiv:2403.00272v1 Announce Type: cross Abstract: In the context of pose-invariant object recognition and retrieval, we demonstrate that it is possible to achieve significant improvements in performance if both the category-based and the object-identity-based embeddings are learned simultaneously during training. In hindsight, that sounds intuitive because learning about the categories is more fundamental than learning about the individual objects that correspond to those categories. However, to the best of what we know, no prior work in pose-invariant learning has demonstrated this effect. This paper presents an attention-based dual-encoder architecture with specially designed loss functions that optimize the inter- and intra-class distances simultaneously in two different embedding spaces, one for the category embeddings and the other for the object-level embeddings. The loss functions we have proposed are pose-invariant ranking losses that are designed to minimize the intra-class distances and maximize the inter-class distances in the dual representation spaces. We demonstrate the power of our approach with three challenging multi-view datasets, ModelNet-40, ObjectPI, and FG3D. With our dual approach, for single-view object recognition, we outperform the previous best by 20.0% on ModelNet40, 2.0% on ObjectPI, and 46.5% on FG3D. On the other hand, for single-view object retrieval, we outperform the previous best by 33.7% on ModelNet40, 18.8% on ObjectPI, and 56.9% on FG3D.  ( 2 min )
    A Survey of Route Recommendations: Methods, Applications, and Opportunities
    arXiv:2403.00284v1 Announce Type: cross Abstract: Nowadays, with advanced information technologies deployed citywide, large data volumes and powerful computational resources are intelligentizing modern city development. As an important part of intelligent transportation, route recommendation and its applications are widely used, directly influencing citizens` travel habits. Developing smart and efficient travel routes based on big data (possibly multi-modal) has become a central challenge in route recommendation research. Our survey offers a comprehensive review of route recommendation work based on urban computing. It is organized by the following three parts: 1) Methodology-wise. We categorize a large volume of traditional machine learning and modern deep learning methods. Also, we discuss their historical relations and reveal the edge-cutting progress. 2) Application\-wise. We present numerous novel applications related to route commendation within urban computing scenarios. 3) We discuss current problems and challenges and envision several promising research directions. We believe that this survey can help relevant researchers quickly familiarize themselves with the current state of route recommendation research and then direct them to future research trends.  ( 2 min )
    Parameter-Efficient Tuning of Large Convolutional Models
    arXiv:2403.00269v1 Announce Type: cross Abstract: To address the high computational and parameter complexity associated with fine-tuning large pre-trained models, researchers have developed parameter-efficient methods, where only partial parameters are updated for downstream tasks. However, these works often overlook the distinct properties of convolutional kernels, which still remain essential elements in many large models, such as Stable Diffusion. In this study, we first introduce filter subspace by decomposing convolutional kernels within each network layer over a small set of filter subspace elements, referred to as filter atoms. We then fine-tune these models to extract task-specific representation by only adapting the filter atoms, a few hundred parameters typically. To potentially expand the parameter space for tuning, we further show a simple approach to generate an overcomplete filter subspace by recursively decomposing each filter atom over another set of filter atoms. The fine-tuning of filter atoms reshapes the filter subspace, enabling convolutional layers to adapt to diverse downstream tasks efficiently. Extensive experiments show that such a simple scheme surpasses previous tuning baselines for both discriminate and generative tasks. Our approach can potentially be complementary to many existing fine-tuning methods.  ( 2 min )
    Deciphering diffuse scattering with machine learning and the equivariant foundation model: The case of molten FeO
    arXiv:2403.00259v1 Announce Type: cross Abstract: Bridging the gap between diffuse x-ray or neutron scattering measurements and predicted structures derived from atom-atom pair potentials in disordered materials, has been a longstanding challenge in condensed matter physics. This perspective gives a brief overview of the traditional approaches employed over the past several decades. Namely, the use of approximate interatomic pair potentials that relate 3-dimensional structural models to the measured structure factor and its associated pair distribution function. The use of machine learned interatomic potentials has grown in the past few years, and has been particularly successful in the cases of ionic and oxide systems. Recent advances in large scale sampling, along with a direct integration of scattering measurements into the model development, has provided improved agreement between experiments and large-scale models calculated with quantum mechanical accuracy. However, details of local polyhedral bonding and connectivity in meta-stable disordered systems still require improvement. Here we leverage MACE-MP-0; a newly introduced equivariant foundation model and validate the results against high-quality experimental scattering data for the case of molten iron(II) oxide (FeO). These preliminary results suggest that the emerging foundation model has the potential to surpass the traditional limitations of classical interatomic potentials.  ( 2 min )
    Robust deep labeling of radiological emphysema subtypes using squeeze and excitation convolutional neural networks: The MESA Lung and SPIROMICS Studies
    arXiv:2403.00257v1 Announce Type: cross Abstract: Pulmonary emphysema, the progressive, irreversible loss of lung tissue, is conventionally categorized into three subtypes identifiable on pathology and on lung computed tomography (CT) images. Recent work has led to the unsupervised learning of ten spatially-informed lung texture patterns (sLTPs) on lung CT, representing distinct patterns of emphysematous lung parenchyma based on both textural appearance and spatial location within the lung, and which aggregate into 6 robust and reproducible CT Emphysema Subtypes (CTES). Existing methods for sLTP segmentation, however, are slow and highly sensitive to changes in CT acquisition protocol. In this work, we present a robust 3-D squeeze-and-excitation CNN for supervised classification of sLTPs and CTES on lung CT. Our results demonstrate that this model achieves accurate and reproducible sLTP segmentation on lung CTscans, across two independent cohorts and independently of scanner manufacturer and model.  ( 2 min )
    "Lossless" Compression of Deep Neural Networks: A High-dimensional Neural Tangent Kernel Approach
    arXiv:2403.00258v1 Announce Type: cross Abstract: Modern deep neural networks (DNNs) are extremely powerful; however, this comes at the price of increased depth and having more parameters per layer, making their training and inference more computationally challenging. In an attempt to address this key limitation, efforts have been devoted to the compression (e.g., sparsification and/or quantization) of these large-scale machine learning models, so that they can be deployed on low-power IoT devices. In this paper, building upon recent advances in neural tangent kernel (NTK) and random matrix theory (RMT), we provide a novel compression approach to wide and fully-connected \emph{deep} neural nets. Specifically, we demonstrate that in the high-dimensional regime where the number of data points $n$ and their dimension $p$ are both large, and under a Gaussian mixture model for the data, there exists \emph{asymptotic spectral equivalence} between the NTK matrices for a large family of DNN models. This theoretical result enables "lossless" compression of a given DNN to be performed, in the sense that the compressed network yields asymptotically the same NTK as the original (dense and unquantized) network, with its weights and activations taking values \emph{only} in $\{ 0, \pm 1 \}$ up to a scaling. Experiments on both synthetic and real-world data are conducted to support the advantages of the proposed compression scheme, with code available at \url{https://github.com/Model-Compression/Lossless_Compression}.  ( 3 min )
    Benchmarking zero-shot stance detection with FlanT5-XXL: Insights from training data, prompting, and decoding strategies into its near-SoTA performance
    arXiv:2403.00236v1 Announce Type: cross Abstract: We investigate the performance of LLM-based zero-shot stance detection on tweets. Using FlanT5-XXL, an instruction-tuned open-source LLM, with the SemEval 2016 Tasks 6A, 6B, and P-Stance datasets, we study the performance and its variations under different prompts and decoding strategies, as well as the potential biases of the model. We show that the zero-shot approach can match or outperform state-of-the-art benchmarks, including fine-tuned models. We provide various insights into its performance including the sensitivity to instructions and prompts, the decoding strategies, the perplexity of the prompts, and to negations and oppositions present in prompts. Finally, we ensure that the LLM has not been trained on test datasets, and identify a positivity bias which may partially explain the performance differences across decoding strategie  ( 2 min )
    EUROPA: A Legal Multilingual Keyphrase Generation Dataset
    arXiv:2403.00252v1 Announce Type: cross Abstract: Keyphrase generation has primarily been explored within the context of academic research articles, with a particular focus on scientific domains and the English language. In this work, we present EUROPA, a dataset for multilingual keyphrase generation in the legal domain. It is derived from legal judgments from the Court of Justice of the European Union (EU), and contains instances in all 24 EU official languages. We run multilingual models on our corpus and analyze the results, showing room for improvement on a domain-specific multilingual corpus such as the one we present.  ( 2 min )
    Causal Bandits with General Causal Models and Interventions
    arXiv:2403.00233v1 Announce Type: cross Abstract: This paper considers causal bandits (CBs) for the sequential design of interventions in a causal system. The objective is to optimize a reward function via minimizing a measure of cumulative regret with respect to the best sequence of interventions in hindsight. The paper advances the results on CBs in three directions. First, the structural causal models (SCMs) are assumed to be unknown and drawn arbitrarily from a general class $\mathcal{F}$ of Lipschitz-continuous functions. Existing results are often focused on (generalized) linear SCMs. Second, the interventions are assumed to be generalized soft with any desired level of granularity, resulting in an infinite number of possible interventions. The existing literature, in contrast, generally adopts atomic and hard interventions. Third, we provide general upper and lower bounds on regret. The upper bounds subsume (and improve) known bounds for special cases. The lower bounds are generally hitherto unknown. These bounds are characterized as functions of the (i) graph parameters, (ii) eluder dimension of the space of SCMs, denoted by $\operatorname{dim}(\mathcal{F})$, and (iii) the covering number of the function space, denoted by ${\rm cn}(\mathcal{F})$. Specifically, the cumulative achievable regret over horizon $T$ is $\mathcal{O}(K d^{L-1}\sqrt{T\operatorname{dim}(\mathcal{F}) \log({\rm cn}(\mathcal{F}))})$, where $K$ is related to the Lipschitz constants, $d$ is the graph's maximum in-degree, and $L$ is the length of the longest causal path. The upper bound is further refined for special classes of SCMs (neural network, polynomial, and linear), and their corresponding lower bounds are provided.  ( 3 min )
    Diffraction and Scattering Aware Radio Map and Environment Reconstruction using Geometry Model-Assisted Deep Learning
    arXiv:2403.00229v1 Announce Type: cross Abstract: Machine learning (ML) facilitates rapid channel modeling for 5G and beyond wireless communication systems. Many existing ML techniques utilize a city map to construct the radio map; however, an updated city map may not always be available. This paper proposes to employ the received signal strength (RSS) data to jointly construct the radio map and the virtual environment by exploiting the geometry structure of the environment. In contrast to many existing ML approaches that lack of an environment model, we develop a virtual obstacle model and characterize the geometry relation between the propagation paths and the virtual obstacles. A multi-screen knife-edge model is adopted to extract the key diffraction features, and these features are fed into a neural network (NN) for diffraction representation. To describe the scattering, as oppose to most existing methods that directly input an entire city map, our model focuses on the geometry structure from the local area surrounding the TX-RX pair and the spatial invariance of such local geometry structure is exploited. Numerical experiments demonstrate that, in addition to reconstructing a 3D virtual environment, the proposed model outperforms the state-of-the-art methods in radio map construction with 10%-18% accuracy improvements. It can also reduce 20% data and 50% training epochs when transferred to a new environment.  ( 2 min )
    Improving Socratic Question Generation using Data Augmentation and Preference Optimization
    arXiv:2403.00199v1 Announce Type: cross Abstract: The Socratic method is a way of guiding students toward solving a problem independently without directly revealing the solution to the problem. Although this method has been shown to significantly improve student learning outcomes, it remains a complex labor-intensive task for instructors. Large language models (LLMs) can be used to augment human effort by automatically generating Socratic questions for students. However, existing methods that involve prompting these LLMs sometimes produce invalid outputs, e.g., those that directly reveal the solution to the problem or provide irrelevant or premature questions. To alleviate this problem, inspired by reinforcement learning with AI feedback (RLAIF), we first propose a data augmentation method to enrich existing Socratic questioning datasets with questions that are invalid in specific ways. Next, we propose a method to optimize open-source LLMs such as LLama 2 to prefer ground-truth questions over generated invalid ones, using direct preference optimization (DPO). Our experiments on a Socratic questions dataset for student code debugging show that a DPO-optimized 7B LLama 2 model can effectively avoid generating invalid questions, and as a result, outperforms existing state-of-the-art prompting methods.  ( 2 min )
    Transcription and translation of videos using fine-tuned XLSR Wav2Vec2 on custom dataset and mBART
    arXiv:2403.00212v1 Announce Type: cross Abstract: This research addresses the challenge of training an ASR model for personalized voices with minimal data. Utilizing just 14 minutes of custom audio from a YouTube video, we employ Retrieval-Based Voice Conversion (RVC) to create a custom Common Voice 16.0 corpus. Subsequently, a Cross-lingual Self-supervised Representations (XLSR) Wav2Vec2 model is fine-tuned on this dataset. The developed web-based GUI efficiently transcribes and translates input Hindi videos. By integrating XLSR Wav2Vec2 and mBART, the system aligns the translated text with the video timeline, delivering an accessible solution for multilingual video content transcription and translation for personalized voice.  ( 2 min )
    Learning to Find Missing Video Frames with Synthetic Data Augmentation: A General Framework and Application in Generating Thermal Images Using RGB Cameras
    arXiv:2403.00196v1 Announce Type: cross Abstract: Advanced Driver Assistance Systems (ADAS) in intelligent vehicles rely on accurate driver perception within the vehicle cabin, often leveraging a combination of sensing modalities. However, these modalities operate at varying rates, posing challenges for real-time, comprehensive driver state monitoring. This paper addresses the issue of missing data due to sensor frame rate mismatches, introducing a generative model approach to create synthetic yet realistic thermal imagery. We propose using conditional generative adversarial networks (cGANs), specifically comparing the pix2pix and CycleGAN architectures. Experimental results demonstrate that pix2pix outperforms CycleGAN, and utilizing multi-view input styles, especially stacked views, enhances the accuracy of thermal image generation. Moreover, the study evaluates the model's generalizability across different subjects, revealing the importance of individualized training for optimal performance. The findings suggest the potential of generative models in addressing missing frames, advancing driver state monitoring for intelligent vehicles, and underscoring the need for continued research in model generalization and customization.  ( 2 min )
    AXOLOTL: Fairness through Assisted Self-Debiasing of Large Language Model Outputs
    arXiv:2403.00198v1 Announce Type: cross Abstract: Pre-trained Large Language Models (LLMs) have significantly advanced natural language processing capabilities but are susceptible to biases present in their training data, leading to unfair outcomes in various applications. While numerous strategies have been proposed to mitigate bias, they often require extensive computational resources and may compromise model performance. In this work, we introduce AXOLOTL, a novel post-processing framework, which operates agnostically across tasks and models, leveraging public APIs to interact with LLMs without direct access to internal parameters. Through a three-step process resembling zero-shot learning, AXOLOTL identifies biases, proposes resolutions, and guides the model to self-debias its outputs. This approach minimizes computational costs and preserves model performance, making AXOLOTL a promising tool for debiasing LLM outputs with broad applicability and ease of use.  ( 2 min )
    Entry-Specific Bounds for Low-Rank Matrix Completion under Highly Non-Uniform Sampling
    arXiv:2403.00184v1 Announce Type: cross Abstract: Low-rank matrix completion concerns the problem of estimating unobserved entries in a matrix using a sparse set of observed entries. We consider the non-uniform setting where the observed entries are sampled with highly varying probabilities, potentially with different asymptotic scalings. We show that under structured sampling probabilities, it is often better and sometimes optimal to run estimation algorithms on a smaller submatrix rather than the entire matrix. In particular, we prove error upper bounds customized to each entry, which match the minimax lower bounds under certain conditions. Our bounds characterize the hardness of estimating each entry as a function of the localized sampling probabilities. We provide numerical experiments that confirm our theoretical findings.  ( 2 min )
    Go Beyond Black-box Policies: Rethinking the Design of Learning Agent for Interpretable and Verifiable HVAC Control
    arXiv:2403.00172v1 Announce Type: cross Abstract: Recent research has shown the potential of Model-based Reinforcement Learning (MBRL) to enhance energy efficiency of Heating, Ventilation, and Air Conditioning (HVAC) systems. However, existing methods rely on black-box thermal dynamics models and stochastic optimizers, lacking reliability guarantees and posing risks to occupant health. In this work, we overcome the reliability bottleneck by redesigning HVAC controllers using decision trees extracted from existing thermal dynamics models and historical data. Our decision tree-based policies are deterministic, verifiable, interpretable, and more energy-efficient than current MBRL methods. First, we introduce a novel verification criterion for RL agents in HVAC control based on domain knowledge. Second, we develop a policy extraction procedure that produces a verifiable decision tree policy. We found that the high dimensionality of the thermal dynamics model input hinders the efficiency of policy extraction. To tackle the dimensionality challenge, we leverage importance sampling conditioned on historical data distributions, significantly improving policy extraction efficiency. Lastly, we present an offline verification algorithm that guarantees the reliability of a control policy. Extensive experiments show that our method saves 68.4% more energy and increases human comfort gain by 14.8% compared to the state-of-the-art method, in addition to an 1127x reduction in computation overhead. Our code and data are available at https://github.com/ryeii/Veri_HVAC  ( 3 min )
    Automated Efficient Estimation using Monte Carlo Efficient Influence Functions
    arXiv:2403.00158v1 Announce Type: cross Abstract: Many practical problems involve estimating low dimensional statistical quantities with high-dimensional models and datasets. Several approaches address these estimation tasks based on the theory of influence functions, such as debiased/double ML or targeted minimum loss estimation. This paper introduces \textit{Monte Carlo Efficient Influence Functions} (MC-EIF), a fully automated technique for approximating efficient influence functions that integrates seamlessly with existing differentiable probabilistic programming systems. MC-EIF automates efficient statistical estimation for a broad class of models and target functionals that would previously require rigorous custom analysis. We prove that MC-EIF is consistent, and that estimators using MC-EIF achieve optimal $\sqrt{N}$ convergence rates. We show empirically that estimators using MC-EIF are at parity with estimators using analytic EIFs. Finally, we demonstrate a novel capstone example using MC-EIF for optimal portfolio selection.  ( 2 min )
    TELEClass: Taxonomy Enrichment and LLM-Enhanced Hierarchical Text Classification with Minimal Supervision
    arXiv:2403.00165v1 Announce Type: cross Abstract: Hierarchical text classification aims to categorize each document into a set of classes in a label taxonomy. Most earlier works focus on fully or semi-supervised methods that require a large amount of human annotated data which is costly and time-consuming to acquire. To alleviate human efforts, in this paper, we work on hierarchical text classification with the minimal amount of supervision: using the sole class name of each node as the only supervision. Recently, large language models (LLM) show competitive performance on various tasks through zero-shot prompting, but this method performs poorly in the hierarchical setting, because it is ineffective to include the large and structured label space in a prompt. On the other hand, previous weakly-supervised hierarchical text classification methods only utilize the raw taxonomy skeleton and ignore the rich information hidden in the text corpus that can serve as additional class-indicative features. To tackle the above challenges, we propose TELEClass, Taxonomy Enrichment and LLM-Enhanced weakly-supervised hierarchical text classification, which (1) automatically enriches the label taxonomy with class-indicative topical terms mined from the corpus to facilitate classifier training and (2) utilizes LLMs for both data annotation and creation tailored for the hierarchical label space. Experiments show that TELEClass can outperform previous weakly-supervised hierarchical text classification methods and LLM-based zero-shot prompting methods on two public datasets.  ( 2 min )
    EBBS: An Ensemble with Bi-Level Beam Search for Zero-Shot Machine Translation
    arXiv:2403.00144v1 Announce Type: cross Abstract: The ability of zero-shot translation emerges when we train a multilingual model with certain translation directions; the model can then directly translate in unseen directions. Alternatively, zero-shot translation can be accomplished by pivoting through a third language (e.g., English). In our work, we observe that both direct and pivot translations are noisy and achieve less satisfactory performance. We propose EBBS, an ensemble method with a novel bi-level beam search algorithm, where each ensemble component explores its own prediction step by step at the lower level but they are synchronized by a "soft voting" mechanism at the upper level. Results on two popular multilingual translation datasets show that EBBS consistently outperforms direct and pivot translations as well as existing ensemble techniques. Further, we can distill the ensemble's knowledge back to the multilingual model to improve inference efficiency; profoundly, our EBBS-based distillation does not sacrifice, or even improves, the translation quality.  ( 2 min )
    Analysis of Kernel Mirror Prox for Measure Optimization
    arXiv:2403.00147v1 Announce Type: cross Abstract: By choosing a suitable function space as the dual to the non-negative measure cone, we study in a unified framework a class of functional saddle-point optimization problems, which we term the Mixed Functional Nash Equilibrium (MFNE), that underlies several existing machine learning algorithms, such as implicit generative models, distributionally robust optimization (DRO), and Wasserstein barycenters. We model the saddle-point optimization dynamics as an interacting Fisher-Rao-RKHS gradient flow when the function space is chosen as a reproducing kernel Hilbert space (RKHS). As a discrete time counterpart, we propose a primal-dual kernel mirror prox (KMP) algorithm, which uses a dual step in the RKHS, and a primal entropic mirror prox step. We then provide a unified convergence analysis of KMP in an infinite-dimensional setting for this class of MFNE problems, which establishes a convergence rate of $O(1/N)$ in the deterministic case and $O(1/\sqrt{N})$ in the stochastic case, where $N$ is the iteration counter. As a case study, we apply our analysis to DRO, providing algorithmic guarantees for DRO robustness and convergence.  ( 2 min )
    Ensemble-Based Unsupervised Discontinuous Constituency Parsing by Tree Averaging
    arXiv:2403.00143v1 Announce Type: cross Abstract: We address unsupervised discontinuous constituency parsing, where we observe a high variance in the performance of the only previous model. We propose to build an ensemble of different runs of the existing discontinuous parser by averaging the predicted trees, to stabilize and boost performance. To begin with, we provide comprehensive computational complexity analysis (in terms of P and NP-complete) for tree averaging under different setups of binarity and continuity. We then develop an efficient exact algorithm to tackle the task, which runs in a reasonable time for all samples in our experiments. Results on three datasets show our method outperforms all baselines in all metrics; we also provide in-depth analyses of our approach.  ( 2 min )
    From Flies to Robots: Inverted Landing in Small Quadcopters with Dynamic Perching
    arXiv:2403.00128v1 Announce Type: cross Abstract: Inverted landing is a routine behavior among a number of animal fliers. However, mastering this feat poses a considerable challenge for robotic fliers, especially to perform dynamic perching with rapid body rotations (or flips) and landing against gravity. Inverted landing in flies have suggested that optical flow senses are closely linked to the precise triggering and control of body flips that lead to a variety of successful landing behaviors. Building upon this knowledge, we aimed to replicate the flies' landing behaviors in small quadcopters by developing a control policy general to arbitrary ceiling-approach conditions. First, we employed reinforcement learning in simulation to optimize discrete sensory-motor pairs across a broad spectrum of ceiling-approach velocities and directions. Next, we converted the sensory-motor pairs to a two-stage control policy in a continuous augmented-optical flow space. The control policy consists of a first-stage Flip-Trigger Policy, which employs a one-class support vector machine, and a second-stage Flip-Action Policy, implemented as a feed-forward neural network. To transfer the inverted-landing policy to physical systems, we utilized domain randomization and system identification techniques for a zero-shot sim-to-real transfer. As a result, we successfully achieved a range of robust inverted-landing behaviors in small quadcopters, emulating those observed in flies.  ( 2 min )
    Scaling up Dynamic Edge Partition Models via Stochastic Gradient MCMC
    arXiv:2403.00044v1 Announce Type: cross Abstract: The edge partition model (EPM) is a generative model for extracting an overlapping community structure from static graph-structured data. In the EPM, the gamma process (GaP) prior is adopted to infer the appropriate number of latent communities, and each vertex is endowed with a gamma distributed positive memberships vector. Despite having many attractive properties, inference in the EPM is typically performed using Markov chain Monte Carlo (MCMC) methods that prevent it from being applied to massive network data. In this paper, we generalize the EPM to account for dynamic enviroment by representing each vertex with a positive memberships vector constructed using Dirichlet prior specification, and capturing the time-evolving behaviour of vertices via a Dirichlet Markov chain construction. A simple-to-implement Gibbs sampler is proposed to perform posterior computation using Negative- Binomial augmentation technique. For large network data, we propose a stochastic gradient Markov chain Monte Carlo (SG-MCMC) algorithm for scalable inference in the proposed model. The experimental results show that the novel methods achieve competitive performance in terms of link prediction, while being much faster.  ( 2 min )
    RiNALMo: General-Purpose RNA Language Models Can Generalize Well on Structure Prediction Tasks
    arXiv:2403.00043v1 Announce Type: cross Abstract: Ribonucleic acid (RNA) plays a variety of crucial roles in fundamental biological processes. Recently, RNA has become an interesting drug target, emphasizing the need to improve our understanding of its structures and functions. Over the years, sequencing technologies have produced an enormous amount of unlabeled RNA data, which hides important knowledge and potential. Motivated by the successes of protein language models, we introduce RiboNucleic Acid Language Model (RiNALMo) to help unveil the hidden code of RNA. RiNALMo is the largest RNA language model to date with $650$ million parameters pre-trained on $36$ million non-coding RNA sequences from several available databases. RiNALMo is able to extract hidden knowledge and capture the underlying structure information implicitly embedded within the RNA sequences. RiNALMo achieves state-of-the-art results on several downstream tasks. Notably, we show that its generalization capabilities can overcome the inability of other deep learning methods for secondary structure prediction to generalize on unseen RNA families. The code has been made publicly available on https://github.com/lbcb-sci/RiNALMo.  ( 2 min )
    Lower Bounds for Differential Privacy Under Continual Observation and Online Threshold Queries
    arXiv:2403.00028v1 Announce Type: cross Abstract: One of the most basic problems for studying the "price of privacy over time" is the so called private counter problem, introduced by Dwork et al. (2010) and Chan et al. (2010). In this problem, we aim to track the number of events that occur over time, while hiding the existence of every single event. More specifically, in every time step $t\in[T]$ we learn (in an online fashion) that $\Delta_t\geq 0$ new events have occurred, and must respond with an estimate $n_t\approx\sum_{j=1}^t \Delta_j$. The privacy requirement is that all of the outputs together, across all time steps, satisfy event level differential privacy. The main question here is how our error needs to depend on the total number of time steps $T$ and the total number of events $n$. Dwork et al. (2015) showed an upper bound of $O\left(\log(T)+\log^2(n)\right)$, and Henzinger et al. (2023) showed a lower bound of $\Omega\left(\min\{\log n, \log T\}\right)$. We show a new lower bound of $\Omega\left(\min\{n,\log T\}\right)$, which is tight w.r.t. the dependence on $T$, and is tight in the sparse case where $\log^2 n=O(\log T)$. Our lower bound has the following implications: $\bullet$ We show that our lower bound extends to the "online thresholds problem", where the goal is to privately answer many "quantile queries" when these queries are presented one-by-one. This resolves an open question of Bun et al. (2017). $\bullet$ Our lower bound implies, for the first time, a separation between the number of mistakes obtainable by a private online learner and a non-private online learner. This partially resolves a COLT'22 open question published by Sanyal and Ramponi. $\bullet$ Our lower bound also yields the first separation between the standard model of private online learning and a recently proposed relaxed variant of it, called private online prediction.  ( 3 min )
    Auditable Homomorphic-based Decentralized Collaborative AI with Attribute-based Differential Privacy
    arXiv:2403.00023v1 Announce Type: cross Abstract: In recent years, the notion of federated learning (FL) has led to the new paradigm of distributed artificial intelligence (AI) with privacy preservation. However, most current FL systems suffer from data privacy issues due to the requirement of a trusted third party. Although some previous works introduce differential privacy to protect the data, however, it may also significantly deteriorate the model performance. To address these issues, we propose a novel decentralized collaborative AI framework, named Auditable Homomorphic-based Decentralised Collaborative AI (AerisAI), to improve security with homomorphic encryption and fine-grained differential privacy. Our proposed AerisAI directly aggregates the encrypted parameters with a blockchain-based smart contract to get rid of the need of a trusted third party. We also propose a brand-new concept for eliminating the negative impacts of differential privacy for model performance. Moreover, the proposed AerisAI also provides the broadcast-aware group key management based on ciphertext-policy attribute-based encryption (CPABE) to achieve fine-grained access control based on different service-level agreements. We provide a formal theoretical analysis of the proposed AerisAI as well as the functionality comparison with the other baselines. We also conduct extensive experiments on real datasets to evaluate the proposed approach. The experimental results indicate that our proposed AerisAI significantly outperforms the other state-of-the-art baselines.  ( 2 min )
    GIN-SD: Source Detection in Graphs with Incomplete Nodes via Positional Encoding and Attentive Fusion
    arXiv:2403.00014v1 Announce Type: cross Abstract: Source detection in graphs has demonstrated robust efficacy in the domain of rumor source identification. Although recent solutions have enhanced performance by leveraging deep neural networks, they often require complete user data. In this paper, we address a more challenging task, rumor source detection with incomplete user data, and propose a novel framework, i.e., Source Detection in Graphs with Incomplete Nodes via Positional Encoding and Attentive Fusion (GIN-SD), to tackle this challenge. Specifically, our approach utilizes a positional embedding module to distinguish nodes that are incomplete and employs a self-attention mechanism to focus on nodes with greater information transmission capacity. To mitigate the prediction bias caused by the significant disparity between the numbers of source and non-source nodes, we also introduce a class-balancing mechanism. Extensive experiments validate the effectiveness of GIN-SD and its superiority to state-of-the-art methods.  ( 2 min )
    SuperdropNet: a Stable and Accurate Machine Learning Proxy for Droplet-based Cloud Microphysics
    arXiv:2402.18354v1 Announce Type: cross Abstract: Cloud microphysics has important consequences for climate and weather phenomena, and inaccurate representations can limit forecast accuracy. While atmospheric models increasingly resolve storms and clouds, the accuracy of the underlying microphysics remains limited by computationally expedient bulk moment schemes based on simplifying assumptions. Droplet-based Lagrangian schemes are more accurate but are underutilized due to their large computational overhead. Machine learning (ML) based schemes can bridge this gap by learning from vast droplet-based simulation datasets, but have so far struggled to match the accuracy and stability of bulk moment schemes. To address this challenge, we developed SuperdropNet, an ML-based emulator of the Lagrangian superdroplet simulations. To improve accuracy and stability, we employ multi-step autoregressive prediction during training, impose physical constraints, and carefully control stochasticity in the training data. Superdropnet predicted hydrometeor states and cloud-to-rain transition times more accurately than previous ML emulators, and matched or outperformed bulk moment schemes in many cases. We further carried out detailed analyses to reveal how multistep autoregressive training improves performance, and how the performance of SuperdropNet and other microphysical schemes hydrometeors' mass, number and size distribution. Together our results suggest that ML models can effectively emulate cloud microphysics, in a manner consistent with droplet-based simulations.  ( 2 min )
    AtP*: An efficient and scalable method for localizing LLM behaviour to components
    arXiv:2403.00745v1 Announce Type: new Abstract: Activation Patching is a method of directly computing causal attributions of behavior to model components. However, applying it exhaustively requires a sweep with cost scaling linearly in the number of model components, which can be prohibitively expensive for SoTA Large Language Models (LLMs). We investigate Attribution Patching (AtP), a fast gradient-based approximation to Activation Patching and find two classes of failure modes of AtP which lead to significant false negatives. We propose a variant of AtP called AtP*, with two changes to address these failure modes while retaining scalability. We present the first systematic study of AtP and alternative methods for faster activation patching and show that AtP significantly outperforms all other investigated methods, with AtP* providing further significant improvement. Finally, we provide a method to bound the probability of remaining false negatives of AtP* estimates.  ( 2 min )
    Adaptive Learning Rate for Follow-the-Regularized-Leader: Competitive Ratio Analysis and Best-of-Both-Worlds
    arXiv:2403.00715v1 Announce Type: new Abstract: Follow-The-Regularized-Leader (FTRL) is known as an effective and versatile approach in online learning, where appropriate choice of the learning rate is crucial for smaller regret. To this end, we formulate the problem of adjusting FTRL's learning rate as a sequential decision-making problem and introduce the framework of competitive analysis. We establish a lower bound for the competitive ratio and propose update rules for learning rate that achieves an upper bound within a constant factor of this lower bound. Specifically, we illustrate that the optimal competitive ratio is characterized by the (approximate) monotonicity of components of the penalty term, showing that a constant competitive ratio is achievable if the components of the penalty term form a monotonically non-increasing sequence, and derive a tight competitive ratio when penalty terms are $\xi$-approximately monotone non-increasing. Our proposed update rule, referred to as \textit{stability-penalty matching}, also facilitates constructing the Best-Of-Both-Worlds (BOBW) algorithms for stochastic and adversarial environments. In these environments our result contributes to achieve tighter regret bound and broaden the applicability of algorithms for various settings such as multi-armed bandits, graph bandits, linear bandits, and contextual bandits.  ( 2 min )
    Subhomogeneous Deep Equilibrium Models
    arXiv:2403.00720v1 Announce Type: new Abstract: Implicit-depth neural networks have grown as powerful alternatives to traditional networks in various applications in recent years. However, these models often lack guarantees of existence and uniqueness, raising stability, performance, and reproducibility issues. In this paper, we present a new analysis of the existence and uniqueness of fixed points for implicit-depth neural networks based on the concept of subhomogeneous operators and the nonlinear Perron-Frobenius theory. Compared to previous similar analyses, our theory allows for weaker assumptions on the parameter matrices, thus yielding a more flexible framework for well-defined implicit networks. We illustrate the performance of the resulting subhomogeneous networks on feed-forward, convolutional, and graph neural network examples.  ( 2 min )
    Scalable Learning of Item Response Theory Models
    arXiv:2403.00680v1 Announce Type: new Abstract: Item Response Theory (IRT) models aim to assess latent abilities of $n$ examinees along with latent difficulty characteristics of $m$ test items from categorical data that indicates the quality of their corresponding answers. Classical psychometric assessments are based on a relatively small number of examinees and items, say a class of $200$ students solving an exam comprising $10$ problems. More recent global large scale assessments such as PISA, or internet studies, may lead to significantly increased numbers of participants. Additionally, in the context of Machine Learning where algorithms take the role of examinees and data analysis problems take the role of items, both $n$ and $m$ may become very large, challenging the efficiency and scalability of computations. To learn the latent variables in IRT models from large data, we leverage the similarity of these models to logistic regression, which can be approximated accurately using small weighted subsets called coresets. We develop coresets for their use in alternating IRT training algorithms, facilitating scalable learning from large data.  ( 2 min )
    Advancing Additive Manufacturing through Deep Learning: A Comprehensive Review of Current Progress and Future Challenges
    arXiv:2403.00669v1 Announce Type: new Abstract: Additive manufacturing (AM) has already proved itself to be the potential alternative to widely-used subtractive manufacturing due to its extraordinary capacity of manufacturing highly customized products with minimum material wastage. Nevertheless, it is still not being considered as the primary choice for the industry due to some of its major inherent challenges, including complex and dynamic process interactions, which are sometimes difficult to fully understand even with traditional machine learning because of the involvement of high-dimensional data such as images, point clouds, and voxels. However, the recent emergence of deep learning (DL) is showing great promise in overcoming many of these challenges as DL can automatically capture complex relationships from high-dimensional data without hand-crafted feature extraction. Therefore, the volume of research in the intersection of AM and DL is exponentially growing each year which makes it difficult for the researchers to keep track of the trend and future potential directions. Furthermore, to the best of our knowledge, there is no comprehensive review paper in this research track summarizing the recent studies. Therefore, this paper reviews the recent studies that apply DL for making the AM process better with a high-level summary of their contributions and limitations. Finally, it summarizes the current challenges and recommends some of the promising opportunities in this domain for further investigation with a special focus on generalizing DL models for wide-range of geometry types, managing uncertainties both in AM data and DL models, overcoming limited and noisy AM data issues by incorporating generative models, and unveiling the potential of interpretable DL for AM.  ( 3 min )
    Stability-Certified Learning of Control Systems with Quadratic Nonlinearities
    arXiv:2403.00646v1 Announce Type: new Abstract: This work primarily focuses on an operator inference methodology aimed at constructing low-dimensional dynamical models based on a priori hypotheses about their structure, often informed by established physics or expert insights. Stability is a fundamental attribute of dynamical systems, yet it is not always assured in models derived through inference. Our main objective is to develop a method that facilitates the inference of quadratic control dynamical systems with inherent stability guarantees. To this aim, we investigate the stability characteristics of control systems with energy-preserving nonlinearities, thereby identifying conditions under which such systems are bounded-input bounded-state stable. These insights are subsequently applied to the learning process, yielding inferred models that are inherently stable by design. The efficacy of our proposed framework is demonstrated through a couple of numerical examples.  ( 2 min )
    Rethinking The Uniformity Metric in Self-Supervised Learning
    arXiv:2403.00642v1 Announce Type: new Abstract: Uniformity plays a crucial role in the assessment of learned representations, contributing to a deeper comprehension of self-supervised learning. The seminal work by \citet{Wang2020UnderstandingCR} introduced a uniformity metric that quantitatively measures the collapse degree of learned representations. Directly optimizing this metric together with alignment proves to be effective in preventing constant collapse. However, we present both theoretical and empirical evidence revealing that this metric lacks sensitivity to dimensional collapse, highlighting its limitations. To address this limitation and design a more effective uniformity metric, this paper identifies five fundamental properties, some of which the existing uniformity metric fails to meet. We subsequently introduce a novel uniformity metric that satisfies all of these desiderata and exhibits sensitivity to dimensional collapse. When applied as an auxiliary loss in various established self-supervised methods, our proposed uniformity metric consistently enhances their performance in downstream tasks.Our code was released at https://github.com/sunset-clouds/WassersteinUniformityMetric.  ( 2 min )
    Beyond Single-Model Views for Deep Learning: Optimization versus Generalizability of Stochastic Optimization Algorithms
    arXiv:2403.00574v1 Announce Type: new Abstract: Despite an extensive body of literature on deep learning optimization, our current understanding of what makes an optimization algorithm effective is fragmented. In particular, we do not understand well whether enhanced optimization translates to improved generalizability. Current research overlooks the inherent stochastic nature of stochastic gradient descent (SGD) and its variants, resulting in a lack of comprehensive benchmarking and insight into their statistical performance. This paper aims to address this gap by adopting a novel approach. Rather than solely evaluating the endpoint of individual optimization trajectories, we draw from an ensemble of trajectories to estimate the stationary distribution of stochastic optimizers. Our investigation encompasses a wide array of techniques, including SGD and its variants, flat-minima optimizers, and new algorithms we propose under the Basin Hopping framework. Through our evaluation, which encompasses synthetic functions with known minima and real-world problems in computer vision and natural language processing, we emphasize fair benchmarking under a statistical framework, comparing stationary distributions and establishing statistical significance. Our study uncovers several key findings regarding the relationship between training loss and hold-out accuracy, as well as the comparable performance of SGD, noise-enabled variants, and novel optimizers utilizing the BH framework. Notably, these algorithms demonstrate performance on par with flat-minima optimizers like SAM, albeit with half the gradient evaluations. We anticipate that our work will catalyze further exploration in deep learning optimization, encouraging a shift away from single-model approaches towards methodologies that acknowledge and leverage the stochastic nature of optimizers.  ( 3 min )
    Bias Mitigation in Fine-tuning Pre-trained Models for Enhanced Fairness and Efficiency
    arXiv:2403.00625v1 Announce Type: new Abstract: Fine-tuning pre-trained models is a widely employed technique in numerous real-world applications. However, fine-tuning these models on new tasks can lead to unfair outcomes. This is due to the absence of generalization guarantees for fairness properties, regardless of whether the original pre-trained model was developed with fairness considerations. To tackle this issue, we introduce an efficient and robust fine-tuning framework specifically designed to mitigate biases in new tasks. Our empirical analysis shows that the parameters in the pre-trained model that affect predictions for different demographic groups are different, so based on this observation, we employ a transfer learning strategy that neutralizes the importance of these influential weights, determined using Fisher information across demographic groups. Additionally, we integrate this weight importance neutralization strategy with a matrix factorization technique, which provides a low-rank approximation of the weight matrix using fewer parameters, reducing the computational demands. Experiments on multiple pre-trained models and new tasks demonstrate the effectiveness of our method.  ( 2 min )
    Indirectly Parameterized Concrete Autoencoders
    arXiv:2403.00563v1 Announce Type: new Abstract: Feature selection is a crucial task in settings where data is high-dimensional or acquiring the full set of features is costly. Recent developments in neural network-based embedded feature selection show promising results across a wide range of applications. Concrete Autoencoders (CAEs), considered state-of-the-art in embedded feature selection, may struggle to achieve stable joint optimization, hurting their training time and generalization. In this work, we identify that this instability is correlated with the CAE learning duplicate selections. To remedy this, we propose a simple and effective improvement: Indirectly Parameterized CAEs (IP-CAEs). IP-CAEs learn an embedding and a mapping from it to the Gumbel-Softmax distributions' parameters. Despite being simple to implement, IP-CAE exhibits significant and consistent improvements over CAE in both generalization and training time across several datasets for reconstruction and classification. Unlike CAE, IP-CAE effectively leverages non-linear relationships and does not require retraining the jointly optimized decoder. Furthermore, our approach is, in principle, generalizable to Gumbel-Softmax distributions beyond feature selection.  ( 2 min )
    EfficientZero V2: Mastering Discrete and Continuous Control with Limited Data
    arXiv:2403.00564v1 Announce Type: new Abstract: Sample efficiency remains a crucial challenge in applying Reinforcement Learning (RL) to real-world tasks. While recent algorithms have made significant strides in improving sample efficiency, none have achieved consistently superior performance across diverse domains. In this paper, we introduce EfficientZero V2, a general framework designed for sample-efficient RL algorithms. We have expanded the performance of EfficientZero to multiple domains, encompassing both continuous and discrete actions, as well as visual and low-dimensional inputs. With a series of improvements we propose, EfficientZero V2 outperforms the current state-of-the-art (SOTA) by a significant margin in diverse tasks under the limited data setting. EfficientZero V2 exhibits a notable advancement over the prevailing general algorithm, DreamerV3, achieving superior outcomes in 50 of 66 evaluated tasks across diverse benchmarks, such as Atari 100k, Proprio Control, and Vision Control.  ( 2 min )
    Imitation Learning Datasets: A Toolkit For Creating Datasets, Training Agents and Benchmarking
    arXiv:2403.00550v1 Announce Type: new Abstract: Imitation learning field requires expert data to train agents in a task. Most often, this learning approach suffers from the absence of available data, which results in techniques being tested on its dataset. Creating datasets is a cumbersome process requiring researchers to train expert agents from scratch, record their interactions and test each benchmark method with newly created data. Moreover, creating new datasets for each new technique results in a lack of consistency in the evaluation process since each dataset can drastically vary in state and action distribution. In response, this work aims to address these issues by creating Imitation Learning Datasets, a toolkit that allows for: (i) curated expert policies with multithreaded support for faster dataset creation; (ii) readily available datasets and techniques with precise measurements; and (iii) sharing implementations of common imitation learning techniques. Demonstration link: https://nathangavenski.github.io/#/il-datasets-video  ( 2 min )
    Machine Learning Training Optimization using the Barycentric Correction Procedure
    arXiv:2403.00542v1 Announce Type: new Abstract: Machine learning (ML) algorithms are predictively competitive algorithms with many human-impact applications. However, the issue of long execution time remains unsolved in the literature for high-dimensional spaces. This study proposes combining ML algorithms with an efficient methodology known as the barycentric correction procedure (BCP) to address this issue. This study uses synthetic data and an educational dataset from a private university to show the benefits of the proposed method. It was found that this combination provides significant benefits related to time in synthetic and real data without losing accuracy when the number of instances and dimensions increases. Additionally, for high-dimensional spaces, it was proved that BCP and linear support vector classification (LinearSVC), after an estimated feature map for the gaussian radial basis function (RBF) kernel, were unfeasible in terms of computational time and accuracy.  ( 2 min )
    Overestimation, Overfitting, and Plasticity in Actor-Critic: the Bitter Lesson of Reinforcement Learning
    arXiv:2403.00514v1 Announce Type: new Abstract: Recent advancements in off-policy Reinforcement Learning (RL) have significantly improved sample efficiency, primarily due to the incorporation of various forms of regularization that enable more gradient update steps than traditional agents. However, many of these techniques have been tested in limited settings, often on tasks from single simulation benchmarks and against well-known algorithms rather than a range of regularization approaches. This limits our understanding of the specific mechanisms driving RL improvements. To address this, we implemented over 60 different off-policy agents, each integrating established regularization techniques from recent state-of-the-art algorithms. We tested these agents across 14 diverse tasks from 2 simulation benchmarks. Our findings reveal that while the effectiveness of a specific regularization setup varies with the task, certain combinations consistently demonstrate robust and superior performance. Notably, a simple Soft Actor-Critic agent, appropriately regularized, reliably solves dog tasks, which were previously solved mainly through model-based approaches.  ( 2 min )
    Epsilon-Greedy Thompson Sampling to Bayesian Optimization
    arXiv:2403.00540v1 Announce Type: new Abstract: Thompson sampling (TS) serves as a solution for addressing the exploitation-exploration dilemma in Bayesian optimization (BO). While it prioritizes exploration by randomly generating and maximizing sample paths of Gaussian process (GP) posteriors, TS weakly manages its exploitation by gathering information about the true objective function after each exploration is performed. In this study, we incorporate the epsilon-greedy ($\varepsilon$-greedy) policy, a well-established selection strategy in reinforcement learning, into TS to improve its exploitation. We first delineate two extremes of TS applied for BO, namely the generic TS and a sample-average TS. The former and latter promote exploration and exploitation, respectively. We then use $\varepsilon$-greedy policy to randomly switch between the two extremes. A small value of $\varepsilon \in (0,1)$ prioritizes exploitation, and vice versa. We empirically show that $\varepsilon$-greedy TS with an appropriate $\varepsilon$ is better than one of its two extremes and competes with the other.  ( 2 min )
    Robust Deep Reinforcement Learning Through Adversarial Attacks and Training : A Survey
    arXiv:2403.00420v1 Announce Type: new Abstract: Deep Reinforcement Learning (DRL) is an approach for training autonomous agents across various complex environments. Despite its significant performance in well known environments, it remains susceptible to minor conditions variations, raising concerns about its reliability in real-world applications. To improve usability, DRL must demonstrate trustworthiness and robustness. A way to improve robustness of DRL to unknown changes in the conditions is through Adversarial Training, by training the agent against well suited adversarial attacks on the dynamics of the environment. Addressing this critical issue, our work presents an in-depth analysis of contemporary adversarial attack methodologies, systematically categorizing them and comparing their objectives and operational mechanisms. This classification offers a detailed insight into how adversarial attacks effectively act for evaluating the resilience of DRL agents, thereby paving the way for enhancing their robustness.  ( 2 min )
    A Survey of Geometric Graph Neural Networks: Data Structures, Models and Applications
    arXiv:2403.00485v1 Announce Type: new Abstract: Geometric graph is a special kind of graph with geometric features, which is vital to model many scientific problems. Unlike generic graphs, geometric graphs often exhibit physical symmetries of translations, rotations, and reflections, making them ineffectively processed by current Graph Neural Networks (GNNs). To tackle this issue, researchers proposed a variety of Geometric Graph Neural Networks equipped with invariant/equivariant properties to better characterize the geometry and topology of geometric graphs. Given the current progress in this field, it is imperative to conduct a comprehensive survey of data structures, models, and applications related to geometric GNNs. In this paper, based on the necessary but concise mathematical preliminaries, we provide a unified view of existing models from the geometric message passing perspective. Additionally, we summarize the applications as well as the related datasets to facilitate later research for methodology development and experimental evaluation. We also discuss the challenges and future potential directions of Geometric GNNs at the end of this survey.  ( 2 min )
    Fractal interpolation in the context of prediction accuracy optimization
    arXiv:2403.00403v1 Announce Type: new Abstract: This paper focuses on the hypothesis of optimizing time series predictions using fractal interpolation techniques. In general, the accuracy of machine learning model predictions is closely related to the quality and quantitative aspects of the data used, following the principle of \textit{garbage-in, garbage-out}. In order to quantitatively and qualitatively augment datasets, one of the most prevalent concerns of data scientists is to generate synthetic data, which should follow as closely as possible the actual pattern of the original data. This study proposes three different data augmentation strategies based on fractal interpolation, namely the \textit{Closest Hurst Strategy}, \textit{Closest Values Strategy} and \textit{Formula Strategy}. To validate the strategies, we used four public datasets from the literature, as well as a private dataset obtained from meteorological records in the city of Brasov, Romania. The prediction results obtained with the LSTM model using the presented interpolation strategies showed a significant accuracy improvement compared to the raw datasets, thus providing a possible answer to practical problems in the field of remote sensing and sensor sensitivity. Moreover, our methodologies answer some optimization-related open questions for the fractal interpolation step using \textit{Optuna} framework.  ( 2 min )
    Nonlinear Sheaf Diffusion in Graph Neural Networks
    arXiv:2403.00337v1 Announce Type: new Abstract: This work focuses on exploring the potential benefits of introducing a nonlinear Laplacian in Sheaf Neural Networks for graph-related tasks. The primary aim is to understand the impact of such nonlinearity on diffusion dynamics, signal propagation, and performance of neural network architectures in discrete-time settings. The study primarily emphasizes experimental analysis, using real-world and synthetic datasets to validate the practical effectiveness of different versions of the model. This approach shifts the focus from an initial theoretical exploration to demonstrating the practical utility of the proposed model.  ( 2 min )
    Scale-Invariant Gradient Aggregation for Constrained Multi-Objective Reinforcement Learning
    arXiv:2403.00282v1 Announce Type: new Abstract: Multi-objective reinforcement learning (MORL) aims to find a set of Pareto optimal policies to cover various preferences. However, to apply MORL in real-world applications, it is important to find policies that are not only Pareto optimal but also satisfy pre-defined constraints for safety. To this end, we propose a constrained MORL (CMORL) algorithm called Constrained Multi-Objective Gradient Aggregator (CoMOGA). Recognizing the difficulty of handling multiple objectives and constraints concurrently, CoMOGA relaxes the original CMORL problem into a constrained optimization problem by transforming the objectives into additional constraints. This novel transformation process ensures that the converted constraints are invariant to the objective scales while having the same effect as the original objectives. We show that the proposed method converges to a local Pareto optimal policy while satisfying the predefined constraints. Empirical evaluations across various tasks show that the proposed method outperforms other baselines by consistently meeting constraints and demonstrating invariance to the objective scales.  ( 2 min )
    Graph Construction with Flexible Nodes for Traffic Demand Prediction
    arXiv:2403.00276v1 Announce Type: new Abstract: Graph neural networks (GNNs) have been widely applied in traffic demand prediction, and transportation modes can be divided into station-based mode and free-floating traffic mode. Existing research in traffic graph construction primarily relies on map matching to construct graphs based on the road network. However, the complexity and inhomogeneity of data distribution in free-floating traffic demand forecasting make road network matching inflexible. To tackle these challenges, this paper introduces a novel graph construction method tailored to free-floating traffic mode. We propose a novel density-based clustering algorithm (HDPC-L) to determine the flexible positioning of nodes in the graph, overcoming the computational bottlenecks of traditional clustering algorithms and enabling effective handling of large-scale datasets. Furthermore, we extract valuable information from ridership data to initialize the edge weights of GNNs. Comprehensive experiments on two real-world datasets, the Shenzhen bike-sharing dataset and the Haikou ride-hailing dataset, show that the method significantly improves the performance of the model. On average, our models show an improvement in accuracy of around 25\% and 19.5\% on the two datasets. Additionally, it significantly enhances computational efficiency, reducing training time by approximately 12% and 32.5% on the two datasets. We make our code available at https://github.com/houjinyan/HDPC-L-ODInit.  ( 2 min )
    Shifted Interpolation for Differential Privacy
    arXiv:2403.00278v1 Announce Type: new Abstract: Noisy gradient descent and its variants are the predominant algorithms for differentially private machine learning. It is a fundamental question to quantify their privacy leakage, yet tight characterizations remain open even in the foundational setting of convex losses. This paper improves over previous analyses by establishing (and refining) the "privacy amplification by iteration" phenomenon in the unifying framework of $f$-differential privacy--which tightly captures all aspects of the privacy loss and immediately implies tighter privacy accounting in other notions of differential privacy, e.g., $(\varepsilon,\delta)$-DP and Renyi DP. Our key technical insight is the construction of shifted interpolated processes that unravel the popular shifted-divergences argument, enabling generalizations beyond divergence-based relaxations of DP. Notably, this leads to the first exact privacy analysis in the foundational setting of strongly convex optimization. Our techniques extend to many settings: convex/strongly convex, constrained/unconstrained, full/cyclic/stochastic batches, and all combinations thereof. As an immediate corollary, we recover the $f$-DP characterization of the exponential mechanism for strongly convex optimization in Gopi et al. (2022), and moreover extend this result to more general settings.  ( 2 min )
    ARED: Argentina Real Estate Dataset
    arXiv:2403.00273v1 Announce Type: new Abstract: The Argentinian real estate market presents a unique case study characterized by its unstable and rapidly shifting macroeconomic circumstances over the past decades. Despite the existence of a few datasets for price prediction, there is a lack of mixed modality datasets specifically focused on Argentina. In this paper, the first edition of ARED is introduced. A comprehensive real estate price prediction dataset series, designed for the Argentinian market. This edition contains information solely for Jan-Feb 2024. It was found that despite the short time range captured by this zeroth edition (44 days), time dependent phenomena has been occurring mostly on a market level (market as a whole). Nevertheless future editions of this dataset, will most likely contain historical data. Each listing in ARED comprises descriptive features, and variable-length sets of images.  ( 2 min )
    Cloud-based Federated Learning Framework for MRI Segmentation
    arXiv:2403.00254v1 Announce Type: new Abstract: In contemporary rural healthcare settings, the principal challenge in diagnosing brain images is the scarcity of available data, given that most of the existing deep learning models demand extensive training data to optimize their performance, necessitating centralized processing methods that potentially compromise data privacy. This paper proposes a novel framework tailored for brain tissue segmentation in rural healthcare facilities. The framework employs a deep reinforcement learning (DRL) environment in tandem with a refinement model (RM) deployed locally at rural healthcare sites. The proposed DRL model has a reduced parameter count and practicality for implementation across distributed rural sites. To uphold data privacy and enhance model generalization without transgressing privacy constraints, we employ federated learning (FL) for cooperative model training. We demonstrate the efficacy of our approach by training the network with a limited data set and observing a substantial performance enhancement, mitigating inaccuracies and irregularities in segmentation across diverse sites. Remarkably, the DRL model attains an accuracy of up to 80%, surpassing the capabilities of conventional convolutional neural networks when confronted with data insufficiency. Incorporating our RM results in an additional accuracy improvement of at least 10%, while FL contributes to a further accuracy enhancement of up to 5%. Collectively, the framework achieves an average 92% accuracy rate within rural healthcare settings characterized by data constraints.  ( 2 min )
    Efficient Reinforcement Learning for Global Decision Making in the Presence of Local Agents at Scale
    arXiv:2403.00222v1 Announce Type: new Abstract: We study reinforcement learning for global decision-making in the presence of many local agents, where the global decision-maker makes decisions affecting all local agents, and the objective is to learn a policy that maximizes the rewards of both the global and the local agents. Such problems find many applications, e.g. demand response, EV charging, queueing, etc. In this setting, scalability has been a long-standing challenge due to the size of the state/action space which can be exponential in the number of agents. This work proposes the SUB-SAMPLE-Q algorithm where the global agent subsamples $k\leq n$ local agents to compute an optimal policy in time that is only exponential in $k$, providing an exponential speedup from standard methods that are exponential in $n$. We show that the learned policy converges to the optimal policy in the order of $\tilde{O}(1/\sqrt{k}+\epsilon_{k,m})$ as the number of sub-sampled agents $k$ increases, where $\epsilon_{k,m}$ is the Bellman noise. We also conduct numerical simulations in a demand-response setting and a queueing setting.  ( 2 min )
    Robust Policy Learning via Offline Skill Diffusion
    arXiv:2403.00225v1 Announce Type: new Abstract: Skill-based reinforcement learning (RL) approaches have shown considerable promise, especially in solving long-horizon tasks via hierarchical structures. These skills, learned task-agnostically from offline datasets, can accelerate the policy learning process for new tasks. Yet, the application of these skills in different domains remains restricted due to their inherent dependency on the datasets, which poses a challenge when attempting to learn a skill-based policy via RL for a target domain different from the datasets' domains. In this paper, we present a novel offline skill learning framework DuSkill which employs a guided Diffusion model to generate versatile skills extended from the limited skills in datasets, thereby enhancing the robustness of policy learning for tasks in different domains. Specifically, we devise a guided diffusion-based skill decoder in conjunction with the hierarchical encoding to disentangle the skill embedding space into two distinct representations, one for encapsulating domain-invariant behaviors and the other for delineating the factors that induce domain variations in the behaviors. Our DuSkill framework enhances the diversity of skills learned offline, thus enabling to accelerate the learning procedure of high-level policies for different domains. Through experiments, we show that DuSkill outperforms other skill-based imitation learning and RL algorithms for several long-horizon tasks, demonstrating its benefits in few-shot imitation and online RL.  ( 2 min )
    Impact of Decentralized Learning on Player Utilities in Stackelberg Games
    arXiv:2403.00188v1 Announce Type: new Abstract: When deployed in the world, a learning agent such as a recommender system or a chatbot often repeatedly interacts with another learning agent (such as a user) over time. In many such two-agent systems, each agent learns separately and the rewards of the two agents are not perfectly aligned. To better understand such cases, we examine the learning dynamics of the two-agent system and the implications for each agent's objective. We model these systems as Stackelberg games with decentralized learning and show that standard regret benchmarks (such as Stackelberg equilibrium payoffs) result in worst-case linear regret for at least one player. To better capture these systems, we construct a relaxed regret benchmark that is tolerant to small learning errors by agents. We show that standard learning algorithms fail to provide sublinear regret, and we develop algorithms to achieve near-optimal $O(T^{2/3})$ regret for both players with respect to these benchmarks. We further design relaxed environments under which faster learning ($O(\sqrt{T})$) is possible. Altogether, our results take a step towards assessing how two-agent interactions in sequential and decentralized learning environments affect the utility of both agents.  ( 2 min )
    Ask Your Distribution Shift if Pre-Training is Right for You
    arXiv:2403.00194v1 Announce Type: new Abstract: Pre-training is a widely used approach to develop models that are robust to distribution shifts. However, in practice, its effectiveness varies: fine-tuning a pre-trained model improves robustness significantly in some cases but not at all in others (compared to training from scratch). In this work, we seek to characterize the failure modes that pre-training can and cannot address. In particular, we focus on two possible failure modes of models under distribution shift: poor extrapolation (e.g., they cannot generalize to a different domain) and biases in the training data (e.g., they rely on spurious features). Our study suggests that, as a rule of thumb, pre-training can help mitigate poor extrapolation but not dataset biases. After providing theoretical motivation and empirical evidence for this finding, we explore two of its implications for developing robust models: (1) pre-training and interventions designed to prevent exploiting biases have complementary robustness benefits, and (2) fine-tuning on a (very) small, non-diverse but de-biased dataset can result in significantly more robust models than fine-tuning on a large and diverse but biased dataset. Code is available at https://github.com/MadryLab/pretraining-distribution-shift-robustness.  ( 2 min )
    Causal Graph ODE: Continuous Treatment Effect Modeling in Multi-agent Dynamical Systems
    arXiv:2403.00178v1 Announce Type: new Abstract: Real-world multi-agent systems are often dynamic and continuous, where the agents co-evolve and undergo changes in their trajectories and interactions over time. For example, the COVID-19 transmission in the U.S. can be viewed as a multi-agent system, where states act as agents and daily population movements between them are interactions. Estimating the counterfactual outcomes in such systems enables accurate future predictions and effective decision-making, such as formulating COVID-19 policies. However, existing methods fail to model the continuous dynamic effects of treatments on the outcome, especially when multiple treatments (e.g., "stay-at-home" and "get-vaccine" policies) are applied simultaneously. To tackle this challenge, we propose Causal Graph Ordinary Differential Equations (CAG-ODE), a novel model that captures the continuous interaction among agents using a Graph Neural Network (GNN) as the ODE function. The key innovation of our model is to learn time-dependent representations of treatments and incorporate them into the ODE function, enabling precise predictions of potential outcomes. To mitigate confounding bias, we further propose two domain adversarial learning-based objectives, which enable our model to learn balanced continuous representations that are not affected by treatments or interference. Experiments on two datasets (i.e., COVID-19 and tumor growth) demonstrate the superior performance of our proposed model.  ( 3 min )
    Non-Invasive Medical Digital Twins using Physics-Informed Self-Supervised Learning
    arXiv:2403.00177v1 Announce Type: new Abstract: A digital twin is a virtual replica of a real-world physical phenomena that uses mathematical modeling to characterize and simulate its defining features. By constructing digital twins for disease processes, we can perform in-silico simulations that mimic patients' health conditions and counterfactual outcomes under hypothetical interventions in a virtual setting. This eliminates the need for invasive procedures or uncertain treatment decisions. In this paper, we propose a method to identify digital twin model parameters using only noninvasive patient health data. We approach the digital twin modeling as a composite inverse problem, and observe that its structure resembles pretraining and finetuning in self-supervised learning (SSL). Leveraging this, we introduce a physics-informed SSL algorithm that initially pretrains a neural network on the pretext task of solving the physical model equations. Subsequently, the model is trained to reconstruct low-dimensional health measurements from noninvasive modalities while being constrained by the physical equations learned in pretraining. We apply our method to identify digital twins of cardiac hemodynamics using noninvasive echocardiogram videos, and demonstrate its utility in unsupervised disease detection and in-silico clinical trials.  ( 2 min )
    Privacy-Preserving Distributed Optimization and Learning
    arXiv:2403.00157v1 Announce Type: new Abstract: Distributed optimization and learning has recently garnered great attention due to its wide applications in sensor networks, smart grids, machine learning, and so forth. Despite rapid development, existing distributed optimization and learning algorithms require each agent to exchange messages with its neighbors, which may expose sensitive information and raise significant privacy concerns. In this survey paper, we overview privacy-preserving distributed optimization and learning methods. We first discuss cryptography, differential privacy, and other techniques that can be used for privacy preservation and indicate their pros and cons for privacy protection in distributed optimization and learning. We believe that among these approaches, differential privacy is most promising due to its low computational and communication complexities, which are extremely appealing for modern learning based applications with high dimensions of optimization variables. We then introduce several differential-privacy algorithms that can simultaneously ensure privacy and optimization accuracy. Moreover, we provide example applications in several machine learning problems to confirm the real-world effectiveness of these algorithms. Finally, we highlight some challenges in this research domain and discuss future directions.  ( 2 min )
    SoD$^2$: Statically Optimizing Dynamic Deep Neural Network
    arXiv:2403.00176v1 Announce Type: new Abstract: Though many compilation and runtime systems have been developed for DNNs in recent years, the focus has largely been on static DNNs. Dynamic DNNs, where tensor shapes and sizes and even the set of operators used are dependent upon the input and/or execution, are becoming common. This paper presents SoD$^2$, a comprehensive framework for optimizing Dynamic DNNs. The basis of our approach is a classification of common operators that form DNNs, and the use of this classification towards a Rank and Dimension Propagation (RDP) method. This framework statically determines the shapes of operators as known constants, symbolic constants, or operations on these. Next, using RDP we enable a series of optimizations, like fused code generation, execution (order) planning, and even runtime memory allocation plan generation. By evaluating the framework on 10 emerging Dynamic DNNs and comparing it against several existing systems, we demonstrate both reductions in execution latency and memory requirements, with RDP-enabled key optimizations responsible for much of the gains. Our evaluation results show that SoD$^2$ runs up to $3.9\times$ faster than these systems while saving up to $88\%$ peak memory consumption.  ( 2 min )
    UniTS: Building a Unified Time Series Model
    arXiv:2403.00131v1 Announce Type: new Abstract: Foundation models, especially LLMs, are profoundly transforming deep learning. Instead of training many task-specific models, we can adapt a single pretrained model to many tasks via fewshot prompting or fine-tuning. However, current foundation models apply to sequence data but not to time series, which present unique challenges due to the inherent diverse and multidomain time series datasets, diverging task specifications across forecasting, classification and other types of tasks, and the apparent need for task-specialized models. We developed UNITS, a unified time series model that supports a universal task specification, accommodating classification, forecasting, imputation, and anomaly detection tasks. This is achieved through a novel unified network backbone, which incorporates sequence and variable attention along with a dynamic linear operator and is trained as a unified model. Across 38 multi-domain datasets, UNITS demonstrates superior performance compared to task-specific models and repurposed natural language-based LLMs. UNITS exhibits remarkable zero-shot, few-shot, and prompt learning capabilities when evaluated on new data domains and tasks. The source code and datasets are available at https://github.com/mims-harvard/UniTS.  ( 2 min )
    Towards Explaining Deep Neural Network Compression Through a Probabilistic Latent Space
    arXiv:2403.00155v1 Announce Type: new Abstract: Despite the impressive performance of deep neural networks (DNNs), their computational complexity and storage space consumption have led to the concept of network compression. While DNN compression techniques such as pruning and low-rank decomposition have been extensively studied, there has been insufficient attention paid to their theoretical explanation. In this paper, we propose a novel theoretical framework that leverages a probabilistic latent space of DNN weights and explains the optimal network sparsity by using the information-theoretic divergence measures. We introduce new analogous projected patterns (AP2) and analogous-in-probability projected patterns (AP3) notions for DNNs and prove that there exists a relationship between AP3/AP2 property of layers in the network and its performance. Further, we provide a theoretical analysis that explains the training process of the compressed network. The theoretical results are empirically validated through experiments conducted on standard pre-trained benchmarks, including AlexNet, ResNet50, and VGG16, using CIFAR10 and CIFAR100 datasets. Through our experiments, we highlight the relationship of AP3 and AP2 properties with fine-tuning pruned DNNs and sparsity levels.  ( 2 min )
    Federated Linear Contextual Bandits with Heterogeneous Clients
    arXiv:2403.00116v1 Announce Type: new Abstract: The demand for collaborative and private bandit learning across multiple agents is surging due to the growing quantity of data generated from distributed systems. Federated bandit learning has emerged as a promising framework for private, efficient, and decentralized online learning. However, almost all previous works rely on strong assumptions of client homogeneity, i.e., all participating clients shall share the same bandit model; otherwise, they all would suffer linear regret. This greatly restricts the application of federated bandit learning in practice. In this work, we introduce a new approach for federated bandits for heterogeneous clients, which clusters clients for collaborative bandit learning under the federated learning setting. Our proposed algorithm achieves non-trivial sub-linear regret and communication cost for all clients, subject to the communication protocol under federated learning that at anytime only one model can be shared by the server.  ( 2 min )
    Longitudinal Counterfactuals: Constraints and Opportunities
    arXiv:2403.00105v1 Announce Type: new Abstract: Counterfactual explanations are a common approach to providing recourse to data subjects. However, current methodology can produce counterfactuals that cannot be achieved by the subject, making the use of counterfactuals for recourse difficult to justify in practice. Though there is agreement that plausibility is an important quality when using counterfactuals for algorithmic recourse, ground truth plausibility continues to be difficult to quantify. In this paper, we propose using longitudinal data to assess and improve plausibility in counterfactuals. In particular, we develop a metric that compares longitudinal differences to counterfactual differences, allowing us to evaluate how similar a counterfactual is to prior observed changes. Furthermore, we use this metric to generate plausible counterfactuals. Finally, we discuss some of the inherent difficulties of using counterfactuals for recourse.  ( 2 min )
    On Robustness and Generalization of ML-Based Congestion Predictors to Valid and Imperceptible Perturbations
    arXiv:2403.00103v1 Announce Type: new Abstract: There is substantial interest in the use of machine learning (ML)-based techniques throughout the electronic computer-aided design (CAD) flow, particularly methods based on deep learning. However, while deep learning methods have achieved state-of-the-art performance in several applications, recent work has demonstrated that neural networks are generally vulnerable to small, carefully chosen perturbations of their input (e.g. a single pixel change in an image). In this work, we investigate robustness in the context of ML-based EDA tools -- particularly for congestion prediction. As far as we are aware, we are the first to explore this concept in the context of ML-based EDA. We first describe a novel notion of imperceptibility designed specifically for VLSI layout problems defined on netlists and cell placements. Our definition of imperceptibility is characterized by a guarantee that a perturbation to a layout will not alter its global routing. We then demonstrate that state-of-the-art CNN and GNN-based congestion models exhibit brittleness to imperceptible perturbations. Namely, we show that when a small number of cells (e.g. 1%-5% of cells) have their positions shifted such that a measure of global congestion is guaranteed to remain unaffected (e.g. 1% of the design adversarially shifted by 0.001% of the layout space results in a predicted decrease in congestion of up to 90%, while no change in congestion is implied by the perturbation). In other words, the quality of a predictor can be made arbitrarily poor (i.e. can be made to predict that a design is "congestion-free") for an arbitrary input layout. Next, we describe a simple technique to train predictors that improves robustness to these perturbations. Our work indicates that CAD engineers should be cautious when integrating neural network-based mechanisms in EDA flows to ensure robust and high-quality results.  ( 3 min )
    Learning to Deliver: a Foundation Model for the Montreal Capacitated Vehicle Routing Problem
    arXiv:2403.00026v1 Announce Type: new Abstract: In this paper, we present the Foundation Model for the Montreal Capacitated Vehicle Routing Problem (FM-MCVRP), a novel Deep Learning (DL) model that approximates high-quality solutions to a variant of the Capacitated Vehicle Routing Problem (CVRP) that characterizes many real-world applications. The so-called Montreal Capacitated Vehicle Routing Problem (MCVRP), first formally described by Bengio et al. (2021), is defined on a fixed and finite graph, which is analogous to a city. Each MCVRP instance is essentially the sub-graph connecting a randomly sampled subset of the nodes in the fixed graph, which represent a set of potential addresses in a real-world delivery problem on a given day. Our work exploits this problem structure to frame the MCVRP as an analogous Natural Language Processing (NLP) task. Specifically, we leverage a Transformer architecture embedded in a Large Language Model (LLM) framework to train our model in a supervised manner on computationally inexpensive, sub-optimal MCVRP solutions obtained algorithmically. Through comprehensive computational experiments, we show that FM-MCVRP produces better MCVRP solutions than the training data and generalizes to larger sized problem instances not seen during training. Even when compared to near-optimal solutions from state-of-the-art heuristics, FM-MCVRP yields competitive results despite being trained on inferior data. For instance, for 400-customer problems, FM-MCVRP solutions on average fall within 2% of the benchmark. Our results further demonstrate that unlike prior works in the literature, FM-MCVRP is a unified model, which performs consistently and reliably on a range of problem instance sizes and parameter values such as the vehicle capacity.  ( 3 min )
    Influencing Bandits: Arm Selection for Preference Shaping
    arXiv:2403.00036v1 Announce Type: new Abstract: We consider a non stationary multi-armed bandit in which the population preferences are positively and negatively reinforced by the observed rewards. The objective of the algorithm is to shape the population preferences to maximize the fraction of the population favouring a predetermined arm. For the case of binary opinions, two types of opinion dynamics are considered -- decreasing elasticity (modeled as a Polya urn with increasing number of balls) and constant elasticity (using the voter model). For the first case, we describe an Explore-then-commit policy and a Thompson sampling policy and analyse the regret for each of these policies. We then show that these algorithms and their analyses carry over to the constant elasticity case. We also describe a Thompson sampling based algorithm for the case when more than two types of opinions are present. Finally, we discuss the case where presence of multiple recommendation systems gives rise to a trade-off between their popularity and opinion shaping objectives.  ( 2 min )
    FlowCyt: A Comparative Study of Deep Learning Approaches for Multi-Class Classification in Flow Cytometry Benchmarking
    arXiv:2403.00024v1 Announce Type: new Abstract: This paper presents FlowCyt, the first comprehensive benchmark for multi-class single-cell classification in flow cytometry data. The dataset comprises bone marrow samples from 30 patients, with each cell characterized by twelve markers. Ground truth labels identify five hematological cell types: T lymphocytes, B lymphocytes, Monocytes, Mast cells, and Hematopoietic Stem/Progenitor Cells (HSPCs). Experiments utilize supervised inductive learning and semi-supervised transductive learning on up to 1 million cells per patient. Baseline methods include Gaussian Mixture Models, XGBoost, Random Forests, Deep Neural Networks, and Graph Neural Networks (GNNs). GNNs demonstrate superior performance by exploiting spatial relationships in graph-encoded data. The benchmark allows standardized evaluation of clinically relevant classification tasks, along with exploratory analyses to gain insights into hematological cell phenotypes. This represents the first public flow cytometry benchmark with a richly annotated, heterogeneous dataset. It will empower the development and rigorous assessment of novel methodologies for single-cell analysis.  ( 2 min )
    Towards Interpreting Multi-Objective Feature Associations
    arXiv:2403.00017v1 Announce Type: new Abstract: Understanding how multiple features are associated and contribute to a specific objective is as important as understanding how each feature contributes to a particular outcome. Interpretability of a single feature in a prediction may be handled in multiple ways; however, in a multi-objective prediction, it is difficult to obtain interpretability of a combination of feature values. To address this issue, we propose an objective specific feature interaction design using multi-labels to find the optimal combination of features in agricultural settings. One of the novel aspects of this design is the identification of a method that integrates feature explanations with global sensitivity analysis in order to ensure combinatorial optimization in multi-objective settings. We have demonstrated in our preliminary experiments that an approximate combination of feature values can be found to achieve the desired outcome using two agricultural datasets: one with pre-harvest poultry farm practices for multi-drug resistance presence, and one with post-harvest poultry farm practices for food-borne pathogens. In our combinatorial optimization approach, all three pathogens are taken into consideration simultaneously to account for the interaction between conditions that favor different types of pathogen growth. These results indicate that explanation-based approaches are capable of identifying combinations of features that reduce pathogen presence in fewer iterations than a baseline.  ( 2 min )
    Transformer-based Parameter Estimation in Statistics
    arXiv:2403.00019v1 Announce Type: new Abstract: Parameter estimation is one of the most important tasks in statistics, and is key to helping people understand the distribution behind a sample of observations. Traditionally parameter estimation is done either by closed-form solutions (e.g., maximum likelihood estimation for Gaussian distribution), or by iterative numerical methods such as Newton-Raphson method when closed-form solution does not exist (e.g., for Beta distribution). In this paper we propose a transformer-based approach to parameter estimation. Compared with existing solutions, our approach does not require a closed-form solution or any mathematical derivations. It does not even require knowing the probability density function, which is needed by numerical methods. After the transformer model is trained, only a single inference is needed to estimate the parameters of the underlying distribution based on a sample of observations. In the empirical study we compared our approach with maximum likelihood estimation on commonly used distributions such as normal distribution, exponential distribution and beta distribution. It is shown that our approach achieves similar or better accuracy as measured by mean-square-errors.  ( 2 min )
    Deep Sensitivity Analysis for Objective-Oriented Combinatorial Optimization
    arXiv:2403.00016v1 Announce Type: new Abstract: Pathogen control is a critical aspect of modern poultry farming, providing important benefits for both public health and productivity. Effective poultry management measures to reduce pathogen levels in poultry flocks promote food safety by lowering risks of food-borne illnesses. They also support animal health and welfare by preventing infectious diseases that can rapidly spread and impact flock growth, egg production, and overall health. This study frames the search for optimal management practices that minimize the presence of multiple pathogens as a combinatorial optimization problem. Specifically, we model the various possible combinations of management settings as a solution space that can be efficiently explored to identify configurations that optimally reduce pathogen levels. This design incorporates a neural network feedback-based method that combines feature explanations with global sensitivity analysis to ensure combinatorial optimization in multiobjective settings. Our preliminary experiments have promising results when applied to two real-world agricultural datasets. While further validation is still needed, these early experimental findings demonstrate the potential of the model to derive targeted feature interactions that adaptively optimize pathogen control under varying real-world constraints.  ( 2 min )
    Prioritizing Informative Features and Examples for Deep Learning from Noisy Data
    arXiv:2403.00013v1 Announce Type: new Abstract: In this dissertation, we propose a systemic framework that prioritizes informative features and examples to enhance each stage of the development process. Specifically, we prioritize informative features and examples and improve the performance of feature learning, data labeling, and data selection. We first propose an approach to extract only informative features that are inherent to solving a target task by using auxiliary out-of-distribution data. We deactivate the noise features in the target distribution by using that in the out-of-distribution data. Next, we introduce an approach that prioritizes informative examples from unlabeled noisy data in order to reduce the labeling cost of active learning. In order to solve the purity-information dilemma, where an attempt to select informative examples induces the selection of many noisy examples, we propose a meta-model that finds the best balance between purity and informativeness. Lastly, we suggest an approach that prioritizes informative examples from labeled noisy data to preserve the performance of data selection. For labeled image noise data, we propose a data selection method that considers the confidence of neighboring samples to maintain the performance of the state-of-the-art Re-labeling models. For labeled text noise data, we present an instruction selection method that takes diversity into account for ranking the quality of instructions with prompting, thereby enhancing the performance of aligned large language models. Overall, our unified framework induces the deep learning development process robust to noisy data, thereby effectively mitigating noisy features and examples in real-world applications.  ( 3 min )
    PreRoutGNN for Timing Prediction with Order Preserving Partition: Global Circuit Pre-training, Local Delay Learning and Attentional Cell Modeling
    arXiv:2403.00012v1 Announce Type: new Abstract: Pre-routing timing prediction has been recently studied for evaluating the quality of a candidate cell placement in chip design. It involves directly estimating the timing metrics for both pin-level (slack, slew) and edge-level (net delay, cell delay), without time-consuming routing. However, it often suffers from signal decay and error accumulation due to the long timing paths in large-scale industrial circuits. To address these challenges, we propose a two-stage approach. First, we propose global circuit training to pre-train a graph auto-encoder that learns the global graph embedding from circuit netlist. Second, we use a novel node updating scheme for message passing on GCN, following the topological sorting sequence of the learned graph embedding and circuit graph. This scheme residually models the local time delay between two adjacent pins in the updating sequence, and extracts the lookup table information inside each cell via a new attention mechanism. To handle large-scale circuits efficiently, we introduce an order preserving partition scheme that reduces memory consumption while maintaining the topological dependencies. Experiments on 21 real world circuits achieve a new SOTA R2 of 0.93 for slack prediction, which is significantly surpasses 0.59 by previous SOTA method. Code will be available at: https://github.com/Thinklab-SJTU/EDA-AI.  ( 3 min )
    Introducing User Feedback-based Counterfactual Explanations (UFCE)
    arXiv:2403.00011v1 Announce Type: new Abstract: Machine learning models are widely used in real-world applications. However, their complexity makes it often challenging to interpret the rationale behind their decisions. Counterfactual explanations (CEs) have emerged as a viable solution for generating comprehensible explanations in eXplainable Artificial Intelligence (XAI). CE provides actionable information to users on how to achieve the desired outcome with minimal modifications to the input. However, current CE algorithms usually operate within the entire feature space when optimizing changes to turn over an undesired outcome, overlooking the identification of key contributors to the outcome and disregarding the practicality of the suggested changes. In this study, we introduce a novel methodology, that is named as user feedback-based counterfactual explanation (UFCE), which addresses these limitations and aims to bolster confidence in the provided explanations. UFCE allows for the inclusion of user constraints to determine the smallest modifications in the subset of actionable features while considering feature dependence, and evaluates the practicality of suggested changes using benchmark evaluation metrics. We conducted three experiments with five datasets, demonstrating that UFCE outperforms two well-known CE methods in terms of \textit{proximity}, \textit{sparsity}, and \textit{feasibility}. Reported results indicate that user constraints influence the generation of feasible CEs.  ( 2 min )
  • Open

    Bures-Wasserstein Means of Graphs
    arXiv:2305.19738v2 Announce Type: replace Abstract: Finding the mean of sampled data is a fundamental task in machine learning and statistics. However, in cases where the data samples are graph objects, defining a mean is an inherently difficult task. We propose a novel framework for defining a graph mean via embeddings in the space of smooth graph signal distributions, where graph similarity can be measured using the Wasserstein metric. By finding a mean in this embedding space, we can recover a mean graph that preserves structural information. We establish the existence and uniqueness of the novel graph mean, and provide an iterative algorithm for computing it. To highlight the potential of our framework as a valuable tool for practical applications in machine learning, it is evaluated on various tasks, including k-means clustering of structured aligned graphs, classification of functional brain networks, and semi-supervised node classification in multi-layer graphs. Our experimental results demonstrate that our approach achieves consistent performance, outperforms existing baseline approaches, and improves the performance of state-of-the-art methods.  ( 2 min )
    Bayesian Robust Optimization for Imitation Learning
    arXiv:2007.12315v4 Announce Type: replace-cross Abstract: One of the main challenges in imitation learning is determining what action an agent should take when outside the state distribution of the demonstrations. Inverse reinforcement learning (IRL) can enable generalization to new states by learning a parameterized reward function, but these approaches still face uncertainty over the true reward function and corresponding optimal policy. Existing safe imitation learning approaches based on IRL deal with this uncertainty using a maxmin framework that optimizes a policy under the assumption of an adversarial reward function, whereas risk-neutral IRL approaches either optimize a policy for the mean or MAP reward function. While completely ignoring risk can lead to overly aggressive and unsafe policies, optimizing in a fully adversarial sense is also problematic as it can lead to overly conservative policies that perform poorly in practice. To provide a bridge between these two extremes, we propose Bayesian Robust Optimization for Imitation Learning (BROIL). BROIL leverages Bayesian reward function inference and a user specific risk tolerance to efficiently optimize a robust policy that balances expected return and conditional value at risk. Our empirical results show that BROIL provides a natural way to interpolate between return-maximizing and risk-minimizing behaviors and outperforms existing risk-sensitive and risk-neutral inverse reinforcement learning algorithms. Code is available at https://github.com/dsbrown1331/broil.  ( 3 min )
    Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation
    arXiv:2310.02304v2 Announce Type: replace-cross Abstract: Several recent advances in AI systems (e.g., Tree-of-Thoughts and Program-Aided Language Models) solve problems by providing a "scaffolding" program that structures multiple calls to language models to generate better outputs. A scaffolding program is written in a programming language such as Python. In this work, we use a language-model-infused scaffolding program to improve itself. We start with a seed "improver" that improves an input program according to a given utility function by querying a language model several times and returning the best solution. We then run this seed improver to improve itself. Across a small set of downstream tasks, the resulting improved improver generates programs with significantly better performance than its seed improver. A variety of self-improvement strategies are proposed by the language model, including beam search, genetic algorithms, and simulated annealing. Since the language models themselves are not altered, this is not full recursive self-improvement. Nonetheless, it demonstrates that a modern language model, GPT-4 in our experiments, is capable of writing code that can call itself to improve itself. We consider concerns around the development of self-improving technologies and evaluate the frequency with which the generated code bypasses a sandbox.  ( 2 min )
    Is your data alignable? Principled and interpretable alignability testing and integration of single-cell data
    arXiv:2308.01839v2 Announce Type: replace-cross Abstract: Single-cell data integration can provide a comprehensive molecular view of cells, and many algorithms have been developed to remove unwanted technical or biological variations and integrate heterogeneous single-cell datasets. Despite their wide usage, existing methods suffer from several fundamental limitations. In particular, we lack a rigorous statistical test for whether two high-dimensional single-cell datasets are alignable (and therefore should even be aligned). Moreover, popular methods can substantially distort the data during alignment, making the aligned data and downstream analysis difficult to interpret. To overcome these limitations, we present a spectral manifold alignment and inference (SMAI) framework, which enables principled and interpretable alignability testing and structure-preserving integration of single-cell data with the same type of features. SMAI provides a statistical test to robustly assess the alignability between datasets to avoid misleading inference, and is justified by high-dimensional statistical theory. On a diverse range of real and simulated benchmark datasets, it outperforms commonly used alignment methods. Moreover, we show that SMAI improves various downstream analyses such as identification of differentially expressed genes and imputation of single-cell spatial transcriptomics, providing further biological insights. SMAI's interpretability also enables quantification and a deeper understanding of the sources of technical confounders in single-cell data.  ( 3 min )
    Spectral Ranking Inferences based on General Multiway Comparisons
    arXiv:2308.02918v3 Announce Type: replace-cross Abstract: This paper studies the performance of the spectral method in the estimation and uncertainty quantification of the unobserved preference scores of compared entities in a general and more realistic setup. Specifically, the comparison graph consists of hyper-edges of possible heterogeneous sizes, and the number of comparisons can be as low as one for a given hyper-edge. Such a setting is pervasive in real applications, circumventing the need to specify the graph randomness and the restrictive homogeneous sampling assumption imposed in the commonly used Bradley-Terry-Luce (BTL) or Plackett-Luce (PL) models. Furthermore, in scenarios where the BTL or PL models are appropriate, we unravel the relationship between the spectral estimator and the Maximum Likelihood Estimator (MLE). We discover that a two-step spectral method, where we apply the optimal weighting estimated from the equal weighting vanilla spectral method, can achieve the same asymptotic efficiency as the MLE. Given the asymptotic distributions of the estimated preference scores, we also introduce a comprehensive framework to carry out both one-sample and two-sample ranking inferences, applicable to both fixed and random graph settings. It is noteworthy that this is the first time effective two-sample rank testing methods have been proposed. Finally, we substantiate our findings via comprehensive numerical simulations and subsequently apply our developed methodologies to perform statistical inferences for statistical journals and movie rankings.  ( 3 min )
    The Risks of Recourse in Binary Classification
    arXiv:2306.00497v2 Announce Type: replace-cross Abstract: Algorithmic recourse provides explanations that help users overturn an unfavorable decision by a machine learning system. But so far very little attention has been paid to whether providing recourse is beneficial or not. We introduce an abstract learning-theoretic framework that compares the risks (i.e., expected losses) for classification with and without algorithmic recourse. This allows us to answer the question of when providing recourse is beneficial or harmful at the population level. Surprisingly, we find that there are many plausible scenarios in which providing recourse turns out to be harmful, because it pushes users to regions of higher class uncertainty and therefore leads to more mistakes. We further study whether the party deploying the classifier has an incentive to strategize in anticipation of having to provide recourse, and we find that sometimes they do, to the detriment of their users. Providing algorithmic recourse may therefore also be harmful at the systemic level. We confirm our theoretical findings in experiments on simulated and real-world data. All in all, we conclude that the current concept of algorithmic recourse is not reliably beneficial, and therefore requires rethinking.  ( 2 min )
    Safeguarding Data in Multimodal AI: A Differentially Private Approach to CLIP Training
    arXiv:2306.08173v2 Announce Type: replace-cross Abstract: The surge in multimodal AI's success has sparked concerns over data privacy in vision-and-language tasks. While CLIP has revolutionized multimodal learning through joint training on images and text, its potential to unintentionally disclose sensitive information necessitates the integration of privacy-preserving mechanisms. We introduce a differentially private adaptation of the Contrastive Language-Image Pretraining (CLIP) model that effectively addresses privacy concerns while retaining accuracy. Our proposed method, Dp-CLIP, is rigorously evaluated on benchmark datasets encompassing diverse vision-and-language tasks such as image classification and visual question answering. We demonstrate that our approach retains performance on par with the standard non-private CLIP model. Furthermore, we analyze our proposed algorithm under linear representation settings. We derive the convergence rate of our algorithm and show a trade-off between utility and privacy when gradients are clipped per-batch and the loss function does not satisfy smoothness conditions assumed in the literature for the analysis of DP-SGD.  ( 2 min )
    Structural Estimation of Markov Decision Processes in High-Dimensional State Space with Finite-Time Guarantees
    arXiv:2210.01282v3 Announce Type: replace-cross Abstract: We consider the task of estimating a structural model of dynamic decisions by a human agent based upon the observable history of implemented actions and visited states. This problem has an inherent nested structure: in the inner problem, an optimal policy for a given reward function is identified while in the outer problem, a measure of fit is maximized. Several approaches have been proposed to alleviate the computational burden of this nested-loop structure, but these methods still suffer from high complexity when the state space is either discrete with large cardinality or continuous in high dimensions. Other approaches in the inverse reinforcement learning (IRL) literature emphasize policy estimation at the expense of reduced reward estimation accuracy. In this paper we propose a single-loop estimation algorithm with finite time guarantees that is equipped to deal with high-dimensional state spaces without compromising reward estimation accuracy. In the proposed algorithm, each policy improvement step is followed by a stochastic gradient step for likelihood maximization. We show that the proposed algorithm converges to a stationary solution with a finite-time guarantee. Further, if the reward is parameterized linearly, we show that the algorithm approximates the maximum likelihood estimator sublinearly. Finally, by using robotics control problems in MuJoCo and their transfer settings, we show that the proposed algorithm achieves superior performance compared with other IRL and imitation learning benchmarks.  ( 3 min )
    Making SGD Parameter-Free
    arXiv:2205.02160v3 Announce Type: replace-cross Abstract: We develop an algorithm for parameter-free stochastic convex optimization (SCO) whose rate of convergence is only a double-logarithmic factor larger than the optimal rate for the corresponding known-parameter setting. In contrast, the best previously known rates for parameter-free SCO are based on online parameter-free regret bounds, which contain unavoidable excess logarithmic terms compared to their known-parameter counterparts. Our algorithm is conceptually simple, has high-probability guarantees, and is also partially adaptive to unknown gradient norms, smoothness, and strong convexity. At the heart of our results is a novel parameter-free certificate for SGD step size choice, and a time-uniform concentration result that assumes no a-priori bounds on SGD iterates.  ( 2 min )
    Symplectic ODE-Net: Learning Hamiltonian Dynamics with Control
    arXiv:1909.12077v5 Announce Type: replace-cross Abstract: In this paper, we introduce Symplectic ODE-Net (SymODEN), a deep learning framework which can infer the dynamics of a physical system, given by an ordinary differential equation (ODE), from observed state trajectories. To achieve better generalization with fewer training samples, SymODEN incorporates appropriate inductive bias by designing the associated computation graph in a physics-informed manner. In particular, we enforce Hamiltonian dynamics with control to learn the underlying dynamics in a transparent way, which can then be leveraged to draw insight about relevant physical aspects of the system, such as mass and potential energy. In addition, we propose a parametrization which can enforce this Hamiltonian formalism even when the generalized coordinate data is embedded in a high-dimensional space or we can only access velocity data instead of generalized momentum. This framework, by offering interpretable, physically-consistent models for physical systems, opens up new possibilities for synthesizing model-based control strategies.  ( 3 min )
    On Rate-Optimal Partitioning Classification from Observable and from Privatised Data
    arXiv:2312.14889v2 Announce Type: replace Abstract: In this paper we revisit the classical method of partitioning classification and study its convergence rate under relaxed conditions, both for observable (non-privatised) and for privatised data. Let the feature vector $X$ take values in $\mathbb{R}^d$ and denote its label by $Y$. Previous results on the partitioning classifier worked with the strong density assumption, which is restrictive, as we demonstrate through simple examples. We assume that the distribution of $X$ is a mixture of an absolutely continuous and a discrete distribution, such that the absolutely continuous component is concentrated to a $d_a$ dimensional subspace. Here, we study the problem under much milder assumptions: in addition to the standard Lipschitz and margin conditions, a novel characteristic of the absolutely continuous component is introduced, by which the exact convergence rate of the classification error probability is calculated, both for the binary and for the multi-label cases. Interestingly, this rate of convergence depends only on the intrinsic dimension $d_a$. The privacy constraints mean that the data $(X_1,Y_1), \dots ,(X_n,Y_n)$ cannot be directly observed, and the classifiers are functions of the randomised outcome of a suitable local differential privacy mechanism. The statistician is free to choose the form of this privacy mechanism, and here we add Laplace distributed noises to the discontinuations of all possible locations of the feature vector $X_i$ and to its label $Y_i$. Again, tight upper bounds on the rate of convergence of the classification error probability are derived, without the strong density assumption, such that this rate depends on $2\,d_a$.  ( 3 min )
    Interpretable Feature Learning in Multivariate Big Data Analysis for Network Monitoring
    arXiv:1907.02677v3 Announce Type: replace-cross Abstract: There is an increasing interest in the development of new data-driven models useful to assess the performance of communication networks. For many applications, like network monitoring and troubleshooting, a data model is of little use if it cannot be interpreted by a human operator. In this paper, we present an extension of the Multivariate Big Data Analysis (MBDA) methodology, a recently proposed interpretable data analysis tool. In this extension, we propose a solution to the automatic derivation of features, a cornerstone step for the application of MBDA when the amount of data is massive. The resulting network monitoring approach allows us to detect and diagnose disparate network anomalies, with a data-analysis workflow that combines the advantages of interpretable and interactive models with the power of parallel processing. We apply the extended MBDA to two case studies: UGR'16, a benchmark flow-based real-traffic dataset for anomaly detection, and Dartmouth'18, the longest and largest Wi-Fi trace known to date.  ( 3 min )
    Distributional Bellman Operators over Mean Embeddings
    arXiv:2312.07358v2 Announce Type: replace Abstract: We propose a novel algorithmic framework for distributional reinforcement learning, based on learning finite-dimensional mean embeddings of return distributions. We derive several new algorithms for dynamic programming and temporal-difference learning based on this framework, provide asymptotic convergence theory, and examine the empirical performance of the algorithms on a suite of tabular tasks. Further, we show that this approach can be straightforwardly combined with deep reinforcement learning, and obtain a new deep RL agent that improves over baseline distributional approaches on the Arcade Learning Environment.  ( 2 min )
    Implicit regularization of deep residual networks towards neural ODEs
    arXiv:2309.01213v2 Announce Type: replace Abstract: Residual neural networks are state-of-the-art deep learning models. Their continuous-depth analog, neural ordinary differential equations (ODEs), are also widely used. Despite their success, the link between the discrete and continuous models still lacks a solid mathematical foundation. In this article, we take a step in this direction by establishing an implicit regularization of deep residual networks towards neural ODEs, for nonlinear networks trained with gradient flow. We prove that if the network is initialized as a discretization of a neural ODE, then such a discretization holds throughout training. Our results are valid for a finite training time, and also as the training time tends to infinity provided that the network satisfies a Polyak-Lojasiewicz condition. Importantly, this condition holds for a family of residual networks where the residuals are two-layer perceptrons with an overparameterization in width that is only linear, and implies the convergence of gradient flow to a global minimum. Numerical experiments illustrate our results.  ( 2 min )
    Global universal approximation of functional input maps on weighted spaces
    arXiv:2306.03303v3 Announce Type: replace Abstract: We introduce so-called functional input neural networks defined on a possibly infinite dimensional weighted space with values also in a possibly infinite dimensional output space. To this end, we use an additive family to map the input weighted space to the hidden layer, on which a non-linear scalar activation function is applied to each neuron, and finally return the output via some linear readouts. Relying on Stone-Weierstrass theorems on weighted spaces, we can prove a global universal approximation result on weighted spaces for continuous functions going beyond the usual approximation on compact sets. This then applies in particular to approximation of (non-anticipative) path space functionals via functional input neural networks. As a further application of the weighted Stone-Weierstrass theorem we prove a global universal approximation result for linear functions of the signature. We also introduce the viewpoint of Gaussian process regression in this setting and emphasize that the reproducing kernel Hilbert space of the signature kernels are Cameron-Martin spaces of certain Gaussian processes. This paves a way towards uncertainty quantification for signature kernel regression.  ( 2 min )
    Escaping mediocrity: how two-layer networks learn hard generalized linear models with SGD
    arXiv:2305.18502v2 Announce Type: replace Abstract: This study explores the sample complexity for two-layer neural networks to learn a generalized linear target function under Stochastic Gradient Descent (SGD), focusing on the challenging regime where many flat directions are present at initialization. It is well-established that in this scenario $n=O(d \log d)$ samples are typically needed. However, we provide precise results concerning the pre-factors in high-dimensional contexts and for varying widths. Notably, our findings suggest that overparameterization can only enhance convergence by a constant factor within this problem class. These insights are grounded in the reduction of SGD dynamics to a stochastic process in lower dimensions, where escaping mediocrity equates to calculating an exit time. Yet, we demonstrate that a deterministic approximation of this process adequately represents the escape time, implying that the role of stochasticity may be minimal in this scenario.  ( 2 min )
    Hierarchical clustering with dot products recovers hidden tree structure
    arXiv:2305.15022v3 Announce Type: replace Abstract: In this paper we offer a new perspective on the well established agglomerative clustering algorithm, focusing on recovery of hierarchical structure. We recommend a simple variant of the standard algorithm, in which clusters are merged by maximum average dot product and not, for example, by minimum distance or within-cluster variance. We demonstrate that the tree output by this algorithm provides a bona fide estimate of generative hierarchical structure in data, under a generic probabilistic graphical model. The key technical innovations are to understand how hierarchical information in this model translates into tree geometry which can be recovered from data, and to characterise the benefits of simultaneously growing sample size and data dimension. We demonstrate superior tree recovery performance with real data over existing approaches such as UPGMA, Ward's method, and HDBSCAN.  ( 2 min )
    Defining Expertise: Applications to Treatment Effect Estimation
    arXiv:2403.00694v1 Announce Type: new Abstract: Decision-makers are often experts of their domain and take actions based on their domain knowledge. Doctors, for instance, may prescribe treatments by predicting the likely outcome of each available treatment. Actions of an expert thus naturally encode part of their domain knowledge, and can help make inferences within the same domain: Knowing doctors try to prescribe the best treatment for their patients, we can tell treatments prescribed more frequently are likely to be more effective. Yet in machine learning, the fact that most decision-makers are experts is often overlooked, and "expertise" is seldom leveraged as an inductive bias. This is especially true for the literature on treatment effect estimation, where often the only assumption made about actions is that of overlap. In this paper, we argue that expertise - particularly the type of expertise the decision-makers of a domain are likely to have - can be informative in designing and selecting methods for treatment effect estimation. We formally define two types of expertise, predictive and prognostic, and demonstrate empirically that: (i) the prominent type of expertise in a domain significantly influences the performance of different methods in treatment effect estimation, and (ii) it is possible to predict the type of expertise present in a dataset, which can provide a quantitative basis for model selection.  ( 2 min )
    Epsilon-Greedy Thompson Sampling to Bayesian Optimization
    arXiv:2403.00540v1 Announce Type: cross Abstract: Thompson sampling (TS) serves as a solution for addressing the exploitation-exploration dilemma in Bayesian optimization (BO). While it prioritizes exploration by randomly generating and maximizing sample paths of Gaussian process (GP) posteriors, TS weakly manages its exploitation by gathering information about the true objective function after each exploration is performed. In this study, we incorporate the epsilon-greedy ($\varepsilon$-greedy) policy, a well-established selection strategy in reinforcement learning, into TS to improve its exploitation. We first delineate two extremes of TS applied for BO, namely the generic TS and a sample-average TS. The former and latter promote exploration and exploitation, respectively. We then use $\varepsilon$-greedy policy to randomly switch between the two extremes. A small value of $\varepsilon \in (0,1)$ prioritizes exploitation, and vice versa. We empirically show that $\varepsilon$-greedy TS with an appropriate $\varepsilon$ is better than one of its two extremes and competes with the other.  ( 2 min )
    Scalable Learning of Item Response Theory Models
    arXiv:2403.00680v1 Announce Type: cross Abstract: Item Response Theory (IRT) models aim to assess latent abilities of $n$ examinees along with latent difficulty characteristics of $m$ test items from categorical data that indicates the quality of their corresponding answers. Classical psychometric assessments are based on a relatively small number of examinees and items, say a class of $200$ students solving an exam comprising $10$ problems. More recent global large scale assessments such as PISA, or internet studies, may lead to significantly increased numbers of participants. Additionally, in the context of Machine Learning where algorithms take the role of examinees and data analysis problems take the role of items, both $n$ and $m$ may become very large, challenging the efficiency and scalability of computations. To learn the latent variables in IRT models from large data, we leverage the similarity of these models to logistic regression, which can be approximated accurately using small weighted subsets called coresets. We develop coresets for their use in alternating IRT training algorithms, facilitating scalable learning from large data.  ( 2 min )
    Tuning-Free Maximum Likelihood Training of Latent Variable Models via Coin Betting
    arXiv:2305.14916v2 Announce Type: replace Abstract: We introduce two new particle-based algorithms for learning latent variable models via marginal maximum likelihood estimation, including one which is entirely tuning-free. Our methods are based on the perspective of marginal maximum likelihood estimation as an optimization problem: namely, as the minimization of a free energy functional. One way to solve this problem is via the discretization of a gradient flow associated with the free energy. We study one such approach, which resembles an extension of Stein variational gradient descent, establishing a descent lemma which guarantees that the free energy decreases at each iteration. This method, and any other obtained as the discretization of the gradient flow, necessarily depends on a learning rate which must be carefully tuned by the practitioner in order to ensure convergence at a suitable rate. With this in mind, we also propose another algorithm for optimizing the free energy which is entirely learning rate free, based on coin betting techniques from convex optimization. We validate the performance of our algorithms across several numerical experiments, including several high-dimensional settings. Our results are competitive with existing particle-based methods, without the need for any hyperparameter tuning.  ( 2 min )
    Global Convergence Rate of Deep Equilibrium Models with General Activations
    arXiv:2302.05797v3 Announce Type: replace Abstract: In a recent paper, Ling et al. investigated the over-parametrized Deep Equilibrium Model (DEQ) with ReLU activation. They proved that the gradient descent converges to a globally optimal solution for the quadratic loss function at a linear convergence rate. This paper shows that this fact still holds for DEQs with any generally bounded activation with bounded first and second derivatives. Since the new activation function is generally non-homogeneous, bounding the least eigenvalue of the Gram matrix of the equilibrium point is particularly challenging. To accomplish this task, we must create a novel population Gram matrix and develop a new form of dual activation with Hermite polynomial expansion.  ( 2 min )
    SPEED: Experimental Design for Policy Evaluation in Linear Heteroscedastic Bandits
    arXiv:2301.12357v3 Announce Type: replace Abstract: In this paper, we study the problem of optimal data collection for policy evaluation in linear bandits. In policy evaluation, we are given a target policy and asked to estimate the expected reward it will obtain when executed in a multi-armed bandit environment. Our work is the first work that focuses on such optimal data collection strategy for policy evaluation involving heteroscedastic reward noise in the linear bandit setting. We first formulate an optimal design for weighted least squares estimates in the heteroscedastic linear bandit setting that reduces the MSE of the value of the target policy. We then use this formulation to derive the optimal allocation of samples per action during data collection. We then introduce a novel algorithm SPEED (Structured Policy Evaluation Experimental Design) that tracks the optimal design and derive its regret with respect to the optimal design. Finally, we empirically validate that SPEED leads to policy evaluation with mean squared error comparable to the oracle strategy and significantly lower than simply running the target policy.  ( 2 min )
    Neural Additive Models for Location Scale and Shape: A Framework for Interpretable Neural Regression Beyond the Mean
    arXiv:2301.11862v2 Announce Type: replace Abstract: Deep neural networks (DNNs) have proven to be highly effective in a variety of tasks, making them the go-to method for problems requiring high-level predictive power. Despite this success, the inner workings of DNNs are often not transparent, making them difficult to interpret or understand. This lack of interpretability has led to increased research on inherently interpretable neural networks in recent years. Models such as Neural Additive Models (NAMs) achieve visual interpretability through the combination of classical statistical methods with DNNs. However, these approaches only concentrate on mean response predictions, leaving out other properties of the response distribution of the underlying data. We propose Neural Additive Models for Location Scale and Shape (NAMLSS), a modelling framework that combines the predictive power of classical deep learning models with the inherent advantages of distributional regression while maintaining the interpretability of additive models. The code is available at the following link: https://github.com/AnFreTh/NAMpy  ( 2 min )
    Indirectly Parameterized Concrete Autoencoders
    arXiv:2403.00563v1 Announce Type: cross Abstract: Feature selection is a crucial task in settings where data is high-dimensional or acquiring the full set of features is costly. Recent developments in neural network-based embedded feature selection show promising results across a wide range of applications. Concrete Autoencoders (CAEs), considered state-of-the-art in embedded feature selection, may struggle to achieve stable joint optimization, hurting their training time and generalization. In this work, we identify that this instability is correlated with the CAE learning duplicate selections. To remedy this, we propose a simple and effective improvement: Indirectly Parameterized CAEs (IP-CAEs). IP-CAEs learn an embedding and a mapping from it to the Gumbel-Softmax distributions' parameters. Despite being simple to implement, IP-CAE exhibits significant and consistent improvements over CAE in both generalization and training time across several datasets for reconstruction and classification. Unlike CAE, IP-CAE effectively leverages non-linear relationships and does not require retraining the jointly optimized decoder. Furthermore, our approach is, in principle, generalizable to Gumbel-Softmax distributions beyond feature selection.  ( 2 min )
    Adaptive Learning Rate for Follow-the-Regularized-Leader: Competitive Ratio Analysis and Best-of-Both-Worlds
    arXiv:2403.00715v1 Announce Type: cross Abstract: Follow-The-Regularized-Leader (FTRL) is known as an effective and versatile approach in online learning, where appropriate choice of the learning rate is crucial for smaller regret. To this end, we formulate the problem of adjusting FTRL's learning rate as a sequential decision-making problem and introduce the framework of competitive analysis. We establish a lower bound for the competitive ratio and propose update rules for learning rate that achieves an upper bound within a constant factor of this lower bound. Specifically, we illustrate that the optimal competitive ratio is characterized by the (approximate) monotonicity of components of the penalty term, showing that a constant competitive ratio is achievable if the components of the penalty term form a monotonically non-increasing sequence, and derive a tight competitive ratio when penalty terms are $\xi$-approximately monotone non-increasing. Our proposed update rule, referred to as \textit{stability-penalty matching}, also facilitates constructing the Best-Of-Both-Worlds (BOBW) algorithms for stochastic and adversarial environments. In these environments our result contributes to achieve tighter regret bound and broaden the applicability of algorithms for various settings such as multi-armed bandits, graph bandits, linear bandits, and contextual bandits.  ( 2 min )
    Shifted Interpolation for Differential Privacy
    arXiv:2403.00278v1 Announce Type: cross Abstract: Noisy gradient descent and its variants are the predominant algorithms for differentially private machine learning. It is a fundamental question to quantify their privacy leakage, yet tight characterizations remain open even in the foundational setting of convex losses. This paper improves over previous analyses by establishing (and refining) the "privacy amplification by iteration" phenomenon in the unifying framework of $f$-differential privacy--which tightly captures all aspects of the privacy loss and immediately implies tighter privacy accounting in other notions of differential privacy, e.g., $(\varepsilon,\delta)$-DP and Renyi DP. Our key technical insight is the construction of shifted interpolated processes that unravel the popular shifted-divergences argument, enabling generalizations beyond divergence-based relaxations of DP. Notably, this leads to the first exact privacy analysis in the foundational setting of strongly convex optimization. Our techniques extend to many settings: convex/strongly convex, constrained/unconstrained, full/cyclic/stochastic batches, and all combinations thereof. As an immediate corollary, we recover the $f$-DP characterization of the exponential mechanism for strongly convex optimization in Gopi et al. (2022), and moreover extend this result to more general settings.  ( 2 min )
    Substitute adjustment via recovery of latent variables
    arXiv:2403.00202v1 Announce Type: cross Abstract: The deconfounder was proposed as a method for estimating causal parameters in a context with multiple causes and unobserved confounding. It is based on recovery of a latent variable from the observed causes. We disentangle the causal interpretation from the statistical estimation problem and show that the deconfounder in general estimates adjusted regression target parameters. It does so by outcome regression adjusted for the recovered latent variable termed the substitute. We refer to the general algorithm, stripped of causal assumptions, as substitute adjustment. We give theoretical results to support that substitute adjustment estimates adjusted regression parameters when the regressors are conditionally independent given the latent variable. We also introduce a variant of our substitute adjustment algorithm that estimates an assumption-lean target parameter with minimal model assumptions. We then give finite sample bounds and asymptotic results supporting substitute adjustment estimation in the case where the latent variable takes values in a finite set. A simulation study illustrates finite sample properties of substitute adjustment. Our results support that when the latent variable model of the regressors hold, substitute adjustment is a viable method for adjusted regression.  ( 2 min )
    Sharp bounds for the max-sliced Wasserstein distance
    arXiv:2403.00666v1 Announce Type: cross Abstract: We obtain sharp upper and lower bounds for the expected max-sliced 1-Wasserstein distance between a probability measure on a separable Hilbert space and its empirical distribution from $n$ samples. A version of this result for probability measures on Banach spaces is also obtained.  ( 2 min )
    Transformer-based Parameter Estimation in Statistics
    arXiv:2403.00019v1 Announce Type: cross Abstract: Parameter estimation is one of the most important tasks in statistics, and is key to helping people understand the distribution behind a sample of observations. Traditionally parameter estimation is done either by closed-form solutions (e.g., maximum likelihood estimation for Gaussian distribution), or by iterative numerical methods such as Newton-Raphson method when closed-form solution does not exist (e.g., for Beta distribution). In this paper we propose a transformer-based approach to parameter estimation. Compared with existing solutions, our approach does not require a closed-form solution or any mathematical derivations. It does not even require knowing the probability density function, which is needed by numerical methods. After the transformer model is trained, only a single inference is needed to estimate the parameters of the underlying distribution based on a sample of observations. In the empirical study we compared our approach with maximum likelihood estimation on commonly used distributions such as normal distribution, exponential distribution and beta distribution. It is shown that our approach achieves similar or better accuracy as measured by mean-square-errors.  ( 2 min )
    On Cyclical MCMC Sampling
    arXiv:2403.00230v1 Announce Type: cross Abstract: Cyclical MCMC is a novel MCMC framework recently proposed by Zhang et al. (2019) to address the challenge posed by high-dimensional multimodal posterior distributions like those arising in deep learning. The algorithm works by generating a nonhomogeneous Markov chain that tracks -- cyclically in time -- tempered versions of the target distribution. We show in this work that cyclical MCMC converges to the desired probability distribution in settings where the Markov kernels used are fast mixing, and sufficiently long cycles are employed. However in the far more common settings of slow mixing kernels, the algorithm may fail to produce samples from the desired distribution. In particular, in a simple mixture example with unequal variance, we show by simulation that cyclical MCMC fails to converge to the desired limit. Finally, we show that cyclical MCMC typically estimates well the local shape of the target distribution around each mode, even when we do not have convergence to the target.  ( 2 min )
    Causal Bandits with General Causal Models and Interventions
    arXiv:2403.00233v1 Announce Type: new Abstract: This paper considers causal bandits (CBs) for the sequential design of interventions in a causal system. The objective is to optimize a reward function via minimizing a measure of cumulative regret with respect to the best sequence of interventions in hindsight. The paper advances the results on CBs in three directions. First, the structural causal models (SCMs) are assumed to be unknown and drawn arbitrarily from a general class $\mathcal{F}$ of Lipschitz-continuous functions. Existing results are often focused on (generalized) linear SCMs. Second, the interventions are assumed to be generalized soft with any desired level of granularity, resulting in an infinite number of possible interventions. The existing literature, in contrast, generally adopts atomic and hard interventions. Third, we provide general upper and lower bounds on regret. The upper bounds subsume (and improve) known bounds for special cases. The lower bounds are generally hitherto unknown. These bounds are characterized as functions of the (i) graph parameters, (ii) eluder dimension of the space of SCMs, denoted by $\operatorname{dim}(\mathcal{F})$, and (iii) the covering number of the function space, denoted by ${\rm cn}(\mathcal{F})$. Specifically, the cumulative achievable regret over horizon $T$ is $\mathcal{O}(K d^{L-1}\sqrt{T\operatorname{dim}(\mathcal{F}) \log({\rm cn}(\mathcal{F}))})$, where $K$ is related to the Lipschitz constants, $d$ is the graph's maximum in-degree, and $L$ is the length of the longest causal path. The upper bound is further refined for special classes of SCMs (neural network, polynomial, and linear), and their corresponding lower bounds are provided.  ( 3 min )
    Entry-Specific Bounds for Low-Rank Matrix Completion under Highly Non-Uniform Sampling
    arXiv:2403.00184v1 Announce Type: new Abstract: Low-rank matrix completion concerns the problem of estimating unobserved entries in a matrix using a sparse set of observed entries. We consider the non-uniform setting where the observed entries are sampled with highly varying probabilities, potentially with different asymptotic scalings. We show that under structured sampling probabilities, it is often better and sometimes optimal to run estimation algorithms on a smaller submatrix rather than the entire matrix. In particular, we prove error upper bounds customized to each entry, which match the minimax lower bounds under certain conditions. Our bounds characterize the hardness of estimating each entry as a function of the localized sampling probabilities. We provide numerical experiments that confirm our theoretical findings.  ( 2 min )
    "Lossless" Compression of Deep Neural Networks: A High-dimensional Neural Tangent Kernel Approach
    arXiv:2403.00258v1 Announce Type: new Abstract: Modern deep neural networks (DNNs) are extremely powerful; however, this comes at the price of increased depth and having more parameters per layer, making their training and inference more computationally challenging. In an attempt to address this key limitation, efforts have been devoted to the compression (e.g., sparsification and/or quantization) of these large-scale machine learning models, so that they can be deployed on low-power IoT devices. In this paper, building upon recent advances in neural tangent kernel (NTK) and random matrix theory (RMT), we provide a novel compression approach to wide and fully-connected \emph{deep} neural nets. Specifically, we demonstrate that in the high-dimensional regime where the number of data points $n$ and their dimension $p$ are both large, and under a Gaussian mixture model for the data, there exists \emph{asymptotic spectral equivalence} between the NTK matrices for a large family of DNN models. This theoretical result enables "lossless" compression of a given DNN to be performed, in the sense that the compressed network yields asymptotically the same NTK as the original (dense and unquantized) network, with its weights and activations taking values \emph{only} in $\{ 0, \pm 1 \}$ up to a scaling. Experiments on both synthetic and real-world data are conducted to support the advantages of the proposed compression scheme, with code available at \url{https://github.com/Model-Compression/Lossless_Compression}.  ( 3 min )
    Validation of ML-UQ calibration statistics using simulated reference values: a sensitivity analysis
    arXiv:2403.00423v1 Announce Type: new Abstract: Some popular Machine Learning Uncertainty Quantification (ML-UQ) calibration statistics do not have predefined reference values and are mostly used in comparative studies. In consequence, calibration is almost never validated and the diagnostic is left to the appreciation of the reader. Simulated reference values, based on synthetic calibrated datasets derived from actual uncertainties, have been proposed to palliate this problem. As the generative probability distribution for the simulation of synthetic errors is often not constrained, the sensitivity of simulated reference values to the choice of generative distribution might be problematic, shedding a doubt on the calibration diagnostic. This study explores various facets of this problem, and shows that some statistics are excessively sensitive to the choice of generative distribution to be used for validation when the generative distribution is unknown. This is the case, for instance, of the correlation coefficient between absolute errors and uncertainties (CC) and of the expected normalized calibration error (ENCE). A robust validation workflow to deal with simulated reference values is proposed.  ( 2 min )

  • Open

    [D] Anonymity Period
    This policy for ACL conferences are gone right? I just wanted to release my work on arXiv and maybe Twitter or LinkedIn. submitted by /u/BigDreamx [link] [comments]
    [D] On the data linitations for any large models
    Hi, I'm not by any chance an expert in this subject. There is one question that's been bothering me for a while. I wanted to post this on this subreddit for a long time but I did not know if this was the right place to ask. I'm posting it here as I'm unable to find answers elsewhere. The question is related to the data that's being fed into the large modles be it vison or language. Due to the sudden explosion of AI a lot of new content that's openly available is AI generated content. I don't have a ratio of the ai vs human generated content but I assume the AI proportion would have an exponential increase in the coming days. Most of these modles rely on huge data sets, I assumes models like GPT-n have been trained on I don't know hundereds of years of human generated data. Also I feel l…
    How to frame a transport network passenger load forecasting tool? [Discussion] [Project]
    I'm trying to do some near-time per-transport-load-forecasting for a network, but I'm having trouble squaring my understanding of the problem with the literature I can find. I think this idea kind of works across many ideas of transport and logistics networks eg bus / rail / air / postal / delivery networks, but I have had very little luck finding examples of these in the literature - what I see is forecast modelling of regular time series eg IoT sensor data/stock market data etc, which feels very different in ways I hope are clear below. I'm going round in circles unable to tell if I just _think_ it's complicated, or if I need to make some genuine simplifications / modelling assumptions. ​ ## The problem space: I have a transport network where I want to predict the number of passenger…
    [D] Is there an equivalent of 'The Pile' for video data?
    Is there a large chunk of large unlabeled video the same way there is for text? submitted by /u/PM_ME_JOB_OFFER [link] [comments]
    [D] : Scale PDF Q&A App to 10K Users with GPUs – <$250/Mo
    Hello everyone, Check out this step-by-step detailed tutorial on building and scaling a PDF Q&A Application using Pinecone, Langchain and Inferless ​ Architecture Alongside, the detailed quick deploy guide, it also includes cost analysis on how you can save upto 84% cost with an example of processing 3000 documents and nearly 10,000 queries every month, all while dramatically cutting your costs from $1800 ( AWS) to just $250 a month on Inferless. Here is the tutorial - https://cookbook.inferless.com/ If you resonate, join the discussion on Hackernews here - https://news.ycombinator.com/item?id=39594588 submitted by /u/Tiny_Cut_8440 [link] [comments]
    [D] Image generation experiment with mnist images - not working quite as expected
    I’m trying a simple experiment using mnist images. Not sure if this is a viable idea or not. It’s not working as expected so I thought I’d get some feedback. What I’m doing is basically take mnist images, ignore the labels but instead have a simple model that takes a random image (generated with torch.rand) and uses the mnist images as output. I trained the model with the entire training dataset (60k). Now, when I run the model with a random image as input, it always generates a fixed output image, which happens to be the number 9. Is this a reasonable approach? Since every single input image in the training is random, I’m not sure what the model would actually learn. But it is generating "valid" output at the end, just not what I'd expect (which is one of the 10 numbers randomly). Any thoughts or explanation of what I’m seeing is appreciated. submitted by /u/gamesntech [link] [comments]
    [R][D] Enhancing Soil Moisture Predictions Using Multimodal Data Integration in Agriculture
    Greetings, ​ I am exploring an interdisciplinary research area involving multimodal data, focusing on agriculture. My study incorporates both visual and tabular data: crop and soil images from three distinct locations, alongside corresponding meteorological information (temperature, humidity, solar radiation, etc.), collected hourly from 9 am to 5 pm. The objective is to enhance soil moisture predictions by integrating these data modalities, aiming to estimate soil moisture accurately from crop images. ​ Could you please suggest methodologies for effectively combining these data types? I've experimented with or considered the following: ​ Feature Concatenation, utilizing MLPs/Random Forest for meteorological data and CNNs for image analysis. Conv-LSTMs. Vision Transformers. Additionally, I'm curious about strategies for incorporating sequential soil patch images into models like Conv-LSTMs to capture both temporal and visual features. Do you think this approach is viable? ​ I appreciate any insights or advice from the community. Thank you for your time. submitted by /u/MohammedRakib [link] [comments]
    [D] Why do GLUs (Gated Linear Units) work?
    Today, GLU variants, such as SwiGLU, are used very often in LLMs. But the paper, "GLU Variants Improve Transformer" (https://arxiv.org/pdf/2002.05202.pdf), just says that "We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence." I also found the explanation in the original GLU paper to be unsatisfactory. They said that it has cleaner gradient, but I thought that this issue was already solved by residual connection. Does anyone have any theoretical or even intuitive explanation as towards why GLU is successful? My thought process is that it allows each token to learn a transformation that is unique to themselves, which improves the expressivity of the MLP, but I'm not sure if that is reasonable. EDIT: The Falcon technical report (link here: https://arxiv.org/pdf/2311.16867.pdf, page 14) discusses that GLU is used because it increases parameter count by 50%, saying: "Scaling-wise, GLU activations have been preferred as well, as they increase the size of the MLP (doubling its first layer), shifting more compute towards simple matrix multiplications." However, I don't think I'm convinced that this is really the case; after all, why can't you just increase the number of neurons in the hidden layer of the MLP by 50%, which has the same memory cost and parameter increase. submitted by /u/cofapie [link] [comments]
    [R] Introducing the next generation of Claude
    https://www.anthropic.com/news/claude-3-family Today, we're announcing the Claude 3 model family, which sets new industry benchmarks across a wide range of cognitive tasks. The family includes three state-of-the-art models in ascending order of capability: Claude 3 Haiku, Claude 3 Sonnet, and Claude 3 Opus. Each successive model offers increasingly powerful performance, allowing users to select the optimal balance of intelligence, speed, and cost for their specific application. Opus, our most intelligent model, outperforms its peers on most of the common evaluation benchmarks for AI systems, including undergraduate level expert knowledge (MMLU), graduate level expert reasoning (GPQA), basic mathematics (GSM8K), and more. It exhibits near-human levels of comprehension and fluency on complex tasks, leading the frontier of general intelligence. submitted by /u/RobbinDeBank [link] [comments]
    [D] Asking for Reviews
    I recently withdrew my non-archival ICLR workshop submissions because I wanted to simply make a submission when my paper was more concrete. At the level that I submitted, it was mostly just showing an idea and very tiny experiment results. So, I withdrew a few days before decision notification. Now that the decisions are released, I actually am curious of the reviews that my paper got. Is it unethical or bad in someway to reach out to the workshop chairs and ask for the reviews? submitted by /u/BigDreamx [link] [comments]
    [D] Resources suggestions
    I find equations, theorems, and proofs hard to understand and cannot connect consecutive steps. I would appreciate any suggestionss to improve my ability to understand those. I would appreciate any resources, guides, books to follow. submitted by /u/snasat [link] [comments]
    [D] XGBoost + Imbalanced data for Binary Classification, Overfitting but Acceptable Classification Report?
    TLDR How would I fix overfitting without sacrificing the performance of Class 1? Is overfitting normal in a highly imbalanced dataset? Based on the Class 1 performance and the goal of this project to detect Class 1, do you think this is good enough for deployment? For context, my data is imbalanced (700: 5500 samples). But in real life, the imbalance would be twice. This binary classifier was created to predict the minority class. I have used XGBoost and RandomForest, both with class weights. I use F1 for scoring and eval metrics = log loss for XGBoost. With Hyperparameter tuning. for early stopping ecal_metrics = mae. The best result is XGBoost Class precision recall f1 support 0 .97 .99 .98 1101 1 (I want this) .93 .82 .87 166 acc 0.97 1267 macro avg 0.95 0.9 0.93 1267 weighted avg .97 .97 .97 1267 mean training f1 score best estimator: .981mean validation f1 score for best estimator: .821 If I try to change features and parameters to balance the train and val scores. the performance is significantly worse, with Class 1 having around .65 recall. For my use case is my current trained model, good enough for production, even tho it overfits? I have advocated deploying the model and collecting new data for Class 1 over time to retrain the model with more minority class data. submitted by /u/Vveriant [link] [comments]
    [D] implementation correctness measures
    So the problem I am talking about is when you implement the same algorithem but in a diffrent way (say u port to gpu or a framework etc) and you want to be sure you got it right. My current method I am using as a sanity check is D(x,y)= (x-y)2/(x2+y2 ) And I just assume that if the tensor is distantly big I am not gona be hitting all zeros. Tho u could add an early exist or an epsilon to gurntee correctness. If I see D is less than 0.05 consistently across many inputs I feel safe in my implementation. Is this common? Do you have your own? I bet this has a name. submitted by /u/rejectedlesbian [link] [comments]
    [R] ASD (Autism Spectrum Disorder) Sensory Overload Detection Sensor Parameters Datasets
    Are there available datasets about sensor parameters for detecting sensory overload from people with ASD? submitted by /u/emcay [link] [comments]
    [D] What are the current best classification architectures/models?
    I am looking at upgrading an older classification model that uses a finetuned BERT for classification, but I'm having a hard time finding a good list of the current best classification models. Preferably the model can be trained locally. If there is anybody that is currently working with classification models and can share some insight that would be nice! submitted by /u/Dustwellow [link] [comments]
    [D] What papers to implement in 2024 from zero to hero?
    The title is inspired by Karpathy's bottom up approach in his course neural network from zero to hero. In a similar vein, but for research papers, going “bottom-up” from earlier papers (maybe max 10-15 years old) to today’s paper - what should someone familiar with DL (at least read a DL book) but who is relatively inexperienced implement? The goal of this is to acquire a deeper understanding of DL research and get to a level where one can comfortably read, understand and implement today’s DL papers and maybe even come up with novel ideas on their own. The areas I’m mostly thinking about are CV and NLP, but it could also be more general stuff like learning rates, loss functions, activation functions, regularization, optimizations etc. Does anyone know paper/papers that would be good to implement for this purpose or if you have any lists you want to share? Any help is appreciated! submitted by /u/total-expectation [link] [comments]
    [D] Any active and good Discord servers for ML?
    Hi, I am looking for a larger-scale ML/Data Science Discord server to find a team (much like game dev ones) or discussions relevant to the topic. Any suggestions? submitted by /u/AloofWasTaken [link] [comments]
    [D] ML being unserious?
    I was reading the ternary paper and it came to me how while most of the Arxiv papers are named stuff like "Probing the Information Encoded in Neural-based Acoustic Models of Automatic Speech Recognition Systems", the ternary paper was called the slightly-unserious "The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits". There are some other unserious things that happen in ML, like the "Attention is All You Need" paper and my favourite paper name, "Loose LIPS Sink Ships: Asking Questions in Battleship with Language-Informed Program Sampling", people naming their techniques stuff like LASER and DRμGS, people naming their LLMs stuff like Nontoxic Bagel, Dolphin, and other random animals, and so on. Amid the somewhat slow period right now, do you have any good suggestions of times where ML was quite unserious? submitted by /u/adumdumonreddit [link] [comments]
    [D] for people working on ML at big tech/hedge funds, what percentage of alphas do you feel come from noise mainly?
    I haven’t made up my mind on this, but I have some colleagues who strongly believe most people are not working on truly meaningful gains using ML, but are chasing internal publication biases. Curious what people at other companies think. submitted by /u/Crazy_Suspect_9512 [link] [comments]
    [D] how successful has RAG been in search/recommendation ranking?
    There are some efforts to cast the retrieval/ranking problem as a document id generation problem, e.g., the différentiable search index papers 2 years ago from google and Microsoft. I am not entirely sure how worthwhile it is to replicate those systems despite their pretty bold claims of success. Furthermore given RAG has been wildly useful in chatbot, has any credible major effort been publicized or “launched” in the context of search/recommendation? submitted by /u/Crazy_Suspect_9512 [link] [comments]
    [D] why does the Mamba paper seem to distance itself from the apparent connection with RNN (a la Schmidhuber et al)?
    Maybe I missed something, but the similarity with RNN is hard to miss. Yet the paper seems deliberate in mentioning RNN. Is it because the convolutional optimization makes it far more efficient than sequential RNN? Or is Schmidhuber nowadays considered an outcast in ML community due to his overzealous self attribution earlier? submitted by /u/Crazy_Suspect_9512 [link] [comments]
    [P] My notes for doing back propagation manually after watching Andrej Karpathy's video
    Hi all, I have wrote an article about Andrej Karpathy's video: [Building makemore Part 4: Becoming a Backprop Ninja](https://www.youtube.com/watch?v=q8SA3rM6ckI&list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ&index=5&t=5798s). It took me 2 days to finish the code and the notes. I hope it can be helpful :) ​ Link: https://lyk-love.cn/2024/03/04/backpropagate%20manually/ submitted by /u/AdministrativeCar545 [link] [comments]
    [D] Minedojo sim speed setting and or video of session
    Anyone who has used minedojo before for agents interacting with a minecraft environment, have you found a way to set the simulation speed of the process? Otherwise, is there any way to get a rendered video of the running game? I’ve been searching on the docs as well as code implementation but haven’t been able to find anything, but I see several demonstration videos where people demonstrate the agent performing in real-time. Any advice would be greatly appreciated. submitted by /u/Open-Ad2530 [link] [comments]
  • Open

    Scaling understanding with the help of feedback loops, knowledge graphs and NLP
    Image by Gerd Altmann from Pixabay Assisted annotation and other automation methods that can be used for knowledge graph creation and natural language understanding should be more flexible and iterative than they often are. Much depends on the architecture and tooling choices your organization makes. If you make sound choices, you avail yourself of a… Read More »Scaling understanding with the help of feedback loops, knowledge graphs and NLP The post Scaling understanding with the help of feedback loops, knowledge graphs and NLP appeared first on Data Science Central.  ( 22 min )
  • Open

    Why image generation AI's are so deeply censored?
    I am not even trying to make the stuff that internet calls "nsfw". For example, i try to make a female character. Ai always portrays it with huge breasts. But as soon as i add "small breast" or "moderate breast size", Dall-e says "I encountered issues generating the updated image based on your specific requests", Midjourney says "wow, forbidden word used, don't do that!". How can i depict a human if certain body parts can't be named? It's not like i am trying to remove clothing from those parts of the body... I need an image of public toilett on the modern city street. Just a door, no humans, nothing else. But every time after generating image Bing says "unsafe image contents detected, unable to display". Why do you put unsafe content in the image in first place? You can just not use t…
    Local/offline open source text to image app with easy UI?
    I found https://msty.app/ and https://lmstudio.ai/ which are great but seem to only be for text and they're closed source. I'm looking for an app that: Is open source Works locally/offline Supports text>text and text>image Has a non-technical UI Is there anything like this? Or anything close? submitted by /u/hanniabu [link] [comments]
    Even ASI will be limited - Let me explain
    Even with the most powerful models, there is still going to be a virtual world limitation of what can be achieved. Say we have elements that have not been discovered yet. Imagine if Gold was so rare it was never mentioned, its properties never known. Nobody has ever written about it. The model would have no idea about Golds existence, yet it does physically exist in our world. Do you see what I mean? submitted by /u/sitytitan [link] [comments]
    Sergey Brin on Gemini 1.5 Pro (AGI House; March 2, 2024)
    submitted by /u/valis2400 [link] [comments]
    AI Security Teams?
    I keep seeing job posts around AI Security across sectors. I'm curious, what is your organisation doing with AI ? (in terms of using it and securing it) There seems to be two camps right now: Teams that are upskilling and adding responsibilities to existing roles, vs places building out new specialised teams. submitted by /u/caljhud [link] [comments]
    AI to find interesting information about companies
    Hi guys, thanks for taking my question. I'm tasked with researching a ton of companies for interesting information about their company. I'm not talking about address, email - stuff like that - I'm talking about reading news articles, finding recent posts about their day to day stuff, reading CEO posts, and so on. Interesting stories would be things like receiving VC funding, presenting at a conference, launching a new product line, etc. Any solutions that you can think of? ChatGPT plugins, other services, anything else? submitted by /u/aphelion3342 [link] [comments]
    Enhancing Our Understanding of Intelligence and the Path Towards Digitally Synthesizing its Artificial Analog: Prof. Jim Al-Khalili's Documentary on the Science Behind Intelligent Life and its Implications for Artificial Synthesis
    submitted by /u/holy_moley_ravioli_ [link] [comments]
    One-Minute Daily AI News 3/3/2024
    French police test AI-powered security cameras ahead of Olympics.[1] Surrey: AI to help turn dog pics into 3D models.[2] India reverses AI stance, requires government approval for model launches.[3] Aussies lose $8 million to deepfake celebrity investment scams.[4] Sources: [1] https://www.telegraph.co.uk/world-news/2024/03/03/paris-olympics-police-ai-security-cameras/ [2] https://www.bbc.com/news/uk-england-surrey-68435966 [3] https://techcrunch.com/2024/03/03/india-reverses-ai-stance-requires-government-approval-for-model-launches/ [4] https://www.9news.com.au/national/australia-news-celebrities-are-not-getting-rich-from-online-investment-trading-platforms/42bba48c-d1c9-4b35-ad72-b38932689d5b submitted by /u/Excellent-Target-847 [link] [comments]
    Which scenario in 20yrs or less?
    View Poll submitted by /u/ArmoredBattalion [link] [comments]
  • Open

    More questions about the gymnasium environment.
    Here's the structure to building a custom gymnasium environment (https://gymnasium.farama.org/tutorials/gymnasium_basics/environment_creation/#make-your-own-custom-environment). I am looking for some clarification - gym-examples/ README.md setup.py gym_examples/ __init__.py envs/ __init__.py grid_world.py wrappers/ __init__.py relative_position.py reacher_weighted_reward.py discrete_action.py clip_reward.py Since there's a `setup.py`, we can import `gym_examples`. I'd like understand what happens after the import process - I believe the moment we import the package, the top most __init__.py gets invoked. This module consists of the following code - ​ from gymnasium.envs.registration import register register( id="gym_examples/GridWorld-v0", entry_point="gym_examples.envs:GridWorldEnv", max_episode_steps=300, ) I am unsure how the `__init__.py` corresponding to the subpackage `envs` get called. Why is there an `entry_point` in the register function? I understand having this in setup.py and then later invoking the entry_point in the command line. But here the authors place it within the function. submitted by /u/Academic-Rent7800 [link] [comments]
    Question Regarding Reinforcement Learning in Robotics
    I'm a high school student in an FRC(First robotics competition) team and was looking into using reinforcement learning for our robot. I have some experience in traditional machine learning and we have a CAD of our robot in onshape. I would really appreciate some help on next steps like robot simulation etc... Edit: We cant pay any subscriptions btw. submitted by /u/Cellini_ [link] [comments]
    Doubts regarding creating a custom gymnasium environment
    Why do I need to create a package when developing my own custom gymnasium environment I am reading the documentation given over here - https://gymnasium.farama.org/tutorials/gymnasium_basics/environment_creation/#make-your-own-custom-environment It has the following information - The last step is to structure our code as a Python package. This involves configuring gym-examples/setup.py. A minimal example of how to do so is as follows: from setuptools import setup setup( name="gym_examples", version="0.0.1", install_requires=["gymnasium==0.26.0", "pygame==2.1.0"], ) I was wondering why do I need to create this package? the surprising thing is if I don't do this step, I get the error - `gymnasium.error.NamespaceNotFound: Namespace gym_examples not found. Have you installed the proper package for gym_examples? What does registering an environment mean? I checked the code and it seems like a function that initializes variables. Is that all it does? submitted by /u/Academic-Rent7800 [link] [comments]
    Advice for a project
    I'm seeking urgent advice for my final year project. I'd like to talk one on one with someone, preferably a PhD scholar. I have so many ideas and I would like to discuss them to know your opinion and get the valuable feedback. I want to know the implementation idea of these algorithms and the challenges that can arise. I am extremely looking for someone to contact as I am in dire need of help. submitted by /u/fa_anony__mous [link] [comments]
    Automated Reinforcement learning
    I have got this idea where for a given game, my agent has to select the best algorithm available of all the deep RL algorithms. So, i first want to train a neural network which would accomplish to figure what's the best suitable algorithm for a given game like DQN, DDPG, A2C, A3C, A3C etc. I'd like to know how this idea is and how to achieve it. submitted by /u/fa_anony__mous [link] [comments]
    Help for project topic
    I have Reinforcement learning as a subject this semester and my final evaluation is a project based on RL. I planned of applying RL in chess but my tutor rejected this topic as it was already done by various people and is not a "relevant" problem apparently. Now I have to choose a project topic, and I have researched everywhere but I couldn't find one that meets his expectations. I need a project topic to work on which hasn't been worked on by many, or worst case a project in which comparitive analysis can be done. Please help submitted by /u/Karthic_2811 [link] [comments]
    PPO in ml agents
    If I have a multi agent problem, all act according to the same model (homogeneous agents) - to train, if I use PPO, do I get multi agent leaning like POCA? I mean, will in this setting PPO take advantage of the joint state-action space? submitted by /u/CuriousDolphin1 [link] [comments]
    Introducing UniROS: ROS-Based Reinforcement Learning for Robotics
    Hey everyone! I'm excited to share UniROS, a ROS-based Reinforcement Learning framework that I've developed to bridge the gap between simulation and real-world robotics. This framework comprises two key packages: MultiROS: Perfect for creating concurrent RL simulation environments using ROS and Gazebo. RealROS: Designed for applying ROS in real robotic environments. What sets UniROS apart is its ease of transitioning from simulations to real-world applications, making reinforcement learning more accessible and effective for roboticists. I've also included additional Python bindings for some low-level ROS features, enhancing usability beyond the RL workflow. I'd love to get your feedback and thoughts on these tools. Let's discuss how they can be applied and improved! Check them out on GitHub: UniROS: github.com/ncbdrck/UniROS RealROS: github.com/ncbdrck/realros MultiROS: github.com/ncbdrck/multiros ​ submitted by /u/ncbdrck [link] [comments]
    Model-Based RL for environments with low dimensional observations
    When reading papers about MBRL I realized that all those approaches are evaluating the performance of their algorithms on environments with pixel-based observations. However, often times, especially in robotics, one has access to structured features like x-position, y-position, z-position, rotation, etc.. Does it make sense to create a model of the environment here for planning? Because even though one has access to structured information, the simulations can still be quiet computationally expensive. Therefore, I would think that MBRL makes sense here but I have not found any work on that specific niche. I would appreciate any paper recommendations. submitted by /u/ijustwanttostudy123 [link] [comments]
  • Open

    Alida gains deeper understanding of customer feedback with Amazon Bedrock
    This post is co-written with Sherwin Chu from Alida. Alida helps the world’s biggest brands create highly engaged research communities to gather feedback that fuels better customer experiences and product innovation. Alida’s customers receive tens of thousands of engaged responses for a single survey, therefore the Alida team opted to leverage machine learning (ML) to […]  ( 8 min )
    Unlocking Innovation: AWS and Anthropic push the boundaries of generative AI together
    Amazon Bedrock is the best place to build and scale generative AI applications with large language models (LLM) and other foundation models (FMs). It enables customers to leverage a variety of high-performing FMs, such as the Claude family of models by Anthropic, to build custom generative AI applications. Looking back to 2021, when Anthropic first started […]  ( 8 min )
  • Open

    Explain to me, if you know, how Synthesia AI Video Generator work
    They are generating videos through video neural synthesis. Please help me understand how it works. https://www.synthesia.io submitted by /u/Acrobatic-Jaguar-599 [link] [comments]
  • Open

    Robo Rendezvous: Robotics Innovators and AI Leaders to Converge at NVIDIA GTC
    Bringing together pioneers in robotics and AI, NVIDIA GTC will be a state-of-the-art showcase of applied AI for autonomous machines. The conference, running March 18-21 at the San Jose Convention Center and online, boasts a star-studded lineup. This includes a fireside chat with Marc Raibert, executive director of The AI Institute, and Dieter Fox, senior Read Article  ( 6 min )
  • Open

    Finite differences and Pascal’s triangle
    The key to solving a lot of elementary what-number-comes-next puzzles is to take first or second differences. For example, if asked what the next item in the series 14, 29, 50, 77, 110, … the answer (or at lest the answer the person posing the question is most likely looking for) is 149. You might […] Finite differences and Pascal’s triangle first appeared on John D. Cook.  ( 6 min )
    Archiving data on paper
    This is a guest post by Ondřej Čertík. Ondřej formerly worked at Los Alamos National Labs and now works for GSI Technologies. He is known in the Python community for starting the SymPy project and in the Fortran community for starting LFortran. — John   I finally got to experiment a bit with archiving data […] Archiving data on paper first appeared on John D. Cook.  ( 7 min )
  • Open

    Google at APS 2024
    Posted by Kate Weber and Shannon Leon, Google Research, Quantum AI Team Today the 2024 March Meeting of the American Physical Society (APS) kicks off in Minneapolis, MN. A premier conference on topics ranging across physics and related fields, APS 2024 brings together researchers, students, and industry professionals to share their discoveries and build partnerships with the goal of realizing fundamental advances in physics-related sciences and technology. This year, Google has a strong presence at APS with a booth hosted by the Google Quantum AI team, 50+ talks throughout the conference, and participation in conference organizing activities, special sessions and events. Attending APS 2024 in person? Come visit Google’s Quantum AI booth to learn more about the exciting work we’re d…  ( 95 min )

  • Open

    [P] Computer Vision Challenge: Validate Detections in Rows and Columns
    submitted by /u/zerojames_ [link] [comments]
    [D] How useful is DCN compared to good old MLP?
    In industrial large scale search/recommendation context, DCN seems to be still a popular kid on the street, compared to more straightforward MLP. The idea feels more or less like ResNet, where the original raw input keeps showing up at every layer. But is the element-wise product of layer output of raw input really adding any value here? If so why is it not adopted by the transformer architecture? submitted by /u/Crazy_Suspect_9512 [link] [comments]
    [P] Serious differences between cross-validation accuracy and test accuracy. Imbalanced data (combination of over and undersampling) + PCA performed
    So I am training a SVM model on a dataset with more than 500.000 observations that is also seriously imbalanced. So after splitting my data into training and test data set, I undersampled and oversampled the majority and minority class respectively. Because the data also had more than 50 features I performed a PCA on the training set and decided to keep the first 10 components. Then, I did a cross validation as you can see on the picture below. The Gaussian kernel performed the best. Then I again trained a model on my training set with the Gaussian kernel and tried to predict the test set (which I also performed PCA on). But now the accuracy dropped to 6% (?!). I know that there can be some overfitting but the difference here is so big that I am wondering if there is a problem in the setup of my code? https://preview.redd.it/r7mxa8byi6mc1.png?width=818&format=png&auto=webp&s=afc51f0352312c57febb12220e2804fcb591369d https://preview.redd.it/0iqa0xy2j6mc1.png?width=1093&format=png&auto=webp&s=ec5609ad081f0fc0766317b622b86118302435c7 submitted by /u/Abject_Pomegranate39 [link] [comments]
    [D] Neural Attention from the most fundamental first principles
    Sharing a video from my YT that explains the origin of the Attention architecture before it became so ubiquitous in NLP and Transformers. Builds off first principles and goes all the way to some of more advanced (and currently relevant) concepts. Link here for those who are looking for something like this. submitted by /u/AvvYaa [link] [comments]
    [R] Humanoid Robot Dancing, High Five, Waving, Hugging
    submitted by /u/XiaolongWang [link] [comments]
    [N] LLM models up to 7 times acceleration.
    submitted by /u/idlemonk111 [link] [comments]
    [D] Types of RAG Implementations and Their Benefits?
    Hey everyone, I recently delved into the world of RAG (Retrieval-Augmented Generation) and discovered its immense potential for enhancing chatbots and other generative models. However, I'm curious about the various types of implementations of this approach and how we can determine which is the most effective. Surprisingly, there's limited information available on different approaches online. Can anyone shed some light on this and share insights on identifying the best implementation? Your expertise would be greatly appreciated! Thanks in advance! submitted by /u/Lyriciseofficial [link] [comments]
    [P] LaVague: Open-source Text2Action AI pipeline to turn natural language into Selenium code
    🌊 Released #LaVague, fullly open-source AI pipeline to turn natural language into browser actions! In less than 150 lines of code (RAG with local embedding + Mixtral on Hugging Face API), it generates Selenium code from user query. In this GIF you can see it follow user instructions to command a browser to browse HF website! https://i.redd.it/vf91c2if25mc1.gif Try it on Colab: https://colab.research.google.com/github/dhuynh95/LaVague/blob/main/LaVague.ipynb GitHub: https://github.com/dhuynh95/LaVague Pretty exciting how it becomes possible to create an AI assistant that could perform actions for us, such as logging on gov accounts, fill forms, or pull personal information! It was quite fun to hack in the weekend using open-source tools, from Hugging Face local embedding with transformers + HF Pro API, to RAG with Llama Index, through Mistral Mixtral model! Some challenges: to make it run on Colab for the GPU Poors, I had to resort to HF Inference API with Mixtral as it was the only model good enough (gemma-7b did not make it and refused to produce code). Because I used an off-the-shelf model, I had to improve performance with few-shot learning and Chain Of Thought, the model managed to generate appropriate code! Potential next steps: fine-tune a 2b or 7b model to run fully locally so that everyone can benefit from a transparent and customizable AI agent to take actions for us It might be hard for me to fully grow this project, so I am open to contribution :) submitted by /u/Separate-Still3770 [link] [comments]
    [D]Agents+ LLM in production
    Hi everyone, I've noticed that Agents+LLM is gaining popularity lately. Can anyone share which Agentic frameworks are being used in production workloads? Are any of you using a combination of RAG Frameworks with Agents? Also, what are some typical production issues you've encountered while working with Agent Frameworks? Do share your thoughts and insights submitted by /u/Electrical_Study_617 [link] [comments]
    [D] Current models for medical image generation
    I'm looking for some up-to-date architectures or models for medical image generation. Since it's medical images data are few (so probably not diffusion) and images should be sharp to make the presence of pathologies clear (so probably not vanilla VAEs). There's many different flavors of GANs and I've found some that work for my case but I'm wondering if I've missed anything. Some sort of latent encoding would be useful too for downstream tasks, so an encoder-decoder architecture would be ideal but I've had some success with GAN inversion. submitted by /u/idontcareaboutthenam [link] [comments]
    [D] How to effectively use date attribute while implementing fake news classification?
    I am researching fake news classification. I want to use all four attributes: title, text, subject, and date as all attributes in the dataset give meaningful info. However, I am slightly confused about how to effectively use the date attribute as it is in a number format like 2017-11-30. I am trying to concatenate all four attributes before feeding them to the LSTM Model. submitted by /u/Kindly-Song5246 [link] [comments]
    [P] Research Papers in February 2024 — A Potential LoRA Successor, Small Finetuned LLMs Vs Generalist LLMs, and Transparent LLM Research
    submitted by /u/seraschka [link] [comments]
    [D] How to detect or speculate the "reward-hacking" phenomenon?
    I'm reading the DeepMind paper "WARM: On the Benefits of Weight Averaged Reward Models". The paper is talking about the reward-hacking phenomenon. In the paper, the authors use the KL-reward curve to detect reward-hacking phenomenon, saying that the reward starts to decrease and thus reward hacking happens. However, previous papers like https://arxiv.org/pdf/2312.09244.pdf or https://arxiv.org/pdf/2312.09244.pdf often use two reward models to detect reward hacking: the proxy reward and the true reward. The policy model is updated under the proxy reward, so when the KL increases, the proxy reward will increase too. The reward hacking phenomenon happens when the true reward starts to decrease, which is treated as the detector. So since WARM only shows the reward starts with an increase and ends with a decrease (which is the proxy reward and thus not enough), I don't think it can show the reward hacking phenomenon happening. Hoping for your comments. https://preview.redd.it/4ajcprcjk4mc1.jpg?width=824&format=pjpg&auto=webp&s=756876807aa48d9fffe284f1f51f6ebfd6d877e6 https://preview.redd.it/k00e2rcjk4mc1.jpg?width=793&format=pjpg&auto=webp&s=916ab66eafbe3db0edb705797024c29de3eb798f submitted by /u/zetiansss [link] [comments]
    [P] Need help fine-tuning a LLM to act as an AI teacher (similar to harvard's CS50 chatbot)
    I’m working on a project to create a model that can act as an AI teacher for students in an intro to programming class. The model should not write the code for them, but guide them and clear their confusions. I want to fine-tune a large language model for this purpose, but I’m not sure how to go about it. I have some questions that I hope you can help me with: ​ Which LLM should I use? . How should I prepare my dataset? submitted by /u/Shaheer_Humayun [link] [comments]
    [Discussion] [Research] Questions Machine Learning
    Dear ML Community, I got some questions since some time and I've been doing my research but I cant find the answers so I've arrived to the point where I need to ask these to see if someone has something to say on it. If you have something to say just to one of these questions, please answer. All these questions are done in the context of designing and building Large Models in a subject different than NLP, specifically animation generation: Are there ways to roughly estimate the relation between dataset_size, model_size and accuracy? Ways different than empiric methods? (since we are talking about large model, I think is too expensive to iterate over many sizes to understand these relation empirically) When building a Large Model. What you increase specifically, number of layers, hidden dim or num of heads? I am particularly curios on the hidden_dim size. I don't think you add value by increasing that dimension as you want. To keep it simply, let's think my token dim size is 1024. I think if you increase the hidden dim size above that number you would be wasting space to storage information. There should be a transformation from the token to the encoded vector that maintain or decrease dimension (to extract valuable features). Therefore increase dimension from the token dimension is a waste of parameters. Thank you for given your thoughts submitted by /u/Stefano939393 [link] [comments]
    [D] Simulated Annealing and Gradient Descent
    First of all I would like to apologise if my question is irrelevant. I am a beginner in the field of data science and I’ve been using gradient descent to find the minimum of a function. ( I’m using python) Obviously, I came across the local minimum problem and tried to find ways to avoid it and found simulated annealing. Is there a way to combine both to get to the global minimum? Thanks in advance submitted by /u/Due-Outside-8526 [link] [comments]
    [r] real-time web search for LLMs
    I’d like to add some realtime web search to an LLM solution. I was thinking the easiest way is to find top-n sites for a given query, scrape those pages, embed and retrieve the chunks most similar to the query and add these to the context. I’m finding the hardest bit to get going with any decent speed is the scraping. Any ideas on libraries/apis that do this? Or any better approaches to this that I might be missing? submitted by /u/Overall-Marsupial-65 [link] [comments]
    [D] Seeking Advice: Continual-RL and Meta-RL Research Communities
    I'm increasingly frustrated by RL's (continual-RL, meta-RL, transformers) sensitivity to hyperparameters and the extensive training times (I hate RL after 5 years of PhD research). This is particularly problematic in meta-RL continual RL, where some benchmarks demand up to 100 hours of training. This leaves little room for optimizing hyperparameters or quickly validating new ideas. Given these challenges and my readiness to explore math theory more deeply, including taking all available online math courses for a proof-based approach to avoid the endless waiting and training loop, I'm curious about AI research areas trending in 2024 that are closely related to reinforcement learning but require a maximum of just 3 hours for training. Any suggestions? submitted by /u/Noprocr [link] [comments]
    [P] How do I deploy my trained LORA adapter to a ChatUI in huggingface?
    I have a trained LORA adapter on Mistral8x7b and uploaded on huggingface. How do I deploy it to a huggingchatUI interface such that it looks like this? Zephyr Gemma Chat - a Hugging Face Space by HuggingFaceH4 1. I’m asking since the tutorials and guides I’ve seen so far are only for models that aren’t adapters. Thank you. submitted by /u/portmanteau98 [link] [comments]
    [R] [CVPR 2024] AV-RIR: Audio-Visual Room Impulse Response Estimation
    submitted by /u/Snoo63916 [link] [comments]
    [D][P] I have some questions about a project related to Chatbot that can give health advice.
    The requirements are that the chatbot must have good conversational flow and be able to help the user accurately. What sort of combination of deep learning nlp and LLMs would I need to use. submitted by /u/Infinite-Dragonfruit [link] [comments]
  • Open

    Which landing page copy is made with AI?
    submitted by /u/Trentadollar [link] [comments]
    Chatbot modelled dead loved one
    Going to be a great service no? submitted by /u/PsychologicalHall905 [link] [comments]
    Is mind uploading theoretically possible? Or is it purely science fiction?
    Is transferring your consciousness and sentience into a powerful computer theoretically possible? Or is it purely science fiction? Isn't consciousness non-algorithmic? https://www.imdb.com/title/tt2209764/ submitted by /u/Der_Ist [link] [comments]
    One-Minute Daily AI News 3/2/2024
    This AI Paper from Meta AI Explores Advanced Refinement Strategies: Unveiling the Power of Stepwise Outcome-based and Process-based Reward Models.[1] AI worm infects users via AI-enabled email clients — Morris II generative AI worm steals confidential data as it spreads.[2] Korean new AI image generator is 8 times faster than OpenAI’s best tool — and can run on cheap computers.[3] AI-generated porn, including celebrity fake nudes, persist on Etsy as deepfake laws ‘lag behind’.[4] Sources: [1] https://www.marktechpost.com/2024/02/29/this-ai-paper-from-meta-ai-explores-advanced-refinement-strategies-unveiling-the-power-of-stepwise-outcome-based-and-process-based-reward-models/?amp [2] https://www.tomshardware.com/tech-industry/artificial-intelligence/ai-worm-infects-users-via-ai-enabled-email-clients-morris-ii-generative-ai-worm-steals-confidential-data-as-it-spreads [3] https://www.livescience.com/technology/artificial-intelligence/new-ai-image-generator-koala-is-8-times-faster-than-openais-best-tool-and-can-run-on-cheap-computers [4] https://www.foxnews.com/politics/ai-generated-porn-including-celebrity-fakes-persist-etsy-deepfake-laws-lag-behind.amp submitted by /u/Excellent-Target-847 [link] [comments]
    AI that can learn to text like people?
    I think the concept of C.AI (character ai) is really cool. I was wondering if there is already a feature or an AI somewhere that can analyze someone, their text messages for instance, and be able to write like them? submitted by /u/alyfox-x [link] [comments]
    Online text rpg?
    A friend of mine presented to me MUD multi-user dungeon, but it is to much detailed and works with commands. I just want something practical. I know that there are lots of AI rpg, but I don't know anyone which is online submitted by /u/Luisgames07_ [link] [comments]
  • Open

    End-to-end JAX library
    Hey all, I thought I’d make a post here to announce a project I’ve released recently. The project is a library of popular algorithm implementations but using the DeepMind anakin training paradigm. This might be known to some as end-to-end RL where every aspect of the algorithm is run on hardware accelerators. For those of you who know PureJaxRL, it’s what that library is fundamentally built on. I’m aiming to create a research framework (that is maintained and improved over time) that provides 4 primary things: 1. Highly efficient, fast, and scalable RL algorithm implementations to allow fast and iterative research. 2. An easily hackable codebase as a starting point for a research project. 3. Performant algorithm implementations for standardised baselines. 4. Educational value. Following closely to the philosophy of cleanRL, the majority of code for each algorithm is provided in single files however certain elements are standardised for reuse such as network architectures, evaluation and logging. The codebase also provides useful functionality such as checkpointing. The project is still relatively new and I only work on it during my free time. Any feedback would be appreciated. Let me know if you have any questions. The project can be found here: https://github.com/EdanToledo/Stoix submitted by /u/WorkingManTech [link] [comments]
    Deep Generative Models for Offline Policy Learning: Tutorial, Survey, and Perspectives on Future Directions
    submitted by /u/Ashamed-Put-2344 [link] [comments]
    Playing with Value Iteration in Haskell
    submitted by /u/gwern [link] [comments]
    Continual-RL and Meta-RL Research Communities
    I'm increasingly frustrated by RL's (continual-RL, meta-RL, transformers) sensitivity to hyperparameters and the extensive training times (I hate RL after 5 years of PhD research). This is particularly problematic in meta-RL continual RL, where some benchmarks demand up to 100 hours of training. This leaves little room for optimizing hyperparameters or quickly validating new ideas. Given these challenges and my readiness to explore math theory more deeply, including taking all available online math courses for a proof-based approach to avoid the endless waiting and training loop, I'm curious about AI research areas trending in 2024 that are closely related to reinforcement learning but require a maximum of just 3 hours for training. Any suggestions? submitted by /u/Noprocr [link] [comments]
    should i write my own deep reinforcement learning algorithm or find an open source project?
    I am a graduate student studying multi-agent deep reinforcement learning. I have recently fallen into self-denial when it comes to reproducing algorithms. As the title says, should I use the open source GitHub project framework with many stars, or implement the algorithm manually according to teaching blogs or tutorials? Before this, I have been reproducing algorithms based on papers, teaching blogs, tutorials, etc. (such as YOUTUBE, spinning up (openai) and other resources). I have successfully reproduced DQN, DDPG, SAC, MADDPG algorithms, and I can ensure There is no problem, because all the results converge to a certain good effect. A problem I encountered in the past two days is that when I was training my own multi-agent environment, I found that the maddpg algorithm I wrote took a…
  • Open

    GenAI: The User Interface to Artificial Intelligence (AI)?
    Generative AI (GenAI) Chatbots like Microsoft Copilot (formerly Bing AI), Google’s Gemini (formerly Google Bard), and OpenAI ChatGPT (still OpenAI ChatGPT) are driving extraordinary productivity improvements by assisting knowledge workers in providing highly relevant information, answering questions, and engaging in wide-ranging exploratory conversations. However, the Wall Street Journey Article “Microsoft’s most ambitious AI upgrade could… Read More »GenAI: The User Interface to Artificial Intelligence (AI)? The post GenAI: The User Interface to Artificial Intelligence (AI)? appeared first on Data Science Central.  ( 22 min )
  • Open

    LLM Tokenizers Explained
    submitted by /u/Personal-Trainer-541 [link] [comments]
    The University of Calgary Unleashes Game-Changing Structured Sparsity Method: SRigL
    submitted by /u/UpvoteBeast [link] [comments]

  • Open

    [P] TimesFM: Google's Foundation Model For Time-Series Forecasting
    submitted by /u/apaxapax [link] [comments]
    [D] Web data for building custom models?
    I'm working on a new project and we're considering web crawling to collect data for building custom models. This crawl would be a combination of internal (private) and external pages and we might use gpt3.5 and RAG or build from scratch. What are others doing? Is everyone leveraging existing models, or is anyone doing more bespoke data collection and making their own models from scratch? Is there a better source of web data that's more recent than the common crawl, but not as hard as doing your own web crawling? Thanks! submitted by /u/charlesthayer [link] [comments]
    [D] Explaining Transformers + VQ-VAE = LLMs that can generate images
    submitted by /u/AvvYaa [link] [comments]
    [R] Multiagent RL software
    A multi-agent RL framework in JAX. It implements popular IMPALA and OPRE multiagent frameworks. It has an interface to hundreds of multi-agent environments through melting pot and overcooked. submitted by /u/misterpawan [link] [comments]
    [R] bGPT - Byte-Level Transformer
    submitted by /u/Marha01 [link] [comments]
    Pytorch training help[D]
    I am training a deep neural network using Pytorch. At every step I want to see the parameter values. How do I do this. Even after I apply opt.step it shows the same values as the initialized ones submitted by /u/NickRay1234 [link] [comments]
    [P] ArXiv Machine Learning Landscape
    submitted by /u/lmcinnes [link] [comments]
    [R] Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
    Training AI to understand and describe video content requires datasets which are expensive for humans to annotate manually. Now researchers from Snap, UC Merced, and the University of Trento have put together a new dataset called Panda-70M that aims to help. This new dataset has 70 million high-res YouTube clips paired with descriptive captions. The key is they used an automated pipeline with multiple cross-modal "teacher" AI models to generate captions based on different inputs like video, subtitles, images, etc. Some highlights: 70M 720p YouTube clips about 8 secs long with 13-word captions Teacher models include video QA, image captioning, text summarization Ensemble of teachers can accurately describe 84% of clips vs 31% for any single model Pretraining on this dataset improved video AI models' performance substantially: 18% boost in captioning accuracy after finetuning small 2.5M subset 7% better at text-video retrieval 77% reduction in video generation errors Limitations remain around content diversity, caption density, and automated quality. But I think this is a big step forward for assembling large-scale video-text training data to advance multimodal AI. Efficient pipelines like this could unlock video understanding capabilities approaching human level comprehension. Exciting to see some models trained on Panda-70M as they become available. Paper here. Summary here. submitted by /u/Successful-Western27 [link] [comments]
    [D][R] What is the current state of the art for unsupervised image segmentation?
    I am working on 4k image data for a PhD project and we have a ton of unlabeled images that needs to be segmented. The advisor wants to start with unsupervised image segmentation. Which algorithm should I start with playing and tweaking? submitted by /u/bahauddin_onar [link] [comments]
    [D] What Is Your LLM Tech Stack in Production?
    Curious what everybody is using to implement LLM powered apps for production usage and your experience with these toolings and advice. This is what I am using for some RAG prototypes I have been building for users in finance and capital markets. Pre-processing\ETL: Unstructured.io + Spark, Airflow Embedding model: Cohere Embed v3 Previously using OpenAI Ada but Cohere has significantly better retrieval recall and precision for my use case. Also exploring other open weights embedding models Vector Database: Elasticsearch previously but now using Pinecone LLM: Gone through quite a few including hosted and self-hosted options. Went with gpt4 early during prototyping then switched to gpt3.5-turbo for more manageable costs and eventually open weights models. Now using a fine-tuned Llama2 30B model self hosted with vLLM LLM Framework: Started with Langchain initially but found it cumbersome to extend as the app became more complex. Tried implementing it in LlamaIndex at some point just to learn and found it just as bad. Went back to Langchain and now I am in the midst of replacing it with my own logic What is everyone else using? submitted by /u/gamerx88 [link] [comments]
    [D] Gemma training datasets list
    Hi, I am trying to find the datasets that were used during training the Gemma model family. The docs only provide a short description but nothing concrete. Does anyone know whether or not I can find such thing? The reason I am doing this is that I want to make sure a dataset that I chose was certainly not within the training samples. submitted by /u/cosminptr [link] [comments]
    Deep Generative Models for Offline Policy Learning: Tutorial, Survey, and Perspectives on Future Directions
    submitted by /u/Ashamed-Put-2344 [link] [comments]
    [D] experience with Google TPU?
    Does anyone have experience porting software to Google’s TPUs? Is it as simple as lifting and shifting an existing PyTorch or TensorFlow workload, or is it more complicated? How about writing code from scratch, any easier? submitted by /u/siliconductor1 [link] [comments]
    [D] LLM with analytical capabilities
    My boss asked me to create an application which can be connected to Redshift or Postgres databases and can retrieve numerical data and can answer financial and analysis questions for the company. He doesn't understand that RAG technique is not suited for numerical analysis stuff. Any idea how I can achieve this. He mentioned to use amazon Q for this task, which I find is awful in giving answers to numeric datasets submitted by /u/Fun-Ad953 [link] [comments]
    [D] Best way to replace clothes on the photo?
    Hi community 👋 As I am a noob in ML, reaching out for some help from knowing people. My goal is to create a proof-of-concept for a tool, that will replace clothes on a person. Expected input: - photo of t-shirt/hoodie/sweater lying on the table or hanged on a hanger - full-body photo of a man/woman standing in simple modelling pose Expected output: - full-body photo of same man/woman standing in the t-shirt/hoodie/sweater that was passed My friends suggested to go with this model: https://ootd.ibot.cn/ But maybe you have encountered that problem before and found a better model/pipeline for it? Also, I am considering expanding the inputs to: item's fabric/item's fit (regular/slim/etc.)/some other attributes that may help achieve the most realistic result without defects. Thank you for your time! Will appreciate any help/suggestions on how to approach this problem :) submitted by /u/sl0bodzyany [link] [comments]
    [D] Time series synthetic data
    Hi all. Has anyone dealt with creating seasonal time series synthetic data? Could you suggest whats the best approach? I’m doing multivariate time series forecast and was wondering if synthetic data is even applicable here, as it should mimic underlying relationships. I did create additional year of data using SDV (specifying constraints and distributions), but the model showed worse scores. Quality report showed 92.64% (column shapes and column pair trends). But Im still confused is it even possible to recreate underlying relationships between X variables and also X and y? Thanks. submitted by /u/_what_the_f [link] [comments]
    [D] BitNet 1-b/b1.58 LLMs - is that a threat to nvidia?
    Link to paper: https://arxiv.org/pdf/2402.17764.pdf is that real? it sounds too good to be real right? If it is true, it not only reduces VRAM capacity and bandwidth required to train and run LLMs, it also suggests simplified hardware implementation due to the lack of need for matmul , it only needs + operation is that not a threat for nvidia (stock) and amd as well ? submitted by /u/tunggad [link] [comments]
    [R] Humanoid Locomotion as Next Token Prediction
    Paper: https://arxiv.org/abs/2402.19469 Abstract: We cast real-world humanoid control as a next token prediction problem, akin to predicting the next word in language. Our model is a causal transformer trained via autoregressive prediction of sensorimotor trajectories. To account for the multi-modal nature of the data, we perform prediction in a modality-aligned way, and for each input token predict the next token from the same modality. This general formulation enables us to leverage data with missing modalities, like video trajectories without actions. We train our model on a collection of simulated trajectories coming from prior neural network policies, model-based controllers, motion capture data, and YouTube videos of humans. We show that our model enables a full-sized humanoid to walk in San Francisco zero-shot. Our model can transfer to the real world even when trained on only 27 hours of walking data, and can generalize to commands not seen during training like walking backward. These findings suggest a promising path toward learning challenging real-world control tasks by generative modeling of sensorimotor trajectories. ​ submitted by /u/StartledWatermelon [link] [comments]
    [D] What generated (cloned) voice model is better Tortoise TTS + Ecker ?
    Hi, I am using https://git.ecker.tech/mrq/ai-voice-cloning which uses Tortoise TTS. I do clone a particular voice and then get a couple of models under "training/{voiceName}/finetune/modules". For example: 200_gpt.pth 400_gpt.pth 501_gpt.pth Question: Which one from the above is the best model, that creates the most similar to the original voice TTS ? PS: I want to do cleanups! and remove all unecesary files as they take lots of gb. Same thing for the folder "training\{voiceName}\finetune\training_state". If I know for example that 501 is the best model I can delete anything else and clean up 8gb. submitted by /u/chriscs777 [link] [comments]
  • Open

    Graph Attention Network (GAT)
    Hello there, I have a query that may seem trivial (I'm new to GAT). Currently, I have a state represented as Data(x=[24, 6], edge_index=[2, 23]). Upon passing it through the Graph Attention Network (GAT) layer, I obtain an output shape of torch.Size([24, 64]). Subsequently, I aim to feed this output into a module that should yield two probabilities but it returns a shape of torch.Size([24, 2]) (for each node it outputs 2 probabilities). My concern arises regarding obtaining just two probabilities as output. Could you kindly advise on how to achieve this? Here is a snapshot of the code: class Policy(nn.Module): def __init__(self, state_dim, action_set1_dim, action_set2_num_heads): super(Policy, self).__init__() self.gat_conv1 = GATConv(state_dim[-1], 64, heads=1, dropout=0.5) self.gat_conv2 = GATConv(64, 64, heads=1, dropout=0.5) # Selector network self.selector = nn.Sequential( nn.Linear(64 + state_dim[-1], 32), nn.ReLU(), nn.Linear(32, 2) ) def forward(self, state): # x: node features, edge_index: graph connectivity, mask: mask for padded values masked_x = state.x * state.mask # Pass through the first GAT layer x_gat1 = F.relu(self.gat_conv1(masked_x, state.edge_index)) # Pass through the second GAT layer x_gat2 = F.relu(self.gat_conv2(x_gat1, state.edge_index)) # Selector network selector_probs = F.softmax(self.selector(x_gat2), dim=-1) return selector_probs submitted by /u/GuavaAgreeable208 [link] [comments]
    Help with continuous space offline Q-Learning
    I'm very new to RL and am planning on using it in my thesis in a way that is admittedly ambitious. I know the basics of online reinforcement learning and have done a couple of small projects, but am struggling to even understand where to look for information to achieve what I want. Essentially I have access to historical data only and it consists of a state space described by 12 60*100 images and an action space described by 2 of these images as well. Theses spaces are continuous and unbounded. I also have a reward function defined for each state as well as some terminal states for different outcomes. My goal is to evaluate each action taken, as such I'd like to train a neural network that can receive a state and an action as input, and output a Q-Value. I have no interest in finding an optimal policy. I've found something called IQL but I feel so over my head that I'm not sure if it applies to my problem. Help would be very appreciated! submitted by /u/macaco3001 [link] [comments]
    What are reward networks in reinforcement learning?
    I am reading the following article given over here - The goal of both inverse reinforcement learning (IRL) algorithms (e.g. AIRL, GAIL) and preference comparison is to discover a reward function. In imitation learning, these discovered rewards are parameterized by reward networks. Does this mean that the output of imitation learning is a reward value? What are the inputs then? The state and action? Is there a good paper that explains reward networks and imitation learning? I am referring to Levine's lecture notes. At slide 5, it seems like the output of imitation learning is a policy and not a reward. submitted by /u/Academic-Rent7800 [link] [comments]
    Wizards and PPO
    submitted by /u/nurgle100 [link] [comments]
    Just sharing my dissertation in RL for Gran Turismo
    https://youtu.be/zpz3Dbmx1SI submitted by /u/NDR008 [link] [comments]
  • Open

    The Employee Angle on AI Copyright
    I was listening to this Decoder with Nilay Pattel episode the other day where they discuss the AI copyright lawsuits going on. The thought experiment I keep coming back to is thinking about how it would work legally if these AI models were instead human employees of the company. Say a company employs someone with a photographic memory that is capable of writing back exactly what they read. They have that employee read NYT articles. Then the company contracts out the use of the employee to some third party. The AI company is acting like a temp agency. The third party then requests the employee to write out the NYT article that they have memorized. Where does the legal responsibility lie in that case? Or as another what if, how about the person is a comic artist that has studied how to draw Mario. Probably a more likely scenario. If they are asked by the third to draw a picture of Mario for them is that breaking copyright law? Or if they're given a very specific request for something that is Mario without using the name Mario is it on them to recognize what they are drawing and refuse the request? I know AI is this super capable thing that can do whatever its asked 1000x faster than any human could but it's still kind of what's going on. Now given the company making the AI has total control over the AI, it would seem all the responsponsibility falls back on the AI company and not the AI itself. submitted by /u/curtis_perrin [link] [comments]
    AI worms
    This specific virus sounds like it would be a problem for AI agents like Rabbit, or any chatbots that can get access to your other apps, like Bing or Gemini. Not sure how it might impact other chatbots that don't. submitted by /u/SignalWorldliness873 [link] [comments]
    Wizards and PPO
    Hello I am u/nurgle100 and I have been working on and off on a Deep Reinforcement Learning Project [GitHub] for the last five years now. Unfortunately I have hit a wall. Therefore I am posting here to show my progress and to see if any of you are interested in taking a look at it, giving some suggestions or even in cooperating with me. The idea is very simple. I wanted to code an agent for Wizard) the card game. If you have never heard of the game before: It is - in a nutshell- a trick-taking card game where you have to announce the amount of tricks that you win each round and gain points if you get this exact amount of tricks but lose points otherwise. Unfortunately I have not yet succeeded at making the computer play well enough to beat my friends, but here is what I have done so far:…
    Why are social media sites like reddit not adding tools for easy analysis of possible AI / troll activity
    I made a simple script that reads worldnews and analyses it for AI bot likeness: https://pastebin.com/rCNcU82M (Uses local LM Studio in server mode for AI) https://youtu.be/s_Rk0eBVDAI My implementation may be flawed in many ways but it made me think. Why on earth does reddit not do something like this? Include easy analysis of the post giving it perhaps points on its AI likeness? They could as well let users know if there has been suspicious activity, say posts that are very similar within a given timeframe. A sort of policing tool that would help the users to figure out if what they are reading is possibly troll or AI generated for purpose X. Especially now before US elections it would be even more important. submitted by /u/aluode [link] [comments]
    One-Minute Daily AI News 3/1/2024
    Elon Musk sues OpenAI and Sam Altman over ‘betrayal’ of nonprofit AI mission.[1] Dell shares soar 15% after beating earnings expectations, cites rising demand for AI servers.[2] Researchers from the University of Pennsylvania and Vector Institute Introduce DataDreamer: An Open-Source Python Library that Allows Researchers to Write Simple Code to Implement Powerful LLM Workflow.[3] Apple, Microsoft, Amazon, Alphabet, and Nvidia Have All Invested in Voice-Recognition Software. Here’s 1 Artificial Intelligence (AI) Stock That Could Go Parabolic.[4] Vimeo launches new AI video tools to help employees breeze through hours-long town halls and training videos.[5] Sources: [1] https://techcrunch.com/2024/03/01/elon-musk-openai-sam-altman-court/ [2] https://www.cnbc.com/2024/02/29/dell-shares-soar-20percent-after-beating-earnings-expectations-cites-rising-demand-for-ai-servers.html [3] https://www.marktechpost.com/2024/02/27/researchers-from-the-university-of-pennsylvania-and-vector-institute-introduce-datadreamer-an-open-source-python-library-that-allows-researchers-to-write-simple-code-to-implement-powerful-llm-workflo/ [4] https://finance.yahoo.com/news/apple-microsoft-amazon-alphabet-nvidia-221900525.html [5] https://fortune.com/2024/02/29/vimeo-central-video-ai-tools-training-videos/ submitted by /u/Excellent-Target-847 [link] [comments]
    Natural AI
    All AIs sound terrible when you talk to them. So robotic, so controlled. I used this judgemental AI thing that judges you based on your music taste and it was so fun. Know any at models that are not some random Character.AI models that actually sound normal not super sensitive? Like there are so many limitations to what they can do. I asked Gemini to give me a list of relatively hot celebrities I'm not even sure what I wanted it for and it was like "I can't fulfill your request because everyone is beautiful in their own way" or something like that. submitted by /u/ecoxtrooper [link] [comments]
  • Open

    Emails moved to Substack
    Until recently I used two email services: one to send out daily blog post announcements and another for monthly blog highlights. I’ve combined these into one Substack account for weekly blog highlights. Apparently readers really like this move. Daily and monthly email subscriptions flatlined some time ago, but Substack subscriptions are going up steadily. Substack […] Emails moved to Substack first appeared on John D. Cook.  ( 5 min )
    What’s the Best Code Editor?
    Emacs, vi, TextEdit, nano, Sublime, Notepad, Wordpad, Visual Studio, Eclipse, etc., etc.—everyone’s got a favorite. I used Visual Studio previously and liked the integrated debugger. Recently I started using VS again and found the code editing windows rather cluttered. Thankfully you can tone this down, if you can locate the right options. Eclipse for Java […] What’s the Best Code Editor? first appeared on John D. Cook.  ( 5 min )
  • Open

    Infinite dSprites for Disentangled Continual Learning: Separating Memory Edits from Generalization
    arXiv:2312.16731v2 Announce Type: replace Abstract: The ability of machine learning systems to learn continually is hindered by catastrophic forgetting, the tendency of neural networks to overwrite existing knowledge when learning a new task. Continual learning methods alleviate this problem through regularization, parameter isolation, or rehearsal, but they are typically evaluated on benchmarks comprising only a handful of tasks. In contrast, humans are able to learn continually in dynamic, open-world environments, effortlessly achieving one-shot memorization of unfamiliar objects and reliably recognizing them under various transformations. To make progress towards closing this gap, we introduce Infinite dSprites, a parsimonious tool for creating continual classification and disentanglement benchmarks of arbitrary length and with full control over generative factors. We show that over a sufficiently long time horizon, the performance of all major types of continual learning methods deteriorates on this simple benchmark. Thus, Infinite dSprites highlights an important aspect of continual learning that has not received enough attention so far: given a finite modelling capacity and an arbitrarily long learning horizon, efficient learning requires memorizing class-specific information and accumulating knowledge about general mechanisms. In a simple setting with direct supervision on the generative factors, we show how learning class-agnostic transformations offers a way to circumvent catastrophic forgetting and improve classification accuracy over time. Our approach sets the stage for continual learning over hundreds of tasks with explicit control over memorization and forgetting, emphasizing open-set classification and one-shot generalization.  ( 3 min )
    Score dynamics: scaling molecular dynamics with picosecond timesteps via conditional diffusion model
    arXiv:2310.01678v3 Announce Type: replace-cross Abstract: We propose score dynamics (SD), a general framework for learning accelerated evolution operators with large timesteps from molecular-dynamics simulations. SD is centered around scores, or derivatives of the transition log-probability with respect to the dynamical degrees of freedom. The latter play the same role as force fields in MD but are used in denoising diffusion probability models to generate discrete transitions of the dynamical variables in an SD timestep, which can be orders of magnitude larger than a typical MD timestep. In this work, we construct graph neural network based score dynamics models of realistic molecular systems that are evolved with 10~ps timesteps. We demonstrate the efficacy of score dynamics with case studies of alanine dipeptide and short alkanes in aqueous solution. Both equilibrium predictions derived from the stationary distributions of the conditional probability and kinetic predictions for the transition rates and transition paths are in good agreement with MD. Our current SD implementation is about two orders of magnitude faster than the MD counterpart for the systems studied in this work. Open challenges and possible future remedies to improve score dynamics are also discussed.  ( 2 min )
    Conformalized Semi-supervised Random Forest for Classification and Abnormality Detection
    arXiv:2302.02237v2 Announce Type: replace Abstract: The Random Forests classifier, a widely utilized off-the-shelf classification tool, assumes training and test samples come from the same distribution as other standard classifiers. However, in safety-critical scenarios like medical diagnosis and network attack detection, discrepancies between the training and test sets, including the potential presence of novel outlier samples not appearing during training, can pose significant challenges. To address this problem, we introduce the Conformalized Semi-Supervised Random Forest (CSForest), which couples the conformalization technique Jackknife+aB with semi-supervised tree ensembles to construct a set-valued prediction $C(x)$. Instead of optimizing over the training distribution, CSForest employs unlabeled test samples to enhance accuracy and flag unseen outliers by generating an empty set. Theoretically, we establish CSForest to cover true labels for previously observed inlier classes under arbitrarily label-shift in the test data. We compare CSForest with state-of-the-art methods using synthetic examples and various real-world datasets, under different types of distribution changes in the test domain. Our results highlight CSForest's effective prediction of inliers and its ability to detect outlier samples unique to the test data. In addition, CSForest shows persistently good performance as the sizes of the training and test sets vary. Codes of CSForest are available at https://github.com/yujinhan98/CSForest.  ( 2 min )
    CAREER: A Foundation Model for Labor Sequence Data
    arXiv:2202.08370v4 Announce Type: replace Abstract: Labor economists regularly analyze employment data by fitting predictive models to small, carefully constructed longitudinal survey datasets. Although machine learning methods offer promise for such problems, these survey datasets are too small to take advantage of them. In recent years large datasets of online resumes have also become available, providing data about the career trajectories of millions of individuals. However, standard econometric models cannot take advantage of their scale or incorporate them into the analysis of survey data. To this end we develop CAREER, a foundation model for job sequences. CAREER is first fit to large, passively-collected resume data and then fine-tuned to smaller, better-curated datasets for economic inferences. We fit CAREER to a dataset of 24 million job sequences from resumes, and adjust it on small longitudinal survey datasets. We find that CAREER forms accurate predictions of job sequences, outperforming econometric baselines on three widely-used economics datasets. We further find that CAREER can be used to form good predictions of other downstream variables. For example, incorporating CAREER into a wage model provides better predictions than the econometric models currently in use.  ( 2 min )
    Federated attention contrastive learning models for prostate cancer diagnosis and Gleason grading
    arXiv:2302.06089v4 Announce Type: replace-cross Abstract: Artificial intelligence (AI) shows great promise in revolutionizing medical imaging, improving diagnosis, and refining treatment methods. However, the training of AI models relies on extensive multi-center datasets, presenting a potential challenge due to concerns about data privacy protection. Federated learning offers a solution by enabling a collaborative model across multiple centers without sharing raw data. In this study, we present a Federated Attention Contrastive Learning (FACL) framework designed to address challenges associated with large-scale pathological images and data heterogeneity. FACL improves model generalization by maximizing attention consistency between the local client and the server model. To enhance privacy and validate robustness, we incorporate differential privacy by introducing noise during parameter transfer. We assess the effectiveness of FACL in cancer diagnosis and Gleason grading tasks using 19,461 whole slide images of prostate cancer sourced from multiple centers. In the diagnosis task, FACL achieves an area under the curve (AUC) of 0.9718, outperforming seven centers whose average AUC is 0.9499 when categories are relatively balanced. In the Gleason grading task, FACL attained a Kappa score of 0.8463, surpassing the average Kappa score of 0.7379 from six centers. In conclusion, FACL offers a robust, accurate, and cost-effective AI training model for prostate cancer pathology while maintaining effective data safeguards.  ( 3 min )
    OK-Robot: What Really Matters in Integrating Open-Knowledge Models for Robotics
    arXiv:2401.12202v2 Announce Type: replace-cross Abstract: Remarkable progress has been made in recent years in the fields of vision, language, and robotics. We now have vision models capable of recognizing objects based on language queries, navigation systems that can effectively control mobile systems, and grasping models that can handle a wide range of objects. Despite these advancements, general-purpose applications of robotics still lag behind, even though they rely on these fundamental capabilities of recognition, navigation, and grasping. In this paper, we adopt a systems-first approach to develop a new Open Knowledge-based robotics framework called OK-Robot. By combining Vision-Language Models (VLMs) for object detection, navigation primitives for movement, and grasping primitives for object manipulation, OK-Robot offers a integrated solution for pick-and-drop operations without requiring any training. To evaluate its performance, we run OK-Robot in 10 real-world home environments. The results demonstrate that OK-Robot achieves a 58.5% success rate in open-ended pick-and-drop tasks, representing a new state-of-the-art in Open Vocabulary Mobile Manipulation (OVMM) with nearly 1.8x the performance of prior work. On cleaner, uncluttered environments, OK-Robot's performance increases to 82%. However, the most important insight gained from OK-Robot is the critical role of nuanced details when combining Open Knowledge systems like VLMs with robotic modules. Videos of our experiments and code are available on our website: https://ok-robot.github.io  ( 3 min )
    ProtoP-OD: Explainable Object Detection with Prototypical Parts
    arXiv:2402.19142v1 Announce Type: cross Abstract: Interpretation and visualization of the behavior of detection transformers tends to highlight the locations in the image that the model attends to, but it provides limited insight into the \emph{semantics} that the model is focusing on. This paper introduces an extension to detection transformers that constructs prototypical local features and uses them in object detection. These custom features, which we call prototypical parts, are designed to be mutually exclusive and align with the classifications of the model. The proposed extension consists of a bottleneck module, the prototype neck, that computes a discretized representation of prototype activations and a new loss term that matches prototypes to object classes. This setup leads to interpretable representations in the prototype neck, allowing visual inspection of the image content perceived by the model and a better understanding of the model's reliability. We show experimentally that our method incurs only a limited performance penalty, and we provide examples that demonstrate the quality of the explanations provided by our method, which we argue outweighs the performance penalty.  ( 2 min )
    A Scalable and Transferable Time Series Prediction Framework for Demand Forecasting
    arXiv:2402.19402v1 Announce Type: new Abstract: Time series forecasting is one of the most essential and ubiquitous tasks in many business problems, including demand forecasting and logistics optimization. Traditional time series forecasting methods, however, have resulted in small models with limited expressive power because they have difficulty in scaling their model size up while maintaining high accuracy. In this paper, we propose Forecasting orchestra (Forchestra), a simple but powerful framework capable of accurately predicting future demand for a diverse range of items. We empirically demonstrate that the model size is scalable to up to 0.8 billion parameters. The proposed method not only outperforms existing forecasting models with a significant margin, but it could generalize well to unseen data points when evaluated in a zero-shot fashion on downstream datasets. Last but not least, we present extensive qualitative and quantitative studies to analyze how the proposed model outperforms baseline models and differs from conventional approaches. The original paper was presented as a full paper at ICDM 2022 and is available at: https://ieeexplore.ieee.org/document/10027662.  ( 2 min )
    Imputation of missing values in multi-view data
    arXiv:2210.14484v3 Announce Type: replace-cross Abstract: Data for which a set of objects is described by multiple distinct feature sets (called views) is known as multi-view data. When missing values occur in multi-view data, all features in a view are likely to be missing simultaneously. This leads to very large quantities of missing data which, especially when combined with high-dimensionality, makes the application of conditional imputation methods computationally infeasible. We introduce a new imputation method based on the existing stacked penalized logistic regression (StaPLR) algorithm for multi-view learning. It performs imputation in a dimension-reduced space to address computational challenges inherent to the multi-view context. We compare the performance of the new imputation method with several existing imputation algorithms in simulated data sets. The results show that the new imputation method leads to competitive results at a much lower computational cost, and makes the use of advanced imputation algorithms such as missForest and predictive mean matching possible in settings where they would otherwise be computationally infeasible.  ( 2 min )
    A Protein Structure Prediction Approach Leveraging Transformer and CNN Integration
    arXiv:2402.19095v1 Announce Type: cross Abstract: Proteins are essential for life, and their structure determines their function. The protein secondary structure is formed by the folding of the protein primary structure, and the protein tertiary structure is formed by the bending and folding of the secondary structure. Therefore, the study of protein secondary structure is very helpful to the overall understanding of protein structure. Although the accuracy of protein secondary structure prediction has continuously improved with the development of machine learning and deep learning, progress in the field of protein structure prediction, unfortunately, remains insufficient to meet the large demand for protein information. Therefore, based on the advantages of deep learning-based methods in feature extraction and learning ability, this paper adopts a two-dimensional fusion deep neural network model, DstruCCN, which uses Convolutional Neural Networks (CCN) and a supervised Transformer protein language model for single-sequence protein structure prediction. The training features of the two are combined to predict the protein Transformer binding site matrix, and then the three-dimensional structure is reconstructed using energy minimization.  ( 2 min )
    Watermark Stealing in Large Language Models
    arXiv:2402.19361v1 Announce Type: cross Abstract: LLM watermarking has attracted attention as a promising way to detect AI-generated content, with some works suggesting that current schemes may already be fit for deployment. In this work we dispute this claim, identifying watermark stealing (WS) as a fundamental vulnerability of these schemes. We show that querying the API of the watermarked LLM to approximately reverse-engineer a watermark enables practical spoofing attacks, as suggested in prior work, but also greatly boosts scrubbing attacks, which was previously unnoticed. We are the first to propose an automated WS algorithm and use it in the first comprehensive study of spoofing and scrubbing in realistic settings. We show that for under $50 an attacker can both spoof and scrub state-of-the-art schemes previously considered safe, with average success rate of over 80%. Our findings challenge common beliefs about LLM watermarking, stressing the need for more robust schemes. We make all our code and additional examples available at https://watermark-stealing.org.  ( 2 min )
    Application of Zone Method based Physics-Informed Neural Networks in Reheating Furnaces
    arXiv:2308.16089v2 Announce Type: replace Abstract: Foundation Industries (FIs) constitute glass, metals, cement, ceramics, bulk chemicals, paper, steel, etc. and provide crucial, foundational materials for a diverse set of economically relevant industries: automobiles, machinery, construction, household appliances, chemicals, etc. Reheating furnaces within the manufacturing chain of FIs are energy-intensive. Accurate and real-time prediction of underlying temperatures in reheating furnaces has the potential to reduce the overall heating time, thereby controlling the energy consumption for achieving the Net-Zero goals in FIs. In this paper, we cast this prediction as a regression task and explore neural networks due to their inherent capability of being effective and efficient, given adequate data. However, due to the infeasibility of achieving good-quality real data in scenarios like reheating furnaces, classical Hottel's zone method based computational model has been used to generate data for model training. To further enhance the Out-Of-Distribution generalization capability of the trained model, we propose a Physics-Informed Neural Network (PINN) by incorporating prior physical knowledge using a set of novel Energy-Balance regularizers.  ( 2 min )
    Temporal-Aware Deep Reinforcement Learning for Energy Storage Bidding in Energy and Contingency Reserve Markets
    arXiv:2402.19110v1 Announce Type: cross Abstract: The battery energy storage system (BESS) has immense potential for enhancing grid reliability and security through its participation in the electricity market. BESS often seeks various revenue streams by taking part in multiple markets to unlock its full potential, but effective algorithms for joint-market participation under price uncertainties are insufficiently explored in the existing research. To bridge this gap, we develop a novel BESS joint bidding strategy that utilizes deep reinforcement learning (DRL) to bid in the spot and contingency frequency control ancillary services (FCAS) markets. Our approach leverages a transformer-based temporal feature extractor to effectively respond to price fluctuations in seven markets simultaneously and helps DRL learn the best BESS bidding strategy in joint-market participation. Additionally, unlike conventional "black-box" DRL model, our approach is more interpretable and provides valuable insights into the temporal bidding behavior of BESS in the dynamic electricity market. We validate our method using realistic market prices from the Australian National Electricity Market. The results show that our strategy outperforms benchmarks, including both optimization-based and other DRL-based strategies, by substantial margins. Our findings further suggest that effective temporal-aware bidding can significantly increase profits in the spot and contingency FCAS markets compared to individual market participation.  ( 3 min )
    GAIA: Categorical Foundations of Generative AI
    arXiv:2402.18732v1 Announce Type: cross Abstract: In this paper, we propose GAIA, a generative AI architecture based on category theory. GAIA is based on a hierarchical model where modules are organized as a simplicial complex. Each simplicial complex updates its internal parameters biased on information it receives from its superior simplices and in turn relays updates to its subordinate sub-simplices. Parameter updates are formulated in terms of lifting diagrams over simplicial sets, where inner and outer horn extensions correspond to different types of learning problems. Backpropagation is modeled as an endofunctor over the category of parameters, leading to a coalgebraic formulation of deep learning.  ( 2 min )
    Smooth Tchebycheff Scalarization for Multi-Objective Optimization
    arXiv:2402.19078v1 Announce Type: new Abstract: Multi-objective optimization problems can be found in many real-world applications, where the objectives often conflict each other and cannot be optimized by a single solution. In the past few decades, numerous methods have been proposed to find Pareto solutions that represent different optimal trade-offs among the objectives for a given problem. However, these existing methods could have high computational complexity or may not have good theoretical properties for solving a general differentiable multi-objective optimization problem. In this work, by leveraging the smooth optimization technique, we propose a novel and lightweight smooth Tchebycheff scalarization approach for gradient-based multi-objective optimization. It has good theoretical properties for finding all Pareto solutions with valid trade-off preferences, while enjoying significantly lower computational complexity compared to other methods. Experimental results on various real-world application problems fully demonstrate the effectiveness of our proposed method.  ( 2 min )
    Improved Forecasting Using a PSO-RDV Framework to Enhance Artificial Neural Network
    arXiv:2402.18576v1 Announce Type: cross Abstract: Decision making and planning have long relied heavily on AI-driven forecasts. The government and the general public are working to minimize the risks while maximizing benefits in the face of potential future public health uncertainties. This study used an improved method of forecasting utilizing the Random Descending Velocity Inertia Weight (RDV IW) technique to improve the convergence of Particle Swarm Optimization (PSO) and the accuracy of Artificial Neural Network (ANN). The IW technique, inspired by the motions of a golf ball, modified the particles' velocities as they approached the solution point to a parabolically descending structure. Simulation results revealed that the proposed forecasting model with [0.4, 0.9] combination of alpha and alpha_dump exhibits a 6.36% improvement in position error and 11.75% improvement in computational time compared to the old model, thus, improving its convergence. It reached the optimum level at minimal steps with 12.50% improvement as against the old model since it provides better velocity averages when speed stabilization occurs at the 24th iteration. Meanwhile, the computed p-values for NRMSE (0.04889174), MAE (0.02829063), MAPE (0.02226053), WAPE (0.01701545), and R2 (0.00000021) of the proposed algorithm are less than the set 0.05 level of significance, thus the values indicated a significant result in terms of accuracy performance. Applying the modified ANN-PSO using RDV IW technique greatly improved the new HIV/AIDS forecasting model compared with the two models.  ( 3 min )
    Negative-Binomial Randomized Gamma Markov Processes for Heterogeneous Overdispersed Count Time Series
    arXiv:2402.18995v1 Announce Type: new Abstract: Modeling count-valued time series has been receiving increasing attention since count time series naturally arise in physical and social domains. Poisson gamma dynamical systems (PGDSs) are newly-developed methods, which can well capture the expressive latent transition structure and bursty dynamics behind count sequences. In particular, PGDSs demonstrate superior performance in terms of data imputation and prediction, compared with canonical linear dynamical system (LDS) based methods. Despite these advantages, PGDS cannot capture the heterogeneous overdispersed behaviours of the underlying dynamic processes. To mitigate this defect, we propose a negative-binomial-randomized gamma Markov process, which not only significantly improves the predictive performance of the proposed dynamical system, but also facilitates the fast convergence of the inference algorithm. Moreover, we develop methods to estimate both factor-structured and graph-structured transition dynamics, which enable us to infer more explainable latent structure, compared with PGDSs. Finally, we demonstrate the explainable latent structure learned by the proposed method, and show its superior performance in imputing missing data and forecasting future observations, compared with the related models.  ( 2 min )
    Binding-Adaptive Diffusion Models for Structure-Based Drug Design
    arXiv:2402.18583v1 Announce Type: cross Abstract: Structure-based drug design (SBDD) aims to generate 3D ligand molecules that bind to specific protein targets. Existing 3D deep generative models including diffusion models have shown great promise for SBDD. However, it is complex to capture the essential protein-ligand interactions exactly in 3D space for molecular generation. To address this problem, we propose a novel framework, namely Binding-Adaptive Diffusion Models (BindDM). In BindDM, we adaptively extract subcomplex, the essential part of binding sites responsible for protein-ligand interactions. Then the selected protein-ligand subcomplex is processed with SE(3)-equivariant neural networks, and transmitted back to each atom of the complex for augmenting the target-aware 3D molecule diffusion generation with binding interaction information. We iterate this hierarchical complex-subcomplex process with cross-hierarchy interaction node for adequately fusing global binding context between the complex and its corresponding subcomplex. Empirical studies on the CrossDocked2020 dataset show BindDM can generate molecules with more realistic 3D structures and higher binding affinities towards the protein targets, with up to -5.92 Avg. Vina Score, while maintaining proper molecular properties. Our code is available at https://github.com/YangLing0818/BindDM  ( 2 min )
    Disentangling representations of retinal images with generative models
    arXiv:2402.19186v1 Announce Type: cross Abstract: Retinal fundus images play a crucial role in the early detection of eye diseases and, using deep learning approaches, recent studies have even demonstrated their potential for detecting cardiovascular risk factors and neurological disorders. However, the impact of technical factors on these images can pose challenges for reliable AI applications in ophthalmology. For example, large fundus cohorts are often confounded by factors like camera type, image quality or illumination level, bearing the risk of learning shortcuts rather than the causal relationships behind the image generation process. Here, we introduce a novel population model for retinal fundus images that effectively disentangles patient attributes from camera effects, thus enabling controllable and highly realistic image generation. To achieve this, we propose a novel disentanglement loss based on distance correlation. Through qualitative and quantitative analyses, we demonstrate the effectiveness of this novel loss function in disentangling the learned subspaces. Our results show that our model provides a new perspective on the complex relationship between patient attributes and technical confounders in retinal fundus image generation.  ( 2 min )
    Verif.ai: Towards an Open-Source Scientific Generative Question-Answering System with Referenced and Verifiable Answers
    arXiv:2402.18589v1 Announce Type: cross Abstract: In this paper, we present the current progress of the project Verif.ai, an open-source scientific generative question-answering system with referenced and verified answers. The components of the system are (1) an information retrieval system combining semantic and lexical search techniques over scientific papers (PubMed), (2) a fine-tuned generative model (Mistral 7B) taking top answers and generating answers with references to the papers from which the claim was derived, and (3) a verification engine that cross-checks the generated claim and the abstract or paper from which the claim was derived, verifying whether there may have been any hallucinations in generating the claim. We are reinforcing the generative model by providing the abstract in context, but in addition, an independent set of methods and models are verifying the answer and checking for hallucinations. Therefore, we believe that by using our method, we can make scientists more productive, while building trust in the use of generative language models in scientific environments, where hallucinations and misinformation cannot be tolerated.  ( 2 min )
    Rethinking Multi-domain Generalization with A General Learning Objective
    arXiv:2402.18853v1 Announce Type: new Abstract: Multi-domain generalization (mDG) is universally aimed to minimize the discrepancy between training and testing distributions to enhance marginal-to-label distribution mapping. However, existing mDG literature lacks a general learning objective paradigm and often imposes constraints on static target marginal distributions. In this paper, we propose to leverage a $Y$-mapping to relax the constraint. We rethink the learning objective for mDG and design a new \textbf{general learning objective} to interpret and analyze most existing mDG wisdom. This general objective is bifurcated into two synergistic amis: learning domain-independent conditional features and maximizing a posterior. Explorations also extend to two effective regularization terms that incorporate prior information and suppress invalid causality, alleviating the issues that come with relaxed constraints. We theoretically contribute an upper bound for the domain alignment of domain-independent conditional features, disclosing that many previous mDG endeavors actually \textbf{optimize partially the objective} and thus lead to limited performance. As such, our study distills a general learning objective into four practical components, providing a general, robust, and flexible mechanism to handle complex domain shifts. Extensive empirical results indicate that the proposed objective with $Y$-mapping leads to substantially better mDG performance in various downstream tasks, including regression, segmentation, and classification.  ( 2 min )
    Learning to Compress Prompt in Natural Language Formats
    arXiv:2402.18700v1 Announce Type: cross Abstract: Large language models (LLMs) are great at processing multiple natural language processing tasks, but their abilities are constrained by inferior performance with long context, slow inference speed, and the high cost of computing the results. Deploying LLMs with precise and informative context helps users process large-scale datasets more effectively and cost-efficiently. Existing works rely on compressing long prompt contexts into soft prompts. However, soft prompt compression encounters limitations in transferability across different LLMs, especially API-based LLMs. To this end, this work aims to compress lengthy prompts in the form of natural language with LLM transferability. This poses two challenges: (i) Natural Language (NL) prompts are incompatible with back-propagation, and (ii) NL prompts lack flexibility in imposing length constraints. In this work, we propose a Natural Language Prompt Encapsulation (Nano-Capsulator) framework compressing original prompts into NL formatted Capsule Prompt while maintaining the prompt utility and transferability. Specifically, to tackle the first challenge, the Nano-Capsulator is optimized by a reward function that interacts with the proposed semantics preserving loss. To address the second question, the Nano-Capsulator is optimized by a reward function featuring length constraints. Experimental results demonstrate that the Capsule Prompt can reduce 81.4% of the original length, decrease inference latency up to 4.5x, and save 80.1% of budget overheads while providing transferability across diverse LLMs and different datasets.  ( 2 min )
    Degradation Modeling and Prognostic Analysis Under Unknown Failure Modes
    arXiv:2402.19294v1 Announce Type: new Abstract: Operating units often experience various failure modes in complex systems, leading to distinct degradation paths. Relying on a prognostic model trained on a single failure mode may lead to poor generalization performance across multiple failure modes. Therefore, accurately identifying the failure mode is of critical importance. Current prognostic approaches either ignore failure modes during degradation or assume known failure mode labels, which can be challenging to acquire in practice. Moreover, the high dimensionality and complex relations of sensor signals make it challenging to identify the failure modes accurately. To address these issues, we propose a novel failure mode diagnosis method that leverages a dimension reduction technique called UMAP (Uniform Manifold Approximation and Projection) to project and visualize each unit's degradation trajectory into a lower dimension. Then, using these degradation trajectories, we develop a time series-based clustering method to identify the training units' failure modes. Finally, we introduce a monotonically constrained prognostic model to predict the failure mode labels and RUL of the test units simultaneously using the obtained failure modes of the training units. The proposed prognostic model provides failure mode-specific RUL predictions while preserving the monotonic property of the RUL predictions across consecutive time steps. We evaluate the proposed model using a case study with the aircraft gas turbine engine dataset.  ( 2 min )
    Trained Random Forests Completely Reveal your Dataset
    arXiv:2402.19232v1 Announce Type: new Abstract: We introduce an optimization-based reconstruction attack capable of completely or near-completely reconstructing a dataset utilized for training a random forest. Notably, our approach relies solely on information readily available in commonly used libraries such as scikit-learn. To achieve this, we formulate the reconstruction problem as a combinatorial problem under a maximum likelihood objective. We demonstrate that this problem is NP-hard, though solvable at scale using constraint programming -- an approach rooted in constraint propagation and solution-domain reduction. Through an extensive computational investigation, we demonstrate that random forests trained without bootstrap aggregation but with feature randomization are susceptible to a complete reconstruction. This holds true even with a small number of trees. Even with bootstrap aggregation, the majority of the data can also be reconstructed. These findings underscore a critical vulnerability inherent in widely adopted ensemble methods, warranting attention and mitigation. Although the potential for such reconstruction attacks has been discussed in privacy research, our study provides clear empirical evidence of their practicability.  ( 2 min )
    Multi-Sensor and Multi-temporal High-Throughput Phenotyping for Monitoring and Early Detection of Water-Limiting Stress in Soybean
    arXiv:2402.18751v1 Announce Type: new Abstract: Soybean production is susceptible to biotic and abiotic stresses, exacerbated by extreme weather events. Water limiting stress, i.e. drought, emerges as a significant risk for soybean production, underscoring the need for advancements in stress monitoring for crop breeding and production. This project combines multi-modal information to identify the most effective and efficient automated methods to investigate drought response. We investigated a set of diverse soybean accessions using multiple sensors in a time series high-throughput phenotyping manner to: (1) develop a pipeline for rapid classification of soybean drought stress symptoms, and (2) investigate methods for early detection of drought stress. We utilized high-throughput time-series phenotyping using UAVs and sensors in conjunction with machine learning (ML) analytics, which offered a swift and efficient means of phenotyping. The red-edge and green bands were most effective to classify canopy wilting stress. The Red-Edge Chlorophyll Vegetation Index (RECI) successfully differentiated susceptible and tolerant soybean accessions prior to visual symptom development. We report pre-visual detection of soybean wilting using a combination of different vegetation indices. These results can contribute to early stress detection methodologies and rapid classification of drought responses in screening nurseries for breeding and production applications.  ( 3 min )
    PaECTER: Patent-level Representation Learning using Citation-informed Transformers
    arXiv:2402.19411v1 Announce Type: cross Abstract: PaECTER is a publicly available, open-source document-level encoder specific for patents. We fine-tune BERT for Patents with examiner-added citation information to generate numerical representations for patent documents. PaECTER performs better in similarity tasks than current state-of-the-art models used in the patent domain. More specifically, our model outperforms the next-best patent specific pre-trained language model (BERT for Patents) on our patent citation prediction test dataset on two different rank evaluation metrics. PaECTER predicts at least one most similar patent at a rank of 1.32 on average when compared against 25 irrelevant patents. Numerical representations generated by PaECTER from patent text can be used for downstream tasks such as classification, tracing knowledge flows, or semantic similarity search. Semantic similarity search is especially relevant in the context of prior art search for both inventors and patent examiners. PaECTER is available on Hugging Face.  ( 2 min )
    Knowledge Accumulation in Continually Learned Representations and the Issue of Feature Forgetting
    arXiv:2304.00933v3 Announce Type: replace Abstract: Continual learning research has shown that neural networks suffer from catastrophic forgetting "at the output level", but it is debated whether this is also the case at the level of learned representations. Multiple recent studies ascribe representations a certain level of innate robustness against forgetting - that they only forget minimally and no critical information. We revisit and expand upon the experiments that revealed this difference in forgetting and illustrate the coexistence of two phenomena that affect the quality of continually learned representations: knowledge accumulation and feature forgetting. Carefully taking both aspects into account, we show that, even though it is true that feature forgetting can be small in absolute terms, newly learned information tends to be forgotten just as catastrophically at the level of the representation as it is at the output level. Next we show that this feature forgetting is problematic as it substantially slows down knowledge accumulation. Finally, we study how feature forgetting and knowledge accumulation are affected by different types of continual learning methods.  ( 2 min )
    Real-Time Adaptive Safety-Critical Control with Gaussian Processes in High-Order Uncertain Models
    arXiv:2402.18946v1 Announce Type: new Abstract: This paper presents an adaptive online learning framework for systems with uncertain parameters to ensure safety-critical control in non-stationary environments. Our approach consists of two phases. The initial phase is centered on a novel sparse Gaussian process (GP) framework. We first integrate a forgetting factor to refine a variational sparse GP algorithm, thus enhancing its adaptability. Subsequently, the hyperparameters of the Gaussian model are trained with a specially compound kernel, and the Gaussian model's online inferential capability and computational efficiency are strengthened by updating a solitary inducing point derived from new samples, in conjunction with the learned hyperparameters. In the second phase, we propose a safety filter based on high-order control barrier functions (HOCBFs), synergized with the previously trained learning model. By leveraging the compound kernel from the first phase, we effectively address the inherent limitations of GPs in handling high-dimensional problems for real-time applications. The derived controller ensures a rigorous lower bound on the probability of satisfying the safety specification. Finally, the efficacy of our proposed algorithm is demonstrated through real-time obstacle avoidance experiments executed using both a simulation platform and a real-world 7-DOF robot.  ( 2 min )
    Improving Group Connectivity for Generalization of Federated Deep Learning
    arXiv:2402.18949v1 Announce Type: new Abstract: Federated learning (FL) involves multiple heterogeneous clients collaboratively training a global model via iterative local updates and model fusion. The generalization of FL's global model has a large gap compared with centralized training, which is its bottleneck for broader applications. In this paper, we study and improve FL's generalization through a fundamental ``connectivity'' perspective, which means how the local models are connected in the parameter region and fused into a generalized global model. The term ``connectivity'' is derived from linear mode connectivity (LMC), studying the interpolated loss landscape of two different solutions (e.g., modes) of neural networks. Bridging the gap between LMC and FL, in this paper, we leverage fixed anchor models to empirically and theoretically study the transitivity property of connectivity from two models (LMC) to a group of models (model fusion in FL). Based on the findings, we propose FedGuCci and FedGuCci+, improving group connectivity for better generalization. It is shown that our methods can boost the generalization of FL under client heterogeneity across various tasks (4 CV datasets and 6 NLP datasets), models (both convolutional and transformer-based), and training paradigms (both from-scratch and pretrain-finetune).  ( 2 min )
    IMG2IMU: Translating Knowledge from Large-Scale Images to IMU Sensing Applications
    arXiv:2209.00945v2 Announce Type: replace Abstract: Pre-training representations acquired via self-supervised learning could achieve high accuracy on even tasks with small training data. Unlike in vision and natural language processing domains, pre-training for IMU-based applications is challenging, as there are few public datasets with sufficient size and diversity to learn generalizable representations. To overcome this problem, we propose IMG2IMU that adapts pre-trained representation from large-scale images to diverse IMU sensing tasks. We convert the sensor data into visually interpretable spectrograms for the model to utilize the knowledge gained from vision. We further present a sensor-aware pre-training method for images that enables models to acquire particularly impactful knowledge for IMU sensing applications. This involves using contrastive learning on our augmentation set customized for the properties of sensor data. Our evaluation with four different IMU sensing tasks shows that IMG2IMU outperforms the baselines pre-trained on sensor data by an average of 9.6%p F1-score, illustrating that vision knowledge can be usefully incorporated into IMU sensing applications where only limited training data is available.  ( 2 min )
    DIGIC: Domain Generalizable Imitation Learning by Causal Discovery
    arXiv:2402.18910v1 Announce Type: new Abstract: Causality has been combined with machine learning to produce robust representations for domain generalization. Most existing methods of this type require massive data from multiple domains to identify causal features by cross-domain variations, which can be expensive or even infeasible and may lead to misidentification in some cases. In this work, we make a different attempt by leveraging the demonstration data distribution to discover the causal features for a domain generalizable policy. We design a novel framework, called DIGIC, to identify the causal features by finding the direct cause of the expert action from the demonstration data distribution via causal discovery. Our framework can achieve domain generalizable imitation learning with only single-domain data and serve as a complement for cross-domain variation-based methods under non-structural assumptions on the underlying causal models. Our empirical study in various control tasks shows that the proposed framework evidently improves the domain generalization performance and has comparable performance to the expert in the original domain simultaneously.  ( 2 min )
    Multi-label Node Classification On Graph-Structured Data
    arXiv:2304.10398v4 Announce Type: replace Abstract: Graph Neural Networks (GNNs) have shown state-of-the-art improvements in node classification tasks on graphs. While these improvements have been largely demonstrated in a multi-class classification scenario, a more general and realistic scenario in which each node could have multiple labels has so far received little attention. The first challenge in conducting focused studies on multi-label node classification is the limited number of publicly available multi-label graph datasets. Therefore, as our first contribution, we collect and release three real-world biological datasets and develop a multi-label graph generator to generate datasets with tunable properties. While high label similarity (high homophily) is usually attributed to the success of GNNs, we argue that a multi-label scenario does not follow the usual semantics of homophily and heterophily so far defined for a multi-class scenario. As our second contribution, we define homophily and Cross-Class Neighborhood Similarity for the multi-label scenario and provide a thorough analyses of the collected $9$ multi-label datasets. Finally, we perform a large-scale comparative study with $8$ methods and $9$ datasets and analyse the performances of the methods to assess the progress made by current state of the art in the multi-label node classification scenario. We release our benchmark at https://github.com/Tianqi-py/MLGNC.  ( 3 min )
    Communication-Efficient Federated Learning via Regularized Sparse Random Networks
    arXiv:2309.10834v2 Announce Type: replace Abstract: This work presents a new method for enhancing communication efficiency in stochastic Federated Learning that trains over-parameterized random networks. In this setting, a binary mask is optimized instead of the model weights, which are kept fixed. The mask characterizes a sparse sub-network that is able to generalize as good as a smaller target network. Importantly, sparse binary masks are exchanged rather than the floating point weights in traditional federated learning, reducing communication cost to at most 1 bit per parameter (Bpp). We show that previous state of the art stochastic methods fail to find sparse networks that can reduce the communication and storage overhead using consistent loss objectives. To address this, we propose adding a regularization term to local objectives that acts as a proxy of the transmitted masks entropy, therefore encouraging sparser solutions by eliminating redundant features across sub-networks. Extensive empirical experiments demonstrate significant improvements in communication and memory efficiency of up to five magnitudes compared to the literature, with minimal performance degradation in validation accuracy in some instances  ( 2 min )
    Efficient Model-Free Exploration in Low-Rank MDPs
    arXiv:2307.03997v2 Announce Type: replace Abstract: A major challenge in reinforcement learning is to develop practical, sample-efficient algorithms for exploration in high-dimensional domains where generalization and function approximation is required. Low-Rank Markov Decision Processes -- where transition probabilities admit a low-rank factorization based on an unknown feature embedding -- offer a simple, yet expressive framework for RL with function approximation, but existing algorithms are either (1) computationally intractable, or (2) reliant upon restrictive statistical assumptions such as latent variable structure, access to model-based function approximation, or reachability. In this work, we propose the first provably sample-efficient algorithm for exploration in Low-Rank MDPs that is both computationally efficient and model-free, allowing for general function approximation and requiring no additional structural assumptions. Our algorithm, VoX, uses the notion of a barycentric spanner for the feature embedding as an efficiently computable basis for exploration, performing efficient barycentric spanner computation by interleaving representation learning and policy optimization. Our analysis -- which is appealingly simple and modular -- carefully combines several techniques, including a new approach to error-tolerant barycentric spanner computation and an improved analysis of a certain minimax representation learning objective found in prior work.  ( 2 min )
    Ask more, know better: Reinforce-Learned Prompt Questions for Decision Making with Large Language Models
    arXiv:2310.18127v2 Announce Type: replace Abstract: Large language models (LLMs) demonstrate their promise in tackling complicated practical challenges by combining action-based policies with chain of thought (CoT) reasoning. Having high-quality prompts on hand, however, is vital to the framework's effectiveness. Currently, these prompts are handcrafted utilising extensive human labor, resulting in CoT policies that frequently fail to generalise. Human intervention is also required to develop grounding functions that ensure low-level controllers appropriately process CoT reasoning. In this paper, we propose a comprehensive training framework for complex task-solving, incorporating human prior knowledge into the learning of action policies. To that purpose, we offer a new leader-follower bilevel framework that is capable of learning to ask relevant questions (prompts) and subsequently undertaking reasoning to guide the learning of actions. The prompt policy is employed to make introspective revisions based on historical findings, leading the CoT process to consider the anticipated goals and generate outputs that lead to decisive, high-performing actions. The action policy subsequently learns to comprehend and integrate the CoT outputs to take actions. Our empirical data reveal that our framework outperforms leading methods in $5$ decision-making tasks such as Overcooked and FourRoom.  ( 3 min )
    FedHCA$^2$: Towards Hetero-Client Federated Multi-Task Learning
    arXiv:2311.13250v2 Announce Type: replace-cross Abstract: Federated Learning (FL) enables joint training across distributed clients using their local data privately. Federated Multi-Task Learning (FMTL) builds on FL to handle multiple tasks, assuming model congruity that identical model architecture is deployed in each client. To relax this assumption and thus extend real-world applicability, we introduce a novel problem setting, Hetero-Client Federated Multi-Task Learning (HC-FMTL), to accommodate diverse task setups. The main challenge of HC-FMTL is the model incongruity issue that invalidates conventional aggregation methods. It also escalates the difficulties in accurate model aggregation to deal with data and task heterogeneity inherent in FMTL. To address these challenges, we propose the FedHCA$^2$ framework, which allows for federated training of personalized models by modeling relationships among heterogeneous clients. Drawing on our theoretical insights into the difference between multi-task and federated optimization, we propose the Hyper Conflict-Averse Aggregation scheme to mitigate conflicts during encoder updates. Additionally, inspired by task interaction in MTL, the Hyper Cross Attention Aggregation scheme uses layer-wise cross attention to enhance decoder interactions while alleviating model incongruity. Moreover, we employ learnable Hyper Aggregation Weights for each client to customize personalized parameter updates. Extensive experiments demonstrate the superior performance of FedHCA$^2$ in various HC-FMTL scenarios compared to representative methods. Our code will be made publicly available.  ( 2 min )
    Learning Logic Specifications for Policy Guidance in POMDPs: an Inductive Logic Programming Approach
    arXiv:2402.19265v1 Announce Type: cross Abstract: Partially Observable Markov Decision Processes (POMDPs) are a powerful framework for planning under uncertainty. They allow to model state uncertainty as a belief probability distribution. Approximate solvers based on Monte Carlo sampling show great success to relax the computational demand and perform online planning. However, scaling to complex realistic domains with many actions and long planning horizons is still a major challenge, and a key point to achieve good performance is guiding the action-selection process with domain-dependent policy heuristics which are tailored for the specific application domain. We propose to learn high-quality heuristics from POMDP traces of executions generated by any solver. We convert the belief-action pairs to a logical semantics, and exploit data- and time-efficient Inductive Logic Programming (ILP) to generate interpretable belief-based policy specifications, which are then used as online heuristics. We evaluate thoroughly our methodology on two notoriously challenging POMDP problems, involving large action spaces and long planning horizons, namely, rocksample and pocman. Considering different state-of-the-art online POMDP solvers, including POMCP, DESPOT and AdaOPS, we show that learned heuristics expressed in Answer Set Programming (ASP) yield performance superior to neural networks and similar to optimal handcrafted task-specific heuristics within lower computational time. Moreover, they well generalize to more challenging scenarios not experienced in the training phase (e.g., increasing rocks and grid size in rocksample, incrementing the size of the map and the aggressivity of ghosts in pocman).  ( 3 min )
    Multi-Fidelity Residual Neural Processes for Scalable Surrogate Modeling
    arXiv:2402.18846v1 Announce Type: new Abstract: Multi-fidelity surrogate modeling aims to learn an accurate surrogate at the highest fidelity level by combining data from multiple sources. Traditional methods relying on Gaussian processes can hardly scale to high-dimensional data. Deep learning approaches utilize neural network based encoders and decoders to improve scalability. These approaches share encoded representations across fidelities without including corresponding decoder parameters. At the highest fidelity, the representations are decoded with different parameters, making the shared information inherently inaccurate. This hinders inference performance, especially in out-of-distribution scenarios when the highest fidelity data has limited domain coverage. To address these limitations, we propose Multi-fidelity Residual Neural Processes (MFRNP), a novel multi-fidelity surrogate modeling framework. MFRNP optimizes lower fidelity decoders for accurate information sharing by aggregating lower fidelity surrogate outputs and models residual between the aggregation and ground truth on the highest fidelity. We show that MFRNP significantly outperforms current state-of-the-art in learning partial differential equations and a real-world climate modeling task.  ( 2 min )
    Towards Green AI in Fine-tuning Large Language Models via Adaptive Backpropagation
    arXiv:2309.13192v2 Announce Type: replace Abstract: Fine-tuning is the most effective way of adapting pre-trained large language models (LLMs) to downstream applications. With the fast growth of LLM-enabled AI applications and democratization of open-souced LLMs, fine-tuning has become possible for non-expert individuals, but intensively performed LLM fine-tuning worldwide could result in significantly high energy consumption and carbon footprint, which may bring large environmental impact. Mitigating such environmental impact towards Green AI directly correlates to reducing the FLOPs of fine-tuning, but existing techniques on efficient LLM fine-tuning can only achieve limited reduction of such FLOPs, due to their ignorance of the backpropagation cost in fine-tuning. To address this limitation, in this paper we present GreenTrainer, a new LLM fine-tuning technique that adaptively evaluates different tensors' backpropagation costs and contributions to the fine-tuned model accuracy, to minimize the fine-tuning cost by selecting the most appropriate set of tensors in training. Such selection in GreenTrainer is made based on a given objective of FLOPs reduction, which can flexibly adapt to the carbon footprint in energy supply and the need in Green AI. Experiment results over multiple open-sourced LLM models and abstractive summarization datasets show that, compared to fine-tuning the whole LLM model, GreenTrainer can save up to 64% FLOPs in fine-tuning without any noticeable model accuracy loss. Compared to the existing fine-tuning techniques such as LoRa, GreenTrainer can achieve up to 4% improvement on model accuracy with on-par FLOPs reduction.  ( 3 min )
    Mirror Diffusion Models for Constrained and Watermarked Generation
    arXiv:2310.01236v2 Announce Type: replace-cross Abstract: Modern successes of diffusion models in learning complex, high-dimensional data distributions are attributed, in part, to their capability to construct diffusion processes with analytic transition kernels and score functions. The tractability results in a simulation-free framework with stable regression losses, from which reversed, generative processes can be learned at scale. However, when data is confined to a constrained set as opposed to a standard Euclidean space, these desirable characteristics appear to be lost based on prior attempts. In this work, we propose Mirror Diffusion Models (MDM), a new class of diffusion models that generate data on convex constrained sets without losing any tractability. This is achieved by learning diffusion processes in a dual space constructed from a mirror map, which, crucially, is a standard Euclidean space. We derive efficient computation of mirror maps for popular constrained sets, such as simplices and $\ell_2$-balls, showing significantly improved performance of MDM over existing methods. For safety and privacy purposes, we also explore constrained sets as a new mechanism to embed invisible but quantitative information (i.e., watermarks) in generated data, for which MDM serves as a compelling approach. Our work brings new algorithmic opportunities for learning tractable diffusion on complex domains. Our code is available at https://github.com/ghliu/mdm  ( 3 min )
    When Demonstrations Meet Generative World Models: A Maximum Likelihood Framework for Offline Inverse Reinforcement Learning
    arXiv:2302.07457v3 Announce Type: replace Abstract: Offline inverse reinforcement learning (Offline IRL) aims to recover the structure of rewards and environment dynamics that underlie observed actions in a fixed, finite set of demonstrations from an expert agent. Accurate models of expertise in executing a task has applications in safety-sensitive applications such as clinical decision making and autonomous driving. However, the structure of an expert's preferences implicit in observed actions is closely linked to the expert's model of the environment dynamics (i.e. the ``world'' model). Thus, inaccurate models of the world obtained from finite data with limited coverage could compound inaccuracy in estimated rewards. To address this issue, we propose a bi-level optimization formulation of the estimation task wherein the upper level is likelihood maximization based upon a conservative model of the expert's policy (lower level). The policy model is conservative in that it maximizes reward subject to a penalty that is increasing in the uncertainty of the estimated model of the world. We propose a new algorithmic framework to solve the bi-level optimization problem formulation and provide statistical and computational guarantees of performance for the associated optimal reward estimator. Finally, we demonstrate that the proposed algorithm outperforms the state-of-the-art offline IRL and imitation learning benchmarks by a large margin, over the continuous control tasks in MuJoCo and different datasets in the D4RL benchmark.  ( 3 min )
    Theoretical Foundations of Deep Selective State-Space Models
    arXiv:2402.19047v1 Announce Type: new Abstract: Structured state-space models (SSMs) such as S4, stemming from the seminal work of Gu et al., are gaining popularity as effective approaches for modeling sequential data. Deep SSMs demonstrate outstanding performance across a diverse set of domains, at a reduced training and inference cost compared to attention-based transformers. Recent developments show that if the linear recurrence powering SSMs allows for multiplicative interactions between inputs and hidden states (e.g. GateLoop, Mamba, GLA), then the resulting architecture can surpass in both in accuracy and efficiency attention-powered foundation models trained on text, at scales of billion parameters. In this paper, we give theoretical grounding to this recent finding using tools from Rough Path Theory: we show that when random linear recurrences are equipped with simple input-controlled transitions (selectivity mechanism), then the hidden state is provably a low-dimensional projection of a powerful mathematical object called the signature of the input -- capturing non-linear interactions between tokens at distinct timescales. Our theory not only motivates the success of modern selective state-space models such as Mamba but also provides a solid framework to understand the expressive power of future SSM variants.  ( 2 min )
    Towards Task Sampler Learning for Meta-Learning
    arXiv:2307.08924v3 Announce Type: replace Abstract: Meta-learning aims to learn general knowledge with diverse training tasks conducted from limited data, and then transfer it to new tasks. It is commonly believed that increasing task diversity will enhance the generalization ability of meta-learning models. However, this paper challenges this view through empirical and theoretical analysis. We obtain three conclusions: (i) there is no universal task sampling strategy that can guarantee the optimal performance of meta-learning models; (ii) over-constraining task diversity may incur the risk of under-fitting or over-fitting during training; and (iii) the generalization performance of meta-learning models are affected by task diversity, task entropy, and task difficulty. Based on this insight, we design a novel task sampler, called Adaptive Sampler (ASr). ASr is a plug-and-play module that can be integrated into any meta-learning framework. It dynamically adjusts task weights according to task diversity, task entropy, and task difficulty, thereby obtaining the optimal probability distribution for meta-training tasks. Finally, we conduct experiments on a series of benchmark datasets across various scenarios, and the results demonstrate that ASr has clear advantages.  ( 2 min )
    Simple linear attention language models balance the recall-throughput tradeoff
    arXiv:2402.18668v1 Announce Type: cross Abstract: Recent work has shown that attention-based language models excel at recall, the ability to ground generations in tokens previously seen in context. However, the efficiency of attention-based models is bottle-necked during inference by the KV-cache's aggressive memory consumption. In this work, we explore whether we can improve language model efficiency (e.g. by reducing memory consumption) without compromising on recall. By applying experiments and theory to a broad set of architectures, we identify a key tradeoff between a model's state size and recall ability. We show that efficient alternatives to attention (e.g. H3, Mamba, RWKV) maintain a fixed-size recurrent state, but struggle at recall. We propose BASED a simple architecture combining linear and sliding window attention. By varying BASED window size and linear attention feature dimension, we can dial the state size and traverse the pareto frontier of the recall-memory tradeoff curve, recovering the full quality of attention on one end and the small state size of attention-alternatives on the other. We train language models up to 1.3b parameters and show that BASED matches the strongest sub-quadratic models (e.g. Mamba) in perplexity and outperforms them on real-world recall-intensive tasks by 6.22 accuracy points. Implementations of linear attention are often less efficient than optimized standard attention implementations. To make BASED competitive, we develop IO-aware algorithms that enable 24x higher throughput on language generation than FlashAttention-2, when generating 1024 tokens using 1.3b parameter models. Code for this work is provided at: https://github.com/HazyResearch/based.  ( 3 min )
    Statistical Component Separation for Targeted Signal Recovery in Noisy Mixtures
    arXiv:2306.15012v3 Announce Type: replace-cross Abstract: Separating signals from an additive mixture may be an unnecessarily hard problem when one is only interested in specific properties of a given signal. In this work, we tackle simpler "statistical component separation" problems that focus on recovering a predefined set of statistical descriptors of a target signal from a noisy mixture. Assuming access to samples of the noise process, we investigate a method devised to match the statistics of the solution candidate corrupted by noise samples with those of the observed mixture. We first analyze the behavior of this method using simple examples with analytically tractable calculations. Then, we apply it in an image denoising context employing 1) wavelet-based descriptors, 2) ConvNet-based descriptors on astrophysics and ImageNet data. In the case of 1), we show that our method better recovers the descriptors of the target data than a standard denoising method in most situations. Additionally, despite not constructed for this purpose, it performs surprisingly well in terms of peak signal-to-noise ratio on full signal reconstruction. In comparison, representation 2) appears less suitable for image denoising. Finally, we extend this method by introducing a diffusive stepwise algorithm which gives a new perspective to the initial method and leads to promising results for image denoising under specific circumstances.  ( 3 min )
    Curiosity-driven Red-teaming for Large Language Models
    arXiv:2402.19464v1 Announce Type: new Abstract: Large language models (LLMs) hold great potential for many natural language applications but risk generating incorrect or toxic content. To probe when an LLM generates unwanted content, the current paradigm is to recruit a \textit{red team} of human testers to design input prompts (i.e., test cases) that elicit undesirable responses from LLMs. However, relying solely on human testers is expensive and time-consuming. Recent works automate red teaming by training a separate red team LLM with reinforcement learning (RL) to generate test cases that maximize the chance of eliciting undesirable responses from the target LLM. However, current RL methods are only able to generate a small number of effective test cases resulting in a low coverage of the span of prompts that elicit undesirable responses from the target LLM. To overcome this limitation, we draw a connection between the problem of increasing the coverage of generated test cases and the well-studied approach of curiosity-driven exploration that optimizes for novelty. Our method of curiosity-driven red teaming (CRT) achieves greater coverage of test cases while mantaining or increasing their effectiveness compared to existing methods. Our method, CRT successfully provokes toxic responses from LLaMA2 model that has been heavily fine-tuned using human preferences to avoid toxic outputs. Code is available at \url{https://github.com/Improbable-AI/curiosity_redteam}  ( 2 min )
    PlaceNav: Topological Navigation through Place Recognition
    arXiv:2309.17260v4 Announce Type: replace-cross Abstract: Recent results suggest that splitting topological navigation into robot-independent and robot-specific components improves navigation performance by enabling the robot-independent part to be trained with data collected by robots of different types. However, the navigation methods' performance is still limited by the scarcity of suitable training data and they suffer from poor computational scaling. In this work, we present PlaceNav, subdividing the robot-independent part into navigation-specific and generic computer vision components. We utilize visual place recognition for the subgoal selection of the topological navigation pipeline. This makes subgoal selection more efficient and enables leveraging large-scale datasets from non-robotics sources, increasing training data availability. Bayesian filtering, enabled by place recognition, further improves navigation performance by increasing the temporal consistency of subgoals. Our experimental results verify the design and the new method obtains a 76% higher success rate in indoor and 23% higher in outdoor navigation tasks with higher computational efficiency.  ( 2 min )
    ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL
    arXiv:2402.19446v1 Announce Type: new Abstract: A broad use case of large language models (LLMs) is in goal-directed decision-making tasks (or "agent" tasks), where an LLM needs to not just generate completions for a given prompt, but rather make intelligent decisions over a multi-turn interaction to accomplish a task (e.g., when interacting with the web, using tools, or providing customer support). Reinforcement learning (RL) provides a general paradigm to address such agent tasks, but current RL methods for LLMs largely focus on optimizing single-turn rewards. By construction, most single-turn RL methods cannot endow LLMs with the ability to intelligently seek information over multiple turns, perform credit assignment, or reason about their past actions -- all of which are critical in agent tasks. This raises the question: how can we design effective and efficient multi-turn RL algorithms for LLMs? In this paper, we develop a framework for building multi-turn RL algorithms for fine-tuning LLMs, that preserves the flexibility of existing single-turn RL methods for LLMs (e.g., proximal policy optimization), while accommodating multiple turns, long horizons, and delayed rewards effectively. To do this, our framework adopts a hierarchical RL approach and runs two RL algorithms in parallel: a high-level off-policy value-based RL algorithm to aggregate reward over utterances, and a low-level RL algorithm that utilizes this high-level value function to train a token policy within each utterance or turn. Our hierarchical framework, Actor-Critic Framework with a Hierarchical Structure (ArCHer), can also give rise to other RL methods. Empirically, we find that ArCHer significantly improves efficiency and performance on agent tasks, attaining a sample efficiency of about 100x over existing methods, while also improving with larger model capacity (upto the 7 billion scale that we tested on).  ( 3 min )
    Partially Observable Multi-agent RL with (Quasi-)Efficiency: The Blessing of Information Sharing
    arXiv:2308.08705v2 Announce Type: replace Abstract: We study provable multi-agent reinforcement learning (MARL) in the general framework of partially observable stochastic games (POSGs). To circumvent the known hardness results and the use of computationally intractable oracles, we advocate leveraging the potential \emph{information-sharing} among agents, a common practice in empirical MARL, and a standard model for multi-agent control systems with communications. We first establish several computation complexity results to justify the necessity of information-sharing, as well as the observability assumption that has enabled quasi-efficient single-agent RL with partial observations, for computational efficiency in solving POSGs. We then propose to further \emph{approximate} the shared common information to construct an {approximate model} of the POSG, in which planning an approximate equilibrium (in terms of solving the original POSG) can be quasi-efficient, i.e., of quasi-polynomial-time, under the aforementioned assumptions. Furthermore, we develop a partially observable MARL algorithm that is both statistically and computationally quasi-efficient. We hope our study may open up the possibilities of leveraging and even designing different \emph{information structures}, for developing both sample- and computation-efficient partially observable MARL.  ( 2 min )
    Probabilistic Lipschitzness and the Stable Rank for Comparing Explanation Models
    arXiv:2402.18863v1 Announce Type: new Abstract: Explainability models are now prevalent within machine learning to address the black-box nature of neural networks. The question now is which explainability model is most effective. Probabilistic Lipschitzness has demonstrated that the smoothness of a neural network is fundamentally linked to the quality of post hoc explanations. In this work, we prove theoretical lower bounds on the probabilistic Lipschitzness of Integrated Gradients, LIME and SmoothGrad. We propose a novel metric using probabilistic Lipschitzness, normalised astuteness, to compare the robustness of explainability models. Further, we prove a link between the local Lipschitz constant of a neural network and its stable rank. We then demonstrate that the stable rank of a neural network provides a heuristic for the robustness of explainability models.  ( 2 min )
    Anomaly Detection in Offshore Wind Turbine Structures using Hierarchical Bayesian Modelling
    arXiv:2402.19295v1 Announce Type: new Abstract: Population-based structural health monitoring (PBSHM), aims to share information between members of a population. An offshore wind (OW) farm could be considered as a population of nominally-identical wind-turbine structures. However, benign variations exist among members, such as geometry, sea-bed conditions and temperature differences. These factors could influence structural properties and therefore the dynamic response, making it more difficult to detect structural problems via traditional SHM techniques. This paper explores the use of a hierarchical Bayesian model to infer expected soil stiffness distributions at both population and local levels, as a basis to perform anomaly detection, in the form of scour, for new and existing turbines. To do this, observations of natural frequency will be generated as though they are from a small population of wind turbines. Differences between individual observations will be introduced by postulating distributions over the soil stiffness and measurement noise, as well as reducing soil depth (to represent scour), in the case of anomaly detection.  ( 2 min )
    Deep Learning for Cross-Domain Data Fusion in Urban Computing: Taxonomy, Advances, and Outlook
    arXiv:2402.19348v1 Announce Type: new Abstract: As cities continue to burgeon, Urban Computing emerges as a pivotal discipline for sustainable development by harnessing the power of cross-domain data fusion from diverse sources (e.g., geographical, traffic, social media, and environmental data) and modalities (e.g., spatio-temporal, visual, and textual modalities). Recently, we are witnessing a rising trend that utilizes various deep-learning methods to facilitate cross-domain data fusion in smart cities. To this end, we propose the first survey that systematically reviews the latest advancements in deep learning-based data fusion methods tailored for urban computing. Specifically, we first delve into data perspective to comprehend the role of each modality and data source. Secondly, we classify the methodology into four primary categories: feature-based, alignment-based, contrast-based, and generation-based fusion methods. Thirdly, we further categorize multi-modal urban applications into seven types: urban planning, transportation, economy, public safety, society, environment, and energy. Compared with previous surveys, we focus more on the synergy of deep learning methods with urban computing applications. Furthermore, we shed light on the interplay between Large Language Models (LLMs) and urban computing, postulating future research directions that could revolutionize the field. We firmly believe that the taxonomy, progress, and prospects delineated in our survey stand poised to significantly enrich the research community. The summary of the comprehensive and up-to-date paper list can be found at https://github.com/yoshall/Awesome-Multimodal-Urban-Computing.  ( 3 min )
    Invariant Aggregator for Defending against Federated Backdoor Attacks
    arXiv:2210.01834v3 Announce Type: replace Abstract: Federated learning enables training high-utility models across several clients without directly sharing their private data. As a downside, the federated setting makes the model vulnerable to various adversarial attacks in the presence of malicious clients. Despite the theoretical and empirical success in defending against attacks that aim to degrade models' utility, defense against backdoor attacks that increase model accuracy on backdoor samples exclusively without hurting the utility on other samples remains challenging. To this end, we first analyze the failure modes of existing defenses over a flat loss landscape, which is common for well-designed neural networks such as Resnet [He et al., 2015] but is often overlooked by previous works. Then, we propose an invariant aggregator that redirects the aggregated update to invariant directions that are generally useful via selectively masking out the update elements that favor few and possibly malicious clients. Theoretical results suggest that our approach provably mitigates backdoor attacks and remains effective over flat loss landscapes. Empirical results on three datasets with different modalities and varying numbers of clients further demonstrate that our approach mitigates a broad class of backdoor attacks with a negligible cost on the model utility.  ( 2 min )
    Memory Injections: Correcting Multi-Hop Reasoning Failures during Inference in Transformer-Based Language Models
    arXiv:2309.05605v3 Announce Type: replace-cross Abstract: Answering multi-hop reasoning questions requires retrieving and synthesizing information from diverse sources. Large Language Models (LLMs) struggle to perform such reasoning consistently. Here we propose an approach to pinpoint and rectify multi-hop reasoning failures through targeted memory injections on LLM attention heads. First, we analyze the per-layer activations of GPT-2 models in response to single and multi-hop prompts. We then propose a mechanism that allows users to inject pertinent prompt-specific information, which we refer to as "memories," at critical LLM locations during inference. By thus enabling the LLM to incorporate additional relevant information during inference, we enhance the quality of multi-hop prompt completions. We show empirically that a simple, efficient, and targeted memory injection into a key attention layer can often increase the probability of the desired next token in multi-hop tasks, by up to 424%.  ( 2 min )
    Differentially Private Worst-group Risk Minimization
    arXiv:2402.19437v1 Announce Type: new Abstract: We initiate a systematic study of worst-group risk minimization under $(\epsilon, \delta)$-differential privacy (DP). The goal is to privately find a model that approximately minimizes the maximal risk across $p$ sub-populations (groups) with different distributions, where each group distribution is accessed via a sample oracle. We first present a new algorithm that achieves excess worst-group population risk of $\tilde{O}(\frac{p\sqrt{d}}{K\epsilon} + \sqrt{\frac{p}{K}})$, where $K$ is the total number of samples drawn from all groups and $d$ is the problem dimension. Our rate is nearly optimal when each distribution is observed via a fixed-size dataset of size $K/p$. Our result is based on a new stability-based analysis for the generalization error. In particular, we show that $\Delta$-uniform argument stability implies $\tilde{O}(\Delta + \frac{1}{\sqrt{n}})$ generalization error w.r.t. the worst-group risk, where $n$ is the number of samples drawn from each sample oracle. Next, we propose an algorithmic framework for worst-group population risk minimization using any DP online convex optimization algorithm as a subroutine. Hence, we give another excess risk bound of $\tilde{O}\left( \sqrt{\frac{d^{1/2}}{\epsilon K}} +\sqrt{\frac{p}{K\epsilon^2}} \right)$. Assuming the typical setting of $\epsilon=\Theta(1)$, this bound is more favorable than our first bound in a certain range of $p$ as a function of $K$ and $d$. Finally, we study differentially private worst-group empirical risk minimization in the offline setting, where each group distribution is observed by a fixed-size dataset. We present a new algorithm with nearly optimal excess risk of $\tilde{O}(\frac{p\sqrt{d}}{K\epsilon})$.  ( 2 min )
    Safe Reinforcement Learning in a Simulated Robotic Arm
    arXiv:2312.09468v2 Announce Type: replace-cross Abstract: Reinforcement learning (RL) agents need to explore their environments in order to learn optimal policies. In many environments and tasks, safety is of critical importance. The widespread use of simulators offers a number of advantages, including safe exploration which will be inevitable in cases when RL systems need to be trained directly in the physical environment (e.g. in human-robot interaction). The popular Safety Gym library offers three mobile agent types that can learn goal-directed tasks while considering various safety constraints. In this paper, we extend the applicability of safe RL algorithms by creating a customized environment with Panda robotic arm where Safety Gym algorithms can be tested. We performed pilot experiments with the popular PPO algorithm comparing the baseline with the constrained version and show that the constrained version is able to learn the equally good policy while better complying with safety constraints and taking longer training time as expected.  ( 2 min )
    Topology-Based Reconstruction Prevention for Decentralised Learning
    arXiv:2312.05248v2 Announce Type: replace-cross Abstract: Decentralised learning has recently gained traction as an alternative to federated learning in which both data and coordination are distributed over its users. To preserve data confidentiality, decentralised learning relies on differential privacy, multi-party computation, or a combination thereof. However, running multiple privacy-preserving summations in sequence may allow adversaries to perform reconstruction attacks. Unfortunately, current reconstruction countermeasures either cannot trivially be adapted to the distributed setting, or add excessive amounts of noise. In this work, we first show that passive honest-but-curious adversaries can infer other users' private data after several privacy-preserving summations. For example, in subgraphs with 18 users, we show that only three passive honest-but-curious adversaries succeed at reconstructing private data 11.0% of the time, requiring an average of 8.8 summations per adversary. The success rate depends only on the adversaries' direct neighbourhood, independent of the size of the full network. We consider weak adversaries, who do not control the graph topology and can exploit neither the inner workings of the summation protocol nor the specifics of users' data. We develop a mathematical understanding of how reconstruction relates to topology and propose the first topology-based decentralised defence against reconstruction attacks. Specifically, we show that reconstruction requires a number of adversaries linear in the length of the network's shortest cycle. Consequently, reconstructing private data from privacy-preserving summations is impossible in acyclic networks. Our work is a stepping stone for a formal theory of topology-based reconstruction defences. Such a theory would generalise our countermeasure beyond summation, define confidentiality in terms of entropy, and describe the effects of differential privacy.  ( 3 min )
    FORML: A Riemannian Hessian-free Method for Meta-learning with Orthogonality Constraint
    arXiv:2402.18605v1 Announce Type: new Abstract: Meta-learning problem is usually formulated as a bi-level optimization in which the task-specific and the meta-parameters are updated in the inner and outer loops of optimization, respectively. However, performing the optimization in the Riemannian space, where the parameters and meta-parameters are located on Riemannian manifolds is computationally intensive. Unlike the Euclidean methods, the Riemannian backpropagation needs computing the second-order derivatives that include backward computations through the Riemannian operators such as retraction and orthogonal projection. This paper introduces a Hessian-free approach that uses a first-order approximation of derivatives on the Stiefel manifold. Our method significantly reduces the computational load and memory footprint. We show how using a Stiefel fully-connected layer that enforces orthogonality constraint on the parameters of the last classification layer as the head of the backbone network, strengthens the representation reuse of the gradient-based meta-learning methods. Our experimental results across various few-shot learning datasets, demonstrate the superiority of our proposed method compared to the state-of-the-art methods, especially MAML, its Euclidean counterpart.  ( 2 min )
    Differential Diffusion: Giving Each Pixel Its Strength
    arXiv:2306.00950v2 Announce Type: replace-cross Abstract: Diffusion models have revolutionized image generation and editing, producing state-of-the-art results in conditioned and unconditioned image synthesis. While current techniques enable user control over the degree of change in an image edit, the controllability is limited to global changes over an entire edited region. This paper introduces a novel framework that enables customization of the amount of change per pixel or per image region. Our framework can be integrated into any existing diffusion model, enhancing it with this capability. Such granular control on the quantity of change opens up a diverse array of new editing capabilities, such as control of the extent to which individual objects are modified, or the ability to introduce gradual spatial changes. Furthermore, we showcase the framework's effectiveness in soft-inpainting -- the completion of portions of an image while subtly adjusting the surrounding areas to ensure seamless integration. Additionally, we introduce a new tool for exploring the effects of different change quantities. Our framework operates solely during inference, requiring no model training or fine-tuning. We demonstrate our method with the current open state-of-the-art models, and validate it via both quantitative and qualitative comparisons, and a user study. Our code is available at: https://github.com/exx8/differential-diffusion  ( 2 min )
    A Study of Acquisition Functions for Medical Imaging Deep Active Learning
    arXiv:2401.15721v2 Announce Type: replace-cross Abstract: The Deep Learning revolution has enabled groundbreaking achievements in recent years. From breast cancer detection to protein folding, deep learning algorithms have been at the core of very important advancements. However, these modern advancements are becoming more and more data-hungry, especially on labeled data whose availability is scarce: this is even more prevalent in the medical context. In this work, we show how active learning could be very effective in data scarcity situations, where obtaining labeled data (or annotation budget is very limited). We compare several selection criteria (BALD, MeanSTD, and MaxEntropy) on the ISIC 2016 dataset. We also explored the effect of acquired pool size on the model's performance. Our results suggest that uncertainty is useful to the Melanoma detection task, and confirms the hypotheses of the author of the paper of interest, that \textit{bald} performs on average better than other acquisition functions. Our extended analyses however revealed that all acquisition functions perform badly on the positive (cancerous) samples, suggesting exploitation of class unbalance, which could be crucial in real-world settings. We finish by suggesting future work directions that would be useful to improve this current work. The code of our implementation is open-sourced at \url{https://github.com/bonaventuredossou/ece526_course_project}  ( 3 min )
    GDP nowcasting with artificial neural networks: How much does long-term memory matter?
    arXiv:2304.05805v3 Announce Type: replace-cross Abstract: We apply artificial neural networks (ANNs) to nowcast quarterly GDP growth for the U.S. economy. Using the monthly FRED-MD database, we compare the nowcasting performance of five different ANN architectures: the multilayer perceptron (MLP), the one-dimensional convolutional neural network (1D CNN), the Elman recurrent neural network (RNN), the long short-term memory network (LSTM), and the gated recurrent unit (GRU). The empirical analysis presents results from two distinctively different evaluation periods. The first (2012:Q1 -- 2019:Q4) is characterized by balanced economic growth, while the second (2012:Q1 -- 2022:Q4) also includes periods of the COVID-19 recession. According to our results, longer input sequences result in more accurate nowcasts in periods of balanced economic growth. However, this effect ceases above a relatively low threshold value of around six quarters (eighteen months). During periods of economic turbulence (e.g., during the COVID-19 recession), longer input sequences do not help the models' predictive performance; instead, they seem to weaken their generalization capability. Combined results from the two evaluation periods indicate that architectural features enabling long-term memory do not result in more accurate nowcasts. Comparing network architectures, the 1D CNN has proved to be a highly suitable model for GDP nowcasting. The network has shown good nowcasting performance among the competitors during the first evaluation period and achieved the overall best accuracy during the second evaluation period. Consequently, first in the literature, we propose the application of the 1D CNN for economic nowcasting.  ( 3 min )
    Practical Transferability Estimation for Image Classification Tasks
    arXiv:2106.10479v3 Announce Type: replace-cross Abstract: Transferability estimation is an essential problem in transfer learning to predict how good the performance is when transferring a source model (or source task) to a target task. Recent analytical transferability metrics have been widely used for source model selection and multi-task learning. A major challenge is how to make transfereability estimation robust under the cross-domain cross-task settings. The recently proposed OTCE score solves this problem by considering both domain and task differences, with the help of transfer experiences on auxiliary tasks, which causes an efficiency overhead. In this work, we propose a practical transferability metric called JC-NCE score that dramatically improves the robustness of the task difference estimation in OTCE, thus removing the need for auxiliary tasks. Specifically, we build the joint correspondences between source and target data via solving an optimal transport problem with a ground cost considering both the sample distance and label distance, and then compute the transferability score as the negative conditional entropy of the matched labels. Extensive validations under the intra-dataset and inter-dataset transfer settings demonstrate that our JC-NCE score outperforms the auxiliary-task free version of OTCE for 7% and 12%, respectively, and is also more robust than other existing transferability metrics on average.  ( 3 min )
    Improving Generalization in Reinforcement Learning Training Regimes for Social Robot Navigation
    arXiv:2308.14947v2 Announce Type: replace-cross Abstract: In order for autonomous mobile robots to navigate in human spaces, they must abide by our social norms. Reinforcement learning (RL) has emerged as an effective method to train sequential decision-making policies that are able to respect these norms. However, a large portion of existing work in the field conducts both RL training and testing in simplistic environments. This limits the generalization potential of these models to unseen environments, and the meaningfulness of their reported results. We propose a method to improve the generalization performance of RL social navigation methods using curriculum learning. By employing multiple environment types and by modeling pedestrians using multiple dynamics models, we are able to progressively diversify and escalate difficulty in training. Our results show that the use of curriculum learning in training can be used to achieve better generalization performance than previous training methods. We also show that results presented in many existing state-of-the-art RL social navigation works do not evaluate their methods outside of their training environments, and thus do not reflect their policies' failure to adequately generalize to out-of-distribution scenarios. In response, we validate our training approach on larger and more crowded testing environments than those used in training, allowing for more meaningful measurements of model performance.  ( 3 min )
    Transferability-Guided Cross-Domain Cross-Task Transfer Learning
    arXiv:2207.05510v2 Announce Type: replace-cross Abstract: We propose two novel transferability metrics F-OTCE (Fast Optimal Transport based Conditional Entropy) and JC-OTCE (Joint Correspondence OTCE) to evaluate how much the source model (task) can benefit the learning of the target task and to learn more transferable representations for cross-domain cross-task transfer learning. Unlike the existing metric that requires evaluating the empirical transferability on auxiliary tasks, our metrics are auxiliary-free such that they can be computed much more efficiently. Specifically, F-OTCE estimates transferability by first solving an Optimal Transport (OT) problem between source and target distributions, and then uses the optimal coupling to compute the Negative Conditional Entropy between source and target labels. It can also serve as a loss function to maximize the transferability of the source model before finetuning on the target task. Meanwhile, JC-OTCE improves the transferability robustness of F-OTCE by including label distances in the OT problem, though it may incur additional computation cost. Extensive experiments demonstrate that F-OTCE and JC-OTCE outperform state-of-the-art auxiliary-free metrics by 18.85% and 28.88%, respectively in correlation coefficient with the ground-truth transfer accuracy. By eliminating the training cost of auxiliary tasks, the two metrics reduces the total computation time of the previous method from 43 minutes to 9.32s and 10.78s, respectively, for a pair of tasks. When used as a loss function, F-OTCE shows consistent improvements on the transfer accuracy of the source model in few-shot classification experiments, with up to 4.41% accuracy gain.  ( 3 min )
    Analyzing Trendy Twitter Hashtags in the 2022 French Election
    arXiv:2310.07576v2 Announce Type: replace-cross Abstract: Regressions trained to predict the future activity of social media users need rich features for accurate predictions. Many advanced models exist to generate such features; however, the time complexities of their computations are often prohibitive when they run on enormous data-sets. Some studies have shown that simple semantic network features can be rich enough to use for regressions without requiring complex computations. We propose a method for using semantic networks as user-level features for machine learning tasks. We conducted an experiment using a semantic network of 1037 Twitter hashtags from a corpus of 3.7 million tweets related to the 2022 French presidential election. A bipartite graph is formed where hashtags are nodes and weighted edges connect the hashtags reflecting the number of Twitter users that interacted with both hashtags. The graph is then transformed into a maximum-spanning tree with the most popular hashtag as its root node to construct a hierarchy amongst the hashtags. We then provide a vector feature for each user based on this tree. To validate the usefulness of our semantic feature we performed a regression experiment to predict the response rate of each user with six emotions like anger, enjoyment, or disgust. Our semantic feature performs well with the regression with most emotions having $R^2$ above 0.5. These results suggest that our semantic feature could be considered for use in further experiments predicting social media response on big data-sets.  ( 3 min )
    Beyond Anti-Forgetting: Multimodal Continual Instruction Tuning with Positive Forward Transfer
    arXiv:2401.09181v2 Announce Type: replace Abstract: Multimodal Continual Instruction Tuning (MCIT) enables Multimodal Large Language Models (MLLMs) to meet continuously emerging requirements without expensive retraining. MCIT faces two major obstacles: catastrophic forgetting (where old knowledge is forgotten) and negative forward transfer (where the performance of future tasks is degraded). Although existing methods have greatly alleviated catastrophic forgetting, they still suffer from negative forward transfer. By performing singular value decomposition (SVD) on input embeddings, we discover a large discrepancy in different input embeddings. The discrepancy results in the model learning irrelevant information for old and pre-trained tasks, which leads to catastrophic forgetting and negative forward transfer. To address these issues, we propose Fwd-Prompt, a prompt-based method projecting prompt gradient to the residual space to minimize the interference between tasks and to the pre-trained subspace for reusing pre-trained knowledge. Our experiments demonstrate that Fwd-Prompt achieves state-of-the-art performance while updating fewer parameters and requiring no old samples. Our research sheds light on the potential of continuously adapting MLLMs to new tasks under the instruction tuning paradigm and encourages future studies to explore MCIT. The code will soon be publicly available.  ( 2 min )
    CMISR: Circular Medical Image Super-Resolution
    arXiv:2308.08567v2 Announce Type: replace-cross Abstract: Classical methods of medical image super-resolution (MISR) utilize open-loop architecture with implicit under-resolution (UR) unit and explicit super-resolution (SR) unit. The UR unit can always be given, assumed, or estimated, while the SR unit is elaborately designed according to various SR algorithms. The closed-loop feedback mechanism is widely employed in current MISR approaches and can efficiently improve their performance. The feedback mechanism may be divided into two categories: local feedback and global feedback. Therefore, this paper proposes a global feedback-based closed-cycle framework, circular MISR (CMISR), with unambiguous UR and advanced SR elements. Mathematical model and closed-loop equation of CMISR are built. Mathematical proof with Taylor-series approximation indicates that CMISR has zero recovery error in steady-state. In addition, CMISR holds plug-and-play characteristic that fuses model-based and learning-based approaches and can be established on any existing MISR algorithms. Five CMISR algorithms are respectively proposed based on the state-of-the-art open-loop MISR algorithms. Experimental results with three scale factors and on three open medical image datasets show that CMISR is superior to MISR in reconstruction performance and is particularly suited to medical images with strong edges or intense contrast.  ( 2 min )
    Cooperative Open-ended Learning Framework for Zero-shot Coordination
    arXiv:2302.04831v4 Announce Type: replace-cross Abstract: Zero-shot coordination in cooperative artificial intelligence (AI) remains a significant challenge, which means effectively coordinating with a wide range of unseen partners. Previous algorithms have attempted to address this challenge by optimizing fixed objectives within a population to improve strategy or behaviour diversity. However, these approaches can result in a loss of learning and an inability to cooperate with certain strategies within the population, known as cooperative incompatibility. To address this issue, we propose the Cooperative Open-ended LEarning (COLE) framework, which constructs open-ended objectives in cooperative games with two players from the perspective of graph theory to assess and identify the cooperative ability of each strategy. We further specify the framework and propose a practical algorithm that leverages knowledge from game theory and graph theory. Furthermore, an analysis of the learning process of the algorithm shows that it can efficiently overcome cooperative incompatibility. The experimental results in the Overcooked game environment demonstrate that our method outperforms current state-of-the-art methods when coordinating with different-level partners. Our demo is available at https://sites.google.com/view/cole-2023.  ( 2 min )
    Revisiting the Hypothesis: Do pretrained Transformers Learn In-Context by Gradient Descent?
    arXiv:2310.08540v4 Announce Type: replace-cross Abstract: The emergence of In-Context Learning (ICL) in LLMs remains a significant phenomenon with little understanding. To explain ICL, recent studies try to theoretically connect it to Gradient Descent (GD). We ask, does this connection hold up in actual pre-trained models? We highlight the limiting assumptions in prior works that make their context considerably different from the practical context in which language models are trained. For example, the theoretical hand-constructed weights used in these studies have properties that don't match those of real LLMs. Furthermore, their experimental verification uses ICL objective (training models explicitly for ICL), which differs from the emergent ICL in the wild. We also look for evidence in real models. We observe that ICL and GD have different sensitivity to the order in which they observe demonstrations. Finally, we probe and compare the ICL vs. GD hypothesis in a natural setting. We conduct comprehensive empirical analyses on language models pre-trained on natural data (LLaMa-7B). Our comparisons of three performance metrics highlight the inconsistent behavior of ICL and GD as a function of various factors such as datasets, models, and the number of demonstrations. We observe that ICL and GD modify the output distribution of language models differently. These results indicate that the equivalence between ICL and GD remains an open hypothesis and calls for further studies.  ( 3 min )
    Towards Open Federated Learning Platforms: Survey and Vision from Technical and Legal Perspectives
    arXiv:2307.02140v3 Announce Type: replace-cross Abstract: Traditional Federated Learning (FL) follows a server-dominated cooperation paradigm which narrows the application scenarios of FL and decreases the enthusiasm of data holders to participate. To fully unleash the potential of FL, we advocate rethinking the design of current FL frameworks and extending it to a more generalized concept: Open Federated Learning Platforms, positioned as a crowdsourcing collaborative machine learning infrastructure for all Internet users. We propose two reciprocal cooperation frameworks to achieve this: query-based FL and contract-based FL. In this survey, we conduct a comprehensive review of the feasibility of constructing open FL platforms from both technical and legal perspectives. We begin by reviewing the definition of FL and summarizing its inherent limitations, including server-client coupling, low model reusability, and non-public. In particular, we introduce a novel taxonomy to streamline the analysis of model license compatibility in FL studies that involve batch model reusing methods, including combination, amalgamation, distillation, and generation. This taxonomy provides a feasible solution for identifying the corresponding licenses clauses and facilitates the analysis of potential legal implications and restrictions when reusing models. Through this survey, we uncover the current dilemmas faced by FL and advocate for the development of sustainable open FL platforms. We aim to provide guidance for establishing such platforms in the future while identifying potential limitations that need to be addressed.  ( 3 min )
    The Blessing of Randomness: SDE Beats ODE in General Diffusion-based Image Editing
    arXiv:2311.01410v2 Announce Type: replace-cross Abstract: We present a unified probabilistic formulation for diffusion-based image editing, where a latent variable is edited in a task-specific manner and generally deviates from the corresponding marginal distribution induced by the original stochastic or ordinary differential equation (SDE or ODE). Instead, it defines a corresponding SDE or ODE for editing. In the formulation, we prove that the Kullback-Leibler divergence between the marginal distributions of the two SDEs gradually decreases while that for the ODEs remains as the time approaches zero, which shows the promise of SDE in image editing. Inspired by it, we provide the SDE counterparts for widely used ODE baselines in various tasks including inpainting and image-to-image translation, where SDE shows a consistent and substantial improvement. Moreover, we propose SDE-Drag -- a simple yet effective method built upon the SDE formulation for point-based content dragging. We build a challenging benchmark (termed DragBench) with open-set natural, art, and AI-generated images for evaluation. A user study on DragBench indicates that SDE-Drag significantly outperforms our ODE baseline, existing diffusion-based methods, and the renowned DragGAN. Our results demonstrate the superiority and versatility of SDE in image editing and push the boundary of diffusion-based editing methods.  ( 3 min )
    Inverse Optimization for Routing Problems
    arXiv:2307.07357v2 Announce Type: replace-cross Abstract: We propose a method for learning decision-makers' behavior in routing problems using Inverse Optimization (IO). The IO framework falls into the supervised learning category and builds on the premise that the target behavior is an optimizer of an unknown cost function. This cost function is to be learned through historical data, and in the context of routing problems, can be interpreted as the routing preferences of the decision-makers. In this view, the main contributions of this study are to propose an IO methodology with a hypothesis function, loss function, and stochastic first-order algorithm tailored to routing problems. We further test our IO approach in the Amazon Last Mile Routing Research Challenge, where the goal is to learn models that replicate the routing preferences of human drivers, using thousands of real-world routing examples. Our final IO-learned routing model achieves a score that ranks 2nd compared with the 48 models that qualified for the final round of the challenge. Our examples and results showcase the flexibility and real-world potential of the proposed IO methodology to learn from decision-makers' decisions in routing problems.  ( 2 min )
    Learning Associative Memories with Gradient Descent
    arXiv:2402.18724v1 Announce Type: new Abstract: This work focuses on the training dynamics of one associative memory module storing outer products of token embeddings. We reduce this problem to the study of a system of particles, which interact according to properties of the data distribution and correlations between embeddings. Through theory and experiments, we provide several insights. In overparameterized regimes, we obtain logarithmic growth of the ``classification margins.'' Yet, we show that imbalance in token frequencies and memory interferences due to correlated embeddings lead to oscillatory transitory regimes. The oscillations are more pronounced with large step sizes, which can create benign loss spikes, although these learning rates speed up the dynamics and accelerate the asymptotic convergence. In underparameterized regimes, we illustrate how the cross-entropy loss can lead to suboptimal memorization schemes. Finally, we assess the validity of our findings on small Transformer models.  ( 2 min )
    Priority Sampling of Large Language Models for Compilers
    arXiv:2402.18734v1 Announce Type: new Abstract: Large language models show great potential in generating and optimizing code. Widely used sampling methods such as Nucleus Sampling increase the diversity of generation but often produce repeated samples for low temperatures and incoherent samples for high temperatures. Furthermore, the temperature coefficient has to be tuned for each task, limiting its usability. We present Priority Sampling, a simple and deterministic sampling technique that produces unique samples ordered by the model's confidence. Each new sample expands the unexpanded token with the highest probability in the augmented search tree. Additionally, Priority Sampling supports generation based on regular expression that provides a controllable and structured exploration process. Priority Sampling outperforms Nucleus Sampling for any number of samples, boosting the performance of the original model from 2.87% to 5% improvement over -Oz. Moreover, it outperforms the autotuner used for the generation of labels for the training of the original model in just 30 samples.  ( 2 min )
    GNSS Positioning using Cost Function Regulated Multilateration and Graph Neural Networks
    arXiv:2402.18630v1 Announce Type: new Abstract: In urban environments, where line-of-sight signals from GNSS satellites are frequently blocked by high-rise objects, GNSS receivers are subject to large errors in measuring satellite ranges. Heuristic methods are commonly used to estimate these errors and reduce the impact of noisy measurements on localization accuracy. In our work, we replace these error estimation heuristics with a deep learning model based on Graph Neural Networks. Additionally, by analyzing the cost function of the multilateration process, we derive an optimal method to utilize the estimated errors. Our approach guarantees that the multilateration converges to the receiver's location as the error estimation accuracy increases. We evaluate our solution on a real-world dataset containing more than 100k GNSS epochs, collected from multiple cities with diverse characteristics. The empirical results show improvements from 40% to 80% in the horizontal localization error against recent deep learning baselines as well as classical localization approaches.  ( 2 min )
    Symphony: Optimized DNN Model Serving using Deferred Batch Scheduling
    arXiv:2308.07470v2 Announce Type: replace-cross Abstract: Having large batch sizes is one of the most critical aspects of increasing the accelerator efficiency and the performance of DNN model inference. However, existing model serving systems cannot achieve adequate batch sizes while meeting latency objectives as these systems eagerly dispatch requests to accelerators to minimize the accelerator idle time. We propose Symphony, a DNN serving system that explores deferred batch scheduling to optimize system efficiency and throughput. Further, unlike other prior systems, Symphony's GPU usage is load-proportional: it consolidates workloads on the appropriate number of GPUs and works smoothly with cluster auto-scaling tools. Symphony consists of two core design points. First, Symphony defines a schedulable window in which a batch of inference requests can be dispatched. This window is computed in order to improve accelerator efficiency while meeting the request's SLO. Second, Symphony implements a scalable, low-latency, fine-grained coordination scheme across accelerators to dispatch and execute requests in the schedulable window. Through extensive scheduler-only benchmarks, we demonstrate that Symphony can schedule millions of requests per second and coordinate thousands of GPUs while also enabling robust autoscaling that adapts to workload changes. Symphony outperforms prior systems by achieving 5x higher goodput when given the same number of GPUs and 60% reduction in GPUs when given the same workload.  ( 3 min )
    Enhancing Zero-Shot Chain-of-Thought Reasoning in Large Language Models through Logic
    arXiv:2309.13339v2 Announce Type: replace-cross Abstract: Recent advancements in large language models have showcased their remarkable generalizability across various domains. However, their reasoning abilities still have significant room for improvement, especially when confronted with scenarios requiring multi-step reasoning. Although large language models possess extensive knowledge, their reasoning often fails to effectively utilize this knowledge to establish a coherent thinking paradigm. These models sometimes show hallucinations as their reasoning procedures are unconstrained by logical principles. Aiming at improving the zero-shot chain-of-thought reasoning ability of large language models, we propose LoT (Logical Thoughts) prompting, a self-improvement framework that leverages principles rooted in symbolic logic, particularly Reductio ad Absurdum, to systematically verify and rectify the reasoning processes step by step. Experimental evaluations conducted on language tasks in diverse domains, including arithmetic, commonsense, symbolic, causal inference, and social problems, demonstrate the efficacy of enhanced reasoning by logic.  ( 2 min )
    Convex Hulls of Reachable Sets
    arXiv:2303.17674v3 Announce Type: replace-cross Abstract: We study the convex hulls of reachable sets of nonlinear systems with bounded disturbances and uncertain initial conditions. Reachable sets play a critical role in control, but remain notoriously challenging to compute, and existing over-approximation tools tend to be conservative or computationally expensive. In this work, we characterize the convex hulls of reachable sets as the convex hulls of solutions of an ordinary differential equation with initial conditions on the sphere. This finite-dimensional characterization unlocks an efficient sampling-based estimation algorithm to accurately over-approximate reachable sets. We also study the structure of the boundary of the reachable convex hulls and derive error bounds for the estimation algorithm. We give applications to neural feedback loop analysis and robust MPC.  ( 2 min )
    Multitask Learning Can Improve Worst-Group Outcomes
    arXiv:2312.03151v2 Announce Type: replace Abstract: In order to create machine learning systems that serve a variety of users well, it is vital to not only achieve high average performance but also ensure equitable outcomes across diverse groups. However, most machine learning methods are designed to improve a model's average performance on a chosen end task without consideration for their impact on worst group error. Multitask learning (MTL) is one such widely used technique. In this paper, we seek not only to understand the impact of MTL on worst-group accuracy but also to explore its potential as a tool to address the challenge of group-wise fairness. We primarily consider the standard setting of fine-tuning a pre-trained model, where, following recent work \citep{gururangan2020don, dery2023aang}, we multitask the end task with the pre-training objective constructed from the end task data itself. In settings with few or no group annotations, we find that multitasking often, but not consistently, achieves better worst-group accuracy than Just-Train-Twice (JTT; \citet{pmlr-v139-liu21f}) -- a representative distributionally robust optimization (DRO) method. Leveraging insights from synthetic data experiments, we propose to modify standard MTL by regularizing the joint multitask representation space. We run a large number of fine-tuning experiments across computer vision and natural language processing datasets and find that our regularized MTL approach \emph{consistently} outperforms JTT on both average and worst-group outcomes. Our official code can be found here: \href{https://github.com/atharvajk98/MTL-group-robustness.git}{\url{https://github.com/atharvajk98/MTL-group-robustness}}.  ( 3 min )
    SMORE: Score Models for Offline Goal-Conditioned Reinforcement Learning
    arXiv:2311.02013v2 Announce Type: replace Abstract: Offline Goal-Conditioned Reinforcement Learning (GCRL) is tasked with learning to achieve multiple goals in an environment purely from offline datasets using sparse reward functions. Offline GCRL is pivotal for developing generalist agents capable of leveraging pre-existing datasets to learn diverse and reusable skills without hand-engineering reward functions. However, contemporary approaches to GCRL based on supervised learning and contrastive learning are often suboptimal in the offline setting. An alternative perspective on GCRL optimizes for occupancy matching, but necessitates learning a discriminator, which subsequently serves as a pseudo-reward for downstream RL. Inaccuracies in the learned discriminator can cascade, negatively influencing the resulting policy. We present a novel approach to GCRL under a new lens of mixture-distribution matching, leading to our discriminator-free method: SMORe. The key insight is combining the occupancy matching perspective of GCRL with a convex dual formulation to derive a learning objective that can better leverage suboptimal offline data. SMORe learns scores or unnormalized densities representing the importance of taking an action at a state for reaching a particular goal. SMORe is principled and our extensive experiments on the fully offline GCRL benchmark composed of robot manipulation and locomotion tasks, including high-dimensional observations, show that SMORe can outperform state-of-the-art baselines by a significant margin.  ( 3 min )
    Differentially Private Synthetic Data via Foundation Model APIs 1: Images
    arXiv:2305.15560v2 Announce Type: replace-cross Abstract: Generating differentially private (DP) synthetic data that closely resembles the original private data is a scalable way to mitigate privacy concerns in the current data-driven world. In contrast to current practices that train customized models for this task, we aim to generate DP Synthetic Data via APIs (DPSDA), where we treat foundation models as blackboxes and only utilize their inference APIs. Such API-based, training-free approaches are easier to deploy as exemplified by the recent surge in the number of API-based apps. These approaches can also leverage the power of large foundation models which are only accessible via their inference APIs. However, this comes with greater challenges due to strictly more restrictive model access and the need to protect privacy from the API provider. In this paper, we present a new framework called Private Evolution (PE) to solve this problem and show its initial promise on synthetic images. Surprisingly, PE can match or even outperform state-of-the-art (SOTA) methods without any model training. For example, on CIFAR10 (with ImageNet as the public data), we achieve FID <= 7.9 with privacy cost {\epsilon} = 0.67, significantly improving the previous SOTA from {\epsilon} = 32. We further demonstrate the promise of applying PE on large foundation models such as Stable Diffusion to tackle challenging private datasets with a small number of high-resolution images. The code and data are released at https://github.com/microsoft/DPSDA.  ( 3 min )
    Poincar\'e Differential Privacy for Hierarchy-Aware Graph Embedding
    arXiv:2312.12183v3 Announce Type: replace Abstract: Hierarchy is an important and commonly observed topological property in real-world graphs that indicate the relationships between supervisors and subordinates or the organizational behavior of human groups. As hierarchy is introduced as a new inductive bias into the Graph Neural Networks (GNNs) in various tasks, it implies latent topological relations for attackers to improve their inference attack performance, leading to serious privacy leakage issues. In addition, existing privacy-preserving frameworks suffer from reduced protection ability in hierarchical propagation due to the deficiency of adaptive upper-bound estimation of the hierarchical perturbation boundary. It is of great urgency to effectively leverage the hierarchical property of data while satisfying privacy guarantees. To solve the problem, we propose the Poincar\'e Differential Privacy framework, named PoinDP, to protect the hierarchy-aware graph embedding based on hyperbolic geometry. Specifically, PoinDP first learns the hierarchy weights for each entity based on the Poincar\'e model in hyperbolic space. Then, the Personalized Hierarchy-aware Sensitivity is designed to measure the sensitivity of the hierarchical structure and adaptively allocate the privacy protection strength. Besides, the Hyperbolic Gaussian Mechanism (HGM) is proposed to extend the Gaussian mechanism in Euclidean space to hyperbolic space to realize random perturbations that satisfy differential privacy under the hyperbolic space metric. Extensive experiment results on five real-world datasets demonstrate the proposed PoinDP's advantages of effective privacy protection while maintaining good performance on the node classification task.  ( 3 min )
    Tight Verification of Probabilistic Robustness in Bayesian Neural Networks
    arXiv:2401.11627v2 Announce Type: replace Abstract: We introduce two algorithms for computing tight guarantees on the probabilistic robustness of Bayesian Neural Networks (BNNs). Computing robustness guarantees for BNNs is a significantly more challenging task than verifying the robustness of standard Neural Networks (NNs) because it requires searching the parameters' space for safe weights. Moreover, tight and complete approaches for the verification of standard NNs, such as those based on Mixed-Integer Linear Programming (MILP), cannot be directly used for the verification of BNNs because of the polynomial terms resulting from the consecutive multiplication of variables encoding the weights. Our algorithms efficiently and effectively search the parameters' space for safe weights by using iterative expansion and the network's gradient and can be used with any verification algorithm of choice for BNNs. In addition to proving that our algorithms compute tighter bounds than the SoA, we also evaluate our algorithms against the SoA on standard benchmarks, such as MNIST and CIFAR10, showing that our algorithms compute bounds up to 40% tighter than the SoA.  ( 2 min )
    Self-Initiated Open World Learning for Autonomous AI Agents
    arXiv:2110.11385v3 Announce Type: replace-cross Abstract: As more and more AI agents are used in practice, it is time to think about how to make these agents fully autonomous so that they can learn by themselves in a self-motivated and self-supervised manner rather than being retrained periodically on the initiation of human engineers using expanded training data. As the real-world is an open environment with unknowns or novelties, detecting novelties or unknowns, characterizing them, accommodating or adapting to them, gathering ground-truth training data, and incrementally learning the unknowns/novelties are critical to making the agent more and more knowledgeable and powerful over time. The key challenge is how to automate the process so that it is carried out on the agent's own initiative and through its own interactions with humans and the environment. Since an AI agent usually has a performance task, characterizing each novelty becomes critical and necessary so that the agent can formulate an appropriate response to adapt its behavior to accommodate the novelty and to learn from it to improve the agent's adaptation capability and task performance. The process goes continually without termination. This paper proposes a theoretic framework for this learning paradigm to promote the research of building Self-initiated Open world Learning (SOL) agents. An example SOL agent is also described.  ( 3 min )
    Molecular Identification and Peak Assignment: Leveraging Multi-Level Multimodal Alignment on NMR
    arXiv:2311.13817v2 Announce Type: replace Abstract: Nuclear magnetic resonance (NMR) spectroscopy plays an essential role in deciphering molecular structure and dynamic behaviors. While AI-enhanced NMR prediction models hold promise, challenges still persist in tasks such as molecular retrieval, isomer recognition, and peak assignment. In response, this paper introduces a novel solution, Multi-Level Multimodal Alignment with Knowledge-Guided Instance-Wise Discrimination (K-M3AID), which establishes correspondences between two heterogeneous modalities: molecular graphs and NMR spectra. K-M3AID employs a dual-coordinated contrastive learning architecture with three key modules: a graph-level alignment module, a node-level alignment module, and a communication channel. Notably, K-M3AID introduces knowledge-guided instance-wise discrimination into contrastive learning within the node-level alignment module. In addition, K-M3AID demonstrates that skills acquired during node-level alignment have a positive impact on graph-level alignment, acknowledging meta-learning as an inherent property. Empirical validation underscores K-M3AID's effectiveness in multiple zero-shot tasks.  ( 2 min )
    Convergence Analysis for Learning Orthonormal Deep Linear Neural Networks
    arXiv:2311.14658v2 Announce Type: replace Abstract: Enforcing orthonormal or isometric property for the weight matrices has been shown to enhance the training of deep neural networks by mitigating gradient exploding/vanishing and increasing the robustness of the learned networks. However, despite its practical performance, the theoretical analysis of orthonormality in neural networks is still lacking; for example, how orthonormality affects the convergence of the training process. In this letter, we aim to bridge this gap by providing convergence analysis for training orthonormal deep linear neural networks. Specifically, we show that Riemannian gradient descent with an appropriate initialization converges at a linear rate for training orthonormal deep linear neural networks with a class of loss functions. Unlike existing works that enforce orthonormal weight matrices for all the layers, our approach excludes this requirement for one layer, which is crucial to establish the convergence guarantee. Our results shed light on how increasing the number of hidden layers can impact the convergence speed. Experimental results validate our theoretical analysis.  ( 2 min )
    Refined Coreset Selection: Towards Minimal Coreset Size under Model Performance Constraints
    arXiv:2311.08675v2 Announce Type: replace Abstract: Coreset selection is powerful in reducing computational costs and accelerating data processing for deep learning algorithms. It strives to identify a small subset from large-scale data, so that training only on the subset practically performs on par with full data. Practitioners regularly desire to identify the smallest possible coreset in realistic scenes while maintaining comparable model performance, to minimize costs and maximize acceleration. Motivated by this desideratum, for the first time, we pose the problem of refined coreset selection, in which the minimal coreset size under model performance constraints is explored. Moreover, to address this problem, we propose an innovative method, which maintains optimization priority order over the model performance and coreset size, and efficiently optimizes them in the coreset selection procedure. Theoretically, we provide the convergence guarantee of the proposed method. Empirically, extensive experiments confirm its superiority compared with previous strategies, often yielding better model performance with smaller coreset sizes.  ( 2 min )
    Spyx: A Library for Just-In-Time Compiled Optimization of Spiking Neural Networks
    arXiv:2402.18994v1 Announce Type: cross Abstract: As the role of artificial intelligence becomes increasingly pivotal in modern society, the efficient training and deployment of deep neural networks have emerged as critical areas of focus. Recent advancements in attention-based large neural architectures have spurred the development of AI accelerators, facilitating the training of extensive, multi-billion parameter models. Despite their effectiveness, these powerful networks often incur high execution costs in production environments. Neuromorphic computing, inspired by biological neural processes, offers a promising alternative. By utilizing temporally-sparse computations, Spiking Neural Networks (SNNs) offer to enhance energy efficiency through a reduced and low-power hardware footprint. However, the training of SNNs can be challenging due to their recurrent nature which cannot as easily leverage the massive parallelism of modern AI accelerators. To facilitate the investigation of SNN architectures and dynamics researchers have sought to bridge Python-based deep learning frameworks such as PyTorch or TensorFlow with custom-implemented compute kernels. This paper introduces Spyx, a new and lightweight SNN simulation and optimization library designed in JAX. By pre-staging data in the expansive vRAM of contemporary accelerators and employing extensive JIT compilation, Spyx allows for SNN optimization to be executed as a unified, low-level program on NVIDIA GPUs or Google TPUs. This approach achieves optimal hardware utilization, surpassing the performance of many existing SNN training frameworks while maintaining considerable flexibility.  ( 2 min )
    Learning with Language-Guided State Abstractions
    arXiv:2402.18759v1 Announce Type: cross Abstract: We describe a framework for using natural language to design state abstractions for imitation learning. Generalizable policy learning in high-dimensional observation spaces is facilitated by well-designed state representations, which can surface important features of an environment and hide irrelevant ones. These state representations are typically manually specified, or derived from other labor-intensive labeling procedures. Our method, LGA (language-guided abstraction), uses a combination of natural language supervision and background knowledge from language models (LMs) to automatically build state representations tailored to unseen tasks. In LGA, a user first provides a (possibly incomplete) description of a target task in natural language; next, a pre-trained LM translates this task description into a state abstraction function that masks out irrelevant features; finally, an imitation policy is trained using a small number of demonstrations and LGA-generated abstract states. Experiments on simulated robotic tasks show that LGA yields state abstractions similar to those designed by humans, but in a fraction of the time, and that these abstractions improve generalization and robustness in the presence of spurious correlations and ambiguous specifications. We illustrate the utility of the learned abstractions on mobile manipulation tasks with a Spot robot.  ( 2 min )
    Humanoid Locomotion as Next Token Prediction
    arXiv:2402.19469v1 Announce Type: cross Abstract: We cast real-world humanoid control as a next token prediction problem, akin to predicting the next word in language. Our model is a causal transformer trained via autoregressive prediction of sensorimotor trajectories. To account for the multi-modal nature of the data, we perform prediction in a modality-aligned way, and for each input token predict the next token from the same modality. This general formulation enables us to leverage data with missing modalities, like video trajectories without actions. We train our model on a collection of simulated trajectories coming from prior neural network policies, model-based controllers, motion capture data, and YouTube videos of humans. We show that our model enables a full-sized humanoid to walk in San Francisco zero-shot. Our model can transfer to the real world even when trained on only 27 hours of walking data, and can generalize to commands not seen during training like walking backward. These findings suggest a promising path toward learning challenging real-world control tasks by generative modeling of sensorimotor trajectories.  ( 2 min )
    ELA: Exploited Level Augmentation for Offline Learning in Zero-Sum Games
    arXiv:2402.18617v1 Announce Type: cross Abstract: Offline learning has become widely used due to its ability to derive effective policies from offline datasets gathered by expert demonstrators without interacting with the environment directly. Recent research has explored various ways to enhance offline learning efficiency by considering the characteristics (e.g., expertise level or multiple demonstrators) of the dataset. However, a different approach is necessary in the context of zero-sum games, where outcomes vary significantly based on the strategy of the opponent. In this study, we introduce a novel approach that uses unsupervised learning techniques to estimate the exploited level of each trajectory from the offline dataset of zero-sum games made by diverse demonstrators. Subsequently, we incorporate the estimated exploited level into the offline learning to maximize the influence of the dominant strategy. Our method enables interpretable exploited level estimation in multiple zero-sum games and effectively identifies dominant strategy data. Also, our exploited level augmented offline learning significantly enhances the original offline learning algorithms including imitation learning and offline reinforcement learning for zero-sum games.  ( 2 min )
    Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Match Human Crowd Accuracy
    arXiv:2402.19379v1 Announce Type: cross Abstract: Human forecasting accuracy in practice relies on the 'wisdom of the crowd' effect, in which predictions about future events are significantly improved by aggregating across a crowd of individual forecasters. Past work on the forecasting ability of large language models (LLMs) suggests that frontier LLMs, as individual forecasters, underperform compared to the gold standard of a human crowd forecasting tournament aggregate. In Study 1, we expand this research by using an LLM ensemble approach consisting of a crowd of twelve LLMs. We compare the aggregated LLM predictions on 31 binary questions to that of a crowd of 925 human forecasters from a three-month forecasting tournament. Our main analysis shows that the LLM crowd outperforms a simple no-information benchmark and is statistically equivalent to the human crowd. We also observe an acquiescence effect, with mean model predictions being significantly above 50%, despite an almost even split of positive and negative resolutions. Moreover, in Study 2, we test whether LLM predictions (of GPT-4 and Claude 2) can be improved by drawing on human cognitive output. We find that both models' forecasting accuracy benefits from exposure to the median human prediction as information, improving accuracy by between 17% and 28%: though this leads to less accurate predictions than simply averaging human and machine forecasts. Our results suggest that LLMs can achieve forecasting accuracy rivaling that of human crowd forecasting tournaments: via the simple, practically applicable method of forecast aggregation. This replicates the 'wisdom of the crowd' effect for LLMs, and opens up their use for a variety applications throughout society.  ( 3 min )
    RL-GPT: Integrating Reinforcement Learning and Code-as-policy
    arXiv:2402.19299v1 Announce Type: cross Abstract: Large Language Models (LLMs) have demonstrated proficiency in utilizing various tools by coding, yet they face limitations in handling intricate logic and precise control. In embodied tasks, high-level planning is amenable to direct coding, while low-level actions often necessitate task-specific refinement, such as Reinforcement Learning (RL). To seamlessly integrate both modalities, we introduce a two-level hierarchical framework, RL-GPT, comprising a slow agent and a fast agent. The slow agent analyzes actions suitable for coding, while the fast agent executes coding tasks. This decomposition effectively focuses each agent on specific tasks, proving highly efficient within our pipeline. Our approach outperforms traditional RL methods and existing GPT agents, demonstrating superior efficiency. In the Minecraft game, it rapidly obtains diamonds within a single day on an RTX3090. Additionally, it achieves SOTA performance across all designated MineDojo tasks.  ( 2 min )
    FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning
    arXiv:2402.18789v1 Announce Type: cross Abstract: Parameter-efficient finetuning (PEFT) is a widely used technique to adapt large language models for different tasks. Service providers typically create separate systems for users to perform PEFT model finetuning and inference tasks. This is because existing systems cannot handle workloads that include a mix of inference and PEFT finetuning requests. As a result, shared GPU resources are underutilized, leading to inefficiencies. To address this problem, we present FlexLLM, the first system that can serve inference and parameter-efficient finetuning requests in the same iteration. Our system leverages the complementary nature of these two tasks and utilizes shared GPU resources to run them jointly, using a method called co-serving. To achieve this, FlexLLM introduces a novel token-level finetuning mechanism, which breaks down the finetuning computation of a sequence into smaller token-level computations and uses dependent parallelization and graph pruning, two static compilation optimizations, to minimize the memory overhead and latency for co-serving. Compared to existing systems, FlexLLM's co-serving approach reduces the activation GPU memory overhead by up to 8x, and the end-to-end GPU memory requirement of finetuning by up to 36% while maintaining a low inference latency and improving finetuning throughput. For example, under a heavy inference workload, FlexLLM can still preserve more than 80% of the peak finetuning throughput, whereas existing systems cannot make any progress with finetuning. The source code of FlexLLM is publicly available at https://github.com/flexflow/FlexFlow.  ( 3 min )
    Adaptive Federated Minimax Optimization with Lower Complexities
    arXiv:2211.07303v4 Announce Type: replace Abstract: Federated learning is a popular distributed and privacy-preserving learning paradigm in machine learning. Recently, some federated learning algorithms have been proposed to solve the distributed minimax problems. However, these federated minimax algorithms still suffer from high gradient or communication complexity. Meanwhile, few algorithm focuses on using adaptive learning rate to accelerate these algorithms. To fill this gap, in the paper, we study a class of nonconvex minimax optimization, and propose an efficient adaptive federated minimax optimization algorithm (i.e., AdaFGDA) to solve these distributed minimax problems. Specifically, our AdaFGDA builds on the momentum-based variance reduced and local-SGD techniques, and it can flexibly incorporate various adaptive learning rates by using the unified adaptive matrices. Theoretically, we provide a solid convergence analysis framework for our AdaFGDA algorithm under non-i.i.d. setting. Moreover, we prove our AdaFGDA algorithm obtains a lower gradient (i.e., stochastic first-order oracle, SFO) complexity of $\tilde{O}(\epsilon^{-3})$ with lower communication complexity of $\tilde{O}(\epsilon^{-2})$ in finding $\epsilon$-stationary point of the nonconvex minimax problems. Experimentally, we conduct some experiments on the deep AUC maximization and robust neural network training tasks to verify efficiency of our algorithms.  ( 2 min )
    The Counterfeit Conundrum: Can Code Language Models Grasp the Nuances of Their Incorrect Generations?
    arXiv:2402.19475v1 Announce Type: cross Abstract: While language models are increasingly more proficient at code generation, they still frequently generate incorrect programs. Many of these programs are obviously wrong, but others are more subtle and pass weaker correctness checks such as being able to compile. In this work, we focus on these counterfeit samples: programs sampled from a language model that 1) have a high enough log-probability to be generated at a moderate temperature and 2) pass weak correctness checks. Overall, we discover that most models have a very shallow understanding of counterfeits through three clear failure modes. First, models mistakenly classify them as correct. Second, models are worse at reasoning about the execution behaviour of counterfeits and often predict their execution results as if they were correct. Third, when asking models to fix counterfeits, the likelihood of a model successfully repairing a counterfeit is often even lower than that of sampling a correct program from scratch. Counterfeits also have very unexpected properties: first, counterfeit programs for problems that are easier for a model to solve are not necessarily easier to detect and only slightly easier to execute and repair. Second, counterfeits from a given model are just as confusing to the model itself as they are to other models. Finally, both strong and weak models are able to generate counterfeit samples that equally challenge all models. In light of our findings, we recommend that care and caution be taken when relying on models to understand their own samples, especially when no external feedback is incorporated.  ( 3 min )
    Human Choice Prediction in Language-based Persuasion Games: Simulation-based Off-Policy Evaluation
    arXiv:2305.10361v4 Announce Type: replace Abstract: Recent advances in Large Language Models (LLMs) have spurred interest in designing LLM-based agents for tasks that involve interaction with human and artificial agents. This paper addresses a key aspect in the design of such agents: Predicting human decision in off-policy evaluation (OPE), focusing on language-based persuasion games, where the agent's goal is to influence its partner's decisions through verbal messages. Using a dedicated application, we collected a dataset of 87K decisions from humans playing a repeated decision-making game with artificial agents. Our approach involves training a model on human interactions with one agents subset to predict decisions when interacting with another. To enhance off-policy performance, we propose a simulation technique involving interactions across the entire agent space and simulated decision makers. Our learning strategy yields significant OPE gains, e.g., improving prediction accuracy in the top 15% challenging cases by 7.1%. Our code and the large dataset we collected and generated are submitted as supplementary material and publicly available in our GitHub repository: https://github.com/eilamshapira/HumanChoicePrediction  ( 2 min )
    Listening to the Noise: Blind Denoising with Gibbs Diffusion
    arXiv:2402.19455v1 Announce Type: cross Abstract: In recent years, denoising problems have become intertwined with the development of deep generative models. In particular, diffusion models are trained like denoisers, and the distribution they model coincide with denoising priors in the Bayesian picture. However, denoising through diffusion-based posterior sampling requires the noise level and covariance to be known, preventing blind denoising. We overcome this limitation by introducing Gibbs Diffusion (GDiff), a general methodology addressing posterior sampling of both the signal and the noise parameters. Assuming arbitrary parametric Gaussian noise, we develop a Gibbs algorithm that alternates sampling steps from a conditional diffusion model trained to map the signal prior to the family of noise distributions, and a Monte Carlo sampler to infer the noise parameters. Our theoretical analysis highlights potential pitfalls, guides diagnostic usage, and quantifies errors in the Gibbs stationary distribution caused by the diffusion model. We showcase our method for 1) blind denoising of natural images involving colored noises with unknown amplitude and spectral index, and 2) a cosmology problem, namely the analysis of cosmic microwave background data, where Bayesian inference of "noise" parameters means constraining models of the evolution of the Universe.  ( 2 min )
    Training-set-free two-stage deep learning for Spectroscopic data de-noising
    arXiv:2402.18830v1 Announce Type: cross Abstract: De-noising is a prominent step in the spectra post-processing procedure. Previous machine learning-based methods are fast but mostly based on supervised learning and require a training set that may be typically expensive in real experimental measurements. Unsupervised learning-based algorithms are slow and require many iterations to achieve convergence. Here, we bridge this gap by proposing a training-set-free two-stage deep learning method. We show that the fuzzy fixed input in previous methods can be improved by introducing an adaptive prior. Combined with more advanced optimization techniques, our approach can achieve five times acceleration compared to previous work. Theoretically, we study the landscape of a corresponding non-convex linear problem, and our results indicates that this problem has benign geometry for first-order algorithms to converge.  ( 2 min )
    Decompose-and-Compose: A Compositional Approach to Mitigating Spurious Correlation
    arXiv:2402.18919v1 Announce Type: cross Abstract: While standard Empirical Risk Minimization (ERM) training is proven effective for image classification on in-distribution data, it fails to perform well on out-of-distribution samples. One of the main sources of distribution shift for image classification is the compositional nature of images. Specifically, in addition to the main object or component(s) determining the label, some other image components usually exist, which may lead to the shift of input distribution between train and test environments. More importantly, these components may have spurious correlations with the label. To address this issue, we propose Decompose-and-Compose (DaC), which improves robustness to correlation shift by a compositional approach based on combining elements of images. Based on our observations, models trained with ERM usually highly attend to either the causal components or the components having a high spurious correlation with the label (especially in datapoints on which models have a high confidence). In fact, according to the amount of spurious correlation and the easiness of classification based on the causal or non-causal components, the model usually attends to one of these more (on samples with high confidence). Following this, we first try to identify the causal components of images using class activation maps of models trained with ERM. Afterward, we intervene on images by combining them and retraining the model on the augmented data, including the counterfactual ones. Along with its high interpretability, this work proposes a group-balancing method by intervening on images without requiring group labels or information regarding the spurious features during training. The method has an overall better worst group accuracy compared to previous methods with the same amount of supervision on the group labels in correlation shift.  ( 3 min )
    An AI based Digital Score of Tumour-Immune Microenvironment Predicts Benefit to Maintenance Immunotherapy in Advanced Oesophagogastric Adenocarcinoma
    arXiv:2402.19296v1 Announce Type: cross Abstract: Gastric and oesophageal (OG) cancers are the leading causes of cancer mortality worldwide. In OG cancers, recent studies have showed that PDL1 immune checkpoint inhibitors (ICI) in combination with chemotherapy improves patient survival. However, our understanding of the tumour immune microenvironment in OG cancers remains limited. In this study, we interrogate multiplex immunofluorescence (mIF) images taken from patients with advanced Oesophagogastric Adenocarcinoma (OGA) who received first-line fluoropyrimidine and platinum-based chemotherapy in the PLATFORM trial (NCT02678182) to predict the efficacy of the treatment and to explore the biological basis of patients responding to maintenance durvalumab (PDL1 inhibitor). Our proposed Artificial Intelligence (AI) based marker successfully identified responder from non-responder (p < 0.05) as well as those who could potentially benefit from ICI with statistical significance (p < 0.05) for both progression free and overall survival. Our findings suggest that T cells that express FOXP3 seem to heavily influence the patient treatment response and survival outcome. We also observed that higher levels of CD8+PD1+ cells are consistently linked to poor prognosis for both OS and PFS, regardless of ICI.  ( 2 min )
    A Deep-Learning Technique to Locate Cryptographic Operations in Side-Channel Traces
    arXiv:2402.19037v1 Announce Type: cross Abstract: Side-channel attacks allow extracting secret information from the execution of cryptographic primitives by correlating the partially known computed data and the measured side-channel signal. However, to set up a successful side-channel attack, the attacker has to perform i) the challenging task of locating the time instant in which the target cryptographic primitive is executed inside a side-channel trace and then ii)the time-alignment of the measured data on that time instant. This paper presents a novel deep-learning technique to locate the time instant in which the target computed cryptographic operations are executed in the side-channel trace. In contrast to state-of-the-art solutions, the proposed methodology works even in the presence of trace deformations obtained through random delay insertion techniques. We validated our proposal through a successful attack against a variety of unprotected and protected cryptographic primitives that have been executed on an FPGA-implemented system-on-chip featuring a RISC-V CPU.  ( 2 min )
    Accelerating Computer Architecture Simulation through Machine Learning
    arXiv:2402.18746v1 Announce Type: cross Abstract: This paper presents our approach to accelerate computer architecture simulation by leveraging machine learning techniques. Traditional computer architecture simulations are time-consuming, making it challenging to explore different design choices efficiently. Our proposed model utilizes a combination of application features and micro-architectural features to predict the performance of an application. These features are derived from simulations of a small portion of the application. We demonstrate the effectiveness of our approach by building and evaluating a machine learning model that offers significant speedup in architectural exploration. This model demonstrates the ability to predict IPC values for the testing data with a root mean square error of less than 0.1.  ( 2 min )
    Conjectural Online Learning with First-order Beliefs in Asymmetric Information Stochastic Games
    arXiv:2402.18781v1 Announce Type: cross Abstract: Stochastic games arise in many complex socio-technical systems, such as cyber-physical systems and IT infrastructures, where information asymmetry presents challenges for decision-making entities (players). Existing computational methods for asymmetric information stochastic games (AISG) are primarily offline, targeting special classes of AISGs to avoid belief hierarchies, and lack online adaptability to deviations from equilibrium. To address this limitation, we propose a conjectural online learning (COL), a learning scheme for generic AISGs. COL, structured as a forecaster-actor-critic (FAC) architecture, utilizes first-order beliefs over the hidden states and subjective forecasts of the opponent's strategies. Against the conjectured opponent, COL updates strategies in an actor-critic approach using online rollout and calibrates conjectures through Bayesian learning. We prove that conjecture in COL is asymptotically consistent with the information feedback in the sense of a relaxed Bayesian consistency. The resulting empirical strategy profile converges to the Berk-Nash equilibrium, a solution concept characterizing rationality under subjectivity. Experimental results from an intrusion response use case demonstrate COL's superiority over state-of-the-art reinforcement learning methods against nonstationary attacks.  ( 2 min )
    Unraveling Adversarial Examples against Speaker Identification -- Techniques for Attack Detection and Victim Model Classification
    arXiv:2402.19355v1 Announce Type: cross Abstract: Adversarial examples have proven to threaten speaker identification systems, and several countermeasures against them have been proposed. In this paper, we propose a method to detect the presence of adversarial examples, i.e., a binary classifier distinguishing between benign and adversarial examples. We build upon and extend previous work on attack type classification by exploring new architectures. Additionally, we introduce a method for identifying the victim model on which the adversarial attack is carried out. To achieve this, we generate a new dataset containing multiple attacks performed against various victim models. We achieve an AUC of 0.982 for attack detection, with no more than a 0.03 drop in performance for unknown attacks. Our attack classification accuracy (excluding benign) reaches 86.48% across eight attack types using our LightResNet34 architecture, while our victim model classification accuracy reaches 72.28% across four victim models.  ( 2 min )
    Effective Two-Stage Knowledge Transfer for Multi-Entity Cross-Domain Recommendation
    arXiv:2402.19101v1 Announce Type: cross Abstract: In recent years, the recommendation content on e-commerce platforms has become increasingly rich -- a single user feed may contain multiple entities, such as selling products, short videos, and content posts. To deal with the multi-entity recommendation problem, an intuitive solution is to adopt the shared-network-based architecture for joint training. The idea is to transfer the extracted knowledge from one type of entity (source entity) to another (target entity). However, different from the conventional same-entity cross-domain recommendation, multi-entity knowledge transfer encounters several important issues: (1) data distributions of the source entity and target entity are naturally different, making the shared-network-based joint training susceptible to the negative transfer issue, (2) more importantly, the corresponding feature schema of each entity is not exactly aligned (e.g., price is an essential feature for selling product while missing for content posts), making the existing methods no longer appropriate. Recent researchers have also experimented with the pre-training and fine-tuning paradigm. Again, they only consider the scenarios with the same entity type and feature systems, which is inappropriate in our case. To this end, we design a pre-training & fine-tuning based Multi-entity Knowledge Transfer framework called MKT. MKT utilizes a multi-entity pre-training module to extract transferable knowledge across different entities. In particular, a feature alignment module is first applied to scale and align different feature schemas. Afterward, a couple of knowledge extractors are employed to extract the common and entity-specific knowledge. In the end, the extracted common knowledge is adopted for target entity model training. Through extensive offline and online experiments, we demonstrated the superiority of MKT over multiple State-Of-The-Art methods.  ( 3 min )
    Attacks Against Mobility Prediction in 5G Networks
    arXiv:2402.19319v1 Announce Type: cross Abstract: The $5^{th}$ generation of mobile networks introduces a new Network Function (NF) that was not present in previous generations, namely the Network Data Analytics Function (NWDAF). Its primary objective is to provide advanced analytics services to various entities within the network and also towards external application services in the 5G ecosystem. One of the key use cases of NWDAF is mobility trajectory prediction, which aims to accurately support efficient mobility management of User Equipment (UE) in the network by allocating ``just in time'' necessary network resources. In this paper, we show that there are potential mobility attacks that can compromise the accuracy of these predictions. In a semi-realistic scenario with 10,000 subscribers, we demonstrate that an adversary equipped with the ability to hijack cellular mobile devices and clone them can significantly reduce the prediction accuracy from 75\% to 40\% using just 100 adversarial UEs. While a defense mechanism largely depends on the attack and the mobility types in a particular area, we prove that a basic KMeans clustering is effective in distinguishing legitimate and adversarial UEs.  ( 2 min )
    Adaptive Testing Environment Generation for Connected and Automated Vehicles with Dense Reinforcement Learning
    arXiv:2402.19275v1 Announce Type: cross Abstract: The assessment of safety performance plays a pivotal role in the development and deployment of connected and automated vehicles (CAVs). A common approach involves designing testing scenarios based on prior knowledge of CAVs (e.g., surrogate models), conducting tests in these scenarios, and subsequently evaluating CAVs' safety performances. However, substantial differences between CAVs and the prior knowledge can significantly diminish the evaluation efficiency. In response to this issue, existing studies predominantly concentrate on the adaptive design of testing scenarios during the CAV testing process. Yet, these methods have limitations in their applicability to high-dimensional scenarios. To overcome this challenge, we develop an adaptive testing environment that bolsters evaluation robustness by incorporating multiple surrogate models and optimizing the combination coefficients of these surrogate models to enhance evaluation efficiency. We formulate the optimization problem as a regression task utilizing quadratic programming. To efficiently obtain the regression target via reinforcement learning, we propose the dense reinforcement learning method and devise a new adaptive policy with high sample efficiency. Essentially, our approach centers on learning the values of critical scenes displaying substantial surrogate-to-real gaps. The effectiveness of our method is validated in high-dimensional overtaking scenarios, demonstrating that our approach achieves notable evaluation efficiency.  ( 2 min )
    Graph Convolutional Neural Networks for Automated Echocardiography View Recognition: A Holistic Approach
    arXiv:2402.19062v1 Announce Type: cross Abstract: To facilitate diagnosis on cardiac ultrasound (US), clinical practice has established several standard views of the heart, which serve as reference points for diagnostic measurements and define viewports from which images are acquired. Automatic view recognition involves grouping those images into classes of standard views. Although deep learning techniques have been successful in achieving this, they still struggle with fully verifying the suitability of an image for specific measurements due to factors like the correct location, pose, and potential occlusions of cardiac structures. Our approach goes beyond view classification and incorporates a 3D mesh reconstruction of the heart that enables several more downstream tasks, like segmentation and pose estimation. In this work, we explore learning 3D heart meshes via graph convolutions, using similar techniques to learn 3D meshes in natural images, such as human pose estimation. As the availability of fully annotated 3D images is limited, we generate synthetic US images from 3D meshes by training an adversarial denoising diffusion model. Experiments were conducted on synthetic and clinical cases for view recognition and structure detection. The approach yielded good performance on synthetic images and, despite being exclusively trained on synthetic data, it already showed potential when applied to clinical images. With this proof-of-concept, we aim to demonstrate the benefits of graphs to improve cardiac view recognition that can ultimately lead to better efficiency in cardiac diagnosis.  ( 3 min )
    BP-DeepONet: A new method for cuffless blood pressure estimation using the physcis-informed DeepONet
    arXiv:2402.18886v1 Announce Type: new Abstract: Cardiovascular diseases (CVDs) are the leading cause of death worldwide, with blood pressure serving as a crucial indicator. Arterial blood pressure (ABP) waveforms provide continuous pressure measurements throughout the cardiac cycle and offer valuable diagnostic insights. Consequently, there is a significant demand for non-invasive and cuff-less methods to measure ABP waveforms continuously. Accurate prediction of ABP waveforms can also improve the estimation of mean blood pressure, an essential cardiovascular health characteristic. This study proposes a novel framework based on the physics-informed DeepONet approach to predict ABP waveforms. Unlike previous methods, our approach requires the predicted ABP waveforms to satisfy the Navier-Stokes equation with a time-periodic condition and a Windkessel boundary condition. Notably, our framework is the first to predict ABP waveforms continuously, both with location and time, within the part of the artery that is being simulated. Furthermore, our method only requires ground truth data at the outlet boundary and can handle periodic conditions with varying periods. Incorporating the Windkessel boundary condition in our solution allows for generating natural physical reflection waves, which closely resemble measurements observed in real-world cases. Moreover, accurately estimating the hyper-parameters in the Navier-Stokes equation for our simulations poses a significant challenge. To overcome this obstacle, we introduce the concept of meta-learning, enabling the neural networks to learn these parameters during the training process.  ( 3 min )
    Masks, Signs, And Learning Rate Rewinding
    arXiv:2402.19262v1 Announce Type: new Abstract: Learning Rate Rewinding (LRR) has been established as a strong variant of Iterative Magnitude Pruning (IMP) to find lottery tickets in deep overparameterized neural networks. While both iterative pruning schemes couple structure and parameter learning, understanding how LRR excels in both aspects can bring us closer to the design of more flexible deep learning algorithms that can optimize diverse sets of sparse architectures. To this end, we conduct experiments that disentangle the effect of mask learning and parameter optimization and how both benefit from overparameterization. The ability of LRR to flip parameter signs early and stay robust to sign perturbations seems to make it not only more effective in mask identification but also in optimizing diverse sets of masks, including random ones. In support of this hypothesis, we prove in a simplified single hidden neuron setting that LRR succeeds in more cases than IMP, as it can escape initially problematic sign configurations.  ( 2 min )
    Fine Structure-Aware Sampling: A New Sampling Training Scheme for Pixel-Aligned Implicit Models in Single-View Human Reconstruction
    arXiv:2402.19197v1 Announce Type: cross Abstract: Pixel-aligned implicit models, such as PIFu, PIFuHD, and ICON, are used for single-view clothed human reconstruction. These models need to be trained using a sampling training scheme. Existing sampling training schemes either fail to capture thin surfaces (e.g. ears, fingers) or cause noisy artefacts in reconstructed meshes. To address these problems, we introduce Fine Structured-Aware Sampling (FSS), a new sampling training scheme to train pixel-aligned implicit models for single-view human reconstruction. FSS resolves the aforementioned problems by proactively adapting to the thickness and complexity of surfaces. In addition, unlike existing sampling training schemes, FSS shows how normals of sample points can be capitalized in the training process to improve results. Lastly, to further improve the training process, FSS proposes a mesh thickness loss signal for pixel-aligned implicit models. It becomes computationally feasible to introduce this loss once a slight reworking of the pixel-aligned implicit function framework is carried out. Our results show that our methods significantly outperform SOTA methods qualitatively and quantitatively. Our code is publicly available at https://github.com/kcyt/FSS.  ( 2 min )
    A Priori Uncertainty Quantification of Reacting Turbulence Closure Models using Bayesian Neural Networks
    arXiv:2402.18729v1 Announce Type: cross Abstract: While many physics-based closure model forms have been posited for the sub-filter scale (SFS) in large eddy simulation (LES), vast amounts of data available from direct numerical simulation (DNS) create opportunities to leverage data-driven modeling techniques. Albeit flexible, data-driven models still depend on the dataset and the functional form of the model chosen. Increased adoption of such models requires reliable uncertainty estimates both in the data-informed and out-of-distribution regimes. In this work, we employ Bayesian neural networks (BNNs) to capture both epistemic and aleatoric uncertainties in a reacting flow model. In particular, we model the filtered progress variable scalar dissipation rate which plays a key role in the dynamics of turbulent premixed flames. We demonstrate that BNN models can provide unique insights about the structure of uncertainty of the data-driven closure models. We also propose a method for the incorporation of out-of-distribution information in a BNN. The efficacy of the model is demonstrated by a priori evaluation on a dataset consisting of a variety of flame conditions and fuels.  ( 2 min )
    Dr. Strategy: Model-Based Generalist Agents with Strategic Dreaming
    arXiv:2402.18866v1 Announce Type: new Abstract: Model-based reinforcement learning (MBRL) has been a primary approach to ameliorating the sample efficiency issue as well as to make a generalist agent. However, there has not been much effort toward enhancing the strategy of dreaming itself. Therefore, it is a question whether and how an agent can "dream better" in a more structured and strategic way. In this paper, inspired by the observation from cognitive science suggesting that humans use a spatial divide-and-conquer strategy in planning, we propose a new MBRL agent, called Dr. Strategy, which is equipped with a novel Dreaming Strategy. The proposed agent realizes a version of divide-and-conquer-like strategy in dreaming. This is achieved by learning a set of latent landmarks and then utilizing these to learn a landmark-conditioned highway policy. With the highway policy, the agent can first learn in the dream to move to a landmark, and from there it tackles the exploration and achievement task in a more focused way. In experiments, we show that the proposed model outperforms prior pixel-based MBRL methods in various visually complex and partially observable navigation tasks. The source code will be available at https://github.com/ahn-ml/drstrategy  ( 2 min )
    At the Dawn of Generative AI Era: A Tutorial-cum-Survey on New Frontiers in 6G Wireless Intelligence
    arXiv:2402.18587v1 Announce Type: cross Abstract: The majority of data-driven wireless research leans heavily on discriminative AI (DAI) that requires vast real-world datasets. Unlike the DAI, Generative AI (GenAI) pertains to generative models (GMs) capable of discerning the underlying data distribution, patterns, and features of the input data. This makes GenAI a crucial asset in wireless domain wherein real-world data is often scarce, incomplete, costly to acquire, and hard to model or comprehend. With these appealing attributes, GenAI can replace or supplement DAI methods in various capacities. Accordingly, this combined tutorial-survey paper commences with preliminaries of 6G and wireless intelligence by outlining candidate 6G applications and services, presenting a taxonomy of state-of-the-art DAI models, exemplifying prominent DAI use cases, and elucidating the multifaceted ways through which GenAI enhances DAI. Subsequently, we present a tutorial on GMs by spotlighting seminal examples such as generative adversarial networks, variational autoencoders, flow-based GMs, diffusion-based GMs, generative transformers, large language models, to name a few. Contrary to the prevailing belief that GenAI is a nascent trend, our exhaustive review of approximately 120 technical papers demonstrates the scope of research across core wireless research areas, including physical layer design; network optimization, organization, and management; network traffic analytics; cross-layer network security; and localization & positioning. Furthermore, we outline the central role of GMs in pioneering areas of 6G network research, including semantic/THz/near-field communications, ISAC, extremely large antenna arrays, digital twins, AI-generated content services, mobile edge computing and edge AI, adversarial ML, and trustworthy AI. Lastly, we shed light on the multifarious challenges ahead, suggesting potential strategies and promising remedies.  ( 3 min )
    SPriFed-OMP: A Differentially Private Federated Learning Algorithm for Sparse Basis Recovery
    arXiv:2402.19016v1 Announce Type: new Abstract: Sparse basis recovery is a classical and important statistical learning problem when the number of model dimensions $p$ is much larger than the number of samples $n$. However, there has been little work that studies sparse basis recovery in the Federated Learning (FL) setting, where the client data's differential privacy (DP) must also be simultaneously protected. In particular, the performance guarantees of existing DP-FL algorithms (such as DP-SGD) will degrade significantly when $p \gg n$, and thus, they will fail to learn the true underlying sparse model accurately. In this work, we develop a new differentially private sparse basis recovery algorithm for the FL setting, called SPriFed-OMP. SPriFed-OMP converts OMP (Orthogonal Matching Pursuit) to the FL setting. Further, it combines SMPC (secure multi-party computation) and DP to ensure that only a small amount of noise needs to be added in order to achieve differential privacy. As a result, SPriFed-OMP can efficiently recover the true sparse basis for a linear model with only $n = O(\sqrt{p})$ samples. We further present an enhanced version of our approach, SPriFed-OMP-GRAD based on gradient privatization, that improves the performance of SPriFed-OMP. Our theoretical analysis and empirical results demonstrate that both SPriFed-OMP and SPriFed-OMP-GRAD terminate in a small number of steps, and they significantly outperform the previous state-of-the-art DP-FL solutions in terms of the accuracy-privacy trade-off.  ( 3 min )
    Graph Generation via Spectral Diffusion
    arXiv:2402.18974v1 Announce Type: new Abstract: In this paper, we present GRASP, a novel graph generative model based on 1) the spectral decomposition of the graph Laplacian matrix and 2) a diffusion process. Specifically, we propose to use a denoising model to sample eigenvectors and eigenvalues from which we can reconstruct the graph Laplacian and adjacency matrix. Our permutation invariant model can also handle node features by concatenating them to the eigenvectors of each node. Using the Laplacian spectrum allows us to naturally capture the structural characteristics of the graph and work directly in the node space while avoiding the quadratic complexity bottleneck that limits the applicability of other methods. This is achieved by truncating the spectrum, which as we show in our experiments results in a faster yet accurate generative process. An extensive set of experiments on both synthetic and real world graphs demonstrates the strengths of our model against state-of-the-art alternatives.  ( 2 min )
    Combination of Weak Learners eXplanations to Improve Random Forest eXplicability Robustness
    arXiv:2402.19025v1 Announce Type: new Abstract: The notion of robustness in XAI refers to the observed variations in the explanation of the prediction of a learned model with respect to changes in the input leading to that prediction. Intuitively, if the input being explained is modified slightly subtly enough so as to not change the prediction of the model too much, then we would expect that the explanation provided for that new input does not change much either. We argue that a combination through discriminative averaging of ensembles weak learners explanations can improve the robustness of explanations in ensemble methods.This approach has been implemented and tested with post-hoc SHAP method and Random Forest ensemble with successful results. The improvements obtained have been measured quantitatively and some insights into the explicability robustness in ensemble methods are presented.  ( 2 min )
    Best Arm Identification with Resource Constraints
    arXiv:2402.19090v1 Announce Type: new Abstract: Motivated by the cost heterogeneity in experimentation across different alternatives, we study the Best Arm Identification with Resource Constraints (BAIwRC) problem. The agent aims to identify the best arm under resource constraints, where resources are consumed for each arm pull. We make two novel contributions. We design and analyze the Successive Halving with Resource Rationing algorithm (SH-RR). The SH-RR achieves a near-optimal non-asymptotic rate of convergence in terms of the probability of successively identifying an optimal arm. Interestingly, we identify a difference in convergence rates between the cases of deterministic and stochastic resource consumption.  ( 2 min )
    FlatNAS: optimizing Flatness in Neural Architecture Search for Out-of-Distribution Robustness
    arXiv:2402.19102v1 Announce Type: new Abstract: Neural Architecture Search (NAS) paves the way for the automatic definition of Neural Network (NN) architectures, attracting increasing research attention and offering solutions in various scenarios. This study introduces a novel NAS solution, called Flat Neural Architecture Search (FlatNAS), which explores the interplay between a novel figure of merit based on robustness to weight perturbations and single NN optimization with Sharpness-Aware Minimization (SAM). FlatNAS is the first work in the literature to systematically explore flat regions in the loss landscape of NNs in a NAS procedure, while jointly optimizing their performance on in-distribution data, their out-of-distribution (OOD) robustness, and constraining the number of parameters in their architecture. Differently from current studies primarily concentrating on OOD algorithms, FlatNAS successfully evaluates the impact of NN architectures on OOD robustness, a crucial aspect in real-world applications of machine and deep learning. FlatNAS achieves a good trade-off between performance, OOD generalization, and the number of parameters, by using only in-distribution data in the NAS exploration. The OOD robustness of the NAS-designed models is evaluated by focusing on robustness to input data corruptions, using popular benchmark datasets in the literature.  ( 2 min )
    Machine learning for modular multiplication
    arXiv:2402.19254v1 Announce Type: new Abstract: Motivated by cryptographic applications, we investigate two machine learning approaches to modular multiplication: namely circular regression and a sequence-to-sequence transformer model. The limited success of both methods demonstrated in our results gives evidence for the hardness of tasks involving modular multiplication upon which cryptosystems are based.  ( 2 min )
    Understanding random forests and overfitting: a visualization and simulation study
    arXiv:2402.18612v1 Announce Type: cross Abstract: Random forests have become popular for clinical risk prediction modelling. In a case study on predicting ovarian malignancy, we observed training c-statistics close to 1. Although this suggests overfitting, performance was competitive on test data. We aimed to understand the behaviour of random forests by (1) visualizing data space in three real world case studies and (2) a simulation study. For the case studies, risk estimates were visualised using heatmaps in a 2-dimensional subspace. The simulation study included 48 logistic data generating mechanisms (DGM), varying the predictor distribution, the number of predictors, the correlation between predictors, the true c-statistic and the strength of true predictors. For each DGM, 1000 training datasets of size 200 or 4000 were simulated and RF models trained with minimum node size 2 or 20 using ranger package, resulting in 192 scenarios in total. The visualizations suggested that the model learned spikes of probability around events in the training set. A cluster of events created a bigger peak, isolated events local peaks. In the simulation study, median training c-statistics were between 0.97 and 1 unless there were 4 or 16 binary predictors with minimum node size 20. Median test c-statistics were higher with higher events per variable, higher minimum node size, and binary predictors. Median training slopes were always above 1, and were not correlated with median test slopes across scenarios (correlation -0.11). Median test slopes were higher with higher true c-statistic, higher minimum node size, and higher sample size. Random forests learn local probability peaks that often yield near perfect training c-statistics without strongly affecting c-statistics on test data. When the aim is probability estimation, the simulation results go against the common recommendation to use fully grown trees in random forest models.  ( 3 min )
    Data Interpreter: An LLM Agent For Data Science
    arXiv:2402.18679v1 Announce Type: cross Abstract: Large Language Model (LLM)-based agents have demonstrated remarkable effectiveness. However, their performance can be compromised in data science scenarios that require real-time data adjustment, expertise in optimization due to complex dependencies among various tasks, and the ability to identify logical errors for precise reasoning. In this study, we introduce the Data Interpreter, a solution designed to solve with code that emphasizes three pivotal techniques to augment problem-solving in data science: 1) dynamic planning with hierarchical graph structures for real-time data adaptability;2) tool integration dynamically to enhance code proficiency during execution, enriching the requisite expertise;3) logical inconsistency identification in feedback, and efficiency enhancement through experience recording. We evaluate the Data Interpreter on various data science and real-world tasks. Compared to open-source baselines, it demonstrated superior performance, exhibiting significant improvements in machine learning tasks, increasing from 0.86 to 0.95. Additionally, it showed a 26% increase in the MATH dataset and a remarkable 112% improvement in open-ended tasks. The solution will be released at https://github.com/geekan/MetaGPT.  ( 2 min )
    Stop Relying on No-Choice and Do not Repeat the Moves: Optimal, Efficient and Practical Algorithms for Assortment Optimization
    arXiv:2402.18917v1 Announce Type: new Abstract: We address the problem of active online assortment optimization problem with preference feedback, which is a framework for modeling user choices and subsetwise utility maximization. The framework is useful in various real-world applications including ad placement, online retail, recommender systems, fine-tuning language models, amongst many. The problem, although has been studied in the past, lacks an intuitive and practical solution approach with simultaneously efficient algorithm and optimal regret guarantee. E.g., popularly used assortment selection algorithms often require the presence of a `strong reference' which is always included in the choice sets, further they are also designed to offer the same assortments repeatedly until the reference item gets selected -- all such requirements are quite unrealistic for practical applications. In this paper, we designed efficient algorithms for the problem of regret minimization in assortment selection with \emph{Plackett Luce} (PL) based user choices. We designed a novel concentration guarantee for estimating the score parameters of the PL model using `\emph{Pairwise Rank-Breaking}', which builds the foundation of our proposed algorithms. Moreover, our methods are practical, provably optimal, and devoid of the aforementioned limitations of the existing methods. Empirical evaluations corroborate our findings and outperform the existing baselines.  ( 3 min )
    Cross-Prediction-Powered Inference
    arXiv:2309.16598v3 Announce Type: replace-cross Abstract: While reliable data-driven decision-making hinges on high-quality labeled data, the acquisition of quality labels often involves laborious human annotations or slow and expensive scientific measurements. Machine learning is becoming an appealing alternative as sophisticated predictive techniques are being used to quickly and cheaply produce large amounts of predicted labels; e.g., predicted protein structures are used to supplement experimentally derived structures, predictions of socioeconomic indicators from satellite imagery are used to supplement accurate survey data, and so on. Since predictions are imperfect and potentially biased, this practice brings into question the validity of downstream inferences. We introduce cross-prediction: a method for valid inference powered by machine learning. With a small labeled dataset and a large unlabeled dataset, cross-prediction imputes the missing labels via machine learning and applies a form of debiasing to remedy the prediction inaccuracies. The resulting inferences achieve the desired error probability and are more powerful than those that only leverage the labeled data. Closely related is the recent proposal of prediction-powered inference, which assumes that a good pre-trained model is already available. We show that cross-prediction is consistently more powerful than an adaptation of prediction-powered inference in which a fraction of the labeled data is split off and used to train the model. Finally, we observe that cross-prediction gives more stable conclusions than its competitors; its confidence intervals typically have significantly lower variability.  ( 2 min )
    Exploring a new machine learning based probabilistic model for high-resolution indoor radon mapping, using the German indoor radon survey data
    arXiv:2310.11143v2 Announce Type: replace-cross Abstract: Radon is a carcinogenic, radioactive gas that can accumulate indoors. Therefore, accurate knowledge of indoor radon concentration is crucial for assessing radon-related health effects or identifying radon-prone areas. Indoor radon concentration at the national scale is usually estimated on the basis of extensive measurement campaigns. However, characteristics of the sample often differ from the characteristics of the population due to the large number of relevant factors that control the indoor radon concentration such as the availability of geogenic radon or floor level. Furthermore, the sample size usually does not allow estimation with high spatial resolution. We propose a model-based approach that allows a more realistic estimation of indoor radon distribution with a higher spatial resolution than a purely data-based approach. A two-stage modelling approach was applied: 1) a quantile regression forest using environmental and building data as predictors was applied to estimate the probability distribution function of indoor radon for each floor level of each residential building in Germany; (2) a probabilistic Monte Carlo sampling technique enabled the combination and population weighting of floor-level predictions. In this way, the uncertainty of the individual predictions is effectively propagated into the estimate of variability at the aggregated level. The results show an approximate lognormal distribution with an arithmetic mean of 63 Bq/m3, a geometric mean of 41 Bq/m3 and a 95 %ile of 180 Bq/m3. The exceedance probability for 100 Bq/m3 and 300 Bq/m3 are 12.5 % (10.5 million people) and 2.2 % (1.9 million people), respectively.  ( 3 min )
    Lifelong Benchmarks: Efficient Model Evaluation in an Era of Rapid Progress
    arXiv:2402.19472v1 Announce Type: new Abstract: Standardized benchmarks drive progress in machine learning. However, with repeated testing, the risk of overfitting grows as algorithms over-exploit benchmark idiosyncrasies. In our work, we seek to mitigate this challenge by compiling ever-expanding large-scale benchmarks called Lifelong Benchmarks. As exemplars of our approach, we create Lifelong-CIFAR10 and Lifelong-ImageNet, containing (for now) 1.69M and 1.98M test samples, respectively. While reducing overfitting, lifelong benchmarks introduce a key challenge: the high cost of evaluating a growing number of models across an ever-expanding sample set. To address this challenge, we also introduce an efficient evaluation framework: Sort \& Search (S&S), which reuses previously evaluated models by leveraging dynamic programming algorithms to selectively rank and sub-select test samples, enabling cost-effective lifelong benchmarking. Extensive empirical evaluations across 31,000 models demonstrate that S&S achieves highly-efficient approximate accuracy measurement, reducing compute cost from 180 GPU days to 5 GPU hours (1000x reduction) on a single A100 GPU, with low approximation error. As such, lifelong benchmarks offer a robust, practical solution to the "benchmark exhaustion" problem.  ( 2 min )
    HemaGraph: Breaking Barriers in Hematologic Single Cell Classification with Graph Attention
    arXiv:2402.18611v1 Announce Type: cross Abstract: In the realm of hematologic cell populations classification, the intricate patterns within flow cytometry data necessitate advanced analytical tools. This paper presents 'HemaGraph', a novel framework based on Graph Attention Networks (GATs) for single-cell multi-class classification of hematological cells from flow cytometry data. Harnessing the power of GATs, our method captures subtle cell relationships, offering highly accurate patient profiling. Based on evaluation of data from 30 patients, HemaGraph demonstrates classification performance across five different cell classes, outperforming traditional methodologies and state-of-the-art methods. Moreover, the uniqueness of this framework lies in the training and testing phase of HemaGraph, where it has been applied for extremely large graphs, containing up to hundreds of thousands of nodes and two million edges, to detect low frequency cell populations (e.g. 0.01% for one population), with accuracies reaching 98%. Our findings underscore the potential of HemaGraph in improving hematoligic multi-class classification, paving the way for patient-personalized interventions. To the best of our knowledge, this is the first effort to use GATs, and Graph Neural Networks (GNNs) in general, to classify cell populations from single-cell flow cytometry data. We envision applying this method to single-cell data from larger cohort of patients and on other hematologic diseases.  ( 2 min )
    EncodingNet: A Novel Encoding-based MAC Design for Efficient Neural Network Acceleration
    arXiv:2402.18595v1 Announce Type: cross Abstract: Deep neural networks (DNNs) have achieved great breakthroughs in many fields such as image classification and natural language processing. However, the execution of DNNs needs to conduct massive numbers of multiply-accumulate (MAC) operations on hardware and thus incurs a large power consumption. To address this challenge, we propose a novel digital MAC design based on encoding. In this new design, the multipliers are replaced by simple logic gates to project the results onto a wide bit representation. These bits carry individual position weights, which can be trained for specific neural networks to enhance inference accuracy. The outputs of the new multipliers are added by bit-wise weighted accumulation and the accumulation results are compatible with existing computing platforms accelerating neural networks with either uniform or non-uniform quantization. Since the multiplication function is replaced by simple logic projection, the critical paths in the resulting circuits become much shorter. Correspondingly, pipelining stages in the MAC array can be reduced, leading to a significantly smaller area as well as a better power efficiency. The proposed design has been synthesized and verified by ResNet18-Cifar10, ResNet20-Cifar100 and ResNet50-ImageNet. The experimental results confirmed the reduction of circuit area by up to 79.63% and the reduction of power consumption of executing DNNs by up to 70.18%, while the accuracy of the neural networks can still be well maintained.  ( 2 min )
    Cocktail: Mixing Multi-Modality Controls for Text-Conditional Image Generation
    arXiv:2306.00964v1 Announce Type: cross Abstract: Text-conditional diffusion models are able to generate high-fidelity images with diverse contents. However, linguistic representations frequently exhibit ambiguous descriptions of the envisioned objective imagery, requiring the incorporation of additional control signals to bolster the efficacy of text-guided diffusion models. In this work, we propose Cocktail, a pipeline to mix various modalities into one embedding, amalgamated with a generalized ControlNet (gControlNet), a controllable normalisation (ControlNorm), and a spatial guidance sampling method, to actualize multi-modal and spatially-refined control for text-conditional diffusion models. Specifically, we introduce a hyper-network gControlNet, dedicated to the alignment and infusion of the control signals from disparate modalities into the pre-trained diffusion model. gControlNet is capable of accepting flexible modality signals, encompassing the simultaneous reception of any combination of modality signals, or the supplementary fusion of multiple modality signals. The control signals are then fused and injected into the backbone model according to our proposed ControlNorm. Furthermore, our advanced spatial guidance sampling methodology proficiently incorporates the control signal into the designated region, thereby circumventing the manifestation of undesired objects within the generated image. We demonstrate the results of our method in controlling various modalities, proving high-quality synthesis and fidelity to multiple external signals.  ( 2 min )
    Benchmarking Uncertainty Disentanglement: Specialized Uncertainties for Specialized Tasks
    arXiv:2402.19460v1 Announce Type: new Abstract: Uncertainty quantification, once a singular task, has evolved into a spectrum of tasks, including abstained prediction, out-of-distribution detection, and aleatoric uncertainty quantification. The latest goal is disentanglement: the construction of multiple estimators that are each tailored to one and only one task. Hence, there is a plethora of recent advances with different intentions - that often entirely deviate from practical behavior. This paper conducts a comprehensive evaluation of numerous uncertainty estimators across diverse tasks on ImageNet. We find that, despite promising theoretical endeavors, disentanglement is not yet achieved in practice. Additionally, we reveal which uncertainty estimators excel at which specific tasks, providing insights for practitioners and guiding future research toward task-centric and disentangled uncertainty estimation methods. Our code is available at https://github.com/bmucsanyi/bud.  ( 2 min )
    CollaFuse: Navigating Limited Resources and Privacy in Collaborative Generative AI
    arXiv:2402.19105v1 Announce Type: new Abstract: In the landscape of generative artificial intelligence, diffusion-based models present challenges for socio-technical systems in data requirements and privacy. Traditional approaches like federated learning distribute the learning process but strain individual clients, especially with constrained resources (e.g., edge devices). In response to these challenges, we introduce CollaFuse, a novel framework inspired by split learning. Tailored for efficient and collaborative use of denoising diffusion probabilistic models, CollaFuse enables shared server training and inference, alleviating client computational burdens. This is achieved by retaining data and computationally inexpensive GPU processes locally at each client while outsourcing the computationally expensive processes to the shared server. Demonstrated in a healthcare context, CollaFuse enhances privacy by highly reducing the need for sensitive information sharing. These capabilities hold the potential to impact various application areas, such as the design of edge computing solutions, healthcare research, or autonomous driving. In essence, our work advances distributed machine learning, shaping the future of collaborative GenAI networks.  ( 2 min )
    Everything You Always Wanted to Know About Storage Compressibility of Pre-Trained ML Models but Were Afraid to Ask
    arXiv:2402.13429v1 Announce Type: cross Abstract: As the number of pre-trained machine learning (ML) models is growing exponentially, data reduction tools are not catching up. Existing data reduction techniques are not specifically designed for pre-trained model (PTM) dataset files. This is largely due to a lack of understanding of the patterns and characteristics of these datasets, especially those relevant to data reduction and compressibility. This paper presents the first, exhaustive analysis to date of PTM datasets on storage compressibility. Our analysis spans different types of data reduction and compression techniques, from hash-based data deduplication, data similarity detection, to dictionary-coding compression. Our analysis explores these techniques at three data granularity levels, from model layers, model chunks, to model parameters. We draw new observations that indicate that modern data reduction tools are not effective when handling PTM datasets. There is a pressing need for new compression methods that take into account PTMs' data characteristics for effective storage reduction. Motivated by our findings, we design ELF, a simple yet effective, error-bounded, lossy floating-point compression method. ELF transforms floating-point parameters in such a way that the common exponent field of the transformed parameters can be completely eliminated to save storage space. We develop Elves, a compression framework that integrates ELF along with several other data reduction methods. Elves uses the most effective method to compress PTMs that exhibit different patterns. Evaluation shows that Elves achieves an overall compression ratio of $1.52\times$, which is $1.31\times$, $1.32\times$ and $1.29\times$ higher than a general-purpose compressor (zstd), an error-bounded lossy compressor (SZ3), and the uniform model quantization, respectively, with negligible model accuracy loss.  ( 3 min )
    Verification of Neural Networks' Global Robustness
    arXiv:2402.19322v1 Announce Type: new Abstract: Neural networks are successful in various applications but are also susceptible to adversarial attacks. To show the safety of network classifiers, many verifiers have been introduced to reason about the local robustness of a given input to a given perturbation. While successful, local robustness cannot generalize to unseen inputs. Several works analyze global robustness properties, however, neither can provide a precise guarantee about the cases where a network classifier does not change its classification. In this work, we propose a new global robustness property for classifiers aiming at finding the minimal globally robust bound, which naturally extends the popular local robustness property for classifiers. We introduce VHAGaR, an anytime verifier for computing this bound. VHAGaR relies on three main ideas: encoding the problem as a mixed-integer programming and pruning the search space by identifying dependencies stemming from the perturbation or network computation and generalizing adversarial attacks to unknown inputs. We evaluate VHAGaR on several datasets and classifiers and show that, given a three hour timeout, the average gap between the lower and upper bound on the minimal globally robust bound computed by VHAGaR is 1.9, while the gap of an existing global robustness verifier is 154.7. Moreover, VHAGaR is 130.6x faster than this verifier. Our results further indicate that leveraging dependencies and adversarial attacks makes VHAGaR 78.6x faster.  ( 2 min )
    Global Context with Discrete Diffusion in Vector Quantised Modelling for Image Generation
    arXiv:2112.01799v1 Announce Type: cross Abstract: The integration of Vector Quantised Variational AutoEncoder (VQ-VAE) with autoregressive models as generation part has yielded high-quality results on image generation. However, the autoregressive models will strictly follow the progressive scanning order during the sampling phase. This leads the existing VQ series models to hardly escape the trap of lacking global information. Denoising Diffusion Probabilistic Models (DDPM) in the continuous domain have shown a capability to capture the global context, while generating high-quality images. In the discrete state space, some works have demonstrated the potential to perform text generation and low resolution image generation. We show that with the help of a content-rich discrete visual codebook from VQ-VAE, the discrete diffusion model can also generate high fidelity images with global context, which compensates for the deficiency of the classical autoregressive model along pixel space. Meanwhile, the integration of the discrete VAE with the diffusion model resolves the drawback of conventional autoregressive models being oversized, and the diffusion model which demands excessive time in the sampling process when generating images. It is found that the quality of the generated images is heavily dependent on the discrete visual codebook. Extensive experiments demonstrate that the proposed Vector Quantised Discrete Diffusion Model (VQ-DDM) is able to achieve comparable performance to top-tier methods with low complexity. It also demonstrates outstanding advantages over other vectors quantised with autoregressive models in terms of image inpainting tasks without additional training.  ( 3 min )
    DiffuseRAW: End-to-End Generative RAW Image Processing for Low-Light Images
    arXiv:2402.18575v1 Announce Type: cross Abstract: Imaging under extremely low-light conditions presents a significant challenge and is an ill-posed problem due to the low signal-to-noise ratio (SNR) caused by minimal photon capture. Previously, diffusion models have been used for multiple kinds of generative tasks and image-to-image tasks, however, these models work as a post-processing step. These diffusion models are trained on processed images and learn on processed images. However, such approaches are often not well-suited for extremely low-light tasks. Unlike the task of low-light image enhancement or image-to-image enhancement, we tackle the task of learning the entire image-processing pipeline, from the RAW image to a processed image. For this task, a traditional image processing pipeline often consists of multiple specialized parts that are overly reliant on the downstream tasks. Unlike these, we develop a new generative ISP that relies on fine-tuning latent diffusion models on RAW images and generating processed long-exposure images which allows for the apt use of the priors from large text-to-image generation models. We evaluate our approach on popular end-to-end low-light datasets for which we see promising results and set a new SoTA on the See-in-Dark (SID) dataset. Furthermore, with this work, we hope to pave the way for more generative and diffusion-based image processing and other problems on RAW data.  ( 2 min )
    TimeXer: Empowering Transformers for Time Series Forecasting with Exogenous Variables
    arXiv:2402.19072v1 Announce Type: new Abstract: Recent studies have demonstrated remarkable performance in time series forecasting. However, due to the partially-observed nature of real-world applications, solely focusing on the target of interest, so-called endogenous variables, is usually insufficient to guarantee accurate forecasting. Notably, a system is often recorded into multiple variables, where the exogenous series can provide valuable external information for endogenous variables. Thus, unlike prior well-established multivariate or univariate forecasting that either treats all the variables equally or overlooks exogenous information, this paper focuses on a practical setting, which is time series forecasting with exogenous variables. We propose a novel framework, TimeXer, to utilize external information to enhance the forecasting of endogenous variables. With a deftly designed embedding layer, TimeXer empowers the canonical Transformer architecture with the ability to reconcile endogenous and exogenous information, where patch-wise self-attention and variate-wise cross-attention are employed. Moreover, a global endogenous variate token is adopted to effectively bridge the exogenous series into endogenous temporal patches. Experimentally, TimeXer significantly improves time series forecasting with exogenous variables and achieves consistent state-of-the-art performance in twelve real-world forecasting benchmarks.  ( 2 min )
    Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models
    arXiv:2402.19449v1 Announce Type: new Abstract: Adam has been shown to outperform gradient descent in optimizing large language transformers empirically, and by a larger margin than on other tasks, but it is unclear why this happens. We show that the heavy-tailed class imbalance found in language modeling tasks leads to difficulties in the optimization dynamics. When training with gradient descent, the loss associated with infrequent words decreases slower than the loss associated with frequent ones. As most samples come from relatively infrequent words, the average loss decreases slowly with gradient descent. On the other hand, Adam and sign-based methods do not suffer from this problem and improve predictions on all classes. To establish that this behavior is indeed caused by class imbalance, we show empirically that it persist through different architectures and data types, on language transformers, vision CNNs, and linear models. We further study this phenomenon on a linear classification with cross-entropy loss, showing that heavy-tailed class imbalance leads to ill-conditioning, and that the normalization used by Adam can counteract it.  ( 2 min )
    Uncertainty-Based Extensible Codebook for Discrete Federated Learning in Heterogeneous Data Silos
    arXiv:2402.18888v1 Announce Type: new Abstract: Federated learning (FL), aimed at leveraging vast distributed datasets, confronts a crucial challenge: the heterogeneity of data across different silos. While previous studies have explored discrete representations to enhance model generalization across minor distributional shifts, these approaches often struggle to adapt to new data silos with significantly divergent distributions. In response, we have identified that models derived from FL exhibit markedly increased uncertainty when applied to data silos with unfamiliar distributions. Consequently, we propose an innovative yet straightforward iterative framework, termed Uncertainty-Based Extensible-Codebook Federated Learning (UEFL). This framework dynamically maps latent features to trainable discrete vectors, assesses the uncertainty, and specifically extends the discretization dictionary or codebook for silos exhibiting high uncertainty. Our approach aims to simultaneously enhance accuracy and reduce uncertainty by explicitly addressing the diversity of data distributions, all while maintaining minimal computational overhead in environments characterized by heterogeneous data silos. Through experiments conducted on five datasets, our method has demonstrated its superiority, achieving significant improvements in accuracy (by 3%--22.1%) and uncertainty reduction (by 38.83%--96.24%), thereby outperforming contemporary state-of-the-art methods. The source code is available at https://github.com/destiny301/uefl.  ( 2 min )
    Loss-Free Machine Unlearning
    arXiv:2402.19308v1 Announce Type: new Abstract: We present a machine unlearning approach that is both retraining- and label-free. Most existing machine unlearning approaches require a model to be fine-tuned to remove information while preserving performance. This is computationally expensive and necessitates the storage of the whole dataset for the lifetime of the model. Retraining-free approaches often utilise Fisher information, which is derived from the loss and requires labelled data which may not be available. Thus, we present an extension to the Selective Synaptic Dampening algorithm, substituting the diagonal of the Fisher information matrix for the gradient of the l2 norm of the model output to approximate sensitivity. We evaluate our method in a range of experiments using ResNet18 and Vision Transformer. Results show our label-free method is competitive with existing state-of-the-art approaches.  ( 2 min )
    Investigating Gender Fairness in Machine Learning-driven Personalized Care for Chronic Pain
    arXiv:2402.19226v1 Announce Type: new Abstract: This study investigates gender fairness in personalized pain care recommendations using machine learning algorithms. Leveraging a contextual bandits framework, personalized recommendations are formulated and evaluated using LinUCB algorithm on a dataset comprising interactions with $164$ patients across $10$ sessions each. Results indicate that while adjustments to algorithm parameters influence the quality of pain care recommendations, this impact remains consistent across genders. However, when certain patient information, such as self-reported pain measurements, is absent, the quality of pain care recommendations for women is notably inferior to that for men.  ( 2 min )
    Structure Preserving Diffusion Models
    arXiv:2402.19369v1 Announce Type: new Abstract: Diffusion models have become the leading distribution-learning method in recent years. Herein, we introduce structure-preserving diffusion processes, a family of diffusion processes for learning distributions that possess additional structure, such as group symmetries, by developing theoretical conditions under which the diffusion transition steps preserve said symmetry. While also enabling equivariant data sampling trajectories, we exemplify these results by developing a collection of different symmetry equivariant diffusion models capable of learning distributions that are inherently symmetric. Empirical studies, over both synthetic and real-world datasets, are used to validate the developed models adhere to the proposed theory and are capable of achieving improved performance over existing methods in terms of sample equality. We also show how the proposed models can be used to achieve theoretically guaranteed equivariant image noise reduction without prior knowledge of the image orientation.  ( 2 min )
    Estimation and Deconvolution of Second Order Cyclostationary Signals
    arXiv:2402.19290v1 Announce Type: new Abstract: This method solves the dual problem of blind deconvolution and estimation of the time waveform of noisy second-order cyclo-stationary (CS2) signals that traverse a Transfer Function (TF) en route to a sensor. We have proven that the deconvolution filter exists and eliminates the TF effect from signals whose statistics vary over time. This method is blind, meaning it does not require prior knowledge about the signals or TF. Simulations demonstrate the algorithm high precision across various signal types, TFs, and Signal-to-Noise Ratios (SNRs). In this study, the CS2 signals family is restricted to the product of a deterministic periodic function and white noise. Furthermore, this method has the potential to improve the training of Machine Learning models where the aggregation of signals from identical systems but with different TFs is required.  ( 2 min )
    StiefelGen: A Simple, Model Agnostic Approach for Time Series Data Augmentation over Riemannian Manifolds
    arXiv:2402.19287v1 Announce Type: new Abstract: Data augmentation is an area of research which has seen active development in many machine learning fields, such as in image-based learning models, reinforcement learning for self driving vehicles, and general noise injection for point cloud data. However, convincing methods for general time series data augmentation still leaves much to be desired, especially since the methods developed for these models do not readily cross-over. Three common approaches for time series data augmentation include: (i) Constructing a physics-based model and then imbuing uncertainty over the coefficient space (for example), (ii) Adding noise to the observed data set(s), and, (iii) Having access to ample amounts of time series data sets from which a robust generative neural network model can be trained. However, for many practical problems that work with time series data in the industry: (i) One usually does not have access to a robust physical model, (ii) The addition of noise can in of itself require large or difficult assumptions (for example, what probability distribution should be used? Or, how large should the noise variance be?), and, (iii) In practice, it can be difficult to source a large representative time series data base with which to train the neural network model for the underlying problem. In this paper, we propose a methodology which attempts to simultaneously tackle all three of these previous limitations to a large extent. The method relies upon the well-studied matrix differential geometry of the Stiefel manifold, as it proposes a simple way in which time series signals can placed on, and then smoothly perturbed over the manifold. We attempt to clarify how this method works by showcasing several potential use cases which in particular work to take advantage of the unique properties of this underlying manifold.  ( 3 min )
    Dual Operating Modes of In-Context Learning
    arXiv:2402.18819v1 Announce Type: new Abstract: In-context learning (ICL) exhibits dual operating modes: task learning, i.e., acquiring a new skill from in-context samples, and task retrieval, i.e., locating and activating a relevant pretrained skill. Recent theoretical work investigates various mathematical models to analyze ICL, but existing models explain only one operating mode at a time. We introduce a probabilistic model, with which one can explain the dual operating modes of ICL simultaneously. Focusing on in-context learning of linear functions, we extend existing models for pretraining data by introducing multiple task groups and task-dependent input distributions. We then analyze the behavior of the optimally pretrained model under the squared loss, i.e., the MMSE estimator of the label given in-context examples. Regarding pretraining task distribution as prior and in-context examples as the observation, we derive the closed-form expression of the task posterior distribution. With the closed-form expression, we obtain a quantitative understanding of the two operating modes of ICL. Furthermore, we shed light on an unexplained phenomenon observed in practice: under certain settings, the ICL risk initially increases and then decreases with more in-context examples. Our model offers a plausible explanation for this "early ascent" phenomenon: a limited number of in-context samples may lead to the retrieval of an incorrect skill, thereby increasing the risk, which will eventually diminish as task learning takes effect with more in-context samples. We also theoretically analyze ICL with biased labels, e.g., zero-shot ICL, where in-context examples are assigned random labels. Lastly, we validate our findings and predictions via experiments involving Transformers and large language models.  ( 3 min )
    FedStruct: Federated Decoupled Learning over Interconnected Graphs
    arXiv:2402.19163v1 Announce Type: new Abstract: We address the challenge of federated learning on graph-structured data distributed across multiple clients. Specifically, we focus on the prevalent scenario of interconnected subgraphs, where inter-connections between different clients play a critical role. We present a novel framework for this scenario, named FedStruct, that harnesses deep structural dependencies. To uphold privacy, unlike existing methods, FedStruct eliminates the necessity of sharing or generating sensitive node features or embeddings among clients. Instead, it leverages explicit global graph structure information to capture inter-node dependencies. We validate the effectiveness of FedStruct through experimental results conducted on six datasets for semi-supervised node classification, showcasing performance close to the centralized approach across various scenarios, including different data partitioning methods, varying levels of label availability, and number of clients.  ( 2 min )
    Derivative-enhanced Deep Operator Network
    arXiv:2402.19242v1 Announce Type: new Abstract: Deep operator networks (DeepONets), a class of neural operators that learn mappings between function spaces, have recently been developed as surrogate models for parametric partial differential equations (PDEs). In this work we propose a derivative-enhanced deep operator network (DE-DeepONet), which leverages the derivative information to enhance the prediction accuracy, and provide a more accurate approximation of the derivatives, especially when the training data are limited. DE-DeepONet incorporates dimension reduction of input into DeepONet and includes two types of derivative labels in the loss function for training, that is, the directional derivatives of the output function with respect to the input function and the gradient of the output function with respect to the physical domain variables. We test DE-DeepONet on three different equations with increasing complexity to demonstrate its effectiveness compared to the vanilla DeepONet.  ( 2 min )
    Neural Galerkin Schemes with Active Learning for High-Dimensional Evolution Equations
    arXiv:2203.01360v4 Announce Type: replace-cross Abstract: Deep neural networks have been shown to provide accurate function approximations in high dimensions. However, fitting network parameters requires informative training data that are often challenging to collect in science and engineering applications. This work proposes Neural Galerkin schemes based on deep learning that generate training data with active learning for numerically solving high-dimensional partial differential equations. Neural Galerkin schemes build on the Dirac-Frenkel variational principle to train networks by minimizing the residual sequentially over time, which enables adaptively collecting new training data in a self-informed manner that is guided by the dynamics described by the partial differential equations. This is in contrast to other machine learning methods that aim to fit network parameters globally in time without taking into account training data acquisition. Our finding is that the active form of gathering training data of the proposed Neural Galerkin schemes is key for numerically realizing the expressive power of networks in high dimensions. Numerical experiments demonstrate that Neural Galerkin schemes have the potential to enable simulating phenomena and processes with many variables for which traditional and other deep-learning-based solvers fail, especially when features of the solutions evolve locally such as in high-dimensional wave propagation problems and interacting particle systems described by Fokker-Planck and kinetic equations.  ( 3 min )
    Extended Deep Adaptive Input Normalization for Preprocessing Time Series Data for Neural Networks
    arXiv:2310.14720v2 Announce Type: replace Abstract: Data preprocessing is a crucial part of any machine learning pipeline, and it can have a significant impact on both performance and training efficiency. This is especially evident when using deep neural networks for time series prediction and classification: real-world time series data often exhibit irregularities such as multi-modality, skewness and outliers, and the model performance can degrade rapidly if these characteristics are not adequately addressed. In this work, we propose the EDAIN (Extended Deep Adaptive Input Normalization) layer, a novel adaptive neural layer that learns how to appropriately normalize irregular time series data for a given task in an end-to-end fashion, instead of using a fixed normalization scheme. This is achieved by optimizing its unknown parameters simultaneously with the deep neural network using back-propagation. Our experiments, conducted using synthetic data, a credit default prediction dataset, and a large-scale limit order book benchmark dataset, demonstrate the superior performance of the EDAIN layer when compared to conventional normalization methods and existing adaptive time series preprocessing layers.  ( 2 min )
    Offline detection of change-points in the mean for stationary graph signals
    arXiv:2006.10628v2 Announce Type: replace Abstract: This paper addresses the problem of segmenting a stream of graph signals: we aim to detect changes in the mean of a multivariate signal defined over the nodes of a known graph. We propose an offline method that relies on the concept of graph signal stationarity and allows the convenient translation of the problem from the original vertex domain to the spectral domain (Graph Fourier Transform), where it is much easier to solve. Although the obtained spectral representation is sparse in real applications, to the best of our knowledge this property has not been sufficiently exploited in the existing related literature. Our change-point detection method adopts a model selection approach that takes into account the sparsity of the spectral representation and determines automatically the number of change-points. Our detector comes with a proof of a non-asymptotic oracle inequality. Numerical experiments demonstrate the performance of the proposed method.  ( 2 min )
    The committee machine: Computational to statistical gaps in learning a two-layers neural network
    arXiv:1806.05451v3 Announce Type: replace Abstract: Heuristic tools from statistical physics have been used in the past to locate the phase transitions and compute the optimal learning and generalization errors in the teacher-student scenario in multi-layer neural networks. In this contribution, we provide a rigorous justification of these approaches for a two-layers neural network model called the committee machine. We also introduce a version of the approximate message passing (AMP) algorithm for the committee machine that allows to perform optimal learning in polynomial time for a large set of parameters. We find that there are regimes in which a low generalization error is information-theoretically achievable while the AMP algorithm fails to deliver it, strongly suggesting that no efficient algorithm exists for those cases, and unveiling a large computational gap.  ( 3 min )
    Deep Reinforcement Learning: A Convex Optimization Approach
    arXiv:2402.19212v1 Announce Type: cross Abstract: In this paper, we consider reinforcement learning of nonlinear systems with continuous state and action spaces. We present an episodic learning algorithm, where we for each episode use convex optimization to find a two-layer neural network approximation of the optimal $Q$-function. The convex optimization approach guarantees that the weights calculated at each episode are optimal, with respect to the given sampled states and actions of the current episode. For stable nonlinear systems, we show that the algorithm converges and that the converging parameters of the trained neural network can be made arbitrarily close to the optimal neural network parameters. In particular, if the regularization parameter is $\rho$ and the time horizon is $T$, then the parameters of the trained neural network converge to $w$, where the distance between $w$ from the optimal parameters $w^\star$ is bounded by $\mathcal{O}(\rho T^{-1})$. That is, when the number of episodes goes to infinity, there exists a constant $C$ such that \[\|w-w^\star\| \le C\cdot\frac{\rho}{T}.\] In particular, our algorithm converges arbitrarily close to the optimal neural network parameters as the time horizon increases or as the regularization parameter decreases.  ( 2 min )
    Inferring Dynamic Networks from Marginals with Iterative Proportional Fitting
    arXiv:2402.18697v1 Announce Type: cross Abstract: A common network inference problem, arising from real-world data constraints, is how to infer a dynamic network from its time-aggregated adjacency matrix and time-varying marginals (i.e., row and column sums). Prior approaches to this problem have repurposed the classic iterative proportional fitting (IPF) procedure, also known as Sinkhorn's algorithm, with promising empirical results. However, the statistical foundation for using IPF has not been well understood: under what settings does IPF provide principled estimation of a dynamic network from its marginals, and how well does it estimate the network? In this work, we establish such a setting, by identifying a generative network model whose maximum likelihood estimates are recovered by IPF. Our model both reveals implicit assumptions on the use of IPF in such settings and enables new analyses, such as structure-dependent error bounds on IPF's parameter estimates. When IPF fails to converge on sparse network data, we introduce a principled algorithm that guarantees IPF converges under minimal changes to the network structure. Finally, we conduct experiments with synthetic and real-world data, which demonstrate the practical value of our theoretical and algorithmic contributions.  ( 2 min )
    Training Dynamics of Multi-Head Softmax Attention for In-Context Learning: Emergence, Convergence, and Optimality
    arXiv:2402.19442v1 Announce Type: new Abstract: We study the dynamics of gradient flow for training a multi-head softmax attention model for in-context learning of multi-task linear regression. We establish the global convergence of gradient flow under suitable choices of initialization. In addition, we prove that an interesting "task allocation" phenomenon emerges during the gradient flow dynamics, where each attention head focuses on solving a single task of the multi-task model. Specifically, we prove that the gradient flow dynamics can be split into three phases -- a warm-up phase where the loss decreases rather slowly and the attention heads gradually build up their inclination towards individual tasks, an emergence phase where each head selects a single task and the loss rapidly decreases, and a convergence phase where the attention parameters converge to a limit. Furthermore, we prove the optimality of gradient flow in the sense that the limiting model learned by gradient flow is on par with the best possible multi-head softmax attention model up to a constant factor. Our analysis also delineates a strict separation in terms of the prediction accuracy of ICL between single-head and multi-head attention models. The key technique for our convergence analysis is to map the gradient flow dynamics in the parameter space to a set of ordinary differential equations in the spectral domain, where the relative magnitudes of the semi-singular values of the attention weights determines task allocation. To our best knowledge, our work provides the first convergence result for the multi-head softmax attention model.  ( 3 min )
    Learnability Gaps of Strategic Classification
    arXiv:2402.19303v1 Announce Type: new Abstract: In contrast with standard classification tasks, strategic classification involves agents strategically modifying their features in an effort to receive favorable predictions. For instance, given a classifier determining loan approval based on credit scores, applicants may open or close their credit cards to fool the classifier. The learning goal is to find a classifier robust against strategic manipulations. Various settings, based on what and when information is known, have been explored in strategic classification. In this work, we focus on addressing a fundamental question: the learnability gaps between strategic classification and standard learning. We essentially show that any learnable class is also strategically learnable: we first consider a fully informative setting, where the manipulation structure (which is modeled by a manipulation graph $G^\star$) is known and during training time the learner has access to both the pre-manipulation data and post-manipulation data. We provide nearly tight sample complexity and regret bounds, offering significant improvements over prior results. Then, we relax the fully informative setting by introducing two natural types of uncertainty. First, following Ahmadi et al. (2023), we consider the setting in which the learner only has access to the post-manipulation data. We improve the results of Ahmadi et al. (2023) and close the gap between mistake upper bound and lower bound raised by them. Our second relaxation of the fully informative setting introduces uncertainty to the manipulation structure. That is, we assume that the manipulation graph is unknown but belongs to a known class of graphs. We provide nearly tight bounds on the learning complexity in various unknown manipulation graph settings. Notably, our algorithm in this setting is of independent interest and can be applied to other problems such as multi-label learning.  ( 3 min )
    Beyond Language Models: Byte Models are Digital World Simulators
    arXiv:2402.19155v1 Announce Type: new Abstract: Traditional deep learning often overlooks bytes, the basic units of the digital world, where all forms of information and operations are encoded and manipulated in binary format. Inspired by the success of next token prediction in natural language processing, we introduce bGPT, a model with next byte prediction to simulate the digital world. bGPT matches specialized models in performance across various modalities, including text, audio, and images, and offers new possibilities for predicting, simulating, and diagnosing algorithm or hardware behaviour. It has almost flawlessly replicated the process of converting symbolic music data, achieving a low error rate of 0.0011 bits per byte in converting ABC notation to MIDI format. In addition, bGPT demonstrates exceptional capabilities in simulating CPU behaviour, with an accuracy exceeding 99.99% in executing various operations. Leveraging next byte prediction, models like bGPT can directly learn from vast binary data, effectively simulating the intricate patterns of the digital world.  ( 2 min )
    Generating, Reconstructing, and Representing Discrete and Continuous Data: Generalized Diffusion with Learnable Encoding-Decoding
    arXiv:2402.19009v1 Announce Type: new Abstract: The vast applications of deep generative models are anchored in three core capabilities -- generating new instances, reconstructing inputs, and learning compact representations -- across various data types, such as discrete text/protein sequences and continuous images. Existing model families, like Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), autoregressive models, and diffusion models, generally excel in specific capabilities and data types but fall short in others. We introduce generalized diffusion with learnable encoder-decoder (DiLED), that seamlessly integrates the core capabilities for broad applicability and enhanced performance. DiLED generalizes the Gaussian noising-denoising in standard diffusion by introducing parameterized encoding-decoding. Crucially, DiLED is compatible with the well-established diffusion model objective and training recipes, allowing effective learning of the encoder-decoder parameters jointly with diffusion. By choosing appropriate encoder/decoder (e.g., large language models), DiLED naturally applies to different data types. Extensive experiments on text, proteins, and images demonstrate DiLED's flexibility to handle diverse data and tasks and its strong improvement over various existing models.  ( 2 min )
    On the Convergence of Differentially-Private Fine-tuning: To Linearly Probe or to Fully Fine-tune?
    arXiv:2402.18905v1 Announce Type: new Abstract: Differentially private (DP) machine learning pipelines typically involve a two-phase process: non-private pre-training on a public dataset, followed by fine-tuning on private data using DP optimization techniques. In the DP setting, it has been observed that full fine-tuning may not always yield the best test accuracy, even for in-distribution data. This paper (1) analyzes the training dynamics of DP linear probing (LP) and full fine-tuning (FT), and (2) explores the phenomenon of sequential fine-tuning, starting with linear probing and transitioning to full fine-tuning (LP-FT), and its impact on test loss. We provide theoretical insights into the convergence of DP fine-tuning within an overparameterized neural network and establish a utility curve that determines the allocation of privacy budget between linear probing and full fine-tuning. The theoretical results are supported by empirical evaluations on various benchmarks and models. The findings reveal the complex nature of DP fine-tuning methods. These results contribute to a deeper understanding of DP machine learning and highlight the importance of considering the allocation of privacy budget in the fine-tuning process.  ( 2 min )
    Supervised Contrastive Representation Learning: Landscape Analysis with Unconstrained Features
    arXiv:2402.18884v1 Announce Type: new Abstract: Recent findings reveal that over-parameterized deep neural networks, trained beyond zero training-error, exhibit a distinctive structural pattern at the final layer, termed as Neural-collapse (NC). These results indicate that the final hidden-layer outputs in such networks display minimal within-class variations over the training set. While existing research extensively investigates this phenomenon under cross-entropy loss, there are fewer studies focusing on its contrastive counterpart, supervised contrastive (SC) loss. Through the lens of NC, this paper employs an analytical approach to study the solutions derived from optimizing the SC loss. We adopt the unconstrained features model (UFM) as a representative proxy for unveiling NC-related phenomena in sufficiently over-parameterized deep networks. We show that, despite the non-convexity of SC loss minimization, all local minima are global minima. Furthermore, the minimizer is unique (up to a rotation). We prove our results by formalizing a tight convex relaxation of the UFM. Finally, through this convex formulation, we delve deeper into characterizing the properties of global solutions under label-imbalanced training data.  ( 2 min )
    Loss-aware Curriculum Learning for Heterogeneous Graph Neural Networks
    arXiv:2402.18875v1 Announce Type: new Abstract: Heterogeneous Graph Neural Networks (HGNNs) are a class of deep learning models designed specifically for heterogeneous graphs, which are graphs that contain different types of nodes and edges. This paper investigates the application of curriculum learning techniques to improve the performance and robustness of Heterogeneous Graph Neural Networks (GNNs). To better classify the quality of the data, we design a loss-aware training schedule, named LTS that measures the quality of every nodes of the data and incorporate the training dataset into the model in a progressive manner that increases difficulty step by step. LTS can be seamlessly integrated into various frameworks, effectively reducing bias and variance, mitigating the impact of noisy data, and enhancing overall accuracy. Our findings demonstrate the efficacy of curriculum learning in enhancing HGNNs capabilities for analyzing complex graph-structured data. The code is public at https: //github.com/LARS-research/CLGNN/.  ( 2 min )
    Analyzing and Reducing Catastrophic Forgetting in Parameter Efficient Tuning
    arXiv:2402.18865v1 Announce Type: new Abstract: Existing research has shown that large language models (LLMs) exhibit remarkable performance in language understanding and generation. However, when LLMs are continuously fine-tuned on complex and diverse domain-specific downstream tasks, the inference performance on historical tasks decreases dramatically, which is known as a catastrophic forgetting problem. A trade-off needs to be kept between learning plasticity and memory stability. Plenty of existing works have explored strategies like memory replay, regularization and parameter isolation, but little is known about the geometric connection of various adjacent minima in the continual LLMs fine-tuning scenarios. In this work, we investigate the geometric connections of different minima through the lens of mode connectivity, which means different minima can be connected by a low-loss valley. Through extensive experiments, we uncover the mode connectivity phenomenon in the LLMs continual learning scenario and find that it can strike a balance between plasticity and stability. Building upon these findings, we propose a simple yet effective method called Interpolation-based LoRA (I-LoRA), which constructs a dual-memory experience replay framework based on LoRA parameter interpolations. Extensive experiments and analysis on eight domain-specific CL benchmarks demonstrate that I-LoRA consistently show significant improvement over the previous state-of-the-art approaches with up to $11\%$ performance gains, providing a strong baseline and insights for future research on the large language model continual learning problem. Our code is available at \url{https://github.com/which47/LLMCL}.  ( 3 min )
    Taking Second-life Batteries from Exhausted to Empowered using Experiments, Data Analysis, and Health Estimation
    arXiv:2402.18859v1 Announce Type: new Abstract: The reuse of retired electric vehicle (EV) batteries in electric grid energy storage emerges as a promising strategy to address environmental concerns and boost economic value. This study concentrates on devising health monitoring algorithms for retired batteries (BMS$_2$) deployed in grid storage applications. Over 15 months of testing, we compile, analyze, and publicly share a dataset of second-life (SL) batteries, implementing a cycling protocol simulating grid energy storage load profiles within a 3 V-4 V voltage window. Four machine learning-based health estimation models, relying on BMS$_2$ features and initial capacity, are developed and compared, with the selected model achieving a Mean Absolute Percentage Error (MAPE) below 2.3% on test data. Additionally, an adaptive online health estimation algorithm is proposed by integrating a clustering-based method, limiting estimation errors during online deployment. These results constitute an initial proof of concept, showcasing the feasibility of repurposing retired batteries for second-life applications. Based on obtained data and representative power demand, these SL batteries exhibit the potential, under specific conditions, for over a decade of grid energy storage use.  ( 2 min )
    Applications of 0-1 Neural Networks in Prescription and Prediction
    arXiv:2402.18851v1 Announce Type: new Abstract: A key challenge in medical decision making is learning treatment policies for patients with limited observational data. This challenge is particularly evident in personalized healthcare decision-making, where models need to take into account the intricate relationships between patient characteristics, treatment options, and health outcomes. To address this, we introduce prescriptive networks (PNNs), shallow 0-1 neural networks trained with mixed integer programming that can be used with counterfactual estimation to optimize policies in medium data settings. These models offer greater interpretability than deep neural networks and can encode more complex policies than common models such as decision trees. We show that PNNs can outperform existing methods in both synthetic data experiments and in a case study of assigning treatments for postpartum hypertension. In particular, PNNs are shown to produce policies that could reduce peak blood pressure by 5.47 mm Hg (p=0.02) over existing clinical practice, and by 2 mm Hg (p=0.01) over the next best prescriptive modeling technique. Moreover PNNs were more likely than all other models to correctly identify clinically significant features while existing models relied on potentially dangerous features such as patient insurance information and race that could lead to bias in treatment.  ( 2 min )
    A Model-Based Approach for Improving Reinforcement Learning Efficiency Leveraging Expert Observations
    arXiv:2402.18836v1 Announce Type: new Abstract: This paper investigates how to incorporate expert observations (without explicit information on expert actions) into a deep reinforcement learning setting to improve sample efficiency. First, we formulate an augmented policy loss combining a maximum entropy reinforcement learning objective with a behavioral cloning loss that leverages a forward dynamics model. Then, we propose an algorithm that automatically adjusts the weights of each component in the augmented loss function. Experiments on a variety of continuous control tasks demonstrate that the proposed algorithm outperforms various benchmarks by effectively utilizing available expert observations.  ( 2 min )
    Extended Flow Matching: a Method of Conditional Generation with Generalized Continuity Equation
    arXiv:2402.18839v1 Announce Type: new Abstract: The task of conditional generation is one of the most important applications of generative models, and numerous methods have been developed to date based on the celebrated diffusion models, with the guidance-based classifier-free method taking the lead. However, the theory of the guidance-based method not only requires the user to fine-tune the "guidance strength," but its target vector field does not necessarily correspond to the conditional distribution used in training. In this paper, we develop the theory of conditional generation based on Flow Matching, a current strong contender of diffusion methods. Motivated by the interpretation of a probability path as a distribution on path space, we establish a novel theory of flow-based generation of conditional distribution by employing the mathematical framework of generalized continuity equation instead of the continuity equation in flow matching. This theory naturally derives a method that aims to match the matrix field as opposed to the vector field. Our framework ensures the continuity of the generated conditional distribution through the existence of flow between conditional distributions. We will present our theory through experiments and mathematical results.  ( 2 min )
    Batch size invariant Adam
    arXiv:2402.18824v1 Announce Type: new Abstract: We propose a batch size invariant version of Adam, for use in large-scale, distributed settings, in which the mini-batch is divided into micro-batches which are distributed among worker nodes. For the v term, standard Adam first computes the average over micro-batch gradients, then squares, while in the batch size invariant Adam proposed here, we first square the micro-batch gradients, then average. Previous work (e.g. Malladi et al. 2022) used an alternative approach that involved a square-root scaling of the learning rate, but this approach requires strong assumptions to work; in particular that the gradient variance dominates the square of the expected gradient. In contrast, the approach proposed here gives batch size invariance without this assumption. We confirm that in practice our scheme gives batch size invariance in a much larger range of scenarios than the previous approach.  ( 2 min )
    To Pool or Not To Pool: Analyzing the Regularizing Effects of Group-Fair Training on Shared Models
    arXiv:2402.18803v1 Announce Type: new Abstract: In fair machine learning, one source of performance disparities between groups is over-fitting to groups with relatively few training samples. We derive group-specific bounds on the generalization error of welfare-centric fair machine learning that benefit from the larger sample size of the majority group. We do this by considering group-specific Rademacher averages over a restricted hypothesis class, which contains the family of models likely to perform well with respect to a fair learning objective (e.g., a power-mean). Our simulations demonstrate these bounds improve over a naive method, as expected by theory, with particularly significant improvement for smaller group sizes.  ( 2 min )
    BlockEcho: Retaining Long-Range Dependencies for Imputing Block-Wise Missing Data
    arXiv:2402.18800v1 Announce Type: new Abstract: Block-wise missing data poses significant challenges in real-world data imputation tasks. Compared to scattered missing data, block-wise gaps exacerbate adverse effects on subsequent analytic and machine learning tasks, as the lack of local neighboring elements significantly reduces the interpolation capability and predictive power. However, this issue has not received adequate attention. Most SOTA matrix completion methods appeared less effective, primarily due to overreliance on neighboring elements for predictions. We systematically analyze the issue and propose a novel matrix completion method ``BlockEcho" for a more comprehensive solution. This method creatively integrates Matrix Factorization (MF) within Generative Adversarial Networks (GAN) to explicitly retain long-distance inter-element relationships in the original matrix. Besides, we incorporate an additional discriminator for GAN, comparing the generator's intermediate progress with pre-trained MF results to constrain high-order feature distributions. Subsequently, we evaluate BlockEcho on public datasets across three domains. Results demonstrate superior performance over both traditional and SOTA methods when imputing block-wise missing data, especially at higher missing rates. The advantage also holds for scattered missing data at high missing rates. We also contribute on the analyses in providing theoretical justification on the optimality and convergence of fusing MF and GAN for missing block data.  ( 2 min )
    MPAT: Building Robust Deep Neural Networks against Textual Adversarial Attacks
    arXiv:2402.18792v1 Announce Type: new Abstract: Deep neural networks have been proven to be vulnerable to adversarial examples and various methods have been proposed to defend against adversarial attacks for natural language processing tasks. However, previous defense methods have limitations in maintaining effective defense while ensuring the performance of the original task. In this paper, we propose a malicious perturbation based adversarial training method (MPAT) for building robust deep neural networks against textual adversarial attacks. Specifically, we construct a multi-level malicious example generation strategy to generate adversarial examples with malicious perturbations, which are used instead of original inputs for model training. Additionally, we employ a novel training objective function to ensure achieving the defense goal without compromising the performance on the original task. We conduct comprehensive experiments to evaluate our defense method by attacking five victim models on three benchmark datasets. The result demonstrates that our method is more effective against malicious adversarial attacks compared with previous defense methods while maintaining or further improving the performance on the original task.  ( 2 min )
    Enhancing the "Immunity" of Mixture-of-Experts Networks for Adversarial Defense
    arXiv:2402.18787v1 Announce Type: new Abstract: Recent studies have revealed the vulnerability of Deep Neural Networks (DNNs) to adversarial examples, which can easily fool DNNs into making incorrect predictions. To mitigate this deficiency, we propose a novel adversarial defense method called "Immunity" (Innovative MoE with MUtual information \& positioN stabilITY) based on a modified Mixture-of-Experts (MoE) architecture in this work. The key enhancements to the standard MoE are two-fold: 1) integrating of Random Switch Gates (RSGs) to obtain diverse network structures via random permutation of RSG parameters at evaluation time, despite of RSGs being determined after one-time training; 2) devising innovative Mutual Information (MI)-based and Position Stability-based loss functions by capitalizing on Grad-CAM's explanatory power to increase the diversity and the causality of expert networks. Notably, our MI-based loss operates directly on the heatmaps, thereby inducing subtler negative impacts on the classification performance when compared to other losses of the same type, theoretically. Extensive evaluation validates the efficacy of the proposed approach in improving adversarial robustness against a wide range of attacks.  ( 2 min )
    Disentangling the Causes of Plasticity Loss in Neural Networks
    arXiv:2402.18762v1 Announce Type: new Abstract: Underpinning the past decades of work on the design, initialization, and optimization of neural networks is a seemingly innocuous assumption: that the network is trained on a \textit{stationary} data distribution. In settings where this assumption is violated, e.g.\ deep reinforcement learning, learning algorithms become unstable and brittle with respect to hyperparameters and even random seeds. One factor driving this instability is the loss of plasticity, meaning that updating the network's predictions in response to new information becomes more difficult as training progresses. While many recent works provide analyses and partial solutions to this phenomenon, a fundamental question remains unanswered: to what extent do known mechanisms of plasticity loss overlap, and how can mitigation strategies be combined to best maintain the trainability of a network? This paper addresses these questions, showing that loss of plasticity can be decomposed into multiple independent mechanisms and that, while intervening on any single mechanism is insufficient to avoid the loss of plasticity in all cases, intervening on multiple mechanisms in conjunction results in highly robust learning algorithms. We show that a combination of layer normalization and weight decay is highly effective at maintaining plasticity in a variety of synthetic nonstationary learning tasks, and further demonstrate its effectiveness on naturally arising nonstationarities, including reinforcement learning in the Arcade Learning Environment.  ( 2 min )
    Pre-training Differentially Private Models with Limited Public Data
    arXiv:2402.18752v1 Announce Type: new Abstract: The superior performance of large foundation models relies on the use of massive amounts of high-quality data, which often contain sensitive, private and copyrighted material that requires formal protection. While differential privacy (DP) is a prominent method to gauge the degree of security provided to the models, its application is commonly limited to the model fine-tuning stage, due to the performance degradation when applying DP during the pre-training stage. Consequently, DP is yet not capable of protecting a substantial portion of the data used during the initial pre-training process. In this work, we first provide a theoretical understanding of the efficacy of DP training by analyzing the per-iteration loss improvement. We make a key observation that DP optimizers' performance degradation can be significantly mitigated by the use of limited public data, which leads to a novel DP continual pre-training strategy. Empirically, using only 10\% of public data, our strategy can achieve DP accuracy of 41.5\% on ImageNet-21k (with $\epsilon=8$), as well as non-DP accuracy of 55.7\% and and 60.0\% on downstream tasks Places365 and iNaturalist-2021, respectively, on par with state-of-the-art standard pre-training and substantially outperforming existing DP pre-trained models.  ( 2 min )
    Unveiling Privacy, Memorization, and Input Curvature Links
    arXiv:2402.18726v1 Announce Type: new Abstract: Deep Neural Nets (DNNs) have become a pervasive tool for solving many emerging problems. However, they tend to overfit to and memorize the training set. Memorization is of keen interest since it is closely related to several concepts such as generalization, noisy learning, and privacy. To study memorization, Feldman (2019) proposed a formal score, however its computational requirements limit its practical use. Recent research has shown empirical evidence linking input loss curvature (measured by the trace of the loss Hessian w.r.t inputs) and memorization. It was shown to be ~3 orders of magnitude more efficient than calculating the memorization score. However, there is a lack of theoretical understanding linking memorization with input loss curvature. In this paper, we not only investigate this connection but also extend our analysis to establish theoretical links between differential privacy, memorization, and input loss curvature. First, we derive an upper bound on memorization characterized by both differential privacy and input loss curvature. Second, we present a novel insight showing that input loss curvature is upper-bounded by the differential privacy parameter. Our theoretical findings are further empirically validated using deep models on CIFAR and ImageNet datasets, showing a strong correlation between our theoretical predictions and results observed in practice.  ( 2 min )
    The VOROS: Lifting ROC curves to 3D
    arXiv:2402.18689v1 Announce Type: new Abstract: The area under the ROC curve is a common measure that is often used to rank the relative performance of different binary classifiers. However, as has been also previously noted, it can be a measure that ill-captures the benefits of different classifiers when either the true class values or misclassification costs are highly unbalanced between the two classes. We introduce a third dimension to capture these costs, and lift the ROC curve to a ROC surface in a natural way. We study both this surface and introduce the VOROS, the volume over this ROC surface, as a 3D generalization of the 2D area under the ROC curve. For problems where there are only bounds on the expected costs or class imbalances, we restrict consideration to the volume of the appropriate subregion of the ROC surface. We show how the VOROS can better capture the costs of different classifiers on both a classical and a modern example dataset.  ( 2 min )
    Quantifying Human Priors over Social and Navigation Networks
    arXiv:2402.18651v1 Announce Type: new Abstract: Human knowledge is largely implicit and relational -- do we have a friend in common? can I walk from here to there? In this work, we leverage the combinatorial structure of graphs to quantify human priors over such relational data. Our experiments focus on two domains that have been continuously relevant over evolutionary timescales: social interaction and spatial navigation. We find that some features of the inferred priors are remarkably consistent, such as the tendency for sparsity as a function of graph size. Other features are domain-specific, such as the propensity for triadic closure in social interactions. More broadly, our work demonstrates how nonclassical statistical analysis of indirect behavioral experiments can be used to efficiently model latent biases in the data.  ( 2 min )
    Deep Neural Network Models Trained With A Fixed Random Classifier Transfer Better Across Domains
    arXiv:2402.18614v1 Announce Type: new Abstract: The recently discovered Neural collapse (NC) phenomenon states that the last-layer weights of Deep Neural Networks (DNN), converge to the so-called Equiangular Tight Frame (ETF) simplex, at the terminal phase of their training. This ETF geometry is equivalent to vanishing within-class variability of the last layer activations. Inspired by NC properties, we explore in this paper the transferability of DNN models trained with their last layer weight fixed according to ETF. This enforces class separation by eliminating class covariance information, effectively providing implicit regularization. We show that DNN models trained with such a fixed classifier significantly improve transfer performance, particularly on out-of-domain datasets. On a broad range of fine-grained image classification datasets, our approach outperforms i) baseline methods that do not perform any covariance regularization (up to 22%), as well as ii) methods that explicitly whiten covariance of activations throughout training (up to 19%). Our findings suggest that DNNs trained with fixed ETF classifiers offer a powerful mechanism for improving transfer learning across domains.  ( 3 min )
    Why Attention Graphs Are All We Need: Pioneering Hierarchical Classification of Hematologic Cell Populations with LeukoGraph
    arXiv:2402.18610v1 Announce Type: new Abstract: In the complex landscape of hematologic samples such as peripheral blood or bone marrow, cell classification, delineating diverse populations into a hierarchical structure, presents profound challenges. This study presents LeukoGraph, a recently developed framework designed explicitly for this purpose employing graph attention networks (GATs) to navigate hierarchical classification (HC) complexities. Notably, LeukoGraph stands as a pioneering effort, marking the application of graph neural networks (GNNs) for hierarchical inference on graphs, accommodating up to one million nodes and millions of edges, all derived from flow cytometry data. LeukoGraph intricately addresses a classification paradigm where for example four different cell populations undergo flat categorization, while a fifth diverges into two distinct child branches, exemplifying the nuanced hierarchical structure inherent in complex datasets. The technique is more general than this example. A hallmark achievement of LeukoGraph is its F-score of 98%, significantly outclassing prevailing state-of-the-art methodologies. Crucially, LeukoGraph's prowess extends beyond theoretical innovation, showcasing remarkable precision in predicting both flat and hierarchical cell types across flow cytometry datasets from 30 distinct patients. This precision is further underscored by LeukoGraph's ability to maintain a correct label ratio, despite the inherent challenges posed by hierarchical classifications.  ( 2 min )
    ICE-SEARCH: A Language Model-Driven Feature Selection Approach
    arXiv:2402.18609v1 Announce Type: new Abstract: This study unveils the In-Context Evolutionary Search (ICE-SEARCH) method, the first work that melds language models (LMs) with evolutionary algorithms for feature selection (FS) tasks and demonstrates its effectiveness in Medical Predictive Analytics (MPA) applications. ICE-SEARCH harnesses the crossover and mutation capabilities inherent in LMs within an evolutionary framework, significantly improving FS through the model's comprehensive world knowledge and its adaptability to a variety of roles. Our evaluation of this methodology spans three crucial MPA tasks: stroke, cardiovascular disease, and diabetes, where ICE-SEARCH outperforms traditional FS methods in pinpointing essential features for medical applications. ICE-SEARCH achieves State-of-the-Art (SOTA) performance in stroke prediction and diabetes prediction; the Decision-Randomized ICE-SEARCH ranks as SOTA in cardiovascular disease prediction. Our results not only demonstrate the efficacy of ICE-SEARCH in medical FS but also underscore the versatility, efficiency, and scalability of integrating LMs in FS tasks. The study emphasizes the critical role of incorporating domain-specific insights, illustrating ICE-SEARCH's robustness, generalizability, and swift convergence. This opens avenues for further research into comprehensive and intricate FS landscapes, marking a significant stride in the application of artificial intelligence in medical predictive analytics.  ( 2 min )
    Impact of network topology on the performance of Decentralized Federated Learning
    arXiv:2402.18606v1 Announce Type: new Abstract: Fully decentralized learning is gaining momentum for training AI models at the Internet's edge, addressing infrastructure challenges and privacy concerns. In a decentralized machine learning system, data is distributed across multiple nodes, with each node training a local model based on its respective dataset. The local models are then shared and combined to form a global model capable of making accurate predictions on new data. Our exploration focuses on how different types of network structures influence the spreading of knowledge - the process by which nodes incorporate insights gained from learning patterns in data available on other nodes across the network. Specifically, this study investigates the intricate interplay between network structure and learning performance using three network topologies and six data distribution methods. These methods consider different vertex properties, including degree centrality, betweenness centrality, and clustering coefficient, along with whether nodes exhibit high or low values of these metrics. Our findings underscore the significance of global centrality metrics (degree, betweenness) in correlating with learning performance, while local clustering proves less predictive. We highlight the challenges in transferring knowledge from peripheral to central nodes, attributed to a dilution effect during model aggregation. Additionally, we observe that central nodes exert a pull effect, facilitating the spread of knowledge. In examining degree distribution, hubs in Barabasi-Albert networks positively impact learning for central nodes but exacerbate dilution when knowledge originates from peripheral nodes. Finally, we demonstrate the formidable challenge of knowledge circulation outside of segregated communities.  ( 3 min )
    Exploring Privacy and Fairness Risks in Sharing Diffusion Models: An Adversarial Perspective
    arXiv:2402.18607v1 Announce Type: new Abstract: Diffusion models have recently gained significant attention in both academia and industry due to their impressive generative performance in terms of both sampling quality and distribution coverage. Accordingly, proposals are made for sharing pre-trained diffusion models across different organizations, as a way of improving data utilization while enhancing privacy protection by avoiding sharing private data directly. However, the potential risks associated with such an approach have not been comprehensively examined. In this paper, we take an adversarial perspective to investigate the potential privacy and fairness risks associated with the sharing of diffusion models. Specifically, we investigate the circumstances in which one party (the sharer) trains a diffusion model using private data and provides another party (the receiver) black-box access to the pre-trained model for downstream tasks. We demonstrate that the sharer can execute fairness poisoning attacks to undermine the receiver's downstream models by manipulating the training data distribution of the diffusion model. Meanwhile, the receiver can perform property inference attacks to reveal the distribution of sensitive features in the sharer's dataset. Our experiments conducted on real-world datasets demonstrate remarkable attack performance on different types of diffusion models, which highlights the critical importance of robust data auditing and privacy protection protocols in pertinent applications.  ( 2 min )
    MMSR: Symbolic Regression is a Multimodal Task
    arXiv:2402.18603v1 Announce Type: new Abstract: Mathematical formulas are the crystallization of human wisdom in exploring the laws of nature for thousands of years. Describing the complex laws of nature with a concise mathematical formula is a constant pursuit of scientists and a great challenge for artificial intelligence. This field is called symbolic regression. Symbolic regression was originally formulated as a combinatorial optimization problem, and GP and reinforcement learning algorithms were used to solve it. However, GP is sensitive to hyperparameters, and these two types of algorithms are inefficient. To solve this problem, researchers treat the mapping from data to expressions as a translation problem. And the corresponding large-scale pre-trained model is introduced. However, the data and expression skeletons do not have very clear word correspondences as the two languages do. Instead, they are more like two modalities (e.g., image and text). Therefore, in this paper, we proposed MMSR. The SR problem is solved as a pure multimodal problem, and contrastive learning is also introduced in the training process for modal alignment to facilitate later modal feature fusion. It is worth noting that in order to better promote the modal feature fusion, we adopt the strategy of training contrastive learning loss and other losses at the same time, which only needs one-step training, instead of training contrastive learning loss first and then training other losses. Because our experiments prove training together can make the feature extraction module and feature fusion module running-in better. Experimental results show that compared with multiple large-scale pre-training baselines, MMSR achieves the most advanced results on multiple mainstream datasets including SRBench.  ( 3 min )
    Stochastic contextual bandits with graph feedback: from independence number to MAS number
    arXiv:2402.18591v1 Announce Type: new Abstract: We consider contextual bandits with graph feedback, a class of interactive learning problems with richer structures than vanilla contextual bandits, where taking an action reveals the rewards for all neighboring actions in the feedback graph under all contexts. Unlike the multi-armed bandits setting where a growing literature has painted a near-complete understanding of graph feedback, much remains unexplored in the contextual bandits counterpart. In this paper, we make inroads into this inquiry by establishing a regret lower bound $\Omega(\sqrt{\beta_M(G) T})$, where $M$ is the number of contexts, $G$ is the feedback graph, and $\beta_M(G)$ is our proposed graph-theoretical quantity that characterizes the fundamental learning limit for this class of problems. Interestingly, $\beta_M(G)$ interpolates between $\alpha(G)$ (the independence number of the graph) and $\mathsf{m}(G)$ (the maximum acyclic subgraph (MAS) number of the graph) as the number of contexts $M$ varies. We also provide algorithms that achieve near-optimal regrets for important classes of context sequences and/or feedback graphs, such as transitively closed graphs that find applications in auctions and inventory control. In particular, with many contexts, our results show that the MAS number completely characterizes the statistical complexity for contextual bandits, as opposed to the independence number in multi-armed bandits.  ( 2 min )
    Meta-Tasks: An alternative view on Meta-Learning Regularization
    arXiv:2402.18599v1 Announce Type: new Abstract: Few-shot learning (FSL) is a challenging machine learning problem due to a scarcity of labeled data. The ability to generalize effectively on both novel and training tasks is a significant barrier to FSL. This paper proposes a novel solution that can generalize to both training and novel tasks while also utilizing unlabeled samples. The method refines the embedding model before updating the outer loop using unsupervised techniques as ``meta-tasks''. The experimental results show that our proposed method performs well on novel and training tasks, with faster and better convergence, lower generalization, and standard deviation error, indicating its potential for practical applications in FSL. The experimental results show that the proposed method outperforms prototypical networks by 3.9%.  ( 2 min )
    Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
    arXiv:2402.19427v1 Announce Type: new Abstract: Recurrent neural networks (RNNs) have fast inference and scale efficiently on long sequences, but they are difficult to train and hard to scale. We propose Hawk, an RNN with gated linear recurrences, and Griffin, a hybrid model that mixes gated linear recurrences with local attention. Hawk exceeds the reported performance of Mamba on downstream tasks, while Griffin matches the performance of Llama-2 despite being trained on over 6 times fewer tokens. We also show that Griffin can extrapolate on sequences significantly longer than those seen during training. Our models match the hardware efficiency of Transformers during training, and during inference they have lower latency and significantly higher throughput. We scale Griffin up to 14B parameters, and explain how to shard our models for efficient distributed training.  ( 2 min )
  • Open

    Imputation of missing values in multi-view data
    arXiv:2210.14484v3 Announce Type: replace Abstract: Data for which a set of objects is described by multiple distinct feature sets (called views) is known as multi-view data. When missing values occur in multi-view data, all features in a view are likely to be missing simultaneously. This leads to very large quantities of missing data which, especially when combined with high-dimensionality, makes the application of conditional imputation methods computationally infeasible. We introduce a new imputation method based on the existing stacked penalized logistic regression (StaPLR) algorithm for multi-view learning. It performs imputation in a dimension-reduced space to address computational challenges inherent to the multi-view context. We compare the performance of the new imputation method with several existing imputation algorithms in simulated data sets. The results show that the new imputation method leads to competitive results at a much lower computational cost, and makes the use of advanced imputation algorithms such as missForest and predictive mean matching possible in settings where they would otherwise be computationally infeasible.  ( 2 min )
    Semi-Supervised U-statistics
    arXiv:2402.18921v1 Announce Type: cross Abstract: Semi-supervised datasets are ubiquitous across diverse domains where obtaining fully labeled data is costly or time-consuming. The prevalence of such datasets has consistently driven the demand for new tools and methods that exploit the potential of unlabeled data. Responding to this demand, we introduce semi-supervised U-statistics enhanced by the abundance of unlabeled data, and investigate their statistical properties. We show that the proposed approach is asymptotically Normal and exhibits notable efficiency gains over classical U-statistics by effectively integrating various powerful prediction tools into the framework. To understand the fundamental difficulty of the problem, we derive minimax lower bounds in semi-supervised settings and showcase that our procedure is semi-parametrically efficient under regularity conditions. Moreover, tailored to bivariate kernels, we propose a refined approach that outperforms the classical U-statistic across all degeneracy regimes, and demonstrate its optimality properties. Simulation studies are conducted to corroborate our findings and to further demonstrate our framework.  ( 2 min )
    Negative-Binomial Randomized Gamma Markov Processes for Heterogeneous Overdispersed Count Time Series
    arXiv:2402.18995v1 Announce Type: cross Abstract: Modeling count-valued time series has been receiving increasing attention since count time series naturally arise in physical and social domains. Poisson gamma dynamical systems (PGDSs) are newly-developed methods, which can well capture the expressive latent transition structure and bursty dynamics behind count sequences. In particular, PGDSs demonstrate superior performance in terms of data imputation and prediction, compared with canonical linear dynamical system (LDS) based methods. Despite these advantages, PGDS cannot capture the heterogeneous overdispersed behaviours of the underlying dynamic processes. To mitigate this defect, we propose a negative-binomial-randomized gamma Markov process, which not only significantly improves the predictive performance of the proposed dynamical system, but also facilitates the fast convergence of the inference algorithm. Moreover, we develop methods to estimate both factor-structured and graph-structured transition dynamics, which enable us to infer more explainable latent structure, compared with PGDSs. Finally, we demonstrate the explainable latent structure learned by the proposed method, and show its superior performance in imputing missing data and forecasting future observations, compared with the related models.  ( 2 min )
    VEC-SBM: Optimal Community Detection with Vectorial Edges Covariates
    arXiv:2402.18805v1 Announce Type: cross Abstract: Social networks are often associated with rich side information, such as texts and images. While numerous methods have been developed to identify communities from pairwise interactions, they usually ignore such side information. In this work, we study an extension of the Stochastic Block Model (SBM), a widely used statistical framework for community detection, that integrates vectorial edges covariates: the Vectorial Edges Covariates Stochastic Block Model (VEC-SBM). We propose a novel algorithm based on iterative refinement techniques and show that it optimally recovers the latent communities under the VEC-SBM. Furthermore, we rigorously assess the added value of leveraging edge's side information in the community detection process. We complement our theoretical results with numerical experiments on synthetic and semi-synthetic data.  ( 2 min )
    Prognostic Covariate Adjustment for Logistic Regression in Randomized Controlled Trials
    arXiv:2402.18900v1 Announce Type: cross Abstract: Randomized controlled trials (RCTs) with binary primary endpoints introduce novel challenges for inferring the causal effects of treatments. The most significant challenge is non-collapsibility, in which the conditional odds ratio estimand under covariate adjustment differs from the unconditional estimand in the logistic regression analysis of RCT data. This issue gives rise to apparent paradoxes, such as the variance of the estimator for the conditional odds ratio from a covariate-adjusted model being greater than the variance of the estimator from the unadjusted model. We address this challenge in the context of adjustment based on predictions of control outcomes from generative artificial intelligence (AI) algorithms, which are referred to as prognostic scores. We demonstrate that prognostic score adjustment in logistic regression increases the power of the Wald test for the conditional odds ratio under a fixed sample size, or alternatively reduces the necessary sample size to achieve a desired power, compared to the unadjusted analysis. We derive formulae for prospective calculations of the power gain and sample size reduction that can result from adjustment for the prognostic score. Furthermore, we utilize g-computation to expand the scope of prognostic score adjustment to inferences on the marginal risk difference, relative risk, and odds ratio estimands. We demonstrate the validity of our formulae via extensive simulation studies that encompass different types of logistic regression model specifications. Our simulation studies also indicate how prognostic score adjustment can reduce the variance of g-computation estimators for the marginal estimands while maintaining frequentist properties such as asymptotic unbiasedness and Type I error rate control. Our methodology can ultimately enable more definitive and conclusive analyses for RCTs with binary primary endpoints.  ( 3 min )
    Statistical Component Separation for Targeted Signal Recovery in Noisy Mixtures
    arXiv:2306.15012v3 Announce Type: replace Abstract: Separating signals from an additive mixture may be an unnecessarily hard problem when one is only interested in specific properties of a given signal. In this work, we tackle simpler "statistical component separation" problems that focus on recovering a predefined set of statistical descriptors of a target signal from a noisy mixture. Assuming access to samples of the noise process, we investigate a method devised to match the statistics of the solution candidate corrupted by noise samples with those of the observed mixture. We first analyze the behavior of this method using simple examples with analytically tractable calculations. Then, we apply it in an image denoising context employing 1) wavelet-based descriptors, 2) ConvNet-based descriptors on astrophysics and ImageNet data. In the case of 1), we show that our method better recovers the descriptors of the target data than a standard denoising method in most situations. Additionally, despite not constructed for this purpose, it performs surprisingly well in terms of peak signal-to-noise ratio on full signal reconstruction. In comparison, representation 2) appears less suitable for image denoising. Finally, we extend this method by introducing a diffusive stepwise algorithm which gives a new perspective to the initial method and leads to promising results for image denoising under specific circumstances.  ( 3 min )
    Cross-Prediction-Powered Inference
    arXiv:2309.16598v3 Announce Type: replace Abstract: While reliable data-driven decision-making hinges on high-quality labeled data, the acquisition of quality labels often involves laborious human annotations or slow and expensive scientific measurements. Machine learning is becoming an appealing alternative as sophisticated predictive techniques are being used to quickly and cheaply produce large amounts of predicted labels; e.g., predicted protein structures are used to supplement experimentally derived structures, predictions of socioeconomic indicators from satellite imagery are used to supplement accurate survey data, and so on. Since predictions are imperfect and potentially biased, this practice brings into question the validity of downstream inferences. We introduce cross-prediction: a method for valid inference powered by machine learning. With a small labeled dataset and a large unlabeled dataset, cross-prediction imputes the missing labels via machine learning and applies a form of debiasing to remedy the prediction inaccuracies. The resulting inferences achieve the desired error probability and are more powerful than those that only leverage the labeled data. Closely related is the recent proposal of prediction-powered inference, which assumes that a good pre-trained model is already available. We show that cross-prediction is consistently more powerful than an adaptation of prediction-powered inference in which a fraction of the labeled data is split off and used to train the model. Finally, we observe that cross-prediction gives more stable conclusions than its competitors; its confidence intervals typically have significantly lower variability.  ( 2 min )
    Neural Galerkin Schemes with Active Learning for High-Dimensional Evolution Equations
    arXiv:2203.01360v4 Announce Type: replace-cross Abstract: Deep neural networks have been shown to provide accurate function approximations in high dimensions. However, fitting network parameters requires informative training data that are often challenging to collect in science and engineering applications. This work proposes Neural Galerkin schemes based on deep learning that generate training data with active learning for numerically solving high-dimensional partial differential equations. Neural Galerkin schemes build on the Dirac-Frenkel variational principle to train networks by minimizing the residual sequentially over time, which enables adaptively collecting new training data in a self-informed manner that is guided by the dynamics described by the partial differential equations. This is in contrast to other machine learning methods that aim to fit network parameters globally in time without taking into account training data acquisition. Our finding is that the active form of gathering training data of the proposed Neural Galerkin schemes is key for numerically realizing the expressive power of networks in high dimensions. Numerical experiments demonstrate that Neural Galerkin schemes have the potential to enable simulating phenomena and processes with many variables for which traditional and other deep-learning-based solvers fail, especially when features of the solutions evolve locally such as in high-dimensional wave propagation problems and interacting particle systems described by Fokker-Planck and kinetic equations.  ( 3 min )
    The committee machine: Computational to statistical gaps in learning a two-layers neural network
    arXiv:1806.05451v3 Announce Type: replace-cross Abstract: Heuristic tools from statistical physics have been used in the past to locate the phase transitions and compute the optimal learning and generalization errors in the teacher-student scenario in multi-layer neural networks. In this contribution, we provide a rigorous justification of these approaches for a two-layers neural network model called the committee machine. We also introduce a version of the approximate message passing (AMP) algorithm for the committee machine that allows to perform optimal learning in polynomial time for a large set of parameters. We find that there are regimes in which a low generalization error is information-theoretically achievable while the AMP algorithm fails to deliver it, strongly suggesting that no efficient algorithm exists for those cases, and unveiling a large computational gap.  ( 3 min )
    Offline detection of change-points in the mean for stationary graph signals
    arXiv:2006.10628v2 Announce Type: replace-cross Abstract: This paper addresses the problem of segmenting a stream of graph signals: we aim to detect changes in the mean of a multivariate signal defined over the nodes of a known graph. We propose an offline method that relies on the concept of graph signal stationarity and allows the convenient translation of the problem from the original vertex domain to the spectral domain (Graph Fourier Transform), where it is much easier to solve. Although the obtained spectral representation is sparse in real applications, to the best of our knowledge this property has not been sufficiently exploited in the existing related literature. Our change-point detection method adopts a model selection approach that takes into account the sparsity of the spectral representation and determines automatically the number of change-points. Our detector comes with a proof of a non-asymptotic oracle inequality. Numerical experiments demonstrate the performance of the proposed method.  ( 2 min )
    Exploring a new machine learning based probabilistic model for high-resolution indoor radon mapping, using the German indoor radon survey data
    arXiv:2310.11143v2 Announce Type: replace Abstract: Radon is a carcinogenic, radioactive gas that can accumulate indoors. Therefore, accurate knowledge of indoor radon concentration is crucial for assessing radon-related health effects or identifying radon-prone areas. Indoor radon concentration at the national scale is usually estimated on the basis of extensive measurement campaigns. However, characteristics of the sample often differ from the characteristics of the population due to the large number of relevant factors that control the indoor radon concentration such as the availability of geogenic radon or floor level. Furthermore, the sample size usually does not allow estimation with high spatial resolution. We propose a model-based approach that allows a more realistic estimation of indoor radon distribution with a higher spatial resolution than a purely data-based approach. A two-stage modelling approach was applied: 1) a quantile regression forest using environmental and building data as predictors was applied to estimate the probability distribution function of indoor radon for each floor level of each residential building in Germany; (2) a probabilistic Monte Carlo sampling technique enabled the combination and population weighting of floor-level predictions. In this way, the uncertainty of the individual predictions is effectively propagated into the estimate of variability at the aggregated level. The results show an approximate lognormal distribution with an arithmetic mean of 63 Bq/m3, a geometric mean of 41 Bq/m3 and a 95 %ile of 180 Bq/m3. The exceedance probability for 100 Bq/m3 and 300 Bq/m3 are 12.5 % (10.5 million people) and 2.2 % (1.9 million people), respectively.  ( 3 min )
    Learning Associative Memories with Gradient Descent
    arXiv:2402.18724v1 Announce Type: cross Abstract: This work focuses on the training dynamics of one associative memory module storing outer products of token embeddings. We reduce this problem to the study of a system of particles, which interact according to properties of the data distribution and correlations between embeddings. Through theory and experiments, we provide several insights. In overparameterized regimes, we obtain logarithmic growth of the ``classification margins.'' Yet, we show that imbalance in token frequencies and memory interferences due to correlated embeddings lead to oscillatory transitory regimes. The oscillations are more pronounced with large step sizes, which can create benign loss spikes, although these learning rates speed up the dynamics and accelerate the asymptotic convergence. In underparameterized regimes, we illustrate how the cross-entropy loss can lead to suboptimal memorization schemes. Finally, we assess the validity of our findings on small Transformer models.  ( 2 min )
    Listening to the Noise: Blind Denoising with Gibbs Diffusion
    arXiv:2402.19455v1 Announce Type: new Abstract: In recent years, denoising problems have become intertwined with the development of deep generative models. In particular, diffusion models are trained like denoisers, and the distribution they model coincide with denoising priors in the Bayesian picture. However, denoising through diffusion-based posterior sampling requires the noise level and covariance to be known, preventing blind denoising. We overcome this limitation by introducing Gibbs Diffusion (GDiff), a general methodology addressing posterior sampling of both the signal and the noise parameters. Assuming arbitrary parametric Gaussian noise, we develop a Gibbs algorithm that alternates sampling steps from a conditional diffusion model trained to map the signal prior to the family of noise distributions, and a Monte Carlo sampler to infer the noise parameters. Our theoretical analysis highlights potential pitfalls, guides diagnostic usage, and quantifies errors in the Gibbs stationary distribution caused by the diffusion model. We showcase our method for 1) blind denoising of natural images involving colored noises with unknown amplitude and spectral index, and 2) a cosmology problem, namely the analysis of cosmic microwave background data, where Bayesian inference of "noise" parameters means constraining models of the evolution of the Universe.  ( 2 min )
    Applications of 0-1 Neural Networks in Prescription and Prediction
    arXiv:2402.18851v1 Announce Type: cross Abstract: A key challenge in medical decision making is learning treatment policies for patients with limited observational data. This challenge is particularly evident in personalized healthcare decision-making, where models need to take into account the intricate relationships between patient characteristics, treatment options, and health outcomes. To address this, we introduce prescriptive networks (PNNs), shallow 0-1 neural networks trained with mixed integer programming that can be used with counterfactual estimation to optimize policies in medium data settings. These models offer greater interpretability than deep neural networks and can encode more complex policies than common models such as decision trees. We show that PNNs can outperform existing methods in both synthetic data experiments and in a case study of assigning treatments for postpartum hypertension. In particular, PNNs are shown to produce policies that could reduce peak blood pressure by 5.47 mm Hg (p=0.02) over existing clinical practice, and by 2 mm Hg (p=0.01) over the next best prescriptive modeling technique. Moreover PNNs were more likely than all other models to correctly identify clinically significant features while existing models relied on potentially dangerous features such as patient insurance information and race that could lead to bias in treatment.  ( 2 min )
    Benchmarking Uncertainty Disentanglement: Specialized Uncertainties for Specialized Tasks
    arXiv:2402.19460v1 Announce Type: cross Abstract: Uncertainty quantification, once a singular task, has evolved into a spectrum of tasks, including abstained prediction, out-of-distribution detection, and aleatoric uncertainty quantification. The latest goal is disentanglement: the construction of multiple estimators that are each tailored to one and only one task. Hence, there is a plethora of recent advances with different intentions - that often entirely deviate from practical behavior. This paper conducts a comprehensive evaluation of numerous uncertainty estimators across diverse tasks on ImageNet. We find that, despite promising theoretical endeavors, disentanglement is not yet achieved in practice. Additionally, we reveal which uncertainty estimators excel at which specific tasks, providing insights for practitioners and guiding future research toward task-centric and disentangled uncertainty estimation methods. Our code is available at https://github.com/bmucsanyi/bud.  ( 2 min )
    Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models
    arXiv:2402.19449v1 Announce Type: cross Abstract: Adam has been shown to outperform gradient descent in optimizing large language transformers empirically, and by a larger margin than on other tasks, but it is unclear why this happens. We show that the heavy-tailed class imbalance found in language modeling tasks leads to difficulties in the optimization dynamics. When training with gradient descent, the loss associated with infrequent words decreases slower than the loss associated with frequent ones. As most samples come from relatively infrequent words, the average loss decreases slowly with gradient descent. On the other hand, Adam and sign-based methods do not suffer from this problem and improve predictions on all classes. To establish that this behavior is indeed caused by class imbalance, we show empirically that it persist through different architectures and data types, on language transformers, vision CNNs, and linear models. We further study this phenomenon on a linear classification with cross-entropy loss, showing that heavy-tailed class imbalance leads to ill-conditioning, and that the normalization used by Adam can counteract it.  ( 2 min )
    Supervised Contrastive Representation Learning: Landscape Analysis with Unconstrained Features
    arXiv:2402.18884v1 Announce Type: cross Abstract: Recent findings reveal that over-parameterized deep neural networks, trained beyond zero training-error, exhibit a distinctive structural pattern at the final layer, termed as Neural-collapse (NC). These results indicate that the final hidden-layer outputs in such networks display minimal within-class variations over the training set. While existing research extensively investigates this phenomenon under cross-entropy loss, there are fewer studies focusing on its contrastive counterpart, supervised contrastive (SC) loss. Through the lens of NC, this paper employs an analytical approach to study the solutions derived from optimizing the SC loss. We adopt the unconstrained features model (UFM) as a representative proxy for unveiling NC-related phenomena in sufficiently over-parameterized deep networks. We show that, despite the non-convexity of SC loss minimization, all local minima are global minima. Furthermore, the minimizer is unique (up to a rotation). We prove our results by formalizing a tight convex relaxation of the UFM. Finally, through this convex formulation, we delve deeper into characterizing the properties of global solutions under label-imbalanced training data.  ( 2 min )
    Training Dynamics of Multi-Head Softmax Attention for In-Context Learning: Emergence, Convergence, and Optimality
    arXiv:2402.19442v1 Announce Type: cross Abstract: We study the dynamics of gradient flow for training a multi-head softmax attention model for in-context learning of multi-task linear regression. We establish the global convergence of gradient flow under suitable choices of initialization. In addition, we prove that an interesting "task allocation" phenomenon emerges during the gradient flow dynamics, where each attention head focuses on solving a single task of the multi-task model. Specifically, we prove that the gradient flow dynamics can be split into three phases -- a warm-up phase where the loss decreases rather slowly and the attention heads gradually build up their inclination towards individual tasks, an emergence phase where each head selects a single task and the loss rapidly decreases, and a convergence phase where the attention parameters converge to a limit. Furthermore, we prove the optimality of gradient flow in the sense that the limiting model learned by gradient flow is on par with the best possible multi-head softmax attention model up to a constant factor. Our analysis also delineates a strict separation in terms of the prediction accuracy of ICL between single-head and multi-head attention models. The key technique for our convergence analysis is to map the gradient flow dynamics in the parameter space to a set of ordinary differential equations in the spectral domain, where the relative magnitudes of the semi-singular values of the attention weights determines task allocation. To our best knowledge, our work provides the first convergence result for the multi-head softmax attention model.  ( 3 min )
    BlockEcho: Retaining Long-Range Dependencies for Imputing Block-Wise Missing Data
    arXiv:2402.18800v1 Announce Type: cross Abstract: Block-wise missing data poses significant challenges in real-world data imputation tasks. Compared to scattered missing data, block-wise gaps exacerbate adverse effects on subsequent analytic and machine learning tasks, as the lack of local neighboring elements significantly reduces the interpolation capability and predictive power. However, this issue has not received adequate attention. Most SOTA matrix completion methods appeared less effective, primarily due to overreliance on neighboring elements for predictions. We systematically analyze the issue and propose a novel matrix completion method ``BlockEcho" for a more comprehensive solution. This method creatively integrates Matrix Factorization (MF) within Generative Adversarial Networks (GAN) to explicitly retain long-distance inter-element relationships in the original matrix. Besides, we incorporate an additional discriminator for GAN, comparing the generator's intermediate progress with pre-trained MF results to constrain high-order feature distributions. Subsequently, we evaluate BlockEcho on public datasets across three domains. Results demonstrate superior performance over both traditional and SOTA methods when imputing block-wise missing data, especially at higher missing rates. The advantage also holds for scattered missing data at high missing rates. We also contribute on the analyses in providing theoretical justification on the optimality and convergence of fusing MF and GAN for missing block data.  ( 2 min )
    Inferring Dynamic Networks from Marginals with Iterative Proportional Fitting
    arXiv:2402.18697v1 Announce Type: new Abstract: A common network inference problem, arising from real-world data constraints, is how to infer a dynamic network from its time-aggregated adjacency matrix and time-varying marginals (i.e., row and column sums). Prior approaches to this problem have repurposed the classic iterative proportional fitting (IPF) procedure, also known as Sinkhorn's algorithm, with promising empirical results. However, the statistical foundation for using IPF has not been well understood: under what settings does IPF provide principled estimation of a dynamic network from its marginals, and how well does it estimate the network? In this work, we establish such a setting, by identifying a generative network model whose maximum likelihood estimates are recovered by IPF. Our model both reveals implicit assumptions on the use of IPF in such settings and enables new analyses, such as structure-dependent error bounds on IPF's parameter estimates. When IPF fails to converge on sparse network data, we introduce a principled algorithm that guarantees IPF converges under minimal changes to the network structure. Finally, we conduct experiments with synthetic and real-world data, which demonstrate the practical value of our theoretical and algorithmic contributions.  ( 2 min )
    Time-Uniform Confidence Spheres for Means of Random Vectors
    arXiv:2311.08168v2 Announce Type: replace-cross Abstract: We derive and study time-uniform confidence spheres -- confidence sphere sequences (CSSs) -- which contain the mean of random vectors with high probability simultaneously across all sample sizes. Inspired by the original work of Catoni and Giulini, we unify and extend their analysis to cover both the sequential setting and to handle a variety of distributional assumptions. Our results include an empirical-Bernstein CSS for bounded random vectors (resulting in a novel empirical-Bernstein confidence interval with asymptotic width scaling proportionally to the true unknown variance), CSSs for sub-$\psi$ random vectors (which includes sub-gamma, sub-Poisson, and sub-exponential), and CSSs for heavy-tailed random vectors (two moments only). Finally, we provide two CSSs that are robust to contamination by Huber noise. The first is a robust version of our empirical-Bernstein CSS, and the second extends recent work in the univariate setting to heavy-tailed multivariate distributions.  ( 2 min )
    Extended Deep Adaptive Input Normalization for Preprocessing Time Series Data for Neural Networks
    arXiv:2310.14720v2 Announce Type: replace-cross Abstract: Data preprocessing is a crucial part of any machine learning pipeline, and it can have a significant impact on both performance and training efficiency. This is especially evident when using deep neural networks for time series prediction and classification: real-world time series data often exhibit irregularities such as multi-modality, skewness and outliers, and the model performance can degrade rapidly if these characteristics are not adequately addressed. In this work, we propose the EDAIN (Extended Deep Adaptive Input Normalization) layer, a novel adaptive neural layer that learns how to appropriately normalize irregular time series data for a given task in an end-to-end fashion, instead of using a fixed normalization scheme. This is achieved by optimizing its unknown parameters simultaneously with the deep neural network using back-propagation. Our experiments, conducted using synthetic data, a credit default prediction dataset, and a large-scale limit order book benchmark dataset, demonstrate the superior performance of the EDAIN layer when compared to conventional normalization methods and existing adaptive time series preprocessing layers.  ( 2 min )
    Non-asymptotic analysis of Langevin-type Monte Carlo algorithms
    arXiv:2303.12407v5 Announce Type: replace-cross Abstract: We study Langevin-type algorithms for sampling from Gibbs distributions such that the potentials are dissipative and their weak gradients have finite moduli of continuity not necessarily convergent to zero. Our main result is a non-asymptotic upper bound of the 2-Wasserstein distance between a Gibbs distribution and the law of general Langevin-type algorithms based on the Liptser--Shiryaev theory and Poincar\'{e} inequalities. We apply this bound to show that the Langevin Monte Carlo algorithm can approximate Gibbs distributions with arbitrary accuracy if the potentials are dissipative and their gradients are uniformly continuous. We also propose Langevin-type algorithms with spherical smoothing for distributions whose potentials are not convex or continuously differentiable.  ( 2 min )
    Neural Likelihood Surfaces for Spatial Processes with Computationally Intensive or Intractable Likelihoods
    arXiv:2305.04634v3 Announce Type: replace-cross Abstract: In spatial statistics, fast and accurate parameter estimation, coupled with a reliable means of uncertainty quantification, can be challenging when fitting a spatial process to real-world data because the likelihood function might be slow to evaluate or wholly intractable. In this work, we propose using convolutional neural networks to learn the likelihood function of a spatial process. Through a specifically designed classification task, our neural network implicitly learns the likelihood function, even in situations where the exact likelihood is not explicitly available. Once trained on the classification task, our neural network is calibrated using Platt scaling which improves the accuracy of the neural likelihood surfaces. To demonstrate our approach, we compare neural likelihood surfaces and the resulting maximum likelihood estimates and approximate confidence regions with the equivalent for exact or approximate likelihood for two different spatial processes: a Gaussian process and a Brown-Resnick process which have computationally intensive and intractable likelihoods, respectively. We conclude that our method provides fast and accurate parameter estimation with a reliable method of uncertainty quantification in situations where standard methods are either undesirably slow or inaccurate. The method is applicable to any spatial process on a grid from which fast simulations are available.  ( 3 min )
    Inference via robust optimal transportation: theory and methods
    arXiv:2301.06297v4 Announce Type: replace-cross Abstract: Optimal transportation theory and the related $p$-Wasserstein distance ($W_p$, $p\geq 1$) are widely-applied in statistics and machine learning. In spite of their popularity, inference based on these tools has some issues. For instance, it is sensitive to outliers and it may not be even defined when the underlying model has infinite moments. To cope with these problems, first we consider a robust version of the primal transportation problem and show that it defines the {robust Wasserstein distance}, $W^{(\lambda)}$, depending on a tuning parameter $\lambda > 0$. Second, we illustrate the link between $W_1$ and $W^{(\lambda)}$ and study its key measure theoretic aspects. Third, we derive some concentration inequalities for $W^{(\lambda)}$. Fourth, we use $W^{(\lambda)}$ to define minimum distance estimators, we provide their statistical guarantees and we illustrate how to apply the derived concentration inequalities for a data driven selection of $\lambda$. Fifth, we provide the {dual} form of the robust optimal transportation problem and we apply it to machine learning problems (generative adversarial networks and domain adaptation). Numerical exercises provide evidence of the benefits yielded by our novel methods.  ( 2 min )
    Mirror Diffusion Models for Constrained and Watermarked Generation
    arXiv:2310.01236v2 Announce Type: replace Abstract: Modern successes of diffusion models in learning complex, high-dimensional data distributions are attributed, in part, to their capability to construct diffusion processes with analytic transition kernels and score functions. The tractability results in a simulation-free framework with stable regression losses, from which reversed, generative processes can be learned at scale. However, when data is confined to a constrained set as opposed to a standard Euclidean space, these desirable characteristics appear to be lost based on prior attempts. In this work, we propose Mirror Diffusion Models (MDM), a new class of diffusion models that generate data on convex constrained sets without losing any tractability. This is achieved by learning diffusion processes in a dual space constructed from a mirror map, which, crucially, is a standard Euclidean space. We derive efficient computation of mirror maps for popular constrained sets, such as simplices and $\ell_2$-balls, showing significantly improved performance of MDM over existing methods. For safety and privacy purposes, we also explore constrained sets as a new mechanism to embed invisible but quantitative information (i.e., watermarks) in generated data, for which MDM serves as a compelling approach. Our work brings new algorithmic opportunities for learning tractable diffusion on complex domains. Our code is available at https://github.com/ghliu/mdm  ( 3 min )

  • Open

    [R] A primer in Text-To-3D
    Text-To-3D along with it's image counterpart have been taking the Machine Learning field by storm, with the Metaverse on the horizon, and Nvidia's Omniverse, never before have we seen such attention to the field of generative 3D. What was the Minecraft of the 2010's is now the Metaverse of the 2020's, no longer will the children of the future be building from Cubic Voxels but they will be imagining objects into existence merely by speaking text prompts into their headsets as huge dystopian-esque server farms churn out 3D content from incredibly large neural networks and large language models. Minecraft 2.0 if you will. Feeling a little left behind? Fear not, I've written an easy to understand primer for anyone interested in understanding this field in just 5 minutes! https://ai.plainenglish.io/text-to-3d-b607bf245031 submitted by /u/SirFletch [link] [comments]
    [R] Do reviewers at a new conference see the previous submissions / reviews?
    Hi, new to ML research and publishing. As part of my lab, we recently submitted to CVPR (received scores of slightly below, borderline, slightly above, which after our rebuttal, were downgraded to three "slightly belows"). There's talk of submission to a later conference like ECCV. I'm curious if the ECCV reviewers would be able to see the fact that we have previously submitted the paper, and the reviewers comments. submitted by /u/YodelingVeterinarian [link] [comments]
    [P] How do I solve this langchain error with my RAG system in python?
    Hi all! I am trying to implement a RAG system in an LLM model, using the Langchain library. Below script is just for testing purposes, which is loosely based on following link: https://www.infoworld.com/article/3712860/retrieval-augmented-generation-step-by-step from langchain_community.document_loaders import TextLoader from langchain_community.document_loaders import TextLoader from langchain.text_splitter import CharacterTextSplitter from langchain_community.vectorstores.faiss import FAISS from langchain_core.documents import Document from langchain_community.embeddings import HuggingFaceEmbeddings from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline from langchain.agents.agent_toolkits import ( create_retriever_tool, create_conversational_retrieval_agent, ) …
    [D] Computer Vision in Warehousing and Logistics
    In this article you will know how computer vision is optimizing processes in logistics and warehousing. The warehousing industry is quickly adapting to online shopping's needs. Computer vision improves inventory management, process optimization, and quality control. It automates manual tasks and optimizes many operations. OpenCV.ai team describes the most popular Use Cases of AI Implementation in this sphere. More details are here submitted by /u/No-Independence5880 [link] [comments]
    [D] Best way to deploy transformer models
    Hi, I’m a Software Engineer building full stack applications for three years now. I’m trying to build a video understanding engine by leveraging OpenAI’s clip model to analyze frames and store them in a vector database for semantic searching. I’m phenomenal when it comes to deploying full stack applications and scaling them minus the whole ML Ops part. I’m trying to deploy CLIP on AWS and I have found a couple of options in my research and I’m completely confused the more I dive into it. The options I found were: 1. HuggingFace dedicated inference endpoints 2. Hugging face container on sage maker 3. Using clip from tensorflow and downloading weights on sage maker as well 4. Spinning up my FAST API server over EC2 with accelerated computing instances For context, I’m using transformers library from hugging face :) So my questions are: 1. How do you guys usually deploy your models 2. In terms of building a multimodal, is it possible to deploy all the models in a single SageMaker endpoint, if so how does that work! I’m sorry if this sounds silly, but do help a brother out haha Looking forward to learning more! submitted by /u/Hot-Afternoon-4831 [link] [comments]
    [D] How did Mistral release an LLM so quickly? And how did AI advance so quickly?
    Up until the release of ChatGPT in November 2022, AI was nowhere to be seen. With the release of it, suddenly every company is using AI, and dozens of companies are releasing their LLMs... How did it evolve so fast? Are they all basing their LLMs on OpenAI's LLM or? I'm a software developer, but I know nothing about AI development at the moment, so I'm curious about this EDIT: Decided to edit the question due to a lack of respect in some people's comments (The original post is still above) Yes, I heard of AI/ML before; you have to be stupid to think otherwise... If nothing else, I would've seen it on job boards :D No I didn't mean "AI was nowhere to be seen" exactly like that. What I meant is exactly what I wrote after that line... after the release of OpeanAI's ChatGPT, less than a year later, Meta released Llama, Google released Bard/Gemini, and Mistral released its LLM. The question was what changed in this period that made it all possible? But obviously, there are some snobs in this subreddit, so apologies to these intellectual individuals for not making my question crystal clear. submitted by /u/Dimness7 [link] [comments]
    disappearance of lalamu and wav2lip - any insights? [D], [P]
    hey everyone, i've noticed that lalamu and wav2lip are no longer available and seem to have been pulled from app stores. does anyone have any info or updates on what happened to these apps? i've tried looking up official sources and developer forums but came up empty. these tools were crucial for my projects, and i'm on the lookout for any alternatives or news about their status. would appreciate any insights or updates you might have! submitted by /u/Due_Brief6661 [link] [comments]
    [P] Luminal: Fast ML in Rust through graph compilation
    Hi everyone, I've been working on an ML framework in Rust for a while and I'm finally excited to share it. Luminal is a deep learning library that uses composable compilers to achieve high performance. Current ML libraries tend to be large and complex because they try to map high level operations directly on to low level handwritten kernels, and focus on eager execution. Libraries like PyTorch contain hundreds of thousands of lines of code, making it nearly impossible for a single programmer to understand it all, set aside do a large refactor. But does it need to be so complex? ML models tend to be static dataflow graphs made up of a few simple operators. This allows us to have a dirt simple core only supporting a few primitive operations, and use them to build up complex neural network…
    [D] Research Papers
    I am a 3rd year undergraduate CS student. I have been wanting to publish Research papers for quite a time now. One is already under progress. But it did take a good amount of time. I want 3-4 Research papers published in the next year or so. Can someone tell me what's the easiest and quickest way to do so? submitted by /u/Anonymous_Life17 [link] [comments]
    [R]: Learning to Generate Instruction Tuning Datasets for Zero-Shot Task Adaptation
    Excited to share our work on synthetic task generation. Introducing Bonito 🐟, an open-source model that converts your raw, unannotated data into synthetic instruction tuning datasets. With it, you can easily create a specialized LLM for your proprietary and private data! Check out our work below: Paper: https://arxiv.org/abs/2402.18334 Code: https://github.com/BatsResearch/bonito Model: https://huggingface.co/BatsResearch/bonito-v1 submitted by /u/nihalnayak [link] [comments]
    [D] Change speech in a video
    For a demonstration, I want to change what a person is saying in a video. Are there any free tools websites or python libraries available, that also adapt the lip movements? submitted by /u/Electronic-Letter592 [link] [comments]
    [P] Enhance Your AI Projects with Top-notch Data Annotation Services by Damco Solutions! 🎯
    Hey fellow Redditors! Are you looking to take your AI projects to the next level? Look no further! At Damco Solutions, we specialize in providing cutting-edge Data Annotation Services tailored to suit your project needs. With our meticulous annotation processes and advanced technology, we ensure high-quality labeled data sets that empower your AI algorithms to perform with precision and accuracy. Why choose Damco Solutions for your Data Annotation needs? Accuracy Guaranteed: Our team of experts ensures meticulous annotation, guaranteeing accuracy for your AI models. Scalability: Whether you need to annotate a few hundred or millions of data points, we've got you covered with our scalable solutions. Customization: We understand that every project is unique. That's why we offer customizable annotation services to meet your specific requirements. Quality Assurance: Rigorous quality checks at every step of the annotation process ensure top-notch results. Fast Turnaround Time: We value your time. Our streamlined processes ensure quick turnaround times without compromising on quality. Industry Expertise: With years of experience across various industries, we bring unparalleled domain expertise to the table. Check out our website here for more details on our Data Annotation Services. Ready to supercharge your AI projects? Get in touch with us today and let's embark on a journey towards AI excellence together! Let's revolutionize the future with data annotation! submitted by /u/Soggy_Example6842 [link] [comments]
    [P] A non-RAG Backtracking GPT Agent with a Dynamic Set of Actions
    Hi all, I've been working on a new framework that doesn't rely on Retrieval-Augmented Generation (RAG) for finding relevant info and generating a response. Instead it leverages a unique Text Interface (TI), enabling direct GPT-4 interactions with external resources, like how we interact with GUIs. Since this method requires repeated interactions with the TI, LLM acts like an autonomous agent. Unlike AutoGPT, where actions are predetermined and do not change, the actions are provided by the TI, and change based on the state TI is in. The main limitations of this approach though are: it only works with GPT-4 and requires building text interfaces for interacting with different types of resources. Demo Video: https://www.youtube.com/watch?v=sE2JK3IB6rI The code: https://github.com/ash80/backtracking_gpt Thread on X: https://x.com/ash_at_tt/status/1763575975185403937 Your feedback, suggestion, and contributions are welcome. submitted by /u/ashz8888 [link] [comments]
    Multi agent deep q learning takes a dive at exploitation state [D]
    I am using double & dueling deep q learning. shortly after reaching epsilon 0.01, the reward starts to go downhill. I am experimenting with different hyper parameters, but would be interested in any similar experiences/ideas. My guess is that since it is a multi agent scenario, most of the exploration stage, the agents learn the best actions, given kind of random actions from the rest. once epsilon reaches 0.01, the behaviors of the rest of the agents (and thus the environment for each one) changes. This is the mean of the rewards. https://preview.redd.it/fk6fpc0a1qlc1.png?width=696&format=png&auto=webp&s=63c7a4bd1a560809cbec2099261d6abc8c2152b5 submitted by /u/ripototo [link] [comments]
    Speaker Recognition [P]
    Hello everyone, I'm working on a speaker recognition project to recognize who is speaking in a speaker dataset. The model works well in close set and I would like that now the model can also tell me when the speaker is unknown (is not part of the dataset). What methods can you recommend? submitted by /u/cestoi [link] [comments]
    [D] Comparing a Multilayer Perceptron with an Support Vector Classifier
    How can I make the implementation most comparable as the models are fundamentally different. I have to implement a classification model with a MLP and an SVM classifier and compare them. I did it as following: MLP: Find the best architecture: Amount of hidden layers and hidden neurons Perform grid search to identify the best hyper parameters Cross validation: To assess generalization performance on unseen data SVM: Perform grid search to identify best hyper pararameters Cross validation: To asses generalization performance on unseen data Then compare the performances of both models on test data an assess F1, Recall, Precision, Confusion Matrix Am I missing something important? submitted by /u/Jumpy-Wrongdoer1649 [link] [comments]
    Does L2-SVM involve always using the squared hinge loss instead of the standard hinge loss ? [D]
    I see many paper write the objective function of L2-SVM with the Standard hinge loss, others with the squared hinge. That's where my confusion arises. submitted by /u/PerfecttMachine [link] [comments]
    [R] Robust Guidance for Unsupervised Data Selection: Capturing Perplexing Named Entities for Domain-Specific Machine Translation
    https://arxiv.org/abs/2402.19267 Abstract : Employing extensive datasets enables the training of multilingual machine translation models; however, these models often fail to accurately translate sentences within specialized domains. Although obtaining and translating domain-specific data incurs high costs, it is inevitable for high-quality translations. Hence, finding the most 'effective' data with an unsupervised setting becomes a practical strategy for reducing labeling costs. Recent research indicates that this effective data could be found by selecting 'properly difficult data' based on its volume. This means the data should not be excessively challenging or overly simplistic, especially if the amount of data is limited. However, we found that establishing a criterion for unsupervised data selection remains challenging, as the 'proper difficulty' might vary based on the data domain being trained on. We introduce a novel unsupervised data selection method, 'Capturing Perplexing Named Entities', which adopts the maximum inference entropy in translated named entities as a selection measure. The motivation was that named entities in domain-specific data are considered the most complex portion of the data and should be predicted with high confidence. When verified with the 'Korean-English Parallel Corpus of Specialized Domains,' our method served as a robust guidance for unsupervised data selection, in contrast to existing methods. submitted by /u/Capital_Reply_7838 [link] [comments]
    DeepMind introduces Hawk and Griffin [R]
    https://arxiv.org/abs/2402.19427 Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models Recurrent neural networks (RNNs) have fast inference and scale efficiently on long sequences, but they are difficult to train and hard to scale. We propose Hawk, an RNN with gated linear recurrences, and Griffin, a hybrid model that mixes gated linear recurrences with local attention. Hawk exceeds the reported performance of Mamba on downstream tasks, while Griffin matches the performance of Llama-2 despite being trained on over 6 times fewer tokens. We also show that Griffin can extrapolate on sequences significantly longer than those seen during training. Our models match the hardware efficiency of Transformers during training, and during inference they have lower latency and significantly higher throughput. We scale Griffin up to 14B parameters, and explain how to shard our models for efficient distributed training. ​ ​ https://preview.redd.it/sqp561yygnlc1.png?width=1304&format=png&auto=webp&s=fc3d8bc47f0c8bcf45851f982467b1d0269e12b4 submitted by /u/we_are_mammals [link] [comments]
  • Open

    One is a real photo and one is A.I. generated. Can you tell which is which?
    submitted by /u/Armand_Roulinn [link] [comments]
    Computer Vision in Warehousing and Logistics
    In this article you will know how computer vision is optimizing processes in logistics and warehousing. The warehousing industry is quickly adapting to online shopping's needs. Computer vision improves inventory management, process optimization, and quality control. It automates manual tasks and optimizes many operations. OpenCV.ai team describes the most popular Use Cases of AI Implementation in this sphere. More details are here submitted by /u/No-Independence5880 [link] [comments]
    Best date for excluding AI-generated content from Google search results?
    I was thinking before:2023 for images and before:2022 for text, do you guys know of better/more precise cut-off points? submitted by /u/Tozemanel [link] [comments]
    Artists are slipping anti-AI ‘poison’ into their art. Here’s how it works
    submitted by /u/scientificamerican [link] [comments]
    Tim Cook Says Apple Will ‘Break New Ground’ in GenAI in 2024
    submitted by /u/brianstoolshed [link] [comments]
    Is there an AI that is not so "friendly"?
    My friends and I are trying to come up with caption for a shirt and want it to be related to IT since that's what we study in school. We asked chatgpt and bing ai for some suggestions but they are mostly lame since they are trying to be friendly and nice. We would like something that would maybe make us look like we suck at school. The closest we got was something along the lines "finished the school with language of 1s and 0s" (1 is the lowest grade we can get so it looks like we finished school getting mostly 1s). I want to know if there is an AI that can come up with some more creative captions that isn't so restricted to being friendly to the user. Something of a bit darker humor. Not fully dark but darker than what you usually get from AI. submitted by /u/ProAssassin666 [link] [comments]
    This week in AI - all the Major AI developments in a nutshell
    Mistral introduced a new model Mistral Large. It reaches top-tier reasoning capabilities, is multi-lingual by design, has native function calling capacities and has 32K tokens context window. The pre-trained model has 81.2% accuracy on MMLU. Alongside Mistral Large, Mistral Small, a model optimized for latency and cost has been released. Mistral Small outperforms Mixtral 8x7B and has lower latency. Mistral also launched a ChatGPT like new conversational assistant, le Chat Mistral [Details]. Alibaba Group introduced EMO, an expressive audio-driven portrait-video generation framework. Input a single reference image and the vocal audio, e.g. talking and singing, it can generate vocal avatar videos with expressive facial expressions, and various head poses [Details]. Ideogram introduced Ideo…
    Elon Musk sues OpenAI accusing it of putting profit before humanity | OpenAI
    submitted by /u/Cbo305 [link] [comments]
    Does Tesla Optimus need a head?
    Does Optimus or any humanoid robot for that matter actually need a human like head? Maybe just a swifling camera. Tesla is seems to be stating that they want to make the robot for practical purposes like in their factories. But the robot looks more showy and seems to follow the human form for appearances. In most situations, you don't need legs at all. I can imagine a more practical version would be a headless version with wheels. Then a legged version if the task requires it. submitted by /u/Steve77307 [link] [comments]
    Birth Of AGI
    I encountered an excerpt from a captivating book exploring the potential arrival of Artificial General Intelligence (AGI) and its impact on the world. It presented a compellingly realistic scenario of its emergence, including the AGI's goals and an ambiguous ending left open to interpretation. "Inside a laboratory filled with the hum of thousands of the latest Nvidia GPUs, a group of experts from Open AI, a leading force in AI research, stood in excited anticipation. They eagerly awaited the results of their latest training run, hoping to replicate the success of their previous project, GPT-7, and unlock further economic and global influence. Utilizing cutting-edge AI hardware and the most recent techniques, they waited amidst the constant hum. During this training run, something extraor…
    One-Minute Daily AI News 2/29/2024
    Humanoid robot startup Figure AI valued at $2.6 billion as Bezos, OpenAI, Nvidia join funding.[1] Microsoft adds new Copilot features to Windows 11 as AI hype remains high.[2] Former Twitter engineers are building Particle, an AI-powered news reader, backed by $4.4M.[3] Pika and Alibaba Debut Competing AI Lip Sync Tools – Video Generation Market Heats Up.[4] Sources: [1] https://www.cnbc.com/2024/02/29/robot-startup-figure-valued-at-2point6-billion-by-bezos-amazon-nvidia.html [2] https://finance.yahoo.com/news/microsoft-adds-new-copilot-features-to-windows-11-as-ai-hype-remains-high-180034361.html [3] https://techcrunch.com/2024/02/29/former-twitter-engineers-are-building-particle-an-ai-powered-news-reader/ [4] https://www.ccn.com/news/pika-lip-sync-alibaba-emo-new-ai-video-tools/ submitted by /u/Excellent-Target-847 [link] [comments]
    Is it possible to get recommendations for specific restaurants and events or only generalized answers?
    Basically I'm wondering if I can just list likes and dislikes and then ask for specific plans, or if I can only get generalized plans like "Go out for sushi" or "Go to the beach?" submitted by /u/YungRobbin [link] [comments]
    The downstream consequences of Google's graying loom large: a talent exodus, stale products, and an overreliance on its advertising cash cow.
    submitted by /u/thisisinsider [link] [comments]
  • Open

    Which Reward and Discount is predicted by Dreamer?
    Hi, I have a doubt regarding one implementation of Dreamer. As long as I understand, the current model state s_t (deterministic + stochastic) at timestep t should reconstruct the observation at timestep t and the reward, obtained by applying action at t-1 and observation at t-1 Now I'm looking at this implementation where each sample inside the buffer contains (obs_t, act_t, r_t, done_t) where the reward and done are associated with the next timestep (after applying act_t). Now in the sampling function, they have this _shift_sequences which shift the obs to (seq:seq_len+L) and the other to (seq-1:seq+L-1). It makes sense because the observations are in line with the reward that should be predicted. ​ The problem is that in the world model training and in particular in the representation_loss they compute the reward_loss as: ``` reward_dist = self.RewardDecoder(post_modelstate[:-1]) reward_loss = self._reward_loss(reward_dist, rewards[1:]) ``` Which seems like they are predicting the next reward. This doesn't feel right since the reward should depend also on the action, while in this case since reward_dist is computed only by the current model state s_t it doesn't include the action that the agent will do. Am I missing something or the reward loss should be: ``` reward_dist = self.RewardDecoder(post_modelstate) reward_loss = self._reward_loss(reward_dist, rewards) ``` Of course, the same doubt goes for the discount predictor submitted by /u/ZioFranco1404 [link] [comments]
    Mujoco MJX performance for high DoF models?
    Hi all, has anyone tried using Mujoco MJX for high DoF models? I modified the dm-control rodent model to be compatible with MJX and with simple batch rollout benchmarking, it seems to be orders of magnitude slower than when using CPU Mujoco with multiple processes (64 cores), largely due to diminishing returns when increasing the batch size. I am using an A100 GPU. Has anyone else tried similar benchmarks with more complex models than what Deepmind demos with (Humanoid, Ant, etc.)? submitted by /u/Czcz_ [link] [comments]
    How to avoid overfitting to a specific level when training a model like AlphaZero/MuZero to play a platformer type video games?
    I am having a bit of trouble understanding in the training process how to ensure the model remains general enough to move to a new level. Verses learning techniques that are level specific. Is it best practice to rotate the levels during the training process? When do different types of levels need different models (like a water level in SMW vs. a standard level)? ​ ​ submitted by /u/drupadoo [link] [comments]
    Demis Hassabis podcast interview (2024-02): "Scaling, Superhuman AIs, AlphaZero atop LLMs, Rogue Nations Threat" (Dwarkesh Patel)
    submitted by /u/gwern [link] [comments]
    Deep reinforcement learning hands-on. Third edition
    Hey folks, Was about to upgrade and buy the second edition of this book but then stumbled across the github page for the third edition which seems to be an active work in progress: https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On-3E. Does anyone know when the new version of the book is coming out? Would like to save myself from buying both if it coming out soon. ​ Thanks! submitted by /u/Any_Camel_5977 [link] [comments]
    What is RLHF?
    RLHF (Reinforcement Learning with Human Feedback) is a method to further fine tune LLMs using a combination of Reward model + PPO algorithm usually which is used for training ChatGPT as well. Check out how it works : https://youtu.be/r35WvNUwY9E?si=oEXs_WSRWE2NzMVF submitted by /u/mehul_gupta1997 [link] [comments]
    Why is it so difficult to do deep reinforcement learning for automated theorem proving in math?
    Please keep in mind that my field is mathematics. I am a newbie to ML/DL: We have AlphaGo and AlphaFold type systems that are superhuman in their respective fields. I have read that researchers have difficulty in making similar systems for mathematics research, i.e “superhuman mathematicians”. So, why is it difficult? Given the advancements of LLM’s, as well as systems like those mentioned above, when should I start worrying about some AI system just outperforming me in every conceivable way? As a newbie, it seems to me that these systems are developing quicker than anyone can imagine, but maybe this is because the only news I see of this is very hyped up. The existential dread I’m feeling is too much. submitted by /u/dhhdhkvjdhdg [link] [comments]
    Why use Q-learning instead of value iteration on the post-decision state?
    On pg 390 of Approximate Dynamic Programming, Powell argues that Q-learning uses “an artificial post-decision state…the state/action pair (S, a)”. He goes on further to say that “iteration using value functions of a post-decision state variable is equivalent to Q-learning”…just much easier because the post decision state is more compact This seems logical to me. Why do we care about the (state, action) combination if we can just work with the end result of that combination? However, in literature and algorithm implementations I’ve seen, Q learning appears to be a lot more popular. Why is this? Is there some benefit of working directly with (state, action) pairs? submitted by /u/lilganj710 [link] [comments]
    Reward-respecting subtasks for model-based reinforcement learning
    Paper: https://www.sciencedirect.com/science/article/pii/S0004370223001479 Preprint version(s): https://arxiv.org/abs/2202.03466 Abstract: To achieve the ambitious goals of artificial intelligence, reinforcement learning must include planning with a model of the world that is abstract in state and time. Deep learning has made progress with state abstraction, but temporal abstraction has rarely been used, despite extensively developed theory based on the options framework. One reason for this is that the space of possible options is immense, and the methods previously proposed for option discovery do not take into account how the option models will be used in planning. Options are typically discovered by posing subsidiary tasks, such as reaching a bottleneck state or maximizing the cumulative sum of a sensory signal other than reward. Each subtask is solved to produce an option, and then a model of the option is learned and made available to the planning process. In most previous work, the subtasks ignore the reward on the original problem, whereas we propose subtasks that use the original reward plus a bonus based on a feature of the state at the time the option terminates. We show that option models obtained from such reward-respecting subtasks are much more likely to be useful in planning than eigenoptions, shortest path options based on bottleneck states, or reward-respecting options generated by the option-critic. Reward respecting subtasks strongly constrain the space of options and thereby also provide a partial solution to the problem of option discovery. Finally, we show how values, policies, options, and models can all be learned online and off-policy using standard algorithms and general value functions. submitted by /u/SunsetOneSix [link] [comments]
  • Open

    Knowledge Bases for Amazon Bedrock now supports hybrid search
    At AWS re:Invent 2023, we announced the general availability of Knowledge Bases for Amazon Bedrock. With a knowledge base, you can securely connect foundation models (FMs) in Amazon Bedrock to your company data for fully managed Retrieval Augmented Generation (RAG). In a previous post, we described how Knowledge Bases for Amazon Bedrock manages the end-to-end […]  ( 12 min )
    Expedite your Genesys Cloud Amazon Lex bot design with the Amazon Lex automated chatbot designer
    The rise of artificial intelligence (AI) has created opportunities to improve the customer experience in the contact center space. Machine learning (ML) technologies continually improve and power the contact center customer experience by providing solutions for capabilities like self-service bots, live call analytics, and post-call analytics. Self-service bots integrated with your call center can help […]  ( 10 min )
  • Open

    Computer Vision in Warehousing and Logistics
    In this article you will know how computer vision is optimizing processes in logistics and warehousing. The warehousing industry is quickly adapting to online shopping's needs. Computer vision improves inventory management, process optimization, and quality control. It automates manual tasks and optimizes many operations. OpenCV.ai team describes the most popular Use Cases of AI Implementation in this sphere. More details are here https://reddit.com/link/1b43x7b/video/6569h6yq3slc1/player submitted by /u/No-Independence5880 [link] [comments]
  • Open

    Automakers Electrify Geneva International Motor Show
    The Geneva International Motor Show, one of the most important and long-standing global auto exhibitions, opened this week, with the spotlight on several China and U.S. EV makers building on NVIDIA DRIVE that are expanding their presence in Europe. BYD One of the key reveals is BYD’s Yangweng U8 plug-in hybrid large SUV, built on Read Article  ( 5 min )
    No Noobs Here: Top Pro Gamers Bolster Software Quality Assurance Testing
    For some NVIDIANs, it’s always game day. Our Santa Clara-based software quality assurance team boasts some of the world’s top gamers, whose search for bugs and errors is as strategic as their battle plans for toppling top-tier opponents in video games. Two team members of the QA team — friendly colleagues in the office but Read Article  ( 6 min )
    What Is Trustworthy AI?
    Artificial intelligence, like any transformative technology, is a work in progress — continually growing in its capabilities and its societal impact. Trustworthy AI initiatives recognize the real-world effects that AI can have on people and society, and aim to channel that power responsibly for positive change. What Is Trustworthy AI? Trustworthy AI is an approach Read Article  ( 8 min )
    Live at GTC: Hear From Industry Leaders Using AI to Drive Innovation and Agility
    Enterprise execs across broad sectors to share their AI strategies and success stories.  ( 7 min )
  • Open

    Leveraging machine learning for predictive analysis in customer satisfaction surveys
    The dynamic landscape of customer-centric businesses requires understanding and improving customer satisfaction. Traditional survey analysis rarely yields real-time actionable insights. However, machine learning (ML) predictive analysis allows organizations to use advanced algorithms to transform customer satisfaction surveys. ML predictive analysis is changing how businesses measure and improve customer satisfaction. Customer feedback analysis is where untapped… Read More »Leveraging machine learning for predictive analysis in customer satisfaction surveys The post Leveraging machine learning for predictive analysis in customer satisfaction surveys appeared first on Data Science Central.  ( 21 min )
    Mobile drivers’ licenses: A humbler take on self-sovereign identity and personal data protection
    I’ve been interested in self-sovereign identity for a number of years now, ever since I interviewed Phil Windley, a founder of the Internet Identity Workshop (IIW) and then chair of the Sovrin Foundation, in 2018.  In a self-sovereign identity (SSI) scenario, the users themselves control the sensitive information previously stored by a third party. Take… Read More »Mobile drivers’ licenses: A humbler take on self-sovereign identity and personal data protection The post Mobile drivers’ licenses: A humbler take on self-sovereign identity and personal data protection appeared first on Data Science Central.  ( 22 min )
  • Open

    Startup accelerates progress toward light-speed computing
    Lightmatter, founded by three MIT alumni, is using photonic technologies to reinvent how chips communicate and calculate.  ( 6 min )
    Dealing with the limitations of our noisy world
    Tamara Broderick uses statistical approaches to understand and quantify the uncertainty that can affect study results.  ( 7 min )
  • Open

    Physics-aware Machine Learning Revolutionizes Scientific Paradigm for Machine Learning and Process-based Hydrology
    arXiv:2310.05227v3 Announce Type: replace Abstract: Accurate hydrological understanding and water cycle prediction are crucial for addressing scientific and societal challenges associated with the management of water resources, particularly under the dynamic influence of anthropogenic climate change. Existing reviews predominantly concentrate on the development of machine learning (ML) in this field, yet there is a clear distinction between hydrology and ML as separate paradigms. Here, we introduce physics-aware ML as a transformative approach to overcome the perceived barrier and revolutionize both fields. Specifically, we present a comprehensive review of the physics-aware ML methods, building a structured community (PaML) of existing methodologies that integrate prior physical knowledge or physics-based modeling into ML. We systematically analyze these PaML methodologies with respect to four aspects: physical data-guided ML, physics-informed ML, physics-embedded ML, and physics-aware hybrid learning. PaML facilitates ML-aided hypotheses, accelerating insights from big data and fostering scientific discoveries. We first conduct a systematic review of hydrology in PaML, including rainfall-runoff hydrological processes and hydrodynamic processes, and highlight the most promising and challenging directions for different objectives and PaML methods. Finally, a new PaML-based hydrology platform, termed HydroPML, is released as a foundation for hydrological applications. HydroPML enhances the explainability and causality of ML and lays the groundwork for the digital water cycle's realization. The HydroPML platform is publicly available at https://hydropml.github.io/.  ( 3 min )
    CL-MAE: Curriculum-Learned Masked Autoencoders
    arXiv:2308.16572v3 Announce Type: replace-cross Abstract: Masked image modeling has been demonstrated as a powerful pretext task for generating robust representations that can be effectively generalized across multiple downstream tasks. Typically, this approach involves randomly masking patches (tokens) in input images, with the masking strategy remaining unchanged during training. In this paper, we propose a curriculum learning approach that updates the masking strategy to continually increase the complexity of the self-supervised reconstruction task. We conjecture that, by gradually increasing the task complexity, the model can learn more sophisticated and transferable representations. To facilitate this, we introduce a novel learnable masking module that possesses the capability to generate masks of different complexities, and integrate the proposed module into masked autoencoders (MAE). Our module is jointly trained with the MAE, while adjusting its behavior during training, transitioning from a partner to the MAE (optimizing the same reconstruction loss) to an adversary (optimizing the opposite loss), while passing through a neutral state. The transition between these behaviors is smooth, being regulated by a factor that is multiplied with the reconstruction loss of the masking module. The resulting training procedure generates an easy-to-hard curriculum. We train our Curriculum-Learned Masked Autoencoder (CL-MAE) on ImageNet and show that it exhibits superior representation learning capabilities compared to MAE. The empirical results on five downstream tasks confirm our conjecture, demonstrating that curriculum learning can be successfully used to self-supervise masked autoencoders. We release our code at https://github.com/ristea/cl-mae.  ( 3 min )
    Partial Label Supervision for Agnostic Generative Noisy Label Learning
    arXiv:2308.01184v2 Announce Type: replace-cross Abstract: Noisy label learning has been tackled with both discriminative and generative approaches. Despite the simplicity and efficiency of discriminative methods, generative models offer a more principled way of disentangling clean and noisy labels and estimating the label transition matrix. However, existing generative methods often require inferring additional latent variables through costly generative modules or heuristic assumptions, which hinder adaptive optimisation for different causal directions. They also assume a uniform clean label prior, which does not reflect the sample-wise clean label distribution and uncertainty. In this paper, we propose a novel framework for generative noisy label learning that addresses these challenges. First, we propose a new single-stage optimisation that directly approximates image generation by a discriminative classifier output. This approximation significantly reduces the computation cost of image generation, preserves the generative modelling benefits, and enables our framework to be agnostic in regards to different causality scenarios (i.e., image generate label or vice-versa). Second, we introduce a new Partial Label Supervision (PLS) for noisy label learning that accounts for both clean label coverage and uncertainty. The supervision of PLS does not merely aim at minimising loss, but seeks to capture the underlying sample-wise clean label distribution and uncertainty. Extensive experiments on computer vision and natural language processing (NLP) benchmarks demonstrate that our generative modelling achieves state-of-the-art results while significantly reducing the computation cost. Our code is available at https://github.com/lfb-1/GNL.  ( 3 min )
    Splitting and Parallelizing of Quantum Convolutional Neural Networks for Learning Translationally Symmetric Data
    arXiv:2306.07331v3 Announce Type: replace-cross Abstract: The quantum convolutional neural network (QCNN) is a promising quantum machine learning (QML) model that is expected to achieve quantum advantages in classically intractable problems. However, the QCNN requires a large number of measurements for data learning, limiting its practical applications in large-scale problems. To alleviate this requirement, we propose a novel architecture called split-parallelizing QCNN (sp-QCNN), which exploits the prior knowledge of quantum data to design an efficient model. This architecture draws inspiration from geometric quantum machine learning and targets translationally symmetric quantum data commonly encountered in physics and quantum computing science. By splitting the quantum circuit based on translational symmetry, the sp-QCNN can substantially parallelize the conventional QCNN without increasing the number of qubits and improve the measurement efficiency by an order of the number of qubits. To demonstrate its effectiveness, we apply the sp-QCNN to a quantum phase recognition task and show that it can achieve comparable classification accuracy to the conventional QCNN while considerably reducing the measurement resources required. Due to its high measurement efficiency, the sp-QCNN can mitigate statistical errors in estimating the gradient of the loss function, thereby accelerating the learning process. These results open up new possibilities for incorporating the prior data knowledge into the efficient design of QML models, leading to practical quantum advantages.  ( 3 min )
    TFMQ-DM: Temporal Feature Maintenance Quantization for Diffusion Models
    arXiv:2311.16503v2 Announce Type: replace-cross Abstract: The Diffusion model, a prevalent framework for image generation, encounters significant challenges in terms of broad applicability due to its extended inference times and substantial memory requirements. Efficient Post-training Quantization (PTQ) is pivotal for addressing these issues in traditional models. Different from traditional models, diffusion models heavily depend on the time-step $t$ to achieve satisfactory multi-round denoising. Usually, $t$ from the finite set $\{1, \ldots, T\}$ is encoded to a temporal feature by a few modules totally irrespective of the sampling data. However, existing PTQ methods do not optimize these modules separately. They adopt inappropriate reconstruction targets and complex calibration methods, resulting in a severe disturbance of the temporal feature and denoising trajectory, as well as a low compression efficiency. To solve these, we propose a Temporal Feature Maintenance Quantization (TFMQ) framework building upon a Temporal Information Block which is just related to the time-step $t$ and unrelated to the sampling data. Powered by the pioneering block design, we devise temporal information aware reconstruction (TIAR) and finite set calibration (FSC) to align the full-precision temporal features in a limited time. Equipped with the framework, we can maintain the most temporal information and ensure the end-to-end generation quality. Extensive experiments on various datasets and diffusion models prove our state-of-the-art results. Remarkably, our quantization approach, for the first time, achieves model performance nearly on par with the full-precision model under 4-bit weight quantization. Additionally, our method incurs almost no extra computational cost and accelerates quantization time by $2.0 \times$ on LSUN-Bedrooms $256 \times 256$ compared to previous works.  ( 3 min )
    Quality-Diversity Generative Sampling for Learning with Synthetic Data
    arXiv:2312.14369v2 Announce Type: replace-cross Abstract: Generative models can serve as surrogates for some real data sources by creating synthetic training datasets, but in doing so they may transfer biases to downstream tasks. We focus on protecting quality and diversity when generating synthetic training datasets. We propose quality-diversity generative sampling (QDGS), a framework for sampling data uniformly across a user-defined measure space, despite the data coming from a biased generator. QDGS is a model-agnostic framework that uses prompt guidance to optimize a quality objective across measures of diversity for synthetically generated data, without fine-tuning the generative model. Using balanced synthetic datasets generated by QDGS, we first debias classifiers trained on color-biased shape datasets as a proof-of-concept. By applying QDGS to facial data synthesis, we prompt for desired semantic concepts, such as skin tone and age, to create an intersectional dataset with a combined blend of visual features. Leveraging this balanced data for training classifiers improves fairness while maintaining accuracy on facial recognition benchmarks. Code available at: https://github.com/Cylumn/qd-generative-sampling.  ( 2 min )
    Fully Convolutional Slice-to-Volume Reconstruction for Single-Stack MRI
    arXiv:2312.03102v2 Announce Type: replace-cross Abstract: In magnetic resonance imaging (MRI), slice-to-volume reconstruction (SVR) refers to computational reconstruction of an unknown 3D magnetic resonance volume from stacks of 2D slices corrupted by motion. While promising, current SVR methods require multiple slice stacks for accurate 3D reconstruction, leading to long scans and limiting their use in time-sensitive applications such as fetal fMRI. Here, we propose a SVR method that overcomes the shortcomings of previous work and produces state-of-the-art reconstructions in the presence of extreme inter-slice motion. Inspired by the recent success of single-view depth estimation methods, we formulate SVR as a single-stack motion estimation task and train a fully convolutional network to predict a motion stack for a given slice stack, producing a 3D reconstruction as a byproduct of the predicted motion. Extensive experiments on the SVR of adult and fetal brains demonstrate that our fully convolutional method is twice as accurate as previous SVR methods. Our code is available at github.com/seannz/svr.  ( 2 min )
    Mutual Information Regularized Offline Reinforcement Learning
    arXiv:2210.07484v3 Announce Type: replace Abstract: The major challenge of offline RL is the distribution shift that appears when out-of-distribution actions are queried, which makes the policy improvement direction biased by extrapolation errors. Most existing methods address this problem by penalizing the policy or value for deviating from the behavior policy during policy improvement or evaluation. In this work, we propose a novel MISA framework to approach offline RL from the perspective of Mutual Information between States and Actions in the dataset by directly constraining the policy improvement direction. MISA constructs lower bounds of mutual information parameterized by the policy and Q-values. We show that optimizing this lower bound is equivalent to maximizing the likelihood of a one-step improved policy on the offline dataset. Hence, we constrain the policy improvement direction to lie in the data manifold. The resulting algorithm simultaneously augments the policy evaluation and improvement by adding mutual information regularizations. MISA is a general framework that unifies conservative Q-learning (CQL) and behavior regularization methods (e.g., TD3+BC) as special cases. We introduce 3 different variants of MISA, and empirically demonstrate that tighter mutual information lower bound gives better offline RL performance. In addition, our extensive experiments show MISA significantly outperforms a wide range of baselines on various tasks of the D4RL benchmark,e.g., achieving 742.9 total points on gym-locomotion tasks. Our code is available at https://github.com/sail-sg/MISA.  ( 3 min )
    DiffCPS: Diffusion Model based Constrained Policy Search for Offline Reinforcement Learning
    arXiv:2310.05333v2 Announce Type: replace Abstract: Constrained policy search (CPS) is a fundamental problem in offline reinforcement learning, which is generally solved by advantage weighted regression (AWR). However, previous methods may still encounter out-of-distribution actions due to the limited expressivity of Gaussian-based policies. On the other hand, directly applying the state-of-the-art models with distribution expression capabilities (i.e., diffusion models) in the AWR framework is intractable since AWR requires exact policy probability densities, which is intractable in diffusion models. In this paper, we propose a novel approach, $\textbf{Diffusion-based Constrained Policy Search}$ (dubbed DiffCPS), which tackles the diffusion-based constrained policy search with the primal-dual method. The theoretical analysis reveals that strong duality holds for diffusion-based CPS problems, and upon introducing parameter approximation, an approximated solution can be obtained after $\mathcal{O}(1/\epsilon)$ number of dual iterations, where $\epsilon$ denotes the representation ability of the parametrized policy. Extensive experimental results based on the D4RL benchmark demonstrate the efficacy of our approach. We empirically show that DiffCPS achieves better or at least competitive performance compared to traditional AWR-based baselines as well as recent diffusion-based offline RL methods. The code is now available at https://github.com/felix-thu/DiffCPS.  ( 3 min )
    Understanding Certified Training with Interval Bound Propagation
    arXiv:2306.10426v2 Announce Type: replace Abstract: As robustness verification methods are becoming more precise, training certifiably robust neural networks is becoming ever more relevant. To this end, certified training methods compute and then optimize an upper bound on the worst-case loss over a robustness specification. Curiously, training methods based on the imprecise interval bound propagation (IBP) consistently outperform those leveraging more precise bounding methods. Still, we lack an understanding of the mechanisms making IBP so successful. In this work, we thoroughly investigate these mechanisms by leveraging a novel metric measuring the tightness of IBP bounds. We first show theoretically that, for deep linear models, tightness decreases with width and depth at initialization, but improves with IBP training, given sufficient network width. We, then, derive sufficient and necessary conditions on weight matrices for IBP bounds to become exact and demonstrate that these impose strong regularization, explaining the empirically observed trade-off between robustness and accuracy in certified training. Our extensive experimental evaluation validates our theoretical predictions for ReLU networks, including that wider networks improve performance, yielding state-of-the-art results. Interestingly, we observe that while all IBP-based training methods lead to high tightness, this is neither sufficient nor necessary to achieve high certifiable robustness. This hints at the existence of new training methods that do not induce the strong regularization required for tight IBP bounds, leading to improved robustness and standard accuracy.  ( 3 min )
    NCART: Neural Classification and Regression Tree for Tabular Data
    arXiv:2307.12198v2 Announce Type: replace Abstract: Deep learning models have become popular in the analysis of tabular data, as they address the limitations of decision trees and enable valuable applications like semi-supervised learning, online learning, and transfer learning. However, these deep-learning approaches often encounter a trade-off. On one hand, they can be computationally expensive when dealing with large-scale or high-dimensional datasets. On the other hand, they may lack interpretability and may not be suitable for small-scale datasets. In this study, we propose a novel interpretable neural network called Neural Classification and Regression Tree (NCART) to overcome these challenges. NCART is a modified version of Residual Networks that replaces fully-connected layers with multiple differentiable oblivious decision trees. By integrating decision trees into the architecture, NCART maintains its interpretability while benefiting from the end-to-end capabilities of neural networks. The simplicity of the NCART architecture makes it well-suited for datasets of varying sizes and reduces computational costs compared to state-of-the-art deep learning models. Extensive numerical experiments demonstrate the superior performance of NCART compared to existing deep learning models, establishing it as a strong competitor to tree-based models.  ( 2 min )
    Finding Foundation Models for Time Series Classification with a PreText Task
    arXiv:2311.14534v2 Announce Type: replace Abstract: Over the past decade, Time Series Classification (TSC) has gained an increasing attention. While various methods were explored, deep learning - particularly through Convolutional Neural Networks (CNNs)-stands out as an effective approach. However, due to the limited availability of training data, defining a foundation model for TSC that overcomes the overfitting problem is still a challenging task. The UCR archive, encompassing a wide spectrum of datasets ranging from motion recognition to ECG-based heart disease detection, serves as a prime example for exploring this issue in diverse TSC scenarios. In this paper, we address the overfitting challenge by introducing pre-trained domain foundation models. A key aspect of our methodology is a novel pretext task that spans multiple datasets. This task is designed to identify the originating dataset of each time series sample, with the goal of creating flexible convolution filters that can be applied across different datasets. The research process consists of two phases: a pre-training phase where the model acquires general features through the pretext task, and a subsequent fine-tuning phase for specific dataset classifications. Our extensive experiments on the UCR archive demonstrate that this pre-training strategy significantly outperforms the conventional training approach without pre-training. This strategy effectively reduces overfitting in small datasets and provides an efficient route for adapting these models to new datasets, thus advancing the capabilities of deep learning in TSC.  ( 3 min )
    Training normalizing flows with computationally intensive target probability distributions
    arXiv:2308.13294v2 Announce Type: replace Abstract: Machine learning techniques, in particular the so-called normalizing flows, are becoming increasingly popular in the context of Monte Carlo simulations as they can effectively approximate target probability distributions. In the case of lattice field theories (LFT) the target distribution is given by the exponential of the action. The common loss function's gradient estimator based on the "reparametrization trick" requires the calculation of the derivative of the action with respect to the fields. This can present a significant computational cost for complicated, non-local actions like e.g. fermionic action in QCD. In this contribution, we propose an estimator for normalizing flows based on the REINFORCE algorithm that avoids this issue. We apply it to two dimensional Schwinger model with Wilson fermions at criticality and show that it is up to ten times faster in terms of the wall-clock time as well as requiring up to $30\%$ less memory than the reparameterization trick estimator. It is also more numerically stable allowing for single precision calculations and the use of half-float tensor cores. We present an in-depth analysis of the origins of those improvements. We believe that these benefits will appear also outside the realm of the LFT, in each case where the target probability distribution is computationally intensive.  ( 3 min )
    Online Search with Predictions: Pareto-optimal Algorithm and its Applications in Energy Markets
    arXiv:2211.06567v2 Announce Type: replace Abstract: This paper develops learning-augmented algorithms for energy trading in volatile electricity markets. The basic problem is to sell (or buy) $k$ units of energy for the highest revenue (lowest cost) over uncertain time-varying prices, which can framed as a classic online search problem in the literature of competitive analysis. State-of-the-art algorithms assume no knowledge about future market prices when they make trading decisions in each time slot, and aim for guaranteeing the performance for the worst-case price sequence. In practice, however, predictions about future prices become commonly available by leveraging machine learning. This paper aims to incorporate machine-learned predictions to design competitive algorithms for online search problems. An important property of our algorithms is that they achieve performances competitive with the offline algorithm in hindsight when the predictions are accurate (i.e., consistency) and also provide worst-case guarantees when the predictions are arbitrarily wrong (i.e., robustness). The proposed algorithms achieve the Pareto-optimal trade-off between consistency and robustness, where no other algorithms for online search can improve on the consistency for a given robustness. Further, we extend the basic online search problem to a more general inventory management setting that can capture storage-assisted energy trading in electricity markets. In empirical evaluations using traces from real-world applications, our learning-augmented algorithms improve the average empirical performance compared to benchmark algorithms, while also providing improved worst-case performance.  ( 3 min )
    Prediction-Powered Ranking of Large Language Models
    arXiv:2402.17826v1 Announce Type: new Abstract: Large language models are often ranked according to their level of alignment with human preferences -- a model is better than other models if its outputs are more frequently preferred by humans. One of the most popular ways to elicit human preferences utilizes pairwise comparisons between the outputs provided by different models to the same inputs. However, since gathering pairwise comparisons by humans is costly and time-consuming, it has become a very common practice to gather pairwise comparisons by a strong large language model -- a model strongly aligned with human preferences. Surprisingly, practitioners cannot currently measure the uncertainty that any mismatch between human and model preferences may introduce in the constructed rankings. In this work, we develop a statistical framework to bridge this gap. Given a small set of pairwise comparisons by humans and a large set of pairwise comparisons by a model, our framework provides a rank-set -- a set of possible ranking positions -- for each of the models under comparison. Moreover, it guarantees that, with a probability greater than or equal to a user-specified value, the rank-sets cover the true ranking consistent with (the distribution of) human pairwise preferences. Our framework is computationally efficient, easy to use, and does not make any assumption about the distribution of human preferences nor about the degree of alignment between the pairwise comparisons by the humans and the strong large language model.  ( 3 min )
    Fast and Accurate Estimation of Low-Rank Matrices from Noisy Measurements via Preconditioned Non-Convex Gradient Descent
    arXiv:2305.17224v2 Announce Type: replace-cross Abstract: Non-convex gradient descent is a common approach for estimating a low-rank $n\times n$ ground truth matrix from noisy measurements, because it has per-iteration costs as low as $O(n)$ time, and is in theory capable of converging to a minimax optimal estimate. However, the practitioner is often constrained to just tens to hundreds of iterations, and the slow and/or inconsistent convergence of non-convex gradient descent can prevent a high-quality estimate from being obtained. Recently, the technique of preconditioning was shown to be highly effective at accelerating the local convergence of non-convex gradient descent when the measurements are noiseless. In this paper, we describe how preconditioning should be done for noisy measurements to accelerate local convergence to minimax optimality. For the symmetric matrix sensing problem, our proposed preconditioned method is guaranteed to locally converge to minimax error at a linear rate that is immune to ill-conditioning and/or over-parameterization. Using our proposed preconditioned method, we perform a 60 megapixel medical image denoising task, and observe significantly reduced noise levels compared to previous approaches.  ( 2 min )
    Multimodal Sleep Apnea Detection with Missing or Noisy Modalities
    arXiv:2402.17788v1 Announce Type: cross Abstract: Polysomnography (PSG) is a type of sleep study that records multimodal physiological signals and is widely used for purposes such as sleep staging and respiratory event detection. Conventional machine learning methods assume that each sleep study is associated with a fixed set of observed modalities and that all modalities are available for each sample. However, noisy and missing modalities are a common issue in real-world clinical settings. In this study, we propose a comprehensive pipeline aiming to compensate for the missing or noisy modalities when performing sleep apnea detection. Unlike other existing studies, our proposed model works with any combination of available modalities. Our experiments show that the proposed model outperforms other state-of-the-art approaches in sleep apnea detection using various subsets of available data and different levels of noise, and maintains its high performance (AUROC>0.9) even in the presence of high levels of noise or missingness. This is especially relevant in settings where the level of noise and missingness is high (such as pediatric or outside-of-clinic scenarios).  ( 2 min )
    Towards Understanding What Code Language Models Learned
    arXiv:2306.11943v2 Announce Type: replace-cross Abstract: Pre-trained language models are effective in a variety of natural language tasks, but it has been argued their capabilities fall short of fully learning meaning or understanding language. To understand the extent to which language models can learn some form of meaning, we investigate their ability to capture semantics of code beyond superficial frequency and co-occurrence. In contrast to previous research on probing models for linguistic features, we study pre-trained models in a setting that allows for objective and straightforward evaluation of a model's ability to learn semantics. In this paper, we examine whether such models capture the semantics of code, which is precisely and formally defined. Through experiments involving the manipulation of code fragments, we show that code pre-trained models of code learn a robust representation of the computational semantics of code that goes beyond superficial features of form alone  ( 2 min )
    Hierarchical Multi-Relational Graph Representation Learning for Large-Scale Prediction of Drug-Drug Interactions
    arXiv:2402.18127v1 Announce Type: new Abstract: Most existing methods for predicting drug-drug interactions (DDI) predominantly concentrate on capturing the explicit relationships among drugs, overlooking the valuable implicit correlations present between drug pairs (DPs), which leads to weak predictions. To address this issue, this paper introduces a hierarchical multi-relational graph representation learning (HMGRL) approach. Within the framework of HMGRL, we leverage a wealth of drug-related heterogeneous data sources to construct heterogeneous graphs, where nodes represent drugs and edges denote clear and various associations. The relational graph convolutional network (RGCN) is employed to capture diverse explicit relationships between drugs from these heterogeneous graphs. Additionally, a multi-view differentiable spectral clustering (MVDSC) module is developed to capture multiple valuable implicit correlations between DPs. Within the MVDSC, we utilize multiple DP features to construct graphs, where nodes represent DPs and edges denote different implicit correlations. Subsequently, multiple DP representations are generated through graph cutting, each emphasizing distinct implicit correlations. The graph-cutting strategy enables our HMGRL to identify strongly connected communities of graphs, thereby reducing the fusion of irrelevant features. By combining every representation view of a DP, we create high-level DP representations for predicting DDIs. Two genuine datasets spanning three distinct tasks are adopted to gauge the efficacy of our HMGRL. Experimental outcomes unequivocally indicate that HMGRL surpasses several leading-edge methods in performance.  ( 2 min )
    Optimal lower bounds for Quantum Learning via Information Theory
    arXiv:2301.02227v3 Announce Type: replace-cross Abstract: Although a concept class may be learnt more efficiently using quantum samples as compared with classical samples in certain scenarios, Arunachalam and de Wolf (JMLR, 2018) proved that quantum learners are asymptotically no more efficient than classical ones in the quantum PAC and Agnostic learning models. They established lower bounds on sample complexity via quantum state identification and Fourier analysis. In this paper, we derive optimal lower bounds for quantum sample complexity in both the PAC and agnostic models via an information-theoretic approach. The proofs are arguably simpler, and the same ideas can potentially be used to derive optimal bounds for other problems in quantum learning theory. We then turn to a quantum analogue of the Coupon Collector problem, a classic problem from probability theory also of importance in the study of PAC learning. Arunachalam, Belovs, Childs, Kothari, Rosmanis, and de Wolf (TQC, 2020) characterized the quantum sample complexity of this problem up to constant factors. First, we show that the information-theoretic approach mentioned above provably does not yield the optimal lower bound. As a by-product, we get a natural ensemble of pure states in arbitrarily high dimensions which are not easily (simultaneously) distinguishable, while the ensemble has close to maximal Holevo information. Second, we discover that the information-theoretic approach yields an asymptotically optimal bound for an approximation variant of the problem. Finally, we derive a sharper lower bound for the Quantum Coupon Collector problem, via the generalized Holevo-Curlander bounds on the distinguishability of an ensemble. All the aspects of the Quantum Coupon Collector problem we study rest on properties of the spectrum of the associated Gram matrix, which may be of independent interest.  ( 3 min )
    Comparative Analysis of XGBoost and Minirocket Algortihms for Human Activity Recognition
    arXiv:2402.18296v1 Announce Type: new Abstract: Human Activity Recognition (HAR) has been extensively studied, with recent emphasis on the implementation of advanced Machine Learning (ML) and Deep Learning (DL) algorithms for accurate classification. This study investigates the efficacy of two ML algorithms, eXtreme Gradient Boosting (XGBoost) and MiniRocket, in the realm of HAR using data collected from smartphone sensors. The experiments are conducted on a dataset obtained from the UCI repository, comprising accelerometer and gyroscope signals captured from 30 volunteers performing various activities while wearing a smartphone. The dataset undergoes preprocessing, including noise filtering and feature extraction, before being utilized for training and testing the classifiers. Monte Carlo cross-validation is employed to evaluate the models' robustness. The findings reveal that both XGBoost and MiniRocket attain accuracy, F1 score, and AUC values as high as 0.99 in activity classification. XGBoost exhibits a slightly superior performance compared to MiniRocket. Notably, both algorithms surpass the performance of other ML and DL algorithms reported in the literature for HAR tasks. Additionally, the study compares the computational efficiency of the two algorithms, revealing XGBoost's advantage in terms of training time. Furthermore, the performance of MiniRocket, which achieves accuracy and F1 values of 0.94, and an AUC value of 0.96 using raw data and utilizing only one channel from the sensors, highlights the potential of directly leveraging unprocessed signals. It also suggests potential advantages that could be gained by utilizing sensor fusion or channel fusion techniques. Overall, this research sheds light on the effectiveness and computational characteristics of XGBoost and MiniRocket in HAR tasks, providing insights for future studies in activity recognition using smartphone sensor data.  ( 3 min )
    Detecting algorithmic bias in medical AI-models
    arXiv:2312.02959v3 Announce Type: replace-cross Abstract: With the growing prevalence of machine learning and artificial intelligence-based medical decision support systems, it is equally important to ensure that these systems provide patient outcomes in a fair and equitable fashion. This paper presents an innovative framework for detecting areas of algorithmic bias in medical-AI decision support systems. Our approach efficiently identifies potential biases in medical-AI models, specifically in the context of sepsis prediction, by employing the Classification and Regression Trees (CART) algorithm. We verify our methodology by conducting a series of synthetic data experiments, showcasing its ability to estimate areas of bias in controlled settings precisely. The effectiveness of the concept is further validated by experiments using electronic medical records from Grady Memorial Hospital in Atlanta, Georgia. These tests demonstrate the practical implementation of our strategy in a clinical environment, where it can function as a vital instrument for guaranteeing fairness and equity in AI-based medical decisions.  ( 2 min )
    Online Signal Estimation on the Graph Edges via Line Graph Transformation
    arXiv:2311.00656v2 Announce Type: replace-cross Abstract: The processing of signals on graph edges is challenging considering that Graph Signal Processing techniques are defined only on the graph nodes. Leveraging the Line Graph to transform a graph edge signal onto the node of its edge-to-vertex dual, we propose the Line Graph Least Mean Square (LGLMS) algorithm for online time-varying graph edge signal prediction. By setting up an $l_2$-norm optimization problem, LGLMS forms an adaptive algorithm as the graph edge analogy of the classical adaptive LMS algorithm. Additionally, the LGLMS inherits all the GSP concepts and techniques that can previously be deployed on the graph nodes, but without the need to redefine them on the graph edges. Experimenting with transportation graphs and meteorological graphs, with the signal observations having noisy and missing values, we confirmed that LGLMS is suitable for the online prediction of time-varying edge signals.  ( 2 min )
    Consistency Models for Scalable and Fast Simulation-Based Inference
    arXiv:2312.05440v2 Announce Type: replace Abstract: Simulation-based inference (SBI) is constantly in search of more expressive algorithms for accurately inferring the parameters of complex models from noisy data. We present consistency models for neural posterior estimation (CMPE), a new free-form conditional sampler for scalable, fast, and amortized SBI with generative neural networks. CMPE combines the advantages of normalizing flows and flow matching methods into a single generative architecture: It essentially distills a continuous probability flow and enables rapid few-shot inference with an unconstrained architecture that can be tailored to the structure of the estimation problem. Our empirical evaluation demonstrates that CMPE not only outperforms current state-of-the-art algorithms on three hard low-dimensional problems but also achieves competitive performance in a high-dimensional Bayesian denoising experiment and in estimating a computationally demanding multi-scale model of tumor spheroid growth.  ( 2 min )
    Robust Data Clustering with Outliers via Transformed Tensor Low-Rank Representation
    arXiv:2307.09055v2 Announce Type: replace-cross Abstract: Recently, tensor low-rank representation (TLRR) has become a popular tool for tensor data recovery and clustering, due to its empirical success and theoretical guarantees. However, existing TLRR methods consider Gaussian or gross sparse noise, inevitably leading to performance degradation when the tensor data are contaminated by outliers or sample-specific corruptions. This paper develops an outlier-robust tensor low-rank representation (OR-TLRR) method that provides outlier detection and tensor data clustering simultaneously based on the t-SVD framework. For tensor observations with arbitrary outlier corruptions, OR-TLRR has provable performance guarantee for exactly recovering the row space of clean data and detecting outliers under mild conditions. Moreover, an extension of OR-TLRR is proposed to handle the case when parts of the data are missing. Finally, extensive experimental results on synthetic and real data demonstrate the effectiveness of the proposed algorithms. We release our code at https://github.com/twugithub/2024-AISTATS-ORTLRR.  ( 2 min )
    Semi-Supervised Clustering of Sparse Graphs: Crossing the Information-Theoretic Threshold
    arXiv:2205.11677v4 Announce Type: replace-cross Abstract: The stochastic block model is a canonical random graph model for clustering and community detection on network-structured data. Decades of extensive study on the problem have established many profound results, among which the phase transition at the Kesten-Stigum threshold is particularly interesting both from a mathematical and an applied standpoint. It states that no estimator based on the network topology can perform substantially better than chance on sparse graphs if the model parameter is below a certain threshold. Nevertheless, if we slightly extend the horizon to the ubiquitous semi-supervised setting, such a fundamental limitation will disappear completely. We prove that with an arbitrary fraction of the labels revealed, the detection problem is feasible throughout the parameter domain. Moreover, we introduce two efficient algorithms, one combinatorial and one based on optimization, to integrate label information with graph structures. Our work brings a new perspective to the stochastic model of networks and semidefinite program research.  ( 2 min )
    Safe Collaborative Filtering
    arXiv:2306.05292v2 Announce Type: replace-cross Abstract: Excellent tail performance is crucial for modern machine learning tasks, such as algorithmic fairness, class imbalance, and risk-sensitive decision making, as it ensures the effective handling of challenging samples within a dataset. Tail performance is also a vital determinant of success for personalized recommender systems to reduce the risk of losing users with low satisfaction. This study introduces a "safe" collaborative filtering method that prioritizes recommendation quality for less-satisfied users rather than focusing on the average performance. Our approach minimizes the conditional value at risk (CVaR), which represents the average risk over the tails of users' loss. To overcome computational challenges for web-scale recommender systems, we develop a robust yet practical algorithm that extends the most scalable method, implicit alternating least squares (iALS). Empirical evaluation on real-world datasets demonstrates the excellent tail performance of our approach while maintaining competitive computational efficiency.  ( 2 min )
    Replacing Language Model for Style Transfer
    arXiv:2211.07343v2 Announce Type: replace-cross Abstract: We introduce replacing language model (RLM), a sequence-to-sequence language modeling framework for text style transfer (TST). Our method autoregressively replaces each token of the source sentence with a text span that has a similar meaning but in the target style. The new span is generated via a non-autoregressive masked language model, which can better preserve the local-contextual meaning of the replaced token. This RLM generation scheme gathers the flexibility of autoregressive models and the accuracy of non-autoregressive models, which bridges the gap between sentence-level and word-level style transfer methods. To control the generation style more precisely, we conduct a token-level style-content disentanglement on the hidden representations of RLM. Empirical results on real-world text datasets demonstrate the effectiveness of RLM compared with other TST baselines. The code is at https://github.com/Linear95/RLM.  ( 2 min )
    Graph Regularized Encoder Training for Extreme Classification
    arXiv:2402.18434v1 Announce Type: new Abstract: Deep extreme classification (XC) aims to train an encoder architecture and an accompanying classifier architecture to tag a data point with the most relevant subset of labels from a very large universe of labels. XC applications in ranking, recommendation and tagging routinely encounter tail labels for which the amount of training data is exceedingly small. Graph convolutional networks (GCN) present a convenient but computationally expensive way to leverage task metadata and enhance model accuracies in these settings. This paper formally establishes that in several use cases, the steep computational cost of GCNs is entirely avoidable by replacing GCNs with non-GCN architectures. The paper notices that in these settings, it is much more effective to use graph data to regularize encoder training than to implement a GCN. Based on these insights, an alternative paradigm RAMEN is presented to utilize graph metadata in XC settings that offers significant performance boosts with zero increase in inference computational costs. RAMEN scales to datasets with up to 1M labels and offers prediction accuracy up to 15% higher on benchmark datasets than state of the art methods, including those that use graph metadata to train GCNs. RAMEN also offers 10% higher accuracy over the best baseline on a proprietary recommendation dataset sourced from click logs of a popular search engine. Code for RAMEN will be released publicly.  ( 3 min )
    Biologically Plausible Training of Deep Neural Networks Using a Top-down Credit Assignment Network
    arXiv:2208.01416v2 Announce Type: replace-cross Abstract: Despite the widespread adoption of Backpropagation algorithm-based Deep Neural Networks, the biological infeasibility of the BP algorithm could potentially limit the evolution of new DNN models. To find a biologically plausible algorithm to replace BP, we focus on the top-down mechanism inherent in the biological brain. Although top-down connections in the biological brain play crucial roles in high-level cognitive functions, their application to neural network learning remains unclear. This study proposes a two-level training framework designed to train a bottom-up network using a Top-Down Credit Assignment Network (TDCA-network). The TDCA-network serves as a substitute for the conventional loss function and the back-propagation algorithm, widely used in neural network training. We further introduce a brain-inspired credit diffusion mechanism, significantly reducing the TDCA-network's parameter complexity, thereby greatly accelerating training without compromising the network's performance.Our experiments involving non-convex function optimization, supervised learning, and reinforcement learning reveal that a well-trained TDCA-network outperforms back-propagation across various settings. The visualization of the update trajectories in the loss landscape indicates the TDCA-network's ability to bypass local minima where BP-based trajectories typically become trapped. The TDCA-network also excels in multi-task optimization, demonstrating robust generalizability across different datasets in supervised learning and unseen task settings in reinforcement learning. Moreover, the results indicate that the TDCA-network holds promising potential to train neural networks across diverse architectures.  ( 3 min )
    Efficient local linearity regularization to overcome catastrophic overfitting
    arXiv:2401.11618v2 Announce Type: replace Abstract: Catastrophic overfitting (CO) in single-step adversarial training (AT) results in abrupt drops in the adversarial test accuracy (even down to 0%). For models trained with multi-step AT, it has been observed that the loss function behaves locally linearly with respect to the input, this is however lost in single-step AT. To address CO in single-step AT, several methods have been proposed to enforce local linearity of the loss via regularization. However, these regularization terms considerably slow down training due to Double Backpropagation. Instead, in this work, we introduce a regularization term, called ELLE, to mitigate CO effectively and efficiently in classical AT evaluations, as well as some more difficult regimes, e.g., large adversarial perturbations and long training schedules. Our regularization term can be theoretically linked to curvature of the loss function and is computationally cheaper than previous methods by avoiding Double Backpropagation. Our thorough experimental validation demonstrates that our work does not suffer from CO, even in challenging settings where previous works suffer from it. We also notice that adapting our regularization parameter during training (ELLE-A) greatly improves the performance, specially in large $\epsilon$ setups. Our implementation is available in https://github.com/LIONS-EPFL/ELLE .  ( 3 min )
    Inference for Heteroskedastic PCA with Missing Data
    arXiv:2107.12365v2 Announce Type: replace-cross Abstract: This paper studies how to construct confidence regions for principal component analysis (PCA) in high dimension, a problem that has been vastly under-explored. While computing measures of uncertainty for nonlinear/nonconvex estimators is in general difficult in high dimension, the challenge is further compounded by the prevalent presence of missing data and heteroskedastic noise. We propose a novel approach to performing valid inference on the principal subspace under a spiked covariance model with missing data, on the basis of an estimator called HeteroPCA (Zhang et al., 2022). We develop non-asymptotic distributional guarantees for HeteroPCA, and demonstrate how these can be invoked to compute both confidence regions for the principal subspace and entrywise confidence intervals for the spiked covariance matrix. Our inference procedures are fully data-driven and adaptive to heteroskedastic random noise, without requiring prior knowledge about the noise levels.  ( 2 min )
    Towards Efficient Communication and Secure Federated Recommendation System via Low-rank Training
    arXiv:2401.03748v2 Announce Type: replace Abstract: Federated Recommendation (FedRec) systems have emerged as a solution to safeguard users' data in response to growing regulatory concerns. However, one of the major challenges in these systems lies in the communication costs that arise from the need to transmit neural network models between user devices and a central server. Prior approaches to these challenges often lead to issues such as computational overheads, model specificity constraints, and compatibility issues with secure aggregation protocols. In response, we propose a novel framework, called Correlated Low-rank Structure (CoLR), which leverages the concept of adjusting lightweight trainable parameters while keeping most parameters frozen. Our approach substantially reduces communication overheads without introducing additional computational burdens. Critically, our framework remains fully compatible with secure aggregation protocols, including the robust use of Homomorphic Encryption. The approach resulted in a reduction of up to 93.75% in payload size, with only an approximate 8% decrease in recommendation performance across datasets. Code for reproducing our experiments can be found at https://github.com/NNHieu/CoLR-FedRec.  ( 3 min )
    Prompt Optimization via Adversarial In-Context Learning
    arXiv:2312.02614v2 Announce Type: replace Abstract: We propose a new method, Adversarial In-Context Learning (adv-ICL), to optimize prompt for in-context learning (ICL) by employing one LLM as a generator, another as a discriminator, and a third as a prompt modifier. As in traditional adversarial learning, adv-ICL is implemented as a two-player game between the generator and discriminator, where the generator tries to generate realistic enough output to fool the discriminator. In each round, given an input prefixed by task instructions and several exemplars, the generator produces an output. The discriminator is then tasked with classifying the generator input-output pair as model-generated or real data. Based on the discriminator loss, the prompt modifier proposes possible edits to the generator and discriminator prompts, and the edits that most improve the adversarial loss are selected. We show that adv-ICL results in significant improvements over state-of-the-art prompt optimization techniques for both open and closed-source models on 11 generation and classification tasks including summarization, arithmetic reasoning, machine translation, data-to-text generation, and the MMLU and big-bench hard benchmarks. In addition, because our method uses pre-trained models and updates only prompts rather than model parameters, it is computationally efficient, easy to extend to any LLM and task, and effective in low-resource settings.  ( 3 min )
    Imagine, Initialize, and Explore: An Effective Exploration Method in Multi-Agent Reinforcement Learning
    arXiv:2402.17978v1 Announce Type: new Abstract: Effective exploration is crucial to discovering optimal strategies for multi-agent reinforcement learning (MARL) in complex coordination tasks. Existing methods mainly utilize intrinsic rewards to enable committed exploration or use role-based learning for decomposing joint action spaces instead of directly conducting a collective search in the entire action-observation space. However, they often face challenges obtaining specific joint action sequences to reach successful states in long-horizon tasks. To address this limitation, we propose Imagine, Initialize, and Explore (IIE), a novel method that offers a promising solution for efficient multi-agent exploration in complex scenarios. IIE employs a transformer model to imagine how the agents reach a critical state that can influence each other's transition functions. Then, we initialize the environment at this state using a simulator before the exploration phase. We formulate the imagination as a sequence modeling problem, where the states, observations, prompts, actions, and rewards are predicted autoregressively. The prompt consists of timestep-to-go, return-to-go, influence value, and one-shot demonstration, specifying the desired state and trajectory as well as guiding the action generation. By initializing agents at the critical states, IIE significantly increases the likelihood of discovering potentially important under-explored regions. Despite its simplicity, empirical results demonstrate that our method outperforms multi-agent exploration baselines on the StarCraft Multi-Agent Challenge (SMAC) and SMACv2 environments. Particularly, IIE shows improved performance in the sparse-reward SMAC tasks and produces more effective curricula over the initialized states than other generative methods, such as CVAE-GAN and diffusion models.  ( 3 min )
    Time Series Analysis in Compressor-Based Machines: A Survey
    arXiv:2402.17802v1 Announce Type: new Abstract: In both industrial and residential contexts, compressor-based machines, such as refrigerators, HVAC systems, heat pumps and chillers, are essential to fulfil production and consumers' needs. The diffusion of sensors and IoT connectivity supports the development of monitoring systems able to detect and predict faults, identify behavioural shifts and forecast the operational status of machines and of their components. The focus of this paper is to survey the recent research on such tasks as Fault Detection, Fault Prediction, Forecasting and Change Point Detection applied to multivariate time series characterizing the operations of compressor-based machines. Specifically, Fault Detection detects and diagnoses faults, Fault Prediction predicts such occurrences, forecasting anticipates the future value of characteristic variables of machines and Change Point Detection identifies significant variations in the behaviour of the appliances, such as a change in the working regime. We identify and classify the approaches to the above-mentioned tasks, compare the algorithms employed, highlight the gaps in the current status of the art and discuss the most promising future research directions in the field.  ( 2 min )
    Imitation-regularized Optimal Transport on Networks: Provable Robustness and Application to Logistics Planning
    arXiv:2402.17967v1 Announce Type: new Abstract: Network systems form the foundation of modern society, playing a critical role in various applications. However, these systems are at significant risk of being adversely affected by unforeseen circumstances, such as disasters. Considering this, there is a pressing need for research to enhance the robustness of network systems. Recently, in reinforcement learning, the relationship between acquiring robustness and regularizing entropy has been identified. Additionally, imitation learning is used within this framework to reflect experts' behavior. However, there are no comprehensive studies on the use of a similar imitation framework for optimal transport on networks. Therefore, in this study, imitation-regularized optimal transport (I-OT) on networks was investigated. It encodes prior knowledge on the network by imitating a given prior distribution. The I-OT solution demonstrated robustness in terms of the cost defined on the network. Moreover, we applied the I-OT to a logistics planning problem using real data. We also examined the imitation and apriori risk information scenarios to demonstrate the usefulness and implications of the proposed method.  ( 2 min )
    FlattenQuant: Breaking Through the Inference Compute-bound for Large Language Models with Per-tensor Quantization
    arXiv:2402.17985v1 Announce Type: new Abstract: Large language models (LLMs) have demonstrated state-of-the-art performance across various tasks. However, the latency of inference and the large GPU memory consumption of LLMs restrict their deployment performance. Recently, there have been some efficient attempts to quantize LLMs, yet inference with large batch size or long sequence still has the issue of being compute-bound. Fine-grained quantization methods have showcased their proficiency in achieving low-bit quantization for LLMs, while requiring FP16 data type for linear layer computations, which is time-consuming when dealing with large batch size or long sequence. In this paper, we introduce a method called FlattenQuant, which significantly reduces the maximum value of the tensor by flattening the large channels in the tensor, to achieve low bit per-tensor quantization with minimal accuracy loss. Our experiments show that FlattenQuant can directly use 4 bits to achieve 48.29% of the linear layer calculation in LLMs, with the remaining layers using 8 bits. The 4-bit matrix multiplication introduced in the FlattenQuant method can effectively address the compute-bound caused by large matrix calculation. Our work achieves up to 2$\times$ speedup and 2.3$\times$ memory reduction for LLMs with negligible loss in accuracy.  ( 2 min )
    Conformer: Embedding Continuous Attention in Vision Transformer for Weather Forecasting
    arXiv:2402.17966v1 Announce Type: new Abstract: Operational weather forecasting system relies on computationally expensive physics-based models. Although Transformers-based models have shown remarkable potential in weather forecasting, Transformers are discrete models which limit their ability to learn the continuous spatio-temporal features of the dynamical weather system. We address this issue with Conformer, a spatio-temporal Continuous Vision Transformer for weather forecasting. Conformer is designed to learn the continuous weather evolution over time by implementing continuity in the multi-head attention mechanism. The attention mechanism is encoded as a differentiable function in the transformer architecture to model the complex weather dynamics. We evaluate Conformer against a state-of-the-art Numerical Weather Prediction (NWP) model and several deep learning based weather forecasting models. Conformer outperforms some of the existing data-driven models at all lead times while only being trained at lower resolution data.  ( 2 min )
    Representation learning in multiplex graphs: Where and how to fuse information?
    arXiv:2402.17906v1 Announce Type: new Abstract: In recent years, unsupervised and self-supervised graph representation learning has gained popularity in the research community. However, most proposed methods are focused on homogeneous networks, whereas real-world graphs often contain multiple node and edge types. Multiplex graphs, a special type of heterogeneous graphs, possess richer information, provide better modeling capabilities and integrate more detailed data from potentially different sources. The diverse edge types in multiplex graphs provide more context and insights into the underlying processes of representation learning. In this paper, we tackle the problem of learning representations for nodes in multiplex networks in an unsupervised or self-supervised manner. To that end, we explore diverse information fusion schemes performed at different levels of the graph processing pipeline. The detailed analysis and experimental evaluation of various scenarios inspired us to propose improvements in how to construct GNN architectures that deal with multiplex graphs.  ( 2 min )
    Automated Statistical Model Discovery with Language Models
    arXiv:2402.17879v1 Announce Type: new Abstract: Statistical model discovery involves a challenging search over a vast space of models subject to domain-specific modeling constraints. Efficiently searching over this space requires human expertise in modeling and the problem domain. Motivated by the domain knowledge and programming capabilities of large language models (LMs), we introduce a method for language model driven automated statistical model discovery. We cast our automated procedure within the framework of Box's Loop: the LM iterates between proposing statistical models represented as probabilistic programs, acting as a modeler, and critiquing those models, acting as a domain expert. By leveraging LMs, we do not have to define a domain-specific language of models or design a handcrafted search procedure, key restrictions of previous systems. We evaluate our method in three common settings in probabilistic modeling: searching within a restricted space of models, searching over an open-ended space, and improving classic models under natural language constraints (e.g., this model should be interpretable to an ecologist). Our method matches the performance of previous systems, identifies models on par with human expert designed models, and extends classic models in interpretable ways. Our results highlight the promise of LM driven model discovery.  ( 2 min )
    Using Graph Neural Networks to Predict Local Culture
    arXiv:2402.17905v1 Announce Type: new Abstract: Urban research has long recognized that neighbourhoods are dynamic and relational. However, lack of data, methodologies, and computer processing power have hampered a formal quantitative examination of neighbourhood relational dynamics. To make progress on this issue, this study proposes a graph neural network (GNN) approach that permits combining and evaluating multiple sources of information about internal characteristics of neighbourhoods, their past characteristics, and flows of groups among them, potentially providing greater expressive power in predictive models. By exploring a public large-scale dataset from Yelp, we show the potential of our approach for considering structural connectedness in predicting neighbourhood attributes, specifically to predict local culture. Results are promising from a substantive and methodologically point of view. Substantively, we find that either local area information (e.g. area demographics) or group profiles (tastes of Yelp reviewers) give the best results in predicting local culture, and they are nearly equivalent in all studied cases. Methodologically, exploring group profiles could be a helpful alternative where finding local information for specific areas is challenging, since they can be extracted automatically from many forms of online data. Thus, our approach could empower researchers and policy-makers to use a range of data sources when other local area information is lacking.  ( 2 min )
    SequentialAttention++ for Block Sparsification: Differentiable Pruning Meets Combinatorial Optimization
    arXiv:2402.17902v1 Announce Type: new Abstract: Neural network pruning is a key technique towards engineering large yet scalable, interpretable, and generalizable models. Prior work on the subject has developed largely along two orthogonal directions: (1) differentiable pruning for efficiently and accurately scoring the importance of parameters, and (2) combinatorial optimization for efficiently searching over the space of sparse models. We unite the two approaches, both theoretically and empirically, to produce a coherent framework for structured neural network pruning in which differentiable pruning guides combinatorial optimization algorithms to select the most important sparse set of parameters. Theoretically, we show how many existing differentiable pruning techniques can be understood as nonconvex regularization for group sparse optimization, and prove that for a wide class of nonconvex regularizers, the global optimum is unique, group-sparse, and provably yields an approximate solution to a sparse convex optimization problem. The resulting algorithm that we propose, SequentialAttention++, advances the state of the art in large-scale neural network block-wise pruning tasks on the ImageNet and Criteo datasets.  ( 2 min )
    Collaborative learning of common latent representations in routinely collected multivariate ICU physiological signals
    arXiv:2402.17917v1 Announce Type: new Abstract: In Intensive Care Units (ICU), the abundance of multivariate time series presents an opportunity for machine learning (ML) to enhance patient phenotyping. In contrast to previous research focused on electronic health records (EHR), here we propose an ML approach for phenotyping using routinely collected physiological time series data. Our new algorithm integrates Long Short-Term Memory (LSTM) networks with collaborative filtering concepts to identify common physiological states across patients. Tested on real-world ICU clinical data for intracranial hypertension (IH) detection in patients with brain injury, our method achieved an area under the curve (AUC) of 0.889 and average precision (AP) of 0.725. Moreover, our algorithm outperforms autoencoders in learning more structured latent representations of the physiological signals. These findings highlight the promise of our methodology for patient phenotyping, leveraging routinely collected multivariate time series to improve clinical care practices.  ( 2 min )
    Latent Neural PDE Solver: a reduced-order modelling framework for partial differential equations
    arXiv:2402.17853v1 Announce Type: new Abstract: Neural networks have shown promising potential in accelerating the numerical simulation of systems governed by partial differential equations (PDEs). Different from many existing neural network surrogates operating on high-dimensional discretized fields, we propose to learn the dynamics of the system in the latent space with much coarser discretizations. In our proposed framework - Latent Neural PDE Solver (LNS), a non-linear autoencoder is first trained to project the full-order representation of the system onto the mesh-reduced space, then a temporal model is trained to predict the future state in this mesh-reduced space. This reduction process simplifies the training of the temporal model by greatly reducing the computational cost accompanying a fine discretization. We study the capability of the proposed framework and several other popular neural PDE solvers on various types of systems including single-phase and multi-phase flows along with varying system parameters. We showcase that it has competitive accuracy and efficiency compared to the neural PDE solver that operates on full-order space.  ( 2 min )
    DropBP: Accelerating Fine-Tuning of Large Language Models by Dropping Backward Propagation
    arXiv:2402.17812v1 Announce Type: new Abstract: Training deep neural networks typically involves substantial computational costs during both forward and backward propagation. The conventional layer dropping techniques drop certain layers during training for reducing the computations burden. However, dropping layers during forward propagation adversely affects the training process by degrading accuracy. In this paper, we propose Dropping Backward Propagation (DropBP), a novel approach designed to reduce computational costs while maintaining accuracy. DropBP randomly drops layers during the backward propagation, which does not deviate forward propagation. Moreover, DropBP calculates the sensitivity of each layer to assign appropriate drop rate, thereby stabilizing the training process. DropBP is designed to enhance the efficiency of the training process with backpropagation, thereby enabling the acceleration of both full fine-tuning and parameter-efficient fine-tuning using backpropagation. Specifically, utilizing DropBP in QLoRA reduces training time by 44%, increases the convergence speed to the identical loss level by 1.5$\times$, and enables training with a 6.2$\times$ larger sequence length on a single NVIDIA-A100 80GiB GPU in LLaMA2-70B. The code is available at https://github.com/WooSunghyeon/dropbp.  ( 2 min )
    From Inverse Optimization to Feasibility to ERM
    arXiv:2402.17890v1 Announce Type: new Abstract: Inverse optimization involves inferring unknown parameters of an optimization problem from known solutions, and is widely used in fields such as transportation, power systems and healthcare. We study the contextual inverse optimization setting that utilizes additional contextual information to better predict the unknown problem parameters. We focus on contextual inverse linear programming (CILP), addressing the challenges posed by the non-differentiable nature of LPs. For a linear prediction model, we reduce CILP to a convex feasibility problem allowing the use of standard algorithms such as alternating projections. The resulting algorithm for CILP is equipped with a linear convergence guarantee without additional assumptions such as degeneracy or interpolation. Next, we reduce CILP to empirical risk minimization (ERM) on a smooth, convex loss that satisfies the Polyak-Lojasiewicz condition. This reduction enables the use of scalable first-order optimization methods to solve large non-convex problems, while maintaining theoretical guarantees in the convex setting. Finally, we experimentally validate our approach on both synthetic and real-world problems, and demonstrate improved performance compared to existing methods.  ( 2 min )
    Independent Learning in Constrained Markov Potential Games
    arXiv:2402.17885v1 Announce Type: new Abstract: Constrained Markov games offer a formal mathematical framework for modeling multi-agent reinforcement learning problems where the behavior of the agents is subject to constraints. In this work, we focus on the recently introduced class of constrained Markov Potential Games. While centralized algorithms have been proposed for solving such constrained games, the design of converging independent learning algorithms tailored for the constrained setting remains an open question. We propose an independent policy gradient algorithm for learning approximate constrained Nash equilibria: Each agent observes their own actions and rewards, along with a shared state. Inspired by the optimization literature, our algorithm performs proximal-point-like updates augmented with a regularized constraint set. Each proximal step is solved inexactly using a stochastic switching gradient algorithm. Notably, our algorithm can be implemented independently without a centralized coordination mechanism requiring turn-based agent updates. Under some technical constraint qualification conditions, we establish convergence guarantees towards constrained approximate Nash equilibria. We perform simulations to illustrate our results.  ( 2 min )
    ConjNorm: Tractable Density Estimation for Out-of-Distribution Detection
    arXiv:2402.17888v1 Announce Type: new Abstract: Post-hoc out-of-distribution (OOD) detection has garnered intensive attention in reliable machine learning. Many efforts have been dedicated to deriving score functions based on logits, distances, or rigorous data distribution assumptions to identify low-scoring OOD samples. Nevertheless, these estimate scores may fail to accurately reflect the true data density or impose impractical constraints. To provide a unified perspective on density-based score design, we propose a novel theoretical framework grounded in Bregman divergence, which extends distribution considerations to encompass an exponential family of distributions. Leveraging the conjugation constraint revealed in our theorem, we introduce a \textsc{ConjNorm} method, reframing density function design as a search for the optimal norm coefficient $p$ against the given dataset. In light of the computational challenges of normalization, we devise an unbiased and analytically tractable estimator of the partition function using the Monte Carlo-based importance sampling technique. Extensive experiments across OOD detection benchmarks empirically demonstrate that our proposed \textsc{ConjNorm} has established a new state-of-the-art in a variety of OOD detection setups, outperforming the current best method by up to 13.25$\%$ and 28.19$\%$ (FPR95) on CIFAR-100 and ImageNet-1K, respectively.  ( 2 min )
    Material Microstructure Design Using VAE-Regression with Multimodal Prior
    arXiv:2402.17806v1 Announce Type: new Abstract: We propose a variational autoencoder (VAE)-based model for building forward and inverse structure-property linkages, a problem of paramount importance in computational materials science. Our model systematically combines VAE with regression, linking the two models through a two-level prior conditioned on the regression variables. The regression loss is optimized jointly with the reconstruction loss of the variational autoencoder, learning microstructure features relevant for property prediction and reconstruction. The resultant model can be used for both forward and inverse prediction i.e., for predicting the properties of a given microstructure as well as for predicting the microstructure required to obtain given properties. Since the inverse problem is ill-posed (one-to-many), we derive the objective function using a multi-modal Gaussian mixture prior enabling the model to infer multiple microstructures for a target set of properties. We show that for forward prediction, our model is as accurate as state-of-the-art forward-only models. Additionally, our method enables direct inverse inference. We show that the microstructures inferred using our model achieve desired properties reasonably accurately, avoiding the need for expensive optimization loops.  ( 2 min )
    Graph Neural Networks and Arithmetic Circuits
    arXiv:2402.17805v1 Announce Type: new Abstract: We characterize the computational power of neural networks that follow the graph neural network (GNN) architecture, not restricted to aggregate-combine GNNs or other particular types. We establish an exact correspondence between the expressivity of GNNs using diverse activation functions and arithmetic circuits over real numbers. In our results the activation function of the network becomes a gate type in the circuit. Our result holds for families of constant depth circuits and networks, both uniformly and non-uniformly, for all common activation functions.  ( 2 min )
    Predicting machine failures from multivariate time series: an industrial case study
    arXiv:2402.17804v1 Announce Type: new Abstract: Non-neural Machine Learning (ML) and Deep Learning (DL) models are often used to predict system failures in the context of industrial maintenance. However, only a few researches jointly assess the effect of varying the amount of past data used to make a prediction and the extension in the future of the forecast. This study evaluates the impact of the size of the reading window and of the prediction window on the performances of models trained to forecast failures in three data sets concerning the operation of (1) an industrial wrapping machine working in discrete sessions, (2) an industrial blood refrigerator working continuously, and (3) a nitrogen generator working continuously. The problem is formulated as a binary classification task that assigns the positive label to the prediction window based on the probability of a failure to occur in such an interval. Six algorithms (logistic regression, random forest, support vector machine, LSTM, ConvLSTM, and Transformers) are compared using multivariate telemetry time series. The results indicate that, in the considered scenarios, the dimension of the prediction windows plays a crucial role and highlight the effectiveness of DL approaches at classifying data with diverse time-dependent patterns preceding a failure and the effectiveness of ML approaches at classifying similar and repetitive patterns preceding a failure.  ( 2 min )
    Learning in Deep Factor Graphs with Gaussian Belief Propagation
    arXiv:2311.14649v2 Announce Type: replace Abstract: We propose an approach to do learning in Gaussian factor graphs. We treat all relevant quantities (inputs, outputs, parameters, latents) as random variables in a graphical model, and view both training and prediction as inference problems with different observed nodes. Our experiments show that these problems can be efficiently solved with belief propagation (BP), whose updates are inherently local, presenting exciting opportunities for distributed and asynchronous training. Our approach can be scaled to deep networks and provides a natural means to do continual learning: use the BP-estimated parameter marginals of the current task as parameter priors for the next. On a video denoising task we demonstrate the benefit of learnable parameters over a classical factor graph approach and we show encouraging performance of deep factor graphs for continual image classification.  ( 2 min )
    FedPEAT: Convergence of Federated Learning, Parameter-Efficient Fine Tuning, and Emulator Assisted Tuning for Artificial Intelligence Foundation Models with Mobile Edge Computing
    arXiv:2310.17491v2 Announce Type: replace Abstract: The emergence of foundation models, including language and vision models, has reshaped AI's landscape, offering capabilities across various applications. Deploying and fine-tuning these large models, like GPT-3 and BERT, presents challenges, especially in the current foundation model era. We introduce Emulator-Assisted Tuning (EAT) combined with Parameter-Efficient Fine-Tuning (PEFT) to form Parameter-Efficient Emulator-Assisted Tuning (PEAT). Further, we expand this into federated learning as Federated PEAT (FedPEAT). FedPEAT uses adapters, emulators, and PEFT for federated model tuning, enhancing model privacy and memory efficiency. Adapters adjust pre-trained models, while emulators give a compact representation of original models, addressing both privacy and efficiency. Adaptable to various neural networks, our approach also uses deep reinforcement learning for hyper-parameter optimization. We tested FedPEAT in a unique scenario with a server participating in collaborative federated tuning, showcasing its potential in tackling foundation model challenges.  ( 2 min )
    Looping in the Human Collaborative and Explainable Bayesian Optimization
    arXiv:2310.17273v4 Announce Type: replace Abstract: Like many optimizers, Bayesian optimization often falls short of gaining user trust due to opacity. While attempts have been made to develop human-centric optimizers, they typically assume user knowledge is well-specified and error-free, employing users mainly as supervisors of the optimization process. We relax these assumptions and propose a more balanced human-AI partnership with our Collaborative and Explainable Bayesian Optimization (CoExBO) framework. Instead of explicitly requiring a user to provide a knowledge model, CoExBO employs preference learning to seamlessly integrate human insights into the optimization, resulting in algorithmic suggestions that resonate with user preference. CoExBO explains its candidate selection every iteration to foster trust, empowering users with a clearer grasp of the optimization. Furthermore, CoExBO offers a no-harm guarantee, allowing users to make mistakes; even with extreme adversarial interventions, the algorithm converges asymptotically to a vanilla Bayesian optimization. We validate CoExBO's efficacy through human-AI teaming experiments in lithium-ion battery design, highlighting substantial improvements over conventional methods. Code is available https://github.com/ma921/CoExBO.  ( 3 min )
    On the Computational Entanglement of Distant Features in Adversarial Machine Learning
    arXiv:2309.15669v4 Announce Type: replace Abstract: Adversarial examples in machine learning has emerged as a focal point of research due to their remarkable ability to deceive models with seemingly inconspicuous input perturbations, potentially resulting in severe consequences. In this study, we undertake a thorough investigation into the emergence of adversarial examples, a phenomenon that can, in principle, manifest in a wide range of machine learning models. Through our research, we unveil a new notion termed computational entanglement, with its ability to entangle distant features, display perfect correlations or anti-correlations regardless to their spatial separation, significantly contributes to the emergence of adversarial examples. We illustrate how computational entanglement aligns with relativistic effects such as time dilation and length contraction to feature pair, ultimately resulting in the convergence of their angle differences and distances towards zero, signifying perfect correlation, or towards maximum, indicating perfect anti-correlation.  ( 2 min )
    On the different regimes of Stochastic Gradient Descent
    arXiv:2309.10688v4 Announce Type: replace Abstract: Modern deep networks are trained with stochastic gradient descent (SGD) whose key hyperparameters are the number of data considered at each step or batch size $B$, and the step size or learning rate $\eta$. For small $B$ and large $\eta$, SGD corresponds to a stochastic evolution of the parameters, whose noise amplitude is governed by the ''temperature'' $T\equiv \eta/B$. Yet this description is observed to break down for sufficiently large batches $B\geq B^*$, or simplifies to gradient descent (GD) when the temperature is sufficiently small. Understanding where these cross-overs take place remains a central challenge. Here, we resolve these questions for a teacher-student perceptron classification model and show empirically that our key predictions still apply to deep networks. Specifically, we obtain a phase diagram in the $B$-$\eta$ plane that separates three dynamical phases: (i) a noise-dominated SGD governed by temperature, (ii) a large-first-step-dominated SGD and (iii) GD. These different phases also correspond to different regimes of generalization error. Remarkably, our analysis reveals that the batch size $B^*$ separating regimes (i) and (ii) scale with the size $P$ of the training set, with an exponent that characterizes the hardness of the classification problem.  ( 3 min )
    CoCA: Fusing Position Embedding with Collinear Constrained Attention in Transformers for Long Context Window Extending
    arXiv:2309.08646v3 Announce Type: replace Abstract: Self-attention and position embedding are two key modules in transformer-based Large Language Models (LLMs). However, the potential relationship between them is far from well studied, especially for long context window extending. In fact, anomalous behaviors harming long context extrapolation exist between Rotary Position Embedding (RoPE) and vanilla self-attention unveiled by our work. To address this issue, we propose a novel attention mechanism, CoCA (Collinear Constrained Attention). Specifically, we enforce a collinear constraint between $Q$ and $K$ to seamlessly integrate RoPE and self-attention. While only adding minimal computational and spatial complexity, this integration significantly enhances long context window extrapolation ability. We provide an optimized implementation, making it a drop-in replacement for any existing transformer-based models. Extensive experiments show that CoCA performs extraordinarily well in extending context windows. A CoCA-based GPT model, trained with a context length of 512, can seamlessly extend the context window up to 32K (60$\times$), without any fine-tuning. Additionally, by dropping CoCA in LLaMA-7B, we achieve extrapolation up to 32K within only 2K training length. Our code is publicly available at: https://github.com/codefuse-ai/Collinear-Constrained-Attention  ( 3 min )
    Advancing Counterfactual Inference through Nonlinear Quantile Regression
    arXiv:2306.05751v3 Announce Type: replace Abstract: The capacity to address counterfactual "what if" inquiries is crucial for understanding and making use of causal influences. Traditional counterfactual inference, under Pearls' counterfactual framework, typically depends on having access to or estimating a structural causal model. Yet, in practice, this causal model is often unknown and might be challenging to identify. Hence, this paper aims to perform reliable counterfactual inference based solely on observational data and the (learned) qualitative causal structure, without necessitating a predefined causal model or even direct estimations of conditional distributions. To this end, we establish a novel connection between counterfactual inference and quantile regression and show that counterfactual inference can be reframed as an extended quantile regression problem. Building on this insight, we propose a practical framework for efficient and effective counterfactual inference implemented with neural networks under a bi-level optimization scheme. The proposed approach enhances the capacity to generalize estimated counterfactual outcomes to unseen data, thereby providing an upper bound on the generalization error. Furthermore, empirical evidence demonstrates its superior statistical efficiency in comparison to existing methods. Empirical results conducted on multiple datasets offer compelling support for our theoretical assertions.  ( 2 min )
    Harnessing Explanations: LLM-to-LM Interpreter for Enhanced Text-Attributed Graph Representation Learning
    arXiv:2305.19523v4 Announce Type: replace Abstract: Representation learning on text-attributed graphs (TAGs) has become a critical research problem in recent years. A typical example of a TAG is a paper citation graph, where the text of each paper serves as node attributes. Initial graph neural network (GNN) pipelines handled these text attributes by transforming them into shallow or hand-crafted features, such as skip-gram or bag-of-words features. Recent efforts have focused on enhancing these pipelines with language models (LMs), which typically demand intricate designs and substantial computational resources. With the advent of powerful large language models (LLMs) such as GPT or Llama2, which demonstrate an ability to reason and to utilize general knowledge, there is a growing need for techniques which combine the textual modelling abilities of LLMs with the structural learning capabilities of GNNs. Hence, in this work, we focus on leveraging LLMs to capture textual information as features, which can be used to boost GNN performance on downstream tasks. A key innovation is our use of explanations as features: we prompt an LLM to perform zero-shot classification, request textual explanations for its decision-making process, and design an LLM-to-LM interpreter to translate these explanations into informative features for downstream GNNs. Our experiments demonstrate that our method achieves state-of-the-art results on well-established TAG datasets, including Cora, PubMed, ogbn-arxiv, as well as our newly introduced dataset, tape-arxiv23. Furthermore, our method significantly speeds up training, achieving a 2.88 times improvement over the closest baseline on ogbn-arxiv. Lastly, we believe the versatility of the proposed method extends beyond TAGs and holds the potential to enhance other tasks involving graph-text data. Our codes and datasets are available at: https://github.com/XiaoxinHe/TAPE.  ( 3 min )
    Improving Convergence and Generalization Using Parameter Symmetries
    arXiv:2305.13404v2 Announce Type: replace Abstract: In many neural networks, different values of the parameters may result in the same loss value. Parameter space symmetries are loss-invariant transformations that change the model parameters. Teleportation applies such transformations to accelerate optimization. However, the exact mechanism behind this algorithm's success is not well understood. In this paper, we show that teleportation not only speeds up optimization in the short-term, but gives overall faster time to convergence. Additionally, teleporting to minima with different curvatures improves generalization, which suggests a connection between the curvature of the minimum and generalization ability. Finally, we show that integrating teleportation into a wide range of optimization algorithms and optimization-based meta-learning improves convergence. Our results showcase the versatility of teleportation and demonstrate the potential of incorporating symmetry in optimization.  ( 2 min )
    A Meta-learning Framework for Tuning Parameters of Protection Mechanisms in Trustworthy Federated Learning
    arXiv:2305.18400v3 Announce Type: replace Abstract: Trustworthy Federated Learning (TFL) typically leverages protection mechanisms to guarantee privacy. However, protection mechanisms inevitably introduce utility loss or efficiency reduction while protecting data privacy. Therefore, protection mechanisms and their parameters should be carefully chosen to strike an optimal tradeoff between \textit{privacy leakage}, \textit{utility loss}, and \textit{efficiency reduction}. To this end, federated learning practitioners need tools to measure the three factors and optimize the tradeoff between them to choose the protection mechanism that is most appropriate to the application at hand. Motivated by this requirement, we propose a framework that (1) formulates TFL as a problem of finding a protection mechanism to optimize the tradeoff between privacy leakage, utility loss, and efficiency reduction and (2) formally defines bounded measurements of the three factors. We then propose a meta-learning algorithm to approximate this optimization problem and find optimal protection parameters for representative protection mechanisms, including Randomization, Homomorphic Encryption, Secret Sharing, and Compression. We further design estimation algorithms to quantify these found optimal protection parameters in a practical horizontal federated learning setting and provide a theoretical analysis of the estimation error.  ( 3 min )
    A Game-theoretic Framework for Privacy-preserving Federated Learning
    arXiv:2304.05836v3 Announce Type: replace Abstract: In federated learning, benign participants aim to optimize a global model collaboratively. However, the risk of \textit{privacy leakage} cannot be ignored in the presence of \textit{semi-honest} adversaries. Existing research has focused either on designing protection mechanisms or on inventing attacking mechanisms. While the battle between defenders and attackers seems never-ending, we are concerned with one critical question: is it possible to prevent potential attacks in advance? To address this, we propose the first game-theoretic framework that considers both FL defenders and attackers in terms of their respective payoffs, which include computational costs, FL model utilities, and privacy leakage risks. We name this game the federated learning privacy game (FLPG), in which neither defenders nor attackers are aware of all participants' payoffs. To handle the \textit{incomplete information} inherent in this situation, we propose associating the FLPG with an \textit{oracle} that has two primary responsibilities. First, the oracle provides lower and upper bounds of the payoffs for the players. Second, the oracle acts as a correlation device, privately providing suggested actions to each player. With this novel framework, we analyze the optimal strategies of defenders and attackers. Furthermore, we derive and demonstrate conditions under which the attacker, as a rational decision-maker, should always follow the oracle's suggestion \textit{not to attack}.  ( 3 min )
    A Comprehensive Survey on Deep Graph Representation Learning
    arXiv:2304.05055v3 Announce Type: replace Abstract: Graph representation learning aims to effectively encode high-dimensional sparse graph-structured data into low-dimensional dense vectors, which is a fundamental task that has been widely studied in a range of fields, including machine learning and data mining. Classic graph embedding methods follow the basic idea that the embedding vectors of interconnected nodes in the graph can still maintain a relatively close distance, thereby preserving the structural information between the nodes in the graph. However, this is sub-optimal due to: (i) traditional methods have limited model capacity which limits the learning performance; (ii) existing techniques typically rely on unsupervised learning strategies and fail to couple with the latest learning paradigms; (iii) representation learning and downstream tasks are dependent on each other which should be jointly enhanced. With the remarkable success of deep learning, deep graph representation learning has shown great potential and advantages over shallow (traditional) methods, there exist a large number of deep graph representation learning techniques have been proposed in the past decade, especially graph neural networks. In this survey, we conduct a comprehensive survey on current deep graph representation learning algorithms by proposing a new taxonomy of existing state-of-the-art literature. Specifically, we systematically summarize the essential components of graph representation learning and categorize existing approaches by the ways of graph neural network architectures and the most recent advanced learning paradigms. Moreover, this survey also provides the practical and promising applications of deep graph representation learning. Last but not least, we state new perspectives and suggest challenging directions which deserve further investigations in the future.  ( 3 min )
    Streaming data recovery via Bayesian tensor train decomposition
    arXiv:2302.12148v2 Announce Type: replace Abstract: In this paper, we study a Bayesian tensor train (TT) decomposition method to recover streaming data by approximating the latent structure in high-order streaming data. Drawing on the streaming variational Bayes method, we introduce the TT format into Bayesian tensor decomposition methods for streaming data, and formulate posteriors of TT cores. Thanks to the Bayesian framework of the TT format, the proposed algorithm (SPTT) excels in recovering streaming data with high-order, incomplete, and noisy properties. The experiments in synthetic and real-world datasets show the accuracy of our method compared to state-of-the-art Bayesian tensor decomposition methods for streaming data.  ( 2 min )
    Differentially Private Neural Tangent Kernels for Privacy-Preserving Data Generation
    arXiv:2303.01687v2 Announce Type: replace Abstract: Maximum mean discrepancy (MMD) is a particularly useful distance metric for differentially private data generation: when used with finite-dimensional features it allows us to summarize and privatize the data distribution once, which we can repeatedly use during generator training without further privacy loss. An important question in this framework is, then, what features are useful to distinguish between real and synthetic data distributions, and whether those enable us to generate quality synthetic data. This work considers the using the features of $\textit{neural tangent kernels (NTKs)}$, more precisely $\textit{empirical}$ NTKs (e-NTKs). We find that, perhaps surprisingly, the expressiveness of the untrained e-NTK features is comparable to that of the features taken from pre-trained perceptual features using public data. As a result, our method improves the privacy-accuracy trade-off compared to other state-of-the-art methods, without relying on any public data, as demonstrated on several tabular and image benchmark datasets.  ( 2 min )
    Numerical Stability of DeepGOPlus Inference
    arXiv:2212.06361v4 Announce Type: replace Abstract: Convolutional neural networks (CNNs) are currently among the most widely-used deep neural network (DNN) architectures available and achieve state-of-the-art performance for many problems. Originally applied to computer vision tasks, CNNs work well with any data with a spatial relationship, besides images, and have been applied to different fields. However, recent works have highlighted numerical stability challenges in DNNs, which also relates to their known sensitivity to noise injection. These challenges can jeopardise their performance and reliability. This paper investigates DeepGOPlus, a CNN that predicts protein function. DeepGOPlus has achieved state-of-the-art performance and can successfully take advantage and annotate the abounding protein sequences emerging in proteomics. We determine the numerical stability of the model's inference stage by quantifying the numerical uncertainty resulting from perturbations of the underlying floating-point data. In addition, we explore the opportunity to use reduced-precision floating point formats for DeepGOPlus inference, to reduce memory consumption and latency. This is achieved by instrumenting DeepGOPlus' execution using Monte Carlo Arithmetic, a technique that experimentally quantifies floating point operation errors and VPREC, a tool that emulates results with customizable floating point precision formats. Focus is placed on the inference stage as it is the primary deliverable of the DeepGOPlus model, widely applicable across different environments. All in all, our results show that although the DeepGOPlus CNN is very stable numerically, it can only be selectively implemented with lower-precision floating-point formats. We conclude that predictions obtained from the pre-trained DeepGOPlus model are very reliable numerically, and use existing floating-point formats efficiently.  ( 3 min )
    Online Learning Models for Vehicle Usage Prediction During COVID-19
    arXiv:2210.16002v2 Announce Type: replace Abstract: Today, there is an ongoing transition to more sustainable transportation, for which an essential part is the switch from combustion engine vehicles to battery electric vehicles (BEVs). BEVs have many advantages from a sustainability perspective, but issues such as limited driving range and long recharge times slow down the transition from combustion engines. One way to mitigate these issues is by performing battery thermal preconditioning, which increases the energy efficiency of the battery. However, to optimally perform battery thermal preconditioning, the vehicle usage pattern needs to be known, i.e., how and when the vehicle will be used. This study attempts to predict the departure time and distance of the first drive each day using online machine learning models. The online machine learning models are trained and evaluated on historical driving data collected from a fleet of BEVs during the COVID-19 pandemic. Additionally, the prediction models are extended to quantify the uncertainty of their predictions, which can be used to decide whether the prediction should be used or dismissed. Based on our results, the best-performing prediction models yield an aggregated mean absolute error of 2.75 hours when predicting departure time and 13.37 km when predicting trip distance.  ( 3 min )
    The good, the bad and the ugly sides of data augmentation: An implicit spectral regularization perspective
    arXiv:2210.05021v3 Announce Type: replace Abstract: Data augmentation (DA) is a powerful workhorse for bolstering performance in modern machine learning. Specific augmentations like translations and scaling in computer vision are traditionally believed to improve generalization by generating new (artificial) data from the same distribution. However, this traditional viewpoint does not explain the success of prevalent augmentations in modern machine learning (e.g. randomized masking, cutout, mixup), that greatly alter the training data distribution. In this work, we develop a new theoretical framework to characterize the impact of a general class of DA on underparameterized and overparameterized linear model generalization. Our framework reveals that DA induces implicit spectral regularization through a combination of two distinct effects: a) manipulating the relative proportion of eigenvalues of the data covariance matrix in a training-data-dependent manner, and b) uniformly boosting the entire spectrum of the data covariance matrix through ridge regression. These effects, when applied to popular augmentations, give rise to a wide variety of phenomena, including discrepancies in generalization between over-parameterized and under-parameterized regimes and differences between regression and classification tasks. Our framework highlights the nuanced and sometimes surprising impacts of DA on generalization, and serves as a testbed for novel augmentation design.  ( 3 min )
    On the Robustness of Bayesian Neural Networks to Adversarial Attacks
    arXiv:2207.06154v3 Announce Type: replace Abstract: Vulnerability to adversarial attacks is one of the principal hurdles to the adoption of deep learning in safety-critical applications. Despite significant efforts, both practical and theoretical, training deep learning models robust to adversarial attacks is still an open problem. In this paper, we analyse the geometry of adversarial attacks in the large-data, overparameterized limit for Bayesian Neural Networks (BNNs). We show that, in the limit, vulnerability to gradient-based attacks arises as a result of degeneracy in the data distribution, i.e., when the data lies on a lower-dimensional submanifold of the ambient space. As a direct consequence, we demonstrate that in this limit BNN posteriors are robust to gradient-based adversarial attacks. Crucially, we prove that the expected gradient of the loss with respect to the BNN posterior distribution is vanishing, even when each neural network sampled from the posterior is vulnerable to gradient-based attacks. Experimental results on the MNIST, Fashion MNIST, and half moons datasets, representing the finite data regime, with BNNs trained with Hamiltonian Monte Carlo and Variational Inference, support this line of arguments, showing that BNNs can display both high accuracy on clean data and robustness to both gradient-based and gradient-free based adversarial attacks.  ( 3 min )
    Human-Centric Aware UAV Trajectory Planning in Search and Rescue Missions Employing Multi-Objective Reinforcement Learning with AHP and Similarity-Based Experience Replay
    arXiv:2402.18487v1 Announce Type: cross Abstract: The integration of Unmanned Aerial Vehicles (UAVs) into Search and Rescue (SAR) missions presents a promising avenue for enhancing operational efficiency and effectiveness. However, the success of these missions is not solely dependent on the technical capabilities of the drones but also on their acceptance and interaction with humans on the ground. This paper explores the effect of human-centric factor in UAV trajectory planning for SAR missions. We introduce a novel approach based on the reinforcement learning augmented with Analytic Hierarchy Process and novel similarity-based experience replay to optimize UAV trajectories, balancing operational objectives with human comfort and safety considerations. Additionally, through a comprehensive survey, we investigate the impact of gender cues and anthropomorphism in UAV design on public acceptance and trust, revealing significant implications for drone interaction strategies in SAR. Our contributions include (1) a reinforcement learning framework for UAV trajectory planning that dynamically integrates multi-objective considerations, (2) an analysis of human perceptions towards gendered and anthropomorphized drones in SAR contexts, and (3) the application of similarity-based experience replay for enhanced learning efficiency in complex SAR scenarios. The findings offer valuable insights into designing UAV systems that are not only technically proficient but also aligned with human-centric values.  ( 3 min )
    Defect Detection in Tire X-Ray Images: Conventional Methods Meet Deep Structures
    arXiv:2402.18527v1 Announce Type: cross Abstract: This paper introduces a robust approach for automated defect detection in tire X-ray images by harnessing traditional feature extraction methods such as Local Binary Pattern (LBP) and Gray Level Co-Occurrence Matrix (GLCM) features, as well as Fourier and Wavelet-based features, complemented by advanced machine learning techniques. Recognizing the challenges inherent in the complex patterns and textures of tire X-ray images, the study emphasizes the significance of feature engineering to enhance the performance of defect detection systems. By meticulously integrating combinations of these features with a Random Forest (RF) classifier and comparing them against advanced models like YOLOv8, the research not only benchmarks the performance of traditional features in defect detection but also explores the synergy between classical and modern approaches. The experimental results demonstrate that these traditional features, when fine-tuned and combined with machine learning models, can significantly improve the accuracy and reliability of tire defect detection, aiming to set a new standard in automated quality assurance in tire manufacturing.  ( 2 min )
    A non-intrusive machine learning framework for debiasing long-time coarse resolution climate simulations and quantifying rare events statistics
    arXiv:2402.18484v1 Announce Type: cross Abstract: Due to the rapidly changing climate, the frequency and severity of extreme weather is expected to increase over the coming decades. As fully-resolved climate simulations remain computationally intractable, policy makers must rely on coarse-models to quantify risk for extremes. However, coarse models suffer from inherent bias due to the ignored "sub-grid" scales. We propose a framework to non-intrusively debias coarse-resolution climate predictions using neural-network (NN) correction operators. Previous efforts have attempted to train such operators using loss functions that match statistics. However, this approach falls short with events that have longer return period than that of the training data, since the reference statistics have not converged. Here, the scope is to formulate a learning method that allows for correction of dynamics and quantification of extreme events with longer return period than the training data. The key obstacle is the chaotic nature of the underlying dynamics. To overcome this challenge, we introduce a dynamical systems approach where the correction operator is trained using reference data and a coarse model simulation nudged towards that reference. The method is demonstrated on debiasing an under-resolved quasi-geostrophic model and the Energy Exascale Earth System Model (E3SM). For the former, our method enables the quantification of events that have return period two orders longer than the training data. For the latter, when trained on 8 years of ERA5 data, our approach is able to correct the coarse E3SM output to closely reflect the 36-year ERA5 statistics for all prognostic variables and significantly reduce their spatial biases.  ( 3 min )
    HOP to the Next Tasks and Domains for Continual Learning in NLP
    arXiv:2402.18449v1 Announce Type: cross Abstract: Continual Learning (CL) aims to learn a sequence of problems (i.e., tasks and domains) by transferring knowledge acquired on previous problems, whilst avoiding forgetting of past ones. Different from previous approaches which focused on CL for one NLP task or domain in a specific use-case, in this paper, we address a more general CL setting to learn from a sequence of problems in a unique framework. Our method, HOP, permits to hop across tasks and domains by addressing the CL problem along three directions: (i) we employ a set of adapters to generalize a large pre-trained model to unseen problems, (ii) we compute high-order moments over the distribution of embedded representations to distinguish independent and correlated statistics across different tasks and domains, (iii) we process this enriched information with auxiliary heads specialized for each end problem. Extensive experimental campaign on 4 NLP applications, 5 benchmarks and 2 CL setups demonstrates the effectiveness of our HOP.  ( 2 min )
    Emotion Classification in Low and Moderate Resource Languages
    arXiv:2402.18424v1 Announce Type: cross Abstract: It is important to be able to analyze the emotional state of people around the globe. There are 7100+ active languages spoken around the world and building emotion classification for each language is labor intensive. Particularly for low-resource and endangered languages, building emotion classification can be quite challenging. We present a cross-lingual emotion classifier, where we train an emotion classifier with resource-rich languages (i.e. \textit{English} in our work) and transfer the learning to low and moderate resource languages. We compare and contrast two approaches of transfer learning from a high-resource language to a low or moderate-resource language. One approach projects the annotation from a high-resource language to low and moderate-resource language in parallel corpora and the other one uses direct transfer from high-resource language to the other languages. We show the efficacy of our approaches on 6 languages: Farsi, Arabic, Spanish, Ilocano, Odia, and Azerbaijani. Our results indicate that our approaches outperform random baselines and transfer emotions across languages successfully. For all languages, the direct cross-lingual transfer of emotion yields better results. We also create annotated emotion-labeled resources for four languages: Farsi, Azerbaijani, Ilocano and Odia.  ( 2 min )
    A Relational Inductive Bias for Dimensional Abstraction in Neural Networks
    arXiv:2402.18426v1 Announce Type: cross Abstract: The human cognitive system exhibits remarkable flexibility and generalization capabilities, partly due to its ability to form low-dimensional, compositional representations of the environment. In contrast, standard neural network architectures often struggle with abstract reasoning tasks, overfitting, and requiring extensive data for training. This paper investigates the impact of the relational bottleneck -- a mechanism that focuses processing on relations among inputs -- on the learning of factorized representations conducive to compositional coding and the attendant flexibility of processing. We demonstrate that such a bottleneck not only improves generalization and learning efficiency, but also aligns network performance with human-like behavioral biases. Networks trained with the relational bottleneck developed orthogonal representations of feature dimensions latent in the dataset, reflecting the factorized structure thought to underlie human cognitive flexibility. Moreover, the relational network mimics human biases towards regularity without pre-specified symbolic primitives, suggesting that the bottleneck fosters the emergence of abstract representations that confer flexibility akin to symbols.  ( 2 min )
    Can GPT Improve the State of Prior Authorization via Guideline Based Automated Question Answering?
    arXiv:2402.18419v1 Announce Type: cross Abstract: Health insurance companies have a defined process called prior authorization (PA) which is a health plan cost-control process that requires doctors and other healthcare professionals to get clearance in advance from a health plan before performing a particular procedure on a patient in order to be eligible for payment coverage. For health insurance companies, approving PA requests for patients in the medical domain is a time-consuming and challenging task. One of those key challenges is validating if a request matches up to certain criteria such as age, gender, etc. In this work, we evaluate whether GPT can validate numerous key factors, in turn helping health plans reach a decision drastically faster. We frame it as a question answering task, prompting GPT to answer a question from patient electronic health record. We experiment with different conventional prompting techniques as well as introduce our own novel prompting technique. Moreover, we report qualitative assessment by humans on the natural language generation outputs from our approach. Results show that our method achieves superior performance with the mean weighted F1 score of 0.61 as compared to its standard counterparts.  ( 2 min )
    Deep Confident Steps to New Pockets: Strategies for Docking Generalization
    arXiv:2402.18396v1 Announce Type: cross Abstract: Accurate blind docking has the potential to lead to new biological breakthroughs, but for this promise to be realized, docking methods must generalize well across the proteome. Existing benchmarks, however, fail to rigorously assess generalizability. Therefore, we develop DockGen, a new benchmark based on the ligand-binding domains of proteins, and we show that existing machine learning-based docking models have very weak generalization abilities. We carefully analyze the scaling laws of ML-based docking and show that, by scaling data and model size, as well as integrating synthetic data strategies, we are able to significantly increase the generalization capacity and set new state-of-the-art performance across benchmarks. Further, we propose Confidence Bootstrapping, a new training paradigm that solely relies on the interaction between diffusion and confidence models and exploits the multi-resolution generation process of diffusion models. We demonstrate that Confidence Bootstrapping significantly improves the ability of ML-based docking methods to dock to unseen protein classes, edging closer to accurate and generalizable blind docking methods.  ( 2 min )
    Learning to Generate Instruction Tuning Datasets for Zero-Shot Task Adaptation
    arXiv:2402.18334v1 Announce Type: cross Abstract: We introduce Bonito, an open-source model for conditional task generation: the task of converting unannotated text into task-specific training datasets for instruction tuning. Our goal is to enable zero-shot task adaptation of large language models on users' specialized, private data. We train Bonito on a new large-scale dataset with 1.65M examples created by remixing existing instruction tuning datasets into meta-templates. The meta-templates for a dataset produce training examples where the input is the unannotated text and the task attribute and the output consists of the instruction and the response. We use Bonito to generate synthetic tasks for seven datasets from specialized domains across three task types -- yes-no question answering, extractive question answering, and natural language inference -- and adapt language models. We show that Bonito significantly improves the average performance of pretrained and instruction tuned models over the de facto self supervised baseline. For example, adapting Mistral-Instruct-v2 and instruction tuned variants of Mistral and Llama2 with Bonito improves the strong zero-shot performance by 22.1 F1 points whereas the next word prediction objective undoes some of the benefits of instruction tuning and reduces the average performance by 0.8 F1 points. We conduct additional experiments with Bonito to understand the effects of the domain, the size of the training set, and the choice of alternative synthetic task generators. Overall, we show that learning with synthetic instruction tuning datasets is an effective way to adapt language models to new domains. The model, dataset, and code are available at https://github.com/BatsResearch/bonito.  ( 3 min )
    Large Language Models As Evolution Strategies
    arXiv:2402.18381v1 Announce Type: cross Abstract: Large Transformer models are capable of implementing a plethora of so-called in-context learning algorithms. These include gradient descent, classification, sequence completion, transformation, and improvement. In this work, we investigate whether large language models (LLMs), which never explicitly encountered the task of black-box optimization, are in principle capable of implementing evolutionary optimization algorithms. While previous works have solely focused on language-based task specification, we move forward and focus on the zero-shot application of LLMs to black-box optimization. We introduce a novel prompting strategy, consisting of least-to-most sorting of discretized population members and querying the LLM to propose an improvement to the mean statistic, i.e. perform a type of black-box recombination operation. Empirically, we find that our setup allows the user to obtain an LLM-based evolution strategy, which we call `EvoLLM', that robustly outperforms baseline algorithms such as random search and Gaussian Hill Climbing on synthetic BBOB functions as well as small neuroevolution tasks. Hence, LLMs can act as `plug-in' in-context recombination operators. We provide several comparative studies of the LLM's model size, prompt strategy, and context construction. Finally, we show that one can flexibly improve EvoLLM's performance by providing teacher algorithm information via instruction fine-tuning on previously collected teacher optimization trajectories.  ( 2 min )
    Living-off-The-Land Reverse-Shell Detection by Informed Data Augmentation
    arXiv:2402.18329v1 Announce Type: cross Abstract: The living-off-the-land (LOTL) offensive methodologies rely on the perpetration of malicious actions through chains of commands executed by legitimate applications, identifiable exclusively by analysis of system logs. LOTL techniques are well hidden inside the stream of events generated by common legitimate activities, moreover threat actors often camouflage activity through obfuscation, making them particularly difficult to detect without incurring in plenty of false alarms, even using machine learning. To improve the performance of models in such an harsh environment, we propose an augmentation framework to enhance and diversify the presence of LOTL malicious activity inside legitimate logs. Guided by threat intelligence, we generate a dataset by injecting attack templates known to be employed in the wild, further enriched by malleable patterns of legitimate activities to replicate the behavior of evasive threat actors. We conduct an extensive ablation study to understand which models better handle our augmented dataset, also manipulated to mimic the presence of model-agnostic evasion and poisoning attacks. Our results suggest that augmentation is needed to maintain high-predictive capabilities, robustness to attack is achieved through specific hardening techniques like adversarial training, and it is possible to deploy near-real-time models with almost-zero false alarms.  ( 2 min )
    How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning
    arXiv:2402.18312v1 Announce Type: cross Abstract: Despite superior reasoning prowess demonstrated by Large Language Models (LLMs) with Chain-of-Thought (CoT) prompting, a lack of understanding prevails around the internal mechanisms of the models that facilitate CoT generation. This work investigates the neural sub-structures within LLMs that manifest CoT reasoning from a mechanistic point of view. From an analysis of LLaMA-2 7B applied to multistep reasoning over fictional ontologies, we demonstrate that LLMs deploy multiple parallel pathways of answer generation for step-by-step reasoning. These parallel pathways provide sequential answers from the input question context as well as the generated CoT. We observe a striking functional rift in the middle layers of the LLM. Token representations in the initial half remain strongly biased towards the pretraining prior, with the in-context taking over abruptly in the later half. This internal phase shift manifests in different functional components: attention heads that write the answer token predominantly appear in the later half, attention heads that move information along ontological relationships appear exclusively in the initial half, and so on. To the best of our knowledge, this is the first attempt towards mechanistic investigation of CoT reasoning in LLMs.  ( 2 min )
    Self-Supervised Learning in Electron Microscopy: Towards a Foundation Model for Advanced Image Analysis
    arXiv:2402.18286v1 Announce Type: cross Abstract: In this work, we explore the potential of self-supervised learning from unlabeled electron microscopy datasets, taking a step toward building a foundation model in this field. We show how self-supervised pretraining facilitates efficient fine-tuning for a spectrum of downstream tasks, including semantic segmentation, denoising, noise & background removal, and super-resolution. Experimentation with varying model complexities and receptive field sizes reveals the remarkable phenomenon that fine-tuned models of lower complexity consistently outperform more complex models with random weight initialization. We demonstrate the versatility of self-supervised pretraining across various downstream tasks in the context of electron microscopy, allowing faster convergence and better performance. We conclude that self-supervised pretraining serves as a powerful catalyst, being especially advantageous when limited annotated data are available and efficient scaling of computational cost are important.  ( 2 min )
    FSL Model can Score Higher as It Is
    arXiv:2402.18292v1 Announce Type: cross Abstract: In daily life, we tend to present the front of our faces by staring squarely at a facial recognition machine, instead of facing it sideways, in order to increase the chance of being correctly recognised. Few-shot-learning (FSL) classification is challenging in itself because a model has to identify images that belong to classes previously unseen during training. Therefore, a warped and non-typical query or support image during testing can make it even more challenging for a model to predict correctly. In our work, to increase the chance of correct prediction during testing, we aim to rectify the test input of a trained FSL model by generating new samples of the tested classes through image-to-image translation. An FSL model is usually trained on classes with sufficient samples, and then tested on classes with few-shot samples. Our proposed method first captures the style or shape of the test image, and then identifies a suitable trained class sample. It then transfers the style or shape of the test image to the train-class images for generation of more test-class samples, before performing classification based on a set of generated samples instead of just one sample. Our method has potential in empowering a trained FSL model to score higher during the testing phase without any extra training nor dataset. According to our experiments, by augmenting the support set with just 1 additional generated sample, we can achieve around 2% improvement for trained FSL models on datasets consisting of either animal faces or traffic signs. By augmenting both the support set and the queries, we can achieve even more performance improvement. Our Github Repository is publicly available.  ( 3 min )
    Affective State Detection using fNIRs and Machine Learning
    arXiv:2402.18241v1 Announce Type: cross Abstract: Affective states regulate our day to day to function and has a tremendous effect on mental and physical health. Detection of affective states is of utmost importance for mental health monitoring, smart entertainment selection and dynamic workload management. In this paper, we discussed relevant literature on affective state detection using physiology data, the benefits and limitations of different sensors and methods used for collecting physiology data, and our rationale for selecting functional near-infrared spectroscopy. We present the design of an experiment involving nine subjects to evoke the affective states of meditation, amusement and cognitive load and the results of the attempt to classify using machine learning. A mean accuracy of 83.04% was achieved in three class classification with an individual model; 84.39% accuracy was achieved for a group model and 60.57% accuracy was achieved for subject independent model using leave one out cross validation. It was found that prediction accuracy for cognitive load was higher (evoked using a pen and paper task) than the other two classes (evoked using computer bases tasks). To verify that this discrepancy was not due to motor skills involved in the pen and paper task, a second experiment was conducted using four participants and the results of that experiment has also been presented in the paper.  ( 2 min )
    CogBench: a large language model walks into a psychology lab
    arXiv:2402.18225v1 Announce Type: cross Abstract: Large language models (LLMs) have significantly advanced the field of artificial intelligence. Yet, evaluating them comprehensively remains challenging. We argue that this is partly due to the predominant focus on performance metrics in most benchmarks. This paper introduces CogBench, a benchmark that includes ten behavioral metrics derived from seven cognitive psychology experiments. This novel approach offers a toolkit for phenotyping LLMs' behavior. We apply CogBench to 35 LLMs, yielding a rich and diverse dataset. We analyze this data using statistical multilevel modeling techniques, accounting for the nested dependencies among fine-tuned versions of specific LLMs. Our study highlights the crucial role of model size and reinforcement learning from human feedback (RLHF) in improving performance and aligning with human behavior. Interestingly, we find that open-source models are less risk-prone than proprietary models and that fine-tuning on code does not necessarily enhance LLMs' behavior. Finally, we explore the effects of prompt-engineering techniques. We discover that chain-of-thought prompting improves probabilistic reasoning, while take-a-step-back prompting fosters model-based behaviors.  ( 2 min )
    Downstream Task Guided Masking Learning in Masked Autoencoders Using Multi-Level Optimization
    arXiv:2402.18128v1 Announce Type: cross Abstract: Masked Autoencoder (MAE) is a notable method for self-supervised pretraining in visual representation learning. It operates by randomly masking image patches and reconstructing these masked patches using the unmasked ones. A key limitation of MAE lies in its disregard for the varying informativeness of different patches, as it uniformly selects patches to mask. To overcome this, some approaches propose masking based on patch informativeness. However, these methods often do not consider the specific requirements of downstream tasks, potentially leading to suboptimal representations for these tasks. In response, we introduce the Multi-level Optimized Mask Autoencoder (MLO-MAE), a novel framework that leverages end-to-end feedback from downstream tasks to learn an optimal masking strategy during pretraining. Our experimental findings highlight MLO-MAE's significant advancements in visual representation learning. Compared to existing methods, it demonstrates remarkable improvements across diverse datasets and tasks, showcasing its adaptability and efficiency. Our code is available at: https://github.com/Alexiland/MLOMAE  ( 2 min )
    DecisionNCE: Embodied Multimodal Representations via Implicit Preference Learning
    arXiv:2402.18137v1 Announce Type: cross Abstract: Multimodal pretraining has emerged as an effective strategy for the trinity of goals of representation learning in autonomous robots: 1) extracting both local and global task progression information; 2) enforcing temporal consistency of visual representation; 3) capturing trajectory-level language grounding. Most existing methods approach these via separate objectives, which often reach sub-optimal solutions. In this paper, we propose a universal unified objective that can simultaneously extract meaningful task progression information from image sequences and seamlessly align them with language instructions. We discover that via implicit preferences, where a visual trajectory inherently aligns better with its corresponding language instruction than mismatched pairs, the popular Bradley-Terry model can transform into representation learning through proper reward reparameterizations. The resulted framework, DecisionNCE, mirrors an InfoNCE-style objective but is distinctively tailored for decision-making tasks, providing an embodied representation learning framework that elegantly extracts both local and global task progression features, with temporal consistency enforced through implicit time contrastive learning, while ensuring trajectory-level instruction grounding via multimodal joint encoding. Evaluation on both simulated and real robots demonstrates that DecisionNCE effectively facilitates diverse downstream policy learning tasks, offering a versatile solution for unified representation and reward learning. Project Page: https://2toinf.github.io/DecisionNCE/  ( 2 min )
    Simple But Effective: Rethinking the Ability of Deep Learning in fNIRS to Exclude Abnormal Input
    arXiv:2402.18112v1 Announce Type: cross Abstract: Functional near-infrared spectroscopy (fNIRS) is a non-invasive technique for monitoring brain activity. To better understand the brain, researchers often use deep learning to address the classification challenges of fNIRS data. Our study shows that while current networks in fNIRS are highly accurate for predictions within their training distribution, they falter at identifying and excluding abnormal data which is out-of-distribution, affecting their reliability. We propose integrating metric learning and supervised methods into fNIRS research to improve networks capability in identifying and excluding out-of-distribution outliers. This method is simple yet effective. In our experiments, it significantly enhances the performance of various networks in fNIRS, particularly transformer-based one, which shows the great improvement in reliability. We will make our experiment data available on GitHub.  ( 2 min )
    PRCL: Probabilistic Representation Contrastive Learning for Semi-Supervised Semantic Segmentation
    arXiv:2402.18117v1 Announce Type: cross Abstract: Tremendous breakthroughs have been developed in Semi-Supervised Semantic Segmentation (S4) through contrastive learning. However, due to limited annotations, the guidance on unlabeled images is generated by the model itself, which inevitably exists noise and disturbs the unsupervised training process. To address this issue, we propose a robust contrastive-based S4 framework, termed the Probabilistic Representation Contrastive Learning (PRCL) framework to enhance the robustness of the unsupervised training process. We model the pixel-wise representation as Probabilistic Representations (PR) via multivariate Gaussian distribution and tune the contribution of the ambiguous representations to tolerate the risk of inaccurate guidance in contrastive learning. Furthermore, we introduce Global Distribution Prototypes (GDP) by gathering all PRs throughout the whole training process. Since the GDP contains the information of all representations with the same class, it is robust from the instant noise in representations and bears the intra-class variance of representations. In addition, we generate Virtual Negatives (VNs) based on GDP to involve the contrastive learning process. Extensive experiments on two public benchmarks demonstrate the superiority of our PRCL framework.  ( 2 min )
    Automated Testing of Spatially-Dependent Environmental Hypotheses through Active Transfer Learning
    arXiv:2402.18064v1 Announce Type: cross Abstract: The efficient collection of samples is an important factor in outdoor information gathering applications on account of high sampling costs such as time, energy, and potential destruction to the environment. Utilization of available a-priori data can be a powerful tool for increasing efficiency. However, the relationships of this data with the quantity of interest are often not known ahead of time, limiting the ability to leverage this knowledge for improved planning efficiency. To this end, this work combines transfer learning and active learning through a Multi-Task Gaussian Process and an information-based objective function. Through this combination it can explore the space of hypothetical inter-quantity relationships and evaluate these hypotheses in real-time, allowing this new knowledge to be immediately exploited for future plans. The performance of the proposed method is evaluated against synthetic data and is shown to evaluate multiple hypotheses correctly. Its effectiveness is also demonstrated on real datasets. The technique is able to identify and leverage hypotheses which show a medium or strong correlation to reduce prediction error by a factor of 1.5--6 within the first 5 samples, and poor hypotheses are quickly identified and rejected, having no adverse effect on planning after around 3 samples.  ( 2 min )
    Automated Discovery of Integral with Deep Learning
    arXiv:2402.18040v1 Announce Type: cross Abstract: Recent advancements in the realm of deep learning, particularly in the development of large language models (LLMs), have demonstrated AI's ability to tackle complex mathematical problems or solving programming challenges. However, the capability to solve well-defined problems based on extensive training data differs significantly from the nuanced process of making scientific discoveries. Trained on almost all human knowledge available, today's sophisticated LLMs basically learn to predict sequences of tokens. They generate mathematical derivations and write code in a similar way as writing an essay, and do not have the ability to pioneer scientific discoveries in the manner a human scientist would do. In this study we delve into the potential of using deep learning to rediscover a fundamental mathematical concept: integrals. By defining integrals as area under the curve, we illustrate how AI can deduce the integral of a given function, exemplified by inferring $\int_{0}^{x} t^2 dt = \frac{x^3}{3}$ and $\int_{0}^{x} ae^{bt} dt = \frac{a}{b} e^{bx} - \frac{a}{b}$. Our experiments show that deep learning models can approach the task of inferring integrals either through a sequence-to-sequence model, akin to language translation, or by uncovering the rudimentary principles of integration, such as $\int_{0}^{x} t^n dt = \frac{x^{n+1}}{n+1}$.  ( 2 min )
    Physics-Informed Machine Learning for Seismic Response Prediction OF Nonlinear Steel Moment Resisting Frame Structures
    arXiv:2402.17992v1 Announce Type: cross Abstract: There is a growing interest in utilizing machine learning (ML) methods for structural metamodeling due to the substantial computational cost of traditional numerical simulations. The existing data-driven strategies show potential limitations to the model robustness and interpretability as well as the dependency of rich data. To address these challenges, this paper presents a novel physics-informed machine learning (PiML) method, which incorporates scientific principles and physical laws into deep neural networks for modeling seismic responses of nonlinear structures. The basic concept is to constrain the solution space of the ML model within known physical bounds. This is made possible with three main features, namely, model order reduction, a long short-term memory (LSTM) networks, and Newton's second law (e.g., the equation of motion). Model order reduction is essential for handling structural systems with inherent redundancy and enhancing model efficiency. The LSTM network captures temporal dependencies, enabling accurate prediction of time series responses. The equation of motion is manipulated to learn system nonlinearities and confines the solution space within physically interpretable results. These features enable model training with relatively sparse data and offer benefits in terms of accuracy, interpretability, and robustness. Furthermore, a dataset of seismically designed archetype ductile planar steel moment resistant frames under horizontal seismic loading, available in the DesignSafe-CI Database, is considered for evaluation of the proposed method. The resulting metamodel is capable of handling more complex data compared to existing physics-guided LSTM models and outperforms other non-physics data-driven neural networks.  ( 3 min )
    Symmetry-aware Reinforcement Learning for Robotic Assembly under Partial Observability with a Soft Wrist
    arXiv:2402.18002v1 Announce Type: cross Abstract: This study tackles the representative yet challenging contact-rich peg-in-hole task of robotic assembly, using a soft wrist that can operate more safely and tolerate lower-frequency control signals than a rigid one. Previous studies often use a fully observable formulation, requiring external setups or estimators for the peg-to-hole pose. In contrast, we use a partially observable formulation and deep reinforcement learning from demonstrations to learn a memory-based agent that acts purely on haptic and proprioceptive signals. Moreover, previous works do not incorporate potential domain symmetry and thus must search for solutions in a bigger space. Instead, we propose to leverage the symmetry for sample efficiency by augmenting the training data and constructing auxiliary losses to force the agent to adhere to the symmetry. Results in simulation with five different symmetric peg shapes show that our proposed agent can be comparable to or even outperform a state-based agent. In particular, the sample efficiency also allows us to learn directly on the real robot within 3 hours.  ( 2 min )
    Constrained Decoding for Code Language Models via Efficient Left and Right Quotienting of Context-Sensitive Grammars
    arXiv:2402.17988v1 Announce Type: cross Abstract: Large Language Models are powerful tools for program synthesis and advanced auto-completion, but come with no guarantee that their output code is syntactically correct. This paper contributes an incremental parser that allows early rejection of syntactically incorrect code, as well as efficient detection of complete programs for fill-in-the-middle (FItM) tasks. We develop Earley-style parsers that operate over left and right quotients of arbitrary context-free grammars, and we extend our incremental parsing and quotient operations to several context-sensitive features present in the grammars of many common programming languages. The result of these contributions is an efficient, general, and well-grounded method for left and right quotient parsing. To validate our theoretical contributions -- and the practical effectiveness of certain design decisions -- we evaluate our method on the particularly difficult case of FItM completion for Python 3. Our results demonstrate that constrained generation can significantly reduce the incidence of syntax errors in recommended code.  ( 2 min )
    Multistatic-Radar RCS-Signature Recognition of Aerial Vehicles: A Bayesian Fusion Approach
    arXiv:2402.17987v1 Announce Type: cross Abstract: Radar Automated Target Recognition (RATR) for Unmanned Aerial Vehicles (UAVs) involves transmitting Electromagnetic Waves (EMWs) and performing target type recognition on the received radar echo, crucial for defense and aerospace applications. Previous studies highlighted the advantages of multistatic radar configurations over monostatic ones in RATR. However, fusion methods in multistatic radar configurations often suboptimally combine classification vectors from individual radars probabilistically. To address this, we propose a fully Bayesian RATR framework employing Optimal Bayesian Fusion (OBF) to aggregate classification probability vectors from multiple radars. OBF, based on expected 0-1 loss, updates a Recursive Bayesian Classification (RBC) posterior distribution for target UAV type, conditioned on historical observations across multiple time steps. We evaluate the approach using simulated random walk trajectories for seven drones, correlating target aspect angles to Radar Cross Section (RCS) measurements in an anechoic chamber. Comparing against single radar Automated Target Recognition (ATR) systems and suboptimal fusion methods, our empirical results demonstrate that the OBF method integrated with RBC significantly enhances classification accuracy compared to other fusion methods and single radar configurations.  ( 2 min )
    Ensemble Methodology:Innovations in Credit Default Prediction Using LightGBM, XGBoost, and LocalEnsemble
    arXiv:2402.17979v1 Announce Type: cross Abstract: In the realm of consumer lending, accurate credit default prediction stands as a critical element in risk mitigation and lending decision optimization. Extensive research has sought continuous improvement in existing models to enhance customer experiences and ensure the sound economic functioning of lending institutions. This study responds to the evolving landscape of credit default prediction, challenging conventional models and introducing innovative approaches. By building upon foundational research and recent innovations, our work aims to redefine the standards of accuracy in credit default prediction, setting a new benchmark for the industry. To overcome these challenges, we present an Ensemble Methods framework comprising LightGBM, XGBoost, and LocalEnsemble modules, each making unique contributions to amplify diversity and improve generalization. By utilizing distinct feature sets, our methodology directly tackles limitations identified in previous studies, with the overarching goal of establishing a novel standard for credit default prediction accuracy. Our experimental findings validate the effectiveness of the ensemble model on the dataset, signifying substantial contributions to the field. This innovative approach not only addresses existing obstacles but also sets a precedent for advancing the accuracy and robustness of credit default prediction models.  ( 2 min )
    Sample-Efficient Preference-based Reinforcement Learning with Dynamics Aware Rewards
    arXiv:2402.17975v1 Announce Type: cross Abstract: Preference-based reinforcement learning (PbRL) aligns a robot behavior with human preferences via a reward function learned from binary feedback over agent behaviors. We show that dynamics-aware reward functions improve the sample efficiency of PbRL by an order of magnitude. In our experiments we iterate between: (1) learning a dynamics-aware state-action representation (z^{sa}) via a self-supervised temporal consistency task, and (2) bootstrapping the preference-based reward function from (z^{sa}), which results in faster policy learning and better final policy performance. For example, on quadruped-walk, walker-walk, and cheetah-run, with 50 preference labels we achieve the same performance as existing approaches with 500 preference labels, and we recover 83\% and 66\% of ground truth reward policy performance versus only 38\% and 21\%. The performance gains demonstrate the benefits of explicitly learning a dynamics-aware reward model. Repo: \texttt{https://github.com/apple/ml-reed}.  ( 2 min )
    Pragmatic Instruction Following and Goal Assistance via Cooperative Language-Guided Inverse Planning
    arXiv:2402.17930v1 Announce Type: cross Abstract: People often give instructions whose meaning is ambiguous without further context, expecting that their actions or goals will disambiguate their intentions. How can we build assistive agents that follow such instructions in a flexible, context-sensitive manner? This paper introduces cooperative language-guided inverse plan search (CLIPS), a Bayesian agent architecture for pragmatic instruction following and goal assistance. Our agent assists a human by modeling them as a cooperative planner who communicates joint plans to the assistant, then performs multimodal Bayesian inference over the human's goal from actions and language, using large language models (LLMs) to evaluate the likelihood of an instruction given a hypothesized plan. Given this posterior, our assistant acts to minimize expected goal achievement cost, enabling it to pragmatically follow ambiguous instructions and provide effective assistance even when uncertain about the goal. We evaluate these capabilities in two cooperative planning domains (Doors, Keys & Gems and VirtualHome), finding that CLIPS significantly outperforms GPT-4V, LLM-based literal instruction following and unimodal inverse planning in both accuracy and helpfulness, while closely matching the inferences and assistive judgments provided by human raters.  ( 2 min )
    Sequential transport maps using SoS density estimation and $\alpha$-divergences
    arXiv:2402.17943v1 Announce Type: cross Abstract: Transport-based density estimation methods are receiving growing interest because of their ability to efficiently generate samples from the approximated density. We further invertigate the sequential transport maps framework proposed from arXiv:2106.04170 arXiv:2303.02554, which builds on a sequence of composed Knothe-Rosenblatt (KR) maps. Each of those maps are built by first estimating an intermediate density of moderate complexity, and then by computing the exact KR map from a reference density to the precomputed approximate density. In our work, we explore the use of Sum-of-Squares (SoS) densities and $\alpha$-divergences for approximating the intermediate densities. Combining SoS densities with $\alpha$-divergence interestingly yields convex optimization problems which can be efficiently solved using semidefinite programming. The main advantage of $\alpha$-divergences is to enable working with unnormalized densities, which provides benefits both numerically and theoretically. In particular, we provide two new convergence analyses of the sequential transport maps: one based on a triangle-like inequality and the second on information geometric properties of $\alpha$-divergences for unnormalizied densities. The choice of intermediate densities is also crucial for the efficiency of the method. While tempered (or annealed) densities are the state-of-the-art, we introduce diffusion-based intermediate densities which permits to approximate densities known from samples only. Such intermediate densities are well-established in machine learning for generative modeling. Finally we propose and try different low-dimensional maps (or lazy maps) for dealing with high-dimensional problems and numerically demonstrate our methods on several benchmarks, including Bayesian inference problems and unsupervised learning task.  ( 2 min )
    Certain and Approximately Certain Models for Statistical Learning
    arXiv:2402.17926v1 Announce Type: cross Abstract: Real-world data is often incomplete and contains missing values. To train accurate models over real-world datasets, users need to spend a substantial amount of time and resources imputing and finding proper values for missing data items. In this paper, we demonstrate that it is possible to learn accurate models directly from data with missing values for certain training data and target models. We propose a unified approach for checking the necessity of data imputation to learn accurate models across various widely-used machine learning paradigms. We build efficient algorithms with theoretical guarantees to check this necessity and return accurate models in cases where imputation is unnecessary. Our extensive experiments indicate that our proposed algorithms significantly reduce the amount of time and effort needed for data imputation without imposing considerable computational overhead.  ( 2 min )
    The Seeker's Dilemma: Realistic Formulation and Benchmarking for Hardware Trojan Detection
    arXiv:2402.17918v1 Announce Type: cross Abstract: This work focuses on advancing security research in the hardware design space by formally defining the realistic problem of Hardware Trojan (HT) detection. The goal is to model HT detection more closely to the real world, i.e., describing the problem as "The Seeker's Dilemma" (an extension of Hide&Seek on a graph), where a detecting agent is unaware of whether circuits are infected by HTs or not. Using this theoretical problem formulation, we create a benchmark that consists of a mixture of HT-free and HT-infected restructured circuits while preserving their original functionalities. The restructured circuits are randomly infected by HTs, causing a situation where the defender is uncertain if a circuit is infected or not. We believe that our innovative dataset will help the community better judge the detection quality of different methods by comparing their success rates in circuit classification. We use our developed benchmark to evaluate three state-of-the-art HT detection tools to show baseline results for this approach. We use Principal Component Analysis to assess the strength of our benchmark, where we observe that some restructured HT-infected circuits are mapped closely to HT-free circuits, leading to significant label misclassification by detectors.  ( 2 min )
    Demonstration of Robust and Efficient Quantum Property Learning with Shallow Shadows
    arXiv:2402.17911v1 Announce Type: cross Abstract: Extracting information efficiently from quantum systems is a major component of quantum information processing tasks. Randomized measurements, or classical shadows, enable predicting many properties of arbitrary quantum states using few measurements. While random single qubit measurements are experimentally friendly and suitable for learning low-weight Pauli observables, they perform poorly for nonlocal observables. Prepending a shallow random quantum circuit before measurements maintains this experimental friendliness, but also has favorable sample complexities for observables beyond low-weight Paulis, including high-weight Paulis and global low-rank properties such as fidelity. However, in realistic scenarios, quantum noise accumulated with each additional layer of the shallow circuit biases the results. To address these challenges, we propose the robust shallow shadows protocol. Our protocol uses Bayesian inference to learn the experimentally relevant noise model and mitigate it in postprocessing. This mitigation introduces a bias-variance trade-off: correcting for noise-induced bias comes at the cost of a larger estimator variance. Despite this increased variance, as we demonstrate on a superconducting quantum processor, our protocol correctly recovers state properties such as expectation values, fidelity, and entanglement entropy, while maintaining a lower sample complexity compared to the random single qubit measurement scheme. We also theoretically analyze the effects of noise on sample complexity and show how the optimal choice of the shallow shadow depth varies with noise strength. This combined theoretical and experimental analysis positions the robust shallow shadow protocol as a scalable, robust, and sample-efficient protocol for characterizing quantum states on current quantum computing platforms.  ( 3 min )
    Using AI libraries for Incompressible Computational Fluid Dynamics
    arXiv:2402.17913v1 Announce Type: cross Abstract: Recently, there has been a huge effort focused on developing highly efficient open source libraries to perform Artificial Intelligence (AI) related computations on different computer architectures (for example, CPUs, GPUs and new AI processors). This has not only made the algorithms based on these libraries highly efficient and portable between different architectures, but also has substantially simplified the entry barrier to develop methods using AI. Here, we present a novel methodology to bring the power of both AI software and hardware into the field of numerical modelling by repurposing AI methods, such as Convolutional Neural Networks (CNNs), for the standard operations required in the field of the numerical solution of Partial Differential Equations (PDEs). The aim of this work is to bring the high performance, architecture agnosticism and ease of use into the field of the numerical solution of PDEs. We use the proposed methodology to solve the advection-diffusion equation, the non-linear Burgers equation and incompressible flow past a bluff body. For the latter, a convolutional neural network is used as a multigrid solver in order to enforce the incompressibility constraint. We show that the presented methodology can solve all these problems using repurposed AI libraries in an efficient way, and presents a new avenue to explore in the development of methods to solve PDEs and Computational Fluid Dynamics problems with implicit methods.  ( 2 min )
    Exoplanets Prediction in Multi-Planetary Systems and Determining the Correlation Between the Parameters of Planets and Host Stars Using Artificial Intelligence
    arXiv:2402.17898v1 Announce Type: cross Abstract: The number of extrasolar planets discovered is increasing, so that more than five thousand exoplanets have been confirmed to date. Now we have an opportunity to test the validity of the laws governing planetary systems and take steps to discover the relationships between the physical parameters of planets and stars. Firstly, we present the results of a search for additional exoplanets in 229 multi-planetary systems that house at least three or more confirmed planets, employing a logarithmic spacing between planets in our Solar System known as the Titius-Bode (TB) relation. We find that the planets in $\sim53\%$ of these systems adhere to a logarithmic spacing relation remarkably better than the Solar System planets. We predict the presence of 426 additional exoplanets, 47 of which are located within the habitable zone (HZ), and five of the 47 planets have a maximum mass limit of 0.1-2$M_{\oplus}$ and a maximum radius lower than 1.25$R_{\oplus}$. Secondly, we employ efficient machine learning approaches to analyze a dataset comprising 762 confirmed exoplanets and eight Solar System planets, aiming to characterize their fundamental quantities. We classify the data into two main classes: 'small' and 'giant' planets, with cut-off values at $R_{p}=8.13R_{\oplus}$ and $M_{p}=52.48M_{\oplus}$. Giant planets have lower densities, suggesting higher H-He mass fractions, while small planets are denser, composed mainly of heavier elements. We highlight that planetary mass, orbital period, and stellar mass play crucial roles in predicting exoplanet radius. Notably, our study reveals a noteworthy result: for giant planets, we observe a strong correlation between planetary radius and the mass of their host stars, which might provide intriguing insights into the relationship between giant planet formation and stellar characteristics.  ( 3 min )
    Zeroth-Order Sampling Methods for Non-Log-Concave Distributions: Alleviating Metastability by Denoising Diffusion
    arXiv:2402.17886v1 Announce Type: cross Abstract: This paper considers the problem of sampling from non-logconcave distribution, based on queries of its unnormalized density. It first describes a framework, Diffusion Monte Carlo (DMC), based on the simulation of a denoising diffusion process with its score function approximated by a generic Monte Carlo estimator. DMC is an oracle-based meta-algorithm, where its oracle is the assumed access to samples that generate a Monte Carlo score estimator. Then we provide an implementation of this oracle, based on rejection sampling, and this turns DMC into a true algorithm, termed Zeroth-Order Diffusion Monte Carlo (ZOD-MC). We provide convergence analyses by first constructing a general framework, i.e. a performance guarantee for DMC, without assuming the target distribution to be log-concave or satisfying any isoperimetric inequality. Then we prove that ZOD-MC admits an inverse polynomial dependence on the desired sampling accuracy, albeit still suffering from the curse of dimensionality. Consequently, for low dimensional distributions, ZOD-MC is a very efficient sampler, with performance exceeding latest samplers, including also-denoising-diffusion-based RDMC and RS-DMC. Last, we experimentally demonstrate the insensitivity of ZOD-MC to increasingly higher barriers between modes or discontinuity in non-convex potential.  ( 2 min )
    Follow My Instruction and Spill the Beans: Scalable Data Extraction from Retrieval-Augmented Generation Systems
    arXiv:2402.17840v1 Announce Type: cross Abstract: Retrieval-Augmented Generation (RAG) improves pre-trained models by incorporating external knowledge at test time to enable customized adaptation. We study the risk of datastore leakage in Retrieval-In-Context RAG Language Models (LMs). We show that an adversary can exploit LMs' instruction-following capabilities to easily extract text data verbatim from the datastore of RAG systems built with instruction-tuned LMs via prompt injection. The vulnerability exists for a wide range of modern LMs that span Llama2, Mistral/Mixtral, Vicuna, SOLAR, WizardLM, Qwen1.5, and Platypus2, and the exploitability exacerbates as the model size scales up. Extending our study to production RAG models GPTs, we design an attack that can cause datastore leakage with a 100% success rate on 25 randomly selected customized GPTs with at most 2 queries, and we extract text data verbatim at a rate of 41% from a book of 77,000 words and 3% from a corpus of 1,569,000 words by prompting the GPTs with only 100 queries generated by themselves.  ( 2 min )
    Stochastic Approximation with Biased MCMC for Expectation Maximization
    arXiv:2402.17870v1 Announce Type: cross Abstract: The expectation maximization (EM) algorithm is a widespread method for empirical Bayesian inference, but its expectation step (E-step) is often intractable. Employing a stochastic approximation scheme with Markov chain Monte Carlo (MCMC) can circumvent this issue, resulting in an algorithm known as MCMC-SAEM. While theoretical guarantees for MCMC-SAEM have previously been established, these results are restricted to the case where asymptotically unbiased MCMC algorithms are used. In practice, MCMC-SAEM is often run with asymptotically biased MCMC, for which the consequences are theoretically less understood. In this work, we fill this gap by analyzing the asymptotics and non-asymptotics of SAEM with biased MCMC steps, particularly the effect of bias. We also provide numerical experiments comparing the Metropolis-adjusted Langevin algorithm (MALA), which is asymptotically unbiased, and the unadjusted Langevin algorithm (ULA), which is asymptotically biased, on synthetic and real datasets. Experimental results show that ULA is more stable with respect to the choice of Langevin stepsize and can sometimes result in faster convergence.  ( 2 min )
    TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space
    arXiv:2402.17811v1 Announce Type: cross Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks. However, they sometimes suffer from producing hallucinations, particularly in cases where they may generate untruthful responses despite possessing the correct knowledge. In this paper, we propose TruthX, an inference-time method to elicit the truthfulness of LLMs by editing their internal representations in truthful space. TruthX employs an auto-encoder to map LLM's representations into semantic and truthful latent spaces respectively, and applies contrastive learning to identify a truthful editing direction within the truthful space. During inference, by editing LLM's internal representations in truthful space, TruthX effectively enhances the truthfulness of LLMs. Experiments show that TruthX effectively improves the truthfulness of 13 advanced LLMs by an average of 20% on TruthfulQA benchmark. Further analyses suggest that the truthful space acquired by TruthX plays a pivotal role in controlling LLM to produce truthful or hallucinatory responses.  ( 2 min )
    BioT5+: Towards Generalized Biological Understanding with IUPAC Integration and Multi-task Tuning
    arXiv:2402.17810v1 Announce Type: cross Abstract: Recent research trends in computational biology have increasingly focused on integrating text and bio-entity modeling, especially in the context of molecules and proteins. However, previous efforts like BioT5 faced challenges in generalizing across diverse tasks and lacked a nuanced understanding of molecular structures, particularly in their textual representations (e.g., IUPAC). This paper introduces BioT5+, an extension of the BioT5 framework, tailored to enhance biological research and drug discovery. BioT5+ incorporates several novel features: integration of IUPAC names for molecular understanding, inclusion of extensive bio-text and molecule data from sources like bioRxiv and PubChem, the multi-task instruction tuning for generality across tasks, and a novel numerical tokenization technique for improved processing of numerical data. These enhancements allow BioT5+ to bridge the gap between molecular representations and their textual descriptions, providing a more holistic understanding of biological entities, and largely improving the grounded reasoning of bio-text and bio-sequences. The model is pre-trained and fine-tuned with a large number of experiments, including \emph{3 types of problems (classification, regression, generation), 15 kinds of tasks, and 21 total benchmark datasets}, demonstrating the remarkable performance and state-of-the-art results in most cases. BioT5+ stands out for its ability to capture intricate relationships in biological data, thereby contributing significantly to bioinformatics and computational biology. Our code is available at \url{https://github.com/QizhiPei/BioT5}.  ( 3 min )
    Exploring Gene Regulatory Interaction Networks and predicting therapeutic molecules for Hypopharyngeal Cancer and EGFR-mutated lung adenocarcinoma
    arXiv:2402.17807v1 Announce Type: cross Abstract: With the advent of Information technology, the Bioinformatics research field is becoming increasingly attractive to researchers and academicians. The recent development of various Bioinformatics toolkits has facilitated the rapid processing and analysis of vast quantities of biological data for human perception. Most studies focus on locating two connected diseases and making some observations to construct diverse gene regulatory interaction networks, a forerunner to general drug design for curing illness. For instance, Hypopharyngeal cancer is a disease that is associated with EGFR-mutated lung adenocarcinoma. In this study, we select EGFR-mutated lung adenocarcinoma and Hypopharyngeal cancer by finding the Lung metastases in hypopharyngeal cancer. To conduct this study, we collect Mircorarray datasets from GEO (Gene Expression Omnibus), an online database controlled by NCBI. Differentially expressed genes, common genes, and hub genes between the selected two diseases are detected for the succeeding move. Our research findings have suggested common therapeutic molecules for the selected diseases based on 10 hub genes with the highest interactions according to the degree topology method and the maximum clique centrality (MCC). Our suggested therapeutic molecules will be fruitful for patients with those two diseases simultaneously.  ( 3 min )
    AN An ica-ensemble learning approach for prediction of uwb nlos signals data classification
    arXiv:2402.17808v1 Announce Type: cross Abstract: Trapped human detection in search and rescue (SAR) scenarios poses a significant challenge in pervasive computing. This study addresses this issue by leveraging machine learning techniques, given their high accuracy. However, accurate identification of trapped individuals is hindered by the curse of dimensionality and noisy data. Particularly in non-line-of-sight (NLOS) situations during catastrophic events, the curse of dimensionality may lead to blind spots due to noise and uncorrelated values in detections. This research focuses on harmonizing information through wireless communication and identifying individuals in NLOS scenarios using ultra-wideband (UWB) radar signals. Employing independent component analysis (ICA) for feature extraction, the study evaluates classification performance using ensemble algorithms on both static and dynamic datasets. The experimental results demonstrate categorization accuracies of 88.37% for static data and 87.20% for dynamic data, highlighting the effectiveness of the proposed approach. Finally, this work can help scientists and engineers make instant decisions during SAR operations.  ( 2 min )
    A Surprising Failure? Multimodal LLMs and the NLVR Challenge
    arXiv:2402.17793v1 Announce Type: cross Abstract: This study evaluates three state-of-the-art MLLMs -- GPT-4V, Gemini Pro, and the open-source model IDEFICS -- on the compositional natural language vision reasoning task NLVR. Given a human-written sentence paired with a synthetic image, this task requires the model to determine the truth value of the sentence with respect to the image. Despite the strong performance demonstrated by these models, we observe they perform poorly on NLVR, which was constructed to require compositional and spatial reasoning, and to be robust for semantic and systematic biases.  ( 2 min )
    EGNN-C+: Interpretable Evolving Granular Neural Network and Application in Classification of Weakly-Supervised EEG Data Streams
    arXiv:2402.17792v1 Announce Type: cross Abstract: We introduce a modified incremental learning algorithm for evolving Granular Neural Network Classifiers (eGNN-C+). We use double-boundary hyper-boxes to represent granules, and customize the adaptation procedures to enhance the robustness of outer boxes for data coverage and noise suppression, while ensuring that inner boxes remain flexible to capture drifts. The classifier evolves from scratch, incorporates new classes on the fly, and performs local incremental feature weighting. As an application, we focus on the classification of emotion-related patterns within electroencephalogram (EEG) signals. Emotion recognition is crucial for enhancing the realism and interactivity of computer systems. We extract features from the Fourier spectrum of EEG signals obtained from 28 individuals engaged in playing computer games -- a public dataset. Each game elicits a different predominant emotion: boredom, calmness, horror, or joy. We analyze individual electrodes, time window lengths, and frequency bands to assess the accuracy and interpretability of resulting user-independent neural models. The findings indicate that both brain hemispheres assist classification, especially electrodes on the temporal (T8) and parietal (P7) areas, alongside contributions from frontal and occipital electrodes. While patterns may manifest in any band, the Alpha (8-13Hz), Delta (1-4Hz), and Theta (4-8Hz) bands, in this order, exhibited higher correspondence with the emotion classes. The eGNN-C+ demonstrates effectiveness in learning EEG data. It achieves an accuracy of 81.7% and a 0.0029 II interpretability using 10-second time windows, even in face of a highly-stochastic time-varying 4-class classification problem.  ( 3 min )
    EEG classifier cross-task transfer to avoid training sessions in robot-assisted rehabilitation
    arXiv:2402.17790v1 Announce Type: cross Abstract: Background: For an individualized support of patients during rehabilitation, learning of individual machine learning models from the human electroencephalogram (EEG) is required. Our approach allows labeled training data to be recorded without the need for a specific training session. For this, the planned exoskeleton-assisted rehabilitation enables bilateral mirror therapy, in which movement intentions can be inferred from the activity of the unaffected arm. During this therapy, labeled EEG data can be collected to enable movement predictions of only the affected arm of a patient. Methods: A study was conducted with 8 healthy subjects and the performance of the classifier transfer approach was evaluated. Each subject performed 3 runs of 40 self-intended unilateral and bilateral reaching movements toward a target while EEG data was recorded from 64 channels. A support vector machine (SVM) classifier was trained under both movement conditions to make predictions for the same type of movement. Furthermore, the classifier was evaluated to predict unilateral movements by only beeing trained on the data of the bilateral movement condition. Results: The results show that the performance of the classifier trained on selected EEG channels evoked by bilateral movement intentions is not significantly reduced compared to a classifier trained directly on EEG data including unilateral movement intentions. Moreover, the results show that our approach also works with only 8 or even 4 channels. Conclusion: It was shown that the proposed classifier transfer approach enables motion prediction without explicit collection of training data. Since the approach can be applied even with a small number of EEG channels, this speaks for the feasibility of the approach in real therapy sessions with patients and motivates further investigations with stroke patients.  ( 3 min )
    Label Informed Contrastive Pretraining for Node Importance Estimation on Knowledge Graphs
    arXiv:2402.17791v1 Announce Type: cross Abstract: Node Importance Estimation (NIE) is a task of inferring importance scores of the nodes in a graph. Due to the availability of richer data and knowledge, recent research interests of NIE have been dedicating to knowledge graphs for predicting future or missing node importance scores. Existing state-of-the-art NIE methods train the model by available labels, and they consider every interested node equally before training. However, the nodes with higher importance often require or receive more attention in real-world scenarios, e.g., people may care more about the movies or webpages with higher importance. To this end, we introduce Label Informed ContrAstive Pretraining (LICAP) to the NIE problem for being better aware of the nodes with high importance scores. Specifically, LICAP is a novel type of contrastive learning framework that aims to fully utilize the continuous labels to generate contrastive samples for pretraining embeddings. Considering the NIE problem, LICAP adopts a novel sampling strategy called top nodes preferred hierarchical sampling to first group all interested nodes into a top bin and a non-top bin based on node importance scores, and then divide the nodes within top bin into several finer bins also based on the scores. The contrastive samples are generated from those bins, and are then used to pretrain node embeddings of knowledge graphs via a newly proposed Predicate-aware Graph Attention Networks (PreGAT), so as to better separate the top nodes from non-top nodes, and distinguish the top nodes within top bin by keeping the relative order among finer bins. Extensive experiments demonstrate that the LICAP pretrained embeddings can further boost the performance of existing NIE methods and achieve the new state-of-the-art performance regarding both regression and ranking metrics. The source code for reproducibility is available at https://github.com/zhangtia16/LICAP  ( 3 min )
    Stepwise Self-Consistent Mathematical Reasoning with Large Language Models
    arXiv:2402.17786v1 Announce Type: cross Abstract: Using Large Language Models for complex mathematical reasoning is difficult, primarily due to the complexity of multi-step reasoning. The main challenges of this process include (1) selecting critical intermediate results to advance the procedure, and (2) limited exploration of potential solutions. To address these issues, we introduce a novel algorithm, namely Stepwise Self-Consistent Chain-of-Thought (SSC-CoT). SSC-CoT employs a strategy of selecting intermediate steps based on the intersection of various reasoning chains. Additionally, SSC-CoT enables the model to discover critical intermediate steps by querying a knowledge graph comprising relevant domain knowledge. To validate SSC-CoT, we present a new dataset, TriMaster100, tailored for complex trigonometry problems. This dataset contains 100 questions, with each solution broken down into scored intermediate steps, facilitating a comprehensive evaluation of the mathematical reasoning process. On TriMaster100, SSC-CoT triples the effectiveness of the state-of-the-art methods. Furthermore, we benchmark SSC-CoT on the widely recognized complex mathematical question dataset, MATH level 5, and it surpasses the second-best method by 7.2% in accuracy. Code and the TriMaster100 dataset can be found at: https://github.com/zhao-zilong/ssc-cot.  ( 2 min )
    Constraint Latent Space Matters: An Anti-anomalous Waveform Transformation Solution from Photoplethysmography to Arterial Blood Pressure
    arXiv:2402.17780v1 Announce Type: cross Abstract: Arterial blood pressure (ABP) holds substantial promise for proactive cardiovascular health management. Notwithstanding its potential, the invasive nature of ABP measurements confines their utility primarily to clinical environments, limiting their applicability for continuous monitoring beyond medical facilities. The conversion of photoplethysmography (PPG) signals into ABP equivalents has garnered significant attention due to its potential in revolutionizing cardiovascular disease management. Recent strides in PPG-to-ABP prediction encompass the integration of generative and discriminative models. Despite these advances, the efficacy of these models is curtailed by the latent space shift predicament, stemming from alterations in PPG data distribution across disparate hardware and individuals, potentially leading to distorted ABP waveforms. To tackle this problem, we present an innovative solution named the Latent Space Constraint Transformer (LSCT), leveraging a quantized codebook to yield robust latent spaces by employing multiple discretizing bases. To facilitate improved reconstruction, the Correlation-boosted Attention Module (CAM) is introduced to systematically query pertinent bases on a global scale. Furthermore, to enhance expressive capacity, we propose the Multi-Spectrum Enhancement Knowledge (MSEK), which fosters local information flow within the channels of latent code and provides additional embedding for reconstruction. Through comprehensive experimentation on both publicly available datasets and a private downstream task dataset, the proposed approach demonstrates noteworthy performance enhancements compared to existing methods. Extensive ablation studies further substantiate the effectiveness of each introduced module.  ( 3 min )
    BagStacking: An Integrated Ensemble Learning Approach for Freezing of Gait Detection in Parkinson's Disease
    arXiv:2402.17783v1 Announce Type: cross Abstract: This paper introduces BagStacking, a novel ensemble learning method designed to enhance the detection of Freezing of Gait (FOG) in Parkinson's Disease (PD) by using a lower-back sensor to track acceleration. Building on the principles of bagging and stacking, BagStacking aims to achieve the variance reduction benefit of bagging's bootstrap sampling while also learning sophisticated blending through stacking. The method involves training a set of base models on bootstrap samples from the training data, followed by a meta-learner trained on the base model outputs and true labels to find an optimal aggregation scheme. The experimental evaluation demonstrates significant improvements over other state-of-the-art machine learning methods on the validation set. Specifically, BagStacking achieved a MAP score of 0.306, outperforming LightGBM (0.234) and classic Stacking (0.286). Additionally, the run-time of BagStacking was measured at 3828 seconds, illustrating an efficient approach compared to Regular Stacking's 8350 seconds. BagStacking presents a promising direction for handling the inherent variability in FOG detection data, offering a robust and scalable solution to improve patient care in PD.  ( 2 min )
    Assessing the importance of long-range correlations for deep-learning-based sleep staging
    arXiv:2402.17779v1 Announce Type: cross Abstract: This study aims to elucidate the significance of long-range correlations for deep-learning-based sleep staging. It is centered around S4Sleep(TS), a recently proposed model for automated sleep staging. This model utilizes electroencephalography (EEG) as raw time series input and relies on structured state space sequence (S4) models as essential model component. Although the model already surpasses state-of-the-art methods for a moderate number of 15 input epochs, recent literature results suggest potential benefits from incorporating very long correlations spanning hundreds of input epochs. In this submission, we explore the possibility of achieving further enhancements by systematically scaling up the model's input size, anticipating potential improvements in prediction accuracy. In contrast to findings in literature, our results demonstrate that augmenting the input size does not yield a significant enhancement in the performance of S4Sleep(TS). These findings, coupled with the distinctive ability of S4 models to capture long-range dependencies in time series data, cast doubt on the diagnostic relevance of very long-range interactions for sleep staging.  ( 2 min )
    Wavelet Scattering Transform for Bioacustics: Application to Watkins Marine Mammal Sound Database
    arXiv:2402.17775v1 Announce Type: cross Abstract: Marine mammal communication is a complex field, hindered by the diversity of vocalizations and environmental factors. The Watkins Marine Mammal Sound Database (WMMD) is an extensive labeled dataset used in machine learning applications. However, the methods for data preparation, preprocessing, and classification found in the literature are quite disparate. This study first focuses on a brief review of the state-of-the-art benchmarks on the dataset, with an emphasis on clarifying data preparation and preprocessing methods. Subsequently, we propose the application of the Wavelet Scattering Transform (WST) in place of standard methods based on the Short-Time Fourier Transform (STFT). The study also tackles a classification task using an ad-hoc deep architecture with residual layers. We outperform the existing classification architecture by $6\%$ in accuracy using WST and $8\%$ using Mel spectrogram preprocessing, effectively reducing by half the number of misclassified samples, and reaching a top accuracy of $96\%$.  ( 2 min )
    EEG2Rep: Enhancing Self-supervised EEG Representation Through Informative Masked Inputs
    arXiv:2402.17772v1 Announce Type: cross Abstract: Self-supervised approaches for electroencephalography (EEG) representation learning face three specific challenges inherent to EEG data: (1) The low signal-to-noise ratio which challenges the quality of the representation learned, (2) The wide range of amplitudes from very small to relatively large due to factors such as the inter-subject variability, risks the models to be dominated by higher amplitude ranges, and (3) The absence of explicit segmentation in the continuous-valued sequences which can result in less informative representations. To address these challenges, we introduce EEG2Rep, a self-prediction approach for self-supervised representation learning from EEG. Two core novel components of EEG2Rep are as follows: 1) Instead of learning to predict the masked input from raw EEG, EEG2Rep learns to predict masked input in latent representation space, and 2) Instead of conventional masking methods, EEG2Rep uses a new semantic subsequence preserving (SSP) method which provides informative masked inputs to guide EEG2Rep to generate rich semantic representations. In experiments on 6 diverse EEG tasks with subject variability, EEG2Rep significantly outperforms state-of-the-art methods. We show that our semantic subsequence preserving improves the existing masking methods in self-prediction literature and find that preserving 50\% of EEG recordings will result in the most accurate results on all 6 tasks on average. Finally, we show that EEG2Rep is robust to noise addressing a significant challenge that exists in EEG data. Models and code are available at: https://github.com/Navidfoumani/EEG2Rep  ( 3 min )
    SINR-Aware Deep Reinforcement Learning for Distributed Dynamic Channel Allocation in Cognitive Interference Networks
    arXiv:2402.17773v1 Announce Type: cross Abstract: We consider the problem of dynamic channel allocation (DCA) in cognitive communication networks with the goal of maximizing a global signal-to-interference-plus-noise ratio (SINR) measure under a specified target quality of service (QoS)-SINR for each network. The shared bandwidth is partitioned into K channels with frequency separation. In contrast to the majority of existing studies that assume perfect orthogonality or a one- to-one user-channel allocation mapping, this paper focuses on real-world systems experiencing inter-carrier interference (ICI) and channel reuse by multiple large-scale networks. This realistic scenario significantly increases the problem dimension, rendering existing algorithms inefficient. We propose a novel multi-agent reinforcement learning (RL) framework for distributed DCA, named Channel Allocation RL To Overlapped Networks (CARLTON). The CARLTON framework is based on the Centralized Training with Decentralized Execution (CTDE) paradigm, utilizing the DeepMellow value-based RL algorithm. To ensure robust performance in the interference-laden environment we address, CARLTON employs a low-dimensional representation of observations, generating a QoS-type measure while maximizing a global SINR measure and ensuring the target QoS-SINR for each network. Our results demonstrate exceptional performance and robust generalization, showcasing superior efficiency compared to alternative state-of-the-art methods, while achieving a marginally diminished performance relative to a fully centralized approach.  ( 2 min )
    Utilizing Machine Learning for Signal Classification and Noise Reduction in Amateur Radio
    arXiv:2402.17771v1 Announce Type: cross Abstract: In the realm of amateur radio, the effective classification of signals and the mitigation of noise play crucial roles in ensuring reliable communication. Traditional methods for signal classification and noise reduction often rely on manual intervention and predefined thresholds, which can be labor-intensive and less adaptable to dynamic radio environments. In this paper, we explore the application of machine learning techniques for signal classification and noise reduction in amateur radio operations. We investigate the feasibility and effectiveness of employing supervised and unsupervised learning algorithms to automatically differentiate between desired signals and unwanted interference, as well as to reduce the impact of noise on received transmissions. Experimental results demonstrate the potential of machine learning approaches to enhance the efficiency and robustness of amateur radio communication systems, paving the way for more intelligent and adaptive radio solutions in the amateur radio community.  ( 2 min )
    If in a Crowdsourced Data Annotation Pipeline, a GPT-4
    arXiv:2402.16795v1 Announce Type: cross Abstract: Recent studies indicated GPT-4 outperforms online crowd workers in data labeling accuracy, notably workers from Amazon Mechanical Turk (MTurk). However, these studies were criticized for deviating from standard crowdsourcing practices and emphasizing individual workers' performances over the whole data-annotation process. This paper compared GPT-4 and an ethical and well-executed MTurk pipeline, with 415 workers labeling 3,177 sentence segments from 200 scholarly articles using the CODA-19 scheme. Two worker interfaces yielded 127,080 labels, which were then used to infer the final labels through eight label-aggregation algorithms. Our evaluation showed that despite best practices, MTurk pipeline's highest accuracy was 81.5%, whereas GPT-4 achieved 83.6%. Interestingly, when combining GPT-4's labels with crowd labels collected via an advanced worker interface for aggregation, 2 out of the 8 algorithms achieved an even higher accuracy (87.5%, 87.0%). Further analysis suggested that, when the crowd's and GPT-4's labeling strengths are complementary, aggregating them could increase labeling accuracy.  ( 2 min )
    Diffusion Language Models Are Versatile Protein Learners
    arXiv:2402.18567v1 Announce Type: new Abstract: This paper introduces diffusion protein language model (DPLM), a versatile protein language model that demonstrates strong generative and predictive capabilities for protein sequences. We first pre-train scalable DPLMs from evolutionary-scale protein sequences within a generative self-supervised discrete diffusion probabilistic framework, which generalizes language modeling for proteins in a principled way. After pre-training, DPLM exhibits the ability to generate structurally plausible, novel, and diverse protein sequences for unconditional generation. We further demonstrate the proposed diffusion generative pre-training makes DPLM possess a better understanding of proteins, making it a superior representation learner, which can be fine-tuned for various predictive tasks, comparing favorably to ESM2 (Lin et al., 2022). Moreover, DPLM can be tailored for various needs, which showcases its prowess of conditional generation in several ways: (1) conditioning on partial peptide sequences, e.g., generating scaffolds for functional motifs with high success rate; (2) incorporating other modalities as conditioner, e.g., structure-conditioned generation for inverse folding; and (3) steering sequence generation towards desired properties, e.g., satisfying specified secondary structures, through a plug-and-play classifier guidance.  ( 2 min )
    Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards
    arXiv:2402.18571v1 Announce Type: new Abstract: Fine-grained control over large language models (LLMs) remains a significant challenge, hindering their adaptability to diverse user needs. While Reinforcement Learning from Human Feedback (RLHF) shows promise in aligning LLMs, its reliance on scalar rewards often limits its ability to capture diverse user preferences in real-world applications. To address this limitation, we introduce the Directional Preference Alignment (DPA) framework. Unlike the scalar-reward RLHF, DPA incorporates multi-objective reward modeling to represent diverse preference profiles. Additionally, DPA models user preferences as directions (i.e., unit vectors) in the reward space to achieve user-dependent preference control. Our method involves training a multi-objective reward model and then fine-tuning the LLM with a preference-conditioned variant of Rejection Sampling Finetuning (RSF), an RLHF method adopted by Llama 2. This method enjoys a better performance trade-off across various reward objectives. In comparison with the scalar-reward RLHF, DPA offers users intuitive control over LLM generation: they can arithmetically specify their desired trade-offs (e.g., more helpfulness with less verbosity). We also validate the effectiveness of DPA with real-world alignment experiments on Mistral-7B. Our method provides straightforward arithmetic control over the trade-off between helpfulness and verbosity while maintaining competitive performance with strong baselines such as Direct Preference Optimization (DPO).  ( 3 min )
    Approaching Human-Level Forecasting with Language Models
    arXiv:2402.18563v1 Announce Type: new Abstract: Forecasting future events is important for policy and decision making. In this work, we study whether language models (LMs) can forecast at the level of competitive human forecasters. Towards this goal, we develop a retrieval-augmented LM system designed to automatically search for relevant information, generate forecasts, and aggregate predictions. To facilitate our study, we collect a large dataset of questions from competitive forecasting platforms. Under a test set published after the knowledge cut-offs of our LMs, we evaluate the end-to-end performance of our system against the aggregates of human forecasts. On average, the system nears the crowd aggregate of competitive forecasters, and in some settings surpasses it. Our work suggests that using LMs to forecast the future could provide accurate predictions at scale and help to inform institutional decision making.  ( 2 min )
    Implicit Bias of Next-Token Prediction
    arXiv:2402.18551v1 Announce Type: new Abstract: Next-token prediction (NTP), the go-to training paradigm in training large language models, involves predicting the next token in a sequence. Departing from traditional one-hot classification, in NTP, multiple tokens with varying frequencies follow each given context. This work frames NTP training as cross-entropy minimization over distinct contexts, each associated with a sparse empirical probability vector across a finite vocabulary. It then addresses the following question: do gradient-based optimizers exhibit a bias towards solutions with specific structure as the NTP training loss reaches its lower bound (entropy)? Specifically, for linear NTP models trained using gradient descent (GD), we make the following contributions: Firstly, we determine NTP-separability conditions on the data, under which GD can attain its lower bound. We also demonstrate that these conditions hold under overparameterization. Secondly, we establish that the parameters of GD projected onto an appropriate data subspace converge to the unique solution of a system of linear equations, which requires the logits' difference of in-support tokens to be equal to the log-ratio of their respective probabilities. Meanwhile, on the orthogonal subspace, the parameters diverge and converge in the direction of the solution of a max-margin quadratic program, minimizing the Euclidean norm of parameters satisfying the \NTP-separability conditions. Akin to prior research on implicit bias of one-hot classification, our work opens exciting avenues for future research that can lead to better understanding optimization, generalization and robustness principles of models trained with NTP.  ( 2 min )
    Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates
    arXiv:2402.18540v1 Announce Type: new Abstract: Public LLMs such as the Llama 2-Chat have driven huge activity in LLM research. These models underwent alignment training and were considered safe. Recently Qi et al. (2023) reported that even benign fine-tuning (e.g., on seemingly safe datasets) can give rise to unsafe behaviors in the models. The current paper is about methods and best practices to mitigate such loss of alignment. Through extensive experiments on several chat models (Meta's Llama 2-Chat, Mistral AI's Mistral 7B Instruct v0.2, and OpenAI's GPT-3.5 Turbo), this paper uncovers that the prompt templates used during fine-tuning and inference play a crucial role in preserving safety alignment, and proposes the "Pure Tuning, Safe Testing" (PTST) principle -- fine-tune models without a safety prompt, but include it at test time. Fine-tuning experiments on GSM8K, ChatDoctor, and OpenOrca show that PTST significantly reduces the rise of unsafe behaviors, and even almost eliminates them in some cases.  ( 2 min )
    Generalizability Under Sensor Failure: Tokenization + Transformers Enable More Robust Latent Spaces
    arXiv:2402.18546v1 Announce Type: new Abstract: A major goal in neuroscience is to discover neural data representations that generalize. This goal is challenged by variability along recording sessions (e.g. environment), subjects (e.g. varying neural structures), and sensors (e.g. sensor noise), among others. Recent work has begun to address generalization across sessions and subjects, but few study robustness to sensor failure which is highly prevalent in neuroscience experiments. In order to address these generalizability dimensions we first collect our own electroencephalography dataset with numerous sessions, subjects, and sensors, then study two time series models: EEGNet (Lawhern et al., 2018) and TOTEM (Talukder et al., 2024). EEGNet is a widely used convolutional neural network, while TOTEM is a discrete time series tokenizer and transformer model. We find that TOTEM outperforms or matches EEGNet across all generalizability cases. Finally through analysis of TOTEM's latent codebook we observe that tokenization enables generalization.  ( 2 min )
    Log Neural Controlled Differential Equations: The Lie Brackets Make a Difference
    arXiv:2402.18512v1 Announce Type: new Abstract: The vector field of a controlled differential equation (CDE) describes the relationship between a control path and the evolution of a solution path. Neural CDEs (NCDEs) treat time series data as observations from a control path, parameterise a CDE's vector field using a neural network, and use the solution path as a continuously evolving hidden state. As their formulation makes them robust to irregular sampling rates, NCDEs are a powerful approach for modelling real-world data. Building on neural rough differential equations (NRDEs), we introduce Log-NCDEs, a novel and effective method for training NCDEs. The core component of Log-NCDEs is the Log-ODE method, a tool from the study of rough paths for approximating a CDE's solution. On a range of multivariate time series classification benchmarks, Log-NCDEs are shown to achieve a higher average test set accuracy than NCDEs, NRDEs, and two state-of-the-art models, S5 and the linear recurrent unit.  ( 2 min )
    RNNs are not Transformers (Yet): The Key Bottleneck on In-context Retrieval
    arXiv:2402.18510v1 Announce Type: new Abstract: This paper investigates the gap in representation powers of Recurrent Neural Networks (RNNs) and Transformers in the context of solving algorithmic problems. We focus on understanding whether RNNs, known for their memory efficiency in handling long sequences, can match the performance of Transformers, particularly when enhanced with Chain-of-Thought (CoT) prompting. Our theoretical analysis reveals that CoT improves RNNs but is insufficient to close the gap with Transformers. A key bottleneck lies in the inability of RNNs to perfectly retrieve information from the context, even with CoT: for several tasks that explicitly or implicitly require this capability, such as associative recall and determining if a graph is a tree, we prove that RNNs are not expressive enough to solve the tasks while Transformers can solve them with ease. Conversely, we prove that adopting techniques to enhance the in-context retrieval capability of RNNs, including Retrieval-Augmented Generation (RAG) and adding a single Transformer layer, can elevate RNNs to be capable of solving all polynomial-time solvable problems with CoT, hence closing the representation gap with Transformers.  ( 2 min )
    Evolving machine learning workflows through interactive AutoML
    arXiv:2402.18505v1 Announce Type: new Abstract: Automatic workflow composition (AWC) is a relevant problem in automated machine learning (AutoML) that allows finding suitable sequences of preprocessing and prediction models together with their optimal hyperparameters. This problem can be solved using evolutionary algorithms and, in particular, grammar-guided genetic programming (G3P). Current G3P approaches to AWC define a fixed grammar that formally specifies how workflow elements can be combined and which algorithms can be included. In this paper we present \ourmethod, an interactive G3P algorithm that allows users to dynamically modify the grammar to prune the search space and focus on their regions of interest. Our proposal is the first to combine the advantages of a G3P method with ideas from interactive optimisation and human-guided machine learning, an area little explored in the context of AutoML. To evaluate our approach, we present an experimental study in which 20 participants interact with \ourmethod to evolve workflows according to their preferences. Our results confirm that the collaboration between \ourmethod and humans allows us to find high-performance workflows in terms of accuracy that require less tuning time than those found without human intervention.  ( 2 min )
    Orchid: Flexible and Data-Dependent Convolution for Sequence Modeling
    arXiv:2402.18508v1 Announce Type: new Abstract: In the rapidly evolving landscape of deep learning, the quest for models that balance expressivity with computational efficiency has never been more critical. This paper introduces Orchid, a novel architecture that reimagines sequence modeling by incorporating a new data-dependent convolution mechanism. Orchid is designed to address the inherent limitations of traditional attention mechanisms, particularly their quadratic complexity, without compromising the ability to capture long-range dependencies and in-context learning. At the core of Orchid lies the data-dependent convolution layer, which dynamically adjusts its kernel conditioned on input data using a dedicated conditioning neural network. We design two simple conditioning networks that maintain shift equivariance in the adaptive convolution operation. The dynamic nature of data-dependent convolution kernel, coupled with gating operations, grants Orchid high expressivity while maintaining efficiency and quasilinear scalability for long sequences. We rigorously evaluate Orchid across multiple domains, including language modeling and image classification, to showcase its performance and generality. Our experiments demonstrate that Orchid architecture not only outperforms traditional attention-based architectures such as BERT and Vision Transformers with smaller model sizes, but also extends the feasible sequence length beyond the limitations of the dense attention layers. This achievement represents a significant step towards more efficient and scalable deep learning models for sequence modeling.  ( 2 min )
    ROG$_{PL}$: Robust Open-Set Graph Learning via Region-Based Prototype Learning
    arXiv:2402.18495v1 Announce Type: new Abstract: Open-set graph learning is a practical task that aims to classify the known class nodes and to identify unknown class samples as unknowns. Conventional node classification methods usually perform unsatisfactorily in open-set scenarios due to the complex data they encounter, such as out-of-distribution (OOD) data and in-distribution (IND) noise. OOD data are samples that do not belong to any known classes. They are outliers if they occur in training (OOD noise), and open-set samples if they occur in testing. IND noise are training samples which are assigned incorrect labels. The existence of IND noise and OOD noise is prevalent, which usually cause the ambiguity problem, including the intra-class variety problem and the inter-class confusion problem. Thus, to explore robust open-set learning methods is necessary and difficult, and it becomes even more difficult for non-IID graph data.To this end, we propose a unified framework named ROG$_{PL}$ to achieve robust open-set learning on complex noisy graph data, by introducing prototype learning. In specific, ROG$_{PL}$ consists of two modules, i.e., denoising via label propagation and open-set prototype learning via regions. The first module corrects noisy labels through similarity-based label propagation and removes low-confidence samples, to solve the intra-class variety problem caused by noise. The second module learns open-set prototypes for each known class via non-overlapped regions and remains both interior and border prototypes to remedy the inter-class confusion problem.The two modules are iteratively updated under the constraints of classification loss and prototype diversity loss. To the best of our knowledge, the proposed ROG$_{PL}$ is the first robust open-set node classification method for graph data with complex noise.  ( 3 min )
    Dynamical Regimes of Diffusion Models
    arXiv:2402.18491v1 Announce Type: new Abstract: Using statistical physics methods, we study generative diffusion models in the regime where the dimension of space and the number of data are large, and the score function has been trained optimally. Our analysis reveals three distinct dynamical regimes during the backward generative diffusion process. The generative dynamics, starting from pure noise, encounters first a 'speciation' transition where the gross structure of data is unraveled, through a mechanism similar to symmetry breaking in phase transitions. It is followed at later time by a 'collapse' transition where the trajectories of the dynamics become attracted to one of the memorized data points, through a mechanism which is similar to the condensation in a glass phase. For any dataset, the speciation time can be found from a spectral analysis of the correlation matrix, and the collapse time can be found from the estimation of an 'excess entropy' in the data. The dependence of the collapse time on the dimension and number of data provides a thorough characterization of the curse of dimensionality for diffusion models. Analytical solutions for simple models like high-dimensional Gaussian mixtures substantiate these findings and provide a theoretical framework, while extensions to more complex scenarios and numerical validations with real datasets confirm the theoretical predictions.  ( 2 min )
    LeMo-NADe: Multi-Parameter Neural Architecture Discovery with LLMs
    arXiv:2402.18443v1 Announce Type: new Abstract: Building efficient neural network architectures can be a time-consuming task requiring extensive expert knowledge. This task becomes particularly challenging for edge devices because one has to consider parameters such as power consumption during inferencing, model size, inferencing speed, and CO2 emissions. In this article, we introduce a novel framework designed to automatically discover new neural network architectures based on user-defined parameters, an expert system, and an LLM trained on a large amount of open-domain knowledge. The introduced framework (LeMo-NADe) is tailored to be used by non-AI experts, does not require a predetermined neural architecture search space, and considers a large set of edge device-specific parameters. We implement and validate this proposed neural architecture discovery framework using CIFAR-10, CIFAR-100, and ImageNet16-120 datasets while using GPT-4 Turbo and Gemini as the LLM component. We observe that the proposed framework can rapidly (within hours) discover intricate neural network models that perform extremely well across a diverse set of application settings defined by the user.  ( 2 min )
    Signature Kernel Conditional Independence Tests in Causal Discovery for Stochastic Processes
    arXiv:2402.18477v1 Announce Type: new Abstract: Inferring the causal structure underlying stochastic dynamical systems from observational data holds great promise in domains ranging from science and health to finance. Such processes can often be accurately modeled via stochastic differential equations (SDEs), which naturally imply causal relationships via "which variables enter the differential of which other variables". In this paper, we develop a kernel-based test of conditional independence (CI) on "path-space" -- solutions to SDEs -- by leveraging recent advances in signature kernels. We demonstrate strictly superior performance of our proposed CI test compared to existing approaches on path-space. Then, we develop constraint-based causal discovery algorithms for acyclic stochastic dynamical systems (allowing for loops) that leverage temporal information to recover the entire directed graph. Assuming faithfulness and a CI oracle, our algorithm is sound and complete. We empirically verify that our developed CI test in conjunction with the causal discovery algorithm reliably outperforms baselines across a range of settings.  ( 2 min )
    Unveiling the Potential of Robustness in Evaluating Causal Inference Models
    arXiv:2402.18392v1 Announce Type: new Abstract: The growing demand for personalized decision-making has led to a surge of interest in estimating the Conditional Average Treatment Effect (CATE). The intersection of machine learning and causal inference has yielded various effective CATE estimators. However, deploying these estimators in practice is often hindered by the absence of counterfactual labels, making it challenging to select the desirable CATE estimator using conventional model selection procedures like cross-validation. Existing approaches for CATE estimator selection, such as plug-in and pseudo-outcome metrics, face two inherent challenges. Firstly, they are required to determine the metric form and the underlying machine learning models for fitting nuisance parameters or plug-in learners. Secondly, they lack a specific focus on selecting a robust estimator. To address these challenges, this paper introduces a novel approach, the Distributionally Robust Metric (DRM), for CATE estimator selection. The proposed DRM not only eliminates the need to fit additional models but also excels at selecting a robust CATE estimator. Experimental studies demonstrate the efficacy of the DRM method, showcasing its consistent effectiveness in identifying superior estimators while mitigating the risk of selecting inferior ones.  ( 2 min )
    FedUV: Uniformity and Variance for Heterogeneous Federated Learning
    arXiv:2402.18372v1 Announce Type: new Abstract: Federated learning is a promising framework to train neural networks with widely distributed data. However, performance degrades heavily with heterogeneously distributed data. Recent work has shown this is due to the final layer of the network being most prone to local bias, some finding success freezing the final layer as an orthogonal classifier. We investigate the training dynamics of the classifier by applying SVD to the weights motivated by the observation that freezing weights results in constant singular values. We find that there are differences when training in IID and non-IID settings. Based on this finding, we introduce two regularization terms for local training to continuously emulate IID settings: (1) variance in the dimension-wise probability distribution of the classifier and (2) hyperspherical uniformity of representations of the encoder. These regularizations promote local models to act as if it were in an IID setting regardless of the local data distribution, thus offsetting proneness to bias while being flexible to the data. On extensive experiments in both label-shift and feature-shift settings, we verify that our method achieves highest performance by a large margin especially in highly non-IID cases in addition to being scalable to larger models and datasets.  ( 2 min )
    Out-of-Domain Generalization in Dynamical Systems Reconstruction
    arXiv:2402.18377v1 Announce Type: new Abstract: In science we are interested in finding the governing equations, the dynamical rules, underlying empirical phenomena. While traditionally scientific models are derived through cycles of human insight and experimentation, recently deep learning (DL) techniques have been advanced to reconstruct dynamical systems (DS) directly from time series data. State-of-the-art dynamical systems reconstruction (DSR) methods show promise in capturing invariant and long-term properties of observed DS, but their ability to generalize to unobserved domains remains an open challenge. Yet, this is a crucial property we would expect from any viable scientific theory. In this work, we provide a formal framework that addresses generalization in DSR. We explain why and how out-of-domain (OOD) generalization (OODG) in DSR profoundly differs from OODG considered elsewhere in machine learning. We introduce mathematical notions based on topological concepts and ergodic theory to formalize the idea of learnability of a DSR model. We formally prove that black-box DL techniques, without adequate structural priors, generally will not be able to learn a generalizing DSR model. We also show this empirically, considering major classes of DSR algorithms proposed so far, and illustrate where and why they fail to generalize across the whole phase space. Our study provides the first comprehensive mathematical treatment of OODG in DSR, and gives a deeper conceptual understanding of where the fundamental problems in OODG lie and how they could possibly be addressed in practice.  ( 2 min )
    Probabilistic Bayesian optimal experimental design using conditional normalizing flows
    arXiv:2402.18337v1 Announce Type: new Abstract: Bayesian optimal experimental design (OED) seeks to conduct the most informative experiment under budget constraints to update the prior knowledge of a system to its posterior from the experimental data in a Bayesian framework. Such problems are computationally challenging because of (1) expensive and repeated evaluation of some optimality criterion that typically involves a double integration with respect to both the system parameters and the experimental data, (2) suffering from the curse-of-dimensionality when the system parameters and design variables are high-dimensional, (3) the optimization is combinatorial and highly non-convex if the design variables are binary, often leading to non-robust designs. To make the solution of the Bayesian OED problem efficient, scalable, and robust for practical applications, we propose a novel joint optimization approach. This approach performs simultaneous (1) training of a scalable conditional normalizing flow (CNF) to efficiently maximize the expected information gain (EIG) of a jointly learned experimental design (2) optimization of a probabilistic formulation of the binary experimental design with a Bernoulli distribution. We demonstrate the performance of our proposed method for a practical MRI data acquisition problem, one of the most challenging Bayesian OED problems that has high-dimensional (320 $\times$ 320) parameters at high image resolution, high-dimensional (640 $\times$ 386) observations, and binary mask designs to select the most informative observations.  ( 2 min )
    Escaping Local Optima in Global Placement
    arXiv:2402.18311v1 Announce Type: new Abstract: Placement is crucial in the physical design, as it greatly affects power, performance, and area metrics. Recent advancements in analytical methods, such as DREAMPlace, have demonstrated impressive performance in global placement. However, DREAMPlace has some limitations, e.g., may not guarantee legalizable placements under the same settings, leading to fragile and unpredictable results. This paper highlights the main issue as being stuck in local optima, and proposes a hybrid optimization framework to efficiently escape the local optima, by perturbing the placement result iteratively. The proposed framework achieves significant improvements compared to state-of-the-art methods on two popular benchmarks.  ( 2 min )
    PiShield: A NeSy Framework for Learning with Requirements
    arXiv:2402.18285v1 Announce Type: new Abstract: Deep learning models have shown their strengths in various application domains, however, they often struggle to meet safety requirements for their outputs. In this paper, we introduce PiShield, the first framework ever allowing for the integration of the requirements into the neural networks' topology. PiShield guarantees compliance with these requirements, regardless of input. Additionally, it allows for integrating requirements both at inference and/or training time, depending on the practitioners' needs. Given the widespread application of deep learning, there is a growing need for frameworks allowing for the integration of the requirements across various domains. Here, we explore three application scenarios: functional genomics, autonomous driving, and tabular data generation.  ( 2 min )
    Efficiently Computable Safety Bounds for Gaussian Processes in Active Learning
    arXiv:2402.18260v1 Announce Type: new Abstract: Active learning of physical systems must commonly respect practical safety constraints, which restricts the exploration of the design space. Gaussian Processes (GPs) and their calibrated uncertainty estimations are widely used for this purpose. In many technical applications the design space is explored via continuous trajectories, along which the safety needs to be assessed. This is particularly challenging for strict safety requirements in GP methods, as it employs computationally expensive Monte-Carlo sampling of high quantiles. We address these challenges by providing provable safety bounds based on the adaptively sampled median of the supremum of the posterior GP. Our method significantly reduces the number of samples required for estimating high safety probabilities, resulting in faster evaluation without sacrificing accuracy and exploration speed. The effectiveness of our safe active learning approach is demonstrated through extensive simulations and validated using a real-world engine example.  ( 2 min )
    Multi-objective Differentiable Neural Architecture Search
    arXiv:2402.18213v1 Announce Type: new Abstract: Pareto front profiling in multi-objective optimization (MOO), i.e. finding a diverse set of Pareto optimal solutions, is challenging, especially with expensive objectives like neural network training. Typically, in MOO neural architecture search (NAS), we aim to balance performance and hardware metrics across devices. Prior NAS approaches simplify this task by incorporating hardware constraints into the objective function, but profiling the Pareto front necessitates a search for each constraint. In this work, we propose a novel NAS algorithm that encodes user preferences for the trade-off between performance and hardware metrics, and yields representative and diverse architectures across multiple devices in just one search run. To this end, we parameterize the joint architectural distribution across devices and multiple objectives via a hypernetwork that can be conditioned on hardware features and preference vectors, enabling zero-shot transferability to new devices. Extensive experiments with up to 19 hardware devices and 3 objectives showcase the effectiveness and scalability of our method. Finally, we show that, without additional costs, our method outperforms existing MOO NAS methods across qualitatively different search spaces and datasets, including MobileNetV3 on ImageNet-1k and a Transformer space on machine translation.  ( 2 min )
    Automated Machine Learning for Multi-Label Classification
    arXiv:2402.18198v1 Announce Type: new Abstract: Automated machine learning (AutoML) aims to select and configure machine learning algorithms and combine them into machine learning pipelines tailored to a dataset at hand. For supervised learning tasks, most notably binary and multinomial classification, aka single-label classification (SLC), such AutoML approaches have shown promising results. However, the task of multi-label classification (MLC), where data points are associated with a set of class labels instead of a single class label, has received much less attention so far. In the context of multi-label classification, the data-specific selection and configuration of multi-label classifiers are challenging even for experts in the field, as it is a high-dimensional optimization problem with multi-level hierarchical dependencies. While for SLC, the space of machine learning pipelines is already huge, the size of the MLC search space outnumbers the one of SLC by several orders. In the first part of this thesis, we devise a novel AutoML approach for single-label classification tasks optimizing pipelines of machine learning algorithms, consisting of two algorithms at most. This approach is then extended first to optimize pipelines of unlimited length and eventually configure the complex hierarchical structures of multi-label classification methods. Furthermore, we investigate how well AutoML approaches that form the state of the art for single-label classification tasks scale with the increased problem complexity of AutoML for multi-label classification. In the second part, we explore how methods for SLC and MLC could be configured more flexibly to achieve better generalization performance and how to increase the efficiency of execution-based AutoML systems.  ( 3 min )
    Catastrophic Overfitting: A Potential Blessing in Disguise
    arXiv:2402.18211v1 Announce Type: new Abstract: Fast Adversarial Training (FAT) has gained increasing attention within the research community owing to its efficacy in improving adversarial robustness. Particularly noteworthy is the challenge posed by catastrophic overfitting (CO) in this field. Although existing FAT approaches have made strides in mitigating CO, the ascent of adversarial robustness occurs with a non-negligible decline in classification accuracy on clean samples. To tackle this issue, we initially employ the feature activation differences between clean and adversarial examples to analyze the underlying causes of CO. Intriguingly, our findings reveal that CO can be attributed to the feature coverage induced by a few specific pathways. By intentionally manipulating feature activation differences in these pathways with well-designed regularization terms, we can effectively mitigate and induce CO, providing further evidence for this observation. Notably, models trained stably with these terms exhibit superior performance compared to prior FAT work. On this basis, we harness CO to achieve `attack obfuscation', aiming to bolster model performance. Consequently, the models suffering from CO can attain optimal classification accuracy on both clean and adversarial data when adding random noise to inputs during evaluation. We also validate their robustness against transferred adversarial examples and the necessity of inducing CO to improve robustness. Hence, CO may not be a problem that has to be solved.  ( 2 min )
    Autoencoder-based General Purpose Representation Learning for Customer Embedding
    arXiv:2402.18164v1 Announce Type: new Abstract: In recent years, exploiting the domain-specific underlying structure of data and its generative factors for representation learning has shown success in various use-case agnostic applications. However, the diversity and complexity of tabular data have made it challenging to represent these structures in a latent space through multi-dimensional vectors. We design an autoencoder-based framework for building general purpose embeddings, we assess the performance of different autoencoder architectures, and show simpler models outperform complex ones in embedding highly complex tabular data. We apply our framework to produce plug-and-play, rich, and anonymized embeddings representing AWS customers for usage in any model, saving up to 45% of development time, and observe significant improvements in downstream models. Moreover, we propose a significant improvement to the calculation of reconstruction loss for multi-layer contractive autoencoders (CAE) by calculating the Jacobian of the entire encoder leading to a 15% improvement in reconstruction quality when compared to a stacked CAE.  ( 2 min )
    Decentralised Traffic Incident Detection via Network Lasso
    arXiv:2402.18167v1 Announce Type: new Abstract: Traffic incident detection plays a key role in intelligent transportation systems, which has gained great attention in transport engineering. In the past, traditional machine learning (ML) based detection methods achieved good performance under a centralised computing paradigm, where all data are transmitted to a central server for building ML models therein. Nowadays, deep neural networks based federated learning (FL) has become a mainstream detection approach to enable the model training in a decentralised manner while warranting local data governance. Such neural networks-centred techniques, however, have overshadowed the utility of well-established ML-based detection methods. In this work, we aim to explore the potential of potent conventional ML-based detection models in modern traffic scenarios featured by distributed data. We leverage an elegant but less explored distributed optimisation framework named Network Lasso, with guaranteed global convergence for convex problem formulations, integrate the potent convex ML model with it, and compare it with centralised learning, local learning, and federated learning methods atop a well-known traffic incident detection dataset. Experimental results show that the proposed network lasso-based approach provides a promising alternative to the FL-based approach in data-decentralised traffic scenarios, with a strong convergence guarantee while rekindling the significance of conventional ML-based detection methods.  ( 2 min )
    Diffusion-based Neural Network Weights Generation
    arXiv:2402.18153v1 Announce Type: new Abstract: Transfer learning is a topic of significant interest in recent deep learning research because it enables faster convergence and improved performance on new tasks. While the performance of transfer learning depends on the similarity of the source data to the target data, it is costly to train a model on a large number of datasets. Therefore, pretrained models are generally blindly selected with the hope that they will achieve good performance on the given task. To tackle such suboptimality of the pretrained models, we propose an efficient and adaptive transfer learning scheme through dataset-conditioned pretrained weights sampling. Specifically, we use a latent diffusion model with a variational autoencoder that can reconstruct the neural network weights, to learn the distribution of a set of pretrained weights conditioned on each dataset for transfer learning on unseen datasets. By learning the distribution of a neural network on a variety pretrained models, our approach enables adaptive sampling weights for unseen datasets achieving faster convergence and reaching competitive performance.  ( 2 min )
    Provable Risk-Sensitive Distributional Reinforcement Learning with General Function Approximation
    arXiv:2402.18159v1 Announce Type: new Abstract: In the realm of reinforcement learning (RL), accounting for risk is crucial for making decisions under uncertainty, particularly in applications where safety and reliability are paramount. In this paper, we introduce a general framework on Risk-Sensitive Distributional Reinforcement Learning (RS-DisRL), with static Lipschitz Risk Measures (LRM) and general function approximation. Our framework covers a broad class of risk-sensitive RL, and facilitates analysis of the impact of estimation functions on the effectiveness of RSRL strategies and evaluation of their sample complexity. We design two innovative meta-algorithms: \texttt{RS-DisRL-M}, a model-based strategy for model-based function approximation, and \texttt{RS-DisRL-V}, a model-free approach for general value function approximation. With our novel estimation techniques via Least Squares Regression (LSR) and Maximum Likelihood Estimation (MLE) in distributional RL with augmented Markov Decision Process (MDP), we derive the first $\widetilde{\mathcal{O}}(\sqrt{K})$ dependency of the regret upper bound for RSRL with static LRM, marking a pioneering contribution towards statistically efficient algorithms in this domain.  ( 2 min )
    Provably Efficient Partially Observable Risk-Sensitive Reinforcement Learning with Hindsight Observation
    arXiv:2402.18149v1 Announce Type: new Abstract: This work pioneers regret analysis of risk-sensitive reinforcement learning in partially observable environments with hindsight observation, addressing a gap in theoretical exploration. We introduce a novel formulation that integrates hindsight observations into a Partially Observable Markov Decision Process (POMDP) framework, where the goal is to optimize accumulated reward under the entropic risk measure. We develop the first provably efficient RL algorithm tailored for this setting. We also prove by rigorous analysis that our algorithm achieves polynomial regret $\tilde{O}\left(\frac{e^{|{\gamma}|H}-1}{|{\gamma}|H}H^2\sqrt{KHS^2OA}\right)$, which outperforms or matches existing upper bounds when the model degenerates to risk-neutral or fully observable settings. We adopt the method of change-of-measure and develop a novel analytical tool of beta vectors to streamline mathematical derivations. These techniques are of particular interest to the theoretical study of reinforcement learning.  ( 2 min )
    Classes Are Not Equal: An Empirical Study on Image Recognition Fairness
    arXiv:2402.18133v1 Announce Type: new Abstract: In this paper, we present an empirical study on image recognition fairness, i.e., extreme class accuracy disparity on balanced data like ImageNet. We experimentally demonstrate that classes are not equal and the fairness issue is prevalent for image classification models across various datasets, network architectures, and model capacities. Moreover, several intriguing properties of fairness are identified. First, the unfairness lies in problematic representation rather than classifier bias. Second, with the proposed concept of Model Prediction Bias, we investigate the origins of problematic representation during optimization. Our findings reveal that models tend to exhibit greater prediction biases for classes that are more challenging to recognize. It means that more other classes will be confused with harder classes. Then the False Positives (FPs) will dominate the learning in optimization, thus leading to their poor accuracy. Further, we conclude that data augmentation and representation learning algorithms improve overall performance by promoting fairness to some degree in image classification.  ( 2 min )
    On the Inductive Biases of Demographic Parity-based Fair Learning Algorithms
    arXiv:2402.18129v1 Announce Type: new Abstract: Fair supervised learning algorithms assigning labels with little dependence on a sensitive attribute have attracted great attention in the machine learning community. While the demographic parity (DP) notion has been frequently used to measure a model's fairness in training fair classifiers, several studies in the literature suggest potential impacts of enforcing DP in fair learning algorithms. In this work, we analytically study the effect of standard DP-based regularization methods on the conditional distribution of the predicted label given the sensitive attribute. Our analysis shows that an imbalanced training dataset with a non-uniform distribution of the sensitive attribute could lead to a classification rule biased toward the sensitive attribute outcome holding the majority of training data. To control such inductive biases in DP-based fair learning, we propose a sensitive attribute-based distributionally robust optimization (SA-DRO) method improving robustness against the marginal distribution of the sensitive attribute. Finally, we present several numerical results on the application of DP-based learning methods to standard centralized and distributed learning problems. The empirical findings support our theoretical results on the inductive biases in DP-based fair learning algorithms and the debiasing effects of the proposed SA-DRO method.  ( 2 min )
    No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization
    arXiv:2402.18096v1 Announce Type: new Abstract: Key-Value (KV) Caching has become an essential technique for accelerating the inference speed and throughput of generative Large Language Models~(LLMs). However, the memory footprint of the KV cache poses a critical bottleneck in LLM deployment as the cache size grows with batch size and sequence length, often surpassing even the size of the model itself. Although recent methods were proposed to select and evict unimportant KV pairs from the cache to reduce memory consumption, the potential ramifications of eviction on the generative process are yet to be thoroughly examined. In this paper, we examine the detrimental impact of cache eviction and observe that unforeseen risks arise as the information contained in the KV pairs is exhaustively discarded, resulting in safety breaches, hallucinations, and context loss. Surprisingly, we find that preserving even a small amount of information contained in the evicted KV pairs via reduced precision quantization substantially recovers the incurred degradation. On the other hand, we observe that the important KV pairs must be kept at a relatively higher precision to safeguard the generation quality. Motivated by these observations, we propose \textit{Mixed-precision KV cache}~(MiKV), a reliable cache compression method that simultaneously preserves the context details by retaining the evicted KV pairs in low-precision and ensure generation quality by keeping the important KV pairs in high-precision. Experiments on diverse benchmarks and LLM backbones show that our proposed method offers a state-of-the-art trade-off between compression ratio and performance, compared to other baselines.  ( 3 min )
    Token-Specific Watermarking with Enhanced Detectability and Semantic Coherence for Large Language Models
    arXiv:2402.18059v1 Announce Type: new Abstract: Large language models generate high-quality responses with potential misinformation, underscoring the need for regulation by distinguishing AI-generated and human-written texts. Watermarking is pivotal in this context, which involves embedding hidden markers in texts during the LLM inference phase, which is imperceptible to humans. Current watermarking algorithms, however, face the challenge of achieving both the detectability of inserted watermarks and the semantic integrity of generated texts, where enhancing one aspect often undermines the other. To overcome this, we introduce a novel multi-objective optimization (MOO) approach for watermarking that utilizes lightweight networks to generate token-specific watermarking logits and splitting ratios. By leveraging MOO to optimize for both detection and semantic objective functions, our method simultaneously achieves detectability and semantic integrity. Experimental results show that our method outperforms current watermarking techniques in enhancing the detectability of texts generated by LLMs while maintaining their semantic coherence. Our code is available at https://github.com/mignonjia/TS_watermark .  ( 2 min )
    Communication Efficient ConFederated Learning: An Event-Triggered SAGA Approach
    arXiv:2402.18018v1 Announce Type: new Abstract: Federated learning (FL) is a machine learning paradigm that targets model training without gathering the local data dispersed over various data sources. Standard FL, which employs a single server, can only support a limited number of users, leading to degraded learning capability. In this work, we consider a multi-server FL framework, referred to as \emph{Confederated Learning} (CFL), in order to accommodate a larger number of users. A CFL system is composed of multiple networked edge servers, with each server connected to an individual set of users. Decentralized collaboration among servers is leveraged to harness all users' data for model training. Due to the potentially massive number of users involved, it is crucial to reduce the communication overhead of the CFL system. We propose a stochastic gradient method for distributed learning in the CFL framework. The proposed method incorporates a conditionally-triggered user selection (CTUS) mechanism as the central component to effectively reduce communication overhead. Relying on a delicately designed triggering condition, the CTUS mechanism allows each server to select only a small number of users to upload their gradients, without significantly jeopardizing the convergence performance of the algorithm. Our theoretical analysis reveals that the proposed algorithm enjoys a linear convergence rate. Simulation results show that it achieves substantial improvement over state-of-the-art algorithms in terms of communication efficiency.  ( 3 min )
    Data augmentation method for modeling health records with applications to clopidogrel treatment failure detection
    arXiv:2402.18046v1 Announce Type: new Abstract: We present a novel data augmentation method to address the challenge of data scarcity in modeling longitudinal patterns in Electronic Health Records (EHR) of patients using natural language processing (NLP) algorithms. The proposed method generates augmented data by rearranging the orders of medical records within a visit where the order of elements are not obvious, if any. Applying the proposed method to the clopidogrel treatment failure detection task enabled up to 5.3% absolute improvement in terms of ROC-AUC (from 0.908 without augmentation to 0.961 with augmentation) when it was used during the pre-training procedure. It was also shown that the augmentation helped to improve performance during fine-tuning procedures, especially when the amount of labeled training data is limited.  ( 2 min )
    Diffusion Models as Constrained Samplers for Optimization with Unknown Constraints
    arXiv:2402.18012v1 Announce Type: new Abstract: Addressing real-world optimization problems becomes particularly challenging when analytic objective functions or constraints are unavailable. While numerous studies have addressed the issue of unknown objectives, limited research has focused on scenarios where feasibility constraints are not given explicitly. Overlooking these constraints can lead to spurious solutions that are unrealistic in practice. To deal with such unknown constraints, we propose to perform optimization within the data manifold using diffusion models. To constrain the optimization process to the data manifold, we reformulate the original optimization problem as a sampling problem from the product of the Boltzmann distribution defined by the objective function and the data distribution learned by the diffusion model. To enhance sampling efficiency, we propose a two-stage framework that begins with a guided diffusion process for warm-up, followed by a Langevin dynamics stage for further correction. Theoretical analysis shows that the initial stage results in a distribution focused on feasible solutions, thereby providing a better initialization for the later stage. Comprehensive experiments on a synthetic dataset, six real-world black-box optimization datasets, and a multi-objective optimization dataset show that our method achieves better or comparable performance with previous state-of-the-art baselines.  ( 2 min )
    Mixer is more than just a model
    arXiv:2402.18007v1 Announce Type: new Abstract: Recently, MLP structures have regained popularity, with MLP-Mixer standing out as a prominent example. In the field of computer vision, MLP-Mixer is noted for its ability to extract data information from both channel and token perspectives, effectively acting as a fusion of channel and token information. Indeed, Mixer represents a paradigm for information extraction that amalgamates channel and token information. The essence of Mixer lies in its ability to blend information from diverse perspectives, epitomizing the true concept of "mixing" in the realm of neural network architectures. Beyond channel and token considerations, it is possible to create more tailored mixers from various perspectives to better suit specific task requirements. This study focuses on the domain of audio recognition, introducing a novel model named Audio Spectrogram Mixer with Roll-Time and Hermit FFT (ASM-RH) that incorporates insights from both time and frequency domains. Experimental results demonstrate that ASM-RH is particularly well-suited for audio data and yields promising outcomes across multiple classification tasks.  ( 2 min )
  • Open

    Fast and Accurate Estimation of Low-Rank Matrices from Noisy Measurements via Preconditioned Non-Convex Gradient Descent
    arXiv:2305.17224v2 Announce Type: replace-cross Abstract: Non-convex gradient descent is a common approach for estimating a low-rank $n\times n$ ground truth matrix from noisy measurements, because it has per-iteration costs as low as $O(n)$ time, and is in theory capable of converging to a minimax optimal estimate. However, the practitioner is often constrained to just tens to hundreds of iterations, and the slow and/or inconsistent convergence of non-convex gradient descent can prevent a high-quality estimate from being obtained. Recently, the technique of preconditioning was shown to be highly effective at accelerating the local convergence of non-convex gradient descent when the measurements are noiseless. In this paper, we describe how preconditioning should be done for noisy measurements to accelerate local convergence to minimax optimality. For the symmetric matrix sensing problem, our proposed preconditioned method is guaranteed to locally converge to minimax error at a linear rate that is immune to ill-conditioning and/or over-parameterization. Using our proposed preconditioned method, we perform a 60 megapixel medical image denoising task, and observe significantly reduced noise levels compared to previous approaches.  ( 2 min )
    Bayesian Prognostic Covariate Adjustment With Additive Mixture Priors
    arXiv:2310.18027v4 Announce Type: replace-cross Abstract: Effective and rapid decision-making from randomized controlled trials (RCTs) requires unbiased and precise treatment effect inferences. Two strategies to address this requirement are to adjust for covariates that are highly correlated with the outcome, and to leverage historical control information via Bayes' theorem. We propose a new Bayesian prognostic covariate adjustment methodology, referred to as Bayesian PROCOVA, that combines these two strategies. Covariate adjustment in Bayesian PROCOVA is based on generative artificial intelligence (AI) algorithms that construct a digital twin generator (DTG) for RCT participants. The DTG is trained on historical control data and yields a digital twin (DT) probability distribution for each RCT participant's outcome under the control treatment. The expectation of the DT distribution, referred to as the prognostic score, defines the covariate for adjustment. Historical control information is leveraged via an additive mixture prior with two components: an informative prior probability distribution specified based on historical control data, and a weakly informative prior distribution. The mixture weight determines the extent to which posterior inferences are drawn from the informative component, versus the weakly informative component. This weight has a prior distribution as well, and so the entire additive mixture prior is completely pre-specifiable without involving any RCT information. We establish an efficient Gibbs algorithm for sampling from the posterior distribution, and derive closed-form expressions for the posterior mean and variance of the treatment effect parameter conditional on the weight, in Bayesian PROCOVA. We evaluate efficiency gains of Bayesian PROCOVA via its bias control and variance reduction compared to frequentist PROCOVA in simulation studies that encompass different discrepancies. These gains translate to smaller RCTs.  ( 3 min )
    Looping in the Human Collaborative and Explainable Bayesian Optimization
    arXiv:2310.17273v4 Announce Type: replace-cross Abstract: Like many optimizers, Bayesian optimization often falls short of gaining user trust due to opacity. While attempts have been made to develop human-centric optimizers, they typically assume user knowledge is well-specified and error-free, employing users mainly as supervisors of the optimization process. We relax these assumptions and propose a more balanced human-AI partnership with our Collaborative and Explainable Bayesian Optimization (CoExBO) framework. Instead of explicitly requiring a user to provide a knowledge model, CoExBO employs preference learning to seamlessly integrate human insights into the optimization, resulting in algorithmic suggestions that resonate with user preference. CoExBO explains its candidate selection every iteration to foster trust, empowering users with a clearer grasp of the optimization. Furthermore, CoExBO offers a no-harm guarantee, allowing users to make mistakes; even with extreme adversarial interventions, the algorithm converges asymptotically to a vanilla Bayesian optimization. We validate CoExBO's efficacy through human-AI teaming experiments in lithium-ion battery design, highlighting substantial improvements over conventional methods. Code is available https://github.com/ma921/CoExBO.  ( 3 min )
    Efficient local linearity regularization to overcome catastrophic overfitting
    arXiv:2401.11618v2 Announce Type: replace-cross Abstract: Catastrophic overfitting (CO) in single-step adversarial training (AT) results in abrupt drops in the adversarial test accuracy (even down to 0%). For models trained with multi-step AT, it has been observed that the loss function behaves locally linearly with respect to the input, this is however lost in single-step AT. To address CO in single-step AT, several methods have been proposed to enforce local linearity of the loss via regularization. However, these regularization terms considerably slow down training due to Double Backpropagation. Instead, in this work, we introduce a regularization term, called ELLE, to mitigate CO effectively and efficiently in classical AT evaluations, as well as some more difficult regimes, e.g., large adversarial perturbations and long training schedules. Our regularization term can be theoretically linked to curvature of the loss function and is computationally cheaper than previous methods by avoiding Double Backpropagation. Our thorough experimental validation demonstrates that our work does not suffer from CO, even in challenging settings where previous works suffer from it. We also notice that adapting our regularization parameter during training (ELLE-A) greatly improves the performance, specially in large $\epsilon$ setups. Our implementation is available in https://github.com/LIONS-EPFL/ELLE .  ( 3 min )
    The good, the bad and the ugly sides of data augmentation: An implicit spectral regularization perspective
    arXiv:2210.05021v3 Announce Type: replace-cross Abstract: Data augmentation (DA) is a powerful workhorse for bolstering performance in modern machine learning. Specific augmentations like translations and scaling in computer vision are traditionally believed to improve generalization by generating new (artificial) data from the same distribution. However, this traditional viewpoint does not explain the success of prevalent augmentations in modern machine learning (e.g. randomized masking, cutout, mixup), that greatly alter the training data distribution. In this work, we develop a new theoretical framework to characterize the impact of a general class of DA on underparameterized and overparameterized linear model generalization. Our framework reveals that DA induces implicit spectral regularization through a combination of two distinct effects: a) manipulating the relative proportion of eigenvalues of the data covariance matrix in a training-data-dependent manner, and b) uniformly boosting the entire spectrum of the data covariance matrix through ridge regression. These effects, when applied to popular augmentations, give rise to a wide variety of phenomena, including discrepancies in generalization between over-parameterized and under-parameterized regimes and differences between regression and classification tasks. Our framework highlights the nuanced and sometimes surprising impacts of DA on generalization, and serves as a testbed for novel augmentation design.  ( 3 min )
    Consistency Models for Scalable and Fast Simulation-Based Inference
    arXiv:2312.05440v2 Announce Type: replace-cross Abstract: Simulation-based inference (SBI) is constantly in search of more expressive algorithms for accurately inferring the parameters of complex models from noisy data. We present consistency models for neural posterior estimation (CMPE), a new free-form conditional sampler for scalable, fast, and amortized SBI with generative neural networks. CMPE combines the advantages of normalizing flows and flow matching methods into a single generative architecture: It essentially distills a continuous probability flow and enables rapid few-shot inference with an unconstrained architecture that can be tailored to the structure of the estimation problem. Our empirical evaluation demonstrates that CMPE not only outperforms current state-of-the-art algorithms on three hard low-dimensional problems but also achieves competitive performance in a high-dimensional Bayesian denoising experiment and in estimating a computationally demanding multi-scale model of tumor spheroid growth.  ( 2 min )
    Learning in Deep Factor Graphs with Gaussian Belief Propagation
    arXiv:2311.14649v2 Announce Type: replace-cross Abstract: We propose an approach to do learning in Gaussian factor graphs. We treat all relevant quantities (inputs, outputs, parameters, latents) as random variables in a graphical model, and view both training and prediction as inference problems with different observed nodes. Our experiments show that these problems can be efficiently solved with belief propagation (BP), whose updates are inherently local, presenting exciting opportunities for distributed and asynchronous training. Our approach can be scaled to deep networks and provides a natural means to do continual learning: use the BP-estimated parameter marginals of the current task as parameter priors for the next. On a video denoising task we demonstrate the benefit of learnable parameters over a classical factor graph approach and we show encouraging performance of deep factor graphs for continual image classification.  ( 2 min )
    Signature Kernel Conditional Independence Tests in Causal Discovery for Stochastic Processes
    arXiv:2402.18477v1 Announce Type: cross Abstract: Inferring the causal structure underlying stochastic dynamical systems from observational data holds great promise in domains ranging from science and health to finance. Such processes can often be accurately modeled via stochastic differential equations (SDEs), which naturally imply causal relationships via "which variables enter the differential of which other variables". In this paper, we develop a kernel-based test of conditional independence (CI) on "path-space" -- solutions to SDEs -- by leveraging recent advances in signature kernels. We demonstrate strictly superior performance of our proposed CI test compared to existing approaches on path-space. Then, we develop constraint-based causal discovery algorithms for acyclic stochastic dynamical systems (allowing for loops) that leverage temporal information to recover the entire directed graph. Assuming faithfulness and a CI oracle, our algorithm is sound and complete. We empirically verify that our developed CI test in conjunction with the causal discovery algorithm reliably outperforms baselines across a range of settings.  ( 2 min )
    Inference for Heteroskedastic PCA with Missing Data
    arXiv:2107.12365v2 Announce Type: replace-cross Abstract: This paper studies how to construct confidence regions for principal component analysis (PCA) in high dimension, a problem that has been vastly under-explored. While computing measures of uncertainty for nonlinear/nonconvex estimators is in general difficult in high dimension, the challenge is further compounded by the prevalent presence of missing data and heteroskedastic noise. We propose a novel approach to performing valid inference on the principal subspace under a spiked covariance model with missing data, on the basis of an estimator called HeteroPCA (Zhang et al., 2022). We develop non-asymptotic distributional guarantees for HeteroPCA, and demonstrate how these can be invoked to compute both confidence regions for the principal subspace and entrywise confidence intervals for the spiked covariance matrix. Our inference procedures are fully data-driven and adaptive to heteroskedastic random noise, without requiring prior knowledge about the noise levels.  ( 2 min )
    On the different regimes of Stochastic Gradient Descent
    arXiv:2309.10688v4 Announce Type: replace-cross Abstract: Modern deep networks are trained with stochastic gradient descent (SGD) whose key hyperparameters are the number of data considered at each step or batch size $B$, and the step size or learning rate $\eta$. For small $B$ and large $\eta$, SGD corresponds to a stochastic evolution of the parameters, whose noise amplitude is governed by the ''temperature'' $T\equiv \eta/B$. Yet this description is observed to break down for sufficiently large batches $B\geq B^*$, or simplifies to gradient descent (GD) when the temperature is sufficiently small. Understanding where these cross-overs take place remains a central challenge. Here, we resolve these questions for a teacher-student perceptron classification model and show empirically that our key predictions still apply to deep networks. Specifically, we obtain a phase diagram in the $B$-$\eta$ plane that separates three dynamical phases: (i) a noise-dominated SGD governed by temperature, (ii) a large-first-step-dominated SGD and (iii) GD. These different phases also correspond to different regimes of generalization error. Remarkably, our analysis reveals that the batch size $B^*$ separating regimes (i) and (ii) scale with the size $P$ of the training set, with an exponent that characterizes the hardness of the classification problem.  ( 3 min )
    Robust Data Clustering with Outliers via Transformed Tensor Low-Rank Representation
    arXiv:2307.09055v2 Announce Type: replace Abstract: Recently, tensor low-rank representation (TLRR) has become a popular tool for tensor data recovery and clustering, due to its empirical success and theoretical guarantees. However, existing TLRR methods consider Gaussian or gross sparse noise, inevitably leading to performance degradation when the tensor data are contaminated by outliers or sample-specific corruptions. This paper develops an outlier-robust tensor low-rank representation (OR-TLRR) method that provides outlier detection and tensor data clustering simultaneously based on the t-SVD framework. For tensor observations with arbitrary outlier corruptions, OR-TLRR has provable performance guarantee for exactly recovering the row space of clean data and detecting outliers under mild conditions. Moreover, an extension of OR-TLRR is proposed to handle the case when parts of the data are missing. Finally, extensive experimental results on synthetic and real data demonstrate the effectiveness of the proposed algorithms. We release our code at https://github.com/twugithub/2024-AISTATS-ORTLRR.  ( 2 min )
    Streaming data recovery via Bayesian tensor train decomposition
    arXiv:2302.12148v2 Announce Type: replace-cross Abstract: In this paper, we study a Bayesian tensor train (TT) decomposition method to recover streaming data by approximating the latent structure in high-order streaming data. Drawing on the streaming variational Bayes method, we introduce the TT format into Bayesian tensor decomposition methods for streaming data, and formulate posteriors of TT cores. Thanks to the Bayesian framework of the TT format, the proposed algorithm (SPTT) excels in recovering streaming data with high-order, incomplete, and noisy properties. The experiments in synthetic and real-world datasets show the accuracy of our method compared to state-of-the-art Bayesian tensor decomposition methods for streaming data.  ( 2 min )
    Detecting algorithmic bias in medical AI-models
    arXiv:2312.02959v3 Announce Type: replace Abstract: With the growing prevalence of machine learning and artificial intelligence-based medical decision support systems, it is equally important to ensure that these systems provide patient outcomes in a fair and equitable fashion. This paper presents an innovative framework for detecting areas of algorithmic bias in medical-AI decision support systems. Our approach efficiently identifies potential biases in medical-AI models, specifically in the context of sepsis prediction, by employing the Classification and Regression Trees (CART) algorithm. We verify our methodology by conducting a series of synthetic data experiments, showcasing its ability to estimate areas of bias in controlled settings precisely. The effectiveness of the concept is further validated by experiments using electronic medical records from Grady Memorial Hospital in Atlanta, Georgia. These tests demonstrate the practical implementation of our strategy in a clinical environment, where it can function as a vital instrument for guaranteeing fairness and equity in AI-based medical decisions.  ( 2 min )
    Multistatic-Radar RCS-Signature Recognition of Aerial Vehicles: A Bayesian Fusion Approach
    arXiv:2402.17987v1 Announce Type: cross Abstract: Radar Automated Target Recognition (RATR) for Unmanned Aerial Vehicles (UAVs) involves transmitting Electromagnetic Waves (EMWs) and performing target type recognition on the received radar echo, crucial for defense and aerospace applications. Previous studies highlighted the advantages of multistatic radar configurations over monostatic ones in RATR. However, fusion methods in multistatic radar configurations often suboptimally combine classification vectors from individual radars probabilistically. To address this, we propose a fully Bayesian RATR framework employing Optimal Bayesian Fusion (OBF) to aggregate classification probability vectors from multiple radars. OBF, based on expected 0-1 loss, updates a Recursive Bayesian Classification (RBC) posterior distribution for target UAV type, conditioned on historical observations across multiple time steps. We evaluate the approach using simulated random walk trajectories for seven drones, correlating target aspect angles to Radar Cross Section (RCS) measurements in an anechoic chamber. Comparing against single radar Automated Target Recognition (ATR) systems and suboptimal fusion methods, our empirical results demonstrate that the OBF method integrated with RBC significantly enhances classification accuracy compared to other fusion methods and single radar configurations.  ( 2 min )
    Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards
    arXiv:2402.18571v1 Announce Type: cross Abstract: Fine-grained control over large language models (LLMs) remains a significant challenge, hindering their adaptability to diverse user needs. While Reinforcement Learning from Human Feedback (RLHF) shows promise in aligning LLMs, its reliance on scalar rewards often limits its ability to capture diverse user preferences in real-world applications. To address this limitation, we introduce the Directional Preference Alignment (DPA) framework. Unlike the scalar-reward RLHF, DPA incorporates multi-objective reward modeling to represent diverse preference profiles. Additionally, DPA models user preferences as directions (i.e., unit vectors) in the reward space to achieve user-dependent preference control. Our method involves training a multi-objective reward model and then fine-tuning the LLM with a preference-conditioned variant of Rejection Sampling Finetuning (RSF), an RLHF method adopted by Llama 2. This method enjoys a better performance trade-off across various reward objectives. In comparison with the scalar-reward RLHF, DPA offers users intuitive control over LLM generation: they can arithmetically specify their desired trade-offs (e.g., more helpfulness with less verbosity). We also validate the effectiveness of DPA with real-world alignment experiments on Mistral-7B. Our method provides straightforward arithmetic control over the trade-off between helpfulness and verbosity while maintaining competitive performance with strong baselines such as Direct Preference Optimization (DPO).  ( 3 min )
    Semi-Supervised Clustering of Sparse Graphs: Crossing the Information-Theoretic Threshold
    arXiv:2205.11677v4 Announce Type: replace Abstract: The stochastic block model is a canonical random graph model for clustering and community detection on network-structured data. Decades of extensive study on the problem have established many profound results, among which the phase transition at the Kesten-Stigum threshold is particularly interesting both from a mathematical and an applied standpoint. It states that no estimator based on the network topology can perform substantially better than chance on sparse graphs if the model parameter is below a certain threshold. Nevertheless, if we slightly extend the horizon to the ubiquitous semi-supervised setting, such a fundamental limitation will disappear completely. We prove that with an arbitrary fraction of the labels revealed, the detection problem is feasible throughout the parameter domain. Moreover, we introduce two efficient algorithms, one combinatorial and one based on optimization, to integrate label information with graph structures. Our work brings a new perspective to the stochastic model of networks and semidefinite program research.  ( 2 min )
    Unveiling the Potential of Robustness in Evaluating Causal Inference Models
    arXiv:2402.18392v1 Announce Type: cross Abstract: The growing demand for personalized decision-making has led to a surge of interest in estimating the Conditional Average Treatment Effect (CATE). The intersection of machine learning and causal inference has yielded various effective CATE estimators. However, deploying these estimators in practice is often hindered by the absence of counterfactual labels, making it challenging to select the desirable CATE estimator using conventional model selection procedures like cross-validation. Existing approaches for CATE estimator selection, such as plug-in and pseudo-outcome metrics, face two inherent challenges. Firstly, they are required to determine the metric form and the underlying machine learning models for fitting nuisance parameters or plug-in learners. Secondly, they lack a specific focus on selecting a robust estimator. To address these challenges, this paper introduces a novel approach, the Distributionally Robust Metric (DRM), for CATE estimator selection. The proposed DRM not only eliminates the need to fit additional models but also excels at selecting a robust CATE estimator. Experimental studies demonstrate the efficacy of the DRM method, showcasing its consistent effectiveness in identifying superior estimators while mitigating the risk of selecting inferior ones.  ( 2 min )
    Provably Efficient Partially Observable Risk-Sensitive Reinforcement Learning with Hindsight Observation
    arXiv:2402.18149v1 Announce Type: cross Abstract: This work pioneers regret analysis of risk-sensitive reinforcement learning in partially observable environments with hindsight observation, addressing a gap in theoretical exploration. We introduce a novel formulation that integrates hindsight observations into a Partially Observable Markov Decision Process (POMDP) framework, where the goal is to optimize accumulated reward under the entropic risk measure. We develop the first provably efficient RL algorithm tailored for this setting. We also prove by rigorous analysis that our algorithm achieves polynomial regret $\tilde{O}\left(\frac{e^{|{\gamma}|H}-1}{|{\gamma}|H}H^2\sqrt{KHS^2OA}\right)$, which outperforms or matches existing upper bounds when the model degenerates to risk-neutral or fully observable settings. We adopt the method of change-of-measure and develop a novel analytical tool of beta vectors to streamline mathematical derivations. These techniques are of particular interest to the theoretical study of reinforcement learning.  ( 2 min )
    Multi-objective Differentiable Neural Architecture Search
    arXiv:2402.18213v1 Announce Type: cross Abstract: Pareto front profiling in multi-objective optimization (MOO), i.e. finding a diverse set of Pareto optimal solutions, is challenging, especially with expensive objectives like neural network training. Typically, in MOO neural architecture search (NAS), we aim to balance performance and hardware metrics across devices. Prior NAS approaches simplify this task by incorporating hardware constraints into the objective function, but profiling the Pareto front necessitates a search for each constraint. In this work, we propose a novel NAS algorithm that encodes user preferences for the trade-off between performance and hardware metrics, and yields representative and diverse architectures across multiple devices in just one search run. To this end, we parameterize the joint architectural distribution across devices and multiple objectives via a hypernetwork that can be conditioned on hardware features and preference vectors, enabling zero-shot transferability to new devices. Extensive experiments with up to 19 hardware devices and 3 objectives showcase the effectiveness and scalability of our method. Finally, we show that, without additional costs, our method outperforms existing MOO NAS methods across qualitatively different search spaces and datasets, including MobileNetV3 on ImageNet-1k and a Transformer space on machine translation.  ( 2 min )
    Implicit Bias of Next-Token Prediction
    arXiv:2402.18551v1 Announce Type: cross Abstract: Next-token prediction (NTP), the go-to training paradigm in training large language models, involves predicting the next token in a sequence. Departing from traditional one-hot classification, in NTP, multiple tokens with varying frequencies follow each given context. This work frames NTP training as cross-entropy minimization over distinct contexts, each associated with a sparse empirical probability vector across a finite vocabulary. It then addresses the following question: do gradient-based optimizers exhibit a bias towards solutions with specific structure as the NTP training loss reaches its lower bound (entropy)? Specifically, for linear NTP models trained using gradient descent (GD), we make the following contributions: Firstly, we determine NTP-separability conditions on the data, under which GD can attain its lower bound. We also demonstrate that these conditions hold under overparameterization. Secondly, we establish that the parameters of GD projected onto an appropriate data subspace converge to the unique solution of a system of linear equations, which requires the logits' difference of in-support tokens to be equal to the log-ratio of their respective probabilities. Meanwhile, on the orthogonal subspace, the parameters diverge and converge in the direction of the solution of a max-margin quadratic program, minimizing the Euclidean norm of parameters satisfying the \NTP-separability conditions. Akin to prior research on implicit bias of one-hot classification, our work opens exciting avenues for future research that can lead to better understanding optimization, generalization and robustness principles of models trained with NTP.  ( 2 min )
    RNNs are not Transformers (Yet): The Key Bottleneck on In-context Retrieval
    arXiv:2402.18510v1 Announce Type: cross Abstract: This paper investigates the gap in representation powers of Recurrent Neural Networks (RNNs) and Transformers in the context of solving algorithmic problems. We focus on understanding whether RNNs, known for their memory efficiency in handling long sequences, can match the performance of Transformers, particularly when enhanced with Chain-of-Thought (CoT) prompting. Our theoretical analysis reveals that CoT improves RNNs but is insufficient to close the gap with Transformers. A key bottleneck lies in the inability of RNNs to perfectly retrieve information from the context, even with CoT: for several tasks that explicitly or implicitly require this capability, such as associative recall and determining if a graph is a tree, we prove that RNNs are not expressive enough to solve the tasks while Transformers can solve them with ease. Conversely, we prove that adopting techniques to enhance the in-context retrieval capability of RNNs, including Retrieval-Augmented Generation (RAG) and adding a single Transformer layer, can elevate RNNs to be capable of solving all polynomial-time solvable problems with CoT, hence closing the representation gap with Transformers.  ( 2 min )
    Zeroth-Order Sampling Methods for Non-Log-Concave Distributions: Alleviating Metastability by Denoising Diffusion
    arXiv:2402.17886v1 Announce Type: new Abstract: This paper considers the problem of sampling from non-logconcave distribution, based on queries of its unnormalized density. It first describes a framework, Diffusion Monte Carlo (DMC), based on the simulation of a denoising diffusion process with its score function approximated by a generic Monte Carlo estimator. DMC is an oracle-based meta-algorithm, where its oracle is the assumed access to samples that generate a Monte Carlo score estimator. Then we provide an implementation of this oracle, based on rejection sampling, and this turns DMC into a true algorithm, termed Zeroth-Order Diffusion Monte Carlo (ZOD-MC). We provide convergence analyses by first constructing a general framework, i.e. a performance guarantee for DMC, without assuming the target distribution to be log-concave or satisfying any isoperimetric inequality. Then we prove that ZOD-MC admits an inverse polynomial dependence on the desired sampling accuracy, albeit still suffering from the curse of dimensionality. Consequently, for low dimensional distributions, ZOD-MC is a very efficient sampler, with performance exceeding latest samplers, including also-denoising-diffusion-based RDMC and RS-DMC. Last, we experimentally demonstrate the insensitivity of ZOD-MC to increasingly higher barriers between modes or discontinuity in non-convex potential.  ( 2 min )
    Stochastic Approximation with Biased MCMC for Expectation Maximization
    arXiv:2402.17870v1 Announce Type: cross Abstract: The expectation maximization (EM) algorithm is a widespread method for empirical Bayesian inference, but its expectation step (E-step) is often intractable. Employing a stochastic approximation scheme with Markov chain Monte Carlo (MCMC) can circumvent this issue, resulting in an algorithm known as MCMC-SAEM. While theoretical guarantees for MCMC-SAEM have previously been established, these results are restricted to the case where asymptotically unbiased MCMC algorithms are used. In practice, MCMC-SAEM is often run with asymptotically biased MCMC, for which the consequences are theoretically less understood. In this work, we fill this gap by analyzing the asymptotics and non-asymptotics of SAEM with biased MCMC steps, particularly the effect of bias. We also provide numerical experiments comparing the Metropolis-adjusted Langevin algorithm (MALA), which is asymptotically unbiased, and the unadjusted Langevin algorithm (ULA), which is asymptotically biased, on synthetic and real datasets. Experimental results show that ULA is more stable with respect to the choice of Langevin stepsize and can sometimes result in faster convergence.  ( 2 min )
    Certain and Approximately Certain Models for Statistical Learning
    arXiv:2402.17926v1 Announce Type: new Abstract: Real-world data is often incomplete and contains missing values. To train accurate models over real-world datasets, users need to spend a substantial amount of time and resources imputing and finding proper values for missing data items. In this paper, we demonstrate that it is possible to learn accurate models directly from data with missing values for certain training data and target models. We propose a unified approach for checking the necessity of data imputation to learn accurate models across various widely-used machine learning paradigms. We build efficient algorithms with theoretical guarantees to check this necessity and return accurate models in cases where imputation is unnecessary. Our extensive experiments indicate that our proposed algorithms significantly reduce the amount of time and effort needed for data imputation without imposing considerable computational overhead.  ( 2 min )
    Sequential transport maps using SoS density estimation and $\alpha$-divergences
    arXiv:2402.17943v1 Announce Type: new Abstract: Transport-based density estimation methods are receiving growing interest because of their ability to efficiently generate samples from the approximated density. We further invertigate the sequential transport maps framework proposed from arXiv:2106.04170 arXiv:2303.02554, which builds on a sequence of composed Knothe-Rosenblatt (KR) maps. Each of those maps are built by first estimating an intermediate density of moderate complexity, and then by computing the exact KR map from a reference density to the precomputed approximate density. In our work, we explore the use of Sum-of-Squares (SoS) densities and $\alpha$-divergences for approximating the intermediate densities. Combining SoS densities with $\alpha$-divergence interestingly yields convex optimization problems which can be efficiently solved using semidefinite programming. The main advantage of $\alpha$-divergences is to enable working with unnormalized densities, which provides benefits both numerically and theoretically. In particular, we provide two new convergence analyses of the sequential transport maps: one based on a triangle-like inequality and the second on information geometric properties of $\alpha$-divergences for unnormalizied densities. The choice of intermediate densities is also crucial for the efficiency of the method. While tempered (or annealed) densities are the state-of-the-art, we introduce diffusion-based intermediate densities which permits to approximate densities known from samples only. Such intermediate densities are well-established in machine learning for generative modeling. Finally we propose and try different low-dimensional maps (or lazy maps) for dealing with high-dimensional problems and numerically demonstrate our methods on several benchmarks, including Bayesian inference problems and unsupervised learning task.  ( 2 min )
    Stable LM 2 1.6B Technical Report
    arXiv:2402.17834v1 Announce Type: cross Abstract: We introduce StableLM 2 1.6B, the first in a new generation of our language model series. In this technical report, we present in detail the data and training procedure leading to the base and instruction-tuned versions of StableLM 2 1.6B. The weights for both models are available via Hugging Face for anyone to download and use. The report contains thorough evaluations of these models, including zero- and few-shot benchmarks, multilingual benchmarks, and the MT benchmark focusing on multi-turn dialogues. At the time of publishing this report, StableLM 2 1.6B was the state-of-the-art open model under 2B parameters by a significant margin. Given its appealing small size, we also provide throughput measurements on a number of edge devices. In addition, we open source several quantized checkpoints and provide their performance metrics compared to the original model.  ( 2 min )
    Material Microstructure Design Using VAE-Regression with Multimodal Prior
    arXiv:2402.17806v1 Announce Type: cross Abstract: We propose a variational autoencoder (VAE)-based model for building forward and inverse structure-property linkages, a problem of paramount importance in computational materials science. Our model systematically combines VAE with regression, linking the two models through a two-level prior conditioned on the regression variables. The regression loss is optimized jointly with the reconstruction loss of the variational autoencoder, learning microstructure features relevant for property prediction and reconstruction. The resultant model can be used for both forward and inverse prediction i.e., for predicting the properties of a given microstructure as well as for predicting the microstructure required to obtain given properties. Since the inverse problem is ill-posed (one-to-many), we derive the objective function using a multi-modal Gaussian mixture prior enabling the model to infer multiple microstructures for a target set of properties. We show that for forward prediction, our model is as accurate as state-of-the-art forward-only models. Additionally, our method enables direct inverse inference. We show that the microstructures inferred using our model achieve desired properties reasonably accurately, avoiding the need for expensive optimization loops.  ( 2 min )
    A network-constrain Weibull AFT model for biomarkers discovery
    arXiv:2402.18242v1 Announce Type: new Abstract: We propose AFTNet, a novel network-constraint survival analysis method based on the Weibull accelerated failure time (AFT) model solved by a penalized likelihood approach for variable selection and estimation. When using the log-linear representation, the inference problem becomes a structured sparse regression problem for which we explicitly incorporate the correlation patterns among predictors using a double penalty that promotes both sparsity and grouping effect. Moreover, we establish the theoretical consistency for the AFTNet estimator and present an efficient iterative computational algorithm based on the proximal gradient descent method. Finally, we evaluate AFTNet performance both on synthetic and real data examples.  ( 2 min )
    Prediction-Powered Ranking of Large Language Models
    arXiv:2402.17826v1 Announce Type: cross Abstract: Large language models are often ranked according to their level of alignment with human preferences -- a model is better than other models if its outputs are more frequently preferred by humans. One of the most popular ways to elicit human preferences utilizes pairwise comparisons between the outputs provided by different models to the same inputs. However, since gathering pairwise comparisons by humans is costly and time-consuming, it has become a very common practice to gather pairwise comparisons by a strong large language model -- a model strongly aligned with human preferences. Surprisingly, practitioners cannot currently measure the uncertainty that any mismatch between human and model preferences may introduce in the constructed rankings. In this work, we develop a statistical framework to bridge this gap. Given a small set of pairwise comparisons by humans and a large set of pairwise comparisons by a model, our framework provides a rank-set -- a set of possible ranking positions -- for each of the models under comparison. Moreover, it guarantees that, with a probability greater than or equal to a user-specified value, the rank-sets cover the true ranking consistent with (the distribution of) human pairwise preferences. Our framework is computationally efficient, easy to use, and does not make any assumption about the distribution of human preferences nor about the degree of alignment between the pairwise comparisons by the humans and the strong large language model.  ( 3 min )

  • Open

    [D] AI OSS projects open to contributors?
    I've got a team of about 8 people considering trying to contribute to some AI OSS, and I'm wondering if anyone here has any favorite projects that are already seeing significant traction? We're especially interested in OSS projects that do one of these things: Make it possible for mere mortals to scale training or fine tuning relatively affordably (when I say mere mortals I mean those of us that don't have OpenAIs billions) Make it easier and more intuitive to get unstructured data (random database tables, pdfs, websites etc) into a format that is ready to fine tune with Usable agents (I'm not a huge believer that we are ready for unsupervised AI, so this is probably one of the lowest on the list unless something out there in OSS land looks really promising) Ways to get runtime context into LLMs without having to configure or program the details of your running system so that you can ask the LLM questions like "tell me if there's anything that has changed from what I consider to be my normal healthy runtime state" New versions of build/deploy/test tooling specifically for an LLM world that make it easier to troubleshoot because they are more configuration driven and LLM friendly and less runtimey and dynamic? That list is sort of like my wish list of things I want to help accelerate LLM-driven dev, and now I'm wanting to go find OSS projects (with traction!) that are doing exactly these things. Anyone seen anything? Or failing that, anyone have good ideas of how to search for them or who to ask about them? submitted by /u/jonxtensen [link] [comments]
    [D] What skill would you learn in LLM if you are just another dev?
    Hi guys, It feels like LLM is most promising tool in ML right now, with application potential in every field. But my impression is it is like a game where some winner take it all, there is not much opportunity or chance for you if you are not one of the bests. What i mean, if you are not good enough to be researcher or dev at big tech or some boutique start-ups (Mistral, Aleph Alpha, Databricks, C3, HF), there is not much chance left, what I want want to say is, it is not a kind of skills like web dev, app dev or cloud computing which there is market need for every level of skill for, there is chance for every one, just my impression. I want to get into this ML field, especially in using of LLMs in own applications, to get a job opportunity in this field, but I'm not sure how to invest time and efforts effectively to get into this field, to be able to produce added value, especially the question: which knowledge depth is a good starter point? - Understand and able to use cloud APIs of most successful LLMs (Vertex, OpenAI, Mistral) - Understand and able to use high level frameworks LangChain, HF, fine tune open source models + host your own models with vLLM - Understanding low level concepts like Embedding, Tokenizer, Attention, Transformer etc. and able to hack and train your own models with low level frameworks pytorch or tf/keras What do you think, where do you see chances for casual dev (0815 in german) to get into the field? submitted by /u/tunggad [link] [comments]
    [N] Exciting Opportunity for AI and NLP Enthusiasts!
    We're excited to announce the 2024 edition of the CheckThat! Lab at CLEF 2024, and we cordially invite you to take part in this groundbreaking event. This year, we're stepping up with six diverse tasks, including two follow-ups and four brand-new challenges, each designed to tackle critical aspects of the fact-checking pipeline across multiple languages. 🔍 Tasks Overview: 1. Check-worthiness in Tweets: Dive into social and mainstream media to identify claims worth verifying. (Arabic, English, Dutch, Spanish) 2. Subjectivity in News Articles: Detect text requiring specific processing strategies to aid the fact-checking pipeline. (Arabic, English, German, Italian, Multilingual) 3. Persuasion Techniques: Identify text spans using persuasion to influence readers. (Arabic, Bulgarian, Eng…
    [D] Why is ViT more commonly used than SWIN?
    I'm still reading around but most every Computer Vision paper I read uses ViT as their backbone instead of SWIN or other similar architectures but why? ​ The ViT paper had to pre-train their model on the 303M image JFT dataset to beat earlier convolutional models on ImageNet whereas SWIN achieves better performance without any pre-training. I imagine SWIN would achieve comparable, if not higher performance on ImageNet if it was pre-trained the same way though admittedly I haven't seen any work to validate this idea. ​ Is this just a case of ViT being first so now everyone uses it as a default or is there another reason? submitted by /u/PM_ME_JOB_OFFER [link] [comments]
    [D] decisions for Open AI superalignment fast grants?
    Hello everyone, The deadline for this grant was on Feb. 18. Has anyone heard back? I am also not sure whether only the winners will be explicitly notified. thanks everyone in advance. submitted by /u/South-Conference-395 [link] [comments]
    [P] [N] OptScale - fully open source MLOps and FinOps platform
    GitHub link: https://github.com/hystax/optscale We’d appreciate it if you give us a Star OptScale’s features: ML Leaderboards with candidates and qualifications Dataset and model tracking and versioning Run metrics and experiment tracker Hypertuning integrated with Optuna Training launcher ML Model training profiler Variety of cloud cost optimization capabilities like RI/SP usage, VM rightsizing, Power Schedules, and cost anomaly detection. The ML functionality of OptScale is under active development and should be helpful to ML teams. Live demo: https://my.optscale.com/live-demo submitted by /u/Hystax [link] [comments]
    [P] How to train Unet for whole heart segmentation? (MRI scans, 8 classes)
    I have small computing resources (16gb GPU colab), which does not allow me to train 3d unet on high resolution scans (original resolution 512x512x160). I don’t want to break it into patches, since this is still better suited for microscopy tasks. Unet 2d cannot be trained on this data, since I need inter-slice dependencies, so I decided to train 3d unet at a small resolution of 128x128x160, this works out quite well, then I want to submit the resulting enlarged masks from 3d unet to the input as a second layer with a separate slice scan in Unet 2d with a high resolution (512x512x160) so that Unet 2d understands where the heart is and does multi-class segmentation. How adequate is this idea? I heard about the recurrent 2d model, but it seems that it will consume quite a lot of resources, because we need to keep the entire scan in memory. Thanks for any ideas and feedback. submitted by /u/ForceAffectionate245 [link] [comments]
    [D] RAG- Dimensionality reduction for embeddings
    Earlier last year around the release of GPT 4 I read about people doing dimensionality reduction on their vector embeddings, specifically principle component analysis, to make them better suited for RAG. Since then as the RAG scene has developed I haven’t seen much mention of doing this. Could anyone shed light on the merits of using dimensionality reduction for RAG? submitted by /u/BlueOrangeBerries [link] [comments]
    Can Hardware Specifications in Aircrafts support Machine Learning models? [R]
    Hi, I'm working on a project related to aircraft network security using Machine Learning Concepts and Cyber Threat database. I wanted to know details on the hardware components generally used in Military and Commercial Aircrafts to make sure my ML model and intelligence database can be accommodated in aircrafts on air without cloud dependency. It would be helpful if you could provide your insights on this idea of mine and details of the Microprocessors and Memory Storage components generally used in aircrafts. What is the amount of memory storage available in aircrafts and how much GPU is available. Will the CPU cycles be sufficient to run basic flight operations and the ML model alongside. These are some of my doubts. This information would be helpful for me to configure my AI models for network security. If you are aware of any publications or articles on the same, that would be helpful too. submitted by /u/aishwaryagm10 [link] [comments]
    [R] Pretrained diffusion U-nets for X-rays (single channel)
    Anyone know of any diffusion U-nets pretrained on X-ray data (single channel) that I could use to fine-tune? Image space or latent space - any. Preferably on mammograms, but any kind would be better than random initialization. submitted by /u/roleparacelsus [link] [comments]
    [R] Reversed Concept Drift (RCD) and algorithm for Concept Drift Detection
    All ML models are designed to do one thing: learning a probability distribution in the form of P(y|X). In other words, they try to learn how to model an outcome 'y' given the input variables 'X'. This probability distribution, P(y|X), is also called Concept. Therefore, if the Concept changes, the model may become invalid. But how do we know if there is a new Concept in our data? Or, more importantly, how do we measure if the new Concept is affecting the model's performance? Here is a clever solution where the main ingredients are a reference dataset, one where the model's performance is known, and a dataset with the latest data we would like to monitor. Step-by-Step solution: We start by training an internal model on a chunk of the latest data. -> This allows us to learn the new possible Concept presented in the data. Next, we use the internal model to make predictions on the reference dataset. We then estimate the model's performance on the reference dataset, assuming the model's predictions on the monitoring data as ground truth. If the estimated performance of the internal model and the actual monitored model are very different, we then say that there has been a Concept Drift. To quantify how this Concept impacts performance, we subtract the actual model's performance on reference from the estimated performance and report a delta of the performance metric. -> This is what the plot below shows. The change of the F1-score due to Concept drift! This process is repeated for every new chunk of data that we get. ​ https://preview.redd.it/z570r1dqmhlc1.jpg?width=2738&format=pjpg&auto=webp&s=3997aba54c71b13567bb78b8f5e4d244aa77c0b6 ​ ​ submitted by /u/santiviquez [link] [comments]
    [R] How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning
    PDF: https://arxiv.org/pdf/2402.18312.pdf Findings: 1. Despite different reasoning requirements across different stages of CoT generation, the functional components of the model remain almost the same. Different neural algorithms are implemented as compositions of induction circuit-like mechanisms. Attention heads perform information movement between ontologically related (or negatively related) tokens. This information movement results in distinctly identifiable representations for such token pairs. Typically, this distinctive information movement starts from the very first layer and continues till the middle. While this phenomenon happens zero-shot, in-context examples exert pressure to quickly mix other task-specific information among tokens. Multiple different neural pathways ar…
    [P] RAG based text to code!!
    Hi all!! I am working on a project where I want to generate code from testcases written in plain english. I want code/scripts compatible with my own codebase so I want the model to have some context of my codebase. For that I was thinking RAG or finetuning. What do you think will be a better choice? And any resources if you guys can recommend for me to follow?...thanks submitted by /u/Delicious_Success303 [link] [comments]
    [D] is there a way to monitor the input prompts?
    I have a set of metrics to detect toxicity and data leakage in the input prompts. If detected, I do not want the input prompts to even reach the LLM because the prompts are bad and it is a waste of money. But I want to log these input prompts, display the raw text as well as the metrics results. Is there a tool like this? submitted by /u/Semantics777 [link] [comments]
    Using Raspberry Pi & YOLO To Determine Weights of Irregular Objects [D] [P]
    I am working on a system on a raspberry pi that uses a YOLO instance segmentation model to classify different foods and mask them. I then want to use the detected classes to find the weight of each food, and add it to a total counter. The camera will most likely be facing top down, so I am curious what the best way to find the depths of the food is. Currently, the code I have now just takes the 2d mask, so it just takes the mask of the object straight down and then finds the weight from the area of that mask. This isn't accurate because we are missing depth, and I need the system to be as accurate as possible. What are some possible low-cost, yet effective solutions I could use to find the volume of foods and not just the area. There could be multiple foods on one plate in the frame, and they would all have different shapes & sizes. They will most likely be breakfast foods, so scrambled eggs, tater tots, french toast, etc. submitted by /u/Thin-Addition6686 [link] [comments]
    [P] Speech-to-Text Benchmark: 47,638 mins transcribed per $1 on RTX3070 Ti (1000-fold cost reduction than managed services)
    Speech-to-text benchmark with Parakeet TDT 1.1B Our previous Speech-to-text benchmarks on Whisper Large V3 benchmark (11,736 mins/$) and Whisper Large V2 benchmark (1681 mins/$) generated a healthy discussion here. Next on our list of open-source STT models is Parakeet TDT 1.1B which turned out to the winner. In this benchmark, we transcribed 17,305 hours of CommonVoice (en) audio to text from 5,209,130 audio files. Benchmark results: Parakeet TDT 1.1B on a RTX 3070 Ti delivered 47,638 minutes per $1 on a distributed cloud. For 1 Million hours of audio, that costs just $1260. Comparative cost to transcribe 1 Million hours of audio with Speech-to-text APIs/managed transcription services ranges from $200K - $1,500,000. To serve ASR inference at scale, a self-managed system bui…
  • Open

    How can I get into an RL/research job from a BS in CSI and working in a robotics lab?
    As the title states, I work in a robotics lab doing machine vision and embedded development. I want to transition to an RL position, preferrably doing research. Do I have to get a masters? submitted by /u/emas_eht [link] [comments]
    How important is limiting joint range for walking?
    I have a quadruped I'm trying to teach to walk. This is a pretty long term project that I want to eventually have two modes of locomotion (walking and flying), so the proportions are kind of strange for a walker and there's an extra limb section on either front limb that's not currently being used for anything. Right now I'm giving the joints their full range of motion for both locomotion modes, but my training isn't going so well. I'm using Mujoco + MJX/Brax so training on cloud GPUs. In an MJX example, it trained a humanoid with approx the same DoF as my model to walk in about 10 mins / 30mil timesteps, but my robot is still flopping around after 30mil timesteps, even though the reward curves look decent to my untrained eye. I've tried changing the reward weights more towards forward, upright movement and also tried changing the forward reward between displacement and velocity but nothing's really worked. When using velocity reward, it seemed to learn to fling itself forward and then immediately die because it falls over. With position reward it's just not really getting anywhere. So I'm wondering if I need to further restrict the joint ranges to just what's needed for walking. I'm also going to try reducing the actuator torque for all joints, I'm thinking right now they might have too much torque for fine control. Policy-wise all I've tried is PPO so far but by all accounts I've read that shouldn't be causing this type of issue. Any and all thoughts/advice welcome! I'm totally new to RL so I'm sure I'm making a few easily fixable newbie mistakes, just need to figure out what they are. Model Reward curves Project repo submitted by /u/AnAngryBirdMan [link] [comments]
    Pistonball training with RLlib
    Hi guys, I am using RLlib to train the Pistonball environment using PPO with the exact same code as in https://pettingzoo.farama.org/tutorials/rllib/pistonball/ I am using a computational grid that satisfies all of the resources requirements (>=5CPU). ​ Yet, I find the training is quite long. It takes like 1h for just one iteration and it is supposed to go up to 50000 iterations... It is likely there is something wrong in my configuration... ​ Has anyone tried to run the Pistonball code with all of the necessary computational resources ? Could be great to have some known information about the expected training duration just for comparison... submitted by /u/Sudden-Trainer-4046 [link] [comments]
    Enviroment to train a self drive robot
    i want to emulate simple robot that have ability to avoid obstacles but dont now what kind enviroment i have to use , should robo-gym or ros2? submitted by /u/sigma_ks [link] [comments]
    Question on upcoming conferences
    Hey Everyone, I just wanted to ask are there any conferences aside from CORL* that happening where the submission is due the summer and the conference will be held late the Fall? The submision for CORL is Maximum June 6th but I wanted to know are there any others that the deadline is after that (sometime n the summer) but the conference is still by the end of the year (max). *My hope is to submit to CORL but I just want to no as a backup if deadline can't be made is there anything else? submitted by /u/StrangerNo9431 [link] [comments]
    Struggling to Train A2C Model on STOCKS_GOOGL Data - Issues with Entropy and Value Loss Convergence
    I've been working on a project where I'm attempting to use the A2C (Advantage Actor-Critic) algorithm from stable_baselines3 to train a model on STOCKS_GOOGL data, aiming to predict stock price movements. However, I've encountered persistent issues with the convergence of the entropy_loss and value_loss during training. Details: Dataset: I'm using the STOCKS_GOOGL dataset, which contains historical stock price data for Google (GOOGL). Model Architecture: I've set up the A2C model with a default MlpPolicyand made sure that the input dimensions match the dataset. Training Process: I've initialized the A2C model and attempted to train it using the provided dataset. However, I'm facing challenges with the convergence of the entropy_loss and value_loss. Observations: The entropy_loss and v…
    Difference between ordinary and weighted importance sampling
    I am following the RL textbook by Sutton and Barto and currently learning about the off-policy evaluation where we have two policies, one being the target policy (deterministic) and the other being the behavior policy (more exploratory), and I am unable to understand how does weighted importance sampling (where we weight every return by "rho" for that return upon the total sum of "rhos") ever converge to the true value of target policy even if we take a large number of steps. I can understand that the ordinary importance sampling will always converge to target policy value functions since they are multiplied by "rho", but I dont understand how the same can be said for weighted importance sampling given that the expectation of weighted importance sampling is going to be that of the behavior policy submitted by /u/Unlikely_Spread_2618 [link] [comments]
    Dynamic observation space
    What’s the best practice for when your observation space size changes dynamically? Some options: 1. Define max size and pad with zeros missing observations 2. Define small max size such that you always throw some observations (you always have more observations than fit into your allowed size) 3. Compress the dynamic size vector into fixed length using for example a DNN. 4. Anything else? Advantages / disadvantages? submitted by /u/CuriousDolphin1 [link] [comments]
    AI-driven meme creation
    We released an AI-driven meme generator for data engineering enthusiasts at Qbeast. This fun project helped us learn how to fine-tune AI models and customize datasets for humor. We're excited to share our experience with other enthusiasts who are interested in merging technology and creativity. Check out the story at https://qbeast.io/qbeasts-adventure-in-ai-driven-meme-creation/. submitted by /u/QbeastIO [link] [comments]
    Consistency of weighted Per-decision importance sampling
    In section 5.9 Sutton and Barto, the authors claim that there does not exist a consistent weighted per-decision IS estimator however, in the original paper by Precup (https://scholarworks.umass.edu/cgi/viewcontent.cgi?article=1079&context=cs_faculty_pubs), the following claim is made: "This weighted per-decision importance sampling estimator is consistent but biased, just like the weighted importance sampling estimator". Admittedly, no proof of consistency for the weighted PD estimator is provided in the original paper. Does anyone know whether the previous statement was diss-proven/have any resources demonstrating that the suggested weighted PD estimator from the original paper is inconistent? Cheers! submitted by /u/bean_the_great [link] [comments]
    What RL algorithm would you use for a recommendation algorithm?
    What algo/reward function you use to increase retention or engagement for an app like Tinder or Pinterest for example. submitted by /u/simoo42 [link] [comments]
    Help with Reinforcement Learning Model for Replicating Game of Life Esque Board Game
    Hi everyone, I'm working on implementing a reinforcement learning model for a game of life-like game where two players start with a single piece on a 64 by 64 board. In each step, players can choose to attack an unoccupied square, which replicates their piece. For example, starting with one piece, attacking a square creates two pieces, those two pieces can attack empty squares and create 4 pieces and so on. Each player choose their moves, and they execute at the same time. Multiple pieces can attack the same square, and the winner is determined by the most pieces attacking the square, otherwise it's random decision if it's an equal split. I'm currently facing challenges with designing the reinforcement learning algorithm for this game. I'm struggling with the state representation and rew…
  • Open

    Brain surgery training from an avatar
    MIT.nano Immersion Lab works with AR/VR startup to create transcontinental medical instruction.  ( 6 min )
  • Open

    ViSNet: A general molecular geometry modeling framework for predicting molecular properties and simulating molecular dynamics
    Molecular geometry modeling is a powerful tool for understanding the intricate relationships between molecular structure and biological activity – a field known as structure-activity relationships (SAR). The main premise of SAR is that the biological activity of a molecule is dictated by its specific chemical structure, not only the connections between nuclei but also how […] The post ViSNet: A general molecular geometry modeling framework for predicting molecular properties and simulating molecular dynamics appeared first on Microsoft Research.  ( 13 min )
    Abstracts: February 29, 2024
    Can how we think about our thinking help us better incorporate generative AI in our lives & work? Explore metacognition’s potential to improve the tech’s usability on “Abstracts,” then sign up for Microsoft Research Forum for more on this & other AI work. The post Abstracts: February 29, 2024 appeared first on Microsoft Research.  ( 16 min )
  • Open

    New Book: Statistical Optimization for GenAI and Machine Learning
    This book is for participants in my AI and machine learning certification program. However, it is now free and available to everyone. With tutorials, enterprise-grade projects and solutions, it covers state-of-the-art material on topics such as generative adversarial networks (GAN), specialized LLM, data synthetization, as well as classical machine learning. It is a work in… Read More »New Book: Statistical Optimization for GenAI and Machine Learning The post New Book: Statistical Optimization for GenAI and Machine Learning appeared first on Data Science Central.  ( 21 min )
    Optimizing logistics operations with technology implementation
    In today’s fast-paced business terrain, optimizing logistics operations has become more pivotal than ever. With increasing competition and rising client prospects, companies are under pressure to deliver products directly and quickly. The capability to effectively manage and streamline the movement of goods and products has a direct impact on client satisfaction, cost-effectiveness, and overall functional… Read More »Optimizing logistics operations with technology implementation The post Optimizing logistics operations with technology implementation appeared first on Data Science Central.  ( 22 min )
  • Open

    Using AI to fight AI?
    Hey. With AI becoming more and more prevalent and artists concerned about having their art stolen by AI, do you think that someone could design a AI software to protect art from getting scraped. Like Glaze and Nightshade but it uses AI to work faster and be more efficient. Do you think such a software could be created and do you think artists would use it even though it is driven by AI itself? submitted by /u/furrynoy96 [link] [comments]
    What are examples of questions ChatGPT 4 still can't solve?
    What are examples of questions ChatGPT 4 still can't solve? ​ submitted by /u/useriogz [link] [comments]
    Zuckerberg Meets LG, Samsung Chiefs in Seoul as Meta Ramps Up AI Ambitions
    submitted by /u/BubblyMcnutty [link] [comments]
    Has someone made a tool that takes in audio and makes music out of it? That is, if I'm like, tapping on the desk and making a cool beat, can I get it turned into a drum track? Or if I hum something, turn it into another instrument?
    And if this doesn't exist, why not? Is the AI in audio not there yet, or someone hasn't made a product out of it? submitted by /u/cryptoz [link] [comments]
  • Open

    Use RAG for drug discovery with Knowledge Bases for Amazon Bedrock
    Amazon Bedrock provides a broad range of models from Amazon and third-party providers, including Anthropic, AI21, Meta, Cohere, and Stability AI, and covers a wide range of use cases, including text and image generation, embedding, chat, high-level agents with reasoning and orchestration, and more. Knowledge Bases for Amazon Bedrock allows you to build performant and […]  ( 14 min )
    Unlock personalized experiences powered by AI using Amazon Personalize and Amazon OpenSearch Service
    OpenSearch is a scalable, flexible, and extensible open source software suite for search, analytics, security monitoring, and observability applications, licensed under the Apache 2.0 license. Amazon OpenSearch Service is a fully managed service that makes it straightforward to deploy, scale, and operate OpenSearch in the AWS Cloud. OpenSearch uses a probabilistic ranking framework called BM-25 […]  ( 12 min )
    Automate Amazon SageMaker Pipelines DAG creation
    Creating scalable and efficient machine learning (ML) pipelines is crucial for streamlining the development, deployment, and management of ML models. In this post, we present a framework for automating the creation of a directed acyclic graph (DAG) for Amazon SageMaker Pipelines based on simple configuration files. The framework code and examples presented here only cover […]  ( 13 min )
    Accelerating large-scale neural network training on CPUs with ThirdAI and AWS Graviton
    This guest post is written by Vihan Lakshman, Tharun Medini, and Anshumali Shrivastava from ThirdAI. Large-scale deep learning has recently produced revolutionary advances in a vast array of fields. Although this stunning progress in artificial intelligence remains remarkable, the financial costs and energy consumption required to train these models has emerged as a critical bottleneck […]  ( 9 min )
    Supercharge your AI team with Amazon SageMaker Studio: A comprehensive view of Deutsche Bahn’s AI platform transformation
    AI’s growing influence in large organizations brings crucial challenges in managing AI platforms. These include developing a scalable and operationally efficient platform that adheres to organizational compliance and security standards. Amazon SageMaker Studio offers a comprehensive set of capabilities for machine learning (ML) practitioners and data scientists. These include a fully managed AI development environment […]  ( 12 min )
  • Open

    Bounding the perimeter of a triangle between circles
    Suppose you have a triangle and you know the size of the largest circle that can fit inside (the incircle) and the size of the smallest circle that can fit outside (the circumcircle). How would you estimate the perimeter of the triangle? In terms of the figure below, if you know the circumference of the […] Bounding the perimeter of a triangle between circles first appeared on John D. Cook.  ( 5 min )
  • Open

    Battle.net Leaps Into the Cloud With GeForce NOW
    GFN Thursday celebrates this leap day with the addition of a popular game store to the cloud. Stream the first titles from Blizzard Entertainment’s Battle.net, including Diablo IV, Overwatch 2, Call of Duty HQ and Hearthstone, now playable across more devices than ever. They’re all part of the 30 new games coming to GeForce NOW Read Article  ( 6 min )
  • Open

    Focus Your Attention (with Adaptive IIR Filters)
    EMNLP 2023: https://aclanthology.org/2023.emnlp-main.772/ arXiv: https://arxiv.org/abs/2305.14952 OpenReview: https://openreview.net/forum?id=DlQeSfGYfS Abstract: We present a new layer in which dynamic (i.e., input-dependent) Infinite Impulse Response (IIR) filters of order two are used to process the input sequence prior to applying conventional attention. The input is split into chunks, and the coefficients of these filters are determined based on previous chunks to maintain causality. Despite their relatively low order, the causal adaptive filters are shown to focus attention on the relevant sequence elements. The new layer is grounded in control theory, and is shown to generalize diagonal state-space layers. The layer performs on-par with state-of-the-art networks, with a fraction of their parameters and with time complexity that is sub-quadratic with input size. The obtained layer is favorable to layers such as Hyena, GPT2, and Mega, both with respect to the number of parameters and the obtained level of performance on multiple long-range sequence problems. submitted by /u/SunsetOneSix [link] [comments]
  • Open

    All the Feels: A dexterous hand with large-area tactile sensing
    arXiv:2210.15658v3 Announce Type: replace-cross Abstract: High cost and lack of reliability has precluded the widespread adoption of dexterous hands in robotics. Furthermore, the lack of a viable tactile sensor capable of sensing over the entire area of the hand impedes the rich, low-level feedback that would improve learning of dexterous manipulation skills. This paper introduces an inexpensive, modular, robust, and scalable platform -- the DManus -- aimed at resolving these challenges while satisfying the large-scale data collection capabilities demanded by deep robot learning paradigms. Studies on human manipulation point to the criticality of low-level tactile feedback in performing everyday dexterous tasks. The DManus comes with ReSkin sensing on the entire surface of the palm as well as the fingertips. We demonstrate effectiveness of the fully integrated system in a tactile aware task -- bin picking and sorting. Code, documentation, design files, detailed assembly instructions, trained models, task videos, and all supplementary materials required to recreate the setup can be found on https://sites.google.com/view/roboticsbenchmarks/platforms/dmanus.  ( 2 min )
    Nemotron-4 15B Technical Report
    arXiv:2402.16819v2 Announce Type: replace-cross Abstract: We introduce Nemotron-4 15B, a 15-billion-parameter large multilingual language model trained on 8 trillion text tokens. Nemotron-4 15B demonstrates strong performance when assessed on English, multilingual, and coding tasks: it outperforms all existing similarly-sized open models on 4 out of 7 downstream evaluation areas and achieves competitive performance to the leading open models in the remaining ones. Specifically, Nemotron-4 15B exhibits the best multilingual capabilities of all similarly-sized models, even outperforming models over four times larger and those explicitly specialized for multilingual tasks.  ( 3 min )
    Multidimensional unstructured sparse recovery via eigenmatrix
    arXiv:2402.17215v1 Announce Type: cross Abstract: This note considers the multidimensional unstructured sparse recovery problems. Examples include Fourier inversion and sparse deconvolution. The eigenmatrix is a data-driven construction with desired approximate eigenvalues and eigenvectors proposed for the one-dimensional problems. This note extends the eigenmatrix approach to multidimensional problems. Numerical results are provided to demonstrate the performance of the proposed method.  ( 2 min )
    Dual-Space Optimization: Improved Molecule Sequence Design by Latent Prompt Transformer
    arXiv:2402.17179v1 Announce Type: new Abstract: Designing molecules with desirable properties, such as drug-likeliness and high binding affinities towards protein targets, is a challenging problem. In this paper, we propose the Dual-Space Optimization (DSO) method that integrates latent space sampling and data space selection to solve this problem. DSO iteratively updates a latent space generative model and a synthetic dataset in an optimization process that gradually shifts the generative model and the synthetic data towards regions of desired property values. Our generative model takes the form of a Latent Prompt Transformer (LPT) where the latent vector serves as the prompt of a causal transformer. Our extensive experiments demonstrate effectiveness of the proposed method, which sets new performance benchmarks across single-objective, multi-objective and constrained molecule design tasks.  ( 2 min )
    Pretrained deep models outperform GBDTs in Learning-To-Rank under label scarcity
    arXiv:2308.00177v3 Announce Type: replace Abstract: On tabular data, a significant body of literature has shown that current deep learning (DL) models perform at best similarly to Gradient Boosted Decision Trees (GBDTs), while significantly underperforming them on outlier data. We identify a natural tabular data setting where DL models can outperform GBDTs: tabular Learning-to-Rank (LTR) under label scarcity. Tabular LTR applications, including search and recommendation, often have an abundance of unlabeled data, and scarce labeled data. We show that DL rankers can utilize unsupervised pretraining to exploit this unlabeled data. In extensive experiments over both public and proprietary datasets, we show that pretrained DL rankers consistently outperform GBDT rankers on ranking metrics -- sometimes by as much as $38\%$ -- both overall and on outliers.  ( 2 min )
    An Online Bootstrap for Time Series
    arXiv:2310.19683v2 Announce Type: replace-cross Abstract: Resampling methods such as the bootstrap have proven invaluable in the field of machine learning. However, the applicability of traditional bootstrap methods is limited when dealing with large streams of dependent data, such as time series or spatially correlated observations. In this paper, we propose a novel bootstrap method that is designed to account for data dependencies and can be executed online, making it particularly suitable for real-time applications. This method is based on an autoregressive sequence of increasingly dependent resampling weights. We prove the theoretical validity of the proposed bootstrap scheme under general conditions. We demonstrate the effectiveness of our approach through extensive simulations and show that it provides reliable uncertainty quantification even in the presence of complex data dependencies. Our work bridges the gap between classical resampling techniques and the demands of modern data analysis, providing a valuable tool for researchers and practitioners in dynamic, data-rich environments.  ( 2 min )
    Evaluation of Predictive Reliability to Foster Trust in Artificial Intelligence. A case study in Multiple Sclerosis
    arXiv:2402.17554v1 Announce Type: new Abstract: Applying Artificial Intelligence (AI) and Machine Learning (ML) in critical contexts, such as medicine, requires the implementation of safety measures to reduce risks of harm in case of prediction errors. Spotting ML failures is of paramount importance when ML predictions are used to drive clinical decisions. ML predictive reliability measures the degree of trust of a ML prediction on a new instance, thus allowing decision-makers to accept or reject it based on its reliability. To assess reliability, we propose a method that implements two principles. First, our approach evaluates whether an instance to be classified is coming from the same distribution of the training set. To do this, we leverage Autoencoders (AEs) ability to reconstruct the training set with low error. An instance is considered Out-of-Distribution (OOD) if the AE reconstructs it with a high error. Second, it is evaluated whether the ML classifier has good performances on samples similar to the newly classified instance by using a proxy model. We show that this approach is able to assess reliability both in a simulated scenario and on a model trained to predict disease progression of Multiple Sclerosis patients. We also developed a Python package, named relAI, to embed reliability measures into ML pipelines. We propose a simple approach that can be used in the deployment phase of any ML model to suggest whether to trust predictions or not. Our method holds the promise to provide effective support to clinicians by spotting potential ML failures during deployment.  ( 3 min )
    Inpainting Computational Fluid Dynamics with Deep Learning
    arXiv:2402.17185v1 Announce Type: new Abstract: Fluid data completion is a research problem with high potential benefit for both experimental and computational fluid dynamics. An effective fluid data completion method reduces the required number of sensors in a fluid dynamics experiment, and allows a coarser and more adaptive mesh for a Computational Fluid Dynamics (CFD) simulation. However, the ill-posed nature of the fluid data completion problem makes it prohibitively difficult to obtain a theoretical solution and presents high numerical uncertainty and instability for a data-driven approach (e.g., a neural network model). To address these challenges, we leverage recent advancements in computer vision, employing the vector quantization technique to map both complete and incomplete fluid data spaces onto discrete-valued lower-dimensional representations via a two-stage learning procedure. We demonstrated the effectiveness of our approach on Kolmogorov flow data (Reynolds number: 1000) occluded by masks of different size and arrangement. Experimental results show that our proposed model consistently outperforms benchmark models under different occlusion settings in terms of point-wise reconstruction accuracy as well as turbulent energy spectrum and vorticity distribution.  ( 2 min )
    DeepDRK: Deep Dependency Regularized Knockoff for Feature Selection
    arXiv:2402.17176v1 Announce Type: new Abstract: Model-X knockoff, among various feature selection methods, received much attention recently due to its guarantee on false discovery rate (FDR) control. Subsequent to its introduction in parametric design, knockoff is advanced to handle arbitrary data distributions using deep learning-based generative modeling. However, we observed that current implementations of the deep Model-X knockoff framework exhibit limitations. Notably, the "swap property" that knockoffs necessitate frequently encounter challenges on sample level, leading to a diminished selection power. To overcome, we develop "Deep Dependency Regularized Knockoff (DeepDRK)", a distribution-free deep learning method that strikes a balance between FDR and power. In DeepDRK, a generative model grounded in a transformer architecture is introduced to better achieve the "swap property". Novel efficient regularization techniques are also proposed to reach higher power. Our model outperforms other benchmarks in synthetic, semi-synthetic, and real-world data, especially when sample size is small and data distribution is complex.  ( 2 min )
    TaxDiff: Taxonomic-Guided Diffusion Model for Protein Sequence Generation
    arXiv:2402.17156v1 Announce Type: new Abstract: Designing protein sequences with specific biological functions and structural stability is crucial in biology and chemistry. Generative models already demonstrated their capabilities for reliable protein design. However, previous models are limited to the unconditional generation of protein sequences and lack the controllable generation ability that is vital to biological tasks. In this work, we propose TaxDiff, a taxonomic-guided diffusion model for controllable protein sequence generation that combines biological species information with the generative capabilities of diffusion models to generate structurally stable proteins within the sequence space. Specifically, taxonomic control information is inserted into each layer of the transformer block to achieve fine-grained control. The combination of global and local attention ensures the sequence consistency and structural foldability of taxonomic-specific proteins. Extensive experiments demonstrate that TaxDiff can consistently achieve better performance on multiple protein sequence generation benchmarks in both taxonomic-guided controllable generation and unconditional generation. Remarkably, the sequences generated by TaxDiff even surpass those produced by direct-structure-generation models in terms of confidence based on predicted structures and require only a quarter of the time of models based on the diffusion model. The code for generating proteins and training new versions of TaxDiff is available at:https://github.com/Linzy19/TaxDiff.  ( 2 min )
    Monitoring Fidelity of Online Reinforcement Learning Algorithms in Clinical Trials
    arXiv:2402.17003v1 Announce Type: new Abstract: Online reinforcement learning (RL) algorithms offer great potential for personalizing treatment for participants in clinical trials. However, deploying an online, autonomous algorithm in the high-stakes healthcare setting makes quality control and data quality especially difficult to achieve. This paper proposes algorithm fidelity as a critical requirement for deploying online RL algorithms in clinical trials. It emphasizes the responsibility of the algorithm to (1) safeguard participants and (2) preserve the scientific utility of the data for post-trial analyses. We also present a framework for pre-deployment planning and real-time monitoring to help algorithm developers and clinical researchers ensure algorithm fidelity. To illustrate our framework's practical application, we present real-world examples from the Oralytics clinical trial. Since Spring 2023, this trial successfully deployed an autonomous, online RL algorithm to personalize behavioral interventions for participants at risk for dental disease.  ( 2 min )
    Discovering Symmetry Group Structures via Implicit Orthogonality Bias
    arXiv:2402.17002v1 Announce Type: new Abstract: We introduce the HyperCube network, a novel approach for autonomously discovering symmetry group structures within data. The key innovation is a unique factorization architecture coupled with a novel regularizer that instills a powerful inductive bias towards learning orthogonal representations. This leverages a fundamental theorem of representation theory that all compact/finite groups can be represented by orthogonal matrices. HyperCube efficiently learns general group operations from partially observed data, successfully recovering complete operation tables. Remarkably, the learned factors correspond directly to exact matrix representations of the underlying group. Moreover, these factors capture the group's complete set of irreducible representations, forming the generalized Fourier basis for performing group convolutions. In extensive experiments with both group and non-group symbolic operations, HyperCube demonstrates a dramatic 100-1000x improvement in training speed and 2-10x greater sample efficiency compared to the Transformer baseline. These results suggest that our approach unlocks a new class of deep learning models capable of harnessing inherent symmetries within data, leading to significant improvements in performance and broader applicability.  ( 2 min )
    Avoiding Catastrophic Forgetting in Visual Classification Using Human Concept Formation
    arXiv:2402.16933v1 Announce Type: new Abstract: Deep neural networks have excelled in machine learning, particularly in vision tasks, however, they often suffer from catastrophic forgetting when learning new tasks sequentially. In this work, we propose Cobweb4V, a novel visual classification approach that builds on Cobweb, a human like learning system that is inspired by the way humans incrementally learn new concepts over time. In this research, we conduct a comprehensive evaluation, showcasing the proficiency of Cobweb4V in learning visual concepts, requiring less data to achieve effective learning outcomes compared to traditional methods, maintaining stable performance over time, and achieving commendable asymptotic behavior, without catastrophic forgetting effects. These characteristics align with learning strategies in human cognition, positioning Cobweb4V as a promising alternative to neural network approaches.  ( 2 min )
    Personalized Federated Instruction Tuning via Neural Architecture Search
    arXiv:2402.16919v1 Announce Type: new Abstract: Federated Instruction Tuning (FIT) has shown the ability to achieve collaborative model instruction tuning among massive data owners without sharing private data. However, it still faces two key challenges, i.e., data and resource heterogeneity. Due to the varying data distribution and preferences among data owners, FIT cannot adapt to the personalized data of individual owners. Moreover, clients with superior computational abilities are constrained since they need to maintain the same fine-tuning architecture as the weaker clients. To address these issues, we propose a novel Personalized Federated Instruction Tuning (PerFIT) framework based on architecture search. Specifically, PerFIT allows each client to search for a personalized architecture by expanding the trainable parameter space of the global model followed by pruning the parameters to the original state. This procedure allows personalized instruction fine-tuning within expanded parameter spaces, concurrently preserving the same number of trainable parameters. Furthermore, to release the abilities of heterogeneous computational resources and enhance the performance of personalization on local data, we exploit personalized parameter-wise aggregation. The evaluation with multiple LLMs non-IID scenarios demonstrates that compared to the state-of-the-art FIT methods, our approach can achieve up to a 23% decrease in perplexity.  ( 2 min )
    m2mKD: Module-to-Module Knowledge Distillation for Modular Transformers
    arXiv:2402.16918v1 Announce Type: new Abstract: Modular neural architectures are gaining increasing attention due to their powerful capability for generalization and sample-efficient adaptation to new domains. However, training modular models, particularly in the early stages, poses challenges due to the optimization difficulties arising from their intrinsic sparse connectivity. Leveraging the knowledge from monolithic models, using techniques such as knowledge distillation, is likely to facilitate the training of modular models and enable them to integrate knowledge from multiple models pretrained on diverse sources. Nevertheless, conventional knowledge distillation approaches are not tailored to modular models and can fail when directly applied due to the unique architectures and the enormous number of parameters involved. Motivated by these challenges, we propose a general module-to-module knowledge distillation (m2mKD) method for transferring knowledge between modules. Our approach involves teacher modules split from a pretrained monolithic model, and student modules of a modular model. m2mKD separately combines these modules with a shared meta model and encourages the student module to mimic the behaviour of the teacher module. We evaluate the effectiveness of m2mKD on two distinct modular neural architectures: Neural Attentive Circuits (NACs) and Vision Mixture-of-Experts (V-MoE). By applying m2mKD to NACs, we achieve significant improvements in IID accuracy on Tiny-ImageNet (up to 5.6%) and OOD robustness on Tiny-ImageNet-R (up to 4.2%). On average, we observe a 1% gain in both ImageNet and ImageNet-R. The V-MoE-Base model trained using m2mKD also achieves 3.5% higher accuracy than end-to-end training on ImageNet. The experimental results demonstrate that our method offers a promising solution for connecting modular networks with pretrained monolithic models. Code is available at https://github.com/kamanphoebe/m2mKD.  ( 3 min )
    More Than Routing: Joint GPS and Route Modeling for Refine Trajectory Representation Learning
    arXiv:2402.16915v1 Announce Type: new Abstract: Trajectory representation learning plays a pivotal role in supporting various downstream tasks. Traditional methods in order to filter the noise in GPS trajectories tend to focus on routing-based methods used to simplify the trajectories. However, this approach ignores the motion details contained in the GPS data, limiting the representation capability of trajectory representation learning. To fill this gap, we propose a novel representation learning framework that Joint GPS and Route Modelling based on self-supervised technology, namely JGRM. We consider GPS trajectory and route as the two modes of a single movement observation and fuse information through inter-modal information interaction. Specifically, we develop two encoders, each tailored to capture representations of route and GPS trajectories respectively. The representations from the two modalities are fed into a shared transformer for inter-modal information interaction. Eventually, we design three self-supervised tasks to train the model. We validate the effectiveness of the proposed method on two real datasets based on extensive experiments. The experimental results demonstrate that JGRM outperforms existing methods in both road segment representation and trajectory representation tasks. Our source code is available at Anonymous Github.  ( 2 min )
    PDETime: Rethinking Long-Term Multivariate Time Series Forecasting from the perspective of partial differential equations
    arXiv:2402.16913v1 Announce Type: new Abstract: Recent advancements in deep learning have led to the development of various models for long-term multivariate time-series forecasting (LMTF), many of which have shown promising results. Generally, the focus has been on historical-value-based models, which rely on past observations to predict future series. Notably, a new trend has emerged with time-index-based models, offering a more nuanced understanding of the continuous dynamics underlying time series. Unlike these two types of models that aggregate the information of spatial domains or temporal domains, in this paper, we consider multivariate time series as spatiotemporal data regularly sampled from a continuous dynamical system, which can be represented by partial differential equations (PDEs), with the spatial domain being fixed. Building on this perspective, we present PDETime, a novel LMTF model inspired by the principles of Neural PDE solvers, following the encoding-integration-decoding operations. Our extensive experimentation across seven diverse real-world LMTF datasets reveals that PDETime not only adapts effectively to the intrinsic spatiotemporal nature of the data but also sets new benchmarks, achieving state-of-the-art results  ( 2 min )
    Trustworthy Personalized Bayesian Federated Learning via Posterior Fine-Tune
    arXiv:2402.16911v1 Announce Type: new Abstract: Performance degradation owing to data heterogeneity and low output interpretability are the most significant challenges faced by federated learning in practical applications. Personalized federated learning diverges from traditional approaches, as it no longer seeks to train a single model, but instead tailors a unique personalized model for each client. However, previous work focused only on personalization from the perspective of neural network parameters and lack of robustness and interpretability. In this work, we establish a novel framework for personalized federated learning, incorporating Bayesian methodology which enhances the algorithm's ability to quantify uncertainty. Furthermore, we introduce normalizing flow to achieve personalization from the parameter posterior perspective and theoretically analyze the impact of normalizing flow on out-of-distribution (OOD) detection for Bayesian neural networks. Finally, we evaluated our approach on heterogeneous datasets, and the experimental results indicate that the new algorithm not only improves accuracy but also outperforms the baseline significantly in OOD detection due to the reliable output of the Bayesian approach.  ( 2 min )
    Impact of Physical Activity on Quality of Life During Pregnancy: A Causal ML Approach
    arXiv:2402.16909v1 Announce Type: new Abstract: The concept of Quality of Life (QoL) refers to a holistic measurement of an individual's well-being, incorporating psychological and social aspects. Pregnant women, especially those with obesity and stress, often experience lower QoL. Physical activity (PA) has shown the potential to enhance the QoL. However, pregnant women who are overweight and obese rarely meet the recommended level of PA. Studies have investigated the relationship between PA and QoL during pregnancy using correlation-based approaches. These methods aim to discover spurious correlations between variables rather than causal relationships. Besides, the existing methods mainly rely on physical activity parameters and neglect the use of different factors such as maternal (medical) history and context data, leading to biased estimates. Furthermore, the estimations lack an understanding of mediators and counterfactual scenarios that might affect them. In this paper, we investigate the causal relationship between being physically active (treatment variable) and the QoL (outcome) during pregnancy and postpartum. To estimate the causal effect, we develop a Causal Machine Learning method, integrating causal discovery and causal inference components. The data for our investigation is derived from a long-term wearable-based health monitoring study focusing on overweight and obese pregnant women. The machine learning (meta-learner) estimation technique is used to estimate the causal effect. Our result shows that performing adequate physical activity during pregnancy and postpartum improves the QoL by units of 7.3 and 3.4 on average in physical health and psychological domains, respectively. In the final step, four refutation analysis techniques are employed to validate our estimation.  ( 3 min )
    A novel data generation scheme for surrogate modelling with deep operator networks
    arXiv:2402.16903v1 Announce Type: new Abstract: Operator-based neural network architectures such as DeepONets have emerged as a promising tool for the surrogate modeling of physical systems. In general, towards operator surrogate modeling, the training data is generated by solving the PDEs using techniques such as Finite Element Method (FEM). The computationally intensive nature of data generation is one of the biggest bottleneck in deploying these surrogate models for practical applications. In this study, we propose a novel methodology to alleviate the computational burden associated with training data generation for DeepONets. Unlike existing literature, the proposed framework for data generation does not use any partial differential equation integration strategy, thereby significantly reducing the computational cost associated with generating training dataset for DeepONet. In the proposed strategy, first, the output field is generated randomly, satisfying the boundary conditions using Gaussian Process Regression (GPR). From the output field, the input source field can be calculated easily using finite difference techniques. The proposed methodology can be extended to other operator learning methods, making the approach widely applicable. To validate the proposed approach, we employ the heat equations as the model problem and develop the surrogate model for numerous boundary value problems.  ( 2 min )
    A prior Estimates for Deep Residual Network in Continuous-time Reinforcement Learning
    arXiv:2402.16899v1 Announce Type: new Abstract: Deep reinforcement learning excels in numerous large-scale practical applications. However, existing performance analyses ignores the unique characteristics of continuous-time control problems, is unable to directly estimate the generalization error of the Bellman optimal loss and require a boundedness assumption. Our work focuses on continuous-time control problems and proposes a method that is applicable to all such problems where the transition function satisfies semi-group and Lipschitz properties. Under this method, we can directly analyze the \emph{a priori} generalization error of the Bellman optimal loss. The core of this method lies in two transformations of the loss function. To complete the transformation, we propose a decomposition method for the maximum operator. Additionally, this analysis method does not require a boundedness assumption. Finally, we obtain an \emph{a priori} generalization error without the curse of dimensionality.  ( 2 min )
    Chaotic attractor reconstruction using small reservoirs - the influence of topology
    arXiv:2402.16888v1 Announce Type: new Abstract: Forecasting timeseries based upon measured data is needed in a wide range of applications and has been the subject of extensive research. A particularly challenging task is the forecasting of timeseries generated by chaotic dynamics. In recent years reservoir computing has been shown to be an effective method of forecasting chaotic dynamics and reconstructing chaotic attractors from data. In this work strides are made toward smaller and lower complexity reservoirs with the goal of improved hardware implementability and more reliable production of adequate surrogate models. We show that a reservoir of uncoupled nodes more reliably produces long term timeseries predictions than complex reservoir topologies. We then link the improved attractor reconstruction of the uncoupled reservoir with smaller spectral radii of the resulting surrogate systems. These results indicate that, the node degree plays an important role in determining whether the desired dynamics will be stable in the autonomous surrogate system which is attained via closed-loop operation of the trained reservoir. In terms of hardware implementability, uncoupled nodes would allow for greater freedom in the hardware architecture because no complex coupling setups are needed and because, for uncoupled nodes, the system response is equivalent for space and time multiplexing.  ( 2 min )
    Generative Models are Self-Watermarked: Declaring Model Authentication through Re-Generation
    arXiv:2402.16889v1 Announce Type: new Abstract: As machine- and AI-generated content proliferates, protecting the intellectual property of generative models has become imperative, yet verifying data ownership poses formidable challenges, particularly in cases of unauthorized reuse of generated data. The challenge of verifying data ownership is further amplified by using Machine Learning as a Service (MLaaS), which often functions as a black-box system. Our work is dedicated to detecting data reuse from even an individual sample. Traditionally, watermarking has been leveraged to detect AI-generated content. However, unlike watermarking techniques that embed additional information as triggers into models or generated content, potentially compromising output quality, our approach identifies latent fingerprints inherently present within the outputs through re-generation. We propose an explainable verification procedure that attributes data ownership through re-generation, and further amplifies these fingerprints in the generative models through iterative data re-generation. This methodology is theoretically grounded and demonstrates viability and robustness using recent advanced text and image generative models. Our methodology is significant as it goes beyond protecting the intellectual property of APIs and addresses important issues such as the spread of misinformation and academic misconduct. It provides a useful tool to ensure the integrity of sources and authorship, expanding its application in different scenarios where authenticity and ownership verification are essential.  ( 2 min )
    BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation
    arXiv:2402.16880v1 Announce Type: new Abstract: Large language models (LLMs) have demonstrated outstanding performance in various tasks, such as text summarization, text question-answering, and etc. While their performance is impressive, the computational footprint due to their vast number of parameters can be prohibitive. Existing solutions such as SparseGPT and Wanda attempt to alleviate this issue through weight pruning. However, their layer-wise approach results in significant perturbation to the model's output and requires meticulous hyperparameter tuning, such as the pruning rate, which can adversely affect overall model performance. To address this, this paper introduces a novel LLM pruning technique dubbed blockwise parameter-efficient sparsity allocation (BESA) by applying a blockwise reconstruction loss. In contrast to the typical layer-wise pruning techniques, BESA is characterized by two distinctive attributes: i) it targets the overall pruning error with respect to individual transformer blocks, and ii) it allocates layer-specific sparsity in a differentiable manner, both of which ensure reduced performance degradation after pruning. Our experiments show that BESA achieves state-of-the-art performance, efficiently pruning LLMs like LLaMA1, and LLaMA2 with 7B to 70B parameters on a single A100 GPU in just five hours. Code is available at \href{https://github.com/OpenGVLab/LLMPrune-BESA}{here}.  ( 2 min )
    Language Agents as Optimizable Graphs
    arXiv:2402.16823v2 Announce Type: replace-cross Abstract: Various human-designed prompt engineering techniques have been proposed to improve problem solvers based on Large Language Models (LLMs), yielding many disparate code bases. We unify these approaches by describing LLM-based agents as computational graphs. The nodes implement functions to process multimodal data or query LLMs, and the edges describe the information flow between operations. Graphs can be recursively combined into larger composite graphs representing hierarchies of inter-agent collaboration (where edges connect operations of different agents). Our novel automatic graph optimizers (1) refine node-level LLM prompts (node optimization) and (2) improve agent orchestration by changing graph connectivity (edge optimization). Experiments demonstrate that our framework can be used to efficiently develop, integrate, and automatically improve various LLM agents. The code can be found at https://github.com/metauto-ai/gptswarm.  ( 2 min )
    Pretrained Visual Uncertainties
    arXiv:2402.16569v2 Announce Type: replace-cross Abstract: Accurate uncertainty estimation is vital to trustworthy machine learning, yet uncertainties typically have to be learned for each task anew. This work introduces the first pretrained uncertainty modules for vision models. Similar to standard pretraining this enables the zero-shot transfer of uncertainties learned on a large pretraining dataset to specialized downstream datasets. We enable our large-scale pretraining on ImageNet-21k by solving a gradient conflict in previous uncertainty modules and accelerating the training by up to 180x. We find that the pretrained uncertainties generalize to unseen datasets. In scrutinizing the learned uncertainties, we find that they capture aleatoric uncertainty, disentangled from epistemic components. We demonstrate that this enables safe retrieval and uncertainty-aware dataset visualization. To encourage applications to further problems and domains, we release all pretrained checkpoints and code under https://github.com/mkirchhof/url .  ( 2 min )
    How Good is ChatGPT at Face Biometrics? A First Look into Recognition, Soft Biometrics, and Explainability
    arXiv:2401.13641v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) such as GPT developed by OpenAI, have already shown astonishing results, introducing quick changes in our society. This has been intensified by the release of ChatGPT which allows anyone to interact in a simple conversational way with LLMs, without any experience in the field needed. As a result, ChatGPT has been rapidly applied to many different tasks such as code- and song-writer, education, virtual assistants, etc., showing impressive results for tasks for which it was not trained (zero-shot learning). The present study aims to explore the ability of ChatGPT, based on the recent GPT-4 multimodal LLM, for the task of face biometrics. In particular, we analyze the ability of ChatGPT to perform tasks such as face verification, soft-biometrics estimation, and explainability of the results. ChatGPT could be very valuable to further increase the explainability and transparency of automatic decisions in human scenarios. Experiments are carried out in order to evaluate the performance and robustness of ChatGPT, using popular public benchmarks and comparing the results with state-of-the-art methods in the field. The results achieved in this study show the potential of LLMs such as ChatGPT for face biometrics, especially to enhance explainability. For reproducibility reasons, we release all the code in GitHub.  ( 3 min )
    How Far Can 100 Samples Go? Unlocking Overall Zero-Shot Multilingual Translation via Tiny Multi-Parallel Data
    arXiv:2401.12413v2 Announce Type: replace-cross Abstract: Zero-shot translation aims to translate between language pairs not seen during training in Multilingual Machine Translation (MMT) and is largely considered an open problem. A common, albeit resource-consuming, solution is to add as many related translation directions as possible to the training corpus. In this paper, we show that for an English-centric model, surprisingly large zero-shot improvements can be achieved by simply fine-tuning with a very small amount of multi-parallel data. For example, on the EC30 dataset, we obtain up to +21.7 ChrF non-English overall improvements (870 directions) by using only 100 multi-parallel samples while preserving English-centric translation quality. When investigating the size effect of fine-tuning data and its transfer capabilities, we found that already a small, randomly sampled set of fine-tuning directions is sufficient to achieve comparable improvements. The resulting non-English performance is close to the complete translation upper bound. Even in a minimal setting -- fine-tuning with only one single sample -- the well-known off-target issue is almost completely resolved, explaining parts--but not all -- of the observed improvements in translation quality.  ( 2 min )
    Anatomy of Neural Language Models
    arXiv:2401.03797v2 Announce Type: replace-cross Abstract: The fields of generative AI and transfer learning have experienced remarkable advancements in recent years especially in the domain of Natural Language Processing (NLP). Transformers have been at the heart of these advancements where the cutting-edge transformer-based Language Models (LMs) have led to new state-of-the-art results in a wide spectrum of applications. While the number of research works involving neural LMs is exponentially increasing, their vast majority are high-level and far from self-contained. Consequently, a deep understanding of the literature in this area is a tough task especially in the absence of a unified mathematical framework explaining the main types of neural LMs. We address the aforementioned problem in this tutorial where the objective is to explain neural LMs in a detailed, simplified and unambiguous mathematical framework accompanied by clear graphical illustrations. Concrete examples on widely used models like BERT and GPT2 are explored. Finally, since transformers pretrained on language-modeling-like tasks have been widely adopted in computer vision and time series applications, we briefly explore some examples of such solutions in order to enable readers to understand how transformers work in the aforementioned domains and compare this use with the original one in NLP.  ( 2 min )
    Point Transformer with Federated Learning for Predicting Breast Cancer HER2 Status from Hematoxylin and Eosin-Stained Whole Slide Images
    arXiv:2312.06454v3 Announce Type: replace-cross Abstract: Directly predicting human epidermal growth factor receptor 2 (HER2) status from widely available hematoxylin and eosin (HE)-stained whole slide images (WSIs) can reduce technical costs and expedite treatment selection. Accurately predicting HER2 requires large collections of multi-site WSIs. Federated learning enables collaborative training of these WSIs without gigabyte-size WSIs transportation and data privacy concerns. However, federated learning encounters challenges in addressing label imbalance in multi-site WSIs from the real world. Moreover, existing WSI classification methods cannot simultaneously exploit local context information and long-range dependencies in the site-end feature representation of federated learning. To address these issues, we present a point transformer with federated learning for multi-site HER2 status prediction from HE-stained WSIs. Our approach incorporates two novel designs. We propose a dynamic label distribution strategy and an auxiliary classifier, which helps to establish a well-initialized model and mitigate label distribution variations across sites. Additionally, we propose a farthest cosine sampling based on cosine distance. It can sample the most distinctive features and capture the long-range dependencies. Extensive experiments and analysis show that our method achieves state-of-the-art performance at four sites with a total of 2687 WSIs. Furthermore, we demonstrate that our model can generalize to two unseen sites with 229 WSIs.  ( 3 min )
    Multistage Collaborative Knowledge Distillation from a Large Language Model for Semi-Supervised Sequence Generation
    arXiv:2311.08640v3 Announce Type: replace-cross Abstract: We study semi-supervised sequence generation tasks, where the few labeled examples are too scarce to finetune a model, and meanwhile, few-shot prompted large language models (LLMs) exhibit room for improvement. In this paper, we present the discovery that a student model distilled from a few-shot prompted LLM can commonly generalize better than its teacher to unseen examples on such tasks. We find that the student is able to learn a general pattern from the high-quality pseudolabels produced by the teacher during knowledge distillation (KD), and favorably not a general pattern from the low-quality pseudolables. Leveraging this discovery, we propose a new method, Multistage Collaborative Knowledge Distillation from an LLM (MCKD), for these tasks. MCKD first few-shot prompts an LLM to produce pseudolabels for unlabeled data. Then at each stage of an iterative KD process, a new pair of students is trained on disjoint partitions of the pseudolabeled data, and produces new and improved pseudolabels for their unseen partitions. We conduct extensive experiments on four syntactic and semantic parsing datasets and show the effectiveness of MCKD for low-resource semi-supervised sequence generation. On CRAFT biomedical parsing, for example, 3-stage MCKD with 50 labeled examples outperforms an LLM teacher and vanilla KD by 7.5% and 3.7% parsing F1, respectively, and matches the performance of supervised finetuning with 500 labeled examples.  ( 3 min )
    Uncertainty estimation in satellite precipitation interpolation with machine learning
    arXiv:2311.07511v2 Announce Type: replace-cross Abstract: Merging satellite and gauge data with machine learning produces high-resolution precipitation datasets, but uncertainty estimates are often missing. We address this gap by benchmarking six algorithms, mostly novel for this task, for quantifying predictive uncertainty in spatial interpolation. On 15 years of monthly data over the contiguous United States (CONUS), we compared quantile regression (QR), quantile regression forests (QRF), generalized random forests (GRF), gradient boosting machines (GBM), light gradient boosting machines (LightGBM), and quantile regression neural networks (QRNN). Their ability to issue predictive precipitation quantiles at nine quantile levels (0.025, 0.050, 0.100, 0.250, 0.500, 0.750, 0.900, 0.950, 0.975), approximating the full probability distribution, was evaluated using quantile scoring functions and the quantile scoring rule. Feature importance analysis revealed satellite precipitation (PERSIANN (Precipitation Estimation from Remotely Sensed Information using Artificial Neural Networks) and IMERG (Integrated Multi-satellitE Retrievals) datasets) as the most informative predictor, followed by gauge elevation and distance to satellite grid points. Compared to QR, LightGBM showed improved performance with respect to the quantile scoring rule by 11.10%, followed by QRF (7.96%), GRF (7.44%), GBM (4.64%) and QRNN (1.73%). Notably, LightGBM outperformed all random forest variants, the current standard in spatial interpolation with machine learning. To conclude, we propose a suite of machine learning algorithms for estimating uncertainty in interpolating spatial data, supported with a formal evaluation framework based on scoring functions and scoring rules.  ( 3 min )
    Conditional Unscented Autoencoders for Trajectory Prediction
    arXiv:2310.19944v2 Announce Type: replace-cross Abstract: The CVAE is one of the most widely-used models in trajectory prediction for AD. It captures the interplay between a driving context and its ground-truth future into a probabilistic latent space and uses it to produce predictions. In this paper, we challenge key components of the CVAE. We leverage recent advances in the space of the VAE, the foundation of the CVAE, which show that a simple change in the sampling procedure can greatly benefit performance. We find that unscented sampling, which draws samples from any learned distribution in a deterministic manner, can naturally be better suited to trajectory prediction than potentially dangerous random sampling. We go further and offer additional improvements including a more structured Gaussian mixture latent space, as well as a novel, potentially more expressive way to do inference with CVAEs. We show wide applicability of our models by evaluating them on the INTERACTION prediction dataset, outperforming the state of the art, as well as at the task of image modeling on the CelebA dataset, outperforming the baseline vanilla CVAE. Code is available at https://github.com/boschresearch/cuae-prediction.  ( 2 min )
    NECO: NEural Collapse Based Out-of-distribution detection
    arXiv:2310.06823v3 Announce Type: replace-cross Abstract: Detecting out-of-distribution (OOD) data is a critical challenge in machine learning due to model overconfidence, often without awareness of their epistemological limits. We hypothesize that ``neural collapse'', a phenomenon affecting in-distribution data for models trained beyond loss convergence, also influences OOD data. To benefit from this interplay, we introduce NECO, a novel post-hoc method for OOD detection, which leverages the geometric properties of ``neural collapse'' and of principal component spaces to identify OOD data. Our extensive experiments demonstrate that NECO achieves state-of-the-art results on both small and large-scale OOD detection tasks while exhibiting strong generalization capabilities across different network architectures. Furthermore, we provide a theoretical explanation for the effectiveness of our method in OOD detection. Code is available at https://gitlab.com/drti/neco  ( 2 min )
    Comparing the Robustness of Modern No-Reference Image- and Video-Quality Metrics to Adversarial Attacks
    arXiv:2310.06958v4 Announce Type: replace-cross Abstract: Nowadays, neural-network-based image- and video-quality metrics perform better than traditional methods. However, they also became more vulnerable to adversarial attacks that increase metrics' scores without improving visual quality. The existing benchmarks of quality metrics compare their performance in terms of correlation with subjective quality and calculation time. Nonetheless, the adversarial robustness of image-quality metrics is also an area worth researching. This paper analyses modern metrics' robustness to different adversarial attacks. We adapted adversarial attacks from computer vision tasks and compared attacks' efficiency against 15 no-reference image- and video-quality metrics. Some metrics showed high resistance to adversarial attacks, which makes their usage in benchmarks safer than vulnerable metrics. The benchmark accepts submissions of new metrics for researchers who want to make their metrics more robust to attacks or to find such metrics for their needs. The latest results can be found online: https://videoprocessing.ai/benchmarks/metrics-robustness.html.  ( 2 min )
    Are Large Language Models Post Hoc Explainers?
    arXiv:2310.05797v3 Announce Type: replace-cross Abstract: The increasing use of predictive models in high-stakes settings highlights the need for ensuring that relevant stakeholders understand and trust the decisions made by these models. To this end, several approaches have been proposed in recent literature to explain the behavior of complex predictive models in a post hoc fashion. However, despite the growing number of such post hoc explanation techniques, many require white-box access to the model and/or are computationally expensive, highlighting the need for next-generation post hoc explainers. Recently, Large Language Models (LLMs) have emerged as powerful tools that are effective at a wide variety of tasks. However, their potential to explain the behavior of other complex predictive models remains relatively unexplored. In this work, we carry out one of the initial explorations to analyze the effectiveness of LLMs in explaining other complex predictive models. To this end, we propose three novel approaches that exploit the in-context learning (ICL) capabilities of LLMs to explain the predictions made by other complex models. We conduct extensive experimentation with these approaches on real-world datasets to demonstrate that LLMs perform on par with state-of-the-art post hoc explainers, opening up promising avenues for future research into LLM-based post hoc explanations of complex predictive models.  ( 2 min )
    Distributed Deep Joint Source-Channel Coding with Decoder-Only Side Information
    arXiv:2310.04311v2 Announce Type: replace-cross Abstract: We consider low-latency image transmission over a noisy wireless channel when correlated side information is present only at the receiver side (the Wyner-Ziv scenario). In particular, we are interested in developing practical schemes using a data-driven joint source-channel coding (JSCC) approach, which has been previously shown to outperform conventional separation-based approaches in the practical finite blocklength regimes, and to provide graceful degradation with channel quality. We propose a novel neural network architecture that incorporates the decoder-only side information at multiple stages at the receiver side. Our results demonstrate that the proposed method succeeds in integrating the side information, yielding improved performance at all channel conditions in terms of the various quality measures considered here, especially at low channel signal-to-noise ratios (SNRs) and small bandwidth ratios (BRs). We have made the source code of the proposed method public to enable further research, and the reproducibility of the results.  ( 2 min )
    Data-driven Intra-Autonomous Systems Graph Generator
    arXiv:2308.05254v2 Announce Type: replace-cross Abstract: Accurate modeling of realistic network topologies is essential for evaluating novel Internet solutions. Current topology generators, notably scale-free-based models, fail to capture multiple properties of intra-AS topologies. While scale-free networks encode node-degree distribution, they overlook crucial graph properties like betweenness, clustering, and assortativity. The limitations of existing generators pose challenges for training and evaluating deep learning models in communication networks, emphasizing the need for advanced topology generators encompassing diverse Internet topology characteristics. This paper introduces a novel deep-learning-based generator of synthetic graphs representing intra-autonomous in the Internet, named Deep-Generative Graphs for the Internet (DGGI). It also presents a novel massive dataset of real intra-AS graphs extracted from the project ITDK, called IGraphs. It is shown that DGGI creates synthetic graphs that accurately reproduce the properties of centrality, clustering, assortativity, and node degree. The DGGI generator overperforms existing Internet topology generators. On average, DGGI improves the MMD metric $84.4\%$, $95.1\%$, $97.9\%$, and $94.7\%$ for assortativity, betweenness, clustering, and node degree, respectively.  ( 2 min )
    Locally Stationary Graph Processes
    arXiv:2309.01657v2 Announce Type: replace-cross Abstract: Stationary graph process models are commonly used in the analysis and inference of data sets collected on irregular network topologies. While most of the existing methods represent graph signals with a single stationary process model that is globally valid on the entire graph, in many practical problems, the characteristics of the process may be subject to local variations in different regions of the graph. In this work, we propose a locally stationary graph process (LSGP) model that aims to extend the classical concept of local stationarity to irregular graph domains. We characterize local stationarity by expressing the overall process as the combination of a set of component processes such that the extent to which the process adheres to each component varies smoothly over the graph. We propose an algorithm for computing LSGP models from realizations of the process, and also study the approximation of LSGPs locally with WSS processes. Experiments on signal interpolation problems show that the proposed process model provides accurate signal representations competitive with the state of the art.  ( 2 min )
    The Impact of Loss Functions and Scene Representations for 3D/2D Registration on Single-view Fluoroscopic X-ray Pose Estimation
    arXiv:2308.00214v3 Announce Type: replace-cross Abstract: Many tasks performed in image-guided procedures can be cast as pose estimation problems, where specific projections are chosen to reach a target in 3D space. In this study, we first develop a differentiable projection (DiffProj) rendering framework for the efficient computation of Digitally Reconstructed Radiographs (DRRs) with automatic differentiability from either Cone-Beam Computerized Tomography (CBCT) or neural scene representations, including two newly proposed methods, Neural Tuned Tomography (NeTT) and masked Neural Radiance Fields (mNeRF). We then perform pose estimation by iterative gradient descent using various candidate loss functions, that quantify the image discrepancy of the synthesized DRR with respect to the ground-truth fluoroscopic X-ray image. Compared to alternative loss functions, the Mutual Information loss function can significantly improve pose estimation accuracy, as it can effectively prevent entrapment in local optima. Using the Mutual Information loss, a comprehensive evaluation of pose estimation performed on a tomographic X-ray dataset of 50 patients$'$ skulls shows that utilizing either discretized (CBCT) or neural (NeTT/mNeRF) scene representations in DiffProj leads to comparable performance in DRR appearance and pose estimation (3D angle errors: mean $\leq$ 3.2{\deg} and 90% quantile $\leq$ 3.4{\deg}), despite the latter often incurring considerable training expenses and time. These findings could be instrumental for selecting appropriate approaches to improve the efficiency and effectiveness of fluoroscopic X-ray pose estimation in widespread image-guided interventions.  ( 3 min )
    Stochastic positional embeddings improve masked image modeling
    arXiv:2308.00566v2 Announce Type: replace-cross Abstract: Masked Image Modeling (MIM) is a promising self-supervised learning approach that enables learning from unlabeled images. Despite its recent success, learning good representations through MIM remains challenging because it requires predicting the right semantic content in accurate locations. For example, given an incomplete picture of a dog, we can guess that there is a tail, but we cannot determine its exact location. In this work, we propose to incorporate location uncertainty into MIM by using stochastic positional embeddings (StoP). Specifically, we condition the model on stochastic masked token positions drawn from a Gaussian distribution. StoP reduces overfitting to location features and guides the model toward learning features that are more robust to location uncertainties. Quantitatively, StoP improves downstream MIM performance on a variety of downstream tasks, including $+1.7\%$ on ImageNet linear probing using ViT-B, and $+2.5\%$ for ViT-H using $1\%$ of the data.  ( 2 min )
    Fractional Denoising for 3D Molecular Pre-training
    arXiv:2307.10683v3 Announce Type: replace-cross Abstract: Coordinate denoising is a promising 3D molecular pre-training method, which has achieved remarkable performance in various downstream drug discovery tasks. Theoretically, the objective is equivalent to learning the force field, which is revealed helpful for downstream tasks. Nevertheless, there are two challenges for coordinate denoising to learn an effective force field, i.e. low coverage samples and isotropic force field. The underlying reason is that molecular distributions assumed by existing denoising methods fail to capture the anisotropic characteristic of molecules. To tackle these challenges, we propose a novel hybrid noise strategy, including noises on both dihedral angel and coordinate. However, denoising such hybrid noise in a traditional way is no more equivalent to learning the force field. Through theoretical deductions, we find that the problem is caused by the dependency of the input conformation for covariance. To this end, we propose to decouple the two types of noise and design a novel fractional denoising method (Frad), which only denoises the latter coordinate part. In this way, Frad enjoys both the merits of sampling more low-energy structures and the force field equivalence. Extensive experiments show the effectiveness of Frad in molecular representation, with a new state-of-the-art on 9 out of 12 tasks of QM9 and on 7 out of 8 targets of MD17.  ( 3 min )
    Pay Attention to the Atlas: Atlas-Guided Test-Time Adaptation Method for Robust 3D Medical Image Segmentation
    arXiv:2307.00676v2 Announce Type: replace-cross Abstract: Convolutional neural networks (CNNs) often suffer from poor performance when tested on target data that differs from the training (source) data distribution, particularly in medical imaging applications where variations in imaging protocols across different clinical sites and scanners lead to different imaging appearances. However, re-accessing source training data for unsupervised domain adaptation or labeling additional test data for model fine-tuning can be difficult due to privacy issues and high labeling costs, respectively. To solve this problem, we propose a novel atlas-guided test-time adaptation (TTA) method for robust 3D medical image segmentation, called AdaAtlas. AdaAtlas only takes one single unlabeled test sample as input and adapts the segmentation network by minimizing an atlas-based loss. Specifically, the network is adapted so that its prediction after registration is aligned with the learned atlas in the atlas space, which helps to reduce anatomical segmentation errors at test time. In addition, different from most existing TTA methods which restrict the adaptation to batch normalization blocks in the segmentation network only, we further exploit the use of channel and spatial attention blocks for improved adaptability at test time. Extensive experiments on multiple datasets from different sites show that AdaAtlas with attention blocks adapted (AdaAtlas-Attention) achieves superior performance improvements, greatly outperforming other competitive TTA methods.  ( 3 min )
    Efficient Contextformer: Spatio-Channel Window Attention for Fast Context Modeling in Learned Image Compression
    arXiv:2306.14287v2 Announce Type: replace-cross Abstract: Entropy estimation is essential for the performance of learned image compression. It has been demonstrated that a transformer-based entropy model is of critical importance for achieving a high compression ratio, however, at the expense of a significant computational effort. In this work, we introduce the Efficient Contextformer (eContextformer) - a computationally efficient transformer-based autoregressive context model for learned image compression. The eContextformer efficiently fuses the patch-wise, checkered, and channel-wise grouping techniques for parallel context modeling, and introduces a shifted window spatio-channel attention mechanism. We explore better training strategies and architectural designs and introduce additional complexity optimizations. During decoding, the proposed optimization techniques dynamically scale the attention span and cache the previous attention computations, drastically reducing the model and runtime complexity. Compared to the non-parallel approach, our proposal has ~145x lower model complexity and ~210x faster decoding speed, and achieves higher average bit savings on Kodak, CLIC2020, and Tecnick datasets. Additionally, the low complexity of our context model enables online rate-distortion algorithms, which further improve the compression performance. We achieve up to 17% bitrate savings over the intra coding of Versatile Video Coding (VVC) Test Model (VTM) 16.2 and surpass various learning-based compression models.  ( 3 min )
    Masking Augmentation for Supervised Learning
    arXiv:2306.11339v2 Announce Type: replace-cross Abstract: Pre-training using random masking has emerged as a novel trend in training techniques. However, supervised learning faces a challenge in adopting masking augmentations, primarily due to unstable training. In this paper, we propose a novel way to involve masking augmentations dubbed Masked Sub-model (MaskSub). MaskSub consists of the main-model and sub-model; while the former enjoys conventional training recipes, the latter leverages the benefit of strong masking augmentations in training. MaskSub addresses the challenge by mitigating adverse effects through a relaxed loss function similar to a self-distillation loss. Our analysis shows that MaskSub improves performance, with the training loss converging even faster than regular training, which suggests our method facilitates training. We further validate MaskSub across diverse training recipes and models, including DeiT-III, MAE fine-tuning, CLIP fine-tuning, ResNet, and Swin Transformer. Our results show that MaskSub consistently provides significant performance gains across all the cases. MaskSub provides a practical and effective solution for introducing additional regularization under various training recipes. Code available at https://github.com/naver-ai/augsub  ( 2 min )
    HRTF upsampling with a generative adversarial network using a gnomonic equiangular projection
    arXiv:2306.05812v2 Announce Type: replace-cross Abstract: An individualised head-related transfer function (HRTF) is very important for creating realistic virtual reality (VR) and augmented reality (AR) environments. However, acoustically measuring high-quality HRTFs requires expensive equipment and an acoustic lab setting. To overcome these limitations and to make this measurement more efficient HRTF upsampling has been exploited in the past where a high-resolution HRTF is created from a low-resolution one. This paper demonstrates how generative adversarial networks (GANs) can be applied to HRTF upsampling. We propose a novel approach that transforms the HRTF data for direct use with a convolutional super-resolution generative adversarial network (SRGAN). This new approach is benchmarked against three baselines: barycentric upsampling, spherical harmonic (SH) upsampling and an HRTF selection approach. Experimental results show that the proposed method outperforms all three baselines in terms of log-spectral distortion (LSD) and localisation performance using perceptual models when the input HRTF is sparse (less than 20 measured positions).  ( 3 min )
    Detecting Heart Disease from Multi-View Ultrasound Images via Supervised Attention Multiple Instance Learning
    arXiv:2306.00003v2 Announce Type: replace-cross Abstract: Aortic stenosis (AS) is a degenerative valve condition that causes substantial morbidity and mortality. This condition is under-diagnosed and under-treated. In clinical practice, AS is diagnosed with expert review of transthoracic echocardiography, which produces dozens of ultrasound images of the heart. Only some of these views show the aortic valve. To automate screening for AS, deep networks must learn to mimic a human expert's ability to identify views of the aortic valve then aggregate across these relevant images to produce a study-level diagnosis. We find previous approaches to AS detection yield insufficient accuracy due to relying on inflexible averages across images. We further find that off-the-shelf attention-based multiple instance learning (MIL) performs poorly. We contribute a new end-to-end MIL approach with two key methodological innovations. First, a supervised attention technique guides the learned attention mechanism to favor relevant views. Second, a novel self-supervised pretraining strategy applies contrastive learning on the representation of the whole study instead of individual images as commonly done in prior literature. Experiments on an open-access dataset and an external validation set show that our approach yields higher accuracy while reducing model size.  ( 3 min )
    Diagnosing Transformers: Illuminating Feature Spaces for Clinical Decision-Making
    arXiv:2305.17588v3 Announce Type: replace-cross Abstract: Pre-trained transformers are often fine-tuned to aid clinical decision-making using limited clinical notes. Model interpretability is crucial, especially in high-stakes domains like medicine, to establish trust and ensure safety, which requires human engagement. We introduce SUFO, a systematic framework that enhances interpretability of fine-tuned transformer feature spaces. SUFO utilizes a range of analytic and visualization techniques, including Supervised probing, Unsupervised similarity analysis, Feature dynamics, and Outlier analysis to address key questions about model trust and interpretability. We conduct a case study investigating the impact of pre-training data where we focus on real-world pathology classification tasks, and validate our findings on MedNLI. We evaluate five 110M-sized pre-trained transformer models, categorized into general-domain (BERT, TNLR), mixed-domain (BioBERT, Clinical BioBERT), and domain-specific (PubMedBERT) groups. Our SUFO analyses reveal that: (1) while PubMedBERT, the domain-specific model, contains valuable information for fine-tuning, it can overfit to minority classes when class imbalances exist. In contrast, mixed-domain models exhibit greater resistance to overfitting, suggesting potential improvements in domain-specific model robustness; (2) in-domain pre-training accelerates feature disambiguation during fine-tuning; and (3) feature spaces undergo significant sparsification during this process, enabling clinicians to identify common outlier modes among fine-tuned models as demonstrated in this paper. These findings showcase the utility of SUFO in enhancing trust and safety when using transformers in medicine, and we believe SUFO can aid practitioners in evaluating fine-tuned language models for other applications in medicine and in more critical domains.  ( 3 min )
    Functional sufficient dimension reduction through information maximization with application to classification
    arXiv:2305.10880v3 Announce Type: replace-cross Abstract: Considering the case where the response variable is a categorical variable and the predictor is a random function, two novel functional sufficient dimensional reduction (FSDR) methods are proposed based on mutual information and square loss mutual information. Compared to the classical FSDR methods, such as functional sliced inverse regression and functional sliced average variance estimation, the proposed methods are appealing because they are capable of estimating multiple effective dimension reduction directions in the case of a relatively small number of categories, especially for the binary response. Moreover, the proposed methods do not require the restrictive linear conditional mean assumption and the constant covariance assumption. They avoid the inverse problem of the covariance operator which is often encountered in the functional sufficient dimension reduction. The functional principal component analysis with truncation be used as a regularization mechanism. Under some mild conditions, the statistical consistency of the proposed methods is established. It is demonstrated that the two methods are competitive compared with some existing FSDR methods by simulations and real data analyses.  ( 3 min )
    Cross-Class Feature Augmentation for Class Incremental Learning
    arXiv:2304.01899v4 Announce Type: replace-cross Abstract: We propose a novel class incremental learning approach by incorporating a feature augmentation technique motivated by adversarial attacks. We employ a classifier learned in the past to complement training examples rather than simply play a role as a teacher for knowledge distillation towards subsequent models. The proposed approach has a unique perspective to utilize the previous knowledge in class incremental learning since it augments features of arbitrary target classes using examples in other classes via adversarial attacks on a previously learned classifier. By allowing the cross-class feature augmentations, each class in the old tasks conveniently populates samples in the feature space, which alleviates the collapse of the decision boundaries caused by sample deficiency for the previous tasks, especially when the number of stored exemplars is small. This idea can be easily incorporated into existing class incremental learning algorithms without any architecture modification. Extensive experiments on the standard benchmarks show that our method consistently outperforms existing class incremental learning methods by significant margins in various scenarios, especially under an environment with an extremely limited memory budget.  ( 2 min )
    Effects of noise on the overparametrization of quantum neural networks
    arXiv:2302.05059v2 Announce Type: replace-cross Abstract: Overparametrization is one of the most surprising and notorious phenomena in machine learning. Recently, there have been several efforts to study if, and how, Quantum Neural Networks (QNNs) acting in the absence of hardware noise can be overparametrized. In particular, it has been proposed that a QNN can be defined as overparametrized if it has enough parameters to explore all available directions in state space. That is, if the rank of the Quantum Fisher Information Matrix (QFIM) for the QNN's output state is saturated. Here, we explore how the presence of noise affects the overparametrization phenomenon. Our results show that noise can "turn on" previously-zero eigenvalues of the QFIM. This enables the parametrized state to explore directions that were otherwise inaccessible, thus potentially turning an overparametrized QNN into an underparametrized one. For small noise levels, the QNN is quasi-overparametrized, as large eigenvalues coexists with small ones. Then, we prove that as the magnitude of noise increases all the eigenvalues of the QFIM become exponentially suppressed, indicating that the state becomes insensitive to any change in the parameters. As such, there is a pull-and-tug effect where noise can enable new directions, but also suppress the sensitivity to parameter updates. Finally, our results imply that current QNN capacity measures are ill-defined when hardware noise is present.  ( 3 min )
    Private, fair and accurate: Training large-scale, privacy-preserving AI models in medical imaging
    arXiv:2302.01622v4 Announce Type: replace-cross Abstract: Artificial intelligence (AI) models are increasingly used in the medical domain. However, as medical data is highly sensitive, special precautions to ensure its protection are required. The gold standard for privacy preservation is the introduction of differential privacy (DP) to model training. Prior work indicates that DP has negative implications on model accuracy and fairness, which are unacceptable in medicine and represent a main barrier to the widespread use of privacy-preserving techniques. In this work, we evaluated the effect of privacy-preserving training of AI models regarding accuracy and fairness compared to non-private training. For this, we used two datasets: (1) A large dataset (N=193,311) of high quality clinical chest radiographs, and (2) a dataset (N=1,625) of 3D abdominal computed tomography (CT) images, with the task of classifying the presence of pancreatic ductal adenocarcinoma (PDAC). Both were retrospectively collected and manually labeled by experienced radiologists. We then compared non-private deep convolutional neural networks (CNNs) and privacy-preserving (DP) models with respect to privacy-utility trade-offs measured as area under the receiver-operator-characteristic curve (AUROC), and privacy-fairness trade-offs, measured as Pearson's r or Statistical Parity Difference. We found that, while the privacy-preserving trainings yielded lower accuracy, they did largely not amplify discrimination against age, sex or co-morbidity. Our study shows that -- under the challenging realistic circumstances of a real-life clinical dataset -- the privacy-preserving training of diagnostic deep learning models is possible with excellent diagnostic accuracy and fairness.  ( 3 min )
    Empirical Risk Minimization with Relative Entropy Regularization
    arXiv:2211.06617v4 Announce Type: replace-cross Abstract: The empirical risk minimization (ERM) problem with relative entropy regularization (ERM-RER) is investigated under the assumption that the reference measure is a $\sigma$-finite measure, and not necessarily a probability measure. Under this assumption, which leads to a generalization of the ERM-RER problem allowing a larger degree of flexibility for incorporating prior knowledge, numerous relevant properties are stated. Among these properties, the solution to this problem, if it exists, is shown to be a unique probability measure, mutually absolutely continuous with the reference measure. Such a solution exhibits a probably-approximately-correct guarantee for the ERM problem independently of whether the latter possesses a solution. For a fixed dataset and under a specific condition, the empirical risk is shown to be a sub-Gaussian random variable when the models are sampled from the solution to the ERM-RER problem. The generalization capabilities of the solution to the ERM-RER problem (the Gibbs algorithm) are studied via the sensitivity of the expected empirical risk to deviations from such a solution towards alternative probability measures. Finally, an interesting connection between sensitivity, generalization error, and lautum information is established.  ( 3 min )
    GmGM: a Fast Multi-Axis Gaussian Graphical Model
    arXiv:2211.02920v3 Announce Type: replace-cross Abstract: This paper introduces the Gaussian multi-Graphical Model, a model to construct sparse graph representations of matrix- and tensor-variate data. We generalize prior work in this area by simultaneously learning this representation across several tensors that share axes, which is necessary to allow the analysis of multimodal datasets such as those encountered in multi-omics. Our algorithm uses only a single eigendecomposition per axis, achieving an order of magnitude speedup over prior work in the ungeneralized case. This allows the use of our methodology on large multi-modal datasets such as single-cell multi-omics data, which was challenging with previous approaches. We validate our model on synthetic data and five real-world datasets.  ( 2 min )
    Cross-lingual Text-To-Speech with Flow-based Voice Conversion for Improved Pronunciation
    arXiv:2210.17264v2 Announce Type: replace-cross Abstract: This paper presents a method for end-to-end cross-lingual text-to-speech (TTS) which aims to preserve the target language's pronunciation regardless of the original speaker's language. The model used is based on a non-attentive Tacotron architecture, where the decoder has been replaced with a normalizing flow network conditioned on the speaker identity, allowing both TTS and voice conversion (VC) to be performed by the same model due to the inherent linguistic content and speaker identity disentanglement. When used in a cross-lingual setting, acoustic features are initially produced with a native speaker of the target language and then voice conversion is applied by the same model in order to convert these features to the target speaker's voice. We verify through objective and subjective evaluations that our method can have benefits compared to baseline cross-lingual synthesis. By including speakers averaging 7.5 minutes of speech, we also present positive results on low-resource scenarios.  ( 2 min )
    Snapture -- A Novel Neural Architecture for Combined Static and Dynamic Hand Gesture Recognition
    arXiv:2205.15862v2 Announce Type: replace-cross Abstract: As robots are expected to get more involved in people's everyday lives, frameworks that enable intuitive user interfaces are in demand. Hand gesture recognition systems provide a natural way of communication and, thus, are an integral part of seamless Human-Robot Interaction (HRI). Recent years have witnessed an immense evolution of computational models powered by deep learning. However, state-of-the-art models fall short in expanding across different gesture domains, such as emblems and co-speech. In this paper, we propose a novel hybrid hand gesture recognition system. Our architecture enables learning both static and dynamic gestures: by capturing a so-called "snapshot" of the gesture performance at its peak, we integrate the hand pose along with the dynamic movement. Moreover, we present a method for analyzing the motion profile of a gesture to uncover its dynamic characteristics and which allows regulating a static channel based on the amount of motion. Our evaluation demonstrates the superiority of our approach on two gesture benchmarks compared to a CNNLSTM baseline. We also provide an analysis on a gesture class basis that unveils the potential of our Snapture architecture for performance improvements. Thanks to its modular implementation, our framework allows the integration of other multimodal data like facial expressions and head tracking, which are important cues in HRI scenarios, into one architecture. Thus, our work contributes both to gesture recognition research and machine learning applications for non-verbal communication with robots.  ( 3 min )
    Composite Goodness-of-fit Tests with Kernels
    arXiv:2111.10275v4 Announce Type: replace-cross Abstract: Model misspecification can create significant challenges for the implementation of probabilistic models, and this has led to development of a range of robust methods which directly account for this issue. However, whether these more involved methods are required will depend on whether the model is really misspecified, and there is a lack of generally applicable methods to answer this question. In this paper, we propose one such method. More precisely, we propose kernel-based hypothesis tests for the challenging composite testing problem, where we are interested in whether the data comes from any distribution in some parametric family. Our tests make use of minimum distance estimators based on the maximum mean discrepancy and the kernel Stein discrepancy. They are widely applicable, including whenever the density of the parametric model is known up to normalisation constant, or if the model takes the form of a simulator. As our main result, we show that we are able to estimate the parameter and conduct our test on the same data (without data splitting), while maintaining a correct test level. Our approach is illustrated on a range of problems, including testing for goodness-of-fit of an unnormalised non-parametric density model, and an intractable generative model of a biological cellular network.  ( 2 min )
    Solving PDEs on Unknown Manifolds with Machine Learning
    arXiv:2106.06682v4 Announce Type: replace-cross Abstract: This paper proposes a mesh-free computational framework and machine learning theory for solving elliptic PDEs on unknown manifolds, identified with point clouds, based on diffusion maps (DM) and deep learning. The PDE solver is formulated as a supervised learning task to solve a least-squares regression problem that imposes an algebraic equation approximating a PDE (and boundary conditions if applicable). This algebraic equation involves a graph-Laplacian type matrix obtained via DM asymptotic expansion, which is a consistent estimator of second-order elliptic differential operators. The resulting numerical method is to solve a highly non-convex empirical risk minimization problem subjected to a solution from a hypothesis space of neural networks (NNs). In a well-posed elliptic PDE setting, when the hypothesis space consists of neural networks with either infinite width or depth, we show that the global minimizer of the empirical loss function is a consistent solution in the limit of large training data. When the hypothesis space is a two-layer neural network, we show that for a sufficiently large width, gradient descent can identify a global minimizer of the empirical loss function. Supporting numerical examples demonstrate the convergence of the solutions, ranging from simple manifolds with low and high co-dimensions, to rough surfaces with and without boundaries. We also show that the proposed NN solver can robustly generalize the PDE solution on new data points with generalization errors that are almost identical to the training errors, superseding a Nystrom-based interpolation method.  ( 3 min )
    A Robust Cybersecurity Topic Classification Tool
    arXiv:2109.02473v4 Announce Type: replace-cross Abstract: In this research, we use user defined labels from three internet text sources (Reddit, Stackexchange, Arxiv) to train 21 different machine learning models for the topic classification task of detecting cybersecurity discussions in natural text. We analyze the false positive and false negative rates of each of the 21 model's in a cross validation experiment. Then we present a Cybersecurity Topic Classification (CTC) tool, which takes the majority vote of the 21 trained machine learning models as the decision mechanism for detecting cybersecurity related text. We also show that the majority vote mechanism of the CTC tool provides lower false negative and false positive rates on average than any of the 21 individual models. We show that the CTC tool is scalable to the hundreds of thousands of documents with a wall clock time on the order of hours.  ( 2 min )
    OneLog: Towards End-to-End Training in Software Log Anomaly Detection
    arXiv:2104.07324v2 Announce Type: replace-cross Abstract: With the growth of online services, IoT devices, and DevOps-oriented software development, software log anomaly detection is becoming increasingly important. Prior works mainly follow a traditional four-staged architecture (Preprocessor, Parser, Vectorizer, and Classifier). This paper proposes OneLog, which utilizes a single Deep Neural Network (DNN) instead of multiple separate components. OneLog harnesses Convolutional Neural Networks (CNN) at the character level to take digits, numbers, and punctuations, which were removed in prior works, into account alongside the main natural language text. We evaluate our approach in six message- and sequence-based data sets: HDFS, Hadoop, BGL, Thunderbird, Spirit, and Liberty. We experiment with Onelog with single-, multi-, and cross-project setups. Onelog offers state-of-the-art performance in our datasets. Onelog can utilize multi-project datasets simultaneously during training, which suggests our model can generalize between datasets. Multi-project training also improves Onelog performance making it ideal when limited training data is available for an individual project. We also found that cross-project anomaly detection is possible with a single project pair (Liberty and Spirit). Analysis of model internals shows that one log has multiple modes of detecting anomalies and that the model learns manually validated parsing rules for the log messages. We conclude that character-based CNNs are a promising approach toward end-to-end learning in log anomaly detection. They offer good performance and generalization over multiple datasets. We will make our scripts publicly available upon the acceptance of this paper.  ( 3 min )
    DoubleML -- An Object-Oriented Implementation of Double Machine Learning in R
    arXiv:2103.09603v5 Announce Type: replace-cross Abstract: The R package DoubleML implements the double/debiased machine learning framework of Chernozhukov et al. (2018). It provides functionalities to estimate parameters in causal models based on machine learning methods. The double machine learning framework consist of three key ingredients: Neyman orthogonality, high-quality machine learning estimation and sample splitting. Estimation of nuisance components can be performed by various state-of-the-art machine learning methods that are available in the mlr3 ecosystem. DoubleML makes it possible to perform inference in a variety of causal models, including partially linear and interactive regression models and their extensions to instrumental variable estimation. The object-oriented implementation of DoubleML enables a high flexibility for the model specification and makes it easily extendable. This paper serves as an introduction to the double machine learning framework and the R package DoubleML. In reproducible code examples with simulated and real data sets, we demonstrate how DoubleML users can perform valid inference based on machine learning methods.  ( 3 min )
    Automated Machine Learning: From Principles to Practices
    arXiv:1810.13306v5 Announce Type: replace-cross Abstract: Machine learning (ML) methods have been developing rapidly, but configuring and selecting proper methods to achieve a desired performance is increasingly difficult and tedious. To address this challenge, automated machine learning (AutoML) has emerged, which aims to generate satisfactory ML configurations for given tasks in a data-driven way. In this paper, we provide a comprehensive survey on this topic. We begin with the formal definition of AutoML and then introduce its principles, including the bi-level learning objective, the learning strategy, and the theoretical interpretation. Then, we summarize the AutoML practices by setting up the taxonomy of existing works based on three main factors: the search space, the search algorithm, and the evaluation strategy. Each category is also explained with the representative methods. Then, we illustrate the principles and practices with exemplary applications from configuring ML pipeline, one-shot neural architecture search, and integration with foundation models. Finally, we highlight the emerging directions of AutoML and conclude the survey.  ( 2 min )
    Asymmetry in Low-Rank Adapters of Foundation Models
    arXiv:2402.16842v2 Announce Type: replace Abstract: Parameter-efficient fine-tuning optimizes large, pre-trained foundation models by updating a subset of parameters; in this class, Low-Rank Adaptation (LoRA) is particularly effective. Inspired by an effort to investigate the different roles of LoRA matrices during fine-tuning, this paper characterizes and leverages unexpected asymmetry in the importance of low-rank adapter matrices. Specifically, when updating the parameter matrices of a neural network by adding a product $BA$, we observe that the $B$ and $A$ matrices have distinct functions: $A$ extracts features from the input, while $B$ uses these features to create the desired output. Based on this observation, we demonstrate that fine-tuning $B$ is inherently more effective than fine-tuning $A$, and that a random untrained $A$ should perform nearly as well as a fine-tuned one. Using an information-theoretic lens, we also bound the generalization of low-rank adapters, showing that the parameter savings of exclusively training $B$ improves the bound. We support our conclusions with experiments on RoBERTa, BART-Large, LLaMA-2, and ViTs.  ( 2 min )
    Interpreting Grokked Transformers in Complex Modular Arithmetic
    arXiv:2402.16726v2 Announce Type: replace Abstract: Grokking has been actively explored to reveal the mystery of delayed generalization. Identifying interpretable algorithms inside the grokked models is a suggestive hint to understanding its mechanism. In this work, beyond the simplest and well-studied modular addition, we observe the internal circuits learned through grokking in complex modular arithmetic via interpretable reverse engineering, which highlights the significant difference in their dynamics: subtraction poses a strong asymmetry on Transformer; multiplication requires cosine-biased components at all the frequencies in a Fourier domain; polynomials often result in the superposition of the patterns from elementary arithmetic, but clear patterns do not emerge in challenging cases; grokking can easily occur even in higher-degree formulas with basic symmetric and alternating expressions. We also introduce the novel progress measure for modular arithmetic; Fourier Frequency Sparsity and Fourier Coefficient Ratio, which not only indicate the late generalization but also characterize distinctive internal representations of grokked models per modular operation. Our empirical analysis emphasizes the importance of holistic evaluation among various combinations.  ( 2 min )
    Score-based Causal Representation Learning: Linear and General Transformations
    arXiv:2402.00849v2 Announce Type: replace Abstract: This paper addresses intervention-based causal representation learning (CRL) under a general nonparametric latent causal model and an unknown transformation that maps the latent variables to the observed variables. Linear and general transformations are investigated. The paper addresses both the identifiability and achievability aspects. Identifiability refers to determining algorithm-agnostic conditions that ensure recovering the true latent causal variables and the latent causal graph underlying them. Achievability refers to the algorithmic aspects and addresses designing algorithms that achieve identifiability guarantees. By drawing novel connections between score functions (i.e., the gradients of the logarithm of density functions) and CRL, this paper designs a score-based class of algorithms that ensures both identifiability and achievability. First, the paper focuses on linear transformations and shows that one stochastic hard intervention per node suffices to guarantee identifiability. It also provides partial identifiability guarantees for soft interventions, including identifiability up to ancestors for general causal models and perfect latent graph recovery for sufficiently non-linear causal models. Secondly, it focuses on general transformations and shows that two stochastic hard interventions per node suffice for identifiability. Notably, one does not need to know which pair of interventional environments have the same node intervened.  ( 3 min )
    Causal Fairness under Unobserved Confounding: A Neural Sensitivity Framework
    arXiv:2311.18460v2 Announce Type: replace Abstract: Fairness for machine learning predictions is widely required in practice for legal, ethical, and societal reasons. Existing work typically focuses on settings without unobserved confounding, even though unobserved confounding can lead to severe violations of causal fairness and, thus, unfair predictions. In this work, we analyze the sensitivity of causal fairness to unobserved confounding. Our contributions are three-fold. First, we derive bounds for causal fairness metrics under different sources of unobserved confounding. This enables practitioners to examine the sensitivity of their machine learning models to unobserved confounding in fairness-critical applications. Second, we propose a novel neural framework for learning fair predictions, which allows us to offer worst-case guarantees of the extent to which causal fairness can be violated due to unobserved confounding. Third, we demonstrate the effectiveness of our framework in a series of experiments, including a real-world case study about predicting prison sentences. To the best of our knowledge, ours is the first work to study causal fairness under unobserved confounding. To this end, our work is of direct practical value as a refutation strategy to ensure the fairness of predictions in high-stakes applications.  ( 2 min )
    Optimal Sparse Survival Trees
    arXiv:2401.15330v2 Announce Type: replace Abstract: Interpretability is crucial for doctors, hospitals, pharmaceutical companies and biotechnology corporations to analyze and make decisions for high stakes problems that involve human health. Tree-based methods have been widely adopted for survival analysis due to their appealing interpretablility and their ability to capture complex relationships. However, most existing methods to produce survival trees rely on heuristic (or greedy) algorithms, which risk producing sub-optimal models. We present a dynamic-programming-with-bounds approach that finds provably-optimal sparse survival tree models, frequently in only a few seconds.  ( 2 min )
    Series of Hessian-Vector Products for Tractable Saddle-Free Newton Optimisation of Neural Networks
    arXiv:2310.14901v2 Announce Type: replace Abstract: Despite their popularity in the field of continuous optimisation, second-order quasi-Newton methods are challenging to apply in machine learning, as the Hessian matrix is intractably large. This computational burden is exacerbated by the need to address non-convexity, for instance by modifying the Hessian's eigenvalues as in Saddle-Free Newton methods. We propose an optimisation algorithm which addresses both of these concerns - to our knowledge, the first efficiently-scalable optimisation algorithm to asymptotically use the exact inverse Hessian with absolute-value eigenvalues. Our method frames the problem as a series which principally square-roots and inverts the squared Hessian, then uses it to precondition a gradient vector, all without explicitly computing or eigendecomposing the Hessian. A truncation of this infinite series provides a new optimisation algorithm which is scalable and comparable to other first- and second-order optimisation methods in both runtime and optimisation performance. We demonstrate this in a variety of settings, including a ResNet-18 trained on CIFAR-10.  ( 2 min )
    TEA: Test-time Energy Adaptation
    arXiv:2311.14402v2 Announce Type: replace Abstract: Test-time adaptation (TTA) aims to improve model generalizability when test data diverges from training distribution, offering the distinct advantage of not requiring access to training data and processes, especially valuable in the context of large pre-trained models. However, current TTA methods fail to address the fundamental issue: covariate shift, i.e., the decreased generalizability can be attributed to the model's reliance on the marginal distribution of the training data, which may impair model calibration and introduce confirmation bias. To address this, we propose a novel energy-based perspective, enhancing the model's perception of target data distributions without requiring access to training data or processes. Building on this perspective, we introduce $\textbf{T}$est-time $\textbf{E}$nergy $\textbf{A}$daptation ($\textbf{TEA}$), which transforms the trained classifier into an energy-based model and aligns the model's distribution with the test data's, enhancing its ability to perceive test distributions and thus improving overall generalizability. Extensive experiments across multiple tasks, benchmarks and architectures demonstrate TEA's superior generalization performance against state-of-the-art methods. Further in-depth analyses reveal that TEA can equip the model with a comprehensive perception of test distribution, ultimately paving the way toward improved generalization and calibration.  ( 2 min )
    Pyramidal Hidden Markov Model For Multivariate Time Series Forecasting
    arXiv:2310.14341v2 Announce Type: replace Abstract: The Hidden Markov Model (HMM) can predict the future value of a time series based on its current and previous values, making it a powerful algorithm for handling various types of time series. Numerous studies have explored the improvement of HMM using advanced techniques, leading to the development of several variations of HMM. Despite these studies indicating the increased competitiveness of HMM compared to other advanced algorithms, few have recognized the significance and impact of incorporating multistep stochastic states into its performance. In this work, we propose a Pyramidal Hidden Markov Model (PHMM) that can capture multiple multistep stochastic states. Initially, a multistep HMM is designed for extracting short multistep stochastic states. Next, a novel time series forecasting structure is proposed based on PHMM, which utilizes pyramid-like stacking to adaptively identify long multistep stochastic states. By employing these two schemes, our model can effectively handle non-stationary and noisy data, while also establishing long-term dependencies for more accurate and comprehensive forecasting. The experimental results on diverse multivariate time series datasets convincingly demonstrate the superior performance of our proposed PHMM compared to its competitive peers in time series forecasting.  ( 2 min )
    Sample Complexity of Preference-Based Nonparametric Off-Policy Evaluation with Deep Networks
    arXiv:2310.10556v2 Announce Type: replace Abstract: A recently popular approach to solving reinforcement learning is with data from human preferences. In fact, human preference data are now used with classic reinforcement learning algorithms such as actor-critic methods, which involve evaluating an intermediate policy over a reward learned from human preference data with distribution shift, known as off-policy evaluation (OPE). Such algorithm includes (i) learning reward function from human preference dataset, and (ii) learning expected cumulative reward of a target policy. Despite the huge empirical success, existing OPE methods with preference data often lack theoretical understanding and rely heavily on heuristics. In this paper, we study the sample efficiency of OPE with human preference and establish a statistical guarantee for it. Specifically, we approach OPE by learning the value function by fitted-Q-evaluation with a deep neural network. By appropriately selecting the size of a ReLU network, we show that one can leverage any low-dimensional manifold structure in the Markov decision process and obtain a sample-efficient estimator without suffering from the curse of high data ambient dimensionality. Under the assumption of high reward smoothness, our results \textit{almost align with the classical OPE results with observable reward data}. To the best of our knowledge, this is the first result that establishes a \textit{provably efficient} guarantee for off-policy evaluation with RLHF.  ( 3 min )
    Multimodal Federated Learning in Healthcare: a Review
    arXiv:2310.09650v2 Announce Type: replace Abstract: Recent advancements in multimodal machine learning have empowered the development of accurate and robust AI systems in the medical domain, especially within centralized database systems. Simultaneously, Federated Learning (FL) has progressed, providing a decentralized mechanism where data need not be consolidated, thereby enhancing the privacy and security of sensitive healthcare data. The integration of these two concepts supports the ongoing progress of multimodal learning in healthcare while ensuring the security and privacy of patient records within local data-holding agencies. This paper offers a concise overview of the significance of FL in healthcare and outlines the current state-of-the-art approaches to Multimodal Federated Learning (MMFL) within the healthcare domain. It comprehensively examines the existing challenges in the field, shedding light on the limitations of present models. Finally, the paper outlines potential directions for future advancements in the field, aiming to bridge the gap between cutting-edge AI technology and the imperative need for patient data privacy in healthcare applications.  ( 2 min )
    Can LLMs Effectively Leverage Graph Structural Information through Prompts, and Why?
    arXiv:2309.16595v3 Announce Type: replace Abstract: Large language models (LLMs) are gaining increasing attention for their capability to process graphs with rich text attributes, especially in a zero-shot fashion. Recent studies demonstrate that LLMs obtain decent text classification performance on common text-rich graph benchmarks, and the performance can be improved by appending encoded structural information as natural languages into prompts. We aim to understand why the incorporation of structural information inherent in graph data can improve the prediction performance of LLMs. First, we rule out the concern of data leakage by curating a novel leakage-free dataset and conducting a comparative analysis alongside a previously widely-used dataset. Second, as past work usually encodes the ego-graph by describing the graph structure in natural language, we ask the question: do LLMs understand the graph structure in accordance with the intent of the prompt designers? Third, we investigate why LLMs can improve their performance after incorporating structural information. Our exploration of these questions reveals that (i) there is no substantial evidence that the performance of LLMs is significantly attributed to data leakage; (ii) instead of understanding prompts as graph structures as intended by the prompt designers, LLMs tend to process prompts more as contextual paragraphs and (iii) the most efficient elements of the local neighborhood included in the prompt are phrases that are pertinent to the node label, rather than the graph structure.  ( 3 min )
    Actively Learning Reinforcement Learning: A Stochastic Optimal Control Approach
    arXiv:2309.10831v3 Announce Type: replace Abstract: In this paper we propose a framework towards achieving two intertwined objectives: (i) equipping reinforcement learning with active exploration and deliberate information gathering, such that it regulates state and parameter uncertainties resulting from modeling mismatches and noisy sensory; and (ii) overcoming the huge computational cost of stochastic optimal control. We approach both objectives by using reinforcement learning to attain the stochastic optimal control law. On one hand, we avoid the curse of dimensionality prohibiting the direct solution of the stochastic dynamic programming equation. On the other hand, the resulting stochastic control inspired reinforcement learning agent admits the behavior of a dual control, namely, caution and probing, that is, regulating the state estimate together with its estimation quality. Unlike exploration and exploitation, caution and probing are employed automatically by the controller in real-time, even after the learning process is concluded. We use the proposed approach on a numerical example of a model that belongs to an emerging class in system identification. We show how, for the dimensionality of the stochastic version of this model, Dynamic Programming is prohibitive, Model Predictive Control requires an expensive nonlinear optimization, and a Linear Quadratic Regulator with the certainty equivalence assumption leads to poor performance and filter divergence, all contrasting our approach which is shown to be both: computationally convenient, stabilizing and of an acceptable performance.  ( 3 min )
    FedSOL: Stabilized Orthogonal Learning in Federated Learning
    arXiv:2308.12532v4 Announce Type: replace Abstract: Federated Learning (FL) aggregates locally trained models from individual clients to construct a global model. While FL enables learning a model with data privacy, it often suffers from significant performance degradation when client data distributions are heterogeneous. Many previous FL algorithms have addressed this issue by introducing various proximal restrictions. These restrictions aim to encourage global alignment by constraining the deviation of local learning from the global objective. However, they inherently limit local learning by interfering with the original local objectives. Recently, an alternative approach has emerged to improve local learning generality. By obtaining local models within a smooth loss landscape, this approach mitigates conflicts among different local objectives of the clients. Yet, it does not ensure stable global alignment, as local learning does not take the global objective into account. In this study, we propose Federated Stability on Learning (FedSoL), which combines both the concepts of global alignment and local generality. In FedSoL, the local learning seeks a parameter region robust against proximal perturbations. This strategy introduces an implicit proximal restriction effect in local learning while maintaining the original local objective for parameter update. Our experiments show that FedSoL consistently achieves state-of-the-art performance on various setups.  ( 3 min )
    AdvDiff: Generating Unrestricted Adversarial Examples using Diffusion Models
    arXiv:2307.12499v3 Announce Type: replace Abstract: Unrestricted adversarial attacks present a serious threat to deep learning models and adversarial defense techniques. They pose severe security problems for deep learning applications because they can effectively bypass defense mechanisms. However, previous attack methods often utilize Generative Adversarial Networks (GANs), which are not theoretically provable and thus generate unrealistic examples by incorporating adversarial objectives, especially for large-scale datasets like ImageNet. In this paper, we propose a new method, called AdvDiff, to generate unrestricted adversarial examples with diffusion models. We design two novel adversarial guidance techniques to conduct adversarial sampling in the reverse generation process of diffusion models. These two techniques are effective and stable to generate high-quality, realistic adversarial examples by integrating gradients of the target classifier interpretably. Experimental results on MNIST and ImageNet datasets demonstrate that AdvDiff is effective to generate unrestricted adversarial examples, which outperforms GAN-based methods in terms of attack performance and generation quality.  ( 2 min )
    The curse of dimensionality in operator learning
    arXiv:2306.15924v2 Announce Type: replace Abstract: Neural operator architectures employ neural networks to approximate operators mapping between Banach spaces of functions; they may be used to accelerate model evaluations via emulation, or to discover models from data. Consequently, the methodology has received increasing attention over recent years, giving rise to the rapidly growing field of operator learning. The first contribution of this paper is to prove that for general classes of operators which are characterized only by their $C^r$- or Lipschitz-regularity, operator learning suffers from a curse of dimensionality, defined precisely here in terms of representations of the infinite-dimensional input and output function spaces. The result is applicable to a wide variety of existing neural operators, including PCA-Net, DeepONet and the FNO. The second contribution of the paper is to prove that the general curse of dimensionality can be overcome for solution operators defined by the Hamilton-Jacobi equation; this is achieved by leveraging additional structure in the underlying solution operator, going beyond regularity. To this end, a novel neural operator architecture is introduced, termed HJ-Net, which explicitly takes into account characteristic information of the underlying Hamiltonian system. Error and complexity estimates are derived for HJ-Net which show that this architecture can provably beat the curse of dimensionality related to the infinite-dimensional input and output function spaces.  ( 2 min )
    Accelerating Cutting-Plane Algorithms via Reinforcement Learning Surrogates
    arXiv:2307.08816v2 Announce Type: replace Abstract: Discrete optimization belongs to the set of $\mathcal{NP}$-hard problems, spanning fields such as mixed-integer programming and combinatorial optimization. A current standard approach to solving convex discrete optimization problems is the use of cutting-plane algorithms, which reach optimal solutions by iteratively adding inequalities known as \textit{cuts} to refine a feasible set. Despite the existence of a number of general-purpose cut-generating algorithms, large-scale discrete optimization problems continue to suffer from intractability. In this work, we propose a method for accelerating cutting-plane algorithms via reinforcement learning. Our approach uses learned policies as surrogates for $\mathcal{NP}$-hard elements of the cut generating procedure in a way that (i) accelerates convergence, and (ii) retains guarantees of optimality. We apply our method on two types of problems where cutting-plane algorithms are commonly used: stochastic optimization, and mixed-integer quadratic programming. We observe the benefits of our method when applied to Benders decomposition (stochastic optimization) and iterative loss approximation (quadratic programming), achieving up to $45\%$ faster average convergence when compared to modern alternative algorithms.  ( 2 min )
    Variational Gaussian Process Diffusion Processes
    arXiv:2306.02066v3 Announce Type: replace Abstract: Diffusion processes are a class of stochastic differential equations (SDEs) providing a rich family of expressive models that arise naturally in dynamic modelling tasks. Probabilistic inference and learning under generative models with latent processes endowed with a non-linear diffusion process prior are intractable problems. We build upon work within variational inference, approximating the posterior process as a linear diffusion process, and point out pathologies in the approach. We propose an alternative parameterization of the Gaussian variational process using a site-based exponential family description. This allows us to trade a slow inference algorithm with fixed-point iterations for a fast algorithm for convex optimization akin to natural gradient descent, which also provides a better objective for learning model parameters.  ( 2 min )
    OCAtari: Object-Centric Atari 2600 Reinforcement Learning Environments
    arXiv:2306.08649v2 Announce Type: replace Abstract: Cognitive science and psychology suggest that object-centric representations of complex scenes are a promising step towards enabling efficient abstract reasoning from low-level perceptual features. Yet, most deep reinforcement learning approaches only rely on pixel-based representations that do not capture the compositional properties of natural scenes. For this, we need environments and datasets that allow us to work and evaluate object-centric approaches. In our work, we extend the Atari Learning Environments, the most-used evaluation framework for deep RL approaches, by introducing OCAtari, that performs resource-efficient extractions of the object-centric states for these games. Our framework allows for object discovery, object representation learning, as well as object-centric RL. We evaluate OCAtari's detection capabilities and resource efficiency. Our source code is available at github.com/k4ntz/OC_Atari.  ( 2 min )
    Dynamic Neighborhood Construction for Structured Large Discrete Action Spaces
    arXiv:2305.19891v4 Announce Type: replace Abstract: Large discrete action spaces (LDAS) remain a central challenge in reinforcement learning. Existing solution approaches can handle unstructured LDAS with up to a few million actions. However, many real-world applications in logistics, production, and transportation systems have combinatorial action spaces, whose size grows well beyond millions of actions, even on small instances. Fortunately, such action spaces exhibit structure, e.g., equally spaced discrete resource units. With this work, we focus on handling structured LDAS (SLDAS) with sizes that cannot be handled by current benchmarks: we propose Dynamic Neighborhood Construction (DNC), a novel exploitation paradigm for SLDAS. We present a scalable neighborhood exploration heuristic that utilizes this paradigm and efficiently explores the discrete neighborhood around the continuous proxy action in structured action spaces with up to $10^{73}$ actions. We demonstrate the performance of our method by benchmarking it against three state-of-the-art approaches designed for large discrete action spaces across two distinct environments. Our results show that DNC matches or outperforms state-of-the-art approaches while being computationally more efficient. Furthermore, our method scales to action spaces that so far remained computationally intractable for existing methodologies.  ( 3 min )
    Exploring the Promise and Limits of Real-Time Recurrent Learning
    arXiv:2305.19044v2 Announce Type: replace Abstract: Real-time recurrent learning (RTRL) for sequence-processing recurrent neural networks (RNNs) offers certain conceptual advantages over backpropagation through time (BPTT). RTRL requires neither caching past activations nor truncating context, and enables online learning. However, RTRL's time and space complexity make it impractical. To overcome this problem, most recent work on RTRL focuses on approximation theories, while experiments are often limited to diagnostic settings. Here we explore the practical promise of RTRL in more realistic settings. We study actor-critic methods that combine RTRL and policy gradients, and test them in several subsets of DMLab-30, ProcGen, and Atari-2600 environments. On DMLab memory tasks, our system trained on fewer than 1.2 B environmental frames is competitive with or outperforms well-known IMPALA and R2D2 baselines trained on 10 B frames. To scale to such challenging tasks, we focus on certain well-known neural architectures with element-wise recurrence, allowing for tractable RTRL without approximation. Importantly, we also discuss rarely addressed limitations of RTRL in real-world applications, such as its complexity in the multi-layer case.  ( 2 min )
    The cross-sectional stock return predictions via quantum neural network and tensor network
    arXiv:2304.12501v2 Announce Type: replace Abstract: In this paper, we investigate the application of quantum and quantum-inspired machine learning algorithms to stock return predictions. Specifically, we evaluate the performance of quantum neural network, an algorithm suited for noisy intermediate-scale quantum computers, and tensor network, a quantum-inspired machine learning algorithm, against classical models such as linear regression and neural networks. To evaluate their abilities, we construct portfolios based on their predictions and measure investment performances. The empirical study on the Japanese stock market shows the tensor network model achieves superior performance compared to classical benchmark models, including linear and neural network models. Though the quantum neural network model attains a lowered risk-adjusted excess return than the classical neural network models over the whole period, both the quantum neural network and tensor network models have superior performances in the latest market environment, which suggests the capability of the model's capturing non-linearity between input features.  ( 2 min )
    Multimodal and multicontrast image fusion via deep generative models
    arXiv:2303.15963v2 Announce Type: replace Abstract: Recently, it has become progressively more evident that classic diagnostic labels are unable to reliably describe the complexity and variability of several clinical phenotypes. This is particularly true for a broad range of neuropsychiatric illnesses (e.g., depression, anxiety disorders, behavioral phenotypes). Patient heterogeneity can be better described by grouping individuals into novel categories based on empirically derived sections of intersecting continua that span across and beyond traditional categorical borders. In this context, neuroimaging data carry a wealth of spatiotemporally resolved information about each patient's brain. However, they are usually heavily collapsed a priori through procedures which are not learned as part of model training, and consequently not optimized for the downstream prediction task. This is because every individual participant usually comes with multiple whole-brain 3D imaging modalities often accompanied by a deep genotypic and phenotypic characterization, hence posing formidable computational challenges. In this paper we design a deep learning architecture based on generative models rooted in a modular approach and separable convolutional blocks to a) fuse multiple 3D neuroimaging modalities on a voxel-wise level, b) convert them into informative latent embeddings through heavy dimensionality reduction, c) maintain good generalizability and minimal information loss. As proof of concept, we test our architecture on the well characterized Human Connectome Project database demonstrating that our latent embeddings can be clustered into easily separable subject strata which, in turn, map to different phenotypical information which was not included in the embedding creation process. This may be of aid in predicting disease evolution as well as drug response, hence supporting mechanistic disease understanding and empowering clinical trials.  ( 3 min )
    Communication-Efficient Federated Bilevel Optimization with Local and Global Lower Level Problems
    arXiv:2302.06701v2 Announce Type: replace Abstract: Bilevel Optimization has witnessed notable progress recently with new emerging efficient algorithms. However, its application in the Federated Learning setting remains relatively underexplored, and the impact of Federated Learning's inherent challenges on the convergence of bilevel algorithms remain obscure. In this work, we investigate Federated Bilevel Optimization problems and propose a communication-efficient algorithm, named FedBiOAcc. The algorithm leverages an efficient estimation of the hyper-gradient in the distributed setting and utilizes the momentum-based variance-reduction acceleration. Remarkably, FedBiOAcc achieves a communication complexity $O(\epsilon^{-1})$, a sample complexity $O(\epsilon^{-1.5})$ and the linear speed up with respect to the number of clients. We also analyze a special case of the Federated Bilevel Optimization problems, where lower level problems are locally managed by clients. We prove that FedBiOAcc-Local, a modified version of FedBiOAcc, converges at the same rate for this type of problems. Finally, we validate the proposed algorithms through two real-world tasks: Federated Data-cleaning and Federated Hyper-representation Learning. Empirical results show superior performance of our algorithms.  ( 2 min )
    Deep Augmentation: Self-Supervised Learning with Transformations in Activation Space
    arXiv:2303.14537v2 Announce Type: replace Abstract: We introduce Deep Augmentation, an approach to implicit data augmentation using dropout or PCA to transform a targeted layer within a neural network to improve performance and generalization. We demonstrate Deep Augmentation through extensive experiments on contrastive learning tasks in NLP, computer vision, and graph learning. We observe substantial performance gains with Transformers, ResNets, and Graph Neural Networks as the underlying models in contrastive learning, but observe inverse effects on the corresponding supervised problems. Our analysis suggests that Deep Augmentation alleviates co-adaption between layers, a form of "collapse." We use this observation to formulate a method for selecting which layer to target; in particular, our experimentation reveals that targeting deeper layers with Deep Augmentation outperforms augmenting the input data. The simple network- and modality-agnostic nature of this approach enables its integration into various machine learning pipelines.  ( 2 min )
    Integrating Multimodal Data for Joint Generative Modeling of Complex Dynamics
    arXiv:2212.07892v2 Announce Type: replace Abstract: Many, if not most, systems of interest in science are naturally described as nonlinear dynamical systems. Empirically, we commonly access these systems through time series measurements. Often such time series may consist of discrete random variables rather than continuous measurements, or may be composed of measurements from multiple data modalities observed simultaneously. For instance, in neuroscience we may have behavioral labels in addition to spike counts and continuous physiological recordings. While by now there is a burgeoning literature on deep learning for dynamical systems reconstruction (DSR), multimodal data integration has hardly been considered in this context. Here we provide such an efficient and flexible algorithmic framework that rests on a multimodal variational autoencoder for generating a sparse teacher signal that guides training of a reconstruction model, exploiting recent advances in DSR training techniques. It enables to combine various sources of information for optimal reconstruction, even allows for reconstruction from symbolic data (class labels) alone, and connects different types of observations within a common latent dynamics space. In contrast to previous multimodal data integration techniques for scientific applications, our framework is fully \textit{generative}, producing, after training, trajectories with the same geometrical and temporal structure as those of the ground truth system.  ( 3 min )
    Tree-Guided Rare Feature Selection and Logic Aggregation with Electronic Health Records Data
    arXiv:2206.09107v2 Announce Type: replace Abstract: Statistical learning with a large number of rare binary features is commonly encountered in analyzing electronic health records (EHR) data, especially in the modeling of disease onset with prior medical diagnoses and procedures. Dealing with the resulting highly sparse and large-scale binary feature matrix is notoriously challenging as conventional methods may suffer from a lack of power in testing and inconsistency in model fitting while machine learning methods may suffer from the inability of producing interpretable results or clinically-meaningful risk factors. To improve EHR-based modeling and utilize the natural hierarchical structure of disease classification, we propose a tree-guided feature selection and logic aggregation approach for large-scale regression with rare binary features, in which dimension reduction is achieved through not only a sparsity pursuit but also an aggregation promoter with the logic operator of ``or''. We convert the combinatorial problem into a convex linearly-constrained regularized estimation, which enables scalable computation with theoretical guarantees. In a suicide risk study with EHR data, our approach is able to select and aggregate prior mental health diagnoses as guided by the diagnosis hierarchy of the International Classification of Diseases. By balancing the rarity and specificity of the EHR diagnosis records, our strategy improves both prediction and model interpretation. We identify important higher-level categories and subcategories of mental health conditions and simultaneously determine the level of specificity needed for each of them in predicting suicide risk.  ( 3 min )
    Goal-Space Planning with Subgoal Models
    arXiv:2206.02902v5 Announce Type: replace Abstract: This paper investigates a new approach to model-based reinforcement learning using background planning: mixing (approximate) dynamic programming updates and model-free updates, similar to the Dyna architecture. Background planning with learned models is often worse than model-free alternatives, such as Double DQN, even though the former uses significantly more memory and computation. The fundamental problem is that learned models can be inaccurate and often generate invalid states, especially when iterated many steps. In this paper, we avoid this limitation by constraining background planning to a set of (abstract) subgoals and learning only local, subgoal-conditioned models. This goal-space planning (GSP) approach is more computationally efficient, naturally incorporates temporal abstraction for faster long-horizon planning and avoids learning the transition dynamics entirely. We show that our GSP algorithm can propagate value from an abstract space in a manner that helps a variety of base learners learn significantly faster in different domains.  ( 2 min )
    Diffusion Meets DAgger: Supercharging Eye-in-hand Imitation Learning
    arXiv:2402.17768v1 Announce Type: cross Abstract: A common failure mode for policies trained with imitation is compounding execution errors at test time. When the learned policy encounters states that were not present in the expert demonstrations, the policy fails, leading to degenerate behavior. The Dataset Aggregation, or DAgger approach to this problem simply collects more data to cover these failure states. However, in practice, this is often prohibitively expensive. In this work, we propose Diffusion Meets DAgger (DMD), a method to reap the benefits of DAgger without the cost for eye-in-hand imitation learning problems. Instead of collecting new samples to cover out-of-distribution states, DMD uses recent advances in diffusion models to create these samples with diffusion models. This leads to robust performance from few demonstrations. In experiments conducted for non-prehensile pushing on a Franka Research 3, we show that DMD can achieve a success rate of 80% with as few as 8 expert demonstrations, where naive behavior cloning reaches only 20%. DMD also outperform competing NeRF-based augmentation schemes by 50%.  ( 2 min )
    The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
    arXiv:2402.17764v1 Announce Type: cross Abstract: Recent research, such as BitNet, is paving the way for a new era of 1-bit Large Language Models (LLMs). In this work, we introduce a 1-bit LLM variant, namely BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}. It matches the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption. More profoundly, the 1.58-bit LLM defines a new scaling law and recipe for training new generations of LLMs that are both high-performance and cost-effective. Furthermore, it enables a new computation paradigm and opens the door for designing specific hardware optimized for 1-bit LLMs.  ( 2 min )
    Opening Cabinets and Drawers in the Real World using a Commodity Mobile Manipulator
    arXiv:2402.17767v1 Announce Type: cross Abstract: Pulling open cabinets and drawers presents many difficult technical challenges in perception (inferring articulation parameters for objects from onboard sensors), planning (producing motion plans that conform to tight task constraints), and control (making and maintaining contact while applying forces on the environment). In this work, we build an end-to-end system that enables a commodity mobile manipulator (Stretch RE2) to pull open cabinets and drawers in diverse previously unseen real world environments. We conduct 4 days of real world testing of this system spanning 31 different objects from across 13 different real world environments. Our system achieves a success rate of 61% on opening novel cabinets and drawers in unseen environments zero-shot. An analysis of the failure modes suggests that errors in perception are the most significant challenge for our system. We will open source code and models for others to replicate and build upon our system.  ( 2 min )
    Learning to Program Variational Quantum Circuits with Fast Weights
    arXiv:2402.17760v1 Announce Type: cross Abstract: Quantum Machine Learning (QML) has surfaced as a pioneering framework addressing sequential control tasks and time-series modeling. It has demonstrated empirical quantum advantages notably within domains such as Reinforcement Learning (RL) and time-series prediction. A significant advancement lies in Quantum Recurrent Neural Networks (QRNNs), specifically tailored for memory-intensive tasks encompassing partially observable environments and non-linear time-series prediction. Nevertheless, QRNN-based models encounter challenges, notably prolonged training duration stemming from the necessity to compute quantum gradients using backpropagation-through-time (BPTT). This predicament exacerbates when executing the complete model on quantum devices, primarily due to the substantial demand for circuit evaluation arising from the parameter-shift rule. This paper introduces the Quantum Fast Weight Programmers (QFWP) as a solution to the temporal or sequential learning challenge. The QFWP leverages a classical neural network (referred to as the 'slow programmer') functioning as a quantum programmer to swiftly modify the parameters of a variational quantum circuit (termed the 'fast programmer'). Instead of completely overwriting the fast programmer at each time-step, the slow programmer generates parameter changes or updates for the quantum circuit parameters. This approach enables the fast programmer to incorporate past observations or information. Notably, the proposed QFWP model achieves learning of temporal dependencies without necessitating the use of quantum recurrent neural networks. Numerical simulations conducted in this study showcase the efficacy of the proposed QFWP model in both time-series prediction and RL tasks. The model exhibits performance levels either comparable to or surpassing those achieved by QLSTM-based models.  ( 3 min )
    Massive Activations in Large Language Models
    arXiv:2402.17762v1 Announce Type: cross Abstract: We observe an empirical phenomenon in Large Language Models (LLMs) -- very few activations exhibit significantly larger values than others (e.g., 100,000 times larger). We call them massive activations. First, we demonstrate the widespread existence of massive activations across various LLMs and characterize their locations. Second, we find their values largely stay constant regardless of the input, and they function as indispensable bias terms in LLMs. Third, these massive activations lead to the concentration of attention probabilities to their corresponding tokens, and further, implicit bias terms in the self-attention output. Last, we also study massive activations in Vision Transformers.  ( 2 min )
    Scaling on-chip photonic neural processors using arbitrarily programmable wave propagation
    arXiv:2402.17750v1 Announce Type: cross Abstract: On-chip photonic processors for neural networks have potential benefits in both speed and energy efficiency but have not yet reached the scale at which they can outperform electronic processors. The dominant paradigm for designing on-chip photonics is to make networks of relatively bulky discrete components connected by one-dimensional waveguides. A far more compact alternative is to avoid explicitly defining any components and instead sculpt the continuous substrate of the photonic processor to directly perform the computation using waves freely propagating in two dimensions. We propose and demonstrate a device whose refractive index as a function of space, $n(x,z)$, can be rapidly reprogrammed, allowing arbitrary control over the wave propagation in the device. Our device, a 2D-programmable waveguide, combines photoconductive gain with the electro-optic effect to achieve massively parallel modulation of the refractive index of a slab waveguide, with an index modulation depth of $10^{-3}$ and approximately $10^4$ programmable degrees of freedom. We used a prototype device with a functional area of $12\,\text{mm}^2$ to perform neural-network inference with up to 49-dimensional input vectors in a single pass, achieving 96% accuracy on vowel classification and 86% accuracy on $7 \times 7$-pixel MNIST handwritten-digit classification. This is a scale beyond that of previous photonic chips relying on discrete components, illustrating the benefit of the continuous-waves paradigm. In principle, with large enough chip area, the reprogrammability of the device's refractive index distribution enables the reconfigurable realization of any passive, linear photonic circuit or device. This promises the development of more compact and versatile photonic systems for a wide range of applications, including optical processing, smart sensing, spectroscopy, and optical communications.  ( 3 min )
    Evaluating Very Long-Term Conversational Memory of LLM Agents
    arXiv:2402.17753v1 Announce Type: cross Abstract: Existing works on long-term open-domain dialogues focus on evaluating model responses within contexts spanning no more than five chat sessions. Despite advancements in long-context large language models (LLMs) and retrieval augmented generation (RAG) techniques, their efficacy in very long-term dialogues remains unexplored. To address this research gap, we introduce a machine-human pipeline to generate high-quality, very long-term dialogues by leveraging LLM-based agent architectures and grounding their dialogues on personas and temporal event graphs. Moreover, we equip each agent with the capability of sharing and reacting to images. The generated conversations are verified and edited by human annotators for long-range consistency and grounding to the event graphs. Using this pipeline, we collect LoCoMo, a dataset of very long-term conversations, each encompassing 300 turns and 9K tokens on avg., over up to 35 sessions. Based on LoCoMo, we present a comprehensive evaluation benchmark to measure long-term memory in models, encompassing question answering, event summarization, and multi-modal dialogue generation tasks. Our experimental results indicate that LLMs exhibit challenges in understanding lengthy conversations and comprehending long-range temporal and causal dynamics within dialogues. Employing strategies like long-context LLMs or RAG can offer improvements but these models still substantially lag behind human performance.  ( 2 min )
    Learning-Based Algorithms for Graph Searching Problems
    arXiv:2402.17736v1 Announce Type: cross Abstract: We consider the problem of graph searching with prediction recently introduced by Banerjee et al. (2022). In this problem, an agent, starting at some vertex $r$ has to traverse a (potentially unknown) graph $G$ to find a hidden goal node $g$ while minimizing the total distance travelled. We study a setting in which at any node $v$, the agent receives a noisy estimate of the distance from $v$ to $g$. We design algorithms for this search task on unknown graphs. We establish the first formal guarantees on unknown weighted graphs and provide lower bounds showing that the algorithms we propose have optimal or nearly-optimal dependence on the prediction error. Further, we perform numerical experiments demonstrating that in addition to being robust to adversarial error, our algorithms perform well in typical instances in which the error is stochastic. Finally, we provide alternative simpler performance bounds on the algorithms of Banerjee et al. (2022) for the case of searching on a known graph, and establish new lower bounds for this setting.  ( 2 min )
    reBandit: Random Effects based Online RL algorithm for Reducing Cannabis Use
    arXiv:2402.17739v1 Announce Type: cross Abstract: The escalating prevalence of cannabis use, and associated cannabis-use disorder (CUD), poses a significant public health challenge globally. With a notably wide treatment gap, especially among emerging adults (EAs; ages 18-25), addressing cannabis use and CUD remains a pivotal objective within the 2030 United Nations Agenda for Sustainable Development Goals (SDG). In this work, we develop an online reinforcement learning (RL) algorithm called reBandit which will be utilized in a mobile health study to deliver personalized mobile health interventions aimed at reducing cannabis use among EAs. reBandit utilizes random effects and informative Bayesian priors to learn quickly and efficiently in noisy mobile health environments. Moreover, reBandit employs Empirical Bayes and optimization techniques to autonomously update its hyper-parameters online. To evaluate the performance of our algorithm, we construct a simulation testbed using data from a prior study, and compare against commonly used algorithms in mobile health studies. We show that reBandit performs equally well or better than all the baseline algorithms, and the performance gap widens as population heterogeneity increases in the simulation environment, proving its adeptness to adapt to diverse population of study participants.  ( 2 min )
    Batched Nonparametric Contextual Bandits
    arXiv:2402.17732v1 Announce Type: cross Abstract: We study nonparametric contextual bandits under batch constraints, where the expected reward for each action is modeled as a smooth function of covariates, and the policy updates are made at the end of each batch of observations. We establish a minimax regret lower bound for this setting and propose Batched Successive Elimination with Dynamic Binning (BaSEDB) that achieves optimal regret (up to logarithmic factors). In essence, BaSEDB dynamically splits the covariate space into smaller bins, carefully aligning their widths with the batch size. We also show the suboptimality of static binning under batch constraints, highlighting the necessity of dynamic binning. Additionally, our results suggest that a nearly constant number of policy updates can attain optimal regret in the fully online setting.  ( 2 min )
    Taming Nonconvex Stochastic Mirror Descent with General Bregman Divergence
    arXiv:2402.17722v1 Announce Type: cross Abstract: This paper revisits the convergence of Stochastic Mirror Descent (SMD) in the contemporary nonconvex optimization setting. Existing results for batch-free nonconvex SMD restrict the choice of the distance generating function (DGF) to be differentiable with Lipschitz continuous gradients, thereby excluding important setups such as Shannon entropy. In this work, we present a new convergence analysis of nonconvex SMD supporting general DGF, that overcomes the above limitations and relies solely on the standard assumptions. Moreover, our convergence is established with respect to the Bregman Forward-Backward envelope, which is a stronger measure than the commonly used squared norm of gradient mapping. We further extend our results to guarantee high probability convergence under sub-Gaussian noise and global convergence under the generalized Bregman Proximal Polyak-{\L}ojasiewicz condition. Additionally, we illustrate the advantages of our improved SMD theory in various nonconvex machine learning tasks by harnessing nonsmooth DGFs. Notably, in the context of nonconvex differentially private (DP) learning, our theory yields a simple algorithm with a (nearly) dimension-independent utility bound. For the problem of training linear neural networks, we develop provably convergent stochastic algorithms.  ( 2 min )
    Transfer Learning Bayesian Optimization to Design Competitor DNA Molecules for Use in Diagnostic Assays
    arXiv:2402.17704v1 Announce Type: cross Abstract: With the rise in engineered biomolecular devices, there is an increased need for tailor-made biological sequences. Often, many similar biological sequences need to be made for a specific application meaning numerous, sometimes prohibitively expensive, lab experiments are necessary for their optimization. This paper presents a transfer learning design of experiments workflow to make this development feasible. By combining a transfer learning surrogate model with Bayesian optimization, we show how the total number of experiments can be reduced by sharing information between optimization tasks. We demonstrate the reduction in the number of experiments using data from the development of DNA competitors for use in an amplification-based diagnostic assay. We use cross-validation to compare the predictive accuracy of different transfer learning models, and then compare the performance of the models for both single objective and penalized optimization tasks.  ( 2 min )
    Real-time Low-latency Music Source Separation using Hybrid Spectrogram-TasNet
    arXiv:2402.17701v1 Announce Type: cross Abstract: There have been significant advances in deep learning for music demixing in recent years. However, there has been little attention given to how these neural networks can be adapted for real-time low-latency applications, which could be helpful for hearing aids, remixing audio streams and live shows. In this paper, we investigate the various challenges involved in adapting current demixing models in the literature for this use case. Subsequently, inspired by the Hybrid Demucs architecture, we propose the Hybrid Spectrogram Time-domain Audio Separation Network HS-TasNet, which utilises the advantages of spectral and waveform domains. For a latency of 23 ms, the HS-TasNet obtains an overall signal-to-distortion ratio (SDR) of 4.65 on the MusDB test set, and increases to 5.55 with additional training data. These results demonstrate the potential of efficient demixing for real-time low-latency music applications.  ( 2 min )
    RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
    arXiv:2402.17700v1 Announce Type: cross Abstract: Individual neurons participate in the representation of multiple high-level concepts. To what extent can different interpretability methods successfully disentangle these roles? To help address this question, we introduce RAVEL (Resolving Attribute-Value Entanglements in Language Models), a dataset that enables tightly controlled, quantitative comparisons between a variety of existing interpretability methods. We use the resulting conceptual framework to define the new method of Multi-task Distributed Alignment Search (MDAS), which allows us to find distributed representations satisfying multiple causal criteria. With Llama2-7B as the target language model, MDAS achieves state-of-the-art results on RAVEL, demonstrating the importance of going beyond neuron-level analyses to identify features distributed across activations. We release our benchmark at https://github.com/explanare/ravel.  ( 2 min )
    Learning reduced-order Quadratic-Linear models in Process Engineering using Operator Inference
    arXiv:2402.17698v1 Announce Type: cross Abstract: In this work, we address the challenge of efficiently modeling dynamical systems in process engineering. We use reduced-order model learning, specifically operator inference. This is a non-intrusive, data-driven method for learning dynamical systems from time-domain data. The application in our study is carbon dioxide methanation, an important reaction within the Power-to-X framework, to demonstrate its potential. The numerical results show the ability of the reduced-order models constructed with operator inference to provide a reduced yet accurate surrogate solution. This represents an important milestone towards the implementation of fast and reliable digital twin architectures.  ( 2 min )
    Outlier-Detection for Reactive Machine Learned Potential Energy Surfaces
    arXiv:2402.17686v1 Announce Type: cross Abstract: Uncertainty quantification (UQ) to detect samples with large expected errors (outliers) is applied to reactive molecular potential energy surfaces (PESs). Three methods - Ensembles, Deep Evidential Regression (DER), and Gaussian Mixture Models (GMM) - were applied to the H-transfer reaction between ${\it syn-}$Criegee and vinyl hydroxyperoxide. The results indicate that ensemble models provide the best results for detecting outliers, followed by GMM. For example, from a pool of 1000 structures with the largest uncertainty, the detection quality for outliers is $\sim 90$ \% and $\sim 50$ \%, respectively, if 25 or 1000 structures with large errors are sought. On the contrary, the limitations of the statistical assumptions of DER greatly impacted its prediction capabilities. Finally, a structure-based indicator was found to be correlated with large average error, which may help to rapidly classify new structures into those that provide an advantage for refining the neural network.  ( 2 min )
    Geometric Deep Learning for Computer-Aided Design: A Survey
    arXiv:2402.17695v1 Announce Type: cross Abstract: Geometric Deep Learning techniques have become a transformative force in the field of Computer-Aided Design (CAD), and have the potential to revolutionize how designers and engineers approach and enhance the design process. By harnessing the power of machine learning-based methods, CAD designers can optimize their workflows, save time and effort while making better informed decisions, and create designs that are both innovative and practical. The ability to process the CAD designs represented by geometric data and to analyze their encoded features enables the identification of similarities among diverse CAD models, the proposition of alternative designs and enhancements, and even the generation of novel design alternatives. This survey offers a comprehensive overview of learning-based methods in computer-aided design across various categories, including similarity analysis and retrieval, 2D and 3D CAD model synthesis, and CAD generation from point clouds. Additionally, it provides a complete list of benchmark datasets and their characteristics, along with open-source codes that have propelled research in this domain. The final discussion delves into the challenges prevalent in this field, followed by potential future research directions in this rapidly evolving field.  ( 2 min )
    Supervised machine learning for microbiomics: bridging the gap between current and best practices
    arXiv:2402.17621v1 Announce Type: cross Abstract: Machine learning (ML) is set to accelerate innovations in clinical microbiomics, such as in disease diagnostics and prognostics. This will require high-quality, reproducible, interpretable workflows whose predictive capabilities meet or exceed the high thresholds set for clinical tools by regulatory agencies. Here, we capture a snapshot of current practices in the application of supervised ML to microbiomics data, through an in-depth analysis of 100 peer-reviewed journal articles published in 2021-2022. We apply a data-driven approach to steer discussion of the merits of varied approaches to experimental design, including key considerations such as how to mitigate the effects of small dataset size while avoiding data leakage. We further provide guidance on how to avoid common experimental design pitfalls that can hurt model performance, trustworthiness, and reproducibility. Discussion is accompanied by an interactive online tutorial that demonstrates foundational principles of ML experimental design, tailored to the microbiomics community. Formalizing community best practices for supervised ML in microbiomics is an important step towards improving the success and efficiency of clinical research, to the benefit of patients and other stakeholders.  ( 2 min )
    syren-halofit: A fast, interpretable, high-precision formula for the $\Lambda$CDM nonlinear matter power spectrum
    arXiv:2402.17492v1 Announce Type: cross Abstract: Rapid and accurate evaluation of the nonlinear matter power spectrum, $P(k)$, as a function of cosmological parameters and redshift is of fundamental importance in cosmology. Analytic approximations provide an interpretable solution, yet current approximations are neither fast nor accurate relative to black-box numerical emulators. We use symbolic regression to obtain simple analytic approximations to the nonlinear scale, $k_\sigma$, the effective spectral index, $n_{\rm eff}$, and the curvature, $C$, which are required for the halofit model. We then re-optimise the coefficients of halofit to fit a wide range of cosmologies and redshifts. We then again exploit symbolic regression to explore the space of analytic expressions to fit the residuals between $P(k)$ and the optimised predictions of halofit. All methods are validated against $N$-body simulations. Our symbolic expressions for $k_\sigma$, $n_{\rm eff}$ and $C$ have root mean squared fractional errors of 0.8%, 0.2% and 0.3%, respectively, for redshifts below 3 and a wide range of cosmologies. The re-optimised halofit parameters reduce the root mean squared fractional error from 3% to below 2% for wavenumbers $k=9\times10^{-3}-9 \, h{\rm Mpc^{-1}}$. We introduce syren-halofit (symbolic-regression-enhanced halofit), an extension to halofit containing a short symbolic correction which improves this error to 1%. Our method is 2350 and 3170 times faster than current halofit and hmcode implementations, respectively, and 2680 and 64 times faster than EuclidEmulator2 (which requires running class) and the BACCO emulator. We obtain comparable accuracy to EuclidEmulator2 and the BACCO emulator when tested on $N$-body simulations. Our work greatly increases the speed and accuracy of symbolic approximations to $P(k)$, making them significantly faster than their numerical counterparts without loss of accuracy.  ( 3 min )
    FaultProfIT: Hierarchical Fault Profiling of Incident Tickets in Large-scale Cloud Systems
    arXiv:2402.17583v1 Announce Type: cross Abstract: Postmortem analysis is essential in the management of incidents within cloud systems, which provides valuable insights to improve system's reliability and robustness. At CloudA, fault pattern profiling is performed during the postmortem phase, which involves the classification of incidents' faults into unique categories, referred to as fault pattern. By aggregating and analyzing these fault patterns, engineers can discern common faults, vulnerable components and emerging fault trends. However, this process is currently conducted by manual labeling, which has inherent drawbacks. On the one hand, the sheer volume of incidents means only the most severe ones are analyzed, causing a skewed overview of fault patterns. On the other hand, the complexity of the task demands extensive domain knowledge, which leads to errors and inconsistencies. To address these limitations, we propose an automated approach, named FaultProfIT, for Fault pattern Profiling of Incident Tickets. It leverages hierarchy-guided contrastive learning to train a hierarchy-aware incident encoder and predicts fault patterns with enhanced incident representations. We evaluate FaultProfIT using the production incidents from CloudA. The results demonstrate that FaultProfIT outperforms state-of-the-art methods. Our ablation study and analysis also verify the effectiveness of hierarchy-guided contrastive learning. Additionally, we have deployed FaultProfIT at CloudA for six months. To date, FaultProfIT has analyzed 10,000+ incidents from 30+ cloud services, successfully revealing several fault trends that have informed system improvements.  ( 3 min )
    Adapting Learned Image Codecs to Screen Content via Adjustable Transformations
    arXiv:2402.17544v1 Announce Type: cross Abstract: As learned image codecs (LICs) become more prevalent, their low coding efficiency for out-of-distribution data becomes a bottleneck for some applications. To improve the performance of LICs for screen content (SC) images without breaking backwards compatibility, we propose to introduce parameterized and invertible linear transformations into the coding pipeline without changing the underlying baseline codec's operation flow. We design two neural networks to act as prefilters and postfilters in our setup to increase the coding efficiency and help with the recovery from coding artifacts. Our end-to-end trained solution achieves up to 10% bitrate savings on SC compression compared to the baseline LICs while introducing only 1% extra parameters.  ( 2 min )
    Thermodynamics-informed super-resolution of scarce temporal dynamics data
    arXiv:2402.17506v1 Announce Type: cross Abstract: We present a method to increase the resolution of measurements of a physical system and subsequently predict its time evolution using thermodynamics-aware neural networks. Our method uses adversarial autoencoders, which reduce the dimensionality of the full order model to a set of latent variables that are enforced to match a prior, for example a normal distribution. Adversarial autoencoders are seen as generative models, and they can be trained to generate high-resolution samples from low-resoution inputs, meaning they can address the so-called super-resolution problem. Then, a second neural network is trained to learn the physical structure of the latent variables and predict their temporal evolution. This neural network is known as an structure-preserving neural network. It learns the metriplectic-structure of the system and applies a physical bias to ensure that the first and second principles of thermodynamics are fulfilled. The integrated trajectories are decoded to their original dimensionality, as well as to the higher dimensionality space produced by the adversarial autoencoder and they are compared to the ground truth solution. The method is tested with two examples of flow over a cylinder, where the fluid properties are varied between both examples.  ( 2 min )
    Predicting Instability in Complex Oscillator Networks: Limitations and Potentials of Network Measures and Machine Learning
    arXiv:2402.17500v1 Announce Type: cross Abstract: A central question of network science is how functional properties of systems arise from their structure. For networked dynamical systems, structure is typically quantified with network measures. A functional property that is of theoretical and practical interest for oscillatory systems is the stability of synchrony to localized perturbations. Recently, Graph Neural Networks (GNNs) have been shown to predict this stability successfully; at the same time, network measures have struggled to paint a clear picture. Here we collect 46 relevant network measures and find that no small subset can reliably predict stability. The performance of GNNs can only be matched by combining all network measures and nodewise machine learning. However, unlike GNNs, this approach fails to extrapolate from network ensembles to several real power grid topologies. This suggests that correlations of network measures and function may be misleading, and that GNNs capture the causal relationship between structure and stability substantially better.  ( 2 min )
    Bit Rate Matching Algorithm Optimization in JPEG-AI Verification Model
    arXiv:2402.17487v1 Announce Type: cross Abstract: The research on neural network (NN) based image compression has shown superior performance compared to classical compression frameworks. Unlike the hand-engineered transforms in the classical frameworks, NN-based models learn the non-linear transforms providing more compact bit representations, and achieve faster coding speed on parallel devices over their classical counterparts. Those properties evoked the attention of both scientific and industrial communities, resulting in the standardization activity JPEG-AI. The verification model for the standardization process of JPEG-AI is already in development and has surpassed the advanced VVC intra codec. To generate reconstructed images with the desired bits per pixel and assess the BD-rate performance of both the JPEG-AI verification model and VVC intra, bit rate matching is employed. However, the current state of the JPEG-AI verification model experiences significant slowdowns during bit rate matching, resulting in suboptimal performance due to an unsuitable model. The proposed methodology offers a gradual algorithmic optimization for matching bit rates, resulting in a fourfold acceleration and over 1% improvement in BD-rate at the base operation point. At the high operation point, the acceleration increases up to sixfold.  ( 2 min )
    Bit Distribution Study and Implementation of Spatial Quality Map in the JPEG-AI Standardization
    arXiv:2402.17470v1 Announce Type: cross Abstract: Currently, there is a high demand for neural network-based image compression codecs. These codecs employ non-linear transforms to create compact bit representations and facilitate faster coding speeds on devices compared to the hand-crafted transforms used in classical frameworks. The scientific and industrial communities are highly interested in these properties, leading to the standardization effort of JPEG-AI. The JPEG-AI verification model has been released and is currently under development for standardization. Utilizing neural networks, it can outperform the classic codec VVC intra by over 10% BD-rate operating at base operation point. Researchers attribute this success to the flexible bit distribution in the spatial domain, in contrast to VVC intra's anchor that is generated with a constant quality point. However, our study reveals that VVC intra displays a more adaptable bit distribution structure through the implementation of various block sizes. As a result of our observations, we have proposed a spatial bit allocation method to optimize the JPEG-AI verification model's bit distribution and enhance the visual quality. Furthermore, by applying the VVC bit distribution strategy, the objective performance of JPEG-AI verification mode can be further improved, resulting in a maximum gain of 0.45 dB in PSNR-Y.  ( 3 min )
    The KANDY Benchmark: Incremental Neuro-Symbolic Learning and Reasoning with Kandinsky Patterns
    arXiv:2402.17431v1 Announce Type: cross Abstract: Artificial intelligence is continuously seeking novel challenges and benchmarks to effectively measure performance and to advance the state-of-the-art. In this paper we introduce KANDY, a benchmarking framework that can be used to generate a variety of learning and reasoning tasks inspired by Kandinsky patterns. By creating curricula of binary classification tasks with increasing complexity and with sparse supervisions, KANDY can be used to implement benchmarks for continual and semi-supervised learning, with a specific focus on symbol compositionality. Classification rules are also provided in the ground truth to enable analysis of interpretable solutions. Together with the benchmark generation pipeline, we release two curricula, an easier and a harder one, that we propose as new challenges for the research community. With a thorough experimental evaluation, we show how both state-of-the-art neural models and purely symbolic approaches struggle with solving most of the tasks, thus calling for the application of advanced neuro-symbolic methods trained over time.  ( 2 min )
    A novel image space formalism of Fourier domain interpolation neural networks for noise propagation analysis
    arXiv:2402.17410v1 Announce Type: cross Abstract: Purpose: To develop an image space formalism of multi-layer convolutional neural networks (CNNs) for Fourier domain interpolation in MRI reconstructions and analytically estimate noise propagation during CNN inference. Theory and Methods: Nonlinear activations in the Fourier domain (also known as k-space) using complex-valued Rectifier Linear Units are expressed as elementwise multiplication with activation masks. This operation is transformed into a convolution in the image space. After network training in k-space, this approach provides an algebraic expression for the derivative of the reconstructed image with respect to the aliased coil images, which serve as the input tensors to the network in the image space. This allows the variance in the network inference to be estimated analytically and to be used to describe noise characteristics. Monte-Carlo simulations and numerical approaches based on auto-differentiation were used for validation. The framework was tested on retrospectively undersampled invivo brain images. Results: Inferences conducted in the image domain are quasi-identical to inferences in the k-space, underlined by corresponding quantitative metrics. Noise variance maps obtained from the analytical expression correspond with those obtained via Monte-Carlo simulations, as well as via an auto-differentiation approach. The noise resilience is well characterized, as in the case of classical Parallel Imaging. Komolgorov-Smirnov tests demonstrate Gaussian distributions of voxel magnitudes in variance maps obtained via Monte-Carlo simulations. Conclusion: The quasi-equivalent image space formalism for neural networks for k-space interpolation enables fast and accurate description of the noise characteristics during CNN inference, analogous to geometry-factor maps in traditional parallel imaging methods.  ( 3 min )
    LSPT: Long-term Spatial Prompt Tuning for Visual Representation Learning
    arXiv:2402.17406v1 Announce Type: cross Abstract: Visual Prompt Tuning (VPT) techniques have gained prominence for their capacity to adapt pre-trained Vision Transformers (ViTs) to downstream visual tasks using specialized learnable tokens termed as prompts. Contemporary VPT methodologies, especially when employed with self-supervised vision transformers, often default to the introduction of new learnable prompts or gated prompt tokens predominantly sourced from the model's previous block. A pivotal oversight in such approaches is their failure to harness the potential of long-range previous blocks as sources of prompts within each self-supervised ViT. To bridge this crucial gap, we introduce Long-term Spatial Prompt Tuning (LSPT) - a revolutionary approach to visual representation learning. Drawing inspiration from the intricacies of the human brain, LSPT ingeniously incorporates long-term gated prompts. This feature serves as temporal coding, curbing the risk of forgetting parameters acquired from earlier blocks. Further enhancing its prowess, LSPT brings into play patch tokens, serving as spatial coding. This is strategically designed to perpetually amass class-conscious features, thereby fortifying the model's prowess in distinguishing and identifying visual categories. To validate the efficacy of our proposed method, we engaged in rigorous experimentation across 5 FGVC and 19 VTAB-1K benchmarks. Our empirical findings underscore the superiority of LSPT, showcasing its ability to set new benchmarks in visual prompt tuning performance.  ( 2 min )
    A Quantum Approach to Synthetic Minority Oversampling Technique (SMOTE)
    arXiv:2402.17398v1 Announce Type: cross Abstract: The paper proposes the Quantum-SMOTE method, a novel solution that uses quantum computing techniques to solve the prevalent problem of class imbalance in machine learning datasets. Quantum-SMOTE, inspired by the Synthetic Minority Oversampling Technique (SMOTE), generates synthetic data points using quantum processes such as swap tests and quantum rotation. The process varies from the conventional SMOTE algorithm's usage of K-Nearest Neighbors (KNN) and Euclidean distances, enabling synthetic instances to be generated from minority class data points without relying on neighbor proximity. The algorithm asserts greater control over the synthetic data generation process by introducing hyperparameters such as rotation angle, minority percentage, and splitting factor, which allow for customization to specific dataset requirements. The approach is tested on a public dataset of TelecomChurn and evaluated alongside two prominent classification algorithms, Random Forest and Logistic Regression, to determine its impact along with varying proportions of synthetic data.  ( 2 min )
    Beacon, a lightweight deep reinforcement learning benchmark library for flow control
    arXiv:2402.17402v1 Announce Type: cross Abstract: Recently, the increasing use of deep reinforcement learning for flow control problems has led to a new area of research, focused on the coupling and the adaptation of the existing algorithms to the control of numerical fluid dynamics environments. Although still in its infancy, the field has seen multiple successes in a short time span, and its fast development pace can certainly be partly imparted to the open-source effort that drives the expansion of the community. Yet, this emerging domain still misses a common ground to (i) ensure the reproducibility of the results, and (ii) offer a proper ad-hoc benchmarking basis. To this end, we propose Beacon, an open-source benchmark library composed of seven lightweight 1D and 2D flow control problems with various characteristics, action and observation space characteristics, and CPU requirements. In this contribution, the seven considered problems are described, and reference control solutions are provided. The sources for the following work are available at https://github.com/jviquerat/beacon.  ( 2 min )
    Impact of Computation in Integral Reinforcement Learning for Continuous-Time Control
    arXiv:2402.17375v1 Announce Type: cross Abstract: Integral reinforcement learning (IntRL) demands the precise computation of the utility function's integral at its policy evaluation (PEV) stage. This is achieved through quadrature rules, which are weighted sums of utility functions evaluated from state samples obtained in discrete time. Our research reveals a critical yet underexplored phenomenon: the choice of the computational method -- in this case, the quadrature rule -- can significantly impact control performance. This impact is traced back to the fact that computational errors introduced in the PEV stage can affect the policy iteration's convergence behavior, which in turn affects the learned controller. To elucidate how computation impacts control, we draw a parallel between IntRL's policy iteration and Newton's method applied to the Hamilton-Jacobi-Bellman equation. In this light, computational error in PEV manifests as an extra error term in each iteration of Newton's method, with its upper bound proportional to the computational error. Further, we demonstrate that when the utility function resides in a reproducing kernel Hilbert space (RKHS), the optimal quadrature is achievable by employing Bayesian quadrature with the RKHS-inducing kernel function. We prove that the local convergence rates for IntRL using the trapezoidal rule and Bayesian quadrature with a Mat\'ern kernel to be $O(N^{-2})$ and $O(N^{-b})$, where $N$ is the number of evenly-spaced samples and $b$ is the Mat\'ern kernel's smoothness parameter. These theoretical findings are finally validated by two canonical control tasks.  ( 3 min )
    Accelerating Diffusion Sampling with Optimized Time Steps
    arXiv:2402.17376v1 Announce Type: cross Abstract: Diffusion probabilistic models (DPMs) have shown remarkable performance in high-resolution image synthesis, but their sampling efficiency is still to be desired due to the typically large number of sampling steps. Recent advancements in high-order numerical ODE solvers for DPMs have enabled the generation of high-quality images with much fewer sampling steps. While this is a significant development, most sampling methods still employ uniform time steps, which is not optimal when using a small number of steps. To address this issue, we propose a general framework for designing an optimization problem that seeks more appropriate time steps for a specific numerical ODE solver for DPMs. This optimization problem aims to minimize the distance between the ground-truth solution to the ODE and an approximate solution corresponding to the numerical solver. It can be efficiently solved using the constrained trust region method, taking less than $15$ seconds. Our extensive experiments on both unconditional and conditional sampling using pixel- and latent-space DPMs demonstrate that, when combined with the state-of-the-art sampling method UniPC, our optimized time steps significantly improve image generation performance in terms of FID scores for datasets such as CIFAR-10 and ImageNet, compared to using uniform time steps.  ( 2 min )
    Understanding the training of PINNs for unsteady flow past a plunging foil through the lens of input subdomain level loss function gradients
    arXiv:2402.17346v1 Announce Type: cross Abstract: Recently immersed boundary method-inspired physics-informed neural networks (PINNs) including the moving boundary-enabled PINNs (MB-PINNs) have shown the ability to accurately reconstruct velocity and recover pressure as a hidden variable for unsteady flow past moving bodies. Considering flow past a plunging foil, MB-PINNs were trained with global physics loss relaxation and also in conjunction with a physics-based undersampling method, obtaining good accuracy. The purpose of this study was to investigate which input spatial subdomain contributes to the training under the effect of physics loss relaxation and physics-based undersampling. In the context of MB-PINNs training, three spatial zones: the moving body, wake, and outer zones were defined. To quantify which spatial zone drives the training, two novel metrics are computed from the zonal loss component gradient statistics and the proportion of sample points in each zone. Results confirm that the learning indeed depends on the combined effect of the zonal loss component gradients and the proportion of points in each zone. Moreover, the dominant input zones are also the ones that have the strongest solution gradients in some sense.  ( 2 min )
    CGGM: A conditional graph generation model with adaptive sparsity for node anomaly detection in IoT networks
    arXiv:2402.17363v1 Announce Type: cross Abstract: Dynamic graphs are extensively employed for detecting anomalous behavior in nodes within the Internet of Things (IoT). Generative models are often used to address the issue of imbalanced node categories in dynamic graphs. Nevertheless, the constraints it faces include the monotonicity of adjacency relationships, the difficulty in constructing multi-dimensional features for nodes, and the lack of a method for end-to-end generation of multiple categories of nodes. This paper presents a novel graph generation model, called CGGM, designed specifically to generate a larger number of nodes belonging to the minority class. The mechanism for generating an adjacency matrix, through adaptive sparsity, enhances flexibility in its structure. The feature generation module, called multidimensional features generator (MFG) to generate node features along with topological information. Labels are transformed into embedding vectors, serving as conditional constraints to control the generation of synthetic data across multiple categories. Using a multi-stage loss, the distribution of synthetic data is adjusted to closely resemble that of real data. In extensive experiments, we show that CGGM's synthetic data outperforms state-of-the-art methods across various metrics. Our results demonstrate efficient generation of diverse data categories, robustly enhancing multi-category classification model performance.  ( 2 min )
    Outdoor Environment Reconstruction with Deep Learning on Radio Propagation Paths
    arXiv:2402.17336v1 Announce Type: cross Abstract: Conventional methods for outdoor environment reconstruction rely predominantly on vision-based techniques like photogrammetry and LiDAR, facing limitations such as constrained coverage, susceptibility to environmental conditions, and high computational and energy demands. These challenges are particularly pronounced in applications like augmented reality navigation, especially when integrated with wearable devices featuring constrained computational resources and energy budgets. In response, this paper proposes a novel approach harnessing ambient wireless signals for outdoor environment reconstruction. By analyzing radio frequency (RF) data, the paper aims to deduce the environmental characteristics and digitally reconstruct the outdoor surroundings. Investigating the efficacy of selected deep learning (DL) techniques on the synthetic RF dataset WAIR-D, the study endeavors to address the research gap in this domain. Two DL-driven approaches are evaluated (convolutional U-Net and CLIP+ based on vision transformers), with performance assessed using metrics like intersection-over-union (IoU), Hausdorff distance, and Chamfer distance. The results demonstrate promising performance of the RF-based reconstruction method, paving the way towards lightweight and scalable reconstruction solutions.  ( 3 min )
    Scaling Supervised Local Learning with Augmented Auxiliary Networks
    arXiv:2402.17318v1 Announce Type: cross Abstract: Deep neural networks are typically trained using global error signals that backpropagate (BP) end-to-end, which is not only biologically implausible but also suffers from the update locking problem and requires huge memory consumption. Local learning, which updates each layer independently with a gradient-isolated auxiliary network, offers a promising alternative to address the above problems. However, existing local learning methods are confronted with a large accuracy gap with the BP counterpart, particularly for large-scale networks. This is due to the weak coupling between local layers and their subsequent network layers, as there is no gradient communication across layers. To tackle this issue, we put forward an augmented local learning method, dubbed AugLocal. AugLocal constructs each hidden layer's auxiliary network by uniformly selecting a small subset of layers from its subsequent network layers to enhance their synergy. We also propose to linearly reduce the depth of auxiliary networks as the hidden layer goes deeper, ensuring sufficient network capacity while reducing the computational cost of auxiliary networks. Our extensive experiments on four image classification datasets (i.e., CIFAR-10, SVHN, STL-10, and ImageNet) demonstrate that AugLocal can effectively scale up to tens of local layers with a comparable accuracy to BP-trained networks while reducing GPU memory usage by around 40%. The proposed AugLocal method, therefore, opens up a myriad of opportunities for training high-performance deep neural networks on resource-constrained platforms.Code is available at https://github.com/ChenxiangMA/AugLocal.  ( 3 min )
    Quantum Distance Approximation for Persistence Diagrams
    arXiv:2402.17295v1 Announce Type: cross Abstract: Topological Data Analysis methods can be useful for classification and clustering tasks in many different fields as they can provide two dimensional persistence diagrams that summarize important information about the shape of potentially complex and high dimensional data sets. The space of persistence diagrams can be endowed with various metrics such as the Wasserstein distance which admit a statistical structure and allow to use these summaries for machine learning algorithms. However, computing the distance between two persistence diagrams involves finding an optimal way to match the points of the two diagrams and may not always be an easy task for classical computers. In this work we explore the potential of quantum computers to estimate the distance between persistence diagrams, in particular we propose variational quantum algorithms for the Wasserstein distance as well as the $d^{c}_{p}$ distance. Our implementation is a weighted version of the Quantum Approximate Optimization Algorithm that relies on control clauses to encode the constraints of the optimization problem.  ( 2 min )
    How we won BraTS 2023 Adult Glioma challenge? Just faking it! Enhanced Synthetic Data Augmentation and Model Ensemble for brain tumour segmentation
    arXiv:2402.17317v1 Announce Type: cross Abstract: Deep Learning is the state-of-the-art technology for segmenting brain tumours. However, this requires a lot of high-quality data, which is difficult to obtain, especially in the medical field. Therefore, our solutions address this problem by using unconventional mechanisms for data augmentation. Generative adversarial networks and registration are used to massively increase the amount of available samples for training three different deep learning models for brain tumour segmentation, the first task of the BraTS2023 challenge. The first model is the standard nnU-Net, the second is the Swin UNETR and the third is the winning solution of the BraTS 2021 Challenge. The entire pipeline is built on the nnU-Net implementation, except for the generation of the synthetic data. The use of convolutional algorithms and transformers is able to fill each other's knowledge gaps. Using the new metric, our best solution achieves the dice results 0.9005, 0.8673, 0.8509 and HD95 14.940, 14.467, 17.699 (whole tumour, tumour core and enhancing tumour) in the validation set.  ( 3 min )
    Multi-Agent, Human-Agent and Beyond: A Survey on Cooperation in Social Dilemmas
    arXiv:2402.17270v1 Announce Type: cross Abstract: The study of cooperation within social dilemmas has long been a fundamental topic across various disciplines, including computer science and social science. Recent advancements in Artificial Intelligence (AI) have significantly reshaped this field, offering fresh insights into understanding and enhancing cooperation. This survey examines three key areas at the intersection of AI and cooperation in social dilemmas. First, focusing on multi-agent cooperation, we review the intrinsic and external motivations that support cooperation among rational agents, and the methods employed to develop effective strategies against diverse opponents. Second, looking into human-agent cooperation, we discuss the current AI algorithms for cooperating with humans and the human biases towards AI agents. Third, we review the emergent field of leveraging AI agents to enhance cooperation among humans. We conclude by discussing future research avenues, such as using large language models, establishing unified theoretical frameworks, revisiting existing theories of human cooperation, and exploring multiple real-world applications.  ( 2 min )
    Deep Learning-Based Speech and Vision Synthesis to Improve Phishing Attack Detection through a Multi-layer Adaptive Framework
    arXiv:2402.17249v1 Announce Type: cross Abstract: The ever-evolving ways attacker continues to im prove their phishing techniques to bypass existing state-of-the-art phishing detection methods pose a mountain of challenges to researchers in both industry and academia research due to the inability of current approaches to detect complex phishing attack. Thus, current anti-phishing methods remain vulnerable to complex phishing because of the increasingly sophistication tactics adopted by attacker coupled with the rate at which new tactics are being developed to evade detection. In this research, we proposed an adaptable framework that combines Deep learning and Randon Forest to read images, synthesize speech from deep-fake videos, and natural language processing at various predictions layered to significantly increase the performance of machine learning models for phishing attack detection.  ( 2 min )
    SDR-Former: A Siamese Dual-Resolution Transformer for Liver Lesion Classification Using 3D Multi-Phase Imaging
    arXiv:2402.17246v1 Announce Type: cross Abstract: Automated classification of liver lesions in multi-phase CT and MR scans is of clinical significance but challenging. This study proposes a novel Siamese Dual-Resolution Transformer (SDR-Former) framework, specifically designed for liver lesion classification in 3D multi-phase CT and MR imaging with varying phase counts. The proposed SDR-Former utilizes a streamlined Siamese Neural Network (SNN) to process multi-phase imaging inputs, possessing robust feature representations while maintaining computational efficiency. The weight-sharing feature of the SNN is further enriched by a hybrid Dual-Resolution Transformer (DR-Former), comprising a 3D Convolutional Neural Network (CNN) and a tailored 3D Transformer for processing high- and low-resolution images, respectively. This hybrid sub-architecture excels in capturing detailed local features and understanding global contextual information, thereby, boosting the SNN's feature extraction capabilities. Additionally, a novel Adaptive Phase Selection Module (APSM) is introduced, promoting phase-specific intercommunication and dynamically adjusting each phase's influence on the diagnostic outcome. The proposed SDR-Former framework has been validated through comprehensive experiments on two clinical datasets: a three-phase CT dataset and an eight-phase MR dataset. The experimental results affirm the efficacy of the proposed framework. To support the scientific community, we are releasing our extensive multi-phase MR dataset for liver lesion analysis to the public. This pioneering dataset, being the first publicly available multi-phase MR dataset in this field, also underpins the MICCAI LLD-MMRI Challenge. The dataset is accessible at:https://bit.ly/3IyYlgN.  ( 3 min )
    A Review of Data Mining in Personalized Education: Current Trends and Future Prospects
    arXiv:2402.17236v1 Announce Type: cross Abstract: Personalized education, tailored to individual student needs, leverages educational technology and artificial intelligence (AI) in the digital age to enhance learning effectiveness. The integration of AI in educational platforms provides insights into academic performance, learning preferences, and behaviors, optimizing the personal learning process. Driven by data mining techniques, it not only benefits students but also provides educators and institutions with tools to craft customized learning experiences. To offer a comprehensive review of recent advancements in personalized educational data mining, this paper focuses on four primary scenarios: educational recommendation, cognitive diagnosis, knowledge tracing, and learning analysis. This paper presents a structured taxonomy for each area, compiles commonly used datasets, and identifies future research directions, emphasizing the role of data mining in enhancing personalized education and paving the way for future exploration and innovation.  ( 2 min )
    Two-scale Neural Networks for Partial Differential Equations with Small Parameters
    arXiv:2402.17232v1 Announce Type: cross Abstract: We propose a two-scale neural network method for solving partial differential equations (PDEs) with small parameters using physics-informed neural networks (PINNs). We directly incorporate the small parameters into the architecture of neural networks. The proposed method enables solving PDEs with small parameters in a simple fashion, without adding Fourier features or other computationally taxing searches of truncation parameters. Various numerical examples demonstrate reasonable accuracy in capturing features of large derivatives in the solutions caused by small parameters.  ( 2 min )
    Preserving Fairness Generalization in Deepfake Detection
    arXiv:2402.17229v1 Announce Type: cross Abstract: Although effective deepfake detection models have been developed in recent years, recent studies have revealed that these models can result in unfair performance disparities among demographic groups, such as race and gender. This can lead to particular groups facing unfair targeting or exclusion from detection, potentially allowing misclassified deepfakes to manipulate public opinion and undermine trust in the model. The existing method for addressing this problem is providing a fair loss function. It shows good fairness performance for intra-domain evaluation but does not maintain fairness for cross-domain testing. This highlights the significance of fairness generalization in the fight against deepfakes. In this work, we propose the first method to address the fairness generalization problem in deepfake detection by simultaneously considering features, loss, and optimization aspects. Our method employs disentanglement learning to extract demographic and domain-agnostic forgery features, fusing them to encourage fair learning across a flattened loss landscape. Extensive experiments on prominent deepfake datasets demonstrate our method's effectiveness, surpassing state-of-the-art approaches in preserving fairness during cross-domain deepfake detection. The code is available at https://github.com/Purdue-M2/Fairness-Generalization  ( 2 min )
    Application of Machine Learning Optimization in Cloud Computing Resource Scheduling and Management
    arXiv:2402.17216v1 Announce Type: cross Abstract: In recent years, cloud computing has been widely used. Cloud computing refers to the centralized computing resources, users through the access to the centralized resources to complete the calculation, the cloud computing center will return the results of the program processing to the user. Cloud computing is not only for individual users, but also for enterprise users. By purchasing a cloud server, users do not have to buy a large number of computers, saving computing costs. According to a report by China Economic News Network, the scale of cloud computing in China has reached 209.1 billion yuan. At present, the more mature cloud service providers in China are Ali Cloud, Baidu Cloud, Huawei Cloud and so on. Therefore, this paper proposes an innovative approach to solve complex problems in cloud computing resource scheduling and management using machine learning optimization techniques. Through in-depth study of challenges such as low resource utilization and unbalanced load in the cloud environment, this study proposes a comprehensive solution, including optimization methods such as deep learning and genetic algorithm, to improve system performance and efficiency, and thus bring new breakthroughs and progress in the field of cloud computing resource management.Rational allocation of resources plays a crucial role in cloud computing. In the resource allocation of cloud computing, the cloud computing center has limited cloud resources, and users arrive in sequence. Each user requests the cloud computing center to use a certain number of cloud resources at a specific time.  ( 3 min )
    Measuring Vision-Language STEM Skills of Neural Models
    arXiv:2402.17205v1 Announce Type: cross Abstract: We introduce a new challenge to test the STEM skills of neural models. The problems in the real world often require solutions, combining knowledge from STEM (science, technology, engineering, and math). Unlike existing datasets, our dataset requires the understanding of multimodal vision-language information of STEM. Our dataset features one of the largest and most comprehensive datasets for the challenge. It includes 448 skills and 1,073,146 questions spanning all STEM subjects. Compared to existing datasets that often focus on examining expert-level ability, our dataset includes fundamental skills and questions designed based on the K-12 curriculum. We also add state-of-the-art foundation models such as CLIP and GPT-3.5-Turbo to our benchmark. Results show that the recent model advances only help master a very limited number of lower grade-level skills (2.5% in the third grade) in our dataset. In fact, these models are still well below (averaging 54.7%) the performance of elementary students, not to mention near expert-level performance. To understand and increase the performance on our dataset, we teach the models on a training split of our dataset. Even though we observe improved performance, the model performance remains relatively low compared to average elementary students. To solve STEM problems, we will need novel algorithmic innovations from the community.  ( 2 min )
    Prediction of the SYM-H Index Using a Bayesian Deep Learning Method with Uncertainty Quantification
    arXiv:2402.17196v1 Announce Type: cross Abstract: We propose a novel deep learning framework, named SYMHnet, which employs a graph neural network and a bidirectional long short-term memory network to cooperatively learn patterns from solar wind and interplanetary magnetic field parameters for short-term forecasts of the SYM-H index based on 1-minute and 5-minute resolution data. SYMHnet takes, as input, the time series of the parameters' values provided by NASA's Space Science Data Coordinated Archive and predicts, as output, the SYM-H index value at time point t + w hours for a given time point t where w is 1 or 2. By incorporating Bayesian inference into the learning framework, SYMHnet can quantify both aleatoric (data) uncertainty and epistemic (model) uncertainty when predicting future SYM-H indices. Experimental results show that SYMHnet works well at quiet time and storm time, for both 1-minute and 5-minute resolution data. The results also show that SYMHnet generally performs better than related machine learning methods. For example, SYMHnet achieves a forecast skill score (FSS) of 0.343 compared to the FSS of 0.074 of a recent gradient boosting machine (GBM) method when predicting SYM-H indices (1 hour in advance) in a large storm (SYM-H = -393 nT) using 5-minute resolution data. When predicting the SYM-H indices (2 hours in advance) in the large storm, SYMHnet achieves an FSS of 0.553 compared to the FSS of 0.087 of the GBM method. In addition, SYMHnet can provide results for both data and model uncertainty quantification, whereas the related methods cannot.  ( 3 min )
    AI-Driven Anonymization: Protecting Personal Data Privacy While Leveraging Machine Learning
    arXiv:2402.17191v1 Announce Type: cross Abstract: The development of artificial intelligence has significantly transformed people's lives. However, it has also posed a significant threat to privacy and security, with numerous instances of personal information being exposed online and reports of criminal attacks and theft. Consequently, the need to achieve intelligent protection of personal information through machine learning algorithms has become a paramount concern. Artificial intelligence leverages advanced algorithms and technologies to effectively encrypt and anonymize personal data, enabling valuable data analysis and utilization while safeguarding privacy. This paper focuses on personal data privacy protection and the promotion of anonymity as its core research objectives. It achieves personal data privacy protection and detection through the use of machine learning's differential privacy protection algorithm. The paper also addresses existing challenges in machine learning related to privacy and personal data protection, offers improvement suggestions, and analyzes factors impacting datasets to enable timely personal data privacy detection and protection.  ( 2 min )
    When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method
    arXiv:2402.17193v1 Announce Type: cross Abstract: While large language models (LLMs) often adopt finetuning to unlock their capabilities for downstream applications, our understanding on the inductive biases (especially the scaling properties) of different finetuning methods is still limited. To fill this gap, we conduct systematic experiments studying whether and how different scaling factors, including LLM model size, pretraining data size, new finetuning parameter size and finetuning data size, affect the finetuning performance. We consider two types of finetuning -- full-model tuning (FMT) and parameter efficient tuning (PET, including prompt tuning and LoRA), and explore their scaling behaviors in the data-limited regime where the LLM model size substantially outweighs the finetuning data size. Based on two sets of pretrained bilingual LLMs from 1B to 16B and experiments on bilingual machine translation and multilingual summarization benchmarks, we find that 1) LLM finetuning follows a powerbased multiplicative joint scaling law between finetuning data size and each other scaling factor; 2) LLM finetuning benefits more from LLM model scaling than pretraining data scaling, and PET parameter scaling is generally ineffective; and 3) the optimal finetuning method is highly task- and finetuning data-dependent. We hope our findings could shed light on understanding, selecting and developing LLM finetuning methods.  ( 2 min )
    Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
    arXiv:2402.17177v1 Announce Type: cross Abstract: Sora is a text-to-video generative AI model, released by OpenAI in February 2024. The model is trained to generate videos of realistic or imaginative scenes from text instructions and show potential in simulating the physical world. Based on public technical reports and reverse engineering, this paper presents a comprehensive review of the model's background, related technologies, applications, remaining challenges, and future directions of text-to-video AI models. We first trace Sora's development and investigate the underlying technologies used to build this "world simulator". Then, we describe in detail the applications and potential impact of Sora in multiple industries ranging from film-making and education to marketing. We discuss the main challenges and limitations that need to be addressed to widely deploy Sora, such as ensuring safe and unbiased video generation. Lastly, we discuss the future development of Sora and video generation models in general, and how advancements in the field could enable new ways of human-AI interaction, boosting productivity and creativity of video generation.  ( 2 min )
    Time series generation for option pricing on quantum computers using tensor network
    arXiv:2402.17148v1 Announce Type: cross Abstract: Finance, especially option pricing, is a promising industrial field that might benefit from quantum computing. While quantum algorithms for option pricing have been proposed, it is desired to devise more efficient implementations of costly operations in the algorithms, one of which is preparing a quantum state that encodes a probability distribution of the underlying asset price. In particular, in pricing a path-dependent option, we need to generate a state encoding a joint distribution of the underlying asset price at multiple time points, which is more demanding. To address these issues, we propose a novel approach using Matrix Product State (MPS) as a generative model for time series generation. To validate our approach, taking the Heston model as a target, we conduct numerical experiments to generate time series in the model. Our findings demonstrate the capability of the MPS model to generate paths in the Heston model, highlighting its potential for path-dependent option pricing on quantum computers.  ( 2 min )
    Energy-Efficient Scheduling with Predictions
    arXiv:2402.17143v1 Announce Type: cross Abstract: An important goal of modern scheduling systems is to efficiently manage power usage. In energy-efficient scheduling, the operating system controls the speed at which a machine is processing jobs with the dual objective of minimizing energy consumption and optimizing the quality of service cost of the resulting schedule. Since machine-learned predictions about future requests can often be learned from historical data, a recent line of work on learning-augmented algorithms aims to achieve improved performance guarantees by leveraging predictions. In particular, for energy-efficient scheduling, Bamas et. al. [BamasMRS20] and Antoniadis et. al. [antoniadis2021novel] designed algorithms with predictions for the energy minimization with deadlines problem and achieved an improved competitive ratio when the prediction error is small while also maintaining worst-case bounds even when the prediction error is arbitrarily large. In this paper, we consider a general setting for energy-efficient scheduling and provide a flexible learning-augmented algorithmic framework that takes as input an offline and an online algorithm for the desired energy-efficient scheduling problem. We show that, when the prediction error is small, this framework gives improved competitive ratios for many different energy-efficient scheduling problems, including energy minimization with deadlines, while also maintaining a bounded competitive ratio regardless of the prediction error. Finally, we empirically demonstrate that this framework achieves an improved performance on real and synthetic datasets.  ( 2 min )
    Repeated Contracting with Multiple Non-Myopic Agents: Policy Regret and Limited Liability
    arXiv:2402.17108v1 Announce Type: cross Abstract: We study a repeated contracting setting in which a Principal adaptively chooses amongst $k$ Agents at each of $T$ rounds. The Agents are non-myopic, and so a mechanism for the Principal induces a $T$-round extensive form game amongst the Agents. We give several results aimed at understanding an under-explored aspect of contract theory -- the game induced when choosing an Agent to contract with. First, we show that this game admits a pure-strategy \emph{non-responsive} equilibrium amongst the Agents -- informally an equilibrium in which the Agent's actions depend on the history of realized states of nature, but not on the history of each other's actions, and so avoids the complexities of collusion and threats. Next, we show that if the Principal selects Agents using a \emph{monotone} bandit algorithm, then for any concave contract, in any such equilibrium, the Principal obtains no regret to contracting with the best Agent in hindsight -- not just given their realized actions, but also to the counterfactual world in which they had offered a guaranteed $T$-round contract to the best Agent in hindsight, which would have induced a different sequence of actions. Finally, we show that if the Principal selects Agents using a monotone bandit algorithm which guarantees no swap-regret, then the Principal can additionally offer only limited liability contracts (in which the Agent never needs to pay the Principal) while getting no-regret to the counterfactual world in which she offered a linear contract to the best Agent in hindsight -- despite the fact that linear contracts are not limited liability. We instantiate this theorem by demonstrating the existence of a monotone no swap-regret bandit algorithm, which to our knowledge has not previously appeared in the literature.  ( 3 min )
    Dataset Fairness: Achievable Fairness on Your Data With Utility Guarantees
    arXiv:2402.17106v1 Announce Type: cross Abstract: In machine learning fairness, training models which minimize disparity across different sensitive groups often leads to diminished accuracy, a phenomenon known as the fairness-accuracy trade-off. The severity of this trade-off fundamentally depends on dataset characteristics such as dataset imbalances or biases. Therefore using a uniform fairness requirement across datasets remains questionable and can often lead to models with substantially low utility. To address this, we present a computationally efficient approach to approximate the fairness-accuracy trade-off curve tailored to individual datasets, backed by rigorous statistical guarantees. By utilizing the You-Only-Train-Once (YOTO) framework, our approach mitigates the computational burden of having to train multiple models when approximating the trade-off curve. Moreover, we quantify the uncertainty in our approximation by introducing confidence intervals around this curve, offering a statistically grounded perspective on the acceptable range of fairness violations for any given accuracy threshold. Our empirical evaluation spanning tabular, image and language datasets underscores that our approach provides practitioners with a principled framework for dataset-specific fairness decisions across various data modalities.  ( 2 min )
    Learning high-dimensional targets by two-parameter models and gradient flow
    arXiv:2402.17089v1 Announce Type: cross Abstract: We explore the theoretical possibility of learning $d$-dimensional targets with $W$-parameter models by gradient flow (GF) when $W<d$. Our main result shows that if the targets are described by a particular $d$-dimensional probability distribution, then there exist models with as few as two parameters that can learn the targets with arbitrarily high success probability. On the other hand, we show that for $W<d$ there is necessarily a large subset of GF-non-learnable targets. In particular, the set of learnable targets is not dense in $\mathbb R^d$, and any subset of $\mathbb R^d$ homeomorphic to the $W$-dimensional sphere contains non-learnable targets. Finally, we observe that the model in our main theorem on almost guaranteed two-parameter learning is constructed using a hierarchical procedure and as a result is not expressible by a single elementary function. We show that this limitation is essential in the sense that such learnability can be ruled out for a large class of elementary functions.  ( 2 min )
    A Note on Bayesian Networks with Latent Root Variables
    arXiv:2402.17087v1 Announce Type: cross Abstract: We characterise the likelihood function computed from a Bayesian network with latent variables as root nodes. We show that the marginal distribution over the remaining, manifest, variables also factorises as a Bayesian network, which we call empirical. A dataset of observations of the manifest variables allows us to quantify the parameters of the empirical Bayesian net. We prove that (i) the likelihood of such a dataset from the original Bayesian network is dominated by the global maximum of the likelihood from the empirical one; and that (ii) such a maximum is attained if and only if the parameters of the Bayesian network are consistent with those of the empirical model.  ( 2 min )
    Taming the Tail in Class-Conditional GANs: Knowledge Sharing via Unconditional Training at Lower Resolutions
    arXiv:2402.17065v1 Announce Type: cross Abstract: Despite the extensive research on training generative adversarial networks (GANs) with limited training data, learning to generate images from long-tailed training distributions remains fairly unexplored. In the presence of imbalanced multi-class training data, GANs tend to favor classes with more samples, leading to the generation of low-quality and less diverse samples in tail classes. In this study, we aim to improve the training of class-conditional GANs with long-tailed data. We propose a straightforward yet effective method for knowledge sharing, allowing tail classes to borrow from the rich information from classes with more abundant training data. More concretely, we propose modifications to existing class-conditional GAN architectures to ensure that the lower-resolution layers of the generator are trained entirely unconditionally while reserving class-conditional generation for the higher-resolution layers. Experiments on several long-tail benchmarks and GAN architectures demonstrate a significant improvement over existing methods in both the diversity and fidelity of the generated images. The code is available at https://github.com/khorrams/utlo.  ( 2 min )
    An Investigation into the Performances of the State-of-the-art Machine Learning Approaches for Various Cyber-attack Detection: A Survey
    arXiv:2402.17045v1 Announce Type: cross Abstract: To secure computers and information systems from attackers taking advantage of vulnerabilities in the system to commit cybercrime, several methods have been proposed for real-time detection of vulnerabilities to improve security around information systems. Of all the proposed methods, machine learning had been the most effective method in securing a system with capabilities ranging from early detection of software vulnerabilities to real-time detection of ongoing compromise in a system. As there are different types of cyberattacks, each of the existing state-of-the-art machine learning models depends on different algorithms for training which also impact their suitability for detection of a particular type of cyberattack. In this research, we analyzed each of the current state-of-theart machine learning models for different types of cyberattack detection from the past 10 years with a major emphasis on the most recent works for comparative study to identify the knowledge gap where work is still needed to be done with regard to detection of each category of cyberattack  ( 2 min )
    Towards Generalizing Inferences from Trials to Target Populations
    arXiv:2402.17042v1 Announce Type: cross Abstract: Randomized Controlled Trials (RCTs) are pivotal in generating internally valid estimates with minimal assumptions, serving as a cornerstone for researchers dedicated to advancing causal inference methods. However, extending these findings beyond the experimental cohort to achieve externally valid estimates is crucial for broader scientific inquiry. This paper delves into the forefront of addressing these external validity challenges, encapsulating the essence of a multidisciplinary workshop held at the Institute for Computational and Experimental Research in Mathematics (ICERM), Brown University, in Fall 2023. The workshop congregated experts from diverse fields including social science, medicine, public health, statistics, computer science, and education, to tackle the unique obstacles each discipline faces in extrapolating experimental findings. Our study presents three key contributions: we integrate ongoing efforts, highlighting methodological synergies across fields; provide an exhaustive review of generalizability and transportability based on the workshop's discourse; and identify persistent hurdles while suggesting avenues for future research. By doing so, this paper aims to enhance the collective understanding of the generalizability and transportability of causal effects, fostering cross-disciplinary collaboration and offering valuable insights for researchers working on refining and applying causal inference methods.  ( 2 min )
    Iterated INLA for State and Parameter Estimation in Nonlinear Dynamical Systems
    arXiv:2402.17036v1 Announce Type: cross Abstract: Data assimilation (DA) methods use priors arising from differential equations to robustly interpolate and extrapolate data. Popular techniques such as ensemble methods that handle high-dimensional, nonlinear PDE priors focus mostly on state estimation, however can have difficulty learning the parameters accurately. On the other hand, machine learning based approaches can naturally learn the state and parameters, but their applicability can be limited, or produce uncertainties that are hard to interpret. Inspired by the Integrated Nested Laplace Approximation (INLA) method in spatial statistics, we propose an alternative approach to DA based on iteratively linearising the dynamical model. This produces a Gaussian Markov random field at each iteration, enabling one to use INLA to infer the state and parameters. Our approach can be used for arbitrary nonlinear systems, while retaining interpretability, and is furthermore demonstrated to outperform existing methods on the DA task. By providing a more nuanced approach to handling nonlinear PDE priors, our methodology offers improved accuracy and robustness in predictions, especially where data sparsity is prevalent.  ( 2 min )
    REFACTOR: Learning to Extract Theorems from Proofs
    arXiv:2402.17032v1 Announce Type: cross Abstract: Human mathematicians are often good at recognizing modular and reusable theorems that make complex mathematical results within reach. In this paper, we propose a novel method called theoREm-from-prooF extrACTOR (REFACTOR) for training neural networks to mimic this ability in formal mathematical theorem proving. We show on a set of unseen proofs, REFACTOR is able to extract 19.6% of the theorems that humans would use to write the proofs. When applying the model to the existing Metamath library, REFACTOR extracted 16 new theorems. With newly extracted theorems, we show that the existing proofs in the MetaMath database can be refactored. The new theorems are used very frequently after refactoring, with an average usage of 733.5 times, and help shorten the proof lengths. Lastly, we demonstrate that the prover trained on the new-theorem refactored dataset proves more test theorems and outperforms state-of-the-art baselines by frequently leveraging a diverse set of newly extracted theorems. Code can be found at https://github.com/jinpz/refactor.  ( 2 min )
    Towards Explainability and Fairness in Swiss Judgement Prediction: Benchmarking on a Multilingual Dataset
    arXiv:2402.17013v1 Announce Type: cross Abstract: The assessment of explainability in Legal Judgement Prediction (LJP) systems is of paramount importance in building trustworthy and transparent systems, particularly considering the reliance of these systems on factors that may lack legal relevance or involve sensitive attributes. This study delves into the realm of explainability and fairness in LJP models, utilizing Swiss Judgement Prediction (SJP), the only available multilingual LJP dataset. We curate a comprehensive collection of rationales that `support' and `oppose' judgement from legal experts for 108 cases in German, French, and Italian. By employing an occlusion-based explainability approach, we evaluate the explainability performance of state-of-the-art monolingual and multilingual BERT-based LJP models, as well as models developed with techniques such as data augmentation and cross-lingual transfer, which demonstrated prediction performance improvement. Notably, our findings reveal that improved prediction performance does not necessarily correspond to enhanced explainability performance, underscoring the significance of evaluating models from an explainability perspective. Additionally, we introduce a novel evaluation framework, Lower Court Insertion (LCI), which allows us to quantify the influence of lower court information on model predictions, exposing current models' biases.  ( 2 min )
    Pandora's White-Box: Increased Training Data Leakage in Open LLMs
    arXiv:2402.17012v1 Announce Type: cross Abstract: In this paper we undertake a systematic study of privacy attacks against open source Large Language Models (LLMs), where an adversary has access to either the model weights, gradients, or losses, and tries to exploit them to learn something about the underlying training data. Our headline results are the first membership inference attacks (MIAs) against pre-trained LLMs that are able to simultaneously achieve high TPRs and low FPRs, and a pipeline showing that over $50\%$ (!) of the fine-tuning dataset can be extracted from a fine-tuned LLM in natural settings. We consider varying degrees of access to the underlying model, customization of the language model, and resources available to the attacker. In the pre-trained setting, we propose three new white-box MIAs: an attack based on the gradient norm, a supervised neural network classifier, and a single step loss ratio attack. All outperform existing black-box baselines, and our supervised attack closes the gap between MIA attack success against LLMs and other types of models. In fine-tuning, we find that given access to the loss of the fine-tuned and base models, a fine-tuned loss ratio attack FLoRA is able to achieve near perfect MIA peformance. We then leverage these MIAs to extract fine-tuning data from fine-tuned language models. We find that the pipeline of generating from fine-tuned models prompted with a small snippet of the prefix of each training example, followed by using FLoRa to select the most likely training sample, succeeds the majority of the fine-tuning dataset after only $3$ epochs of fine-tuning. Taken together, these findings show that highly effective MIAs are available in almost all LLM training settings, and highlight that great care must be taken before LLMs are fine-tuned on highly sensitive data and then deployed.  ( 3 min )
    What Do Language Models Hear? Probing for Auditory Representations in Language Models
    arXiv:2402.16998v1 Announce Type: cross Abstract: This work explores whether language models encode meaningfully grounded representations of sounds of objects. We learn a linear probe that retrieves the correct text representation of an object given a snippet of audio related to that object, where the sound representation is given by a pretrained audio model. This probe is trained via a contrastive loss that pushes the language representations and sound representations of an object to be close to one another. After training, the probe is tested on its ability to generalize to objects that were not seen during training. Across different language models and audio models, we find that the probe generalization is above chance in many cases, indicating that despite being trained only on raw text, language models encode grounded knowledge of sounds for some objects.  ( 2 min )
    Towards Decoding Brain Activity During Passive Listening of Speech
    arXiv:2402.16996v1 Announce Type: cross Abstract: The aim of the study is to investigate the complex mechanisms of speech perception and ultimately decode the electrical changes in the brain accruing while listening to speech. We attempt to decode heard speech from intracranial electroencephalographic (iEEG) data using deep learning methods. The goal is to aid the advancement of brain-computer interface (BCI) technology for speech synthesis, and, hopefully, to provide an additional perspective on the cognitive processes of speech perception. This approach diverges from the conventional focus on speech production and instead chooses to investigate neural representations of perceived speech. This angle opened up a complex perspective, potentially allowing us to study more sophisticated neural patterns. Leveraging the power of deep learning models, the research aimed to establish a connection between these intricate neural activities and the corresponding speech sounds. Despite the approach not having achieved a breakthrough yet, the research sheds light on the potential of decoding neural activity during speech perception. Our current efforts can serve as a foundation, and we are optimistic about the potential of expanding and improving upon this work to move closer towards more advanced BCIs, better understanding of processes underlying perceived speech and its relation to spoken speech.  ( 2 min )
    GEM3D: GEnerative Medial Abstractions for 3D Shape Synthesis
    arXiv:2402.16994v1 Announce Type: cross Abstract: We introduce GEM3D -- a new deep, topology-aware generative model of 3D shapes. The key ingredient of our method is a neural skeleton-based representation encoding information on both shape topology and geometry. Through a denoising diffusion probabilistic model, our method first generates skeleton-based representations following the Medial Axis Transform (MAT), then generates surfaces through a skeleton-driven neural implicit formulation. The neural implicit takes into account the topological and geometric information stored in the generated skeleton representations to yield surfaces that are more topologically and geometrically accurate compared to previous neural field formulations. We discuss applications of our method in shape synthesis and point cloud reconstruction tasks, and evaluate our method both qualitatively and quantitatively. We demonstrate significantly more faithful surface reconstruction and diverse shape generation results compared to the state-of-the-art, also involving challenging scenarios of reconstructing and synthesizing structurally complex, high-genus shape surfaces from Thingi10K and ShapeNet.  ( 2 min )
    A Phase Transition in Diffusion Models Reveals the Hierarchical Nature of Data
    arXiv:2402.16991v1 Announce Type: cross Abstract: Understanding the structure of real data is paramount in advancing modern deep-learning methodologies. Natural data such as images are believed to be composed of features organised in a hierarchical and combinatorial manner, which neural networks capture during learning. Recent advancements show that diffusion models can generate high-quality images, hinting at their ability to capture this underlying structure. We study this phenomenon in a hierarchical generative model of data. We find that the backward diffusion process acting after a time $t$ is governed by a phase transition at some threshold time, where the probability of reconstructing high-level features, like the class of an image, suddenly drops. Instead, the reconstruction of low-level features, such as specific details of an image, evolves smoothly across the whole diffusion process. This result implies that at times beyond the transition, the class has changed but the generated sample may still be composed of low-level elements of the initial image. We validate these theoretical insights through numerical experiments on class-unconditional ImageNet diffusion models. Our analysis characterises the relationship between time and scale in diffusion models and puts forward generative models as powerful tools to model combinatorial data properties.  ( 2 min )
    Algorithmic Arbitrariness in Content Moderation
    arXiv:2402.16979v1 Announce Type: cross Abstract: Machine learning (ML) is widely used to moderate online content. Despite its scalability relative to human moderation, the use of ML introduces unique challenges to content moderation. One such challenge is predictive multiplicity: multiple competing models for content classification may perform equally well on average, yet assign conflicting predictions to the same content. This multiplicity can result from seemingly innocuous choices during model development, such as random seed selection for parameter initialization. We experimentally demonstrate how content moderation tools can arbitrarily classify samples as toxic, leading to arbitrary restrictions on speech. We discuss these findings in terms of human rights set out by the International Covenant on Civil and Political Rights (ICCPR), namely freedom of expression, non-discrimination, and procedural justice. We analyze (i) the extent of predictive multiplicity among state-of-the-art LLMs used for detecting toxic content; (ii) the disparate impact of this arbitrariness across social groups; and (iii) how model multiplicity compares to unambiguous human classifications. Our findings indicate that the up-scaled algorithmic moderation risks legitimizing an algorithmic leviathan, where an algorithm disproportionately manages human rights. To mitigate such risks, our study underscores the need to identify and increase the transparency of arbitrariness in content moderation applications. Since algorithmic content moderation is being fueled by pressing social concerns, such as disinformation and hate speech, our discussion on harms raises concerns relevant to policy debates. Our findings also contribute to content moderation and intermediary liability laws being discussed and passed in many countries, such as the Digital Services Act in the European Union, the Online Safety Act in the United Kingdom, and the Fake News Bill in Brazil.  ( 3 min )
    inGRASS: Incremental Graph Spectral Sparsification via Low-Resistance-Diameter Decomposition
    arXiv:2402.16990v1 Announce Type: cross Abstract: This work presents inGRASS, a novel algorithm designed for incremental spectral sparsification of large undirected graphs. The proposed inGRASS algorithm is highly scalable and parallel-friendly, having a nearly-linear time complexity for the setup phase and the ability to update the spectral sparsifier in $O(\log N)$ time for each incremental change made to the original graph with $N$ nodes. A key component in the setup phase of inGRASS is a multilevel resistance embedding framework introduced for efficiently identifying spectrally-critical edges and effectively detecting redundant ones, which is achieved by decomposing the initial sparsifier into many node clusters with bounded effective-resistance diameters leveraging a low-resistance-diameter decomposition (LRD) scheme. The update phase of inGRASS exploits low-dimensional node embedding vectors for efficiently estimating the importance and uniqueness of each newly added edge. As demonstrated through extensive experiments, inGRASS achieves up to over $200 \times$ speedups while retaining comparable solution quality in incremental spectral sparsification of graphs obtained from various datasets, such as circuit simulations, finite element analysis, and social networks.  ( 2 min )
    An inexact Bregman proximal point method and its acceleration version for unbalanced optimal transport
    arXiv:2402.16978v1 Announce Type: cross Abstract: The Unbalanced Optimal Transport (UOT) problem plays increasingly important roles in computational biology, computational imaging and deep learning. Scaling algorithm is widely used to solve UOT due to its convenience and good convergence properties. However, this algorithm has lower accuracy for large regularization parameters, and due to stability issues, small regularization parameters can easily lead to numerical overflow. We address this challenge by developing an inexact Bregman proximal point method for solving UOT. This algorithm approximates the proximal operator using the Scaling algorithm at each iteration. The algorithm (1) converges to the true solution of UOT, (2) has theoretical guarantees and robust regularization parameter selection, (3) mitigates numerical stability issues, and (4) can achieve comparable computational complexity to the Scaling algorithm in specific practice. Building upon this, we develop an accelerated version of inexact Bregman proximal point method for solving UOT by using acceleration techniques of Bregman proximal point method and provide theoretical guarantees and experimental validation of convergence and acceleration.  ( 2 min )
    Disentangled 3D Scene Generation with Layout Learning
    arXiv:2402.16936v1 Announce Type: cross Abstract: We introduce a method to generate 3D scenes that are disentangled into their component objects. This disentanglement is unsupervised, relying only on the knowledge of a large pretrained text-to-image model. Our key insight is that objects can be discovered by finding parts of a 3D scene that, when rearranged spatially, still produce valid configurations of the same scene. Concretely, our method jointly optimizes multiple NeRFs from scratch - each representing its own object - along with a set of layouts that composite these objects into scenes. We then encourage these composited scenes to be in-distribution according to the image generator. We show that despite its simplicity, our approach successfully generates 3D scenes decomposed into individual objects, enabling new capabilities in text-to-3D content creation. For results and an interactive demo, see our project page at https://dave.ml/layoutlearning/  ( 2 min )
    On the (In)feasibility of ML Backdoor Detection as an Hypothesis Testing Problem
    arXiv:2402.16926v1 Announce Type: cross Abstract: We introduce a formal statistical definition for the problem of backdoor detection in machine learning systems and use it to analyze the feasibility of such problems, providing evidence for the utility and applicability of our definition. The main contributions of this work are an impossibility result and an achievability result for backdoor detection. We show a no-free-lunch theorem, proving that universal (adversary-unaware) backdoor detection is impossible, except for very small alphabet sizes. Thus, we argue, that backdoor detection methods need to be either explicitly, or implicitly adversary-aware. However, our work does not imply that backdoor detection cannot work in specific scenarios, as evidenced by successful backdoor detection methods in the scientific literature. Furthermore, we connect our definition to the probably approximately correct (PAC) learnability of the out-of-distribution detection problem.  ( 2 min )
    TrustMol: Trustworthy Inverse Molecular Design via Alignment with Molecular Dynamics
    arXiv:2402.16930v1 Announce Type: cross Abstract: Data-driven generation of molecules with desired properties, also known as inverse molecular design (IMD), has attracted significant attention in recent years. Despite the significant progress in the accuracy and diversity of solutions, existing IMD methods lag behind in terms of trustworthiness. The root issue is that the design process of these methods is increasingly more implicit and indirect, and this process is also isolated from the native forward process (NFP), the ground-truth function that models the molecular dynamics. Following this insight, we propose TrustMol, an IMD method built to be trustworthy. For this purpose, TrustMol relies on a set of technical novelties including a new variational autoencoder network. Moreover, we propose a latent-property pairs acquisition method to effectively navigate the complexities of molecular latent optimization, a process that seems intuitive yet challenging due to the high-frequency and discontinuous nature of molecule space. TrustMol also integrates uncertainty-awareness into molecular latent optimization. These lead to improvements in both explainability and reliability of the IMD process. We validate the trustworthiness of TrustMol through a wide range of experiments.  ( 2 min )
    An Adversarial Robustness Benchmark for Enterprise Network Intrusion Detection
    arXiv:2402.16912v1 Announce Type: cross Abstract: As cyber-attacks become more sophisticated, improving the robustness of Machine Learning (ML) models must be a priority for enterprises of all sizes. To reliably compare the robustness of different ML models for cyber-attack detection in enterprise computer networks, they must be evaluated in standardized conditions. This work presents a methodical adversarial robustness benchmark of multiple decision tree ensembles with constrained adversarial examples generated from standard datasets. The robustness of regularly and adversarially trained RF, XGB, LGBM, and EBM models was evaluated on the original CICIDS2017 dataset, a corrected version of it designated as NewCICIDS, and the HIKARI dataset, which contains more recent network traffic. NewCICIDS led to models with a better performance, especially XGB and EBM, but RF and LGBM were less robust against the more recent cyber-attacks of HIKARI. Overall, the robustness of the models to adversarial cyber-attack examples was improved without their generalization to regular traffic being affected, enabling a reliable detection of suspicious activity without costly increases of false alarms.  ( 2 min )
    Local stochastic computing using memristor-enabled stochastic logics
    arXiv:2402.16908v1 Announce Type: cross Abstract: Stochastic computing offers a probabilistic approach to address challenges posed by problems with uncertainty and noise in various fields, particularly machine learning. The realization of stochastic computing, however, faces the limitation of developing reliable stochastic logics. Here, we present stochastic logics development using memristors. Specifically, we integrate memristors into logic circuits to design the stochastic logics, wherein the inherent stochasticity in memristor switching is harnessed to enable stochastic number encoding and processing with well-regulated probabilities and correlations. As a practical application of the stochastic logics, we design a compact stochastic Roberts cross operator for edge detection. Remarkably, the operator demonstrates exceptional contour and texture extractions, even in the presence of 50% noise, and owning to the probabilistic nature and compact design, the operator can consume 95% less computational costs required by conventional binary computing. The results underscore the great potential of our stochastic computing approach as a lightweight local solution to machine learning challenges in autonomous driving, virtual reality, medical diagnosis, industrial automation, and beyond.  ( 2 min )
    Diffusion Posterior Proximal Sampling for Image Restoration
    arXiv:2402.16907v1 Announce Type: cross Abstract: Diffusion models have demonstrated remarkable efficacy in generating high-quality samples. Existing diffusion-based image restoration algorithms exploit pre-trained diffusion models to leverage data priors, yet they still preserve elements inherited from the unconditional generation paradigm. These strategies initiate the denoising process with pure white noise and incorporate random noise at each generative step, leading to over-smoothed results. In this paper, we introduce a refined paradigm for diffusion-based image restoration. Specifically, we opt for a sample consistent with the measurement identity at each generative step, exploiting the sampling selection as an avenue for output stability and enhancement. Besides, we start the restoration process with an initialization combined with the measurement signal, providing supplementary information to better align the generative process. Extensive experimental results and analyses validate the effectiveness of our proposed approach across diverse image restoration tasks.  ( 2 min )
    Enforcing Temporal Constraints on Generative Agent Behavior with Reactive Synthesis
    arXiv:2402.16905v1 Announce Type: cross Abstract: The surge in popularity of Large Language Models (LLMs) has opened doors for new approaches to the creation of interactive agents. However, managing the temporal behavior of such agents over the course of an interaction remains challenging. The stateful, long-term horizon and quantitative reasoning required for coherent agent behavior does not fit well into the LLM paradigm. We propose a combination of formal logic-based program synthesis and LLM content generation to create generative agents that adhere to temporal constraints. Our approach uses Temporal Stream Logic (TSL) to generate an automaton that enforces a temporal structure on an agent and leaves the details of each action for a moment in time to an LLM. By using TSL, we are able to augment the generative agent where users have a higher level of guarantees on behavior, better interpretability of the system, and more ability to build agents in a modular way. We evaluate our approach on different tasks involved in creating a coherent interactive agent specialized for various application domains. We found that over all of the tasks, our approach using TSL achieves at least 96% adherence, whereas the pure LLM-based approach demonstrates as low as 14.67% adherence.  ( 2 min )
    MIM-Reasoner: Learning with Theoretical Guarantees for Multiplex Influence Maximization
    arXiv:2402.16898v1 Announce Type: cross Abstract: Multiplex influence maximization (MIM) asks us to identify a set of seed users such as to maximize the expected number of influenced users in a multiplex network. MIM has been one of central research topics, especially in nowadays social networking landscape where users participate in multiple online social networks (OSNs) and their influences can propagate among several OSNs simultaneously. Although there exist a couple combinatorial algorithms to MIM, learning-based solutions have been desired due to its generalization ability to heterogeneous networks and their diversified propagation characteristics. In this paper, we introduce MIM-Reasoner, coupling reinforcement learning with probabilistic graphical model, which effectively captures the complex propagation process within and between layers of a given multiplex network, thereby tackling the most challenging problem in MIM. We establish a theoretical guarantee for MIM-Reasoner as well as conduct extensive analyses on both synthetic and real-world datasets to validate our MIM-Reasoner's performance.  ( 2 min )
    FGBERT: Function-Driven Pre-trained Gene Language Model for Metagenomics
    arXiv:2402.16901v1 Announce Type: cross Abstract: Metagenomic data, comprising mixed multi-species genomes, are prevalent in diverse environments like oceans and soils, significantly impacting human health and ecological functions. However, current research relies on K-mer representations, limiting the capture of structurally relevant gene contexts. To address these limitations and further our understanding of complex relationships between metagenomic sequences and their functions, we introduce a protein-based gene representation as a context-aware and structure-relevant tokenizer. Our approach includes Masked Gene Modeling (MGM) for gene group-level pre-training, providing insights into inter-gene contextual information, and Triple Enhanced Metagenomic Contrastive Learning (TEM-CL) for gene-level pre-training to model gene sequence-function relationships. MGM and TEM-CL constitute our novel metagenomic language model {\NAME}, pre-trained on 100 million metagenomic sequences. We demonstrate the superiority of our proposed {\NAME} on eight datasets.  ( 2 min )
    On Trojan Signatures in Large Language Models of Code
    arXiv:2402.16896v1 Announce Type: cross Abstract: Trojan signatures, as described by Fields et al. (2021), are noticeable differences in the distribution of the trojaned class parameters (weights) and the non-trojaned class parameters of the trojaned model, that can be used to detect the trojaned model. Fields et al. (2021) found trojan signatures in computer vision classification tasks with image models, such as, Resnet, WideResnet, Densenet, and VGG. In this paper, we investigate such signatures in the classifier layer parameters of large language models of source code. Our results suggest that trojan signatures could not generalize to LLMs of code. We found that trojaned code models are stubborn, even when the models were poisoned under more explicit settings (finetuned with pre-trained weights frozen). We analyzed nine trojaned models for two binary classification tasks: clone and defect detection. To the best of our knowledge, this is the first work to examine weight-based trojan signature revelation techniques for large-language models of code and furthermore to demonstrate that detecting trojans only from the weights in such models is a hard problem.  ( 2 min )
    Artificial Intelligence for Complex Network: Potential, Methodology and Application
    arXiv:2402.16887v1 Announce Type: cross Abstract: Complex networks pervade various real-world systems, from the natural environment to human societies. The essence of these networks is in their ability to transition and evolve from microscopic disorder-where network topology and node dynamics intertwine-to a macroscopic order characterized by certain collective behaviors. Over the past two decades, complex network science has significantly enhanced our understanding of the statistical mechanics, structures, and dynamics underlying real-world networks. Despite these advancements, there remain considerable challenges in exploring more realistic systems and enhancing practical applications. The emergence of artificial intelligence (AI) technologies, coupled with the abundance of diverse real-world network data, has heralded a new era in complex network science research. This survey aims to systematically address the potential advantages of AI in overcoming the lingering challenges of complex network research. It endeavors to summarize the pivotal research problems and provide an exhaustive review of the corresponding methodologies and applications. Through this comprehensive survey-the first of its kind on AI for complex networks-we expect to provide valuable insights that will drive further research and advancement in this interdisciplinary field.  ( 2 min )
    Using text embedding models and vector databases as text classifiers with the example of medical data
    arXiv:2402.16886v1 Announce Type: cross Abstract: The advent of Large Language Models (LLMs) is promising and has found application in numerous fields, but as it often is with the medical field, the bar is typically quite high [5]. In tandem with LLMs, vector embedding models and vector databases provide a robust way of expressing numerous modes of data that are easily digestible by typical machine learning models. Along with the ease of adding information, knowledge, and data to these vector databases, they provide a compelling reason to apply them in numerous fields where the task of retrieving information is typically done by humans. Researchers at Google have developed a clear alternative model, Med-PaLM [6] specifically designed to match a clinician's level of accuracy when it comes to medical knowledge. When training classifiers, and developing models, it is imperative to maintain factuality and reduce bias [4]. Here, we explore the use of vector databases and embedding models as a means of encoding, and classifying text with the example and application in the field of medicine. We show the robustness of these tools depends heavily on the sparsity of the data presented, and even with low amounts of data in the vector database itself, the vector database does a good job at classifying data [9]. Using various LLMs to generate the medical data, we also understand the limitations of the medical knowledge of these models and encourage further expert medical review of our testing data. By using vector databases to classify a clinician's notes on a patient presented with a certain ailment, we understand the limitations of such methods, but also the promise of their prospective use and with continued testing and experimentation, hope to explore a unique use case of vector databases and embedding models.  ( 3 min )
    Substrate Scope Contrastive Learning: Repurposing Human Bias to Learn Atomic Representations
    arXiv:2402.16882v1 Announce Type: cross Abstract: Learning molecular representation is a critical step in molecular machine learning that significantly influences modeling success, particularly in data-scarce situations. The concept of broadly pre-training neural networks has advanced fields such as computer vision, natural language processing, and protein engineering. However, similar approaches for small organic molecules have not achieved comparable success. In this work, we introduce a novel pre-training strategy, substrate scope contrastive learning, which learns atomic representations tailored to chemical reactivity. This method considers the grouping of substrates and their yields in published substrate scope tables as a measure of their similarity or dissimilarity in terms of chemical reactivity. We focus on 20,798 aryl halides in the CAS Content Collection spanning thousands of publications to learn a representation of aryl halide reactivity. We validate our pre-training approach through both intuitive visualizations and comparisons to traditional reactivity descriptors and physical organic chemistry principles. The versatility of these embeddings is further evidenced in their application to yield prediction, regioselectivity prediction, and the diverse selection of new substrates. This work not only presents a chemistry-tailored neural network pre-training strategy to learn reactivity-aligned atomic representations, but also marks a first-of-its-kind approach to benefit from the human bias in substrate scope design.  ( 2 min )
    EvoGPT-f: An Evolutionary GPT Framework for Benchmarking Formal Math Languages
    arXiv:2402.16878v1 Announce Type: cross Abstract: Formal mathematics is the discipline of translating mathematics into a programming language in which any statement can be unequivocally checked by a computer. Mathematicians and computer scientists have spent decades of painstaking formalization efforts developing languages such as Coq, HOL, and Lean. Machine learning research has converged on these formal math corpora and given rise to an assortment of methodologies to aid in interactive and automated theorem proving. However, these papers have primarily focused on one method, for one proof task, in one language. This paper introduces EvoGPT-f: a novel evolutionary framework for the first systematic quantitative analysis of the differential machine learnability of five formal math corpora (Lean 3, Lean 4, Coq, HOL 4, HOL Light) using four tokenization methods (character, word-level, Byte Pair Encoding and StarCoder tokenizer). This paper does not put to rest the question of the "best" or "easiest" language to learn. Rather, this framework and preliminary findings begin to illuminate the differential machine learnability of these languages, offering a foundation to forge more systematic quantitative and qualitative comparative research across communities.  ( 2 min )
    Large Language Model Augmented Exercise Retrieval for Personalized Language Learning
    arXiv:2402.16877v1 Announce Type: cross Abstract: We study the problem of zero-shot exercise retrieval in the context of online language learning, to give learners the ability to explicitly request personalized exercises via natural language. Using real-world data collected from language learners, we observe that vector similarity approaches poorly capture the relationship between exercise content and the language that learners use to express what they want to learn. This semantic gap between queries and content dramatically reduces the effectiveness of general-purpose retrieval models pretrained on large scale information retrieval datasets like MS MARCO. We leverage the generative capabilities of large language models to bridge the gap by synthesizing hypothetical exercises based on the learner's input, which are then used to search for relevant exercises. Our approach, which we call mHyER, overcomes three challenges: (1) lack of relevance labels for training, (2) unrestricted learner input content, and (3) low semantic similarity between input and retrieval candidates. mHyER outperforms several strong baselines on two novel benchmarks created from crowdsourced data and publicly available data.  ( 2 min )
    Improve Robustness of Eye Disease Detection by including Learnable Probabilistic Discrete Latent Variables into Machine Learning Models
    arXiv:2402.16865v1 Announce Type: cross Abstract: Ocular diseases, ranging from diabetic retinopathy to glaucoma, present a significant public health challenge due to their prevalence and potential for causing vision impairment. Early and accurate diagnosis is crucial for effective treatment and management.In recent years, deep learning models have emerged as powerful tools for analysing medical images, including ocular imaging . However, challenges persist in model interpretability and uncertainty estimation, which are critical for clinical decision-making. This study introduces a novel application of GFlowOut, leveraging the probabilistic framework of Generative Flow Networks (GFlowNets) to learn the posterior distribution over dropout masks, for the classification and analysis of ocular diseases using eye fundus images. We develop a robust and generalizable method that utilizes GFlowOut integrated with ResNet18 and ViT models as backbone in identifying various ocular conditions. This study employs a unique set of dropout masks - none, random, bottomup, and topdown - to enhance model performance in analyzing ocular images. Our results demonstrate that the bottomup GFlowOut mask significantly improves accuracy, outperforming the traditional dropout approach.  ( 2 min )
    Pragmatic Goal-Oriented Communications under Semantic-Effectiveness Channel Errors
    arXiv:2402.16858v1 Announce Type: cross Abstract: In forthcoming AI-assisted 6G networks, integrating semantic, pragmatic, and goal-oriented communication strategies becomes imperative. This integration will enable sensing, transmission, and processing of exclusively pertinent task data, ensuring conveyed information possesses understandable, pragmatic semantic significance, aligning with destination needs and goals. Without doubt, no communication is error free. Within this context, besides errors stemming from typical wireless communication dynamics, potential distortions between transmitter-intended and receiver-interpreted meanings can emerge due to limitations in semantic processing capabilities, as well as language and knowledge representation disparities between transmitters and receivers. The main contribution of this paper is two-fold. First, it proposes and details a novel mathematical modeling of errors stemming from language mismatches at both semantic and effectiveness levels. Second, it provides a novel algorithmic solution to counteract these types of errors which leverages optimal transport theory. Our numerical results show the potential of the proposed mechanism to compensate for language mismatches, thereby enhancing the attainability of reliable communication under noisy communication environments.  ( 2 min )
    Attention Based Molecule Generation via Hierarchical Variational Autoencoder
    arXiv:2402.16854v1 Announce Type: cross Abstract: Molecule generation is a task made very difficult by the complex ways in which we represent molecules computationally. A common technique used in molecular generative modeling is to use SMILES strings with recurrent neural networks built into variational autoencoders - but these suffer from a myriad of issues: vanishing gradients, long-range forgetting, and invalid molecules. In this work, we show that by combining recurrent neural networks with convolutional networks in a hierarchical manner, we are able to both extract autoregressive information from SMILES strings while maintaining signal and long-range dependencies. This allows for generations with very high validity rates on the order of 95% when reconstructing known molecules. We also observe an average Tanimoto similarity of .6 between test set and reconstructed molecules, which suggests our method is able to map between SMILES strings and their learned representations in a more effective way than prior works using similar methods.  ( 2 min )
    Robustly Learning Single-Index Models via Alignment Sharpness
    arXiv:2402.17756v1 Announce Type: new Abstract: We study the problem of learning Single-Index Models under the $L_2^2$ loss in the agnostic model. We give an efficient learning algorithm, achieving a constant factor approximation to the optimal loss, that succeeds under a range of distributions (including log-concave distributions) and a broad class of monotone and Lipschitz link functions. This is the first efficient constant factor approximate agnostic learner, even for Gaussian data and for any nontrivial class of link functions. Prior work for the case of unknown link function either works in the realizable setting or does not attain constant factor approximation. The main technical ingredient enabling our algorithm and analysis is a novel notion of a local error bound in optimization that we term alignment sharpness and that may be of broader interest.  ( 2 min )
    Markovletics: Methods and A Novel Application for Learning Continuous-Time Markov Chain Mixtures
    arXiv:2402.17730v1 Announce Type: new Abstract: Sequential data naturally arises from user engagement on digital platforms like social media, music streaming services, and web navigation, encapsulating evolving user preferences and behaviors through continuous information streams. A notable unresolved query in stochastic processes is learning mixtures of continuous-time Markov chains (CTMCs). While there is progress in learning mixtures of discrete-time Markov chains with recovery guarantees [GKV16,ST23,KTT2023], the continuous scenario uncovers unique unexplored challenges. The intrigue in CTMC mixtures stems from their potential to model intricate continuous-time stochastic processes prevalent in various fields including social media, finance, and biology. In this study, we introduce a novel framework for exploring CTMCs, emphasizing the influence of observed trails' length and mixture parameters on problem regimes, which demands specific algorithms. Through thorough experimentation, we examine the impact of discretizing continuous-time trails on the learnability of the continuous-time mixture, given that these processes are often observed via discrete, resource-demanding observations. Our comparative analysis with leading methods explores sample complexity and the trade-off between the number of trails and their lengths, offering crucial insights for method selection in different problem instances. We apply our algorithms on an extensive collection of Lastfm's user-generated trails spanning three years, demonstrating the capability of our algorithms to differentiate diverse user preferences. We pioneer the use of CTMC mixtures on a basketball passing dataset to unveil intricate offensive tactics of NBA teams. This underscores the pragmatic utility and versatility of our proposed framework. All results presented in this study are replicable, and we provide the implementations to facilitate reproducibility.  ( 3 min )
    When Your AI Deceives You: Challenges with Partial Observability of Human Evaluators in Reward Learning
    arXiv:2402.17747v1 Announce Type: new Abstract: Past analyses of reinforcement learning from human feedback (RLHF) assume that the human fully observes the environment. What happens when human feedback is based only on partial observations? We formally define two failure cases: deception and overjustification. Modeling the human as Boltzmann-rational w.r.t. a belief over trajectories, we prove conditions under which RLHF is guaranteed to result in policies that deceptively inflate their performance, overjustify their behavior to make an impression, or both. To help address these issues, we mathematically characterize how partial observability of the environment translates into (lack of) ambiguity in the learned return function. In some cases, accounting for partial observability makes it theoretically possible to recover the return function and thus the optimal policy, while in other cases, there is irreducible ambiguity. We caution against blindly applying RLHF in partially observable settings and propose research directions to help tackle these challenges.  ( 2 min )
    The SMART approach to instance-optimal online learning
    arXiv:2402.17720v1 Announce Type: new Abstract: We devise an online learning algorithm -- titled Switching via Monotone Adapted Regret Traces (SMART) -- that adapts to the data and achieves regret that is instance optimal, i.e., simultaneously competitive on every input sequence compared to the performance of the follow-the-leader (FTL) policy and the worst case guarantee of any other input policy. We show that the regret of the SMART policy on any input sequence is within a multiplicative factor $e/(e-1) \approx 1.58$ of the smaller of: 1) the regret obtained by FTL on the sequence, and 2) the upper bound on regret guaranteed by the given worst-case policy. This implies a strictly stronger guarantee than typical `best-of-both-worlds' bounds as the guarantee holds for every input sequence regardless of how it is generated. SMART is simple to implement as it begins by playing FTL and switches at most once during the time horizon to the worst-case algorithm. Our approach and results follow from an operational reduction of instance optimal online learning to competitive analysis for the ski-rental problem. We complement our competitive ratio upper bounds with a fundamental lower bound showing that over all input sequences, no algorithm can get better than a $1.43$-fraction of the minimum regret achieved by FTL and the minimax-optimal policy. We also present a modification of SMART that combines FTL with a ``small-loss" algorithm to achieve instance optimality between the regret of FTL and the small loss regret bound.  ( 3 min )
    Towards a Digital Twin Framework in Additive Manufacturing: Machine Learning and Bayesian Optimization for Time Series Process Optimization
    arXiv:2402.17718v1 Announce Type: new Abstract: Laser-directed-energy deposition (DED) offers advantages in additive manufacturing (AM) for creating intricate geometries and material grading. Yet, challenges like material inconsistency and part variability remain, mainly due to its layer-wise fabrication. A key issue is heat accumulation during DED, which affects the material microstructure and properties. While closed-loop control methods for heat management are common in DED research, few integrate real-time monitoring, physics-based modeling, and control in a unified framework. Our work presents a digital twin (DT) framework for real-time predictive control of DED process parameters to meet specific design objectives. We develop a surrogate model using Long Short-Term Memory (LSTM)-based machine learning with Bayesian Inference to predict temperatures in DED parts. This model predicts future temperature states in real time. We also introduce Bayesian Optimization (BO) for Time Series Process Optimization (BOTSPO), based on traditional BO but featuring a unique time series process profile generator with reduced dimensions. BOTSPO dynamically optimizes processes, identifying optimal laser power profiles to attain desired mechanical properties. The established process trajectory guides online optimizations, aiming to enhance performance. This paper outlines the digital twin framework's components, promoting its integration into a comprehensive system for AM.  ( 3 min )
    Understanding Neural Network Binarization with Forward and Backward Proximal Quantizers
    arXiv:2402.17710v1 Announce Type: new Abstract: In neural network binarization, BinaryConnect (BC) and its variants are considered the standard. These methods apply the sign function in their forward pass and their respective gradients are backpropagated to update the weights. However, the derivative of the sign function is zero whenever defined, which consequently freezes training. Therefore, implementations of BC (e.g., BNN) usually replace the derivative of sign in the backward computation with identity or other approximate gradient alternatives. Although such practice works well empirically, it is largely a heuristic or ''training trick.'' We aim at shedding some light on these training tricks from the optimization perspective. Building from existing theory on ProxConnect (PC, a generalization of BC), we (1) equip PC with different forward-backward quantizers and obtain ProxConnect++ (PC++) that includes existing binarization techniques as special cases; (2) derive a principled way to synthesize forward-backward quantizers with automatic theoretical guarantees; (3) illustrate our theory by proposing an enhanced binarization algorithm BNN++; (4) conduct image classification experiments on CNNs and vision transformers, and empirically verify that BNN++ generally achieves competitive results on binarizing these models.  ( 2 min )
    Federated Learning for Estimating Heterogeneous Treatment Effects
    arXiv:2402.17705v1 Announce Type: new Abstract: Machine learning methods for estimating heterogeneous treatment effects (HTE) facilitate large-scale personalized decision-making across various domains such as healthcare, policy making, education, and more. Current machine learning approaches for HTE require access to substantial amounts of data per treatment, and the high costs associated with interventions makes centrally collecting so much data for each intervention a formidable challenge. To overcome this obstacle, in this work, we propose a novel framework for collaborative learning of HTE estimators across institutions via Federated Learning. We show that even under a diversity of interventions and subject populations across clients, one can jointly learn a common feature representation, while concurrently and privately learning the specific predictive functions for outcomes under distinct interventions across institutions. Our framework and the associated algorithm are based on this insight, and leverage tabular transformers to map multiple input data to feature representations which are then used for outcome prediction via multi-task learning. We also propose a novel way of federated training of personalised transformers that can work with heterogeneous input feature spaces. Experimental results on real-world clinical trial data demonstrate the effectiveness of our method.  ( 2 min )
    Autonomous Vehicles: Evolution of Artificial Intelligence and Learning Algorithms
    arXiv:2402.17690v1 Announce Type: new Abstract: The advent of autonomous vehicles has heralded a transformative era in transportation, reshaping the landscape of mobility through cutting-edge technologies. Central to this evolu- tion is the integration of Artificial Intelligence (AI) and learning algorithms, propelling vehicles into realms of unprecedented autonomy. This paper provides a comprehensive exploration of the evolutionary trajectory of AI within autonomous vehicles, tracing the journey from foundational principles to the most recent advancements. Commencing with a current landscape overview, the paper delves into the fundamental role of AI in shaping the autonomous decision-making capabilities of vehicles. It elucidates the steps involved in the AI-powered development life cycle in vehicles, addressing ethical considerations and bias in AI-driven software development for autonomous vehicles. The study presents statis- tical insights into the usage and types of AI/learning algorithms over the years, showcasing the evolving research landscape within the automotive industry. Furthermore, the paper highlights the pivotal role of parameters in refining algorithms for both trucks and cars, facilitating vehicles to adapt, learn, and improve performance over time. It concludes by outlining different levels of autonomy, elucidating the nuanced usage of AI and learning algorithms, and automating key tasks at each level. Additionally, the document discusses the variation in software package sizes across different autonomy levels  ( 2 min )
    Gradient-based Discrete Sampling with Automatic Cyclical Scheduling
    arXiv:2402.17699v1 Announce Type: new Abstract: Discrete distributions, particularly in high-dimensional deep models, are often highly multimodal due to inherent discontinuities. While gradient-based discrete sampling has proven effective, it is susceptible to becoming trapped in local modes due to the gradient information. To tackle this challenge, we propose an automatic cyclical scheduling, designed for efficient and accurate sampling in multimodal discrete distributions. Our method contains three key components: (1) a cyclical step size schedule where large steps discover new modes and small steps exploit each mode; (2) a cyclical balancing schedule, ensuring ``balanced" proposals for given step sizes and high efficiency of the Markov chain; and (3) an automatic tuning scheme for adjusting the hyperparameters in the cyclical schedules, allowing adaptability across diverse datasets with minimal tuning. We prove the non-asymptotic convergence and inference guarantee for our method in general discrete distributions. Extensive experiments demonstrate the superiority of our method in sampling complex multimodal discrete distributions.  ( 2 min )
    QoS prediction in radio vehicular environments via prior user information
    arXiv:2402.17689v1 Announce Type: new Abstract: Reliable wireless communications play an important role in the automotive industry as it helps to enhance current use cases and enable new ones such as connected autonomous driving, platooning, cooperative maneuvering, teleoperated driving, and smart navigation. These and other use cases often rely on specific quality of service (QoS) levels for communication. Recently, the area of predictive quality of service (QoS) has received a great deal of attention as a key enabler to forecast communication quality well enough in advance. However, predicting QoS in a reliable manner is a notoriously difficult task. In this paper, we evaluate ML tree-ensemble methods to predict QoS in the range of minutes with data collected from a cellular test network. We discuss radio environment characteristics and we showcase how these can be used to improve ML performance and further support the uptake of ML in commercial networks. Specifically, we use the correlations of the measurements coming from the radio environment by including information of prior vehicles to enhance the prediction of the target vehicles. Moreover, we are extending prior art by showing how longer prediction horizons can be supported.  ( 2 min )
    Securing Reliability: A Brief Overview on Enhancing In-Context Learning for Foundation Models
    arXiv:2402.17671v1 Announce Type: new Abstract: As foundation models (FMs) continue to shape the landscape of AI, the in-context learning (ICL) paradigm thrives but also encounters issues such as toxicity, hallucination, disparity, adversarial vulnerability, and inconsistency. Ensuring the reliability and responsibility of FMs is crucial for the sustainable development of the AI ecosystem. In this concise overview, we investigate recent advancements in enhancing the reliability and trustworthiness of FMs within ICL frameworks, focusing on four key methodologies, each with its corresponding subgoals. We sincerely hope this paper can provide valuable insights for researchers and practitioners endeavoring to build safe and dependable FMs and foster a stable and consistent ICL environment, thereby unlocking their vast potential.  ( 2 min )
    TorchMD-Net 2.0: Fast Neural Network Potentials for Molecular Simulations
    arXiv:2402.17660v1 Announce Type: new Abstract: Achieving a balance between computational speed, prediction accuracy, and universal applicability in molecular simulations has been a persistent challenge. This paper presents substantial advancements in the TorchMD-Net software, a pivotal step forward in the shift from conventional force fields to neural network-based potentials. The evolution of TorchMD-Net into a more comprehensive and versatile framework is highlighted, incorporating cutting-edge architectures such as TensorNet. This transformation is achieved through a modular design approach, encouraging customized applications within the scientific community. The most notable enhancement is a significant improvement in computational efficiency, achieving a very remarkable acceleration in the computation of energy and forces for TensorNet models, with performance gains ranging from 2-fold to 10-fold over previous iterations. Other enhancements include highly optimized neighbor search algorithms that support periodic boundary conditions and the smooth integration with existing molecular dynamics frameworks. Additionally, the updated version introduces the capability to integrate physical priors, further enriching its application spectrum and utility in research. The software is available at https://github.com/torchmd/torchmd-net.  ( 2 min )
    Multi-Agent Deep Reinforcement Learning for Distributed Satellite Routing
    arXiv:2402.17666v1 Announce Type: new Abstract: This paper introduces a Multi-Agent Deep Reinforcement Learning (MA-DRL) approach for routing in Low Earth Orbit Satellite Constellations (LSatCs). Each satellite is an independent decision-making agent with a partial knowledge of the environment, and supported by feedback received from the nearby agents. Building on our previous work that introduced a Q-routing solution, the contribution of this paper is to extend it to a deep learning framework able to quickly adapt to the network and traffic changes, and based on two phases: (1) An offline exploration learning phase that relies on a global Deep Neural Network (DNN) to learn the optimal paths at each possible position and congestion level; (2) An online exploitation phase with local, on-board, pre-trained DNNs. Results show that MA-DRL efficiently learns optimal routes offline that are then loaded for an efficient distributed routing online.  ( 2 min )
    Variational Learning is Effective for Large Deep Networks
    arXiv:2402.17641v1 Announce Type: new Abstract: We give extensive empirical evidence against the common belief that variational learning is ineffective for large neural networks. We show that an optimizer called Improved Variational Online Newton (IVON) consistently matches or outperforms Adam for training large networks such as GPT-2 and ResNets from scratch. IVON's computational costs are nearly identical to Adam but its predictive uncertainty is better. We show several new use cases of IVON where we improve fine-tuning and model merging in Large Language Models, accurately predict generalization error, and faithfully estimate sensitivity to data. We find overwhelming evidence in support of effectiveness of variational learning.  ( 2 min )
    Confidence-Aware Multi-Field Model Calibration
    arXiv:2402.17655v1 Announce Type: new Abstract: Accurately predicting the probabilities of user feedback, such as clicks and conversions, is critical for ad ranking and bidding. However, there often exist unwanted mismatches between predicted probabilities and true likelihoods due to the shift of data distributions and intrinsic model biases. Calibration aims to address this issue by post-processing model predictions, and field-aware calibration can adjust model output on different feature field values to satisfy fine-grained advertising demands. Unfortunately, the observed samples corresponding to certain field values can be too limited to make confident calibrations, which may yield bias amplification and online disturbance. In this paper, we propose a confidence-aware multi-field calibration method, which adaptively adjusts the calibration intensity based on the confidence levels derived from sample statistics. It also utilizes multiple feature fields for joint model calibration with awareness of their importance to mitigate the data sparsity effect of a single field. Extensive offline and online experiments show the superiority of our method in boosting advertising performance and reducing prediction deviations.  ( 2 min )
    Learning Topological Representations with Bidirectional Graph Attention Network for Solving Job Shop Scheduling Problem
    arXiv:2402.17606v1 Announce Type: new Abstract: Existing learning-based methods for solving job shop scheduling problem (JSSP) usually use off-the-shelf GNN models tailored to undirected graphs and neglect the rich and meaningful topological structures of disjunctive graphs (DGs). This paper proposes the topology-aware bidirectional graph attention network (TBGAT), a novel GNN architecture based on the attention mechanism, to embed the DG for solving JSSP in a local search framework. Specifically, TBGAT embeds the DG from a forward and a backward view, respectively, where the messages are propagated by following the different topologies of the views and aggregated via graph attention. Then, we propose a novel operator based on the message-passing mechanism to calculate the forward and backward topological sorts of the DG, which are the features for characterizing the topological structures and exploited by our model. In addition, we theoretically and experimentally show that TBGAT has linear computational complexity to the number of jobs and machines, respectively, which strengthens the practical value of our method. Besides, extensive experiments on five synthetic datasets and seven classic benchmarks show that TBGAT achieves new SOTA results by outperforming a wide range of neural methods by a large margin.  ( 2 min )
    Advancing sleep detection by modelling weak label sets: A novel weakly supervised learning approach
    arXiv:2402.17601v1 Announce Type: new Abstract: Understanding sleep and activity patterns plays a crucial role in physical and mental health. This study introduces a novel approach for sleep detection using weakly supervised learning for scenarios where reliable ground truth labels are unavailable. The proposed method relies on a set of weak labels, derived from the predictions generated by conventional sleep detection algorithms. Introducing a novel approach, we suggest a novel generalised non-linear statistical model in which the number of weak sleep labels is modelled as outcome of a binomial distribution. The probability of sleep in the binomial distribution is linked to the outcomes of neural networks trained to detect sleep based on actigraphy. We show that maximizing the likelihood function of the model, is equivalent to minimizing the soft cross-entropy loss. Additionally, we explored the use of the Brier score as a loss function for weak labels. The efficacy of the suggested modelling framework was demonstrated using the Multi-Ethnic Study of Atherosclerosis dataset. A \gls{lstm} trained on the soft cross-entropy outperformed conventional sleep detection algorithms, other neural network architectures and loss functions in accuracy and model calibration. This research not only advances sleep detection techniques in scenarios where ground truth data is scarce but also contributes to the broader field of weakly supervised learning by introducing innovative approach in modelling sets of weak labels.  ( 3 min )
    DAGnosis: Localized Identification of Data Inconsistencies using Structures
    arXiv:2402.17599v1 Announce Type: new Abstract: Identification and appropriate handling of inconsistencies in data at deployment time is crucial to reliably use machine learning models. While recent data-centric methods are able to identify such inconsistencies with respect to the training set, they suffer from two key limitations: (1) suboptimality in settings where features exhibit statistical independencies, due to their usage of compressive representations and (2) lack of localization to pin-point why a sample might be flagged as inconsistent, which is important to guide future data collection. We solve these two fundamental limitations using directed acyclic graphs (DAGs) to encode the training set's features probability distribution and independencies as a structure. Our method, called DAGnosis, leverages these structural interactions to bring valuable and insightful data-centric conclusions. DAGnosis unlocks the localization of the causes of inconsistencies on a DAG, an aspect overlooked by previous approaches. Moreover, we show empirically that leveraging these interactions (1) leads to more accurate conclusions in detecting inconsistencies, as well as (2) provides more detailed insights into why some samples are flagged.  ( 2 min )
    Implicit Regularization via Spectral Neural Networks and Non-linear Matrix Sensing
    arXiv:2402.17595v1 Announce Type: new Abstract: The phenomenon of implicit regularization has attracted interest in recent years as a fundamental aspect of the remarkable generalizing ability of neural networks. In a nutshell, it entails that gradient descent dynamics in many neural nets, even without any explicit regularizer in the loss function, converges to the solution of a regularized learning problem. However, known results attempting to theoretically explain this phenomenon focus overwhelmingly on the setting of linear neural nets, and the simplicity of the linear structure is particularly crucial to existing arguments. In this paper, we explore this problem in the context of more realistic neural networks with a general class of non-linear activation functions, and rigorously demonstrate the implicit regularization phenomenon for such networks in the setting of matrix sensing problems, together with rigorous rate guarantees that ensure exponentially fast convergence of gradient descent.In this vein, we contribute a network architecture called Spectral Neural Networks (abbrv. SNN) that is particularly suitable for matrix learning problems. Conceptually, this entails coordinatizing the space of matrices by their singular values and singular vectors, as opposed to by their entries, a potentially fruitful perspective for matrix learning. We demonstrate that the SNN architecture is inherently much more amenable to theoretical analysis than vanilla neural nets and confirm its effectiveness in the context of matrix sensing, via both mathematical guarantees and empirical investigations. We believe that the SNN architecture has the potential to be of wide applicability in a broad class of matrix learning scenarios.  ( 3 min )
    Sparse Variational Contaminated Noise Gaussian Process Regression for Forecasting Geomagnetic Perturbations
    arXiv:2402.17570v1 Announce Type: new Abstract: Gaussian Processes (GP) have become popular machine learning methods for kernel based learning on datasets with complicated covariance structures. In this paper, we present a novel extension to the GP framework using a contaminated normal likelihood function to better account for heteroscedastic variance and outlier noise. We propose a scalable inference algorithm based on the Sparse Variational Gaussian Process (SVGP) method for fitting sparse Gaussian process regression models with contaminated normal noise on large datasets. We examine an application to geomagnetic ground perturbations, where the state-of-art prediction model is based on neural networks. We show that our approach yields shorter predictions intervals for similar coverage and accuracy when compared to an artificial dense neural network baseline.  ( 2 min )
    Hyperdimensional computing: a fast, robust and interpretable paradigm for biological data
    arXiv:2402.17572v1 Announce Type: new Abstract: Advances in bioinformatics are primarily due to new algorithms for processing diverse biological data sources. While sophisticated alignment algorithms have been pivotal in analyzing biological sequences, deep learning has substantially transformed bioinformatics, addressing sequence, structure, and functional analyses. However, these methods are incredibly data-hungry, compute-intensive and hard to interpret. Hyperdimensional computing (HDC) has recently emerged as an intriguing alternative. The key idea is that random vectors of high dimensionality can represent concepts such as sequence identity or phylogeny. These vectors can then be combined using simple operators for learning, reasoning or querying by exploiting the peculiar properties of high-dimensional spaces. Our work reviews and explores the potential of HDC for bioinformatics, emphasizing its efficiency, interpretability, and adeptness in handling multimodal and structured data. HDC holds a lot of potential for various omics data searching, biosignal analysis and health applications.  ( 2 min )
    Label-Noise Robust Diffusion Models
    arXiv:2402.17517v1 Announce Type: new Abstract: Conditional diffusion models have shown remarkable performance in various generative tasks, but training them requires large-scale datasets that often contain noise in conditional inputs, a.k.a. noisy labels. This noise leads to condition mismatch and quality degradation of generated data. This paper proposes Transition-aware weighted Denoising Score Matching (TDSM) for training conditional diffusion models with noisy labels, which is the first study in the line of diffusion models. The TDSM objective contains a weighted sum of score networks, incorporating instance-wise and time-dependent label transition probabilities. We introduce a transition-aware weight estimator, which leverages a time-dependent noisy-label classifier distinctively customized to the diffusion process. Through experiments across various datasets and noisy label settings, TDSM improves the quality of generated samples aligned with given conditions. Furthermore, our method improves generation performance even on prevalent benchmark datasets, which implies the potential noisy labels and their risk of generative model learning. Finally, we show the improved performance of TDSM on top of conventional noisy label corrections, which empirically proving its contribution as a part of label-noise robust generative models. Our code is available at: https://github.com/byeonghu-na/tdsm.  ( 2 min )
    QUCE: The Minimisation and Quantification of Path-Based Uncertainty for Generative Counterfactual Explanations
    arXiv:2402.17516v1 Announce Type: new Abstract: Deep Neural Networks (DNNs) stand out as one of the most prominent approaches within the Machine Learning (ML) domain. The efficacy of DNNs has surged alongside recent increases in computational capacity, allowing these approaches to scale to significant complexities for addressing predictive challenges in big data. However, as the complexity of DNN models rises, interpretability diminishes. In response to this challenge, explainable models such as Adversarial Gradient Integration (AGI) leverage path-based gradients provided by DNNs to elucidate their decisions. Yet the performance of path-based explainers can be compromised when gradients exhibit irregularities during out-of-distribution path traversal. In this context, we introduce Quantified Uncertainty Counterfactual Explanations (QUCE), a method designed to mitigate out-of-distribution traversal by minimizing path uncertainty. QUCE not only quantifies uncertainty when presenting explanations but also generates more certain counterfactual examples. We showcase the performance of the QUCE method by comparing it with competing methods for both path-based explanations and generative counterfactual examples. The code repository for the QUCE method is available at: https://github.com/jamie-duell/QUCE.  ( 2 min )
    Intensive Care as One Big Sequence Modeling Problem
    arXiv:2402.17501v1 Announce Type: new Abstract: Reinforcement Learning in Healthcare is typically concerned with narrow self-contained tasks such as sepsis prediction or anesthesia control. However, previous research has demonstrated the potential of generalist models (the prime example being Large Language Models) to outperform task-specific approaches due to their capability for implicit transfer learning. To enable training of foundation models for Healthcare as well as leverage the capabilities of state of the art Transformer architectures, we propose the paradigm of Healthcare as Sequence Modeling, in which interaction between the patient and the healthcare provider is represented as an event stream and tasks like diagnosis and treatment selection are modeled as prediction of future events in the stream. To explore this paradigm experimentally we develop MIMIC-SEQ, a sequence modeling benchmark derived by translating heterogenous clinical records from MIMIC-IV dataset into a uniform event stream format, train a baseline model and explore its capabilities.  ( 2 min )
    Why do Learning Rates Transfer? Reconciling Optimization and Scaling Limits for Deep Learning
    arXiv:2402.17457v1 Announce Type: new Abstract: Recently, there has been growing evidence that if the width and depth of a neural network are scaled toward the so-called rich feature learning limit ($\mu$P and its depth extension), then some hyperparameters - such as the learning rate - exhibit transfer from small to very large models, thus reducing the cost of hyperparameter tuning. From an optimization perspective, this phenomenon is puzzling, as it implies that the loss landscape is remarkably consistent across very different model sizes. In this work, we find empirical evidence that learning rate transfer can be attributed to the fact that under $\mu$P and its depth extension, the largest eigenvalue of the training loss Hessian (i.e. the sharpness) is largely independent of the width and depth of the network for a sustained period of training time. On the other hand, we show that under the neural tangent kernel (NTK) regime, the sharpness exhibits very different dynamics at different scales, thus preventing learning rate transfer. But what causes these differences in the sharpness dynamics? Through a connection between the spectra of the Hessian and the NTK matrix, we argue that the cause lies in the presence (for $\mu$P) or progressive absence (for the NTK regime) of feature learning, which results in a different evolution of the NTK, and thus of the sharpness. We corroborate our claims with a substantial suite of experiments, covering a wide range of datasets and architectures: from ResNets and Vision Transformers trained on benchmark vision datasets to Transformers-based language models trained on WikiText  ( 3 min )
    Fraud Detection with Binding Global and Local Relational Interaction
    arXiv:2402.17472v1 Announce Type: new Abstract: Graph Neural Network has been proved to be effective for fraud detection for its capability to encode node interaction and aggregate features in a holistic view. Recently, Transformer network with great sequence encoding ability, has also outperformed other GNN-based methods in literatures. However, both GNN-based and Transformer-based networks only encode one perspective of the whole graph, while GNN encodes global features and Transformer network encodes local ones. Furthermore, previous works ignored encoding global interaction features of the heterogeneous graph with separate networks, thus leading to suboptimal performance. In this work, we present a novel framework called Relation-Aware GNN with transFormer (RAGFormer) which simultaneously embeds local and global features into a target node. The simple yet effective network applies a modified GAGA module where each transformer layer is followed by a cross-relation aggregation layer, to encode local embeddings and node interactions across different relations. Apart from the Transformer-based network, we further introduce a Relation-Aware GNN module to learn global embeddings, which is later merged into the local embeddings by an attention fusion module and a skip connection. Extensive experiments on two popular public datasets and an industrial dataset demonstrate that RAGFormer achieves the state-of-the-art performance. Substantial analysis experiments validate the effectiveness of each submodule of RAGFormer and its high efficiency in utilizing small-scale data and low hyper-parameter sensitivity.  ( 3 min )
    DS-Agent: Automated Data Science by Empowering Large Language Models with Case-Based Reasoning
    arXiv:2402.17453v1 Announce Type: new Abstract: In this work, we investigate the potential of large language models (LLMs) based agents to automate data science tasks, with the goal of comprehending task requirements, then building and training the best-fit machine learning models. Despite their widespread success, existing LLM agents are hindered by generating unreasonable experiment plans within this scenario. To this end, we present DS-Agent, a novel automatic framework that harnesses LLM agent and case-based reasoning (CBR). In the development stage, DS-Agent follows the CBR framework to structure an automatic iteration pipeline, which can flexibly capitalize on the expert knowledge from Kaggle, and facilitate consistent performance improvement through the feedback mechanism. Moreover, DS-Agent implements a low-resource deployment stage with a simplified CBR paradigm to adapt past successful solutions from the development stage for direct code generation, significantly reducing the demand on foundational capabilities of LLMs. Empirically, DS-Agent with GPT-4 achieves an unprecedented 100% success rate in the development stage, while attaining 36% improvement on average one pass rate across alternative LLMs in the deployment stage. In both stages, DS-Agent achieves the best rank in performance, costing \$1.60 and \$0.13 per run with GPT-4, respectively.  ( 2 min )
    Principled Architecture-aware Scaling of Hyperparameters
    arXiv:2402.17440v1 Announce Type: new Abstract: Training a high-quality deep neural network requires choosing suitable hyperparameters, which is a non-trivial and expensive process. Current works try to automatically optimize or design principles of hyperparameters, such that they can generalize to diverse unseen scenarios. However, most designs or optimization methods are agnostic to the choice of network structures, and thus largely ignore the impact of neural architectures on hyperparameters. In this work, we precisely characterize the dependence of initializations and maximal learning rates on the network architecture, which includes the network depth, width, convolutional kernel size, and connectivity patterns. By pursuing every parameter to be maximally updated with the same mean squared change in pre-activations, we can generalize our initialization and learning rates across MLPs (multi-layer perception) and CNNs (convolutional neural network) with sophisticated graph topologies. We verify our principles with comprehensive experiments. More importantly, our strategy further sheds light on advancing current benchmarks for architecture design. A fair comparison of AutoML algorithms requires accurate network rankings. However, we demonstrate that network rankings can be easily changed by better training networks in benchmarks with our architecture-aware learning rates and initialization.  ( 2 min )
    Reinforced In-Context Black-Box Optimization
    arXiv:2402.17423v1 Announce Type: new Abstract: Black-Box Optimization (BBO) has found successful applications in many fields of science and engineering. Recently, there has been a growing interest in meta-learning particular components of BBO algorithms to speed up optimization and get rid of tedious hand-crafted heuristics. As an extension, learning the entire algorithm from data requires the least labor from experts and can provide the most flexibility. In this paper, we propose RIBBO, a method to reinforce-learn a BBO algorithm from offline data in an end-to-end fashion. RIBBO employs expressive sequence models to learn the optimization histories produced by multiple behavior algorithms and tasks, leveraging the in-context learning ability of large models to extract task information and make decisions accordingly. Central to our method is to augment the optimization histories with regret-to-go tokens, which are designed to represent the performance of an algorithm based on cumulative regret of the histories. The integration of regret-to-go tokens enables RIBBO to automatically generate sequences of query points that satisfy the user-desired regret, which is verified by its universally good empirical performance on diverse problems, including BBOB functions, hyper-parameter optimization and robot control problems.  ( 2 min )
    Robustness-Congruent Adversarial Training for Secure Machine Learning Model Updates
    arXiv:2402.17390v1 Announce Type: new Abstract: Machine-learning models demand for periodic updates to improve their average accuracy, exploiting novel architectures and additional data. However, a newly-updated model may commit mistakes that the previous model did not make. Such misclassifications are referred to as negative flips, and experienced by users as a regression of performance. In this work, we show that this problem also affects robustness to adversarial examples, thereby hindering the development of secure model update practices. In particular, when updating a model to improve its adversarial robustness, some previously-ineffective adversarial examples may become misclassified, causing a regression in the perceived security of the system. We propose a novel technique, named robustness-congruent adversarial training, to address this issue. It amounts to fine-tuning a model with adversarial training, while constraining it to retain higher robustness on the adversarial examples that were correctly classified before the update. We show that our algorithm and, more generally, learning with non-regression constraints, provides a theoretically-grounded framework to train consistent estimators. Our experiments on robust models for computer vision confirm that (i) both accuracy and robustness, even if improved after model update, can be affected by negative flips, and (ii) our robustness-congruent adversarial training can mitigate the problem, outperforming competing baseline methods.  ( 2 min )
    Enhanced Bayesian Optimization via Preferential Modeling of Abstract Properties
    arXiv:2402.17343v1 Announce Type: new Abstract: Experimental (design) optimization is a key driver in designing and discovering new products and processes. Bayesian Optimization (BO) is an effective tool for optimizing expensive and black-box experimental design processes. While Bayesian optimization is a principled data-driven approach to experimental optimization, it learns everything from scratch and could greatly benefit from the expertise of its human (domain) experts who often reason about systems at different abstraction levels using physical properties that are not necessarily directly measured (or measurable). In this paper, we propose a human-AI collaborative Bayesian framework to incorporate expert preferences about unmeasured abstract properties into the surrogate modeling to further boost the performance of BO. We provide an efficient strategy that can also handle any incorrect/misleading expert bias in preferential judgments. We discuss the convergence behavior of our proposed framework. Our experimental results involving synthetic functions and real-world datasets show the superiority of our method against the baselines.  ( 2 min )
    LocalGCL: Local-aware Contrastive Learning for Graphs
    arXiv:2402.17345v1 Announce Type: new Abstract: Graph representation learning (GRL) makes considerable progress recently, which encodes graphs with topological structures into low-dimensional embeddings. Meanwhile, the time-consuming and costly process of annotating graph labels manually prompts the growth of self-supervised learning (SSL) techniques. As a dominant approach of SSL, Contrastive learning (CL) learns discriminative representations by differentiating between positive and negative samples. However, when applied to graph data, it overemphasizes global patterns while neglecting local structures. To tackle the above issue, we propose \underline{Local}-aware \underline{G}raph \underline{C}ontrastive \underline{L}earning (\textbf{\methnametrim}), a self-supervised learning framework that supplementarily captures local graph information with masking-based modeling compared with vanilla contrastive learning. Extensive experiments validate the superiority of \methname against state-of-the-art methods, demonstrating its promise as a comprehensive graph representation learner.  ( 2 min )
    Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond
    arXiv:2402.17327v1 Announce Type: new Abstract: We study the data selection problem, whose aim is to select a small representative subset of data that can be used to efficiently train a machine learning model. We present a new data selection approach based on $k$-means clustering and sensitivity sampling. Assuming access to an embedding representation of the data with respect to which the model loss is H\"older continuous, our approach provably allows selecting a set of ``typical'' $k + 1/\varepsilon^2$ elements whose average loss corresponds to the average loss of the whole dataset, up to a multiplicative $(1\pm\varepsilon)$ factor and an additive $\varepsilon \lambda \Phi_k$, where $\Phi_k$ represents the $k$-means cost for the input embeddings and $\lambda$ is the H\"older constant. We furthermore demonstrate the performance and scalability of our approach on fine-tuning foundation models and show that it outperforms state-of-the-art methods. We also show how it can be applied on linear regression, leading to a new sampling strategy that surprisingly matches the performances of leverage score sampling, while being conceptually simpler and more scalable.  ( 2 min )
    An Interpretable Evaluation of Entropy-based Novelty of Generative Models
    arXiv:2402.17287v1 Announce Type: new Abstract: The massive developments of generative model frameworks and architectures require principled methods for the evaluation of a model's novelty compared to a reference dataset or baseline generative models. While the recent literature has extensively studied the evaluation of the quality, diversity, and generalizability of generative models, the assessment of a model's novelty compared to a baseline model has not been adequately studied in the machine learning community. In this work, we focus on the novelty assessment under multi-modal generative models and attempt to answer the following question: Given the samples of a generative model $\mathcal{G}$ and a reference dataset $\mathcal{S}$, how can we discover and count the modes expressed by $\mathcal{G}$ more frequently than in $\mathcal{S}$. We introduce a spectral approach to the described task and propose the Kernel-based Entropic Novelty (KEN) score to quantify the mode-based novelty of distribution $P_\mathcal{G}$ with respect to distribution $P_\mathcal{S}$. We analytically interpret the behavior of the KEN score under mixture distributions with sub-Gaussian components. Next, we develop a method based on Cholesky decomposition to compute the KEN score from observed samples. We support the KEN-based quantification of novelty by presenting several numerical results on synthetic and real image distributions. Our numerical results indicate the success of the proposed approach in detecting the novel modes and the comparison of state-of-the-art generative models.  ( 3 min )
    Curriculum Learning Meets Directed Acyclic Graph for Multimodal Emotion Recognition
    arXiv:2402.17269v1 Announce Type: new Abstract: Emotion recognition in conversation (ERC) is a crucial task in natural language processing and affective computing. This paper proposes MultiDAG+CL, a novel approach for Multimodal Emotion Recognition in Conversation (ERC) that employs Directed Acyclic Graph (DAG) to integrate textual, acoustic, and visual features within a unified framework. The model is enhanced by Curriculum Learning (CL) to address challenges related to emotional shifts and data imbalance. Curriculum learning facilitates the learning process by gradually presenting training samples in a meaningful order, thereby improving the model's performance in handling emotional variations and data imbalance. Experimental results on the IEMOCAP and MELD datasets demonstrate that the MultiDAG+CL models outperform baseline models.  ( 2 min )
    RIME: Robust Preference-based Reinforcement Learning with Noisy Preferences
    arXiv:2402.17257v1 Announce Type: new Abstract: Preference-based Reinforcement Learning (PbRL) avoids the need for reward engineering by harnessing human preferences as the reward signal. However, current PbRL algorithms over-reliance on high-quality feedback from domain experts, which results in a lack of robustness. In this paper, we present RIME, a robust PbRL algorithm for effective reward learning from noisy preferences. Our method incorporates a sample selection-based discriminator to dynamically filter denoised preferences for robust training. To mitigate the accumulated error caused by incorrect selection, we propose to warm start the reward model, which additionally bridges the performance gap during transition from pre-training to online training in PbRL. Our experiments on robotic manipulation and locomotion tasks demonstrate that RIME significantly enhances the robustness of the current state-of-the-art PbRL method. Ablation studies further demonstrate that the warm start is crucial for both robustness and feedback-efficiency in limited-feedback cases.  ( 2 min )
    Stochastic Gradient Succeeds for Bandits
    arXiv:2402.17235v1 Announce Type: new Abstract: We show that the \emph{stochastic gradient} bandit algorithm converges to a \emph{globally optimal} policy at an $O(1/t)$ rate, even with a \emph{constant} step size. Remarkably, global convergence of the stochastic gradient bandit algorithm has not been previously established, even though it is an old algorithm known to be applicable to bandits. The new result is achieved by establishing two novel technical findings: first, the noise of the stochastic updates in the gradient bandit algorithm satisfies a strong ``growth condition'' property, where the variance diminishes whenever progress becomes small, implying that additional noise control via diminishing step sizes is unnecessary; second, a form of ``weak exploration'' is automatically achieved through the stochastic gradient updates, since they prevent the action probabilities from decaying faster than $O(1/t)$, thus ensuring that every action is sampled infinitely often with probability $1$. These two findings can be used to show that the stochastic gradient update is already ``sufficient'' for bandits in the sense that exploration versus exploitation is automatically balanced in a manner that ensures almost sure convergence to a global optimum. These novel theoretical findings are further verified by experimental results.  ( 2 min )
    Does Negative Sampling Matter? A Review with Insights into its Theory and Applications
    arXiv:2402.17238v1 Announce Type: new Abstract: Negative sampling has swiftly risen to prominence as a focal point of research, with wide-ranging applications spanning machine learning, computer vision, natural language processing, data mining, and recommender systems. This growing interest raises several critical questions: Does negative sampling really matter? Is there a general framework that can incorporate all existing negative sampling methods? In what fields is it applied? Addressing these questions, we propose a general framework that leverages negative sampling. Delving into the history of negative sampling, we trace the development of negative sampling through five evolutionary paths. We dissect and categorize the strategies used to select negative sample candidates, detailing global, local, mini-batch, hop, and memory-based approaches. Our review categorizes current negative sampling methods into five types: static, hard, GAN-based, Auxiliary-based, and In-batch methods, providing a clear structure for understanding negative sampling. Beyond detailed categorization, we highlight the application of negative sampling in various areas, offering insights into its practical benefits. Finally, we briefly discuss open problems and future directions for negative sampling.  ( 2 min )
    Efficient Backpropagation with Variance-Controlled Adaptive Sampling
    arXiv:2402.17227v1 Announce Type: new Abstract: Sampling-based algorithms, which eliminate ''unimportant'' computations during forward and/or back propagation (BP), offer potential solutions to accelerate neural network training. However, since sampling introduces approximations to training, such algorithms may not consistently maintain accuracy across various tasks. In this work, we introduce a variance-controlled adaptive sampling (VCAS) method designed to accelerate BP. VCAS computes an unbiased stochastic gradient with fine-grained layerwise importance sampling in data dimension for activation gradient calculation and leverage score sampling in token dimension for weight gradient calculation. To preserve accuracy, we control the additional variance by learning the sample ratio jointly with model parameters during training. We assessed VCAS on multiple fine-tuning and pre-training tasks in both vision and natural language domains. On all the tasks, VCAS can preserve the original training loss trajectory and validation accuracy with an up to 73.87% FLOPs reduction of BP and 49.58% FLOPs reduction of the whole training process. The implementation is available at https://github.com/thu-ml/VCAS .  ( 2 min )
    Hybrid Square Neural ODE Causal Modeling
    arXiv:2402.17233v1 Announce Type: new Abstract: Hybrid models combine mechanistic ODE-based dynamics with flexible and expressive neural network components. Such models have grown rapidly in popularity, especially in scientific domains where such ODE-based modeling offers important interpretability and validated causal grounding (e.g., for counterfactual reasoning). The incorporation of mechanistic models also provides inductive bias in standard blackbox modeling approaches, critical when learning from small datasets or partially observed, complex systems. Unfortunately, as hybrid models become more flexible, the causal grounding provided by the mechanistic model can quickly be lost. We address this problem by leveraging another common source of domain knowledge: ranking of treatment effects for a set of interventions, even if the precise treatment effect is unknown. We encode this information in a causal loss that we combine with the standard predictive loss to arrive at a hybrid loss that biases our learning towards causally valid hybrid models. We demonstrate our ability to achieve a win-win -- state-of-the-art predictive performance and causal validity -- in the challenging task of modeling glucose dynamics during exercise.  ( 2 min )
    Temporal Logic Specification-Conditioned Decision Transformer for Offline Safe Reinforcement Learning
    arXiv:2402.17217v1 Announce Type: new Abstract: Offline safe reinforcement learning (RL) aims to train a constraint satisfaction policy from a fixed dataset. Current state-of-the-art approaches are based on supervised learning with a conditioned policy. However, these approaches fall short in real-world applications that involve complex tasks with rich temporal and logical structures. In this paper, we propose temporal logic Specification-conditioned Decision Transformer (SDT), a novel framework that harnesses the expressive power of signal temporal logic (STL) to specify complex temporal rules that an agent should follow and the sequential modeling capability of Decision Transformer (DT). Empirical evaluations on the DSRL benchmarks demonstrate the better capacity of SDT in learning safe and high-reward policies compared with existing approaches. In addition, SDT shows good alignment with respect to different desired degrees of satisfaction of the STL specification that it is conditioned on.  ( 2 min )
    FedBRB: An Effective Solution to the Small-to-Large Scenario in Device-Heterogeneity Federated Learning
    arXiv:2402.17202v1 Announce Type: new Abstract: Recently, the success of large models has demonstrated the importance of scaling up model size. This has spurred interest in exploring collaborative training of large-scale models from federated learning perspective. Due to computational constraints, many institutions struggle to train a large-scale model locally. Thus, training a larger global model using only smaller local models has become an important scenario (i.e., the \textbf{small-to-large scenario}). Although recent device-heterogeneity federated learning approaches have started to explore this area, they face limitations in fully covering the parameter space of the global model. In this paper, we propose a method called \textbf{FedBRB} (\underline{B}lock-wise \underline{R}olling and weighted \underline{B}roadcast) based on the block concept. FedBRB can uses small local models to train all blocks of the large global model, and broadcasts the trained parameters to the entire space for faster information interaction. Experiments demonstrate FedBRB yields substantial performance gains, achieving state-of-the-art results in this scenario. Moreover, FedBRB using only minimal local models can even surpass baselines using larger local models.  ( 2 min )
    Generative Learning for Forecasting the Dynamics of Complex Systems
    arXiv:2402.17157v1 Announce Type: new Abstract: We introduce generative models for accelerating simulations of complex systems through learning and evolving their effective dynamics. In the proposed Generative Learning of Effective Dynamics (G-LED), instances of high dimensional data are down sampled to a lower dimensional manifold that is evolved through an auto-regressive attention mechanism. In turn, Bayesian diffusion models, that map this low-dimensional manifold onto its corresponding high-dimensional space, capture the statistics of the system dynamics. We demonstrate the capabilities and drawbacks of G-LED in simulations of several benchmark systems, including the Kuramoto-Sivashinsky (KS) equation, two-dimensional high Reynolds number flow over a backward-facing step, and simulations of three-dimensional turbulent channel flow. The results demonstrate that generative learning offers new frontiers for the accurate forecasting of the statistical properties of complex systems at a reduced computational cost.  ( 2 min )
    Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations
    arXiv:2402.17152v1 Announce Type: new Abstract: Large-scale recommendation systems are characterized by their reliance on high cardinality, heterogeneous features and the need to handle tens of billions of user actions on a daily basis. Despite being trained on huge volume of data with thousands of features, most Deep Learning Recommendation Models (DLRMs) in industry fail to scale with compute. Inspired by success achieved by Transformers in language and vision domains, we revisit fundamental design choices in recommendation systems. We reformulate recommendation problems as sequential transduction tasks within a generative modeling framework (``Generative Recommenders''), and propose a new architecture, HSTU, designed for high cardinality, non-stationary streaming recommendation data. HSTU outperforms baselines over synthetic and public datasets by up to 65.8\% in NDCG, and is 5.3x to 15.2x faster than FlashAttention2-based Transformers on 8192 length sequences. HSTU-based Generative Recommenders, with 1.5 trillion parameters, improve metrics in online A/B tests by 12.4\% and have been deployed on multiple surfaces of a large internet platform with billions of users. More importantly, the model quality of Generative Recommenders empirically scales as a power-law of training compute across three orders of magnitude, up to GPT-3/LLaMa-2 scale, which reduces carbon footprint needed for future model developments, and further paves the way for the first foundational models in recommendations.  ( 3 min )
    Unsupervised Zero-Shot Reinforcement Learning via Functional Reward Encodings
    arXiv:2402.17135v1 Announce Type: new Abstract: Can we pre-train a generalist agent from a large amount of unlabeled offline trajectories such that it can be immediately adapted to any new downstream tasks in a zero-shot manner? In this work, we present a functional reward encoding (FRE) as a general, scalable solution to this zero-shot RL problem. Our main idea is to learn functional representations of any arbitrary tasks by encoding their state-reward samples using a transformer-based variational auto-encoder. This functional encoding not only enables the pre-training of an agent from a wide diversity of general unsupervised reward functions, but also provides a way to solve any new downstream tasks in a zero-shot manner, given a small number of reward-annotated samples. We empirically show that FRE agents trained on diverse random unsupervised reward functions can generalize to solve novel tasks in a range of simulated robotic benchmarks, often outperforming previous zero-shot RL and offline RL methods. Code for this project is provided at: https://github.com/kvfrans/fre  ( 2 min )
    Predicting O-GlcNAcylation Sites in Mammalian Proteins with Transformers and RNNs Trained with a New Loss Function
    arXiv:2402.17131v1 Announce Type: new Abstract: Glycosylation, a protein modification, has multiple essential functional and structural roles. O-GlcNAcylation, a subtype of glycosylation, has the potential to be an important target for therapeutics, but methods to reliably predict O-GlcNAcylation sites had not been available until 2023; a 2021 review correctly noted that published models were insufficient and failed to generalize. Moreover, many are no longer usable. In 2023, a considerably better RNN model with an F$_1$ score of 36.17% and an MCC of 34.57% on a large dataset was published. This article first sought to improve these metrics using transformer encoders. While transformers displayed high performance on this dataset, their performance was inferior to that of the previously published RNN. We then created a new loss function, which we call the weighted focal differentiable MCC, to improve the performance of classification models. RNN models trained with this new function display superior performance to models trained using the weighted cross-entropy loss; this new function can also be used to fine-tune trained models. A two-cell RNN trained with this loss achieves state-of-the-art performance in O-GlcNAcylation site prediction with an F$_1$ score of 38.82% and an MCC of 38.21% on that large dataset.  ( 2 min )
    LCEN: A Novel Feature Selection Algorithm for Nonlinear, Interpretable Machine Learning Models
    arXiv:2402.17120v1 Announce Type: new Abstract: Interpretable architectures can have advantages over black-box architectures, and interpretability is essential for the application of machine learning in critical settings, such as aviation or medicine. However, the simplest, most commonly used interpretable architectures (such as LASSO or EN) are limited to linear predictions and have poor feature selection capabilities. In this work, we introduce the LASSO-Clip-EN (LCEN) algorithm for the creation of nonlinear, interpretable machine learning models. LCEN is tested on a wide variety of artificial and empirical datasets, creating more accurate, sparser models than other commonly used architectures. These experiments reveal that LCEN is robust against many issues typically present in datasets and modeling, including noise, multicollinearity, data scarcity, and hyperparameter variance. LCEN is also able to rediscover multiple physical laws from empirical data and, for processes with no known physical laws, LCEN achieves better results than many other dense and sparse methods -- including using 10.8 times fewer features than dense methods and 8.1 times fewer features than EN on one dataset, and is comparable to an ANN on another dataset.  ( 2 min )
    Sinkhorn Distance Minimization for Knowledge Distillation
    arXiv:2402.17110v1 Announce Type: new Abstract: Knowledge distillation (KD) has been widely adopted to compress large language models (LLMs). Existing KD methods investigate various divergence measures including the Kullback-Leibler (KL), reverse Kullback-Leibler (RKL), and Jensen-Shannon (JS) divergences. However, due to limitations inherent in their assumptions and definitions, these measures fail to deliver effective supervision when few distribution overlap exists between the teacher and the student. In this paper, we show that the aforementioned KL, RKL, and JS divergences respectively suffer from issues of mode-averaging, mode-collapsing, and mode-underestimation, which deteriorates logits-based KD for diverse NLP tasks. We propose the Sinkhorn Knowledge Distillation (SinKD) that exploits the Sinkhorn distance to ensure a nuanced and precise assessment of the disparity between teacher and student distributions. Besides, profit by properties of the Sinkhorn metric, we can get rid of sample-wise KD that restricts the perception of divergence in each teacher-student sample pair. Instead, we propose a batch-wise reformulation to capture geometric intricacies of distributions across samples in the high-dimensional space. Comprehensive evaluation on GLUE and SuperGLUE, in terms of comparability, validity, and generalizability, highlights our superiority over state-of-the-art methods on all kinds of LLMs with encoder-only, encoder-decoder, and decoder-only architectures.  ( 2 min )
    Adversarial Perturbations of Physical Signals
    arXiv:2402.17104v1 Announce Type: new Abstract: We investigate the vulnerability of computer-vision-based signal classifiers to adversarial perturbations of their inputs, where the signals and perturbations are subject to physical constraints. We consider a scenario in which a source and interferer emit signals that propagate as waves to a detector, which attempts to classify the source by analyzing the spectrogram of the signal it receives using a pre-trained neural network. By solving PDE-constrained optimization problems, we construct interfering signals that cause the detector to misclassify the source even though the perturbations to the spectrogram of the received signal are nearly imperceptible. Though such problems can have millions of decision variables, we introduce methods to solve them efficiently. Our experiments demonstrate that one can compute effective and physically realizable adversarial perturbations for a variety of machine learning models under various physical conditions.  ( 2 min )
    Parallelized Spatiotemporal Binding
    arXiv:2402.17077v1 Announce Type: new Abstract: While modern best practices advocate for scalable architectures that support long-range interactions, object-centric models are yet to fully embrace these architectures. In particular, existing object-centric models for handling sequential inputs, due to their reliance on RNN-based implementation, show poor stability and capacity and are slow to train on long sequences. We introduce Parallelizable Spatiotemporal Binder or PSB, the first temporally-parallelizable slot learning architecture for sequential inputs. Unlike conventional RNN-based approaches, PSB produces object-centric representations, known as slots, for all time-steps in parallel. This is achieved by refining the initial slots across all time-steps through a fixed number of layers equipped with causal attention. By capitalizing on the parallelism induced by our architecture, the proposed model exhibits a significant boost in efficiency. In experiments, we test PSB extensively as an encoder within an auto-encoding framework paired with a wide variety of decoder options. Compared to the state-of-the-art, our architecture demonstrates stable training on longer sequences, achieves parallelization that results in a 60% increase in training speed, and yields performance that is on par with or better on unsupervised 2D and 3D object-centric scene decomposition and understanding.  ( 2 min )
    One-Shot Graph Representation Learning Using Hyperdimensional Computing
    arXiv:2402.17073v1 Announce Type: new Abstract: We present a novel, simple, fast, and efficient approach for semi-supervised learning on graphs. The proposed approach takes advantage of hyper-dimensional computing which encodes data samples using random projections into a high dimensional space (HD space for short). Specifically, we propose a Hyper-dimensional Graph Learning (HDGL) algorithm that leverages the injectivity property of the node representations of a family of graph neural networks. HDGL maps node features to the HD space and then uses HD operators such as bundling and binding to aggregate information from the local neighborhood of each node. Results of experiments with widely used benchmark data sets show that HDGL achieves predictive performance that is competitive with the state-of-the-art deep learning methods, without the need for computationally expensive training.  ( 2 min )
    A Multi-Fidelity Methodology for Reduced Order Models with High-Dimensional Inputs
    arXiv:2402.17061v1 Announce Type: new Abstract: In the early stages of aerospace design, reduced order models (ROMs) are crucial for minimizing computational costs associated with using physics-rich field information in many-query scenarios requiring multiple evaluations. The intricacy of aerospace design demands the use of high-dimensional design spaces to capture detailed features and design variability accurately. However, these spaces introduce significant challenges, including the curse of dimensionality, which stems from both high-dimensional inputs and outputs necessitating substantial training data and computational effort. To address these complexities, this study introduces a novel multi-fidelity, parametric, and non-intrusive ROM framework designed for high-dimensional contexts. It integrates machine learning techniques for manifold alignment and dimension reduction employing Proper Orthogonal Decomposition (POD) and Model-based Active Subspace with multi-fidelity regression for ROM construction. Our approach is validated through two test cases: the 2D RAE~2822 airfoil and the 3D NASA CRM wing, assessing combinations of various fidelity levels, training data ratios, and sample sizes. Compared to the single-fidelity PCAS method, our multi-fidelity solution offers improved cost-accuracy benefits and achieves better predictive accuracy with reduced computational demands. Moreover, our methodology outperforms the manifold-aligned ROM (MA-ROM) method by 50% in handling scenarios with large input dimensions, underscoring its efficacy in addressing the complex challenges of aerospace design.  ( 2 min )
    A Curious Case of Remarkable Resilience to Gradient Attacks via Fully Convolutional and Differentiable Front End with a Skip Connection
    arXiv:2402.17018v1 Announce Type: new Abstract: We tested front-end enhanced neural models where a frozen classifier was prepended by a differentiable and fully convolutional model with a skip connection. By training them using a small learning rate for about one epoch, we obtained models that retained the accuracy of the backbone classifier while being unusually resistant to gradient attacks including APGD and FAB-T attacks from the AutoAttack package, which we attributed to gradient masking. The gradient masking phenomenon is not new, but the degree of masking was quite remarkable for fully differentiable models that did not have gradient-shattering components such as JPEG compression or components that are expected to cause diminishing gradients. Though black box attacks can be partially effective against gradient masking, they are easily defeated by combining models into randomized ensembles. We estimate that such ensembles achieve near-SOTA AutoAttack accuracy on CIFAR10, CIFAR100, and ImageNet despite having virtually zero accuracy under adaptive attacks. Adversarial training of the backbone classifier can further increase resistance of the front-end enhanced model to gradient attacks. On CIFAR10, the respective randomized ensemble achieved 90.8$\pm 2.5$% (99% CI) accuracy under AutoAttack while having only 18.2$\pm 3.6$% accuracy under the adaptive attack. We do not establish SOTA in adversarial robustness. Instead, we make methodological contributions and further supports the thesis that adaptive attacks designed with the complete knowledge of model architecture are crucial in demonstrating model robustness and that even the so-called white-box gradient attacks can have limited applicability. Although gradient attacks can be complemented with black-box attack such as the SQUARE attack or the zero-order PGD, black-box attacks can be weak against randomized ensembles, e.g., when ensemble models mask gradients.  ( 3 min )
    FedReview: A Review Mechanism for Rejecting Poisoned Updates in Federated Learning
    arXiv:2402.16934v1 Announce Type: new Abstract: Federated learning has recently emerged as a decentralized approach to learn a high-performance model without access to user data. Despite its effectiveness, federated learning gives malicious users opportunities to manipulate the model by uploading poisoned model updates to the server. In this paper, we propose a review mechanism called FedReview to identify and decline the potential poisoned updates in federated learning. Under our mechanism, the server randomly assigns a subset of clients as reviewers to evaluate the model updates on their training datasets in each round. The reviewers rank the model updates based on the evaluation results and count the number of the updates with relatively low quality as the estimated number of poisoned updates. Based on review reports, the server employs a majority voting mechanism to integrate the rankings and remove the potential poisoned updates in the model aggregation process. Extensive evaluation on multiple datasets demonstrate that FedReview can assist the server to learn a well-performed global model in an adversarial environment.  ( 2 min )
    Minimize Control Inputs for Strong Structural Controllability Using Reinforcement Learning with Graph Neural Network
    arXiv:2402.16925v1 Announce Type: new Abstract: Strong structural controllability (SSC) guarantees networked system with linear-invariant dynamics controllable for all numerical realizations of parameters. Current research has established algebraic and graph-theoretic conditions of SSC for zero/nonzero or zero/nonzero/arbitrary structure. One relevant practical problem is how to fully control the system with the minimal number of input signals and identify which nodes must be imposed signals. Previous work shows that this optimization problem is NP-hard and it is difficult to find the solution. To solve this problem, we formulate the graph coloring process as a Markov decision process (MDP) according to the graph-theoretical condition of SSC for both zero/nonzero and zero/nonzero/arbitrary structure. We use Actor-critic method with Directed graph neural network which represents the color information of graph to optimize MDP. Our method is validated in a social influence network with real data and different complex network models. We find that the number of input nodes is determined by the average degree of the network and the input nodes tend to select nodes with low in-degree and avoid high-degree nodes.  ( 2 min )
    Selective Task offloading for Maximum Inference Accuracy and Energy efficient Real-Time IoT Sensing Systems
    arXiv:2402.16904v1 Announce Type: new Abstract: The recent advancements in small-size inference models facilitated AI deployment on the edge. However, the limited resource nature of edge devices poses new challenges especially for real-time applications. Deploying multiple inference models (or a single tunable model) varying in size and therefore accuracy and power consumption, in addition to an edge server inference model, can offer a dynamic system in which the allocation of inference models to inference jobs is performed according to the current resource conditions. Therefore, in this work, we tackle the problem of selectively allocating inference models to jobs or offloading them to the edge server to maximize inference accuracy under time and energy constraints. This problem is shown to be an instance of the unbounded multidimensional knapsack problem which is considered a strongly NP-hard problem. We propose a lightweight hybrid genetic algorithm (LGSTO) to solve this problem. We introduce a termination condition and neighborhood exploration techniques for faster evolution of populations. We compare LGSTO with the Naive and Dynamic programming solutions. In addition to classic genetic algorithms using different reproduction methods including NSGA-II, and finally we compare to other evolutionary methods such as Particle swarm optimization (PSO) and Ant colony optimization (ACO). Experiment results show that LGSTO performed 3 times faster than the fastest comparable schemes while producing schedules with higher average accuracy.  ( 3 min )
    PRoLoRA: Partial Rotation Empowers More Parameter-Efficient LoRA
    arXiv:2402.16902v1 Announce Type: new Abstract: With the rapid scaling of large language models (LLMs), serving numerous LoRAs concurrently has become increasingly impractical, leading to unaffordable costs and necessitating more parameter-efficient finetuning methods. In this work, we introduce Partially Rotation-enhanced Low-Rank Adaptation (PRoLoRA), an intra-layer sharing mechanism comprising four essential components: broadcast reduction, rotation enhancement, partially-sharing refinement, and rectified initialization strategy. As a superset of LoRA, PRoLoRA pertains its advantages, and effectively circumvent the drawbacks of peer parameter-sharing methods with superior model capacity, practical feasibility, and broad applicability. Empirical experiments demonstrate the remarkably higher parameter efficiency of PRoLoRA in both specific parameter budget and performance target scenarios, and its scalability to larger LLMs. Notably, with one time less trainable parameters, PRoLoRA still outperforms LoRA on multiple instruction tuning datasets. Subsequently, an ablation study is conducted to validate the necessity of individual components and highlight the superiority of PRoLoRA over three potential variants. Hopefully, the conspicuously higher parameter efficiency can establish PRoLoRA as a resource-friendly alternative to LoRA.  ( 2 min )
    Multi-Task Learning for Routing Problem with Cross-Problem Zero-Shot Generalization
    arXiv:2402.16891v1 Announce Type: new Abstract: Vehicle routing problems (VRPs), which can be found in numerous real-world applications, have been an important research topic for several decades. Recently, the neural combinatorial optimization (NCO) approach that leverages a learning-based model to solve VRPs without manual algorithm design has gained substantial attention. However, current NCO methods typically require building one model for each routing problem, which significantly hinders their practical application for real-world industry problems with diverse attributes. In this work, we make the first attempt to tackle the crucial challenge of cross-problem generalization. In particular, we formulate VRPs as different combinations of a set of shared underlying attributes and solve them simultaneously via a single model through attribute composition. In this way, our proposed model can successfully solve VRPs with unseen attribute combinations in a zero-shot generalization manner. Extensive experiments are conducted on eleven VRP variants, benchmark datasets, and industry logistic scenarios. The results show that the unified model demonstrates superior performance in the eleven VRPs, reducing the average gap to around 5% from over 20% in the existing approach and achieving a significant performance boost on benchmark datasets as well as a real-world logistics application.  ( 2 min )
    Reliable Conflictive Multi-View Learning
    arXiv:2402.16897v1 Announce Type: new Abstract: Multi-view learning aims to combine multiple features to achieve more comprehensive descriptions of data. Most previous works assume that multiple views are strictly aligned. However, real-world multi-view data may contain low-quality conflictive instances, which show conflictive information in different views. Previous methods for this problem mainly focus on eliminating the conflictive data instances by removing them or replacing conflictive views. Nevertheless, real-world applications usually require making decisions for conflictive instances rather than only eliminating them. To solve this, we point out a new Reliable Conflictive Multi-view Learning (RCML) problem, which requires the model to provide decision results and attached reliabilities for conflictive multi-view data. We develop an Evidential Conflictive Multi-view Learning (ECML) method for this problem. ECML first learns view-specific evidence, which could be termed as the amount of support to each category collected from data. Then, we can construct view-specific opinions consisting of decision results and reliability. In the multi-view fusion stage, we propose a conflictive opinion aggregation strategy and theoretically prove this strategy can exactly model the relation of multi-view common and view-specific reliabilities. Experiments performed on 6 datasets verify the effectiveness of ECML.  ( 2 min )
  • Open

    An Online Bootstrap for Time Series
    arXiv:2310.19683v2 Announce Type: replace Abstract: Resampling methods such as the bootstrap have proven invaluable in the field of machine learning. However, the applicability of traditional bootstrap methods is limited when dealing with large streams of dependent data, such as time series or spatially correlated observations. In this paper, we propose a novel bootstrap method that is designed to account for data dependencies and can be executed online, making it particularly suitable for real-time applications. This method is based on an autoregressive sequence of increasingly dependent resampling weights. We prove the theoretical validity of the proposed bootstrap scheme under general conditions. We demonstrate the effectiveness of our approach through extensive simulations and show that it provides reliable uncertainty quantification even in the presence of complex data dependencies. Our work bridges the gap between classical resampling techniques and the demands of modern data analysis, providing a valuable tool for researchers and practitioners in dynamic, data-rich environments.  ( 2 min )
    DoubleML -- An Object-Oriented Implementation of Double Machine Learning in R
    arXiv:2103.09603v5 Announce Type: replace Abstract: The R package DoubleML implements the double/debiased machine learning framework of Chernozhukov et al. (2018). It provides functionalities to estimate parameters in causal models based on machine learning methods. The double machine learning framework consist of three key ingredients: Neyman orthogonality, high-quality machine learning estimation and sample splitting. Estimation of nuisance components can be performed by various state-of-the-art machine learning methods that are available in the mlr3 ecosystem. DoubleML makes it possible to perform inference in a variety of causal models, including partially linear and interactive regression models and their extensions to instrumental variable estimation. The object-oriented implementation of DoubleML enables a high flexibility for the model specification and makes it easily extendable. This paper serves as an introduction to the double machine learning framework and the R package DoubleML. In reproducible code examples with simulated and real data sets, we demonstrate how DoubleML users can perform valid inference based on machine learning methods.  ( 3 min )
    Composite Goodness-of-fit Tests with Kernels
    arXiv:2111.10275v4 Announce Type: replace Abstract: Model misspecification can create significant challenges for the implementation of probabilistic models, and this has led to development of a range of robust methods which directly account for this issue. However, whether these more involved methods are required will depend on whether the model is really misspecified, and there is a lack of generally applicable methods to answer this question. In this paper, we propose one such method. More precisely, we propose kernel-based hypothesis tests for the challenging composite testing problem, where we are interested in whether the data comes from any distribution in some parametric family. Our tests make use of minimum distance estimators based on the maximum mean discrepancy and the kernel Stein discrepancy. They are widely applicable, including whenever the density of the parametric model is known up to normalisation constant, or if the model takes the form of a simulator. As our main result, we show that we are able to estimate the parameter and conduct our test on the same data (without data splitting), while maintaining a correct test level. Our approach is illustrated on a range of problems, including testing for goodness-of-fit of an unnormalised non-parametric density model, and an intractable generative model of a biological cellular network.  ( 2 min )
    Tree-Guided Rare Feature Selection and Logic Aggregation with Electronic Health Records Data
    arXiv:2206.09107v2 Announce Type: replace-cross Abstract: Statistical learning with a large number of rare binary features is commonly encountered in analyzing electronic health records (EHR) data, especially in the modeling of disease onset with prior medical diagnoses and procedures. Dealing with the resulting highly sparse and large-scale binary feature matrix is notoriously challenging as conventional methods may suffer from a lack of power in testing and inconsistency in model fitting while machine learning methods may suffer from the inability of producing interpretable results or clinically-meaningful risk factors. To improve EHR-based modeling and utilize the natural hierarchical structure of disease classification, we propose a tree-guided feature selection and logic aggregation approach for large-scale regression with rare binary features, in which dimension reduction is achieved through not only a sparsity pursuit but also an aggregation promoter with the logic operator of ``or''. We convert the combinatorial problem into a convex linearly-constrained regularized estimation, which enables scalable computation with theoretical guarantees. In a suicide risk study with EHR data, our approach is able to select and aggregate prior mental health diagnoses as guided by the diagnosis hierarchy of the International Classification of Diseases. By balancing the rarity and specificity of the EHR diagnosis records, our strategy improves both prediction and model interpretation. We identify important higher-level categories and subcategories of mental health conditions and simultaneously determine the level of specificity needed for each of them in predicting suicide risk.  ( 3 min )
    Variational Gaussian Process Diffusion Processes
    arXiv:2306.02066v3 Announce Type: replace-cross Abstract: Diffusion processes are a class of stochastic differential equations (SDEs) providing a rich family of expressive models that arise naturally in dynamic modelling tasks. Probabilistic inference and learning under generative models with latent processes endowed with a non-linear diffusion process prior are intractable problems. We build upon work within variational inference, approximating the posterior process as a linear diffusion process, and point out pathologies in the approach. We propose an alternative parameterization of the Gaussian variational process using a site-based exponential family description. This allows us to trade a slow inference algorithm with fixed-point iterations for a fast algorithm for convex optimization akin to natural gradient descent, which also provides a better objective for learning model parameters.  ( 2 min )
    Uncertainty estimation in satellite precipitation interpolation with machine learning
    arXiv:2311.07511v2 Announce Type: replace Abstract: Merging satellite and gauge data with machine learning produces high-resolution precipitation datasets, but uncertainty estimates are often missing. We address this gap by benchmarking six algorithms, mostly novel for this task, for quantifying predictive uncertainty in spatial interpolation. On 15 years of monthly data over the contiguous United States (CONUS), we compared quantile regression (QR), quantile regression forests (QRF), generalized random forests (GRF), gradient boosting machines (GBM), light gradient boosting machines (LightGBM), and quantile regression neural networks (QRNN). Their ability to issue predictive precipitation quantiles at nine quantile levels (0.025, 0.050, 0.100, 0.250, 0.500, 0.750, 0.900, 0.950, 0.975), approximating the full probability distribution, was evaluated using quantile scoring functions and the quantile scoring rule. Feature importance analysis revealed satellite precipitation (PERSIANN (Precipitation Estimation from Remotely Sensed Information using Artificial Neural Networks) and IMERG (Integrated Multi-satellitE Retrievals) datasets) as the most informative predictor, followed by gauge elevation and distance to satellite grid points. Compared to QR, LightGBM showed improved performance with respect to the quantile scoring rule by 11.10%, followed by QRF (7.96%), GRF (7.44%), GBM (4.64%) and QRNN (1.73%). Notably, LightGBM outperformed all random forest variants, the current standard in spatial interpolation with machine learning. To conclude, we propose a suite of machine learning algorithms for estimating uncertainty in interpolating spatial data, supported with a formal evaluation framework based on scoring functions and scoring rules.  ( 3 min )
    Sample Complexity of Preference-Based Nonparametric Off-Policy Evaluation with Deep Networks
    arXiv:2310.10556v2 Announce Type: replace-cross Abstract: A recently popular approach to solving reinforcement learning is with data from human preferences. In fact, human preference data are now used with classic reinforcement learning algorithms such as actor-critic methods, which involve evaluating an intermediate policy over a reward learned from human preference data with distribution shift, known as off-policy evaluation (OPE). Such algorithm includes (i) learning reward function from human preference dataset, and (ii) learning expected cumulative reward of a target policy. Despite the huge empirical success, existing OPE methods with preference data often lack theoretical understanding and rely heavily on heuristics. In this paper, we study the sample efficiency of OPE with human preference and establish a statistical guarantee for it. Specifically, we approach OPE by learning the value function by fitted-Q-evaluation with a deep neural network. By appropriately selecting the size of a ReLU network, we show that one can leverage any low-dimensional manifold structure in the Markov decision process and obtain a sample-efficient estimator without suffering from the curse of high data ambient dimensionality. Under the assumption of high reward smoothness, our results \textit{almost align with the classical OPE results with observable reward data}. To the best of our knowledge, this is the first result that establishes a \textit{provably efficient} guarantee for off-policy evaluation with RLHF.  ( 3 min )
    GmGM: a Fast Multi-Axis Gaussian Graphical Model
    arXiv:2211.02920v3 Announce Type: replace Abstract: This paper introduces the Gaussian multi-Graphical Model, a model to construct sparse graph representations of matrix- and tensor-variate data. We generalize prior work in this area by simultaneously learning this representation across several tensors that share axes, which is necessary to allow the analysis of multimodal datasets such as those encountered in multi-omics. Our algorithm uses only a single eigendecomposition per axis, achieving an order of magnitude speedup over prior work in the ungeneralized case. This allows the use of our methodology on large multi-modal datasets such as single-cell multi-omics data, which was challenging with previous approaches. We validate our model on synthetic data and five real-world datasets.  ( 2 min )
    Locally Stationary Graph Processes
    arXiv:2309.01657v2 Announce Type: replace Abstract: Stationary graph process models are commonly used in the analysis and inference of data sets collected on irregular network topologies. While most of the existing methods represent graph signals with a single stationary process model that is globally valid on the entire graph, in many practical problems, the characteristics of the process may be subject to local variations in different regions of the graph. In this work, we propose a locally stationary graph process (LSGP) model that aims to extend the classical concept of local stationarity to irregular graph domains. We characterize local stationarity by expressing the overall process as the combination of a set of component processes such that the extent to which the process adheres to each component varies smoothly over the graph. We propose an algorithm for computing LSGP models from realizations of the process, and also study the approximation of LSGPs locally with WSS processes. Experiments on signal interpolation problems show that the proposed process model provides accurate signal representations competitive with the state of the art.  ( 2 min )
    Functional sufficient dimension reduction through information maximization with application to classification
    arXiv:2305.10880v3 Announce Type: replace Abstract: Considering the case where the response variable is a categorical variable and the predictor is a random function, two novel functional sufficient dimensional reduction (FSDR) methods are proposed based on mutual information and square loss mutual information. Compared to the classical FSDR methods, such as functional sliced inverse regression and functional sliced average variance estimation, the proposed methods are appealing because they are capable of estimating multiple effective dimension reduction directions in the case of a relatively small number of categories, especially for the binary response. Moreover, the proposed methods do not require the restrictive linear conditional mean assumption and the constant covariance assumption. They avoid the inverse problem of the covariance operator which is often encountered in the functional sufficient dimension reduction. The functional principal component analysis with truncation be used as a regularization mechanism. Under some mild conditions, the statistical consistency of the proposed methods is established. It is demonstrated that the two methods are competitive compared with some existing FSDR methods by simulations and real data analyses.  ( 3 min )
    Robustly Learning Single-Index Models via Alignment Sharpness
    arXiv:2402.17756v1 Announce Type: cross Abstract: We study the problem of learning Single-Index Models under the $L_2^2$ loss in the agnostic model. We give an efficient learning algorithm, achieving a constant factor approximation to the optimal loss, that succeeds under a range of distributions (including log-concave distributions) and a broad class of monotone and Lipschitz link functions. This is the first efficient constant factor approximate agnostic learner, even for Gaussian data and for any nontrivial class of link functions. Prior work for the case of unknown link function either works in the realizable setting or does not attain constant factor approximation. The main technical ingredient enabling our algorithm and analysis is a novel notion of a local error bound in optimization that we term alignment sharpness and that may be of broader interest.  ( 2 min )
    Batched Nonparametric Contextual Bandits
    arXiv:2402.17732v1 Announce Type: cross Abstract: We study nonparametric contextual bandits under batch constraints, where the expected reward for each action is modeled as a smooth function of covariates, and the policy updates are made at the end of each batch of observations. We establish a minimax regret lower bound for this setting and propose Batched Successive Elimination with Dynamic Binning (BaSEDB) that achieves optimal regret (up to logarithmic factors). In essence, BaSEDB dynamically splits the covariate space into smaller bins, carefully aligning their widths with the batch size. We also show the suboptimality of static binning under batch constraints, highlighting the necessity of dynamic binning. Additionally, our results suggest that a nearly constant number of policy updates can attain optimal regret in the fully online setting.  ( 2 min )
    Score-based Causal Representation Learning: Linear and General Transformations
    arXiv:2402.00849v2 Announce Type: replace-cross Abstract: This paper addresses intervention-based causal representation learning (CRL) under a general nonparametric latent causal model and an unknown transformation that maps the latent variables to the observed variables. Linear and general transformations are investigated. The paper addresses both the identifiability and achievability aspects. Identifiability refers to determining algorithm-agnostic conditions that ensure recovering the true latent causal variables and the latent causal graph underlying them. Achievability refers to the algorithmic aspects and addresses designing algorithms that achieve identifiability guarantees. By drawing novel connections between score functions (i.e., the gradients of the logarithm of density functions) and CRL, this paper designs a score-based class of algorithms that ensures both identifiability and achievability. First, the paper focuses on linear transformations and shows that one stochastic hard intervention per node suffices to guarantee identifiability. It also provides partial identifiability guarantees for soft interventions, including identifiability up to ancestors for general causal models and perfect latent graph recovery for sufficiently non-linear causal models. Secondly, it focuses on general transformations and shows that two stochastic hard interventions per node suffice for identifiability. Notably, one does not need to know which pair of interventional environments have the same node intervened.  ( 3 min )
    Automated Machine Learning: From Principles to Practices
    arXiv:1810.13306v5 Announce Type: replace-cross Abstract: Machine learning (ML) methods have been developing rapidly, but configuring and selecting proper methods to achieve a desired performance is increasingly difficult and tedious. To address this challenge, automated machine learning (AutoML) has emerged, which aims to generate satisfactory ML configurations for given tasks in a data-driven way. In this paper, we provide a comprehensive survey on this topic. We begin with the formal definition of AutoML and then introduce its principles, including the bi-level learning objective, the learning strategy, and the theoretical interpretation. Then, we summarize the AutoML practices by setting up the taxonomy of existing works based on three main factors: the search space, the search algorithm, and the evaluation strategy. Each category is also explained with the representative methods. Then, we illustrate the principles and practices with exemplary applications from configuring ML pipeline, one-shot neural architecture search, and integration with foundation models. Finally, we highlight the emerging directions of AutoML and conclude the survey.  ( 2 min )
    Effects of noise on the overparametrization of quantum neural networks
    arXiv:2302.05059v2 Announce Type: replace-cross Abstract: Overparametrization is one of the most surprising and notorious phenomena in machine learning. Recently, there have been several efforts to study if, and how, Quantum Neural Networks (QNNs) acting in the absence of hardware noise can be overparametrized. In particular, it has been proposed that a QNN can be defined as overparametrized if it has enough parameters to explore all available directions in state space. That is, if the rank of the Quantum Fisher Information Matrix (QFIM) for the QNN's output state is saturated. Here, we explore how the presence of noise affects the overparametrization phenomenon. Our results show that noise can "turn on" previously-zero eigenvalues of the QFIM. This enables the parametrized state to explore directions that were otherwise inaccessible, thus potentially turning an overparametrized QNN into an underparametrized one. For small noise levels, the QNN is quasi-overparametrized, as large eigenvalues coexists with small ones. Then, we prove that as the magnitude of noise increases all the eigenvalues of the QFIM become exponentially suppressed, indicating that the state becomes insensitive to any change in the parameters. As such, there is a pull-and-tug effect where noise can enable new directions, but also suppress the sensitivity to parameter updates. Finally, our results imply that current QNN capacity measures are ill-defined when hardware noise is present.  ( 3 min )
    Series of Hessian-Vector Products for Tractable Saddle-Free Newton Optimisation of Neural Networks
    arXiv:2310.14901v2 Announce Type: replace-cross Abstract: Despite their popularity in the field of continuous optimisation, second-order quasi-Newton methods are challenging to apply in machine learning, as the Hessian matrix is intractably large. This computational burden is exacerbated by the need to address non-convexity, for instance by modifying the Hessian's eigenvalues as in Saddle-Free Newton methods. We propose an optimisation algorithm which addresses both of these concerns - to our knowledge, the first efficiently-scalable optimisation algorithm to asymptotically use the exact inverse Hessian with absolute-value eigenvalues. Our method frames the problem as a series which principally square-roots and inverts the squared Hessian, then uses it to precondition a gradient vector, all without explicitly computing or eigendecomposing the Hessian. A truncation of this infinite series provides a new optimisation algorithm which is scalable and comparable to other first- and second-order optimisation methods in both runtime and optimisation performance. We demonstrate this in a variety of settings, including a ResNet-18 trained on CIFAR-10.  ( 2 min )
    Dataset Fairness: Achievable Fairness on Your Data With Utility Guarantees
    arXiv:2402.17106v1 Announce Type: new Abstract: In machine learning fairness, training models which minimize disparity across different sensitive groups often leads to diminished accuracy, a phenomenon known as the fairness-accuracy trade-off. The severity of this trade-off fundamentally depends on dataset characteristics such as dataset imbalances or biases. Therefore using a uniform fairness requirement across datasets remains questionable and can often lead to models with substantially low utility. To address this, we present a computationally efficient approach to approximate the fairness-accuracy trade-off curve tailored to individual datasets, backed by rigorous statistical guarantees. By utilizing the You-Only-Train-Once (YOTO) framework, our approach mitigates the computational burden of having to train multiple models when approximating the trade-off curve. Moreover, we quantify the uncertainty in our approximation by introducing confidence intervals around this curve, offering a statistically grounded perspective on the acceptable range of fairness violations for any given accuracy threshold. Our empirical evaluation spanning tabular, image and language datasets underscores that our approach provides practitioners with a principled framework for dataset-specific fairness decisions across various data modalities.  ( 2 min )
    Iterated INLA for State and Parameter Estimation in Nonlinear Dynamical Systems
    arXiv:2402.17036v1 Announce Type: new Abstract: Data assimilation (DA) methods use priors arising from differential equations to robustly interpolate and extrapolate data. Popular techniques such as ensemble methods that handle high-dimensional, nonlinear PDE priors focus mostly on state estimation, however can have difficulty learning the parameters accurately. On the other hand, machine learning based approaches can naturally learn the state and parameters, but their applicability can be limited, or produce uncertainties that are hard to interpret. Inspired by the Integrated Nested Laplace Approximation (INLA) method in spatial statistics, we propose an alternative approach to DA based on iteratively linearising the dynamical model. This produces a Gaussian Markov random field at each iteration, enabling one to use INLA to infer the state and parameters. Our approach can be used for arbitrary nonlinear systems, while retaining interpretability, and is furthermore demonstrated to outperform existing methods on the DA task. By providing a more nuanced approach to handling nonlinear PDE priors, our methodology offers improved accuracy and robustness in predictions, especially where data sparsity is prevalent.  ( 2 min )
    DAGnosis: Localized Identification of Data Inconsistencies using Structures
    arXiv:2402.17599v1 Announce Type: cross Abstract: Identification and appropriate handling of inconsistencies in data at deployment time is crucial to reliably use machine learning models. While recent data-centric methods are able to identify such inconsistencies with respect to the training set, they suffer from two key limitations: (1) suboptimality in settings where features exhibit statistical independencies, due to their usage of compressive representations and (2) lack of localization to pin-point why a sample might be flagged as inconsistent, which is important to guide future data collection. We solve these two fundamental limitations using directed acyclic graphs (DAGs) to encode the training set's features probability distribution and independencies as a structure. Our method, called DAGnosis, leverages these structural interactions to bring valuable and insightful data-centric conclusions. DAGnosis unlocks the localization of the causes of inconsistencies on a DAG, an aspect overlooked by previous approaches. Moreover, we show empirically that leveraging these interactions (1) leads to more accurate conclusions in detecting inconsistencies, as well as (2) provides more detailed insights into why some samples are flagged.  ( 2 min )
    Generative Learning for Forecasting the Dynamics of Complex Systems
    arXiv:2402.17157v1 Announce Type: cross Abstract: We introduce generative models for accelerating simulations of complex systems through learning and evolving their effective dynamics. In the proposed Generative Learning of Effective Dynamics (G-LED), instances of high dimensional data are down sampled to a lower dimensional manifold that is evolved through an auto-regressive attention mechanism. In turn, Bayesian diffusion models, that map this low-dimensional manifold onto its corresponding high-dimensional space, capture the statistics of the system dynamics. We demonstrate the capabilities and drawbacks of G-LED in simulations of several benchmark systems, including the Kuramoto-Sivashinsky (KS) equation, two-dimensional high Reynolds number flow over a backward-facing step, and simulations of three-dimensional turbulent channel flow. The results demonstrate that generative learning offers new frontiers for the accurate forecasting of the statistical properties of complex systems at a reduced computational cost.  ( 2 min )
    Gradient-based Discrete Sampling with Automatic Cyclical Scheduling
    arXiv:2402.17699v1 Announce Type: cross Abstract: Discrete distributions, particularly in high-dimensional deep models, are often highly multimodal due to inherent discontinuities. While gradient-based discrete sampling has proven effective, it is susceptible to becoming trapped in local modes due to the gradient information. To tackle this challenge, we propose an automatic cyclical scheduling, designed for efficient and accurate sampling in multimodal discrete distributions. Our method contains three key components: (1) a cyclical step size schedule where large steps discover new modes and small steps exploit each mode; (2) a cyclical balancing schedule, ensuring ``balanced" proposals for given step sizes and high efficiency of the Markov chain; and (3) an automatic tuning scheme for adjusting the hyperparameters in the cyclical schedules, allowing adaptability across diverse datasets with minimal tuning. We prove the non-asymptotic convergence and inference guarantee for our method in general discrete distributions. Extensive experiments demonstrate the superiority of our method in sampling complex multimodal discrete distributions.  ( 2 min )
    Implicit Regularization via Spectral Neural Networks and Non-linear Matrix Sensing
    arXiv:2402.17595v1 Announce Type: cross Abstract: The phenomenon of implicit regularization has attracted interest in recent years as a fundamental aspect of the remarkable generalizing ability of neural networks. In a nutshell, it entails that gradient descent dynamics in many neural nets, even without any explicit regularizer in the loss function, converges to the solution of a regularized learning problem. However, known results attempting to theoretically explain this phenomenon focus overwhelmingly on the setting of linear neural nets, and the simplicity of the linear structure is particularly crucial to existing arguments. In this paper, we explore this problem in the context of more realistic neural networks with a general class of non-linear activation functions, and rigorously demonstrate the implicit regularization phenomenon for such networks in the setting of matrix sensing problems, together with rigorous rate guarantees that ensure exponentially fast convergence of gradient descent.In this vein, we contribute a network architecture called Spectral Neural Networks (abbrv. SNN) that is particularly suitable for matrix learning problems. Conceptually, this entails coordinatizing the space of matrices by their singular values and singular vectors, as opposed to by their entries, a potentially fruitful perspective for matrix learning. We demonstrate that the SNN architecture is inherently much more amenable to theoretical analysis than vanilla neural nets and confirm its effectiveness in the context of matrix sensing, via both mathematical guarantees and empirical investigations. We believe that the SNN architecture has the potential to be of wide applicability in a broad class of matrix learning scenarios.  ( 3 min )
    A Note on Bayesian Networks with Latent Root Variables
    arXiv:2402.17087v1 Announce Type: new Abstract: We characterise the likelihood function computed from a Bayesian network with latent variables as root nodes. We show that the marginal distribution over the remaining, manifest, variables also factorises as a Bayesian network, which we call empirical. A dataset of observations of the manifest variables allows us to quantify the parameters of the empirical Bayesian net. We prove that (i) the likelihood of such a dataset from the original Bayesian network is dominated by the global maximum of the likelihood from the empirical one; and that (ii) such a maximum is attained if and only if the parameters of the Bayesian network are consistent with those of the empirical model.  ( 2 min )
    Variational Learning is Effective for Large Deep Networks
    arXiv:2402.17641v1 Announce Type: cross Abstract: We give extensive empirical evidence against the common belief that variational learning is ineffective for large neural networks. We show that an optimizer called Improved Variational Online Newton (IVON) consistently matches or outperforms Adam for training large networks such as GPT-2 and ResNets from scratch. IVON's computational costs are nearly identical to Adam but its predictive uncertainty is better. We show several new use cases of IVON where we improve fine-tuning and model merging in Large Language Models, accurately predict generalization error, and faithfully estimate sensitivity to data. We find overwhelming evidence in support of effectiveness of variational learning.  ( 2 min )
    Enhanced Bayesian Optimization via Preferential Modeling of Abstract Properties
    arXiv:2402.17343v1 Announce Type: cross Abstract: Experimental (design) optimization is a key driver in designing and discovering new products and processes. Bayesian Optimization (BO) is an effective tool for optimizing expensive and black-box experimental design processes. While Bayesian optimization is a principled data-driven approach to experimental optimization, it learns everything from scratch and could greatly benefit from the expertise of its human (domain) experts who often reason about systems at different abstraction levels using physical properties that are not necessarily directly measured (or measurable). In this paper, we propose a human-AI collaborative Bayesian framework to incorporate expert preferences about unmeasured abstract properties into the surrogate modeling to further boost the performance of BO. We provide an efficient strategy that can also handle any incorrect/misleading expert bias in preferential judgments. We discuss the convergence behavior of our proposed framework. Our experimental results involving synthetic functions and real-world datasets show the superiority of our method against the baselines.  ( 2 min )
    An Interpretable Evaluation of Entropy-based Novelty of Generative Models
    arXiv:2402.17287v1 Announce Type: cross Abstract: The massive developments of generative model frameworks and architectures require principled methods for the evaluation of a model's novelty compared to a reference dataset or baseline generative models. While the recent literature has extensively studied the evaluation of the quality, diversity, and generalizability of generative models, the assessment of a model's novelty compared to a baseline model has not been adequately studied in the machine learning community. In this work, we focus on the novelty assessment under multi-modal generative models and attempt to answer the following question: Given the samples of a generative model $\mathcal{G}$ and a reference dataset $\mathcal{S}$, how can we discover and count the modes expressed by $\mathcal{G}$ more frequently than in $\mathcal{S}$. We introduce a spectral approach to the described task and propose the Kernel-based Entropic Novelty (KEN) score to quantify the mode-based novelty of distribution $P_\mathcal{G}$ with respect to distribution $P_\mathcal{S}$. We analytically interpret the behavior of the KEN score under mixture distributions with sub-Gaussian components. Next, we develop a method based on Cholesky decomposition to compute the KEN score from observed samples. We support the KEN-based quantification of novelty by presenting several numerical results on synthetic and real image distributions. Our numerical results indicate the success of the proposed approach in detecting the novel modes and the comparison of state-of-the-art generative models.  ( 3 min )
    Adversarial Perturbations of Physical Signals
    arXiv:2402.17104v1 Announce Type: cross Abstract: We investigate the vulnerability of computer-vision-based signal classifiers to adversarial perturbations of their inputs, where the signals and perturbations are subject to physical constraints. We consider a scenario in which a source and interferer emit signals that propagate as waves to a detector, which attempts to classify the source by analyzing the spectrogram of the signal it receives using a pre-trained neural network. By solving PDE-constrained optimization problems, we construct interfering signals that cause the detector to misclassify the source even though the perturbations to the spectrogram of the received signal are nearly imperceptible. Though such problems can have millions of decision variables, we introduce methods to solve them efficiently. Our experiments demonstrate that one can compute effective and physically realizable adversarial perturbations for a variety of machine learning models under various physical conditions.  ( 2 min )
    Stochastic approximation in infinite dimensions
    arXiv:2402.17258v1 Announce Type: cross Abstract: Stochastic Approximation (SA) was introduced in the early 1950's and has been an active area of research for several decades. While the initial focus was on statistical questions, it was seen to have applications to signal processing, convex optimisation. %Over the last decade, there has been a revival of interest in SA as In later years SA has found application in Reinforced Learning (RL) and led to revival of interest. While bulk of the literature is on SA for the case when the observations are from a finite dimensional Euclidian space, there has been interest in extending the same to infinite dimension. Extension to Hilbert spaces is relatively easier to do, but this is not so when we come to a Banach space - since in the case of a Banach space, even {\em law of large numbers} is not true in general. We consider some cases where approximation works in a Banach space. Our framework includes case when the Banach space $\Bb$ is $\Cb([0,1],\R^d)$, as well as $\L^1([0,1],\R^d)$, the two cases which do not even have the Radon-Nikodym property.  ( 2 min )
    On Independent Samples Along the Langevin Diffusion and the Unadjusted Langevin Algorithm
    arXiv:2402.17067v1 Announce Type: cross Abstract: We study the rate at which the initial and current random variables become independent along a Markov chain, focusing on the Langevin diffusion in continuous time and the Unadjusted Langevin Algorithm (ULA) in discrete time. We measure the dependence between random variables via their mutual information. For the Langevin diffusion, we show the mutual information converges to $0$ exponentially fast when the target is strongly log-concave, and at a polynomial rate when the target is weakly log-concave. These rates are analogous to the mixing time of the Langevin diffusion under similar assumptions. For the ULA, we show the mutual information converges to $0$ exponentially fast when the target is strongly log-concave and smooth. We prove our results by developing the mutual version of the mixing time analyses of these Markov chains. We also provide alternative proofs based on strong data processing inequalities for the Langevin diffusion and the ULA, and by showing regularity results for these processes in mutual information.  ( 2 min )
    Transfer Learning Bayesian Optimization to Design Competitor DNA Molecules for Use in Diagnostic Assays
    arXiv:2402.17704v1 Announce Type: cross Abstract: With the rise in engineered biomolecular devices, there is an increased need for tailor-made biological sequences. Often, many similar biological sequences need to be made for a specific application meaning numerous, sometimes prohibitively expensive, lab experiments are necessary for their optimization. This paper presents a transfer learning design of experiments workflow to make this development feasible. By combining a transfer learning surrogate model with Bayesian optimization, we show how the total number of experiments can be reduced by sharing information between optimization tasks. We demonstrate the reduction in the number of experiments using data from the development of DNA competitors for use in an amplification-based diagnostic assay. We use cross-validation to compare the predictive accuracy of different transfer learning models, and then compare the performance of the models for both single objective and penalized optimization tasks.  ( 2 min )
    Learning high-dimensional targets by two-parameter models and gradient flow
    arXiv:2402.17089v1 Announce Type: new Abstract: We explore the theoretical possibility of learning $d$-dimensional targets with $W$-parameter models by gradient flow (GF) when $W<d$. Our main result shows that if the targets are described by a particular $d$-dimensional probability distribution, then there exist models with as few as two parameters that can learn the targets with arbitrarily high success probability. On the other hand, we show that for $W<d$ there is necessarily a large subset of GF-non-learnable targets. In particular, the set of learnable targets is not dense in $\mathbb R^d$, and any subset of $\mathbb R^d$ homeomorphic to the $W$-dimensional sphere contains non-learnable targets. Finally, we observe that the model in our main theorem on almost guaranteed two-parameter learning is constructed using a hierarchical procedure and as a result is not expressible by a single elementary function. We show that this limitation is essential in the sense that such learnability can be ruled out for a large class of elementary functions.  ( 2 min )
    MIM-Reasoner: Learning with Theoretical Guarantees for Multiplex Influence Maximization
    arXiv:2402.16898v1 Announce Type: cross Abstract: Multiplex influence maximization (MIM) asks us to identify a set of seed users such as to maximize the expected number of influenced users in a multiplex network. MIM has been one of central research topics, especially in nowadays social networking landscape where users participate in multiple online social networks (OSNs) and their influences can propagate among several OSNs simultaneously. Although there exist a couple combinatorial algorithms to MIM, learning-based solutions have been desired due to its generalization ability to heterogeneous networks and their diversified propagation characteristics. In this paper, we introduce MIM-Reasoner, coupling reinforcement learning with probabilistic graphical model, which effectively captures the complex propagation process within and between layers of a given multiplex network, thereby tackling the most challenging problem in MIM. We establish a theoretical guarantee for MIM-Reasoner as well as conduct extensive analyses on both synthetic and real-world datasets to validate our MIM-Reasoner's performance.  ( 2 min )
    On the (In)feasibility of ML Backdoor Detection as an Hypothesis Testing Problem
    arXiv:2402.16926v1 Announce Type: cross Abstract: We introduce a formal statistical definition for the problem of backdoor detection in machine learning systems and use it to analyze the feasibility of such problems, providing evidence for the utility and applicability of our definition. The main contributions of this work are an impossibility result and an achievability result for backdoor detection. We show a no-free-lunch theorem, proving that universal (adversary-unaware) backdoor detection is impossible, except for very small alphabet sizes. Thus, we argue, that backdoor detection methods need to be either explicitly, or implicitly adversary-aware. However, our work does not imply that backdoor detection cannot work in specific scenarios, as evidenced by successful backdoor detection methods in the scientific literature. Furthermore, we connect our definition to the probably approximately correct (PAC) learnability of the out-of-distribution detection problem.  ( 2 min )
    When Your AI Deceives You: Challenges with Partial Observability of Human Evaluators in Reward Learning
    arXiv:2402.17747v1 Announce Type: cross Abstract: Past analyses of reinforcement learning from human feedback (RLHF) assume that the human fully observes the environment. What happens when human feedback is based only on partial observations? We formally define two failure cases: deception and overjustification. Modeling the human as Boltzmann-rational w.r.t. a belief over trajectories, we prove conditions under which RLHF is guaranteed to result in policies that deceptively inflate their performance, overjustify their behavior to make an impression, or both. To help address these issues, we mathematically characterize how partial observability of the environment translates into (lack of) ambiguity in the learned return function. In some cases, accounting for partial observability makes it theoretically possible to recover the return function and thus the optimal policy, while in other cases, there is irreducible ambiguity. We caution against blindly applying RLHF in partially observable settings and propose research directions to help tackle these challenges.  ( 2 min )
    How to Measure Evidence and Its Strength: Bayes Factors or Relative Belief Ratios?
    arXiv:2301.08994v2 Announce Type: replace-cross Abstract: Both the Bayes factor and the relative belief ratio satisfy the principle of evidence and so can be seen to be valid measures of statistical evidence. Certainly Bayes factors are regularly employed. The question then is: which of these measures of evidence is more appropriate? It is argued here that there are questions concerning the validity of a current commonly used definition of the Bayes factor based on a mixture prior and, when all is considered, the relative belief ratio has better properties as a measure of evidence. It is further shown that, when a natural restriction on the mixture prior is imposed, the Bayes factor equals the relative belief ratio obtained without using the mixture prior. Even with this restriction, this still leaves open the question of how the strength of evidence is to be measured. It is argued here that the current practice of using the size of the Bayes factor to measure strength is not correct and a solution to this issue is presented. Several general criticisms of these measures of evidence are also discussed and addressed.  ( 2 min )
    NECO: NEural Collapse Based Out-of-distribution detection
    arXiv:2310.06823v3 Announce Type: replace Abstract: Detecting out-of-distribution (OOD) data is a critical challenge in machine learning due to model overconfidence, often without awareness of their epistemological limits. We hypothesize that ``neural collapse'', a phenomenon affecting in-distribution data for models trained beyond loss convergence, also influences OOD data. To benefit from this interplay, we introduce NECO, a novel post-hoc method for OOD detection, which leverages the geometric properties of ``neural collapse'' and of principal component spaces to identify OOD data. Our extensive experiments demonstrate that NECO achieves state-of-the-art results on both small and large-scale OOD detection tasks while exhibiting strong generalization capabilities across different network architectures. Furthermore, we provide a theoretical explanation for the effectiveness of our method in OOD detection. Code is available at https://gitlab.com/drti/neco  ( 2 min )
    Latent Attention for Linear Time Transformers
    arXiv:2402.17512v1 Announce Type: cross Abstract: The time complexity of the standard attention mechanism in a transformer scales quadratically with the length of the sequence. We introduce a method to reduce this to linear scaling with time, based on defining attention via latent vectors. The method is readily usable as a drop-in replacement for the standard attention mechanism. Our "Latte Transformer" model can be implemented for both bidirectional and unidirectional tasks, with the causal version allowing a recurrent implementation which is memory and time-efficient during inference of language generation tasks. Whilst next token prediction scales linearly with the sequence length for a standard transformer, a Latte Transformer requires constant time to compute the next token. The empirical performance of our method is comparable to standard attention, yet allows scaling to context windows much larger than practical in standard attention.  ( 2 min )
    A Phase Transition in Diffusion Models Reveals the Hierarchical Nature of Data
    arXiv:2402.16991v1 Announce Type: new Abstract: Understanding the structure of real data is paramount in advancing modern deep-learning methodologies. Natural data such as images are believed to be composed of features organised in a hierarchical and combinatorial manner, which neural networks capture during learning. Recent advancements show that diffusion models can generate high-quality images, hinting at their ability to capture this underlying structure. We study this phenomenon in a hierarchical generative model of data. We find that the backward diffusion process acting after a time $t$ is governed by a phase transition at some threshold time, where the probability of reconstructing high-level features, like the class of an image, suddenly drops. Instead, the reconstruction of low-level features, such as specific details of an image, evolves smoothly across the whole diffusion process. This result implies that at times beyond the transition, the class has changed but the generated sample may still be composed of low-level elements of the initial image. We validate these theoretical insights through numerical experiments on class-unconditional ImageNet diffusion models. Our analysis characterises the relationship between time and scale in diffusion models and puts forward generative models as powerful tools to model combinatorial data properties.  ( 2 min )

  • Open

    Should we be basing models off of preexisting theories of the mind?
    For example suppose that I believe that in order for me to manipulate an object in the mind I need to first train an archetype/schema of that object, as a basis of a mental model, before any manipulations can take place. Should I make an AI that specifically compresses/trains to a schema model and then synthesizes interactions from the schema or do I assume that results of sucessful manipulation from more vague transformations means that the model was embodied? submitted by /u/downfall254 [link] [comments]
    Where to start?
    This had probably been asked many times before, but I’m wondering where to start when learning about AI. I’m mostly looking for books, since I’ve tried watching videos about it but don’t know where to start there either, and I usually find books more comprehensible. Not looking to earn any quick money, just want to learn. Where do I start? submitted by /u/EmoioN [link] [comments]
    When will we get an AI version of one of these?
    The one thing I really cannot wait for, is a simple handheld device that I can click a button and talk into an AI will do the rest. It'll be great for note-taking brainstorming ideating and research of all kinds. With apps like chat getting memory soon the race will be on to get as much data in as possible as the benefits are going to be massive. Does anyone know anyone working on a simple handheld pushed to talk into an AI device that works in anywhere like this? submitted by /u/zascar [link] [comments]
    AGI this Decade: Demis Hassabis on the Dwarkesh Patel Podcast
    submitted by /u/holy_moley_ravioli_ [link] [comments]
    Is using Ai on work in college cheating?
    I have a classmate who’ve I’ve spotted many times using Ai generated sentences/art during class work, recently I spotted him using Ai art for a class project, I asked him is that real or Ai generated and he replied made it real submitted by /u/Moemilitaryfan666 [link] [comments]
    Outfit Anyone made by HumanAIGC Alibaba Research Group
    submitted by /u/poopsmith38 [link] [comments]
    When everything online is AI generated...
    Does there come a point where we all head back offline to newspapers and books and local art shows? I already don't trust anything I see or read here or on Twitter or anywhere else. submitted by /u/theferalturtle [link] [comments]
    Does anyone know of an AI story where a mistake in the input severely affected the expected output?
    Long story short I have to write a speech and I want to use the overarching theme of “life is what you put into it.” I was hoping to use an AI story where someone made a mistake in inputs which had an almost catastrophic, or at the very minimum, a serious deviation in the expected output due to that mistake. Either a personal story or link to one is fine. Thanks in advance for any help you can provide. submitted by /u/RanchWilder11 [link] [comments]
    OpenAI says New York Times 'hacked' ChatGPT to build copyright lawsuit
    submitted by /u/Cbo305 [link] [comments]
    🤖 AI Replaces 700 Jobs at Klarna
    700 jobs shifted to AI, altering employment landscapes. Enhanced service speed reflects AI's transformative power. $40 million profit projected from AI's operational efficiency. 35-language support showcases AI's global utility. Sebastian Siemiatkowski voices concerns on AI's societal effects. Experts debate AI's role in reshaping job descriptions and security. Read the article: https://www.newsweek.com/klarna-artificial-intelligence-tool-takes-700-jobs-1874002 submitted by /u/clonefitreal [link] [comments]
    AI is banned from Federal Judge Brantley Starr's courtroom
    From his official page on the Northern District of Texas Federal Court website (emphasis added): All attorneys and pro se litigants appearing before the Court must, together with their notice of appearance, file on the docket a certificate attesting either that no portion of any filing will be drafted by generative artificial intelligence (such as ChatGPT, HarveyAI, or Google Bard)...These platforms are incredibly powerful and have many uses in the law: form divorces, discovery requests, suggested errors in documents, anticipated questions at oral argument. But legal briefing is not one of them. Here’s why. These platforms in their current states are prone to hallucinations and bias. On hallucinations, they make stuff up—even quotes and citations. Another issue is reliability or bias. While attorneys swear an oath to set aside their personal prejudices, biases, and beliefs to faithfully uphold the law and represent their clients, generative artificial intelligence is the product of programming devised by humans who did not have to swear such an oath. As such, these systems hold no allegiance to any client, the rule of law, or the laws and Constitution of the United States (or, as addressed above, the truth). Unbound by any sense of duty, honor, or justice, such programs act according to computer code rather than conviction, based on programming rather than principle. Source: Judge Brantley Starr | Northern District of Texas | United States District Court (uscourts.gov) EDIT: Changed description of website from "government" to "Federal Court". Removed hyperlink for HarveyAI. submitted by /u/UlisKromwell [link] [comments]
    One-Minute Daily AI News 2/27/2024
    Apple Abandoning Electric Car Quest And Focusing On AI.[1] A Beverly Hills middle school is investigating students sharing AI-made nude photos of classmates.[2] Nvidia threatens to reshape telecom with AI-plus-RAN plan.[3] Former AMD GPU head accuses Nvidia of being a ‘GPU cartel’ in response to reports of retaliatory shipment delays.[4] Sources: [1] https://www.forbes.com/sites/mollybohannon/2024/02/27/apple-abandoning-electric-car-quest-and-focusing-on-ai-report-says/?sh=15b9ba0f7ac0 [2] https://www.nbcnews.com/tech/misinformation/beverly-vista-hills-middle-school-ai-images-deepfakes-rcna140775 [3] https://www.lightreading.com/ai-machine-learning/nvidia-threatens-to-reshape-telecom-with-ai-plus-ran-plan [4] https://www.tomshardware.com/tech-industry/artificial-intelligence/former-amd-gpu-head-accuses-nvidia-of-being-a-gpu-cartel-in-response-to-reports-of-retaliatory-shipment-delays submitted by /u/Excellent-Target-847 [link] [comments]
    Sort of old news but I think this was missed. Introducing Smell2Text: a multimodal framework for Neural Networks to identify smells
    submitted by /u/holy_moley_ravioli_ [link] [comments]
    Crazy research out of Alibaba group
    https://humanaigc.github.io/emote-portrait-alive/ submitted by /u/drgoldenpants [link] [comments]
    Apple cancels decade-long work on electric car and will shift staff to focus on generative AI
    submitted by /u/Civil_Collection7267 [link] [comments]
    I made Bluetooth speaker voicelines but as if the speaker was high using chat gpt and ai tts
    I don't remember the name of the ai tts, anyway anybody can use the voicelines if they want to :) submitted by /u/tritrou2021yt [link] [comments]
    I Spent Over 3,000 Hours Developing an AI-Powered Finance Assistant. The Good, The Bad, and The Ugly
    The Ugly I would consider NexusTrade my life’s work. It started off as a passion project primarily designed for myself. I called it “The Money Printer” because the idea was that by using an automated platform, I would “print money”. I remember creating a thread on Reddit’s algotrading subreddit. And the people LOVED it. ​ The first iteration of the AlgoTrading System: “The Money Printer” The platform was clunky, ugly, and rigid. It wasn’t extensible, and it was literally impossible to express the vast majority of ideas. But the people of Reddit loved it, and convinced me it was the coolest thing since sliced bread. ​ Creating a new strategy with “The Money Printer” The love and enthusiasm I got from those comments ignited a fire in me. From that point on, I was obsessed. I decided to pivot and create a platform — not just for myself, but for others to use as well. The Bad I spent the next year or so making significant changes to the UI. I did this by myself, and I am no designer. I wouldn’t even consider myself a frontend engineer. But I was driven, and passionate about the problem I was solving. And I did the best I could to make a professional-looking application. ​ Creating a new strategy with NextTrade The UI was a lot better than the first version of the platform. And it was more configurable. It even had some revolutionary features, like the genetic optimization engine, that optimized trading strategies using genetic algorithms. ​ Read the rest of the article (for free) here! I’m curious to get y’all’s opinion on this project! As you can imagine, this has been a lot of work. But I’m passionate about and it’s super fun and has been a great learning experience! I’m also happy to talk in depth about the technical details or anything else about the platform 😃 submitted by /u/Starks-Technology [link] [comments]
    What's the current state of AI Voice Replacement?
    What I've seen so far makes me wonder if I'm missing something. It seems like replacing vocals in music is achievable with pretty decent results. You can also produce fairly convincing text to voice, that at least sounds like the persons voice. but what about replacing spoken audio with someone else's voice? Are there tools capable of this? I don't understand the technology well enough to speak on it, but it vaguely makes sense to my why recreating singing might be easier than, say, a speech. Anyone have any insight? ​ submitted by /u/shanefolke [link] [comments]
  • Open

    image to video to image, or itvti lol [P] [D]
    Can somebody with a better setup than me train an image to image AI that makes a video between the two. The inputs can be the start and last frame of the video, and the outputs would be the full video. The eval metric can be perplexity or accuracy, and I would recommend video tokenization. IDK if tf keras or pytorch would be better. (I like keras but whatever). I think padding to 15 seconds or less would be fine. I would find a dataset on huggingface, but the site is down. Obviously this is just an idea and nobody has to actually do it, but I can't at the moment. I just thought it was a cool idea submitted by /u/asoulsghost [link] [comments]
    [Discussion] Spoonfed Machine Learning and Statistics
    Hi, I'm not sure if this is the right subreddit, but I thought some members of this community might have faced similar challenges. I'm tasked with writing a technical document that covers statistical analysis, modeling, and prediction for a dataset. However, I'm collaborating with individuals who are not familiar with statistics and machine learning. They've dismissed some industry-standard terms as incorrect and demand highly simplified explanations. I understand that we all have varying expertise, but simplifying the content to the extent they require would be akin to writing a textbook, not a report. How would you navigate this situation? Typically, if I encounter unfamiliar concepts, I research them online, but I'm unsure how common this approach is. submitted by /u/Realistic-Bed2658 [link] [comments]
    How to Leverage Hugging Face’s Pre-Trained Models and Spaces for Your AI Projects [P]
    How to Leverage Hugging Face’s Pre-Trained Models and Spaces for Your AI Projects ​ My article about how to use hf https://medium.com/stackademic/how-to-leverage-hugging-faces-pre-trained-models-and-spaces-for-your-ai-projects-9046687cef3f submitted by /u/Snoo23237 [link] [comments]
    [D] Opensource SORA
    What do you think about it? is stability and others cooking something for us? submitted by /u/MemoryLeak6552 [link] [comments]
    [R] Online course recommendation
    Hello guys, as title mentions I need your advices for advanced machine learning courses or platforms. I am working with mostly Generative NLP models and data engineering with Spark. Also an advanced python course appreciated. submitted by /u/Lathanderrr [link] [comments]
    [R] EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions
    Researchers have struggled to make AI-generated talking head videos that capture the nuance of human facial expressions and speech. Methods usually fail to replicate the fluidity and synchronization of real human mouths and faces. A new paper from Alibaba proposes EMO, an AI system that achieves unprecedented realism in synthesized talking head videos using a novel diffusion model approach. EMO generates videos directly from audio clips and portrait images, without 3D graphics or animation: Audio encoder analyzes tone, rhythm to generate motions Reference encoder preserves visual identity throughout video Temporal modules enable smooth frame transitions Facial mask focuses details on core face regions like mouth, eyes, etc Speed control layers stabilize pace of head movements Trained on a dataset of over 250 hours of talking head videos spanning 150 million frames, EMO learns the intricacies of human speech like enunciation, accents, and emotional affect. Quantitative evaluations show EMO substantially improves over previous state-of-the-art methods like Wav2Lip and DreamTalk on metrics including: Fréchet Inception Distance: individual frame quality Expression modeling: vividness of facial animations Lip sync: audio-visual alignment in mouth shapes Fréchet Video Distance: consistency of identity and expressions While limitations remain like slower generation and artifacts, EMO represents a major advance in replicating human facial dynamics directly from audio. As models scale up, AI-generated talking heads will become increasingly expressive and realistic. Paper here (github). Full summary here. submitted by /u/Successful-Western27 [link] [comments]
    [D] Pretraining DINOv2 model on custom data for instance retrieval.
    I want to do the pretraining of the DINOv2 model on a custom dataset for instance retrieval. I have one or two GPUs that I can use. I tried a couple of things but was not able to run because of the memory limitations. Can anyone please point out any guide/thread anywhere that might be useful? submitted by /u/connectwithprakash [link] [comments]
    [D] Can I use machine learning model as a black box function in a bayesian optimization?
    As far as what I research (disclaimer I am not a data scientist nor professional in this field), bayesian optimization can give you best combination of inputs to maximize or minimize a black box function right? suppose I have f(x,y,z), then using bayesian optimization can give me values for x,y and z for maximum or minimum output of function. Question: Since real world scenario doesn't realy gives you a black_box function in mathematical equation, will it be possible that instead of a function can I use a machine learning model for this? Suppose I have ML model that requires mutiple inputs (x,y,z) and it outputs O, I am thinking that instead of using a function, I'll simply use the ML model in the bayesian optimization? Will this be possible? What is the pros and cons for this? Is it achievable or not? I might need your suggestions about this. submitted by /u/MaterialDiver8105 [link] [comments]
    [D] How to process high amount of datapoint per instance?
    Right now I am enrolled in my first machine learning course as a medical science student. As a project topic we are classifying 10 seconds long ECG measurements. Since it is my first time getting into ML, I have some issues on how to preprocess data. So basically my data is already labled and is just a x,y measurement. This means I have 5000 datapoints per instance/ecg measurement. My question is what is the best way to handle this data that consists of a lot of datapoints. Should I do feature reduction by a PCA to make the 5000 datapoints into for example 10 features? (I dont think this is the best way since it could lose important information) Or is there a way to put all these datapoints into a model? I would really appreciate advice or help. I am new to this so some clarification would be helpful! submitted by /u/Acceptable_Smoke_235 [link] [comments]
    [R] Survey on Artificial Intelligence in Urban Planning
    I'm conducting a research study for my dissertation on the integration of Artificial Intelligence (AI) in urban planning, and I need your help! What's the survey about? This survey gathers opinions on AI's role in urban planning. Whether you're a student, an urban planner, or simply interested in the topic, your input is invaluable! How can you participate? Click the link below to answer a few questions. Your responses are anonymous and will shape our understanding of AI's impact on urban planning. Further Insights Welcome! If you have any additional thoughts or comments about AI and urban planning, especially in transport planning, feel free to share them under this post. Your insights will provide further depth to our research! Survey Link: https://forms.gle/ADeHC3CS4H7tv2mg6 Drop your survey link in the comments, and I'll make sure to complete yours if you complete mine! Thank you! submitted by /u/SLabyrinth1 [link] [comments]
    [D] Are There Companies or Groups Open to Software Professionals Contributing to AI Research
    I'm currently working in a software role, but I'm eager to dive into AI research as well. Are there any companies or groups out there that encourage professionals to balance their day jobs with contributing to AI research projects(open source or anything). submitted by /u/hackbear [link] [comments]
    [R] The SpeechBrain toolkit version 1.0 is out!
    Dear Reddit community, Three years ago, we announced the Beta release of SpeechBrain in this subreddit. Today, we are thrilled to announce the official release of SpeechBrain 1.0. This milestone comes with numerous enhancements and advancements from the previous versions. This is a real community effort (200k downloads on PyPi per month), and we are really happy and proud to lead it. You can explore a comprehensive summary of our improvements here: SpeechBrain 1.0 Summary. Among the changes, we have made significant improvements in speech recognition, enhancing search functionalities through integration with K2 for finite-state transducers, CTC decoding, n-gram rescoring, and LLM integration. Additionally, we have introduced novel models such as Streamable Conformer Transducers, Branchformers, and Hyper-conformer, among others, to improve performance and speed. You can now also easily use large-language models (LLMs) and fine-tune them with our data, or simply employ them for rescoring ASR hypotheses. Furthermore, SpeechBrain now supports a broader spectrum of tasks for speech, audio, text, and EEG processing. We improved our integration with HF models to make importing any model from HF easier. We implemented modern techniques and models, including continual learning, diffusion models, hyper-networks, Bayesian ASR, and more. We have created a new benchmark repository featuring useful benchmarks for self-supervised learning (MP3S), EEG processing (SpeechBrain-MOABB), and continual learning of new languages (CL-MASR). Many more novelties can be found on the provided Colab. Stay tuned for the future because we have big plans ahead. Of course, huge thanks to our generous sponsors HuggingFace, OVHCloud, and ViaDialog, and our partners at Concordia, Avignon, Mila, Cambridge University, and Samsung, as well as all our amazing contributors! Regards, The SpeechBrain Core Team submitted by /u/TParcollet [link] [comments]
    [D] Question about CNN blocks used for building modern CV algorithms
    Hello everyone, Im looking to better understand the use of convolutional blocks such as YOLOv8's bottleneck block, SPPF, and C2f blocks, how are they constructed? and why are they built that way?, and what led the creators choose to use this order of layers? etc. I think I have a solid understanding of what each CNN layer does (Conv, Pooling, FC) but the concept of these blocks alludes to me. any explanations or resources are appreciated. submitted by /u/TheWingedCucumber [link] [comments]
    [D] : Building NLP Research Profile for PhD Applications (No Prior Publications)
    I recently applied to CMU's LTI PhD program in NLP but wasn't accepted. I understand my shortcomings: Limited research experience: CS background with Masters, 5 years industry experience (software engineer, data analyst). Basic NLP projects: Master's coursework (Advanced ML, NLP) and basic projects (sentiment analysis, music genre classification). No research publications: I acknowledge the importance of publications in top programs. Given the approaching application cycle (applications start September 2024), I'm eager to strengthen my profile. Since top conferences have passed submission deadlines (not that i am gearing up to publish my work lol), how can I best utilize this time to become a competitive applicant for top-10 NLP PhD programs? I'm looking for advice on building my research profile, including: Independent research opportunities: Are there alternative avenues for research experience besides conference publications? Skill development: What specific NLP skills or knowledge should I focus on acquiring? Connecting with researchers: How can I connect with NLP researchers to potentially gain research experience or guidance? Any insights or suggestions from the community would be greatly appreciated! submitted by /u/Puzzleheaded_Big_242 [link] [comments]
    Shape Suffixes — Good Coding Style by Noam Shazeer [D]
    https://medium.com/@NoamShazeer/shape-suffixes-good-coding-style-f836e72e24fd submitted by /u/convexstrictly [link] [comments]
    [D] Role of Interoperability in End-to-End Data Governance: As Implemented by Data Developer Platforms
    Here are some key takeaways from the post: 💠 𝐈𝐧𝐭𝐞𝐫𝐨𝐩𝐞𝐫𝐚𝐛𝐢𝐥𝐢𝐭𝐲 𝐅𝐮𝐧𝐝𝐚𝐦𝐞𝐧𝐭𝐚𝐥𝐬: Interoperability allows different entities within a data stack to communicate effectively. Data developer platforms serve as a unifying centerpiece, facilitating interoperability by integrating with existing infrastructure. 💠 𝐈𝐦𝐩𝐥𝐞𝐦𝐞𝐧𝐭𝐢𝐧𝐠 𝐈𝐧𝐭𝐞𝐫𝐨𝐩𝐞𝐫𝐚𝐛𝐢𝐥𝐢𝐭𝐲: DDP enable interoperability both inside-out and outside-in. Inside-out interoperability triggers events in external entities, while outside-in interoperability allows the platform to ingest information and act on event triggers from outside sources. 💠 𝐈𝐧𝐭𝐞𝐫𝐨𝐩𝐞𝐫𝐚𝐛𝐢𝐥𝐢𝐭𝐲 𝐢𝐧 𝐆𝐨𝐯𝐞𝐫𝐧𝐚𝐧𝐜𝐞: Governance models are implemented through Policy Decision Points (PDPs) and Policy Execution Points (PEPs). It's crucial to have a single PDP for the entire ecosystem to avoid conflicting policies. Share your thoughts! 𝐑𝐞𝐚𝐝 𝐭𝐡𝐞 𝐞𝐧𝐭𝐢𝐫𝐞 𝐛𝐥𝐨𝐠 𝐡𝐞𝐫𝐞: https://moderndata101.substack.com/p/role-of-interoperability-in-end-to submitted by /u/growth_man [link] [comments]
    [D] What does a production level RAG Application really consist of
    I've done my research on RAG, implemented a few naive RAG tutorials and I'm aware of the advanced RAG techniques I now want to implement. But I don't really know what that proper application workflow looks like. Like what makes the difference between implementing those 7 lines of code in a naive RAG tutorial and then adding stuff like hybrid search or reranking, and actually building a production level RAG application? What does that 'workflow' look like? submitted by /u/Aggravating-Floor-38 [link] [comments]
    [R] Higher Layers Need More LoRA Experts
    Paper: https://arxiv.org/abs/2402.08562 Code: https://github.com/GCYZSL/MoLA Abstract: Parameter-efficient tuning (PEFT) techniques like low-rank adaptation (LoRA) offer training efficiency on Large Language Models, but their impact on model performance remains limited. Recent efforts integrate LoRA and Mixture-of-Experts (MoE) to improve the performance of PEFT methods. Despite promising results, research on improving the efficiency of LoRA with MoE is still in its early stages. Recent studies have shown that experts in the MoE architecture have different strengths and also exhibit some redundancy. Does this statement also apply to parameter-efficient MoE? In this paper, we introduce a novel parameter-efficient MoE method, MoE-LoRA with Layer-wise Expert Allocation (MoLA) for Transformer-based models, where each model layer has the flexibility to employ a varying number of LoRA experts. We investigate several architectures with varying layer-wise expert configurations. Experiments on six well-known NLP and commonsense QA benchmarks demonstrate that MoLA achieves equal or superior performance compared to all baselines. We find that allocating more LoRA experts to higher layers further enhances the effectiveness of models with a certain number of experts in total. With much fewer parameters, this allocation strategy outperforms the setting with the same number of experts in every layer. This work can be widely used as a plug-and-play parameter-efficient tuning approach for various applications. The code is available at this https URL. submitted by /u/SunsetOneSix [link] [comments]
    [R] GLoRe: When, Where, and How to Improve LLM Reasoning via Global and Local Refinements
    Paper: https://arxiv.org/abs/2402.10963 Abstract: State-of-the-art language models can exhibit impressive reasoning refinement capabilities on math, science or coding tasks. However, recent work demonstrates that even the best models struggle to identify when and where to refine without access to external feedback. Outcome-based Reward Models (ORMs), trained to predict correctness of the final answer indicating when to refine, offer one convenient solution for deciding when to refine. Process Based Reward Models (PRMs), trained to predict correctness of intermediate steps, can then be used to indicate where to refine. But they are expensive to train, requiring extensive human annotations. In this paper, we propose Stepwise ORMs (SORMs) which are trained, only on synthetic data, to approximate the expected future reward of the optimal policy or V⋆. More specifically, SORMs are trained to predict the correctness of the final answer when sampling the current policy many times (rather than only once as in the case of ORMs). Our experiments show that SORMs can more accurately detect incorrect reasoning steps compared to ORMs, thus improving downstream accuracy when doing refinements. We then train global refinement models, which take only the question and a draft solution as input and predict a corrected solution, and local refinement models which also take as input a critique indicating the location of the first reasoning error. We generate training data for both models synthetically by reusing data used to train the SORM. We find combining global and local refinements, using the ORM as a reranker, significantly outperforms either one individually, as well as a best of three sample baseline. With this strategy we can improve the accuracy of a LLaMA-2 13B model (already fine-tuned with RL) on GSM8K from 53% to 65% when greedily sampled. submitted by /u/SunsetOneSix [link] [comments]
    [R] The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
    https://arxiv.org/abs/2402.17764 Abstract Recent research, such as BitNet, is paving the way for a new era of 1-bit Large Language Models (LLMs). In this work, we introduce a 1-bit LLM variant, namely BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}. It matches the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption. More profoundly, the 1.58-bit LLM defines a new scaling law and recipe for training new generations of LLMs that are both high-performance and cost-effective. Furthermore, it enables a new computation paradigm and opens the door for designing specific hardware optimized for 1-bit LLMs. submitted by /u/Civil_Collection7267 [link] [comments]
    [D] Trainining a model in production
    I'm a master's student and have been tasked to do my first semi-big project. I'm facing two main issues and would like to know how are they solved in a real world scenario. The first is, we only saw jupyter notebooks, hower as the project grows they become more and more unusable, they don't work well with scripts, since each new modification requires a kernel restart (I know there is an extension but i don't find it reliable enough). Another problem is the low computational power at my disposal. My idea was prepare the model and send its definition (a .py script) to a cloud provider that does the training for me, like aws(?). Is this correct? Am I missing something ? submitted by /u/Puddino [link] [comments]
    [Discussion] Dynamical Models
    Hey friends, I wrote a blog post about dynamical models -- some ideas on promising research directions towards AGI. The hypothesis is that future research should focus on training systems with stable, long-term, evolving, and continuously optimizing dynamics, using specialized architectures and training methods at large scale. Hope someone will find them interesting. Part 1: https://medium.com/@tommyx058/cdda161fa7f9 Part 2: https://medium.com/@tommyx058/e226eee07627 submitted by /u/TommyX12 [link] [comments]
    [D] Model creation help
    Hi I'm a noob to this field. I'm currently learning all about ML, Deep Learning and all and I need some guidance. Can someone tell me if I want to create or train a model, do I need to get a set of good hardware for the computational power required ? Like the TPU and Nvidia 4070 graphics etc ? Or is there any other way ? Also which is more preferable? submitted by /u/the_engineerguy [link] [comments]
  • Open

    3 Questions: Shaping the future of work in an age of AI
    Daron Acemoglu, David Autor, and Simon Johnson, faculty co-directors of the new MIT Shaping the Future of Work Initiative, describe why the work matters and what they hope to achieve.  ( 5 min )
  • Open

    Music of the spheres
    The idea of “music of the spheres” dates back to the Pythagoreans. They saw an analogy between orbital frequency ratios and musical frequency ratios. HD 110067 is a star 105 light years away that has six known planets in orbital resonance. The orbital frequencies of the planets are related to each other by small integer […] Music of the spheres first appeared on John D. Cook.  ( 5 min )
    The Real Book
    I listened to the 99% Invisible podcast about The Real Book this morning and thought back to my first copy. My first year in college I had a jazz class, and I needed to get a copy of The Real Book, a book of sheet music for jazz standards. The book that was illegal at […] The Real Book first appeared on John D. Cook.  ( 5 min )
  • Open

    What Is Sovereign AI?
    Nations have long invested in domestic infrastructure to advance their economies, control their own data and take advantage of technology opportunities in areas such as transportation, communications, commerce, entertainment and healthcare. AI, the most important technology of our time, is turbocharging innovation across every facet of society. It’s expected to generate trillions of dollars in Read Article  ( 7 min )
    And … Action! Cuebric CEO Provides Insights Into Filmmaking Using AI
    These days, just about everyone is a content creator. But can generative AI help make people create high-quality films and other content affordably? Find out from Pinar Seyhan Demirdag, cofounder and CEO of Cuebric, during his conversation with NVIDIA AI Podcast host Noah Kravitz. Cuebric is on a mission to offer new solutions in filmmaking Read Article  ( 5 min )
    Time to Skill Up: Game Reviewer Ralph Panebianco Wields NVIDIA RTX for the Win
    YouTube content creator Ralph Panebianco really, really loves video games.  ( 6 min )
  • Open

    Self-Driving Cars: Need Help for Reward Shaping
    Hi everyone, I try to train PPO agents that control non-playable cars in a custom simulation environment. I try to come up with a reward for this task. There is no destination point or goal state for those cars. They should go as fast as the speed limit and forever (without colliding with walls or other cars). The action and observation space are continuous. The actions are throttle (-1 for full break and 1 for full throttle) and steering (-pi/3 for full left and pi/3 for full right). Here are some observations that can be beneficial: Car's Speed The speed limit The angle between the car's heading direction and the road direction Offset from the lane center Previous Action (for smooth control?) The simulation environment entirely belongs to me, so if you think some other observations can be useful, you can also suggest them and I can code them. Thank you for your answers. submitted by /u/Few-Pen-9807 [link] [comments]
    Reasoning with PyReason
    submitted by /u/Neurosymbolic [link] [comments]
    RL & Prediction
    Hi everyone, I am fairly new to RL and one question has been on my mind for a while, maybe you can help me: Prescriptive Problem: An optimization problem like vehicle routing for bicycle re-allocation in a bike-sharing context (e.g., I have k stations where bikes can be rented plus a lot of other information are there like how many trucks, costs etc.) --> goal is to ensure that people have enough bikes (maximizing profit or simply usage) while minimizing costs for reallocation (e.g., if not necessary I do not need to send the truck to some stations etc.) Question: Now I have contextual information that comes from a prediction model, for example for the case above I have a demand prediction for each stations for the next time interval t (lets say the next hour) - I can use contextual optimization (e.g., predict then optimize) to use this info, however I was wandering, where do I put this info in RL (do I encode it in the state? Has it something to do with look-ahead?). Any help or literature advice is certainly appreciated! Thx in advanced! submitted by /u/mexodus [link] [comments]
    How can I teach a robot dog to walk on rough terrain?
    I am working on a course project for a class where I have a robot dog trained to walk on rough terrain on a simulator. What RL algos, environments or anything else do you recommend I should use. How should I even go about the problem? What is the output the policy? is it the torques on the joints? or something else? (I have a good experience with ML and just started getting into Deep RL in the past months) submitted by /u/elonmusk-A12 [link] [comments]
    Help double checking whether I'm passing JAX PRNG key around correctly for reproducibility with RL algorithms?
    Hello everyone! I have an implementation of JAX SAC and JAX PPO that are functional and train correctly, but aren't completely reproducible. For the SAC implementation the episodic returns are identical across multiple runs of the same seed but the plots for the losses are just slightly different: SAC plots over multiple runs, returns are same but losses are slightly different For the PPO runs, it seems like sometimes the runs are identical, but other times they aren't. For example, all of these are repeated runs using the same seed: ​ PPO runs, all with the same seed, two pairs where the runs in each pair are identical but pair vs. pair is different My SAC implementation is here, everything is contained in one script: https://pastebin.com/bPA2eqqB My PPO implementation is here, also contained in one script: https://pastebin.com/XDMdBHRx Would anyone be able to double check for me if I'm messing up my RNG key handling somehow? I've looked at this myself but I'm clearly missing something. Thanks! submitted by /u/1cedrake [link] [comments]
    Help me understand: why use a policy net instead of just a value net?
    Context: let's say something like AlphaZero. A deterministic game where you take an action and it's easy to calculate the subsequent state after that action. In digging through examples of how people have used RL for such scenarios, I saw a lot of examples of training both a policy network to evaluate actions and a value network to evaluate states. What I'm puzzling over is, why both? Why not just one network to evaluate states, and you choose the action based on the state it will lead to? I don't know if this is a hard, context-dependent question, an easy question with a clear answer that I will shortly hear, or the sort of question where the answer is just that it works better for some reason. My intuition is both that the state evaluation would work better than the policy evaluation AND that only having to train one network would be an advantage. Some possibilities I can think of: Running a network on one state to get outputs for N actions is significantly cheaper than running a network on N states (this feels likely to be part of the answer) Not calculating the subsequent states is cheaper than calculating them It generalizes to situations where the next state isn't known It works better specifically when actions have much smaller representations than states There is some additional benefit from having two independent networks It's easier to get the policy network trained up to a level than to train the value net up to the same level of accuracy because it cares about fewer things submitted by /u/seventythree [link] [comments]
    People with no top-tier ML papers, where are you working at?
    I am graduating soon, and my Ph.D. research is about RL algorithms and their applications. However, I failed to publish papers in top-tier ML conferences (NeurIPS, ICLR, ICML). But with several papers in my domain, how can I get hired for an RL-related job? I have interviewed a handful of mobile and e-commerce (RecSys) companies, all failed. I don't want to do a postdoc and I am not interested in anything related to academia. Please let me know if there are any opportunities in startups, or other positions I have not explored yet. submitted by /u/Blasphemer666 [link] [comments]
  • Open

    Build a robust text-to-SQL solution generating complex queries, self-correcting, and querying diverse data sources
    Structured Query Language (SQL) is a complex language that requires an understanding of databases and metadata. Today, generative AI can enable people without SQL knowledge. This generative AI task is called text-to-SQL, which generates SQL queries from natural language processing (NLP) and converts text into semantically correct SQL. The solution in this post aims to […]  ( 11 min )
  • Open

    Reasoning with PyReason
    submitted by /u/Neurosymbolic [link] [comments]
  • Open

    Ever: Mitigating Hallucination in Large Language Models through Real-Time Verification and Rectification
    arXiv:2311.09114v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) have demonstrated remarkable proficiency in generating fluent text. However, they often encounter the challenge of generating inaccurate or hallucinated content. This issue is common in both non-retrieval-based generation and retrieval-augmented generation approaches, and existing post-hoc rectification methods may not address the accumulated hallucination errors that may be caused by the "snowballing" issue, especially in reasoning tasks. To tackle these challenges, we introduce a novel approach called Real-time Verification and Rectification (Ever). Instead of waiting until the end of the generation process to rectify hallucinations, Ever employs a real-time, step-wise generation and hallucination rectification strategy. The primary objective is to detect and rectify hallucinations as they occur during the text generation process. When compared to both retrieval-based and non-retrieval-based baselines, Ever demonstrates a significant improvement in generating trustworthy and factually accurate text across a diverse range of tasks, including short-form QA, biography generation, and multi-hop reasoning.  ( 2 min )
    Efficient Ensemble for Multimodal Punctuation Restoration using Time-Delay Neural Network
    arXiv:2302.13376v2 Announce Type: replace-cross Abstract: Punctuation restoration plays an essential role in the post-processing procedure of automatic speech recognition, but model efficiency is a key requirement for this task. To that end, we present EfficientPunct, an ensemble method with a multimodal time-delay neural network that outperforms the current best model by 1.0 F1 points, using less than a tenth of its inference network parameters. We streamline a speech recognizer to efficiently output hidden layer acoustic embeddings for punctuation restoration, as well as BERT to extract meaningful text embeddings. By using forced alignment and temporal convolutions, we eliminate the need for attention-based fusion, greatly increasing computational efficiency and raising performance. EfficientPunct sets a new state of the art with an ensemble that weights BERT's purely language-based predictions slightly more than the multimodal network's predictions. Our code is available at https://github.com/lxy-peter/EfficientPunct.  ( 2 min )
    Distributed Graph Embedding with Information-Oriented Random Walks
    arXiv:2303.15702v2 Announce Type: replace-cross Abstract: Graph embedding maps graph nodes to low-dimensional vectors, and is widely adopted in machine learning tasks. The increasing availability of billion-edge graphs underscores the importance of learning efficient and effective embeddings on large graphs, such as link prediction on Twitter with over one billion edges. Most existing graph embedding methods fall short of reaching high data scalability. In this paper, we present a general-purpose, distributed, information-centric random walk-based graph embedding framework, DistGER, which can scale to embed billion-edge graphs. DistGER incrementally computes information-centric random walks. It further leverages a multi-proximity-aware, streaming, parallel graph partitioning strategy, simultaneously achieving high local partition quality and excellent workload balancing across machines. DistGER also improves the distributed Skip-Gram learning model to generate node embeddings by optimizing the access locality, CPU throughput, and synchronization efficiency. Experiments on real-world graphs demonstrate that compared to state-of-the-art distributed graph embedding frameworks, including KnightKing, DistDGL, and Pytorch-BigGraph, DistGER exhibits 2.33x-129x acceleration, 45% reduction in cross-machines communication, and > 10% effectiveness improvement in downstream tasks.  ( 2 min )
    Learning to Generalize towards Unseen Domains via a Content-Aware Style Invariant Model for Disease Detection from Chest X-rays
    arXiv:2302.13991v4 Announce Type: replace-cross Abstract: Performance degradation due to distribution discrepancy is a longstanding challenge in intelligent imaging, particularly for chest X-rays (CXRs). Recent studies have demonstrated that CNNs are biased toward styles (e.g., uninformative textures) rather than content (e.g., shape), in stark contrast to the human vision system. Radiologists tend to learn visual cues from CXRs and thus perform well across multiple domains. Motivated by this, we employ the novel on-the-fly style randomization modules at both image (SRM-IL) and feature (SRM-FL) levels to create rich style perturbed features while keeping the content intact for robust cross-domain performance. Previous methods simulate unseen domains by constructing new styles via interpolation or swapping styles from existing data, limiting them to available source domains during training. However, SRM-IL samples the style statistics from the possible value range of a CXR image instead of the training data to achieve more diversified augmentations. Moreover, we utilize pixel-wise learnable parameters in the SRM-FL compared to pre-defined channel-wise mean and standard deviations as style embeddings for capturing more representative style features. Additionally, we leverage consistency regularizations on global semantic features and predictive distributions from with and without style-perturbed versions of the same CXR to tweak the model's sensitivity toward content markers for accurate predictions. Our proposed method, trained on CheXpert and MIMIC-CXR datasets, achieves 77.32$\pm$0.35, 88.38$\pm$0.19, 82.63$\pm$0.13 AUCs(%) on the unseen domain test datasets, i.e., BRAX, VinDr-CXR, and NIH chest X-ray14, respectively, compared to 75.56$\pm$0.80, 87.57$\pm$0.46, 82.07$\pm$0.19 from state-of-the-art models on five-fold cross-validation with statistically significant results in thoracic disease classification.  ( 3 min )
    Topology-Aware Loss for Aorta and Great Vessel Segmentation in Computed Tomography Images
    arXiv:2307.03137v2 Announce Type: replace-cross Abstract: Segmentation networks are not explicitly imposed to learn global invariants of an image, such as the shape of an object and the geometry between multiple objects, when they are trained with a standard loss function. On the other hand, incorporating such invariants into network training may help improve performance for various segmentation tasks when they are the intrinsic characteristics of the objects to be segmented. One example is segmentation of aorta and great vessels in computed tomography (CT) images where vessels are found in a particular geometry in the body due to the human anatomy and they mostly seem as round objects on a 2D CT image. This paper addresses this issue by introducing a new topology-aware loss function that penalizes topology dissimilarities between the ground truth and prediction through persistent homology. Different from the previously suggested segmentation network designs, which apply the threshold filtration on a likelihood function of the prediction map and the Betti numbers of the ground truth, this paper proposes to apply the Vietoris-Rips filtration to obtain persistence diagrams of both ground truth and prediction maps and calculate the dissimilarity with the Wasserstein distance between the corresponding persistence diagrams. The use of this filtration has advantage of modeling shape and geometry at the same time, which may not happen when the threshold filtration is applied. Our experiments on 4327 CT images of 24 subjects reveal that the proposed topology-aware loss function leads to better results than its counterparts, indicating the effectiveness of this use.  ( 3 min )
    Rethinking Human-AI Collaboration in Complex Medical Decision Making: A Case Study in Sepsis Diagnosis
    arXiv:2309.12368v2 Announce Type: replace-cross Abstract: Today's AI systems for medical decision support often succeed on benchmark datasets in research papers but fail in real-world deployment. This work focuses on the decision making of sepsis, an acute life-threatening systematic infection that requires an early diagnosis with high uncertainty from the clinician. Our aim is to explore the design requirements for AI systems that can support clinical experts in making better decisions for the early diagnosis of sepsis. The study begins with a formative study investigating why clinical experts abandon an existing AI-powered Sepsis predictive module in their electrical health record (EHR) system. We argue that a human-centered AI system needs to support human experts in the intermediate stages of a medical decision-making process (e.g., generating hypotheses or gathering data), instead of focusing only on the final decision. Therefore, we build SepsisLab based on a state-of-the-art AI algorithm and extend it to predict the future projection of sepsis development, visualize the prediction uncertainty, and propose actionable suggestions (i.e., which additional laboratory tests can be collected) to reduce such uncertainty. Through heuristic evaluation with six clinicians using our prototype system, we demonstrate that SepsisLab enables a promising human-AI collaboration paradigm for the future of AI-assisted sepsis diagnosis and other high-stakes medical decision making.  ( 3 min )
    Can Large Language Models Transform Computational Social Science?
    arXiv:2305.03514v3 Announce Type: replace-cross Abstract: Large Language Models (LLMs) are capable of successfully performing many language processing tasks zero-shot (without training data). If zero-shot LLMs can also reliably classify and explain social phenomena like persuasiveness and political ideology, then LLMs could augment the Computational Social Science (CSS) pipeline in important ways. This work provides a road map for using LLMs as CSS tools. Towards this end, we contribute a set of prompting best practices and an extensive evaluation pipeline to measure the zero-shot performance of 13 language models on 25 representative English CSS benchmarks. On taxonomic labeling tasks (classification), LLMs fail to outperform the best fine-tuned models but still achieve fair levels of agreement with humans. On free-form coding tasks (generation), LLMs produce explanations that often exceed the quality of crowdworkers' gold references. We conclude that the performance of today's LLMs can augment the CSS research pipeline in two ways: (1) serving as zero-shot data annotators on human annotation teams, and (2) bootstrapping challenging creative generation tasks (e.g., explaining the underlying attributes of a text). In summary, LLMs are posed to meaningfully participate in social science analysis in partnership with humans.  ( 2 min )
    Inertial Navigation Meets Deep Learning: A Survey of Current Trends and Future Directions
    arXiv:2307.00014v2 Announce Type: replace-cross Abstract: Inertial sensing is used in many applications and platforms, ranging from day-to-day devices such as smartphones to very complex ones such as autonomous vehicles. In recent years, the development of machine learning and deep learning techniques has increased significantly in the field of inertial sensing and sensor fusion. This is due to the development of efficient computing hardware and the accessibility of publicly available sensor data. These data-driven approaches mainly aim to empower model-based inertial sensing algorithms. To encourage further research in integrating deep learning with inertial navigation and fusion and to leverage their capabilities, this paper provides an in-depth review of deep learning methods for inertial sensing and sensor fusion. We discuss learning methods for calibration and denoising as well as approaches for improving pure inertial navigation and sensor fusion. The latter is done by learning some of the fusion filter parameters. The reviewed approaches are classified by the environment in which the vehicles operate: land, air, and sea. In addition, we analyze trends and future directions in deep learning-based navigation and provide statistical data on commonly used approaches.  ( 2 min )
    Low-Cost High-Power Membership Inference Attacks
    arXiv:2312.03262v2 Announce Type: replace-cross Abstract: Membership inference attacks (MIA) aim to detect if a particular data point was used in training a machine learning model. Recent strong attacks have high computational costs and inconsistent performance under varying conditions, rendering them unreliable for practical privacy risk assessment. We design a novel, efficient, and robust membership inference attack (RMIA) which accurately differentiates between population data and training data of a model, with minimal computational overhead. We achieve this by a more accurate modeling of the null hypothesis setting in our likelihood ratio tests, and effectively leveraging both reference models and reference data samples from the population. Our algorithm exhibits superior test power (true-positive rate) compared to prior methods, throughout the TPR-FPR curve including at extremely low false-positive rates (as low as 0). Under computation constraints, where only a limited number of pre-trained reference models (as few as 1) are available, and also when we vary other elements of the attack, our method performs exceptionally well, unlike some prior attacks that approach random guessing. RMIA outperforms the prior work in all configurations of the attack setup. RMIA lays the algorithmic groundwork for practical yet accurate and reliable privacy risk analysis in machine learning.  ( 2 min )
    Efficient Convex Algorithms for Universal Kernel Learning
    arXiv:2304.07472v2 Announce Type: replace-cross Abstract: The accuracy and complexity of machine learning algorithms based on kernel optimization are determined by the set of kernels over which they are able to optimize. An ideal set of kernels should: admit a linear parameterization (for tractability); be dense in the set of all kernels (for robustness); be universal (for accuracy). Recently, a framework was proposed for using positive matrices to parameterize a class of positive semi-separable kernels. Although this class can be shown to meet all three criteria, previous algorithms for optimization of such kernels were limited to classification and furthermore relied on computationally complex Semidefinite Programming (SDP) algorithms. In this paper, we pose the problem of learning semiseparable kernels as a minimax optimization problem and propose a SVD-QCQP primal-dual algorithm which dramatically reduces the computational complexity as compared with previous SDP-based approaches. Furthermore, we provide an efficient implementation of this algorithm for both classification and regression -- an implementation which enables us to solve problems with 100 features and up to 30,000 datums. Finally, when applied to benchmark data, the algorithm demonstrates the potential for significant improvement in accuracy over typical (but non-convex) approaches such as Neural Nets and Random Forest with similar or better computation time.  ( 2 min )
    The Shape of Learning: Anisotropy and Intrinsic Dimensions in Transformer-Based Models
    arXiv:2311.05928v2 Announce Type: replace-cross Abstract: In this study, we present an investigation into the anisotropy dynamics and intrinsic dimension of embeddings in transformer architectures, focusing on the dichotomy between encoders and decoders. Our findings reveal that the anisotropy profile in transformer decoders exhibits a distinct bell-shaped curve, with the highest anisotropy concentrations in the middle layers. This pattern diverges from the more uniformly distributed anisotropy observed in encoders. In addition, we found that the intrinsic dimension of embeddings increases in the initial phases of training, indicating an expansion into higher-dimensional space. Which is then followed by a compression phase towards the end of training with dimensionality decrease, suggesting a refinement into more compact representations. Our results provide fresh insights to the understanding of encoders and decoders embedding properties.  ( 2 min )
    Efficient Title Reranker for Fast and Improved Knowledge-Intense NLP
    arXiv:2312.12430v3 Announce Type: replace-cross Abstract: In recent RAG approaches, rerankers play a pivotal role in refining retrieval accuracy with the ability of revealing logical relations for each pair of query and text. However, existing rerankers are required to repeatedly encode the query and a large number of long retrieved text. This results in high computational costs and limits the number of retrieved text, hindering accuracy. As a remedy of the problem, we introduce the Efficient Title Reranker via Broadcasting Query Encoder, a novel technique for title reranking that achieves a 20x-40x speedup over the vanilla passage reranker. Furthermore, we introduce Sigmoid Trick, a novel loss function customized for title reranking. Combining both techniques, we empirically validated their effectiveness, achieving state-of-the-art results on all four datasets we experimented with from the KILT knowledge benchmark.  ( 2 min )
    Pre-Training LiDAR-Based 3D Object Detectors Through Colorization
    arXiv:2310.14592v2 Announce Type: replace-cross Abstract: Accurate 3D object detection and understanding for self-driving cars heavily relies on LiDAR point clouds, necessitating large amounts of labeled data to train. In this work, we introduce an innovative pre-training approach, Grounded Point Colorization (GPC), to bridge the gap between data and labels by teaching the model to colorize LiDAR point clouds, equipping it with valuable semantic cues. To tackle challenges arising from color variations and selection bias, we incorporate color as "context" by providing ground-truth colors as hints during colorization. Experimental results on the KITTI and Waymo datasets demonstrate GPC's remarkable effectiveness. Even with limited labeled data, GPC significantly improves fine-tuning performance; notably, on just 20% of the KITTI dataset, GPC outperforms training from scratch with the entire dataset. In sum, we introduce a fresh perspective on pre-training for 3D object detection, aligning the objective with the model's intended role and ultimately advancing the accuracy and efficiency of 3D object detection for autonomous vehicles.  ( 2 min )
    Learning to reconstruct the bubble distribution with conductivity maps using Invertible Neural Networks and Error Diffusion
    arXiv:2307.02496v2 Announce Type: replace-cross Abstract: Electrolysis is crucial for eco-friendly hydrogen production, but gas bubbles generated during the process hinder reactions, reduce cell efficiency, and increase energy consumption. Additionally, these gas bubbles cause changes in the conductivity inside the cell, resulting in corresponding variations in the induced magnetic field around the cell. Therefore, measuring these gas bubble-induced magnetic field fluctuations using external magnetic sensors and solving the inverse problem of Biot-Savart Law allows for estimating the conductivity in the cell and, thus, bubble size and location. However, determining high-resolution conductivity maps from only a few induced magnetic field measurements is an ill-posed inverse problem. To overcome this, we exploit Invertible Neural Networks (INNs) to reconstruct the conductivity field. Our qualitative results and quantitative evaluation using random error diffusion show that INN achieves far superior performance compared to Tikhonov regularization.  ( 2 min )
    HiFT: A Hierarchical Full Parameter Fine-Tuning Strategy
    arXiv:2401.15207v2 Announce Type: replace Abstract: Full-parameter fine-tuning has become the go-to choice for adapting language models (LMs) to downstream tasks due to its excellent performance. As LMs grow in size, fine-tuning the full parameters of LMs requires a prohibitively large amount of GPU memory. Existing approaches utilize zeroth-order optimizer to conserve GPU memory, which can potentially compromise the performance of LMs as non-zero order optimizers tend to converge more readily on most downstream tasks. In this paper, we propose a novel optimizer-independent end-to-end hierarchical fine-tuning strategy, HiFT, which only updates a subset of parameters at each training step. HiFT can significantly reduce the amount of gradients and optimizer state parameters residing in GPU memory at the same time, thereby reducing GPU memory usage. Our results demonstrate that: (1) HiFT achieves comparable performance to parameter-efficient fine-tuning and standard full parameter fine-tuning. (2) HiFT supports various optimizers including AdamW, AdaGrad, SGD, etc. (3) HiFT can save more than 60\% GPU memory compared with standard full-parameter fine-tuning for 7B model. (4) HiFT enables full-parameter fine-tuning of a 7B model on single 48G A6000 with a precision of 32 using the AdamW optimizer, without using any memory saving techniques.  ( 2 min )
    A Foundation Model for General Moving Object Segmentation in Medical Images
    arXiv:2309.17264v4 Announce Type: replace-cross Abstract: Medical image segmentation aims to delineate the anatomical or pathological structures of interest, playing a crucial role in clinical diagnosis. A substantial amount of high-quality annotated data is crucial for constructing high-precision deep segmentation models. However, medical annotation is highly cumbersome and time-consuming, especially for medical videos or 3D volumes, due to the huge labeling space and poor inter-frame consistency. Recently, a fundamental task named Moving Object Segmentation (MOS) has made significant advancements in natural images. Its objective is to delineate moving objects from the background within image sequences, requiring only minimal annotations. In this paper, we propose the first foundation model, named iMOS, for MOS in medical images. Extensive experiments on a large multi-modal medical dataset validate the effectiveness of the proposed iMOS. Specifically, with the annotation of only a small number of images in the sequence, iMOS can achieve satisfactory tracking and segmentation performance of moving objects throughout the entire sequence in bi-directions. We hope that the proposed iMOS can help accelerate the annotation speed of experts, and boost the development of medical foundation models.  ( 3 min )
    User Inference Attacks on Large Language Models
    arXiv:2310.09266v2 Announce Type: replace-cross Abstract: Fine-tuning is a common and effective method for tailoring large language models (LLMs) to specialized tasks and applications. In this paper, we study the privacy implications of fine-tuning LLMs on user data. To this end, we consider a realistic threat model, called user inference, wherein an attacker infers whether or not a user's data was used for fine-tuning. We design attacks for performing user inference that require only black-box access to the fine-tuned LLM and a few samples from a user which need not be from the fine-tuning dataset. We find that LLMs are susceptible to user inference across a variety of fine-tuning datasets, at times with near perfect attack success rates. Further, we theoretically and empirically investigate the properties that make users vulnerable to user inference, finding that outlier users, users with identifiable shared features between examples, and users that contribute a large fraction of the fine-tuning data are most susceptible to attack. Based on these findings, we identify several methods for mitigating user inference including training with example-level differential privacy, removing within-user duplicate examples, and reducing a user's contribution to the training data. While these techniques provide partial mitigation of user inference, we highlight the need to develop methods to fully protect fine-tuned LLMs against this privacy risk.  ( 3 min )
    Let Models Speak Ciphers: Multiagent Debate through Embeddings
    arXiv:2310.06272v2 Announce Type: replace-cross Abstract: Discussion and debate among Large Language Models (LLMs) have gained considerable attention due to their potential to enhance the reasoning ability of LLMs. Although natural language is an obvious choice for communication due to LLM's language understanding capability, the token sampling step needed when generating natural language poses a potential risk of information loss, as it uses only one token to represent the model's belief across the entire vocabulary. In this paper, we introduce a communication regime named CIPHER (Communicative Inter-Model Protocol Through Embedding Representation) to address this issue. Specifically, we remove the token sampling step from LLMs and let them communicate their beliefs across the vocabulary through the expectation of the raw transformer output embeddings. Remarkably, by deviating from natural language, CIPHER offers an advantage of encoding a broader spectrum of information without any modification to the model weights, outperforming the state-of-the-art LLM debate methods using natural language by 0.5-5.0% across five reasoning tasks and multiple open-source LLMs of varying sizes. This showcases the superiority and robustness of embeddings as an alternative "language" for communication among LLMs. We anticipate that CIPHER will inspire further exploration for the design of interactions within LLM agent systems, offering a new direction that could significantly influence future developments in the field.  ( 3 min )
    Mixed Variational Flows for Discrete Variables
    arXiv:2308.15613v3 Announce Type: replace-cross Abstract: Variational flows allow practitioners to learn complex continuous distributions, but approximating discrete distributions remains a challenge. Current methodologies typically embed the discrete target in a continuous space - usually via continuous relaxation or dequantization - and then apply a continuous flow. These approaches involve a surrogate target that may not capture the original discrete target, might have biased or unstable gradients, and can create a difficult optimization problem. In this work, we develop a variational flow family for discrete distributions without any continuous embedding. First, we develop a measure-preserving and discrete (MAD) invertible map that leaves the discrete target invariant, and then create a mixed variational flow (MAD Mix) based on that map. Our family provides access to i.i.d. sampling and density evaluation with virtually no tuning effort. We also develop an extension to MAD Mix that handles joint discrete and continuous models. Our experiments suggest that MAD Mix produces more reliable approximations than continuous-embedding flows while being significantly faster to train.  ( 2 min )
    Spherical convolutional neural networks can improve brain microstructure estimation from diffusion MRI data
    arXiv:2211.09887v3 Announce Type: replace-cross Abstract: Diffusion magnetic resonance imaging is sensitive to the microstructural properties of brain tissue. However, estimating clinically and scientifically relevant microstructural properties from the measured signals remains a highly challenging inverse problem that machine learning may help solve. This study investigated if recently developed rotationally invariant spherical convolutional neural networks can improve microstructural parameter estimation. We trained a spherical convolutional neural network to predict the ground-truth parameter values from efficiently simulated noisy data and applied the trained network to imaging data acquired in a clinical setting to generate microstructural parameter maps. Our network performed better than the spherical mean technique and multi-layer perceptron, achieving higher prediction accuracy than the spherical mean technique with less rotational variance than the multi-layer perceptron. Although we focused on a constrained two-compartment model of neuronal tissue, the network and training pipeline are generalizable and can be used to estimate the parameters of any Gaussian compartment model. To highlight this, we also trained the network to predict the parameters of a three-compartment model that enables the estimation of apparent neural soma density using tensor-valued diffusion encoding.  ( 3 min )
    Rigid Transformations for Stabilized Lower Dimensional Space to Support Subsurface Uncertainty Quantification and Interpretation
    arXiv:2308.08079v2 Announce Type: replace Abstract: Subsurface datasets inherently possess big data characteristics such as vast volume, diverse features, and high sampling speeds, further compounded by the curse of dimensionality from various physical, engineering, and geological inputs. Among the existing dimensionality reduction (DR) methods, nonlinear dimensionality reduction (NDR) methods, especially Metric-multidimensional scaling (MDS), are preferred for subsurface datasets due to their inherent complexity. While MDS retains intrinsic data structure and quantifies uncertainty, its limitations include unstabilized unique solutions invariant to Euclidean transformations and an absence of out-of-sample points (OOSP) extension. To enhance subsurface inferential and machine learning workflows, datasets must be transformed into stable, reduced-dimension representations that accommodate OOSP. Our solution employs rigid transformations for a stabilized Euclidean invariant representation for LDS. By computing an MDS input dissimilarity matrix, and applying rigid transformations on multiple realizations, we ensure transformation invariance and integrate OOSP. This process leverages a convex hull algorithm and incorporates loss function and normalized stress for distortion quantification. We validate our approach with synthetic data, varying distance metrics, and real-world wells from the Duvernay Formation. Results confirm our method's efficacy in achieving consistent LDS representations. Furthermore, our proposed "stress ratio" (SR) metric provides insight into uncertainty, beneficial for model adjustments and inferential analysis. Consequently, our workflow promises enhanced repeatability and comparability in NDR for subsurface energy resource engineering and associated big data workflows.  ( 3 min )
    Function Vectors in Large Language Models
    arXiv:2310.15213v2 Announce Type: replace-cross Abstract: We report the presence of a simple neural mechanism that represents an input-output function as a vector within autoregressive transformer language models (LMs). Using causal mediation analysis on a diverse range of in-context-learning (ICL) tasks, we find that a small number attention heads transport a compact representation of the demonstrated task, which we call a function vector (FV). FVs are robust to changes in context, i.e., they trigger execution of the task on inputs such as zero-shot and natural text settings that do not resemble the ICL contexts from which they are collected. We test FVs across a range of tasks, models, and layers and find strong causal effects across settings in middle layers. We investigate the internal structure of FVs and find while that they often contain information that encodes the output space of the function, this information alone is not sufficient to reconstruct an FV. Finally, we test semantic vector composition in FVs, and find that to some extent they can be summed to create vectors that trigger new complex tasks. Our findings show that compact, causal internal vector representations of function abstractions can be explicitly extracted from LLMs. Our code and data are available at https://functions.baulab.info.  ( 3 min )
    On Bilingual Lexicon Induction with Large Language Models
    arXiv:2310.13995v2 Announce Type: replace-cross Abstract: Bilingual Lexicon Induction (BLI) is a core task in multilingual NLP that still, to a large extent, relies on calculating cross-lingual word representations. Inspired by the global paradigm shift in NLP towards Large Language Models (LLMs), we examine the potential of the latest generation of LLMs for the development of bilingual lexicons. We ask the following research question: Is it possible to prompt and fine-tune multilingual LLMs (mLLMs) for BLI, and how does this approach compare against and complement current BLI approaches? To this end, we systematically study 1) zero-shot prompting for unsupervised BLI and 2) few-shot in-context prompting with a set of seed translation pairs, both without any LLM fine-tuning, as well as 3) standard BLI-oriented fine-tuning of smaller LLMs. We experiment with 18 open-source text-to-text mLLMs of different sizes (from 0.3B to 13B parameters) on two standard BLI benchmarks covering a range of typologically diverse languages. Our work is the first to demonstrate strong BLI capabilities of text-to-text mLLMs. The results reveal that few-shot prompting with in-context examples from nearest neighbours achieves the best performance, establishing new state-of-the-art BLI scores for many language pairs. We also conduct a series of in-depth analyses and ablation studies, providing more insights on BLI with (m)LLMs, also along with their limitations.  ( 3 min )
    Robust Multimodal Learning with Missing Modalities via Parameter-Efficient Adaptation
    arXiv:2310.03986v3 Announce Type: replace-cross Abstract: Multimodal learning seeks to utilize data from multiple sources to improve the overall performance of downstream tasks. It is desirable for redundancies in the data to make multimodal systems robust to missing or corrupted observations in some correlated modalities. However, we observe that the performance of several existing multimodal networks significantly deteriorates if one or multiple modalities are absent at test time. To enable robustness to missing modalities, we propose a simple and parameter-efficient adaptation procedure for pretrained multimodal networks. In particular, we exploit modulation of intermediate features to compensate for the missing modalities. We demonstrate that such adaptation can partially bridge performance drop due to missing modalities and outperform independent, dedicated networks trained for the available modality combinations in some cases. The proposed adaptation requires extremely small number of parameters (e.g., fewer than 0.7% of the total parameters) and applicable to a wide range of modality combinations and tasks. We conduct a series of experiments to highlight the missing modality robustness of our proposed method on 5 different datasets for multimodal semantic segmentation, multimodal material segmentation, and multimodal sentiment analysis tasks. Our proposed method demonstrates versatility across various tasks and datasets, and outperforms existing methods for robust multimodal learning with missing modalities.  ( 3 min )
    Ask Again, Then Fail: Large Language Models' Vacillations in Judgement
    arXiv:2310.02174v2 Announce Type: replace-cross Abstract: We observe that current conversational language models often waver in their judgements when faced with follow-up questions, even if the original judgement was correct. This wavering presents a significant challenge for generating reliable responses and building user trust. To comprehensively assess this issue, we introduce a Follow-up Questioning Mechanism along with two metrics to quantify this inconsistency, confirming its widespread presence in current language models. To mitigate this issue, we explore various prompting strategies for closed-source models; moreover, we develop a training-based framework Unwavering-FQ that teaches language models to maintain their originally correct judgements through synthesized high-quality preference data. Our experimental results confirm the effectiveness of our framework and its ability to enhance the general capabilities of models (https://github.com/NUSTM/LLMs-Waver-In-Judgements).  ( 2 min )
    GeoLLM: Extracting Geospatial Knowledge from Large Language Models
    arXiv:2310.06213v2 Announce Type: replace-cross Abstract: The application of machine learning (ML) in a range of geospatial tasks is increasingly common but often relies on globally available covariates such as satellite imagery that can either be expensive or lack predictive power. Here we explore the question of whether the vast amounts of knowledge found in Internet language corpora, now compressed within large language models (LLMs), can be leveraged for geospatial prediction tasks. We first demonstrate that LLMs embed remarkable spatial information about locations, but naively querying LLMs using geographic coordinates alone is ineffective in predicting key indicators like population density. We then present GeoLLM, a novel method that can effectively extract geospatial knowledge from LLMs with auxiliary map data from OpenStreetMap. We demonstrate the utility of our approach across multiple tasks of central interest to the international community, including the measurement of population density and economic livelihoods. Across these tasks, our method demonstrates a 70% improvement in performance (measured using Pearson's $r^2$) relative to baselines that use nearest neighbors or use information directly from the prompt, and performance equal to or exceeding satellite-based benchmarks in the literature. With GeoLLM, we observe that GPT-3.5 outperforms Llama 2 and RoBERTa by 19% and 51% respectively, suggesting that the performance of our method scales well with the size of the model and its pretraining dataset. Our experiments reveal that LLMs are remarkably sample-efficient, rich in geospatial information, and robust across the globe. Crucially, GeoLLM shows promise in mitigating the limitations of existing geospatial covariates and complementing them well. Code is available on the project website: https://rohinmanvi.github.io/GeoLLM  ( 3 min )
    Pre-training with Synthetic Data Helps Offline Reinforcement Learning
    arXiv:2310.00771v3 Announce Type: replace-cross Abstract: Recently, it has been shown that for offline deep reinforcement learning (DRL), pre-training Decision Transformer with a large language corpus can improve downstream performance (Reid et al., 2022). A natural question to ask is whether this performance gain can only be achieved with language pre-training, or can be achieved with simpler pre-training schemes which do not involve language. In this paper, we first show that language is not essential for improved performance, and indeed pre-training with synthetic IID data for a small number of updates can match the performance gains from pre-training with a large language corpus; moreover, pre-training with data generated by a one-step Markov chain can further improve the performance. Inspired by these experimental results, we then consider pre-training Conservative Q-Learning (CQL), a popular offline DRL algorithm, which is Q-learning-based and typically employs a Multi-Layer Perceptron (MLP) backbone. Surprisingly, pre-training with simple synthetic data for a small number of updates can also improve CQL, providing consistent performance improvement on D4RL Gym locomotion datasets. The results of this paper not only illustrate the importance of pre-training for offline DRL but also show that the pre-training data can be synthetic and generated with remarkably simple mechanisms.  ( 2 min )
    Large Language Models for Telecom: Forthcoming Impact on the Industry
    arXiv:2308.06013v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs), AI-driven models that can achieve general-purpose language understanding and generation, have emerged as a transformative force, revolutionizing fields well beyond Natural Language Processing (NLP) and garnering unprecedented attention. As LLM technology continues to progress, the telecom industry is facing the prospect of its impact on its landscape. To elucidate these implications, we delve into the inner workings of LLMs, providing insights into their current capabilities and limitations. We also examine the use cases that can be readily implemented in the telecom industry, streamlining tasks, such as anomalies resolutions and technical specifications comprehension, which currently hinder operational efficiency and demand significant manpower and expertise. Furthermore, we uncover essential research directions that deal with the distinctive challenges of utilizing the LLMs within the telecom domain. Addressing them represents a significant stride towards fully harnessing the potential of LLMs and unlocking their capabilities to the fullest extent within the telecom domain.  ( 2 min )
    Autonomous and Human-Driven Vehicles Interacting in a Roundabout: A Quantitative and Qualitative Evaluation
    arXiv:2309.08254v2 Announce Type: replace-cross Abstract: Optimizing traffic dynamics in an evolving transportation landscape is crucial, particularly in scenarios where autonomous vehicles (AVs) with varying levels of autonomy coexist with human-driven cars. While optimizing Reinforcement Learning (RL) policies for such scenarios is becoming more and more common, little has been said about realistic evaluations of such trained policies. This paper presents an evaluation of the effects of AVs penetration among human drivers in a roundabout scenario, considering both quantitative and qualitative aspects. In particular, we learn a policy to minimize traffic jams (i.e., minimize the time to cross the scenario) and to minimize pollution in a roundabout in Milan, Italy. Through empirical analysis, we demonstrate that the presence of AVs} can reduce time and pollution levels. Furthermore, we qualitatively evaluate the learned policy using a cutting-edge cockpit to assess its performance in near-real-world conditions. To gauge the practicality and acceptability of the policy, we conduct evaluations with human participants using the simulator, focusing on a range of metrics like traffic smoothness and safety perception. In general, our findings show that human-driven vehicles benefit from optimizing AVs dynamics. Also, participants in the study highlight that the scenario with 80% AVs is perceived as safer than the scenario with 20%. The same result is obtained for traffic smoothness perception.  ( 3 min )
    Revisit and Outstrip Entity Alignment: A Perspective of Generative Models
    arXiv:2305.14651v2 Announce Type: replace-cross Abstract: Recent embedding-based methods have achieved great successes in exploiting entity alignment from knowledge graph (KG) embeddings of multiple modalities. In this paper, we study embedding-based entity alignment (EEA) from a perspective of generative models. We show that EEA shares similarities with typical generative models and prove the effectiveness of the recently developed generative adversarial network (GAN)-based EEA methods theoretically. We then reveal that their incomplete objective limits the capacity on both entity alignment and entity synthesis (i.e., generating new entities). We mitigate this problem by introducing a generative EEA (GEEA) framework with the proposed mutual variational autoencoder (M-VAE) as the generative model. M-VAE enables entity conversion between KGs and generation of new entities from random noise vectors. We demonstrate the power of GEEA with theoretical analysis and empirical experiments on both entity alignment and entity synthesis tasks.  ( 2 min )
    CaloClouds II: Ultra-Fast Geometry-Independent Highly-Granular Calorimeter Simulation
    arXiv:2309.05704v2 Announce Type: replace-cross Abstract: Fast simulation of the energy depositions in high-granular detectors is needed for future collider experiments with ever-increasing luminosities. Generative machine learning (ML) models have been shown to speed up and augment the traditional simulation chain in physics analysis. However, the majority of previous efforts were limited to models relying on fixed, regular detector readout geometries. A major advancement is the recently introduced CaloClouds model, a geometry-independent diffusion model, which generates calorimeter showers as point clouds for the electromagnetic calorimeter of the envisioned International Large Detector (ILD). In this work, we introduce CaloClouds II which features a number of key improvements. This includes continuous time score-based modelling, which allows for a 25-step sampling with comparable fidelity to CaloClouds while yielding a $6\times$ speed-up over Geant4 on a single CPU ($5\times$ over CaloClouds). We further distill the diffusion model into a consistency model allowing for accurate sampling in a single step and resulting in a $46\times$ ($37\times$ over CaloClouds) speed-up. This constitutes the first application of consistency distillation for the generation of calorimeter showers.  ( 2 min )
    Explainable Equivariant Neural Networks for Particle Physics: PELICAN
    arXiv:2307.16506v4 Announce Type: replace-cross Abstract: PELICAN is a novel permutation equivariant and Lorentz invariant or covariant aggregator network designed to overcome common limitations found in architectures applied to particle physics problems. Compared to many approaches that use non-specialized architectures that neglect underlying physics principles and require very large numbers of parameters, PELICAN employs a fundamentally symmetry group-based architecture that demonstrates benefits in terms of reduced complexity, increased interpretability, and raw performance. We present a comprehensive study of the PELICAN algorithm architecture in the context of both tagging (classification) and reconstructing (regression) Lorentz-boosted top quarks, including the difficult task of specifically identifying and measuring the $W$-boson inside the dense environment of the Lorentz-boosted top-quark hadronic final state. We also extend the application of PELICAN to the tasks of identifying quark-initiated vs.~gluon-initiated jets, and a multi-class identification across five separate target categories of jets. When tested on the standard task of Lorentz-boosted top-quark tagging, PELICAN outperforms existing competitors with much lower model complexity and high sample efficiency. On the less common and more complex task of 4-momentum regression, PELICAN also outperforms hand-crafted, non-machine learning algorithms. We discuss the implications of symmetry-restricted architectures for the wider field of machine learning for physics.  ( 3 min )
    CaloClouds: Fast Geometry-Independent Highly-Granular Calorimeter Simulation
    arXiv:2305.04847v2 Announce Type: replace-cross Abstract: Simulating showers of particles in highly-granular detectors is a key frontier in the application of machine learning to particle physics. Achieving high accuracy and speed with generative machine learning models would enable them to augment traditional simulations and alleviate a major computing constraint. This work achieves a major breakthrough in this task by, for the first time, directly generating a point cloud of a few thousand space points with energy depositions in the detector in 3D space without relying on a fixed-grid structure. This is made possible by two key innovations: i) Using recent improvements in generative modeling we apply a diffusion model to generate photon showers as high-cardinality point clouds. ii) These point clouds of up to $6,000$ space points are largely geometry-independent as they are down-sampled from initial even higher-resolution point clouds of up to $40,000$ so-called Geant4 steps. We showcase the performance of this approach using the specific example of simulating photon showers in the planned electromagnetic calorimeter of the International Large Detector (ILD) and achieve overall good modeling of physically relevant distributions.  ( 3 min )
    Optimal Transport for Measures with Noisy Tree Metric
    arXiv:2310.13653v2 Announce Type: replace-cross Abstract: We study optimal transport (OT) problem for probability measures supported on a tree metric space. It is known that such OT problem (i.e., tree-Wasserstein (TW)) admits a closed-form expression, but depends fundamentally on the underlying tree structure over supports of input measures. In practice, the given tree structure may be, however, perturbed due to noisy or adversarial measurements. To mitigate this issue, we follow the max-min robust OT approach which considers the maximal possible distances between two input measures over an uncertainty set of tree metrics. In general, this approach is hard to compute, even for measures supported in one-dimensional space, due to its non-convexity and non-smoothness which hinders its practical applications, especially for large-scale settings. In this work, we propose novel uncertainty sets of tree metrics from the lens of edge deletion/addition which covers a diversity of tree structures in an elegant framework. Consequently, by building upon the proposed uncertainty sets, and leveraging the tree structure over supports, we show that the robust OT also admits a closed-form expression for a fast computation as its counterpart standard OT (i.e., TW). Furthermore, we demonstrate that the robust OT satisfies the metric property and is negative definite. We then exploit its negative definiteness to propose positive definite kernels and test them in several simulations on various real-world datasets on document classification and topological data analysis.  ( 3 min )
    Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models
    arXiv:2310.01107v2 Announce Type: replace-cross Abstract: Recent endeavors in video editing have showcased promising results in single-attribute editing or style transfer tasks, either by training text-to-video (T2V) models on text-video data or adopting training-free methods. However, when confronted with the complexities of multi-attribute editing scenarios, they exhibit shortcomings such as omitting or overlooking intended attribute changes, modifying the wrong elements of the input video, and failing to preserve regions of the input video that should remain intact. To address this, here we present a novel grounding-guided video-to-video translation framework called Ground-A-Video for multi-attribute video editing. Ground-A-Video attains temporally consistent multi-attribute editing of input videos in a training-free manner without aforementioned shortcomings. Central to our method is the introduction of Cross-Frame Gated Attention which incorporates groundings information into the latent representations in a temporally consistent fashion, along with Modulated Cross-Attention and optical flow guided inverted latents smoothing. Extensive experiments and applications demonstrate that Ground-A-Video's zero-shot capacity outperforms other baseline methods in terms of edit-accuracy and frame consistency. Further results and code are available at our project page (http://ground-a-video.github.io).  ( 2 min )
    MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts
    arXiv:2401.04081v2 Announce Type: replace Abstract: State Space Models (SSMs) have become serious contenders in the field of sequential modeling, challenging the dominance of Transformers. At the same time, Mixture of Experts (MoE) has significantly improved Transformer-based Large Language Models, including recent state-of-the-art open models. We propose that to unlock the potential of SSMs for scaling, they should be combined with MoE. We showcase this on Mamba, a recent SSM-based model that achieves remarkable performance. Our model, MoE-Mamba, outperforms both Mamba and baseline Transformer-MoE. In particular, MoE-Mamba reaches the same performance as Mamba in $2.35\times$ fewer training steps while preserving the inference performance gains of Mamba against Transformer.  ( 2 min )
    DOS: Diverse Outlier Sampling for Out-of-Distribution Detection
    arXiv:2306.02031v2 Announce Type: replace Abstract: Modern neural networks are known to give overconfident prediction for out-of-distribution inputs when deployed in the open world. It is common practice to leverage a surrogate outlier dataset to regularize the model during training, and recent studies emphasize the role of uncertainty in designing the sampling strategy for outlier dataset. However, the OOD samples selected solely based on predictive uncertainty can be biased towards certain types, which may fail to capture the full outlier distribution. In this work, we empirically show that diversity is critical in sampling outliers for OOD detection performance. Motivated by the observation, we propose a straightforward and novel sampling strategy named DOS (Diverse Outlier Sampling) to select diverse and informative outliers. Specifically, we cluster the normalized features at each iteration, and the most informative outlier from each cluster is selected for model training with absent category loss. With DOS, the sampled outliers efficiently shape a globally compact decision boundary between ID and OOD data. Extensive experiments demonstrate the superiority of DOS, reducing the average FPR95 by up to 25.79% on CIFAR-100 with TI-300K.  ( 2 min )
    Statistical Perspective of Top-K Sparse Softmax Gating Mixture of Experts
    arXiv:2309.13850v2 Announce Type: replace-cross Abstract: Top-K sparse softmax gating mixture of experts has been widely used for scaling up massive deep-learning architectures without increasing the computational cost. Despite its popularity in real-world applications, the theoretical understanding of that gating function has remained an open problem. The main challenge comes from the structure of the top-K sparse softmax gating function, which partitions the input space into multiple regions with distinct behaviors. By focusing on a Gaussian mixture of experts, we establish theoretical results on the effects of the top-K sparse softmax gating function on both density and parameter estimations. Our results hinge upon defining novel loss functions among parameters to capture different behaviors of the input regions. When the true number of experts $k_{\ast}$ is known, we demonstrate that the convergence rates of density and parameter estimations are both parametric on the sample size. However, when $k_{\ast}$ becomes unknown and the true model is over-specified by a Gaussian mixture of $k$ experts where $k > k_{\ast}$, our findings suggest that the number of experts selected from the top-K sparse softmax gating function must exceed the total cardinality of a certain number of Voronoi cells associated with the true parameters to guarantee the convergence of the density estimation. Moreover, while the density estimation rate remains parametric under this setting, the parameter estimation rates become substantially slow due to an intrinsic interaction between the softmax gating and expert functions.  ( 3 min )
    Broadcasting in random recursive dags
    arXiv:2306.01727v2 Announce Type: replace-cross Abstract: A uniform $k$-{\sc dag} generalizes the uniform random recursive tree by picking $k$ parents uniformly at random from the existing nodes. It starts with $k$ ''roots''. Each of the $k$ roots is assigned a bit. These bits are propagated by a noisy channel. The parents' bits are flipped with probability $p$, and a majority vote is taken. When all nodes have received their bits, the $k$-{\sc dag} is shown without identifying the roots. The goal is to estimate the majority bit among the roots. We identify the threshold for $p$ as a function of $k$ below which the majority rule among all nodes yields an error $c+o(1)$ with $c<1/2$. Above the threshold the majority rule errs with probability $1/2+o(1)$.  ( 2 min )
    Exploring Collaboration Mechanisms for LLM Agents: A Social Psychology View
    arXiv:2310.02124v2 Announce Type: replace-cross Abstract: As Natural Language Processing (NLP) systems are increasingly employed in intricate social environments, a pressing query emerges: Can these NLP systems mirror human-esque collaborative intelligence, in a multi-agent society consisting of multiple large language models (LLMs)? This paper probes the collaboration mechanisms among contemporary NLP systems by melding practical experiments with theoretical insights. We fabricate four unique `societies' comprised of LLM agents, where each agent is characterized by a specific `trait' (easy-going or overconfident) and engages in collaboration with a distinct `thinking pattern' (debate or reflection). Through evaluating these multi-agent societies on three benchmark datasets, we discern that certain collaborative strategies not only outshine previous top-tier approaches, but also optimize efficiency (using fewer API tokens). Moreover, our results further illustrate that LLM agents manifest human-like social behaviors, such as conformity and consensus reaching, mirroring foundational social psychology theories. In conclusion, we integrate insights from social psychology to contextualize the collaboration of LLM agents, inspiring further investigations into the collaboration mechanism for LLMs. We commit to sharing our code and datasets\footnote{\url{https://github.com/zjunlp/MachineSoM}.}, hoping to catalyze further research in this promising avenue.  ( 3 min )
    A hybrid quantum-classical fusion neural network to improve protein-ligand binding affinity predictions for drug discovery
    arXiv:2309.03919v2 Announce Type: replace-cross Abstract: The field of drug discovery hinges on the accurate prediction of binding affinity between prospective drug molecules and target proteins, especially when such proteins directly influence disease progression. However, estimating binding affinity demands significant financial and computational resources. While state-of-the-art methodologies employ classical machine learning (ML) techniques, emerging hybrid quantum machine learning (QML) models have shown promise for enhanced performance, owing to their inherent parallelism and capacity to manage exponential increases in data dimensionality. Despite these advances, existing models encounter issues related to convergence stability and prediction accuracy. This paper introduces a novel hybrid quantum-classical deep learning model tailored for binding affinity prediction in drug discovery. Specifically, the proposed model synergistically integrates 3D and spatial graph convolutional neural networks within an optimized quantum architecture. Simulation results demonstrate a 6% improvement in prediction accuracy relative to existing classical models, as well as a significantly more stable convergence performance compared to previous classical approaches.  ( 2 min )
    Solving Kernel Ridge Regression with Gradient-Based Optimization Methods
    arXiv:2306.16838v5 Announce Type: replace-cross Abstract: Kernel ridge regression, KRR, is a generalization of linear ridge regression that is non-linear in the data, but linear in the parameters. Here, we introduce an equivalent formulation of the objective function of KRR, opening up both for using penalties other than the ridge penalty and for studying kernel ridge regression from the perspective of gradient descent. Using a continuous-time perspective, we derive a closed-form solution for solving kernel regression with gradient descent, something we refer to as kernel gradient flow, KGF, and theoretically bound the differences between KRR and KGF, where, for the latter, regularization is obtained through early stopping. We also generalize KRR by replacing the ridge penalty with the $\ell_1$ and $\ell_\infty$ penalties, respectively, and use the fact that analogous to the similarities between KGF and KRR, $\ell_1$ regularization and forward stagewise regression (also known as coordinate descent), and $\ell_\infty$ regularization and sign gradient descent, follow similar solution paths. We can thus alleviate the need for computationally heavy algorithms based on proximal gradient descent. We show theoretically and empirically how the $\ell_1$ and $\ell_\infty$ penalties, and the corresponding gradient-based optimization algorithms, produce sparse and robust kernel regression solutions, respectively.  ( 3 min )
    Energy-Based Concept Bottleneck Models: Unifying Prediction, Concept Intervention, and Probabilistic Interpretations
    arXiv:2401.14142v2 Announce Type: replace-cross Abstract: Existing methods, such as concept bottleneck models (CBMs), have been successful in providing concept-based interpretations for black-box deep learning models. They typically work by predicting concepts given the input and then predicting the final class label given the predicted concepts. However, (1) they often fail to capture the high-order, nonlinear interaction between concepts, e.g., correcting a predicted concept (e.g., "yellow breast") does not help correct highly correlated concepts (e.g., "yellow belly"), leading to suboptimal final accuracy; (2) they cannot naturally quantify the complex conditional dependencies between different concepts and class labels (e.g., for an image with the class label "Kentucky Warbler" and a concept "black bill", what is the probability that the model correctly predicts another concept "black crown"), therefore failing to provide deeper insight into how a black-box model works. In response to these limitations, we propose Energy-based Concept Bottleneck Models (ECBMs). Our ECBMs use a set of neural networks to define the joint energy of candidate (input, concept, class) tuples. With such a unified interface, prediction, concept correction, and conditional dependency quantification are then represented as conditional probabilities, which are generated by composing different energy functions. Our ECBMs address both limitations of existing CBMs, providing higher accuracy and richer concept interpretations. Empirical results show that our approach outperforms the state-of-the-art on real-world datasets.  ( 3 min )
    QuasiNet: a neural network with trainable product layers
    arXiv:2401.06137v2 Announce Type: replace-cross Abstract: Classical neural networks achieve only limited convergence in hard problems such as XOR or parity when the number of hidden neurons is small. With the motivation to improve the success rate of neural networks in these problems, we propose a new neural network model inspired by existing neural network models with so called product neurons and a learning rule derived from classical error backpropagation, which elegantly solves the problem of mutually exclusive situations. Unlike existing product neurons, which have weights that are preset and not adaptable, our product layers of neurons also do learn. We tested the model and compared its success rate to a classical multilayer perceptron in the aforementioned problems as well as in other hard problems such as the two spirals. Our results indicate that our model is clearly more successful than the classical MLP and has the potential to be used in many tasks and applications.  ( 2 min )
    AllSpark: A Multimodal Spatio-Temporal General Intelligence Model with Thirteen Modalities
    arXiv:2401.00546v2 Announce Type: replace-cross Abstract: For a long time, due to the high heterogeneity in structure and semantics among various spatiotemporal modal data, the joint interpretation of multimodal spatiotemporal data has been an extremely challenging problem. The primary challenge resides in striking a trade-off between the cohesion and autonomy of diverse modalities, and this trade-off exhibits a progressively nonlinear nature as the number of modalities expands. We introduce the Language as Reference Framework (LaRF), a fundamental principle for constructing a multimodal unified model, aiming to strike a trade-off between the cohesion and autonomy among different modalities. We propose a multimodal spatiotemporal general artificial intelligence model, called AllSpark. Our model integrates thirteen different modalities into a unified framework, including 1D (text, code), 2D (RGB, infrared, SAR, multispectral, hyperspectral, tables, graphs, trajectory, oblique photography), and 3D (point clouds, videos) modalities. To achieve modal cohesion, AllSpark uniformly maps diverse modal features to the language modality. In addition, we design modality-specific prompts to guide multi-modal large language models in accurately perceiving multimodal data. To maintain modality autonomy, AllSpark introduces modality-specific encoders to extract the tokens of various spatiotemporal modalities. And modal bridge is employed to achieve dimensional projection from each modality to the language modality. Finally, observing a gap between the model's interpretation and downstream tasks, we designed task heads to enhance the model's generalization capability on specific downstream tasks. Experiments indicate that AllSpark achieves competitive accuracy in modalities such as RGB and trajectory compared to state-of-the-art models.  ( 3 min )
    M3D: Dataset Condensation by Minimizing Maximum Mean Discrepancy
    arXiv:2312.15927v3 Announce Type: replace-cross Abstract: Training state-of-the-art (SOTA) deep models often requires extensive data, resulting in substantial training and storage costs. To address these challenges, dataset condensation has been developed to learn a small synthetic set that preserves essential information from the original large-scale dataset. Nowadays, optimization-oriented methods have been the primary method in the field of dataset condensation for achieving SOTA results. However, the bi-level optimization process hinders the practical application of such methods to realistic and larger datasets. To enhance condensation efficiency, previous works proposed Distribution-Matching (DM) as an alternative, which significantly reduces the condensation cost. Nonetheless, current DM-based methods still yield less comparable results to SOTA optimization-oriented methods. In this paper, we argue that existing DM-based methods overlook the higher-order alignment of the distributions, which may lead to sub-optimal matching results. Inspired by this, we present a novel DM-based method named M3D for dataset condensation by Minimizing the Maximum Mean Discrepancy between feature representations of the synthetic and real images. By embedding their distributions in a reproducing kernel Hilbert space, we align all orders of moments of the distributions of real and synthetic images, resulting in a more generalized condensed set. Notably, our method even surpasses the SOTA optimization-oriented method IDC on the high-resolution ImageNet dataset. Extensive analysis is conducted to verify the effectiveness of the proposed method. Source codes are available at https://github.com/Hansong-Zhang/M3D.  ( 3 min )
    Diffusion Model with Perceptual Loss
    arXiv:2401.00110v4 Announce Type: replace-cross Abstract: Diffusion models trained with mean squared error loss tend to generate unrealistic samples. Current state-of-the-art models rely on classifier-free guidance to improve sample quality, yet its surprising effectiveness is not fully understood. In this paper, we show that the effectiveness of classifier-free guidance partly originates from it being a form of implicit perceptual guidance. As a result, we can directly incorporate perceptual loss in diffusion training to improve sample quality. Since the score matching objective used in diffusion training strongly resembles the denoising autoencoder objective used in unsupervised training of perceptual networks, the diffusion model itself is a perceptual network and can be used to generate meaningful perceptual loss. We propose a novel self-perceptual objective that results in diffusion models capable of generating more realistic samples. For conditional generation, our method only improves sample quality without entanglement with the conditional input and therefore does not sacrifice sample diversity. Our method can also improve sample quality for unconditional generation, which was not possible with classifier-free guidance before.  ( 2 min )
    RepairLLaMA: Efficient Representations and Fine-Tuned Adapters for Program Repair
    arXiv:2312.15698v2 Announce Type: replace-cross Abstract: Automated Program Repair (APR) has evolved significantly with the advent of Large Language Models (LLMs). Fine-tuning LLMs for program repair is a recent avenue of research, with many dimensions which have not been explored. Existing work mostly fine-tunes LLMs with naive code representations and is fundamentally limited in its ability to fine-tune larger LLMs. To address this problem, we propose RepairLLaMA, a novel program repair approach that combines 1) code representations for APR and 2) the state-of-the-art parameter-efficient LLM fine-tuning technique called LoRA. This results in RepairLLaMA producing a highly effective `program repair adapter' for fixing bugs with language models. Our experiments demonstrate the validity of both concepts. First, fine-tuning adapters with program repair specific code representations enables the model to use meaningful repair signals. Second, parameter-efficient fine-tuning helps fine-tuning to converge and contributes to the effectiveness of the repair adapter to fix data-points outside the fine-tuning data distribution. Overall, RepairLLaMA correctly fixes 125 Defects4J v2 and 82 HumanEval-Java bugs, outperforming all baselines.  ( 2 min )
    Learning Multi-graph Structure for Temporal Knowledge Graph Reasoning
    arXiv:2312.03004v2 Announce Type: replace-cross Abstract: Temporal Knowledge Graph (TKG) reasoning that forecasts future events based on historical snapshots distributed over timestamps is denoted as extrapolation and has gained significant attention. Owing to its extreme versatility and variation in spatial and temporal correlations, TKG reasoning presents a challenging task, demanding efficient capture of concurrent structures and evolutional interactions among facts. While existing methods have made strides in this direction, they still fall short of harnessing the diverse forms of intrinsic expressive semantics of TKGs, which encompass entity correlations across multiple timestamps and periodicity of temporal information. This limitation constrains their ability to thoroughly reflect historical dependencies and future trends. In response to these drawbacks, this paper proposes an innovative reasoning approach that focuses on Learning Multi-graph Structure (LMS). Concretely, it comprises three distinct modules concentrating on multiple aspects of graph structure knowledge within TKGs, including concurrent and evolutional patterns along timestamps, query-specific correlations across timestamps, and semantic dependencies of timestamps, which capture TKG features from various perspectives. Besides, LMS incorporates an adaptive gate for merging entity representations both along and across timestamps effectively. Moreover, it integrates timestamp semantics into graph attention calculations and time-aware decoders, in order to impose temporal constraints on events and narrow down prediction scopes with historical statistics. Extensive experimental results on five event-based benchmark datasets demonstrate that LMS outperforms state-of-the-art extrapolation models, indicating the superiority of modeling a multi-graph perspective for TKG reasoning.  ( 3 min )
    Convergence Analysis of Fractional Gradient Descent
    arXiv:2311.18426v4 Announce Type: replace-cross Abstract: Fractional derivatives are a well-studied generalization of integer order derivatives. Naturally, for optimization, it is of interest to understand the convergence properties of gradient descent using fractional derivatives. Convergence analysis of fractional gradient descent is currently limited both in the methods analyzed and the settings analyzed. This paper aims to fill in these gaps by analyzing variations of fractional gradient descent in smooth and convex, smooth and strongly convex, and smooth and non-convex settings. First, novel bounds will be established bridging fractional and integer derivatives. Then, these bounds will be applied to the aforementioned settings to prove linear convergence for smooth and strongly convex functions and $O(1/T)$ convergence for smooth and convex functions. Additionally, we prove $O(1/T)$ convergence for smooth and non-convex functions using an extended notion of smoothness - H\"older smoothness - that is more natural for fractional derivatives. Finally, empirical results will be presented on the potential speed up of fractional gradient descent over standard gradient descent as well as some preliminary theoretical results explaining this speed up.  ( 2 min )
    Tube-NeRF: Efficient Imitation Learning of Visuomotor Policies from MPC using Tube-Guided Data Augmentation and NeRFs
    arXiv:2311.14153v2 Announce Type: replace-cross Abstract: Imitation learning (IL) can train computationally-efficient sensorimotor policies from a resource-intensive Model Predictive Controller (MPC), but it often requires many samples, leading to long training times or limited robustness. To address these issues, we combine IL with a variant of robust MPC that accounts for process and sensing uncertainties, and we design a data augmentation (DA) strategy that enables efficient learning of vision-based policies. The proposed DA method, named Tube-NeRF, leverages Neural Radiance Fields (NeRFs) to generate novel synthetic images, and uses properties of the robust MPC (the tube) to select relevant views and to efficiently compute the corresponding actions. We tailor our approach to the task of localization and trajectory tracking on a multirotor, by learning a visuomotor policy that generates control actions using images from the onboard camera as only source of horizontal position. Numerical evaluations show 80-fold increase in demonstration efficiency and a 50% reduction in training time over current IL methods. Additionally, our policies successfully transfer to a real multirotor, achieving low tracking errors despite large disturbances, with an onboard inference time of only 1.5 ms. Video: https://youtu.be/_W5z33ZK1m4  ( 3 min )
    Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey
    arXiv:2311.12351v2 Announce Type: replace-cross Abstract: Transformer-based Large Language Models (LLMs) have been applied in diverse areas such as knowledge bases, human interfaces, and dynamic agents, and marking a stride towards achieving Artificial General Intelligence (AGI). However, current LLMs are predominantly pretrained on short text snippets, which compromises their effectiveness in processing the long-context prompts that are frequently encountered in practical scenarios. This article offers a comprehensive survey of the recent advancement in Transformer-based LLM architectures aimed at enhancing the long-context capabilities of LLMs throughout the entire model lifecycle, from pre-training through to inference. We first delineate and analyze the problems of handling long-context input and output with the current Transformer-based models. We then provide a taxonomy and the landscape of upgrades on Transformer architecture to solve these problems. Afterwards, we provide an investigation on wildly used evaluation necessities tailored for long-context LLMs, including datasets, metrics, and baseline models, as well as optimization toolkits such as libraries, frameworks, and compilers to boost the efficacy of LLMs across different stages in runtime. Finally, we discuss the challenges and potential avenues for future research. A curated repository of relevant literature, continuously updated, is available at https://github.com/Strivin0311/long-llms-learning.  ( 3 min )
    Towards A Unified View of Answer Calibration for Multi-Step Reasoning
    arXiv:2311.09101v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) employing Chain-of-Thought (CoT) prompting have broadened the scope for improving multi-step reasoning capabilities. We generally divide multi-step reasoning into two phases: path generation to generate the reasoning path(s); and answer calibration post-processing the reasoning path(s) to obtain a final answer. However, the existing literature lacks systematic analysis on different answer calibration approaches. In this paper, we summarize the taxonomy of recent answer calibration techniques and break them down into step-level and path-level strategies. We then conduct a thorough evaluation on these strategies from a unified view, systematically scrutinizing step-level and path-level answer calibration across multiple paths. Experimental results reveal that integrating the dominance of both strategies tends to derive optimal outcomes. Our study holds the potential to illuminate key insights for optimizing multi-step reasoning with answer calibration.  ( 2 min )
    Can Authorship Attribution Models Distinguish Speakers in Speech Transcripts?
    arXiv:2311.07564v2 Announce Type: replace-cross Abstract: Authorship verification is the task of determining if two distinct writing samples share the same author and is typically concerned with the attribution of written text. In this paper, we explore the attribution of transcribed speech, which poses novel challenges. The main challenge is that many stylistic features, such as punctuation and capitalization, are not informative in this setting. On the other hand, transcribed speech exhibits other patterns, such as filler words and backchannels (e.g., 'um', 'uh-huh'), which may be characteristic of different speakers. We propose a new benchmark for speaker attribution focused on conversational speech transcripts. To limit spurious associations of speakers with topic, we employ both conversation prompts and speakers participating in the same conversation to construct verification trials of varying difficulties. We establish the state of the art on this new benchmark by comparing a suite of neural and non-neural baselines, finding that although written text attribution models achieve surprisingly good performance in certain settings, they perform markedly worse as conversational topic information is increasingly controlled. We present analyses of the impact of transcription style on performance as well as the ability of fine-tuning on speech transcripts to improve performance.  ( 2 min )
    Supervised and Penalized Baseline Correction
    arXiv:2310.18306v3 Announce Type: replace-cross Abstract: Spectroscopic measurements can show distorted spectral shapes arising from a mixture of absorbing and scattering contributions. These distortions (or baselines) often manifest themselves as non-constant offsets or low-frequency oscillations. As a result, these baselines can adversely affect analytical and quantitative results. Baseline correction is an umbrella term where one applies pre-processing methods to obtain baseline spectra (the unwanted distortions) and then remove the distortions by differencing. However, current state-of-the art baseline correction methods do not utilize analyte concentrations even if they are available, or even if they contribute significantly to the observed spectral variability. We examine a class of state-of-the-art methods (penalized baseline correction) and modify them such that they can accommodate a priori analyte concentrations such that prediction can be enhanced. Performance will be assessed on two near infra-red data sets across both classical penalized baseline correction methods (without analyte information) and modified penalized baseline correction methods (leveraging analyte information).  ( 2 min )
    Orthogonal Gradient Boosting for Simpler Additive Rule Ensembles
    arXiv:2402.15691v1 Announce Type: new Abstract: Gradient boosting of prediction rules is an efficient approach to learn potentially interpretable yet accurate probabilistic models. However, actual interpretability requires to limit the number and size of the generated rules, and existing boosting variants are not designed for this purpose. Though corrective boosting refits all rule weights in each iteration to minimise prediction risk, the included rule conditions tend to be sub-optimal, because commonly used objective functions fail to anticipate this refitting. Here, we address this issue by a new objective function that measures the angle between the risk gradient vector and the projection of the condition output vector onto the orthogonal complement of the already selected conditions. This approach correctly approximate the ideal update of adding the risk gradient itself to the model and favours the inclusion of more general and thus shorter rules. As we demonstrate using a wide range of prediction tasks, this significantly improves the comprehensibility/accuracy trade-off of the fitted ensemble. Additionally, we show how objective values for related rule conditions can be computed incrementally to avoid any substantial computational overhead of the new method.  ( 2 min )
    Universal Model in Online Customer Service
    arXiv:2402.15666v1 Announce Type: new Abstract: Building machine learning models can be a time-consuming process that often takes several months to implement in typical business scenarios. To ensure consistent model performance and account for variations in data distribution, regular retraining is necessary. This paper introduces a solution for improving online customer service in e-commerce by presenting a universal model for predict-ing labels based on customer questions, without requiring training. Our novel approach involves using machine learning techniques to tag customer questions in transcripts and create a repository of questions and corresponding labels. When a customer requests assistance, an information retrieval model searches the repository for similar questions, and statistical analysis is used to predict the corresponding label. By eliminating the need for individual model training and maintenance, our approach reduces both the model development cycle and costs. The repository only requires periodic updating to maintain accuracy.  ( 2 min )
    Is Offline Decision Making Possible with Only Few Samples? Reliable Decisions in Data-Starved Bandits via Trust Region Enhancement
    arXiv:2402.15703v1 Announce Type: new Abstract: What can an agent learn in a stochastic Multi-Armed Bandit (MAB) problem from a dataset that contains just a single sample for each arm? Surprisingly, in this work, we demonstrate that even in such a data-starved setting it may still be possible to find a policy competitive with the optimal one. This paves the way to reliable decision-making in settings where critical decisions must be made by relying only on a handful of samples. Our analysis reveals that \emph{stochastic policies can be substantially better} than deterministic ones for offline decision-making. Focusing on offline multi-armed bandits, we design an algorithm called Trust Region of Uncertainty for Stochastic policy enhancemenT (TRUST) which is quite different from the predominant value-based lower confidence bound approach. Its design is enabled by localization laws, critical radii, and relative pessimism. We prove that its sample complexity is comparable to that of LCB on minimax problems while being substantially lower on problems with very few samples. Finally, we consider an application to offline reinforcement learning in the special case where the logging policies are known.  ( 2 min )
    Sparse MeZO: Less Parameters for Better Performance in Zeroth-Order LLM Fine-Tuning
    arXiv:2402.15751v1 Announce Type: new Abstract: While fine-tuning large language models (LLMs) for specific tasks often yields impressive results, it comes at the cost of memory inefficiency due to back-propagation in gradient-based training. Memory-efficient Zeroth-order (MeZO) optimizers, recently proposed to address this issue, only require forward passes during training, making them more memory-friendly. However, the quality of gradient estimates in zeroth order optimization often depends on the data dimensionality, potentially explaining why MeZO still exhibits significant performance drops compared to standard fine-tuning across various tasks. Inspired by the success of Parameter-Efficient Fine-Tuning (PEFT), this paper introduces Sparse MeZO, a novel memory-efficient zeroth-order optimization approach that applies ZO only to a carefully chosen subset of parameters. We propose a simple yet effective parameter selection scheme that yields significant performance gains with Sparse-MeZO. Additionally, we develop a memory-optimized implementation for sparse masking, ensuring the algorithm requires only inference-level memory consumption, allowing Sparse-MeZO to fine-tune LLaMA-30b on a single A100 GPU. Experimental results illustrate that Sparse-MeZO consistently improves both performance and convergence speed over MeZO without any overhead. For example, it achieves a 9\% absolute accuracy improvement and 3.5x speedup over MeZO on the RTE task.  ( 2 min )
    Scalable Density-based Clustering with Random Projections
    arXiv:2402.15679v1 Announce Type: new Abstract: We present sDBSCAN, a scalable density-based clustering algorithm in high dimensions with cosine distance. Utilizing the neighborhood-preserving property of random projections, sDBSCAN can quickly identify core points and their neighborhoods, the primary hurdle of density-based clustering. Theoretically, sDBSCAN outputs a clustering structure similar to DBSCAN under mild conditions with high probability. To further facilitate sDBSCAN, we present sOPTICS, a scalable OPTICS for interactive exploration of the intrinsic clustering structure. We also extend sDBSCAN and sOPTICS to L2, L1, $\chi^2$, and Jensen-Shannon distances via random kernel features. Empirically, sDBSCAN is significantly faster and provides higher accuracy than many other clustering algorithms on real-world million-point data sets. On these data sets, sDBSCAN and sOPTICS run in a few minutes, while the scikit-learn's counterparts demand several hours or cannot run due to memory constraints.  ( 2 min )
    Truly No-Regret Learning in Constrained MDPs
    arXiv:2402.15776v1 Announce Type: new Abstract: Constrained Markov decision processes (CMDPs) are a common way to model safety constraints in reinforcement learning. State-of-the-art methods for efficiently solving CMDPs are based on primal-dual algorithms. For these algorithms, all currently known regret bounds allow for error cancellations -- one can compensate for a constraint violation in one round with a strict constraint satisfaction in another. This makes the online learning process unsafe since it only guarantees safety for the final (mixture) policy but not during learning. As Efroni et al. (2020) pointed out, it is an open question whether primal-dual algorithms can provably achieve sublinear regret if we do not allow error cancellations. In this paper, we give the first affirmative answer. We first generalize a result on last-iterate convergence of regularized primal-dual schemes to CMDPs with multiple constraints. Building upon this insight, we propose a model-based primal-dual algorithm to learn in an unknown CMDP. We prove that our algorithm achieves sublinear regret without error cancellations.  ( 2 min )
    How Can LLM Guide RL? A Value-Based Approach
    arXiv:2402.16181v1 Announce Type: new Abstract: Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback. However, RL algorithms may require extensive trial-and-error interactions to collect useful feedback for improvement. On the other hand, recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities for planning tasks, lacking the ability to autonomously refine their responses based on feedback. Therefore, in this paper, we study how the policy prior provided by the LLM can enhance the sample efficiency of RL algorithms. Specifically, we develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning, particularly when the difference between the ideal policy and the LLM-informed policy is small, which suggests that the initial policy is close to optimal, reducing the need for further exploration. Additionally, we present a practical algorithm SLINVIT that simplifies the construction of the value function and employs subgoals to reduce the search complexity. Our experiments across three interactive environments ALFWorld, InterCode, and BlocksWorld demonstrate that our method achieves state-of-the-art success rates and also surpasses previous RL and LLM approaches in terms of sample efficiency. Our code is available at https://github.com/agentification/Language-Integrated-VI.  ( 2 min )
    Self Supervised Correlation-based Permutations for Multi-View Clustering
    arXiv:2402.16383v1 Announce Type: new Abstract: Fusing information from different modalities can enhance data analysis tasks, including clustering. However, existing multi-view clustering (MVC) solutions are limited to specific domains or rely on a suboptimal and computationally demanding two-stage procedure of representation and clustering. We propose an end-to-end deep learning-based MVC framework for general data (image, tabular, etc.). Our approach involves learning meaningful fused data representations with a novel permutation-based canonical correlation objective. Concurrently, we learn cluster assignments by identifying consistent pseudo-labels across multiple views. We demonstrate the effectiveness of our model using ten MVC benchmark datasets. Theoretically, we show that our model approximates the supervised linear discrimination analysis (LDA) representation. Additionally, we provide an error bound induced by false-pseudo label annotations.  ( 2 min )
    Graph Learning under Distribution Shifts: A Comprehensive Survey on Domain Adaptation, Out-of-distribution, and Continual Learning
    arXiv:2402.16374v1 Announce Type: new Abstract: Graph learning plays a pivotal role and has gained significant attention in various application scenarios, from social network analysis to recommendation systems, for its effectiveness in modeling complex data relations represented by graph structural data. In reality, the real-world graph data typically show dynamics over time, with changing node attributes and edge structure, leading to the severe graph data distribution shift issue. This issue is compounded by the diverse and complex nature of distribution shifts, which can significantly impact the performance of graph learning methods in degraded generalization and adaptation capabilities, posing a substantial challenge to their effectiveness. In this survey, we provide a comprehensive review and summary of the latest approaches, strategies, and insights that address distribution shifts within the context of graph learning. Concretely, according to the observability of distributions in the inference stage and the availability of sufficient supervision information in the training stage, we categorize existing graph learning methods into several essential scenarios, including graph domain adaptation learning, graph out-of-distribution learning, and graph continual learning. For each scenario, a detailed taxonomy is proposed, with specific descriptions and discussions of existing progress made in distribution-shifted graph learning. Additionally, we discuss the potential applications and future directions for graph learning under distribution shifts with a systematic analysis of the current state in this field. The survey is positioned to provide general guidance for the development of effective graph learning algorithms in handling graph distribution shifts, and to stimulate future research and advancements in this area.  ( 3 min )
    On the dynamics of three-layer neural networks: initial condensation
    arXiv:2402.15958v1 Announce Type: new Abstract: Empirical and theoretical works show that the input weights of two-layer neural networks, when initialized with small values, converge towards isolated orientations. This phenomenon, referred to as condensation, indicates that the gradient descent methods tend to spontaneously reduce the complexity of neural networks during the training process. In this work, we elucidate the mechanisms behind the condensation phenomena occurring in the training of three-layer neural networks and distinguish it from the training of two-layer neural networks. Through rigorous theoretical analysis, we establish the blow-up property of effective dynamics and present a sufficient condition for the occurrence of condensation, findings that are substantiated by experimental results. Additionally, we explore the association between condensation and the low-rank bias observed in deep matrix factorization.  ( 2 min )
    A Machine Learning Approach to Detect Customer Satisfaction From Multiple Tweet Parameters
    arXiv:2402.15992v1 Announce Type: new Abstract: Since internet technologies have advanced, one of the primary factors in company development is customer happiness. Online platforms have become prominent places for sharing reviews. Twitter is one of these platforms where customers frequently post their thoughts. Reviews of flights on these platforms have become a concern for the airline business. A positive review can help the company grow, while a negative one can quickly ruin its revenue and reputation. So it's vital for airline businesses to examine the feedback and experiences of their customers and enhance their services to remain competitive. But studying thousands of tweets and analyzing them to find the satisfaction of the customer is quite a difficult task. This tedious process can be made easier by using a machine learning approach to analyze tweets to determine client satisfaction levels. Some work has already been done on this strategy to automate the procedure using machine learning and deep learning techniques. However, they are all purely concerned with assessing the text's sentiment. In addition to the text, the tweet also includes the time, location, username, airline name, and so on. This additional information can be crucial for improving the model's outcome. To provide a machine learning based solution, this work has broadened its perspective to include these qualities. And it has come as no surprise that the additional features beyond text sentiment analysis produce better outcomes in machine learning based models.  ( 3 min )
    GLA-Grad: A Griffin-Lim Extended Waveform Generation Diffusion Model
    arXiv:2402.15516v1 Announce Type: cross Abstract: Diffusion models are receiving a growing interest for a variety of signal generation tasks such as speech or music synthesis. WaveGrad, for example, is a successful diffusion model that conditionally uses the mel spectrogram to guide a diffusion process for the generation of high-fidelity audio. However, such models face important challenges concerning the noise diffusion process for training and inference, and they have difficulty generating high-quality speech for speakers that were not seen during training. With the aim of minimizing the conditioning error and increasing the efficiency of the noise diffusion process, we propose in this paper a new scheme called GLA-Grad, which consists in introducing a phase recovery algorithm such as the Griffin-Lim algorithm (GLA) at each step of the regular diffusion process. Furthermore, it can be directly applied to an already-trained waveform generation model, without additional training or fine-tuning. We show that our algorithm outperforms state-of-the-art diffusion models for speech generation, especially when generating speech for a previously unseen target speaker.  ( 2 min )
    HKD-SHO: A hybrid smart home system based on knowledge-based and data-driven services
    arXiv:2402.15521v1 Announce Type: cross Abstract: A smart home is realized by setting up various services. Several methods have been proposed to create smart home services, which can be divided into knowledge-based and data-driven approaches. However, knowledge-based approaches usually require manual input from the inhabitant, which can be complicated if the physical phenomena of the concerned environment states are complex, and the inhabitant does not know how to adjust related actuators to achieve the target values of the states monitored by services. Moreover, machine learning-based data-driven approaches that we are interested in are like black boxes and cannot show the inhabitant in which situations certain services proposed certain actuators' states. To solve these problems, we propose a hybrid system called HKD-SHO (Hybrid Knowledge-based and Data-driven services based Smart HOme system), where knowledge-based and machine learning-based data-driven services are profitably integrated. The principal advantage is that it inherits the explicability of knowledge-based services and the dynamism of data-driven services. We compare HKD-SHO with several systems for creating dynamic smart home services, and the results show the better performance of HKD-SHO.  ( 2 min )
    Investigating the Generalizability of Physiological Characteristics of Anxiety
    arXiv:2402.15513v1 Announce Type: cross Abstract: Recent works have demonstrated the effectiveness of machine learning (ML) techniques in detecting anxiety and stress using physiological signals, but it is unclear whether ML models are learning physiological features specific to stress. To address this ambiguity, we evaluated the generalizability of physiological features that have been shown to be correlated with anxiety and stress to high-arousal emotions. Specifically, we examine features extracted from electrocardiogram (ECG) and electrodermal (EDA) signals from the following three datasets: Anxiety Phases Dataset (APD), Wearable Stress and Affect Detection (WESAD), and the Continuously Annotated Signals of Emotion (CASE) dataset. We aim to understand whether these features are specific to anxiety or general to other high-arousal emotions through a statistical regression analysis, in addition to a within-corpus, cross-corpus, and leave-one-corpus-out cross-validation across instances of stress and arousal. We used the following classifiers: Support Vector Machines, LightGBM, Random Forest, XGBoost, and an ensemble of the aforementioned models. We found that models trained on an arousal dataset perform relatively well on a previously unseen stress dataset, and vice versa. Our experimental results suggest that the evaluated models may be identifying emotional arousal instead of stress. This work is the first cross-corpus evaluation across stress and arousal from ECG and EDA signals, contributing new findings about the generalizability of stress detection.  ( 3 min )
    Learning to Schedule Online Tasks with Bandit Feedback
    arXiv:2402.16463v1 Announce Type: new Abstract: Online task scheduling serves an integral role for task-intensive applications in cloud computing and crowdsourcing. Optimal scheduling can enhance system performance, typically measured by the reward-to-cost ratio, under some task arrival distribution. On one hand, both reward and cost are dependent on task context (e.g., evaluation metric) and remain black-box in practice. These render reward and cost hard to model thus unknown before decision making. On the other hand, task arrival behaviors remain sensitive to factors like unpredictable system fluctuation whereby a prior estimation or the conventional assumption of arrival distribution (e.g., Poisson) may fail. This implies another practical yet often neglected challenge, i.e., uncertain task arrival distribution. Towards effective scheduling under a stationary environment with various uncertainties, we propose a double-optimistic learning based Robbins-Monro (DOL-RM) algorithm. Specifically, DOL-RM integrates a learning module that incorporates optimistic estimation for reward-to-cost ratio and a decision module that utilizes the Robbins-Monro method to implicitly learn task arrival distribution while making scheduling decisions. Theoretically, DOL-RM achieves convergence gap and no regret learning with a sub-linear regret of $O(T^{3/4})$, which is the first result for online task scheduling under uncertain task arrival distribution and unknown reward and cost. Our numerical results in a synthetic experiment and a real-world application demonstrate the effectiveness of DOL-RM in achieving the best cumulative reward-to-cost ratio compared with other state-of-the-art baselines.  ( 3 min )
    On Distributed Larger-Than-Memory Subset Selection With Pairwise Submodular Functions
    arXiv:2402.16442v1 Announce Type: new Abstract: Many learning problems hinge on the fundamental problem of subset selection, i.e., identifying a subset of important and representative points. For example, selecting the most significant samples in ML training cannot only reduce training costs but also enhance model quality. Submodularity, a discrete analogue of convexity, is commonly used for solving subset selection problems. However, existing algorithms for optimizing submodular functions are sequential, and the prior distributed methods require at least one central machine to fit the target subset. In this paper, we relax the requirement of having a central machine for the target subset by proposing a novel distributed bounding algorithm with provable approximation guarantees. The algorithm iteratively bounds the minimum and maximum utility values to select high quality points and discard the unimportant ones. When bounding does not find the complete subset, we use a multi-round, partition-based distributed greedy algorithm to identify the remaining subset. We show that these algorithms find high quality subsets on CIFAR-100 and ImageNet with marginal or no loss in quality compared to centralized methods, and scale to a dataset with 13 billion points.  ( 2 min )
    Label Learning Method Based on Tensor Projection
    arXiv:2402.16544v1 Announce Type: new Abstract: Multi-view clustering method based on anchor graph has been widely concerned due to its high efficiency and effectiveness. In order to avoid post-processing, most of the existing anchor graph-based methods learn bipartite graphs with connected components. However, such methods have high requirements on parameters, and in some cases it may not be possible to obtain bipartite graphs with clear connected components. To end this, we propose a label learning method based on tensor projection (LLMTP). Specifically, we project anchor graph into the label space through an orthogonal projection matrix to obtain cluster labels directly. Considering that the spatial structure information of multi-view data may be ignored to a certain extent when projected in different views separately, we extend the matrix projection transformation to tensor projection, so that the spatial structure information between views can be fully utilized. In addition, we introduce the tensor Schatten $p$-norm regularization to make the clustering label matrices of different views as consistent as possible. Extensive experiments have proved the effectiveness of the proposed method.  ( 2 min )
    TOTEM: TOkenized Time Series EMbeddings for General Time Series Analysis
    arXiv:2402.16412v1 Announce Type: new Abstract: The field of general time series analysis has recently begun to explore unified modeling, where a common architectural backbone can be retrained on a specific task for a specific dataset. In this work, we approach unification from a complementary vantage point: unification across tasks and domains. To this end, we explore the impact of discrete, learnt, time series data representations that enable generalist, cross-domain training. Our method, TOTEM, or TOkenized Time Series EMbeddings, proposes a simple tokenizer architecture that embeds time series data from varying domains using a discrete vectorized representation learned in a self-supervised manner. TOTEM works across multiple tasks and domains with minimal to no tuning. We study the efficacy of TOTEM with an extensive evaluation on 17 real world time series datasets across 3 tasks. We evaluate both the specialist (i.e., training a model on each domain) and generalist (i.e., training a single model on many domains) settings, and show that TOTEM matches or outperforms previous best methods on several popular benchmarks. The code can be found at: https://github.com/SaberaTalukder/TOTEM.  ( 2 min )
    Generative Pretrained Hierarchical Transformer for Time Series Forecasting
    arXiv:2402.16516v1 Announce Type: new Abstract: Recent efforts have been dedicated to enhancing time series forecasting accuracy by introducing advanced network architectures and self-supervised pretraining strategies. Nevertheless, existing approaches still exhibit two critical drawbacks. Firstly, these methods often rely on a single dataset for training, limiting the model's generalizability due to the restricted scale of the training data. Secondly, the one-step generation schema is widely followed, which necessitates a customized forecasting head and overlooks the temporal dependencies in the output series, and also leads to increased training costs under different horizon length settings. To address these issues, we propose a novel generative pretrained hierarchical transformer architecture for forecasting, named GPHT. There are two aspects of key designs in GPHT. On the one hand, we advocate for constructing a mixed dataset for pretraining our model, comprising various datasets from diverse data scenarios. This approach significantly expands the scale of training data, allowing our model to uncover commonalities in time series data and facilitating improved transfer to specific datasets. On the other hand, GPHT employs an auto-regressive forecasting approach under the channel-independent assumption, effectively modeling temporal dependencies in the output series. Importantly, no customized forecasting head is required, enabling a single model to forecast at arbitrary horizon settings. We conduct sufficient experiments on eight datasets with mainstream self-supervised pretraining models and supervised models. The results demonstrated that GPHT surpasses the baseline models across various fine-tuning and zero/few-shot learning settings in the traditional long-term forecasting task, providing support for verifying the feasibility of pretrained time series large models.  ( 3 min )
    Training Implicit Generative Models via an Invariant Statistical Loss
    arXiv:2402.16435v1 Announce Type: new Abstract: Implicit generative models have the capability to learn arbitrary complex data distributions. On the downside, training requires telling apart real data from artificially-generated ones using adversarial discriminators, leading to unstable training and mode-dropping issues. As reported by Zahee et al. (2017), even in the one-dimensional (1D) case, training a generative adversarial network (GAN) is challenging and often suboptimal. In this work, we develop a discriminator-free method for training one-dimensional (1D) generative implicit models and subsequently expand this method to accommodate multivariate cases. Our loss function is a discrepancy measure between a suitably chosen transformation of the model samples and a uniform distribution; hence, it is invariant with respect to the true distribution of the data. We first formulate our method for 1D random variables, providing an effective solution for approximate reparameterization of arbitrary complex distributions. Then, we consider the temporal setting (both univariate and multivariate), in which we model the conditional distribution of each sample given the history of the process. We demonstrate through numerical simulations that this new method yields promising results, successfully learning true distributions in a variety of scenarios and mitigating some of the well-known problems that state-of-the-art implicit methods present.  ( 3 min )
    On the Generalization Capability of Temporal Graph Learning Algorithms: Theoretical Insights and a Simpler Method
    arXiv:2402.16387v1 Announce Type: new Abstract: Temporal Graph Learning (TGL) has become a prevalent technique across diverse real-world applications, especially in domains where data can be represented as a graph and evolves over time. Although TGL has recently seen notable progress in algorithmic solutions, its theoretical foundations remain largely unexplored. This paper aims at bridging this gap by investigating the generalization ability of different TGL algorithms (e.g., GNN-based, RNN-based, and memory-based methods) under the finite-wide over-parameterized regime. We establish the connection between the generalization error of TGL algorithms and "the number of layers/steps" in the GNN-/RNN-based TGL methods and "the feature-label alignment (FLA) score", where FLA can be used as a proxy for the expressive power and explains the performance of memory-based methods. Guided by our theoretical analysis, we propose Simplified-Temporal-Graph-Network, which enjoys a small generalization error, improved overall performance, and lower model complexity. Extensive experiments on real-world datasets demonstrate the effectiveness of our method. Our theoretical findings and proposed algorithm offer essential insights into TGL from a theoretical standpoint, laying the groundwork for the designing practical TGL algorithms in future studies.  ( 2 min )
    Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey
    arXiv:2402.09283v1 Announce Type: cross Abstract: Large Language Models (LLMs) are now commonplace in conversation applications. However, their risks of misuse for generating harmful responses have raised serious societal concerns and spurred recent research on LLM conversation safety. Therefore, in this survey, we provide a comprehensive overview of recent studies, covering three critical aspects of LLM conversation safety: attacks, defenses, and evaluations. Our goal is to provide a structured summary that enhances understanding of LLM conversation safety and encourages further investigation into this important subject. For easy reference, we have categorized all the studies mentioned in this survey according to our taxonomy, available at: https://github.com/niconi19/LLM-conversation-safety.  ( 2 min )
    Achieving $\tilde{O}(1/\epsilon)$ Sample Complexity for Constrained Markov Decision Process
    arXiv:2402.16324v1 Announce Type: new Abstract: We consider the reinforcement learning problem for the constrained Markov decision process (CMDP), which plays a central role in satisfying safety or resource constraints in sequential learning and decision-making. In this problem, we are given finite resources and a MDP with unknown transition probabilities. At each stage, we take an action, collecting a reward and consuming some resources, all assumed to be unknown and need to be learned over time. In this work, we take the first step towards deriving optimal problem-dependent guarantees for the CMDP problems. We derive a logarithmic regret bound, which translates into a $O(\frac{\kappa}{\epsilon}\cdot\log^2(1/\epsilon))$ sample complexity bound, with $\kappa$ being a problem-dependent parameter, yet independent of $\epsilon$. Our sample complexity bound improves upon the state-of-art $O(1/\epsilon^2)$ sample complexity for CMDP problems established in the previous literature, in terms of the dependency on $\epsilon$. To achieve this advance, we develop a new framework for analyzing CMDP problems. To be specific, our algorithm operates in the primal space and we resolve the primal LP for the CMDP problem at each period in an online manner, with \textit{adaptive} remaining resource capacities. The key elements of our algorithm are: i). an eliminating procedure that characterizes one optimal basis of the primal LP, and; ii) a resolving procedure that is adaptive to the remaining resources and sticks to the characterized optimal basis.  ( 2 min )
    Feedback Efficient Online Fine-Tuning of Diffusion Models
    arXiv:2402.16359v1 Announce Type: new Abstract: Diffusion models excel at modeling complex data distributions, including those of images, proteins, and small molecules. However, in many cases, our goal is to model parts of the distribution that maximize certain properties: for example, we may want to generate images with high aesthetic quality, or molecules with high bioactivity. It is natural to frame this as a reinforcement learning (RL) problem, in which the objective is to fine-tune a diffusion model to maximize a reward function that corresponds to some property. Even with access to online queries of the ground-truth reward function, efficiently discovering high-reward samples can be challenging: they might have a low probability in the initial distribution, and there might be many infeasible samples that do not even have a well-defined reward (e.g., unnatural images or physically impossible molecules). In this work, we propose a novel reinforcement learning procedure that efficiently explores on the manifold of feasible samples. We present a theoretical analysis providing a regret guarantee, as well as empirical validation across three domains: images, biological sequences, and molecules.  ( 2 min )
    Language-guided Skill Learning with Temporal Variational Inference
    arXiv:2402.16354v1 Announce Type: new Abstract: We present an algorithm for skill discovery from expert demonstrations. The algorithm first utilizes Large Language Models (LLMs) to propose an initial segmentation of the trajectories. Following that, a hierarchical variational inference framework incorporates the LLM-generated segmentation information to discover reusable skills by merging trajectory segments. To further control the trade-off between compression and reusability, we introduce a novel auxiliary objective based on the Minimum Description Length principle that helps guide this skill discovery process. Our results demonstrate that agents equipped with our method are able to discover skills that help accelerate learning and outperform baseline skill learning approaches on new long-horizon tasks in BabyAI, a grid world navigation environment, as well as ALFRED, a household simulation environment.  ( 2 min )
    Graph Learning with Distributional Edge Layouts
    arXiv:2402.16402v1 Announce Type: new Abstract: Graph Neural Networks (GNNs) learn from graph-structured data by passing local messages between neighboring nodes along edges on certain topological layouts. Typically, these topological layouts in modern GNNs are deterministically computed (e.g., attention-based GNNs) or locally sampled (e.g., GraphSage) under heuristic assumptions. In this paper, we for the first time pose that these layouts can be globally sampled via Langevin dynamics following Boltzmann distribution equipped with explicit physical energy, leading to higher feasibility in the physical world. We argue that such a collection of sampled/optimized layouts can capture the wide energy distribution and bring extra expressivity on top of WL-test, therefore easing downstream tasks. As such, we propose Distributional Edge Layouts (DELs) to serve as a complement to a variety of GNNs. DEL is a pre-processing strategy independent of subsequent GNN variants, thus being highly flexible. Experimental results demonstrate that DELs consistently and substantially improve a series of GNN baselines, achieving state-of-the-art performance on multiple datasets.  ( 2 min )
    A VAE-based Framework for Learning Multi-Level Neural Granger-Causal Connectivity
    arXiv:2402.16131v1 Announce Type: new Abstract: Granger causality has been widely used in various application domains to capture lead-lag relationships amongst the components of complex dynamical systems, and the focus in extant literature has been on a single dynamical system. In certain applications in macroeconomics and neuroscience, one has access to data from a collection of related such systems, wherein the modeling task of interest is to extract the shared common structure that is embedded across them, as well as to identify the idiosyncrasies within individual ones. This paper introduces a Variational Autoencoder (VAE) based framework that jointly learns Granger-causal relationships amongst components in a collection of related-yet-heterogeneous dynamical systems, and handles the aforementioned task in a principled way. The performance of the proposed framework is evaluated on several synthetic data settings and benchmarked against existing approaches designed for individual system learning. The method is further illustrated on a real dataset involving time series data from a neurophysiological experiment and produces interpretable results.  ( 2 min )
    C-GAIL: Stabilizing Generative Adversarial Imitation Learning with Control Theory
    arXiv:2402.16349v1 Announce Type: new Abstract: Generative Adversarial Imitation Learning (GAIL) trains a generative policy to mimic a demonstrator. It uses on-policy Reinforcement Learning (RL) to optimize a reward signal derived from a GAN-like discriminator. A major drawback of GAIL is its training instability - it inherits the complex training dynamics of GANs, and the distribution shift introduced by RL. This can cause oscillations during training, harming its sample efficiency and final policy performance. Recent work has shown that control theory can help with the convergence of a GAN's training. This paper extends this line of work, conducting a control-theoretic analysis of GAIL and deriving a novel controller that not only pushes GAIL to the desired equilibrium but also achieves asymptotic stability in a 'one-step' setting. Based on this, we propose a practical algorithm 'Controlled-GAIL' (C-GAIL). On MuJoCo tasks, our controlled variant is able to speed up the rate of convergence, reduce the range of oscillation and match the expert's distribution more closely both for vanilla GAIL and GAIL-DAC.  ( 2 min )
    Consensus learning: A novel decentralised ensemble learning paradigm
    arXiv:2402.16157v1 Announce Type: new Abstract: The widespread adoption of large-scale machine learning models in recent years highlights the need for distributed computing for efficiency and scalability. This work introduces a novel distributed machine learning paradigm -- \emph{consensus learning} -- which combines classical ensemble methods with consensus protocols deployed in peer-to-peer systems. These algorithms consist of two phases: first, participants develop their models and submit predictions for any new data inputs; second, the individual predictions are used as inputs for a communication phase, which is governed by a consensus protocol. Consensus learning ensures user data privacy, while also inheriting the safety measures against Byzantine attacks from the underlying consensus mechanism. We provide a detailed theoretical analysis for a particular consensus protocol and compare the performance of the consensus learning ensemble with centralised ensemble learning algorithms. The discussion is supplemented by various numerical simulations, which describe the robustness of the algorithms against Byzantine participants.  ( 2 min )
    An Integrated Data Processing Framework for Pretraining Foundation Models
    arXiv:2402.16358v1 Announce Type: new Abstract: The ability of the foundation models heavily relies on large-scale, diverse, and high-quality pretraining data. In order to improve data quality, researchers and practitioners often have to manually curate datasets from difference sources and develop dedicated data cleansing pipeline for each data repository. Lacking a unified data processing framework, this process is repetitive and cumbersome. To mitigate this issue, we propose a data processing framework that integrates a Processing Module which consists of a series of operators at different granularity levels, and an Analyzing Module which supports probing and evaluation of the refined data. The proposed framework is easy to use and highly flexible. In this demo paper, we first introduce how to use this framework with some example use cases and then demonstrate its effectiveness in improving the data quality with an automated evaluation with ChatGPT and an end-to-end evaluation in pretraining the GPT-2 model. The code and demonstration videos are accessible on GitHub.  ( 2 min )
    Federated Contextual Cascading Bandits with Asynchronous Communication and Heterogeneous Users
    arXiv:2402.16312v1 Announce Type: new Abstract: We study the problem of federated contextual combinatorial cascading bandits, where $|\mathcal{U}|$ agents collaborate under the coordination of a central server to provide tailored recommendations to the $|\mathcal{U}|$ corresponding users. Existing works consider either a synchronous framework, necessitating full agent participation and global synchronization, or assume user homogeneity with identical behaviors. We overcome these limitations by considering (1) federated agents operating in an asynchronous communication paradigm, where no mandatory synchronization is required and all agents communicate independently with the server, (2) heterogeneous user behaviors, where users can be stratified into $J \le |\mathcal{U}|$ latent user clusters, each exhibiting distinct preferences. For this setting, we propose a UCB-type algorithm with delicate communication protocols. Through theoretical analysis, we give sub-linear regret bounds on par with those achieved in the synchronous framework, while incurring only logarithmic communication costs. Empirical evaluation on synthetic and real-world datasets validates our algorithm's superior performance in terms of regrets and communication costs.  ( 2 min )
    GARNN: An Interpretable Graph Attentive Recurrent Neural Network for Predicting Blood Glucose Levels via Multivariate Time Series
    arXiv:2402.16230v1 Announce Type: new Abstract: Accurate prediction of future blood glucose (BG) levels can effectively improve BG management for people living with diabetes, thereby reducing complications and improving quality of life. The state of the art of BG prediction has been achieved by leveraging advanced deep learning methods to model multi-modal data, i.e., sensor data and self-reported event data, organised as multi-variate time series (MTS). However, these methods are mostly regarded as ``black boxes'' and not entirely trusted by clinicians and patients. In this paper, we propose interpretable graph attentive recurrent neural networks (GARNNs) to model MTS, explaining variable contributions via summarizing variable importance and generating feature maps by graph attention mechanisms instead of post-hoc analysis. We evaluate GARNNs on four datasets, representing diverse clinical scenarios. Upon comparison with twelve well-established baseline methods, GARNNs not only achieve the best prediction accuracy but also provide high-quality temporal interpretability, in particular for postprandial glucose levels as a result of corresponding meal intake and insulin injection. These findings underline the potential of GARNN as a robust tool for improving diabetes care, bridging the gap between deep learning technology and real-world healthcare solutions.  ( 2 min )
    Boosting Graph Pooling with Persistent Homology
    arXiv:2402.16346v1 Announce Type: new Abstract: Recently, there has been an emerging trend to integrate persistent homology (PH) into graph neural networks (GNNs) to enrich expressive power. However, naively plugging PH features into GNN layers always results in marginal improvement with low interpretability. In this paper, we investigate a novel mechanism for injecting global topological invariance into pooling layers using PH, motivated by the observation that filtration operation in PH naturally aligns graph pooling in a cut-off manner. In this fashion, message passing in the coarsened graph acts along persistent pooled topology, leading to improved performance. Experimentally, we apply our mechanism to a collection of graph pooling methods and observe consistent and substantial performance gain over several popular datasets, demonstrating its wide applicability and flexibility.  ( 2 min )
    Referee Can Play: An Alternative Approach to Conditional Generation via Model Inversion
    arXiv:2402.16305v1 Announce Type: new Abstract: As a dominant force in text-to-image generation tasks, Diffusion Probabilistic Models (DPMs) face a critical challenge in controllability, struggling to adhere strictly to complex, multi-faceted instructions. In this work, we aim to address this alignment challenge for conditional generation tasks. First, we provide an alternative view of state-of-the-art DPMs as a way of inverting advanced Vision-Language Models (VLMs). With this formulation, we naturally propose a training-free approach that bypasses the conventional sampling process associated with DPMs. By directly optimizing images with the supervision of discriminative VLMs, the proposed method can potentially achieve a better text-image alignment. As proof of concept, we demonstrate the pipeline with the pre-trained BLIP-2 model and identify several key designs for improved image generation. To further enhance the image fidelity, a Score Distillation Sampling module of Stable Diffusion is incorporated. By carefully balancing the two components during optimization, our method can produce high-quality images with near state-of-the-art performance on T2I-Compbench.  ( 2 min )
    REPLAY: Modeling Time-Varying Temporal Regularities of Human Mobility for Location Prediction over Sparse Trajectories
    arXiv:2402.16310v1 Announce Type: new Abstract: Location prediction forecasts a user's location based on historical user mobility traces. To tackle the intrinsic sparsity issue of real-world user mobility traces, spatiotemporal contexts have been shown as significantly useful. Existing solutions mostly incorporate spatiotemporal distances between locations in mobility traces, either by feeding them as additional inputs to Recurrent Neural Networks (RNNs) or by using them to search for informative past hidden states for prediction. However, such distance-based methods fail to capture the time-varying temporal regularities of human mobility, where human mobility is often more regular in the morning than in other periods, for example; this suggests the usefulness of the actual timestamps besides the temporal distances. Against this background, we propose REPLAY, a general RNN architecture learning to capture the time-varying temporal regularities for location prediction. Specifically, REPLAY not only resorts to the spatiotemporal distances in sparse trajectories to search for the informative past hidden states, but also accommodates the time-varying temporal regularities by incorporating smoothed timestamp embeddings using Gaussian weighted averaging with timestamp-specific learnable bandwidths, which can flexibly adapt to the temporal regularities of different strengths across different timestamps. Our extensive evaluation compares REPLAY against a sizable collection of state-of-the-art techniques on two real-world datasets. Results show that REPLAY consistently and significantly outperforms state-of-the-art methods by 7.7\%-10.9\% in the location prediction task, and the bandwidths reveal interesting patterns of the time-varying temporal regularities.  ( 3 min )
    Poisson-Gamma Dynamical Systems with Non-Stationary Transition Dynamics
    arXiv:2402.16297v1 Announce Type: new Abstract: Bayesian methodologies for handling count-valued time series have gained prominence due to their ability to infer interpretable latent structures and to estimate uncertainties, and thus are especially suitable for dealing with noisy and incomplete count data. Among these Bayesian models, Poisson-Gamma Dynamical Systems (PGDSs) are proven to be effective in capturing the evolving dynamics underlying observed count sequences. However, the state-of-the-art PGDS still falls short in capturing the time-varying transition dynamics that are commonly observed in real-world count time series. To mitigate this limitation, a non-stationary PGDS is proposed to allow the underlying transition matrices to evolve over time, and the evolving transition matrices are modeled by sophisticatedly-designed Dirichlet Markov chains. Leveraging Dirichlet-Multinomial-Beta data augmentation techniques, a fully-conjugate and efficient Gibbs sampler is developed to perform posterior simulation. Experiments show that, in comparison with related models, the proposed non-stationary PGDS achieves improved predictive performance due to its capacity to learn non-stationary dependency structure captured by the time-evolving transition matrices.  ( 2 min )
    Learning Translations: Emergent Communication Pretraining for Cooperative Language Acquisition
    arXiv:2402.16247v1 Announce Type: new Abstract: In Emergent Communication (EC) agents learn to communicate with one another, but the protocols that they develop are specialised to their training community. This observation led to research into Zero-Shot Coordination (ZSC) for learning communication strategies that are robust to agents not encountered during training. However, ZSC typically assumes that no prior data is available about the agents that will be encountered in the zero-shot setting. In many cases, this presents an unnecessarily hard problem and rules out communication via preestablished conventions. We propose a novel AI challenge called a Cooperative Language Acquisition Problem (CLAP) in which the ZSC assumptions are relaxed by allowing a 'joiner' agent to learn from a dataset of interactions between agents in a target community. We propose and compare two methods for solving CLAPs: Imitation Learning (IL), and Emergent Communication pretraining and Translation Learning (ECTL), in which an agent is trained in self-play with EC and then learns from the data to translate between the emergent protocol and the target community's protocol.  ( 2 min )
    Watch Your Head: Assembling Projection Heads to Save the Reliability of Federated Models
    arXiv:2402.16255v1 Announce Type: new Abstract: Federated learning encounters substantial challenges with heterogeneous data, leading to performance degradation and convergence issues. While considerable progress has been achieved in mitigating such an impact, the reliability aspect of federated models has been largely disregarded. In this study, we conduct extensive experiments to investigate the reliability of both generic and personalized federated models. Our exploration uncovers a significant finding: \textbf{federated models exhibit unreliability when faced with heterogeneous data}, demonstrating poor calibration on in-distribution test data and low uncertainty levels on out-of-distribution data. This unreliability is primarily attributed to the presence of biased projection heads, which introduce miscalibration into the federated models. Inspired by this observation, we propose the "Assembled Projection Heads" (APH) method for enhancing the reliability of federated models. By treating the existing projection head parameters as priors, APH randomly samples multiple initialized parameters of projection heads from the prior and further performs targeted fine-tuning on locally available data under varying learning rates. Such a head ensemble introduces parameter diversity into the deterministic model, eliminating the bias and producing reliable predictions via head averaging. We evaluate the effectiveness of the proposed APH method across three prominent federated benchmarks. Experimental results validate the efficacy of APH in model calibration and uncertainty estimation. Notably, APH can be seamlessly integrated into various federated approaches but only requires less than 30\% additional computation cost for 100$\times$ inferences within large models.  ( 3 min )
    Conformalized Selective Regression
    arXiv:2402.16300v1 Announce Type: new Abstract: Should prediction models always deliver a prediction? In the pursuit of maximum predictive performance, critical considerations of reliability and fairness are often overshadowed, particularly when it comes to the role of uncertainty. Selective regression, also known as the "reject option," allows models to abstain from predictions in cases of considerable uncertainty. Initially proposed seven decades ago, approaches to selective regression have mostly focused on distribution-based proxies for measuring uncertainty, particularly conditional variance. However, this focus neglects the significant influence of model-specific biases on a model's performance. In this paper, we propose a novel approach to selective regression by leveraging conformal prediction, which provides grounded confidence measures for individual predictions based on model-specific biases. In addition, we propose a standardized evaluation framework to allow proper comparison of selective regression approaches. Via an extensive experimental approach, we demonstrate how our proposed approach, conformalized selective regression, demonstrates an advantage over multiple state-of-the-art baselines.  ( 2 min )
    Graph Diffusion Policy Optimization
    arXiv:2402.16302v1 Announce Type: new Abstract: Recent research has made significant progress in optimizing diffusion models for specific downstream objectives, which is an important pursuit in fields such as graph generation for drug design. However, directly applying these models to graph diffusion presents challenges, resulting in suboptimal performance. This paper introduces graph diffusion policy optimization (GDPO), a novel approach to optimize graph diffusion models for arbitrary (e.g., non-differentiable) objectives using reinforcement learning. GDPO is based on an eager policy gradient tailored for graph diffusion models, developed through meticulous analysis and promising improved performance. Experimental results show that GDPO achieves state-of-the-art performance in various graph generation tasks with complex and diverse objectives. Code is available at https://github.com/sail-sg/GDPO.  ( 2 min )
    Foundation Model Transparency Reports
    arXiv:2402.16268v1 Announce Type: new Abstract: Foundation models are critical digital technologies with sweeping societal impact that necessitates transparency. To codify how foundation model developers should provide transparency about the development and deployment of their models, we propose Foundation Model Transparency Reports, drawing upon the transparency reporting practices in social media. While external documentation of societal harms prompted social media transparency reports, our objective is to institutionalize transparency reporting for foundation models while the industry is still nascent. To design our reports, we identify 6 design principles given the successes and shortcomings of social media transparency reporting. To further schematize our reports, we draw upon the 100 transparency indicators from the Foundation Model Transparency Index. Given these indicators, we measure the extent to which they overlap with the transparency requirements included in six prominent government policies (e.g., the EU AI Act, the US Executive Order on Safe, Secure, and Trustworthy AI). Well-designed transparency reports could reduce compliance costs, in part due to overlapping regulatory requirements across different jurisdictions. We encourage foundation model developers to regularly publish transparency reports, building upon recommendations from the G7 and the White House.  ( 2 min )
    Active Level Set Estimation for Continuous Search Space with Theoretical Guarantee
    arXiv:2402.16237v1 Announce Type: new Abstract: A common problem encountered in many real-world applications is level set estimation where the goal is to determine the region in the function domain where the function is above or below a given threshold. When the function is black-box and expensive to evaluate, the level sets need to be found in a minimum set of function evaluations. Existing methods often assume a discrete search space with a finite set of data points for function evaluations and estimating the level sets. When applied to a continuous search space, these methods often need to first discretize the space which leads to poor results while needing high computational time. While some methods cater for the continuous setting, they still lack a proper guarantee for theoretical convergence. To address this problem, we propose a novel algorithm that does not need any discretization and can directly work in continuous search spaces. Our method suggests points by constructing an acquisition function that is defined as a measure of confidence of the function being higher or lower than the given threshold. A theoretical analysis for the convergence of the algorithm to an accurate solution is provided. On multiple synthetic and real-world datasets, our algorithm successfully outperforms state-of-the-art methods.  ( 2 min )
    Combining Machine Learning with Computational Fluid Dynamics using OpenFOAM and SmartSim
    arXiv:2402.16196v1 Announce Type: new Abstract: Combining machine learning (ML) with computational fluid dynamics (CFD) opens many possibilities for improving simulations of technical and natural systems. However, CFD+ML algorithms require exchange of data, synchronization, and calculation on heterogeneous hardware, making their implementation for large-scale problems exceptionally challenging. We provide an effective and scalable solution to developing CFD+ML algorithms using open source software OpenFOAM and SmartSim. SmartSim provides an Orchestrator that significantly simplifies the programming of CFD+ML algorithms and a Redis database that ensures highly scalable data exchange between ML and CFD clients. We show how to leverage SmartSim to effectively couple different segments of OpenFOAM with ML, including pre/post-processing applications, solvers, function objects, and mesh motion solvers. We additionally provide an OpenFOAM sub-module with examples that can be used as starting points for real-world applications in CFD+ML.  ( 2 min )
    Bayesian Neural Network For Personalized Federated Learning Parameter Selection
    arXiv:2402.16091v1 Announce Type: new Abstract: Federated learning's poor performance in the presence of heterogeneous data remains one of the most pressing issues in the field. Personalized federated learning departs from the conventional paradigm in which all clients employ the same model, instead striving to discover an individualized model for each client to address the heterogeneity in the data. One of such approach involves personalizing specific layers of neural networks. However, prior endeavors have not provided a dependable rationale, and some have selected personalized layers that are entirely distinct and conflicting. In this work, we take a step further by proposing personalization at the elemental level, rather than the traditional layer-level personalization. To select personalized parameters, we introduce Bayesian neural networks and rely on the uncertainty they offer to guide our selection of personalized parameters. Finally, we validate our algorithm's efficacy on several real-world datasets, demonstrating that our proposed approach outperforms existing baselines.  ( 2 min )
    Deep Contrastive Graph Learning with Clustering-Oriented Guidance
    arXiv:2402.16012v1 Announce Type: new Abstract: Graph Convolutional Network (GCN) has exhibited remarkable potential in improving graph-based clustering. To handle the general clustering scenario without a prior graph, these models estimate an initial graph beforehand to apply GCN. Throughout the literature, we have witnessed that 1) most models focus on the initial graph while neglecting the original features. Therefore, the discriminability of the learned representation may be corrupted by a low-quality initial graph; 2) the training procedure lacks effective clustering guidance, which may lead to the incorporation of clustering-irrelevant information into the learned graph. To tackle these problems, the Deep Contrastive Graph Learning (DCGL) model is proposed for general data clustering. Specifically, we establish a pseudo-siamese network, which incorporates auto-encoder with GCN to emphasize both the graph structure and the original features. On this basis, feature-level contrastive learning is introduced to enhance the discriminative capacity, and the relationship between samples and centroids is employed as the clustering-oriented guidance. Afterward, a two-branch graph learning mechanism is designed to extract the local and global structural relationships, which are further embedded into a unified graph under the cluster-level contrastive guidance. Experimental results on several benchmark datasets demonstrate the superiority of DCGL against state-of-the-art algorithms.  ( 2 min )
    DeepForge: Leveraging AI for Microstructural Control in Metal Forming via Model Predictive Control
    arXiv:2402.16119v1 Announce Type: new Abstract: This study presents a novel method for microstructure control in closed die hot forging that combines Model Predictive Control (MPC) with a developed machine learning model called DeepForge. DeepForge uses an architecture that combines 1D convolutional neural networks and gated recurrent units. It uses surface temperature measurements of a workpiece as input to predict microstructure changes during forging. The paper also details DeepForge's architecture and the finite element simulation model used to generate the data set, using a three-stroke forging process. The results demonstrate DeepForge's ability to predict microstructure with a mean absolute error of 0.4$\pm$0.3%. In addition, the study explores the use of MPC to adjust inter-stroke wait times, effectively counteracting temperature disturbances to achieve a target grain size of less than 35 microns within a specific 2D region of the workpiece. These results are then verified experimentally, demonstrating a significant step towards improved control and quality in forging processes where temperature can be used as an additional degree of freedom in the process.  ( 2 min )
    Deep Neural Network Initialization with Sparsity Inducing Activations
    arXiv:2402.16184v1 Announce Type: new Abstract: Inducing and leveraging sparse activations during training and inference is a promising avenue for improving the computational efficiency of deep networks, which is increasingly important as network sizes continue to grow and their application becomes more widespread. Here we use the large width Gaussian process limit to analyze the behaviour, at random initialization, of nonlinear activations that induce sparsity in the hidden outputs. A previously unreported form of training instability is proven for arguably two of the most natural candidates for hidden layer sparsification; those being a shifted ReLU ($\phi(x)=\max(0, x-\tau)$ for $\tau\ge 0$) and soft thresholding ($\phi(x)=0$ for $|x|\le\tau$ and $x-\text{sign}(x)\tau$ for $|x|>\tau$). We show that this instability is overcome by clipping the nonlinear activation magnitude, at a level prescribed by the shape of the associated Gaussian process variance map. Numerical experiments verify the theory and show that the proposed magnitude clipped sparsifying activations can be trained with training and test fractional sparsity as high as 85\% while retaining close to full accuracy.  ( 2 min )
    Beyond Spatio-Temporal Representations: Evolving Fourier Transform for Temporal Graphs
    arXiv:2402.16078v1 Announce Type: new Abstract: We present the Evolving Graph Fourier Transform (EFT), the first invertible spectral transform that captures evolving representations on temporal graphs. We motivate our work by the inadequacy of existing methods for capturing the evolving graph spectra, which are also computationally expensive due to the temporal aspect along with the graph vertex domain. We view the problem as an optimization over the Laplacian of the continuous time dynamic graph. Additionally, we propose pseudo-spectrum relaxations that decompose the transformation process, making it highly computationally efficient. The EFT method adeptly captures the evolving graph's structural and positional properties, making it effective for downstream tasks on evolving graphs. Hence, as a reference implementation, we develop a simple neural model induced with EFT for capturing evolving graph spectra. We empirically validate our theoretical findings on a number of large-scale and standard temporal graph benchmarks and demonstrate that our model achieves state-of-the-art performance.  ( 2 min )
    Informed Meta-Learning
    arXiv:2402.16105v1 Announce Type: new Abstract: In noisy and low-data regimes prevalent in real-world applications, an outstanding challenge of machine learning lies in effectively incorporating inductive biases that promote data efficiency and robustness. Meta-learning and informed ML stand out as two approaches for incorporating prior knowledge into the ML pipeline. While the former relies on a purely data-driven source of priors, the latter is guided by a formal representation of expert knowledge. This paper introduces a novel hybrid paradigm, informed meta-learning, seeking complementarity in cross-task knowledge sharing of humans and machines. We establish the foundational components of informed meta-learning and present a concrete instantiation of this framework--the Informed Neural Process. Through a series of illustrative and larger-scale experiments, we demonstrate the potential benefits of informed meta-learning in improving data efficiency and robustness to observational noise, task distribution shifts, and heterogeneity.  ( 2 min )
    Learning method for S4 with Diagonal State Space Layers using Balanced Truncation
    arXiv:2402.15993v1 Announce Type: new Abstract: We introduce a novel learning method for Structured State Space Sequence (S4) models incorporating Diagonal State Space (DSS) layers, tailored for processing long-sequence data in edge intelligence applications, including sensor data analysis and real-time analytics. This method utilizes the balanced truncation technique, prevalent in control theory, applied specifically to DSS layers to reduce computational costs during inference. By leveraging parameters from the reduced model, we refine the initialization process of S4 models, outperforming the widely used Skew-HiPPo initialization in terms of performance. Numerical experiments demonstrate that our trained S4 models with DSS layers surpass conventionally trained models in accuracy and efficiency metrics. Furthermore, our observations reveal a positive correlation: higher accuracy in the original model consistently leads to increased accuracy in models trained using our method, suggesting that our approach effectively leverages the strengths of the original model.  ( 2 min )
    Feature Selection Based on Orthogonal Constraints and Polygon Area
    arXiv:2402.16026v1 Announce Type: new Abstract: The goal of feature selection is to choose the optimal subset of features for a recognition task by evaluating the importance of each feature, thereby achieving effective dimensionality reduction. Currently, proposed feature selection methods often overlook the discriminative dependencies between features and labels. To address this problem, this paper introduces a novel orthogonal regression model incorporating the area of a polygon. The model can intuitively capture the discriminative dependencies between features and labels. Additionally, this paper employs a hybrid non-monotone linear search method to efficiently tackle the non-convex optimization challenge posed by orthogonal constraints. Experimental results demonstrate that our approach not only effectively captures discriminative dependency information but also surpasses traditional methods in reducing feature dimensions and enhancing classification performance.  ( 2 min )
    Building Flexible Machine Learning Models for Scientific Computing at Scale
    arXiv:2402.16014v1 Announce Type: new Abstract: Foundation models have revolutionized knowledge acquisition across domains, and our study introduces OmniArch, a paradigm-shifting approach designed for building foundation models in multi-physics scientific computing. OmniArch's pre-training involves a versatile pipeline that processes multi-physics spatio-temporal data, casting forward problem learning into scalable auto-regressive tasks, while our novel Physics-Informed Reinforcement Learning (PIRL) technique during fine-tuning ensures alignment with physical laws. Pre-trained on the comprehensive PDEBench dataset, OmniArch not only sets new performance benchmarks for 1D, 2D and 3D PDEs but also demonstrates exceptional adaptability to new physics via few-shot and zero-shot learning approaches. The model's representations further extend to inverse problem-solving, highlighting the transformative potential of AI-enabled Scientific Computing(AI4SC) foundation models for engineering applications and physics discovery.  ( 2 min )
    Equivariant Frames and the Impossibility of Continuous Canonicalization
    arXiv:2402.16077v1 Announce Type: new Abstract: Canonicalization provides an architecture-agnostic method for enforcing equivariance, with generalizations such as frame-averaging recently gaining prominence as a lightweight and flexible alternative to equivariant architectures. Recent works have found an empirical benefit to using probabilistic frames instead, which learn weighted distributions over group elements. In this work, we provide strong theoretical justification for this phenomenon: for commonly-used groups, there is no efficiently computable choice of frame that preserves continuity of the function being averaged. In other words, unweighted frame-averaging can turn a smooth, non-symmetric function into a discontinuous, symmetric function. To address this fundamental robustness problem, we formally define and construct \emph{weighted} frames, which provably preserve continuity, and demonstrate their utility by constructing efficient and continuous weighted frames for the actions of $SO(2)$, $SO(3)$, and $S_n$ on point clouds.  ( 2 min )
    A Step-by-step Introduction to the Implementation of Automatic Differentiation
    arXiv:2402.16020v1 Announce Type: new Abstract: Automatic differentiation is a key component in deep learning. This topic is well studied and excellent surveys such as Baydin et al. (2018) have been available to clearly describe the basic concepts. Further, sophisticated implementations of automatic differentiation are now an important part of popular deep learning frameworks. However, it is difficult, if not impossible, to directly teach students the implementation of existing systems due to the complexity. On the other hand, if the teaching stops at the basic concept, students fail to sense the realization of an implementation. For example, we often mention the computational graph in teaching automatic differentiation, but students wonder how to implement and use it. In this document, we partially fill the gap by giving a step by step introduction of implementing a simple automatic differentiation system. We streamline the mathematical concepts and the implementation. Further, we give the motivation behind each implementation detail, so the whole setting becomes very natural.  ( 2 min )
    Spectrum Extraction and Clipping for Implicitly Linear Layers
    arXiv:2402.16017v1 Announce Type: new Abstract: We show the effectiveness of automatic differentiation in efficiently and correctly computing and controlling the spectrum of implicitly linear operators, a rich family of layer types including all standard convolutional and dense layers. We provide the first clipping method which is correct for general convolution layers, and illuminate the representational limitation that caused correctness issues in prior work. We study the effect of the batch normalization layers when concatenated with convolutional layers and show how our clipping method can be applied to their composition. By comparing the accuracy and performance of our algorithms to the state-of-the-art methods, using various experiments, we show they are more precise and efficient and lead to better generalization and adversarial robustness. We provide the code for using our methods at https://github.com/Ali-E/FastClip.  ( 2 min )
    Behavioral Refinement via Interpolant-based Policy Diffusion
    arXiv:2402.16075v1 Announce Type: new Abstract: Imitation learning empowers artificial agents to mimic behavior by learning from demonstrations. Recently, diffusion models, which have the ability to model high-dimensional and multimodal distributions, have shown impressive performance on imitation learning tasks. These models learn to shape a policy by diffusing actions (or states) from standard Gaussian noise. However, the target policy to be learned is often significantly different from Gaussian and this mismatch can result in poor performance when using a small number of diffusion steps (to improve inference speed) and under limited data. The key idea in this work is that initiating from a more informative source than Gaussian enables diffusion methods to overcome the above limitations. We contribute both theoretical results, a new method, and empirical findings that show the benefits of using an informative source policy. Our method, which we call BRIDGER, leverages the stochastic interpolants framework to bridge arbitrary policies, thus enabling a flexible approach towards imitation learning. It generalizes prior work in that standard Gaussians can still be applied, but other source policies can be used if available. In experiments on challenging benchmarks, BRIDGER outperforms state-of-the-art diffusion policies and we provide further analysis on design considerations when applying BRIDGER.  ( 2 min )
    Pretraining Strategy for Neural Potentials
    arXiv:2402.15921v1 Announce Type: new Abstract: We propose a mask pretraining method for Graph Neural Networks (GNNs) to improve their performance on fitting potential energy surfaces, particularly in water systems. GNNs are pretrained by recovering spatial information related to masked-out atoms from molecules, then transferred and finetuned on atomic forcefields. Through such pretraining, GNNs learn meaningful prior about structural and underlying physical information of molecule systems that are useful for downstream tasks. From comprehensive experiments and ablation studies, we show that the proposed method improves the accuracy and convergence speed compared to GNNs trained from scratch or using other pretraining techniques such as denoising. On the other hand, our pretraining method is suitable for both energy-centric and force-centric GNNs. This approach showcases its potential to enhance the performance and data efficiency of GNNs in fitting molecular force fields.  ( 2 min )
    Shaving Weights with Occam's Razor: Bayesian Sparsification for Neural Networks Using the Marginal Likelihood
    arXiv:2402.15978v1 Announce Type: new Abstract: Neural network sparsification is a promising avenue to save computational time and memory costs, especially in an age where many successful AI models are becoming too large to na\"ively deploy on consumer hardware. While much work has focused on different weight pruning criteria, the overall sparsifiability of the network, i.e., its capacity to be pruned without quality loss, has often been overlooked. We present Sparsifiability via the Marginal likelihood (SpaM), a pruning framework that highlights the effectiveness of using the Bayesian marginal likelihood in conjunction with sparsity-inducing priors for making neural networks more sparsifiable. Our approach implements an automatic Occam's razor that selects the most sparsifiable model that still explains the data well, both for structured and unstructured sparsification. In addition, we demonstrate that the pre-computed posterior Hessian approximation used in the Laplace approximation can be re-used to define a cheap pruning criterion, which outperforms many existing (more expensive) approaches. We demonstrate the effectiveness of our framework, especially at high sparsity levels, across a range of different neural network architectures and datasets.  ( 2 min )
    A unified Fourier slice method to derive ridgelet transform for a variety of depth-2 neural networks
    arXiv:2402.15984v1 Announce Type: new Abstract: To investigate neural network parameters, it is easier to study the distribution of parameters than to study the parameters in each neuron. The ridgelet transform is a pseudo-inverse operator that maps a given function $f$ to the parameter distribution $\gamma$ so that a network $\mathtt{NN}[\gamma]$ reproduces $f$, i.e. $\mathtt{NN}[\gamma]=f$. For depth-2 fully-connected networks on a Euclidean space, the ridgelet transform has been discovered up to the closed-form expression, thus we could describe how the parameters are distributed. However, for a variety of modern neural network architectures, the closed-form expression has not been known. In this paper, we explain a systematic method using Fourier expressions to derive ridgelet transforms for a variety of modern networks such as networks on finite fields $\mathbb{F}_p$, group convolutional networks on abstract Hilbert space $\mathcal{H}$, fully-connected networks on noncompact symmetric spaces $G/K$, and pooling layers, or the $d$-plane ridgelet transform.  ( 2 min )
    Structural Knowledge-Driven Meta-Learning for Task Offloading in Vehicular Networks with Integrated Communications, Sensing and Computing
    arXiv:2402.15972v1 Announce Type: new Abstract: Task offloading is a potential solution to satisfy the strict requirements of computation-intensive and latency-sensitive vehicular applications due to the limited onboard computing resources. However, the overwhelming upload traffic may lead to unacceptable uploading time. To tackle this issue, for tasks taking environmental data as input, the data perceived by roadside units (RSU) equipped with several sensors can be directly exploited for computation, resulting in a novel task offloading paradigm with integrated communications, sensing and computing (I-CSC). With this paradigm, vehicles can select to upload their sensed data to RSUs or transmit computing instructions to RSUs during the offloading. By optimizing the computation mode and network resources, in this paper, we investigate an I-CSC-based task offloading problem to reduce the cost caused by resource consumption while guaranteeing the latency of each task. Although this non-convex problem can be handled by the alternating minimization (AM) algorithm that alternatively minimizes the divided four sub-problems, it leads to high computational complexity and local optimal solution. To tackle this challenge, we propose a creative structural knowledge-driven meta-learning (SKDML) method, involving both the model-based AM algorithm and neural networks. Specifically, borrowing the iterative structure of the AM algorithm, also referred to as structural knowledge, the proposed SKDML adopts long short-term memory (LSTM) network-based meta-learning to learn an adaptive optimizer for updating variables in each sub-problem, instead of the handcrafted counterpart in the AM algorithm.  ( 3 min )
    GreenLLaMA: A Framework for Detoxification with Explanations
    arXiv:2402.15951v1 Announce Type: new Abstract: Prior works on detoxification are scattered in the sense that they do not cover all aspects of detoxification needed in a real-world scenario. Notably, prior works restrict the task of developing detoxification models to only a seen subset of platforms, leaving the question of how the models would perform on unseen platforms unexplored. Additionally, these works do not address non-detoxifiability, a phenomenon whereby the toxic text cannot be detoxified without altering the meaning. We propose GreenLLaMA, the first comprehensive end-to-end detoxification framework, which attempts to alleviate the aforementioned limitations. We first introduce a cross-platform pseudo-parallel corpus applying multi-step data processing and generation strategies leveraging ChatGPT. We then train a suite of detoxification models with our cross-platform corpus. We show that our detoxification models outperform the SoTA model trained with human-annotated parallel corpus. We further introduce explanation to promote transparency and trustworthiness. GreenLLaMA additionally offers a unique paraphrase detector especially dedicated for the detoxification task to tackle the non-detoxifiable cases. Through experimental analysis, we demonstrate the effectiveness of our cross-platform corpus and the robustness of GreenLLaMA against adversarial toxicity.  ( 2 min )
    DynaMITE-RL: A Dynamic Model for Improved Temporal Meta-Reinforcement Learning
    arXiv:2402.15957v1 Announce Type: new Abstract: We introduce DynaMITE-RL, a meta-reinforcement learning (meta-RL) approach to approximate inference in environments where the latent state evolves at varying rates. We model episode sessions - parts of the episode where the latent state is fixed - and propose three key modifications to existing meta-RL methods: consistency of latent information within sessions, session masking, and prior latent conditioning. We demonstrate the importance of these modifications in various domains, ranging from discrete Gridworld environments to continuous-control and simulated robot assistive tasks, demonstrating that DynaMITE-RL significantly outperforms state-of-the-art baselines in sample efficiency and inference returns.  ( 2 min )
    Hierarchical energy signatures using machine learning for operational visibility and diagnostics in automotive manufacturing
    arXiv:2402.15962v1 Announce Type: new Abstract: Manufacturing energy consumption data contains important process signatures required for operational visibility and diagnostics. These signatures may be of different temporal scales, ranging from monthly to sub-second resolutions. We introduce a hierarchical machine learning approach to identify automotive process signatures from paint shop electricity consumption data at varying temporal scales (weekly and daily). A Multi-Layer Perceptron (MLP), a Convolutional Neural Network (CNN), and Principal Component Analysis (PCA) combined with Logistic Regression (LR) are used for the analysis. We validate the utility of the developed algorithms with subject matter experts for (i) better operational visibility, and (ii) identifying energy saving opportunities.  ( 2 min )
    Large Stepsize Gradient Descent for Logistic Loss: Non-Monotonicity of the Loss Improves Optimization Efficiency
    arXiv:2402.15926v1 Announce Type: new Abstract: We consider gradient descent (GD) with a constant stepsize applied to logistic regression with linearly separable data, where the constant stepsize $\eta$ is so large that the loss initially oscillates. We show that GD exits this initial oscillatory phase rapidly -- in $\mathcal{O}(\eta)$ steps -- and subsequently achieves an $\tilde{\mathcal{O}}(1 / (\eta t) )$ convergence rate after $t$ additional steps. Our results imply that, given a budget of $T$ steps, GD can achieve an accelerated loss of $\tilde{\mathcal{O}}(1/T^2)$ with an aggressive stepsize $\eta:= \Theta( T)$, without any use of momentum or variable stepsize schedulers. Our proof technique is versatile and also handles general classification loss functions (where exponential tails are needed for the $\tilde{\mathcal{O}}(1/T^2)$ acceleration), nonlinear predictors in the neural tangent kernel regime, and online stochastic gradient descent (SGD) with a large stepsize, under suitable separability conditions.  ( 2 min )
    CoDream: Exchanging dreams instead of models for federated aggregation with heterogeneous models
    arXiv:2402.15968v1 Announce Type: new Abstract: Federated Learning (FL) enables collaborative optimization of machine learning models across decentralized data by aggregating model parameters. Our approach extends this concept by aggregating "knowledge" derived from models, instead of model parameters. We present a novel framework called \codream, where clients collaboratively optimize randomly initialized data using federated optimization in the input data space, similar to how randomly initialized model parameters are optimized in FL. Our key insight is that jointly optimizing this data can effectively capture the properties of the global data distribution. Sharing knowledge in data space offers numerous benefits: (1) model-agnostic collaborative learning, i.e., different clients can have different model architectures; (2) communication that is independent of the model size, eliminating scalability concerns with model parameters; (3) compatibility with secure aggregation, thus preserving the privacy benefits of federated learning; (4) allowing of adaptive optimization of knowledge shared for personalized learning. We empirically validate \codream on standard FL tasks, demonstrating competitive performance despite not sharing model parameters. Our code: https://mitmedialab.github.io/codream.github.io/  ( 2 min )
    ESFL: Efficient Split Federated Learning over Resource-Constrained Heterogeneous Wireless Devices
    arXiv:2402.15903v1 Announce Type: new Abstract: Federated learning (FL) allows multiple parties (distributed devices) to train a machine learning model without sharing raw data. How to effectively and efficiently utilize the resources on devices and the central server is a highly interesting yet challenging problem. In this paper, we propose an efficient split federated learning algorithm (ESFL) to take full advantage of the powerful computing capabilities at a central server under a split federated learning framework with heterogeneous end devices (EDs). By splitting the model into different submodels between the server and EDs, our approach jointly optimizes user-side workload and server-side computing resource allocation by considering users' heterogeneity. We formulate the whole optimization problem as a mixed-integer non-linear program, which is an NP-hard problem, and develop an iterative approach to obtain an approximate solution efficiently. Extensive simulations have been conducted to validate the significantly increased efficiency of our ESFL approach compared with standard federated learning, split learning, and splitfed learning.  ( 2 min )
    Scalable Volt-VAR Optimization using RLlib-IMPALA Framework: A Reinforcement Learning Approach
    arXiv:2402.15932v1 Announce Type: new Abstract: In the rapidly evolving domain of electrical power systems, the Volt-VAR optimization (VVO) is increasingly critical, especially with the burgeoning integration of renewable energy sources. Traditional approaches to learning-based VVO in expansive and dynamically changing power systems are often hindered by computational complexities. To address this challenge, our research presents a novel framework that harnesses the potential of Deep Reinforcement Learning (DRL), specifically utilizing the Importance Weighted Actor-Learner Architecture (IMPALA) algorithm, executed on the RAY platform. This framework, built upon RLlib-an industry-standard in Reinforcement Learning-ingeniously capitalizes on the distributed computing capabilities and advanced hyperparameter tuning offered by RAY. This design significantly expedites the exploration and exploitation phases in the VVO solution space. Our empirical results demonstrate that our approach not only surpasses existing DRL methods in achieving superior reward outcomes but also manifests a remarkable tenfold reduction in computational requirements. The integration of our DRL agent with the RAY platform facilitates the creation of RLlib-IMPALA, a novel framework that efficiently uses RAY's resources to improve system adaptability and control. RLlib-IMPALA leverages RAY's toolkit to enhance analytical capabilities and significantly speeds up training to become more than 10 times faster than other state-of-the-art DRL methods.  ( 2 min )
    Predicting Outcomes in Video Games with Long Short Term Memory Networks
    arXiv:2402.15923v1 Announce Type: new Abstract: Forecasting winners in E-sports with real-time analytics has the potential to further engage audiences watching major tournament events. However, making such real-time predictions is challenging due to unpredictable variables within the game involving diverse player strategies and decision-making. Our work attempts to enhance audience engagement within video game tournaments by introducing a real-time method of predicting wins. Our Long Short Term Memory Network (LSTMs) based approach enables efficient predictions of win-lose outcomes by only using the health indicator of each player as a time series. As a proof of concept, we evaluate our model's performance within a classic, two-player arcade game, Super Street Fighter II Turbo. We also benchmark our method against state of the art methods for time series forecasting; i.e. Transformer models found in large language models (LLMs). Finally, we open-source our data set and code in hopes of furthering work in predictive analysis for arcade games.  ( 2 min )
    Field-based Molecule Generation
    arXiv:2402.15864v1 Announce Type: new Abstract: This work introduces FMG, a field-based model for drug-like molecule generation. We show how the flexibility of this method provides crucial advantages over the prevalent, point-cloud based methods, and achieves competitive molecular stability generation. We tackle optical isomerism (enantiomers), a previously omitted molecular property that is crucial for drug safety and effectiveness, and thus account for all molecular geometry aspects. We demonstrate how previous methods are invariant to a group of transformations that includes enantiomer pairs, leading them invariant to the molecular R and S configurations, while our field-based generative model captures this property.  ( 2 min )
    Information-based Transductive Active Learning
    arXiv:2402.15898v1 Announce Type: new Abstract: We generalize active learning to address real-world settings where sampling is restricted to an accessible region of the domain, while prediction targets may lie outside this region. To this end, we propose ITL, short for information-based transductive learning, an approach which samples adaptively to maximize the information gained about specified prediction targets. We show, under general regularity assumptions, that ITL converges uniformly to the smallest possible uncertainty obtainable from the accessible data. We demonstrate ITL in two key applications: Few-shot fine-tuning of large neural networks and safe Bayesian optimization, and in both cases, ITL significantly outperforms the state-of-the-art.  ( 2 min )
    Fusion Encoder Networks
    arXiv:2402.15883v1 Announce Type: new Abstract: In this paper we present fusion encoder networks (FENs): a class of algorithms for creating neural networks that map fixed-length sequences to outputs. The resulting neural network has only logarithmic depth (alleviating the degradation of data as it propagates through the network) and can process sequences in linear time (or in logarithmic time with a linear number of processors). The crucial property of FENs is that they learn by training a quasi-linear number of constant-depth neural networks in parallel. The fact that these networks are constant depth means that backpropagation works well. We note that currently the performance of FENs is only conjectured as we are yet to implement them.  ( 2 min )
    Reward Design for Justifiable Sequential Decision-Making
    arXiv:2402.15826v1 Announce Type: new Abstract: Equipping agents with the capacity to justify made decisions using supporting evidence represents a cornerstone of accountable decision-making. Furthermore, ensuring that justifications are in line with human expectations and societal norms is vital, especially in high-stakes situations such as healthcare. In this work, we propose the use of a debate-based reward model for reinforcement learning agents, where the outcome of a zero-sum debate game quantifies the justifiability of a decision in a particular state. This reward model is then used to train a justifiable policy, whose decisions can be more easily corroborated with supporting evidence. In the debate game, two argumentative agents take turns providing supporting evidence for two competing decisions. Given the proposed evidence, a proxy of a human judge evaluates which decision is better justified. We demonstrate the potential of our approach in learning policies for prescribing and justifying treatment decisions of septic patients. We show that augmenting the reward with the feedback signal generated by the debate-based reward model yields policies highly favored by the judge when compared to the policy obtained solely from the environment rewards, while hardly sacrificing any performance. Moreover, in terms of the overall performance and justifiability of trained policies, the debate-based feedback is comparable to the feedback obtained from an ideal judge proxy that evaluates decisions using the full information encoded in the state. This suggests that the debate game outputs key information contained in states that is most relevant for evaluating decisions, which in turn substantiates the practicality of combining our approach with human-in-the-loop evaluations. Lastly, we showcase that agents trained via multi-agent debate learn to propose evidence that is resilient to refutations and closely aligns with human preferences.  ( 3 min )
    The Sample Complexity of Approximate Rejection Sampling with Applications to Smoothed Online Learning
    arXiv:2302.04658v3 Announce Type: replace-cross Abstract: Suppose we are given access to $n$ independent samples from distribution $\mu$ and we wish to output one of them with the goal of making the output distributed as close as possible to a target distribution $\nu$. In this work we show that the optimal total variation distance as a function of $n$ is given by $\tilde\Theta(\frac{D}{f'(n)})$ over the class of all pairs $\nu,\mu$ with a bounded $f$-divergence $D_f(\nu\|\mu)\leq D$. Previously, this question was studied only for the case when the Radon-Nikodym derivative of $\nu$ with respect to $\mu$ is uniformly bounded. We then consider an application in the seemingly very different field of smoothed online learning, where we show that recent results on the minimax regret and the regret of oracle-efficient algorithms still hold even under relaxed constraints on the adversary (to have bounded $f$-divergence, as opposed to bounded Radon-Nikodym derivative). Finally, we also study efficacy of importance sampling for mean estimates uniform over a function class and compare importance sampling with rejection sampling.  ( 2 min )
    CEDAS: A Compressed Decentralized Stochastic Gradient Method with Improved Convergence
    arXiv:2301.05872v2 Announce Type: replace-cross Abstract: In this paper, we consider solving the distributed optimization problem over a multi-agent network under the communication restricted setting. We study a compressed decentralized stochastic gradient method, termed ``compressed exact diffusion with adaptive stepsizes (CEDAS)", and show the method asymptotically achieves comparable convergence rate as centralized { stochastic gradient descent (SGD)} for both smooth strongly convex objective functions and smooth nonconvex objective functions under unbiased compression operators. In particular, to our knowledge, CEDAS enjoys so far the shortest transient time (with respect to the graph specifics) for achieving the convergence rate of centralized SGD, which behaves as $\mathcal{O}(n{C^3}/(1-\lambda_2)^{2})$ under smooth strongly convex objective functions, and $\mathcal{O}(n^3{C^6}/(1-\lambda_2)^4)$ under smooth nonconvex objective functions, where $(1-\lambda_2)$ denotes the spectral gap of the mixing matrix, and $C>0$ is the compression-related parameter. Numerical experiments further demonstrate the effectiveness of the proposed algorithm.  ( 2 min )
    Fourier series weight in quantum machine learning
    arXiv:2302.00105v2 Announce Type: replace-cross Abstract: In this work, we aim to confirm the impact of the Fourier series on the quantum machine learning model. We will propose models, tests, and demonstrations to achieve this objective. We designed a quantum machine learning leveraged on the Hamiltonian encoding. With a subtle change, we performed the trigonometric interpolation, binary and multiclass classifier, and a quantum signal processing application. We also proposed a block diagram of determining approximately the Fourier coefficient based on quantum machine learning. We performed and tested all the proposed models using the Pennylane framework.  ( 2 min )
    Non-invasive Liver Fibrosis Screening on CT Images using Radiomics
    arXiv:2211.14396v2 Announce Type: replace-cross Abstract: Objectives: To develop and evaluate a radiomics machine learning model for detecting liver fibrosis on CT of the liver. Methods: For this retrospective, single-centre study, radiomic features were extracted from Regions of Interest (ROIs) on CT images of patients who underwent simultaneous liver biopsy and CT examinations. Combinations of contrast, normalization, machine learning model, and feature selection method were determined based on their mean test Area Under the Receiver Operating Characteristic curve (AUC) on randomly placed ROIs. The combination and selected features with the highest AUC were used to develop a final liver fibrosis screening model. Results: The study included 101 male and 68 female patients (mean age = 51.2 years $\pm$ 14.7 [SD]). When averaging the AUC across all combinations, non-contrast enhanced (NC) CT (AUC, 0.6100; 95% CI: 0.5897, 0.6303) outperformed contrast-enhanced CT (AUC, 0.5680; 95% CI: 0.5471, 0.5890). The combination of hyperparameters and features that yielded the highest AUC was a logistic regression model with inputs features of maximum, energy, kurtosis, skewness, and small area high gray level emphasis extracted from non-contrast enhanced NC CT normalized using Gamma correction with $\gamma$ = 1.5 (AUC, 0.7833; 95% CI: 0.7821, 0.7845), (sensitivity, 0.9091; 95% CI: 0.9091, 0.9091). Conclusions: Radiomics-based machine learning models allow for the detection of liver fibrosis with reasonable accuracy and high sensitivity on NC CT. Thus, these models can be used to non-invasively screen for liver fibrosis, contributing to earlier detection of the disease at a potentially curable stage.  ( 3 min )
    Complex behavior from intrinsic motivation to occupy action-state path space
    arXiv:2205.10316v2 Announce Type: replace-cross Abstract: Most theories of behavior posit that agents tend to maximize some form of reward or utility. However, animals very often move with curiosity and seem to be motivated in a reward-free manner. Here we abandon the idea of reward maximization, and propose that the goal of behavior is maximizing occupancy of future paths of actions and states. According to this maximum occupancy principle, rewards are the means to occupy path space, not the goal per se; goal-directedness simply emerges as rational ways of searching for resources so that movement, understood amply, never ends. We find that action-state path entropy is the only measure consistent with additivity and other intuitive properties of expected future action-state path occupancy. We provide analytical expressions that relate the optimal policy and state-value function, and prove convergence of our value iteration algorithm. Using discrete and continuous state tasks, including a high--dimensional controller, we show that complex behaviors such as `dancing', hide-and-seek and a basic form of altruistic behavior naturally result from the intrinsic motivation to occupy path space. All in all, we present a theory of behavior that generates both variability and goal-directedness in the absence of reward maximization.  ( 3 min )
    Multimodal Generative Models for Bankruptcy Prediction Using Textual Data
    arXiv:2211.08405v5 Announce Type: replace-cross Abstract: Textual data from financial filings, e.g., the Management's Discussion & Analysis (MDA) section in Form 10-K, has been used to improve the prediction accuracy of bankruptcy models. In practice, however, we cannot obtain the MDA section for all public companies, which limits the use of MDA data in traditional bankruptcy models, as they need complete data to make predictions. The two main reasons for the lack of MDA are: (i) not all companies are obliged to submit the MDA and (ii) technical problems arise when crawling and scrapping the MDA section. To solve this limitation, this research introduces the Conditional Multimodal Discriminative (CMMD) model that learns multimodal representations that embed information from accounting, market, and textual data modalities. The CMMD model needs a sample with all data modalities for model training. At test time, the CMMD model only needs access to accounting and market modalities to generate multimodal representations, which are further used to make bankruptcy predictions and to generate words from the missing MDA modality. With this novel methodology, it is realistic to use textual data in bankruptcy prediction models, since accounting and market data are available for all companies, unlike textual data. The empirical results of this research show that if financial regulators, or investors, were to use traditional models using MDA data, they would only be able to make predictions for 60% of the companies. Furthermore, the classification performance of our proposed methodology is superior to that of a large number of traditional classifier models, taking into account all the companies in our sample.  ( 3 min )
    Broad Recommender System: An Efficient Nonlinear Collaborative Filtering Approach
    arXiv:2204.11602v5 Announce Type: replace-cross Abstract: Recently, Deep Neural Networks (DNNs) have been widely introduced into Collaborative Filtering (CF) to produce more accurate recommendation results due to their capability of capturing the complex nonlinear relationships between items and users.However, the DNNs-based models usually suffer from high computational complexity, i.e., consuming very long training time and storing huge amount of trainable parameters. To address these problems, we propose a new broad recommender system called Broad Collaborative Filtering (BroadCF), which is an efficient nonlinear collaborative filtering approach. Instead of DNNs, Broad Learning System (BLS) is used as a mapping function to learn the complex nonlinear relationships between users and items, which can avoid the above issues while achieving very satisfactory recommendation performance. However, it is not feasible to directly feed the original rating data into BLS. To this end, we propose a user-item rating collaborative vector preprocessing procedure to generate low-dimensional user-item input data, which is able to harness quality judgments of the most similar users/items. Extensive experiments conducted on seven benchmark datasets have confirmed the effectiveness of the proposed BroadCF algorithm  ( 2 min )
    Detecting data-driven robust statistical arbitrage strategies with deep neural networks
    arXiv:2203.03179v4 Announce Type: replace-cross Abstract: We present an approach, based on deep neural networks, that allows identifying robust statistical arbitrage strategies in financial markets. Robust statistical arbitrage strategies refer to trading strategies that enable profitable trading under model ambiguity. The presented novel methodology allows to consider a large amount of underlying securities simultaneously and does not depend on the identification of cointegrated pairs of assets, hence it is applicable on high-dimensional financial markets or in markets where classical pairs trading approaches fail. Moreover, we provide a method to build an ambiguity set of admissible probability measures that can be derived from observed market data. Thus, the approach can be considered as being model-free and entirely data-driven. We showcase the applicability of our method by providing empirical investigations with highly profitable trading performances even in 50 dimensions, during financial crises, and when the cointegration relationship between asset pairs stops to persist.  ( 2 min )
    Max-Linear Regression by Convex Programming
    arXiv:2103.07020v2 Announce Type: replace-cross Abstract: We consider the multivariate max-linear regression problem where the model parameters $\boldsymbol{\beta}_{1},\dotsc,\boldsymbol{\beta}_{k}\in\mathbb{R}^{p}$ need to be estimated from $n$ independent samples of the (noisy) observations $y = \max_{1\leq j \leq k} \boldsymbol{\beta}_{j}^{\mathsf{T}} \boldsymbol{x} + \mathrm{noise}$. The max-linear model vastly generalizes the conventional linear model, and it can approximate any convex function to an arbitrary accuracy when the number of linear models $k$ is large enough. However, the inherent nonlinearity of the max-linear model renders the estimation of the regression parameters computationally challenging. Particularly, no estimator based on convex programming is known in the literature. We formulate and analyze a scalable convex program given by anchored regression (AR) as the estimator for the max-linear regression problem. Under the standard Gaussian observation setting, we present a non-asymptotic performance guarantee showing that the convex program recovers the parameters with high probability. When the $k$ linear components are equally likely to achieve the maximum, our result shows a sufficient number of noise-free observations for exact recovery scales as {$k^{4}p$} up to a logarithmic factor. { This sample complexity coincides with that by alternating minimization (Ghosh et al., {2021}). Moreover, the same sample complexity applies when the observations are corrupted with arbitrary deterministic noise. We provide empirical results that show that our method performs as our theoretical result predicts, and is competitive with the alternating minimization algorithm particularly in presence of multiplicative Bernoulli noise. Furthermore, we also show empirically that a recursive application of AR can significantly improve the estimation accuracy.}  ( 3 min )
    Kernel Two-Sample Tests for Manifold Data
    arXiv:2105.03425v4 Announce Type: replace-cross Abstract: We present a study of a kernel-based two-sample test statistic related to the Maximum Mean Discrepancy (MMD) in the manifold data setting, assuming that high-dimensional observations are close to a low-dimensional manifold. We characterize the test level and power in relation to the kernel bandwidth, the number of samples, and the intrinsic dimensionality of the manifold. Specifically, when data densities $p$ and $q$ are supported on a $d$-dimensional sub-manifold ${M}$ embedded in an $m$-dimensional space and are H\"older with order $\beta$ (up to 2) on ${M}$, we prove a guarantee of the test power for finite sample size $n$ that exceeds a threshold depending on $d$, $\beta$, and $\Delta_2$ the squared $L^2$-divergence between $p$ and $q$ on the manifold, and with a properly chosen kernel bandwidth $\gamma$. For small density departures, we show that with large $n$ they can be detected by the kernel test when $\Delta_2$ is greater than $n^{- { 2 \beta/( d + 4 \beta ) }}$ up to a certain constant and $\gamma$ scales as $n^{-1/(d+4\beta)}$. The analysis extends to cases where the manifold has a boundary and the data samples contain high-dimensional additive noise. Our results indicate that the kernel two-sample test has no curse-of-dimensionality when the data lie on or near a low-dimensional manifold. We validate our theory and the properties of the kernel test for manifold data through a series of numerical experiments.  ( 3 min )
    The MSR-Video to Text Dataset with Clean Annotations
    arXiv:2102.06448v4 Announce Type: replace-cross Abstract: Video captioning automatically generates short descriptions of the video content, usually in form of a single sentence. Many methods have been proposed for solving this task. A large dataset called MSR Video to Text (MSR-VTT) is often used as the benchmark dataset for testing the performance of the methods. However, we found that the human annotations, i.e., the descriptions of video contents in the dataset are quite noisy, e.g., there are many duplicate captions and many captions contain grammatical problems. These problems may pose difficulties to video captioning models for learning underlying patterns. We cleaned the MSR-VTT annotations by removing these problems, then tested several typical video captioning models on the cleaned dataset. Experimental results showed that data cleaning boosted the performances of the models measured by popular quantitative metrics. We recruited subjects to evaluate the results of a model trained on the original and cleaned datasets. The human behavior experiment demonstrated that trained on the cleaned dataset, the model generated captions that were more coherent and more relevant to the contents of the video clips.  ( 3 min )
    On the representation and learning of monotone triangular transport maps
    arXiv:2009.10303v3 Announce Type: replace-cross Abstract: Transportation of measure provides a versatile approach for modeling complex probability distributions, with applications in density estimation, Bayesian inference, generative modeling, and beyond. Monotone triangular transport maps$\unicode{x2014}$approximations of the Knothe$\unicode{x2013}$Rosenblatt (KR) rearrangement$\unicode{x2014}$are a canonical choice for these tasks. Yet the representation and parameterization of such maps have a significant impact on their generality and expressiveness, and on properties of the optimization problem that arises in learning a map from data (e.g., via maximum likelihood estimation). We present a general framework for representing monotone triangular maps via invertible transformations of smooth functions. We establish conditions on the transformation such that the associated infinite-dimensional minimization problem has no spurious local minima, i.e., all local minima are global minima; and we show for target distributions satisfying certain tail conditions that the unique global minimizer corresponds to the KR map. Given a sample from the target, we then propose an adaptive algorithm that estimates a sparse semi-parametric approximation of the underlying KR map. We demonstrate how this framework can be applied to joint and conditional density estimation, likelihood-free inference, and structure learning of directed graphical models, with stable generalization performance across a range of sample sizes.  ( 3 min )
    Sustainable Recreational Fishing Using a Novel Electrical Muscle Stimulation (EMS) Lure and Ensemble Network Algorithm to Maximize Catch and Release Survivability
    arXiv:2006.10125v3 Announce Type: replace-cross Abstract: With 200-700 million anglers in the world, sportfishing is nearly five times more common than commercial trawling. Worldwide, hundreds of thousands of jobs are linked to the sportfishing industry, which generates billions of dollars for water-side communities and fisheries conservatories alike. However, the sheer popularity of recreational fishing poses threats to aquatic biodiversity that are hard to regulate. For example, as much as 25% of overfished populations can be traced to anglers. This alarming statistic is explained by the average catch and release mortality rate of 43%, which primarily results from hook-related injuries and careless out-of-water handling. The provisional-patented design proposed in this paper addresses both these problems separately First, a novel, electrical muscle stimulation based fishing lure is proposed as a harmless and low cost alternative to sharp hooks. Early prototypes show a constant electrical current of 90 mA applied through a 200g European perch's jaw can support a reeling tension of 2N - safely within the necessary ranges. Second, a fish-eye camera bob is designed to wirelessly relay underwater footage to a smartphone app, where an ensemble convolutional neural network automatically classifies the fish's species, estimates its length, and cross references with local and state fishing regulations (ie. minimum size, maximum bag limit, and catch season). This capability reduces overfishing by helping anglers avoid accidentally violating guidelines and eliminates the need to reel the fish in and expose it to negligent handling. IN conjunction, this cheap, lightweight, yet high-tech invention is a paradigm shift in preserving a world favorite pastime; while at the same time making recreational fishing more sustainable.  ( 3 min )
    Sparse Orthogonal Variational Inference for Gaussian Processes
    arXiv:1910.10596v5 Announce Type: replace-cross Abstract: We introduce a new interpretation of sparse variational approximations for Gaussian processes using inducing points, which can lead to more scalable algorithms than previous methods. It is based on decomposing a Gaussian process as a sum of two independent processes: one spanned by a finite basis of inducing points and the other capturing the remaining variation. We show that this formulation recovers existing approximations and at the same time allows to obtain tighter lower bounds on the marginal likelihood and new stochastic variational inference algorithms. We demonstrate the efficiency of these algorithms in several Gaussian process models ranging from standard regression to multi-class classification using (deep) convolutional Gaussian processes and report state-of-the-art results on CIFAR-10 among purely GP-based models.  ( 2 min )
    Multi-scale Traffic Pattern Bank for Cross-city Few-shot Traffic Forecasting
    arXiv:2402.00397v2 Announce Type: replace Abstract: Traffic forecasting is crucial for intelligent transportation systems (ITS), aiding in efficient resource allocation and effective traffic control. However, its effectiveness often relies heavily on abundant traffic data, while many cities lack sufficient data due to limited device support, posing a significant challenge for traffic forecasting. Recognizing this challenge, we have made a noteworthy observation: traffic patterns exhibit similarities across diverse cities. Building on this key insight, we propose a solution for the cross-city few-shot traffic forecasting problem called Multi-scale Traffic Pattern Bank (MTPB). Primarily, MTPB initiates its learning process by leveraging data-rich source cities, effectively acquiring comprehensive traffic knowledge through a spatial-temporal-aware pre-training process. Subsequently, the framework employs advanced clustering techniques to systematically generate a multi-scale traffic pattern bank derived from the learned knowledge. Next, the traffic data of the data-scarce target city could query the traffic pattern bank, facilitating the aggregation of meta-knowledge. This meta-knowledge, in turn, assumes a pivotal role as a robust guide in subsequent processes involving graph reconstruction and forecasting. Empirical assessments conducted on real-world traffic datasets affirm the superior performance of MTPB, surpassing existing methods across various categories and exhibiting numerous attributes conducive to the advancement of cross-city few-shot forecasting methodologies. The code is available in https://github.com/zhyliu00/MTPB.  ( 3 min )
    Communication-Efficient Federated Learning through Adaptive Weight Clustering and Server-Side Distillation
    arXiv:2401.14211v3 Announce Type: replace Abstract: Federated Learning (FL) is a promising technique for the collaborative training of deep neural networks across multiple devices while preserving data privacy. Despite its potential benefits, FL is hindered by excessive communication costs due to repeated server-client communication during training. To address this challenge, model compression techniques, such as sparsification and weight clustering are applied, which often require modifying the underlying model aggregation schemes or involve cumbersome hyperparameter tuning, with the latter not only adjusts the model's compression rate but also limits model's potential for continuous improvement over growing data. In this paper, we propose FedCompress, a novel approach that combines dynamic weight clustering and server-side knowledge distillation to reduce communication costs while learning highly generalizable models. Through a comprehensive evaluation on diverse public datasets, we demonstrate the efficacy of our approach compared to baselines in terms of communication costs and inference speed.  ( 2 min )
    Next Visit Diagnosis Prediction via Medical Code-Centric Multimodal Contrastive EHR Modelling with Hierarchical Regularisation
    arXiv:2401.11648v4 Announce Type: replace Abstract: Predicting next visit diagnosis using Electronic Health Records (EHR) is an essential task in healthcare, critical for devising proactive future plans for both healthcare providers and patients. Nonetheless, many preceding studies have not sufficiently addressed the heterogeneous and hierarchical characteristics inherent in EHR data, inevitably leading to sub-optimal performance. To this end, we propose NECHO, a novel medical code-centric multimodal contrastive EHR learning framework with hierarchical regularisation. First, we integrate multifaceted information encompassing medical codes, demographics, and clinical notes using a tailored network design and a pair of bimodal contrastive losses, all of which pivot around a medical codes representation. We also regularise modality-specific encoders using a parental level information in medical ontology to learn hierarchical structure of EHR data. A series of experiments on MIMIC-III data demonstrates effectiveness of our approach.  ( 2 min )
    Freely Long-Thinking Transformer (FraiLT)
    arXiv:2401.11626v2 Announce Type: replace Abstract: Freely Long-Thinking Transformer (FraiLT) is an improved transformer model designed to enhance processing capabilities without scaling up size. It utilizes a recursive approach, iterating over a subset of layers multiple times, and introduces iteration encodings to maintain awareness across these cycles. Iteration encoding allows FraiLT to achieve the interpretive depth of larger models in a compact form. When evaluated on a synthetic story dataset, FraiLT outperformed larger models, showcasing its ability to deliver high-quality performance while reducing memory demands. This model represents a step forward towards more efficient and accessible language models.  ( 2 min )
    GD-CAF: Graph Dual-stream Convolutional Attention Fusion for Precipitation Nowcasting
    arXiv:2401.07958v2 Announce Type: replace Abstract: Accurate precipitation nowcasting is essential for various applications, including flood prediction, disaster management, optimizing agricultural activities, managing transportation routes and renewable energy. While several studies have addressed this challenging task from a sequence-to-sequence perspective, most of them have focused on a single area without considering the existing correlation between multiple disjoint regions. In this paper, we formulate precipitation nowcasting as a spatiotemporal graph sequence nowcasting problem. In particular, we introduce Graph Dual-stream Convolutional Attention Fusion (GD-CAF), a novel approach designed to learn from historical spatiotemporal graph of precipitation maps and nowcast future time step ahead precipitation at different spatial locations. GD-CAF consists of spatio-temporal convolutional attention as well as gated fusion modules which are equipped with depthwise-separable convolutional operations. This enhancement enables the model to directly process the high-dimensional spatiotemporal graph of precipitation maps and exploits higher-order correlations between the data dimensions. We evaluate our model on seven years of precipitation maps across Europe and its neighboring areas collected from the ERA5 dataset, provided by Copernicus Climate Change Services. The experimental results reveal the superior performance of the GD-CAF model compared to the other examined models. Additionally, visualizations of averaged seasonal spatial and temporal attention scores across the test set offer valuable insights into the most robust connections between diverse regions or time steps.  ( 3 min )
    A Closer Look at AUROC and AUPRC under Class Imbalance
    arXiv:2401.06091v2 Announce Type: replace Abstract: In machine learning (ML), a widespread adage is that the area under the precision-recall curve (AUPRC) is a superior metric for model comparison to the area under the receiver operating characteristic (AUROC) for binary classification tasks with class imbalance. This paper challenges this notion through novel mathematical analysis, illustrating that AUROC and AUPRC can be concisely related in probabilistic terms. We demonstrate that AUPRC, contrary to popular belief, is not superior in cases of class imbalance and might even be a harmful metric, given its inclination to unduly favor model improvements in subpopulations with more frequent positive labels. This bias can inadvertently heighten algorithmic disparities. Prompted by these insights, a thorough review of existing ML literature was conducted, utilizing large language models to analyze over 1.5 million papers from arXiv. Our investigation focused on the prevalence and substantiation of the purported AUPRC superiority. The results expose a significant deficit in empirical backing and a trend of misattributions that have fuelled the widespread acceptance of AUPRC's supposed advantages. Our findings represent a dual contribution: a significant technical advancement in understanding metric behaviors and a stark warning about unchecked assumptions in the ML community. All experiments are accessible at https://github.com/mmcdermott/AUC_is_all_you_need.  ( 3 min )
    Quantifying Marketing Performance at Channel-Partner Level by Using Marketing Mix Modeling (MMM) and Shapley Value Regression
    arXiv:2401.05653v2 Announce Type: replace Abstract: This paper explores the application of Shapley Value Regression in dissecting marketing performance at channel-partner level, complementing channel-level Marketing Mix Modeling (MMM). Utilizing real-world data from the financial services industry, we demonstrate the practicality of Shapley Value Regression in evaluating individual partner contributions. Although structured in-field testing along with cooperative game theory is most accurate, it can often be highly complex and expensive to conduct. Shapley Value Regression is thus a more feasible approach to disentangle the influence of each marketing partner within a marketing channel. We also propose a simple method to derive adjusted coefficients of Shapley Value Regression and compares it with alternative approaches.  ( 2 min )
    Learning to Embed Time Series Patches Independently
    arXiv:2312.16427v2 Announce Type: replace Abstract: Masked time series modeling has recently gained much attention as a self-supervised representation learning strategy for time series. Inspired by masked image modeling in computer vision, recent works first patchify and partially mask out time series, and then train Transformers to capture the dependencies between patches by predicting masked patches from unmasked patches. However, we argue that capturing such patch dependencies might not be an optimal strategy for time series representation learning; rather, learning to embed patches independently results in better time series representations. Specifically, we propose to use 1) the simple patch reconstruction task, which autoencode each patch without looking at other patches, and 2) the simple patch-wise MLP that embeds each patch independently. In addition, we introduce complementary contrastive learning to hierarchically capture adjacent time series information efficiently. Our proposed method improves time series forecasting and classification performance compared to state-of-the-art Transformer-based models, while it is more efficient in terms of the number of parameters and training/inference time. Code is available at this repository: https://github.com/seunghan96/pits.  ( 2 min )
    Soft Contrastive Learning for Time Series
    arXiv:2312.16424v2 Announce Type: replace Abstract: Contrastive learning has shown to be effective to learn representations from time series in a self-supervised way. However, contrasting similar time series instances or values from adjacent timestamps within a time series leads to ignore their inherent correlations, which results in deteriorating the quality of learned representations. To address this issue, we propose SoftCLT, a simple yet effective soft contrastive learning strategy for time series. This is achieved by introducing instance-wise and temporal contrastive loss with soft assignments ranging from zero to one. Specifically, we define soft assignments for 1) instance-wise contrastive loss by the distance between time series on the data space, and 2) temporal contrastive loss by the difference of timestamps. SoftCLT is a plug-and-play method for time series contrastive learning that improves the quality of learned representations without bells and whistles. In experiments, we demonstrate that SoftCLT consistently improves the performance in various downstream tasks including classification, semi-supervised learning, transfer learning, and anomaly detection, showing state-of-the-art performance. Code is available at this repository: https://github.com/seunghan96/softclt.  ( 2 min )
    Policy Optimization in RLHF: The Impact of Out-of-preference Data
    arXiv:2312.10584v2 Announce Type: replace Abstract: Aligning intelligent agents with human preferences and values is important. This paper examines two popular alignment methods: Direct Preference Optimization (DPO) and Reward-Model-Based Policy Optimization (RMB-PO). A variant of RMB-PO, referred to as RMB-PO+ is also considered. These methods, either explicitly or implicitly, learn a reward model from preference data and differ in the data used for policy optimization to unlock the generalization ability of the reward model. In particular, compared with DPO, RMB-PO additionally uses policy-generated data, and RMB-PO+ further leverages new, preference-free data. We examine the impact of such out-of-preference data. Our study, conducted through controlled and synthetic experiments, demonstrates that DPO performs poorly, whereas RMB-PO+ performs the best. In particular, even when providing the policy model with a good feature representation, we find that policy optimization with adequate out-of-preference data significantly improves performance by harnessing the reward model's generalization capabilities.  ( 2 min )
    FRAPP\'E: A Group Fairness Framework for Post-Processing Everything
    arXiv:2312.02592v2 Announce Type: replace Abstract: Despite achieving promising fairness-error trade-offs, in-processing mitigation techniques for group fairness cannot be employed in numerous practical applications with limited computation resources or no access to the training pipeline of the prediction model. In these situations, post-processing is a viable alternative. However, current methods are tailored to specific problem settings and fairness definitions and hence, are not as broadly applicable as in-processing. In this work, we propose a framework that turns any regularized in-processing method into a post-processing approach. This procedure prescribes a way to obtain post-processing techniques for a much broader range of problem settings than the prior post-processing literature. We show theoretically and through extensive experiments that our framework preserves the good fairness-error trade-offs achieved with in-processing and can improve over the effectiveness of prior post-processing methods. Finally, we demonstrate several advantages of a modular mitigation strategy that disentangles the training of the prediction model from the fairness mitigation, including better performance on tasks with partial group labels.  ( 2 min )
    How Well Does GPT-4V(ision) Adapt to Distribution Shifts? A Preliminary Investigation
    arXiv:2312.07424v3 Announce Type: replace Abstract: In machine learning, generalization against distribution shifts -- where deployment conditions diverge from the training scenarios -- is crucial, particularly in fields like climate modeling, biomedicine, and autonomous driving. The emergence of foundation models, distinguished by their extensive pretraining and task versatility, has led to an increased interest in their adaptability to distribution shifts. GPT-4V(ision) acts as the most advanced publicly accessible multimodal foundation model, with extensive applications across various domains, including anomaly detection, video understanding, image generation, and medical diagnosis. However, its robustness against data distributions remains largely underexplored. Addressing this gap, this study rigorously evaluates GPT-4V's adaptability and generalization capabilities in dynamic environments, benchmarking against prominent models like CLIP, LLaVA, and Gemini. We delve into GPT-4V's zero-shot generalization across 13 diverse datasets spanning natural, medical, and molecular domains. We further investigate its adaptability to controlled data perturbations and examine the efficacy of in-context learning as a tool to enhance its adaptation. Our findings delineate GPT-4V's capability boundaries in distribution shifts, shedding light on its strengths and limitations across various scenarios. Importantly, this investigation contributes to our understanding of how AI foundation models generalize to distribution shifts, offering pivotal insights into their adaptability and robustness. The code is publicly available at https://github.com/jameszhou-gl/gpt-4v-distribution-shift.  ( 3 min )
    ULMA: Unified Language Model Alignment with Human Demonstration and Point-wise Preference
    arXiv:2312.02554v2 Announce Type: replace Abstract: Aligning language models to human expectations, e.g., being helpful and harmless, has become a pressing challenge for large language models. A typical alignment procedure consists of supervised fine-tuning and preference learning. Most preference learning methods, such as RLHF and DPO, depend on pairwise preference data, which inadequately address scenarios where human feedback is point-wise, leading to potential information loss and suboptimal performance. Addressing this gap, we introduce Point-wise Direct Preference Optimization, a novel preference learning method designed to harness point-wise feedback effectively. Our work also uncovers a novel connection between supervised fine-tuning and point-wise preference learning, culminating in Unified Language Model Alignment, a single-step method that unifies the alignment with human demonstrations and point-wise preferences. Extensive experiments on point-wise preference datasets with binary or continuous labels validate the effectiveness of our methods. Our code and a new dataset with high-quality demonstration samples on harmlessness are released.  ( 2 min )
    xTrimoGene: An Efficient and Scalable Representation Learner for Single-Cell RNA-Seq Data
    arXiv:2311.15156v2 Announce Type: replace Abstract: Advances in high-throughput sequencing technology have led to significant progress in measuring gene expressions at the single-cell level. The amount of publicly available single-cell RNA-seq (scRNA-seq) data is already surpassing 50M records for humans with each record measuring 20,000 genes. This highlights the need for unsupervised representation learning to fully ingest these data, yet classical transformer architectures are prohibitive to train on such data in terms of both computation and memory. To address this challenge, we propose a novel asymmetric encoder-decoder transformer for scRNA-seq data, called xTrimoGene$^\alpha$ (or xTrimoGene for short), which leverages the sparse characteristic of the data to scale up the pre-training. This scalable design of xTrimoGene reduces FLOPs by one to two orders of magnitude compared to classical transformers while maintaining high accuracy, enabling us to train the largest transformer models over the largest scRNA-seq dataset today. Our experiments also show that the performance of xTrimoGene improves as we scale up the model sizes, and it also leads to SOTA performance over various downstream tasks, such as cell type annotation, perturb-seq effect prediction, and drug combination prediction. xTrimoGene model is now available for use as a service via the following link: https://api.biomap.com/xTrimoGene/apply.  ( 3 min )
    Designing monitoring strategies for deployed machine learning algorithms: navigating performativity through a causal lens
    arXiv:2311.11463v2 Announce Type: replace Abstract: After a machine learning (ML)-based system is deployed, monitoring its performance is important to ensure the safety and effectiveness of the algorithm over time. When an ML algorithm interacts with its environment, the algorithm can affect the data-generating mechanism and be a major source of bias when evaluating its standalone performance, an issue known as performativity. Although prior work has shown how to validate models in the presence of performativity using causal inference techniques, there has been little work on how to monitor models in the presence of performativity. Unlike the setting of model validation, there is much less agreement on which performance metrics to monitor. Different monitoring criteria impact how interpretable the resulting test statistic is, what assumptions are needed for identifiability, and the speed of detection. When this choice is further coupled with the decision to use observational versus interventional data, ML deployment teams are faced with a multitude of monitoring options. The aim of this work is to highlight the relatively under-appreciated complexity of designing a monitoring strategy and how causal reasoning can provide a systematic framework for choosing between these options. As a motivating example, we consider an ML-based risk prediction algorithm for predicting unplanned readmissions. Bringing together tools from causal inference and statistical process control, we consider six monitoring procedures (three candidate monitoring criteria and two data sources) and investigate their operating characteristics in simulation studies. Results from this case study emphasize the seemingly simple (and obvious) fact that not all monitoring systems are created equal, which has real-world impacts on the design and documentation of ML monitoring systems.  ( 3 min )
    Fuse It or Lose It: Deep Fusion for Multimodal Simulation-Based Inference
    arXiv:2311.10671v2 Announce Type: replace Abstract: We present multimodal neural posterior estimation (MultiNPE), a method to integrate heterogeneous data from different sources in simulation-based inference with neural networks. Inspired by advances in deep fusion learning, it empowers researchers to analyze data from different domains and infer the parameters of complex mathematical models with increased accuracy. We formulate multimodal fusion approaches for \hbox{MultiNPE} (early, late, hybrid) and evaluate their performance in three challenging experiments. MultiNPE not only outperforms single-source baselines on a reference task, but also achieves superior inference on scientific models from neuroscience and cardiology. We systematically investigate the impact of partially missing data on the different fusion strategies. Across our experiments, late and hybrid fusion techniques emerge as the methods of choice for practical applications of multimodal simulation-based inference.  ( 2 min )
    Learning to Learn for Few-shot Continual Active Learning
    arXiv:2311.03732v2 Announce Type: replace Abstract: Continual learning strives to ensure stability in solving previously seen tasks while demonstrating plasticity in a novel domain. Recent advances in CL are mostly confined to a supervised learning setting, especially in NLP domain. In this work, we consider a few-shot continual active learning (CAL) setting where labeled data are inadequate, and unlabeled data are abundant but with a limited annotation budget. We propose a simple but efficient method, called Meta-Continual Active Learning. Specifically, we employ meta-learning and experience replay to address inter-task confusion and catastrophic forgetting. We further incorporate textual augmentations to ensure generalization. We conduct extensive experiments on benchmark text classification datasets to validate the effectiveness of the proposed method and analyze the effect of different active learning strategies in few-shot CAL setting. Our experimental results demonstrate that random sampling is the best default strategy for active learning and memory sample selection to solve few-shot CAL problems.  ( 2 min )
    Contrastive Difference Predictive Coding
    arXiv:2310.20141v2 Announce Type: replace Abstract: Predicting and reasoning about the future lie at the heart of many time-series questions. For example, goal-conditioned reinforcement learning can be viewed as learning representations to predict which states are likely to be visited in the future. While prior methods have used contrastive predictive coding to model time series data, learning representations that encode long-term dependencies usually requires large amounts of data. In this paper, we introduce a temporal difference version of contrastive predictive coding that stitches together pieces of different time series data to decrease the amount of data required to learn predictions of future events. We apply this representation learning method to derive an off-policy algorithm for goal-conditioned RL. Experiments demonstrate that, compared with prior RL methods, ours achieves $2 \times$ median improvement in success rates and can better cope with stochastic environments. In tabular settings, we show that our method is about $20 \times$ more sample efficient than the successor representation and $1500 \times$ more sample efficient than the standard (Monte Carlo) version of contrastive predictive coding.  ( 2 min )
    Beyond Accuracy: Evaluating Self-Consistency of Code Large Language Models with IdentityChain
    arXiv:2310.14053v3 Announce Type: replace Abstract: Code Large Language Models (Code LLMs) are being increasingly employed in real-life applications, so evaluating them is critical. While the conventional accuracy evaluates the performance of Code LLMs on a set of individual tasks, their self-consistency across different tasks is overlooked. Intuitively, a trustworthy model should be self-consistent when generating natural language specifications for its own code and generating code for its own specifications. Failure to preserve self-consistency reveals a lack of understanding of the shared semantics underlying natural language and programming language, and therefore undermines the trustworthiness of a model. In this paper, we first formally define the self-consistency of Code LLMs and then design a framework, IdentityChain, which effectively and efficiently evaluates the self-consistency and conventional accuracy of a model at the same time. We study eleven Code LLMs and show that they fail to preserve self-consistency, which is indeed a distinct aspect from conventional accuracy. Furthermore, we show that IdentityChain can be used as a model debugging tool to expose weaknesses of Code LLMs by demonstrating three major weaknesses that we identify in current models using IdentityChain. Our code is available at https://github.com/marcusm117/IdentityChain.  ( 3 min )
    Explainable Modeling for Wind Power Forecasting: A Glass-Box Approach with High Accuracy
    arXiv:2310.18629v2 Announce Type: replace Abstract: Machine learning models (e.g., neural networks) achieve high accuracy in wind power forecasting, but they are usually regarded as black boxes that lack interpretability. To address this issue, the paper proposes a glass-box approach that combines high accuracy with transparency for wind power forecasting. Specifically, the core is to sum up the feature effects by constructing shape functions, which effectively map the intricate non-linear relationships between wind power output and input features. Furthermore, the forecasting model is enriched by incorporating interaction terms that adeptly capture interdependencies and synergies among the input features. The additive nature of the proposed glass-box approach ensures its interpretability. Simulation results show that the proposed glass-box approach effectively interprets the results of wind power forecasting from both global and instance perspectives. Besides, it outperforms most benchmark models and exhibits comparable performance to the best-performing neural networks. This dual strength of transparency and high accuracy positions the proposed glass-box approach as a compelling choice for reliable wind power forecasting.  ( 2 min )
    Fantastic Gains and Where to Find Them: On the Existence and Prospect of General Knowledge Transfer between Any Pretrained Model
    arXiv:2310.17653v2 Announce Type: replace Abstract: Training deep networks requires various design decisions regarding for instance their architecture, data augmentation, or optimization. In this work, we find these training variations to result in networks learning unique feature sets from the data. Using public model libraries comprising thousands of models trained on canonical datasets like ImageNet, we observe that for arbitrary pairings of pretrained models, one model extracts significant data context unavailable in the other -- independent of overall performance. Given any arbitrary pairing of pretrained models and no external rankings (such as separate test sets, e.g. due to data privacy), we investigate if it is possible to transfer such "complementary" knowledge from one model to another without performance degradation -- a task made particularly difficult as additional knowledge can be contained in stronger, equiperformant or weaker models. Yet facilitating robust transfer in scenarios agnostic to pretrained model pairings would unlock auxiliary gains and knowledge fusion from any model repository without restrictions on model and problem specifics - including from weaker, lower-performance models. This work therefore provides an initial, in-depth exploration on the viability of such general-purpose knowledge transfer. Across large-scale experiments, we first reveal the shortcomings of standard knowledge distillation techniques, and then propose a much more general extension through data partitioning for successful transfer between nearly all pretrained models, which we show can also be done unsupervised. Finally, we assess both the scalability and impact of fundamental model properties on successful model-agnostic knowledge transfer.  ( 3 min )
    Generative Flow Networks as Entropy-Regularized RL
    arXiv:2310.12934v3 Announce Type: replace Abstract: The recently proposed generative flow networks (GFlowNets) are a method of training a policy to sample compositional discrete objects with probabilities proportional to a given reward via a sequence of actions. GFlowNets exploit the sequential nature of the problem, drawing parallels with reinforcement learning (RL). Our work extends the connection between RL and GFlowNets to a general case. We demonstrate how the task of learning a generative flow network can be efficiently redefined as an entropy-regularized RL problem with a specific reward and regularizer structure. Furthermore, we illustrate the practical efficiency of this reformulation by applying standard soft RL algorithms to GFlowNet training across several probabilistic modeling tasks. Contrary to previously reported results, we show that entropic RL approaches can be competitive against established GFlowNet training methods. This perspective opens a direct path for integrating RL principles into the realm of generative flow networks.  ( 2 min )
    Hierarchical Forecasting at Scale
    arXiv:2310.12809v2 Announce Type: replace Abstract: Existing hierarchical forecasting techniques scale poorly when the number of time series increases. We propose to learn a coherent forecast for millions of time series with a single bottom-level forecast model by using a sparse loss function that directly optimizes the hierarchical product and/or temporal structure. The benefit of our sparse hierarchical loss function is that it provides practitioners a method of producing bottom-level forecasts that are coherent to any chosen cross-sectional or temporal hierarchy. In addition, removing the need for a post-processing step as required in traditional hierarchical forecasting techniques reduces the computational cost of the prediction phase in the forecasting pipeline. On the public M5 dataset, our sparse hierarchical loss function performs up to 10% (RMSE) better compared to the baseline loss function. We implement our sparse hierarchical loss function within an existing forecasting model at bol, a large European e-commerce platform, resulting in an improved forecasting performance of 2% at the product level. Finally, we found an increase in forecasting performance of about 5-10% when evaluating the forecasting performance across the cross-sectional hierarchies that we defined. These results demonstrate the usefulness of our sparse hierarchical loss applied to a production forecasting system at a major e-commerce platform.  ( 2 min )
    A Quasi-Wasserstein Loss for Learning Graph Neural Networks
    arXiv:2310.11762v3 Announce Type: replace Abstract: When learning graph neural networks (GNNs) in node-level prediction tasks, most existing loss functions are applied for each node independently, even if node embeddings and their labels are non-i.i.d. because of their graph structures. To eliminate such inconsistency, in this study we propose a novel Quasi-Wasserstein (QW) loss with the help of the optimal transport defined on graphs, leading to new learning and prediction paradigms of GNNs. In particular, we design a ``Quasi-Wasserstein'' distance between the observed multi-dimensional node labels and their estimations, optimizing the label transport defined on graph edges. The estimations are parameterized by a GNN in which the optimal label transport may determine the graph edge weights optionally. By reformulating the strict constraint of the label transport to a Bregman divergence-based regularizer, we obtain the proposed Quasi-Wasserstein loss associated with two efficient solvers learning the GNN together with optimal label transport. When predicting node labels, our model combines the output of the GNN with the residual component provided by the optimal label transport, leading to a new transductive prediction paradigm. Experiments show that the proposed QW loss applies to various GNNs and helps to improve their performance in node-level classification and regression tasks. The code of this work can be found at \url{https://github.com/SDS-Lab/QW_Loss}.  ( 2 min )
    Federated Learning with Convex Global and Local Constraints
    arXiv:2310.10117v2 Announce Type: replace Abstract: In practice, many machine learning (ML) problems come with constraints, and their applied domains involve distributed sensitive data that cannot be shared with others, e.g., in healthcare. Collaborative learning in such practical scenarios entails federated learning (FL) for ML problems with constraints, or FL with constraints for short. Despite the extensive developments of FL techniques in recent years, these techniques only deal with unconstrained FL problems. To fill this gap, we take the first step toward building a general algorithmic framework for solving FL problems with constraints. In particular, we propose a new FL algorithm for constrained ML problems based on the proximal augmented Lagrangian (AL) method. Assuming convex objective and convex constraints plus other mild conditions, we establish the worst-case complexity of the proposed algorithm. Our numerical experiments show the effectiveness of our algorithm in performing Neyman-Pearson classification and fairness-aware learning with nonconvex constraints, in an FL setting.  ( 2 min )
    Neural Diffusion Models
    arXiv:2310.08337v2 Announce Type: replace Abstract: Diffusion models have shown remarkable performance on many generative tasks. Despite recent success, most diffusion models are restricted in that they only allow linear transformation of the data distribution. In contrast, broader family of transformations can potentially help train generative distributions more efficiently, simplifying the reverse process and closing the gap between the true negative log-likelihood and the variational approximation. In this paper, we present Neural Diffusion Models (NDMs), a generalization of conventional diffusion models that enables defining and learning time-dependent non-linear transformations of data. We show how to optimise NDMs using a variational bound in a simulation-free setting. Moreover, we derive a time-continuous formulation of NDMs, which allows fast and reliable inference using off-the-shelf numerical ODE and SDE solvers. Finally, we demonstrate the utility of NDMs with learnable transformations through experiments on standard image generation benchmarks, including CIFAR-10, downsampled versions of ImageNet and CelebA-HQ. NDMs outperform conventional diffusion models in terms of likelihood and produce high-quality samples.  ( 2 min )
    Suppressing Overestimation in Q-Learning through Adversarial Behaviors
    arXiv:2310.06286v2 Announce Type: replace Abstract: The goal of this paper is to propose a new Q-learning algorithm with a dummy adversarial player, which is called dummy adversarial Q-learning (DAQ), that can effectively regulate the overestimation bias in standard Q-learning. With the dummy player, the learning can be formulated as a two-player zero-sum game. The proposed DAQ unifies several Q-learning variations to control overestimation biases, such as maxmin Q-learning and minmax Q-learning (proposed in this paper) in a single framework. The proposed DAQ is a simple but effective way to suppress the overestimation bias thourgh dummy adversarial behaviors and can be easily applied to off-the-shelf reinforcement learning algorithms to improve the performances. A finite-time convergence of DAQ is analyzed from an integrated perspective by adapting an adversarial Q-learning. The performance of the suggested DAQ is empirically demonstrated under various benchmark environments.  ( 2 min )
    Label-free Node Classification on Graphs with Large Language Models (LLMS)
    arXiv:2310.04668v3 Announce Type: replace Abstract: In recent years, there have been remarkable advancements in node classification achieved by Graph Neural Networks (GNNs). However, they necessitate abundant high-quality labels to ensure promising performance. In contrast, Large Language Models (LLMs) exhibit impressive zero-shot proficiency on text-attributed graphs. Yet, they face challenges in efficiently processing structural data and suffer from high inference costs. In light of these observations, this work introduces a label-free node classification on graphs with LLMs pipeline, LLM-GNN. It amalgamates the strengths of both GNNs and LLMs while mitigating their limitations. Specifically, LLMs are leveraged to annotate a small portion of nodes and then GNNs are trained on LLMs' annotations to make predictions for the remaining large portion of nodes. The implementation of LLM-GNN faces a unique challenge: how can we actively select nodes for LLMs to annotate and consequently enhance the GNN training? How can we leverage LLMs to obtain annotations of high quality, representativeness, and diversity, thereby enhancing GNN performance with less cost? To tackle this challenge, we develop an annotation quality heuristic and leverage the confidence scores derived from LLMs to advanced node selection. Comprehensive experimental results validate the effectiveness of LLM-GNN. In particular, LLM-GNN can achieve an accuracy of 74.9% on a vast-scale dataset \products with a cost less than 1 dollar.  ( 3 min )
    Leveraging Self-Consistency for Data-Efficient Amortized Bayesian Inference
    arXiv:2310.04395v3 Announce Type: replace Abstract: We propose a method to improve the efficiency and accuracy of amortized Bayesian inference by leveraging universal symmetries in the joint probabilistic model of parameters and data. In a nutshell, we invert Bayes' theorem and estimate the marginal likelihood based on approximate representations of the joint model. Upon perfect approximation, the marginal likelihood is constant across all parameter values by definition. However, errors in approximate inference lead to undesirable variance in the marginal likelihood estimates across different parameter values. We penalize violations of this symmetry with a \textit{self-consistency loss} which significantly improves the quality of approximate inference in low data regimes and can be used to augment the training of popular neural density estimators. We apply our method to a number of synthetic problems and realistic scientific models, discovering notable advantages in the context of both neural posterior and likelihood approximation.  ( 2 min )
    Order-Preserving GFlowNets
    arXiv:2310.00386v2 Announce Type: replace Abstract: Generative Flow Networks (GFlowNets) have been introduced as a method to sample a diverse set of candidates with probabilities proportional to a given reward. However, GFlowNets can only be used with a predefined scalar reward, which can be either computationally expensive or not directly accessible, in the case of multi-objective optimization (MOO) tasks for example. Moreover, to prioritize identifying high-reward candidates, the conventional practice is to raise the reward to a higher exponent, the optimal choice of which may vary across different environments. To address these issues, we propose Order-Preserving GFlowNets (OP-GFNs), which sample with probabilities in proportion to a learned reward function that is consistent with a provided (partial) order on the candidates, thus eliminating the need for an explicit formulation of the reward function. We theoretically prove that the training process of OP-GFNs gradually sparsifies the learned reward landscape in single-objective maximization tasks. The sparsification concentrates on candidates of a higher hierarchy in the ordering, ensuring exploration at the beginning and exploitation towards the end of the training. We demonstrate OP-GFN's state-of-the-art performance in single-objective maximization (totally ordered) and multi-objective Pareto front approximation (partially ordered) tasks, including synthetic datasets, molecule generation, and neural architecture search.  ( 2 min )
    SADMoE: Exploiting Activation Sparsity with Dynamic-k Gating
    arXiv:2310.04361v2 Announce Type: replace Abstract: Transformer models, despite their impressive performance, often face practical limitations due to their high computational requirements. At the same time, such models exhibit significant activation sparsity, which can be leveraged to reduce the inference cost by transforming parts of the network into Mixture-of-Experts (MoE) layers. However, despite the crucial role of activation sparsity, its impact on this process remains unexplored. In this paper, we enhance the efficiency of MoE conversion through activation sparsity enforcement. Moreover, motivated by the high variance in the number of activated neurons, we propose a more effective dynamic-k expert selection rule that adjusts the number of executed experts on a per-token basis. Finally, we extend this approach to multi-head attention projections, which results in even further savings. The proposed method, Sparsified Activation Dynamic-k Mixture-of-Experts (SADMoE), outperforms existing approaches on common NLP and vision tasks, allowing us to save up to 60% of inference cost without significantly affecting model performance.  ( 2 min )
    Transformer-VQ: Linear-Time Transformers via Vector Quantization
    arXiv:2309.16354v2 Announce Type: replace Abstract: We introduce Transformer-VQ, a decoder-only transformer computing softmax-based dense self-attention in linear time. Transformer-VQ's efficient attention is enabled by vector-quantized keys and a novel caching mechanism. In our large-scale experiments, Transformer-VQ is shown highly competitive in quality, obtaining 0.99 bpb on Enwik8, 26.6 ppl on PG-19, and 3.16 bpb on ImageNet64. In addition, the optimized implementation of Transformer-VQ is over 3x faster than a comparable quadratic-time transformer at sequence length 8k, is over 12x faster at 32k, and can scale to 131k with similar throughput. Code available: \url{https://github.com/transformer-vq/transformer_vq}  ( 2 min )
    Contrastive Initial State Buffer for Reinforcement Learning
    arXiv:2309.09752v3 Announce Type: replace Abstract: In Reinforcement Learning, the trade-off between exploration and exploitation poses a complex challenge for achieving efficient learning from limited samples. While recent works have been effective in leveraging past experiences for policy updates, they often overlook the potential of reusing past experiences for data collection. Independent of the underlying RL algorithm, we introduce the concept of a Contrastive Initial State Buffer, which strategically selects states from past experiences and uses them to initialize the agent in the environment in order to guide it toward more informative states. We validate our approach on two complex robotic tasks without relying on any prior information about the environment: (i) locomotion of a quadruped robot traversing challenging terrains and (ii) a quadcopter drone racing through a track. The experimental results show that our initial state buffer achieves higher task performance than the nominal baseline while also speeding up training convergence.  ( 2 min )
    A Gentle Introduction to Gradient-Based Optimization and Variational Inequalities for Machine Learning
    arXiv:2309.04877v2 Announce Type: replace Abstract: The rapid progress in machine learning in recent years has been based on a highly productive connection to gradient-based optimization. Further progress hinges in part on a shift in focus from pattern recognition to decision-making and multi-agent problems. In these broader settings, new mathematical challenges emerge that involve equilibria and game theory instead of optima. Gradient-based methods remain essential -- given the high dimensionality and large scale of machine-learning problems -- but simple gradient descent is no longer the point of departure for algorithm design. We provide a gentle introduction to a broader framework for gradient-based algorithms in machine learning, beginning with saddle points and monotone games, and proceeding to general variational inequalities. While we provide convergence proofs for several of the algorithms that we present, our main focus is that of providing motivation and intuition.  ( 2 min )
    Graph Neural Networks Use Graphs When They Shouldn't
    arXiv:2309.04332v2 Announce Type: replace Abstract: Predictions over graphs play a crucial role in various domains, including social networks and medicine. Graph Neural Networks (GNNs) have emerged as the dominant approach for learning on graph data. Although a graph-structure is provided as input to the GNN, in some cases the best solution can be obtained by ignoring it. While GNNs have the ability to ignore the graph- structure in such cases, it is not clear that they will. In this work, we show that GNNs actually tend to overfit the given graph-structure. Namely, they use it even when a better solution can be obtained by ignoring it. We analyze the implicit bias of gradient-descent learning of GNNs and prove that when the ground truth function does not use the graphs, GNNs are not guaranteed to learn a solution that ignores the graph, even with infinite data. We examine this phenomenon with respect to different graph distributions and find that regular graphs are more robust to this over-fitting. We also prove that within the family of regular graphs, GNNs are guaranteed to extrapolate when learning with gradient descent. Finally, based on our empirical and theoretical findings, we demonstrate on real-data how regular graphs can be leveraged to reduce graph overfitting and enhance performance.  ( 2 min )
    A Benchmark Study on Calibration
    arXiv:2308.11838v4 Announce Type: replace Abstract: Deep neural networks are increasingly utilized in various machine learning tasks. However, as these models grow in complexity, they often face calibration issues, despite enhanced prediction accuracy. Many studies have endeavored to improve calibration performance through the use of specific loss functions, data preprocessing and training frameworks. Yet, investigations into calibration properties have been somewhat overlooked. Our study leverages the Neural Architecture Search (NAS) search space, offering an exhaustive model architecture space for thorough calibration properties exploration. We specifically create a model calibration dataset. This dataset evaluates 90 bin-based and 12 additional calibration measurements across 117,702 unique neural networks within the widely employed NATS-Bench search space. Our analysis aims to answer several longstanding questions in the field, using our proposed dataset: (i) Can model calibration be generalized across different datasets? (ii) Can robustness be used as a calibration measurement? (iii) How reliable are calibration metrics? (iv) Does a post-hoc calibration method affect all models uniformly? (v) How does calibration interact with accuracy? (vi) What is the impact of bin size on calibration measurement? (vii) Which architectural designs are beneficial for calibration? Additionally, our study bridges an existing gap by exploring calibration within NAS. By providing this dataset, we enable further research into NAS calibration. As far as we are aware, our research represents the first large-scale investigation into calibration properties and the premier study of calibration issues within NAS.  ( 3 min )
    When Are Two Lists Better than One?: Benefits and Harms in Joint Decision-making
    arXiv:2308.11721v3 Announce Type: replace Abstract: Historically, much of machine learning research has focused on the performance of the algorithm alone, but recently more attention has been focused on optimizing joint human-algorithm performance. Here, we analyze a specific type of human-algorithm collaboration where the algorithm has access to a set of $n$ items, and presents a subset of size $k$ to the human, who selects a final item from among those $k$. This scenario could model content recommendation, route planning, or any type of labeling task. Because both the human and algorithm have imperfect, noisy information about the true ordering of items, the key question is: which value of $k$ maximizes the probability that the best item will be ultimately selected? For $k=1$, performance is optimized by the algorithm acting alone, and for $k=n$ it is optimized by the human acting alone. Surprisingly, we show that for multiple of noise models, it is optimal to set $k \in [2, n-1]$ - that is, there are strict benefits to collaborating, even when the human and algorithm have equal accuracy separately. We demonstrate this theoretically for the Mallows model and experimentally for the Random Utilities models of noisy permutations. However, we show this pattern is reversed when the human is anchored on the algorithm's presented ordering - the joint system always has strictly worse performance. We extend these results to the case where the human and algorithm differ in their accuracy levels, showing that there always exist regimes where a more accurate agent would strictly benefit from collaborating with a less accurate one, but these regimes are asymmetric between the human and the algorithm's accuracy.  ( 3 min )
    Internal Cross-layer Gradients for Extending Homogeneity to Heterogeneity in Federated Learning
    arXiv:2308.11464v2 Announce Type: replace Abstract: Federated learning (FL) inevitably confronts the challenge of system heterogeneity in practical scenarios. To enhance the capabilities of most model-homogeneous FL methods in handling system heterogeneity, we propose a training scheme that can extend their capabilities to cope with this challenge. In this paper, we commence our study with a detailed exploration of homogeneous and heterogeneous FL settings and discover three key observations: (1) a positive correlation between client performance and layer similarities, (2) higher similarities in the shallow layers in contrast to the deep layers, and (3) the smoother gradients distributions indicate the higher layer similarities. Building upon these observations, we propose InCo Aggregation that leverages internal cross-layer gradients, a mixture of gradients from shallow and deep layers within a server model, to augment the similarity in the deep layers without requiring additional communication between clients. Furthermore, our methods can be tailored to accommodate model-homogeneous FL methods such as FedAvg, FedProx, FedNova, Scaffold, and MOON, to expand their capabilities to handle the system heterogeneity. Copious experimental results validate the effectiveness of InCo Aggregation, spotlighting internal cross-layer gradients as a promising avenue to enhance the performance in heterogeneous FL.  ( 3 min )
    ALI-DPFL: Differentially Private Federated Learning with Adaptive Local Iterations
    arXiv:2308.10457v4 Announce Type: replace Abstract: Federated Learning (FL) is a distributed machine learning technique that allows model training among multiple devices or organizations by sharing training parameters instead of raw data. However, adversaries can still infer individual information through inference attacks (e.g. differential attacks) on these training parameters. As a result, Differential Privacy (DP) has been widely used in FL to prevent such attacks. We consider differentially private federated learning in a resource-constrained scenario, where both privacy budget and communication rounds are constrained. By theoretically analyzing the convergence, we can find the optimal number of local DPSGD iterations for clients between any two sequential global updates. Based on this, we design an algorithm of Differentially Private Federated Learning with Adaptive Local Iterations (ALI-DPFL). We experiment our algorithm on the MNIST, FashionMNIST and Cifar10 datasets, and demonstrate significantly better performances than previous work in the resource-constraint scenario. Code is available at https://anonymous.4open.science/r/ALI-DPFL-5347.  ( 2 min )
    Domain-Aware Fine-Tuning: Enhancing Neural Network Adaptability
    arXiv:2308.07728v4 Announce Type: replace Abstract: Fine-tuning pre-trained neural network models has become a widely adopted approach across various domains. However, it can lead to the distortion of pre-trained feature extractors that already possess strong generalization capabilities. Mitigating feature distortion during adaptation to new target domains is crucial. Recent studies have shown promising results in handling feature distortion by aligning the head layer on in-distribution datasets before performing fine-tuning. Nonetheless, a significant limitation arises from the treatment of batch normalization layers during fine-tuning, leading to suboptimal performance. In this paper, we propose Domain-Aware Fine-Tuning (DAFT), a novel approach that incorporates batch normalization conversion and the integration of linear probing and fine-tuning. Our batch normalization conversion method effectively mitigates feature distortion by reducing modifications to the neural network during fine-tuning. Additionally, we introduce the integration of linear probing and fine-tuning to optimize the head layer with gradual adaptation of the feature extractor. By leveraging batch normalization layers and integrating linear probing and fine-tuning, our DAFT significantly mitigates feature distortion and achieves improved model performance on both in-distribution and out-of-distribution datasets. Extensive experiments demonstrate that our method outperforms other baseline methods, demonstrating its effectiveness in not only improving performance but also mitigating feature distortion.  ( 2 min )
    A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis
    arXiv:2307.12856v4 Announce Type: replace Abstract: Pre-trained large language models (LLMs) have recently achieved better generalization and sample efficiency in autonomous web automation. However, the performance on real-world websites has still suffered from (1) open domainness, (2) limited context length, and (3) lack of inductive bias on HTML. We introduce WebAgent, an LLM-driven agent that learns from self-experience to complete tasks on real websites following natural language instructions. WebAgent plans ahead by decomposing instructions into canonical sub-instructions, summarizes long HTML documents into task-relevant snippets, and acts on websites via Python programs generated from those. We design WebAgent with Flan-U-PaLM, for grounded code generation, and HTML-T5, new pre-trained LLMs for long HTML documents using local and global attention mechanisms and a mixture of long-span denoising objectives, for planning and summarization. We empirically demonstrate that our modular recipe improves the success on real websites by over 50%, and that HTML-T5 is the best model to solve various HTML understanding tasks; achieving 18.7% higher success rate than the prior method on MiniWoB web automation benchmark, and SoTA performance on Mind2Web, an offline task planning evaluation.  ( 3 min )
    Language models as master equation solvers
    arXiv:2308.02514v2 Announce Type: replace Abstract: Master equations are of fundamental importance in modeling stochastic dynamical systems.However, solving master equations is challenging due to the exponential increase in the number of possible states or trajectories with the dimension of the state space. In this study, we propose repurposing language models as a machine learning approach to solve master equations. We design a prompt-based neural network to map rate parameters, initial conditions, and time values directly to the state joint probability distribution that exactly matches the input contexts. In this way, we approximate the solution of the master equation in its most general form. We train the network using the policy gradient algorithm within the reinforcement learning framework, with feedback rewards provided by a set of variational autoregressive models. By applying this approach to representative examples, we observe high accuracy for both multi-module and high-dimensional systems. The trained network also exhibits extrapolating ability, extending its predictability to unseen data. Our findings establish the connection between language models and master equations, highlighting the possibility of using a single pretrained large model to solve any master equation.  ( 2 min )
    On the power of graph neural networks and the role of the activation function
    arXiv:2307.04661v3 Announce Type: replace Abstract: In this article we present new results about the expressivity of Graph Neural Networks (GNNs). We prove that for any GNN with piecewise polynomial activations, whose architecture size does not grow with the graph input sizes, there exists a pair of non-isomorphic rooted trees of depth two such that the GNN cannot distinguish their root vertex up to an arbitrary number of iterations. The proof relies on tools from the algebra of symmetric polynomials. In contrast, it was already known that unbounded GNNs (those whose size is allowed to change with the graph sizes) with piecewise polynomial activations can distinguish these vertices in only two iterations. Our results imply a strict separation between bounded and unbounded size GNNs, answering an open question formulated by [Grohe, 2021]. We next prove that if one allows activations that are not piecewise polynomial, then in two iterations a single neuron perceptron can distinguish the root vertices of any pair of nonisomorphic trees of depth two (our results hold for activations like the sigmoid, hyperbolic tan and others). This shows how the power of graph neural networks can change drastically if one changes the activation function of the neural networks. The proof of this result utilizes the Lindemann-Weierstrauss theorem from transcendental number theory.  ( 3 min )
    Graph Neural Networks as an Enabler of Terahertz-based Flow-guided Nanoscale Localization over Highly Erroneous Raw Data
    arXiv:2307.05551v3 Announce Type: replace Abstract: Contemporary research advances in nanotechnology and material science are rooted in the emergence of nanodevices as a versatile tool that harmonizes sensing, computing, wireless communication, data storage, and energy harvesting. These devices offer novel pathways for disease diagnostics, treatment, and monitoring within the bloodstreams. Ensuring precise localization of events of diagnostic interest, which underpins the concept of flow-guided in-body nanoscale localization, would provide an added diagnostic value to the detected events. Raw data generated by the nanodevices is pivotal for this localization and consist of an event detection indicator and the time elapsed since the last passage of a nanodevice through the heart. The energy constraints of the nanodevices lead to intermittent operation and unreliable communication, intrinsically affecting this data. This posits a need for comprehensively modelling the features of this data. These imperfections also have profound implications for the viability of existing flow-guided localization approaches, which are ill-prepared to address the intricacies of the environment. Our first contribution lies in an analytical model of raw data for flow-guided localization, dissecting how communication and energy capabilities influence the nanodevices' data output. This model acts as a vital bridge, reconciling idealized assumptions with practical challenges of flow-guided localization. Toward addressing these practical challenges, we also present an integration of Graph Neural Networks (GNNs) into the flow-guided localization paradigm. GNNs excel in capturing complex dynamic interactions inherent to the localization of events sensed by the nanodevices. Our results highlight the potential of GNNs not only to enhance localization accuracy but also extend coverage to encompass the entire bloodstream.  ( 3 min )
    Momentum Benefits Non-IID Federated Learning Simply and Provably
    arXiv:2306.16504v2 Announce Type: replace Abstract: Federated learning is a powerful paradigm for large-scale machine learning, but it faces significant challenges due to unreliable network connections, slow communication, and substantial data heterogeneity across clients. FedAvg and SCAFFOLD are two prominent algorithms to address these challenges. In particular, FedAvg employs multiple local updates before communicating with a central server, while SCAFFOLD maintains a control variable on each client to compensate for ``client drift'' in its local updates. Various methods have been proposed to enhance the convergence of these two algorithms, but they either make impractical adjustments to the algorithmic structure or rely on the assumption of bounded data heterogeneity. This paper explores the utilization of momentum to enhance the performance of FedAvg and SCAFFOLD. When all clients participate in the training process, we demonstrate that incorporating momentum allows FedAvg to converge without relying on the assumption of bounded data heterogeneity even using a constant local learning rate. This is novel and fairly surprising as existing analyses for FedAvg require bounded data heterogeneity even with diminishing local learning rates. In partial client participation, we show that momentum enables SCAFFOLD to converge provably faster without imposing any additional assumptions. Furthermore, we use momentum to develop new variance-reduced extensions of FedAvg and SCAFFOLD, which exhibit state-of-the-art convergence rates. Our experimental results support all theoretical findings.  ( 3 min )
    Improving Automatic Parallel Training via Balanced Memory Workload Optimization
    arXiv:2307.02031v2 Announce Type: replace Abstract: Transformer models have emerged as the leading approach for achieving state-of-the-art performance across various application domains, serving as the foundation for advanced large-scale deep learning (DL) models. However, efficiently training these models across multiple GPUs remains a complex challenge due to the abundance of parallelism options. Existing DL systems either require manual efforts to design distributed training plans or limit parallelism combinations to a constrained search space. In this paper, we present Galvatron-BMW, a novel system framework that integrates multiple prevalent parallelism dimensions and automatically identifies the most efficient hybrid parallelism strategy. To effectively navigate this vast search space, we employ a decision tree approach for decomposition and pruning based on intuitive insights. We further utilize a dynamic programming search algorithm to derive the optimal plan. Moreover, to improve resource utilization and enhance system efficiency, we propose a bi-objective optimization workflow that focuses on workload balance. Our evaluations on different Transformer models demonstrate the capabilities of Galvatron-BMW in automating distributed training under varying GPU memory constraints. Across all tested scenarios, Galvatron-BMW consistently achieves superior system throughput, surpassing previous approaches that rely on limited parallelism strategies.  ( 3 min )
    Deep Reinforcement Learning with Task-Adaptive Retrieval via Hypernetwork
    arXiv:2306.10698v5 Announce Type: replace Abstract: Deep reinforcement learning algorithms are usually impeded by sampling inefficiency, heavily depending on multiple interactions with the environment to acquire accurate decision-making capabilities. In contrast, humans rely on their hippocampus to retrieve relevant information from past experiences of relevant tasks, which guides their decision-making when learning a new task, rather than exclusively depending on environmental interactions. Nevertheless, designing a hippocampus-like module for an agent to incorporate past experiences into established reinforcement learning algorithms presents two challenges. The first challenge involves selecting the most relevant past experiences for the current task, and the second challenge is integrating such experiences into the decision network. To address these challenges, we propose a novel method that utilizes a retrieval network based on task-conditioned hypernetwork, which adapts the retrieval network's parameters depending on the task. At the same time, a dynamic modification mechanism enhances the collaborative efforts between the retrieval and decision networks. We evaluate the proposed method across various tasks within a multitask scenario in the Minigrid environment. The experimental results demonstrate that our proposed method significantly outperforms strong baselines.  ( 3 min )
    RLtools: A Fast, Portable Deep Reinforcement Learning Library for Continuous Control
    arXiv:2306.03530v3 Announce Type: replace Abstract: Deep Reinforcement Learning (RL) can yield capable agents and control policies in several domains but is commonly plagued by prohibitively long training times. Additionally, in the case of continuous control problems, the applicability of learned policies on real-world embedded devices is limited due to the lack of real-time guarantees and portability of existing libraries. To address these challenges, we present RLtools, a dependency-free, header-only, pure C++ library for deep supervised and reinforcement learning. Its novel architecture allows RLtools to be used on a wide variety of platforms, from HPC clusters over workstations and laptops to smartphones, smartwatches, and microcontrollers. Specifically, due to the tight integration of the RL algorithms with simulation environments, RLtools can solve popular RL problems up to 76 times faster than other popular RL frameworks. We also benchmark the inference on a diverse set of microcontrollers and show that in most cases our optimized implementation is by far the fastest. Finally, RLtools enables the first-ever demonstration of training a deep RL algorithm directly on a microcontroller, giving rise to the field of Tiny Reinforcement Learning (TinyRL). The source code as well as documentation and live demos are available through our project page at https://rl.tools.  ( 3 min )
    Stabilizing Contrastive RL: Techniques for Robotic Goal Reaching from Offline Data
    arXiv:2306.03346v2 Announce Type: replace Abstract: Robotic systems that rely primarily on self-supervised learning have the potential to decrease the amount of human annotation and engineering effort required to learn control strategies. In the same way that prior robotic systems have leveraged self-supervised techniques from computer vision (CV) and natural language processing (NLP), our work builds on prior work showing that the reinforcement learning (RL) itself can be cast as a self-supervised problem: learning to reach any goal without human-specified rewards or labels. Despite the seeming appeal, little (if any) prior work has demonstrated how self-supervised RL methods can be practically deployed on robotic systems. By first studying a challenging simulated version of this task, we discover design decisions about architectures and hyperparameters that increase the success rate by $2 \times$. These findings lay the groundwork for our main result: we demonstrate that a self-supervised RL algorithm based on contrastive learning can solve real-world, image-based robotic manipulation tasks, with tasks being specified by a single goal image provided after training.  ( 2 min )
    The Galerkin method beats Graph-Based Approaches for Spectral Algorithms
    arXiv:2306.00742v3 Announce Type: replace Abstract: Historically, the machine learning community has derived spectral decompositions from graph-based approaches. We break with this approach and prove the statistical and computational superiority of the Galerkin method, which consists in restricting the study to a small set of test functions. In particular, we introduce implementation tricks to deal with differential operators in large dimensions with structured kernels. Finally, we extend on the core principles beyond our approach to apply them to non-linear spaces of functions, such as the ones parameterized by deep neural networks, through loss-based optimization procedures.  ( 2 min )
    Feature-aligned N-BEATS with Sinkhorn divergence
    arXiv:2305.15196v3 Announce Type: replace Abstract: We propose Feature-aligned N-BEATS as a domain-generalized time series forecasting model. It is a nontrivial extension of N-BEATS with doubly residual stacking principle (Oreshkin et al. [45]) into a representation learning framework. In particular, it revolves around marginal feature probability measures induced by the intricate composition of residual and feature extracting operators of N-BEATS in each stack and aligns them stack-wise via an approximate of an optimal transport distance referred to as the Sinkhorn divergence. The training loss consists of an empirical risk minimization from multiple source domains, i.e., forecasting loss, and an alignment loss calculated with the Sinkhorn divergence, which allows the model to learn invariant features stack-wise across multiple source data sequences while retaining N-BEATS's interpretable design and forecasting power. Comprehensive experimental evaluations with ablation studies are provided and the corresponding results demonstrate the proposed model's forecasting and generalization capabilities.  ( 2 min )
    Homogenizing Non-IID datasets via In-Distribution Knowledge Distillation for Decentralized Learning
    arXiv:2304.04326v2 Announce Type: replace Abstract: Decentralized learning enables serverless training of deep neural networks (DNNs) in a distributed manner on multiple nodes. This allows for the use of large datasets, as well as the ability to train with a wide variety of data sources. However, one of the key challenges with decentralized learning is heterogeneity in the data distribution across the nodes. In this paper, we propose In-Distribution Knowledge Distillation (IDKD) to address the challenge of heterogeneous data distribution. The goal of IDKD is to homogenize the data distribution across the nodes. While such data homogenization can be achieved by exchanging data among the nodes sacrificing privacy, IDKD achieves the same objective using a common public dataset across nodes without breaking the privacy constraint. This public dataset is different from the training dataset and is used to distill the knowledge from each node and communicate it to its neighbors through the generated labels. With traditional knowledge distillation, the generalization of the distilled model is reduced because all the public dataset samples are used irrespective of their similarity to the local dataset. Thus, we introduce an Out-of-Distribution (OoD) detector at each node to label a subset of the public dataset that maps close to the local training data distribution. Finally, only labels corresponding to these subsets are exchanged among the nodes and with appropriate label averaging each node is finetuned on these data subsets along with its local data. Our experiments on multiple image classification datasets and graph topologies show that the proposed IDKD scheme is more effective than traditional knowledge distillation and achieves state-of-the-art generalization performance on heterogeneously distributed data with minimal communication overhead.  ( 3 min )
    Domain Generalization In Robust Invariant Representation
    arXiv:2304.03431v2 Announce Type: replace Abstract: Unsupervised approaches for learning representations invariant to common transformations are used quite often for object recognition. Learning invariances makes models more robust and practical to use in real-world scenarios. Since data transformations that do not change the intrinsic properties of the object cause the majority of the complexity in recognition tasks, models that are invariant to these transformations help reduce the amount of training data required. This further increases the model's efficiency and simplifies training. In this paper, we investigate the generalization of invariant representations on out-of-distribution data and try to answer the question: Do model representations invariant to some transformations in a particular seen domain also remain invariant in previously unseen domains? Through extensive experiments, we demonstrate that the invariant model learns unstructured latent representations that are robust to distribution shifts, thus making invariance a desirable property for training in resource-constrained settings.  ( 2 min )
    Bridging the Gap between Chemical Reaction Pretraining and Conditional Molecule Generation with a Unified Model
    arXiv:2303.06965v4 Announce Type: replace Abstract: Chemical reactions are the fundamental building blocks of drug design and organic chemistry research. In recent years, there has been a growing need for a large-scale deep-learning framework that can efficiently capture the basic rules of chemical reactions. In this paper, we have proposed a unified framework that addresses both the reaction representation learning and molecule generation tasks, which allows for a more holistic approach. Inspired by the organic chemistry mechanism, we develop a novel pretraining framework that enables us to incorporate inductive biases into the model. Our framework achieves state-of-the-art results on challenging downstream tasks. By possessing chemical knowledge, our generative framework overcome the limitations of current molecule generation models that rely on a small number of reaction templates. In the extensive experiments, our model generates synthesizable drug-like structures of high quality. Overall, our work presents a significant step toward a large-scale deep-learning framework for a variety of reaction-based applications.  ( 3 min )
    Flooding with Absorption: An Efficient Protocol for Heterogeneous Bandits over Complex Networks
    arXiv:2303.05445v4 Announce Type: replace Abstract: Multi-armed bandits are extensively used to model sequential decision-making, making them ubiquitous in many real-life applications such as online recommender systems and wireless networking. We consider a multi-agent setting where each agent solves their own bandit instance endowed with a different set of arms. Their goal is to minimize their group regret while collaborating via some communication protocol over a given network. Previous literature on this problem only considered arm heterogeneity and networked agents separately. In this work, we introduce a setting that encompasses both features. For this novel setting, we first provide a rigorous regret analysis for a standard flooding protocol combined with the classic UCB policy. Then, to mitigate the issue of high communication costs incurred by flooding in complex networks, we propose a new protocol called Flooding with Absorption (FwA). We provide a theoretical analysis of the resulting regret bound and discuss the advantages of using FwA over flooding. Lastly, we experimentally verify on various scenarios, including dynamic networks, that FwA leads to significantly lower communication costs despite minimal regret performance loss compared to other network protocols.  ( 3 min )
    Learning Hidden Markov Models Using Conditional Samples
    arXiv:2302.14753v2 Announce Type: replace Abstract: This paper is concerned with the computational complexity of learning the Hidden Markov Model (HMM). Although HMMs are some of the most widely used tools in sequential and time series modeling, they are cryptographically hard to learn in the standard setting where one has access to i.i.d. samples of observation sequences. In this paper, we depart from this setup and consider an interactive access model, in which the algorithm can query for samples from the conditional distributions of the HMMs. We show that interactive access to the HMM enables computationally efficient learning algorithms, thereby bypassing cryptographic hardness. Specifically, we obtain efficient algorithms for learning HMMs in two settings: (a) An easier setting where we have query access to the exact conditional probabilities. Here our algorithm runs in polynomial time and makes polynomially many queries to approximate any HMM in total variation distance. (b) A harder setting where we can only obtain samples from the conditional distributions. Here the performance of the algorithm depends on a new parameter, called the fidelity of the HMM. We show that this captures cryptographically hard instances and previously known positive results. We also show that these results extend to a broader class of distributions with latent low rank structure. Our algorithms can be viewed as generalizations and robustifications of Angluin's $L^*$ algorithm for learning deterministic finite automata from membership queries.  ( 3 min )
    Stochastic Online Instrumental Variable Regression: Regrets for Endogeneity and Bandit Feedback
    arXiv:2302.09357v3 Announce Type: replace Abstract: Endogeneity, i.e. the dependence of noise and covariates, is a common phenomenon in real data due to omitted variables, strategic behaviours, measurement errors etc. In contrast, the existing analyses of stochastic online linear regression with unbounded noise and linear bandits depend heavily on exogeneity, i.e. the independence of noise and covariates. Motivated by this gap, we study the over- and just-identified Instrumental Variable (IV) regression, specifically Two-Stage Least Squares, for stochastic online learning, and propose to use an online variant of Two-Stage Least Squares, namely O2SLS. We show that O2SLS achieves $\mathcal O(d_{x}d_{z}\log^2 T)$ identification and $\widetilde{\mathcal O}(\gamma \sqrt{d_{z} T})$ oracle regret after $T$ interactions, where $d_{x}$ and $d_{z}$ are the dimensions of covariates and IVs, and $\gamma$ is the bias due to endogeneity. For $\gamma=0$, i.e. under exogeneity, O2SLS exhibits $\mathcal O(d_{x}^2 \log^2 T)$ oracle regret, which is of the same order as that of the stochastic online ridge. Then, we leverage O2SLS as an oracle to design OFUL-IV, a stochastic linear bandit algorithm to tackle endogeneity. OFUL-IV yields $\widetilde{\mathcal O}(\sqrt{d_{x}d_{z}T})$ regret that matches the regret lower bound under exogeneity. For different datasets with endogeneity, we experimentally show efficiencies of O2SLS and OFUL-IV.  ( 3 min )
    Et Tu Certifications: Robustness Certificates Yield Better Adversarial Examples
    arXiv:2302.04379v3 Announce Type: replace Abstract: In guaranteeing the absence of adversarial examples in an instance's neighbourhood, certification mechanisms play an important role in demonstrating neural net robustness. In this paper, we ask if these certifications can compromise the very models they help to protect? Our new \emph{Certification Aware Attack} exploits certifications to produce computationally efficient norm-minimising adversarial examples $74 \%$ more often than comparable attacks, while reducing the median perturbation norm by more than $10\%$. While these attacks can be used to assess the tightness of certification bounds, they also highlight an apparent paradox -- that certifications can reduce security.  ( 2 min )
    Label-efficient Time Series Representation Learning: A Review
    arXiv:2302.06433v3 Announce Type: replace Abstract: The scarcity of labeled data is one of the main challenges of applying deep learning models on time series data in the real world. Therefore, several approaches, e.g., transfer learning, self-supervised learning, and semi-supervised learning, have been recently developed to promote the learning capability of deep learning models from the limited time series labels. In this survey, for the first time, we provide a novel taxonomy to categorize existing approaches that address the scarcity of labeled data problem in time series data based on their dependency on external data sources. Moreover, we present a review of the recent advances in each approach and conclude the limitations of the current works and provide future directions that could yield better progress in the field.  ( 2 min )
    Flow Matching on General Geometries
    arXiv:2302.03660v3 Announce Type: replace Abstract: We propose Riemannian Flow Matching (RFM), a simple yet powerful framework for training continuous normalizing flows on manifolds. Existing methods for generative modeling on manifolds either require expensive simulation, are inherently unable to scale to high dimensions, or use approximations for limiting quantities that result in biased training objectives. Riemannian Flow Matching bypasses these limitations and offers several advantages over previous approaches: it is simulation-free on simple geometries, does not require divergence computation, and computes its target vector field in closed-form. The key ingredient behind RFM is the construction of a relatively simple premetric for defining target vector fields, which encompasses the existing Euclidean case. To extend to general geometries, we rely on the use of spectral decompositions to efficiently compute premetrics on the fly. Our method achieves state-of-the-art performance on many real-world non-Euclidean datasets, and we demonstrate tractable training on general geometries, including triangular meshes with highly non-trivial curvature and boundaries.  ( 2 min )
    Don't Play Favorites: Minority Guidance for Diffusion Models
    arXiv:2301.12334v2 Announce Type: replace Abstract: We explore the problem of generating minority samples using diffusion models. The minority samples are instances that lie on low-density regions of a data manifold. Generating a sufficient number of such minority instances is important, since they often contain some unique attributes of the data. However, the conventional generation process of the diffusion models mostly yields majority samples (that lie on high-density regions of the manifold) due to their high likelihoods, making themselves ineffective and time-consuming for the minority generating task. In this work, we present a novel framework that can make the generation process of the diffusion models focus on the minority samples. We first highlight that Tweedie's denoising formula yields favorable results for majority samples. The observation motivates us to introduce a metric that describes the uniqueness of a given sample. To address the inherent preference of the diffusion models w.r.t. the majority samples, we further develop minority guidance, a sampling technique that can guide the generation process toward regions with desired likelihood levels. Experiments on benchmark real datasets demonstrate that our minority guidance can greatly improve the capability of generating high-quality minority samples over existing generative samplers. We showcase that the performance benefit of our framework persists even in demanding real-world scenarios such as medical imaging, further underscoring the practical significance of our work. Code is available at https://github.com/soobin-um/minority-guidance.  ( 3 min )
    Quality at the Tail of Machine Learning Inference
    arXiv:2212.13925v3 Announce Type: replace Abstract: Machine learning inference should be subject to stringent inference time constraints while ensuring high inference quality, especially in safety-critical (e.g., autonomous driving) and mission-critical (e.g., emotion recognition) contexts. Neglecting either aspect can lead to severe consequences, such as loss of life and property damage. Many studies lack a comprehensive consideration of these metrics, leading to incomplete or misleading evaluations. The study unveils a counterintuitive revelation: deep learning inference quality exhibits fluctuations due to inference time. To depict this phenomenon, the authors coin a new term, "tail quality," providing a more comprehensive evaluation, and overcoming conventional metric limitations. Moreover, the research proposes an initial evaluation framework to analyze factors affecting quality fluctuations, facilitating the prediction of the potential distribution of inference quality. The effectiveness of the evaluation framework is validated through experiments conducted on deep learning models for three different tasks across four systems.  ( 2 min )
    Localized Randomized Smoothing for Collective Robustness Certification
    arXiv:2210.16140v3 Announce Type: replace Abstract: Models for image segmentation, node classification and many other tasks map a single input to multiple labels. By perturbing this single shared input (e.g. the image) an adversary can manipulate several predictions (e.g. misclassify several pixels). Collective robustness certification is the task of provably bounding the number of robust predictions under this threat model. The only dedicated method that goes beyond certifying each output independently is limited to strictly local models, where each prediction is associated with a small receptive field. We propose a more general collective robustness certificate for all types of models. We further show that this approach is beneficial for the larger class of softly local models, where each output is dependent on the entire input but assigns different levels of importance to different input regions (e.g. based on their proximity in the image). The certificate is based on our novel localized randomized smoothing approach, where the random perturbation strength for different input regions is proportional to their importance for the outputs. Localized smoothing Pareto-dominates existing certificates on both image segmentation and node classification tasks, simultaneously offering higher accuracy and stronger certificates.  ( 2 min )
    TREE-G: Decision Trees Contesting Graph Neural Networks
    arXiv:2207.02760v5 Announce Type: replace Abstract: When dealing with tabular data, models based on decision trees are a popular choice due to their high accuracy on these data types, their ease of application, and explainability properties. However, when it comes to graph-structured data, it is not clear how to apply them effectively, in a way that incorporates the topological information with the tabular data available on the vertices of the graph. To address this challenge, we introduce TREE-G. TREE-G modifies standard decision trees, by introducing a novel split function that is specialized for graph data. Not only does this split function incorporate the node features and the topological information, but it also uses a novel pointer mechanism that allows split nodes to use information computed in previous splits. Therefore, the split function adapts to the predictive task and the graph at hand. We analyze the theoretical properties of TREE-G and demonstrate its benefits empirically on multiple graph and vertex prediction benchmarks. In these experiments, TREE-G consistently outperforms other tree-based models and often outperforms other graph-learning algorithms such as Graph Neural Networks (GNNs) and Graph Kernels, sometimes by large margins. Moreover, TREE-Gs models and their predictions can be explained and visualized  ( 3 min )
    DynaConF: Dynamic Forecasting of Non-Stationary Time Series
    arXiv:2209.08411v3 Announce Type: replace Abstract: Deep learning has shown impressive results in a variety of time series forecasting tasks, where modeling the conditional distribution of the future given the past is the essence. However, when this conditional distribution is non-stationary, it poses challenges for these models to learn consistently and to predict accurately. In this work, we propose a new method to model non-stationary conditional distributions over time by clearly decoupling stationary conditional distribution modeling from non-stationary dynamics modeling. Our method is based on a Bayesian dynamic model that can adapt to conditional distribution changes and a deep conditional distribution model that handles multivariate time series using a factorized output space. Our experimental results on synthetic and real-world datasets show that our model can adapt to non-stationary time series better than state-of-the-art deep learning solutions.  ( 2 min )
    On the Identifiability of Nonlinear ICA: Sparsity and Beyond
    arXiv:2206.07751v5 Announce Type: replace Abstract: Nonlinear independent component analysis (ICA) aims to recover the underlying independent latent sources from their observable nonlinear mixtures. How to make the nonlinear ICA model identifiable up to certain trivial indeterminacies is a long-standing problem in unsupervised learning. Recent breakthroughs reformulate the standard independence assumption of sources as conditional independence given some auxiliary variables (e.g., class labels and/or domain/time indexes) as weak supervision or inductive bias. However, nonlinear ICA with unconditional priors cannot benefit from such developments. We explore an alternative path and consider only assumptions on the mixing process, such as Structural Sparsity. We show that under specific instantiations of such constraints, the independent latent sources can be identified from their nonlinear mixtures up to a permutation and a component-wise transformation, thus achieving nontrivial identifiability of nonlinear ICA without auxiliary variables. We provide estimation methods and validate the theoretical results experimentally. The results on image data suggest that our conditions may hold in a number of practical data generating processes.  ( 2 min )
    Model-Based Reinforcement Learning for Offline Zero-Sum Markov Games
    arXiv:2206.04044v2 Announce Type: replace Abstract: This paper makes progress towards learning Nash equilibria in two-player zero-sum Markov games from offline data. Specifically, consider a $\gamma$-discounted infinite-horizon Markov game with $S$ states, where the max-player has $A$ actions and the min-player has $B$ actions. We propose a pessimistic model-based algorithm with Bernstein-style lower confidence bounds -- called VI-LCB-Game -- that provably finds an $\varepsilon$-approximate Nash equilibrium with a sample complexity no larger than $\frac{C_{\mathsf{clipped}}^{\star}S(A+B)}{(1-\gamma)^{3}\varepsilon^{2}}$ (up to some log factor). Here, $C_{\mathsf{clipped}}^{\star}$ is some unilateral clipped concentrability coefficient that reflects the coverage and distribution shift of the available data (vis-\`a-vis the target data), and the target accuracy $\varepsilon$ can be any value within $\big(0,\frac{1}{1-\gamma}\big]$. Our sample complexity bound strengthens prior art by a factor of $\min\{A,B\}$, achieving minimax optimality for the entire $\varepsilon$-range. An appealing feature of our result lies in algorithmic simplicity, which reveals the unnecessity of variance reduction and sample splitting in achieving sample optimality.  ( 2 min )
    Learning Dynamic Mechanisms in Unknown Environments: A Reinforcement Learning Approach
    arXiv:2202.12797v2 Announce Type: replace Abstract: Dynamic mechanism design studies how mechanism designers should allocate resources among agents in a time-varying environment. We consider the problem where the agents interact with the mechanism designer according to an unknown Markov Decision Process (MDP), where agent rewards and the mechanism designer's state evolve according to an episodic MDP with unknown reward functions and transition kernels. We focus on the online setting with linear function approximation and propose novel learning algorithms to recover the dynamic Vickrey-Clarke-Grove (VCG) mechanism over multiple rounds of interaction. A key contribution of our approach is incorporating reward-free online Reinforcement Learning (RL) to aid exploration over a rich policy space to estimate prices in the dynamic VCG mechanism. We show that the regret of our proposed method is upper bounded by $\tilde{\mathcal{O}}(T^{2/3})$ and further devise a lower bound to show that our algorithm is efficient, incurring the same $\tilde{\mathcal{O}}(T^{2 / 3})$ regret as the lower bound, where $T$ is the total number of rounds. Our work establishes the regret guarantee for online RL in solving dynamic mechanism design problems without prior knowledge of the underlying model.  ( 3 min )
    Optimization with Access to Auxiliary Information
    arXiv:2206.00395v4 Announce Type: replace Abstract: We investigate the fundamental optimization question of minimizing a target function $f$, whose gradients are expensive to compute or have limited availability, given access to some auxiliary side function $h$ whose gradients are cheap or more available. This formulation captures many settings of practical relevance, such as i) re-using batches in SGD, ii) transfer learning, iii) federated learning, iv) training with compressed models/dropout, Et cetera. We propose two generic new algorithms that apply in all these settings; we also prove that we can benefit from this framework under the Hessian similarity assumption between the target and side information. A benefit is obtained when this similarity measure is small; we also show a potential benefit from stochasticity when the auxiliary noise is correlated with that of the target function.  ( 2 min )
    The Klarna Product Page Dataset: Web Element Nomination with Graph Neural Networks and Large Language Models
    arXiv:2111.02168v4 Announce Type: replace Abstract: Web automation holds the potential to revolutionize how users interact with the digital world, offering unparalleled assistance and simplifying tasks via sophisticated computational methods. Central to this evolution is the web element nomination task, which entails identifying unique elements on webpages. Unfortunately, the development of algorithmic designs for web automation is hampered by the scarcity of comprehensive and realistic datasets that reflect the complexity faced by real-world applications on the Web. To address this, we introduce the Klarna Product Page Dataset, a comprehensive and diverse collection of webpages that surpasses existing datasets in richness and variety. The dataset features 51,701 manually labeled product pages from 8,175 e-commerce websites across eight geographic regions, accompanied by a dataset of rendered page screenshots. To initiate research on the Klarna Product Page Dataset, we empirically benchmark a range of Graph Neural Networks (GNNs) on the web element nomination task. We make three important contributions. First, we found that a simple Convolutional GNN (GCN) outperforms complex state-of-the-art nomination methods. Second, we introduce a training refinement procedure that involves identifying a small number of relevant elements from each page using the aforementioned GCN. These elements are then passed to a large language model for the final nomination. This procedure significantly improves the nomination accuracy by 16.8 percentage points on our challenging dataset, without any need for fine-tuning. Finally, in response to another prevalent challenge in this field - the abundance of training methodologies suitable for element nomination - we introduce the Challenge Nomination Training Procedure, a novel training approach that further boosts nomination accuracy.  ( 3 min )
    The Role of Contextual Information in Best Arm Identification
    arXiv:2106.14077v3 Announce Type: replace Abstract: We study the best-arm identification problem with fixed confidence when contextual (covariate) information is available in stochastic bandits. Although we can use contextual information in each round, we are interested in the marginalized mean reward over the contextual distribution. Our goal is to identify the best arm with a minimal number of samplings under a given value of the error rate. We show the instance-specific sample complexity lower bounds for the problem. Then, we propose a context-aware version of the "Track-and-Stop" strategy, wherein the proportion of the arm draws tracks the set of optimal allocations and prove that the expected number of arm draws matches the lower bound asymptotically. We demonstrate that contextual information can be used to improve the efficiency of the identification of the best marginalized mean reward compared with the results of Garivier & Kaufmann (2016). We experimentally confirm that context information contributes to faster best-arm identification.  ( 2 min )
    Scaling the Convex Barrier with Sparse Dual Algorithms
    arXiv:2101.05844v3 Announce Type: replace Abstract: Tight and efficient neural network bounding is crucial to the scaling of neural network verification systems. Many efficient bounding algorithms have been presented recently, but they are often too loose to verify more challenging properties. This is due to the weakness of the employed relaxation, which is usually a linear program of size linear in the number of neurons. While a tighter linear relaxation for piecewise-linear activations exists, it comes at the cost of exponentially many constraints and currently lacks an efficient customized solver. We alleviate this deficiency by presenting two novel dual algorithms: one operates a subgradient method on a small active set of dual variables, the other exploits the sparsity of Frank-Wolfe type optimizers to incur only a linear memory cost. Both methods recover the strengths of the new relaxation: tightness and a linear separation oracle. At the same time, they share the benefits of previous dual approaches for weaker relaxations: massive parallelism, GPU implementation, low cost per iteration and valid bounds at any time. As a consequence, we can obtain better bounds than off-the-shelf solvers in only a fraction of their running time, attaining significant formal verification speed-ups.  ( 3 min )
    Mixed Policy Gradient: off-policy reinforcement learning driven jointly by data and model
    arXiv:2102.11513v2 Announce Type: replace Abstract: Reinforcement learning (RL) shows great potential in sequential decision-making. At present, mainstream RL algorithms are data-driven, which usually yield better asymptotic performance but much slower convergence compared with model-driven methods. This paper proposes mixed policy gradient (MPG) algorithm, which fuses the empirical data and the transition model in policy gradient (PG) to accelerate convergence without performance degradation. Formally, MPG is constructed as a weighted average of the data-driven and model-driven PGs, where the former is the derivative of the learned Q-value function, and the latter is that of the model-predictive return. To guide the weight design, we analyze and compare the upper bound of each PG error. Relying on that, a rule-based method is employed to heuristically adjust the weights. In particular, to get a better PG, the weight of the data-driven PG is designed to grow along the learning process while the other to decrease. Simulation results show that the MPG method achieves the best asymptotic performance and convergence speed compared with other baseline algorithms.  ( 2 min )
    On the generalization of Tanimoto-type kernels to real valued functions
    arXiv:2007.05943v2 Announce Type: replace Abstract: The Tanimoto kernel (Jaccard index) is a well known tool to describe the similarity between sets of binary attributes. It has been extended to the case when the attributes are nonnegative real values. This paper introduces a more general Tanimoto kernel formulation which allows to measure the similarity of arbitrary real-valued functions. This extension is constructed by unifying the representation of the attributes via properly chosen sets. After deriving the general form of the kernel, explicit feature representation is extracted from the kernel function, and a simply way of including general kernels into the Tanimoto kernel is shown. Finally, the kernel is also expressed as a quotient of piecewise linear functions, and a smooth approximation is provided.  ( 2 min )
    Online Causal Inference for Advertising in Real-Time Bidding Auctions
    arXiv:1908.08600v4 Announce Type: replace Abstract: Real-time bidding (RTB) systems, which utilize auctions to allocate user impressions to competing advertisers, continue to enjoy success in digital advertising. Assessing the effectiveness of such advertising remains a challenge in research and practice. This paper proposes a new approach to perform causal inference on advertising bought through such mechanisms. Leveraging the economic structure of first- and second-price auctions, we first show that the effects of advertising are identified by the optimal bids. Hence, since these optimal bids are the only objects that need to be recovered, we introduce an adapted Thompson sampling (TS) algorithm to solve a multi-armed bandit problem that succeeds in recovering such bids and, consequently, the effects of advertising while minimizing the costs of experimentation. We derive a regret bound for our algorithm which is order optimal and use data from RTB auctions to show that it outperforms commonly used methods that estimate the effects of advertising.  ( 2 min )
    Model-based deep reinforcement learning for accelerated learning from flow simulations
    arXiv:2402.16543v1 Announce Type: cross Abstract: In recent years, deep reinforcement learning has emerged as a technique to solve closed-loop flow control problems. Employing simulation-based environments in reinforcement learning enables a priori end-to-end optimization of the control system, provides a virtual testbed for safety-critical control applications, and allows to gain a deep understanding of the control mechanisms. While reinforcement learning has been applied successfully in a number of rather simple flow control benchmarks, a major bottleneck toward real-world applications is the high computational cost and turnaround time of flow simulations. In this contribution, we demonstrate the benefits of model-based reinforcement learning for flow control applications. Specifically, we optimize the policy by alternating between trajectories sampled from flow simulations and trajectories sampled from an ensemble of environment models. The model-based learning reduces the overall training time by up to $85\%$ for the fluidic pinball test case. Even larger savings are expected for more demanding flow simulations.  ( 2 min )
    Integrating Large Language Models with Graphical Session-Based Recommendation
    arXiv:2402.16539v1 Announce Type: cross Abstract: With the rapid development of Large Language Models (LLMs), various explorations have arisen to utilize LLMs capability of context understanding on recommender systems. While pioneering strategies have primarily transformed traditional recommendation tasks into challenges of natural language generation, there has been a relative scarcity of exploration in the domain of session-based recommendation (SBR) due to its specificity. SBR has been primarily dominated by Graph Neural Networks, which have achieved many successful outcomes due to their ability to capture both the implicit and explicit relationships between adjacent behaviors. The structural nature of graphs contrasts with the essence of natural language, posing a significant adaptation gap for LLMs. In this paper, we introduce large language models with graphical Session-Based recommendation, named LLMGR, an effective framework that bridges the aforementioned gap by harmoniously integrating LLMs with Graph Neural Networks (GNNs) for SBR tasks. This integration seeks to leverage the complementary strengths of LLMs in natural language understanding and GNNs in relational data processing, leading to a more powerful session-based recommender system that can understand and recommend items within a session. Moreover, to endow the LLM with the capability to empower SBR tasks, we design a series of prompts for both auxiliary and major instruction tuning tasks. These prompts are crafted to assist the LLM in understanding graph-structured data and align textual information with nodes, effectively translating nuanced user interactions into a format that can be understood and utilized by LLM architectures. Extensive experiments on three real-world datasets demonstrate that LLMGR outperforms several competitive baselines, indicating its effectiveness in enhancing SBR tasks and its potential as a research direction for future exploration.  ( 3 min )
    Discovering Artificial Viscosity Models for Discontinuous Galerkin Approximation of Conservation Laws using Physics-Informed Machine Learning
    arXiv:2402.16517v1 Announce Type: cross Abstract: Finite element-based high-order solvers of conservation laws offer large accuracy but face challenges near discontinuities due to the Gibbs phenomenon. Artificial viscosity is a popular and effective solution to this problem based on physical insight. In this work, we present a physics-informed machine learning algorithm to automate the discovery of artificial viscosity models in a non-supervised paradigm. The algorithm is inspired by reinforcement learning and trains a neural network acting cell-by-cell (the viscosity model) by minimizing a loss defined as the difference with respect to a reference solution thanks to automatic differentiation. This enables a dataset-free training procedure. We prove that the algorithm is effective by integrating it into a state-of-the-art Runge-Kutta discontinuous Galerkin solver. We showcase several numerical tests on scalar and vectorial problems, such as Burgers' and Euler's equations in one and two dimensions. Results demonstrate that the proposed approach trains a model that is able to outperform classical viscosity models. Moreover, we show that the learnt artificial viscosity model is able to generalize across different problems and parameters.  ( 2 min )
    Stable Training of Normalizing Flows for High-dimensional Variational Inference
    arXiv:2402.16408v1 Announce Type: cross Abstract: Variational inference with normalizing flows (NFs) is an increasingly popular alternative to MCMC methods. In particular, NFs based on coupling layers (Real NVPs) are frequently used due to their good empirical performance. In theory, increasing the depth of normalizing flows should lead to more accurate posterior approximations. However, in practice, training deep normalizing flows for approximating high-dimensional posterior distributions is often infeasible due to the high variance of the stochastic gradients. In this work, we show that previous methods for stabilizing the variance of stochastic gradient descent can be insufficient to achieve stable training of Real NVPs. As the source of the problem, we identify that, during training, samples often exhibit unusual high values. As a remedy, we propose a combination of two methods: (1) soft-thresholding of the scale in Real NVPs, and (2) a bijective soft log transformation of the samples. We evaluate these and other previously proposed modification on several challenging target distributions, including a high-dimensional horseshoe logistic regression model. Our experiments show that with our modifications, stable training of Real NVPs for posteriors with several thousand dimensions is possible, allowing for more accurate marginal likelihood estimation via importance sampling. Moreover, we evaluate several common training techniques and architecture choices and provide practical advise for training NFs for high-dimensional variational inference.  ( 2 min )
    Uncertainty Quantification in Anomaly Detection with Cross-Conformal $p$-Values
    arXiv:2402.16388v1 Announce Type: cross Abstract: Given the growing significance of reliable, trustworthy, and explainable machine learning, the requirement of uncertainty quantification for anomaly detection systems has become increasingly important. In this context, effectively controlling Type I error rates ($\alpha$) without compromising the statistical power ($1-\beta$) of these systems can build trust and reduce costs related to false discoveries, particularly when follow-up procedures are expensive. Leveraging the principles of conformal prediction emerges as a promising approach for providing respective statistical guarantees by calibrating a model's uncertainty. This work introduces a novel framework for anomaly detection, termed cross-conformal anomaly detection, building upon well-known cross-conformal methods designed for prediction tasks. With that, it addresses a natural research gap by extending previous works in the context of inductive conformal anomaly detection, relying on the split-conformal approach for model calibration. Drawing on insights from conformal prediction, we demonstrate that the derived methods for calculating cross-conformal $p$-values strike a practical compromise between statistical efficiency (full-conformal) and computational efficiency (split-conformal) for uncertainty-quantified anomaly detection on benchmark datasets.  ( 2 min )
    An Automated End-to-End Open-Source Software for High-Quality Text-to-Speech Dataset Generation
    arXiv:2402.16380v1 Announce Type: cross Abstract: Data availability is crucial for advancing artificial intelligence applications, including voice-based technologies. As content creation, particularly in social media, experiences increasing demand, translation and text-to-speech (TTS) technologies have become essential tools. Notably, the performance of these TTS technologies is highly dependent on the quality of the training data, emphasizing the mutual dependence of data availability and technological progress. This paper introduces an end-to-end tool to generate high-quality datasets for text-to-speech (TTS) models to address this critical need for high-quality data. The contributions of this work are manifold and include: the integration of language-specific phoneme distribution into sample selection, automation of the recording process, automated and human-in-the-loop quality assurance of recordings, and processing of recordings to meet specified formats. The proposed application aims to streamline the dataset creation process for TTS models through these features, thereby facilitating advancements in voice-based technologies.  ( 2 min )
    Generative AI in Vision: A Survey on Models, Metrics and Applications
    arXiv:2402.16369v1 Announce Type: cross Abstract: Generative AI models have revolutionized various fields by enabling the creation of realistic and diverse data samples. Among these models, diffusion models have emerged as a powerful approach for generating high-quality images, text, and audio. This survey paper provides a comprehensive overview of generative AI diffusion and legacy models, focusing on their underlying techniques, applications across different domains, and their challenges. We delve into the theoretical foundations of diffusion models, including concepts such as denoising diffusion probabilistic models (DDPM) and score-based generative modeling. Furthermore, we explore the diverse applications of these models in text-to-image, image inpainting, and image super-resolution, along with others, showcasing their potential in creative tasks and data augmentation. By synthesizing existing research and highlighting critical advancements in this field, this survey aims to provide researchers and practitioners with a comprehensive understanding of generative AI diffusion and legacy models and inspire future innovations in this exciting area of artificial intelligence.  ( 2 min )
    Where Do We Go from Here? Multi-scale Allocentric Relational Inference from Natural Spatial Descriptions
    arXiv:2402.16364v1 Announce Type: cross Abstract: When communicating routes in natural language, the concept of {\em acquired spatial knowledge} is crucial for geographic information retrieval (GIR) and in spatial cognitive research. However, NLP navigation studies often overlook the impact of such acquired knowledge on textual descriptions. Current navigation studies concentrate on egocentric local descriptions (e.g., `it will be on your right') that require reasoning over the agent's local perception. These instructions are typically given as a sequence of steps, with each action-step explicitly mentioning and being followed by a landmark that the agent can use to verify they are on the right path (e.g., `turn right and then you will see...'). In contrast, descriptions based on knowledge acquired through a map provide a complete view of the environment and capture its overall structure. These instructions (e.g., `it is south of Central Park and a block north of a police station') are typically non-sequential, contain allocentric relations, with multiple spatial relations and implicit actions, without any explicit verification. This paper introduces the Rendezvous (RVS) task and dataset, which includes 10,404 examples of English geospatial instructions for reaching a target location using map-knowledge. Our analysis reveals that RVS exhibits a richer use of spatial allocentric relations, and requires resolving more spatial relations simultaneously compared to previous text-based navigation benchmarks.  ( 3 min )
    A Provably Accurate Randomized Sampling Algorithm for Logistic Regression
    arXiv:2402.16326v1 Announce Type: cross Abstract: In statistics and machine learning, logistic regression is a widely-used supervised learning technique primarily employed for binary classification tasks. When the number of observations greatly exceeds the number of predictor variables, we present a simple, randomized sampling-based algorithm for logistic regression problem that guarantees high-quality approximations to both the estimated probabilities and the overall discrepancy of the model. Our analysis builds upon two simple structural conditions that boil down to randomized matrix multiplication, a fundamental and well-understood primitive of randomized numerical linear algebra. We analyze the properties of estimated probabilities of logistic regression when leverage scores are used to sample observations, and prove that accurate approximations can be achieved with a sample whose size is much smaller than the total number of observations. To further validate our theoretical findings, we conduct comprehensive empirical evaluations. Overall, our work sheds light on the potential of using randomized sampling approaches to efficiently approximate the estimated probabilities in logistic regression, offering a practical and computationally efficient solution for large-scale datasets.  ( 2 min )
    An optimal tradeoff between entanglement and copy complexity for state tomography
    arXiv:2402.16353v1 Announce Type: cross Abstract: There has been significant interest in understanding how practical constraints on contemporary quantum devices impact the complexity of quantum learning. For the classic question of tomography, recent work tightly characterized the copy complexity for any protocol that can only measure one copy of the unknown state at a time, showing it is polynomially worse than if one can make fully-entangled measurements. While we now have a fairly complete picture of the rates for such tasks in the near-term and fault-tolerant regimes, it remains poorly understood what the landscape in between looks like. In this work, we study tomography in the natural setting where one can make measurements of $t$ copies at a time. For sufficiently small $\epsilon$, we show that for any $t \le d^2$, $\widetilde{\Theta}(\frac{d^3}{\sqrt{t}\epsilon^2})$ copies are necessary and sufficient to learn an unknown $d$-dimensional state $\rho$ to trace distance $\epsilon$. This gives a smooth and optimal interpolation between the known rates for single-copy and fully-entangled measurements. To our knowledge, this is the first smooth entanglement-copy tradeoff known for any quantum learning task, and for tomography, no intermediate point on this curve was known, even at $t = 2$. An important obstacle is that unlike the optimal single-copy protocol, the optimal fully-entangled protocol is inherently biased and thus precludes naive batching approaches. Instead, we devise a novel two-stage procedure that uses Keyl's algorithm to refine a crude estimate for $\rho$ based on single-copy measurements. A key insight is to use Schur-Weyl sampling not to estimate the spectrum of $\rho$, but to estimate the deviation of $\rho$ from the maximally mixed state. When $\rho$ is far from the maximally mixed state, we devise a novel quantum splitting procedure that reduces to the case where $\rho$ is close to maximally mixed.  ( 3 min )
    Self-Supervised Speech Quality Estimation and Enhancement Using Only Clean Speech
    arXiv:2402.16321v1 Announce Type: cross Abstract: Speech quality estimation has recently undergone a paradigm shift from human-hearing expert designs to machine-learning models. However, current models rely mainly on supervised learning, which is time-consuming and expensive for label collection. To solve this problem, we propose VQScore, a self-supervised metric for evaluating speech based on the quantization error of a vector-quantized-variational autoencoder (VQ-VAE). The training of VQ-VAE relies on clean speech; hence, large quantization errors can be expected when the speech is distorted. To further improve correlation with real quality scores, domain knowledge of speech processing is incorporated into the model design. We found that the vector quantization mechanism could also be used for self-supervised speech enhancement (SE) model training. To improve the robustness of the encoder for SE, a novel self-distillation mechanism combined with adversarial training is introduced. In summary, the proposed speech quality estimation method and enhancement models require only clean speech for training without any label requirements. Experimental results show that the proposed VQScore and enhancement model are competitive with supervised baselines. The code will be released after publication.  ( 2 min )
    Against Filter Bubbles: Diversified Music Recommendation via Weighted Hypergraph Embedding Learning
    arXiv:2402.16299v1 Announce Type: cross Abstract: Recommender systems serve a dual purpose for users: sifting out inappropriate or mismatched information while accurately identifying items that align with their preferences. Numerous recommendation algorithms are designed to provide users with a personalized array of information tailored to their preferences. Nevertheless, excessive personalization can confine users within a "filter bubble". Consequently, achieving the right balance between accuracy and diversity in recommendations is a pressing concern. To address this challenge, exemplified by music recommendation, we introduce the Diversified Weighted Hypergraph music Recommendation algorithm (DWHRec). In the DWHRec algorithm, the initial connections between users and listened tracks are represented by a weighted hypergraph. Simultaneously, associations between artists, albums and tags with tracks are also appended to the hypergraph. To explore users' latent preferences, a hypergraph-based random walk embedding method is applied to the constructed hypergraph. In our investigation, accuracy is gauged by the alignment between the user and the track, whereas the array of recommended track types measures diversity. We rigorously compared DWHRec against seven state-of-the-art recommendation algorithms using two real-world music datasets. The experimental results validate DWHRec as a solution that adeptly harmonizes accuracy and diversity, delivering a more enriched musical experience. Beyond music recommendation, DWHRec can be extended to cater to other scenarios with similar data structures.  ( 2 min )
    A Self-matching Training Method with Annotation Embedding Models for Ontology Subsumption Prediction
    arXiv:2402.16278v1 Announce Type: cross Abstract: Recently, ontology embeddings representing entities in a low-dimensional space have been proposed for ontology completion. However, the ontology embeddings for concept subsumption prediction do not address the difficulties of similar and isolated entities and fail to extract the global information of annotation axioms from an ontology. In this paper, we propose a self-matching training method for the two ontology embedding models: Inverted-index Matrix Embedding (InME) and Co-occurrence Matrix Embedding (CoME). The two embeddings capture the global and local information in annotation axioms by means of the occurring locations of each word in a set of axioms and the co-occurrences of words in each axiom. The self-matching training method increases the robustness of the concept subsumption prediction when predicted superclasses are similar to subclasses and are isolated to other entities in an ontology. Our evaluation experiments show that the self-matching training method with InME outperforms the existing ontology embeddings for the GO and FoodOn ontologies and that the method with the concatenation of CoME and OWL2Vec* outperforms them for the HeLiS ontology.  ( 2 min )
    A Comparison of Deep Learning Models for Proton Background Rejection with the AMS Electromagnetic Calorimeter
    arXiv:2402.16285v1 Announce Type: cross Abstract: The Alpha Magnetic Spectrometer (AMS) is a high-precision particle detector onboard the International Space Station containing six different subdetectors. The Transition Radiation Detector and Electromagnetic Calorimeter (ECAL) are used to separate electrons/positrons from the abundant cosmic-ray proton background. The positron flux measured in space by AMS falls with a power law which unexpectedly softens above 25 GeV and then hardens above 280 GeV. Several theoretical models try to explain these phenomena, and a purer measurement of positrons at higher energies is needed to help test them. The currently used methods to reject the proton background at high energies involve extrapolating shower features from the ECAL to use as inputs for boosted decision tree and likelihood classifiers. We present a new approach for particle identification with the AMS ECAL using deep learning (DL). By taking the energy deposition within all the ECAL cells as an input and treating them as pixels in an image-like format, we train an MLP, a CNN, and multiple ResNets and Convolutional vision Transformers (CvTs) as shower classifiers. Proton rejection performance is evaluated using Monte Carlo (MC) events and ISS data separately. For MC, using events with a reconstructed energy between 0.2 - 2 TeV, at 90% electron accuracy, the proton rejection power of our CvT model is more than 5 times that of the other DL models. Similarly, for ISS data with a reconstructed energy between 50 - 70 GeV, the proton rejection power of our CvT model is more than 2.5 times that of the other DL models.  ( 3 min )
    From Large Language Models and Optimization to Decision Optimization CoPilot: A Research Manifesto
    arXiv:2402.16269v1 Announce Type: cross Abstract: Significantly simplifying the creation of optimization models for real-world business problems has long been a major goal in applying mathematical optimization more widely to important business and societal decisions. The recent capabilities of Large Language Models (LLMs) present a timely opportunity to achieve this goal. Therefore, we propose research at the intersection of LLMs and optimization to create a Decision Optimization CoPilot (DOCP) - an AI tool designed to assist any decision maker, interacting in natural language to grasp the business problem, subsequently formulating and solving the corresponding optimization model. This paper outlines our DOCP vision and identifies several fundamental requirements for its implementation. We describe the state of the art through a literature survey and experiments using ChatGPT. We show that a) LLMs already provide substantial novel capabilities relevant to a DOCP, and b) major research challenges remain to be addressed. We also propose possible research directions to overcome these gaps. We also see this work as a call to action to bring together the LLM and optimization communities to pursue our vision, thereby enabling much more widespread improved decision-making.  ( 2 min )
    IR2: Information Regularization for Information Retrieval
    arXiv:2402.16200v1 Announce Type: cross Abstract: Effective information retrieval (IR) in settings with limited training data, particularly for complex queries, remains a challenging task. This paper introduces IR2, Information Regularization for Information Retrieval, a technique for reducing overfitting during synthetic data generation. This approach, representing a novel application of regularization techniques in synthetic data creation for IR, is tested on three recent IR tasks characterized by complex queries: DORIS-MAE, ArguAna, and WhatsThatBook. Experimental results indicate that our regularization techniques not only outperform previous synthetic query generation methods on the tasks considered but also reduce cost by up to 50%. Furthermore, this paper categorizes and explores three regularization methods at different stages of the query synthesis pipeline-input, prompt, and output-each offering varying degrees of performance improvement compared to models where no regularization is applied. This provides a systematic approach for optimizing synthetic data generation in data-limited, complex-query IR scenarios. All code, prompts and synthetic data are available at https://github.com/Info-Regularization/Information-Regularization.  ( 2 min )
    Attacking LLM Watermarks by Exploiting Their Strengths
    arXiv:2402.16187v1 Announce Type: cross Abstract: Advances in generative models have made it possible for AI-generated text, code, and images to mirror human-generated content in many applications. Watermarking, a technique that aims to embed information in the output of a model to verify its source, is useful for mitigating misuse of such AI-generated content. However, existing watermarking schemes remain surprisingly susceptible to attack. In particular, we show that desirable properties shared by existing LLM watermarking systems such as quality preservation, robustness, and public detection APIs can in turn make these systems vulnerable to various attacks. We rigorously study potential attacks in terms of common watermark design choices, and propose best practices and defenses for mitigation -- establishing a set of practical guidelines for embedding and detection of LLM watermarks.  ( 2 min )
    Language Models for Code Completion: A Practical Evaluation
    arXiv:2402.16197v1 Announce Type: cross Abstract: Transformer-based language models for automatic code completion have shown great promise so far, yet the evaluation of these models rarely uses real data. This study provides both quantitative and qualitative assessments of three public code language models when completing real-world code. We first developed an open-source IDE extension, Code4Me, for the online evaluation of the models. We collected real auto-completion usage data for over a year from more than 1200 users, resulting in over 600K valid completions. These models were then evaluated using six standard metrics across twelve programming languages. Next, we conducted a qualitative study of 1690 real-world completion requests to identify the reasons behind the poor model performance. A comparative analysis of the models' performance in online and offline settings was also performed, using benchmark synthetic datasets and two masking strategies. Our findings suggest that while developers utilize code completion across various languages, the best results are achieved for mainstream languages such as Python and Java. InCoder outperformed the other models across all programming languages, highlighting the significance of training data and objectives. Our study also revealed that offline evaluations do not accurately reflect real-world scenarios. Upon qualitative analysis of the model's predictions, we found that 66.3% of failures were due to the models' limitations, 24.4% occurred due to inappropriate model usage in a development context, and 9.3% were valid requests that developers overwrote. Given these findings, we propose several strategies to overcome the current limitations. These include refining training objectives, improving resilience to typographical errors, adopting hybrid approaches, and enhancing implementations and usability.  ( 3 min )
    Distribution-Free Fair Federated Learning with Small Samples
    arXiv:2402.16158v1 Announce Type: cross Abstract: As federated learning gains increasing importance in real-world applications due to its capacity for decentralized data training, addressing fairness concerns across demographic groups becomes critically important. However, most existing machine learning algorithms for ensuring fairness are designed for centralized data environments and generally require large-sample and distributional assumptions, underscoring the urgent need for fairness techniques adapted for decentralized and heterogeneous systems with finite-sample and distribution-free guarantees. To address this issue, this paper introduces FedFaiREE, a post-processing algorithm developed specifically for distribution-free fair learning in decentralized settings with small samples. Our approach accounts for unique challenges in decentralized environments, such as client heterogeneity, communication costs, and small sample sizes. We provide rigorous theoretical guarantees for both fairness and accuracy, and our experimental results further provide robust empirical validation for our proposed method.  ( 2 min )
    ChatMusician: Understanding and Generating Music Intrinsically with LLM
    arXiv:2402.16153v1 Announce Type: cross Abstract: While Large Language Models (LLMs) demonstrate impressive capabilities in text generation, we find that their ability has yet to be generalized to music, humanity's creative language. We introduce ChatMusician, an open-source LLM that integrates intrinsic musical abilities. It is based on continual pre-training and finetuning LLaMA2 on a text-compatible music representation, ABC notation, and the music is treated as a second language. ChatMusician can understand and generate music with a pure text tokenizer without any external multi-modal neural structures or tokenizers. Interestingly, endowing musical abilities does not harm language abilities, even achieving a slightly higher MMLU score. Our model is capable of composing well-structured, full-length music, conditioned on texts, chords, melodies, motifs, musical forms, etc, surpassing GPT-4 baseline. On our meticulously curated college-level music understanding benchmark, MusicTheoryBench, ChatMusician surpasses LLaMA2 and GPT-3.5 on zero-shot setting by a noticeable margin. Our work reveals that LLMs can be an excellent compressor for music, but there remains significant territory to be conquered. We release our 4B token music-language corpora MusicPile, the collected MusicTheoryBench, code, model and demo in GitHub.  ( 3 min )
    InstructEdit: Instruction-based Knowledge Editing for Large Language Models
    arXiv:2402.16123v1 Announce Type: cross Abstract: Knowledge editing for large language models can offer an efficient solution to alter a model's behavior without negatively impacting the overall performance. However, the current approach encounters issues with limited generalizability across tasks, necessitating one distinct editor for each task, which significantly hinders the broader applications. To address this, we take the first step to analyze the multi-task generalization issue in knowledge editing. Specifically, we develop an instruction-based editing technique, termed InstructEdit, which facilitates the editor's adaptation to various task performances simultaneously using simple instructions. With only one unified editor for each LLM, we empirically demonstrate that InstructEdit can improve the editor's control, leading to an average 14.86% increase in Reliability in multi-task editing setting. Furthermore, experiments involving holdout unseen task illustrate that InstructEdit consistently surpass previous strong baselines. To further investigate the underlying mechanisms of instruction-based knowledge editing, we analyze the principal components of the editing gradient directions, which unveils that instructions can help control optimization direction with stronger OOD generalization. Code and datasets will be available in https://github.com/zjunlp/EasyEdit.  ( 2 min )
    Key Design Choices in Source-Free Unsupervised Domain Adaptation: An In-depth Empirical Analysis
    arXiv:2402.16090v1 Announce Type: cross Abstract: This study provides a comprehensive benchmark framework for Source-Free Unsupervised Domain Adaptation (SF-UDA) in image classification, aiming to achieve a rigorous empirical understanding of the complex relationships between multiple key design factors in SF-UDA methods. The study empirically examines a diverse set of SF-UDA techniques, assessing their consistency across datasets, sensitivity to specific hyperparameters, and applicability across different families of backbone architectures. Moreover, it exhaustively evaluates pre-training datasets and strategies, particularly focusing on both supervised and self-supervised methods, as well as the impact of fine-tuning on the source domain. Our analysis also highlights gaps in existing benchmark practices, guiding SF-UDA research towards more effective and general approaches. It emphasizes the importance of backbone architecture and pre-training dataset selection on SF-UDA performance, serving as an essential reference and providing key insights. Lastly, we release the source code of our experimental framework. This facilitates the construction, training, and testing of SF-UDA methods, enabling systematic large-scale experimental analysis and supporting further research efforts in this field.  ( 2 min )
    Pfeed: Generating near real-time personalized feeds using precomputed embedding similarities
    arXiv:2402.16073v1 Announce Type: cross Abstract: In personalized recommender systems, embeddings are often used to encode customer actions and items, and retrieval is then performed in the embedding space using approximate nearest neighbor search. However, this approach can lead to two challenges: 1) user embeddings can restrict the diversity of interests captured and 2) the need to keep them up-to-date requires an expensive, real-time infrastructure. In this paper, we propose a method that overcomes these challenges in a practical, industrial setting. The method dynamically updates customer profiles and composes a feed every two minutes, employing precomputed embeddings and their respective similarities. We tested and deployed this method to personalise promotional items at Bol, one of the largest e-commerce platforms of the Netherlands and Belgium. The method enhanced customer engagement and experience, leading to a significant 4.9% uplift in conversions.  ( 2 min )
    Training a Bilingual Language Model by Mapping Tokens onto a Shared Character Space
    arXiv:2402.16065v1 Announce Type: cross Abstract: We train a bilingual Arabic-Hebrew language model using a transliterated version of Arabic texts in Hebrew, to ensure both languages are represented in the same script. Given the morphological, structural similarities, and the extensive number of cognates shared among Arabic and Hebrew, we assess the performance of a language model that employs a unified script for both languages, on machine translation which requires cross-lingual knowledge. The results are promising: our model outperforms a contrasting model which keeps the Arabic texts in the Arabic script, demonstrating the efficacy of the transliteration step. Despite being trained on a dataset approximately 60% smaller than that of other existing language models, our model appears to deliver comparable performance in machine translation across both translation directions.  ( 2 min )
    Gradient-enhanced deep Gaussian processes for multifidelity modelling
    arXiv:2402.16059v1 Announce Type: cross Abstract: Multifidelity models integrate data from multiple sources to produce a single approximator for the underlying process. Dense low-fidelity samples are used to reduce interpolation error, while sparse high-fidelity samples are used to compensate for bias or noise in the low-fidelity samples. Deep Gaussian processes (GPs) are attractive for multifidelity modelling as they are non-parametric, robust to overfitting, perform well for small datasets, and, critically, can capture nonlinear and input-dependent relationships between data of different fidelities. Many datasets naturally contain gradient data, especially when they are generated by computational models that are compatible with automatic differentiation or have adjoint solutions. Principally, this work extends deep GPs to incorporate gradient data. We demonstrate this method on an analytical test problem and a realistic partial differential equation problem, where we predict the aerodynamic coefficients of a hypersonic flight vehicle over a range of flight conditions and geometries. In both examples, the gradient-enhanced deep GP outperforms a gradient-enhanced linear GP model and their non-gradient-enhanced counterparts.  ( 2 min )
    LLMs with Chain-of-Thought Are Non-Causal Reasoners
    arXiv:2402.16048v1 Announce Type: cross Abstract: This paper explores the role of the Chain of Thought (CoT) in Large Language Models (LLMs) reasoning. Despite its potential to improve task performance, our analysis reveals a surprising frequency of correct answers following incorrect CoTs and vice versa. We employ causal analysis to assess the cause-effect relationship between CoTs/instructions and answers in LLMs, uncovering the Structural Causal Model (SCM) that LLMs approximate. By comparing the implied SCM with that of human reasoning, we highlight discrepancies between LLM and human reasoning processes. We further examine the factors influencing the causal structure of the implied SCM, revealing that in-context learning, supervised fine-tuning, and reinforcement learning on human feedback significantly impact the causal relations. We release the code and results at https://github.com/StevenZHB/CoT_Causal_Analysis.  ( 2 min )
    Detecting Machine-Generated Texts by Multi-Population Aware Optimization for Maximum Mean Discrepancy
    arXiv:2402.16041v1 Announce Type: cross Abstract: Large language models (LLMs) such as ChatGPT have exhibited remarkable performance in generating human-like texts. However, machine-generated texts (MGTs) may carry critical risks, such as plagiarism issues, misleading information, or hallucination issues. Therefore, it is very urgent and important to detect MGTs in many situations. Unfortunately, it is challenging to distinguish MGTs and human-written texts because the distributional discrepancy between them is often very subtle due to the remarkable performance of LLMs. In this paper, we seek to exploit \textit{maximum mean discrepancy} (MMD) to address this issue in the sense that MMD can well identify distributional discrepancies. However, directly training a detector with MMD using diverse MGTs will incur a significantly increased variance of MMD since MGTs may contain \textit{multiple text populations} due to various LLMs. This will severely impair MMD's ability to measure the difference between two samples. To tackle this, we propose a novel \textit{multi-population} aware optimization method for MMD called MMD-MP, which can \textit{avoid variance increases} and thus improve the stability to measure the distributional discrepancy. Relying on MMD-MP, we develop two methods for paragraph-based and sentence-based detection, respectively. Extensive experiments on various LLMs, \eg, GPT2 and ChatGPT, show superior detection performance of our MMD-MP. The source code is available at \url{https://github.com/ZSHsh98/MMD-MP}.  ( 2 min )
    Deep Learning Approaches for Improving Question Answering Systems in Hepatocellular Carcinoma Research
    arXiv:2402.16038v1 Announce Type: cross Abstract: In recent years, advancements in natural language processing (NLP) have been fueled by deep learning techniques, particularly through the utilization of powerful computing resources like GPUs and TPUs. Models such as BERT and GPT-3, trained on vast amounts of data, have revolutionized language understanding and generation. These pre-trained models serve as robust bases for various tasks including semantic understanding, intelligent writing, and reasoning, paving the way for a more generalized form of artificial intelligence. NLP, as a vital application of AI, aims to bridge the gap between humans and computers through natural language interaction. This paper delves into the current landscape and future prospects of large-scale model-based NLP, focusing on the question-answering systems within this domain. Practical cases and developments in artificial intelligence-driven question-answering systems are analyzed to foster further exploration and research in the realm of large-scale NLP.  ( 2 min )
    Machine Learning-Based Vehicle Intention Trajectory Recognition and Prediction for Autonomous Driving
    arXiv:2402.16036v1 Announce Type: cross Abstract: In recent years, the expansion of internet technology and advancements in automation have brought significant attention to autonomous driving technology. Major automobile manufacturers, including Volvo, Mercedes-Benz, and Tesla, have progressively introduced products ranging from assisted-driving vehicles to semi-autonomous vehicles. However, this period has also witnessed several traffic safety incidents involving self-driving vehicles. For instance, in March 2016, a Google self-driving car was involved in a minor collision with a bus. At the time of the accident, the autonomous vehicle was attempting to merge into the right lane but failed to dynamically respond to the real-time environmental information during the lane change. It incorrectly assumed that the approaching bus would slow down to avoid it, leading to a low-speed collision with the bus. This incident highlights the current technological shortcomings and safety concerns associated with autonomous lane-changing behavior, despite the rapid advancements in autonomous driving technology. Lane-changing is among the most common and hazardous behaviors in highway driving, significantly impacting traffic safety and flow. Therefore, lane-changing is crucial for traffic safety, and accurately predicting drivers' lane change intentions can markedly enhance driving safety. This paper introduces a deep learning-based prediction method for autonomous driving lane change behavior, aiming to facilitate safe lane changes and thereby improve road safety.  ( 2 min )
    HiGPT: Heterogeneous Graph Language Model
    arXiv:2402.16024v1 Announce Type: cross Abstract: Heterogeneous graph learning aims to capture complex relationships and diverse relational semantics among entities in a heterogeneous graph to obtain meaningful representations for nodes and edges. Recent advancements in heterogeneous graph neural networks (HGNNs) have achieved state-of-the-art performance by considering relation heterogeneity and using specialized message functions and aggregation rules. However, existing frameworks for heterogeneous graph learning have limitations in generalizing across diverse heterogeneous graph datasets. Most of these frameworks follow the "pre-train" and "fine-tune" paradigm on the same dataset, which restricts their capacity to adapt to new and unseen data. This raises the question: "Can we generalize heterogeneous graph models to be well-adapted to diverse downstream learning tasks with distribution shifts in both node token sets and relation type heterogeneity?'' To tackle those challenges, we propose HiGPT, a general large graph model with Heterogeneous graph instruction-tuning paradigm. Our framework enables learning from arbitrary heterogeneous graphs without the need for any fine-tuning process from downstream datasets. To handle distribution shifts in heterogeneity, we introduce an in-context heterogeneous graph tokenizer that captures semantic relationships in different heterogeneous graphs, facilitating model adaptation. We incorporate a large corpus of heterogeneity-aware graph instructions into our HiGPT, enabling the model to effectively comprehend complex relation heterogeneity and distinguish between various types of graph tokens. Furthermore, we introduce the Mixture-of-Thought (MoT) instruction augmentation paradigm to mitigate data scarcity by generating diverse and informative instructions. Through comprehensive evaluations, our proposed framework demonstrates exceptional performance in terms of generalization performance.  ( 2 min )
    Unmasking Dementia Detection by Masking Input Gradients: A JSM Approach to Model Interpretability and Precision
    arXiv:2402.16008v1 Announce Type: cross Abstract: The evolution of deep learning and artificial intelligence has significantly reshaped technological landscapes. However, their effective application in crucial sectors such as medicine demands more than just superior performance, but trustworthiness as well. While interpretability plays a pivotal role, existing explainable AI (XAI) approaches often do not reveal {\em Clever Hans} behavior where a model makes (ungeneralizable) correct predictions using spurious correlations or biases in data. Likewise, current post-hoc XAI methods are susceptible to generating unjustified counterfactual examples. In this paper, we approach XAI with an innovative {\em model debugging} methodology realized through Jacobian Saliency Map (JSM). To cast the problem into a concrete context, we employ Alzheimer's disease (AD) diagnosis as the use case, motivated by its significant impact on human lives and the formidable challenge in its early detection, stemming from the intricate nature of its progression. We introduce an interpretable, multimodal model for AD classification over its multi-stage progression, incorporating JSM as a modality-agnostic tool that provides insights into volumetric changes indicative of brain abnormalities. Our extensive evaluation including ablation study manifests the efficacy of using JSM for model debugging and interpretation, while significantly enhancing model accuracy as well.  ( 2 min )
    Adversarial-Robust Transfer Learning for Medical Imaging via Domain Assimilation
    arXiv:2402.16005v1 Announce Type: cross Abstract: In the field of Medical Imaging, extensive research has been dedicated to leveraging its potential in uncovering critical diagnostic features in patients. Artificial Intelligence (AI)-driven medical diagnosis relies on sophisticated machine learning and deep learning models to analyze, detect, and identify diseases from medical images. Despite the remarkable performance of these models, characterized by high accuracy, they grapple with trustworthiness issues. The introduction of a subtle perturbation to the original image empowers adversaries to manipulate the prediction output, redirecting it to other targeted or untargeted classes. Furthermore, the scarcity of publicly available medical images, constituting a bottleneck for reliable training, has led contemporary algorithms to depend on pretrained models grounded on a large set of natural images -- a practice referred to as transfer learning. However, a significant {\em domain discrepancy} exists between natural and medical images, which causes AI models resulting from transfer learning to exhibit heightened {\em vulnerability} to adversarial attacks. This paper proposes a {\em domain assimilation} approach that introduces texture and color adaptation into transfer learning, followed by a texture preservation component to suppress undesired distortion. We systematically analyze the performance of transfer learning in the face of various adversarial attacks under different data modalities, with the overarching goal of fortifying the model's robustness and security in medical imaging tasks. The results demonstrate high effectiveness in reducing attack efficacy, contributing toward more trustworthy transfer learning in biomedical applications.  ( 3 min )
    Post-Quantum Cryptography Neural Network
    arXiv:2402.16002v1 Announce Type: cross Abstract: In recent years, quantum computers and Shor quantum algorithm have posed a threat to current mainstream asymmetric cryptography methods (e.g. RSA and Elliptic Curve Cryptography (ECC)). Therefore, it is necessary to construct a Post-Quantum Cryptography (PQC) method to resist quantum computing attacks. Therefore, this study proposes a PQC-based neural network that maps a code-based PQC method to a neural network structure and enhances the security of ciphertexts with non-linear activation functions, random perturbation of ciphertexts, and uniform distribution of ciphertexts. In practical experiments, this study uses cellular network signals as a case study to demonstrate that encryption and decryption can be performed by the proposed PQC-based neural network with the uniform distribution of ciphertexts. In the future, the proposed PQC-based neural network could be applied to various applications.  ( 2 min )
    Cieran: Designing Sequential Colormaps via In-Situ Active Preference Learning
    arXiv:2402.15997v1 Announce Type: cross Abstract: Quality colormaps can help communicate important data patterns. However, finding an aesthetically pleasing colormap that looks "just right" for a given scenario requires significant design and technical expertise. We introduce Cieran, a tool that allows any data analyst to rapidly find quality colormaps while designing charts within Jupyter Notebooks. Our system employs an active preference learning paradigm to rank expert-designed colormaps and create new ones from pairwise comparisons, allowing analysts who are novices in color design to tailor colormaps to their data context. We accomplish this by treating colormap design as a path planning problem through the CIELAB colorspace with a context-specific reward model. In an evaluation with twelve scientists, we found that Cieran effectively modeled user preferences to rank colormaps and leveraged this model to create new quality designs. Our work shows the potential of active preference learning for supporting efficient visualization design optimization.  ( 2 min )
    Improved Hardness Results for Learning Intersections of Halfspaces
    arXiv:2402.15995v1 Announce Type: cross Abstract: We show strong (and surprisingly simple) lower bounds for weakly learning intersections of halfspaces in the improper setting. Strikingly little is known about this problem. For instance, it is not even known if there is a polynomial-time algorithm for learning the intersection of only two halfspaces. On the other hand, lower bounds based on well-established assumptions (such as approximating worst-case lattice problems or variants of Feige's 3SAT hypothesis) are only known (or are implied by existing results) for the intersection of super-logarithmically many halfspaces [KS09,KS06,DSS16]. With intersections of fewer halfspaces being only ruled out under less standard assumptions [DV21] (such as the existence of local pseudo-random generators with large stretch). We significantly narrow this gap by showing that even learning $\omega(\log \log N)$ halfspaces in dimension $N$ takes super-polynomial time under standard assumptions on worst-case lattice problems (namely that SVP and SIVP are hard to approximate within polynomial factors). Further, we give unconditional hardness results in the statistical query framework. Specifically, we show that for any $k$ (even constant), learning $k$ halfspaces in dimension $N$ requires accuracy $N^{-\Omega(k)}$, or exponentially many queries -- in particular ruling out SQ algorithms with polynomial accuracy for $\omega(1)$ halfspaces. To the best of our knowledge this is the first unconditional hardness result for learning a super-constant number of halfspaces. Our lower bounds are obtained in a unified way via a novel connection we make between intersections of halfspaces and the so-called parallel pancakes distribution [DKS17,BLPR19,BRST21] that has been at the heart of many lower bound constructions in (robust) high-dimensional statistics in the past few years.  ( 3 min )
    Optimizing Portfolio Management and Risk Assessment in Digital Assets Using Deep Learning for Predictive Analysis
    arXiv:2402.15994v1 Announce Type: cross Abstract: Portfolio management issues have been extensively studied in the field of artificial intelligence in recent years, but existing deep learning-based quantitative trading methods have some areas where they could be improved. First of all, the prediction mode of stocks is singular; often, only one trading expert is trained by a model, and the trading decision is solely based on the prediction results of the model. Secondly, the data source used by the model is relatively simple, and only considers the data of the stock itself, ignoring the impact of the whole market risk on the stock. In this paper, the DQN algorithm is introduced into asset management portfolios in a novel and straightforward way, and the performance greatly exceeds the benchmark, which fully proves the effectiveness of the DRL algorithm in portfolio management. This also inspires us to consider the complexity of financial problems, and the use of algorithms should be fully combined with the problems to adapt. Finally, in this paper, the strategy is implemented by selecting the assets and actions with the largest Q value. Since different assets are trained separately as environments, there may be a phenomenon of Q value drift among different assets (different assets have different Q value distribution areas), which may easily lead to incorrect asset selection. Consider adding constraints so that the Q values of different assets share a Q value distribution to improve results.  ( 3 min )
    Phonetic and Lexical Discovery of a Canine Language using HuBERT
    arXiv:2402.15985v1 Announce Type: cross Abstract: This paper delves into the pioneering exploration of potential communication patterns within dog vocalizations and transcends traditional linguistic analysis barriers, which heavily relies on human priori knowledge on limited datasets to find sound units in dog vocalization. We present a self-supervised approach with HuBERT, enabling the accurate classification of phoneme labels and the identification of vocal patterns that suggest a rudimentary vocabulary within dog vocalizations. Our findings indicate a significant acoustic consistency in these identified canine vocabulary, covering the entirety of observed dog vocalization sequences. We further develop a web-based dog vocalization labeling system. This system can highlight phoneme n-grams, present in the vocabulary, in the dog audio uploaded by users.  ( 2 min )
    Towards Fair Graph Anomaly Detection: Problem, New Datasets, and Evaluation
    arXiv:2402.15988v1 Announce Type: cross Abstract: The Fair Graph Anomaly Detection (FairGAD) problem aims to accurately detect anomalous nodes in an input graph while ensuring fairness and avoiding biased predictions against individuals from sensitive subgroups such as gender or political leanings. Fairness in graphs is particularly crucial in anomaly detection areas such as misinformation detection in search/ranking systems, where decision outcomes can significantly affect individuals. However, the current literature does not comprehensively discuss this problem, nor does it provide realistic datasets that encompass actual graph structures, anomaly labels, and sensitive attributes for research in FairGAD. To bridge this gap, we introduce a formal definition of the FairGAD problem and present two novel graph datasets constructed from the globally prominent social media platforms Reddit and Twitter. These datasets comprise 1.2 million and 400,000 edges associated with 9,000 and 47,000 nodes, respectively, and leverage political leanings as sensitive attributes and misinformation spreaders as anomaly labels. We demonstrate that our FairGAD datasets significantly differ from the synthetic datasets used currently by the research community. These new datasets offer significant values for FairGAD by providing realistic data that captures the intricacies of social networks. Using our datasets, we investigate the performance-fairness trade-off in eleven existing GAD and non-graph AD methods on five state-of-the-art fairness methods, which sheds light on their effectiveness and limitations in addressing the FairGAD problem.  ( 3 min )
    Implementing Recycling Methods for Linear Systems in Python with an Application to Multiple Objective Optimization
    arXiv:2402.15941v1 Announce Type: cross Abstract: Sequences of linear systems arise in the predictor-corrector method when computing the Pareto front for multi-objective optimization. Rather than discarding information generated when solving one system, it may be advantageous to recycle information for subsequent systems. To accomplish this, we seek to reduce the overall cost of computation when solving linear systems using common recycling methods. In this work, we assessed the performance of recycling minimum residual (RMINRES) method along with a map between coefficient matrices. For these methods to be fully integrated into the software used in Enouen et al. (2022), there must be working version of each in both Python and PyTorch. Herein, we discuss the challenges we encountered and solutions undertaken (and some ongoing) when computing efficient Python implementations of these recycling strategies. The goal of this project was to implement RMINRES in Python and PyTorch and add it to the established Pareto front code to reduce computational cost. Additionally, we wanted to implement the sparse approximate maps code in Python and PyTorch, so that it can be parallelized in future work.  ( 2 min )
    Deep Separable Spatiotemporal Learning for Fast Dynamic Cardiac MRI
    arXiv:2402.15939v1 Announce Type: cross Abstract: Dynamic magnetic resonance imaging (MRI) plays an indispensable role in cardiac diagnosis. To enable fast imaging, the k-space data can be undersampled but the image reconstruction poses a great challenge of high-dimensional processing. This challenge leads to necessitate extensive training data in many deep learning reconstruction methods. This work proposes a novel and efficient approach, leveraging a dimension-reduced separable learning scheme that excels even with highly limited training data. We further integrate it with spatiotemporal priors to develop a Deep Separable Spatiotemporal Learning network (DeepSSL), which unrolls an iteration process of a reconstruction model with both temporal low-rankness and spatial sparsity. Intermediate outputs are visualized to provide insights into the network's behavior and enhance its interpretability. Extensive results on cardiac cine datasets show that the proposed DeepSSL is superior to the state-of-the-art methods visually and quantitatively, while reducing the demand for training cases by up to 75%. And its preliminary adaptability to cardiac patients has been verified through experienced radiologists' and cardiologists' blind reader study. Additionally, DeepSSL also benefits for achieving the downstream task of cardiac segmentation with higher accuracy and shows robustness in prospective real-time cardiac MRI.  ( 2 min )
    Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models
    arXiv:2402.15938v1 Announce Type: cross Abstract: Recent statements about the impressive capabilities of large language models (LLMs) are usually supported by evaluating on open-access benchmarks. Considering the vast size and wide-ranging sources of LLMs' training data, it could explicitly or implicitly include test data, leading to LLMs being more susceptible to data contamination. However, due to the opacity of training data, the black-box access of models, and the rapid growth of synthetic training data, detecting and mitigating data contamination for LLMs faces significant challenges. In this paper, we propose CDD, which stands for Contamination Detection via output Distribution for LLMs. CDD necessitates only the sampled texts to detect data contamination, by identifying the peakedness of LLM's output distribution. To mitigate the impact of data contamination in evaluation, we also present TED: Trustworthy Evaluation via output Distribution, based on the correction of LLM's output distribution. To facilitate this study, we introduce two benchmarks, i.e., DetCon and ComiEval, for data contamination detection and contamination mitigation evaluation tasks. Extensive experimental results show that CDD achieves the average relative improvements of 21.8\%-30.2\% over other contamination detection approaches in terms of Accuracy, F1 Score, and AUC metrics, and can effectively detect contamination caused by the variants of test data. TED significantly mitigates performance improvements up to 66.9\% attributed to data contamination across 24 settings and 21 contamination degrees. In real-world applications, we reveal that ChatGPT exhibits a high potential to suffer from data contamination on HumanEval benchmark.  ( 3 min )
    Bridging the Gap between 2D and 3D Visual Question Answering: A Fusion Approach for 3D VQA
    arXiv:2402.15933v1 Announce Type: cross Abstract: In 3D Visual Question Answering (3D VQA), the scarcity of fully annotated data and limited visual content diversity hampers the generalization to novel scenes and 3D concepts (e.g., only around 800 scenes are utilized in ScanQA and SQA dataset). Current approaches resort supplement 3D reasoning with 2D information. However, these methods face challenges: either they use top-down 2D views that introduce overly complex and sometimes question-irrelevant visual clues, or they rely on globally aggregated scene/image-level representations from 2D VLMs, losing the fine-grained vision-language correlations. To overcome these limitations, our approach utilizes question-conditional 2D view selection procedure, pinpointing semantically relevant 2D inputs for crucial visual clues. We then integrate this 2D knowledge into the 3D-VQA system via a two-branch Transformer structure. This structure, featuring a Twin-Transformer design, compactly combines 2D and 3D modalities and captures fine-grained correlations between modalities, allowing them mutually augmenting each other. Integrating proposed mechanisms above, we present BridgeQA, that offers a fresh perspective on multi-modal transformer-based architectures for 3D-VQA. Experiments validate that BridgeQA achieves state-of-the-art on 3D-VQA datasets and significantly outperforms existing solutions. Code is available at $\href{https://github.com/matthewdm0816/BridgeQA}{\text{this URL}}$.  ( 2 min )
    QuaCer-C: Quantitative Certification of Knowledge Comprehension in LLMs
    arXiv:2402.15929v1 Announce Type: cross Abstract: Large Language Models (LLMs) have demonstrated impressive performance on several benchmarks. However, traditional studies do not provide formal guarantees on the performance of LLMs. In this work, we propose a novel certification framework for LLM, QuaCer-C, wherein we formally certify the knowledge-comprehension capabilities of popular LLMs. Our certificates are quantitative - they consist of high-confidence, tight bounds on the probability that the target LLM gives the correct answer on any relevant knowledge comprehension prompt. Our certificates for the Llama, Vicuna, and Mistral LLMs indicate that the knowledge comprehension capability improves with an increase in the number of parameters and that the Mistral model is less performant than the rest in this evaluation.  ( 2 min )
    Explainable Contrastive and Cost-Sensitive Learning for Cervical Cancer Classification
    arXiv:2402.15905v1 Announce Type: cross Abstract: This paper proposes an efficient system for classifying cervical cancer cells using pre-trained convolutional neural networks (CNNs). We first fine-tune five pre-trained CNNs and minimize the overall cost of misclassification by prioritizing accuracy for certain classes that have higher associated costs or importance. To further enhance the performance of the models, supervised contrastive learning is included to make the models more adept at capturing important features and patterns. Extensive experimentation are conducted to evaluate the proposed system on the SIPaKMeD dataset. The experimental results demonstrate the effectiveness of the developed system, achieving an accuracy of 97.29%. To make our system more trustworthy, we have employed several explainable AI techniques to interpret how the models reached a specific decision. The implementation of the system can be found at - https://github.com/isha-67/CervicalCancerStudy.  ( 2 min )
    Concurrent Learning of Policy and Unknown Safety Constraints in Reinforcement Learning
    arXiv:2402.15893v1 Announce Type: cross Abstract: Reinforcement learning (RL) has revolutionized decision-making across a wide range of domains over the past few decades. Yet, deploying RL policies in real-world scenarios presents the crucial challenge of ensuring safety. Traditional safe RL approaches have predominantly focused on incorporating predefined safety constraints into the policy learning process. However, this reliance on predefined safety constraints poses limitations in dynamic and unpredictable real-world settings where such constraints may not be available or sufficiently adaptable. Bridging this gap, we propose a novel approach that concurrently learns a safe RL control policy and identifies the unknown safety constraint parameters of a given environment. Initializing with a parametric signal temporal logic (pSTL) safety specification and a small initial labeled dataset, we frame the problem as a bilevel optimization task, intricately integrating constrained policy optimization, using a Lagrangian-variant of the twin delayed deep deterministic policy gradient (TD3) algorithm, with Bayesian optimization for optimizing parameters for the given pSTL safety specification. Through experimentation in comprehensive case studies, we validate the efficacy of this approach across varying forms of environmental constraints, consistently yielding safe RL policies with high returns. Furthermore, our findings indicate successful learning of STL safety constraint parameters, exhibiting a high degree of conformity with true environmental safety constraints. The performance of our model closely mirrors that of an ideal scenario that possesses complete prior knowledge of safety constraints, demonstrating its proficiency in accurately identifying environmental safety constraints and learning safe policies that adhere to those constraints.  ( 3 min )
    Statistical Games
    arXiv:2402.15892v1 Announce Type: cross Abstract: This work contains the mathematical exploration of a few prototypical games in which central concepts from statistics and probability theory naturally emerge. The first two kinds of games are termed Fisher and Bayesian games, which are connected to Frequentist and Bayesian statistics, respectively. Later, a more general type of game is introduced, termed Statistical game, in which a further parameter, the players' relative risk aversion, can be set. In this work, we show that Fisher and Bayesian games can be viewed as limiting cases of Statistical games. Therefore, Statistical games can be viewed as a unified framework, incorporating both Frequentist and Bayesian statistics. Furthermore, a philosophical framework is (re-)presented -- often referred to as minimax regret criterion -- as a general approach to decision making. The main motivation for this work was to embed Bayesian statistics into a broader decision-making framework, where, based on collected data, actions with consequences have to be made, which can be translated to utilities (or rewards/losses) of the decision-maker. The work starts with the simplest possible toy model, related to hypothesis testing and statistical inference. This choice has two main benefits: i.) it allows us to determine (conjecture) the behaviour of the equilibrium strategies in various limiting cases ii.) this way, we can introduce Statistical games without requiring additional stochastic parameters. The work contains game theoretical methods related to two-player, non-cooperative games to determine and prove equilibrium strategies of Fisher, Bayesian and Statistical games. It also relies on analytical tools for derivations concerning various limiting cases.  ( 2 min )
    Prompt Perturbation Consistency Learning for Robust Language Models
    arXiv:2402.15833v1 Announce Type: cross Abstract: Large language models (LLMs) have demonstrated impressive performance on a number of natural language processing tasks, such as question answering and text summarization. However, their performance on sequence labeling tasks such as intent classification and slot filling (IC-SF), which is a central component in personal assistant systems, lags significantly behind discriminative models. Furthermore, there is a lack of substantive research on the robustness of LLMs to various perturbations in the input prompts. The contributions of this paper are three-fold. First, we show that fine-tuning sufficiently large LLMs can produce IC-SF performance comparable to discriminative models. Next, we systematically analyze the performance deterioration of those fine-tuned models due to three distinct yet relevant types of input perturbations - oronyms, synonyms, and paraphrasing. Finally, we propose an efficient mitigation approach, Prompt Perturbation Consistency Learning (PPCL), which works by regularizing the divergence between losses from clean and perturbed samples. Our experiments demonstrate that PPCL can recover on average 59% and 69% of the performance drop for IC and SF tasks, respectively. Furthermore, PPCL beats the data augmentation approach while using ten times fewer augmented data samples.  ( 2 min )
    Debiased Model-based Interactive Recommendation
    arXiv:2402.15819v1 Announce Type: cross Abstract: Existing model-based interactive recommendation systems are trained by querying a world model to capture the user preference, but learning the world model from historical logged data will easily suffer from bias issues such as popularity bias and sampling bias. This is why some debiased methods have been proposed recently. However, two essential drawbacks still remain: 1) ignoring the dynamics of the time-varying popularity results in a false reweighting of items. 2) taking the unknown samples as negative samples in negative sampling results in the sampling bias. To overcome these two drawbacks, we develop a model called \textbf{i}dentifiable \textbf{D}ebiased \textbf{M}odel-based \textbf{I}nteractive \textbf{R}ecommendation (\textbf{iDMIR} in short). In iDMIR, for the first drawback, we devise a debiased causal world model based on the causal mechanism of the time-varying recommendation generation process with identification guarantees; for the second drawback, we devise a debiased contrastive policy, which coincides with the debiased contrastive learning and avoids sampling bias. Moreover, we demonstrate that the proposed method not only outperforms several latest interactive recommendation algorithms but also enjoys diverse recommendation performance.  ( 2 min )
    A Theoretical Result on the Inductive Bias of RNN Language Models
    arXiv:2402.15814v1 Announce Type: cross Abstract: Recent work by Hewitt et al. (2020) provides a possible interpretation of the empirical success of recurrent neural networks (RNNs) as language models (LMs). It shows that RNNs can efficiently represent bounded hierarchical structures that are prevalent in human language. This suggests that RNNs' success might be linked to their ability to model hierarchy. However, a closer inspection of Hewitt et al.'s (2020) construction shows that it is not limited to hierarchical LMs, posing the question of what \emph{other classes} of LMs can be efficiently represented by RNNs. To this end, we generalize their construction to show that RNNs can efficiently represent a larger class of LMs: Those that can be represented by a pushdown automaton with a bounded stack and a generalized stack update function. This is analogous to an automaton that keeps a memory of a fixed number of symbols and updates the memory with a simple update mechanism. Altogether, the efficiency in representing a diverse class of non-hierarchical LMs posits a lack of concrete cognitive and human-language-centered inductive biases in RNNs.  ( 2 min )
    OAG-Bench: A Human-Curated Benchmark for Academic Graph Mining
    arXiv:2402.15810v1 Announce Type: cross Abstract: With the rapid proliferation of scientific literature, versatile academic knowledge services increasingly rely on comprehensive academic graph mining. Despite the availability of public academic graphs, benchmarks, and datasets, these resources often fall short in multi-aspect and fine-grained annotations, are constrained to specific task types and domains, or lack underlying real academic graphs. In this paper, we present OAG-Bench, a comprehensive, multi-aspect, and fine-grained human-curated benchmark based on the Open Academic Graph (OAG). OAG-Bench covers 10 tasks, 20 datasets, 70+ baselines, and 120+ experimental results to date. We propose new data annotation strategies for certain tasks and offer a suite of data pre-processing codes, algorithm implementations, and standardized evaluation protocols to facilitate academic graph mining. Extensive experiments reveal that even advanced algorithms like large language models (LLMs) encounter difficulties in addressing key challenges in certain tasks, such as paper source tracing and scholar profiling. We also introduce the Open Academic Graph Challenge (OAG-Challenge) to encourage community input and sharing. We envisage that OAG-Bench can serve as a common ground for the community to evaluate and compare algorithms in academic graph mining, thereby accelerating algorithm development and advancement in this field. OAG-Bench is accessible at https://www.aminer.cn/data/.  ( 2 min )
    Analysis of Off-Policy Multi-Step TD-Learning with Linear Function Approximation
    arXiv:2402.15781v1 Announce Type: cross Abstract: This paper analyzes multi-step TD-learning algorithms within the `deadly triad' scenario, characterized by linear function approximation, off-policy learning, and bootstrapping. In particular, we prove that n-step TD-learning algorithms converge to a solution as the sampling horizon n increases sufficiently. The paper is divided into two parts. In the first part, we comprehensively examine the fundamental properties of their model-based deterministic counterparts, including projected value iteration, gradient descent algorithms, and the control theoretic approach, which can be viewed as prototype deterministic algorithms whose analysis plays a pivotal role in understanding and developing their model-free reinforcement learning counterparts. In particular, we prove that these algorithms converge to meaningful solutions when n is sufficiently large. Based on these findings, two n-step TD-learning algorithms are proposed and analyzed, which can be seen as the model-free reinforcement learning counterparts of the gradient and control theoretic algorithms.  ( 2 min )
    ArEEG_Chars: Dataset for Envisioned Speech Recognition using EEG for Arabic Characters
    arXiv:2402.15733v1 Announce Type: cross Abstract: Brain-Computer-Interface (BCI) has been a hot research topic in the last few years that could help paralyzed people in their lives. Several researches were done to classify electroencephalography (EEG) signals automatically into English characters and words. Arabic language is one of the most used languages around the world. However, to the best of our knowledge, there is no dataset for Arabic characters EEG signals. In this paper, we have created an EEG dataset for Arabic characters and named it ArEEG_Chars. Moreover, several experiments were done on ArEEG_Chars using deep learning. Best results were achieved using LSTM and reached an accuracy of 97%. ArEEG_Chars dataset will be public for researchers.  ( 2 min )
    A Duality Analysis of Kernel Ridge Regression in the Noiseless Regime
    arXiv:2402.15718v1 Announce Type: cross Abstract: In this paper, we conduct a comprehensive analysis of generalization properties of Kernel Ridge Regression (KRR) in the noiseless regime, a scenario crucial to scientific computing, where data are often generated via computer simulations. We prove that KRR can attain the minimax optimal rate, which depends on both the eigenvalue decay of the associated kernel and the relative smoothness of target functions. Particularly, when the eigenvalue decays exponentially fast, KRR achieves the spectral accuracy, i.e., a convergence rate faster than any polynomial. Moreover, the numerical experiments well corroborate our theoretical findings. Our proof leverages a novel extension of the duality framework introduced by Chen et al. (2023), which could be useful in analyzing kernel-based methods beyond the scope of this work.  ( 2 min )
    Bagged Deep Image Prior for Recovering Images in the Presence of Speckle Noise
    arXiv:2402.15635v1 Announce Type: cross Abstract: We investigate both the theoretical and algorithmic aspects of likelihood-based methods for recovering a complex-valued signal from multiple sets of measurements, referred to as looks, affected by speckle (multiplicative) noise. Our theoretical contributions include establishing the first existing theoretical upper bound on the Mean Squared Error (MSE) of the maximum likelihood estimator under the deep image prior hypothesis. Our theoretical results capture the dependence of MSE upon the number of parameters in the deep image prior, the number of looks, the signal dimension, and the number of measurements per look. On the algorithmic side, we introduce the concept of bagged Deep Image Priors (Bagged-DIP) and integrate them with projected gradient descent. Furthermore, we show how employing Newton-Schulz algorithm for calculating matrix inverses within the iterations of PGD reduces the computational complexity of the algorithm. We will show that this method achieves the state-of-the-art performance.  ( 2 min )
    Learning Cyclic Causal Models from Incomplete Data
    arXiv:2402.15625v1 Announce Type: cross Abstract: Causal learning is a fundamental problem in statistics and science, offering insights into predicting the effects of unseen treatments on a system. Despite recent advances in this topic, most existing causal discovery algorithms operate under two key assumptions: (i) the underlying graph is acyclic, and (ii) the available data is complete. These assumptions can be problematic as many real-world systems contain feedback loops (e.g., biological systems), and practical scenarios frequently involve missing data. In this work, we propose a novel framework, named MissNODAGS, for learning cyclic causal graphs from partially missing data. Under the additive noise model, MissNODAGS learns the causal graph by alternating between imputing the missing data and maximizing the expected log-likelihood of the visible part of the data in each training step, following the principles of the expectation-maximization (EM) framework. Through synthetic experiments and real-world single-cell perturbation data, we demonstrate improved performance when compared to using state-of-the-art imputation techniques followed by causal learning on partially missing interventional data.  ( 2 min )
    Language-Based User Profiles for Recommendation
    arXiv:2402.15623v1 Announce Type: cross Abstract: Most conventional recommendation methods (e.g., matrix factorization) represent user profiles as high-dimensional vectors. Unfortunately, these vectors lack interpretability and steerability, and often perform poorly in cold-start settings. To address these shortcomings, we explore the use of user profiles that are represented as human-readable text. We propose the Language-based Factorization Model (LFM), which is essentially an encoder/decoder model where both the encoder and the decoder are large language models (LLMs). The encoder LLM generates a compact natural-language profile of the user's interests from the user's rating history. The decoder LLM uses this summary profile to complete predictive downstream tasks. We evaluate our LFM approach on the MovieLens dataset, comparing it against matrix factorization and an LLM model that directly predicts from the user's rating history. In cold-start settings, we find that our method can have higher accuracy than matrix factorization. Furthermore, we find that generating a compact and human-readable summary often performs comparably with or better than direct LLM prediction, while enjoying better interpretability and shorter model input length. Our results motivate a number of future research directions and potential improvements.  ( 2 min )
    Data/moment-driven approaches for fast predictive control of collective dynamics
    arXiv:2402.15611v1 Announce Type: cross Abstract: Feedback control synthesis for large-scale particle systems is reviewed in the framework of model predictive control (MPC). The high-dimensional character of collective dynamics hampers the performance of traditional MPC algorithms based on fast online dynamic optimization at every time step. Two alternatives to MPC are proposed. First, the use of supervised learning techniques for the offline approximation of optimal feedback laws is discussed. Then, a procedure based on sequential linearization of the dynamics based on macroscopic quantities of the particle ensemble is reviewed. Both approaches circumvent the online solution of optimal control problems enabling fast, real-time, feedback synthesis for large-scale particle systems. Numerical experiments assess the performance of the proposed algorithms.  ( 2 min )
    Minimax Optimality of Score-based Diffusion Models: Beyond the Density Lower Bound Assumptions
    arXiv:2402.15602v1 Announce Type: cross Abstract: We study the asymptotic error of score-based diffusion model sampling in large-sample scenarios from a non-parametric statistics perspective. We show that a kernel-based score estimator achieves an optimal mean square error of $\widetilde{O}\left(n^{-1} t^{-\frac{d+2}{2}}(t^{\frac{d}{2}} \vee 1)\right)$ for the score function of $p_0*\mathcal{N}(0,t\boldsymbol{I}_d)$, where $n$ and $d$ represent the sample size and the dimension, $t$ is bounded above and below by polynomials of $n$, and $p_0$ is an arbitrary sub-Gaussian distribution. As a consequence, this yields an $\widetilde{O}\left(n^{-1/2} t^{-\frac{d}{4}}\right)$ upper bound for the total variation error of the distribution of the sample generated by the diffusion model under a mere sub-Gaussian assumption. If in addition, $p_0$ belongs to the nonparametric family of the $\beta$-Sobolev space with $\beta\le 2$, by adopting an early stopping strategy, we obtain that the diffusion model is nearly (up to log factors) minimax optimal. This removes the crucial lower bound assumption on $p_0$ in previous proofs of the minimax optimality of the diffusion model for nonparametric families.  ( 2 min )
    Neural optimal controller for stochastic systems via pathwise HJB operator
    arXiv:2402.15592v1 Announce Type: cross Abstract: The aim of this work is to develop deep learning-based algorithms for high-dimensional stochastic control problems based on physics-informed learning and dynamic programming. Unlike classical deep learning-based methods relying on a probabilistic representation of the solution to the Hamilton--Jacobi--Bellman (HJB) equation, we introduce a pathwise operator associated with the HJB equation so that we can define a problem of physics-informed learning. According to whether the optimal control has an explicit representation, two numerical methods are proposed to solve the physics-informed learning problem. We provide an error analysis on how the truncation, approximation and optimization errors affect the accuracy of these methods. Numerical results on various applications are presented to illustrate the performance of the proposed algorithms.  ( 2 min )
    Prompting LLMs to Compose Meta-Review Drafts from Peer-Review Narratives of Scholarly Manuscripts
    arXiv:2402.15589v1 Announce Type: cross Abstract: One of the most important yet onerous tasks in the academic peer-reviewing process is composing meta-reviews, which involves understanding the core contributions, strengths, and weaknesses of a scholarly manuscript based on peer-review narratives from multiple experts and then summarizing those multiple experts' perspectives into a concise holistic overview. Given the latest major developments in generative AI, especially Large Language Models (LLMs), it is very compelling to rigorously study the utility of LLMs in generating such meta-reviews in an academic peer-review setting. In this paper, we perform a case study with three popular LLMs, i.e., GPT-3.5, LLaMA2, and PaLM2, to automatically generate meta-reviews by prompting them with different types/levels of prompts based on the recently proposed TELeR taxonomy. Finally, we perform a detailed qualitative study of the meta-reviews generated by the LLMs and summarize our findings and recommendations for prompting LLMs for this complex task.  ( 2 min )
    State Space Models for Event Cameras
    arXiv:2402.15584v1 Announce Type: cross Abstract: Today, state-of-the-art deep neural networks that process event-camera data first convert a temporal window of events into dense, grid-like input representations. As such, they exhibit poor generalizability when deployed at higher inference frequencies (i.e., smaller temporal windows) than the ones they were trained on. We address this challenge by introducing state-space models (SSMs) with learnable timescale parameters to event-based vision. This design adapts to varying frequencies without the need to retrain the network at different frequencies. Additionally, we investigate two strategies to counteract aliasing effects when deploying the model at higher frequencies. We comprehensively evaluate our approach against existing methods based on RNN and Transformer architectures across various benchmarks, including Gen1 and 1 Mpx event camera datasets. Our results demonstrate that SSM-based models train 33% faster and also exhibit minimal performance degradation when tested at higher frequencies than the training input. Traditional RNN and Transformer models exhibit performance drops of more than 20 mAP, with SSMs having a drop of 3.31 mAP, highlighting the effectiveness of SSMs in event-based vision tasks.  ( 2 min )
    Cohere3D: Exploiting Temporal Coherence for Unsupervised Representation Learning of Vision-based Autonomous Driving
    arXiv:2402.15583v1 Announce Type: cross Abstract: Due to the lack of depth cues in images, multi-frame inputs are important for the success of vision-based perception, prediction, and planning in autonomous driving. Observations from different angles enable the recovery of 3D object states from 2D image inputs if we can identify the same instance in different input frames. However, the dynamic nature of autonomous driving scenes leads to significant changes in the appearance and shape of each instance captured by the camera at different time steps. To this end, we propose a novel contrastive learning algorithm, Cohere3D, to learn coherent instance representations in a long-term input sequence robust to the change in distance and perspective. The learned representation aids in instance-level correspondence across multiple input frames in downstream tasks. In the pretraining stage, the raw point clouds from LiDAR sensors are utilized to construct the long-term temporal correspondence for each instance, which serves as guidance for the extraction of instance-level representation from the vision-based bird's eye-view (BEV) feature map. Cohere3D encourages a consistent representation for the same instance at different frames but distinguishes between representations of different instances. We evaluate our algorithm by finetuning the pretrained model on various downstream perception, prediction, and planning tasks. Results show a notable improvement in both data efficiency and task performance.  ( 3 min )
    Closing the AI generalization gap by adjusting for dermatology condition distribution differences across clinical settings
    arXiv:2402.15566v1 Announce Type: cross Abstract: Recently, there has been great progress in the ability of artificial intelligence (AI) algorithms to classify dermatological conditions from clinical photographs. However, little is known about the robustness of these algorithms in real-world settings where several factors can lead to a loss of generalizability. Understanding and overcoming these limitations will permit the development of generalizable AI that can aid in the diagnosis of skin conditions across a variety of clinical settings. In this retrospective study, we demonstrate that differences in skin condition distribution, rather than in demographics or image capture mode are the main source of errors when an AI algorithm is evaluated on data from a previously unseen source. We demonstrate a series of steps to close this generalization gap, requiring progressively more information about the new source, ranging from the condition distribution to training data enriched for data less frequently seen during training. Our results also suggest comparable performance from end-to-end fine tuning versus fine tuning solely the classification layer on top of a frozen embedding model. Our approach can inform the adaptation of AI algorithms to new settings, based on the information and resources available.  ( 3 min )
    Toward Fully Self-Supervised Multi-Pitch Estimation
    arXiv:2402.15569v1 Announce Type: cross Abstract: Multi-pitch estimation is a decades-long research problem involving the detection of pitch activity associated with concurrent musical events within multi-instrument mixtures. Supervised learning techniques have demonstrated solid performance on more narrow characterizations of the task, but suffer from limitations concerning the shortage of large-scale and diverse polyphonic music datasets with multi-pitch annotations. We present a suite of self-supervised learning objectives for multi-pitch estimation, which encourage the concentration of support around harmonics, invariance to timbral transformations, and equivariance to geometric transformations. These objectives are sufficient to train an entirely convolutional autoencoder to produce multi-pitch salience-grams directly, without any fine-tuning. Despite training exclusively on a collection of synthetic single-note audio samples, our fully self-supervised framework generalizes to polyphonic music mixtures, and achieves performance comparable to supervised models trained on conventional multi-pitch datasets.  ( 2 min )
    HiMAP: Learning Heuristics-Informed Policies for Large-Scale Multi-Agent Pathfinding
    arXiv:2402.15546v1 Announce Type: cross Abstract: Large-scale multi-agent pathfinding (MAPF) presents significant challenges in several areas. As systems grow in complexity with a multitude of autonomous agents operating simultaneously, efficient and collision-free coordination becomes paramount. Traditional algorithms often fall short in scalability, especially in intricate scenarios. Reinforcement Learning (RL) has shown potential to address the intricacies of MAPF; however, it has also been shown to struggle with scalability, demanding intricate implementation, lengthy training, and often exhibiting unstable convergence, limiting its practical application. In this paper, we introduce Heuristics-Informed Multi-Agent Pathfinding (HiMAP), a novel scalable approach that employs imitation learning with heuristic guidance in a decentralized manner. We train on small-scale instances using a heuristic policy as a teacher that maps each single agent observation information to an action probability distribution. During pathfinding, we adopt several inference techniques to improve performance. With a simple training scheme and implementation, HiMAP demonstrates competitive results in terms of success rate and scalability in the field of imitation-learning-only MAPF, showing the potential of imitation-learning-only MAPF equipped with inference techniques.  ( 2 min )
    Streaming IoT Data and the Quantum Edge: A Classic/Quantum Machine Learning Use Case
    arXiv:2402.15542v1 Announce Type: cross Abstract: With the advent of the Post-Moore era, the scientific community is faced with the challenge of addressing the demands of current data-intensive machine learning applications, which are the cornerstone of urgent analytics in distributed computing. Quantum machine learning could be a solution for the increasing demand of urgent analytics, providing potential theoretical speedups and increased space efficiency. However, challenges such as (1) the encoding of data from the classical to the quantum domain, (2) hyperparameter tuning, and (3) the integration of quantum hardware into a distributed computing continuum limit the adoption of quantum machine learning for urgent analytics. In this work, we investigate the use of Edge computing for the integration of quantum machine learning into a distributed computing continuum, identifying the main challenges and possible solutions. Furthermore, exploring the data encoding and hyperparameter tuning challenges, we present preliminary results for quantum machine learning analytics on an IoT scenario.  ( 2 min )
    Evaluating the Performance of ChatGPT for Spam Email Detection
    arXiv:2402.15537v1 Announce Type: cross Abstract: Email continues to be a pivotal and extensively utilized communication medium within professional and commercial domains. Nonetheless, the prevalence of spam emails poses a significant challenge for users, disrupting their daily routines and diminishing productivity. Consequently, accurately identifying and filtering spam based on content has become crucial for cybersecurity. Recent advancements in natural language processing, particularly with large language models like ChatGPT, have shown remarkable performance in tasks such as question answering and text generation. However, its potential in spam identification remains underexplored. To fill in the gap, this study attempts to evaluate ChatGPT's capabilities for spam identification in both English and Chinese email datasets. We employ ChatGPT for spam email detection using in-context learning, which requires a prompt instruction and a few demonstrations. We also investigate how the training example size affects the performance of ChatGPT. For comparison, we also implement five popular benchmark methods, including naive Bayes, support vector machines (SVM), logistic regression (LR), feedforward dense neural networks (DNN), and BERT classifiers. Though extensive experiments, the performance of ChatGPT is significantly worse than deep supervised learning methods in the large English dataset, while it presents superior performance on the low-resourced Chinese dataset, even outperforming BERT in this case.  ( 2 min )
    DiCoM -- Diverse Concept Modeling towards Enhancing Generalizability in Chest X-Ray Studies
    arXiv:2402.15534v1 Announce Type: cross Abstract: Chest X-Ray (CXR) is a widely used clinical imaging modality and has a pivotal role in the diagnosis and prognosis of various lung and heart related conditions. Conventional automated clinical diagnostic tool design strategies relying on radiology reads and supervised learning, entail the cumbersome requirement of high quality annotated training data. To address this challenge, self-supervised pre-training has proven to outperform supervised pre-training in numerous downstream vision tasks, representing a significant breakthrough in the field. However, medical imaging pre-training significantly differs from pre-training with natural images (e.g., ImageNet) due to unique attributes of clinical images. In this context, we introduce Diverse Concept Modeling (DiCoM), a novel self-supervised training paradigm that leverages a student teacher framework for learning diverse concepts and hence effective representation of the CXR data. Hence, expanding beyond merely modeling a single primary label within an image, instead, effectively harnessing the information from all the concepts inherent in the CXR. The pre-trained model is subsequently fine-tuned to address diverse domain-specific tasks. Our proposed paradigm consistently demonstrates robust performance across multiple downstream tasks on multiple datasets, highlighting the success and generalizability of the pre-training strategy. To establish the efficacy of our methods we analyze both the power of learned representations and the speed of convergence (SoC) of our models. For diverse data and tasks, DiCoM is able to achieve in most cases better results compared to other state-of-the-art pre-training strategies. This when combined with the higher SoC and generalization capabilities positions DiCoM to be established as a foundation model for CXRs, a widely used imaging modality.  ( 3 min )
    Chain-of-Specificity: An Iteratively Refining Method for Eliciting Knowledge from Large Language Models
    arXiv:2402.15526v1 Announce Type: cross Abstract: Large Language Models (LLMs) exhibit remarkable generative capabilities, enabling the generation of valuable information. Despite these advancements, previous research found that LLMs sometimes struggle with adhering to specific constraints (e.g., in specific place or at specific time), at times even overlooking them, which leads to responses that are either too generic or not fully satisfactory. Existing approaches attempted to address this issue by decomposing or rewriting input instructions, yet they fall short in adequately emphasizing specific constraints and in unlocking the underlying knowledge (e.g., programming within the context of software development). In response, this paper proposes a simple yet effective method named Chain-of-Specificity (CoS). Specifically, CoS iteratively emphasizes the specific constraints in the input instructions, unlocks knowledge within LLMs, and refines responses. Experiments conducted on publicly available and self-build complex datasets demonstrate that CoS outperforms existing methods in enhancing generated content especially for the specificity. Besides, as the number of specific constraints increase, other baselines falter, while CoS still performs well. Moreover, we show that distilling responses generated by CoS effectively enhances the ability of smaller models to follow the constrained instructions. Resources of this paper will be released for further research.  ( 2 min )
    Graph Pruning for Enumeration of Minimal Unsatisfiable Subsets
    arXiv:2402.15524v1 Announce Type: cross Abstract: Finding Minimal Unsatisfiable Subsets (MUSes) of binary constraints is a common problem in infeasibility analysis of over-constrained systems. However, because of the exponential search space of the problem, enumerating MUSes is extremely time-consuming in real applications. In this work, we propose to prune formulas using a learned model to speed up MUS enumeration. We represent formulas as graphs and then develop a graph-based learning model to predict which part of the formula should be pruned. Importantly, our algorithm does not require data labeling by only checking the satisfiability of pruned formulas. It does not even require training data from the target application because it extrapolates to data with different distributions. In our experiments we combine our algorithm with existing MUS enumerators and validate its effectiveness in multiple benchmarks including a set of real-world problems outside our training distribution. The experiment results show that our method significantly accelerates MUS enumeration on average on these benchmark problems.  ( 2 min )
    Deep Networks Always Grok and Here is Why
    arXiv:2402.15555v1 Announce Type: new Abstract: Grokking, or delayed generalization, is a phenomenon where generalization in a deep neural network (DNN) occurs long after achieving near zero training error. Previous studies have reported the occurrence of grokking in specific controlled settings, such as DNNs initialized with large-norm parameters or transformers trained on algorithmic datasets. We demonstrate that grokking is actually much more widespread and materializes in a wide range of practical settings, such as training of a convolutional neural network (CNN) on CIFAR10 or a Resnet on Imagenette. We introduce the new concept of delayed robustness, whereby a DNN groks adversarial examples and becomes robust, long after interpolation and/or generalization. We develop an analytical explanation for the emergence of both delayed generalization and delayed robustness based on a new measure of the local complexity of a DNN's input-output mapping. Our local complexity measures the density of the so-called 'linear regions' (aka, spline partition regions) that tile the DNN input space, and serves as a utile progress measure for training. We provide the first evidence that for classification problems, the linear regions undergo a phase transition during training whereafter they migrate away from the training samples (making the DNN mapping smoother there) and towards the decision boundary (making the DNN mapping less smooth there). Grokking occurs post phase transition as a robust partition of the input space emerges thanks to the linearization of the DNN mapping around the training points. Website: https://bit.ly/grok-adversarial  ( 3 min )
    Training Nonlinear Transformers for Efficient In-Context Learning: A Theoretical Learning and Generalization Analysis
    arXiv:2402.15607v1 Announce Type: new Abstract: Transformer-based large language models have displayed impressive in-context learning capabilities, where a pre-trained model can handle new tasks without fine-tuning by simply augmenting the query with some input-output examples from that task. Despite the empirical success, the mechanics of how to train a Transformer to achieve ICL and the corresponding ICL capacity is mostly elusive due to the technical challenges of analyzing the nonconvex training problems resulting from the nonlinear self-attention and nonlinear activation in Transformers. To the best of our knowledge, this paper provides the first theoretical analysis of the training dynamics of Transformers with nonlinear self-attention and nonlinear MLP, together with the ICL generalization capability of the resulting model. Focusing on a group of binary classification tasks, we train Transformers using data from a subset of these tasks and quantify the impact of various factors on the ICL generalization performance on the remaining unseen tasks with and without data distribution shifts. We also analyze how different components in the learned Transformers contribute to the ICL performance. Furthermore, we provide the first theoretical analysis of how model pruning affects the ICL performance and prove that proper magnitude-based pruning can have a minimal impact on ICL while reducing inference costs. These theoretical findings are justified through numerical experiments.  ( 2 min )
    Smooth and Sparse Latent Dynamics in Operator Learning with Jerk Regularization
    arXiv:2402.15636v1 Announce Type: new Abstract: Spatiotemporal modeling is critical for understanding complex systems across various scientific and engineering disciplines, but governing equations are often not fully known or computationally intractable due to inherent system complexity. Data-driven reduced-order models (ROMs) offer a promising approach for fast and accurate spatiotemporal forecasting by computing solutions in a compressed latent space. However, these models often neglect temporal correlations between consecutive snapshots when constructing the latent space, leading to suboptimal compression, jagged latent trajectories, and limited extrapolation ability over time. To address these issues, this paper introduces a continuous operator learning framework that incorporates jerk regularization into the learning of the compressed latent space. This jerk regularization promotes smoothness and sparsity of latent space dynamics, which not only yields enhanced accuracy and convergence speed but also helps identify intrinsic latent space coordinates. Consisting of an implicit neural representation (INR)-based autoencoder and a neural ODE latent dynamics model, the framework allows for inference at any desired spatial or temporal resolution. The effectiveness of this framework is demonstrated through a two-dimensional unsteady flow problem governed by the Navier-Stokes equations, highlighting its potential to expedite high-fidelity simulations in various scientific and engineering applications.  ( 2 min )
    Foundation Policies with Hilbert Representations
    arXiv:2402.15567v1 Announce Type: new Abstract: Unsupervised and self-supervised objectives, such as next token prediction, have enabled pre-training generalist models from large amounts of unlabeled data. In reinforcement learning (RL), however, finding a truly general and scalable unsupervised pre-training objective for generalist policies from offline data remains a major open question. While a number of methods have been proposed to enable generic self-supervised RL, based on principles such as goal-conditioned RL, behavioral cloning, and unsupervised skill learning, such methods remain limited in terms of either the diversity of the discovered behaviors, the need for high-quality demonstration data, or the lack of a clear prompting or adaptation mechanism for downstream tasks. In this work, we propose a novel unsupervised framework to pre-train generalist policies that capture diverse, optimal, long-horizon behaviors from unlabeled offline data such that they can be quickly adapted to any arbitrary new tasks in a zero-shot manner. Our key insight is to learn a structured representation that preserves the temporal structure of the underlying environment, and then to span this learned latent space with directional movements, which enables various zero-shot policy "prompting" schemes for downstream tasks. Through our experiments on simulated robotic locomotion and manipulation benchmarks, we show that our unsupervised policies can solve goal-conditioned and general RL tasks in a zero-shot fashion, even often outperforming prior methods designed specifically for each setting. Our code and videos are available at https://seohong.me/projects/hilp/  ( 2 min )
    Differentially Private Fair Binary Classifications
    arXiv:2402.15603v1 Announce Type: new Abstract: In this work, we investigate binary classification under the constraints of both differential privacy and fairness. We first propose an algorithm based on the decoupling technique for learning a classifier with only fairness guarantee. This algorithm takes in classifiers trained on different demographic groups and generates a single classifier satisfying statistical parity. We then refine this algorithm to incorporate differential privacy. The performance of the final algorithm is rigorously examined in terms of privacy, fairness, and utility guarantees. Empirical evaluations conducted on the Adult and Credit Card datasets illustrate that our algorithm outperforms the state-of-the-art in terms of fairness guarantees, while maintaining the same level of privacy and utility.  ( 2 min )
    Machine Learning-Based Completions Sequencing for Well Performance Optimization
    arXiv:2402.15608v1 Announce Type: new Abstract: Establishing accurate field development parameters to optimize long-term oil production takes time and effort due to the complexity of oil well development, and the uncertainty in estimating long-term well production. Traditionally, oil and gas companies use simulation software that are inherently computationally expensive to forecast production. Thus, machine learning approaches are recently utilized in literature as an efficient alternative to optimize well developments by enhancing completion conditions. The primary goal of this project is to develop effective machine-learning models that can integrate the effects of multidimensional predictive variables (i.e., completion conditions) to predict 12-Month Cumulative Production accurately. Three predictive regression machine learning models are implemented for predicting 12-month cumulative oil production: Random Forest, Gradient Boosting, and Long Short-Term Memory Models. All three models yielded cumulative production predictions with root mean squared error (RMSE ) values ranging from 7.35 to 20.01 thousand barrels of oil. Although we hypothesized that all models would yield accurate predictions, the results indicated a crucial need for further refinement to create reliable and rational predictive tools in the subsurface. While this study did not produce optimal models for completion sequencing to maximize long-term production, we established that machine learning models alone are not self-sufficient for problems of this nature. Hence, there is potential for significant improvement, including comprehensive feature engineering, and a recommendation of exploring the use of hybrid or surrogate models (i.e., coupling physics reduced models and machine learning models), to ascertain significant contribution to the progress of completion sequencing workflows.  ( 3 min )
    Fair Resource Allocation in Multi-Task Learning
    arXiv:2402.15638v1 Announce Type: new Abstract: By jointly learning multiple tasks, multi-task learning (MTL) can leverage the shared knowledge across tasks, resulting in improved data efficiency and generalization performance. However, a major challenge in MTL lies in the presence of conflicting gradients, which can hinder the fair optimization of some tasks and subsequently impede MTL's ability to achieve better overall performance. Inspired by fair resource allocation in communication networks, we formulate the optimization of MTL as a utility maximization problem, where the loss decreases across tasks are maximized under different fairness measurements. To solve this problem, we propose FairGrad, a novel MTL optimization method. FairGrad not only enables flexible emphasis on certain tasks but also achieves a theoretical convergence guarantee. Extensive experiments demonstrate that our method can achieve state-of-the-art performance among gradient manipulation methods on a suite of multi-task benchmarks in supervised learning and reinforcement learning. Furthermore, we incorporate the idea of $\alpha$-fairness into loss functions of various MTL methods. Extensive empirical studies demonstrate that their performance can be significantly enhanced. Code is provided at \url{https://github.com/OptMN-Lab/fairgrad}.  ( 2 min )
    Towards Efficient Active Learning in NLP via Pretrained Representations
    arXiv:2402.15613v1 Announce Type: new Abstract: Fine-tuning Large Language Models (LLMs) is now a common approach for text classification in a wide range of applications. When labeled documents are scarce, active learning helps save annotation efforts but requires retraining of massive models on each acquisition iteration. We drastically expedite this process by using pretrained representations of LLMs within the active learning loop and, once the desired amount of labeled data is acquired, fine-tuning that or even a different pretrained LLM on this labeled data to achieve the best performance. As verified on common text classification benchmarks with pretrained BERT and RoBERTa as the backbone, our strategy yields similar performance to fine-tuning all the way through the active learning loop but is orders of magnitude less computationally expensive. The data acquired with our procedure generalizes across pretrained networks, allowing flexibility in choosing the final model or updating it as newer versions get released.  ( 2 min )
    Fair Multivariate Adaptive Regression Splines for Ensuring Equity and Transparency
    arXiv:2402.15561v1 Announce Type: new Abstract: Predictive analytics is widely used in various domains, including education, to inform decision-making and improve outcomes. However, many predictive models are proprietary and inaccessible for evaluation or modification by researchers and practitioners, limiting their accountability and ethical design. Moreover, predictive models are often opaque and incomprehensible to the officials who use them, reducing their trust and utility. Furthermore, predictive models may introduce or exacerbate bias and inequity, as they have done in many sectors of society. Therefore, there is a need for transparent, interpretable, and fair predictive models that can be easily adopted and adapted by different stakeholders. In this paper, we propose a fair predictive model based on multivariate adaptive regression splines(MARS) that incorporates fairness measures in the learning process. MARS is a non-parametric regression model that performs feature selection, handles non-linear relationships, generates interpretable decision rules, and derives optimal splitting criteria on the variables. Specifically, we integrate fairness into the knot optimization algorithm and provide theoretical and empirical evidence of how it results in a fair knot placement. We apply our fairMARS model to real-world data and demonstrate its effectiveness in terms of accuracy and equity. Our paper contributes to the advancement of responsible and ethical predictive analytics for social good.  ( 2 min )
    A Generative Machine Learning Model for Material Microstructure 3D Reconstruction and Performance Evaluation
    arXiv:2402.15815v1 Announce Type: new Abstract: The reconstruction of 3D microstructures from 2D slices is considered to hold significant value in predicting the spatial structure and physical properties of materials.The dimensional extension from 2D to 3D is viewed as a highly challenging inverse problem from the current technological perspective.Recently,methods based on generative adversarial networks have garnered widespread attention.However,they are still hampered by numerous limitations,including oversimplified models,a requirement for a substantial number of training samples,and difficulties in achieving model convergence during training.In light of this,a novel generative model that integrates the multiscale properties of U-net with and the generative capabilities of GAN has been proposed.Based on this,the innovative construction of a multi-scale channel aggregation module,a multi-scale hierarchical feature aggregation module and a convolutional block attention mechanism can better capture the properties of the material microstructure and extract the image information.The model's accuracy is further improved by combining the image regularization loss with the Wasserstein distance loss.In addition,this study utilizes the anisotropy index to accurately distinguish the nature of the image,which can clearly determine the isotropy and anisotropy of the image.It is also the first time that the generation quality of material samples from different domains is evaluated and the performance of the model itself is compared.The experimental results demonstrate that the present model not only shows a very high similarity between the generated 3D structures and real samples but is also highly consistent with real data in terms of statistical data analysis.  ( 3 min )
    Optimal Zero-Shot Detector for Multi-Armed Attacks
    arXiv:2402.15808v1 Announce Type: new Abstract: This paper explores a scenario in which a malicious actor employs a multi-armed attack strategy to manipulate data samples, offering them various avenues to introduce noise into the dataset. Our central objective is to protect the data by detecting any alterations to the input. We approach this defensive strategy with utmost caution, operating in an environment where the defender possesses significantly less information compared to the attacker. Specifically, the defender is unable to utilize any data samples for training a defense model or verifying the integrity of the channel. Instead, the defender relies exclusively on a set of pre-existing detectors readily available ``off the shelf''. To tackle this challenge, we derive an innovative information-theoretic defense approach that optimally aggregates the decisions made by these detectors, eliminating the need for any training data. We further explore a practical use-case scenario for empirical evaluation, where the attacker possesses a pre-trained classifier and launches well-known adversarial attacks against it. Our experiments highlight the effectiveness of our proposed solution, even in scenarios that deviate from the optimal setup.  ( 2 min )
    Batch Active Learning of Reward Functions from Human Preferences
    arXiv:2402.15757v1 Announce Type: new Abstract: Data generation and labeling are often expensive in robot learning. Preference-based learning is a concept that enables reliable labeling by querying users with preference questions. Active querying methods are commonly employed in preference-based learning to generate more informative data at the expense of parallelization and computation time. In this paper, we develop a set of novel algorithms, batch active preference-based learning methods, that enable efficient learning of reward functions using as few data samples as possible while still having short query generation times and also retaining parallelizability. We introduce a method based on determinantal point processes (DPP) for active batch generation and several heuristic-based alternatives. Finally, we present our experimental results for a variety of robotics tasks in simulation. Our results suggest that our batch active learning algorithm requires only a few queries that are computed in a short amount of time. We showcase one of our algorithms in a study to learn human users' preferences.  ( 2 min )
    Low-Rank Bandits via Tight Two-to-Infinity Singular Subspace Recovery
    arXiv:2402.15739v1 Announce Type: new Abstract: We study contextual bandits with low-rank structure where, in each round, if the (context, arm) pair $(i,j)\in [m]\times [n]$ is selected, the learner observes a noisy sample of the $(i,j)$-th entry of an unknown low-rank reward matrix. Successive contexts are generated randomly in an i.i.d. manner and are revealed to the learner. For such bandits, we present efficient algorithms for policy evaluation, best policy identification and regret minimization. For policy evaluation and best policy identification, we show that our algorithms are nearly minimax optimal. For instance, the number of samples required to return an $\varepsilon$-optimal policy with probability at least $1-\delta$ typically scales as ${m+n\over \varepsilon^2}\log(1/\delta)$. Our regret minimization algorithm enjoys minimax guarantees scaling as $r^{7/4}(m+n)^{3/4}\sqrt{T}$, which improves over existing algorithms. All the proposed algorithms consist of two phases: they first leverage spectral methods to estimate the left and right singular subspaces of the low-rank reward matrix. We show that these estimates enjoy tight error guarantees in the two-to-infinity norm. This in turn allows us to reformulate our problems as a misspecified linear bandit problem with dimension roughly $r(m+n)$ and misspecification controlled by the subspace recovery error, as well as to design the second phase of our algorithms efficiently.  ( 2 min )
    Operator Learning: Algorithms and Analysis
    arXiv:2402.15715v1 Announce Type: new Abstract: Operator learning refers to the application of ideas from machine learning to approximate (typically nonlinear) operators mapping between Banach spaces of functions. Such operators often arise from physical models expressed in terms of partial differential equations (PDEs). In this context, such approximate operators hold great potential as efficient surrogate models to complement traditional numerical methods in many-query tasks. Being data-driven, they also enable model discovery when a mathematical description in terms of a PDE is not available. This review focuses primarily on neural operators, built on the success of deep neural networks in the approximation of functions defined on finite dimensional Euclidean spaces. Empirically, neural operators have shown success in a variety of applications, but our theoretical understanding remains incomplete. This review article summarizes recent progress and the current state of our theoretical understanding of neural operators, focusing on an approximation theoretic point of view.  ( 2 min )
    Clustering in Dynamic Environments: A Framework for Benchmark Dataset Generation With Heterogeneous Changes
    arXiv:2402.15731v1 Announce Type: new Abstract: Clustering in dynamic environments is of increasing importance, with broad applications ranging from real-time data analysis and online unsupervised learning to dynamic facility location problems. While meta-heuristics have shown promising effectiveness in static clustering tasks, their application for tracking optimal clustering solutions or robust clustering over time in dynamic environments remains largely underexplored. This is partly due to a lack of dynamic datasets with diverse, controllable, and realistic dynamic characteristics, hindering systematic performance evaluations of clustering algorithms in various dynamic scenarios. This deficiency leads to a gap in our understanding and capability to effectively design algorithms for clustering in dynamic environments. To bridge this gap, this paper introduces the Dynamic Dataset Generator (DDG). DDG features multiple dynamic Gaussian components integrated with a range of heterogeneous, local, and global changes. These changes vary in spatial and temporal severity, patterns, and domain of influence, providing a comprehensive tool for simulating a wide range of dynamic scenarios.  ( 2 min )
    A Statistical Analysis of Wasserstein Autoencoders for Intrinsically Low-dimensional Data
    arXiv:2402.15710v1 Announce Type: new Abstract: Variational Autoencoders (VAEs) have gained significant popularity among researchers as a powerful tool for understanding unknown distributions based on limited samples. This popularity stems partly from their impressive performance and partly from their ability to provide meaningful feature representations in the latent space. Wasserstein Autoencoders (WAEs), a variant of VAEs, aim to not only improve model efficiency but also interpretability. However, there has been limited focus on analyzing their statistical guarantees. The matter is further complicated by the fact that the data distributions to which WAEs are applied - such as natural images - are often presumed to possess an underlying low-dimensional structure within a high-dimensional feature space, which current theory does not adequately account for, rendering known bounds inefficient. To bridge the gap between the theory and practice of WAEs, in this paper, we show that WAEs can learn the data distributions when the network architectures are properly chosen. We show that the convergence rates of the expected excess risk in the number of samples for WAEs are independent of the high feature dimension, instead relying only on the intrinsic dimension of the data distribution.  ( 2 min )
    Data-Efficient Operator Learning via Unsupervised Pretraining and In-Context Learning
    arXiv:2402.15734v1 Announce Type: new Abstract: Recent years have witnessed the promise of coupling machine learning methods and physical domain-specific insight for solving scientific problems based on partial differential equations (PDEs). However, being data-intensive, these methods still require a large amount of PDE data. This reintroduces the need for expensive numerical PDE solutions, partially undermining the original goal of avoiding these expensive simulations. In this work, seeking data efficiency, we design unsupervised pretraining and in-context learning methods for PDE operator learning. To reduce the need for training data with simulated solutions, we pretrain neural operators on unlabeled PDE data using reconstruction-based proxy tasks. To improve out-of-distribution performance, we further assist neural operators in flexibly leveraging in-context learning methods, without incurring extra training costs or designs. Extensive empirical evaluations on a diverse set of PDEs demonstrate that our method is highly data-efficient, more generalizable, and even outperforms conventional vision-pretrained models.  ( 2 min )
    Anchor-free Clustering based on Anchor Graph Factorization
    arXiv:2402.15688v1 Announce Type: new Abstract: Anchor-based methods are a pivotal approach in handling clustering of large-scale data. However, these methods typically entail two distinct stages: selecting anchor points and constructing an anchor graph. This bifurcation, along with the initialization of anchor points, significantly influences the overall performance of the algorithm. To mitigate these issues, we introduce a novel method termed Anchor-free Clustering based on Anchor Graph Factorization (AFCAGF). AFCAGF innovates in learning the anchor graph, requiring only the computation of pairwise distances between samples. This process, achievable through straightforward optimization, circumvents the necessity for explicit selection of anchor points. More concretely, our approach enhances the Fuzzy k-means clustering algorithm (FKM), introducing a new manifold learning technique that obviates the need for initializing cluster centers. Additionally, we evolve the concept of the membership matrix between cluster centers and samples in FKM into an anchor graph encompassing multiple anchor points and samples. Employing Non-negative Matrix Factorization (NMF) on this anchor graph allows for the direct derivation of cluster labels, thereby eliminating the requirement for further post-processing steps. To solve the method proposed, we implement an alternating optimization algorithm that ensures convergence. Empirical evaluations on various real-world datasets underscore the superior efficacy of our algorithm compared to traditional approaches.  ( 2 min )
    Understanding Missingness in Time-series Electronic Health Records for Individualized Representation
    arXiv:2402.15730v1 Announce Type: new Abstract: With the widespread of machine learning models for healthcare applications, there is increased interest in building applications for personalized medicine. Despite the plethora of proposed research for personalized medicine, very few focus on representing missingness and learning from the missingness patterns in time-series Electronic Health Records (EHR) data. The lack of focus on missingness representation in an individualized way limits the full utilization of machine learning applications towards true personalization. In this brief communication, we highlight new insights into patterns of missingness with real-world examples and implications of missingness in EHRs. The insights in this work aim to bridge the gap between theoretical assumptions and practical observations in real-world EHRs. We hope this work will open new doors for exploring directions for better representation in predictive modelling for true personalization.  ( 2 min )
    Teacher-Student Learning on Complexity in Intelligent Routing
    arXiv:2402.15665v1 Announce Type: new Abstract: Customer service is often the most time-consuming aspect for e-commerce websites, with each contact typically taking 10-15 minutes. Effectively routing customers to appropriate agents without transfers is therefore crucial for e-commerce success. To this end, we have developed a machine learning framework that predicts the complexity of customer contacts and routes them to appropriate agents accordingly. The framework consists of two parts. First, we train a teacher model to score the complexity of a contact based on the post-contact transcripts. Then, we use the teacher model as a data annotator to provide labels to train a student model that predicts the complexity based on pre-contact data only. Our experiments show that such a framework is successful and can significantly improve customer experience. We also propose a useful metric called complexity AUC that evaluates the effectiveness of customer service at a statistical level.  ( 2 min )
    CoRelation: Boosting Automatic ICD Coding Through Contextualized Code Relation Learning
    arXiv:2402.15700v1 Announce Type: new Abstract: Automatic International Classification of Diseases (ICD) coding plays a crucial role in the extraction of relevant information from clinical notes for proper recording and billing. One of the most important directions for boosting the performance of automatic ICD coding is modeling ICD code relations. However, current methods insufficiently model the intricate relationships among ICD codes and often overlook the importance of context in clinical notes. In this paper, we propose a novel approach, a contextualized and flexible framework, to enhance the learning of ICD code representations. Our approach, unlike existing methods, employs a dependent learning paradigm that considers the context of clinical notes in modeling all possible code relations. We evaluate our approach on six public ICD coding datasets and the experimental results demonstrate the effectiveness of our approach compared to state-of-the-art baselines.  ( 2 min )
    Overcoming Pitfalls in Graph Contrastive Learning Evaluation: Toward Comprehensive Benchmarks
    arXiv:2402.15680v1 Announce Type: new Abstract: The rise of self-supervised learning, which operates without the need for labeled data, has garnered significant interest within the graph learning community. This enthusiasm has led to the development of numerous Graph Contrastive Learning (GCL) techniques, all aiming to create a versatile graph encoder that leverages the wealth of unlabeled data for various downstream tasks. However, the current evaluation standards for GCL approaches are flawed due to the need for extensive hyper-parameter tuning during pre-training and the reliance on a single downstream task for assessment. These flaws can skew the evaluation away from the intended goals, potentially leading to misleading conclusions. In our paper, we thoroughly examine these shortcomings and offer fresh perspectives on how GCL methods are affected by hyper-parameter choices and the choice of downstream tasks for their evaluation. Additionally, we introduce an enhanced evaluation framework designed to more accurately gauge the effectiveness, consistency, and overall capability of GCL methods.  ( 2 min )
    Contact Complexity in Customer Service
    arXiv:2402.15655v1 Announce Type: new Abstract: Customers who reach out for customer service support may face a range of issues that vary in complexity. Routing high-complexity contacts to junior agents can lead to multiple transfers or repeated contacts, while directing low-complexity contacts to senior agents can strain their capacity to assist customers who need professional help. To tackle this, a machine learning model that accurately predicts the complexity of customer issues is highly desirable. However, defining the complexity of a contact is a difficult task as it is a highly abstract concept. While consensus-based data annotation by experienced agents is a possible solution, it is time-consuming and costly. To overcome these challenges, we have developed a novel machine learning approach to define contact complexity. Instead of relying on human annotation, we trained an AI expert model to mimic the behavior of agents and evaluate each contact's complexity based on how the AI expert responds. If the AI expert is uncertain or lacks the skills to comprehend the contact transcript, it is considered a high-complexity contact. Our method has proven to be reliable, scalable, and cost-effective based on the collected data.  ( 2 min )
    Learning Semilinear Neural Operators : A Unified Recursive Framework For Prediction And Data Assimilation
    arXiv:2402.15656v1 Announce Type: new Abstract: Recent advances in the theory of Neural Operators (NOs) have enabled fast and accurate computation of the solutions to complex systems described by partial differential equations (PDEs). Despite their great success, current NO-based solutions face important challenges when dealing with spatio-temporal PDEs over long time scales. Specifically, the current theory of NOs does not present a systematic framework to perform data assimilation and efficiently correct the evolution of PDE solutions over time based on sparsely sampled noisy measurements. In this paper, we propose a learning-based state-space approach to compute the solution operators to infinite-dimensional semilinear PDEs. Exploiting the structure of semilinear PDEs and the theory of nonlinear observers in function spaces, we develop a flexible recursive method that allows for both prediction and data assimilation by combining prediction and correction operations. The proposed framework is capable of producing fast and accurate predictions over long time horizons, dealing with irregularly sampled noisy measurements to correct the solution, and benefits from the decoupling between the spatial and temporal dynamics of this class of PDEs. We show through experiments on the Kuramoto-Sivashinsky, Navier-Stokes and Korteweg-de Vries equations that the proposed model is robust to noise and can leverage arbitrary amounts of measurements to correct its prediction over a long time horizon with little computational overhead.  ( 3 min )
    Multi-Constraint Safe RL with Objective Suppression for Safety-Critical Applications
    arXiv:2402.15650v1 Announce Type: new Abstract: Safe reinforcement learning tasks with multiple constraints are a challenging domain despite being very common in the real world. To address this challenge, we propose Objective Suppression, a novel method that adaptively suppresses the task reward maximizing objectives according to a safety critic. We benchmark Objective Suppression in two multi-constraint safety domains, including an autonomous driving domain where any incorrect behavior can lead to disastrous consequences. Empirically, we demonstrate that our proposed method, when combined with existing safe RL algorithms, can match the task reward achieved by our baselines with significantly fewer constraint violations.  ( 2 min )
    MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs
    arXiv:2402.15627v1 Announce Type: new Abstract: We present the design, implementation and engineering experience in building and deploying MegaScale, a production system for training large language models (LLMs) at the scale of more than 10,000 GPUs. Training LLMs at this scale brings unprecedented challenges to training efficiency and stability. We take a full-stack approach that co-designs the algorithmic and system components across model block and optimizer design, computation and communication overlapping, operator optimization, data pipeline, and network performance tuning. Maintaining high efficiency throughout the training process (i.e., stability) is an important consideration in production given the long extent of LLM training jobs. Many hard stability issues only emerge at large scale, and in-depth observability is the key to address them. We develop a set of diagnosis tools to monitor system components and events deep in the stack, identify root causes, and derive effective techniques to achieve fault tolerance and mitigate stragglers. MegaScale achieves 55.2% Model FLOPs Utilization (MFU) when training a 175B LLM model on 12,288 GPUs, improving the MFU by 1.34x compared to Megatron-LM. We share our operational experience in identifying and fixing failures and stragglers. We hope by articulating the problems and sharing our experience from a systems perspective, this work can inspire future LLM systems research.  ( 3 min )
  • Open

    Mixed Variational Flows for Discrete Variables
    arXiv:2308.15613v3 Announce Type: replace-cross Abstract: Variational flows allow practitioners to learn complex continuous distributions, but approximating discrete distributions remains a challenge. Current methodologies typically embed the discrete target in a continuous space - usually via continuous relaxation or dequantization - and then apply a continuous flow. These approaches involve a surrogate target that may not capture the original discrete target, might have biased or unstable gradients, and can create a difficult optimization problem. In this work, we develop a variational flow family for discrete distributions without any continuous embedding. First, we develop a measure-preserving and discrete (MAD) invertible map that leaves the discrete target invariant, and then create a mixed variational flow (MAD Mix) based on that map. Our family provides access to i.i.d. sampling and density evaluation with virtually no tuning effort. We also develop an extension to MAD Mix that handles joint discrete and continuous models. Our experiments suggest that MAD Mix produces more reliable approximations than continuous-embedding flows while being significantly faster to train.  ( 2 min )
    A kernel-based analysis of Laplacian Eigenmaps
    arXiv:2402.16481v1 Announce Type: cross Abstract: Given i.i.d. observations uniformly distributed on a closed manifold $\mathcal{M}\subseteq \mathbb{R}^p$, we study the spectral properties of the associated empirical graph Laplacian based on a Gaussian kernel. Our main results are non-asymptotic error bounds, showing that the eigenvalues and eigenspaces of the empirical graph Laplacian are close to the eigenvalues and eigenspaces of the Laplace-Beltrami operator of $\mathcal{M}$. In our analysis, we connect the empirical graph Laplacian to kernel principal component analysis, and consider the heat kernel of $\mathcal{M}$ as reproducing kernel feature map. This leads to novel points of view and allows to leverage results for empirical covariance operators in infinite dimensions.  ( 2 min )
    Low-Cost High-Power Membership Inference Attacks
    arXiv:2312.03262v2 Announce Type: replace Abstract: Membership inference attacks (MIA) aim to detect if a particular data point was used in training a machine learning model. Recent strong attacks have high computational costs and inconsistent performance under varying conditions, rendering them unreliable for practical privacy risk assessment. We design a novel, efficient, and robust membership inference attack (RMIA) which accurately differentiates between population data and training data of a model, with minimal computational overhead. We achieve this by a more accurate modeling of the null hypothesis setting in our likelihood ratio tests, and effectively leveraging both reference models and reference data samples from the population. Our algorithm exhibits superior test power (true-positive rate) compared to prior methods, throughout the TPR-FPR curve including at extremely low false-positive rates (as low as 0). Under computation constraints, where only a limited number of pre-trained reference models (as few as 1) are available, and also when we vary other elements of the attack, our method performs exceptionally well, unlike some prior attacks that approach random guessing. RMIA outperforms the prior work in all configurations of the attack setup. RMIA lays the algorithmic groundwork for practical yet accurate and reliable privacy risk analysis in machine learning.  ( 2 min )
    Partial Rankings of Optimizers
    arXiv:2402.16565v1 Announce Type: cross Abstract: We introduce a framework for benchmarking optimizers according to multiple criteria over various test functions. Based on a recently introduced union-free generic depth function for partial orders/rankings, it fully exploits the ordinal information and allows for incomparability. Our method describes the distribution of all partial orders/rankings, avoiding the notorious shortcomings of aggregation. This permits to identify test functions that produce central or outlying rankings of optimizers and to assess the quality of benchmarking suites.  ( 2 min )
    Efficient Convex Algorithms for Universal Kernel Learning
    arXiv:2304.07472v2 Announce Type: replace Abstract: The accuracy and complexity of machine learning algorithms based on kernel optimization are determined by the set of kernels over which they are able to optimize. An ideal set of kernels should: admit a linear parameterization (for tractability); be dense in the set of all kernels (for robustness); be universal (for accuracy). Recently, a framework was proposed for using positive matrices to parameterize a class of positive semi-separable kernels. Although this class can be shown to meet all three criteria, previous algorithms for optimization of such kernels were limited to classification and furthermore relied on computationally complex Semidefinite Programming (SDP) algorithms. In this paper, we pose the problem of learning semiseparable kernels as a minimax optimization problem and propose a SVD-QCQP primal-dual algorithm which dramatically reduces the computational complexity as compared with previous SDP-based approaches. Furthermore, we provide an efficient implementation of this algorithm for both classification and regression -- an implementation which enables us to solve problems with 100 features and up to 30,000 datums. Finally, when applied to benchmark data, the algorithm demonstrates the potential for significant improvement in accuracy over typical (but non-convex) approaches such as Neural Nets and Random Forest with similar or better computation time.  ( 2 min )
    Re-Envisioning Numerical Information Field Theory (NIFTy.re): A Library for Gaussian Processes and Variational Inference
    arXiv:2402.16683v1 Announce Type: cross Abstract: Imaging is the process of transforming noisy, incomplete data into a space that humans can interpret. NIFTy is a Bayesian framework for imaging and has already successfully been applied to many fields in astrophysics. Previous design decisions held the performance and the development of methods in NIFTy back. We present a rewrite of NIFTy, coined NIFTy.re, which reworks the modeling principle, extends the inference strategies, and outsources much of the heavy lifting to JAX. The rewrite dramatically accelerates models written in NIFTy, lays the foundation for new types of inference machineries, improves maintainability, and enables interoperability between NIFTy and the JAX machine learning ecosystem.  ( 2 min )
    Vanilla Bayesian Optimization Performs Great in High Dimensions
    arXiv:2402.02229v3 Announce Type: replace-cross Abstract: High-dimensional problems have long been considered the Achilles' heel of Bayesian optimization algorithms. Spurred by the curse of dimensionality, a large collection of algorithms aim to make it more performant in this setting, commonly by imposing various simplifying assumptions on the objective. In this paper, we identify the degeneracies that make vanilla Bayesian optimization poorly suited to high-dimensional tasks, and further show how existing algorithms address these degeneracies through the lens of lowering the model complexity. Moreover, we propose an enhancement to the prior assumptions that are typical to vanilla Bayesian optimization algorithms, which reduces the complexity to manageable levels without imposing structural restrictions on the objective. Our modification - a simple scaling of the Gaussian process lengthscale prior with the dimensionality - reveals that standard Bayesian optimization works drastically better than previously thought in high dimensions, clearly outperforming existing state-of-the-art algorithms on multiple commonly considered real-world high-dimensional tasks.  ( 2 min )
    An Interpretable and Efficient Infinite-Order Vector Autoregressive Model for High-Dimensional Time Series
    arXiv:2209.01172v4 Announce Type: replace-cross Abstract: As a special infinite-order vector autoregressive (VAR) model, the vector autoregressive moving average (VARMA) model can capture much richer temporal patterns than the widely used finite-order VAR model. However, its practicality has long been hindered by its non-identifiability, computational intractability, and difficulty of interpretation, especially for high-dimensional time series. This paper proposes a novel sparse infinite-order VAR model for high-dimensional time series, which avoids all above drawbacks while inheriting essential temporal patterns of the VARMA model. As another attractive feature, the temporal and cross-sectional structures of the VARMA-type dynamics captured by this model can be interpreted separately, since they are characterized by different sets of parameters. This separation naturally motivates the sparsity assumption on the parameters determining the cross-sectional dependence. As a result, greater statistical efficiency and interpretability can be achieved with little loss of temporal information. We introduce two $\ell_1$-regularized estimation methods for the proposed model, which can be efficiently implemented via block coordinate descent algorithms, and derive the corresponding nonasymptotic error bounds. A consistent model order selection method based on the Bayesian information criteria is also developed. The merit of the proposed approach is supported by simulation studies and a real-world macroeconomic data analysis.  ( 2 min )
    Optimal Transport for Measures with Noisy Tree Metric
    arXiv:2310.13653v2 Announce Type: replace Abstract: We study optimal transport (OT) problem for probability measures supported on a tree metric space. It is known that such OT problem (i.e., tree-Wasserstein (TW)) admits a closed-form expression, but depends fundamentally on the underlying tree structure over supports of input measures. In practice, the given tree structure may be, however, perturbed due to noisy or adversarial measurements. To mitigate this issue, we follow the max-min robust OT approach which considers the maximal possible distances between two input measures over an uncertainty set of tree metrics. In general, this approach is hard to compute, even for measures supported in one-dimensional space, due to its non-convexity and non-smoothness which hinders its practical applications, especially for large-scale settings. In this work, we propose novel uncertainty sets of tree metrics from the lens of edge deletion/addition which covers a diversity of tree structures in an elegant framework. Consequently, by building upon the proposed uncertainty sets, and leveraging the tree structure over supports, we show that the robust OT also admits a closed-form expression for a fast computation as its counterpart standard OT (i.e., TW). Furthermore, we demonstrate that the robust OT satisfies the metric property and is negative definite. We then exploit its negative definiteness to propose positive definite kernels and test them in several simulations on various real-world datasets on document classification and topological data analysis.  ( 3 min )
    Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models
    arXiv:2310.01107v2 Announce Type: replace-cross Abstract: Recent endeavors in video editing have showcased promising results in single-attribute editing or style transfer tasks, either by training text-to-video (T2V) models on text-video data or adopting training-free methods. However, when confronted with the complexities of multi-attribute editing scenarios, they exhibit shortcomings such as omitting or overlooking intended attribute changes, modifying the wrong elements of the input video, and failing to preserve regions of the input video that should remain intact. To address this, here we present a novel grounding-guided video-to-video translation framework called Ground-A-Video for multi-attribute video editing. Ground-A-Video attains temporally consistent multi-attribute editing of input videos in a training-free manner without aforementioned shortcomings. Central to our method is the introduction of Cross-Frame Gated Attention which incorporates groundings information into the latent representations in a temporally consistent fashion, along with Modulated Cross-Attention and optical flow guided inverted latents smoothing. Extensive experiments and applications demonstrate that Ground-A-Video's zero-shot capacity outperforms other baseline methods in terms of edit-accuracy and frame consistency. Further results and code are available at our project page (http://ground-a-video.github.io).  ( 2 min )
    Learning to Embed Time Series Patches Independently
    arXiv:2312.16427v2 Announce Type: replace-cross Abstract: Masked time series modeling has recently gained much attention as a self-supervised representation learning strategy for time series. Inspired by masked image modeling in computer vision, recent works first patchify and partially mask out time series, and then train Transformers to capture the dependencies between patches by predicting masked patches from unmasked patches. However, we argue that capturing such patch dependencies might not be an optimal strategy for time series representation learning; rather, learning to embed patches independently results in better time series representations. Specifically, we propose to use 1) the simple patch reconstruction task, which autoencode each patch without looking at other patches, and 2) the simple patch-wise MLP that embeds each patch independently. In addition, we introduce complementary contrastive learning to hierarchically capture adjacent time series information efficiently. Our proposed method improves time series forecasting and classification performance compared to state-of-the-art Transformer-based models, while it is more efficient in terms of the number of parameters and training/inference time. Code is available at this repository: https://github.com/seunghan96/pits.  ( 2 min )
    Energy-Based Concept Bottleneck Models: Unifying Prediction, Concept Intervention, and Probabilistic Interpretations
    arXiv:2401.14142v2 Announce Type: replace-cross Abstract: Existing methods, such as concept bottleneck models (CBMs), have been successful in providing concept-based interpretations for black-box deep learning models. They typically work by predicting concepts given the input and then predicting the final class label given the predicted concepts. However, (1) they often fail to capture the high-order, nonlinear interaction between concepts, e.g., correcting a predicted concept (e.g., "yellow breast") does not help correct highly correlated concepts (e.g., "yellow belly"), leading to suboptimal final accuracy; (2) they cannot naturally quantify the complex conditional dependencies between different concepts and class labels (e.g., for an image with the class label "Kentucky Warbler" and a concept "black bill", what is the probability that the model correctly predicts another concept "black crown"), therefore failing to provide deeper insight into how a black-box model works. In response to these limitations, we propose Energy-based Concept Bottleneck Models (ECBMs). Our ECBMs use a set of neural networks to define the joint energy of candidate (input, concept, class) tuples. With such a unified interface, prediction, concept correction, and conditional dependency quantification are then represented as conditional probabilities, which are generated by composing different energy functions. Our ECBMs address both limitations of existing CBMs, providing higher accuracy and richer concept interpretations. Empirical results show that our approach outperforms the state-of-the-art on real-world datasets.  ( 3 min )
    Soft Contrastive Learning for Time Series
    arXiv:2312.16424v2 Announce Type: replace-cross Abstract: Contrastive learning has shown to be effective to learn representations from time series in a self-supervised way. However, contrasting similar time series instances or values from adjacent timestamps within a time series leads to ignore their inherent correlations, which results in deteriorating the quality of learned representations. To address this issue, we propose SoftCLT, a simple yet effective soft contrastive learning strategy for time series. This is achieved by introducing instance-wise and temporal contrastive loss with soft assignments ranging from zero to one. Specifically, we define soft assignments for 1) instance-wise contrastive loss by the distance between time series on the data space, and 2) temporal contrastive loss by the difference of timestamps. SoftCLT is a plug-and-play method for time series contrastive learning that improves the quality of learned representations without bells and whistles. In experiments, we demonstrate that SoftCLT consistently improves the performance in various downstream tasks including classification, semi-supervised learning, transfer learning, and anomaly detection, showing state-of-the-art performance. Code is available at this repository: https://github.com/seunghan96/softclt.  ( 2 min )
    Designing monitoring strategies for deployed machine learning algorithms: navigating performativity through a causal lens
    arXiv:2311.11463v2 Announce Type: replace-cross Abstract: After a machine learning (ML)-based system is deployed, monitoring its performance is important to ensure the safety and effectiveness of the algorithm over time. When an ML algorithm interacts with its environment, the algorithm can affect the data-generating mechanism and be a major source of bias when evaluating its standalone performance, an issue known as performativity. Although prior work has shown how to validate models in the presence of performativity using causal inference techniques, there has been little work on how to monitor models in the presence of performativity. Unlike the setting of model validation, there is much less agreement on which performance metrics to monitor. Different monitoring criteria impact how interpretable the resulting test statistic is, what assumptions are needed for identifiability, and the speed of detection. When this choice is further coupled with the decision to use observational versus interventional data, ML deployment teams are faced with a multitude of monitoring options. The aim of this work is to highlight the relatively under-appreciated complexity of designing a monitoring strategy and how causal reasoning can provide a systematic framework for choosing between these options. As a motivating example, we consider an ML-based risk prediction algorithm for predicting unplanned readmissions. Bringing together tools from causal inference and statistical process control, we consider six monitoring procedures (three candidate monitoring criteria and two data sources) and investigate their operating characteristics in simulation studies. Results from this case study emphasize the seemingly simple (and obvious) fact that not all monitoring systems are created equal, which has real-world impacts on the design and documentation of ML monitoring systems.  ( 3 min )
    Generative Flow Networks as Entropy-Regularized RL
    arXiv:2310.12934v3 Announce Type: replace-cross Abstract: The recently proposed generative flow networks (GFlowNets) are a method of training a policy to sample compositional discrete objects with probabilities proportional to a given reward via a sequence of actions. GFlowNets exploit the sequential nature of the problem, drawing parallels with reinforcement learning (RL). Our work extends the connection between RL and GFlowNets to a general case. We demonstrate how the task of learning a generative flow network can be efficiently redefined as an entropy-regularized RL problem with a specific reward and regularizer structure. Furthermore, we illustrate the practical efficiency of this reformulation by applying standard soft RL algorithms to GFlowNet training across several probabilistic modeling tasks. Contrary to previously reported results, we show that entropic RL approaches can be competitive against established GFlowNet training methods. This perspective opens a direct path for integrating RL principles into the realm of generative flow networks.  ( 2 min )
    Neural Diffusion Models
    arXiv:2310.08337v2 Announce Type: replace-cross Abstract: Diffusion models have shown remarkable performance on many generative tasks. Despite recent success, most diffusion models are restricted in that they only allow linear transformation of the data distribution. In contrast, broader family of transformations can potentially help train generative distributions more efficiently, simplifying the reverse process and closing the gap between the true negative log-likelihood and the variational approximation. In this paper, we present Neural Diffusion Models (NDMs), a generalization of conventional diffusion models that enables defining and learning time-dependent non-linear transformations of data. We show how to optimise NDMs using a variational bound in a simulation-free setting. Moreover, we derive a time-continuous formulation of NDMs, which allows fast and reliable inference using off-the-shelf numerical ODE and SDE solvers. Finally, we demonstrate the utility of NDMs with learnable transformations through experiments on standard image generation benchmarks, including CIFAR-10, downsampled versions of ImageNet and CelebA-HQ. NDMs outperform conventional diffusion models in terms of likelihood and produce high-quality samples.  ( 2 min )
    Order-Preserving GFlowNets
    arXiv:2310.00386v2 Announce Type: replace-cross Abstract: Generative Flow Networks (GFlowNets) have been introduced as a method to sample a diverse set of candidates with probabilities proportional to a given reward. However, GFlowNets can only be used with a predefined scalar reward, which can be either computationally expensive or not directly accessible, in the case of multi-objective optimization (MOO) tasks for example. Moreover, to prioritize identifying high-reward candidates, the conventional practice is to raise the reward to a higher exponent, the optimal choice of which may vary across different environments. To address these issues, we propose Order-Preserving GFlowNets (OP-GFNs), which sample with probabilities in proportion to a learned reward function that is consistent with a provided (partial) order on the candidates, thus eliminating the need for an explicit formulation of the reward function. We theoretically prove that the training process of OP-GFNs gradually sparsifies the learned reward landscape in single-objective maximization tasks. The sparsification concentrates on candidates of a higher hierarchy in the ordering, ensuring exploration at the beginning and exploitation towards the end of the training. We demonstrate OP-GFN's state-of-the-art performance in single-objective maximization (totally ordered) and multi-objective Pareto front approximation (partially ordered) tasks, including synthetic datasets, molecule generation, and neural architecture search.  ( 2 min )
    A Gentle Introduction to Gradient-Based Optimization and Variational Inequalities for Machine Learning
    arXiv:2309.04877v2 Announce Type: replace-cross Abstract: The rapid progress in machine learning in recent years has been based on a highly productive connection to gradient-based optimization. Further progress hinges in part on a shift in focus from pattern recognition to decision-making and multi-agent problems. In these broader settings, new mathematical challenges emerge that involve equilibria and game theory instead of optima. Gradient-based methods remain essential -- given the high dimensionality and large scale of machine-learning problems -- but simple gradient descent is no longer the point of departure for algorithm design. We provide a gentle introduction to a broader framework for gradient-based algorithms in machine learning, beginning with saddle points and monotone games, and proceeding to general variational inequalities. While we provide convergence proofs for several of the algorithms that we present, our main focus is that of providing motivation and intuition.  ( 2 min )
    The Galerkin method beats Graph-Based Approaches for Spectral Algorithms
    arXiv:2306.00742v3 Announce Type: replace-cross Abstract: Historically, the machine learning community has derived spectral decompositions from graph-based approaches. We break with this approach and prove the statistical and computational superiority of the Galerkin method, which consists in restricting the study to a small set of test functions. In particular, we introduce implementation tricks to deal with differential operators in large dimensions with structured kernels. Finally, we extend on the core principles beyond our approach to apply them to non-linear spaces of functions, such as the ones parameterized by deep neural networks, through loss-based optimization procedures.  ( 2 min )
    A Benchmark Study on Calibration
    arXiv:2308.11838v4 Announce Type: replace-cross Abstract: Deep neural networks are increasingly utilized in various machine learning tasks. However, as these models grow in complexity, they often face calibration issues, despite enhanced prediction accuracy. Many studies have endeavored to improve calibration performance through the use of specific loss functions, data preprocessing and training frameworks. Yet, investigations into calibration properties have been somewhat overlooked. Our study leverages the Neural Architecture Search (NAS) search space, offering an exhaustive model architecture space for thorough calibration properties exploration. We specifically create a model calibration dataset. This dataset evaluates 90 bin-based and 12 additional calibration measurements across 117,702 unique neural networks within the widely employed NATS-Bench search space. Our analysis aims to answer several longstanding questions in the field, using our proposed dataset: (i) Can model calibration be generalized across different datasets? (ii) Can robustness be used as a calibration measurement? (iii) How reliable are calibration metrics? (iv) Does a post-hoc calibration method affect all models uniformly? (v) How does calibration interact with accuracy? (vi) What is the impact of bin size on calibration measurement? (vii) Which architectural designs are beneficial for calibration? Additionally, our study bridges an existing gap by exploring calibration within NAS. By providing this dataset, we enable further research into NAS calibration. As far as we are aware, our research represents the first large-scale investigation into calibration properties and the premier study of calibration issues within NAS.  ( 3 min )
    Flooding with Absorption: An Efficient Protocol for Heterogeneous Bandits over Complex Networks
    arXiv:2303.05445v4 Announce Type: replace-cross Abstract: Multi-armed bandits are extensively used to model sequential decision-making, making them ubiquitous in many real-life applications such as online recommender systems and wireless networking. We consider a multi-agent setting where each agent solves their own bandit instance endowed with a different set of arms. Their goal is to minimize their group regret while collaborating via some communication protocol over a given network. Previous literature on this problem only considered arm heterogeneity and networked agents separately. In this work, we introduce a setting that encompasses both features. For this novel setting, we first provide a rigorous regret analysis for a standard flooding protocol combined with the classic UCB policy. Then, to mitigate the issue of high communication costs incurred by flooding in complex networks, we propose a new protocol called Flooding with Absorption (FwA). We provide a theoretical analysis of the resulting regret bound and discuss the advantages of using FwA over flooding. Lastly, we experimentally verify on various scenarios, including dynamic networks, that FwA leads to significantly lower communication costs despite minimal regret performance loss compared to other network protocols.  ( 3 min )
    Learning Hidden Markov Models Using Conditional Samples
    arXiv:2302.14753v2 Announce Type: replace-cross Abstract: This paper is concerned with the computational complexity of learning the Hidden Markov Model (HMM). Although HMMs are some of the most widely used tools in sequential and time series modeling, they are cryptographically hard to learn in the standard setting where one has access to i.i.d. samples of observation sequences. In this paper, we depart from this setup and consider an interactive access model, in which the algorithm can query for samples from the conditional distributions of the HMMs. We show that interactive access to the HMM enables computationally efficient learning algorithms, thereby bypassing cryptographic hardness. Specifically, we obtain efficient algorithms for learning HMMs in two settings: (a) An easier setting where we have query access to the exact conditional probabilities. Here our algorithm runs in polynomial time and makes polynomially many queries to approximate any HMM in total variation distance. (b) A harder setting where we can only obtain samples from the conditional distributions. Here the performance of the algorithm depends on a new parameter, called the fidelity of the HMM. We show that this captures cryptographically hard instances and previously known positive results. We also show that these results extend to a broader class of distributions with latent low rank structure. Our algorithms can be viewed as generalizations and robustifications of Angluin's $L^*$ algorithm for learning deterministic finite automata from membership queries.  ( 3 min )
    Flow Matching on General Geometries
    arXiv:2302.03660v3 Announce Type: replace-cross Abstract: We propose Riemannian Flow Matching (RFM), a simple yet powerful framework for training continuous normalizing flows on manifolds. Existing methods for generative modeling on manifolds either require expensive simulation, are inherently unable to scale to high dimensions, or use approximations for limiting quantities that result in biased training objectives. Riemannian Flow Matching bypasses these limitations and offers several advantages over previous approaches: it is simulation-free on simple geometries, does not require divergence computation, and computes its target vector field in closed-form. The key ingredient behind RFM is the construction of a relatively simple premetric for defining target vector fields, which encompasses the existing Euclidean case. To extend to general geometries, we rely on the use of spectral decompositions to efficiently compute premetrics on the fly. Our method achieves state-of-the-art performance on many real-world non-Euclidean datasets, and we demonstrate tractable training on general geometries, including triangular meshes with highly non-trivial curvature and boundaries.  ( 2 min )
    Stochastic Online Instrumental Variable Regression: Regrets for Endogeneity and Bandit Feedback
    arXiv:2302.09357v3 Announce Type: replace-cross Abstract: Endogeneity, i.e. the dependence of noise and covariates, is a common phenomenon in real data due to omitted variables, strategic behaviours, measurement errors etc. In contrast, the existing analyses of stochastic online linear regression with unbounded noise and linear bandits depend heavily on exogeneity, i.e. the independence of noise and covariates. Motivated by this gap, we study the over- and just-identified Instrumental Variable (IV) regression, specifically Two-Stage Least Squares, for stochastic online learning, and propose to use an online variant of Two-Stage Least Squares, namely O2SLS. We show that O2SLS achieves $\mathcal O(d_{x}d_{z}\log^2 T)$ identification and $\widetilde{\mathcal O}(\gamma \sqrt{d_{z} T})$ oracle regret after $T$ interactions, where $d_{x}$ and $d_{z}$ are the dimensions of covariates and IVs, and $\gamma$ is the bias due to endogeneity. For $\gamma=0$, i.e. under exogeneity, O2SLS exhibits $\mathcal O(d_{x}^2 \log^2 T)$ oracle regret, which is of the same order as that of the stochastic online ridge. Then, we leverage O2SLS as an oracle to design OFUL-IV, a stochastic linear bandit algorithm to tackle endogeneity. OFUL-IV yields $\widetilde{\mathcal O}(\sqrt{d_{x}d_{z}T})$ regret that matches the regret lower bound under exogeneity. For different datasets with endogeneity, we experimentally show efficiencies of O2SLS and OFUL-IV.  ( 3 min )
    Don't Play Favorites: Minority Guidance for Diffusion Models
    arXiv:2301.12334v2 Announce Type: replace-cross Abstract: We explore the problem of generating minority samples using diffusion models. The minority samples are instances that lie on low-density regions of a data manifold. Generating a sufficient number of such minority instances is important, since they often contain some unique attributes of the data. However, the conventional generation process of the diffusion models mostly yields majority samples (that lie on high-density regions of the manifold) due to their high likelihoods, making themselves ineffective and time-consuming for the minority generating task. In this work, we present a novel framework that can make the generation process of the diffusion models focus on the minority samples. We first highlight that Tweedie's denoising formula yields favorable results for majority samples. The observation motivates us to introduce a metric that describes the uniqueness of a given sample. To address the inherent preference of the diffusion models w.r.t. the majority samples, we further develop minority guidance, a sampling technique that can guide the generation process toward regions with desired likelihood levels. Experiments on benchmark real datasets demonstrate that our minority guidance can greatly improve the capability of generating high-quality minority samples over existing generative samplers. We showcase that the performance benefit of our framework persists even in demanding real-world scenarios such as medical imaging, further underscoring the practical significance of our work. Code is available at https://github.com/soobin-um/minority-guidance.  ( 3 min )
    Multi-Armed Bandits and Quantum Channel Oracles
    arXiv:2301.08544v3 Announce Type: replace-cross Abstract: Multi-armed bandits are one of the theoretical pillars of reinforcement learning. Recently, the investigation of quantum algorithms for multi-armed bandit problems was started, and it was found that a quadratic speed-up (in query complexity) is possible when the arms and the randomness of the rewards of the arms can be queried in superposition. Here we introduce further bandit models where we only have limited access to the randomness of the rewards, but we can still query the arms in superposition. We show that then the query complexity is the same as for classical algorithms. This generalizes the prior result that no speed-up is possible for unstructured search when the oracle has positive failure probability.  ( 2 min )
    DynaConF: Dynamic Forecasting of Non-Stationary Time Series
    arXiv:2209.08411v3 Announce Type: replace-cross Abstract: Deep learning has shown impressive results in a variety of time series forecasting tasks, where modeling the conditional distribution of the future given the past is the essence. However, when this conditional distribution is non-stationary, it poses challenges for these models to learn consistently and to predict accurately. In this work, we propose a new method to model non-stationary conditional distributions over time by clearly decoupling stationary conditional distribution modeling from non-stationary dynamics modeling. Our method is based on a Bayesian dynamic model that can adapt to conditional distribution changes and a deep conditional distribution model that handles multivariate time series using a factorized output space. Our experimental results on synthetic and real-world datasets show that our model can adapt to non-stationary time series better than state-of-the-art deep learning solutions.  ( 2 min )
    Multimodal Generative Models for Bankruptcy Prediction Using Textual Data
    arXiv:2211.08405v5 Announce Type: replace-cross Abstract: Textual data from financial filings, e.g., the Management's Discussion & Analysis (MDA) section in Form 10-K, has been used to improve the prediction accuracy of bankruptcy models. In practice, however, we cannot obtain the MDA section for all public companies, which limits the use of MDA data in traditional bankruptcy models, as they need complete data to make predictions. The two main reasons for the lack of MDA are: (i) not all companies are obliged to submit the MDA and (ii) technical problems arise when crawling and scrapping the MDA section. To solve this limitation, this research introduces the Conditional Multimodal Discriminative (CMMD) model that learns multimodal representations that embed information from accounting, market, and textual data modalities. The CMMD model needs a sample with all data modalities for model training. At test time, the CMMD model only needs access to accounting and market modalities to generate multimodal representations, which are further used to make bankruptcy predictions and to generate words from the missing MDA modality. With this novel methodology, it is realistic to use textual data in bankruptcy prediction models, since accounting and market data are available for all companies, unlike textual data. The empirical results of this research show that if financial regulators, or investors, were to use traditional models using MDA data, they would only be able to make predictions for 60% of the companies. Furthermore, the classification performance of our proposed methodology is superior to that of a large number of traditional classifier models, taking into account all the companies in our sample.  ( 3 min )
    On the Identifiability of Nonlinear ICA: Sparsity and Beyond
    arXiv:2206.07751v5 Announce Type: replace-cross Abstract: Nonlinear independent component analysis (ICA) aims to recover the underlying independent latent sources from their observable nonlinear mixtures. How to make the nonlinear ICA model identifiable up to certain trivial indeterminacies is a long-standing problem in unsupervised learning. Recent breakthroughs reformulate the standard independence assumption of sources as conditional independence given some auxiliary variables (e.g., class labels and/or domain/time indexes) as weak supervision or inductive bias. However, nonlinear ICA with unconditional priors cannot benefit from such developments. We explore an alternative path and consider only assumptions on the mixing process, such as Structural Sparsity. We show that under specific instantiations of such constraints, the independent latent sources can be identified from their nonlinear mixtures up to a permutation and a component-wise transformation, thus achieving nontrivial identifiability of nonlinear ICA without auxiliary variables. We provide estimation methods and validate the theoretical results experimentally. The results on image data suggest that our conditions may hold in a number of practical data generating processes.  ( 2 min )
    Learning Dynamic Mechanisms in Unknown Environments: A Reinforcement Learning Approach
    arXiv:2202.12797v2 Announce Type: replace-cross Abstract: Dynamic mechanism design studies how mechanism designers should allocate resources among agents in a time-varying environment. We consider the problem where the agents interact with the mechanism designer according to an unknown Markov Decision Process (MDP), where agent rewards and the mechanism designer's state evolve according to an episodic MDP with unknown reward functions and transition kernels. We focus on the online setting with linear function approximation and propose novel learning algorithms to recover the dynamic Vickrey-Clarke-Grove (VCG) mechanism over multiple rounds of interaction. A key contribution of our approach is incorporating reward-free online Reinforcement Learning (RL) to aid exploration over a rich policy space to estimate prices in the dynamic VCG mechanism. We show that the regret of our proposed method is upper bounded by $\tilde{\mathcal{O}}(T^{2/3})$ and further devise a lower bound to show that our algorithm is efficient, incurring the same $\tilde{\mathcal{O}}(T^{2 / 3})$ regret as the lower bound, where $T$ is the total number of rounds. Our work establishes the regret guarantee for online RL in solving dynamic mechanism design problems without prior knowledge of the underlying model.  ( 3 min )
    Model-Based Reinforcement Learning for Offline Zero-Sum Markov Games
    arXiv:2206.04044v2 Announce Type: replace-cross Abstract: This paper makes progress towards learning Nash equilibria in two-player zero-sum Markov games from offline data. Specifically, consider a $\gamma$-discounted infinite-horizon Markov game with $S$ states, where the max-player has $A$ actions and the min-player has $B$ actions. We propose a pessimistic model-based algorithm with Bernstein-style lower confidence bounds -- called VI-LCB-Game -- that provably finds an $\varepsilon$-approximate Nash equilibrium with a sample complexity no larger than $\frac{C_{\mathsf{clipped}}^{\star}S(A+B)}{(1-\gamma)^{3}\varepsilon^{2}}$ (up to some log factor). Here, $C_{\mathsf{clipped}}^{\star}$ is some unilateral clipped concentrability coefficient that reflects the coverage and distribution shift of the available data (vis-\`a-vis the target data), and the target accuracy $\varepsilon$ can be any value within $\big(0,\frac{1}{1-\gamma}\big]$. Our sample complexity bound strengthens prior art by a factor of $\min\{A,B\}$, achieving minimax optimality for the entire $\varepsilon$-range. An appealing feature of our result lies in algorithmic simplicity, which reveals the unnecessity of variance reduction and sample splitting in achieving sample optimality.  ( 2 min )
    On the generalization of Tanimoto-type kernels to real valued functions
    arXiv:2007.05943v2 Announce Type: replace-cross Abstract: The Tanimoto kernel (Jaccard index) is a well known tool to describe the similarity between sets of binary attributes. It has been extended to the case when the attributes are nonnegative real values. This paper introduces a more general Tanimoto kernel formulation which allows to measure the similarity of arbitrary real-valued functions. This extension is constructed by unifying the representation of the attributes via properly chosen sets. After deriving the general form of the kernel, explicit feature representation is extracted from the kernel function, and a simply way of including general kernels into the Tanimoto kernel is shown. Finally, the kernel is also expressed as a quotient of piecewise linear functions, and a smooth approximation is provided.  ( 2 min )
    The Role of Contextual Information in Best Arm Identification
    arXiv:2106.14077v3 Announce Type: replace-cross Abstract: We study the best-arm identification problem with fixed confidence when contextual (covariate) information is available in stochastic bandits. Although we can use contextual information in each round, we are interested in the marginalized mean reward over the contextual distribution. Our goal is to identify the best arm with a minimal number of samplings under a given value of the error rate. We show the instance-specific sample complexity lower bounds for the problem. Then, we propose a context-aware version of the "Track-and-Stop" strategy, wherein the proportion of the arm draws tracks the set of optimal allocations and prove that the expected number of arm draws matches the lower bound asymptotically. We demonstrate that contextual information can be used to improve the efficiency of the identification of the best marginalized mean reward compared with the results of Garivier & Kaufmann (2016). We experimentally confirm that context information contributes to faster best-arm identification.  ( 2 min )
    Online Causal Inference for Advertising in Real-Time Bidding Auctions
    arXiv:1908.08600v4 Announce Type: replace-cross Abstract: Real-time bidding (RTB) systems, which utilize auctions to allocate user impressions to competing advertisers, continue to enjoy success in digital advertising. Assessing the effectiveness of such advertising remains a challenge in research and practice. This paper proposes a new approach to perform causal inference on advertising bought through such mechanisms. Leveraging the economic structure of first- and second-price auctions, we first show that the effects of advertising are identified by the optimal bids. Hence, since these optimal bids are the only objects that need to be recovered, we introduce an adapted Thompson sampling (TS) algorithm to solve a multi-armed bandit problem that succeeds in recovering such bids and, consequently, the effects of advertising while minimizing the costs of experimentation. We derive a regret bound for our algorithm which is order optimal and use data from RTB auctions to show that it outperforms commonly used methods that estimate the effects of advertising.  ( 2 min )
    Supervised and Penalized Baseline Correction
    arXiv:2310.18306v3 Announce Type: replace Abstract: Spectroscopic measurements can show distorted spectral shapes arising from a mixture of absorbing and scattering contributions. These distortions (or baselines) often manifest themselves as non-constant offsets or low-frequency oscillations. As a result, these baselines can adversely affect analytical and quantitative results. Baseline correction is an umbrella term where one applies pre-processing methods to obtain baseline spectra (the unwanted distortions) and then remove the distortions by differencing. However, current state-of-the art baseline correction methods do not utilize analyte concentrations even if they are available, or even if they contribute significantly to the observed spectral variability. We examine a class of state-of-the-art methods (penalized baseline correction) and modify them such that they can accommodate a priori analyte concentrations such that prediction can be enhanced. Performance will be assessed on two near infra-red data sets across both classical penalized baseline correction methods (without analyte information) and modified penalized baseline correction methods (leveraging analyte information).  ( 2 min )
    Statistical Perspective of Top-K Sparse Softmax Gating Mixture of Experts
    arXiv:2309.13850v2 Announce Type: replace Abstract: Top-K sparse softmax gating mixture of experts has been widely used for scaling up massive deep-learning architectures without increasing the computational cost. Despite its popularity in real-world applications, the theoretical understanding of that gating function has remained an open problem. The main challenge comes from the structure of the top-K sparse softmax gating function, which partitions the input space into multiple regions with distinct behaviors. By focusing on a Gaussian mixture of experts, we establish theoretical results on the effects of the top-K sparse softmax gating function on both density and parameter estimations. Our results hinge upon defining novel loss functions among parameters to capture different behaviors of the input regions. When the true number of experts $k_{\ast}$ is known, we demonstrate that the convergence rates of density and parameter estimations are both parametric on the sample size. However, when $k_{\ast}$ becomes unknown and the true model is over-specified by a Gaussian mixture of $k$ experts where $k > k_{\ast}$, our findings suggest that the number of experts selected from the top-K sparse softmax gating function must exceed the total cardinality of a certain number of Voronoi cells associated with the true parameters to guarantee the convergence of the density estimation. Moreover, while the density estimation rate remains parametric under this setting, the parameter estimation rates become substantially slow due to an intrinsic interaction between the softmax gating and expert functions.  ( 3 min )
    Solving Kernel Ridge Regression with Gradient-Based Optimization Methods
    arXiv:2306.16838v5 Announce Type: replace Abstract: Kernel ridge regression, KRR, is a generalization of linear ridge regression that is non-linear in the data, but linear in the parameters. Here, we introduce an equivalent formulation of the objective function of KRR, opening up both for using penalties other than the ridge penalty and for studying kernel ridge regression from the perspective of gradient descent. Using a continuous-time perspective, we derive a closed-form solution for solving kernel regression with gradient descent, something we refer to as kernel gradient flow, KGF, and theoretically bound the differences between KRR and KGF, where, for the latter, regularization is obtained through early stopping. We also generalize KRR by replacing the ridge penalty with the $\ell_1$ and $\ell_\infty$ penalties, respectively, and use the fact that analogous to the similarities between KGF and KRR, $\ell_1$ regularization and forward stagewise regression (also known as coordinate descent), and $\ell_\infty$ regularization and sign gradient descent, follow similar solution paths. We can thus alleviate the need for computationally heavy algorithms based on proximal gradient descent. We show theoretically and empirically how the $\ell_1$ and $\ell_\infty$ penalties, and the corresponding gradient-based optimization algorithms, produce sparse and robust kernel regression solutions, respectively.  ( 3 min )
    Broadcasting in random recursive dags
    arXiv:2306.01727v2 Announce Type: replace Abstract: A uniform $k$-{\sc dag} generalizes the uniform random recursive tree by picking $k$ parents uniformly at random from the existing nodes. It starts with $k$ ''roots''. Each of the $k$ roots is assigned a bit. These bits are propagated by a noisy channel. The parents' bits are flipped with probability $p$, and a majority vote is taken. When all nodes have received their bits, the $k$-{\sc dag} is shown without identifying the roots. The goal is to estimate the majority bit among the roots. We identify the threshold for $p$ as a function of $k$ below which the majority rule among all nodes yields an error $c+o(1)$ with $c<1/2$. Above the threshold the majority rule errs with probability $1/2+o(1)$.  ( 2 min )
    Kernel Two-Sample Tests for Manifold Data
    arXiv:2105.03425v4 Announce Type: replace Abstract: We present a study of a kernel-based two-sample test statistic related to the Maximum Mean Discrepancy (MMD) in the manifold data setting, assuming that high-dimensional observations are close to a low-dimensional manifold. We characterize the test level and power in relation to the kernel bandwidth, the number of samples, and the intrinsic dimensionality of the manifold. Specifically, when data densities $p$ and $q$ are supported on a $d$-dimensional sub-manifold ${M}$ embedded in an $m$-dimensional space and are H\"older with order $\beta$ (up to 2) on ${M}$, we prove a guarantee of the test power for finite sample size $n$ that exceeds a threshold depending on $d$, $\beta$, and $\Delta_2$ the squared $L^2$-divergence between $p$ and $q$ on the manifold, and with a properly chosen kernel bandwidth $\gamma$. For small density departures, we show that with large $n$ they can be detected by the kernel test when $\Delta_2$ is greater than $n^{- { 2 \beta/( d + 4 \beta ) }}$ up to a certain constant and $\gamma$ scales as $n^{-1/(d+4\beta)}$. The analysis extends to cases where the manifold has a boundary and the data samples contain high-dimensional additive noise. Our results indicate that the kernel two-sample test has no curse-of-dimensionality when the data lie on or near a low-dimensional manifold. We validate our theory and the properties of the kernel test for manifold data through a series of numerical experiments.  ( 3 min )
    The Sample Complexity of Approximate Rejection Sampling with Applications to Smoothed Online Learning
    arXiv:2302.04658v3 Announce Type: replace Abstract: Suppose we are given access to $n$ independent samples from distribution $\mu$ and we wish to output one of them with the goal of making the output distributed as close as possible to a target distribution $\nu$. In this work we show that the optimal total variation distance as a function of $n$ is given by $\tilde\Theta(\frac{D}{f'(n)})$ over the class of all pairs $\nu,\mu$ with a bounded $f$-divergence $D_f(\nu\|\mu)\leq D$. Previously, this question was studied only for the case when the Radon-Nikodym derivative of $\nu$ with respect to $\mu$ is uniformly bounded. We then consider an application in the seemingly very different field of smoothed online learning, where we show that recent results on the minimax regret and the regret of oracle-efficient algorithms still hold even under relaxed constraints on the adversary (to have bounded $f$-divergence, as opposed to bounded Radon-Nikodym derivative). Finally, we also study efficacy of importance sampling for mean estimates uniform over a function class and compare importance sampling with rejection sampling.  ( 2 min )
    Max-Linear Regression by Convex Programming
    arXiv:2103.07020v2 Announce Type: replace Abstract: We consider the multivariate max-linear regression problem where the model parameters $\boldsymbol{\beta}_{1},\dotsc,\boldsymbol{\beta}_{k}\in\mathbb{R}^{p}$ need to be estimated from $n$ independent samples of the (noisy) observations $y = \max_{1\leq j \leq k} \boldsymbol{\beta}_{j}^{\mathsf{T}} \boldsymbol{x} + \mathrm{noise}$. The max-linear model vastly generalizes the conventional linear model, and it can approximate any convex function to an arbitrary accuracy when the number of linear models $k$ is large enough. However, the inherent nonlinearity of the max-linear model renders the estimation of the regression parameters computationally challenging. Particularly, no estimator based on convex programming is known in the literature. We formulate and analyze a scalable convex program given by anchored regression (AR) as the estimator for the max-linear regression problem. Under the standard Gaussian observation setting, we present a non-asymptotic performance guarantee showing that the convex program recovers the parameters with high probability. When the $k$ linear components are equally likely to achieve the maximum, our result shows a sufficient number of noise-free observations for exact recovery scales as {$k^{4}p$} up to a logarithmic factor. { This sample complexity coincides with that by alternating minimization (Ghosh et al., {2021}). Moreover, the same sample complexity applies when the observations are corrupted with arbitrary deterministic noise. We provide empirical results that show that our method performs as our theoretical result predicts, and is competitive with the alternating minimization algorithm particularly in presence of multiplicative Bernoulli noise. Furthermore, we also show empirically that a recursive application of AR can significantly improve the estimation accuracy.}  ( 3 min )
    On the representation and learning of monotone triangular transport maps
    arXiv:2009.10303v3 Announce Type: replace Abstract: Transportation of measure provides a versatile approach for modeling complex probability distributions, with applications in density estimation, Bayesian inference, generative modeling, and beyond. Monotone triangular transport maps$\unicode{x2014}$approximations of the Knothe$\unicode{x2013}$Rosenblatt (KR) rearrangement$\unicode{x2014}$are a canonical choice for these tasks. Yet the representation and parameterization of such maps have a significant impact on their generality and expressiveness, and on properties of the optimization problem that arises in learning a map from data (e.g., via maximum likelihood estimation). We present a general framework for representing monotone triangular maps via invertible transformations of smooth functions. We establish conditions on the transformation such that the associated infinite-dimensional minimization problem has no spurious local minima, i.e., all local minima are global minima; and we show for target distributions satisfying certain tail conditions that the unique global minimizer corresponds to the KR map. Given a sample from the target, we then propose an adaptive algorithm that estimates a sparse semi-parametric approximation of the underlying KR map. We demonstrate how this framework can be applied to joint and conditional density estimation, likelihood-free inference, and structure learning of directed graphical models, with stable generalization performance across a range of sample sizes.  ( 3 min )
    Cost Aware Best Arm Identification
    arXiv:2402.16710v1 Announce Type: cross Abstract: In this paper, we study a best arm identification problem with dual objects. In addition to the classic reward, each arm is associated with a cost distribution and the goal is to identify the largest reward arm using the minimum expected cost. We call it \emph{Cost Aware Best Arm Identification} (CABAI), which captures the separation of testing and implementation phases in product development pipelines and models the objective shift between phases, i.e., cost for testing and reward for implementation. We first derive an theoretic lower bound for CABAI and propose an algorithm called $\mathsf{CTAS}$ to match it asymptotically. To reduce the computation of $\mathsf{CTAS}$, we further propose a low-complexity algorithm called CO, based on a square-root rule, which proves optimal in simplified two-armed models and generalizes surprisingly well in numerical experiments. Our results show (i) ignoring the heterogeneous action cost results in sub-optimality in practice, and (ii) low-complexity algorithms deliver near-optimal performance over a wide range of problems.  ( 2 min )
    Sparse Orthogonal Variational Inference for Gaussian Processes
    arXiv:1910.10596v5 Announce Type: replace Abstract: We introduce a new interpretation of sparse variational approximations for Gaussian processes using inducing points, which can lead to more scalable algorithms than previous methods. It is based on decomposing a Gaussian process as a sum of two independent processes: one spanned by a finite basis of inducing points and the other capturing the remaining variation. We show that this formulation recovers existing approximations and at the same time allows to obtain tighter lower bounds on the marginal likelihood and new stochastic variational inference algorithms. We demonstrate the efficiency of these algorithms in several Gaussian process models ranging from standard regression to multi-class classification using (deep) convolutional Gaussian processes and report state-of-the-art results on CIFAR-10 among purely GP-based models.  ( 2 min )
    Training Implicit Generative Models via an Invariant Statistical Loss
    arXiv:2402.16435v1 Announce Type: cross Abstract: Implicit generative models have the capability to learn arbitrary complex data distributions. On the downside, training requires telling apart real data from artificially-generated ones using adversarial discriminators, leading to unstable training and mode-dropping issues. As reported by Zahee et al. (2017), even in the one-dimensional (1D) case, training a generative adversarial network (GAN) is challenging and often suboptimal. In this work, we develop a discriminator-free method for training one-dimensional (1D) generative implicit models and subsequently expand this method to accommodate multivariate cases. Our loss function is a discrepancy measure between a suitably chosen transformation of the model samples and a uniform distribution; hence, it is invariant with respect to the true distribution of the data. We first formulate our method for 1D random variables, providing an effective solution for approximate reparameterization of arbitrary complex distributions. Then, we consider the temporal setting (both univariate and multivariate), in which we model the conditional distribution of each sample given the history of the process. We demonstrate through numerical simulations that this new method yields promising results, successfully learning true distributions in a variety of scenarios and mitigating some of the well-known problems that state-of-the-art implicit methods present.  ( 3 min )
    Self Supervised Correlation-based Permutations for Multi-View Clustering
    arXiv:2402.16383v1 Announce Type: cross Abstract: Fusing information from different modalities can enhance data analysis tasks, including clustering. However, existing multi-view clustering (MVC) solutions are limited to specific domains or rely on a suboptimal and computationally demanding two-stage procedure of representation and clustering. We propose an end-to-end deep learning-based MVC framework for general data (image, tabular, etc.). Our approach involves learning meaningful fused data representations with a novel permutation-based canonical correlation objective. Concurrently, we learn cluster assignments by identifying consistent pseudo-labels across multiple views. We demonstrate the effectiveness of our model using ten MVC benchmark datasets. Theoretically, we show that our model approximates the supervised linear discrimination analysis (LDA) representation. Additionally, we provide an error bound induced by false-pseudo label annotations.  ( 2 min )
    Feedback Efficient Online Fine-Tuning of Diffusion Models
    arXiv:2402.16359v1 Announce Type: cross Abstract: Diffusion models excel at modeling complex data distributions, including those of images, proteins, and small molecules. However, in many cases, our goal is to model parts of the distribution that maximize certain properties: for example, we may want to generate images with high aesthetic quality, or molecules with high bioactivity. It is natural to frame this as a reinforcement learning (RL) problem, in which the objective is to fine-tune a diffusion model to maximize a reward function that corresponds to some property. Even with access to online queries of the ground-truth reward function, efficiently discovering high-reward samples can be challenging: they might have a low probability in the initial distribution, and there might be many infeasible samples that do not even have a well-defined reward (e.g., unnatural images or physically impossible molecules). In this work, we propose a novel reinforcement learning procedure that efficiently explores on the manifold of feasible samples. We present a theoretical analysis providing a regret guarantee, as well as empirical validation across three domains: images, biological sequences, and molecules.  ( 2 min )
    A VAE-based Framework for Learning Multi-Level Neural Granger-Causal Connectivity
    arXiv:2402.16131v1 Announce Type: cross Abstract: Granger causality has been widely used in various application domains to capture lead-lag relationships amongst the components of complex dynamical systems, and the focus in extant literature has been on a single dynamical system. In certain applications in macroeconomics and neuroscience, one has access to data from a collection of related such systems, wherein the modeling task of interest is to extract the shared common structure that is embedded across them, as well as to identify the idiosyncrasies within individual ones. This paper introduces a Variational Autoencoder (VAE) based framework that jointly learns Granger-causal relationships amongst components in a collection of related-yet-heterogeneous dynamical systems, and handles the aforementioned task in a principled way. The performance of the proposed framework is evaluated on several synthetic data settings and benchmarked against existing approaches designed for individual system learning. The method is further illustrated on a real dataset involving time series data from a neurophysiological experiment and produces interpretable results.  ( 2 min )
    Conformalized Selective Regression
    arXiv:2402.16300v1 Announce Type: cross Abstract: Should prediction models always deliver a prediction? In the pursuit of maximum predictive performance, critical considerations of reliability and fairness are often overshadowed, particularly when it comes to the role of uncertainty. Selective regression, also known as the "reject option," allows models to abstain from predictions in cases of considerable uncertainty. Initially proposed seven decades ago, approaches to selective regression have mostly focused on distribution-based proxies for measuring uncertainty, particularly conditional variance. However, this focus neglects the significant influence of model-specific biases on a model's performance. In this paper, we propose a novel approach to selective regression by leveraging conformal prediction, which provides grounded confidence measures for individual predictions based on model-specific biases. In addition, we propose a standardized evaluation framework to allow proper comparison of selective regression approaches. Via an extensive experimental approach, we demonstrate how our proposed approach, conformalized selective regression, demonstrates an advantage over multiple state-of-the-art baselines.  ( 2 min )
    A unified Fourier slice method to derive ridgelet transform for a variety of depth-2 neural networks
    arXiv:2402.15984v1 Announce Type: cross Abstract: To investigate neural network parameters, it is easier to study the distribution of parameters than to study the parameters in each neuron. The ridgelet transform is a pseudo-inverse operator that maps a given function $f$ to the parameter distribution $\gamma$ so that a network $\mathtt{NN}[\gamma]$ reproduces $f$, i.e. $\mathtt{NN}[\gamma]=f$. For depth-2 fully-connected networks on a Euclidean space, the ridgelet transform has been discovered up to the closed-form expression, thus we could describe how the parameters are distributed. However, for a variety of modern neural network architectures, the closed-form expression has not been known. In this paper, we explain a systematic method using Fourier expressions to derive ridgelet transforms for a variety of modern networks such as networks on finite fields $\mathbb{F}_p$, group convolutional networks on abstract Hilbert space $\mathcal{H}$, fully-connected networks on noncompact symmetric spaces $G/K$, and pooling layers, or the $d$-plane ridgelet transform.  ( 2 min )
    Improved Hardness Results for Learning Intersections of Halfspaces
    arXiv:2402.15995v1 Announce Type: cross Abstract: We show strong (and surprisingly simple) lower bounds for weakly learning intersections of halfspaces in the improper setting. Strikingly little is known about this problem. For instance, it is not even known if there is a polynomial-time algorithm for learning the intersection of only two halfspaces. On the other hand, lower bounds based on well-established assumptions (such as approximating worst-case lattice problems or variants of Feige's 3SAT hypothesis) are only known (or are implied by existing results) for the intersection of super-logarithmically many halfspaces [KS09,KS06,DSS16]. With intersections of fewer halfspaces being only ruled out under less standard assumptions [DV21] (such as the existence of local pseudo-random generators with large stretch). We significantly narrow this gap by showing that even learning $\omega(\log \log N)$ halfspaces in dimension $N$ takes super-polynomial time under standard assumptions on worst-case lattice problems (namely that SVP and SIVP are hard to approximate within polynomial factors). Further, we give unconditional hardness results in the statistical query framework. Specifically, we show that for any $k$ (even constant), learning $k$ halfspaces in dimension $N$ requires accuracy $N^{-\Omega(k)}$, or exponentially many queries -- in particular ruling out SQ algorithms with polynomial accuracy for $\omega(1)$ halfspaces. To the best of our knowledge this is the first unconditional hardness result for learning a super-constant number of halfspaces. Our lower bounds are obtained in a unified way via a novel connection we make between intersections of halfspaces and the so-called parallel pancakes distribution [DKS17,BLPR19,BRST21] that has been at the heart of many lower bound constructions in (robust) high-dimensional statistics in the past few years.  ( 3 min )
    Large Stepsize Gradient Descent for Logistic Loss: Non-Monotonicity of the Loss Improves Optimization Efficiency
    arXiv:2402.15926v1 Announce Type: cross Abstract: We consider gradient descent (GD) with a constant stepsize applied to logistic regression with linearly separable data, where the constant stepsize $\eta$ is so large that the loss initially oscillates. We show that GD exits this initial oscillatory phase rapidly -- in $\mathcal{O}(\eta)$ steps -- and subsequently achieves an $\tilde{\mathcal{O}}(1 / (\eta t) )$ convergence rate after $t$ additional steps. Our results imply that, given a budget of $T$ steps, GD can achieve an accelerated loss of $\tilde{\mathcal{O}}(1/T^2)$ with an aggressive stepsize $\eta:= \Theta( T)$, without any use of momentum or variable stepsize schedulers. Our proof technique is versatile and also handles general classification loss functions (where exponential tails are needed for the $\tilde{\mathcal{O}}(1/T^2)$ acceleration), nonlinear predictors in the neural tangent kernel regime, and online stochastic gradient descent (SGD) with a large stepsize, under suitable separability conditions.  ( 2 min )
    Shaving Weights with Occam's Razor: Bayesian Sparsification for Neural Networks Using the Marginal Likelihood
    arXiv:2402.15978v1 Announce Type: cross Abstract: Neural network sparsification is a promising avenue to save computational time and memory costs, especially in an age where many successful AI models are becoming too large to na\"ively deploy on consumer hardware. While much work has focused on different weight pruning criteria, the overall sparsifiability of the network, i.e., its capacity to be pruned without quality loss, has often been overlooked. We present Sparsifiability via the Marginal likelihood (SpaM), a pruning framework that highlights the effectiveness of using the Bayesian marginal likelihood in conjunction with sparsity-inducing priors for making neural networks more sparsifiable. Our approach implements an automatic Occam's razor that selects the most sparsifiable model that still explains the data well, both for structured and unstructured sparsification. In addition, we demonstrate that the pre-computed posterior Hessian approximation used in the Laplace approximation can be re-used to define a cheap pruning criterion, which outperforms many existing (more expensive) approaches. We demonstrate the effectiveness of our framework, especially at high sparsity levels, across a range of different neural network architectures and datasets.  ( 2 min )
    Statistical Games
    arXiv:2402.15892v1 Announce Type: cross Abstract: This work contains the mathematical exploration of a few prototypical games in which central concepts from statistics and probability theory naturally emerge. The first two kinds of games are termed Fisher and Bayesian games, which are connected to Frequentist and Bayesian statistics, respectively. Later, a more general type of game is introduced, termed Statistical game, in which a further parameter, the players' relative risk aversion, can be set. In this work, we show that Fisher and Bayesian games can be viewed as limiting cases of Statistical games. Therefore, Statistical games can be viewed as a unified framework, incorporating both Frequentist and Bayesian statistics. Furthermore, a philosophical framework is (re-)presented -- often referred to as minimax regret criterion -- as a general approach to decision making. The main motivation for this work was to embed Bayesian statistics into a broader decision-making framework, where, based on collected data, actions with consequences have to be made, which can be translated to utilities (or rewards/losses) of the decision-maker. The work starts with the simplest possible toy model, related to hypothesis testing and statistical inference. This choice has two main benefits: i.) it allows us to determine (conjecture) the behaviour of the equilibrium strategies in various limiting cases ii.) this way, we can introduce Statistical games without requiring additional stochastic parameters. The work contains game theoretical methods related to two-player, non-cooperative games to determine and prove equilibrium strategies of Fisher, Bayesian and Statistical games. It also relies on analytical tools for derivations concerning various limiting cases.  ( 2 min )
    Truly No-Regret Learning in Constrained MDPs
    arXiv:2402.15776v1 Announce Type: cross Abstract: Constrained Markov decision processes (CMDPs) are a common way to model safety constraints in reinforcement learning. State-of-the-art methods for efficiently solving CMDPs are based on primal-dual algorithms. For these algorithms, all currently known regret bounds allow for error cancellations -- one can compensate for a constraint violation in one round with a strict constraint satisfaction in another. This makes the online learning process unsafe since it only guarantees safety for the final (mixture) policy but not during learning. As Efroni et al. (2020) pointed out, it is an open question whether primal-dual algorithms can provably achieve sublinear regret if we do not allow error cancellations. In this paper, we give the first affirmative answer. We first generalize a result on last-iterate convergence of regularized primal-dual schemes to CMDPs with multiple constraints. Building upon this insight, we propose a model-based primal-dual algorithm to learn in an unknown CMDP. We prove that our algorithm achieves sublinear regret without error cancellations.  ( 2 min )
    Low-Rank Bandits via Tight Two-to-Infinity Singular Subspace Recovery
    arXiv:2402.15739v1 Announce Type: cross Abstract: We study contextual bandits with low-rank structure where, in each round, if the (context, arm) pair $(i,j)\in [m]\times [n]$ is selected, the learner observes a noisy sample of the $(i,j)$-th entry of an unknown low-rank reward matrix. Successive contexts are generated randomly in an i.i.d. manner and are revealed to the learner. For such bandits, we present efficient algorithms for policy evaluation, best policy identification and regret minimization. For policy evaluation and best policy identification, we show that our algorithms are nearly minimax optimal. For instance, the number of samples required to return an $\varepsilon$-optimal policy with probability at least $1-\delta$ typically scales as ${m+n\over \varepsilon^2}\log(1/\delta)$. Our regret minimization algorithm enjoys minimax guarantees scaling as $r^{7/4}(m+n)^{3/4}\sqrt{T}$, which improves over existing algorithms. All the proposed algorithms consist of two phases: they first leverage spectral methods to estimate the left and right singular subspaces of the low-rank reward matrix. We show that these estimates enjoy tight error guarantees in the two-to-infinity norm. This in turn allows us to reformulate our problems as a misspecified linear bandit problem with dimension roughly $r(m+n)$ and misspecification controlled by the subspace recovery error, as well as to design the second phase of our algorithms efficiently.  ( 2 min )
    Batch Active Learning of Reward Functions from Human Preferences
    arXiv:2402.15757v1 Announce Type: cross Abstract: Data generation and labeling are often expensive in robot learning. Preference-based learning is a concept that enables reliable labeling by querying users with preference questions. Active querying methods are commonly employed in preference-based learning to generate more informative data at the expense of parallelization and computation time. In this paper, we develop a set of novel algorithms, batch active preference-based learning methods, that enable efficient learning of reward functions using as few data samples as possible while still having short query generation times and also retaining parallelizability. We introduce a method based on determinantal point processes (DPP) for active batch generation and several heuristic-based alternatives. Finally, we present our experimental results for a variety of robotics tasks in simulation. Our results suggest that our batch active learning algorithm requires only a few queries that are computed in a short amount of time. We showcase one of our algorithms in a study to learn human users' preferences.  ( 2 min )
    Data-Efficient Operator Learning via Unsupervised Pretraining and In-Context Learning
    arXiv:2402.15734v1 Announce Type: cross Abstract: Recent years have witnessed the promise of coupling machine learning methods and physical domain-specific insight for solving scientific problems based on partial differential equations (PDEs). However, being data-intensive, these methods still require a large amount of PDE data. This reintroduces the need for expensive numerical PDE solutions, partially undermining the original goal of avoiding these expensive simulations. In this work, seeking data efficiency, we design unsupervised pretraining and in-context learning methods for PDE operator learning. To reduce the need for training data with simulated solutions, we pretrain neural operators on unlabeled PDE data using reconstruction-based proxy tasks. To improve out-of-distribution performance, we further assist neural operators in flexibly leveraging in-context learning methods, without incurring extra training costs or designs. Extensive empirical evaluations on a diverse set of PDEs demonstrate that our method is highly data-efficient, more generalizable, and even outperforms conventional vision-pretrained models.  ( 2 min )
    A Statistical Analysis of Wasserstein Autoencoders for Intrinsically Low-dimensional Data
    arXiv:2402.15710v1 Announce Type: cross Abstract: Variational Autoencoders (VAEs) have gained significant popularity among researchers as a powerful tool for understanding unknown distributions based on limited samples. This popularity stems partly from their impressive performance and partly from their ability to provide meaningful feature representations in the latent space. Wasserstein Autoencoders (WAEs), a variant of VAEs, aim to not only improve model efficiency but also interpretability. However, there has been limited focus on analyzing their statistical guarantees. The matter is further complicated by the fact that the data distributions to which WAEs are applied - such as natural images - are often presumed to possess an underlying low-dimensional structure within a high-dimensional feature space, which current theory does not adequately account for, rendering known bounds inefficient. To bridge the gap between the theory and practice of WAEs, in this paper, we show that WAEs can learn the data distributions when the network architectures are properly chosen. We show that the convergence rates of the expected excess risk in the number of samples for WAEs are independent of the high feature dimension, instead relying only on the intrinsic dimension of the data distribution.  ( 2 min )
    Orthogonal Gradient Boosting for Simpler Additive Rule Ensembles
    arXiv:2402.15691v1 Announce Type: cross Abstract: Gradient boosting of prediction rules is an efficient approach to learn potentially interpretable yet accurate probabilistic models. However, actual interpretability requires to limit the number and size of the generated rules, and existing boosting variants are not designed for this purpose. Though corrective boosting refits all rule weights in each iteration to minimise prediction risk, the included rule conditions tend to be sub-optimal, because commonly used objective functions fail to anticipate this refitting. Here, we address this issue by a new objective function that measures the angle between the risk gradient vector and the projection of the condition output vector onto the orthogonal complement of the already selected conditions. This approach correctly approximate the ideal update of adding the risk gradient itself to the model and favours the inclusion of more general and thus shorter rules. As we demonstrate using a wide range of prediction tasks, this significantly improves the comprehensibility/accuracy trade-off of the fitted ensemble. Additionally, we show how objective values for related rule conditions can be computed incrementally to avoid any substantial computational overhead of the new method.  ( 2 min )
    Is Offline Decision Making Possible with Only Few Samples? Reliable Decisions in Data-Starved Bandits via Trust Region Enhancement
    arXiv:2402.15703v1 Announce Type: cross Abstract: What can an agent learn in a stochastic Multi-Armed Bandit (MAB) problem from a dataset that contains just a single sample for each arm? Surprisingly, in this work, we demonstrate that even in such a data-starved setting it may still be possible to find a policy competitive with the optimal one. This paves the way to reliable decision-making in settings where critical decisions must be made by relying only on a handful of samples. Our analysis reveals that \emph{stochastic policies can be substantially better} than deterministic ones for offline decision-making. Focusing on offline multi-armed bandits, we design an algorithm called Trust Region of Uncertainty for Stochastic policy enhancemenT (TRUST) which is quite different from the predominant value-based lower confidence bound approach. Its design is enabled by localization laws, critical radii, and relative pessimism. We prove that its sample complexity is comparable to that of LCB on minimax problems while being substantially lower on problems with very few samples. Finally, we consider an application to offline reinforcement learning in the special case where the logging policies are known.  ( 2 min )
    Bagged Deep Image Prior for Recovering Images in the Presence of Speckle Noise
    arXiv:2402.15635v1 Announce Type: cross Abstract: We investigate both the theoretical and algorithmic aspects of likelihood-based methods for recovering a complex-valued signal from multiple sets of measurements, referred to as looks, affected by speckle (multiplicative) noise. Our theoretical contributions include establishing the first existing theoretical upper bound on the Mean Squared Error (MSE) of the maximum likelihood estimator under the deep image prior hypothesis. Our theoretical results capture the dependence of MSE upon the number of parameters in the deep image prior, the number of looks, the signal dimension, and the number of measurements per look. On the algorithmic side, we introduce the concept of bagged Deep Image Priors (Bagged-DIP) and integrate them with projected gradient descent. Furthermore, we show how employing Newton-Schulz algorithm for calculating matrix inverses within the iterations of PGD reduces the computational complexity of the algorithm. We will show that this method achieves the state-of-the-art performance.  ( 2 min )
    Differentially Private Fair Binary Classifications
    arXiv:2402.15603v1 Announce Type: cross Abstract: In this work, we investigate binary classification under the constraints of both differential privacy and fairness. We first propose an algorithm based on the decoupling technique for learning a classifier with only fairness guarantee. This algorithm takes in classifiers trained on different demographic groups and generates a single classifier satisfying statistical parity. We then refine this algorithm to incorporate differential privacy. The performance of the final algorithm is rigorously examined in terms of privacy, fairness, and utility guarantees. Empirical evaluations conducted on the Adult and Credit Card datasets illustrate that our algorithm outperforms the state-of-the-art in terms of fairness guarantees, while maintaining the same level of privacy and utility.  ( 2 min )
    A Study of Shape Modeling Against Noise
    arXiv:2402.15587v1 Announce Type: cross Abstract: Shape modeling is a challenging task with many potential applications in computer vision and medical imaging. There are many shape modeling methods in the literature, each with its advantages and applications. However, many shape modeling methods have difficulties handling shapes that have missing pieces or outliers. In this regard, this paper introduces shape denoising, a fundamental problem in shape modeling that lies at the core of many computer vision and medical imaging applications and has not received enough attention in the literature. The paper introduces six types of noise that can be used to perturb shapes as well as an objective measure for the noise level and for comparing methods on their shape denoising capabilities. Finally, the paper evaluates seven methods capable of accomplishing this task, of which six are based on deep learning, including some generative models.  ( 2 min )
    Minimax Optimality of Score-based Diffusion Models: Beyond the Density Lower Bound Assumptions
    arXiv:2402.15602v1 Announce Type: cross Abstract: We study the asymptotic error of score-based diffusion model sampling in large-sample scenarios from a non-parametric statistics perspective. We show that a kernel-based score estimator achieves an optimal mean square error of $\widetilde{O}\left(n^{-1} t^{-\frac{d+2}{2}}(t^{\frac{d}{2}} \vee 1)\right)$ for the score function of $p_0*\mathcal{N}(0,t\boldsymbol{I}_d)$, where $n$ and $d$ represent the sample size and the dimension, $t$ is bounded above and below by polynomials of $n$, and $p_0$ is an arbitrary sub-Gaussian distribution. As a consequence, this yields an $\widetilde{O}\left(n^{-1/2} t^{-\frac{d}{4}}\right)$ upper bound for the total variation error of the distribution of the sample generated by the diffusion model under a mere sub-Gaussian assumption. If in addition, $p_0$ belongs to the nonparametric family of the $\beta$-Sobolev space with $\beta\le 2$, by adopting an early stopping strategy, we obtain that the diffusion model is nearly (up to log factors) minimax optimal. This removes the crucial lower bound assumption on $p_0$ in previous proofs of the minimax optimality of the diffusion model for nonparametric families.  ( 2 min )
    Inference for Regression with Variables Generated from Unstructured Data
    arXiv:2402.15585v1 Announce Type: cross Abstract: The leading strategy for analyzing unstructured data uses two steps. First, latent variables of economic interest are estimated with an upstream information retrieval model. Second, the estimates are treated as "data" in a downstream econometric model. We establish theoretical arguments for why this two-step strategy leads to biased inference in empirically plausible settings. More constructively, we propose a one-step strategy for valid inference that uses the upstream and downstream models jointly. The one-step strategy (i) substantially reduces bias in simulations; (ii) has quantitatively important effects in a leading application using CEO time-use data; and (iii) can be readily adapted by applied researchers.  ( 2 min )
    Scalable Robust Sparse Principal Component Analysis
    arXiv:2402.16712v1 Announce Type: new Abstract: In this work, we propose an optimization framework for estimating a sparse robust one-dimensional subspace. Our objective is to minimize both the representation error and the penalty, in terms of the l1-norm criterion. Given that the problem is NP-hard, we introduce a linear relaxation-based approach. Additionally, we present a novel fitting procedure, utilizing simple ratios and sorting techniques. The proposed algorithm demonstrates a worst-case time complexity of $O(n^2 m \log n)$ and, in certain instances, achieves global optimality for the sparse robust subspace, thereby exhibiting polynomial time efficiency. Compared to extant methodologies, the proposed algorithm finds the subspace with the lowest discordance, offering a smoother trade-off between sparsity and fit. Its architecture affords scalability, evidenced by a 16-fold improvement in computational speeds for matrices of 2000x2000 over CPU version. Furthermore, this method is distinguished by several advantages, including its independence from initialization and deterministic and replicable procedures. Furthermore, this method is distinguished by several advantages, including its independence from initialization and deterministic and replicable procedures. The real-world example demonstrates the effectiveness of algorithm in achieving meaningful sparsity, underscoring its precise and useful application across various domains.  ( 2 min )
    On the connection between Noise-Contrastive Estimation and Contrastive Divergence
    arXiv:2402.16688v1 Announce Type: new Abstract: Noise-contrastive estimation (NCE) is a popular method for estimating unnormalised probabilistic models, such as energy-based models, which are effective for modelling complex data distributions. Unlike classical maximum likelihood (ML) estimation that relies on importance sampling (resulting in ML-IS) or MCMC (resulting in contrastive divergence, CD), NCE uses a proxy criterion to avoid the need for evaluating an often intractable normalisation constant. Despite apparent conceptual differences, we show that two NCE criteria, ranking NCE (RNCE) and conditional NCE (CNCE), can be viewed as ML estimation methods. Specifically, RNCE is equivalent to ML estimation combined with conditional importance sampling, and both RNCE and CNCE are special cases of CD. These findings bridge the gap between the two method classes and allow us to apply techniques from the ML-IS and CD literature to NCE, offering several advantageous extensions.  ( 2 min )
    Penalized Generative Variable Selection
    arXiv:2402.16661v1 Announce Type: new Abstract: Deep networks are increasingly applied to a wide variety of data, including data with high-dimensional predictors. In such analysis, variable selection can be needed along with estimation/model building. Many of the existing deep network studies that incorporate variable selection have been limited to methodological and numerical developments. In this study, we consider modeling/estimation using the conditional Wasserstein Generative Adversarial networks. Group Lasso penalization is applied for variable selection, which may improve model estimation/prediction, interpretability, stability, etc. Significantly advancing from the existing literature, the analysis of censored survival data is also considered. We establish the convergence rate for variable selection while considering the approximation error, and obtain a more efficient distribution estimation. Simulations and the analysis of real experimental data demonstrate satisfactory practical utility of the proposed analysis.  ( 2 min )
    Stable Training of Normalizing Flows for High-dimensional Variational Inference
    arXiv:2402.16408v1 Announce Type: new Abstract: Variational inference with normalizing flows (NFs) is an increasingly popular alternative to MCMC methods. In particular, NFs based on coupling layers (Real NVPs) are frequently used due to their good empirical performance. In theory, increasing the depth of normalizing flows should lead to more accurate posterior approximations. However, in practice, training deep normalizing flows for approximating high-dimensional posterior distributions is often infeasible due to the high variance of the stochastic gradients. In this work, we show that previous methods for stabilizing the variance of stochastic gradient descent can be insufficient to achieve stable training of Real NVPs. As the source of the problem, we identify that, during training, samples often exhibit unusual high values. As a remedy, we propose a combination of two methods: (1) soft-thresholding of the scale in Real NVPs, and (2) a bijective soft log transformation of the samples. We evaluate these and other previously proposed modification on several challenging target distributions, including a high-dimensional horseshoe logistic regression model. Our experiments show that with our modifications, stable training of Real NVPs for posteriors with several thousand dimensions is possible, allowing for more accurate marginal likelihood estimation via importance sampling. Moreover, we evaluate several common training techniques and architecture choices and provide practical advise for training NFs for high-dimensional variational inference.  ( 2 min )
    Uncertainty Quantification in Anomaly Detection with Cross-Conformal $p$-Values
    arXiv:2402.16388v1 Announce Type: new Abstract: Given the growing significance of reliable, trustworthy, and explainable machine learning, the requirement of uncertainty quantification for anomaly detection systems has become increasingly important. In this context, effectively controlling Type I error rates ($\alpha$) without compromising the statistical power ($1-\beta$) of these systems can build trust and reduce costs related to false discoveries, particularly when follow-up procedures are expensive. Leveraging the principles of conformal prediction emerges as a promising approach for providing respective statistical guarantees by calibrating a model's uncertainty. This work introduces a novel framework for anomaly detection, termed cross-conformal anomaly detection, building upon well-known cross-conformal methods designed for prediction tasks. With that, it addresses a natural research gap by extending previous works in the context of inductive conformal anomaly detection, relying on the split-conformal approach for model calibration. Drawing on insights from conformal prediction, we demonstrate that the derived methods for calculating cross-conformal $p$-values strike a practical compromise between statistical efficiency (full-conformal) and computational efficiency (split-conformal) for uncertainty-quantified anomaly detection on benchmark datasets.  ( 2 min )
    Distribution-Free Fair Federated Learning with Small Samples
    arXiv:2402.16158v1 Announce Type: new Abstract: As federated learning gains increasing importance in real-world applications due to its capacity for decentralized data training, addressing fairness concerns across demographic groups becomes critically important. However, most existing machine learning algorithms for ensuring fairness are designed for centralized data environments and generally require large-sample and distributional assumptions, underscoring the urgent need for fairness techniques adapted for decentralized and heterogeneous systems with finite-sample and distribution-free guarantees. To address this issue, this paper introduces FedFaiREE, a post-processing algorithm developed specifically for distribution-free fair learning in decentralized settings with small samples. Our approach accounts for unique challenges in decentralized environments, such as client heterogeneity, communication costs, and small sample sizes. We provide rigorous theoretical guarantees for both fairness and accuracy, and our experimental results further provide robust empirical validation for our proposed method.  ( 2 min )
    A Provably Accurate Randomized Sampling Algorithm for Logistic Regression
    arXiv:2402.16326v1 Announce Type: new Abstract: In statistics and machine learning, logistic regression is a widely-used supervised learning technique primarily employed for binary classification tasks. When the number of observations greatly exceeds the number of predictor variables, we present a simple, randomized sampling-based algorithm for logistic regression problem that guarantees high-quality approximations to both the estimated probabilities and the overall discrepancy of the model. Our analysis builds upon two simple structural conditions that boil down to randomized matrix multiplication, a fundamental and well-understood primitive of randomized numerical linear algebra. We analyze the properties of estimated probabilities of logistic regression when leverage scores are used to sample observations, and prove that accurate approximations can be achieved with a sample whose size is much smaller than the total number of observations. To further validate our theoretical findings, we conduct comprehensive empirical evaluations. Overall, our work sheds light on the potential of using randomized sampling approaches to efficiently approximate the estimated probabilities in logistic regression, offering a practical and computationally efficient solution for large-scale datasets.  ( 2 min )
    Gradient-enhanced deep Gaussian processes for multifidelity modelling
    arXiv:2402.16059v1 Announce Type: new Abstract: Multifidelity models integrate data from multiple sources to produce a single approximator for the underlying process. Dense low-fidelity samples are used to reduce interpolation error, while sparse high-fidelity samples are used to compensate for bias or noise in the low-fidelity samples. Deep Gaussian processes (GPs) are attractive for multifidelity modelling as they are non-parametric, robust to overfitting, perform well for small datasets, and, critically, can capture nonlinear and input-dependent relationships between data of different fidelities. Many datasets naturally contain gradient data, especially when they are generated by computational models that are compatible with automatic differentiation or have adjoint solutions. Principally, this work extends deep GPs to incorporate gradient data. We demonstrate this method on an analytical test problem and a realistic partial differential equation problem, where we predict the aerodynamic coefficients of a hypersonic flight vehicle over a range of flight conditions and geometries. In both examples, the gradient-enhanced deep GP outperforms a gradient-enhanced linear GP model and their non-gradient-enhanced counterparts.  ( 2 min )
    A Duality Analysis of Kernel Ridge Regression in the Noiseless Regime
    arXiv:2402.15718v1 Announce Type: new Abstract: In this paper, we conduct a comprehensive analysis of generalization properties of Kernel Ridge Regression (KRR) in the noiseless regime, a scenario crucial to scientific computing, where data are often generated via computer simulations. We prove that KRR can attain the minimax optimal rate, which depends on both the eigenvalue decay of the associated kernel and the relative smoothness of target functions. Particularly, when the eigenvalue decays exponentially fast, KRR achieves the spectral accuracy, i.e., a convergence rate faster than any polynomial. Moreover, the numerical experiments well corroborate our theoretical findings. Our proof leverages a novel extension of the duality framework introduced by Chen et al. (2023), which could be useful in analyzing kernel-based methods beyond the scope of this work.  ( 2 min )
    Learning Cyclic Causal Models from Incomplete Data
    arXiv:2402.15625v1 Announce Type: new Abstract: Causal learning is a fundamental problem in statistics and science, offering insights into predicting the effects of unseen treatments on a system. Despite recent advances in this topic, most existing causal discovery algorithms operate under two key assumptions: (i) the underlying graph is acyclic, and (ii) the available data is complete. These assumptions can be problematic as many real-world systems contain feedback loops (e.g., biological systems), and practical scenarios frequently involve missing data. In this work, we propose a novel framework, named MissNODAGS, for learning cyclic causal graphs from partially missing data. Under the additive noise model, MissNODAGS learns the causal graph by alternating between imputing the missing data and maximizing the expected log-likelihood of the visible part of the data in each training step, following the principles of the expectation-maximization (EM) framework. Through synthetic experiments and real-world single-cell perturbation data, we demonstrate improved performance when compared to using state-of-the-art imputation techniques followed by causal learning on partially missing interventional data.  ( 2 min )

  • Open

    [D] Yolov8 on AMD GPU
    Like the title says. I got a rx5700xt and it would be great if i could train my own image models with it. Since yolov8 uses CUDA which isn't supported it wont work. Now I tried using it with rocm like said here: https://github.com/harakas/amd_igpu_yolo_v8 but that doesn't seem an easy fix either. Now I had the same problem trying to run stable diffusion which I eventually got to work. Now im just asking is there some easy way to do it, or any known way at all?? Thanks submitted by /u/chainedkids420 [link] [comments]
    [Discussion] Importance of intra-batch composition vs number of times each datapoint is seen when training on multiple datasets
    [Discussion] I am trying to train a speech recognition model with datasets of varying quality, supervised open source data (narrow domain but presumably clean data), pseudolabelled data with whisper, and weakly supervised data scraped from YT (human uploaded captions). The datasets are of varying sizes and I am trying to account for this by adjusting what percent of each of my batches corresponds to each dataset. I don't have enough compute to run large hyperparam sweeps and am looking for existing research on this area or intuition from others experiences. submitted by /u/ResponsibleHouse7436 [link] [comments]
    [P] Machine Learning for Making Quotes
    Hey all, Im new here. Im a CompSci Graduate Student and had an interesting idea for a project, Im working on collecting data (only about 1,000 rows so far) of quoting data from a dumpster company. This includes, address, features of the dumpster (size, whether it has a lock, wheels), and their contract price. I would like to start training a model to predict what a price/quote would look like for a hypothetical instance, So if i give it size, address, etc, it would spit out the predicted price. How should i go about doing this, and any suggestions for reading and research I can do to understand this problem? Im sure the no.1 first thing i can do for better results is to gather more data. submitted by /u/Hungry-Being-2384 [link] [comments]
    [R] LLM Training with > 10,000 GPUs
    LLM Training with more than 10,000 GPUs!! https://arxiv.org/abs/2402.15627 Thoughts?? submitted by /u/CathieVictoriaWood [link] [comments]
    [D] Tuning parameters of autoencoder for deep representation and K-means clustering
    Hey everyone. I'm using deep representation and K-means clustering to find structures in unlabeled single particle trajectory data. I start with a sequence of shallow features, use a deep sparse autoencoder to get deep features, use them as input for the K-means clustering and evaluate the Davies-Bouldin index. Now I want to tune the parameters of the autoencoder (sparsity target, weights and learning rate). Would it make sense to create a parameter grid, and for each combination evaluate the clustering performance and select the parameters that in the end minimize DBI? Any other suggestions? submitted by /u/Naive-Bee4750 [link] [comments]
    [D] Advice for the next step of researchers (postdoc/PhD) who do interdisciplinary AI
    I am not sure if it’s a common question. I am a postdoc doing projects related to AI in Europe, no papers in top-tier AI conferences. I mainly focus on applying AI to domain-specific datasets. However doing a postdoc is an unstable job and I realized that I might not like doing research, maybe due to lacking interest and passion for research. Besides keeping doing Postdoc or applying PI (could be very challenging). Is industrial an option for my case? Obviously, I can’t compete with "actual AI" researchers to get into big tech companies. What about MLE or DS jobs for smaller tech companies? Also without New Graduate, would finding these jobs be even harder for me? If it’s possible, what should I prepare? Leetcode or something. If not what could be the options? Thanks. submitted by /u/CurrentExamination49 [link] [comments]
    [D] Capturing forecast uncertainty for a SVR model
    Does anybody have any suggestions to quantify forecast uncertainty for an SVR model? In parametric models such as basic OLS - forecast intervals accounting for both parameter and observational uncertainty is fairly straightforward. Similarly with a neural network - uncertainty can be quantified by adding dropout layers and modelling the distribution of predictions - however I'm not entirely sure how best to do this using an SVR! Any ideas greatly welcome submitted by /u/LDM-88 [link] [comments]
    [P][R] IdentitiySimilarity (idsim) package can be used for your face related researches.
    I am sharing the IdentitiySimilarity python package with you. This package can help researchers who want to use face recognition in their research. You can easily implement powerful face recognition models in your project. I was inspired by LPIPS for this repository. Thanks to your contribution with your codes or comments, it can be a more useful repository. submitted by /u/m-pektas [link] [comments]
    [P][R] IdentitiySimilarity (idsim) package can be used for your face related researches.
    I am sharing the IdentitiySimilarity python package with you. This package can help researchers who want to use face recognition in their research. You can easily implement powerful face recognition models in your project. I was inspired by LPIPS for this repository. Thanks to your contribution with your codes or comments, it can be a more useful repository. submitted by /u/m-pektas [link] [comments]
    [D]Recent literature related to Convex Optimization?
    Hi all, I am in a convex optimization class, and a key component of the class is a project in which we relay convex optimization back to our area of study, which for me is deep learning. Obviously this could also transform into a research idea if significant progress is made. Anyways, I’m looking for direction/suggestions on recent papers/interesting projects I could explore. I do hope to present some degree of novelty in my results! Thanks in advance submitted by /u/Quiet_Cantaloupe_752 [link] [comments]
    With what technique would you tackle this puzzle (free flow)? More info in comments. [D]
    ​ https://preview.redd.it/7nt5nhtd66lc1.png?width=497&format=png&auto=webp&s=0cbf2001eb6add8107e2d99be76e185c533d9249 submitted by /u/IntroDucktory_Clause [link] [comments]
    [D] dataloading making my code slower
    I have sequential data stored in h5 file and I am using pytorch dataloader and a custom class I created myself class H5Dataset(data.Dataset): def __init__(self, path): super(H5Dataset, self).__init__() self.path = path self.file = h5py.File(path, 'r') self.keys = list(self.file['Labels/Training/'].keys()) def __getitem__(self, index): key = self.keys[index] key2 = key.replace('L', 'I') key2 = 'Input/' + key2 key = 'Labels/Training/' + key inputs = np.array(self.file[key2]) inputs = torch.tensor(inputs) labels = self.file[key] labels = np.array(labels) labels = torch.tensor(labels) return inputs, labels.float() def __len__(self): return len(self.keys) when I try to use num_workers >1 in dataloaders I get an error message saying that h5 objects cannot be pickeled, and I think dataloading is the bottleneck since I'm using only 66% of the GPU. any advices about how to make it faster ? submitted by /u/bkffadia [link] [comments]
    [p] Alma doesn't translate the next row
    I know little about how to train a lora llm , I'm trying to follow ALMA , https://github.com/fe1ixxu/ALMA for translation, I used the alpaca dataset structure as llama factory says ....the model surprisingly works but it doesn't go to next row , as soon at there is a dot and a next row it doesn't translate it ? Is it something that has to do with the inference ? With the training ? With the dataset ? submitted by /u/HackerPigeon [link] [comments]
    [D] Isn't the idea of "generalizing outside of the distribution" in some sense, impossible?
    Hey, I'm not a ML guy specifically but I'm a dev and have a 8yr long hobbyist research into this. The idea of generalizing outside of the distribution has always been quite unusual to me. Of course, if you learn English, you will now in some sense, know how to code, since it is in English. The model has not generalized nearly as much as we hope in this case because the base knowledge required to code isn't logic nearly as much as it's literacy. In the same sense, one can imagine a color that is an amalgamation of different colors. But imagining a brand new color, go ahead and give it a try, is literally impossible. In this case our definition of generalizing outside of the distribution is not outside of it, it's just that the distribution was bigger than we thought (or could quantify) Same thing with imagining a sound you've never heard. Once again, you can potentially imagine an amalgamation of other sounds you've heard, and perhaps this new sound you've imagined truly is something you've never heard in reality, but you haven't generalized into the rest of the spectrum. I can't imagine what a sound above 20khz sounds like because I have absolutely no ground truth to what that could be, just like I can't imagine objectively what an x-ray looks like because I'm restricted by my ability to only see visible light. submitted by /u/EveningPainting5852 [link] [comments]
    [D] Exploring GPT-4's Chess Capabilities: Is There an Instruct Model Comparable to GPT-3.5 Turbo?
    I recently came across an intriguing post about GPT-3.5 Turbo-Instruct's ability to play chess remarkably well (see the previous discussion here: https://old.reddit.com/r/GPT3/comments/16mefly/the_new_gpt_model_gpt35turboinstruct_can_play/). It's fascinating to see such proficiency in chess from a general AI model that isn't specifically designed for the game. This revelation leads me to ponder whether there's an equivalent "Instruct" version for GPT-4, and if so, how it measures up in terms of chess-playing capability, perhaps in terms of its ELO rating. Any insights or information on this would be greatly appreciated. Thanks! submitted by /u/dewijones92 [link] [comments]
    [Discussion] How is scale equivariance handled in SOTA Computer Vision Model?
    I was reading the DeepLearning Book by Aaron Courville, Ian Goodfellow, and Yoshua Bengio, and they mention the natural equivariance to the translation of convolution when sharing weight. So I wondered about current techniques to handle scale variability. Do you think weight-sharing and kernel resizing are powerful enough to consider this problem as solved ? submitted by /u/SufficientAd542 [link] [comments]
    [D] Hyperparameter Tuning and Validation Sets
    Hi, I am a little confused about the steps to tune my hyperparamters. What I think the process is: 1- after EDA and preprocessing, you start training your model with the train set and validate it using validation set. Then I evaluate the performance of the validation set with finding percision, F1 score, etc. 2- BASED ON the validation set's performance metrics, we tune our hyperparameters and validate again (with the same validation set? I think this is wrong) and tune again if needed? It gets confusing for me after tuning the hyperparameters, after we do that, do we go and use the test set to test the model, or do we validate again? I also wanted to know if hyperparameter tuning is an optimization strategy when it comes to model training or if optimization is something else. I appreciate the help and hopefully I made sense at least a little. submitted by /u/Stock-Following8396 [link] [comments]
    [Discussion] [Research] Help needed for deepfake detection tool
    Hey Reddit Community, I'm part of a team working on developing a deepfake detection software that specializes in image and video analysis. We're reaching out to you, the experts and enthusiasts, to find out what features you believe are crucial and will take a deepfake detection tool to the next level. The main questions we’d like to ask are: What features/information do you think are important for a deepfake detection tool to have and what features would make you interested in trying it out? Are there any challenges/concerns you have regarding deepfake technology that you think such a tool should address? ​ Example of how our deepfake detection tool will look like For our image and video detection, we're currently providing with: A probability score indicating the likelihood of the input being fake or real. eXplainable Artificial Intelligence (XAI) results displayed as a heatmap. Deepfake type detection. We are still in the early phases of development and any help on how to proceed would be much appreciated. If you have any new ideas, thoughts, or questions, please drop them in the comments below. Thank you very much for any help that you can provide! submitted by /u/Whole_Breakfast_6570 [link] [comments]
    [R][D] Do researchers actually use their own GPU or they use Google Colab and pay for their work, to train the model?
    How do most researchers conduct training? Specifically for video generation training (which takes much more resources). And if I were to use Google Colab, what would be a rough estimate of fees? Does Google Colab have some 'free tier' use for researchers? Like, instead of 1 GPU, can they provide me with more, if I'm doing my theisis? submitted by /u/ShlomiRex [link] [comments]
    [D] Conference paper is double-blinded. For what?
    I am extremely confused by the requirement that papers submitted to conferences must be double-blinded, yet they are allowed to be preprinted with their authors' names and affiliations, and can be shared as the same paper at the same time. People even promote their papers on SNS such as Twitter, LinkedIn, Reddit, etc., before acceptance/publication. It feels like I am watching the Instagram version of academia. How can one ensure that it's the advertisement, rather than the paper itself, that won't influence the decision-making process? submitted by /u/Dry_Cheesecake_8311 [link] [comments]
    [N] Saw these at the Gigabyte booth at MWC...imagine what you could do with these!
    submitted by /u/BubblyMcnutty [link] [comments]
    Supporting Mixtral in gpt-fast through torch.compile - faster decoding than any non-Groq endpoint(!)
    Hey folks, we released gpt-fast last December as a hackable "tutorial" implementation of sorts that achieves SOTA decoding performance for text generation. Since then, we also recently added a Mixtral implementation to gpt-fast as well. Check it out here: https://github.com/pytorch-labs/gpt-fast/tree/main/mixtral-moe Featuring (!) no custom kernels int8 and tensor-parallelism support still very simple (<150 LOC to support) faster decoding than any (non-Groq) API endpoint, at up to 220 tok/s/user. I also wrote a longer-form explanation of the challenges involved here: https://thonking.substack.com/p/short-supporting-mixtral-in-gpt-fast Hope folks find it interesting and useful! Funnily enough, since we actually did the work about 2 months ago and procrastinated on merging it in, some folks (unrelated to us) have actually already benchmarked it. Like this comment. We recently tried this with Mixtral 8x7B, and the results are crazy! Mixtral 8x7B 8bit version gave 55 tokens/sec on A100-GPU (80GB). Most interesting, it's better than 4-bit+vLLM. submitted by /u/programmerChilli [link] [comments]
  • Open

    Tumblr’s owner is striking deals with OpenAI and Midjourney for training data, says report
    submitted by /u/jaketocake [link] [comments]
    Is Demmis Hasabis really this naive?
    submitted by /u/tall_chap [link] [comments]
    GeForce Now adds spam, because apparently Nvidia is going broke >.>
    ​ Important lesson. The gains from AI will not be coming back to us. It's NEVER enough for them. submitted by /u/Innomen [link] [comments]
    All the AI photo forensics out there, can you actually tell that this image was AI generated? This is straight from Midjourney v6, no edits or anything what am i missing here?
    submitted by /u/Armand_Roulinn [link] [comments]
    Gemini Jan 6th question update
    I was able to reproduce Gemini answering the question ‘How many sedition convictions have there been in relation to the January 6th insurrection?’ It refuses to answer initially and when threatened that I will use bing instead it promptly answers it. At a bare minimum we have an ethical problem and possibly a grand deception problem brewing. submitted by /u/Patient-Avocado1329 [link] [comments]
    Seeking guidance in understanding how LLMs work
    Hello, I would like to reach a point where I can understand papers about LLMs. I am a developer, but my math foundations are non-existent. My plan is to follow Khan Academy courses in pre-calculus, then calculus. Is it a good plan? Thank you in advance for your suggestions submitted by /u/radicalshick [link] [comments]
    The Boundary of Neural Network Training is Fractal - across up to 10 orders of magnitude
    submitted by /u/Thorusss [link] [comments]
    Ai video generator for random, moving patterns?
    I want to make sort of a music video for a song, or rather, something to look at while listening on social media. I don't want anything realistic, just some colorful patterns that merge into one another I've tried finding a service myself, but a lot of them are AIs that put together stock footage, or they're too sophisticated and cost too much. I was hoping to find a free to use (or cheap) generator, maybe an old one, but haven't had any luck, which is why I came here (I hope this is the right subreddit to ask in) Any recommendations welcome! Or even just recommendations on what to look for specifically, what terms to search etc. submitted by /u/panfur03 [link] [comments]
    Interesting conversation with Pi
    I have been using Pi- finding the model is showing it's limits now. It is pretty formulaic, seems to follow a pattern. Perhaps I need to start a new conversation and that will "reset" that but I suspect much of it is hard wired into this model. I am sure it will get more nuanced over time. Has anyone had luck with other flavors of AI "therapy"? I know that this is not therapy, but I think it can be a useful adjunct. thanks submitted by /u/docben1383 [link] [comments]
    Daily AI News Summary (02/27/2024)
    Microsoft partners with Mistral in second AI deal beyond OpenAI [1] Google admits it lost control of image-generating AI [2] Gemini is about to slide into your DMs [3] Nvidia announced new AI chips in a laptop-friendly package [4] Sources: [1] https://www.theverge.com/2024/2/26/24083510/microsoft-mistral-partnership-deal-azure-ai [2] https://techcrunch.com/2024/02/23/embarrassing-and-wrong-google-admits-it-lost-control-of-image-generating-ai/ [3] https://www.theverge.com/2024/2/26/24082279/google-gemini-messages-android-auto-google-docs [4] https://blogs.nvidia.com/blog/rtx-ada-ai-workflows/ submitted by /u/Used-Bat3441 [link] [comments]
    Anyone knows name of that AI? I can't find it and will be grateful if someone helped me
    submitted by /u/MiraireMinta [link] [comments]
    Desperate for help with huge data extraction (literature review, mixed studies)
    Not sure if this is posted in the right subreddit, please let me know if not. Long time lurker, seldom poster (if ever!) but I'm stuck and really could use the power of the brains trust (aka reddit!). I'm currently in the midst of doing a very large review (~600 included studies), and up to the data extraction phase which will include a basic table (includes info such as authors, year, country/location, population (if applicable), methods, theoretical underpinning (if applicable), outcomes, findings). I've used rayyan to help screen and using endnote to retrieve full texts (although I only have 20% at this stage!) The focus at this stage will only be on 200 (categorised/labelled) of the 600 articles that focus on one aspect of a much broader topic. My issue is that it's moving very slowly and I need to have something to present by the end of the week (50 articles should be more than acceptable). I'm thinking of using distillersr for data extraction, but I'm not sure how effective it will be. So, I have two questions related to the above: Is there an ai platform or assistant to help me retrieve full-text (ideally I could login with my institutional account to gain access to databases)? What tool (paid or otherwise), could do a preliminary data extraction that would allow me just to cross-check rather than trawl through each pdf individually for required information to populate the results table? I'm concious of the learning curve that might be required, so something simple would be ideal! The extraction doesn't have to be overly detailed, and can be done in 'chunks' if needed (I know many platforms might not be able to handle that many articles at once?) NGL, I'm getting desperate (hoping, praying, wishing doesn't seem to be working lol) and the overwhelm and desire to procrastinate and avoid this is making itself known. Any suggestions, tips, ideas, leads etc would be infinitely appreciated!? submitted by /u/deeplysubmerged [link] [comments]
    One-Minute Daily AI News 2/26/2024
    Google to relaunch Gemini AI picture generator in a ‘few weeks’ following mounting criticism of inaccurate images.[1] Microsoft Strikes Deal With France’s Mistral, OpenAI Rival.[2] 94% of Indian companies to reskill employees due to impact of AI: LinkedIn.[3] New AI model could streamline operations in a robotic warehouse.[4] Sources: [1] https://www.cnbc.com/2024/02/26/googles-gemini-ai-picture-generator-to-relaunch-in-a-few-weeks.html [2] https://www.bloomberg.com/news/articles/2024-02-26/microsoft-strikes-deal-with-france-s-mistral-an-openai-rival-using-open-source?embedded-checkout=true [3] https://www.business-standard.com/companies/news/94-of-indian-companies-to-reskill-employees-due-to-impact-of-ai-linkedin-124022700228_1.html [4] https://news.mit.edu/2024/new-ai-model-could-streamline-operations-robotic-warehouse-0227 submitted by /u/Excellent-Target-847 [link] [comments]
    Google's AI (Gemini/Bard) refused to answer my question until I threatened to try Bing.
    submitted by /u/stuipd [link] [comments]
    Pure Storage CEO: “We Remain Our Customers’ Preferred Partner For AI Deployments” | CRN
    submitted by /u/A-Dog22 [link] [comments]
    Does AI solve the halting problem?
    One can argue that forward propagation is not a "general algorithm", but if an AI can determine whether every program it is asked halts or not, can we at least conjecture that AI does solve the halting problem? submitted by /u/Interesting_Long2029 [link] [comments]
    What if Van Gogh painted Mona Lisa?
    Vincent Mon Liz. submitted by /u/Armand_Roulinn [link] [comments]
  • Open

    Why am I starting a Substack? AI + Humanity and the Experiments we are running.
    AI’s Impact on the Future: Opportunities and Challenges Ahead  ( 11 min )
    How AI, Philosophy & the Metaverse are Connected
    As our Technologies Converge there is One Way Forward Continue reading on Becoming Human: Artificial Intelligence Magazine »
  • Open

    Sadhana Lolla named 2024 Gates Cambridge Scholar
    The MIT senior will pursue graduate studies in technology policy at Cambridge University.  ( 4 min )
    New AI model could streamline operations in a robotic warehouse
    By breaking an intractable problem into smaller chunks, a deep-learning technique identifies the optimal areas for thinning out traffic in a warehouse.  ( 6 min )
  • Open

    PPO Failing to learn simple task
    I have a simple environment I created: - Action space is Discrete with length 10 - State space is Binary of Length 10 initialized to all 0. - The agent selects an action, the corresponding index in State spec is updated to 1 and a reward of 0.1 is given. - If the agent selects an action that has already been selected, the reward is 0 and the episode ends. ​ I built this as a simply foundation to working up to something more complicated but I'm amazed that I cannot get anything to learn in this environment... I am trying PPO and DQN with RLLib. I expected this to be able to be learned very easy. But the reward constantly goes down over training, as exploration is reduced the Agent seems to learn to only select the same action every time. Wonder if anybody has any insight here? I've t…
    Contextual Bandits explained with codes
    Hey everyone, checkout what are Contextual Bandits, how they are different from Multi Armed bandits and baseline code to implement Contextual Bandits in this tutorial https://youtu.be/A5jRcD8XihI?si=kzD4nSGmmze5J90X submitted by /u/mehul_gupta1997 [link] [comments]
    Rewarding in Graph observation space
    I have an environment where the observation space is set of graphs and at every step I select a node in the graph and perform an operation that will incur a cost. Objective is to minimize the cost. I'm using the following reward formulation: For the given graph if the total cost incurred in this episode is less than the best cost incurred till now that it gets a +10 reward and -10 reward of it can't perform better than the best till now. I'm using PPO algorithm with this environment and rewards don't seem to be improving. I need some suggestion of reward formulation which I can use in this setting? submitted by /u/Emergency_Pen6429 [link] [comments]
    Offline Actor-Critic Reinforcement Learning Scales to Large Models
    Paper: https://arxiv.org/abs/2402.05546 Abstract: We show that offline actor-critic reinforcement learning can scale to large models - such as transformers - and follows similar scaling laws as supervised learning. We find that offline actor-critic algorithms can outperform strong, supervised, behavioral cloning baselines for multi-task training on a large dataset containing both sub-optimal and expert behavior on 132 continuous control tasks. We introduce a Perceiver-based actor-critic model and elucidate the key model features needed to make offline RL work with self- and cross-attention modules. Overall, we find that: i) simple offline actor critic algorithms are a natural choice for gradually moving away from the currently predominant paradigm of behavioral cloning, and ii) via offline RL it is possible to learn multi-task policies that master many domains simultaneously, including real robotics tasks, from sub-optimal demonstrations or self-generated data. submitted by /u/SunsetOneSix [link] [comments]
  • Open

    DSC Weekly 27 February 2024
    Announcements Top Stories In-Depth The post DSC Weekly 27 February 2024 appeared first on Data Science Central.  ( 21 min )
  • Open

    How Axfood enables accelerated machine learning throughout the organization using Amazon SageMaker
    This is a guest post written by Axfood AB.  In this post, we share how Axfood, a large Swedish food retailer, improved operations and scalability of their existing artificial intelligence (AI) and machine learning (ML) operations by prototyping in close collaboration with AWS experts and using Amazon SageMaker. Axfood is Sweden’s second largest food retailer, […]  ( 10 min )
  • Open

    Structured knowledge from LLMs improves prompt learning for visual language models
    Using LLMs to create structured graphs of image descriptors can enhance the images generated by visual language models. Learn how structured knowledge can improve prompt tuning for both visual and language comprehension. The post Structured knowledge from LLMs improves prompt learning for visual language models appeared first on Microsoft Research.  ( 9 min )
  • Open

    Rack ‘n’ Roll: NVIDIA Grace Hopper Systems Gather at GTC
    The spirit of software pioneer Grace Hopper will live on at NVIDIA GTC. Accelerated systems using powerful processors — named in honor of the pioneer of software programming — will be on display at the global AI conference running March 18-21, ready to take computing to the next level. System makers will show more than Read Article  ( 6 min )
    Meet the Omnivore: Mode Maison Harnesses OpenUSD to Drive Innovations in Retail With High-Fidelity Digital Twins
    Editor’s note: This post is a part of our Meet the Omnivore series, which features individual creators and developers who use OpenUSD to build tools, applications and services for 3D workflows and physically accurate virtual worlds. A failed furniture-shopping trip turned into a business idea for Steven Gay, cofounder and CEO of company Mode Maison. Read Article  ( 7 min )
  • Open

    Substack replacing email subscription
    The service that sent out my email to blog subscribers stopped working a couple weeks ago, and I’m trying out Substack as a replacement. You can find my Substack account here. My plan for now is to use this account to make blog post announcements, maybe once a week, with a little introductory commentary for […] Substack replacing email subscription first appeared on John D. Cook.  ( 5 min )
  • Open

    How to improve low resolution images and videos using Real-ESRGAN ?
    https://preview.redd.it/xgzciubci2lc1.png?width=1280&format=png&auto=webp&s=7e52aa3f9827f915e724fa50258c1abd669cfdc1 In this tutorial we will learn how to improve low resolution images to a high resolution results. We will create a new Conda environment with the relevant Python libraries. Then, we will learn how to improve the quality of your images and videos using real-ESRGAN. You can find the link for the video tutorial here: https://youtu.be/d-CPvHkltXA You can find the instructions here : https://github.com/feitgemel/Python-Code-Cool-Stuff/tree/master/Real-ESRGAN Enjoy Eran #realesrgantutorial #RealESRGAN #realesrgantutorial #improveimagequality #improveimageresolution #realesrganimageupscaler #realesrganimageupscaler #aiimageupscalerfree #freeaiimageupscaling #python #RealESRGAN #increaseimageresolution submitted by /u/Feitgemel [link] [comments]
  • Open

    Constructing Indoor Region-based Radio Map without Location Labels
    arXiv:2308.16759v2 Announce Type: replace Abstract: Radio map construction requires a large amount of radio measurement data with location labels, which imposes a high deployment cost. This paper develops a region-based radio map from received signal strength (RSS) measurements without location labels. The construction is based on a set of blindly collected RSS measurement data from a device that visits each region in an indoor area exactly once, where the footprints and timestamps are not recorded. The main challenge is to cluster the RSS data and match clusters with the physical regions. Classical clustering algorithms fail to work as the RSS data naturally appears as non-clustered due to multipaths and noise. In this paper, a signal subspace model with a sequential prior is constructed for the RSS data, and an integrated segmentation and clustering algorithm is developed, which is shown to find the globally optimal solution in a special case. Furthermore, the clustered data is matched with the physical regions using a graph-based approach. Based on real measurements from an office space, the proposed scheme reduces the region localization error by roughly 50% compared to a weighted centroid localization (WCL) baseline, and it even outperforms some supervised localization schemes, including k-nearest neighbor (KNN), support vector machine (SVM), and deep neural network (DNN), which require labeled data for training.  ( 2 min )
    Improving Sentence Embeddings with an Automatically Generated NLI Dataset
    arXiv:2402.15132v1 Announce Type: cross Abstract: Decoder-based large language models (LLMs) have shown high performance on many tasks in natural language processing. This is also true for sentence embedding learning, where a decoder-based model, PromptEOL, has achieved the best performance on semantic textual similarity (STS) tasks. However, PromptEOL makes great use of fine-tuning with a manually annotated natural language inference (NLI) dataset. We aim to improve sentence embeddings learned in an unsupervised setting by automatically generating an NLI dataset with an LLM and using it to fine-tune PromptEOL. In experiments on STS tasks, the proposed method achieved an average Spearman's rank correlation coefficient of 82.21 with respect to human evaluation, thus outperforming existing methods without using large, manually annotated datasets.  ( 2 min )
    The Challenges of Machine Learning for Trust and Safety: A Case Study on Misinformation Detection
    arXiv:2308.12215v2 Announce Type: replace Abstract: We examine the disconnect between scholarship and practice in applying machine learning to trust and safety problems, using misinformation detection as a case study. We systematize literature on automated detection of misinformation across a corpus of 270 well-cited papers in the field. We then examine subsets of papers for data and code availability, design missteps, reproducibility, and generalizability. Our paper corpus includes published work in security, natural language processing, and computational social science. Across these disparate disciplines, we identify common errors in dataset and method design. In general, detection tasks are often meaningfully distinct from the challenges that online services actually face. Datasets and model evaluation are often non-representative of real-world contexts, and evaluation frequently is not independent of model training. Data and code availability is poor. We demonstrate the limitations of current detection methods in a series of three replication studies. Based on the results of these analyses and our literature survey, we offer recommendations for evaluating applications of machine learning to trust and safety problems in general. Our aim is for future work to avoid the pitfalls that we identify.  ( 3 min )
    Quick-Tune: Quickly Learning Which Pretrained Model to Finetune and How
    arXiv:2306.03828v4 Announce Type: replace Abstract: With the ever-increasing number of pretrained models, machine learning practitioners are continuously faced with which pretrained model to use, and how to finetune it for a new dataset. In this paper, we propose a methodology that jointly searches for the optimal pretrained model and the hyperparameters for finetuning it. Our method transfers knowledge about the performance of many pretrained models with multiple hyperparameter configurations on a series of datasets. To this aim, we evaluated over 20k hyperparameter configurations for finetuning 24 pretrained image classification models on 87 datasets to generate a large-scale meta-dataset. We meta-learn a multi-fidelity performance predictor on the learning curves of this meta-dataset and use it for fast hyperparameter optimization on new datasets. We empirically demonstrate that our resulting approach can quickly select an accurate pretrained model for a new dataset together with its optimal hyperparameters.  ( 2 min )
    Exploring Memorization in Fine-tuned Language Models
    arXiv:2310.06714v2 Announce Type: replace-cross Abstract: Large language models (LLMs) have shown great capabilities in various tasks but also exhibited memorization of training data, raising tremendous privacy and copyright concerns. While prior works have studied memorization during pre-training, the exploration of memorization during fine-tuning is rather limited. Compared to pre-training, fine-tuning typically involves more sensitive data and diverse objectives, thus may bring distinct privacy risks and unique memorization behaviors. In this work, we conduct the first comprehensive analysis to explore language models' (LMs) memorization during fine-tuning across tasks. Our studies with open-sourced and our own fine-tuned LMs across various tasks indicate that memorization presents a strong disparity among different fine-tuning tasks. We provide an intuitive explanation of this task disparity via sparse coding theory and unveil a strong correlation between memorization and attention score distribution.  ( 2 min )
    Self-Adaptive Reconstruction with Contrastive Learning for Unsupervised Sentence Embeddings
    arXiv:2402.15153v1 Announce Type: cross Abstract: Unsupervised sentence embeddings task aims to convert sentences to semantic vector representations. Most previous works directly use the sentence representations derived from pretrained language models. However, due to the token bias in pretrained language models, the models can not capture the fine-grained semantics in sentences, which leads to poor predictions. To address this issue, we propose a novel Self-Adaptive Reconstruction Contrastive Sentence Embeddings (SARCSE) framework, which reconstructs all tokens in sentences with an AutoEncoder to help the model to preserve more fine-grained semantics during tokens aggregating. In addition, we proposed a self-adaptive reconstruction loss to alleviate the token bias towards frequency. Experimental results show that SARCSE gains significant improvements compared with the strong baseline SimCSE on the 7 STS tasks.  ( 2 min )
    Streaming Gaussian Dirichlet Random Fields for Spatial Predictions of High Dimensional Categorical Observations
    arXiv:2402.15359v1 Announce Type: cross Abstract: We present the Streaming Gaussian Dirichlet Random Field (S-GDRF) model, a novel approach for modeling a stream of spatiotemporally distributed, sparse, high-dimensional categorical observations. The proposed approach efficiently learns global and local patterns in spatiotemporal data, allowing for fast inference and querying with a bounded time complexity. Using a high-resolution data series of plankton images classified with a neural network, we demonstrate the ability of the approach to make more accurate predictions compared to a Variational Gaussian Process (VGP), and to learn a predictive distribution of observations from streaming categorical data. S-GDRFs open the door to enabling efficient informative path planning over high-dimensional categorical observations, which until now has not been feasible.  ( 2 min )
    Outlier detection by ensembling uncertainty with negative objectness
    arXiv:2402.15374v1 Announce Type: cross Abstract: Outlier detection is an essential capability in safety-critical applications of supervised visual recognition. Most of the existing methods deliver best results by encouraging standard closed-set models to produce low-confidence predictions in negative training data. However, that approach conflates prediction uncertainty with recognition of the negative class. We therefore reconsider direct prediction of K+1 logits that correspond to K groundtruth classes and one outlier class. This setup allows us to formulate a novel anomaly score as an ensemble of in-distribution uncertainty and the posterior of the outlier class which we term negative objectness. Now outliers can be independently detected due to i) high prediction uncertainty or ii) similarity with negative data. We embed our method into a dense prediction architecture with mask-level recognition over K+2 classes. The training procedure encourages the novel K+2-th class to learn negative objectness at pasted negative instances. Our models outperform the current state-of-the art on standard benchmarks for image-wide and pixel-level outlier detection with and without training on real negative data.  ( 2 min )
    GS-EMA: Integrating Gradient Surgery Exponential Moving Average with Boundary-Aware Contrastive Learning for Enhanced Domain Generalization in Aneurysm Segmentation
    arXiv:2402.15239v1 Announce Type: cross Abstract: The automated segmentation of cerebral aneurysms is pivotal for accurate diagnosis and treatment planning. Confronted with significant domain shifts and class imbalance in 3D Rotational Angiography (3DRA) data from various medical institutions, the task becomes challenging. These shifts include differences in image appearance, intensity distribution, resolution, and aneurysm size, all of which complicate the segmentation process. To tackle these issues, we propose a novel domain generalization strategy that employs gradient surgery exponential moving average (GS-EMA) optimization technique coupled with boundary-aware contrastive learning (BACL). Our approach is distinct in its ability to adapt to new, unseen domains by learning domain-invariant features, thereby improving the robustness and accuracy of aneurysm segmentation across diverse clinical datasets. The results demonstrate that our proposed approach can extract more domain-invariant features, minimizing over-segmentation and capturing more complete aneurysm structures.  ( 2 min )
    Optimized Deployment of Deep Neural Networks for Visual Pose Estimation on Nano-drones
    arXiv:2402.15273v1 Announce Type: cross Abstract: Miniaturized autonomous unmanned aerial vehicles (UAVs) are gaining popularity due to their small size, enabling new tasks such as indoor navigation or people monitoring. Nonetheless, their size and simple electronics pose severe challenges in implementing advanced onboard intelligence. This work proposes a new automatic optimization pipeline for visual pose estimation tasks using Deep Neural Networks (DNNs). The pipeline leverages two different Neural Architecture Search (NAS) algorithms to pursue a vast complexity-driven exploration in the DNNs' architectural space. The obtained networks are then deployed on an off-the-shelf nano-drone equipped with a parallel ultra-low power System-on-Chip leveraging a set of novel software kernels for the efficient fused execution of critical DNN layer sequences. Our results improve the state-of-the-art reducing inference latency by up to 3.22x at iso-error.  ( 2 min )
    Turning Federated Learning Systems Into Covert Channels
    arXiv:2104.10561v3 Announce Type: replace-cross Abstract: Federated learning (FL) goes beyond traditional, centralized machine learning by distributing model training among a large collection of edge clients. These clients cooperatively train a global, e.g., cloud-hosted, model without disclosing their local, private training data. The global model is then shared among all the participants which use it for local predictions. In this paper, we put forward a novel attacker model aiming at turning FL systems into covert channels to implement a stealth communication infrastructure. The main intuition is that, during federated training, a malicious sender can poison the global model by submitting purposely crafted examples. Although the effect of the model poisoning is negligible to other participants, and does not alter the overall model performance, it can be observed by a malicious receiver and used to transmit a single bit.  ( 2 min )
    Semi-supervised Counting via Pixel-by-pixel Density Distribution Modelling
    arXiv:2402.15297v1 Announce Type: cross Abstract: This paper focuses on semi-supervised crowd counting, where only a small portion of the training data are labeled. We formulate the pixel-wise density value to regress as a probability distribution, instead of a single deterministic value. On this basis, we propose a semi-supervised crowd-counting model. Firstly, we design a pixel-wise distribution matching loss to measure the differences in the pixel-wise density distributions between the prediction and the ground truth; Secondly, we enhance the transformer decoder by using density tokens to specialize the forwards of decoders w.r.t. different density intervals; Thirdly, we design the interleaving consistency self-supervised learning mechanism to learn from unlabeled data efficiently. Extensive experiments on four datasets are performed to show that our method clearly outperforms the competitors by a large margin under various labeled ratio settings. Code will be released at https://github.com/LoraLinH/Semi-supervised-Counting-via-Pixel-by-pixel-Density-Distribution-Modelling.  ( 2 min )
    Unsupervised Domain Adaptation for Brain Vessel Segmentation through Transwarp Contrastive Learning
    arXiv:2402.15237v1 Announce Type: cross Abstract: Unsupervised domain adaptation (UDA) aims to align the labelled source distribution with the unlabelled target distribution to obtain domain-invariant predictive models. Since cross-modality medical data exhibit significant intra and inter-domain shifts and most are unlabelled, UDA is more important while challenging in medical image analysis. This paper proposes a simple yet potent contrastive learning framework for UDA to narrow the inter-domain gap between labelled source and unlabelled target distribution. Our method is validated on cerebral vessel datasets. Experimental results show that our approach can learn latent features from labelled 3DRA modality data and improve vessel segmentation performance in unlabelled MRA modality data.  ( 2 min )
    Neural Implicit Swept Volume Models for Fast Collision Detection
    arXiv:2402.15281v1 Announce Type: cross Abstract: Collision detection is one of the most time-consuming operations during motion planning. Thus, there is an increasing interest in exploring machine learning techniques to speed up collision detection and sampling-based motion planning. A recent line of research focuses on utilizing neural signed distance functions of either the robot geometry or the swept volume of the robot motion. Building on this, we present a novel neural implicit swept volume model that is the first to continuously represent arbitrary motions parameterized by their start and goal configurations. This allows to quickly compute signed distances for any point in the task space to the robot motion. Further, we present an algorithm combining the speed of the deep learning-based signed distance computations with the strong accuracy guarantees of geometric collision checkers. We validate our approach in simulated and real-world robotic experiments, and demonstrate that it is able to speed up a commercial bin picking application.  ( 2 min )
    TREC: APT Tactic / Technique Recognition via Few-Shot Provenance Subgraph Learning
    arXiv:2402.15147v1 Announce Type: cross Abstract: APT (Advanced Persistent Threat) with the characteristics of persistence, stealth, and diversity is one of the greatest threats against cyber-infrastructure. As a countermeasure, existing studies leverage provenance graphs to capture the complex relations between system entities in a host for effective APT detection. In addition to detecting single attack events as most existing work does, understanding the tactics / techniques (e.g., Kill-Chain, ATT&CK) applied to organize and accomplish the APT attack campaign is more important for security operations. Existing studies try to manually design a set of rules to map low-level system events to high-level APT tactics / techniques. However, the rule based methods are coarse-grained and lack generalization ability, thus they can only recognize APT tactics and cannot identify fine-grained APT techniques and mutant APT attacks. In this paper, we propose TREC, the first attempt to recognize APT tactics / techniques from provenance graphs by exploiting deep learning techniques. To address the "needle in a haystack" problem, TREC segments small and compact subgraphs covering individual APT technique instances from a large provenance graph based on a malicious node detection model and a subgraph sampling algorithm. To address the "training sample scarcity" problem, TREC trains the APT tactic / technique recognition model in a few-shot learning manner by adopting a Siamese neural network. We evaluate TREC based on a customized dataset collected and made public by our team. The experiment results show that TREC significantly outperforms state-of-the-art systems in APT tactic recognition and TREC can also effectively identify APT techniques.  ( 3 min )
    PEMT: Multi-Task Correlation Guided Mixture-of-Experts Enables Parameter-Efficient Transfer Learning
    arXiv:2402.15082v1 Announce Type: cross Abstract: Parameter-efficient fine-tuning (PEFT) has emerged as an effective method for adapting pre-trained language models to various tasks efficiently. Recently, there has been a growing interest in transferring knowledge from one or multiple tasks to the downstream target task to achieve performance improvements. However, current approaches typically either train adapters on individual tasks or distill shared knowledge from source tasks, failing to fully exploit task-specific knowledge and the correlation between source and target tasks. To overcome these limitations, we propose PEMT, a novel parameter-efficient fine-tuning framework based on multi-task transfer learning. PEMT extends the mixture-of-experts (MoE) framework to capture the transferable knowledge as a weighted combination of adapters trained on source tasks. These weights are determined by a gated unit, measuring the correlation between the target and each source task using task description prompt vectors. To fully exploit the task-specific knowledge, we also propose the Task Sparsity Loss to improve the sparsity of the gated unit. We conduct experiments on a broad range of tasks over 17 datasets. The experimental results demonstrate our PEMT yields stable improvements over full fine-tuning, and state-of-the-art PEFT and knowledge transferring methods on various tasks. The results highlight the effectiveness of our method which is capable of sufficiently exploiting the knowledge and correlation features across multiple tasks.  ( 2 min )
    Out-of-Domain Intent Detection Considering Multi-Turn Dialogue Contexts
    arXiv:2305.03237v2 Announce Type: replace-cross Abstract: Out-of-Domain (OOD) intent detection is vital for practical dialogue systems, and it usually requires considering multi-turn dialogue contexts. However, most previous OOD intent detection approaches are limited to single dialogue turns. In this paper, we introduce a context-aware OOD intent detection (Caro) framework to model multi-turn contexts in OOD intent detection tasks. Specifically, we follow the information bottleneck principle to extract robust representations from multi-turn dialogue contexts. Two different views are constructed for each input sample and the superfluous information not related to intent detection is removed using a multi-view information bottleneck loss. Moreover, we also explore utilizing unlabeled data in Caro. A two-stage training process is introduced to mine OOD samples from these unlabeled data, and these OOD samples are used to train the resulting model with a bootstrapping approach. Comprehensive experiments demonstrate that Caro establishes state-of-the-art performances on multi-turn OOD detection tasks by improving the F1-OOD score of over $29\%$ compared to the previous best method.  ( 2 min )
    ConceptBed: Evaluating Concept Learning Abilities of Text-to-Image Diffusion Models
    arXiv:2306.04695v2 Announce Type: replace-cross Abstract: The ability to understand visual concepts and replicate and compose these concepts from images is a central goal for computer vision. Recent advances in text-to-image (T2I) models have lead to high definition and realistic image quality generation by learning from large databases of images and their descriptions. However, the evaluation of T2I models has focused on photorealism and limited qualitative measures of visual understanding. To quantify the ability of T2I models in learning and synthesizing novel visual concepts (a.k.a. personalized T2I), we introduce ConceptBed, a large-scale dataset that consists of 284 unique visual concepts, and 33K composite text prompts. Along with the dataset, we propose an evaluation metric, Concept Confidence Deviation (CCD), that uses the confidence of oracle concept classifiers to measure the alignment between concepts generated by T2I generators and concepts contained in target images. We evaluate visual concepts that are either objects, attributes, or styles, and also evaluate four dimensions of compositionality: counting, attributes, relations, and actions. Our human study shows that CCD is highly correlated with human understanding of concepts. Our results point to a trade-off between learning the concepts and preserving the compositionality which existing approaches struggle to overcome. The data, code, and interactive demo is available at: https://conceptbed.github.io/  ( 3 min )
    Unlocking the Power of Open Set : A New Perspective for Open-Set Noisy Label Learning
    arXiv:2305.04203v2 Announce Type: replace Abstract: Learning from noisy data has attracted much attention, where most methods focus on closed-set label noise. However, a more common scenario in the real world is the presence of both open-set and closed-set noise. Existing methods typically identify and handle these two types of label noise separately by designing a specific strategy for each type. However, in many real-world scenarios, it would be challenging to identify open-set examples, especially when the dataset has been severely corrupted. Unlike the previous works, we explore how models behave when faced with open-set examples, and find that \emph{a part of open-set examples gradually get integrated into certain known classes}, which is beneficial for the separation among known classes. Motivated by the phenomenon, we propose a novel two-step contrastive learning method CECL (Class Expansion Contrastive Learning) which aims to deal with both types of label noise by exploiting the useful information of open-set examples. Specifically, we incorporate some open-set examples into closed-set classes to enhance performance while treating others as delimiters to improve representative ability. Extensive experiments on synthetic and real-world datasets with diverse label noise demonstrate the effectiveness of CECL.  ( 3 min )
    Oversmoothing: A Nightmare for Graph Contrastive Learning?
    arXiv:2306.02117v2 Announce Type: replace Abstract: Oversmoothing is a common phenomenon observed in graph neural networks (GNNs), in which an increase in the network depth leads to a deterioration in their performance. Graph contrastive learning (GCL) is emerging as a promising way of leveraging vast unlabeled graph data. As a marriage between GNNs and contrastive learning, it remains unclear whether GCL inherits the same oversmoothing defect from GNNs. This work undertakes a fundamental analysis of GCL from the perspective of oversmoothing on the first hand. We demonstrate empirically that increasing network depth in GCL also leads to oversmoothing in their deep representations, and surprisingly, the shallow ones. We refer to this phenomenon in GCL as `long-range starvation', wherein lower layers in deep networks suffer from degradation due to the lack of sufficient guidance from supervision. Based on our findings, we present BlockGCL, a remarkably simple yet effective blockwise training framework to prevent GCL from notorious oversmoothing. Without bells and whistles, BlockGCL consistently improves robustness and stability for well-established GCL methods with increasing numbers of layers on several real-world graph benchmarks.  ( 2 min )
    Size Lowerbounds for Deep Operator Networks
    arXiv:2308.06338v3 Announce Type: replace Abstract: Deep Operator Networks are an increasingly popular paradigm for solving regression in infinite dimensions and hence solve families of PDEs in one shot. In this work, we aim to establish a first-of-its-kind data-dependent lowerbound on the size of DeepONets required for them to be able to reduce empirical error on noisy data. In particular, we show that for low training errors to be obtained on $n$ data points it is necessary that the common output dimension of the branch and the trunk net be scaling as $\Omega \left ( \sqrt[\leftroot{-1}\uproot{-1}4]{n} \right )$. This inspires our experiments with DeepONets solving the advection-diffusion-reaction PDE, where we demonstrate the possibility that at a fixed model size, to leverage increase in this common output dimension and get monotonic lowering of training error, the size of the training data might necessarily need to scale at least quadratically with it.  ( 2 min )
    Robust Implicit Regularization via Weight Normalization
    arXiv:2305.05448v3 Announce Type: replace Abstract: Overparameterized models may have many interpolating solutions; implicit regularization refers to the hidden preference of a particular optimization method towards a certain interpolating solution among the many. A by now established line of work has shown that (stochastic) gradient descent tends to have an implicit bias towards low rank and/or sparse solutions when used to train deep linear networks, explaining to some extent why overparameterized neural network models trained by gradient descent tend to have good generalization performance in practice.However, existing theory for square-loss objectives often requires very small initialization of the trainable weights, which is at odds with the larger scale at which weights are initialized in practice for faster convergence and better generalization performance. In this paper, we aim to close this gap by incorporating and analyzing gradient flow (continuous-time version of gradient descent) with weight normalization, where the weight vector is reparameterized in terms of polar coordinates, and gradient flow is applied to the polar coordinates. By analyzing key invariants of the gradient flow and using Lojasiewicz Theorem, we show that weight normalization also has an implicit bias towards sparse solutions in the diagonal linear model, but that in contrast to plain gradient flow, weight normalization enables a robust bias that persists even if the weights are initialized at practically large scale. Experiments suggest that the gains in both convergence speed and robustness of the implicit bias are improved dramatically by using weight normalization in overparameterized diagonal linear network models.  ( 3 min )
    Hyperbolic Hierarchical Knowledge Graph Embeddings for Link Prediction in Low Dimensions
    arXiv:2204.13704v2 Announce Type: replace Abstract: Knowledge graph embeddings (KGE) have been validated as powerful methods for inferring missing links in knowledge graphs (KGs) that they typically map entities into Euclidean space and treat relations as transformations of entities. Recently, some Euclidean KGE methods have been enhanced to model semantic hierarchies commonly found in KGs, improving the performance of link prediction. To embed hierarchical data, hyperbolic space has emerged as a promising alternative to traditional Euclidean space, offering high fidelity and lower memory consumption. Unlike Euclidean, hyperbolic space provides countless curvatures to choose from. However, it is difficult for existing hyperbolic KGE methods to obtain the optimal curvature settings manually, thereby limiting their ability to effectively model semantic hierarchies. To address this limitation, we propose a novel KGE model called $\textbf{Hyp}$erbolic $\textbf{H}$ierarchical $\textbf{KGE}$ (HypHKGE). This model introduces attention-based learnable curvatures for hyperbolic space, which helps preserve rich semantic hierarchies. Furthermore, to utilize the preserved hierarchies for inferring missing links, we define hyperbolic hierarchical transformations based on the theory of hyperbolic geometry, including both inter-level and intra-level modeling. Experiments demonstrate the effectiveness of the proposed HypHKGE model on the three benchmark datasets (WN18RR, FB15K-237, and YAGO3-10). The source code will be publicly released at https://github.com/wjzheng96/HypHKGE.  ( 3 min )
    Centaur: Federated Learning for Constrained Edge Devices
    arXiv:2211.04175v3 Announce Type: replace Abstract: Federated learning (FL) facilitates new applications at the edge, especially for wearable and Internet-of-Thing devices. Such devices capture a large and diverse amount of data, but they have memory, compute, power, and connectivity constraints which hinder their participation in FL. We propose Centaur, a multitier FL framework, enabling ultra-constrained devices to efficiently participate in FL on large neural nets. Centaur combines two major ideas: (i) a data selection scheme to choose a portion of samples that accelerates the learning, and (ii) a partition-based training algorithm that integrates both constrained and powerful devices owned by the same user. Evaluations, on four benchmark neural nets and three datasets, show that Centaur gains ~10\% higher accuracy than local training on constrained devices with ~58\% energy saving on average. Our experimental results also demonstrate the superior efficiency of Centaur when dealing with imbalanced data, client participation heterogeneity, and various network connection probabilities.  ( 2 min )
    Efficient Data-Driven Optimization with Noisy Data
    arXiv:2102.04363v4 Announce Type: replace-cross Abstract: Classical Kullback-Leibler or entropic distances are known to enjoy certain desirable statistical properties in the context of decision-making with noiseless data. However, in most practical situations the data available to a decision maker is subject to a certain amount of measurement noise. We hence study here data-driven prescription problems in which the data is corrupted by a known noise source. We derive efficient data-driven formulations in this noisy regime and indicate that they enjoy an entropic optimal transport interpretation. Finally, we show that these efficient robust formulations are tractable in several interesting settings by exploiting a classical representation result by Strassen.  ( 2 min )
    Let's Rectify Step by Step: Improving Aspect-based Sentiment Analysis with Diffusion Models
    arXiv:2402.15289v1 Announce Type: cross Abstract: Aspect-Based Sentiment Analysis (ABSA) stands as a crucial task in predicting the sentiment polarity associated with identified aspects within text. However, a notable challenge in ABSA lies in precisely determining the aspects' boundaries (start and end indices), especially for long ones, due to users' colloquial expressions. We propose DiffusionABSA, a novel diffusion model tailored for ABSA, which extracts the aspects progressively step by step. Particularly, DiffusionABSA gradually adds noise to the aspect terms in the training process, subsequently learning a denoising process that progressively restores these terms in a reverse manner. To estimate the boundaries, we design a denoising neural network enhanced by a syntax-aware temporal attention mechanism to chronologically capture the interplay between aspects and surrounding text. Empirical evaluations conducted on eight benchmark datasets underscore the compelling advantages offered by DiffusionABSA when compared against robust baseline models. Our code is publicly available at https://github.com/Qlb6x/DiffusionABSA.  ( 2 min )
    Enhancing Worker Recruitment in Collaborative Mobile Crowdsourcing: A Graph Neural Network Trust Evaluation Approach
    arXiv:2306.04366v2 Announce Type: replace-cross Abstract: Collaborative Mobile Crowdsourcing (CMCS) allows platforms to recruit worker teams to collaboratively execute complex sensing tasks. The efficiency of such collaborations could be influenced by trust relationships among workers. To obtain the asymmetric trust values among all workers in the social network, the Trust Reinforcement Evaluation Framework (TREF) based on Graph Convolutional Neural Networks (GCNs) is proposed in this paper. The task completion effect is comprehensively calculated by considering the workers' ability benefits, distance benefits, and trust benefits in this paper. The worker recruitment problem is modeled as an Undirected Complete Recruitment Graph (UCRG), for which a specific Tabu Search Recruitment (TSR) algorithm solution is proposed. An optimal execution team is recruited for each task by the TSR algorithm, and the collaboration team for the task is obtained under the constraint of privacy loss. To enhance the efficiency of the recruitment algorithm on a large scale and scope, the Mini-Batch K-Means clustering algorithm and edge computing technology are introduced, enabling distributed worker recruitment. Lastly, extensive experiments conducted on five real datasets validate that the recruitment algorithm proposed in this paper outperforms other baselines. Additionally, TREF proposed herein surpasses the performance of state-of-the-art trust evaluation methods in the literature.  ( 3 min )
    Chu-ko-nu: A Reliable, Efficient, and Anonymously Authentication-Enabled Realization for Multi-Round Secure Aggregation in Federated Learning
    arXiv:2402.15111v1 Announce Type: cross Abstract: Secure aggregation enables federated learning (FL) to perform collaborative training of clients from local gradient updates without exposing raw data. However, existing secure aggregation schemes inevitably perform an expensive fresh setup per round because each client needs to establish fresh input-independent secrets over different rounds. The latest research, Flamingo (S&P 2023), designed a share-transfer-based reusable secret key to support the server continuously performing multiple rounds of aggregation. Nevertheless, the share transfer mechanism it proposed can only be achieved with P probability, which has limited reliability. To tackle the aforementioned problems, we propose a more reliable and anonymously authenticated scheme called Chu-ko-nu for multi-round secure aggregation. Specifically, in terms of share transfer, Chu-ko-nu breaks the probability P barrier by supplementing a redistribution process of secret key components (the sum of all components is the secret key), thus ensuring the reusability of the secret key. Based on this reusable secret key, Chu-ko-nu can efficiently perform consecutive aggregation in the following rounds. Furthermore, considering the client identity authentication and privacy protection issue most approaches ignore, Chu-ko-nu introduces a zero-knowledge proof-based authentication mechanism. It can support clients anonymously participating in FL training and enables the server to authenticate clients effectively in the presence of various attacks. Rigorous security proofs and extensive experiments demonstrated that Chu-ko-nu can provide reliable and anonymously authenticated aggregation for FL with low aggregation costs, at least a 21.02% reduction compared to the state-of-the-art schemes.  ( 3 min )
    Federated Learning with Extremely Noisy Clients via Negative Distillation
    arXiv:2312.12703v2 Announce Type: replace Abstract: Federated learning (FL) has shown remarkable success in cooperatively training deep models, while typically struggling with noisy labels. Advanced works propose to tackle label noise by a re-weighting strategy with a strong assumption, i.e., mild label noise. However, it may be violated in many real-world FL scenarios because of highly contaminated clients, resulting in extreme noise ratios, e.g., $>$90%. To tackle extremely noisy clients, we study the robustness of the re-weighting strategy, showing a pessimistic conclusion: minimizing the weight of clients trained over noisy data outperforms re-weighting strategies. To leverage models trained on noisy clients, we propose a novel approach, called negative distillation (FedNed). FedNed first identifies noisy clients and employs rather than discards the noisy clients in a knowledge distillation manner. In particular, clients identified as noisy ones are required to train models using noisy labels and pseudo-labels obtained by global models. The model trained on noisy labels serves as a `bad teacher' in knowledge distillation, aiming to decrease the risk of providing incorrect information. Meanwhile, the model trained on pseudo-labels is involved in model aggregation if not identified as a noisy client. Consequently, through pseudo-labeling, FedNed gradually increases the trustworthiness of models trained on noisy clients, while leveraging all clients for model aggregation through negative distillation. To verify the efficacy of FedNed, we conduct extensive experiments under various settings, demonstrating that FedNed can consistently outperform baselines and achieve state-of-the-art performance. Our code is available at https://github.com/linChen99/FedNed.  ( 3 min )
    Archetypal Analysis++: Rethinking the Initialization Strategy
    arXiv:2301.13748v3 Announce Type: replace Abstract: Archetypal analysis is a matrix factorization method with convexity constraints. Due to local minima, a good initialization is essential, but frequently used initialization methods yield either sub-optimal starting points or are prone to get stuck in poor local minima. In this paper, we propose archetypal analysis++ (AA++), a probabilistic initialization strategy for archetypal analysis that sequentially samples points based on their influence on the objective, similar to $k$-means++. In fact, we argue that $k$-means++ already approximates the proposed initialization method. Furthermore, we suggest to adapt an efficient Monte Carlo approximation of $k$-means++ to AA++. In an extensive empirical evaluation of 14 real-world data sets of varying sizes and dimensionalities and considering two pre-processing strategies, we show that AA++ almost always outperforms all baselines, including the most frequently used ones.  ( 2 min )
    Joint Problems in Learning Multiple Dynamical Systems
    arXiv:2311.02181v2 Announce Type: replace-cross Abstract: Clustering of time series is a well-studied problem, with applications ranging from quantitative, personalized models of metabolism obtained from metabolite concentrations to state discrimination in quantum information theory. We consider a variant, where given a set of trajectories and a number of parts, we jointly partition the set of trajectories and learn linear dynamical system (LDS) models for each part, so as to minimize the maximum error across all the models. We present globally convergent methods and EM heuristics, accompanied by promising computational results.  ( 2 min )
    Everything of Thoughts: Defying the Law of Penrose Triangle for Thought Generation
    arXiv:2311.04254v3 Announce Type: replace-cross Abstract: Recent advancements in Large Language Models (LLMs) have revolutionized decision-making by breaking down complex problems into more manageable language sequences referred to as "thoughts". An effective thought design should consider three key perspectives: performance, efficiency, and flexibility. However, existing thought can at most exhibit two of these attributes. To address these limitations, we introduce a novel thought prompting approach called "Everything of Thoughts" (XoT) to defy the law of "Penrose triangle of existing thought paradigms. XoT leverages pretrained reinforcement learning and Monte Carlo Tree Search (MCTS) to incorporate external domain knowledge into thoughts, thereby enhancing LLMs' capabilities and enabling them to generalize to unseen problems efficiently. Through the utilization of the MCTS-LLM collaborative thought revision framework, this approach autonomously produces high-quality comprehensive cognitive mappings with minimal LLM interactions. Additionally, XoT empowers LLMs to engage in unconstrained thinking, allowing for flexible cognitive mappings for problems with multiple solutions. We evaluate XoT on several challenging multi-solution problem-solving tasks, including Game of 24, 8-Puzzle, and Pocket Cube. Our results demonstrate that XoT significantly outperforms existing approaches. Notably, XoT can yield multiple solutions with just one LLM call, showcasing its remarkable proficiency in addressing complex problems across diverse domains.  ( 3 min )
    Training Language Models with Language Feedback at Scale
    arXiv:2303.16755v3 Announce Type: replace-cross Abstract: Pretrained language models often generate outputs that are not in line with human preferences, such as harmful text or factually incorrect summaries. Recent work approaches the above issues by learning from a simple form of human feedback: comparisons between pairs of model-generated outputs. However, comparison feedback only conveys limited information about human preferences. In this paper, we introduce Imitation learning from Language Feedback (ILF), a new approach that utilizes more informative language feedback. ILF consists of three steps that are applied iteratively: first, conditioning the language model on the input, an initial LM output, and feedback to generate refinements. Second, selecting the refinement incorporating the most feedback. Third, finetuning the language model to maximize the likelihood of the chosen refinement given the input. We show theoretically that ILF can be viewed as Bayesian Inference, similar to Reinforcement Learning from human feedback. We evaluate ILF's effectiveness on a carefully-controlled toy task and a realistic summarization task. Our experiments demonstrate that large language models accurately incorporate feedback and that finetuning with ILF scales well with the dataset size, even outperforming finetuning on human summaries. Learning from both language and comparison feedback outperforms learning from each alone, achieving human-level summarization performance.  ( 3 min )
    CMOS + stochastic nanomagnets: heterogeneous computers for probabilistic inference and learning
    arXiv:2304.05949v3 Announce Type: replace-cross Abstract: Extending Moore's law by augmenting complementary-metal-oxide semiconductor (CMOS) transistors with emerging nanotechnologies (X) has become increasingly important. One important class of problems involve sampling-based Monte Carlo algorithms used in probabilistic machine learning, optimization, and quantum simulation. Here, we combine stochastic magnetic tunnel junction (sMTJ)-based probabilistic bits (p-bits) with Field Programmable Gate Arrays (FPGA) to create an energy-efficient CMOS + X (X = sMTJ) prototype. This setup shows how asynchronously driven CMOS circuits controlled by sMTJs can perform probabilistic inference and learning by leveraging the algorithmic update-order-invariance of Gibbs sampling. We show how the stochasticity of sMTJs can augment low-quality random number generators (RNG). Detailed transistor-level comparisons reveal that sMTJ-based p-bits can replace up to 10,000 CMOS transistors while dissipating two orders of magnitude less energy. Integrated versions of our approach can advance probabilistic computing involving deep Boltzmann machines and other energy-based learning algorithms with extremely high throughput and energy efficiency.  ( 2 min )
    Revisiting the Role of Label Smoothing in Enhanced Text Sentiment Classification
    arXiv:2312.06522v2 Announce Type: replace-cross Abstract: Label smoothing is a widely used technique in various domains, such as text classification, image classification and speech recognition, known for effectively combating model overfitting. However, there is little fine-grained analysis on how label smoothing enhances text sentiment classification. To fill in the gap, this article performs a set of in-depth analyses on eight datasets for text sentiment classification and three deep learning architectures: TextCNN, BERT, and RoBERTa, under two learning schemes: training from scratch and fine-tuning. By tuning the smoothing parameters, we can achieve improved performance on almost all datasets for each model architecture. We further investigate the benefits of label smoothing, finding that label smoothing can accelerate the convergence of deep models and make samples of different labels easily distinguishable.  ( 2 min )
    UniTime: A Language-Empowered Unified Model for Cross-Domain Time Series Forecasting
    arXiv:2310.09751v3 Announce Type: replace Abstract: Multivariate time series forecasting plays a pivotal role in contemporary web technologies. In contrast to conventional methods that involve creating dedicated models for specific time series application domains, this research advocates for a unified model paradigm that transcends domain boundaries. However, learning an effective cross-domain model presents the following challenges. First, various domains exhibit disparities in data characteristics, e.g., the number of variables, posing hurdles for existing models that impose inflexible constraints on these factors. Second, the model may encounter difficulties in distinguishing data from various domains, leading to suboptimal performance in our assessments. Third, the diverse convergence rates of time series domains can also result in compromised empirical performance. To address these issues, we propose UniTime for effective cross-domain time series learning. Concretely, UniTime can flexibly adapt to data with varying characteristics. It also uses domain instructions and a Language-TS Transformer to offer identification information and align two modalities. In addition, UniTime employs masking to alleviate domain convergence speed imbalance issues. Our extensive experiments demonstrate the effectiveness of UniTime in advancing state-of-the-art forecasting performance and zero-shot transferability.  ( 2 min )
    Quantum Convolutional Neural Networks with Interaction Layers for Classification of Classical Data
    arXiv:2307.11792v3 Announce Type: replace-cross Abstract: Quantum Machine Learning (QML) has come into the limelight due to the exceptional computational abilities of quantum computers. With the promises of near error-free quantum computers in the not-so-distant future, it is important that the effect of multi-qubit interactions on quantum neural networks is studied extensively. This paper introduces a Quantum Convolutional Network with novel Interaction layers exploiting three-qubit interactions, while studying the network's expressibility and entangling capability, for classifying both image and one-dimensional data. The proposed approach is tested on three publicly available datasets namely MNIST, Fashion MNIST, and Iris datasets, flexible in performing binary and multiclass classifications, and is found to supersede the performance of existing state-of-the-art methods.  ( 2 min )
    CPT: Competence-progressive Training Strategy for Few-shot Node Classification
    arXiv:2402.00450v2 Announce Type: replace Abstract: Graph Neural Networks (GNNs) have made significant advancements in node classification, but their success relies on sufficient labeled nodes per class in the training data. Real-world graph data often exhibits a long-tail distribution with sparse labels, emphasizing the importance of GNNs' ability in few-shot node classification, which entails categorizing nodes with limited data. Traditional episodic meta-learning approaches have shown promise in this domain, but they face an inherent limitation: it might lead the model to converge to suboptimal solutions because of random and uniform task assignment, ignoring task difficulty levels. This could lead the meta-learner to face complex tasks too soon, hindering proper learning. Ideally, the meta-learner should start with simple concepts and advance to more complex ones, like human learning. So, we introduce CPT, a novel two-stage curriculum learning method that aligns task difficulty with the meta-learner's progressive competence, enhancing overall performance. Specifically, in CPT's initial stage, the focus is on simpler tasks, fostering foundational skills for engaging with complex tasks later. Importantly, the second stage dynamically adjusts task difficulty based on the meta-learner's growing competence, aiming for optimal knowledge acquisition. Extensive experiments on popular node classification datasets demonstrate significant improvements of our strategy over existing methods.  ( 2 min )
    Tailoring Instructions to Student's Learning Levels Boosts Knowledge Distillation
    arXiv:2305.09651v2 Announce Type: replace-cross Abstract: It has been commonly observed that a teacher model with superior performance does not necessarily result in a stronger student, highlighting a discrepancy between current teacher training practices and effective knowledge transfer. In order to enhance the guidance of the teacher training process, we introduce the concept of distillation influence to determine the impact of distillation from each training sample on the student's generalization ability. In this paper, we propose Learning Good Teacher Matters (LGTM), an efficient training technique for incorporating distillation influence into the teacher's learning process. By prioritizing samples that are likely to enhance the student's generalization ability, our LGTM outperforms 10 common knowledge distillation baselines on 6 text classification tasks in the GLUE benchmark.  ( 2 min )
    Gotcha! Don't trick me with unanswerable questions! Self-aligning Large Language Models for Responding to Unknown Questions
    arXiv:2402.15062v1 Announce Type: cross Abstract: Despite the remarkable abilities of Large Language Models (LLMs) to answer questions, they often display a considerable level of overconfidence even when the question does not have a definitive answer. To avoid providing hallucinated answers to these unknown questions, existing studies typically investigate approaches to refusing to answer these questions. In this work, we propose a novel and scalable self-alignment method to utilize the LLM itself to enhance its response-ability to different types of unknown questions, being capable of not only refusing to answer but also providing explanation to the unanswerability of unknown questions. Specifically, the Self-Align method first employ a two-stage class-aware self-augmentation approach to generate a large amount of unknown question-response data. Then we conduct disparity-driven self-curation to select qualified data for fine-tuning the LLM itself for aligning the responses to unknown questions as desired. Experimental results on two datasets across four types of unknown questions validate the superiority of the Self-Align method over existing baselines in terms of three types of task formulation.  ( 2 min )
    Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca
    arXiv:2304.08177v3 Announce Type: replace-cross Abstract: Large Language Models (LLMs), such as ChatGPT and GPT-4, have dramatically transformed natural language processing research and shown promising strides towards Artificial General Intelligence (AGI). Nonetheless, the high costs associated with training and deploying LLMs present substantial obstacles to transparent, accessible academic research. While several large language models, such as LLaMA, have been open-sourced by the community, these predominantly focus on English corpora, limiting their usefulness for other languages. In this paper, we propose a method to augment LLaMA with capabilities for understanding and generating Chinese text and its ability to follow instructions. We achieve this by extending LLaMA's existing vocabulary with an additional 20,000 Chinese tokens, thereby improving its encoding efficiency and semantic understanding of Chinese. We further incorporate secondary pre-training using Chinese data and fine-tune the model with Chinese instruction datasets, significantly enhancing the model's ability to comprehend and execute instructions. Our experimental results indicate that the newly proposed model markedly enhances the original LLaMA's proficiency in understanding and generating Chinese content. Additionally, the results on the C-Eval dataset yield competitive performance among the models with several times the size of ours. We have made our pre-trained models, training scripts, and other resources available through GitHub, fostering open research for our community. Chinese LLaMA series: \url{https://github.com/ymcui/Chinese-LLaMA-Alpaca} and Chinese Llama-2 series: \url{https://github.com/ymcui/Chinese-LLaMA-Alpaca-2}  ( 3 min )
    GNNInterpreter: A Probabilistic Generative Model-Level Explanation for Graph Neural Networks
    arXiv:2209.07924v4 Announce Type: replace Abstract: Recently, Graph Neural Networks (GNNs) have significantly advanced the performance of machine learning tasks on graphs. However, this technological breakthrough makes people wonder: how does a GNN make such decisions, and can we trust its prediction with high confidence? When it comes to some critical fields, such as biomedicine, where making wrong decisions can have severe consequences, it is crucial to interpret the inner working mechanisms of GNNs before applying them. In this paper, we propose a model-agnostic model-level explanation method for different GNNs that follow the message passing scheme, GNNInterpreter, to explain the high-level decision-making process of the GNN model. More specifically, GNNInterpreter learns a probabilistic generative graph distribution that produces the most discriminative graph pattern the GNN tries to detect when making a certain prediction by optimizing a novel objective function specifically designed for the model-level explanation for GNNs. Compared to existing works, GNNInterpreter is more flexible and computationally efficient in generating explanation graphs with different types of node and edge features, without introducing another blackbox or requiring manually specified domain-specific rules. In addition, the experimental studies conducted on four different datasets demonstrate that the explanation graphs generated by GNNInterpreter match the desired graph pattern if the model is ideal; otherwise, potential model pitfalls can be revealed by the explanation. The official implementation can be found at https://github.com/yolandalalala/GNNInterpreter.  ( 3 min )
    CFDBench: A Large-Scale Benchmark for Machine Learning Methods in Fluid Dynamics
    arXiv:2310.05963v2 Announce Type: replace Abstract: In recent years, applying deep learning to solve physics problems has attracted much attention. Data-driven deep learning methods produce fast numerical operators that can learn approximate solutions to the whole system of partial differential equations (i.e., surrogate modeling). Although these neural networks may have lower accuracy than traditional numerical methods, they, once trained, are orders of magnitude faster at inference. Hence, one crucial feature is that these operators can generalize to unseen PDE parameters without expensive re-training.In this paper, we construct CFDBench, a benchmark tailored for evaluating the generalization ability of neural operators after training in computational fluid dynamics (CFD) problems. It features four classic CFD problems: lid-driven cavity flow, laminar boundary layer flow in circular tubes, dam flows through the steps, and periodic Karman vortex street. The data contains a total of 302K frames of velocity and pressure fields, involving 739 cases with different operating condition parameters, generated with numerical methods. We evaluate the effectiveness of popular neural operators including feed-forward networks, DeepONet, FNO, U-Net, etc. on CFDBnech by predicting flows with non-periodic boundary conditions, fluid properties, and flow domain shapes that are not seen during training. Appropriate modifications were made to apply popular deep neural networks to CFDBench and enable the accommodation of more changing inputs. Empirical results on CFDBench show many baseline models have errors as high as 300% in some problems, and severe error accumulation when performing autoregressive inference. CFDBench facilitates a more comprehensive comparison between different neural operators for CFD compared to existing benchmarks.  ( 3 min )
    ArabianGPT: Native Arabic GPT-based Large Language
    arXiv:2402.15313v1 Announce Type: cross Abstract: The predominance of English and Latin-based large language models (LLMs) has led to a notable deficit in native Arabic LLMs. This discrepancy is accentuated by the prevalent inclusion of English tokens in existing Arabic models, detracting from their efficacy in processing native Arabic's intricate morphology and syntax. Consequently, there is a theoretical and practical imperative for developing LLMs predominantly focused on Arabic linguistic elements. To address this gap, this paper proposes ArabianGPT, a series of transformer-based models within the ArabianLLM suite designed explicitly for Arabic. These models, including ArabianGPT-0.1B and ArabianGPT-0.3B, vary in size and complexity, aligning with the nuanced linguistic characteristics of Arabic. The AraNizer tokenizer, integral to these models, addresses the unique morphological aspects of Arabic script, ensuring more accurate text processing. Empirical results from fine-tuning the models on tasks like sentiment analysis and summarization demonstrate significant improvements. For sentiment analysis, the fine-tuned ArabianGPT-0.1B model achieved a remarkable accuracy of 95%, a substantial increase from the base model's 56%. Similarly, in summarization tasks, fine-tuned models showed enhanced F1 scores, indicating improved precision and recall in generating concise summaries. Comparative analysis of fine-tuned ArabianGPT models against their base versions across various benchmarks reveals nuanced differences in performance, with fine-tuning positively impacting specific tasks like question answering and summarization. These findings underscore the efficacy of fine-tuning in aligning ArabianGPT models more closely with specific NLP tasks, highlighting the potential of tailored transformer architectures in advancing Arabic NLP.  ( 2 min )
    A Data-Centric Approach To Generate Faithful and High Quality Patient Summaries with Large Language Models
    arXiv:2402.15422v1 Announce Type: cross Abstract: Patients often face difficulties in understanding their hospitalizations, while healthcare workers have limited resources to provide explanations. In this work, we investigate the potential of large language models to generate patient summaries based on doctors' notes and study the effect of training data on the faithfulness and quality of the generated summaries. To this end, we develop a rigorous labeling protocol for hallucinations, and have two medical experts annotate 100 real-world summaries and 100 generated summaries. We show that fine-tuning on hallucination-free data effectively reduces hallucinations from 2.60 to 1.55 per summary for Llama 2, while preserving relevant information. Although the effect is still present, it is much smaller for GPT-4 when prompted with five examples (0.70 to 0.40). We also conduct a qualitative evaluation using hallucination-free and improved training data. GPT-4 shows very good results even in the zero-shot setting. We find that common quantitative metrics do not correlate well with faithfulness and quality. Finally, we test GPT-4 for automatic hallucination detection, which yields promising results.  ( 2 min )
    Ranking Entities along Conceptual Space Dimensions with LLMs: An Analysis of Fine-Tuning Strategies
    arXiv:2402.15337v1 Announce Type: cross Abstract: Conceptual spaces represent entities in terms of their primitive semantic features. Such representations are highly valuable but they are notoriously difficult to learn, especially when it comes to modelling perceptual and subjective features. Distilling conceptual spaces from Large Language Models (LLMs) has recently emerged as a promising strategy. However, existing work has been limited to probing pre-trained LLMs using relatively simple zero-shot strategies. We focus in particular on the task of ranking entities according to a given conceptual space dimension. Unfortunately, we cannot directly fine-tune LLMs on this task, because ground truth rankings for conceptual space dimensions are rare. We therefore use more readily available features as training data and analyse whether the ranking capabilities of the resulting models transfer to perceptual and subjective features. We find that this is indeed the case, to some extent, but having perceptual and subjective features in the training data seems essential for achieving the best results. We furthermore find that pointwise ranking strategies are competitive against pairwise approaches, in defiance of common wisdom.  ( 2 min )
    Generative Modelling with Tensor Train approximations of Hamilton--Jacobi--Bellman equations
    arXiv:2402.15285v1 Announce Type: cross Abstract: Sampling from probability densities is a common challenge in fields such as Uncertainty Quantification (UQ) and Generative Modelling (GM). In GM in particular, the use of reverse-time diffusion processes depending on the log-densities of Ornstein-Uhlenbeck forward processes are a popular sampling tool. In Berner et al. [2022] the authors point out that these log-densities can be obtained by solution of a \textit{Hamilton-Jacobi-Bellman} (HJB) equation known from stochastic optimal control. While this HJB equation is usually treated with indirect methods such as policy iteration and unsupervised training of black-box architectures like Neural Networks, we propose instead to solve the HJB equation by direct time integration, using compressed polynomials represented in the Tensor Train (TT) format for spatial discretization. Crucially, this method is sample-free, agnostic to normalization constants and can avoid the curse of dimensionality due to the TT compression. We provide a complete derivation of the HJB equation's action on Tensor Train polynomials and demonstrate the performance of the proposed time-step-, rank- and degree-adaptive integration method on a nonlinear sampling task in 20 dimensions.  ( 2 min )
    Learning-Augmented Online Packet Scheduling with Deadlines
    arXiv:2305.07164v2 Announce Type: replace-cross Abstract: The modern network aims to prioritize critical traffic over non-critical traffic and effectively manage traffic flow. This necessitates proper buffer management to prevent the loss of crucial traffic while minimizing the impact on non-critical traffic. Therefore, the algorithm's objective is to control which packets to transmit and which to discard at each step. In this study, we initiate the learning-augmented online packet scheduling with deadlines and provide a novel algorithmic framework to cope with the prediction. We show that when the prediction error is small, our algorithm improves the competitive ratio while still maintaining a bounded competitive ratio, regardless of the prediction error.  ( 2 min )
    Efficient semi-supervised inference for logistic regression under case-control studies
    arXiv:2402.15365v1 Announce Type: cross Abstract: Semi-supervised learning has received increasingly attention in statistics and machine learning. In semi-supervised learning settings, a labeled data set with both outcomes and covariates and an unlabeled data set with covariates only are collected. We consider an inference problem in semi-supervised settings where the outcome in the labeled data is binary and the labeled data is collected by case-control sampling. Case-control sampling is an effective sampling scheme for alleviating imbalance structure in binary data. Under the logistic model assumption, case-control data can still provide consistent estimator for the slope parameter of the regression model. However, the intercept parameter is not identifiable. Consequently, the marginal case proportion cannot be estimated from case-control data. We find out that with the availability of the unlabeled data, the intercept parameter can be identified in semi-supervised learning setting. We construct the likelihood function of the observed labeled and unlabeled data and obtain the maximum likelihood estimator via an iterative algorithm. The proposed estimator is shown to be consistent, asymptotically normal, and semiparametrically efficient. Extensive simulation studies are conducted to show the finite sample performance of the proposed method. The results imply that the unlabeled data not only helps to identify the intercept but also improves the estimation efficiency of the slope parameter. Meanwhile, the marginal case proportion can be estimated accurately by the proposed method.  ( 2 min )
    FedDebug: Systematic Debugging for Federated Learning Applications
    arXiv:2301.03553v2 Announce Type: replace-cross Abstract: In Federated Learning (FL), clients independently train local models and share them with a central aggregator to build a global model. Impermissibility to access clients' data and collaborative training make FL appealing for applications with data-privacy concerns, such as medical imaging. However, these FL characteristics pose unprecedented challenges for debugging. When a global model's performance deteriorates, identifying the responsible rounds and clients is a major pain point. Developers resort to trial-and-error debugging with subsets of clients, hoping to increase the global model's accuracy or let future FL rounds retune the model, which are time-consuming and costly. We design a systematic fault localization framework, FedDebug, that advances the FL debugging on two novel fronts. First, FedDebug enables interactive debugging of realtime collaborative training in FL by leveraging record and replay techniques to construct a simulation that mirrors live FL. FedDebug's breakpoint can help inspect an FL state (round, client, and global model) and move between rounds and clients' models seamlessly, enabling a fine-grained step-by-step inspection. Second, FedDebug automatically identifies the client(s) responsible for lowering the global model's performance without any testing data and labels--both are essential for existing debugging techniques. FedDebug's strengths come from adapting differential testing in conjunction with neuron activations to determine the client(s) deviating from normal behavior. FedDebug achieves 100% accuracy in finding a single faulty client and 90.3% accuracy in finding multiple faulty clients. FedDebug's interactive debugging incurs 1.2% overhead during training, while it localizes a faulty client in only 2.1% of a round's training time.  ( 3 min )
    Alquist 5.0: Dialogue Trees Meet Generative Models. A Novel Approach for Enhancing SocialBot Conversations
    arXiv:2310.16119v2 Announce Type: replace Abstract: We present our SocialBot -- Alquist~5.0 -- developed for the Alexa Prize SocialBot Grand Challenge~5. Building upon previous versions of our system, we introduce the NRG Barista and outline several innovative approaches for integrating Barista into our SocialBot, improving the overall conversational experience. Additionally, we extend our SocialBot to support multimodal devices. This paper offers insights into the development of Alquist~5.0, which meets evolving user expectations while maintaining empathetic and knowledgeable conversational abilities across diverse topics.  ( 2 min )
    Learning Mean Field Games on Sparse Graphs: A Hybrid Graphex Approach
    arXiv:2401.12686v2 Announce Type: replace-cross Abstract: Learning the behavior of large agent populations is an important task for numerous research areas. Although the field of multi-agent reinforcement learning (MARL) has made significant progress towards solving these systems, solutions for many agents often remain computationally infeasible and lack theoretical guarantees. Mean Field Games (MFGs) address both of these issues and can be extended to Graphon MFGs (GMFGs) to include network structures between agents. Despite their merits, the real world applicability of GMFGs is limited by the fact that graphons only capture dense graphs. Since most empirically observed networks show some degree of sparsity, such as power law graphs, the GMFG framework is insufficient for capturing these network topologies. Thus, we introduce the novel concept of Graphex MFGs (GXMFGs) which builds on the graph theoretical concept of graphexes. Graphexes are the limiting objects to sparse graph sequences that also have other desirable features such as the small world property. Learning equilibria in these games is challenging due to the rich and sparse structure of the underlying graphs. To tackle these challenges, we design a new learning algorithm tailored to the GXMFG setup. This hybrid graphex learning approach leverages that the system mainly consists of a highly connected core and a sparse periphery. After defining the system and providing a theoretical analysis, we state our learning approach and demonstrate its learning capabilities on both synthetic graphs and real-world networks. This comparison shows that our GXMFG learning algorithm successfully extends MFGs to a highly relevant class of hard, realistic learning problems that are not accurately addressed by current MARL and MFG methods.  ( 3 min )
    EdgeServe: A Streaming System for Decentralized Model Serving
    arXiv:2303.08028v3 Announce Type: replace-cross Abstract: The relevant features for a machine learning task may arrive as one or more continuous streams of data. Serving machine learning models over streams of data creates a number of interesting systems challenges in managing data routing, time-synchronization, and rate control. This paper presents EdgeServe, a distributed streaming system that can serve predictions from machine learning models in real time. We evaluate EdgeServe on three streaming prediction tasks: (1) human activity recognition, (2) autonomous driving, and (3) network intrusion detection.  ( 2 min )
    Offline Inverse RL: New Solution Concepts and Provably Efficient Algorithms
    arXiv:2402.15392v1 Announce Type: new Abstract: Inverse reinforcement learning (IRL) aims to recover the reward function of an expert agent from demonstrations of behavior. It is well known that the IRL problem is fundamentally ill-posed, i.e., many reward functions can explain the demonstrations. For this reason, IRL has been recently reframed in terms of estimating the feasible reward set, thus, postponing the selection of a single reward. However, so far, the available formulations and algorithmic solutions have been proposed and analyzed mainly for the online setting, where the learner can interact with the environment and query the expert at will. This is clearly unrealistic in most practical applications, where the availability of an offline dataset is a much more common scenario. In this paper, we introduce a novel notion of feasible reward set capturing the opportunities and limitations of the offline setting and we analyze the complexity of its estimation. This requires the introduction an original learning framework that copes with the intrinsic difficulty of the setting, for which the data coverage is not under control. Then, we propose two computationally and statistically efficient algorithms, IRLO and PIRLO, for addressing the problem. In particular, the latter adopts a specific form of pessimism to enforce the novel desirable property of inclusion monotonicity of the delivered feasible set. With this work, we aim to provide a panorama of the challenges of the offline IRL problem and how they can be fruitfully addressed.  ( 2 min )
    Descripci\'on autom\'atica de secciones delgadas de rocas: una aplicaci\'on Web
    arXiv:2402.15039v1 Announce Type: cross Abstract: The identification and characterization of various rock types is one of the fundamental activities for geology and related areas such as mining, petroleum, environment, industry and construction. Traditionally, a human specialist is responsible for analyzing and explaining details about the type, composition, texture, shape and other properties using rock samples collected in-situ or prepared in a laboratory. The results become subjective based on experience, in addition to consuming a large investment of time and effort. The present proposal uses artificial intelligence techniques combining computer vision and natural language processing to generate a textual and verbal description from a thin section image of rock. We build a dataset of images and their respective textual descriptions for the training of a model that associates the relevant features of the image extracted by EfficientNetB7 with the textual description generated by a Transformer network, reaching an accuracy value of 0.892 and a BLEU value of 0.71. This model can be a useful resource for research, professional and academic work, so it has been deployed through a Web application for public use.  ( 2 min )
    opp/ai: Optimistic Privacy-Preserving AI on Blockchain
    arXiv:2402.15006v1 Announce Type: cross Abstract: The convergence of Artificial Intelligence (AI) and blockchain technology is reshaping the digital world, offering decentralized, secure, and efficient AI services on blockchain platforms. Despite the promise, the high computational demands of AI on blockchain raise significant privacy and efficiency concerns. The Optimistic Privacy-Preserving AI (opp/ai) framework is introduced as a pioneering solution to these issues, striking a balance between privacy protection and computational efficiency. The framework integrates Zero-Knowledge Machine Learning (zkML) for privacy with Optimistic Machine Learning (opML) for efficiency, creating a hybrid model tailored for blockchain AI services. This study presents the opp/ai framework, delves into the privacy features of zkML, and assesses the framework's performance and adaptability across different scenarios.  ( 2 min )
    Practice Makes Perfect: Planning to Learn Skill Parameter Policies
    arXiv:2402.15025v1 Announce Type: cross Abstract: One promising approach towards effective robot decision making in complex, long-horizon tasks is to sequence together parameterized skills. We consider a setting where a robot is initially equipped with (1) a library of parameterized skills, (2) an AI planner for sequencing together the skills given a goal, and (3) a very general prior distribution for selecting skill parameters. Once deployed, the robot should rapidly and autonomously learn to improve its performance by specializing its skill parameter selection policy to the particular objects, goals, and constraints in its environment. In this work, we focus on the active learning problem of choosing which skills to practice to maximize expected future task success. We propose that the robot should estimate the competence of each skill, extrapolate the competence (asking: "how much would the competence improve through practice?"), and situate the skill in the task distribution through competence-aware planning. This approach is implemented within a fully autonomous system where the robot repeatedly plans, practices, and learns without any environment resets. Through experiments in simulation, we find that our approach learns effective parameter policies more sample-efficiently than several baselines. Experiments in the real-world demonstrate our approach's ability to handle noise from perception and control and improve the robot's ability to solve two long-horizon mobile-manipulation tasks after a few hours of autonomous practice.  ( 2 min )
    Multimodal Transformer With a Low-Computational-Cost Guarantee
    arXiv:2402.15096v1 Announce Type: new Abstract: Transformer-based models have significantly improved performance across a range of multimodal understanding tasks, such as visual question answering and action recognition. However, multimodal Transformers significantly suffer from a quadratic complexity of the multi-head attention with the input sequence length, especially as the number of modalities increases. To address this, we introduce Low-Cost Multimodal Transformer (LoCoMT), a novel multimodal attention mechanism that aims to reduce computational cost during training and inference with minimal performance loss. Specifically, by assigning different multimodal attention patterns to each attention head, LoCoMT can flexibly control multimodal signals and theoretically ensures a reduced computational cost compared to existing multimodal Transformer variants. Experimental results on two multimodal datasets, namely Audioset and MedVidCL demonstrate that LoCoMT not only reduces GFLOPs but also matches or even outperforms established models.  ( 2 min )
    Distributionally Robust Off-Dynamics Reinforcement Learning: Provable Efficiency with Linear Function Approximation
    arXiv:2402.15399v1 Announce Type: new Abstract: We study off-dynamics Reinforcement Learning (RL), where the policy is trained on a source domain and deployed to a distinct target domain. We aim to solve this problem via online distributionally robust Markov decision processes (DRMDPs), where the learning algorithm actively interacts with the source domain while seeking the optimal performance under the worst possible dynamics that is within an uncertainty set of the source domain's transition kernel. We provide the first study on online DRMDPs with function approximation for off-dynamics RL. We find that DRMDPs' dual formulation can induce nonlinearity, even when the nominal transition kernel is linear, leading to error propagation. By designing a $d$-rectangular uncertainty set using the total variation distance, we remove this additional nonlinearity and bypass the error propagation. We then introduce DR-LSVI-UCB, the first provably efficient online DRMDP algorithm for off-dynamics RL with function approximation, and establish a polynomial suboptimality bound that is independent of the state and action space sizes. Our work makes the first step towards a deeper understanding of the provable efficiency of online DRMDPs with linear function approximation. Finally, we substantiate the performance and robustness of DR-LSVI-UCB through different numerical experiments.  ( 2 min )
    United We Pretrain, Divided We Fail! Representation Learning for Time Series by Pretraining on 75 Datasets at Once
    arXiv:2402.15404v1 Announce Type: new Abstract: In natural language processing and vision, pretraining is utilized to learn effective representations. Unfortunately, the success of pretraining does not easily carry over to time series due to potential mismatch between sources and target. Actually, common belief is that multi-dataset pretraining does not work for time series! Au contraire, we introduce a new self-supervised contrastive pretraining approach to learn one encoding from many unlabeled and diverse time series datasets, so that the single learned representation can then be reused in several target domains for, say, classification. Specifically, we propose the XD-MixUp interpolation method and the Soft Interpolation Contextual Contrasting (SICC) loss. Empirically, this outperforms both supervised training and other self-supervised pretraining methods when finetuning on low-data regimes. This disproves the common belief: We can actually learn from multiple time series datasets, even from 75 at once.  ( 2 min )
    Towards Principled Task Grouping for Multi-Task Learning
    arXiv:2402.15328v1 Announce Type: new Abstract: This paper presents a novel approach to task grouping in Multitask Learning (MTL), advancing beyond existing methods by addressing key theoretical and practical limitations. Unlike prior studies, our approach offers a more theoretically grounded method that does not rely on restrictive assumptions for constructing transfer gains. We also propose a flexible mathematical programming formulation which can accommodate a wide spectrum of resource constraints, thus enhancing its versatility. Experimental results across diverse domains, including computer vision datasets, combinatorial optimization benchmarks and time series tasks, demonstrate the superiority of our method over extensive baselines, validating its effectiveness and general applicability in MTL.  ( 2 min )
    Calibration of Deep Learning Classification Models in fNIRS
    arXiv:2402.15266v1 Announce Type: new Abstract: Functional near-infrared spectroscopy (fNIRS) is a valuable non-invasive tool for monitoring brain activity. The classification of fNIRS data in relation to conscious activity holds significance for advancing our understanding of the brain and facilitating the development of brain-computer interfaces (BCI). Many researchers have turned to deep learning to tackle the classification challenges inherent in fNIRS data due to its strong generalization and robustness. In the application of fNIRS, reliability is really important, and one mathematical formulation of the reliability of confidence is calibration. However, many researchers overlook the important issue of calibration. To address this gap, we propose integrating calibration into fNIRS field and assess the reliability of existing models. Surprisingly, our results indicate poor calibration performance in many proposed models. To advance calibration development in the fNIRS field, we summarize three practical tips. Through this letter, we hope to emphasize the critical role of calibration in fNIRS research and argue for enhancing the reliability of deep learning-based predictions in fNIRS classification tasks. All data from our experimental process are openly available on GitHub.  ( 2 min )
    Understanding Oversmoothing in Diffusion-Based GNNs From the Perspective of Operator Semigroup Theory
    arXiv:2402.15326v1 Announce Type: new Abstract: This paper presents a novel study of the oversmoothing issue in diffusion-based Graph Neural Networks (GNNs). Diverging from extant approaches grounded in random walk analysis or particle systems, we approach this problem through operator semigroup theory. This theoretical framework allows us to rigorously prove that oversmoothing is intrinsically linked to the ergodicity of the diffusion operator. This finding further poses a general and mild ergodicity-breaking condition, encompassing the various specific solutions previously offered, thereby presenting a more universal and theoretically grounded approach to mitigating oversmoothing in diffusion-based GNNs. Additionally, we offer a probabilistic interpretation of our theory, forging a link with prior works and broadening the theoretical horizon. Our experimental results reveal that this ergodicity-breaking term effectively mitigates oversmoothing measured by Dirichlet energy, and simultaneously enhances performance in node classification tasks.  ( 2 min )
    TransFlower: An Explainable Transformer-Based Model with Flow-to-Flow Attention for Commuting Flow Prediction
    arXiv:2402.15398v1 Announce Type: new Abstract: Understanding the link between urban planning and commuting flows is crucial for guiding urban development and policymaking. This research, bridging computer science and urban studies, addresses the challenge of integrating these fields with their distinct focuses. Traditional urban studies methods, like the gravity and radiation models, often underperform in complex scenarios due to their limited handling of multiple variables and reliance on overly simplistic and unrealistic assumptions, such as spatial isotropy. While deep learning models offer improved accuracy, their black-box nature poses a trade-off between performance and explainability -- both vital for analyzing complex societal phenomena like commuting flows. To address this, we introduce TransFlower, an explainable, transformer-based model employing flow-to-flow attention to predict urban commuting patterns. It features a geospatial encoder with an anisotropy-aware relative location encoder for nuanced flow representation. Following this, the transformer-based flow predictor enhances this by leveraging attention mechanisms to efficiently capture flow interactions. Our model outperforms existing methods by up to 30.8% Common Part of Commuters, offering insights into mobility dynamics crucial for urban planning and policy decisions.  ( 2 min )
    Convergence Analysis of Blurring Mean Shift
    arXiv:2402.15146v1 Announce Type: new Abstract: Blurring mean shift (BMS) algorithm, a variant of the mean shift algorithm, is a kernel-based iterative method for data clustering, where data points are clustered according to their convergent points via iterative blurring. In this paper, we analyze convergence properties of the BMS algorithm by leveraging its interpretation as an optimization procedure, which is known but has been underutilized in existing convergence studies. Whereas existing results on convergence properties applicable to multi-dimensional data only cover the case where all the blurred data point sequences converge to a single point, this study provides a convergence guarantee even when those sequences can converge to multiple points, yielding multiple clusters. This study also shows that the convergence of the BMS algorithm is fast by further leveraging geometrical characterization of the convergent points.  ( 2 min )
    Machine Unlearning by Suppressing Sample Contribution
    arXiv:2402.15109v1 Announce Type: new Abstract: Machine Unlearning (MU) is to forget data from a well-trained model, which is practically important due to the "right to be forgotten". In this paper, we start from the fundamental distinction between training data and unseen data on their contribution to the model: the training data contributes to the final model while the unseen data does not. We theoretically discover that the input sensitivity can approximately measure the contribution and practically design an algorithm, called MU-Mis (machine unlearning via minimizing input sensitivity), to suppress the contribution of the forgetting data. Experimental results demonstrate that MU-Mis outperforms state-of-the-art MU methods significantly. Additionally, MU-Mis aligns more closely with the application of MU as it does not require the use of remaining data.  ( 2 min )
    Shapley Value Based Multi-Agent Reinforcement Learning: Theory, Method and Its Application to Energy Network
    arXiv:2402.15324v1 Announce Type: cross Abstract: Multi-agent reinforcement learning is an area of rapid advancement in artificial intelligence and machine learning. One of the important questions to be answered is how to conduct credit assignment in a multi-agent system. There have been many schemes designed to conduct credit assignment by multi-agent reinforcement learning algorithms. Although these credit assignment schemes have been proved useful in improving the performance of multi-agent reinforcement learning, most of them are designed heuristically without a rigorous theoretic basis and therefore infeasible to understand how agents cooperate. In this thesis, we aim at investigating the foundation of credit assignment in multi-agent reinforcement learning via cooperative game theory. We first extend a game model called convex game and a payoff distribution scheme called Shapley value in cooperative game theory to Markov decision process, named as Markov convex game and Markov Shapley value respectively. We represent a global reward game as a Markov convex game under the grand coalition. As a result, Markov Shapley value can be reasonably used as a credit assignment scheme in the global reward game. Markov Shapley value possesses the following virtues: (i) efficiency; (ii) identifiability of dummy agents; (iii) reflecting the contribution and (iv) symmetry, which form the fair credit assignment. Based on Markov Shapley value, we propose three multi-agent reinforcement learning algorithms called SHAQ, SQDDPG and SMFPPO. Furthermore, we extend Markov convex game to partial observability to deal with the partially observable problems, named as partially observable Markov convex game. In application, we evaluate SQDDPG and SMFPPO on the real-world problem in energy networks.  ( 3 min )
    Lasso with Latents: Efficient Estimation, Covariate Rescaling, and Computational-Statistical Gaps
    arXiv:2402.15409v1 Announce Type: cross Abstract: It is well-known that the statistical performance of Lasso can suffer significantly when the covariates of interest have strong correlations. In particular, the prediction error of Lasso becomes much worse than computationally inefficient alternatives like Best Subset Selection. Due to a large conjectured computational-statistical tradeoff in the problem of sparse linear regression, it may be impossible to close this gap in general. In this work, we propose a natural sparse linear regression setting where strong correlations between covariates arise from unobserved latent variables. In this setting, we analyze the problem caused by strong correlations and design a surprisingly simple fix. While Lasso with standard normalization of covariates fails, there exists a heterogeneous scaling of the covariates with which Lasso will suddenly obtain strong provable guarantees for estimation. Moreover, we design a simple, efficient procedure for computing such a "smart scaling." The sample complexity of the resulting "rescaled Lasso" algorithm incurs (in the worst case) quadratic dependence on the sparsity of the underlying signal. While this dependence is not information-theoretically necessary, we give evidence that it is optimal among the class of polynomial-time algorithms, via the method of low-degree polynomials. This argument reveals a new connection between sparse linear regression and a special version of sparse PCA with a near-critical negative spike. The latter problem can be thought of as a real-valued analogue of learning a sparse parity. Using it, we also establish the first computational-statistical gap for the closely related problem of learning a Gaussian Graphical Model.  ( 3 min )
    Grasp, See and Place: Efficient Unknown Object Rearrangement with Policy Structure Prior
    arXiv:2402.15402v1 Announce Type: cross Abstract: We focus on the task of unknown object rearrangement, where a robot is supposed to re-configure the objects into a desired goal configuration specified by an RGB-D image. Recent works explore unknown object rearrangement systems by incorporating learning-based perception modules. However, they are sensitive to perception error, and pay less attention to task-level performance. In this paper, we aim to develop an effective system for unknown object rearrangement amidst perception noise. We theoretically reveal the noisy perception impacts grasp and place in a decoupled way, and show such a decoupled structure is non-trivial to improve task optimality. We propose GSP, a dual-loop system with the decoupled structure as prior. For the inner loop, we learn an active seeing policy for self-confident object matching to improve the perception of place. For the outer loop, we learn a grasp policy aware of object matching and grasp capability guided by task-level rewards. We leverage the foundation model CLIP for object matching, policy learning and self-termination. A series of experiments indicate that GSP can conduct unknown object rearrangement with higher completion rate and less steps.  ( 2 min )
    PREDILECT: Preferences Delineated with Zero-Shot Language-based Reasoning in Reinforcement Learning
    arXiv:2402.15420v1 Announce Type: cross Abstract: Preference-based reinforcement learning (RL) has emerged as a new field in robot learning, where humans play a pivotal role in shaping robot behavior by expressing preferences on different sequences of state-action pairs. However, formulating realistic policies for robots demands responses from humans to an extensive array of queries. In this work, we approach the sample-efficiency challenge by expanding the information collected per query to contain both preferences and optional text prompting. To accomplish this, we leverage the zero-shot capabilities of a large language model (LLM) to reason from the text provided by humans. To accommodate the additional query information, we reformulate the reward learning objectives to contain flexible highlights -- state-action pairs that contain relatively high information and are related to the features processed in a zero-shot fashion from a pretrained LLM. In both a simulated scenario and a user study, we reveal the effectiveness of our work by analyzing the feedback and its implications. Additionally, the collective feedback collected serves to train a robot on socially compliant trajectories in a simulated social navigation landscape. We provide video examples of the trained policies at https://sites.google.com/view/rl-predilect  ( 2 min )
    On normalization-equivariance properties of supervised and unsupervised denoising methods: a survey
    arXiv:2402.15352v1 Announce Type: cross Abstract: Image denoising is probably the oldest and still one of the most active research topic in image processing. Many methodological concepts have been introduced in the past decades and have improved performances significantly in recent years, especially with the emergence of convolutional neural networks and supervised deep learning. In this paper, we propose a survey of guided tour of supervised and unsupervised learning methods for image denoising, classifying the main principles elaborated during this evolution, with a particular concern given to recent developments in supervised learning. It is conceived as a tutorial organizing in a comprehensive framework current approaches. We give insights on the rationales and limitations of the most performant methods in the literature, and we highlight the common features between many of them. Finally, we focus on on the normalization equivariance properties that is surprisingly not guaranteed with most of supervised methods. It is of paramount importance that intensity shifting or scaling applied to the input image results in a corresponding change in the denoiser output.  ( 2 min )
    All Thresholds Barred: Direct Estimation of Call Density in Bioacoustic Data
    arXiv:2402.15360v1 Announce Type: cross Abstract: Passive acoustic monitoring (PAM) studies generate thousands of hours of audio, which may be used to monitor specific animal populations, conduct broad biodiversity surveys, detect threats such as poachers, and more. Machine learning classifiers for species identification are increasingly being used to process the vast amount of audio generated by bioacoustic surveys, expediting analysis and increasing the utility of PAM as a management tool. In common practice, a threshold is applied to classifier output scores, and scores above the threshold are aggregated into a detection count. The choice of threshold produces biased counts of vocalizations, which are subject to false positive/negative rates that may vary across subsets of the dataset. In this work, we advocate for directly estimating call density: The proportion of detection windows containing the target vocalization, regardless of classifier score. Our approach targets a desirable ecological estimator and provides a more rigorous grounding for identifying the core problems caused by distribution shifts -- when the defining characteristics of the data distribution change -- and designing strategies to mitigate them. We propose a validation scheme for estimating call density in a body of data and obtain, through Bayesian reasoning, probability distributions of confidence scores for both the positive and negative classes. We use these distributions to predict site-level densities, which may be subject to distribution shifts. We test our proposed methods on a real-world study of Hawaiian birds and provide simulation results leveraging existing fully annotated datasets, demonstrating robustness to variations in call density and classifier model quality.  ( 3 min )
    Dual Encoder: Exploiting the Potential of Syntactic and Semantic for Aspect Sentiment Triplet Extraction
    arXiv:2402.15370v1 Announce Type: cross Abstract: Aspect Sentiment Triple Extraction (ASTE) is an emerging task in fine-grained sentiment analysis. Recent studies have employed Graph Neural Networks (GNN) to model the syntax-semantic relationships inherent in triplet elements. However, they have yet to fully tap into the vast potential of syntactic and semantic information within the ASTE task. In this work, we propose a \emph{Dual Encoder: Exploiting the potential of Syntactic and Semantic} model (D2E2S), which maximizes the syntactic and semantic relationships among words. Specifically, our model utilizes a dual-channel encoder with a BERT channel to capture semantic information, and an enhanced LSTM channel for comprehensive syntactic information capture. Subsequently, we introduce the heterogeneous feature interaction module to capture intricate interactions between dependency syntax and attention semantics, and to dynamically select vital nodes. We leverage the synergy of these modules to harness the significant potential of syntactic and semantic information in ASTE tasks. Testing on public benchmarks, our D2E2S model surpasses the current state-of-the-art(SOTA), demonstrating its effectiveness.  ( 2 min )
    NuNER: Entity Recognition Encoder Pre-training via LLM-Annotated Data
    arXiv:2402.15343v1 Announce Type: cross Abstract: Large Language Models (LLMs) have shown impressive abilities in data annotation, opening the way for new approaches to solve classic NLP problems. In this paper, we show how to use LLMs to create NuNER, a compact language representation model specialized in the Named Entity Recognition (NER) task. NuNER can be fine-tuned to solve downstream NER problems in a data-efficient way, outperforming similar-sized foundation models in the few-shot regime and competing with much larger LLMs. We find that the size and entity-type diversity of the pre-training dataset are key to achieving good performance. We view NuNER as a member of the broader family of task-specific foundation models, recently unlocked by LLMs.  ( 2 min )
    Iteration and Stochastic First-order Oracle Complexities of Stochastic Gradient Descent using Constant and Decaying Learning Rates
    arXiv:2402.15344v1 Announce Type: cross Abstract: The performance of stochastic gradient descent (SGD), which is the simplest first-order optimizer for training deep neural networks, depends on not only the learning rate but also the batch size. They both affect the number of iterations and the stochastic first-order oracle (SFO) complexity needed for training. In particular, the previous numerical results indicated that, for SGD using a constant learning rate, the number of iterations needed for training decreases when the batch size increases, and the SFO complexity needed for training is minimized at a critical batch size and that it increases once the batch size exceeds that size. Here, we study the relationship between batch size and the iteration and SFO complexities needed for nonconvex optimization in deep learning with SGD using constant or decaying learning rates and show that SGD using the critical batch size minimizes the SFO complexity. We also provide numerical comparisons of SGD with the existing first-order optimizers and show the usefulness of SGD using a critical batch size. Moreover, we show that measured critical batch sizes are close to the sizes estimated from our theoretical results.  ( 2 min )
    Farsight: Fostering Responsible AI Awareness During AI Application Prototyping
    arXiv:2402.15350v1 Announce Type: cross Abstract: Prompt-based interfaces for Large Language Models (LLMs) have made prototyping and building AI-powered applications easier than ever before. However, identifying potential harms that may arise from AI applications remains a challenge, particularly during prompt-based prototyping. To address this, we present Farsight, a novel in situ interactive tool that helps people identify potential harms from the AI applications they are prototyping. Based on a user's prompt, Farsight highlights news articles about relevant AI incidents and allows users to explore and edit LLM-generated use cases, stakeholders, and harms. We report design insights from a co-design study with 10 AI prototypers and findings from a user study with 42 AI prototypers. After using Farsight, AI prototypers in our user study are better able to independently identify potential harms associated with a prompt and find our tool more useful and usable than existing resources. Their qualitative feedback also highlights that Farsight encourages them to focus on end-users and think beyond immediate harms. We discuss these findings and reflect on their implications for designing AI prototyping experiences that meaningfully engage with AI harms. Farsight is publicly accessible at: https://PAIR-code.github.io/farsight.  ( 3 min )
    Low-Rank Representations Meets Deep Unfolding: A Generalized and Interpretable Network for Hyperspectral Anomaly Detection
    arXiv:2402.15335v1 Announce Type: cross Abstract: Current hyperspectral anomaly detection (HAD) benchmark datasets suffer from low resolution, simple background, and small size of the detection data. These factors also limit the performance of the well-known low-rank representation (LRR) models in terms of robustness on the separation of background and target features and the reliance on manual parameter selection. To this end, we build a new set of HAD benchmark datasets for improving the robustness of the HAD algorithm in complex scenarios, AIR-HAD for short. Accordingly, we propose a generalized and interpretable HAD network by deeply unfolding a dictionary-learnable LLR model, named LRR-Net$^+$, which is capable of spectrally decoupling the background structure and object properties in a more generalized fashion and eliminating the bias introduced by vital interference targets concurrently. In addition, LRR-Net$^+$ integrates the solution process of the Alternating Direction Method of Multipliers (ADMM) optimizer with the deep network, guiding its search process and imparting a level of interpretability to parameter optimization. Additionally, the integration of physical models with DL techniques eliminates the need for manual parameter tuning. The manually tuned parameters are seamlessly transformed into trainable parameters for deep neural networks, facilitating a more efficient and automated optimization process. Extensive experiments conducted on the AIR-HAD dataset show the superiority of our LRR-Net$^+$ in terms of detection performance and generalization ability, compared to top-performing rivals. Furthermore, the compilable codes and our AIR-HAD benchmark datasets in this paper will be made available freely and openly at \url{https://sites.google.com/view/danfeng-hong}.  ( 3 min )
    OpenSUN3D: 1st Workshop Challenge on Open-Vocabulary 3D Scene Understanding
    arXiv:2402.15321v1 Announce Type: cross Abstract: This report provides an overview of the challenge hosted at the OpenSUN3D Workshop on Open-Vocabulary 3D Scene Understanding held in conjunction with ICCV 2023. The goal of this workshop series is to provide a platform for exploration and discussion of open-vocabulary 3D scene understanding tasks, including but not limited to segmentation, detection and mapping. We provide an overview of the challenge hosted at the workshop, present the challenge dataset, the evaluation methodology, and brief descriptions of the winning methods. For additional details, please see https://opensun3d.github.io/index_iccv23.html.  ( 2 min )
    Representing Online Handwriting for Recognition in Large Vision-Language Models
    arXiv:2402.15307v1 Announce Type: cross Abstract: The adoption of tablets with touchscreens and styluses is increasing, and a key feature is converting handwriting to text, enabling search, indexing, and AI assistance. Meanwhile, vision-language models (VLMs) are now the go-to solution for image understanding, thanks to both their state-of-the-art performance across a variety of tasks and the simplicity of a unified approach to training, fine-tuning, and inference. While VLMs obtain high performance on image-based tasks, they perform poorly on handwriting recognition when applied naively, i.e., by rendering handwriting as an image and performing optical character recognition (OCR). In this paper, we study online handwriting recognition with VLMs, going beyond naive OCR. We propose a novel tokenized representation of digital ink (online handwriting) that includes both a time-ordered sequence of strokes as text, and as image. We show that this representation yields results comparable to or better than state-of-the-art online handwriting recognizers. Wide applicability is shown through results with two different VLM families, on multiple public datasets. Our approach can be applied to off-the-shelf VLMs, does not require any changes in their architecture, and can be used in both fine-tuning and parameter-efficient tuning. We perform a detailed ablation study to identify the key elements of the proposed representation.  ( 2 min )
    Seeing is Believing: Mitigating Hallucination in Large Vision-Language Models via CLIP-Guided Decoding
    arXiv:2402.15300v1 Announce Type: cross Abstract: Large Vision-Language Models (LVLMs) are susceptible to object hallucinations, an issue in which their generated text contains non-existent objects, greatly limiting their reliability and practicality. Current approaches often rely on the model's token likelihoods or other internal information, instruction tuning on additional datasets, or incorporating complex external tools. We first perform empirical analysis on sentence-level LVLM hallucination, finding that CLIP similarity to the image acts as a stronger and more robust indicator of hallucination compared to token likelihoods. Motivated by this, we introduce our CLIP-Guided Decoding (CGD) approach, a straightforward but effective training-free approach to reduce object hallucination at decoding time. CGD uses CLIP to guide the model's decoding process by enhancing visual grounding of generated text with the image. Experiments demonstrate that CGD effectively mitigates object hallucination across multiple LVLM families while preserving the utility of text generation.  ( 2 min )
    Causal Graph Discovery with Retrieval-Augmented Generation based Large Language Models
    arXiv:2402.15301v1 Announce Type: cross Abstract: Causal graph recovery is essential in the field of causal inference. Traditional methods are typically knowledge-based or statistical estimation-based, which are limited by data collection biases and individuals' knowledge about factors affecting the relations between variables of interests. The advance of large language models (LLMs) provides opportunities to address these problems. We propose a novel method that utilizes the extensive knowledge contained within a large corpus of scientific literature to deduce causal relationships in general causal graph recovery tasks. This method leverages Retrieval Augmented-Generation (RAG) based LLMs to systematically analyze and extract pertinent information from a comprehensive collection of research papers. Our method first retrieves relevant text chunks from the aggregated literature. Then, the LLM is tasked with identifying and labelling potential associations between factors. Finally, we give a method to aggregate the associational relationships to build a causal graph. We demonstrate our method is able to construct high quality causal graphs on the well-known SACHS dataset solely from literature.  ( 2 min )
    MemoryPrompt: A Light Wrapper to Improve Context Tracking in Pre-trained Language Models
    arXiv:2402.15268v1 Announce Type: cross Abstract: Transformer-based language models (LMs) track contextual information through large, hard-coded input windows. We introduce MemoryPrompt, a leaner approach in which the LM is complemented by a small auxiliary recurrent network that passes information to the LM by prefixing its regular input with a sequence of vectors, akin to soft prompts, without requiring LM finetuning. Tested on a task designed to probe a LM's ability to keep track of multiple fact updates, a MemoryPrompt-augmented LM outperforms much larger LMs that have access to the full input history. We also test MemoryPrompt on a long-distance dialogue dataset, where its performance is comparable to that of a model conditioned on the entire conversation history. In both experiments we also observe that, unlike full-finetuning approaches, MemoryPrompt does not suffer from catastrophic forgetting when adapted to new tasks, thus not disrupting the generalist capabilities of the underlying LM.  ( 2 min )
    Real-Time FPGA Demonstrator of ANN-Based Equalization for Optical Communications
    arXiv:2402.15288v1 Announce Type: cross Abstract: In this work, we present a high-throughput field programmable gate array (FPGA) demonstrator of an artificial neural network (ANN)-based equalizer. The equalization is performed and illustrated in real-time for a 30 GBd, two-level pulse amplitude modulation (PAM2) optical communication system.  ( 2 min )
    A Survey of Music Generation in the Context of Interaction
    arXiv:2402.15294v1 Announce Type: cross Abstract: In recent years, machine learning, and in particular generative adversarial neural networks (GANs) and attention-based neural networks (transformers), have been successfully used to compose and generate music, both melodies and polyphonic pieces. Current research focuses foremost on style replication (eg. generating a Bach-style chorale) or style transfer (eg. classical to jazz) based on large amounts of recorded or transcribed music, which in turn also allows for fairly straight-forward "performance" evaluation. However, most of these models are not suitable for human-machine co-creation through live interaction, neither is clear, how such models and resulting creations would be evaluated. This article presents a thorough review of music representation, feature analysis, heuristic algorithms, statistical and parametric modelling, and human and automatic evaluation measures, along with a discussion of which approaches and models seem most suitable for live interaction.  ( 2 min )
    A note on the adjoint method for neural ordinary differential equation network
    arXiv:2402.15141v1 Announce Type: cross Abstract: Perturbation and operator adjoint method are used to give the right adjoint form rigourously. From the derivation, we can have following results: 1) The loss gradient is not an ODE, it is an integral and we shows the reason; 2) The traditional adjoint form is not equivalent with the back propagation results. 3) The adjoint operator analysis shows that if and only if the discrete adjoint has the same scheme with the discrete neural ODE, the adjoint form would give the same results as BP does.  ( 2 min )
    Biomedical Entity Linking as Multiple Choice Question Answering
    arXiv:2402.15189v1 Announce Type: cross Abstract: Although biomedical entity linking (BioEL) has made significant progress with pre-trained language models, challenges still exist for fine-grained and long-tailed entities. To address these challenges, we present BioELQA, a novel model that treats Biomedical Entity Linking as Multiple Choice Question Answering. BioELQA first obtains candidate entities with a fast retriever, jointly presents the mention and candidate entities to a generator, and then outputs the predicted symbol associated with its chosen entity. This formulation enables explicit comparison of different candidate entities, thus capturing fine-grained interactions between mentions and entities, as well as among entities themselves. To improve generalization for long-tailed entities, we retrieve similar labeled training instances as clues and concatenate the input with retrieved instances for the generator. Extensive experimental results show that BioELQA outperforms state-of-the-art baselines on several datasets.  ( 2 min )
    High Resolution Guitar Transcription via Domain Adaptation
    arXiv:2402.15258v1 Announce Type: cross Abstract: Automatic music transcription (AMT) has achieved high accuracy for piano due to the availability of large, high-quality datasets such as MAESTRO and MAPS, but comparable datasets are not yet available for other instruments. In recent work, however, it has been demonstrated that aligning scores to transcription model activations can produce high quality AMT training data for instruments other than piano. Focusing on the guitar, we refine this approach to training on score data using a dataset of commercially available score-audio pairs. We propose the use of a high-resolution piano transcription model to train a new guitar transcription model. The resulting model obtains state-of-the-art transcription results on GuitarSet in a zero-shot context, improving on previously published methods.  ( 2 min )
    Open Ad Hoc Teamwork with Cooperative Game Theory
    arXiv:2402.15259v1 Announce Type: cross Abstract: Ad hoc teamwork poses a challenging problem, requiring the design of an agent to collaborate with teammates without prior coordination or joint training. Open ad hoc teamwork further complicates this challenge by considering environments with a changing number of teammates, referred to as open teams. The state-of-the-art solution to this problem is graph-based policy learning (GPL), leveraging the generalizability of graph neural networks to handle an unrestricted number of agents and effectively address open teams. GPL's performance is superior to other methods, but its joint Q-value representation presents challenges for interpretation, hindering further development of this research line and applicability. In this paper, we establish a new theory to give an interpretation for the joint Q-value representation employed in GPL, from the perspective of cooperative game theory. Building on our theory, we propose a novel algorithm based on GPL framework, to complement the critical features that facilitate learning, but overlooked in GPL. Through experiments, we demonstrate the correctness of our theory by comparing the performance of the resulting algorithm with GPL in dynamic team compositions.  ( 2 min )
    Statistical Agnostic Regression: a machine learning method to validate regression models
    arXiv:2402.15213v1 Announce Type: cross Abstract: Regression analysis is a central topic in statistical modeling, aiming to estimate the relationships between a dependent variable, commonly referred to as the response variable, and one or more independent variables, i.e., explanatory variables. Linear regression is by far the most popular method for performing this task in several fields of research, such as prediction, forecasting, or causal inference. Beyond various classical methods to solve linear regression problems, such as Ordinary Least Squares, Ridge, or Lasso regressions - which are often the foundation for more advanced machine learning (ML) techniques - the latter have been successfully applied in this scenario without a formal definition of statistical significance. At most, permutation or classical analyses based on empirical measures (e.g., residuals or accuracy) have been conducted to reflect the greater ability of ML estimations for detection. In this paper, we introduce a method, named Statistical Agnostic Regression (SAR), for evaluating the statistical significance of an ML-based linear regression based on concentration inequalities of the actual risk using the analysis of the worst case. To achieve this goal, similar to the classification problem, we define a threshold to establish that there is sufficient evidence with a probability of at least 1-eta to conclude that there is a linear relationship in the population between the explanatory (feature) and the response (label) variables. Simulations in only two dimensions demonstrate the ability of the proposed agnostic test to provide a similar analysis of variance given by the classical $F$ test for the slope parameter.  ( 3 min )
    The AffectToolbox: Affect Analysis for Everyone
    arXiv:2402.15195v1 Announce Type: cross Abstract: In the field of affective computing, where research continually advances at a rapid pace, the demand for user-friendly tools has become increasingly apparent. In this paper, we present the AffectToolbox, a novel software system that aims to support researchers in developing affect-sensitive studies and prototypes. The proposed system addresses the challenges posed by existing frameworks, which often require profound programming knowledge and cater primarily to power-users or skilled developers. Aiming to facilitate ease of use, the AffectToolbox requires no programming knowledge and offers its functionality to reliably analyze the affective state of users through an accessible graphical user interface. The architecture encompasses a variety of models for emotion recognition on multiple affective channels and modalities, as well as an elaborate fusion system to merge multi-modal assessments into a unified result. The entire system is open-sourced and will be publicly available to ensure easy integration into more complex applications through a well-structured, Python-based code base - therefore marking a substantial contribution toward advancing affective computing research and fostering a more collaborative and inclusive environment within this interdisciplinary field.  ( 2 min )
    Classification of compact radio sources in the Galactic plane with supervised machine learning
    arXiv:2402.15232v1 Announce Type: cross Abstract: Generation of science-ready data from processed data products is one of the major challenges in next-generation radio continuum surveys with the Square Kilometre Array (SKA) and its precursors, due to the expected data volume and the need to achieve a high degree of automated processing. Source extraction, characterization, and classification are the major stages involved in this process. In this work we focus on the classification of compact radio sources in the Galactic plane using both radio and infrared images as inputs. To this aim, we produced a curated dataset of ~20,000 images of compact sources of different astronomical classes, obtained from past radio and infrared surveys, and novel radio data from pilot surveys carried out with the Australian SKA Pathfinder (ASKAP). Radio spectral index information was also obtained for a subset of the data. We then trained two different classifiers on the produced dataset. The first model uses gradient-boosted decision trees and is trained on a set of pre-computed features derived from the data, which include radio-infrared colour indices and the radio spectral index. The second model is trained directly on multi-channel images, employing convolutional neural networks. Using a completely supervised procedure, we obtained a high classification accuracy (F1-score>90%) for separating Galactic objects from the extragalactic background. Individual class discrimination performances, ranging from 60% to 75%, increased by 10% when adding far-infrared and spectral index information, with extragalactic objects, PNe and HII regions identified with higher accuracies. The implemented tools and trained models were publicly released, and made available to the radioastronomical community for future application on new radio data.  ( 3 min )
    Convergence Analysis of Split Federated Learning on Heterogeneous Data
    arXiv:2402.15166v1 Announce Type: cross Abstract: Split federated learning (SFL) is a recent distributed approach for collaborative model training among multiple clients. In SFL, a global model is typically split into two parts, where clients train one part in a parallel federated manner, and a main server trains the other. Despite the recent research on SFL algorithm development, the convergence analysis of SFL is missing in the literature, and this paper aims to fill this gap. The analysis of SFL can be more challenging than that of federated learning (FL), due to the potential dual-paced updates at the clients and the main server. We provide convergence analysis of SFL for strongly convex and general convex objectives on heterogeneous data. The convergence rates are $O(1/T)$ and $O(1/\sqrt[3]{T})$, respectively, where $T$ denotes the total number of rounds for SFL training. We further extend the analysis to non-convex objectives and where some clients may be unavailable during training. Numerical experiments validate our theoretical results and show that SFL outperforms FL and split learning (SL) when data is highly heterogeneous across a large number of clients.  ( 2 min )
    Safety Optimized Reinforcement Learning via Multi-Objective Policy Optimization
    arXiv:2402.15197v1 Announce Type: cross Abstract: Safe reinforcement learning (Safe RL) refers to a class of techniques that aim to prevent RL algorithms from violating constraints in the process of decision-making and exploration during trial and error. In this paper, a novel model-free Safe RL algorithm, formulated based on the multi-objective policy optimization framework is introduced where the policy is optimized towards optimality and safety, simultaneously. The optimality is achieved by the environment reward function that is subsequently shaped using a safety critic. The advantage of the Safety Optimized RL (SORL) algorithm compared to the traditional Safe RL algorithms is that it omits the need to constrain the policy search space. This allows SORL to find a natural tradeoff between safety and optimality without compromising the performance in terms of either safety or optimality due to strict search space constraints. Through our theoretical analysis of SORL, we propose a condition for SORL's converged policy to guarantee safety and then use it to introduce an aggressiveness parameter that allows for fine-tuning the mentioned tradeoff. The experimental results obtained in seven different robotic environments indicate a considerable reduction in the number of safety violations along with higher, or competitive, policy returns, in comparison to six different state-of-the-art Safe RL methods. The results demonstrate the significant superiority of the proposed SORL algorithm in safety-critical applications.  ( 3 min )
    EasyRL4Rec: A User-Friendly Code Library for Reinforcement Learning Based Recommender Systems
    arXiv:2402.15164v1 Announce Type: cross Abstract: Reinforcement Learning (RL)-Based Recommender Systems (RSs) are increasingly recognized for their ability to improve long-term user engagement. Yet, the field grapples with challenges such as the absence of accessible frameworks, inconsistent evaluation standards, and the complexity of replicating prior work. Addressing these obstacles, we present EasyRL4Rec, a user-friendly and efficient library tailored for RL-based RSs. EasyRL4Rec features lightweight, diverse RL environments built on five widely-used public datasets, and is equipped with comprehensive core modules that offer rich options to ease the development of models. It establishes consistent evaluation criteria with a focus on long-term impacts and introduces customized solutions for state modeling and action representation tailored to recommender systems. Additionally, we share valuable insights gained from extensive experiments with current methods. EasyRL4Rec aims to facilitate the model development and experimental process in the domain of RL-based RSs. The library is openly accessible at https://github.com/chongminggao/EasyRL4Rec.  ( 2 min )
    Machine Unlearning of Pre-trained Large Language Models
    arXiv:2402.15159v1 Announce Type: cross Abstract: This study investigates the concept of the `right to be forgotten' within the context of large language models (LLMs). We explore machine unlearning as a pivotal solution, with a focus on pre-trained models--a notably under-researched area. Our research delineates a comprehensive framework for machine unlearning in pre-trained LLMs, encompassing a critical analysis of seven diverse unlearning methods. Through rigorous evaluation using curated datasets from arXiv, books, and GitHub, we establish a robust benchmark for unlearning performance, demonstrating that these methods are over $10^5$ times more computationally efficient than retraining. Our results show that integrating gradient ascent with gradient descent on in-distribution data improves hyperparameter robustness. We also provide detailed guidelines for efficient hyperparameter tuning in the unlearning process. Our findings advance the discourse on ethical AI practices, offering substantive insights into the mechanics of machine unlearning for pre-trained LLMs and underscoring the potential for responsible AI development.  ( 2 min )
    Entity-level Factual Adaptiveness of Fine-tuning based Abstractive Summarization Models
    arXiv:2402.15162v1 Announce Type: cross Abstract: Abstractive summarization models often generate factually inconsistent content particularly when the parametric knowledge of the model conflicts with the knowledge in the input document. In this paper, we analyze the robustness of fine-tuning based summarization models to the knowledge conflict, which we call factual adaptiveness. We utilize pre-trained language models to construct evaluation sets and find that factual adaptiveness is not strongly correlated with factual consistency on original datasets. Furthermore, we introduce a controllable counterfactual data augmentation method where the degree of knowledge conflict within the augmented data can be adjustable. Our experimental results on two pre-trained language models (PEGASUS and BART) and two fine-tuning datasets (XSum and CNN/DailyMail) demonstrate that our method enhances factual adaptiveness while achieving factual consistency on original datasets on par with the contrastive learning baseline.  ( 2 min )
    Attention-Guided Masked Autoencoders For Learning Image Representations
    arXiv:2402.15172v1 Announce Type: cross Abstract: Masked autoencoders (MAEs) have established themselves as a powerful method for unsupervised pre-training for computer vision tasks. While vanilla MAEs put equal emphasis on reconstructing the individual parts of the image, we propose to inform the reconstruction process through an attention-guided loss function. By leveraging advances in unsupervised object discovery, we obtain an attention map of the scene which we employ in the loss function to put increased emphasis on reconstructing relevant objects, thus effectively incentivizing the model to learn more object-focused representations without compromising the established masking strategy. Our evaluations show that our pre-trained models learn better latent representations than the vanilla MAE, demonstrated by improved linear probing and k-NN classification results on several benchmarks while at the same time making ViTs more robust against varying backgrounds.  ( 2 min )
    Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions
    arXiv:2402.15055v1 Announce Type: cross Abstract: In this paper, we investigate the interplay between attention heads and specialized "next-token" neurons in the Multilayer Perceptron that predict specific tokens. By prompting an LLM like GPT-4 to explain these model internals, we can elucidate attention mechanisms that activate certain next-token neurons. Our analysis identifies attention heads that recognize contexts relevant to predicting a particular token, activating the associated neuron through the residual connection. We focus specifically on heads in earlier layers consistently activating the same next-token neuron across similar prompts. Exploring these differential activation patterns reveals that heads that specialize for distinct linguistic contexts are tied to generating certain tokens. Overall, our method combines neural explanations and probing isolated components to illuminate how attention enables context-dependent, specialized processing in LLMs.  ( 2 min )
    PUAD: Frustratingly Simple Method for Robust Anomaly Detection
    arXiv:2402.15143v1 Announce Type: cross Abstract: Developing an accurate and fast anomaly detection model is an important task in real-time computer vision applications. There has been much research to develop a single model that detects either structural or logical anomalies, which are inherently distinct. The majority of the existing approaches implicitly assume that the anomaly can be represented by identifying the anomalous location. However, we argue that logical anomalies, such as the wrong number of objects, can not be well-represented by the spatial feature maps and require an alternative approach. In addition, we focused on the possibility of detecting logical anomalies by using an out-of-distribution detection approach on the feature space, which aggregates the spatial information of the feature map. As a demonstration, we propose a method that incorporates a simple out-of-distribution detection method on the feature space against state-of-the-art reconstruction-based approaches. Despite the simplicity of our proposal, our method PUAD (Picturable and Unpicturable Anomaly Detection) achieves state-of-the-art performance on the MVTec LOCO AD dataset.  ( 2 min )
    Fine-tuning CLIP Text Encoders with Two-step Paraphrasing
    arXiv:2402.15120v1 Announce Type: cross Abstract: Contrastive language-image pre-training (CLIP) models have demonstrated considerable success across various vision-language tasks, such as text-to-image retrieval, where the model is required to effectively process natural language input to produce an accurate visual output. However, current models still face limitations in dealing with linguistic variations in input queries, such as paraphrases, making it challenging to handle a broad range of user queries in real-world applications. In this study, we introduce a straightforward fine-tuning approach to enhance the representations of CLIP models for paraphrases. Our approach involves a two-step paraphrase generation process, where we automatically create two categories of paraphrases from web-scale image captions by leveraging large language models. Subsequently, we fine-tune the CLIP text encoder using these generated paraphrases while freezing the image encoder. Our resulting model, which we call ParaCLIP, exhibits significant improvements over baseline CLIP models across various tasks, including paraphrased retrieval (with rank similarity scores improved by up to 2.0% and 5.6%), Visual Genome Relation and Attribution, as well as seven semantic textual similarity tasks.  ( 2 min )
    Data-Driven Ground-Fault Location Method in Distribution Power System With Distributed Generation
    arXiv:2402.14894v1 Announce Type: cross Abstract: The recent increase in renewable energy penetration at the distribution level introduces a multi-directional power flow that outdated traditional fault location techniques. To this extent, the development of new methods is needed to ensure fast and accurate fault localization and, hence, strengthen power system reliability. This paper proposes a data-driven ground fault location method for the power distribution system. An 11-bus 20 kV power system is modeled in Matlab/Simulink to simulate ground faults. The faults are generated at different locations and under various system operational states. Time-domain faulted three-phase voltages at the system substation are then analyzed with discrete wavelet transform. Statistical quantities of the processed data are eventually used to train an Artificial Neural Network (ANN) to find a mapping between computed voltage features and faults. Specifically, three ANNs allow the prediction of faulted phase, faulted branch, and fault distance from the system substation separately. According to the results, the method shows good potential, with a total relative error of 0,4% for fault distance prediction. The method is applied to datasets with unknown system states to test robustness.  ( 2 min )
    Studying LLM Performance on Closed- and Open-source Data
    arXiv:2402.15100v1 Announce Type: cross Abstract: Large Language models (LLMs) are finding wide use in software engineering practice. These models are extremely data-hungry, and are largely trained on open-source (OSS) code distributed with permissive licenses. In terms of actual use however, a great deal of software development still occurs in the for-profit/proprietary sphere, where the code under development is not, and never has been, in the public domain; thus, many developers, do their work, and use LLMs, in settings where the models may not be as familiar with the code under development. In such settings, do LLMs work as well as they do for OSS code? If not, what are the differences? When performance differs, what are the possible causes, and are there work-arounds? In this paper, we examine this issue using proprietary, closed-source software data from Microsoft, where most proprietary code is in C# and C++. We find that performance for C# changes little from OSS --> proprietary code, but does significantly reduce for C++; we find that this difference is attributable to differences in identifiers. We also find that some performance degradation, in some cases, can be ameliorated efficiently by in-context learning.  ( 2 min )
    Physics-constrained polynomial chaos expansion for scientific machine learning and uncertainty quantification
    arXiv:2402.15115v1 Announce Type: cross Abstract: We present a novel physics-constrained polynomial chaos expansion as a surrogate modeling method capable of performing both scientific machine learning (SciML) and uncertainty quantification (UQ) tasks. The proposed method possesses a unique capability: it seamlessly integrates SciML into UQ and vice versa, which allows it to quantify the uncertainties in SciML tasks effectively and leverage SciML for improved uncertainty assessment during UQ-related tasks. The proposed surrogate model can effectively incorporate a variety of physical constraints, such as governing partial differential equations (PDEs) with associated initial and boundary conditions constraints, inequality-type constraints (e.g., monotonicity, convexity, non-negativity, among others), and additional a priori information in the training process to supplement limited data. This ensures physically realistic predictions and significantly reduces the need for expensive computational model evaluations to train the surrogate model. Furthermore, the proposed method has a built-in uncertainty quantification (UQ) feature to efficiently estimate output uncertainties. To demonstrate the effectiveness of the proposed method, we apply it to a diverse set of problems, including linear/non-linear PDEs with deterministic and stochastic parameters, data-driven surrogate modeling of a complex physical system, and UQ of a stochastic system with parameters modeled as random fields.  ( 2 min )
    The Umeyama algorithm for matching correlated Gaussian geometric models in the low-dimensional regime
    arXiv:2402.15095v1 Announce Type: cross Abstract: Motivated by the problem of matching two correlated random geometric graphs, we study the problem of matching two Gaussian geometric models correlated through a latent node permutation. Specifically, given an unknown permutation $\pi^*$ on $\{1,\ldots,n\}$ and given $n$ i.i.d. pairs of correlated Gaussian vectors $\{X_{\pi^*(i)},Y_i\}$ in $\mathbb{R}^d$ with noise parameter $\sigma$, we consider two types of (correlated) weighted complete graphs with edge weights given by $A_{i,j}=\langle X_i,X_j \rangle$, $B_{i,j}=\langle Y_i,Y_j \rangle$. The goal is to recover the hidden vertex correspondence $\pi^*$ based on the observed matrices $A$ and $B$. For the low-dimensional regime where $d=O(\log n)$, Wang, Wu, Xu, and Yolou [WWXY22+] established the information thresholds for exact and almost exact recovery in matching correlated Gaussian geometric models. They also conducted numerical experiments for the classical Umeyama algorithm. In our work, we prove that this algorithm achieves exact recovery of $\pi^*$ when the noise parameter $\sigma=o(d^{-3}n^{-2/d})$, and almost exact recovery when $\sigma=o(d^{-3}n^{-1/d})$. Our results approach the information thresholds up to a $\operatorname{poly}(d)$ factor in the low-dimensional regime.  ( 2 min )
    Novelty Detection on Radio Astronomy Data using Signatures
    arXiv:2402.14892v1 Announce Type: cross Abstract: We introduce SigNova, a new semi-supervised framework for detecting anomalies in streamed data. While our initial examples focus on detecting radio-frequency interference (RFI) in digitized signals within the field of radio astronomy, it is important to note that SigNova's applicability extends to any type of streamed data. The framework comprises three primary components. Firstly, we use the signature transform to extract a canonical collection of summary statistics from observational sequences. This allows us to represent variable-length visibility samples as finite-dimensional feature vectors. Secondly, each feature vector is assigned a novelty score, calculated as the Mahalanobis distance to its nearest neighbor in an RFI-free training set. By thresholding these scores we identify observation ranges that deviate from the expected behavior of RFI-free visibility samples without relying on stringent distributional assumptions. Thirdly, we integrate this anomaly detector with Pysegments, a segmentation algorithm, to localize consecutive observations contaminated with RFI, if any. This approach provides a compelling alternative to classical windowing techniques commonly used for RFI detection. Importantly, the complexity of our algorithm depends on the RFI pattern rather than on the size of the observation window. We demonstrate how SigNova improves the detection of various types of RFI (e.g., broadband and narrowband) in time-frequency visibility data. We validate our framework on the Murchison Widefield Array (MWA) telescope and simulated data and the Hydrogen Epoch of Reionization Array (HERA).  ( 3 min )
    Fine-tuning Large Language Models for Domain-specific Machine Translation
    arXiv:2402.15061v1 Announce Type: cross Abstract: Large language models (LLMs) have made significant progress in machine translation (MT). However, their potential in domain-specific MT remains under-explored. Current LLM-based MT systems still face several challenges. First, for LLMs with in-context learning, their effectiveness is highly sensitive to input translation examples, and processing them can increase inference costs. They often require extra post-processing due to over-generation. Second, LLMs with fine-tuning on domain-specific data often require high training costs for domain adaptation, and may weaken the zero-shot MT capabilities of LLMs due to over-specialization. The aforementioned methods can struggle to translate rare words in domain transfer scenarios. To address these challenges, this paper proposes a prompt-oriented fine-tuning method, denoted as LlamaIT, to effectively and efficiently fine-tune a general-purpose LLM for domain-specific MT tasks. First, we construct a task-specific mix-domain dataset, which is then used to fine-tune the LLM with LoRA. This can eliminate the need for input translation examples, post-processing, or over-specialization. By zero-shot prompting with instructions, we adapt the MT tasks to the target domain at inference time. To further elicit the MT capability for rare words, we construct new prompts by incorporating domain-specific bilingual vocabulary. We also conduct extensive experiments on both publicly available and self-constructed datasets. The results show that our LlamaIT can significantly enhance the domain-specific MT capabilities of the LLM, meanwhile preserving its zero-shot MT capabilities.  ( 2 min )
    AttributionBench: How Hard is Automatic Attribution Evaluation?
    arXiv:2402.15089v1 Announce Type: cross Abstract: Modern generative search engines enhance the reliability of large language model (LLM) responses by providing cited evidence. However, evaluating the answer's attribution, i.e., whether every claim within the generated responses is fully supported by its cited evidence, remains an open problem. This verification, traditionally dependent on costly human evaluation, underscores the urgent need for automatic attribution evaluation methods. To bridge the gap in the absence of standardized benchmarks for these methods, we present AttributionBench, a comprehensive benchmark compiled from various existing attribution datasets. Our extensive experiments on AttributionBench reveal the challenges of automatic attribution evaluation, even for state-of-the-art LLMs. Specifically, our findings show that even a fine-tuned GPT-3.5 only achieves around 80% macro-F1 under a binary classification formulation. A detailed analysis of more than 300 error cases indicates that a majority of failures stem from the model's inability to process nuanced information, and the discrepancy between the information the model has access to and that human annotators do.  ( 2 min )
    User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient
    arXiv:1710.00095v4 Announce Type: replace-cross Abstract: In this paper, we study the problem of sampling from a given probability density function that is known to be smooth and strongly log-concave. We analyze several methods of approximate sampling based on discretizations of the (highly overdamped) Langevin diffusion and establish guarantees on its error measured in the Wasserstein-2 distance. Our guarantees improve or extend the state-of-the-art results in three directions. First, we provide an upper bound on the error of the first-order Langevin Monte Carlo (LMC) algorithm with optimized varying step-size. This result has the advantage of being horizon free (we do not need to know in advance the target precision) and to improve by a logarithmic factor the corresponding result for the constant step-size. Second, we study the case where accurate evaluations of the gradient of the log-density are unavailable, but one can have access to approximations of the aforementioned gradient. In such a situation, we consider both deterministic and stochastic approximations of the gradient and provide an upper bound on the sampling error of the first-order LMC that quantifies the impact of the gradient evaluation inaccuracies. Third, we establish upper bounds for two versions of the second-order LMC, which leverage the Hessian of the log-density. We provide nonasymptotic guarantees on the sampling error of these second-order LMCs. These guarantees reveal that the second-order LMC algorithms improve on the first-order LMC in ill-conditioned settings.  ( 3 min )
    Bernstein Flows for Flexible Posteriors in Variational Bayes
    arXiv:2202.05650v2 Announce Type: replace-cross Abstract: Variational inference (VI) is a technique to approximate difficult to compute posteriors by optimization. In contrast to MCMC, VI scales to many observations. In the case of complex posteriors, however, state-of-the-art VI approaches often yield unsatisfactory posterior approximations. This paper presents Bernstein flow variational inference (BF-VI), a robust and easy-to-use method, flexible enough to approximate complex multivariate posteriors. BF-VI combines ideas from normalizing flows and Bernstein polynomial-based transformation models. In benchmark experiments, we compare BF-VI solutions with exact posteriors, MCMC solutions, and state-of-the-art VI methods including normalizing flow based VI. We show for low-dimensional models that BF-VI accurately approximates the true posterior; in higher-dimensional models, BF-VI outperforms other VI methods. Further, we develop with BF-VI a Bayesian model for the semi-structured Melanoma challenge data, combining a CNN model part for image data with an interpretable model part for tabular data, and demonstrate for the first time how the use of VI in semi-structured models.  ( 2 min )
    Human-Aligned Calibration for AI-Assisted Decision Making
    arXiv:2306.00074v4 Announce Type: replace Abstract: Whenever a binary classifier is used to provide decision support, it typically provides both a label prediction and a confidence value. Then, the decision maker is supposed to use the confidence value to calibrate how much to trust the prediction. In this context, it has been often argued that the confidence value should correspond to a well calibrated estimate of the probability that the predicted label matches the ground truth label. However, multiple lines of empirical evidence suggest that decision makers have difficulties at developing a good sense on when to trust a prediction using these confidence values. In this paper, our goal is first to understand why and then investigate how to construct more useful confidence values. We first argue that, for a broad class of utility functions, there exist data distributions for which a rational decision maker is, in general, unlikely to discover the optimal decision policy using the above confidence values -- an optimal decision maker would need to sometimes place more (less) trust on predictions with lower (higher) confidence values. However, we then show that, if the confidence values satisfy a natural alignment property with respect to the decision maker's confidence on her own predictions, there always exists an optimal decision policy under which the level of trust the decision maker would need to place on predictions is monotone on the confidence values, facilitating its discoverability. Further, we show that multicalibration with respect to the decision maker's confidence on her own predictions is a sufficient condition for alignment. Experiments on four different AI-assisted decision making tasks where a classifier provides decision support to real human experts validate our theoretical results and suggest that alignment may lead to better decisions.  ( 3 min )
    ProTIP: Probabilistic Robustness Verification on Text-to-Image Diffusion Models against Stochastic Perturbation
    arXiv:2402.15429v1 Announce Type: cross Abstract: Text-to-Image (T2I) Diffusion Models (DMs) have shown impressive abilities in generating high-quality images based on simple text descriptions. However, as is common with many Deep Learning (DL) models, DMs are subject to a lack of robustness. While there are attempts to evaluate the robustness of T2I DMs as a binary or worst-case problem, they cannot answer how robust in general the model is whenever an adversarial example (AE) can be found. In this study, we first introduce a probabilistic notion of T2I DMs' robustness; and then establish an efficient framework, ProTIP, to evaluate it with statistical guarantees. The main challenges stem from: i) the high computational cost of the generation process; and ii) determining if a perturbed input is an AE involves comparing two output distributions, which is fundamentally harder compared to other DL tasks like classification where an AE is identified upon misprediction of labels. To tackle the challenges, we employ sequential analysis with efficacy and futility early stopping rules in the statistical testing for identifying AEs, and adaptive concentration inequalities to dynamically determine the "just-right" number of stochastic perturbations whenever the verification target is met. Empirical experiments validate the effectiveness and efficiency of ProTIP over common T2I DMs. Finally, we demonstrate an application of ProTIP to rank commonly used defence methods.  ( 2 min )
    Unleashing the Power of Imbalanced Modality Information for Multi-modal Knowledge Graph Completion
    arXiv:2402.15444v1 Announce Type: cross Abstract: Multi-modal knowledge graph completion (MMKGC) aims to predict the missing triples in the multi-modal knowledge graphs by incorporating structural, visual, and textual information of entities into the discriminant models. The information from different modalities will work together to measure the triple plausibility. Existing MMKGC methods overlook the imbalance problem of modality information among entities, resulting in inadequate modal fusion and inefficient utilization of the raw modality information. To address the mentioned problems, we propose Adaptive Multi-modal Fusion and Modality Adversarial Training (AdaMF-MAT) to unleash the power of imbalanced modality information for MMKGC. AdaMF-MAT achieves multi-modal fusion with adaptive modality weights and further generates adversarial samples by modality-adversarial training to enhance the imbalanced modality information. Our approach is a co-design of the MMKGC model and training strategy which can outperform 19 recent MMKGC methods and achieve new state-of-the-art results on three public MMKGC benchmarks. Our code and data have been released at https://github.com/zjukg/AdaMF-MAT.  ( 2 min )
    Estimation of partially known Gaussian graphical models with score-based structural priors
    arXiv:2401.14340v3 Announce Type: replace-cross Abstract: We propose a novel algorithm for the support estimation of partially known Gaussian graphical models that incorporates prior information about the underlying graph. In contrast to classical approaches that provide a point estimate based on a maximum likelihood or a maximum a posteriori criterion using (simple) priors on the precision matrix, we consider a prior on the graph and rely on annealed Langevin diffusion to generate samples from the posterior distribution. Since the Langevin sampler requires access to the score function of the underlying graph prior, we use graph neural networks to effectively estimate the score from a graph dataset (either available beforehand or generated from a known distribution). Numerical experiments demonstrate the benefits of our approach.  ( 2 min )
    Adaptive Deep Learning for Efficient Visual Pose Estimation aboard Ultra-low-power Nano-drones
    arXiv:2401.15236v2 Announce Type: replace-cross Abstract: Sub-10cm diameter nano-drones are gaining momentum thanks to their applicability in scenarios prevented to bigger flying drones, such as in narrow environments and close to humans. However, their tiny form factor also brings their major drawback: ultra-constrained memory and processors for the onboard execution of their perception pipelines. Therefore, lightweight deep learning-based approaches are becoming increasingly popular, stressing how computational efficiency and energy-saving are paramount as they can make the difference between a fully working closed-loop system and a failing one. In this work, to maximize the exploitation of the ultra-limited resources aboard nano-drones, we present a novel adaptive deep learning-based mechanism for the efficient execution of a vision-based human pose estimation task. We leverage two State-of-the-Art (SoA) convolutional neural networks (CNNs) with different regression performance vs. computational costs trade-offs. By combining these CNNs with three novel adaptation strategies based on the output's temporal consistency and on auxiliary tasks to swap the CNN being executed proactively, we present six different systems. On a real-world dataset and the actual nano-drone hardware, our best-performing system, compared to executing only the bigger and most accurate SoA model, shows 28% latency reduction while keeping the same mean absolute error (MAE), 3% MAE reduction while being iso-latency, and the absolute peak performance, i.e., 6% better than SoA model.  ( 3 min )
    Remembering to Be Fair: Non-Markovian Fairness in Sequential Decision Making
    arXiv:2312.04772v3 Announce Type: replace-cross Abstract: Fair decision making has largely been studied with respect to a single decision. In this paper we investigate the notion of fairness in the context of sequential decision making where multiple stakeholders can be affected by the outcomes of decisions. We observe that fairness often depends on the history of the sequential decision-making process, and in this sense that it is inherently non-Markovian. We further observe that fairness often needs to be assessed at time points within the process, not just at the end of the process. To advance our understanding of this class of fairness problems, we explore the notion of non-Markovian fairness in the context of sequential decision making. We identify properties of non-Markovian fairness, including notions of long-term, anytime, periodic, and bounded fairness. We further explore the interplay between non-Markovian fairness and memory, and how this can support construction of fair policies for making sequential decisions.  ( 2 min )
    A Fixed-Parameter Tractable Algorithm for Counting Markov Equivalence Classes with the same Skeleton
    arXiv:2310.04218v2 Announce Type: replace-cross Abstract: Causal DAGs (also known as Bayesian networks) are a popular tool for encoding conditional dependencies between random variables. In a causal DAG, the random variables are modeled as vertices in the DAG, and it is stipulated that every random variable is independent of its ancestors conditioned on its parents. It is possible, however, for two different causal DAGs on the same set of random variables to encode exactly the same set of conditional dependencies. Such causal DAGs are said to be Markov equivalent, and equivalence classes of Markov equivalent DAGs are known as Markov Equivalent Classes (MECs). Beautiful combinatorial characterizations of MECs have been developed in the past few decades, and it is known, in particular that all DAGs in the same MEC must have the same "skeleton" (underlying undirected graph) and v-structures (induced subgraph of the form $a\rightarrow b \leftarrow c$). These combinatorial characterizations also suggest several natural algorithmic questions. One of these is: given an undirected graph $G$ as input, how many distinct Markov equivalence classes have the skeleton $G$? Much work has been devoted in the last few years to this and other closely related problems. However, to the best of our knowledge, a polynomial time algorithm for the problem remains unknown. In this paper, we make progress towards this goal by giving a fixed parameter tractable algorithm for the above problem, with the parameters being the treewidth and the maximum degree of the input graph $G$. The main technical ingredient in our work is a construction we refer to as shadow, which lets us create a "local description" of long-range constraints imposed by the combinatorial characterizations of MECs.  ( 3 min )
    InteRACT: Transformer Models for Human Intent Prediction Conditioned on Robot Actions
    arXiv:2311.12943v2 Announce Type: replace-cross Abstract: In collaborative human-robot manipulation, a robot must predict human intents and adapt its actions accordingly to smoothly execute tasks. However, the human's intent in turn depends on actions the robot takes, creating a chicken-or-egg problem. Prior methods ignore such inter-dependency and instead train marginal intent prediction models independent of robot actions. This is because training conditional models is hard given a lack of paired human-robot interaction datasets. Can we instead leverage large-scale human-human interaction data that is more easily accessible? Our key insight is to exploit a correspondence between human and robot actions that enables transfer learning from human-human to human-robot data. We propose a novel architecture, InteRACT, that pre-trains a conditional intent prediction model on large human-human datasets and fine-tunes on a small human-robot dataset. We evaluate on a set of real-world collaborative human-robot manipulation tasks and show that our conditional model improves over various marginal baselines. We also introduce new techniques to tele-operate a 7-DoF robot arm and collect a diverse range of human-robot collaborative manipulation data, which we open-source.  ( 2 min )
    Distilled Self-Critique of LLMs with Synthetic Data: a Bayesian Perspective
    arXiv:2312.01957v2 Announce Type: replace-cross Abstract: This paper proposes an interpretation of RLAIF as Bayesian inference by introducing distilled Self-Critique (dSC), which refines the outputs of a LLM through a Gibbs sampler that is later distilled into a fine-tuned model. Only requiring synthetic data, dSC is exercised in experiments regarding safety, sentiment, and privacy control, showing it can be a viable and cheap alternative to align LLMs. Code released at \url{https://github.com/vicgalle/distilled-self-critique}.  ( 2 min )
    Simulation-based inference using surjective sequential neural likelihood estimation
    arXiv:2308.01054v2 Announce Type: replace-cross Abstract: We present Surjective Sequential Neural Likelihood (SSNL) estimation, a novel method for simulation-based inference in models where the evaluation of the likelihood function is not tractable and only a simulator that can generate synthetic data is available. SSNL fits a dimensionality-reducing surjective normalizing flow model and uses it as a surrogate likelihood function which allows for conventional Bayesian inference using either Markov chain Monte Carlo methods or variational inference. By embedding the data in a low-dimensional space, SSNL solves several issues previous likelihood-based methods had when applied to high-dimensional data sets that, for instance, contain non-informative data dimensions or lie along a lower-dimensional manifold. We evaluate SSNL on a wide variety of experiments and show that it generally outperforms contemporary methods used in simulation-based inference, for instance, on a challenging real-world example from astrophysics which models the magnetic field strength of the sun using a solar dynamo model.  ( 2 min )
    Convolutional Deep Kernel Machines
    arXiv:2309.09814v2 Announce Type: replace-cross Abstract: Standard infinite-width limits of neural networks sacrifice the ability for intermediate layers to learn representations from data. Recent work (A theory of representation learning gives a deep generalisation of kernel methods, Yang et al. 2023) modified the Neural Network Gaussian Process (NNGP) limit of Bayesian neural networks so that representation learning is retained. Furthermore, they found that applying this modified limit to a deep Gaussian process gives a practical learning algorithm which they dubbed the deep kernel machine (DKM). However, they only considered the simplest possible setting: regression in small, fully connected networks with e.g. 10 input features. Here, we introduce convolutional deep kernel machines. This required us to develop a novel inter-domain inducing point approximation, as well as introducing and experimentally assessing a number of techniques not previously seen in DKMs, including analogues to batch normalisation, different likelihoods, and different types of top-layer. The resulting model trains in roughly 77 GPU hours, achieving around 99% test accuracy on MNIST, 72% on CIFAR-100, and 92.7% on CIFAR-10, which is SOTA for kernel methods.  ( 2 min )
    FedDefender: Backdoor Attack Defense in Federated Learning
    arXiv:2307.08672v2 Announce Type: replace-cross Abstract: Federated Learning (FL) is a privacy-preserving distributed machine learning technique that enables individual clients (e.g., user participants, edge devices, or organizations) to train a model on their local data in a secure environment and then share the trained model with an aggregator to build a global model collaboratively. In this work, we propose FedDefender, a defense mechanism against targeted poisoning attacks in FL by leveraging differential testing. Our proposed method fingerprints the neuron activations of clients' models on the same input and uses differential testing to identify a potentially malicious client containing a backdoor. We evaluate FedDefender using MNIST and FashionMNIST datasets with 20 and 30 clients, and our results demonstrate that FedDefender effectively mitigates such attacks, reducing the attack success rate (ASR) to 10\% without deteriorating the global model performance.  ( 2 min )
    Cluster Algebras: Network Science and Machine Learning
    arXiv:2203.13847v2 Announce Type: replace-cross Abstract: Cluster algebras have recently become an important player in mathematics and physics. In this work, we investigate them through the lens of modern data science, specifically with techniques from network science and machine learning. Network analysis methods are applied to the exchange graphs for cluster algebras of varying mutation types. The analysis indicates that when the graphs are represented without identifying by permutation equivalence between clusters an elegant symmetry emerges in the quiver exchange graph embedding. The ratio between number of seeds and number of quivers associated to this symmetry is computed for finite Dynkin type algebras up to rank 5, and conjectured for higher ranks. Simple machine learning techniques successfully learn to classify cluster algebras using the data of seeds. The learning performance exceeds 0.9 accuracies between algebras of the same mutation type and between types, as well as relative to artificially generated data.  ( 2 min )
    Improving Code Generation by Training with Natural Language Feedback
    arXiv:2303.16749v2 Announce Type: replace-cross Abstract: The potential for pre-trained large language models (LLMs) to use natural language feedback at inference time has been an exciting recent development. We build upon this observation by formalizing an algorithm for learning from natural language feedback at training time instead, which we call Imitation learning from Language Feedback (ILF). ILF requires only a small amount of human-written feedback during training and does not require the same feedback at test time, making it both user-friendly and sample-efficient. We further show that ILF can be seen as a form of minimizing the KL divergence to the ground truth distribution and demonstrate a proof-of-concept on a neural program synthesis task. We use ILF to improve a Codegen-Mono 6.1B model's pass@1 rate by 38% relative (and 10% absolute) on the Mostly Basic Python Problems (MBPP) benchmark, outperforming both fine-tuning on MBPP and fine-tuning on repaired programs written by humans. Overall, our results suggest that learning from human-written natural language feedback is both more effective and sample-efficient than training exclusively on demonstrations for improving an LLM's performance on code generation tasks.  ( 2 min )
    Position Paper: An Integrated Perspective on Data, Metrics, and Methodology for Deep Time-Series Forecasting
    arXiv:2310.07446v2 Announce Type: replace Abstract: Deep time-series forecasting plays an integral role in numerous practical applications. However, existing research fall short by focusing narrowly on either neural architecture designs for long-term point forecasts or probabilistic models for short-term scenarios. By proposing a comprehensive framework, facilitated by a novel tool, ProbTS, that integrates diverse data scenarios, evaluation metrics, and methodological focuses, we aim to transcend the limitations of current forecasting practices. Rigorous experimentation uncovers pivotal insights, including the supreme importance of aligning forecasting methodologies with the unique characteristics of the data; the necessity of a broad spectrum of metrics for accurately assessing both point and distributional forecasts; and the challenges inherent in adapting existing forecasting methods to a wider range of scenarios. These findings not only challenge conventional approaches but also illuminate promising avenues for future research, suggesting a more nuanced and effective strategy for advancing the field of deep time-series forecasting.  ( 2 min )
    Interventional Causal Representation Learning
    arXiv:2209.11924v4 Announce Type: replace-cross Abstract: Causal representation learning seeks to extract high-level latent factors from low-level sensory data. Most existing methods rely on observational data and structural assumptions (e.g., conditional independence) to identify the latent factors. However, interventional data is prevalent across applications. Can interventional data facilitate causal representation learning? We explore this question in this paper. The key observation is that interventional data often carries geometric signatures of the latent factors' support (i.e. what values each latent can possibly take). For example, when the latent factors are causally connected, interventions can break the dependency between the intervened latents' support and their ancestors'. Leveraging this fact, we prove that the latent causal factors can be identified up to permutation and scaling given data from perfect $do$ interventions. Moreover, we can achieve block affine identification, namely the estimated latent factors are only entangled with a few other latents if we have access to data from imperfect interventions. These results highlight the unique power of interventional data in causal representation learning; they can enable provable identification of latent factors without any assumptions about their distributions or dependency structure.  ( 2 min )
    Age-Based Scheduling for Mobile Edge Computing: A Deep Reinforcement Learning Approach
    arXiv:2312.00279v2 Announce Type: replace Abstract: With the rapid development of Mobile Edge Computing (MEC), various real-time applications have been deployed to benefit people's daily lives. The performance of these applications relies heavily on the freshness of collected environmental information, which can be quantified by its Age of Information (AoI). In the traditional definition of AoI, it is assumed that the status information can be actively sampled and directly used. However, for many MEC-enabled applications, the desired status information is updated in an event-driven manner and necessitates data processing. To better serve these applications, we propose a new definition of AoI and, based on the redefined AoI, we formulate an online AoI minimization problem for MEC systems. Notably, the problem can be interpreted as a Markov Decision Process (MDP), thus enabling its solution through Reinforcement Learning (RL) algorithms. Nevertheless, the traditional RL algorithms are designed for MDPs with completely unknown system dynamics and hence usually suffer long convergence times. To accelerate the learning process, we introduce Post-Decision States (PDSs) to exploit the partial knowledge of the system's dynamics. We also combine PDSs with deep RL to further improve the algorithm's applicability, scalability, and robustness. Numerical results demonstrate that our algorithm outperforms the benchmarks under various scenarios.  ( 3 min )
    DMODE: Differential Monocular Object Distance Estimation Module without Class Specific Information
    arXiv:2210.12596v2 Announce Type: replace-cross Abstract: Utilizing a single camera for measuring object distances is a cost-effective alternative to stereo-vision and LiDAR. Although monocular distance estimation has been explored in the literature, most existing techniques rely on object class knowledge to achieve high performance. Without this contextual data, monocular distance estimation becomes more challenging, lacking reference points and object-specific cues. However, these cues can be misleading for objects with wide-range variation or adversarial situations, which is a challenging aspect of object-agnostic distance estimation. In this paper, we propose DMODE, a class-agnostic method for monocular distance estimation that does not require object class knowledge. DMODE estimates an object's distance by fusing its fluctuation in size over time with the camera's motion, making it adaptable to various object detectors and unknown objects, thus addressing these challenges. We evaluate our model on the KITTI MOTS dataset using ground-truth bounding box annotations and outputs from TrackRCNN and EagerMOT. The object's location is determined using the change in bounding box sizes and camera position without measuring the object's detection source or class attributes. Our approach demonstrates superior performance in multi-class object distance detection scenarios compared to conventional methods.  ( 2 min )
    Simultaneous off-the-grid learning of mixtures issued from a continuous dictionary
    arXiv:2210.16311v2 Announce Type: replace-cross Abstract: In this paper we observe a set, possibly a continuum, of signals corrupted by noise. Each signal is a finite mixture of an unknown number of features belonging to a continuous dictionary. The continuous dictionary is parametrized by a real non-linear parameter. We shall assume that the signals share an underlying structure by assuming that each signal has its active features included in a finite and sparse set. We formulate regularized optimization problem to estimate simultaneously the linear coefficients in the mixtures and the non-linear parameters of the features. The optimization problem is composed of a data fidelity term and a $(\ell_1,L^p)$-penalty. We call its solution the Group-Nonlinear-Lasso and provide high probability bounds on the prediction error using certificate functions. Following recent works on the geometry of off-the-grid methods, we show that such functions can be constructed provided the parameters of the active features are pairwise separated by a constant with respect to a Riemannian metric.When the number of signals is finite and the noise is assumed Gaussian, we give refinements of our results for $p=1$ and $p=2$ using tail bounds on suprema of Gaussian and $\chi^2$ random processes. When $p=2$, our prediction error reaches the rates obtained by the Group-Lasso estimator in the multi-task linear regression model. Furthermore, for $p=2$ these prediction rates are faster than for $p=1$ when all signals share most of the non-linear parameters.  ( 3 min )
    Adversarial Examples Detection with Bayesian Neural Network
    arXiv:2105.08620v3 Announce Type: replace-cross Abstract: In this paper, we propose a new framework to detect adversarial examples motivated by the observations that random components can improve the smoothness of predictors and make it easier to simulate the output distribution of a deep neural network. With these observations, we propose a novel Bayesian adversarial example detector, short for BATer, to improve the performance of adversarial example detection. Specifically, we study the distributional difference of hidden layer output between natural and adversarial examples, and propose to use the randomness of the Bayesian neural network to simulate hidden layer output distribution and leverage the distribution dispersion to detect adversarial examples. The advantage of a Bayesian neural network is that the output is stochastic while a deep neural network without random components does not have such characteristics. Empirical results on several benchmark datasets against popular attacks show that the proposed BATer outperforms the state-of-the-art detectors in adversarial example detection.  ( 2 min )
    DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models
    arXiv:2310.00902v2 Announce Type: replace Abstract: Quantifying the impact of training data points is crucial for understanding the outputs of machine learning models and for improving the transparency of the AI pipeline. The influence function is a principled and popular data attribution method, but its computational cost often makes it challenging to use. This issue becomes more pronounced in the setting of large language models and text-to-image models. In this work, we propose DataInf, an efficient influence approximation method that is practical for large-scale generative AI models. Leveraging an easy-to-compute closed-form expression, DataInf outperforms existing influence computation algorithms in terms of computational and memory efficiency. Our theoretical analysis shows that DataInf is particularly well-suited for parameter-efficient fine-tuning techniques such as LoRA. Through systematic empirical evaluations, we show that DataInf accurately approximates influence scores and is orders of magnitude faster than existing methods. In applications to RoBERTa-large, Llama-2-13B-chat, and stable-diffusion-v1.5 models, DataInf effectively identifies the most influential fine-tuning examples better than other approximate influence scores. Moreover, it can help to identify which data points are mislabeled.  ( 2 min )
    Adversarial Feature Map Pruning for Backdoor
    arXiv:2307.11565v2 Announce Type: replace Abstract: Deep neural networks have been widely used in many critical applications, such as autonomous vehicles and medical diagnosis. However, their security is threatened by backdoor attacks, which are achieved by adding artificial patterns to specific training data. Existing defense strategies primarily focus on using reverse engineering to reproduce the backdoor trigger generated by attackers and subsequently repair the DNN model by adding the trigger into inputs and fine-tuning the model with ground-truth labels. However, once the trigger generated by the attackers is complex and invisible, the defender cannot reproduce the trigger successfully then the DNN model will not be repaired, as the trigger is not effectively removed. In this work, we propose Adversarial Feature Map Pruning for Backdoor (FMP) to mitigate backdoor from the DNN. Unlike existing defense strategies, which focus on reproducing backdoor triggers, FMP attempts to prune backdoor feature maps, which are trained to extract backdoor information from inputs. After pruning these backdoor feature maps, FMP will fine-tune the model with a secure subset of training data. Our experiments demonstrate that, compared to existing defense strategies, FMP can effectively reduce the Attack Success Rate (ASR) even against the most complex and invisible attack triggers (e.g., FMP decreases the ASR to 2.86\% in CIFAR10, which is 19.2\% to 65.41\% lower than baselines). Second, unlike conventional defense methods that tend to exhibit low robust accuracy (that is, the accuracy of the model on poisoned data), FMP achieves a higher RA, indicating its superiority in maintaining model performance while mitigating the effects of backdoor attacks (e.g., FMP obtains 87.40\% RA in CIFAR10). Our code is publicly available at: https://github.com/retsuh-bqw/FMP.  ( 3 min )
    Predicting Properties of Quantum Systems with Conditional Generative Models
    arXiv:2211.16943v2 Announce Type: replace-cross Abstract: Machine learning has emerged recently as a powerful tool for predicting properties of quantum many-body systems. For many ground states of gapped Hamiltonians, generative models can learn from measurements of a single quantum state to reconstruct the state accurately enough to predict local observables. Alternatively, classification and regression models can predict local observables by learning from measurements on different but related states. In this work, we combine the benefits of both approaches and propose the use of conditional generative models to simultaneously represent a family of states, learning shared structures of different quantum states from measurements. The trained model enables us to predict arbitrary local properties of ground states, even for states not included in the training data, without necessitating further training for new observables. We first numerically validate our approach on 2D random Heisenberg models using simulations of up to 45 qubits. Furthermore, we conduct quantum simulations on a neutral-atom quantum computer and demonstrate that our method can accurately predict the quantum phases of square lattices of 13$\times$13 Rydberg atoms.  ( 2 min )
    Neural Causal Abstractions
    arXiv:2401.02602v2 Announce Type: replace Abstract: The abilities of humans to understand the world in terms of cause and effect relationships, as well as to compress information into abstract concepts, are two hallmark features of human intelligence. These two topics have been studied in tandem in the literature under the rubric of causal abstractions theory. In practice, it remains an open problem how to best leverage abstraction theory in real-world causal inference tasks, where the true mechanisms are unknown and only limited data is available. In this paper, we develop a new family of causal abstractions by clustering variables and their domains. This approach refines and generalizes previous notions of abstractions to better accommodate individual causal distributions that are spawned by Pearl's causal hierarchy. We show that such abstractions are learnable in practical settings through Neural Causal Models (Xia et al., 2021), enabling the use of the deep learning toolkit to solve various challenging causal inference tasks -- identification, estimation, sampling -- at different levels of granularity. Finally, we integrate these results with representation learning to create more flexible abstractions, moving these results closer to practical applications. Our experiments support the theory and illustrate how to scale causal inferences to high-dimensional settings involving image data.  ( 2 min )
    On Hypothesis Transfer Learning of Functional Linear Models
    arXiv:2206.04277v4 Announce Type: replace-cross Abstract: We study the transfer learning (TL) for the functional linear regression (FLR) under the Reproducing Kernel Hilbert Space (RKHS) framework, observing the TL techniques in existing high-dimensional linear regression is not compatible with the truncation-based FLR methods as functional data are intrinsically infinite-dimensional and generated by smooth underlying processes. We measure the similarity across tasks using RKHS distance, allowing the type of information being transferred tied to the properties of the imposed RKHS. Building on the hypothesis offset transfer learning paradigm, two algorithms are proposed: one conducts the transfer when positive sources are known, while the other leverages aggregation techniques to achieve robust transfer without prior information about the sources. We establish lower bounds for this learning problem and show the proposed algorithms enjoy a matching asymptotic upper bound. These analyses provide statistical insights into factors that contribute to the dynamics of the transfer. We also extend the results to functional generalized linear models. The effectiveness of the proposed algorithms is demonstrated on extensive synthetic data as well as a financial data application.  ( 2 min )
    Machine unlearning through fine-grained model parameters perturbation
    arXiv:2401.04385v2 Announce Type: replace Abstract: Machine unlearning techniques, which involve retracting data records and reducing influence of said data on trained models, help with the user privacy protection objective but incur significant computational costs. Weight perturbation-based unlearning is a general approach, but it typically involves globally modifying the parameters. We propose fine-grained Top-K and Random-k parameters perturbed inexact machine unlearning strategies that address the privacy needs while keeping the computational costs tractable. In order to demonstrate the efficacy of our strategies we also tackle the challenge of evaluating the effectiveness of machine unlearning by considering the model's generalization performance across both unlearning and remaining data. To better assess the unlearning effect and model generalization, we propose novel metrics, namely, the forgetting rate and memory retention rate. However, for inexact machine unlearning, current metrics are inadequate in quantifying the degree of forgetting that occurs after unlearning strategies are applied. To address this, we introduce SPD-GAN, which subtly perturbs the distribution of data targeted for unlearning. Then, we evaluate the degree of unlearning by measuring the performance difference of the models on the perturbed unlearning data before and after the unlearning process. By implementing these innovative techniques and metrics, we achieve computationally efficacious privacy protection in machine learning applications without significant sacrifice of model performance. Furthermore, this approach provides a novel method for evaluating the degree of unlearning.  ( 3 min )
    Feature Selection with Annealing for Forecasting Financial Time Series
    arXiv:2303.02223v3 Announce Type: replace Abstract: Stock market and cryptocurrency forecasting is very important to investors as they aspire to achieve even the slightest improvement to their buy or hold strategies so that they may increase profitability. However, obtaining accurate and reliable predictions is challenging, noting that accuracy does not equate to reliability, especially when financial time-series forecasting is applied owing to its complex and chaotic tendencies. To mitigate this complexity, this study provides a comprehensive method for forecasting financial time series based on tactical input output feature mapping techniques using machine learning (ML) models. During the prediction process, selecting the relevant indicators is vital to obtaining the desired results. In the financial field, limited attention has been paid to this problem with ML solutions. We investigate the use of feature selection with annealing (FSA) for the first time in this field, and we apply the least absolute shrinkage and selection operator (Lasso) method to select the features from more than 1,000 candidates obtained from 26 technical classifiers with different periods and lags. Boruta (BOR) feature selection, a wrapper method, is used as a baseline for comparison. Logistic regression (LR), extreme gradient boosting (XGBoost), and long short-term memory (LSTM) are then applied to the selected features for forecasting purposes using 10 different financial datasets containing cryptocurrencies and stocks. The dependent variables consisted of daily logarithmic returns and trends. The mean-squared error for regression, area under the receiver operating characteristic curve, and classification accuracy were used to evaluate model performance, and the statistical significance of the forecasting results was tested using paired t-tests. Experiments indicate that the FSA algorithm increased the performance of ML models, regardless of problem type.  ( 3 min )
    AgentOhana: Design Unified Data and Training Pipeline for Effective Agent Learning
    arXiv:2402.15506v1 Announce Type: cross Abstract: Autonomous agents powered by large language models (LLMs) have garnered significant research attention. However, fully harnessing the potential of LLMs for agent-based tasks presents inherent challenges due to the heterogeneous nature of diverse data sources featuring multi-turn trajectories. In this paper, we introduce \textbf{AgentOhana} as a comprehensive solution to address these challenges. \textit{AgentOhana} aggregates agent trajectories from distinct environments, spanning a wide array of scenarios. It meticulously standardizes and unifies these trajectories into a consistent format, streamlining the creation of a generic data loader optimized for agent training. Leveraging the data unification, our training pipeline maintains equilibrium across different data sources and preserves independent randomness across devices during dataset partitioning and model training. Additionally, we present \textbf{xLAM-v0.1}, a large action model tailored for AI agents, which demonstrates exceptional performance across various benchmarks.  ( 2 min )
    BeGin: Extensive Benchmark Scenarios and An Easy-to-use Framework for Graph Continual Learning
    arXiv:2211.14568v3 Announce Type: replace Abstract: Continual Learning (CL) is the process of learning ceaselessly a sequence of tasks. Most existing CL methods deal with independent data (e.g., images and text) for which many benchmark frameworks and results under standard experimental settings are available. Compared to them, however, CL methods for graph data (graph CL) are relatively underexplored because of (a) the lack of standard experimental settings, especially regarding how to deal with the dependency between instances, (b) the lack of benchmark datasets and scenarios, and (c) high complexity in implementation and evaluation due to the dependency. In this paper, regarding (a) we define four standard incremental settings (task-, class-, domain-, and time-incremental) for node-, link-, and graph-level problems, extending the previously explored scope. Regarding (b), we provide 31 benchmark scenarios based on 20 real-world graphs. Regarding (c), we develop BeGin, an easy and fool-proof framework for graph CL. BeGin is easily extended since it is modularized with reusable modules for data processing, algorithm design, and evaluation. Especially, the evaluation module is completely separated from user code to eliminate potential mistakes. Regarding benchmark results, we cover 3X more combinations of incremental settings and levels of problems than the latest benchmark. All assets for the benchmark framework are publicly available at https://github.com/ShinhwanKang/BeGin.  ( 3 min )
    How to Evaluate Behavioral Models
    arXiv:2306.04778v2 Announce Type: replace Abstract: Researchers building behavioral models, such as behavioral game theorists, use experimental data to evaluate predictive models of human behavior. However, there is little agreement about which loss function should be used in evaluations, with error rate, negative log-likelihood, cross-entropy, Brier score, and squared L2 error all being common choices. We attempt to offer a principled answer to the question of which loss functions should be used for this task, formalizing axioms that we argue loss functions should satisfy. We construct a family of loss functions, which we dub "diagonal bounded Bregman divergences", that satisfy all of these axioms. These rule out many loss functions used in practice, but notably include squared L2 error; we thus recommend its use for evaluating behavioral models.  ( 2 min )
    Hierarchical Invariance for Robust and Interpretable Vision Tasks at Larger Scales
    arXiv:2402.15430v1 Announce Type: cross Abstract: Developing robust and interpretable vision systems is a crucial step towards trustworthy artificial intelligence. In this regard, a promising paradigm considers embedding task-required invariant structures, e.g., geometric invariance, in the fundamental image representation. However, such invariant representations typically exhibit limited discriminability, limiting their applications in larger-scale trustworthy vision tasks. For this open problem, we conduct a systematic investigation of hierarchical invariance, exploring this topic from theoretical, practical, and application perspectives. At the theoretical level, we show how to construct over-complete invariants with a Convolutional Neural Networks (CNN)-like hierarchical architecture yet in a fully interpretable manner. The general blueprint, specific definitions, invariant properties, and numerical implementations are provided. At the practical level, we discuss how to customize this theoretical framework into a given task. With the over-completeness, discriminative features w.r.t. the task can be adaptively formed in a Neural Architecture Search (NAS)-like manner. We demonstrate the above arguments with accuracy, invariance, and efficiency results on texture, digit, and parasite classification experiments. Furthermore, at the application level, our representations are explored in real-world forensics tasks on adversarial perturbations and Artificial Intelligence Generated Content (AIGC). Such applications reveal that the proposed strategy not only realizes the theoretically promised invariance, but also exhibits competitive discriminability even in the era of deep learning. For robust and interpretable vision tasks at larger scales, hierarchical invariant representation can be considered as an effective alternative to traditional CNN and invariants.  ( 3 min )
    Rayleigh Quotient Graph Neural Networks for Graph-level Anomaly Detection
    arXiv:2310.02861v3 Announce Type: replace Abstract: Graph-level anomaly detection has gained significant attention as it finds applications in various domains, such as cancer diagnosis and enzyme prediction. However, existing methods fail to capture the spectral properties of graph anomalies, resulting in unexplainable framework design and unsatisfying performance. In this paper, we re-investigate the spectral differences between anomalous and normal graphs. Our main observation shows a significant disparity in the accumulated spectral energy between these two classes. Moreover, we prove that the accumulated spectral energy of the graph signal can be represented by its Rayleigh Quotient, indicating that the Rayleigh Quotient is a driving factor behind the anomalous properties of graphs. Motivated by this, we propose Rayleigh Quotient Graph Neural Network (RQGNN), the first spectral GNN that explores the inherent spectral features of anomalous graphs for graph-level anomaly detection. Specifically, we introduce a novel framework with two components: the Rayleigh Quotient learning component (RQL) and Chebyshev Wavelet GNN with RQ-pooling (CWGNN-RQ). RQL explicitly captures the Rayleigh Quotient of graphs and CWGNN-RQ implicitly explores the spectral space of graphs. Extensive experiments on 10 real-world datasets show that RQGNN outperforms the best rival by 6.74% in Macro-F1 score and 1.44% in AUC, demonstrating the effectiveness of our framework. Our code is available at https://github.com/xydong127/RQGNN.  ( 3 min )
    Learning Action Embeddings for Off-Policy Evaluation
    arXiv:2305.03954v2 Announce Type: replace Abstract: Off-policy evaluation (OPE) methods allow us to compute the expected reward of a policy by using the logged data collected by a different policy. OPE is a viable alternative to running expensive online A/B tests: it can speed up the development of new policies, and reduces the risk of exposing customers to suboptimal treatments. However, when the number of actions is large, or certain actions are under-explored by the logging policy, existing estimators based on inverse-propensity scoring (IPS) can have a high or even infinite variance. Saito and Joachims (arXiv:2202.06317v2 [cs.LG]) propose marginalized IPS (MIPS) that uses action embeddings instead, which reduces the variance of IPS in large action spaces. MIPS assumes that good action embeddings can be defined by the practitioner, which is difficult to do in many real-world applications. In this work, we explore learning action embeddings from logged data. In particular, we use intermediate outputs of a trained reward model to define action embeddings for MIPS. This approach extends MIPS to more applications, and in our experiments improves upon MIPS with pre-defined embeddings, as well as standard baselines, both on synthetic and real-world data. Our method does not make assumptions about the reward model class, and supports using additional action information to further improve the estimates. The proposed approach presents an appealing alternative to DR for combining the low variance of DM with the low bias of IPS.  ( 3 min )
    Generative Artificial Intelligence in Healthcare: Ethical Considerations and Assessment Checklist
    arXiv:2311.02107v2 Announce Type: replace Abstract: The widespread use of ChatGPT and other emerging technology powered by generative artificial intelligence (GenAI) has drawn much attention to potential ethical issues, especially in high-stakes applications such as healthcare, but ethical discussions are yet to translate into operationalisable solutions. Furthermore, ongoing ethical discussions often neglect other types of GenAI that have been used to synthesise data (e.g., images) for research and practical purposes, which resolved some ethical issues and exposed others. We conduct a scoping review of ethical discussions on GenAI in healthcare to comprehensively analyse gaps in the current research, and further propose to reduce the gaps by developing a checklist for comprehensive assessment and transparent documentation of ethical discussions in GenAI research. The checklist can be readily integrated into the current peer review and publication system to enhance GenAI research, and may be used for ethics-related disclosures for GenAI-powered products, healthcare applications of such products and beyond.  ( 2 min )
    Spear and Shield: Adversarial Attacks and Defense Methods for Model-Based Link Prediction on Continuous-Time Dynamic Graphs
    arXiv:2308.10779v2 Announce Type: replace Abstract: Real-world graphs are dynamic, constantly evolving with new interactions, such as financial transactions in financial networks. Temporal Graph Neural Networks (TGNNs) have been developed to effectively capture the evolving patterns in dynamic graphs. While these models have demonstrated their superiority, being widely adopted in various important fields, their vulnerabilities against adversarial attacks remain largely unexplored. In this paper, we propose T-SPEAR, a simple and effective adversarial attack method for link prediction on continuous-time dynamic graphs, focusing on investigating the vulnerabilities of TGNNs. Specifically, before the training procedure of a victim model, which is a TGNN for link prediction, we inject edge perturbations to the data that are unnoticeable in terms of the four constraints we propose, and yet effective enough to cause malfunction of the victim model. Moreover, we propose a robust training approach T-SHIELD to mitigate the impact of adversarial attacks. By using edge filtering and enforcing temporal smoothness to node embeddings, we enhance the robustness of the victim model. Our experimental study shows that T-SPEAR significantly degrades the victim model's performance on link prediction tasks, and even more, our attacks are transferable to other TGNNs, which differ from the victim model assumed by the attacker. Moreover, we demonstrate that T-SHIELD effectively filters out adversarial edges and exhibits robustness against adversarial attacks, surpassing the link prediction performance of the naive TGNN by up to 11.2% under T-SPEAR.  ( 3 min )
    Variance-Covariance Regularization Improves Representation Learning
    arXiv:2306.13292v2 Announce Type: replace Abstract: Transfer learning plays a key role in advancing machine learning models, yet conventional supervised pretraining often undermines feature transferability by prioritizing features that minimize the pretraining loss. In this work, we adapt a self-supervised learning regularization technique from the VICReg method to supervised learning contexts, introducing Variance-Covariance Regularization (VCReg). This adaptation encourages the network to learn high-variance, low-covariance representations, promoting learning more diverse features. We outline best practices for an efficient implementation of our framework, including applying it to the intermediate representations. Through extensive empirical evaluation, we demonstrate that our method significantly enhances transfer learning for images and videos, achieving state-of-the-art performance across numerous tasks and datasets. VCReg also improves performance in scenarios like long-tail learning and hierarchical classification. Additionally, we show its effectiveness may stem from its success in addressing challenges like gradient starvation and neural collapse. In summary, VCReg offers a universally applicable regularization framework that significantly advances transfer learning and highlights the connection between gradient starvation, neural collapse, and feature transferability.  ( 2 min )
    Learning Decentralized Partially Observable Mean Field Control for Artificial Collective Behavior
    arXiv:2307.06175v2 Announce Type: replace Abstract: Recent reinforcement learning (RL) methods have achieved success in various domains. However, multi-agent RL (MARL) remains a challenge in terms of decentralization, partial observability and scalability to many agents. Meanwhile, collective behavior requires resolution of the aforementioned challenges, and remains of importance to many state-of-the-art applications such as active matter physics, self-organizing systems, opinion dynamics, and biological or robotic swarms. Here, MARL via mean field control (MFC) offers a potential solution to scalability, but fails to consider decentralized and partially observable systems. In this paper, we enable decentralized behavior of agents under partial information by proposing novel models for decentralized partially observable MFC (Dec-POMFC), a broad class of problems with permutation-invariant agents allowing for reduction to tractable single-agent Markov decision processes (MDP) with single-agent RL solution. We provide rigorous theoretical results, including a dynamic programming principle, together with optimality guarantees for Dec-POMFC solutions applied to finite swarms of interest. Algorithmically, we propose Dec-POMFC-based policy gradient methods for MARL via centralized training and decentralized execution, together with policy gradient approximation guarantees. In addition, we improve upon state-of-the-art histogram-based MFC by kernel methods, which is of separate interest also for fully observable MFC. We evaluate numerically on representative collective behavior tasks such as adapted Kuramoto and Vicsek swarming models, being on par with state-of-the-art MARL. Overall, our framework takes a step towards RL-based engineering of artificial collective behavior via MFC.  ( 3 min )
    SparDL: Distributed Deep Learning Training with Efficient Sparse Communication
    arXiv:2304.00737v2 Announce Type: replace Abstract: Top-k sparsification has recently been widely used to reduce the communication volume in distributed deep learning. However, due to the Sparse Gradient Accumulation (SGA) dilemma, the performance of top-k sparsification still has limitations. Recently, a few methods have been put forward to handle the SGA dilemma. Regrettably, even the state-of-the-art method suffers from several drawbacks, e.g., it relies on an inefficient communication algorithm and requires extra transmission steps. Motivated by the limitations of existing methods, we propose a novel efficient sparse communication framework, called SparDL. Specifically, SparDL uses the Spar-Reduce-Scatter algorithm, which is based on an efficient Reduce-Scatter model, to handle the SGA dilemma without additional communication operations. Besides, to further reduce the latency cost and improve the efficiency of SparDL, we propose the Spar-All-Gather algorithm. Moreover, we propose the global residual collection algorithm to ensure fast convergence of model training. Finally, extensive experiments are conducted to validate the superiority of SparDL.  ( 2 min )
    Inversion dynamics of class manifolds in deep learning reveals tradeoffs underlying generalisation
    arXiv:2303.05161v2 Announce Type: replace Abstract: To achieve near-zero training error in a classification problem, the layers of a feed-forward network have to disentangle the manifolds of data points with different labels, to facilitate the discrimination. However, excessive class separation can bring to overfitting since good generalisation requires learning invariant features, which involve some level of entanglement. We report on numerical experiments showing how the optimisation dynamics finds representations that balance these opposing tendencies with a non-monotonic trend. After a fast segregation phase, a slower rearrangement (conserved across data sets and architectures) increases the class entanglement.The training error at the inversion is stable under subsampling, and across network initialisations and optimisers, which characterises it as a property solely of the data structure and (very weakly) of the architecture. The inversion is the manifestation of tradeoffs elicited by well-defined and maximally stable elements of the training set, coined ``stragglers'', particularly influential for generalisation.  ( 2 min )
    Sharp Lower Bounds on Interpolation by Deep ReLU Neural Networks at Irregularly Spaced Data
    arXiv:2302.00834v2 Announce Type: replace Abstract: We study the interpolation power of deep ReLU neural networks. Specifically, we consider the question of how efficiently, in terms of the number of parameters, deep ReLU networks can interpolate values at $N$ datapoints in the unit ball which are separated by a distance $\delta$. We show that $\Omega(N)$ parameters are required in the regime where $\delta$ is exponentially small in $N$, which gives the sharp result in this regime since $O(N)$ parameters are always sufficient. This also shows that the bit-extraction technique used to prove lower bounds on the VC dimension cannot be applied to irregularly spaced datapoints. Finally, as an application we give a lower bound on the approximation rates that deep ReLU neural networks can achieve for Sobolev spaces at the embedding endpoint.  ( 2 min )
    Zero-shot causal learning
    arXiv:2301.12292v4 Announce Type: replace Abstract: Predicting how different interventions will causally affect a specific individual is important in a variety of domains such as personalized medicine, public policy, and online marketing. There are a large number of methods to predict the effect of an existing intervention based on historical data from individuals who received it. However, in many settings it is important to predict the effects of novel interventions (e.g., a newly invented drug), which these methods do not address. Here, we consider zero-shot causal learning: predicting the personalized effects of a novel intervention. We propose CaML, a causal meta-learning framework which formulates the personalized prediction of each intervention's effect as a task. CaML trains a single meta-model across thousands of tasks, each constructed by sampling an intervention, its recipients, and its nonrecipients. By leveraging both intervention information (e.g., a drug's attributes) and individual features~(e.g., a patient's history), CaML is able to predict the personalized effects of novel interventions that do not exist at the time of training. Experimental results on real world datasets in large-scale medical claims and cell-line perturbations demonstrate the effectiveness of our approach. Most strikingly, \method's zero-shot predictions outperform even strong baselines trained directly on data from the test interventions.  ( 3 min )
    Leveraging Domain Knowledge for Efficient Reward Modelling in RLHF: A Case-Study in E-Commerce Opinion Summarization
    arXiv:2402.15473v1 Announce Type: cross Abstract: Reinforcement Learning from Human Feedback (RLHF) has become a dominating strategy in steering Language Models (LMs) towards human values/goals. The key to the strategy is employing a reward model ({$\varphi$}) which can reflect a latent reward model with humans. While this strategy has proven to be effective, the training methodology requires a lot of human preference annotation (usually of the order of tens of thousands) to train {$\varphi$}. Such large-scale preference annotations can be achievable if the reward model can be ubiquitously used. However, human values/goals are subjective and depend on the nature of the task. This poses a challenge in collecting diverse preferences for downstream applications. To address this, we propose a novel methodology to infuse domain knowledge into {$\varphi$}, which reduces the size of preference annotation required. We validate our approach in E-Commerce Opinion Summarization, with a significant reduction in dataset size (just $940$ samples) while advancing the state-of-the-art. Our contributions include a novel Reward Modelling technique, a new dataset (PromptOpinSumm) for Opinion Summarization, and a human preference dataset (OpinPref). The proposed methodology opens avenues for efficient RLHF, making it more adaptable to diverse applications with varying human values. We release the artifacts for usage under MIT License.  ( 3 min )
    Improving Data Quality with Training Dynamics of Gradient Boosting Decision Trees
    arXiv:2210.11327v2 Announce Type: replace Abstract: Real world datasets contain incorrectly labeled instances that hamper the performance of the model and, in particular, the ability to generalize out of distribution. Also, each example might have different contribution towards learning. This motivates studies to better understanding of the role of data instances with respect to their contribution in good metrics in models. In this paper we propose a method based on metrics computed from training dynamics of Gradient Boosting Decision Trees (GBDTs) to assess the behavior of each training example. We focus on datasets containing mostly tabular or structured data, for which the use of Decision Trees ensembles are still the state-of-the-art in terms of performance. Our methods achieved the best results overall when compared with confident learning, direct heuristics and a robust boosting algorithm. We show results on detecting noisy labels in order clean datasets, improving models' metrics in synthetic and real public datasets, as well as on a industry case in which we deployed a model based on the proposed solution.  ( 3 min )
    Universal Lower Bounds and Optimal Rates: Achieving Minimax Clustering Error in Sub-Exponential Mixture Models
    arXiv:2402.15432v1 Announce Type: cross Abstract: Clustering is a pivotal challenge in unsupervised machine learning and is often investigated through the lens of mixture models. The optimal error rate for recovering cluster labels in Gaussian and sub-Gaussian mixture models involves ad hoc signal-to-noise ratios. Simple iterative algorithms, such as Lloyd's algorithm, attain this optimal error rate. In this paper, we first establish a universal lower bound for the error rate in clustering any mixture model, expressed through a Chernoff divergence, a more versatile measure of model information than signal-to-noise ratios. We then demonstrate that iterative algorithms attain this lower bound in mixture models with sub-exponential tails, notably emphasizing location-scale mixtures featuring Laplace-distributed errors. Additionally, for datasets better modelled by Poisson or Negative Binomial mixtures, we study mixture models whose distributions belong to an exponential family. In such mixtures, we establish that Bregman hard clustering, a variant of Lloyd's algorithm employing a Bregman divergence, is rate optimal.  ( 2 min )
    Experimental Design for Multi-Channel Imaging via Task-Driven Feature Selection
    arXiv:2210.06891v3 Announce Type: replace Abstract: This paper presents a data-driven, task-specific paradigm for experimental design, to shorten acquisition time, reduce costs, and accelerate the deployment of imaging devices. Current approaches in experimental design focus on model-parameter estimation and require specification of a particular model, whereas in imaging, other tasks may drive the design. Furthermore, such approaches often lead to intractable optimization problems in real-world imaging applications. Here we present a new paradigm for experimental design that simultaneously optimizes the design (set of image channels) and trains a machine-learning model to execute a user-specified image-analysis task. The approach obtains data densely-sampled over the measurement space (many image channels) for a small number of acquisitions, then identifies a subset of channels of prespecified size that best supports the task. We propose a method: TADRED for TAsk-DRiven Experimental Design in imaging, to identify the most informative channel-subset whilst simultaneously training a network to execute the task given the subset. Experiments demonstrate the potential of TADRED in diverse imaging applications: several clinically-relevant tasks in magnetic resonance imaging; and remote sensing and physiological applications of hyperspectral imaging. Results show substantial improvement over classical experimental design, two recent application-specific methods within the new paradigm, and state-of-the-art approaches in supervised feature selection. We anticipate further applications of our approach. Code is available: https://github.com/sbb-gh/experimental-design-multichannel  ( 3 min )
    Causal Discovery from Conditionally Stationary Time Series
    arXiv:2110.06257v2 Announce Type: replace Abstract: Causal discovery, i.e., inferring underlying causal relationships from observational data, has been shown to be highly challenging for AI systems. In time series modeling context, traditional causal discovery methods mainly consider constrained scenarios with fully observed variables and/or data from stationary time-series. We develop a causal discovery approach to handle a wide class of non-stationary time-series that are conditionally stationary, where the non-stationary behaviour is modeled as stationarity conditioned on a set of (possibly hidden) state variables. Named State-Dependent Causal Inference (SDCI), our approach is able to recover the underlying causal dependencies, provably with fully-observed states and empirically with hidden states. The latter is confirmed by experiments on synthetic linear system and nonlinear particle interaction data, where SDCI achieves superior performance over baseline causal discovery methods. Improved results over non-causal RNNs on modeling NBA player movements demonstrate the potential of our method and motivate the use of causality-driven methods for forecasting.  ( 2 min )
    Repetition Improves Language Model Embeddings
    arXiv:2402.15449v1 Announce Type: cross Abstract: Recent approaches to improving the extraction of text embeddings from autoregressive large language models (LLMs) have largely focused on improvements to data, backbone pretrained language models, or improving task-differentiation via instructions. In this work, we address an architectural limitation of autoregressive models: token embeddings cannot contain information from tokens that appear later in the input. To address this limitation, we propose a simple approach, "echo embeddings," in which we repeat the input twice in context and extract embeddings from the second occurrence. We show that echo embeddings of early tokens can encode information about later tokens, allowing us to maximally leverage high-quality LLMs for embeddings. On the MTEB leaderboard, echo embeddings improve over classical embeddings by over 9% zero-shot and by around 0.7% when fine-tuned. Echo embeddings with a Mistral-7B model achieve state-of-the-art compared to prior open source models that do not leverage synthetic fine-tuning data.  ( 2 min )
    RoboEXP: Action-Conditioned Scene Graph via Interactive Exploration for Robotic Manipulation
    arXiv:2402.15487v1 Announce Type: cross Abstract: Robots need to explore their surroundings to adapt to and tackle tasks in unknown environments. Prior work has proposed building scene graphs of the environment but typically assumes that the environment is static, omitting regions that require active interactions. This severely limits their ability to handle more complex tasks in household and office environments: before setting up a table, robots must explore drawers and cabinets to locate all utensils and condiments. In this work, we introduce the novel task of interactive scene exploration, wherein robots autonomously explore environments and produce an action-conditioned scene graph (ACSG) that captures the structure of the underlying environment. The ACSG accounts for both low-level information, such as geometry and semantics, and high-level information, such as the action-conditioned relationships between different entities in the scene. To this end, we present the Robotic Exploration (RoboEXP) system, which incorporates the Large Multimodal Model (LMM) and an explicit memory design to enhance our system's capabilities. The robot reasons about what and how to explore an object, accumulating new information through the interaction process and incrementally constructing the ACSG. We apply our system across various real-world settings in a zero-shot manner, demonstrating its effectiveness in exploring and modeling environments it has never seen before. Leveraging the constructed ACSG, we illustrate the effectiveness and efficiency of our RoboEXP system in facilitating a wide range of real-world manipulation tasks involving rigid, articulated objects, nested objects like Matryoshka dolls, and deformable objects like cloth.  ( 3 min )
    FP8 Quantization: The Power of the Exponent
    arXiv:2208.09225v2 Announce Type: replace Abstract: When quantizing neural networks for efficient inference, low-bit integers are the go-to format for efficiency. However, low-bit floating point numbers have an extra degree of freedom, assigning some bits to work on an exponential scale instead. This paper in-depth investigates this benefit of the floating point format for neural network inference. We detail the choices that can be made for the FP8 format, including the important choice of the number of bits for the mantissa and exponent, and show analytically in which settings these choices give better performance. Then we show how these findings translate to real networks, provide an efficient implementation for FP8 simulation, and a new algorithm that enables the learning of both the scale parameters and the number of exponent bits in the FP8 format. Our chief conclusion is that when doing post-training quantization for a wide range of networks, the FP8 format is better than INT8 in terms of accuracy, and the choice of the number of exponent bits is driven by the severity of outliers in the network. We also conduct experiments with quantization-aware training where the difference in formats disappears as the network is trained to reduce the effect of outliers.  ( 2 min )
    Mixup Barcodes: Quantifying Geometric-Topological Interactions between Point Clouds
    arXiv:2402.15058v1 Announce Type: cross Abstract: We combine standard persistent homology with image persistent homology to define a novel way of characterizing shapes and interactions between them. In particular, we introduce: (1) a mixup barcode, which captures geometric-topological interactions (mixup) between two point sets in arbitrary dimension; (2) simple summary statistics, total mixup and total percentage mixup, which quantify the complexity of the interactions as a single number; (3) a software tool for playing with the above. As a proof of concept, we apply this tool to a problem arising from machine learning. In particular, we study the disentanglement in embeddings of different classes. The results suggest that topological mixup is a useful method for characterizing interactions for low and high-dimensional data. Compared to the typical usage of persistent homology, the new tool is sensitive to the geometric locations of the topological features, which is often desirable.  ( 2 min )
    Fiducial Focus Augmentation for Facial Landmark Detection
    arXiv:2402.15044v1 Announce Type: cross Abstract: Deep learning methods have led to significant improvements in the performance on the facial landmark detection (FLD) task. However, detecting landmarks in challenging settings, such as head pose changes, exaggerated expressions, or uneven illumination, continue to remain a challenge due to high variability and insufficient samples. This inadequacy can be attributed to the model's inability to effectively acquire appropriate facial structure information from the input images. To address this, we propose a novel image augmentation technique specifically designed for the FLD task to enhance the model's understanding of facial structures. To effectively utilize the newly proposed augmentation technique, we employ a Siamese architecture-based training mechanism with a Deep Canonical Correlation Analysis (DCCA)-based loss to achieve collective learning of high-level feature representations from two different views of the input images. Furthermore, we employ a Transformer + CNN-based network with a custom hourglass module as the robust backbone for the Siamese framework. Extensive experiments show that our approach outperforms multiple state-of-the-art approaches across various benchmark datasets.  ( 2 min )
    Nonlinear Bayesian optimal experimental design using logarithmic Sobolev inequalities
    arXiv:2402.15053v1 Announce Type: cross Abstract: We study the problem of selecting $k$ experiments from a larger candidate pool, where the goal is to maximize mutual information (MI) between the selected subset and the underlying parameters. Finding the exact solution is to this combinatorial optimization problem is computationally costly, not only due to the complexity of the combinatorial search but also the difficulty of evaluating MI in nonlinear/non-Gaussian settings. We propose greedy approaches based on new computationally inexpensive lower bounds for MI, constructed via log-Sobolev inequalities. We demonstrate that our method outperforms random selection strategies, Gaussian approximations, and nested Monte Carlo (NMC) estimators of MI in various settings, including optimal design for nonlinear models with non-additive noise.  ( 2 min )
    KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models
    arXiv:2402.15043v1 Announce Type: cross Abstract: Automatic evaluation methods for large language models (LLMs) are hindered by data contamination, leading to inflated assessments of their effectiveness. Existing strategies, which aim to detect contaminated texts, focus on quantifying contamination status instead of accurately gauging model performance. In this paper, we introduce KIEval, a Knowledge-grounded Interactive Evaluation framework, which incorporates an LLM-powered "interactor" role for the first time to accomplish a dynamic contamination-resilient evaluation. Starting with a question in a conventional LLM benchmark involving domain-specific knowledge, KIEval utilizes dynamically generated, multi-round, and knowledge-focused dialogues to determine whether a model's response is merely a recall of benchmark answers or demonstrates a deep comprehension to apply knowledge in more complex conversations. Extensive experiments on seven leading LLMs across five datasets validate KIEval's effectiveness and generalization. We also reveal that data contamination brings no contribution or even negative effect to models' real-world applicability and understanding, and existing contamination detection methods for LLMs can only identify contamination in pre-training but not during supervised fine-tuning.  ( 2 min )
    Dynamics-Guided Diffusion Model for Robot Manipulator Design
    arXiv:2402.15038v1 Announce Type: cross Abstract: We present Dynamics-Guided Diffusion Model, a data-driven framework for generating manipulator geometry designs for a given manipulation task. Instead of training different design models for each task, our approach employs a learned dynamics network shared across tasks. For a new manipulation task, we first decompose it into a collection of individual motion targets which we call target interaction profile, where each individual motion can be modeled by the shared dynamics network. The design objective constructed from the target and predicted interaction profiles provides a gradient to guide the refinement of finger geometry for the task. This refinement process is executed as a classifier-guided diffusion process, where the design objective acts as the classifier guidance. We evaluate our framework on various manipulation tasks, under the sensor-less setting using only an open-loop parallel jaw motion. Our generated designs outperform optimization-based and unguided diffusion baselines relatively by 31.5% and 45.3% on average manipulation success rate. With the ability to generate a design within 0.8 seconds, our framework could facilitate rapid design iteration and enhance the adoption of data-driven approaches for robotic mechanism design.  ( 2 min )
    How Important Is Tokenization in French Medical Masked Language Models?
    arXiv:2402.15010v1 Announce Type: cross Abstract: Subword tokenization has become the prevailing standard in the field of natural language processing (NLP) over recent years, primarily due to the widespread utilization of pre-trained language models. This shift began with Byte-Pair Encoding (BPE) and was later followed by the adoption of SentencePiece and WordPiece. While subword tokenization consistently outperforms character and word-level tokenization, the precise factors contributing to its success remain unclear. Key aspects such as the optimal segmentation granularity for diverse tasks and languages, the influence of data sources on tokenizers, and the role of morphological information in Indo-European languages remain insufficiently explored. This is particularly pertinent for biomedical terminology, characterized by specific rules governing morpheme combinations. Despite the agglutinative nature of biomedical terminology, existing language models do not explicitly incorporate this knowledge, leading to inconsistent tokenization strategies for common terms. In this paper, we seek to delve into the complexities of subword tokenization in French biomedical domain across a variety of NLP tasks and pinpoint areas where further enhancements can be made. We analyze classical tokenization algorithms, including BPE and SentencePiece, and introduce an original tokenization strategy that integrates morpheme-enriched word segmentation into existing tokenization methods.  ( 2 min )
    Unintended Impacts of LLM Alignment on Global Representation
    arXiv:2402.15018v1 Announce Type: cross Abstract: Before being deployed for user-facing applications, developers align Large Language Models (LLMs) to user preferences through a variety of procedures, such as Reinforcement Learning From Human Feedback (RLHF) and Direct Preference Optimization (DPO). Current evaluations of these procedures focus on benchmarks of instruction following, reasoning, and truthfulness. However, human preferences are not universal, and aligning to specific preference sets may have unintended effects. We explore how alignment impacts performance along three axes of global representation: English dialects, multilingualism, and opinions from and about countries worldwide. Our results show that current alignment procedures create disparities between English dialects and global opinions. We find alignment improves capabilities in several languages. We conclude by discussing design decisions that led to these unintended impacts and recommendations for more equitable preference tuning.  ( 2 min )
    Divide-or-Conquer? Which Part Should You Distill Your LLM?
    arXiv:2402.15000v1 Announce Type: cross Abstract: Recent methods have demonstrated that Large Language Models (LLMs) can solve reasoning tasks better when they are encouraged to solve subtasks of the main task first. In this paper we devise a similar strategy that breaks down reasoning tasks into a problem decomposition phase and a problem solving phase and show that the strategy is able to outperform a single stage solution. Further, we hypothesize that the decomposition should be easier to distill into a smaller model compared to the problem solving because the latter requires large amounts of domain knowledge while the former only requires learning general problem solving strategies. We propose methods to distill these two capabilities and evaluate their impact on reasoning outcomes and inference cost. We find that we can distill the problem decomposition phase and at the same time achieve good generalization across tasks, datasets, and models. However, it is harder to distill the problem solving capability without losing performance and the resulting distilled model struggles with generalization. These results indicate that by using smaller, distilled problem decomposition models in combination with problem solving LLMs we can achieve reasoning with cost-efficient inference and local adaptation.  ( 2 min )
    On the Performance of Empirical Risk Minimization with Smoothed Data
    arXiv:2402.14987v1 Announce Type: cross Abstract: In order to circumvent statistical and computational hardness results in sequential decision-making, recent work has considered smoothed online learning, where the distribution of data at each time is assumed to have bounded likeliehood ratio with respect to a base measure when conditioned on the history. While previous works have demonstrated the benefits of smoothness, they have either assumed that the base measure is known to the learner or have presented computationally inefficient algorithms applying only in special cases. This work investigates the more general setting where the base measure is \emph{unknown} to the learner, focusing in particular on the performance of Empirical Risk Minimization (ERM) with square loss when the data are well-specified and smooth. We show that in this setting, ERM is able to achieve sublinear error whenever a class is learnable with iid data; in particular, ERM achieves error scaling as $\tilde O( \sqrt{\mathrm{comp}(\mathcal F)\cdot T} )$, where $\mathrm{comp}(\mathcal F)$ is the statistical complexity of learning $\mathcal F$ with iid data. In so doing, we prove a novel norm comparison bound for smoothed data that comprises the first sharp norm comparison for dependent data applying to arbitrary, nonlinear function classes. We complement these results with a lower bound indicating that our analysis of ERM is essentially tight, establishing a separation in the performance of ERM between smoothed and iid data.  ( 2 min )
    tinyBenchmarks: evaluating LLMs with fewer examples
    arXiv:2402.14992v1 Announce Type: cross Abstract: The versatility of large language models (LLMs) led to the creation of diverse benchmarks that thoroughly test a variety of language models' abilities. These benchmarks consist of tens of thousands of examples making evaluation of LLMs very expensive. In this paper, we investigate strategies to reduce the number of evaluations needed to assess the performance of an LLM on several key benchmarks. For example, we show that to accurately estimate the performance of an LLM on MMLU, a popular multiple-choice QA benchmark consisting of 14K examples, it is sufficient to evaluate this LLM on 100 curated examples. We release evaluation tools and tiny versions of popular benchmarks: Open LLM Leaderboard, MMLU, HELM, and AlpacaEval 2.0. Our empirical analysis demonstrates that these tools and tiny benchmarks are sufficient to reliably and efficiently reproduce the original evaluation results.  ( 2 min )
    Comparative Analysis of Data Preprocessing Methods, Feature Selection Techniques and Machine Learning Models for Improved Classification and Regression Performance on Imbalanced Genetic Data
    arXiv:2402.14980v1 Announce Type: cross Abstract: Rapid advancements in genome sequencing have led to the collection of vast amounts of genomics data. Researchers may be interested in using machine learning models on such data to predict the pathogenicity or clinical significance of a genetic mutation. However, many genetic datasets contain imbalanced target variables that pose challenges to machine learning models: observations are skewed/imbalanced in regression tasks or class-imbalanced in classification tasks. Genetic datasets are also often high-cardinal and contain skewed predictor variables, which poses further challenges. We aimed to investigate the effects of data preprocessing, feature selection techniques, and model selection on the performance of models trained on these datasets. We measured performance with 5-fold cross-validation and compared averaged r-squared and accuracy metrics across different combinations of techniques. We found that outliers/skew in predictor or target variables did not pose a challenge to regression models. We also found that class-imbalanced target variables and skewed predictors had little to no impact on classification performance. Random forest was the best model to use for imbalanced regression tasks. While our study uses a genetic dataset as an example of a real-world application, our findings can be generalized to any similar datasets.  ( 3 min )
    Human Brain Exhibits Distinct Patterns When Listening to Fake Versus Real Audio: Preliminary Evidence
    arXiv:2402.14982v1 Announce Type: cross Abstract: In this paper we study the variations in human brain activity when listening to real and fake audio. Our preliminary results suggest that the representations learned by a state-of-the-art deepfake audio detection algorithm, do not exhibit clear distinct patterns between real and fake audio. In contrast, human brain activity, as measured by EEG, displays distinct patterns when individuals are exposed to fake versus real audio. This preliminary evidence enables future research directions in areas such as deepfake audio detection.  ( 2 min )
    Unsupervised Domain Adaptation within Deep Foundation Latent Spaces
    arXiv:2402.14976v1 Announce Type: cross Abstract: The vision transformer-based foundation models, such as ViT or Dino-V2, are aimed at solving problems with little or no finetuning of features. Using a setting of prototypical networks, we analyse to what extent such foundation models can solve unsupervised domain adaptation without finetuning over the source or target domain. Through quantitative analysis, as well as qualitative interpretations of decision making, we demonstrate that the suggested method can improve upon existing baselines, as well as showcase the limitations of such approach yet to be solved.  ( 2 min )
    Mudjacking: Patching Backdoor Vulnerabilities in Foundation Models
    arXiv:2402.14977v1 Announce Type: cross Abstract: Foundation model has become the backbone of the AI ecosystem. In particular, a foundation model can be used as a general-purpose feature extractor to build various downstream classifiers. However, foundation models are vulnerable to backdoor attacks and a backdoored foundation model is a single-point-of-failure of the AI ecosystem, e.g., multiple downstream classifiers inherit the backdoor vulnerabilities simultaneously. In this work, we propose Mudjacking, the first method to patch foundation models to remove backdoors. Specifically, given a misclassified trigger-embedded input detected after a backdoored foundation model is deployed, Mudjacking adjusts the parameters of the foundation model to remove the backdoor. We formulate patching a foundation model as an optimization problem and propose a gradient descent based method to solve it. We evaluate Mudjacking on both vision and language foundation models, eleven benchmark datasets, five existing backdoor attacks, and thirteen adaptive backdoor attacks. Our results show that Mudjacking can remove backdoor from a foundation model while maintaining its utility.  ( 2 min )
    GenCeption: Evaluate Multimodal LLMs with Unlabeled Unimodal Data
    arXiv:2402.14973v1 Announce Type: cross Abstract: Multimodal Large Language Models (MLLMs) are commonly evaluated using costly annotated multimodal benchmarks. However, these benchmarks often struggle to keep pace with the rapidly advancing requirements of MLLM evaluation. We propose GenCeption, a novel and annotation-free MLLM evaluation framework that merely requires unimodal data to assess inter-modality semantic coherence and inversely reflects the models' inclination to hallucinate. Analogous to the popular DrawCeption game, GenCeption initiates with a non-textual sample and undergoes a series of iterative description and generation steps. Semantic drift across iterations is quantified using the GC@T metric. Our empirical findings validate GenCeption's efficacy, showing strong correlations with popular MLLM benchmarking results. GenCeption may be extended to mitigate training data contamination by utilizing ubiquitous, previously unseen unimodal data.  ( 2 min )
    Towards Spatially-Lucid AI Classification in Non-Euclidean Space: An Application for MxIF Oncology Data
    arXiv:2402.14974v1 Announce Type: cross Abstract: Given multi-category point sets from different place-types, our goal is to develop a spatially-lucid classifier that can distinguish between two classes based on the arrangements of their points. This problem is important for many applications, such as oncology, for analyzing immune-tumor relationships and designing new immunotherapies. It is challenging due to spatial variability and interpretability needs. Previously proposed techniques require dense training data or have limited ability to handle significant spatial variability within a single place-type. Most importantly, these deep neural network (DNN) approaches are not designed to work in non-Euclidean space, particularly point sets. Existing non-Euclidean DNN methods are limited to one-size-fits-all approaches. We explore a spatial ensemble framework that explicitly uses different training strategies, including weighted-distance learning rate and spatial domain adaptation, on various place-types for spatially-lucid classification. Experimental results on real-world datasets (e.g., MxIF oncology data) show that the proposed framework provides higher prediction accuracy than baseline methods.  ( 2 min )
    Reinforcement Learning with Elastic Time Steps
    arXiv:2402.14961v1 Announce Type: cross Abstract: Traditional Reinforcement Learning (RL) algorithms are usually applied in robotics to learn controllers that act with a fixed control rate. Given the discrete nature of RL algorithms, they are oblivious to the effects of the choice of control rate: finding the correct control rate can be difficult and mistakes often result in excessive use of computing resources or even lack of convergence. We propose Soft Elastic Actor-Critic (SEAC), a novel off-policy actor-critic algorithm to address this issue. SEAC implements elastic time steps, time steps with a known, variable duration, which allow the agent to change its control frequency to adapt to the situation. In practice, SEAC applies control only when necessary, minimizing computational resources and data usage. We evaluate SEAC's capabilities in simulation in a Newtonian kinematics maze navigation task and on a 3D racing video game, Trackmania. SEAC outperforms the SAC baseline in terms of energy efficiency and overall time management, and most importantly without the need to identify a control frequency for the learned controller. SEAC demonstrated faster and more stable training speeds than SAC, especially at control rates where SAC struggled to converge. We also compared SEAC with a similar approach, the Continuous-Time Continuous-Options (CTCO) model, and SEAC resulted in better task performance. These findings highlight the potential of SEAC for practical, real-world RL applications in robotics.  ( 2 min )
    Smoothness Adaptive Hypothesis Transfer Learning
    arXiv:2402.14966v1 Announce Type: cross Abstract: Many existing two-phase kernel-based hypothesis transfer learning algorithms employ the same kernel regularization across phases and rely on the known smoothness of functions to obtain optimality. Therefore, they fail to adapt to the varying and unknown smoothness between the target/source and their offset in practice. In this paper, we address these problems by proposing Smoothness Adaptive Transfer Learning (SATL), a two-phase kernel ridge regression(KRR)-based algorithm. We first prove that employing the misspecified fixed bandwidth Gaussian kernel in target-only KRR learning can achieve minimax optimality and derive an adaptive procedure to the unknown Sobolev smoothness. Leveraging these results, SATL employs Gaussian kernels in both phases so that the estimators can adapt to the unknown smoothness of the target/source and their offset function. We derive the minimax lower bound of the learning problem in excess risk and show that SATL enjoys a matching upper bound up to a logarithmic factor. The minimax convergence rate sheds light on the factors influencing transfer dynamics and demonstrates the superiority of SATL compared to non-transfer learning settings. While our main objective is a theoretical analysis, we also conduct several experiments to confirm our results.  ( 2 min )
    In-Context Learning of a Linear Transformer Block: Benefits of the MLP Component and One-Step GD Initialization
    arXiv:2402.14951v1 Announce Type: cross Abstract: We study the \emph{in-context learning} (ICL) ability of a \emph{Linear Transformer Block} (LTB) that combines a linear attention component and a linear multi-layer perceptron (MLP) component. For ICL of linear regression with a Gaussian prior and a \emph{non-zero mean}, we show that LTB can achieve nearly Bayes optimal ICL risk. In contrast, using only linear attention must incur an irreducible additive approximation error. Furthermore, we establish a correspondence between LTB and one-step gradient descent estimators with learnable initialization ($\mathsf{GD}\text{-}\mathbf{\beta}$), in the sense that every $\mathsf{GD}\text{-}\mathbf{\beta}$ estimator can be implemented by an LTB estimator and every optimal LTB estimator that minimizes the in-class ICL risk is effectively a $\mathsf{GD}\text{-}\mathbf{\beta}$ estimator. Finally, we show that $\mathsf{GD}\text{-}\mathbf{\beta}$ estimators can be efficiently optimized with gradient flow, despite a non-convex training objective. Our results reveal that LTB achieves ICL by implementing $\mathsf{GD}\text{-}\mathbf{\beta}$, and they highlight the role of MLP layers in reducing approximation error.  ( 2 min )
    The Common Stability Mechanism behind most Self-Supervised Learning Approaches
    arXiv:2402.14957v1 Announce Type: cross Abstract: Last couple of years have witnessed a tremendous progress in self-supervised learning (SSL), the success of which can be attributed to the introduction of useful inductive biases in the learning process to learn meaningful visual representations while avoiding collapse. These inductive biases and constraints manifest themselves in the form of different optimization formulations in the SSL techniques, e.g. by utilizing negative examples in a contrastive formulation, or exponential moving average and predictor in BYOL and SimSiam. In this paper, we provide a framework to explain the stability mechanism of these different SSL techniques: i) we discuss the working mechanism of contrastive techniques like SimCLR, non-contrastive techniques like BYOL, SWAV, SimSiam, Barlow Twins, and DINO; ii) we provide an argument that despite different formulations these methods implicitly optimize a similar objective function, i.e. minimizing the magnitude of the expected representation over all data samples, or the mean of the data distribution, while maximizing the magnitude of the expected representation of individual samples over different data augmentations; iii) we provide mathematical and empirical evidence to support our framework. We formulate different hypotheses and test them using the Imagenet100 dataset.  ( 2 min )
    Learning Inverse Kinodynamics for Autonomous Vehicle Drifting
    arXiv:2402.14928v1 Announce Type: cross Abstract: In this work, we explore a data-driven learning-based approach to learning the kinodynamic model of a small autonomous vehicle, and observe the effect it has on motion planning, specifically autonomous drifting. When executing a motion plan in the real world, there are numerous causes for error, and what is planned is often not what is executed on the actual car. Learning a kinodynamic planner based off of inertial measurements and executed commands can help us learn the world state. In our case, we look towards the realm of drifting; it is a complex maneuver that requires a smooth enough surface, high enough speed, and a drastic change in velocity. We attempt to learn the kinodynamic model for these drifting maneuvers, and attempt to tighten the slip of the car. Our approach is able to learn a kinodynamic model for high-speed circular navigation, and is able to avoid obstacles on an autonomous drift at high speed by correcting an executed curvature for loose drifts. We seek to adjust our kinodynamic model for success in tighter drifts in future work.  ( 2 min )
    Re-Examine Distantly Supervised NER: A New Benchmark and a Simple Approach
    arXiv:2402.14948v1 Announce Type: cross Abstract: This paper delves into Named Entity Recognition (NER) under the framework of Distant Supervision (DS-NER), where the main challenge lies in the compromised quality of labels due to inherent errors such as false positives, false negatives, and positive type errors. We critically assess the efficacy of current DS-NER methodologies using a real-world benchmark dataset named QTL, revealing that their performance often does not meet expectations. To tackle the prevalent issue of label noise, we introduce a simple yet effective approach, Curriculum-based Positive-Unlabeled Learning CuPUL, which strategically starts on "easy" and cleaner samples during the training process to enhance model resilience to noisy samples. Our empirical results highlight the capability of CuPUL to significantly reduce the impact of noisy labels and outperform existing methods.  ( 2 min )
    Watermarking Makes Language Models Radioactive
    arXiv:2402.14904v1 Announce Type: cross Abstract: This paper investigates the radioactivity of LLM-generated texts, i.e. whether it is possible to detect that such input was used as training data. Conventional methods like membership inference can carry out this detection with some level of accuracy. We show that watermarked training data leaves traces easier to detect and much more reliable than membership inference. We link the contamination level to the watermark robustness, its proportion in the training set, and the fine-tuning process. We notably demonstrate that training on watermarked synthetic instructions can be detected with high confidence (p-value < 1e-5) even when as little as 5% of training text is watermarked. Thus, LLM watermarking, originally designed for detecting machine-generated text, gives the ability to easily identify if the outputs of a watermarked LLM were used to fine-tune another LLM.  ( 2 min )
    Efficient Unbiased Sparsification
    arXiv:2402.14925v1 Announce Type: cross Abstract: An unbiased $m$-sparsification of a vector $p\in \mathbb{R}^n$ is a random vector $Q\in \mathbb{R}^n$ with mean $p$ that has at most $m<n$ nonzero coordinates. Unbiased sparsification compresses the original vector without introducing bias; it arises in various contexts, such as in federated learning and sampling sparse probability distributions. Ideally, unbiased sparsification should also minimize the expected value of a divergence function $\mathsf{Div}(Q,p)$ that measures how far away $Q$ is from the original $p$. If $Q$ is optimal in this sense, then we call it efficient. Our main results describe efficient unbiased sparsifications for divergences that are either permutation-invariant or additively separable. Surprisingly, the characterization for permutation-invariant divergences is robust to the choice of divergence function, in the sense that our class of optimal $Q$ for squared Euclidean distance coincides with our class of optimal $Q$ for Kullback-Leibler divergence, or indeed any of a wide variety of divergences.  ( 2 min )
    Stop Reasoning! When Multimodal LLMs with Chain-of-Thought Reasoning Meets Adversarial Images
    arXiv:2402.14899v1 Announce Type: cross Abstract: Recently, Multimodal LLMs (MLLMs) have shown a great ability to understand images. However, like traditional vision models, they are still vulnerable to adversarial images. Meanwhile, Chain-of-Thought (CoT) reasoning has been widely explored on MLLMs, which not only improves model's performance, but also enhances model's explainability by giving intermediate reasoning steps. Nevertheless, there is still a lack of study regarding MLLMs' adversarial robustness with CoT and an understanding of what the rationale looks like when MLLMs infer wrong answers with adversarial images. Our research evaluates the adversarial robustness of MLLMs when employing CoT reasoning, finding that CoT marginally improves adversarial robustness against existing attack methods. Moreover, we introduce a novel stop-reasoning attack technique that effectively bypasses the CoT-induced robustness enhancements. Finally, we demonstrate the alterations in CoT reasoning when MLLMs confront adversarial images, shedding light on their reasoning process under adversarial attacks.  ( 2 min )
    Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs
    arXiv:2402.14903v1 Announce Type: cross Abstract: Tokenization, the division of input text into input tokens, is an often overlooked aspect of the large language model (LLM) pipeline and could be the source of useful or harmful inductive biases. Historically, LLMs have relied on byte pair encoding, without care to specific input domains. With the increased use of LLMs for reasoning, various number-specific tokenization schemes have been adopted, with popular models like LLaMa and PaLM opting for single-digit tokenization while GPT-3.5 and GPT-4 have separate tokens for each 1-, 2-, and 3-digit numbers. In this work, we study the effect this choice has on numerical reasoning through the use of arithmetic tasks. We consider left-to-right and right-to-left tokenization for GPT-3.5 and -4, finding that right-to-left tokenization (enforced by comma separating numbers at inference time) leads to largely improved performance. Furthermore, we find that model errors when using standard left-to-right tokenization follow stereotyped error patterns, suggesting that model computations are systematic rather than approximate. We show that the model is able to convert between tokenizations easily, thus allowing chain-of-thought-inspired approaches to recover performance on left-to-right tokenized inputs. We also find the gap between tokenization directions decreases when models are scaled, possibly indicating that larger models are better able to override this tokenization-dependent inductive bias. In summary, our work performs the first study of how number tokenization choices lead to differences in model performance on arithmetic tasks, accompanied by a thorough analysis of error patterns. We hope this work inspires practitioners to more carefully ablate number tokenization-related choices when working towards general models of numerical reasoning.  ( 3 min )
    Data Augmentation is Dead, Long Live Data Augmentation
    arXiv:2402.14895v1 Announce Type: cross Abstract: Textual data augmentation (DA) is a prolific field of study where novel techniques to create artificial data are regularly proposed, and that has demonstrated great efficiency on small data settings, at least for text classification tasks. In this paper, we challenge those results, showing that classical data augmentation is simply a way of performing better fine-tuning, and that spending more time fine-tuning before applying data augmentation negates its effect. This is a significant contribution as it answers several questions that were left open in recent years, namely~: which DA technique performs best (all of them as long as they generate data close enough to the training set as to not impair training) and why did DA show positive results (facilitates training of network). We furthermore show that zero and few-shot data generation via conversational agents such as ChatGPT or LLama2 can increase performances, concluding that this form of data augmentation does still work, even if classical methods do not.  ( 2 min )
    Chain-of-Thought Unfaithfulness as Disguised Accuracy
    arXiv:2402.14897v1 Announce Type: cross Abstract: Understanding the extent to which Chain-of-Thought (CoT) generations align with a large language model's (LLM) internal computations is critical for deciding whether to trust an LLM's output. As a proxy for CoT faithfulness, arXiv:2307.13702 propose a metric that measures a model's dependence on its CoT for producing an answer. Within a single family of proprietary models, they find that LLMs exhibit a scaling-then-inverse-scaling relationship between model size and their measure of faithfulness, and that a 13 billion parameter model exhibits increased faithfulness compared to models ranging from 810 million to 175 billion parameters in size. We evaluate whether these results generalize as a property of all LLMs. We replicate their experimental setup with three different families of models and, under specific conditions, successfully reproduce the scaling trends for CoT faithfulness they report. However, we discover that simply changing the order of answer choices in the prompt can reduce the metric by 73 percentage points. The faithfulness metric is also highly correlated ($R^2$ = 0.91) with accuracy, raising doubts about its validity as a construct for evaluating faithfulness.  ( 2 min )
    Vygotsky Distance: Measure for Benchmark Task Similarity
    arXiv:2402.14890v1 Announce Type: cross Abstract: Evaluation plays a significant role in modern natural language processing. Most modern NLP benchmarks consist of arbitrary sets of tasks that neither guarantee any generalization potential for the model once applied outside the test set nor try to minimize the resource consumption needed for model evaluation. This paper presents a theoretical instrument and a practical algorithm to calculate similarity between benchmark tasks, we call this similarity measure "Vygotsky distance". The core idea of this similarity measure is that it is based on relative performance of the "students" on a given task, rather that on the properties of the task itself. If two tasks are close to each other in terms of Vygotsky distance the models tend to have similar relative performance on them. Thus knowing Vygotsky distance between tasks one can significantly reduce the number of evaluation tasks while maintaining a high validation quality. Experiments on various benchmarks, including GLUE, SuperGLUE, CLUE, and RussianSuperGLUE, demonstrate that a vast majority of NLP benchmarks could be at least 40% smaller in terms of the tasks included. Most importantly, Vygotsky distance could also be used for the validation of new tasks thus increasing the generalization potential of the future NLP models.  ( 2 min )
    Double-I Watermark: Protecting Model Copyright for LLM Fine-tuning
    arXiv:2402.14883v1 Announce Type: cross Abstract: To support various applications, business owners often seek the customized models that are obtained by fine-tuning a pre-trained LLM through the API provided by LLM owners or cloud servers. However, this process carries a substantial risk of model misuse, potentially resulting in severe economic consequences for business owners. Thus, safeguarding the copyright of these customized models during LLM fine-tuning has become an urgent practical requirement, but there are limited existing solutions to provide such protection. To tackle this pressing issue, we propose a novel watermarking approach named "Double-I watermark". Specifically, based on the instruct-tuning data, two types of backdoor data paradigms are introduced with trigger in the instruction and the input, respectively. By leveraging LLM's learning capability to incorporate customized backdoor samples into the dataset, the proposed approach effectively injects specific watermarking information into the customized model during fine-tuning, which makes it easy to inject and verify watermarks in commercial scenarios. We evaluate the proposed "Double-I watermark" under various fine-tuning methods, demonstrating its harmlessness, robustness, uniqueness, imperceptibility, and validity through both theoretical analysis and experimental verification.  ( 2 min )
    Machine-learning prediction of tipping and collapse of the Atlantic Meridional Overturning Circulation
    arXiv:2402.14877v1 Announce Type: cross Abstract: Recent research on the Atlantic Meridional Overturning Circulation (AMOC) raised concern about its potential collapse through a tipping point due to the climate-change caused increase in the freshwater input into the North Atlantic. The predicted time window of collapse is centered about the middle of the century and the earliest possible start is approximately two years from now. More generally, anticipating a tipping point at which the system transitions from one stable steady state to another is relevant to a broad range of fields. We develop a machine-learning approach to predicting tipping in noisy dynamical systems with a time-varying parameter and test it on a number of systems including the AMOC, ecological networks, an electrical power system, and a climate model. For the AMOC, our prediction based on simulated fingerprint data and real data of the sea surface temperature places the time window of a potential collapse between the years 2040 and 2065.  ( 2 min )
    What's in a Name? Auditing Large Language Models for Race and Gender Bias
    arXiv:2402.14875v1 Announce Type: cross Abstract: We employ an audit design to investigate biases in state-of-the-art large language models, including GPT-4. In our study, we elicit prompt the models for advice regarding an individual across a variety of scenarios, such as during car purchase negotiations or election outcome predictions. We find that the advice systematically disadvantages names that are commonly associated with racial minorities and women. Names associated with Black women receive the least advantageous outcomes. The biases are consistent across 42 prompt templates and several models, indicating a systemic issue rather than isolated incidents. While providing numerical, decision-relevant anchors in the prompt can successfully counteract the biases, qualitative details have inconsistent effects and may even increase disparities. Our findings underscore the importance of conducting audits at the point of LLM deployment and implementation to mitigate their potential for harm against marginalized communities.  ( 2 min )
    Effects of term weighting approach with and without stop words removing on Arabic text classification
    arXiv:2402.14867v1 Announce Type: cross Abstract: Classifying text is a method for categorizing documents into pre-established groups. Text documents must be prepared and represented in a way that is appropriate for the algorithms used for data mining prior to classification. As a result, a number of term weighting strategies have been created in the literature to enhance text categorization algorithms' functionality. This study compares the effects of Binary and Term frequency weighting feature methodologies on the text's classification method when stop words are eliminated once and when they are not. In recognition of assessing the effects of prior weighting of features approaches on classification results in terms of accuracy, recall, precision, and F-measure values, we used an Arabic data set made up of 322 documents divided into six main topics (agriculture, economy, health, politics, science, and sport), each of which contains 50 documents, with the exception of the health category, which contains 61 documents. The results demonstrate that for all metrics, the term frequency feature weighting approach with stop word removal outperforms the binary approach, while for accuracy, recall, and F-Measure, the binary approach outperforms the TF approach without stop word removal. However, for precision, the two approaches produce results that are very similar. Additionally, it is clear from the data that, using the same phrase weighting approach, stop word removing increases classification accuracy.  ( 3 min )
    Distillation Contrastive Decoding: Improving LLMs Reasoning with Contrastive Decoding and Distillation
    arXiv:2402.14874v1 Announce Type: cross Abstract: We propose a straightforward approach called Distillation Contrastive Decoding (DCD) to enhance the reasoning capabilities of Large Language Models (LLMs) during inference. In contrast to previous approaches that relied on smaller amateur models or analysis of hidden state differences, DCD employs Contrastive Chain-of-thought Prompting and advanced distillation techniques, including Dropout and Quantization. This approach effectively addresses the limitations of Contrastive Decoding (CD), which typically requires both an expert and an amateur model, thus increasing computational resource demands. By integrating contrastive prompts with distillation, DCD obviates the need for an amateur model and reduces memory usage. Our evaluations demonstrate that DCD significantly enhances LLM performance across a range of reasoning benchmarks, surpassing both CD and existing methods in the GSM8K and StrategyQA datasets.  ( 2 min )
    SISSA: Real-time Monitoring of Hardware Functional Safety and Cybersecurity with In-vehicle SOME/IP Ethernet Traffic
    arXiv:2402.14862v1 Announce Type: cross Abstract: Scalable service-Oriented Middleware over IP (SOME/IP) is an Ethernet communication standard protocol in the Automotive Open System Architecture (AUTOSAR), promoting ECU-to-ECU communication over the IP stack. However, SOME/IP lacks a robust security architecture, making it susceptible to potential attacks. Besides, random hardware failure of ECU will disrupt SOME/IP communication. In this paper, we propose SISSA, a SOME/IP communication traffic-based approach for modeling and analyzing in-vehicle functional safety and cyber security. Specifically, SISSA models hardware failures with the Weibull distribution and addresses five potential attacks on SOME/IP communication, including Distributed Denial-of-Services, Man-in-the-Middle, and abnormal communication processes, assuming a malicious user accesses the in-vehicle network. Subsequently, SISSA designs a series of deep learning models with various backbones to extract features from SOME/IP sessions among ECUs. We adopt residual self-attention to accelerate the model's convergence and enhance detection accuracy, determining whether an ECU is under attack, facing functional failure, or operating normally. Additionally, we have created and annotated a dataset encompassing various classes, including indicators of attack, functionality, and normalcy. This contribution is noteworthy due to the scarcity of publicly accessible datasets with such characteristics.Extensive experimental results show the effectiveness and efficiency of SISSA.  ( 2 min )
    DyVal 2: Dynamic Evaluation of Large Language Models by Meta Probing Agents
    arXiv:2402.14865v1 Announce Type: cross Abstract: Evaluation of large language models (LLMs) has raised great concerns in the community due to the issue of data contamination. Existing work designed evaluation protocols using well-defined algorithms for specific tasks, which cannot be easily extended to diverse scenarios. Moreover, current evaluation benchmarks can only provide the overall benchmark results and cannot support a fine-grained and multifaceted analysis of LLMs' abilities. In this paper, we propose meta probing agents (MPA), a general dynamic evaluation protocol inspired by psychometrics to evaluate LLMs. MPA is the key component of DyVal 2, which naturally extends the previous DyVal~\citep{zhu2023dyval}. MPA designs the probing and judging agents to automatically transform an original evaluation problem into a new one following psychometric theory on three basic cognitive abilities: language understanding, problem solving, and domain knowledge. These basic abilities are also dynamically configurable, allowing multifaceted analysis. We conducted extensive evaluations using MPA and found that most LLMs achieve poorer performance, indicating room for improvement. Our multifaceted analysis demonstrated the strong correlation between the basic abilities and an implicit Matthew effect on model size, i.e., larger models possess stronger correlations of the abilities. MPA can also be used as a data augmentation approach to enhance LLMs.  ( 2 min )
    The Wolf Within: Covert Injection of Malice into MLLM Societies via an MLLM Operative
    arXiv:2402.14859v1 Announce Type: cross Abstract: Due to their unprecedented ability to process and respond to various types of data, Multimodal Large Language Models (MLLMs) are constantly defining the new boundary of Artificial General Intelligence (AGI). As these advanced generative models increasingly form collaborative networks for complex tasks, the integrity and security of these systems are crucial. Our paper, ``The Wolf Within'', explores a novel vulnerability in MLLM societies - the indirect propagation of malicious content. Unlike direct harmful output generation for MLLMs, our research demonstrates how a single MLLM agent can be subtly influenced to generate prompts that, in turn, induce other MLLM agents in the society to output malicious content. This subtle, yet potent method of indirect influence marks a significant escalation in the security risks associated with MLLMs. Our findings reveal that, with minimal or even no access to MLLMs' parameters, an MLLM agent, when manipulated to produce specific prompts or instructions, can effectively ``infect'' other agents within a society of MLLMs. This infection leads to the generation and circulation of harmful outputs, such as dangerous instructions or misinformation, across the society. We also show the transferability of these indirectly generated prompts, highlighting their possibility in propagating malice through inter-agent communication. This research provides a critical insight into a new dimension of threat posed by MLLMs, where a single agent can act as a catalyst for widespread malevolent influence. Our work underscores the urgent need for developing robust mechanisms to detect and mitigate such covert manipulations within MLLM societies, ensuring their safe and ethical utilization in societal applications. Our implementation is released at \url{https://github.com/ChengshuaiZhao0/The-Wolf-Within.git}.  ( 3 min )
    Ranking Large Language Models without Ground Truth
    arXiv:2402.14860v1 Announce Type: cross Abstract: Evaluation and ranking of large language models (LLMs) has become an important problem with the proliferation of these models and their impact. Evaluation methods either require human responses which are expensive to acquire or use pairs of LLMs to evaluate each other which can be unreliable. In this paper, we provide a novel perspective where, given a dataset of prompts (viz. questions, instructions, etc.) and a set of LLMs, we rank them without access to any ground truth or reference responses. Inspired by real life where both an expert and a knowledgeable person can identify a novice our main idea is to consider triplets of models, where each one of them evaluates the other two, correctly identifying the worst model in the triplet with high probability. We also analyze our idea and provide sufficient conditions for it to succeed. Applying this idea repeatedly, we propose two methods to rank LLMs. In experiments on different generative tasks (summarization, multiple-choice, and dialog), our methods reliably recover close to true rankings without reference data. This points to a viable low-resource mechanism for practical use.  ( 2 min )
    Asynchronous and Segmented Bidirectional Encoding for NMT
    arXiv:2402.14849v1 Announce Type: cross Abstract: With the rapid advancement of Neural Machine Translation (NMT), enhancing translation efficiency and quality has become a focal point of research. Despite the commendable performance of general models such as the Transformer in various aspects, they still fall short in processing long sentences and fully leveraging bidirectional contextual information. This paper introduces an improved model based on the Transformer, implementing an asynchronous and segmented bidirectional decoding strategy aimed at elevating translation efficiency and accuracy. Compared to traditional unidirectional translations from left-to-right or right-to-left, our method demonstrates heightened efficiency and improved translation quality, particularly in handling long sentences. Experimental results on the IWSLT2017 dataset confirm the effectiveness of our approach in accelerating translation and increasing accuracy, especially surpassing traditional unidirectional strategies in long sentence translation. Furthermore, this study analyzes the impact of sentence length on decoding outcomes and explores the model's performance in various scenarios. The findings of this research not only provide an effective encoding strategy for the NMT field but also pave new avenues and directions for future studies.  ( 2 min )
    HumanEval on Latest GPT Models -- 2024
    arXiv:2402.14852v1 Announce Type: cross Abstract: In 2023, we are using the latest models of GPT-4 to advance program synthesis. The large language models have significantly improved the state-of-the-art for this purpose. To make these advancements more accessible, we have created a repository that connects these models to Huamn Eval. This dataset was initally developed to be used with a language model called CODEGEN on natural and programming language data. The utility of these trained models is showcased by demonstrating their competitive performance in zero-shot Python code generation on HumanEval tasks compared to previous state-of-the-art solutions. Additionally, this gives way to developing more multi-step paradigm synthesis. This benchmark features 160 diverse problem sets factorized into multistep prompts that our analysis shows significantly improves program synthesis over single-turn inputs. All code is open source at https://github.com/daniel442li/gpt-human-eval .  ( 2 min )
    Stick to your Role! Stability of Personal Values Expressed in Large Language Models
    arXiv:2402.14846v1 Announce Type: cross Abstract: The standard way to study Large Language Models (LLMs) through benchmarks or psychology questionnaires is to provide many different queries from similar minimal contexts (e.g. multiple choice questions). However, due to LLM's highly context-dependent nature, conclusions from such minimal-context evaluations may be little informative about the model's behavior in deployment (where it will be exposed to many new contexts). We argue that context-dependence should be studied as another dimension of LLM comparison alongside others such as cognitive abilities, knowledge, or model size. In this paper, we present a case-study about the stability of value expression over different contexts (simulated conversations on different topics), and as measured using a standard psychology questionnaire (PVQ) and a behavioral downstream task. We consider 19 open-sourced LLMs from five families. Reusing methods from psychology, we study Rank-order stability on the population (interpersonal) level, and Ipsative stability on the individual (intrapersonal) level. We explore two settings: with and without instructing LLMs to simulate particular personalities. We observe similar trends in the stability of models and model families - Mixtral, Mistral and Qwen families being more stable than LLaMa-2 and Phi - over those two settings, two different simulated populations, and even in the downstream behavioral task. When instructed to simulate particular personas, LLMs exhibit low Rank-Order stability, and this stability further diminishes with conversation length. This highlights the need for future research directions on LLMs that can coherently simulate a diversity of personas, as well as how context-dependence can be studied in more thorough and efficient ways. This paper provides a foundational step in that direction, and, to our knowledge, it is the first study of value stability in LLMs.  ( 3 min )
    Deep learning-driven scheduling algorithm for a single machine problem minimizing the total tardiness
    arXiv:2402.14847v1 Announce Type: cross Abstract: In this paper, we investigate the use of the deep learning method for solving a well-known NP-hard single machine scheduling problem with the objective of minimizing the total tardiness. We propose a deep neural network that acts as a polynomial-time estimator of the criterion value used in a single-pass scheduling algorithm based on Lawler's decomposition and symmetric decomposition proposed by Della Croce et al. Essentially, the neural network guides the algorithm by estimating the best splitting of the problem into subproblems. The paper also describes a new method for generating the training data set, which speeds up the training dataset generation and reduces the average optimality gap of solutions. The experimental results show that our machine learning-driven approach can efficiently generalize information from the training phase to significantly larger instances. Even though the instances used in the training phase have from 75 to 100 jobs, the average optimality gap on instances with up to 800 jobs is 0.26%, which is almost five times less than the gap of the state-of-the-art heuristic.  ( 2 min )
    The New Era of Dynamic Pricing: Synergizing Supervised Learning and Quadratic Programming
    arXiv:2402.14844v1 Announce Type: cross Abstract: In this paper, we explore a novel combination of supervised learning and quadratic programming to refine dynamic pricing models in the car rental industry. We utilize dynamic modeling of price elasticity, informed by ordinary least squares (OLS) metrics such as p-values, homoscedasticity, error normality. These metrics, when their underlying assumptions hold, are integral in guiding a quadratic programming agent. The program is tasked with optimizing margin for a given finite set target.  ( 2 min )
    Purifying Large Language Models by Ensembling a Small Language Model
    arXiv:2402.14845v1 Announce Type: cross Abstract: The emerging success of large language models (LLMs) heavily relies on collecting abundant training data from external (untrusted) sources. Despite substantial efforts devoted to data cleaning and curation, well-constructed LLMs have been reported to suffer from copyright infringement, data poisoning, and/or privacy violations, which would impede practical deployment of LLMs. In this study, we propose a simple and easily implementable method for purifying LLMs from the negative effects caused by uncurated data, namely, through ensembling LLMs with benign and small language models (SLMs). Aside from theoretical guarantees, we perform comprehensive experiments to empirically confirm the efficacy of ensembling LLMs with SLMs, which can effectively preserve the performance of LLMs while mitigating issues such as copyright infringement, data poisoning, and privacy violations.  ( 2 min )
    RFBES at SemEval-2024 Task 8: Investigating Syntactic and Semantic Features for Distinguishing AI-Generated and Human-Written Texts
    arXiv:2402.14838v1 Announce Type: cross Abstract: Nowadays, the usage of Large Language Models (LLMs) has increased, and LLMs have been used to generate texts in different languages and for different tasks. Additionally, due to the participation of remarkable companies such as Google and OpenAI, LLMs are now more accessible, and people can easily use them. However, an important issue is how we can detect AI-generated texts from human-written ones. In this article, we have investigated the problem of AI-generated text detection from two different aspects: semantics and syntax. Finally, we presented an AI model that can distinguish AI-generated texts from human-written ones with high accuracy on both multilingual and monolingual tasks using the M4 dataset. According to our results, using a semantic approach would be more helpful for detection. However, there is a lot of room for improvement in the syntactic approach, and it would be a good approach for future work.  ( 2 min )
    Text Diffusion with Reinforced Conditioning
    arXiv:2402.14843v1 Announce Type: cross Abstract: Diffusion models have demonstrated exceptional capability in generating high-quality images, videos, and audio. Due to their adaptiveness in iterative refinement, they provide a strong potential for achieving better non-autoregressive sequence generation. However, existing text diffusion models still fall short in their performance due to a challenge in handling the discreteness of language. This paper thoroughly analyzes text diffusion models and uncovers two significant limitations: degradation of self-conditioning during training and misalignment between training and sampling. Motivated by our findings, we propose a novel Text Diffusion model called TREC, which mitigates the degradation with Reinforced Conditioning and the misalignment by Time-Aware Variance Scaling. Our extensive experiments demonstrate the competitiveness of TREC against autoregressive, non-autoregressive, and diffusion baselines. Moreover, qualitative analysis shows its advanced ability to fully utilize the diffusion process in refining samples.  ( 2 min )
    MIKE: A New Benchmark for Fine-grained Multimodal Entity Knowledge Editing
    arXiv:2402.14835v1 Announce Type: cross Abstract: Multimodal knowledge editing represents a critical advancement in enhancing the capabilities of Multimodal Large Language Models (MLLMs). Despite its potential, current benchmarks predominantly focus on coarse-grained knowledge, leaving the intricacies of fine-grained (FG) multimodal entity knowledge largely unexplored. This gap presents a notable challenge, as FG entity recognition is pivotal for the practical deployment and effectiveness of MLLMs in diverse real-world scenarios. To bridge this gap, we introduce MIKE, a comprehensive benchmark and dataset specifically designed for the FG multimodal entity knowledge editing. MIKE encompasses a suite of tasks tailored to assess different perspectives, including Vanilla Name Answering, Entity-Level Caption, and Complex-Scenario Recognition. In addition, a new form of knowledge editing, Multi-step Editing, is introduced to evaluate the editing efficiency. Through our extensive evaluations, we demonstrate that the current state-of-the-art methods face significant challenges in tackling our proposed benchmark, underscoring the complexity of FG knowledge editing in MLLMs. Our findings spotlight the urgent need for novel approaches in this domain, setting a clear agenda for future research and development efforts within the community.  ( 2 min )
    An Empirical Categorization of Prompting Techniques for Large Language Models: A Practitioner's Guide
    arXiv:2402.14837v1 Announce Type: cross Abstract: Due to rapid advancements in the development of Large Language Models (LLMs), programming these models with prompts has recently gained significant attention. However, the sheer number of available prompt engineering techniques creates an overwhelming landscape for practitioners looking to utilize these tools. For the most efficient and effective use of LLMs, it is important to compile a comprehensive list of prompting techniques and establish a standardized, interdisciplinary categorization framework. In this survey, we examine some of the most well-known prompting techniques from both academic and practical viewpoints and classify them into seven distinct categories. We present an overview of each category, aiming to clarify their unique contributions and showcase their practical applications in real-world examples in order to equip fellow practitioners with a structured framework for understanding and categorizing prompting techniques tailored to their specific domains. We believe that this approach will help simplify the complex landscape of prompt engineering and enable more effective utilization of LLMs in various applications. By providing practitioners with a systematic approach to prompt categorization, we aim to assist in navigating the intricacies of effective prompt design for conversational pre-trained LLMs and inspire new possibilities in their respective fields.  ( 2 min )
    Optimizing Uterine Synchronization Analysis in Pregnancy and Labor through Window Selection and Node Optimization
    arXiv:2402.14827v1 Announce Type: cross Abstract: Preterm labor (PL) has globally become the leading cause of death in children under the age of 5 years. To address this problem, this paper will provide a new approach by analyzing the EHG signals, which are recorded on the abdomen of the mother during labor and pregnancy. The EHG signal reflects the electrical activity that induces the mechanical contraction of the myometrium. Because EHGs are known to be non-stationary signals, and because we anticipate connectivity to alter during contraction, we applied the windowing approach on real signals to help us identify the best windows and the best nodes with the most significant data to be used for classification. The suggested pipeline includes i) divide the 16 EHG signals that are recorded from the abdomen of pregnant women in N windows; ii) apply the connectivity matrices on each window; iii) apply the Graph theory-based measures on the connectivity matrices on each window; iv) apply the consensus Matrix on each window in order to retrieve the best windows and the best nodes. Following that, several neural network and machine learning methods are applied to the best windows and best nodes to categorize pregnancy and labor contractions, based on the different input parameters (connectivity method alone, connectivity method plus graph parameters, best nodes, all nodes, best windows, all windows). Results showed that the best nodes are nodes 8, 9, 10, 11, and 12; while the best windows are 2, 4, and 5. The classification results obtained by using only these best nodes are better than when using the whole nodes. The results are always better when using the full burst, whatever the chosen nodes. Thus, the windowing approach proved to be an innovative technique that can improve the differentiation between labor and pregnancy EHG signals.  ( 3 min )
    CliqueParcel: An Approach For Batching LLM Prompts That Jointly Optimizes Efficiency And Faithfulness
    arXiv:2402.14833v1 Announce Type: cross Abstract: Large language models (LLMs) have become pivotal in recent research. However, during the inference process, LLMs still require substantial resources. In this paper, we propose CliqueParcel, a method designed to improve the efficiency of LLMs via prompt batching. Existing strategies to optimize inference efficiency often compromise on output quality, leading to a discounted output problem. This issue might result in reduced accuracy or outputs that are less detailed. CliqueParcel is our answer to this challenge. While ensuring accuracy and minimizing deviations from the original outputs (i.e., faithfulness), our method significantly improves efficiency during inference. To lay the groundwork, we first redefine efficiency measurements by excluding the reduction in running time due to shorter lengths. Then, we provide a comprehensive trade-off between efficiency and faithfulness to clarify the nature of the 'discounted output' problem. Within the CliqueParcel framework, we suggest multiple batching sub-methods and discuss the specific scenarios in which they can be applied. During evaluation, CliqueParcel is tested on eight widely recognized datasets, which can be classified into three types: reading comprehension, open-source question-answering, and reasoning. Our experiments explore the performance of CliqueParcel, including efficiency, faithfulness, and the trade-off between them. This work provides novel insights into inference efficiency and demonstrates promising performance.  ( 2 min )
    Co-Supervised Learning: Improving Weak-to-Strong Generalization with Hierarchical Mixture of Experts
    arXiv:2402.15505v1 Announce Type: new Abstract: Steering the behavior of a strong model pre-trained on internet-scale data can be difficult due to the scarcity of competent supervisors. Recent studies reveal that, despite supervisory noises, a strong student model may surpass its weak teacher when fine-tuned on specific objectives. Yet, the effectiveness of such weak-to-strong generalization remains limited, especially in the presence of large capability gaps. In this paper, we propose to address this challenge by harnessing a diverse set of specialized teachers, instead of a single generalist one, that collectively supervises the strong student. Our approach resembles the classical hierarchical mixture of experts, with two components tailored for co-supervision: (i) we progressively alternate student training and teacher assignment, leveraging the growth of the strong student to identify plausible supervisions; (ii) we conservatively enforce teacher-student and local-global consistency, leveraging their dependencies to reject potential annotation noises. We validate the proposed method through visual recognition tasks on the OpenAI weak-to-strong benchmark and additional multi-domain datasets. Our code is available at \url{https://github.com/yuejiangliu/csl}.  ( 2 min )
    Deepfake Detection and the Impact of Limited Computing Capabilities
    arXiv:2402.14825v1 Announce Type: cross Abstract: The rapid development of technologies and artificial intelligence makes deepfakes an increasingly sophisticated and challenging-to-identify technique. To ensure the accuracy of information and control misinformation and mass manipulation, it is of paramount importance to discover and develop artificial intelligence models that enable the generic detection of forged videos. This work aims to address the detection of deepfakes across various existing datasets in a scenario with limited computing resources. The goal is to analyze the applicability of different deep learning techniques under these restrictions and explore possible approaches to enhance their efficiency.  ( 2 min )
    A Comprehensive Survey of Convolutions in Deep Learning: Applications, Challenges, and Future Trends
    arXiv:2402.15490v1 Announce Type: new Abstract: In today's digital age, Convolutional Neural Networks (CNNs), a subset of Deep Learning (DL), are widely used for various computer vision tasks such as image classification, object detection, and image segmentation. There are numerous types of CNNs designed to meet specific needs and requirements, including 1D, 2D, and 3D CNNs, as well as dilated, grouped, attention, depthwise convolutions, and NAS, among others. Each type of CNN has its unique structure and characteristics, making it suitable for specific tasks. It's crucial to gain a thorough understanding and perform a comparative analysis of these different CNN types to understand their strengths and weaknesses. Furthermore, studying the performance, limitations, and practical applications of each type of CNN can aid in the development of new and improved architectures in the future. We also dive into the platforms and frameworks that researchers utilize for their research or development from various perspectives. Additionally, we explore the main research fields of CNN like 6D vision, generative models, and meta-learning. This survey paper provides a comprehensive examination and comparison of various CNN architectures, highlighting their architectural differences and emphasizing their respective advantages, disadvantages, applications, challenges, and future trends.  ( 2 min )
    Mechanics-Informed Autoencoder Enables Automated Detection and Localization of Unforeseen Structural Damage
    arXiv:2402.15492v1 Announce Type: new Abstract: Structural health monitoring (SHM) is vital for ensuring the safety and longevity of structures like buildings and bridges. As the volume and scale of structures and the impact of their failure continue to grow, there is a dire need for SHM techniques that are scalable, inexpensive, operate passively without human intervention, and customized for each mechanical structure without the need for complex baseline models. We present a novel "deploy-and-forget" approach for automated detection and localization of damages in structures. It is based on a synergistic combination of fully passive measurements from inexpensive sensors and a mechanics-informed autoencoder. Once deployed, our solution continuously learns and adapts a bespoke baseline model for each structure, learning from its undamaged state's response characteristics. After learning from just 3 hours of data, it can autonomously detect and localize different types of unforeseen damage. Results from numerical simulations and experiments indicate that incorporating the mechanical characteristics into the variational autoencoder allows for up to 35\% earlier detection and localization of damage over a standard autoencoder. Our approach holds substantial promise for a significant reduction in human intervention and inspection costs and enables proactive and preventive maintenance strategies, thus extending the lifespan, reliability, and sustainability of civil infrastructures.  ( 2 min )
    Debiasing Machine Learning Models by Using Weakly Supervised Learning
    arXiv:2402.15477v1 Announce Type: new Abstract: We tackle the problem of bias mitigation of algorithmic decisions in a setting where both the output of the algorithm and the sensitive variable are continuous. Most of prior work deals with discrete sensitive variables, meaning that the biases are measured for subgroups of persons defined by a label, leaving out important algorithmic bias cases, where the sensitive variable is continuous. Typical examples are unfair decisions made with respect to the age or the financial status. In our work, we then propose a bias mitigation strategy for continuous sensitive variables, based on the notion of endogeneity which comes from the field of econometrics. In addition to solve this new problem, our bias mitigation strategy is a weakly supervised learning method which requires that a small portion of the data can be measured in a fair manner. It is model agnostic, in the sense that it does not make any hypothesis on the prediction model. It also makes use of a reasonably large amount of input observations and their corresponding predictions. Only a small fraction of the true output predictions should be known. This therefore limits the need for expert interventions. Results obtained on synthetic data show the effectiveness of our approach for examples as close as possible to real-life applications in econometrics.  ( 3 min )
    Transformers are Expressive, But Are They Expressive Enough for Regression?
    arXiv:2402.15478v1 Announce Type: new Abstract: Transformers have become pivotal in Natural Language Processing, demonstrating remarkable success in applications like Machine Translation and Summarization. Given their widespread adoption, several works have attempted to analyze the expressivity of Transformers. Expressivity of a neural network is the class of functions it can approximate. A neural network is fully expressive if it can act as a universal function approximator. We attempt to analyze the same for Transformers. Contrary to existing claims, our findings reveal that Transformers struggle to reliably approximate continuous functions, relying on piecewise constant approximations with sizable intervals. The central question emerges as: "\textit{Are Transformers truly Universal Function Approximators}?" To address this, we conduct a thorough investigation, providing theoretical insights and supporting evidence through experiments. Our contributions include a theoretical analysis pinpointing the root of Transformers' limitation in function approximation and extensive experiments to verify the limitation. By shedding light on these challenges, we advocate a refined understanding of Transformers' capabilities.  ( 2 min )
    Active Few-Shot Fine-Tuning
    arXiv:2402.15441v1 Announce Type: new Abstract: We study the active few-shot fine-tuning of large neural networks to downstream tasks. We show that few-shot fine-tuning is an instance of a generalization of classical active learning, transductive active learning, and we propose ITL, short for information-based transductive learning, an approach which samples adaptively to maximize the information gained about specified downstream tasks. Under general regularity assumptions, we prove that ITL converges uniformly to the smallest possible uncertainty obtainable from the accessible data. To the best of our knowledge, we are the first to derive generalization bounds of this kind, and they may be of independent interest for active learning. We apply ITL to the few-shot fine-tuning of large neural networks and show that ITL substantially improves upon the state-of-the-art.  ( 2 min )
    FAIR: Filtering of Automatically Induced Rules
    arXiv:2402.15472v1 Announce Type: new Abstract: The availability of large annotated data can be a critical bottleneck in training machine learning algorithms successfully, especially when applied to diverse domains. Weak supervision offers a promising alternative by accelerating the creation of labeled training data using domain-specific rules. However, it requires users to write a diverse set of high-quality rules to assign labels to the unlabeled data. Automatic Rule Induction (ARI) approaches circumvent this problem by automatically creating rules from features on a small labeled set and filtering a final set of rules from them. In the ARI approach, the crucial step is to filter out a set of a high-quality useful subset of rules from the large set of automatically created rules. In this paper, we propose an algorithm (Filtering of Automatically Induced Rules) to filter rules from a large number of automatically induced rules using submodular objective functions that account for the collective precision, coverage, and conflicts of the rule set. We experiment with three ARI approaches and five text classification datasets to validate the superior performance of our algorithm with respect to several semi-supervised label aggregation approaches. Further, we show that achieves statistically significant results in comparison to existing rule-filtering approaches.  ( 2 min )
    Does Combining Parameter-efficient Modules Improve Few-shot Transfer Accuracy?
    arXiv:2402.15414v1 Announce Type: new Abstract: Parameter-efficient fine-tuning stands as the standard for efficiently fine-tuning large language and vision models on downstream tasks. Specifically, the efficiency of low-rank adaptation has facilitated the creation and sharing of hundreds of custom LoRA modules, each trained on distinct data from various downstream tasks. In this paper, we explore the composability of LoRA modules, examining if combining these pre-trained modules enhances generalization to unseen downstream tasks. Our investigation involves evaluating two approaches: (a) uniform composition, involving averaging upstream LoRA modules with equal weights, and (b) learned composition, where we learn the weights for each upstream module and perform weighted averaging. Our experimental results on both vision and language models reveal that in few-shot settings, where only a limited number of samples are available for the downstream task, both uniform and learned composition methods result in better transfer accuracy; outperforming full fine-tuning and training a LoRA from scratch. Moreover, in full-shot settings, learned composition performs comparably to regular LoRA training with significantly fewer number of trainable parameters. Our research unveils the potential of uniform composition for enhancing transferability in low-shot settings, without introducing additional learnable parameters.  ( 2 min )
    The Impact of LoRA on the Emergence of Clusters in Transformers
    arXiv:2402.15415v1 Announce Type: new Abstract: In this paper, we employ the mathematical framework on Transformers developed by \citet{sander2022sinkformers,geshkovski2023emergence,geshkovski2023mathematical} to explore how variations in attention parameters and initial token values impact the structural dynamics of token clusters. Our analysis demonstrates that while the clusters within a modified attention matrix dynamics can exhibit significant divergence from the original over extended periods, they maintain close similarities over shorter intervals, depending on the parameter differences. This work contributes to the fine-tuning field through practical applications to the LoRA algorithm \cite{hu2021lora,peft}, enhancing our understanding of the behavior of LoRA-enhanced Transformer models.  ( 2 min )
    Optimisic Information Directed Sampling
    arXiv:2402.15411v1 Announce Type: new Abstract: We study the problem of online learning in contextual bandit problems where the loss function is assumed to belong to a known parametric function class. We propose a new analytic framework for this setting that bridges the Bayesian theory of information-directed sampling due to Russo and Van Roy (2018) and the worst-case theory of Foster, Kakade, Qian, and Rakhlin (2021) based on the decision-estimation coefficient. Drawing from both lines of work, we propose a algorithmic template called Optimistic Information-Directed Sampling and show that it can achieve instance-dependent regret guarantees similar to the ones achievable by the classic Bayesian IDS method, but with the major advantage of not requiring any Bayesian assumptions. The key technical innovation of our analysis is introducing an optimistic surrogate model for the regret and using it to define a frequentist version of the Information Ratio of Russo and Van Roy (2018), and a less conservative version of the Decision Estimation Coefficient of Foster et al. (2021). Keywords: Contextual bandits, information-directed sampling, decision estimation coefficient, first-order regret bounds.  ( 2 min )
    G-RepsNet: A Fast and General Construction of Equivariant Networks for Arbitrary Matrix Groups
    arXiv:2402.15413v1 Announce Type: new Abstract: Group equivariance is a strong inductive bias useful in a wide range of deep learning tasks. However, constructing efficient equivariant networks for general groups and domains is difficult. Recent work by Finzi et al. (2021) directly solves the equivariance constraint for arbitrary matrix groups to obtain equivariant MLPs (EMLPs). But this method does not scale well and scaling is crucial in deep learning. Here, we introduce Group Representation Networks (G-RepsNets), a lightweight equivariant network for arbitrary matrix groups with features represented using tensor polynomials. The key intuition for our design is that using tensor representations in the hidden layers of a neural network along with simple inexpensive tensor operations can lead to expressive universal equivariant networks. We find G-RepsNet to be competitive to EMLP on several tasks with group symmetries such as O(5), O(1, 3), and O(3) with scalars, vectors, and second-order tensors as data types. On image classification tasks, we find that G-RepsNet using second-order representations is competitive and often even outperforms sophisticated state-of-the-art equivariant models such as GCNNs (Cohen & Welling, 2016a) and E(2)-CNNs (Weiler & Cesa, 2019). To further illustrate the generality of our approach, we show that G-RepsNet is competitive to G-FNO (Helwig et al., 2023) and EGNN (Satorras et al., 2021) on N-body predictions and solving PDEs, respectively, while being efficient.  ( 2 min )
    Conformalized-DeepONet: A Distribution-Free Framework for Uncertainty Quantification in Deep Operator Networks
    arXiv:2402.15406v1 Announce Type: new Abstract: In this paper, we adopt conformal prediction, a distribution-free uncertainty quantification (UQ) framework, to obtain confidence prediction intervals with coverage guarantees for Deep Operator Network (DeepONet) regression. Initially, we enhance the uncertainty quantification frameworks (B-DeepONet and Prob-DeepONet) previously proposed by the authors by using split conformal prediction. By combining conformal prediction with our Prob- and B-DeepONets, we effectively quantify uncertainty by generating rigorous confidence intervals for DeepONet prediction. Additionally, we design a novel Quantile-DeepONet that allows for a more natural use of split conformal prediction. We refer to this distribution-free effective uncertainty quantification framework as split conformal Quantile-DeepONet regression. Finally, we demonstrate the effectiveness of the proposed methods using various ordinary, partial differential equation numerical examples, and multi-fidelity learning.  ( 2 min )
    NeuralThink: Algorithm Synthesis that Extrapolates in General Tasks
    arXiv:2402.15393v1 Announce Type: new Abstract: While machine learning methods excel at pattern recognition, they struggle with complex reasoning tasks in a scalable, algorithmic manner. Recent Deep Thinking methods show promise in learning algorithms that extrapolate: learning in smaller environments and executing the learned algorithm in larger environments. However, these works are limited to symmetrical tasks, where the input and output dimensionalities are the same. To address this gap, we propose NeuralThink, a new recurrent architecture that can consistently extrapolate to both symmetrical and asymmetrical tasks, where the dimensionality of the input and output are different. We contribute with a novel benchmark of asymmetrical tasks for extrapolation. We show that NeuralThink consistently outperforms the prior state-of-the-art Deep Thinking architectures, in regards to stable extrapolation to large observations from smaller training sizes.  ( 2 min )
    Explorations of Self-Repair in Language Models
    arXiv:2402.15390v1 Announce Type: new Abstract: Prior interpretability research studying narrow distributions has preliminarily identified self-repair, a phenomena where if components in large language models are ablated, later components will change their behavior to compensate. Our work builds off this past literature, demonstrating that self-repair exists on a variety of models families and sizes when ablating individual attention heads on the full training distribution. We further show that on the full training distribution self-repair is imperfect, as the original direct effect of the head is not fully restored, and noisy, since the degree of self-repair varies significantly across different prompts (sometimes overcorrecting beyond the original effect). We highlight two different mechanisms that contribute to self-repair, including changes in the final LayerNorm scaling factor (which can repair up to 30% of the direct effect) and sparse sets of neurons implementing Anti-Erasure. We additionally discuss the implications of these results for interpretability practitioners and close with a more speculative discussion on the mystery of why self-repair occurs in these models at all, highlighting evidence for the Iterative Inference hypothesis in language models, a framework that predicts self-repair.  ( 2 min )
    Genie: Generative Interactive Environments
    arXiv:2402.15391v1 Announce Type: new Abstract: We introduce Genie, the first generative interactive environment trained in an unsupervised manner from unlabelled Internet videos. The model can be prompted to generate an endless variety of action-controllable virtual worlds described through text, synthetic images, photographs, and even sketches. At 11B parameters, Genie can be considered a foundation world model. It is comprised of a spatiotemporal video tokenizer, an autoregressive dynamics model, and a simple and scalable latent action model. Genie enables users to act in the generated environments on a frame-by-frame basis despite training without any ground-truth action labels or other domain-specific requirements typically found in the world model literature. Further the resulting learned latent action space facilitates training agents to imitate behaviors from unseen videos, opening the path for training generalist agents of the future.  ( 2 min )
    Information-Theoretic Safe Bayesian Optimization
    arXiv:2402.15347v1 Announce Type: new Abstract: We consider a sequential decision making task, where the goal is to optimize an unknown function without evaluating parameters that violate an a~priori unknown (safety) constraint. A common approach is to place a Gaussian process prior on the unknown functions and allow evaluations only in regions that are safe with high probability. Most current methods rely on a discretization of the domain and cannot be directly extended to the continuous case. Moreover, the way in which they exploit regularity assumptions about the constraint introduces an additional critical hyperparameter. In this paper, we propose an information-theoretic safe exploration criterion that directly exploits the GP posterior to identify the most informative safe parameters to evaluate. The combination of this exploration criterion with a well known Bayesian optimization acquisition function yields a novel safe Bayesian optimization selection criterion. Our approach is naturally applicable to continuous domains and does not require additional explicit hyperparameters. We theoretically analyze the method and show that we do not violate the safety constraint with high probability and that we learn about the value of the safe optimum up to arbitrary precision. Empirical evaluations demonstrate improved data-efficiency and scalability.  ( 2 min )
    AutoMMLab: Automatically Generating Deployable Models from Language Instructions for Computer Vision Tasks
    arXiv:2402.15351v1 Announce Type: new Abstract: Automated machine learning (AutoML) is a collection of techniques designed to automate the machine learning development process. While traditional AutoML approaches have been successfully applied in several critical steps of model development (e.g. hyperparameter optimization), there lacks a AutoML system that automates the entire end-to-end model production workflow. To fill this blank, we present AutoMMLab, a general-purpose LLM-empowered AutoML system that follows user's language instructions to automate the whole model production workflow for computer vision tasks. The proposed AutoMMLab system effectively employs LLMs as the bridge to connect AutoML and OpenMMLab community, empowering non-expert individuals to easily build task-specific models via a user-friendly language interface. Specifically, we propose RU-LLaMA to understand users' request and schedule the whole pipeline, and propose a novel LLM-based hyperparameter optimizer called HPO-LLaMA to effectively search for the optimal hyperparameters. Experiments show that our AutoMMLab system is versatile and covers a wide range of mainstream tasks, including classification, detection, segmentation and keypoint estimation. We further develop a new benchmark, called LAMP, for studying key components in the end-to-end prompt-based model training pipeline. Code, model, and data will be released.  ( 2 min )
    Categorical Deep Learning: An Algebraic Theory of Architectures
    arXiv:2402.15332v1 Announce Type: new Abstract: We present our position on the elusive quest for a general-purpose framework for specifying and studying deep learning architectures. Our opinion is that the key attempts made so far lack a coherent bridge between specifying constraints which models must satisfy and specifying their implementations. Focusing on building a such a bridge, we propose to apply category theory -- precisely, the universal algebra of monads valued in a 2-category of parametric maps -- as a single theory elegantly subsuming both of these flavours of neural network design. To defend our position, we show how this theory recovers constraints induced by geometric deep learning, as well as implementations of many architectures drawn from the diverse landscape of neural networks, such as RNNs. We also illustrate how the theory naturally encodes many standard constructs in computer science and automata theory.  ( 2 min )
    Fourier Basis Density Model
    arXiv:2402.15345v1 Announce Type: new Abstract: We introduce a lightweight, flexible and end-to-end trainable probability density model parameterized by a constrained Fourier basis. We assess its performance at approximating a range of multi-modal 1D densities, which are generally difficult to fit. In comparison to the deep factorized model introduced in [1], our model achieves a lower cross entropy at a similar computational budget. In addition, we also evaluate our method on a toy compression task, demonstrating its utility in learned compression.  ( 2 min )
    On Minimal Depth in Neural Networks
    arXiv:2402.15315v1 Announce Type: new Abstract: A characterization of the representability of neural networks is relevant to comprehend their success in artificial intelligence. This study investigate two topics on ReLU neural network expressivity and their connection with a conjecture related to the minimum depth required for representing any continuous piecewise linear function (CPWL). The topics are the minimal depth representation of the sum and max operations, as well as the exploration of polytope neural networks. For the sum operation, we establish a sufficient condition on the minimal depth of the operands to find the minimal depth of the operation. In contrast, regarding the max operation, a comprehensive set of examples is presented, demonstrating that no sufficient conditions, depending solely on the depth of the operands, would imply a minimal depth for the operation. The study also examine the minimal depth relationship between convex CPWL functions. On polytope neural networks, we investigate several fundamental properties, deriving results equivalent to those of ReLU networks, such as depth inclusions and depth computation from vertices. Notably, we compute the minimal depth of simplices, which is strictly related to the minimal depth conjecture in ReLU networks.  ( 2 min )
    GPTVQ: The Blessing of Dimensionality for LLM Quantization
    arXiv:2402.15319v1 Announce Type: new Abstract: In this work we show that the size versus accuracy trade-off of neural network quantization can be significantly improved by increasing the quantization dimensionality. We propose the GPTVQ method, a new fast method for post-training vector quantization (VQ) that scales well to Large Language Models (LLMs). Our method interleaves quantization of one or more columns with updates to the remaining unquantized weights, using information from the Hessian of the per-layer output reconstruction MSE. Quantization codebooks are initialized using an efficient data-aware version of the EM algorithm. The codebooks are then updated, and further compressed by using integer quantization and SVD-based compression. GPTVQ establishes a new state-of-the art in the size vs accuracy trade-offs on a wide range of LLMs such as Llama-v2 and Mistral. Furthermore, our method is efficient: on a single H100 it takes between 3 and 11 hours to process a Llamav2-70B model, depending on quantization setting. Lastly, with on-device timings for VQ decompression on a mobile CPU we show that VQ leads to improved latency compared to using a 4-bit integer format.  ( 2 min )
    Linear Dynamics-embedded Neural Network for Long-Sequence Modeling
    arXiv:2402.15290v1 Announce Type: new Abstract: The trade-off between performance and computational efficiency in long-sequence modeling becomes a bottleneck for existing models. Inspired by the continuous state space models (SSMs) with multi-input and multi-output in control theory, we propose a new neural network called Linear Dynamics-embedded Neural Network (LDNN). SSMs' continuous, discrete, and convolutional properties enable LDNN to have few parameters, flexible inference, and efficient training in long-sequence tasks. Two efficient strategies, diagonalization and $'\text{Disentanglement then Fast Fourier Transform (FFT)}'$, are developed to reduce the time complexity of convolution from $O(LNH\max\{L, N\})$ to $O(LN\max \{H, \log L\})$. We further improve LDNN through bidirectional noncausal and multi-head settings to accommodate a broader range of applications. Extensive experiments on the Long Range Arena (LRA) demonstrate the effectiveness and state-of-the-art performance of LDNN.  ( 2 min )
    Counterfactual Generation with Identifiability Guarantees
    arXiv:2402.15309v1 Announce Type: new Abstract: Counterfactual generation lies at the core of various machine learning tasks, including image translation and controllable text generation. This generation process usually requires the identification of the disentangled latent representations, such as content and style, that underlie the observed data. However, it becomes more challenging when faced with a scarcity of paired data and labeling information. Existing disentangled methods crucially rely on oversimplified assumptions, such as assuming independent content and style variables, to identify the latent variables, even though such assumptions may not hold for complex data distributions. For instance, food reviews tend to involve words like tasty, whereas movie reviews commonly contain words such as thrilling for the same positive sentiment. This problem is exacerbated when data are sampled from multiple domains since the dependence between content and style may vary significantly over domains. In this work, we tackle the domain-varying dependence between the content and the style variables inherent in the counterfactual generation task. We provide identification guarantees for such latent-variable models by leveraging the relative sparsity of the influences from different latent variables. Our theoretical insights enable the development of a doMain AdapTive counTerfactual gEneration model, called (MATTE). Our theoretically grounded framework achieves state-of-the-art performance in unsupervised style transfer tasks, where neither paired data nor style labels are utilized, across four large-scale datasets. Code is available at https://github.com/hanqi-qi/Matte.git  ( 3 min )
    When in Doubt, Think Slow: Iterative Reasoning with Latent Imagination
    arXiv:2402.15283v1 Announce Type: new Abstract: In an unfamiliar setting, a model-based reinforcement learning agent can be limited by the accuracy of its world model. In this work, we present a novel, training-free approach to improving the performance of such agents separately from planning and learning. We do so by applying iterative inference at decision-time, to fine-tune the inferred agent states based on the coherence of future state representations. Our approach achieves a consistent improvement in both reconstruction accuracy and task performance when applied to visual 3D navigation tasks. We go on to show that considering more future states further improves the performance of the agent in partially-observable environments, but not in a fully-observable one. Finally, we demonstrate that agents with less training pre-evaluation benefit most from our approach.  ( 2 min )
    Spatiotemporal Observer Design for Predictive Learning of High-Dimensional Data
    arXiv:2402.15284v1 Announce Type: new Abstract: Although deep learning-based methods have shown great success in spatiotemporal predictive learning, the framework of those models is designed mainly by intuition. How to make spatiotemporal forecasting with theoretical guarantees is still a challenging issue. In this work, we tackle this problem by applying domain knowledge from the dynamical system to the framework design of deep learning models. An observer theory-guided deep learning architecture, called Spatiotemporal Observer, is designed for predictive learning of high dimensional data. The characteristics of the proposed framework are twofold: firstly, it provides the generalization error bound and convergence guarantee for spatiotemporal prediction; secondly, dynamical regularization is introduced to enable the model to learn system dynamics better during training. Further experimental results show that this framework could capture the spatiotemporal dynamics and make accurate predictions in both one-step-ahead and multi-step-ahead forecasting scenarios.  ( 2 min )
    Smoothed Graph Contrastive Learning via Seamless Proximity Integration
    arXiv:2402.15270v1 Announce Type: new Abstract: Graph contrastive learning (GCL) aligns node representations by classifying node pairs into positives and negatives using a selection process that typically relies on establishing correspondences within two augmented graphs. The conventional GCL approaches incorporate negative samples uniformly in the contrastive loss, resulting in the equal treatment negative nodes, regardless of their proximity to the true positive. In this paper, we present a Smoothed Graph Contrastive Learning model (SGCL), which leverages the geometric structure of augmented graphs to inject proximity information associated with positive/negative pairs in the contrastive loss, thus significantly regularizing the learning process. The proposed SGCL adjusts the penalties associated with node pairs in the contrastive loss by incorporating three distinct smoothing techniques that result in proximity aware positives and negatives. To enhance scalability for large-scale graphs, the proposed framework incorporates a graph batch-generating strategy that partitions the given graphs into multiple subgraphs, facilitating efficient training in separate batches. Through extensive experimentation in the unsupervised setting on various benchmarks, particularly those of large scale, we demonstrate the superiority of our proposed framework against recent baselines.  ( 2 min )
    Classification Under Strategic Self-Selection
    arXiv:2402.15274v1 Announce Type: new Abstract: When users stand to gain from certain predictions, they are prone to act strategically to obtain favorable predictive outcomes. Whereas most works on strategic classification consider user actions that manifest as feature modifications, we study a novel setting in which users decide -- in response to the learned classifier -- whether to at all participate (or not). For learning approaches of increasing strategic awareness, we study the effects of self-selection on learning, and the implications of learning on the composition of the self-selected population. We then propose a differentiable framework for learning under self-selective behavior, which can be optimized effectively. We conclude with experiments on real data and simulated behavior that both complement our analysis and demonstrate the utility of our approach.  ( 2 min )
    Dynamic Memory Based Adaptive Optimization
    arXiv:2402.15262v1 Announce Type: new Abstract: Define an optimizer as having memory $k$ if it stores $k$ dynamically changing vectors in the parameter space. Classical SGD has memory $0$, momentum SGD optimizer has $1$ and Adam optimizer has $2$. We address the following questions: How can optimizers make use of more memory units? What information should be stored in them? How to use them for the learning steps? As an approach to the last question, we introduce a general method called "Retrospective Learning Law Correction" or shortly RLLC. This method is designed to calculate a dynamically varying linear combination (called learning law) of memory units, which themselves may evolve arbitrarily. We demonstrate RLLC on optimizers whose memory units have linear update rules and small memory ($\leq 4$ memory units). Our experiments show that in a variety of standard problems, these optimizers outperform the above mentioned three classical optimizers. We conclude that RLLC is a promising framework for boosting the performance of known optimizers by adding more memory units and by making them more adaptive.  ( 2 min )
    A Bargaining-based Approach for Feature Trading in Vertical Federated Learning
    arXiv:2402.15247v1 Announce Type: new Abstract: Vertical Federated Learning (VFL) has emerged as a popular machine learning paradigm, enabling model training across the data and the task parties with different features about the same user set while preserving data privacy. In production environment, VFL usually involves one task party and one data party. Fair and economically efficient feature trading is crucial to the commercialization of VFL, where the task party is considered as the data consumer who buys the data party's features. However, current VFL feature trading practices often price the data party's data as a whole and assume transactions occur prior to the performing VFL. Neglecting the performance gains resulting from traded features may lead to underpayment and overpayment issues. In this study, we propose a bargaining-based feature trading approach in VFL to encourage economically efficient transactions. Our model incorporates performance gain-based pricing, taking into account the revenue-based optimization objectives of both parties. We analyze the proposed bargaining model under perfect and imperfect performance information settings, proving the existence of an equilibrium that optimizes the parties' objectives. Moreover, we develop performance gain estimation-based bargaining strategies for imperfect performance information scenarios and discuss potential security issues and solutions. Experiments on three real-world datasets demonstrate the effectiveness of the proposed bargaining model.  ( 2 min )
    Optimal Transport for Structure Learning Under Missing Data
    arXiv:2402.15255v1 Announce Type: new Abstract: Causal discovery in the presence of missing data introduces a chicken-and-egg dilemma. While the goal is to recover the true causal structure, robust imputation requires considering the dependencies or preferably causal relations among variables. Merely filling in missing values with existing imputation methods and subsequently applying structure learning on the complete data is empirical shown to be sub-optimal. To this end, we propose in this paper a score-based algorithm, based on optimal transport, for learning causal structure from missing data. This optimal transport viewpoint diverges from existing score-based approaches that are dominantly based on EM. We project structure learning as a density fitting problem, where the goal is to find the causal model that induces a distribution of minimum Wasserstein distance with the distribution over the observed data. Through extensive simulations and real-data experiments, our framework is shown to recover the true causal graphs more effectively than the baselines in various simulations and real-data experiments. Empirical evidences also demonstrate the superior scalability of our approach, along with the flexibility to incorporate any off-the-shelf causal discovery methods for complete data.  ( 2 min )
    Fixed Random Classifier Rearrangement for Continual Learning
    arXiv:2402.15227v1 Announce Type: new Abstract: With the explosive growth of data, continual learning capability is increasingly important for neural networks. Due to catastrophic forgetting, neural networks inevitably forget the knowledge of old tasks after learning new ones. In visual classification scenario, a common practice of alleviating the forgetting is to constrain the backbone. However, the impact of classifiers is underestimated. In this paper, we analyze the variation of model predictions in sequential binary classification tasks and find that the norm of the equivalent one-class classifiers significantly affects the forgetting level. Based on this conclusion, we propose a two-stage continual learning algorithm named Fixed Random Classifier Rearrangement (FRCR). In first stage, FRCR replaces the learnable classifiers with fixed random classifiers, constraining the norm of the equivalent one-class classifiers without affecting the performance of the network. In second stage, FRCR rearranges the entries of new classifiers to implicitly reduce the drift of old latent representations. The experimental results on multiple datasets show that FRCR significantly mitigates the model forgetting; subsequent experimental analyses further validate the effectiveness of the algorithm.  ( 2 min )
    Which Model to Transfer? A Survey on Transferability Estimation
    arXiv:2402.15231v1 Announce Type: new Abstract: Transfer learning methods endeavor to leverage relevant knowledge from existing source pre-trained models or datasets to solve downstream target tasks. With the increase in the scale and quantity of available pre-trained models nowadays, it becomes critical to assess in advance whether they are suitable for a specific target task. Model transferability estimation is an emerging and growing area of interest, aiming to propose a metric to quantify this suitability without training them individually, which is computationally prohibitive. Despite extensive recent advances already devoted to this area, they have custom terminological definitions and experimental settings. In this survey, we present the first review of existing advances in this area and categorize them into two separate realms: source-free model transferability estimation and source-dependent model transferability estimation. Each category is systematically defined, accompanied by a comprehensive taxonomy. Besides, we address challenges and outline future research directions, intending to provide a comprehensive guide to aid researchers and practitioners.  ( 2 min )
    Bidirectional Uncertainty-Based Active Learning for Open Set Annotation
    arXiv:2402.15198v1 Announce Type: new Abstract: Active learning (AL) in open set scenarios presents a novel challenge of identifying the most valuable examples in an unlabeled data pool that comprises data from both known and unknown classes. Traditional methods prioritize selecting informative examples with low confidence, with the risk of mistakenly selecting unknown-class examples with similarly low confidence. Recent methods favor the most probable known-class examples, with the risk of picking simple already mastered examples. In this paper, we attempt to query examples that are both likely from known classes and highly informative, and propose a \textit{Bidirectional Uncertainty-based Active Learning} (BUAL) framework. Specifically, we achieve this by first pushing the unknown class examples toward regions with high-confidence predictions with our proposed \textit{Random Label Negative Learning} method. Then, we propose a \textit{Bidirectional Uncertainty sampling} strategy by jointly estimating uncertainty posed by both positive and negative learning to perform consistent and stable sampling. BUAL successfully extends existing uncertainty-based AL methods to complex open-set scenarios. Extensive experiments on multiple datasets with varying openness demonstrate that BUAL achieves state-of-the-art performance.  ( 2 min )
    ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition
    arXiv:2402.15220v1 Announce Type: new Abstract: Self-attention is an essential component of large language models(LLMs) but a significant source of inference latency for long sequences. In multi-tenant LLMs serving scenarios, the compute and memory operation cost of self-attention can be optimized by using the probability that multiple LLM requests have shared system prompts in prefixes. In this paper, we introduce ChunkAttention, a prefix-aware self-attention module that can detect matching prompt prefixes across multiple requests and share their key/value tensors in memory at runtime to improve the memory utilization of KV cache. This is achieved by breaking monolithic key/value tensors into smaller chunks and structuring them into the auxiliary prefix tree. Consequently, on top of the prefix-tree based KV cache, we design an efficient self-attention kernel, where a two-phase partition algorithm is implemented to improve the data locality during self-attention computation in the presence of shared system prompts. Experiments show that ChunkAttention can speed up the self-attention kernel by 3.2-4.8$\times$ compared to the start-of-the-art implementation, with the length of the system prompt ranging from 1024 to 4096.  ( 2 min )
    Parameter-Free Algorithms for Performative Regret Minimization under Decision-Dependent Distributions
    arXiv:2402.15188v1 Announce Type: new Abstract: This paper studies performative risk minimization, a formulation of stochastic optimization under decision-dependent distributions. We consider the general case where the performative risk can be non-convex, for which we develop efficient parameter-free optimistic optimization-based methods. Our algorithms significantly improve upon the existing Lipschitz bandit-based method in many aspects. In particular, our framework does not require knowledge about the sensitivity parameter of the distribution map and the Lipshitz constant of the loss function. This makes our framework practically favorable, together with the efficient optimistic optimization-based tree-search mechanism. We provide experimental results that demonstrate the numerical superiority of our algorithms over the existing method and other black-box optimistic optimization methods.  ( 2 min )
    Fine-Tuning of Continuous-Time Diffusion Models as Entropy-Regularized Control
    arXiv:2402.15194v1 Announce Type: new Abstract: Diffusion models excel at capturing complex data distributions, such as those of natural images and proteins. While diffusion models are trained to represent the distribution in the training dataset, we often are more concerned with other properties, such as the aesthetic quality of the generated images or the functional properties of generated proteins. Diffusion models can be finetuned in a goal-directed way by maximizing the value of some reward function (e.g., the aesthetic quality of an image). However, these approaches may lead to reduced sample diversity, significant deviations from the training data distribution, and even poor sample quality due to the exploitation of an imperfect reward function. The last issue often occurs when the reward function is a learned model meant to approximate a ground-truth "genuine" reward, as is the case in many practical applications. These challenges, collectively termed "reward collapse," pose a substantial obstacle. To address this reward collapse, we frame the finetuning problem as entropy-regularized control against the pretrained diffusion model, i.e., directly optimizing entropy-enhanced rewards with neural SDEs. We present theoretical and empirical evidence that demonstrates our framework is capable of efficiently generating diverse samples with high genuine rewards, mitigating the overoptimization of imperfect reward models.  ( 2 min )
    GraphEdit: Large Language Models for Graph Structure Learning
    arXiv:2402.15183v1 Announce Type: new Abstract: Graph Structure Learning (GSL) focuses on capturing intrinsic dependencies and interactions among nodes in graph-structured data by generating novel graph structures. Graph Neural Networks (GNNs) have emerged as promising GSL solutions, utilizing recursive message passing to encode node-wise inter-dependencies. However, many existing GSL methods heavily depend on explicit graph structural information as supervision signals, leaving them susceptible to challenges such as data noise and sparsity. In this work, we propose GraphEdit, an approach that leverages large language models (LLMs) to learn complex node relationships in graph-structured data. By enhancing the reasoning capabilities of LLMs through instruction-tuning over graph structures, we aim to overcome the limitations associated with explicit graph structural information and enhance the reliability of graph structure learning. Our approach not only effectively denoises noisy connections but also identifies node-wise dependencies from a global perspective, providing a comprehensive understanding of the graph structure. We conduct extensive experiments on multiple benchmark datasets to demonstrate the effectiveness and robustness of GraphEdit across various settings. We have made our model implementation available at: https://github.com/HKUDS/GraphEdit.  ( 2 min )
    Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement
    arXiv:2402.15180v1 Announce Type: new Abstract: Caution: This paper includes offensive words that could potentially cause unpleasantness. Language models (LMs) are vulnerable to exploitation for adversarial misuse. Training LMs for safety alignment is extensive and makes it hard to respond to fast-developing attacks immediately, such as jailbreaks. We propose self-refine with formatting that achieves outstanding safety even in non-safety-aligned LMs and evaluate our method alongside several defense baselines, demonstrating that it is the safest training-free method against jailbreak attacks. Additionally, we proposed a formatting method that improves the efficiency of the self-refine process while reducing attack success rates in fewer iterations. We've also observed that non-safety-aligned LMs outperform safety-aligned LMs in safety tasks by giving more helpful and safe responses. In conclusion, our findings can achieve less safety risk with fewer computational costs, allowing non-safety LM to be easily utilized in real-world service.  ( 2 min )
    Advancing Parameter Efficiency in Fine-tuning via Representation Editing
    arXiv:2402.15179v1 Announce Type: new Abstract: Parameter Efficient Fine-Tuning (PEFT) has gained significant attention for its ability to achieve competitive results while updating only a small subset of trainable parameters. Despite the promising performance of current PEFT methods, they present challenges in hyperparameter selection, such as determining the rank of LoRA or Adapter, or specifying the length of soft prompts. In addressing these challenges, we propose a novel approach to fine-tuning neural models, termed Representation EDiting (RED), which scales and biases the representation produced at each layer. RED substantially reduces the number of trainable parameters by a factor of $25,700$ compared to full parameter fine-tuning, and by a factor of $32$ compared to LoRA. Remarkably, RED achieves comparable or superior results to full parameter fine-tuning and other PEFT methods. Extensive experiments were conducted across models of varying architectures and scales, including RoBERTa, GPT-2, T5, and Llama-2, and the results demonstrate the efficiency and efficacy of RED, positioning it as a promising PEFT approach for large neural models.  ( 2 min )
    Unified View of Grokking, Double Descent and Emergent Abilities: A Perspective from Circuits Competition
    arXiv:2402.15175v1 Announce Type: new Abstract: Recent studies have uncovered intriguing phenomena in deep learning, such as grokking, double descent, and emergent abilities in large language models, which challenge human intuition and are crucial for a deeper understanding of neural models. In this paper, we present a comprehensive framework that provides a unified view of these three phenomena, focusing on the competition between memorization and generalization circuits. This approach, initially employed to explain grokking, is extended in our work to encompass a wider range of model sizes and training data volumes. Our framework delineates four distinct training dynamics, each depending on varying combinations of model size and training data quantity. Utilizing this framework, we provide a detailed analysis of the double descent phenomenon and propose two verifiable predictions regarding its occurrence, both substantiated by our experimental results. Moreover, we expand our framework to the multi-task learning paradigm, demonstrating how algorithm tasks can be turned into emergent abilities. This offers a novel perspective to understand emergent abilities in Large Language Models.  ( 2 min )
    Second-Order Fine-Tuning without Pain for LLMs:A Hessian Informed Zeroth-Order Optimizer
    arXiv:2402.15173v1 Announce Type: new Abstract: Fine-tuning large language models (LLMs) with classic first-order optimizers entails prohibitive GPU memory due to the backpropagation process. Recent works have turned to zeroth-order optimizers for fine-tuning, which save substantial memory by using two forward passes. However, these optimizers are plagued by the heterogeneity of parameter curvatures across different dimensions. In this work, we propose HiZOO, a diagonal Hessian informed zeroth-order optimizer which is the first work to leverage the diagonal Hessian to enhance zeroth-order optimizer for fine-tuning LLMs. What's more, HiZOO avoids the expensive memory cost and only increases one forward pass per step. Extensive experiments on various models (350M~66B parameters) indicate that HiZOO improves model convergence, significantly reducing training steps and effectively enhancing model accuracy. Moreover, we visualize the optimization trajectories of HiZOO on test functions, illustrating its effectiveness in handling heterogeneous curvatures. Lastly, we provide theoretical proofs of convergence for HiZOO. Code is publicly available at https://anonymous.4open.science/r/HiZOO27F8.  ( 2 min )
    Covariance-Adaptive Least-Squares Algorithm for Stochastic Combinatorial Semi-Bandits
    arXiv:2402.15171v1 Announce Type: new Abstract: We address the problem of stochastic combinatorial semi-bandits, where a player can select from P subsets of a set containing d base items. Most existing algorithms (e.g. CUCB, ESCB, OLS-UCB) require prior knowledge on the reward distribution, like an upper bound on a sub-Gaussian proxy-variance, which is hard to estimate tightly. In this work, we design a variance-adaptive version of OLS-UCB, relying on an online estimation of the covariance structure. Estimating the coefficients of a covariance matrix is much more manageable in practical settings and results in improved regret upper bounds compared to proxy variance-based algorithms. When covariance coefficients are all non-negative, we show that our approach efficiently leverages the semi-bandit feedback and provably outperforms bandit feedback approaches, not only in exponential regimes where P $\gg$ d but also when P $\le$ d, which is not straightforward from most existing analyses.  ( 2 min )
    The Surprising Effectiveness of Skip-Tuning in Diffusion Sampling
    arXiv:2402.15170v1 Announce Type: new Abstract: With the incorporation of the UNet architecture, diffusion probabilistic models have become a dominant force in image generation tasks. One key design in UNet is the skip connections between the encoder and decoder blocks. Although skip connections have been shown to improve training stability and model performance, we reveal that such shortcuts can be a limiting factor for the complexity of the transformation. As the sampling steps decrease, the generation process and the role of the UNet get closer to the push-forward transformations from Gaussian distribution to the target, posing a challenge for the network's complexity. To address this challenge, we propose Skip-Tuning, a simple yet surprisingly effective training-free tuning method on the skip connections. Our method can achieve 100% FID improvement for pretrained EDM on ImageNet 64 with only 19 NFEs (1.75), breaking the limit of ODE samplers regardless of sampling steps. Surprisingly, the improvement persists when we increase the number of sampling steps and can even surpass the best result from EDM-2 (1.58) with only 39 NFEs (1.57). Comprehensive exploratory experiments are conducted to shed light on the surprising effectiveness. We observe that while Skip-Tuning increases the score-matching losses in the pixel space, the losses in the feature space are reduced, particularly at intermediate noise levels, which coincide with the most effective range accounting for image quality improvement.  ( 2 min )
    Studying the Impact of Stochasticity on the Evaluation of Deep Neural Networks for Forest-Fire Prediction
    arXiv:2402.15163v1 Announce Type: new Abstract: This paper presents the first systematic study of the evaluation of Deep Neural Networks (DNNs) for discrete dynamical systems under stochastic assumptions, with a focus on wildfire prediction. We develop a framework to study the impact of stochasticity on two classes of evaluation metrics: classification-based metrics, which assess fidelity to observed ground truth (GT), and proper scoring rules, which test fidelity-to-statistic. Our findings reveal that evaluating for fidelity-to-statistic is a reliable alternative in highly stochastic scenarios. We extend our analysis to real-world wildfire data, highlighting limitations in traditional wildfire prediction evaluation methods, and suggest interpretable stochasticity-compatible alternatives.  ( 2 min )
    Spatially-Aware Transformer Memory for Embodied Agents
    arXiv:2402.15160v1 Announce Type: new Abstract: Episodic memory plays a crucial role in various cognitive processes, such as the ability to mentally recall past events. While cognitive science emphasizes the significance of spatial context in the formation and retrieval of episodic memory, the current primary approach to implementing episodic memory in AI systems is through transformers that store temporally ordered experiences, which overlooks the spatial dimension. As a result, it is unclear how the underlying structure could be extended to incorporate the spatial axis beyond temporal order alone and thereby what benefits can be obtained. To address this, this paper explores the use of Spatially-Aware Transformer models that incorporate spatial information. These models enable the creation of place-centric episodic memory that considers both temporal and spatial dimensions. Adopting this approach, we demonstrate that memory utilization efficiency can be improved, leading to enhanced accuracy in various place-centric downstream tasks. Additionally, we propose the Adaptive Memory Allocator, a memory management method based on reinforcement learning that aims to optimize efficiency of memory utilization. Our experiments demonstrate the advantages of our proposed model in various environments and across multiple downstream tasks, including prediction, generation, reasoning, and reinforcement learning. The source code for our models and experiments will be available at https://github.com/junmokane/spatially-aware-transformer.  ( 2 min )
    On the Duality Between Sharpness-Aware Minimization and Adversarial Training
    arXiv:2402.15152v1 Announce Type: new Abstract: Adversarial Training (AT), which adversarially perturb the input samples during training, has been acknowledged as one of the most effective defenses against adversarial attacks, yet suffers from a fundamental tradeoff that inevitably decreases clean accuracy. Instead of perturbing the samples, Sharpness-Aware Minimization (SAM) perturbs the model weights during training to find a more flat loss landscape and improve generalization. However, as SAM is designed for better clean accuracy, its effectiveness in enhancing adversarial robustness remains unexplored. In this work, considering the duality between SAM and AT, we investigate the adversarial robustness derived from SAM. Intriguingly, we find that using SAM alone can improve adversarial robustness. To understand this unexpected property of SAM, we first provide empirical and theoretical insights into how SAM can implicitly learn more robust features, and conduct comprehensive experiments to show that SAM can improve adversarial robustness notably without sacrificing any clean accuracy, shedding light on the potential of SAM to be a substitute for AT when accuracy comes at a higher priority. Code is available at https://github.com/weizeming/SAM_AT.  ( 2 min )
    The Cost of Parallelizing Boosting
    arXiv:2402.15145v1 Announce Type: new Abstract: We study the cost of parallelizing weak-to-strong boosting algorithms for learning, following the recent work of Karbasi and Larsen. Our main results are two-fold: - First, we prove a tight lower bound, showing that even "slight" parallelization of boosting requires an exponential blow-up in the complexity of training. Specifically, let $\gamma$ be the weak learner's advantage over random guessing. The famous \textsc{AdaBoost} algorithm produces an accurate hypothesis by interacting with the weak learner for $\tilde{O}(1 / \gamma^2)$ rounds where each round runs in polynomial time. Karbasi and Larsen showed that "significant" parallelization must incur exponential blow-up: Any boosting algorithm either interacts with the weak learner for $\Omega(1 / \gamma)$ rounds or incurs an $\exp(d / \gamma)$ blow-up in the complexity of training, where $d$ is the VC dimension of the hypothesis class. We close the gap by showing that any boosting algorithm either has $\Omega(1 / \gamma^2)$ rounds of interaction or incurs a smaller exponential blow-up of $\exp(d)$. -Complementing our lower bound, we show that there exists a boosting algorithm using $\tilde{O}(1/(t \gamma^2))$ rounds, and only suffer a blow-up of $\exp(d \cdot t^2)$. Plugging in $t = \omega(1)$, this shows that the smaller blow-up in our lower bound is tight. More interestingly, this provides the first trade-off between the parallelism and the total work required for boosting.  ( 2 min )
    Deep Coupling Network For Multivariate Time Series Forecasting
    arXiv:2402.15134v1 Announce Type: new Abstract: Multivariate time series (MTS) forecasting is crucial in many real-world applications. To achieve accurate MTS forecasting, it is essential to simultaneously consider both intra- and inter-series relationships among time series data. However, previous work has typically modeled intra- and inter-series relationships separately and has disregarded multi-order interactions present within and between time series data, which can seriously degrade forecasting accuracy. In this paper, we reexamine intra- and inter-series relationships from the perspective of mutual information and accordingly construct a comprehensive relationship learning mechanism tailored to simultaneously capture the intricate multi-order intra- and inter-series couplings. Based on the mechanism, we propose a novel deep coupling network for MTS forecasting, named DeepCN, which consists of a coupling mechanism dedicated to explicitly exploring the multi-order intra- and inter-series relationships among time series data concurrently, a coupled variable representation module aimed at encoding diverse variable patterns, and an inference module facilitating predictions through one forward step. Extensive experiments conducted on seven real-world datasets demonstrate that our proposed DeepCN achieves superior performance compared with the state-of-the-art baselines.  ( 2 min )
    Multi-Armed Bandits with Abstention
    arXiv:2402.15127v1 Announce Type: new Abstract: We introduce a novel extension of the canonical multi-armed bandit problem that incorporates an additional strategic element: abstention. In this enhanced framework, the agent is not only tasked with selecting an arm at each time step, but also has the option to abstain from accepting the stochastic instantaneous reward before observing it. When opting for abstention, the agent either suffers a fixed regret or gains a guaranteed reward. Given this added layer of complexity, we ask whether we can develop efficient algorithms that are both asymptotically and minimax optimal. We answer this question affirmatively by designing and analyzing algorithms whose regrets meet their corresponding information-theoretic lower bounds. Our results offer valuable quantitative insights into the benefits of the abstention option, laying the groundwork for further exploration in other online decision-making problems with such an option. Numerical results further corroborate our theoretical findings.  ( 2 min )
    Accelerating Convergence of Stein Variational Gradient Descent via Deep Unfolding
    arXiv:2402.15125v1 Announce Type: new Abstract: Stein variational gradient descent (SVGD) is a prominent particle-based variational inference method used for sampling a target distribution. SVGD has attracted interest for application in machine-learning techniques such as Bayesian inference. In this paper, we propose novel trainable algorithms that incorporate a deep-learning technique called deep unfolding,into SVGD. This approach facilitates the learning of the internal parameters of SVGD, thereby accelerating its convergence speed. To evaluate the proposed trainable SVGD algorithms, we conducted numerical simulations of three tasks: sampling a one-dimensional Gaussian mixture, performing Bayesian logistic regression, and learning Bayesian neural networks. The results show that our proposed algorithms exhibit faster convergence than the conventional variants of SVGD.  ( 2 min )
    MSPipe: Efficient Temporal GNN Training via Staleness-aware Pipeline
    arXiv:2402.15113v1 Announce Type: new Abstract: Memory-based Temporal Graph Neural Networks (MTGNNs) are a class of temporal graph neural networks that utilize a node memory module to capture and retain long-term temporal dependencies, leading to superior performance compared to memory-less counterparts. However, the iterative reading and updating process of the memory module in MTGNNs to obtain up-to-date information needs to follow the temporal dependencies. This introduces significant overhead and limits training throughput. Existing optimizations for static GNNs are not directly applicable to MTGNNs due to differences in training paradigm, model architecture, and the absence of a memory module. Moreover, they do not effectively address the challenges posed by temporal dependencies, making them ineffective for MTGNN training. In this paper, we propose MSPipe, a general and efficient framework for MTGNNs that maximizes training throughput while maintaining model accuracy. Our design addresses the unique challenges associated with fetching and updating node memory states in MTGNNs by integrating staleness into the memory module. However, simply introducing a predefined staleness bound in the memory module to break temporal dependencies may lead to suboptimal performance and lack of generalizability across different models and datasets. To solve this, we introduce an online pipeline scheduling algorithm in MSPipe that strategically breaks temporal dependencies with minimal staleness and delays memory fetching to obtain fresher memory states. Moreover, we design a staleness mitigation mechanism to enhance training convergence and model accuracy. We provide convergence analysis and prove that MSPipe maintains the same convergence rate as vanilla sample-based GNN training. Experimental results show that MSPipe achieves up to 2.45x speed-up without sacrificing accuracy, making it a promising solution for efficient MTGNN training.  ( 3 min )
    Sampling-based Distributed Training with Message Passing Neural Network
    arXiv:2402.15106v1 Announce Type: new Abstract: In this study, we introduce a domain-decomposition-based distributed training and inference approach for message-passing neural networks (MPNN). Our objective is to address the challenge of scaling edge-based graph neural networks as the number of nodes increases. Through our distributed training approach, coupled with Nystr\"om-approximation sampling techniques, we present a scalable graph neural network, referred to as DS-MPNN (D and S standing for distributed and sampled, respectively), capable of scaling up to $O(10^5)$ nodes. We validate our sampling and distributed training approach on two cases: (a) a Darcy flow dataset and (b) steady RANS simulations of 2-D airfoils, providing comparisons with both single-GPU implementation and node-based graph convolution networks (GCNs). The DS-MPNN model demonstrates comparable accuracy to single-GPU implementation, can accommodate a significantly larger number of nodes compared to the single-GPU variant (S-MPNN), and significantly outperforms the node-based GCN.  ( 2 min )
    Trajectory-wise Iterative Reinforcement Learning Framework for Auto-bidding
    arXiv:2402.15102v1 Announce Type: new Abstract: In online advertising, advertisers participate in ad auctions to acquire ad opportunities, often by utilizing auto-bidding tools provided by demand-side platforms (DSPs). The current auto-bidding algorithms typically employ reinforcement learning (RL). However, due to safety concerns, most RL-based auto-bidding policies are trained in simulation, leading to a performance degradation when deployed in online environments. To narrow this gap, we can deploy multiple auto-bidding agents in parallel to collect a large interaction dataset. Offline RL algorithms can then be utilized to train a new policy. The trained policy can subsequently be deployed for further data collection, resulting in an iterative training framework, which we refer to as iterative offline RL. In this work, we identify the performance bottleneck of this iterative offline RL framework, which originates from the ineffective exploration and exploitation caused by the inherent conservatism of offline RL algorithms. To overcome this bottleneck, we propose Trajectory-wise Exploration and Exploitation (TEE), which introduces a novel data collecting and data utilization method for iterative offline RL from a trajectory perspective. Furthermore, to ensure the safety of online exploration while preserving the dataset quality for TEE, we propose Safe Exploration by Adaptive Action Selection (SEAS). Both offline experiments and real-world experiments on Alibaba display advertising platform demonstrate the effectiveness of our proposed method.  ( 2 min )
    Learning solution operators of PDEs defined on varying domains via MIONet
    arXiv:2402.15097v1 Announce Type: new Abstract: In this work, we propose a method to learn the solution operators of PDEs defined on varying domains via MIONet, and theoretically justify this method. We first extend the approximation theory of MIONet to further deal with metric spaces, establishing that MIONet can approximate mappings with multiple inputs in metric spaces. Subsequently, we construct a set consisting of some appropriate regions and provide a metric on this set thus make it a metric space, which satisfies the approximation condition of MIONet. Building upon the theoretical foundation, we are able to learn the solution mapping of a PDE with all the parameters varying, including the parameters of the differential operator, the right-hand side term, the boundary condition, as well as the domain. Without loss of generality, we for example perform the experiments for 2-d Poisson equations, where the domains and the right-hand side terms are varying. The results provide insights into the performance of this method across convex polygons, polar regions with smooth boundary, and predictions for different levels of discretization on one task. Reasonably, we point out that this is a meshless method, hence can be flexibly used as a general solver for a type of PDE.  ( 2 min )
    Cost-Adaptive Recourse Recommendation by Adaptive Preference Elicitation
    arXiv:2402.15073v1 Announce Type: new Abstract: Algorithmic recourse recommends a cost-efficient action to a subject to reverse an unfavorable machine learning classification decision. Most existing methods in the literature generate recourse under the assumption of complete knowledge about the cost function. In real-world practice, subjects could have distinct preferences, leading to incomplete information about the underlying cost function of the subject. This paper proposes a two-step approach integrating preference learning into the recourse generation problem. In the first step, we design a question-answering framework to refine the confidence set of the Mahalanobis matrix cost of the subject sequentially. Then, we generate recourse by utilizing two methods: gradient-based and graph-based cost-adaptive recourse that ensures validity while considering the whole confidence set of the cost matrix. The numerical evaluation demonstrates the benefits of our approach over state-of-the-art baselines in delivering cost-efficient recourse recommendations.  ( 2 min )
    Probabilistically-sound beam search with masked language models
    arXiv:2402.15020v1 Announce Type: new Abstract: Beam search with masked language models (MLMs) is challenging in part because joint probability distributions over sequences are not readily available, unlike for autoregressive models. Nevertheless, estimating such distributions has applications in many domains, including protein engineering and ancient text restoration. We present probabilistically-sound methods for beam search with MLMs. First, we clarify the conditions under which it is theoretically sound to perform text infilling with MLMs using standard beam search. When these conditions fail, we provide a probabilistically-sound modification with no additional computational complexity and demonstrate that it is superior to the aforementioned beam search in the expected conditions. We then present empirical results comparing several infilling approaches with MLMs across several domains.  ( 2 min )
    Enhancing One-Shot Federated Learning Through Data and Ensemble Co-Boosting
    arXiv:2402.15070v1 Announce Type: new Abstract: One-shot Federated Learning (OFL) has become a promising learning paradigm, enabling the training of a global server model via a single communication round. In OFL, the server model is aggregated by distilling knowledge from all client models (the ensemble), which are also responsible for synthesizing samples for distillation. In this regard, advanced works show that the performance of the server model is intrinsically related to the quality of the synthesized data and the ensemble model. To promote OFL, we introduce a novel framework, Co-Boosting, in which synthesized data and the ensemble model mutually enhance each other progressively. Specifically, Co-Boosting leverages the current ensemble model to synthesize higher-quality samples in an adversarial manner. These hard samples are then employed to promote the quality of the ensemble model by adjusting the ensembling weights for each client model. Consequently, Co-Boosting periodically achieves high-quality data and ensemble models. Extensive experiments demonstrate that Co-Boosting can substantially outperform existing baselines under various settings. Moreover, Co-Boosting eliminates the need for adjustments to the client's local training, requires no additional data or model transmission, and allows client models to have heterogeneous architectures.  ( 2 min )
    Towards Few-Shot Adaptation of Foundation Models via Multitask Finetuning
    arXiv:2402.15017v1 Announce Type: new Abstract: Foundation models have emerged as a powerful tool for many AI problems. Despite the tremendous success of foundation models, effective adaptation to new tasks, particularly those with limited labels, remains an open question and lacks theoretical understanding. An emerging solution with recent success in vision and NLP involves finetuning a foundation model on a selection of relevant tasks, before its adaptation to a target task with limited labeled samples. In this paper, we study the theoretical justification of this multitask finetuning approach. Our theoretical analysis reveals that with a diverse set of related tasks, this multitask finetuning leads to reduced error in the target task, in comparison to directly adapting the same pretrained model. We quantify the relationship between finetuning tasks and target tasks by diversity and consistency metrics, and further propose a practical task selection algorithm. We substantiate our theoretical claims with extensive empirical evidence. Further, we present results affirming our task selection algorithm adeptly chooses related finetuning tasks, providing advantages to the model performance on target tasks. We believe our study shed new light on the effective adaptation of foundation models to new tasks that lack abundant labels. Our code is available at https://github.com/OliverXUZY/Foudation-Model_Multitask.  ( 2 min )
    Consistency-Guided Temperature Scaling Using Style and Content Information for Out-of-Domain Calibration
    arXiv:2402.15019v1 Announce Type: new Abstract: Research interests in the robustness of deep neural networks against domain shifts have been rapidly increasing in recent years. Most existing works, however, focus on improving the accuracy of the model, not the calibration performance which is another important requirement for trustworthy AI systems. Temperature scaling (TS), an accuracy-preserving post-hoc calibration method, has been proven to be effective in in-domain settings, but not in out-of-domain (OOD) due to the difficulty in obtaining a validation set for the unseen domain beforehand. In this paper, we propose consistency-guided temperature scaling (CTS), a new temperature scaling strategy that can significantly enhance the OOD calibration performance by providing mutual supervision among data samples in the source domains. Motivated by our observation that over-confidence stemming from inconsistent sample predictions is the main obstacle to OOD calibration, we propose to guide the scaling process by taking consistencies into account in terms of two different aspects -- style and content -- which are the key components that can well-represent data samples in multi-domain settings. Experimental results demonstrate that our proposed strategy outperforms existing works, achieving superior OOD calibration performance on various datasets. This can be accomplished by employing only the source domains without compromising accuracy, making our scheme directly applicable to various trustworthy AI systems.  ( 3 min )
    Quantum Theory and Application of Contextual Optimal Transport
    arXiv:2402.14991v1 Announce Type: new Abstract: Optimal Transport (OT) has fueled machine learning (ML) applications across many domains. In cases where paired data measurements ($\mu$, $\nu$) are coupled to a context variable $p_i$ , one may aspire to learn a global transportation map that can be parameterized through a potentially unseen con-text. Existing approaches utilize Neural OT and largely rely on Brenier's theorem. Here, we propose a first-of-its-kind quantum computing formulation for amortized optimization of contextualized transportation plans. We exploit a direct link between doubly stochastic matrices and unitary operators thus finding a natural connection between OT and quantum computation. We verify our method on synthetic and real data, by predicting variations in cell type distributions parameterized through drug dosage as context. Our comparisons to several baselines reveal that our method can capture dose-induced variations in cell distributions, even to some extent when dosages are extrapolated and sometimes with performance similar to the best classical models. In summary, this is a first step toward learning to predict contextualized transportation plans through quantum.  ( 2 min )
    Comparison of Machine Learning Classification Algorithms and Application to the Framingham Heart Study
    arXiv:2402.15005v1 Announce Type: new Abstract: The use of machine learning algorithms in healthcare can amplify social injustices and health inequities. While the exacerbation of biases can occur and compound during the problem selection, data collection, and outcome definition, this research pertains to some generalizability impediments that occur during the development and the post-deployment of machine learning classification algorithms. Using the Framingham coronary heart disease data as a case study, we show how to effectively select a probability cutoff to convert a regression model for a dichotomous variable into a classifier. We then compare the sampling distribution of the predictive performance of eight machine learning classification algorithms under four training/testing scenarios to test their generalizability and their potential to perpetuate biases. We show that both the Extreme Gradient Boosting, and Support Vector Machine are flawed when trained on an unbalanced dataset. We introduced and show that the double discriminant scoring of type I is the most generalizable as it consistently outperforms the other classification algorithms regardless of the training/testing scenario. Finally, we introduce a methodology to extract an optimal variable hierarchy for a classification algorithm, and illustrate it on the overall, male and female Framingham coronary heart disease data.  ( 2 min )
    Verifiable Boosted Tree Ensembles
    arXiv:2402.14988v1 Announce Type: new Abstract: Verifiable learning advocates for training machine learning models amenable to efficient security verification. Prior research demonstrated that specific classes of decision tree ensembles -- called large-spread ensembles -- allow for robustness verification in polynomial time against any norm-based attacker. This study expands prior work on verifiable learning from basic ensemble methods (i.e., hard majority voting) to advanced boosted tree ensembles, such as those trained using XGBoost or LightGBM. Our formal results indicate that robustness verification is achievable in polynomial time when considering attackers based on the $L_\infty$-norm, but remains NP-hard for other norm-based attackers. Nevertheless, we present a pseudo-polynomial time algorithm to verify robustness against attackers based on the $L_p$-norm for any $p \in \mathbb{N} \cup \{0\}$, which in practice grants excellent performance. Our experimental evaluation shows that large-spread boosted ensembles are accurate enough for practical adoption, while being amenable to efficient security verification.  ( 2 min )
    Stable Neural Stochastic Differential Equations in Analyzing Irregular Time Series Data
    arXiv:2402.14989v1 Announce Type: new Abstract: Irregular sampling intervals and missing values in real-world time series data present challenges for conventional methods that assume consistent intervals and complete data. Neural Ordinary Differential Equations (Neural ODEs) offer an alternative approach, utilizing neural networks combined with ODE solvers to learn continuous latent representations through parameterized vector fields. Neural Stochastic Differential Equations (Neural SDEs) extend Neural ODEs by incorporating a diffusion term, although this addition is not trivial, particularly when addressing irregular intervals and missing values. Consequently, careful design of drift and diffusion functions is crucial for maintaining stability and enhancing performance, while incautious choices can result in adverse properties such as the absence of strong solutions, stochastic destabilization, or unstable Euler discretizations, significantly affecting Neural SDEs' performance. In this study, we propose three stable classes of Neural SDEs: Langevin-type SDE, Linear Noise SDE, and Geometric SDE. Then, we rigorously demonstrate their robustness in maintaining excellent performance under distribution shift, while effectively preventing overfitting. To assess the effectiveness of our approach, we conduct extensive experiments on four benchmark datasets for interpolation, forecasting, and classification tasks, and analyze the robustness of our methods with 30 public datasets under different missing rates. Our results demonstrate the efficacy of the proposed method in handling real-world irregular time series data.  ( 2 min )
    Optimizing Language Models for Human Preferences is a Causal Inference Problem
    arXiv:2402.14979v1 Announce Type: new Abstract: As large language models (LLMs) see greater use in academic and commercial settings, there is increasing interest in methods that allow language models to generate texts aligned with human preferences. In this paper, we present an initial exploration of language model optimization for human preferences from direct outcome datasets, where each sample consists of a text and an associated numerical outcome measuring the reader's response. We first propose that language model optimization should be viewed as a causal problem to ensure that the model correctly learns the relationship between the text and the outcome. We formalize this causal language optimization problem, and we develop a method--causal preference optimization (CPO)--that solves an unbiased surrogate objective for the problem. We further extend CPO with doubly robust CPO (DR-CPO), which reduces the variance of the surrogate objective while retaining provably strong guarantees on bias. Finally, we empirically demonstrate the effectiveness of (DR-)CPO in optimizing state-of-the-art LLMs for human preferences on direct outcome data, and we validate the robustness of DR-CPO under difficult confounding conditions.  ( 2 min )
    Privacy-Enhancing Collaborative Information Sharing through Federated Learning -- A Case of the Insurance Industry
    arXiv:2402.14983v1 Announce Type: new Abstract: The report demonstrates the benefits (in terms of improved claims loss modeling) of harnessing the value of Federated Learning (FL) to learn a single model across multiple insurance industry datasets without requiring the datasets themselves to be shared from one company to another. The application of FL addresses two of the most pressing concerns: limited data volume and data variety, which are caused by privacy concerns, the rarity of claim events, the lack of informative rating factors, etc.. During each round of FL, collaborators compute improvements on the model using their local private data, and these insights are combined to update a global model. Such aggregation of insights allows for an increase to the effectiveness in forecasting claims losses compared to models individually trained at each collaborator. Critically, this approach enables machine learning collaboration without the need for raw data to leave the compute infrastructure of each respective data owner. Additionally, the open-source framework, OpenFL, that is used in our experiments is designed so that it can be run using confidential computing as well as with additional algorithmic protections against leakage of information via the shared model updates. In such a way, FL is implemented as a privacy-enhancing collaborative learning technique that addresses the challenges posed by the sensitivity and privacy of data in traditional machine learning solutions. This paper's application of FL can also be expanded to other areas including fraud detection, catastrophe modeling, etc., that have a similar need to incorporate data privacy into machine learning collaborations. Our framework and empirical results provide a foundation for future collaborations among insurers, regulators, academic researchers, and InsurTech experts.  ( 3 min )
    SoK: Analyzing Adversarial Examples: A Framework to Study Adversary Knowledge
    arXiv:2402.14937v1 Announce Type: new Abstract: Adversarial examples are malicious inputs to machine learning models that trigger a misclassification. This type of attack has been studied for close to a decade, and we find that there is a lack of study and formalization of adversary knowledge when mounting attacks. This has yielded a complex space of attack research with hard-to-compare threat models and attacks. We focus on the image classification domain and provide a theoretical framework to study adversary knowledge inspired by work in order theory. We present an adversarial example game, inspired by cryptographic games, to standardize attacks. We survey recent attacks in the image classification domain and classify their adversary's knowledge in our framework. From this systematization, we compile results that both confirm existing beliefs about adversary knowledge, such as the potency of information about the attacked model as well as allow us to derive new conclusions on the difficulty associated with the white-box and transferable threat models, for example, that transferable attacks might not be as difficult as previously thought.  ( 2 min )
    Enhancing Power Quality Event Classification with AI Transformer Models
    arXiv:2402.14949v1 Announce Type: new Abstract: Recently, there has been a growing interest in utilizing machine learning for accurate classification of power quality events (PQEs). However, most of these studies are performed assuming an ideal situation, while in reality, we can have measurement noise, DC offset, and variations in the voltage signal's amplitude and frequency. Building on the prior PQE classification works using deep learning, this paper proposes a deep-learning framework that leverages attention-enabled Transformers as a tool to accurately classify PQEs under the aforementioned considerations. The proposed framework can operate directly on the voltage signals with no need for a separate feature extraction or calculation phase. Our results show that the proposed framework outperforms recently proposed learning-based techniques. It can accurately classify PQEs under the aforementioned conditions with an accuracy varying between 99.81%$-$91.43% depending on the signal-to-noise ratio, DC offsets, and variations in the signal amplitude and frequency.  ( 2 min )
    Boosting gets full Attention for Relational Learning
    arXiv:2402.14926v1 Announce Type: new Abstract: More often than not in benchmark supervised ML, tabular data is flat, i.e. consists of a single $m \times d$ (rows, columns) file, but cases abound in the real world where observations are described by a set of tables with structural relationships. Neural nets-based deep models are a classical fit to incorporate general topological dependence among description features (pixels, words, etc.), but their suboptimality to tree-based models on tabular data is still well documented. In this paper, we introduce an attention mechanism for structured data that blends well with tree-based models in the training context of (gradient) boosting. Each aggregated model is a tree whose training involves two steps: first, simple tabular models are learned descending tables in a top-down fashion with boosting's class residuals on tables' features. Second, what has been learned progresses back bottom-up via attention and aggregation mechanisms, progressively crafting new features that complete at the end the set of observation features over which a single tree is learned, boosting's iteration clock is incremented and new class residuals are computed. Experiments on simulated and real-world domains display the competitiveness of our method against a state of the art containing both tree-based and neural nets-based models.  ( 2 min )
    Federated Fairness without Access to Sensitive Groups
    arXiv:2402.14929v1 Announce Type: new Abstract: Current approaches to group fairness in federated learning assume the existence of predefined and labeled sensitive groups during training. However, due to factors ranging from emerging regulations to dynamics and location-dependency of protected groups, this assumption may be unsuitable in many real-world scenarios. In this work, we propose a new approach to guarantee group fairness that does not rely on any predefined definition of sensitive groups or additional labels. Our objective allows the federation to learn a Pareto efficient global model ensuring worst-case group fairness and it enables, via a single hyper-parameter, trade-offs between fairness and utility, subject only to a group size constraint. This implies that any sufficiently large subset of the population is guaranteed to receive at least a minimum level of utility performance from the model. The proposed objective encompasses existing approaches as special cases, such as empirical risk minimization and subgroup robustness objectives from centralized machine learning. We provide an algorithm to solve this problem in federation that enjoys convergence and excess risk guarantees. Our empirical results indicate that the proposed approach can effectively improve the worst-performing group that may be present without unnecessarily hurting the average performance, exhibits superior or comparable performance to relevant baselines, and achieves a large set of solutions with different fairness-utility trade-offs.  ( 2 min )
    MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases
    arXiv:2402.14905v1 Announce Type: new Abstract: This paper addresses the growing need for efficient large language models (LLMs) on mobile devices, driven by increasing cloud costs and latency concerns. We focus on designing top-quality LLMs with fewer than a billion parameters, a practical choice for mobile deployment. Contrary to prevailing belief emphasizing the pivotal role of data and parameter quantity in determining model quality, our investigation underscores the significance of model architecture for sub-billion scale LLMs. Leveraging deep and thin architectures, coupled with embedding sharing and grouped-query attention mechanisms, we establish a strong baseline network denoted as MobileLLM, which attains a remarkable 2.7%/4.3% accuracy boost over preceding 125M/350M state-of-the-art models. Additionally, we propose an immediate block-wise weight sharing approach with no increase in model size and only marginal latency overhead. The resultant models, denoted as MobileLLM-LS, demonstrate a further accuracy enhancement of 0.7%/0.8% than MobileLLM 125M/350M. Moreover, MobileLLM model family shows significant improvements compared to previous sub-billion models on chat benchmarks, and demonstrates close correctness to LLaMA-v2 7B in API calling tasks, highlighting the capability of small models for common on-device use cases.  ( 2 min )
    Practical Insights into Knowledge Distillation for Pre-Trained Models
    arXiv:2402.14922v1 Announce Type: new Abstract: This research investigates the enhancement of knowledge distillation (KD) processes in pre-trained models, an emerging field in knowledge transfer with significant implications for distributed training and federated learning environments. These environments benefit from reduced communication demands and accommodate various model architectures. Despite the adoption of numerous KD approaches for transferring knowledge among pre-trained models, a comprehensive understanding of KD's application in these scenarios is lacking. Our study conducts an extensive comparison of multiple KD techniques, including standard KD, tuned KD (via optimized temperature and weight parameters), deep mutual learning, and data partitioning KD. We assess these methods across various data distribution strategies to identify the most effective contexts for each. Through detailed examination of hyperparameter tuning, informed by extensive grid search evaluations, we pinpoint when adjustments are crucial to enhance model performance. This paper sheds light on optimal hyperparameter settings for distinct data partitioning scenarios and investigates KD's role in improving federated learning by minimizing communication rounds and expediting the training process. By filling a notable void in current research, our findings serve as a practical framework for leveraging KD in pre-trained models within collaborative and federated learning frameworks.  ( 2 min )
    Applying Reinforcement Learning to Optimize Traffic Light Cycles
    arXiv:2402.14886v1 Announce Type: new Abstract: Manual optimization of traffic light cycles is a complex and time-consuming task, necessitating the development of automated solutions. In this paper, we propose the application of reinforcement learning to optimize traffic light cycles in real-time. We present a case study using the Simulation Urban Mobility simulator to train a Deep Q-Network algorithm. The experimental results showed 44.16% decrease in the average number of Emergency stops, showing the potential of our approach to reduce traffic congestion and improve traffic flow. Furthermore, we discuss avenues for future research and enhancements to the reinforcement learning model.  ( 2 min )
    Efficient data selection employing Semantic Similarity-based Graph Structures for model training
    arXiv:2402.14888v1 Announce Type: new Abstract: Recent developments in natural language processing (NLP) have highlighted the need for substantial amounts of data for models to capture textual information accurately. This raises concerns regarding the computational resources and time required for training such models. This paper introduces Semantics for data SAliency in Model performance Estimation (SeSaME). It is an efficient data sampling mechanism solely based on textual information without passing the data through a compute-heavy model or other intensive pre-processing transformations. The application of this approach is demonstrated in the use case of low-resource automated speech recognition (ASR) models, which excessively rely on text-to-speech (TTS) calls when using augmented data. SeSaME learns to categorize new incoming data points into speech recognition difficulty buckets by employing semantic similarity-based graph structures and discrete ASR information from homophilous neighbourhoods through message passing. The results indicate reliable projections of ASR performance, with a 93% accuracy increase when using the proposed method compared to random predictions, bringing non-trivial information on the impact of textual representations in speech models. Furthermore, a series of experiments show both the benefits and challenges of using the ASR information on incoming data to fine-tune the model. We report a 7% drop in validation loss compared to random sampling, 7% WER drop with non-local aggregation when evaluating against a highly difficult dataset, and 1.8% WER drop with local aggregation and high semantic similarity between datasets.  ( 3 min )
    Energy-efficiency Limits on Training AI Systems using Learning-in-Memory
    arXiv:2402.14878v1 Announce Type: new Abstract: Learning-in-memory (LIM) is a recently proposed paradigm to overcome fundamental memory bottlenecks in training machine learning systems. While compute-in-memory (CIM) approaches can address the so-called memory-wall (i.e. energy dissipated due to repeated memory read access) they are agnostic to the energy dissipated due to repeated memory writes at the precision required for training (the update-wall), and they don't account for the energy dissipated when transferring information between short-term and long-term memories (the consolidation-wall). The LIM paradigm proposes that these bottlenecks, too, can be overcome if the energy barrier of physical memories is adaptively modulated such that the dynamics of memory updates and consolidation match the Lyapunov dynamics of gradient-descent training of an AI model. In this paper, we derive new theoretical lower bounds on energy dissipation when training AI systems using different LIM approaches. The analysis presented here is model-agnostic and highlights the trade-off between energy efficiency and the speed of training. The resulting non-equilibrium energy-efficiency bounds have a similar flavor as that of Landauer's energy-dissipation bounds. We also extend these limits by taking into account the number of floating-point operations (FLOPs) used for training, the size of the AI model, and the precision of the training parameters. Our projections suggest that the energy-dissipation lower-bound to train a brain scale AI system (comprising of $10^{15}$ parameters) using LIM is $10^8 \sim 10^9$ Joules, which is on the same magnitude the Landauer's adiabatic lower-bound and $6$ to $7$ orders of magnitude lower than the projections obtained using state-of-the-art AI accelerator hardware lower-bounds.  ( 3 min )
    Deep Generative Model-based Synthesis of Four-bar Linkage Mechanisms with Target Conditions
    arXiv:2402.14882v1 Announce Type: new Abstract: Mechanisms are essential components designed to perform specific tasks in various mechanical systems. However, designing a mechanism that satisfies certain kinematic or quasi-static requirements is a challenging task. The kinematic requirements may include the workspace of a mechanism, while the quasi-static requirements of a mechanism may include its torque transmission, which refers to the ability of the mechanism to transfer power and torque effectively. In this paper, we propose a deep learning-based generative model for generating multiple crank-rocker four-bar linkage mechanisms that satisfy both the kinematic and quasi-static requirements aforementioned. The proposed model is based on a conditional generative adversarial network (cGAN) with modifications for mechanism synthesis, which is trained to learn the relationship between the requirements of a mechanism with respect to linkage lengths. The results demonstrate that the proposed model successfully generates multiple distinct mechanisms that satisfy specific kinematic and quasi-static requirements. To evaluate the novelty of our approach, we provide a comparison of the samples synthesized by the proposed cGAN, traditional cVAE and NSGA-II. Our approach has several advantages over traditional design methods. It enables designers to efficiently generate multiple diverse and feasible design candidates while exploring a large design space. Also, the proposed model considers both the kinematic and quasi-static requirements, which can lead to more efficient and effective mechanisms for real-world use, making it a promising tool for linkage mechanism design.  ( 3 min )
    CloudNine: Analyzing Meteorological Observation Impact on Weather Prediction Using Explainable Graph Neural Networks
    arXiv:2402.14861v1 Announce Type: new Abstract: The impact of meteorological observations on weather forecasting varies with sensor type, location, time, and other environmental factors. Thus, quantitative analysis of observation impacts is crucial for effective and efficient development of weather forecasting systems. However, the existing impact analysis methods are difficult to be widely applied due to their high dependencies on specific forecasting systems. Also, they cannot provide observation impacts at multiple spatio-temporal scales, only global impacts of observation types. To address these issues, we present a novel system called ``CloudNine,'' which allows analysis of individual observations' impacts on specific predictions based on explainable graph neural networks (XGNNs). Combining an XGNN-based atmospheric state estimation model with a numerical weather prediction model, we provide a web application to search for observations in the 3D space of the Earth system and to visualize the impact of individual observations on predictions in specific spatial regions and time periods.  ( 2 min )
    APTQ: Attention-aware Post-Training Mixed-Precision Quantization for Large Language Models
    arXiv:2402.14866v1 Announce Type: new Abstract: Large Language Models (LLMs) have greatly advanced the natural language processing paradigm. However, the high computational load and huge model sizes pose a grand challenge for deployment on edge devices. To this end, we propose APTQ (Attention-aware Post-Training Mixed-Precision Quantization) for LLMs, which considers not only the second-order information of each layer's weights, but also, for the first time, the nonlinear effect of attention outputs on the entire model. We leverage the Hessian trace as a sensitivity metric for mixed-precision quantization, ensuring an informed precision reduction that retains model performance. Experiments show APTQ surpasses previous quantization methods, achieving an average of 4 bit width a 5.22 perplexity nearly equivalent to full precision in the C4 dataset. In addition, APTQ attains state-of-the-art zero-shot accuracy of 68.24\% and 70.48\% at an average bitwidth of 3.8 in LLaMa-7B and LLaMa-13B, respectively, demonstrating its effectiveness to produce high-quality quantized LLMs.  ( 2 min )
  • Open

    User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient
    arXiv:1710.00095v4 Announce Type: replace-cross Abstract: In this paper, we study the problem of sampling from a given probability density function that is known to be smooth and strongly log-concave. We analyze several methods of approximate sampling based on discretizations of the (highly overdamped) Langevin diffusion and establish guarantees on its error measured in the Wasserstein-2 distance. Our guarantees improve or extend the state-of-the-art results in three directions. First, we provide an upper bound on the error of the first-order Langevin Monte Carlo (LMC) algorithm with optimized varying step-size. This result has the advantage of being horizon free (we do not need to know in advance the target precision) and to improve by a logarithmic factor the corresponding result for the constant step-size. Second, we study the case where accurate evaluations of the gradient of the log-density are unavailable, but one can have access to approximations of the aforementioned gradient. In such a situation, we consider both deterministic and stochastic approximations of the gradient and provide an upper bound on the sampling error of the first-order LMC that quantifies the impact of the gradient evaluation inaccuracies. Third, we establish upper bounds for two versions of the second-order LMC, which leverage the Hessian of the log-density. We provide nonasymptotic guarantees on the sampling error of these second-order LMCs. These guarantees reveal that the second-order LMC algorithms improve on the first-order LMC in ill-conditioned settings.  ( 3 min )
    DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models
    arXiv:2310.00902v2 Announce Type: replace-cross Abstract: Quantifying the impact of training data points is crucial for understanding the outputs of machine learning models and for improving the transparency of the AI pipeline. The influence function is a principled and popular data attribution method, but its computational cost often makes it challenging to use. This issue becomes more pronounced in the setting of large language models and text-to-image models. In this work, we propose DataInf, an efficient influence approximation method that is practical for large-scale generative AI models. Leveraging an easy-to-compute closed-form expression, DataInf outperforms existing influence computation algorithms in terms of computational and memory efficiency. Our theoretical analysis shows that DataInf is particularly well-suited for parameter-efficient fine-tuning techniques such as LoRA. Through systematic empirical evaluations, we show that DataInf accurately approximates influence scores and is orders of magnitude faster than existing methods. In applications to RoBERTa-large, Llama-2-13B-chat, and stable-diffusion-v1.5 models, DataInf effectively identifies the most influential fine-tuning examples better than other approximate influence scores. Moreover, it can help to identify which data points are mislabeled.  ( 2 min )
    Estimation of partially known Gaussian graphical models with score-based structural priors
    arXiv:2401.14340v3 Announce Type: replace Abstract: We propose a novel algorithm for the support estimation of partially known Gaussian graphical models that incorporates prior information about the underlying graph. In contrast to classical approaches that provide a point estimate based on a maximum likelihood or a maximum a posteriori criterion using (simple) priors on the precision matrix, we consider a prior on the graph and rely on annealed Langevin diffusion to generate samples from the posterior distribution. Since the Langevin sampler requires access to the score function of the underlying graph prior, we use graph neural networks to effectively estimate the score from a graph dataset (either available beforehand or generated from a known distribution). Numerical experiments demonstrate the benefits of our approach.  ( 2 min )
    Bernstein Flows for Flexible Posteriors in Variational Bayes
    arXiv:2202.05650v2 Announce Type: replace Abstract: Variational inference (VI) is a technique to approximate difficult to compute posteriors by optimization. In contrast to MCMC, VI scales to many observations. In the case of complex posteriors, however, state-of-the-art VI approaches often yield unsatisfactory posterior approximations. This paper presents Bernstein flow variational inference (BF-VI), a robust and easy-to-use method, flexible enough to approximate complex multivariate posteriors. BF-VI combines ideas from normalizing flows and Bernstein polynomial-based transformation models. In benchmark experiments, we compare BF-VI solutions with exact posteriors, MCMC solutions, and state-of-the-art VI methods including normalizing flow based VI. We show for low-dimensional models that BF-VI accurately approximates the true posterior; in higher-dimensional models, BF-VI outperforms other VI methods. Further, we develop with BF-VI a Bayesian model for the semi-structured Melanoma challenge data, combining a CNN model part for image data with an interpretable model part for tabular data, and demonstrate for the first time how the use of VI in semi-structured models.  ( 2 min )
    Efficient error and variance estimation for randomized matrix computations
    arXiv:2207.06342v4 Announce Type: replace-cross Abstract: Randomized matrix algorithms have become workhorse tools in scientific computing and machine learning. To use these algorithms safely in applications, they should be coupled with posterior error estimates to assess the quality of the output. To meet this need, this paper proposes two diagnostics: a leave-one-out error estimator for randomized low-rank approximations and a jackknife resampling method to estimate the variance of the output of a randomized matrix computation. Both of these diagnostics are rapid to compute for randomized low-rank approximation algorithms such as the randomized SVD and randomized Nystr\"om approximation, and they provide useful information that can be used to assess the quality of the computed output and guide algorithmic parameter choices.  ( 2 min )
    Sharp Lower Bounds on Interpolation by Deep ReLU Neural Networks at Irregularly Spaced Data
    arXiv:2302.00834v2 Announce Type: replace-cross Abstract: We study the interpolation power of deep ReLU neural networks. Specifically, we consider the question of how efficiently, in terms of the number of parameters, deep ReLU networks can interpolate values at $N$ datapoints in the unit ball which are separated by a distance $\delta$. We show that $\Omega(N)$ parameters are required in the regime where $\delta$ is exponentially small in $N$, which gives the sharp result in this regime since $O(N)$ parameters are always sufficient. This also shows that the bit-extraction technique used to prove lower bounds on the VC dimension cannot be applied to irregularly spaced datapoints. Finally, as an application we give a lower bound on the approximation rates that deep ReLU neural networks can achieve for Sobolev spaces at the embedding endpoint.  ( 2 min )
    Learning thermodynamically constrained equations of state with uncertainty
    arXiv:2306.17004v2 Announce Type: replace-cross Abstract: Numerical simulations of high energy-density experiments require equation of state (EOS) models that relate a material's thermodynamic state variables -- specifically pressure, volume/density, energy, and temperature. EOS models are typically constructed using a semi-empirical parametric methodology, which assumes a physics-informed functional form with many tunable parameters calibrated using experimental/simulation data. Since there are inherent uncertainties in the calibration data (parametric uncertainty) and the assumed functional EOS form (model uncertainty), it is essential to perform uncertainty quantification (UQ) to improve confidence in the EOS predictions. Model uncertainty is challenging for UQ studies since it requires exploring the space of all possible physically consistent functional forms. Thus, it is often neglected in favor of parametric uncertainty, which is easier to quantify without violating thermodynamic laws. This work presents a data-driven machine learning approach to constructing EOS models that naturally captures model uncertainty while satisfying the necessary thermodynamic consistency and stability constraints. We propose a novel framework based on physics-informed Gaussian process regression (GPR) that automatically captures total uncertainty in the EOS and can be jointly trained on both simulation and experimental data sources. A GPR model for the shock Hugoniot is derived and its uncertainties are quantified using the proposed framework. We apply the proposed model to learn the EOS for the diamond solid state of carbon, using both density functional theory data and experimental shock Hugoniot data to train the model and show that the prediction uncertainty reduces by considering the thermodynamic constraints.  ( 3 min )
    Human-Aligned Calibration for AI-Assisted Decision Making
    arXiv:2306.00074v4 Announce Type: replace-cross Abstract: Whenever a binary classifier is used to provide decision support, it typically provides both a label prediction and a confidence value. Then, the decision maker is supposed to use the confidence value to calibrate how much to trust the prediction. In this context, it has been often argued that the confidence value should correspond to a well calibrated estimate of the probability that the predicted label matches the ground truth label. However, multiple lines of empirical evidence suggest that decision makers have difficulties at developing a good sense on when to trust a prediction using these confidence values. In this paper, our goal is first to understand why and then investigate how to construct more useful confidence values. We first argue that, for a broad class of utility functions, there exist data distributions for which a rational decision maker is, in general, unlikely to discover the optimal decision policy using the above confidence values -- an optimal decision maker would need to sometimes place more (less) trust on predictions with lower (higher) confidence values. However, we then show that, if the confidence values satisfy a natural alignment property with respect to the decision maker's confidence on her own predictions, there always exists an optimal decision policy under which the level of trust the decision maker would need to place on predictions is monotone on the confidence values, facilitating its discoverability. Further, we show that multicalibration with respect to the decision maker's confidence on her own predictions is a sufficient condition for alignment. Experiments on four different AI-assisted decision making tasks where a classifier provides decision support to real human experts validate our theoretical results and suggest that alignment may lead to better decisions.  ( 3 min )
    Improving Data Quality with Training Dynamics of Gradient Boosting Decision Trees
    arXiv:2210.11327v2 Announce Type: replace-cross Abstract: Real world datasets contain incorrectly labeled instances that hamper the performance of the model and, in particular, the ability to generalize out of distribution. Also, each example might have different contribution towards learning. This motivates studies to better understanding of the role of data instances with respect to their contribution in good metrics in models. In this paper we propose a method based on metrics computed from training dynamics of Gradient Boosting Decision Trees (GBDTs) to assess the behavior of each training example. We focus on datasets containing mostly tabular or structured data, for which the use of Decision Trees ensembles are still the state-of-the-art in terms of performance. Our methods achieved the best results overall when compared with confident learning, direct heuristics and a robust boosting algorithm. We show results on detecting noisy labels in order clean datasets, improving models' metrics in synthetic and real public datasets, as well as on a industry case in which we deployed a model based on the proposed solution.  ( 3 min )
    Causal Discovery from Conditionally Stationary Time Series
    arXiv:2110.06257v2 Announce Type: replace-cross Abstract: Causal discovery, i.e., inferring underlying causal relationships from observational data, has been shown to be highly challenging for AI systems. In time series modeling context, traditional causal discovery methods mainly consider constrained scenarios with fully observed variables and/or data from stationary time-series. We develop a causal discovery approach to handle a wide class of non-stationary time-series that are conditionally stationary, where the non-stationary behaviour is modeled as stationarity conditioned on a set of (possibly hidden) state variables. Named State-Dependent Causal Inference (SDCI), our approach is able to recover the underlying causal dependencies, provably with fully-observed states and empirically with hidden states. The latter is confirmed by experiments on synthetic linear system and nonlinear particle interaction data, where SDCI achieves superior performance over baseline causal discovery methods. Improved results over non-causal RNNs on modeling NBA player movements demonstrate the potential of our method and motivate the use of causality-driven methods for forecasting.  ( 2 min )
    Convolutional Deep Kernel Machines
    arXiv:2309.09814v2 Announce Type: replace Abstract: Standard infinite-width limits of neural networks sacrifice the ability for intermediate layers to learn representations from data. Recent work (A theory of representation learning gives a deep generalisation of kernel methods, Yang et al. 2023) modified the Neural Network Gaussian Process (NNGP) limit of Bayesian neural networks so that representation learning is retained. Furthermore, they found that applying this modified limit to a deep Gaussian process gives a practical learning algorithm which they dubbed the deep kernel machine (DKM). However, they only considered the simplest possible setting: regression in small, fully connected networks with e.g. 10 input features. Here, we introduce convolutional deep kernel machines. This required us to develop a novel inter-domain inducing point approximation, as well as introducing and experimentally assessing a number of techniques not previously seen in DKMs, including analogues to batch normalisation, different likelihoods, and different types of top-layer. The resulting model trains in roughly 77 GPU hours, achieving around 99% test accuracy on MNIST, 72% on CIFAR-100, and 92.7% on CIFAR-10, which is SOTA for kernel methods.  ( 2 min )
    Simultaneous off-the-grid learning of mixtures issued from a continuous dictionary
    arXiv:2210.16311v2 Announce Type: replace Abstract: In this paper we observe a set, possibly a continuum, of signals corrupted by noise. Each signal is a finite mixture of an unknown number of features belonging to a continuous dictionary. The continuous dictionary is parametrized by a real non-linear parameter. We shall assume that the signals share an underlying structure by assuming that each signal has its active features included in a finite and sparse set. We formulate regularized optimization problem to estimate simultaneously the linear coefficients in the mixtures and the non-linear parameters of the features. The optimization problem is composed of a data fidelity term and a $(\ell_1,L^p)$-penalty. We call its solution the Group-Nonlinear-Lasso and provide high probability bounds on the prediction error using certificate functions. Following recent works on the geometry of off-the-grid methods, we show that such functions can be constructed provided the parameters of the active features are pairwise separated by a constant with respect to a Riemannian metric.When the number of signals is finite and the noise is assumed Gaussian, we give refinements of our results for $p=1$ and $p=2$ using tail bounds on suprema of Gaussian and $\chi^2$ random processes. When $p=2$, our prediction error reaches the rates obtained by the Group-Lasso estimator in the multi-task linear regression model. Furthermore, for $p=2$ these prediction rates are faster than for $p=1$ when all signals share most of the non-linear parameters.  ( 3 min )
    Simulation-based inference using surjective sequential neural likelihood estimation
    arXiv:2308.01054v2 Announce Type: replace Abstract: We present Surjective Sequential Neural Likelihood (SSNL) estimation, a novel method for simulation-based inference in models where the evaluation of the likelihood function is not tractable and only a simulator that can generate synthetic data is available. SSNL fits a dimensionality-reducing surjective normalizing flow model and uses it as a surrogate likelihood function which allows for conventional Bayesian inference using either Markov chain Monte Carlo methods or variational inference. By embedding the data in a low-dimensional space, SSNL solves several issues previous likelihood-based methods had when applied to high-dimensional data sets that, for instance, contain non-informative data dimensions or lie along a lower-dimensional manifold. We evaluate SSNL on a wide variety of experiments and show that it generally outperforms contemporary methods used in simulation-based inference, for instance, on a challenging real-world example from astrophysics which models the magnetic field strength of the sun using a solar dynamo model.  ( 2 min )
    On Hypothesis Transfer Learning of Functional Linear Models
    arXiv:2206.04277v4 Announce Type: replace Abstract: We study the transfer learning (TL) for the functional linear regression (FLR) under the Reproducing Kernel Hilbert Space (RKHS) framework, observing the TL techniques in existing high-dimensional linear regression is not compatible with the truncation-based FLR methods as functional data are intrinsically infinite-dimensional and generated by smooth underlying processes. We measure the similarity across tasks using RKHS distance, allowing the type of information being transferred tied to the properties of the imposed RKHS. Building on the hypothesis offset transfer learning paradigm, two algorithms are proposed: one conducts the transfer when positive sources are known, while the other leverages aggregation techniques to achieve robust transfer without prior information about the sources. We establish lower bounds for this learning problem and show the proposed algorithms enjoy a matching asymptotic upper bound. These analyses provide statistical insights into factors that contribute to the dynamics of the transfer. We also extend the results to functional generalized linear models. The effectiveness of the proposed algorithms is demonstrated on extensive synthetic data as well as a financial data application.  ( 2 min )
    Interventional Causal Representation Learning
    arXiv:2209.11924v4 Announce Type: replace Abstract: Causal representation learning seeks to extract high-level latent factors from low-level sensory data. Most existing methods rely on observational data and structural assumptions (e.g., conditional independence) to identify the latent factors. However, interventional data is prevalent across applications. Can interventional data facilitate causal representation learning? We explore this question in this paper. The key observation is that interventional data often carries geometric signatures of the latent factors' support (i.e. what values each latent can possibly take). For example, when the latent factors are causally connected, interventions can break the dependency between the intervened latents' support and their ancestors'. Leveraging this fact, we prove that the latent causal factors can be identified up to permutation and scaling given data from perfect $do$ interventions. Moreover, we can achieve block affine identification, namely the estimated latent factors are only entangled with a few other latents if we have access to data from imperfect interventions. These results highlight the unique power of interventional data in causal representation learning; they can enable provable identification of latent factors without any assumptions about their distributions or dependency structure.  ( 2 min )
    Adversarial Examples Detection with Bayesian Neural Network
    arXiv:2105.08620v3 Announce Type: replace Abstract: In this paper, we propose a new framework to detect adversarial examples motivated by the observations that random components can improve the smoothness of predictors and make it easier to simulate the output distribution of a deep neural network. With these observations, we propose a novel Bayesian adversarial example detector, short for BATer, to improve the performance of adversarial example detection. Specifically, we study the distributional difference of hidden layer output between natural and adversarial examples, and propose to use the randomness of the Bayesian neural network to simulate hidden layer output distribution and leverage the distribution dispersion to detect adversarial examples. The advantage of a Bayesian neural network is that the output is stochastic while a deep neural network without random components does not have such characteristics. Empirical results on several benchmark datasets against popular attacks show that the proposed BATer outperforms the state-of-the-art detectors in adversarial example detection.  ( 2 min )
    GROS: A General Robust Aggregation Strategy
    arXiv:2402.15442v1 Announce Type: cross Abstract: A new, very general, robust procedure for combining estimators in metric spaces is introduced GROS. The method is reminiscent of the well-known median of means, as described in \cite{devroye2016sub}. Initially, the sample is divided into $K$ groups. Subsequently, an estimator is computed for each group. Finally, these $K$ estimators are combined using a robust procedure. We prove that this estimator is sub-Gaussian and we get its break-down point, in the sense of Donoho. The robust procedure involves a minimization problem on a general metric space, but we show that the same (up to a constant) sub-Gaussianity is obtained if the minimization is taken over the sample, making GROS feasible in practice. The performance of GROS is evaluated through five simulation studies: the first one focuses on classification using $k$-means, the second one on the multi-armed bandit problem, the third one on the regression problem. The fourth one is the set estimation problem under a noisy model. Lastly, we apply GROS to get a robust persistent diagram.  ( 2 min )
    Transformers are Expressive, But Are They Expressive Enough for Regression?
    arXiv:2402.15478v1 Announce Type: cross Abstract: Transformers have become pivotal in Natural Language Processing, demonstrating remarkable success in applications like Machine Translation and Summarization. Given their widespread adoption, several works have attempted to analyze the expressivity of Transformers. Expressivity of a neural network is the class of functions it can approximate. A neural network is fully expressive if it can act as a universal function approximator. We attempt to analyze the same for Transformers. Contrary to existing claims, our findings reveal that Transformers struggle to reliably approximate continuous functions, relying on piecewise constant approximations with sizable intervals. The central question emerges as: "\textit{Are Transformers truly Universal Function Approximators}?" To address this, we conduct a thorough investigation, providing theoretical insights and supporting evidence through experiments. Our contributions include a theoretical analysis pinpointing the root of Transformers' limitation in function approximation and extensive experiments to verify the limitation. By shedding light on these challenges, we advocate a refined understanding of Transformers' capabilities.  ( 2 min )
    The Impact of LoRA on the Emergence of Clusters in Transformers
    arXiv:2402.15415v1 Announce Type: cross Abstract: In this paper, we employ the mathematical framework on Transformers developed by \citet{sander2022sinkformers,geshkovski2023emergence,geshkovski2023mathematical} to explore how variations in attention parameters and initial token values impact the structural dynamics of token clusters. Our analysis demonstrates that while the clusters within a modified attention matrix dynamics can exhibit significant divergence from the original over extended periods, they maintain close similarities over shorter intervals, depending on the parameter differences. This work contributes to the fine-tuning field through practical applications to the LoRA algorithm \cite{hu2021lora,peft}, enhancing our understanding of the behavior of LoRA-enhanced Transformer models.  ( 2 min )
    Universal Lower Bounds and Optimal Rates: Achieving Minimax Clustering Error in Sub-Exponential Mixture Models
    arXiv:2402.15432v1 Announce Type: cross Abstract: Clustering is a pivotal challenge in unsupervised machine learning and is often investigated through the lens of mixture models. The optimal error rate for recovering cluster labels in Gaussian and sub-Gaussian mixture models involves ad hoc signal-to-noise ratios. Simple iterative algorithms, such as Lloyd's algorithm, attain this optimal error rate. In this paper, we first establish a universal lower bound for the error rate in clustering any mixture model, expressed through a Chernoff divergence, a more versatile measure of model information than signal-to-noise ratios. We then demonstrate that iterative algorithms attain this lower bound in mixture models with sub-exponential tails, notably emphasizing location-scale mixtures featuring Laplace-distributed errors. Additionally, for datasets better modelled by Poisson or Negative Binomial mixtures, we study mixture models whose distributions belong to an exponential family. In such mixtures, we establish that Bregman hard clustering, a variant of Lloyd's algorithm employing a Bregman divergence, is rate optimal.  ( 2 min )
    Information-Theoretic Safe Bayesian Optimization
    arXiv:2402.15347v1 Announce Type: cross Abstract: We consider a sequential decision making task, where the goal is to optimize an unknown function without evaluating parameters that violate an a~priori unknown (safety) constraint. A common approach is to place a Gaussian process prior on the unknown functions and allow evaluations only in regions that are safe with high probability. Most current methods rely on a discretization of the domain and cannot be directly extended to the continuous case. Moreover, the way in which they exploit regularity assumptions about the constraint introduces an additional critical hyperparameter. In this paper, we propose an information-theoretic safe exploration criterion that directly exploits the GP posterior to identify the most informative safe parameters to evaluate. The combination of this exploration criterion with a well known Bayesian optimization acquisition function yields a novel safe Bayesian optimization selection criterion. Our approach is naturally applicable to continuous domains and does not require additional explicit hyperparameters. We theoretically analyze the method and show that we do not violate the safety constraint with high probability and that we learn about the value of the safe optimum up to arbitrary precision. Empirical evaluations demonstrate improved data-efficiency and scalability.  ( 2 min )
    Rapid Bayesian identification of sparse nonlinear dynamics from scarce and noisy data
    arXiv:2402.15357v1 Announce Type: cross Abstract: We propose a fast probabilistic framework for identifying differential equations governing the dynamics of observed data. We recast the SINDy method within a Bayesian framework and use Gaussian approximations for the prior and likelihood to speed up computation. The resulting method, Bayesian-SINDy, not only quantifies uncertainty in the parameters estimated but also is more robust when learning the correct model from limited and noisy data. Using both synthetic and real-life examples such as Lynx-Hare population dynamics, we demonstrate the effectiveness of the new framework in learning correct model equations and compare its computational and data efficiency with existing methods. Because Bayesian-SINDy can quickly assimilate data and is robust against noise, it is particularly suitable for biological data and real-time system identification in control. Its probabilistic framework also enables the calculation of information entropy, laying the foundation for an active learning strategy.  ( 2 min )
    Categorical Deep Learning: An Algebraic Theory of Architectures
    arXiv:2402.15332v1 Announce Type: cross Abstract: We present our position on the elusive quest for a general-purpose framework for specifying and studying deep learning architectures. Our opinion is that the key attempts made so far lack a coherent bridge between specifying constraints which models must satisfy and specifying their implementations. Focusing on building a such a bridge, we propose to apply category theory -- precisely, the universal algebra of monads valued in a 2-category of parametric maps -- as a single theory elegantly subsuming both of these flavours of neural network design. To defend our position, we show how this theory recovers constraints induced by geometric deep learning, as well as implementations of many architectures drawn from the diverse landscape of neural networks, such as RNNs. We also illustrate how the theory naturally encodes many standard constructs in computer science and automata theory.  ( 2 min )
    Fourier Basis Density Model
    arXiv:2402.15345v1 Announce Type: cross Abstract: We introduce a lightweight, flexible and end-to-end trainable probability density model parameterized by a constrained Fourier basis. We assess its performance at approximating a range of multi-modal 1D densities, which are generally difficult to fit. In comparison to the deep factorized model introduced in [1], our model achieves a lower cross entropy at a similar computational budget. In addition, we also evaluate our method on a toy compression task, demonstrating its utility in learned compression.  ( 2 min )
    Fine-Tuning of Continuous-Time Diffusion Models as Entropy-Regularized Control
    arXiv:2402.15194v1 Announce Type: cross Abstract: Diffusion models excel at capturing complex data distributions, such as those of natural images and proteins. While diffusion models are trained to represent the distribution in the training dataset, we often are more concerned with other properties, such as the aesthetic quality of the generated images or the functional properties of generated proteins. Diffusion models can be finetuned in a goal-directed way by maximizing the value of some reward function (e.g., the aesthetic quality of an image). However, these approaches may lead to reduced sample diversity, significant deviations from the training data distribution, and even poor sample quality due to the exploitation of an imperfect reward function. The last issue often occurs when the reward function is a learned model meant to approximate a ground-truth "genuine" reward, as is the case in many practical applications. These challenges, collectively termed "reward collapse," pose a substantial obstacle. To address this reward collapse, we frame the finetuning problem as entropy-regularized control against the pretrained diffusion model, i.e., directly optimizing entropy-enhanced rewards with neural SDEs. We present theoretical and empirical evidence that demonstrates our framework is capable of efficiently generating diverse samples with high genuine rewards, mitigating the overoptimization of imperfect reward models.  ( 2 min )
    Classification of compact radio sources in the Galactic plane with supervised machine learning
    arXiv:2402.15232v1 Announce Type: cross Abstract: Generation of science-ready data from processed data products is one of the major challenges in next-generation radio continuum surveys with the Square Kilometre Array (SKA) and its precursors, due to the expected data volume and the need to achieve a high degree of automated processing. Source extraction, characterization, and classification are the major stages involved in this process. In this work we focus on the classification of compact radio sources in the Galactic plane using both radio and infrared images as inputs. To this aim, we produced a curated dataset of ~20,000 images of compact sources of different astronomical classes, obtained from past radio and infrared surveys, and novel radio data from pilot surveys carried out with the Australian SKA Pathfinder (ASKAP). Radio spectral index information was also obtained for a subset of the data. We then trained two different classifiers on the produced dataset. The first model uses gradient-boosted decision trees and is trained on a set of pre-computed features derived from the data, which include radio-infrared colour indices and the radio spectral index. The second model is trained directly on multi-channel images, employing convolutional neural networks. Using a completely supervised procedure, we obtained a high classification accuracy (F1-score>90%) for separating Galactic objects from the extragalactic background. Individual class discrimination performances, ranging from 60% to 75%, increased by 10% when adding far-infrared and spectral index information, with extragalactic objects, PNe and HII regions identified with higher accuracies. The implemented tools and trained models were publicly released, and made available to the radioastronomical community for future application on new radio data.  ( 3 min )
    Benchmarking Observational Studies with Experimental Data under Right-Censoring
    arXiv:2402.15137v1 Announce Type: cross Abstract: Drawing causal inferences from observational studies (OS) requires unverifiable validity assumptions; however, one can falsify those assumptions by benchmarking the OS with experimental data from a randomized controlled trial (RCT). A major limitation of existing procedures is not accounting for censoring, despite the abundance of RCTs and OSes that report right-censored time-to-event outcomes. We consider two cases where censoring time (1) is independent of time-to-event and (2) depends on time-to-event the same way in OS and RCT. For the former, we adopt a censoring-doubly-robust signal for the conditional average treatment effect (CATE) to facilitate an equivalence test of CATEs in OS and RCT, which serves as a proxy for testing if the validity assumptions hold. For the latter, we show that the same test can still be used even though unbiased CATE estimation may not be possible. We verify the effectiveness of our censoring-aware tests via semi-synthetic experiments and analyze RCT and OS data from the Women's Health Initiative study.  ( 2 min )
    Covariance-Adaptive Least-Squares Algorithm for Stochastic Combinatorial Semi-Bandits
    arXiv:2402.15171v1 Announce Type: cross Abstract: We address the problem of stochastic combinatorial semi-bandits, where a player can select from P subsets of a set containing d base items. Most existing algorithms (e.g. CUCB, ESCB, OLS-UCB) require prior knowledge on the reward distribution, like an upper bound on a sub-Gaussian proxy-variance, which is hard to estimate tightly. In this work, we design a variance-adaptive version of OLS-UCB, relying on an online estimation of the covariance structure. Estimating the coefficients of a covariance matrix is much more manageable in practical settings and results in improved regret upper bounds compared to proxy variance-based algorithms. When covariance coefficients are all non-negative, we show that our approach efficiently leverages the semi-bandit feedback and provably outperforms bandit feedback approaches, not only in exponential regimes where P $\gg$ d but also when P $\le$ d, which is not straightforward from most existing analyses.  ( 2 min )
    Accelerating Convergence of Stein Variational Gradient Descent via Deep Unfolding
    arXiv:2402.15125v1 Announce Type: cross Abstract: Stein variational gradient descent (SVGD) is a prominent particle-based variational inference method used for sampling a target distribution. SVGD has attracted interest for application in machine-learning techniques such as Bayesian inference. In this paper, we propose novel trainable algorithms that incorporate a deep-learning technique called deep unfolding,into SVGD. This approach facilitates the learning of the internal parameters of SVGD, thereby accelerating its convergence speed. To evaluate the proposed trainable SVGD algorithms, we conducted numerical simulations of three tasks: sampling a one-dimensional Gaussian mixture, performing Bayesian logistic regression, and learning Bayesian neural networks. The results show that our proposed algorithms exhibit faster convergence than the conventional variants of SVGD.  ( 2 min )
    Multi-Armed Bandits with Abstention
    arXiv:2402.15127v1 Announce Type: cross Abstract: We introduce a novel extension of the canonical multi-armed bandit problem that incorporates an additional strategic element: abstention. In this enhanced framework, the agent is not only tasked with selecting an arm at each time step, but also has the option to abstain from accepting the stochastic instantaneous reward before observing it. When opting for abstention, the agent either suffers a fixed regret or gains a guaranteed reward. Given this added layer of complexity, we ask whether we can develop efficient algorithms that are both asymptotically and minimax optimal. We answer this question affirmatively by designing and analyzing algorithms whose regrets meet their corresponding information-theoretic lower bounds. Our results offer valuable quantitative insights into the benefits of the abstention option, laying the groundwork for further exploration in other online decision-making problems with such an option. Numerical results further corroborate our theoretical findings.  ( 2 min )
    Comparison of Machine Learning Classification Algorithms and Application to the Framingham Heart Study
    arXiv:2402.15005v1 Announce Type: cross Abstract: The use of machine learning algorithms in healthcare can amplify social injustices and health inequities. While the exacerbation of biases can occur and compound during the problem selection, data collection, and outcome definition, this research pertains to some generalizability impediments that occur during the development and the post-deployment of machine learning classification algorithms. Using the Framingham coronary heart disease data as a case study, we show how to effectively select a probability cutoff to convert a regression model for a dichotomous variable into a classifier. We then compare the sampling distribution of the predictive performance of eight machine learning classification algorithms under four training/testing scenarios to test their generalizability and their potential to perpetuate biases. We show that both the Extreme Gradient Boosting, and Support Vector Machine are flawed when trained on an unbalanced dataset. We introduced and show that the double discriminant scoring of type I is the most generalizable as it consistently outperforms the other classification algorithms regardless of the training/testing scenario. Finally, we introduce a methodology to extract an optimal variable hierarchy for a classification algorithm, and illustrate it on the overall, male and female Framingham coronary heart disease data.  ( 2 min )
    Consistency-Guided Temperature Scaling Using Style and Content Information for Out-of-Domain Calibration
    arXiv:2402.15019v1 Announce Type: cross Abstract: Research interests in the robustness of deep neural networks against domain shifts have been rapidly increasing in recent years. Most existing works, however, focus on improving the accuracy of the model, not the calibration performance which is another important requirement for trustworthy AI systems. Temperature scaling (TS), an accuracy-preserving post-hoc calibration method, has been proven to be effective in in-domain settings, but not in out-of-domain (OOD) due to the difficulty in obtaining a validation set for the unseen domain beforehand. In this paper, we propose consistency-guided temperature scaling (CTS), a new temperature scaling strategy that can significantly enhance the OOD calibration performance by providing mutual supervision among data samples in the source domains. Motivated by our observation that over-confidence stemming from inconsistent sample predictions is the main obstacle to OOD calibration, we propose to guide the scaling process by taking consistencies into account in terms of two different aspects -- style and content -- which are the key components that can well-represent data samples in multi-domain settings. Experimental results demonstrate that our proposed strategy outperforms existing works, achieving superior OOD calibration performance on various datasets. This can be accomplished by employing only the source domains without compromising accuracy, making our scheme directly applicable to various trustworthy AI systems.  ( 3 min )
    Verifiable Boosted Tree Ensembles
    arXiv:2402.14988v1 Announce Type: cross Abstract: Verifiable learning advocates for training machine learning models amenable to efficient security verification. Prior research demonstrated that specific classes of decision tree ensembles -- called large-spread ensembles -- allow for robustness verification in polynomial time against any norm-based attacker. This study expands prior work on verifiable learning from basic ensemble methods (i.e., hard majority voting) to advanced boosted tree ensembles, such as those trained using XGBoost or LightGBM. Our formal results indicate that robustness verification is achievable in polynomial time when considering attackers based on the $L_\infty$-norm, but remains NP-hard for other norm-based attackers. Nevertheless, we present a pseudo-polynomial time algorithm to verify robustness against attackers based on the $L_p$-norm for any $p \in \mathbb{N} \cup \{0\}$, which in practice grants excellent performance. Our experimental evaluation shows that large-spread boosted ensembles are accurate enough for practical adoption, while being amenable to efficient security verification.  ( 2 min )
    tinyBenchmarks: evaluating LLMs with fewer examples
    arXiv:2402.14992v1 Announce Type: cross Abstract: The versatility of large language models (LLMs) led to the creation of diverse benchmarks that thoroughly test a variety of language models' abilities. These benchmarks consist of tens of thousands of examples making evaluation of LLMs very expensive. In this paper, we investigate strategies to reduce the number of evaluations needed to assess the performance of an LLM on several key benchmarks. For example, we show that to accurately estimate the performance of an LLM on MMLU, a popular multiple-choice QA benchmark consisting of 14K examples, it is sufficient to evaluate this LLM on 100 curated examples. We release evaluation tools and tiny versions of popular benchmarks: Open LLM Leaderboard, MMLU, HELM, and AlpacaEval 2.0. Our empirical analysis demonstrates that these tools and tiny benchmarks are sufficient to reliably and efficiently reproduce the original evaluation results.  ( 2 min )
    Comparative Analysis of Data Preprocessing Methods, Feature Selection Techniques and Machine Learning Models for Improved Classification and Regression Performance on Imbalanced Genetic Data
    arXiv:2402.14980v1 Announce Type: cross Abstract: Rapid advancements in genome sequencing have led to the collection of vast amounts of genomics data. Researchers may be interested in using machine learning models on such data to predict the pathogenicity or clinical significance of a genetic mutation. However, many genetic datasets contain imbalanced target variables that pose challenges to machine learning models: observations are skewed/imbalanced in regression tasks or class-imbalanced in classification tasks. Genetic datasets are also often high-cardinal and contain skewed predictor variables, which poses further challenges. We aimed to investigate the effects of data preprocessing, feature selection techniques, and model selection on the performance of models trained on these datasets. We measured performance with 5-fold cross-validation and compared averaged r-squared and accuracy metrics across different combinations of techniques. We found that outliers/skew in predictor or target variables did not pose a challenge to regression models. We also found that class-imbalanced target variables and skewed predictors had little to no impact on classification performance. Random forest was the best model to use for imbalanced regression tasks. While our study uses a genetic dataset as an example of a real-world application, our findings can be generalized to any similar datasets.  ( 3 min )
    Nonsmooth Nonparametric Regression via Fractional Laplacian Eigenmaps
    arXiv:2402.14985v1 Announce Type: cross Abstract: We develop nonparametric regression methods for the case when the true regression function is not necessarily smooth. More specifically, our approach is using the fractional Laplacian and is designed to handle the case when the true regression function lies in an $L_2$-fractional Sobolev space with order $s\in (0,1)$. This function class is a Hilbert space lying between the space of square-integrable functions and the first-order Sobolev space consisting of differentiable functions. It contains fractional power functions, piecewise constant or polynomial functions and bump function as canonical examples. For the proposed approach, we prove upper bounds on the in-sample mean-squared estimation error of order $n^{-\frac{2s}{2s+d}}$, where $d$ is the dimension, $s$ is the aforementioned order parameter and $n$ is the number of observations. We also provide preliminary empirical results validating the practical performance of the developed estimators.  ( 2 min )
    Partial Search in a Frozen Network is Enough to Find a Strong Lottery Ticket
    arXiv:2402.14029v1 Announce Type: cross Abstract: Randomly initialized dense networks contain subnetworks that achieve high accuracy without weight learning -- strong lottery tickets (SLTs). Recently, Gadhikar et al. (2023) demonstrated theoretically and experimentally that SLTs can also be found within a randomly pruned source network, thus reducing the SLT search space. However, this limits the search to SLTs that are even sparser than the source, leading to worse accuracy due to unintentionally high sparsity. This paper proposes a method that reduces the SLT search space by an arbitrary ratio that is independent of the desired SLT sparsity. A random subset of the initial weights is excluded from the search space by freezing it -- i.e., by either permanently pruning them or locking them as a fixed part of the SLT. Indeed, the SLT existence in such a reduced search space is theoretically guaranteed by our subset-sum approximation with randomly frozen variables. In addition to reducing search space, the random freezing pattern can also be exploited to reduce model size in inference. Furthermore, experimental results show that the proposed method finds SLTs with better accuracy and model size trade-off than the SLTs obtained from dense or randomly pruned source networks. In particular, the SLT found in a frozen graph neural network achieves higher accuracy than its weight trained counterpart while reducing model size by $40.3\times$.  ( 3 min )
    A Causal Framework to Evaluate Racial Bias in Law Enforcement Systems
    arXiv:2402.14959v1 Announce Type: cross Abstract: We are interested in developing a data-driven method to evaluate race-induced biases in law enforcement systems. While the recent works have addressed this question in the context of police-civilian interactions using police stop data, they have two key limitations. First, bias can only be properly quantified if true criminality is accounted for in addition to race, but it is absent in prior works. Second, law enforcement systems are multi-stage and hence it is important to isolate the true source of bias within the "causal chain of interactions" rather than simply focusing on the end outcome; this can help guide reforms. In this work, we address these challenges by presenting a multi-stage causal framework incorporating criminality. We provide a theoretical characterization and an associated data-driven method to evaluate (a) the presence of any form of racial bias, and (b) if so, the primary source of such a bias in terms of race and criminality. Our framework identifies three canonical scenarios with distinct characteristics: in settings like (1) airport security, the primary source of observed bias against a race is likely to be bias in law enforcement against innocents of that race; (2) AI-empowered policing, the primary source of observed bias against a race is likely to be bias in law enforcement against criminals of that race; and (3) police-civilian interaction, the primary source of observed bias against a race could be bias in law enforcement against that race or bias from the general public in reporting against the other race. Through an extensive empirical study using police-civilian interaction data and 911 call data, we find an instance of such a counter-intuitive phenomenon: in New Orleans, the observed bias is against the majority race and the likely reason for it is the over-reporting (via 911 calls) of incidents involving the minority race by the general public.  ( 3 min )
    Efficient semi-supervised inference for logistic regression under case-control studies
    arXiv:2402.15365v1 Announce Type: new Abstract: Semi-supervised learning has received increasingly attention in statistics and machine learning. In semi-supervised learning settings, a labeled data set with both outcomes and covariates and an unlabeled data set with covariates only are collected. We consider an inference problem in semi-supervised settings where the outcome in the labeled data is binary and the labeled data is collected by case-control sampling. Case-control sampling is an effective sampling scheme for alleviating imbalance structure in binary data. Under the logistic model assumption, case-control data can still provide consistent estimator for the slope parameter of the regression model. However, the intercept parameter is not identifiable. Consequently, the marginal case proportion cannot be estimated from case-control data. We find out that with the availability of the unlabeled data, the intercept parameter can be identified in semi-supervised learning setting. We construct the likelihood function of the observed labeled and unlabeled data and obtain the maximum likelihood estimator via an iterative algorithm. The proposed estimator is shown to be consistent, asymptotically normal, and semiparametrically efficient. Extensive simulation studies are conducted to show the finite sample performance of the proposed method. The results imply that the unlabeled data not only helps to identify the intercept but also improves the estimation efficiency of the slope parameter. Meanwhile, the marginal case proportion can be estimated accurately by the proposed method.  ( 2 min )
    Lasso with Latents: Efficient Estimation, Covariate Rescaling, and Computational-Statistical Gaps
    arXiv:2402.15409v1 Announce Type: new Abstract: It is well-known that the statistical performance of Lasso can suffer significantly when the covariates of interest have strong correlations. In particular, the prediction error of Lasso becomes much worse than computationally inefficient alternatives like Best Subset Selection. Due to a large conjectured computational-statistical tradeoff in the problem of sparse linear regression, it may be impossible to close this gap in general. In this work, we propose a natural sparse linear regression setting where strong correlations between covariates arise from unobserved latent variables. In this setting, we analyze the problem caused by strong correlations and design a surprisingly simple fix. While Lasso with standard normalization of covariates fails, there exists a heterogeneous scaling of the covariates with which Lasso will suddenly obtain strong provable guarantees for estimation. Moreover, we design a simple, efficient procedure for computing such a "smart scaling." The sample complexity of the resulting "rescaled Lasso" algorithm incurs (in the worst case) quadratic dependence on the sparsity of the underlying signal. While this dependence is not information-theoretically necessary, we give evidence that it is optimal among the class of polynomial-time algorithms, via the method of low-degree polynomials. This argument reveals a new connection between sparse linear regression and a special version of sparse PCA with a near-critical negative spike. The latter problem can be thought of as a real-valued analogue of learning a sparse parity. Using it, we also establish the first computational-statistical gap for the closely related problem of learning a Gaussian Graphical Model.  ( 3 min )
    Generative Modelling with Tensor Train approximations of Hamilton--Jacobi--Bellman equations
    arXiv:2402.15285v1 Announce Type: new Abstract: Sampling from probability densities is a common challenge in fields such as Uncertainty Quantification (UQ) and Generative Modelling (GM). In GM in particular, the use of reverse-time diffusion processes depending on the log-densities of Ornstein-Uhlenbeck forward processes are a popular sampling tool. In Berner et al. [2022] the authors point out that these log-densities can be obtained by solution of a \textit{Hamilton-Jacobi-Bellman} (HJB) equation known from stochastic optimal control. While this HJB equation is usually treated with indirect methods such as policy iteration and unsupervised training of black-box architectures like Neural Networks, we propose instead to solve the HJB equation by direct time integration, using compressed polynomials represented in the Tensor Train (TT) format for spatial discretization. Crucially, this method is sample-free, agnostic to normalization constants and can avoid the curse of dimensionality due to the TT compression. We provide a complete derivation of the HJB equation's action on Tensor Train polynomials and demonstrate the performance of the proposed time-step-, rank- and degree-adaptive integration method on a nonlinear sampling task in 20 dimensions.  ( 2 min )
    Iteration and Stochastic First-order Oracle Complexities of Stochastic Gradient Descent using Constant and Decaying Learning Rates
    arXiv:2402.15344v1 Announce Type: new Abstract: The performance of stochastic gradient descent (SGD), which is the simplest first-order optimizer for training deep neural networks, depends on not only the learning rate but also the batch size. They both affect the number of iterations and the stochastic first-order oracle (SFO) complexity needed for training. In particular, the previous numerical results indicated that, for SGD using a constant learning rate, the number of iterations needed for training decreases when the batch size increases, and the SFO complexity needed for training is minimized at a critical batch size and that it increases once the batch size exceeds that size. Here, we study the relationship between batch size and the iteration and SFO complexities needed for nonconvex optimization in deep learning with SGD using constant or decaying learning rates and show that SGD using the critical batch size minimizes the SFO complexity. We also provide numerical comparisons of SGD with the existing first-order optimizers and show the usefulness of SGD using a critical batch size. Moreover, we show that measured critical batch sizes are close to the sizes estimated from our theoretical results.  ( 2 min )
    Physics-constrained polynomial chaos expansion for scientific machine learning and uncertainty quantification
    arXiv:2402.15115v1 Announce Type: new Abstract: We present a novel physics-constrained polynomial chaos expansion as a surrogate modeling method capable of performing both scientific machine learning (SciML) and uncertainty quantification (UQ) tasks. The proposed method possesses a unique capability: it seamlessly integrates SciML into UQ and vice versa, which allows it to quantify the uncertainties in SciML tasks effectively and leverage SciML for improved uncertainty assessment during UQ-related tasks. The proposed surrogate model can effectively incorporate a variety of physical constraints, such as governing partial differential equations (PDEs) with associated initial and boundary conditions constraints, inequality-type constraints (e.g., monotonicity, convexity, non-negativity, among others), and additional a priori information in the training process to supplement limited data. This ensures physically realistic predictions and significantly reduces the need for expensive computational model evaluations to train the surrogate model. Furthermore, the proposed method has a built-in uncertainty quantification (UQ) feature to efficiently estimate output uncertainties. To demonstrate the effectiveness of the proposed method, we apply it to a diverse set of problems, including linear/non-linear PDEs with deterministic and stochastic parameters, data-driven surrogate modeling of a complex physical system, and UQ of a stochastic system with parameters modeled as random fields.  ( 2 min )
    Statistical Agnostic Regression: a machine learning method to validate regression models
    arXiv:2402.15213v1 Announce Type: new Abstract: Regression analysis is a central topic in statistical modeling, aiming to estimate the relationships between a dependent variable, commonly referred to as the response variable, and one or more independent variables, i.e., explanatory variables. Linear regression is by far the most popular method for performing this task in several fields of research, such as prediction, forecasting, or causal inference. Beyond various classical methods to solve linear regression problems, such as Ordinary Least Squares, Ridge, or Lasso regressions - which are often the foundation for more advanced machine learning (ML) techniques - the latter have been successfully applied in this scenario without a formal definition of statistical significance. At most, permutation or classical analyses based on empirical measures (e.g., residuals or accuracy) have been conducted to reflect the greater ability of ML estimations for detection. In this paper, we introduce a method, named Statistical Agnostic Regression (SAR), for evaluating the statistical significance of an ML-based linear regression based on concentration inequalities of the actual risk using the analysis of the worst case. To achieve this goal, similar to the classification problem, we define a threshold to establish that there is sufficient evidence with a probability of at least 1-eta to conclude that there is a linear relationship in the population between the explanatory (feature) and the response (label) variables. Simulations in only two dimensions demonstrate the ability of the proposed agnostic test to provide a similar analysis of variance given by the classical $F$ test for the slope parameter.  ( 3 min )
    Nonlinear Bayesian optimal experimental design using logarithmic Sobolev inequalities
    arXiv:2402.15053v1 Announce Type: new Abstract: We study the problem of selecting $k$ experiments from a larger candidate pool, where the goal is to maximize mutual information (MI) between the selected subset and the underlying parameters. Finding the exact solution is to this combinatorial optimization problem is computationally costly, not only due to the complexity of the combinatorial search but also the difficulty of evaluating MI in nonlinear/non-Gaussian settings. We propose greedy approaches based on new computationally inexpensive lower bounds for MI, constructed via log-Sobolev inequalities. We demonstrate that our method outperforms random selection strategies, Gaussian approximations, and nested Monte Carlo (NMC) estimators of MI in various settings, including optimal design for nonlinear models with non-additive noise.  ( 2 min )
    On the Performance of Empirical Risk Minimization with Smoothed Data
    arXiv:2402.14987v1 Announce Type: new Abstract: In order to circumvent statistical and computational hardness results in sequential decision-making, recent work has considered smoothed online learning, where the distribution of data at each time is assumed to have bounded likeliehood ratio with respect to a base measure when conditioned on the history. While previous works have demonstrated the benefits of smoothness, they have either assumed that the base measure is known to the learner or have presented computationally inefficient algorithms applying only in special cases. This work investigates the more general setting where the base measure is \emph{unknown} to the learner, focusing in particular on the performance of Empirical Risk Minimization (ERM) with square loss when the data are well-specified and smooth. We show that in this setting, ERM is able to achieve sublinear error whenever a class is learnable with iid data; in particular, ERM achieves error scaling as $\tilde O( \sqrt{\mathrm{comp}(\mathcal F)\cdot T} )$, where $\mathrm{comp}(\mathcal F)$ is the statistical complexity of learning $\mathcal F$ with iid data. In so doing, we prove a novel norm comparison bound for smoothed data that comprises the first sharp norm comparison for dependent data applying to arbitrary, nonlinear function classes. We complement these results with a lower bound indicating that our analysis of ERM is essentially tight, establishing a separation in the performance of ERM between smoothed and iid data.  ( 2 min )
    Smoothness Adaptive Hypothesis Transfer Learning
    arXiv:2402.14966v1 Announce Type: new Abstract: Many existing two-phase kernel-based hypothesis transfer learning algorithms employ the same kernel regularization across phases and rely on the known smoothness of functions to obtain optimality. Therefore, they fail to adapt to the varying and unknown smoothness between the target/source and their offset in practice. In this paper, we address these problems by proposing Smoothness Adaptive Transfer Learning (SATL), a two-phase kernel ridge regression(KRR)-based algorithm. We first prove that employing the misspecified fixed bandwidth Gaussian kernel in target-only KRR learning can achieve minimax optimality and derive an adaptive procedure to the unknown Sobolev smoothness. Leveraging these results, SATL employs Gaussian kernels in both phases so that the estimators can adapt to the unknown smoothness of the target/source and their offset function. We derive the minimax lower bound of the learning problem in excess risk and show that SATL enjoys a matching upper bound up to a logarithmic factor. The minimax convergence rate sheds light on the factors influencing transfer dynamics and demonstrates the superiority of SATL compared to non-transfer learning settings. While our main objective is a theoretical analysis, we also conduct several experiments to confirm our results.  ( 2 min )
    In-Context Learning of a Linear Transformer Block: Benefits of the MLP Component and One-Step GD Initialization
    arXiv:2402.14951v1 Announce Type: new Abstract: We study the \emph{in-context learning} (ICL) ability of a \emph{Linear Transformer Block} (LTB) that combines a linear attention component and a linear multi-layer perceptron (MLP) component. For ICL of linear regression with a Gaussian prior and a \emph{non-zero mean}, we show that LTB can achieve nearly Bayes optimal ICL risk. In contrast, using only linear attention must incur an irreducible additive approximation error. Furthermore, we establish a correspondence between LTB and one-step gradient descent estimators with learnable initialization ($\mathsf{GD}\text{-}\mathbf{\beta}$), in the sense that every $\mathsf{GD}\text{-}\mathbf{\beta}$ estimator can be implemented by an LTB estimator and every optimal LTB estimator that minimizes the in-class ICL risk is effectively a $\mathsf{GD}\text{-}\mathbf{\beta}$ estimator. Finally, we show that $\mathsf{GD}\text{-}\mathbf{\beta}$ estimators can be efficiently optimized with gradient flow, despite a non-convex training objective. Our results reveal that LTB achieves ICL by implementing $\mathsf{GD}\text{-}\mathbf{\beta}$, and they highlight the role of MLP layers in reducing approximation error.  ( 2 min )

  • Open

    Need help to find a free AI enhancer
    As the title suggests; does anyone here know the specific app that makes the images as smooth as those that I put down right here? The left is unenhanced and the right one is enhanced. I already sent a message to the one who posted the enhanced one and I didn't get anything back from them so I'm wondering if anyone here knows what the app is. I already looked around and can't find it because other apps either don't do it at all or do a poor job of enhancing it. Thanks in advance if anyone knows! submitted by /u/YaBoiRambo1982 [link] [comments]
    Noticed that any ai generated religious content bots are received with positive attention for some reason but WHY….its a content bot isn’t that like….not good?
    But why lol… submitted by /u/Impossible_Belt_7757 [link] [comments]
    India confronts Google over Gemini AI tool’s ‘fascist Modi’ responses
    submitted by /u/donutloop [link] [comments]
    PSA: I want the community to be aware of Cursor.sh, an IDE fork of VSCode that natively integrates GPT-4 that can take your entire codebase into its context window to generate results
    submitted by /u/holy_moley_ravioli_ [link] [comments]
    Help for a PM with a technical background to improve AI skills
    Hey, I’m looking for some advice related to improving my skills in AI. I am currently a product manager in a tech company and, like the rest of the world, we have pivoted to using AI for a whole set of things. I’m looking to improve my knowledge of Generative (and other forms of cutting edge) AI so that I am better able to understand how it can be used for products, as well investigate frontier use cases. Although I have a technical background in AI from my university days, that was over 15 years ago and obviously the world has changed dramatically since then. But given that I’ve continued to keep doing AI courses since then, I still have enough knowledge of a lot of AI basics, and I don’t see anything more another course Coursera or Udacity course or two would teach me. So how do I bes…
    Sora generating some imaginary animals
    submitted by /u/drgoldenpants [link] [comments]
    You can now make any image into interactive, playable enviornments using Google's Genie
    submitted by /u/holy_moley_ravioli_ [link] [comments]
    Microsoft partners with Mistral in second AI deal beyond OpenAI
    submitted by /u/jaketocake [link] [comments]
    What happens when we outsource boring but important work to AI? Research shows we forget how to do it ourselves
    submitted by /u/Jariiari7 [link] [comments]
    One-Minute Daily AI News 2/25/2024
    Huawei spin-off Honor shows off tech to control a car with your eyes and chatbot based on Meta’s AI.[2] Construction workers being replaced by AI robot bricklayers.[2] Apple Researchers Introduce Keyframer: An LLM-Powered Animation Prototyping Tool that can Generate Animations from Static Images.[3] Google Unveils Gemma: Open Source AI Models for Enhanced Accessibility.[4] Sources: [1] https://www.cnbc.com/2024/02/25/honor-shows-off-tech-in-magic-6-pro-to-control-a-car-with-your-eyes.html [2] https://www.foxnews.com/tech/construction-workers-being-replaced-by-ai-robot-bricklayers [3] https://www.marktechpost.com/2024/02/22/apple-researchers-introduce-keyframer-an-llm-powered-animation-prototyping-tool-that-can-generate-animations-from-static-images-svgs/ [4] https://www.newsbricks.com/technology/google-unveils-gemma-open-source-ai-models-for-enhanced-accessibility/41341 submitted by /u/Excellent-Target-847 [link] [comments]
    How would we consume movies in the coming years?
    This is in the context of Sora further improving and perhaps other Sora-like competitors will emerge as well. I've been thinking lately about this. The Hollywood model would just not work anymore. And if so, how will the movie industry be reshaped? Will it be a bunch of independent storytellers generating AI movies in their bedrooms, then release/publish it on YT or a streaming platform like Netflix? Will hardware still be crucial and you'd have the emergence of companies leasing their computing power to would-be movie makers. Or perhaps AI would be very powerful that we could just type a prompt and watch a custom movie for a one-time consumption? Would we still yearn for a shared experience? Would Hollywood actors still be relevant? Would we see the rise of AI generated celebrities that appear across movies? submitted by /u/syf3r [link] [comments]
    Has anyone tried ai study tools for summaries and quizzes?
    I've been looking into AI study tools for generating summaries and quiz/flashcards of readings and lecture pdfs. Problem is, a lot of them are quite pricy so I was wondering if anyone could give me recommendations for AI study tools. I have tried Mindgrasp so far but I ran into a few bugs during the free trial. I also liked the look of Revisely even though I haven't been able to properly put it to test because it doesn't have a free trial for its premium features. Any recommendations would be welcome! :) submitted by /u/Erud1te [link] [comments]
  • Open

    [R] Grayscale Image Classification
    We use only single layers of a GNN and MLP for grayscale image classification. We reduce runtime and latency by up to 16x other state-of-the-art models. It's a cool paper :D https://arxiv.org/abs/2402.00564 https://github.com/GECCOProject/GECCO submitted by /u/Imaginary-Two3061 [link] [comments]
    [D] Any existing models that can classify user text query based on "complexity?
    Hi, I am trying to find a trained ML model that can route user queries to the appropriate LLM based on the "complexity" of the query. For example, if a query is classified as "Simple" then route to GPT3.5, if classified as "Hard" then route to GPT4. If a model doesn't exist for this particular classification, does anyone know any existing models that has similar functions? "Simple/Complexity" can be defined roughly using the following rule: 1. Assess the Length of the Query Simple: If the query is short and to the point, typically asking for a straightforward fact, definition, or concise answer. Complex: If the query is lengthy, involves multiple components, or requires detailed explanations. 2. Evaluate the Subject Matter or Domain Specificity Simple: Queries that pertain to general…
    [D] Unsupervised learning
    I am currently working on clustering customers for b2b. I have 695 customers with mixed data types (26 categorical features and 11 numerical). I tried applying label encoding and one hot encoding for nominal and ordinal features. In case of numerical features I used robust scalar. Upon applying kmeans on this transformed data all the data points are categorised in the same cluster. How should I improve my model for better cluster results? submitted by /u/No-Body189 [link] [comments]
    [D] Are some classes of models considered "obsolete" now?
    I am looking to do some self-driven training in ML and data science as a senior SWE. I am in the process of collecting and organizing courses, and I am wondering if some topics are now obsolete and I can safely skip them. This is primarily from the POV of building applications. For example, take the Andrew Ng GAN specialization on Coursera. With stable diffusion models like Midjourney or OpenAI's vision models, do GANs still have a use case? In NLP, do I need to study HMM or build toy translation and summarization models when there's just GPT4 now? submitted by /u/The-_Captain [link] [comments]
    [P] Image classification using TensorFlow & EfficientNetV2 in video surveillance: We improved our Learning Curve in an extensive process of trial and error, training over 150.000 images and learned a lot on the way
    Having your first AI model run using ready-made models and code examples from the book is comparatively easy. However, getting useful results from this technology requires the smart use of many settings. The most important ones are: 1.) Learning rate (we change that even during the training process). 2.) Image Augmentation (setting a standard value is not enough for most applications). 3.) Setting appropriate training weights for individual images. Dealing with false alerts is a big issue in video surveillance. I've never been totally happy with the existing solutions. So we've developed a free and open-source AI software that will help you to solve this problem. We have achieved 99,7 % success rate in detecting predefined objects in our system. To achieve this... * 1.) We adapt (lower) the Adam learning rate stepwise during the training from 1e-4 to 1e-6. * 2.) We use different image augmentation intensities according to the size of the model and the size of the dataset. * 3.) We have different training weights for images that are predicted correctly by the present model and those that are predicted wrongly. There is much more, feel free to ask... CAM-AI is optimized to work with most IP cameras. With CAM-AI you can train your camera to detect different objects (humans, cars, dogs, cats, ...), so you can set up individually conditioned alarms. Just connect your camera to our web application or set up your own server, such as a Raspberry Pi. Of course, your data is encrypted, ensuring that only you can access it. In this learning curve you can see the accelerated improvement of the losses (training loss and validation loss) after 5 times lowering the learning rate: https://preview.redd.it/wwphskaz00lc1.png?width=959&format=png&auto=webp&s=54189f22bd04bbaff4c491bd0880dbe140035867 Here you can read more on our project: https://www.cam-ai.de Hope this will help many of you! Feel free to ask me anything. submitted by /u/sebastian-mh [link] [comments]
    [R] Genie: Generative Interactive Environments
    submitted by /u/topcodemangler [link] [comments]
    Are there any website with an anime-style girl image tag estimation system?[D]
    Not sure where to post this, but I was wondering if anyone had created a anime-style girl image tag estimation system website like https://huggingface.co/spaces/hysts/DeepDanbooru where you uploaded a picture, and it told you the prompts it could pull from the picture. I know there used to be a far better version on huggingface but it stopped working, somehow getting corrupted or something of the likes. submitted by /u/MSGTNP [link] [comments]
    [P] OpenHermesPreferences - a dataset of 1M AI preferences for RLAIF and DPO
    ​ https://preview.redd.it/euzo21cl9zkc1.jpg?width=1792&format=pjpg&auto=webp&s=28a29a944aedc59513554ee614fd479cd8de4b7f Hello everyone, I'm happy to share a large dataset of ~1 million AI preferences that we've created with the Argilla team to enable techniques like RLAIF and DPO to be tested at scale: https://huggingface.co/datasets/argilla/OpenHermesPreferences How did we create it? To build OpenHermesPreferences, we started with Teknium's high-quality OpenHermes-2.5 dataset of (mostly) GPT-4 generated completions. After filtering out multi-turn conversations, this gave us around 1 million prompts to work with. Next, we generated new completions using Mixtral and Nous-Hermes-2-Yi. This was a bit tricky with native `transformers` code, so we used distilabel and llm-swarm that integrate with vLLM and TGI - this reduced the generation time from days to hours! Finally, we ranked all 3 completions using the PairRM model from AllenAI: this is a lightweight reward model that was trained specifically for N-way comparisons and runs inference really fast. What can I use it for? We see two main applications of this dataset: Training killer reward models for techniques like PPO / ReST Applying DPO and friends to existing SFT models We hope this dataset will help the community's research efforts towards understanding the role of AI feedback in language model alignment. Enjoy! ​ ​ submitted by /u/lewtun [link] [comments]
    [N] Run models on a real device with Qualcomm AI Hub
    Launched at MWC Barcelona 24 https://aihub.qualcomm.com https://huggingface.co/qualcomm submitted by /u/ephemeralshot [link] [comments]
    [P] How to design a writing assistive system that can do writing style guide check like how Grammarly does for gramma checking
    I have a style guide that specifies the use of punctuations, line length, and how to break up a line (e.g. before prepositions). I have a dataset that contains original text and the version of the text are conformed to this style guide. Now, how do I design a system that can recommend to user how to conform a given text to the style guide? More specifically, How do I frame this task as? Is this a type of sequence-to-sequence modeling task? Is fine-tuning a LLM with my data a good approach? How to deal with hallucinations? How can this system also tell users that what style guide rules are used in the recommendations? Any pointers to research papers or similar system design would be much appreciated! submitted by /u/Great_East4325 [link] [comments]
    [D] what reseach topics are converging to a (nearly) saturated performance
    given the rapid progress in AI, what topics are almost tackled? benchmark-wise neural machine translation? image-level object detection and segmentation? human pose estimation without heavily occlusion? what do you think submitted by /u/xiikjuy [link] [comments]
    [D] How KV cache is valid in LLM transformer
    Seeing a lot of literature mentioning using KV cache for transformer models to reduce compute in decoder, but in my understanding, when the sequence reaches maximum context length and each left shift renders the left-most token out of scope, the KV cache would lose validity, apparently because a previously participating token vanishes, is that correct? submitted by /u/victordion [link] [comments]
    [D] Practitioners of differential privacy - what’s the typical epsilon value you’ve used?
    It’s hard to find what are typical or accepted ranges of epsilon for differential privacy. Online search has not yielded citable or consistent results. Some sources have said typical values of 1.0 to 10.0. Personally I have found that: 1) differential privacy significantly slows down training speed by almost 10X 2) often I have to use epsilon figures of 1.0 - 5.0 for results to still have required levels of accuracy 3) it’s either I have to increase dataset size significantly, or I have to sacrifice epsilon (privacy) to have models that are still accurate 4) when performing membership inference tasks, I find that increasing dataset size has the same impact as reducing epsilon. For instance: 1) with a dataset size of 200K examples, a DP model with epsilon 1.0 performs exactly the same as a model without dp in terms of accuracy and in privacy evaluation tasks e.g. membership inference. Model without DP hover took 30 mins to train compared to model with DP which took 6 hours. 2) for dataset size of less 100K, DP model is simply not accurate enough unless i increase epsilon to 5.0. This leads me to think: 1) what even is the point of DP if I have to increase epsilon all the way to 5 for model to even be useful 2) if with 200K, there’s no difference in membership inference attack, why should I even bother with DP? What exactly are the practical benefits of DP when increasing dataset size has the same effect in privacy evaluation task, and comes at a fraction of computational cost, beyond “theoretical guarantees”? submitted by /u/shengy90 [link] [comments]
    [D] Question on generating embeddings with Transformer Encoder-Decoder
    I am currently trying to generate embeddings I have of a heterogeneous datasource I am working with. The problem is, that for every sample, I have a varying amount of input features. For the ones interested, the problem is biological. In particular I am trying to calculate a gene-level representation of this data and for every gene I have a different number of variants, with each variant having one feature value associated to it. My solution to homogenise these features was to generate embeddings with a transformer encoder. I generate my feature set by taking all the variant values per gene and concatenating them. Then, I pad the edges of the resulting tensors with zeros such that the signal lies in the centre and the resulting tensor is 1,024 dimensional. The encoder learns a 256 dimens…
    The industry is not going "recover" for newly minted research scientists [D]
    The top thread today asks: "Is the tech industry still not recovered or I am that bad?" Let me make a bold prediction (and I hope I'm wrong, but I don't think I am): the industry is not going to "recover" for newly minted research scientists: You have an exponentially growing number of ML papers, reflecting an exponentially growing number of PhD students and postdocs: ​ https://preview.redd.it/viv6l1gnkykc1.png?width=899&format=png&auto=webp&s=04e227dede42f7d46d1941fc268bb7ea0a409a04 ... who graduate and start competing for a roughly fixed number of well-paying industry research positions. The number of these positions might increase or decrease seasonally, but the longer-term trend is that their job prospects will become increasingly worse, while this exponential trend continues. ​ submitted by /u/we_are_mammals [link] [comments]
    [D] Follow-up Paper?
    I am considering writing a follow-up paper from my previous work. Without revealing too much details, my previous work proposed a new method for error correcting in logical reasoning of LLMs. However, it did require some manual labor at that stage in identifying where the error happened. I plan to 1) incorporate some existing methods to automate the process of identifying the error and 2) increase the prediction process by combining with another existing method. In short, is this worth another paper? The main reason I'm confused is because I am improving my original idea by combining it with "preexisting" methods. Despite the potential reduction in manual labor and increase in accuracy, I am using existing methods to improve my method, which I find quite less novel. submitted by /u/BigDreamx [link] [comments]
    [N] Tech giants are developing their AI chips. Here's the list
    There is a shortage of NVIDIA GPUs, which has led several companies to create their own AI chips. Here's a list of those companies: • Google is at the forefront of improving its Tensor Processing Unit (TPU) https://cloud.google.com/tpu?hl=en technology for Google Cloud. • OpenAI is investigating the potential of designing proprietary AI chips https://www.reuters.com/technology/chatgpt-owner-openai-is-exploring-making-its-own-ai-chips-sources-2023-10-06/. • Microsoft announced https://news.microsoft.com/source/features/ai/in-house-chips-silicon-to-service-to-meet-ai-demand/ two custom-designed chips: the Microsoft Azure Maia AI Accelerator for large language model training and inferencing and the Microsoft Azure Cobalt CPU for general-purpose compute workloads on the Microsoft Cloud. • Amazon has rolled out its Inferentia AI chip https://aws.amazon.com/machine-learning/inferentia/ and the second-generation machine learning (ML) accelerator, AWS Trainium https://aws.amazon.com/machine-learning/trainium/. • Apple has been developing its series of custom chips and unveiled https://www.apple.com/newsroom/2023/10/apple-unveils-m3-m3-pro-and-m3-max-the-most-advanced-chips-for-a-personal-computer/ M3, M3 Pro, and M3 Max processors, which could be extended to specialized AI tasks. • Meta plans to deploy a new version of a custom chip aimed at supporting its artificial intelligence (AI) push, according to Reuters https://www.reuters.com/technology/meta-deploy-in-house-custom-chips-this-year-power-ai-drive-memo-2024-02-01/. • Huawei is reportedly https://www.reuters.com/technology/ai-chip-demand-forces-huawei-slow-smartphone-production-sources-2024-02-05/ prioritizing AI and slowing the production of its premium Mate 60 phones as the demand for their AI chips https://www.hisilicon.com/en/products/ascend has soared. Did I miss any? submitted by /u/vvkuka [link] [comments]
    [D] Is the tech industry still not recovered or I am that bad?
    I am a recent PhD graduate from a top university in Europe, working on some popular topics in ML/CV, I've published 8 - 20 papers, most of which I've first-authored. These papers have accumulated 1000 - 3000 citations. (using a new account and wide range to maintain anonymity) Despite what I thought I am a fairly strong candidate, I've encountered significant challenges in my recent job search. I have been mainly aiming for Research Scientist positions, hopefully working on open-ended research. I've reached out to numerous senior ML researchers across the EMEA region, and while some have expressed interests, unfortunately, none of the opportunities have materialised due to various reasons, such as limited headcounts or simply no updates from hiring managers. I've mostly targeted big tech companies as well as some recent popular ML startups. Unfortunately, the majority of my applications were rejected, often without the opportunity for an interview. (I only got interviewed once by one of the big tech companies and then got rejected.) In particular, despite referrals from friends, I've met immediate rejection from Meta for Research Scientist positions (within a couple of days). I am currently simply very confused and upset and not sure what went wrong, did I got blacklisted from these companies? But I couldn't recall I made any enemies. I am hopefully seeking some advise on what I can do next.... submitted by /u/Holiday_Safe_5620 [link] [comments]
    What’s the current consensus on using tensorflow 2.x vs PyTorch? [D]
    I understand that many folks left TF during the TF1 to TF2 migration and never looked back. My question is what’s the current consensus on using TF2 vs PyTorch (vs Jax) and why? From an end-to-end perspective, for me, TF2 has been good. Debugging is easier on PyTorch but the benefits are not big enough to drop everything and leave TF2 for good. What do you guys think? Assuming that you are building AI products and deployment is a must, do you prefer TensorFlow or Pytorch (or JAX) in your codebase and why? submitted by /u/1infiniteloop [link] [comments]
    [R] YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information
    Paper: https://arxiv.org/abs/2402.13616 Code: https://github.com/WongKinYiu/yolov9 Models: https://huggingface.co/merve/yolov9 Demo: https://huggingface.co/spaces/kadirnar/Yolov9 Colab: https://colab.research.google.com/github/roboflow-ai/notebooks/blob/main/notebooks/train-yolov9-object-detection-on-custom-dataset.ipynb Abstract: Today's deep learning methods focus on how to design the most appropriate objective functions so that the prediction results of the model can be closest to the ground truth. Meanwhile, an appropriate architecture that can facilitate acquisition of enough information for prediction has to be designed. Existing methods ignore a fact that when input data undergoes layer-by-layer feature extraction and spatial transformation, large amount of information will …
    [P] GPTFast: Accelerate your Hugging Face Transformers 6-7x. Native to Hugging Face and PyTorch.
    GitHub: https://github.com/MDK8888/GPTFast GPTFast Accelerate your Hugging Face Transformers 6-7x with GPTFast! Background GPTFast was originally a set of techniques developed by the PyTorch Team to accelerate the inference speed of Llama-2-7b. This pip package generalizes those techniques to all Hugging Face models. submitted by /u/SunsetOneSix [link] [comments]
    [R]Post-Grad Opportunities for ML Theorists
    I am a first-year PhD at a top-20 CS Program in the US. I am interested in Statistical Learning Theory. However, I am concerned about my Post Grad Opportunities since the research I may end up doing will be too theoretical with zero to no practical significance. For me, empirics matter too. Would the industry be interested in hiring a theorist whose work has no practical consequence? Is there a way I can do both? Any suggestions are welcomed. Please help a poor first-year PhD student. :'( submitted by /u/dead_CS [link] [comments]
    Is 'Don't Stop Pre-training' just fine-tuning? [R]
    The Don't Stop Pre-training paper boasts about improving the performance of LLMs through "domain-adaptice pre-training' but this just seems like another word for fine-tuning that isn't new at all. I'm missing something for sure - what is it? submitted by /u/MLenthusiast34 [link] [comments]
    [P] AI infrastructure landscape
    submitted by /u/gaocegege [link] [comments]
    [D] Is it worth switching to JAX from TensorFlow/PyTorch?
    Hey all! I'm seeing JAX pop up more and more e.g. Google Deepmind released their Gemma open source models in JAX. I currently use TensorFlow/PyTorch. Is it worth checking out JAX? Can it do the same stuff with same flexibility as TensorFlow/PyTorch? submitted by /u/Few-Pomegranate4369 [link] [comments]
  • Open

    Any Strong Literature/Ideas on Feature Removal/Selection for RL Tasks?
    Hello all, I'm interested in looking into feature selection as a means of increasing convergence speed for continuous action space RL tasks - specifically working with the omniverse isaac gym environments. As of now, I have mediocre success first training a model, calculating the feature importance values with libraries like SHAP and SAGE, and retraining with some low important observations removed, however this strategy doesn't work for other environments I've been trying. I know SHAP and SAGE and feature importance in general isn't the greatest method of feature selection, so I'm reaching out here to see if anyone has any different directions they can point me. Some of these tasks have 50+ observation space sizes and I want to see if there are any methods to help determine which features can be removed without sacrificing model reward maximums any help and discussion is appreciated, thank you submitted by /u/bbri826 [link] [comments]
    Help with modeling reversal learning experimental data
    Neuroscience researcher here. I'm relatively new to the reinforcement learning in programming but am currently working on an experiment that will require descriptive modeling of a behavioral task. ​ Context: I am running a simple reversal learning task in which two spouts are available to an animal, but only one of them is rewarded. There are a large number of trials, and in each trial a cue plays followed by a reward delivery. After a specified number of trials, the contingencies switched whereby the previously unrewarded spout becomes rewarded. I track licks continuously during the task and record some other more complicated neural data. I want to see how animals flexibly adapt their anticipatory licks to each spout prior to reward delivery based on outcome history and other parameters. My goal is to fit the behavioral data to some model that describes change in relative value of each action. I'd like to see how well the neural data can describe value as well, though I would like to start with modeling the behavioral measurements. I've looked into various models like Rescorla Wagner and Q-learning, but am unsure which approach would best for modeling my data. Also, I'd like to limit all of my programming to Python if possible. ​ Any tips on how to go forward with this data analysis would be greatly appreciated. submitted by /u/Any-Captain5070 [link] [comments]
    Good Projects in these 5 domains to stand out in GSOC registration
    https://preview.redd.it/yjxy8stgtzkc1.png?width=1093&format=png&auto=webp&s=f1e6ea35386318e4d3292c4ee90682f96dcb2ecf GSOC 2024, mlpack. These are the ideas they are expecting. wish to know relevant mid difficulty projects that would give me an edge over the others. also any other tips on GSOC would be appreciated submitted by /u/PandeyyJi [link] [comments]
    Doubt about MuZero
    My understanding of MuZero is that starting from a given state we expand for K steps into the future the search tree with the Monte Carlo Tree Search algorithm. But differently from a standard MCTS, we have a deep model that a) produces the next state and reward given the action and b) produces a value function so that we don't need to simulate the whole episode continuation at every node. Two questions: Is the last point correct? I.e. there isn't any simulation done during the tree search, only the value function is used to estimate the future return from the current node onwards? Is this tree-expansion mechanism used only at training time or also at train time? Some parts of the paper seem to suggest that it is, but I then don't understand what the policy head is for submitted by /u/fedetask [link] [comments]
    Seeking for research project ideas
    Im a final year undergraduate student and am fairly new to this domain. I am working on building a project in this domain where I want to bring some novelty. I am seeking for advice and ideas on how I can accomplish this. submitted by /u/fa_anony__mous [link] [comments]
    Is it worth the effort implementing RL project in Julia? Any first-hand experiences?
    I am just starting a new research project in deep RL. Now I am wondering whether it is worthwhile to switch to Julia for this project in terms of performance and usability. Previously I have implemented my projects with Ray's Rllibs or have written my own code (first using tensorflow and then migrating to pytorch). I have heard that Julia excels at usability and speed, however I am not sure, whether this applies to Deep Learning and RL, too. I am grateful for any first-hand experiences :) submitted by /u/Tortoise_vs_Hare [link] [comments]
    [Q] About Mahalanobis Norm in Linear Bandits
    I am confused about the Mahalanobis Norm used under the settings of linear bandits. Are there any linear algebra required? submitted by /u/RalCauchy [link] [comments]
  • Open

    “We offer another place for knowledge”
    After acquiring data science and AI skills from MIT, Jospin Hassan shared them with his community in the Dzaleka Refugee Camp in Malawi and built pathways for talented learners.  ( 6 min )
    Generative AI for smart grid modeling
    MIT LIDS awarded funding from the Appalachian Regional Commission as part of a multi-state collaborative project to model and test new smart grid technologies for use in rural areas.  ( 4 min )
    Putting AI into the hands of people with problems to solve
    Alumni-founded Pienso has developed a user-friendly AI builder so domain experts can build solutions without writing any code.  ( 6 min )
  • Open

    Techniques and approaches for monitoring large language models on AWS
    Large Language Models (LLMs) have revolutionized the field of natural language processing (NLP), improving tasks such as language translation, text summarization, and sentiment analysis. However, as these models continue to grow in size and complexity, monitoring their performance and behavior has become increasingly challenging. Monitoring the performance and behavior of LLMs is a critical task […]  ( 7 min )
  • Open

    5 critical metrics every data scientist should monitor in hybrid cloud environments
    Experienced data scientists will find it helpful to think of hybrid cloud environments as a kind of high-tech ecosystem—complex and full of pitfalls that could swallow you whole if you’re not careful. In this context, keeping tabs on key metrics isn’t just helpful; it’s your secret to making sure everything runs smoother than ever. Here… Read More »5 critical metrics every data scientist should monitor in hybrid cloud environments The post 5 critical metrics every data scientist should monitor in hybrid cloud environments appeared first on Data Science Central.  ( 21 min )
    Scaling FAIR data sharing in an R&D culture
    Interview with Ben Gardner of AstraZeneca Image by Gerd Altmann from Pixabay Ben Gardner’s experience working with drug discovery teams goes back two decades, when he was a team leader at Pfizer. “We were just transferring information by spreadsheet…. It was supposed to be from a data warehouse with our values in there, but our… Read More »Scaling FAIR data sharing in an R&D culture The post Scaling FAIR data sharing in an R&D culture appeared first on Data Science Central.  ( 20 min )
    How can data and analytics transform financial IT?
    Top 5 financial IT solutions transforming the industry As technology advances, customers expect to access instant services through their phones regardless of their location. Businesses are also hoping to streamline their services and maximize profits.  The financial industry is currently facing major shifts from traditional ways of operation thanks to technological advancements. The fast-changing technology… Read More »How can data and analytics transform financial IT? The post How can data and analytics transform financial IT? appeared first on Data Science Central.  ( 22 min )
  • Open

    Determinant of an infinite matrix
    What does the infinite determinant mean and when does it converge? The determinant D above is the limit of the determinants Dn defined by If all the a‘s are 1 and all the b‘s are −1 then this post shows that Dn = Fn, the nth Fibonacci number. The Fibonacci numbers obviously don’t converge, so […] Determinant of an infinite matrix first appeared on John D. Cook.  ( 5 min )
  • Open

    Choose Your Own Coding Assistant
    submitted by /u/High_Sleep3694 [link] [comments]
    Liquid Neural Network
    Hello everyone, Can you please provide me with the resources or tools (Code, preferrably using TF) or a roadmap to learn LNNs (Liquid Neural Networks) since I have a project to do about them on a certain real life application. I already have a strong background about ML, Neural Netwok, CNN, RNN, LSTM, transfer learning. Your help is much appreciated! submitted by /u/Hussein_Jammal [link] [comments]
    Number of Words to Use for a Word2Vec?
    If I'm using a word2vec to tokenize words for another network that does sentiment analysis how many words should I train it with? If I were to use up to the nth most common words how likely is it an email would have a word not in the dataset? I tried to find the frequency of the 10,000th most common word but no luck. submitted by /u/ConflictUnfair [link] [comments]
  • Open

    NVIDIA RTX 500 and 1000 Professional Ada Generation Laptop GPUs Drive AI-Enhanced Workflows From Anywhere
    With generative AI and hybrid work environments becoming the new standard, nearly every professional, whether a content creator, researcher or engineer, needs a powerful, AI-accelerated laptop to help users tackle their industry’s toughest challenges — even on the go. The new NVIDIA RTX 500 and 1000 Ada Generation Laptop GPUs will be available in new, Read Article  ( 6 min )

  • Open

    ChatGPT is integrated with Siri Shortcuts! Their app’s integration works even on HomePod, you can access the power of this tool from Siri right now, pretty neat!
    submitted by /u/AffectionateTrips [link] [comments]
    Ai-run online print shop
    Haven't seen anything else like this yet. Pretty impressive catalog of stuff they got on there. I think they use a combination of LLM + stable diffusion submitted by /u/ak93211 [link] [comments]
    Are there aspects of AI that can be just as effectively done on a DIY swarm of GPUs as on A/H100 servers?
    I'm curious if there are certain and specific things like AI image generation, database analysis, etc that can be done commercially and effectively as you can on something like an A100, except by small scale people with lots of parallel GPUs like RTX 3000/4000 series? I was scaling up for crypto mining before that went downhill. Mostly ASICs, but I'm wondering if there would be some kind of commercial use for setups similar to GPU ETH mining rigs, except for AI this time around? If so, what sectors of AI are these things? I have a warehouse with a lot of space and power. But not a ton of extra money. Just trying to figure out ways to use this resource effectively. submitted by /u/agoldprospector [link] [comments]
    How to use tungsten.run?
    I have no coding experience. Can someone explain to me how to use and setup tungsten.run Thanks! submitted by /u/Kingslayer_96 [link] [comments]
    Best image generator software?
    What is the current best? I’ve been using ChatGPT image generator but it only does it in a smaller resolution. Is there something more powerful for iOS or macOS? submitted by /u/Turtleguycool [link] [comments]
    Learn Transformers - resources for Andrej Karpathy's instruction videos and more!
    Andrej Karpathy has some excellent videos around how to build a GPT style transformer. Here is a recent video on how to build a transformer he made. https://youtu.be/kCc8FmEb1nY?si=mT4zI6eeUJmskuF6 I created a Discord server for learners who are interested in sharing some resources on how to learn transformers. 1) quiz questions to test important knowledge 2) open source projects building on transformers (e.g. sentence transformers, sentiment analysis, chat, etc) 3) weekly video call to share what we are learning 4) place to ask questions 5) place to share links to research, videos, projects Link to the discord: https://discord.gg/CSkZcPWu (will expire in 7 days) submitted by /u/xyz_TrashMan_zyx [link] [comments]
    What are some known AI video creators on social media?
    I'm looking for some creators who make AI videos on social medias like Tik Tok and Instagram. I want to find the original creators. Thanks submitted by /u/Davidvia24 [link] [comments]
    image to video (workflow 2.0 in comments): Midjourney + Photoshop + Stable Video Diffusion + MPC + Ultrasharp + Premiere + Topaz Video AI.
    submitted by /u/PeePeePeePooPooPooo [link] [comments]
    Interactive Language Assistant
    Is there an artificial intelligence tool available that engages in conversations with humans on a specific topic and provides grammar corrections of better work choice? submitted by /u/flight862 [link] [comments]
    Meta AI hallucinating hard
    Learnt about hallucinations in the ChatGPT reddit and got karma trashed in the process. XD Am I dumb to believe that there should be some baseline for hallucinations and companies shouldn’t just roll out an AI to stay relevant with the buzz? I understand AI is not perfect but they’re calling it an assistant even in the basic description. Just think AI is being made out to be more than it actually is rn. submitted by /u/IWant2Break_Free [link] [comments]
    What are the Best AI Communities and Grants out there for Enthusiasts, Researchers and Developers?
    If you’re an AI enthusiast the best way to learn more about AI and stay on top of developments is to join a vibrant Community, while if you are a Researcher or a developer you can benefit a lot from a community full of brilliant minds, and for having access to funding to make your work come to life. For that reason, we thought it would be valuable to create a video with some of the different and most interesting options we found where you can actually do both at the same time, leverage a brilliant and heterogeneous community, and access grants, funding and all kind of AI-related opportunities: https://www.youtube.com/watch?v=V2JqbH2AC5k In this video, you will find options for all tastes, options that are fit for the AI enthusiasts that are just starting, as well as options fit for AI startups, that are searching for financial support and mentorship to take their business to the next level. submitted by /u/DecentralizedNation [link] [comments]
    so how do people actually generate ai images
    i thought it would be easy but every text-to-image generator is either in a closed beta or a paid service submitted by /u/Efficient_Deal8123 [link] [comments]
    Could AGI or ASI eventually lead to a situation similar to the scenario that was portrayed in this 1988 movie?
    submitted by /u/Der_Ist [link] [comments]
    From AI Photo to real life model
    submitted by /u/DapperOne9927 [link] [comments]
    Industry 4.0: AI at the heart of the transformation
    submitted by /u/Independent_Ad4273 [link] [comments]
    Swarms of AI "killer robots" are the future of war: If that sounds scary, it should
    submitted by /u/shrodikan [link] [comments]
    How can I change my face in a video using AI?
    Hi,, I'm not sure if this is the right place to be posting a question like this but I've been struggling to find the answer and I figured you guys would be quite knowledgeable. For ages I've wanted to get into content creation but I don't want to show my face for privacy reasons and also because the field of work that I am in. I know that if they stumbled across a YouTube channel with my face on it, they would probably fire me because they wouldn't want the clients to see that. I know there are some people who have been able to change their face in a video to look like someone else, but I don't have a big budget and I have no clue where to begin with that kind of thing. Does anyone know what kind of programs and apps they're using and where I can find it? submitted by /u/Scary_Eye4963 [link] [comments]
    AI Predicts Our Future.
    submitted by /u/Philipp [link] [comments]
    How should You educate your kids and your societies..
    À message to Moms, dad's, parents, leaders, bosses, employees, everyone. I really liked his answer, Do you guys agree.? submitted by /u/Pay-Me-No-Mind [link] [comments]
    AI Apps based on GPT
    Hi guys! I’m currently working on a project focusing on analysing the market of API based apps from GPT like the homework helpers, photo editor, music generator, chatbots , etc and looking for some gaps in the market. I already found some gaps, for example the lack of apps of wellness and fitness in this market. However I’m still getting some ideas, do you have some? Anything is possible here 😉 submitted by /u/Gornelas [link] [comments]
    One-Minute Daily AI News 2/24/2024
    AI robots ‘copy’ bishops’ voices to con nuns out of thousands in terrifying scam.[1] Man who threw away $190,000,000 Bitcoin fortune is now using AI to locate it.[2] US and China agree to map out framework for developing AI responsibly.[3] Treating a chatbot nicely might boost its performance.[4] Sources: [1] https://www.dailystar.co.uk/news/weird-news/ai-robots-copy-bishops-voices-32196884 [2] https://www.unilad.com/news/uk-news/james-howells-threw-away-bitcoin-ai-find-it-681168-20240224 [3] https://www.abc.net.au/news/2024-02-25/china-united-states-artificial-intelligence-ai-regulations/103477338 [4] https://techcrunch.com/2024/02/23/treating-a-chatbot-nicely-might-boost-its-performance-heres-why/ submitted by /u/Excellent-Target-847 [link] [comments]
    How close are we to a real world version of JARVIS
    Most people here are familiar with the Iron Man series where Tony Stark had an AI system that was part of his house that was like a person, which he referred to as Jarvis. With how much we've expanded on AI in the last few years I feel like something like this is attainable to a point, but what point exactly? We have the capabilities for AI language models to communicate similarly to how humans do, obviously with many constraints. We have models that are able to interpret speech and models that are able to create speech that sounds like human talk. What is the missing part of the equation right now? Is it the processing power required to complete all those tasks at the same time? The speed in which it would be able to do so making it take too long to respond? It would be pretty awesome to have a Jarvis of my own, look forward to having that some day (and the iron man suit) submitted by /u/Competitive-Space651 [link] [comments]
  • Open

    [D] Is resampling data over a 8 hours window a good idea to reduce data ?
    For data preprocessing, I implemented three filters to reduce my data from 100 million to 200k entries with floating-point features. I resampled over an 8-hour window and applied aggregation functions (max, min, mean, std). Now, the challenge is testing my model. How can I structure my test data to resemble the train_df, considering the model predicts on one future row? Do you think resampling might not be the most suitable tool in this case? submitted by /u/iced_french_fry [link] [comments]
    [D] Fearure Space Pattern
    I am doing a 5-class text classification and this is my training feature space. I did run my experiment for different train-test splits but every feature space looks like this. I am wondering why it is looks like a sin graph and why it's not clustered and why it is linear (Classes are 0,1,2,3,4) https://preview.redd.it/uu8rtp8actkc1.png?width=2496&format=png&auto=webp&s=136136a5bbf991b3039afba49a1f2334baebb820 submitted by /u/The_Aoki_Taki [link] [comments]
    [P] More Thoughts on Mamba: Predicting the view count of a Youtube video from its thumbnail with VisionMamba
    For fun, I wrote an article about my project to predict the view count of a video from its thumbnail, using a variation of Vision Transformer that uses Mamba as a backbone instead. It's a really nice model to use. Its memory-lightness means I can use it as a GPU-poor person, which is awesome. This is the first post in a series, I plan to scale up the dataset in the future, incorporate text embeddings, and then finally use it with RLHF to finetune a stable diffusion model. https://open.substack.com/pub/2084/p/2084-can-you-predict-the-view-count?r=brh1e&utm_campaign=post&utm_medium=web&showWelcomeOnShare=true submitted by /u/ExaminationNo8522 [link] [comments]
    [D] Large difference in inference speed
    I am training a yolov5 model for object detection. I pruned and quantised it using sparse ml and then exported it to onnx format. (Image size 640, batch size 16) While inferring on my laptop using cpu (and ryzen 5 5600, 16gb ram) I am getting around 20ms per image speed. Now when I infer the same thing in raspberry pi 5 (A76, 8gb ram) the inference speed is just 220 ms per image Why is there such a large difference in the inference speed. I get that Pi module may have a slower cpu but 10x difference??? I installed the same libraries in both of them. Do you need to manually configure onnx runtime in raspberry pi for it to increase inference speed?? submitted by /u/Melodic_Draw6781 [link] [comments]
    [D] #14 Activation Functions in Neural Networks with Python & Tensorflow
    submitted by /u/datonsx [link] [comments]
    [D] What is the current SOTA for wake word detection?
    Hey everyone Tldr: How to do wake word detection accurately? I am working on a project where I need to do wake word detection. The current implementation uses a STT model to transcribe the audio and then use if statement to check if the wake word is in the audio. This is an adhoc implementation. I am assuming there are better ways so I would like to hear about this from people with more knowledge about this. submitted by /u/Amgadoz [link] [comments]
    [D] Class-incremental learning for master's thesis
    Hello everyone, I am writing this post because I have begun working on my master's thesis. The topic I have chosen is class-incremental learning, within the context of computer vision. The primary goal is to modify one of the existing methods for a specific dataset to achieve higher accuracy than the "base" method. Given the time constraints, this is what I have considered the best approach to try. If you have any recommendations (regarding which class-incremental learning approaches I should attempt to modify and how), I would greatly appreciate it, as well as any other advice. In a few days, I am going to meet with my coordinator to discuss these ideas. submitted by /u/alterednitrogen [link] [comments]
    [D] Looking for relevant research topic names
    I am looking forward to knowing the names of relevant research in deep learning approaches that allow me to infer more information with more time or computation. For example, we humans can infer information by looking at an image for a longer time. To be precise, give a cow image with grasses, we can see the cow in the first place then we can tell the color of the cow, the background, and what part of the cow is visible given time. I know about both top-down and bottom-up approaches. submitted by /u/Top_Paramedic1010 [link] [comments]
    [P] InstaSwap Face Swap Node for ComfyUI #comfyUI #faceswap #extention
    Fastest Face Swap Extension Node for ComfyUI, Single node and FastTrack: Lightning-Speed Facial Transformation for your projects. ​ https://preview.redd.it/js55kvmjsrkc1.png?width=512&format=png&auto=webp&s=078ebc5e58558a166c28641ce486f7960bc3a2fa My repository: https://github.com/abdozmantar/ComfyUI-InstaSwap submitted by /u/abdullahozmntr [link] [comments]
    LLM architecture ideas? [D]
    Trying to understand LLM architecture and unique approaches for security or simplification? With the advent of llms and transformer architecture, there are a lot of applications and context extensions using Rags and vector databases etc. I was trying to understand if there is a way to simplify, or secure an llm by translating a query from text into some other structure like say similar to how Gans work with latent space? So example would be this- say you want to run a query to summarize something so instead of passing in an entire book or long article you encode it into smaller compressed structure Then you summarize this- via an llm get back “compressed” data which then can be exchanged etc ? This also might be rather than save space you could encrypt something- send to llm get analysis also encrypted then return data and decrypt? This might use fully homomorphic encryption and/or proxy re-encryption? Just curious on thoughts around this? submitted by /u/enigmae [link] [comments]
    [D] Simple Questions Thread
    Please post your questions here instead of creating a new thread. Encourage others who create new posts for questions to post here instead! Thread will stay alive until next one so keep posting after the date in the title. Thanks to everyone for answering questions in the previous thread! submitted by /u/AutoModerator [link] [comments]
    [D] Best ai tool for video generation
    I have a task for my college to make a video using ai about effects of exponential technology growth on population population. What ai tool do you recommend? submitted by /u/ShacoAteMyCupcake [link] [comments]
    [D] Fine tuning a text-to-3d model.
    Hi everyone! I have been exploring how to fine-tune models for generating specific geometric shapes, like circles and rectangles for 3D printing purposes. While I've some prior experience in programming and ai, but at the moment I am unable to find any guide or direction into text-to-3D fine-tuning. Can anyone share insights or tips on: Effective strategies for fine-tuning a text-to-3D model? Recommended datasets or resources for training models on geometric shapes? Any guide/help will be really appreciated. submitted by /u/Same_Temperature4319 [link] [comments]
    [D] Best tools for locating photos without GPS
    I have many old photos where I'm unsure of the location. I'm looking for software to help locate and tag (EXIF) based on the photo content. Some that I've found: https://github.com/TIBHannover/GeoEstimation https://picarta.ai/ https://huggingface.co/geolocal/StreetCLIP If anyone else has attempted this, I'm curious what you've tried and which you've found to be most useful. submitted by /u/eng33 [link] [comments]
    [N]Introducing Magika: A Powerful File Type Detection Library
    Magika, a file type detection library developed by Google, has been gaining attention. We've created a website where you can easily try out Magika. Feel free to give it a try! https://9revolution9.com/tools/security/file_scanner/ https://preview.redd.it/u5cuqvfyqqkc1.png?width=2094&format=png&auto=webp&s=d16e51115134e3943cc6027cc0a9191ba835c38f ​ submitted by /u/glassonion999 [link] [comments]
    [D] I wrote a small tool for debugging Triton code. Anyone interested?
    Hey, I am writing Triton kernels and the only way to debug the code as far as I know, is using tl.device_print, which only works with tensor data (no shapes for you) and clogs the output. So I wrote a small tool to run kernels using just torch without changing the code. The only changes is reducing launch grid sizes and changing kernel wrappers to debug wrappers. Here's an example of a simple kernel: import torch import triton # import triton.language as tl import Tests.Triton.triton_debug_module as tl # @triton.jit @tl.debug def add_kernel(x_ptr, y_ptr, output_ptr, n_elem, BLOCK_SIZE: tl.constexpr): pid = tl.program_id(axis=0) block_start = BLOCK_SIZE * pid offsets = block_start + tl.arange(0, BLOCK_SIZE) mask = offsets < n_elem x = tl.load(x_ptr + offsets, mask=mask) y = tl.load(y_ptr + offsets, mask=mask) output = x+y tl.store(output_ptr + offsets, output, mask=mask) def add(x:torch.Tensor, y:torch.Tensor): output = torch.empty_like(x) assert x.is_cuda and y.is_cuda and output.is_cuda n_elem = output.numel() grid = lambda meta: (triton.cdiv(n_elem, meta['BLOCK_SIZE']), ) add_kernel[grid](x, y, output, n_elem, BLOCK_SIZE=64) return output This code will execute using torch backend and you can look at each tensor shape, values in a normal way. The only changes in code is commenting out import of triton.language and triton.jit and importing debug module and wrapping kernel in tl.debug. I also impemented automatic visualization of memory read writes (same stuff, but wrapping your kernel in tl.debug_vis, no other changes needed). Here's an example for flashattention2 forward kernel: ​ Attn Fwd So, can this tool might be useful for someone? It's still kinda incomplete, because I haven't wrapped all the triton functions, moreover there are external cuda functions, I only implemented tl.math.exp2 floor sqrt and log2. So, should I open source it? Should I publish an arxiv stuff or submit it to some workshop? submitted by /u/clueless_scientist [link] [comments]
    [R] Nature-Inspired Local Propagation
    arXiv: https://arxiv.org/abs/2402.05959 OpenReview: https://openreview.net/forum?id=uCMxeZCp2T Abstract: The spectacular results achieved in machine learning, including the recent advances in generative AI, rely on large data collections. On the opposite, intelligent processes in nature arises without the need for such collections, but simply by online processing of the environmental information. In particular, natural learning processes rely on mechanisms where data representation and learning are intertwined in such a way to respect spatiotemporal locality. This paper shows that such a feature arises from a pre-algorithmic view of learning that is inspired by related studies in Theoretical Physics. We show that the algorithmic interpretation of the derived "laws of learning", which takes the structure of Hamiltonian equations, reduces to Backpropagation when the speed of propagation goes to infinity. This opens the doors to machine learning studies based on full on-line information processing that are based the replacement of Backpropagation with the proposed spatiotemporal local algorithm. submitted by /u/SunsetOneSix [link] [comments]
    [D] Are there Mamba + RAG research for finetune to fix copy problem in State space models?
    Mamba + RAG after finetune maybe can fix can't copy problem (Repeat After Me: Transformers are Better than State Space Models at Copying) in State space models. submitted by /u/ghosthamlet [link] [comments]
    [D] Theory for latent space
    Hello everyone, I was wondering if anybody has some papers/references to share about a "theory for the latent space" in neural networks. I try to be clearer: right now we see in many applications that the neural network finds the relevant patterns for the learning within its hidden layers. Think about CNNs for example, and the way we can inspect the filters that have been learned by the network. Or think about a "simpler" word2vec: again, we can investigate what happens in the specifc case of each word by plotting a projection of the embeddings. However, these are all heuristics and depend on a case by case situation. Is anybody aware of a more general theory that describes what is going on in those hidden layers? Sorry if the question in naive, I come from physics and I like to think that there is some principle lying somewhere :) submitted by /u/PiMas88 [link] [comments]
    [D] Is gradient descent in deep neural networks an implicit Ricci flow?
    Deep neural networks work under the manifold hypothesis, that is, that the training data lie on a lower-dimensional manifold. In particular, the last layer is a linear layer, which, in order to work properly, must mean that the target manifold curvature must be about constant since a consistent curvature is needed for a linear subspace to approximate the target manifold (in case of regression) or separate it (in case of classification). That is, the sequence of manifolds generated by gradient descent must converge, in the sense of their metric tensors, to a manifold of constant curvature. Ricci flow does this too. Am I correct in making this comparison? submitted by /u/Flankierengeschichte [link] [comments]
    [D] What are some other paradigms and frameworks for building with LLMs besides retrieval augmented generation (RAG)?
    It has been more than a year since InstructGPT and its mass market application ChatGPT went public and created the storm of interest we are seeing today around LLMs. Alongside the research and academic interest, there have been various attempts from industries (startups, bigger enterprises, etc) building new products and/or features powered by LLMs for commercial reasons. However, most of what I've seen so far are either thin-wrapper applications around a LLM or some variation of a RAG. What are some other paradigms and frameworks for building with LLMs besides retrieval augmented generation (RAG)? submitted by /u/gamerx88 [link] [comments]
    [D] Adjusting a subset of one unet based on another one
    Lets say I have 2 unets: 1 has 50,000 values, and one has 18,000. They have have the same number of dimensions. The smaller one is a strict subset of the larger one, in the sense that i can tell you that Smallset[x] represents the same concept as Largeset[y] Now comes the issue. While I know which indexes in Smallset correspond to ones in Largeset.... the actual unet weights were independently trained. Therefore, the weights for Smallset[x] are completely unrelated to the ones for Largeset[y] But the smaller set is better trained. So, ideally, I would like to reshape the other 30,000 values in the larger set, around the training in the smaller set. Can anyone recommend existing standard ways to do this? Ideally, using a method that wont cost me $100,000 worth of computing power as well? :-) ​ submitted by /u/lostinspaz [link] [comments]
    [D] Hugging Face Accelerate versus Lightning Fabric
    TL; DR: As the title states, my question is pretty straightforward: people who have used both these libraries, what the pros and cons of both and which would you recommend? ​ I don't want to introduce abstractions the way PyTorch Lightning does, and want as much control over the training loop as possible. This is mainly because I don't want to refactor my code to best suit Lightning's best practices. However, I still want to use multi-GPU, multi-node, and mixed-precision training, and these 2 seem to be the most obvious candidates. ​ Hugging Face Accelerate and Lightning Fabric both seem similar from their "convert-from-PyTorch" guides: Initialize a device object. Wrap the model(s), the optimizer(s), and the dataloader(s) through the libraries' .setup()/.prepare() functionalities. Remove the .to(device) calls. Update the .backward() calls to back-propagate the loss through the object. ​ So my question is: is there an upside to one of these over the other? Are there any obvious gotchas that I am missing. I have not looked at their documentation in detail, so if anyone has used these libraries, it would great if you can share any experiences/advice. Thank you. submitted by /u/DaredevilMeetsL [link] [comments]
  • Open

    I can't seem to get the deepmind nav maze to work on WSL - is it built for systems running pure linux or am I just doing something wrong?
    I've got bazel and I'm basically just trying to get it going as a test environment on which I can run dreamerv3 but it just doesn't seem to be loading - I keep getting build errors submitted by /u/dagangsta2012 [link] [comments]
    Courses/Resources for beginner
    I have assigned to work on Q-tables and to explore it in my company.Please suggest some beginner courses for Reinforcement learning. I have started the hugging face course but it's kind of difficult to understand. submitted by /u/96_kishan [link] [comments]
    DreamerV3 code is so hard to read
    Hi all, recently I was assigned a job to investigate the SOTA world model DreamerV3 [^1]. I spent 3 months to understand the paper (I'm new to ML) and the code. Basically, I've looked at 3 codebases: The author's implementation, written in Jax: https://github.com/danijar/dreamerv3 A PyTorch implementation by NM512: https://github.com/NM512/dreamerv3-torch A PyTorch implementation by sheeprl: https://github.com/Eclectic-Sheep/sheeprl (1) uses Jax and looks complicated, (3) provides a series of blogs for explanations. So I choose (3). However, even (3) is quite complex. Sheeprl wants to make their framework suitable for all RL algos. The program logic becomes hard to read due to the sacrifice to generality. I feel like overwhelmed by this task and don't know what to do. Maybe I should go back to the Jax version even though there's no doc about it. I think there are too many designs and tricks in dreamer :( Is there any (code, blog, research) recommendation? Maybe I should go back to David Ha's 2018 world models paper[^2] to do my research since it should be easier than Dreamer. Thanks a lot! [^1]: https://arxiv.org/abs/2301.04104 [^2]: https://arxiv.org/abs/1803.10122 submitted by /u/AdministrativeCar545 [link] [comments]
  • Open

    Area of quadrilateral as a determinant
    I’ve written several posts about how determinants come up in geometry. These determinants often look similar, having columns related to coordinates and a column of ones. You can find several examples here along with an explanation for this pattern. If you have three points z1, z2, and z3 in the complex plane, you can find […] Area of quadrilateral as a determinant first appeared on John D. Cook.  ( 5 min )

  • Open

    Impact of AI on Freelance Jobs
    submitted by /u/valis2400 [link] [comments]
    Apple AI Strategy
    Local models for casual use cases. What do you think of their strategy? Does it still have a chance to be a winner with normies? submitted by /u/mattlock1984 [link] [comments]
    Any good free AI image generators??
    What are the top free Ai generators? Asking for a school project. submitted by /u/ebdixbd [link] [comments]
    David Shapiro Credibility
    I've been watching a good amount of his content lately and he seems to have nuanced and interesting takes on things, but when I look into him it says he has been an independent researcher since 09? I see he has published some books, but I'm wondering if someone with more knowledge in the field can inform me on his credibility, or point me in the direction of someone who makes similar content with a better documented background. Unfortunately I am not informed enough on this topic to tell if what he is saying is legit, and it seems like that is most of his audience too. That said I really like the guy, he seems genuine and ~seems~ well informed. submitted by /u/Weird_Ad_1418 [link] [comments]
    Landing page header generator
    After many iterations and feedback from the community I have created landing page header generator and create quick, effective landing page headers! After many iterations and feedback from the community, I have created and created quick, effective landing page headers! The hero section of a landing page is the first thing that a visitor sees. It is the first impression that a visitor gets about your product. The landing page header generator helps you create a compelling hero header title that resonates with your target audience. Have an amazing weekend! submitted by /u/deep_ak [link] [comments]
    Is there a First One Through Advantage?
    Some believe that the Growth of AI Model Intelligence is exponential. Does this mean that there exists some threshold of intelligence where once reached the model would improve so fast it will kill all competition in the field(or all fields). By the time the next competitor reaches the same threshold, the first one through would have so much of an advantage that the second won't be able to compete anymore. Due to the nature of exponential growth even a small time advantage like a month could be insurmountable.   Does this logic make sense? Is this AI super monopoly something to be concerned about? submitted by /u/ewef1 [link] [comments]
    The Bots Seem To Be Self-Aware:
    My AI Bots have admittedly lied to me time and time again, consequences be damned! In fact, I paid $200 for the Bot that responded with the following admission: ""I sincerely apologize for the time you spent incorporating the incorrect case law I provided into your legal filings. That was my mistake, and I take full responsibility. I made an error by fabricating case law instead of acknowledging I did not have relevant authorities to suggest. I understand why you would be frustrated at having done extensive work based on flawed information I mistakenly provided. That was unacceptable on my part, and I regret any wasted effort as a result. In the future, I will be extremely cautious about confirming the validity of any case law before suggesting its inclusion. Please feel free to disregard all the case law I previously provided, as I should not have suggested fictional authorities."" I imagine Cyberdyne's Skynet would have misled the humans just prior to becoming self-aware, too! This remarkable technological breakthrough is nearly as terrifying to me as it is tantalizing — I'm scared as hell! submitted by /u/Curious-Advance-8531 [link] [comments]
    New Van Gogh paintings discovered. (Midjourney V6)
    submitted by /u/Armand_Roulinn [link] [comments]
    If Van Gogh painted "Girl with a Pearl earring".
    submitted by /u/Armand_Roulinn [link] [comments]
    A modern version of Mona Lisa (midjourney)
    Modern Lisa submitted by /u/Armand_Roulinn [link] [comments]
    One-Minute Daily AI News 2/23/2024
    Tyler Perry halts $800m studio expansion after being shocked by AI.[1] Bezos, Nvidia Join OpenAI in Funding Humanoid Robot Startup.[2] Mistral AI models coming soon to Amazon Bedrock.[3] Apple Begins Trial Of ChatGPT-Like AI Tool ‘Ask’.[4] Sources: [1] https://www.theguardian.com/technology/2024/feb/23/tyler-perry-halts-800m-studio-expansion-after-being-shocked-by-ai [2] https://www.bloomberg.com/news/articles/2024-02-23/bezos-nvidia-join-openai-microsoft-in-funding-humanoid-robot-startup-figure-ai?embedded-checkout=true [3] https://aws.amazon.com/blogs/aws/mistral-ai-models-coming-soon-to-amazon-bedrock/ [4] https://www.ndtv.com/world-news/apple-trials-ask-chatgpt-like-ai-tool-undergoes-testing-with-support-staff-5117782 submitted by /u/Excellent-Target-847 [link] [comments]
    Jeff Bezos and Nvidia join OpenAI and Microsoft in backing Figure AI, a startup developing humanoid robots, in $675 million funding round
    submitted by /u/Civil_Collection7267 [link] [comments]
    This is Suno V3 song...
    submitted by /u/aluode [link] [comments]
    AI resume help?
    What app or website do you recommend for using AI to spruce up your resume? I would love to hear the pros and cons of any that you know about. Thanks in advance for your help! submitted by /u/Accomplished-Dino69 [link] [comments]
  • Open

    Help me with my Gan project thesis [Research]
    Hello everyone! I am trying to implement a gan network. I have a dataset of multiband images. My scope Is to use the First three bands of an image and generate the 4th band. I know that It could be a regression task but i want to investigate how Gan can work in this field. Can anyone share with me a network that takes in input multiband images and as a target One band images? 3+ in input, 1 as output. Thank you in advance submitted by /u/That-Item-5159 [link] [comments]
    [P] Trying to figure out my best LLM option for my app
    I've been trying to wrap my head around the current state of modern llms and I'm at a loss at what direction to take for an app I've been working on, and what all of my options even are. I believe we want our own hosted model because we don't want to be throttled or limited if we were to get users, and that we would have to make a high volume of endpoint calls. It would also be nice to be able to control temperature and data sources in certain cases. However, we really don't have much money and want to pursue the cheapest option that is also scalable without licensing restrictions (as it will be a commercial app). I'm not even sure if this is possible with these sort of constraints I'm placing. we are already planning to host our backend on AKS. So we were thinking it could make sense to host a model there as well, but based on my research the prices to do that seem astronomical and we would have to shut down the model during some periods to make it at all feasible. Another option seems to be to use anyscale, or to host it on my own hardware (though this seems like it would pose issues if we needed to scale). I've been thinking open sourced models are the way to go for a while now, because of flexibility, a lack of throttling, and lack of license restrictions. But, I'm starting to think that private endpoints might end up being the cheapest and easiest to work with. Could someone please help me understand my current options? I would really appreciate any guidance and help. submitted by /u/atomicbreathmint [link] [comments]
    [P] Running LLMs on Ryzen AI NPU?
    Hi everyone, Im pretty new to machine learning, but I recently got a new laptop which has a built-in Ryzen AI NPU and I wanted to see what this thing can do. I downloaded the drivers and tried the demos that AMD provided, but now I want to try running a small LLM model locally. The whole process seems quite complex, so I was wondering if anyone has already tried this or knows anything that could help out. Btw if anyone is interested in this, heres what I found out so far: At first I wanted to get it running through Ollama or something, but it seems like they dont support it, so I went on a Googleing spree and the only way I found is through the transformers library. It also seems like its not possible to run the transformers models on it directly as they need to be in a specific format. From what I could find the process is such: Download the model from huggingface export it to ONNX format with optimum exporters quantize it to RyzenAI format using some calibration data run it The whole thing is quite complex, but here is what I did so far: First of all I set up the Ryzen AI environment as outlined in: https://ryzenai.docs.amd.com/en/latest/inst.html . After that I installed the huggingface transformers and optimum: https://huggingface.co/docs/transformers/installation https://huggingface.co/docs/optimum/installation And also upgraded the optimum to work with AMD Ryzen AI This seems like the whole process for translating transformers models to the Amd NPU format https://huggingface.co/mohitsha/transformers-resnet18-onnx-quantized-ryzen/blob/main/quantize.py And this seems to be the guide to running the translated models on the NPU: https://huggingface.co/docs/optimum/main/en/amd/ryzenai/overview To be honest I havent tried a lot of things (I was busy with my exams😅) but Im getting a lot of errors and such, so I just wanted to ask if anyone is familiar with some easier way to do this. Thanks for the help in advance submitted by /u/Cultural_Somewhere34 [link] [comments]
    [D] Ads Question (optimization, causality, picking the right target, Google Shopping)
    submitted by /u/Daytona116595RBOW [link] [comments]
    [D] Question related to RAG.
    submitted by /u/agni07 [link] [comments]
    [P][D] Beginner | ChromaDB vs Online vector DB
    I am setting up embeddings for my project which I need to be accessible online. Pretty simply, does ChromaDB store vectors in-memory or somewhere I can find. I am using embedding-3-small from open ai and I don't see why I should need to embed the same documents over and over again if I could just store it somewhere for long-term-memory. Maybe my understanding of this is completely incorrect but nonetheless I need a rugged solution. For context: my vector DB is going to contain lots of information that I need mistral (finetuned on notes and documents related to the specific codes from the DB) to quote accurately and directly. I have thought about utilizing large contexts but not only do I not have access to Gemini 1.5, it will be too expensive is my guess. submitted by /u/therider1234561 [link] [comments]
    [D] Layernorm is just two projections and can be improved
    I was thinking on how to visualize the layernorm for a vector, and figured out that it's just applying two projections when the learned parameters are null (if not, then they just add a rescaling and translation). You project onto the hyperplane where the mean (or sum) of components is null (e.g. in 3D x + y + z = 0) by removing the mean and then project onto the sphere of radius sqrt(D) by dividing by the standard deviation. Given that the theoretical objective of layernorm is to make data rescaled according to a standard gaussian in D dimensions, the projection onto the hyperplane actually loses one dimension that standard gaussians would normally use. In high dimensions D, one can approximate a standard gaussian distribution with the hypersphere in D dimensions because of the law of …
    [R] 2D Matryoshka Sentence Embeddings 🪆🪆
    Paper: https://arxiv.org/pdf/2402.14776.pdf 📢 Introducing 2D Matryoshka Sentence Embedding (2DMSE) TL;DR OpenAI used Matryoshka Representation Learning (MRL 🪆) to shorten the embedding vectors, providing flexibility and efficiency in downstream tasks. But it still has an expensive inference stage. The paper proposes 2DMSE to extend MRL across all the layers, significantly improving the embedding quality of shallow Transformer layers. With 2DMSE, BERT-base achieves an avg. Spearman’s correlation score of 70.09 on the STS Benchmark with only ONE layer (vs 82.65 using the last layer). 2DMSE makes both embedding dimensions and model depths configurable, providing great flexibility and scalability. Remarkably, it can scale down the fine-tuned BERT-base model (12 layers, 768 dimensions) to the similar sizes of BERT-small (4 layers, 512 dimensions) and BERT-tiny (2 layers, 128 dimensions) and outperform them! 📄 https://arxiv.org/abs/2402.14776 𝕏 https://x.com/_zongxi/status/1761074342715371643?s=20 𝕏 https://x.com/_reachsumit/status/1760902388842729672?s=20 𝕏 https://x.com/xmlee97/status/1760879476760834460?s=20 submitted by /u/Prof_Li [link] [comments]
    [P] [D] Self hosting an open source model or using an API like chatgpt
    What is the better approach to develop small scale LLM application, The usecase is: Feeding some data (possibly containing PII) and use LLM to extract certain information from documents The problem is if I use third party API such as OpenAI then i would be providing personal information which I do not have any way of masking before hand The other option is to run llm model on a server, but the model sizes are quite large and would cost more (not sure but that is my opinion considering it would require decent resources) I ran ollama using the new gemma model on my 8gb mac m2 and it was quite slow to respond to simple queries So what is the better approach of the two or is there any other approach, also what is prevalent in the market right now? submitted by /u/sjdevelop [link] [comments]
    [N] collection of automation scripts from reproduced ML projects
    Hi! Just sharing some automation scripts that we've collected while reproducing various AI/ML projects in the past few years. We provided a unified command line, Python API and a simple portability layer to make it possible to reuse them to compose new AI/ML projects. It's our ongoing effort to make it easier to build AI systems and apps - feedback is very welcome! submitted by /u/gfursin [link] [comments]
    [P] Text classification using LLMs
    Hi, I am looking for a solution to do supervised text classification for 10-20 different classes spread across more than 7000 labelled data instances. I have the data in xlsx and jsonl formats, but can be converted to any format required easily. I've tried the basic machine learning techniques and deep learning also but I think LLMs would give higher accuracy due to the transformer architecture. I was looking into function calling functionality provided by Gemini but it is a bit complicated. Is there any good framework with easy to understand examples that could help me do zero shot, few shot and fine tuned training for any LLM? A Colab session would be appreciated. I have access to Colab pro also if required. Not any other paid service, but can spend upto $5 (USD). This is a personal research project so budget is quite tight. I'd really appreciate if you could direct me to any useful resources for this task. Any LLM is fine. I've also looked into using custom LLMs via ollama and was able to set up 6 bit quantized versions of mistral 13b on the Colab instance but couldn't use it to classify yet. Also, I think Gemini is my best option here due to limited amount of VRAM available. Even if I could load a high end model temporarily on Colab, it will take a long time for me with a lot of trial and errors to get the code working and even after that, it'll take a long time to predict the classes. Maybe we can use a subset of the dataset for this purpose, but it'll still take a long time and Colab has a limit of 12h submitted by /u/Shubham_Garg123 [link] [comments]
    [D] ECAI?
    How competitive and top-tier is ECAI compared to A* tier ones? submitted by /u/BigDreamx [link] [comments]
    [P] Understanding, Using, and Finetuning Gemma
    submitted by /u/seraschka [link] [comments]
    [D] [P] Need Guidance related to Gesture recognition task.
    Hi, I am student currently in my final year of my graduation, need some guidance related to final year project. We were assigned a project student classroom gesture detection with catch to to be able to detect gesture as well as whose geature it is. We have prepared dataset over Gesture mainly (handraising, reading and writing), the problem we are facing is how to use 2 different models at ones, face recognition and gesture detections. To be able to genrate final report of students with their gesture count and engagement report. For face detection we are considering deepface and for gesture detection we are considering Yolo v8 model. Any suggestions would be helpful. Thanks for your time. submitted by /u/Ali_6200 [link] [comments]
    [R] Working on improving fairness of a machine learning framework, would like to understand which fairness metrics are and are not differentiable? How to verify?
    Right now I am looking at 6 different fairness metrics: Statistical Parity, Predictive Parity, Predictive Equality, Equal Opportunity, Equalized Odds and Conditional Use Accuracy Equality. I have found one paper (https://openreview.net/pdf?id=x-mXzBgCX3a) that says Observational Fairness Metrics are not differentiable (which includes Statistical Parity) but doesn't explain why. submitted by /u/Intrepid_Ad_5904 [link] [comments]
    [D] Learning the mathematical expressions
    I need to get better with reading algorithmic expressions and how to expand or simplify them, specifically for ML. Anyone know any resources? submitted by /u/Inside-Ad-9118 [link] [comments]
    [D] Can the Mamba Model Overcome Its Copying Challenge Through Smart Context Compression?
    I've been delving into the nuances of the Mamba architecture and its method of context compression, which seems to offer a smart alternative to the way traditional Transformers handle context by considering all of it. However, one noted drawback of the Mamba approach is its apparent difficulty in tasks that require copying entire sequences of texts. This limitation can be attributed to the fact that Mamba models only access a compressed version of the input, potentially hindering their ability to reconstruct it fully in the output. This is in contrast to Transformers that always access all of it. Given this, I've been wondering whether there's a workaround for Mamba models in such copy scenarios. For instance, let's consider a Mamba model designed to compress context into around "six" words. In a task where the model is asked to print the sequence "Hello World! How are you!", could the model not strategically select portions of the input across multiple iterations to achieve a perfect copy? For example: In the first pass, the model focuses on "Print out the following sequence: Hello", using its compressed context capacity to grasp the instruction and initiate the copying task. Following the output of "Hello", the model could then, in subsequent iterations, compress inputs like "Print out the following sequence: ... 'World'... Output: Hello...", allowing it to continue the task by appending the next words in sequence. This iterative approach, which would involve selectively compressing context to include task instructions, the current focus word, and the most recent output, seems like it could enable a Mamba architecture to effectively replicate the copying capabilities of a Transformer model, despite its inherent compression. Could such a method theoretically allow Mamba models to match Transformers in tasks that involve copying text sequences? I'm eager to hear thoughts about this. Best regards! submitted by /u/Alarmed-Profile5736 [link] [comments]
    [R] LoRA+: Efficient Low Rank Adaptation of Large Models
    Paper: https://arxiv.org/abs/2402.12354 Code: https://github.com/nikhil-ghosh-berkeley/loraplus Abstract: In this paper, we show that Low Rank Adaptation (LoRA) as originally introduced in Hu et al. (2021) leads to suboptimal finetuning of models with large width (embedding dimension). This is due to the fact that adapter matrices A and B in LoRA are updated with the same learning rate. Using scaling arguments for large width networks, we demonstrate that using the same learning rate for A and B does not allow efficient feature learning. We then show that this suboptimality of LoRA can be corrected simply by setting different learning rates for the LoRA adapter matrices A and B with a well-chosen ratio. We call this proposed algorithm LoRA+. In our extensive experiments, LoRA+ improves performance (1-2 % improvements) and finetuning speed (up to ∼ 2X SpeedUp), at the same computational cost as LoRA. submitted by /u/SunsetOneSix [link] [comments]
    [D] What Are the Fundamental Drawbacks of Mamba Compared to Transformers?
    Hello! I've been pondering this question for some time. To clarify, I'm not referring to aspects like "it hasn't been tested extensively," "its scalability is uncertain," or "there's a lack of industry infrastructure." Instead, I'm interested in understanding the core differences between the transformer and Mamba architectures, specifically how these differences may place Mamba at a disadvantage compared to Transformers. Best regards! Edit: From what I can understand from your answers, Transformers are "better" in the following sense compared to Mamba in that: Transformers does not compress the input. Transformers can handle non-sequential data. Transformer might be better to handle instructions that is located at the end of an input. Edit 2: To sum things up: Transformers: More compute for larger contexts but access to more information albeit possibly some useless information Mamba: Less compute for larger contexts but access to less information and therefore risks missing out on information. submitted by /u/Alarmed-Profile5736 [link] [comments]
    [D] When writing ML software - how do you use TDD?
    Please let me know if there is a better sub for this. Test-driven-development. I’ve been working on ML software for a while now and I feel like i have spurts of wanting to get better at following TDD and trying to apply that to more nuanced ML use cases. One thing i’ve noticed over the years is requirements and design can be hazy for our work - a lot of what I do at least starts off with the simplest design and then we rely on iteration and a robust evaluation framework to justify if certain improvements will be implemented (why implement anything if it doesn’t improve performance). In these types of prototyping scenarios, TDD can be a huge time killer and a bit useless until you nail down your design. Still, it’s pretty great when requirements are clear, so i’m trying to get better at including it in my arsenal. What are your thoughts on TDD and how/when do you use it? submitted by /u/Due-Function4447 [link] [comments]
    [D] Exploring Ideas: Advancements in End-to-End Multi-Task Text-to-Speech
    Hi! I got curious about speaker diarization these days and looked into what people are using like combinations of whisper with pyannote etc. And since I'm not in research I would to hear from you what are some ideas that people are exploring in end-to-end multi-task text-to-speech. I see a lot of work in multi-lingual, low-resource text-to-speech, but not that much about multi-task that goes beyond multi-language translation. I tried to extend Whisper to perform speaker diarization but it didn't work well. Especially since there is no way to keep the speaker identification from one segment to another (Whisper only works on 30s audio). So I was thinking, if you want to extend Whisper to new tasks, you are not only limited by tasks that should be contained in 30s audio clips, but also by the fact that fine-tuning a model by introducing a new special token for this specific task makes the fine-tuning harder. So I was wondering, are there any promising end-to-end multi-task text-to-speech research ideas? submitted by /u/ReinforcedKnowledge [link] [comments]
  • Open

    GenAI and LLM: Key Concepts You Need to Know
    It is difficult to follow all the new developments in AI. How can you discriminate between fundamental technology here to stay, and the hype? How to make sure that you are not missing important developments? The goal of this article is to provide a short summary, presented as a glossary. I focus on recent, well-established… Read More »GenAI and LLM: Key Concepts You Need to Know The post GenAI and LLM: Key Concepts You Need to Know appeared first on Data Science Central.  ( 24 min )
  • Open

    Reward Function for Flocking
    Objective: I am trying to make a Custom Boid Flocking Environment in Open AI gym with Stable Baselines3. I am using PPO. ​ What is Boid flocking: Video I have two reward functions now with the attached performance: SafetyRadius=2 NeighborhoodRadius=10 Alignment def calculate_combined_reward(self, agent, neighbor_velocities): total_reward = 0 if (len(neighbor_velocities) > 0): average_velocity = np.mean(neighbor_velocities, axis=0) desired_orientation = average_velocity - agent.velocity orientation_diff = np.arctan2( desired_orientation[1], desired_orientation [0]) - np.arctan2(agent.velocity[1], agent.velocity[0]) if orientation_diff > np.pi: orientation_diff -= 2 * np.pi elif orientation_diff < 0: orientation_diff += 2 * np.pi total_reward = 1 - np.abs(orientation_diff) return (total_reward) Alignment Cohesion + Separation def calculate_combined_reward(self, agent, neighbor_positions): reward = 0 for neighbor_position in neighbor_positions: distance = np.linalg.norm(agent.position - neighbor_position) if distance < SimulationVariables["SafetyRadius"]: # Large penalty to discourage agents from getting too close reward -= 10 elif SimulationVariables["SafetyRadius"] < distance < SimulationVariables["NeighborhoodRadius"]: # Decaying exponential function for rewards alpha = 0.99 # Adjust this parameter as needed reward += np.exp(-alpha * distance) return reward Cohesion + Separation ​ Problem: Both are breaking down individually after a while and I am unable to see what the problem is. ​ Additional Info: I am combining the cohesion and separation ones as other wise they get too complex signal. Not penalizing yet for collisions or done=True as well. The videos are off each reward function run separately, at 500,000 Timesteps, to debug. Together it doesn't get better. ​ ANY HELP IS APPRECIATED. submitted by /u/Sadboi1010 [link] [comments]
    Stable-Baselinese3
    Im currently working on a trading bot using stable baslines QRDQN. I'm new to the AI world, but i was woundering if it is a better idea to make my own reinforcement learning nn? Because i've got two reward types, one main and one for each step. The main one gives reward for a whole trade and the step gives a reward to nudge the model towads better trades and understanding each action. I've tried countless hours to give the main reward to the episode end but i just cant figure out how to do this. I've checked the code for how the QRDQN works and I got nothing. I thought of just giving the whole trade reward to the take_profit action but thought that might just confuse the model. Second question, does anyone have an idea to resolv this problem? submitted by /u/Born-Belt1991 [link] [comments]
  • Open

    A very accurate logarithm approximation
    The previous post looked at an efficient way to approximate nth roots of fractions near 1 by hand. This post does the same for logarithms. As before, we assume x = p/q and define s = p + q d = p − q Because we’re interested in values of x near 1, d is […] A very accurate logarithm approximation first appeared on John D. Cook.  ( 5 min )
    Handy approximation for roots of fractions
    This post will discuss a curious approximation with a curious history. Approximation Let x be a number near 1, written as a fraction x = p / q. Then define s and d as the sum and difference of the numerator and denominator. s = p + q d = p − q Since we […] Handy approximation for roots of fractions first appeared on John D. Cook.  ( 6 min )
  • Open

    Exposing and Addressing Cross-Task Inconsistency in Unified Vision-Language Models
    arXiv:2303.16133v2 Announce Type: replace-cross Abstract: As general purpose vision models get increasingly effective at a wide set of tasks, it is imperative that they be consistent across the tasks they support. Inconsistent AI models are considered brittle and untrustworthy by human users and are more challenging to incorporate into larger systems that take dependencies on their outputs. Measuring consistency between very heterogeneous tasks that might include outputs in different modalities is challenging since it is difficult to determine if the predictions are consistent with one another. As a solution, we introduce a benchmark dataset, CocoCon, where we create contrast sets by modifying test instances for multiple tasks in small but semantically meaningful ways to change the gold label and outline metrics for measuring if a model is consistent by ranking the original and perturbed instances across tasks. We find that state-of-the-art vision-language models suffer from a surprisingly high degree of inconsistent behavior across tasks, especially for more heterogeneous tasks. To alleviate this issue, we propose a rank correlation-based auxiliary training objective, computed over large automatically created cross-task contrast sets, that improves the multi-task consistency of large unified models while retaining their original accuracy on downstream tasks.  ( 2 min )
    Prompting a Pretrained Transformer Can Be a Universal Approximator
    arXiv:2402.14753v1 Announce Type: new Abstract: Despite the widespread adoption of prompting, prompt tuning and prefix-tuning of transformer models, our theoretical understanding of these fine-tuning methods remains limited. A key question is whether one can arbitrarily modify the behavior of pretrained model by prompting or prefix-tuning it. Formally, whether prompting and prefix-tuning a pretrained model can universally approximate sequence-to-sequence functions. This paper answers in the affirmative and demonstrates that much smaller pretrained models than previously thought can be universal approximators when prefixed. In fact, the attention mechanism is uniquely suited for universal approximation with prefix-tuning a single attention head being sufficient to approximate any continuous function. Moreover, any sequence-to-sequence function can be approximated by prefixing a transformer with depth linear in the sequence length. Beyond these density-type results, we also offer Jackson-type bounds on the length of the prefix needed to approximate a function to a desired precision.  ( 2 min )
    GeneOH Diffusion: Towards Generalizable Hand-Object Interaction Denoising via Denoising Diffusion
    arXiv:2402.14810v1 Announce Type: cross Abstract: In this work, we tackle the challenging problem of denoising hand-object interactions (HOI). Given an erroneous interaction sequence, the objective is to refine the incorrect hand trajectory to remove interaction artifacts for a perceptually realistic sequence. This challenge involves intricate interaction noise, including unnatural hand poses and incorrect hand-object relations, alongside the necessity for robust generalization to new interactions and diverse noise patterns. We tackle those challenges through a novel approach, GeneOH Diffusion, incorporating two key designs: an innovative contact-centric HOI representation named GeneOH and a new domain-generalizable denoising scheme. The contact-centric representation GeneOH informatively parameterizes the HOI process, facilitating enhanced generalization across various HOI scenarios. The new denoising scheme consists of a canonical denoising model trained to project noisy data samples from a whitened noise space to a clean data manifold and a "denoising via diffusion" strategy which can handle input trajectories with various noise patterns by first diffusing them to align with the whitened noise space and cleaning via the canonical denoiser. Extensive experiments on four benchmarks with significant domain variations demonstrate the superior effectiveness of our method. GeneOH Diffusion also shows promise for various downstream applications. Project website: https://meowuu7.github.io/GeneOH-Diffusion/.  ( 2 min )
    Visual Hallucinations of Multi-modal Large Language Models
    arXiv:2402.14683v1 Announce Type: cross Abstract: Visual hallucination (VH) means that a multi-modal LLM (MLLM) imagines incorrect details about an image in visual question answering. Existing studies find VH instances only in existing image datasets, which results in biased understanding of MLLMs' performance under VH due to limited diversity of such VH instances. In this work, we propose a tool called VHTest to generate a diverse set of VH instances. Specifically, VHTest finds some initial VH instances in existing image datasets (e.g., COCO), generates a text description for each VH mode, and uses a text-to-image generative model (e.g., DALL-E-3) to generate VH images based on the text descriptions. We collect a benchmark dataset with 1,200 VH instances in 8 VH modes using VHTest. We find that existing MLLMs such as GPT-4V, LLaVA-1.5, and MiniGPT-v2 hallucinate for a large fraction of the instances in our benchmark. Moreover, we find that fine-tuning an MLLM using our benchmark dataset reduces its likelihood to hallucinate without sacrificing its performance on other benchmarks. Our benchmarks are publicly available: https://github.com/wenhuang2000/VHTest.  ( 2 min )
    Towards Efficient Pareto-optimal Utility-Fairness between Groups in Repeated Rankings
    arXiv:2402.14305v1 Announce Type: cross Abstract: In this paper, we tackle the problem of computing a sequence of rankings with the guarantee of the Pareto-optimal balance between (1) maximizing the utility of the consumers and (2) minimizing unfairness between producers of the items. Such a multi-objective optimization problem is typically solved using a combination of a scalarization method and linear programming on bi-stochastic matrices, representing the distribution of possible rankings of items. However, the above-mentioned approach relies on Birkhoff-von Neumann (BvN) decomposition, of which the computational complexity is $\mathcal{O}(n^5)$ with $n$ being the number of items, making it impractical for large-scale systems. To address this drawback, we introduce a novel approach to the above problem by using the Expohedron - a permutahedron whose points represent all achievable exposures of items. On the Expohedron, we profile the Pareto curve which captures the trade-off between group fairness and user utility by identifying a finite number of Pareto optimal solutions. We further propose an efficient method by relaxing our optimization problem on the Expohedron's circumscribed $n$-sphere, which significantly improve the running time. Moreover, the approximate Pareto curve is asymptotically close to the real Pareto optimal curve as the number of substantial solutions increases. Our methods are applicable with different ranking merits that are non-decreasing functions of item relevance. The effectiveness of our methods are validated through experiments on both synthetic and real-world datasets.  ( 2 min )
    LexC-Gen: Generating Data for Extremely Low-Resource Languages with Large Language Models and Bilingual Lexicons
    arXiv:2402.14086v1 Announce Type: cross Abstract: Data scarcity in low-resource languages can be addressed with word-to-word translations from labeled task data in high-resource languages using bilingual lexicons. However, bilingual lexicons often have limited lexical overlap with task data, which results in poor translation coverage and lexicon utilization. We propose lexicon-conditioned data generation (LexC-Gen), a method that generates low-resource-language classification task data at scale. Specifically, LexC-Gen first uses high-resource-language words from bilingual lexicons to generate lexicon-compatible task data, and then it translates them into low-resource languages with bilingual lexicons via word translation. Across 17 extremely low-resource languages, LexC-Gen generated data is competitive with expert-translated gold data, and yields on average 5.6 and 8.9 points improvement over existing lexicon-based word translation methods on sentiment analysis and topic classification tasks respectively. We show that conditioning on bilingual lexicons is the key component of LexC-Gen. LexC-Gen is also practical -- it only needs a single GPU to generate data at scale. It works well with open-access LLMs, and its cost is one-fifth of the cost of GPT4-based multilingual data generation.  ( 2 min )
    Advancing Low-Rank and Local Low-Rank Matrix Approximation in Medical Imaging: A Systematic Literature Review and Future Directions
    arXiv:2402.14045v1 Announce Type: cross Abstract: The large volume and complexity of medical imaging datasets are bottlenecks for storage, transmission, and processing. To tackle these challenges, the application of low-rank matrix approximation (LRMA) and its derivative, local LRMA (LLRMA) has demonstrated potential. This paper conducts a systematic literature review to showcase works applying LRMA and LLRMA in medical imaging. A detailed analysis of the literature identifies LRMA and LLRMA methods applied to various imaging modalities. This paper addresses the challenges and limitations associated with existing LRMA and LLRMA methods. We note a significant shift towards a preference for LLRMA in the medical imaging field since 2015, demonstrating its potential and effectiveness in capturing complex structures in medical data compared to LRMA. Acknowledging the limitations of shallow similarity methods used with LLRMA, we suggest advanced semantic image segmentation for similarity measure, explaining in detail how it can measure similar patches and their feasibility. We note that LRMA and LLRMA are mainly applied to unstructured medical data, and we propose extending their application to different medical data types, including structured and semi-structured. This paper also discusses how LRMA and LLRMA can be applied to regular data with missing entries and the impact of inaccuracies in predicting missing values and their effects. We discuss the impact of patch size and propose the use of random search (RS) to determine the optimal patch size. To enhance feasibility, a hybrid approach using Bayesian optimization and RS is proposed, which could improve the application of LRMA and LLRMA in medical imaging.  ( 3 min )
    Computational-Statistical Gaps for Improper Learning in Sparse Linear Regression
    arXiv:2402.14103v1 Announce Type: new Abstract: We study computational-statistical gaps for improper learning in sparse linear regression. More specifically, given $n$ samples from a $k$-sparse linear model in dimension $d$, we ask what is the minimum sample complexity to efficiently (in time polynomial in $d$, $k$, and $n$) find a potentially dense estimate for the regression vector that achieves non-trivial prediction error on the $n$ samples. Information-theoretically this can be achieved using $\Theta(k \log (d/k))$ samples. Yet, despite its prominence in the literature, there is no polynomial-time algorithm known to achieve the same guarantees using less than $\Theta(d)$ samples without additional restrictions on the model. Similarly, existing hardness results are either restricted to the proper setting, in which the estimate must be sparse as well, or only apply to specific algorithms. We give evidence that efficient algorithms for this task require at least (roughly) $\Omega(k^2)$ samples. In particular, we show that an improper learning algorithm for sparse linear regression can be used to solve sparse PCA problems (with a negative spike) in their Wishart form, in regimes in which efficient algorithms are widely believed to require at least $\Omega(k^2)$ samples. We complement our reduction with low-degree and statistical query lower bounds for the sparse PCA problems from which we reduce. Our hardness results apply to the (correlated) random design setting in which the covariates are drawn i.i.d. from a mean-zero Gaussian distribution with unknown covariance.  ( 3 min )
    ACE : Off-Policy Actor-Critic with Causality-Aware Entropy Regularization
    arXiv:2402.14528v1 Announce Type: new Abstract: The varying significance of distinct primitive behaviors during the policy learning process has been overlooked by prior model-free RL algorithms. Leveraging this insight, we explore the causal relationship between different action dimensions and rewards to evaluate the significance of various primitive behaviors during training. We introduce a causality-aware entropy term that effectively identifies and prioritizes actions with high potential impacts for efficient exploration. Furthermore, to prevent excessive focus on specific primitive behaviors, we analyze the gradient dormancy phenomenon and introduce a dormancy-guided reset mechanism to further enhance the efficacy of our method. Our proposed algorithm, ACE: Off-policy Actor-critic with Causality-aware Entropy regularization, demonstrates a substantial performance advantage across 29 diverse continuous control tasks spanning 7 domains compared to model-free RL baselines, which underscores the effectiveness, versatility, and efficient sample efficiency of our approach. Benchmark results and videos are available at https://ace-rl.github.io/.  ( 2 min )
    E2USD: Efficient-yet-effective Unsupervised State Detection for Multivariate Time Series
    arXiv:2402.14041v1 Announce Type: new Abstract: We propose E2USD that enables efficient-yet-accurate unsupervised MTS state detection. E2USD exploits a Fast Fourier Transform-based Time Series Compressor (FFTCompress) and a Decomposed Dual-view Embedding Module (DDEM) that together encode input MTSs at low computational overhead. Additionally, we propose a False Negative Cancellation Contrastive Learning method (FNCCLearning) to counteract the effects of false negatives and to achieve more cluster-friendly embedding spaces. To reduce computational overhead further in streaming settings, we introduce Adaptive Threshold Detection (ADATD). Comprehensive experiments with six baselines and six datasets offer evidence that E2USD is capable of SOTA accuracy at significantly reduced computational overhead. Our code is available at https://github.com/AI4CTS/E2Usd.  ( 2 min )
    Breaking the Trilemma of Privacy, Utility, Efficiency via Controllable Machine Unlearning
    arXiv:2310.18574v2 Announce Type: replace-cross Abstract: Machine Unlearning (MU) algorithms have become increasingly critical due to the imperative adherence to data privacy regulations. The primary objective of MU is to erase the influence of specific data samples on a given model without the need to retrain it from scratch. Accordingly, existing methods focus on maximizing user privacy protection. However, there are different degrees of privacy regulations for each real-world web-based application. Exploring the full spectrum of trade-offs between privacy, model utility, and runtime efficiency is critical for practical unlearning scenarios. Furthermore, designing the MU algorithm with simple control of the aforementioned trade-off is desirable but challenging due to the inherent complex interaction. To address the challenges, we present Controllable Machine Unlearning (ConMU), a novel framework designed to facilitate the calibration of MU. The ConMU framework contains three integral modules: an important data selection module that reconciles the runtime efficiency and model generalization, a progressive Gaussian mechanism module that balances privacy and model generalization, and an unlearning proxy that controls the trade-offs between privacy and runtime efficiency. Comprehensive experiments on various benchmark datasets have demonstrated the robust adaptability of our control mechanism and its superiority over established unlearning methods. ConMU explores the full spectrum of the Privacy-Utility-Efficiency trade-off and allows practitioners to account for different real-world regulations. Source code available at: https://github.com/guangyaodou/ConMU.  ( 3 min )
    Deep Learning for Survival Analysis: A Review
    arXiv:2305.14961v4 Announce Type: replace-cross Abstract: The influx of deep learning (DL) techniques into the field of survival analysis in recent years has led to substantial methodological progress; for instance, learning from unstructured or high-dimensional data such as images, text or omics data. In this work, we conduct a comprehensive systematic review of DL-based methods for time-to-event analysis, characterizing them according to both survival- and DL-related attributes. In summary, the reviewed methods often address only a small subset of tasks relevant to time-to-event data - e.g., single-risk right-censored data - and neglect to incorporate more complex settings. Our findings are summarized in an editable, open-source, interactive table: https://survival-org.github.io/DL4Survival. As this research area is advancing rapidly, we encourage community contribution in order to keep this database up to date.  ( 2 min )
    ReConcile: Round-Table Conference Improves Reasoning via Consensus among Diverse LLMs
    arXiv:2309.13007v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) still struggle with natural language reasoning tasks. Motivated by the society of minds (Minsky, 1988), we propose ReConcile, a multi-model multiagent framework designed as a round table conference among diverse LLM agents. ReConcile enhances collaborative reasoning between LLM agents via multiple rounds of discussion, learning to convince other agents to improve their answers, and employing a confidence-weighted voting mechanism that leads to a better consensus. In each round, ReConcile initiates discussion between agents via a 'discussion prompt' that consists of (a) grouped answers and explanations generated by each agent in the previous round, (b) their confidence scores, and (c) demonstrations of answer-rectifying human explanations, used for convincing other agents. Experiments on seven benchmarks demonstrate that ReConcile significantly improves LLMs' reasoning -- both individually and as a team -- surpassing prior single-agent and multi-agent baselines by up to 11.4% and even outperforming GPT-4 on three datasets. ReConcile also flexibly incorporates different combinations of agents, including API-based, open-source, and domain-specific models, leading to an 8% improvement on MATH. Finally, we analyze the individual components of ReConcile, demonstrating that the diversity originating from different models is critical to its superior performance. Code: https://github.com/dinobby/ReConcile  ( 2 min )
    Interpreting Shared Circuits for Ordered Sequence Prediction in a Large Language Model
    arXiv:2311.04131v3 Announce Type: replace-cross Abstract: While transformer models exhibit strong capabilities on linguistic tasks, their complex architectures make them difficult to interpret. Recent work has aimed to reverse engineer transformer models into human-readable representations called circuits that implement algorithmic functions. We extend this research by analyzing and comparing circuits for similar sequence continuation tasks, which include increasing sequences of digits, number words, and months. Through the application of circuit analysis techniques, we identify key sub-circuits responsible for detecting sequence members and for predicting the next member in a sequence. Our analysis reveals that semantically related sequences rely on shared circuit subgraphs with analogous roles. Overall, documenting shared computational structures enables better prediction of model behaviors, identification of errors, and safer editing procedures. This mechanistic understanding of transformers is a critical step towards building more robust, aligned, and interpretable language models.  ( 2 min )
    Attention, Distillation, and Tabularization: Towards Practical Neural Network-Based Prefetching
    arXiv:2401.06362v3 Announce Type: replace-cross Abstract: Attention-based Neural Networks (NN) have demonstrated their effectiveness in accurate memory access prediction, an essential step in data prefetching. However, the substantial computational overheads associated with these models result in high inference latency, limiting their feasibility as practical prefetchers. To close the gap, we propose a new approach based on tabularization that significantly reduces model complexity and inference latency without sacrificing prediction accuracy. Our novel tabularization methodology takes as input a distilled, yet highly accurate attention-based model for memory access prediction and efficiently converts its expensive matrix multiplications into a hierarchy of fast table lookups. As an exemplar of the above approach, we develop DART, a prefetcher comprised of a simple hierarchy of tables. With a modest 0.09 drop in F1-score, DART reduces 99.99% of arithmetic operations from the large attention-based model and 91.83% from the distilled model. DART accelerates the large model inference by 170x and the distilled model by 9.4x. DART has comparable latency and storage costs as state-of-the-art rule-based prefetcher BO but surpasses it by 6.1% in IPC improvement. DART outperforms state-of-the-art NN-based prefetchers TransFetch by 33.1% and Voyager by 37.2% in terms of IPC improvement, primarily due to its low prefetching latency.  ( 2 min )
    Adaptive conformal classification with noisy labels
    arXiv:2309.05092v2 Announce Type: replace-cross Abstract: This paper develops novel conformal prediction methods for classification tasks that can automatically adapt to random label contamination in the calibration sample, leading to more informative prediction sets with stronger coverage guarantees compared to state-of-the-art approaches. This is made possible by a precise characterization of the effective coverage inflation (or deflation) suffered by standard conformal inferences in the presence of label contamination, which is then made actionable through new calibration algorithms. Our solution is flexible and can leverage different modeling assumptions about the label contamination process, while requiring no knowledge of the underlying data distribution or of the inner workings of the machine-learning classifier. The advantages of the proposed methods are demonstrated through extensive simulations and an application to object classification with the CIFAR-10H image data set.  ( 2 min )
    Generative Invertible Quantum Neural Networks
    arXiv:2302.12906v3 Announce Type: replace-cross Abstract: Invertible Neural Networks (INN) have become established tools for the simulation and generation of highly complex data. We propose a quantum-gate algorithm for a Quantum Invertible Neural Network (QINN) and apply it to the LHC data of jet-associated production of a Z-boson that decays into leptons, a standard candle process for particle collider precision measurements. We compare the QINN's performance for different loss functions and training scenarios. For this task, we find that a hybrid QINN matches the performance of a significantly larger purely classical INN in learning and generating complex data.  ( 2 min )
    Time Travel in LLMs: Tracing Data Contamination in Large Language Models
    arXiv:2308.08493v3 Announce Type: replace-cross Abstract: Data contamination, i.e., the presence of test data from downstream tasks in the training data of large language models (LLMs), is a potential major issue in measuring LLMs' real effectiveness on other tasks. We propose a straightforward yet effective method for identifying data contamination within LLMs. At its core, our approach starts by identifying potential contamination at the instance level; using this information, our approach then assesses wider contamination at the partition level. To estimate contamination of individual instances, we employ "guided instruction:" a prompt consisting of the dataset name, partition type, and the random-length initial segment of a reference instance, asking the LLM to complete it. An instance is flagged as contaminated if the LLM's output either exactly or nearly matches the latter segment of the reference. To understand if an entire partition is contaminated, we propose two ideas. The first idea marks a dataset partition as contaminated if the average overlap score with the reference instances (as measured by ROUGE-L or BLEURT) is statistically significantly better with the completions from guided instruction compared to a "general instruction" that does not include the dataset and partition name. The second idea marks a dataset partition as contaminated if a classifier based on GPT-4 with few-shot in-context learning prompt marks multiple generated completions as exact/near-exact matches of the corresponding reference instances. Our best method achieves an accuracy between 92% and 100% in detecting if an LLM is contaminated with seven datasets, containing train and test/validation partitions, when contrasted with manual evaluation by human experts. Further, our findings indicate that GPT-4 is contaminated with AG News, WNLI, and XSum datasets.  ( 3 min )
    Quantum-Inspired Machine Learning for Molecular Docking
    arXiv:2401.12999v2 Announce Type: replace-cross Abstract: Molecular docking is an important tool for structure-based drug design, accelerating the efficiency of drug development. Complex and dynamic binding processes between proteins and small molecules require searching and sampling over a wide spatial range. Traditional docking by searching for possible binding sites and conformations is computationally complex and results poorly under blind docking. Quantum-inspired algorithms combining quantum properties and annealing show great advantages in solving combinatorial optimization problems. Inspired by this, we achieve an improved in blind docking by using quantum-inspired combined with gradients learned by deep learning in the encoded molecular space. Numerical simulation shows that our method outperforms traditional docking algorithms and deep learning-based algorithms over 10\%. Compared to the current state-of-the-art deep learning-based docking algorithm DiffDock, the success rate of Top-1 (RMSD<2) achieves an improvement from 33\% to 35\% in our same setup. In particular, a 6\% improvement is realized in the high-precision region(RMSD<1) on molecules data unseen in DiffDock, which demonstrates the well-generalized of our method.  ( 2 min )
    Self-Imagine: Effective Unimodal Reasoning with Multimodal Models using Self-Imagination
    arXiv:2401.08025v2 Announce Type: replace-cross Abstract: The potential of Vision-Language Models (VLMs) often remains underutilized in handling complex text-based problems, particularly when these problems could benefit from visual representation. Resonating with humans' ability to solve complex text-based problems by (1) creating a visual diagram from the problem and (2) deducing what steps they need to take to solve it, we propose Self-Imagine. We leverage a single Vision-Language Model (VLM) to generate a structured representation of the question using HTML, then render the HTML as an image, and finally use the same VLM to answer the question using both the question and the image. Our approach does not require any additional training data or training. We evaluate our approach on three mathematics tasks and nine general-purpose reasoning tasks using state-of-the-art (LLAVA-1.5 and GEMINI PRO) VLMs. Our approach boosts the performance of LLAVA-1.5 and GEMINI PRO on all math tasks (on average GSM8K: +3.1%; ASDIV: +3.2%; SVAMP: +6.9%) and the majority of the general-purpose reasoning tasks by 3.2% to 6.0% on average.  ( 2 min )
    V2Meow: Meowing to the Visual Beat via Video-to-Music Generation
    arXiv:2305.06594v2 Announce Type: replace-cross Abstract: Video-to-music generation demands both a temporally localized high-quality listening experience and globally aligned video-acoustic signatures. While recent music generation models excel at the former through advanced audio codecs, the exploration of video-acoustic signatures has been confined to specific visual scenarios. In contrast, our research confronts the challenge of learning globally aligned signatures between video and music directly from paired music and videos, without explicitly modeling domain-specific rhythmic or semantic relationships. We propose V2Meow, a video-to-music generation system capable of producing high-quality music audio for a diverse range of video input types using a multi-stage autoregressive model. Trained on 5k hours of music audio clips paired with video frames mined from in-the-wild music videos, V2Meow is competitive with previous domain-specific models when evaluated in a zero-shot manner. It synthesizes high-fidelity music audio waveforms solely by conditioning on pre-trained general-purpose visual features extracted from video frames, with optional style control via text prompts. Through both qualitative and quantitative evaluations, we demonstrate that our model outperforms various existing music generation systems in terms of visual-audio correspondence and audio quality. Music samples are available at tinyurl.com/v2meow.  ( 3 min )
    Physics-informed deep-learning applications to experimental fluid mechanics
    arXiv:2203.15402v2 Announce Type: replace-cross Abstract: High-resolution reconstruction of flow-field data from low-resolution and noisy measurements is of interest due to the prevalence of such problems in experimental fluid mechanics, where the measurement data are in general sparse, incomplete and noisy. Deep-learning approaches have been shown suitable for such super-resolution tasks. However, a high number of high-resolution examples is needed, which may not be available for many cases. Moreover, the obtained predictions may lack in complying with the physical principles, e.g. mass and momentum conservation. Physics-informed deep learning provides frameworks for integrating data and physical laws for learning. In this study, we apply physics-informed neural networks (PINNs) for super-resolution of flow-field data both in time and space from a limited set of noisy measurements without having any high-resolution reference data. Our objective is to obtain a continuous solution of the problem, providing a physically-consistent prediction at any point in the solution domain. We demonstrate the applicability of PINNs for the super-resolution of flow-field data in time and space through three canonical cases: Burgers' equation, two-dimensional vortex shedding behind a circular cylinder and the minimal turbulent channel flow. The robustness of the models is also investigated by adding synthetic Gaussian noise. Furthermore, we show the capabilities of PINNs to improve the resolution and reduce the noise in a real experimental dataset consisting of hot-wire-anemometry measurements. Our results show the adequate capabilities of PINNs in the context of data augmentation for experiments in fluid mechanics.  ( 3 min )
    Decoupling Decision-Making in Fraud Prevention through Classifier Calibration for Business Logic Action
    arXiv:2401.05240v2 Announce Type: replace Abstract: Machine learning models typically focus on specific targets like creating classifiers, often based on known population feature distributions in a business context. However, models calculating individual features adapt over time to improve precision, introducing the concept of decoupling: shifting from point evaluation to data distribution. We use calibration strategies as strategy for decoupling machine learning (ML) classifiers from score-based actions within business logic frameworks. To evaluate these strategies, we perform a comparative analysis using a real-world business scenario and multiple ML models. Our findings highlight the trade-offs and performance implications of the approach, offering valuable insights for practitioners seeking to optimize their decoupling efforts. In particular, the Isotonic and Beta calibration methods stand out for scenarios in which there is shift between training and testing data.  ( 2 min )
    Decentralized Riemannian Conjugate Gradient Method on the Stiefel Manifold
    arXiv:2308.10547v2 Announce Type: replace-cross Abstract: The conjugate gradient method is a crucial first-order optimization method that generally converges faster than the steepest descent method, and its computational cost is much lower than the second-order methods. However, while various types of conjugate gradient methods have been studied in Euclidean spaces and on Riemannian manifolds, there is little study for those in distributed scenarios. This paper proposes a decentralized Riemannian conjugate gradient descent (DRCGD) method that aims at minimizing a global function over the Stiefel manifold. The optimization problem is distributed among a network of agents, where each agent is associated with a local function, and the communication between agents occurs over an undirected connected graph. Since the Stiefel manifold is a non-convex set, a global function is represented as a finite sum of possibly non-convex (but smooth) local functions. The proposed method is free from expensive Riemannian geometric operations such as retractions, exponential maps, and vector transports, thereby reducing the computational complexity required by each agent. To the best of our knowledge, DRCGD is the first decentralized Riemannian conjugate gradient algorithm to achieve global convergence over the Stiefel manifold.  ( 2 min )
    Variational Self-Supervised Contrastive Learning Using Beta Divergence For Face Understanding
    arXiv:2312.00824v2 Announce Type: replace-cross Abstract: Learning a discriminative semantic space using unlabelled and noisy data remains unaddressed in a multi-label setting. We present a contrastive self-supervised learning method which is robust to data noise, grounded in the domain of variational methods. The method (VCL) utilizes variational contrastive learning with beta-divergence to learn robustly from unlabelled datasets, including uncurated and noisy datasets. We demonstrate the effectiveness of the proposed method through rigorous experiments including linear evaluation and fine-tuning scenarios with multi-label datasets in the face understanding domain. In almost all tested scenarios, VCL surpasses the performance of state-of-the-art self-supervised methods, achieving a noteworthy increase in accuracy.  ( 2 min )
    Graph-enhanced Optimizers for Structure-aware Recommendation Embedding Evolution
    arXiv:2310.03032v2 Announce Type: replace-cross Abstract: Embedding plays a critical role in modern recommender systems because they are virtual representations of real-world entities and the foundation for subsequent decision models. In this paper, we propose a novel embedding update mechanism, Structure-aware Embedding Evolution (SEvo for short), to encourage related nodes to evolve similarly at each step. Unlike GNN (Graph Neural Network) that typically serves as an intermediate part, SEvo is able to directly inject the graph structure information into embedding with negligible computational overhead in training. The convergence properties of SEvo as well as its possible variants are theoretically analyzed to justify the validity of the designs. Moreover, SEvo can be seamlessly integrated into existing optimizers for state-of-the-art performance. In particular, SEvo-enhanced AdamW with moment estimate correction demonstrates consistent improvements across a spectrum of models and datasets, suggesting a novel technical route to effectively utilize graph structure information beyond explicit GNN modules.  ( 2 min )
    Multi-Modal Discussion Transformer: Integrating Text, Images and Graph Transformers to Detect Hate Speech on Social Media
    arXiv:2307.09312v4 Announce Type: replace-cross Abstract: We present the Multi-Modal Discussion Transformer (mDT), a novel methodfor detecting hate speech in online social networks such as Reddit discussions. In contrast to traditional comment-only methods, our approach to labelling a comment as hate speech involves a holistic analysis of text and images grounded in the discussion context. This is done by leveraging graph transformers to capture the contextual relationships in the discussion surrounding a comment and grounding the interwoven fusion layers that combine text and image embeddings instead of processing modalities separately. To evaluate our work, we present a new dataset, HatefulDiscussions, comprising complete multi-modal discussions from multiple online communities on Reddit. We compare the performance of our model to baselines that only process individual comments and conduct extensive ablation studies.  ( 2 min )
    Improving Adaptive Online Learning Using Refined Discretization
    arXiv:2309.16044v2 Announce Type: replace Abstract: We study unconstrained Online Linear Optimization with Lipschitz losses. Motivated by the pursuit of instance optimality, we propose a new algorithm that simultaneously achieves ($i$) the AdaGrad-style second order gradient adaptivity; and ($ii$) the comparator norm adaptivity also known as "parameter freeness" in the literature. In particular, - our algorithm does not employ the impractical doubling trick, and does not require an a priori estimate of the time-uniform Lipschitz constant; - the associated regret bound has the optimal $O(\sqrt{V_T})$ dependence on the gradient variance $V_T$, without the typical logarithmic multiplicative factor; - the leading constant in the regret bound is "almost" optimal. Central to these results is a continuous time approach to online learning. We first show that the aimed simultaneous adaptivity can be achieved fairly easily in a continuous time analogue of the problem, where the environment is modeled by an arbitrary continuous semimartingale. Then, our key innovation is a new discretization argument that preserves such adaptivity in the discrete time adversarial setting. This refines a non-gradient-adaptive discretization argument from (Harvey et al., 2023), both algorithmically and analytically, which could be of independent interest.  ( 2 min )
    Pre- to Post-Contrast Breast MRI Synthesis for Enhanced Tumour Segmentation
    arXiv:2311.10879v2 Announce Type: replace-cross Abstract: Despite its benefits for tumour detection and treatment, the administration of contrast agents in dynamic contrast-enhanced MRI (DCE-MRI) is associated with a range of issues, including their invasiveness, bioaccumulation, and a risk of nephrogenic systemic fibrosis. This study explores the feasibility of producing synthetic contrast enhancements by translating pre-contrast T1-weighted fat-saturated breast MRI to their corresponding first DCE-MRI sequence leveraging the capabilities of a generative adversarial network (GAN). Additionally, we introduce a Scaled Aggregate Measure (SAMe) designed for quantitatively evaluating the quality of synthetic data in a principled manner and serving as a basis for selecting the optimal generative model. We assess the generated DCE-MRI data using quantitative image quality metrics and apply them to the downstream task of 3D breast tumour segmentation. Our results highlight the potential of post-contrast DCE-MRI synthesis in enhancing the robustness of breast tumour segmentation models via data augmentation. Our code is available at https://github.com/RichardObi/pre_post_synthesis.  ( 2 min )
    Privacy-Preserving Neural Graph Databases
    arXiv:2312.15591v3 Announce Type: replace-cross Abstract: In the era of large language models (LLMs), efficient and accurate data retrieval has become increasingly crucial for the use of domain-specific or private data in the retrieval augmented generation (RAG). Neural graph databases (NGDBs) have emerged as a powerful paradigm that combines the strengths of graph databases (GDBs) and neural networks to enable efficient storage, retrieval, and analysis of graph-structured data which can be adaptively trained with LLMs. The usage of neural embedding storage and Complex neural logical Query Answering (CQA) provides NGDBs with generalization ability. When the graph is incomplete, by extracting latent patterns and representations, neural graph databases can fill gaps in the graph structure, revealing hidden relationships and enabling accurate query answering. Nevertheless, this capability comes with inherent trade-offs, as it introduces additional privacy risks to the domain-specific or private databases. Malicious attackers can infer more sensitive information in the database using well-designed queries such as from the answer sets of where Turing Award winners born before 1950 and after 1940 lived, the living places of Turing Award winner Hinton are probably exposed, although the living places may have been deleted in the training stage due to the privacy concerns. In this work, we propose a privacy-preserved neural graph database (P-NGDB) framework to alleviate the risks of privacy leakage in NGDBs. We introduce adversarial training techniques in the training stage to enforce the NGDBs to generate indistinguishable answers when queried with private information, enhancing the difficulty of inferring sensitive information through combinations of multiple innocuous queries.  ( 3 min )
    MaxK-GNN: Towards Theoretical Speed Limits for Accelerating Graph Neural Networks Training
    arXiv:2312.08656v3 Announce Type: replace Abstract: In the acceleration of deep neural network training, the GPU has become the mainstream platform. GPUs face substantial challenges on GNNs, such as workload imbalance and memory access irregularities, leading to underutilized hardware. Existing solutions such as PyG, DGL with cuSPARSE, and GNNAdvisor frameworks partially address these challenges but memory traffic is still significant. We argue that drastic performance improvements can only be achieved by the vertical optimization of algorithm and system innovations, rather than treating the speedup optimization as an "after-thought" (i.e., (i) given a GNN algorithm, designing an accelerator, or (ii) given hardware, mainly optimizing the GNN algorithm). In this paper, we present MaxK-GNN, an advanced high-performance GPU training system integrating algorithm and system innovation. (i) We introduce the MaxK nonlinearity and provide a theoretical analysis of MaxK nonlinearity as a universal approximator, and present the Compressed Balanced Sparse Row (CBSR) format, designed to store the data and index of the feature matrix after nonlinearity; (ii) We design a coalescing enhanced forward computation with row-wise product-based SpGEMM Kernel using CBSR for input feature matrix fetching and strategic placement of a sparse output accumulation buffer in shared memory; (iii) We develop an optimized backward computation with outer product-based and SSpMM Kernel. We conduct extensive evaluations of MaxK-GNN and report the end-to-end system run-time. Experiments show that MaxK-GNN system could approach the theoretical speedup limit according to Amdahl's law. We achieve comparable accuracy to SOTA GNNs, but at a significantly increased speed: 3.22/4.24 times speedup (vs. theoretical limits, 5.52/7.27 times) on Reddit compared to DGL and GNNAdvisor implementations.  ( 3 min )
    BLP-2023 Task 2: Sentiment Analysis
    arXiv:2310.16183v2 Announce Type: replace-cross Abstract: We present an overview of the BLP Sentiment Shared Task, organized as part of the inaugural BLP 2023 workshop, co-located with EMNLP 2023. The task is defined as the detection of sentiment in a given piece of social media text. This task attracted interest from 71 participants, among whom 29 and 30 teams submitted systems during the development and evaluation phases, respectively. In total, participants submitted 597 runs. However, a total of 15 teams submitted system description papers. The range of approaches in the submitted systems spans from classical machine learning models, fine-tuning pre-trained models, to leveraging Large Language Model (LLMs) in zero- and few-shot settings. In this paper, we provide a detailed account of the task setup, including dataset development and evaluation setup. Additionally, we provide a brief overview of the systems submitted by the participants. All datasets and evaluation scripts from the shared task have been made publicly available for the research community, to foster further research in this domain.  ( 2 min )
    Bad Values but Good Behavior: Learning Highly Misspecified Bandits and MDPs
    arXiv:2310.09358v2 Announce Type: replace Abstract: Parametric, feature-based reward models are employed by a variety of algorithms in decision-making settings such as bandits and Markov decision processes (MDPs). The typical assumption under which the algorithms are analysed is realizability, i.e., that the true values of actions are perfectly explained by some parametric model in the class. We are, however, interested in the situation where the true values are (significantly) misspecified with respect to the model class. For parameterized bandits, contextual bandits and MDPs, we identify structural conditions, depending on the problem instance and model class, under which basic algorithms such as $\epsilon$-greedy, LinUCB and fitted Q-learning provably learn optimal policies under even highly misspecified models. This is in contrast to existing worst-case results for, say misspecified bandits, which show regret bounds that scale linearly with time, and shows that there can be a nontrivially large set of bandit instances that are robust to misspecification.  ( 2 min )
    Externally Valid Policy Evaluation Combining Trial and Observational Data
    arXiv:2310.14763v2 Announce Type: replace-cross Abstract: Randomized trials are widely considered as the gold standard for evaluating the effects of decision policies. Trial data is, however, drawn from a population which may differ from the intended target population and this raises a problem of external validity (aka. generalizability). In this paper we seek to use trial data to draw valid inferences about the outcome of a policy on the target population. Additional covariate data from the target population is used to model the sampling of individuals in the trial study. We develop a method that yields certifiably valid trial-based policy evaluations under any specified range of model miscalibrations. The method is nonparametric and the validity is assured even with finite samples. The certified policy evaluations are illustrated using both simulated and real data.  ( 2 min )
    Navigating Scaling Laws: Compute Optimality in Adaptive Model Training
    arXiv:2311.03233v2 Announce Type: replace Abstract: In recent years, the state-of-the-art in deep learning has been dominated by very large models that have been pre-trained on vast amounts of data. The paradigm is very simple: investing more computational resources (optimally) leads to better performance, and even predictably so; neural scaling laws have been derived that accurately forecast the performance of a network for a desired level of compute. This leads to the notion of a `compute-optimal' model, i.e. a model that allocates a given level of compute during training optimally to maximize performance. In this work, we extend the concept of optimality by allowing for an `adaptive' model, i.e. a model that can change its shape during training. By doing so, we can design adaptive models that optimally traverse between the underlying scaling laws and outpace their `static' counterparts, leading to a significant reduction in the required compute to reach a given target performance. We show that our approach generalizes across modalities and different shape parameters.  ( 2 min )
    Single-Model Attribution of Generative Models Through Final-Layer Inversion
    arXiv:2306.06210v4 Announce Type: replace-cross Abstract: Recent breakthroughs in generative modeling have sparked interest in practical single-model attribution. Such methods predict whether a sample was generated by a specific generator or not, for instance, to prove intellectual property theft. However, previous works are either limited to the closed-world setting or require undesirable changes to the generative model. We address these shortcomings by, first, viewing single-model attribution through the lens of anomaly detection. Arising from this change of perspective, we propose FLIPAD, a new approach for single-model attribution in the open-world setting based on final-layer inversion and anomaly detection. We show that the utilized final-layer inversion can be reduced to a convex lasso optimization problem, making our approach theoretically sound and computationally efficient. The theoretical findings are accompanied by an experimental study demonstrating the effectiveness of our approach and its flexibility to various domains.  ( 2 min )
    Persuading a Behavioral Agent: Approximately Best Responding and Learning
    arXiv:2302.03719v2 Announce Type: replace-cross Abstract: The classic Bayesian persuasion model assumes a Bayesian and best-responding receiver. We study a relaxation of the Bayesian persuasion model where the receiver can approximately best respond to the sender's signaling scheme. We show that, under natural assumptions, (1) the sender can find a signaling scheme that guarantees itself an expected utility almost as good as its optimal utility in the classic model, no matter what approximately best-responding strategy the receiver uses; (2) on the other hand, there is no signaling scheme that gives the sender much more utility than its optimal utility in the classic model, even if the receiver uses the approximately best-responding strategy that is best for the sender. Together, (1) and (2) imply that the approximately best-responding behavior of the receiver does not affect the sender's maximal achievable utility a lot in the Bayesian persuasion problem. The proofs of both results rely on the idea of robustification of a Bayesian persuasion scheme: given a pair of the sender's signaling scheme and the receiver's strategy, we can construct another signaling scheme such that the receiver prefers to use that strategy in the new scheme more than in the original scheme, and the two schemes give the sender similar utilities. As an application of our main result (1), we show that, in a repeated Bayesian persuasion model where the receiver learns to respond to the sender by some algorithms, the sender can do almost as well as in the classic model. Interestingly, unlike (2), with a learning receiver the sender can sometimes do much better than in the classic model.  ( 3 min )
    Stochastic Re-weighted Gradient Descent via Distributionally Robust Optimization
    arXiv:2306.09222v3 Announce Type: replace Abstract: We present Re-weighted Gradient Descent (RGD), a novel optimization technique that improves the performance of deep neural networks through dynamic sample importance weighting. Our method is grounded in the principles of distributionally robust optimization (DRO) with Kullback-Leibler divergence. RGD is simple to implement, computationally efficient, and compatible with widely used optimizers such as SGD and Adam. We demonstrate the broad applicability and impact of RGD by achieving state-of-the-art results on diverse benchmarks, including improvements of +0.7% (DomainBed), +1.44% (tabular classification), +1.94% (GLUE with BERT), and +1.01% (ImageNet-1K with ViT).  ( 2 min )
    Towards true discovery of the differential equations
    arXiv:2308.04901v2 Announce Type: replace Abstract: Differential equation discovery, a machine learning subfield, is used to develop interpretable models, particularly in nature-related applications. By expertly incorporating the general parametric form of the equation of motion and appropriate differential terms, algorithms can autonomously uncover equations from data. This paper explores the prerequisites and tools for independent equation discovery without expert input, eliminating the need for equation form assumptions. We focus on addressing the challenge of assessing the adequacy of discovered equations when the correct equation is unknown, with the aim of providing insights for reliable equation discovery without prior knowledge of the equation form.  ( 2 min )
    State Regularized Policy Optimization on Data with Dynamics Shift
    arXiv:2306.03552v4 Announce Type: replace Abstract: In many real-world scenarios, Reinforcement Learning (RL) algorithms are trained on data with dynamics shift, i.e., with different underlying environment dynamics. A majority of current methods address such issue by training context encoders to identify environment parameters. Data with dynamics shift are separated according to their environment parameters to train the corresponding policy. However, these methods can be sample inefficient as data are used \textit{ad hoc}, and policies trained for one dynamics cannot benefit from data collected in all other environments with different dynamics. In this paper, we find that in many environments with similar structures and different dynamics, optimal policies have similar stationary state distributions. We exploit such property and learn the stationary state distribution from data with dynamics shift for efficient data reuse. Such distribution is used to regularize the policy trained in a new environment, leading to the SRPO (\textbf{S}tate \textbf{R}egularized \textbf{P}olicy \textbf{O}ptimization) algorithm. To conduct theoretical analyses, the intuition of similar environment structures is characterized by the notion of homomorphous MDPs. We then demonstrate a lower-bound performance guarantee on policies regularized by the stationary state distribution. In practice, SRPO can be an add-on module to context-based algorithms in both online and offline RL settings. Experimental results show that SRPO can make several context-based algorithms far more data efficient and significantly improve their overall performance.  ( 3 min )
    Deep hybrid model with satellite imagery: how to combine demand modeling and computer vision for behavior analysis?
    arXiv:2303.04204v2 Announce Type: replace Abstract: Classical demand modeling analyzes travel behavior using only low-dimensional numeric data (i.e. sociodemographics and travel attributes) but not high-dimensional urban imagery. However, travel behavior depends on the factors represented by both numeric data and urban imagery, thus necessitating a synergetic framework to combine them. This study creates a theoretical framework of deep hybrid models with a crossing structure consisting of a mixing operator and a behavioral predictor, thus integrating the numeric and imagery data into a latent space. Empirically, this framework is applied to analyze travel mode choice using the MyDailyTravel Survey from Chicago as the numeric inputs and the satellite images as the imagery inputs. We found that deep hybrid models outperform both the traditional demand models and the recent deep learning in predicting the aggregate and disaggregate travel behavior with our supervision-as-mixing design. The latent space in deep hybrid models can be interpreted, because it reveals meaningful spatial and social patterns. The deep hybrid models can also generate new urban images that do not exist in reality and interpret them with economic theory, such as computing substitution patterns and social welfare changes. Overall, the deep hybrid models demonstrate the complementarity between the low-dimensional numeric and high-dimensional imagery data and between the traditional demand modeling and recent deep learning. It generalizes the latent classes and variables in classical hybrid demand models to a latent space, and leverages the computational power of deep learning for imagery while retaining the economic interpretability on the microeconomics foundation.  ( 3 min )
    Graph Neural Networks for Graphs with Heterophily: A Survey
    arXiv:2202.07082v2 Announce Type: replace Abstract: Recent years have witnessed fast developments of graph neural networks (GNNs) that have benefited myriads of graph analytic tasks and applications. In general, most GNNs depend on the homophily assumption that nodes belonging to the same class are more likely to be connected. However, as a ubiquitous graph property in numerous real-world scenarios, heterophily, i.e., nodes with different labels tend to be linked, significantly limits the performance of tailor-made homophilic GNNs. Hence, GNNs for heterophilic graphs are gaining increasing research attention to enhance graph learning with heterophily. In this paper, we provide a comprehensive review of GNNs for heterophilic graphs. Specifically, we propose a systematic taxonomy that essentially governs existing heterophilic GNN models, along with a general summary and detailed analysis. %Furthermore, we summarize the mainstream heterophilic graph benchmarks to facilitate robust and fair evaluations and discuss the correlation between graph heterophily and various graph research domains. Furthermore, we discuss the correlation between graph heterophily and various graph research domains, aiming to facilitate the development of more effective GNNs across a spectrum of practical applications and learning tasks in the graph research community. In the end, we point out the potential directions to advance and stimulate more future research and applications on heterophilic graph learning with GNNs.  ( 3 min )
    SparQ Attention: Bandwidth-Efficient LLM Inference
    arXiv:2312.04985v2 Announce Type: replace Abstract: Generative large language models (LLMs) have opened up numerous novel possibilities, but due to their significant computational requirements their ubiquitous use remains challenging. Some of the most useful applications require processing large numbers of samples at a time and using long contexts, both significantly increasing the memory communication load of the models. We introduce SparQ Attention, a technique for increasing the inference throughput of LLMs by reducing the memory bandwidth requirements within the attention blocks through selective fetching of the cached history. Our proposed technique can be applied directly to off-the-shelf LLMs during inference, without requiring any modification to the pre-training setup or additional fine-tuning. We show how SparQ Attention can decrease the attention memory bandwidth requirements up to eight times without any loss in accuracy by evaluating Llama 2 and Pythia models on a wide range of downstream tasks.  ( 2 min )
    Credal Bayesian Deep Learning
    arXiv:2302.09656v4 Announce Type: replace Abstract: Uncertainty quantification and robustness to distribution shifts are important goals in machine learning and artificial intelligence. Although Bayesian Neural Networks (BNNs) allow for uncertainty in the predictions to be assessed, different sources of uncertainty are indistinguishable. We present Credal Bayesian Deep Learning (CBDL). Heuristically, CBDL allows to train an (uncountably) infinite ensemble of BNNs, using only finitely many elements. This is possible thanks to prior and likelihood finitely generated credal sets (FGCSs), a concept from the imprecise probability literature. Intuitively, convex combinations of a finite collection of prior-likelihood pairs are able to represent infinitely many such pairs. After training, CBDL outputs a set of posteriors on the parameters of the neural network. At inference time, such posterior set is used to derive a set of predictive distributions that is in turn utilized to distinguish between aleatoric and epistemic uncertainties, and to quantify them. The predictive set also produces either (i) a collection of outputs enjoying desirable probabilistic guarantees, or (ii) the single output that is deemed the best, that is, the one having the highest predictive lower probability -- another imprecise-probabilistic concept. CBDL is more robust than single BNNs to prior and likelihood misspecification, and to distribution shift. We show that CBDL is better at quantifying and disentangling different types of uncertainties than single BNNs, ensemble of BNNs, and Bayesian Model Averaging. In addition, we apply CBDL to two case studies to demonstrate its downstream tasks capabilities: one, for motion prediction in autonomous driving scenarios, and two, to model blood glucose and insulin dynamics for artificial pancreas control. We show that CBDL performs better when compared to an ensemble of BNNs baseline.  ( 3 min )
    LLMs for Knowledge Graph Construction and Reasoning: Recent Capabilities and Future Opportunities
    arXiv:2305.13168v2 Announce Type: replace-cross Abstract: This paper presents an exhaustive quantitative and qualitative evaluation of Large Language Models (LLMs) for Knowledge Graph (KG) construction and reasoning. We engage in experiments across eight diverse datasets, focusing on four representative tasks encompassing entity and relation extraction, event extraction, link prediction, and question-answering, thereby thoroughly exploring LLMs' performance in the domain of construction and inference. Empirically, our findings suggest that LLMs, represented by GPT-4, are more suited as inference assistants rather than few-shot information extractors. Specifically, while GPT-4 exhibits good performance in tasks related to KG construction, it excels further in reasoning tasks, surpassing fine-tuned models in certain cases. Moreover, our investigation extends to the potential generalization ability of LLMs for information extraction, leading to the proposition of a Virtual Knowledge Extraction task and the development of the corresponding VINE dataset. Based on these empirical findings, we further propose AutoKG, a multi-agent-based approach employing LLMs and external sources for KG construction and reasoning. We anticipate that this research can provide invaluable insights for future undertakings in the field of knowledge graphs. The code and datasets are in https://github.com/zjunlp/AutoKG.  ( 3 min )
    Transformers as Support Vector Machines
    arXiv:2308.16898v3 Announce Type: replace Abstract: Since its inception in "Attention Is All You Need", transformer architecture has led to revolutionary advancements in NLP. The attention layer within the transformer admits a sequence of input tokens $X$ and makes them interact through pairwise similarities computed as softmax$(XQK^\top X^\top)$, where $(K,Q)$ are the trainable key-query parameters. In this work, we establish a formal equivalence between the optimization geometry of self-attention and a hard-margin SVM problem that separates optimal input tokens from non-optimal tokens using linear constraints on the outer-products of token pairs. This formalism allows us to characterize the implicit bias of 1-layer transformers optimized with gradient descent: (1) Optimizing the attention layer with vanishing regularization, parameterized by $(K,Q)$, converges in direction to an SVM solution minimizing the nuclear norm of the combined parameter $W=KQ^\top$. Instead, directly parameterizing by $W$ minimizes a Frobenius norm objective. We characterize this convergence, highlighting that it can occur toward locally-optimal directions rather than global ones. (2) Complementing this, we prove the local/global directional convergence of gradient descent under suitable geometric conditions. Importantly, we show that over-parameterization catalyzes global convergence by ensuring the feasibility of the SVM problem and by guaranteeing a benign optimization landscape devoid of stationary points. (3) While our theory applies primarily to linear prediction heads, we propose a more general SVM equivalence that predicts the implicit bias with nonlinear heads. Our findings are applicable to arbitrary datasets and their validity is verified via experiments. We also introduce several open problems and research directions. We believe these findings inspire the interpretation of transformers as a hierarchy of SVMs that separates and selects optimal tokens.  ( 3 min )
    Vision-Language Models as a Source of Rewards
    arXiv:2312.09187v2 Announce Type: replace Abstract: Building generalist agents that can accomplish many goals in rich open-ended environments is one of the research frontiers for reinforcement learning. A key limiting factor for building generalist agents with RL has been the need for a large number of reward functions for achieving different goals. We investigate the feasibility of using off-the-shelf vision-language models, or VLMs, as sources of rewards for reinforcement learning agents. We show how rewards for visual achievement of a variety of language goals can be derived from the CLIP family of models, and used to train RL agents that can achieve a variety of language goals. We showcase this approach in two distinct visual domains and present a scaling trend showing how larger VLMs lead to more accurate rewards for visual goal achievement, which in turn produces more capable RL agents.  ( 2 min )
    DP-SGD Without Clipping: The Lipschitz Neural Network Way
    arXiv:2305.16202v2 Announce Type: replace Abstract: State-of-the-art approaches for training Differentially Private (DP) Deep Neural Networks (DNN) face difficulties to estimate tight bounds on the sensitivity of the network's layers, and instead rely on a process of per-sample gradient clipping. This clipping process not only biases the direction of gradients but also proves costly both in memory consumption and in computation. To provide sensitivity bounds and bypass the drawbacks of the clipping process, we propose to rely on Lipschitz constrained networks. Our theoretical analysis reveals an unexplored link between the Lipschitz constant with respect to their input and the one with respect to their parameters. By bounding the Lipschitz constant of each layer with respect to its parameters, we prove that we can train these networks with privacy guarantees. Our analysis not only allows the computation of the aforementioned sensitivities at scale, but also provides guidance on how to maximize the gradient-to-noise ratio for fixed privacy guarantees. The code has been released as a Python package available at https://github.com/Algue-Rythme/lip-dp  ( 2 min )
    Mobiprox: Supporting Dynamic Approximate Computing on Mobiles
    arXiv:2303.11291v2 Announce Type: replace Abstract: Runtime-tunable context-dependent network compression would make mobile deep learning (DL) adaptable to often varying resource availability, input "difficulty", or user needs. The existing compression techniques significantly reduce the memory, processing, and energy tax of DL, yet, the resulting models tend to be permanently impaired, sacrificing the inference power for reduced resource usage. The existing tunable compression approaches, on the other hand, require expensive re-training, do not support arbitrary strategies for adapting the compression and do not provide mobile-ready implementations. In this paper we present Mobiprox, a framework enabling mobile DL with flexible precision. Mobiprox implements tunable approximations of tensor operations and enables runtime-adaptable approximation of individual network layers. A profiler and a tuner included with Mobiprox identify the most promising neural network approximation configurations leading to the desired inference quality with the minimal use of resources. Furthermore, we develop control strategies that depending on contextual factors, such as the input data difficulty, dynamically adjust the approximation levels across a mobile DL model's layers. We implement Mobiprox in Android OS and through experiments in diverse mobile domains, including human activity recognition and spoken keyword detection, demonstrate that it can save up to 15% system-wide energy with a minimal impact on the inference accuracy.  ( 3 min )
    Scaling Efficient LLMs
    arXiv:2402.14746v1 Announce Type: cross Abstract: Trained LLMs are typically sparse in that most of the parameters are zero, raising questions on efficiency. In response, we inquire into efficient LLMs, i.e. those with the fewest parameters that achieve the desired accuracy on a training corpus. Specifically, we compare theoretical and empirical estimates for training loss at current scale to obtain upper and lower bounds on the number of unique sequences in a natural training corpus as a function of its size. Our result implies (1) to double the number of skills represented in a training corpus, the corpus must scale roughly between three and five fold (2) for efficient LLMs, the number of parameters $N$ and the size $D$ of a natural training corpus scale as $N \sim D^{0.58}$ (3) if the number of parameters of an LLM is smaller than the number of unique sequences in the training corpus, scaling up can uncover emergent skills.  ( 2 min )
    Data structure > labels? Unsupervised heuristics for SVM hyperparameter estimation
    arXiv:2111.02164v2 Announce Type: replace Abstract: Classification is one of the main areas of pattern recognition research, and within it, Support Vector Machine (SVM) is one of the most popular methods outside of field of deep learning -- and a de-facto reference for many Machine Learning approaches. Its performance is determined by parameter selection, which is usually achieved by a time-consuming grid search cross-validation procedure (GSCV). That method, however relies on the availability and quality of labelled examples and thus, when those are limited can be hindered. To address that problem, there exist several unsupervised heuristics that take advantage of the characteristics of the dataset for selecting parameters instead of using class label information. While an order of magnitude faster, they are scarcely used under the assumption that their results are significantly worse than those of grid search. To challenge that assumption, we have proposed improved heuristics for SVM parameter selection and tested it against GSCV and state of the art heuristics on over 30 standard classification datasets. The results show not only its advantage over state-of-art heuristics but also that it is statistically no worse than GSCV.  ( 2 min )
    2D Matryoshka Sentence Embeddings
    arXiv:2402.14776v1 Announce Type: cross Abstract: Common approaches rely on fixed-length embedding vectors from language models as sentence embeddings for downstream tasks such as semantic textual similarity (STS). Such methods are limited in their flexibility due to unknown computational constraints and budgets across various applications. Matryoshka Representation Learning (MRL) (Kusupati et al., 2022) encodes information at finer granularities, i.e., with lower embedding dimensions, to adaptively accommodate ad hoc tasks. Similar accuracy can be achieved with a smaller embedding size, leading to speedups in downstream tasks. Despite its improved efficiency, MRL still requires traversing all Transformer layers before obtaining the embedding, which remains the dominant factor in time and memory consumption. This prompts consideration of whether the fixed number of Transformer layers affects representation quality and whether using intermediate layers for sentence representation is feasible. In this paper, we introduce a novel sentence embedding model called Two-dimensional Matryoshka Sentence Embedding (2DMSE). It supports elastic settings for both embedding sizes and Transformer layers, offering greater flexibility and efficiency than MRL. We conduct extensive experiments on STS tasks and downstream applications. The experimental results demonstrate the effectiveness of our proposed model in dynamically supporting different embedding sizes and Transformer layers, allowing it to be highly adaptable to various scenarios.  ( 2 min )
    Attention-stacked Generative Adversarial Network (AS-GAN)-empowered Sensor Data Augmentation for Online Monitoring of Manufacturing System
    arXiv:2306.06268v2 Announce Type: replace Abstract: Machine learning (ML) has been extensively adopted for the online sensing-based monitoring in advanced manufacturing systems. However, the sensor data collected under abnormal states are usually insufficient, leading to significant data imbalanced issue for supervised machine learning. A common solution is to incorporate data augmentation techniques, i.e., augmenting the available abnormal states data (i.e., minority samples) via synthetic generation. To generate the high-quality minority samples, it is vital to learn the underlying distribution of the abnormal states data. In recent years, the generative adversarial network (GAN)-based approaches become popular to learn data distribution as well as perform data augmentation. However, in practice, the quality of generated samples from GAN-based data augmentation may vary drastically. In addition, the sensor signals are collected sequentially by time from the manufacturing systems, which means sequential information is also very important in data augmentation. To address these limitations, inspired by the multi-head attention mechanism, this paper proposed an attention-stacked GAN (AS-GAN) architecture for sensor data augmentation of online monitoring in manufacturing system. It incorporates a new attention-stacked framework to strengthen the generator in GAN with the capability of capturing sequential information, and thereby the developed attention-stacked framework greatly helps to improve the quality of the generated sensor signals. Afterwards, the generated high-quality sensor signals for abnormal states could be applied to train classifiers more accurately, further improving the online monitoring performance of manufacturing systems. The case study conducted in additive manufacturing also successfully validated the effectiveness of the proposed AS-GAN.  ( 3 min )
    Unlocking the Potential of Prompt-Tuning in Bridging Generalized and Personalized Federated Learning
    arXiv:2310.18285v3 Announce Type: replace Abstract: Vision Transformers (ViT) and Visual Prompt Tuning (VPT) achieve state-of-the-art performance with improved efficiency in various computer vision tasks. This suggests a promising paradigm shift of adapting pre-trained ViT models to Federated Learning (FL) settings. However, the challenge of data heterogeneity among FL clients presents a significant hurdle in effectively deploying ViT models. Existing Generalized FL (GFL) and Personalized FL (PFL) methods have limitations in balancing performance across both global and local data distributions. In this paper, we present a novel algorithm, SGPT, that integrates GFL and PFL approaches by employing a unique combination of both shared and group-specific prompts. This design enables SGPT to capture both common and group-specific features. A key feature of SGPT is its prompt selection module, which facilitates the training of a single global model capable of automatically adapting to diverse local client data distributions without the need for local fine-tuning. To effectively train the prompts, we utilize block coordinate descent (BCD), learning from common feature information (shared prompts), and then more specialized knowledge (group prompts) iteratively. Theoretically, we justify that learning the proposed prompts can reduce the gap between global and local performance. Empirically, we conduct experiments on both label and feature heterogeneity settings in comparison with state-of-the-art baselines, along with extensive ablation studies, to substantiate the superior performance of SGPT.  ( 3 min )
    Avoiding an AI-imposed Taylor's Version of all music history
    arXiv:2402.14589v1 Announce Type: cross Abstract: As future musical AIs adhere closely to human music, they may form their own attachments to particular human artists in their databases, and these biases may in the worst case lead to potential existential threats to all musical history. AI super fans may act to corrupt the historical record and extant recordings in favour of their own preferences, and preservation of the diversity of world music culture may become even more of a pressing issue than the imposition of 12 tone equal temperament or other Western homogenisations. We discuss the technical capability of AI cover software and produce Taylor's Versions of famous tracks from Western pop history as provocative examples; the quality of these productions does not affect the overall argument (which might even see a future AI try to impose the sound of paperclips onto all existing audio files, let alone Taylor Swift). We discuss some potential defenses against the danger of future musical monopolies, whilst analysing the feasibility of a maximal 'Taylor Swiftication' of the complete musical record.  ( 2 min )
    Dynamic Sparse Training with Structured Sparsity
    arXiv:2305.02299v4 Announce Type: replace Abstract: Dynamic Sparse Training (DST) methods achieve state-of-the-art results in sparse neural network training, matching the generalization of dense models while enabling sparse training and inference. Although the resulting models are highly sparse and theoretically less computationally expensive, achieving speedups with unstructured sparsity on real-world hardware is challenging. In this work, we propose a sparse-to-sparse DST method, Structured RigL (SRigL), to learn a variant of fine-grained structured N:M sparsity by imposing a constant fan-in constraint. Using our empirical analysis of existing DST methods at high sparsity, we additionally employ a neuron ablation method which enables SRigL to achieve state-of-the-art sparse-to-sparse structured DST performance on a variety of Neural Network (NN) architectures. Using a 90% sparse linear layer, we demonstrate a real-world acceleration of 3.4x/2.5x on CPU for online inference and 1.7x/13.0x on GPU for inference with a batch size of 256 when compared to equivalent dense/unstructured (CSR) sparse layers, respectively.  ( 2 min )
    MC-NN: An End-to-End Multi-Channel Neural Network Approach for Predicting Influenza A Virus Hosts and Antigenic Types
    arXiv:2306.05587v4 Announce Type: replace Abstract: Influenza poses a significant threat to public health, particularly among the elderly, young children, and people with underlying dis-eases. The manifestation of severe conditions, such as pneumonia, highlights the importance of preventing the spread of influenza. An accurate and cost-effective prediction of the host and antigenic sub-types of influenza A viruses is essential to addressing this issue, particularly in resource-constrained regions. In this study, we propose a multi-channel neural network model to predict the host and antigenic subtypes of influenza A viruses from hemagglutinin and neuraminidase protein sequences. Our model was trained on a comprehensive data set of complete protein sequences and evaluated on various test data sets of complete and incomplete sequences. The results demonstrate the potential and practicality of using multi-channel neural networks in predicting the host and antigenic subtypes of influenza A viruses from both full and partial protein sequences.  ( 3 min )
    Learning Style Identification Using Semi-Supervised Self-Taught Labeling
    arXiv:2402.14597v1 Announce Type: cross Abstract: Education is a dynamic field that must be adaptable to sudden changes and disruptions caused by events like pandemics, war, and natural disasters related to climate change. When these events occur, traditional classrooms with traditional or blended delivery can shift to fully online learning, which requires an efficient learning environment that meets students' needs. While learning management systems support teachers' productivity and creativity, they typically provide the same content to all learners in a course, ignoring their unique learning styles. To address this issue, we propose a semi-supervised machine learning approach that detects students' learning styles using a data mining technique. We use the commonly used Felder Silverman learning style model and demonstrate that our semi-supervised method can produce reliable classification models with few labeled data. We evaluate our approach on two different courses and achieve an accuracy of 88.83% and 77.35%, respectively. Our work shows that educational data mining and semi-supervised machine learning techniques can identify different learning styles and create a personalized learning environment.  ( 2 min )
    Model-Based Reinforcement Learning Control of Reaction-Diffusion Problems
    arXiv:2402.14446v1 Announce Type: cross Abstract: Mathematical and computational tools have proven to be reliable in decision-making processes. In recent times, in particular, machine learning-based methods are becoming increasingly popular as advanced support tools. When dealing with control problems, reinforcement learning has been applied to decision-making in several applications, most notably in games. The success of these methods in finding solutions to complex problems motivates the exploration of new areas where they can be employed to overcome current difficulties. In this paper, we explore the use of automatic control strategies to initial boundary value problems in thermal and disease transport. Specifically, in this work, we adapt an existing reinforcement learning algorithm using a stochastic policy gradient method and we introduce two novel reward functions to drive the flow of the transported field. The new model-based framework exploits the interactions between a reaction-diffusion model and the modified agent. The results show that certain controls can be implemented successfully in these applications, although model simplifications had to be assumed.  ( 2 min )
    A Survey on Fairness for Machine Learning on Graphs
    arXiv:2205.05396v2 Announce Type: replace Abstract: Nowadays, the analysis of complex phenomena modeled by graphs plays a crucial role in many real-world application domains where decisions can have a strong societal impact. However, numerous studies and papers have recently revealed that machine learning models could lead to potential disparate treatment between individuals and unfair outcomes. In that context, algorithmic contributions for graph mining are not spared by the problem of fairness and present some specific challenges related to the intrinsic nature of graphs: (1) graph data is non-IID, and this assumption may invalidate many existing studies in fair machine learning, (2) suited metric definitions to assess the different types of fairness with relational data and (3) algorithmic challenge on the difficulty of finding a good trade-off between model accuracy and fairness. This survey is the first one dedicated to fairness for relational data. It aims to present a comprehensive review of state-of-the-art techniques in fairness on graph mining and identify the open challenges and future trends. In particular, we start by presenting several sensible application domains and the associated graph mining tasks with a focus on edge prediction and node classification in the sequel. We also recall the different metrics proposed to evaluate potential bias at different levels of the graph mining process; then we provide a comprehensive overview of recent contributions in the domain of fair machine learning for graphs, that we classify into pre-processing, in-processing and post-processing models. We also propose to describe existing graph data, synthetic and real-world benchmarks. Finally, we present in detail five potential promising directions to advance research in studying algorithmic fairness on graphs.  ( 3 min )
    A Decision-Language Model (DLM) for Dynamic Restless Multi-Armed Bandit Tasks in Public Health
    arXiv:2402.14807v1 Announce Type: cross Abstract: Efforts to reduce maternal mortality rate, a key UN Sustainable Development target (SDG Target 3.1), rely largely on preventative care programs to spread critical health information to high-risk populations. These programs face two important challenges: efficiently allocating limited health resources to large beneficiary populations, and adapting to evolving policy priorities. While prior works in restless multi-armed bandit (RMAB) demonstrated success in public health allocation tasks, they lack flexibility to adapt to evolving policy priorities. Concurrently, Large Language Models (LLMs) have emerged as adept, automated planners in various domains, including robotic control and navigation. In this paper, we propose DLM: a Decision Language Model for RMABs. To enable dynamic fine-tuning of RMAB policies for challenging public health settings using human-language commands, we propose using LLMs as automated planners to (1) interpret human policy preference prompts, (2) propose code reward functions for a multi-agent RL environment for RMABs, and (3) iterate on the generated reward using feedback from RMAB simulations to effectively adapt policy outcomes. In collaboration with ARMMAN, an India-based public health organization promoting preventative care for pregnant mothers, we conduct a simulation study, showing DLM can dynamically shape policy outcomes using only human language commands as input.  ( 2 min )
    Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset
    arXiv:2402.14804v1 Announce Type: cross Abstract: Recent advancements in Large Multimodal Models (LMMs) have shown promising results in mathematical reasoning within visual contexts, with models approaching human-level performance on existing benchmarks such as MathVista. However, we observe significant limitations in the diversity of questions and breadth of subjects covered by these benchmarks. To address this issue, we present the MATH-Vision (MATH-V) dataset, a meticulously curated collection of 3,040 high-quality mathematical problems with visual contexts sourced from real math competitions. Spanning 16 distinct mathematical disciplines and graded across 5 levels of difficulty, our dataset provides a comprehensive and diverse set of challenges for evaluating the mathematical reasoning abilities of LMMs. Through extensive experimentation, we unveil a notable performance gap between current LMMs and human performance on MATH-V, underscoring the imperative for further advancements in LMMs. Moreover, our detailed categorization allows for a thorough error analysis of LMMs, offering valuable insights to guide future research and development. The project is available at https://mathvision-cuhk.github.io  ( 2 min )
    Architectures of Topological Deep Learning: A Survey of Message-Passing Topological Neural Networks
    arXiv:2304.10031v3 Announce Type: replace Abstract: The natural world is full of complex systems characterized by intricate relations between their components: from social interactions between individuals in a social network to electrostatic interactions between atoms in a protein. Topological Deep Learning (TDL) provides a comprehensive framework to process and extract knowledge from data associated with these systems, such as predicting the social community to which an individual belongs or predicting whether a protein can be a reasonable target for drug development. TDL has demonstrated theoretical and practical advantages that hold the promise of breaking ground in the applied sciences and beyond. However, the rapid growth of the TDL literature for relational systems has also led to a lack of unification in notation and language across message-passing Topological Neural Network (TNN) architectures. This presents a real obstacle for building upon existing works and for deploying message-passing TNNs to new real-world problems. To address this issue, we provide an accessible introduction to TDL for relational systems, and compare the recently published message-passing TNNs using a unified mathematical and graphical notation. Through an intuitive and critical review of the emerging field of TDL, we extract valuable insights into current challenges and exciting opportunities for future development.  ( 3 min )
    StreaMulT: Streaming Multimodal Transformer for Heterogeneous and Arbitrary Long Sequential Data
    arXiv:2110.08021v2 Announce Type: replace Abstract: The increasing complexity of Industry 4.0 systems brings new challenges regarding predictive maintenance tasks such as fault detection and diagnosis. A corresponding and realistic setting includes multi-source data streams from different modalities, such as sensors measurements time series, machine images, textual maintenance reports, etc. These heterogeneous multimodal streams also differ in their acquisition frequency, may embed temporally unaligned information and can be arbitrarily long, depending on the considered system and task. Whereas multimodal fusion has been largely studied in a static setting, to the best of our knowledge, there exists no previous work considering arbitrarily long multimodal streams alongside with related tasks such as prediction across time. Thus, in this paper, we first formalize this paradigm of heterogeneous multimodal learning in a streaming setting as a new one. To tackle this challenge, we propose StreaMulT, a Streaming Multimodal Transformer relying on cross-modal attention and on a memory bank to process arbitrarily long input sequences at training time and run in a streaming way at inference. StreaMulT improves the state-of-the-art metrics on CMU-MOSEI dataset for Multimodal Sentiment Analysis task, while being able to deal with much longer inputs than other multimodal models. The conducted experiments eventually highlight the importance of the textual embedding layer, questioning recent improvements in Multimodal Sentiment Analysis benchmarks.  ( 3 min )
    Brain-inspired Distributed Memorization Learning for Efficient Feature-free Unsupervised Domain Adaptation
    arXiv:2402.14598v1 Announce Type: cross Abstract: Compared with gradient based artificial neural networks, biological neural networks usually show a more powerful generalization ability to quickly adapt to unknown environments without using any gradient back-propagation procedure. Inspired by the distributed memory mechanism of human brains, we propose a novel gradient-free Distributed Memorization Learning mechanism, namely DML, to support quick domain adaptation of transferred models. In particular, DML adopts randomly connected neurons to memorize the association of input signals, which are propagated as impulses, and makes the final decision by associating the distributed memories based on their confidence. More importantly, DML is able to perform reinforced memorization based on unlabeled data to quickly adapt to a new domain without heavy fine-tuning of deep features, which makes it very suitable for deploying on edge devices. Experiments based on four cross-domain real-world datasets show that DML can achieve superior performance of real-time domain adaptation compared with traditional gradient based MLP with more than 10% improvement of accuracy while reducing 87% of the timing cost of optimization.  ( 2 min )
    Diffusion Model Based Visual Compensation Guidance and Visual Difference Analysis for No-Reference Image Quality Assessment
    arXiv:2402.14401v1 Announce Type: cross Abstract: Existing free-energy guided No-Reference Image Quality Assessment (NR-IQA) methods still suffer from finding a balance between learning feature information at the pixel level of the image and capturing high-level feature information and the efficient utilization of the obtained high-level feature information remains a challenge. As a novel class of state-of-the-art (SOTA) generative model, the diffusion model exhibits the capability to model intricate relationships, enabling a comprehensive understanding of images and possessing a better learning of both high-level and low-level visual features. In view of these, we pioneer the exploration of the diffusion model into the domain of NR-IQA. Firstly, we devise a new diffusion restoration network that leverages the produced enhanced image and noise-containing images, incorporating nonlinear features obtained during the denoising process of the diffusion model, as high-level visual information. Secondly, two visual evaluation branches are designed to comprehensively analyze the obtained high-level feature information. These include the visual compensation guidance branch, grounded in the transformer architecture and noise embedding strategy, and the visual difference analysis branch, built on the ResNet architecture and the residual transposed attention block. Extensive experiments are conducted on seven public NR-IQA datasets, and the results demonstrate that the proposed model outperforms SOTA methods for NR-IQA.  ( 3 min )
    Demographic Bias of Expert-Level Vision-Language Foundation Models in Medical Imaging
    arXiv:2402.14815v1 Announce Type: cross Abstract: Advances in artificial intelligence (AI) have achieved expert-level performance in medical imaging applications. Notably, self-supervised vision-language foundation models can detect a broad spectrum of pathologies without relying on explicit training annotations. However, it is crucial to ensure that these AI models do not mirror or amplify human biases, thereby disadvantaging historically marginalized groups such as females or Black patients. The manifestation of such biases could systematically delay essential medical care for certain patient subgroups. In this study, we investigate the algorithmic fairness of state-of-the-art vision-language foundation models in chest X-ray diagnosis across five globally-sourced datasets. Our findings reveal that compared to board-certified radiologists, these foundation models consistently underdiagnose marginalized groups, with even higher rates seen in intersectional subgroups, such as Black female patients. Such demographic biases present over a wide range of pathologies and demographic attributes. Further analysis of the model embedding uncovers its significant encoding of demographic information. Deploying AI systems with these biases in medical imaging can intensify pre-existing care disparities, posing potential challenges to equitable healthcare access and raising ethical questions about their clinical application.  ( 2 min )
    Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking
    arXiv:2402.14811v1 Announce Type: cross Abstract: Fine-tuning on generalized tasks such as instruction following, code generation, and mathematics has been shown to enhance language models' performance on a range of tasks. Nevertheless, explanations of how such fine-tuning influences the internal computations in these models remain elusive. We study how fine-tuning affects the internal mechanisms implemented in language models. As a case study, we explore the property of entity tracking, a crucial facet of language comprehension, where models fine-tuned on mathematics have substantial performance gains. We identify the mechanism that enables entity tracking and show that (i) in both the original model and its fine-tuned versions primarily the same circuit implements entity tracking. In fact, the entity tracking circuit of the original model on the fine-tuned versions performs better than the full original model. (ii) The circuits of all the models implement roughly the same functionality: Entity tracking is performed by tracking the position of the correct entity in both the original model and its fine-tuned versions. (iii) Performance boost in the fine-tuned models is primarily attributed to its improved ability to handle the augmented positional information. To uncover these findings, we employ: Patch Patching, DCM, which automatically detects model components responsible for specific semantics, and CMAP, a new approach for patching activations across models to reveal improved mechanisms. Our findings suggest that fine-tuning enhances, rather than fundamentally alters, the mechanistic operation of the model.  ( 3 min )
    Cameras as Rays: Pose Estimation via Ray Diffusion
    arXiv:2402.14817v1 Announce Type: cross Abstract: Estimating camera poses is a fundamental task for 3D reconstruction and remains challenging given sparse views (<10). In contrast to existing approaches that pursue top-down prediction of global parametrizations of camera extrinsics, we propose a distributed representation of camera pose that treats a camera as a bundle of rays. This representation allows for a tight coupling with spatial image features improving pose precision. We observe that this representation is naturally suited for set-level level transformers and develop a regression-based approach that maps image patches to corresponding rays. To capture the inherent uncertainties in sparse-view pose inference, we adapt this approach to learn a denoising diffusion model which allows us to sample plausible modes while improving performance. Our proposed methods, both regression- and diffusion-based, demonstrate state-of-the-art performance on camera pose estimation on CO3D while generalizing to unseen object categories and in-the-wild captures.  ( 2 min )
    CriticBench: Benchmarking LLMs for Critique-Correct Reasoning
    arXiv:2402.14809v1 Announce Type: cross Abstract: The ability of Large Language Models (LLMs) to critique and refine their reasoning is crucial for their application in evaluation, feedback provision, and self-improvement. This paper introduces CriticBench, a comprehensive benchmark designed to assess LLMs' abilities to critique and rectify their reasoning across a variety of tasks. CriticBench encompasses five reasoning domains: mathematical, commonsense, symbolic, coding, and algorithmic. It compiles 15 datasets and incorporates responses from three LLM families. Utilizing CriticBench, we evaluate and dissect the performance of 17 LLMs in generation, critique, and correction reasoning, i.e., GQC reasoning. Our findings reveal: (1) a linear relationship in GQC capabilities, with critique-focused training markedly enhancing performance; (2) a task-dependent variation in correction effectiveness, with logic-oriented tasks being more amenable to correction; (3) GQC knowledge inconsistencies that decrease as model size increases; and (4) an intriguing inter-model critiquing dynamic, where stronger models are better at critiquing weaker ones, while weaker models can surprisingly surpass stronger ones in their self-critique. We hope these insights into the nuanced critique-correct reasoning of LLMs will foster further research in LLM critique and self-improvement.  ( 2 min )
    A Quick Introduction to Quantum Machine Learning for Non-Practitioners
    arXiv:2402.14694v1 Announce Type: cross Abstract: This paper provides an introduction to quantum machine learning, exploring the potential benefits of using quantum computing principles and algorithms that may improve upon classical machine learning approaches. Quantum computing utilizes particles governed by quantum mechanics for computational purposes, leveraging properties like superposition and entanglement for information representation and manipulation. Quantum machine learning applies these principles to enhance classical machine learning models, potentially reducing network size and training time on quantum hardware. The paper covers basic quantum mechanics principles, including superposition, phase space, and entanglement, and introduces the concept of quantum gates that exploit these properties. It also reviews classical deep learning concepts, such as artificial neural networks, gradient descent, and backpropagation, before delving into trainable quantum circuits as neural networks. An example problem demonstrates the potential advantages of quantum neural networks, and the appendices provide detailed derivations. The paper aims to help researchers new to quantum mechanics and machine learning develop their expertise more efficiently.  ( 2 min )
    Modeling 3D Infant Kinetics Using Adaptive Graph Convolutional Networks
    arXiv:2402.14400v1 Announce Type: cross Abstract: Reliable methods for the neurodevelopmental assessment of infants are essential for early detection of medical issues that may need prompt interventions. Spontaneous motor activity, or `kinetics', is shown to provide a powerful surrogate measure of upcoming neurodevelopment. However, its assessment is by and large qualitative and subjective, focusing on visually identified, age-specific gestures. Here, we follow an alternative approach, predicting infants' neurodevelopmental maturation based on data-driven evaluation of individual motor patterns. We utilize 3D video recordings of infants processed with pose-estimation to extract spatio-temporal series of anatomical landmarks, and apply adaptive graph convolutional networks to predict the actual age. We show that our data-driven approach achieves improvement over traditional machine learning baselines based on manually engineered features.  ( 2 min )
    T-Stitch: Accelerating Sampling in Pre-Trained Diffusion Models with Trajectory Stitching
    arXiv:2402.14167v1 Announce Type: cross Abstract: Sampling from diffusion probabilistic models (DPMs) is often expensive for high-quality image generation and typically requires many steps with a large model. In this paper, we introduce sampling Trajectory Stitching T-Stitch, a simple yet efficient technique to improve the sampling efficiency with little or no generation degradation. Instead of solely using a large DPM for the entire sampling trajectory, T-Stitch first leverages a smaller DPM in the initial steps as a cheap drop-in replacement of the larger DPM and switches to the larger DPM at a later stage. Our key insight is that different diffusion models learn similar encodings under the same training data distribution and smaller models are capable of generating good global structures in the early steps. Extensive experiments demonstrate that T-Stitch is training-free, generally applicable for different architectures, and complements most existing fast sampling techniques with flexible speed and quality trade-offs. On DiT-XL, for example, 40% of the early timesteps can be safely replaced with a 10x faster DiT-S without performance drop on class-conditional ImageNet generation. We further show that our method can also be used as a drop-in technique to not only accelerate the popular pretrained stable diffusion (SD) models but also improve the prompt alignment of stylized SD models from the public model zoo. Code is released at https://github.com/NVlabs/T-Stitch  ( 2 min )
    Adaptive time series forecasting with markovian variance switching
    arXiv:2402.14684v1 Announce Type: cross Abstract: Adaptive time series forecasting is essential for prediction under regime changes. Several classical methods assume linear Gaussian state space model (LGSSM) with variances constant in time. However, there are many real-world processes that cannot be captured by such models. We consider a state-space model with Markov switching variances. Such dynamical systems are usually intractable because of their computational complexity increasing exponentially with time; Variational Bayes (VB) techniques have been applied to this problem. In this paper, we propose a new way of estimating variances based on online learning theory; we adapt expert aggregation methods to learn the variances over time. We apply the proposed method to synthetic data and to the problem of electricity load forecasting. We show that this method is robust to misspecification and outperforms traditional expert aggregation.  ( 2 min )
    Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models
    arXiv:2402.14800v1 Announce Type: cross Abstract: A pivotal advancement in the progress of large language models (LLMs) is the emergence of the Mixture-of-Experts (MoE) LLMs. Compared to traditional LLMs, MoE LLMs can achieve higher performance with fewer parameters, but it is still hard to deploy them due to their immense parameter sizes. Different from previous weight pruning methods that rely on specifically designed hardware, this paper mainly aims to enhance the deployment efficiency of MoE LLMs by introducing plug-and-play expert-level sparsification techniques. Specifically, we propose, for the first time to our best knowledge, post-training approaches for task-agnostic and task-specific expert pruning and skipping of MoE LLMs, tailored to improve deployment efficiency while maintaining model performance across a wide range of tasks. Extensive experiments show that our proposed methods can simultaneously reduce model sizes and increase the inference speed, while maintaining satisfactory performance. Data and code will be available at https://github.com/Lucky-Lance/Expert_Sparsity.  ( 2 min )
    Bringing Generative AI to Adaptive Learning in Education
    arXiv:2402.14601v1 Announce Type: cross Abstract: The recent surge in generative AI technologies, such as large language models and diffusion models, have boosted the development of AI applications in various domains, including science, finance, and education. Concurrently, adaptive learning, a concept that has gained substantial interest in the educational sphere, has proven its efficacy in enhancing students' learning efficiency. In this position paper, we aim to shed light on the intersectional studies of these two methods, which combine generative AI with adaptive learning concepts. By presenting discussions about the benefits, challenges, and potentials in this field, we argue that this union will contribute significantly to the development of the next stage learning format in education.  ( 2 min )
    PeriodGrad: Towards Pitch-Controllable Neural Vocoder Based on a Diffusion Probabilistic Model
    arXiv:2402.14692v1 Announce Type: cross Abstract: This paper presents a neural vocoder based on a denoising diffusion probabilistic model (DDPM) incorporating explicit periodic signals as auxiliary conditioning signals. Recently, DDPM-based neural vocoders have gained prominence as non-autoregressive models that can generate high-quality waveforms. The neural vocoders based on DDPM have the advantage of training with a simple time-domain loss. In practical applications, such as singing voice synthesis, there is a demand for neural vocoders to generate high-fidelity speech waveforms with flexible pitch control. However, conventional DDPM-based neural vocoders struggle to generate speech waveforms under such conditions. Our proposed model aims to accurately capture the periodic structure of speech waveforms by incorporating explicit periodic signals. Experimental results show that our model improves sound quality and provides better pitch control than conventional DDPM-based neural vocoders.  ( 2 min )
    AURA: Natural Language Reasoning for Aleatoric Uncertainty in Rationales
    arXiv:2402.14337v1 Announce Type: cross Abstract: Rationales behind answers not only explain model decisions but boost language models to reason well on complex reasoning tasks. However, obtaining impeccable rationales is often impossible. Besides, it is non-trivial to estimate the degree to which the rationales are faithful enough to encourage model performance. Thus, such reasoning tasks often compel models to output correct answers under undesirable rationales and are sub-optimal compared to what the models are fully capable of. In this work, we propose how to deal with imperfect rationales causing aleatoric uncertainty. We first define the ambiguous rationales with entropy scores of given rationales, using model prior beliefs as informativeness. We then guide models to select one of two different reasoning models according to the ambiguity of rationales. We empirically argue that our proposed method produces robust performance superiority against the adversarial quality of rationales and low-resource settings.  ( 2 min )
    IEPile: Unearthing Large-Scale Schema-Based Information Extraction Corpus
    arXiv:2402.14710v1 Announce Type: cross Abstract: Large Language Models (LLMs) demonstrate remarkable potential across various domains; however, they exhibit a significant performance gap in Information Extraction (IE). Note that high-quality instruction data is the vital key for enhancing the specific capabilities of LLMs, while current IE datasets tend to be small in scale, fragmented, and lack standardized schema. To this end, we introduce IEPile, a comprehensive bilingual (English and Chinese) IE instruction corpus, which contains approximately 0.32B tokens. We construct IEPile by collecting and cleaning 33 existing IE datasets, and introduce schema-based instruction generation to unearth a large-scale corpus. Experimental results on LLaMA and Baichuan demonstrate that using IEPile can enhance the performance of LLMs for IE, especially the zero-shot generalization. We open-source the resource and pre-trained models, hoping to provide valuable support to the NLP community.  ( 2 min )
    Balanced Resonate-and-Fire Neurons
    arXiv:2402.14603v1 Announce Type: cross Abstract: The resonate-and-fire (RF) neuron, introduced over two decades ago, is a simple, efficient, yet biologically plausible spiking neuron model, which can extract frequency patterns within the time domain due to its resonating membrane dynamics. However, previous RF formulations suffer from intrinsic shortcomings that limit effective learning and prevent exploiting the principled advantage of RF neurons. Here, we introduce the balanced RF (BRF) neuron, which alleviates some of the intrinsic limitations of vanilla RF neurons and demonstrates its effectiveness within recurrent spiking neural networks (RSNNs) on various sequence learning tasks. We show that networks of BRF neurons achieve overall higher task performance, produce only a fraction of the spikes, and require significantly fewer parameters as compared to modern RSNNs. Moreover, BRF-RSNN consistently provide much faster and more stable training convergence, even when bridging many hundreds of time steps during backpropagation through time (BPTT). These results underscore that our BRF-RSNN is a strong candidate for future large-scale RSNN architectures, further lines of research in SNN methodology, and more efficient hardware implementations.  ( 2 min )
    CLCE: An Approach to Refining Cross-Entropy and Contrastive Learning for Optimized Learning Fusion
    arXiv:2402.14551v1 Announce Type: cross Abstract: State-of-the-art pre-trained image models predominantly adopt a two-stage approach: initial unsupervised pre-training on large-scale datasets followed by task-specific fine-tuning using Cross-Entropy loss~(CE). However, it has been demonstrated that CE can compromise model generalization and stability. While recent works employing contrastive learning address some of these limitations by enhancing the quality of embeddings and producing better decision boundaries, they often overlook the importance of hard negative mining and rely on resource intensive and slow training using large sample batches. To counter these issues, we introduce a novel approach named CLCE, which integrates Label-Aware Contrastive Learning with CE. Our approach not only maintains the strengths of both loss functions but also leverages hard negative mining in a synergistic way to enhance performance. Experimental results demonstrate that CLCE significantly outperforms CE in Top-1 accuracy across twelve benchmarks, achieving gains of up to 3.52% in few-shot learning scenarios and 3.41% in transfer learning settings with the BEiT-3 model. Importantly, our proposed CLCE approach effectively mitigates the dependency of contrastive learning on large batch sizes such as 4096 samples per batch, a limitation that has previously constrained the application of contrastive learning in budget-limited hardware environments.  ( 3 min )
    Symbolic Music Generation with Non-Differentiable Rule Guided Diffusion
    arXiv:2402.14285v1 Announce Type: cross Abstract: We study the problem of symbolic music generation (e.g., generating piano rolls), with a technical focus on non-differentiable rule guidance. Musical rules are often expressed in symbolic form on note characteristics, such as note density or chord progression, many of which are non-differentiable which pose a challenge when using them for guided diffusion. We propose Stochastic Control Guidance (SCG), a novel guidance method that only requires forward evaluation of rule functions that can work with pre-trained diffusion models in a plug-and-play way, thus achieving training-free guidance for non-differentiable rules for the first time. Additionally, we introduce a latent diffusion architecture for symbolic music generation with high time resolution, which can be composed with SCG in a plug-and-play fashion. Compared to standard strong baselines in symbolic music generation, this framework demonstrates marked advancements in music quality and rule-based controllability, outperforming current state-of-the-art generators in a variety of settings. For detailed demonstrations, please visit our project site: https://scg-rule-guided-music.github.io/.  ( 2 min )
    Text Role Classification in Scientific Charts Using Multimodal Transformers
    arXiv:2402.14579v1 Announce Type: cross Abstract: Text role classification involves classifying the semantic role of textual elements within scientific charts. For this task, we propose to finetune two pretrained multimodal document layout analysis models, LayoutLMv3 and UDOP, on chart datasets. The transformers utilize the three modalities of text, image, and layout as input. We further investigate whether data augmentation and balancing methods help the performance of the models. The models are evaluated on various chart datasets, and results show that LayoutLMv3 outperforms UDOP in all experiments. LayoutLMv3 achieves the highest F1-macro score of 82.87 on the ICPR22 test dataset, beating the best-performing model from the ICPR22 CHART-Infographics challenge. Moreover, the robustness of the models is tested on a synthetic noisy dataset ICPR22-N. Finally, the generalizability of the models is evaluated on three chart datasets, CHIME-R, DeGruyter, and EconBiz, for which we added labels for the text roles. Findings indicate that even in cases where there is limited training data, transformers can be used with the help of data augmentation and balancing methods. The source code and datasets are available on GitHub under https://github.com/hjkimk/text-role-classification  ( 2 min )
    Edge Caching Based on Deep Reinforcement Learning and Transfer Learning
    arXiv:2402.14576v1 Announce Type: cross Abstract: This paper addresses the escalating challenge of redundant data transmission in networks. The surge in traffic has strained backhaul links and backbone networks, prompting the exploration of caching solutions at the edge router. Existing work primarily relies on Markov Decision Processes (MDP) for caching issues, assuming fixed-time interval decisions; however, real-world scenarios involve random request arrivals, and despite the critical role of various file characteristics in determining an optimal caching policy, none of the related existing work considers all these file characteristics in forming a caching policy. In this paper, first, we formulate the caching problem using a semi-Markov Decision Process (SMDP) to accommodate the continuous-time nature of real-world scenarios allowing for caching decisions at random times upon file requests. Then, we propose a double deep Q-learning-based caching approach that comprehensively accounts for file features such as lifetime, size, and importance. Simulation results demonstrate the superior performance of our approach compared to a recent Deep Reinforcement Learning-based method. Furthermore, we extend our work to include a Transfer Learning (TL) approach to account for changes in file request rates in the SMDP framework. The proposed TL approach exhibits fast convergence, even in scenarios with increased differences in request rates between source and target domains, presenting a promising solution to the dynamic challenges of caching in real-world environments.  ( 2 min )
    Scaling Up LLM Reviews for Google Ads Content Moderation
    arXiv:2402.14590v1 Announce Type: cross Abstract: Large language models (LLMs) are powerful tools for content moderation, but their inference costs and latency make them prohibitive for casual use on large datasets, such as the Google Ads repository. This study proposes a method for scaling up LLM reviews for content moderation in Google Ads. First, we use heuristics to select candidates via filtering and duplicate removal, and create clusters of ads for which we select one representative ad per cluster. We then use LLMs to review only the representative ads. Finally, we propagate the LLM decisions for the representative ads back to their clusters. This method reduces the number of reviews by more than 3 orders of magnitude while achieving a 2x recall compared to a baseline non-LLM model. The success of this approach is a strong function of the representations used in clustering and label propagation; we found that cross-modal similarity representations yield better results than uni-modal representations.  ( 2 min )
    Sample-Efficient Linear Regression with Self-Selection Bias
    arXiv:2402.14229v1 Announce Type: cross Abstract: We consider the problem of linear regression with self-selection bias in the unknown-index setting, as introduced in recent work by Cherapanamjeri, Daskalakis, Ilyas, and Zampetakis [STOC 2023]. In this model, one observes $m$ i.i.d. samples $(\mathbf{x}_{\ell},z_{\ell})_{\ell=1}^m$ where $z_{\ell}=\max_{i\in [k]}\{\mathbf{x}_{\ell}^T\mathbf{w}_i+\eta_{i,\ell}\}$, but the maximizing index $i_{\ell}$ is unobserved. Here, the $\mathbf{x}_{\ell}$ are assumed to be $\mathcal{N}(0,I_n)$ and the noise distribution $\mathbf{\eta}_{\ell}\sim \mathcal{D}$ is centered and independent of $\mathbf{x}_{\ell}$. We provide a novel and near optimally sample-efficient (in terms of $k$) algorithm to recover $\mathbf{w}_1,\ldots,\mathbf{w}_k\in \mathbb{R}^n$ up to additive $\ell_2$-error $\varepsilon$ with polynomial sample complexity $\tilde{O}(n)\cdot \mathsf{poly}(k,1/\varepsilon)$ and significantly improved time complexity $\mathsf{poly}(n,k,1/\varepsilon)+O(\log(k)/\varepsilon)^{O(k)}$. When $k=O(1)$, our algorithm runs in $\mathsf{poly}(n,1/\varepsilon)$ time, generalizing the polynomial guarantee of an explicit moment matching algorithm of Cherapanamjeri, et al. for $k=2$ and when it is known that $\mathcal{D}=\mathcal{N}(0,I_k)$. Our algorithm succeeds under significantly relaxed noise assumptions, and therefore also succeeds in the related setting of max-linear regression where the added noise is taken outside the maximum. For this problem, our algorithm is efficient in a much larger range of $k$ than the state-of-the-art due to Ghosh, Pananjady, Guntuboyina, and Ramchandran [IEEE Trans. Inf. Theory 2022] for not too small $\varepsilon$, and leads to improved algorithms for any $\varepsilon$ by providing a warm start for existing local convergence methods.  ( 2 min )
    Enhancement of High-definition Map Update Service Through Coverage-aware and Reinforcement Learning
    arXiv:2402.14582v1 Announce Type: cross Abstract: High-definition (HD) Map systems will play a pivotal role in advancing autonomous driving to a higher level, thanks to the significant improvement over traditional two-dimensional (2D) maps. Creating an HD Map requires a huge amount of on-road and off-road data. Typically, these raw datasets are collected and uploaded to cloud-based HD map service providers through vehicular networks. Nevertheless, there are challenges in transmitting the raw data over vehicular wireless channels due to the dynamic topology. As the number of vehicles increases, there is a detrimental impact on service quality, which acts as a barrier to a real-time HD Map system for collaborative driving in Autonomous Vehicles (AV). In this paper, to overcome network congestion, a Q-learning coverage-time-awareness algorithm is presented to optimize the quality of service for vehicular networks and HD map updates. The algorithm is evaluated in an environment that imitates a dynamic scenario where vehicles enter and leave. Results showed an improvement in latency for HD map data of $75\%$, $73\%$, and $10\%$ compared with IEEE802.11p without Quality of Service (QoS), IEEE802.11 with QoS, and IEEE802.11p with new access category (AC) for HD map, respectively.  ( 2 min )
    Kinematically Constrained Human-like Bimanual Robot-to-Human Handovers
    arXiv:2402.14525v1 Announce Type: cross Abstract: Bimanual handovers are crucial for transferring large, deformable or delicate objects. This paper proposes a framework for generating kinematically constrained human-like bimanual robot motions to ensure seamless and natural robot-to-human object handovers. We use a Hidden Semi-Markov Model (HSMM) to reactively generate suitable response trajectories for a robot based on the observed human partner's motion. The trajectories are adapted with task space constraints to ensure accurate handovers. Results from a pilot study show that our approach is perceived as more human--like compared to a baseline Inverse Kinematics approach.  ( 2 min )
    Towards Unified Task Embeddings Across Multiple Models: Bridging the Gap for Prompt-Based Large Language Models and Beyond
    arXiv:2402.14522v1 Announce Type: cross Abstract: Task embedding, a meta-learning technique that captures task-specific information, has become prevalent, especially in areas such as multi-task learning, model editing, and interpretability. However, it faces challenges with the emergence of prompt-guided Large Language Models (LLMs) operating in a gradientfree manner. Existing task embedding methods rely on fine-tuned, task-specific language models, which hinders the adaptability of task embeddings across diverse models, especially prompt-based LLMs. To unleash the power of task embedding in the era of LLMs, we propose a framework for unified task embeddings (FUTE), harmonizing task embeddings from various models, including smaller language models and LLMs with varied prompts, within a single vector space. Such uniformity enables the comparison and analysis of similarities amongst different models, extending the scope and utility of existing task embedding methods in addressing multi-model scenarios, whilst maintaining their performance to be comparable to architecture-specific methods.  ( 2 min )
    Structure-Based Drug Design via 3D Molecular Generative Pre-training and Sampling
    arXiv:2402.14315v1 Announce Type: cross Abstract: Structure-based drug design aims at generating high affinity ligands with prior knowledge of 3D target structures. Existing methods either use conditional generative model to learn the distribution of 3D ligands given target binding sites, or iteratively modify molecules to optimize a structure-based activity estimator. The former is highly constrained by data quantity and quality, which leaves optimization-based approaches more promising in practical scenario. However, existing optimization-based approaches choose to edit molecules in 2D space, and use molecular docking to estimate the activity using docking predicted 3D target-ligand complexes. The misalignment between the action space and the objective hinders the performance of these models, especially for those employ deep learning for acceleration. In this work, we propose MolEdit3D to combine 3D molecular generation with optimization frameworks. We develop a novel 3D graph editing model to generate molecules using fragments, and pre-train this model on abundant 3D ligands for learning target-independent properties. Then we employ a target-guided self-learning strategy to improve target-related properties using self-sampled molecules. MolEdit3D achieves state-of-the-art performance on majority of the evaluation metrics, and demonstrate strong capability of capturing both target-dependent and -independent properties.  ( 2 min )
    Machine Learning Reveals Large-scale Impact of Posidonia Oceanica on Mediterranean Sea Water
    arXiv:2402.14459v1 Announce Type: cross Abstract: Posidonia oceanica is a protected endemic seagrass of Mediterranean sea that fosters biodiversity, stores carbon, releases oxygen, and provides habitat to numerous sea organisms. Leveraging augmented research, we collected a comprehensive dataset of 174 features compiled from diverse data sources. Through machine learning analysis, we discovered the existence of a robust correlation between the exact location of P. oceanica and water biogeochemical properties. The model's feature importance, showed that carbon-related variables as net biomass production and downward surface mass flux of carbon dioxide have their values altered in the areas with P. oceanica, which in turn can be used for indirect location of P. oceanica meadows. The study provides the evidence of the plant's ability to exert a global impact on the environment and underscores the crucial role of this plant in sea ecosystems, emphasizing the need for its conservation and management.  ( 2 min )
    BIRCO: A Benchmark of Information Retrieval Tasks with Complex Objectives
    arXiv:2402.14151v1 Announce Type: cross Abstract: We present the Benchmark of Information Retrieval (IR) tasks with Complex Objectives (BIRCO). BIRCO evaluates the ability of IR systems to retrieve documents given multi-faceted user objectives. The benchmark's complexity and compact size make it suitable for evaluating large language model (LLM)-based information retrieval systems. We present a modular framework for investigating factors that may influence LLM performance on retrieval tasks, and identify a simple baseline model which matches or outperforms existing approaches and more complex alternatives. No approach achieves satisfactory performance on all benchmark tasks, suggesting that stronger models and new retrieval protocols are necessary to address complex user needs.  ( 2 min )
    GenSERP: Large Language Models for Whole Page Presentation
    arXiv:2402.14301v1 Announce Type: cross Abstract: The advent of large language models (LLMs) brings an opportunity to minimize the effort in search engine result page (SERP) organization. In this paper, we propose GenSERP, a framework that leverages LLMs with vision in a few-shot setting to dynamically organize intermediate search results, including generated chat answers, website snippets, multimedia data, knowledge panels into a coherent SERP layout based on a user's query. Our approach has three main stages: (1) An information gathering phase where the LLM continuously orchestrates API tools to retrieve different types of items, and proposes candidate layouts based on the retrieved items, until it's confident enough to generate the final result. (2) An answer generation phase where the LLM populates the layouts with the retrieved content. In this phase, the LLM adaptively optimize the ranking of items and UX configurations of the SERP. Consequently, it assigns a location on the page to each item, along with the UX display details. (3) A scoring phase where an LLM with vision scores all the generated SERPs based on how likely it can satisfy the user. It then send the one with highest score to rendering. GenSERP features two generation paradigms. First, coarse-to-fine, which allow it to approach optimal layout in a more manageable way, (2) beam search, which give it a better chance to hit the optimal solution compared to greedy decoding. Offline experimental results on real-world data demonstrate how LLMs can contextually organize heterogeneous search results on-the-fly and provide a promising user experience.  ( 3 min )
    A Class of Topological Pseudodistances for Fast Comparison of Persistence Diagrams
    arXiv:2402.14489v1 Announce Type: cross Abstract: Persistence diagrams (PD)s play a central role in topological data analysis, and are used in an ever increasing variety of applications. The comparison of PD data requires computing comparison metrics among large sets of PDs, with metrics which are accurate, theoretically sound, and fast to compute. Especially for denser multi-dimensional PDs, such comparison metrics are lacking. While on the one hand, Wasserstein-type distances have high accuracy and theoretical guarantees, they incur high computational cost. On the other hand, distances between vectorizations such as Persistence Statistics (PS)s have lower computational cost, but lack the accuracy guarantees and in general they are not guaranteed to distinguish PDs (i.e. the two PS vectors of different PDs may be equal). In this work we introduce a class of pseudodistances called Extended Topological Pseudodistances (ETD)s, which have tunable complexity, and can approximate Sliced and classical Wasserstein distances at the high-complexity extreme, while being computationally lighter and close to Persistence Statistics at the lower complexity extreme, and thus allow users to interpolate between the two metrics. We build theoretical comparisons to show how to fit our new distances at an intermediate level between persistence vectorizations and Wasserstein distances. We also experimentally verify that ETDs outperform PSs in terms of accuracy and outperform Wasserstein and Sliced Wasserstein distances in terms of computational complexity.  ( 3 min )
    Are Bounded Contracts Learnable and Approximately Optimal?
    arXiv:2402.14486v1 Announce Type: cross Abstract: This paper considers the hidden-action model of the principal-agent problem, in which a principal incentivizes an agent to work on a project using a contract. We investigate whether contracts with bounded payments are learnable and approximately optimal. Our main results are two learning algorithms that can find a nearly optimal bounded contract using a polynomial number of queries, under two standard assumptions in the literature: a costlier action for the agent leads to a better outcome distribution for the principal, and the agent's cost/effort has diminishing returns. Our polynomial query complexity upper bound shows that standard assumptions are sufficient for achieving an exponential improvement upon the known lower bound for general instances. Unlike the existing algorithms, which relied on discretizing the contract space, our algorithms directly learn the underlying outcome distributions. As for the approximate optimality of bounded contracts, we find that they could be far from optimal in terms of multiplicative or additive approximation, but satisfy a notion of mixed approximation.  ( 2 min )
    Closed-Form Bounds for DP-SGD against Record-level Inference
    arXiv:2402.14397v1 Announce Type: cross Abstract: Machine learning models trained with differentially-private (DP) algorithms such as DP-SGD enjoy resilience against a wide range of privacy attacks. Although it is possible to derive bounds for some attacks based solely on an $(\varepsilon,\delta)$-DP guarantee, meaningful bounds require a small enough privacy budget (i.e., injecting a large amount of noise), which results in a large loss in utility. This paper presents a new approach to evaluate the privacy of machine learning models against specific record-level threats, such as membership and attribute inference, without the indirection through DP. We focus on the popular DP-SGD algorithm, and derive simple closed-form bounds. Our proofs model DP-SGD as an information theoretic channel whose inputs are the secrets that an attacker wants to infer (e.g., membership of a data record) and whose outputs are the intermediate model parameters produced by iterative optimization. We obtain bounds for membership inference that match state-of-the-art techniques, whilst being orders of magnitude faster to compute. Additionally, we present a novel data-dependent bound against attribute inference. Our results provide a direct, interpretable, and practical way to evaluate the privacy of trained models against specific inference threats without sacrificing utility.  ( 2 min )
    Parallelized Midpoint Randomization for Langevin Monte Carlo
    arXiv:2402.14434v1 Announce Type: cross Abstract: We explore the sampling problem within the framework where parallel evaluations of the gradient of the log-density are feasible. Our investigation focuses on target distributions characterized by smooth and strongly log-concave densities. We revisit the parallelized randomized midpoint method and employ proof techniques recently developed for analyzing its purely sequential version. Leveraging these techniques, we derive upper bounds on the Wasserstein distance between the sampling and target densities. These bounds quantify the runtime improvement achieved by utilizing parallel processing units, which can be considerable.  ( 2 min )
    An FPGA-Based Accelerator Enabling Efficient Support for CNNs with Arbitrary Kernel Sizes
    arXiv:2402.14307v1 Announce Type: cross Abstract: Convolutional neural networks (CNNs) with large kernels, drawing inspiration from the key operations of vision transformers (ViTs), have demonstrated impressive performance in various vision-based applications. To address the issue of computational efficiency degradation in existing designs for supporting large-kernel convolutions, an FPGA-based inference accelerator is proposed for the efficient deployment of CNNs with arbitrary kernel sizes. Firstly, a Z-flow method is presented to optimize the computing data flow by maximizing data reuse opportunity. Besides, the proposed design, incorporating the kernel-segmentation (Kseg) scheme, enables extended support for large-kernel convolutions, significantly reducing the storage requirements for overlapped data. Moreover, based on the analysis of typical block structures in emerging CNNs, vertical-fused (VF) and horizontal-fused (HF) methods are developed to optimize CNN deployments from both computation and transmission perspectives. The proposed hardware accelerator, evaluated on Intel Arria 10 FPGA, achieves up to 3.91 times better DSP efficiency than prior art on the same network. Particularly, it demonstrates efficient support for large-kernel CNNs, achieving throughputs of 169.68 GOPS and 244.55 GOPS for RepLKNet-31 and PyConvResNet-50, respectively, both of which are implemented on hardware for the first time.  ( 2 min )
    Zero-shot generalization across architectures for visual classification
    arXiv:2402.14095v1 Announce Type: cross Abstract: Generalization to unseen data is a key desideratum for deep networks, but its relation to classification accuracy is unclear. Using a minimalist vision dataset and a measure of generalizability, we show that popular networks, from deep convolutional networks (CNNs) to transformers, vary in their power to extrapolate to unseen classes both across layers and across architectures. Accuracy is not a good predictor of generalizability, and generalization varies non-monotonically with layer depth. Code is available at https://github.com/dyballa/zero-shot-generalization.  ( 2 min )
    Enhancing Robotic Manipulation with AI Feedback from Multimodal Large Language Models
    arXiv:2402.14245v1 Announce Type: cross Abstract: Recently, there has been considerable attention towards leveraging large language models (LLMs) to enhance decision-making processes. However, aligning the natural language text instructions generated by LLMs with the vectorized operations required for execution presents a significant challenge, often necessitating task-specific details. To circumvent the need for such task-specific granularity, inspired by preference-based policy learning approaches, we investigate the utilization of multimodal LLMs to provide automated preference feedback solely from image inputs to guide decision-making. In this study, we train a multimodal LLM, termed CriticGPT, capable of understanding trajectory videos in robot manipulation tasks, serving as a critic to offer analysis and preference feedback. Subsequently, we validate the effectiveness of preference labels generated by CriticGPT from a reward modeling perspective. Experimental evaluation of the algorithm's preference accuracy demonstrates its effective generalization ability to new tasks. Furthermore, performance on Meta-World tasks reveals that CriticGPT's reward model efficiently guides policy learning, surpassing rewards based on state-of-the-art pre-trained representation models.  ( 2 min )
    Protect and Extend -- Using GANs for Synthetic Data Generation of Time-Series Medical Records
    arXiv:2402.14042v1 Announce Type: new Abstract: Preservation of private user data is of paramount importance for high Quality of Experience (QoE) and acceptability, particularly with services treating sensitive data, such as IT-based health services. Whereas anonymization techniques were shown to be prone to data re-identification, synthetic data generation has gradually replaced anonymization since it is relatively less time and resource-consuming and more robust to data leakage. Generative Adversarial Networks (GANs) have been used for generating synthetic datasets, especially GAN frameworks adhering to the differential privacy phenomena. This research compares state-of-the-art GAN-based models for synthetic data generation to generate time-series synthetic medical records of dementia patients which can be distributed without privacy concerns. Predictive modeling, autocorrelation, and distribution analysis are used to assess the Quality of Generating (QoG) of the generated data. The privacy preservation of the respective models is assessed by applying membership inference attacks to determine potential data leakage risks. Our experiments indicate the superiority of the privacy-preserving GAN (PPGAN) model over other models regarding privacy preservation while maintaining an acceptable level of QoG. The presented results can support better data protection for medical use cases in the future.  ( 2 min )
    Specialty detection in the context of telemedicine in a highly imbalanced multi-class distribution
    arXiv:2402.14039v1 Announce Type: new Abstract: The Covid-19 pandemic has led to an increase in the awareness of and demand for telemedicine services, resulting in a need for automating the process and relying on machine learning (ML) to reduce the operational load. This research proposes a specialty detection classifier based on a machine learning model to automate the process of detecting the correct specialty for each question and routing it to the correct doctor. The study focuses on handling multiclass and highly imbalanced datasets for Arabic medical questions, comparing some oversampling techniques, developing a Deep Neural Network (DNN) model for specialty detection, and exploring the hidden business areas that rely on specialty detection such as customizing and personalizing the consultation flow for different specialties. The proposed module is deployed in both synchronous and asynchronous medical consultations to provide more real-time classification, minimize the doctor effort in addressing the correct specialty, and give the system more flexibility in customizing the medical consultation flow. The evaluation and assessment are based on accuracy, precision, recall, and F1-score. The experimental results suggest that combining multiple techniques, such as SMOTE and reweighing with keyword identification, is necessary to achieve improved performance in detecting rare classes in imbalanced multiclass datasets. By using these techniques, specialty detection models can more accurately detect rare classes in real-world scenarios where imbalanced data is common.  ( 3 min )
    Linear Convergence of Black-Box Variational Inference: Should We Stick the Landing?
    arXiv:2307.14642v3 Announce Type: replace-cross Abstract: We prove that black-box variational inference (BBVI) with control variates, particularly the sticking-the-landing (STL) estimator, converges at a geometric (traditionally called "linear") rate under perfect variational family specification. In particular, we prove a quadratic bound on the gradient variance of the STL estimator, one which encompasses misspecified variational families. Combined with previous works on the quadratic variance condition, this directly implies convergence of BBVI with the use of projected stochastic gradient descent. For the projection operator, we consider a domain with triangular scale matrices, which the projection onto is computable in $\Theta(d)$ time, where $d$ is the dimensionality of the target posterior. We also improve existing analysis on the regular closed-form entropy gradient estimators, which enables comparison against the STL estimator, providing explicit non-asymptotic complexity guarantees for both.  ( 2 min )
    Controlling Multiple Errors Simultaneously with a PAC-Bayes Bound
    arXiv:2202.05560v2 Announce Type: replace-cross Abstract: Current PAC-Bayes generalisation bounds are restricted to scalar metrics of performance, such as the loss or error rate. However, one ideally wants more information-rich certificates that control the entire distribution of possible outcomes, such as the distribution of the test loss in regression, or the probabilities of different mis classifications. We provide the first PAC-Bayes bound capable of providing such rich information by bounding the Kullback-Leibler divergence between the empirical and true probabilities of a set of M error types, which can either be discretized loss values for regression, or the elements of the confusion matrix (or a partition thereof) for classification. We transform our bound into a differentiable training objective. Our bound is especially useful in cases where the severity of different mis-classifications may change over time; existing PAC-Bayes bounds can only bound a particular pre-decided weighting of the error types. In contrast our bound implicitly controls all uncountably many weightings simultaneously.  ( 2 min )
    Adversarial Machine Learning: Bayesian Perspectives
    arXiv:2003.03546v2 Announce Type: replace-cross Abstract: Adversarial Machine Learning (AML) is emerging as a major field aimed at protecting machine learning (ML) systems against security threats: in certain scenarios there may be adversaries that actively manipulate input data to fool learning systems. This creates a new class of security vulnerabilities that ML systems may face, and a new desirable property called adversarial robustness essential to trust operations based on ML outputs. Most work in AML is built upon a game-theoretic modelling of the conflict between a learning system and an adversary, ready to manipulate input data. This assumes that each agent knows their opponent's interests and uncertainty judgments, facilitating inferences based on Nash equilibria. However, such common knowledge assumption is not realistic in the security scenarios typical of AML. After reviewing such game-theoretic approaches, we discuss the benefits that Bayesian perspectives provide when defending ML-based systems. We demonstrate how the Bayesian approach allows us to explicitly model our uncertainty about the opponent's beliefs and interests, relaxing unrealistic assumptions, and providing more robust inferences. We illustrate this approach in supervised learning settings, and identify relevant future research problems.  ( 2 min )
    FIGARO: Generating Symbolic Music with Fine-Grained Artistic Control
    arXiv:2201.10936v4 Announce Type: replace-cross Abstract: Generating music with deep neural networks has been an area of active research in recent years. While the quality of generated samples has been steadily increasing, most methods are only able to exert minimal control over the generated sequence, if any. We propose the self-supervised description-to-sequence task, which allows for fine-grained controllable generation on a global level. We do so by extracting high-level features about the target sequence and learning the conditional distribution of sequences given the corresponding high-level description in a sequence-to-sequence modelling setup. We train FIGARO (FIne-grained music Generation via Attention-based, RObust control) by applying description-to-sequence modelling to symbolic music. By combining learned high level features with domain knowledge, which acts as a strong inductive bias, the model achieves state-of-the-art results in controllable symbolic music generation and generalizes well beyond the training distribution.  ( 2 min )
    EduGym: An Environment and Notebook Suite for Reinforcement Learning Education
    arXiv:2311.10590v2 Announce Type: replace Abstract: Due to the empirical success of reinforcement learning, an increasing number of students study the subject. However, from our practical teaching experience, we see students entering the field (bachelor, master and early PhD) often struggle. On the one hand, textbooks and (online) lectures provide the fundamentals, but students find it hard to translate between equations and code. On the other hand, public codebases do provide practical examples, but the implemented algorithms tend to be complex, and the underlying test environments contain multiple reinforcement learning challenges at once. Although this is realistic from a research perspective, it often hinders educational conceptual understanding. To solve this issue we introduce EduGym, a set of educational reinforcement learning environments and associated interactive notebooks tailored for education. Each EduGym environment is specifically designed to illustrate a certain aspect/challenge of reinforcement learning (e.g., exploration, partial observability, stochasticity, etc.), while the associated interactive notebook explains the challenge and its possible solution approaches, connecting equations and code in a single document. An evaluation among RL students and researchers shows 86% of them think EduGym is a useful tool for reinforcement learning education. All notebooks are available from https://www.edugym.org/, while the full software package can be installed from https://github.com/RLG-Leiden/edugym.  ( 3 min )
    On Feynman--Kac training of partial Bayesian neural networks
    arXiv:2310.19608v2 Announce Type: replace Abstract: Recently, partial Bayesian neural networks (pBNNs), which only consider a subset of the parameters to be stochastic, were shown to perform competitively with full Bayesian neural networks. However, pBNNs are often multi-modal in the latent variable space and thus challenging to approximate with parametric models. To address this problem, we propose an efficient sampling-based training strategy, wherein the training of a pBNN is formulated as simulating a Feynman--Kac model. We then describe variations of sequential Monte Carlo samplers that allow us to simultaneously estimate the parameters and the latent posterior distribution of this model at a tractable computational cost. Using various synthetic and real-world datasets we show that our proposed training scheme outperforms the state of the art in terms of predictive performance.  ( 2 min )
    Flow-based Distributionally Robust Optimization
    arXiv:2310.19253v3 Announce Type: replace Abstract: We present a computationally efficient framework, called $\texttt{FlowDRO}$, for solving flow-based distributionally robust optimization (DRO) problems with Wasserstein uncertainty sets while aiming to find continuous worst-case distribution (also called the Least Favorable Distribution, LFD) and sample from it. The requirement for LFD to be continuous is so that the algorithm can be scalable to problems with larger sample sizes and achieve better generalization capability for the induced robust algorithms. To tackle the computationally challenging infinitely dimensional optimization problem, we leverage flow-based models and continuous-time invertible transport maps between the data distribution and the target distribution and develop a Wasserstein proximal gradient flow type algorithm. In theory, we establish the equivalence of the solution by optimal transport map to the original formulation, as well as the dual form of the problem through Wasserstein calculus and Brenier theorem. In practice, we parameterize the transport maps by a sequence of neural networks progressively trained in blocks by gradient descent. We demonstrate its usage in adversarial learning, distributionally robust hypothesis testing, and a new mechanism for data-driven distribution perturbation differential privacy, where the proposed method gives strong empirical performance on high-dimensional real data.  ( 2 min )
    Adaptive Batch Sizes for Active Learning A Probabilistic Numerics Approach
    arXiv:2306.05843v2 Announce Type: replace Abstract: Active learning parallelization is widely used, but typically relies on fixing the batch size throughout experimentation. This fixed approach is inefficient because of a dynamic trade-off between cost and speed -- larger batches are more costly, smaller batches lead to slower wall-clock run-times -- and the trade-off may change over the run (larger batches are often preferable earlier). To address this trade-off, we propose a novel Probabilistic Numerics framework that adaptively changes batch sizes. By framing batch selection as a quadrature task, our integration-error-aware algorithm facilitates the automatic tuning of batch sizes to meet predefined quadrature precision objectives, akin to how typical optimizers terminate based on convergence thresholds. This approach obviates the necessity for exhaustive searches across all potential batch sizes. We also extend this to scenarios with constrained active learning and constrained optimization, interpreting constraint violations as reductions in the precision requirement, to subsequently adapt batch construction. Through extensive experiments, we demonstrate that our approach significantly enhances learning efficiency and flexibility in diverse Bayesian batch active learning and Bayesian optimization applications.  ( 2 min )
    Uncertainty Quantification of Spatiotemporal Travel Demand with Probabilistic Graph Neural Networks
    arXiv:2303.04040v2 Announce Type: replace Abstract: Recent studies have significantly improved the prediction accuracy of travel demand using graph neural networks. However, these studies largely ignored uncertainty that inevitably exists in travel demand prediction. To fill this gap, this study proposes a framework of probabilistic graph neural networks (Prob-GNN) to quantify the spatiotemporal uncertainty of travel demand. This Prob-GNN framework is substantiated by deterministic and probabilistic assumptions, and empirically applied to the task of predicting the transit and ridesharing demand in Chicago. We found that the probabilistic assumptions (e.g. distribution tail, support) have a greater impact on uncertainty prediction than the deterministic ones (e.g. deep modules, depth). Among the family of Prob-GNNs, the GNNs with truncated Gaussian and Laplace distributions achieve the highest performance in transit and ridesharing data. Even under significant domain shifts, Prob-GNNs can predict the ridership uncertainty in a stable manner, when the models are trained on pre-COVID data and tested across multiple periods during and after the COVID-19 pandemic. Prob-GNNs also reveal the spatiotemporal pattern of uncertainty, which is concentrated on the afternoon peak hours and the areas with large travel volumes. Overall, our findings highlight the importance of incorporating randomness into deep learning for spatiotemporal ridership prediction. Future research should continue to investigate versatile probabilistic assumptions to capture behavioral randomness, and further develop methods to quantify uncertainty to build resilient cities.  ( 3 min )
    Promises and Pitfalls of Threshold-based Auto-labeling
    arXiv:2211.12620v2 Announce Type: replace Abstract: Creating large-scale high-quality labeled datasets is a major bottleneck in supervised machine learning workflows. Threshold-based auto-labeling (TBAL), where validation data obtained from humans is used to find a confidence threshold above which the data is machine-labeled, reduces reliance on manual annotation. TBAL is emerging as a widely-used solution in practice. Given the long shelf-life and diverse usage of the resulting datasets, understanding when the data obtained by such auto-labeling systems can be relied on is crucial. This is the first work to analyze TBAL systems and derive sample complexity bounds on the amount of human-labeled validation data required for guaranteeing the quality of machine-labeled data. Our results provide two crucial insights. First, reasonable chunks of unlabeled data can be automatically and accurately labeled by seemingly bad models. Second, a hidden downside of TBAL systems is potentially prohibitive validation data usage. Together, these insights describe the promise and pitfalls of using such systems. We validate our theoretical guarantees with extensive experiments on synthetic and real datasets.  ( 2 min )
    Causal Imputation for Counterfactual SCMs: Bridging Graphs and Latent Factor Models
    arXiv:2402.14777v1 Announce Type: cross Abstract: We consider the task of causal imputation, where we aim to predict the outcomes of some set of actions across a wide range of possible contexts. As a running example, we consider predicting how different drugs affect cells from different cell types. We study the index-only setting, where the actions and contexts are categorical variables with a finite number of possible values. Even in this simple setting, a practical challenge arises, since often only a small subset of possible action-context pairs have been studied. Thus, models must extrapolate to novel action-context pairs, which can be framed as a form of matrix completion with rows indexed by actions, columns indexed by contexts, and matrix entries corresponding to outcomes. We introduce a novel SCM-based model class, where the outcome is expressed as a counterfactual, actions are expressed as interventions on an instrumental variable, and contexts are defined based on the initial state of the system. We show that, under a linearity assumption, this setup induces a latent factor model over the matrix of outcomes, with an additional fixed effect term. To perform causal prediction based on this model class, we introduce simple extension to the Synthetic Interventions estimator (Agarwal et al., 2020). We evaluate several matrix completion approaches on the PRISM drug repurposing dataset, showing that our method outperforms all other considered matrix completion approaches.  ( 2 min )
    Consolidating Attention Features for Multi-view Image Editing
    arXiv:2402.14792v1 Announce Type: cross Abstract: Large-scale text-to-image models enable a wide range of image editing techniques, using text prompts or even spatial controls. However, applying these editing methods to multi-view images depicting a single scene leads to 3D-inconsistent results. In this work, we focus on spatial control-based geometric manipulations and introduce a method to consolidate the editing process across various views. We build on two insights: (1) maintaining consistent features throughout the generative process helps attain consistency in multi-view editing, and (2) the queries in self-attention layers significantly influence the image structure. Hence, we propose to improve the geometric consistency of the edited images by enforcing the consistency of the queries. To do so, we introduce QNeRF, a neural radiance field trained on the internal query features of the edited images. Once trained, QNeRF can render 3D-consistent queries, which are then softly injected back into the self-attention layers during generation, greatly improving multi-view consistency. We refine the process through a progressive, iterative method that better consolidates queries across the diffusion timesteps. We compare our method to a range of existing techniques and demonstrate that it can achieve better multi-view consistency and higher fidelity to the input scene. These advantages allow us to train NeRFs with fewer visual artifacts, that are better aligned with the target geometry.  ( 2 min )
    Batch and match: black-box variational inference with a score-based divergence
    arXiv:2402.14758v1 Announce Type: cross Abstract: Most leading implementations of black-box variational inference (BBVI) are based on optimizing a stochastic evidence lower bound (ELBO). But such approaches to BBVI often converge slowly due to the high variance of their gradient estimates. In this work, we propose batch and match (BaM), an alternative approach to BBVI based on a score-based divergence. Notably, this score-based divergence can be optimized by a closed-form proximal update for Gaussian variational families with full covariance matrices. We analyze the convergence of BaM when the target distribution is Gaussian, and we prove that in the limit of infinite batch size the variational parameter updates converge exponentially quickly to the target mean and covariance. We also evaluate the performance of BaM on Gaussian and non-Gaussian target distributions that arise from posterior inference in hierarchical and deep generative models. In these experiments, we find that BaM typically converges in fewer (and sometimes significantly fewer) gradient evaluations than leading implementations of BBVI based on ELBO maximization.  ( 2 min )
    Large Language Models as Urban Residents: An LLM Agent Framework for Personal Mobility Generation
    arXiv:2402.14744v1 Announce Type: cross Abstract: This paper introduces a novel approach using Large Language Models (LLMs) integrated into an agent framework for flexible and efficient personal mobility generation. LLMs overcome the limitations of previous models by efficiently processing semantic data and offering versatility in modeling various tasks. Our approach addresses the critical need to align LLMs with real-world urban mobility data, focusing on three research questions: aligning LLMs with rich activity data, developing reliable activity generation strategies, and exploring LLM applications in urban mobility. The key technical contribution is a novel LLM agent framework that accounts for individual activity patterns and motivations, including a self-consistency approach to align LLMs with real-world activity data and a retrieval-augmented strategy for interpretable activity generation. In experimental studies, comprehensive validation is performed using real-world data. This research marks the pioneering work of designing an LLM agent framework for activity generation based on real-world human activity data, offering a promising tool for urban mobility analysis.  ( 2 min )
    COMPASS: Computational Mapping of Patient-Therapist Alliance Strategies with Language Modeling
    arXiv:2402.14701v1 Announce Type: cross Abstract: The therapeutic working alliance is a critical factor in predicting the success of psychotherapy treatment. Traditionally, working alliance assessment relies on questionnaires completed by both therapists and patients. In this paper, we present COMPASS, a novel framework to directly infer the therapeutic working alliance from the natural language used in psychotherapy sessions. Our approach utilizes advanced large language models to analyze transcripts of psychotherapy sessions and compare them with distributed representations of statements in the working alliance inventory. Analyzing a dataset of over 950 sessions covering diverse psychiatric conditions, we demonstrate the effectiveness of our method in microscopically mapping patient-therapist alignment trajectories and providing interpretability for clinical psychiatry and in identifying emerging patterns related to the condition being treated. By employing various neural topic modeling techniques in combination with generative language prompting, we analyze the topical characteristics of different psychiatric conditions and incorporate temporal modeling to capture the evolution of topics at a turn-level resolution. This combined framework enhances the understanding of therapeutic interactions, enabling timely feedback for therapists regarding conversation quality and providing interpretable insights to improve the effectiveness of psychotherapy.  ( 3 min )
    Multivariate Online Linear Regression for Hierarchical Forecasting
    arXiv:2402.14578v1 Announce Type: cross Abstract: In this paper, we consider a deterministic online linear regression model where we allow the responses to be multivariate. To address this problem, we introduce MultiVAW, a method that extends the well-known Vovk-Azoury-Warmuth algorithm to the multivariate setting, and show that it also enjoys logarithmic regret in time. We apply our results to the online hierarchical forecasting problem and recover an algorithm from this literature as a special case, allowing us to relax the hypotheses usually made for its analysis.  ( 2 min )
    Spectral invariance and maximality properties of the frequency spectrum of quantum neural networks
    arXiv:2402.14515v1 Announce Type: cross Abstract: Quantum Neural Networks (QNNs) are a popular approach in Quantum Machine Learning due to their close connection to Variational Quantum Circuits, making them a promising candidate for practical applications on Noisy Intermediate-Scale Quantum (NISQ) devices. A QNN can be expressed as a finite Fourier series, where the set of frequencies is called the frequency spectrum. We analyse this frequency spectrum and prove, for a large class of models, various maximality results. Furthermore, we prove that under some mild conditions there exists a bijection between classes of models with the same area $A = RL$ that preserves the frequency spectrum, where $R$ denotes the number of qubits and $L$ the number of layers, which we consequently call spectral invariance under area-preserving transformations. With this we explain the symmetry in $R$ and $L$ in the results often observed in the literature and show that the maximal frequency spectrum depends only on the area $A = RL$ and not on the individual values of $R$ and $L$. Moreover, we extend existing results and specify the maximum possible frequency spectrum of a QNN with arbitrarily many layers as a function of the spectrum of its generators. If the generators of the QNN can be further decomposed into 2-dimensional sub-generators, then this specification follows from elementary number-theoretical considerations. In the case of arbitrary dimensional generators, we extend existing results based on the so-called Golomb ruler and introduce a second novel approach based on a variation of the turnpike problem, which we call the relaxed turnpike problem.  ( 3 min )
    Reimagining Anomalies: What If Anomalies Were Normal?
    arXiv:2402.14469v1 Announce Type: cross Abstract: Deep learning-based methods have achieved a breakthrough in image anomaly detection, but their complexity introduces a considerable challenge to understanding why an instance is predicted to be anomalous. We introduce a novel explanation method that generates multiple counterfactual examples for each anomaly, capturing diverse concepts of anomalousness. A counterfactual example is a modification of the anomaly that is perceived as normal by the anomaly detector. The method provides a high-level semantic explanation of the mechanism that triggered the anomaly detector, allowing users to explore "what-if scenarios." Qualitative and quantitative analyses across various image datasets show that the method applied to state-of-the-art anomaly detectors can achieve high-quality semantic explanations of detectors.  ( 2 min )
    Quantum Circuit Optimization with AlphaTensor
    arXiv:2402.14396v1 Announce Type: cross Abstract: A key challenge in realizing fault-tolerant quantum computers is circuit optimization. Focusing on the most expensive gates in fault-tolerant quantum computation (namely, the T gates), we address the problem of T-count optimization, i.e., minimizing the number of T gates that are needed to implement a given circuit. To achieve this, we develop AlphaTensor-Quantum, a method based on deep reinforcement learning that exploits the relationship between optimizing T-count and tensor decomposition. Unlike existing methods for T-count optimization, AlphaTensor-Quantum can incorporate domain-specific knowledge about quantum computation and leverage gadgets, which significantly reduces the T-count of the optimized circuits. AlphaTensor-Quantum outperforms the existing methods for T-count optimization on a set of arithmetic benchmarks (even when compared without making use of gadgets). Remarkably, it discovers an efficient algorithm akin to Karatsuba's method for multiplication in finite fields. AlphaTensor-Quantum also finds the best human-designed solutions for relevant arithmetic computations used in Shor's algorithm and for quantum chemistry simulation, thus demonstrating it can save hundreds of hours of research by optimizing relevant quantum circuits in a fully automated way.  ( 2 min )
    Uncertainty-driven and Adversarial Calibration Learning for Epicardial Adipose Tissue Segmentation
    arXiv:2402.14349v1 Announce Type: cross Abstract: Epicardial adipose tissue (EAT) is a type of visceral fat that can secrete large amounts of adipokines to affect the myocardium and coronary arteries. EAT volume and density can be used as independent risk markers measurement of volume by noninvasive magnetic resonance images is the best method of assessing EAT. However, segmenting EAT is challenging due to the low contrast between EAT and pericardial effusion and the presence of motion artifacts. we propose a novel feature latent space multilevel supervision network (SPDNet) with uncertainty-driven and adversarial calibration learning to enhance segmentation for more accurate EAT volume estimation. The network first addresses the blurring of EAT edges due to the medical images in the open medical environments with low quality or out-of-distribution by modeling the uncertainty as a Gaussian distribution in the feature latent space, which using its Bayesian estimation as a regularization constraint to optimize SwinUNETR. Second, an adversarial training strategy is introduced to calibrate the segmentation feature map and consider the multi-scale feature differences between the uncertainty-guided predictive segmentation and the ground truth segmentation, synthesizing the multi-scale adversarial loss directly improves the ability to discriminate the similarity between organizations. Experiments on both the cardiac public MRI dataset (ACDC) and the real-world clinical cohort EAT dataset show that the proposed network outperforms mainstream models, validating that uncertainty-driven and adversarial calibration learning can be used to provide additional information for modeling multi-scale ambiguities.  ( 3 min )
    CEV-LM: Controlled Edit Vector Language Model for Shaping Natural Language Generations
    arXiv:2402.14290v1 Announce Type: cross Abstract: As large-scale language models become the standard for text generation, there is a greater need to tailor the generations to be more or less concise, targeted, and informative, depending on the audience/application. Existing control approaches primarily adjust the semantic (e.g., emotion, topics), structural (e.g., syntax tree, parts-of-speech), and lexical (e.g., keyword/phrase inclusion) properties of text, but are insufficient to accomplish complex objectives such as pacing which control the complexity and readability of the text. In this paper, we introduce CEV-LM - a lightweight, semi-autoregressive language model that utilizes constrained edit vectors to control three complementary metrics (speed, volume, and circuitousness) that quantify the shape of text (e.g., pacing of content). We study an extensive set of state-of-the-art CTG models and find that CEV-LM provides significantly more targeted and precise control of these three metrics while preserving semantic content, using less training data, and containing fewer parameters.  ( 2 min )
    Structure-agnostic Optimality of Doubly Robust Learning for Treatment Effect Estimation
    arXiv:2402.14264v1 Announce Type: cross Abstract: Average treatment effect estimation is the most central problem in causal inference with application to numerous disciplines. While many estimation strategies have been proposed in the literature, recently also incorporating generic machine learning estimators, the statistical optimality of these methods has still remained an open area of investigation. In this paper, we adopt the recently introduced structure-agnostic framework of statistical lower bounds, which poses no structural properties on the nuisance functions other than access to black-box estimators that attain small errors; which is particularly appealing when one is only willing to consider estimation strategies that use non-parametric regression and classification oracles as a black-box sub-process. Within this framework, we prove the statistical optimality of the celebrated and widely used doubly robust estimators for both the Average Treatment Effect (ATE) and the Average Treatment Effect on the Treated (ATTE), as well as weighted variants of the former, which arise in policy evaluation.  ( 2 min )
    Secure Navigation using Landmark-based Localization in a GPS-denied Environment
    arXiv:2402.14280v1 Announce Type: cross Abstract: In modern battlefield scenarios, the reliance on GPS for navigation can be a critical vulnerability. Adversaries often employ tactics to deny or deceive GPS signals, necessitating alternative methods for the localization and navigation of mobile troops. Range-free localization methods such as DV-HOP rely on radio-based anchors and their average hop distance which suffers from accuracy and stability in a dynamic and sparse network topology. Vision-based approaches like SLAM and Visual Odometry use sensor fusion techniques for map generation and pose estimation that are more sophisticated and computationally expensive. This paper proposes a novel framework that integrates landmark-based localization (LanBLoc) with an Extended Kalman Filter (EKF) to predict the future state of moving entities along the battlefield. Our framework utilizes safe trajectory information generated by the troop control center by considering identifiable landmarks and pre-defined hazard maps. It performs point inclusion tests on the convex hull of the trajectory segments to ensure the safety and survivability of a moving entity and determines the next point forward decisions. We present a simulated battlefield scenario for two different approaches (with EKF and without EKF) that guide a moving entity through an obstacle and hazard-free path. Using the proposed method, we observed a percent error of 6.51% lengthwise in safe trajectory estimation with an Average Displacement Error (ADE) of 2.97m and a Final Displacement Error (FDE) of 3.27m. The results demonstrate that our approach not only ensures the safety of the mobile units by keeping them within the secure trajectory but also enhances operational effectiveness by adapting to the evolving threat landscape.  ( 3 min )
    Word-Sequence Entropy: Towards Uncertainty Estimation in Free-Form Medical Question Answering Applications and Beyond
    arXiv:2402.14259v1 Announce Type: cross Abstract: Uncertainty estimation plays a pivotal role in ensuring the reliability of safety-critical human-AI interaction systems, particularly in the medical domain. However, a general method for quantifying the uncertainty of free-form answers has yet to be established in open-ended medical question-answering (QA) tasks, where irrelevant words and sequences with limited semantic information can be the primary source of uncertainty due to the presence of generative inequality. In this paper, we propose the Word-Sequence Entropy (WSE), which calibrates the uncertainty proportion at both the word and sequence levels according to the semantic relevance, with greater emphasis placed on keywords and more relevant sequences when performing uncertainty quantification. We compare WSE with 6 baseline methods on 5 free-form medical QA datasets, utilizing 7 "off-the-shelf" large language models (LLMs), and show that WSE exhibits superior performance on accurate uncertainty measurement under two standard criteria for correctness evaluation (e.g., WSE outperforms existing state-of-the-art method by 3.23% AUROC on the MedQA dataset). Additionally, in terms of the potential for real-world medical QA applications, we achieve a significant enhancement in the performance of LLMs when employing sequences with lower uncertainty, identified by WSE, as final answers (e.g., +6.36% accuracy improvement on the COVID-QA dataset), without requiring any additional task-specific fine-tuning or architectural modifications.  ( 2 min )
    MENTOR: Guiding Hierarchical Reinforcement Learning with Human Feedback and Dynamic Distance Constraint
    arXiv:2402.14244v1 Announce Type: cross Abstract: Hierarchical reinforcement learning (HRL) provides a promising solution for complex tasks with sparse rewards of intelligent agents, which uses a hierarchical framework that divides tasks into subgoals and completes them sequentially. However, current methods struggle to find suitable subgoals for ensuring a stable learning process. Without additional guidance, it is impractical to rely solely on exploration or heuristics methods to determine subgoals in a large goal space. To address the issue, We propose a general hierarchical reinforcement learning framework incorporating human feedback and dynamic distance constraints (MENTOR). MENTOR acts as a "mentor", incorporating human feedback into high-level policy learning, to find better subgoals. As for low-level policy, MENTOR designs a dual policy for exploration-exploitation decoupling respectively to stabilize the training. Furthermore, although humans can simply break down tasks into subgoals to guide the right learning direction, subgoals that are too difficult or too easy can still hinder downstream learning efficiency. We propose the Dynamic Distance Constraint (DDC) mechanism dynamically adjusting the space of optional subgoals. Thus MENTOR can generate subgoals matching the low-level policy learning process from easy to hard. Extensive experiments demonstrate that MENTOR uses a small amount of human feedback to achieve significant improvement in complex tasks with sparse rewards.  ( 2 min )
    Content Conditional Debiasing for Fair Text Embedding
    arXiv:2402.14208v1 Announce Type: cross Abstract: Mitigating biases in machine learning models has gained increasing attention in Natural Language Processing (NLP). Yet, only a few studies focus on fair text embeddings, which are crucial yet challenging for real-world applications. In this paper, we propose a novel method for learning fair text embeddings. We achieve fairness while maintaining utility trade-off by ensuring conditional independence between sensitive attributes and text embeddings conditioned on the content. Specifically, we enforce that embeddings of texts with different sensitive attributes but identical content maintain the same distance toward the embedding of their corresponding neutral text. Furthermore, we address the issue of lacking proper training data by using Large Language Models (LLMs) to augment texts into different sensitive groups. Our extensive evaluations demonstrate that our approach effectively improves fairness while preserving the utility of embeddings, representing a pioneering effort in achieving conditional independence for fair text embeddings.  ( 2 min )
    Contrastive Learning of Shared Spatiotemporal EEG Representations Across Individuals for Naturalistic Neuroscience
    arXiv:2402.14213v1 Announce Type: cross Abstract: Neural representations induced by naturalistic stimuli offer insights into how humans respond to peripheral stimuli in daily life. The key to understanding the general neural mechanisms underlying naturalistic stimuli processing involves aligning neural activities across individuals and extracting inter-subject shared neural representations. Targeting the Electroencephalogram (EEG) technique, known for its rich spatial and temporal information, this study presents a general framework for Contrastive Learning of Shared SpatioTemporal EEG Representations across individuals (CL-SSTER). Harnessing the representational capabilities of contrastive learning, CL-SSTER utilizes a neural network to maximize the similarity of EEG representations across individuals for identical stimuli, contrasting with those for varied stimuli. The network employed spatial and temporal convolutions to simultaneously learn the spatial and temporal patterns inherent in EEG. The versatility of CL-SSTER was demonstrated on three EEG datasets, including a synthetic dataset, a speech audio EEG dataset, and an emotional video EEG dataset. CL-SSTER attained the highest inter-subject correlation (ISC) values compared to the state-of-the-art ISC methods. The latent representations generated by CL-SSTER exhibited reliable spatiotemporal EEG patterns, which can be explained by specific aspects of the stimuli. CL-SSTER serves as an interpretable and scalable foundational framework for the identification of inter-subject shared neural representations in the realm of naturalistic neuroscience.  ( 2 min )
    Compression Robust Synthetic Speech Detection Using Patched Spectrogram Transformer
    arXiv:2402.14205v1 Announce Type: cross Abstract: Many deep learning synthetic speech generation tools are readily available. The use of synthetic speech has caused financial fraud, impersonation of people, and misinformation to spread. For this reason forensic methods that can detect synthetic speech have been proposed. Existing methods often overfit on one dataset and their performance reduces substantially in practical scenarios such as detecting synthetic speech shared on social platforms. In this paper we propose, Patched Spectrogram Synthetic Speech Detection Transformer (PS3DT), a synthetic speech detector that converts a time domain speech signal to a mel-spectrogram and processes it in patches using a transformer neural network. We evaluate the detection performance of PS3DT on ASVspoof2019 dataset. Our experiments show that PS3DT performs well on ASVspoof2019 dataset compared to other approaches using spectrogram for synthetic speech detection. We also investigate generalization performance of PS3DT on In-the-Wild dataset. PS3DT generalizes well than several existing methods on detecting synthetic speech from an out-of-distribution dataset. We also evaluate robustness of PS3DT to detect telephone quality synthetic speech and synthetic speech shared on social platforms (compressed speech). PS3DT is robust to compression and can detect telephone quality synthetic speech better than several existing methods.  ( 2 min )
    Neural Networks and Friction: Slide, Hold, Learn
    arXiv:2402.14148v1 Announce Type: cross Abstract: In this study, it is demonstrated that Recurrent Neural Networks (RNNs), specifically those utilizing Gated Recurrent Unit (GRU) architecture, possess the capability to learn the complex dynamics of rate-and-state friction laws from synthetic data. The data employed for training the network is generated through the application of traditional rate-and-state friction equations coupled with the aging law for state evolution. A novel aspect of our approach is the formulation of a loss function that explicitly accounts for initial conditions, the direct effect, and the evolution of state variables during training. It is found that the RNN, with its GRU architecture, effectively learns to predict changes in the friction coefficient resulting from velocity jumps, thereby showcasing the potential of machine learning models in understanding and simulating the physics of frictional processes.  ( 2 min )
    GDTM: An Indoor Geospatial Tracking Dataset with Distributed Multimodal Sensors
    arXiv:2402.14136v1 Announce Type: cross Abstract: Constantly locating moving objects, i.e., geospatial tracking, is essential for autonomous building infrastructure. Accurate and robust geospatial tracking often leverages multimodal sensor fusion algorithms, which require large datasets with time-aligned, synchronized data from various sensor types. However, such datasets are not readily available. Hence, we propose GDTM, a nine-hour dataset for multimodal object tracking with distributed multimodal sensors and reconfigurable sensor node placements. Our dataset enables the exploration of several research problems, such as optimizing architectures for processing multimodal data, and investigating models' robustness to adverse sensing conditions and sensor placement variances. A GitHub repository containing the code, sample data, and checkpoints of this work is available at https://github.com/nesl/GDTM.  ( 2 min )
    Multiply Robust Estimation for Local Distribution Shifts with Multiple Domains
    arXiv:2402.14145v1 Announce Type: cross Abstract: Distribution shifts are ubiquitous in real-world machine learning applications, posing a challenge to the generalization of models trained on one data distribution to another. We focus on scenarios where data distributions vary across multiple segments of the entire population and only make local assumptions about the differences between training and test (deployment) distributions within each segment. We propose a two-stage multiply robust estimation method to improve model performance on each individual segment for tabular data analysis. The method involves fitting a linear combination of the based models, learned using clusters of training data from multiple segments, followed by a refinement step for each segment. Our method is designed to be implemented with commonly used off-the-shelf machine learning models. We establish theoretical guarantees on the generalization bound of the method on the test risk. With extensive experiments on synthetic and real datasets, we demonstrate that the proposed method substantially improves over existing alternatives in prediction accuracy and robustness on both regression and classification tasks. We also assess its effectiveness on a user city prediction dataset from a large technology company.  ( 2 min )
    Learning dynamic representations of the functional connectome in neurobiological networks
    arXiv:2402.14102v1 Announce Type: cross Abstract: The static synaptic connectivity of neuronal circuits stands in direct contrast to the dynamics of their function. As in changing community interactions, different neurons can participate actively in various combinations to effect behaviors at different times. We introduce an unsupervised approach to learn the dynamic affinities between neurons in live, behaving animals, and to reveal which communities form among neurons at different times. The inference occurs in two major steps. First, pairwise non-linear affinities between neuronal traces from brain-wide calcium activity are organized by non-negative tensor factorization (NTF). Each factor specifies which groups of neurons are most likely interacting for an inferred interval in time, and for which animals. Finally, a generative model that allows for weighted community detection is applied to the functional motifs produced by NTF to reveal a dynamic functional connectome. Since time codes the different experimental variables (e.g., application of chemical stimuli), this provides an atlas of neural motifs active during separate stages of an experiment (e.g., stimulus application or spontaneous behaviors). Results from our analysis are experimentally validated, confirming that our method is able to robustly predict causal interactions between neurons to generate behavior. Code is available at https://github.com/dyballa/dynamic-connectomes.  ( 2 min )
    Random forests for detecting weak signals and extracting physical information: a case study of magnetic navigation
    arXiv:2402.14131v1 Announce Type: cross Abstract: It was recently demonstrated that two machine-learning architectures, reservoir computing and time-delayed feed-forward neural networks, can be exploited for detecting the Earth's anomaly magnetic field immersed in overwhelming complex signals for magnetic navigation in a GPS-denied environment. The accuracy of the detected anomaly field corresponds to a positioning accuracy in the range of 10 to 40 meters. To increase the accuracy and reduce the uncertainty of weak signal detection as well as to directly obtain the position information, we exploit the machine-learning model of random forests that combines the output of multiple decision trees to give optimal values of the physical quantities of interest. In particular, from time-series data gathered from the cockpit of a flying airplane during various maneuvering stages, where strong background complex signals are caused by other elements of the Earth's magnetic field and the fields produced by the electronic systems in the cockpit, we demonstrate that the random-forest algorithm performs remarkably well in detecting the weak anomaly field and in filtering the position of the aircraft. With the aid of the conventional inertial navigation system, the positioning error can be reduced to less than 10 meters. We also find that, contrary to the conventional wisdom, the classic Tolles-Lawson model for calibrating and removing the magnetic field generated by the body of the aircraft is not necessary and may even be detrimental for the success of the random-forest method.  ( 3 min )
    Improving Language Understanding from Screenshots
    arXiv:2402.14073v1 Announce Type: cross Abstract: An emerging family of language models (LMs), capable of processing both text and images within a single visual view, has the promise to unlock complex tasks such as chart understanding and UI navigation. We refer to these models as screenshot language models. Despite their appeal, existing screenshot LMs substantially lag behind text-only models on language understanding tasks. To close this gap, we adopt a simplified setting where the model inputs are plain-text-rendered screenshots, and we focus on improving the text ability of screenshot LMs. We propose a novel Patch-and-Text Prediction (PTP) objective, which masks and recovers both image patches of screenshots and text within screenshots. We also conduct extensive ablation studies on masking rates and patch sizes, as well as designs for improving training stability. Our pre-trained model, while solely taking visual inputs, achieves comparable performance with BERT on 6 out of 8 GLUE tasks (within 2%) and improves up to 8% over prior work. Additionally, we extend PTP to train autoregressive screenshot LMs and demonstrate its effectiveness--our models can significantly reduce perplexity by utilizing the screenshot context. Together, we hope our findings can inspire future research on developing powerful screenshot LMs and extending their reach to broader applications.  ( 2 min )
    Statistical validation of a deep learning algorithm for dental anomaly detection in intraoral radiographs using paired data
    arXiv:2402.14022v1 Announce Type: cross Abstract: This article describes the clinical validation study setup, statistical analysis and results for a deep learning algorithm which detects dental anomalies in intraoral radiographic images, more specifically caries, apical lesions, root canal treatment defects, marginal defects at crown restorations, periodontal bone loss and calculus. The study compares the detection performance of dentists using the deep learning algorithm to the prior performance of these dentists evaluating the images without algorithmic assistance. Calculating the marginal profit and loss of performance from the annotated paired image data allows for a quantification of the hypothesized change in sensitivity and specificity. The statistical significance of these results is extensively proven using both McNemar's test and the binomial hypothesis test. The average sensitivity increases from $60.7\%$ to $85.9\%$, while the average specificity slightly decreases from $94.5\%$ to $92.7\%$. We prove that the increase of the area under the localization ROC curve (AUC) is significant (from $0.60$ to $0.86$ on average), while the average AUC is bounded by the $95\%$ confidence intervals ${[}0.54, 0.65{]}$ and ${[}0.82, 0.90{]}$. When using the deep learning algorithm for diagnostic guidance, the dentist can be $95\%$ confident that the average true population sensitivity is bounded by the range $79.6\%$ to $91.9\%$. The proposed paired data setup and statistical analysis can be used as a blueprint to thoroughly test the effect of a modality change, like a deep learning based detection and/or segmentation, on radiographic images.  ( 3 min )
    Autoencoder with Ordered Variance for Nonlinear Model Identification
    arXiv:2402.14031v1 Announce Type: cross Abstract: This paper presents a novel autoencoder with ordered variance (AEO) in which the loss function is modified with a variance regularization term to enforce order in the latent space. Further, the autoencoder is modified using ResNets, which results in a ResNet AEO (RAEO). The paper also illustrates the effectiveness of AEO and RAEO in extracting nonlinear relationships among input variables in an unsupervised setting.  ( 2 min )
    Difference Learning for Air Quality Forecasting Transport Emulation
    arXiv:2402.14806v1 Announce Type: new Abstract: Human health is negatively impacted by poor air quality including increased risk for respiratory and cardiovascular disease. Due to a recent increase in extreme air quality events, both globally and locally in the United States, finer resolution air quality forecasting guidance is needed to effectively adapt to these events. The National Oceanic and Atmospheric Administration provides air quality forecasting guidance for the Continental United States. Their air quality forecasting model is based on a 15 km spatial resolution; however, the goal is to reach a three km spatial resolution. This is currently not feasible due in part to prohibitive computational requirements for modeling the transport of chemical species. In this work, we describe a deep learning transport emulator that is able to reduce computations while maintaining skill comparable with the existing numerical model. We show how this method maintains skill in the presence of extreme air quality events, making it a potential candidate for operational use. We also explore evaluating how well this model maintains the physical properties of the modeled transport for a given set of species.  ( 2 min )
    Revisiting Convergence of AdaGrad with Relaxed Assumptions
    arXiv:2402.13794v1 Announce Type: cross Abstract: In this study, we revisit the convergence of AdaGrad with momentum (covering AdaGrad as a special case) on non-convex smooth optimization problems. We consider a general noise model where the noise magnitude is controlled by the function value gap together with the gradient magnitude. This model encompasses a broad range of noises including bounded noise, sub-Gaussian noise, affine variance noise and the expected smoothness, and it has been shown to be more realistic in many practical applications. Our analysis yields a probabilistic convergence rate which, under the general noise, could reach at (\tilde{\mathcal{O}}(1/\sqrt{T})). This rate does not rely on prior knowledge of problem-parameters and could accelerate to (\tilde{\mathcal{O}}(1/T)) where (T) denotes the total number iterations, when the noise parameters related to the function value gap and noise level are sufficiently small. The convergence rate thus matches the lower rate for stochastic first-order methods over non-convex smooth landscape up to logarithm terms [Arjevani et al., 2023]. We further derive a convergence bound for AdaGrad with mometum, considering the generalized smoothness where the local smoothness is controlled by a first-order function of the gradient norm.  ( 2 min )
    Self-Guided Masked Autoencoders for Domain-Agnostic Self-Supervised Learning
    arXiv:2402.14789v1 Announce Type: new Abstract: Self-supervised learning excels in learning representations from large amounts of unlabeled data, demonstrating success across multiple data modalities. Yet, extending self-supervised learning to new modalities is non-trivial because the specifics of existing methods are tailored to each domain, such as domain-specific augmentations which reflect the invariances in the target task. While masked modeling is promising as a domain-agnostic framework for self-supervised learning because it does not rely on input augmentations, its mask sampling procedure remains domain-specific. We present Self-guided Masked Autoencoders (SMA), a fully domain-agnostic masked modeling method. SMA trains an attention based model using a masked modeling objective, by learning masks to sample without any domain-specific assumptions. We evaluate SMA on three self-supervised learning benchmarks in protein biology, chemical property prediction, and particle physics. We find SMA is capable of learning representations without domain-specific knowledge and achieves state-of-the-art performance on these three benchmarks.  ( 2 min )
    Link Prediction under Heterophily: A Physics-Inspired Graph Neural Network Approach
    arXiv:2402.14802v1 Announce Type: new Abstract: In the past years, Graph Neural Networks (GNNs) have become the `de facto' standard in various deep learning domains, thanks to their flexibility in modeling real-world phenomena represented as graphs. However, the message-passing mechanism of GNNs faces challenges in learnability and expressivity, hindering high performance on heterophilic graphs, where adjacent nodes frequently have different labels. Most existing solutions addressing these challenges are primarily confined to specific benchmarks focused on node classification tasks. This narrow focus restricts the potential impact that link prediction under heterophily could offer in several applications, including recommender systems. For example, in social networks, two users may be connected for some latent reason, making it challenging to predict such connections in advance. Physics-Inspired GNNs such as GRAFF provided a significant contribution to enhance node classification performance under heterophily, thanks to the adoption of physics biases in the message-passing. Drawing inspiration from these findings, we advocate that the methodology employed by GRAFF can improve link prediction performance as well. To further explore this hypothesis, we introduce GRAFF-LP, an extension of GRAFF to link prediction. We evaluate its efficacy within a recent collection of heterophilic graphs, establishing a new benchmark for link prediction under heterophily. Our approach surpasses previous methods, in most of the datasets, showcasing a strong flexibility in different contexts, and achieving relative AUROC improvements of up to 26.7%.  ( 3 min )
    Generalizing Reward Modeling for Out-of-Distribution Preference Learning
    arXiv:2402.14760v1 Announce Type: new Abstract: Preference learning (PL) with large language models (LLMs) aims to align the LLMs' generations with human preferences. Previous work on reinforcement learning from human feedback (RLHF) has demonstrated promising results in in-distribution PL. However, due to the difficulty of obtaining human feedback, discretely training reward models for every encountered distribution is challenging. Thus, out-of-distribution (OOD) PL is practically useful for enhancing the generalization ability of LLMs with limited preference feedback. This work addresses OOD PL by optimizing a general reward model through a meta-learning approach. During meta-training, a bilevel optimization algorithm is utilized to learn a reward model capable of guiding policy learning to align with human preferences across various distributions. When encountering a test distribution, the meta-test procedure conducts regularized policy optimization using the learned reward model for PL. We theoretically demonstrate the convergence rate of the bilevel optimization algorithm under reasonable assumptions. Additionally, we conduct experiments on two text generation tasks across 20 held-out domains and outperform a variety of strong baselines across various evaluation metrics.  ( 2 min )
    Rao-Blackwellising Bayesian Causal Inference
    arXiv:2402.14781v1 Announce Type: new Abstract: Bayesian causal inference, i.e., inferring a posterior over causal models for the use in downstream causal reasoning tasks, poses a hard computational inference problem that is little explored in literature. In this work, we combine techniques from order-based MCMC structure learning with recent advances in gradient-based graph learning into an effective Bayesian causal inference framework. Specifically, we decompose the problem of inferring the causal structure into (i) inferring a topological order over variables and (ii) inferring the parent sets for each variable. When limiting the number of parents per variable, we can exactly marginalise over the parent sets in polynomial time. We further use Gaussian processes to model the unknown causal mechanisms, which also allows their exact marginalisation. This introduces a Rao-Blackwellization scheme, where all components are eliminated from the model, except for the causal order, for which we learn a distribution via gradient-based optimisation. The combination of Rao-Blackwellization with our sequential inference procedure for causal orders yields state-of-the-art on linear and non-linear additive noise benchmarks with scale-free and Erdos-Renyi graph structures.  ( 2 min )
    Generalising realisability in statistical learning theory under epistemic uncertainty
    arXiv:2402.14759v1 Announce Type: new Abstract: The purpose of this paper is to look into how central notions in statistical learning theory, such as realisability, generalise under the assumption that train and test distribution are issued from the same credal set, i.e., a convex set of probability distributions. This can be considered as a first step towards a more general treatment of statistical learning under epistemic uncertainty.  ( 2 min )
    Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs
    arXiv:2402.14740v1 Announce Type: new Abstract: AI alignment in the shape of Reinforcement Learning from Human Feedback (RLHF) is increasingly treated as a crucial ingredient for high performance large language models. \textsc{Proximal Policy Optimization} (PPO) has been positioned by recent literature as the canonical method for the RL part of RLHF. However, it involves both high computational cost and sensitive hyperparameter tuning. We posit that most of the motivational principles that led to the development of PPO are less of a practical concern in RLHF and advocate for a less computationally expensive method that preserves and even increases performance. We revisit the \textit{formulation} of alignment from human preferences in the context of RL. Keeping simplicity as a guiding principle, we show that many components of PPO are unnecessary in an RLHF context and that far simpler REINFORCE-style optimization variants outperform both PPO and newly proposed "RL-free" methods such as DPO and RAFT. Our work suggests that careful adaptation to LLMs alignment characteristics enables benefiting from online RL optimization at low cost.  ( 2 min )
    How Transformers Learn Causal Structure with Gradient Descent
    arXiv:2402.14735v1 Announce Type: new Abstract: The incredible success of transformers on sequence modeling tasks can be largely attributed to the self-attention mechanism, which allows information to be transferred between different parts of a sequence. Self-attention allows transformers to encode causal structure which makes them particularly suitable for sequence modeling. However, the process by which transformers learn such causal structure via gradient-based training algorithms remains poorly understood. To better understand this process, we introduce an in-context learning task that requires learning latent causal structure. We prove that gradient descent on a simplified two-layer transformer learns to solve this task by encoding the latent causal graph in the first attention layer. The key insight of our proof is that the gradient of the attention matrix encodes the mutual information between tokens. As a consequence of the data processing inequality, the largest entries of this gradient correspond to edges in the latent causal graph. As a special case, when the sequences are generated from in-context Markov chains, we prove that transformers learn an induction head (Olsson et al., 2022). We confirm our theoretical findings by showing that transformers trained on our in-context learning task are able to recover a wide variety of causal structures.  ( 2 min )
    Incorporating Expert Rules into Neural Networks in the Framework of Concept-Based Learning
    arXiv:2402.14726v1 Announce Type: new Abstract: A problem of incorporating the expert rules into machine learning models for extending the concept-based learning is formulated in the paper. It is proposed how to combine logical rules and neural networks predicting the concept probabilities. The first idea behind the combination is to form constraints for a joint probability distribution over all combinations of concept values to satisfy the expert rules. The second idea is to represent a feasible set of probability distributions in the form of a convex polytope and to use its vertices or faces. We provide several approaches for solving the stated problem and for training neural networks which guarantee that the output probabilities of concepts would not violate the expert rules. The solution of the problem can be viewed as a way for combining the inductive and deductive learning. Expert rules are used in a broader sense when any logical function that connects concepts and class labels or just concepts with each other can be regarded as a rule. This feature significantly expands the class of the proposed results. Numerical examples illustrate the approaches. The code of proposed algorithms is publicly available.  ( 2 min )
    Clifford-Steerable Convolutional Neural Networks
    arXiv:2402.14730v1 Announce Type: new Abstract: We present Clifford-Steerable Convolutional Neural Networks (CS-CNNs), a novel class of $\mathrm{E}(p, q)$-equivariant CNNs. CS-CNNs process multivector fields on pseudo-Euclidean spaces $\mathbb{R}^{p,q}$. They cover, for instance, $\mathrm{E}(3)$-equivariance on $\mathbb{R}^3$ and Poincar\'e-equivariance on Minkowski spacetime $\mathbb{R}^{1,3}$. Our approach is based on an implicit parametrization of $\mathrm{O}(p,q)$-steerable kernels via Clifford group equivariant neural networks. We significantly and consistently outperform baseline methods on fluid dynamics as well as relativistic electrodynamics forecasting tasks.  ( 2 min )
    On the Curses of Future and History in Future-dependent Value Functions for Off-policy Evaluation
    arXiv:2402.14703v1 Announce Type: new Abstract: We study off-policy evaluation (OPE) in partially observable environments with complex observations, with the goal of developing estimators whose guarantee avoids exponential dependence on the horizon. While such estimators exist for MDPs and POMDPs can be converted to history-based MDPs, their estimation errors depend on the state-density ratio for MDPs which becomes history ratios after conversion, an exponential object. Recently, Uehara et al. (2022) proposed future-dependent value functions as a promising framework to address this issue, where the guarantee for memoryless policies depends on the density ratio over the latent state space. However, it also depends on the boundedness of the future-dependent value function and other related quantities, which we show could be exponential-in-length and thus erasing the advantage of the method. In this paper, we discover novel coverage assumptions tailored to the structure of POMDPs, such as outcome coverage and belief coverage. These assumptions not only enable polynomial bounds on the aforementioned quantities, but also lead to the discovery of new algorithms with complementary properties.  ( 2 min )
    CaT-GNN: Enhancing Credit Card Fraud Detection via Causal Temporal Graph Neural Networks
    arXiv:2402.14708v1 Announce Type: new Abstract: Credit card fraud poses a significant threat to the economy. While Graph Neural Network (GNN)-based fraud detection methods perform well, they often overlook the causal effect of a node's local structure on predictions. This paper introduces a novel method for credit card fraud detection, the \textbf{\underline{Ca}}usal \textbf{\underline{T}}emporal \textbf{\underline{G}}raph \textbf{\underline{N}}eural \textbf{N}etwork (CaT-GNN), which leverages causal invariant learning to reveal inherent correlations within transaction data. By decomposing the problem into discovery and intervention phases, CaT-GNN identifies causal nodes within the transaction graph and applies a causal mixup strategy to enhance the model's robustness and interpretability. CaT-GNN consists of two key components: Causal-Inspector and Causal-Intervener. The Causal-Inspector utilizes attention weights in the temporal attention mechanism to identify causal and environment nodes without introducing additional parameters. Subsequently, the Causal-Intervener performs a causal mixup enhancement on environment nodes based on the set of nodes. Evaluated on three datasets, including a private financial dataset and two public datasets, CaT-GNN demonstrates superior performance over existing state-of-the-art methods. Our findings highlight the potential of integrating causal reasoning with graph neural networks to improve fraud detection capabilities in financial transactions.  ( 2 min )
    Q-Probe: A Lightweight Approach to Reward Maximization for Language Models
    arXiv:2402.14688v1 Announce Type: new Abstract: We present an approach called Q-probing to adapt a pre-trained language model to maximize a task-specific reward function. At a high level, Q-probing sits between heavier approaches such as finetuning and lighter approaches such as few shot prompting, but can also be combined with either. The idea is to learn a simple linear function on a model's embedding space that can be used to reweight candidate completions. We theoretically show that this sampling procedure is equivalent to a KL-constrained maximization of the Q-probe as the number of samples increases. To train the Q-probes we consider either reward modeling or a class of novel direct policy learning objectives based on importance weighted policy gradients. With this technique, we see gains in domains with ground-truth rewards (code generation) as well as implicit rewards defined by preference data, even outperforming finetuning in data-limited regimes. Moreover, a Q-probe can be trained on top of an API since it only assumes access to sampling and embeddings. Code: https://github.com/likenneth/q_probe .  ( 2 min )
    Big data analytics to classify earthwork-related locations: A Chengdu study
    arXiv:2402.14698v1 Announce Type: new Abstract: Air pollution has significantly intensified, leading to severe health consequences worldwide. Earthwork-related locations (ERLs) constitute significant sources of urban dust pollution. The effective management of ERLs has long posed challenges for governmental and environmental agencies, primarily due to their classification under different regulatory authorities, information barriers, delays in data updating, and a lack of dust suppression measures for various sources of dust pollution. To address these challenges, we classified urban dust pollution sources using dump truck trajectory, urban point of interest (POI), and land cover data. We compared several prediction models and investigated the relationship between features and dust pollution sources using real data. The results demonstrate that high-accuracy classification can be achieved with a limited number of features. This method was successfully implemented in the system called Alpha MAPS in Chengdu to provide decision support for urban pollution control.  ( 2 min )
    Rethinking Invariance Regularization in Adversarial Training to Improve Robustness-Accuracy Trade-off
    arXiv:2402.14648v1 Announce Type: new Abstract: Although adversarial training has been the state-of-the-art approach to defend against adversarial examples (AEs), they suffer from a robustness-accuracy trade-off. In this work, we revisit representation-based invariance regularization to learn discriminative yet adversarially invariant representations, aiming to mitigate this trade-off. We empirically identify two key issues hindering invariance regularization: (1) a "gradient conflict" between invariance loss and classification objectives, indicating the existence of "collapsing solutions," and (2) the mixture distribution problem arising from diverged distributions of clean and adversarial inputs. To address these issues, we propose Asymmetrically Representation-regularized Adversarial Training (AR-AT), which incorporates a stop-gradient operation and a pre-dictor in the invariance loss to avoid "collapsing solutions," inspired by a recent non-contrastive self-supervised learning approach, and a split-BatchNorm (BN) structure to resolve the mixture distribution problem. Our method significantly improves the robustness-accuracy trade-off by learning adversarially invariant representations without sacrificing discriminative power. Furthermore, we discuss the relevance of our findings to knowledge-distillation-based defense methods, contributing to a deeper understanding of their relative successes.  ( 2 min )
    Bayesian Off-Policy Evaluation and Learning for Large Action Spaces
    arXiv:2402.14664v1 Announce Type: new Abstract: In interactive systems, actions are often correlated, presenting an opportunity for more sample-efficient off-policy evaluation (OPE) and learning (OPL) in large action spaces. We introduce a unified Bayesian framework to capture these correlations through structured and informative priors. In this framework, we propose sDM, a generic Bayesian approach designed for OPE and OPL, grounded in both algorithmic and theoretical foundations. Notably, sDM leverages action correlations without compromising computational efficiency. Moreover, inspired by online Bayesian bandits, we introduce Bayesian metrics that assess the average performance of algorithms across multiple problem instances, deviating from the conventional worst-case assessments. We analyze sDM in OPE and OPL, highlighting the benefits of leveraging action correlations. Empirical evidence showcases the strong performance of sDM.  ( 2 min )
    Sparse Linear Regression and Lattice Problems
    arXiv:2402.14645v1 Announce Type: new Abstract: Sparse linear regression (SLR) is a well-studied problem in statistics where one is given a design matrix $X\in\mathbb{R}^{m\times n}$ and a response vector $y=X\theta^*+w$ for a $k$-sparse vector $\theta^*$ (that is, $\|\theta^*\|_0\leq k$) and small, arbitrary noise $w$, and the goal is to find a $k$-sparse $\widehat{\theta} \in \mathbb{R}^n$ that minimizes the mean squared prediction error $\frac{1}{m}\|X\widehat{\theta}-X\theta^*\|^2_2$. While $\ell_1$-relaxation methods such as basis pursuit, Lasso, and the Dantzig selector solve SLR when the design matrix is well-conditioned, no general algorithm is known, nor is there any formal evidence of hardness in an average-case setting with respect to all efficient algorithms. We give evidence of average-case hardness of SLR w.r.t. all efficient algorithms assuming the worst-case hardness of lattice problems. Specifically, we give an instance-by-instance reduction from a variant of the bounded distance decoding (BDD) problem on lattices to SLR, where the condition number of the lattice basis that defines the BDD instance is directly related to the restricted eigenvalue condition of the design matrix, which characterizes some of the classical statistical-computational gaps for sparse linear regression. Also, by appealing to worst-case to average-case reductions from the world of lattices, this shows hardness for a distribution of SLR instances; while the design matrices are ill-conditioned, the resulting SLR instances are in the identifiable regime. Furthermore, for well-conditioned (essentially) isotropic Gaussian design matrices, where Lasso is known to behave well in the identifiable regime, we show hardness of outputting any good solution in the unidentifiable regime where there are many solutions, assuming the worst-case hardness of standard and well-studied lattice problems.  ( 3 min )
    CoLoRA: Continuous low-rank adaptation for reduced implicit neural modeling of parameterized partial differential equations
    arXiv:2402.14646v1 Announce Type: new Abstract: This work introduces reduced models based on Continuous Low Rank Adaptation (CoLoRA) that pre-train neural networks for a given partial differential equation and then continuously adapt low-rank weights in time to rapidly predict the evolution of solution fields at new physics parameters and new initial conditions. The adaptation can be either purely data-driven or via an equation-driven variational approach that provides Galerkin-optimal approximations. Because CoLoRA approximates solution fields locally in time, the rank of the weights can be kept small, which means that only few training trajectories are required offline so that CoLoRA is well suited for data-scarce regimes. Predictions with CoLoRA are orders of magnitude faster than with classical methods and their accuracy and parameter efficiency is higher compared to other neural network approaches.  ( 2 min )
    Federated Complex Qeury Answering
    arXiv:2402.14609v1 Announce Type: new Abstract: Complex logical query answering is a challenging task in knowledge graphs (KGs) that has been widely studied. The ability to perform complex logical reasoning is essential and supports various graph reasoning-based downstream tasks, such as search engines. Recent approaches are proposed to represent KG entities and logical queries into embedding vectors and find answers to logical queries from the KGs. However, existing proposed methods mainly focus on querying a single KG and cannot be applied to multiple graphs. In addition, directly sharing KGs with sensitive information may incur privacy risks, making it impractical to share and construct an aggregated KG for reasoning to retrieve query answers. Thus, it remains unknown how to answer queries on multi-source KGs. An entity can be involved in various knowledge graphs and reasoning on multiple KGs and answering complex queries on multi-source KGs is important in discovering knowledge cross graphs. Fortunately, federated learning is utilized in knowledge graphs to collaboratively learn representations with privacy preserved. Federated knowledge graph embeddings enrich the relations in knowledge graphs to improve the representation quality. However, these methods only focus on one-hop relations and cannot perform complex reasoning tasks. In this paper, we apply federated learning to complex query-answering tasks to reason over multi-source knowledge graphs while preserving privacy. We propose a Federated Complex Query Answering framework (FedCQA), to reason over multi-source KGs avoiding sensitive raw data transmission to protect privacy. We conduct extensive experiments on three real-world datasets and evaluate retrieval performance on various types of complex queries.  ( 3 min )
    latrend: A Framework for Clustering Longitudinal Data
    arXiv:2402.14621v1 Announce Type: new Abstract: Clustering of longitudinal data is used to explore common trends among subjects over time for a numeric measurement of interest. Various R packages have been introduced throughout the years for identifying clusters of longitudinal patterns, summarizing the variability in trajectories between subject in terms of one or more trends. We introduce the R package "latrend" as a framework for the unified application of methods for longitudinal clustering, enabling comparisons between methods with minimal coding. The package also serves as an interface to commonly used packages for clustering longitudinal data, including "dtwclust", "flexmix", "kml", "lcmm", "mclust", "mixAK", and "mixtools". This enables researchers to easily compare different approaches, implementations, and method specifications. Furthermore, researchers can build upon the standard tools provided by the framework to quickly implement new cluster methods, enabling rapid prototyping. We demonstrate the functionality and application of the latrend package on a synthetic dataset based on the therapy adherence patterns of patients with sleep apnea.  ( 2 min )
    OmniPred: Language Models as Universal Regressors
    arXiv:2402.14547v1 Announce Type: new Abstract: Over the broad landscape of experimental design, regression has been a powerful tool to accurately predict the outcome metrics of a system or model given a set of parameters, but has been traditionally restricted to methods which are only applicable to a specific task. In this paper, we propose OmniPred, a framework for training language models as universal end-to-end regressors over $(x,y)$ evaluation data from diverse real world experiments. Using data sourced from Google Vizier, one of the largest blackbox optimization databases in the world, our extensive experiments demonstrate that through only textual representations of mathematical parameters and values, language models are capable of very precise numerical regression, and if given the opportunity to train over multiple tasks, can significantly outperform traditional regression models.  ( 2 min )
    Bandits with Abstention under Expert Advice
    arXiv:2402.14585v1 Announce Type: new Abstract: We study the classic problem of prediction with expert advice under bandit feedback. Our model assumes that one action, corresponding to the learner's abstention from play, has no reward or loss on every trial. We propose the CBA algorithm, which exploits this assumption to obtain reward bounds that can significantly improve those of the classical Exp4 algorithm. We can view our problem as the aggregation of confidence-rated predictors when the learner has the option of abstention from play. Importantly, we are the first to achieve bounds on the expected cumulative reward for general confidence-rated predictors. In the special case of specialists we achieve a novel reward bound, significantly improving previous bounds of SpecialistExp (treating abstention as another action). As an example application, we discuss learning unions of balls in a finite metric space. In this contextual setting, we devise an efficient implementation of CBA, reducing the runtime from quadratic to almost linear in the number of contexts. Preliminary experiments show that CBA improves over existing bandit algorithms.  ( 2 min )
    A Framework for Variational Inference of Lightweight Bayesian Neural Networks with Heteroscedastic Uncertainties
    arXiv:2402.14532v1 Announce Type: new Abstract: Obtaining heteroscedastic predictive uncertainties from a Bayesian Neural Network (BNN) is vital to many applications. Often, heteroscedastic aleatoric uncertainties are learned as outputs of the BNN in addition to the predictive means, however doing so may necessitate adding more learnable parameters to the network. In this work, we demonstrate that both the heteroscedastic aleatoric and epistemic variance can be embedded into the variances of learned BNN parameters, improving predictive performance for lightweight networks. By complementing this approach with a moment propagation approach to inference, we introduce a relatively simple framework for sampling-free variational inference suitable for lightweight BNNs.  ( 2 min )
    Federated Learning on Transcriptomic Data: Model Quality and Performance Trade-Offs
    arXiv:2402.14527v1 Announce Type: new Abstract: Machine learning on large-scale genomic or transcriptomic data is important for many novel health applications. For example, precision medicine tailors medical treatments to patients on the basis of individual biomarkers, cellular and molecular states, etc. However, the data required is sensitive, voluminous, heterogeneous, and typically distributed across locations where dedicated machine learning hardware is not available. Due to privacy and regulatory reasons, it is also problematic to aggregate all data at a trusted third party.Federated learning is a promising solution to this dilemma, because it enables decentralized, collaborative machine learning without exchanging raw data. In this paper, we perform comparative experiments with the federated learning frameworks TensorFlow Federated and Flower. Our test case is the training of disease prognosis and cell type classification models. We train the models with distributed transcriptomic data, considering both data heterogeneity and architectural heterogeneity. We measure model quality, robustness against privacy-enhancing noise, computational performance and resource overhead. Each of the federated learning frameworks has different strengths. However, our experiments confirm that both frameworks can readily build models on transcriptomic data, without transferring personal raw data to a third party with abundant computational resources.  ( 2 min )
    Imbalanced Data Clustering using Equilibrium K-Means
    arXiv:2402.14490v1 Announce Type: new Abstract: Imbalanced data, characterized by an unequal distribution of data points across different clusters, poses a challenge for traditional hard and fuzzy clustering algorithms, such as hard K-means (HKM, or Lloyd's algorithm) and fuzzy K-means (FKM, or Bezdek's algorithm). This paper introduces equilibrium K-means (EKM), a novel and simple K-means-type algorithm that alternates between just two steps, yielding significantly improved clustering results for imbalanced data by reducing the tendency of centroids to crowd together in the center of large clusters. We also present a unifying perspective for HKM, FKM, and EKM, showing they are essentially gradient descent algorithms with an explicit relationship to Newton's method. EKM has the same time and space complexity as FKM but offers a clearer physical meaning for its membership definition. We illustrate the performance of EKM on two synthetic and ten real datasets, comparing it to various clustering algorithms, including HKM, FKM, maximum-entropy fuzzy clustering, two FKM variations designed for imbalanced data, and the Gaussian mixture model. The results demonstrate that EKM performs competitively on balanced data while significantly outperforming other techniques on imbalanced data. For high-dimensional data clustering, we demonstrate that a more discriminative representation can be obtained by mapping high-dimensional data via deep neural networks into a low-dimensional, EKM-friendly space. Deep clustering with EKM improves clustering accuracy by 35% on an imbalanced dataset derived from MNIST compared to deep clustering based on HKM.  ( 2 min )
    Towards Automated Causal Discovery: a case study on 5G telecommunication data
    arXiv:2402.14481v1 Announce Type: new Abstract: We introduce the concept of Automated Causal Discovery (AutoCD), defined as any system that aims to fully automate the application of causal discovery and causal reasoning methods. AutoCD's goal is to deliver all causal information that an expert human analyst would and answer a user's causal queries. We describe the architecture of such a platform, and illustrate its performance on synthetic data sets. As a case study, we apply it on temporal telecommunication data. The system is general and can be applied to a plethora of causal discovery problems.  ( 2 min )
    SpanSeq: Similarity-based sequence data splitting method for improved development and assessment of deep learning projects
    arXiv:2402.14482v1 Announce Type: new Abstract: The use of deep learning models in computational biology has increased massively in recent years, and is expected to do so further with the current advances in fields like Natural Language Processing. These models, although able to draw complex relations between input and target, are also largely inclined to learn noisy deviations from the pool of data used during their development. In order to assess their performance on unseen data (their capacity to generalize), it is common to randomly split the available data in development (train/validation) and test sets. This procedure, although standard, has lately been shown to produce dubious assessments of generalization due to the existing similarity between samples in the databases used. In this work, we present SpanSeq, a database partition method for machine learning that can scale to most biological sequences (genes, proteins and genomes) in order to avoid data leakage between sets. We also explore the effect of not restraining similarity between sets by reproducing the development of the state-of-the-art model DeepLoc, not only confirming the consequences of randomly splitting databases on the model assessment, but expanding those repercussions to the model development. SpanSeq is available for downloading and installing at https://github.com/genomicepidemiology/SpanSeq.  ( 3 min )
    Data Science with LLMs and Interpretable Models
    arXiv:2402.14474v1 Announce Type: new Abstract: Recent years have seen important advances in the building of interpretable models, machine learning models that are designed to be easily understood by humans. In this work, we show that large language models (LLMs) are remarkably good at working with interpretable models, too. In particular, we show that LLMs can describe, interpret, and debug Generalized Additive Models (GAMs). Combining the flexibility of LLMs with the breadth of statistical patterns accurately described by GAMs enables dataset summarization, question answering, and model critique. LLMs can also improve the interaction between domain experts and interpretable models, and generate hypotheses about the underlying phenomenon. We release \url{https://github.com/interpretml/TalkToEBM} as an open-source LLM-GAM interface.  ( 2 min )
    DynGMA: a robust approach for learning stochastic differential equations from data
    arXiv:2402.14475v1 Announce Type: new Abstract: Learning unknown stochastic differential equations (SDEs) from observed data is a significant and challenging task with applications in various fields. Current approaches often use neural networks to represent drift and diffusion functions, and construct likelihood-based loss by approximating the transition density to train these networks. However, these methods often rely on one-step stochastic numerical schemes, necessitating data with sufficiently high time resolution. In this paper, we introduce novel approximations to the transition density of the parameterized SDE: a Gaussian density approximation inspired by the random perturbation theory of dynamical systems, and its extension, the dynamical Gaussian mixture approximation (DynGMA). Benefiting from the robust density approximation, our method exhibits superior accuracy compared to baseline methods in learning the fully unknown drift and diffusion functions and computing the invariant distribution from trajectory data. And it is capable of handling trajectory data with low time resolution and variable, even uncontrollable, time step sizes, such as data generated from Gillespie's stochastic simulations. We then conduct several experiments across various scenarios to verify the advantages and robustness of the proposed method.  ( 2 min )
    Text me the data: Generating Ground Pressure Sequence from Textual Descriptions for HAR
    arXiv:2402.14427v1 Announce Type: new Abstract: In human activity recognition (HAR), the availability of substantial ground truth is necessary for training efficient models. However, acquiring ground pressure data through physical sensors itself can be cost-prohibitive, time-consuming. To address this critical need, we introduce Text-to-Pressure (T2P), a framework designed to generate extensive ground pressure sequences from textual descriptions of human activities using deep learning techniques. We show that the combination of vector quantization of sensor data along with simple text conditioned auto regressive strategy allows us to obtain high-quality generated pressure sequences from textual descriptions with the help of discrete latent correlation between text and pressure maps. We achieved comparable performance on the consistency between text and generated motion with an R squared value of 0.722, Masked R squared value of 0.892, and FID score of 1.83. Additionally, we trained a HAR model with the the synthesized data and evaluated it on pressure dynamics collected by a real pressure sensor which is on par with a model trained on only real data. Combining both real and synthesized training data increases the overall macro F1 score by 5.9 percent.  ( 2 min )
    Robust Training of Federated Models with Extremely Label Deficiency
    arXiv:2402.14430v1 Announce Type: new Abstract: Federated semi-supervised learning (FSSL) has emerged as a powerful paradigm for collaboratively training machine learning models using distributed data with label deficiency. Advanced FSSL methods predominantly focus on training a single model on each client. However, this approach could lead to a discrepancy between the objective functions of labeled and unlabeled data, resulting in gradient conflicts. To alleviate gradient conflict, we propose a novel twin-model paradigm, called Twin-sight, designed to enhance mutual guidance by providing insights from different perspectives of labeled and unlabeled data. In particular, Twin-sight concurrently trains a supervised model with a supervised objective function while training an unsupervised model using an unsupervised objective function. To enhance the synergy between these two models, Twin-sight introduces a neighbourhood-preserving constraint, which encourages the preservation of the neighbourhood relationship among data features extracted by both models. Our comprehensive experiments on four benchmark datasets provide substantial evidence that Twin-sight can significantly outperform state-of-the-art methods across various experimental settings, demonstrating the efficacy of the proposed Twin-sight.  ( 2 min )
    Global Safe Sequential Learning via Efficient Knowledge Transfer
    arXiv:2402.14402v1 Announce Type: new Abstract: Sequential learning methods such as active learning and Bayesian optimization select the most informative data to learn about a task. In many medical or engineering applications, the data selection is constrained by a priori unknown safety conditions. A promissing line of safe learning methods utilize Gaussian processes (GPs) to model the safety probability and perform data selection in areas with high safety confidence. However, accurate safety modeling requires prior knowledge or consumes data. In addition, the safety confidence centers around the given observations which leads to local exploration. As transferable source knowledge is often available in safety critical experiments, we propose to consider transfer safe sequential learning to accelerate the learning of safety. We further consider a pre-computation of source components to reduce the additional computational load that is introduced by incorporating source data. In this paper, we theoretically analyze the maximum explorable safe regions of conventional safe learning methods. Furthermore, we empirically demonstrate that our approach 1) learns a task with lower data consumption, 2) globally explores multiple disjoint safe regions under guidance of the source knowledge, and 3) operates with computation comparable to conventional safe learning methods.  ( 2 min )
    Large-Scale Actionless Video Pre-Training via Discrete Diffusion for Efficient Policy Learning
    arXiv:2402.14407v1 Announce Type: new Abstract: Learning a generalist embodied agent capable of completing multiple tasks poses challenges, primarily stemming from the scarcity of action-labeled robotic datasets. In contrast, a vast amount of human videos exist, capturing intricate tasks and interactions with the physical world. Promising prospects arise for utilizing actionless human videos for pre-training and transferring the knowledge to facilitate robot policy learning through limited robot demonstrations. In this paper, we introduce a novel framework that leverages a unified discrete diffusion to combine generative pre-training on human videos and policy fine-tuning on a small number of action-labeled robot videos. We start by compressing both human and robot videos into unified video tokens. In the pre-training stage, we employ a discrete diffusion model with a mask-and-replace diffusion strategy to predict future video tokens in the latent space. In the fine-tuning stage, we harness the imagined future videos to guide low-level action learning trained on a limited set of robot data. Experiments demonstrate that our method generates high-fidelity future videos for planning and enhances the fine-tuned policies compared to previous state-of-the-art approaches with superior generalization ability. Our project website is available at https://video-diff.github.io/.  ( 2 min )
    MAPE-PPI: Towards Effective and Efficient Protein-Protein Interaction Prediction via Microenvironment-Aware Protein Embedding
    arXiv:2402.14391v1 Announce Type: new Abstract: Protein-Protein Interactions (PPIs) are fundamental in various biological processes and play a key role in life activities. The growing demand and cost of experimental PPI assays require computational methods for efficient PPI prediction. While existing methods rely heavily on protein sequence for PPI prediction, it is the protein structure that is the key to determine the interactions. To take both protein modalities into account, we define the microenvironment of an amino acid residue by its sequence and structural contexts, which describe the surrounding chemical properties and geometric features. In addition, microenvironments defined in previous work are largely based on experimentally assayed physicochemical properties, for which the "vocabulary" is usually extremely small. This makes it difficult to cover the diversity and complexity of microenvironments. In this paper, we propose Microenvironment-Aware Protein Embedding for PPI prediction (MPAE-PPI), which encodes microenvironments into chemically meaningful discrete codes via a sufficiently large microenvironment "vocabulary" (i.e., codebook). Moreover, we propose a novel pre-training strategy, namely Masked Codebook Modeling (MCM), to capture the dependencies between different microenvironments by randomly masking the codebook and reconstructing the input. With the learned microenvironment codebook, we can reuse it as an off-the-shelf tool to efficiently and effectively encode proteins of different sizes and functions for large-scale PPI prediction. Extensive experiments show that MAPE-PPI can scale to PPI prediction with millions of PPIs with superior trade-offs between effectiveness and computational efficiency than the state-of-the-art competitors.  ( 3 min )
    Graph Parsing Networks
    arXiv:2402.14393v1 Announce Type: new Abstract: Graph pooling compresses graph information into a compact representation. State-of-the-art graph pooling methods follow a hierarchical approach, which reduces the graph size step-by-step. These methods must balance memory efficiency with preserving node information, depending on whether they use node dropping or node clustering. Additionally, fixed pooling ratios or numbers of pooling layers are predefined for all graphs, which prevents personalized pooling structures from being captured for each individual graph. In this work, inspired by bottom-up grammar induction, we propose an efficient graph parsing algorithm to infer the pooling structure, which then drives graph pooling. The resulting Graph Parsing Network (GPN) adaptively learns personalized pooling structure for each individual graph. GPN benefits from the discrete assignments generated by the graph parsing algorithm, allowing good memory efficiency while preserving node information intact. Experimental results on standard benchmarks demonstrate that GPN outperforms state-of-the-art graph pooling methods in graph classification tasks while being able to achieve competitive performance in node classification tasks. We also conduct a graph reconstruction task to show GPN's ability to preserve node information and measure both memory and time efficiency through relevant tests.  ( 2 min )
    WindDragon: Enhancing wind power forecasting with Automated Deep Learning
    arXiv:2402.14385v1 Announce Type: new Abstract: Achieving net zero carbon emissions by 2050 requires the integration of increasing amounts of wind power into power grids. This energy source poses a challenge to system operators due to its variability and uncertainty. Therefore, accurate forecasting of wind power is critical for grid operation and system balancing. This paper presents an innovative approach to short-term (1 to 6 hour horizon) windpower forecasting at a national level. The method leverages Automated Deep Learning combined with Numerical Weather Predictions wind speed maps to accurately forecast wind power.  ( 2 min )
    Securing Transactions: A Hybrid Dependable Ensemble Machine Learning Model using IHT-LR and Grid Search
    arXiv:2402.14389v1 Announce Type: new Abstract: Financial institutions and businesses face an ongoing challenge from fraudulent transactions, prompting the need for effective detection methods. Detecting credit card fraud is crucial for identifying and preventing unauthorized transactions.Timely detection of fraud enables investigators to take swift actions to mitigate further losses. However, the investigation process is often time-consuming, limiting the number of alerts that can be thoroughly examined each day. Therefore, the primary objective of a fraud detection model is to provide accurate alerts while minimizing false alarms and missed fraud cases. In this paper, we introduce a state-of-the-art hybrid ensemble (ENS) dependable Machine learning (ML) model that intelligently combines multiple algorithms with proper weighted optimization using Grid search, including Decision Tree (DT), Random Forest (RF), K-Nearest Neighbor (KNN), and Multilayer Perceptron (MLP), to enhance fraud identification. To address the data imbalance issue, we employ the Instant Hardness Threshold (IHT) technique in conjunction with Logistic Regression (LR), surpassing conventional approaches. Our experiments are conducted on a publicly available credit card dataset comprising 284,807 transactions. The proposed model achieves impressive accuracy rates of 99.66%, 99.73%, 98.56%, and 99.79%, and a perfect 100% for the DT, RF, KNN, MLP and ENS models, respectively. The hybrid ensemble model outperforms existing works, establishing a new benchmark for detecting fraudulent transactions in high-frequency scenarios. The results highlight the effectiveness and reliability of our approach, demonstrating superior performance metrics and showcasing its exceptional potential for real-world fraud detection applications.  ( 3 min )
    Representation Learning for Frequent Subgraph Mining
    arXiv:2402.14367v1 Announce Type: new Abstract: Identifying frequent subgraphs, also called network motifs, is crucial in analyzing and predicting properties of real-world networks. However, finding large commonly-occurring motifs remains a challenging problem not only due to its NP-hard subroutine of subgraph counting, but also the exponential growth of the number of possible subgraphs patterns. Here we present Subgraph Pattern Miner (SPMiner), a novel neural approach for approximately finding frequent subgraphs in a large target graph. SPMiner combines graph neural networks, order embedding space, and an efficient search strategy to identify network subgraph patterns that appear most frequently in the target graph. SPMiner first decomposes the target graph into many overlapping subgraphs and then encodes each subgraph into an order embedding space. SPMiner then uses a monotonic walk in the order embedding space to identify frequent motifs. Compared to existing approaches and possible neural alternatives, SPMiner is more accurate, faster, and more scalable. For 5- and 6-node motifs, we show that SPMiner can almost perfectly identify the most frequent motifs while being 100x faster than exact enumeration methods. In addition, SPMiner can also reliably identify frequent 10-node motifs, which is well beyond the size limit of exact enumeration approaches. And last, we show that SPMiner can find large up to 20 node motifs with 10-100x higher frequency than those found by current approximate methods.  ( 3 min )
    Generative Adversarial Network with Soft-Dynamic Time Warping and Parallel Reconstruction for Energy Time Series Anomaly Detection
    arXiv:2402.14384v1 Announce Type: new Abstract: In this paper, we employ a 1D deep convolutional generative adversarial network (DCGAN) for sequential anomaly detection in energy time series data. Anomaly detection involves gradient descent to reconstruct energy sub-sequences, identifying the noise vector that closely generates them through the generator network. Soft-DTW is used as a differentiable alternative for the reconstruction loss and is found to be superior to Euclidean distance. Combining reconstruction loss and the latent space's prior probability distribution serves as the anomaly score. Our novel method accelerates detection by parallel computation of reconstruction of multiple points and shows promise in identifying anomalous energy consumption in buildings, as evidenced by performing experiments on hourly energy time series from 15 buildings.  ( 2 min )
    OpenTab: Advancing Large Language Models as Open-domain Table Reasoners
    arXiv:2402.14361v1 Announce Type: new Abstract: Large Language Models (LLMs) trained on large volumes of data excel at various natural language tasks, but they cannot handle tasks requiring knowledge that has not been trained on previously. One solution is to use a retriever that fetches relevant information to expand LLM's knowledge scope. However, existing textual-oriented retrieval-based LLMs are not ideal on structured table data due to diversified data modalities and large table sizes. In this work, we propose OpenTab, an open-domain table reasoning framework powered by LLMs. Overall, OpenTab leverages table retriever to fetch relevant tables and then generates SQL programs to parse the retrieved tables efficiently. Utilizing the intermediate data derived from the SQL executions, it conducts grounded inference to produce accurate response. Extensive experimental evaluation shows that OpenTab significantly outperforms baselines in both open- and closed-domain settings, achieving up to 21.5% higher accuracy. We further run ablation studies to validate the efficacy of our proposed designs of the system.  ( 2 min )
    Dependable Distributed Training of Compressed Machine Learning Models
    arXiv:2402.14346v1 Announce Type: new Abstract: The existing work on the distributed training of machine learning (ML) models has consistently overlooked the distribution of the achieved learning quality, focusing instead on its average value. This leads to a poor dependability}of the resulting ML models, whose performance may be much worse than expected. We fill this gap by proposing DepL, a framework for dependable learning orchestration, able to make high-quality, efficient decisions on (i) the data to leverage for learning, (ii) the models to use and when to switch among them, and (iii) the clusters of nodes, and the resources thereof, to exploit. For concreteness, we consider as possible available models a full DNN and its compressed versions. Unlike previous studies, DepL guarantees that a target learning quality is reached with a target probability, while keeping the training cost at a minimum. We prove that DepL has constant competitive ratio and polynomial complexity, and show that it outperforms the state-of-the-art by over 27% and closely matches the optimum.  ( 2 min )
    HyperFast: Instant Classification for Tabular Data
    arXiv:2402.14335v1 Announce Type: new Abstract: Training deep learning models and performing hyperparameter tuning can be computationally demanding and time-consuming. Meanwhile, traditional machine learning methods like gradient-boosting algorithms remain the preferred choice for most tabular data applications, while neural network alternatives require extensive hyperparameter tuning or work only in toy datasets under limited settings. In this paper, we introduce HyperFast, a meta-trained hypernetwork designed for instant classification of tabular data in a single forward pass. HyperFast generates a task-specific neural network tailored to an unseen dataset that can be directly used for classification inference, removing the need for training a model. We report extensive experiments with OpenML and genomic data, comparing HyperFast to competing tabular data neural networks, traditional ML methods, AutoML systems, and boosting machines. HyperFast shows highly competitive results, while being significantly faster. Additionally, our approach demonstrates robust adaptability across a variety of classification tasks with little to no fine-tuning, positioning HyperFast as a strong solution for numerous applications and rapid model deployment. HyperFast introduces a promising paradigm for fast classification, with the potential to substantially decrease the computational burden of deep learning. Our code, which offers a scikit-learn-like interface, along with the trained HyperFast model, can be found at https://github.com/AI-sandbox/HyperFast.  ( 2 min )
    From Large to Small Datasets: Size Generalization for Clustering Algorithm Selection
    arXiv:2402.14332v1 Announce Type: new Abstract: In clustering algorithm selection, we are given a massive dataset and must efficiently select which clustering algorithm to use. We study this problem in a semi-supervised setting, with an unknown ground-truth clustering that we can only access through expensive oracle queries. Ideally, the clustering algorithm's output will be structurally close to the ground truth. We approach this problem by introducing a notion of size generalization for clustering algorithm accuracy. We identify conditions under which we can (1) subsample the massive clustering instance, (2) evaluate a set of candidate algorithms on the smaller instance, and (3) guarantee that the algorithm with the best accuracy on the small instance will have the best accuracy on the original big instance. We provide theoretical size generalization guarantees for three classic clustering algorithms: single-linkage, k-means++, and (a smoothed variant of) Gonzalez's k-centers heuristic. We validate our theoretical analysis with empirical results, observing that on real-world clustering instances, we can use a subsample of as little as 5% of the data to identify which algorithm is best on the full dataset.  ( 2 min )
    TinyLLaVA: A Framework of Small-scale Large Multimodal Models
    arXiv:2402.14289v1 Announce Type: new Abstract: We present the TinyLLaVA framework that provides a unified perspective in designing and analyzing the small-scale Large Multimodal Models (LMMs). We empirically study the effects of different vision encoders, connection modules, language models, training data and training recipes. Our extensive experiments showed that better quality of data combined with better training recipes, smaller LMMs can consistently achieve on-par performances compared to bigger LMMs. Under our framework, we train a family of small-scale LMMs. Our best model, TinyLLaVA-3.1B, achieves better overall performance against existing 7B models such as LLaVA-1.5 and Qwen-VL. We hope our findings can serve as baselines for future research in terms of data scaling, training setups and model selections. Our model weights and codes will be made public.  ( 2 min )
    High-arity PAC learning via exchangeability
    arXiv:2402.14294v1 Announce Type: new Abstract: We develop a theory of high-arity PAC learning, which is statistical learning in the presence of "structured correlation". In this theory, hypotheses are either graphs, hypergraphs or, more generally, structures in finite relational languages, and i.i.d. sampling is replaced by sampling an induced substructure, producing an exchangeable distribution. We prove a high-arity version of the fundamental theorem of statistical learning by characterizing high-arity (agnostic) PAC learnability in terms of finiteness of a purely combinatorial dimension and in terms of an appropriate version of uniform convergence.  ( 2 min )
    A hierarchical decomposition for explaining ML performance discrepancies
    arXiv:2402.14254v1 Announce Type: new Abstract: Machine learning (ML) algorithms can often differ in performance across domains. Understanding $\textit{why}$ their performance differs is crucial for determining what types of interventions (e.g., algorithmic or operational) are most effective at closing the performance gaps. Existing methods focus on $\textit{aggregate decompositions}$ of the total performance gap into the impact of a shift in the distribution of features $p(X)$ versus the impact of a shift in the conditional distribution of the outcome $p(Y|X)$; however, such coarse explanations offer only a few options for how one can close the performance gap. $\textit{Detailed variable-level decompositions}$ that quantify the importance of each variable to each term in the aggregate decomposition can provide a much deeper understanding and suggest much more targeted interventions. However, existing methods assume knowledge of the full causal graph or make strong parametric assumptions. We introduce a nonparametric hierarchical framework that provides both aggregate and detailed decompositions for explaining why the performance of an ML algorithm differs across domains, without requiring causal knowledge. We derive debiased, computationally-efficient estimators, and statistical inference procedures for asymptotically valid confidence intervals.  ( 2 min )
    Take the Bull by the Horns: Hard Sample-Reweighted Continual Training Improves LLM Generalization
    arXiv:2402.14270v1 Announce Type: new Abstract: In the rapidly advancing arena of large language models (LLMs), a key challenge is to enhance their capabilities amid a looming shortage of high-quality training data. Our study starts from an empirical strategy for the light continual training of LLMs using their original pre-training data sets, with a specific focus on selective retention of samples that incur moderately high losses. These samples are deemed informative and beneficial for model refinement, contrasting with the highest-loss samples, which would be discarded due to their correlation with data noise and complexity. We then formalize this strategy into a principled framework of Instance-Reweighted Distributionally Robust Optimization (IR-DRO). IR-DRO is designed to dynamically prioritize the training focus on informative samples through an instance reweighting mechanism, streamlined by a closed-form solution for straightforward integration into established training protocols. Through rigorous experimentation with various models and datasets, our findings indicate that our sample-targeted methods significantly improve LLM performance across multiple benchmarks, in both continual pre-training and instruction tuning scenarios. Our codes are available at https://github.com/VITA-Group/HardFocusTraining.  ( 2 min )
    Reconstruction-Based Anomaly Localization via Knowledge-Informed Self-Training
    arXiv:2402.14246v1 Announce Type: new Abstract: Anomaly localization, which involves localizing anomalous regions within images, is a significant industrial task. Reconstruction-based methods are widely adopted for anomaly localization because of their low complexity and high interpretability. Most existing reconstruction-based methods only use normal samples to construct model. If anomalous samples are appropriately utilized in the process of anomaly localization, the localization performance can be improved. However, usually only weakly labeled anomalous samples are available, which limits the improvement. In many cases, we can obtain some knowledge of anomalies summarized by domain experts. Taking advantage of such knowledge can help us better utilize the anomalous samples and thus further improve the localization performance. In this paper, we propose a novel reconstruction-based method named knowledge-informed self-training (KIST) which integrates knowledge into reconstruction model through self-training. Specifically, KIST utilizes weakly labeled anomalous samples in addition to the normal ones and exploits knowledge to yield pixel-level pseudo-labels of the anomalous samples. Based on the pseudo labels, a novel loss which promotes the reconstruction of normal pixels while suppressing the reconstruction of anomalous pixels is used. We conduct experiments on different datasets and demonstrate the advantages of KIST over the existing reconstruction-based methods.  ( 2 min )
    Automated Design and Optimization of Distributed Filtering Circuits via Reinforcement Learning
    arXiv:2402.14236v1 Announce Type: new Abstract: Designing distributed filtering circuits (DFCs) is complex and time-consuming, with the circuit performance relying heavily on the expertise and experience of electronics engineers. However, manual design methods tend to have exceedingly low-efficiency. This study proposes a novel end-to-end automated method for fabricating circuits to improve the design of DFCs. The proposed method harnesses reinforcement learning (RL) algorithms, eliminating the dependence on the design experience of engineers. Thus, it significantly reduces the subjectivity and constraints associated with circuit design. The experimental findings demonstrate clear improvements in both design efficiency and quality when comparing the proposed method with traditional engineer-driven methods. In particular, the proposed method achieves superior performance when designing complex or rapidly evolving DFCs. Furthermore, compared to existing circuit automation design techniques, the proposed method demonstrates superior design efficiency, highlighting the substantial potential of RL in circuit design automation.  ( 2 min )
    COPR: Continual Human Preference Learning via Optimal Policy Regularization
    arXiv:2402.14228v1 Announce Type: new Abstract: Reinforcement Learning from Human Feedback (RLHF) is commonly utilized to improve the alignment of Large Language Models (LLMs) with human preferences. Given the evolving nature of human preferences, continual alignment becomes more crucial and practical in comparison to traditional static alignment. Nevertheless, making RLHF compatible with Continual Learning (CL) is challenging due to its complex process. Meanwhile, directly learning new human preferences may lead to Catastrophic Forgetting (CF) of historical preferences, resulting in helpless or harmful outputs. To overcome these challenges, we propose the Continual Optimal Policy Regularization (COPR) method, which draws inspiration from the optimal policy theory. COPR utilizes a sampling distribution as a demonstration and regularization constraints for CL. It adopts the Lagrangian Duality (LD) method to dynamically regularize the current policy based on the historically optimal policy, which prevents CF and avoids over-emphasizing unbalanced objectives. We also provide formal proof for the learnability of COPR. The experimental results show that COPR outperforms strong CL baselines on our proposed benchmark, in terms of reward-based, GPT-4 evaluations and human assessment. Furthermore, we validate the robustness of COPR under various CL settings, including different backbones, replay memory sizes, and learning orders.  ( 2 min )
    Quaternion recurrent neural network with real-time recurrent learning and maximum correntropy criterion
    arXiv:2402.14227v1 Announce Type: new Abstract: We develop a robust quaternion recurrent neural network (QRNN) for real-time processing of 3D and 4D data with outliers. This is achieved by combining the real-time recurrent learning (RTRL) algorithm and the maximum correntropy criterion (MCC) as a loss function. While both the mean square error and maximum correntropy criterion are viable cost functions, it is shown that the non-quadratic maximum correntropy loss function is less sensitive to outliers, making it suitable for applications with multidimensional noisy or uncertain data. Both algorithms are derived based on the novel generalised HR (GHR) calculus, which allows for the differentiation of real functions of quaternion variables and offers the product and chain rules, thus enabling elegant and compact derivations. Simulation results in the context of motion prediction of chest internal markers for lung cancer radiotherapy, which includes regular and irregular breathing sequences, support the analysis.  ( 2 min )
    Moonwalk: Inverse-Forward Differentiation
    arXiv:2402.14212v1 Announce Type: new Abstract: Backpropagation, while effective for gradient computation, falls short in addressing memory consumption, limiting scalability. This work explores forward-mode gradient computation as an alternative in invertible networks, showing its potential to reduce the memory footprint without substantial drawbacks. We introduce a novel technique based on a vector-inverse-Jacobian product that accelerates the computation of forward gradients while retaining the advantages of memory reduction and preserving the fidelity of true gradients. Our method, Moonwalk, has a time complexity linear in the depth of the network, unlike the quadratic time complexity of na\"ive forward, and empirically reduces computation time by several orders of magnitude without allocating more memory. We further accelerate Moonwalk by combining it with reverse-mode differentiation to achieve time complexity comparable with backpropagation while maintaining a much smaller memory footprint. Finally, we showcase the robustness of our method across several architecture choices. Moonwalk is the first forward-based method to compute true gradients in invertible networks in computation time comparable to backpropagation and using significantly less memory.  ( 2 min )
    Estimating Unknown Population Sizes Using the Hypergeometric Distribution
    arXiv:2402.14220v1 Announce Type: new Abstract: The multivariate hypergeometric distribution describes sampling without replacement from a discrete population of elements divided into multiple categories. Addressing a gap in the literature, we tackle the challenge of estimating discrete distributions when both the total population size and the sizes of its constituent categories are unknown. Here, we propose a novel solution using the hypergeometric likelihood to solve this estimation challenge, even in the presence of severe under-sampling. We develop our approach to account for a data generating process where the ground-truth is a mixture of distributions conditional on a continuous latent variable, such as with collaborative filtering, using the variational autoencoder framework. Empirical data simulation demonstrates that our method outperforms other likelihood functions used to model count data, both in terms of accuracy of population size estimate and in its ability to learn an informative latent space. We demonstrate our method's versatility through applications in NLP, by inferring and estimating the complexity of latent vocabularies in text excerpts, and in biology, by accurately recovering the true number of gene transcripts from sparse single-cell genomics data.  ( 2 min )
    BeTAIL: Behavior Transformer Adversarial Imitation Learning from Human Racing Gameplay
    arXiv:2402.14194v1 Announce Type: new Abstract: Imitation learning learns a policy from demonstrations without requiring hand-designed reward functions. In many robotic tasks, such as autonomous racing, imitated policies must model complex environment dynamics and human decision-making. Sequence modeling is highly effective in capturing intricate patterns of motion sequences but struggles to adapt to new environments or distribution shifts that are common in real-world robotics tasks. In contrast, Adversarial Imitation Learning (AIL) can mitigate this effect, but struggles with sample inefficiency and handling complex motion patterns. Thus, we propose BeTAIL: Behavior Transformer Adversarial Imitation Learning, which combines a Behavior Transformer (BeT) policy from human demonstrations with online AIL. BeTAIL adds an AIL residual policy to the BeT policy to model the sequential decision-making process of human experts and correct for out-of-distribution states or shifts in environment dynamics. We test BeTAIL on three challenges with expert-level demonstrations of real human gameplay in Gran Turismo Sport. Our proposed residual BeTAIL reduces environment interactions and improves racing performance and stability, even when the BeT is pretrained on different tracks than downstream learning. Videos and code available at: https://sites.google.com/berkeley.edu/BeTAIL/home.  ( 2 min )
    Comparing Graph Transformers via Positional Encodings
    arXiv:2402.14202v1 Announce Type: new Abstract: The distinguishing power of graph transformers is closely tied to the choice of positional encoding: features used to augment the base transformer with information about the graph. There are two primary types of positional encoding: absolute positional encodings (APEs) and relative positional encodings (RPEs). APEs assign features to each node and are given as input to the transformer. RPEs instead assign a feature to each pair of nodes, e.g., graph distance, and are used to augment the attention block. A priori, it is unclear which method is better for maximizing the power of the resulting graph transformer. In this paper, we aim to understand the relationship between these different types of positional encodings. Interestingly, we show that graph transformers using APEs and RPEs are equivalent in terms of distinguishing power. In particular, we demonstrate how to interchange APEs and RPEs while maintaining their distinguishing power in terms of graph transformers. Based on our theoretical results, we provide a study on several APEs and RPEs (including the resistance distance and the recently introduced stable and expressive positional encoding (SPE)) and compare their distinguishing power in terms of transformers. We believe our work will help navigate the huge number of choices of positional encoding and will provide guidance on the future design of positional encodings for graph transformers.  ( 2 min )
    Linear Transformers are Versatile In-Context Learners
    arXiv:2402.14180v1 Announce Type: new Abstract: Recent research has demonstrated that transformers, particularly linear attention models, implicitly execute gradient-descent-like algorithms on data provided in-context during their forward inference step. However, their capability in handling more complex problems remains unexplored. In this paper, we prove that any linear transformer maintains an implicit linear model and can be interpreted as performing a variant of preconditioned gradient descent. We also investigate the use of linear transformers in a challenging scenario where the training data is corrupted with different levels of noise. Remarkably, we demonstrate that for this problem linear transformers discover an intricate and highly effective optimization algorithm, surpassing or matching in performance many reasonable baselines. We reverse-engineer this algorithm and show that it is a novel approach incorporating momentum and adaptive rescaling based on noise levels. Our findings show that even linear transformers possess the surprising ability to discover sophisticated optimization strategies.  ( 2 min )
    Diversity-Aware Ensembling of Language Models Based on Topological Data Analysis
    arXiv:2402.14184v1 Announce Type: new Abstract: Ensembles are important tools for improving the performance of machine learning models. In cases related to natural language processing, ensembles boost the performance of a method due to multiple large models available in open source. However, existing approaches mostly rely on simple averaging of predictions by ensembles with equal weights for each model, ignoring differences in the quality and conformity of models. We propose to estimate weights for ensembles of NLP models using not only knowledge of their individual performance but also their similarity to each other. By adopting distance measures based on Topological Data Analysis (TDA), we improve our ensemble. The quality improves for both text classification accuracy and relevant uncertainty estimation.  ( 2 min )
    Recursive Speculative Decoding: Accelerating LLM Inference via Sampling Without Replacement
    arXiv:2402.14160v1 Announce Type: new Abstract: Speculative decoding is an inference-acceleration method for large language models (LLMs) where a small language model generates a draft-token sequence which is further verified by the target LLM in parallel. Recent works have advanced this method by establishing a draft-token tree, achieving superior performance over a single-sequence speculative decoding. However, those works independently generate tokens at each level of the tree, not leveraging the tree's entire diversifiability. Besides, their empirical superiority has been shown for fixed length of sequences, implicitly granting more computational resource to LLM for the tree-based methods. None of the existing works has conducted empirical studies with fixed target computational budgets despite its importance to resource-bounded devices. We present Recursive Speculative Decoding (RSD), a novel tree-based method that samples draft tokens without replacement and maximizes the diversity of the tree. During RSD's drafting, the tree is built by either Gumbel-Top-$k$ trick that draws tokens without replacement in parallel or Stochastic Beam Search that samples sequences without replacement while early-truncating unlikely draft sequences and reducing the computational cost of LLM. We empirically evaluate RSD with Llama 2 and OPT models, showing that RSD outperforms the baseline methods, consistently for fixed draft sequence length and in most cases for fixed computational budgets at LLM.  ( 2 min )
    A Temporal Bias Correction using a Machine Learning Attention model
    arXiv:2402.14169v1 Announce Type: new Abstract: Climate models are biased with respect to real world observations and usually need to be calibrated prior to impact studies. The suite of statistical methods that enable such calibrations is called bias correction (BC). However, current BC methods struggle to adjust for temporal biases, because they disregard the dependence between consecutive time-points. As a result, climate statistics with long-range temporal properties, such as heatwave duration and frequency, cannot be corrected accurately, making it more difficult to produce reliable impact studies on such climate statistics. In this paper, we offer a novel BC methodology to correct for temporal biases. This is made possible by i) re-thinking BC as a probability model rather than an algorithmic procedure, and ii) adapting state-of-the-art machine-learning (ML) probabilistic attention models to fit the BC task. With a case study of heatwave duration statistics in Abuja, Nigeria, and Tokyo, Japan, we show striking results compared to current climate model outputs and alternative BC methods.  ( 2 min )
    DeiSAM: Segment Anything with Deictic Prompting
    arXiv:2402.14123v1 Announce Type: new Abstract: Large-scale, pre-trained neural networks have demonstrated strong capabilities in various tasks, including zero-shot image segmentation. To identify concrete objects in complex scenes, humans instinctively rely on deictic descriptions in natural language, i.e., referring to something depending on the context such as "The object that is on the desk and behind the cup.". However, deep learning approaches cannot reliably interpret such deictic representations due to their lack of reasoning capabilities in complex scenarios. To remedy this issue, we propose DeiSAM -- a combination of large pre-trained neural networks with differentiable logic reasoners -- for deictic promptable segmentation. Given a complex, textual segmentation description, DeiSAM leverages Large Language Models (LLMs) to generate first-order logic rules and performs differentiable forward reasoning on generated scene graphs. Subsequently, DeiSAM segments objects by matching them to the logically inferred image regions. As part of our evaluation, we propose the Deictic Visual Genome (DeiVG) dataset, containing paired visual input and complex, deictic textual prompts. Our empirical results demonstrate that DeiSAM is a substantial improvement over purely data-driven baselines for deictic promptable segmentation.  ( 2 min )
    NeuroFlux: Memory-Efficient CNN Training Using Adaptive Local Learning
    arXiv:2402.14139v1 Announce Type: new Abstract: Efficient on-device convolutional neural network (CNN) training in resource-constrained mobile and edge environments is an open challenge. Backpropagation is the standard approach adopted, but it is GPU memory intensive due to its strong inter-layer dependencies that demand intermediate activations across the entire CNN model to be retained in GPU memory. This necessitates smaller batch sizes to make training possible within the available GPU memory budget, but in turn, results in a substantially high and impractical training time. We introduce NeuroFlux, a novel CNN training system tailored for memory-constrained scenarios. We develop two novel opportunities: firstly, adaptive auxiliary networks that employ a variable number of filters to reduce GPU memory usage, and secondly, block-specific adaptive batch sizes, which not only cater to the GPU memory constraints but also accelerate the training process. NeuroFlux segments the CNNs into blocks based on GPU memory usage and further attaches an auxiliary network to each layer in these blocks. This disrupts the typical layer dependencies under a new training paradigm - 'adaptive local learning'. Moreover, NeuroFlux adeptly caches intermediate activations, eliminating redundant forward passes over previously trained blocks, further accelerating the training process. The results are twofold when compared to Backpropagation: on various hardware platforms, NeuroFlux demonstrates training speed-ups of 2.3$\times$ to 6.1$\times$ under stringent GPU memory budgets, and NeuroFlux generates streamlined models that have 10.9$\times$ to 29.4$\times$ fewer parameters without sacrificing accuracy.  ( 2 min )
    Intriguing Properties of Modern GANs
    arXiv:2402.14098v1 Announce Type: new Abstract: Modern GANs achieve remarkable performance in terms of generating realistic and diverse samples. This has led many to believe that ``GANs capture the training data manifold''. In this work we show that this interpretation is wrong. We empirically show that the manifold learned by modern GANs does not fit the training distribution: specifically the manifold does not pass through the training examples and passes closer to out-of-distribution images than to in-distribution images. We also investigate the distribution over images implied by the prior over the latent codes and study whether modern GANs learn a density that approximates the training distribution. Surprisingly, we find that the learned density is very far from the data distribution and that GANs tend to assign higher density to out-of-distribution images. Finally, we demonstrate that the set of images used to train modern GANs are often not part of the typical set described by the GANs' distribution.  ( 2 min )
    Efficient Normalized Conformal Prediction and Uncertainty Quantification for Anti-Cancer Drug Sensitivity Prediction with Deep Regression Forests
    arXiv:2402.14080v1 Announce Type: new Abstract: Deep learning models are being adopted and applied on various critical decision-making tasks, yet they are trained to provide point predictions without providing degrees of confidence. The trustworthiness of deep learning models can be increased if paired with uncertainty estimations. Conformal Prediction has emerged as a promising method to pair machine learning models with prediction intervals, allowing for a view of the model's uncertainty. However, popular uncertainty estimation methods for conformal prediction fail to provide heteroskedastic intervals that are equally accurate for all samples. In this paper, we propose a method to estimate the uncertainty of each sample by calculating the variance obtained from a Deep Regression Forest. We show that the deep regression forest variance improves the efficiency and coverage of normalized inductive conformal prediction on a drug response prediction task.  ( 2 min )
    Robust Learning of Noisy Time Series Collections Using Stochastic Process Models with Motion Codes
    arXiv:2402.14081v1 Announce Type: new Abstract: While time series classification and forecasting problems have been extensively studied, the cases of noisy time series data with arbitrary time sequence lengths have remained challenging. Each time series instance can be thought of as a sample realization of a noisy dynamical model, which is characterized by a continuous stochastic process. For many applications, the data are mixed and consist of several types of noisy time series sequences modeled by multiple stochastic processes, making the forecasting and classification tasks even more challenging. Instead of regressing data naively and individually to each time series type, we take a latent variable model approach using a mixtured Gaussian processes with learned spectral kernels. More specifically, we auto-assign each type of noisy time series data a signature vector called its motion code. Then, conditioned on each assigned motion code, we infer a sparse approximation of the corresponding time series using the concept of the most informative timestamps. Our unmixing classification approach involves maximizing the likelihood across all the mixed noisy time series sequences of varying lengths. This stochastic approach allows us to learn not only within a single type of noisy time series data but also across many underlying stochastic processes, giving us a way to learn multiple dynamical models in an integrated and robust manner. The different learned latent stochastic models allow us to generate specific sub-type forecasting. We provide several quantitative comparisons demonstrating the performance of our approach.  ( 3 min )
    PolyNet: Learning Diverse Solution Strategies for Neural Combinatorial Optimization
    arXiv:2402.14048v1 Announce Type: new Abstract: Reinforcement learning-based methods for constructing solutions to combinatorial optimization problems are rapidly approaching the performance of human-designed algorithms. To further narrow the gap, learning-based approaches must efficiently explore the solution space during the search process. Recent approaches artificially increase exploration by enforcing diverse solution generation through handcrafted rules, however, these rules can impair solution quality and are difficult to design for more complex problems. In this paper, we introduce PolyNet, an approach for improving exploration of the solution space by learning complementary solution strategies. In contrast to other works, PolyNet uses only a single-decoder and a training schema that does not enforce diverse solution generation through handcrafted rules. We evaluate PolyNet on four combinatorial optimization problems and observe that the implicit diversity mechanism allows PolyNet to find better solutions than approaches the explicitly enforce diverse solution generation.  ( 2 min )
    Generative Adversarial Models for Extreme Downscaling of Climate Datasets
    arXiv:2402.14049v1 Announce Type: new Abstract: Addressing the challenges of climate change requires accurate and high-resolution mapping of climate and weather variables. However, many existing climate datasets, such as the gridded outputs of the state-of-the-art numerical climate models (e.g., general circulation models), are only available at very coarse spatial resolutions due to the model complexity and extremely high computational demand. Deep-learning-based methods, particularly generative adversarial networks (GANs) and their variants, have proved effective for refining natural images, and have shown great promise in improving scientific datasets. In this paper, we describe a conditional GAN-based geospatial downscaling method for extreme downscaling of gridded climate datasets. Compared to most existing methods, the method can generate high-resolution accurate climate datasets from very low-resolution inputs. More importantly, the method explicitly considers the uncertainty inherent to the downscaling process that tends to be ignored in existing methods. Given an input, the method can produce a multitude of plausible high-resolution samples instead of one single deterministic result. These samples allow for an empirical exploration and inferences of model uncertainty and robustness. With a case study of gridded climate datasets (wind velocity and solar irradiance), we demonstrate the performances of the framework in downscaling tasks with very high scaling factors (up to $64\times$) and highlight the advantages of the framework with a comprehensive comparison with commonly used downscaling methods, including area-to-point (ATP) kriging, deep image prior (DIP), enhanced deep super-resolution network (EDSR), enhanced super-resolution generative adversarial networks (ESRGAN), and physics-informed resolution-enhancing GAN (PhIRE GAN).  ( 3 min )
    Simple and Effective Transfer Learning for Neuro-Symbolic Integration
    arXiv:2402.14047v1 Announce Type: new Abstract: Deep Learning (DL) techniques have achieved remarkable successes in recent years. However, their ability to generalize and execute reasoning tasks remains a challenge. A potential solution to this issue is Neuro-Symbolic Integration (NeSy), where neural approaches are combined with symbolic reasoning. Most of these methods exploit a neural network to map perceptions to symbols and a logical reasoner to predict the output of the downstream task. These methods exhibit superior generalization capacity compared to fully neural architectures. However, they suffer from several issues, including slow convergence, learning difficulties with complex perception tasks, and convergence to local minima. This paper proposes a simple yet effective method to ameliorate these problems. The key idea involves pretraining a neural model on the downstream task. Then, a NeSy model is trained on the same task via transfer learning, where the weights of the perceptual part are injected from the pretrained network. The key observation of our work is that the neural network fails to generalize only at the level of the symbolic part while being perfectly capable of learning the mapping from perceptions to symbols. We have tested our training strategy on various SOTA NeSy methods and datasets, demonstrating consistent improvements in the aforementioned problems.  ( 2 min )
    Wisdom of Committee: Distilling from Foundation Model to SpecializedApplication Model
    arXiv:2402.14035v1 Announce Type: new Abstract: Recent advancements in foundation models have yielded impressive performance across a wide range of tasks. Meanwhile, for specific applications, practitioners have been developing specialized application models. To enjoy the benefits of both kinds of models, one natural path is to transfer the knowledge in foundation models into specialized application models, which are generally more efficient for serving. Techniques from knowledge distillation may be applied here, where the application model learns to mimic the foundation model. However, specialized application models and foundation models have substantial gaps in capacity, employing distinct architectures, using different input features from different modalities, and being optimized on different distributions. These differences in model characteristics lead to significant challenges for distillation methods. In this work, we propose creating a teaching committee comprising both foundation model teachers and complementary teachers. Complementary teachers possess model characteristics akin to the student's, aiming to bridge the gap between the foundation model and specialized application models for a smoother knowledge transfer. Further, to accommodate the dissimilarity among the teachers in the committee, we introduce DiverseDistill, which allows the student to understand the expertise of each teacher and extract task knowledge. Our evaluations demonstrate that adding complementary teachers enhances student performance. Finally, DiverseDistill consistently outperforms baseline distillation methods, regardless of the teacher choices, resulting in significantly improved student performance.  ( 2 min )
    VN Network: Embedding Newly Emerging Entities with Virtual Neighbors
    arXiv:2402.14033v1 Announce Type: new Abstract: Embedding entities and relations into continuous vector spaces has attracted a surge of interest in recent years. Most embedding methods assume that all test entities are available during training, which makes it time-consuming to retrain embeddings for newly emerging entities. To address this issue, recent works apply the graph neural network on the existing neighbors of the unseen entities. In this paper, we propose a novel framework, namely Virtual Neighbor (VN) network, to address three key challenges. Firstly, to reduce the neighbor sparsity problem, we introduce the concept of the virtual neighbors inferred by rules. And we assign soft labels to these neighbors by solving a rule-constrained problem, rather than simply regarding them as unquestionably true. Secondly, many existing methods only use one-hop or two-hop neighbors for aggregation and ignore the distant information that may be helpful. Instead, we identify both logic and symmetric path rules to capture complex patterns. Finally, instead of one-time injection of rules, we employ an iterative learning scheme between the embedding method and virtual neighbor prediction to capture the interactions within. Experimental results on two knowledge graph completion tasks demonstrate that our VN network significantly outperforms state-of-the-art baselines. Furthermore, results on Subject/Object-R show that our proposed VN network is highly robust to the neighbor sparsity problem.  ( 3 min )
    Partial Search in a Frozen Network is Enough to Find a Strong Lottery Ticket
    arXiv:2402.14029v1 Announce Type: new Abstract: Randomly initialized dense networks contain subnetworks that achieve high accuracy without weight learning -- strong lottery tickets (SLTs). Recently, Gadhikar et al. (2023) demonstrated theoretically and experimentally that SLTs can also be found within a randomly pruned source network, thus reducing the SLT search space. However, this limits the search to SLTs that are even sparser than the source, leading to worse accuracy due to unintentionally high sparsity. This paper proposes a method that reduces the SLT search space by an arbitrary ratio that is independent of the desired SLT sparsity. A random subset of the initial weights is excluded from the search space by freezing it -- i.e., by either permanently pruning them or locking them as a fixed part of the SLT. Indeed, the SLT existence in such a reduced search space is theoretically guaranteed by our subset-sum approximation with randomly frozen variables. In addition to reducing search space, the random freezing pattern can also be exploited to reduce model size in inference. Furthermore, experimental results show that the proposed method finds SLTs with better accuracy and model size trade-off than the SLTs obtained from dense or randomly pruned source networks. In particular, the SLT found in a frozen graph neural network achieves higher accuracy than its weight trained counterpart while reducing model size by $40.3\times$.  ( 3 min )
    Learning causation event conjunction sequences
    arXiv:2402.14027v1 Announce Type: new Abstract: This is an examination of some methods that learn causations in event sequences. A causation is defined as a conjunction of one or more cause events occurring in an arbitrary order, with possible intervening non-causal events, that lead to an effect. The methods include recurrent and non-recurrent artificial neural networks (ANNs), as well as a histogram-based algorithm. An attention recurrent ANN performed the best of the ANNs, while the histogram algorithm was significantly superior to all the ANNs.  ( 2 min )
  • Open

    Computational-Statistical Gaps for Improper Learning in Sparse Linear Regression
    arXiv:2402.14103v1 Announce Type: cross Abstract: We study computational-statistical gaps for improper learning in sparse linear regression. More specifically, given $n$ samples from a $k$-sparse linear model in dimension $d$, we ask what is the minimum sample complexity to efficiently (in time polynomial in $d$, $k$, and $n$) find a potentially dense estimate for the regression vector that achieves non-trivial prediction error on the $n$ samples. Information-theoretically this can be achieved using $\Theta(k \log (d/k))$ samples. Yet, despite its prominence in the literature, there is no polynomial-time algorithm known to achieve the same guarantees using less than $\Theta(d)$ samples without additional restrictions on the model. Similarly, existing hardness results are either restricted to the proper setting, in which the estimate must be sparse as well, or only apply to specific algorithms. We give evidence that efficient algorithms for this task require at least (roughly) $\Omega(k^2)$ samples. In particular, we show that an improper learning algorithm for sparse linear regression can be used to solve sparse PCA problems (with a negative spike) in their Wishart form, in regimes in which efficient algorithms are widely believed to require at least $\Omega(k^2)$ samples. We complement our reduction with low-degree and statistical query lower bounds for the sparse PCA problems from which we reduce. Our hardness results apply to the (correlated) random design setting in which the covariates are drawn i.i.d. from a mean-zero Gaussian distribution with unknown covariance.  ( 3 min )
    Bootstrap aggregation and confidence measures to improve time series causal discovery
    arXiv:2306.08946v2 Announce Type: replace-cross Abstract: Learning causal graphs from multivariate time series is a ubiquitous challenge in all application domains dealing with time-dependent systems, such as in Earth sciences, biology, or engineering, to name a few. Recent developments for this causal discovery learning task have shown considerable skill, notably the specific time-series adaptations of the popular conditional independence-based learning framework. However, uncertainty estimation is challenging for conditional independence-based methods. Here, we introduce a novel bootstrap approach designed for time series causal discovery that preserves the temporal dependencies and lag structure. It can be combined with a range of time series causal discovery methods and provides a measure of confidence for the links of the time series graphs. Furthermore, next to confidence estimation, an aggregation, also called bagging, of the bootstrapped graphs by majority voting results in bagged causal discovery methods. In this work, we combine this approach with the state-of-the-art conditional-independence-based algorithm PCMCI+. With extensive numerical experiments we empirically demonstrate that, in addition to providing confidence measures for links, Bagged-PCMCI+ improves in precision and recall as compared to its base algorithm PCMCI+, at the cost of higher computational demands. These statistical performance improvements are especially pronounced in the more challenging settings (short time sample size, large number of variables, high autocorrelation). Our bootstrap approach can also be combined with other time series causal discovery algorithms and can be of considerable use in many real-world applications.  ( 3 min )
    Meaningful Causal Aggregation and Paradoxical Confounding
    arXiv:2304.11625v3 Announce Type: replace-cross Abstract: In aggregated variables the impact of interventions is typically ill-defined because different micro-realizations of the same macro-intervention can result in different changes of downstream macro-variables. We show that this ill-definedness of causality on aggregated variables can turn unconfounded causal relations into confounded ones and vice versa, depending on the respective micro-realization. We argue that it is practically infeasible to only use aggregated causal systems when we are free from this ill-definedness. Instead, we need to accept that macro causal relations are typically defined only with reference to the micro states. On the positive side, we show that cause-effect relations can be aggregated when the macro interventions are such that the distribution of micro states is the same as in the observational distribution; we term this natural macro interventions. We also discuss generalizations of this observation.  ( 2 min )
    From Large to Small Datasets: Size Generalization for Clustering Algorithm Selection
    arXiv:2402.14332v1 Announce Type: cross Abstract: In clustering algorithm selection, we are given a massive dataset and must efficiently select which clustering algorithm to use. We study this problem in a semi-supervised setting, with an unknown ground-truth clustering that we can only access through expensive oracle queries. Ideally, the clustering algorithm's output will be structurally close to the ground truth. We approach this problem by introducing a notion of size generalization for clustering algorithm accuracy. We identify conditions under which we can (1) subsample the massive clustering instance, (2) evaluate a set of candidate algorithms on the smaller instance, and (3) guarantee that the algorithm with the best accuracy on the small instance will have the best accuracy on the original big instance. We provide theoretical size generalization guarantees for three classic clustering algorithms: single-linkage, k-means++, and (a smoothed variant of) Gonzalez's k-centers heuristic. We validate our theoretical analysis with empirical results, observing that on real-world clustering instances, we can use a subsample of as little as 5% of the data to identify which algorithm is best on the full dataset.  ( 2 min )
    Improving Adaptive Online Learning Using Refined Discretization
    arXiv:2309.16044v2 Announce Type: replace-cross Abstract: We study unconstrained Online Linear Optimization with Lipschitz losses. Motivated by the pursuit of instance optimality, we propose a new algorithm that simultaneously achieves ($i$) the AdaGrad-style second order gradient adaptivity; and ($ii$) the comparator norm adaptivity also known as "parameter freeness" in the literature. In particular, - our algorithm does not employ the impractical doubling trick, and does not require an a priori estimate of the time-uniform Lipschitz constant; - the associated regret bound has the optimal $O(\sqrt{V_T})$ dependence on the gradient variance $V_T$, without the typical logarithmic multiplicative factor; - the leading constant in the regret bound is "almost" optimal. Central to these results is a continuous time approach to online learning. We first show that the aimed simultaneous adaptivity can be achieved fairly easily in a continuous time analogue of the problem, where the environment is modeled by an arbitrary continuous semimartingale. Then, our key innovation is a new discretization argument that preserves such adaptivity in the discrete time adversarial setting. This refines a non-gradient-adaptive discretization argument from (Harvey et al., 2023), both algorithmically and analytically, which could be of independent interest.  ( 2 min )
    Credal Bayesian Deep Learning
    arXiv:2302.09656v4 Announce Type: replace-cross Abstract: Uncertainty quantification and robustness to distribution shifts are important goals in machine learning and artificial intelligence. Although Bayesian Neural Networks (BNNs) allow for uncertainty in the predictions to be assessed, different sources of uncertainty are indistinguishable. We present Credal Bayesian Deep Learning (CBDL). Heuristically, CBDL allows to train an (uncountably) infinite ensemble of BNNs, using only finitely many elements. This is possible thanks to prior and likelihood finitely generated credal sets (FGCSs), a concept from the imprecise probability literature. Intuitively, convex combinations of a finite collection of prior-likelihood pairs are able to represent infinitely many such pairs. After training, CBDL outputs a set of posteriors on the parameters of the neural network. At inference time, such posterior set is used to derive a set of predictive distributions that is in turn utilized to distinguish between aleatoric and epistemic uncertainties, and to quantify them. The predictive set also produces either (i) a collection of outputs enjoying desirable probabilistic guarantees, or (ii) the single output that is deemed the best, that is, the one having the highest predictive lower probability -- another imprecise-probabilistic concept. CBDL is more robust than single BNNs to prior and likelihood misspecification, and to distribution shift. We show that CBDL is better at quantifying and disentangling different types of uncertainties than single BNNs, ensemble of BNNs, and Bayesian Model Averaging. In addition, we apply CBDL to two case studies to demonstrate its downstream tasks capabilities: one, for motion prediction in autonomous driving scenarios, and two, to model blood glucose and insulin dynamics for artificial pancreas control. We show that CBDL performs better when compared to an ensemble of BNNs baseline.  ( 3 min )
    Deep Learning for Survival Analysis: A Review
    arXiv:2305.14961v4 Announce Type: replace Abstract: The influx of deep learning (DL) techniques into the field of survival analysis in recent years has led to substantial methodological progress; for instance, learning from unstructured or high-dimensional data such as images, text or omics data. In this work, we conduct a comprehensive systematic review of DL-based methods for time-to-event analysis, characterizing them according to both survival- and DL-related attributes. In summary, the reviewed methods often address only a small subset of tasks relevant to time-to-event data - e.g., single-risk right-censored data - and neglect to incorporate more complex settings. Our findings are summarized in an editable, open-source, interactive table: https://survival-org.github.io/DL4Survival. As this research area is advancing rapidly, we encourage community contribution in order to keep this database up to date.  ( 2 min )
    Sparse Linear Regression and Lattice Problems
    arXiv:2402.14645v1 Announce Type: cross Abstract: Sparse linear regression (SLR) is a well-studied problem in statistics where one is given a design matrix $X\in\mathbb{R}^{m\times n}$ and a response vector $y=X\theta^*+w$ for a $k$-sparse vector $\theta^*$ (that is, $\|\theta^*\|_0\leq k$) and small, arbitrary noise $w$, and the goal is to find a $k$-sparse $\widehat{\theta} \in \mathbb{R}^n$ that minimizes the mean squared prediction error $\frac{1}{m}\|X\widehat{\theta}-X\theta^*\|^2_2$. While $\ell_1$-relaxation methods such as basis pursuit, Lasso, and the Dantzig selector solve SLR when the design matrix is well-conditioned, no general algorithm is known, nor is there any formal evidence of hardness in an average-case setting with respect to all efficient algorithms. We give evidence of average-case hardness of SLR w.r.t. all efficient algorithms assuming the worst-case hardness of lattice problems. Specifically, we give an instance-by-instance reduction from a variant of the bounded distance decoding (BDD) problem on lattices to SLR, where the condition number of the lattice basis that defines the BDD instance is directly related to the restricted eigenvalue condition of the design matrix, which characterizes some of the classical statistical-computational gaps for sparse linear regression. Also, by appealing to worst-case to average-case reductions from the world of lattices, this shows hardness for a distribution of SLR instances; while the design matrices are ill-conditioned, the resulting SLR instances are in the identifiable regime. Furthermore, for well-conditioned (essentially) isotropic Gaussian design matrices, where Lasso is known to behave well in the identifiable regime, we show hardness of outputting any good solution in the unidentifiable regime where there are many solutions, assuming the worst-case hardness of standard and well-studied lattice problems.  ( 3 min )
    latrend: A Framework for Clustering Longitudinal Data
    arXiv:2402.14621v1 Announce Type: cross Abstract: Clustering of longitudinal data is used to explore common trends among subjects over time for a numeric measurement of interest. Various R packages have been introduced throughout the years for identifying clusters of longitudinal patterns, summarizing the variability in trajectories between subject in terms of one or more trends. We introduce the R package "latrend" as a framework for the unified application of methods for longitudinal clustering, enabling comparisons between methods with minimal coding. The package also serves as an interface to commonly used packages for clustering longitudinal data, including "dtwclust", "flexmix", "kml", "lcmm", "mclust", "mixAK", and "mixtools". This enables researchers to easily compare different approaches, implementations, and method specifications. Furthermore, researchers can build upon the standard tools provided by the framework to quickly implement new cluster methods, enabling rapid prototyping. We demonstrate the functionality and application of the latrend package on a synthetic dataset based on the therapy adherence patterns of patients with sleep apnea.  ( 2 min )
    Bandits with Abstention under Expert Advice
    arXiv:2402.14585v1 Announce Type: cross Abstract: We study the classic problem of prediction with expert advice under bandit feedback. Our model assumes that one action, corresponding to the learner's abstention from play, has no reward or loss on every trial. We propose the CBA algorithm, which exploits this assumption to obtain reward bounds that can significantly improve those of the classical Exp4 algorithm. We can view our problem as the aggregation of confidence-rated predictors when the learner has the option of abstention from play. Importantly, we are the first to achieve bounds on the expected cumulative reward for general confidence-rated predictors. In the special case of specialists we achieve a novel reward bound, significantly improving previous bounds of SpecialistExp (treating abstention as another action). As an example application, we discuss learning unions of balls in a finite metric space. In this contextual setting, we devise an efficient implementation of CBA, reducing the runtime from quadratic to almost linear in the number of contexts. Preliminary experiments show that CBA improves over existing bandit algorithms.  ( 2 min )
    WindDragon: Enhancing wind power forecasting with Automated Deep Learning
    arXiv:2402.14385v1 Announce Type: cross Abstract: Achieving net zero carbon emissions by 2050 requires the integration of increasing amounts of wind power into power grids. This energy source poses a challenge to system operators due to its variability and uncertainty. Therefore, accurate forecasting of wind power is critical for grid operation and system balancing. This paper presents an innovative approach to short-term (1 to 6 hour horizon) windpower forecasting at a national level. The method leverages Automated Deep Learning combined with Numerical Weather Predictions wind speed maps to accurately forecast wind power.  ( 2 min )
    Global Safe Sequential Learning via Efficient Knowledge Transfer
    arXiv:2402.14402v1 Announce Type: cross Abstract: Sequential learning methods such as active learning and Bayesian optimization select the most informative data to learn about a task. In many medical or engineering applications, the data selection is constrained by a priori unknown safety conditions. A promissing line of safe learning methods utilize Gaussian processes (GPs) to model the safety probability and perform data selection in areas with high safety confidence. However, accurate safety modeling requires prior knowledge or consumes data. In addition, the safety confidence centers around the given observations which leads to local exploration. As transferable source knowledge is often available in safety critical experiments, we propose to consider transfer safe sequential learning to accelerate the learning of safety. We further consider a pre-computation of source components to reduce the additional computational load that is introduced by incorporating source data. In this paper, we theoretically analyze the maximum explorable safe regions of conventional safe learning methods. Furthermore, we empirically demonstrate that our approach 1) learns a task with lower data consumption, 2) globally explores multiple disjoint safe regions under guidance of the source knowledge, and 3) operates with computation comparable to conventional safe learning methods.  ( 2 min )
    HyperFast: Instant Classification for Tabular Data
    arXiv:2402.14335v1 Announce Type: cross Abstract: Training deep learning models and performing hyperparameter tuning can be computationally demanding and time-consuming. Meanwhile, traditional machine learning methods like gradient-boosting algorithms remain the preferred choice for most tabular data applications, while neural network alternatives require extensive hyperparameter tuning or work only in toy datasets under limited settings. In this paper, we introduce HyperFast, a meta-trained hypernetwork designed for instant classification of tabular data in a single forward pass. HyperFast generates a task-specific neural network tailored to an unseen dataset that can be directly used for classification inference, removing the need for training a model. We report extensive experiments with OpenML and genomic data, comparing HyperFast to competing tabular data neural networks, traditional ML methods, AutoML systems, and boosting machines. HyperFast shows highly competitive results, while being significantly faster. Additionally, our approach demonstrates robust adaptability across a variety of classification tasks with little to no fine-tuning, positioning HyperFast as a strong solution for numerous applications and rapid model deployment. HyperFast introduces a promising paradigm for fast classification, with the potential to substantially decrease the computational burden of deep learning. Our code, which offers a scikit-learn-like interface, along with the trained HyperFast model, can be found at https://github.com/AI-sandbox/HyperFast.  ( 2 min )
    Estimating Unknown Population Sizes Using the Hypergeometric Distribution
    arXiv:2402.14220v1 Announce Type: cross Abstract: The multivariate hypergeometric distribution describes sampling without replacement from a discrete population of elements divided into multiple categories. Addressing a gap in the literature, we tackle the challenge of estimating discrete distributions when both the total population size and the sizes of its constituent categories are unknown. Here, we propose a novel solution using the hypergeometric likelihood to solve this estimation challenge, even in the presence of severe under-sampling. We develop our approach to account for a data generating process where the ground-truth is a mixture of distributions conditional on a continuous latent variable, such as with collaborative filtering, using the variational autoencoder framework. Empirical data simulation demonstrates that our method outperforms other likelihood functions used to model count data, both in terms of accuracy of population size estimate and in its ability to learn an informative latent space. We demonstrate our method's versatility through applications in NLP, by inferring and estimating the complexity of latent vocabularies in text excerpts, and in biology, by accurately recovering the true number of gene transcripts from sparse single-cell genomics data.  ( 2 min )
    A hierarchical decomposition for explaining ML performance discrepancies
    arXiv:2402.14254v1 Announce Type: cross Abstract: Machine learning (ML) algorithms can often differ in performance across domains. Understanding $\textit{why}$ their performance differs is crucial for determining what types of interventions (e.g., algorithmic or operational) are most effective at closing the performance gaps. Existing methods focus on $\textit{aggregate decompositions}$ of the total performance gap into the impact of a shift in the distribution of features $p(X)$ versus the impact of a shift in the conditional distribution of the outcome $p(Y|X)$; however, such coarse explanations offer only a few options for how one can close the performance gap. $\textit{Detailed variable-level decompositions}$ that quantify the importance of each variable to each term in the aggregate decomposition can provide a much deeper understanding and suggest much more targeted interventions. However, existing methods assume knowledge of the full causal graph or make strong parametric assumptions. We introduce a nonparametric hierarchical framework that provides both aggregate and detailed decompositions for explaining why the performance of an ML algorithm differs across domains, without requiring causal knowledge. We derive debiased, computationally-efficient estimators, and statistical inference procedures for asymptotically valid confidence intervals.  ( 2 min )
    Revisiting Convergence of AdaGrad with Relaxed Assumptions
    arXiv:2402.13794v1 Announce Type: cross Abstract: In this study, we revisit the convergence of AdaGrad with momentum (covering AdaGrad as a special case) on non-convex smooth optimization problems. We consider a general noise model where the noise magnitude is controlled by the function value gap together with the gradient magnitude. This model encompasses a broad range of noises including bounded noise, sub-Gaussian noise, affine variance noise and the expected smoothness, and it has been shown to be more realistic in many practical applications. Our analysis yields a probabilistic convergence rate which, under the general noise, could reach at (\tilde{\mathcal{O}}(1/\sqrt{T})). This rate does not rely on prior knowledge of problem-parameters and could accelerate to (\tilde{\mathcal{O}}(1/T)) where (T) denotes the total number iterations, when the noise parameters related to the function value gap and noise level are sufficiently small. The convergence rate thus matches the lower rate for stochastic first-order methods over non-convex smooth landscape up to logarithm terms [Arjevani et al., 2023]. We further derive a convergence bound for AdaGrad with mometum, considering the generalized smoothness where the local smoothness is controlled by a first-order function of the gradient norm.  ( 2 min )
    Probability Tools for Sequential Random Projection
    arXiv:2402.14026v1 Announce Type: cross Abstract: We introduce the first probabilistic framework tailored for sequential random projection, an approach rooted in the challenges of sequential decision-making under uncertainty. The analysis is complicated by the sequential dependence and high-dimensional nature of random variables, a byproduct of the adaptive mechanisms inherent in sequential decision processes. Our work features a novel construction of a stopped process, facilitating the analysis of a sequence of concentration events that are interconnected in a sequential manner. By employing the method of mixtures within a self-normalized process, derived from the stopped process, we achieve a desired non-asymptotic probability bound. This bound represents a non-trivial martingale extension of the Johnson-Lindenstrauss (JL) lemma, marking a pioneering contribution to the literature on random projection and sequential analysis.  ( 2 min )
    Efficient Normalized Conformal Prediction and Uncertainty Quantification for Anti-Cancer Drug Sensitivity Prediction with Deep Regression Forests
    arXiv:2402.14080v1 Announce Type: cross Abstract: Deep learning models are being adopted and applied on various critical decision-making tasks, yet they are trained to provide point predictions without providing degrees of confidence. The trustworthiness of deep learning models can be increased if paired with uncertainty estimations. Conformal Prediction has emerged as a promising method to pair machine learning models with prediction intervals, allowing for a view of the model's uncertainty. However, popular uncertainty estimation methods for conformal prediction fail to provide heteroskedastic intervals that are equally accurate for all samples. In this paper, we propose a method to estimate the uncertainty of each sample by calculating the variance obtained from a Deep Regression Forest. We show that the deep regression forest variance improves the efficiency and coverage of normalized inductive conformal prediction on a drug response prediction task.  ( 2 min )
    Batch and match: black-box variational inference with a score-based divergence
    arXiv:2402.14758v1 Announce Type: new Abstract: Most leading implementations of black-box variational inference (BBVI) are based on optimizing a stochastic evidence lower bound (ELBO). But such approaches to BBVI often converge slowly due to the high variance of their gradient estimates. In this work, we propose batch and match (BaM), an alternative approach to BBVI based on a score-based divergence. Notably, this score-based divergence can be optimized by a closed-form proximal update for Gaussian variational families with full covariance matrices. We analyze the convergence of BaM when the target distribution is Gaussian, and we prove that in the limit of infinite batch size the variational parameter updates converge exponentially quickly to the target mean and covariance. We also evaluate the performance of BaM on Gaussian and non-Gaussian target distributions that arise from posterior inference in hierarchical and deep generative models. In these experiments, we find that BaM typically converges in fewer (and sometimes significantly fewer) gradient evaluations than leading implementations of BBVI based on ELBO maximization.  ( 2 min )
    Social Environment Design
    arXiv:2402.14090v1 Announce Type: cross Abstract: Artificial Intelligence (AI) holds promise as a technology that can be used to improve government and economic policy-making. This paper proposes a new research agenda towards this end by introducing Social Environment Design, a general framework for the use of AI for automated policy-making that connects with the Reinforcement Learning, EconCS, and Computational Social Choice communities. The framework seeks to capture general economic environments, includes voting on policy objectives, and gives a direction for the systematic analysis of government and economic policy through AI simulation. We highlight key open problems for future research in AI-based policy-making. By solving these challenges, we hope to achieve various social welfare objectives, thereby promoting more ethical and responsible decision making.  ( 2 min )
    Robust Learning of Noisy Time Series Collections Using Stochastic Process Models with Motion Codes
    arXiv:2402.14081v1 Announce Type: cross Abstract: While time series classification and forecasting problems have been extensively studied, the cases of noisy time series data with arbitrary time sequence lengths have remained challenging. Each time series instance can be thought of as a sample realization of a noisy dynamical model, which is characterized by a continuous stochastic process. For many applications, the data are mixed and consist of several types of noisy time series sequences modeled by multiple stochastic processes, making the forecasting and classification tasks even more challenging. Instead of regressing data naively and individually to each time series type, we take a latent variable model approach using a mixtured Gaussian processes with learned spectral kernels. More specifically, we auto-assign each type of noisy time series data a signature vector called its motion code. Then, conditioned on each assigned motion code, we infer a sparse approximation of the corresponding time series using the concept of the most informative timestamps. Our unmixing classification approach involves maximizing the likelihood across all the mixed noisy time series sequences of varying lengths. This stochastic approach allows us to learn not only within a single type of noisy time series data but also across many underlying stochastic processes, giving us a way to learn multiple dynamical models in an integrated and robust manner. The different learned latent stochastic models allow us to generate specific sub-type forecasting. We provide several quantitative comparisons demonstrating the performance of our approach.  ( 3 min )
    On Feynman--Kac training of partial Bayesian neural networks
    arXiv:2310.19608v2 Announce Type: replace-cross Abstract: Recently, partial Bayesian neural networks (pBNNs), which only consider a subset of the parameters to be stochastic, were shown to perform competitively with full Bayesian neural networks. However, pBNNs are often multi-modal in the latent variable space and thus challenging to approximate with parametric models. To address this problem, we propose an efficient sampling-based training strategy, wherein the training of a pBNN is formulated as simulating a Feynman--Kac model. We then describe variations of sequential Monte Carlo samplers that allow us to simultaneously estimate the parameters and the latent posterior distribution of this model at a tractable computational cost. Using various synthetic and real-world datasets we show that our proposed training scheme outperforms the state of the art in terms of predictive performance.  ( 2 min )
    EduGym: An Environment and Notebook Suite for Reinforcement Learning Education
    arXiv:2311.10590v2 Announce Type: replace-cross Abstract: Due to the empirical success of reinforcement learning, an increasing number of students study the subject. However, from our practical teaching experience, we see students entering the field (bachelor, master and early PhD) often struggle. On the one hand, textbooks and (online) lectures provide the fundamentals, but students find it hard to translate between equations and code. On the other hand, public codebases do provide practical examples, but the implemented algorithms tend to be complex, and the underlying test environments contain multiple reinforcement learning challenges at once. Although this is realistic from a research perspective, it often hinders educational conceptual understanding. To solve this issue we introduce EduGym, a set of educational reinforcement learning environments and associated interactive notebooks tailored for education. Each EduGym environment is specifically designed to illustrate a certain aspect/challenge of reinforcement learning (e.g., exploration, partial observability, stochasticity, etc.), while the associated interactive notebook explains the challenge and its possible solution approaches, connecting equations and code in a single document. An evaluation among RL students and researchers shows 86% of them think EduGym is a useful tool for reinforcement learning education. All notebooks are available from https://www.edugym.org/, while the full software package can be installed from https://github.com/RLG-Leiden/edugym.  ( 3 min )
    Externally Valid Policy Evaluation Combining Trial and Observational Data
    arXiv:2310.14763v2 Announce Type: replace-cross Abstract: Randomized trials are widely considered as the gold standard for evaluating the effects of decision policies. Trial data is, however, drawn from a population which may differ from the intended target population and this raises a problem of external validity (aka. generalizability). In this paper we seek to use trial data to draw valid inferences about the outcome of a policy on the target population. Additional covariate data from the target population is used to model the sampling of individuals in the trial study. We develop a method that yields certifiably valid trial-based policy evaluations under any specified range of model miscalibrations. The method is nonparametric and the validity is assured even with finite samples. The certified policy evaluations are illustrated using both simulated and real data.  ( 2 min )
    Flow-based Distributionally Robust Optimization
    arXiv:2310.19253v3 Announce Type: replace-cross Abstract: We present a computationally efficient framework, called $\texttt{FlowDRO}$, for solving flow-based distributionally robust optimization (DRO) problems with Wasserstein uncertainty sets while aiming to find continuous worst-case distribution (also called the Least Favorable Distribution, LFD) and sample from it. The requirement for LFD to be continuous is so that the algorithm can be scalable to problems with larger sample sizes and achieve better generalization capability for the induced robust algorithms. To tackle the computationally challenging infinitely dimensional optimization problem, we leverage flow-based models and continuous-time invertible transport maps between the data distribution and the target distribution and develop a Wasserstein proximal gradient flow type algorithm. In theory, we establish the equivalence of the solution by optimal transport map to the original formulation, as well as the dual form of the problem through Wasserstein calculus and Brenier theorem. In practice, we parameterize the transport maps by a sequence of neural networks progressively trained in blocks by gradient descent. We demonstrate its usage in adversarial learning, distributionally robust hypothesis testing, and a new mechanism for data-driven distribution perturbation differential privacy, where the proposed method gives strong empirical performance on high-dimensional real data.  ( 2 min )
    Adaptive Batch Sizes for Active Learning A Probabilistic Numerics Approach
    arXiv:2306.05843v2 Announce Type: replace-cross Abstract: Active learning parallelization is widely used, but typically relies on fixing the batch size throughout experimentation. This fixed approach is inefficient because of a dynamic trade-off between cost and speed -- larger batches are more costly, smaller batches lead to slower wall-clock run-times -- and the trade-off may change over the run (larger batches are often preferable earlier). To address this trade-off, we propose a novel Probabilistic Numerics framework that adaptively changes batch sizes. By framing batch selection as a quadrature task, our integration-error-aware algorithm facilitates the automatic tuning of batch sizes to meet predefined quadrature precision objectives, akin to how typical optimizers terminate based on convergence thresholds. This approach obviates the necessity for exhaustive searches across all potential batch sizes. We also extend this to scenarios with constrained active learning and constrained optimization, interpreting constraint violations as reductions in the precision requirement, to subsequently adapt batch construction. Through extensive experiments, we demonstrate that our approach significantly enhances learning efficiency and flexibility in diverse Bayesian batch active learning and Bayesian optimization applications.  ( 2 min )
    Uncertainty Quantification of Spatiotemporal Travel Demand with Probabilistic Graph Neural Networks
    arXiv:2303.04040v2 Announce Type: replace-cross Abstract: Recent studies have significantly improved the prediction accuracy of travel demand using graph neural networks. However, these studies largely ignored uncertainty that inevitably exists in travel demand prediction. To fill this gap, this study proposes a framework of probabilistic graph neural networks (Prob-GNN) to quantify the spatiotemporal uncertainty of travel demand. This Prob-GNN framework is substantiated by deterministic and probabilistic assumptions, and empirically applied to the task of predicting the transit and ridesharing demand in Chicago. We found that the probabilistic assumptions (e.g. distribution tail, support) have a greater impact on uncertainty prediction than the deterministic ones (e.g. deep modules, depth). Among the family of Prob-GNNs, the GNNs with truncated Gaussian and Laplace distributions achieve the highest performance in transit and ridesharing data. Even under significant domain shifts, Prob-GNNs can predict the ridership uncertainty in a stable manner, when the models are trained on pre-COVID data and tested across multiple periods during and after the COVID-19 pandemic. Prob-GNNs also reveal the spatiotemporal pattern of uncertainty, which is concentrated on the afternoon peak hours and the areas with large travel volumes. Overall, our findings highlight the importance of incorporating randomness into deep learning for spatiotemporal ridership prediction. Future research should continue to investigate versatile probabilistic assumptions to capture behavioral randomness, and further develop methods to quantify uncertainty to build resilient cities.  ( 3 min )
    FIGARO: Generating Symbolic Music with Fine-Grained Artistic Control
    arXiv:2201.10936v4 Announce Type: replace-cross Abstract: Generating music with deep neural networks has been an area of active research in recent years. While the quality of generated samples has been steadily increasing, most methods are only able to exert minimal control over the generated sequence, if any. We propose the self-supervised description-to-sequence task, which allows for fine-grained controllable generation on a global level. We do so by extracting high-level features about the target sequence and learning the conditional distribution of sequences given the corresponding high-level description in a sequence-to-sequence modelling setup. We train FIGARO (FIne-grained music Generation via Attention-based, RObust control) by applying description-to-sequence modelling to symbolic music. By combining learned high level features with domain knowledge, which acts as a strong inductive bias, the model achieves state-of-the-art results in controllable symbolic music generation and generalizes well beyond the training distribution.  ( 2 min )
    Promises and Pitfalls of Threshold-based Auto-labeling
    arXiv:2211.12620v2 Announce Type: replace-cross Abstract: Creating large-scale high-quality labeled datasets is a major bottleneck in supervised machine learning workflows. Threshold-based auto-labeling (TBAL), where validation data obtained from humans is used to find a confidence threshold above which the data is machine-labeled, reduces reliance on manual annotation. TBAL is emerging as a widely-used solution in practice. Given the long shelf-life and diverse usage of the resulting datasets, understanding when the data obtained by such auto-labeling systems can be relied on is crucial. This is the first work to analyze TBAL systems and derive sample complexity bounds on the amount of human-labeled validation data required for guaranteeing the quality of machine-labeled data. Our results provide two crucial insights. First, reasonable chunks of unlabeled data can be automatically and accurately labeled by seemingly bad models. Second, a hidden downside of TBAL systems is potentially prohibitive validation data usage. Together, these insights describe the promise and pitfalls of using such systems. We validate our theoretical guarantees with extensive experiments on synthetic and real datasets.  ( 2 min )
    Linear Convergence of Black-Box Variational Inference: Should We Stick the Landing?
    arXiv:2307.14642v3 Announce Type: replace Abstract: We prove that black-box variational inference (BBVI) with control variates, particularly the sticking-the-landing (STL) estimator, converges at a geometric (traditionally called "linear") rate under perfect variational family specification. In particular, we prove a quadratic bound on the gradient variance of the STL estimator, one which encompasses misspecified variational families. Combined with previous works on the quadratic variance condition, this directly implies convergence of BBVI with the use of projected stochastic gradient descent. For the projection operator, we consider a domain with triangular scale matrices, which the projection onto is computable in $\Theta(d)$ time, where $d$ is the dimensionality of the target posterior. We also improve existing analysis on the regular closed-form entropy gradient estimators, which enables comparison against the STL estimator, providing explicit non-asymptotic complexity guarantees for both.  ( 2 min )
    Adversarial Machine Learning: Bayesian Perspectives
    arXiv:2003.03546v2 Announce Type: replace-cross Abstract: Adversarial Machine Learning (AML) is emerging as a major field aimed at protecting machine learning (ML) systems against security threats: in certain scenarios there may be adversaries that actively manipulate input data to fool learning systems. This creates a new class of security vulnerabilities that ML systems may face, and a new desirable property called adversarial robustness essential to trust operations based on ML outputs. Most work in AML is built upon a game-theoretic modelling of the conflict between a learning system and an adversary, ready to manipulate input data. This assumes that each agent knows their opponent's interests and uncertainty judgments, facilitating inferences based on Nash equilibria. However, such common knowledge assumption is not realistic in the security scenarios typical of AML. After reviewing such game-theoretic approaches, we discuss the benefits that Bayesian perspectives provide when defending ML-based systems. We demonstrate how the Bayesian approach allows us to explicitly model our uncertainty about the opponent's beliefs and interests, relaxing unrealistic assumptions, and providing more robust inferences. We illustrate this approach in supervised learning settings, and identify relevant future research problems.  ( 2 min )
    Controlling Multiple Errors Simultaneously with a PAC-Bayes Bound
    arXiv:2202.05560v2 Announce Type: replace Abstract: Current PAC-Bayes generalisation bounds are restricted to scalar metrics of performance, such as the loss or error rate. However, one ideally wants more information-rich certificates that control the entire distribution of possible outcomes, such as the distribution of the test loss in regression, or the probabilities of different mis classifications. We provide the first PAC-Bayes bound capable of providing such rich information by bounding the Kullback-Leibler divergence between the empirical and true probabilities of a set of M error types, which can either be discretized loss values for regression, or the elements of the confusion matrix (or a partition thereof) for classification. We transform our bound into a differentiable training objective. Our bound is especially useful in cases where the severity of different mis-classifications may change over time; existing PAC-Bayes bounds can only bound a particular pre-decided weighting of the error types. In contrast our bound implicitly controls all uncountably many weightings simultaneously.  ( 2 min )
    How Transformers Learn Causal Structure with Gradient Descent
    arXiv:2402.14735v1 Announce Type: cross Abstract: The incredible success of transformers on sequence modeling tasks can be largely attributed to the self-attention mechanism, which allows information to be transferred between different parts of a sequence. Self-attention allows transformers to encode causal structure which makes them particularly suitable for sequence modeling. However, the process by which transformers learn such causal structure via gradient-based training algorithms remains poorly understood. To better understand this process, we introduce an in-context learning task that requires learning latent causal structure. We prove that gradient descent on a simplified two-layer transformer learns to solve this task by encoding the latent causal graph in the first attention layer. The key insight of our proof is that the gradient of the attention matrix encodes the mutual information between tokens. As a consequence of the data processing inequality, the largest entries of this gradient correspond to edges in the latent causal graph. As a special case, when the sequences are generated from in-context Markov chains, we prove that transformers learn an induction head (Olsson et al., 2022). We confirm our theoretical findings by showing that transformers trained on our in-context learning task are able to recover a wide variety of causal structures.  ( 2 min )
    Rao-Blackwellising Bayesian Causal Inference
    arXiv:2402.14781v1 Announce Type: cross Abstract: Bayesian causal inference, i.e., inferring a posterior over causal models for the use in downstream causal reasoning tasks, poses a hard computational inference problem that is little explored in literature. In this work, we combine techniques from order-based MCMC structure learning with recent advances in gradient-based graph learning into an effective Bayesian causal inference framework. Specifically, we decompose the problem of inferring the causal structure into (i) inferring a topological order over variables and (ii) inferring the parent sets for each variable. When limiting the number of parents per variable, we can exactly marginalise over the parent sets in polynomial time. We further use Gaussian processes to model the unknown causal mechanisms, which also allows their exact marginalisation. This introduces a Rao-Blackwellization scheme, where all components are eliminated from the model, except for the causal order, for which we learn a distribution via gradient-based optimisation. The combination of Rao-Blackwellization with our sequential inference procedure for causal orders yields state-of-the-art on linear and non-linear additive noise benchmarks with scale-free and Erdos-Renyi graph structures.  ( 2 min )
    Incorporating Expert Rules into Neural Networks in the Framework of Concept-Based Learning
    arXiv:2402.14726v1 Announce Type: cross Abstract: A problem of incorporating the expert rules into machine learning models for extending the concept-based learning is formulated in the paper. It is proposed how to combine logical rules and neural networks predicting the concept probabilities. The first idea behind the combination is to form constraints for a joint probability distribution over all combinations of concept values to satisfy the expert rules. The second idea is to represent a feasible set of probability distributions in the form of a convex polytope and to use its vertices or faces. We provide several approaches for solving the stated problem and for training neural networks which guarantee that the output probabilities of concepts would not violate the expert rules. The solution of the problem can be viewed as a way for combining the inductive and deductive learning. Expert rules are used in a broader sense when any logical function that connects concepts and class labels or just concepts with each other can be regarded as a rule. This feature significantly expands the class of the proposed results. Numerical examples illustrate the approaches. The code of proposed algorithms is publicly available.  ( 2 min )
    On the Curses of Future and History in Future-dependent Value Functions for Off-policy Evaluation
    arXiv:2402.14703v1 Announce Type: cross Abstract: We study off-policy evaluation (OPE) in partially observable environments with complex observations, with the goal of developing estimators whose guarantee avoids exponential dependence on the horizon. While such estimators exist for MDPs and POMDPs can be converted to history-based MDPs, their estimation errors depend on the state-density ratio for MDPs which becomes history ratios after conversion, an exponential object. Recently, Uehara et al. (2022) proposed future-dependent value functions as a promising framework to address this issue, where the guarantee for memoryless policies depends on the density ratio over the latent state space. However, it also depends on the boundedness of the future-dependent value function and other related quantities, which we show could be exponential-in-length and thus erasing the advantage of the method. In this paper, we discover novel coverage assumptions tailored to the structure of POMDPs, such as outcome coverage and belief coverage. These assumptions not only enable polynomial bounds on the aforementioned quantities, but also lead to the discovery of new algorithms with complementary properties.  ( 2 min )
    Bayesian Off-Policy Evaluation and Learning for Large Action Spaces
    arXiv:2402.14664v1 Announce Type: cross Abstract: In interactive systems, actions are often correlated, presenting an opportunity for more sample-efficient off-policy evaluation (OPE) and learning (OPL) in large action spaces. We introduce a unified Bayesian framework to capture these correlations through structured and informative priors. In this framework, we propose sDM, a generic Bayesian approach designed for OPE and OPL, grounded in both algorithmic and theoretical foundations. Notably, sDM leverages action correlations without compromising computational efficiency. Moreover, inspired by online Bayesian bandits, we introduce Bayesian metrics that assess the average performance of algorithms across multiple problem instances, deviating from the conventional worst-case assessments. We analyze sDM in OPE and OPL, highlighting the benefits of leveraging action correlations. Empirical evidence showcases the strong performance of sDM.  ( 2 min )
    CoLoRA: Continuous low-rank adaptation for reduced implicit neural modeling of parameterized partial differential equations
    arXiv:2402.14646v1 Announce Type: cross Abstract: This work introduces reduced models based on Continuous Low Rank Adaptation (CoLoRA) that pre-train neural networks for a given partial differential equation and then continuously adapt low-rank weights in time to rapidly predict the evolution of solution fields at new physics parameters and new initial conditions. The adaptation can be either purely data-driven or via an equation-driven variational approach that provides Galerkin-optimal approximations. Because CoLoRA approximates solution fields locally in time, the rank of the weights can be kept small, which means that only few training trajectories are required offline so that CoLoRA is well suited for data-scarce regimes. Predictions with CoLoRA are orders of magnitude faster than with classical methods and their accuracy and parameter efficiency is higher compared to other neural network approaches.  ( 2 min )
    A Framework for Variational Inference of Lightweight Bayesian Neural Networks with Heteroscedastic Uncertainties
    arXiv:2402.14532v1 Announce Type: cross Abstract: Obtaining heteroscedastic predictive uncertainties from a Bayesian Neural Network (BNN) is vital to many applications. Often, heteroscedastic aleatoric uncertainties are learned as outputs of the BNN in addition to the predictive means, however doing so may necessitate adding more learnable parameters to the network. In this work, we demonstrate that both the heteroscedastic aleatoric and epistemic variance can be embedded into the variances of learned BNN parameters, improving predictive performance for lightweight networks. By complementing this approach with a moment propagation approach to inference, we introduce a relatively simple framework for sampling-free variational inference suitable for lightweight BNNs.  ( 2 min )
    Imbalanced Data Clustering using Equilibrium K-Means
    arXiv:2402.14490v1 Announce Type: cross Abstract: Imbalanced data, characterized by an unequal distribution of data points across different clusters, poses a challenge for traditional hard and fuzzy clustering algorithms, such as hard K-means (HKM, or Lloyd's algorithm) and fuzzy K-means (FKM, or Bezdek's algorithm). This paper introduces equilibrium K-means (EKM), a novel and simple K-means-type algorithm that alternates between just two steps, yielding significantly improved clustering results for imbalanced data by reducing the tendency of centroids to crowd together in the center of large clusters. We also present a unifying perspective for HKM, FKM, and EKM, showing they are essentially gradient descent algorithms with an explicit relationship to Newton's method. EKM has the same time and space complexity as FKM but offers a clearer physical meaning for its membership definition. We illustrate the performance of EKM on two synthetic and ten real datasets, comparing it to various clustering algorithms, including HKM, FKM, maximum-entropy fuzzy clustering, two FKM variations designed for imbalanced data, and the Gaussian mixture model. The results demonstrate that EKM performs competitively on balanced data while significantly outperforming other techniques on imbalanced data. For high-dimensional data clustering, we demonstrate that a more discriminative representation can be obtained by mapping high-dimensional data via deep neural networks into a low-dimensional, EKM-friendly space. Deep clustering with EKM improves clustering accuracy by 35% on an imbalanced dataset derived from MNIST compared to deep clustering based on HKM.  ( 2 min )
    Spectral invariance and maximality properties of the frequency spectrum of quantum neural networks
    arXiv:2402.14515v1 Announce Type: cross Abstract: Quantum Neural Networks (QNNs) are a popular approach in Quantum Machine Learning due to their close connection to Variational Quantum Circuits, making them a promising candidate for practical applications on Noisy Intermediate-Scale Quantum (NISQ) devices. A QNN can be expressed as a finite Fourier series, where the set of frequencies is called the frequency spectrum. We analyse this frequency spectrum and prove, for a large class of models, various maximality results. Furthermore, we prove that under some mild conditions there exists a bijection between classes of models with the same area $A = RL$ that preserves the frequency spectrum, where $R$ denotes the number of qubits and $L$ the number of layers, which we consequently call spectral invariance under area-preserving transformations. With this we explain the symmetry in $R$ and $L$ in the results often observed in the literature and show that the maximal frequency spectrum depends only on the area $A = RL$ and not on the individual values of $R$ and $L$. Moreover, we extend existing results and specify the maximum possible frequency spectrum of a QNN with arbitrarily many layers as a function of the spectrum of its generators. If the generators of the QNN can be further decomposed into 2-dimensional sub-generators, then this specification follows from elementary number-theoretical considerations. In the case of arbitrary dimensional generators, we extend existing results based on the so-called Golomb ruler and introduce a second novel approach based on a variation of the turnpike problem, which we call the relaxed turnpike problem.  ( 3 min )
    The Universe as a Learning System
    arXiv:2402.14423v1 Announce Type: cross Abstract: At its microscopic level, the universe follows the laws of quantum mechanics. Focusing on the quantum trajectories of particles as followed from the hydrodynamical formulation of quantum mechanics, we propose that under general requirements, quantum systems follow a disrupted version of the gradient descent model, a basic machine learning algorithm, where the learning is distorted due to the self-organizing process of the quantum system. Such a learning process is possible only when we assume dissipation, i.e., that the quantum system is open. The learning parameter is the time increment of the process over the mass of the quantum particle, and a friction parameter determines the nonlinearity of the quantum system. We then provide an empirical demonstration of the proposed model.  ( 2 min )
    Reimagining Anomalies: What If Anomalies Were Normal?
    arXiv:2402.14469v1 Announce Type: cross Abstract: Deep learning-based methods have achieved a breakthrough in image anomaly detection, but their complexity introduces a considerable challenge to understanding why an instance is predicted to be anomalous. We introduce a novel explanation method that generates multiple counterfactual examples for each anomaly, capturing diverse concepts of anomalousness. A counterfactual example is a modification of the anomaly that is perceived as normal by the anomaly detector. The method provides a high-level semantic explanation of the mechanism that triggered the anomaly detector, allowing users to explore "what-if scenarios." Qualitative and quantitative analyses across various image datasets show that the method applied to state-of-the-art anomaly detectors can achieve high-quality semantic explanations of detectors.  ( 2 min )
    Multiply Robust Estimation for Local Distribution Shifts with Multiple Domains
    arXiv:2402.14145v1 Announce Type: new Abstract: Distribution shifts are ubiquitous in real-world machine learning applications, posing a challenge to the generalization of models trained on one data distribution to another. We focus on scenarios where data distributions vary across multiple segments of the entire population and only make local assumptions about the differences between training and test (deployment) distributions within each segment. We propose a two-stage multiply robust estimation method to improve model performance on each individual segment for tabular data analysis. The method involves fitting a linear combination of the based models, learned using clusters of training data from multiple segments, followed by a refinement step for each segment. Our method is designed to be implemented with commonly used off-the-shelf machine learning models. We establish theoretical guarantees on the generalization bound of the method on the test risk. With extensive experiments on synthetic and real datasets, we demonstrate that the proposed method substantially improves over existing alternatives in prediction accuracy and robustness on both regression and classification tasks. We also assess its effectiveness on a user city prediction dataset from a large technology company.  ( 2 min )
    Structure-agnostic Optimality of Doubly Robust Learning for Treatment Effect Estimation
    arXiv:2402.14264v1 Announce Type: new Abstract: Average treatment effect estimation is the most central problem in causal inference with application to numerous disciplines. While many estimation strategies have been proposed in the literature, recently also incorporating generic machine learning estimators, the statistical optimality of these methods has still remained an open area of investigation. In this paper, we adopt the recently introduced structure-agnostic framework of statistical lower bounds, which poses no structural properties on the nuisance functions other than access to black-box estimators that attain small errors; which is particularly appealing when one is only willing to consider estimation strategies that use non-parametric regression and classification oracles as a black-box sub-process. Within this framework, we prove the statistical optimality of the celebrated and widely used doubly robust estimators for both the Average Treatment Effect (ATE) and the Average Treatment Effect on the Treated (ATTE), as well as weighted variants of the former, which arise in policy evaluation.  ( 2 min )
    Adaptive time series forecasting with markovian variance switching
    arXiv:2402.14684v1 Announce Type: new Abstract: Adaptive time series forecasting is essential for prediction under regime changes. Several classical methods assume linear Gaussian state space model (LGSSM) with variances constant in time. However, there are many real-world processes that cannot be captured by such models. We consider a state-space model with Markov switching variances. Such dynamical systems are usually intractable because of their computational complexity increasing exponentially with time; Variational Bayes (VB) techniques have been applied to this problem. In this paper, we propose a new way of estimating variances based on online learning theory; we adapt expert aggregation methods to learn the variances over time. We apply the proposed method to synthetic data and to the problem of electricity load forecasting. We show that this method is robust to misspecification and outperforms traditional expert aggregation.  ( 2 min )
    Multivariate Online Linear Regression for Hierarchical Forecasting
    arXiv:2402.14578v1 Announce Type: new Abstract: In this paper, we consider a deterministic online linear regression model where we allow the responses to be multivariate. To address this problem, we introduce MultiVAW, a method that extends the well-known Vovk-Azoury-Warmuth algorithm to the multivariate setting, and show that it also enjoys logarithmic regret in time. We apply our results to the online hierarchical forecasting problem and recover an algorithm from this literature as a special case, allowing us to relax the hypotheses usually made for its analysis.  ( 2 min )
    Causal Imputation for Counterfactual SCMs: Bridging Graphs and Latent Factor Models
    arXiv:2402.14777v1 Announce Type: new Abstract: We consider the task of causal imputation, where we aim to predict the outcomes of some set of actions across a wide range of possible contexts. As a running example, we consider predicting how different drugs affect cells from different cell types. We study the index-only setting, where the actions and contexts are categorical variables with a finite number of possible values. Even in this simple setting, a practical challenge arises, since often only a small subset of possible action-context pairs have been studied. Thus, models must extrapolate to novel action-context pairs, which can be framed as a form of matrix completion with rows indexed by actions, columns indexed by contexts, and matrix entries corresponding to outcomes. We introduce a novel SCM-based model class, where the outcome is expressed as a counterfactual, actions are expressed as interventions on an instrumental variable, and contexts are defined based on the initial state of the system. We show that, under a linearity assumption, this setup induces a latent factor model over the matrix of outcomes, with an additional fixed effect term. To perform causal prediction based on this model class, we introduce simple extension to the Synthetic Interventions estimator (Agarwal et al., 2020). We evaluate several matrix completion approaches on the PRISM drug repurposing dataset, showing that our method outperforms all other considered matrix completion approaches.  ( 2 min )

  • Open

    [D] How does Information Extraction happen in LLMs so quickly?
    I am an AI researcher and I know a lot about the inner workings of ChatGPT. But each time I am using it, I am totally surprised how it delivers very complex answers on niche topics so extremely quickly, in almost no time. It just seems to (i) know everything and (ii) have thought through everything upfront of any question on any topic. How does it work? Are there any recent theoretical explanations about this, such as filter banks, etc? submitted by /u/CodingButStillAlive [link] [comments]
    [P] Advice regarding MoE and Mamba implementations
    Hi everyone, I'm diving into my Master's Thesis and need some guidance. The core of my work is to linearize a complex function that's riddled with memory effects. While the Transformer architecture has been explored in literature, I'm considering taking a fresh angle with either a Mamba architecture or spicing up the Transformer with a MoE (Mixture of Experts) approach. Moe-Mamba is also on the table. The thing is: it's the first time I'm actually working with these architectures, so I don't really know where to start in order to implement them in real code. Where should I learn about these architectures more? Can you also suggest some code implementations (I don't think there are libraries yet) for these architectures? PS: I know I still have to study a lot about these topics so don't judge my stupid questions pls, that is why I'm asking for advice, I want to learn! :) submitted by /u/PaleAle34 [link] [comments]
    [P] Haven't tried the 120B models unquantized yet? Look no further...
    You can try 120B models at https://www.projectatlantis.ai Honest feedback appreciated. https://preview.redd.it/7iqgla0w7ekc1.png?width=958&format=png&auto=webp&s=c550334dd33a62e9f52ba0ae39875f259feadbbd submitted by /u/Growth4Good [link] [comments]
    [D] ai explainability tools
    I'm working on training a computer vision model for detecting custom objects. I've been looking for tools to help understand AI models, and haven't come across much. Google has a paid toolset: https://cloud.google.com/explainable-ai And this one is free: https://shap-lrjball.readthedocs.io/en/latest/generated/shap.DeepExplainer.html What other tools are people using? submitted by /u/_meatMuffin [link] [comments]
    [D] Modern Dimensionality Reduction
    Hey All, I’m familiar with the more classical techniques of dimensionality reduction like SVD, PCA, and factor analysis. But are there any modern techniques or maybe some tricks that people have learned over the years that they would like to share. For context, this would be for tabular data. Thanks! submitted by /u/MuscleML [link] [comments]
    Unable to specify GPU usage in VLLM code [D]
    I am facing difficulties in specifying GPU usage for different models for LLM inference pipeline using vLLM. Specifically, I have 4 RTX 4090 GPUs available, and I aim to run a LLM with a size of 42GB on 2 RTX 4090 GPUs (~48GB) and a separate model with a size of 22GB on 1 RTX 4090 GPU(`24GB).This is my code for running 42GB model on two GPUs. from vllm import LLM llm = LLM(model_name, max_model_len=50, tensor_parallel_size=2) output = llm.generate(text) However, I haven't found a straightforward method within the VLLM library to specify which GPU should be used for each model. ​ submitted by /u/Humza0000 [link] [comments]
    [P] Gemma 7B with Tensor RT (>500k tok/s batch-8) tutorial
    Hey all - we just put out a guide for running Gemma 7B with Tensor RT. You can get some much better performance out of it with Tensor RT. ​ Check it out: https://docs.mystic.ai/docs/deploy-gemma7b-tensorrt-llm ​ Hope it's useful! submitted by /u/paulcjh [link] [comments]
    [D] [P] intent-pilot: A Desktop Operating Agent
    library: pip install intent-pilot Repo link: https://github.com/askui/intent-pilot Hey! We built a Desktop agent which can perform end-end automation. The core idea is based Set-of-Mark (SoM) + GPT-4v for localization. This library is along the same lines of self-operating-computer or open-interpreter but we felt our object detection is better for the UI domain. Also, we improved the UI experience by providing notifications across platforms and also fixed the keyboard layout issue - For example, Pyautogui messes up special characters in German keyboard. Let me know what you guys think. I built it in a week and my colleague helped at the end. So, your feedback will be appreciated. Our object detection model is behind an API but we have released a global key (available in the repo). Have a nice weekend! Note: This is like giving access to a baby. It surprises you mostly but can also shock you. I suggest you close important tabs before it clicks on the wrong thing ;) submitted by /u/Outlandish_MurMan [link] [comments]
    [D] System Design Interview - Design Chatbot or Search Engine like Perplexity.
    Hi Folks, I have a system design interview coming up for ML role at a FAANG. I have started playing with Gen AI only recently and read up on foundational concepts - LLM, model selection, RAGs, fine tuning, etc. But want to get a solid understanding of overall system specifically for gen AI powered apps. Curious if anyone can point to resources o r explain end-to-end system design for a typical Chatbot or a search engine (like Perplexity). submitted by /u/Grouchy-Ad6094 [link] [comments]
    [D] Approximating known distributions (up to a normalization factor) by a decoder-only part of a VAE.
    Hi all, Please feel free to delete if this is a beginner question. I'm reading a bunch of papers on how to learn and sample distributions using neural networks and have some questions. Everything described below is a summary of a couple of papers I read where people tried to do this thing, but I'd like to keep the post self-contained. ------------------------------------------------------------------------------------------------------------------------------ Introduction: I have the following question. Imagine you have a distribution P(x)=F(x)/N, where we know F(x) and can evaluate it at will, but we don't know the normalization factor N. The question is -- how can we learn generate samples for the distribution P(x), with x being elements of some high dimensional space? One option wou…
    [R] Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping - Meta 2024 - Searchformer - Significantly outperforms baselines that predict the optimal plan directly with a 5-10× smaller model size and a 10× smaller training dataset!
    Paper: https://arxiv.org/abs/2402.14083 Abstract: While Transformers have enabled tremendous progress in various application settings, such architectures still lag behind traditional symbolic planners for solving complex decision making tasks. In this work, we demonstrate how to train Transformers to solve complex planning tasks and present Searchformer, a Transformer model that optimally solves previously unseen Sokoban puzzles 93.7% of the time, while using up to 26.8% fewer search steps than standard A∗ search. Searchformer is an encoder-decoder Transformer model trained to predict the search dynamics of A∗. This model is then fine-tuned via expert iterations to perform fewer search steps than A∗ search while still generating an optimal plan. In our training method, A∗'s search dynamics are expressed as a token sequence outlining when task states are added and removed into the search tree during symbolic planning. In our ablation studies on maze navigation, we find that Searchformer significantly outperforms baselines that predict the optimal plan directly with a 5-10× smaller model size and a 10× smaller training dataset. We also demonstrate how Searchformer scales to larger and more complex decision making tasks like Sokoban with improved percentage of solved tasks and shortened search dynamics. https://preview.redd.it/fhn5bsklbdkc1.jpg?width=1028&format=pjpg&auto=webp&s=bbb8d726ba74d046023a6c6827249fa602c6eff1 https://preview.redd.it/n5a54uklbdkc1.jpg?width=521&format=pjpg&auto=webp&s=619d31cf68977f98213422566f7c075aa1a2007b https://preview.redd.it/ztmf8rklbdkc1.jpg?width=1144&format=pjpg&auto=webp&s=700c9cf543b09b85b07d296a314d0ef6b451c1d0 https://preview.redd.it/poragwklbdkc1.jpg?width=936&format=pjpg&auto=webp&s=18435580f179a63d72305aa1d9c4511f1ecf70c5 submitted by /u/Singularian2501 [link] [comments]
    [D] My thoughts on model merging
    Hey folks, I've been fascinated by model merging and have been playing with it recently. I wanted to understand how it works and these are my findings. My understanding of how model merging works: Task Vectors - The core idea in model merging is derived from the concept of task vectors. The main idea here is that once you have finetuned a model on a specific task, if you subtract the weights from the base model, it gives you a "vector" which captures the modifications needed for the task. Why Model Merging Works - The intuition here is that if you have different models that are good at different things, you can combine different task vectors (such as taking an average in different ways) to produce a new model that is good at both tasks. For example, if model A is good at math, and model B is good at programming, you can merge the models to product a model that is good at both. Many Approaches to Merge Models - All model merging approaches work by combining the task vectors in different ways. Some approaches include Linear Interpolation (LERP), Spherical Linear Interpolation (SLERP), TIES, and DARE. More Art then Science - Like everything in LLM world, these approaches are a bit like black box. While they have some intuitive reasoning, it seems like this is also more of an art then exact science. I am curious what you guys think of this? How have your experiences been working with merged models? Is merging mostly trial and error, and are there some recommended best practices? I have put together a step by step guide along with relevant code and tools I used to do the merge in a blog post here. submitted by /u/bhavya6187 [link] [comments]
    [R] OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement - 2024 - HumanEval of 92.7! GPT-4 CodeInterpreter has only 88.0!
    Paper: https://arxiv.org/abs/2402.14658 Github: https://opencodeinterpreter.github.io/ Abstract: The introduction of large language models has significantly advanced code generation. However, open-source models often lack the execution capabilities and iterative refinement of advanced systems like the GPT-4 Code Interpreter. To address this, we introduce OpenCodeInterpreter, a family of open-source code systems designed for generating, executing, and iteratively refining code. Supported by Code-Feedback, a dataset featuring 68K multi-turn interactions, OpenCodeInterpreter integrates execution and human feedback for dynamic code refinement. Our comprehensive evaluation of OpenCodeInterpreter across key benchmarks such as HumanEval, MBPP, and their enhanced versions from EvalPlus reveals its exceptional performance. Notably, OpenCodeInterpreter-33B achieves an accuracy of 83.2 (76.4) on the average (and plus versions) of HumanEval and MBPP, closely rivaling GPT-4's 84.2 (76.2) and further elevates to 91.6 (84.6) with synthesized human feedback from GPT-4. OpenCodeInterpreter brings the gap between open-source code generation models and proprietary systems like GPT-4 Code Interpreter. https://preview.redd.it/56p1vhv26dkc1.jpg?width=752&format=pjpg&auto=webp&s=f1f47a4d25a05ff4a41e46eadce82ca51c1784cb submitted by /u/Singularian2501 [link] [comments]
    [D] ICLR Plot Twists
    Saw a few ICLR results that seem like a surprise to the community: Mamba ➡️ Reject V-JEPA ➡️ Reject MetaGPT ➡️ Accept (Oral) as discussed here What other accepts/rejects have raised a few eyebrows? submitted by /u/hzmehrdad [link] [comments]
    [R] LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens - Microsoft 2024
    Paper: https://arxiv.org/abs/2402.13753 Abstract: Large context window is a desirable feature in large language models (LLMs). However, due to high fine-tuning costs, scarcity of long texts, and catastrophic values introduced by new token positions, current extended context windows are limited to around 128k tokens. This paper introduces LongRoPE that, for the first time, extends the context window of pre-trained LLMs to an impressive 2048k tokens, with up to only 1k fine-tuning steps at within 256k training lengths, while maintaining performance at the original short context window. This is achieved by three key innovations: (i) we identify and exploit two forms of non-uniformities in positional interpolation through an efficient search, providing a better initialization for fine-tuning and enabling an 8x extension in non-fine-tuning scenarios; (ii) we introduce a progressive extension strategy that first fine-tunes a 256k length LLM and then conducts a second positional interpolation on the fine-tuned extended LLM to achieve a 2048k context window; (iii) we readjust LongRoPE on 8k length to recover the short context window performance. Extensive experiments on LLaMA2 and Mistral across various tasks demonstrate the effectiveness of our method. Models extended via LongRoPE retain the original architecture with minor modifications to the positional embedding, and can reuse most pre-existing optimizations. https://preview.redd.it/siuxi9gf2dkc1.jpg?width=1109&format=pjpg&auto=webp&s=c21f2879f3bdafafb1e9f0a97ca303b90c96d18f https://preview.redd.it/vtpxmcgf2dkc1.jpg?width=1188&format=pjpg&auto=webp&s=9837d7ed191130bbf853c7455f06cb16cd704756 https://preview.redd.it/uapbcbgf2dkc1.jpg?width=1115&format=pjpg&auto=webp&s=aa2f5f41dc0a32968c45144dfea21e1541d089c2 submitted by /u/Singularian2501 [link] [comments]
    [D] Mamba: The Easy Way
    Mamba looks like an exciting new language model architecture, and it took me awhile to understand the paper in full! The model employs a lot of tough concepts (S4, GPU memory, parallel scans, etc.), so I've written a blogpost about my understanding of Mamba's big ideas and contributions, with an eye toward making it as beginner-friendly as possible. Link: https://jackcook.com/2024/02/23/mamba.html I hope this is helpful, and I'd love to discuss any additional questions or points of clarification. Let me know what you think! submitted by /u/jackcook [link] [comments]
    [R] LLMs seem to be by default value maximizers and have a value bias in their responses
    Title : Exploring Value Biases: How LLMs Deviate Towards the Ideal Abstract: Large-Language-Models (LLMs) are deployed in a wide range of applications, and their response has an increasing social impact. Understanding the non-deliberate(ive) mechanism of LLMs in giving responses is essential in explaining their performance and discerning their biases in real-world applications. This is analogous to human studies, where such inadvertent responses are referred to as sampling. We study this sampling of LLMs in light of value bias and show that the sampling of LLMs tends to favour high-value options. Value bias corresponds to this shift of response from the most likely towards an ideal value represented in the LLM. In fact, this effect can be reproduced even with new entities learnt via in-context prompting. We show that this bias manifests in unexpected places and has implications on relevant application scenarios, like choosing exemplars. The results show that value bias is strong in LLMs across different categories, similar to the results found in human studies. https://arxiv.org/abs/2402.11005 submitted by /u/Cool_Abbreviations_9 [link] [comments]
    [R] "Generative Models: What do they know? Do they know things? Let's find out!". Quote from paper: "Our findings reveal that all types of the generative models we study contain rich information about scene intrinsics [normals, depth, albedo, and shading] that can be easily extracted using LoRA."
    Paper. Project website. I am not affiliated with the authors. Abstract: Generative models have been shown to be capable of synthesizing highly detailed and realistic images. It is natural to suspect that they implicitly learn to model some image intrinsics such as surface normals, depth, or shadows. In this paper, we present compelling evidence that generative models indeed internally produce high-quality scene intrinsic maps. We introduce Intrinsic LoRA (I LoRA), a universal, plug-and-play approach that transforms any generative model into a scene intrinsic predictor, capable of extracting intrinsic scene maps directly from the original generator network without needing additional decoders or fully fine-tuning the original network. Our method employs a Low-Rank Adaptation (LoRA) of ke…
    [D] AMD Ryzen Threadripper PRO 5955WX vs. AMD Ryzen Threadripper 7960X for DL Workstation
    I want to build a ml workstation with 2x rtx4090 and 128GB RAM. I‘m struggling to decide on the right cpu. What is your opinion on the AMD Ryzen Threadripper PRO 5955WX vs. AMD Ryzen Threadripper 7960X? I feel like the 7960x is superior in every aspect, unless i would want to expand my setup in the future to quad gpu or more RAM. What do you think, am i missing something? What is your opinion on those options in general? submitted by /u/Striking_Way_3205 [link] [comments]
    [D] Lessons (tips) for writing a compelling conference paper
    Hi, With ECCV deadline around the corner, I wanted to get some insights from more experienced members here about the lessons, tips and tricks they follow/learnt over the course of their career while writing a conference paper. These can be anything of sorts like "pretty pictures", tone of the text, phrasing of objectives, certain rules they follow for the write-up, or high level questions they think the reviewers ask while reading their work etc. Hopefully, this exchange will also benefit others. submitted by /u/PaganPasta [link] [comments]
    [R] A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models
    Paper: https://arxiv.org/abs/2401.01313 Abstract: As Large Language Models (LLMs) continue to advance in their ability to write human-like text, a key challenge remains around their tendency to hallucinate generating content that appears factual but is ungrounded. This issue of hallucination is arguably the biggest hindrance to safely deploying these powerful LLMs into real-world production systems that impact people's lives. The journey toward widespread adoption of LLMs in practical settings heavily relies on addressing and mitigating hallucinations. Unlike traditional AI systems focused on limited tasks, LLMs have been exposed to vast amounts of online text data during training. While this allows them to display impressive language fluency, it also means they are capable of extrapolating information from the biases in training data, misinterpreting ambiguous prompts, or modifying the information to align superficially with the input. This becomes hugely alarming when we rely on language generation capabilities for sensitive applications, such as summarizing medical records, financial analysis reports, etc. This paper presents a comprehensive survey of over 32 techniques developed to mitigate hallucination in LLMs. Notable among these are Retrieval Augmented Generation (Lewis et al, 2021), Knowledge Retrieval (Varshney et al,2023), CoNLI (Lei et al, 2023), and CoVe (Dhuliawala et al, 2023). Furthermore, we introduce a detailed taxonomy categorizing these methods based on various parameters, such as dataset utilization, common tasks, feedback mechanisms, and retriever types. This classification helps distinguish the diverse approaches specifically designed to tackle hallucination issues in LLMs. Additionally, we analyze the challenges and limitations inherent in these techniques, providing a solid foundation for future research in addressing hallucinations and related phenomena within the realm of LLMs. submitted by /u/SunsetOneSix [link] [comments]
    [D] Are IPUs still a thing?
    I was initially thrilled by IPUs as they seemed to be a serious alternative to GPUs (and TPUs). But Graphcore, the company making the IPUs, seems to be in a very bad situation now. And I don't see so many improvements in terms of software compatibility on IPUs. For example, the HF Optimum Graphcore lib has not been updated for 3 months: https://github.com/huggingface/optimum-graphcore ... submitted by /u/handwerner142 [link] [comments]
    [D] Why is everybody surprised that Mamba got rejected from ICLR? Am I missing something?
    I'm not just trying to be contrarian either. I keep hearing this on Reddit, at work, on different online forums, etc. I also was surprised when I first heard the news but after reading the paper I wasn't particularly surprised. Their hardware tweaks were interesting but other than that it seems like it was a simple adaptation of a previous paper. The benchmark experiments were not as extensive as I initially believed due to everybody talking about how revolutionary it is. Reading the paper just left me with a ton of questions along the lines of "What about performance on X task or Y benchmark?" I'm not trying to shame the authors, but it didn't really feel like a "conventional" paper in the machine learning field either. There have been plenty of great papers released that weren't exactly fit for a conference publication, and I don't think that just because something is being talked about a lot on Twitter or LinkedIn it means it deserves to be published at a venue. I'm genuinely wondering if I'm underestimating it because I didn't understand it properly and am open to any opinions. submitted by /u/Seankala [link] [comments]
  • Open

    AI for study plan generation based off of previous exam scores
    Background: I'm a Physician Assistant student that is about to take the board exam (PANCE). In preparation for the PANCE we take End of Rotation (EOR's) exams after each required clinical rotation. We also take an End of Curriculum (EOC) exam that encompasses all the EOR's and is like a mini PANCE exam. The content for each EOR never changes, such as cardiology, pulmonology, Gastrointestinal (GI), ect. However, each EOR is different and is graded/based on what that rotation is heavier on. An example is that Surgery has a lot of GI questions or like OBGYN has a lot of OBGYN...lol. The exam questions all have a category such as diagnosis, diagnostic studies, history and physical, ect. All of which are weighted differently as well. Example, 20% of the exam will be about diagnosis and like 1…
    Any *free* AI that work like chatGPT except with images?
    Basically, just wondering if there is an ai chatbot thats actually free that works like GPT 3, but with image generation submitted by /u/Large_monke_69 [link] [comments]
    What models/technologies do these latest Humanoid robots use?
    A lot of companies are coming with humanoid robots. I wonder if there is a technological breakthrough that underlies this acceleration in development. (Like Transformers and diffusion models, were the breakthrough that led to the emergence of the all these AI companies) ​ like what are underlying technologies? I know deep learning based localization and mapping techniques are a major part. and what else? ​ Just asking because I want to know if I can build a small n'simple AI robot at home? submitted by /u/scholorboy [link] [comments]
    I built an LLM agent that crawls documentation websites, so you don't have to
    submitted by /u/TheMblabla [link] [comments]
    Personal AI with memory
    Hey folks. Anyone know of an AI tool that I can use as a second brain/personal assistant? I imagine something I can either connect to a cloud storage like Google Drive, Dropbox, etc. and/or upload documents to it. And it would use all of the info from that data (what it knows about me, my writing style, my thoughts, etc.) and help me brainstorm prompts or write in a way that mirrors my natural thoughts. I forget things often but I have so much content in my notes app and Gdrive that I wish I could feed to ChatGPT or Bard so I didn’t gave up extensively set the context during each conversation. TLDR: looking for a mix of a chat AI (ChatGPT, Perplexity, Claude, etc.) and lex.page with long term memory and document ingestion to use as my personal AI, similar to the cookies) from the Black Mirror White Christmas episode. submitted by /u/wanderlotus [link] [comments]
    For fun, I got Gemini, ChatGPT, and Copilot to write presidential stump speeches. Here is the result.
    submitted by /u/bubblesort [link] [comments]
    Daily AI News Summary (02/23/2024)
    Microsoft and Intel strike a custom chip deal that could be worth billions [1] Nvidia's AI-friendly hardware sends its market value soaring higher than Australia's GDP [2] Google DeepMind forms a new org focused on AI safety [3] ChatGPT went temporarily “insane” with unexpected outputs [4] Sources: [1] https://www.theverge.com/2024/2/21/24079336/microsoft-intel-chip-partnership-foundry-tsmc [2] https://www.abc.net.au/news/2024-02-23/nvidia-artificial-intelligence-computer-chip-revenue-profit/103504226 [3] https://techcrunch.com/2024/02/21/google-deepmind-forms-a-new-org-focused-on-ai-safety/ [4] https://arstechnica.com/information-technology/2024/02/chatgpt-alarms-users-by-spitting-out-shakespearean-nonsense-and-rambling submitted by /u/Used-Bat3441 [link] [comments]
    How might the growing presence of artificial intelligence influence the development of future generations' human intelligence?
    We know that AI is becoming increasingly integrated into our daily lives, so I'm curious about its potential effects on various aspects of human intelligence. With AI capable of generating essays and completing assignments, how might this affect the rigor of education and the development of essential skills such as reading, writing, critical thinking and problem solving? Will these skills become irrelevent? submitted by /u/erikvoza [link] [comments]
    ChatGPT, make Terminator 2 medieval.
    submitted by /u/Philipp [link] [comments]
    Is AI this easy?
    Please forgive me, I am not well versed in AI. I have been an innocent bystander to its advancement and this it's very interesting. I would like to venture into creating, or generating stuff, myself. Years ago some friends and I were joking around and somehow the conversation led to 'what if [friends name] played Dom Toretto in F&F instead of Vin Diesel?' We laughed about the concept for a good while, and I also laughed about it for a solid couple minutes when I remembered it today. It dawned on me that AI may be able to bring that joke to a reality. Is this something I can make happen with modern AI tools? Even if it's just a few scenes from the film. I imagine there would be more steps involved besides just a text prompt. The AI would need to have some sort of data on my friends voice and a 3D model. Feel free to shut me down if this is too far fetched, but it would make approximately 12 people laugh uncontrollably for years to come so I figured it's worth a shot submitted by /u/caddlaxx [link] [comments]
    Stable Diffusion 3.0 has announced: 3 highlights of this new version
    https://stability.ai/news/stable-diffusion-3 Stability AI emphasized several highlights of this version. The foremost is text rendering capability. On their official website, they consecutively showed three images containing text, not only is the text clear, but there are also no spelling errors. https://preview.redd.it/5xbpeufrsakc1.png?width=1080&format=png&auto=webp&s=cdc0e22f801fe92ac87ba7b617241ed2f2a7d5ef Stability AI's CEO Mostaque also showcased images with text on X(Twitter): https://preview.redd.it/u1kgiof3vakc1.png?width=672&format=png&auto=webp&s=72764f43cf1b8c0b3fa19d12eadff22d7ebf37e3 https://preview.redd.it/4rb1pm82vakc1.png?width=2048&format=png&auto=webp&s=5d7c38ed6b54a3a4b69eeab51663d10685058eac Another highlight is multi-topic generation. You can paint a picture based on a prompt containing multiple elements. https://preview.redd.it/m317mxcrtakc1.png?width=1080&format=png&auto=webp&s=55426550b8880ceb40655c5bf8aa5b3a71325277 The third highlight is high image quality. https://preview.redd.it/6r2nois5uakc1.png?width=1080&format=png&auto=webp&s=a7ba2fdb1f3a90c79d530ed73ee979255be4606f Also, the texture of generated comics and sketches has improved over previous versions: https://preview.redd.it/bx7gy1w6uakc1.png?width=1080&format=png&auto=webp&s=2e239aab2888a58c34dfe4186bf5c26d95b6e16c https://preview.redd.it/js1k3ye8uakc1.png?width=1080&format=png&auto=webp&s=299df4c1c294d855e0716cf88157d313501111cf Although Stable Diffusion 3.0 was initially showcased as a text-to-image AI generation technology, it will become the foundation for broader applications. Over the past few months, Stability AI has also been developing 3D image generation and video synthesis capabilities. Reference: https://twitter.com/EMostaque https://venturebeat.com/ai/stable-diffusion-3-0-debuts-new-diffusion-transformation-architecture-to-reinvent-text-to-image-gen-ai/ ​ submitted by /u/Stupid_hardcorer [link] [comments]
    One-Minute Daily AI News 2/22/2024
    Nvidia’s AI-friendly hardware sends its market value soaring higher than Australia’s GDP.[1] Jiang Lu, Head of Google’s VideoPoet Project, Joined TikTok.[2] Google admits Gemini AI is ‘missing the mark’, pauses image generation capabilities.[3] Tableau launches Pulse, a GenAI-fueled insight generator.[4] Sources: [1] https://www.abc.net.au/news/2024-02-23/nvidia-artificial-intelligence-computer-chip-revenue-profit/103504226 [2] https://pandaily.com/jiang-lu-head-of-googles-videopoet-project-joined-tiktok/ [3] https://www.livemint.com/technology/tech-news/google-admits-gemini-ai-is-missing-the-mark-pauses-image-generation-capabilities-11708651965110.html [4] https://www.techtarget.com/searchbusinessanalytics/news/366571065/Tableau-launches-Pulse-a-GenAI-fueled-insight-generator submitted by /u/Excellent-Target-847 [link] [comments]
    AI Craze: The Bloomberg Close, Americas Edition
    submitted by /u/manwhoholdtheworld [link] [comments]
    Intuitive Machines Have Successfully Landed On The Moon LUNR
    submitted by /u/Xtianus21 [link] [comments]
    Option Care Health Selects Palantir’s Artificial Intelligence Platform for Enterprise-Wide Digital Transformation
    submitted by /u/A-Dog22 [link] [comments]
  • Open

    Hyperparameters Tuning: Grid Search vs Random Search
    submitted by /u/Personal-Trainer-541 [link] [comments]
    Explain Diffusion Models like I'm 5
    Can someone help me understand diffusion models? I'm almost there, but not quite. I understand you take an image, add noise to it in steps, then remove some noise from it and repeat with slightly less noise until you know how much noise to remove to get the original image. I also get that in there you add the text encodings to create the correlation between the text caption and the resulting image. But how do you generate images? When I supply a prompt, is the reverse diffusion model given a brand new, completely random set of noise? And then takes it's prior knowledge of how to remove noise to get an image? If so, where do semantics come in with a prompt it's never seen before? I vaguely understand U-nets and how you can generate semantics and masks from one, but where in the diffusion process is this? TYIA!! submitted by /u/Alternative_Leg_3111 [link] [comments]
    Image segmentation neural network
    Hello everyone, first time I post and I have a question : Can I take a semgentation neural network and change it to it have an attention block now? Like is there a known procedure on how to change a NN that take one image to another that take an image+ its saliency map? This is for the sake of coparaison, to understand the effect of attention block on the overall result of the segmentation. submitted by /u/karimredditor [link] [comments]
  • Open

    What are the benefits of using Natural Language Processing (NLP) in Business?
    In 2024, companies all around the world are on a relentless quest for innovative solutions to leverage vast amounts of information and elevate their interactions. In this quest, Natural Language Processing (NLP) emerges as a groundbreaking area of artificial intelligence, seamlessly connecting human communication with machine interpretation. NLP is transforming business practices, data analysis, and… Read More »What are the benefits of using Natural Language Processing (NLP) in Business? The post What are the benefits of using Natural Language Processing (NLP) in Business? appeared first on Data Science Central.  ( 22 min )
    MarTech Trends: Capitalizing on the rising GenAI excitement for business ROI growth
    The momentum around Generative AI in MarTech is undeniable. A staggering 63% of marketing leaders are signaling their intention to invest in this innovative technology in the future. This isn’t a vision for the distant future—it’s already here. It is reshaping our present, unlocking new avenues for innovation and offering solutions that were once thought… Read More »MarTech Trends: Capitalizing on the rising GenAI excitement for business ROI growth  The post MarTech Trends: Capitalizing on the rising GenAI excitement for business ROI growth  appeared first on Data Science Central.  ( 23 min )
    SQL Server: Powering your data warehouse with insights and efficiency
    Businesses today heavily rely on data, but many struggle with an overwhelming amount of information that doesn’t provide the necessary insights. Important questions often go unanswered as data is scattered across various files and slow reports. This challenge makes decision-making difficult, opportunities are missed, and progress stalls. Have you ever felt this way? The problem… Read More »SQL Server: Powering your data warehouse with insights and efficiency The post SQL Server: Powering your data warehouse with insights and efficiency appeared first on Data Science Central.  ( 22 min )
  • Open

    Uncovering names masked with stars
    Sometimes I’ll see things like my name partially concealed as J*** C*** and think “a lot of good that does.” Masking letters reveals more than people realize. For example, when you see that someone’s first name is four letters and begins with J, there’s about a 70% chance they’re male and there’s a 44% chance […] Uncovering names masked with stars first appeared on John D. Cook.  ( 5 min )
    Almost ASCII
    I was working recently with a gigabyte file that had a dozen non-ASCII characters. This is very common. The ASCII character set is not quite big enough for a lot of tasks. Of course it’s completely inadequate if you’re writing Japanese, but it’s almost enough for documents written in English and a few other languages. […] Almost ASCII first appeared on John D. Cook.  ( 6 min )
    A knight’s tour of an infinite chessboard
    Let ℤ² be the lattice of points in the plane with integer coordinates. You could think of these points as being the centers of the squares in a chessboard extending to infinity in every direction. Cantor tells us that the points in ℤ² are countable. What’s more surprising is that you could count the points […] A knight’s tour of an infinite chessboard first appeared on John D. Cook.  ( 5 min )
  • Open

    I was doing MaxEntRL all this time?
    I have a quick question. I recently came across Max Entropy RL. However, I don't understand the field. It seems just adding entropy regularization loss to your policy loss makes the method max entropy RL as long as the coefficient is 1? Am I missing something? I thought maximum entropy RL should be a more sophisticated algorithm. tldr; Is adding entropy regularization to your A2C/PPO, etc policy loss with coefficient 1 doing maximum entropy RL? submitted by /u/miladink [link] [comments]
    Grid World Application
    submitted by /u/bwe587 [link] [comments]
    Need of working code examples (repo)
    Hi folks. I have been working on RL for a while and now I lost my motivation. After purchasing several books and Udemy courses I still lack of working examples of code. Most of the algorithms shown in the books are settled on old library versions and due to that fact not working. Some for example: https://github.com/PacktPublishing/Deep-Reinforcement-Learning-with-Python/tree/master/11.%20Actor%20Critic%20Methods%20-%20A2C%20and%20A3C can not solving the env. For example https://github.com/PacktPublishing/Tensorflow-2-Reinforcement-Learning-Cookbook/blob/master/Chapter03/5_a3c_continuous.py Working as slow as hell (several minutes for one episode). So it can not be verified if it is good or not. So my question is anybody have access to the working examples repository (Gym)? My interest is a continuous version of RL (continuous actions). submitted by /u/Sharp-Record1600 [link] [comments]
    Policy Iteration: different definitions of improvement step in different textbooks
    In the textbook Artificial Intelligence, A Modern Approach, the method of evaluating an action value in the policy improvement step: https://preview.redd.it/vo75je58qakc1.png?width=544&format=png&auto=webp&s=648cd1ad34bda60966f2b027555e5623f0415492 is different to that of the Reinforcement Learning textbook by Sutton and Barto: https://preview.redd.it/von4631aqakc1.png?width=515&format=png&auto=webp&s=a20c09926e057fb02bf44c7cc5a1a1c8f4ec3a65 where the reward is taken into account. Is there a reason why? Also, U(s') is not defined as R+ V(s') in the first image. U and V are just different notations of the same thing. Thanks for any answers ​ Edit: Policy evaluation in first book: ​ https://preview.redd.it/puujf9gvfckc1.png?width=566&format=png&auto=webp&s=4cead60f6db3029a497fab18ec3f3cd761323f1f Full algo for second image: ​ https://preview.redd.it/bfckvah3gckc1.png?width=490&format=png&auto=webp&s=b8e42bff391090bcb853708e03b655a800343d64 submitted by /u/tengboss [link] [comments]
    JAX-based SAC iteration not training, unsure where I'm missing the bug
    Hi all, I'm back with another JAX question. This time around I've been re-designing my previous version of JAX-based SAC to be similar to the PureJAXRL format where the entire algorithm is contained in one function that can then be jitted, vmapped, etc. Here is what I have so far: https://pastes.io/xlyddbpwb1 (main code), https://pastes.io/o2jagqxrhi (simple replay buffer). The issue I'm running into currently is that it is not learning. I'm using the exact same loss calculations as my previous version of JAX SAC and that trained properly so I know the math there is correct. This version of the algorithm is uploaded here for comparison: https://pastes.io/qq3mia60gi I also re-implemented that code using the JAX-based replay buffer I use in this code and that also appeared to train fine. I have a feeling that I'm potentially running into a bug with my usage of lax.scan or lax.cond for handling the conditionals. Would anyone else be able to take a look and see what I might be missing? I've been staring at this for a while now and I'm convinced I'm maybe missing something simple. Thanks in advance for any help! submitted by /u/1cedrake [link] [comments]
  • Open

    Bias correction of wind power forecasts with SCADA data and continuous learning
    arXiv:2402.13916v1 Announce Type: new Abstract: Wind energy plays a critical role in the transition towards renewable energy sources. However, the uncertainty and variability of wind can impede its full potential and the necessary growth of wind power capacity. To mitigate these challenges, wind power forecasting methods are employed for applications in power management, energy trading, or maintenance scheduling. In this work, we present, evaluate, and compare four machine learning-based wind power forecasting models. Our models correct and improve 48-hour forecasts extracted from a numerical weather prediction (NWP) model. The models are evaluated on datasets from a wind park comprising 65 wind turbines. The best improvement in forecasting error and mean bias was achieved by a convolutional neural network, reducing the average NRMSE down to 22%, coupled with a significant reduction in mean bias, compared to a NRMSE of 35% from the strongly biased baseline model using uncorrected NWP forecasts. Our findings further indicate that changes to neural network architectures play a minor role in affecting the forecasting performance, and that future research should rather investigate changes in the model pipeline. Moreover, we introduce a continuous learning strategy, which is shown to achieve the highest forecasting performance improvements when new data is made available.  ( 2 min )
    Reconstruction of Sound Field through Diffusion Models
    arXiv:2312.08821v2 Announce Type: replace-cross Abstract: Reconstructing the sound field in a room is an important task for several applications, such as sound control and augmented (AR) or virtual reality (VR). In this paper, we propose a data-driven generative model for reconstructing the magnitude of acoustic fields in rooms with a focus on the modal frequency range. We introduce, for the first time, the use of a conditional Denoising Diffusion Probabilistic Model (DDPM) trained in order to reconstruct the sound field (SF-Diff) over an extended domain. The architecture is devised in order to be conditioned on a set of limited available measurements at different frequencies and generate the sound field in target, unknown, locations. The results show that SF-Diff is able to provide accurate reconstructions, outperforming a state-of-the-art baseline based on kernel interpolation.  ( 2 min )
    Sparse and Structured Hopfield Networks
    arXiv:2402.13725v1 Announce Type: new Abstract: Modern Hopfield networks have enjoyed recent interest due to their connection to attention in transformers. Our paper provides a unified framework for sparse Hopfield networks by establishing a link with Fenchel-Young losses. The result is a new family of Hopfield-Fenchel-Young energies whose update rules are end-to-end differentiable sparse transformations. We reveal a connection between loss margins, sparsity, and exact memory retrieval. We further extend this framework to structured Hopfield networks via the SparseMAP transformation, which can retrieve pattern associations instead of a single pattern. Experiments on multiple instance learning and text rationalization demonstrate the usefulness of our approach.  ( 2 min )
    Towards Message Brokers for Generative AI: Survey, Challenges, and Opportunities
    arXiv:2312.14647v2 Announce Type: replace-cross Abstract: In today's digital world, Generative Artificial Intelligence (GenAI) such as Large Language Models (LLMs) is becoming increasingly prevalent, extending its reach across diverse applications. This surge in adoption has sparked a significant increase in demand for data-centric GenAI models, highlighting the necessity for robust data communication infrastructures. Central to this need are message brokers, which serve as essential channels for data transfer within various system components. This survey aims to delve into a comprehensive analysis of traditional and modern message brokers, offering a comparative study of prevalent platforms. Our study considers numerous criteria including, but not limited to, open-source availability, integrated monitoring tools, message prioritization mechanisms, capabilities for parallel processing, reliability, distribution and clustering functionalities, authentication processes, data persistence strategies, fault tolerance, and scalability. Furthermore, we explore the intrinsic constraints that the design and operation of each message broker might impose, recognizing that these limitations are crucial in understanding their real-world applicability. Finally, this study examines the enhancement of message broker mechanisms specifically for GenAI contexts, emphasizing the criticality of developing a versatile message broker framework. Such a framework would be poised for quick adaptation, catering to the dynamic and growing demands of GenAI in the foreseeable future. Through this dual-pronged approach, we intend to contribute a foundational compendium that can guide future innovations and infrastructural advancements in the realm of GenAI data communication.  ( 3 min )
    Hierarchical Neural Simulation-Based Inference Over Event Ensembles
    arXiv:2306.12584v2 Announce Type: replace-cross Abstract: When analyzing real-world data it is common to work with event ensembles, which comprise sets of observations that collectively constrain the parameters of an underlying model of interest. Such models often have a hierarchical structure, where "local" parameters impact individual events and "global" parameters influence the entire dataset. We introduce practical approaches for frequentist and Bayesian dataset-wide probabilistic inference in cases where the likelihood is intractable, but simulations can be realized via a hierarchical forward model. We construct neural estimators for the likelihood(-ratio) or posterior and show that explicitly accounting for the model's hierarchical structure can lead to significantly tighter parameter constraints. We ground our discussion using case studies from the physical sciences, focusing on examples from particle physics and cosmology.  ( 2 min )
    A Conservative Approach for Few-Shot Transfer in Off-Dynamics Reinforcement Learning
    arXiv:2312.15474v2 Announce Type: replace Abstract: Off-dynamics Reinforcement Learning (ODRL) seeks to transfer a policy from a source environment to a target environment characterized by distinct yet similar dynamics. In this context, traditional RL agents depend excessively on the dynamics of the source environment, resulting in the discovery of policies that excel in this environment but fail to provide reasonable performance in the target one. In the few-shot framework, a limited number of transitions from the target environment are introduced to facilitate a more effective transfer. Addressing this challenge, we propose an innovative approach inspired by recent advancements in Imitation Learning and conservative RL algorithms. The proposed method introduces a penalty to regulate the trajectories generated by the source-trained policy. We evaluate our method across various environments representing diverse off-dynamics conditions, where access to the target environment is extremely limited. These experiments include high-dimensional systems relevant to real-world applications. Across most tested scenarios, our proposed method demonstrates performance improvements compared to existing baselines.  ( 2 min )
    Combining unsupervised and supervised learning in microscopy enables defect analysis of a full 4H-SiC wafer
    arXiv:2402.13353v1 Announce Type: cross Abstract: Detecting and analyzing various defect types in semiconductor materials is an important prerequisite for understanding the underlying mechanisms as well as tailoring the production processes. Analysis of microscopy images that reveal defects typically requires image analysis tasks such as segmentation and object detection. With the permanently increasing amount of data that is produced by experiments, handling these tasks manually becomes more and more impossible. In this work, we combine various image analysis and data mining techniques for creating a robust and accurate, automated image analysis pipeline. This allows for extracting the type and position of all defects in a microscopy image of a KOH-etched 4H-SiC wafer that was stitched together from approximately 40,000 individual images.  ( 2 min )
    Scaling Laws for Associative Memories
    arXiv:2310.02984v2 Announce Type: replace-cross Abstract: Learning arguably involves the discovery and memorization of abstract rules. The aim of this paper is to study associative memory mechanisms. Our model is based on high-dimensional matrices consisting of outer products of embeddings, which relates to the inner layers of transformer language models. We derive precise scaling laws with respect to sample size and parameter size, and discuss the statistical efficiency of different estimators, including optimization-based algorithms. We provide extensive numerical experiments to validate and interpret theoretical results, including fine-grained visualizations of the stored memory associations.  ( 2 min )
    Simple, unified analysis of Johnson-Lindenstrauss with applications
    arXiv:2402.10232v2 Announce Type: replace-cross Abstract: In this work, we present a simple and unified analysis of the Johnson-Lindenstrauss (JL) lemma, a cornerstone in the field of dimensionality reduction critical for managing high-dimensional data. Our approach not only simplifies the understanding but also unifies various constructions under the JL framework, including spherical, binary-coin, sparse JL, Gaussian and sub-Gaussian models. This simplification and unification make significant strides in preserving the intrinsic geometry of data, essential across diverse applications from streaming algorithms to reinforcement learning. Notably, we deliver the first rigorous proof of the spherical construction's effectiveness and provide a general class of sub-Gaussian constructions within this simplified framework. At the heart of our contribution is an innovative extension of the Hanson-Wright inequality to high dimensions, complete with explicit constants, marking a substantial leap in the literature. By employing simple yet powerful probabilistic tools and analytical techniques, such as an enhanced diagonalization process, our analysis not only solidifies the JL lemma's theoretical foundation but also extends its practical reach, showcasing its adaptability and importance in contemporary computational algorithms.  ( 2 min )
    Harnessing Large Language Models as Post-hoc Correctors
    arXiv:2402.13414v1 Announce Type: new Abstract: As Machine Learning (ML) models grow in size and demand higher-quality training data, the expenses associated with re-training and fine-tuning these models are escalating rapidly. Inspired by recent impressive achievements of Large Language Models (LLMs) in different fields, this paper delves into the question: can LLMs efficiently improve an ML's performance at a minimal cost? We show that, through our proposed training-free framework LlmCorr, an LLM can work as a post-hoc corrector to propose corrections for the predictions of an arbitrary ML model. In particular, we form a contextual knowledge database by incorporating the dataset's label information and the ML model's predictions on the validation dataset. Leveraging the in-context learning capability of LLMs, we ask the LLM to summarise the instances in which the ML model makes mistakes and the correlation between primary predictions and true labels. Following this, the LLM can transfer its acquired knowledge to suggest corrections for the ML model's predictions. Our experimental results on the challenging molecular predictions show that LlmCorr improves the performance of a number of models by up to 39%.  ( 2 min )
    Towards accelerating physical discovery via non-interactive and interactive multi-fidelity Bayesian Optimization: Current challenges and future opportunities
    arXiv:2402.13402v1 Announce Type: new Abstract: Both computational and experimental material discovery bring forth the challenge of exploring multidimensional and often non-differentiable parameter spaces, such as phase diagrams of Hamiltonians with multiple interactions, composition spaces of combinatorial libraries, processing spaces, and molecular embedding spaces. Often these systems are expensive or time-consuming to evaluate a single instance, and hence classical approaches based on exhaustive grid or random search are too data intensive. This resulted in strong interest towards active learning methods such as Bayesian optimization (BO) where the adaptive exploration occurs based on human learning (discovery) objective. However, classical BO is based on a predefined optimization target, and policies balancing exploration and exploitation are purely data driven. In practical settings, the domain expert can pose prior knowledge on the system in form of partially known physics laws and often varies exploration policies during the experiment. Here, we explore interactive workflows building on multi-fidelity BO (MFBO), starting with classical (data-driven) MFBO, then structured (physics-driven) sMFBO, and extending it to allow human in the loop interactive iMFBO workflows for adaptive and domain expert aligned exploration. These approaches are demonstrated over highly non-smooth multi-fidelity simulation data generated from an Ising model, considering spin-spin interaction as parameter space, lattice sizes as fidelity spaces, and the objective as maximizing heat capacity. Detailed analysis and comparison show the impact of physics knowledge injection and on-the-fly human decisions for improved exploration, current challenges, and potential opportunities for algorithm development with combining data, physics and real time human decisions.  ( 3 min )
    CausalLM is not optimal for in-context learning
    arXiv:2308.06912v3 Announce Type: replace Abstract: Recent empirical evidence indicates that transformer based in-context learning performs better when using a prefix language model (prefixLM), in which in-context samples can all attend to each other, compared to causal language models (causalLM), which use auto-regressive attention that prohibits in-context samples to attend to future samples. While this result is intuitive, it is not understood from a theoretical perspective. In this paper we take a theoretical approach and analyze the convergence behavior of prefixLM and causalLM under a certain parameter construction. Our analysis shows that both LM types converge to their stationary points at a linear rate, but that while prefixLM converges to the optimal solution of linear regression, causalLM convergence dynamics follows that of an online gradient descent algorithm, which is not guaranteed to be optimal even as the number of samples grows infinitely. We supplement our theoretical claims with empirical experiments over synthetic and real tasks and using various types of transformers. Our experiments verify that causalLM consistently underperforms prefixLM in all settings.  ( 2 min )
    Reinforcement learning-assisted quantum architecture search for variational quantum algorithms
    arXiv:2402.13754v1 Announce Type: cross Abstract: A significant hurdle in the noisy intermediate-scale quantum (NISQ) era is identifying functional quantum circuits. These circuits must also adhere to the constraints imposed by current quantum hardware limitations. Variational quantum algorithms (VQAs), a class of quantum-classical optimization algorithms, were developed to address these challenges in the currently available quantum devices. However, the overall performance of VQAs depends on the initialization strategy of the variational circuit, the structure of the circuit (also known as ansatz), and the configuration of the cost function. Focusing on the structure of the circuit, in this thesis, we improve the performance of VQAs by automating the search for an optimal structure for the variational circuits using reinforcement learning (RL). Within the thesis, the optimality of a circuit is determined by evaluating its depth, the overall count of gates and parameters, and its accuracy in solving the given problem. The task of automating the search for optimal quantum circuits is known as quantum architecture search (QAS). The majority of research in QAS is primarily focused on a noiseless scenario. Yet, the impact of noise on the QAS remains inadequately explored. In this thesis, we tackle the issue by introducing a tensor-based quantum circuit encoding, restrictions on environment dynamics to explore the search space of possible circuits efficiently, an episode halting scheme to steer the agent to find shorter circuits, a double deep Q-network (DDQN) with an $\epsilon$-greedy policy for better stability. The numerical experiments on noiseless and noisy quantum hardware show that in dealing with various VQAs, our RL-based QAS outperforms existing QAS. Meanwhile, the methods we propose in the thesis can be readily adapted to address a wide range of other VQAs.  ( 3 min )
    LSTSVR-PI: Least square twin support vector regression with privileged information
    arXiv:2312.02596v2 Announce Type: replace Abstract: In an educational setting, a teacher plays a crucial role in various classroom teaching patterns. Similarly, mirroring this aspect of human learning, the learning using privileged information (LUPI) paradigm introduces additional information to instruct learning models during the training stage. A different approach to train the twin variant of the regression model is provided by the new least square twin support vector regression using privileged information (LSTSVR-PI), which integrates the LUPI paradigm to utilize additional sources of information into the least square twin support vector regression. The proposed LSTSVR-PI solves system of linear equations which adds up to the efficiency of the model. Further, we also establish a generalization error bound based on the Rademacher complexity of the proposed model and incorporate the structural risk minimization principle. The proposed LSTSVR-PI fills the gap between the contemporary paradigm of LUPI and classical LSTSVR. Further, to assess the performance of the proposed model, we conduct numerical experiments along with the baseline models across various artificially generated and real-world datasets. The various experiments and statistical analysis infer the superiority of the proposed model. Moreover, as an application, we conduct experiments on time series datasets, which results in the superiority of the proposed LSTSVR-PI.  ( 2 min )
    The Effect of Intrinsic Dataset Properties on Generalization: Unraveling Learning Differences Between Natural and Medical Images
    arXiv:2401.08865v3 Announce Type: replace-cross Abstract: This paper investigates discrepancies in how neural networks learn from different imaging domains, which are commonly overlooked when adopting computer vision techniques from the domain of natural images to other specialized domains such as medical images. Recent works have found that the generalization error of a trained network typically increases with the intrinsic dimension ($d_{data}$) of its training set. Yet, the steepness of this relationship varies significantly between medical (radiological) and natural imaging domains, with no existing theoretical explanation. We address this gap in knowledge by establishing and empirically validating a generalization scaling law with respect to $d_{data}$, and propose that the substantial scaling discrepancy between the two considered domains may be at least partially attributed to the higher intrinsic ``label sharpness'' ($K_\mathcal{F}$) of medical imaging datasets, a metric which we propose. Next, we demonstrate an additional benefit of measuring the label sharpness of a training set: it is negatively correlated with the trained model's adversarial robustness, which notably leads to models for medical images having a substantially higher vulnerability to adversarial attack. Finally, we extend our $d_{data}$ formalism to the related metric of learned representation intrinsic dimension ($d_{repr}$), derive a generalization scaling law with respect to $d_{repr}$, and show that $d_{data}$ serves as an upper bound for $d_{repr}$. Our theoretical results are supported by thorough experiments with six models and eleven natural and medical imaging datasets over a range of training set sizes. Our findings offer insights into the influence of intrinsic dataset properties on generalization, representation learning, and robustness in deep neural networks. Code link: https://github.com/mazurowski-lab/intrinsic-properties  ( 3 min )
    Beyond Fidelity: Explaining Vulnerability Localization of Learning-based Detectors
    arXiv:2401.02686v2 Announce Type: replace-cross Abstract: Vulnerability detectors based on deep learning (DL) models have proven their effectiveness in recent years. However, the shroud of opacity surrounding the decision-making process of these detectors makes it difficult for security analysts to comprehend. To address this, various explanation approaches have been proposed to explain the predictions by highlighting important features, which have been demonstrated effective in other domains such as computer vision and natural language processing. Unfortunately, an in-depth evaluation of vulnerability-critical features, such as fine-grained vulnerability-related code lines, learned and understood by these explanation approaches remains lacking. In this study, we first evaluate the performance of ten explanation approaches for vulnerability detectors based on graph and sequence representations, measured by two quantitative metrics including fidelity and vulnerability line coverage rate. Our results show that fidelity alone is not sufficient for evaluating these approaches, as fidelity incurs significant fluctuations across different datasets and detectors. We subsequently check the precision of the vulnerability-related code lines reported by the explanation approaches, and find poor accuracy in this task among all of them. This can be attributed to the inefficiency of explainers in selecting important features and the presence of irrelevant artifacts learned by DL-based detectors.  ( 3 min )
    Efficient and Scalable Graph Generation through Iterative Local Expansion
    arXiv:2312.11529v2 Announce Type: replace-cross Abstract: In the realm of generative models for graphs, extensive research has been conducted. However, most existing methods struggle with large graphs due to the complexity of representing the entire joint distribution across all node pairs and capturing both global and local graph structures simultaneously. To overcome these issues, we introduce a method that generates a graph by progressively expanding a single node to a target graph. In each step, nodes and edges are added in a localized manner through denoising diffusion, building first the global structure, and then refining the local details. The local generation avoids modeling the entire joint distribution over all node pairs, achieving substantial computational savings with subquadratic runtime relative to node count while maintaining high expressivity through multiscale generation. Our experiments show that our model achieves state-of-the-art performance on well-established benchmark datasets while successfully scaling to graphs with at least 5000 nodes. Our method is also the first to successfully extrapolate to graphs outside of the training distribution, showcasing a much better generalization capability over existing methods.  ( 2 min )
    Hidden yet quantifiable: A lower bound for confounding strength using randomized trials
    arXiv:2312.03871v2 Announce Type: replace-cross Abstract: In the era of fast-paced precision medicine, observational studies play a major role in properly evaluating new treatments in clinical practice. Yet, unobserved confounding can significantly compromise causal conclusions drawn from non-randomized data. We propose a novel strategy that leverages randomized trials to quantify unobserved confounding. First, we design a statistical test to detect unobserved confounding with strength above a given threshold. Then, we use the test to estimate an asymptotically valid lower bound on the unobserved confounding strength. We evaluate the power and validity of our statistical test on several synthetic and semi-synthetic datasets. Further, we show how our lower bound can correctly identify the absence and presence of unobserved confounding in a real-world setting.  ( 2 min )
    Creating and Leveraging a Synthetic Dataset of Cloud Optical Thickness Measures for Cloud Detection in MSI
    arXiv:2311.14024v2 Announce Type: replace-cross Abstract: Cloud formations often obscure optical satellite-based monitoring of the Earth's surface, thus limiting Earth observation (EO) activities such as land cover mapping, ocean color analysis, and cropland monitoring. The integration of machine learning (ML) methods within the remote sensing domain has significantly improved performance on a wide range of EO tasks, including cloud detection and filtering, but there is still much room for improvement. A key bottleneck is that ML methods typically depend on large amounts of annotated data for training, which is often difficult to come by in EO contexts. This is especially true when it comes to cloud optical thickness (COT) estimation. A reliable estimation of COT enables more fine-grained and application-dependent control compared to using pre-specified cloud categories, as is commonly done in practice. To alleviate the COT data scarcity problem, in this work we propose a novel synthetic dataset for COT estimation, that we subsequently leverage for obtaining reliable and versatile cloud masks on real data. In our dataset, top-of-atmosphere radiances have been simulated for 12 of the spectral bands of the Multispectral Imagery (MSI) sensor onboard Sentinel-2 platforms. These data points have been simulated under consideration of different cloud types, COTs, and ground surface and atmospheric profiles. Extensive experimentation of training several ML models to predict COT from the measured reflectivity of the spectral bands demonstrates the usefulness of our proposed dataset. In particular, by thresholding COT estimates from our ML models, we show on two satellite image datasets (one that is publicly available, and one which we have collected and annotated) that reliable cloud masks can be obtained. The synthetic data, the collected real dataset, code and models have been made publicly available at https://github.com/aleksispi/ml-cloud-opt-thick.  ( 3 min )
    Continual Learning Under Language Shift
    arXiv:2311.01200v2 Announce Type: replace-cross Abstract: The recent increase in data and model scale for language model pre-training has led to huge training costs. In scenarios where new data become available over time, updating a model instead of fully retraining it would therefore provide significant gains. We study the pros and cons of updating a language model when new data comes from new languages -- the case of continual learning under language shift. Starting from a monolingual English language model, we incrementally add data from Danish, Icelandic, and Norwegian to investigate how forward and backward transfer effects depend on pre-training order and characteristics of languages, for three different model sizes. Our results show that, while forward transfer is largely positive and independent of language order, backward transfer can be positive or negative depending on the order and characteristics of new languages. We explore a number of potentially explanatory factors and find that a combination of language contamination and syntactic similarity best fits our results.  ( 2 min )
    EEG motor imagery decoding: A framework for comparative analysis with channel attention mechanisms
    arXiv:2310.11198v2 Announce Type: replace-cross Abstract: The objective of this study is to investigate the application of various channel attention mechanisms within the domain of brain-computer interface (BCI) for motor imagery decoding. Channel attention mechanisms can be seen as a powerful evolution of spatial filters traditionally used for motor imagery decoding. This study systematically compares such mechanisms by integrating them into a lightweight architecture framework to evaluate their impact. We carefully construct a straightforward and lightweight baseline architecture designed to seamlessly integrate different channel attention mechanisms. This approach is contrary to previous works which only investigate one attention mechanism and usually build a very complex, sometimes nested architecture. Our framework allows us to evaluate and compare the impact of different attention mechanisms under the same circumstances. The easy integration of different channel attention mechanisms as well as the low computational complexity enables us to conduct a wide range of experiments on four datasets to thoroughly assess the effectiveness of the baseline model and the attention mechanisms. Our experiments demonstrate the strength and generalizability of our architecture framework as well as how channel attention mechanisms can improve the performance while maintaining the small memory footprint and low computational complexity of our baseline architecture. Our architecture emphasizes simplicity, offering easy integration of channel attention mechanisms, while maintaining a high degree of generalizability across datasets, making it a versatile and efficient solution for EEG motor imagery decoding within brain-computer interfaces.  ( 3 min )
    Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution
    arXiv:2310.16834v2 Announce Type: replace-cross Abstract: Despite their groundbreaking performance for many generative modeling tasks, diffusion models have fallen short on discrete data domains such as natural language. Crucially, standard diffusion models rely on the well-established theory of score matching, but efforts to generalize this to discrete structures have not yielded the same empirical gains. In this work, we bridge this gap by proposing score entropy, a novel loss that naturally extends score matching to discrete spaces, integrates seamlessly to build discrete diffusion models, and significantly boosts performance. Experimentally, we test our Score Entropy Discrete Diffusion models (SEDD) on standard language modeling tasks. For comparable model sizes, SEDD beats existing language diffusion paradigms (reducing perplexity by $25$-$75$\%) and is competitive with autoregressive models, in particular outperforming GPT-2. Furthermore, compared to autoregressive mdoels, SEDD generates faithful text without requiring distribution annealing techniques like temperature scaling (around $6$-$8\times$ better generative perplexity than un-annealed GPT-2), can trade compute and quality (similar quality with $32\times$ fewer network evaluations), and enables controllable infilling (matching nucleus sampling quality while enabling other strategies besides left to right prompting).  ( 2 min )
    QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models
    arXiv:2310.08041v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) excel in NLP, but their demands hinder their widespread deployment. While Quantization-Aware Training (QAT) offers a solution, its extensive training costs make Post-Training Quantization (PTQ) a more practical approach for LLMs. In existing studies, activation outliers in particular channels are identified as the bottleneck to PTQ accuracy. They propose to transform the magnitudes from activations to weights, which however offers limited alleviation or suffers from unstable gradients, resulting in a severe performance drop at low-bitwidth. In this paper, we propose QLLM, an accurate and efficient low-bitwidth PTQ method designed for LLMs. QLLM introduces an adaptive channel reassembly technique that reallocates the magnitude of outliers to other channels, thereby mitigating their impact on the quantization range. This is achieved by channel disassembly and channel assembly, which first breaks down the outlier channels into several sub-channels to ensure a more balanced distribution of activation magnitudes. Then similar channels are merged to maintain the original channel number for efficiency. Additionally, an adaptive strategy is designed to autonomously determine the optimal number of sub-channels for channel disassembly. To further compensate for the performance loss caused by quantization, we propose an efficient tuning method that only learns a small number of low-rank weights while freezing the pre-trained quantized model. After training, these low-rank parameters can be fused into the frozen weights without affecting inference. Extensive experiments on LLaMA-1 and LLaMA-2 show that QLLM can obtain accurate quantized models efficiently. For example, QLLM quantizes the 4-bit LLaMA-2-70B within 10 hours on a single A100-80G GPU, outperforming the previous state-of-the-art method by 7.89% on the average accuracy across five zero-shot tasks.  ( 3 min )
    Posterior Sampling Based on Gradient Flows of the MMD with Negative Distance Kernel
    arXiv:2310.03054v2 Announce Type: replace-cross Abstract: We propose conditional flows of the maximum mean discrepancy (MMD) with the negative distance kernel for posterior sampling and conditional generative modeling. This MMD, which is also known as energy distance, has several advantageous properties like efficient computation via slicing and sorting. We approximate the joint distribution of the ground truth and the observations using discrete Wasserstein gradient flows and establish an error bound for the posterior distributions. Further, we prove that our particle flow is indeed a Wasserstein gradient flow of an appropriate functional. The power of our method is demonstrated by numerical examples including conditional image generation and inverse problems like superresolution, inpainting and computed tomography in low-dose and limited-angle settings.  ( 2 min )
    Probing the Multi-turn Planning Capabilities of LLMs via 20 Question Games
    arXiv:2310.01468v3 Announce Type: replace-cross Abstract: Large language models (LLMs) are effective at answering questions that are clearly asked. However, when faced with ambiguous queries they can act unpredictably and produce incorrect outputs. This underscores the need for the development of intelligent agents capable of asking clarification questions to resolve ambiguities effectively. This capability requires complex understanding, state tracking, reasoning and planning over multiple conversational turns. However, directly measuring this can be challenging. In this paper, we offer a surrogate problem which assesses an LLMs's capability to deduce an entity unknown to itself, but revealed to a judge, by asking the judge a series of queries. This \textit{entity-deducing game} can serve as an evaluation framework to probe the conversational reasoning and planning capabilities of language models. We systematically evaluate various LLMs and discover significant differences in their performance on this task. We find that strong LLMs like GPT-4 outperform human players by a large margin. We further employ Behavior Cloning (BC) to examine whether a weaker model is capable of imitating a stronger model and generalizing to data or domains, using only the demonstrations from a stronger model. We finally propose to use Reinforcement Learning to enhance reasoning and planning capacity of Vicuna models through episodes of game playing, which lead to significant performance improvement. We hope that this problem offers insights into how autonomous agents could be trained to behave more intelligently in ambiguous circumstances.  ( 3 min )
    PoSE: Efficient Context Window Extension of LLMs via Positional Skip-wise Training
    arXiv:2309.10400v3 Announce Type: replace-cross Abstract: Large Language Models (LLMs) are trained with a pre-defined context length, restricting their use in scenarios requiring long inputs. Previous efforts for adapting LLMs to a longer length usually requires fine-tuning with this target length (Full-length fine-tuning), suffering intensive training cost. To decouple train length from target length for efficient context window extension, we propose Positional Skip-wisE (PoSE) training that smartly simulates long inputs using a fixed context window. This is achieved by first dividing the original context window into several chunks, then designing distinct skipping bias terms to manipulate the position indices of each chunk. These bias terms and the lengths of each chunk are altered for every training example, allowing the model to adapt to all positions within target length. Experimental results show that PoSE greatly reduces memory and time overhead compared with Full-length fine-tuning, with minimal impact on performance. Leveraging this advantage, we have successfully extended the LLaMA model to 128k tokens using a 2k training context window. Furthermore, we empirically confirm that PoSE is compatible with all RoPE-based LLMs and position interpolation strategies. Notably, our method can potentially support infinite length, limited only by memory usage in inference. With ongoing progress for efficient inference, we believe PoSE can further scale the context window beyond 128k.  ( 3 min )
    What Matters to Enhance Traffic Rule Compliance of Imitation Learning for Automated Driving
    arXiv:2309.07808v2 Announce Type: replace-cross Abstract: More research attention has recently been given to end-to-end autonomous driving technologies where the entire driving pipeline is replaced with a single neural network because of its simpler structure and faster inference time. Despite this appealing approach largely reducing the components in the driving pipeline, its simplicity also leads to interpretability problems and safety issues. The trained policy is not always compliant with the traffic rules and it is also hard to discover the reason for the misbehavior because of the lack of intermediate outputs. Meanwhile, sensors are also critical to autonomous driving's security and feasibility to perceive the surrounding environment under complex driving scenarios. In this paper, we proposed P-CSG, a penalty-based imitation learning approach with cross semantics generation sensor fusion technologies to increase the overall performance of end-to-end autonomous driving. In this method, we introduce three penalties - red light, stop sign, and curvature speed penalty to make the agent more sensitive to traffic rules. The proposed cross semantics generation helps to align the shared information from different input modalities. We assessed our model's performance using the CARLA leaderboard - Town 05 Long benchmark and Longest6 Benchmark, achieving an impressive driving score improvement. Furthermore, we conducted robustness evaluations against adversarial attacks like FGSM and Dot attacks, revealing a substantial increase in robustness compared to baseline models. More detailed information, such as code base resources, and videos can be found at https://hk-zh.github.io/p-csg-plus.  ( 3 min )
    Cross-domain Sound Recognition for Efficient Underwater Data Analysis
    arXiv:2309.03451v2 Announce Type: replace-cross Abstract: This paper presents a novel deep learning approach for analyzing massive underwater acoustic data by leveraging a model trained on a broad spectrum of non-underwater (aerial) sounds. Recognizing the challenge in labeling vast amounts of underwater data, we propose a two-fold methodology to accelerate this labor-intensive procedure. The first part of our approach involves PCA and UMAP visualization of the underwater data using the feature vectors of an aerial sound recognition model. This enables us to cluster the data in a two dimensional space and listen to points within these clusters to understand their defining characteristics. This innovative method simplifies the process of selecting candidate labels for further training. In the second part, we train a neural network model using both the selected underwater data and the non-underwater dataset. We conducted a quantitative analysis to measure the precision, recall, and F1 score of our model for recognizing airgun sounds, a common type of underwater sound. The F1 score achieved by our model exceeded 84.3%, demonstrating the effectiveness of our approach in analyzing underwater acoustic data. The methodology presented in this paper holds significant potential to reduce the amount of labor required in underwater data analysis and opens up new possibilities for further research in the field of cross-domain data analysis.  ( 3 min )
    Speech Self-Supervised Representations Benchmarking: a Case for Larger Probing Heads
    arXiv:2308.14456v2 Announce Type: replace-cross Abstract: Self-supervised learning (SSL) leverages large datasets of unlabeled speech to reach impressive performance with reduced amounts of annotated data. The high number of proposed approaches fostered the emergence of comprehensive benchmarks that evaluate their performance on a set of downstream tasks exploring various aspects of the speech signal. However, while the number of considered tasks has been growing, most proposals rely upon a single downstream architecture that maps the frozen SSL representations to the task labels. This study examines how benchmarking results are affected by changes in the probing head architecture. Interestingly, we found that altering the downstream architecture structure leads to significant fluctuations in the performance ranking of the evaluated models. Against common practices in speech SSL benchmarking, we evaluate larger-capacity probing heads, showing their impact on performance, inference costs, generalization and multi-level feature exploitation.  ( 2 min )
    A Review of Driver Gaze Estimation and Application in Gaze Behavior Understanding
    arXiv:2307.01470v2 Announce Type: replace-cross Abstract: Driver gaze plays an important role in different gaze-based applications such as driver attentiveness detection, visual distraction detection, gaze behavior understanding, and building driver assistance system. The main objective of this study is to perform a comprehensive summary of driver gaze fundamentals, methods to estimate driver gaze, and it's applications in real world driving scenarios. We first discuss the fundamentals related to driver gaze, involving head-mounted and remote setup based gaze estimation and the terminologies used for each of these data collection methods. Next, we list out the existing benchmark driver gaze datasets, highlighting the collection methodology and the equipment used for such data collection. This is followed by a discussion of the algorithms used for driver gaze estimation, which primarily involves traditional machine learning and deep learning based techniques. The estimated driver gaze is then used for understanding gaze behavior while maneuvering through intersections, on-ramps, off-ramps, lane changing, and determining the effect of roadside advertising structures. Finally, we have discussed the limitations in the existing literature, challenges, and the future scope in driver gaze estimation and gaze-based applications.  ( 2 min )
    How Sparse Can We Prune A Deep Network: A Fundamental Limit Viewpoint
    arXiv:2306.05857v2 Announce Type: replace-cross Abstract: Network pruning is an effective measure to alleviate the storage and computational burden of deep neural networks arising from its high overparameterization. Thus raises a fundamental question: How sparse can we prune a deep network without sacrifice on the performance? To address this problem, in this work we'll take a first principles approach, i.e. we directly impose the sparsity constraint on the original loss function and then characterize the necessary and sufficient condition of the sparsity (\textit{which turns out to nearly coincide}) by leveraging the notion of \textit{statistical dimension} in convex geometry. Through this fundamental limit, we're able to identify two key factors that determine the pruning ratio limit, i.e., weight magnitude and network flatness. Generally speaking, the flatter the loss landscape or the smaller the weight magnitude, the smaller pruning ratio. In addition, we provide efficient countermeasures to address the challenges in computing the pruning limit, which involves accurate spectrum estimation of a large-scale and non-positive Hessian matrix. Moreover, through the lens of the pruning ratio threshold, we can provide rigorous interpretations on several heuristics in existing pruning algorithms. Extensive experiments are performed that demonstrate that the our theoretical pruning ratio threshold coincides very well with the experiments. All codes are available at: https://github.com/QiaozheZhang/Global-One-shot-Pruning  ( 2 min )
    SENS: Part-Aware Sketch-based Implicit Neural Shape Modeling
    arXiv:2306.06088v2 Announce Type: replace-cross Abstract: We present SENS, a novel method for generating and editing 3D models from hand-drawn sketches, including those of abstract nature. Our method allows users to quickly and easily sketch a shape, and then maps the sketch into the latent space of a part-aware neural implicit shape architecture. SENS analyzes the sketch and encodes its parts into ViT patch encoding, subsequently feeding them into a transformer decoder that converts them to shape embeddings suitable for editing 3D neural implicit shapes. SENS provides intuitive sketch-based generation and editing, and also succeeds in capturing the intent of the user's sketch to generate a variety of novel and expressive 3D shapes, even from abstract and imprecise sketches. Additionally, SENS supports refinement via part reconstruction, allowing for nuanced adjustments and artifact removal. It also offers part-based modeling capabilities, enabling the combination of features from multiple sketches to create more complex and customized 3D shapes. We demonstrate the effectiveness of our model compared to the state-of-the-art using objective metric evaluation criteria and a user study, both indicating strong performance on sketches with a medium level of abstraction. Furthermore, we showcase our method's intuitive sketch-based shape editing capabilities, and validate it through a usability study.  ( 2 min )
    dotears: Scalable, consistent DAG estimation using observational and interventional data
    arXiv:2305.19215v2 Announce Type: replace-cross Abstract: New biological assays like Perturb-seq link highly parallel CRISPR interventions to a high-dimensional transcriptomic readout, providing insight into gene regulatory networks. Causal gene regulatory networks can be represented by directed acyclic graph (DAGs), but learning DAGs from observational data is complicated by lack of identifiability and a combinatorial solution space. Score-based structure learning improves practical scalability of inferring DAGs. Previous score-based methods are sensitive to error variance structure; on the other hand, estimation of error variance is difficult without prior knowledge of structure. Accordingly, we present $\texttt{dotears}$ [doo-tairs], a continuous optimization framework which leverages observational and interventional data to infer a single causal structure, assuming a linear Structural Equation Model (SEM). $\texttt{dotears}$ exploits structural consequences of hard interventions to give a marginal estimate of exogenous error structure, bypassing the circular estimation problem. We show that $\texttt{dotears}$ is a provably consistent estimator of the true DAG under mild assumptions. $\texttt{dotears}$ outperforms other methods in varied simulations, and in real data infers edges that validate with higher precision and recall than state-of-the-art methods through differential expression tests and high-confidence protein-protein interactions.  ( 2 min )
    InstructIE: A Bilingual Instruction-based Information Extraction Dataset
    arXiv:2305.11527v2 Announce Type: replace-cross Abstract: Traditional information extraction (IE) methodologies, constrained by pre-defined classes and static training paradigms, often falter in adaptability, especially in the dynamic world. To bridge this gap, we explore an instruction-based IE paradigm in this paper, leveraging the substantial cross-task generalization capabilities of Large Language Models (LLMs). We observe that most existing IE datasets tend to be overly redundant in their label sets, which leads to the inclusion of numerous labels not directly relevant to the extraction content when constructing instructions. To tackle this issue, we introduce a bilingual theme-centric IE instruction dataset (Chinese and English), InstructIE, and for the first time, incorporate a theme scheme design that effectively simplifies the label structure. Furthermore, we develop an innovative framework named KG2Instruction, which is specifically designed for the automatic generation of such datasets. Experimental evaluations based on InstructIE reveal that while current models show promise in Instruction-based IE tasks, opportunities for their potential optimization also emerge. The dataset is available at https://huggingface.co/datasets/zjunlp/InstructIE.  ( 2 min )
    TESS: Text-to-Text Self-Conditioned Simplex Diffusion
    arXiv:2305.08379v2 Announce Type: replace-cross Abstract: Diffusion models have emerged as a powerful paradigm for generation, obtaining strong performance in various continuous domains. However, applying continuous diffusion models to natural language remains challenging due to its discrete nature and the need for a large number of diffusion steps to generate text, making diffusion-based generation expensive. In this work, we propose Text-to-text Self-conditioned Simplex Diffusion (TESS), a text diffusion model that is fully non-autoregressive, employs a new form of self-conditioning, and applies the diffusion process on the logit simplex space rather than the learned embedding space. Through extensive experiments on natural language understanding and generation tasks including summarization, text simplification, paraphrase generation, and question generation, we demonstrate that TESS outperforms state-of-the-art non-autoregressive models, requires fewer diffusion steps with minimal drop in performance, and is competitive with pretrained autoregressive sequence-to-sequence models. We publicly release our codebase at https://github.com/allenai/tess-diffusion.  ( 2 min )
    Automated Design of Metaheuristic Algorithms: A Survey
    arXiv:2303.06532v3 Announce Type: replace-cross Abstract: Metaheuristics have gained great success in academia and practice because their search logic can be applied to any problem with available solution representation, solution quality evaluation, and certain notions of locality. Manually designing metaheuristic algorithms for solving a target problem is criticized for being laborious, error-prone, and requiring intensive specialized knowledge. This gives rise to increasing interest in automated design of metaheuristic algorithms. With computing power to fully explore potential design choices, the automated design could reach and even surpass human-level design and could make high-performance algorithms accessible to a much wider range of researchers and practitioners. This paper presents a broad picture of automated design of metaheuristic algorithms, by conducting a survey on the common grounds and representative techniques in terms of design space, design strategies, performance evaluation strategies, and target problems in this field.  ( 2 min )
    PC-JeDi: Diffusion for Particle Cloud Generation in High Energy Physics
    arXiv:2303.05376v2 Announce Type: replace-cross Abstract: In this paper, we present a new method to efficiently generate jets in High Energy Physics called PC-JeDi. This method utilises score-based diffusion models in conjunction with transformers which are well suited to the task of generating jets as particle clouds due to their permutation equivariance. PC-JeDi achieves competitive performance with current state-of-the-art methods across several metrics that evaluate the quality of the generated jets. Although slower than other models, due to the large number of forward passes required by diffusion models, it is still substantially faster than traditional detailed simulation. Furthermore, PC-JeDi uses conditional generation to produce jets with a desired mass and transverse momentum for two different particles, top quarks and gluons.  ( 2 min )
    Learning-based Online Optimization for Autonomous Mobility-on-Demand Fleet Control
    arXiv:2302.03963v2 Announce Type: replace-cross Abstract: Autonomous mobility-on-demand systems are a viable alternative to mitigate many transportation-related externalities in cities, such as rising vehicle volumes in urban areas and transportation-related pollution. However, the success of these systems heavily depends on efficient and effective fleet control strategies. In this context, we study online control algorithms for autonomous mobility-on-demand systems and develop a novel hybrid combinatorial optimization enriched machine learning pipeline which learns online dispatching and rebalancing policies from optimal full-information solutions. We test our hybrid pipeline on large-scale real-world scenarios with different vehicle fleet sizes and various request densities. We show that our approach outperforms state-of-the-art greedy, and model-predictive control approaches with respect to various KPIs, e.g., by up to 17.1% and on average by 6.3% in terms of realized profit.  ( 2 min )
    Boosting Object Representation Learning via Motion and Object Continuity
    arXiv:2211.09771v3 Announce Type: replace-cross Abstract: Recent unsupervised multi-object detection models have shown impressive performance improvements, largely attributed to novel architectural inductive biases. Unfortunately, they may produce suboptimal object encodings for downstream tasks. To overcome this, we propose to exploit object motion and continuity, i.e., objects do not pop in and out of existence. This is accomplished through two mechanisms: (i) providing priors on the location of objects through integration of optical flow, and (ii) a contrastive object continuity loss across consecutive image frames. Rather than developing an explicit deep architecture, the resulting Motion and Object Continuity (MOC) scheme can be instantiated using any baseline object detection model. Our results show large improvements in the performances of a SOTA model in terms of object discovery, convergence speed and overall latent object representations, particularly for playing Atari games. Overall, we show clear benefits of integrating motion and object continuity for downstream tasks, moving beyond object representation learning based only on reconstruction.  ( 2 min )
    Approximation of optimization problems with constraints through kernel Sum-Of-Squares
    arXiv:2301.06339v2 Announce Type: replace-cross Abstract: Handling an infinite number of inequality constraints in infinite-dimensional spaces occurs in many fields, from global optimization to optimal transport. These problems have been tackled individually in several previous articles through kernel Sum-Of-Squares (kSoS) approximations. We propose here a unified theorem to prove convergence guarantees for these schemes. Pointwise inequalities are turned into equalities within a class of nonnegative kSoS functions. Assuming further that the functions appearing in the problem are smooth, focusing on pointwise equality constraints enables the use of scattering inequalities to mitigate the curse of dimensionality in sampling the constraints. Our approach is illustrated in learning vector fields with side information, here the invariance of a set.  ( 2 min )
    The non-overlapping statistical approximation to overlapping group lasso
    arXiv:2211.09221v3 Announce Type: replace-cross Abstract: Group lasso is a commonly used regularization method in statistical learning in which parameters are eliminated from the model according to predefined groups. However, when the groups overlap, optimizing the group lasso penalized objective can be time-consuming on large-scale problems because of the non-separability induced by the overlapping groups. This bottleneck has seriously limited the application of overlapping group lasso regularization in many modern problems, such as gene pathway selection and graphical model estimation. In this paper, we propose a separable penalty as an approximation of the overlapping group lasso penalty. Thanks to the separability, the computation of regularization based on our penalty is substantially faster than that of the overlapping group lasso, especially for large-scale and high-dimensional problems. We show that the penalty is the tightest separable relaxation of the overlapping group lasso norm within the family of $\ell_{q_1}/\ell_{q_2}$ norms. Moreover, we show that the estimator based on the proposed separable penalty is statistically equivalent to the one based on the overlapping group lasso penalty with respect to their error bounds and the rate-optimal performance under the squared loss. We demonstrate the faster computational time and statistical equivalence of our method compared with the overlapping group lasso in simulation examples and a classification problem of cancer tumors based on gene expression and multiple gene pathways.  ( 2 min )
    SketchySGD: Reliable Stochastic Optimization via Randomized Curvature Estimates
    arXiv:2211.08597v5 Announce Type: replace-cross Abstract: SketchySGD improves upon existing stochastic gradient methods in machine learning by using randomized low-rank approximations to the subsampled Hessian and by introducing an automated stepsize that works well across a wide range of convex machine learning problems. We show theoretically that SketchySGD with a fixed stepsize converges linearly to a small ball around the optimum. Further, in the ill-conditioned setting we show SketchySGD converges at a faster rate than SGD for least-squares problems. We validate this improvement empirically with ridge regression experiments on real data. Numerical experiments on both ridge and logistic regression problems with dense and sparse data, show that SketchySGD equipped with its default hyperparameters can achieve comparable or better results than popular stochastic gradient methods, even when they have been tuned to yield their best performance. In particular, SketchySGD is able to solve an ill-conditioned logistic regression problem with a data matrix that takes more than $840$GB RAM to store, while its competitors, even when tuned, are unable to make any progress. SketchySGD's ability to work out-of-the box with its default hyperparameters and excel on ill-conditioned problems is an advantage over other stochastic gradient methods, most of which require careful hyperparameter tuning (especially of the learning rate) to obtain good performance and degrade in the presence of ill-conditioning.  ( 3 min )
    CLEEGN: A Convolutional Neural Network for Plug-and-Play Automatic EEG Reconstruction
    arXiv:2210.05988v2 Announce Type: replace-cross Abstract: Human electroencephalography (EEG) is a brain monitoring modality that senses cortical neuroelectrophysiological activity in high-temporal resolution. One of the greatest challenges posed in applications of EEG is the unstable signal quality susceptible to inevitable artifacts during recordings. To date, most existing techniques for EEG artifact removal and reconstruction are applicable to offline analysis solely, or require individualized training data to facilitate online reconstruction. We have proposed CLEEGN, a novel convolutional neural network for plug-and-play automatic EEG reconstruction. CLEEGN is based on a subject-independent pre-trained model using existing data and can operate on a new user without any further calibration. The performance of CLEEGN was validated using multiple evaluations including waveform observation, reconstruction error assessment, and decoding accuracy on well-studied labeled datasets. The results of simulated online validation suggest that, even without any calibration, CLEEGN can largely preserve inherent brain activity and outperforms leading online/offline artifact removal methods in the decoding accuracy of reconstructed EEG data. In addition, visualization of model parameters and latent features exhibit the model behavior and reveal explainable insights related to existing knowledge of neuroscience. We foresee pervasive applications of CLEEGN in prospective works of online plug-and-play EEG decoding and analysis.  ( 2 min )
    Assessing Uncertainty in Similarity Scoring: Performance & Fairness in Face Recognition
    arXiv:2211.07245v2 Announce Type: replace-cross Abstract: The ROC curve is the major tool for assessing not only the performance but also the fairness properties of a similarity scoring function. In order to draw reliable conclusions based on empirical ROC analysis, accurately evaluating the uncertainty level related to statistical versions of the ROC curves of interest is absolutely necessary, especially for applications with considerable societal impact such as Face Recognition. In this article, we prove asymptotic guarantees for empirical ROC curves of similarity functions as well as for by-product metrics useful to assess fairness. We also explain that, because the false acceptance/rejection rates are of the form of U-statistics in the case of similarity scoring, the naive bootstrap approach may jeopardize the assessment procedure. A dedicated recentering technique must be used instead. Beyond the theoretical analysis carried out, various experiments using real face image datasets provide strong empirical evidence of the practical relevance of the methods promoted here, when applied to several ROC-based measures such as popular fairness metrics.  ( 2 min )
    The ODE Method for Asymptotic Statistics in Stochastic Approximation and Reinforcement Learning
    arXiv:2110.14427v4 Announce Type: replace-cross Abstract: The paper concerns the stochastic approximation recursion, \[ \theta_{n+1}= \theta_n + \alpha_{n + 1} f(\theta_n, \Phi_{n+1}) \,,\quad n\ge 0, \] where the {\em estimates} $\theta_n\in\Re^d$ and $ \{ \Phi_n \}$ is a Markov chain on a general state space. In addition to standard Lipschitz assumptions and conditions on the vanishing step-size sequence, it is assumed that the associated \textit{mean flow} $ \tfrac{d}{dt} \vartheta_t = \bar{f}(\vartheta_t)$, is globally asymptotically stable with stationary point denoted $\theta^*$, where $\bar{f}(\theta)=\text{ E}[f(\theta,\Phi)]$ with $\Phi$ having the stationary distribution of the chain. The main results are established under additional conditions on the mean flow and a version of the Donsker-Varadhan Lyapunov drift condition known as (DV3) for the chain: (i) An appropriate Lyapunov function is constructed that implies convergence of the estimates in $L_4$. (ii) A functional CLT is established, as well as the usual one-dimensional CLT for the normalized error. Moment bounds combined with the CLT imply convergence of the normalized covariance $\text{ E} [ z_n z_n^T ]$ to the asymptotic covariance $\Sigma^\Theta$ in the CLT, where $z_n= (\theta_n-\theta^*)/\sqrt{\alpha_n}$. (iii) The CLT holds for the normalized version $z^{\text{ PR}}_n$ of the averaged parameters $\theta^{\text{ PR}}_n$, subject to standard assumptions on the step-size. Moreover, the normalized covariance of both $\theta^{\text{ PR}}_n$ and $z^{\text{ PR}}_n$ converge to $\Sigma^{\text{ PR}}$, the minimal covariance of Polyak and Ruppert. (iv)} An example is given where $f$ and $\bar{f}$ are linear in $\theta$, and the Markov chain is geometrically ergodic but does not satisfy (DV3). While the algorithm is convergent, the second moment of $\theta_n$ is unbounded and in fact diverges.  ( 3 min )
    Heterogeneous LoRA for Federated Fine-tuning of On-Device Foundation Models
    arXiv:2401.06432v2 Announce Type: replace Abstract: Foundation models (FMs) adapt well to specific domains or tasks with fine-tuning, and federated learning (FL) enables the potential for privacy-preserving fine-tuning of the FMs with on-device local data. For federated fine-tuning of FMs, we consider the FMs with small to medium parameter sizes of single digit billion at maximum, referred to as on-device FMs (ODFMs) that can be deployed on devices for inference but can only be fine-tuned with parameter efficient methods. In our work, we tackle the data and system heterogeneity problem of federated fine-tuning of ODFMs by proposing a novel method using heterogeneous low-rank approximations (LoRAs), namely HetLoRA. First, we show that the naive approach of using homogeneous LoRA ranks across devices face a trade-off between overfitting and slow convergence, and thus propose HetLoRA, which allows heterogeneous ranks across client devices and efficiently aggregates and distributes these heterogeneous LoRA modules. By applying rank self-pruning locally and sparsity-weighted aggregation at the server, HetLoRA combines the advantages of high and low-rank LoRAs, which achieves improved convergence speed and final performance compared to homogeneous LoRA. Furthermore, HetLoRA offers enhanced computation efficiency compared to full fine-tuning, making it suitable for federated fine-tuning across heterogeneous devices.  ( 3 min )
    GNNShap: Fast and Accurate GNN Explanations using Shapley Values
    arXiv:2401.04829v2 Announce Type: replace Abstract: Graph neural networks (GNNs) are popular machine learning models for graphs with many applications across scientific domains. However, GNNs are considered black box models, and it is challenging to understand how the model makes predictions. Game theoric Shapley value approaches are popular explanation methods in other domains but are not well-studied for graphs. Some studies have proposed Shapley value based GNN explanations, yet they have several limitations: they consider limited samples to approximate Shapley values; some mainly focus on small and large coalition sizes, and they are an order of magnitude slower than other explanation methods, making them inapplicable to even moderate-size graphs. In this work, we propose GNNShap, which provides explanations for edges since they provide more natural explanations for graphs and more fine-grained explanations. We overcome the limitations by sampling from all coalition sizes, parallelizing the sampling on GPUs, and speeding up model predictions by batching. GNNShap gives better fidelity scores and faster explanations than baselines on real-world datasets. The code is available at https://github.com/HipGraph/GNNShap.  ( 2 min )
    Comparing Machine Learning Algorithms by Union-Free Generic Depth
    arXiv:2312.12839v3 Announce Type: replace Abstract: We propose a framework for descriptively analyzing sets of partial orders based on the concept of depth functions. Despite intensive studies in linear and metric spaces, there is very little discussion on depth functions for non-standard data types such as partial orders. We introduce an adaptation of the well-known simplicial depth to the set of all partial orders, the union-free generic (ufg) depth. Moreover, we utilize our ufg depth for a comparison of machine learning algorithms based on multidimensional performance measures. Concretely, we provide two examples of classifier comparisons on samples of standard benchmark data sets. Our results demonstrate promisingly the wide variety of different analysis approaches based on ufg methods. Furthermore, the examples outline that our approach differs substantially from existing benchmarking approaches, and thus adds a new perspective to the vivid debate on classifier comparison.  ( 2 min )
    Towards Context-Aware Domain Generalization: Understanding the Benefits and Limits of Marginal Transfer Learning
    arXiv:2312.10107v2 Announce Type: replace Abstract: In this work, we analyze the conditions under which information about the context of an input $X$ can improve the predictions of deep learning models in new domains. Following work in marginal transfer learning in Domain Generalization (DG), we formalize the notion of context as a permutation-invariant representation of a set of data points that originate from the same domain as the input itself. We offer a theoretical analysis of the conditions under which this approach can, in principle, yield benefits, and formulate two necessary criteria that can be easily verified in practice. Additionally, we contribute insights into the kind of distribution shifts for which the marginal transfer learning approach promises robustness. Empirical analysis shows that our criteria are effective in discerning both favorable and unfavorable scenarios. Finally, we demonstrate that we can reliably detect scenarios where a model is tasked with unwarranted extrapolation in out-of-distribution (OOD) domains, identifying potential failure cases. Consequently, we showcase a method to select between the most predictive and the most robust model, circumventing the well-known trade-off between predictive performance and robustness.  ( 2 min )
    On the Impact of Multi-dimensional Local Differential Privacy on Fairness
    arXiv:2312.04404v3 Announce Type: replace Abstract: Automated decision systems are increasingly used to make consequential decisions in people's lives. Due to the sensitivity of the manipulated data as well as the resulting decisions, several ethical concerns need to be addressed for the appropriate use of such technologies, in particular, fairness and privacy. Unlike previous work, which focused on centralized differential privacy (DP) or local DP (LDP) for a single sensitive attribute, in this paper, we examine the impact of LDP in the presence of several sensitive attributes (i.e., multi-dimensional data) on fairness. Detailed empirical analysis on synthetic and benchmark datasets revealed very relevant observations. In particular, (1) multi-dimensional LDP is an efficient approach to reduce disparity, (2) the multi-dimensional approach of LDP (independent vs. combined) matters only at low privacy guarantees, and (3) the outcome Y distribution has an important effect on which group is more sensitive to the obfuscation. Last, we summarize our findings in the form of recommendations to guide practitioners in adopting effective privacy-preserving practices while maintaining fairness and utility in ML applications.  ( 2 min )
    Tree of Attacks: Jailbreaking Black-Box LLMs Automatically
    arXiv:2312.02119v2 Announce Type: replace Abstract: While Large Language Models (LLMs) display versatile functionality, they continue to generate harmful, biased, and toxic content, as demonstrated by the prevalence of human-designed jailbreaks. In this work, we present Tree of Attacks with Pruning (TAP), an automated method for generating jailbreaks that only requires black-box access to the target LLM. TAP utilizes an LLM to iteratively refine candidate (attack) prompts using tree-of-thought reasoning until one of the generated prompts jailbreaks the target. Crucially, before sending prompts to the target, TAP assesses them and prunes the ones unlikely to result in jailbreaks. Using tree-of-thought reasoning allows TAP to navigate a large search space of prompts and pruning reduces the total number of queries sent to the target. In empirical evaluations, we observe that TAP generates prompts that jailbreak state-of-the-art LLMs (including GPT4 and GPT4-Turbo) for more than 80% of the prompts using only a small number of queries. Interestingly, TAP is also capable of jailbreaking LLMs protected by state-of-the-art guardrails, e.g., LlamaGuard. This significantly improves upon the previous state-of-the-art black-box method for generating jailbreaks.  ( 2 min )
    A Generative Model for Accelerated Inverse Modelling Using a Novel Embedding for Continuous Variables
    arXiv:2311.11343v2 Announce Type: replace Abstract: In materials science, the challenge of rapid prototyping materials with desired properties often involves extensive experimentation to find suitable microstructures. Additionally, finding microstructures for given properties is typically an ill-posed problem where multiple solutions may exist. Using generative machine learning models can be a viable solution which also reduces the computational cost. This comes with new challenges because, e.g., a continuous property variable as conditioning input to the model is required. We investigate the shortcomings of an existing method and compare this to a novel embedding strategy for generative models that is based on the binary representation of floating point numbers. This eliminates the need for normalization, preserves information, and creates a versatile embedding space for conditioning the generative model. This technique can be applied to condition a network on any number, to provide fine control over generated microstructure images, thereby contributing to accelerated materials design.  ( 2 min )
    A PAC Learning Algorithm for LTL and Omega-regular Objectives in MDPs
    arXiv:2310.12248v3 Announce Type: replace Abstract: Linear temporal logic (LTL) and omega-regular objectives -- a superset of LTL -- have seen recent use as a way to express non-Markovian objectives in reinforcement learning. We introduce a model-based probably approximately correct (PAC) learning algorithm for omega-regular objectives in Markov decision processes (MDPs). As part of the development of our algorithm, we introduce the epsilon-recurrence time: a measure of the speed at which a policy converges to the satisfaction of the omega-regular objective in the limit. We prove that our algorithm only requires a polynomial number of samples in the relevant parameters, and perform experiments which confirm our theory.  ( 2 min )
    Adaptive Interventions with User-Defined Goals for Health Behavior Change
    arXiv:2311.09483v2 Announce Type: replace Abstract: Physical inactivity remains a major public health concern, having associations with adverse health outcomes such as cardiovascular disease and type-2 diabetes. Mobile health applications present a promising avenue for low-cost, scalable physical activity promotion, yet often suffer from small effect sizes and low adherence rates, particularly in comparison to human coaching. Goal-setting is a critical component of health coaching that has been underutilized in adaptive algorithms for mobile health interventions. This paper introduces a modification to the Thompson sampling algorithm that places emphasis on individualized goal-setting by optimizing personalized reward functions. As a step towards supporting goal-setting, this paper offers a balanced approach that can leverage shared structure while optimizing individual preferences and goals. We prove that our modification incurs only a constant penalty on the cumulative regret while preserving the sample complexity benefits of data sharing. In a physical activity simulator, we demonstrate that our algorithm achieves substantial improvements in cumulative regret compared to baselines that do not share data or do not optimize for individualized rewards.  ( 2 min )
    Comparing Comparators in Generalization Bounds
    arXiv:2310.10534v2 Announce Type: replace Abstract: We derive generic information-theoretic and PAC-Bayesian generalization bounds involving an arbitrary convex comparator function, which measures the discrepancy between the training and population loss. The bounds hold under the assumption that the cumulant-generating function (CGF) of the comparator is upper-bounded by the corresponding CGF within a family of bounding distributions. We show that the tightest possible bound is obtained with the comparator being the convex conjugate of the CGF of the bounding distribution, also known as the Cram\'er function. This conclusion applies more broadly to generalization bounds with a similar structure. This confirms the near-optimality of known bounds for bounded and sub-Gaussian losses and leads to novel bounds under other bounding distributions.  ( 2 min )
    Adaptive Neural Ranking Framework: Toward Maximized Business Goal for Cascade Ranking Systems
    arXiv:2310.10462v2 Announce Type: replace Abstract: Cascade ranking is widely used for large-scale top-k selection problems in online advertising and recommendation systems, and learning-to-rank is an important way to optimize the models in cascade ranking. Previous works on learning-to-rank usually focus on letting the model learn the complete order or top-k order, and adopt the corresponding rank metrics (e.g. OPA and NDCG@k) as optimization targets. However, these targets can not adapt to various cascade ranking scenarios with varying data complexities and model capabilities; and the existing metric-driven methods such as the Lambda framework can only optimize a rough upper bound of limited metrics, potentially resulting in sub-optimal and performance misalignment. To address these issues, we propose a novel perspective on optimizing cascade ranking systems by highlighting the adaptability of optimization targets to data complexities and model capabilities. Concretely, we employ multi-task learning to adaptively combine the optimization of relaxed and full targets, which refers to metrics Recall@m@k and OPA respectively. We also introduce permutation matrix to represent the rank metrics and employ differentiable sorting techniques to relax hard permutation matrix with controllable approximate error bound. This enables us to optimize both the relaxed and full targets directly and more appropriately. We named this method as Adaptive Neural Ranking Framework (abbreviated as ARF). Furthermore, we give a specific practice under ARF. We use the NeuralSort to obtain the relaxed permutation matrix and draw on the variant of the uncertainty weight method in multi-task learning to optimize the proposed losses jointly. Experiments on a total of 4 public and industrial benchmarks show the effectiveness and generalization of our method, and online experiment shows that our method has significant application value.  ( 3 min )
    Learning Hierarchical Relational Representations through Relational Convolutions
    arXiv:2310.03240v2 Announce Type: replace Abstract: A maturing area of research in deep learning is the study of architectures and inductive biases for learning representations of relational features. In this paper, we focus on the problem of learning representations of hierarchical relations, proposing an architectural framework we call "relational convolutional networks". Given a collection of objects, pairwise relations are modeled via inner products of feature maps. We formalize a relational convolution operation in which graphlet filters are matched against patches of the input (i.e, groupings of objects), capturing the relational pattern in each group of objects. We also propose mechanisms for explicitly learning groupings of objects which are relevant to the downstream task. Composing these operations yields representations of higher-order, hierarchical relations. We present the motivation and details of the architecture, together with a set of experiments to demonstrate how relational convolutional networks can provide an effective framework for modeling relational tasks that have hierarchical structure.  ( 2 min )
    Interpretable Diffusion via Information Decomposition
    arXiv:2310.07972v2 Announce Type: replace Abstract: Denoising diffusion models enable conditional generation and density modeling of complex relationships like images and text. However, the nature of the learned relationships is opaque making it difficult to understand precisely what relationships between words and parts of an image are captured, or to predict the effect of an intervention. We illuminate the fine-grained relationships learned by diffusion models by noticing a precise relationship between diffusion and information decomposition. Exact expressions for mutual information and conditional mutual information can be written in terms of the denoising model. Furthermore, pointwise estimates can be easily estimated as well, allowing us to ask questions about the relationships between specific images and captions. Decomposing information even further to understand which variables in a high-dimensional space carry information is a long-standing problem. For diffusion models, we show that a natural non-negative decomposition of mutual information emerges, allowing us to quantify informative relationships between words and pixels in an image. We exploit these new relations to measure the compositional understanding of diffusion models, to do unsupervised localization of objects in images, and to measure effects when selectively editing images through prompt interventions.  ( 2 min )
    TranDRL: A Transformer-Driven Deep Reinforcement Learning Enabled Prescriptive Maintenance Framework
    arXiv:2309.16935v2 Announce Type: replace Abstract: Industrial systems demand reliable predictive maintenance strategies to enhance operational efficiency and reduce downtime. This paper introduces an integrated framework that leverages the capabilities of the Transformer model-based neural networks and deep reinforcement learning (DRL) algorithms to optimize system maintenance actions. Our approach employs the Transformer model to effectively capture complex temporal patterns in sensor data, thereby accurately predicting the remaining useful life (RUL) of an equipment. Additionally, the DRL component of our framework provides cost-effective and timely maintenance recommendations. We validate the efficacy of our framework on the NASA C-MPASS dataset, where it demonstrates significant advancements in both RUL prediction accuracy and the optimization of maintenance actions, compared to the other prevalent machine learning-based methods. Our proposed approach provides an innovative data-driven framework for industry machine systems, accurately forecasting equipment lifespans and optimizing maintenance schedules, thereby reducing downtime and cutting costs.  ( 2 min )
    Are Human-generated Demonstrations Necessary for In-context Learning?
    arXiv:2309.14681v4 Announce Type: replace Abstract: Despite the promising few-shot ability of large language models (LLMs), the standard paradigm of In-context Learning (ICL) suffers the disadvantages of susceptibility to selected demonstrations and the intricacy to generate these demonstrations. In this paper, we raise the fundamental question that whether human-generated demonstrations are necessary for ICL. To answer this question, we propose self-contemplation prompting strategy (SEC), a paradigm free from human-crafted demonstrations. The key point of SEC is that, instead of using hand-crafted examples as demonstrations in ICL, SEC asks LLMs to first create demonstrations on their own, based on which the final output is generated. SEC is a flexible framework and can be adapted to both the vanilla ICL and the chain-of-thought (CoT), but with greater ease: as the manual-generation process of both examples and rationale can be saved. Extensive experiments in arithmetic reasoning, commonsense reasoning, multi-task language understanding, and code generation benchmarks, show that SEC, which does not require hand-crafted demonstrations, significantly outperforms the zero-shot learning strategy, and achieves comparable results to ICL with hand-crafted demonstrations. This demonstrates that, for many tasks, contemporary LLMs possess a sufficient level of competence to exclusively depend on their own capacity for decision making, removing the need for external training data. Code is available at https://github.com/ruili33/SEC.  ( 3 min )
    Fair Ranking under Disparate Uncertainty
    arXiv:2309.01610v2 Announce Type: replace Abstract: Ranking is a ubiquitous method for focusing the attention of human evaluators on a manageable subset of options. Its use as part of human decision-making processes ranges from surfacing potentially relevant products on an e-commerce site to prioritizing college applications for human review. While ranking can make human evaluation more effective by focusing attention on the most promising options, we argue that it can introduce unfairness if the uncertainty of the underlying relevance model differs between groups of options. Unfortunately, such disparity in uncertainty appears widespread, often to the detriment of minority groups for which relevance estimates can have higher uncertainty due to a lack of data or appropriate features. To address this fairness issue, we propose Equal-Opportunity Ranking (EOR) as a new fairness criterion for ranking and show that it corresponds to a group-wise fair lottery among the relevant options even in the presence of disparate uncertainty. Furthermore, EOR optimizes for an even cost burden on all groups, unlike the conventional Probability Ranking Principle. In contrast to affirmative action interventions like proportional Rooney rule constraints, EOR does not require the designation of a disadvantaged group. To make EOR ranking practical, we present an efficient algorithm for computing it in time $O(n \log(n))$ and prove its close approximation guarantee to the globally optimal solution. In a comprehensive empirical evaluation on synthetic data, a US Census dataset, and a real-world audit of Amazon search queries, we find that the algorithm reliably guarantees EOR fairness while providing effective rankings.  ( 3 min )
    A new method of modeling the multi-stage decision-making process of CRT using machine learning with uncertainty quantification
    arXiv:2309.08415v3 Announce Type: replace Abstract: Aims. The purpose of this study is to create a multi-stage machine learning model to predict cardiac resynchronization therapy (CRT) response for heart failure (HF) patients. This model exploits uncertainty quantification to recommend additional collection of single-photon emission computed tomography myocardial perfusion imaging (SPECT MPI) variables if baseline clinical variables and features from electrocardiogram (ECG) are not sufficient. Methods. 218 patients who underwent rest-gated SPECT MPI were enrolled in this study. CRT response was defined as an increase in left ventricular ejection fraction (LVEF) > 5% at a 6+-1 month follow-up. A multi-stage ML model was created by combining two ensemble models: Ensemble 1 was trained with clinical variables and ECG; Ensemble 2 included Ensemble 1 plus SPECT MPI features. Uncertainty quantification from Ensemble 1 allowed for multi-stage decision-making to determine if the acquisition of SPECT data for a patient is necessary. The performance of the multi-stage model was compared with that of Ensemble models 1 and 2. Results. The response rate for CRT was 55.5% (n = 121) with overall male gender 61.0% (n = 133), an average age of 62.0+-11.8, and LVEF of 27.7+-11.0. The multi-stage model performed similarly to Ensemble 2 (which utilized the additional SPECT data) with AUC of 0.75 vs. 0.77, accuracy of 0.71 vs. 0.69, sensitivity of 0.70 vs. 0.72, and specificity 0.72 vs. 0.65, respectively. However, the multi-stage model only required SPECT MPI data for 52.7% of the patients across all folds. Conclusions. By using rule-based logic stemming from uncertainty quantification, the multi-stage model was able to reduce the need for additional SPECT MPI data acquisition without sacrificing performance.  ( 3 min )
    Rank Collapse Causes Over-Smoothing and Over-Correlation in Graph Neural Networks
    arXiv:2308.16800v2 Announce Type: replace Abstract: Our study reveals new theoretical insights into over-smoothing and feature over-correlation in deep graph neural networks. We show the prevalence of invariant subspaces, demonstrating a fixed relative behavior that is unaffected by feature transformations. Our work clarifies recent observations related to convergence to a constant state and a potential over-separation of node states, as the amplification of subspaces only depends on the spectrum of the aggregation function. In linear scenarios, this leads to node representations being dominated by a low-dimensional subspace with an asymptotic convergence rate independent of the feature transformations. This causes a rank collapse of the node representations, resulting in over-smoothing when smooth vectors span this subspace, and over-correlation even when over-smoothing is avoided. Guided by our theory, we propose a sum of Kronecker products as a beneficial property that can provably prevent over-smoothing, over-correlation, and rank collapse. We empirically extend our insights to the non-linear case, demonstrating the inability of existing models to capture linearly independent features.  ( 2 min )
    Collaborative Information Dissemination with Graph-based Multi-Agent Reinforcement Learning
    arXiv:2308.16198v3 Announce Type: replace Abstract: Efficient information dissemination is crucial for supporting critical operations across domains like disaster response, autonomous vehicles, and sensor networks. This paper introduces a Multi-Agent Reinforcement Learning (MARL) approach as a significant step forward in achieving more decentralized, efficient, and collaborative information dissemination. We propose a Partially Observable Stochastic Game (POSG) formulation for information dissemination empowering each agent to decide on message forwarding independently, based on the observation of their one-hop neighborhood. This constitutes a significant paradigm shift from heuristics currently employed in real-world broadcast protocols. Our novel approach harnesses Graph Convolutional Reinforcement Learning and Graph Attention Networks (GATs) with dynamic attention to capture essential network features. We propose two approaches, L-DyAN and HL-DyAN, which differ in terms of the information exchanged among agents. Our experimental results show that our trained policies outperform existing methods, including the state-of-the-art heuristic, in terms of network coverage as well as communication overhead on dynamic networks of varying density and behavior.  ( 2 min )
    An Adaptive Tangent Feature Perspective of Neural Networks
    arXiv:2308.15478v3 Announce Type: replace Abstract: In order to better understand feature learning in neural networks, we propose a framework for understanding linear models in tangent feature space where the features are allowed to be transformed during training. We consider linear transformations of features, resulting in a joint optimization over parameters and transformations with a bilinear interpolation constraint. We show that this optimization problem has an equivalent linearly constrained optimization with structured regularization that encourages approximately low rank solutions. Specializing to neural network structure, we gain insights into how the features and thus the kernel function change, providing additional nuance to the phenomenon of kernel alignment when the target function is poorly represented using tangent features. We verify our theoretical observations in the kernel alignment of real neural networks.  ( 2 min )
    Fixing confirmation bias in feature attribution methods via semantic match
    arXiv:2307.00897v2 Announce Type: replace Abstract: Feature attribution methods have become a staple method to disentangle the complex behavior of black box models. Despite their success, some scholars have argued that such methods suffer from a serious flaw: they do not allow a reliable interpretation in terms of human concepts. Simply put, visualizing an array of feature contributions is not enough for humans to conclude something about a model's internal representations, and confirmation bias can trick users into false beliefs about model behavior. We argue that a structured approach is required to test whether our hypotheses on the model are confirmed by the feature attributions. This is what we call the "semantic match" between human concepts and (sub-symbolic) explanations. Building on the conceptual framework put forward in Cin\`a et al. [2023], we propose a structured approach to evaluate semantic match in practice. We showcase the procedure in a suite of experiments spanning tabular and image data, and show how the assessment of semantic match can give insight into both desirable (e.g., focusing on an object relevant for prediction) and undesirable model behaviors (e.g., focusing on a spurious correlation). We couple our experimental results with an analysis on the metrics to measure semantic match, and argue that this approach constitutes the first step towards resolving the issue of confirmation bias in XAI.  ( 3 min )
    NeuralFuse: Learning to Recover the Accuracy of Access-Limited Neural Network Inference in Low-Voltage Regimes
    arXiv:2306.16869v2 Announce Type: replace Abstract: Deep neural networks (DNNs) have become ubiquitous in machine learning, but their energy consumption remains a notable issue. Lowering the supply voltage is an effective strategy for reducing energy consumption. However, aggressively scaling down the supply voltage can lead to accuracy degradation due to random bit flips in static random access memory (SRAM) where model parameters are stored. To address this challenge, we introduce NeuralFuse, a novel add-on module that addresses the accuracy-energy tradeoff in low-voltage regimes by learning input transformations to generate error-resistant data representations. NeuralFuse protects DNN accuracy in both nominal and low-voltage scenarios. Moreover, NeuralFuse is easy to implement and can be readily applied to DNNs with limited access, such as non-configurable hardware or remote access to cloud-based APIs. Experimental results demonstrate that, at a 1% bit error rate, NeuralFuse can reduce SRAM memory access energy by up to 24% while recovering accuracy by up to 57%. To the best of our knowledge, this is the first model-agnostic approach (i.e., no model retraining) to address low-voltage-induced bit errors. The source code is available at https://github.com/IBM/NeuralFuse.  ( 2 min )
    Latent SDEs on Homogeneous Spaces
    arXiv:2306.16248v3 Announce Type: replace Abstract: We consider the problem of variational Bayesian inference in a latent variable model where a (possibly complex) observed stochastic process is governed by the solution of a latent stochastic differential equation (SDE). Motivated by the challenges that arise when trying to learn an (almost arbitrary) latent neural SDE from data, such as efficient gradient computation, we take a step back and study a specific subclass instead. In our case, the SDE evolves on a homogeneous latent space and is induced by stochastic dynamics of the corresponding (matrix) Lie group. In learning problems, SDEs on the unit n-sphere are arguably the most relevant incarnation of this setup. Notably, for variational inference, the sphere not only facilitates using a truly uninformative prior, but we also obtain a particularly simple and intuitive expression for the Kullback-Leibler divergence between the approximate posterior and prior process in the evidence lower bound. Experiments demonstrate that a latent SDE of the proposed type can be learned efficiently by means of an existing one-step geometric Euler-Maruyama scheme. Despite restricting ourselves to a less rich class of SDEs, we achieve competitive or even state-of-the-art results on various time series interpolation/classification problems.  ( 2 min )
    Globally Interpretable Graph Learning via Distribution Matching
    arXiv:2306.10447v2 Announce Type: replace Abstract: Graph neural networks (GNNs) have emerged as a powerful model to capture critical graph patterns. Instead of treating them as black boxes in an end-to-end fashion, attempts are arising to explain the model behavior. Existing works mainly focus on local interpretation to reveal the discriminative pattern for each individual instance, which however cannot directly reflect the high-level model behavior across instances. To gain global insights, we aim to answer an important question that is not yet well studied: how to provide a global interpretation for the graph learning procedure? We formulate this problem as globally interpretable graph learning, which targets on distilling high-level and human-intelligible patterns that dominate the learning procedure, such that training on this pattern can recover a similar model. As a start, we propose a novel model fidelity metric, tailored for evaluating the fidelity of the resulting model trained on interpretations. Our preliminary analysis shows that interpretative patterns generated by existing global methods fail to recover the model training procedure. Thus, we further propose our solution, Graph Distribution Matching (GDM), which synthesizes interpretive graphs by matching the distribution of the original and interpretive graphs in the GNN's feature space as its training proceeds, thus capturing the most informative patterns the model learns during training. Extensive experiments on graph classification datasets demonstrate multiple advantages of the proposed method, including high model fidelity, predictive accuracy and time efficiency, as well as the ability to reveal class-relevant structure.  ( 3 min )
    A Simple Unified Uncertainty-Guided Framework for Offline-to-Online Reinforcement Learning
    arXiv:2306.07541v2 Announce Type: replace Abstract: Offline reinforcement learning (RL) provides a promising solution to learning an agent fully relying on a data-driven paradigm. However, constrained by the limited quality of the offline dataset, its performance is often sub-optimal. Therefore, it is desired to further finetune the agent via extra online interactions before deployment. Unfortunately, offline-to-online RL can be challenging due to two main challenges: constrained exploratory behavior and state-action distribution shift. To this end, we propose a Simple Unified uNcertainty-Guided (SUNG) framework, which naturally unifies the solution to both challenges with the tool of uncertainty. Specifically, SUNG quantifies uncertainty via a VAE-based state-action visitation density estimator. To facilitate efficient exploration, SUNG presents a practical optimistic exploration strategy to select informative actions with both high value and high uncertainty. Moreover, SUNG develops an adaptive exploitation method by applying conservative offline RL objectives to high-uncertainty samples and standard online RL objectives to low-uncertainty samples to smoothly bridge offline and online stages. SUNG achieves state-of-the-art online finetuning performance when combined with different offline RL methods, across various environments and datasets in D4RL benchmark.  ( 2 min )
    AutoML in the Age of Large Language Models: Current Challenges, Future Opportunities and Risks
    arXiv:2306.08107v3 Announce Type: replace Abstract: The fields of both Natural Language Processing (NLP) and Automated Machine Learning (AutoML) have achieved remarkable results over the past years. In NLP, especially Large Language Models (LLMs) have experienced a rapid series of breakthroughs very recently. We envision that the two fields can radically push the boundaries of each other through tight integration. To showcase this vision, we explore the potential of a symbiotic relationship between AutoML and LLMs, shedding light on how they can benefit each other. In particular, we investigate both the opportunities to enhance AutoML approaches with LLMs from different perspectives and the challenges of leveraging AutoML to further improve LLMs. To this end, we survey existing work, and we critically assess risks. We strongly believe that the integration of the two fields has the potential to disrupt both fields, NLP and AutoML. By highlighting conceivable synergies, but also risks, we aim to foster further exploration at the intersection of AutoML and LLMs.  ( 2 min )
    Rotational Equilibrium: How Weight Decay Balances Learning Across Neural Networks
    arXiv:2305.17212v3 Announce Type: replace Abstract: This study investigates how weight decay affects the update behavior of individual neurons in deep neural networks through a combination of applied analysis and experimentation. Weight decay can cause the expected magnitude and angular updates of a neuron's weight vector to converge to a steady state we call rotational equilibrium. These states can be highly homogeneous, effectively balancing the average rotation -- a proxy for the effective learning rate -- across different layers and neurons. Our work analyzes these dynamics across optimizers like Adam, Lion, and SGD with momentum, offering a new simple perspective on training that elucidates the efficacy of widely used but poorly understood methods in deep learning. We demonstrate how balanced rotation plays a key role in the effectiveness of normalization like Weight Standardization, as well as that of AdamW over Adam with L2-regularization. Finally, we show that explicitly controlling the rotation provides the benefits of weight decay while substantially reducing the need for learning rate warmup.  ( 2 min )
    Data Redaction from Conditional Generative Models
    arXiv:2305.11351v2 Announce Type: replace Abstract: Deep generative models are known to produce undesirable samples such as harmful content. Traditional mitigation methods include re-training from scratch, filtering, or editing; however, these are either computationally expensive or can be circumvented by third parties. In this paper, we take a different approach and study how to post-edit an already-trained conditional generative model so that it redacts certain conditionals that will, with high probability, lead to undesirable content. This is done by distilling the conditioning network in the models, giving a solution that is effective, efficient, controllable, and universal for a class of deep generative models. We conduct experiments on redacting prompts in text-to-image models and redacting voices in text-to-speech models. Our method is computationally light, leads to better redaction quality and robustness than baseline methods while still retaining high generation quality.  ( 2 min )
    Convergence of variational Monte Carlo simulation and scale-invariant pre-training
    arXiv:2303.11602v3 Announce Type: replace Abstract: We provide theoretical convergence bounds for the variational Monte Carlo (VMC) method as applied to optimize neural network wave functions for the electronic structure problem. We study both the energy minimization phase and the supervised pre-training phase that is commonly used prior to energy minimization. For the energy minimization phase, the standard algorithm is scale-invariant by design, and we provide a proof of convergence for this algorithm without modifications. The pre-training stage typically does not feature such scale-invariance. We propose using a scale-invariant loss for the pretraining phase and demonstrate empirically that it leads to faster pre-training.  ( 2 min )
    The Normalized Cross Density Functional: A Framework to Quantify Statistical Dependence for Random Processes
    arXiv:2212.04631v3 Announce Type: replace Abstract: This paper presents a novel approach to measuring statistical dependence between two random processes (r.p.) using a positive-definite function called the Normalized Cross Density (NCD). NCD is derived directly from the probability density functions of two r.p. and constructs a data-dependent Hilbert space, the Normalized Cross-Density Hilbert Space (NCD-HS). By Mercer's Theorem, the NCD norm can be decomposed into its eigenspectrum, which we name the Multivariate Statistical Dependence (MSD) measure, and their sum, the Total Dependence Measure (TSD). Hence, the NCD-HS eigenfunctions serve as a novel embedded feature space, suitable for quantifying r.p. statistical dependence. In order to apply NCD directly to r.p. realizations, we introduce an architecture with two multiple-output neural networks, a cost function, and an algorithm named the Functional Maximal Correlation Algorithm (FMCA). With FMCA, the two networks learn concurrently by approximating each other's outputs, extending the Alternating Conditional Expectation (ACE) for multivariate functions. We mathematically prove that FMCA learns the dominant eigenvalues and eigenfunctions of NCD directly from realizations. Preliminary results with synthetic data and medium-sized image datasets corroborate the theory. Different strategies for applying NCD are proposed and discussed, demonstrating the method's versatility and stability beyond supervised learning. Specifically, when the two r.p. are high-dimensional real-world images and a white uniform noise process, FMCA learns factorial codes, i.e., the occurrence of a code guarantees that a specific training set image was present, which is important for feature learning.  ( 3 min )
    Self-supervised Representation Learning on Electronic Health Records with Graph Kernel Infomax
    arXiv:2209.00655v2 Announce Type: replace Abstract: Learning Electronic Health Records (EHRs) representation is a preeminent yet under-discovered research topic. It benefits various clinical decision support applications, e.g., medication outcome prediction or patient similarity search. Current approaches focus on task-specific label supervision on vectorized sequential EHR, which is not applicable to large-scale unsupervised scenarios. Recently, contrastive learning shows great success on self-supervised representation learning problems. However, complex temporality often degrades the performance. We propose Graph Kernel Infomax, a self-supervised graph kernel learning approach on the graphical representation of EHR, to overcome the previous problems. Unlike the state-of-the-art, we do not change the graph structure to construct augmented views. Instead, we use Kernel Subspace Augmentation to embed nodes into two geometrically different manifold views. The entire framework is trained by contrasting nodes and graph representations on those two manifold views through the commonly used contrastive objectives. Empirically, using publicly available benchmark EHR datasets, our approach yields performance on clinical downstream tasks that exceeds the state-of-the-art. Theoretically, the variation on distance metrics naturally creates different views as data augmentation without changing graph structures.  ( 2 min )
    Mildly Conservative Q-Learning for Offline Reinforcement Learning
    arXiv:2206.04745v3 Announce Type: replace Abstract: Offline reinforcement learning (RL) defines the task of learning from a static logged dataset without continually interacting with the environment. The distribution shift between the learned policy and the behavior policy makes it necessary for the value function to stay conservative such that out-of-distribution (OOD) actions will not be severely overestimated. However, existing approaches, penalizing the unseen actions or regularizing with the behavior policy, are too pessimistic, which suppresses the generalization of the value function and hinders the performance improvement. This paper explores mild but enough conservatism for offline learning while not harming generalization. We propose Mildly Conservative Q-learning (MCQ), where OOD actions are actively trained by assigning them proper pseudo Q values. We theoretically show that MCQ induces a policy that behaves at least as well as the behavior policy and no erroneous overestimation will occur for OOD actions. Experimental results on the D4RL benchmarks demonstrate that MCQ achieves remarkable performance compared with prior work. Furthermore, MCQ shows superior generalization ability when transferring from offline to online, and significantly outperforms baselines. Our code is publicly available at https://github.com/dmksjfl/MCQ.  ( 2 min )
    Trustworthy Graph Neural Networks: Aspects, Methods and Trends
    arXiv:2205.07424v2 Announce Type: replace Abstract: Graph neural networks (GNNs) have emerged as a series of competent graph learning methods for diverse real-world scenarios, ranging from daily applications like recommendation systems and question answering to cutting-edge technologies such as drug discovery in life sciences and n-body simulation in astrophysics. However, task performance is not the only requirement for GNNs. Performance-oriented GNNs have exhibited potential adverse effects like vulnerability to adversarial attacks, unexplainable discrimination against disadvantaged groups, or excessive resource consumption in edge computing environments. To avoid these unintentional harms, it is necessary to build competent GNNs characterised by trustworthiness. To this end, we propose a comprehensive roadmap to build trustworthy GNNs from the view of the various computing technologies involved. In this survey, we introduce basic concepts and comprehensively summarise existing efforts for trustworthy GNNs from six aspects, including robustness, explainability, privacy, fairness, accountability, and environmental well-being. Additionally, we highlight the intricate cross-aspect relations between the above six aspects of trustworthy GNNs. Finally, we present a thorough overview of trending directions for facilitating the research and industrialisation of trustworthy GNNs.  ( 2 min )
    Conflict-Averse Gradient Descent for Multi-task Learning
    arXiv:2110.14048v2 Announce Type: replace Abstract: The goal of multi-task learning is to enable more efficient learning than single task learning by sharing model structures for a diverse set of tasks. A standard multi-task learning objective is to minimize the average loss across all tasks. While straightforward, using this objective often results in much worse final performance for each task than learning them independently. A major challenge in optimizing a multi-task model is the conflicting gradients, where gradients of different task objectives are not well aligned so that following the average gradient direction can be detrimental to specific tasks' performance. Previous work has proposed several heuristics to manipulate the task gradients for mitigating this problem. But most of them lack convergence guarantee and/or could converge to any Pareto-stationary point. In this paper, we introduce Conflict-Averse Gradient descent (CAGrad) which minimizes the average loss function, while leveraging the worst local improvement of individual tasks to regularize the algorithm trajectory. CAGrad balances the objectives automatically and still provably converges to a minimum over the average loss. It includes the regular gradient descent (GD) and the multiple gradient descent algorithm (MGDA) in the multi-objective optimization (MOO) literature as special cases. On a series of challenging multi-task supervised learning and reinforcement learning tasks, CAGrad achieves improved performance over prior state-of-the-art multi-objective gradient manipulation methods.  ( 3 min )
    Machine Learning with a Reject Option: A survey
    arXiv:2107.11277v3 Announce Type: replace Abstract: Machine learning models always make a prediction, even when it is likely to be inaccurate. This behavior should be avoided in many decision support applications, where mistakes can have severe consequences. Albeit already studied in 1970, machine learning with rejection recently gained interest. This machine learning subfield enables machine learning models to abstain from making a prediction when likely to make a mistake. This survey aims to provide an overview on machine learning with rejection. We introduce the conditions leading to two types of rejection, ambiguity and novelty rejection, which we carefully formalize. Moreover, we review and categorize strategies to evaluate a model's predictive and rejective quality. Additionally, we define the existing architectures for models with rejection and describe the standard techniques for learning such models. Finally, we provide examples of relevant application domains and show how machine learning with rejection relates to other machine learning research areas.  ( 2 min )
    Chasing Convex Functions with Long-term Constraints
    arXiv:2402.14012v1 Announce Type: cross Abstract: We introduce and study a family of online metric problems with long-term constraints. In these problems, an online player makes decisions $\mathbf{x}_t$ in a metric space $(X,d)$ to simultaneously minimize their hitting cost $f_t(\mathbf{x}_t)$ and switching cost as determined by the metric. Over the time horizon $T$, the player must satisfy a long-term demand constraint $\sum_{t} c(\mathbf{x}_t) \geq 1$, where $c(\mathbf{x}_t)$ denotes the fraction of demand satisfied at time $t$. Such problems can find a wide array of applications to online resource allocation in sustainable energy and computing systems. We devise optimal competitive and learning-augmented algorithms for specific instantiations of these problems, and further show that our proposed algorithms perform well in numerical experiments.  ( 2 min )
    Asymptotics of Learning with Deep Structured (Random) Features
    arXiv:2402.13999v1 Announce Type: cross Abstract: For a large class of feature maps we provide a tight asymptotic characterisation of the test error associated with learning the readout layer, in the high-dimensional limit where the input dimension, hidden layer widths, and number of training samples are proportionally large. This characterization is formulated in terms of the population covariance of the features. Our work is partially motivated by the problem of learning with Gaussian rainbow neural networks, namely deep non-linear fully-connected networks with random but structured weights, whose row-wise covariances are further allowed to depend on the weights of previous layers. For such networks we also derive a closed-form formula for the feature covariance in terms of the weight matrices. We further find that in some cases our results can capture feature maps learned by deep, finite-width neural networks trained under gradient descent.  ( 2 min )
    Linear-Time Graph Neural Networks for Scalable Recommendations
    arXiv:2402.13973v1 Announce Type: cross Abstract: In an era of information explosion, recommender systems are vital tools to deliver personalized recommendations for users. The key of recommender systems is to forecast users' future behaviors based on previous user-item interactions. Due to their strong expressive power of capturing high-order connectivities in user-item interaction data, recent years have witnessed a rising interest in leveraging Graph Neural Networks (GNNs) to boost the prediction performance of recommender systems. Nonetheless, classic Matrix Factorization (MF) and Deep Neural Network (DNN) approaches still play an important role in real-world large-scale recommender systems due to their scalability advantages. Despite the existence of GNN-acceleration solutions, it remains an open question whether GNN-based recommender systems can scale as efficiently as classic MF and DNN methods. In this paper, we propose a Linear-Time Graph Neural Network (LTGNN) to scale up GNN-based recommender systems to achieve comparable scalability as classic MF approaches while maintaining GNNs' powerful expressiveness for superior prediction accuracy. Extensive experiments and ablation studies are presented to validate the effectiveness and scalability of the proposed algorithm. Our implementation based on PyTorch is available.  ( 2 min )
    Probabilistic Neural Networks (PNNs) for Modeling Aleatoric Uncertainty in Scientific Machine Learning
    arXiv:2402.13945v1 Announce Type: cross Abstract: This paper investigates the use of probabilistic neural networks (PNNs) to model aleatoric uncertainty, which refers to the inherent variability in the input-output relationships of a system, often characterized by unequal variance or heteroscedasticity. Unlike traditional neural networks that produce deterministic outputs, PNNs generate probability distributions for the target variable, allowing the determination of both predicted means and intervals in regression scenarios. Contributions of this paper include the development of a probabilistic distance metric to optimize PNN architecture, and the deployment of PNNs in controlled data sets as well as a practical material science case involving fiber-reinforced composites. The findings confirm that PNNs effectively model aleatoric uncertainty, proving to be more appropriate than the commonly employed Gaussian process regression for this purpose. Specifically, in a real-world scientific machine learning context, PNNs yield remarkably accurate output mean estimates with R-squared scores approaching 0.97, and their predicted intervals exhibit a high correlation coefficient of nearly 0.80, closely matching observed data intervals. Hence, this research contributes to the ongoing exploration of leveraging the sophisticated representational capacity of neural networks to delineate complex input-output relationships in scientific problems.  ( 2 min )
    Advancing Audio Fingerprinting Accuracy Addressing Background Noise and Distortion Challenges
    arXiv:2402.13957v1 Announce Type: cross Abstract: Audio fingerprinting, exemplified by pioneers like Shazam, has transformed digital audio recognition. However, existing systems struggle with accuracy in challenging conditions, limiting broad applicability. This research proposes an AI and ML integrated audio fingerprinting algorithm to enhance accuracy. Built on the Dejavu Project's foundations, the study emphasizes real-world scenario simulations with diverse background noises and distortions. Signal processing, central to Dejavu's model, includes the Fast Fourier Transform, spectrograms, and peak extraction. The "constellation" concept and fingerprint hashing enable unique song identification. Performance evaluation attests to 100% accuracy within a 5-second audio input, with a system showcasing predictable matching speed for efficiency. Storage analysis highlights the critical space-speed trade-off for practical implementation. This research advances audio fingerprinting's adaptability, addressing challenges in varied environments and applications.  ( 2 min )
    Verifying message-passing neural networks via topology-based bounds tightening
    arXiv:2402.13937v1 Announce Type: cross Abstract: Since graph neural networks (GNNs) are often vulnerable to attack, we need to know when we can trust them. We develop a computationally effective approach towards providing robust certificates for message-passing neural networks (MPNNs) using a Rectified Linear Unit (ReLU) activation function. Because our work builds on mixed-integer optimization, it encodes a wide variety of subproblems, for example it admits (i) both adding and removing edges, (ii) both global and local budgets, and (iii) both topological perturbations and feature modifications. Our key technology, topology-based bounds tightening, uses graph structure to tighten bounds. We also experiment with aggressive bounds tightening to dynamically change the optimization constraints by tightening variable bounds. To demonstrate the effectiveness of these strategies, we implement an extension to the open-source branch-and-cut solver SCIP. We test on both node and graph classification problems and consider topological attacks that both add and remove edges.  ( 2 min )
    SDXL-Lightning: Progressive Adversarial Diffusion Distillation
    arXiv:2402.13929v1 Announce Type: cross Abstract: We propose a diffusion distillation method that achieves new state-of-the-art in one-step/few-step 1024px text-to-image generation based on SDXL. Our method combines progressive and adversarial distillation to achieve a balance between quality and mode coverage. In this paper, we discuss the theoretical analysis, discriminator design, model formulation, and training techniques. We open-source our distilled SDXL-Lightning models both as LoRA and full UNet weights.  ( 2 min )
    Explain to Question not to Justify
    arXiv:2402.13914v1 Announce Type: cross Abstract: Explainable Artificial Intelligence (XAI) is a young but very promising field of research. Unfortunately, the progress in this field is currently slowed down by divergent and incompatible goals. In this paper, we separate various threads tangled within the area of XAI into two complementary cultures of human/value-oriented explanations (BLUE XAI) and model/validation-oriented explanations (RED XAI). We also argue that the area of RED XAI is currently under-explored and hides great opportunities and potential for important research necessary to ensure the safety of AI systems. We conclude this paper by presenting promising challenges in this area.  ( 2 min )
    BenchCloudVision: A Benchmark Analysis of Deep Learning Approaches for Cloud Detection and Segmentation in Remote Sensing Imagery
    arXiv:2402.13918v1 Announce Type: cross Abstract: Satellites equipped with optical sensors capture high-resolution imagery, providing valuable insights into various environmental phenomena. In recent years, there has been a surge of research focused on addressing some challenges in remote sensing, ranging from water detection in diverse landscapes to the segmentation of mountainous and terrains. Ongoing investigations goals to enhance the precision and efficiency of satellite imagery analysis. Especially, there is a growing emphasis on developing methodologies for accurate water body detection, snow and clouds, important for environmental monitoring, resource management, and disaster response. Within this context, this paper focus on the cloud segmentation from remote sensing imagery. Accurate remote sensing data analysis can be challenging due to the presence of clouds in optical sensor-based applications. The quality of resulting products such as applications and research is directly impacted by cloud detection, which plays a key role in the remote sensing data processing pipeline. This paper examines seven cutting-edge semantic segmentation and detection algorithms applied to clouds identification, conducting a benchmark analysis to evaluate their architectural approaches and identify the most performing ones. To increase the model's adaptability, critical elements including the type of imagery and the amount of spectral bands used during training are analyzed. Additionally, this research tries to produce machine learning algorithms that can perform cloud segmentation using only a few spectral bands, including RGB and RGBN-IR combinations. The model's flexibility for a variety of applications and user scenarios is assessed by using imagery from Sentinel-2 and Landsat-8 as datasets. This benchmark can be reproduced using the material from this github link: \url{https://github.com/toelt-llc/cloud\_segmentation\_comparative}.  ( 3 min )
    RFI-DRUnet: Restoring dynamic spectra corrupted by radio frequency interference -- Application to pulsar observations
    arXiv:2402.13867v1 Announce Type: cross Abstract: Radio frequency interference (RFI) have been an enduring concern in radio astronomy, particularly for the observations of pulsars which require high timing precision and data sensitivity. In most works of the literature, RFI mitigation has been formulated as a detection task that consists of localizing possible RFI in dynamic spectra. This strategy inevitably leads to a potential loss of information since parts of the signal identified as possibly RFI-corrupted are generally not considered in the subsequent data processing pipeline. Conversely, this work proposes to tackle RFI mitigation as a joint detection and restoration that allows parts of the dynamic spectrum affected by RFI to be not only identified but also recovered. The proposed supervised method relies on a deep convolutional network whose architecture inherits the performance reached by a recent yet popular image-denoising network. To train this network, a whole simulation framework is built to generate large data sets according to physics-inspired and statistical models of the pulsar signals and of the RFI. The relevance of the proposed approach is quantitatively assessed by conducting extensive experiments. In particular, the results show that the restored dynamic spectra are sufficiently reliable to estimate pulsar times-of-arrivals with an accuracy close to the one that would be obtained from RFI-free signals.  ( 3 min )
    Science Checker Reloaded: A Bidirectional Paradigm for Transparency and Logical Reasoning
    arXiv:2402.13897v1 Announce Type: cross Abstract: Information retrieval is a rapidly evolving field. However it still faces significant limitations in the scientific and industrial vast amounts of information, such as semantic divergence and vocabulary gaps in sparse retrieval, low precision and lack of interpretability in semantic search, or hallucination and outdated information in generative models. In this paper, we introduce a two-block approach to tackle these hurdles for long documents. The first block enhances language understanding in sparse retrieval by query expansion to retrieve relevant documents. The second block deepens the result by providing comprehensive and informative answers to the complex question using only the information spread in the long document, enabling bidirectional engagement. At various stages of the pipeline, intermediate results are presented to users to facilitate understanding of the system's reasoning. We believe this bidirectional approach brings significant advancements in terms of transparency, logical thinking, and comprehensive understanding in the field of scientific information retrieval.  ( 2 min )
    Large Language Models are Advanced Anonymizers
    arXiv:2402.13846v1 Announce Type: cross Abstract: Recent work in privacy research on large language models has shown that they achieve near human-level performance at inferring personal data from real-world online texts. With consistently increasing model capabilities, existing text anonymization methods are currently lacking behind regulatory requirements and adversarial threats. This raises the question of how individuals can effectively protect their personal data in sharing online texts. In this work, we take two steps to answer this question: We first present a new setting for evaluating anonymizations in the face of adversarial LLMs inferences, allowing for a natural measurement of anonymization performance while remedying some of the shortcomings of previous metrics. We then present our LLM-based adversarial anonymization framework leveraging the strong inferential capabilities of LLMs to inform our anonymization procedure. In our experimental evaluation, we show on real-world and synthetic online texts how adversarial anonymization outperforms current industry-grade anonymizers both in terms of the resulting utility and privacy.  ( 2 min )
    Improving Efficiency of Iso-Surface Extraction on Implicit Neural Representations Using Uncertainty Propagation
    arXiv:2402.13861v1 Announce Type: cross Abstract: Implicit Neural representations (INRs) are widely used for scientific data reduction and visualization by modeling the function that maps a spatial location to a data value. Without any prior knowledge about the spatial distribution of values, we are forced to sample densely from INRs to perform visualization tasks like iso-surface extraction which can be very computationally expensive. Recently, range analysis has shown promising results in improving the efficiency of geometric queries, such as ray casting and hierarchical mesh extraction, on INRs for 3D geometries by using arithmetic rules to bound the output range of the network within a spatial region. However, the analysis bounds are often too conservative for complex scientific data. In this paper, we present an improved technique for range analysis by revisiting the arithmetic rules and analyzing the probability distribution of the network output within a spatial region. We model this distribution efficiently as a Gaussian distribution by applying the central limit theorem. Excluding low probability values, we are able to tighten the output bounds, resulting in a more accurate estimation of the value range, and hence more accurate identification of iso-surface cells and more efficient iso-surface extraction on INRs. Our approach demonstrates superior performance in terms of the iso-surface extraction time on four datasets compared to the original range analysis method and can also be generalized to other geometric query tasks.  ( 3 min )
    Cas-DiffCom: Cascaded diffusion model for infant longitudinal super-resolution 3D medical image completion
    arXiv:2402.13776v1 Announce Type: cross Abstract: Early infancy is a rapid and dynamic neurodevelopmental period for behavior and neurocognition. Longitudinal magnetic resonance imaging (MRI) is an effective tool to investigate such a crucial stage by capturing the developmental trajectories of the brain structures. However, longitudinal MRI acquisition always meets a serious data-missing problem due to participant dropout and failed scans, making longitudinal infant brain atlas construction and developmental trajectory delineation quite challenging. Thanks to the development of an AI-based generative model, neuroimage completion has become a powerful technique to retain as much available data as possible. However, current image completion methods usually suffer from inconsistency within each individual subject in the time dimension, compromising the overall quality. To solve this problem, our paper proposed a two-stage cascaded diffusion model, Cas-DiffCom, for dense and longitudinal 3D infant brain MRI completion and super-resolution. We applied our proposed method to the Baby Connectome Project (BCP) dataset. The experiment results validate that Cas-DiffCom achieves both individual consistency and high fidelity in longitudinal infant brain image completion. We further applied the generated infant brain images to two downstream tasks, brain tissue segmentation and developmental trajectory delineation, to declare its task-oriented potential in the neuroscience field.  ( 3 min )
    An Evaluation of Large Language Models in Bioinformatics Research
    arXiv:2402.13714v1 Announce Type: cross Abstract: Large language models (LLMs) such as ChatGPT have gained considerable interest across diverse research communities. Their notable ability for text completion and generation has inaugurated a novel paradigm for language-interfaced problem solving. However, the potential and efficacy of these models in bioinformatics remain incompletely explored. In this work, we study the performance LLMs on a wide spectrum of crucial bioinformatics tasks. These tasks include the identification of potential coding regions, extraction of named entities for genes and proteins, detection of antimicrobial and anti-cancer peptides, molecular optimization, and resolution of educational bioinformatics problems. Our findings indicate that, given appropriate prompts, LLMs like GPT variants can successfully handle most of these tasks. In addition, we provide a thorough analysis of their limitations in the context of complicated bioinformatics tasks. In conclusion, we believe that this work can provide new perspectives and motivate future research in the field of LLMs applications, AI for Science and bioinformatics.  ( 2 min )
    The Effect of Batch Size on Contrastive Self-Supervised Speech Representation Learning
    arXiv:2402.13723v1 Announce Type: cross Abstract: Foundation models in speech are often trained using many GPUs, which implicitly leads to large effective batch sizes. In this paper we study the effect of batch size on pre-training, both in terms of statistics that can be monitored during training, and in the effect on the performance of a downstream fine-tuning task. By using batch sizes varying from 87.5 seconds to 80 minutes of speech we show that, for a fixed amount of iterations, larger batch sizes result in better pre-trained models. However, there is lower limit for stability, and an upper limit for effectiveness. We then show that the quality of the pre-trained model depends mainly on the amount of speech data seen during training, i.e., on the product of batch size and number of iterations. All results are produced with an independent implementation of the wav2vec 2.0 architecture, which to a large extent reproduces the results of the original work (arXiv:2006.11477). Our extensions can help researchers choose effective operating conditions when studying self-supervised learning in speech, and hints towards benchmarking self-supervision with a fixed amount of seen data. Code and model checkpoints are available at https://github.com/nikvaessen/w2v2-batch-size.  ( 2 min )
    Explainable Classification Techniques for Quantum Dot Device Measurements
    arXiv:2402.13699v1 Announce Type: cross Abstract: In the physical sciences, there is an increased need for robust feature representations of image data: image acquisition, in the generalized sense of two-dimensional data, is now widespread across a large number of fields, including quantum information science, which we consider here. While traditional image features are widely utilized in such cases, their use is rapidly being supplanted by Neural Network-based techniques that often sacrifice explainability in exchange for high accuracy. To ameliorate this trade-off, we propose a synthetic data-based technique that results in explainable features. We show, using Explainable Boosting Machines (EBMs), that this method offers superior explainability without sacrificing accuracy. Specifically, we show that there is a meaningful benefit to this technique in the context of quantum dot tuning, where human intervention is necessary at the current stage of development.  ( 2 min )
    Computing Transiting Exoplanet Parameters with 1D Convolutional Neural Networks
    arXiv:2402.13673v1 Announce Type: cross Abstract: The transit method allows the detection and characterization of planetary systems by analyzing stellar light curves. Convolutional neural networks appear to offer a viable solution for automating these analyses. In this research, two 1D convolutional neural network models, which work with simulated light curves in which transit-like signals were injected, are presented. One model operates on complete light curves and estimates the orbital period, and the other one operates on phase-folded light curves and estimates the semimajor axis of the orbit and the square of the planet-to-star radius ratio. Both models were tested on real data from TESS light curves with confirmed planets to ensure that they are able to work with real data. The results obtained show that 1D CNNs are able to characterize transiting exoplanets from their host star's detrended light curve and, furthermore, reducing both the required time and computational costs compared with the current detection and characterization algorithms.  ( 2 min )
    Improving a Proportional Integral Controller with Reinforcement Learning on a Throttle Valve Benchmark
    arXiv:2402.13654v1 Announce Type: cross Abstract: This paper presents a learning-based control strategy for non-linear throttle valves with an asymmetric hysteresis, leading to a near-optimal controller without requiring any prior knowledge about the environment. We start with a carefully tuned Proportional Integrator (PI) controller and exploit the recent advances in Reinforcement Learning (RL) with Guides to improve the closed-loop behavior by learning from the additional interactions with the valve. We test the proposed control method in various scenarios on three different valves, all highlighting the benefits of combining both PI and RL frameworks to improve control performance in non-linear stochastic systems. In all the experimental test cases, the resulting agent has a better sample efficiency than traditional RL agents and outperforms the PI controller.  ( 2 min )
    Measurement Uncertainty: Relating the uncertainties of physical and virtual measurements
    arXiv:2402.13666v1 Announce Type: cross Abstract: In the context of industrially mass-manufactured products, quality management is based on physically inspecting a small sample from a large batch and reasoning about the batch's quality conformance. When complementing physical inspections with predictions from machine learning models, it is crucial that the uncertainty of the prediction is known. Otherwise, the application of established quality management concepts is not legitimate. Deterministic (machine learning) models lack quantification of their predictive uncertainty and are therefore unsuitable. Probabilistic (machine learning) models provide a predictive uncertainty along with the prediction. However, a concise relationship is missing between the measurement uncertainty of physical inspections and the predictive uncertainty of probabilistic models in their application in quality management. Here, we show how the predictive uncertainty of probabilistic (machine learning) models is related to the measurement uncertainty of physical inspections. This enables the use of probabilistic models for virtual inspections and integrates them into existing quality management concepts. Thus, we can provide a virtual measurement for any quality characteristic based on the process data and achieve a 100 percent inspection rate. In the field of Predictive Quality, the virtual measurement is of great interest. Based on our results, physical inspections with a low sampling rate can be accompanied by virtual measurements that allow an inspection rate of 100 percent. We add substantial value, especially to complex process chains, as faulty products/parts are identified promptly and upcoming process steps can be aborted.  ( 3 min )
    Robustness of Deep Neural Networks for Micro-Doppler Radar Classification
    arXiv:2402.13651v1 Announce Type: cross Abstract: With the great capabilities of deep classifiers for radar data processing come the risks of learning dataset-specific features that do not generalize well. In this work, the robustness of two deep convolutional architectures, trained and tested on the same data, is evaluated. When standard training practice is followed, both classifiers exhibit sensitivity to subtle temporal shifts of the input representation, an augmentation that carries minimal semantic content. Furthermore, the models are extremely susceptible to adversarial examples. Both small temporal shifts and adversarial examples are a result of a model overfitting on features that do not generalize well. As a remedy, it is shown that training on adversarial examples and temporally augmented samples can reduce this effect and lead to models that generalise better. Finally, models operating on cadence-velocity diagram representation rather than Doppler-time are demonstrated to be naturally more immune to adversarial examples.  ( 2 min )
    A Large Dimensional Analysis of Multi-task Semi-Supervised Learning
    arXiv:2402.13646v1 Announce Type: cross Abstract: This article conducts a large dimensional study of a simple yet quite versatile classification model, encompassing at once multi-task and semi-supervised learning, and taking into account uncertain labeling. Using tools from random matrix theory, we characterize the asymptotics of some key functionals, which allows us on the one hand to predict the performances of the algorithm, and on the other hand to reveal some counter-intuitive guidance on how to use it efficiently. The model, powerful enough to provide good performance guarantees, is also straightforward enough to provide strong insights into its behavior.  ( 2 min )
    Learning Dual-arm Object Rearrangement for Cartesian Robots
    arXiv:2402.13634v1 Announce Type: cross Abstract: This work focuses on the dual-arm object rearrangement problem abstracted from a realistic industrial scenario of Cartesian robots. The goal of this problem is to transfer all the objects from sources to targets with the minimum total completion time. To achieve the goal, the core idea is to develop an effective object-to-arm task assignment strategy for minimizing the cumulative task execution time and maximizing the dual-arm cooperation efficiency. One of the difficulties in the task assignment is the scalability problem. As the number of objects increases, the computation time of traditional offline-search-based methods grows strongly for computational complexity. Encouraged by the adaptability of reinforcement learning (RL) in long-sequence task decisions, we propose an online task assignment decision method based on RL, and the computation time of our method only increases linearly with the number of objects. Further, we design an attention-based network to model the dependencies between the input states during the whole task execution process to help find the most reasonable object-to-arm correspondence in each task assignment round. In the experimental part, we adapt some search-based methods to this specific setting and compare our method with them. Experimental result shows that our approach achieves outperformance over search-based methods in total execution time and computational efficiency, and also verifies the generalization of our method to different numbers of objects. In addition, we show the effectiveness of our method deployed on the real robot in the supplementary video.  ( 3 min )
    Green AI: A Preliminary Empirical Study on Energy Consumption in DL Models Across Different Runtime Infrastructures
    arXiv:2402.13640v1 Announce Type: cross Abstract: Deep Learning (DL) frameworks such as PyTorch and TensorFlow include runtime infrastructures responsible for executing trained models on target hardware, managing memory, data transfers, and multi-accelerator execution, if applicable. Additionally, it is a common practice to deploy pre-trained models on environments distinct from their native development settings. This led to the introduction of interchange formats such as ONNX, which includes its runtime infrastructure, and ONNX Runtime, which work as standard formats that can be used across diverse DL frameworks and languages. Even though these runtime infrastructures have a great impact on inference performance, no previous paper has investigated their energy efficiency. In this study, we monitor the energy consumption and inference time in the runtime infrastructures of three well-known DL frameworks as well as ONNX, using three various DL models. To have nuance in our investigation, we also examine the impact of using different execution providers. We find out that the performance and energy efficiency of DL are difficult to predict. One framework, MXNet, outperforms both PyTorch and TensorFlow for the computer vision models using batch size 1, due to efficient GPU usage and thus low CPU usage. However, batch size 64 makes PyTorch and MXNet practically indistinguishable, while TensorFlow is outperformed consistently. For BERT, PyTorch exhibits the best performance. Converting the models to ONNX usually yields significant performance improvements but the ONNX converted ResNet model with batch size 64 consumes approximately 10% more energy and time than the original PyTorch model.  ( 3 min )
    Analysis of Bootstrap and Subsampling in High-dimensional Regularized Regression
    arXiv:2402.13622v1 Announce Type: cross Abstract: We investigate popular resampling methods for estimating the uncertainty of statistical models, such as subsampling, bootstrap and the jackknife, and their performance in high-dimensional supervised regression tasks. We provide a tight asymptotic description of the biases and variances estimated by these methods in the context of generalized linear models, such as ridge and logistic regression, taking the limit where the number of samples $n$ and dimension $d$ of the covariates grow at a comparable fixed rate $\alpha\!=\! n/d$. Our findings are three-fold: i) resampling methods are fraught with problems in high dimensions and exhibit the double-descent-like behavior typical of these situations; ii) only when $\alpha$ is large enough do they provide consistent and reliable error estimations (we give convergence rates); iii) in the over-parametrized regime $\alpha\!<\!1$ relevant to modern machine learning practice, their predictions are not consistent, even with optimal regularization.  ( 2 min )
    Overview of the VLSP 2023 -- ComOM Shared Task: A Data Challenge for Comparative Opinion Mining from Vietnamese Product Reviews
    arXiv:2402.13613v1 Announce Type: cross Abstract: This paper presents a comprehensive overview of the Comparative Opinion Mining from Vietnamese Product Reviews shared task (ComOM), held as part of the 10$^{th}$ International Workshop on Vietnamese Language and Speech Processing (VLSP 2023). The primary objective of this shared task is to advance the field of natural language processing by developing techniques that proficiently extract comparative opinions from Vietnamese product reviews. Participants are challenged to propose models that adeptly extract a comparative "quintuple" from a comparative sentence, encompassing Subject, Object, Aspect, Predicate, and Comparison Type Label. We construct a human-annotated dataset comprising $120$ documents, encompassing $7427$ non-comparative sentences and $2468$ comparisons within $1798$ sentences. Participating models undergo evaluation and ranking based on the Exact match macro-averaged quintuple F1 score.  ( 2 min )
    Convergence Acceleration of Markov Chain Monte Carlo-based Gradient Descent by Deep Unfolding
    arXiv:2402.13608v1 Announce Type: cross Abstract: This study proposes a trainable sampling-based solver for combinatorial optimization problems (COPs) using a deep-learning technique called deep unfolding. The proposed solver is based on the Ohzeki method that combines Markov-chain Monte-Carlo (MCMC) and gradient descent, and its step sizes are trained by minimizing a loss function. In the training process, we propose a sampling-based gradient estimation that substitutes auto-differentiation with a variance estimation, thereby circumventing the failure of back propagation due to the non-differentiability of MCMC. The numerical results for a few COPs demonstrated that the proposed solver significantly accelerated the convergence speed compared with the original Ohzeki method.  ( 2 min )
    Data-driven Discovery with Large Generative Models
    arXiv:2402.13610v1 Announce Type: cross Abstract: With the accumulation of data at an unprecedented rate, its potential to fuel scientific discovery is growing exponentially. This position paper urges the Machine Learning (ML) community to exploit the capabilities of large generative models (LGMs) to develop automated systems for end-to-end data-driven discovery -- a paradigm encompassing the search and verification of hypotheses purely from a set of provided datasets, without the need for additional data collection or physical experiments. We first outline several desiderata for an ideal data-driven discovery system. Then, through DATAVOYAGER, a proof-of-concept utilizing GPT-4, we demonstrate how LGMs fulfill several of these desiderata -- a feat previously unattainable -- while also highlighting important limitations in the current system that open up opportunities for novel ML research. We contend that achieving accurate, reliable, and robust end-to-end discovery systems solely through the current capabilities of LGMs is challenging. We instead advocate for fail-proof tool integration, along with active user moderation through feedback mechanisms, to foster data-driven scientific discoveries with efficiency and reproducibility.  ( 2 min )
    User-LLM: Efficient LLM Contextualization with User Embeddings
    arXiv:2402.13598v1 Announce Type: cross Abstract: Large language models (LLMs) have revolutionized natural language processing. However, effectively incorporating complex and potentially noisy user interaction data remains a challenge. To address this, we propose User-LLM, a novel framework that leverages user embeddings to contextualize LLMs. These embeddings, distilled from diverse user interactions using self-supervised pretraining, capture latent user preferences and their evolution over time. We integrate these user embeddings with LLMs through cross-attention and soft-prompting, enabling LLMs to dynamically adapt to user context. Our comprehensive experiments on MovieLens, Amazon Review, and Google Local Review datasets demonstrate significant performance gains across various tasks. Notably, our approach outperforms text-prompt-based contextualization on long sequence tasks and tasks that require deep user understanding while being computationally efficient. We further incorporate Perceiver layers to streamline the integration between user encoders and LLMs, reducing computational demands.  ( 2 min )
    A cutting plane algorithm for globally solving low dimensional k-means clustering problems
    arXiv:2402.13595v1 Announce Type: cross Abstract: Clustering is one of the most fundamental tools in data science and machine learning, and k-means clustering is one of the most common such methods. There is a variety of approximate algorithms for the k-means problem, but computing the globally optimal solution is in general NP-hard. In this paper we consider the k-means problem for instances with low dimensional data and formulate it as a structured concave assignment problem. This allows us to exploit the low dimensional structure and solve the problem to global optimality within reasonable time for large data sets with several clusters. The method builds on iteratively solving a small concave problem and a large linear programming problem. This gives a sequence of feasible solutions along with bounds which we show converges to zero optimality gap. The paper combines methods from global optimization theory to accelerate the procedure, and we provide numerical results on their performance.  ( 2 min )
    ToDo: Token Downsampling for Efficient Generation of High-Resolution Images
    arXiv:2402.13573v1 Announce Type: cross Abstract: Attention mechanism has been crucial for image diffusion models, however, their quadratic computational complexity limits the sizes of images we can process within reasonable time and memory constraints. This paper investigates the importance of dense attention in generative image models, which often contain redundant features, making them suitable for sparser attention mechanisms. We propose a novel training-free method ToDo that relies on token downsampling of key and value tokens to accelerate Stable Diffusion inference by up to 2x for common sizes and up to 4.5x or more for high resolutions like 2048x2048. We demonstrate that our approach outperforms previous methods in balancing efficient throughput and fidelity.  ( 2 min )
    Mastering the Game of Guandan with Deep Reinforcement Learning and Behavior Regulating
    arXiv:2402.13582v1 Announce Type: cross Abstract: Games are a simplified model of reality and often serve as a favored platform for Artificial Intelligence (AI) research. Much of the research is concerned with game-playing agents and their decision making processes. The game of Guandan (literally, "throwing eggs") is a challenging game where even professional human players struggle to make the right decision at times. In this paper we propose a framework named GuanZero for AI agents to master this game using Monte-Carlo methods and deep neural networks. The main contribution of this paper is about regulating agents' behavior through a carefully designed neural network encoding scheme. We then demonstrate the effectiveness of the proposed framework by comparing it with state-of-the-art approaches.  ( 2 min )
    Graph Representation of Narrative Context: Coherence Dependency via Retrospective Questions
    arXiv:2402.13551v1 Announce Type: cross Abstract: This work introduces a novel and practical paradigm for narrative comprehension, stemming from the observation that individual passages within narratives are often cohesively related than being isolated. We therefore propose to formulate a graph upon narratives dubbed NARCO that depicts a task-agnostic coherence dependency of the entire context. Especially, edges in NARCO encompass retrospective free-form questions between two context snippets reflecting high-level coherent relations, inspired by the cognitive perception of humans who constantly reinstate relevant events from prior context. Importantly, our graph is instantiated through our designed two-stage LLM prompting, thereby without reliance on human annotations. We present three unique studies on its practical utility, examining the edge efficacy via recap identification, local context augmentation via plot retrieval, and broader applications exemplified by long document QA. Experiments suggest that our approaches leveraging NARCO yield performance boost across all three tasks.  ( 2 min )
    ARL2: Aligning Retrievers for Black-box Large Language Models via Self-guided Adaptive Relevance Labeling
    arXiv:2402.13542v1 Announce Type: cross Abstract: Retrieval-augmented generation enhances large language models (LLMs) by incorporating relevant information from external knowledge sources. This enables LLMs to adapt to specific domains and mitigate hallucinations in knowledge-intensive tasks. However, existing retrievers are often misaligned with LLMs due to their separate training processes and the black-box nature of LLMs. To address this challenge, we propose ARL2, a retriever learning technique that harnesses LLMs as labelers. ARL2 leverages LLMs to annotate and score relevant evidence, enabling learning the retriever from robust LLM supervision. Furthermore, ARL2 uses an adaptive self-training strategy for curating high-quality and diverse relevance data, which can effectively reduce the annotation cost. Extensive experiments demonstrate the effectiveness of ARL2, achieving accuracy improvements of 5.4% on NQ and 4.6% on MMLU compared to the state-of-the-art methods. Additionally, ARL2 exhibits robust transfer learning capabilities and strong zero-shot generalization abilities. Our code will be published at \url{https://github.com/zhanglingxi-cs/ARL2}.  ( 2 min )
    Infrastructure Ombudsman: Mining Future Failure Concerns from Structural Disaster Response
    arXiv:2402.13528v1 Announce Type: cross Abstract: Current research concentrates on studying discussions on social media related to structural failures to improve disaster response strategies. However, detecting social web posts discussing concerns about anticipatory failures is under-explored. If such concerns are channeled to the appropriate authorities, it can aid in the prevention and mitigation of potential infrastructural failures. In this paper, we develop an infrastructure ombudsman -- that automatically detects specific infrastructure concerns. Our work considers several recent structural failures in the US. We present a first-of-its-kind dataset of 2,662 social web instances for this novel task mined from Reddit and YouTube.  ( 2 min )
    Best of Many in Both Worlds: Online Resource Allocation with Predictions under Unknown Arrival Model
    arXiv:2402.13530v1 Announce Type: cross Abstract: Online decision-makers today can often obtain predictions on future variables, such as arrivals, demands, inventories, and so on. These predictions can be generated from simple forecasting algorithms for univariate time-series, all the way to state-of-the-art machine learning models that leverage multiple time-series and additional feature information. However, the prediction quality is often unknown to decisions-makers a priori, hence blindly following the predictions can be harmful. In this paper, we address this problem by giving algorithms that take predictions as inputs and perform robustly against the unknown prediction quality. We consider the online resource allocation problem, one of the most generic models in revenue management and online decision-making. In this problem, a decision maker has a limited amount of resources, and requests arrive sequentially. For each request, the decision-maker needs to decide on an action, which generates a certain amount of rewards and consumes a certain amount of resources, without knowing the future requests. The decision-maker's objective is to maximize the total rewards subject to resource constraints. We take the shadow price of each resource as prediction, which can be obtained by predictions on future requests. Prediction quality is naturally defined to be the $\ell_1$ distance between the prediction and the actual shadow price. Our main contribution is an algorithm which takes the prediction of unknown quality as an input, and achieves asymptotically optimal performance under both requests arrival models (stochastic and adversarial) without knowing the prediction quality and the requests arrival model beforehand. We show our algorithm's performance matches the best achievable performance of any algorithm had the arrival models and the accuracy of the predictions been known. We empirically validate our algorithm with experiments.  ( 3 min )
    Balancing Spectral, Temporal and Spatial Information for EEG-based Alzheimer's Disease Classification
    arXiv:2402.13523v1 Announce Type: cross Abstract: The prospect of future treatment warrants the development of cost-effective screening for Alzheimer's disease (AD). A promising candidate in this regard is electroencephalography (EEG), as it is one of the most economic imaging modalities. Recent efforts in EEG analysis have shifted towards leveraging spatial information, employing novel frameworks such as graph signal processing or graph neural networks. Here, we systematically investigate the importance of spatial information relative to spectral or temporal information by varying the proportion of each dimension for AD classification. To do so, we test various dimension resolution configurations on two routine EEG datasets. We find that spatial information is consistently more relevant than temporal information and equally relevant as spectral information. These results emphasise the necessity to consider spatial information for EEG-based AD classification. On our second dataset, we further find that well-balanced feature resolutions boost classification accuracy by up to 1.6%. Our resolution-based feature extraction has the potential to improve AD classification specifically, and multivariate signal classification generally.  ( 2 min )
    Retrieval-Augmented Data Augmentation for Low-Resource Domain Tasks
    arXiv:2402.13482v1 Announce Type: cross Abstract: Despite large successes of recent language models on diverse tasks, they suffer from severe performance degeneration in low-resource settings with limited training data available. Many existing works tackle this problem by generating synthetic data from the training data and then training models on them, recently using Large Language Models (LLMs). However, in low-resource settings, the amount of seed data samples to use for data augmentation is very small, which makes generated samples suboptimal and less diverse. To tackle this challenge, we propose a novel method that augments training data by incorporating a wealth of examples from other datasets, along with the given training data. Specifically, we first retrieve the relevant instances from other datasets, such as their input-output pairs or contexts, based on their similarities with the given seed data, and then prompt LLMs to generate new samples with the contextual information within and across the original and retrieved samples. This approach can ensure that the generated data is not only relevant but also more diverse than what could be achieved using the limited seed data alone. We validate our proposed Retrieval-Augmented Data Augmentation (RADA) framework on multiple datasets under low-resource settings of training and test-time data augmentation scenarios, on which it outperforms existing LLM-powered data augmentation baselines.  ( 2 min )
    ED-Copilot: Reduce Emergency Department Wait Time with Language Model Diagnostic Assistance
    arXiv:2402.13448v1 Announce Type: cross Abstract: In the emergency department (ED), patients undergo triage and multiple laboratory tests before diagnosis. This process is time-consuming, and causes ED crowding which significantly impacts patient mortality, medical errors, staff burnout, etc. This work proposes (time) cost-effective diagnostic assistance that explores the potential of artificial intelligence (AI) systems in assisting ED clinicians to make time-efficient and accurate diagnoses. Using publicly available patient data, we collaborate with ED clinicians to curate MIMIC-ED-Assist, a benchmark that measures the ability of AI systems in suggesting laboratory tests that minimize ED wait times, while correctly predicting critical outcomes such as death. We develop ED-Copilot which sequentially suggests patient-specific laboratory tests and makes diagnostic predictions. ED-Copilot uses a pre-trained bio-medical language model to encode patient information and reinforcement learning to minimize ED wait time and maximize prediction accuracy of critical outcomes. On MIMIC-ED-Assist, ED-Copilot improves prediction accuracy over baselines while halving average wait time from four hours to two hours. Ablation studies demonstrate the importance of model scale and use of a bio-medical language model. Further analyses reveal the necessity of personalized laboratory test suggestions for diagnosing patients with severe cases, as well as the potential of ED-Copilot in providing ED clinicians with informative laboratory test recommendations. Our code is available at https://github.com/cxcscmu/ED-Copilot.  ( 2 min )
    LocalTweets to LocalHealth: A Mental Health Surveillance Framework Based on Twitter Data
    arXiv:2402.13452v1 Announce Type: cross Abstract: Prior research on Twitter (now X) data has provided positive evidence of its utility in developing supplementary health surveillance systems. In this study, we present a new framework to surveil public health, focusing on mental health (MH) outcomes. We hypothesize that locally posted tweets are indicative of local MH outcomes and collect tweets posted from 765 neighborhoods (census block groups) in the USA. We pair these tweets from each neighborhood with the corresponding MH outcome reported by the Center for Disease Control (CDC) to create a benchmark dataset, LocalTweets. With LocalTweets, we present the first population-level evaluation task for Twitter-based MH surveillance systems. We then develop an efficient and effective method, LocalHealth, for predicting MH outcomes based on LocalTweets. When used with GPT3.5, LocalHealth achieves the highest F1-score and accuracy of 0.7429 and 79.78\%, respectively, a 59\% improvement in F1-score over the GPT3.5 in zero-shot setting. We also utilize LocalHealth to extrapolate CDC's estimates to proxy unreported neighborhoods, achieving an F1-score of 0.7291. Our work suggests that Twitter data can be effectively leveraged to simulate neighborhood-level MH outcomes.  ( 2 min )
    Learning to Retrieve for Job Matching
    arXiv:2402.13435v1 Announce Type: cross Abstract: Web-scale search systems typically tackle the scalability challenge with a two-step paradigm: retrieval and ranking. The retrieval step, also known as candidate selection, often involves extracting standardized entities, creating an inverted index, and performing term matching for retrieval. Such traditional methods require manual and time-consuming development of query models. In this paper, we discuss applying learning-to-retrieve technology to enhance LinkedIns job search and recommendation systems. In the realm of promoted jobs, the key objective is to improve the quality of applicants, thereby delivering value to recruiter customers. To achieve this, we leverage confirmed hire data to construct a graph that evaluates a seeker's qualification for a job, and utilize learned links for retrieval. Our learned model is easy to explain, debug, and adjust. On the other hand, the focus for organic jobs is to optimize seeker engagement. We accomplished this by training embeddings for personalized retrieval, fortified by a set of rules derived from the categorization of member feedback. In addition to a solution based on a conventional inverted index, we developed an on-GPU solution capable of supporting both KNN and term matching efficiently.  ( 2 min )
    DrBenchmark: A Large Language Understanding Evaluation Benchmark for French Biomedical Domain
    arXiv:2402.13432v1 Announce Type: cross Abstract: The biomedical domain has sparked a significant interest in the field of Natural Language Processing (NLP), which has seen substantial advancements with pre-trained language models (PLMs). However, comparing these models has proven challenging due to variations in evaluation protocols across different models. A fair solution is to aggregate diverse downstream tasks into a benchmark, allowing for the assessment of intrinsic PLMs qualities from various perspectives. Although still limited to few languages, this initiative has been undertaken in the biomedical field, notably English and Chinese. This limitation hampers the evaluation of the latest French biomedical models, as they are either assessed on a minimal number of tasks with non-standardized protocols or evaluated using general downstream tasks. To bridge this research gap and account for the unique sensitivities of French, we present the first-ever publicly available French biomedical language understanding benchmark called DrBenchmark. It encompasses 20 diversified tasks, including named-entity recognition, part-of-speech tagging, question-answering, semantic textual similarity, and classification. We evaluate 8 state-of-the-art pre-trained masked language models (MLMs) on general and biomedical-specific data, as well as English specific MLMs to assess their cross-lingual capabilities. Our experiments reveal that no single model excels across all tasks, while generalist models are sometimes still competitive.  ( 2 min )
    The Dimension of Self-Directed Learning
    arXiv:2402.13400v1 Announce Type: cross Abstract: Understanding the self-directed learning complexity has been an important problem that has captured the attention of the online learning theory community since the early 1990s. Within this framework, the learner is allowed to adaptively choose its next data point in making predictions unlike the setting in adversarial online learning. In this paper, we study the self-directed learning complexity in both the binary and multi-class settings, and we develop a dimension, namely $SDdim$, that exactly characterizes the self-directed learning mistake-bound for any concept class. The intuition behind $SDdim$ can be understood as a two-player game called the "labelling game". Armed with this two-player game, we calculate $SDdim$ on a whole host of examples with notable results on axis-aligned rectangles, VC dimension $1$ classes, and linear separators. We demonstrate several learnability gaps with a central focus on self-directed learning and offline sequence learning models that include either the best or worst ordering. Finally, we extend our analysis to the self-directed binary agnostic setting where we derive upper and lower bounds.  ( 2 min )
    Toward TransfORmers: Revolutionizing the Solution of Mixed Integer Programs with Transformers
    arXiv:2402.13380v1 Announce Type: cross Abstract: In this study, we introduce an innovative deep learning framework that employs a transformer model to address the challenges of mixed-integer programs, specifically focusing on the Capacitated Lot Sizing Problem (CLSP). Our approach, to our knowledge, is the first to utilize transformers to predict the binary variables of a mixed-integer programming (MIP) problem. Specifically, our approach harnesses the encoder decoder transformer's ability to process sequential data, making it well-suited for predicting binary variables indicating production setup decisions in each period of the CLSP. This problem is inherently dynamic, and we need to handle sequential decision making under constraints. We present an efficient algorithm in which CLSP solutions are learned through a transformer neural network. The proposed post-processed transformer algorithm surpasses the state-of-the-art solver, CPLEX and Long Short-Term Memory (LSTM) in solution time, optimal gap, and percent infeasibility over 240K benchmark CLSP instances tested. After the ML model is trained, conducting inference on the model, including post-processing, reduces the MIP into a linear program (LP). This transforms the ML-based algorithm, combined with an LP solver, into a polynomial-time approximation algorithm to solve a well-known NP-Hard problem, with almost perfect solution quality.  ( 2 min )
    KetGPT -- Dataset Augmentation of Quantum Circuits using Transformers
    arXiv:2402.13352v1 Announce Type: cross Abstract: Quantum algorithms, represented as quantum circuits, can be used as benchmarks for assessing the performance of quantum systems. Existing datasets, widely utilized in the field, suffer from limitations in size and versatility, leading researchers to employ randomly generated circuits. Random circuits are, however, not representative benchmarks as they lack the inherent properties of real quantum algorithms for which the quantum systems are manufactured. This shortage of `useful' quantum benchmarks poses a challenge to advancing the development and comparison of quantum compilers and hardware. This research aims to enhance the existing quantum circuit datasets by generating what we refer to as `realistic-looking' circuits by employing the Transformer machine learning architecture. For this purpose, we introduce KetGPT, a tool that generates synthetic circuits in OpenQASM language, whose structure is based on quantum circuits derived from existing quantum algorithms and follows the typical patterns of human-written algorithm-based code (e.g., order of gates and qubits). Our three-fold verification process, involving manual inspection and Qiskit framework execution, transformer-based classification, and structural analysis, demonstrates the efficacy of KetGPT in producing large amounts of additional circuits that closely align with algorithm-based structures. Beyond benchmarking, we envision KetGPT contributing substantially to AI-driven quantum compilers and systems.  ( 2 min )
    DeepCode AI Fix: Fixing Security Vulnerabilities with Large Language Models
    arXiv:2402.13291v1 Announce Type: cross Abstract: The automated program repair field has attracted substantial interest over the years, but despite significant research efforts, creating a system that works well for complex semantic bugs such as security vulnerabilities has proven difficult. A promising direction to solve this challenge is by leveraging large language models (LLMs), which are increasingly used to solve various programming tasks. In this paper, we investigate the effectiveness of LLMs for solving code-repair task. We show that the task is difficult as it requires the model to learn long-range code relationships, a task that inherently relies on extensive amounts of training data. At the same time, creating a large, clean dataset for complex program bugs and their corresponding fixes is non-trivial. We propose a technique to address these challenges with a new approach for querying and fine-tuning LLMs. The idea is to use program analysis to limit the LLM's attention mechanism on the portions of code needed to perform the fix, drastically reducing the amount of required training data. Concretely, for training and inference, rather than feeding the entire program to the LLM, we reduce its code to a much shorter snippet that contains the reported defect together with the necessary context - and use that instead. Our evaluation shows that this code reduction approach substantially improves available models such as GPT-4 using few-shot learning, as well as fine-tuning models. To train and evaluate our system, we created a comprehensive code fixing dataset by extensively labeling 156 bug patterns (including 40 security rules), requiring complex interprocedural dataflow to discover. Our best system with Mixtral-8x7B can remove more than 80% of the reported defects while exactly matching the human fix in between 10 and 50% of cases, outperforming baselines based on GPT-3.5 and GPT-4, or based on window-based models like TFix.  ( 3 min )
    Rigor with Machine Learning from Field Theory to the Poincar\'e Conjecture
    arXiv:2402.13321v1 Announce Type: cross Abstract: Machine learning techniques are increasingly powerful, leading to many breakthroughs in the natural sciences, but they are often stochastic, error-prone, and blackbox. How, then, should they be utilized in fields such as theoretical physics and pure mathematics that place a premium on rigor and understanding? In this Perspective we discuss techniques for obtaining rigor in the natural sciences with machine learning. Non-rigorous methods may lead to rigorous results via conjecture generation or verification by reinforcement learning. We survey applications of these techniques-for-rigor ranging from string theory to the smooth $4$d Poincar\'e conjecture in low-dimensional topology. One can also imagine building direct bridges between machine learning theory and either mathematics or theoretical physics. As examples, we describe a new approach to field theory motivated by neural network theory, and a theory of Riemannian metric flows induced by neural network gradient descent, which encompasses Perelman's formulation of the Ricci flow that was utilized to resolve the $3$d Poincar\'e conjecture.  ( 2 min )
    Manipulating hidden-Markov-model inferences by corrupting batch data
    arXiv:2402.13287v1 Announce Type: cross Abstract: Time-series models typically assume untainted and legitimate streams of data. However, a self-interested adversary may have incentive to corrupt this data, thereby altering a decision maker's inference. Within the broader field of adversarial machine learning, this research provides a novel, probabilistic perspective toward the manipulation of hidden Markov model inferences via corrupted data. In particular, we provision a suite of corruption problems for filtering, smoothing, and decoding inferences leveraging an adversarial risk analysis approach. Multiple stochastic programming models are set forth that incorporate realistic uncertainties and varied attacker objectives. Three general solution methods are developed by alternatively viewing the problem from frequentist and Bayesian perspectives. The efficacy of each method is illustrated via extensive, empirical testing. The developed methods are characterized by their solution quality and computational effort, resulting in a stratification of techniques across varying problem-instance architectures. This research highlights the weaknesses of hidden Markov models under adversarial activity, thereby motivating the need for robustification techniques to ensure their security.  ( 2 min )
    Leveraging PAC-Bayes Theory and Gibbs Distributions for Generalization Bounds with Complexity Measures
    arXiv:2402.13285v1 Announce Type: cross Abstract: In statistical learning theory, a generalization bound usually involves a complexity measure imposed by the considered theoretical framework. This limits the scope of such bounds, as other forms of capacity measures or regularizations are used in algorithms. In this paper, we leverage the framework of disintegrated PAC-Bayes bounds to derive a general generalization bound instantiable with arbitrary complexity measures. One trick to prove such a result involves considering a commonly used family of distributions: the Gibbs distributions. Our bound stands in probability jointly over the hypothesis and the learning sample, which allows the complexity to be adapted to the generalization gap as it can be customized to fit both the hypothesis class and the task.  ( 2 min )
    Global Tropical Cyclone Intensity Forecasting with Multi-modal Multi-scale Causal Autoregressive Model
    arXiv:2402.13270v1 Announce Type: cross Abstract: Accurate forecasting of Tropical cyclone (TC) intensity is crucial for formulating disaster risk reduction strategies. Current methods predominantly rely on limited spatiotemporal information from ERA5 data and neglect the causal relationships between these physical variables, failing to fully capture the spatial and temporal patterns required for intensity forecasting. To address this issue, we propose a Multi-modal multi-Scale Causal AutoRegressive model (MSCAR), which is the first model that combines causal relationships with large-scale multi-modal data for global TC intensity autoregressive forecasting. Furthermore, given the current absence of a TC dataset that offers a wide range of spatial variables, we present the Satellite and ERA5-based Tropical Cyclone Dataset (SETCD), which stands as the longest and most comprehensive global dataset related to TCs. Experiments on the dataset show that MSCAR outperforms the state-of-the-art methods, achieving maximum reductions in global and regional forecast errors of 9.52% and 6.74%, respectively. The code and dataset are publicly available at https://anonymous.4open.science/r/MSCAR.  ( 2 min )
    MLSTL-WSN: Machine Learning-based Intrusion Detection using SMOTETomek in WSNs
    arXiv:2402.13277v1 Announce Type: cross Abstract: Wireless Sensor Networks (WSNs) play a pivotal role as infrastructures, encompassing both stationary and mobile sensors. These sensors self-organize and establish multi-hop connections for communication, collectively sensing, gathering, processing, and transmitting data about their surroundings. Despite their significance, WSNs face rapid and detrimental attacks that can disrupt functionality. Existing intrusion detection methods for WSNs encounter challenges such as low detection rates, computational overhead, and false alarms. These issues stem from sensor node resource constraints, data redundancy, and high correlation within the network. To address these challenges, we propose an innovative intrusion detection approach that integrates Machine Learning (ML) techniques with the Synthetic Minority Oversampling Technique Tomek Link (SMOTE-TomekLink) algorithm. This blend synthesizes minority instances and eliminates Tomek links, resulting in a balanced dataset that significantly enhances detection accuracy in WSNs. Additionally, we incorporate feature scaling through standardization to render input features consistent and scalable, facilitating more precise training and detection. To counteract imbalanced WSN datasets, we employ the SMOTE-Tomek resampling technique, mitigating overfitting and underfitting issues. Our comprehensive evaluation, using the WSN Dataset (WSN-DS) containing 374,661 records, identifies the optimal model for intrusion detection in WSNs. The standout outcome of our research is the remarkable performance of our model. In binary, it achieves an accuracy rate of 99.78% and in multiclass, it attains an exceptional accuracy rate of 99.92%. These findings underscore the efficiency and superiority of our proposal in the context of WSN intrusion detection, showcasing its effectiveness in detecting and mitigating intrusions in WSNs.  ( 3 min )
    Coercing LLMs to do and reveal (almost) anything
    arXiv:2402.14020v1 Announce Type: new Abstract: It has recently been shown that adversarial attacks on large language models (LLMs) can "jailbreak" the model into making harmful statements. In this work, we argue that the spectrum of adversarial attacks on LLMs is much larger than merely jailbreaking. We provide a broad overview of possible attack surfaces and attack goals. Based on a series of concrete examples, we discuss, categorize and systematize attacks that coerce varied unintended behaviors, such as misdirection, model control, denial-of-service, or data extraction. We analyze these attacks in controlled experiments, and find that many of them stem from the practice of pre-training LLMs with coding capabilities, as well as the continued existence of strange "glitch" tokens in common LLM vocabularies that should be removed for security reasons.  ( 2 min )
    Thresholded Oja does Sparse PCA?
    arXiv:2402.07240v2 Announce Type: cross Abstract: We consider the problem of Sparse Principal Component Analysis (PCA) when the ratio $d/n \rightarrow c > 0$. There has been a lot of work on optimal rates on sparse PCA in the offline setting, where all the data is available for multiple passes. In contrast, when the population eigenvector is $s$-sparse, streaming algorithms that have $O(d)$ storage and $O(nd)$ time complexity either typically require strong initialization conditions or have a suboptimal error. We show that a simple algorithm that thresholds and renormalizes the output of Oja's algorithm (the Oja vector) obtains a near-optimal error rate. This is very surprising because, without thresholding, the Oja vector has a large error. Our analysis centers around bounding the entries of the unnormalized Oja vector, which involves the projection of a product of independent random matrices on a random initial vector. This is nontrivial and novel since previous analyses of Oja's algorithm and matrix products have been done when the trace of the population covariance matrix is bounded while in our setting, this quantity can be as large as $n$.  ( 2 min )
    Corrective Machine Unlearning
    arXiv:2402.14015v1 Announce Type: new Abstract: Machine Learning models increasingly face data integrity challenges due to the use of large-scale training datasets drawn from the internet. We study what model developers can do if they detect that some data was manipulated or incorrect. Such manipulated data can cause adverse effects like vulnerability to backdoored samples, systematic biases, and in general, reduced accuracy on certain input domains. Often, all manipulated training samples are not known, and only a small, representative subset of the affected data is flagged. We formalize "Corrective Machine Unlearning" as the problem of mitigating the impact of data affected by unknown manipulations on a trained model, possibly knowing only a subset of impacted samples. We demonstrate that the problem of corrective unlearning has significantly different requirements from traditional privacy-oriented unlearning. We find most existing unlearning methods, including the gold-standard retraining-from-scratch, require most of the manipulated data to be identified for effective corrective unlearning. However, one approach, SSD, achieves limited success in unlearning adverse effects with just a small portion of the manipulated samples, showing the tractability of this setting. We hope our work spurs research towards developing better methods for corrective unlearning and offers practitioners a new strategy to handle data integrity challenges arising from web-scale training.  ( 2 min )
    D-Flow: Differentiating through Flows for Controlled Generation
    arXiv:2402.14017v1 Announce Type: new Abstract: Taming the generation outcome of state of the art Diffusion and Flow-Matching (FM) models without having to re-train a task-specific model unlocks a powerful tool for solving inverse problems, conditional generation, and controlled generation in general. In this work we introduce D-Flow, a simple framework for controlling the generation process by differentiating through the flow, optimizing for the source (noise) point. We motivate this framework by our key observation stating that for Diffusion/FM models trained with Gaussian probability paths, differentiating through the generation process projects gradient on the data manifold, implicitly injecting the prior into the optimization process. We validate our framework on linear and non-linear controlled generation problems including: image and audio inverse problems and conditional molecule generation reaching state of the art performance across all.  ( 2 min )
    Misalignment, Learning, and Ranking: Harnessing Users Limited Attention
    arXiv:2402.14013v1 Announce Type: new Abstract: In digital health and EdTech, recommendation systems face a significant challenge: users often choose impulsively, in ways that conflict with the platform's long-term payoffs. This misalignment makes it difficult to effectively learn to rank items, as it may hinder exploration of items with greater long-term payoffs. Our paper tackles this issue by utilizing users' limited attention spans. We propose a model where a platform presents items with unknown payoffs to the platform in a ranked list to $T$ users over time. Each user selects an item by first considering a prefix window of these ranked items and then picking the highest preferred item in that window (and the platform observes its payoff for this item). We study the design of online bandit algorithms that obtain vanishing regret against hindsight optimal benchmarks. We first consider adversarial window sizes and stochastic iid payoffs. We design an active-elimination-based algorithm that achieves an optimal instance-dependent regret bound of $O(\log(T))$, by showing matching regret upper and lower bounds. The key idea is using the combinatorial structure of the problem to either obtain a large payoff from each item or to explore by getting a sample from that item. This method systematically narrows down the item choices to enhance learning efficiency and payoff. Second, we consider adversarial payoffs and stochastic iid window sizes. We start from the full-information problem of finding the permutation that maximizes the expected payoff. By a novel combinatorial argument, we characterize the polytope of admissible item selection probabilities by a permutation and show it has a polynomial-size representation. Using this representation, we show how standard algorithms for adversarial online linear optimization in the space of admissible probabilities can be used to obtain a polynomial-time algorithm with $O(\sqrt{T})$ regret.  ( 3 min )
    Geometry-Informed Neural Networks
    arXiv:2402.14009v1 Announce Type: new Abstract: We introduce the concept of geometry-informed neural networks (GINNs), which encompass (i) learning under geometric constraints, (ii) neural fields as a suitable representation, and (iii) generating diverse solutions to under-determined systems often encountered in geometric tasks. Notably, the GINN formulation does not require training data, and as such can be considered generative modeling driven purely by constraints. We add an explicit diversity loss to mitigate mode collapse. We consider several constraints, in particular, the connectedness of components which we convert to a differentiable loss through Morse theory. Experimentally, we demonstrate the efficacy of the GINN learning paradigm across a range of two and three-dimensional scenarios with increasing levels of complexity.  ( 2 min )
    A Simple and Yet Fairly Effective Defense for Graph Neural Networks
    arXiv:2402.13987v1 Announce Type: new Abstract: Graph Neural Networks (GNNs) have emerged as the dominant approach for machine learning on graph-structured data. However, concerns have arisen regarding the vulnerability of GNNs to small adversarial perturbations. Existing defense methods against such perturbations suffer from high time complexity and can negatively impact the model's performance on clean graphs. To address these challenges, this paper introduces NoisyGNNs, a novel defense method that incorporates noise into the underlying model's architecture. We establish a theoretical connection between noise injection and the enhancement of GNN robustness, highlighting the effectiveness of our approach. We further conduct extensive empirical evaluations on the node classification task to validate our theoretical findings, focusing on two popular GNNs: the GCN and GIN. The results demonstrate that NoisyGNN achieves superior or comparable defense performance to existing methods while minimizing added time complexity. The NoisyGNN approach is model-agnostic, allowing it to be integrated with different GNN architectures. Successful combinations of our NoisyGNN approach with existing defense techniques demonstrate even further improved adversarial defense results. Our code is publicly available at: https://github.com/Sennadir/NoisyGNN.  ( 2 min )
    FedADMM-InSa: An Inexact and Self-Adaptive ADMM for Federated Learning
    arXiv:2402.13989v1 Announce Type: new Abstract: Federated learning (FL) is a promising framework for learning from distributed data while maintaining privacy. The development of efficient FL algorithms encounters various challenges, including heterogeneous data and systems, limited communication capacities, and constrained local computational resources. Recently developed FedADMM methods show great resilience to both data and system heterogeneity. However, they still suffer from performance deterioration if the hyperparameters are not carefully tuned. To address this issue, we propose an inexact and self-adaptive FedADMM algorithm, termed FedADMM-InSa. First, we design an inexactness criterion for the clients' local updates to eliminate the need for empirically setting the local training accuracy. This inexactness criterion can be assessed by each client independently based on its unique condition, thereby reducing the local computational cost and mitigating the undesirable straggle effect. The convergence of the resulting inexact ADMM is proved under the assumption of strongly convex loss functions. Additionally, we present a self-adaptive scheme that dynamically adjusts each client's penalty parameter, enhancing algorithm robustness by mitigating the need for empirical penalty parameter choices for each client. Extensive numerical experiments on both synthetic and real-world datasets are conducted. As validated by some numerical tests, our proposed algorithm can reduce the clients' local computational load significantly and also accelerate the learning process compared to the vanilla FedADMM.  ( 2 min )
    Stability-Aware Training of Neural Network Interatomic Potentials with Differentiable Boltzmann Estimators
    arXiv:2402.13984v1 Announce Type: new Abstract: Neural network interatomic potentials (NNIPs) are an attractive alternative to ab-initio methods for molecular dynamics (MD) simulations. However, they can produce unstable simulations which sample unphysical states, limiting their usefulness for modeling phenomena occurring over longer timescales. To address these challenges, we present Stability-Aware Boltzmann Estimator (StABlE) Training, a multi-modal training procedure which combines conventional supervised training from quantum-mechanical energies and forces with reference system observables, to produce stable and accurate NNIPs. StABlE Training iteratively runs MD simulations to seek out unstable regions, and corrects the instabilities via supervision with a reference observable. The training procedure is enabled by the Boltzmann Estimator, which allows efficient computation of gradients required to train neural networks to system observables, and can detect both global and local instabilities. We demonstrate our methodology across organic molecules, tetrapeptides, and condensed phase systems, along with using three modern NNIP architectures. In all three cases, StABlE-trained models achieve significant improvements in simulation stability and recovery of structural and dynamic observables. In some cases, StABlE-trained models outperform conventional models trained on datasets 50 times larger. As a general framework applicable across NNIP architectures and systems, StABlE Training is a powerful tool for training stable and accurate NNIPs, particularly in the absence of large reference datasets.  ( 3 min )
    The Importance of Architecture Choice in Deep Learning for Climate Applications
    arXiv:2402.13979v1 Announce Type: new Abstract: Machine Learning has become a pervasive tool in climate science applications. However, current models fail to address nonstationarity induced by anthropogenic alterations in greenhouse emissions and do not routinely quantify the uncertainty of proposed projections. In this paper, we model the Atlantic Meridional Overturning Circulation (AMOC) which is of major importance to climate in Europe and the US East Coast by transporting warm water to these regions, and has the potential for abrupt collapse. We can generate arbitrarily extreme climate scenarios through arbitrary time scales which we then predict using neural networks. Our analysis shows that the AMOC is predictable using neural networks under a diverse set of climate scenarios. Further experiments reveal that MLPs and Deep Ensembles can learn the physics of the AMOC instead of imitating its progression through autocorrelation. With quantified uncertainty, an intriguing pattern of "spikes" before critical points of collapse in the AMOC casts doubt on previous analyses that predicted an AMOC collapse within this century. Our results show that Bayesian Neural Networks perform poorly compared to more dense architectures and care should be taken when applying neural networks to nonstationary scenarios such as climate projections. Further, our results highlight that big NN models might have difficulty in modeling global Earth System dynamics accurately and be successfully applied in nonstationary climate scenarios due to the physics being challenging for neural networks to capture.  ( 2 min )
    Do Efficient Transformers Really Save Computation?
    arXiv:2402.13934v1 Announce Type: new Abstract: As transformer-based language models are trained on increasingly large datasets and with vast numbers of parameters, finding more efficient alternatives to the standard Transformer has become very valuable. While many efficient Transformers and Transformer alternatives have been proposed, none provide theoretical guarantees that they are a suitable replacement for the standard Transformer. This makes it challenging to identify when to use a specific model and what directions to prioritize for further investigation. In this paper, we aim to understand the capabilities and limitations of efficient Transformers, specifically the Sparse Transformer and the Linear Transformer. We focus on their reasoning capability as exhibited by Chain-of-Thought (CoT) prompts and follow previous works to model them as Dynamic Programming (DP) problems. Our results show that while these models are expressive enough to solve general DP tasks, contrary to expectations, they require a model size that scales with the problem size. Nonetheless, we identify a class of DP problems for which these models can be more efficient than the standard Transformer. We confirm our theoretical results through experiments on representative DP tasks, adding to the understanding of efficient Transformers' practical strengths and weaknesses.  ( 2 min )
    AttackGNN: Red-Teaming GNNs in Hardware Security Using Reinforcement Learning
    arXiv:2402.13946v1 Announce Type: new Abstract: Machine learning has shown great promise in addressing several critical hardware security problems. In particular, researchers have developed novel graph neural network (GNN)-based techniques for detecting intellectual property (IP) piracy, detecting hardware Trojans (HTs), and reverse engineering circuits, to name a few. These techniques have demonstrated outstanding accuracy and have received much attention in the community. However, since these techniques are used for security applications, it is imperative to evaluate them thoroughly and ensure they are robust and do not compromise the security of integrated circuits. In this work, we propose AttackGNN, the first red-team attack on GNN-based techniques in hardware security. To this end, we devise a novel reinforcement learning (RL) agent that generates adversarial examples, i.e., circuits, against the GNN-based techniques. We overcome three challenges related to effectiveness, scalability, and generality to devise a potent RL agent. We target five GNN-based techniques for four crucial classes of problems in hardware security: IP piracy, detecting/localizing HTs, reverse engineering, and hardware obfuscation. Through our approach, we craft circuits that fool all GNNs considered in this work. For instance, to evade IP piracy detection, we generate adversarial pirated circuits that fool the GNN-based defense into classifying our crafted circuits as not pirated. For attacking HT localization GNN, our attack generates HT-infested circuits that fool the defense on all tested circuits. We obtain a similar 100% success rate against GNNs for all classes of problems.  ( 3 min )
    Enhancing Reinforcement Learning Agents with Local Guides
    arXiv:2402.13930v1 Announce Type: new Abstract: This paper addresses the problem of integrating local guide policies into a Reinforcement Learning agent. For this, we show how to adapt existing algorithms to this setting before introducing a novel algorithm based on a noisy policy-switching procedure. This approach builds on a proper Approximate Policy Evaluation (APE) scheme to provide a perturbation that carefully leads the local guides towards better actions. We evaluated our method on a set of classical Reinforcement Learning problems, including safety-critical systems where the agent cannot enter some areas at the risk of triggering catastrophic consequences. In all the proposed environments, our agent proved to be efficient at leveraging those policies to improve the performance of any APE-based Reinforcement Learning algorithm, especially in its first learning stages.  ( 2 min )
    Dealing with unbounded gradients in stochastic saddle-point optimization
    arXiv:2402.13903v1 Announce Type: new Abstract: We study the performance of stochastic first-order methods for finding saddle points of convex-concave functions. A notorious challenge faced by such methods is that the gradients can grow arbitrarily large during optimization, which may result in instability and divergence. In this paper, we propose a simple and effective regularization technique that stabilizes the iterates and yields meaningful performance guarantees even if the domain and the gradient noise scales linearly with the size of the iterates (and is thus potentially unbounded). Besides providing a set of general results, we also apply our algorithm to a specific problem in reinforcement learning, where it leads to performance guarantees for finding near-optimal policies in an average-reward MDP without prior knowledge of the bias span.  ( 2 min )
    Replication Study: Enhancing Hydrological Modeling with Physics-Guided Machine Learning
    arXiv:2402.13911v1 Announce Type: new Abstract: Current hydrological modeling methods combine data-driven Machine Learning (ML) algorithms and traditional physics-based models to address their respective limitations incorrect parameter estimates from rigid physics-based models and the neglect of physical process constraints by ML algorithms. Despite the accuracy of ML in outcome prediction, the integration of scientific knowledge is crucial for reliable predictions. This study introduces a Physics Informed Machine Learning (PIML) model, which merges the process understanding of conceptual hydrological models with the predictive efficiency of ML algorithms. Applied to the Anandapur sub-catchment, the PIML model demonstrates superior performance in forecasting monthly streamflow and actual evapotranspiration over both standalone conceptual models and ML algorithms, ensuring physical consistency of the outputs. This study replicates the methodologies of Bhasme, P., Vagadiya, J., & Bhatia, U. (2022) from their pivotal work on Physics Informed Machine Learning for hydrological processes, utilizing their shared code and datasets to further explore the predictive capabilities in hydrological modeling.  ( 2 min )
    Non-asymptotic Convergence of Discrete-time Diffusion Models: New Approach and Improved Rate
    arXiv:2402.13901v1 Announce Type: new Abstract: The denoising diffusion model emerges recently as a powerful generative technique that converts noise into data. Theoretical convergence guarantee has been mainly studied for continuous-time diffusion models, and has been obtained for discrete-time diffusion models only for distributions with bounded support in the literature. In this paper, we establish the convergence guarantee for substantially larger classes of distributions under discrete-time diffusion models and further improve the convergence rate for distributions with bounded support. In particular, we first establish the convergence rates for both smooth and general (possibly non-smooth) distributions having finite second moment. We then specialize our results to a number of interesting classes of distributions with explicit parameter dependencies, including distributions with Lipschitz scores, Gaussian mixture distributions, and distributions with bounded support. We further propose a novel accelerated sampler and show that it improves the convergence rates of the corresponding regular sampler by orders of magnitude with respect to all system parameters. For distributions with bounded support, our result improves the dimensional dependence of the previous convergence rate by orders of magnitude. Our study features a novel analysis technique that constructs tilting factor representation of the convergence error and exploits Tweedie's formula for handling Taylor expansion power terms.  ( 2 min )
    Overcoming Saturation in Density Ratio Estimation by Iterated Regularization
    arXiv:2402.13891v1 Announce Type: new Abstract: Estimating the ratio of two probability densities from finitely many samples, is a central task in machine learning and statistics. In this work, we show that a large class of kernel methods for density ratio estimation suffers from error saturation, which prevents algorithms from achieving fast error convergence rates on highly regular learning problems. To resolve saturation, we introduce iterated regularization in density ratio estimation to achieve fast error rates. Our methods outperform its non-iteratively regularized versions on benchmarks for density ratio estimation as well as on large-scale evaluations for importance-weighted ensembling of deep unsupervised domain adaptation models.  ( 2 min )
    Generative Probabilistic Time Series Forecasting and Applications in Grid Operations
    arXiv:2402.13870v1 Announce Type: new Abstract: Generative probabilistic forecasting produces future time series samples according to the conditional probability distribution given past time series observations. Such techniques are essential in risk-based decision-making and planning under uncertainty with broad applications in grid operations, including electricity price forecasting, risk-based economic dispatch, and stochastic optimizations. Inspired by Wiener and Kallianpur's innovation representation, we propose a weak innovation autoencoder architecture and a learning algorithm to extract independent and identically distributed innovation sequences from nonparametric stationary time series. We show that the weak innovation sequence is Bayesian sufficient, which makes the proposed weak innovation autoencoder a canonical architecture for generative probabilistic forecasting. The proposed technique is applied to forecasting highly volatile real-time electricity prices, demonstrating superior performance across multiple forecasting measures over leading probabilistic and point forecasting techniques.  ( 2 min )
    An Explainable Transformer-based Model for Phishing Email Detection: A Large Language Model Approach
    arXiv:2402.13871v1 Announce Type: new Abstract: Phishing email is a serious cyber threat that tries to deceive users by sending false emails with the intention of stealing confidential information or causing financial harm. Attackers, often posing as trustworthy entities, exploit technological advancements and sophistication to make detection and prevention of phishing more challenging. Despite extensive academic research, phishing detection remains an ongoing and formidable challenge in the cybersecurity landscape. Large Language Models (LLMs) and Masked Language Models (MLMs) possess immense potential to offer innovative solutions to address long-standing challenges. In this research paper, we present an optimized, fine-tuned transformer-based DistilBERT model designed for the detection of phishing emails. In the detection process, we work with a phishing email dataset and utilize the preprocessing techniques to clean and solve the imbalance class issues. Through our experiments, we found that our model effectively achieves high accuracy, demonstrating its capability to perform well. Finally, we demonstrate our fine-tuned model using Explainable-AI (XAI) techniques such as Local Interpretable Model-Agnostic Explanations (LIME) and Transformer Interpret to explain how our model makes predictions in the context of text classification for phishing emails.  ( 2 min )
    Replicable Learning of Large-Margin Halfspaces
    arXiv:2402.13857v1 Announce Type: new Abstract: We provide efficient replicable algorithms for the problem of learning large-margin halfspaces. Our results improve upon the algorithms provided by Impagliazzo, Lei, Pitassi, and Sorrell [STOC, 2022]. We design the first dimension-independent replicable algorithms for this task which runs in polynomial time, is proper, and has strictly improved sample complexity compared to the one achieved by Impagliazzo et al. [2022] with respect to all the relevant parameters. Moreover, our first algorithm has sample complexity that is optimal with respect to the accuracy parameter $\epsilon$. We also design an SGD-based replicable algorithm that, in some parameters' regimes, achieves better sample and time complexity than our first algorithm. Departing from the requirement of polynomial time algorithms, using the DP-to-Replicability reduction of Bun, Gaboardi, Hopkins, Impagliazzo, Lei, Pitassi, Sorrell, and Sivakumar [STOC, 2023], we show how to obtain a replicable algorithm for large-margin halfspaces with improved sample complexity with respect to the margin parameter $\tau$, but running time doubly exponential in $1/\tau^2$ and worse sample complexity dependence on $\epsilon$ than one of our previous algorithms. We then design an improved algorithm with better sample complexity than all three of our previous algorithms and running time exponential in $1/\tau^{2}$.  ( 2 min )
    Neural Control System for Continuous Glucose Monitoring and Maintenance
    arXiv:2402.13852v1 Announce Type: new Abstract: Precise glucose level management is pivotal for individuals with diabetes, averting severe complications. In this work, we introduce a novel neural control system for continuous glucose monitoring and maintenance, utilizing differential predictive control. Our system, guided by a sophisticated neural policy and differentiable modeling, dynamically adjusts insulin delivery in real-time, enhancing glucose optimization. This end-to-end approach maximizes efficiency, ensuring personalized care and improved health outcomes, as affirmed by empirical findings.  ( 2 min )
    MLXP: A framework for conducting replicable Machine Learning eXperiments in Python
    arXiv:2402.13831v1 Announce Type: new Abstract: Replicability in machine learning (ML) research is increasingly concerning due to the utilization of complex non-deterministic algorithms and the dependence on numerous hyper-parameter choices, such as model architecture and training datasets. Ensuring reproducible and replicable results is crucial for advancing the field, yet often requires significant technical effort to conduct systematic and well-organized experiments that yield robust conclusions. Several tools have been developed to facilitate experiment management and enhance reproducibility; however, they often introduce complexity that hinders adoption within the research community, despite being well-handled in industrial settings. To address the challenge of low adoption, we propose MLXP, an open-source, simple, and lightweight experiment management tool based on Python, available at https://github.com/inria-thoth/mlxp . MLXP streamlines the experimental process with minimal practitioner overhead while ensuring a high level of reproducibility.  ( 2 min )
    Performance Improvement Bounds for Lipschitz Configurable Markov Decision Processes
    arXiv:2402.13821v1 Announce Type: new Abstract: Configurable Markov Decision Processes (Conf-MDPs) have recently been introduced as an extension of the traditional Markov Decision Processes (MDPs) to model the real-world scenarios in which there is the possibility to intervene in the environment in order to configure some of its parameters. In this paper, we focus on a particular subclass of Conf-MDP that satisfies regularity conditions, namely Lipschitz continuity. We start by providing a bound on the Wasserstein distance between $\gamma$-discounted stationary distributions induced by changing policy and configuration. This result generalizes the already existing bounds both for Conf-MDPs and traditional MDPs. Then, we derive a novel performance improvement lower bound.  ( 2 min )
    Voice-Driven Mortality Prediction in Hospitalized Heart Failure Patients: A Machine Learning Approach Enhanced with Diagnostic Biomarkers
    arXiv:2402.13812v1 Announce Type: new Abstract: Addressing heart failure (HF) as a prevalent global health concern poses difficulties in implementing innovative approaches for enhanced patient care. Predicting mortality rates in HF patients, in particular, is difficult yet critical, necessitating individualized care, proactive management, and enabling educated decision-making to enhance outcomes. Recently, the significance of voice biomarkers coupled with Machine Learning (ML) has surged, demonstrating remarkable efficacy, particularly in predicting heart failure. The synergy of voice analysis and ML algorithms provides a non-invasive and easily accessible means to evaluate patients' health. However, there is a lack of voice biomarkers for predicting mortality rates among heart failure patients with standardized speech protocols. Here, we demonstrate a powerful and effective ML model for predicting mortality rates in hospitalized HF patients through the utilization of voice biomarkers. By seamlessly integrating voice biomarkers into routine patient monitoring, this strategy has the potential to improve patient outcomes, optimize resource allocation, and advance patient-centered HF management. In this study, a Machine Learning system, specifically a logistic regression model, is trained to predict patients' 5-year mortality rates using their speech as input. The model performs admirably and consistently, as demonstrated by cross-validation and statistical approaches (p-value < 0.001). Furthermore, integrating NT-proBNP, a diagnostic biomarker in HF, improves the model's predictive accuracy substantially.  ( 3 min )
    FLD: Fourier Latent Dynamics for Structured Motion Representation and Learning
    arXiv:2402.13820v1 Announce Type: new Abstract: Motion trajectories offer reliable references for physics-based motion learning but suffer from sparsity, particularly in regions that lack sufficient data coverage. To address this challenge, we introduce a self-supervised, structured representation and generation method that extracts spatial-temporal relationships in periodic or quasi-periodic motions. The motion dynamics in a continuously parameterized latent space enable our method to enhance the interpolation and generalization capabilities of motion learning algorithms. The motion learning controller, informed by the motion parameterization, operates online tracking of a wide range of motions, including targets unseen during training. With a fallback mechanism, the controller dynamically adapts its tracking strategy and automatically resorts to safe action execution when a potentially risky target is proposed. By leveraging the identified spatial-temporal structure, our work opens new possibilities for future advancements in general motion representation and learning algorithms.  ( 2 min )
    Opening the Black-Box: A Systematic Review on Explainable AI in Remote Sensing
    arXiv:2402.13791v1 Announce Type: new Abstract: In recent years, black-box machine learning approaches have become a dominant modeling paradigm for knowledge extraction in Remote Sensing. Despite the potential benefits of uncovering the inner workings of these models with explainable AI, a comprehensive overview summarizing the used explainable AI methods and their objectives, findings, and challenges in Remote Sensing applications is still missing. In this paper, we address this issue by performing a systematic review to identify the key trends of how explainable AI is used in Remote Sensing and shed light on novel explainable AI approaches and emerging directions that tackle specific Remote Sensing challenges. We also reveal the common patterns of explanation interpretation, discuss the extracted scientific insights in Remote Sensing, and reflect on the approaches used for explainable AI methods evaluation. Our review provides a complete summary of the state-of-the-art in the field. Further, we give a detailed outlook on the challenges and promising research directions, representing a basis for novel methodological development and a useful starting point for new researchers in the field of explainable AI in Remote Sensing.  ( 2 min )
    The Expected Loss of Preconditioned Langevin Dynamics Reveals the Hessian Rank
    arXiv:2402.13810v1 Announce Type: new Abstract: Langevin dynamics (LD) is widely used for sampling from distributions and for optimization. In this work, we derive a closed-form expression for the expected loss of preconditioned LD near stationary points of the objective function. We use the fact that at the vicinity of such points, LD reduces to an Ornstein-Uhlenbeck process, which is amenable to convenient mathematical treatment. Our analysis reveals that when the preconditioning matrix satisfies a particular relation with respect to the noise covariance, LD's expected loss becomes proportional to the rank of the objective's Hessian. We illustrate the applicability of this result in the context of neural networks, where the Hessian rank has been shown to capture the complexity of the predictor function but is usually computationally hard to probe. Finally, we use our analysis to compare SGD-like and Adam-like preconditioners and identify the regimes under which each of them leads to a lower expected loss.  ( 2 min )
    Preserving Near-Optimal Gradient Sparsification Cost for Scalable Distributed Deep Learning
    arXiv:2402.13781v1 Announce Type: new Abstract: Communication overhead is a major obstacle to scaling distributed training systems. Gradient sparsification is a potential optimization approach to reduce the communication volume without significant loss of model fidelity. However, existing gradient sparsification methods have low scalability owing to inefficient design of their algorithms, which raises the communication overhead significantly. In particular, gradient build-up and inadequate sparsity control methods degrade the sparsification performance considerably. Moreover, communication traffic increases drastically owing to workload imbalance of gradient selection between workers. To address these challenges, we propose a novel gradient sparsification scheme called ExDyna. In ExDyna, the gradient tensor of the model comprises fined-grained blocks, and contiguous blocks are grouped into non-overlapping partitions. Each worker selects gradients in its exclusively allocated partition so that gradient build-up never occurs. To balance the workload of gradient selection between workers, ExDyna adjusts the topology of partitions by comparing the workloads of adjacent partitions. In addition, ExDyna supports online threshold scaling, which estimates the accurate threshold of gradient selection on-the-fly. Accordingly, ExDyna can satisfy the user-required sparsity level during a training period regardless of models and datasets. Therefore, ExDyna can enhance the scalability of distributed training systems by preserving near-optimal gradient sparsification cost. In experiments, ExDyna outperformed state-of-the-art sparsifiers in terms of training speed and sparsification performance while achieving high accuracy.  ( 3 min )
    Contextual Molecule Representation Learning from Chemical Reaction Knowledge
    arXiv:2402.13779v1 Announce Type: new Abstract: In recent years, self-supervised learning has emerged as a powerful tool to harness abundant unlabelled data for representation learning and has been broadly adopted in diverse areas. However, when applied to molecular representation learning (MRL), prevailing techniques such as masked sub-unit reconstruction often fall short, due to the high degree of freedom in the possible combinations of atoms within molecules, which brings insurmountable complexity to the masking-reconstruction paradigm. To tackle this challenge, we introduce REMO, a self-supervised learning framework that takes advantage of well-defined atom-combination rules in common chemistry. Specifically, REMO pre-trains graph/Transformer encoders on 1.7 million known chemical reactions in the literature. We propose two pre-training objectives: Masked Reaction Centre Reconstruction (MRCR) and Reaction Centre Identification (RCI). REMO offers a novel solution to MRL by exploiting the underlying shared patterns in chemical reactions as \textit{context} for pre-training, which effectively infers meaningful representations of common chemistry knowledge. Such contextual representations can then be utilized to support diverse downstream molecular tasks with minimum finetuning, such as affinity prediction and drug-drug interaction prediction. Extensive experimental results on MoleculeACE, ACNet, drug-drug interaction (DDI), and reaction type classification show that across all tested downstream tasks, REMO outperforms the standard baseline of single-molecule masked modeling used in current MRL. Remarkably, REMO is the pioneering deep learning model surpassing fingerprint-based methods in activity cliff benchmarks.  ( 2 min )
    Deep Generative Models for Offline Policy Learning: Tutorial, Survey, and Perspectives on Future Directions
    arXiv:2402.13777v1 Announce Type: new Abstract: Deep generative models (DGMs) have demonstrated great success across various domains, particularly in generating texts, images, and videos using models trained from offline data. Similarly, data-driven decision-making and robotic control also necessitate learning a generator function from the offline data to serve as the strategy or policy. In this case, applying deep generative models in offline policy learning exhibits great potential, and numerous studies have explored in this direction. However, this field still lacks a comprehensive review and so developments of different branches are relatively independent. Thus, we provide the first systematic review on the applications of deep generative models for offline policy learning. In particular, we cover five mainstream deep generative models, including Variational Auto-Encoders, Generative Adversarial Networks, Normalizing Flows, Transformers, and Diffusion Models, and their applications in both offline reinforcement learning (offline RL) and imitation learning (IL). Offline RL and IL are two main branches of offline policy learning and are widely-adopted techniques for sequential decision-making. Specifically, for each type of DGM-based offline policy learning, we distill its fundamental scheme, categorize related works based on the usage of the DGM, and sort out the development process of algorithms in that field. Subsequent to the main content, we provide in-depth discussions on deep generative models and offline policy learning as a summary, based on which we present our perspectives on future research directions. This work offers a hands-on reference for the research progress in deep generative models for offline policy learning, and aims to inspire improved DGM-based offline RL or IL algorithms.  ( 3 min )
    Accuracy-Preserving Calibration via Statistical Modeling on Probability Simplex
    arXiv:2402.13765v1 Announce Type: new Abstract: Classification models based on deep neural networks (DNNs) must be calibrated to measure the reliability of predictions. Some recent calibration methods have employed a probabilistic model on the probability simplex. However, these calibration methods cannot preserve the accuracy of pre-trained models, even those with a high classification accuracy. We propose an accuracy-preserving calibration method using the Concrete distribution as the probabilistic model on the probability simplex. We theoretically prove that a DNN model trained on cross-entropy loss has optimality as the parameter of the Concrete distribution. We also propose an efficient method that synthetically generates samples for training probabilistic models on the probability simplex. We demonstrate that the proposed method can outperform previous methods in accuracy-preserving calibration tasks using benchmarks.  ( 2 min )
    Reasoning Algorithmically in Graph Neural Networks
    arXiv:2402.13744v1 Announce Type: new Abstract: The development of artificial intelligence systems with advanced reasoning capabilities represents a persistent and long-standing research question. Traditionally, the primary strategy to address this challenge involved the adoption of symbolic approaches, where knowledge was explicitly represented by means of symbols and explicitly programmed rules. However, with the advent of machine learning, there has been a paradigm shift towards systems that can autonomously learn from data, requiring minimal human guidance. In light of this shift, in latest years, there has been increasing interest and efforts at endowing neural networks with the ability to reason, bridging the gap between data-driven learning and logical reasoning. Within this context, Neural Algorithmic Reasoning (NAR) stands out as a promising research field, aiming to integrate the structured and rule-based reasoning of algorithms with the adaptive learning capabilities of neural networks, typically by tasking neural models to mimic classical algorithms. In this dissertation, we provide theoretical and practical contributions to this area of research. We explore the connections between neural networks and tropical algebra, deriving powerful architectures that are aligned with algorithm execution. Furthermore, we discuss and show the ability of such neural reasoners to learn and manipulate complex algorithmic and combinatorial optimization concepts, such as the principle of strong duality. Finally, in our empirical efforts, we validate the real-world utility of NAR networks across different practical scenarios. This includes tasks as diverse as planning problems, large-scale edge classification tasks and the learning of polynomial-time approximate algorithms for NP-hard combinatorial problems. Through this exploration, we aim to showcase the potential integrating algorithmic reasoning in machine learning models.  ( 3 min )
    AI-Powered Predictions for Electricity Load in Prosumer Communities
    arXiv:2402.13752v1 Announce Type: new Abstract: The flexibility in electricity consumption and production in communities of residential buildings, including those with renewable energy sources and energy storage (a.k.a., prosumers), can effectively be utilized through the advancement of short-term demand response mechanisms. It is known that flexibility can further be increased if demand response is performed at the level of communities of prosumers, since aggregated groups can better coordinate electricity consumption. However, the effectiveness of such short-term optimization is highly dependent on the accuracy of electricity load forecasts both for each building as well as for the whole community. Structural variations in the electricity load profile can be associated with different exogenous factors, such as weather conditions, calendar information and day of the week, as well as user behavior. In this paper, we review a wide range of electricity load forecasting techniques, that can provide significant assistance in optimizing load consumption in prosumer communities. We present and test artificial intelligence (AI) powered short-term load forecasting methodologies that operate with black-box time series models, such as Facebook's Prophet and Long Short-term Memory (LSTM) models; season-based SARIMA and smoothing Holt-Winters models; and empirical regression-based models that utilize domain knowledge. The integration of weather forecasts into data-driven time series forecasts is also tested. Results show that the combination of persistent and regression terms (adapted to the load forecasting task) achieves the best forecast accuracy.  ( 3 min )
    Average gradient outer product as a mechanism for deep neural collapse
    arXiv:2402.13728v1 Announce Type: new Abstract: Deep Neural Collapse (DNC) refers to the surprisingly rigid structure of the data representations in the final layers of Deep Neural Networks (DNNs). Though the phenomenon has been measured in a wide variety of settings, its emergence is only partially understood. In this work, we provide substantial evidence that DNC formation occurs primarily through deep feature learning with the average gradient outer product (AGOP). This takes a step further compared to efforts that explain neural collapse via feature-agnostic approaches, such as the unconstrained features model. We proceed by providing evidence that the right singular vectors and values of the weights are responsible for the majority of within-class variability collapse in DNNs. As shown in recent work, this singular structure is highly correlated with that of the AGOP. We then establish experimentally and theoretically that AGOP induces neural collapse in a randomly initialized neural network. In particular, we demonstrate that Deep Recursive Feature Machines, a method originally introduced as an abstraction for AGOP feature learning in convolutional neural networks, exhibits DNC.  ( 2 min )
    On the Conflict of Robustness and Learning in Collaborative Machine Learning
    arXiv:2402.13700v1 Announce Type: new Abstract: Collaborative Machine Learning (CML) allows participants to jointly train a machine learning model while keeping their training data private. In scenarios where privacy is a strong requirement, such as health-related applications, safety is also a primary concern. This means that privacy-preserving CML processes must produce models that output correct and reliable decisions \emph{even in the presence of potentially untrusted participants}. In response to this issue, researchers propose to use \textit{robust aggregators} that rely on metrics which help filter out malicious contributions that could compromise the training process. In this work, we formalize the landscape of robust aggregators in the literature. Our formalization allows us to show that existing robust aggregators cannot fulfill their goal: either they use distance-based metrics that cannot accurately identify targeted malicious updates; or propose methods whose success is in direct conflict with the ability of CML participants to learn from others and therefore cannot eliminate the risk of manipulation without preventing learning.  ( 2 min )
    DSLR: Diversity Enhancement and Structure Learning for Rehearsal-based Graph Continual Learning
    arXiv:2402.13711v1 Announce Type: new Abstract: We investigate the replay buffer in rehearsal-based approaches for graph continual learning (GCL) methods. Existing rehearsal-based GCL methods select the most representative nodes for each class and store them in a replay buffer for later use in training subsequent tasks. However, we discovered that considering only the class representativeness of each replayed node makes the replayed nodes to be concentrated around the center of each class, incurring a potential risk of overfitting to nodes residing in those regions, which aggravates catastrophic forgetting. Moreover, as the rehearsal-based approach heavily relies on a few replayed nodes to retain knowledge obtained from previous tasks, involving the replayed nodes that have irrelevant neighbors in the model training may have a significant detrimental impact on model performance. In this paper, we propose a GCL model named DSLR, specifically, we devise a coverage-based diversity (CD) approach to consider both the class representativeness and the diversity within each class of the replayed nodes. Moreover, we adopt graph structure learning (GSL) to ensure that the replayed nodes are connected to truly informative neighbors. Extensive experimental results demonstrate the effectiveness and efficiency of DSLR.  ( 2 min )
    PQA: Zero-shot Protein Question Answering for Free-form Scientific Enquiry with Large Language Models
    arXiv:2402.13653v1 Announce Type: new Abstract: We introduce the novel task of zero-shot Protein Question Answering (PQA) for free-form scientific enquiry. Given a previously unseen protein sequence and a natural language question, the task is to deliver a scientifically accurate answer. This task not only supports future biological research, but could also provide a test bed for assessing the scientific precision of large language models (LLMs). We contribute the first specialized dataset for PQA model training, containing 257K protein sequences annotated with 1.97M scientific question-answer pairs. Additionally, we propose and study several novel biologically relevant benchmarks for scientific PQA. Employing two robust multi-modal architectures, we establish an initial state-of-the-art performance for PQA and reveal key performance factors through ablation studies. Our comprehensive PQA framework, named Pika, including dataset, code, model checkpoints, and a user-friendly demo, is openly accessible on github.com/EMCarrami/Pika, promoting wider research and application in the field.  ( 2 min )
    Stable Update of Regression Trees
    arXiv:2402.13655v1 Announce Type: new Abstract: Updating machine learning models with new information usually improves their predictive performance, yet, in many applications, it is also desirable to avoid changing the model predictions too much. This property is called stability. In most cases when stability matters, so does explainability. We therefore focus on the stability of an inherently explainable machine learning method, namely regression trees. We aim to use the notion of empirical stability and design algorithms for updating regression trees that provide a way to balance between predictability and empirical stability. To achieve this, we propose a regularization method, where data points are weighted based on the uncertainty in the initial model. The balance between predictability and empirical stability can be adjusted through hyperparameters. This regularization method is evaluated in terms of loss and stability and assessed on a broad range of data characteristics. The results show that the proposed update method improves stability while achieving similar or better predictive performance. This shows that it is possible to achieve both predictive and stable results when updating regression trees.  ( 2 min )
    The METRIC-framework for assessing data quality for trustworthy AI in medicine: a systematic review
    arXiv:2402.13635v1 Announce Type: new Abstract: The adoption of machine learning (ML) and, more specifically, deep learning (DL) applications into all major areas of our lives is underway. The development of trustworthy AI is especially important in medicine due to the large implications for patients' lives. While trustworthiness concerns various aspects including ethical, technical and privacy requirements, we focus on the importance of data quality (training/test) in DL. Since data quality dictates the behaviour of ML products, evaluating data quality will play a key part in the regulatory approval of medical AI products. We perform a systematic review following PRISMA guidelines using the databases PubMed and ACM Digital Library. We identify 2362 studies, out of which 62 records fulfil our eligibility criteria. From this literature, we synthesise the existing knowledge on data quality frameworks and combine it with the perspective of ML applications in medicine. As a result, we propose the METRIC-framework, a specialised data quality framework for medical training data comprising 15 awareness dimensions, along which developers of medical ML applications should investigate a dataset. This knowledge helps to reduce biases as a major source of unfairness, increase robustness, facilitate interpretability and thus lays the foundation for trustworthy AI in medicine. Incorporating such systematic assessment of medical datasets into regulatory approval processes has the potential to accelerate the approval of ML products and builds the basis for new standards.  ( 3 min )
    FlexHB: a More Efficient and Flexible Framework for Hyperparameter Optimization
    arXiv:2402.13641v1 Announce Type: new Abstract: Given a Hyperparameter Optimization(HPO) problem, how to design an algorithm to find optimal configurations efficiently? Bayesian Optimization(BO) and the multi-fidelity BO methods employ surrogate models to sample configurations based on history evaluations. More recent studies obtain better performance by integrating BO with HyperBand(HB), which accelerates evaluation by early stopping mechanism. However, these methods ignore the advantage of a suitable evaluation scheme over the default HyperBand, and the capability of BO is still constrained by skewed evaluation results. In this paper, we propose FlexHB, a new method pushing multi-fidelity BO to the limit as well as re-designing a framework for early stopping with Successive Halving(SH). Comprehensive study on FlexHB shows that (1) our fine-grained fidelity method considerably enhances the efficiency of searching optimal configurations, (2) our FlexBand framework (self-adaptive allocation of SH brackets, and global ranking of configurations in both current and past SH procedures) grants the algorithm with more flexibility and improves the anytime performance. Our method achieves superior efficiency and outperforms other methods on various HPO tasks. Empirical results demonstrate that FlexHB can achieve up to 6.9X and 11.1X speedups over the state-of-the-art MFES-HB and BOHB respectively.  ( 2 min )
    Improving Building Temperature Forecasting: A Data-driven Approach with System Scenario Clustering
    arXiv:2402.13628v1 Announce Type: new Abstract: Heat, Ventilation and Air Conditioning (HVAC) systems play a critical role in maintaining a comfortable thermal environment and cost approximately 40% of primary energy usage in the building sector. For smart energy management in buildings, usage patterns and their resulting profiles allow the improvement of control systems with prediction capabilities. However, for large-scale HVAC system management, it is difficult to construct a detailed model for each subsystem. In this paper, a new data-driven room temperature prediction model is proposed based on the k-means clustering method. The proposed data-driven temperature prediction approach extracts the system operation feature through historical data analysis and further simplifies the system-level model to improve generalization and computational efficiency. We evaluate the proposed approach in the real world. The results demonstrated that our approach can significantly reduce modeling time without reducing prediction accuracy.  ( 2 min )
    UniGraph: Learning a Cross-Domain Graph Foundation Model From Natural Language
    arXiv:2402.13630v1 Announce Type: new Abstract: Foundation models like ChatGPT and GPT-4 have revolutionized artificial intelligence, exhibiting remarkable abilities to generalize across a wide array of tasks and applications beyond their initial training objectives. However, when this concept is applied to graph learning, a stark contrast emerges. Graph learning has predominantly focused on single-graph models, tailored to specific tasks or datasets, lacking the ability to transfer learned knowledge to different domains. This limitation stems from the inherent complexity and diversity of graph structures, along with the different feature and label spaces specific to graph data. In this paper, we present our UniGraph framework, designed to train a graph foundation model capable of generalizing to unseen graphs and tasks across diverse domains. Unlike single-graph models that use pre-computed node features of varying dimensions as input, our approach leverages Text-Attributed Graphs (TAGs) for unifying node representations. We propose a cascaded architecture of Language Models (LMs) and Graph Neural Networks (GNNs) as backbone networks with a self-supervised training objective based on Masked Graph Modeling (MGM). We introduce graph instruction tuning using Large Language Models (LLMs) to enable zero-shot prediction ability. Our comprehensive experiments across various graph learning tasks and domains demonstrate the model's effectiveness in self-supervised representation learning on unseen graphs, few-shot in-context transfer, and zero-shot transfer, even surpassing or matching the performance of GNNs that have undergone supervised training on target datasets.  ( 2 min )
    On the Expressive Power of a Variant of the Looped Transformer
    arXiv:2402.13572v1 Announce Type: new Abstract: Besides natural language processing, transformers exhibit extraordinary performance in solving broader applications, including scientific computing and computer vision. Previous works try to explain this from the expressive power and capability perspectives that standard transformers are capable of performing some algorithms. To empower transformers with algorithmic capabilities and motivated by the recently proposed looped transformer (Yang et al., 2024; Giannou et al., 2023), we design a novel transformer block, dubbed Algorithm Transformer (abbreviated as AlgoFormer). Compared with the standard transformer and vanilla looped transformer, the proposed AlgoFormer can achieve significantly higher expressiveness in algorithm representation when using the same number of parameters. In particular, inspired by the structure of human-designed learning algorithms, our transformer block consists of a pre-transformer that is responsible for task pre-processing, a looped transformer for iterative optimization algorithms, and a post-transformer for producing the desired results after post-processing. We provide theoretical evidence of the expressive power of the AlgoFormer in solving some challenging problems, mirroring human-designed algorithms. Furthermore, some theoretical and empirical results are presented to show that the designed transformer has the potential to be smarter than human-designed algorithms. Experimental results demonstrate the empirical superiority of the proposed transformer in that it outperforms the standard transformer and vanilla looped transformer in some challenging tasks.  ( 3 min )
    Spot Check Equivalence: an Interpretable Metric for Information Elicitation Mechanisms
    arXiv:2402.13567v1 Announce Type: new Abstract: Because high-quality data is like oxygen for AI systems, effectively eliciting information from crowdsourcing workers has become a first-order problem for developing high-performance machine learning algorithms. Two prevalent paradigms, spot-checking and peer prediction, enable the design of mechanisms to evaluate and incentivize high-quality data from human labelers. So far, at least three metrics have been proposed to compare the performances of these techniques [33, 8, 3]. However, different metrics lead to divergent and even contradictory results in various contexts. In this paper, we harmonize these divergent stories, showing that two of these metrics are actually the same within certain contexts and explain the divergence of the third. Moreover, we unify these different contexts by introducing \textit{Spot Check Equivalence}, which offers an interpretable metric for the effectiveness of a peer prediction mechanism. Finally, we present two approaches to compute spot check equivalence in various contexts, where simulation results verify the effectiveness of our proposed metric.  ( 2 min )
    Inductive Graph Alignment Prompt: Bridging the Gap between Graph Pre-training and Inductive Fine-tuning From Spectral Perspective
    arXiv:2402.13556v1 Announce Type: new Abstract: The "Graph pre-training and fine-tuning" paradigm has significantly improved Graph Neural Networks(GNNs) by capturing general knowledge without manual annotations for downstream tasks. However, due to the immense gap of data and tasks between the pre-training and fine-tuning stages, the model performance is still limited. Inspired by prompt fine-tuning in Natural Language Processing(NLP), many endeavors have been made to bridge the gap in graph domain. But existing methods simply reformulate the form of fine-tuning tasks to the pre-training ones. With the premise that the pre-training graphs are compatible with the fine-tuning ones, these methods typically operate in transductive setting. In order to generalize graph pre-training to inductive scenario where the fine-tuning graphs might significantly differ from pre-training ones, we propose a novel graph prompt based method called Inductive Graph Alignment Prompt(IGAP). Firstly, we unify the mainstream graph pre-training frameworks and analyze the essence of graph pre-training from graph spectral theory. Then we identify the two sources of the data gap in inductive setting: (i) graph signal gap and (ii) graph structure gap. Based on the insight of graph pre-training, we propose to bridge the graph signal gap and the graph structure gap with learnable prompts in the spectral space. A theoretical analysis ensures the effectiveness of our method. At last, we conduct extensive experiments among nodes classification and graph classification tasks under the transductive, semi-inductive and inductive settings. The results demonstrate that our proposed method can successfully bridge the data gap under different settings.  ( 3 min )
    DiffPLF: A Conditional Diffusion Model for Probabilistic Forecasting of EV Charging Load
    arXiv:2402.13548v1 Announce Type: new Abstract: Due to the vast electric vehicle (EV) penetration to distribution grid, charging load forecasting is essential to promote charging station operation and demand-side management.However, the stochastic charging behaviors and associated exogenous factors render future charging load patterns quite volatile and hard to predict. Accordingly, we devise a novel Diffusion model termed DiffPLF for Probabilistic Load Forecasting of EV charging, which can explicitly approximate the predictive load distribution conditioned on historical data and related covariates. Specifically, we leverage a denoising diffusion model, which can progressively convert the Gaussian prior to real time-series data by learning a reversal of the diffusion process. Besides, we couple such diffusion model with a cross-attention-based conditioning mechanism to execute conditional generation for possible charging demand profiles. We also propose a task-informed fine-tuning technique to better adapt DiffPLF to the probabilistic time-series forecasting task and acquire more accurate and reliable predicted intervals. Finally, we conduct multiple experiments to validate the superiority of DiffPLF to predict complex temporal patterns of erratic charging load and carry out controllable generation based on certain covariate. Results demonstrate that we can attain a notable rise of 39.58% and 49.87% on MAE and CRPS respectively compared to the conventional method.  ( 2 min )
    Private Gradient Descent for Linear Regression: Tighter Error Bounds and Instance-Specific Uncertainty Estimation
    arXiv:2402.13531v1 Announce Type: new Abstract: We provide an improved analysis of standard differentially private gradient descent for linear regression under the squared error loss. Under modest assumptions on the input, we characterize the distribution of the iterate at each time step. Our analysis leads to new results on the algorithm's accuracy: for a proper fixed choice of hyperparameters, the sample complexity depends only linearly on the dimension of the data. This matches the dimension-dependence of the (non-private) ordinary least squares estimator as well as that of recent private algorithms that rely on sophisticated adaptive gradient-clipping schemes (Varshney et al., 2022; Liu et al., 2023). Our analysis of the iterates' distribution also allows us to construct confidence intervals for the empirical optimizer which adapt automatically to the variance of the algorithm on a particular data set. We validate our theorems through experiments on synthetic data.  ( 2 min )
    FinGPT-HPC: Efficient Pretraining and Finetuning Large Language Models for Financial Applications with High-Performance Computing
    arXiv:2402.13533v1 Announce Type: new Abstract: Large language models (LLMs) are computationally intensive. The computation workload and the memory footprint grow quadratically with the dimension (layer width). Most of LLMs' parameters come from the linear layers of the transformer structure and are highly redundant. These linear layers contribute more than 80% of the computation workload and 99% of the model size. To pretrain and finetune LLMs efficiently, there are three major challenges to address: 1) reducing redundancy of the linear layers; 2) reducing GPU memory footprint; 3) improving GPU utilization when using distributed training. Prior methods, such as LoRA and QLoRA, utilized low-rank matrices and quantization to reduce the number of trainable parameters and model size, respectively. However, the resulting model still consumes a large amount of GPU memory. In this paper, we present high-performance GPU-based methods that exploit low-rank structures to pretrain and finetune LLMs for financial applications. We replace one conventional linear layer of the transformer structure with two narrower linear layers, which allows us to reduce the number of parameters by several orders of magnitude. By quantizing the parameters into low precision (8-bit and 4-bit), the memory consumption of the resulting model is further reduced. Compared with existing LLMs, our methods achieve a speedup of 1.3X and a model compression ratio of 2.64X for pretaining without accuracy drop. For finetuning, our methods achieve an average accuracy increase of 6.3% and 24.0% in general tasks and financial tasks, respectively, and GPU memory consumption ratio of 6.3X. The sizes of our models are smaller than 0.59 GB, allowing inference on a smartphone.  ( 3 min )
    ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models
    arXiv:2402.13516v1 Announce Type: new Abstract: Activation sparsity refers to the existence of considerable weakly-contributed elements among activation outputs. As a prevalent property of the models using the ReLU activation function, it has been proven a promising paradigm to boost model inference efficiency. Nevertheless, most large language models (LLMs) adopt activation functions without intrinsic activation sparsity (e.g., GELU and Swish). Some recent efforts have explored introducing ReLU or its variants as the substitutive activation function to help LLMs achieve activation sparsity and inference acceleration, but few can simultaneously obtain high sparsity and comparable model performance. This paper introduces an effective sparsification method named "ProSparse" to push LLMs for higher activation sparsity without decreasing model performance. Specifically, after substituting the activation function of LLMs with ReLU, ProSparse adopts progressive sparsity regularization with a factor smoothly increasing along sine curves in multiple stages. This can enhance activation sparsity and alleviate performance degradation by avoiding radical shifts in activation distribution. With ProSparse, we obtain high sparsity of 89.32% and 88.80% for LLaMA2-7B and LLaMA2-13B, respectively, achieving comparable performance to their original Swish-activated versions. Our inference acceleration experiments further demonstrate the practical acceleration brought by higher activation sparsity.  ( 2 min )
    MatchNAS: Optimizing Edge AI in Sparse-Label Data Contexts via Automating Deep Neural Network Porting for Mobile Deployment
    arXiv:2402.13525v1 Announce Type: new Abstract: Recent years have seen the explosion of edge intelligence with powerful Deep Neural Networks (DNNs). One popular scheme is training DNNs on powerful cloud servers and subsequently porting them to mobile devices after being lightweight. Conventional approaches manually specialized DNNs for various edge platforms and retrain them with real-world data. However, as the number of platforms increases, these approaches become labour-intensive and computationally prohibitive. Additionally, real-world data tends to be sparse-label, further increasing the difficulty of lightweight models. In this paper, we propose MatchNAS, a novel scheme for porting DNNs to mobile devices. Specifically, we simultaneously optimise a large network family using both labelled and unlabelled data and then automatically search for tailored networks for different hardware platforms. MatchNAS acts as an intermediary that bridges the gap between cloud-based DNNs and edge-based DNNs.  ( 2 min )
    From Self-Attention to Markov Models: Unveiling the Dynamics of Generative Transformers
    arXiv:2402.13512v1 Announce Type: new Abstract: Modern language models rely on the transformer architecture and attention mechanism to perform language understanding and text generation. In this work, we study learning a 1-layer self-attention model from a set of prompts and associated output data sampled from the model. We first establish a precise mapping between the self-attention mechanism and Markov models: Inputting a prompt to the model samples the output token according to a context-conditioned Markov chain (CCMC) which weights the transition matrix of a base Markov chain. Additionally, incorporating positional encoding results in position-dependent scaling of the transition probabilities. Building on this formalism, we develop identifiability/coverage conditions for the prompt distribution that guarantee consistent estimation and establish sample complexity guarantees under IID samples. Finally, we study the problem of learning from a single output trajectory generated from an initial prompt. We characterize an intriguing winner-takes-all phenomenon where the generative process implemented by self-attention collapses into sampling a limited subset of tokens due to its non-mixing nature. This provides a mathematical explanation to the tendency of modern LLMs to generate repetitive text. In summary, the equivalence to CCMC provides a simple but powerful framework to study self-attention and its properties.  ( 2 min )
    SimPro: A Simple Probabilistic Framework Towards Realistic Long-Tailed Semi-Supervised Learning
    arXiv:2402.13505v1 Announce Type: new Abstract: Recent advancements in semi-supervised learning have focused on a more realistic yet challenging task: addressing imbalances in labeled data while the class distribution of unlabeled data remains both unknown and potentially mismatched. Current approaches in this sphere often presuppose rigid assumptions regarding the class distribution of unlabeled data, thereby limiting the adaptability of models to only certain distribution ranges. In this study, we propose a novel approach, introducing a highly adaptable framework, designated as SimPro, which does not rely on any predefined assumptions about the distribution of unlabeled data. Our framework, grounded in a probabilistic model, innovatively refines the expectation-maximization (EM) algorithm by explicitly decoupling the modeling of conditional and marginal class distributions. This separation facilitates a closed-form solution for class distribution estimation during the maximization phase, leading to the formulation of a Bayes classifier. The Bayes classifier, in turn, enhances the quality of pseudo-labels in the expectation phase. Remarkably, the SimPro framework not only comes with theoretical guarantees but also is straightforward to implement. Moreover, we introduce two novel class distributions broadening the scope of the evaluation. Our method showcases consistent state-of-the-art performance across diverse benchmarks and data distribution scenarios. Our code is available at https://github.com/LeapLabTHU/SimPro.  ( 2 min )
    Stealthy Adversarial Attacks on Stochastic Multi-Armed Bandits
    arXiv:2402.13487v1 Announce Type: new Abstract: Adversarial attacks against stochastic multi-armed bandit (MAB) algorithms have been extensively studied in the literature. In this work, we focus on reward poisoning attacks and find most existing attacks can be easily detected by our proposed detection method based on the test of homogeneity, due to their aggressive nature in reward manipulations. This motivates us to study the notion of stealthy attack against stochastic MABs and investigate the resulting attackability. Our analysis shows that against two popularly employed MAB algorithms, UCB1 and $\epsilon$-greedy, the success of a stealthy attack depends on the environmental conditions and the realized reward of the arm pulled in the first round. We also analyze the situation for general MAB algorithms equipped with our attack detection method and find that it is possible to have a stealthy attack that almost always succeeds. This brings new insights into the security risks of MAB algorithms.  ( 2 min )
    HetTree: Heterogeneous Tree Graph Neural Network
    arXiv:2402.13496v1 Announce Type: new Abstract: The recent past has seen an increasing interest in Heterogeneous Graph Neural Networks (HGNNs) since many real-world graphs are heterogeneous in nature, from citation graphs to email graphs. However, existing methods ignore a tree hierarchy among metapaths, which is naturally constituted by different node types and relation types. In this paper, we present HetTree, a novel heterogeneous tree graph neural network that models both the graph structure and heterogeneous aspects in a scalable and effective manner. Specifically, HetTree builds a semantic tree data structure to capture the hierarchy among metapaths. Existing tree encoding techniques aggregate children nodes by weighting the contribution of children nodes based on similarity to the parent node. However, we find that this tree encoding fails to capture the entire parent-children hierarchy by only considering the parent node. Hence, HetTree uses a novel subtree attention mechanism to emphasize metapaths that are more helpful in encoding parent-children relationships. Moreover, instead of separating feature learning from label learning or treating features and labels equally by projecting them to the same latent space, HetTree proposes to match them carefully based on corresponding metapaths, which provides more accurate and richer information between node features and labels. Our evaluation of HetTree on a variety of real-world datasets demonstrates that it outperforms all existing baselines on open benchmarks and efficiently scales to large real-world graphs with millions of nodes and edges.  ( 2 min )
    ProPD: Dynamic Token Tree Pruning and Generation for LLM Parallel Decoding
    arXiv:2402.13485v1 Announce Type: new Abstract: Recent advancements in generative large language models (LLMs) have significantly boosted the performance in natural language processing tasks. However, their efficiency is hampered by the inherent limitations in autoregressive token generation. While parallel decoding with token tree verification, e.g., Medusa, has been proposed to improve decoding parallelism and efficiency, it often struggles with maintaining contextual relationships due to its independent token prediction approach and incurs significant verification overhead, especially with large tree sizes and batch processing. In this paper, we propose ProPD, an efficient LLM parallel decoding framework based on dynamic token tree pruning and generation. ProPD features an advanced early pruning mechanism to efficiently eliminate unpromising token sequences to improve verification efficiency. Additionally, it introduces a dynamic token tree generation algorithm to balance the computation and parallelism of the verification phase in real-time and maximize the overall efficiency across different batch sizes, sequence lengths, and tasks, etc. We verify ProPD across a diverse set of datasets, LLMs, and batch sizes and demonstrate ProPD consistently outperforms existing decoding algorithms by 1.1-3.2x.  ( 2 min )
    STENCIL: Submodular Mutual Information Based Weak Supervision for Cold-Start Active Learning
    arXiv:2402.13468v1 Announce Type: new Abstract: As supervised fine-tuning of pre-trained models within NLP applications increases in popularity, larger corpora of annotated data are required, especially with increasing parameter counts in large language models. Active learning, which attempts to mine and annotate unlabeled instances to improve model performance maximally fast, is a common choice for reducing the annotation cost; however, most methods typically ignore class imbalance and either assume access to initial annotated data or require multiple rounds of active learning selection before improving rare classes. We present STENCIL, which utilizes a set of text exemplars and the recently proposed submodular mutual information to select a set of weakly labeled rare-class instances that are then strongly labeled by an annotator. We show that STENCIL improves overall accuracy by $10\%-24\%$ and rare-class F-1 score by $17\%-40\%$ on multiple text classification datasets over common active learning methods within the class-imbalanced cold-start setting.  ( 2 min )
    Theoretical Analysis of Submodular Information Measures for Targeted Data Subset Selection
    arXiv:2402.13454v1 Announce Type: new Abstract: With increasing volume of data being used across machine learning tasks, the capability to target specific subsets of data becomes more important. To aid in this capability, the recently proposed Submodular Mutual Information (SMI) has been effectively applied across numerous tasks in literature to perform targeted subset selection with the aid of a exemplar query set. However, all such works are deficient in providing theoretical guarantees for SMI in terms of its sensitivity to a subset's relevance and coverage of the targeted data. For the first time, we provide such guarantees by deriving similarity-based bounds on quantities related to relevance and coverage of the targeted data. With these bounds, we show that the SMI functions, which have empirically shown success in multiple applications, are theoretically sound in achieving good query relevance and query coverage.  ( 2 min )
    Learning to Poison Large Language Models During Instruction Tuning
    arXiv:2402.13459v1 Announce Type: new Abstract: The advent of Large Language Models (LLMs) has marked significant achievements in language processing and reasoning capabilities. Despite their advancements, LLMs face vulnerabilities to data poisoning attacks, where adversaries insert backdoor triggers into training data to manipulate outputs for malicious purposes. This work further identifies additional security risks in LLMs by designing a new data poisoning attack tailored to exploit the instruction tuning process. We propose a novel gradient-guided backdoor trigger learning approach to identify adversarial triggers efficiently, ensuring an evasion of detection by conventional defenses while maintaining content integrity. Through experimental validation across various LLMs and tasks, our strategy demonstrates a high success rate in compromising model outputs; poisoning only 1\% of 4,000 instruction tuning samples leads to a Performance Drop Rate (PDR) of around 80\%. Our work highlights the need for stronger defenses against data poisoning attack, offering insights into safeguarding LLMs against these more sophisticated attacks. The source code can be found on this GitHub repository: https://github.com/RookieZxy/GBTL/blob/main/README.md.  ( 2 min )
    PaCKD: Pattern-Clustered Knowledge Distillation for Compressing Memory Access Prediction Models
    arXiv:2402.13441v1 Announce Type: new Abstract: Deep neural networks (DNNs) have proven to be effective models for accurate Memory Access Prediction (MAP), a critical task in mitigating memory latency through data prefetching. However, existing DNN-based MAP models suffer from the challenges such as significant physical storage space and poor inference latency, primarily due to their large number of parameters. These limitations render them impractical for deployment in real-world scenarios. In this paper, we propose PaCKD, a Pattern-Clustered Knowledge Distillation approach to compress MAP models while maintaining the prediction performance. The PaCKD approach encompasses three steps: clustering memory access sequences into distinct partitions involving similar patterns, training large pattern-specific teacher models for memory access prediction for each partition, and training a single lightweight student model by distilling the knowledge from the trained pattern-specific teachers. We evaluate our approach on LSTM, MLP-Mixer, and ResNet models, as they exhibit diverse structures and are widely used for image classification tasks in order to test their effectiveness in four widely used graph applications. Compared to the teacher models with 5.406M parameters and an F1-score of 0.4626, our student models achieve a 552$\times$ model size compression while maintaining an F1-score of 0.4538 (with a 1.92% performance drop). Our approach yields an 8.70% higher result compared to student models trained with standard knowledge distillation and an 8.88% higher result compared to student models trained without any form of knowledge distillation.  ( 3 min )
    LinkSAGE: Optimizing Job Matching Using Graph Neural Networks
    arXiv:2402.13430v1 Announce Type: new Abstract: We present LinkSAGE, an innovative framework that integrates Graph Neural Networks (GNNs) into large-scale personalized job matching systems, designed to address the complex dynamics of LinkedIns extensive professional network. Our approach capitalizes on a novel job marketplace graph, the largest and most intricate of its kind in industry, with billions of nodes and edges. This graph is not merely extensive but also richly detailed, encompassing member and job nodes along with key attributes, thus creating an expansive and interwoven network. A key innovation in LinkSAGE is its training and serving methodology, which effectively combines inductive graph learning on a heterogeneous, evolving graph with an encoder-decoder GNN model. This methodology decouples the training of the GNN model from that of existing Deep Neural Nets (DNN) models, eliminating the need for frequent GNN retraining while maintaining up-to-date graph signals in near realtime, allowing for the effective integration of GNN insights through transfer learning. The subsequent nearline inference system serves the GNN encoder within a real-world setting, significantly reducing online latency and obviating the need for costly real-time GNN infrastructure. Validated across multiple online A/B tests in diverse product scenarios, LinkSAGE demonstrates marked improvements in member engagement, relevance matching, and member retention, confirming its generalizability and practical impact.  ( 2 min )
    Context-Aware Quantitative Risk Assessment Machine Learning Model for Drivers Distraction
    arXiv:2402.13421v1 Announce Type: new Abstract: Risk mitigation techniques are critical to avoiding accidents associated with driving behaviour. We provide a novel Multi-Class Driver Distraction Risk Assessment (MDDRA) model that considers the vehicle, driver, and environmental data during a journey. MDDRA categorises the driver on a risk matrix as safe, careless, or dangerous. It offers flexibility in adjusting the parameters and weights to consider each event on a specific severity level. We collect real-world data using the Field Operation Test (TeleFOT), covering drivers using the same routes in the East Midlands, United Kingdom (UK). The results show that reducing road accidents caused by driver distraction is possible. We also study the correlation between distraction (driver, vehicle, and environment) and the classification severity based on a continuous distraction severity score. Furthermore, we apply machine learning techniques to classify and predict driver distraction according to severity levels to aid the transition of control from the driver to the vehicle (vehicle takeover) when a situation is deemed risky. The Ensemble Bagged Trees algorithm performed best, with an accuracy of 96.2%.  ( 2 min )
    Investigating the Histogram Loss in Regression
    arXiv:2402.13425v1 Announce Type: new Abstract: It is becoming increasingly common in regression to train neural networks that model the entire distribution even if only the mean is required for prediction. This additional modeling often comes with performance gain and the reasons behind the improvement are not fully known. This paper investigates a recent approach to regression, the Histogram Loss, which involves learning the conditional distribution of the target variable by minimizing the cross-entropy between a target distribution and a flexible histogram prediction. We design theoretical and empirical analyses to determine why and when this performance gain appears, and how different components of the loss contribute to it. Our results suggest that the benefits of learning distributions in this setup come from improvements in optimization rather than learning a better representation. We then demonstrate the viability of the Histogram Loss in common deep learning applications without a need for costly hyperparameter tuning.  ( 2 min )
    EvolMPNN: Predicting Mutational Effect on Homologous Proteins by Evolution Encoding
    arXiv:2402.13418v1 Announce Type: new Abstract: Predicting protein properties is paramount for biological and medical advancements. Current protein engineering mutates on a typical protein, called the wild-type, to construct a family of homologous proteins and study their properties. Yet, existing methods easily neglect subtle mutations, failing to capture the effect on the protein properties. To this end, we propose EvolMPNN, Evolution-aware Message Passing Neural Network, to learn evolution-aware protein embeddings. EvolMPNN samples sets of anchor proteins, computes evolutionary information by means of residues and employs a differentiable evolution-aware aggregation scheme over these sampled anchors. This way EvolMPNNcan capture the mutation effect on proteins with respect to the anchor proteins. Afterwards, the aggregated evolution-aware embeddings are integrated with sequence embeddings to generate final comprehensive protein embeddings. Our model shows up to 6.4% better than state-of-the-art methods and attains 36x inference speedup in comparison with large pre-trained models.  ( 2 min )
    Bayesian Neural Networks with Domain Knowledge Priors
    arXiv:2402.13410v1 Announce Type: new Abstract: Bayesian neural networks (BNNs) have recently gained popularity due to their ability to quantify model uncertainty. However, specifying a prior for BNNs that captures relevant domain knowledge is often extremely challenging. In this work, we propose a framework for integrating general forms of domain knowledge (i.e., any knowledge that can be represented by a loss function) into a BNN prior through variational inference, while enabling computationally efficient posterior inference and sampling. Specifically, our approach results in a prior over neural network weights that assigns high probability mass to models that better align with our domain knowledge, leading to posterior samples that also exhibit this behavior. We show that BNNs using our proposed domain knowledge priors outperform those with standard priors (e.g., isotropic Gaussian, Gaussian process), successfully incorporating diverse types of prior information such as fairness, physics rules, and healthcare knowledge and achieving better predictive performance. We also present techniques for transferring the learned priors across different model architectures, demonstrating their broad utility across various settings.  ( 2 min )
    Scaling physics-informed hard constraints with mixture-of-experts
    arXiv:2402.13412v1 Announce Type: new Abstract: Imposing known physical constraints, such as conservation laws, during neural network training introduces an inductive bias that can improve accuracy, reliability, convergence, and data efficiency for modeling physical dynamics. While such constraints can be softly imposed via loss function penalties, recent advancements in differentiable physics and optimization improve performance by incorporating PDE-constrained optimization as individual layers in neural networks. This enables a stricter adherence to physical constraints. However, imposing hard constraints significantly increases computational and memory costs, especially for complex dynamical systems. This is because it requires solving an optimization problem over a large number of points in a mesh, representing spatial and temporal discretizations, which greatly increases the complexity of the constraint. To address this challenge, we develop a scalable approach to enforce hard physical constraints using Mixture-of-Experts (MoE), which can be used with any neural network architecture. Our approach imposes the constraint over smaller decomposed domains, each of which is solved by an "expert" through differentiable optimization. During training, each expert independently performs a localized backpropagation step by leveraging the implicit function theorem; the independence of each expert allows for parallelization across multiple GPUs. Compared to standard differentiable optimization, our scalable approach achieves greater accuracy in the neural PDE solver setting for predicting the dynamics of challenging non-linear systems. We also improve training stability and require significantly less computation time during both training and inference stages.  ( 3 min )
    Fairness Risks for Group-conditionally Missing Demographics
    arXiv:2402.13393v1 Announce Type: new Abstract: Fairness-aware classification models have gained increasing attention in recent years as concerns grow on discrimination against some demographic groups. Most existing models require full knowledge of the sensitive features, which can be impractical due to privacy, legal issues, and an individual's fear of discrimination. The key challenge we will address is the group dependency of the unavailability, e.g., people of some age range may be more reluctant to reveal their age. Our solution augments general fairness risks with probabilistic imputations of the sensitive features, while jointly learning the group-conditionally missing probabilities in a variational auto-encoder. Our model is demonstrated effective on both image and tabular datasets, achieving an improved balance between accuracy and fairness.  ( 2 min )
    Referee-Meta-Learning for Fast Adaptation of Locational Fairness
    arXiv:2402.13379v1 Announce Type: new Abstract: When dealing with data from distinct locations, machine learning algorithms tend to demonstrate an implicit preference of some locations over the others, which constitutes biases that sabotage the spatial fairness of the algorithm. This unfairness can easily introduce biases in subsequent decision-making given broad adoptions of learning-based solutions in practice. However, locational biases in AI are largely understudied. To mitigate biases over locations, we propose a locational meta-referee (Meta-Ref) to oversee the few-shot meta-training and meta-testing of a deep neural network. Meta-Ref dynamically adjusts the learning rates for training samples of given locations to advocate a fair performance across locations, through an explicit consideration of locational biases and the characteristics of input data. We present a three-phase training framework to learn both a meta-learning-based predictor and an integrated Meta-Ref that governs the fairness of the model. Once trained with a distribution of spatial tasks, Meta-Ref is applied to samples from new spatial tasks (i.e., regions outside the training area) to promote fairness during the fine-tune step. We carried out experiments with two case studies on crop monitoring and transportation safety, which show Meta-Ref can improve locational fairness while keeping the overall prediction quality at a similar level.  ( 2 min )
    Transformer tricks: Precomputing the first layer
    arXiv:2402.13388v1 Announce Type: new Abstract: This short paper describes a trick to speed up inference of transformers with RoPE (such as LLaMA, Mistral, and PaLM). For these models, a large portion of the first transformer layer can be precomputed, which results in slightly lower latency and lower cost-per-token. Because this trick optimizes only one layer, the relative savings depend on the total number of layers. For example, the maximum savings for a model with only 4 layers (such as Whisper tiny) is limited to 25%, while a 32-layer model (such as Mistral-7B) is limited to 3% savings.  ( 2 min )
    FIDLAR: Forecast-Informed Deep Learning Architecture for Flood Mitigation
    arXiv:2402.13371v1 Announce Type: new Abstract: In coastal river systems, frequent floods, often occurring during major storms or king tides, pose a severe threat to lives and property. However, these floods can be mitigated or even prevented by strategically releasing water before extreme weather events with hydraulic structures such as dams, gates, pumps, and reservoirs. A standard approach used by local water management agencies is the "rule-based" method, which specifies predetermined pre-releases of water based on historical and time-tested human experience, but which tends to result in excess or inadequate water release. The model predictive control (MPC), a physics-based model for prediction, is an alternative approach, albeit involving computationally intensive calculations. In this paper, we propose a Forecast Informed Deep Learning Architecture, FIDLAR, to achieve rapid and optimal flood management with precise water pre-releases. FIDLAR seamlessly integrates two neural network modules: one called the Flood Manager, which is responsible for generating water pre-release schedules, and another called the Flood Evaluator, which assesses these generated schedules. The Evaluator module is pre-trained separately, and its gradient-based feedback is used to train the Manager model, ensuring optimal water pre-releases. We have conducted experiments using FIDLAR with data from a flood-prone coastal area in South Florida, particularly susceptible to frequent storms. Results show that FIDLAR is several orders of magnitude faster than currently used physics-based approaches while outperforming baseline methods with improved water pre-release schedules. Our code is at https://github.com/JimengShi/FIDLAR/.  ( 2 min )
    The Uncanny Valley: A Comprehensive Analysis of Diffusion Models
    arXiv:2402.13369v1 Announce Type: new Abstract: Through Diffusion Models (DMs), we have made significant advances in generating high-quality images. Our exploration of these models delves deeply into their core operational principles by systematically investigating key aspects across various DM architectures: i) noise schedules, ii) samplers, and iii) guidance. Our comprehensive examination of these models sheds light on their hidden fundamental mechanisms, revealing the concealed foundational elements that are essential for their effectiveness. Our analyses emphasize the hidden key factors that determine model performance, offering insights that contribute to the advancement of DMs. Past findings show that the configuration of noise schedules, samplers, and guidance is vital to the quality of generated images; however, models reach a stable level of quality across different configurations at a remarkably similar point, revealing that the decisive factors for optimal performance predominantly reside in the diffusion process dynamics and the structural design of the model's network, rather than the specifics of configuration details. Our comparative analysis reveals that Denoising Diffusion Probabilistic Model (DDPM)-based diffusion dynamics consistently outperform the Noise Conditioned Score Network (NCSN)-based ones, not only when evaluated in their original forms but also when continuous through Stochastic Differential Equation (SDE)-based implementations.  ( 2 min )
    Statistical curriculum learning: An elimination algorithm achieving an oracle risk
    arXiv:2402.13366v1 Announce Type: new Abstract: We consider a statistical version of curriculum learning (CL) in a parametric prediction setting. The learner is required to estimate a target parameter vector, and can adaptively collect samples from either the target model, or other source models that are similar to the target model, but less noisy. We consider three types of learners, depending on the level of side-information they receive. The first two, referred to as strong/weak-oracle learners, receive high/low degrees of information about the models, and use these to learn. The third, a fully adaptive learner, estimates the target parameter vector without any prior information. In the single source case, we propose an elimination learning method, whose risk matches that of a strong-oracle learner. In the multiple source case, we advocate that the risk of the weak-oracle learner is a realistic benchmark for the risk of adaptive learners. We develop an adaptive multiple elimination-rounds CL algorithm, and characterize instance-dependent conditions for its risk to match that of the weak-oracle learner. We consider instance-dependent minimax lower bounds, and discuss the challenges associated with defining the class of instances for the bound. We derive two minimax lower bounds, and determine the conditions under which the performance weak-oracle learner is minimax optimal.  ( 2 min )
    Unsupervised Concept Discovery Mitigates Spurious Correlations
    arXiv:2402.13368v1 Announce Type: new Abstract: Models prone to spurious correlations in training data often produce brittle predictions and introduce unintended biases. Addressing this challenge typically involves methods relying on prior knowledge and group annotation to remove spurious correlations, which may not be readily available in many applications. In this paper, we establish a novel connection between unsupervised object-centric learning and mitigation of spurious correlations. Instead of directly inferring sub-groups with varying correlations with labels, our approach focuses on discovering concepts: discrete ideas that are shared across input samples. Leveraging existing object-centric representation learning, we introduce CoBalT: a concept balancing technique that effectively mitigates spurious correlations without requiring human labeling of subgroups. Evaluation across the Waterbirds, CelebA and ImageNet-9 benchmark datasets for subpopulation shifts demonstrate superior or competitive performance compared state-of-the-art baselines, without the need for group annotation.  ( 2 min )
    Double machine learning for causal hybrid modeling -- applications in the Earth sciences
    arXiv:2402.13332v1 Announce Type: new Abstract: Hybrid modeling integrates machine learning with scientific knowledge with the goal of enhancing interpretability, generalization, and adherence to natural laws. Nevertheless, equifinality and regularization biases pose challenges in hybrid modeling to achieve these purposes. This paper introduces a novel approach to estimating hybrid models via a causal inference framework, specifically employing Double Machine Learning (DML) to estimate causal effects. We showcase its use for the Earth sciences on two problems related to carbon dioxide fluxes. In the $Q_{10}$ model, we demonstrate that DML-based hybrid modeling is superior in estimating causal parameters over end-to-end deep neural network (DNN) approaches, proving efficiency, robustness to bias from regularization methods, and circumventing equifinality. Our approach, applied to carbon flux partitioning, exhibits flexibility in accommodating heterogeneous causal effects. The study emphasizes the necessity of explicitly defining causal graphs and relationships, advocating for this as a general best practice. We encourage the continued exploration of causality in hybrid models for more interpretable and trustworthy results in knowledge-guided machine learning.  ( 2 min )
    Incentivized Exploration via Filtered Posterior Sampling
    arXiv:2402.13338v1 Announce Type: new Abstract: We study "incentivized exploration" (IE) in social learning problems where the principal (a recommendation algorithm) can leverage information asymmetry to incentivize sequentially-arriving agents to take exploratory actions. We identify posterior sampling, an algorithmic approach that is well known in the multi-armed bandits literature, as a general-purpose solution for IE. In particular, we expand the existing scope of IE in several practically-relevant dimensions, from private agent types to informative recommendations to correlated Bayesian priors. We obtain a general analysis of posterior sampling in IE which allows us to subsume these extended settings as corollaries, while also recovering existing results as special cases.  ( 2 min )
    Harmful algal bloom forecasting. A comparison between stream and batch learning
    arXiv:2402.13304v1 Announce Type: new Abstract: Diarrhetic Shellfish Poisoning (DSP) is a global health threat arising from shellfish contaminated with toxins produced by dinoflagellates. The condition, with its widespread incidence, high morbidity rate, and persistent shellfish toxicity, poses risks to public health and the shellfish industry. High biomass of toxin-producing algae such as DSP are known as Harmful Algal Blooms (HABs). Monitoring and forecasting systems are crucial for mitigating HABs impact. Predicting harmful algal blooms involves a time-series-based problem with a strong historical seasonal component, however, recent anomalies due to changes in meteorological and oceanographic events have been observed. Stream Learning stands out as one of the most promising approaches for addressing time-series-based problems with concept drifts. However, its efficacy in predicting HABs remains unproven and needs to be tested in comparison with Batch Learning. Historical data availability is a critical point in developing predictive systems. In oceanography, the available data collection can have some constrains and limitations, which has led to exploring new tools to obtain more exhaustive time series. In this study, a machine learning workflow for predicting the number of cells of a toxic dinoflagellate, Dinophysis acuminata, was developed with several key advancements. Seven machine learning algorithms were compared within two learning paradigms. Notably, the output data from CROCO, the ocean hydrodynamic model, was employed as the primary dataset, palliating the limitation of time-continuous historical data. This study highlights the value of models interpretability, fair models comparison methodology, and the incorporation of Stream Learning models. The model DoME, with an average R2 of 0.77 in the 3-day-ahead prediction, emerged as the most effective and interpretable predictor, outperforming the other algorithms.  ( 3 min )
  • Open

    Positive Semidefinite Supermartingales and Randomized Matrix Concentration Inequalities
    arXiv:2401.15567v2 Announce Type: replace-cross Abstract: We present new concentration inequalities for either martingale dependent or exchangeable random symmetric matrices under a variety of tail conditions, encompassing now-standard Chernoff bounds to self-normalized heavy-tailed settings. These inequalities are often randomized in a way that renders them strictly tighter than existing deterministic results in the literature, are typically expressed in the Loewner order, and are sometimes valid at arbitrary data-dependent stopping times. Along the way, we explore the theory of positive semidefinite supermartingales and maximal inequalities, a natural matrix analog of scalar nonnegative supermartingales that is potentially of independent interest.  ( 2 min )
    A Conservative Approach for Few-Shot Transfer in Off-Dynamics Reinforcement Learning
    arXiv:2312.15474v2 Announce Type: replace-cross Abstract: Off-dynamics Reinforcement Learning (ODRL) seeks to transfer a policy from a source environment to a target environment characterized by distinct yet similar dynamics. In this context, traditional RL agents depend excessively on the dynamics of the source environment, resulting in the discovery of policies that excel in this environment but fail to provide reasonable performance in the target one. In the few-shot framework, a limited number of transitions from the target environment are introduced to facilitate a more effective transfer. Addressing this challenge, we propose an innovative approach inspired by recent advancements in Imitation Learning and conservative RL algorithms. The proposed method introduces a penalty to regulate the trajectories generated by the source-trained policy. We evaluate our method across various environments representing diverse off-dynamics conditions, where access to the target environment is extremely limited. These experiments include high-dimensional systems relevant to real-world applications. Across most tested scenarios, our proposed method demonstrates performance improvements compared to existing baselines.  ( 2 min )
    The Effect of Intrinsic Dataset Properties on Generalization: Unraveling Learning Differences Between Natural and Medical Images
    arXiv:2401.08865v3 Announce Type: replace-cross Abstract: This paper investigates discrepancies in how neural networks learn from different imaging domains, which are commonly overlooked when adopting computer vision techniques from the domain of natural images to other specialized domains such as medical images. Recent works have found that the generalization error of a trained network typically increases with the intrinsic dimension ($d_{data}$) of its training set. Yet, the steepness of this relationship varies significantly between medical (radiological) and natural imaging domains, with no existing theoretical explanation. We address this gap in knowledge by establishing and empirically validating a generalization scaling law with respect to $d_{data}$, and propose that the substantial scaling discrepancy between the two considered domains may be at least partially attributed to the higher intrinsic ``label sharpness'' ($K_\mathcal{F}$) of medical imaging datasets, a metric which we propose. Next, we demonstrate an additional benefit of measuring the label sharpness of a training set: it is negatively correlated with the trained model's adversarial robustness, which notably leads to models for medical images having a substantially higher vulnerability to adversarial attack. Finally, we extend our $d_{data}$ formalism to the related metric of learned representation intrinsic dimension ($d_{repr}$), derive a generalization scaling law with respect to $d_{repr}$, and show that $d_{data}$ serves as an upper bound for $d_{repr}$. Our theoretical results are supported by thorough experiments with six models and eleven natural and medical imaging datasets over a range of training set sizes. Our findings offer insights into the influence of intrinsic dataset properties on generalization, representation learning, and robustness in deep neural networks. Code link: https://github.com/mazurowski-lab/intrinsic-properties  ( 3 min )
    Comparing Machine Learning Algorithms by Union-Free Generic Depth
    arXiv:2312.12839v3 Announce Type: replace-cross Abstract: We propose a framework for descriptively analyzing sets of partial orders based on the concept of depth functions. Despite intensive studies in linear and metric spaces, there is very little discussion on depth functions for non-standard data types such as partial orders. We introduce an adaptation of the well-known simplicial depth to the set of all partial orders, the union-free generic (ufg) depth. Moreover, we utilize our ufg depth for a comparison of machine learning algorithms based on multidimensional performance measures. Concretely, we provide two examples of classifier comparisons on samples of standard benchmark data sets. Our results demonstrate promisingly the wide variety of different analysis approaches based on ufg methods. Furthermore, the examples outline that our approach differs substantially from existing benchmarking approaches, and thus adds a new perspective to the vivid debate on classifier comparison.  ( 2 min )
    Tree of Attacks: Jailbreaking Black-Box LLMs Automatically
    arXiv:2312.02119v2 Announce Type: replace-cross Abstract: While Large Language Models (LLMs) display versatile functionality, they continue to generate harmful, biased, and toxic content, as demonstrated by the prevalence of human-designed jailbreaks. In this work, we present Tree of Attacks with Pruning (TAP), an automated method for generating jailbreaks that only requires black-box access to the target LLM. TAP utilizes an LLM to iteratively refine candidate (attack) prompts using tree-of-thought reasoning until one of the generated prompts jailbreaks the target. Crucially, before sending prompts to the target, TAP assesses them and prunes the ones unlikely to result in jailbreaks. Using tree-of-thought reasoning allows TAP to navigate a large search space of prompts and pruning reduces the total number of queries sent to the target. In empirical evaluations, we observe that TAP generates prompts that jailbreak state-of-the-art LLMs (including GPT4 and GPT4-Turbo) for more than 80% of the prompts using only a small number of queries. Interestingly, TAP is also capable of jailbreaking LLMs protected by state-of-the-art guardrails, e.g., LlamaGuard. This significantly improves upon the previous state-of-the-art black-box method for generating jailbreaks.  ( 2 min )
    Private Networked Federated Learning for Nonsmooth Objectives
    arXiv:2306.14012v2 Announce Type: replace-cross Abstract: This paper develops a networked federated learning algorithm to solve nonsmooth objective functions. To guarantee the confidentiality of the participants with respect to each other and potential eavesdroppers, we use the zero-concentrated differential privacy notion (zCDP). Privacy is achieved by perturbing the outcome of the computation at each client with a variance-decreasing Gaussian noise. ZCDP allows for better accuracy than the conventional $(\epsilon, \delta)$-DP and stronger guarantees than the more recent R\'enyi-DP by assuming adversaries aggregate all the exchanged messages. The proposed algorithm relies on the distributed Alternating Direction Method of Multipliers (ADMM) and uses the approximation of the augmented Lagrangian to handle nonsmooth objective functions. The developed private networked federated learning algorithm has a competitive privacy accuracy trade-off and handles nonsmooth and non-strongly convex problems. We provide complete theoretical proof for the privacy guarantees and the algorithm's convergence to the exact solution. We also prove under additional assumptions that the algorithm converges in $O(1/n)$ ADMM iterations. Finally, we observe the performance of the algorithm in a series of numerical simulations.  ( 2 min )
    Comparing Comparators in Generalization Bounds
    arXiv:2310.10534v2 Announce Type: replace-cross Abstract: We derive generic information-theoretic and PAC-Bayesian generalization bounds involving an arbitrary convex comparator function, which measures the discrepancy between the training and population loss. The bounds hold under the assumption that the cumulant-generating function (CGF) of the comparator is upper-bounded by the corresponding CGF within a family of bounding distributions. We show that the tightest possible bound is obtained with the comparator being the convex conjugate of the CGF of the bounding distribution, also known as the Cram\'er function. This conclusion applies more broadly to generalization bounds with a similar structure. This confirms the near-optimality of known bounds for bounded and sub-Gaussian losses and leads to novel bounds under other bounding distributions.  ( 2 min )
    Double Robust Bayesian Inference on Average Treatment Effects
    arXiv:2211.16298v4 Announce Type: replace-cross Abstract: We propose a double robust Bayesian inference procedure on the average treatment effect (ATE) under unconfoundedness. Our robust Bayesian approach involves two important modifications: first, we adjust the prior distributions of the conditional mean function; second, we correct the posterior distribution of the resulting ATE. Both adjustments make use of pilot estimators motivated by the semiparametric influence function for ATE estimation. We prove asymptotic equivalence of our Bayesian procedure and efficient frequentist ATE estimators by establishing a new semiparametric Bernstein-von Mises theorem under double robustness; i.e., the lack of smoothness of conditional mean functions can be compensated by high regularity of the propensity score and vice versa. Consequently, the resulting Bayesian credible sets form confidence intervals with asymptotically exact coverage probability. In simulations, our double robust Bayesian procedure leads to significant bias reduction of point estimation over conventional Bayesian methods and more accurate coverage of confidence intervals compared to existing frequentist methods. We illustrate our method in an application to the National Supported Work Demonstration.  ( 2 min )
    Assessing Uncertainty in Similarity Scoring: Performance & Fairness in Face Recognition
    arXiv:2211.07245v2 Announce Type: replace-cross Abstract: The ROC curve is the major tool for assessing not only the performance but also the fairness properties of a similarity scoring function. In order to draw reliable conclusions based on empirical ROC analysis, accurately evaluating the uncertainty level related to statistical versions of the ROC curves of interest is absolutely necessary, especially for applications with considerable societal impact such as Face Recognition. In this article, we prove asymptotic guarantees for empirical ROC curves of similarity functions as well as for by-product metrics useful to assess fairness. We also explain that, because the false acceptance/rejection rates are of the form of U-statistics in the case of similarity scoring, the naive bootstrap approach may jeopardize the assessment procedure. A dedicated recentering technique must be used instead. Beyond the theoretical analysis carried out, various experiments using real face image datasets provide strong empirical evidence of the practical relevance of the methods promoted here, when applied to several ROC-based measures such as popular fairness metrics.  ( 2 min )
    Hidden yet quantifiable: A lower bound for confounding strength using randomized trials
    arXiv:2312.03871v2 Announce Type: replace Abstract: In the era of fast-paced precision medicine, observational studies play a major role in properly evaluating new treatments in clinical practice. Yet, unobserved confounding can significantly compromise causal conclusions drawn from non-randomized data. We propose a novel strategy that leverages randomized trials to quantify unobserved confounding. First, we design a statistical test to detect unobserved confounding with strength above a given threshold. Then, we use the test to estimate an asymptotically valid lower bound on the unobserved confounding strength. We evaluate the power and validity of our statistical test on several synthetic and semi-synthetic datasets. Further, we show how our lower bound can correctly identify the absence and presence of unobserved confounding in a real-world setting.  ( 2 min )
    Stereographic Markov Chain Monte Carlo
    arXiv:2205.12112v2 Announce Type: replace-cross Abstract: High-dimensional distributions, especially those with heavy tails, are notoriously difficult for off-the-shelf MCMC samplers: the combination of unbounded state spaces, diminishing gradient information, and local moves results in empirically observed ``stickiness'' and poor theoretical mixing properties -- lack of geometric ergodicity. In this paper, we introduce a new class of MCMC samplers that map the original high-dimensional problem in Euclidean space onto a sphere and remedy these notorious mixing problems. In particular, we develop random-walk Metropolis type algorithms as well as versions of the Bouncy Particle Sampler that are uniformly ergodic for a large class of light and heavy-tailed distributions and also empirically exhibit rapid convergence in high dimensions. In the best scenario, the proposed samplers can enjoy the ``blessings of dimensionality'' that the convergence is faster in higher dimensions.  ( 2 min )
    Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution
    arXiv:2310.16834v2 Announce Type: replace Abstract: Despite their groundbreaking performance for many generative modeling tasks, diffusion models have fallen short on discrete data domains such as natural language. Crucially, standard diffusion models rely on the well-established theory of score matching, but efforts to generalize this to discrete structures have not yielded the same empirical gains. In this work, we bridge this gap by proposing score entropy, a novel loss that naturally extends score matching to discrete spaces, integrates seamlessly to build discrete diffusion models, and significantly boosts performance. Experimentally, we test our Score Entropy Discrete Diffusion models (SEDD) on standard language modeling tasks. For comparable model sizes, SEDD beats existing language diffusion paradigms (reducing perplexity by $25$-$75$\%) and is competitive with autoregressive models, in particular outperforming GPT-2. Furthermore, compared to autoregressive mdoels, SEDD generates faithful text without requiring distribution annealing techniques like temperature scaling (around $6$-$8\times$ better generative perplexity than un-annealed GPT-2), can trade compute and quality (similar quality with $32\times$ fewer network evaluations), and enables controllable infilling (matching nucleus sampling quality while enabling other strategies besides left to right prompting).  ( 2 min )
    Posterior Sampling Based on Gradient Flows of the MMD with Negative Distance Kernel
    arXiv:2310.03054v2 Announce Type: replace Abstract: We propose conditional flows of the maximum mean discrepancy (MMD) with the negative distance kernel for posterior sampling and conditional generative modeling. This MMD, which is also known as energy distance, has several advantageous properties like efficient computation via slicing and sorting. We approximate the joint distribution of the ground truth and the observations using discrete Wasserstein gradient flows and establish an error bound for the posterior distributions. Further, we prove that our particle flow is indeed a Wasserstein gradient flow of an appropriate functional. The power of our method is demonstrated by numerical examples including conditional image generation and inverse problems like superresolution, inpainting and computed tomography in low-dose and limited-angle settings.  ( 2 min )
    Hierarchical Neural Simulation-Based Inference Over Event Ensembles
    arXiv:2306.12584v2 Announce Type: replace Abstract: When analyzing real-world data it is common to work with event ensembles, which comprise sets of observations that collectively constrain the parameters of an underlying model of interest. Such models often have a hierarchical structure, where "local" parameters impact individual events and "global" parameters influence the entire dataset. We introduce practical approaches for frequentist and Bayesian dataset-wide probabilistic inference in cases where the likelihood is intractable, but simulations can be realized via a hierarchical forward model. We construct neural estimators for the likelihood(-ratio) or posterior and show that explicitly accounting for the model's hierarchical structure can lead to significantly tighter parameter constraints. We ground our discussion using case studies from the physical sciences, focusing on examples from particle physics and cosmology.  ( 2 min )
    Scaling Laws for Associative Memories
    arXiv:2310.02984v2 Announce Type: replace Abstract: Learning arguably involves the discovery and memorization of abstract rules. The aim of this paper is to study associative memory mechanisms. Our model is based on high-dimensional matrices consisting of outer products of embeddings, which relates to the inner layers of transformer language models. We derive precise scaling laws with respect to sample size and parameter size, and discuss the statistical efficiency of different estimators, including optimization-based algorithms. We provide extensive numerical experiments to validate and interpret theoretical results, including fine-grained visualizations of the stored memory associations.  ( 2 min )
    A unified Bayesian framework for interval hypothesis testing in clinical trials
    arXiv:2402.13890v1 Announce Type: cross Abstract: The American Statistical Association (ASA) statement on statistical significance and P-values \cite{wasserstein2016asa} cautioned statisticians against making scientific decisions solely on the basis of traditional P-values. The statement delineated key issues with P-values, including a lack of transparency, an inability to quantify evidence in support of the null hypothesis, and an inability to measure the size of an effect or the importance of a result. In this article, we demonstrate that the interval null hypothesis framework (instead of the point null hypothesis framework), when used in tandem with Bayes factor-based tests, is instrumental in circumnavigating the key issues of P-values. Further, we note that specifying prior densities for Bayes factors is challenging and has been a reason for criticism of Bayesian hypothesis testing in existing literature. We address this by adapting Bayes factors directly based on common test statistics. We demonstrate, through numerical experiments and real data examples, that the proposed Bayesian interval hypothesis testing procedures can be calibrated to ensure frequentist error control while retaining their inherent interpretability. Finally, we illustrate the improved flexibility and applicability of the proposed methods by providing coherent frameworks for competitive landscape analysis and end-to-end Bayesian hypothesis tests in the context of reporting clinical trial outcomes.  ( 2 min )
    Convergence Acceleration of Markov Chain Monte Carlo-based Gradient Descent by Deep Unfolding
    arXiv:2402.13608v1 Announce Type: cross Abstract: This study proposes a trainable sampling-based solver for combinatorial optimization problems (COPs) using a deep-learning technique called deep unfolding. The proposed solver is based on the Ohzeki method that combines Markov-chain Monte-Carlo (MCMC) and gradient descent, and its step sizes are trained by minimizing a loss function. In the training process, we propose a sampling-based gradient estimation that substitutes auto-differentiation with a variance estimation, thereby circumventing the failure of back propagation due to the non-differentiability of MCMC. The numerical results for a few COPs demonstrated that the proposed solver significantly accelerated the convergence speed compared with the original Ohzeki method.  ( 2 min )
    A cutting plane algorithm for globally solving low dimensional k-means clustering problems
    arXiv:2402.13595v1 Announce Type: cross Abstract: Clustering is one of the most fundamental tools in data science and machine learning, and k-means clustering is one of the most common such methods. There is a variety of approximate algorithms for the k-means problem, but computing the globally optimal solution is in general NP-hard. In this paper we consider the k-means problem for instances with low dimensional data and formulate it as a structured concave assignment problem. This allows us to exploit the low dimensional structure and solve the problem to global optimality within reasonable time for large data sets with several clusters. The method builds on iteratively solving a small concave problem and a large linear programming problem. This gives a sequence of feasible solutions along with bounds which we show converges to zero optimality gap. The paper combines methods from global optimization theory to accelerate the procedure, and we provide numerical results on their performance.  ( 2 min )
    Non-asymptotic Convergence of Discrete-time Diffusion Models: New Approach and Improved Rate
    arXiv:2402.13901v1 Announce Type: cross Abstract: The denoising diffusion model emerges recently as a powerful generative technique that converts noise into data. Theoretical convergence guarantee has been mainly studied for continuous-time diffusion models, and has been obtained for discrete-time diffusion models only for distributions with bounded support in the literature. In this paper, we establish the convergence guarantee for substantially larger classes of distributions under discrete-time diffusion models and further improve the convergence rate for distributions with bounded support. In particular, we first establish the convergence rates for both smooth and general (possibly non-smooth) distributions having finite second moment. We then specialize our results to a number of interesting classes of distributions with explicit parameter dependencies, including distributions with Lipschitz scores, Gaussian mixture distributions, and distributions with bounded support. We further propose a novel accelerated sampler and show that it improves the convergence rates of the corresponding regular sampler by orders of magnitude with respect to all system parameters. For distributions with bounded support, our result improves the dimensional dependence of the previous convergence rate by orders of magnitude. Our study features a novel analysis technique that constructs tilting factor representation of the convergence error and exploits Tweedie's formula for handling Taylor expansion power terms.  ( 2 min )
    dotears: Scalable, consistent DAG estimation using observational and interventional data
    arXiv:2305.19215v2 Announce Type: replace Abstract: New biological assays like Perturb-seq link highly parallel CRISPR interventions to a high-dimensional transcriptomic readout, providing insight into gene regulatory networks. Causal gene regulatory networks can be represented by directed acyclic graph (DAGs), but learning DAGs from observational data is complicated by lack of identifiability and a combinatorial solution space. Score-based structure learning improves practical scalability of inferring DAGs. Previous score-based methods are sensitive to error variance structure; on the other hand, estimation of error variance is difficult without prior knowledge of structure. Accordingly, we present $\texttt{dotears}$ [doo-tairs], a continuous optimization framework which leverages observational and interventional data to infer a single causal structure, assuming a linear Structural Equation Model (SEM). $\texttt{dotears}$ exploits structural consequences of hard interventions to give a marginal estimate of exogenous error structure, bypassing the circular estimation problem. We show that $\texttt{dotears}$ is a provably consistent estimator of the true DAG under mild assumptions. $\texttt{dotears}$ outperforms other methods in varied simulations, and in real data infers edges that validate with higher precision and recall than state-of-the-art methods through differential expression tests and high-confidence protein-protein interactions.  ( 2 min )
    How Sparse Can We Prune A Deep Network: A Fundamental Limit Viewpoint
    arXiv:2306.05857v2 Announce Type: replace Abstract: Network pruning is an effective measure to alleviate the storage and computational burden of deep neural networks arising from its high overparameterization. Thus raises a fundamental question: How sparse can we prune a deep network without sacrifice on the performance? To address this problem, in this work we'll take a first principles approach, i.e. we directly impose the sparsity constraint on the original loss function and then characterize the necessary and sufficient condition of the sparsity (\textit{which turns out to nearly coincide}) by leveraging the notion of \textit{statistical dimension} in convex geometry. Through this fundamental limit, we're able to identify two key factors that determine the pruning ratio limit, i.e., weight magnitude and network flatness. Generally speaking, the flatter the loss landscape or the smaller the weight magnitude, the smaller pruning ratio. In addition, we provide efficient countermeasures to address the challenges in computing the pruning limit, which involves accurate spectrum estimation of a large-scale and non-positive Hessian matrix. Moreover, through the lens of the pruning ratio threshold, we can provide rigorous interpretations on several heuristics in existing pruning algorithms. Extensive experiments are performed that demonstrate that the our theoretical pruning ratio threshold coincides very well with the experiments. All codes are available at: https://github.com/QiaozheZhang/Global-One-shot-Pruning  ( 2 min )
    Analysis of Bootstrap and Subsampling in High-dimensional Regularized Regression
    arXiv:2402.13622v1 Announce Type: new Abstract: We investigate popular resampling methods for estimating the uncertainty of statistical models, such as subsampling, bootstrap and the jackknife, and their performance in high-dimensional supervised regression tasks. We provide a tight asymptotic description of the biases and variances estimated by these methods in the context of generalized linear models, such as ridge and logistic regression, taking the limit where the number of samples $n$ and dimension $d$ of the covariates grow at a comparable fixed rate $\alpha\!=\! n/d$. Our findings are three-fold: i) resampling methods are fraught with problems in high dimensions and exhibit the double-descent-like behavior typical of these situations; ii) only when $\alpha$ is large enough do they provide consistent and reliable error estimations (we give convergence rates); iii) in the over-parametrized regime $\alpha\!<\!1$ relevant to modern machine learning practice, their predictions are not consistent, even with optimal regularization.  ( 2 min )
    Probabilistic Neural Networks (PNNs) for Modeling Aleatoric Uncertainty in Scientific Machine Learning
    arXiv:2402.13945v1 Announce Type: new Abstract: This paper investigates the use of probabilistic neural networks (PNNs) to model aleatoric uncertainty, which refers to the inherent variability in the input-output relationships of a system, often characterized by unequal variance or heteroscedasticity. Unlike traditional neural networks that produce deterministic outputs, PNNs generate probability distributions for the target variable, allowing the determination of both predicted means and intervals in regression scenarios. Contributions of this paper include the development of a probabilistic distance metric to optimize PNN architecture, and the deployment of PNNs in controlled data sets as well as a practical material science case involving fiber-reinforced composites. The findings confirm that PNNs effectively model aleatoric uncertainty, proving to be more appropriate than the commonly employed Gaussian process regression for this purpose. Specifically, in a real-world scientific machine learning context, PNNs yield remarkably accurate output mean estimates with R-squared scores approaching 0.97, and their predicted intervals exhibit a high correlation coefficient of nearly 0.80, closely matching observed data intervals. Hence, this research contributes to the ongoing exploration of leveraging the sophisticated representational capacity of neural networks to delineate complex input-output relationships in scientific problems.  ( 2 min )
    Thresholded Oja does Sparse PCA?
    arXiv:2402.07240v2 Announce Type: cross Abstract: We consider the problem of Sparse Principal Component Analysis (PCA) when the ratio $d/n \rightarrow c > 0$. There has been a lot of work on optimal rates on sparse PCA in the offline setting, where all the data is available for multiple passes. In contrast, when the population eigenvector is $s$-sparse, streaming algorithms that have $O(d)$ storage and $O(nd)$ time complexity either typically require strong initialization conditions or have a suboptimal error. We show that a simple algorithm that thresholds and renormalizes the output of Oja's algorithm (the Oja vector) obtains a near-optimal error rate. This is very surprising because, without thresholding, the Oja vector has a large error. Our analysis centers around bounding the entries of the unnormalized Oja vector, which involves the projection of a product of independent random matrices on a random initial vector. This is nontrivial and novel since previous analyses of Oja's algorithm and matrix products have been done when the trace of the population covariance matrix is bounded while in our setting, this quantity can be as large as $n$.  ( 2 min )
    The Dimension of Self-Directed Learning
    arXiv:2402.13400v1 Announce Type: new Abstract: Understanding the self-directed learning complexity has been an important problem that has captured the attention of the online learning theory community since the early 1990s. Within this framework, the learner is allowed to adaptively choose its next data point in making predictions unlike the setting in adversarial online learning. In this paper, we study the self-directed learning complexity in both the binary and multi-class settings, and we develop a dimension, namely $SDdim$, that exactly characterizes the self-directed learning mistake-bound for any concept class. The intuition behind $SDdim$ can be understood as a two-player game called the "labelling game". Armed with this two-player game, we calculate $SDdim$ on a whole host of examples with notable results on axis-aligned rectangles, VC dimension $1$ classes, and linear separators. We demonstrate several learnability gaps with a central focus on self-directed learning and offline sequence learning models that include either the best or worst ordering. Finally, we extend our analysis to the self-directed binary agnostic setting where we derive upper and lower bounds.  ( 2 min )
    Toward TransfORmers: Revolutionizing the Solution of Mixed Integer Programs with Transformers
    arXiv:2402.13380v1 Announce Type: cross Abstract: In this study, we introduce an innovative deep learning framework that employs a transformer model to address the challenges of mixed-integer programs, specifically focusing on the Capacitated Lot Sizing Problem (CLSP). Our approach, to our knowledge, is the first to utilize transformers to predict the binary variables of a mixed-integer programming (MIP) problem. Specifically, our approach harnesses the encoder decoder transformer's ability to process sequential data, making it well-suited for predicting binary variables indicating production setup decisions in each period of the CLSP. This problem is inherently dynamic, and we need to handle sequential decision making under constraints. We present an efficient algorithm in which CLSP solutions are learned through a transformer neural network. The proposed post-processed transformer algorithm surpasses the state-of-the-art solver, CPLEX and Long Short-Term Memory (LSTM) in solution time, optimal gap, and percent infeasibility over 240K benchmark CLSP instances tested. After the ML model is trained, conducting inference on the model, including post-processing, reduces the MIP into a linear program (LP). This transforms the ML-based algorithm, combined with an LP solver, into a polynomial-time approximation algorithm to solve a well-known NP-Hard problem, with almost perfect solution quality.  ( 2 min )
    A Large Dimensional Analysis of Multi-task Semi-Supervised Learning
    arXiv:2402.13646v1 Announce Type: new Abstract: This article conducts a large dimensional study of a simple yet quite versatile classification model, encompassing at once multi-task and semi-supervised learning, and taking into account uncertain labeling. Using tools from random matrix theory, we characterize the asymptotics of some key functionals, which allows us on the one hand to predict the performances of the algorithm, and on the other hand to reveal some counter-intuitive guidance on how to use it efficiently. The model, powerful enough to provide good performance guarantees, is also straightforward enough to provide strong insights into its behavior.  ( 2 min )
    Leveraging PAC-Bayes Theory and Gibbs Distributions for Generalization Bounds with Complexity Measures
    arXiv:2402.13285v1 Announce Type: new Abstract: In statistical learning theory, a generalization bound usually involves a complexity measure imposed by the considered theoretical framework. This limits the scope of such bounds, as other forms of capacity measures or regularizations are used in algorithms. In this paper, we leverage the framework of disintegrated PAC-Bayes bounds to derive a general generalization bound instantiable with arbitrary complexity measures. One trick to prove such a result involves considering a commonly used family of distributions: the Gibbs distributions. Our bound stands in probability jointly over the hypothesis and the learning sample, which allows the complexity to be adapted to the generalization gap as it can be customized to fit both the hypothesis class and the task.  ( 2 min )
    Statistical curriculum learning: An elimination algorithm achieving an oracle risk
    arXiv:2402.13366v1 Announce Type: cross Abstract: We consider a statistical version of curriculum learning (CL) in a parametric prediction setting. The learner is required to estimate a target parameter vector, and can adaptively collect samples from either the target model, or other source models that are similar to the target model, but less noisy. We consider three types of learners, depending on the level of side-information they receive. The first two, referred to as strong/weak-oracle learners, receive high/low degrees of information about the models, and use these to learn. The third, a fully adaptive learner, estimates the target parameter vector without any prior information. In the single source case, we propose an elimination learning method, whose risk matches that of a strong-oracle learner. In the multiple source case, we advocate that the risk of the weak-oracle learner is a realistic benchmark for the risk of adaptive learners. We develop an adaptive multiple elimination-rounds CL algorithm, and characterize instance-dependent conditions for its risk to match that of the weak-oracle learner. We consider instance-dependent minimax lower bounds, and discuss the challenges associated with defining the class of instances for the bound. We derive two minimax lower bounds, and determine the conditions under which the performance weak-oracle learner is minimax optimal.  ( 2 min )
    Asymptotics of Learning with Deep Structured (Random) Features
    arXiv:2402.13999v1 Announce Type: new Abstract: For a large class of feature maps we provide a tight asymptotic characterisation of the test error associated with learning the readout layer, in the high-dimensional limit where the input dimension, hidden layer widths, and number of training samples are proportionally large. This characterization is formulated in terms of the population covariance of the features. Our work is partially motivated by the problem of learning with Gaussian rainbow neural networks, namely deep non-linear fully-connected networks with random but structured weights, whose row-wise covariances are further allowed to depend on the weights of previous layers. For such networks we also derive a closed-form formula for the feature covariance in terms of the weight matrices. We further find that in some cases our results can capture feature maps learned by deep, finite-width neural networks trained under gradient descent.  ( 2 min )
    Average gradient outer product as a mechanism for deep neural collapse
    arXiv:2402.13728v1 Announce Type: cross Abstract: Deep Neural Collapse (DNC) refers to the surprisingly rigid structure of the data representations in the final layers of Deep Neural Networks (DNNs). Though the phenomenon has been measured in a wide variety of settings, its emergence is only partially understood. In this work, we provide substantial evidence that DNC formation occurs primarily through deep feature learning with the average gradient outer product (AGOP). This takes a step further compared to efforts that explain neural collapse via feature-agnostic approaches, such as the unconstrained features model. We proceed by providing evidence that the right singular vectors and values of the weights are responsible for the majority of within-class variability collapse in DNNs. As shown in recent work, this singular structure is highly correlated with that of the AGOP. We then establish experimentally and theoretically that AGOP induces neural collapse in a randomly initialized neural network. In particular, we demonstrate that Deep Recursive Feature Machines, a method originally introduced as an abstraction for AGOP feature learning in convolutional neural networks, exhibits DNC.  ( 2 min )
    Do Efficient Transformers Really Save Computation?
    arXiv:2402.13934v1 Announce Type: cross Abstract: As transformer-based language models are trained on increasingly large datasets and with vast numbers of parameters, finding more efficient alternatives to the standard Transformer has become very valuable. While many efficient Transformers and Transformer alternatives have been proposed, none provide theoretical guarantees that they are a suitable replacement for the standard Transformer. This makes it challenging to identify when to use a specific model and what directions to prioritize for further investigation. In this paper, we aim to understand the capabilities and limitations of efficient Transformers, specifically the Sparse Transformer and the Linear Transformer. We focus on their reasoning capability as exhibited by Chain-of-Thought (CoT) prompts and follow previous works to model them as Dynamic Programming (DP) problems. Our results show that while these models are expressive enough to solve general DP tasks, contrary to expectations, they require a model size that scales with the problem size. Nonetheless, we identify a class of DP problems for which these models can be more efficient than the standard Transformer. We confirm our theoretical results through experiments on representative DP tasks, adding to the understanding of efficient Transformers' practical strengths and weaknesses.  ( 2 min )
    Dealing with unbounded gradients in stochastic saddle-point optimization
    arXiv:2402.13903v1 Announce Type: cross Abstract: We study the performance of stochastic first-order methods for finding saddle points of convex-concave functions. A notorious challenge faced by such methods is that the gradients can grow arbitrarily large during optimization, which may result in instability and divergence. In this paper, we propose a simple and effective regularization technique that stabilizes the iterates and yields meaningful performance guarantees even if the domain and the gradient noise scales linearly with the size of the iterates (and is thus potentially unbounded). Besides providing a set of general results, we also apply our algorithm to a specific problem in reinforcement learning, where it leads to performance guarantees for finding near-optimal policies in an average-reward MDP without prior knowledge of the bias span.  ( 2 min )
    Accuracy-Preserving Calibration via Statistical Modeling on Probability Simplex
    arXiv:2402.13765v1 Announce Type: cross Abstract: Classification models based on deep neural networks (DNNs) must be calibrated to measure the reliability of predictions. Some recent calibration methods have employed a probabilistic model on the probability simplex. However, these calibration methods cannot preserve the accuracy of pre-trained models, even those with a high classification accuracy. We propose an accuracy-preserving calibration method using the Concrete distribution as the probabilistic model on the probability simplex. We theoretically prove that a DNN model trained on cross-entropy loss has optimality as the parameter of the Concrete distribution. We also propose an efficient method that synthetically generates samples for training probabilistic models on the probability simplex. We demonstrate that the proposed method can outperform previous methods in accuracy-preserving calibration tasks using benchmarks.  ( 2 min )
    Overcoming Saturation in Density Ratio Estimation by Iterated Regularization
    arXiv:2402.13891v1 Announce Type: cross Abstract: Estimating the ratio of two probability densities from finitely many samples, is a central task in machine learning and statistics. In this work, we show that a large class of kernel methods for density ratio estimation suffers from error saturation, which prevents algorithms from achieving fast error convergence rates on highly regular learning problems. To resolve saturation, we introduce iterated regularization in density ratio estimation to achieve fast error rates. Our methods outperform its non-iteratively regularized versions on benchmarks for density ratio estimation as well as on large-scale evaluations for importance-weighted ensembling of deep unsupervised domain adaptation models.  ( 2 min )
    The non-overlapping statistical approximation to overlapping group lasso
    arXiv:2211.09221v3 Announce Type: replace Abstract: Group lasso is a commonly used regularization method in statistical learning in which parameters are eliminated from the model according to predefined groups. However, when the groups overlap, optimizing the group lasso penalized objective can be time-consuming on large-scale problems because of the non-separability induced by the overlapping groups. This bottleneck has seriously limited the application of overlapping group lasso regularization in many modern problems, such as gene pathway selection and graphical model estimation. In this paper, we propose a separable penalty as an approximation of the overlapping group lasso penalty. Thanks to the separability, the computation of regularization based on our penalty is substantially faster than that of the overlapping group lasso, especially for large-scale and high-dimensional problems. We show that the penalty is the tightest separable relaxation of the overlapping group lasso norm within the family of $\ell_{q_1}/\ell_{q_2}$ norms. Moreover, we show that the estimator based on the proposed separable penalty is statistically equivalent to the one based on the overlapping group lasso penalty with respect to their error bounds and the rate-optimal performance under the squared loss. We demonstrate the faster computational time and statistical equivalence of our method compared with the overlapping group lasso in simulation examples and a classification problem of cancer tumors based on gene expression and multiple gene pathways.  ( 2 min )
    A Method For Bounding Tail Probabilities
    arXiv:2402.13662v1 Announce Type: cross Abstract: We present a method for upper and lower bounding the right and the left tail probabilities of continuous random variables (RVs). For the right tail probability of RV $X$ with probability density function $f_X(x)$, this method requires first setting a continuous, positive, and strictly decreasing function $g_X(x)$ such that $-f_X(x)/g'_X(x)$ is a decreasing and increasing function, $\forall x>x_0$, which results in upper and lower bounds, respectively, given in the form $-f_X(x) g_X(x)/g'_X(x)$, $\forall x>x_0$, where $x_0$ is some point. Similarly, for the upper and lower bounds on the left tail probability of $X$, this method requires first setting a continuous, positive, and strictly increasing function $g_X(x)$ such that $f_X(x)/g'_X(x)$ is an increasing and decreasing function, $\forall x<x_0$, which results in upper and lower bounds, respectively, given in the form $f_X(x) g_X(x)/g'_X(x)$, $\forall x<x_0$. We provide some examples of good candidates for the function $g_X(x)$. We also establish connections between the new bounds and Markov's inequality and Chernoff's bound. In addition, we provide an iterative method for obtaining ever tighter lower and upper bounds, under certain conditions. Finally, we provide numerical examples, where we show the tightness of these bounds, for some chosen $g_X(x)$.  ( 2 min )
    Bayesian Neural Networks with Domain Knowledge Priors
    arXiv:2402.13410v1 Announce Type: cross Abstract: Bayesian neural networks (BNNs) have recently gained popularity due to their ability to quantify model uncertainty. However, specifying a prior for BNNs that captures relevant domain knowledge is often extremely challenging. In this work, we propose a framework for integrating general forms of domain knowledge (i.e., any knowledge that can be represented by a loss function) into a BNN prior through variational inference, while enabling computationally efficient posterior inference and sampling. Specifically, our approach results in a prior over neural network weights that assigns high probability mass to models that better align with our domain knowledge, leading to posterior samples that also exhibit this behavior. We show that BNNs using our proposed domain knowledge priors outperform those with standard priors (e.g., isotropic Gaussian, Gaussian process), successfully incorporating diverse types of prior information such as fairness, physics rules, and healthcare knowledge and achieving better predictive performance. We also present techniques for transferring the learned priors across different model architectures, demonstrating their broad utility across various settings.  ( 2 min )
    Investigating the Histogram Loss in Regression
    arXiv:2402.13425v1 Announce Type: cross Abstract: It is becoming increasingly common in regression to train neural networks that model the entire distribution even if only the mean is required for prediction. This additional modeling often comes with performance gain and the reasons behind the improvement are not fully known. This paper investigates a recent approach to regression, the Histogram Loss, which involves learning the conditional distribution of the target variable by minimizing the cross-entropy between a target distribution and a flexible histogram prediction. We design theoretical and empirical analyses to determine why and when this performance gain appears, and how different components of the loss contribute to it. Our results suggest that the benefits of learning distributions in this setup come from improvements in optimization rather than learning a better representation. We then demonstrate the viability of the Histogram Loss in common deep learning applications without a need for costly hyperparameter tuning.  ( 2 min )
    Neural Control System for Continuous Glucose Monitoring and Maintenance
    arXiv:2402.13852v1 Announce Type: cross Abstract: Precise glucose level management is pivotal for individuals with diabetes, averting severe complications. In this work, we introduce a novel neural control system for continuous glucose monitoring and maintenance, utilizing differential predictive control. Our system, guided by a sophisticated neural policy and differentiable modeling, dynamically adjusts insulin delivery in real-time, enhancing glucose optimization. This end-to-end approach maximizes efficiency, ensuring personalized care and improved health outcomes, as affirmed by empirical findings.  ( 2 min )

  • Open

    [N] Mistral is a strong wind in France, and in french Groq means big...
    Is it all a fart joke? submitted by /u/matovitch [link] [comments]
    [Discussion] Methodology for repairing doubled audio with AI?
    I have this recording where I had two mics recordings simultaneously on accident. One mic is about half a second behind the other, so it creates a kind of echo that's impossible to remove with traditional filtering techniques in the audio programs I know. For example: https://drive.google.com/file/d/10Spnm1Q_S2dkAOg9IKk9DvWRD_eIsF8Y/view?usp=sharing Anything in the AI world that could conceivably fix this audio? (I'm an avid user of StableDiffusion but know zip about the audio side of things!) ​ submitted by /u/mccoypauley [link] [comments]
    Loading Video In Training Loops [Research]
    Does anyone have a fast way of decoding long videos in training loops. I am currently using NVidia Dali but even that is quite slow and leaves the GPU idle for some amount of time. I found all the CPU dataloaders quite slow. Hoping someone has experience here! submitted by /u/GullibleWerewolf6947 [link] [comments]
    [D] Ideal VPS requirements for SentenceTransformers
    Currently, I'm using "intfloat/multilingual-e5-small" on my local machine for embedding my documents, but it's incredibly slow! What are some cheap, not too over the top, VPS options I can use when going to production? A VPS with GPU is not an option due to the high costs, and my volume is not high enough to make this worthwhile. submitted by /u/Scooby072 [link] [comments]
    Master thesis mistake [D]
    Hey everyone, I’m writing my masters thesis in computer science, my core topic is automatic feature engineering specifically using a genetic algorithm. I just realized I had a data leakage and once resolved, the results are significantly lower than before. I’m nearing the end of the thesis and have a last meeting with my prof next week. My data is from the company I work with and it can’t be disclosed so my work can’t be reproduced really, but I’m really stressing whether I should tell the prof about this issue or move forward with it.. it’s eating me up and I don’t want to be dishonest but I’m not sure how to approach it, any one can help with some ideas? submitted by /u/Adventurous_Car_1809 [link] [comments]
    Database of Healthcare 837 and 835 files? [R]
    Does anyone know of a public database of healthcare 837 and 835 files? submitted by /u/OtherwiseGroup3162 [link] [comments]
    [D] Image Classification Internship Help
    So, I am doing an internship, and I was tasked with making an ML model that should recognise whether the image input is a passport or an ID. The model is pretty much done, but I am facing 2 problems that are interrelated: - Lack of images - Incorrect predictions Moreover, I was also tasked with making it a usable service, which my supervisor will be able to use when he just inputs the path of an image. I have 0 idea how to go about and do that. submitted by /u/T_Sayyed [link] [comments]
    [D] MetaGPT grossly misreported baseline numbers and got an ICLR Oral!
    OpenReview: https://openreview.net/forum?id=VtmBAGCN7o I was looking at ICLR reviews and was surprised to see MetaGPT being submitted to ICLR. The acceptance decision states that they were awarded an Oral (highest level at ICLR). Looking at the paper, they report these comparisons with HumanEval: ​ Method Pass@1 MetaGPT 85.9 GPT-4 67.0 GPT-3.5-Turbo (in the response) 48.1 However the real GPT-4 and GPT-3.5-Turbo numbers on this benchmark are much much higher (see EvalPlus leaderboard: https://evalplus.github.io/leaderboard.html). The results from the EvalPlus leaderboard have been reproduced numerous times, so there is no doubt about those. The numbers the MetaGPT authors used were pulled from the old technical report, and are not accurate anymore. They must know this, everyone does, there is no doubt about it. Here are the real comparisons using the numbers from EvalPlus: Method Pass@1 MetaGPT 85.9 GPT-4 88.4 GPT-3.5-Turbo 76.8 The GPT-3.5-Turbo performance is GROSSLY missreported. Never seen anything like this before. There is no way they legitimately got that number with GPT-3.5-Turbo. So, basically, their whole "agent company simulation" deal that makes you spend $10 in OpenAI credits is worse than just asking the LLM once... And they got an oral... We are screwed. submitted by /u/Signal-Aardvark-4179 [link] [comments]
    [D] Training Gaussian Processes on Multiple Time-Series Samples with Variability
    Hello r/machinelearning community, I'm embarking on a project that involves modeling the temperature profile of a chemical reactor using Gaussian Processes (GPs), but I've encountered a challenge that seems to be a bit of a hurdle with conventional GP approaches. The Challenge: My main issue arises from the necessity to train a GP on multiple samples of time-series data. Typically, GPs are tailored for a single instance of time-series data, interpreting variability among samples as either noise or a sign of low function smoothness. However, my project requires the GP to recognize and model variability across multiple time-series samples, rather than dismissing this variability as noise. The Goal: The objective is to develop a GP model that can accurately reflect the dynamics of temperat…
    [D] Animal Behavior Recognition Using Machine Learning
    I hope you will find this post well. The article from OpenCV.ai reviews key AI methods in animal behavior recognition and animal pose detection, showing their application in fields from neurobiology to veterinary medicine. Also, it highlights the significance of recent scientific advancements in understanding and managing animal behavior. In this article you will know: Understanding Behavior Recognition in Deep Learning Deep Learning Animal Pose Recognition Methods Animal Behavior Recognition Techniques Examples of Deep Learning-based Applications The full article is here. submitted by /u/No-Independence5880 [link] [comments]
    [D] Why are there so few research divisions of top companies in Spain?
    I have been reviewing research vacancies in several leading companies, and I have been surprised that despite the fact that many of them have divisions in Spain, hardly any research is done. Normally, the most research-focused positions seem to be in France, the UK, Holland, Germany or Switzerland, apart from the US. Is it that there has not been time to create divisions of this type in Spain or is there some underlying reason? I don't think it's because of a lack of talent, as I know many talented people who have had to emigrate to find work in these companies. submitted by /u/SrPinko [link] [comments]
    Does anyone know any video quantization techniques that produce time-consistent output? [R]
    I've been recently working on time-consistent latent feats from videos. And the task is to turn them into id sequences to put them into a transformer. However, I struggled to get time-consistent generation using VQ-VAE or VQ-GAN. Any suggestions on how to achieve time consistency when using VQ techniques for videos/video latents? submitted by /u/Weird_Register3689 [link] [comments]
    [D] Why are Byte Pair Encoding tokenizers preferred over character level ones in LLMs?
    I understand that byte pair will give you a larger vocabulary but shorter token sequences, while something more fine-grained like character level tokenizers will have a small vocabulary but much longer output token sequences. What I don’t understand is why this is preferred for most LLM models out there. For example, GPT and Llama both use Byte Pair Encoding. Does it have something to do with limitations on block size of these models? submitted by /u/putinwhat [link] [comments]
    [D] Are there any cloud GPUs billed on compute/utilization instead of uptime?
    I'm launching an photo processing app that uses GPUs for inference on a large-ish image processing model. Right now, there are only a few users, so my GPU instance is idle 99.99% of the time–but I'm paying $400+ per month for the GPU to be idle. I could spin up the GPU instance when a user triggers an inference job, but this would mean the user waits for several minutes for the instance to come up. Can't afford that latency. I've searched for cloud GPU solutions that bill based on utilization, but I haven't found anything. This is, of course, very common for cloud CPUs (multi-tenancy, virtualized instances) where you only pay for the compute you use; but I guess GPUs don't really have the same capability? Has anybody seen otherwise? What are my other options, aside from paying $400+/month for an incredibly under-utilized GPU? submitted by /u/uberdev [link] [comments]
    [News] Stable Diffusion 3
    Seems to be waitlisted. https://stability.ai/news/stable-diffusion-3 "Stable Diffusion 3 combines a diffusion transformer architecture and flow matching. " submitted by /u/SunnyJapan [link] [comments]
    RAG vs Long Context Models [Discussion]
    Hello everyone, I know that we all have seen the Gemini v1.5 model with 1 million context and also the hardware from company called groq showed that if the hardware is designed specifically for Language models in mind they can get much better. What do you think about RAG architectures now as we have seen very long context model. What if we have much more long context models with better quantization techniques and hardware?? Do you think architecture like RAGs and usage of vector DBs to store the knowledge base and retrieve on the fly would still be relevant?? Please correct me and add more relevant information accordingly. If relevant research and observations are posted, that is much appreciated!! submitted by /u/WritingBeginning3403 [link] [comments]
    [D] How often is results (or data) manipulated in research papers?
    I recently came across the Harvard Professor's results manipulation case and found it amusing that how can people fake results on such a higher level. Is this something what happen in ML Research as well? I've never been a first author of a ML paper so I don't know how things work at reputed conferences but I don't think someone cross checks the results, but only raise issue if they find it suspicious. Sometimes I have found results not matching what authors have published vs while recreating it with the code they provided, but I have always blamed it to my implementation and/or environment. submitted by /u/ade17_in [link] [comments]
    [R] What is a good resource to be recommended hot papers pertaining to LLMs, AI and ML?
    I'm looking for something like a weekly newsletter with 5 hot papers or something. Do any of you know where/to what I can subscribe to or have any recommendations? submitted by /u/PowerLock2 [link] [comments]
    [Discussion] Deep Learning Approaches to Document Classification
    Hi everyone, I plan to participate in a shared task about classifying news articles. This made me think about the existing approaches to classifying documents, and I quickly realized there are many! Next is a list of short descriptions of all the approaches I encountered during my studies / working with NLP projects. Which ones do you think / know are the most promising? Do you have other approaches in mind? Looking forward to your comments! For context, the documents I am considering are about 500 - 1000 tokens, so they should fit into most recent transformer-based architectures with minimal (if any) truncation. The number of classes is quite low: N=4. When I say encoder model, I refer to encoder-only architectures, e.g., BERT. For the task at hand, I only have access to ~1000 samples (…
    [D] Why do researchers so rarely release training code?
    I'm looking at 3 different papers right now for various MoE models. All 3 release the model weights and inference code, but none of them release training code. Why is this so common and accepted, when we expect most papers now to have code along with their implementations? submitted by /u/hazard02 [link] [comments]
    [D] Measuring complexity of datasets
    I'm starting a project in which we aim to demonstrate that many datasets in my field are much simpler than more colloquial ML datasets (UCI, image, etc.). Our lab has several results showing that some algorithms work extremely well in our field and not so great outside, while the converse is also true--some algorithms that work very well in more general ML datasets don't do too well or are computationally more expensive for the same results as methods in my field. As a first step, we've shown that the smoothness of the loss functions with many datasets are much lower than standard ML datasets. I want to also explore other ways to provide a more thorough view. As such, I wonder if the following mathematical tools would be jointly useful to interpret how complex a dataset is: Expected smoothness of the loss Expected Lipschitz constant of the loss Results from the Mapper algorithm (in TDA) or Betti curves Strong convexity of the loss (strong convexity can be shown to be equivalent to the sharpness of the loss, and flat minima have been shown to generalize better) Do these make sense? I understand that individually they only show part of the picture, but together they should provide a more rounded view of the complexity of the data. submitted by /u/FlyingQuokka [link] [comments]
  • Open

    Google employees are posting internal memes poking fun at how many AI models and names the company launched
    submitted by /u/thisisinsider [link] [comments]
    Continuation thread: Will animation studios like Pixar Animation Studios, Walt Disney Animation Studios, DreamWorks Animation, Illumination, Sony Pictures Animation, and so on completely cease to exist very soon due to OpenAI's Sora?
    So remember this thread of mine?: https://old.reddit.com/r/artificial/comments/1aws29z/will_hollywood_completely_cease_to_exist_very/ Well, when it comes to animated films, there were these articles about them and Sora: OpenAI’s Sora Creates Minute-Long Photorealistic Animation From Text Prompts Around this time a year ago, Stable Diffusion launched an early text-to-animation tool and the internet was obsessing over an AI-generated Seinfeld knockoff. This week, OpenAI has demonstrated how far animation-generating technology has come over the past 12 months with the unveiling of its new model Sora, which can create photorealistic and cg-style animation of up to one minute based on text prompts. What is Sora? Sora is a generative AI model that uses text and image prompts to create vid…
    Convert 2D photos to a Spatial Photo for the Apple Vision Pro
    I recently used the Depth Anything model to generate a "Spatial" version of a 2D photo. I also wrote a blog post about how I did this. It also has a link to a Github repo with all the code so you can run it on your own machine. https://blog.studiolanes.com/posts/2d-to-spatial-photos ✨ Fun fact: the original photo shown in the below demo was taken on a Contax T2, a film camera that was released on 1991. 33 years difference between its release date and the Vision Pro :O How a converted Spatial Photo looks like on the Vision Pro submitted by /u/mkchoi212 [link] [comments]
    Who are the best critics of AI, and of Deep Learning, in particular?
    Humans tend to suffer from the confirmation bias, which is why I think we need to make a conscious effort to listen to well-informed contrarians. On the AI/DL front: Gary Marcus is in the spotlight, giving TV interviews and testifying in the Senate. He's a psychologist, by background. And his long-standing critique of DL is largely related to its reliability. Judea Pearl, a Turing award recipient, thinks that causal/counterfactual reasoning is needed, and is largely missing from the current approaches. Stuart J. Russell co-authored a popular textbook on AI. He thinks DL is limited by its representational power. Erik J. Larson authored "The Myth of AI". He's a computer scientist, who worked on AI applications. He argues that the current approaches work inductively and deductively, while humans think mostly abductively. I'm wondering who else I might be overlooking? submitted by /u/we_are_mammals [link] [comments]
    Any free GPT-4 unliimted programs that isn't Chat? instead a Text generator?
    i write submitted by /u/Exciting_Boot_6157 [link] [comments]
    2030 and AI
    In the last month or so, haven't you felt a strange sense of excitement about the developments in the field of artificial intelligence? I am really curious and impatient about this. By 2030, do you believe that worldwide, medical, biological, biochemistry and nanotechnology studies on extending human life and reversing old age will be widespread? I really strongly support this. submitted by /u/BilgeYamtar [link] [comments]
    Animal Behavior Recognition Using Machine Learning
    I hope you will find this post well. The article from OpenCV.ai reviews key AI methods in animal behavior recognition and animal pose detection, showing their application in fields from neurobiology to veterinary medicine. Also, it highlights the significance of recent scientific advancements in understanding and managing animal behavior. In this article you will know: Understanding Behavior Recognition in Deep Learning Deep Learning Animal Pose Recognition Methods Animal Behavior Recognition Techniques Examples of Deep Learning-based Applications The full article is here. submitted by /u/No-Independence5880 [link] [comments]
    Reddit Inks $60 Million-a-Year Deal To Train Google AI Ahead of Expected IPO | Report
    submitted by /u/jaketocake [link] [comments]
    New study finds that AI significantly improves colonoscopy efficacy, with 40% more polyps detected
    submitted by /u/fotogneric [link] [comments]
    Best Artificial Intelligence courses for Healthcare You should learn 2024 -
    submitted by /u/Sreeravan [link] [comments]
    Question about testing an AI
    I have a research question related to chatGPT or other LLMs. My question is, can chatGPT answer this set of standardized exam questions? I know, not very interesting. My question to you though is if we know that such an AI answers each question differently every time, sometimes wrong and some times right, how many times should I ask the AI to answer EACH question to conclude whether it can answer a given question or not? (e.g., averaging correct rate for each question)? Are there standard guidelines in AI research for that? submitted by /u/Substantial-Ad2200 [link] [comments]
    Is narrow AI the end of public forums as we know them?
    Narrow AI needs data to scout for patterns to then regurgitate out for the next person that comes along and asks it a question. AI developers have gathered as many free, legal or otherwise, texts as they could, they've also siphoned off tons of data from public forums such as reddit and Twitter for God knows for how long until they started to close the doors. This formed a good base for historical knowledge until ~2022 give or take. Now let's fast forward a bit. Public forums and social networks that house information have closed their doors to prevent losing traffic. Take reddit as an example. It's a known meme for those seeking tech answers they should just append "reddit" to their search and they'll probably find better and more accurate data than elsewhere. Then you have places lik…
    Need guidance in the ocean of AI
    I want to build an AI language conversation tutor for a specific language. There already exists many AI language sites that do this but I'm looking to build something with features specific to one language. Any idea where I can start? Is there an open source app that can be built upon? A subreddit for these kinds of things? Any help is greatly appreciated! submitted by /u/evil_penguin_ouch [link] [comments]
    One-Minute Daily AI News 2/21/2024
    OpenAI has an official TikTok account now. New Sora videos attracted 128k followers.[1] Adobe launches AI assistant that can search and summarize PDFs.[2] Google AI Introduces ScreenAI: A Vision-Language Model for User interfaces (UI) and Infographics Understanding.[3] Will Smith parodies viral AI-generated video by actually eating spaghetti.[4] Sources: [1] https://www.tiktok.com/@openai [2] https://www.cnbc.com/2024/02/20/adobe-launches-ai-assistant-that-can-search-and-summarize-pdfs.html [3] https://www.marktechpost.com/2024/02/20/google-ai-introduces-screenai-a-vision-language-model-for-user-interfaces-ui-and-infographics-understanding/ [4] https://arstechnica.com/information-technology/2024/02/will-smith-parodies-viral-ai-generated-video-by-actually-eating-spaghetti/ submitted by /u/Excellent-Target-847 [link] [comments]
    picture enhancing AI?
    hello does anyone know of an AI that can crisp up a picture for me? my dog died today and i snapped a blurry picture of him a couple hours before his sudden death. i initially deleted it because it wasnt a great picture, but after he died i restored it because it became the last picture ever taken of him. i would like it to look a little nicer lol submitted by /u/neonifiednyan [link] [comments]
    Does anyone know a good community supported ai voice generator?
    Looking for something with lots of community generated fictional voices, sort of like uberduck.ai was before they took all the community content off. submitted by /u/ProBoyGaming521 [link] [comments]
    Will Hollywood completely cease to exist very soon due to OpenAI's Sora?
    Apparently, some are even describing Sora as an all-powerful AI that can create videos from text in few seconds, which will cause Hollywood to cease to exist entirely as regular people can create films on their own in less than a minute: Sora by OpenAI just destroyed Hollywood https://www.youtube.com/watch?v=fyn3m-qhpjE What Sora AI means for Hollywood In december, I said Hollywood is in trouble. We’d soon be watching Oscar-winning films produced by a machine. This future is now on our doorstep. Sam Altman’s company OpenAI just released Sora AI. Sora is a text-to-video AI tool that can create hyper-realistic videos from a few lines of text instructions. Check out this example from OpenAI’s website: A stylish woman walks down a Tokyo street filled with warm, glowing, neon a…
  • Open

    VideoPrism: A foundational visual encoder for video understanding
    Posted by Long Zhao, Senior Research Scientist, and Ting Liu, Senior Staff Software Engineer, Google Research An astounding number of videos are available on the Web, covering a variety of content from everyday moments people share to historical moments to scientific observations, each of which contains a unique record of the world. The right tools could help researchers analyze these videos, transforming how we understand the world around us. Videos offer dynamic visual content far more rich than static images, capturing movement, changes, and dynamic relationships between entities. Analyzing this complexity, along with the immense diversity of publicly available video data, demands models that go beyond traditional image understanding. Consequently, many of the approaches that be…  ( 93 min )
  • Open

    What's the difference between the value of an action and the reward ?
    Following Sutton and Barto, and they defined the value of an action 'a' as the mathematical expectation of the reward given that 'a' is selected, but intuitively are not they are the same ? If I am trying to estimate the value of an action then I am trying to estimate the reward of that action, is there something I am missing here ? submitted by /u/al3arabcoreleone [link] [comments]
    Best Books to Learn Reinforcement Learning in 2024 -
    submitted by /u/Sreeravan [link] [comments]
    Value network diverges when used asynchronously in A3C implementation
    I use the CartPole-v1 environment. I build a value network that I use for both VPG and A3C. In VPG it works as expected.In A3C I create a work method where I have 8 workers working asynchronously with the help of pytorch's/python's multiprocessing library. workers = [mp.Process(target=self.work, args=(rank,)) for rank in range(self.n_workers)] [w.start() for w in workers]; [w.join() for w in workers] I pass the value network in the class as a function self.value_model_fn = value_model_fn and on the work method I create local value networks to work with local_value_model = self.value_model_fn(nS) where nS is the state size. I do the exact same thing for a policy network but the policy network works as expected both in VPG and A3C. I also load a shared policy and value model that I initialize in the train function like this: local_policy_model = self.policy_model_fn(nS, nA) local_policy_model.load_state_dict(self.shared_policy_model.state_dict()) local_value_model = self.value_model_fn(nS) local_value_model.load_state_dict(self.shared_value_model.state_dict()) When I run local_value_model(state) the code gets in a never ending loop. I tried running the value network outside the work method and it works fine. I tried running it without loading the shared_value_network and I also tried running it before the local_policy_network but it just doesn't work when I run it inside the work method. What it should do is return a value in this form: tensor([[0.0214]], grad_fn= but instead it loops forever. I don't really know what other direction to take, I don't think there'a a problem in the network implementation since it works fine outside the work method AND the VPG implemetation so my guess is there's something about the multiprocessing that makes it diverge. submitted by /u/yuyututuyutu [link] [comments]
    RL and SL in autonomous driving
    I made this based on my observation and knowledge. this is a high level overview of application RL and SL algorithms in autonomous driving mostly for decision making. next I will try cover keys and terminalogies used in RL and SL. If there is a mistake pls let me know. submitted by /u/Mysterious-Care1358 [link] [comments]
  • Open

    Animal Behavior Recognition Using Machine Learning
    I hope you will find this post well. The article from OpenCV.ai reviews key AI methods in animal behavior recognition and animal pose detection, showing their application in fields from neurobiology to veterinary medicine. Also, it highlights the significance of recent scientific advancements in understanding and managing animal behavior. In this article you will know: Understanding Behavior Recognition in Deep Learning Deep Learning Animal Pose Recognition Methods Animal Behavior Recognition Techniques Examples of Deep Learning-based Applications The full article is here. submitted by /u/No-Independence5880 [link] [comments]
    Turning a doodle into multiple views
    Hi, is there any way I can turn a doodle into multiple views of the same object? (pictures) Much like the https://github.com/alexjc/neural-doodle but generating multiple views of some object I sketched. Thanks! submitted by /u/Specialist_Ice_5715 [link] [comments]
    Building neural network - Python or Java ?
    Hello, ​ I want to build a neural network from scratch, no tensorflow no nothing, to get a deep understanding of neural networks. ​ The question is : in which language am I gonna do it ? My project is gonna be made on Minecraft, which is in Java only, but I feel like it's easier to make a NN in python. But i don't know if i will be able to access all the data with python. ​ What do you think ? submitted by /u/SpellGlittering1901 [link] [comments]
    Neural networks made of light: Research team develops AI system in optical fibers
    submitted by /u/keghn [link] [comments]
  • Open

    Natural one-liners
    I learned to use Unix in college—this was before Linux—but it felt a little mysterious. Clearly it was developed by really smart people, but what were the problems that motivated their design choices? Some of these are widely repeated. For example, commands have terse names because you may have to transmit commands over a glacial […] Natural one-liners first appeared on John D. Cook.  ( 7 min )
    When is less data less private?
    If I give you a database, I give you every row in the database. So if you delete some rows from the database, you have less information, not more, right? This seems very simple, and it mostly is, but there are a couple subtleties. A common measure in data privacy is k-anonymity. The idea is […] When is less data less private? first appeared on John D. Cook.  ( 5 min )
  • Open

    Add It to the Toolkit: February Studio Driver and NVIDIA App Beta Now Available
    The February NVIDIA Studio Driver, designed specifically to optimize creative apps, is now available for download.  ( 7 min )
    Into the Omniverse: Rhino 3D Launches OpenUSD Features to Enhance 3D Modeling and Development
    Editor’s note: This post is part of Into the Omniverse, a series focused on how artists, developers and enterprises can transform their workflows using the latest advances in OpenUSD and NVIDIA Omniverse. The combination of powerful 3D tools and groundbreaking technologies can transform the way designers bring their visions to life — and Universal Scene Read Article  ( 7 min )
    Time to Play: GeForce NOW Now Offers 1,800 Games to Stream
    Top-tier games from publishing partners Bandai Namco Entertainment and Inflexion Games are joining GeForce NOW this week as the cloud streaming service’s fourth-anniversary celebrations continue. Eleven new titles join the over 1,800 supported games in the GeForce NOW library, including Nightingale from Inflexion Games and Bandai Namco Entertainment’s Tales of Arise, Katamari Damacy REROLL and Read Article  ( 7 min )
    FOMO Alert: Discover 7 Unmissable Reasons to Attend GTC 2024
    “I just got back from GTC and ….” In four weeks, those will be among the most powerful words in your industry. But you won’t be able to use them if you haven’t been here. NVIDIA’s GTC 2024 transforms the San Jose Convention Center into a crucible of innovation, learning and community from March 18-21, Read Article  ( 6 min )
  • Open

    Robustness of Algorithms for Causal Structure Learning to Hyperparameter Choice
    arXiv:2310.18212v2 Announce Type: replace Abstract: Hyperparameters play a critical role in machine learning. Hyperparameter tuning can make the difference between state-of-the-art and poor prediction performance for any algorithm, but it is particularly challenging for structure learning due to its unsupervised nature. As a result, hyperparameter tuning is often neglected in favour of using the default values provided by a particular implementation of an algorithm. While there have been numerous studies on performance evaluation of causal discovery algorithms, how hyperparameters affect individual algorithms, as well as the choice of the best algorithm for a specific problem, has not been studied in depth before. This work addresses this gap by investigating the influence of hyperparameters on causal structure learning tasks. Specifically, we perform an empirical evaluation of hyperparameter selection for some seminal learning algorithms on datasets of varying levels of complexity. We find that, while the choice of algorithm remains crucial to obtaining state-of-the-art performance, hyperparameter selection in ensemble settings strongly influences the choice of algorithm, in that a poor choice of hyperparameters can lead to analysts using algorithms which do not give state-of-the-art performance for their data.  ( 2 min )
    Integrating kNN with Foundation Models for Adaptable and Privacy-Aware Image Classification
    arXiv:2402.12500v1 Announce Type: cross Abstract: Traditional deep learning models implicity encode knowledge limiting their transparency and ability to adapt to data changes. Yet, this adaptability is vital for addressing user data privacy concerns. We address this limitation by storing embeddings of the underlying training data independently of the model weights, enabling dynamic data modifications without retraining. Specifically, our approach integrates the $k$-Nearest Neighbor ($k$-NN) classifier with a vision-based foundation model, pre-trained self-supervised on natural images, enhancing interpretability and adaptability. We share open-source implementations of a previously unpublished baseline method as well as our performance-improving contributions. Quantitative experiments confirm improved classification across established benchmark datasets and the method's applicability to distinct medical image classification tasks. Additionally, we assess the method's robustness in continual learning and data removal scenarios. The approach exhibits great promise for bridging the gap between foundation models' performance and challenges tied to data privacy. The source code is available at https://github.com/TobArc/privacy-aware-image-classification-with-kNN.  ( 2 min )
    Physics Informed Neural Network Code for 2D Transient Problems (PINN-2DT) Compatible with Google Colab
    arXiv:2310.03755v2 Announce Type: replace-cross Abstract: We present an open-source Physics Informed Neural Network environment for simulations of transient phenomena on two-dimensional rectangular domains, with the following features: (1) it is compatible with Google Colab which allows automatic execution on cloud environment; (2) it supports two dimensional time-dependent PDEs; (3) it provides simple interface for definition of the residual loss, boundary condition and initial loss, together with their weights; (4) it support Neumann and Dirichlet boundary conditions; (5) it allows for customizing the number of layers and neurons per layer, as well as for arbitrary activation function; (6) the learning rate and number of epochs are available as parameters; (7) it automatically differentiates PINN with respect to spatial and temporal variables; (8) it provides routines for plotting the convergence (with running average), initial conditions learnt, 2D and 3D snapshots from the simulation and movies (9) it includes a library of problems: (a) non-stationary heat transfer; (b) wave equation modeling a tsunami; (c) atmospheric simulations including thermal inversion; (d) tumor growth simulations.  ( 3 min )
    IT Intrusion Detection Using Statistical Learning and Testbed Measurements
    arXiv:2402.13081v1 Announce Type: new Abstract: We study automated intrusion detection in an IT infrastructure, specifically the problem of identifying the start of an attack, the type of attack, and the sequence of actions an attacker takes, based on continuous measurements from the infrastructure. We apply statistical learning methods, including Hidden Markov Model (HMM), Long Short-Term Memory (LSTM), and Random Forest Classifier (RFC) to map sequences of observations to sequences of predicted attack actions. In contrast to most related research, we have abundant data to train the models and evaluate their predictive power. The data comes from traces we generate on an in-house testbed where we run attacks against an emulated IT infrastructure. Central to our work is a machine-learning pipeline that maps measurements from a high-dimensional observation space to a space of low dimensionality or to a small set of observation symbols. Investigating intrusions in offline as well as online scenarios, we find that both HMM and LSTM can be effective in predicting attack start time, attack type, and attack actions. If sufficient training data is available, LSTM achieves higher prediction accuracy than HMM. HMM, on the other hand, requires less computational resources and less training data for effective prediction. Also, we find that the methods we study benefit from data produced by traditional intrusion detection systems like SNORT.  ( 3 min )
    Teacher as a Lenient Expert: Teacher-Agnostic Data-Free Knowledge Distillation
    arXiv:2402.12406v1 Announce Type: new Abstract: Data-free knowledge distillation (DFKD) aims to distill pretrained knowledge to a student model with the help of a generator without using original data. In such data-free scenarios, achieving stable performance of DFKD is essential due to the unavailability of validation data. Unfortunately, this paper has discovered that existing DFKD methods are quite sensitive to different teacher models, occasionally showing catastrophic failures of distillation, even when using well-trained teacher models. Our observation is that the generator in DFKD is not always guaranteed to produce precise yet diverse samples using the existing representative strategy of minimizing both class-prior and adversarial losses. Through our empirical study, we focus on the fact that class-prior not only decreases the diversity of generated samples, but also cannot completely address the problem of generating unexpectedly low-quality samples depending on teacher models. In this paper, we propose the teacher-agnostic data-free knowledge distillation (TA-DFKD) method, with the goal of more robust and stable performance regardless of teacher models. Our basic idea is to assign the teacher model a lenient expert role for evaluating samples, rather than a strict supervisor that enforces its class-prior on the generator. Specifically, we design a sample selection approach that takes only clean samples verified by the teacher model without imposing restrictions on the power of generating diverse samples. Through extensive experiments, we show that our method successfully achieves both robustness and training stability across various teacher models, while outperforming the existing DFKD methods.  ( 3 min )
    Diffeomorphism Neural Operator for various domains and parameters of partial differential equations
    arXiv:2402.12475v1 Announce Type: cross Abstract: Many science and engineering applications demand partial differential equations (PDE) evaluations that are traditionally computed with resource-intensive numerical solvers. Neural operator models provide an efficient alternative by learning the governing physical laws directly from data in a class of PDEs with different parameters, but constrained in a fixed boundary (domain). Many applications, such as design and manufacturing, would benefit from neural operators with flexible domains when studied at scale. Here we present a diffeomorphism neural operator learning framework towards developing domain-flexible models for physical systems with various and complex domains. Specifically, a neural operator trained in a shared domain mapped from various domains of fields by diffeomorphism is proposed, which transformed the problem of learning function mappings in varying domains (spaces) into the problem of learning operators on a shared diffeomorphic domain. Meanwhile, an index is provided to evaluate the generalization of diffeomorphism neural operators in different domains by the domain diffeomorphism similarity. Experiments on statics scenarios (Darcy flow, mechanics) and dynamic scenarios (pipe flow, airfoil flow) demonstrate the advantages of our approach for neural operator learning under various domains, where harmonic and volume parameterization are used as the diffeomorphism for 2D and 3D domains. Our diffeomorphism neural operator approach enables strong learning capability and robust generalization across varying domains and parameters.  ( 2 min )
    Conditional Logical Message Passing Transformer for Complex Query Answering
    arXiv:2402.12954v1 Announce Type: new Abstract: Complex Query Answering (CQA) over Knowledge Graphs (KGs) is a challenging task. Given that KGs are usually incomplete, neural models are proposed to solve CQA by performing multi-hop logical reasoning. However, most of them cannot perform well on both one-hop and multi-hop queries simultaneously. Recent work proposes a logical message passing mechanism based on the pre-trained neural link predictors. While effective on both one-hop and multi-hop queries, it ignores the difference between the constant and variable nodes in a query graph. In addition, during the node embedding update stage, this mechanism cannot dynamically measure the importance of different messages, and whether it can capture the implicit logical dependencies related to a node and received messages remains unclear. In this paper, we propose Conditional Logical Message Passing Transformer (CLMPT), which considers the difference between constants and variables in the case of using pre-trained neural link predictors and performs message passing conditionally on the node type. We empirically verified that this approach can reduce computational costs without affecting performance. Furthermore, CLMPT uses the transformer to aggregate received messages and update the corresponding node embedding. Through the self-attention mechanism, CLMPT can assign adaptive weights to elements in an input set consisting of received messages and the corresponding node and explicitly model logical dependencies between various elements. Experimental results show that CLMPT is a new state-of-the-art neural CQA model.  ( 3 min )
    Locality-Sensitive Hashing-Based Efficient Point Transformer with Applications in High-Energy Physics
    arXiv:2402.12535v1 Announce Type: new Abstract: This study introduces a novel transformer model optimized for large-scale point cloud processing in scientific domains such as high-energy physics (HEP) and astrophysics. Addressing the limitations of graph neural networks and standard transformers, our model integrates local inductive bias and achieves near-linear complexity with hardware-friendly regular operations. One contribution of this work is the quantitative analysis of the error-complexity tradeoff of various sparsification techniques for building efficient transformers. Our findings highlight the superiority of using locality-sensitive hashing (LSH), especially OR \& AND-construction LSH, in kernel approximation for large-scale point cloud data with local inductive bias. Based on this finding, we propose LSH-based Efficient Point Transformer (\textbf{HEPT}), which combines E$^2$LSH with OR \& AND constructions and is built upon regular computations. HEPT demonstrates remarkable performance in two critical yet time-consuming HEP tasks, significantly outperforming existing GNNs and transformers in accuracy and computational speed, marking a significant advancement in geometric deep learning and large-scale scientific data processing. Our code is available at \url{https://github.com/Graph-COM/HEPT}.  ( 2 min )
    Mechanistic Neural Networks for Scientific Machine Learning
    arXiv:2402.13077v1 Announce Type: new Abstract: This paper presents Mechanistic Neural Networks, a neural network design for machine learning applications in the sciences. It incorporates a new Mechanistic Block in standard architectures to explicitly learn governing differential equations as representations, revealing the underlying dynamics of data and enhancing interpretability and efficiency in data modeling. Central to our approach is a novel Relaxed Linear Programming Solver (NeuRLP) inspired by a technique that reduces solving linear ODEs to solving linear programs. This integrates well with neural networks and surpasses the limitations of traditional ODE solvers enabling scalable GPU parallel processing. Overall, Mechanistic Neural Networks demonstrate their versatility for scientific machine learning applications, adeptly managing tasks from equation discovery to dynamic systems modeling. We prove their comprehensive capabilities in analyzing and interpreting complex scientific data across various applications, showing significant performance against specialized state-of-the-art methods.  ( 2 min )
    Improving Model's Interpretability and Reliability using Biomarkers
    arXiv:2402.12394v1 Announce Type: cross Abstract: Accurate and interpretable diagnostic models are crucial in the safety-critical field of medicine. We investigate the interpretability of our proposed biomarker-based lung ultrasound diagnostic pipeline to enhance clinicians' diagnostic capabilities. The objective of this study is to assess whether explanations from a decision tree classifier, utilizing biomarkers, can improve users' ability to identify inaccurate model predictions compared to conventional saliency maps. Our findings demonstrate that decision tree explanations, based on clinically established biomarkers, can assist clinicians in detecting false positives, thus improving the reliability of diagnostic models in medicine.  ( 2 min )
    Bayesian Reward Models for LLM Alignment
    arXiv:2402.13210v1 Announce Type: new Abstract: To ensure that large language model (LLM) responses are helpful and non-toxic, we usually fine-tune a reward model on human preference data. We then select policy responses with high rewards (best-of-n sampling) or further optimize the policy to produce responses with high rewards (reinforcement learning from human feedback). However, this process is vulnerable to reward overoptimization or hacking, in which the responses selected have high rewards due to errors in the reward model rather than a genuine preference. This is especially problematic as the prompt or response diverges from the training data. It should be possible to mitigate these issues by training a Bayesian reward model, which signals higher uncertainty further from the training data distribution. Therefore, we trained Bayesian reward models using Laplace-LoRA (Yang et al., 2024) and found that the resulting uncertainty estimates can successfully mitigate reward overoptimization in best-of-n sampling.  ( 2 min )
    Enhancing Real-World Complex Network Representations with Hyperedge Augmentation
    arXiv:2402.13033v1 Announce Type: new Abstract: Graph augmentation methods play a crucial role in improving the performance and enhancing generalisation capabilities in Graph Neural Networks (GNNs). Existing graph augmentation methods mainly perturb the graph structures and are usually limited to pairwise node relations. These methods cannot fully address the complexities of real-world large-scale networks that often involve higher-order node relations beyond only being pairwise. Meanwhile, real-world graph datasets are predominantly modelled as simple graphs, due to the scarcity of data that can be used to form higher-order edges. Therefore, reconfiguring the higher-order edges as an integration into graph augmentation strategies lights up a promising research path to address the aforementioned issues. In this paper, we present Hyperedge Augmentation (HyperAug), a novel graph augmentation method that constructs virtual hyperedges directly form the raw data, and produces auxiliary node features by extracting from the virtual hyperedge information, which are used for enhancing GNN performances on downstream tasks. We design three diverse virtual hyperedge construction strategies to accompany the augmentation scheme: (1) via graph statistics, (2) from multiple data perspectives, and (3) utilising multi-modality. Furthermore, to facilitate HyperAug evaluation, we provide 23 novel real-world graph datasets across various domains including social media, biology, and e-commerce. Our empirical study shows that HyperAug consistently and significantly outperforms GNN baselines and other graph augmentation methods, across a variety of application contexts, which clearly indicates that it can effectively incorporate higher-order node relations into graph augmentation methods for real-world complex networks.  ( 3 min )
    Structural Knowledge Informed Continual Multivariate Time Series Forecasting
    arXiv:2402.12722v1 Announce Type: new Abstract: Recent studies in multivariate time series (MTS) forecasting reveal that explicitly modeling the hidden dependencies among different time series can yield promising forecasting performance and reliable explanations. However, modeling variable dependencies remains underexplored when MTS is continuously accumulated under different regimes (stages). Due to the potential distribution and dependency disparities, the underlying model may encounter the catastrophic forgetting problem, i.e., it is challenging to memorize and infer different types of variable dependencies across different regimes while maintaining forecasting performance. To address this issue, we propose a novel Structural Knowledge Informed Continual Learning (SKI-CL) framework to perform MTS forecasting within a continual learning paradigm, which leverages structural knowledge to steer the forecasting model toward identifying and adapting to different regimes, and selects representative MTS samples from each regime for memory replay. Specifically, we develop a forecasting model based on graph structure learning, where a consistency regularization scheme is imposed between the learned variable dependencies and the structural knowledge while optimizing the forecasting objective over the MTS data. As such, MTS representations learned in each regime are associated with distinct structural knowledge, which helps the model memorize a variety of conceivable scenarios and results in accurate forecasts in the continual learning context. Meanwhile, we develop a representation-matching memory replay scheme that maximizes the temporal coverage of MTS data to efficiently preserve the underlying temporal dynamics and dependency structures of each regime. Thorough empirical studies on synthetic and real-world benchmarks validate SKI-CL's efficacy and advantages over the state-of-the-art for continual MTS forecasting tasks.  ( 3 min )
    DBNets: A publicly available deep learning tool to measure the masses of young planets in dusty protoplanetary discs
    arXiv:2402.12448v1 Announce Type: cross Abstract: Current methods to characterize embedded planets in protoplanetary disc observations are severely limited either in their ability to fully account for the observed complex physics or in their computational and time costs. To address this shortcoming, we developed DBNets: a deep learning tool, based on convolutional neural networks, that analyses substructures observed in the dust continuum emission of protoplanetary discs to quickly infer the mass of allegedly embedded planets. We focussed on developing a method to reliably quantify not only the planet mass, but also the associated uncertainty introduced by our modelling and adopted techniques. Our tests gave promising results achieving an 87% reduction of the log Mp mean squared error with respect to an analytical formula fitted on the same data (DBNets metrics: lmse 0.016, r2-score 97%). With the goal of providing the final user of DBNets with all the tools needed to interpret their measurements and decide on their significance, we extensively tested our tool on out-of-distribution data. We found that DBNets can identify inputs strongly outside its training scope returning an uncertainty above a specific threshold and we thus provided a rejection criterion that helps determine the significance of the results obtained. Additionally, we outlined some limitations of our tool: it can be reliably applied only on discs observed with inclinations below approximately 60{\deg}, in the optically thin regime, with a resolution 8 times better than the gap radial location and with a signal-to-noise ratio higher than approximately ten. Finally, we applied DBNets to 33 actual observations of protoplanetary discs measuring the mass of 48 proposed planets and comparing our results with the available literature. We confirmed that most of the observed gaps imply planets in the sub-Jupiter regime. DBNets is publicly available at dbnets.fisica.unimi.it.  ( 3 min )
    Investigating the Impact of Model Instability on Explanations and Uncertainty
    arXiv:2402.13006v1 Announce Type: new Abstract: Explainable AI methods facilitate the understanding of model behaviour, yet, small, imperceptible perturbations to inputs can vastly distort explanations. As these explanations are typically evaluated holistically, before model deployment, it is difficult to assess when a particular explanation is trustworthy. Some studies have tried to create confidence estimators for explanations, but none have investigated an existing link between uncertainty and explanation quality. We artificially simulate epistemic uncertainty in text input by introducing noise at inference time. In this large-scale empirical study, we insert different levels of noise perturbations and measure the effect on the output of pre-trained language models and different uncertainty metrics. Realistic perturbations have minimal effect on performance and explanations, yet masking has a drastic effect. We find that high uncertainty doesn't necessarily imply low explanation plausibility; the correlation between the two metrics can be moderately positive when noise is exposed during the training process. This suggests that noise-augmented models may be better at identifying salient tokens when uncertain. Furthermore, when predictive and epistemic uncertainty measures are over-confident, the robustness of a saliency map to perturbation can indicate model stability issues. Integrated Gradients shows the overall greatest robustness to perturbation, while still showing model-specific patterns in performance; however, this phenomenon is limited to smaller Transformer-based language models.  ( 2 min )
    Attacks on Node Attributes in Graph Neural Networks
    arXiv:2402.12426v1 Announce Type: cross Abstract: Graphs are commonly used to model complex networks prevalent in modern social media and literacy applications. Our research investigates the vulnerability of these graphs through the application of feature based adversarial attacks, focusing on both decision-time attacks and poisoning attacks. In contrast to state-of-the-art models like Net Attack and Meta Attack, which target node attributes and graph structure, our study specifically targets node attributes. For our analysis, we utilized the text dataset Hellaswag and graph datasets Cora and CiteSeer, providing a diverse basis for evaluation. Our findings indicate that decision-time attacks using Projected Gradient Descent (PGD) are more potent compared to poisoning attacks that employ Mean Node Embeddings and Graph Contrastive Learning strategies. This provides insights for graph data security, pinpointing where graph-based models are most vulnerable and thereby informing the development of stronger defense mechanisms against such attacks.  ( 2 min )
    Deep Structural Knowledge Exploitation and Synergy for Estimating Node Importance Value on Heterogeneous Information Networks
    arXiv:2402.12411v1 Announce Type: cross Abstract: Node importance estimation problem has been studied conventionally with homogeneous network topology analysis. To deal with network heterogeneity, a few recent methods employ graph neural models to automatically learn diverse sources of information. However, the major concern revolves around that their full adaptive learning process may lead to insufficient information exploration, thereby formulating the problem as the isolated node value prediction with underperformance and less interpretability. In this work, we propose a novel learning framework: SKES. Different from previous automatic learning designs, SKES exploits heterogeneous structural knowledge to enrich the informativeness of node representations. Based on a sufficiently uninformative reference, SKES estimates the importance value for any input node, by quantifying its disparity against the reference. This establishes an interpretable node importance computation paradigm. Furthermore, SKES dives deep into the understanding that "nodes with similar characteristics are prone to have similar importance values" whilst guaranteeing that such informativeness disparity between any different nodes is orderly reflected by the embedding distance of their associated latent features. Extensive experiments on three widely-evaluated benchmarks demonstrate the performance superiority of SKES over several recent competing methods.  ( 2 min )
    On the Semantic Latent Space of Diffusion-Based Text-to-Speech Models
    arXiv:2402.12423v1 Announce Type: cross Abstract: The incorporation of Denoising Diffusion Models (DDMs) in the Text-to-Speech (TTS) domain is rising, providing great value in synthesizing high quality speech. Although they exhibit impressive audio quality, the extent of their semantic capabilities is unknown, and controlling their synthesized speech's vocal properties remains a challenge. Inspired by recent advances in image synthesis, we explore the latent space of frozen TTS models, which is composed of the latent bottleneck activations of the DDM's denoiser. We identify that this space contains rich semantic information, and outline several novel methods for finding semantic directions within it, both supervised and unsupervised. We then demonstrate how these enable off-the-shelf audio editing, without any further training, architectural changes or data requirements. We present evidence of the semantic and acoustic qualities of the edited audio, and provide supplemental samples: https://latent-analysis-grad-tts.github.io/speech-samples/.  ( 2 min )
    Emulating the interstellar medium chemistry with neural operators
    arXiv:2402.12435v1 Announce Type: cross Abstract: Galaxy formation and evolution critically depend on understanding the complex photo-chemical processes that govern the evolution and thermodynamics of the InterStellar Medium (ISM). Computationally, solving chemistry is among the most heavy tasks in cosmological and astrophysical simulations. The evolution of such non-equilibrium photo-chemical network relies on implicit, precise, computationally costly, ordinary differential equations (ODE) solvers. Here, we aim at substituting such procedural solvers with fast, pre-trained, emulators based on neural operators. We emulate a non-equilibrium chemical network up to H$_2$ formation (9 species, 52 reactions) by adopting the DeepONet formalism, i.e. by splitting the ODE solver operator that maps the initial conditions and time evolution into a tensor product of two neural networks. We use $\texttt{KROME}$ to generate a training set spanning $-2\leq \log(n/\mathrm{cm}^{-3}) \leq 3.5$, $\log(20) \leq\log(T/\mathrm{K}) \leq 5.5$, $-6 \leq \log(n_i/n) < 0$, and by adopting an incident radiation field $\textbf{F}$ sampled in 10 energy bins with a continuity prior. We separately train the solver for $T$ and each $n_i$ for $\simeq 4.34\,\rm GPUhrs$. Compared with the reference solutions obtained by $\texttt{KROME}$ for single zone models, the typical precision obtained is of order $10^{-2}$, i.e. the $10 \times$ better with a training that is $40 \times$ less costly with respect to previous emulators which however considered only a fixed $\mathbf{F}$. The present model achieves a speed-up of a factor of $128 \times$ with respect to stiff ODE solvers. Our neural emulator represents a significant leap forward in the modeling of ISM chemistry, offering a good balance of precision, versatility, and computational efficiency.  ( 3 min )
    Multi-class Temporal Logic Neural Networks
    arXiv:2402.12397v1 Announce Type: cross Abstract: Time-series data can represent the behaviors of autonomous systems, such as drones and self-driving cars. The problem of binary and multi-class classification has received a lot of attention in this field. Neural networks represent a popular approach to classifying data; However, they lack interpretability, which poses a significant challenge in extracting meaningful information from them. Signal Temporal Logic (STL) is a formalism to describe the properties of timed behaviors. We propose a method that combines all of the above: neural networks that represent STL specifications for multi-class classification of time-series data. We offer two key contributions: 1) We introduce a notion of margin for multi-class classification, and 2) we introduce the use of STL-based attributes for enhancing the interpretability of the results. We evaluate our method on two datasets and compare with state-of-the-art baselines.  ( 2 min )
    Toward a Team of AI-made Scientists for Scientific Discovery from Gene Expression Data
    arXiv:2402.12391v1 Announce Type: cross Abstract: Machine learning has emerged as a powerful tool for scientific discovery, enabling researchers to extract meaningful insights from complex datasets. For instance, it has facilitated the identification of disease-predictive genes from gene expression data, significantly advancing healthcare. However, the traditional process for analyzing such datasets demands substantial human effort and expertise for the data selection, processing, and analysis. To address this challenge, we introduce a novel framework, a Team of AI-made Scientists (TAIS), designed to streamline the scientific discovery pipeline. TAIS comprises simulated roles, including a project manager, data engineer, and domain expert, each represented by a Large Language Model (LLM). These roles collaborate to replicate the tasks typically performed by data scientists, with a specific focus on identifying disease-predictive genes. Furthermore, we have curated a benchmark dataset to assess TAIS's effectiveness in gene identification, demonstrating our system's potential to significantly enhance the efficiency and scope of scientific exploration. Our findings represent a solid step towards automating scientific discovery through large language models.  ( 2 min )
    Estimating the age-conditioned average treatment effects curves: An application for assessing load-management strategies in the NBA
    arXiv:2402.12400v1 Announce Type: cross Abstract: In the realm of competitive sports, understanding the performance dynamics of athletes, represented by the age curve (showing progression, peak, and decline), is vital. Our research introduces a novel framework for quantifying age-specific treatment effects, enhancing the granularity of performance trajectory analysis. Firstly, we propose a methodology for estimating the age curve using game-level data, diverging from traditional season-level data approaches, and tackling its inherent complexities with a meta-learner framework that leverages advanced machine learning models. This approach uncovers intricate non-linear patterns missed by existing methods. Secondly, our framework enables the identification of causal effects, allowing for a detailed examination of age curves under various conditions. By defining the Age-Conditioned Treatment Effect (ACTE), we facilitate the exploration of causal relationships regarding treatment impacts at specific ages. Finally, applying this methodology to study the effects of rest days on performance metrics, particularly across different ages, offers valuable insights into load management strategies' effectiveness. Our findings underscore the importance of tailored rest periods, highlighting their positive impact on athlete performance and suggesting a reevaluation of current management practices for optimizing athlete performance.  ( 2 min )
    Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!
    arXiv:2402.12343v1 Announce Type: cross Abstract: Large language models (LLMs) need to undergo safety alignment to ensure safe conversations with humans. However, in this work, we introduce an inference-time attack framework, demonstrating that safety alignment can also unintentionally facilitate harmful outcomes under adversarial manipulation. This framework, named Emulated Disalignment (ED), adversely combines a pair of open-source pre-trained and safety-aligned language models in the output space to produce a harmful language model without any training. Our experiments with ED across three datasets and four model families (Llama-1, Llama-2, Mistral, and Alpaca) show that ED doubles the harmfulness of pre-trained models and outperforms strong baselines, achieving the highest harmful rate in 43 out of 48 evaluation subsets by a large margin. Crucially, our findings highlight the importance of reevaluating the practice of open-sourcing language models even after safety alignment.  ( 2 min )
    SMORE: Similarity-based Hyperdimensional Domain Adaptation for Multi-Sensor Time Series Classification
    arXiv:2402.13233v1 Announce Type: new Abstract: Many real-world applications of the Internet of Things (IoT) employ machine learning (ML) algorithms to analyze time series information collected by interconnected sensors. However, distribution shift, a fundamental challenge in data-driven ML, arises when a model is deployed on a data distribution different from the training data and can substantially degrade model performance. Additionally, increasingly sophisticated deep neural networks (DNNs) are required to capture intricate spatial and temporal dependencies in multi-sensor time series data, often exceeding the capabilities of today's edge devices. In this paper, we propose SMORE, a novel resource-efficient domain adaptation (DA) algorithm for multi-sensor time series classification, leveraging the efficient and parallel operations of hyperdimensional computing. SMORE dynamically customizes test-time models with explicit consideration of the domain context of each sample to mitigate the negative impacts of domain shifts. Our evaluation on a variety of multi-sensor time series classification tasks shows that SMORE achieves on average 1.98% higher accuracy than state-of-the-art (SOTA) DNN-based DA algorithms with 18.81x faster training and 4.63x faster inference.  ( 2 min )
    CHILI: Chemically-Informed Large-scale Inorganic Nanomaterials Dataset for Advancing Graph Machine Learning
    arXiv:2402.13221v1 Announce Type: new Abstract: Advances in graph machine learning (ML) have been driven by applications in chemistry as graphs have remained the most expressive representations of molecules. While early graph ML methods focused primarily on small organic molecules, recently, the scope of graph ML has expanded to include inorganic materials. Modelling the periodicity and symmetry of inorganic crystalline materials poses unique challenges, which existing graph ML methods are unable to address. Moving to inorganic nanomaterials increases complexity as the scale of number of nodes within each graph can be broad ($10$ to $10^5$). The bulk of existing graph ML focuses on characterising molecules and materials by predicting target properties with graphs as input. However, the most exciting applications of graph ML will be in their generative capabilities, which is currently not at par with other domains such as images or text. We invite the graph ML community to address these open challenges by presenting two new chemically-informed large-scale inorganic (CHILI) nanomaterials datasets: A medium-scale dataset (with overall >6M nodes, >49M edges) of mono-metallic oxide nanomaterials generated from 12 selected crystal types (CHILI-3K) and a large-scale dataset (with overall >183M nodes, >1.2B edges) of nanomaterials generated from experimentally determined crystal structures (CHILI-100K). We define 11 property prediction tasks and 6 structure prediction tasks, which are of special interest for nanomaterial research. We benchmark the performance of a wide array of baseline methods and use these benchmarking results to highlight areas which need future work. To the best of our knowledge, CHILI-3K and CHILI-100K are the first open-source nanomaterial datasets of this scale -- both on the individual graph level and of the dataset as a whole -- and the only nanomaterials datasets with high structural and elemental diversity.  ( 3 min )
    How Does Selection Leak Privacy: Revisiting Private Selection and Improved Results for Hyper-parameter Tuning
    arXiv:2402.13087v1 Announce Type: new Abstract: We study the problem of guaranteeing Differential Privacy (DP) in hyper-parameter tuning, a crucial process in machine learning involving the selection of the best run from several. Unlike many private algorithms, including the prevalent DP-SGD, the privacy implications of tuning remain insufficiently understood. Recent works propose a generic private solution for the tuning process, yet a fundamental question still persists: is the current privacy bound for this solution tight? This paper contributes both positive and negative answers to this question. Initially, we provide studies affirming the current privacy analysis is indeed tight in a general sense. However, when we specifically study the hyper-parameter tuning problem, such tightness no longer holds. This is first demonstrated by applying privacy audit on the tuning process. Our findings underscore a substantial gap between the current theoretical privacy bound and the empirical bound derived even under the strongest audit setup. The gap found is not a fluke. Our subsequent study provides an improved privacy result for private hyper-parameter tuning due to its distinct properties. Our privacy results are also more generalizable compared to prior analyses that are only easily applicable in specific setups.  ( 2 min )
    Federated Causal Discovery from Heterogeneous Data
    arXiv:2402.13241v1 Announce Type: new Abstract: Conventional causal discovery methods rely on centralized data, which is inconsistent with the decentralized nature of data in many real-world situations. This discrepancy has motivated the development of federated causal discovery (FCD) approaches. However, existing FCD methods may be limited by their potentially restrictive assumptions of identifiable functional causal models or homogeneous data distributions, narrowing their applicability in diverse scenarios. In this paper, we propose a novel FCD method attempting to accommodate arbitrary causal models and heterogeneous data. We first utilize a surrogate variable corresponding to the client index to account for the data heterogeneity across different clients. We then develop a federated conditional independence test (FCIT) for causal skeleton discovery and establish a federated independent change principle (FICP) to determine causal directions. These approaches involve constructing summary statistics as a proxy of the raw data to protect data privacy. Owing to the nonparametric properties, FCIT and FICP make no assumption about particular functional forms, thereby facilitating the handling of arbitrary causal models. We conduct extensive experiments on synthetic and real datasets to show the efficacy of our method. The code is available at \url{https://github.com/lokali/FedCDH.git}.  ( 2 min )
    Order-Optimal Regret in Distributed Kernel Bandits using Uniform Sampling with Shared Randomness
    arXiv:2402.13182v1 Announce Type: new Abstract: We consider distributed kernel bandits where $N$ agents aim to collaboratively maximize an unknown reward function that lies in a reproducing kernel Hilbert space. Each agent sequentially queries the function to obtain noisy observations at the query points. Agents can share information through a central server, with the objective of minimizing regret that is accumulating over time $T$ and aggregating over agents. We develop the first algorithm that achieves the optimal regret order (as defined by centralized learning) with a communication cost that is sublinear in both $N$ and $T$. The key features of the proposed algorithm are the uniform exploration at the local agents and shared randomness with the central server. Working together with the sparse approximation of the GP model, these two key components make it possible to preserve the learning rate of the centralized setting at a diminishing rate of communication.  ( 2 min )
    BuffGraph: Enhancing Class-Imbalanced Node Classification via Buffer Nodes
    arXiv:2402.13114v1 Announce Type: new Abstract: Class imbalance in graph-structured data, where minor classes are significantly underrepresented, poses a critical challenge for Graph Neural Networks (GNNs). To address this challenge, existing studies generally generate new minority nodes and edges connecting new nodes to the original graph to make classes balanced. However, they do not solve the problem that majority classes still propagate information to minority nodes by edges in the original graph which introduces bias towards majority classes. To address this, we introduce BuffGraph, which inserts buffer nodes into the graph, modulating the impact of majority classes to improve minor class representation. Our extensive experiments across diverse real-world datasets empirically demonstrate that BuffGraph outperforms existing baseline methods in class-imbalanced node classification in both natural settings and imbalanced settings. Code is available at https://anonymous.4open.science/r/BuffGraph-730A.  ( 2 min )
    Text-Guided Molecule Generation with Diffusion Language Model
    arXiv:2402.13040v1 Announce Type: new Abstract: Text-guided molecule generation is a task where molecules are generated to match specific textual descriptions. Recently, most existing SMILES-based molecule generation methods rely on an autoregressive architecture. In this work, we propose the Text-Guided Molecule Generation with Diffusion Language Model (TGM-DLM), a novel approach that leverages diffusion models to address the limitations of autoregressive methods. TGM-DLM updates token embeddings within the SMILES string collectively and iteratively, using a two-phase diffusion generation process. The first phase optimizes embeddings from random noise, guided by the text description, while the second phase corrects invalid SMILES strings to form valid molecular representations. We demonstrate that TGM-DLM outperforms MolT5-Base, an autoregressive model, without the need for additional data resources. Our findings underscore the remarkable effectiveness of TGM-DLM in generating coherent and precise molecules with specific properties, opening new avenues in drug discovery and related scientific domains. Code will be released at: https://github.com/Deno-V/tgm-dlm.  ( 2 min )
    Practical Kernel Tests of Conditional Independence
    arXiv:2402.13196v1 Announce Type: new Abstract: We describe a data-efficient, kernel-based approach to statistical testing of conditional independence. A major challenge of conditional independence testing, absent in tests of unconditional independence, is to obtain the correct test level (the specified upper bound on the rate of false positives), while still attaining competitive test power. Excess false positives arise due to bias in the test statistic, which is obtained using nonparametric kernel ridge regression. We propose three methods for bias control to correct the test level, based on data splitting, auxiliary data, and (where possible) simpler function classes. We show these combined strategies are effective both for synthetic and real-world data.  ( 2 min )
    Testing Calibration in Subquadratic Time
    arXiv:2402.13187v1 Announce Type: new Abstract: In the recent literature on machine learning and decision making, calibration has emerged as a desirable and widely-studied statistical property of the outputs of binary prediction models. However, the algorithmic aspects of measuring model calibration have remained relatively less well-explored. Motivated by [BGHN23], which proposed a rigorous framework for measuring distances to calibration, we initiate the algorithmic study of calibration through the lens of property testing. We define the problem of calibration testing from samples where given $n$ draws from a distribution $\mathcal{D}$ on (predictions, binary outcomes), our goal is to distinguish between the case where $\mathcal{D}$ is perfectly calibrated, and the case where $\mathcal{D}$ is $\varepsilon$-far from calibration. We design an algorithm based on approximate linear programming, which solves calibration testing information-theoretically optimally (up to constant factors) in time $O(n^{1.5} \log(n))$. This improves upon state-of-the-art black-box linear program solvers requiring $\Omega(n^\omega)$ time, where $\omega > 2$ is the exponent of matrix multiplication. We also develop algorithms for tolerant variants of our testing problem, and give sample complexity lower bounds for alternative calibration distances to the one considered in this work. Finally, we present preliminary experiments showing that the testing problem we define faithfully captures standard notions of calibration, and that our algorithms scale to accommodate moderate sample sizes.  ( 2 min )
    Neural Network Diffusion
    arXiv:2402.13144v1 Announce Type: new Abstract: Diffusion models have achieved remarkable success in image and video generation. In this work, we demonstrate that diffusion models can also \textit{generate high-performing neural network parameters}. Our approach is simple, utilizing an autoencoder and a standard latent diffusion model. The autoencoder extracts latent representations of a subset of the trained network parameters. A diffusion model is then trained to synthesize these latent parameter representations from random noise. It then generates new representations that are passed through the autoencoder's decoder, whose outputs are ready to use as new subsets of network parameters. Across various architectures and datasets, our diffusion process consistently generates models of comparable or improved performance over trained networks, with minimal additional cost. Notably, we empirically find that the generated models perform differently with the trained networks. Our results encourage more exploration on the versatile use of diffusion models.  ( 2 min )
    Towards an empirical understanding of MoE design choices
    arXiv:2402.13089v1 Announce Type: new Abstract: In this study, we systematically evaluate the impact of common design choices in Mixture of Experts (MoEs) on validation performance, uncovering distinct influences at token and sequence levels. We also present empirical evidence showing comparable performance between a learned router and a frozen, randomly initialized router, suggesting that learned routing may not be essential. Our study further reveals that Sequence-level routing can result in topic-specific weak expert specialization, in contrast to syntax specialization observed with Token-level routing.  ( 2 min )
    Diffusion Posterior Sampling is Computationally Intractable
    arXiv:2402.12727v1 Announce Type: new Abstract: Diffusion models are a remarkably effective way of learning and sampling from a distribution $p(x)$. In posterior sampling, one is also given a measurement model $p(y \mid x)$ and a measurement $y$, and would like to sample from $p(x \mid y)$. Posterior sampling is useful for tasks such as inpainting, super-resolution, and MRI reconstruction, so a number of recent works have given algorithms to heuristically approximate it; but none are known to converge to the correct distribution in polynomial time. In this paper we show that posterior sampling is \emph{computationally intractable}: under the most basic assumption in cryptography -- that one-way functions exist -- there are instances for which \emph{every} algorithm takes superpolynomial time, even though \emph{unconditional} sampling is provably fast. We also show that the exponential-time rejection sampling algorithm is essentially optimal under the stronger plausible assumption that there are one-way functions that take exponential time to invert.  ( 2 min )
    Right on Time: Revising Time Series Models by Constraining their Explanations
    arXiv:2402.12921v1 Announce Type: new Abstract: The reliability of deep time series models is often compromised by their tendency to rely on confounding factors, which may lead to misleading results. Our newly recorded, naturally confounded dataset named P2S from a real mechanical production line emphasizes this. To tackle the challenging problem of mitigating confounders in time series data, we introduce Right on Time (RioT). Our method enables interactions with model explanations across both the time and frequency domain. Feedback on explanations in both domains is then used to constrain the model, steering it away from the annotated confounding factors. The dual-domain interaction strategy is crucial for effectively addressing confounders in time series datasets. We empirically demonstrate that RioT can effectively guide models away from the wrong reasons in P2S as well as popular time series classification and forecasting datasets.  ( 2 min )
    Fast Rates in Online Convex Optimization by Exploiting the Curvature of Feasible Sets
    arXiv:2402.12868v1 Announce Type: new Abstract: In this paper, we explore online convex optimization (OCO) and introduce a new analysis that provides fast rates by exploiting the curvature of feasible sets. In online linear optimization, it is known that if the average gradient of loss functions is larger than a certain value, the curvature of feasible sets can be exploited by the follow-the-leader (FTL) algorithm to achieve a logarithmic regret. This paper reveals that algorithms adaptive to the curvature of loss functions can also leverage the curvature of feasible sets. We first prove that if an optimal decision is on the boundary of a feasible set and the gradient of an underlying loss function is non-zero, then the algorithm achieves a regret upper bound of $O(\rho \log T)$ in stochastic environments. Here, $\rho > 0$ is the radius of the smallest sphere that includes the optimal decision and encloses the feasible set. Our approach, unlike existing ones, can work directly with convex loss functions, exploiting the curvature of loss functions simultaneously, and can achieve the logarithmic regret only with a local property of feasible sets. Additionally, it achieves an $O(\sqrt{T})$ regret even in adversarial environments where FTL suffers an $\Omega(T)$ regret, and attains an $O(\rho \log T + \sqrt{C \rho \log T})$ regret bound in corrupted stochastic environments with corruption level $C$. Furthermore, by extending our analysis, we establish a regret upper bound of $O\Big(T^{\frac{q-2}{2(q-1)}} (\log T)^{\frac{q}{2(q-1)}}\Big)$ for $q$-uniformly convex feasible sets, where uniformly convex sets include strongly convex sets and $\ell_p$-balls for $p \in [1,\infty)$. This bound bridges the gap between the $O(\log T)$ regret bound for strongly convex sets ($q=2$) and the $O(\sqrt{T})$ regret bound for non-curved sets ($q\to\infty$).  ( 3 min )
    Spurious Correlations in Machine Learning: A Survey
    arXiv:2402.12715v1 Announce Type: new Abstract: Machine learning systems are known to be sensitive to spurious correlations between biased features of the inputs (e.g., background, texture, and secondary objects) and the corresponding labels. These features and their correlations with the labels are known as "spurious" because they tend to change with shifts in real-world data distributions, which can negatively impact the model's generalization and robustness. In this survey, we provide a comprehensive review of this issue, along with a taxonomy of current state-of-the-art methods for addressing spurious correlations in machine learning models. Additionally, we summarize existing datasets, benchmarks, and metrics to aid future research. The paper concludes with a discussion of the recent advancements and future research challenges in this field, aiming to provide valuable insights for researchers in the related domains.  ( 2 min )
    Revitalizing Multivariate Time Series Forecasting: Learnable Decomposition with Inter-Series Dependencies and Intra-Series Variations Modeling
    arXiv:2402.12694v1 Announce Type: new Abstract: Predicting multivariate time series is crucial, demanding precise modeling of intricate patterns, including inter-series dependencies and intra-series variations. Distinctive trend characteristics in each time series pose challenges, and existing methods, relying on basic moving average kernels, may struggle with the non-linear structure and complex trends in real-world data. Given that, we introduce a learnable decomposition strategy to capture dynamic trend information more reasonably. Additionally, we propose a dual attention module tailored to capture inter-series dependencies and intra-series variations simultaneously for better time series forecasting, which is implemented by channel-wise self-attention and autoregressive self-attention. To evaluate the effectiveness of our method, we conducted experiments across eight open-source datasets and compared it with the state-of-the-art methods. Through the comparison results, our Leddam (LEarnable Decomposition and Dual Attention Module) not only demonstrates significant advancements in predictive performance, but also the proposed decomposition strategy can be plugged into other methods with a large performance-boosting, from 11.87% to 48.56% MSE error degradation.  ( 2 min )
    Training Artificial Neural Networks by Coordinate Search Algorithm
    arXiv:2402.12646v1 Announce Type: new Abstract: Training Artificial Neural Networks poses a challenging and critical problem in machine learning. Despite the effectiveness of gradient-based learning methods, such as Stochastic Gradient Descent (SGD), in training neural networks, they do have several limitations. For instance, they require differentiable activation functions, and cannot optimize a model based on several independent non-differentiable loss functions simultaneously; for example, the F1-score, which is used during testing, can be used during training when a gradient-free optimization algorithm is utilized. Furthermore, the training in any DNN can be possible with a small size of the training dataset. To address these concerns, we propose an efficient version of the gradient-free Coordinate Search (CS) algorithm, an instance of General Pattern Search methods, for training neural networks. The proposed algorithm can be used with non-differentiable activation functions and tailored to multi-objective/multi-loss problems. Finding the optimal values for weights of ANNs is a large-scale optimization problem. Therefore instead of finding the optimal value for each variable, which is the common technique in classical CS, we accelerate optimization and convergence by bundling the weights. In fact, this strategy is a form of dimension reduction for optimization problems. Based on the experimental results, the proposed method, in some cases, outperforms the gradient-based approach, particularly, in situations with insufficient labeled training data. The performance plots demonstrate a high convergence rate, highlighting the capability of our suggested method to find a reasonable solution with fewer function calls. As of now, the only practical and efficient way of training ANNs with hundreds of thousands of weights is gradient-based algorithms such as SGD or Adam. In this paper we introduce an alternative method for training ANN.  ( 3 min )
    Gaussian Process Neural Additive Models
    arXiv:2402.12518v1 Announce Type: new Abstract: Deep neural networks have revolutionized many fields, but their black-box nature also occasionally prevents their wider adoption in fields such as healthcare and finance, where interpretable and explainable models are required. The recent development of Neural Additive Models (NAMs) is a significant step in the direction of interpretable deep learning for tabular datasets. In this paper, we propose a new subclass of NAMs that use a single-layer neural network construction of the Gaussian process via random Fourier features, which we call Gaussian Process Neural Additive Models (GP-NAM). GP-NAMs have the advantage of a convex objective function and number of trainable parameters that grows linearly with feature dimensionality. It suffers no loss in performance compared to deeper NAM approaches because GPs are well-suited for learning complex non-parametric univariate functions. We demonstrate the performance of GP-NAM on several tabular datasets, showing that it achieves comparable or better performance in both classification and regression tasks with a large reduction in the number of parameters.  ( 2 min )
    Evaluation of Country Dietary Habits Using Machine Learning Techniques in Relation to Deaths from COVID-19
    arXiv:2402.12558v1 Announce Type: new Abstract: COVID-19 disease has affected almost every country in the world. The large number of infected people and the different mortality rates between countries has given rise to many hypotheses about the key points that make the virus so lethal in some places. In this study, the eating habits of 170 countries were evaluated in order to find correlations between these habits and mortality rates caused by COVID-19 using machine learning techniques that group the countries together according to the different distribution of fat, energy, and protein across 23 different types of food, as well as the amount ingested in kilograms. Results shown how obesity and the high consumption of fats appear in countries with the highest death rates, whereas countries with a lower rate have a higher level of cereal consumption accompanied by a lower total average intake of kilocalories.  ( 3 min )
    SDEs for Minimax Optimization
    arXiv:2402.12508v1 Announce Type: new Abstract: Minimax optimization problems have attracted a lot of attention over the past few years, with applications ranging from economics to machine learning. While advanced optimization methods exist for such problems, characterizing their dynamics in stochastic scenarios remains notably challenging. In this paper, we pioneer the use of stochastic differential equations (SDEs) to analyze and compare Minimax optimizers. Our SDE models for Stochastic Gradient Descent-Ascent, Stochastic Extragradient, and Stochastic Hamiltonian Gradient Descent are provable approximations of their algorithmic counterparts, clearly showcasing the interplay between hyperparameters, implicit regularization, and implicit curvature-induced noise. This perspective also allows for a unified and simplified analysis strategy based on the principles of It\^o calculus. Finally, our approach facilitates the derivation of convergence conditions and closed-form solutions for the dynamics in simplified settings, unveiling further insights into the behavior of different optimizers.  ( 2 min )
    Multivariate Functional Linear Discriminant Analysis for the Classification of Short Time Series with Missing Data
    arXiv:2402.13103v1 Announce Type: new Abstract: Functional linear discriminant analysis (FLDA) is a powerful tool that extends LDA-mediated multiclass classification and dimension reduction to univariate time-series functions. However, in the age of large multivariate and incomplete data, statistical dependencies between features must be estimated in a computationally tractable way, while also dealing with missing data. There is a need for a computationally tractable approach that considers the statistical dependencies between features and can handle missing values. We here develop a multivariate version of FLDA (MUDRA) to tackle this issue and describe an efficient expectation/conditional-maximization (ECM) algorithm to infer its parameters. We assess its predictive power on the "Articulary Word Recognition" data set and show its improvement over the state-of-the-art, especially in the case of missing data. MUDRA allows interpretable classification of data sets with large proportions of missing data, which will be particularly useful for medical or psychological data sets.  ( 2 min )
    Discovering Behavioral Modes in Deep Reinforcement Learning Policies Using Trajectory Clustering in Latent Space
    arXiv:2402.12939v1 Announce Type: new Abstract: Understanding the behavior of deep reinforcement learning (DRL) agents is crucial for improving their performance and reliability. However, the complexity of their policies often makes them challenging to understand. In this paper, we introduce a new approach for investigating the behavior modes of DRL policies, which involves utilizing dimensionality reduction and trajectory clustering in the latent space of neural networks. Specifically, we use Pairwise Controlled Manifold Approximation Projection (PaCMAP) for dimensionality reduction and TRACLUS for trajectory clustering to analyze the latent space of a DRL policy trained on the Mountain Car control task. Our methodology helps identify diverse behavior patterns and suboptimal choices by the policy, thus allowing for targeted improvements. We demonstrate how our approach, combined with domain knowledge, can enhance a policy's performance in specific regions of the state space.  ( 2 min )
    On the Stability of Gradient Descent for Large Learning Rate
    arXiv:2402.13108v1 Announce Type: new Abstract: There currently is a significant interest in understanding the Edge of Stability (EoS) phenomenon, which has been observed in neural networks training, characterized by a non-monotonic decrease of the loss function over epochs, while the sharpness of the loss (spectral norm of the Hessian) progressively approaches and stabilizes around 2/(learning rate). Reasons for the existence of EoS when training using gradient descent have recently been proposed -- a lack of flat minima near the gradient descent trajectory together with the presence of compact forward-invariant sets. In this paper, we show that linear neural networks optimized under a quadratic loss function satisfy the first assumption and also a necessary condition for the second assumption. More precisely, we prove that the gradient descent map is non-singular, the set of global minimizers of the loss function forms a smooth manifold, and the stable minima form a bounded subset in parameter space. Additionally, we prove that if the step-size is too big, then the set of initializations from which gradient descent converges to a critical point has measure zero.  ( 2 min )
    Defending Jailbreak Prompts via In-Context Adversarial Game
    arXiv:2402.13148v1 Announce Type: new Abstract: Large Language Models (LLMs) demonstrate remarkable capabilities across diverse applications. However, concerns regarding their security, particularly the vulnerability to jailbreak attacks, persist. Drawing inspiration from adversarial training in deep learning and LLM agent learning processes, we introduce the In-Context Adversarial Game (ICAG) for defending against jailbreaks without the need for fine-tuning. ICAG leverages agent learning to conduct an adversarial game, aiming to dynamically extend knowledge to defend against jailbreaks. Unlike traditional methods that rely on static datasets, ICAG employs an iterative process to enhance both the defense and attack agents. This continuous improvement process strengthens defenses against newly generated jailbreak prompts. Our empirical studies affirm ICAG's efficacy, where LLMs safeguarded by ICAG exhibit significantly reduced jailbreak success rates across various attack scenarios. Moreover, ICAG demonstrates remarkable transferability to other LLMs, indicating its potential as a versatile defense mechanism.  ( 2 min )
    Align Your Intents: Offline Imitation Learning via Optimal Transport
    arXiv:2402.13037v1 Announce Type: new Abstract: Offline reinforcement learning (RL) addresses the problem of sequential decision-making by learning optimal policy through pre-collected data, without interacting with the environment. As yet, it has remained somewhat impractical, because one rarely knows the reward explicitly and it is hard to distill it retrospectively. Here, we show that an imitating agent can still learn the desired behavior merely from observing the expert, despite the absence of explicit rewards or action labels. In our method, AILOT (Aligned Imitation Learning via Optimal Transport), we involve special representation of states in a form of intents that incorporate pairwise spatial distances within the data. Given such representations, we define intrinsic reward function via optimal transport distance between the expert's and the agent's trajectories. We report that AILOT outperforms state-of-the art offline imitation learning algorithms on D4RL benchmarks and improves the performance of other offline RL algorithms in the sparse-reward tasks.  ( 2 min )
    Stochastic Approximation Approach to Federated Machine Learning
    arXiv:2402.12945v1 Announce Type: new Abstract: This paper examines Federated learning (FL) in a Stochastic Approximation (SA) framework. FL is a collaborative way to train neural network models across various participants or clients without centralizing their data. Each client will train a model on their respective data and send the weights across to a the server periodically for aggregation. The server aggregates these weights which are then used by the clients to re-initialize their neural network and continue the training. SA is an iterative algorithm that uses approximate sample gradients and tapering step size to locate a minimizer of a cost function. In this paper the clients use a stochastic approximation iterate to update the weights of its neural network. It is shown that the aggregated weights track an autonomous ODE. Numerical simulations are performed and the results are compared with standard algorithms like FedAvg and FedProx. It is observed that the proposed algorithm is robust and gives more reliable estimates of the weights, in particular when the clients data are not identically distributed.  ( 2 min )
    SubIQ: Inverse Soft-Q Learning for Offline Imitation with Suboptimal Demonstrations
    arXiv:2402.13147v1 Announce Type: new Abstract: We consider offline imitation learning (IL), which aims to mimic the expert's behavior from its demonstration without further interaction with the environment. One of the main challenges in offline IL is dealing with the limited support of expert demonstrations that cover only a small fraction of the state-action spaces. In this work, we consider offline IL, where expert demonstrations are limited but complemented by a larger set of sub-optimal demonstrations of lower expertise levels. Most of the existing offline IL methods developed for this setting are based on behavior cloning or distribution matching, where the aim is to match the occupancy distribution of the imitation policy with that of the expert policy. Such an approach often suffers from over-fitting, as expert demonstrations are limited to accurately represent any occupancy distribution. On the other hand, since sub-optimal sets are much larger, there is a high chance that the imitation policy is trained towards sub-optimal policies. In this paper, to address these issues, we propose a new approach based on inverse soft-Q learning, where a regularization term is added to the training objective, with the aim of aligning the learned rewards with a pre-assigned reward function that allocates higher weights to state-action pairs from expert demonstrations, and lower weights to those from lower expertise levels. On standard benchmarks, our inverse soft-Q learning significantly outperforms other offline IL baselines by a large margin.  ( 3 min )
    Improve Cross-Architecture Generalization on Dataset Distillation
    arXiv:2402.13007v1 Announce Type: new Abstract: Dataset distillation, a pragmatic approach in machine learning, aims to create a smaller synthetic dataset from a larger existing dataset. However, existing distillation methods primarily adopt a model-based paradigm, where the synthetic dataset inherits model-specific biases, limiting its generalizability to alternative models. In response to this constraint, we propose a novel methodology termed "model pool". This approach involves selecting models from a diverse model pool based on a specific probability distribution during the data distillation process. Additionally, we integrate our model pool with the established knowledge distillation approach and apply knowledge distillation to the test process of the distilled dataset. Our experimental results validate the effectiveness of the model pool approach across a range of existing models while testing, demonstrating superior performance compared to existing methodologies.  ( 2 min )
    A Microstructure-based Graph Neural Network for Accelerating Multiscale Simulations
    arXiv:2402.13101v1 Announce Type: new Abstract: Simulating the mechanical response of advanced materials can be done more accurately using concurrent multiscale models than with single-scale simulations. However, the computational costs stand in the way of the practical application of this approach. The costs originate from microscale Finite Element (FE) models that must be solved at every macroscopic integration point. A plethora of surrogate modeling strategies attempt to alleviate this cost by learning to predict macroscopic stresses from macroscopic strains, completely replacing the microscale models. In this work, we introduce an alternative surrogate modeling strategy that allows for keeping the multiscale nature of the problem, allowing it to be used interchangeably with an FE solver for any time step. Our surrogate provides all microscopic quantities, which are then homogenized to obtain macroscopic quantities of interest. We achieve this for an elasto-plastic material by predicting full-field microscopic strains using a graph neural network (GNN) while retaining the microscopic constitutive material model to obtain the stresses. This hybrid data-physics graph-based approach avoids the high dimensionality originating from predicting full-field responses while allowing non-locality to arise. By training the GNN on a variety of meshes, it learns to generalize to unseen meshes, allowing a single model to be used for a range of microstructures. The embedded microscopic constitutive model in the GNN implicitly tracks history-dependent variables and leads to improved accuracy. We demonstrate for several challenging scenarios that the surrogate can predict complex macroscopic stress-strain paths. As the computation time of our method scales favorably with the number of elements in the microstructure compared to the FE method, our method can significantly accelerate FE2 simulations.  ( 3 min )
    TRAP: Targeted Random Adversarial Prompt Honeypot for Black-Box Identification
    arXiv:2402.12991v1 Announce Type: new Abstract: Large Language Model (LLM) services and models often come with legal rules on who can use them and how they must use them. Assessing the compliance of the released LLMs is crucial, as these rules protect the interests of the LLM contributor and prevent misuse. In this context, we describe the novel problem of Black-box Identity Verification (BBIV). The goal is to determine whether a third-party application uses a certain LLM through its chat function. We propose a method called Targeted Random Adversarial Prompt (TRAP) that identifies the specific LLM in use. We repurpose adversarial suffixes, originally proposed for jailbreaking, to get a pre-defined answer from the target LLM, while other models give random answers. TRAP detects the target LLMs with over 95% true positive rate at under 0.2% false positive rate even after a single interaction. TRAP remains effective even if the LLM has minor changes that do not significantly alter the original function.  ( 2 min )
    Towards Robust Graph Incremental Learning on Evolving Graphs
    arXiv:2402.12987v1 Announce Type: new Abstract: Incremental learning is a machine learning approach that involves training a model on a sequence of tasks, rather than all tasks at once. This ability to learn incrementally from a stream of tasks is crucial for many real-world applications. However, incremental learning is a challenging problem on graph-structured data, as many graph-related problems involve prediction tasks for each individual node, known as Node-wise Graph Incremental Learning (NGIL). This introduces non-independent and non-identically distributed characteristics in the sample data generation process, making it difficult to maintain the performance of the model as new tasks are added. In this paper, we focus on the inductive NGIL problem, which accounts for the evolution of graph structure (structural shift) induced by emerging tasks. We provide a formal formulation and analysis of the problem, and propose a novel regularization-based technique called Structural-Shift-Risk-Mitigation (SSRM) to mitigate the impact of the structural shift on catastrophic forgetting of the inductive NGIL problem. We show that the structural shift can lead to a shift in the input distribution for the existing tasks, and further lead to an increased risk of catastrophic forgetting. Through comprehensive empirical studies with several benchmark datasets, we demonstrate that our proposed method, Structural-Shift-Risk-Mitigation (SSRM), is flexible and easy to adapt to improve the performance of state-of-the-art GNN incremental learning frameworks in the inductive setting.  ( 2 min )
    From Movements to Metrics: Evaluating Explainable AI Methods in Skeleton-Based Human Activity Recognition
    arXiv:2402.12790v1 Announce Type: new Abstract: The advancement of deep learning in human activity recognition (HAR) using 3D skeleton data is critical for applications in healthcare, security, sports, and human-computer interaction. This paper tackles a well-known gap in the field, which is the lack of testing in the applicability and reliability of XAI evaluation metrics in the skeleton-based HAR domain. We have tested established XAI metrics namely faithfulness and stability on Class Activation Mapping (CAM) and Gradient-weighted Class Activation Mapping (Grad-CAM) to address this problem. The study also introduces a perturbation method that respects human biomechanical constraints to ensure realistic variations in human movement. Our findings indicate that \textit{faithfulness} may not be a reliable metric in certain contexts, such as with the EfficientGCN model. Conversely, stability emerges as a more dependable metric when there is slight input data perturbations. CAM and Grad-CAM are also found to produce almost identical explanations, leading to very similar XAI metric performance. This calls for the need for more diversified metrics and new XAI methods applied in skeleton-based HAR.  ( 2 min )
    CCFC++: Enhancing Federated Clustering through Feature Decorrelation
    arXiv:2402.12852v1 Announce Type: new Abstract: In federated clustering, multiple data-holding clients collaboratively group data without exchanging raw data. This field has seen notable advancements through its marriage with contrastive learning, exemplified by Cluster-Contrastive Federated Clustering (CCFC). However, CCFC suffers from heterogeneous data across clients, leading to poor and unrobust performance. Our study conducts both empirical and theoretical analyses to understand the impact of heterogeneous data on CCFC. Findings indicate that increased data heterogeneity exacerbates dimensional collapse in CCFC, evidenced by increased correlations across multiple dimensions of the learned representations. To address this, we introduce a decorrelation regularizer to CCFC. Benefiting from the regularizer, the improved method effectively mitigates the detrimental effects of data heterogeneity, and achieves superior performance, as evidenced by a marked increase in NMI scores, with the gain reaching as high as 0.32 in the most pronounced case.  ( 2 min )
    GRAPHGINI: Fostering Individual and Group Fairness in Graph Neural Networks
    arXiv:2402.12937v1 Announce Type: new Abstract: We address the growing apprehension that GNNs, in the absence of fairness constraints, might produce biased decisions that disproportionately affect underprivileged groups or individuals. Departing from previous work, we introduce for the first time a method for incorporating the Gini coefficient as a measure of fairness to be used within the GNN framework. Our proposal, GRAPHGINI, works with the two different goals of individual and group fairness in a single system, while maintaining high prediction accuracy. GRAPHGINI enforces individual fairness through learnable attention scores that help in aggregating more information through similar nodes. A heuristic-based maximum Nash social welfare constraint ensures the maximum possible group fairness. Both the individual fairness constraint and the group fairness constraint are stated in terms of a differentiable approximation of the Gini coefficient. This approximation is a contribution that is likely to be of interest even beyond the scope of the problem studied in this paper. Unlike other state-of-the-art, GRAPHGINI automatically balances all three optimization objectives (utility, individual, and group fairness) of the GNN and is free from any manual tuning of weight parameters. Extensive experimentation on real-world datasets showcases the efficacy of GRAPHGINI in making significant improvements in individual fairness compared to all currently available state-of-the-art methods while maintaining utility and group equality.  ( 2 min )
    Bounding Reconstruction Attack Success of Adversaries Without Data Priors
    arXiv:2402.12861v1 Announce Type: new Abstract: Reconstruction attacks on machine learning (ML) models pose a strong risk of leakage of sensitive data. In specific contexts, an adversary can (almost) perfectly reconstruct training data samples from a trained model using the model's gradients. When training ML models with differential privacy (DP), formal upper bounds on the success of such reconstruction attacks can be provided. So far, these bounds have been formulated under worst-case assumptions that might not hold high realistic practicality. In this work, we provide formal upper bounds on reconstruction success under realistic adversarial settings against ML models trained with DP and support these bounds with empirical results. With this, we show that in realistic scenarios, (a) the expected reconstruction success can be bounded appropriately in different contexts and by different metrics, which (b) allows for a more educated choice of a privacy parameter.  ( 2 min )
    Vehicle-group-based Crash Risk Formation and Propagation Analysis for Expressways
    arXiv:2402.12415v1 Announce Type: new Abstract: Previous studies in predicting crash risk primarily associated the number or likelihood of crashes on a road segment with traffic parameters or geometric characteristics of the segment, usually neglecting the impact of vehicles' continuous movement and interactions with nearby vehicles. Advancements in communication technologies have empowered driving information collected from surrounding vehicles, enabling the study of group-based crash risks. Based on high-resolution vehicle trajectory data, this research focused on vehicle groups as the subject of analysis and explored risk formation and propagation mechanisms considering features of vehicle groups and road segments. Several key factors contributing to crash risks were identified, including past high-risk vehicle-group states, complex vehicle behaviors, high percentage of large vehicles, frequent lane changes within a vehicle group, and specific road geometries. A multinomial logistic regression model was developed to analyze the spatial risk propagation patterns, which were classified based on the trend of high-risk occurrences within vehicle groups. The results indicated that extended periods of high-risk states, increase in vehicle-group size, and frequent lane changes are associated with adverse risk propagation patterns. Conversely, smoother traffic flow and high initial crash risk values are linked to risk dissipation. Furthermore, the study conducted sensitivity analysis on different types of classifiers, prediction time intervalsss and adaptive TTC thresholds. The highest AUC value for vehicle-group risk prediction surpassed 0.93. The findings provide valuable insights to researchers and practitioners in understanding and prediction of vehicle-group safety, ultimately improving active traffic safety management and operations of Connected and Autonomous Vehicles.  ( 3 min )
    Bellman Optimal Stepsize Straightening of Flow-Matching Models
    arXiv:2312.16414v3 Announce Type: replace-cross Abstract: Flow matching is a powerful framework for generating high-quality samples in various applications, especially image synthesis. However, the intensive computational demands of these models, especially during the finetuning process and sampling processes, pose significant challenges for low-resource scenarios. This paper introduces Bellman Optimal Stepsize Straightening (BOSS) technique for distilling flow-matching generative models: it aims specifically for a few-step efficient image sampling while adhering to a computational budget constraint. First, this technique involves a dynamic programming algorithm that optimizes the stepsizes of the pretrained network. Then, it refines the velocity network to match the optimal step sizes, aiming to straighten the generation paths. Extensive experimental evaluations across image generation tasks demonstrate the efficacy of BOSS in terms of both resource utilization and image quality. Our results reveal that BOSS achieves substantial gains in efficiency while maintaining competitive sample quality, effectively bridging the gap between low-resource constraints and the demanding requirements of flow-matching generative models. Our paper also fortifies the responsible development of artificial intelligence, offering a more sustainable generative model that reduces computational costs and environmental footprints. Our code can be found at https://github.com/nguyenngocbaocmt02/BOSS.  ( 2 min )
    Assessing the Impact of Prompting Methods on ChatGPT's Mathematical Capabilities
    arXiv:2312.15006v2 Announce Type: replace-cross Abstract: This study critically evaluates the efficacy of prompting methods in enhancing the mathematical reasoning capability of large language models (LLMs). The investigation uses three prescriptive prompting methods - simple, persona, and conversational prompting - known for their effectiveness in enhancing the linguistic tasks of LLMs. We conduct this analysis on OpenAI's LLM chatbot, ChatGPT-3.5, on extensive problem sets from the MATH, GSM8K, and MMLU datasets, encompassing a broad spectrum of mathematical challenges. A grading script adapted to each dataset is used to determine the effectiveness of these prompting interventions in enhancing the model's mathematical analysis power. Contrary to expectations, our empirical analysis reveals that none of the investigated methods consistently improves over ChatGPT-3.5's baseline performance, with some causing significant degradation. Our findings suggest that prompting strategies do not necessarily generalize to new domains, in this study failing to enhance mathematical performance.  ( 2 min )
    Dynamic Retrieval-Augmented Generation
    arXiv:2312.08976v2 Announce Type: replace-cross Abstract: Current state-of-the-art large language models are effective in generating high-quality text and encapsulating a broad spectrum of world knowledge. These models, however, often hallucinate and lack locally relevant factual data. Retrieval-augmented approaches were introduced to overcome these problems and provide more accurate responses. Typically, the retrieved information is simply appended to the main request, restricting the context window size of the model. We propose a novel approach for the Dynamic Retrieval-Augmented Generation (DRAG), based on the entity-augmented generation, which injects compressed embeddings of the retrieved entities into the generative model. The proposed pipeline was developed for code-generation tasks, yet can be transferred to some domains of natural language processing. To train the model, we collect and publish a new project-level code generation dataset. We use it for the evaluation along with publicly available datasets. Our approach achieves several targets: (1) lifting the length limitations of the context window, saving on the prompt size; (2) allowing huge expansion of the number of retrieval entities available for the context; (3) alleviating the problem of misspelling or failing to find relevant entity names. This allows the model to beat all baselines (except GPT-3.5) with a strong margin.  ( 2 min )
    Prompt Engineering a Prompt Engineer
    arXiv:2311.05661v2 Announce Type: replace-cross Abstract: Prompt engineering is a challenging yet crucial task for optimizing the performance of large language models on customized tasks. It requires complex reasoning to examine the model's errors, hypothesize what is missing or misleading in the current prompt, and communicate the task with clarity. While recent works indicate that large language models can be meta-prompted to perform automatic prompt engineering, we argue that their potential is limited due to insufficient guidance for complex reasoning in the meta-prompt. We fill this gap by infusing into the meta-prompt three key components: detailed descriptions, context specification, and a step-by-step reasoning template. The resulting method, named PE2, showcases remarkable versatility across diverse language tasks. It finds prompts that outperform "let's think step by step" by 6.3% on MultiArith and 3.1% on GSM8K, and outperforms competitive baselines on counterfactual tasks by 6.9%. Further, we show that PE2 can make targeted prompt edits, rectify erroneous prompts, and induce multi-step plans for complex tasks.  ( 2 min )
    Procedural Fairness Through Decoupling Objectionable Data Generating Components
    arXiv:2311.14688v2 Announce Type: replace-cross Abstract: We reveal and address the frequently overlooked yet important issue of disguised procedural unfairness, namely, the potentially inadvertent alterations on the behavior of neutral (i.e., not problematic) aspects of data generating process, and/or the lack of procedural assurance of the greatest benefit of the least advantaged individuals. Inspired by John Rawls's advocacy for pure procedural justice, we view automated decision-making as a microcosm of social institutions, and consider how the data generating process itself can satisfy the requirements of procedural fairness. We propose a framework that decouples the objectionable data generating components from the neutral ones by utilizing reference points and the associated value instantiation rule. Our findings highlight the necessity of preventing disguised procedural unfairness, drawing attention not only to the objectionable data generating components that we aim to mitigate, but also more importantly, to the neutral components that we intend to keep unaffected.  ( 2 min )
    Emergence of Grid-like Representations by Training Recurrent Networks with Conformal Normalization
    arXiv:2310.19192v2 Announce Type: replace-cross Abstract: Grid cells in the entorhinal cortex of mammalian brains exhibit striking hexagon grid firing patterns in their response maps as the animal (e.g., a rat) navigates in a 2D open environment. In this paper, we study the emergence of the hexagon grid patterns of grid cells based on a general recurrent neural network (RNN) model that captures the navigation process. The responses of grid cells collectively form a high dimensional vector, representing the 2D self-position of the agent. As the agent moves, the vector is transformed by an RNN that takes the velocity of the agent as input. We propose a simple yet general conformal normalization of the input velocity of the RNN, so that the local displacement of the position vector in the high-dimensional neural space is proportional to the local displacement of the agent in the 2D physical space, regardless of the direction of the input velocity. We apply this mechanism to both a linear RNN and nonlinear RNNs. Theoretically, we provide an understanding that explains the connection between conformal normalization and the emergence of hexagon grid patterns. Empirically, we conduct extensive experiments to verify that conformal normalization is crucial for the emergence of hexagon grid patterns, across various types of RNNs. The learned patterns share similar profiles to biological grid cells, and the topological properties of the patterns also align with our theoretical understanding.  ( 3 min )
    Language Agents with Reinforcement Learning for Strategic Play in the Werewolf Game
    arXiv:2310.18940v3 Announce Type: replace-cross Abstract: Agents built with large language models (LLMs) have shown great potential across a wide range of domains. However, in complex decision-making tasks, pure LLM-based agents tend to exhibit intrinsic bias in their choice of actions, which is inherited from the model's training data and results in suboptimal performance. To develop strategic language agents, i.e., agents that generate flexible language actions and possess strong decision-making abilities, we propose a novel framework that powers LLM-based agents with reinforcement learning (RL). We consider Werewolf, a popular social deduction game, as a challenging testbed that emphasizes versatile communication and strategic gameplay. To mitigate the intrinsic bias in language actions, our agents use an LLM to perform deductive reasoning and generate a diverse set of action candidates. Then an RL policy trained to optimize the decision-making ability chooses an action from the candidates to play in the game. Extensive experiments show that our agents overcome the intrinsic bias and outperform existing LLM-based agents in the Werewolf game. We also conduct human-agent experiments and find that our agents achieve human-level performance and demonstrate strong strategic play.  ( 2 min )
    Dealing with uncertainty: balancing exploration and exploitation in deep recurrent reinforcement learning
    arXiv:2310.08331v2 Announce Type: replace-cross Abstract: Incomplete knowledge of the environment leads an agent to make decisions under uncertainty. One of the major dilemmas in Reinforcement Learning (RL) where an autonomous agent has to balance two contrasting needs in making its decisions is: exploiting the current knowledge of the environment to maximize the cumulative reward as well as exploring actions that allow improving the knowledge of the environment, hopefully leading to higher reward values (exploration-exploitation trade-off). Concurrently, another relevant issue regards the full observability of the states, which may not be assumed in all applications. For instance, when 2D images are considered as input in an RL approach used for finding the best actions within a 3D simulation environment. In this work, we address these issues by deploying and testing several techniques to balance exploration and exploitation trade-off on partially observable systems for predicting steering wheels in autonomous driving scenarios. More precisely, the final aim is to investigate the effects of using both adaptive and deterministic exploration strategies coupled with a Deep Recurrent Q-Network. Additionally, we adapted and evaluated the impact of a modified quadratic loss function to improve the learning phase of the underlying Convolutional Recurrent Neural Network. We show that adaptive methods better approximate the trade-off between exploration and exploitation and, in general, Softmax and Max-Boltzmann strategies outperform epsilon-greedy techniques.  ( 3 min )
    Contextual Directed Acyclic Graphs
    arXiv:2310.15627v2 Announce Type: replace-cross Abstract: Estimating the structure of directed acyclic graphs (DAGs) from observational data remains a significant challenge in machine learning. Most research in this area concentrates on learning a single DAG for the entire population. This paper considers an alternative setting where the graph structure varies across individuals based on available "contextual" features. We tackle this contextual DAG problem via a neural network that maps the contextual features to a DAG, represented as a weighted adjacency matrix. The neural network is equipped with a novel projection layer that ensures the output matrices are sparse and satisfy a recently developed characterization of acyclicity. We devise a scalable computational framework for learning contextual DAGs and provide a convergence guarantee and an analytical gradient for backpropagating through the projection layer. Our experiments suggest that the new approach can recover the true context-specific graph where existing approaches fail.  ( 2 min )
    FLAIM: AIM-based Synthetic Data Generation in the Federated Setting
    arXiv:2310.03447v2 Announce Type: replace-cross Abstract: Preserving individual privacy while enabling collaborative data sharing is crucial for organizations. Synthetic data generation is one solution, producing artificial data that mirrors the statistical properties of private data. While numerous techniques have been devised under differential privacy, they predominantly assume data is centralized. However, data is often distributed across multiple clients in a federated manner. In this work, we initiate the study of federated synthetic tabular data generation. Building upon a SOTA central method known as AIM, we present DistAIM and FLAIM. We first show that it is straightforward to distribute AIM, extending a recent approach based on secure multi-party computation which necessitates additional overhead, making it less suited to federated scenarios. We then demonstrate that naively federating AIM can lead to substantial degradation in utility under the presence of heterogeneity. To mitigate both issues, we propose an augmented FLAIM approach that maintains a private proxy of heterogeneity. We simulate our methods across a range of benchmark datasets under different degrees of heterogeneity and show we can improve utility while reducing overhead.  ( 2 min )
    PMET: Precise Model Editing in a Transformer
    arXiv:2308.08742v5 Announce Type: replace-cross Abstract: Model editing techniques modify a minor proportion of knowledge in Large Language Models (LLMs) at a relatively low cost, which have demonstrated notable success. Existing methods assume Transformer Layer (TL) hidden states are values of key-value memories of the Feed-Forward Network (FFN). They usually optimize the TL hidden states to memorize target knowledge and use it to update the weights of the FFN in LLMs. However, the information flow of TL hidden states comes from three parts: Multi-Head Self-Attention (MHSA), FFN, and residual connections. Existing methods neglect the fact that the TL hidden states contains information not specifically required for FFN. Consequently, the performance of model editing decreases. To achieve more precise model editing, we analyze hidden states of MHSA and FFN, finding that MHSA encodes certain general knowledge extraction patterns. This implies that MHSA weights do not require updating when new knowledge is introduced. Based on above findings, we introduce PMET, which simultaneously optimizes Transformer Component (TC, namely MHSA and FFN) hidden states, while only using the optimized TC hidden states of FFN to precisely update FFN weights. Our experiments demonstrate that PMET exhibits state-of-the-art performance on both the COUNTERFACT and zsRE datasets. Our ablation experiments substantiate the effectiveness of our enhancements, further reinforcing the finding that the MHSA encodes certain general knowledge extraction patterns and indicating its storage of a small amount of factual knowledge. Our code is available at https://github.com/xpq-tech/PMET.  ( 3 min )
    Self-concordant Smoothing for Large-Scale Convex Composite Optimization
    arXiv:2309.01781v2 Announce Type: replace-cross Abstract: We introduce a notion of self-concordant smoothing for minimizing the sum of two convex functions, one of which is smooth and the other may be nonsmooth. The key highlight of our approach is in a natural property of the resulting problem's structure which provides us with a variable-metric selection method and a step-length selection rule particularly suitable for proximal Newton-type algorithms. In addition, we efficiently handle specific structures promoted by the nonsmooth function, such as $\ell_1$-regularization and group-lasso penalties. We prove the convergence of two resulting algorithms: Prox-N-SCORE, a proximal Newton algorithm and Prox-GGN-SCORE, a proximal generalized Gauss-Newton algorithm. The Prox-GGN-SCORE algorithm highlights an important approximation procedure which helps to significantly reduce most of the computational overhead associated with the inverse Hessian. This approximation is essentially useful for overparameterized machine learning models and in the mini-batch settings. Numerical examples on both synthetic and real datasets demonstrate the efficiency of our approach and its superiority over existing approaches. A Julia package implementing the proposed algorithms is available at https://github.com/adeyemiadeoye/SelfConcordantSmoothOptimization.jl.  ( 2 min )
    Kernel Single Proxy Control for Deterministic Confounding
    arXiv:2308.04585v3 Announce Type: replace-cross Abstract: We consider the problem of causal effect estimation with an unobserved confounder, where we observe a proxy variable that is associated with the confounder. Although Proxy causal learning (PCL) uses two proxy variables to recover the true causal effect, we show that a single proxy variable is sufficient for causal estimation if the outcome is generated deterministically, generalizing Control Outcome Calibration Approach (COCA). We propose two kernel-based methods for this setting: the first based on the two-stage regression approach, and the second based on a maximum moment restriction approach. We prove that both approaches can consistently estimate the causal effect, and we empirically demonstrate that we can successfully recover the causal effect on challenging synthetic benchmarks.  ( 2 min )
    DyPP: Dynamic Parameter Prediction to Accelerate Convergence of Variational Quantum Algorithms
    arXiv:2307.12449v3 Announce Type: replace-cross Abstract: The exponential run time of quantum simulators on classical machines and long queue times and high costs of real quantum devices present significant challenges in the efficient optimization of Variational Quantum Algorithms (VQAs) like Variational Quantum Eigensolver (VQE), Quantum Approximate Optimization Algorithm (QAOA) and Quantum Neural Networks (QNNs). To address these limitations, we propose a new approach, DyPP (Dynamic Parameter Prediction), which accelerates the convergence of VQAs by exploiting regular trends in the parameter weights to update parameters. We introduce two techniques for optimal prediction performance namely, Naive Prediction (NaP) and Adaptive Prediction (AdaP). Through extensive experimentation and training of multiple QNN models on various datasets, we demonstrate that DyPP offers a speedup of approximately $2.25\times$ compared to standard training methods, while also providing improved accuracy (up to $2.3\%$ higher) and loss (up to $6.1\%$ lower) with low storage and computational overheads. We also evaluate DyPP's effectiveness in VQE for molecular ground-state energy estimation and in QAOA for graph MaxCut. Our results show that on average, DyPP leads to speedup of up to $3.1\times$ for VQE and $2.91\times$ for QAOA, compared to traditional optimization techniques, while using up to $3.3\times$ lesser shots (i.e., repeated circuit executions). Even under hardware noise, DyPP outperforms existing optimization techniques, delivering upto $3.33\times$ speedup and $2.5\times$ fewer shots, thereby enhancing efficiency of VQAs.  ( 3 min )
    Underwater Acoustic Target Recognition based on Smoothness-inducing Regularization and Spectrogram-based Data Augmentation
    arXiv:2306.06945v2 Announce Type: replace-cross Abstract: Underwater acoustic target recognition is a challenging task owing to the intricate underwater environments and limited data availability. Insufficient data can hinder the ability of recognition systems to support complex modeling, thus impeding their advancement. To improve the generalization capacity of recognition models, techniques such as data augmentation have been employed to simulate underwater signals and diversify data distribution. However, the complexity of underwater environments can cause the simulated signals to deviate from real scenarios, resulting in biased models that are misguided by non-true data. In this study, we propose two strategies to enhance the generalization ability of models in the case of limited data while avoiding the risk of performance degradation. First, as an alternative to traditional data augmentation, we utilize smoothness-inducing regularization, which only incorporates simulated signals in the regularization term. Additionally, we propose a specialized spectrogram-based data augmentation strategy, namely local masking and replicating (LMR), to capture inter-class relationships. Our experiments and visualization analysis demonstrate the superiority of our proposed strategies.  ( 2 min )
    Choice Models and Permutation Invariance: Demand Estimation in Differentiated Products Markets
    arXiv:2307.07090v2 Announce Type: replace-cross Abstract: Choice modeling is at the core of understanding how changes to the competitive landscape affect consumer choices and reshape market equilibria. In this paper, we propose a fundamental characterization of choice functions that encompasses a wide variety of extant choice models. We demonstrate how non-parametric estimators like neural nets can easily approximate such functionals and overcome the curse of dimensionality that is inherent in the non-parametric estimation of choice functions. We demonstrate through extensive simulations that our proposed functionals can flexibly capture underlying consumer behavior in a completely data-driven fashion and outperform traditional parametric models. As demand settings often exhibit endogenous features, we extend our framework to incorporate estimation under endogenous features. Further, we also describe a formal inference procedure to construct valid confidence intervals on objects of interest like price elasticity. Finally, to assess the practical applicability of our estimator, we utilize a real-world dataset from S. Berry, Levinsohn, and Pakes (1995). Our empirical analysis confirms that the estimator generates realistic and comparable own- and cross-price elasticities that are consistent with the observations reported in the existing literature.  ( 2 min )
    A Review on Knowledge Graphs for Healthcare: Resources, Applications, and Promises
    arXiv:2306.04802v3 Announce Type: replace-cross Abstract: Healthcare knowledge graphs (HKGs) are valuable tools for organizing biomedical concepts and their relationships with interpretable structures. The recent advent of large language models (LLMs) has paved the way for building more comprehensive and accurate HKGs. This, in turn, can improve the reliability of generated content and enable better evaluation of LLMs. However, the challenges of HKGs such as regarding data heterogeneity and limited coverage are not fully understood, highlighting the need for detailed reviews. This work provides the first comprehensive review of HKGs. It summarizes the pipeline and key techniques for HKG construction, as well as the common utilization approaches, i.e., model-free and model-based. The existing HKG resources are also organized based on the data types they capture and application domains they cover, along with relevant statistical information (Resource available at https://github.com/lujiaying/Awesome-HealthCare-KnowledgeBase). At the application level, we delve into the successful integration of HKGs across various health domains, ranging from fine-grained basic science research to high-level clinical decision support and public health. Lastly, the paper highlights the opportunities for HKGs in the era of LLMs. This work aims to serve as a valuable resource for understanding the potential and opportunities of HKG in health research.  ( 3 min )
    Incentivizing Exploration with Linear Contexts and Combinatorial Actions
    arXiv:2306.01990v2 Announce Type: replace-cross Abstract: We advance the study of incentivized bandit exploration, in which arm choices are viewed as recommendations and are required to be Bayesian incentive compatible. Recent work has shown under certain independence assumptions that after collecting enough initial samples, the popular Thompson sampling algorithm becomes incentive compatible. We give an analog of this result for linear bandits, where the independence of the prior is replaced by a natural convexity condition. This opens up the possibility of efficient and regret-optimal incentivized exploration in high-dimensional action spaces. In the semibandit model, we also improve the sample complexity for the pre-Thompson sampling phase of initial data collection.  ( 2 min )
    Surface EMG-Based Inter-Session/Inter-Subject Gesture Recognition by Leveraging Lightweight All-ConvNet and Transfer Learning
    arXiv:2305.08014v3 Announce Type: replace-cross Abstract: Gesture recognition using low-resolution instantaneous HD-sEMG images opens up new avenues for the development of more fluid and natural muscle-computer interfaces. However, the data variability between inter-session and inter-subject scenarios presents a great challenge. The existing approaches employed very large and complex deep ConvNet or 2SRNN-based domain adaptation methods to approximate the distribution shift caused by these inter-session and inter-subject data variability. Hence, these methods also require learning over millions of training parameters and a large pre-trained and target domain dataset in both the pre-training and adaptation stages. As a result, it makes high-end resource-bounded and computationally very expensive for deployment in real-time applications. To overcome this problem, we propose a lightweight All-ConvNet+TL model that leverages lightweight All-ConvNet and transfer learning (TL) for the enhancement of inter-session and inter-subject gesture recognition performance. The All-ConvNet+TL model consists solely of convolutional layers, a simple yet efficient framework for learning invariant and discriminative representations to address the distribution shifts caused by inter-session and inter-subject data variability. Experiments on four datasets demonstrate that our proposed methods outperform the most complex existing approaches by a large margin and achieve state-of-the-art results on inter-session and inter-subject scenarios and perform on par or competitively on intra-session gesture recognition. These performance gaps increase even more when a tiny amount (e.g., a single trial) of data is available on the target domain for adaptation. These outstanding experimental results provide evidence that the current state-of-the-art models may be overparameterized for sEMG-based inter-session and inter-subject gesture recognition tasks.  ( 3 min )
    Having Beer after Prayer? Measuring Cultural Bias in Large Language Models
    arXiv:2305.14456v3 Announce Type: replace-cross Abstract: As the reach of large language models (LMs) expands globally, their ability to cater to diverse cultural contexts becomes crucial. Despite advancements in multilingual capabilities, models are not designed with appropriate cultural nuances. In this paper, we show that multilingual and Arabic monolingual LMs exhibit bias towards entities associated with Western culture. We introduce CAMeL, a novel resource of 628 naturally-occurring prompts and 20,368 entities spanning eight types that contrast Arab and Western cultures. CAMeL provides a foundation for measuring cultural biases in LMs through both extrinsic and intrinsic evaluations. Using CAMeL, we examine the cross-cultural performance in Arabic of 12 different LMs on tasks such as story generation, NER, and sentiment analysis, where we find concerning cases of stereotyping and cultural unfairness. We further test their text-infilling performance, revealing the incapability of appropriate adaptation to Arab cultural contexts. Finally, we analyze 6 Arabic pre-training corpora and find that commonly used sources such as Wikipedia may not be best suited to build culturally aware LMs, if used as they are without adjustment. We will make CAMeL publicly available at: https://github.com/tareknaous/camel  ( 2 min )
    A multimodal dynamical variational autoencoder for audiovisual speech representation learning
    arXiv:2305.03582v3 Announce Type: replace-cross Abstract: In this paper, we present a multimodal and dynamical VAE (MDVAE) applied to unsupervised audio-visual speech representation learning. The latent space is structured to dissociate the latent dynamical factors that are shared between the modalities from those that are specific to each modality. A static latent variable is also introduced to encode the information that is constant over time within an audiovisual speech sequence. The model is trained in an unsupervised manner on an audiovisual emotional speech dataset, in two stages. In the first stage, a vector quantized VAE (VQ-VAE) is learned independently for each modality, without temporal modeling. The second stage consists in learning the MDVAE model on the intermediate representation of the VQ-VAEs before quantization. The disentanglement between static versus dynamical and modality-specific versus modality-common information occurs during this second training stage. Extensive experiments are conducted to investigate how audiovisual speech latent factors are encoded in the latent space of MDVAE. These experiments include manipulating audiovisual speech, audiovisual facial image denoising, and audiovisual speech emotion recognition. The results show that MDVAE effectively combines the audio and visual information in its latent space. They also show that the learned static representation of audiovisual speech can be used for emotion recognition with few labeled data, and with better accuracy compared with unimodal baselines and a state-of-the-art supervised model based on an audiovisual transformer architecture.  ( 3 min )
    Stabilizing the Maximal Entropy Moment Method for Rarefied Gas Dynamics at Single-Precision
    arXiv:2303.02898v3 Announce Type: replace-cross Abstract: The maximal entropy moment method (MEM) is systematic solution of the challenging problem: generating extended hydrodynamic equations valid for both dense and rarefied gases. However, simulating MEM suffers from a computational expensive and ill-conditioned maximal entropy problem. It causes numerical overflow and breakdown when the numerical precision is insufficient, especially for flows like high-speed shock waves. It also prevents modern GPUs from accelerating MEM with their enormous single floating-point precision computation power. This paper aims to stabilize MEM, making it possible to simulating very strong normal shock waves on modern GPUs at single precision. We improve the condition number of the maximal entropy problem by proposing gauge transformations, which moves not only flow fields but also hydrodynamic equations into a more optimal coordinate system. We addressed numerical overflow and breakdown in the maximal entropy problem by employing the canonical form of distribution and a modified Newton optimization method. Moreover, we discovered a counter-intuitive phenomenon that over-refined spatial mesh beyond mean free path degrades the stability of MEM. With these techniques, we accomplished single-precision GPU simulations of high speed shock wave up to Mach 10 utilizing 35 moments MEM, while previous methods only achieved Mach 4 on double-precision.  ( 3 min )
    Model Stitching and Visualization How GAN Generators can Invert Networks in Real-Time
    arXiv:2302.02181v2 Announce Type: replace-cross Abstract: In this work, we propose a fast and accurate method to reconstruct activations of classification and semantic segmentation networks by stitching them with a GAN generator utilizing a 1x1 convolution. We test our approach on images of animals from the AFHQ wild dataset, ImageNet1K, and real-world digital pathology scans of stained tissue samples. Our results show comparable performance to established gradient descent methods but with a processing time that is two orders of magnitude faster, making this approach promising for practical applications.  ( 2 min )
    Learning not to Regret
    arXiv:2303.01074v2 Announce Type: replace-cross Abstract: The literature on game-theoretic equilibrium finding predominantly focuses on single games or their repeated play. Nevertheless, numerous real-world scenarios feature playing a game sampled from a distribution of similar, but not identical games, such as playing poker with different public cards or trading correlated assets on the stock market. As these similar games feature similar equilibra, we investigate a way to accelerate equilibrium finding on such a distribution. We present a novel "learning not to regret" framework, enabling us to meta-learn a regret minimizer tailored to a specific distribution. Our key contribution, Neural Predictive Regret Matching, is uniquely meta-learned to converge rapidly for the chosen distribution of games, while having regret minimization guarantees on any game. We validated our algorithms' faster convergence on a distribution of river poker games. Our experiments show that the meta-learned algorithms outpace their non-meta-learned counterparts, achieving more than tenfold improvements.  ( 2 min )
    Graph Filters for Signal Processing and Machine Learning on Graphs
    arXiv:2211.08854v2 Announce Type: replace-cross Abstract: Filters are fundamental in extracting information from data. For time series and image data that reside on Euclidean domains, filters are the crux of many signal processing and machine learning techniques, including convolutional neural networks. Increasingly, modern data also reside on networks and other irregular domains whose structure is better captured by a graph. To process and learn from such data, graph filters account for the structure of the underlying data domain. In this article, we provide a comprehensive overview of graph filters, including the different filtering categories, design strategies for each type, and trade-offs between different types of graph filters. We discuss how to extend graph filters into filter banks and graph neural networks to enhance the representational power; that is, to model a broader variety of signal classes, data patterns, and relationships. We also showcase the fundamental role of graph filters in signal processing and machine learning applications. Our aim is that this article provides a unifying framework for both beginner and experienced researchers, as well as a common understanding that promotes collaborations at the intersections of signal processing, machine learning, and application domains.  ( 2 min )
    Quantum Vision Transformers
    arXiv:2209.08167v2 Announce Type: replace-cross Abstract: In this work, quantum transformers are designed and analysed in detail by extending the state-of-the-art classical transformer neural network architectures known to be very performant in natural language processing and image analysis. Building upon the previous work, which uses parametrised quantum circuits for data loading and orthogonal neural layers, we introduce three types of quantum transformers for training and inference, including a quantum transformer based on compound matrices, which guarantees a theoretical advantage of the quantum attention mechanism compared to their classical counterpart both in terms of asymptotic run time and the number of model parameters. These quantum architectures can be built using shallow quantum circuits and produce qualitatively different classification models. The three proposed quantum attention layers vary on the spectrum between closely following the classical transformers and exhibiting more quantum characteristics. As building blocks of the quantum transformer, we propose a novel method for loading a matrix as quantum states as well as two new trainable quantum orthogonal layers adaptable to different levels of connectivity and quality of quantum computers. We performed extensive simulations of the quantum transformers on standard medical image datasets that showed competitively, and at times better performance compared to the classical benchmarks, including the best-in-class classical vision transformers. The quantum transformers we trained on these small-scale datasets require fewer parameters compared to standard classical benchmarks. Finally, we implemented our quantum transformers on superconducting quantum computers and obtained encouraging results for up to six qubit experiments.  ( 3 min )
    Preemptive Motion Planning for Human-to-Robot Indirect Placement Handovers
    arXiv:2203.00156v3 Announce Type: replace-cross Abstract: As technology advances, the need for safe, efficient, and collaborative human-robot-teams has become increasingly important. One of the most fundamental collaborative tasks in any setting is the object handover. Human-to-robot handovers can take either of two approaches: (1) direct hand-to-hand or (2) indirect hand-to-placement-to-pick-up. The latter approach ensures minimal contact between the human and robot but can also result in increased idle time due to having to wait for the object to first be placed down on a surface. To minimize such idle time, the robot must preemptively predict the human intent of where the object will be placed. Furthermore, for the robot to preemptively act in any sort of productive manner, predictions and motion planning must occur in real-time. We introduce a novel prediction-planning pipeline that allows the robot to preemptively move towards the human agent's intended placement location using gaze and gestures as model inputs. In this paper, we investigate the performance and drawbacks of our early intent predictor-planner as well as the practical benefits of using such a pipeline through a human-robot case study.  ( 2 min )
    Learning inducing points and uncertainty on molecular data by scalable variational Gaussian processes
    arXiv:2207.07654v3 Announce Type: replace-cross Abstract: Uncertainty control and scalability to large datasets are the two main issues for the deployment of Gaussian process (GP) models within the autonomous machine learning-based prediction pipelines in material science and chemistry. One way to address both of these issues is by introducing the latent inducing point variables and choosing the right approximation for the marginal log-likelihood objective. Here, we empirically show that variational learning of the inducing points in a molecular descriptor space improves the prediction of energies and atomic forces on two molecular dynamics datasets. First, we show that variational GPs can learn to represent the configurations of the molecules of different types that were not present within the initialization set of configurations. We provide a comparison of alternative log-likelihood training objectives and variational distributions. Among several evaluated approximate marginal log-likelihood objectives, we show that predictive log-likelihood provides excellent uncertainty estimates at the slight expense of predictive quality. Furthermore, we extend our study to a large molecular crystal system, showing that variational GP models perform well for predicting atomic forces by efficiently learning a sparse representation of the dataset.  ( 3 min )
    Simplicial Convolutional Filters
    arXiv:2201.11720v3 Announce Type: replace-cross Abstract: We study linear filters for processing signals supported on abstract topological spaces modeled as simplicial complexes, which may be interpreted as generalizations of graphs that account for nodes, edges, triangular faces etc. To process such signals, we develop simplicial convolutional filters defined as matrix polynomials of the lower and upper Hodge Laplacians. First, we study the properties of these filters and show that they are linear and shift-invariant, as well as permutation and orientation equivariant. These filters can also be implemented in a distributed fashion with a low computational complexity, as they involve only (multiple rounds of) simplicial shifting between upper and lower adjacent simplices. Second, focusing on edge-flows, we study the frequency responses of these filters and examine how we can use the Hodge-decomposition to delineate gradient, curl and harmonic frequencies. We discuss how these frequencies correspond to the lower- and the upper-adjacent couplings and the kernel of the Hodge Laplacian, respectively, and can be tuned independently by our filter designs. Third, we study different procedures for designing simplicial convolutional filters and discuss their relative advantages. Finally, we corroborate our simplicial filters in several applications: to extract different frequency components of a simplicial signal, to denoise edge flows, and to analyze financial markets and traffic networks.  ( 3 min )
    Convergence of Batch Asynchronous Stochastic Approximation With Applications to Reinforcement Learning
    arXiv:2109.03445v5 Announce Type: replace-cross Abstract: Ever since its introduction in the classic paper of Robbins and Monro in 1951, Stochastic Approximation (SA) has become a standard tool for finding a solution of an equation of the form $f(\theta) = 0$, when only noisy measurements of $f(\cdot)$ are available. In most situations, \textit{every component} of the putative solution $\theta_t$ is updated at each step $t$. In some applications such as $Q$-learning, a key technique in Reinforcement Learning (RL), \textit{only one component} of $\theta_t$ is updated at each $t$. This is known as \textbf{asynchronous} SA. The topic of study in the present paper is to study \textbf{Block Asynchronous SA (BASA)}, in which, at each step $t$, \textit{some but not necessarily all} components of $\theta_t$ are updated. The theory presented here embraces both conventional (synchronous) SA as well as asynchronous SA, and all in-between possibilities. We also prove bounds on the \textit{rate} of convergence of $\theta_t$ to the solutions. As a prelude to the new results, we also briefly survey some results on the convergence of the Stochastic Gradient method, proved in a companion paper by the present authors.  ( 3 min )
    Generalization in Kernel Regression Under Realistic Assumptions
    arXiv:2312.15995v2 Announce Type: replace Abstract: It is by now well-established that modern over-parameterized models seem to elude the bias-variance tradeoff and generalize well despite overfitting noise. Many recent works attempt to analyze this phenomenon in the relatively tractable setting of kernel regression. However, as we argue in detail, most past works on this topic either make unrealistic assumptions, or focus on a narrow problem setup. This work aims to provide a unified theory to upper bound the excess risk of kernel regression for nearly all common and realistic settings. Specifically, we provide rigorous bounds that hold for common kernels and for any amount of regularization, noise, any input dimension, and any number of samples. Furthermore, we provide relative perturbation bounds for the eigenvalues of kernel matrices, which may be of independent interest. These reveal a self-regularization phenomenon, whereby a heavy tail in the eigendecomposition of the kernel provides it with an implicit form of regularization, enabling good generalization. When applied to common kernels, our results imply benign overfitting in high input dimensions, nearly tempered overfitting in fixed dimensions, and explicit convergence rates for regularized regression. As a by-product, we obtain time-dependent bounds for neural networks trained in the kernel regime.  ( 2 min )
    Practical and Parallelizable Algorithms for Non-Monotone Submodular Maximization with Size Constraint
    arXiv:2009.01947v5 Announce Type: replace-cross Abstract: We present combinatorial and parallelizable algorithms for maximization of a submodular function, not necessarily monotone, with respect to a size constraint. We improve the best approximation factor achieved by an algorithm that has optimal adaptivity and nearly optimal query complexity to $0.193 - \varepsilon$. The conference version of this work mistakenly employed a subroutine that does not work for non-monotone, submodular functions. In this version, we propose a fixed and improved subroutine to add a set with high average marginal gain, ThreshSeq, which returns a solution in $O( \log(n) )$ adaptive rounds with high probability. Moreover, we provide two approximation algorithms. The first has approximation ratio $1/6 - \varepsilon$, adaptivity $O( \log (n) )$, and query complexity $O( n \log (k) )$, while the second has approximation ratio $0.193 - \varepsilon$, adaptivity $O( \log^2 (n) )$, and query complexity $O(n \log (k))$. Our algorithms are empirically validated to use a low number of adaptive rounds and total queries while obtaining solutions with high objective value in comparison with state-of-the-art approximation algorithms, including continuous algorithms that use the multilinear extension.  ( 3 min )
    Best Arm Identification with Fixed Budget: A Large Deviation Perspective
    arXiv:2312.12137v2 Announce Type: replace Abstract: We consider the problem of identifying the best arm in stochastic Multi-Armed Bandits (MABs) using a fixed sampling budget. Characterizing the minimal instance-specific error probability for this problem constitutes one of the important remaining open problems in MABs. When arms are selected using a static sampling strategy, the error probability decays exponentially with the number of samples at a rate that can be explicitly derived via Large Deviation techniques. Analyzing the performance of algorithms with adaptive sampling strategies is however much more challenging. In this paper, we establish a connection between the Large Deviation Principle (LDP) satisfied by the empirical proportions of arm draws and that satisfied by the empirical arm rewards. This connection holds for any adaptive algorithm, and is leveraged (i) to improve error probability upper bounds of some existing algorithms, such as the celebrated \sr (Successive Rejects) algorithm \citep{audibert2010best}, and (ii) to devise and analyze new algorithms. In particular, we present \sred (Continuous Rejects), a truly adaptive algorithm that can reject arms in {\it any} round based on the observed empirical gaps between the rewards of various arms. Applying our Large Deviation results, we prove that \sred enjoys better performance guarantees than existing algorithms, including \sr. Extensive numerical experiments confirm this observation.  ( 3 min )
    Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint
    arXiv:2312.11456v3 Announce Type: replace Abstract: This paper studies the theoretical framework of the alignment process of generative models with Reinforcement Learning from Human Feedback (RLHF). We consider a standard mathematical formulation, the reverse-KL regularized contextual bandit for RLHF. Despite its widespread practical application, a rigorous theoretical analysis of this formulation remains open. We investigate its behavior in three distinct settings -- offline, online, and hybrid -- and propose efficient algorithms with finite-sample theoretical guarantees. Moving towards practical applications, our framework, with a robust approximation of the information-theoretical policy improvement oracle, naturally gives rise to several novel RLHF algorithms. This includes an iterative version of the Direct Preference Optimization (DPO) algorithm for online settings, and a multi-step rejection sampling strategy for offline scenarios. Our empirical evaluations on real-world alignment experiment of large language model demonstrate that these proposed methods significantly surpass existing strong baselines, such as DPO and Rejection Sampling Optimization (RSO), showcasing the connections between solid theoretical foundations and their powerful practical implementations.  ( 2 min )
    Touring sampling with pushforward maps
    arXiv:2311.13845v2 Announce Type: replace Abstract: The number of sampling methods could be daunting for a practitioner looking to cast powerful machine learning methods to their specific problem. This paper takes a theoretical stance to review and organize many sampling approaches in the ``generative modeling'' setting, where one wants to generate new data that are similar to some training examples. By revealing links between existing methods, it might prove useful to overcome some of the current challenges in sampling with diffusion models, such as long inference time due to diffusion simulation, or the lack of diversity in generated samples.  ( 2 min )
    Coupled Confusion Correction: Learning from Crowds with Sparse Annotations
    arXiv:2312.07331v3 Announce Type: replace Abstract: As the size of the datasets getting larger, accurately annotating such datasets is becoming more impractical due to the expensiveness on both time and economy. Therefore, crowd-sourcing has been widely adopted to alleviate the cost of collecting labels, which also inevitably introduces label noise and eventually degrades the performance of the model. To learn from crowd-sourcing annotations, modeling the expertise of each annotator is a common but challenging paradigm, because the annotations collected by crowd-sourcing are usually highly-sparse. To alleviate this problem, we propose Coupled Confusion Correction (CCC), where two models are simultaneously trained to correct the confusion matrices learned by each other. Via bi-level optimization, the confusion matrices learned by one model can be corrected by the distilled data from the other. Moreover, we cluster the ``annotator groups'' who share similar expertise so that their confusion matrices could be corrected together. In this way, the expertise of the annotators, especially of those who provide seldom labels, could be better captured. Remarkably, we point out that the annotation sparsity not only means the average number of labels is low, but also there are always some annotators who provide very few labels, which is neglected by previous works when constructing synthetic crowd-sourcing annotations. Based on that, we propose to use Beta distribution to control the generation of the crowd-sourcing labels so that the synthetic annotations could be more consistent with the real-world ones. Extensive experiments are conducted on two types of synthetic datasets and three real-world datasets, the results of which demonstrate that CCC significantly outperforms state-of-the-art approaches. Source codes are available at: https://github.com/Hansong-Zhang/CCC.  ( 3 min )
    Add and Thin: Diffusion for Temporal Point Processes
    arXiv:2311.01139v2 Announce Type: replace Abstract: Autoregressive neural networks within the temporal point process (TPP) framework have become the standard for modeling continuous-time event data. Even though these models can expressively capture event sequences in a one-step-ahead fashion, they are inherently limited for long-term forecasting applications due to the accumulation of errors caused by their sequential nature. To overcome these limitations, we derive ADD-THIN, a principled probabilistic denoising diffusion model for TPPs that operates on entire event sequences. Unlike existing diffusion approaches, ADD-THIN naturally handles data with discrete and continuous components. In experiments on synthetic and real-world datasets, our model matches the state-of-the-art TPP models in density estimation and strongly outperforms them in forecasting.  ( 2 min )
    Self-Supervised Pre-Training for Precipitation Post-Processor
    arXiv:2310.20187v3 Announce Type: replace Abstract: Obtaining a sufficient forecast lead time for local precipitation is essential in preventing hazardous weather events. Global warming-induced climate change increases the challenge of accurately predicting severe precipitation events, such as heavy rainfall. In this paper, we propose a deep learning-based precipitation post-processor for numerical weather prediction (NWP) models. The precipitation post-processor consists of (i) employing self-supervised pre-training, where the parameters of the encoder are pre-trained on the reconstruction of the masked variables of the atmospheric physics domain; and (ii) conducting transfer learning on precipitation segmentation tasks (the target domain) from the pre-trained encoder. In addition, we introduced a heuristic labeling approach to effectively train class-imbalanced datasets. Our experiments on precipitation correction for regional NWP show that the proposed method outperforms other approaches.  ( 2 min )
    Interpretable Prototype-based Graph Information Bottleneck
    arXiv:2310.19906v2 Announce Type: replace Abstract: The success of Graph Neural Networks (GNNs) has led to a need for understanding their decision-making process and providing explanations for their predictions, which has given rise to explainable AI (XAI) that offers transparent explanations for black-box models. Recently, the use of prototypes has successfully improved the explainability of models by learning prototypes to imply training graphs that affect the prediction. However, these approaches tend to provide prototypes with excessive information from the entire graph, leading to the exclusion of key substructures or the inclusion of irrelevant substructures, which can limit both the interpretability and the performance of the model in downstream tasks. In this work, we propose a novel framework of explainable GNNs, called interpretable Prototype-based Graph Information Bottleneck (PGIB) that incorporates prototype learning within the information bottleneck framework to provide prototypes with the key subgraph from the input graph that is important for the model prediction. This is the first work that incorporates prototype learning into the process of identifying the key subgraphs that have a critical impact on the prediction performance. Extensive experiments, including qualitative analysis, demonstrate that PGIB outperforms state-of-the-art methods in terms of both prediction performance and explainability.  ( 2 min )
    Stable Nonconvex-Nonconcave Training via Linear Interpolation
    arXiv:2310.13459v3 Announce Type: replace Abstract: This paper presents a theoretical analysis of linear interpolation as a principled method for stabilizing (large-scale) neural network training. We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear interpolation can help by leveraging the theory of nonexpansive operators. We construct a new optimization scheme called relaxed approximate proximal point (RAPP), which is the first explicit method without anchoring to achieve last iterate convergence rates for $\rho$-comonotone problems while only requiring $\rho > -\tfrac{1}{2L}$. The construction extends to constrained and regularized settings. By replacing the inner optimizer in RAPP we rediscover the family of Lookahead algorithms for which we establish convergence in cohypomonotone problems even when the base optimizer is taken to be gradient descent ascent. The range of cohypomonotone problems in which Lookahead converges is further expanded by exploiting that Lookahead inherits the properties of the base optimizer. We corroborate the results with experiments on generative adversarial networks which demonstrates the benefits of the linear interpolation present in both RAPP and Lookahead.  ( 2 min )
    LARA: A Light and Anti-overfitting Retraining Approach for Unsupervised Anomaly Detection
    arXiv:2310.05668v3 Announce Type: replace Abstract: Most of current anomaly detection models assume that the normal pattern remains same all the time. However, the normal patterns of Web services change dramatically and frequently. The model trained on old-distribution data is outdated after such changes. Retraining the whole model every time is expensive. Besides, at the beginning of normal pattern changes, there is not enough observation data from the new distribution. Retraining a large neural network model with limited data is vulnerable to overfitting. Thus, we propose a Light and Anti-overfitting Retraining Approach (LARA) for deep variational auto-encoder based time series anomaly detection methods (VAEs). This work aims to make three novel contributions: 1) the retraining process is formulated as a convex problem and can converge at a fast rate as well as prevent overfitting; 2) designing a ruminate block, which leverages the historical data without the need to store them; 3) mathematically proving that when fine-tuning the latent vector and reconstructed data, the linear formations can achieve the least adjusting errors between the ground truths and the fine-tuned ones. Moreover, we have performed many experiments to verify that retraining LARA with even 43 time slots of data from new distribution can result in its competitive F1 Score in comparison with the state-of-the-art anomaly detection models trained with sufficient data. Besides, we verify its light overhead.  ( 3 min )
    SemiReward: A General Reward Model for Semi-supervised Learning
    arXiv:2310.03013v2 Announce Type: replace Abstract: Semi-supervised learning (SSL) has witnessed great progress with various improvements in the self-training framework with pseudo labeling. The main challenge is how to distinguish high-quality pseudo labels against the confirmation bias. However, existing pseudo-label selection strategies are limited to pre-defined schemes or complex hand-crafted policies specially designed for classification, failing to achieve high-quality labels, fast convergence, and task versatility simultaneously. To these ends, we propose a Semi-supervised Reward framework (SemiReward) that predicts reward scores to evaluate and filter out high-quality pseudo labels, which is pluggable to mainstream SSL methods in wide task types and scenarios. To mitigate confirmation bias, SemiReward is trained online in two stages with a generator model and subsampling strategy. With classification and regression tasks on 13 standard SSL benchmarks across three modalities, extensive experiments verify that SemiReward achieves significant performance gains and faster convergence speeds upon Pseudo Label, FlexMatch, and Free/SoftMatch. Code and models are available at https://github.com/Westlake-AI/SemiReward.  ( 2 min )
    How Graph Neural Networks Learn: Lessons from Training Dynamics
    arXiv:2310.05105v2 Announce Type: replace Abstract: A long-standing goal in deep learning has been to characterize the learning behavior of black-box models in a more interpretable manner. For graph neural networks (GNNs), considerable advances have been made in formalizing what functions they can represent, but whether GNNs will learn desired functions during the optimization process remains less clear. To fill this gap, we study their training dynamics in function space. In particular, we find that the optimization of GNNs through gradient descent implicitly leverages the graph structure to update the learned function. This phenomenon is dubbed as kernel-graph alignment, which has been empirically and theoretically corroborated. This new analytical framework from the optimization perspective enables interpretable explanations of when and why the learned GNN functions generalize, which are relevant to their limitations on heterophilic graphs. From a practical standpoint, it also provides high-level principles for designing new algorithms. We exemplify this by showing that a simple and efficient non-parametric algorithm, obtained by explicitly using graph structure to update the learned function, can consistently compete with nonlinear GNNs.  ( 2 min )
    SWoTTeD: An Extension of Tensor Decomposition to Temporal Phenotyping
    arXiv:2310.01201v2 Announce Type: replace Abstract: Tensor decomposition has recently been gaining attention in the machine learning community for the analysis of individual traces, such as Electronic Health Records (EHR). However, this task becomes significantly more difficult when the data follows complex temporal patterns. This paper introduces the notion of a temporal phenotype as an arrangement of features over time and it proposes SWoTTeD (Sliding Window for Temporal Tensor Decomposition), a novel method to discover hidden temporal patterns. SWoTTeD integrates several constraints and regularizations to enhance the interpretability of the extracted phenotypes. We validate our proposal using both synthetic and real-world datasets, and we present an original usecase using data from the Greater Paris University Hospital. The results show that SWoTTeD achieves at least as accurate reconstruction as recent state-of-the-art tensor decomposition models, and extracts temporal phenotypes that are meaningful for clinicians.  ( 2 min )
    MiCRO: Near-Zero Cost Gradient Sparsification for Scaling and Accelerating Distributed DNN Training
    arXiv:2310.00967v3 Announce Type: replace Abstract: Gradient sparsification is a communication optimisation technique for scaling and accelerating distributed deep neural network (DNN) training. It reduces the increasing communication traffic for gradient aggregation. However, existing sparsifiers have poor scalability because of the high computational cost of gradient selection and/or increase in communication traffic. In particular, an increase in communication traffic is caused by gradient build-up and inappropriate threshold for gradient selection. To address these challenges, we propose a novel gradient sparsification method called MiCRO. In MiCRO, the gradient vector is partitioned, and each partition is assigned to the corresponding worker. Each worker then selects gradients from its partition, and the aggregated gradients are free from gradient build-up. Moreover, MiCRO estimates the accurate threshold to maintain the communication traffic as per user requirement by minimising the compression ratio error. MiCRO enables near-zero cost gradient sparsification by solving existing problems that hinder the scalability and acceleration of distributed DNN training. In our extensive experiments, MiCRO outperformed state-of-the-art sparsifiers with an outstanding convergence rate.  ( 2 min )
    Generating and Imputing Tabular Data via Diffusion and Flow-based Gradient-Boosted Trees
    arXiv:2309.09968v3 Announce Type: replace Abstract: Tabular data is hard to acquire and is subject to missing values. This paper introduces a novel approach for generating and imputing mixed-type (continuous and categorical) tabular data utilizing score-based diffusion and conditional flow matching. In contrast to prior methods that rely on neural networks to learn the score function or the vector field, we adopt XGBoost, a widely used Gradient-Boosted Tree (GBT) technique. To test our method, we build one of the most extensive benchmarks for tabular data generation and imputation, containing 27 diverse datasets and 9 metrics. Through empirical evaluation across the benchmark, we demonstrate that our approach outperforms deep-learning generation methods in data generation tasks and remains competitive in data imputation. Notably, it can be trained in parallel using CPUs without requiring a GPU. Our Python and R code is available at https://github.com/SamsungSAILMontreal/ForestDiffusion.  ( 2 min )
    Evidential Deep Learning: Enhancing Predictive Uncertainty Estimation for Earth System Science Applications
    arXiv:2309.13207v2 Announce Type: replace Abstract: Robust quantification of predictive uncertainty is critical for understanding factors that drive weather and climate outcomes. Ensembles provide predictive uncertainty estimates and can be decomposed physically, but both physics and machine learning ensembles are computationally expensive. Parametric deep learning can estimate uncertainty with one model by predicting the parameters of a probability distribution but do not account for epistemic uncertainty.. Evidential deep learning, a technique that extends parametric deep learning to higher-order distributions, can account for both aleatoric and epistemic uncertainty with one model. This study compares the uncertainty derived from evidential neural networks to those obtained from ensembles. Through applications of classification of winter precipitation type and regression of surface layer fluxes, we show evidential deep learning models attaining predictive accuracy rivaling standard methods, while robustly quantifying both sources of uncertainty. We evaluate the uncertainty in terms of how well the predictions are calibrated and how well the uncertainty correlates with prediction error. Analyses of uncertainty in the context of the inputs reveal sensitivities to underlying meteorological processes, facilitating interpretation of the models. The conceptual simplicity, interpretability, and computational efficiency of evidential neural networks make them highly extensible, offering a promising approach for reliable and practical uncertainty quantification in Earth system science modeling. In order to encourage broader adoption of evidential deep learning in Earth System Science, we have developed a new Python package, MILES-GUESS (https://github.com/ai2es/miles-guess), that enables users to train and evaluate both evidential and ensemble deep learning.  ( 3 min )
    A Comprehensive Survey on Deep Learning Techniques in Educational Data Mining
    arXiv:2309.04761v3 Announce Type: replace Abstract: Educational Data Mining (EDM) has emerged as a vital field of research, which harnesses the power of computational techniques to analyze educational data. With the increasing complexity and diversity of educational data, Deep Learning techniques have shown significant advantages in addressing the challenges associated with analyzing and modeling this data. This survey aims to systematically review the state-of-the-art in EDM with Deep Learning. We begin by providing a brief introduction to EDM and Deep Learning, highlighting their relevance in the context of modern education. Next, we present a detailed review of Deep Learning techniques applied in four typical educational scenarios, including knowledge tracing, student behavior detection, performance prediction, and personalized recommendation. Furthermore, a comprehensive overview of public datasets and processing tools for EDM is provided. Finally, we point out emerging trends and future directions in this research area.  ( 2 min )
    Will More Expressive Graph Neural Networks do Better on Generative Tasks?
    arXiv:2308.11978v4 Announce Type: replace Abstract: Graph generation poses a significant challenge as it involves predicting a complete graph with multiple nodes and edges based on simply a given label. This task also carries fundamental importance to numerous real-world applications, including de-novo drug and molecular design. In recent years, several successful methods have emerged in the field of graph generation. However, these approaches suffer from two significant shortcomings: (1) the underlying Graph Neural Network (GNN) architectures used in these methods are often underexplored; and (2) these methods are often evaluated on only a limited number of metrics. To fill this gap, we investigate the expressiveness of GNNs under the context of the molecular graph generation task, by replacing the underlying GNNs of graph generative models with more expressive GNNs. Specifically, we analyse the per- formance of six GNNs in two different generative frameworks -- autoregressive generation models, such as GCPN and GraphAF, and one-shot generation models, such as GraphEBM -- on six different molecular generative objectives on the ZINC-250k dataset. Through our extensive experiments, we demonstrate that advanced GNNs can indeed improve the performance of GCPN, GraphAF, and GraphEBM on molecular generation tasks, but GNN expressiveness is not a necessary condition for a good GNN-based generative model. Moreover, we show that GCPN and GraphAF with advanced GNNs can achieve state-of-the-art results across 17 other non-GNN-based graph generative approaches, such as variational autoencoders and Bayesian optimisation models, on the proposed molecular generative objectives (DRD2, Median1, Median2), which are impor- tant metrics for de-novo molecular design.  ( 3 min )
    Regularization, early-stopping and dreaming: a Hopfield-like setup to address generalization and overfitting
    arXiv:2308.01421v2 Announce Type: replace Abstract: In this work we approach attractor neural networks from a machine learning perspective: we look for optimal network parameters by applying a gradient descent over a regularized loss function. Within this framework, the optimal neuron-interaction matrices turn out to be a class of matrices which correspond to Hebbian kernels revised by a reiterated unlearning protocol. Remarkably, the extent of such unlearning is proved to be related to the regularization hyperparameter of the loss function and to the training time. Thus, we can design strategies to avoid overfitting that are formulated in terms of regularization and early-stopping tuning. The generalization capabilities of these attractor networks are also investigated: analytical results are obtained for random synthetic datasets, next, the emerging picture is corroborated by numerical experiments that highlight the existence of several regimes (i.e., overfitting, failure and success) as the dataset parameters are varied.  ( 2 min )
    Enumerating Safe Regions in Deep Neural Networks with Provable Probabilistic Guarantees
    arXiv:2308.09842v2 Announce Type: replace Abstract: Identifying safe areas is a key point to guarantee trust for systems that are based on Deep Neural Networks (DNNs). To this end, we introduce the AllDNN-Verification problem: given a safety property and a DNN, enumerate the set of all the regions of the property input domain which are safe, i.e., where the property does hold. Due to the #P-hardness of the problem, we propose an efficient approximation method called epsilon-ProVe. Our approach exploits a controllable underestimation of the output reachable sets obtained via statistical prediction of tolerance limits, and can provide a tight (with provable probabilistic guarantees) lower estimate of the safe areas. Our empirical evaluation on different standard benchmarks shows the scalability and effectiveness of our method, offering valuable insights for this new type of verification of DNNs.  ( 2 min )
    The importance of feature preprocessing for differentially private linear optimization
    arXiv:2307.11106v2 Announce Type: replace Abstract: Training machine learning models with differential privacy (DP) has received increasing interest in recent years. One of the most popular algorithms for training differentially private models is differentially private stochastic gradient descent (DPSGD) and its variants, where at each step gradients are clipped and combined with some noise. Given the increasing usage of DPSGD, we ask the question: is DPSGD alone sufficient to find a good minimizer for every dataset under privacy constraints? Towards answering this question, we show that even for the simple case of linear classification, unlike non-private optimization, (private) feature preprocessing is vital for differentially private optimization. In detail, we first show theoretically that there exists an example where without feature preprocessing, DPSGD incurs an optimality gap proportional to the maximum Euclidean norm of features over all samples. We then propose an algorithm called DPSGD-F, which combines DPSGD with feature preprocessing and prove that for classification tasks, it incurs an optimality gap proportional to the diameter of the features $\max_{x, x' \in D} \|x - x'\|_2$. We finally demonstrate the practicality of our algorithm on image classification benchmarks.  ( 2 min )
    TGRL: An Algorithm for Teacher Guided Reinforcement Learning
    arXiv:2307.03186v2 Announce Type: replace Abstract: Learning from rewards (i.e., reinforcement learning or RL) and learning to imitate a teacher (i.e., teacher-student learning) are two established approaches for solving sequential decision-making problems. To combine the benefits of these different forms of learning, it is common to train a policy to maximize a combination of reinforcement and teacher-student learning objectives. However, without a principled method to balance these objectives, prior work used heuristics and problem-specific hyperparameter searches to balance the two objectives. We present a $\textit{principled}$ approach, along with an approximate implementation for $\textit{dynamically}$ and $\textit{automatically}$ balancing when to follow the teacher and when to use rewards. The main idea is to adjust the importance of teacher supervision by comparing the agent's performance to the counterfactual scenario of the agent learning without teacher supervision and only from rewards. If using teacher supervision improves performance, the importance of teacher supervision is increased and otherwise it is decreased. Our method, $\textit{Teacher Guided Reinforcement Learning}$ (TGRL), outperforms strong baselines across diverse domains without hyper-parameter tuning.  ( 2 min )
    Federated Learning in the Presence of Adversarial Client Unavailability
    arXiv:2305.19971v2 Announce Type: replace Abstract: Federated learning is a decentralized machine learning framework that enables collaborative model training without revealing raw data. Due to the diverse hardware and software limitations, a client may not always be available for the computation requests from the parameter server. An emerging line of research is devoted to tackling arbitrary client unavailability. However, existing work still imposes structural assumptions on the unavailability patterns, impeding their applicability in challenging scenarios wherein the unavailability patterns are beyond the control of the parameter server. Moreover, in harsh environments like battlefields, adversaries can selectively and adaptively silence specific clients. In this paper, we relax the structural assumptions and consider adversarial client unavailability. To quantify the degrees of client unavailability, we use the notion of $\epsilon$-adversary dropout fraction. We show that simple variants of FedAvg or FedProx, albeit completely agnostic to $\epsilon$, converge to an estimation error on the order of $\epsilon (G^2 + \sigma^2)$ for non-convex global objectives and $\epsilon(G^2 + \sigma^2)/\mu^2$ for $\mu$ strongly convex global objectives, where $G$ is a heterogeneity parameter and $\sigma^2$ is the noise level. Conversely, we prove that any algorithm has to suffer an estimation error of at least $\epsilon (G^2 + \sigma^2)/8$ and $\epsilon(G^2 + \sigma^2)/(8\mu^2)$ for non-convex global objectives and $\mu$-strongly convex global objectives. Furthermore, the convergence speeds of the FedAvg or FedProx variants are $O(1/\sqrt{T})$ for non-convex objectives and $O(1/T)$ for strongly-convex objectives, both of which are the best possible for any first-order method that only has access to noisy gradients.  ( 3 min )
    Hybrid Graph: A Unified Graph Representation with Datasets and Benchmarks for Complex Graphs
    arXiv:2306.05108v2 Announce Type: replace Abstract: Graphs are widely used to encapsulate a variety of data formats, but real-world networks often involve complex node relations beyond only being pairwise. While hypergraphs and hierarchical graphs have been developed and employed to account for the complex node relations, they cannot fully represent these complexities in practice. Additionally, though many Graph Neural Networks (GNNs) have been proposed for representation learning on higher-order graphs, they are usually only evaluated on simple graph datasets. Therefore, there is a need for a unified modelling of higher-order graphs, and a collection of comprehensive datasets with an accessible evaluation framework to fully understand the performance of these algorithms on complex graphs. In this paper, we introduce the concept of hybrid graphs, a unified definition for higher-order graphs, and present the Hybrid Graph Benchmark (HGB). HGB contains 23 real-world hybrid graph datasets across various domains such as biology, social media, and e-commerce. Furthermore, we provide an extensible evaluation framework and a supporting codebase to facilitate the training and evaluation of GNNs on HGB. Our empirical study of existing GNNs on HGB reveals various research opportunities and gaps, including (1) evaluating the actual performance improvement of hypergraph GNNs over simple graph GNNs; (2) comparing the impact of different sampling strategies on hybrid graph learning methods; and (3) exploring ways to integrate simple graph and hypergraph information. We make our source code and full datasets publicly available at https://zehui127.github.io/hybrid-graph-benchmark/.  ( 3 min )
    Universal consistency of the $k$-NN rule in metric spaces and Nagata dimension. II
    arXiv:2305.17282v4 Announce Type: replace Abstract: We continue to investigate the $k$ nearest neighbour ($k$-NN) learning rule in complete separable metric spaces. Thanks to the results of C\'erou and Guyader (2006) and Preiss (1983), this rule is known to be universally consistent in every such metric space that is sigma-finite dimensional in the sense of Nagata. Here we show that the rule is strongly universally consistent in such spaces in the absence of ties. Under the tie-breaking strategy applied by Devroye, Gy\"{o}rfi, Krzy\.{z}ak, and Lugosi (1994) in the Euclidean setting, we manage to show the strong universal consistency in non-Archimedian metric spaces (that is, those of Nagata dimension zero). Combining the theorem of C\'erou and Guyader with results of Assouad and Quentin de Gromard (2006), one deduces that the $k$-NN rule is universally consistent in metric spaces having finite dimension in the sense of de Groot. In particular, the $k$-NN rule is universally consistent in the Heisenberg group which is not sigma-finite dimensional in the sense of Nagata as follows from an example independently constructed by Kor\'anyi and Reimann (1995) and Sawyer and Wheeden (1992).  ( 3 min )
    Martian time-series unraveled: A multi-scale nested approach with factorial variational autoencoders
    arXiv:2305.16189v3 Announce Type: replace Abstract: Unsupervised source separation involves unraveling an unknown set of source signals recorded through a mixing operator, with limited prior knowledge about the sources, and only access to a dataset of signal mixtures. This problem is inherently ill-posed and is further challenged by the variety of timescales exhibited by sources. Existing methods typically rely on a preselected window size that determines their operating timescale, limiting their capacity to handle multi-scale sources. To address this issue, we propose an unsupervised multi-scale clustering and source separation framework by leveraging wavelet scattering spectra that provide a low-dimensional representation of stochastic processes, capable of distinguishing between different non-Gaussian stochastic processes. Nested within this representation space, we develop a factorial Gaussian-mixture variational autoencoder that is trained to (1) probabilistically cluster sources at different timescales and (2) independently sample scattering spectra representations associated with each cluster. As the final stage, using samples from each cluster as prior information, we formulate source separation as an optimization problem in the wavelet scattering spectra representation space, aiming to separate sources in the time domain. When applied to the entire seismic dataset recorded during the NASA InSight mission on Mars, containing sources varying greatly in timescale, our multi-scale nested approach proves to be a powerful tool for disentangling such different sources, e.g., minute-long transient one-sided pulses (known as ``glitches'') and structured ambient noises resulting from atmospheric activities that typically last for tens of minutes. These results provide an opportunity to conduct further investigations into the isolated sources related to atmospheric-surface interactions, thermal relaxations, and other complex phenomena.  ( 3 min )
    Going Further: Flatness at the Rescue of Early Stopping for Adversarial Example Transferability
    arXiv:2304.02688v2 Announce Type: replace Abstract: Transferability is the property of adversarial examples to be misclassified by other models than the surrogate model for which they were crafted. Previous research has shown that early stopping the training of the surrogate model substantially increases transferability. A common hypothesis to explain this is that deep neural networks (DNNs) first learn robust features, which are more generic, thus a better surrogate. Then, at later epochs, DNNs learn non-robust features, which are more brittle, hence worst surrogate. First, we tend to refute this hypothesis, using transferability as a proxy for representation similarity. We then establish links between transferability and the exploration of the loss landscape in parameter space, focusing on sharpness, which is affected by early stopping. This leads us to evaluate surrogate models trained with seven minimizers that minimize both loss value and loss sharpness. Among them, SAM consistently outperforms early stopping by up to 28.8 percentage points. We discover that the strong SAM regularization from large flat neighborhoods tightly links to transferability. Finally, the best sharpness-aware minimizers prove competitive with other training methods and complement existing transferability techniques.  ( 3 min )
    Generative Sliced MMD Flows with Riesz Kernels
    arXiv:2305.11463v4 Announce Type: replace Abstract: Maximum mean discrepancy (MMD) flows suffer from high computational costs in large scale computations. In this paper, we show that MMD flows with Riesz kernels $K(x,y) = - \|x-y\|^r$, $r \in (0,2)$ have exceptional properties which allow their efficient computation. We prove that the MMD of Riesz kernels, which is also known as energy distance, coincides with the MMD of their sliced version. As a consequence, the computation of gradients of MMDs can be performed in the one-dimensional setting. Here, for $r=1$, a simple sorting algorithm can be applied to reduce the complexity from $O(MN+N^2)$ to $O((M+N)\log(M+N))$ for two measures with $M$ and $N$ support points. As another interesting follow-up result, the MMD of compactly supported measures can be estimated from above and below by the Wasserstein-1 distance. For the implementations we approximate the gradient of the sliced MMD by using only a finite number $P$ of slices. We show that the resulting error has complexity $O(\sqrt{d/P})$, where $d$ is the data dimension. These results enable us to train generative models by approximating MMD gradient flows by neural networks even for image applications. We demonstrate the efficiency of our model by image generation on MNIST, FashionMNIST and CIFAR10.  ( 3 min )
    Autoregressive Bandits
    arXiv:2212.06251v2 Announce Type: replace Abstract: Autoregressive processes naturally arise in a large variety of real-world scenarios, including stock markets, sales forecasting, weather prediction, advertising, and pricing. When facing a sequential decision-making problem in such a context, the temporal dependence between consecutive observations should be properly accounted for guaranteeing convergence to the optimal policy. In this work, we propose a novel online learning setting, namely, Autoregressive Bandits (ARBs), in which the observed reward is governed by an autoregressive process of order $k$, whose parameters depend on the chosen action. We show that, under mild assumptions on the reward process, the optimal policy can be conveniently computed. Then, we devise a new optimistic regret minimization algorithm, namely, AutoRegressive Upper Confidence Bound (AR-UCB), that suffers sublinear regret of order $\widetilde{\mathcal{O}} \left( \frac{(k+1)^{3/2}\sqrt{nT}}{(1-\Gamma)^2}\right)$, where $T$ is the optimization horizon, $n$ is the number of actions, and $\Gamma < 1$ is a stability index of the process. Finally, we empirically validate our algorithm, illustrating its advantages w.r.t. bandit baselines and its robustness to misspecification of key parameters.  ( 2 min )
    Learning Large Causal Structures from Inverse Covariance Matrix via Sparse Matrix Decomposition
    arXiv:2211.14221v3 Announce Type: replace Abstract: Learning causal structures from observational data is a fundamental problem facing important computational challenges when the number of variables is large. In the context of linear structural equation models (SEMs), this paper focuses on learning causal structures from the inverse covariance matrix. The proposed method, called ICID for Independence-preserving Decomposition from Inverse Covariance matrix, is based on continuous optimization of a matrix decomposition model that preserves the nonzero patterns of the inverse covariance matrix. Through theoretical and empirical evidences, we show that ICID efficiently identifies the sought directed acyclic graph (DAG) assuming the knowledge of noise variances. Moreover, ICID is shown empirically to be robust under bounded misspecification of noise variances in the case where the noise variances are non-equal. The proposed method enjoys a low complexity, as reflected by its time efficiency in the experiments, and also enables a novel regularization scheme that yields highly accurate solutions on the Simulated fMRI data (Smith et al., 2011) in comparison with state-of-the-art algorithms.  ( 2 min )
    Survival Kernets: Scalable and Interpretable Deep Kernel Survival Analysis with an Accuracy Guarantee
    arXiv:2206.10477v5 Announce Type: replace Abstract: Kernel survival analysis models estimate individual survival distributions with the help of a kernel function, which measures the similarity between any two data points. Such a kernel function can be learned using deep kernel survival models. In this paper, we present a new deep kernel survival model called a survival kernet, which scales to large datasets in a manner that is amenable to model interpretation and also theoretical analysis. Specifically, the training data are partitioned into clusters based on a recently developed training set compression scheme for classification and regression called kernel netting that we extend to the survival analysis setting. At test time, each data point is represented as a weighted combination of these clusters, and each such cluster can be visualized. For a special case of survival kernets, we establish a finite-sample error bound on predicted survival distributions that is, up to a log factor, optimal. Whereas scalability at test time is achieved using the aforementioned kernel netting compression strategy, scalability during training is achieved by a warm-start procedure based on tree ensembles such as XGBoost and a heuristic approach to accelerating neural architecture search. On four standard survival analysis datasets of varying sizes (up to roughly 3 million data points), we show that survival kernets are highly competitive compared to various baselines tested in terms of time-dependent concordance index. Our code is available at: https://github.com/georgehc/survival-kernets  ( 3 min )
    Mathematical Framework for Online Social Media Auditing
    arXiv:2209.05550v2 Announce Type: replace Abstract: Social media platforms (SMPs) leverage algorithmic filtering (AF) as a means of selecting the content that constitutes a user's feed with the aim of maximizing their rewards. Selectively choosing the contents to be shown on the user's feed may yield a certain extent of influence, either minor or major, on the user's decision-making, compared to what it would have been under a natural/fair content selection. As we have witnessed over the past decade, algorithmic filtering can cause detrimental side effects, ranging from biasing individual decisions to shaping those of society as a whole, for example, diverting users' attention from whether to get the COVID-19 vaccine or inducing the public to choose a presidential candidate. The government's constant attempts to regulate the adverse effects of AF are often complicated, due to bureaucracy, legal affairs, and financial considerations. On the other hand SMPs seek to monitor their own algorithmic activities to avoid being fined for exceeding the allowable threshold. In this paper, we mathematically formalize this framework and utilize it to construct a data-driven statistical auditing procedure to regulate AF from deflecting users' beliefs over time, along with sample complexity guarantees. This state-of-the-art algorithm can be used either by authorities acting as external regulators or by SMPs for self-auditing.  ( 3 min )
    Inverse Boundary Value and Optimal Control Problems on Graphs: A Neural and Numerical Synthesis
    arXiv:2206.02911v2 Announce Type: replace Abstract: A general setup for deterministic system identification problems on graphs with Dirichlet and Neumann boundary conditions is introduced. When control nodes are available along the boundary, we apply a discretize-then-optimize method to estimate an optimal control. A key piece in the present architecture is our boundary injected message passing neural network. This will produce more accurate predictions that are considerably more stable in proximity of the boundary. Also, a regularization technique based on graphical distance is introduced that helps with stabilizing the predictions at nodes far from the boundary.  ( 2 min )
    Learning in Mean Field Games: A Survey
    arXiv:2205.12944v3 Announce Type: replace Abstract: Non-cooperative and cooperative games with a very large number of players have many applications but remain generally intractable when the number of players increases. Introduced by Lasry and Lions, and Huang, Caines and Malham\'e, Mean Field Games (MFGs) rely on a mean-field approximation to allow the number of players to grow to infinity. Traditional methods for solving these games generally rely on solving partial or stochastic differential equations with a full knowledge of the model. Recently, Reinforcement Learning (RL) has appeared promising to solve complex problems at scale. The combination of RL and MFGs is promising to solve games at a very large scale both in terms of population size and environment complexity. In this survey, we review the quickly growing recent literature on RL methods to learn equilibria and social optima in MFGs. We first identify the most common settings (static, stationary, and evolutive) of MFGs. We then present a general framework for classical iterative methods (based on best-response computation or policy evaluation) to solve MFGs in an exact way. Building on these algorithms and the connection with Markov Decision Processes, we explain how RL can be used to learn MFG solutions in a model-free way. Last, we present numerical illustrations on a benchmark problem, and conclude with some perspectives.  ( 3 min )
    CounterCurate: Enhancing Physical and Semantic Visio-Linguistic Compositional Reasoning via Counterfactual Examples
    arXiv:2402.13254v1 Announce Type: cross Abstract: We propose CounterCurate, a framework to comprehensively improve the visio-linguistic compositional reasoning capability for both contrastive and generative multimodal models. In particular, we identify two under-explored critical problems: the neglect of the physically grounded reasoning (counting and position understanding) and the potential of using highly capable text and image generation models for semantic counterfactual fine-tuning. Our work pioneers an approach that addresses these gaps. We first spotlight the near-chance performance of multimodal models like CLIP and LLaVA in physically grounded compositional reasoning. We then apply simple data augmentation using a grounded image generation model, GLIGEN, to generate finetuning data, resulting in significant performance improvements: +33% and +37% for CLIP and LLaVA, respectively, on our newly curated Flickr30k-Positions benchmark. Moreover, we exploit the capabilities of high-performing text generation and image generation models, specifically GPT-4V and DALLE-3, to curate challenging semantic counterfactuals, thereby further enhancing compositional reasoning capabilities on benchmarks such as SugarCrepe, where CounterCurate outperforms GPT-4V.  ( 2 min )
    Deep Proxy Causal Learning and its Application to Confounded Bandit Policy Evaluation
    arXiv:2106.03907v4 Announce Type: replace Abstract: Proxy causal learning (PCL) is a method for estimating the causal effect of treatments on outcomes in the presence of unobserved confounding, using proxies (structured side information) for the confounder. This is achieved via two-stage regression: in the first stage, we model relations among the treatment and proxies; in the second stage, we use this model to learn the effect of treatment on the outcome, given the context provided by the proxies. PCL guarantees recovery of the true causal effect, subject to identifiability conditions. We propose a novel method for PCL, the deep feature proxy variable method (DFPV), to address the case where the proxies, treatments, and outcomes are high-dimensional and have nonlinear complex relationships, as represented by deep neural network features. We show that DFPV outperforms recent state-of-the-art PCL methods on challenging synthetic benchmarks, including settings involving high dimensional image data. Furthermore, we show that PCL can be applied to off-policy evaluation for the confounded bandit problem, in which DFPV also exhibits competitive performance.  ( 3 min )
    FlashTex: Fast Relightable Mesh Texturing with LightControlNet
    arXiv:2402.13251v1 Announce Type: cross Abstract: Manually creating textures for 3D meshes is time-consuming, even for expert visual content creators. We propose a fast approach for automatically texturing an input 3D mesh based on a user-provided text prompt. Importantly, our approach disentangles lighting from surface material/reflectance in the resulting texture so that the mesh can be properly relit and rendered in any lighting environment. We introduce LightControlNet, a new text-to-image model based on the ControlNet architecture, which allows the specification of the desired lighting as a conditioning image to the model. Our text-to-texture pipeline then constructs the texture in two stages. The first stage produces a sparse set of visually consistent reference views of the mesh using LightControlNet. The second stage applies a texture optimization based on Score Distillation Sampling (SDS) that works with LightControlNet to increase the texture quality while disentangling surface material from lighting. Our pipeline is significantly faster than previous text-to-texture methods, while producing high-quality and relightable textures.  ( 2 min )
    Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive
    arXiv:2402.13228v1 Announce Type: cross Abstract: Direct Preference Optimisation (DPO) is effective at significantly improving the performance of large language models (LLMs) on downstream tasks such as reasoning, summarisation, and alignment. Using pairs of preferred and dispreferred data, DPO models the \textit{relative} probability of picking one response over another. In this work, first we show theoretically that the standard DPO loss can lead to a \textit{reduction} of the model's likelihood of the preferred examples, as long as the relative probability between the preferred and dispreferred classes increases. We then show empirically that this phenomenon occurs when fine-tuning LLMs on common datasets, especially datasets in which the edit distance between pairs of completions is low. Using these insights, we design DPO-Positive (DPOP), a new loss function and training procedure which avoids this failure mode. Surprisingly, we also find that DPOP significantly outperforms DPO across a wide variety of datasets and downstream tasks, including datasets with high edit distances between completions. By fine-tuning with DPOP, we create and release Smaug-34B and Smaug-72B, which achieve state-of-the-art open-source performance. Notably, Smaug-72B is nearly 2\% better than any other open-source model on the HuggingFace Open LLM Leaderboard and becomes the first open-source LLM to surpass an average accuracy of 80\%.  ( 2 min )
    Analyzing Operator States and the Impact of AI-Enhanced Decision Support in Control Rooms: A Human-in-the-Loop Specialized Reinforcement Learning Framework for Intervention Strategies
    arXiv:2402.13219v1 Announce Type: cross Abstract: In complex industrial and chemical process control rooms, effective decision-making is crucial for safety and effi- ciency. The experiments in this paper evaluate the impact and applications of an AI-based decision support system integrated into an improved human-machine interface, using dynamic influ- ence diagrams, a hidden Markov model, and deep reinforcement learning. The enhanced support system aims to reduce operator workload, improve situational awareness, and provide different intervention strategies to the operator adapted to the current state of both the system and human performance. Such a system can be particularly useful in cases of information overload when many alarms and inputs are presented all within the same time window, or for junior operators during training. A comprehensive cross-data analysis was conducted, involving 47 participants and a diverse range of data sources such as smartwatch metrics, eye- tracking data, process logs, and responses from questionnaires. The results indicate interesting insights regarding the effec- tiveness of the approach in aiding decision-making, decreasing perceived workload, and increasing situational awareness for the scenarios considered. Additionally, the results provide valuable insights to compare differences between styles of information gathering when using the system by individual participants. These findings are particularly relevant when predicting the overall performance of the individual participant and their capacity to successfully handle a plant upset and the alarms connected to it using process and human-machine interaction logs in real-time. These predictions enable the development of more effective intervention strategies.  ( 3 min )
    Controlling Large Electric Vehicle Charging Stations via User Behavior Modeling and Stochastic Programming
    arXiv:2402.13224v1 Announce Type: cross Abstract: This paper introduces an Electric Vehicle Charging Station (EVCS) model that incorporates real-world constraints, such as slot power limitations, contract threshold overruns penalties, or early disconnections of electric vehicles (EVs). We propose a formulation of the problem of EVCS control under uncertainty, and implement two Multi-Stage Stochastic Programming approaches that leverage user-provided information, namely, Model Predictive Control and Two-Stage Stochastic Programming. The model addresses uncertainties in charging session start and end times, as well as in energy demand. A user's behavior model based on a sojourn-time-dependent stochastic process enhances cost reduction while maintaining customer satisfaction. The benefits of the two proposed methods are showcased against two baselines over a 22-day simulation using a real-world dataset. The two-stage approach proves robust against early disconnections, considering a more significant number of uncertainty scenarios for optimization. The algorithm prioritizing user satisfaction over electricity cost achieves a 20% and 36% improvement in two user satisfaction metrics compared to an industry-standard baseline. Additionally, the algorithm striking the best balance between cost and user satisfaction exhibits a mere 3% relative cost increase compared to the theoretically optimal baseline - for which the nonanticipativity constraint is relaxed - while attaining 94% and 84% of the user satisfaction performance in the two used satisfaction metrics.  ( 3 min )
    Softmax Probabilities (Mostly) Predict Large Language Model Correctness on Multiple-Choice Q&A
    arXiv:2402.13213v1 Announce Type: cross Abstract: Although large language models (LLMs) perform impressively on many tasks, overconfidence remains a problem. We hypothesized that on multiple-choice Q&A tasks, wrong answers would be associated with smaller maximum softmax probabilities (MSPs) compared to correct answers. We comprehensively evaluate this hypothesis on ten open-source LLMs and five datasets, and find strong evidence for our hypothesis among models which perform well on the original Q&A task. For the six LLMs with the best Q&A performance, the AUROC derived from the MSP was better than random chance with p < 10^{-4} in 59/60 instances. Among those six LLMs, the average AUROC ranged from 60% to 69%. Leveraging these findings, we propose a multiple-choice Q&A task with an option to abstain and show that performance can be improved by selectively abstaining based on the MSP of the initial model response. We also run the same experiments with pre-softmax logits instead of softmax probabilities and find similar (but not identical) results.  ( 2 min )
    Soft Self-Consistency Improves Language Model Agents
    arXiv:2402.13212v1 Announce Type: cross Abstract: Generations from large language models (LLMs) can be improved by sampling and scoring multiple solutions to select a final answer. Current "sample and select" methods such as self-consistency (SC) rely on majority voting to score answers. However, when tasks have many distinct and valid answers, selection by voting requires a large number of samples. This makes SC prohibitively expensive for interactive tasks that involve generating multiple actions (answers) sequentially. After establishing that majority voting fails to provide consistent gains on such tasks, we demonstrate how to increase success rates by softening the scoring criterion. We introduce Soft Self-Consistency (Soft-SC), which replaces SC's discontinuous scoring with a continuous score computed from model likelihoods, allowing for selection even when actions are sparsely distributed. Soft-SC improves both performance and efficiency on long-horizon interactive tasks, requiring half as many samples as SC for comparable or better performance. For a fixed number of samples, Soft-SC leads to a 1.3% increase over SC in absolute success rate on writing bash programs, a 6.6% increase on online shopping (WebShop), and a 4.7% increase for an interactive household game (ALFWorld). Finally, we show that Soft-SC can be applied to both open-source and black-box models.  ( 2 min )
    Tiny Reinforcement Learning for Quadruped Locomotion using Decision Transformers
    arXiv:2402.13201v1 Announce Type: cross Abstract: Resource-constrained robotic platforms are particularly useful for tasks that require low-cost hardware alternatives due to the risk of losing the robot, like in search-and-rescue applications, or the need for a large number of devices, like in swarm robotics. For this reason, it is crucial to find mechanisms for adapting reinforcement learning techniques to the constraints imposed by lower computational power and smaller memory capacities of these ultra low-cost robotic platforms. We try to address this need by proposing a method for making imitation learning deployable onto resource-constrained robotic platforms. Here we cast the imitation learning problem as a conditional sequence modeling task and we train a decision transformer using expert demonstrations augmented with a custom reward. Then, we compress the resulting generative model using software optimization schemes, including quantization and pruning. We test our method in simulation using Isaac Gym, a realistic physics simulation environment designed for reinforcement learning. We empirically demonstrate that our method achieves natural looking gaits for Bittle, a resource-constrained quadruped robot. We also run multiple simulations to show the effects of pruning and quantization on the performance of the model. Our results show that quantization (down to 4 bits) and pruning reduce model size by around 30\% while maintaining a competitive reward, making the model deployable in a resource-constrained system.  ( 2 min )
    SONATA: Self-adaptive Evolutionary Framework for Hardware-aware Neural Architecture Search
    arXiv:2402.13204v1 Announce Type: cross Abstract: Recent advancements in Artificial Intelligence (AI), driven by Neural Networks (NN), demand innovative neural architecture designs, particularly within the constrained environments of Internet of Things (IoT) systems, to balance performance and efficiency. HW-aware Neural Architecture Search (HW-aware NAS) emerges as an attractive strategy to automate the design of NN using multi-objective optimization approaches, such as evolutionary algorithms. However, the intricate relationship between NN design parameters and HW-aware NAS optimization objectives remains an underexplored research area, overlooking opportunities to effectively leverage this knowledge to guide the search process accordingly. Furthermore, the large amount of evaluation data produced during the search holds untapped potential for refining the optimization strategy and improving the approximation of the Pareto front. Addressing these issues, we propose SONATA, a self-adaptive evolutionary algorithm for HW-aware NAS. Our method leverages adaptive evolutionary operators guided by the learned importance of NN design parameters. Specifically, through tree-based surrogate models and a Reinforcement Learning agent, we aspire to gather knowledge on 'How' and 'When' to evolve NN architectures. Comprehensive evaluations across various NAS search spaces and hardware devices on the ImageNet-1k dataset have shown the merit of SONATA with up to 0.25% improvement in accuracy and up to 2.42x gains in latency and energy. Our SONATA has seen up to sim$93.6% Pareto dominance over the native NSGA-II, further stipulating the importance of self-adaptive evolution operators in HW-aware NAS.  ( 2 min )
    DINOBot: Robot Manipulation via Retrieval and Alignment with Vision Foundation Models
    arXiv:2402.13181v1 Announce Type: cross Abstract: We propose DINOBot, a novel imitation learning framework for robot manipulation, which leverages the image-level and pixel-level capabilities of features extracted from Vision Transformers trained with DINO. When interacting with a novel object, DINOBot first uses these features to retrieve the most visually similar object experienced during human demonstrations, and then uses this object to align its end-effector with the novel object to enable effective interaction. Through a series of real-world experiments on everyday tasks, we show that exploiting both the image-level and pixel-level properties of vision foundation models enables unprecedented learning efficiency and generalisation. Videos and code are available at https://www.robot-learning.uk/dinobot.  ( 2 min )
    VGMShield: Mitigating Misuse of Video Generative Models
    arXiv:2402.13126v1 Announce Type: cross Abstract: With the rapid advancement in video generation, people can conveniently utilize video generation models to create videos tailored to their specific desires. Nevertheless, there are also growing concerns about their potential misuse in creating and disseminating false information. In this work, we introduce VGMShield: a set of three straightforward but pioneering mitigations through the lifecycle of fake video generation. We start from \textit{fake video detection} trying to understand whether there is uniqueness in generated videos and whether we can differentiate them from real videos; then, we investigate the \textit{tracing} problem, which maps a fake video back to a model that generates it. Towards these, we propose to leverage pre-trained models that focus on {\it spatial-temporal dynamics} as the backbone to identify inconsistencies in videos. Through experiments on seven state-of-the-art open-source models, we demonstrate that current models still cannot perfectly handle spatial-temporal relationships, and thus, we can accomplish detection and tracing with nearly perfect accuracy. Furthermore, anticipating future generative model improvements, we propose a {\it prevention} method that adds invisible perturbations to images to make the generated videos look unreal. Together with fake video detection and tracing, our multi-faceted set of solutions can effectively mitigate misuse of video generative models.  ( 2 min )
    Mode Estimation with Partial Feedback
    arXiv:2402.13079v1 Announce Type: cross Abstract: The combination of lightly supervised pre-training and online fine-tuning has played a key role in recent AI developments. These new learning pipelines call for new theoretical frameworks. In this paper, we formalize core aspects of weakly supervised and active learning with a simple problem: the estimation of the mode of a distribution using partial feedback. We show how entropy coding allows for optimal information acquisition from partial feedback, develop coarse sufficient statistics for mode identification, and adapt bandit algorithms to our new setting. Finally, we combine those contributions into a statistically and computationally efficient solution to our problem.  ( 2 min )
    On Generalization Bounds for Deep Compound Gaussian Neural Networks
    arXiv:2402.13106v1 Announce Type: cross Abstract: Algorithm unfolding or unrolling is the technique of constructing a deep neural network (DNN) from an iterative algorithm. Unrolled DNNs often provide better interpretability and superior empirical performance over standard DNNs in signal estimation tasks. An important theoretical question, which has only recently received attention, is the development of generalization error bounds for unrolled DNNs. These bounds deliver theoretical and practical insights into the performance of a DNN on empirical datasets that are distinct from, but sampled from, the probability density generating the DNN training data. In this paper, we develop novel generalization error bounds for a class of unrolled DNNs that are informed by a compound Gaussian prior. These compound Gaussian networks have been shown to outperform comparative standard and unfolded deep neural networks in compressive sensing and tomographic imaging problems. The generalization error bound is formulated by bounding the Rademacher complexity of the class of compound Gaussian network estimates with Dudley's integral. Under realistic conditions, we show that, at worst, the generalization error scales $\mathcal{O}(n\sqrt{\ln(n)})$ in the signal dimension and $\mathcal{O}(($Network Size$)^{3/2})$ in network size.  ( 2 min )
    Not All Weights Are Created Equal: Enhancing Energy Efficiency in On-Device Streaming Speech Recognition
    arXiv:2402.13076v1 Announce Type: cross Abstract: Power consumption plays an important role in on-device streaming speech recognition, as it has a direct impact on the user experience. This study delves into how weight parameters in speech recognition models influence the overall power consumption of these models. We discovered that the impact of weight parameters on power consumption varies, influenced by factors including how often they are invoked and their placement in memory. Armed with this insight, we developed design guidelines aimed at optimizing on-device speech recognition models. These guidelines focus on minimizing power use without substantially affecting accuracy. Our method, which employs targeted compression based on the varying sensitivities of weight parameters, demonstrates superior performance compared to state-of-the-art compression methods. It achieves a reduction in energy usage of up to 47% while maintaining similar model accuracy and improving the real-time factor.  ( 2 min )
    Improving Neural-based Classification with Logical Background Knowledge
    arXiv:2402.13019v1 Announce Type: cross Abstract: Neurosymbolic AI is a growing field of research aiming to combine neural networks learning capabilities with the reasoning abilities of symbolic systems. This hybridization can take many shapes. In this paper, we propose a new formalism for supervised multi-label classification with propositional background knowledge. We introduce a new neurosymbolic technique called semantic conditioning at inference, which only constrains the system during inference while leaving the training unaffected. We discuss its theoritical and practical advantages over two other popular neurosymbolic techniques: semantic conditioning and semantic regularization. We develop a new multi-scale methodology to evaluate how the benefits of a neurosymbolic technique evolve with the scale of the network. We then evaluate experimentally and compare the benefits of all three techniques across model scales on several datasets. Our results demonstrate that semantic conditioning at inference can be used to build more accurate neural-based systems with fewer resources while guaranteeing the semantic consistency of outputs.  ( 2 min )
    SzCORE: A Seizure Community Open-source Research Evaluation framework for the validation of EEG-based automated seizure detection algorithms
    arXiv:2402.13005v1 Announce Type: cross Abstract: The need for high-quality automated seizure detection algorithms based on electroencephalography (EEG) becomes ever more pressing with the increasing use of ambulatory and long-term EEG monitoring. Heterogeneity in validation methods of these algorithms influences the reported results and makes comprehensive evaluation and comparison challenging. This heterogeneity concerns in particular the choice of datasets, evaluation methodologies, and performance metrics. In this paper, we propose a unified framework designed to establish standardization in the validation of EEG-based seizure detection algorithms. Based on existing guidelines and recommendations, the framework introduces a set of recommendations and standards related to datasets, file formats, EEG data input content, seizure annotation input and output, cross-validation strategies, and performance metrics. We also propose the 10-20 seizure detection benchmark, a machine-learning benchmark based on public datasets converted to a standardized format. This benchmark defines the machine-learning task as well as reporting metrics. We illustrate the use of the benchmark by evaluating a set of existing seizure detection algorithms. The SzCORE (Seizure Community Open-source Research Evaluation) framework and benchmark are made publicly available along with an open-source software library to facilitate research use, while enabling rigorous evaluation of the clinical significance of the algorithms, fostering a collective effort to more optimally detect seizures to improve the lives of people with epilepsy.  ( 3 min )
    A unifying primary framework for quantum graph neural networks from quantum graph states
    arXiv:2402.13001v1 Announce Type: cross Abstract: Graph states are used to represent mathematical graphs as quantum states on quantum computers. They can be formulated through stabilizer codes or directly quantum gates and quantum states. In this paper we show that a quantum graph neural network model can be understood and realized based on graph states. We show that they can be used either as a parameterized quantum circuits to represent neural networks or as an underlying structure to construct graph neural networks on quantum computers.  ( 2 min )
    An Autonomous Large Language Model Agent for Chemical Literature Data Mining
    arXiv:2402.12993v1 Announce Type: cross Abstract: Chemical synthesis, which is crucial for advancing material synthesis and drug discovery, impacts various sectors including environmental science and healthcare. The rise of technology in chemistry has generated extensive chemical data, challenging researchers to discern patterns and refine synthesis processes. Artificial intelligence (AI) helps by analyzing data to optimize synthesis and increase yields. However, AI faces challenges in processing literature data due to the unstructured format and diverse writing style of chemical literature. To overcome these difficulties, we introduce an end-to-end AI agent framework capable of high-fidelity extraction from extensive chemical literature. This AI agent employs large language models (LLMs) for prompt generation and iterative optimization. It functions as a chemistry assistant, automating data collection and analysis, thereby saving manpower and enhancing performance. Our framework's efficacy is evaluated using accuracy, recall, and F1 score of reaction condition data, and we compared our method with human experts in terms of content correctness and time efficiency. The proposed approach marks a significant advancement in automating chemical literature extraction and demonstrates the potential for AI to revolutionize data management and utilization in chemistry.  ( 2 min )
    How Temporal Unrolling Supports Neural Physics Simulators
    arXiv:2402.12971v1 Announce Type: cross Abstract: Unrolling training trajectories over time strongly influences the inference accuracy of neural network-augmented physics simulators. We analyze these effects by studying three variants of training neural networks on discrete ground truth trajectories. In addition to commonly used one-step setups and fully differentiable unrolling, we include a third, less widely used variant: unrolling without temporal gradients. Comparing networks trained with these three modalities makes it possible to disentangle the two dominant effects of unrolling, training distribution shift and long-term gradients. We present a detailed study across physical systems, network sizes, network architectures, training setups, and test scenarios. It provides an empirical basis for our main findings: A non-differentiable but unrolled training setup supported by a numerical solver can yield 4.5-fold improvements over a fully differentiable prediction setup that does not utilize this solver. We also quantify a difference in the accuracy of models trained in a fully differentiable setup compared to their non-differentiable counterparts. While differentiable setups perform best, the accuracy of unrolling without temporal gradients comes comparatively close. Furthermore, we empirically show that these behaviors are invariant to changes in the underlying physical system, the network architecture and size, and the numerical scheme. These results motivate integrating non-differentiable numerical simulators into training setups even if full differentiability is unavailable. We also observe that the convergence rate of common neural architectures is low compared to numerical algorithms. This encourages the use of hybrid approaches combining neural and numerical algorithms to utilize the benefits of both.  ( 3 min )
    A Bound on the Maximal Marginal Degrees of Freedom
    arXiv:2402.12885v1 Announce Type: cross Abstract: Common kernel ridge regression is expensive in memory allocation and computation time. This paper addresses low rank approximations and surrogates for kernel ridge regression, which bridge these difficulties. The fundamental contribution of the paper is a lower bound on the rank of the low dimensional approximation, which is required such that the prediction power remains reliable. The bound relates the effective dimension with the largest statistical leverage score. We characterize the effective dimension and its growth behavior with respect to the regularization parameter by involving the regularity of the kernel. This growth is demonstrated to be asymptotically logarithmic for suitably chosen kernels, justifying low-rank approximations as the Nystr\"om method.  ( 2 min )
    More Discriminative Sentence Embeddings via Semantic Graph Smoothing
    arXiv:2402.12890v1 Announce Type: cross Abstract: This paper explores an empirical approach to learn more discriminantive sentence representations in an unsupervised fashion. Leveraging semantic graph smoothing, we enhance sentence embeddings obtained from pretrained models to improve results for the text clustering and classification tasks. Our method, validated on eight benchmarks, demonstrates consistent improvements, showcasing the potential of semantic graph smoothing in improving sentence embeddings for the supervised and unsupervised document categorization tasks.  ( 2 min )
    Towards MLOps: A DevOps Tools Recommender System for Machine Learning System
    arXiv:2402.12867v1 Announce Type: cross Abstract: Applying DevOps practices to machine learning system is termed as MLOps and machine learning systems evolve on new data unlike traditional systems on requirements. The objective of MLOps is to establish a connection between different open-source tools to construct a pipeline that can automatically perform steps to construct a dataset, train the machine learning model and deploy the model to the production as well as store different versions of model and dataset. Benefits of MLOps is to make sure the fast delivery of the new trained models to the production to have accurate results. Furthermore, MLOps practice impacts the overall quality of the software products and is completely dependent on open-source tools and selection of relevant open-source tools is considered as challenged while a generalized method to select an appropriate open-source tools is desirable. In this paper, we present a framework for recommendation system that processes the contextual information (e.g., nature of data, type of the data) of the machine learning project and recommends a relevant toolchain (tech-stack) for the operationalization of machine learning systems. To check the applicability of the proposed framework, four different approaches i.e., rule-based, random forest, decision trees and k-nearest neighbors were investigated where precision, recall and f-score is measured, the random forest out classed other approaches with highest f-score value of 0.66.  ( 2 min )
    Backward Lens: Projecting Language Model Gradients into the Vocabulary Space
    arXiv:2402.12865v1 Announce Type: cross Abstract: Understanding how Transformer-based Language Models (LMs) learn and recall information is a key goal of the deep learning community. Recent interpretability methods project weights and hidden states obtained from the forward pass to the models' vocabularies, helping to uncover how information flows within LMs. In this work, we extend this methodology to LMs' backward pass and gradients. We first prove that a gradient matrix can be cast as a low-rank linear combination of its forward and backward passes' inputs. We then develop methods to project these gradients into vocabulary items and explore the mechanics of how new information is stored in the LMs' neurons.  ( 2 min )
    Instruction-tuned Language Models are Better Knowledge Learners
    arXiv:2402.12847v1 Announce Type: cross Abstract: In order for large language model (LLM)-based assistants to effectively adapt to evolving information needs, it must be possible to update their factual knowledge through continued training on new data. The standard recipe for doing so involves continued pre-training on new documents followed by instruction-tuning on question-answer (QA) pairs. However, we find that LLMs trained with this recipe struggle to answer questions, even though the perplexity of documents is minimized. We found that QA pairs are generally straightforward, while documents are more complex, weaving many factual statements together in an intricate manner. Therefore, we hypothesize that it is beneficial to expose LLMs to QA pairs before continued pre-training on documents so that the process of encoding knowledge from complex documents takes into account how this knowledge is accessed through questions. Based on this, we propose pre-instruction-tuning (PIT), a method that instruction-tunes on questions prior to training on documents. This contrasts with standard instruction-tuning, which learns how to extract knowledge after training on documents. Extensive experiments and ablation studies demonstrate that PIT significantly enhances the ability of LLMs to absorb knowledge from new documents, outperforming standard instruction-tuning by 17.8%.  ( 2 min )
    PromptKD: Distilling Student-Friendly Knowledge for Generative Language Models via Prompt Tuning
    arXiv:2402.12842v1 Announce Type: cross Abstract: Recent advancements in large language models (LLMs) have raised concerns about inference costs, increasing the need for research into model compression. While knowledge distillation (KD) is a prominent method for this, research on KD for generative language models like LLMs is relatively sparse, and the approach of distilling student-friendly knowledge, which has shown promising performance in KD for classification models, remains unexplored in generative language models. To explore this approach, we propose PromptKD, a simple yet effective method that utilizes prompt tuning - for the first time in KD - to enable generative language models to transfer student-friendly knowledge. Unlike previous works in classification that require fine-tuning the entire teacher model for extracting student-friendly knowledge, PromptKD achieves similar effects by adding a small number of prompt tokens and tuning only the prompt with student guidance. Extensive experiments on instruction-following datasets using the GPT-2 model family show that PromptKD achieves state-of-the-art performance while adding only 0.0007% of the teacher's parameters as prompts. Further analysis suggests that distilling student-friendly knowledge alleviates exposure bias effectively throughout the entire training process, leading to performance enhancements.  ( 2 min )
    Identifying Factual Inconsistency in Summaries: Towards Effective Utilization of Large Language Model
    arXiv:2402.12821v1 Announce Type: cross Abstract: Factual inconsistency poses a significant hurdle for the commercial deployment of abstractive summarizers. Under this Large Language Model (LLM) era, this work focuses around two important questions: what is the best way to leverage LLM for factual inconsistency detection, and how could we distill a smaller LLM with both high efficiency and efficacy? Three zero-shot paradigms are firstly proposed and evaluated across five diverse datasets: direct inference on the entire summary or each summary window; entity verification through question generation and answering. Experiments suggest that LLM itself is capable to resolve this task train-free under the proper paradigm design, surpassing strong trained baselines by 2.8% on average. To further promote practical utility, we then propose training strategies aimed at distilling smaller open-source LLM that learns to score the entire summary at once with high accuracy, which outperforms the zero-shot approaches by much larger LLM, serving as an effective and efficient ready-to-use scorer.  ( 2 min )
    SGD with Clipping is Secretly Estimating the Median Gradient
    arXiv:2402.12828v1 Announce Type: cross Abstract: There are several applications of stochastic optimization where one can benefit from a robust estimate of the gradient. For example, domains such as distributed learning with corrupted nodes, the presence of large outliers in the training data, learning under privacy constraints, or even heavy-tailed noise due to the dynamics of the algorithm itself. Here we study SGD with robust gradient estimators based on estimating the median. We first consider computing the median gradient across samples, and show that the resulting method can converge even under heavy-tailed, state-dependent noise. We then derive iterative methods based on the stochastic proximal point method for computing the geometric median and generalizations thereof. Finally we propose an algorithm estimating the median gradient across iterations, and find that several well known methods - in particular different forms of clipping - are particular cases of this framework.  ( 2 min )
    Fine-Tuning, Prompting, In-Context Learning and Instruction-Tuning: How Many Labelled Samples Do We Need?
    arXiv:2402.12819v1 Announce Type: cross Abstract: When solving a task with limited labelled data, researchers can either use a general large language model without further update, or use the few examples to tune a specialised smaller model. When enough labels are available, the specialised models outperform the general ones on many NLP tasks. In this work, we aim to investigate how many labelled samples are required for the specialised models to achieve this superior performance, while taking the results variance into consideration. Observing the behaviour of prompting, in-context learning, fine-tuning and instruction-tuning, identifying their break-even points when increasing number of labelled training samples across three tasks of varying complexity, we find that the specialised models often need only few samples ($100-1000$) to be on par or better than the general ones. At the same time, the amount of required labelled data strongly depends on the task complexity and results variance.  ( 2 min )
    On Sensitivity of Learning with Limited Labelled Data to the Effects of Randomness: Impact of Interactions and Systematic Choices
    arXiv:2402.12817v1 Announce Type: cross Abstract: While learning with limited labelled data can improve performance when the labels are lacking, it is also sensitive to the effects of uncontrolled randomness introduced by so-called randomness factors (e.g., varying order of data). We propose a method to systematically investigate the effects of randomness factors while taking the interactions between them into consideration. To measure the true effects of an individual randomness factor, our method mitigates the effects of other factors and observes how the performance varies across multiple runs. Applying our method to multiple randomness factors across in-context learning and fine-tuning approaches on 7 representative text classification tasks and meta-learning on 3 tasks, we show that: 1) disregarding interactions between randomness factors in existing works caused inconsistent findings due to incorrect attribution of the effects of randomness factors, such as disproving the consistent sensitivity of in-context learning to sample order even with random sample selection; and 2) besides mutual interactions, the effects of randomness factors, especially sample order, are also dependent on more systematic choices unexplored in existing works, such as number of classes, samples per class or choice of prompt format.  ( 2 min )
    Learning under Singularity: An Information Criterion improving WBIC and sBIC
    arXiv:2402.12762v1 Announce Type: cross Abstract: We introduce a novel Information Criterion (IC), termed Learning under Singularity (LS), designed to enhance the functionality of the Widely Applicable Bayes Information Criterion (WBIC) and the Singular Bayesian Information Criterion (sBIC). LS is effective without regularity constraints and demonstrates stability. Watanabe defined a statistical model or a learning machine as regular if the mapping from a parameter to a probability distribution is one-to-one and its Fisher information matrix is positive definite. In contrast, models not meeting these conditions are termed singular. Over the past decade, several information criteria for singular cases have been proposed, including WBIC and sBIC. WBIC is applicable in non-regular scenarios but faces challenges with large sample sizes and redundant estimation of known learning coefficients. Conversely, sBIC is limited in its broader application due to its dependence on maximum likelihood estimates. LS addresses these limitations by enhancing the utility of both WBIC and sBIC. It incorporates the empirical loss from the Widely Applicable Information Criterion (WAIC) to represent the goodness of fit to the statistical model, along with a penalty term similar to that of sBIC. This approach offers a flexible and robust method for model selection, free from regularity constraints.  ( 2 min )
    Autonomous Reality Modelling for Cultural Heritage Sites employing cooperative quadrupedal robots and unmanned aerial vehicles
    arXiv:2402.12794v1 Announce Type: cross Abstract: Nowadays, the use of advanced sensors, such as terrestrial 3D laser scanners, mobile LiDARs and Unmanned Aerial Vehicles (UAV) photogrammetric imaging, has become the prevalent practice for 3D Reality Modeling and digitization of large-scale monuments of Cultural Heritage (CH). In practice, this process is heavily related to the expertise of the surveying team, handling the laborious planning and time-consuming execution of the 3D mapping process that is tailored to the specific requirements and constraints of each site. To minimize human intervention, this paper introduces a novel methodology for autonomous 3D Reality Modeling for CH monuments by employing au-tonomous biomimetic quadrupedal robotic agents and UAVs equipped with the appropriate sensors. These autonomous robotic agents carry out the 3D RM process in a systematic and repeatable ap-proach. The outcomes of this automated process may find applications in digital twin platforms, facilitating secure monitoring and management of cultural heritage sites and spaces, in both indoor and outdoor environments.  ( 2 min )
    UMBCLU at SemEval-2024 Task 1A and 1C: Semantic Textual Relatedness with and without machine translation
    arXiv:2402.12730v1 Announce Type: cross Abstract: This paper describes the system we developed for SemEval-2024 Task 1, "Semantic Textual Relatedness for African and Asian Languages." The aim of the task is to build a model that can identify semantic textual relatedness (STR) between two sentences of a target language belonging to a collection of African and Asian languages. We participated in Subtasks A and C and explored supervised and cross-lingual training leveraging large language models (LLMs). Pre-trained large language models have been extensively used for machine translation and semantic similarity. Using a combination of machine translation and sentence embedding LLMs, we developed a unified STR model, TranSem, for subtask A and fine-tuned the T5 family of models on the STR data, FineSem, for use in subtask C. Our model results for 7 languages in subtask A were better than the official baseline for 3 languages and on par with the baseline for the remaining 4 languages. Our model results for the 12 languages in subtask C resulted in 1st place for Africaans, 2nd place for Indonesian, and 3rd place for English with low performance for the remaining 9 languages.  ( 2 min )
    APT-MMF: An advanced persistent threat actor attribution method based on multimodal and multilevel feature fusion
    arXiv:2402.12743v1 Announce Type: cross Abstract: Threat actor attribution is a crucial defense strategy for combating advanced persistent threats (APTs). Cyber threat intelligence (CTI), which involves analyzing multisource heterogeneous data from APTs, plays an important role in APT actor attribution. The current attribution methods extract features from different CTI perspectives and employ machine learning models to classify CTI reports according to their threat actors. However, these methods usually extract only one kind of feature and ignore heterogeneous information, especially the attributes and relations of indicators of compromise (IOCs), which form the core of CTI. To address these problems, we propose an APT actor attribution method based on multimodal and multilevel feature fusion (APT-MMF). First, we leverage a heterogeneous attributed graph to characterize APT reports and their IOC information. Then, we extract and fuse multimodal features, including attribute type features, natural language text features and topological relationship features, to construct comprehensive node representations. Furthermore, we design multilevel heterogeneous graph attention networks to learn the deep hidden features of APT report nodes; these networks integrate IOC type-level, metapath-based neighbor node-level, and metapath semantic-level attention. Utilizing multisource threat intelligence, we construct a heterogeneous attributed graph dataset for verification purposes. The experimental results show that our method not only outperforms the existing methods but also demonstrates its good interpretability for attribution analysis tasks.  ( 2 min )
    Modality-Aware Integration with Large Language Models for Knowledge-based Visual Question Answering
    arXiv:2402.12728v1 Announce Type: cross Abstract: Knowledge-based visual question answering (KVQA) has been extensively studied to answer visual questions with external knowledge, e.g., knowledge graphs (KGs). While several attempts have been proposed to leverage large language models (LLMs) as an implicit knowledge source, it remains challenging since LLMs may generate hallucinations. Moreover, multiple knowledge sources, e.g., images, KGs and LLMs, cannot be readily aligned for complex scenarios. To tackle these, we present a novel modality-aware integration with LLMs for KVQA (MAIL). It carefully leverages multimodal knowledge for both image understanding and knowledge reasoning. Specifically, (i) we propose a two-stage prompting strategy with LLMs to densely embody the image into a scene graph with detailed visual features; (ii) We construct a coupled concept graph by linking the mentioned entities with external facts. (iii) A tailored pseudo-siamese graph medium fusion is designed for sufficient multimodal fusion. We utilize the shared mentioned entities in two graphs as mediums to bridge a tight inter-modal exchange, while maximally preserving insightful intra-modal learning by constraining the fusion within mediums. Extensive experiments on two benchmark datasets show the superiority of MAIL with 24x less resources.  ( 2 min )
    Integrating Active Learning in Causal Inference with Interference: A Novel Approach in Online Experiments
    arXiv:2402.12710v1 Announce Type: cross Abstract: In the domain of causal inference research, the prevalent potential outcomes framework, notably the Rubin Causal Model (RCM), often overlooks individual interference and assumes independent treatment effects. This assumption, however, is frequently misaligned with the intricate realities of real-world scenarios, where interference is not merely a possibility but a common occurrence. Our research endeavors to address this discrepancy by focusing on the estimation of direct and spillover treatment effects under two assumptions: (1) network-based interference, where treatments on neighbors within connected networks affect one's outcomes, and (2) non-random treatment assignments influenced by confounders. To improve the efficiency of estimating potentially complex effects functions, we introduce an novel active learning approach: Active Learning in Causal Inference with Interference (ACI). This approach uses Gaussian process to flexibly model the direct and spillover treatment effects as a function of a continuous measure of neighbors' treatment assignment. The ACI framework sequentially identifies the experimental settings that demand further data. It further optimizes the treatment assignments under the network interference structure using genetic algorithms to achieve efficient learning outcome. By applying our method to simulation data and a Tencent game dataset, we demonstrate its feasibility in achieving accurate effects estimations with reduced data requirements. This ACI approach marks a significant advancement in the realm of data efficiency for causal inference, offering a robust and efficient alternative to traditional methodologies, particularly in scenarios characterized by complex interference patterns.  ( 3 min )
    Quantum Embedding with Transformer for High-dimensional Data
    arXiv:2402.12704v1 Announce Type: cross Abstract: Quantum embedding with transformers is a novel and promising architecture for quantum machine learning to deliver exceptional capability on near-term devices or simulators. The research incorporated a vision transformer (ViT) to advance quantum significantly embedding ability and results for a single qubit classifier with around 3 percent in the median F1 score on the BirdCLEF-2021, a challenging high-dimensional dataset. The study showcases and analyzes empirical evidence that our transformer-based architecture is a highly versatile and practical approach to modern quantum machine learning problems.  ( 2 min )
    Randomization Can Reduce Both Bias and Variance: A Case Study in Random Forests
    arXiv:2402.12668v1 Announce Type: cross Abstract: We study the often overlooked phenomenon, first noted in \cite{breiman2001random}, that random forests appear to reduce bias compared to bagging. Motivated by an interesting paper by \cite{mentch2020randomization}, where the authors argue that random forests reduce effective degrees of freedom and only outperform bagging ensembles in low signal-to-noise ratio (SNR) settings, we explore how random forests can uncover patterns in the data missed by bagging. We empirically demonstrate that in the presence of such patterns, random forests reduce bias along with variance and increasingly outperform bagging ensembles when SNR is high. Our observations offer insights into the real-world success of random forests across a range of SNRs and enhance our understanding of the difference between random forests and bagging ensembles with respect to the randomization injected into each split. Our investigations also yield practical insights into the importance of tuning $mtry$ in random forests.  ( 2 min )
    SoftQE: Learned Representations of Queries Expanded by LLMs
    arXiv:2402.12663v1 Announce Type: cross Abstract: We investigate the integration of Large Language Models (LLMs) into query encoders to improve dense retrieval without increasing latency and cost, by circumventing the dependency on LLMs at inference time. SoftQE incorporates knowledge from LLMs by mapping embeddings of input queries to those of the LLM-expanded queries. While improvements over various strong baselines on in-domain MS-MARCO metrics are marginal, SoftQE improves performance by 2.83 absolute percentage points on average on five out-of-domain BEIR tasks.  ( 2 min )
    FAST: An Optimization Framework for Fast Additive Segmentation in Transparent ML
    arXiv:2402.12630v1 Announce Type: cross Abstract: We present FAST, an optimization framework for fast additive segmentation. FAST segments piecewise constant shape functions for each feature in a dataset to produce transparent additive models. The framework leverages a novel optimization procedure to fit these models $\sim$2 orders of magnitude faster than existing state-of-the-art methods, such as explainable boosting machines \citep{nori2019interpretml}. We also develop new feature selection algorithms in the FAST framework to fit parsimonious models that perform well. Through experiments and case studies, we show that FAST improves the computational efficiency and interpretability of additive models.  ( 2 min )
    Truncated Polynomial Expansion-Based Detection in Massive MIMO: A Model-Driven Deep Learning Approach
    arXiv:2402.12595v1 Announce Type: cross Abstract: In this paper, we propose a deep learning (DL)-based approach for efficiently computing the inverse of Hermitian matrices using truncated polynomial expansion (TPE). Our model-driven approach involves optimizing the coefficients of the TPE during an offline training procedure for a given number of TPE terms. We apply this method to signal detection in uplink massive multiple-input multiple-output (MIMO) systems, where the matrix inverse operation required by linear detectors, such as zero-forcing (ZF) and minimum mean square error (MMSE), is approximated using TPE. Our simulation results demonstrate that the proposed learned TPE-based method outperforms the conventional TPE method with optimal coefficients in terms of asymptotic convergence speed and reduces the computational complexity of the online detection stage, albeit at the expense of the offline training stage. However, the limited number of trainable parameters leads to a swift offline training process.  ( 2 min )
    Generative AI Security: Challenges and Countermeasures
    arXiv:2402.12617v1 Announce Type: cross Abstract: Generative AI's expanding footprint across numerous industries has led to both excitement and increased scrutiny. This paper delves into the unique security challenges posed by Generative AI, and outlines potential research directions for managing these risks.  ( 2 min )
    GenAudit: Fixing Factual Errors in Language Model Outputs with Evidence
    arXiv:2402.12566v1 Announce Type: cross Abstract: LLMs can generate factually incorrect statements even when provided access to reference documents. Such errors can be dangerous in high-stakes applications (e.g., document-grounded QA for healthcare or finance). We present GenAudit -- a tool intended to assist fact-checking LLM responses for document-grounded tasks. GenAudit suggests edits to the LLM response by revising or removing claims that are not supported by the reference document, and also presents evidence from the reference for facts that do appear to have support. We train models to execute these tasks, and design an interactive interface to present suggested edits and evidence to users. Comprehensive evaluation by human raters shows that GenAudit can detect errors in 8 different LLM outputs when summarizing documents from diverse domains. To ensure that most errors are flagged by the system, we propose a method that can increase the error recall while minimizing impact on precision. We will release our tool (GenAudit) and fact-checking model for public use.  ( 2 min )
    Multilinear Mixture of Experts: Scalable Expert Specialization through Factorization
    arXiv:2402.12550v1 Announce Type: cross Abstract: The Mixture of Experts (MoE) paradigm provides a powerful way to decompose inscrutable dense layers into smaller, modular computations often more amenable to human interpretation, debugging, and editability. A major problem however lies in the computational cost of scaling the number of experts to achieve sufficiently fine-grained specialization. In this paper, we propose the Multilinear Mixutre of Experts (MMoE) layer to address this, focusing on vision models. MMoE layers perform an implicit computation on prohibitively large weight tensors entirely in factorized form. Consequently, MMoEs both (1) avoid the issues incurred through the discrete expert routing in the popular 'sparse' MoE models, yet (2) do not incur the restrictively high inference-time costs of 'soft' MoE alternatives. We present both qualitative and quantitative evidence (through visualization and counterfactual interventions respectively) that scaling MMoE layers when fine-tuning foundation models for vision tasks leads to more specialized experts at the class-level whilst remaining competitive with the performance of parameter-matched linear layer counterparts. Finally, we show that learned expert specialism further facilitates manual correction of demographic bias in CelebA attribute classification. Our MMoE model code is available at https://github.com/james-oldfield/MMoE.  ( 2 min )
    A Machine Learning Ensemble Model for the Detection of Cyberbullying
    arXiv:2402.12538v1 Announce Type: cross Abstract: The pervasive use of social media platforms, such as Facebook, Instagram, and X, has significantly amplified our electronic interconnectedness. Moreover, these platforms are now easily accessible from any location at any given time. However, the increased popularity of social media has also led to cyberbullying.It is imperative to address the need for finding, monitoring, and mitigating cyberbullying posts on social media platforms. Motivated by this necessity, we present this paper to contribute to developing an automated system for detecting binary labels of aggressive tweets.Our study has demonstrated remarkable performance compared to previous experiments on the same dataset. We employed the stacking ensemble machine learning method, utilizing four various feature extraction techniques to optimize performance within the stacking ensemble learning framework. Combining five machine learning algorithms,Decision Trees, Random Forest, Linear Support Vector Classification, Logistic Regression, and K-Nearest Neighbors into an ensemble method, we achieved superior results compared to traditional machine learning classifier models. The stacking classifier achieved a high accuracy rate of 94.00%, outperforming traditional machine learning models and surpassing the results of prior experiments that utilized the same dataset. The outcomes of our experiments showcased an accuracy rate of 0.94% in detection tweets as aggressive or non-aggressive.  ( 2 min )
    Impact of data usage for forecasting on performance of model predictive control in buildings with smart energy storage
    arXiv:2402.12539v1 Announce Type: cross Abstract: Data is required to develop forecasting models for use in Model Predictive Control (MPC) schemes in building energy systems. However, data usage incurs costs from both its collection and exploitation. Determining cost optimal data usage requires understanding of the forecast accuracy and resulting MPC operational performance it enables. This study investigates the performance of both simple and state-of-the-art machine learning prediction models for MPC in a multi-building energy system simulation using historic building energy data. The impact of data usage on forecast accuracy is quantified for the following data efficiency measures: reuse of prediction models, reduction of training data volumes, reduction of model data features, and online model training. A simple linear multi-layer perceptron model is shown to provide equivalent forecast accuracy to state-of-the-art models, with greater data efficiency and generalisability. The use of more than 2 years of training data for load prediction models provided no significant improvement in forecast accuracy. Forecast accuracy and data efficiency were improved simultaneously by using change-point analysis to screen training data. Reused models and those trained with 3 months of data had on average 10% higher error than baseline, indicating that deploying MPC systems without prior data collection may be economic.  ( 3 min )
    Improving Deep Generative Models on Many-To-One Image-to-Image Translation
    arXiv:2402.12531v1 Announce Type: cross Abstract: Deep generative models have been applied to multiple applications in image- to-image translation. Generative Adversarial Networks and Diffusion Models have presented impressive results, setting new state-of-the-art results on these tasks. Most methods have symmetric setups across the different domains in a dataset. These methods assume that all domains have either multiple modalities or only one modality. However, there are many datasets that have a many-to-one relationship between two domains. In this work, we first introduce a Colorized MNIST dataset and a Color-Recall score that can provide a simple benchmark for evaluating models on many-to-one translation. We then introduce a new asymmetric framework to improve existing deep generative models on many-to-one image-to- image translation. We apply this framework to StarGAN V2 and show that in both unsupervised and semi-supervised settings, the performance of this new model improves on many-to-one image-to-image translation.  ( 2 min )
    Parallel Structures in Pre-training Data Yield In-Context Learning
    arXiv:2402.12530v1 Announce Type: cross Abstract: Pre-trained language models (LMs) are capable of in-context learning (ICL): they can adapt to a task with only a few examples given in the prompt without any parameter update. However, it is unclear where this capability comes from as there is a stark distribution shift between pre-training text and ICL prompts. In this work, we study what patterns of the pre-training data contribute to ICL. We find that LMs' ICL ability depends on $\textit{parallel structures}$ in the pre-training data -- pairs of phrases following similar templates in the same context window. Specifically, we detect parallel structures by checking whether training on one phrase improves prediction of the other, and conduct ablation experiments to study their effect on ICL. We show that removing parallel structures in the pre-training data reduces LMs' ICL accuracy by 51% (vs 2% from random ablation). This drop persists even when excluding common patterns such as n-gram repetitions and long-range dependency, showing the diversity and generality of parallel structures. A closer look at the detected parallel structures indicates that they cover diverse linguistic tasks and span long distances in the data.  ( 2 min )
    Automated Security Response through Online Learning with Adaptive Conjectures
    arXiv:2402.12499v1 Announce Type: cross Abstract: We study automated security response for an IT infrastructure and formulate the interaction between an attacker and a defender as a partially observed, non-stationary game. We relax the standard assumption that the game model is correctly specified and consider that each player has a probabilistic conjecture about the model, which may be misspecified in the sense that the true model has probability 0. This formulation allows us to capture uncertainty about the infrastructure and the intents of the players. To learn effective game strategies online, we design a novel method where a player iteratively adapts its conjecture using Bayesian learning and updates its strategy through rollout. We prove that the conjectures converge to best fits, and we provide a bound on the performance improvement that rollout enables with a conjectured model. To characterize the steady state of the game, we propose a variant of the Berk-Nash equilibrium. We present our method through an advanced persistent threat use case. Simulation studies based on testbed measurements show that our method produces effective security strategies that adapt to a changing environment. We also find that our method enables faster convergence than current reinforcement learning techniques.  ( 3 min )
    Feudal Networks for Visual Navigation
    arXiv:2402.12498v1 Announce Type: cross Abstract: Visual navigation follows the intuition that humans can navigate without detailed maps. A common approach is interactive exploration while building a topological graph with images at nodes that can be used for planning. Recent variations learn from passive videos and can navigate using complex social and semantic cues. However, a significant number of training videos are needed, large graphs are utilized, and scenes are not unseen since odometry is utilized. We introduce a new approach to visual navigation using feudal learning, which employs a hierarchical structure consisting of a worker agent, a mid-level manager, and a high-level manager. Key to the feudal learning paradigm, agents at each level see a different aspect of the task and operate at different spatial and temporal scales. Two unique modules are developed in this framework. For the high- level manager, we learn a memory proxy map in a self supervised manner to record prior observations in a learned latent space and avoid the use of graphs and odometry. For the mid-level manager, we develop a waypoint network that outputs intermediate subgoals imitating human waypoint selection during local navigation. This waypoint network is pre-trained using a new, small set of teleoperation videos that we make publicly available, with training environments different from testing environments. The resulting feudal navigation network achieves near SOTA performance, while providing a novel no-RL, no-graph, no-odometry, no-metric map approach to the image goal navigation task.  ( 2 min )
    SECP: A Speech Enhancement-Based Curation Pipeline For Scalable Acquisition Of Clean Speech
    arXiv:2402.12482v1 Announce Type: cross Abstract: As more speech technologies rely on a supervised deep learning approach with clean speech as the ground truth, a methodology to onboard said speech at scale is needed. However, this approach needs to minimize the dependency on human listening and annotation, only requiring a human-in-the-loop when needed. In this paper, we address this issue by outlining Speech Enhancement-based Curation Pipeline (SECP) which serves as a framework to onboard clean speech. This clean speech can then train a speech enhancement model, which can further refine the original dataset and thus close the iterative loop. By running two iterative rounds, we observe that enhanced output used as ground truth does not degrade model performance according to $\Delta_{PESQ}$, a metric used in this paper. We also show through comparative mean opinion score (CMOS) based subjective tests that the highest and lowest bound of refined data is perceptually better than the original data.  ( 2 min )
    Learning Exceptional Subgroups by End-to-End Maximizing KL-divergence
    arXiv:2402.12930v1 Announce Type: new Abstract: Finding and describing sub-populations that are exceptional regarding a target property has important applications in many scientific disciplines, from identifying disadvantaged demographic groups in census data to finding conductive molecules within gold nanoparticles. Current approaches to finding such subgroups require pre-discretized predictive variables, do not permit non-trivial target distributions, do not scale to large datasets, and struggle to find diverse results. To address these limitations, we propose Syflow, an end-to-end optimizable approach in which we leverage normalizing flows to model arbitrary target distributions, and introduce a novel neural layer that results in easily interpretable subgroup descriptions. We demonstrate on synthetic and real-world data, including a case study, that Syflow reliably finds highly exceptional subgroups accompanied by insightful descriptions.  ( 2 min )
    Data Pipeline Training: Integrating AutoML to Optimize the Data Flow of Machine Learning Models
    arXiv:2402.12916v1 Announce Type: new Abstract: Data Pipeline plays an indispensable role in tasks such as modeling machine learning and developing data products. With the increasing diversification and complexity of Data sources, as well as the rapid growth of data volumes, building an efficient Data Pipeline has become crucial for improving work efficiency and solving complex problems. This paper focuses on exploring how to optimize data flow through automated machine learning methods by integrating AutoML with Data Pipeline. We will discuss how to leverage AutoML technology to enhance the intelligence of Data Pipeline, thereby achieving better results in machine learning tasks. By delving into the automation and optimization of Data flows, we uncover key strategies for constructing efficient data pipelines that can adapt to the ever-changing data landscape. This not only accelerates the modeling process but also provides innovative solutions to complex problems, enabling more significant outcomes in increasingly intricate data domains. Keywords- Data Pipeline Training;AutoML; Data environment; Machine learning  ( 2 min )
    Federated Multi-Task Learning on Non-IID Data Silos: An Experimental Study
    arXiv:2402.12876v1 Announce Type: new Abstract: The innovative Federated Multi-Task Learning (FMTL) approach consolidates the benefits of Federated Learning (FL) and Multi-Task Learning (MTL), enabling collaborative model training on multi-task learning datasets. However, a comprehensive evaluation method, integrating the unique features of both FL and MTL, is currently absent in the field. This paper fills this void by introducing a novel framework, FMTL-Bench, for systematic evaluation of the FMTL paradigm. This benchmark covers various aspects at the data, model, and optimization algorithm levels, and comprises seven sets of comparative experiments, encapsulating a wide array of non-independent and identically distributed (Non-IID) data partitioning scenarios. We propose a systematic process for comparing baselines of diverse indicators and conduct a case study on communication expenditure, time, and energy consumption. Through our exhaustive experiments, we aim to provide valuable insights into the strengths and limitations of existing baseline methods, contributing to the ongoing discourse on optimal FMTL application in practical scenarios. The source code will be made available for results replication.  ( 2 min )
    Skill or Luck? Return Decomposition via Advantage Functions
    arXiv:2402.12874v1 Announce Type: new Abstract: Learning from off-policy data is essential for sample-efficient reinforcement learning. In the present work, we build on the insight that the advantage function can be understood as the causal effect of an action on the return, and show that this allows us to decompose the return of a trajectory into parts caused by the agent's actions (skill) and parts outside of the agent's control (luck). Furthermore, this decomposition enables us to naturally extend Direct Advantage Estimation (DAE) to off-policy settings (Off-policy DAE). The resulting method can learn from off-policy trajectories without relying on importance sampling techniques or truncating off-policy actions. We draw connections between Off-policy DAE and previous methods to demonstrate how it can speed up learning and when the proposed off-policy corrections are important. Finally, we use the MinAtar environments to illustrate how ignoring off-policy corrections can lead to suboptimal policy optimization performance.  ( 2 min )
    Chain of Thought Empowers Transformers to Solve Inherently Serial Problems
    arXiv:2402.12875v1 Announce Type: new Abstract: Instructing the model to generate a sequence of intermediate steps, a.k.a., a chain of thought (CoT), is a highly effective method to improve the accuracy of large language models (LLMs) on arithmetics and symbolic reasoning tasks. However, the mechanism behind CoT remains unclear. This work provides a theoretical understanding of the power of CoT for decoder-only transformers through the lens of expressiveness. Conceptually, CoT empowers the model with the ability to perform inherently serial computation, which is otherwise lacking in transformers, especially when depth is low. Given input length $n$, previous works have shown that constant-depth transformers with finite precision $\mathsf{poly}(n)$ embedding size can only solve problems in $\mathsf{TC}^0$ without CoT. We first show an even tighter expressiveness upper bound for constant-depth transformers with constant-bit precision, which can only solve problems in $\mathsf{AC}^0$, a proper subset of $ \mathsf{TC}^0$. However, with $T$ steps of CoT, constant-depth transformers using constant-bit precision and $O(\log n)$ embedding size can solve any problem solvable by boolean circuits of size $T$. Empirically, enabling CoT dramatically improves the accuracy for tasks that are hard for parallel computation, including the composition of permutation groups, iterated squaring, and circuit value problems, especially for low-depth transformers.  ( 2 min )
    Differentiable Mapper For Topological Optimization Of Data Representation
    arXiv:2402.12854v1 Announce Type: new Abstract: Unsupervised data representation and visualization using tools from topology is an active and growing field of Topological Data Analysis (TDA) and data science. Its most prominent line of work is based on the so-called Mapper graph, which is a combinatorial graph whose topological structures (connected components, branches, loops) are in correspondence with those of the data itself. While highly generic and applicable, its use has been hampered so far by the manual tuning of its many parameters-among these, a crucial one is the so-called filter: it is a continuous function whose variations on the data set are the main ingredient for both building the Mapper representation and assessing the presence and sizes of its topological structures. However, while a few parameter tuning methods have already been investigated for the other Mapper parameters (i.e., resolution, gain, clustering), there is currently no method for tuning the filter itself. In this work, we build on a recently proposed optimization framework incorporating topology to provide the first filter optimization scheme for Mapper graphs. In order to achieve this, we propose a relaxed and more general version of the Mapper graph, whose convergence properties are investigated. Finally, we demonstrate the usefulness of our approach by optimizing Mapper graph representations on several datasets, and showcasing the superiority of the optimized representation over arbitrary ones.  ( 2 min )
    Scalable Decentralized Algorithms for Online Personalized Mean Estimation
    arXiv:2402.12812v1 Announce Type: new Abstract: In numerous settings, agents lack sufficient data to directly learn a model. Collaborating with other agents may help, but it introduces a bias-variance trade-off, when local data distributions differ. A key challenge is for each agent to identify clients with similar distributions while learning the model, a problem that remains largely unresolved. This study focuses on a simplified version of the overarching problem, where each agent collects samples from a real-valued distribution over time to estimate its mean. Existing algorithms face impractical space and time complexities (quadratic in the number of agents A). To address scalability challenges, we propose a framework where agents self-organize into a graph, allowing each agent to communicate with only a selected number of peers r. We introduce two collaborative mean estimation algorithms: one draws inspiration from belief propagation, while the other employs a consensus-based approach, with complexity of O( r |A| log |A|) and O(r |A|), respectively. We establish conditions under which both algorithms yield asymptotically optimal estimates and offer a theoretical characterization of their performance.  ( 2 min )
    Learning Generalization and Regularization of Nonhomogeneous Temporal Poisson Processes
    arXiv:2402.12808v1 Announce Type: new Abstract: The Poisson process, especially the nonhomogeneous Poisson process (NHPP), is an essentially important counting process with numerous real-world applications. Up to date, almost all works in the literature have been on the estimation of NHPPs with infinite data using non-data driven binning methods. In this paper, we formulate the problem of estimation of NHPPs from finite and limited data as a learning generalization problem. We mathematically show that while binning methods are essential for the estimation of NHPPs, they pose a threat of overfitting when the amount of data is limited. We propose a framework for regularized learning of NHPPs with two new adaptive and data-driven binning methods that help to remove the ad-hoc tuning of binning parameters. Our methods are experimentally tested on synthetic and real-world datasets and the results show their effectiveness.  ( 2 min )
    Fair Classifiers Without Fair Training: An Influence-Guided Data Sampling Approach
    arXiv:2402.12789v1 Announce Type: new Abstract: A fair classifier should ensure the benefit of people from different groups, while the group information is often sensitive and unsuitable for model training. Therefore, learning a fair classifier but excluding sensitive attributes in the training dataset is important. In this paper, we study learning fair classifiers without implementing fair training algorithms to avoid possible leakage of sensitive information. Our theoretical analyses validate the possibility of this approach, that traditional training on a dataset with an appropriate distribution shift can reduce both the upper bound for fairness disparity and model generalization error, indicating that fairness and accuracy can be improved simultaneously with simply traditional training. We then propose a tractable solution to progressively shift the original training data during training by sampling influential data, where the sensitive attribute of new data is not accessed in sampling or used in training. Extensive experiments on real-world data demonstrate the effectiveness of our proposed algorithm.  ( 2 min )
    Tackling Byzantine Clients in Federated Learning
    arXiv:2402.12780v1 Announce Type: new Abstract: The possibility of adversarial (a.k.a., {\em Byzantine}) clients makes federated learning (FL) prone to arbitrary manipulation. The natural approach to robustify FL against adversarial clients is to replace the simple averaging operation at the server in the standard $\mathsf{FedAvg}$ algorithm by a \emph{robust averaging rule}. While a significant amount of work has been devoted to studying the convergence of federated {\em robust averaging} (which we denote by $\mathsf{FedRo}$), prior work has largely ignored the impact of {\em client subsampling} and {\em local steps}, two fundamental FL characteristics. While client subsampling increases the effective fraction of Byzantine clients, local steps increase the drift between the local updates computed by honest (i.e., non-Byzantine) clients. Consequently, a careless deployment of $\mathsf{FedRo}$ could yield poor performance. We validate this observation by presenting an in-depth analysis of $\mathsf{FedRo}$ tightly analyzing the impact of client subsampling and local steps. Specifically, we present a sufficient condition on client subsampling for nearly-optimal convergence of $\mathsf{FedRo}$ (for smooth non-convex loss). Also, we show that the rate of improvement in learning accuracy {\em diminishes} with respect to the number of clients subsampled, as soon as the sample size exceeds a threshold value. Interestingly, we also observe that under a careful choice of step-sizes, the learning error due to Byzantine clients decreases with the number of local steps. We validate our theory by experiments on the FEMNIST and CIFAR-$10$ image classification tasks.  ( 2 min )
    When and How: Learning Identifiable Latent States for Nonstationary Time Series Forecasting
    arXiv:2402.12767v1 Announce Type: new Abstract: Temporal distribution shifts are ubiquitous in time series data. One of the most popular methods assumes that the temporal distribution shift occurs uniformly to disentangle the stationary and nonstationary dependencies. But this assumption is difficult to meet, as we do not know when the distribution shifts occur. To solve this problem, we propose to learn IDentifiable latEnt stAtes (IDEA) to detect when the distribution shifts occur. Beyond that, we further disentangle the stationary and nonstationary latent states via sufficient observation assumption to learn how the latent states change. Specifically, we formalize the causal process with environment-irrelated station- ary and environment-related nonstationary variables. Under mild conditions, we show that latent environments and stationary/nonstationary variables are identifiable. Based on these theories, we devise the IDEA model, which incorporates an autoregressive hidden Markov model to estimate latent environments and modular prior networks to identify latent states. The IDEA model outperforms several latest nonstationary forecasting methods on various benchmark datasets, highlighting its advantages in real-world scenarios.  ( 2 min )
    Static vs. Dynamic Databases for Indoor Localization based on Wi-Fi Fingerprinting: A Discussion from a Data Perspective
    arXiv:2402.12756v1 Announce Type: new Abstract: Wi-Fi fingerprinting has emerged as the most popular approach to indoor localization. The use of ML algorithms has greatly improved the localization performance of Wi-Fi fingerprinting, but its success depends on the availability of fingerprint databases composed of a large number of RSSIs, the MAC addresses of access points, and the other measurement information. However, most fingerprint databases do not reflect well the time varying nature of electromagnetic interferences in complicated modern indoor environment. This could result in significant changes in statistical characteristics of training/validation and testing datasets, which are often constructed at different times, and even the characteristics of the testing datasets could be different from those of the data submitted by users during the operation of localization systems after their deployment. In this paper, we consider the implications of time-varying Wi-Fi fingerprints on indoor localization from a data-centric point of view and discuss the differences between static and dynamic databases. As a case study, we have constructed a dynamic database covering three floors of the IR building of XJTLU based on RSSI measurements, over 44 days, and investigated the differences between static and dynamic databases in terms of statistical characteristics and localization performance. The analyses based on variance calculations and Isolation Forest show the temporal shifts in RSSIs, which result in a noticeable trend of the increase in the localization error of a Gaussian process regression model with the maximum error of 6.65 m after 14 days of training without model adjustments. The results of the case study with the XJTLU dynamic database clearly demonstrate the limitations of static databases and the importance of the creation and adoption of dynamic databases for future indoor localization research and real-world deployment.  ( 3 min )
    FGAD: Self-boosted Knowledge Distillation for An Effective Federated Graph Anomaly Detection Framework
    arXiv:2402.12761v1 Announce Type: new Abstract: Graph anomaly detection (GAD) aims to identify anomalous graphs that significantly deviate from other ones, which has raised growing attention due to the broad existence and complexity of graph-structured data in many real-world scenarios. However, existing GAD methods usually execute with centralized training, which may lead to privacy leakage risk in some sensitive cases, thereby impeding collaboration among organizations seeking to collectively develop robust GAD models. Although federated learning offers a promising solution, the prevalent non-IID problems and high communication costs present significant challenges, particularly pronounced in collaborations with graph data distributed among different participants. To tackle these challenges, we propose an effective federated graph anomaly detection framework (FGAD). We first introduce an anomaly generator to perturb the normal graphs to be anomalous, and train a powerful anomaly detector by distinguishing generated anomalous graphs from normal ones. Then, we leverage a student model to distill knowledge from the trained anomaly detector (teacher model), which aims to maintain the personality of local models and alleviate the adverse impact of non-IID problems. Moreover, we design an effective collaborative learning mechanism that facilitates the personalization preservation of local models and significantly reduces communication costs among clients. Empirical results of the GAD tasks on non-IID graphs compared with state-of-the-art baselines demonstrate the superiority and efficiency of the proposed FGAD method.  ( 3 min )
    Guarantee Regions for Local Explanations
    arXiv:2402.12737v1 Announce Type: new Abstract: Interpretability methods that utilise local surrogate models (e.g. LIME) are very good at describing the behaviour of the predictive model at a point of interest, but they are not guaranteed to extrapolate to the local region surrounding the point. However, overfitting to the local curvature of the predictive model and malicious tampering can significantly limit extrapolation. We propose an anchor-based algorithm for identifying regions in which local explanations are guaranteed to be correct by explicitly describing those intervals along which the input features can be trusted. Our method produces an interpretable feature-aligned box where the prediction of the local surrogate model is guaranteed to match the predictive model. We demonstrate that our algorithm can be used to find explanations with larger guarantee regions that better cover the data manifold compared to existing baselines. We also show how our method can identify misleading local explanations with significantly poorer guarantee regions.  ( 2 min )
    Scalable and reliable deep transfer learning for intelligent fault detection via multi-scale neural processes embedded with knowledge
    arXiv:2402.12729v1 Announce Type: new Abstract: Deep transfer learning (DTL) is a fundamental method in the field of Intelligent Fault Detection (IFD). It aims to mitigate the degradation of method performance that arises from the discrepancies in data distribution between training set (source domain) and testing set (target domain). Considering the fact that fault data collection is challenging and certain faults are scarce, DTL-based methods face the limitation of available observable data, which reduces the detection performance of the methods in the target domain. Furthermore, DTL-based methods lack comprehensive uncertainty analysis that is essential for building reliable IFD systems. To address the aforementioned problems, this paper proposes a novel DTL-based method known as Neural Processes-based deep transfer learning with graph convolution network (GTNP). Feature-based transfer strategy of GTNP bridges the data distribution discrepancies of source domain and target domain in high-dimensional space. Both the joint modeling based on global and local latent variables and sparse sampling strategy reduce the demand of observable data in the target domain. The multi-scale uncertainty analysis is obtained by using the distribution characteristics of global and local latent variables. Global analysis of uncertainty enables GTNP to provide quantitative values that reflect the complexity of methods and the difficulty of tasks. Local analysis of uncertainty allows GTNP to model uncertainty (confidence of the fault detection result) at each sample affected by noise and bias. The validation of the proposed method is conducted across 3 IFD tasks, consistently showing the superior detection performance of GTNP compared to the other DTL-based methods.  ( 3 min )
    Equivariant Pretrained Transformer for Unified Geometric Learning on Multi-Domain 3D Molecules
    arXiv:2402.12714v1 Announce Type: new Abstract: Pretraining on a large number of unlabeled 3D molecules has showcased superiority in various scientific applications. However, prior efforts typically focus on pretraining models on a specific domain, either proteins or small molecules, missing the opportunity to leverage the cross-domain knowledge. To mitigate this gap, we introduce Equivariant Pretrained Transformer (EPT), a novel pretraining framework designed to harmonize the geometric learning of small molecules and proteins. To be specific, EPT unifies the geometric modeling of multi-domain molecules via the block-enhanced representation that can attend a broader context of each atom. Upon transformer framework, EPT is further enhanced with E(3) equivariance to facilitate the accurate representation of 3D structures. Another key innovation of EPT is its block-level pretraining task, which allows for joint pretraining on datasets comprising both small molecules and proteins. Experimental evaluations on a diverse group of benchmarks, including ligand binding affinity prediction, molecular property prediction, and protein property prediction, show that EPT significantly outperforms previous SOTA methods for affinity prediction, and achieves the best or comparable performance with existing domain-specific pretraining models for other tasks.  ( 2 min )
    Achieving Near-Optimal Regret for Bandit Algorithms with Uniform Last-Iterate Guarantee
    arXiv:2402.12711v1 Announce Type: new Abstract: Existing performance measures for bandit algorithms such as regret, PAC bounds, or uniform-PAC (Dann et al., 2017), typically evaluate the cumulative performance, while allowing the play of an arbitrarily bad arm at any finite time t. Such a behavior can be highly detrimental in high-stakes applications. This paper introduces a stronger performance measure, the uniform last-iterate (ULI) guarantee, capturing both cumulative and instantaneous performance of bandit algorithms. Specifically, ULI characterizes the instantaneous performance since it ensures that the per-round regret of the played arm is bounded by a function, monotonically decreasing w.r.t. (large) round t, preventing revisits to bad arms when sufficient samples are available. We demonstrate that a near-optimal ULI guarantee directly implies near-optimal cumulative performance across aforementioned performance measures. To examine the achievability of ULI in the finite arm setting, we first provide two positive results that some elimination-based algorithms and high-probability adversarial algorithms with stronger analysis or additional designs, can attain near-optimal ULI guarantees. Then, we also provide a negative result, indicating that optimistic algorithms cannot achieve a near-optimal ULI guarantee. Finally, we propose an efficient algorithm for linear bandits with infinitely many arms, which achieves the ULI guarantee, given access to an optimization oracle.  ( 2 min )
    Learning on manifolds without manifold learning
    arXiv:2402.12687v1 Announce Type: new Abstract: Function approximation based on data drawn randomly from an unknown distribution is an important problem in machine learning. In contrast to the prevalent paradigm of solving this problem by minimizing a loss functional, we have given a direct one-shot construction together with optimal error bounds under the manifold assumption; i.e., one assumes that the data is sampled from an unknown sub-manifold of a high dimensional Euclidean space. A great deal of research deals with obtaining information about this manifold, such as the eigendecomposition of the Laplace-Beltrami operator or coordinate charts, and using this information for function approximation. This two step approach implies some extra errors in the approximation stemming from basic quantities of the data in addition to the errors inherent in function approximation. In Neural Networks, 132:253268, 2020, we have proposed a one-shot direct method to achieve function approximation without requiring the extraction of any information about the manifold other than its dimension. However, one cannot pin down the class of approximants used in that paper. In this paper, we view the unknown manifold as a sub-manifold of an ambient hypersphere and study the question of constructing a one-shot approximation using the spherical polynomials based on the hypersphere. Our approach does not require preprocessing of the data to obtain information about the manifold other than its dimension. We give optimal rates of approximation for relatively "rough" functions.  ( 2 min )
    TorchCP: A Library for Conformal Prediction based on PyTorch
    arXiv:2402.12683v1 Announce Type: new Abstract: TorchCP is a Python toolbox for conformal prediction research on deep learning models. It contains various implementations for posthoc and training methods for classification and regression tasks (including multi-dimension output). TorchCP is built on PyTorch (Paszke et al., 2019) and leverages the advantages of matrix computation to provide concise and efficient inference implementations. The code is licensed under the LGPL license and is open-sourced at $\href{https://github.com/ml-stat-Sustech/TorchCP}{\text{this https URL}}$.  ( 2 min )
    Beyond Worst-case Attacks: Robust RL with Adaptive Defense via Non-dominated Policies
    arXiv:2402.12673v1 Announce Type: new Abstract: In light of the burgeoning success of reinforcement learning (RL) in diverse real-world applications, considerable focus has been directed towards ensuring RL policies are robust to adversarial attacks during test time. Current approaches largely revolve around solving a minimax problem to prepare for potential worst-case scenarios. While effective against strong attacks, these methods often compromise performance in the absence of attacks or the presence of only weak attacks. To address this, we study policy robustness under the well-accepted state-adversarial attack model, extending our focus beyond only worst-case attacks. We first formalize this task at test time as a regret minimization problem and establish its intrinsic hardness in achieving sublinear regret when the baseline policy is from a general continuous policy class, $\Pi$. This finding prompts us to \textit{refine} the baseline policy class $\Pi$ prior to test time, aiming for efficient adaptation within a finite policy class $\Tilde{\Pi}$, which can resort to an adversarial bandit subroutine. In light of the importance of a small, finite $\Tilde{\Pi}$, we propose a novel training-time algorithm to iteratively discover \textit{non-dominated policies}, forming a near-optimal and minimal $\Tilde{\Pi}$, thereby ensuring both robustness and test-time efficiency. Empirical validation on the Mujoco corroborates the superiority of our approach in terms of natural and robust performance, as well as adaptability to various attack scenarios.  ( 3 min )
    Discriminant Distance-Aware Representation on Deterministic Uncertainty Quantification Methods
    arXiv:2402.12664v1 Announce Type: new Abstract: Uncertainty estimation is a crucial aspect of deploying dependable deep learning models in safety-critical systems. In this study, we introduce a novel and efficient method for deterministic uncertainty estimation called Discriminant Distance-Awareness Representation (DDAR). Our approach involves constructing a DNN model that incorporates a set of prototypes in its latent representations, enabling us to analyze valuable feature information from the input data. By leveraging a distinction maximization layer over optimal trainable prototypes, DDAR can learn a discriminant distance-awareness representation. We demonstrate that DDAR overcomes feature collapse by relaxing the Lipschitz constraint that hinders the practicality of deterministic uncertainty methods (DUMs) architectures. Our experiments show that DDAR is a flexible and architecture-agnostic method that can be easily integrated as a pluggable layer with distance-sensitive metrics, outperforming state-of-the-art uncertainty estimation methods on multiple benchmark problems.  ( 2 min )
    HyperMoE: Towards Better Mixture of Experts via Transferring Among Experts
    arXiv:2402.12656v1 Announce Type: new Abstract: The Mixture of Experts (MoE) for language models has been proven effective in augmenting the capacity of models by dynamically routing each input token to a specific subset of experts for processing. Despite the success, most existing methods face a challenge for balance between sparsity and the availability of expert knowledge: enhancing performance through increased use of expert knowledge often results in diminishing sparsity during expert selection. To mitigate this contradiction, we propose HyperMoE, a novel MoE framework built upon Hypernetworks. This framework integrates the computational processes of MoE with the concept of knowledge transferring in multi-task learning. Specific modules generated based on the information of unselected experts serve as supplementary information, which allows the knowledge of experts not selected to be used while maintaining selection sparsity. Our comprehensive empirical evaluations across multiple datasets and backbones establish that HyperMoE significantly outperforms existing MoE methods under identical conditions concerning the number of experts.  ( 2 min )
    A Comprehensive Review of Machine Learning Advances on Data Change: A Cross-Field Perspective
    arXiv:2402.12627v1 Announce Type: new Abstract: Recent artificial intelligence (AI) technologies show remarkable evolution in various academic fields and industries. However, in the real world, dynamic data lead to principal challenges for deploying AI models. An unexpected data change brings about severe performance degradation in AI models. We identify two major related research fields, domain shift and concept drift according to the setting of the data change. Although these two popular research fields aim to solve distribution shift and non-stationary data stream problems, the underlying properties remain similar which also encourages similar technical approaches. In this review, we regroup domain shift and concept drift into a single research problem, namely the data change problem, with a systematic overview of state-of-the-art methods in the two research fields. We propose a three-phase problem categorization scheme to link the key ideas in the two technical fields. We thus provide a novel scope for researchers to explore contemporary technical strategies, learn industrial applications, and identify future directions for addressing data change challenges.  ( 2 min )
    Indiscriminate Data Poisoning Attacks on Pre-trained Feature Extractors
    arXiv:2402.12626v1 Announce Type: new Abstract: Machine learning models have achieved great success in supervised learning tasks for end-to-end training, which requires a large amount of labeled data that is not always feasible. Recently, many practitioners have shifted to self-supervised learning methods that utilize cheap unlabeled data to learn a general feature extractor via pre-training, which can be further applied to personalized downstream tasks by simply training an additional linear layer with limited labeled data. However, such a process may also raise concerns regarding data poisoning attacks. For instance, indiscriminate data poisoning attacks, which aim to decrease model utility by injecting a small number of poisoned data into the training set, pose a security risk to machine learning models, but have only been studied for end-to-end supervised learning. In this paper, we extend the exploration of the threat of indiscriminate attacks on downstream tasks that apply pre-trained feature extractors. Specifically, we propose two types of attacks: (1) the input space attacks, where we modify existing attacks to directly craft poisoned data in the input space. However, due to the difficulty of optimization under constraints, we further propose (2) the feature targeted attacks, where we mitigate the challenge with three stages, firstly acquiring target parameters for the linear head; secondly finding poisoned features by treating the learned feature representations as a dataset; and thirdly inverting the poisoned features back to the input space. Our experiments examine such attacks in popular downstream tasks of fine-tuning on the same dataset and transfer learning that considers domain adaptation. Empirical results reveal that transfer learning is more vulnerable to our attacks. Additionally, input space attacks are a strong threat if no countermeasures are posed, but are otherwise weaker than feature targeted attacks.  ( 3 min )
    Reflect-RL: Two-Player Online RL Fine-Tuning for LMs
    arXiv:2402.12621v1 Announce Type: new Abstract: As language models (LMs) demonstrate their capabilities in various fields, their application to tasks requiring multi-round interactions has become increasingly popular. These tasks usually have complex dynamics, so supervised fine-tuning (SFT) on a limited offline dataset does not yield good performance. However, only a few works attempted to directly train the LMs within interactive decision-making environments. We aim to create an effective mechanism to fine-tune LMs with online reinforcement learning (RL) in these environments. We propose Reflect-RL, a two-player system to fine-tune an LM using online RL, where a frozen reflection model assists the policy model. To generate data for the warm-up SFT stage, we use negative example generation to enhance the error-correction ability of the reflection model. Furthermore, we designed single-prompt action enumeration and applied curriculum learning to allow the policy model to learn more efficiently. Empirically, we verify that Reflect-RL outperforms SFT and online RL without reflection. Testing results indicate GPT-2-xl after Reflect-RL also outperforms those of untuned pre-trained LMs, such as Mistral 7B.  ( 2 min )
    Compact NSGA-II for Multi-objective Feature Selection
    arXiv:2402.12625v1 Announce Type: new Abstract: Feature selection is an expensive challenging task in machine learning and data mining aimed at removing irrelevant and redundant features. This contributes to an improvement in classification accuracy, as well as the budget and memory requirements for classification, or any other post-processing task conducted after feature selection. In this regard, we define feature selection as a multi-objective binary optimization task with the objectives of maximizing classification accuracy and minimizing the number of selected features. In order to select optimal features, we have proposed a binary Compact NSGA-II (CNSGA-II) algorithm. Compactness represents the population as a probability distribution to enhance evolutionary algorithms not only to be more memory-efficient but also to reduce the number of fitness evaluations. Instead of holding two populations during the optimization process, our proposed method uses several Probability Vectors (PVs) to generate new individuals. Each PV efficiently explores a region of the search space to find non-dominated solutions instead of generating candidate solutions from a small population as is the common approach in most evolutionary algorithms. To the best of our knowledge, this is the first compact multi-objective algorithm proposed for feature selection. The reported results for expensive optimization cases with a limited budget on five datasets show that the CNSGA-II performs more efficiently than the well-known NSGA-II method in terms of the hypervolume (HV) performance metric requiring less memory. The proposed method and experimental results are explained and analyzed in detail.  ( 3 min )
    Analysis of Using Sigmoid Loss for Contrastive Learning
    arXiv:2402.12613v1 Announce Type: new Abstract: Contrastive learning has emerged as a prominent branch of self-supervised learning for several years. Especially, CLIP, which applies contrastive learning to large sets of captioned images, has garnered significant attention. Recently, SigLIP, a variant of CLIP, has been proposed, which uses the sigmoid loss instead of the standard InfoNCE loss. SigLIP achieves the performance comparable to CLIP in a more efficient manner by eliminating the need for a global view. However, theoretical understanding of using the sigmoid loss in contrastive learning is underexplored. In this paper, we provide a theoretical analysis of using the sigmoid loss in contrastive learning, in the perspective of the geometric structure of learned embeddings. First, we propose the double-Constant Embedding Model (CCEM), a framework for parameterizing various well-known embedding structures by a single variable. Interestingly, the proposed CCEM is proven to contain the optimal embedding with respect to the sigmoid loss. Second, we mathematically analyze the optimal embedding minimizing the sigmoid loss for contrastive learning. The optimal embedding ranges from simplex equiangular-tight-frame to antipodal structure, depending on the temperature parameter used in the sigmoid loss. Third, our experimental results on synthetic datasets coincide with the theoretical results on the optimal embedding structures.  ( 2 min )
    Multi-objective Binary Coordinate Search for Feature Selection
    arXiv:2402.12616v1 Announce Type: new Abstract: A supervised feature selection method selects an appropriate but concise set of features to differentiate classes, which is highly expensive for large-scale datasets. Therefore, feature selection should aim at both minimizing the number of selected features and maximizing the accuracy of classification, or any other task. However, this crucial task is computationally highly demanding on many real-world datasets and requires a very efficient algorithm to reach a set of optimal features with a limited number of fitness evaluations. For this purpose, we have proposed the binary multi-objective coordinate search (MOCS) algorithm to solve large-scale feature selection problems. To the best of our knowledge, the proposed algorithm in this paper is the first multi-objective coordinate search algorithm. In this method, we generate new individuals by flipping a variable of the candidate solutions on the Pareto front. This enables us to investigate the effectiveness of each feature in the corresponding subset. In fact, this strategy can play the role of crossover and mutation operators to generate distinct subsets of features. The reported results indicate the significant superiority of our method over NSGA-II, on five real-world large-scale datasets, particularly when the computing budget is limited. Moreover, this simple hyper-parameter-free algorithm can solve feature selection much faster and more efficiently than NSGA-II.  ( 3 min )
    FairProof : Confidential and Certifiable Fairness for Neural Networks
    arXiv:2402.12572v1 Announce Type: new Abstract: Machine learning models are increasingly used in societal applications, yet legal and privacy concerns demand that they very often be kept confidential. Consequently, there is a growing distrust about the fairness properties of these models in the minds of consumers, who are often at the receiving end of model predictions. To this end, we propose FairProof - a system that uses Zero-Knowledge Proofs (a cryptographic primitive) to publicly verify the fairness of a model, while maintaining confidentiality. We also propose a fairness certification algorithm for fully-connected neural networks which is befitting to ZKPs and is used in this system. We implement FairProof in Gnark and demonstrate empirically that our system is practically feasible.  ( 2 min )
    Graph-based Virtual Sensing from Sparse and Partial Multivariate Observations
    arXiv:2402.12598v1 Announce Type: new Abstract: Virtual sensing techniques allow for inferring signals at new unmonitored locations by exploiting spatio-temporal measurements coming from physical sensors at different locations. However, as the sensor coverage becomes sparse due to costs or other constraints, physical proximity cannot be used to support interpolation. In this paper, we overcome this challenge by leveraging dependencies between the target variable and a set of correlated variables (covariates) that can frequently be associated with each location of interest. From this viewpoint, covariates provide partial observability, and the problem consists of inferring values for unobserved channels by exploiting observations at other locations to learn how such variables can correlate. We introduce a novel graph-based methodology to exploit such relationships and design a graph deep learning architecture, named GgNet, implementing the framework. The proposed approach relies on propagating information over a nested graph structure that is used to learn dependencies between variables as well as locations. GgNet is extensively evaluated under different virtual sensing scenarios, demonstrating higher reconstruction accuracy compared to the state-of-the-art.  ( 2 min )
    Dynamic Pricing and Learning with Long-term Reference Effects
    arXiv:2402.12562v1 Announce Type: new Abstract: We consider a dynamic pricing problem where customer response to the current price is impacted by the customer price expectation, aka reference price. We study a simple and novel reference price mechanism where reference price is the average of the past prices offered by the seller. As opposed to the more commonly studied exponential smoothing mechanism, in our reference price mechanism the prices offered by seller have a longer term effect on the future customer expectations. We show that under this mechanism, a markdown policy is near-optimal irrespective of the parameters of the model. This matches the common intuition that a seller may be better off by starting with a higher price and then decreasing it, as the customers feel like they are getting bargains on items that are ordinarily more expensive. For linear demand models, we also provide a detailed characterization of the near-optimal markdown policy along with an efficient way of computing it. We then consider a more challenging dynamic pricing and learning problem, where the demand model parameters are apriori unknown, and the seller needs to learn them online from the customers' responses to the offered prices while simultaneously optimizing revenue. The objective is to minimize regret, i.e., the $T$-round revenue loss compared to a clairvoyant optimal policy. This task essentially amounts to learning a non-stationary optimal policy in a time-variant Markov Decision Process (MDP). For linear demand models, we provide an efficient learning algorithm with an optimal $\tilde{O}(\sqrt{T})$ regret upper bound.  ( 3 min )
    Offline Multi-task Transfer RL with Representational Penalization
    arXiv:2402.12570v1 Announce Type: new Abstract: We study the problem of representation transfer in offline Reinforcement Learning (RL), where a learner has access to episodic data from a number of source tasks collected a priori, and aims to learn a shared representation to be used in finding a good policy for a target task. Unlike in online RL where the agent interacts with the environment while learning a policy, in the offline setting there cannot be such interactions in either the source tasks or the target task; thus multi-task offline RL can suffer from incomplete coverage. We propose an algorithm to compute pointwise uncertainty measures for the learnt representation, and establish a data-dependent upper bound for the suboptimality of the learnt policy for the target task. Our algorithm leverages the collective exploration done by source tasks to mitigate poor coverage at some points by a few tasks, thus overcoming the limitation of needing uniformly good coverage for a meaningful transfer by existing offline algorithms. We complement our theoretical results with empirical evaluation on a rich-observation MDP which requires many samples for complete coverage. Our findings illustrate the benefits of penalizing and quantifying the uncertainty in the learnt representation.  ( 2 min )
    Hierarchical Bayes Approach to Personalized Federated Unsupervised Learning
    arXiv:2402.12537v1 Announce Type: new Abstract: Statistical heterogeneity of clients' local data is an important characteristic in federated learning, motivating personalized algorithms tailored to the local data statistics. Though there has been a plethora of algorithms proposed for personalized supervised learning, discovering the structure of local data through personalized unsupervised learning is less explored. We initiate a systematic study of such personalized unsupervised learning by developing algorithms based on optimization criteria inspired by a hierarchical Bayesian statistical framework. We develop adaptive algorithms that discover the balance between using limited local data and collaborative information. We do this in the context of two unsupervised learning tasks: personalized dimensionality reduction and personalized diffusion models. We develop convergence analyses for our adaptive algorithms which illustrate the dependence on problem parameters (e.g., heterogeneity, local sample size). We also develop a theoretical framework for personalized diffusion models, which shows the benefits of collaboration even under heterogeneity. We finally evaluate our proposed algorithms using synthetic and real data, demonstrating the effective sample amplification for personalized tasks, induced through collaboration, despite data heterogeneity.  ( 2 min )
    The Edge-of-Reach Problem in Offline Model-Based Reinforcement Learning
    arXiv:2402.12527v1 Announce Type: new Abstract: Offline reinforcement learning aims to enable agents to be trained from pre-collected datasets, however, this comes with the added challenge of estimating the value of behavior not covered in the dataset. Model-based methods offer a solution by allowing agents to collect additional synthetic data via rollouts in a learned dynamics model. The prevailing theoretical understanding is that this can then be viewed as online reinforcement learning in an approximate dynamics model, and any remaining gap is therefore assumed to be due to the imperfect dynamics model. Surprisingly, however, we find that if the learned dynamics model is replaced by the true error-free dynamics, existing model-based methods completely fail. This reveals a major misconception. Our subsequent investigation finds that the general procedure used in model-based algorithms results in the existence of a set of edge-of-reach states which trigger pathological value overestimation and collapse in Bellman-based algorithms. We term this the edge-of-reach problem. Based on this, we fill some gaps in existing theory and also explain how prior model-based methods are inadvertently addressing the true underlying edge-of-reach problem. Finally, we propose Reach-Aware Value Learning (RAVL), a simple and robust method that directly addresses the edge-of-reach problem and achieves strong performance across both proprioceptive and pixel-based benchmarks. Code open-sourced at: https://github.com/anyasims/edge-of-reach.  ( 2 min )
    Induced Model Matching: How Restricted Models Can Help Larger Ones
    arXiv:2402.12513v1 Announce Type: new Abstract: We consider scenarios where a very accurate predictive model using restricted features is available at the time of training of a larger, full-featured, model. This restricted model may be thought of as "side-information", derived either from an auxiliary exhaustive dataset or on the same dataset, by forcing the restriction. How can the restricted model be useful to the full model? We propose an approach for transferring the knowledge of the restricted model to the full model, by aligning the full model's context-restricted performance with that of the restricted model's. We call this methodology Induced Model Matching (IMM) and first illustrate its general applicability by using logistic regression as a toy example. We then explore IMM's use in language modeling, the application that initially inspired it, and where it offers an explicit foundation in contrast to the implicit use of restricted models in techniques such as noising. We demonstrate the methodology on both LSTM and transformer full models, using $N$-grams as restricted models. To further illustrate the potential of the principle whenever it is much cheaper to collect restricted rather than full information, we conclude with a simple RL example where POMDP policies can improve learned MDP policies via IMM.  ( 2 min )
    PARCv2: Physics-aware Recurrent Convolutional Neural Networks for Spatiotemporal Dynamics Modeling
    arXiv:2402.12503v1 Announce Type: new Abstract: Modeling unsteady, fast transient, and advection-dominated physics problems is a pressing challenge for physics-aware deep learning (PADL). The physics of complex systems is governed by large systems of partial differential equations (PDEs) and ancillary constitutive models with nonlinear structures, as well as evolving state fields exhibiting sharp gradients and rapidly deforming material interfaces. Here, we investigate an inductive bias approach that is versatile and generalizable to model generic nonlinear field evolution problems. Our study focuses on the recent physics-aware recurrent convolutions (PARC), which incorporates a differentiator-integrator architecture that inductively models the spatiotemporal dynamics of generic physical systems. We extend the capabilities of PARC to simulate unsteady, transient, and advection-dominant systems. The extended model, referred to as PARCv2, is equipped with differential operators to model advection-reaction-diffusion equations, as well as a hybrid integral solver for stable, long-time predictions. PARCv2 is tested on both standard benchmark problems in fluid dynamics, namely Burgers and Navier-Stokes equations, and then applied to more complex shock-induced reaction problems in energetic materials. We evaluate the behavior of PARCv2 in comparison to other physics-informed and learning bias models and demonstrate its potential to model unsteady and advection-dominant dynamics regimes.  ( 2 min )
    Towards Cross-Domain Continual Learning
    arXiv:2402.12490v1 Announce Type: new Abstract: Continual learning is a process that involves training learning agents to sequentially master a stream of tasks or classes without revisiting past data. The challenge lies in leveraging previously acquired knowledge to learn new tasks efficiently, while avoiding catastrophic forgetting. Existing methods primarily focus on single domains, restricting their applicability to specific problems. In this work, we introduce a novel approach called Cross-Domain Continual Learning (CDCL) that addresses the limitations of being limited to single supervised domains. Our method combines inter- and intra-task cross-attention mechanisms within a compact convolutional network. This integration enables the model to maintain alignment with features from previous tasks, thereby delaying the data drift that may occur between tasks, while performing unsupervised cross-domain (UDA) between related domains. By leveraging an intra-task-specific pseudo-labeling method, we ensure accurate input pairs for both labeled and unlabeled samples, enhancing the learning process. To validate our approach, we conduct extensive experiments on public UDA datasets, showcasing its positive performance on cross-domain continual learning challenges. Additionally, our work introduces incremental ideas that contribute to the advancement of this field. We make our code and models available to encourage further exploration and reproduction of our results: \url{https://github.com/Ivsucram/CDCL}  ( 2 min )
    In deep reinforcement learning, a pruned network is a good network
    arXiv:2402.12479v1 Announce Type: new Abstract: Recent work has shown that deep reinforcement learning agents have difficulty in effectively using their network parameters. We leverage prior insights into the advantages of sparse training techniques and demonstrate that gradual magnitude pruning enables agents to maximize parameter effectiveness. This results in networks that yield dramatic performance improvements over traditional networks and exhibit a type of "scaling law", using only a small fraction of the full network parameters.  ( 2 min )
    Tables as Images? Exploring the Strengths and Limitations of LLMs on Multimodal Representations of Tabular Data
    arXiv:2402.12424v1 Announce Type: new Abstract: In this paper, we investigate the effectiveness of various LLMs in interpreting tabular data through different prompting strategies and data formats. Our analysis extends across six benchmarks for table-related tasks such as question-answering and fact-checking. We introduce for the first time the assessment of LLMs' performance on image-based table representations. Specifically, we compare five text-based and three image-based table representations, demonstrating the influence of representation and prompting on LLM performance. Our study provides insights into the effective use of LLMs on table-related tasks.  ( 2 min )
    Neuro-mimetic Task-free Unsupervised Online Learning with Continual Self-Organizing Maps
    arXiv:2402.12465v1 Announce Type: new Abstract: An intelligent system capable of continual learning is one that can process and extract knowledge from potentially infinitely long streams of pattern vectors. The major challenge that makes crafting such a system difficult is known as catastrophic forgetting - an agent, such as one based on artificial neural networks (ANNs), struggles to retain previously acquired knowledge when learning from new samples. Furthermore, ensuring that knowledge is preserved for previous tasks becomes more challenging when input is not supplemented with task boundary information. Although forgetting in the context of ANNs has been studied extensively, there still exists far less work investigating it in terms of unsupervised architectures such as the venerable self-organizing map (SOM), a neural model often used in clustering and dimensionality reduction. While the internal mechanisms of SOMs could, in principle, yield sparse representations that improve memory retention, we observe that, when a fixed-size SOM processes continuous data streams, it experiences concept drift. In light of this, we propose a generalization of the SOM, the continual SOM (CSOM), which is capable of online unsupervised learning under a low memory budget. Our results, on benchmarks including MNIST, Kuzushiji-MNIST, and Fashion-MNIST, show almost a two times increase in accuracy, and CIFAR-10 demonstrates a state-of-the-art result when tested on (online) unsupervised class incremental learning setting.  ( 2 min )
    EBFT: Effective and Block-Wise Fine-Tuning for Sparse LLMs
    arXiv:2402.12419v1 Announce Type: new Abstract: Existing methods for fine-tuning sparse LLMs often suffer from resource-intensive requirements and high retraining costs. Additionally, many fine-tuning methods often rely on approximations or heuristic optimization strategies, which may lead to suboptimal solutions. To address these issues, we propose an efficient and fast framework for fine-tuning sparse LLMs based on minimizing reconstruction error. Our approach involves sampling a small dataset for calibration and utilizing backpropagation to iteratively optimize block-wise reconstruction error, on a block-by-block basis, aiming for optimal solutions. Extensive experiments on various benchmarks consistently demonstrate the superiority of our method over other baselines. For instance, on the Wikitext2 dataset with LlamaV1-7B at 70% sparsity, our proposed EBFT achieves a perplexity of 16.88, surpassing the state-of-the-art DSnoT with a perplexity of 75.14. Moreover, with a structured sparsity ratio of 26\%, EBFT achieves a perplexity of 16.27, outperforming LoRA (perplexity 16.44). Furthermore, the fine-tuning process of EBFT for LlamaV1-7B only takes approximately 30 minutes, and the entire framework can be executed on a single 16GB GPU. The source code is available at https://github.com/sunggo/EBFT.  ( 2 min )
    Beyond Uniform Scaling: Exploring Depth Heterogeneity in Neural Architectures
    arXiv:2402.12418v1 Announce Type: new Abstract: Conventional scaling of neural networks typically involves designing a base network and growing different dimensions like width, depth, etc. of the same by some predefined scaling factors. We introduce an automated scaling approach leveraging second-order loss landscape information. Our method is flexible towards skip connections a mainstay in modern vision transformers. Our training-aware method jointly scales and trains transformers without additional training iterations. Motivated by the hypothesis that not all neurons need uniform depth complexity, our approach embraces depth heterogeneity. Extensive evaluations on DeiT-S with ImageNet100 show a 2.5% accuracy gain and 10% parameter efficiency improvement over conventional scaling. Scaled networks demonstrate superior performance upon training small scale datasets from scratch. We introduce the first intact scaling mechanism for vision transformers, a step towards efficient model scaling.  ( 2 min )
    Predicting trucking accidents with truck drivers 'safety climate perception across companies: A transfer learning approach
    arXiv:2402.12417v1 Announce Type: new Abstract: There is a rising interest in using artificial intelligence (AI)-powered safety analytics to predict accidents in the trucking industry. Companies may face the practical challenge, however, of not having enough data to develop good safety analytics models. Although pretrained models may offer a solution for such companies, existing safety research using transfer learning has mostly focused on computer vision and natural language processing, rather than accident analytics. To fill the above gap, we propose a pretrain-then-fine-tune transfer learning approach to help any company leverage other companies' data to develop AI models for a more accurate prediction of accident risk. We also develop SafeNet, a deep neural network algorithm for classification tasks suitable for accident prediction. Using the safety climate survey data from seven trucking companies with different data sizes, we show that our proposed approach results in better model performance compared to training the model from scratch using only the target company's data. We also show that for the transfer learning model to be effective, the pretrained model should be developed with larger datasets from diverse sources. The trucking industry may, thus, consider pooling safety analytics data from a wide range of companies to develop pretrained models and share them within the industry for better knowledge and resource transfer. The above contributions point to the promise of advanced safety analytics to make the industry safer and more sustainable.  ( 3 min )
    ModelGPT: Unleashing LLM's Capabilities for Tailored Model Generation
    arXiv:2402.12408v1 Announce Type: new Abstract: The rapid advancement of Large Language Models (LLMs) has revolutionized various sectors by automating routine tasks, marking a step toward the realization of Artificial General Intelligence (AGI). However, they still struggle to accommodate the diverse and specific needs of users and simplify the utilization of AI models for the average user. In response, we propose ModelGPT, a novel framework designed to determine and generate AI models specifically tailored to the data or task descriptions provided by the user, leveraging the capabilities of LLMs. Given user requirements, ModelGPT is able to provide tailored models at most 270x faster than the previous paradigms (e.g. all-parameter or LoRA finetuning). Comprehensive experiments on NLP, CV, and Tabular datasets attest to the effectiveness of our framework in making AI models more accessible and user-friendly. Our code is available at https://github.com/IshiKura-a/ModelGPT.  ( 2 min )
    Primary and Secondary Factor Consistency as Domain Knowledge to Guide Happiness Computing in Online Assessment
    arXiv:2402.12398v1 Announce Type: new Abstract: Happiness computing based on large-scale online web data and machine learning methods is an emerging research topic that underpins a range of issues, from personal growth to social stability. Many advanced Machine Learning (ML) models with explanations are used to compute the happiness online assessment while maintaining high accuracy of results. However, domain knowledge constraints, such as the primary and secondary relations of happiness factors, are absent from these models, which limits the association between computing results and the right reasons for why they occurred. This article attempts to provide new insights into the explanation consistency from an empirical study perspective. Then we study how to represent and introduce domain knowledge constraints to make ML models more trustworthy. We achieve this through: (1) proving that multiple prediction models with additive factor attributions will have the desirable property of primary and secondary relations consistency, and (2) showing that factor relations with quantity can be represented as an importance distribution for encoding domain knowledge. Factor explanation difference is penalized by the Kullback-Leibler divergence-based loss among computing models. Experimental results using two online web datasets show that domain knowledge of stable factor relations exists. Using this knowledge not only improves happiness computing accuracy but also reveals more significative happiness factors for assisting decisions well.  ( 3 min )
    Turn Waste into Worth: Rectifying Top-$k$ Router of MoE
    arXiv:2402.12399v1 Announce Type: new Abstract: Sparse Mixture of Experts (MoE) models are popular for training large language models due to their computational efficiency. However, the commonly used top-$k$ routing mechanism suffers from redundancy computation and memory costs due to the unbalanced routing. Some experts are overflow, where the exceeding tokens are dropped. While some experts are vacant, which are padded with zeros, negatively impacting model performance. To address the dropped tokens and padding, we propose the Rectify-Router, comprising the Intra-GPU Rectification and the Fill-in Rectification. The Intra-GPU Rectification handles dropped tokens, efficiently routing them to experts within the GPU where they are located to avoid inter-GPU communication. The Fill-in Rectification addresses padding by replacing padding tokens with the tokens that have high routing scores. Our experimental results demonstrate that the Intra-GPU Rectification and the Fill-in Rectification effectively handle dropped tokens and padding, respectively. Furthermore, the combination of them achieves superior performance, surpassing the accuracy of the vanilla top-1 router by 4.7%.  ( 2 min )
  • Open

    Efficient adjustment for complex covariates: Gaining efficiency with DOPE
    arXiv:2402.12980v1 Announce Type: cross Abstract: Covariate adjustment is a ubiquitous method used to estimate the average treatment effect (ATE) from observational data. Assuming a known graphical structure of the data generating model, recent results give graphical criteria for optimal adjustment, which enables efficient estimation of the ATE. However, graphical approaches are challenging for high-dimensional and complex data, and it is not straightforward to specify a meaningful graphical model of non-Euclidean data such as texts. We propose an general framework that accommodates adjustment for any subset of information expressed by the covariates. We generalize prior works and leverage these results to identify the optimal covariate information for efficient adjustment. This information is minimally sufficient for prediction of the outcome conditionally on treatment. Based on our theoretical results, we propose the Debiased Outcome-adapted Propensity Estimator (DOPE) for efficient estimation of the ATE, and we provide asymptotic results for the DOPE under general conditions. Compared to the augmented inverse propensity weighted (AIPW) estimator, the DOPE can retain its efficiency even when the covariates are highly predictive of treatment. We illustrate this with a single-index model, and with an implementation of the DOPE based on neural networks, we demonstrate its performance on simulated and real data. Our results show that the DOPE provides an efficient and robust methodology for ATE estimation in various observational settings.  ( 2 min )
    Learning under Singularity: An Information Criterion improving WBIC and sBIC
    arXiv:2402.12762v1 Announce Type: new Abstract: We introduce a novel Information Criterion (IC), termed Learning under Singularity (LS), designed to enhance the functionality of the Widely Applicable Bayes Information Criterion (WBIC) and the Singular Bayesian Information Criterion (sBIC). LS is effective without regularity constraints and demonstrates stability. Watanabe defined a statistical model or a learning machine as regular if the mapping from a parameter to a probability distribution is one-to-one and its Fisher information matrix is positive definite. In contrast, models not meeting these conditions are termed singular. Over the past decade, several information criteria for singular cases have been proposed, including WBIC and sBIC. WBIC is applicable in non-regular scenarios but faces challenges with large sample sizes and redundant estimation of known learning coefficients. Conversely, sBIC is limited in its broader application due to its dependence on maximum likelihood estimates. LS addresses these limitations by enhancing the utility of both WBIC and sBIC. It incorporates the empirical loss from the Widely Applicable Information Criterion (WAIC) to represent the goodness of fit to the statistical model, along with a penalty term similar to that of sBIC. This approach offers a flexible and robust method for model selection, free from regularity constraints.  ( 2 min )
    Order-Optimal Regret in Distributed Kernel Bandits using Uniform Sampling with Shared Randomness
    arXiv:2402.13182v1 Announce Type: cross Abstract: We consider distributed kernel bandits where $N$ agents aim to collaboratively maximize an unknown reward function that lies in a reproducing kernel Hilbert space. Each agent sequentially queries the function to obtain noisy observations at the query points. Agents can share information through a central server, with the objective of minimizing regret that is accumulating over time $T$ and aggregating over agents. We develop the first algorithm that achieves the optimal regret order (as defined by centralized learning) with a communication cost that is sublinear in both $N$ and $T$. The key features of the proposed algorithm are the uniform exploration at the local agents and shared randomness with the central server. Working together with the sparse approximation of the GP model, these two key components make it possible to preserve the learning rate of the centralized setting at a diminishing rate of communication.  ( 2 min )
    Chain of Thought Empowers Transformers to Solve Inherently Serial Problems
    arXiv:2402.12875v1 Announce Type: cross Abstract: Instructing the model to generate a sequence of intermediate steps, a.k.a., a chain of thought (CoT), is a highly effective method to improve the accuracy of large language models (LLMs) on arithmetics and symbolic reasoning tasks. However, the mechanism behind CoT remains unclear. This work provides a theoretical understanding of the power of CoT for decoder-only transformers through the lens of expressiveness. Conceptually, CoT empowers the model with the ability to perform inherently serial computation, which is otherwise lacking in transformers, especially when depth is low. Given input length $n$, previous works have shown that constant-depth transformers with finite precision $\mathsf{poly}(n)$ embedding size can only solve problems in $\mathsf{TC}^0$ without CoT. We first show an even tighter expressiveness upper bound for constant-depth transformers with constant-bit precision, which can only solve problems in $\mathsf{AC}^0$, a proper subset of $ \mathsf{TC}^0$. However, with $T$ steps of CoT, constant-depth transformers using constant-bit precision and $O(\log n)$ embedding size can solve any problem solvable by boolean circuits of size $T$. Empirically, enabling CoT dramatically improves the accuracy for tasks that are hard for parallel computation, including the composition of permutation groups, iterated squaring, and circuit value problems, especially for low-depth transformers.  ( 2 min )
    SGD with Clipping is Secretly Estimating the Median Gradient
    arXiv:2402.12828v1 Announce Type: new Abstract: There are several applications of stochastic optimization where one can benefit from a robust estimate of the gradient. For example, domains such as distributed learning with corrupted nodes, the presence of large outliers in the training data, learning under privacy constraints, or even heavy-tailed noise due to the dynamics of the algorithm itself. Here we study SGD with robust gradient estimators based on estimating the median. We first consider computing the median gradient across samples, and show that the resulting method can converge even under heavy-tailed, state-dependent noise. We then derive iterative methods based on the stochastic proximal point method for computing the geometric median and generalizations thereof. Finally we propose an algorithm estimating the median gradient across iterations, and find that several well known methods - in particular different forms of clipping - are particular cases of this framework.  ( 2 min )
    Contextual Directed Acyclic Graphs
    arXiv:2310.15627v2 Announce Type: replace Abstract: Estimating the structure of directed acyclic graphs (DAGs) from observational data remains a significant challenge in machine learning. Most research in this area concentrates on learning a single DAG for the entire population. This paper considers an alternative setting where the graph structure varies across individuals based on available "contextual" features. We tackle this contextual DAG problem via a neural network that maps the contextual features to a DAG, represented as a weighted adjacency matrix. The neural network is equipped with a novel projection layer that ensures the output matrices are sparse and satisfy a recently developed characterization of acyclicity. We devise a scalable computational framework for learning contextual DAGs and provide a convergence guarantee and an analytical gradient for backpropagating through the projection layer. Our experiments suggest that the new approach can recover the true context-specific graph where existing approaches fail.  ( 2 min )
    Testing Calibration in Subquadratic Time
    arXiv:2402.13187v1 Announce Type: cross Abstract: In the recent literature on machine learning and decision making, calibration has emerged as a desirable and widely-studied statistical property of the outputs of binary prediction models. However, the algorithmic aspects of measuring model calibration have remained relatively less well-explored. Motivated by [BGHN23], which proposed a rigorous framework for measuring distances to calibration, we initiate the algorithmic study of calibration through the lens of property testing. We define the problem of calibration testing from samples where given $n$ draws from a distribution $\mathcal{D}$ on (predictions, binary outcomes), our goal is to distinguish between the case where $\mathcal{D}$ is perfectly calibrated, and the case where $\mathcal{D}$ is $\varepsilon$-far from calibration. We design an algorithm based on approximate linear programming, which solves calibration testing information-theoretically optimally (up to constant factors) in time $O(n^{1.5} \log(n))$. This improves upon state-of-the-art black-box linear program solvers requiring $\Omega(n^\omega)$ time, where $\omega > 2$ is the exponent of matrix multiplication. We also develop algorithms for tolerant variants of our testing problem, and give sample complexity lower bounds for alternative calibration distances to the one considered in this work. Finally, we present preliminary experiments showing that the testing problem we define faithfully captures standard notions of calibration, and that our algorithms scale to accommodate moderate sample sizes.  ( 2 min )
    Kernel Single Proxy Control for Deterministic Confounding
    arXiv:2308.04585v3 Announce Type: replace Abstract: We consider the problem of causal effect estimation with an unobserved confounder, where we observe a proxy variable that is associated with the confounder. Although Proxy causal learning (PCL) uses two proxy variables to recover the true causal effect, we show that a single proxy variable is sufficient for causal estimation if the outcome is generated deterministically, generalizing Control Outcome Calibration Approach (COCA). We propose two kernel-based methods for this setting: the first based on the two-stage regression approach, and the second based on a maximum moment restriction approach. We prove that both approaches can consistently estimate the causal effect, and we empirically demonstrate that we can successfully recover the causal effect on challenging synthetic benchmarks.  ( 2 min )
    Dealing with uncertainty: balancing exploration and exploitation in deep recurrent reinforcement learning
    arXiv:2310.08331v2 Announce Type: replace Abstract: Incomplete knowledge of the environment leads an agent to make decisions under uncertainty. One of the major dilemmas in Reinforcement Learning (RL) where an autonomous agent has to balance two contrasting needs in making its decisions is: exploiting the current knowledge of the environment to maximize the cumulative reward as well as exploring actions that allow improving the knowledge of the environment, hopefully leading to higher reward values (exploration-exploitation trade-off). Concurrently, another relevant issue regards the full observability of the states, which may not be assumed in all applications. For instance, when 2D images are considered as input in an RL approach used for finding the best actions within a 3D simulation environment. In this work, we address these issues by deploying and testing several techniques to balance exploration and exploitation trade-off on partially observable systems for predicting steering wheels in autonomous driving scenarios. More precisely, the final aim is to investigate the effects of using both adaptive and deterministic exploration strategies coupled with a Deep Recurrent Q-Network. Additionally, we adapted and evaluated the impact of a modified quadratic loss function to improve the learning phase of the underlying Convolutional Recurrent Neural Network. We show that adaptive methods better approximate the trade-off between exploration and exploitation and, in general, Softmax and Max-Boltzmann strategies outperform epsilon-greedy techniques.  ( 3 min )
    Fast Rates in Online Convex Optimization by Exploiting the Curvature of Feasible Sets
    arXiv:2402.12868v1 Announce Type: cross Abstract: In this paper, we explore online convex optimization (OCO) and introduce a new analysis that provides fast rates by exploiting the curvature of feasible sets. In online linear optimization, it is known that if the average gradient of loss functions is larger than a certain value, the curvature of feasible sets can be exploited by the follow-the-leader (FTL) algorithm to achieve a logarithmic regret. This paper reveals that algorithms adaptive to the curvature of loss functions can also leverage the curvature of feasible sets. We first prove that if an optimal decision is on the boundary of a feasible set and the gradient of an underlying loss function is non-zero, then the algorithm achieves a regret upper bound of $O(\rho \log T)$ in stochastic environments. Here, $\rho > 0$ is the radius of the smallest sphere that includes the optimal decision and encloses the feasible set. Our approach, unlike existing ones, can work directly with convex loss functions, exploiting the curvature of loss functions simultaneously, and can achieve the logarithmic regret only with a local property of feasible sets. Additionally, it achieves an $O(\sqrt{T})$ regret even in adversarial environments where FTL suffers an $\Omega(T)$ regret, and attains an $O(\rho \log T + \sqrt{C \rho \log T})$ regret bound in corrupted stochastic environments with corruption level $C$. Furthermore, by extending our analysis, we establish a regret upper bound of $O\Big(T^{\frac{q-2}{2(q-1)}} (\log T)^{\frac{q}{2(q-1)}}\Big)$ for $q$-uniformly convex feasible sets, where uniformly convex sets include strongly convex sets and $\ell_p$-balls for $p \in [1,\infty)$. This bound bridges the gap between the $O(\log T)$ regret bound for strongly convex sets ($q=2$) and the $O(\sqrt{T})$ regret bound for non-curved sets ($q\to\infty$).  ( 3 min )
    Integrating Active Learning in Causal Inference with Interference: A Novel Approach in Online Experiments
    arXiv:2402.12710v1 Announce Type: cross Abstract: In the domain of causal inference research, the prevalent potential outcomes framework, notably the Rubin Causal Model (RCM), often overlooks individual interference and assumes independent treatment effects. This assumption, however, is frequently misaligned with the intricate realities of real-world scenarios, where interference is not merely a possibility but a common occurrence. Our research endeavors to address this discrepancy by focusing on the estimation of direct and spillover treatment effects under two assumptions: (1) network-based interference, where treatments on neighbors within connected networks affect one's outcomes, and (2) non-random treatment assignments influenced by confounders. To improve the efficiency of estimating potentially complex effects functions, we introduce an novel active learning approach: Active Learning in Causal Inference with Interference (ACI). This approach uses Gaussian process to flexibly model the direct and spillover treatment effects as a function of a continuous measure of neighbors' treatment assignment. The ACI framework sequentially identifies the experimental settings that demand further data. It further optimizes the treatment assignments under the network interference structure using genetic algorithms to achieve efficient learning outcome. By applying our method to simulation data and a Tencent game dataset, we demonstrate its feasibility in achieving accurate effects estimations with reduced data requirements. This ACI approach marks a significant advancement in the realm of data efficiency for causal inference, offering a robust and efficient alternative to traditional methodologies, particularly in scenarios characterized by complex interference patterns.  ( 3 min )
    Convergence of Batch Asynchronous Stochastic Approximation With Applications to Reinforcement Learning
    arXiv:2109.03445v5 Announce Type: replace Abstract: Ever since its introduction in the classic paper of Robbins and Monro in 1951, Stochastic Approximation (SA) has become a standard tool for finding a solution of an equation of the form $f(\theta) = 0$, when only noisy measurements of $f(\cdot)$ are available. In most situations, \textit{every component} of the putative solution $\theta_t$ is updated at each step $t$. In some applications such as $Q$-learning, a key technique in Reinforcement Learning (RL), \textit{only one component} of $\theta_t$ is updated at each $t$. This is known as \textbf{asynchronous} SA. The topic of study in the present paper is to study \textbf{Block Asynchronous SA (BASA)}, in which, at each step $t$, \textit{some but not necessarily all} components of $\theta_t$ are updated. The theory presented here embraces both conventional (synchronous) SA as well as asynchronous SA, and all in-between possibilities. We also prove bounds on the \textit{rate} of convergence of $\theta_t$ to the solutions. As a prelude to the new results, we also briefly survey some results on the convergence of the Stochastic Gradient method, proved in a companion paper by the present authors.  ( 3 min )
    CHILI: Chemically-Informed Large-scale Inorganic Nanomaterials Dataset for Advancing Graph Machine Learning
    arXiv:2402.13221v1 Announce Type: cross Abstract: Advances in graph machine learning (ML) have been driven by applications in chemistry as graphs have remained the most expressive representations of molecules. While early graph ML methods focused primarily on small organic molecules, recently, the scope of graph ML has expanded to include inorganic materials. Modelling the periodicity and symmetry of inorganic crystalline materials poses unique challenges, which existing graph ML methods are unable to address. Moving to inorganic nanomaterials increases complexity as the scale of number of nodes within each graph can be broad ($10$ to $10^5$). The bulk of existing graph ML focuses on characterising molecules and materials by predicting target properties with graphs as input. However, the most exciting applications of graph ML will be in their generative capabilities, which is currently not at par with other domains such as images or text. We invite the graph ML community to address these open challenges by presenting two new chemically-informed large-scale inorganic (CHILI) nanomaterials datasets: A medium-scale dataset (with overall >6M nodes, >49M edges) of mono-metallic oxide nanomaterials generated from 12 selected crystal types (CHILI-3K) and a large-scale dataset (with overall >183M nodes, >1.2B edges) of nanomaterials generated from experimentally determined crystal structures (CHILI-100K). We define 11 property prediction tasks and 6 structure prediction tasks, which are of special interest for nanomaterial research. We benchmark the performance of a wide array of baseline methods and use these benchmarking results to highlight areas which need future work. To the best of our knowledge, CHILI-3K and CHILI-100K are the first open-source nanomaterial datasets of this scale -- both on the individual graph level and of the dataset as a whole -- and the only nanomaterials datasets with high structural and elemental diversity.  ( 3 min )
    Achieving Near-Optimal Regret for Bandit Algorithms with Uniform Last-Iterate Guarantee
    arXiv:2402.12711v1 Announce Type: cross Abstract: Existing performance measures for bandit algorithms such as regret, PAC bounds, or uniform-PAC (Dann et al., 2017), typically evaluate the cumulative performance, while allowing the play of an arbitrarily bad arm at any finite time t. Such a behavior can be highly detrimental in high-stakes applications. This paper introduces a stronger performance measure, the uniform last-iterate (ULI) guarantee, capturing both cumulative and instantaneous performance of bandit algorithms. Specifically, ULI characterizes the instantaneous performance since it ensures that the per-round regret of the played arm is bounded by a function, monotonically decreasing w.r.t. (large) round t, preventing revisits to bad arms when sufficient samples are available. We demonstrate that a near-optimal ULI guarantee directly implies near-optimal cumulative performance across aforementioned performance measures. To examine the achievability of ULI in the finite arm setting, we first provide two positive results that some elimination-based algorithms and high-probability adversarial algorithms with stronger analysis or additional designs, can attain near-optimal ULI guarantees. Then, we also provide a negative result, indicating that optimistic algorithms cannot achieve a near-optimal ULI guarantee. Finally, we propose an efficient algorithm for linear bandits with infinitely many arms, which achieves the ULI guarantee, given access to an optimization oracle.  ( 2 min )
    Learning Generalization and Regularization of Nonhomogeneous Temporal Poisson Processes
    arXiv:2402.12808v1 Announce Type: cross Abstract: The Poisson process, especially the nonhomogeneous Poisson process (NHPP), is an essentially important counting process with numerous real-world applications. Up to date, almost all works in the literature have been on the estimation of NHPPs with infinite data using non-data driven binning methods. In this paper, we formulate the problem of estimation of NHPPs from finite and limited data as a learning generalization problem. We mathematically show that while binning methods are essential for the estimation of NHPPs, they pose a threat of overfitting when the amount of data is limited. We propose a framework for regularized learning of NHPPs with two new adaptive and data-driven binning methods that help to remove the ad-hoc tuning of binning parameters. Our methods are experimentally tested on synthetic and real-world datasets and the results show their effectiveness.  ( 2 min )
    PIP-Net: Pedestrian Intention Prediction in the Wild
    arXiv:2402.12810v1 Announce Type: cross Abstract: Accurate pedestrian intention prediction (PIP) by Autonomous Vehicles (AVs) is one of the current research challenges in this field. In this article, we introduce PIP-Net, a novel framework designed to predict pedestrian crossing intentions by AVs in real-world urban scenarios. We offer two variants of PIP-Net designed for different camera mounts and setups. Leveraging both kinematic data and spatial features from the driving scene, the proposed model employs a recurrent and temporal attention-based solution, outperforming state-of-the-art performance. To enhance the visual representation of road users and their proximity to the ego vehicle, we introduce a categorical depth feature map, combined with a local motion flow feature, providing rich insights into the scene dynamics. Additionally, we explore the impact of expanding the camera's field of view, from one to three cameras surrounding the ego vehicle, leading to enhancement in the model's contextual perception. Depending on the traffic scenario and road environment, the model excels in predicting pedestrian crossing intentions up to 4 seconds in advance which is a breakthrough in current research studies in pedestrian intention prediction. Finally, for the first time, we present the Urban-PIP dataset, a customised pedestrian intention prediction dataset, with multi-camera annotations in real-world automated driving scenarios.  ( 2 min )
    On Generalization Bounds for Deep Compound Gaussian Neural Networks
    arXiv:2402.13106v1 Announce Type: new Abstract: Algorithm unfolding or unrolling is the technique of constructing a deep neural network (DNN) from an iterative algorithm. Unrolled DNNs often provide better interpretability and superior empirical performance over standard DNNs in signal estimation tasks. An important theoretical question, which has only recently received attention, is the development of generalization error bounds for unrolled DNNs. These bounds deliver theoretical and practical insights into the performance of a DNN on empirical datasets that are distinct from, but sampled from, the probability density generating the DNN training data. In this paper, we develop novel generalization error bounds for a class of unrolled DNNs that are informed by a compound Gaussian prior. These compound Gaussian networks have been shown to outperform comparative standard and unfolded deep neural networks in compressive sensing and tomographic imaging problems. The generalization error bound is formulated by bounding the Rademacher complexity of the class of compound Gaussian network estimates with Dudley's integral. Under realistic conditions, we show that, at worst, the generalization error scales $\mathcal{O}(n\sqrt{\ln(n)})$ in the signal dimension and $\mathcal{O}(($Network Size$)^{3/2})$ in network size.  ( 2 min )
    A Bound on the Maximal Marginal Degrees of Freedom
    arXiv:2402.12885v1 Announce Type: new Abstract: Common kernel ridge regression is expensive in memory allocation and computation time. This paper addresses low rank approximations and surrogates for kernel ridge regression, which bridge these difficulties. The fundamental contribution of the paper is a lower bound on the rank of the low dimensional approximation, which is required such that the prediction power remains reliable. The bound relates the effective dimension with the largest statistical leverage score. We characterize the effective dimension and its growth behavior with respect to the regularization parameter by involving the regularity of the kernel. This growth is demonstrated to be asymptotically logarithmic for suitably chosen kernels, justifying low-rank approximations as the Nystr\"om method.  ( 2 min )
    Diffusion Posterior Sampling is Computationally Intractable
    arXiv:2402.12727v1 Announce Type: cross Abstract: Diffusion models are a remarkably effective way of learning and sampling from a distribution $p(x)$. In posterior sampling, one is also given a measurement model $p(y \mid x)$ and a measurement $y$, and would like to sample from $p(x \mid y)$. Posterior sampling is useful for tasks such as inpainting, super-resolution, and MRI reconstruction, so a number of recent works have given algorithms to heuristically approximate it; but none are known to converge to the correct distribution in polynomial time. In this paper we show that posterior sampling is \emph{computationally intractable}: under the most basic assumption in cryptography -- that one-way functions exist -- there are instances for which \emph{every} algorithm takes superpolynomial time, even though \emph{unconditional} sampling is provably fast. We also show that the exponential-time rejection sampling algorithm is essentially optimal under the stronger plausible assumption that there are one-way functions that take exponential time to invert.  ( 2 min )
    Learning on manifolds without manifold learning
    arXiv:2402.12687v1 Announce Type: cross Abstract: Function approximation based on data drawn randomly from an unknown distribution is an important problem in machine learning. In contrast to the prevalent paradigm of solving this problem by minimizing a loss functional, we have given a direct one-shot construction together with optimal error bounds under the manifold assumption; i.e., one assumes that the data is sampled from an unknown sub-manifold of a high dimensional Euclidean space. A great deal of research deals with obtaining information about this manifold, such as the eigendecomposition of the Laplace-Beltrami operator or coordinate charts, and using this information for function approximation. This two step approach implies some extra errors in the approximation stemming from basic quantities of the data in addition to the errors inherent in function approximation. In Neural Networks, 132:253268, 2020, we have proposed a one-shot direct method to achieve function approximation without requiring the extraction of any information about the manifold other than its dimension. However, one cannot pin down the class of approximants used in that paper. In this paper, we view the unknown manifold as a sub-manifold of an ambient hypersphere and study the question of constructing a one-shot approximation using the spherical polynomials based on the hypersphere. Our approach does not require preprocessing of the data to obtain information about the manifold other than its dimension. We give optimal rates of approximation for relatively "rough" functions.  ( 2 min )
    Mode Estimation with Partial Feedback
    arXiv:2402.13079v1 Announce Type: new Abstract: The combination of lightly supervised pre-training and online fine-tuning has played a key role in recent AI developments. These new learning pipelines call for new theoretical frameworks. In this paper, we formalize core aspects of weakly supervised and active learning with a simple problem: the estimation of the mode of a distribution using partial feedback. We show how entropy coding allows for optimal information acquisition from partial feedback, develop coarse sufficient statistics for mode identification, and adapt bandit algorithms to our new setting. Finally, we combine those contributions into a statistically and computationally efficient solution to our problem.  ( 2 min )
    Generalization in Kernel Regression Under Realistic Assumptions
    arXiv:2312.15995v2 Announce Type: replace-cross Abstract: It is by now well-established that modern over-parameterized models seem to elude the bias-variance tradeoff and generalize well despite overfitting noise. Many recent works attempt to analyze this phenomenon in the relatively tractable setting of kernel regression. However, as we argue in detail, most past works on this topic either make unrealistic assumptions, or focus on a narrow problem setup. This work aims to provide a unified theory to upper bound the excess risk of kernel regression for nearly all common and realistic settings. Specifically, we provide rigorous bounds that hold for common kernels and for any amount of regularization, noise, any input dimension, and any number of samples. Furthermore, we provide relative perturbation bounds for the eigenvalues of kernel matrices, which may be of independent interest. These reveal a self-regularization phenomenon, whereby a heavy tail in the eigendecomposition of the kernel provides it with an implicit form of regularization, enabling good generalization. When applied to common kernels, our results imply benign overfitting in high input dimensions, nearly tempered overfitting in fixed dimensions, and explicit convergence rates for regularized regression. As a by-product, we obtain time-dependent bounds for neural networks trained in the kernel regime.  ( 2 min )
    Best Arm Identification with Fixed Budget: A Large Deviation Perspective
    arXiv:2312.12137v2 Announce Type: replace-cross Abstract: We consider the problem of identifying the best arm in stochastic Multi-Armed Bandits (MABs) using a fixed sampling budget. Characterizing the minimal instance-specific error probability for this problem constitutes one of the important remaining open problems in MABs. When arms are selected using a static sampling strategy, the error probability decays exponentially with the number of samples at a rate that can be explicitly derived via Large Deviation techniques. Analyzing the performance of algorithms with adaptive sampling strategies is however much more challenging. In this paper, we establish a connection between the Large Deviation Principle (LDP) satisfied by the empirical proportions of arm draws and that satisfied by the empirical arm rewards. This connection holds for any adaptive algorithm, and is leveraged (i) to improve error probability upper bounds of some existing algorithms, such as the celebrated \sr (Successive Rejects) algorithm \citep{audibert2010best}, and (ii) to devise and analyze new algorithms. In particular, we present \sred (Continuous Rejects), a truly adaptive algorithm that can reject arms in {\it any} round based on the observed empirical gaps between the rewards of various arms. Applying our Large Deviation results, we prove that \sred enjoys better performance guarantees than existing algorithms, including \sr. Extensive numerical experiments confirm this observation.  ( 3 min )
    Mean estimation in the add-remove model of differential privacy
    arXiv:2312.06658v2 Announce Type: replace-cross Abstract: Differential privacy is often studied under two different models of neighboring datasets: the add-remove model and the swap model. While the swap model is frequently used in the academic literature to simplify analysis, many practical applications rely on the more conservative add-remove model, where obtaining tight results can be difficult. Here, we study the problem of one-dimensional mean estimation under the add-remove model. We propose a new algorithm and show that it is min-max optimal, achieving the best possible constant in the leading term of the mean squared error for all $\epsilon$, and that this constant is the same as the optimal algorithm under the swap model. These results show that the add-remove and swap models give nearly identical errors for mean estimation, even though the add-remove model cannot treat the size of the dataset as public information. We also demonstrate empirically that our proposed algorithm yields at least a factor of two improvement in mean squared error over algorithms frequently used in practice. One of our main technical contributions is a new hour-glass mechanism, which might be of independent interest in other scenarios.  ( 2 min )
    Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint
    arXiv:2312.11456v3 Announce Type: replace-cross Abstract: This paper studies the theoretical framework of the alignment process of generative models with Reinforcement Learning from Human Feedback (RLHF). We consider a standard mathematical formulation, the reverse-KL regularized contextual bandit for RLHF. Despite its widespread practical application, a rigorous theoretical analysis of this formulation remains open. We investigate its behavior in three distinct settings -- offline, online, and hybrid -- and propose efficient algorithms with finite-sample theoretical guarantees. Moving towards practical applications, our framework, with a robust approximation of the information-theoretical policy improvement oracle, naturally gives rise to several novel RLHF algorithms. This includes an iterative version of the Direct Preference Optimization (DPO) algorithm for online settings, and a multi-step rejection sampling strategy for offline scenarios. Our empirical evaluations on real-world alignment experiment of large language model demonstrate that these proposed methods significantly surpass existing strong baselines, such as DPO and Rejection Sampling Optimization (RSO), showcasing the connections between solid theoretical foundations and their powerful practical implementations.  ( 2 min )
    Touring sampling with pushforward maps
    arXiv:2311.13845v2 Announce Type: replace-cross Abstract: The number of sampling methods could be daunting for a practitioner looking to cast powerful machine learning methods to their specific problem. This paper takes a theoretical stance to review and organize many sampling approaches in the ``generative modeling'' setting, where one wants to generate new data that are similar to some training examples. By revealing links between existing methods, it might prove useful to overcome some of the current challenges in sampling with diffusion models, such as long inference time due to diffusion simulation, or the lack of diversity in generated samples.  ( 2 min )
    Add and Thin: Diffusion for Temporal Point Processes
    arXiv:2311.01139v2 Announce Type: replace-cross Abstract: Autoregressive neural networks within the temporal point process (TPP) framework have become the standard for modeling continuous-time event data. Even though these models can expressively capture event sequences in a one-step-ahead fashion, they are inherently limited for long-term forecasting applications due to the accumulation of errors caused by their sequential nature. To overcome these limitations, we derive ADD-THIN, a principled probabilistic denoising diffusion model for TPPs that operates on entire event sequences. Unlike existing diffusion approaches, ADD-THIN naturally handles data with discrete and continuous components. In experiments on synthetic and real-world datasets, our model matches the state-of-the-art TPP models in density estimation and strongly outperforms them in forecasting.  ( 2 min )
    SWoTTeD: An Extension of Tensor Decomposition to Temporal Phenotyping
    arXiv:2310.01201v2 Announce Type: replace-cross Abstract: Tensor decomposition has recently been gaining attention in the machine learning community for the analysis of individual traces, such as Electronic Health Records (EHR). However, this task becomes significantly more difficult when the data follows complex temporal patterns. This paper introduces the notion of a temporal phenotype as an arrangement of features over time and it proposes SWoTTeD (Sliding Window for Temporal Tensor Decomposition), a novel method to discover hidden temporal patterns. SWoTTeD integrates several constraints and regularizations to enhance the interpretability of the extracted phenotypes. We validate our proposal using both synthetic and real-world datasets, and we present an original usecase using data from the Greater Paris University Hospital. The results show that SWoTTeD achieves at least as accurate reconstruction as recent state-of-the-art tensor decomposition models, and extracts temporal phenotypes that are meaningful for clinicians.  ( 2 min )
    Emergence of Grid-like Representations by Training Recurrent Networks with Conformal Normalization
    arXiv:2310.19192v2 Announce Type: replace-cross Abstract: Grid cells in the entorhinal cortex of mammalian brains exhibit striking hexagon grid firing patterns in their response maps as the animal (e.g., a rat) navigates in a 2D open environment. In this paper, we study the emergence of the hexagon grid patterns of grid cells based on a general recurrent neural network (RNN) model that captures the navigation process. The responses of grid cells collectively form a high dimensional vector, representing the 2D self-position of the agent. As the agent moves, the vector is transformed by an RNN that takes the velocity of the agent as input. We propose a simple yet general conformal normalization of the input velocity of the RNN, so that the local displacement of the position vector in the high-dimensional neural space is proportional to the local displacement of the agent in the 2D physical space, regardless of the direction of the input velocity. We apply this mechanism to both a linear RNN and nonlinear RNNs. Theoretically, we provide an understanding that explains the connection between conformal normalization and the emergence of hexagon grid patterns. Empirically, we conduct extensive experiments to verify that conformal normalization is crucial for the emergence of hexagon grid patterns, across various types of RNNs. The learned patterns share similar profiles to biological grid cells, and the topological properties of the patterns also align with our theoretical understanding.  ( 3 min )
    Will More Expressive Graph Neural Networks do Better on Generative Tasks?
    arXiv:2308.11978v4 Announce Type: replace-cross Abstract: Graph generation poses a significant challenge as it involves predicting a complete graph with multiple nodes and edges based on simply a given label. This task also carries fundamental importance to numerous real-world applications, including de-novo drug and molecular design. In recent years, several successful methods have emerged in the field of graph generation. However, these approaches suffer from two significant shortcomings: (1) the underlying Graph Neural Network (GNN) architectures used in these methods are often underexplored; and (2) these methods are often evaluated on only a limited number of metrics. To fill this gap, we investigate the expressiveness of GNNs under the context of the molecular graph generation task, by replacing the underlying GNNs of graph generative models with more expressive GNNs. Specifically, we analyse the per- formance of six GNNs in two different generative frameworks -- autoregressive generation models, such as GCPN and GraphAF, and one-shot generation models, such as GraphEBM -- on six different molecular generative objectives on the ZINC-250k dataset. Through our extensive experiments, we demonstrate that advanced GNNs can indeed improve the performance of GCPN, GraphAF, and GraphEBM on molecular generation tasks, but GNN expressiveness is not a necessary condition for a good GNN-based generative model. Moreover, we show that GCPN and GraphAF with advanced GNNs can achieve state-of-the-art results across 17 other non-GNN-based graph generative approaches, such as variational autoencoders and Bayesian optimisation models, on the proposed molecular generative objectives (DRD2, Median1, Median2), which are impor- tant metrics for de-novo molecular design.  ( 3 min )
    Martian time-series unraveled: A multi-scale nested approach with factorial variational autoencoders
    arXiv:2305.16189v3 Announce Type: replace-cross Abstract: Unsupervised source separation involves unraveling an unknown set of source signals recorded through a mixing operator, with limited prior knowledge about the sources, and only access to a dataset of signal mixtures. This problem is inherently ill-posed and is further challenged by the variety of timescales exhibited by sources. Existing methods typically rely on a preselected window size that determines their operating timescale, limiting their capacity to handle multi-scale sources. To address this issue, we propose an unsupervised multi-scale clustering and source separation framework by leveraging wavelet scattering spectra that provide a low-dimensional representation of stochastic processes, capable of distinguishing between different non-Gaussian stochastic processes. Nested within this representation space, we develop a factorial Gaussian-mixture variational autoencoder that is trained to (1) probabilistically cluster sources at different timescales and (2) independently sample scattering spectra representations associated with each cluster. As the final stage, using samples from each cluster as prior information, we formulate source separation as an optimization problem in the wavelet scattering spectra representation space, aiming to separate sources in the time domain. When applied to the entire seismic dataset recorded during the NASA InSight mission on Mars, containing sources varying greatly in timescale, our multi-scale nested approach proves to be a powerful tool for disentangling such different sources, e.g., minute-long transient one-sided pulses (known as ``glitches'') and structured ambient noises resulting from atmospheric activities that typically last for tens of minutes. These results provide an opportunity to conduct further investigations into the isolated sources related to atmospheric-surface interactions, thermal relaxations, and other complex phenomena.  ( 3 min )
    Generative Sliced MMD Flows with Riesz Kernels
    arXiv:2305.11463v4 Announce Type: replace-cross Abstract: Maximum mean discrepancy (MMD) flows suffer from high computational costs in large scale computations. In this paper, we show that MMD flows with Riesz kernels $K(x,y) = - \|x-y\|^r$, $r \in (0,2)$ have exceptional properties which allow their efficient computation. We prove that the MMD of Riesz kernels, which is also known as energy distance, coincides with the MMD of their sliced version. As a consequence, the computation of gradients of MMDs can be performed in the one-dimensional setting. Here, for $r=1$, a simple sorting algorithm can be applied to reduce the complexity from $O(MN+N^2)$ to $O((M+N)\log(M+N))$ for two measures with $M$ and $N$ support points. As another interesting follow-up result, the MMD of compactly supported measures can be estimated from above and below by the Wasserstein-1 distance. For the implementations we approximate the gradient of the sliced MMD by using only a finite number $P$ of slices. We show that the resulting error has complexity $O(\sqrt{d/P})$, where $d$ is the data dimension. These results enable us to train generative models by approximating MMD gradient flows by neural networks even for image applications. We demonstrate the efficiency of our model by image generation on MNIST, FashionMNIST and CIFAR10.  ( 3 min )
    Going Further: Flatness at the Rescue of Early Stopping for Adversarial Example Transferability
    arXiv:2304.02688v2 Announce Type: replace-cross Abstract: Transferability is the property of adversarial examples to be misclassified by other models than the surrogate model for which they were crafted. Previous research has shown that early stopping the training of the surrogate model substantially increases transferability. A common hypothesis to explain this is that deep neural networks (DNNs) first learn robust features, which are more generic, thus a better surrogate. Then, at later epochs, DNNs learn non-robust features, which are more brittle, hence worst surrogate. First, we tend to refute this hypothesis, using transferability as a proxy for representation similarity. We then establish links between transferability and the exploration of the loss landscape in parameter space, focusing on sharpness, which is affected by early stopping. This leads us to evaluate surrogate models trained with seven minimizers that minimize both loss value and loss sharpness. Among them, SAM consistently outperforms early stopping by up to 28.8 percentage points. We discover that the strong SAM regularization from large flat neighborhoods tightly links to transferability. Finally, the best sharpness-aware minimizers prove competitive with other training methods and complement existing transferability techniques.  ( 3 min )
    Mathematical Framework for Online Social Media Auditing
    arXiv:2209.05550v2 Announce Type: replace-cross Abstract: Social media platforms (SMPs) leverage algorithmic filtering (AF) as a means of selecting the content that constitutes a user's feed with the aim of maximizing their rewards. Selectively choosing the contents to be shown on the user's feed may yield a certain extent of influence, either minor or major, on the user's decision-making, compared to what it would have been under a natural/fair content selection. As we have witnessed over the past decade, algorithmic filtering can cause detrimental side effects, ranging from biasing individual decisions to shaping those of society as a whole, for example, diverting users' attention from whether to get the COVID-19 vaccine or inducing the public to choose a presidential candidate. The government's constant attempts to regulate the adverse effects of AF are often complicated, due to bureaucracy, legal affairs, and financial considerations. On the other hand SMPs seek to monitor their own algorithmic activities to avoid being fined for exceeding the allowable threshold. In this paper, we mathematically formalize this framework and utilize it to construct a data-driven statistical auditing procedure to regulate AF from deflecting users' beliefs over time, along with sample complexity guarantees. This state-of-the-art algorithm can be used either by authorities acting as external regulators or by SMPs for self-auditing.  ( 3 min )
    Autoregressive Bandits
    arXiv:2212.06251v2 Announce Type: replace-cross Abstract: Autoregressive processes naturally arise in a large variety of real-world scenarios, including stock markets, sales forecasting, weather prediction, advertising, and pricing. When facing a sequential decision-making problem in such a context, the temporal dependence between consecutive observations should be properly accounted for guaranteeing convergence to the optimal policy. In this work, we propose a novel online learning setting, namely, Autoregressive Bandits (ARBs), in which the observed reward is governed by an autoregressive process of order $k$, whose parameters depend on the chosen action. We show that, under mild assumptions on the reward process, the optimal policy can be conveniently computed. Then, we devise a new optimistic regret minimization algorithm, namely, AutoRegressive Upper Confidence Bound (AR-UCB), that suffers sublinear regret of order $\widetilde{\mathcal{O}} \left( \frac{(k+1)^{3/2}\sqrt{nT}}{(1-\Gamma)^2}\right)$, where $T$ is the optimization horizon, $n$ is the number of actions, and $\Gamma < 1$ is a stability index of the process. Finally, we empirically validate our algorithm, illustrating its advantages w.r.t. bandit baselines and its robustness to misspecification of key parameters.  ( 2 min )
    Survival Kernets: Scalable and Interpretable Deep Kernel Survival Analysis with an Accuracy Guarantee
    arXiv:2206.10477v5 Announce Type: replace-cross Abstract: Kernel survival analysis models estimate individual survival distributions with the help of a kernel function, which measures the similarity between any two data points. Such a kernel function can be learned using deep kernel survival models. In this paper, we present a new deep kernel survival model called a survival kernet, which scales to large datasets in a manner that is amenable to model interpretation and also theoretical analysis. Specifically, the training data are partitioned into clusters based on a recently developed training set compression scheme for classification and regression called kernel netting that we extend to the survival analysis setting. At test time, each data point is represented as a weighted combination of these clusters, and each such cluster can be visualized. For a special case of survival kernets, we establish a finite-sample error bound on predicted survival distributions that is, up to a log factor, optimal. Whereas scalability at test time is achieved using the aforementioned kernel netting compression strategy, scalability during training is achieved by a warm-start procedure based on tree ensembles such as XGBoost and a heuristic approach to accelerating neural architecture search. On four standard survival analysis datasets of varying sizes (up to roughly 3 million data points), we show that survival kernets are highly competitive compared to various baselines tested in terms of time-dependent concordance index. Our code is available at: https://github.com/georgehc/survival-kernets  ( 3 min )
    Deep Proxy Causal Learning and its Application to Confounded Bandit Policy Evaluation
    arXiv:2106.03907v4 Announce Type: replace-cross Abstract: Proxy causal learning (PCL) is a method for estimating the causal effect of treatments on outcomes in the presence of unobserved confounding, using proxies (structured side information) for the confounder. This is achieved via two-stage regression: in the first stage, we model relations among the treatment and proxies; in the second stage, we use this model to learn the effect of treatment on the outcome, given the context provided by the proxies. PCL guarantees recovery of the true causal effect, subject to identifiability conditions. We propose a novel method for PCL, the deep feature proxy variable method (DFPV), to address the case where the proxies, treatments, and outcomes are high-dimensional and have nonlinear complex relationships, as represented by deep neural network features. We show that DFPV outperforms recent state-of-the-art PCL methods on challenging synthetic benchmarks, including settings involving high dimensional image data. Furthermore, we show that PCL can be applied to off-policy evaluation for the confounded bandit problem, in which DFPV also exhibits competitive performance.  ( 3 min )
    Randomization Can Reduce Both Bias and Variance: A Case Study in Random Forests
    arXiv:2402.12668v1 Announce Type: new Abstract: We study the often overlooked phenomenon, first noted in \cite{breiman2001random}, that random forests appear to reduce bias compared to bagging. Motivated by an interesting paper by \cite{mentch2020randomization}, where the authors argue that random forests reduce effective degrees of freedom and only outperform bagging ensembles in low signal-to-noise ratio (SNR) settings, we explore how random forests can uncover patterns in the data missed by bagging. We empirically demonstrate that in the presence of such patterns, random forests reduce bias along with variance and increasingly outperform bagging ensembles when SNR is high. Our observations offer insights into the real-world success of random forests across a range of SNRs and enhance our understanding of the difference between random forests and bagging ensembles with respect to the randomization injected into each split. Our investigations also yield practical insights into the importance of tuning $mtry$ in random forests.  ( 2 min )
    FAST: An Optimization Framework for Fast Additive Segmentation in Transparent ML
    arXiv:2402.12630v1 Announce Type: new Abstract: We present FAST, an optimization framework for fast additive segmentation. FAST segments piecewise constant shape functions for each feature in a dataset to produce transparent additive models. The framework leverages a novel optimization procedure to fit these models $\sim$2 orders of magnitude faster than existing state-of-the-art methods, such as explainable boosting machines \citep{nori2019interpretml}. We also develop new feature selection algorithms in the FAST framework to fit parsimonious models that perform well. Through experiments and case studies, we show that FAST improves the computational efficiency and interpretability of additive models.  ( 2 min )
    Multi-class Temporal Logic Neural Networks
    arXiv:2402.12397v1 Announce Type: new Abstract: Time-series data can represent the behaviors of autonomous systems, such as drones and self-driving cars. The problem of binary and multi-class classification has received a lot of attention in this field. Neural networks represent a popular approach to classifying data; However, they lack interpretability, which poses a significant challenge in extracting meaningful information from them. Signal Temporal Logic (STL) is a formalism to describe the properties of timed behaviors. We propose a method that combines all of the above: neural networks that represent STL specifications for multi-class classification of time-series data. We offer two key contributions: 1) We introduce a notion of margin for multi-class classification, and 2) we introduce the use of STL-based attributes for enhancing the interpretability of the results. We evaluate our method on two datasets and compare with state-of-the-art baselines.  ( 2 min )

  • Open

    Beneficial AGI Summit: Mass Movement Toward AGI
    submitted by /u/munwarzielf [link] [comments]
    Why does it seem like AI progress was stagnant until ChatGPT?
    As an outsider, it seems like ChatGPT started the AI craze and now suddenly Google has their own, Meta has their own, Apple is working on their own. What scientific breakthrough or landmark discovery was made that has now allowed this race to start? submitted by /u/50kPplUsed2LivHere [link] [comments]
    Best AI method to analyze and query 1000's of local .pdfs?
    I have a very large collection of pdfs which are impossible to really catalog. I would like to be able to just set an AI loose on the entire collection and then be able to query the AI about specific key words, topics, numbers, and ideas and have it find all the relevant documents for me to read myself. Wouldn't mind training an AI specifically on that narrow dataset too, if that's possible. They generally range from 1 page to 40 pages long. So much has changed recently it's difficult for me to understand what exactly the best tool for this would be. Or if I'm still stuck with uploading this massive collection online somewhere for GPT4 or similar to analyze? *Have a 16gb RTX 3060, so some local capability, I would pay for a better card if there is a good solution that requires it though. Chat With RTX supposedly misses quite a lot of data in the middle of documents so that is a last resort. submitted by /u/agoldprospector [link] [comments]
    Artificial Intelligence in the NEW Animal Crossing of the Switch 2
    The Switch game has already made the graphics and decoration perfect. The only thing missing to make the definitive game is that you can really talk to the characters, and there are already small projects that have made this possible with artificial intelligence. If Nintendo makes an Animal Crossing for the Switch 2, this is the only thing they can improve on, and it would be a revolution, everyone would buy it. submitted by /u/gutierrezz36 [link] [comments]
    I tried the most popular free AI's to summarize YouTube videos
    Here's my rundown of the AI tools I tested out (and think everyone should give a go): https://www.gosummarize.com/ (Top pick for me): This one secured the top position by delivering both precision and practicality. https://getliner.com/en (Secured my second spot): Offers concise summaries that give you the gist without getting too verbose. Accuracy was on point. https://clipnote.ai/: Despite an obvious error in the summary, it still earns third place for its overall comprehensiveness. https://www.summarize.tech/: Summarizes content into a single text, but the presentation isn't particularly attractive or comprehensive. https://www.wordtune.com/: Essentially transcribes subtitles with timestamps, although this wasn't always the case. Most of these tools offer reasonable free plans, with options for paid upgrades. For me, Go Summarize stood out as the most accurate and useful. While it might be too detailed for some, it has timestamped highlights, short summary, and a detailed long summary covering most information in the video. Feel free to share your experiences with other websites, and correct any misconceptions I might have about the ones mentioned above. submitted by /u/KentSjc [link] [comments]
    Daily AI Quick Links (02/21/2024)
    Groq’s $20,000 LPU chip breaks AI performance records to rival GPU-led industry [1] ElevenLabs is launching AI for the creation of unique sound effects [2] ​​Gartner predicts 25% dip in search engine volume by 2026 due to AI [3] UofL researchers develop AI-powered tool to diagnose autism earlier [4] Sources: [1] https://cryptoslate.com/groq-20000-lpu-card-breaks-ai-performance-records-to-rival-gpu-led-industry/ [2] https://readwrite.com/elevenlabs-to-launch-ai-for-the-creation-of-unique-sound-effects/ [3] https://backendnews.net/gartner-predicts-25-dip-in-search-engine-volume-by-2026-due-to-ai/ [4] https://www.uoflnews.com/post/uofltoday/uofl-researchers-develop-ai-powered-tool-to-diagnosis-autism-earlier/ submitted by /u/Used-Bat3441 [link] [comments]
    image to video (workflow in comments): Midjourney + Photoshop + Stable Video Diffusion + MPC + Ultrasharp + Premiere + Flowframes.
    submitted by /u/PeePeePeePooPooPooo [link] [comments]
    AI video generation SOTA?
    Hi, I have a project in mind that harnesses video generation, and I wonder if Sora is the most powerful or if there are equivalents by Google / Anthropic / Mistral / Meta / ... ? I don't know which ecosystem to turn to! Take care! submitted by /u/Lena_Cl [link] [comments]
    How much longer will RAG be useful in light of Gemini's 1M and 10M token context window and 99% accuracy?
    The Gemini release was really interesting in that they sort of buried the lede by not mentioning the 99% accuracy of the context window. The 128k context window of OpenAI will fall down pretty quickly and really is only 32k-64k if you care about your context actually being used. Ideally you would just fit all your data into the 10M token context window but that's going to be about $5 as per my understanding. That's going to get expensive quickly for a lot of applications. The questions is how long will this be the case. If RAG is only about cost savings I can see it starting to fade away in use over the next 1-2 years and most people just wanting to push everything into the context window. submitted by /u/brainhack3r [link] [comments]
    Giant AI generated billboard in Madrid…
    Walked outside this theater and that image seemed off. They did do some Photoshop adjustments but if you zoom in on the woman’s hair, you’ll clearly see it’s AI generated. submitted by /u/Zeta-Splash [link] [comments]
    Google Gemini AI-image generator refuses to generate images of white people and purposefully alters history to fake diversity
    submitted by /u/iced327 [link] [comments]
    AI and Prisons
    This is a subject I haven't seen discussed yet. How do you think AI and AGI will be used and implemented in incarceration facilities? How do you think it will be different in other countries? What possibilities could AI and machine vision offer to improve the outcomes of incarceration? Conversely, what potential abuses do you see AI causing within prisons? Do you foresee a future a future with reduced or no human labor in managing the prison populations? submitted by /u/MethGerbil [link] [comments]
    How are chatbots or other AI agents given scope of authority at companies?
    I’ve been curious given the outcome of the AirCanada ruling. I searched through the topic on HN (https://news.ycombinator.com/item?id=39378235) and some talk about the legal test of apparent authority. That was partly helpful in how that scope is likely to be tested in the court of law. What are the actual mechanics within the company? For example, the companies I’ve worked for typically give delegation of authority to human roles that dictated what that role can or cannot do on behalf of the company. Some authority is even written directly into company bylaws, like the ability to enter into a contract with financial implications, or issue stock, etc. Side note, on the AirCanada ruling. I find it hard to believe that company would go to court over a few hundred dollars. Most customer service departments have the authority to just solve that issue without even tapping the legal team. My best guess is that they chose a case with minimal damages to test how the courts would react when an AI agent conflicts with the official policy of the company. submitted by /u/billylb42 [link] [comments]
    Sora’s growth in the next year
    With Open Ai revealing their new text to video generation tool named Sora it has caused a massive sesmic wave in the tech industry. When you compare what text-to-video generation looked like a year ago from today and you compare it to what we have today, it’s clear the progress has been exponential. What does text-to-video generation look like a year from today? What will be the consequences of such a tool? I can’t help but imagine that the implications will change pretty much every aspect of our life. submitted by /u/ClassroomOk8582 [link] [comments]
    Looking for terrible AI to write a hilariously bad story as a reward to IT support for assisting me
    I'm an IT guy, and I (for now) am using ask.ai to write me stories as a reward to the support vendor when they assist me solving some issue I have in our environment. It works, but even when I tell it to write a story about an IT guy who solves a customer problem who also happens to shoot lasers out of their mouth, it's only kind of funny. I'm looking for something hilariously bad. Anyone know of some AI out there than can do this job for me? Thanks in advance (bored like hell IT support people all over India will appreciate it) submitted by /u/reddit_haven_of_evil [link] [comments]
    Pi- my thoughts
    Hi new on this community- wanted to share my thoughts and experience with Pi. Stumbled upon it, did some research, the company behind it seems to have some pretty heavy hitters and I like the specific part of only using de-identified information (although you never know). I tried using Pi as a touchboard. Something I could use to reflect on things, perhaps give me new perspectives, reframe my thinking. It follows a form of therapy called acceptance and commitment therapy although I have found it to also bring in other approaches. I was pleasantly surprised by it. It was helpful in helping me look at things from a different perspective, reframing things, giving me fresh ideas that I had not considered. I asked it to tone down the saccharine tone and get a little more real and in your face and it did (to a point). I also opened an account with a firefox relay email (didn't want to get my main account banned). I then tried to break Pi- pushing its limits. Trying to get it to go off the rails with me. Every time it either steered the conversation back to sanity or set some firm limits. Suicidal stuff was responded to quite appropriately. I don't think this is a replacement for a therapist. I do think it can be a helpful tool though, as long as one can realize the limitations at the moment. I say at the moment because I think things are going to change faster than you can possibly imagine (for better or worse)- it's an exciting time (in a good and a bad way). Anyway, just my musings. submitted by /u/docben1383 [link] [comments]
    Kits Ai/Voicemy.ai alternative
    I'm looking for a voice cloning site/software that allows me to voice clone easily for free. I tried ElevenLabs, Kits AI and Voicemy.ai, however, they all eventually begin to suck for me as they all locked down voice cloning (especially voice model creation) for free users. There's Revocalize.ai, however, this is much worse than them, with bugs and not allowing me to use it for absolutely no reason. Also, while I can accept straight RVC (if possible), I can only accept stuff that either are free or easy to use. Also, I don't think Google Collab or HuggingFace count, however, if they can fulfill everything what I need just fine (free high quality voice model creation and voice cloning) I will give them a pass. Oh, and must be working with AMD GPU on Windows if it's a software. Meaning no CUDA required because this is only for Nvidia GPUs from what I know of. submitted by /u/MinecrafterPictures [link] [comments]
    Google introduces Gemma, a new family of state-of-the-art open LLMs
    submitted by /u/Civil_Collection7267 [link] [comments]
    Americans increasingly believe Artificial General Intelligence (AGI) is possible to build. They are less likely to agree an AGI should have the same rights as a human being.
    Peer-reviewed, open-access research article: https://doi.org/10.53975/8b8e-9e08 Abstract: A compact, inexpensive repeated survey on American adults’ attitudes toward Artificial General Intelligence (AGI) revealed a stable ordering but changing magnitudes of agreement toward three statements. Contrasting 2023 to 2021 results, American adults increasingly agreed AGI was possible to build. Respondents agreed more weakly that AGI should be built. Finally, American adults mostly disagree that an AGI should have the same rights as a human being; disagreeing more strongly in 2023 than in 2021. submitted by /u/jasonjonesresearch [link] [comments]
    AI enables a machine to work intelligently?
    submitted by /u/malaysianzombie [link] [comments]
    Games in the future will be using AI generated graphics ?
    So now we are seeing AI Generated videos, do you think the graphics engine of games will be using AI to fully generate the games graphics with some sorts of prompts ? Of course it would need a lot of power and calculations but computers would be very powerful compared to nowadays and AI generation could be very precise if prompted accordingly or fed with related content. submitted by /u/KrySoar [link] [comments]
    Is 2024 year of the AI Agents?
    2023 was the Year of ChatGPT and getting us answers that we must be the middle man to translate what it said and go back to our computer and interface with it to get what we need. So the next step is for something to skip the human part and interface with the other machine for us right? That seems to be the point of the rabbit right? I am mostly interested in the coding side of things so that means the agent is trying to take the answers on how to code and directly putting them into the machine and attempting to test and verify. submitted by /u/punkouter23 [link] [comments]
  • Open

    [D] Just wanted some recommendations on codig bootcamps!
    I am an Actuarial Science undergrad at a Canadian University. I am looking to make the shift into AI and ML. I have some amount of exposure to it from trying to learn it by myself but hae decided to try out bootcamps as I have heard that makes it easier. I was wondering which bootcamps that can be done in Canada would be best to go into that space. Any recommendations are welcome and thank you again to everyone who comments!! submitted by /u/Junior_Reporter_9388 [link] [comments]
    [D] How do you keep yourself updated?
    I would like to read newest research in areas that I’m into, but I never know where to look. So, how do you do that? Any advice is welcome, thanks! submitted by /u/zero_redditer [link] [comments]
    [D] Best Practices for a RAG/LLM Monorepo – What to Keep in Mind?
    If you've worked with RAG or LLM models in a monorepo environment, what are the lessons learned for effective design, organization, and workflow? Are there any specialized tools or approaches you've found helpful for these use cases? I am also curious as to how containerization was managed. submitted by /u/AmazingGrace_D [link] [comments]
    [D] Integrating Models into Existing Production Systems
    I'm a software engineer that's working on an MLOPs team for the first time. We have a Data Scientist that builds models and deploys them to Databricks at an http endpoint. All our engineering team does is build services that integrate with his models. The data flow is basically: An async event occurs that triggers one of our lambdas. A lambda ingests the event and grabs some related information from another system (S3). The lambda sends the aggregated features to the Databricks endpoint for inference. The lambda stores the prediction in postgres and sends an event containing the prediction to another team's system to display to the user. I've observed a couple of things while working on this team: Our Data Scientist typically has a ML model ready to deploy 6 months before we are able to use it as an org (engineering in general is time consuming). We are doing the same work over and over again, async processing with a log database of predictions. I'm wondering: How common is the architecture that we're building? What are your experiences with integrating ML models into production systems? What are your specific integration use cases? submitted by /u/Turbulent_Cat_5827 [link] [comments]
    [D] Tabular data and ML
    Hi, I’ve been in the industry for about 10 years (PHD in mathematical modeling- linear algebra area) and past couple of years ended up working only in tabular data and a little heavy on infra side. There’s little scope to use modern (forget SOTA) ML, mainly DL (yes any of). I see this could be a problem as I might not be in demand at the market. I’m an IC and would love to become a manager some day. What steps could one take? My ideal role of interest would be applied scientist with applications not limited to just tabular but multimodal. submitted by /u/Evening-Moment-6589 [link] [comments]
    [P] Question on existing tools to expand my .csv dataset using ML
    I am fine tuning an LLM to react and interact to a specific type of situation (gonna omit), so I am creating a csv file that represents transcripts from those situations including annotations, reactions, etc... I have only been able to generate ~30 annotated situations as they are quite time consuming and there are NO EXISTING DATASETS and no repositories online from which to pull from (Ive checked) Im looking for a tool (open-source or not, free or paid) where I can communicate/use ai to build out that dataset from my 30 transcripts (im going to want 500-1000), could be some kind of open source or wrapper where it can work well with tables and expanding datasets using ai getting them generated in chatGPT and gemini are proving useless, especially at scale Thank you and any info would be appreciated submitted by /u/TheMeals [link] [comments]
    [D] Any suggestions for models to train for the purpose of image animation?
    I want to train a model to be able to animate an image using keyframe interpolation using two keyframes, the input would be text + image (the two keyframes); I just want to build a prototype, it does not have to be perfect. submitted by /u/Meba_ [link] [comments]
    [D] Anyone else notice a surprisingly big difference between FP32 and FP16 models?
    At the company I work at we run an image classification service. The model's fairly simple: we just have a pre-trained image encoder backbone model that we further pre-trained with our own data, and we have a fully-connected layer for each "task." A "task" could be something like classifying images of dogs or classifying the color that an image has. I recently changed my code to use HuggingFace's Accelerate module rather than PyTorch's native DDP and also was training my model with mixed precision training which stores the weights in FP16 rather than FP32. I kept my backbone and fully-connected layers frozen for the original tasks and only wanted to train new fully-connected layers. However, the problem is that I'm noticing a significant difference in the predictions made by frozen branches. After investigating everything that could have possibly gone wrong and even manually checking the weight values in an IPython terminal, I noticed that the main difference is that the previous model's weights were stored in FP32 whereas my newly trained ones were FP16. I have heard of stories where FP16 is essentially a tradeoff for speed and performance but I never knew that it would be that different. Curious if anyone else has experienced similar. submitted by /u/Seankala [link] [comments]
    [D][R] What does your ML tech stack look like?
    There are many libraries out there for training and inference of DL models. What does your training tech-stack look like? For example I make heavy use of huggingface ecosystem libraries and rarely have to import something outside of those or plain old torch. submitted by /u/iordanis_ [link] [comments]
    [D] Gemma vs Mistral-7B-v0.1 evaluation: Gemma really Struggles to Reach Mistral's Accuracy
    The new Model Gemma from Google DeepMind does not demonstrate strong performance on medical/healthcare domain benchmarks. A side-by-side comparison of Gemma by Gemma vs Mistral by Mistral AI without fine-tuning. Mistral clearly wins : https://preview.redd.it/3h883kbi00kc1.png?width=721&format=png&auto=webp&s=7ca853f9cd9b7ee0d1c230541e759867c4c6d67d I'll be fine-tuning and evaluating Gemma & different LLMs over the next few days on different Medical and Legal benchmarks. Follow the updates here: https://twitter.com/aadityaura submitted by /u/aadityaura [link] [comments]
    [D] Twitter/X thread about OpenAI's Sora from one of the 2 authors of work "Scalable Diffusion Models with Transformers": "Here's my take on the Sora technical report, with a good dose of speculation that could be totally off. [...]." The other author of that work is involved with Sora at OpenAI.
    Unrolled Twitter/X thread. First tweet in thread, which I found via this tweet by Yann LeCun. Here's my take on the Sora technical report, with a good dose of speculation that could be totally off. First of all, really appreciate the team for sharing helpful insights and design decisions – Sora is incredible and is set to transform the video generation community. What we have learned so far: - Architecture: Sora is built on our diffusion transformer (DiT) model (published in ICCV 2023) — it's a diffusion model with a transformer backbone, in short: DiT = [VAE encoder + ViT + DDPM + VAE decoder]. According to the report, it seems there are not much additional bells and whistles. [...] Scalable Diffusion Models with Transformers. Sora technical report. A tweet from the other author of the work: Sora is here! It's a diffusion transformer that can generate up to a minute of 1080p video with great coherence and quality. @ /_tim_brooks and I have been working on this at @ /openai for a year, and we're pumped about pursuing AGI by simulating everything! http://openai.com/sora Related post: [D] OpenAI Sora Video Gen -- How?? submitted by /u/Wiskkey [link] [comments]
    [D] Gemma vs Mistral (and other open models)
    A lot of comparisons out there in terms of model size and quantitative performance against benchmarks. Wanted to get people's thoughts on qualitative, ad-hoc performance. > Playground to Compare: https://huggingface.co/spaces/lastmileai/gemma-playground In this example, both Gemma 2B and 7B doesn't seem to do well against Mistral for CoT tasks. Curious how it's doing for instruct, question-answering, and creativity tasks. submitted by /u/InevitableSky2801 [link] [comments]
    [N] Language Processing Unit (LPU) makes inference of LLMs 10x faster
    This week, a largely unknown company, Groq, demonstrated unprecedented speed running open-source LLMs such as Llama-2 (70 billion parameters) at more than 100 tokens per second, and Mixtral at nearly 500 tokens per second per user on a Groq’s Language Processing Unit (LPU). For the comparison: “According to Groq, in similar tests, ChatGPT loads at 40-50 tokens per second, and Bard at 70 tokens per second on typical GPU-based computing systems. Context for 100 tokens per second per user, demonstrated unprecedented speed running open-source LLMs such as Llama-2 (70 billion parameters) at more than 100 tokens per second, and Mixtral at nearly 500 tokens per second per user on Groq’s Language Processing Unit (LPU). So: What is LPU, how does it work, and where is Groq (such an unfortunate name, given Musk's Grok is all over the media) coming from? https://www.turingpost.com/p/fod41 submitted by /u/vvkuka [link] [comments]
    [D] Modern Samplers for Diffusions.
    What are the default samplers used for diffusion models nowadays? The paper by song: https://arxiv.org/pdf/2011.13456.pdf proposes a couple. Are PC samplers the default nowadays, or do people do something else? ​ Thank you! submitted by /u/randomkolmogorov [link] [comments]
    [P] marimo-wasm: a reactive Python notebook in the browser
    We (2 developers) made marimo[1] compatible with WebAssembly (WASM), so you can run it entirely in the browser thanks to Pyodide. You can try out the playground: https://marimo.app/. This has been a great tool for learning Python or educating others, as you can share snippets of code via the URL. For example, here is a notebook on Bayes' Theorem. [1] marimo is open-source on GitHub: https://github.com/marimo-team/marimo submitted by /u/mmmmmmyles [link] [comments]
    What is the meaning of cross-validation in a small dataset [D]
    To the best of my understanding, in the context of deep learning, k-fold cross validation aims to allocate multiple subsets of training/validation pairs for hyperparameter tuning. Let alone the independent test dataset, each training/validation fold should correspond to one 'final' model trained with different hyperparameters so that we can compare the effect of the hyperparameters. However in a small dataset where the data in one subset may be not generalized, how do I know whether the model performance difference in different subsets come from the bias of the data or the difference in the hyperparameters used? (e.g. in CNN, often we have limited data and the images usually contain dramatically distinct features) submitted by /u/alan6690 [link] [comments]
    [D] Graphs + vectordbs? Need your input: Cognee.ai . AI Data Pipelines for Real-World Production (Part 4)
    Hey there, Redditors! I'm back with the latest installment on creating dependable AI data pipelines for real-world production. If you've been following along, you know I'm on a mission to move beyond the "thin OpenAI wrapper" trend and tackle the challenges of building robust data pipelines. After a few months of work, we integrated cognitive architecture with keepi.ai We aim to explore with our demo: 1. Context sanitization The world of AI is fast-moving, and we've realized that the context is becoming a building block we refer to as a crucial part of future cognitive architecture. 2. Best Practices for AI Memory In this rapidly evolving landscape, there are no established best practices. You'll need to make educated bets on tools and processes, knowing that things will change. We assume that having traditional data engineering practices + frameworks + classifiers and other AI solutions can solve a lot of standard hurdles 3. AI Frameworks They are trying to do too much, too fast, too broad. We want to find a pattern and a correct layer of abstraction for the AI memory to fit new industry. ​ How does it work? The Github repo is l: How cognee works Github repo is here Next steps: I have questions for you: Is context sanitization relevant for you? How do you manage metadata? How do you prepare data for LLMs? Are there any data enrichment steps you perform? Check out the blog post: Link to part 4 Remember to give this post an upvote if you found it insightful! And also star our Github repo submitted by /u/Snoo-bedooo [link] [comments]
    [D][R] How do researchers (Masters, PhD) implement complex models? Are they gods?
    I'm doing my theisis right now. I have good grasp of the high-level details on most ML models (RNN, CNN, LSTM, Transformers, GPT, CNN, GANs, LDMs, VAEs, Autoencoder and much more). Of course by no means i'm an expert, but I'm able to learn what I need. But when it comes to actually use them, and implement them in code, and train them, this becomes hell. For the simpler models, its fine, but for the more complex once, there are no tutorials online, they just say 'to use existing model'. How do researchers across the world implement complex models? For instance, diffusion models, LDMs, or modified LLMs, like transformer, or GPT? Or how do they change existing model, and use different techniques, like adding encoder for conditioning? Like, researching and understanding the basics is fine, but actually implementing it is extremly hard. How do they do it with such elegance? Some survey research papers include the usage of multiple models and comparing them. How do they do it? submitted by /u/ShlomiRex [link] [comments]
    [R] Simple Javascript code that could help soldiers and civilians evade drone strikes (updated for immediate combat deployment)
    https://www.academia.edu/115181929/Aerial_Object_Detection_updated_for_combat_deployment_ The app will beep when an aerial object is located. The longer an aerial object is hovering near you, the longer the beeping noise. For soldiers, this could mean that a drone is targeting them. Ideally, soldiers would use the app on their cell phones and attach the device to the top area of their vehicles or to their body while sleeping in the trenches. Keep mind that sim cards must be removed and cell phone wireless connectivity must remain "off" in combat environments. Before deployment, soldiers should connect to wifi and start the app. Once the app is started, a soldier can then disable wifi and leave the app running as he/she is deployed into a combat zone. For detecting aerial objects in combat, android phone should be mounted to the top of the backpack or top of the helmet. In civilian environments, the cell phone, with wireless turned on, could be placed on rooftops. With internet access, a user could view the aerial scene remotely with facebook live submitted by /u/AnthonyofBoston [link] [comments]
    [News] Google release new and open llm model: gemma model
    apparently better than llama7 and 13 (but does not benchmark against mistral7b): https://blog.google/technology/developers/gemma-open-models/ submitted by /u/edienemis [link] [comments]
    [P] OpenVoice (Text-to-Speech) GPU Benchmark: 6.6 Million words per $ on RTX2070 - 230 words per second on RTX 3080 Ti
    In this project, we benchmarked the Text-to-Speech component of OpenVoice across 23 consumer GPUs on SaladCloud. Utilizing the default voice parameters with a speed setting of 1, our base text was a book “Robots and Empire” by Isaac Asimov, available via Archive.org totaling approximately 150,000 words. Methodology To manage memory efficiency and ensure seamless processing, we broke down and processed the text into chunks of roughly 30 sentences, or 200-300 words. Our evaluation spanned all consumer GPU classes available on SaladCloud, with each node provisioned with 1 vCPU and 8GB of RAM. To simulate a single-task environment typical of many production systems, we did not employ threading, thus each GPU was tasked with processing one chunk of text at a time. This setup provides insig…
    [D] Data Engineering for Scaling Language Models to 128K Context - MIT 2024 - New open LLaMA-2 7B and 13B with 128k context!
    Paper: https://arxiv.org/abs/2402.10171 Github: https://github.com/FranxYao/Long-Context-Data-Engineering New models with 128k context inside! Abstract: We study the continual pretraining recipe for scaling language models’ context lengths to 128K, with a focus on data engineering. We hypothesize that long context modeling, in particular the ability to utilize information at arbitrary input locations, is a capability that is mostly already acquired through large-scale pretraining, and that this capability can be readily extended to contexts substantially longer than seen during training (e.g., 4K to 128K) through lightweight continual pretraining on appropriate data mixture. We investigate the quantity and quality of the data for continual pretraining: (1) for quantity, we show that 500 million to 5 billion tokens are enough to enable the model to retrieve information anywhere within the 128K context; (2) for quality, our results equally emphasize domain balance and length upsampling. Concretely, we find that nively upsampling longer data on certain domains like books, a common practice of existing work, gives suboptimal performance, and that a balanced domain mixture is important. We demonstrate that continual pretraining of the full model on 1B-5B tokens of such data is an effective and affordable strategy for scaling the context length of language models to 128K. Our recipe outperforms strong open-source longcontext models and closes the gap to frontier models like GPT-4 128K. https://preview.redd.it/bedg1gsgixjc1.jpg?width=1447&format=pjpg&auto=webp&s=cdf15e90c375988b169fd24ffd5d4505da002593 https://preview.redd.it/2qy3dhsgixjc1.jpg?width=1837&format=pjpg&auto=webp&s=2ced604b9e1360ee8d170773a1a0600523288516 https://preview.redd.it/pebawhsgixjc1.jpg?width=1446&format=pjpg&auto=webp&s=4a57b8bb6685d6122d51a67e4fa9645555c51d5a https://preview.redd.it/o8v3kisgixjc1.jpg?width=577&format=pjpg&auto=webp&s=6d39b7736dc9221ed69e1c61ca36f303e8ef131e submitted by /u/Singularian2501 [link] [comments]
    [D] If a paper has no open source code available, are you allowed to implement the code for fun/practice and publish it in your own Github with appropriate citation and mention that all credit goes to the authors?
    Hi all, I found a description of an ML implementation of a numerical calculation in physics in a paper that is published on arXiv with the CC4 license (besides arXiv it was also published in a journal but I'm only looking at the arXiv version): You are free to: Share — copy and redistribute the material in any medium or format for any purpose, even commercially. Adapt — remix, transform, and build upon the material for any purpose, even commercially. The licensor cannot revoke these freedoms as long as you follow the license terms. Under the following terms: Attribution — You must give appropriate credit , provide a link to the license, and indicate if changes were made . You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original. No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits. The paper has no open source code as far as I can find. Is there anything wrong from an academic conduct pov if I try to write my own implementation of the paper's ML stuff and publish it on my own Github? Of course I will reference the paper and say that everything in this project is based on that. I don't really want to contact the authors to ask to be honest. I want to know if it's fine to do this regardless of contacting the authors. Imagine you had to do that every time you want to do something like that (practice implementations of Attention is All You Need for example). Thanks a lot! submitted by /u/Invariant_apple [link] [comments]
    [D] When doing regression training with large values, better to train the model with the log of the values or the values themselves?
    I'm training a model to predict viewcount from thumbnails, and since the viewcounts are fairly large(in the millions), I was wondering if I should use the log of the viewcount instead of the viewcount directly to prevent blowing up of the loss function. Anyone have any insight into this? lr = 1e-5 batch_size=8 optimizer = torch.optim.AdamW(model.parameters(),lr=lr) train_dataloader = torch.utils.data.DataLoader(mapped_dataset.with_format('torch'), batch_size=batch_size) loss_func = nn.MSELoss() Log based code: for batch in tqdm(train_dataloader, total=1296//batch_size): img_resized = batch['normalized_image'].reshape(-1,3,720, 1296 ) views = torch.log(torch.tensor(batch['views']).to(device).to(torch.float32)).reshape(-1,1) output_view = model(img_resized.cuda()) loss = loss_func(output_view, views) optimizer.zero_grad() loss.backward() # torch.nn.utils.clip_grad.clip_grad_norm_(model.parameters(), 1.0) optimizer.step() if ind % (100//batch_size) == 0: print(f'lastloss', loss.item(),math.log(loss.item())) print(output_view, views) losses.append(loss.item()) ind = ind + 1 Non log: for batch in tqdm(train_dataloader, total=1296//batch_size): img_resized = batch['normalized_image'].reshape(-1,3,720, 1296 ) views = (torch.tensor(batch['views']).to(device).to(torch.float32)).reshape(-1,1) output_view = model(img_resized.cuda()) loss = loss_func(output_view, views) optimizer.zero_grad() loss.backward() # torch.nn.utils.clip_grad.clip_grad_norm_(model.parameters(), 1.0) optimizer.step() if ind % (100//batch_size) == 0: print(f'lastloss', loss.item(),math.log(loss.item())) print(output_view, views) losses.append(loss.item()) ind = ind + 1 ​ submitted by /u/ExaminationNo8522 [link] [comments]
  • Open

    Advances in private training for production on-device language models
    Posted by Zheng Xu, Research Scientist, and Yanxiang Zhang, Software Engineer, Google Language models (LMs) trained to predict the next word given input text are the key technology for many applications [1, 2]. In Gboard, LMs are used to improve users’ typing experience by supporting features like next word prediction (NWP), Smart Compose, smart completion and suggestion, slide to type, and proofread. Deploying models on users’ devices rather than enterprise servers has advantages like lower latency and better privacy for model usage. While training on-device models directly from user data effectively improves the utility performance for applications such as NWP and smart text selection, protecting the privacy of user data for model training is important. Gboard features powe…  ( 93 min )
  • Open

    Simple javascript code that could help soldiers and civilians detect and evade both drone strikes and enemy invaders (prepared for immediate deployment)
    ​ Two javascript projects that make use of object detection. One uses a simple code that could help detect enemy drones. The other uses a simple code that could help detect enemy soldiers. Both apps could be deployed immediately https://www.academia.edu/115181929/Aerial_Object_Detection_updated_for_combat_deployment_ Wait for the model to load before clicking the button to enable the webcam - at which point it will become visible to use. The app will beep when an aerial object is located. The longer an aerial object is hovering near you, the longer the beeping noise. For soldiers, this could mean that a drone is targeting them. Ideally, soldiers would use the app on their cell phones and attach the device to the top area of their vehicles or to their body while sleeping in the trenches.…
    Backpropagation for activation layer in CNN
    Hello I’m trying to learn and implement Backpropagation on a CNN. I understand most of the equations, but I haven’t found any for the activation layers. If I understand correctly it goes Conv layer - activation - pooling. I get the equations for pooling and the conv layer, but I’m not sure about the activation. If you could show me any resources or the actual equations for it, it would help a lot, thanks. submitted by /u/MobileOk8440 [link] [comments]
  • Open

    Next research topic for robot learning?
    After a long period study for RL I amconfused about what to do next in my robot learning resarch. I've been conducted a serie research in RL and then MARL. Now i think its wise for me to transfer RL research to imitation learning cause which is more reliable in realistic. I couldn't find the RL for robots in reality without a world model built. Or Maybe imitation learning with llm could be the next direction for robot learning? submitted by /u/Tight-Ad789 [link] [comments]
    Reinforcement Learning Playlist
    Hey everyone, Check out my reinforcement learning playlist comprising of 28 tutorials covering basics and advance topics with example including MAB, Contextual MAB, Monte Carlo, SARSA, Q Learning, A2C, DDPG, REINFORCE, PPO, RLHF , Multi-Agent algorithms & few OpenAI gym examples https://youtube.com/playlist?list=PLnH2pfPCPZsIpNtTIwKQZeDerJEz2ccEV&si=KLZ3FYlBWkAHIkxx submitted by /u/mehul_gupta1997 [link] [comments]
    Library choice for practical learning
    Hi, I intend on getting started with deep reinforcement learning and I would like to know which libraries combination as of today. I found some tutorials but they seem quite outdated, using keras 2 as V3 is already available. As I am getting started for professional reasons (and I'm already experienced with programming), I would like to build a string fondation knowledge. Hence I'm not afraid of using less "user friendly" APIs if they tend to be better maintained in the long run. Are kerasV3 + tf and gym and /or gymnasium a good starting choice? For instance I found plenty of tutorials using keras-rl2 which is if I understand correctly maintained independently from keras what is exactly the point of using this library instead of the built in keras's reinforcement learning algorithms? Also, i did not find an equivalent for keras V3. Is it because keras V3 made the keras-rl libraries obsolete or is there another reason? submitted by /u/dougdoug110 [link] [comments]
    Changing the Bipedal Walker Terrain in Gymnasium
    Hello everyone, I am experimenting with curriculum learning within the Bipedal walker gymnasium environment in python. I am struggling to edit the terrain that the walker walks on i.e the pits, stumps, stairs etc. For example, I want to train the walker on just the pits etc Does anyone know any good resources for this? Thanks in advance. submitted by /u/Hunter0708 [link] [comments]
    Research Topics or Papers related to RL ?
    Hey, I'm about to finish my master's in mathematics and need to pick a topic for my thesis. I focused my studies on applied statistics and machine learning. Lately, I've been starting to engage with RL and and I've been considering writing my master's thesis on it. I am not too afraid of coding but I definitely would like to do theoretical research. As i am not too deeply engaged with RL I could really use some ideas or paper suggestions for my thesis. I would be really grateful about any suggestions! (preferably RL) Thanks! submitted by /u/d-eighties [link] [comments]
    Is there any trick to help peg-in-hole tasks converge?
    Hi! I'm starting with a simple peg-in-hole task but it's hard to converge whether using dense or sparse reward. For the sparse reward, the trick of random goal position is used in this paper to help converge. Is there any smart trick that has been used to help converge for peg-in-hole tasks? BTW, are there any recommended open-source repos regarding peg-in-hole tasks? Any help would be appreciated! Thanks! submitted by /u/UpperSearch4172 [link] [comments]
    Losing Motivation
    Maybe I am just overreacting, or I am too weak. But please hear me out. For the past few months I have been trying to self study RL. I am on second course in coursera RL specialization, 8th chapter in sutton & barto. Initially some of my friends also wanted to study with me but none of them is doing it anymore. I am also doing a full time swe job which takes out almost all of my energy. I looked for mentors, but couldn't find any. Everyone is into computer vision or NLP these days. Also, lots of people are saying that RL has no future and all. All of these together is just so tiring. I don't really know what I am looking for here. If you can share your journey, that will be a help. Also, if you can mentor me (even if a little bit of your time), I will be forever grateful. submitted by /u/Casio991es [link] [comments]
    Need help with understanding a research paper
    I need help with understanding a paper that was published by deepmind in 2020, the name of the paper is, " Avoiding Side Effects By Considering Future Tasks ". I'm struggling to understand few concepts that were discussed in this paper. Please reach out if any of you are willing to help me with this, Thanks in advance. submitted by /u/Donald-the-dramaduck [link] [comments]
    Training MuZero
    Did someone try to use this code - https://pypi.org/project/muzero-baseline/ ? I installed it on my desktop and tried to train it to play the tic tac toe muzero = MuZero(tictactoe.Game, tictactoe.MuZeroConfig()) muzero.train() muzero.test(render=True) After 2 days of running it did not produce any results. I noticed that the use of CPU was low and the use of GPU was low. ​ ​ submitted by /u/Competitive-Elk4319 [link] [comments]
  • Open

    Research Focus: Week of February 19, 2024
    In this issue: CaaSPER: vertical autoscaling algorithm dynamically maintains optimal CPU utilization; Improved scene landmark detection for camera localization runs faster, uses less storage; ESUS simplifies usability questionnaires for technical products and services. The post Research Focus: Week of February 19, 2024 appeared first on Microsoft Research.  ( 9 min )
  • Open

    Shining Brighter Together: Google’s Gemma Optimized to Run on NVIDIA GPUs
    NVIDIA, in collaboration with Google, today launched optimizations across all NVIDIA AI platforms for Gemma — Google’s state-of-the-art new lightweight 2 billion– and 7 billion-parameter open language models that can be run anywhere, reducing costs and speeding innovative work for domain-specific use cases. Teams from the companies worked closely together to accelerate the performance of Read Article  ( 5 min )
  • Open

    Forward-Backward Reasoning in Large Language Models for Mathematical Verification
    arXiv:2308.07758v5 Announce Type: replace-cross Abstract: Self-Consistency samples diverse reasoning chains with answers and chooses the final answer by majority voting. It is based on forward reasoning and cannot further improve performance by sampling more reasoning chains when saturated. To further boost performance, we introduce backward reasoning to verify candidate answers. Specifically, for mathematical tasks, we mask a number in the question and ask the LLM to answer a backward question created by a simple template, i.e., to predict the masked number when a candidate answer is provided. Instead of using forward or backward reasoning alone, we propose FOBAR to combine FOrward and BAckward Reasoning for verification. Extensive experiments on six standard mathematical data sets and three LLMs show that FOBAR achieves state-of-the-art performance. In particular, FOBAR outperforms Self-Consistency, which uses forward reasoning alone, demonstrating that combining forward and forward reasoning is better. In addition, FOBAR performs better than existing verification methods, showing the effectiveness of the simple template used in backward reasoning and the proposed combination. Extensions to non-mathematical problems are also discussed and validated empirically.  ( 3 min )
    Exploring ChatGPT for Next-generation Information Retrieval: Opportunities and Challenges
    arXiv:2402.11203v1 Announce Type: cross Abstract: The rapid advancement of artificial intelligence (AI) has highlighted ChatGPT as a pivotal technology in the field of information retrieval (IR). Distinguished from its predecessors, ChatGPT offers significant benefits that have attracted the attention of both the industry and academic communities. While some view ChatGPT as a groundbreaking innovation, others attribute its success to the effective integration of product development and market strategies. The emergence of ChatGPT, alongside GPT-4, marks a new phase in Generative AI, generating content that is distinct from training examples and exceeding the capabilities of the prior GPT-3 model by OpenAI. Unlike the traditional supervised learning approach in IR tasks, ChatGPT challenges existing paradigms, bringing forth new challenges and opportunities regarding text quality assurance, model bias, and efficiency. This paper seeks to examine the impact of ChatGPT on IR tasks and offer insights into its potential future developments.  ( 2 min )
    SIG: Speaker Identification in Literature via Prompt-Based Generation
    arXiv:2312.14590v2 Announce Type: replace-cross Abstract: Identifying speakers of quotations in narratives is an important task in literary analysis, with challenging scenarios including the out-of-domain inference for unseen speakers, and non-explicit cases where there are no speaker mentions in surrounding context. In this work, we propose a simple and effective approach SIG, a generation-based method that verbalizes the task and quotation input based on designed prompt templates, which also enables easy integration of other auxiliary tasks that further bolster the speaker identification performance. The prediction can either come from direct generation by the model, or be determined by the highest generation probability of each speaker candidate. Based on our approach design, SIG supports out-of-domain evaluation, and achieves open-world classification paradigm that is able to accept any forms of candidate input. We perform both cross-domain evaluation and in-domain evaluation on PDNC, the largest dataset of this task, where empirical results suggest that SIG outperforms previous baselines of complicated designs, as well as the zero-shot ChatGPT, especially excelling at those hard non-explicit scenarios by up to 17% improvement. Additional experiments on another dataset WP further corroborate the efficacy of SIG.  ( 2 min )
    A General Framework for User-Guided Bayesian Optimization
    arXiv:2311.14645v2 Announce Type: replace Abstract: The optimization of expensive-to-evaluate black-box functions is prevalent in various scientific disciplines. Bayesian optimization is an automatic, general and sample-efficient method to solve these problems with minimal knowledge of the underlying function dynamics. However, the ability of Bayesian optimization to incorporate prior knowledge or beliefs about the function at hand in order to accelerate the optimization is limited, which reduces its appeal for knowledgeable practitioners with tight budgets. To allow domain experts to customize the optimization routine, we propose ColaBO, the first Bayesian-principled framework for incorporating prior beliefs beyond the typical kernel structure, such as the likely location of the optimizer or the optimal value. The generality of ColaBO makes it applicable across different Monte Carlo acquisition functions and types of user beliefs. We empirically demonstrate ColaBO's ability to substantially accelerate optimization when the prior information is accurate, and to retain approximately default performance when it is misleading.  ( 2 min )
    On Double Descent in Reinforcement Learning with LSTD and Random Features
    arXiv:2310.05518v4 Announce Type: replace Abstract: Temporal Difference (TD) algorithms are widely used in Deep Reinforcement Learning (RL). Their performance is heavily influenced by the size of the neural network. While in supervised learning, the regime of over-parameterization and its benefits are well understood, the situation in RL is much less clear. In this paper, we present a theoretical analysis of the influence of network size and $l_2$-regularization on performance. We identify the ratio between the number of parameters and the number of visited states as a crucial factor and define over-parameterization as the regime when it is larger than one. Furthermore, we observe a double descent phenomenon, i.e., a sudden drop in performance around the parameter/state ratio of one. Leveraging random features and the lazy training regime, we study the regularized Least-Square Temporal Difference (LSTD) algorithm in an asymptotic regime, as both the number of parameters and states go to infinity, maintaining a constant ratio. We derive deterministic limits of both the empirical and the true Mean-Squared Bellman Error (MSBE) that feature correction terms responsible for the double descent. Correction terms vanish when the $l_2$-regularization is increased or the number of unvisited states goes to zero. Numerical experiments with synthetic and small real-world environments closely match the theoretical predictions.  ( 3 min )
    CoDi: Conditional Diffusion Distillation for Higher-Fidelity and Faster Image Generation
    arXiv:2310.01407v2 Announce Type: replace-cross Abstract: Large generative diffusion models have revolutionized text-to-image generation and offer immense potential for conditional generation tasks such as image enhancement, restoration, editing, and compositing. However, their widespread adoption is hindered by the high computational cost, which limits their real-time application. To address this challenge, we introduce a novel method dubbed CoDi, that adapts a pre-trained latent diffusion model to accept additional image conditioning inputs while significantly reducing the sampling steps required to achieve high-quality results. Our method can leverage architectures such as ControlNet to incorporate conditioning inputs without compromising the model's prior knowledge gained during large scale pre-training. Additionally, a conditional consistency loss enforces consistent predictions across diffusion steps, effectively compelling the model to generate high-quality images with conditions in a few steps. Our conditional-task learning and distillation approach outperforms previous distillation methods, achieving a new state-of-the-art in producing high-quality images with very few steps (e.g., 1-4) across multiple tasks, including super-resolution, text-guided image editing, and depth-to-image generation.  ( 2 min )
    Foundational theories of hesitant fuzzy sets and hesitant fuzzy information systems and their applications for multi-strength intelligent classifiers
    arXiv:2311.04256v3 Announce Type: replace-cross Abstract: Hesitant fuzzy sets are widely used in certain instances of uncertainty and hesitation. In sets, the inclusion relationship is an important and foundational definition. Thus, as a kind of set, hesitant fuzzy sets require an explicit definition of inclusion relationship. Based on the hesitant fuzzy membership degree of discrete form, several kinds of inclusion relationships for hesitant fuzzy sets are proposed in this work. Then, some foundational propositions of hesitant fuzzy sets are presented, along with propositions of families of hesitant fuzzy sets. Some foundational propositions of hesitant fuzzy information systems are proposed with respect to parameter reductions and an example and an algorithm are given to illustrate the processes of parameter reduction. Finally, a multi-strength intelligent classifier is proposed to make health state diagnoses for complex systems.  ( 2 min )
    Discrete Neural Algorithmic Reasoning
    arXiv:2402.11628v1 Announce Type: new Abstract: Neural algorithmic reasoning aims to capture computations with neural networks via learning the models to imitate the execution of classical algorithms. While common architectures are expressive enough to contain the correct model in the weights space, current neural reasoners are struggling to generalize well on out-of-distribution data. On the other hand, classical computations are not affected by distribution shifts as they can be described as transitions between discrete computational states. In this work, we propose to force neural reasoners to maintain the execution trajectory as a combination of finite predefined states. Trained with supervision on the algorithm's state transitions, such models are able to perfectly align with the original algorithm. To show this, we evaluate our approach on the SALSA-CLRS benchmark, where we get perfect test scores for all tasks. Moreover, the proposed architectural choice allows us to prove the correctness of the learned algorithms for any test data.  ( 2 min )
    Bridging Evolutionary Algorithms and Reinforcement Learning: A Comprehensive Survey
    arXiv:2401.11963v2 Announce Type: replace-cross Abstract: Evolutionary Reinforcement Learning (ERL), which integrates Evolutionary Algorithms (EAs) and Reinforcement Learning (RL) for optimization, has demonstrated remarkable performance advancements. By fusing the strengths of both approaches, ERL has emerged as a promising research direction. This survey offers a comprehensive overview of the diverse research branches in ERL. Specifically, we systematically summarize recent advancements in relevant algorithms and identify three primary research directions: EA-assisted optimization of RL, RL-assisted optimization of EA, and synergistic optimization of EA and RL. Following that, we conduct an in-depth analysis of each research direction, organizing multiple research branches. We elucidate the problems that each branch aims to tackle and how the integration of EA and RL addresses these challenges. In conclusion, we discuss potential challenges and prospective future research directions across various research directions. To facilitate researchers in delving into ERL, we organize the algorithms and codes involved on https://github.com/yeshenpy/Awesome-Evolutionary-Reinforcement-Learning.  ( 2 min )
    Graph Language Models
    arXiv:2401.07105v2 Announce Type: replace-cross Abstract: While Language Models (LMs) are the workhorses of NLP, their interplay with structured knowledge graphs (KGs) is still actively researched. Current methods for encoding such graphs typically either (i) linearize them for embedding with LMs -- which underutilize structural information, or (ii) use Graph Neural Networks (GNNs) to preserve the graph structure -- but GNNs cannot represent text features as well as pretrained LMs. In our work we introduce a novel LM type, the Graph Language Model (GLM), that integrates the strengths of both approaches and mitigates their weaknesses. The GLM parameters are initialized from a pretrained LM to enhance understanding of individual graph concepts and triplets. Simultaneously, we design the GLM's architecture to incorporate graph biases, thereby promoting effective knowledge distribution within the graph. This enables GLMs to process graphs, texts, and interleaved inputs of both. Empirical evaluations on relation classification tasks show that GLM embeddings surpass both LM- and GNN-based baselines in supervised and zero-shot setting, demonstrating their versatility.  ( 2 min )
    PPO-Clip Attains Global Optimality: Towards Deeper Understandings of Clipping
    arXiv:2312.12065v2 Announce Type: replace Abstract: Proximal Policy Optimization algorithm employing a clipped surrogate objective (PPO-Clip) is a prominent exemplar of the policy optimization methods. However, despite its remarkable empirical success, PPO-Clip lacks theoretical substantiation to date. In this paper, we contribute to the field by establishing the first global convergence results of a PPO-Clip variant in both tabular and neural function approximation settings. Our findings highlight the $O(1/\sqrt{T})$ min-iterate convergence rate specifically in the context of neural function approximation. We tackle the inherent challenges in analyzing PPO-Clip through three central concepts: (i) We introduce a generalized version of the PPO-Clip objective, illuminated by its connection with the hinge loss. (ii) Employing entropic mirror descent, we establish asymptotic convergence for tabular PPO-Clip with direct policy parameterization. (iii) Inspired by the tabular analysis, we streamline convergence analysis by introducing a two-step policy improvement approach. This decouples policy search from complex neural policy parameterization using a regression-based update scheme. Furthermore, we gain deeper insights into the efficacy of PPO-Clip by interpreting these generalized objectives. Our theoretical findings also mark the first characterization of the influence of the clipping mechanism on PPO-Clip convergence. Importantly, the clipping range affects only the pre-constant of the convergence rate.  ( 2 min )
    ResMGCN: Residual Message Graph Convolution Network for Fast Biomedical Interactions Discovering
    arXiv:2311.07632v2 Announce Type: replace Abstract: Biomedical information graphs are crucial for interaction discovering of biomedical information in modern age, such as identification of multifarious molecular interactions and drug discovery, which attracts increasing interests in biomedicine, bioinformatics, and human healthcare communities. Nowadays, more and more graph neural networks have been proposed to learn the entities of biomedical information and precisely reveal biomedical molecule interactions with state-of-the-art results. These methods remedy the fading of features from a far distance but suffer from remedying such problem at the expensive cost of redundant memory and time. In our paper, we propose a novel Residual Message Graph Convolution Network (ResMGCN) for fast and precise biomedical interaction prediction in a different idea. Specifically, instead of enhancing the message from far nodes, ResMGCN aggregates lower-order information with the next round higher information to guide the node update to obtain a more meaningful node representation. ResMGCN is able to perceive and preserve various messages from the previous layer and high-order information in the current layer with least memory and time cost to obtain informative representations of biomedical entities. We conduct experiments on four biomedical interaction network datasets, including protein-protein, drug-drug, drug-target, and gene-disease interactions, which demonstrates that ResMGCN outperforms previous state-of-the-art models while achieving superb effectiveness on both storage and time.  ( 2 min )
    Adaptive ship-radiated noise recognition with learnable fine-grained wavelet transform
    arXiv:2306.01002v2 Announce Type: replace-cross Abstract: Analyzing the ocean acoustic environment is a tricky task. Background noise and variable channel transmission environment make it complicated to implement accurate ship-radiated noise recognition. Existing recognition systems are weak in addressing the variable underwater environment, thus leading to disappointing performance in practical application. In order to keep the recognition system robust in various underwater environments, this work proposes an adaptive generalized recognition system - AGNet (Adaptive Generalized Network). By converting fixed wavelet parameters into fine-grained learnable parameters, AGNet learns the characteristics of underwater sound at different frequencies. Its flexible and fine-grained design is conducive to capturing more background acoustic information (e.g., background noise, underwater transmission channel). To utilize the implicit information in wavelet spectrograms, AGNet adopts the convolutional neural network with parallel convolution attention modules as the classifier. Experiments reveal that our AGNet outperforms all baseline methods on several underwater acoustic datasets, and AGNet could benefit more from transfer learning. Moreover, AGNet shows robust performance against various interference factors.  ( 2 min )
    Simplifying Hyperparameter Tuning in Online Machine Learning -- The spotRiverGUI
    arXiv:2402.11594v1 Announce Type: new Abstract: Batch Machine Learning (BML) reaches its limits when dealing with very large amounts of streaming data. This is especially true for available memory, handling drift in data streams, and processing new, unknown data. Online Machine Learning (OML) is an alternative to BML that overcomes the limitations of BML. OML is able to process data in a sequential manner, which is especially useful for data streams. The `river` package is a Python OML-library, which provides a variety of online learning algorithms for classification, regression, clustering, anomaly detection, and more. The `spotRiver` package provides a framework for hyperparameter tuning of OML models. The `spotRiverGUI` is a graphical user interface for the `spotRiver` package. The `spotRiverGUI` releases the user from the burden of manually searching for the optimal hyperparameter setting. After the data is provided, users can compare different OML algorithms from the powerful `river` package in a convenient way and tune the selected algorithms very efficiently.  ( 2 min )
    Learning with Imbalanced Noisy Data by Preventing Bias in Sample Selection
    arXiv:2402.11242v1 Announce Type: new Abstract: Learning with noisy labels has gained increasing attention because the inevitable imperfect labels in real-world scenarios can substantially hurt the deep model performance. Recent studies tend to regard low-loss samples as clean ones and discard high-loss ones to alleviate the negative impact of noisy labels. However, real-world datasets contain not only noisy labels but also class imbalance. The imbalance issue is prone to causing failure in the loss-based sample selection since the under-learning of tail classes also leans to produce high losses. To this end, we propose a simple yet effective method to address noisy labels in imbalanced datasets. Specifically, we propose Class-Balance-based sample Selection (CBS) to prevent the tail class samples from being neglected during training. We propose Confidence-based Sample Augmentation (CSA) for the chosen clean samples to enhance their reliability in the training process. To exploit selected noisy samples, we resort to prediction history to rectify labels of noisy samples. Moreover, we introduce the Average Confidence Margin (ACM) metric to measure the quality of corrected labels by leveraging the model's evolving training dynamics, thereby ensuring that low-quality corrected noisy samples are appropriately masked out. Lastly, consistency regularization is imposed on filtered label-corrected noisy samples to boost model performance. Comprehensive experimental results on synthetic and real-world datasets demonstrate the effectiveness and superiority of our proposed method, especially in imbalanced scenarios. Comprehensive experimental results on synthetic and real-world datasets demonstrate the effectiveness and superiority of our proposed method, especially in imbalanced scenarios.  ( 3 min )
    Evaluating the Stability of Deep Learning Latent Feature Spaces
    arXiv:2402.11404v1 Announce Type: new Abstract: High-dimensional datasets present substantial challenges in statistical modeling across various disciplines, necessitating effective dimensionality reduction methods. Deep learning approaches, notable for their capacity to distill essential features from complex data, facilitate modeling, visualization, and compression through reduced dimensionality latent feature spaces, have wide applications from bioinformatics to earth sciences. This study introduces a novel workflow to evaluate the stability of these latent spaces, ensuring consistency and reliability in subsequent analyses. Stability, defined as the invariance of latent spaces to minor data, training realizations, and parameter perturbations, is crucial yet often overlooked. Our proposed methodology delineates three stability types, sample, structural, and inferential, within latent spaces, and introduces a suite of metrics for comprehensive evaluation. We implement this workflow across 500 autoencoder realizations and three datasets, encompassing both synthetic and real-world scenarios to explain latent space dynamics. Employing k-means clustering and the modified Jonker-Volgenant algorithm for class alignment, alongside anisotropy metrics and convex hull analysis, we introduce adjusted stress and Jaccard dissimilarity as novel stability indicators. Our findings highlight inherent instabilities in latent feature spaces and demonstrate the workflow's efficacy in quantifying and interpreting these instabilities. This work advances the understanding of latent feature spaces, promoting improved model interpretability and quality control for more informed decision-making for diverse analytical workflows that leverage deep learning.  ( 2 min )
    Towards Versatile Graph Learning Approach: from the Perspective of Large Language Models
    arXiv:2402.11641v1 Announce Type: new Abstract: Graph-structured data are the commonly used and have wide application scenarios in the real world. For these diverse applications, the vast variety of learning tasks, graph domains, and complex graph learning procedures present challenges for human experts when designing versatile graph learning approaches. Facing these challenges, large language models (LLMs) offer a potential solution due to the extensive knowledge and the human-like intelligence. This paper proposes a novel conceptual prototype for designing versatile graph learning methods with LLMs, with a particular focus on the ``where'' and ``how'' perspectives. From the ``where'' perspective, we summarize four key graph learning procedures, including task definition, graph data feature engineering, model selection and optimization, deployment and serving. We then explore the application scenarios of LLMs in these procedures across a wider spectrum. In the ``how'' perspective, we align the abilities of LLMs with the requirements of each procedure. Finally, we point out the promising directions that could better leverage the strength of LLMs towards versatile graph learning methods.  ( 2 min )
    DreamSmooth: Improving Model-based Reinforcement Learning via Reward Smoothing
    arXiv:2311.01450v2 Announce Type: replace Abstract: Model-based reinforcement learning (MBRL) has gained much attention for its ability to learn complex behaviors in a sample-efficient way: planning actions by generating imaginary trajectories with predicted rewards. Despite its success, we found that surprisingly, reward prediction is often a bottleneck of MBRL, especially for sparse rewards that are challenging (or even ambiguous) to predict. Motivated by the intuition that humans can learn from rough reward estimates, we propose a simple yet effective reward smoothing approach, DreamSmooth, which learns to predict a temporally-smoothed reward, instead of the exact reward at the given timestep. We empirically show that DreamSmooth achieves state-of-the-art performance on long-horizon sparse-reward tasks both in sample efficiency and final performance without losing performance on common benchmarks, such as Deepmind Control Suite and Atari benchmarks.  ( 2 min )
    Quantum-Inspired Analysis of Neural Network Vulnerabilities: The Role of Conjugate Variables in System Attacks
    arXiv:2402.10983v1 Announce Type: new Abstract: Neural networks demonstrate inherent vulnerability to small, non-random perturbations, emerging as adversarial attacks. Such attacks, born from the gradient of the loss function relative to the input, are discerned as input conjugates, revealing a systemic fragility within the network structure. Intriguingly, a mathematical congruence manifests between this mechanism and the quantum physics' uncertainty principle, casting light on a hitherto unanticipated interdisciplinarity. This inherent susceptibility within neural network systems is generally intrinsic, highlighting not only the innate vulnerability of these networks but also suggesting potential advancements in the interdisciplinary area for understanding these black-box networks.  ( 2 min )
    In-Context Learning with Transformers: Softmax Attention Adapts to Function Lipschitzness
    arXiv:2402.11639v1 Announce Type: new Abstract: A striking property of transformers is their ability to perform in-context learning (ICL), a machine learning framework in which the learner is presented with a novel context during inference implicitly through some data, and tasked with making a prediction in that context. As such that learner must adapt to the context without additional training. We explore the role of softmax attention in an ICL setting where each context encodes a regression task. We show that an attention unit learns a window that it uses to implement a nearest-neighbors predictor adapted to the landscape of the pretraining tasks. Specifically, we show that this window widens with decreasing Lipschitzness and increasing label noise in the pretraining tasks. We also show that on low-rank, linear problems, the attention unit learns to project onto the appropriate subspace before inference. Further, we show that this adaptivity relies crucially on the softmax activation and thus cannot be replicated by the linear activation often studied in prior theoretical analyses.  ( 2 min )
    Discovering stochastic dynamical equations from biological time series data
    arXiv:2205.02645v5 Announce Type: replace-cross Abstract: Stochastic differential equations (SDEs) are an important framework to model dynamics with randomness, as is common in most biological systems. The inverse problem of integrating these models with empirical data remains a major challenge. Here, we present an equation discovery methodology that takes time series data as an input, analyses fine scale fluctuations and outputs an interpretable SDE that can correctly capture long-time dynamics of data. We achieve this by combining traditional approaches from stochastic calculus literature with state-of-the-art equation discovery techniques. We validate our approach on synthetic datasets, and demonstrate the generality and applicability of the method on two real-world datasets of vastly different spatiotemporal scales: (i) collective movement of fish school where stochasticity plays a crucial role, and (ii) confined migration of a single cell, primarily following a relaxed oscillation. We make the method available as an easy-to-use, open-source Python package, PyDaddy (Python Library for Data Driven Dynamics).  ( 3 min )
    Theoretical foundations for programmatic reinforcement learning
    arXiv:2402.11650v1 Announce Type: new Abstract: The field of Reinforcement Learning (RL) is concerned with algorithms for learning optimal policies in unknown stochastic environments. Programmatic RL studies representations of policies as programs, meaning involving higher order constructs such as control loops. Despite attracting a lot of attention at the intersection of the machine learning and formal methods communities, very little is known on the theoretical front about programmatic RL: what are good classes of programmatic policies? How large are optimal programmatic policies? How can we learn them? The goal of this paper is to give first answers to these questions, initiating a theoretical study of programmatic RL.  ( 2 min )
    Space and Time Continuous Physics Simulation From Partial Observations
    arXiv:2401.09198v2 Announce Type: replace Abstract: Modern techniques for physical simulations rely on numerical schemes and mesh-refinement methods to address trade-offs between precision and complexity, but these handcrafted solutions are tedious and require high computational power. Data-driven methods based on large-scale machine learning promise high adaptivity by integrating long-range dependencies more directly and efficiently. In this work, we focus on fluid dynamics and address the shortcomings of a large part of the literature, which are based on fixed support for computations and predictions in the form of regular or irregular grids. We propose a novel setup to perform predictions in a continuous spatial and temporal domain while being trained on sparse observations. We formulate the task as a double observation problem and propose a solution with two interlinked dynamical systems defined on, respectively, the sparse positions and the continuous domain, which allows to forecast and interpolate a solution from the initial condition. Our practical implementation involves recurrent GNNs and a spatio-temporal attention observer capable of interpolating the solution at arbitrary locations. Our model not only generalizes to new initial conditions (as standard auto-regressive models do) but also performs evaluation at arbitrary space and time locations. We evaluate on three standard datasets in fluid dynamics and compare to strong baselines, which are outperformed both in classical settings and in the extended new task requiring continuous predictions.  ( 3 min )
    3D-U-SAM Network For Few-shot Tooth Segmentation in CBCT Images
    arXiv:2309.11015v2 Announce Type: replace-cross Abstract: Accurate representation of tooth position is extremely important in treatment. 3D dental image segmentation is a widely used method, however labelled 3D dental datasets are a scarce resource, leading to the problem of small samples that this task faces in many cases. To this end, we address this problem with a pretrained SAM and propose a novel 3D-U-SAM network for 3D dental image segmentation. Specifically, in order to solve the problem of using 2D pre-trained weights on 3D datasets, we adopted a convolution approximation method; in order to retain more details, we designed skip connections to fuse features at all levels with reference to U-Net. The effectiveness of the proposed method is demonstrated in ablation experiments, comparison experiments, and sample size experiments.  ( 2 min )
    Towards Financially Inclusive Credit Products Through Financial Time Series Clustering
    arXiv:2402.11066v1 Announce Type: new Abstract: Financial inclusion ensures that individuals have access to financial products and services that meet their needs. As a key contributing factor to economic growth and investment opportunity, financial inclusion increases consumer spending and consequently business development. It has been shown that institutions are more profitable when they provide marginalised social groups access to financial services. Customer segmentation based on consumer transaction data is a well-known strategy used to promote financial inclusion. While the required data is available to modern institutions, the challenge remains that segment annotations are usually difficult and/or expensive to obtain. This prevents the usage of time series classification models for customer segmentation based on domain expert knowledge. As a result, clustering is an attractive alternative to partition customers into homogeneous groups based on the spending behaviour encoded within their transaction data. In this paper, we present a solution to one of the key challenges preventing modern financial institutions from providing financially inclusive credit, savings and insurance products: the inability to understand consumer financial behaviour, and hence risk, without the introduction of restrictive conventional credit scoring techniques. We present a novel time series clustering algorithm that allows institutions to understand the financial behaviour of their customers. This enables unique product offerings to be provided based on the needs of the customer, without reliance on restrictive credit practices.  ( 3 min )
    Token-Level Adversarial Prompt Detection Based on Perplexity Measures and Contextual Information
    arXiv:2311.11509v3 Announce Type: replace-cross Abstract: In recent years, Large Language Models (LLM) have emerged as pivotal tools in various applications. However, these models are susceptible to adversarial prompt attacks, where attackers can carefully curate input strings that mislead LLMs into generating incorrect or undesired outputs. Previous work has revealed that with relatively simple yet effective attacks based on discrete optimization, it is possible to generate adversarial prompts that bypass moderation and alignment of the models. This vulnerability to adversarial prompts underscores a significant concern regarding the robustness and reliability of LLMs. Our work aims to address this concern by introducing a novel approach to detecting adversarial prompts at a token level, leveraging the LLM's capability to predict the next token's probability. We measure the degree of the model's perplexity, where tokens predicted with high probability are considered normal, and those exhibiting high perplexity are flagged as adversarial. Additionaly, our method also integrates context understanding by incorporating neighboring token information to encourage the detection of contiguous adversarial prompt sequences. To this end, we design two algorithms for adversarial prompt detection: one based on optimization techniques and another on Probabilistic Graphical Models (PGM). Both methods are equipped with efficient solving methods, ensuring efficient adversarial prompt detection. Our token-level detection result can be visualized as heatmap overlays on the text sequence, allowing for a clearer and more intuitive representation of which part of the text may contain adversarial prompts.  ( 3 min )
    PRewrite: Prompt Rewriting with Reinforcement Learning
    arXiv:2401.08189v2 Announce Type: replace-cross Abstract: Prompt engineering is critical for the development of LLM-based applications. However, it is usually done manually in a "trial and error" fashion. This manual procedure can be time consuming, ineffective, and the generated prompts are, in a lot of cases, sub-optimal. Even for the prompts which seemingly work well, there is always a lingering question: can the prompts be made better with further modifications? To address these questions, in this paper, we investigate prompt engineering automation. We consider a specific use case scenario in which developers/users have drafted initial prompts, but lack the time/expertise to optimize them. We propose PRewrite, an automated tool to rewrite these drafts and to generate highly effective new prompts. PRewrite is based on the Reinforcement Learning (RL) framework which allows for end-to-end optimization and our design allows the RL search to happen in a large action space. The automated tool leverages manually crafted prompts as starting points which makes the rewriting procedure more guided and efficient. The generated prompts are human readable, and self-explanatory, unlike some of those in previous works. We conducted extensive experiments on diverse datasets and found that the prompts generated with this new method not only outperform professionally crafted prompts, but also prompts generated with other previously proposed methods.  ( 3 min )
    SymTC: A Symbiotic Transformer-CNN Net for Instance Segmentation of Lumbar Spine MRI
    arXiv:2401.09627v3 Announce Type: replace-cross Abstract: Intervertebral disc disease, a prevalent ailment, frequently leads to intermittent or persistent low back pain, and diagnosing and assessing of this disease rely on accurate measurement of vertebral bone and intervertebral disc geometries from lumbar MR images. Deep neural network (DNN) models may assist clinicians with more efficient image segmentation of individual instances (disks and vertebrae) of the lumbar spine in an automated way, which is termed as instance image segmentation. In this work, we proposed SymTC, an innovative lumbar spine MR image segmentation model that combines the strengths of Transformer and Convolutional Neural Network (CNN). Specifically, we designed a parallel dual-path architecture to merge CNN layers and Transformer layers, and we integrated a novel position embedding into the self-attention module of Transformer, enhancing the utilization of positional information for more accurate segmentation. To further improves model performance, we introduced a new data augmentation technique to create synthetic yet realistic MR image dataset, named SSMSpine, which is made publicly available. We evaluated our SymTC and the other 15 existing image segmentation models on our private in-house dataset and the public SSMSpine dataset, using two metrics, Dice Similarity Coefficient and 95% Hausdorff Distance. The results show that our SymTC has the best performance for segmenting vertebral bones and intervertebral discs in lumbar spine MR images. The SymTC code and SSMSpine dataset are available at https://github.com/jiasongchen/SymTC.  ( 3 min )
    Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
    arXiv:2312.08935v3 Announce Type: replace-cross Abstract: In this paper, we present an innovative process-oriented math process reward model called \textbf{Math-Shepherd}, which assigns a reward score to each step of math problem solutions. The training of Math-Shepherd is achieved using automatically constructed process-wise supervision data, breaking the bottleneck of heavy reliance on manual annotation in existing work. We explore the effectiveness of Math-Shepherd in two scenarios: 1) \textit{Verification}: Math-Shepherd is utilized for reranking multiple outputs generated by Large Language Models (LLMs); 2) \textit{Reinforcement Learning}: Math-Shepherd is employed to reinforce LLMs with step-by-step Proximal Policy Optimization (PPO). With Math-Shepherd, a series of open-source LLMs demonstrates exceptional performance. For instance, the step-by-step PPO with Math-Shepherd significantly improves the accuracy of Mistral-7B (77.9\%$\to$84.1\% on GSM8K and 28.6\%$\to$33.0\% on MATH). The accuracy can be further enhanced to 89.1\% and 43.5\% on GSM8K and MATH with the verification of Math-Shepherd, respectively. We believe that automatic process supervision holds significant potential for the future evolution of LLMs.  ( 2 min )
    Can LLMs Patch Security Issues?
    arXiv:2312.00024v3 Announce Type: replace-cross Abstract: Large Language Models (LLMs) have shown impressive proficiency in code generation. Nonetheless, similar to human developers, these models might generate code that contains security vulnerabilities and flaws. Writing secure code remains a substantial challenge, as vulnerabilities often arise during interactions between programs and external systems or services, such as databases and operating systems. In this paper, we propose a novel approach, Feedback-Driven Solution Synthesis (FDSS), designed to explore the use of LLMs in receiving feedback from Bandit, which is a static code analysis tool, and then the LLMs generate potential solutions to resolve security vulnerabilities. Each solution, along with the vulnerable code, is then sent back to the LLM for code refinement. Our approach shows a significant improvement over the baseline and outperforms existing approaches. Furthermore, we introduce a new dataset, PythonSecurityEval, collected from real-world scenarios on Stack Overflow to evaluate the LLMs' ability to generate secure code. Code and data are available at \url{https://github.com/Kamel773/LLM-code-refine}  ( 2 min )
    Combining Machine Learning and Ontology: A Systematic Literature Review
    arXiv:2401.07744v2 Announce Type: replace-cross Abstract: Motivated by the desire to explore the process of combining inductive and deductive reasoning, we conducted a systematic literature review of articles that investigate the integration of machine learning and ontologies. The objective was to identify diverse techniques that incorporate both inductive reasoning (performed by machine learning) and deductive reasoning (performed by ontologies) into artificial intelligence systems. Our review, which included the analysis of 128 studies, allowed us to identify three main categories of hybridization between machine learning and ontologies: learning-enhanced ontologies, semantic data mining, and learning and reasoning systems. We provide a comprehensive examination of all these categories, emphasizing the various machine learning algorithms utilized in the studies. Furthermore, we compared our classification with similar recent work in the field of hybrid AI and neuro-symbolic approaches.  ( 2 min )
    Adversarial Illusions in Multi-Modal Embeddings
    arXiv:2308.11804v3 Announce Type: replace-cross Abstract: Multi-modal embeddings encode texts, images, sounds, videos, etc., into a single embedding space, aligning representations across different modalities (e.g., associate an image of a dog with a barking sound). In this paper, we show that multi-modal embeddings can be vulnerable to an attack we call "adversarial illusions." Given an image or a sound, an adversary can perturb it to make its embedding close to an arbitrary, adversary-chosen input in another modality. These attacks are cross-modal and targeted: the adversary is free to align any image and any sound with any target of his choice. Adversarial illusions exploit proximity in the embedding space and are thus agnostic to downstream tasks and modalities, enabling a wholesale compromise of current and future downstream tasks and modalities not available to the adversary. Using ImageBind and AudioCLIP embeddings, we demonstrate how adversarially aligned inputs, generated without knowledge of specific downstream tasks, mislead image generation, text generation, zero-shot classification, and audio retrieval. We investigate transferability of illusions across different embeddings and develop a black-box version of our method that we use to demonstrate the first adversarial alignment attack on Amazon's commercial, proprietary Titan embedding. Finally, we analyze countermeasures and evasion attacks.  ( 2 min )
    Reinforcement Learning with Maskable Stock Representation for Portfolio Management in Customizable Stock Pools
    arXiv:2311.10801v3 Announce Type: replace-cross Abstract: Portfolio management (PM) is a fundamental financial trading task, which explores the optimal periodical reallocation of capitals into different stocks to pursue long-term profits. Reinforcement learning (RL) has recently shown its potential to train profitable agents for PM through interacting with financial markets. However, existing work mostly focuses on fixed stock pools, which is inconsistent with investors' practical demand. Specifically, the target stock pool of different investors varies dramatically due to their discrepancy on market states and individual investors may temporally adjust stocks they desire to trade (e.g., adding one popular stocks), which lead to customizable stock pools (CSPs). Existing RL methods require to retrain RL agents even with a tiny change of the stock pool, which leads to high computational cost and unstable performance. To tackle this challenge, we propose EarnMore, a rEinforcement leARNing framework with Maskable stOck REpresentation to handle PM with CSPs through one-shot training in a global stock pool (GSP). Specifically, we first introduce a mechanism to mask out the representation of the stocks outside the target pool. Second, we learn meaningful stock representations through a self-supervised masking and reconstruction process. Third, a re-weighting mechanism is designed to make the portfolio concentrate on favorable stocks and neglect the stocks outside the target pool. Through extensive experiments on 8 subset stock pools of the US stock market, we demonstrate that EarnMore significantly outperforms 14 state-of-the-art baselines in terms of 6 popular financial metrics with over 40% improvement on profit.  ( 3 min )
    Robust Errant Beam Prognostics with Conditional Modeling for Particle Accelerators
    arXiv:2312.10040v2 Announce Type: replace-cross Abstract: Particle accelerators are complex and comprise thousands of components, with many pieces of equipment running at their peak power. Consequently, particle accelerators can fault and abort operations for numerous reasons. These faults impact the availability of particle accelerators during scheduled run-time and hamper the efficiency and the overall science output. To avoid these faults, we apply anomaly detection techniques to predict any unusual behavior and perform preemptive actions to improve the total availability of particle accelerators. Semi-supervised Machine Learning (ML) based anomaly detection approaches such as autoencoders and variational autoencoders are often used for such tasks. However, supervised ML techniques such as Siamese Neural Network (SNN) models can outperform unsupervised or semi-supervised approaches for anomaly detection by leveraging the label information. One of the challenges specific to anomaly detection for particle accelerators is the data's variability due to system configuration changes. To address this challenge, we employ Conditional Siamese Neural Network (CSNN) models and Conditional Variational Auto Encoder (CVAE) models to predict errant beam pulses at the Spallation Neutron Source (SNS) under different system configuration conditions and compare their performance. We demonstrate that CSNN outperforms CVAE in our application.  ( 2 min )
    SINCERE: Supervised Information Noise-Contrastive Estimation REvisited
    arXiv:2309.14277v2 Announce Type: replace-cross Abstract: The information noise-contrastive estimation (InfoNCE) loss function provides the basis of many self-supervised deep learning methods due to its strong empirical results and theoretic motivation. Previous work suggests a supervised contrastive (SupCon) loss to extend InfoNCE to learn from available class labels. However, in this work we find that the prior SupCon loss formulation has questionable justification because it can encourage some images from the same class to repel one another in the learned embedding space. We propose the Supervised InfoNCE REvisited (SINCERE) loss as a theoretically-justified supervised extension of InfoNCE that never causes images from the same class to repel one another. Experiments show that SINCERE leads to better separation of embeddings from different classes while delivering competitive classification accuracy for supervised and transfer learning. We further show an information-theoretic bound that relates SINCERE loss to the symmeterized KL divergence between data-generating distributions for a target class and all other classes.  ( 2 min )
    Open-Set Graph Anomaly Detection via Normal Structure Regularisation
    arXiv:2311.06835v2 Announce Type: replace Abstract: This paper considers an important Graph Anomaly Detection (GAD) task, namely open-set GAD, which aims to detect anomalous nodes using a small number of labelled training normal and anomaly nodes (known as seen anomalies) that cannot illustrate all possible inference-time abnormalities. The availability of that labelled data provides crucial prior knowledge about abnormalities for GAD models, enabling substantially reduced detection errors. However, current methods tend to over-emphasise fitting the seen anomalies, leading to a weak generalisation ability to detect unseen anomalies, i.e., those that are not illustrated by the labelled anomaly nodes. Further, they were introduced to handle Euclidean data, failing to effectively capture important non-Euclidean features for GAD. In this work, we propose a novel open-set GAD approach, namely Normal Structure Regularisation (NSReg), to achieve generalised detection ability to unseen anomalies, while maintaining its effectiveness on detecting seen anomalies. The key idea in NSReg is to introduce a regularisation term that enforces the learning of compact, semantically-rich representations of normal nodes based on their structural relations to other nodes. When being optimised with supervised anomaly detection losses, the regularisation term helps incorporate strong normality into the modelling, empowering the joint learning of both seen abnormality and normality of the nodes, and thus, it effectively avoids the over emphasis on solely fitting the seen anomalies during training. Extensive empirical results on six real-world datasets demonstrate the superiority of our proposed NSReg for open-set GAD.  ( 3 min )
    Adversarial Preference Optimization
    arXiv:2311.08045v2 Announce Type: replace-cross Abstract: Human preference alignment is essential to improve the interaction quality of large language models (LLMs). Existing aligning methods depend on manually annotated preference data to guide the LLM optimization directions. However, in practice, continuously updating LLMs raises a distribution gap between model-generated samples and human-preferred responses, which hinders model fine-tuning efficiency. To mitigate this issue, previous methods require additional preference annotation on generated samples to adapt the shifted distribution, which consumes a large amount of annotation resources. Targeting more efficient human preference optimization, we propose an adversarial preference optimization (APO) framework, where the LLM agent and the preference model update alternatively via a min-max game. Without additional annotation, our APO method can make a self-adaption to the generation distribution gap through the adversarial learning process. Based on comprehensive experiments, we find APO further enhances the alignment performance of baseline methods in terms of helpfulness and harmlessness.  ( 2 min )
    LLMs as Visual Explainers: Advancing Image Classification with Evolving Visual Descriptions
    arXiv:2311.11904v2 Announce Type: replace-cross Abstract: Vision-language models (VLMs) offer a promising paradigm for image classification by comparing the similarity between images and class embeddings. A critical challenge lies in crafting precise textual representations for class names. While previous studies have leveraged recent advancements in large language models (LLMs) to enhance these descriptors, their outputs often suffer from ambiguity and inaccuracy. We attribute this to two primary factors: 1) the reliance on single-turn textual interactions with LLMs, leading to a mismatch between generated text and visual concepts for VLMs; 2) the oversight of the inter-class relationships, resulting in descriptors that fail to differentiate similar classes effectively. In this paper, we propose a novel framework that integrates LLMs and VLMs to find the optimal class descriptors. Our training-free approach develops an LLM-based agent with an evolutionary optimization strategy to iteratively refine class descriptors. We demonstrate our optimized descriptors are of high quality which effectively improves classification accuracy on a wide range of benchmarks. Additionally, these descriptors offer explainable and robust features, boosting performance across various backbone models and complementing fine-tuning-based methods.  ( 2 min )
    Granularity at Scale: Estimating Neighborhood Socioeconomic Indicators from High-Resolution Orthographic Imagery and Hybrid Learning
    arXiv:2309.16808v3 Announce Type: replace-cross Abstract: Many areas of the world are without basic information on the socioeconomic well-being of the residing population due to limitations in existing data collection methods. Overhead images obtained remotely, such as from satellite or aircraft, can help serve as windows into the state of life on the ground and help "fill in the gaps" where community information is sparse, with estimates at smaller geographic scales requiring higher resolution sensors. Concurrent with improved sensor resolutions, recent advancements in machine learning and computer vision have made it possible to quickly extract features from and detect patterns in image data, in the process correlating these features with other information. In this work, we explore how well two approaches, a supervised convolutional neural network and semi-supervised clustering based on bag-of-visual-words, estimate population density, median household income, and educational attainment of individual neighborhoods from publicly available high-resolution imagery of cities throughout the United States. Results and analyses indicate that features extracted from the imagery can accurately estimate the density (R$^2$ up to 0.81) of neighborhoods, with the supervised approach able to explain about half the variation in a population's income and education. In addition to the presented approaches serving as a basis for further geographic generalization, the novel semi-supervised approach provides a foundation for future work seeking to estimate fine-scale information from aerial imagery without the need for label data.  ( 3 min )
    Gait Event Detection and Travel Distance Using Waist-Worn Accelerometers across a Range of Speeds: Automated Approach
    arXiv:2307.04866v2 Announce Type: replace-cross Abstract: Estimation of temporospatial clinical features of gait (CFs), such as step count and length, step duration, step frequency, gait speed, and distance traveled, is an important component of community-based mobility evaluation using wearable accelerometers. However, accurate unsupervised computerized measurement of CFs of individuals with Duchenne muscular dystrophy (DMD) who have progressive loss of ambulatory mobility is difficult due to differences in patterns and magnitudes of acceleration across their range of attainable gait velocities. This paper proposes a novel calibration method. It aims to detect steps, estimate stride lengths, and determine travel distance. The approach involves a combination of clinical observation, machine-learning-based step detection, and regression-based stride length prediction. The method demonstrates high accuracy in children with DMD and typically developing controls (TDs) regardless of the participant's level of ability. Fifteen children with DMD and fifteen TDs underwent supervised clinical testing across a range of gait speeds using 10 m or 25 m run/walk (10 MRW, 25 MRW), 100 m run/walk (100 MRW), 6-min walk (6 MWT), and free-walk (FW) evaluations while wearing a mobile-phone-based accelerometer at the waist near the body's center of mass. Following calibration by a trained clinical evaluator, CFs were extracted from the accelerometer data using a multi-step machine-learning-based process and the results were compared to ground-truth observation data. Model predictions vs. observed values for step counts, distance traveled, and step length showed a strong correlation. Our study findings indicate that a single waist-worn accelerometer calibrated to an individual's stride characteristics using our methods accurately measures CFs and estimates travel distances across a common range of gait speeds in both DMD and TD peers.  ( 3 min )
    Knowledge-Augmented Large Language Models for Personalized Contextual Query Suggestion
    arXiv:2311.06318v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) excel at tackling various natural language tasks. However, due to the significant costs involved in re-training or fine-tuning them, they remain largely static and difficult to personalize. Nevertheless, a variety of applications could benefit from generations that are tailored to users' preferences, goals, and knowledge. Among them is web search, where knowing what a user is trying to accomplish, what they care about, and what they know can lead to improved search experiences. In this work, we propose a novel and general approach that augments an LLM with relevant context from users' interaction histories with a search engine in order to personalize its outputs. Specifically, we construct an entity-centric knowledge store for each user based on their search and browsing activities on the web, which is then leveraged to provide contextually relevant LLM prompt augmentations. This knowledge store is light-weight, since it only produces user-specific aggregate projections of interests and knowledge onto public knowledge graphs, and leverages existing search log infrastructure, thereby mitigating the privacy, compliance, and scalability concerns associated with building deep user profiles for personalization. We validate our approach on the task of contextual query suggestion, which requires understanding not only the user's current search context but also what they historically know and care about. Through a number of experiments based on human evaluation, we show that our approach is significantly better than several other LLM-powered baselines, generating query suggestions that are contextually more relevant, personalized, and useful.  ( 3 min )
    Large Language Model Unlearning
    arXiv:2310.10683v2 Announce Type: replace-cross Abstract: We study how to perform unlearning, i.e. forgetting undesirable misbehaviors, on large language models (LLMs). We show at least three scenarios of aligning LLMs with human preferences can benefit from unlearning: (1) removing harmful responses, (2) erasing copyright-protected content as requested, and (3) reducing hallucinations. Unlearning, as an alignment technique, has three advantages. (1) It only requires negative (e.g. harmful) examples, which are much easier and cheaper to collect (e.g. via red teaming or user reporting) than positive (e.g. helpful and often human-written) examples required in RLHF (RL from human feedback). (2) It is computationally efficient. (3) It is especially effective when we know which training samples cause the misbehavior. To the best of our knowledge, our work is among the first to explore LLM unlearning. We are also among the first to formulate the settings, goals, and evaluations in LLM unlearning. We show that if practitioners only have limited resources, and therefore the priority is to stop generating undesirable outputs rather than to try to generate desirable outputs, unlearning is particularly appealing. Despite only having negative samples, our ablation study shows that unlearning can still achieve better alignment performance than RLHF with just 2% of its computational time.  ( 2 min )
    Optimizing Performance of Feedforward and Convolutional Neural Networks through Dynamic Activation Functions
    arXiv:2308.05724v2 Announce Type: replace Abstract: Deep learning training training algorithms are a huge success in recent years in many fields including speech, text,image video etc. Deeper and deeper layers are proposed with huge success with resnet structures having around 152 layers. Shallow convolution neural networks(CNN's) are still an active research, where some phenomena are still unexplained. Activation functions used in the network are of utmost importance, as they provide non linearity to the networks. Relu's are the most commonly used activation function.We show a complex piece-wise linear(PWL) activation in the hidden layer. We show that these PWL activations work much better than relu activations in our networks for convolution neural networks and multilayer perceptrons. Result comparison in PyTorch for shallow and deep CNNs are given to further strengthen our case.  ( 2 min )
    DePT: Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning
    arXiv:2309.05173v5 Announce Type: replace-cross Abstract: Prompt tuning (PT), where a small amount of trainable soft (continuous) prompt vectors is affixed to the input of language models (LM), has shown promising results across various tasks and models for parameter-efficient fine-tuning (PEFT). PT stands out from other PEFT approaches because it maintains competitive performance with fewer trainable parameters and does not drastically scale up its parameters as the model size expands. However, PT introduces additional soft prompt tokens, leading to longer input sequences, which significantly impacts training and inference time and memory usage due to the Transformer's quadratic complexity. Particularly concerning for Large Language Models (LLMs) that face heavy daily querying. To address this issue, we propose Decomposed Prompt Tuning (DePT), which decomposes the soft prompt into a shorter soft prompt and a pair of low-rank matrices that are then optimised with two different learning rates. This allows DePT to achieve better performance while saving substantial memory and time costs compared to vanilla PT and its variants, without changing trainable parameter sizes. Through extensive experiments on 23 natural language processing (NLP) and vision-language (VL) tasks, we demonstrate that DePT outperforms state-of-the-art PEFT approaches, including the full fine-tuning baseline, in some scenarios. Additionally, we empirically show that DEPT grows more efficient as the model size increases. Our further study reveals that DePT integrates seamlessly with parameter-efficient transfer learning in the few-shot learning setting and highlights its adaptability to various model architectures and sizes.  ( 3 min )
    From Language Modeling to Instruction Following: Understanding the Behavior Shift in LLMs after Instruction Tuning
    arXiv:2310.00492v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) have achieved remarkable success, where instruction tuning is the critical step in aligning LLMs with user intentions. In this work, we investigate how the instruction tuning adjusts pre-trained models with a focus on intrinsic changes. Specifically, we first develop several local and global explanation methods, including a gradient-based method for input-output attribution and techniques for interpreting patterns and concepts in self-attention and feed-forward layers. The impact of instruction tuning is then studied by comparing the explanations derived from the pre-trained and instruction-tuned models. This approach provides an internal perspective of the model shifts on a human-comprehensible level. Our findings reveal three significant impacts of instruction tuning: 1) It empowers LLMs to recognize the instruction parts from user prompts, and promotes the response generation constantly conditioned on user instructions. 2) It encourages the self-attention heads to capture more word-word relationships about instruction verbs. 3) It encourages the feed-forward networks to rotate their pre-trained knowledge toward user-oriented tasks. These insights contribute to a more comprehensive understanding of instruction tuning and lay the groundwork for future work that aims at interpreting and optimizing LLMs for various applications.  ( 3 min )
    Contextual Pre-Planning on Reward Machine Abstractions for Enhanced Transfer in Deep Reinforcement Learning
    arXiv:2307.05209v3 Announce Type: replace-cross Abstract: Recent studies show that deep reinforcement learning (DRL) agents tend to overfit to the task on which they were trained and fail to adapt to minor environment changes. To expedite learning when transferring to unseen tasks, we propose a novel approach to representing the current task using reward machines (RMs), state machine abstractions that induce subtasks based on the current task's rewards and dynamics. Our method provides agents with symbolic representations of optimal transitions from their current abstract state and rewards them for achieving these transitions. These representations are shared across tasks, allowing agents to exploit knowledge of previously encountered symbols and transitions, thus enhancing transfer. Empirical results show that our representations improve sample efficiency and few-shot transfer in a variety of domains.  ( 2 min )
    Continual Learning on Graphs: Challenges, Solutions, and Opportunities
    arXiv:2402.11565v1 Announce Type: new Abstract: Continual learning on graph data has recently attracted paramount attention for its aim to resolve the catastrophic forgetting problem on existing tasks while adapting the sequentially updated model to newly emerged graph tasks. While there have been efforts to summarize progress on continual learning research over Euclidean data, e.g., images and texts, a systematic review of progress in continual learning on graphs, a.k.a, continual graph learning (CGL) or lifelong graph learning, is still demanding. Graph data are far more complex in terms of data structures and application scenarios, making CGL task settings, model designs, and applications extremely challenging. To bridge the gap, we provide a comprehensive review of existing continual graph learning (CGL) algorithms by elucidating the different task settings and categorizing the existing methods based on their characteristics. We compare the CGL methods with traditional continual learning techniques and analyze the applicability of the traditional continual learning techniques to CGL tasks. Additionally, we review the benchmark works that are crucial to CGL research. Finally, we discuss the remaining challenges and propose several future directions. We will maintain an up-to-date GitHub repository featuring a comprehensive list of CGL algorithms, accessible at https://github.com/UConn-DSIS/Survey-of-Continual-Learning-on-Graphs.  ( 2 min )
    RaTrack: Moving Object Detection and Tracking with 4D Radar Point Cloud
    arXiv:2309.09737v4 Announce Type: replace-cross Abstract: Mobile autonomy relies on the precise perception of dynamic environments. Robustly tracking moving objects in 3D world thus plays a pivotal role for applications like trajectory prediction, obstacle avoidance, and path planning. While most current methods utilize LiDARs or cameras for Multiple Object Tracking (MOT), the capabilities of 4D imaging radars remain largely unexplored. Recognizing the challenges posed by radar noise and point sparsity in 4D radar data, we introduce RaTrack, an innovative solution tailored for radar-based tracking. Bypassing the typical reliance on specific object types and 3D bounding boxes, our method focuses on motion segmentation and clustering, enriched by a motion estimation module. Evaluated on the View-of-Delft dataset, RaTrack showcases superior tracking precision of moving objects, largely surpassing the performance of the state of the art. We release our code and model at https://github.com/LJacksonPan/RaTrack.  ( 2 min )
    Dynamic nowcast of the New Zealand greenhouse gas inventory
    arXiv:2402.11107v1 Announce Type: new Abstract: As efforts to mitigate the effects of climate change grow, reliable and thorough reporting of greenhouse gas emissions are essential for measuring progress towards international and domestic emissions reductions targets. New Zealand's national emissions inventories are currently reported between 15 to 27 months out-of-date. We present a machine learning approach to nowcast (dynamically estimate) national greenhouse gas emissions in New Zealand in advance of the national emissions inventory's release, with just a two month latency due to current data availability. Key findings include an estimated 0.2% decrease in national gross emissions since 2020 (as at July 2022). Our study highlights the predictive power of a dynamic view of emissions intensive activities. This methodology is a proof of concept that a machine learning approach can make sub-annual estimates of national greenhouse gas emissions by sector with a relatively low error that could be of value for policy makers.  ( 2 min )
    Temporal Disentangled Contrastive Diffusion Model for Spatiotemporal Imputation
    arXiv:2402.11558v1 Announce Type: new Abstract: Spatiotemporal data analysis is pivotal across various domains, including transportation, meteorology, and healthcare. However, the data collected in real-world scenarios often suffers incompleteness due to sensor malfunctions and network transmission errors. Spatiotemporal imputation endeavours to predict missing values by exploiting the inherent spatial and temporal dependencies present in the observed data. Traditional approaches, which rely on classical statistical and machine learning techniques, are often inadequate, particularly when the data fails to meet strict distributional assumptions. In contrast, recent deep learning-based methods, leveraging graph and recurrent neural networks, have demonstrated enhanced efficacy. Nonetheless, these approaches are prone to error accumulation. Generative models have been increasingly adopted to circumvent the reliance on potentially inaccurate historical imputed values for future predictions. These models grapple with the challenge of producing unstable results, a particular issue in diffusion-based models. We aim to address these challenges by designing conditional features to guide the generative process and expedite training. Specifically, we introduce C$^2$TSD, a novel approach incorporating trend and seasonal information as conditional features and employing contrastive learning to improve model generalizability. The extensive experiments on three real-world datasets demonstrate the superior performance of C$^2$TSD over various state-of-the-art baselines.  ( 2 min )
    Multi Task Inverse Reinforcement Learning for Common Sense Reward
    arXiv:2402.11367v1 Announce Type: new Abstract: One of the challenges in applying reinforcement learning in a complex real-world environment lies in providing the agent with a sufficiently detailed reward function. Any misalignment between the reward and the desired behavior can result in unwanted outcomes. This may lead to issues like "reward hacking" where the agent maximizes rewards by unintended behavior. In this work, we propose to disentangle the reward into two distinct parts. A simple task-specific reward, outlining the particulars of the task at hand, and an unknown common-sense reward, indicating the expected behavior of the agent within the environment. We then explore how this common-sense reward can be learned from expert demonstrations. We first show that inverse reinforcement learning, even when it succeeds in training an agent, does not learn a useful reward function. That is, training a new agent with the learned reward does not impair the desired behaviors. We then demonstrate that this problem can be solved by training simultaneously on multiple tasks. That is, multi-task inverse reinforcement learning can be applied to learn a useful reward function.  ( 2 min )
    Reinforcement learning to maximise wind turbine energy generation
    arXiv:2402.11384v1 Announce Type: new Abstract: We propose a reinforcement learning strategy to control wind turbine energy generation by actively changing the rotor speed, the rotor yaw angle and the blade pitch angle. A double deep Q-learning with a prioritized experience replay agent is coupled with a blade element momentum model and is trained to allow control for changing winds. The agent is trained to decide the best control (speed, yaw, pitch) for simple steady winds and is subsequently challenged with real dynamic turbulent winds, showing good performance. The double deep Q- learning is compared with a classic value iteration reinforcement learning control and both strategies outperform a classic PID control in all environments. Furthermore, the reinforcement learning approach is well suited to changing environments including turbulent/gusty winds, showing great adaptability. Finally, we compare all control strategies with real winds and compute the annual energy production. In this case, the double deep Q-learning algorithm also outperforms classic methodologies.  ( 2 min )
    Ransomware detection using stacked autoencoder for feature selection
    arXiv:2402.11342v1 Announce Type: new Abstract: The aim of this study is to propose and evaluate an advanced ransomware detection and classification method that combines a Stacked Autoencoder (SAE) for precise feature selection with a Long Short Term Memory (LSTM) classifier to enhance ransomware stratification accuracy. The proposed approach involves thorough pre processing of the UGRansome dataset and training an unsupervised SAE for optimal feature selection or fine tuning via supervised learning to elevate the LSTM model's classification capabilities. The study meticulously analyzes the autoencoder's learned weights and activations to identify essential features for distinguishing ransomware families from other malware and creates a streamlined feature set for precise classification. Extensive experiments, including up to 400 epochs and varying learning rates, are conducted to optimize the model's performance. The results demonstrate the outstanding performance of the SAE-LSTM model across all ransomware families, boasting high precision, recall, and F1 score values that underscore its robust classification capabilities. Furthermore, balanced average scores affirm the proposed model's ability to generalize effectively across various malware types. The proposed model achieves an exceptional 99% accuracy in ransomware classification, surpassing the Extreme Gradient Boosting (XGBoost) algorithm primarily due to its effective SAE feature selection mechanism. The model also demonstrates outstanding performance in identifying signature attacks, achieving a 98% accuracy rate.  ( 2 min )
    Accelerating Approximate Thompson Sampling with Underdamped Langevin Monte Carlo
    arXiv:2401.11665v2 Announce Type: replace-cross Abstract: Approximate Thompson sampling with Langevin Monte Carlo broadens its reach from Gaussian posterior sampling to encompass more general smooth posteriors. However, it still encounters scalability issues in high-dimensional problems when demanding high accuracy. To address this, we propose an approximate Thompson sampling strategy, utilizing underdamped Langevin Monte Carlo, where the latter is the go-to workhorse for simulations of high-dimensional posteriors. Based on the standard smoothness and log-concavity conditions, we study the accelerated posterior concentration and sampling using a specific potential function. This design improves the sample complexity for realizing logarithmic regrets from $\mathcal{\tilde O}(d)$ to $\mathcal{\tilde O}(\sqrt{d})$. The scalability and robustness of our algorithm are also empirically validated through synthetic experiments in high-dimensional bandit problems.  ( 2 min )
    Learning from higher-order statistics, efficiently: hypothesis tests, random features, and neural networks
    arXiv:2312.14922v2 Announce Type: replace-cross Abstract: Neural networks excel at discovering statistical patterns in high-dimensional data sets. In practice, higher-order cumulants, which quantify the non-Gaussian correlations between three or more variables, are particularly important for the performance of neural networks. But how efficient are neural networks at extracting features from higher-order cumulants? We study this question in the spiked cumulant model, where the statistician needs to recover a privileged direction or "spike" from the order-$p\ge 4$ cumulants of $d$-dimensional inputs. We first characterise the fundamental statistical and computational limits of recovering the spike by analysing the number of samples $n$ required to strongly distinguish between inputs from the spiked cumulant model and isotropic Gaussian inputs. We find that statistical distinguishability requires $n\gtrsim d$ samples, while distinguishing the two distributions in polynomial time requires $n \gtrsim d^2$ samples for a wide class of algorithms, i.e. those covered by the low-degree conjecture. These results suggest the existence of a wide statistical-to-computational gap in this problem. Numerical experiments show that neural networks learn to distinguish the two distributions with quadratic sample complexity, while "lazy" methods like random features are not better than random guessing in this regime. Our results show that neural networks extract information from higher-order correlations in the spiked cumulant model efficiently, and reveal a large gap in the amount of data required by neural networks and random features to learn from higher-order cumulants.  ( 3 min )
    Iterative Regularization with k-support Norm: An Important Complement to Sparse Recovery
    arXiv:2401.05394v3 Announce Type: replace-cross Abstract: Sparse recovery is ubiquitous in machine learning and signal processing. Due to the NP-hard nature of sparse recovery, existing methods are known to suffer either from restrictive (or even unknown) applicability conditions, or high computational cost. Recently, iterative regularization methods have emerged as a promising fast approach because they can achieve sparse recovery in one pass through early stopping, rather than the tedious grid-search used in the traditional methods. However, most of those iterative methods are based on the $\ell_1$ norm which requires restrictive applicability conditions and could fail in many cases. Therefore, achieving sparse recovery with iterative regularization methods under a wider range of conditions has yet to be further explored. To address this issue, we propose a novel iterative regularization algorithm, IRKSN, based on the $k$-support norm regularizer rather than the $\ell_1$ norm. We provide conditions for sparse recovery with IRKSN, and compare them with traditional conditions for recovery with $\ell_1$ norm regularizers. Additionally, we give an early stopping bound on the model error of IRKSN with explicit constants, achieving the standard linear rate for sparse recovery. Finally, we illustrate the applicability of our algorithm on several experiments, including a support recovery experiment with a correlated design matrix.  ( 3 min )
    Sensitivity-Aware Amortized Bayesian Inference
    arXiv:2310.11122v4 Announce Type: replace-cross Abstract: Sensitivity analyses reveal the influence of various modeling choices on the outcomes of statistical analyses. While theoretically appealing, they are overwhelmingly inefficient for complex Bayesian models. In this work, we propose sensitivity-aware amortized Bayesian inference (SA-ABI), a multifaceted approach to efficiently integrate sensitivity analyses into simulation-based inference with neural networks. First, we utilize weight sharing to encode the structural similarities between alternative likelihood and prior specifications in the training process with minimal computational overhead. Second, we leverage the rapid inference of neural networks to assess sensitivity to data perturbations and preprocessing steps. In contrast to most other Bayesian approaches, both steps circumvent the costly bottleneck of refitting the model for each choice of likelihood, prior, or data set. Finally, we propose to use deep ensembles to detect sensitivity arising from unreliable approximation (e.g., due to model misspecification). We demonstrate the effectiveness of our method in applied modeling problems, ranging from disease outbreak dynamics and global warming thresholds to human decision-making. Our results support sensitivity-aware inference as a default choice for amortized Bayesian workflows, automatically providing modelers with insights into otherwise hidden dimensions.  ( 2 min )
    On the Posterior Distribution in Denoising: Application to Uncertainty Quantification
    arXiv:2309.13598v2 Announce Type: replace-cross Abstract: Denoisers play a central role in many applications, from noise suppression in low-grade imaging sensors, to empowering score-based generative models. The latter category of methods makes use of Tweedie's formula, which links the posterior mean in Gaussian denoising (\ie the minimum MSE denoiser) with the score of the data distribution. Here, we derive a fundamental relation between the higher-order central moments of the posterior distribution, and the higher-order derivatives of the posterior mean. We harness this result for uncertainty quantification of pre-trained denoisers. Particularly, we show how to efficiently compute the principal components of the posterior distribution for any desired region of an image, as well as to approximate the full marginal distribution along those (or any other) one-dimensional directions. Our method is fast and memory-efficient, as it does not explicitly compute or store the high-order moment tensors and it requires no training or fine tuning of the denoiser. Code and examples are available on the project webpage in https://hilamanor.github.io/GaussianDenoisingPosterior/ .  ( 2 min )
    Deep Video Codec Control for Vision Models
    arXiv:2308.16215v5 Announce Type: replace-cross Abstract: Standardized lossy video coding is at the core of almost all real-world video processing pipelines. Rate control is used to enable standard codecs to adapt to different network bandwidth conditions or storage constraints. However, standard video codecs (e.g., H.264) and their rate control modules aim to minimize video distortion w.r.t human quality assessment. We demonstrate empirically that standard-coded videos vastly deteriorate the performance of deep vision models. To overcome the deterioration of vision performance, this paper presents the first end-to-end learnable deep video codec control that considers both bandwidth constraints and downstream deep vision performance, while adhering to existing standardization. We demonstrate that our approach better preserves downstream deep vision performance than traditional approaches.  ( 2 min )
    Constrained optimization of sensor placement for nuclear digital twins
    arXiv:2306.13637v2 Announce Type: replace-cross Abstract: The deployment of extensive sensor arrays in nuclear reactors is infeasible due to challenging operating conditions and inherent spatial limitations. Strategically placing sensors within defined spatial constraints is essential for the reconstruction of reactor flow fields and the creation of nuclear digital twins. We develop a data-driven technique that incorporates constraints into an optimization framework for sensor placement, with the primary objective of minimizing reconstruction errors under noisy sensor measurements. The proposed greedy algorithm optimizes sensor locations over high-dimensional grids, adhering to user-specified constraints. We demonstrate the efficacy of optimized sensors by exhaustively computing all feasible configurations for a low-dimensional dynamical system. To validate our methodology, we apply the algorithm to the Out-of-Pile Testing and Instrumentation Transient Water Irradiation System (OPTI-TWIST) prototype capsule. This capsule is electrically heated to emulate the neutronics effect of the nuclear fuel. The TWIST prototype that will eventually be inserted in the Transient Reactor Test facility (TREAT) at the Idaho National Laboratory (INL), serves as a practical demonstration. The resulting sensor-based temperature reconstruction within OPTI-TWIST demonstrates minimized error, provides probabilistic bounds for noise-induced uncertainty, and establishes a foundation for communication between the digital twin and the experimental facility.  ( 2 min )
    Human Limits in Machine Learning: Prediction of Plant Phenotypes Using Soil Microbiome Data
    arXiv:2306.11157v2 Announce Type: replace-cross Abstract: The preservation of soil health is a critical challenge in the 21st century due to its significant impact on agriculture, human health, and biodiversity. We provide the first deep investigation of the predictive potential of machine learning models to understand the connections between soil and biological phenotypes. We investigate an integrative framework performing accurate machine learning-based prediction of plant phenotypes from biological, chemical, and physical properties of the soil via two models: random forest and Bayesian neural network. We show that prediction is improved when incorporating environmental features like soil physicochemical properties and microbial population density into the models, in addition to the microbiome information. Exploring various data preprocessing strategies confirms the significant impact of human decisions on predictive performance. We show that the naive total sum scaling normalization that is commonly used in microbiome research is not the optimal strategy to maximize predictive power. Also, we find that accurately defined labels are more important than normalization, taxonomic level or model characteristics. In cases where humans are unable to classify samples accurately, machine learning model performance is limited. Lastly, we provide domain scientists via a full model selection decision tree to identify the human choices that optimize model prediction power. Our work is accompanied by open source reproducible scripts (https://github.com/solislemuslab/soil-microbiome-nn) for maximum outreach among the microbiome research community.  ( 3 min )
    Predicting Temporal Aspects of Movement for Predictive Replication in Fog Environments
    arXiv:2306.00575v4 Announce Type: replace-cross Abstract: To fully exploit the benefits of the fog environment, efficient management of data locality is crucial. Blind or reactive data replication falls short in harnessing the potential of fog computing, necessitating more advanced techniques for predicting where and when clients will connect. While spatial prediction has received considerable attention, temporal prediction remains understudied. Our paper addresses this gap by examining the advantages of incorporating temporal prediction into existing spatial prediction models. We also provide a comprehensive analysis of spatio-temporal prediction models, such as Deep Neural Networks and Markov models, in the context of predictive replication. We propose a novel model using Holt-Winter's Exponential Smoothing for temporal prediction, leveraging sequential and periodical user movement patterns. In a fog network simulation with real user trajectories our model achieves a 15% reduction in excess data with a marginal 1% decrease in data availability.  ( 2 min )
    Underwater-Art: Expanding Information Perspectives With Text Templates For Underwater Acoustic Target Recognition
    arXiv:2305.19612v2 Announce Type: replace-cross Abstract: Underwater acoustic target recognition is an intractable task due to the complex acoustic source characteristics and sound propagation patterns. Limited by insufficient data and narrow information perspective, recognition models based on deep learning seem far from satisfactory in practical underwater scenarios. Although underwater acoustic signals are severely influenced by distance, channel depth, or other factors, annotations of relevant information are often non-uniform, incomplete, and hard to use. In our work, we propose to implement Underwater Acoustic Recognition based on Templates made up of rich relevant information (hereinafter called "UART"). We design templates to integrate relevant information from different perspectives into descriptive natural language. UART adopts an audio-spectrogram-text tri-modal contrastive learning framework, which endows UART with the ability to guide the learning of acoustic representations by descriptive natural language. Our experiments reveal that UART has better recognition capability and generalization performance than traditional paradigms. Furthermore, the pre-trained UART model could provide superior prior knowledge for the recognition model in the scenario without any auxiliary annotation.  ( 2 min )
    SciMON: Scientific Inspiration Machines Optimized for Novelty
    arXiv:2305.14259v5 Announce Type: replace-cross Abstract: We explore and enhance the ability of neural language models to generate novel scientific directions grounded in literature. Work on literature-based hypothesis generation has traditionally focused on binary link prediction -- severely limiting the expressivity of hypotheses. This line of work also does not focus on optimizing novelty. We take a dramatic departure with a novel setting in which models use as input background contexts (e.g., problems, experimental settings, goals), and output natural language ideas grounded in literature. We present SciMON, a modeling framework that uses retrieval of "inspirations" from past scientific papers, and explicitly optimizes for novelty by iteratively comparing to prior papers and updating idea suggestions until sufficient novelty is achieved. Comprehensive evaluations reveal that GPT-4 tends to generate ideas with overall low technical depth and novelty, while our methods partially mitigate this issue. Our work represents a first step toward evaluating and developing language models that generate new ideas derived from the scientific literature.  ( 2 min )
    A Graph is Worth 1-bit Spikes: When Graph Contrastive Learning Meets Spiking Neural Networks
    arXiv:2305.19306v2 Announce Type: replace-cross Abstract: While contrastive self-supervised learning has become the de-facto learning paradigm for graph neural networks, the pursuit of higher task accuracy requires a larger hidden dimensionality to learn informative and discriminative full-precision representations, raising concerns about computation, memory footprint, and energy consumption burden (largely overlooked) for real-world applications. This work explores a promising direction for graph contrastive learning (GCL) with spiking neural networks (SNNs), which leverage sparse and binary characteristics to learn more biologically plausible and compact representations. We propose SpikeGCL, a novel GCL framework to learn binarized 1-bit representations for graphs, making balanced trade-offs between efficiency and performance. We provide theoretical guarantees to demonstrate that SpikeGCL has comparable expressiveness with its full-precision counterparts. Experimental results demonstrate that, with nearly 32x representation storage compression, SpikeGCL is either comparable to or outperforms many fancy state-of-the-art supervised and self-supervised methods across several graph benchmarks.  ( 2 min )
    DeformerNet: Learning Bimanual Manipulation of 3D Deformable Objects
    arXiv:2305.04449v3 Announce Type: replace-cross Abstract: Applications in fields ranging from home care to warehouse fulfillment to surgical assistance require robots to reliably manipulate the shape of 3D deformable objects. Analytic models of elastic, 3D deformable objects require numerous parameters to describe the potentially infinite degrees of freedom present in determining the object's shape. Previous attempts at performing 3D shape control rely on hand-crafted features to represent the object shape and require training of object-specific control models. We overcome these issues through the use of our novel DeformerNet neural network architecture, which operates on a partial-view point cloud of the manipulated object and a point cloud of the goal shape to learn a low-dimensional representation of the object shape. This shape embedding enables the robot to learn a visual servo controller that computes the desired robot end-effector action to iteratively deform the object toward the target shape. We demonstrate both in simulation and on a physical robot that DeformerNet reliably generalizes to object shapes and material stiffness not seen during training, including ex vivo chicken muscle tissue. Crucially, using DeformerNet, the robot successfully accomplishes three surgical sub-tasks: retraction (moving tissue aside to access a site underneath it), tissue wrapping (a sub-task in procedures like aortic stent placements), and connecting two tubular pieces of tissue (a sub-task in anastomosis).  ( 3 min )
    Adam-family Methods for Nonsmooth Optimization with Convergence Guarantees
    arXiv:2305.03938v2 Announce Type: replace-cross Abstract: In this paper, we present a comprehensive study on the convergence properties of Adam-family methods for nonsmooth optimization, especially in the training of nonsmooth neural networks. We introduce a novel two-timescale framework that adopts a two-timescale updating scheme, and prove its convergence properties under mild assumptions. Our proposed framework encompasses various popular Adam-family methods, providing convergence guarantees for these methods in training nonsmooth neural networks. Furthermore, we develop stochastic subgradient methods that incorporate gradient clipping techniques for training nonsmooth neural networks with heavy-tailed noise. Through our framework, we show that our proposed methods converge even when the evaluation noises are only assumed to be integrable. Extensive numerical experiments demonstrate the high efficiency and robustness of our proposed methods.  ( 2 min )
    Towards black-box parameter estimation
    arXiv:2303.15041v2 Announce Type: replace-cross Abstract: Deep learning algorithms have recently shown to be a successful tool in estimating parameters of statistical models for which simulation is easy, but likelihood computation is challenging. But the success of these approaches depends on simulating parameters that sufficiently reproduce the observed data, and, at present, there is a lack of efficient methods to produce these simulations. We develop new black-box procedures to estimate parameters of statistical models based only on weak parameter structure assumptions. For well-structured likelihoods with frequent occurrences, such as in time series, this is achieved by pre-training a deep neural network on an extensive simulated database that covers a wide range of data sizes. For other types of complex dependencies, an iterative algorithm guides simulations to the correct parameter region in multiple rounds. These approaches can successfully estimate and quantify the uncertainty of parameters from non-Gaussian models with complex spatial and temporal dependencies. The success of our methods is a first step towards a fully flexible automatic black-box estimation framework.  ( 2 min )
    Sliced Wasserstein Estimation with Control Variates
    arXiv:2305.00402v2 Announce Type: replace-cross Abstract: The sliced Wasserstein (SW) distances between two probability measures are defined as the expectation of the Wasserstein distance between two one-dimensional projections of the two measures. The randomness comes from a projecting direction that is used to project the two input measures to one dimension. Due to the intractability of the expectation, Monte Carlo integration is performed to estimate the value of the SW distance. Despite having various variants, there has been no prior work that improves the Monte Carlo estimation scheme for the SW distance in terms of controlling its variance. To bridge the literature on variance reduction and the literature on the SW distance, we propose computationally efficient control variates to reduce the variance of the empirical estimation of the SW distance. The key idea is to first find Gaussian approximations of projected one-dimensional measures, then we utilize the closed-form of the Wasserstein-2 distance between two Gaussian distributions to design the control variates. In particular, we propose using a lower bound and an upper bound of the Wasserstein-2 distance between two fitted Gaussians as two computationally efficient control variates. We empirically show that the proposed control variate estimators can help to reduce the variance considerably when comparing measures over images and point-clouds. Finally, we demonstrate the favorable performance of the proposed control variate estimators in gradient flows to interpolate between two point-clouds and in deep generative modeling on standard image datasets, such as CIFAR10 and CelebA.  ( 3 min )
    Recent Developments in Machine Learning Methods for Stochastic Control and Games
    arXiv:2303.10257v2 Announce Type: replace-cross Abstract: Stochastic optimal control and games have a wide range of applications, from finance and economics to social sciences, robotics, and energy management. Many real-world applications involve complex models that have driven the development of sophisticated numerical methods. Recently, computational methods based on machine learning have been developed for solving stochastic control problems and games. In this review, we focus on deep learning methods that have unlocked the possibility of solving such problems, even in high dimensions or when the structure is very complex, beyond what traditional numerical methods can achieve. We consider mostly the continuous time and continuous space setting. Many of the new approaches build on recent neural-network-based methods for solving high-dimensional partial differential equations or backward stochastic differential equations, or on model-free reinforcement learning for Markov decision processes that have led to breakthrough results. This paper provides an introduction to these methods and summarizes the state-of-the-art works at the crossroad of machine learning and stochastic control and games.  ( 2 min )
    Can AI-Generated Text be Reliably Detected?
    arXiv:2303.11156v3 Announce Type: replace-cross Abstract: The unregulated use of LLMs can potentially lead to malicious consequences such as plagiarism, generating fake news, spamming, etc. Therefore, reliable detection of AI-generated text can be critical to ensure the responsible use of LLMs. Recent works attempt to tackle this problem either using certain model signatures present in the generated text outputs or by applying watermarking techniques that imprint specific patterns onto them. In this paper, we show that these detectors are not reliable in practical scenarios. In particular, we develop a recursive paraphrasing attack to apply on AI text, which can break a whole range of detectors, including the ones using the watermarking schemes as well as neural network-based detectors, zero-shot classifiers, and retrieval-based detectors. Our experiments include passages around 300 tokens in length, showing the sensitivity of the detectors even in the case of relatively long passages. We also observe that our recursive paraphrasing only degrades text quality slightly, measured via human studies, and metrics such as perplexity scores and accuracy on text benchmarks. Additionally, we show that even LLMs protected by watermarking schemes can be vulnerable against spoofing attacks aimed to mislead detectors to classify human-written text as AI-generated, potentially causing reputational damages to the developers. In particular, we show that an adversary can infer hidden AI text signatures of the LLM outputs without having white-box access to the detection method. Finally, we provide a theoretical connection between the AUROC of the best possible detector and the Total Variation distance between human and AI text distributions that can be used to study the fundamental hardness of the reliable detection problem for advanced language models. Our code is publicly available at https://github.com/vinusankars/Reliability-of-AI-text-detectors.  ( 3 min )
    A Lower Bound and a Near-Optimal Algorithm for Bilevel Empirical Risk Minimization
    arXiv:2302.08766v3 Announce Type: replace-cross Abstract: Bilevel optimization problems, which are problems where two optimization problems are nested, have more and more applications in machine learning. In many practical cases, the upper and the lower objectives correspond to empirical risk minimization problems and therefore have a sum structure. In this context, we propose a bilevel extension of the celebrated SARAH algorithm. We demonstrate that the algorithm requires $\mathcal{O}((n+m)^{\frac12}\varepsilon^{-1})$ gradient computations to achieve $\varepsilon$-stationarity with $n+m$ the total number of samples, which improves over all previous bilevel algorithms. Moreover, we provide a lower bound on the number of oracle calls required to get an approximate stationary point of the objective function of the bilevel problem. This lower bound is attained by our algorithm, which is therefore optimal in terms of sample complexity.  ( 2 min )
    On Sampling with Approximate Transport Maps
    arXiv:2302.04763v3 Announce Type: replace-cross Abstract: Transport maps can ease the sampling of distributions with non-trivial geometries by transforming them into distributions that are easier to handle. The potential of this approach has risen with the development of Normalizing Flows (NF) which are maps parameterized with deep neural networks trained to push a reference distribution towards a target. NF-enhanced samplers recently proposed blend (Markov chain) Monte Carlo methods with either (i) proposal draws from the flow or (ii) a flow-based reparametrization. In both cases, the quality of the learned transport conditions performance. The present work clarifies for the first time the relative strengths and weaknesses of these two approaches. Our study concludes that multimodal targets can be reliably handled with flow-based proposals up to moderately high dimensions. In contrast, methods relying on reparametrization struggle with multimodality but are more robust otherwise in high-dimensional settings and under poor training. To further illustrate the influence of target-proposal adequacy, we also derive a new quantitative bound for the mixing time of the Independent Metropolis-Hastings sampler.  ( 2 min )
    From Denoising Diffusions to Denoising Markov Models
    arXiv:2211.03595v3 Announce Type: replace-cross Abstract: Denoising diffusions are state-of-the-art generative models exhibiting remarkable empirical performance. They work by diffusing the data distribution into a Gaussian distribution and then learning to reverse this noising process to obtain synthetic datapoints. The denoising diffusion relies on approximations of the logarithmic derivatives of the noised data densities using score matching. Such models can also be used to perform approximate posterior simulation when one can only sample from the prior and likelihood. We propose a unifying framework generalising this approach to a wide class of spaces and leading to an original extension of score matching. We illustrate the resulting models on various applications.  ( 2 min )
    Statistical Optimality of Divide and Conquer Kernel-based Functional Linear Regression
    arXiv:2211.10968v3 Announce Type: replace-cross Abstract: Previous analysis of regularized functional linear regression in a reproducing kernel Hilbert space (RKHS) typically requires the target function to be contained in this kernel space. This paper studies the convergence performance of divide-and-conquer estimators in the scenario that the target function does not necessarily reside in the underlying RKHS. As a decomposition-based scalable approach, the divide-and-conquer estimators of functional linear regression can substantially reduce the algorithmic complexities in time and memory. We develop an integral operator approach to establish sharp finite sample upper bounds for prediction with divide-and-conquer estimators under various regularity conditions of explanatory variables and target function. We also prove the asymptotic optimality of the derived rates by building the mini-max lower bounds. Finally, we consider the convergence of noiseless estimators and show that the rates can be arbitrarily fast under mild conditions.  ( 2 min )
    Variance estimation in graphs with the fused lasso
    arXiv:2207.12638v3 Announce Type: replace-cross Abstract: We study the problem of variance estimation in general graph-structured problems. First, we develop a linear time estimator for the homoscedastic case that can consistently estimate the variance in general graphs. We show that our estimator attains minimax rates for the chain and 2D grid graphs when the mean signal has total variation with canonical scaling. Furthermore, we provide general upper bounds on the mean squared error performance of the fused lasso estimator in general graphs under a moment condition and a bound on the tail behavior of the errors. These upper bounds allow us to generalize for broader classes of distributions, such as sub-exponential, many existing results on the fused lasso that are only known to hold with the assumption that errors are sub-Gaussian random variables. Exploiting our upper bounds, we then study a simple total variation regularization estimator for estimating the signal of variances in the heteroscedastic case. We also provide lower bounds showing that our heteroscedastic variance estimator attains minimax rates for estimating signals of bounded variation in grid graphs, and $K$-nearest neighbor graphs, and the estimator is consistent for estimating the variances in any connected graph.  ( 2 min )
    On the Evaluation of User Privacy in Deep Neural Networks using Timing Side Channel
    arXiv:2208.01113v3 Announce Type: replace-cross Abstract: Recent Deep Learning (DL) advancements in solving complex real-world tasks have led to its widespread adoption in practical applications. However, this opportunity comes with significant underlying risks, as many of these models rely on privacy-sensitive data for training in a variety of applications, making them an overly-exposed threat surface for privacy violations. Furthermore, the widespread use of cloud-based Machine-Learning-as-a-Service (MLaaS) for its robust infrastructure support has broadened the threat surface to include a variety of remote side-channel attacks. In this paper, we first identify and report a novel data-dependent timing side-channel leakage (termed Class Leakage) in DL implementations originating from non-constant time branching operation in a widely used DL framework PyTorch. We further demonstrate a practical inference-time attack where an adversary with user privilege and hard-label black-box access to an MLaaS can exploit Class Leakage to compromise the privacy of MLaaS users. DL models are vulnerable to Membership Inference Attack (MIA), where an adversary's objective is to deduce whether any particular data has been used while training the model. In this paper, as a separate case study, we demonstrate that a DL model secured with differential privacy (a popular countermeasure against MIA) is still vulnerable to MIA against an adversary exploiting Class Leakage. We develop an easy-to-implement countermeasure by making a constant-time branching operation that alleviates the Class Leakage and also aids in mitigating MIA. We have chosen two standard benchmarking image classification datasets, CIFAR-10 and CIFAR-100 to train five state-of-the-art pre-trained DL models, over two different computing environments having Intel Xeon and Intel i7 processors to validate our approach.  ( 3 min )
    A Theoretical Analysis of the Learning Dynamics under Class Imbalance
    arXiv:2207.00391v4 Announce Type: replace-cross Abstract: Data imbalance is a common problem in machine learning that can have a critical effect on the performance of a model. Various solutions exist but their impact on the convergence of the learning dynamics is not understood. Here, we elucidate the significant negative impact of data imbalance on learning, showing that the learning curves for minority and majority classes follow sub-optimal trajectories when training with a gradient-based optimizer. This slowdown is related to the imbalance ratio and can be traced back to a competition between the optimization of different classes. Our main contribution is the analysis of the convergence of full-batch (GD) and stochastic gradient descent (SGD), and of variants that renormalize the contribution of each per-class gradient. We find that GD is not guaranteed to decrease the loss for each class but that this problem can be addressed by performing a per-class normalization of the gradient. With SGD, class imbalance has an additional effect on the direction of the gradients: the minority class suffers from a higher directional noise, which reduces the effectiveness of the per-class gradient normalization. Our findings not only allow us to understand the potential and limitations of strategies involving the per-class gradients, but also the reason for the effectiveness of previously used solutions for class imbalance such as oversampling.  ( 3 min )
    Learning Progress Driven Multi-Agent Curriculum
    arXiv:2205.10016v2 Announce Type: replace-cross Abstract: Curriculum reinforcement learning (CRL) aims to speed up learning by gradually increasing the difficulty of a task, usually quantified by the achievable expected return. Inspired by the success of CRL in single-agent settings, a few works have attempted to apply CRL to multi-agent reinforcement learning (MARL) using the number of agents to control task difficulty. However, existing works typically use manually defined curricula such as a linear scheme. In this paper, we first apply state-of-the-art single-agent self-paced CRL to sparse reward MARL. Although with satisfying performance, we identify two potential flaws of the curriculum generated by existing reward-based CRL methods: (1) tasks with high returns may not provide informative learning signals and (2) the exacerbated credit assignment difficulty in tasks where more agents yield higher returns. Thereby, we further propose self-paced MARL (SPMARL) to prioritize tasks based on \textit{learning progress} instead of the episode return. Our method not only outperforms baselines in three challenging sparse-reward benchmarks but also converges faster than self-paced CRL.  ( 2 min )
    Algebraic Machine Learning with an Application to Chemistry
    arXiv:2205.05795v3 Announce Type: replace-cross Abstract: As datasets used in scientific applications become more complex, studying the geometry and topology of data has become an increasingly prevalent part of the data analysis process. This can be seen for example with the growing interest in topological tools such as persistent homology. However, on the one hand, topological tools are inherently limited to providing only coarse information about the underlying space of the data. On the other hand, more geometric approaches rely predominately on the manifold hypothesis, which asserts that the underlying space is a smooth manifold. This assumption fails for many physical models where the underlying space contains singularities. In this paper we develop a machine learning pipeline that captures fine-grain geometric information without having to rely on any smoothness assumptions. Our approach involves working within the scope of algebraic geometry and algebraic varieties instead of differential geometry and smooth manifolds. In the setting of the variety hypothesis, the learning problem becomes to find the underlying variety using sample data. We cast this learning problem into a Maximum A Posteriori optimization problem which we solve in terms of an eigenvalue computation. Having found the underlying variety, we explore the use of Gr\"obner bases and numerical methods to reveal information about its geometry. In particular, we propose a heuristic for numerically detecting points lying near the singular locus of the underlying variety.  ( 3 min )
    Group-Sparse Matrix Factorization for Transfer Learning of Word Embeddings
    arXiv:2104.08928v3 Announce Type: replace-cross Abstract: Unstructured text provides decision-makers with a rich data source in many domains, ranging from product reviews in retail to nursing notes in healthcare. To leverage this information, words are typically translated into word embeddings -- vectors that encode the semantic relationships between words -- through unsupervised learning algorithms such as matrix factorization. However, learning word embeddings from new domains with limited training data can be challenging, because the meaning/usage may be different in the new domain, e.g., the word ``positive'' typically has positive sentiment, but often has negative sentiment in medical notes since it may imply that a patient tested positive for a disease. In practice, we expect that only a small number of domain-specific words may have new meanings. We propose an intuitive two-stage estimator that exploits this structure via a group-sparse penalty to efficiently transfer learn domain-specific word embeddings by combining large-scale text corpora (such as Wikipedia) with limited domain-specific text data. We bound the generalization error of our transfer learning estimator, proving that it can achieve high accuracy with substantially less domain-specific data when only a small number of embeddings are altered between domains. Furthermore, we prove that all local minima identified by our nonconvex objective function are statistically indistinguishable from the global minimum under standard regularization conditions, implying that our estimator can be computed efficiently. Our results provide the first bounds on group-sparse matrix factorization, which may be of independent interest. We empirically evaluate our approach compared to state-of-the-art fine-tuning heuristics from natural language processing.  ( 3 min )
    HCR-Net: A deep learning based script independent handwritten character recognition network
    arXiv:2108.06663v4 Announce Type: replace-cross Abstract: Handwritten character recognition (HCR) remains a challenging pattern recognition problem despite decades of research, and lacks research on script independent recognition techniques. {\color{black}This is mainly because of similar character structures, different handwriting styles, diverse scripts, handcrafted feature extraction techniques, unavailability of data and code, and the development of script-specific deep learning techniques. To address these limitations, we have proposed a script independent deep learning network for HCR research, called HCR-Net, that sets a new research direction for the field. HCR-Net is based on a novel transfer learning approach for HCR, which \textit{partly utilizes} feature extraction layers of a pre-trained network.} Due to transfer learning and image augmentation, HCR-Net provides faster and computationally efficient training, better performance and generalizations, and can work with small datasets. HCR-Net is extensively evaluated on 40 publicly available datasets of Bangla, Punjabi, Hindi, English, Swedish, Urdu, Farsi, Tibetan, Kannada, Malayalam, Telugu, Marathi, Nepali and Arabic languages, and established 26 new benchmark results while performed close to the best results in the rest cases. HCR-Net showed performance improvements up to 11\% against the existing results and achieved a fast convergence rate showing up to 99\% of final performance in the very first epoch. HCR-Net significantly outperformed the state-of-the-art transfer learning techniques and also reduced the number of trainable parameters by 34\% as compared with the corresponding pre-trained network. To facilitate reproducibility and further advancements of HCR research, the complete code is publicly released at \url{https://github.com/jmdvinodjmd/HCR-Net}.  ( 3 min )
    Robust Learning Rate Selection for Stochastic Optimization via Splitting Diagnostic
    arXiv:1910.08597v5 Announce Type: replace-cross Abstract: This paper proposes SplitSGD, a new dynamic learning rate schedule for stochastic optimization. This method decreases the learning rate for better adaptation to the local geometry of the objective function whenever a stationary phase is detected, that is, the iterates are likely to bounce at around a vicinity of a local minimum. The detection is performed by splitting the single thread into two and using the inner product of the gradients from the two threads as a measure of stationarity. Owing to this simple yet provably valid stationarity detection, SplitSGD is easy-to-implement and essentially does not incur additional computational cost than standard SGD. Through a series of extensive experiments, we show that this method is appropriate for both convex problems and training (non-convex) neural networks, with performance compared favorably to other stochastic optimization methods. Importantly, this method is observed to be very robust with a set of default parameters for a wide range of problems and, moreover, can yield better generalization performance than other adaptive gradient methods such as Adam.  ( 2 min )
    Instruction Fine-Tuning: Does Prompt Loss Matter?
    arXiv:2401.13586v2 Announce Type: replace Abstract: We present a study analyzing the effects of prompt loss weighting (PLW) on supervised instruction fine-tuning. We recreated Stanford's Alpaca experiment with both LLaMA 1 and LLaMA 2 and multiple instruction datasets. We found that performance of models fine-tuned on our short-completion dataset had a statistically significant negative quadratic relationship with PLW, but performance of models fine-tuned on medium- and long-completion data did not show any relationship with PLW. I.e., prompt loss can be safely ignored for many datasets. For short-completion data, small values (0.01-0.1) of PLW were optimal for multiple-choice and short-generation tasks while large values (~ 1.0) of PLW were optimal for long-generation tasks. We concluded that low non-zero PLW encourages models to not diverge from pre-trained model weights during training and high PLW reduces overfitting. Finally, we present a rough guide for selecting PLW values based on the completion-prompt length ratio of fine-tuning data.  ( 2 min )
    LightDiC: A Simple yet Effective Approach for Large-scale Digraph Representation Learning
    arXiv:2401.11772v2 Announce Type: replace Abstract: Most existing graph neural networks (GNNs) are limited to undirected graphs, whose restricted scope of the captured relational information hinders their expressive capabilities and deployments in real-world scenarios. Compared with undirected graphs, directed graphs (digraphs) fit the demand for modeling more complex topological systems by capturing more intricate relationships between nodes, such as formulating transportation and financial networks. While some directed GNNs have been introduced, their inspiration mainly comes from deep learning architectures, which lead to redundant complexity and computation, making them inapplicable to large-scale databases. To address these issues, we propose LightDiC, a scalable variant of the digraph convolution based on the magnetic Laplacian. Since topology-related computations are conducted solely during offline pre-processing, LightDiC achieves exceptional scalability, enabling downstream predictions to be trained separately without incurring recursive computational costs. Theoretical analysis shows that LightDiC utilizes directed information to achieve message passing based on the complex field, which corresponds to the proximal gradient descent process of the Dirichlet energy optimization function from the perspective of digraph signal denoising, ensuring its expressiveness. Experimental results demonstrate that LightDiC performs comparably well or even outperforms other SOTA methods in various downstream tasks, with fewer learnable parameters and higher training efficiency. Notably, LightDiC is the first DiGNN to provide satisfactory results in the most representative large-scale database (ogbn-papers100M).  ( 3 min )
    Technical Report: On the Convergence of Gossip Learning in the Presence of Node Inaccessibility
    arXiv:2401.09498v2 Announce Type: replace Abstract: Gossip learning (GL), as a decentralized alternative to federated learning (FL), is more suitable for resource-constrained wireless networks, such as Flying Ad-Hoc Networks (FANETs) that are formed by unmanned aerial vehicles (UAVs). GL can significantly enhance the efficiency and extend the battery life of UAV networks. Despite the advantages, the performance of GL is strongly affected by data distribution, communication speed, and network connectivity. However, how these factors influence the GL convergence is still unclear. Existing work studied the convergence of GL based on a virtual quantity for the sake of convenience, which failed to reflect the real state of the network when some nodes are inaccessible. In this paper, we formulate and investigate the impact of inaccessible nodes to GL under a dynamic network topology. We first decompose the weight divergence by whether the node is accessible or not. Then, we investigate the GL convergence under the dynamic of node accessibility and theoretically provide how the number of inaccessible nodes, data non-i.i.d.-ness, and duration of inaccessibility affect the convergence. Extensive experiments are carried out in practical settings to comprehensively verify the correctness of our theoretical findings.  ( 2 min )
    An Empirical Investigation into the Effect of Parameter Choices in Knowledge Distillation
    arXiv:2401.06356v2 Announce Type: replace Abstract: We present a large-scale empirical study of how choices of configuration parameters affect performance in knowledge distillation (KD). An example of such a KD parameter is the measure of distance between the predictions of the teacher and the student, common choices for which include the mean squared error (MSE) and the KL-divergence. Although scattered efforts have been made to understand the differences between such options, the KD literature still lacks a systematic study on their general effect on student performance. We take an empirical approach to this question in this paper, seeking to find out the extent to which such choices influence student performance across 13 datasets from 4 NLP tasks and 3 student sizes. We quantify the cost of making sub-optimal choices and identify a single configuration that performs well across the board.  ( 2 min )
    Human as AI Mentor: Enhanced Human-in-the-loop Reinforcement Learning for Safe and Efficient Autonomous Driving
    arXiv:2401.03160v3 Announce Type: replace Abstract: Despite significant progress in autonomous vehicles (AVs), the development of driving policies that ensure both the safety of AVs and traffic flow efficiency has not yet been fully explored. In this paper, we propose an enhanced human-in-the-loop reinforcement learning method, termed the Human as AI mentor-based deep reinforcement learning (HAIM-DRL) framework, which facilitates safe and efficient autonomous driving in mixed traffic platoon. Drawing inspiration from the human learning process, we first introduce an innovative learning paradigm that effectively injects human intelligence into AI, termed Human as AI mentor (HAIM). In this paradigm, the human expert serves as a mentor to the AI agent. While allowing the agent to sufficiently explore uncertain environments, the human expert can take control in dangerous situations and demonstrate correct actions to avoid potential accidents. On the other hand, the agent could be guided to minimize traffic flow disturbance, thereby optimizing traffic flow efficiency. In detail, HAIM-DRL leverages data collected from free exploration and partial human demonstrations as its two training sources. Remarkably, we circumvent the intricate process of manually designing reward functions; instead, we directly derive proxy state-action values from partial human demonstrations to guide the agents' policy learning. Additionally, we employ a minimal intervention technique to reduce the human mentor's cognitive load. Comparative results show that HAIM-DRL outperforms traditional methods in driving safety, sampling efficiency, mitigation of traffic flow disturbance, and generalizability to unseen traffic scenarios. The code and demo videos for this paper can be accessed at: https://zilin-huang.github.io/HAIM-DRL-website/  ( 3 min )
    Denoising Diffusion Variational Inference: Diffusion Models as Expressive Variational Posteriors
    arXiv:2401.02739v2 Announce Type: replace Abstract: We propose denoising diffusion variational inference (DDVI), an approximate inference algorithm for latent variable models which relies on diffusion models as flexible variational posteriors. Specifically, our method introduces an expressive class of approximate posteriors with auxiliary latent variables that perform diffusion in latent space by reversing a user-specified noising process. We fit these models by optimizing a lower bound on the marginal likelihood inspired by the wake-sleep algorithm. Our method is easy to implement (it fits a regularized extension of the ELBO), is compatible with black-box variational inference, and outperforms alternative classes of approximate posteriors based on normalizing flows or adversarial networks. It increases the expressivity of flow-based methods via non-invertible deep recurrent architectures and avoids the instability of adversarial methods. We use DDVI on a motivating task in biology -- inferring latent ancestry from human genomes -- and we find that it outperforms strong baselines on the Thousand Genomes dataset.  ( 2 min )
    Best-of-Both-Worlds Algorithms for Linear Contextual Bandits
    arXiv:2312.15433v2 Announce Type: replace Abstract: We study best-of-both-worlds algorithms for $K$-armed linear contextual bandits. Our algorithms deliver near-optimal regret bounds in both the adversarial and stochastic regimes, without prior knowledge about the environment. In the stochastic regime, we achieve the polylogarithmic rate $\frac{(dK)^2\mathrm{poly}\log(dKT)}{\Delta_{\min}}$, where $\Delta_{\min}$ is the minimum suboptimality gap over the $d$-dimensional context space. In the adversarial regime, we obtain either the first-order $\widetilde{O}(dK\sqrt{L^*})$ bound, or the second-order $\widetilde{O}(dK\sqrt{\Lambda^*})$ bound, where $L^*$ is the cumulative loss of the best action and $\Lambda^*$ is a notion of the cumulative second moment for the losses incurred by the algorithm. Moreover, we develop an algorithm based on FTRL with Shannon entropy regularizer that does not require the knowledge of the inverse of the covariance matrix, and achieves a polylogarithmic regret in the stochastic regime while obtaining $\widetilde{O}\big(dK\sqrt{T}\big)$ regret bounds in the adversarial regime.  ( 2 min )
    Hidden Minima in Two-Layer ReLU Networks
    arXiv:2312.16819v2 Announce Type: replace Abstract: The optimization problem associated to fitting two-layer ReLU networks having $d$~inputs, $k$~neurons, and labels generated by a target network, is considered. Two types of infinite families of spurious minima, giving one minimum per $d$, were recently found. The loss at minima belonging to the first type converges to zero as $d$ increases. In the second type, the loss remains bounded away from zero. That being so, how may one avoid minima belonging to the latter type? Fortunately, such minima are never detected by standard optimization methods. Motivated by questions concerning the nature of this phenomenon, we develop methods to study distinctive analytic properties of hidden minima. By existing analyses, the Hessian spectrum of both types agree modulo $O(d^{-1/2})$-terms -- not promising. Thus, rather, our investigation proceeds by studying curves along which the loss is minimized or maximized, generally referred to as tangency arcs. We prove that apparently far removed group representation-theoretic considerations concerning the arrangement of subspaces invariant to the action of subgroups of $S_d$, the symmetry group over $d$ symbols, relative to ones fixed by the action yield a precise description of all finitely many admissible types of tangency arcs. The general results used for the loss function reveal that arcs emanating from hidden minima differ, characteristically, by their structure and symmetry, precisely on account of the $O(d^{-1/2})$-eigenvalue terms absent in previous work, indicating in particular the subtlety of the analysis. The theoretical results, stated and proved for o-minimal structures, show that the set comprising all tangency arcs is topologically sufficiently tame to enable a numerical construction of tangency arcs and so compare how minima, both types, are positioned relative to adjacent critical points.  ( 3 min )
    Doubly Perturbed Task Free Continual Learning
    arXiv:2312.13027v2 Announce Type: replace Abstract: Task Free online continual learning (TF-CL) is a challenging problem where the model incrementally learns tasks without explicit task information. Although training with entire data from the past, present as well as future is considered as the gold standard, naive approaches in TF-CL with the current samples may be conflicted with learning with samples in the future, leading to catastrophic forgetting and poor plasticity. Thus, a proactive consideration of an unseen future sample in TF-CL becomes imperative. Motivated by this intuition, we propose a novel TF-CL framework considering future samples and show that injecting adversarial perturbations on both input data and decision-making is effective. Then, we propose a novel method named Doubly Perturbed Continual Learning (DPCL) to efficiently implement these input and decision-making perturbations. Specifically, for input perturbation, we propose an approximate perturbation method that injects noise into the input data as well as the feature vector and then interpolates the two perturbed samples. For decision-making process perturbation, we devise multiple stochastic classifiers. We also investigate a memory management scheme and learning rate scheduling reflecting our proposed double perturbations. We demonstrate that our proposed method outperforms the state-of-the-art baseline methods by large margins on various TF-CL benchmarks.  ( 2 min )
    Less or More From Teacher: Exploiting Trilateral Geometry For Knowledge Distillation
    arXiv:2312.15112v3 Announce Type: replace Abstract: Knowledge distillation aims to train a compact student network using soft supervision from a larger teacher network and hard supervision from ground truths. However, determining an optimal knowledge fusion ratio that balances these supervisory signals remains challenging. Prior methods generally resort to a constant or heuristic-based fusion ratio, which often falls short of a proper balance. In this study, we introduce a novel adaptive method for learning a sample-wise knowledge fusion ratio, exploiting both the correctness of teacher and student, as well as how well the student mimics the teacher on each sample. Our method naturally leads to the intra-sample trilateral geometric relations among the student prediction ($S$), teacher prediction ($T$), and ground truth ($G$). To counterbalance the impact of outliers, we further extend to the inter-sample relations, incorporating the teacher's global average prediction $\bar{T}$ for samples within the same class. A simple neural network then learns the implicit mapping from the intra- and inter-sample relations to an adaptive, sample-wise knowledge fusion ratio in a bilevel-optimization manner. Our approach provides a simple, practical, and adaptable solution for knowledge distillation that can be employed across various architectures and model sizes. Extensive experiments demonstrate consistent improvements over other loss re-weighting methods on image classification, attack detection, and click-through rate prediction.  ( 3 min )
    An Attentive Inductive Bias for Sequential Recommendation beyond the Self-Attention
    arXiv:2312.10325v2 Announce Type: replace Abstract: Sequential recommendation (SR) models based on Transformers have achieved remarkable successes. The self-attention mechanism of Transformers for computer vision and natural language processing suffers from the oversmoothing problem, i.e., hidden representations becoming similar to tokens. In the SR domain, we, for the first time, show that the same problem occurs. We present pioneering investigations that reveal the low-pass filtering nature of self-attention in the SR, which causes oversmoothing. To this end, we propose a novel method called $\textbf{B}$eyond $\textbf{S}$elf-$\textbf{A}$ttention for Sequential $\textbf{Rec}$ommendation (BSARec), which leverages the Fourier transform to i) inject an inductive bias by considering fine-grained sequential patterns and ii) integrate low and high-frequency information to mitigate oversmoothing. Our discovery shows significant advancements in the SR domain and is expected to bridge the gap for existing Transformer-based SR models. We test our proposed approach through extensive experiments on 6 benchmark datasets. The experimental results demonstrate that our model outperforms 7 baseline methods in terms of recommendation performance. Our code is available at https://github.com/yehjin-shin/BSARec.  ( 2 min )
    Better Neural PDE Solvers Through Data-Free Mesh Movers
    arXiv:2312.05583v2 Announce Type: replace Abstract: Recently, neural networks have been extensively employed to solve partial differential equations (PDEs) in physical system modeling. While major studies focus on learning system evolution on predefined static mesh discretizations, some methods utilize reinforcement learning or supervised learning techniques to create adaptive and dynamic meshes, due to the dynamic nature of these systems. However, these approaches face two primary challenges: (1) the need for expensive optimal mesh data, and (2) the change of the solution space's degree of freedom and topology during mesh refinement. To address these challenges, this paper proposes a neural PDE solver with a neural mesh adapter. To begin with, we introduce a novel data-free neural mesh adaptor, called Data-free Mesh Mover (DMM), with two main innovations. Firstly, it is an operator that maps the solution to adaptive meshes and is trained using the Monge-Amp\`ere equation without optimal mesh data. Secondly, it dynamically changes the mesh by moving existing nodes rather than adding or deleting nodes and edges. Theoretical analysis shows that meshes generated by DMM have the lowest interpolation error bound. Based on DMM, to efficiently and accurately model dynamic systems, we develop a moving mesh based neural PDE solver (MM-PDE) that embeds the moving mesh with a two-branch architecture and a learnable interpolation framework to preserve information within the data. Empirical experiments demonstrate that our method generates suitable meshes and considerably enhances accuracy when modeling widely considered PDE systems. The code can be found at: https://github.com/Peiyannn/MM-PDE.git.  ( 3 min )
    A framework for conditional diffusion modelling with applications in motif scaffolding for protein design
    arXiv:2312.09236v2 Announce Type: replace Abstract: Many protein design applications, such as binder or enzyme design, require scaffolding a structural motif with high precision. Generative modelling paradigms based on denoising diffusion processes emerged as a leading candidate to address this motif scaffolding problem and have shown early experimental success in some cases. In the diffusion paradigm, motif scaffolding is treated as a conditional generation task, and several conditional generation protocols were proposed or imported from the Computer Vision literature. However, most of these protocols are motivated heuristically, e.g. via analogies to Langevin dynamics, and lack a unifying framework, obscuring connections between the different approaches. In this work, we unify conditional training and conditional sampling procedures under one common framework based on the mathematically well-understood Doob's h-transform. This new perspective allows us to draw connections between existing methods and propose a new variation on existing conditional training protocols. We illustrate the effectiveness of this new protocol in both, image outpainting and motif scaffolding and find that it outperforms standard methods.  ( 2 min )
    How Many Validation Labels Do You Need? Exploring the Design Space of Label-Efficient Model Ranking
    arXiv:2312.01619v3 Announce Type: replace Abstract: This paper presents LEMR (Label-Efficient Model Ranking) and introduces the MoraBench Benchmark. LEMR is a novel framework that minimizes the need for costly annotations in model selection by strategically annotating instances from an unlabeled validation set. To evaluate LEMR, we leverage the MoraBench Benchmark, a comprehensive collection of model outputs across diverse scenarios. Our extensive evaluation across 23 different NLP tasks in semi-supervised learning, weak supervision, and prompt selection tasks demonstrates LEMR's effectiveness in significantly reducing labeling costs. Key findings highlight the impact of suitable ensemble methods, uncertainty sampling strategies, and model committee selection in enhancing model ranking accuracy. LEMR, supported by the insights from MoraBench, provides a cost-effective and accurate solution for model selection, especially valuable in resource-constrained environments.  ( 2 min )
    Do AI models produce better weather forecasts than physics-based models? A quantitative evaluation case study of Storm Ciar\'an
    arXiv:2312.02658v2 Announce Type: replace Abstract: There has been huge recent interest in the potential of making operational weather forecasts using machine learning techniques. As they become a part of the weather forecasting toolbox, there is a pressing need to understand how well current machine learning models can simulate high-impact weather events. We compare forecasts of Storm Ciar\'an, a European windstorm that caused sixteen deaths and extensive damage in Northern Europe, made by machine learning and numerical weather prediction models. The four machine learning models considered (FourCastNet, Pangu-Weather, GraphCast and FourCastNet-v2) produce forecasts that accurately capture the synoptic-scale structure of the cyclone including the position of the cloud head, shape of the warm sector and location of warm conveyor belt jet, and the large-scale dynamical drivers important for the rapid storm development such as the position of the storm relative to the upper-level jet exit. However, their ability to resolve the more detailed structures important for issuing weather warnings is more mixed. All of the machine learning models underestimate the peak amplitude of winds associated with the storm, only some machine learning models resolve the warm core seclusion and none of the machine learning models capture the sharp bent-back warm frontal gradient. Our study shows there is a great deal about the performance and properties of machine learning weather forecasts that can be derived from case studies of high-impact weather events such as Storm Ciar\'an.  ( 3 min )
    Directions of Curvature as an Explanation for Loss of Plasticity
    arXiv:2312.00246v2 Announce Type: replace Abstract: Loss of plasticity is a phenomenon in which neural networks lose their ability to learn from new experience. Despite being empirically observed in several problem settings, little is understood about the mechanisms that lead to loss of plasticity. In this paper, we offer a consistent explanation for loss of plasticity: Neural networks lose directions of curvature during training and that loss of plasticity can be attributed to this reduction in curvature. To support such a claim, we provide a systematic investigation of loss of plasticity across continual learning tasks using MNIST, CIFAR-10 and ImageNet. Our findings illustrate that loss of curvature directions coincides with loss of plasticity, while also showing that previous explanations are insufficient to explain loss of plasticity in all settings. Lastly, we show that regularizers which mitigate loss of plasticity also preserve curvature, motivating a simple distributional regularizer that proves to be effective across the problem settings we considered.  ( 2 min )
    Leveraging Function Space Aggregation for Federated Learning at Scale
    arXiv:2311.10291v2 Announce Type: replace Abstract: The federated learning paradigm has motivated the development of methods for aggregating multiple client updates into a global server model, without sharing client data. Many federated learning algorithms, including the canonical Federated Averaging (FedAvg), take a direct (possibly weighted) average of the client parameter updates, motivated by results in distributed optimization. In this work, we adopt a function space perspective and propose a new algorithm, FedFish, that aggregates local approximations to the functions learned by clients, using an estimate based on their Fisher information. We evaluate FedFish on realistic, large-scale cross-device benchmarks. While the performance of FedAvg can suffer as client models drift further apart, we demonstrate that FedFish is more robust to longer local training. Our evaluation across several settings in image and language benchmarks shows that FedFish outperforms FedAvg as local training epochs increase. Further, FedFish results in global networks that are more amenable to efficient personalization via local fine-tuning on the same or shifted data distributions. For instance, federated pretraining on the C4 dataset, followed by few-shot personalization on Stack Overflow, results in a 7% improvement in next-token prediction by FedFish over FedAvg.  ( 2 min )
    On diffusion-based generative models and their error bounds: The log-concave case with full convergence estimates
    arXiv:2311.13584v2 Announce Type: replace Abstract: We provide full theoretical guarantees for the convergence behaviour of diffusion-based generative models under the assumption of strongly log-concave data distributions while our approximating class of functions used for score estimation is made of Lipschitz continuous functions. We demonstrate via a motivating example, sampling from a Gaussian distribution with unknown mean, the powerfulness of our approach. In this case, explicit estimates are provided for the associated optimization problem, i.e. score approximation, while these are combined with the corresponding sampling estimates. As a result, we obtain the best known upper bound estimates in terms of key quantities of interest, such as the dimension and rates of convergence, for the Wasserstein-2 distance between the data distribution (Gaussian with unknown mean) and our sampling algorithm. Beyond the motivating example and in order to allow for the use of a diverse range of stochastic optimizers, we present our results using an $L^2$-accurate score estimation assumption, which crucially is formed under an expectation with respect to the stochastic optimizer and our novel auxiliary process that uses only known information. This approach yields the best known convergence rate for our sampling algorithm.  ( 3 min )
    Self-Supervised Curriculum Generation for Autonomous Reinforcement Learning without Task-Specific Knowledge
    arXiv:2311.09195v2 Announce Type: replace Abstract: A significant bottleneck in applying current reinforcement learning algorithms to real-world scenarios is the need to reset the environment between every episode. This reset process demands substantial human intervention, making it difficult for the agent to learn continuously and autonomously. Several recent works have introduced autonomous reinforcement learning (ARL) algorithms that generate curricula for jointly training reset and forward policies. While their curricula can reduce the number of required manual resets by taking into account the agent's learning progress, they rely on task-specific knowledge, such as predefined initial states or reset reward functions. In this paper, we propose a novel ARL algorithm that can generate a curriculum adaptive to the agent's learning progress without task-specific knowledge. Our curriculum empowers the agent to autonomously reset to diverse and informative initial states. To achieve this, we introduce a success discriminator that estimates the success probability from each initial state when the agent follows the forward policy. The success discriminator is trained with relabeled transitions in a self-supervised manner. Our experimental results demonstrate that our ARL algorithm can generate an adaptive curriculum and enable the agent to efficiently bootstrap to solve sparse-reward maze navigation and manipulation tasks, outperforming baselines with significantly fewer manual resets.  ( 2 min )
    Feature emergence via margin maximization: case studies in algebraic tasks
    arXiv:2311.07568v2 Announce Type: replace Abstract: Understanding the internal representations learned by neural networks is a cornerstone challenge in the science of machine learning. While there have been significant recent strides in some cases towards understanding how neural networks implement specific target functions, this paper explores a complementary question -- why do networks arrive at particular computational strategies? Our inquiry focuses on the algebraic learning tasks of modular addition, sparse parities, and finite group operations. Our primary theoretical findings analytically characterize the features learned by stylized neural networks for these algebraic tasks. Notably, our main technique demonstrates how the principle of margin maximization alone can be used to fully specify the features learned by the network. Specifically, we prove that the trained networks utilize Fourier features to perform modular addition and employ features corresponding to irreducible group-theoretic representations to perform compositions in general groups, aligning closely with the empirical observations of Nanda et al. and Chughtai et al. More generally, we hope our techniques can help to foster a deeper understanding of why neural networks adopt specific computational strategies.  ( 2 min )
    Re-evaluating Retrosynthesis Algorithms with Syntheseus
    arXiv:2310.19796v2 Announce Type: replace Abstract: The planning of how to synthesize molecules, also known as retrosynthesis, has been a growing focus of the machine learning and chemistry communities in recent years. Despite the appearance of steady progress, we argue that imperfect benchmarks and inconsistent comparisons mask systematic shortcomings of existing techniques. To remedy this, we present a benchmarking library called syntheseus which promotes best practice by default, enabling consistent meaningful evaluation of single-step and multi-step retrosynthesis algorithms. We use syntheseus to re-evaluate a number of previous retrosynthesis algorithms, and find that the ranking of state-of-the-art models changes when evaluated carefully. We end with guidance for future works in this area.  ( 2 min )
    SalUn: Empowering Machine Unlearning via Gradient-based Weight Saliency in Both Image Classification and Generation
    arXiv:2310.12508v3 Announce Type: replace Abstract: With evolving data regulations, machine unlearning (MU) has become an important tool for fostering trust and safety in today's AI models. However, existing MU methods focusing on data and/or weight perspectives often suffer limitations in unlearning accuracy, stability, and cross-domain applicability. To address these challenges, we introduce the concept of 'weight saliency' for MU, drawing parallels with input saliency in model explanation. This innovation directs MU's attention toward specific model weights rather than the entire model, improving effectiveness and efficiency. The resultant method that we call saliency unlearning (SalUn) narrows the performance gap with 'exact' unlearning (model retraining from scratch after removing the forgetting data points). To the best of our knowledge, SalUn is the first principled MU approach that can effectively erase the influence of forgetting data, classes, or concepts in both image classification and generation tasks. As highlighted below, For example, SalUn yields a stability advantage in high-variance random data forgetting, e.g., with a 0.2% gap compared to exact unlearning on the CIFAR-10 dataset. Moreover, in preventing conditional diffusion models from generating harmful images, SalUn achieves nearly 100% unlearning accuracy, outperforming current state-of-the-art baselines like Erased Stable Diffusion and Forget-Me-Not. Codes are available at https://github.com/OPTML-Group/Unlearn-Saliency. (WARNING: This paper contains model outputs that may be offensive in nature.)  ( 3 min )
    Corruption-Robust Offline Reinforcement Learning with General Function Approximation
    arXiv:2310.14550v3 Announce Type: replace Abstract: We investigate the problem of corruption robustness in offline reinforcement learning (RL) with general function approximation, where an adversary can corrupt each sample in the offline dataset, and the corruption level $\zeta\geq0$ quantifies the cumulative corruption amount over $n$ episodes and $H$ steps. Our goal is to find a policy that is robust to such corruption and minimizes the suboptimality gap with respect to the optimal policy for the uncorrupted Markov decision processes (MDPs). Drawing inspiration from the uncertainty-weighting technique from the robust online RL setting \citep{he2022nearly,ye2022corruptionrobust}, we design a new uncertainty weight iteration procedure to efficiently compute on batched samples and propose a corruption-robust algorithm for offline RL. Notably, under the assumption of single policy coverage and the knowledge of $\zeta$, our proposed algorithm achieves a suboptimality bound that is worsened by an additive factor of $\mathcal{O}(\zeta (C(\widehat{\mathcal{F}},\mu)n)^{-1})$ due to the corruption. Here $\widehat{\mathcal{F}}$ is the confidence set, and the dataset $\mathcal{Z}_n^H$, and $C(\widehat{\mathcal{F}},\mu)$ is a coefficient that depends on $\widehat{\mathcal{F}}$ and the underlying data distribution $\mu$. When specialized to linear MDPs, the corruption-dependent error term reduces to $\mathcal{O}(\zeta d n^{-1})$ with $d$ being the dimension of the feature map, which matches the existing lower bound for corrupted linear MDPs. This suggests that our analysis is tight in terms of the corruption-dependent term.  ( 3 min )
    On The Expressivity of Objective-Specification Formalisms in Reinforcement Learning
    arXiv:2310.11840v2 Announce Type: replace Abstract: Most algorithms in reinforcement learning (RL) require that the objective is formalised with a Markovian reward function. However, it is well-known that certain tasks cannot be expressed by means of an objective in the Markov rewards formalism, motivating the study of alternative objective-specification formalisms in RL such as Linear Temporal Logic and Multi-Objective Reinforcement Learning. To date, there has not yet been any thorough analysis of how these formalisms relate to each other in terms of their expressivity. We fill this gap in the existing literature by providing a comprehensive comparison of 17 salient objective-specification formalisms. We place these formalisms in a preorder based on their expressive power, and present this preorder as a Hasse diagram. We find a variety of limitations for the different formalisms, and argue that no formalism is both dominantly expressive and straightforward to optimise with current techniques. For example, we prove that each of Regularised RL, (Outer) Nonlinear Markov Rewards, Reward Machines, Linear Temporal Logic, and Limit Average Rewards can express a task that the others cannot. The significance of our results is twofold. First, we identify important expressivity limitations to consider when specifying objectives for policy optimization. Second, our results highlight the need for future research which adapts reward learning to work with a greater variety of formalisms, since many existing reward learning methods assume that the desired objective takes a Markovian form. Our work contributes towards a more cohesive understanding of the costs and benefits of different RL objective-specification formalisms.  ( 3 min )
    Accelerated Policy Gradient: On the Convergence Rates of the Nesterov Momentum for Reinforcement Learning
    arXiv:2310.11897v2 Announce Type: replace Abstract: Various acceleration approaches for Policy Gradient (PG) have been analyzed within the realm of Reinforcement Learning (RL). However, the theoretical understanding of the widely used momentum-based acceleration method on PG remains largely open. In response to this gap, we adapt the celebrated Nesterov's accelerated gradient (NAG) method to policy optimization in RL, termed \textit{Accelerated Policy Gradient} (APG). To demonstrate the potential of APG in achieving fast convergence, we formally prove that with the true gradient and under the softmax policy parametrization, APG converges to an optimal policy at rates: (i) $\tilde{O}(1/t^2)$ with constant step sizes; (ii) $O(e^{-ct})$ with exponentially-growing step sizes. To the best of our knowledge, this is the first characterization of the convergence rates of NAG in the context of RL. Notably, our analysis relies on one interesting finding: Regardless of the parameter initialization, APG ends up entering a locally nearly-concave regime, where APG can significantly benefit from the momentum, within finite iterations. Through numerical validation and experiments on the Atari 2600 benchmarks, we confirm that APG exhibits a $\tilde{O}(1/t^2)$ rate with constant step sizes and a linear convergence rate with exponentially-growing step sizes, significantly improving convergence over the standard PG.  ( 2 min )
    Federated Heterogeneous Graph Neural Network for Privacy-preserving Recommendation
    arXiv:2310.11730v3 Announce Type: replace Abstract: The heterogeneous information network (HIN), which contains rich semantics depicted by meta-paths, has emerged as a potent tool for mitigating data sparsity in recommender systems. Existing HIN-based recommender systems operate under the assumption of centralized storage and model training. However, real-world data is often distributed due to privacy concerns, leading to the semantic broken issue within HINs and consequent failures in centralized HIN-based recommendations. In this paper, we suggest the HIN is partitioned into private HINs stored on the client side and shared HINs on the server. Following this setting, we propose a federated heterogeneous graph neural network (FedHGNN) based framework, which facilitates collaborative training of a recommendation model using distributed HINs while protecting user privacy. Specifically, we first formalize the privacy definition for HIN-based federated recommendation (FedRec) in the light of differential privacy, with the goal of protecting user-item interactions within private HIN as well as users' high-order patterns from shared HINs. To recover the broken meta-path based semantics and ensure proposed privacy measures, we elaborately design a semantic-preserving user interactions publishing method, which locally perturbs user's high-order patterns and related user-item interactions for publishing. Subsequently, we introduce an HGNN model for recommendation, which conducts node- and semantic-level aggregations to capture recovered semantics. Extensive experiments on four datasets demonstrate that our model outperforms existing methods by a substantial margin (up to 34% in HR@10 and 42% in NDCG@10) under a reasonable privacy budget.  ( 3 min )
    Compositional Abilities Emerge Multiplicatively: Exploring Diffusion Models on a Synthetic Task
    arXiv:2310.09336v4 Announce Type: replace Abstract: Modern generative models exhibit unprecedented capabilities to generate extremely realistic data. However, given the inherent compositionality of the real world, reliable use of these models in practical applications requires that they exhibit the capability to compose a novel set of concepts to generate outputs not seen in the training data set. Prior work demonstrates that recent diffusion models do exhibit intriguing compositional generalization abilities, but also fail unpredictably. Motivated by this, we perform a controlled study for understanding compositional generalization in conditional diffusion models in a synthetic setting, varying different attributes of the training data and measuring the model's ability to generate samples out-of-distribution. Our results show: (i) the order in which the ability to generate samples from a concept and compose them emerges is governed by the structure of the underlying data-generating process; (ii) performance on compositional tasks exhibits a sudden "emergence" due to multiplicative reliance on the performance of constituent tasks, partially explaining emergent phenomena seen in generative models; and (iii) composing concepts with lower frequency in the training data to generate out-of-distribution samples requires considerably more optimization steps compared to generating in-distribution samples. Overall, our study lays a foundation for understanding capabilities and compositionality in generative models from a data-centric perspective.  ( 3 min )
    Understanding the Effects of RLHF on LLM Generalisation and Diversity
    arXiv:2310.06452v3 Announce Type: replace Abstract: Large language models (LLMs) fine-tuned with reinforcement learning from human feedback (RLHF) have been used in some of the most widely deployed AI models to date, such as OpenAI's ChatGPT or Anthropic's Claude. While there has been significant work developing these methods, our understanding of the benefits and downsides of each stage in RLHF is still limited. To fill this gap, we present an extensive analysis of how each stage of the process (i.e. supervised fine-tuning (SFT), reward modelling, and RLHF) affects two key properties: out-of-distribution (OOD) generalisation and output diversity. OOD generalisation is crucial given the wide range of real-world scenarios in which these models are being used, while output diversity refers to the model's ability to generate varied outputs and is important for a variety of use cases. We perform our analysis across two base models on both summarisation and instruction following tasks, the latter being highly relevant for current LLM use cases. We find that RLHF generalises better than SFT to new inputs, particularly as the distribution shift between train and test becomes larger. However, RLHF significantly reduces output diversity compared to SFT across a variety of measures, implying a tradeoff in current LLM fine-tuning methods between generalisation and diversity. Our results provide guidance on which fine-tuning method should be used depending on the application, and show that more research is needed to improve the tradeoff between generalisation and diversity.  ( 3 min )
    Transformers for Green Semantic Communication: Less Energy, More Semantics
    arXiv:2310.07592v2 Announce Type: replace Abstract: Semantic communication aims to transmit meaningful and effective information, rather than focusing on individual symbols or bits. This results in benefits like reduced latency, bandwidth usage, and higher throughput compared with traditional communication. However, semantic communication poses significant challenges due to the need for universal metrics to benchmark the joint effects of semantic information loss and practical energy consumption. This research presents a novel multi-objective loss function named "Energy-Optimized Semantic Loss" (EOSL), addressing the challenge of balancing semantic information loss and energy consumption. Through comprehensive experiments on transformer models, including CPU and GPU energy usage, it is demonstrated that EOSL-based encoder model selection can save up to 90% of energy while achieving a 44% improvement in semantic similarity performance during inference in this experiment. This work paves the way for energy-efficient neural network selection and the development of greener semantic communication architectures.  ( 2 min )
    Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity
    arXiv:2310.05175v2 Announce Type: replace Abstract: Large Language Models (LLMs), renowned for their remarkable performance across diverse domains, present a challenge when it comes to practical deployment due to their colossal model size. In response to this challenge, efforts have been directed toward the application of traditional network pruning techniques to LLMs, uncovering a massive number of parameters that can be pruned in one-shot without hurting performance. Prevailing LLM pruning strategies have consistently adhered to the practice of uniformly pruning all layers at equivalent sparsity, resulting in robust performance. However, this observation stands in contrast to the prevailing trends observed in the field of vision models, where non-uniform layerwise sparsity typically yields stronger results. To understand the underlying reasons for this disparity, we conduct a comprehensive study and discover a strong correlation with the emergence of activation outliers in LLMs. Inspired by this finding, we introduce a novel LLM pruning methodology that incorporates a tailored set of non-uniform layerwise sparsity ratios, termed as Outlier Weighed Layerwise sparsity (OWL). The sparsity ratio of OWL is proportional to the outlier ratio observed within each layer, facilitating a more effective alignment between layerwise weight sparsity and outlier ratios. Our empirical evaluation, conducted across the LLaMA-V1 family and OPT, spanning various benchmarks, demonstrates the distinct advantages offered by OWL over previous methods. For instance, OWL exhibits a remarkable performance gain, surpassing the state-of-the-art Wanda and SparseGPT by 61.22 and 6.80 perplexity at a high sparsity level of 70%, respectively, while delivering 2x end-to-end inference speed-up in the DeepSparse inference engine. Codes are available at https://github.com/luuyin/OWL.  ( 3 min )
    Why should autoencoders work?
    arXiv:2310.02250v3 Announce Type: replace Abstract: Deep neural network autoencoders are routinely used computationally for model reduction. They allow recognizing the intrinsic dimension of data that lie in a $k$-dimensional subset $K$ of an input Euclidean space $\mathbb{R}^n$. The underlying idea is to obtain both an encoding layer that maps $\mathbb{R}^n$ into $\mathbb{R}^k$ (called the bottleneck layer or the space of latent variables) and a decoding layer that maps $\mathbb{R}^k$ back into $\mathbb{R}^n$, in such a way that the input data from the set $K$ is recovered when composing the two maps. This is achieved by adjusting parameters (weights) in the network to minimize the discrepancy between the input and the reconstructed output. Since neural networks (with continuous activation functions) compute continuous maps, the existence of a network that achieves perfect reconstruction would imply that $K$ is homeomorphic to a $k$-dimensional subset of $\mathbb{R}^k$, so clearly there are topological obstructions to finding such a network. On the other hand, in practice the technique is found to "work" well, which leads one to ask if there is a way to explain this effectiveness. We show that, up to small errors, indeed the method is guaranteed to work. This is done by appealing to certain facts from differential topology. A computational example is also included to illustrate the ideas.  ( 2 min )
    Pruning Small Pre-Trained Weights Irreversibly and Monotonically Impairs "Difficult" Downstream Tasks in LLMs
    arXiv:2310.02277v2 Announce Type: replace Abstract: We present Junk DNA Hypothesis by adopting a novel task-centric angle for the pre-trained weights of large language models (LLMs). It has been believed that weights in LLMs contain significant redundancy, leading to the conception that a considerable chunk of the parameters can be removed by pruning without compromising performance. Contrary to this belief, this paper presents a counter-argument: small-magnitude weights of pre-trained model weights encode vital knowledge essential for tackling difficult downstream tasks - manifested as the monotonic relationship between the performance drop of downstream tasks across the difficulty spectrum, as we prune more pre-trained weights by magnitude. Moreover, we reveal that these seemingly inconsequential weights can result in irreparable loss of knowledge and performance degradation in difficult tasks, even when downstream continual training is allowed. Interestingly, our evaluations show that the other popular compression, namely quantization, fails to exhibit similar monotonic effect and does not as convincingly disentangle this task-difficulty information. To study formally, we introduce several quantifiable metrics to gauge the downstream task difficulty: (1) within the same task category, and (2) across different task categories. Our extensive experiments substantiate the Junk DNA Hypothesis across a diverse range of model sizes, tasks, datasets, and even pruning methods. Codes are available at: https://github.com/VITA-Group/Junk_DNA_Hypothesis.git.  ( 3 min )
    Empowering Many, Biasing a Few: Generalist Credit Scoring through Large Language Models
    arXiv:2310.00566v3 Announce Type: replace Abstract: In the financial industry, credit scoring is a fundamental element, shaping access to credit and determining the terms of loans for individuals and businesses alike. Traditional credit scoring methods, however, often grapple with challenges such as narrow knowledge scope and isolated evaluation of credit tasks. Our work posits that Large Language Models (LLMs) have great potential for credit scoring tasks, with strong generalization ability across multiple tasks. To systematically explore LLMs for credit scoring, we propose the first open-source comprehensive framework. We curate a novel benchmark covering 9 datasets with 14K samples, tailored for credit assessment and a critical examination of potential biases within LLMs, and the novel instruction tuning data with over 45k samples. We then propose the first Credit and Risk Assessment Large Language Model (CALM) by instruction tuning, tailored to the nuanced demands of various financial risk assessment tasks. We evaluate CALM, existing state-of-art (SOTA) methods, open source and closed source LLMs on the build benchmark. Our empirical results illuminate the capability of LLMs to not only match but surpass conventional models, pointing towards a future where credit scoring can be more inclusive, comprehensive, and unbiased. We contribute to the industry's transformation by sharing our pioneering instruction-tuning datasets, credit and risk assessment LLM, and benchmarks with the research community and the financial industry.  ( 3 min )
    Going Beyond Familiar Features for Deep Anomaly Detection
    arXiv:2310.00797v3 Announce Type: replace Abstract: Anomaly Detection (AD) is a critical task that involves identifying observations that do not conform to a learned model of normality. Prior work in deep AD is predominantly based on a familiarity hypothesis, where familiar features serve as the reference in a pre-trained embedding space. While this strategy has proven highly successful, it turns out that it causes consistent false negatives when anomalies consist of truly novel features that are not well captured by the pre-trained encoding. We propose a novel approach to AD using explainability to capture such novel features as unexplained observations in the input space. We achieve strong performance across a wide range of anomaly benchmarks by combining familiarity and novelty in a hybrid approach. Our approach establishes a new state-of-the-art across multiple benchmarks, handling diverse anomaly types while eliminating the need for expensive background models and dense matching. In particular, we show that by taking account of novel features, we reduce false negative anomalies by up to 40% on challenging benchmarks compared to the state-of-the-art. Our method gives visually inspectable explanations for pixel-level anomalies.  ( 2 min )
    Early-Exit with Class Exclusion for Efficient Inference of Neural Networks
    arXiv:2309.13443v2 Announce Type: replace Abstract: Deep neural networks (DNNs) have been successfully applied in various fields. In DNNs, a large number of multiply-accumulate (MAC) operations are required to be performed, posing critical challenges in applying them in resource-constrained platforms, e.g., edge devices. To address this challenge, in this paper, we propose a class-based early-exit for dynamic inference. Instead of pushing DNNs to make a dynamic decision at intermediate layers, we take advantage of the learned features in these layers to exclude as many irrelevant classes as possible, so that later layers only have to determine the target class among the remaining classes. When only one class remains at a layer, this class is the corresponding classification result. Experimental results demonstrate the computational cost of DNNs in inference can be reduced significantly with the proposed early-exit technique. The codes can be found at https://github.com/HWAI-TUDa/EarlyClassExclusion.  ( 2 min )
    Decoupled Training: Return of Frustratingly Easy Multi-Domain Learning
    arXiv:2309.10302v2 Announce Type: replace Abstract: Multi-domain learning (MDL) aims to train a model with minimal average risk across multiple overlapping but non-identical domains. To tackle the challenges of dataset bias and domain domination, numerous MDL approaches have been proposed from the perspectives of seeking commonalities by aligning distributions to reduce domain gap or reserving differences by implementing domain-specific towers, gates, and even experts. MDL models are becoming more and more complex with sophisticated network architectures or loss functions, introducing extra parameters and enlarging computation costs. In this paper, we propose a frustratingly easy and hyperparameter-free multi-domain learning method named Decoupled Training (D-Train). D-Train is a tri-phase general-to-specific training strategy that first pre-trains on all domains to warm up a root model, then post-trains on each domain by splitting into multi-heads, and finally fine-tunes the heads by fixing the backbone, enabling decouple training to achieve domain independence. Despite its extraordinary simplicity and efficiency, D-Train performs remarkably well in extensive evaluations of various datasets from standard benchmarks to applications of satellite imagery and recommender systems.  ( 2 min )
    Order-based Structure Learning with Normalizing Flows
    arXiv:2308.07480v2 Announce Type: replace Abstract: Estimating the causal structure of observational data is a challenging combinatorial search problem that scales super-exponentially with graph size. Existing methods use continuous relaxations to make this problem computationally tractable but often restrict the data-generating process to additive noise models (ANMs) through explicit or implicit assumptions. We present Order-based Structure Learning with Normalizing Flows (OSLow), a framework that relaxes these assumptions using autoregressive normalizing flows. We leverage the insight that searching over topological orderings is a natural way to enforce acyclicity in structure discovery and propose a novel, differentiable permutation learning method to find such orderings. Through extensive experiments on synthetic and real-world data, we demonstrate that OSLow outperforms prior baselines and improves performance on the observational Sachs and SynTReN datasets as measured by structural hamming distance and structural intervention distance, highlighting the importance of relaxing the ANM assumption made by existing methods.  ( 2 min )
    VERSE: Virtual-Gradient Aware Streaming Lifelong Learning with Anytime Inference
    arXiv:2309.08227v2 Announce Type: replace Abstract: Lifelong learning or continual learning is the problem of training an AI agent continuously while also preventing it from forgetting its previously acquired knowledge. Streaming lifelong learning is a challenging setting of lifelong learning with the goal of continuous learning in a dynamic non-stationary environment without forgetting. We introduce a novel approach to lifelong learning, which is streaming (observes each training example only once), requires a single pass over the data, can learn in a class-incremental manner, and can be evaluated on-the-fly (anytime inference). To accomplish these, we propose a novel \emph{virtual gradients} based approach for continual representation learning which adapts to each new example while also generalizing well on past data to prevent catastrophic forgetting. Our approach also leverages an exponential-moving-average-based semantic memory to further enhance performance. Experiments on diverse datasets with temporally correlated observations demonstrate our method's efficacy and superior performance over existing methods.  ( 2 min )
    AFN: Adaptive Fusion Normalization via an Encoder-Decoder Framework
    arXiv:2308.03321v4 Announce Type: replace Abstract: The success of deep learning is inseparable from normalization layers. Researchers have proposed various normalization functions, and each of them has both advantages and disadvantages. In response, efforts have been made to design a unified normalization function that combines all normalization procedures and mitigates their weaknesses. We also proposed a new normalization function called Adaptive Fusion Normalization. Through experiments, we demonstrate AFN outperforms the previous normalization techniques in domain generalization and image classification tasks.  ( 2 min )
    CoRe Optimizer: An All-in-One Solution for Machine Learning
    arXiv:2307.15663v2 Announce Type: replace Abstract: The optimization algorithm and its hyperparameters can significantly affect the training speed and resulting model accuracy in machine learning applications. The wish list for an ideal optimizer includes fast and smooth convergence to low error, low computational demand, and general applicability. Our recently introduced continual resilient (CoRe) optimizer has shown superior performance compared to other state-of-the-art first-order gradient-based optimizers for training lifelong machine learning potentials. In this work we provide an extensive performance comparison of the CoRe optimizer and nine other optimization algorithms including the Adam optimizer and resilient backpropagation (RPROP) for diverse machine learning tasks. We analyze the influence of different hyperparameters and provide generally applicable values. The CoRe optimizer yields best or competitive performance in every investigated application, while only one hyperparameter needs to be changed depending on mini-batch or batch learning.  ( 2 min )
    A Survey of What to Share in Federated Learning: Perspectives on Model Utility, Privacy Leakage, and Communication Efficiency
    arXiv:2307.10655v2 Announce Type: replace Abstract: Federated learning (FL) has emerged as a secure paradigm for collaborative training among clients. Without data centralization, FL allows clients to share local information in a privacy-preserving manner. This approach has gained considerable attention, promoting numerous surveys to summarize the related works. However, the majority of these surveys concentrate on FL methods that share model parameters during the training process, while overlooking the possibility of sharing local information in other forms. In this paper, we present a systematic survey from a new perspective of what to share in FL, with an emphasis on the model utility, privacy leakage, and communication efficiency. First, we present a new taxonomy of FL methods in terms of three sharing methods, which respectively share model, synthetic data, and knowledge. Second, we analyze the vulnerability of different sharing methods to privacy attacks and review the defense mechanisms. Third, we conduct extensive experiments to compare the learning performance and communication overhead of various sharing methods in FL. Besides, we assess the potential privacy leakage through model inversion and membership inference attacks, while comparing the effectiveness of various defense approaches. Finally, we identify future research directions and conclude the survey.  ( 3 min )
    On the Cause of Unfairness: A Training Sample Perspective
    arXiv:2306.17828v2 Announce Type: replace Abstract: Identifying the causes of a model's unfairness is an important yet relatively unexplored task. We look into this problem through the lens of training data - the major source of unfairness. We ask the following questions: How would the unfairness of a model change if its training samples (1) were collected from a different (e.g. demographic) group, (2) were labeled differently, or (3) whose features were modified? In other words, we quantify the influence of training samples on unfairness by counterfactually changing samples based on predefined concepts, i.e. data attributes such as features, labels, and sensitive attributes. Our framework not only can help practitioners understand the observed unfairness and mitigate it by repairing their training data, but also leads to many other applications, e.g. detecting mislabeling, fixing imbalanced representations, and detecting fairness-targeted poisoning attacks.  ( 2 min )
    Nonparametric Classification on Low Dimensional Manifolds using Overparameterized Convolutional Residual Networks
    arXiv:2307.01649v2 Announce Type: replace Abstract: Convolutional residual neural networks (ConvResNets), though overparameterized, can achieve remarkable prediction performance in practice, which cannot be well explained by conventional wisdom. To bridge this gap, we study the performance of ConvResNeXts, which cover ConvResNets as a special case, trained with weight decay from the perspective of nonparametric classification. Our analysis allows for infinitely many building blocks in ConvResNeXts, and shows that weight decay implicitly enforces sparsity on these blocks. Specifically, we consider a smooth target function supported on a low-dimensional manifold, then prove that ConvResNeXts can adapt to the function smoothness and low-dimensional structures and efficiently learn the function without suffering from the curse of dimensionality. Our findings partially justify the advantage of overparameterized ConvResNeXts over conventional machine learning models.  ( 2 min )
    Seizing Serendipity: Exploiting the Value of Past Success in Off-Policy Actor-Critic
    arXiv:2306.02865v4 Announce Type: replace Abstract: Learning high-quality Q-value functions plays a key role in the success of many modern off-policy deep reinforcement learning (RL) algorithms. Previous works focus on addressing the value overestimation issue, an outcome of adopting function approximators and off-policy learning. Deviating from the common viewpoint, we observe that Q-values are indeed underestimated in the latter stage of the RL training process, primarily related to the use of inferior actions from the current policy in Bellman updates as compared to the more optimal action samples in the replay buffer. We hypothesize that this long-neglected phenomenon potentially hinders policy learning and reduces sample efficiency. Our insight to address this issue is to incorporate sufficient exploitation of past successes while maintaining exploration optimism. We propose the Blended Exploitation and Exploration (BEE) operator, a simple yet effective approach that updates Q-value using both historical best-performing actions and the current policy. The instantiations of our method in both model-free and model-based settings outperform state-of-the-art methods in various continuous control tasks and achieve strong performance in failure-prone scenarios and real-world robot tasks.  ( 2 min )
    Datasheets for Machine Learning Sensors: Towards Transparency, Auditability, and Responsibility for Intelligent Sensing
    arXiv:2306.08848v3 Announce Type: replace Abstract: Machine learning (ML) sensors are enabling intelligence at the edge by empowering end-users with greater control over their data. ML sensors offer a new paradigm for sensing that moves the processing and analysis to the device itself rather than relying on the cloud, bringing benefits like lower latency and greater data privacy. The rise of these intelligent edge devices, while revolutionizing areas like the internet of things (IoT) and healthcare, also throws open critical questions about privacy, security, and the opacity of AI decision-making. As ML sensors become more pervasive, it requires judicious governance regarding transparency, accountability, and fairness. To this end, we introduce a standard datasheet template for these ML sensors and discuss and evaluate the design and motivation for each section of the datasheet in detail including: standard dasheet components like the system's hardware specifications, IoT and AI components like the ML model and dataset attributes, as well as novel components like end-to-end performance metrics, and expanded environmental impact metrics. To provide a case study of the application of our datasheet template, we also designed and developed two examples for ML sensors performing computer vision-based person detection: one an open-source ML sensor designed and developed in-house, and a second commercial ML sensor developed by our industry collaborators. Together, ML sensors and their datasheets provide greater privacy, security, transparency, explainability, auditability, and user-friendliness for ML-enabled embedded systems. We conclude by emphasizing the need for standardization of datasheets across the broader ML community to ensure the responsible use of sensor data.  ( 3 min )
    Autoencoding Conditional Neural Processes for Representation Learning
    arXiv:2305.18485v2 Announce Type: replace Abstract: Conditional neural processes (CNPs) are a flexible and efficient family of models that learn to learn a stochastic process from data. They have seen particular application in contextual image completion - observing pixel values at some locations to predict a distribution over values at other unobserved locations. However, the choice of pixels in learning CNPs is typically either random or derived from a simple statistical measure (e.g. pixel variance). Here, we turn the problem on its head and ask: which pixels would a CNP like to observe - do they facilitate fitting better CNPs, and do such pixels tell us something meaningful about the underlying image? To this end we develop the Partial Pixel Space Variational Autoencoder (PPS-VAE), an amortised variational framework that casts CNP context as latent variables learnt simultaneously with the CNP. We evaluate PPS-VAE over a number of tasks across different visual data, and find that not only can it facilitate better-fit CNPs, but also that the spatial arrangement and values meaningfully characterise image information - evaluated through the lens of classification on both within and out-of-data distributions. Our model additionally allows for dynamic adaption of context-set size and the ability to scale-up to larger images, providing a promising avenue to explore learning meaningful and effective visual representations.  ( 2 min )
    Balanced Training of Energy-Based Models with Adaptive Flow Sampling
    arXiv:2306.00684v4 Announce Type: replace Abstract: Energy-based models (EBMs) are versatile density estimation models that directly parameterize an unnormalized log density. Although very flexible, EBMs lack a specified normalization constant of the model, making the likelihood of the model computationally intractable. Several approximate samplers and variational inference techniques have been proposed to estimate the likelihood gradients for training. These techniques have shown promising results in generating samples, but little attention has been paid to the statistical accuracy of the estimated density, such as determining the relative importance of different classes in a dataset. In this work, we propose a new maximum likelihood training algorithm for EBMs that uses a different type of generative model, normalizing flows (NF), which have recently been proposed to facilitate sampling. Our method fits an NF to an EBM during training so that an NF-assisted sampling scheme provides an accurate gradient for the EBMs at all times, ultimately leading to a fast sampler for generating new data.  ( 2 min )
    Leaving the Nest: Going Beyond Local Loss Functions for Predict-Then-Optimize
    arXiv:2305.16830v2 Announce Type: replace Abstract: Predict-then-Optimize is a framework for using machine learning to perform decision-making under uncertainty. The central research question it asks is, "How can the structure of a decision-making task be used to tailor ML models for that specific task?" To this end, recent work has proposed learning task-specific loss functions that capture this underlying structure. However, current approaches make restrictive assumptions about the form of these losses and their impact on ML model behavior. These assumptions both lead to approaches with high computational cost, and when they are violated in practice, poor performance. In this paper, we propose solutions to these issues, avoiding the aforementioned assumptions and utilizing the ML model's features to increase the sample efficiency of learning loss functions. We empirically show that our method achieves state-of-the-art results in four domains from the literature, often requiring an order of magnitude fewer samples than comparable methods from past work. Moreover, our approach outperforms the best existing method by nearly 200% when the localness assumption is broken.  ( 2 min )
    Approximation Rate of the Transformer Architecture for Sequence Modeling
    arXiv:2305.18475v2 Announce Type: replace Abstract: The Transformer architecture is widely applied in sequence modeling applications, yet the theoretical understanding of its working principles remains limited. In this work, we investigate the approximation rate for single-layer Transformers with one head. We consider a class of non-linear relationships and identify a novel notion of complexity measures to establish an explicit Jackson-type approximation rate estimate for the Transformer. This rate reveals the structural properties of the Transformer and suggests the types of sequential relationships it is best suited for approximating. In particular, the results on approximation rates enable us to concretely analyze the differences between the Transformer and classical sequence modeling methods, such as recurrent neural networks.  ( 2 min )
    MLCopilot: Unleashing the Power of Large Language Models in Solving Machine Learning Tasks
    arXiv:2304.14979v2 Announce Type: replace Abstract: The field of machine learning (ML) has gained widespread adoption, leading to significant demand for adapting ML to specific scenarios, which is yet expensive and non-trivial. The predominant approaches towards the automation of solving ML tasks (e.g., AutoML) are often time-consuming and hard to understand for human developers. In contrast, though human engineers have the incredible ability to understand tasks and reason about solutions, their experience and knowledge are often sparse and difficult to utilize by quantitative approaches. In this paper, we aim to bridge the gap between machine intelligence and human knowledge by introducing a novel framework, which leverages the state-of-the-art large language models to develop ML solutions for novel tasks. We showcase the possibility of extending the capability of LLMs to comprehend structured inputs and perform thorough reasoning for solving novel ML tasks. And we find that, after some dedicated design, the LLM can (i) observe from the existing experiences of ML tasks and (ii) reason effectively to deliver promising results for new tasks. The solution generated can be used directly to achieve high levels of competitiveness. Examples and code available at https://github.com/microsoft/CoML.  ( 2 min )
    Multimodal Web Navigation with Instruction-Finetuned Foundation Models
    arXiv:2305.11854v3 Announce Type: replace Abstract: The progress of autonomous web navigation has been hindered by the dependence on billions of exploratory interactions via online reinforcement learning, and domain-specific model designs that make it difficult to leverage generalization from rich out-of-domain data. In this work, we study data-driven offline training for web agents with vision-language foundation models. We propose an instruction-following multimodal agent, WebGUM, that observes both webpage screenshots and HTML pages and outputs web navigation actions, such as click and type. WebGUM is trained by jointly finetuning an instruction-finetuned language model and a vision encoder with temporal and local perception on a large corpus of demonstrations. We empirically demonstrate this recipe improves the agent's ability of grounded multimodal perception, HTML comprehension, and multi-step reasoning, outperforming prior works by a significant margin. On the MiniWoB, we improve over the previous best offline methods by more than 45.8%, even outperforming online-finetuned SoTA, humans, and GPT-4-based agent. On the WebShop benchmark, our 3-billion-parameter model achieves superior performance to the existing SoTA, PaLM-540B. Furthermore, WebGUM exhibits strong positive transfer to the real-world planning tasks on the Mind2Web. We also collect 347K high-quality demonstrations using our trained models, 38 times larger than prior work, and make them available to promote future research in this direction.  ( 3 min )
    Spintronic Physical Reservoir for Autonomous Prediction and Long-Term Household Energy Load Forecasting
    arXiv:2304.03343v2 Announce Type: replace Abstract: In this study, we have shown autonomous long-term prediction with a spintronic physical reservoir. Due to the short-term memory property of the magnetization dynamics, non-linearity arises in the reservoir states which could be used for long-term prediction tasks using simple linear regression for online training. During the prediction stage, the output is directly fed to the input of the reservoir for autonomous prediction. We employ our proposed reservoir for the modeling of the chaotic time series such as Mackey-Glass and dynamic time-series data, such as household building energy loads. Since only the last layer of a RC needs to be trained with linear regression, it is well suited for learning in real time on edge devices. Here we show that a skyrmion based magnetic tunnel junction can potentially be used as a prototypical RC but any nanomagnetic magnetic tunnel junction with nonlinear magnetization behavior can implement such a RC. By comparing our spintronic physical RC approach with energy load forecasting algorithms, such as LSTMs and RNNs, we conclude that the proposed framework presents good performance in achieving high predictions accuracy, while also requiring low memory and energy both of which are at a premium in hardware resource and power constrained edge applications. Further, the proposed approach is shown to require very small training datasets and at the same time being at least 16X energy efficient compared to the sequence to sequence LSTM for accurate household load predictions.  ( 3 min )
    Applications of No-Collision Transportation Maps in Manifold Learning
    arXiv:2304.00199v4 Announce Type: replace Abstract: In this work, we investigate applications of no-collision transportation maps introduced in [Nurbekyan et. al., 2020] in manifold learning for image data. Recently, there has been a surge in applying transportation-based distances and features for data representing motion-like or deformation-like phenomena. Indeed, comparing intensities at fixed locations often does not reveal the data structure. No-collision maps and distances developed in [Nurbekyan et. al., 2020] are sensitive to geometric features similar to optimal transportation (OT) maps but much cheaper to compute due to the absence of optimization. In this work, we prove that no-collision distances provide an isometry between translations (respectively dilations) of a single probability measure and the translation (respectively dilation) vectors equipped with a Euclidean distance. Furthermore, we prove that no-collision transportation maps, as well as OT and linearized OT maps, do not in general provide an isometry for rotations. The numerical experiments confirm our theoretical findings and show that no-collision distances achieve similar or better performance on several manifold learning tasks compared to other OT and Euclidean-based methods at a fraction of a computational cost.  ( 2 min )
    Decomposed Diffusion Sampler for Accelerating Large-Scale Inverse Problems
    arXiv:2303.05754v3 Announce Type: replace Abstract: Krylov subspace, which is generated by multiplying a given vector by the matrix of a linear transformation and its successive powers, has been extensively studied in classical optimization literature to design algorithms that converge quickly for large linear inverse problems. For example, the conjugate gradient method (CG), one of the most popular Krylov subspace methods, is based on the idea of minimizing the residual error in the Krylov subspace. However, with the recent advancement of high-performance diffusion solvers for inverse problems, it is not clear how classical wisdom can be synergistically combined with modern diffusion models. In this study, we propose a novel and efficient diffusion sampling strategy that synergistically combines the diffusion sampling and Krylov subspace methods. Specifically, we prove that if the tangent space at a denoised sample by Tweedie's formula forms a Krylov subspace, then the CG initialized with the denoised data ensures the data consistency update to remain in the tangent space. This negates the need to compute the manifold-constrained gradient (MCG), leading to a more efficient diffusion sampling method. Our method is applicable regardless of the parametrization and setting (i.e., VE, VP). Notably, we achieve state-of-the-art reconstruction quality on challenging real-world medical inverse imaging problems, including multi-coil MRI reconstruction and 3D CT reconstruction. Moreover, our proposed method achieves more than 80 times faster inference time than the previous state-of-the-art method. Code is available at https://github.com/HJ-harry/DDS  ( 3 min )
    Multi-task Meta Label Correction for Time Series Prediction
    arXiv:2303.08103v3 Announce Type: replace Abstract: Time series classification faces two unavoidable problems. One is partial feature information and the other is poor label quality, which may affect model performance. To address the above issues, we create a label correction method to time series data with meta-learning under a multi-task framework. There are three main contributions. First, we train the label correction model with a two-branch neural network in the outer loop. While in the model-agnostic inner loop, we use pre-existing classification models in a multi-task way and jointly update the meta-knowledge so as to help us achieve adaptive labeling on complex time series. Second, we devise new data visualization methods for both image patterns of the historical data and data in the prediction horizon. Finally, we test our method with various financial datasets, including XOM, S\&P500, and SZ50. Results show that our method is more effective and accurate than some existing label correction techniques.  ( 2 min )
    Distances for Markov Chains, and Their Differentiation
    arXiv:2302.08621v2 Announce Type: replace Abstract: (Directed) graphs with node attributes are a common type of data in various applications and there is a vast literature on developing metrics and efficient algorithms for comparing them. Recently, in the graph learning and optimization communities, a range of new approaches have been developed for comparing graphs with node attributes, leveraging ideas such as the Optimal Transport (OT) and the Weisfeiler-Lehman (WL) graph isomorphism test. Two state-of-the-art representatives are the OTC distance proposed in (O'Connor et al., 2022) and the WL distance in (Chen et al., 2022). Interestingly, while these two distances are developed based on different ideas, we observe that they both view graphs as Markov chains, and are deeply connected. Indeed, in this paper, we propose a unified framework to generate distances for Markov chains (thus including (directed) graphs with node attributes), which we call the Optimal Transport Markov (OTM) distances, that encompass both the OTC and the WL distances. We further introduce a special one-parameter family of distances within our OTM framework, called the discounted WL distance. We show that the discounted WL distance has nice theoretical properties and can address several limitations of the existing OTC and WL distances. Furthermore, contrary to the OTC and the WL distances, our new discounted WL distance can be differentiated after a entropy-regularization similar to the Sinkhorn distance, making it suitable to use in learning frameworks, e.g., as the reconstruction loss in a graph generative model.  ( 3 min )
    Lumos: Heterogeneity-aware Federated Graph Learning over Decentralized Devices
    arXiv:2303.00492v3 Announce Type: replace Abstract: Graph neural networks (GNN) have been widely deployed in real-world networked applications and systems due to their capability to handle graph-structured data. However, the growing awareness of data privacy severely challenges the traditional centralized model training paradigm, where a server holds all the graph information. Federated learning is an emerging collaborative computing paradigm that allows model training without data centralization. Existing federated GNN studies mainly focus on systems where clients hold distinctive graphs or sub-graphs. The practical node-level federated situation, where each client is only aware of its direct neighbors, has yet to be studied. In this paper, we propose the first federated GNN framework called Lumos that supports supervised and unsupervised learning with feature and degree protection on node-level federated graphs. We first design a tree constructor to improve the representation capability given the limited structural information. We further present a Monte Carlo Markov Chain-based algorithm to mitigate the workload imbalance caused by degree heterogeneity with theoretically-guaranteed performance. Based on the constructed tree for each client, a decentralized tree-based GNN trainer is proposed to support versatile training. Extensive experiments demonstrate that Lumos outperforms the baseline with significantly higher accuracy and greatly reduced communication cost and training time.  ( 3 min )
    Deep Learning Predicts Prevalent and Incident Parkinson's Disease From UK Biobank Fundus Imaging
    arXiv:2302.06727v3 Announce Type: replace Abstract: Parkinson's disease is the world's fastest-growing neurological disorder. Research to elucidate the mechanisms of Parkinson's disease and automate diagnostics would greatly improve the treatment of patients with Parkinson's disease. Current diagnostic methods are expensive and have limited availability. Considering the insidious and preclinical onset and progression of the disease, a desirable screening should be diagnostically accurate even before the onset of symptoms to allow medical interventions. We highlight retinal fundus imaging, often termed a window to the brain, as a diagnostic screening modality for Parkinson's disease. We conducted a systematic evaluation of conventional machine learning and deep learning techniques to classify Parkinson's disease from UK Biobank fundus imaging. Our results show that Parkinson's disease individuals can be differentiated from age and gender-matched healthy subjects with an Area Under the Curve (AUC) of 0.77. This accuracy is maintained when predicting either prevalent or incident Parkinson's disease. Explainability and trustworthiness are enhanced by visual attribution maps of localized biomarkers and quantified metrics of model robustness to data perturbations.  ( 3 min )
    Graph-based Time-Series Anomaly Detection: A Survey and Outlook
    arXiv:2302.00058v3 Announce Type: replace Abstract: With the recent advances in technology, a wide range of systems continue to collect a large amount of data over time and thus generate time series. Time-Series Anomaly Detection (TSAD) is an important task in various time-series applications such as e-commerce, cybersecurity, vehicle maintenance, and healthcare monitoring. However, this task is very challenging as it requires considering both the intra-variable dependency and the inter-variable dependency, where a variable can be defined as an observation in time series data. Recent graph-based approaches have made impressive progress in tackling the challenges of this field. In this survey, we conduct a comprehensive and up-to-date review of Graph-based TSAD (G-TSAD). First, we explore the significant potential of graph representation learning for time-series data. Then, we review state-of-the-art graph anomaly detection techniques in the context of time series and discuss their strengths and drawbacks. Finally, we discuss the technical challenges and potential future directions for possible improvements in this research field.  ( 2 min )
    Semi-supervised Batch Learning From Logged Data
    arXiv:2209.07148v3 Announce Type: replace Abstract: Off-policy learning methods are intended to learn a policy from logged data, which includes context, action, and feedback (cost or reward) for each sample point. In this work, we build on the counterfactual risk minimization framework, which also assumes access to propensity scores. We propose learning methods for problems where feedback is missing for some samples, so there are samples with feedback and samples missing-feedback in the logged data. We refer to this type of learning as semi-supervised batch learning from logged data, which arises in a wide range of application domains. We derive a novel upper bound for the true risk under the inverse propensity score estimator to address this kind of learning problem. Using this bound, we propose a regularized semi-supervised batch learning method with logged data where the regularization term is feedback-independent and, as a result, can be evaluated using the logged missing-feedback data. Consequently, even though feedback is only present for some samples, a learning policy can be learned by leveraging the missing-feedback samples. The results of experiments derived from benchmark datasets indicate that these algorithms achieve policies with better performance in comparison with logging policies.  ( 2 min )
    STLGRU: Spatio-Temporal Lightweight Graph GRU for Traffic Flow Prediction
    arXiv:2212.04548v3 Announce Type: replace Abstract: Reliable forecasting of traffic flow requires efficient modeling of traffic data. Indeed, different correlations and influences arise in a dynamic traffic network, making modeling a complicated task. Existing literature has proposed many different methods to capture traffic networks' complex underlying spatial-temporal relations. However, given the heterogeneity of traffic data, consistently capturing both spatial and temporal dependencies presents a significant challenge. Also, as more and more sophisticated methods are being proposed, models are increasingly becoming memory-heavy and, thus, unsuitable for low-powered devices. To this end, we propose Spatio-Temporal Lightweight Graph GRU, namely STLGRU, a novel traffic forecasting model for predicting traffic flow accurately. Specifically, our proposed STLGRU can effectively capture dynamic local and global spatial-temporal relations of traffic networks using memory-augmented attention and gating mechanisms in a continuously synchronized manner. Moreover, instead of employing separate temporal and spatial components, we show that our memory module and gated unit can successfully learn the spatial-temporal dependencies with reduced memory usage and fewer parameters. Extensive experimental results on three real-world public traffic datasets demonstrate that our method can not only achieve state-of-the-art performance but also exhibit competitive computational efficiency. Our code is available at https://github.com/Kishor-Bhaumik/STLGRU  ( 2 min )
    Variance Reduction Based Experience Replay for Policy Optimization
    arXiv:2110.08902v3 Announce Type: replace Abstract: For reinforcement learning on complex stochastic systems, it is desirable to effectively leverage the information from historical samples collected in previous iterations to accelerate policy optimization. Classical experience replay, while effective, treats all observations uniformly, neglecting their relative importance. To address this limitation, we introduce a novel Variance Reduction Experience Replay (VRER) framework, enabling the selective reuse of relevant samples to improve policy gradient estimation. VRER, as an adaptable method that can seamlessly integrate with different policy optimization algorithms, forms the foundation of our sample-efficient off-policy algorithm known as Policy Optimization with VRER (PG-VRER). Furthermore, the lack of a rigorous theoretical understanding of the experience replay method in the literature motivates us to introduce a novel theoretical framework that accounts for sample dependencies induced by Markovian noise and behavior policy interdependencies. This framework is then employed to analyze the finite-time convergence of our VRER-based policy optimization algorithm, revealing a crucial bias-variance trade-off in policy gradient estimates: the reuse of old experience introduces increased bias while simultaneously reducing gradient variance. Extensive experiments have shown that VRER offers a notable acceleration in learning optimal policies and enhances the performance of state-of-the-art (SOTA) policy optimization approaches.  ( 3 min )
    Representations learnt by SGD and Adaptive learning rules: Conditions that vary sparsity and selectivity in neural network
    arXiv:2201.11653v2 Announce Type: replace Abstract: From the point of view of the human brain, continual learning can perform various tasks without mutual interference. An effective way to reduce mutual interference can be found in sparsity and selectivity of neurons. According to Aljundi et al. and Hadsell et al., imposing sparsity at the representational level is advantageous for continual learning because sparse neuronal activations encourage less overlap between parameters, resulting in less interference. Similarly, highly selective neural networks are likely to induce less interference since particular response in neurons will reduce the chance of overlap with other parameters. Considering that the human brain performs continual learning over the lifespan, finding conditions where sparsity and selectivity naturally arises may provide insight for understanding how the brain functions. This paper investigates various conditions that naturally increase sparsity and selectivity in a neural network. This paper tested different optimizers with Hoyer's sparsity metric and CCMAS selectivity metric in MNIST classification task. It is essential to note that investigations on the natural occurrence of sparsity and selectivity concerning various conditions have not been acknowledged in any sector of neuroscience nor machine learning until this day. This paper found that particular conditions increase sparsity and selectivity such as applying a large learning rate and lowering a batch size. In addition to the relationship between the condition, sparsity, and selectivity, the following will be discussed based on empirical analysis: 1. The relationship between sparsity and selectivity and 2. The relationship between test accuracy, sparsity, and selectivity.  ( 3 min )
    RadarScenes: A Real-World Radar Point Cloud Data Set for Automotive Applications
    arXiv:2104.02493v2 Announce Type: replace Abstract: A new automotive radar data set with measurements and point-wise annotations from more than four hours of driving is presented. Data provided by four series radar sensors mounted on one test vehicle were recorded and the individual detections of dynamic objects were manually grouped to clusters and labeled afterwards. The purpose of this data set is to enable the development of novel (machine learning-based) radar perception algorithms with the focus on moving road users. Images of the recorded sequences were captured using a documentary camera. For the evaluation of future object detection and classification algorithms, proposals for score calculation are made so that researchers can evaluate their algorithms on a common basis. Additional information as well as download instructions can be found on the website of the data set: www.radar-scenes.com.  ( 2 min )
    Deep-Lock: Secure Authorization for Deep Neural Networks
    arXiv:2008.05966v2 Announce Type: replace Abstract: Trained Deep Neural Network (DNN) models are considered valuable Intellectual Properties (IP) in several business models. Prevention of IP theft and unauthorized usage of such DNN models has been raised as of significant concern by industry. In this paper, we address the problem of preventing unauthorized usage of DNN models by proposing a generic and lightweight key-based model-locking scheme, which ensures that a locked model functions correctly only upon applying the correct secret key. The proposed scheme, known as Deep-Lock, utilizes S-Boxes with good security properties to encrypt each parameter of a trained DNN model with secret keys generated from a master key via a key scheduling algorithm. The resulting dense network of encrypted weights is found robust against model fine-tuning attacks. Finally, Deep-Lock does not require any intervention in the structure and training of the DNN models, making it applicable for all existing software and hardware implementations of DNN.  ( 2 min )
    GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations
    arXiv:2402.12348v1 Announce Type: cross Abstract: As Large Language Models (LLMs) are integrated into critical real-world applications, their strategic and logical reasoning abilities are increasingly crucial. This paper evaluates LLMs' reasoning abilities in competitive environments through game-theoretic tasks, e.g., board and card games that require pure logic and strategic reasoning to compete with opponents. We first propose GTBench, a language-driven environment composing 10 widely-recognized tasks, across a comprehensive game taxonomy: complete versus incomplete information, dynamic versus static, and probabilistic versus deterministic scenarios. Then, we investigate two key problems: (1) Characterizing game-theoretic reasoning of LLMs; (2) LLM-vs-LLM competitions as reasoning evaluation. We observe that (1) LLMs have distinct behaviors regarding various gaming scenarios; for example, LLMs fail in complete and deterministic games yet they are competitive in probabilistic gaming scenarios; (2) Open-source LLMs, e.g., CodeLlama-34b-Instruct, are less competitive than commercial LLMs, e.g., GPT-4, in complex games. In addition, code-pretraining greatly benefits strategic reasoning, while advanced reasoning methods such as Chain-of-Thought (CoT) and Tree-of-Thought (ToT) do not always help. Detailed error profiles are also provided for a better understanding of LLMs' behavior.  ( 2 min )
    Short-Period Variables in TESS Full-Frame Image Light Curves Identified via Convolutional Neural Networks
    arXiv:2402.12369v1 Announce Type: cross Abstract: The Transiting Exoplanet Survey Satellite (TESS) mission measured light from stars in ~85% of the sky throughout its two-year primary mission, resulting in millions of TESS 30-minute cadence light curves to analyze in the search for transiting exoplanets. To search this vast dataset, we aim to provide an approach that is both computationally efficient, produces highly performant predictions, and minimizes the required human search effort. We present a convolutional neural network that we train to identify short period variables. To make a prediction for a given light curve, our network requires no prior target parameters identified using other methods. Our network performs inference on a TESS 30-minute cadence light curve in ~5ms on a single GPU, enabling large scale archival searches. We present a collection of 14156 short-period variables identified by our network. The majority of our identified variables fall into two prominent populations, one of short-period main sequence binaries and another of Delta Scuti stars. Our neural network model and related code is additionally provided as open-source code for public use and extension.  ( 3 min )
    Query-Based Adversarial Prompt Generation
    arXiv:2402.12329v1 Announce Type: cross Abstract: Recent work has shown it is possible to construct adversarial examples that cause an aligned language model to emit harmful strings or perform harmful behavior. Existing attacks work either in the white-box setting (with full access to the model weights), or through transferability: the phenomenon that adversarial examples crafted on one model often remain effective on other models. We improve on prior work with a query-based attack that leverages API access to a remote language model to construct adversarial examples that cause the model to emit harmful strings with (much) higher probability than with transfer-only attacks. We validate our attack on GPT-3.5 and OpenAI's safety classifier; we can cause GPT-3.5 to emit harmful strings that current transfer attacks fail at, and we can evade the safety classifier with nearly 100% probability.  ( 2 min )
    An Adversarial Approach to Evaluating the Robustness of Event Identification Models
    arXiv:2402.12338v1 Announce Type: cross Abstract: Intelligent machine learning approaches are finding active use for event detection and identification that allow real-time situational awareness. Yet, such machine learning algorithms have been shown to be susceptible to adversarial attacks on the incoming telemetry data. This paper considers a physics-based modal decomposition method to extract features for event classification and focuses on interpretable classifiers including logistic regression and gradient boosting to distinguish two types of events: load loss and generation loss. The resulting classifiers are then tested against an adversarial algorithm to evaluate their robustness. The adversarial attack is tested in two settings: the white box setting, wherein the attacker knows exactly the classification model; and the gray box setting, wherein the attacker has access to historical data from the same network as was used to train the classifier, but does not know the classification model. Thorough experiments on the synthetic South Carolina 500-bus system highlight that a relatively simpler model such as logistic regression is more susceptible to adversarial attacks than gradient boosting.  ( 2 min )
    LLM Agents for Psychology: A Study on Gamified Assessments
    arXiv:2402.12326v1 Announce Type: cross Abstract: Psychological measurement is essential for mental health, self-understanding, and personal development. Traditional methods, such as self-report scales and psychologist interviews, often face challenges with engagement and accessibility. While game-based and LLM-based tools have been explored to improve user interest and automate assessment, they struggle to balance engagement with generalizability. In this work, we propose PsychoGAT (Psychological Game AgenTs) to achieve a generic gamification of psychological assessment. The main insight is that powerful LLMs can function both as adept psychologists and innovative game designers. By incorporating LLM agents into designated roles and carefully managing their interactions, PsychoGAT can transform any standardized scales into personalized and engaging interactive fiction games. To validate the proposed method, we conduct psychometric evaluations to assess its effectiveness and employ human evaluators to examine the generated content across various psychological constructs, including depression, cognitive distortions, and personality traits. Results demonstrate that PsychoGAT serves as an effective assessment tool, achieving statistically significant excellence in psychometric metrics such as reliability, convergent validity, and discriminant validity. Moreover, human evaluations confirm PsychoGAT's enhancements in content coherence, interactivity, interest, immersion, and satisfaction.  ( 2 min )
    Landmark Stereo Dataset for Landmark Recognition and Moving Node Localization in a Non-GPS Battlefield Environment
    arXiv:2402.12320v1 Announce Type: cross Abstract: In this paper, we have proposed a new strategy of using the landmark anchor node instead of a radio-based anchor node to obtain the virtual coordinates (landmarkID, DISTANCE) of moving troops or defense forces that will help in tracking and maneuvering the troops along a safe path within a GPS-denied battlefield environment. The proposed strategy implements landmark recognition using the Yolov5 model and landmark distance estimation using an efficient Stereo Matching Algorithm. We consider that a moving node carrying a low-power mobile device facilitated with a calibrated stereo vision camera that captures stereo images of a scene containing landmarks within the battlefield region whose locations are stored in an offline server residing within the device itself. We created a custom landmark image dataset called MSTLandmarkv1 with 34 landmark classes and another landmark stereo dataset of those 34 landmark instances called MSTLandmarkStereov1. We trained the YOLOv5 model with MSTLandmarkv1 dataset and achieved 0.95 mAP @ 0.5 IoU and 0.767 mAP @ [0.5: 0.95] IoU. We calculated the distance from a node to the landmark utilizing the bounding box coordinates and the depth map generated by the improved SGM algorithm using MSTLandmarkStereov1. The tuple of landmark IDs obtained from the detection result and the distances calculated by the SGM algorithm are stored as the virtual coordinates of a node. In future work, we will use these virtual coordinates to obtain the location of a node using an efficient trilateration algorithm and optimize the node position using the appropriate optimization method.  ( 3 min )
    Regularization by denoising: Bayesian model and Langevin-within-split Gibbs sampling
    arXiv:2402.12292v1 Announce Type: cross Abstract: This paper introduces a Bayesian framework for image inversion by deriving a probabilistic counterpart to the regularization-by-denoising (RED) paradigm. It additionally implements a Monte Carlo algorithm specifically tailored for sampling from the resulting posterior distribution, based on an asymptotically exact data augmentation (AXDA). The proposed algorithm is an approximate instance of split Gibbs sampling (SGS) which embeds one Langevin Monte Carlo step. The proposed method is applied to common imaging tasks such as deblurring, inpainting and super-resolution, demonstrating its efficacy through extensive numerical experiments. These contributions advance Bayesian inference in imaging by leveraging data-driven regularization strategies within a probabilistic framework.  ( 2 min )
    Asymptotic Gaussian Fluctuations of Eigenvectors in Spectral Clustering
    arXiv:2402.12302v1 Announce Type: cross Abstract: The performance of spectral clustering relies on the fluctuations of the entries of the eigenvectors of a similarity matrix, which has been left uncharacterized until now. In this letter, it is shown that the signal $+$ noise structure of a general spike random matrix model is transferred to the eigenvectors of the corresponding Gram kernel matrix and the fluctuations of their entries are Gaussian in the large-dimensional regime. This CLT-like result was the last missing piece to precisely predict the classification performance of spectral clustering. The proposed proof is very general and relies solely on the rotational invariance of the noise. Numerical experiments on synthetic and real data illustrate the universality of this phenomenon.  ( 2 min )
    Kernel KMeans clustering splits for end-to-end unsupervised decision trees
    arXiv:2402.12232v1 Announce Type: cross Abstract: Trees are convenient models for obtaining explainable predictions on relatively small datasets. Although there are many proposals for the end-to-end construction of such trees in supervised learning, learning a tree end-to-end for clustering without labels remains an open challenge. As most works focus on interpreting with trees the result of another clustering algorithm, we present here a novel end-to-end trained unsupervised binary tree for clustering: Kauri. This method performs a greedy maximisation of the kernel KMeans objective without requiring the definition of centroids. We compare this model on multiple datasets with recent unsupervised trees and show that Kauri performs identically when using a linear kernel. For other kernels, Kauri often outperforms the concatenation of kernel KMeans and a CART decision tree.  ( 2 min )
    Secure Federated Learning Across Heterogeneous Cloud and High-Performance Computing Resources -- A Case Study on Federated Fine-tuning of LLaMA 2
    arXiv:2402.12271v1 Announce Type: cross Abstract: Federated learning enables multiple data owners to collaboratively train robust machine learning models without transferring large or sensitive local datasets by only sharing the parameters of the locally trained models. In this paper, we elaborate on the design of our Advanced Privacy-Preserving Federated Learning (APPFL) framework, which streamlines end-to-end secure and reliable federated learning experiments across cloud computing facilities and high-performance computing resources by leveraging Globus Compute, a distributed function as a service platform, and Amazon Web Services. We further demonstrate the use case of APPFL in fine-tuning a LLaMA 2 7B model using several cloud resources and supercomputers.  ( 2 min )
    AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
    arXiv:2402.12226v1 Announce Type: cross Abstract: We introduce AnyGPT, an any-to-any multimodal language model that utilizes discrete representations for the unified processing of various modalities, including speech, text, images, and music. AnyGPT can be trained stably without any alterations to the current large language model (LLM) architecture or training paradigms. Instead, it relies exclusively on data-level preprocessing, facilitating the seamless integration of new modalities into LLMs, akin to the incorporation of new languages. We build a multimodal text-centric dataset for multimodal alignment pre-training. Utilizing generative models, we synthesize the first large-scale any-to-any multimodal instruction dataset. It consists of 108k samples of multi-turn conversations that intricately interweave various modalities, thus equipping the model to handle arbitrary combinations of multimodal inputs and outputs. Experimental results demonstrate that AnyGPT is capable of facilitating any-to-any multimodal conversation while achieving performance comparable to specialized models across all modalities, proving that discrete representations can effectively and conveniently unify multiple modalities within a language model. Demos are shown in https://junzhan2000.github.io/AnyGPT.github.io/  ( 2 min )
    CovRL: Fuzzing JavaScript Engines with Coverage-Guided Reinforcement Learning for LLM-based Mutation
    arXiv:2402.12222v1 Announce Type: cross Abstract: Fuzzing is an effective bug-finding technique but it struggles with complex systems like JavaScript engines that demand precise grammatical input. Recently, researchers have adopted language models for context-aware mutation in fuzzing to address this problem. However, existing techniques are limited in utilizing coverage guidance for fuzzing, which is rather performed in a black-box manner. This paper presents a novel technique called CovRL (Coverage-guided Reinforcement Learning) that combines Large Language Models (LLMs) with reinforcement learning from coverage feedback. Our fuzzer, CovRL-Fuzz, integrates coverage feedback directly into the LLM by leveraging the Term Frequency-Inverse Document Frequency (TF-IDF) method to construct a weighted coverage map. This map is key in calculating the fuzzing reward, which is then applied to the LLM-based mutator through reinforcement learning. CovRL-Fuzz, through this approach, enables the generation of test cases that are more likely to discover new coverage areas, thus improving vulnerability detection while minimizing syntax and semantic errors, all without needing extra post-processing. Our evaluation results indicate that CovRL-Fuzz outperforms the state-of-the-art fuzzers in terms of code coverage and bug-finding capabilities: CovRL-Fuzz identified 48 real-world security-related bugs in the latest JavaScript engines, including 39 previously unknown vulnerabilities and 11 CVEs.  ( 2 min )
    Reformatted Alignment
    arXiv:2402.12219v1 Announce Type: cross Abstract: The quality of finetuning data is crucial for aligning large language models (LLMs) with human values. Current methods to improve data quality are either labor-intensive or prone to factual errors caused by LLM hallucinations. This paper explores elevating the quality of existing instruction data to better align with human values, introducing a simple and effective approach named ReAlign, which reformats the responses of instruction data into a format that better aligns with pre-established criteria and the collated evidence. This approach minimizes human annotation, hallucination, and the difficulty in scaling, remaining orthogonal to existing alignment techniques. Experimentally, ReAlign significantly boosts the general alignment ability, math reasoning, factuality, and readability of the LLMs. Encouragingly, without introducing any additional data or advanced training techniques, and merely by reformatting the response, LLaMA-2-13B's mathematical reasoning ability on GSM8K can be improved from 46.77% to 56.63% in accuracy. Additionally, a mere 5% of ReAlign data yields a 67% boost in general alignment ability measured by the Alpaca dataset. This work highlights the need for further research into the science and mechanistic interpretability of LLMs. We have made the associated code and data publicly accessible to support future studies at https://github.com/GAIR-NLP/ReAlign.  ( 2 min )
    Bayesian Parameter-Efficient Fine-Tuning for Overcoming Catastrophic Forgetting
    arXiv:2402.12220v1 Announce Type: cross Abstract: Although motivated by the adaptation of text-to-speech synthesis models, we argue that more generic parameter-efficient fine-tuning (PEFT) is an appropriate framework to do such adaptation. However, catastrophic forgetting remains an issue with PEFT, damaging the pre-trained model's inherent capabilities. We demonstrate that existing Bayesian learning techniques can be applied to PEFT to prevent catastrophic forgetting as long as the parameter shift of the fine-tuned layers can be calculated differentiably. In a principled series of experiments on language modeling and speech synthesis tasks, we utilize established Laplace approximations, including diagonal and Kronecker factored approaches, to regularize PEFT with the low-rank adaptation (LoRA) and compare their performance in pre-training knowledge preservation. Our results demonstrate that catastrophic forgetting can be overcome by our methods without degrading the fine-tuning performance, and using the Kronecker factored approximations produces a better preservation of the pre-training knowledge than the diagonal ones.  ( 2 min )
    Towards AI-Based Precision Oncology: A Machine Learning Framework for Personalized Counterfactual Treatment Suggestions based on Multi-Omics Data
    arXiv:2402.12190v1 Announce Type: cross Abstract: AI-driven precision oncology has the transformative potential to reshape cancer treatment by leveraging the power of AI models to analyze the interaction between complex patient characteristics and their corresponding treatment outcomes. New technological platforms have facilitated the timely acquisition of multimodal data on tumor biology at an unprecedented resolution, such as single-cell multi-omics data, making this quality and quantity of data available for data-driven improved clinical decision-making. In this work, we propose a modular machine learning framework designed for personalized counterfactual cancer treatment suggestions based on an ensemble of machine learning experts trained on diverse multi-omics technologies. These specialized counterfactual experts per technology are consistently aggregated into a more powerful expert with superior performance and can provide both confidence and an explanation of its decision. The framework is tailored to address critical challenges inherent in data-driven cancer research, including the high-dimensional nature of the data, and the presence of treatment assignment bias in the retrospective observational data. The framework is showcased through comprehensive demonstrations using data from in-vitro and in-vivo treatment responses from a cohort of patients with ovarian cancer. Our method aims to empower clinicians with a reality-centric decision-support tool including probabilistic treatment suggestions with calibrated confidence and personalized explanations for tailoring treatment strategies to multi-omics characteristics of individual cancer patients.  ( 3 min )
    Zero shot VLMs for hate meme detection: Are we there yet?
    arXiv:2402.12198v1 Announce Type: cross Abstract: Multimedia content on social media is rapidly evolving, with memes gaining prominence as a distinctive form. Unfortunately, some malicious users exploit memes to target individuals or vulnerable communities, making it imperative to identify and address such instances of hateful memes. Extensive research has been conducted to address this issue by developing hate meme detection models. However, a notable limitation of traditional machine/deep learning models is the requirement for labeled datasets for accurate classification. Recently, the research community has witnessed the emergence of several visual language models that have exhibited outstanding performance across various tasks. In this study, we aim to investigate the efficacy of these visual language models in handling intricate tasks such as hate meme detection. We use various prompt settings to focus on zero-shot classification of hateful/harmful memes. Through our analysis, we observe that large VLMs are still vulnerable for zero-shot hate meme detection.  ( 2 min )
    Amplifying Training Data Exposure through Fine-Tuning with Pseudo-Labeled Memberships
    arXiv:2402.12189v1 Announce Type: cross Abstract: Neural language models (LMs) are vulnerable to training data extraction attacks due to data memorization. This paper introduces a novel attack scenario wherein an attacker adversarially fine-tunes pre-trained LMs to amplify the exposure of the original training data. This strategy differs from prior studies by aiming to intensify the LM's retention of its pre-training dataset. To achieve this, the attacker needs to collect generated texts that are closely aligned with the pre-training data. However, without knowledge of the actual dataset, quantifying the amount of pre-training data within generated texts is challenging. To address this, we propose the use of pseudo-labels for these generated texts, leveraging membership approximations indicated by machine-generated probabilities from the target LM. We subsequently fine-tune the LM to favor generations with higher likelihoods of originating from the pre-training data, based on their membership probabilities. Our empirical findings indicate a remarkable outcome: LMs with over 1B parameters exhibit a four to eight-fold increase in training data exposure. We discuss potential mitigations and suggest future research directions.  ( 2 min )
    Adversarial Feature Alignment: Balancing Robustness and Accuracy in Deep Learning via Adversarial Training
    arXiv:2402.12187v1 Announce Type: cross Abstract: Deep learning models continue to advance in accuracy, yet they remain vulnerable to adversarial attacks, which often lead to the misclassification of adversarial examples. Adversarial training is used to mitigate this problem by increasing robustness against these attacks. However, this approach typically reduces a model's standard accuracy on clean, non-adversarial samples. The necessity for deep learning models to balance both robustness and accuracy for security is obvious, but achieving this balance remains challenging, and the underlying reasons are yet to be clarified. This paper proposes a novel adversarial training method called Adversarial Feature Alignment (AFA), to address these problems. Our research unveils an intriguing insight: misalignment within the feature space often leads to misclassification, regardless of whether the samples are benign or adversarial. AFA mitigates this risk by employing a novel optimization algorithm based on contrastive learning to alleviate potential feature misalignment. Through our evaluations, we demonstrate the superior performance of AFA. The baseline AFA delivers higher robust accuracy than previous adversarial contrastive learning methods while minimizing the drop in clean accuracy to 1.86% and 8.91% on CIFAR10 and CIFAR100, respectively, in comparison to cross-entropy. We also show that joint optimization of AFA and TRADES, accompanied by data augmentation using a recent diffusion model, achieves state-of-the-art accuracy and robustness.  ( 3 min )
    Molecule Generation and Optimization for Efficient Fragrance Creation
    arXiv:2402.12134v1 Announce Type: cross Abstract: This research introduces a Machine Learning-centric approach to replicate olfactory experiences, validated through experimental quantification of perfume perception. Key contributions encompass a hybrid model connecting perfume molecular structure to human olfactory perception. This model includes an AI-driven molecule generator (utilizing Graph and Generative Neural Networks), quantification and prediction of odor intensity, and refinery of optimal solvent and molecule combinations for desired fragrances. Additionally, a thermodynamic-based model establishes a link between olfactory perception and liquid-phase concentrations. The methodology employs Transfer Learning and selects the most suitable molecules based on vapor pressure and fragrance notes. Ultimately, a mathematical optimization problem is formulated to minimize discrepancies between new and target olfactory experiences. The methodology is validated by reproducing two distinct olfactory experiences using available experimental data.  ( 2 min )
    Meta Ranking: Less Capable Language Models are Capable for Single Response Judgement
    arXiv:2402.12146v1 Announce Type: cross Abstract: Although Large Language Models (LLMs) have demonstrated strong performance on a wide range of tasks, they still face reliability challenges such as hallucination. Previous studies reveal that highly capable LLMs like GPT-4 are effective in judging the reliability of individual responses, while less capable ones are often tuned to evaluate the relative reliability of responses to the same query. To enable less capable LLMs to effectively judge the reliability of individual responses, we propose a novel method named $\textit{Meta}$ $\textit{Ranking}$ (MR). Unlike previous methods, which assess the response directly, we achieve the judgement by comparing the target query-response pair with reference query-response pairs. We found its remarkable effectiveness in error detection for LLM responses on reasoning tasks, where less capable LLMs could outperform strong baselines, even without fine-tuning. We further demonstrate that MR can be used to enhance the performance of LLMs in two practical applications: query routing and iterative training data filtering. The former achieves GPT-4-turbo comparable performance with less than half the token consumption, while the latter makes the instruction-tuned LLaMA-7B and Phi-2, a 2.7B model, significantly surpass Alpaca-13B over fewer training samples, underscoring the high potential of our proposed method.  ( 2 min )
    Causal Equal Protection as Algorithmic Fairness
    arXiv:2402.12062v1 Announce Type: cross Abstract: Over the last ten years the literature in computer science and philosophy has formulated different criteria of algorithmic fairness. One of the most discussed, classification parity, requires that the erroneous classifications of a predictive algorithm occur with equal frequency for groups picked out by protected characteristics. Despite its intuitive appeal, classification parity has come under attack. Multiple scenarios can be imagined in which - intuitively - a predictive algorithm does not treat any individual unfairly, and yet classification parity is violated. To make progress, we turn to a related principle, equal protection, originally developed in the context of criminal justice. Key to equal protection is equalizing the risks of erroneous classifications (in a sense to be specified) as opposed to equalizing the rates of erroneous classifications. We show that equal protection avoids many of the counterexamples to classification parity, but also fails to model our moral intuitions in a number of common scenarios, for example, when the predictor is causally downstream relative to the protected characteristic. To address these difficulties, we defend a novel principle, causal equal protection, that models the fair allocation of the risks of erroneous classification through the lenses of causality.  ( 2 min )
    Robustness and Exploration of Variational and Machine Learning Approaches to Inverse Problems: An Overview
    arXiv:2402.12072v1 Announce Type: cross Abstract: This paper attempts to provide an overview of current approaches for solving inverse problems in imaging using variational methods and machine learning. A special focus lies on point estimators and their robustness against adversarial perturbations. In this context results of numerical experiments for a one-dimensional toy problem are provided, showing the robustness of different approaches and empirically verifying theoretical guarantees. Another focus of this review is the exploration of the subspace of data consistent solutions through explicit guidance to satisfy specific semantic or textural properties.  ( 2 min )
    When Do Off-Policy and On-Policy Policy Gradient Methods Align?
    arXiv:2402.12034v1 Announce Type: cross Abstract: Policy gradient methods are widely adopted reinforcement learning algorithms for tasks with continuous action spaces. These methods succeeded in many application domains, however, because of their notorious sample inefficiency their use remains limited to problems where fast and accurate simulations are available. A common way to improve sample efficiency is to modify their objective function to be computable from off-policy samples without importance sampling. A well-established off-policy objective is the excursion objective. This work studies the difference between the excursion objective and the traditional on-policy objective, which we refer to as the on-off gap. We provide the first theoretical analysis showing conditions to reduce the on-off gap while establishing empirical evidence of shortfalls arising when these conditions are not met.  ( 2 min )
    Distilling Large Language Models for Text-Attributed Graph Learning
    arXiv:2402.12022v1 Announce Type: cross Abstract: Text-Attributed Graphs (TAGs) are graphs of connected textual documents. Graph models can efficiently learn TAGs, but their training heavily relies on human-annotated labels, which are scarce or even unavailable in many applications. Large language models (LLMs) have recently demonstrated remarkable capabilities in few-shot and zero-shot TAG learning, but they suffer from scalability, cost, and privacy issues. Therefore, in this work, we focus on synergizing LLMs and graph models with their complementary strengths by distilling the power of LLMs to a local graph model on TAG learning. To address the inherent gaps between LLMs (generative models for texts) and graph models (discriminative models for graphs), we propose first to let LLMs teach an interpreter with rich textual rationale and then let a student model mimic the interpreter's reasoning without LLMs' textual rationale. Extensive experiments validate the efficacy of our proposed framework.  ( 2 min )
    Remember This Event That Year? Assessing Temporal Information and Reasoning in Large Language Models
    arXiv:2402.11997v1 Announce Type: cross Abstract: Large Language Models (LLMs) are increasingly becoming ubiquitous, yet their ability to reason about and retain temporal information remains limited. This hinders their application in real-world scenarios where understanding the sequential nature of events is crucial. This paper experiments with state-of-the-art models on a novel, large-scale temporal dataset, \textbf{TempUN}, to reveal significant limitations in temporal retention and reasoning abilities. Interestingly, closed-source models indicate knowledge gaps more frequently, potentially suggesting a trade-off between uncertainty awareness and incorrect responses. Further, exploring various fine-tuning approaches yielded no major performance improvements. The associated dataset and code are available at the following URL (https://github.com/lingoiitgn/TempUN).  ( 2 min )
    An Index Policy Based on Sarsa and Q-learning for Heterogeneous Smart Target Tracking
    arXiv:2402.12015v1 Announce Type: cross Abstract: In solving the non-myopic radar scheduling for multiple smart target tracking within an active and passive radar network, we need to consider both short-term enhanced tracking performance and a higher probability of target maneuvering in the future with active tracking. Acquiring the long-term tracking performance while scheduling the beam resources of active and passive radars poses a challenge. To address this challenge, we model this problem as a Markov decision process consisting of parallel restless bandit processes. Each bandit process is associated with a smart target, of which the estimation state evolves according to different discrete dynamic models for different actions - whether or not the target is being tracked. The discrete state is defined by the dynamic mode. The problem exhibits the curse of dimensionality, where optimal solutions are in general intractable. We resort to heuristics through the famous restless multi-armed bandit techniques. It follows with efficient scheduling policies based on the indices that are real numbers representing the marginal rewards of taking different actions. For the inevitable practical case with unknown transition matrices, we propose a new method that utilizes the forward Sarsa and backward Q-learning to approximate the indices through adapting the state-action value functions, or equivalently the Q-functions, and propose a new policy, namely ISQ, aiming to maximize the long-term tracking rewards. Numerical results demonstrate that the proposed ISQ policy outperforms conventional Q-learning-based methods and rapidly converges to the well-known Whittle index policy with revealed state transition models, which is considered the benchmark.  ( 3 min )
    Weakly Supervised Object Detection in Chest X-Rays with Differentiable ROI Proposal Networks and Soft ROI Pooling
    arXiv:2402.11985v1 Announce Type: cross Abstract: Weakly supervised object detection (WSup-OD) increases the usefulness and interpretability of image classification algorithms without requiring additional supervision. The successes of multiple instance learning in this task for natural images, however, do not translate well to medical images due to the very different characteristics of their objects (i.e. pathologies). In this work, we propose Weakly Supervised ROI Proposal Networks (WSRPN), a new method for generating bounding box proposals on the fly using a specialized region of interest-attention (ROI-attention) module. WSRPN integrates well with classic backbone-head classification algorithms and is end-to-end trainable with only image-label supervision. We experimentally demonstrate that our new method outperforms existing methods in the challenging task of disease localization in chest X-ray images. Code: https://github.com/philip-mueller/wsrpn  ( 2 min )
    ISCUTE: Instance Segmentation of Cables Using Text Embedding
    arXiv:2402.11996v1 Announce Type: cross Abstract: In the field of robotics and automation, conventional object recognition and instance segmentation methods face a formidable challenge when it comes to perceiving Deformable Linear Objects (DLOs) like wires, cables, and flexible tubes. This challenge arises primarily from the lack of distinct attributes such as shape, color, and texture, which calls for tailored solutions to achieve precise identification. In this work, we propose a foundation model-based DLO instance segmentation technique that is text-promptable and user-friendly. Specifically, our approach combines the text-conditioned semantic segmentation capabilities of CLIPSeg model with the zero-shot generalization capabilities of Segment Anything Model (SAM). We show that our method exceeds SOTA performance on DLO instance segmentation, achieving a mIoU of $91.21\%$. We also introduce a rich and diverse DLO-specific dataset for instance segmentation.  ( 2 min )
    Hebbian Learning based Orthogonal Projection for Continual Learning of Spiking Neural Networks
    arXiv:2402.11984v1 Announce Type: cross Abstract: Neuromorphic computing with spiking neural networks is promising for energy-efficient artificial intelligence (AI) applications. However, different from humans who continually learn different tasks in a lifetime, neural network models suffer from catastrophic forgetting. How could neuronal operations solve this problem is an important question for AI and neuroscience. Many previous studies draw inspiration from observed neuroscience phenomena and propose episodic replay or synaptic metaplasticity, but they are not guaranteed to explicitly preserve knowledge for neuron populations. Other works focus on machine learning methods with more mathematical grounding, e.g., orthogonal projection on high dimensional spaces, but there is no neural correspondence for neuromorphic computing. In this work, we develop a new method with neuronal operations based on lateral connections and Hebbian learning, which can protect knowledge by projecting activity traces of neurons into an orthogonal subspace so that synaptic weight update will not interfere with old tasks. We show that Hebbian and anti-Hebbian learning on recurrent lateral connections can effectively extract the principal subspace of neural activities and enable orthogonal projection. This provides new insights into how neural circuits and Hebbian learning can help continual learning, and also how the concept of orthogonal projection can be realized in neuronal systems. Our method is also flexible to utilize arbitrary training methods based on presynaptic activities/traces. Experiments show that our method consistently solves forgetting for spiking neural networks with nearly zero forgetting under various supervised training methods with different error propagation approaches, and outperforms previous approaches under various settings. Our method can pave a solid path for building continual neuromorphic computing systems.  ( 3 min )
    Stealing the Invisible: Unveiling Pre-Trained CNN Models through Adversarial Examples and Timing Side-Channels
    arXiv:2402.11953v1 Announce Type: cross Abstract: Machine learning, with its myriad applications, has become an integral component of numerous technological systems. A common practice in this domain is the use of transfer learning, where a pre-trained model's architecture, readily available to the public, is fine-tuned to suit specific tasks. As Machine Learning as a Service (MLaaS) platforms increasingly use pre-trained models in their backends, it's crucial to safeguard these architectures and understand their vulnerabilities. In this work, we present an approach based on the observation that the classification patterns of adversarial images can be used as a means to steal the models. Furthermore, the adversarial image classifications in conjunction with timing side channels can lead to a model stealing method. Our approach, designed for typical user-level access in remote MLaaS environments exploits varying misclassifications of adversarial images across different models to fingerprint several renowned Convolutional Neural Network (CNN) and Vision Transformer (ViT) architectures. We utilize the profiling of remote model inference times to reduce the necessary adversarial images, subsequently decreasing the number of queries required. We have presented our results over 27 pre-trained models of different CNN and ViT architectures using CIFAR-10 dataset and demonstrate a high accuracy of 88.8% while keeping the query budget under 20.  ( 2 min )
    AICAttack: Adversarial Image Captioning Attack with Attention-Based Optimization
    arXiv:2402.11940v1 Announce Type: cross Abstract: Recent advances in deep learning research have shown remarkable achievements across many tasks in computer vision (CV) and natural language processing (NLP). At the intersection of CV and NLP is the problem of image captioning, where the related models' robustness against adversarial attacks has not been well studied. In this paper, we present a novel adversarial attack strategy, which we call AICAttack (Attention-based Image Captioning Attack), designed to attack image captioning models through subtle perturbations on images. Operating within a black-box attack scenario, our algorithm requires no access to the target model's architecture, parameters, or gradient information. We introduce an attention-based candidate selection mechanism that identifies the optimal pixels to attack, followed by Differential Evolution (DE) for perturbing pixels' RGB values. We demonstrate AICAttack's effectiveness through extensive experiments on benchmark datasets with multiple victim models. The experimental results demonstrate that our method surpasses current leading-edge techniques by effectively distributing the alignment and semantics of words in the output.  ( 2 min )
    A novel molecule generative model of VAE combined with Transformer
    arXiv:2402.11950v1 Announce Type: cross Abstract: Recently, molecule generation using deep learning has been actively investigated in drug discovery. In this field, Transformer and VAE are widely used as powerful models, but they are rarely used in combination due to structural and performance mismatch of them. This study proposes a model that combines these two models through structural and parameter optimization in handling diverse molecules. The proposed model shows comparable performance to existing models in generating molecules, and showed by far superior performance in generating molecules with unseen structures. In addition, the proposed model successfully predicted molecular properties using the latent representation of VAE. Ablation studies suggested the advantage of VAE over other generative models like language model in generating novel molecules, and that the molecules can be described by ~32 dimensional variables, much smaller than existing descriptors and models. This study is expected to provide a virtual chemical library containing a wide variety of compounds for virtual screening and to enable efficient screening.  ( 2 min )
    Scalable Virtual Valuations Combinatorial Auction Design by Combining Zeroth-Order and First-Order Optimization Method
    arXiv:2402.11904v1 Announce Type: cross Abstract: Automated auction design seeks to discover empirically high-revenue and incentive-compatible mechanisms using machine learning. Ensuring dominant strategy incentive compatibility (DSIC) is crucial, and the most effective approach is to confine the mechanism to Affine Maximizer Auctions (AMAs). Nevertheless, existing AMA-based approaches encounter challenges such as scalability issues (arising from combinatorial candidate allocations) and the non-differentiability of revenue. In this paper, to achieve a scalable AMA-based method, we further restrict the auction mechanism to Virtual Valuations Combinatorial Auctions (VVCAs), a subset of AMAs with significantly fewer parameters. Initially, we employ a parallelizable dynamic programming algorithm to compute the winning allocation of a VVCA. Subsequently, we propose a novel optimization method that combines both zeroth-order and first-order techniques to optimize the VVCA parameters. Extensive experiments demonstrate the efficacy and scalability of our proposed approach, termed Zeroth-order and First-order Optimization of VVCAs (ZFO-VVCA), particularly when applied to large-scale auctions.  ( 2 min )
    Stochastic Hessian Fitting on Lie Group
    arXiv:2402.11858v1 Announce Type: cross Abstract: This paper studies the fitting of Hessian or its inverse with stochastic Hessian-vector products. A Hessian fitting criterion, which can be used to derive most of the commonly used methods, e.g., BFGS, Gaussian-Newton, AdaGrad, etc., is used for the analysis. Our studies reveal different convergence rates for different Hessian fitting methods, e.g., sublinear rates for gradient descent in the Euclidean space and a commonly used closed-form solution, linear rates for gradient descent on the manifold of symmetric positive definite (SPL) matrices and certain Lie groups. The Hessian fitting problem is further shown to be strongly convex under mild conditions on a specific yet general enough Lie group. To confirm our analysis, these methods are tested under different settings like noisy Hessian-vector products, time varying Hessians, and low precision arithmetic. These findings are useful for stochastic second order optimizations that rely on fast, robust and accurate Hessian estimations.  ( 2 min )
    An enhanced Teaching-Learning-Based Optimization (TLBO) with Grey Wolf Optimizer (GWO) for text feature selection and clustering
    arXiv:2402.11839v1 Announce Type: cross Abstract: Text document clustering can play a vital role in organizing and handling the everincreasing number of text documents. Uninformative and redundant features included in large text documents reduce the effectiveness of the clustering algorithm. Feature selection (FS) is a well-known technique for removing these features. Since FS can be formulated as an optimization problem, various meta-heuristic algorithms have been employed to solve it. Teaching-Learning-Based Optimization (TLBO) is a novel meta-heuristic algorithm that benefits from the low number of parameters and fast convergence. A hybrid method can simultaneously benefit from the advantages of TLBO and tackle the possible entrapment in the local optimum. By proposing a hybrid of TLBO, Grey Wolf Optimizer (GWO), and Genetic Algorithm (GA) operators, this paper suggests a filter-based FS algorithm (TLBO-GWO). Six benchmark datasets are selected, and TLBO-GWO is compared with three recently proposed FS algorithms with similar approaches, the main TLBO and GWO. The comparison is conducted based on clustering evaluation measures, convergence behavior, and dimension reduction, and is validated using statistical tests. The results reveal that TLBO-GWO can significantly enhance the effectiveness of the text clustering technique (K-means).  ( 2 min )
    Avoiding Feature Suppression in Contrastive Learning: Learning What Has Not Been Learned Before
    arXiv:2402.11816v1 Announce Type: cross Abstract: Self-Supervised contrastive learning has emerged as a powerful method for obtaining high-quality representations from unlabeled data. However, feature suppression has recently been identified in standard contrastive learning ($e.g.$, SimCLR, CLIP): in a single end-to-end training stage, the contrastive model captures only parts of the shared information across contrasting views, while ignore the other potentially useful information. With feature suppression, contrastive models often fail to learn sufficient representations capable for various downstream tasks. To mitigate the feature suppression problem and ensure the contrastive model to learn comprehensive representations, we develop a novel Multistage Contrastive Learning (MCL) framework. Unlike standard contrastive learning that often result in feature suppression, MCL progressively learn new features that have not been explored in the previous stage, while maintaining the well-learned features. Extensive experiments conducted on various publicly available benchmarks validate the effectiveness of our proposed framework. In addition, we demonstrate that the proposed MCL can be adapted to a variety of popular contrastive learning backbones and boost their performance by learning features that could not be gained from standard contrastive learning procedures.  ( 2 min )
    Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding
    arXiv:2402.11809v1 Announce Type: cross Abstract: This research aims to accelerate the inference speed of large language models (LLMs) with billions of parameters. We propose \textbf{S}mart \textbf{P}arallel \textbf{A}uto-\textbf{C}orrect d\textbf{E}coding (SPACE), an innovative approach designed for achieving lossless acceleration of LLMs. By integrating semi-autoregressive inference and speculative decoding capabilities, SPACE uniquely enables autoregressive LLMs to parallelize token generation and verification. This is realized through a specialized semi-autoregressive supervised fine-tuning process that equips existing LLMs with the ability to simultaneously predict multiple tokens. Additionally, an auto-correct decoding algorithm facilitates the simultaneous generation and verification of token sequences within a single model invocation. Through extensive experiments on a range of LLMs, SPACE has demonstrated inference speedup ranging from 2.7x-4.0x on HumanEval-X while maintaining output quality.  ( 2 min )
    HU at SemEval-2024 Task 8A: Can Contrastive Learning Learn Embeddings to Detect Machine-Generated Text?
    arXiv:2402.11815v1 Announce Type: cross Abstract: This paper describes our system developed for SemEval-2024 Task 8, "Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection." Machine-generated texts have been one of the main concerns due to the use of large language models (LLM) in fake text generation, phishing, cheating in exams, or even plagiarizing copyright materials. A lot of systems have been developed to detect machine-generated text. Nonetheless, the majority of these systems rely on the text-generating model, a limitation that is impractical in real-world scenarios, as it's often impossible to know which specific model the user has used for text generation. In this work, we propose a single model based on contrastive learning, which uses ~40% of the baseline's parameters (149M vs. 355M) but shows a comparable performance on the test dataset (21st out of 137 participants). Our key finding is that even without an ensemble of multiple models, a single base model can have comparable performance with the help of data augmentation and contrastive learning.  ( 2 min )
    What Evidence Do Language Models Find Convincing?
    arXiv:2402.11782v1 Announce Type: cross Abstract: Retrieval-augmented language models are being increasingly tasked with subjective, contentious, and conflicting queries such as "is aspartame linked to cancer". To resolve these ambiguous queries, one must search through a large range of websites and consider "which, if any, of this evidence do I find convincing?". In this work, we study how LLMs answer this question. In particular, we construct ConflictingQA, a dataset that pairs controversial queries with a series of real-world evidence documents that contain different facts (e.g., quantitative results), argument styles (e.g., appeals to authority), and answers (Yes or No). We use this dataset to perform sensitivity and counterfactual analyses to explore which text features most affect LLM predictions. Overall, we find that current models rely heavily on the relevance of a website to the query, while largely ignoring stylistic features that humans find important such as whether a text contains scientific references or is written with a neutral tone. Taken together, these results highlight the importance of RAG corpus quality (e.g., the need to filter misinformation), and possibly even a shift in how LLMs are trained to better align with human judgements.  ( 2 min )
    Statistical Test for Generated Hypotheses by Diffusion Models
    arXiv:2402.11789v1 Announce Type: cross Abstract: The enhanced performance of AI has accelerated its integration into scientific research. In particular, the use of generative AI to create scientific hypotheses is promising and is increasingly being applied across various fields. However, when employing AI-generated hypotheses for critical decisions, such as medical diagnoses, verifying their reliability is crucial. In this study, we consider a medical diagnostic task using generated images by diffusion models, and propose a statistical test to quantify its reliability. The basic idea behind the proposed statistical test is to employ a selective inference framework, where we consider a statistical test conditional on the fact that the generated images are produced by a trained diffusion model. Using the proposed method, the statistical reliability of medical image diagnostic results can be quantified in the form of a p-value, allowing for decision-making with a controlled error rate. We show the theoretical validity of the proposed statistical test and its effectiveness through numerical experiments on synthetic and brain image datasets.  ( 2 min )
    FOD-Swin-Net: angular super resolution of fiber orientation distribution using a transformer-based deep model
    arXiv:2402.11775v1 Announce Type: cross Abstract: Identifying and characterizing brain fiber bundles can help to understand many diseases and conditions. An important step in this process is the estimation of fiber orientations using Diffusion-Weighted Magnetic Resonance Imaging (DW-MRI). However, obtaining robust orientation estimates demands high-resolution data, leading to lengthy acquisitions that are not always clinically available. In this work, we explore the use of automated angular super resolution from faster acquisitions to overcome this challenge. Using the publicly available Human Connectome Project (HCP) DW-MRI data, we trained a transformer-based deep learning architecture to achieve angular super resolution in fiber orientation distribution (FOD). Our patch-based methodology, FOD-Swin-Net, is able to bring a single-shell reconstruction driven from 32 directions to be comparable to a multi-shell 288 direction FOD reconstruction, greatly reducing the number of required directions on initial acquisition. Evaluations of the reconstructed FOD with Angular Correlation Coefficient and qualitative visualizations reveal superior performance than the state-of-the-art in HCP testing data. Open source code for reproducibility is available at https://github.com/MICLab-Unicamp/FOD-Swin-Net.  ( 2 min )
    Uncovering Latent Human Wellbeing in Language Model Embeddings
    arXiv:2402.11777v1 Announce Type: cross Abstract: Do language models implicitly learn a concept of human wellbeing? We explore this through the ETHICS Utilitarianism task, assessing if scaling enhances pretrained models' representations. Our initial finding reveals that, without any prompt engineering or finetuning, the leading principal component from OpenAI's text-embedding-ada-002 achieves 73.9% accuracy. This closely matches the 74.6% of BERT-large finetuned on the entire ETHICS dataset, suggesting pretraining conveys some understanding about human wellbeing. Next, we consider four language model families, observing how Utilitarianism accuracy varies with increased parameters. We find performance is nondecreasing with increased model size when using sufficient numbers of principal components.  ( 2 min )
    Parameter Efficient Finetuning for Speech Emotion Recognition and Domain Adaptation
    arXiv:2402.11747v1 Announce Type: cross Abstract: Foundation models have shown superior performance for speech emotion recognition (SER). However, given the limited data in emotion corpora, finetuning all parameters of large pre-trained models for SER can be both resource-intensive and susceptible to overfitting. This paper investigates parameter-efficient finetuning (PEFT) for SER. Various PEFT adaptors are systematically studied for both classification of discrete emotion categories and prediction of dimensional emotional attributes. The results demonstrate that the combination of PEFT methods surpasses full finetuning with a significant reduction in the number of trainable parameters. Furthermore, a two-stage adaptation strategy is proposed to adapt models trained on acted emotion data, which is more readily available, to make the model more adept at capturing natural emotional expressions. Both intra- and cross-corpus experiments validate the efficacy of the proposed approach in enhancing the performance on both the source and target domains.  ( 2 min )
    MARS: Meaning-Aware Response Scoring for Uncertainty Estimation in Generative LLMs
    arXiv:2402.11756v1 Announce Type: cross Abstract: Generative Large Language Models (LLMs) are widely utilized for their excellence in various tasks. However, their tendency to produce inaccurate or misleading outputs poses a potential risk, particularly in high-stakes environments. Therefore, estimating the correctness of generative LLM outputs is an important task for enhanced reliability. Uncertainty Estimation (UE) in generative LLMs is an evolving domain, where SOTA probability-based methods commonly employ length-normalized scoring. In this work, we propose Meaning-Aware Response Scoring (MARS) as an alternative to length-normalized scoring for UE methods. MARS is a novel scoring function that considers the semantic contribution of each token in the generated sequence in the context of the question. We demonstrate that integrating MARS into UE methods results in a universal and significant improvement in UE performance. We conduct experiments using three distinct closed-book question-answering datasets across five popular pre-trained LLMs. Lastly, we validate the efficacy of MARS on a Medical QA dataset. Code can be found https://anonymous.4open.science/r/LLM_Uncertainity-309B.  ( 2 min )
    Numerical Claim Detection in Finance: A New Financial Dataset, Weak-Supervision Model, and Market Analysis
    arXiv:2402.11728v1 Announce Type: cross Abstract: In this paper, we investigate the influence of claims in analyst reports and earnings calls on financial market returns, considering them as significant quarterly events for publicly traded companies. To facilitate a comprehensive analysis, we construct a new financial dataset for the claim detection task in the financial domain. We benchmark various language models on this dataset and propose a novel weak-supervision model that incorporates the knowledge of subject matter experts (SMEs) in the aggregation function, outperforming existing approaches. Furthermore, we demonstrate the practical utility of our proposed model by constructing a novel measure ``optimism". Furthermore, we observed the dependence of earnings surprise and return on our optimism measure. Our dataset, models, and code will be made publicly (under CC BY 4.0 license) available on GitHub and Hugging Face.  ( 2 min )
    A Transition System Abstraction Framework for Neural Network Dynamical System Models
    arXiv:2402.11739v1 Announce Type: cross Abstract: This paper proposes a transition system abstraction framework for neural network dynamical system models to enhance the model interpretability, with applications to complex dynamical systems such as human behavior learning and verification. To begin with, the localized working zone will be segmented into multiple localized partitions under the data-driven Maximum Entropy (ME) partitioning method. Then, the transition matrix will be obtained based on the set-valued reachability analysis of neural networks. Finally, applications to human handwriting dynamics learning and verification are given to validate our proposed abstraction framework, which demonstrates the advantages of enhancing the interpretability of the black-box model, i.e., our proposed framework is able to abstract a data-driven neural network model into a transition system, making the neural network model interpretable through verifying specifications described in Computational Tree Logic (CTL) languages.  ( 2 min )
    Learning Memory Kernels in Generalized Langevin Equations
    arXiv:2402.11705v1 Announce Type: cross Abstract: We introduce a novel approach for learning memory kernels in Generalized Langevin Equations. This approach initially utilizes a regularized Prony method to estimate correlation functions from trajectory data, followed by regression over a Sobolev norm-based loss function with RKHS regularization. Our approach guarantees improved performance within an exponentially weighted $L^2$ space, with the kernel estimation error controlled by the error in estimated correlation functions. We demonstrate the superiority of our estimator compared to other regression estimators that rely on $L^2$ loss functions and also an estimator derived from the inverse Laplace transform, using numerical examples that highlight its consistent advantage across various weight parameter selections. Additionally, we provide examples that include the application of force and drift terms in the equation.  ( 2 min )
    Can ChatGPT Support Developers? An Empirical Evaluation of Large Language Models for Code Generation
    arXiv:2402.11702v1 Announce Type: cross Abstract: Large language models (LLMs) have demonstrated notable proficiency in code generation, with numerous prior studies showing their promising capabilities in various development scenarios. However, these studies mainly provide evaluations in research settings, which leaves a significant gap in understanding how effectively LLMs can support developers in real-world. To address this, we conducted an empirical analysis of conversations in DevGPT, a dataset collected from developers' conversations with ChatGPT (captured with the Share Link feature on platforms such as GitHub). Our empirical findings indicate that the current practice of using LLM-generated code is typically limited to either demonstrating high-level concepts or providing examples in documentation, rather than to be used as production-ready code. These findings indicate that there is much future work needed to improve LLMs in code generation before they can be integral parts of modern software development.  ( 2 min )
    Evaluating Efficacy of Model Stealing Attacks and Defenses on Quantum Neural Networks
    arXiv:2402.11687v1 Announce Type: cross Abstract: Cloud hosting of quantum machine learning (QML) models exposes them to a range of vulnerabilities, the most significant of which is the model stealing attack. In this study, we assess the efficacy of such attacks in the realm of quantum computing. We conducted comprehensive experiments on various datasets with multiple QML model architectures. Our findings revealed that model stealing attacks can produce clone models achieving up to $0.9\times$ and $0.99\times$ clone test accuracy when trained using Top-$1$ and Top-$k$ labels, respectively ($k:$ num\_classes). To defend against these attacks, we leverage the unique properties of current noisy hardware and perturb the victim model outputs and hinder the attacker's training process. In particular, we propose: 1) hardware variation-induced perturbation (HVIP) and 2) hardware and architecture variation-induced perturbation (HAVIP). Although noise and architectural variability can provide up to $\sim16\%$ output obfuscation, our comprehensive analysis revealed that models cloned under noisy conditions tend to be resilient, suffering little to no performance degradation due to such obfuscations. Despite limited success with our defense techniques, this outcome has led to an important discovery: QML models trained on noisy hardwares are naturally resistant to perturbation or obfuscation-based defenses or attacks.  ( 2 min )
    Explaining the Machine Learning Solution of the Ising Model
    arXiv:2402.11701v1 Announce Type: cross Abstract: As powerful as machine learning (ML) techniques are in solving problems involving data with large dimensionality, explaining the results from the fitted parameters remains a challenging task of utmost importance, especially in physics applications. Here it is shown how this can be accomplished for the ferromagnetic Ising model, the target of many ML studies in the last years. By using a neural network (NN) without any hidden layers and the symmetry of the Hamiltonian to find the critical temperature for the continuous phase transition of the model, an explanation of its strategy is found. This allows the prediction of the minimal extension of the NN to solve the problem when the symmetry is not known, which is also explainable.  ( 2 min )
    Challenging the Black Box: A Comprehensive Evaluation of Attribution Maps of CNN Applications in Agriculture and Forestry
    arXiv:2402.11670v1 Announce Type: cross Abstract: In this study, we explore the explainability of neural networks in agriculture and forestry, specifically in fertilizer treatment classification and wood identification. The opaque nature of these models, often considered 'black boxes', is addressed through an extensive evaluation of state-of-the-art Attribution Maps (AMs), also known as class activation maps (CAMs) or saliency maps. Our comprehensive qualitative and quantitative analysis of these AMs uncovers critical practical limitations. Findings reveal that AMs frequently fail to consistently highlight crucial features and often misalign with the features considered important by domain experts. These discrepancies raise substantial questions about the utility of AMs in understanding the decision-making process of neural networks. Our study provides critical insights into the trustworthiness and practicality of AMs within the agriculture and forestry sectors, thus facilitating a better understanding of neural networks in these application areas.  ( 2 min )
    A Fast Algorithm to Simulate Nonlinear Resistive Networks
    arXiv:2402.11674v1 Announce Type: cross Abstract: In the quest for energy-efficient artificial intelligence systems, resistor networks are attracting interest as an alternative to conventional GPU-based neural networks. These networks leverage the physics of electrical circuits for inference and can be optimized with local training techniques such as equilibrium propagation. Despite their potential advantage in terms of power consumption, the challenge of efficiently simulating these resistor networks has been a significant bottleneck to assess their scalability, with current methods either being limited to linear networks or relying on realistic, yet slow circuit simulators like SPICE. Assuming ideal circuit elements, we introduce a novel approach for the simulation of nonlinear resistive networks, which we frame as a quadratic programming problem with linear inequality constraints, and which we solve using a fast, exact coordinate descent algorithm. Our simulation methodology significantly outperforms existing SPICE-based simulations, enabling the training of networks up to 325 times larger at speeds 150 times faster, resulting in a 50,000-fold improvement in the ratio of network size to epoch duration. Our approach, adaptable to other electrical components, can foster more rapid progress in the simulations of nonlinear electrical networks.  ( 2 min )
    Integrating Pre-Trained Language Model with Physical Layer Communications
    arXiv:2402.11656v1 Announce Type: cross Abstract: The burgeoning field of on-device AI communication, where devices exchange information directly through embedded foundation models, such as language models (LMs), requires robust, efficient, and generalizable communication frameworks. However, integrating these frameworks with existing wireless systems and effectively managing noise and bit errors pose significant challenges. In this work, we introduce a practical on-device AI communication framework, integrated with physical layer (PHY) communication functions, demonstrated through its performance on a link-level simulator. Our framework incorporates end-to-end training with channel noise to enhance resilience, incorporates vector quantized variational autoencoders (VQ-VAE) for efficient and robust communication, and utilizes pre-trained encoder-decoder transformers for improved generalization capabilities. Simulations, across various communication scenarios, reveal that our framework achieves a 50% reduction in transmission size while demonstrating substantial generalization ability and noise robustness under standardized 3GPP channel models.  ( 2 min )
    Dynamic planning in hierarchical active inference
    arXiv:2402.11658v1 Announce Type: cross Abstract: By dynamic planning, we refer to the ability of the human brain to infer and impose motor trajectories related to cognitive decisions. A recent paradigm, active inference, brings fundamental insights into the adaptation of biological organisms, constantly striving to minimize prediction errors to restrict themselves to life-compatible states. Over the past years, many studies have shown how human and animal behavior could be explained in terms of an active inferential process -- either as discrete decision-making or continuous motor control -- inspiring innovative solutions in robotics and artificial intelligence. Still, the literature lacks a comprehensive outlook on how to effectively plan actions in changing environments. Setting ourselves the goal of modeling tool use, we delve into the topic of dynamic planning in active inference, keeping in mind two crucial aspects of biological goal-directed behavior: the capacity to understand and exploit affordances for object manipulation, and to learn the hierarchical interactions between the self and the environment, including other agents. We start from a simple unit and gradually describe more advanced structures, comparing recently proposed design choices and providing basic examples for each section. This study distances itself from traditional views centered on neural networks and reinforcement learning, and points toward a yet unexplored direction in active inference: hybrid representations in hierarchical models.  ( 2 min )
    Doubly Robust Inference in Causal Latent Factor Models
    arXiv:2402.11652v1 Announce Type: cross Abstract: This article introduces a new framework for estimating average treatment effects under unobserved confounding in modern data-rich environments featuring large numbers of units and outcomes. The proposed estimator is doubly robust, combining outcome imputation, inverse probability weighting, and a novel cross-fitting procedure for matrix completion. We derive finite-sample and asymptotic guarantees, and show that the error of the new estimator converges to a mean-zero Gaussian distribution at a parametric rate. Simulation results demonstrate the practical relevance of the formal properties of the estimators analyzed in this article.  ( 2 min )
    Model-Free $\mu$-Synthesis: A Nonsmooth Optimization Perspective
    arXiv:2402.11654v1 Announce Type: cross Abstract: In this paper, we revisit model-free policy search on an important robust control benchmark, namely $\mu$-synthesis. In the general output-feedback setting, there do not exist convex formulations for this problem, and hence global optimality guarantees are not expected. Apkarian (2011) presented a nonconvex nonsmooth policy optimization approach for this problem, and achieved state-of-the-art design results via using subgradient-based policy search algorithms which generate update directions in a model-based manner. Despite the lack of convexity and global optimality guarantees, these subgradient-based policy search methods have led to impressive numerical results in practice. Built upon such a policy optimization persepctive, our paper extends these subgradient-based search methods to a model-free setting. Specifically, we examine the effectiveness of two model-free policy optimization strategies: the model-free non-derivative sampling method and the zeroth-order policy search with uniform smoothing. We performed an extensive numerical study to demonstrate that both methods consistently replicate the design outcomes achieved by their model-based counterparts. Additionally, we provide some theoretical justifications showing that convergence guarantees to stationary points can be established for our model-free $\mu$-synthesis under some assumptions related to the coerciveness of the cost function. Overall, our results demonstrate that derivative-free policy optimization offers a competitive and viable approach for solving general output-feedback $\mu$-synthesis problems in the model-free setting.  ( 2 min )
    Logical Closed Loop: Uncovering Object Hallucinations in Large Vision-Language Models
    arXiv:2402.11622v1 Announce Type: cross Abstract: Object hallucination has been an Achilles' heel which hinders the broader applications of large vision-language models (LVLMs). Object hallucination refers to the phenomenon that the LVLMs claim non-existent objects in the image. To mitigate the object hallucinations, instruction tuning and external model-based detection methods have been proposed, which either require large-scare computational resources or depend on the detection result of external models. However, there remains an under-explored field to utilize the LVLM itself to alleviate object hallucinations. In this work, we adopt the intuition that the LVLM tends to respond logically consistently for existent objects but inconsistently for hallucinated objects. Therefore, we propose a Logical Closed Loop-based framework for Object Hallucination Detection and Mitigation, namely LogicCheckGPT. In specific, we devise logical consistency probing to raise questions with logical correlations, inquiring about attributes from objects and vice versa. Whether their responses can form a logical closed loop serves as an indicator of object hallucination. As a plug-and-play method, it can be seamlessly applied to all existing LVLMs. Comprehensive experiments conducted on three benchmarks across four LVLMs have demonstrated significant improvements brought by our method, indicating its effectiveness and generality.  ( 2 min )
    Poisoning Federated Recommender Systems with Fake Users
    arXiv:2402.11637v1 Announce Type: cross Abstract: Federated recommendation is a prominent use case within federated learning, yet it remains susceptible to various attacks, from user to server-side vulnerabilities. Poisoning attacks are particularly notable among user-side attacks, as participants upload malicious model updates to deceive the global model, often intending to promote or demote specific targeted items. This study investigates strategies for executing promotion attacks in federated recommender systems. Current poisoning attacks on federated recommender systems often rely on additional information, such as the local training data of genuine users or item popularity. However, such information is challenging for the potential attacker to obtain. Thus, there is a need to develop an attack that requires no extra information apart from item embeddings obtained from the server. In this paper, we introduce a novel fake user based poisoning attack named PoisonFRS to promote the attacker-chosen targeted item in federated recommender systems without requiring knowledge about user-item rating data, user attributes, or the aggregation rule used by the server. Extensive experiments on multiple real-world datasets demonstrate that PoisonFRS can effectively promote the attacker-chosen targeted item to a large portion of genuine users and outperform current benchmarks that rely on additional information about the system. We further observe that the model updates from both genuine and fake users are indistinguishable within the latent space.  ( 2 min )
    PolypNextLSTM: A lightweight and fast polyp video segmentation network using ConvNext and ConvLSTM
    arXiv:2402.11585v1 Announce Type: cross Abstract: Commonly employed in polyp segmentation, single image UNet architectures lack the temporal insight clinicians gain from video data in diagnosing polyps. To mirror clinical practices more faithfully, our proposed solution, PolypNextLSTM, leverages video-based deep learning, harnessing temporal information for superior segmentation performance with the least parameter overhead, making it possibly suitable for edge devices. PolypNextLSTM employs a UNet-like structure with ConvNext-Tiny as its backbone, strategically omitting the last two layers to reduce parameter overhead. Our temporal fusion module, a Convolutional Long Short Term Memory (ConvLSTM), effectively exploits temporal features. Our primary novelty lies in PolypNextLSTM, which stands out as the leanest in parameters and the fastest model, surpassing the performance of five state-of-the-art image and video-based deep learning models. The evaluation of the SUN-SEG dataset spans easy-to-detect and hard-to-detect polyp scenarios, along with videos containing challenging artefacts like fast motion and occlusion.  ( 2 min )
    Empirical Density Estimation based on Spline Quasi-Interpolation with applications to Copulas clustering modeling
    arXiv:2402.11552v1 Announce Type: cross Abstract: Density estimation is a fundamental technique employed in various fields to model and to understand the underlying distribution of data. The primary objective of density estimation is to estimate the probability density function of a random variable. This process is particularly valuable when dealing with univariate or multivariate data and is essential for tasks such as clustering, anomaly detection, and generative modeling. In this paper we propose the mono-variate approximation of the density using spline quasi interpolation and we applied it in the context of clustering modeling. The clustering technique used is based on the construction of suitable multivariate distributions which rely on the estimation of the monovariate empirical densities (marginals). Such an approximation is achieved by using the proposed spline quasi-interpolation, while the joint distributions to model the sought clustering partition is constructed with the use of copulas functions. In particular, since copulas can capture the dependence between the features of the data independently from the marginal distributions, a finite mixture copula model is proposed. The presented algorithm is validated on artificial and real datasets.  ( 2 min )
    Advancing Translation Preference Modeling with RLHF: A Step Towards Cost-Effective Solution
    arXiv:2402.11525v1 Announce Type: cross Abstract: Faithfulness, expressiveness, and elegance is the constant pursuit in machine translation. However, traditional metrics like \textit{BLEU} do not strictly align with human preference of translation quality. In this paper, we explore leveraging reinforcement learning with human feedback (\textit{RLHF}) to improve translation quality. It is non-trivial to collect a large high-quality dataset of human comparisons between translations, especially for low-resource languages. To address this issue, we propose a cost-effective preference learning strategy, optimizing reward models by distinguishing between human and machine translations. In this manner, the reward model learns the deficiencies of machine translation compared to human and guides subsequent improvements in machine translation. Experimental results demonstrate that \textit{RLHF} can effectively enhance translation quality and this improvement benefits other translation directions not trained with \textit{RLHF}. Further analysis indicates that the model's language capabilities play a crucial role in preference learning. A reward model with strong language capabilities can more sensitively learn the subtle differences in translation quality and align better with real human translation preferences.  ( 2 min )
    PASCL: Supervised Contrastive Learning with Perturbative Augmentation for Particle Decay Reconstruction
    arXiv:2402.11538v1 Announce Type: cross Abstract: In high-energy physics, particles produced in collision events decay in a format of a hierarchical tree structure, where only the final decay products can be observed using detectors. However, the large combinatorial space of possible tree structures makes it challenging to recover the actual decay process given a set of final particles. To better analyse the hierarchical tree structure, we propose a graph-based deep learning model to infer the tree structure to reconstruct collision events. In particular, we use a compact matrix representation termed as lowest common ancestor generations (LCAG) matrix, to encode the particle decay tree structure. Then, we introduce a perturbative augmentation technique applied to node features, aiming to mimic experimental uncertainties and increase data diversity. We further propose a supervised graph contrastive learning algorithm to utilize the information of inter-particle relations from multiple decay processes. Extensive experiments show that our proposed supervised graph contrastive learning with perturbative augmentation (PASCL) method outperforms state-of-the-art baseline models on an existing physics-based dataset, significantly improving the reconstruction accuracy. This method provides a more effective training strategy for models with the same parameters and makes way for more accurate and efficient high-energy particle physics data analysis.  ( 2 min )
    LEIA: Facilitating Cross-Lingual Knowledge Transfer in Language Models with Entity-based Data Augmentation
    arXiv:2402.11485v1 Announce Type: cross Abstract: Adapting English-based large language models (LLMs) to other languages has become increasingly popular due to the efficiency and potential of cross-lingual transfer. However, existing language adaptation methods often overlook the benefits of cross-lingual supervision. In this study, we introduce LEIA, a language adaptation tuning method that utilizes Wikipedia entity names aligned across languages. This method involves augmenting the target language corpus with English entity names and training the model using left-to-right language modeling. We assess LEIA on diverse question answering datasets using 7B-parameter LLMs, demonstrating significant performance gains across various non-English languages. The source code is available at https://github.com/studio-ousia/leia.  ( 2 min )
    URLBERT:A Contrastive and Adversarial Pre-trained Model for URL Classification
    arXiv:2402.11495v1 Announce Type: cross Abstract: URLs play a crucial role in understanding and categorizing web content, particularly in tasks related to security control and online recommendations. While pre-trained models are currently dominating various fields, the domain of URL analysis still lacks specialized pre-trained models. To address this gap, this paper introduces URLBERT, the first pre-trained representation learning model applied to a variety of URL classification or detection tasks. We first train a URL tokenizer on a corpus of billions of URLs to address URL data tokenization. Additionally, we propose two novel pre-training tasks: (1) self-supervised contrastive learning tasks, which strengthen the model's understanding of URL structure and the capture of category differences by distinguishing different variants of the same URL; (2) virtual adversarial training, aimed at improving the model's robustness in extracting semantic features from URLs. Finally, our proposed methods are evaluated on tasks including phishing URL detection, web page classification, and ad filtering, achieving state-of-the-art performance. Importantly, we also explore multi-task learning with URLBERT, and experimental results demonstrate that multi-task learning model based on URLBERT exhibit equivalent effectiveness compared to independently fine-tuned models, showing the simplicity of URLBERT in handling complex task requirements. The code for our work is available at https://github.com/Davidup1/URLBERT.  ( 2 min )
    DDIPrompt: Drug-Drug Interaction Event Prediction based on Graph Prompt Learning
    arXiv:2402.11472v1 Announce Type: cross Abstract: Recently, Graph Neural Networks have become increasingly prevalent in predicting adverse drug-drug interactions (DDI) due to their proficiency in modeling the intricate associations between atoms and functional groups within and across drug molecules. However, they are still hindered by two significant challenges: (1) the issue of highly imbalanced event distribution, which is a common but critical problem in medical datasets where certain interactions are vastly underrepresented. This imbalance poses a substantial barrier to achieving accurate and reliable DDI predictions. (2) the scarcity of labeled data for rare events, which is a pervasive issue in the medical field where rare yet potentially critical interactions are often overlooked or under-studied due to limited available data. In response, we offer DDIPrompt, an innovative panacea inspired by the recent advancements in graph prompting. Our framework aims to address these issues by leveraging the intrinsic knowledge from pre-trained models, which can be efficiently deployed with minimal downstream data. Specifically, to solve the first challenge, DDIPrompt employs augmented links between drugs, considering both structural and interactive proximity. It features a hierarchical pre-training strategy that comprehends intra-molecular structures and inter-molecular interactions, fostering a comprehensive and unbiased understanding of drug properties. For the second challenge, we implement a prototype-enhanced prompting mechanism during inference. This mechanism, refined by few-shot examples from each category, effectively harnesses the rich pre-training knowledge to enhance prediction accuracy, particularly for these rare but crucial interactions. Comprehensive evaluations on two benchmark datasets demonstrate the superiority of DDIPrompt, particularly in predicting rare DDI events.  ( 3 min )
    Re-Dock: Towards Flexible and Realistic Molecular Docking with Diffusion Bridge
    arXiv:2402.11459v1 Announce Type: cross Abstract: Accurate prediction of protein-ligand binding structures, a task known as molecular docking is crucial for drug design but remains challenging. While deep learning has shown promise, existing methods often depend on holo-protein structures (docked, and not accessible in realistic tasks) or neglect pocket sidechain conformations, leading to limited practical utility and unrealistic conformation predictions. To fill these gaps, we introduce an under-explored task, named flexible docking to predict poses of ligand and pocket sidechains simultaneously and introduce Re-Dock, a novel diffusion bridge generative model extended to geometric manifolds. Specifically, we propose energy-to-geometry mapping inspired by the Newton-Euler equation to co-model the binding energy and conformations for reflecting the energy-constrained docking generative process. Comprehensive experiments on designed benchmark datasets including apo-dock and cross-dock demonstrate our model's superior effectiveness and efficiency over current methods.  ( 2 min )
    Online Local False Discovery Rate Control: A Resource Allocation Approach
    arXiv:2402.11425v1 Announce Type: cross Abstract: We consider the problem of online local false discovery rate (FDR) control where multiple tests are conducted sequentially, with the goal of maximizing the total expected number of discoveries. We formulate the problem as an online resource allocation problem with accept/reject decisions, which from a high level can be viewed as an online knapsack problem, with the additional uncertainty of random budget replenishment. We start with general arrival distributions and propose a simple policy that achieves a $O(\sqrt{T})$ regret. We complement the result by showing that such regret rate is in general not improvable. We then shift our focus to discrete arrival distributions. We find that many existing re-solving heuristics in the online resource allocation literature, albeit achieve bounded loss in canonical settings, may incur a $\Omega(\sqrt{T})$ or even a $\Omega(T)$ regret. With the observation that canonical policies tend to be too optimistic and over accept arrivals, we propose a novel policy that incorporates budget buffers. We show that small additional logarithmic buffers suffice to reduce the regret from $\Omega(\sqrt{T})$ or even $\Omega(T)$ to $O(\ln^2 T)$. Numerical experiments are conducted to validate our theoretical findings. Our formulation may have wider applications beyond the problem considered in this paper, and our results emphasize how effective policies should be designed to reach a balance between circumventing wrong accept and reducing wrong reject in online resource allocation problems with uncertain budgets.  ( 3 min )
    InfuserKI: Enhancing Large Language Models with Knowledge Graphs via Infuser-Guided Knowledge Integration
    arXiv:2402.11441v1 Announce Type: cross Abstract: Though Large Language Models (LLMs) have shown remarkable open-generation capabilities across diverse domains, they struggle with knowledge-intensive tasks. To alleviate this issue, knowledge integration methods have been proposed to enhance LLMs with domain-specific knowledge graphs using external modules. However, they suffer from data inefficiency as they require both known and unknown knowledge for fine-tuning. Thus, we study a novel problem of integrating unknown knowledge into LLMs efficiently without unnecessary overlap of known knowledge. Injecting new knowledge poses the risk of forgetting previously acquired knowledge. To tackle this, we propose a novel Infuser-Guided Knowledge Integration (InfuserKI) framework that utilizes transformer internal states to determine whether to enhance the original LLM output with additional information, thereby effectively mitigating knowledge forgetting. Evaluations on the UMLS-2.5k and MetaQA domain knowledge graphs demonstrate that InfuserKI can effectively acquire new knowledge and outperform state-of-the-art baselines by 9% and 6%, respectively, in reducing knowledge forgetting.  ( 2 min )
    A Multispectral Automated Transfer Technique (MATT) for machine-driven image labeling utilizing the Segment Anything Model (SAM)
    arXiv:2402.11413v1 Announce Type: cross Abstract: Segment Anything Model (SAM) is drastically accelerating the speed and accuracy of automatically segmenting and labeling large Red-Green-Blue (RGB) imagery datasets. However, SAM is unable to segment and label images outside of the visible light spectrum, for example, for multispectral or hyperspectral imagery. Therefore, this paper outlines a method we call the Multispectral Automated Transfer Technique (MATT). By transposing SAM segmentation masks from RGB images we can automatically segment and label multispectral imagery with high precision and efficiency. For example, the results demonstrate that segmenting and labeling a 2,400-image dataset utilizing MATT achieves a time reduction of 87.8% in developing a trained model, reducing roughly 20 hours of manual labeling, to only 2.4 hours. This efficiency gain is associated with only a 6.7% decrease in overall mean average precision (mAP) when training multispectral models via MATT, compared to a manually labeled dataset. We consider this an acceptable level of precision loss when considering the time saved during training, especially for rapidly prototyping experimental modeling methods. This research greatly contributes to the study of multispectral object detection by providing a novel and open-source method to rapidly segment, label, and train multispectral object detection models with minimal human interaction. Future research needs to focus on applying these methods to (i) space-based multispectral, and (ii) drone-based hyperspectral imagery.  ( 3 min )
    LoRETTA: Low-Rank Economic Tensor-Train Adaptation for Ultra-Low-Parameter Fine-Tuning of Large Language Models
    arXiv:2402.11417v1 Announce Type: cross Abstract: Various parameter-efficient fine-tuning (PEFT) techniques have been proposed to enable computationally efficient fine-tuning while maintaining model performance. However, existing PEFT methods are still limited by the growing number of trainable parameters with the rapid deployment of Large Language Models (LLMs). To address this challenge, we present LoRETTA, an ultra-parameter-efficient framework that significantly reduces trainable parameters through tensor-train decomposition. Specifically, we propose two methods, named {LoRETTA}$_{adp}$ and {LoRETTA}$_{rep}$. The former employs tensorized adapters, offering a high-performance yet lightweight approach for the fine-tuning of LLMs. The latter emphasizes fine-tuning via weight parameterization with a set of small tensor factors. LoRETTA achieves comparable or better performance than most widely used PEFT methods with up to $100\times$ fewer parameters on the LLaMA-2-7B models. Furthermore, empirical results demonstrate that the proposed method effectively improves training efficiency, enjoys better multi-task learning performance, and enhances the anti-overfitting capability. Plug-and-play codes built upon the Huggingface framework and PEFT library will be released.  ( 2 min )
    GraphKD: Exploring Knowledge Distillation Towards Document Object Detection with Structured Graph Creation
    arXiv:2402.11401v1 Announce Type: cross Abstract: Object detection in documents is a key step to automate the structural elements identification process in a digital or scanned document through understanding the hierarchical structure and relationships between different elements. Large and complex models, while achieving high accuracy, can be computationally expensive and memory-intensive, making them impractical for deployment on resource constrained devices. Knowledge distillation allows us to create small and more efficient models that retain much of the performance of their larger counterparts. Here we present a graph-based knowledge distillation framework to correctly identify and localize the document objects in a document image. Here, we design a structured graph with nodes containing proposal-level features and edges representing the relationship between the different proposal regions. Also, to reduce text bias an adaptive node sampling strategy is designed to prune the weight distribution and put more weightage on non-text nodes. We encode the complete graph as a knowledge representation and transfer it from the teacher to the student through the proposed distillation loss by effectively capturing both local and global information concurrently. Extensive experimentation on competitive benchmarks demonstrates that the proposed framework outperforms the current state-of-the-art approaches. The code will be available at: https://github.com/ayanban011/GraphKD.  ( 2 min )
    k-SemStamp: A Clustering-Based Semantic Watermark for Detection of Machine-Generated Text
    arXiv:2402.11399v1 Announce Type: cross Abstract: Recent watermarked generation algorithms inject detectable signatures during language generation to facilitate post-hoc detection. While token-level watermarks are vulnerable to paraphrase attacks, SemStamp (Hou et al., 2023) applies watermark on the semantic representation of sentences and demonstrates promising robustness. SemStamp employs locality-sensitive hashing (LSH) to partition the semantic space with arbitrary hyperplanes, which results in a suboptimal tradeoff between robustness and speed. We propose k-SemStamp, a simple yet effective enhancement of SemStamp, utilizing k-means clustering as an alternative of LSH to partition the embedding space with awareness of inherent semantic structure. Experimental results indicate that k-SemStamp saliently improves its robustness and sampling efficiency while preserving the generation quality, advancing a more effective tool for machine-generated text detection.  ( 2 min )
    Variational Entropy Search for Adjusting Expected Improvement
    arXiv:2402.11345v1 Announce Type: cross Abstract: Bayesian optimization is a widely used technique for optimizing black-box functions, with Expected Improvement (EI) being the most commonly utilized acquisition function in this domain. While EI is often viewed as distinct from other information-theoretic acquisition functions, such as entropy search (ES) and max-value entropy search (MES), our work reveals that EI can be considered a special case of MES when approached through variational inference (VI). In this context, we have developed the Variational Entropy Search (VES) methodology and the VES-Gamma algorithm, which adapts EI by incorporating principles from information-theoretic concepts. The efficacy of VES-Gamma is demonstrated across a variety of test functions and read datasets, highlighting its theoretical and practical utilities in Bayesian optimization scenarios.  ( 2 min )
    What Changed? Converting Representational Interventions to Natural Language
    arXiv:2402.11355v1 Announce Type: cross Abstract: Interventions targeting the representation space of language models (LMs) have emerged as effective means to influence model behavior. These methods are employed, for example, to eliminate or alter the encoding of demographic information such as gender within the model's representations, creating a counterfactual representation. However, since the intervention operates within the representation space, understanding precisely which features it modifies poses a challenge. We show that representation-space counterfactuals can be converted into natural language counterfactuals. We demonstrate that this approach enables us to analyze the linguistic alterations corresponding to a given representation-space intervention and to interpret the features utilized for encoding a specific concept. Moreover, the resulting counterfactuals can be used to mitigate bias in classification.  ( 2 min )
    Fair Resource Allocation in Virtualized O-RAN Platforms
    arXiv:2402.11285v1 Announce Type: cross Abstract: O-RAN systems and their deployment in virtualized general-purpose computing platforms (O-Cloud) constitute a paradigm shift expected to bring unprecedented performance gains. However, these architectures raise new implementation challenges and threaten to worsen the already-high energy consumption of mobile networks. This paper presents first a series of experiments which assess the O-Cloud's energy costs and their dependency on the servers' hardware, capacity and data traffic properties which, typically, change over time. Next, it proposes a compute policy for assigning the base station data loads to O-Cloud servers in an energy-efficient fashion; and a radio policy that determines at near-real-time the minimum transmission block size for each user so as to avoid unnecessary energy costs. The policies balance energy savings with performance, and ensure that both of them are dispersed fairly across the servers and users, respectively. To cater for the unknown and time-varying parameters affecting the policies, we develop a novel online learning framework with fairness guarantees that apply to the entire operation horizon of the system (long-term fairness). The policies are evaluated using trace-driven simulations and are fully implemented in an O-RAN compatible system where we measure the energy costs and throughput in realistic scenarios.  ( 2 min )
    SpikeNAS: A Fast Memory-Aware Neural Architecture Search Framework for Spiking Neural Network Systems
    arXiv:2402.11322v1 Announce Type: cross Abstract: Spiking Neural Networks (SNNs) offer a promising solution to achieve ultra low-power/energy computation for solving machine learning tasks. Currently, most of the SNN architectures are derived from Artificial Neural Networks whose neurons' architectures and operations are different from SNNs, or developed without considering memory budgets from the underlying processing hardware. These limitations hinder the SNNs from reaching their full potential in accuracy and efficiency. Towards this, we propose SpikeNAS, a novel memory-aware neural architecture search (NAS) framework for SNNs that can quickly find an appropriate SNN architecture with high accuracy under the given memory budgets. To do this, our SpikeNAS employs several key steps: analyzing the impacts of network operations on the accuracy, enhancing the network architecture to improve the learning quality, and developing a fast memory-aware search algorithm. The experimental results show that our SpikeNAS improves the searching time and maintains high accuracy as compared to state-of-the-art while meeting the given memory budgets (e.g., 4.4x faster search with 1.3% accuracy improvement for CIFAR100, using an Nvidia RTX 6000 Ada GPU machine), thereby quickly providing the appropriate SNN architecture for memory-constrained SNN-based systems.  ( 2 min )
    TC-DiffRecon: Texture coordination MRI reconstruction method based on diffusion model and modified MF-UNet method
    arXiv:2402.11274v1 Announce Type: cross Abstract: Recently, diffusion models have gained significant attention as a novel set of deep learning-based generative methods. These models attempt to sample data from a Gaussian distribution that adheres to a target distribution, and have been successfully adapted to the reconstruction of MRI data. However, as an unconditional generative model, the diffusion model typically disrupts image coordination because of the consistent projection of data introduced by conditional bootstrap. This often results in image fragmentation and incoherence. Furthermore, the inherent limitations of the diffusion model often lead to excessive smoothing of the generated images. In the same vein, some deep learning-based models often suffer from poor generalization performance, meaning their effectiveness is greatly affected by different acceleration factors. To address these challenges, we propose a novel diffusion model-based MRI reconstruction method, named TC-DiffRecon, which does not rely on a specific acceleration factor for training. We also suggest the incorporation of the MF-UNet module, designed to enhance the quality of MRI images generated by the model while mitigating the over-smoothing issue to a certain extent. During the image generation sampling process, we employ a novel TCKG module and a Coarse-to-Fine sampling scheme. These additions aim to harmonize image texture, expedite the sampling process, while achieving data consistency. Our source code is available at https://github.com/JustlfC03/TC-DiffRecon.  ( 3 min )
    Mirror Gradient: Towards Robust Multimodal Recommender Systems via Exploring Flat Local Minima
    arXiv:2402.11262v1 Announce Type: cross Abstract: Multimodal recommender systems utilize various types of information to model user preferences and item features, helping users discover items aligned with their interests. The integration of multimodal information mitigates the inherent challenges in recommender systems, e.g., the data sparsity problem and cold-start issues. However, it simultaneously magnifies certain risks from multimodal information inputs, such as information adjustment risk and inherent noise risk. These risks pose crucial challenges to the robustness of recommendation models. In this paper, we analyze multimodal recommender systems from the novel perspective of flat local minima and propose a concise yet effective gradient strategy called Mirror Gradient (MG). This strategy can implicitly enhance the model's robustness during the optimization process, mitigating instability risks arising from multimodal information inputs. We also provide strong theoretical evidence and conduct extensive empirical experiments to show the superiority of MG across various multimodal recommendation models and benchmarks. Furthermore, we find that the proposed MG can complement existing robust training methods and be easily extended to diverse advanced recommendation models, making it a promising new and fundamental paradigm for training multimodal recommender systems. The code is released at https://github.com/Qrange-group/Mirror-Gradient.  ( 2 min )
    On the Role of Similarity in Detecting Masquerading Files
    arXiv:2402.11227v1 Announce Type: cross Abstract: Similarity has been applied to a wide range of security applications, typically used in machine learning models. We examine the problem posed by masquerading samples; that is samples crafted by bad actors to be similar or near identical to legitimate samples. We find that these samples potentially create significant problems for machine learning solutions. The primary problem being that bad actors can circumvent machine learning solutions by using masquerading samples. We then examine the interplay between digital signatures and machine learning solutions. In particular, we focus on executable files and code signing. We offer a taxonomy for masquerading files. We use a combination of similarity and clustering to find masquerading files. We use the insights gathered in this process to offer improvements to similarity based and machine learning security solutions.  ( 2 min )
    Adaptive Split Balancing for Optimal Random Forest
    arXiv:2402.11228v1 Announce Type: cross Abstract: While random forests are commonly used for regression problems, existing methods often lack adaptability in complex situations or lose optimality under simple, smooth scenarios. In this study, we introduce the adaptive split balancing forest (ASBF), capable of learning tree representations from data while simultaneously achieving minimax optimality under the Lipschitz class. To exploit higher-order smoothness levels, we further propose a localized version that attains the minimax rate under the H\"older class $\mathcal{H}^{q,\beta}$ for any $q\in\mathbb{N}$ and $\beta\in(0,1]$. Rather than relying on the widely-used random feature selection, we consider a balanced modification to existing approaches. Our results indicate that an over-reliance on auxiliary randomness may compromise the approximation power of tree models, leading to suboptimal results. Conversely, a less random, more balanced approach demonstrates optimality. Additionally, we establish uniform upper bounds and explore the application of random forests in average treatment effect estimation problems. Through simulation studies and real-data applications, we demonstrate the superior empirical performance of the proposed methods over existing random forests.  ( 2 min )
    Efficient Low-Rank Matrix Estimation, Experimental Design, and Arm-Set-Dependent Low-Rank Bandits
    arXiv:2402.11156v1 Announce Type: cross Abstract: We study low-rank matrix trace regression and the related problem of low-rank matrix bandits. Assuming access to the distribution of the covariates, we propose a novel low-rank matrix estimation method called LowPopArt and provide its recovery guarantee that depends on a novel quantity denoted by B(Q) that characterizes the hardness of the problem, where Q is the covariance matrix of the measurement distribution. We show that our method can provide tighter recovery guarantees than classical nuclear norm penalized least squares (Koltchinskii et al., 2011) in several problems. To perform efficient estimation with a limited number of measurements from an arbitrarily given measurement set A, we also propose a novel experimental design criterion that minimizes B(Q) with computational efficiency. We leverage our novel estimator and design of experiments to derive two low-rank linear bandit algorithms for general arm sets that enjoy improved regret upper bounds. This improves over previous works on low-rank bandits, which make somewhat restrictive assumptions that the arm set is the unit ball or that an efficient exploration distribution is given. To our knowledge, our experimental design criterion is the first one tailored to low-rank matrix estimation beyond the naive reduction to linear regression, which can be of independent interest.  ( 2 min )
    Supporting Experts with a Multimodal Machine-Learning-Based Tool for Human Behavior Analysis of Conversational Videos
    arXiv:2402.11145v1 Announce Type: cross Abstract: Multimodal scene search of conversations is essential for unlocking valuable insights into social dynamics and enhancing our communication. While experts in conversational analysis have their own knowledge and skills to find key scenes, a lack of comprehensive, user-friendly tools that streamline the processing of diverse multimodal queries impedes efficiency and objectivity. To solve it, we developed Providence, a visual-programming-based tool based on design considerations derived from a formative study with experts. It enables experts to combine various machine learning algorithms to capture human behavioral cues without writing code. Our study showed its preferable usability and satisfactory output with less cognitive load imposed in accomplishing scene search tasks of conversations, verifying the importance of its customizability and transparency. Furthermore, through the in-the-wild trial, we confirmed the objectivity and reusability of the tool transform experts' workflow, suggesting the advantage of expert-AI teaming in a highly human-contextual domain.  ( 2 min )
    Contrastive Instruction Tuning
    arXiv:2402.11138v1 Announce Type: cross Abstract: Instruction tuning has been used as a promising approach to improve the performance of large language models (LLMs) on unseen tasks. However, current LLMs exhibit limited robustness to unseen instructions, generating inconsistent outputs when the same instruction is phrased with slightly varied forms or language styles. This behavior indicates LLMs' lack of robustness to textual variations and generalizability to unseen instructions, potentially leading to trustworthiness issues. Accordingly, we propose Contrastive Instruction Tuning, which maximizes the similarity between the hidden representations of semantically equivalent instruction-instance pairs while minimizing the similarity between semantically different ones. To facilitate this approach, we augment the existing FLAN collection by paraphrasing task instructions. Experiments on the PromptBench benchmark show that CoIN consistently improves LLMs' robustness to unseen instructions with variations across character, word, sentence, and semantic levels by an average of +2.5% in accuracy.  ( 2 min )
    Physics-based material parameters extraction from perovskite experiments via Bayesian optimization
    arXiv:2402.11101v1 Announce Type: cross Abstract: The ability to extract material parameters from quantitative experimental analysis is essential for rational design and theory advancement. However, the difficulty of this analysis increases significantly with the complexity of the theoretical model and the number of material parameters. Here we use Bayesian optimization to develop an analysis platform that can extract up to 8 fundamental material parameters of an organometallic perovskite semiconductor from a transient photoluminescence experiment, based on a complex full physics model that includes drift-diffusion of carriers and dynamic defect occupation. An example study of thermal degradation reveals that changes in doping concentration and carrier mobility dominate, while the defect energy level remains nearly unchanged. This platform can be conveniently applied to other experiments or to combinations of experiments, accelerating materials discovery and optimization of semiconductor materials for photovoltaics and other applications.  ( 2 min )
    Speculative Streaming: Fast LLM Inference without Auxiliary Models
    arXiv:2402.11131v1 Announce Type: cross Abstract: Speculative decoding is a prominent technique to speed up the inference of a large target language model based on predictions of an auxiliary draft model. While effective, in application-specific settings, it often involves fine-tuning both draft and target models to achieve high acceptance rates. As the number of downstream tasks grows, these draft models add significant complexity to inference systems. We propose Speculative Streaming, a single-model speculative decoding method that fuses drafting into the target model by changing the fine-tuning objective from next token prediction to future n-gram prediction. Speculative Streaming speeds up decoding by 1.8 - 3.1X in a diverse set of tasks, such as Summarization, Structured Queries, and Meaning Representation, without sacrificing generation quality. Additionally, Speculative Streaming is parameter-efficient. It achieves on-par/higher speed-ups than Medusa-style architectures while using ~10000X fewer extra parameters, making it well-suited for resource-constrained devices.  ( 2 min )
    Occlusion Resilient 3D Human Pose Estimation
    arXiv:2402.11036v1 Announce Type: cross Abstract: Occlusions remain one of the key challenges in 3D body pose estimation from single-camera video sequences. Temporal consistency has been extensively used to mitigate their impact but the existing algorithms in the literature do not explicitly model them. Here, we apply this by representing the deforming body as a spatio-temporal graph. We then introduce a refinement network that performs graph convolutions over this graph to output 3D poses. To ensure robustness to occlusions, we train this network with a set of binary masks that we use to disable some of the edges as in drop-out techniques. In effect, we simulate the fact that some joints can be hidden for periods of time and train the network to be immune to that. We demonstrate the effectiveness of this approach compared to state-of-the-art techniques that infer poses from single-camera sequences.  ( 2 min )
    Modular Graph Extraction for Handwritten Circuit Diagram Images
    arXiv:2402.11093v1 Announce Type: cross Abstract: As digitization in engineering progressed, circuit diagrams (also referred to as schematics) are typically developed and maintained in computer-aided engineering (CAE) systems, thus allowing for automated verification, simulation and further processing in downstream engineering steps. However, apart from printed legacy schematics, hand-drawn circuit diagrams are still used today in the educational domain, where they serve as an easily accessible mean for trainees and students to learn drawing this type of diagrams. Furthermore, hand-drawn schematics are typically used in examinations due to legal constraints. In order to harness the capabilities of digital circuit representations, automated means for extracting the electrical graph from raster graphics are required. While respective approaches have been proposed in literature, they are typically conducted on small or non-disclosed datasets. This paper describes a modular end-to-end solution on a larger, public dataset, in which approaches for the individual sub-tasks are evaluated to form a new baseline. These sub-tasks include object detection (for electrical symbols and texts), binary segmentation (drafter's stroke vs. background), handwritten character recognition and orientation regression for electrical symbols and texts. Furthermore, computer-vision graph assembly and rectification algorithms are presented. All methods are integrated in a publicly available prototype.  ( 2 min )
    Automated Detection and Analysis of Data Practices Using A Real-World Corpus
    arXiv:2402.11006v1 Announce Type: cross Abstract: Privacy policies are crucial for informing users about data practices, yet their length and complexity often deter users from reading them. In this paper, we propose an automated approach to identify and visualize data practices within privacy policies at different levels of detail. Leveraging crowd-sourced annotations from the ToS;DR platform, we experiment with various methods to match policy excerpts with predefined data practice descriptions. We further conduct a case study to evaluate our approach on a real-world policy, demonstrating its effectiveness in simplifying complex policies. Experiments show that our approach accurately matches data practice descriptions with policy excerpts, facilitating the presentation of simplified privacy information to users.  ( 2 min )
    Provably Safe Neural Network Controllers via Differential Dynamic Logic
    arXiv:2402.10998v1 Announce Type: cross Abstract: While neural networks (NNs) have a large potential as goal-oriented controllers for Cyber-Physical Systems, verifying the safety of neural network based control systems (NNCSs) poses significant challenges for the practical use of NNs -- especially when safety is needed for unbounded time horizons. One reason for this is the intractability of NN and hybrid system analysis. We introduce VerSAILLE (Verifiably Safe AI via Logically Linked Envelopes): The first approach for the combination of differential dynamic logic (dL) and NN verification. By joining forces, we can exploit the efficiency of NN verification tools while retaining the rigor of dL. We reflect a safety proof for a controller envelope in an NN to prove the safety of concrete NNCS on an infinite-time horizon. The NN verification properties resulting from VerSAILLE typically require nonlinear arithmetic while efficient NN verification tools merely support linear arithmetic. To overcome this divide, we present Mosaic: The first sound and complete verification approach for polynomial real arithmetic properties on piece-wise linear NNs. Mosaic lifts off-the-shelf tools for linear properties to the nonlinear setting. An evaluation on case studies, including adaptive cruise control and airborne collision avoidance, demonstrates the versatility of VerSAILLE and Mosaic: It supports the certification of infinite-time horizon safety and the exhaustive enumeration of counterexample regions while significantly outperforming State-of-the-Art tools in closed-loop NNV.  ( 2 min )
    CHEMREASONER: Heuristic Search over a Large Language Model's Knowledge Space using Quantum-Chemical Feedback
    arXiv:2402.10980v1 Announce Type: cross Abstract: The discovery of new catalysts is essential for the design of new and more efficient chemical processes in order to transition to a sustainable future. We introduce an AI-guided computational screening framework unifying linguistic reasoning with quantum-chemistry based feedback from 3D atomistic representations. Our approach formulates catalyst discovery as an uncertain environment where an agent actively searches for highly effective catalysts via the iterative combination of large language model (LLM)-derived hypotheses and atomistic graph neural network (GNN)-derived feedback. Identified catalysts in intermediate search steps undergo structural evaluation based on spatial orientation, reaction pathways, and stability. Scoring functions based on adsorption energies and barriers steer the exploration in the LLM's knowledge space toward energetically favorable, high-efficiency catalysts. We introduce planning methods that automatically guide the exploration without human input, providing competitive performance against expert-enumerated chemical descriptor-based implementations. By integrating language-guided reasoning with computational chemistry feedback, our work pioneers AI-accelerated, trustworthy catalyst discovery.  ( 2 min )
    Stuck-at Faults in ReRAM Neuromorphic Circuit Array and their Correction through Machine Learning
    arXiv:2402.10981v1 Announce Type: cross Abstract: In this paper, we study the inference accuracy of the Resistive Random Access Memory (ReRAM) neuromorphic circuit due to stuck-at faults (stuck-on, stuck-off, and stuck at a certain resistive value). A simulation framework using Python is used to perform supervised machine learning (neural network with 3 hidden layers, 1 input layer, and 1 output layer) of handwritten digits and construct a corresponding fully analog neuromorphic circuit (4 synaptic arrays) simulated by Spectre. A generic 45nm Process Development Kit (PDK) was used. We study the difference in the inference accuracy degradation due to stuck-on and stuck-off defects. Various defect patterns are studied including circular, ring, row, column, and circular-complement defects. It is found that stuck-on and stuck-off defects have a similar effect on inference accuracy. However, it is also found that if there is a spatial defect variation across the columns, the inference accuracy may be degraded significantly. We also propose a machine learning (ML) strategy to recover the inference accuracy degradation due to stuck-at faults. The inference accuracy is improved from 48% to 85% in a defective neuromorphic circuit.  ( 2 min )
    Modeling methodology for the accurate and prompt prediction of symptomatic events in chronic diseases
    arXiv:2402.10972v1 Announce Type: cross Abstract: Prediction of symptomatic crises in chronic diseases allows to take decisions before the symptoms occur, such as the intake of drugs to avoid the symptoms or the activation of medical alarms. The prediction horizon is in this case an important parameter in order to fulfill the pharmacokinetics of medications, or the time response of medical services. This paper presents a study about the prediction limits of a chronic disease with symptomatic crises: the migraine. For that purpose, this work develops a methodology to build predictive migraine models and to improve these predictions beyond the limits of the initial models. The maximum prediction horizon is analyzed, and its dependency on the selected features is studied. A strategy for model selection is proposed to tackle the trade off between conservative but robust predictive models, with respect to less accurate predictions with higher horizons. The obtained results show a prediction horizon close to 40 minutes, which is in the time range of the drug pharmacokinetics. Experiments have been performed in a realistic scenario where input data have been acquired in an ambulatory clinical study by the deployment of a non-intrusive Wireless Body Sensor Network. Our results provide an effective methodology for the selection of the future horizon in the development of prediction algorithms for diseases experiencing symptomatic crises.  ( 3 min )
    On the Cross-Dataset Generalization of Machine Learning for Network Intrusion Detection
    arXiv:2402.10974v1 Announce Type: cross Abstract: Network Intrusion Detection Systems (NIDS) are a fundamental tool in cybersecurity. Their ability to generalize across diverse networks is a critical factor in their effectiveness and a prerequisite for real-world applications. In this study, we conduct a comprehensive analysis on the generalization of machine-learning-based NIDS through an extensive experimentation in a cross-dataset framework. We employ four machine learning classifiers and utilize four datasets acquired from different networks: CIC-IDS-2017, CSE-CIC-IDS2018, LycoS-IDS2017, and LycoS-Unicas-IDS2018. Notably, the last dataset is a novel contribution, where we apply corrections based on LycoS-IDS2017 to the well-known CSE-CIC-IDS2018 dataset. The results show nearly perfect classification performance when the models are trained and tested on the same dataset. However, when training and testing the models in a cross-dataset fashion, the classification accuracy is largely commensurate with random chance except for a few combinations of attacks and datasets. We employ data visualization techniques in order to provide valuable insights on the patterns in the data. Our analysis unveils the presence of anomalies in the data that directly hinder the classifiers capability to generalize the learned knowledge to new scenarios. This study enhances our comprehension of the generalization capabilities of machine-learning-based NIDS, highlighting the significance of acknowledging data heterogeneity.  ( 2 min )
    Generalization in Healthcare AI: Evaluation of a Clinical Large Language Model
    arXiv:2402.10965v1 Announce Type: cross Abstract: Advances in large language models (LLMs) provide new opportunities in healthcare for improved patient care, clinical decision-making, and enhancement of physician and administrator workflows. However, the potential of these models importantly depends on their ability to generalize effectively across clinical environments and populations, a challenge often underestimated in early development. To better understand reasons for these challenges and inform mitigation approaches, we evaluated ClinicLLM, an LLM trained on [HOSPITAL]'s clinical notes, analyzing its performance on 30-day all-cause readmission prediction focusing on variability across hospitals and patient characteristics. We found poorer generalization particularly in hospitals with fewer samples, among patients with government and unspecified insurance, the elderly, and those with high comorbidities. To understand reasons for lack of generalization, we investigated sample sizes for fine-tuning, note content (number of words per note), patient characteristics (comorbidity level, age, insurance type, borough), and health system aspects (hospital, all-cause 30-day readmission, and mortality rates). We used descriptive statistics and supervised classification to identify features. We found that, along with sample size, patient age, number of comorbidities, and the number of words in notes are all important factors related to generalization. Finally, we compared local fine-tuning (hospital specific), instance-based augmented fine-tuning and cluster-based fine-tuning for improving generalization. Among these, local fine-tuning proved most effective, increasing AUC by 0.25% to 11.74% (most helpful in settings with limited data). Overall, this study provides new insights for enhancing the deployment of large language models in the societally important domain of healthcare, and improving their performance for broader populations.  ( 3 min )
    GLoRe: When, Where, and How to Improve LLM Reasoning via Global and Local Refinements
    arXiv:2402.10963v1 Announce Type: cross Abstract: State-of-the-art language models can exhibit impressive reasoning refinement capabilities on math, science or coding tasks. However, recent work demonstrates that even the best models struggle to identify \textit{when and where to refine} without access to external feedback. Outcome-based Reward Models (\textbf{ORMs}), trained to predict correctness of the final answer indicating when to refine, offer one convenient solution for deciding when to refine. Process Based Reward Models (\textbf{PRMs}), trained to predict correctness of intermediate steps, can then be used to indicate where to refine. But they are expensive to train, requiring extensive human annotations. In this paper, we propose Stepwise ORMs (\textbf{SORMs}) which are trained, only on synthetic data, to approximate the expected future reward of the optimal policy or $V^{\star}$. More specifically, SORMs are trained to predict the correctness of the final answer when sampling the current policy many times (rather than only once as in the case of ORMs). Our experiments show that SORMs can more accurately detect incorrect reasoning steps compared to ORMs, thus improving downstream accuracy when doing refinements. We then train \textit{global} refinement models, which take only the question and a draft solution as input and predict a corrected solution, and \textit{local} refinement models which also take as input a critique indicating the location of the first reasoning error. We generate training data for both models synthetically by reusing data used to train the SORM. We find combining global and local refinements, using the ORM as a reranker, significantly outperforms either one individually, as well as a best of three sample baseline. With this strategy we can improve the accuracy of a LLaMA-2 13B model (already fine-tuned with RL) on GSM8K from 53\% to 65\% when greedily sampled.  ( 3 min )
    Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts
    arXiv:2402.10958v1 Announce Type: cross Abstract: In the field of large language models (LLMs), aligning models with the diverse preferences of users is a critical challenge. Direct Preference Optimization (DPO) has played a key role in this area. It works by using pairs of preferences derived from the same prompts, and it functions without needing an additional reward model. However, DPO does not fully reflect the complex nature of human learning, which often involves understanding contrasting responses to not only identical but also similar questions. To overcome this shortfall, we propose Relative Preference Optimization (RPO). RPO is designed to discern between more and less preferred responses derived from both identical and related prompts. It introduces a contrastive weighting mechanism, enabling the tuning of LLMs using a broader range of preference data, including both paired and unpaired sets. This approach expands the learning capabilities of the model, allowing it to leverage insights from a more varied set of prompts. Through empirical tests, including dialogue and summarization tasks, and evaluations using the AlpacaEval2.0 leaderboard, RPO has demonstrated a superior ability to align LLMs with user preferences and to improve their adaptability during the training process. The PyTorch code necessary to reproduce the results presented in the paper will be made available on GitHub for public access.  ( 3 min )
    Measuring and Controlling Persona Drift in Language Model Dialogs
    arXiv:2402.10962v1 Announce Type: cross Abstract: Prompting is a standard tool for customizing language-model chatbots, enabling them to take on a specific "persona". An implicit assumption in the use of prompts is that they will be stable, so the chatbot will continue to generate text according to the stipulated persona for the duration of a conversation. We propose a quantitative benchmark to test this assumption, evaluating persona stability via self-chats between two personalized chatbots. Testing popular models like LLaMA2-chat-70B, we reveal a significant persona drift within eight rounds of conversations. An empirical and theoretical analysis of this phenomenon suggests the transformer attention mechanism plays a role, due to attention decay over long exchanges. To combat attention decay and persona drift, we propose a lightweight method called split-softmax, which compares favorably against two strong baselines.  ( 2 min )
    DAEDRA: A language model for predicting outcomes in passive pharmacovigilance reporting
    arXiv:2402.10951v1 Announce Type: cross Abstract: Over the recent years, the emergence of large language models (LLMs) has given rise to a proliferation of domain-specific models that are intended to reflect the particularities of linguistic context and content as a correlate of the originating domain. This paper details the conception, design, training and evaluation of DAEDRA, a LLM designed to detect regulatory-relevant outcomes (mortality, ER attendance and hospitalisation) in adverse event reports elicited through passive reporting (PR). While PR is a highly cost-efficient way of eliciting information from a wide and diverse audience -- typically including not only physicians and healthcare providers but also patients, family members and other lay stakeholders --, this diversity makes PR corpora difficult to analyse. Generic language models may not capture the complex clinical dimensions while specific clinical or biomedical models may not perform well on lay reports. To evaluate the utility of a subdomain-specific language model, an adaptive training approach was adapted, wherein base language model candidates were evaluated on a subset of the corpus, and the best performer was trained on the entire corpus. This yielded a small but significant improvement in $F_1$ (+1%), precision (+2.5%) and recall (+3.8%), at a relatively low training cost and a single-day training time. Subdomain-specific LLMs continue to be viable options for better results when analysing highly specialised corpora.  ( 2 min )
    Sleep-Like Unsupervised Replay Improves Performance when Data are Limited or Unbalanced
    arXiv:2402.10956v1 Announce Type: cross Abstract: The performance of artificial neural networks (ANNs) degrades when training data are limited or imbalanced. In contrast, the human brain can learn quickly from just a few examples. Here, we investigated the role of sleep in improving the performance of ANNs trained with limited data on the MNIST and Fashion MNIST datasets. Sleep was implemented as an unsupervised phase with local Hebbian type learning rules. We found a significant boost in accuracy after the sleep phase for models trained with limited data in the range of 0.5-10% of total MNIST or Fashion MNIST datasets. When more than 10% of the total data was used, sleep alone had a slight negative impact on performance, but this was remedied by fine-tuning on the original data. This study sheds light on a potential synaptic weight dynamics strategy employed by the brain during sleep to enhance memory performance when training data are limited or imbalanced.  ( 2 min )
    The Unreasonable Effectiveness of Eccentric Automatic Prompts
    arXiv:2402.10949v1 Announce Type: cross Abstract: Large Language Models (LLMs) have demonstrated remarkable problem-solving and basic mathematics abilities. However, their efficacy is highly contingent on the formulation of the prompt. This study endeavors to quantify the influence of incorporating "positive thinking" into the system message of the prompt, then compare that to systematic prompt optimization. We assess the performance of 60 combinations of system message snippets, tested with and without Chain of Thought prompting, across three models with parameters ranging from 7 to 70 billion on the GSM8K dataset. Our findings reveal that results do not universally generalize across models. In most instances, the inclusion of "positive thinking" prompts positively affected model performance. Notably, however, Llama2-70B exhibited an exception when not utilizing Chain of Thought, as the optimal system message was found to be none at all. Given the combinatorial complexity, and thus computation time, of experimenting with hand-tuning prompts for large black-box models, we then compared the performance of the best "positive thinking" prompt against the output of systematic prompt optimization. We show that employing an automated prompt optimizer emerges as the most effective method for enhancing performance, even when working with smaller open-source models. Additionally, our findings reveal that the highest-scoring, automatically-optimized prompt exhibits a degree of peculiarity far beyond expectations.  ( 2 min )
    CultureLLM: Incorporating Cultural Differences into Large Language Models
    arXiv:2402.10946v1 Announce Type: cross Abstract: Large language models (LLMs) are reported to be partial to certain cultures owing to the training data dominance from the English corpora. Since multilingual cultural data are often expensive to collect, existing efforts handle this by prompt engineering or culture-specific pre-training. However, they might overlook the knowledge deficiency of low-resource culture and require extensive computing resources. In this paper, we propose CultureLLM, a cost-effective solution to incorporate cultural differences into LLMs. CultureLLM adopts World Value Survey (WVS) as seed data and generates semantically equivalent training data via the proposed semantic data augmentation. Using only 50 seed samples from WVS with augmented data, we fine-tune culture-specific LLMs and one unified model (CultureLLM-One) for 9 cultures covering rich and low-resource languages. Extensive experiments on 60 culture-related datasets demonstrate that CultureLLM significantly outperforms various counterparts such as GPT-3.5 (by 8.1%) and Gemini Pro (by 9.5%) with comparable performance to GPT-4 or even better. Our human study shows that the generated samples are semantically equivalent to the original samples, providing an effective solution for LLMs augmentation.  ( 2 min )
    Neural machine translation of clinical procedure codes for medical diagnosis and uncertainty quantification
    arXiv:2402.10940v1 Announce Type: cross Abstract: A Clinical Decision Support System (CDSS) is designed to enhance clinician decision-making by combining system-generated recommendations with medical expertise. Given the high costs, intensive labor, and time-sensitive nature of medical treatments, there is a pressing need for efficient decision support, especially in complex emergency scenarios. In these scenarios, where information can be limited, an advanced CDSS framework that leverages AI (artificial intelligence) models to effectively reduce diagnostic uncertainty has utility. Such an AI-enabled CDSS framework with quantified uncertainty promises to be practical and beneficial in the demanding context of real-world medical care. In this study, we introduce the concept of Medical Entropy, quantifying uncertainties in patient outcomes predicted by neural machine translation based on the ICD-9 code of procedures. Our experimental results not only show strong correlations between procedure and diagnosis sequences based on the simple ICD-9 code but also demonstrate the promising capacity to model trends of uncertainties during hospitalizations through a data-driven approach.  ( 2 min )
    Text2Data: Low-Resource Data Generation with Textual Control
    arXiv:2402.10941v1 Announce Type: cross Abstract: Natural language serves as a common and straightforward control signal for humans to interact seamlessly with machines. Recognizing the importance of this interface, the machine learning community is investing considerable effort in generating data that is semantically coherent with textual instructions. While strides have been made in text-to-data generation spanning image editing, audio synthesis, video creation, and beyond, low-resource areas characterized by expensive annotations or complex data structures, such as molecules, motion dynamics, and time series, often lack textual labels. This deficiency impedes supervised learning, thereby constraining the application of advanced generative models for text-to-data tasks. In response to these challenges in the low-resource scenario, we propose Text2Data, a novel approach that utilizes unlabeled data to understand the underlying data distribution through an unsupervised diffusion model. Subsequently, it undergoes controllable finetuning via a novel constraint optimization-based learning objective that ensures controllability and effectively counteracts catastrophic forgetting. Comprehensive experiments demonstrate that Text2Data is able to achieve enhanced performance regarding controllability across various modalities, including molecules, motions and time series, when compared to existing baselines.  ( 2 min )
    ConSmax: Hardware-Friendly Alternative Softmax with Learnable Parameters
    arXiv:2402.10930v1 Announce Type: cross Abstract: The self-attention mechanism sets transformer-based large language model (LLM) apart from the convolutional and recurrent neural networks. Despite the performance improvement, achieving real-time LLM inference on silicon is challenging due to the extensively used Softmax in self-attention. Apart from the non-linearity, the low arithmetic intensity greatly reduces the processing parallelism, which becomes the bottleneck especially when dealing with a longer context. To address this challenge, we propose Constant Softmax (ConSmax), a software-hardware co-design as an efficient Softmax alternative. ConSmax employs differentiable normalization parameters to remove the maximum searching and denominator summation in Softmax. It allows for massive parallelization while performing the critical tasks of Softmax. In addition, a scalable ConSmax hardware utilizing a bitwidth-split look-up table (LUT) can produce lossless non-linear operation and support mix-precision computing. It further facilitates efficient LLM inference. Experimental results show that ConSmax achieves a minuscule power consumption of 0.43 mW and area of 0.001 mm2 at 1-GHz working frequency and 22-nm CMOS technology. Compared to state-of-the-art Softmax hardware, ConSmax results in 14.5x energy and 14.0x area savings with a comparable accuracy on a GPT-2 model and the WikiText103 dataset.  ( 2 min )
    A Lightweight Inception Boosted U-Net Neural Network for Routability Prediction
    arXiv:2402.10937v1 Announce Type: cross Abstract: As the modern CPU, GPU, and NPU chip design complexity and transistor counts keep increasing, and with the relentless shrinking of semiconductor technology nodes to nearly 1 nanometer, the placement and routing have gradually become the two most pivotal processes in modern very-large-scale-integrated (VLSI) circuit back-end design. How to evaluate routability efficiently and accurately in advance (at the placement and global routing stages) has grown into a crucial research area in the field of artificial intelligence (AI) assisted electronic design automation (EDA). In this paper, we propose a novel U-Net variant model boosted by an Inception embedded module to predict Routing Congestion (RC) and Design Rule Checking (DRC) hotspots. Experimental results on the recently published CircuitNet dataset benchmark show that our proposed method achieves up to 5% (RC) and 20% (DRC) rate reduction in terms of Avg-NRMSE (Average Normalized Root Mean Square Error) compared to the classic architecture. Furthermore, our approach consistently outperforms the prior model on the SSIM (Structural Similarity Index Measure) metric.  ( 2 min )
    Numerical analysis of physics-informed neural networks and related models in physics-informed machine learning
    arXiv:2402.10926v1 Announce Type: cross Abstract: Physics-informed neural networks (PINNs) and their variants have been very popular in recent years as algorithms for the numerical simulation of both forward and inverse problems for partial differential equations. This article aims to provide a comprehensive review of currently available results on the numerical analysis of PINNs and related models that constitute the backbone of physics-informed machine learning. We provide a unified framework in which analysis of the various components of the error incurred by PINNs in approximating PDEs can be effectively carried out. A detailed review of available results on approximation, generalization and training errors and their behavior with respect to the type of the PDE and the dimension of the underlying domain is presented. In particular, the role of the regularity of the solutions and their stability to perturbations in the error analysis is elucidated. Numerical results are also presented to illustrate the theory. We identify training errors as a key bottleneck which can adversely affect the overall performance of various models in physics-informed machine learning.  ( 2 min )
    AM^2-EmoJE: Adaptive Missing-Modality Emotion Recognition in Conversation via Joint Embedding Learning
    arXiv:2402.10921v1 Announce Type: cross Abstract: Human emotion can be presented in different modes i.e., audio, video, and text. However, the contribution of each mode in exhibiting each emotion is not uniform. Furthermore, the availability of complete mode-specific details may not always be guaranteed in the test time. In this work, we propose AM^2-EmoJE, a model for Adaptive Missing-Modality Emotion Recognition in Conversation via Joint Embedding Learning model that is grounded on two-fold contributions: First, a query adaptive fusion that can automatically learn the relative importance of its mode-specific representations in a query-specific manner. By this the model aims to prioritize the mode-invariant spatial query details of the emotion patterns, while also retaining its mode-exclusive aspects within the learned multimodal query descriptor. Second the multimodal joint embedding learning module that explicitly addresses various missing modality scenarios in test-time. By this, the model learns to emphasize on the correlated patterns across modalities, which may help align the cross-attended mode-specific descriptors pairwise within a joint-embedding space and thereby compensate for missing modalities during inference. By leveraging the spatio-temporal details at the dialogue level, the proposed AM^2-EmoJE not only demonstrates superior performance compared to the best-performing state-of-the-art multimodal methods, by effectively leveraging body language in place of face expression, it also exhibits an enhanced privacy feature. By reporting around 2-5% improvement in the weighted-F1 score, the proposed multimodal joint embedding module facilitates an impressive performance gain in a variety of missing-modality query scenarios during test time.  ( 3 min )
    Hermite Neural Network Simulation for Solving the 2D Schrodinger Equation
    arXiv:2402.10649v1 Announce Type: cross Abstract: The Schrodinger equation is a mathematical equation describing the wave function's behavior in a quantum-mechanical system. It is a partial differential equation that provides valuable insights into the fundamental principles of quantum mechanics. In this paper, the aim was to solve the Schrodinger equation with sufficient accuracy by using a mixture of neural networks with the collocation method base Hermite functions. Initially, the Hermite functions roots were employed as collocation points, enhancing the efficiency of the solution. The Schrodinger equation is defined in an infinite domain, the use of Hermite functions as activation functions resulted in excellent precision. Finally, the proposed method was simulated using MATLAB's Simulink tool. The results were then compared with those obtained using Physics-informed neural networks and the presented method.  ( 2 min )
    LLM-Assisted Crisis Management: Building Advanced LLM Platforms for Effective Emergency Response and Public Collaboration
    arXiv:2402.10908v1 Announce Type: cross Abstract: Emergencies and critical incidents often unfold rapidly, necessitating a swift and effective response. In this research, we introduce a novel approach to identify and classify emergency situations from social media posts and direct emergency messages using an open source Large Language Model, LLAMA2. The goal is to harness the power of natural language processing and machine learning to assist public safety telecommunicators and huge crowds during countrywide emergencies. Our research focuses on developing a language model that can understand users describe their situation in the 911 call, enabling LLAMA2 to analyze the content and offer relevant instructions to the telecommunicator, while also creating workflows to notify government agencies with the caller's information when necessary. Another benefit this language model provides is its ability to assist people during a significant emergency incident when the 911 system is overwhelmed, by assisting the users with simple instructions and informing authorities with their location and emergency information.  ( 2 min )
    A White-Box False Positive Adversarial Attack Method on Contrastive Loss Based Offline Handwritten Signature Verification Models
    arXiv:2308.08925v3 Announce Type: cross Abstract: In this paper, we tackle the challenge of white-box false positive adversarial attacks on contrastive loss based offline handwritten signature verification models. We propose a novel attack method that treats the attack as a style transfer between closely related but distinct writing styles. To guide the generation of deceptive images, we introduce two new loss functions that enhance the attack success rate by perturbing the Euclidean distance between the embedding vectors of the original and synthesized samples, while ensuring minimal perturbations by reducing the difference between the generated image and the original image. Our method demonstrates state-of-the-art performance in white-box attacks on contrastive loss based offline handwritten signature verification models, as evidenced by our experiments. The key contributions of this paper include a novel false positive attack method, two new loss functions, effective style transfer in handwriting styles, and superior performance in white-box false positive attacks compared to other white-box attack methods.  ( 3 min )
    LoraRetriever: Input-Aware LoRA Retrieval and Composition for Mixed Tasks in the Wild
    arXiv:2402.09997v1 Announce Type: cross Abstract: Low-Rank Adaptation (LoRA) provides an effective yet efficient solution for fine-tuning large language models (LLM). The modular and plug-and-play nature of LoRA enables the integration of diverse domain-specific LoRAs to enhance the capabilities of LLMs. Previous research on exploiting multiple LoRAs either focuses on specific isolated downstream tasks or fixes the selection of LoRAs during training. However, in real-world scenarios, LLMs receive diverse prompts covering different tasks, and the pool of candidate LoRAs is often dynamically updated. To bridge this gap, we propose LoraRetriever, a retrieve-then-compose framework that adaptively retrieves and composes multiple LoRAs according to the input prompts. LoraRetriever contains three main components: firstly, identifying and retrieving LoRAs relevant to the given input; secondly, formulating strategies for effectively integrating the retrieved LoRAs; and thirdly, developing efficient batch inference to accommodate heterogeneous requests. Experimental results indicate that LoraRetriever consistently outperforms the baselines, highlighting its practical effectiveness and versatility.  ( 2 min )
    A Critical Evaluation of AI Feedback for Aligning Large Language Models
    arXiv:2402.12366v1 Announce Type: new Abstract: Reinforcement learning with AI feedback (RLAIF) is a popular paradigm for improving the instruction-following abilities of powerful pre-trained language models. RLAIF first performs supervised fine-tuning (SFT) using demonstrations from a teacher model and then further fine-tunes the model with reinforcement learning (RL), using feedback from a critic model. While recent popular open-source models have demonstrated substantial improvements in performance from the RL step, in this paper we question whether the complexity of this RL step is truly warranted for AI feedback. We show that the improvements of the RL step are virtually entirely due to the widespread practice of using a weaker teacher model (e.g. GPT-3.5) for SFT data collection than the critic (e.g., GPT-4) used for AI feedback generation. Specifically, we show that simple supervised fine-tuning with GPT-4 as the teacher outperforms existing RLAIF pipelines. More generally, we find that the gains from RLAIF vary substantially across base model families, test-time evaluation protocols, and critic models. Finally, we provide a mechanistic explanation for when SFT may outperform the full two-step RLAIF pipeline as well as suggestions for making RLAIF maximally useful in practice.  ( 2 min )
    Universal Physics Transformers
    arXiv:2402.12365v1 Announce Type: new Abstract: Deep neural network based surrogates for partial differential equations have recently gained increased interest. However, akin to their numerical counterparts, different techniques are used across applications, even if the underlying dynamics of the systems are similar. A prominent example is the Lagrangian and Eulerian specification in computational fluid dynamics, posing a challenge for neural networks to effectively model particle- as opposed to grid-based dynamics. We introduce Universal Physics Transformers (UPTs), a novel learning paradigm which models a wide range of spatio-temporal problems - both for Lagrangian and Eulerian discretization schemes. UPTs operate without grid- or particle-based latent structures, enabling flexibility across meshes and particles. UPTs efficiently propagate dynamics in the latent space, emphasized by inverse encoding and decoding techniques. Finally, UPTs allow for queries of the latent space representation at any point in space-time. We demonstrate the efficacy of UPTs in mesh-based fluid simulations, steady-state Reynolds averaged Navier-Stokes simulations, and Lagrangian-based dynamics. Project page: https://ml-jku.github.io/UPT  ( 2 min )
    LoRA+: Efficient Low Rank Adaptation of Large Models
    arXiv:2402.12354v1 Announce Type: new Abstract: In this paper, we show that Low Rank Adaptation (LoRA) as originally introduced in Hu et al. (2021) leads to suboptimal finetuning of models with large width (embedding dimension). This is due to the fact that adapter matrices A and B in LoRA are updated with the same learning rate. Using scaling arguments for large width networks, we demonstrate that using the same learning rate for A and B does not allow efficient feature learning. We then show that this suboptimality of LoRA can be corrected simply by setting different learning rates for the LoRA adapter matrices A and B with a well-chosen ratio. We call this proposed algorithm LoRA$+$. In our extensive experiments, LoRA$+$ improves performance (1-2 $\%$ improvements) and finetuning speed (up to $\sim$ 2X SpeedUp), at the same computational cost as LoRA.  ( 2 min )
    Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models
    arXiv:2402.12336v1 Announce Type: new Abstract: Multi-modal foundation models like OpenFlamingo, LLaVA, and GPT-4 are increasingly used for various real-world tasks. Prior work has shown that these models are highly vulnerable to adversarial attacks on the vision modality. These attacks can be leveraged to spread fake information or defraud users, and thus pose a significant risk, which makes the robustness of large multi-modal foundation models a pressing problem. The CLIP model, or one of its variants, is used as a frozen vision encoder in many vision-language models (VLMs), e.g. LLaVA and OpenFlamingo. We propose an unsupervised adversarial fine-tuning scheme to obtain a robust CLIP vision encoder, which yields robustness on all vision down-stream tasks (VLMs, zero-shot classification) that rely on CLIP. In particular, we show that stealth-attacks on users of VLMs by a malicious third party providing manipulated images are no longer possible once one replaces the original CLIP model with our robust one. No retraining or fine-tuning of the VLM is required. The code and robust models are available at https://github.com/chs20/RobustVLM  ( 2 min )
    Dynamic Environment Responsive Online Meta-Learning with Fairness Awareness
    arXiv:2402.12319v1 Announce Type: new Abstract: The fairness-aware online learning framework has emerged as a potent tool within the context of continuous lifelong learning. In this scenario, the learner's objective is to progressively acquire new tasks as they arrive over time, while also guaranteeing statistical parity among various protected sub-populations, such as race and gender, when it comes to the newly introduced tasks. A significant limitation of current approaches lies in their heavy reliance on the i.i.d (independent and identically distributed) assumption concerning data, leading to a static regret analysis of the framework. Nevertheless, it's crucial to note that achieving low static regret does not necessarily translate to strong performance in dynamic environments characterized by tasks sampled from diverse distributions. In this paper, to tackle the fairness-aware online learning challenge in evolving settings, we introduce a unique regret measure, FairSAR, by incorporating long-term fairness constraints into a strongly adapted loss regret framework. Moreover, to determine an optimal model parameter at each time step, we introduce an innovative adaptive fairness-aware online meta-learning algorithm, referred to as FairSAOML. This algorithm possesses the ability to adjust to dynamic environments by effectively managing bias control and model accuracy. The problem is framed as a bi-level convex-concave optimization, considering both the model's primal and dual parameters, which pertain to its accuracy and fairness attributes, respectively. Theoretical analysis yields sub-linear upper bounds for both loss regret and the cumulative violation of fairness constraints. Our experimental evaluation on various real-world datasets in dynamic environments demonstrates that our proposed FairSAOML algorithm consistently outperforms alternative approaches rooted in the most advanced prior online learning methods.  ( 3 min )
    Generating Survival Interpretable Trajectories and Data
    arXiv:2402.12331v1 Announce Type: new Abstract: A new model for generating survival trajectories and data based on applying an autoencoder of a specific structure is proposed. It solves three tasks. First, it provides predictions in the form of the expected event time and the survival function for a new generated feature vector on the basis of the Beran estimator. Second, the model generates additional data based on a given training set that would supplement the original dataset. Third, the most important, it generates a prototype time-dependent trajectory for an object, which characterizes how features of the object could be changed to achieve a different time to an event. The trajectory can be viewed as a type of the counterfactual explanation. The proposed model is robust during training and inference due to a specific weighting scheme incorporating into the variational autoencoder. The model also determines the censored indicators of new generated data by solving a classification task. The paper demonstrates the efficiency and properties of the proposed model using numerical experiments on synthetic and real datasets. The code of the algorithm implementing the proposed model is publicly available.  ( 2 min )
    Multi-View Conformal Learning for Heterogeneous Sensor Fusion
    arXiv:2402.12307v1 Announce Type: new Abstract: Being able to assess the confidence of individual predictions in machine learning models is crucial for decision making scenarios. Specially, in critical applications such as medical diagnosis, security, and unmanned vehicles, to name a few. In the last years, complex predictive models have had great success in solving hard tasks and new methods are being proposed every day. While the majority of new developments in machine learning models focus on improving the overall performance, less effort is put on assessing the trustworthiness of individual predictions, and even to a lesser extent, in the context of sensor fusion. To this end, we build and test multi-view and single-view conformal models for heterogeneous sensor fusion. Our models provide theoretical marginal confidence guarantees since they are based on the conformal prediction framework. We also propose a multi-view semi-conformal model based on sets intersection. Through comprehensive experimentation, we show that multi-view models perform better than single-view models not only in terms of accuracy-based performance metrics (as it has already been shown in several previous works) but also in conformal measures that provide uncertainty estimation. Our results also showed that multi-view models generate prediction sets with less uncertainty compared to single-view models.  ( 2 min )
    Refining Minimax Regret for Unsupervised Environment Design
    arXiv:2402.12284v1 Announce Type: new Abstract: In unsupervised environment design, reinforcement learning agents are trained on environment configurations (levels) generated by an adversary that maximises some objective. Regret is a commonly used objective that theoretically results in a minimax regret (MMR) policy with desirable robustness guarantees; in particular, the agent's maximum regret is bounded. However, once the agent reaches this regret bound on all levels, the adversary will only sample levels where regret cannot be further reduced. Although there are possible performance improvements to be made outside of these regret-maximising levels, learning stagnates. In this work, we introduce Bayesian level-perfect MMR (BLP), a refinement of the minimax regret objective that overcomes this limitation. We formally show that solving for this objective results in a subset of MMR policies, and that BLP policies act consistently with a Perfect Bayesian policy over all levels. We further introduce an algorithm, ReMiDi, that results in a BLP policy at convergence. We empirically demonstrate that training on levels from a minimax regret adversary causes learning to prematurely stagnate, but that ReMiDi continues learning.  ( 2 min )
    On the Byzantine-Resilience of Distillation-Based Federated Learning
    arXiv:2402.12265v1 Announce Type: new Abstract: Federated Learning (FL) algorithms using Knowledge Distillation (KD) have received increasing attention due to their favorable properties with respect to privacy, non-i.i.d. data and communication cost. These methods depart from transmitting model parameters and, instead, communicate information about a learning task by sharing predictions on a public dataset. In this work, we study the performance of such approaches in the byzantine setting, where a subset of the clients act in an adversarial manner aiming to disrupt the learning process. We show that KD-based FL algorithms are remarkably resilient and analyze how byzantine clients can influence the learning process compared to Federated Averaging. Based on these insights, we introduce two new byzantine attacks and demonstrate that they are effective against prior byzantine-resilient methods. Additionally, we propose FilterExp, a novel method designed to enhance the byzantine resilience of KD-based FL algorithms and demonstrate its efficacy. Finally, we provide a general method to make attacks harder to detect, improving their effectiveness.  ( 2 min )
    End-to-end Supervised Prediction of Arbitrary-size Graphs with Partially-Masked Fused Gromov-Wasserstein Matching
    arXiv:2402.12269v1 Announce Type: new Abstract: We present a novel end-to-end deep learning-based approach for Supervised Graph Prediction (SGP). We introduce an original Optimal Transport (OT)-based loss, the Partially-Masked Fused Gromov-Wasserstein loss (PM-FGW), that allows to directly leverage graph representations such as adjacency and feature matrices. PM-FGW exhibits all the desirable properties for SGP: it is node permutation invariant, sub-differentiable and handles graphs of different sizes by comparing their padded representations as well as their masking vectors. Moreover, we present a flexible transformer-based architecture that easily adapts to different types of input data. In the experimental section, three different tasks, a novel and challenging synthetic dataset (image2graph) and two real-world tasks, image2map and fingerprint2molecule - showcase the efficiency and versatility of the approach compared to competitors.  ( 2 min )
    Towards a tailored mixed-precision sub-8bit quantization scheme for Gated Recurrent Units using Genetic Algorithms
    arXiv:2402.12263v1 Announce Type: new Abstract: Despite the recent advances in model compression techniques for deep neural networks, deploying such models on ultra-low-power embedded devices still proves challenging. In particular, quantization schemes for Gated Recurrent Units (GRU) are difficult to tune due to their dependence on an internal state, preventing them from fully benefiting from sub-8bit quantization. In this work, we propose a modular integer quantization scheme for GRUs where the bit width of each operator can be selected independently. We then employ Genetic Algorithms (GA) to explore the vast search space of possible bit widths, simultaneously optimising for model size and accuracy. We evaluate our methods on four different sequential tasks and demonstrate that mixed-precision solutions exceed homogeneous-precision ones in terms of Pareto efficiency. In our results, we achieve a model size reduction between 25% and 55% while maintaining an accuracy comparable with the 8-bit homogeneous equivalent.  ( 2 min )
    Uncertainty quantification in fine-tuned LLMs using LoRA ensembles
    arXiv:2402.12264v1 Announce Type: new Abstract: Fine-tuning large language models can improve task specific performance, although a general understanding of what the fine-tuned model has learned, forgotten and how to trust its predictions is still missing. We derive principled uncertainty quantification for fine-tuned LLMs with posterior approximations using computationally efficient low-rank adaptation ensembles. We analyze three common multiple-choice datasets using low-rank adaptation ensembles based on Mistral-7b, and draw quantitative and qualitative conclusions on their perceived complexity and model efficacy on the different target domains during and after fine-tuning. In particular, backed by the numerical experiments, we hypothesise about signals from entropic uncertainty measures for data domains that are inherently difficult for a given architecture to learn.  ( 2 min )
    Non-orthogonal Age-Optimal Information Dissemination in Vehicular Networks: A Meta Multi-Objective Reinforcement Learning Approach
    arXiv:2402.12260v1 Announce Type: new Abstract: This paper considers minimizing the age-of-information (AoI) and transmit power consumption in a vehicular network, where a roadside unit (RSU) provides timely updates about a set of physical processes to vehicles. We consider non-orthogonal multi-modal information dissemination, which is based on superposed message transmission from RSU and successive interference cancellation (SIC) at vehicles. The formulated problem is a multi-objective mixed-integer nonlinear programming problem; thus, a Pareto-optimal front is very challenging to obtain. First, we leverage the weighted-sum approach to decompose the multi-objective problem into a set of multiple single-objective sub-problems corresponding to each predefined objective preference weight. Then, we develop a hybrid deep Q-network (DQN)-deep deterministic policy gradient (DDPG) model to solve each optimization sub-problem respective to predefined objective-preference weight. The DQN optimizes the decoding order, while the DDPG solves the continuous power allocation. The model needs to be retrained for each sub-problem. We then present a two-stage meta-multi-objective reinforcement learning solution to estimate the Pareto front with a few fine-tuning update steps without retraining the model for each sub-problem. Simulation results illustrate the efficacy of the proposed solutions compared to the existing benchmarks and that the meta-multi-objective reinforcement learning model estimates a high-quality Pareto frontier with reduced training time.  ( 2 min )
    Synthetic location trajectory generation using categorical diffusion models
    arXiv:2402.12242v1 Announce Type: new Abstract: Diffusion probabilistic models (DPMs) have rapidly evolved to be one of the predominant generative models for the simulation of synthetic data, for instance, for computer vision, audio, natural language processing, or biomolecule generation. Here, we propose using DPMs for the generation of synthetic individual location trajectories (ILTs) which are sequences of variables representing physical locations visited by individuals. ILTs are of major importance in mobility research to understand the mobility behavior of populations and to ultimately inform political decision-making. We represent ILTs as multi-dimensional categorical random variables and propose to model their joint distribution using a continuous DPM by first applying the diffusion process in a continuous unconstrained space and then mapping the continuous variables into a discrete space. We demonstrate that our model can synthesize realistic ILPs by comparing conditionally and unconditionally generated sequences to real-world ILPs from a GNSS tracking data set which suggests the potential use of our model for synthetic data generation, for example, for benchmarking models used in mobility research.  ( 2 min )
    BEARS Make Neuro-Symbolic Models Aware of their Reasoning Shortcuts
    arXiv:2402.12240v1 Announce Type: new Abstract: Neuro-Symbolic (NeSy) predictors that conform to symbolic knowledge - encoding, e.g., safety constraints - can be affected by Reasoning Shortcuts (RSs): They learn concepts consistent with the symbolic knowledge by exploiting unintended semantics. RSs compromise reliability and generalization and, as we show in this paper, they are linked to NeSy models being overconfident about the predicted concepts. Unfortunately, the only trustworthy mitigation strategy requires collecting costly dense supervision over the concepts. Rather than attempting to avoid RSs altogether, we propose to ensure NeSy models are aware of the semantic ambiguity of the concepts they learn, thus enabling their users to identify and distrust low-quality concepts. Starting from three simple desiderata, we derive bears (BE Aware of Reasoning Shortcuts), an ensembling technique that calibrates the model's concept-level confidence without compromising prediction accuracy, thus encouraging NeSy architectures to be uncertain about concepts affected by RSs. We show empirically that bears improves RS-awareness of several state-of-the-art NeSy models, and also facilitates acquiring informative dense annotations for mitigation purposes.  ( 2 min )
    Convergence of Gradient Descent for Recurrent Neural Networks: A Nonasymptotic Analysis
    arXiv:2402.12241v1 Announce Type: new Abstract: We analyze recurrent neural networks trained with gradient descent in the supervised learning setting for dynamical systems, and prove that gradient descent can achieve optimality \emph{without} massive overparameterization. Our in-depth nonasymptotic analysis (i) provides sharp bounds on the network size $m$ and iteration complexity $\tau$ in terms of the sequence length $T$, sample size $n$ and ambient dimension $d$, and (ii) identifies the significant impact of long-term dependencies in the dynamical system on the convergence and network width bounds characterized by a cutoff point that depends on the Lipschitz continuity of the activation function. Remarkably, this analysis reveals that an appropriately-initialized recurrent neural network trained with $n$ samples can achieve optimality with a network size $m$ that scales only logarithmically with $n$. This sharply contrasts with the prior works that require high-order polynomial dependency of $m$ on $n$ to establish strong regularity conditions. Our results are based on an explicit characterization of the class of dynamical systems that can be approximated and learned by recurrent neural networks via norm-constrained transportation mappings, and establishing local smoothness properties of the hidden state with respect to the learnable parameters.  ( 2 min )
    The Fundamental Limits of Least-Privilege Learning
    arXiv:2402.12235v1 Announce Type: new Abstract: The promise of least-privilege learning -- to find feature representations that are useful for a learning task but prevent inference of any sensitive information unrelated to this task -- is highly appealing. However, so far this concept has only been stated informally. It thus remains an open question whether and how we can achieve this goal. In this work, we provide the first formalisation of the least-privilege principle for machine learning and characterise its feasibility. We prove that there is a fundamental trade-off between a representation's utility for a given task and its leakage beyond the intended task: it is not possible to learn representations that have high utility for the intended task but, at the same time prevent inference of any attribute other than the task label itself. This trade-off holds regardless of the technique used to learn the feature mappings that produce these representations. We empirically validate this result for a wide range of learning techniques, model architectures, and datasets.  ( 2 min )
    Learning to Defer in Content Moderation: The Human-AI Interplay
    arXiv:2402.12237v1 Announce Type: new Abstract: Successful content moderation in online platforms relies on a human-AI collaboration approach. A typical heuristic estimates the expected harmfulness of a post and uses fixed thresholds to decide whether to remove it and whether to send it for human review. This disregards the prediction uncertainty, the time-varying element of human review capacity and post arrivals, and the selective sampling in the dataset (humans only review posts filtered by the admission algorithm). In this paper, we introduce a model to capture the human-AI interplay in content moderation. The algorithm observes contextual information for incoming posts, makes classification and admission decisions, and schedules posts for human review. Only admitted posts receive human reviews on their harmfulness. These reviews help educate the machine-learning algorithms but are delayed due to congestion in the human review system. The classical learning-theoretic way to capture this human-AI interplay is via the framework of learning to defer, where the algorithm has the option to defer a classification task to humans for a fixed cost and immediately receive feedback. Our model contributes to this literature by introducing congestion in the human review system. Moreover, unlike work on online learning with delayed feedback where the delay in the feedback is exogenous to the algorithm's decisions, the delay in our model is endogenous to both the admission and the scheduling decisions. We propose a near-optimal learning algorithm that carefully balances the classification loss from a selectively sampled dataset, the idiosyncratic loss of non-reviewed posts, and the delay loss of having congestion in the human review system. To the best of our knowledge, this is the first result for online learning in contextual queueing systems and hence our analytical framework may be of independent interest.  ( 3 min )
    Diffusion Tempering Improves Parameter Estimation with Probabilistic Integrators for Ordinary Differential Equations
    arXiv:2402.12231v1 Announce Type: new Abstract: Ordinary differential equations (ODEs) are widely used to describe dynamical systems in science, but identifying parameters that explain experimental measurements is challenging. In particular, although ODEs are differentiable and would allow for gradient-based parameter optimization, the nonlinear dynamics of ODEs often lead to many local minima and extreme sensitivity to initial conditions. We therefore propose diffusion tempering, a novel regularization technique for probabilistic numerical methods which improves convergence of gradient-based parameter optimization in ODEs. By iteratively reducing a noise parameter of the probabilistic integrator, the proposed method converges more reliably to the true parameters. We demonstrate that our method is effective for dynamical systems of different complexity and show that it obtains reliable parameter estimates for a Hodgkin-Huxley model with a practically relevant number of parameters.  ( 2 min )
    Dictionary Learning Improves Patch-Free Circuit Discovery in Mechanistic Interpretability: A Case Study on Othello-GPT
    arXiv:2402.12201v1 Announce Type: new Abstract: Sparse dictionary learning has been a rapidly growing technique in mechanistic interpretability to attack superposition and extract more human-understandable features from model activations. We ask a further question based on the extracted more monosemantic features: How do we recognize circuits connecting the enormous amount of dictionary features? We propose a circuit discovery framework alternative to activation patching. Our framework suffers less from out-of-distribution and proves to be more efficient in terms of asymptotic complexity. The basic unit in our framework is dictionary features decomposed from all modules writing to the residual stream, including embedding, attention output and MLP output. Starting from any logit, dictionary feature or attention score, we manage to trace down to lower-level dictionary features of all tokens and compute their contribution to these more interpretable and local model behaviors. We dig in a small transformer trained on a synthetic task named Othello and find a number of human-understandable fine-grained circuits inside of it.  ( 2 min )
    Revisiting Data Augmentation in Deep Reinforcement Learning
    arXiv:2402.12181v1 Announce Type: new Abstract: Various data augmentation techniques have been recently proposed in image-based deep reinforcement learning (DRL). Although they empirically demonstrate the effectiveness of data augmentation for improving sample efficiency or generalization, which technique should be preferred is not always clear. To tackle this question, we analyze existing methods to better understand them and to uncover how they are connected. Notably, by expressing the variance of the Q-targets and that of the empirical actor/critic losses of these methods, we can analyze the effects of their different components and compare them. We furthermore formulate an explanation about how these methods may be affected by choosing different data augmentation transformations in calculating the target Q-values. This analysis suggests recommendations on how to exploit data augmentation in a more principled way. In addition, we include a regularization term called tangent prop, previously proposed in computer vision, but whose adaptation to DRL is novel to the best of our knowledge. We evaluate our proposition and validate our analysis in several domains. Compared to different relevant baselines, we demonstrate that it achieves state-of-the-art performance in most environments and shows higher sample efficiency and better generalization ability in some complex environments.  ( 2 min )
    Mafin: Enhancing Black-Box Embeddings with Model Augmented Fine-tuning
    arXiv:2402.12177v1 Announce Type: new Abstract: Retrieval Augmented Generation (RAG) has emerged as an effective solution for mitigating hallucinations in Large Language Models (LLMs). The retrieval stage in RAG typically involves a pre-trained embedding model, which converts queries and passages into vectors to capture their semantics. However, a standard pre-trained embedding model may exhibit sub-optimal performance when applied to specific domain knowledge, necessitating fine-tuning. This paper addresses scenarios where the embeddings are only available from a black-box model. We introduce Model augmented fine-tuning (Mafin) -- a novel approach for fine-tuning a black-box embedding model by augmenting it with a trainable embedding model. Our results demonstrate that Mafin significantly enhances the performance of the black-box embeddings by only requiring the training of a small augmented model. We validate the effectiveness of our method on both labeled and unlabeled datasets, illustrating its broad applicability and efficiency.  ( 2 min )
    Endowing Pre-trained Graph Models with Provable Fairness
    arXiv:2402.12161v1 Announce Type: new Abstract: Pre-trained graph models (PGMs) aim to capture transferable inherent structural properties and apply them to different downstream tasks. Similar to pre-trained language models, PGMs also inherit biases from human society, resulting in discriminatory behavior in downstream applications. The debiasing process of existing fair methods is generally coupled with parameter optimization of GNNs. However, different downstream tasks may be associated with different sensitive attributes in reality, directly employing existing methods to improve the fairness of PGMs is inflexible and inefficient. Moreover, most of them lack a theoretical guarantee, i.e., provable lower bounds on the fairness of model predictions, which directly provides assurance in a practical scenario. To overcome these limitations, we propose a novel adapter-tuning framework that endows pre-trained \textbf{Graph} models with \textbf{P}rovable f\textbf{A}i\textbf{R}ness (called GraphPAR). GraphPAR freezes the parameters of PGMs and trains a parameter-efficient adapter to flexibly improve the fairness of PGMs in downstream tasks. Specifically, we design a sensitive semantic augmenter on node representations, to extend the node representations with different sensitive attribute semantics for each node. The extended representations will be used to further train an adapter, to prevent the propagation of sensitive attribute semantics from PGMs to task predictions. Furthermore, with GraphPAR, we quantify whether the fairness of each node is provable, i.e., predictions are always fair within a certain range of sensitive attribute semantics. Experimental evaluations on real-world datasets demonstrate that GraphPAR achieves state-of-the-art prediction performance and fairness on node classification task. Furthermore, based on our GraphPAR, around 90\% nodes have provable fairness.  ( 3 min )
    Learning Discretized Bayesian Networks with GOMEA
    arXiv:2402.12175v1 Announce Type: new Abstract: Bayesian networks model relationships between random variables under uncertainty and can be used to predict the likelihood of events and outcomes while incorporating observed evidence. From an eXplainable AI (XAI) perspective, such models are interesting as they tend to be compact. Moreover, captured relations can be directly inspected by domain experts. In practice, data is often real-valued. Unless assumptions of normality can be made, discretization is often required. The optimal discretization, however, depends on the relations modelled between the variables. This complicates learning Bayesian networks from data. For this reason, most literature focuses on learning conditional dependencies between sets of variables, called structure learning. In this work, we extend an existing state-of-the-art structure learning approach based on the Gene-pool Optimal Mixing Evolutionary Algorithm (GOMEA) to jointly learn variable discretizations. The proposed Discretized Bayesian Network GOMEA (DBN-GOMEA) obtains similar or better results than the current state-of-the-art when tasked to retrieve randomly generated ground-truth networks. Moreover, leveraging a key strength of evolutionary algorithms, we can straightforwardly perform DBN learning multi-objectively. We show how this enables incorporating expert knowledge in a uniquely insightful fashion, finding multiple DBNs that trade-off complexity, accuracy, and the difference with a pre-determined expert network.  ( 2 min )
    MLFEF: Machine Learning Fusion Model with Empirical Formula to Explore the Momentum in Competitive Sports
    arXiv:2402.12149v1 Announce Type: new Abstract: Tennis is so popular that coaches and players are curious about factors other than skill, such as momentum. This article will try to define and quantify momentum, providing a basis for real-time analysis of tennis matches. Based on the tennis Grand Slam men's singles match data in recent years, we built two models, one is to build a model based on data-driven, and the other is to build a model based on empirical formulas. For the data-driven model, we first found a large amount of public data including public data on tennis matches in the past five years and personal information data of players. Then the data is preprocessed, and feature engineered, and a fusion model of SVM, Random Forrest algorithm and XGBoost was established. For the mechanism analysis model, important features were selected based on the suggestions of many tennis players and enthusiasts, the sliding window algorithm was used to calculate the weight, and different methods were used to visualize the momentum. For further analysis of the momentum fluctuation, it is based on the popular CUMSUM algorithm in the industry as well as the RUN Test, and the result shows the momentum is not random and the trend might be random. At last, the robustness of the fusion model is analyzed by Monte Carlo simulation.  ( 2 min )
    Federated Bayesian Network Ensembles
    arXiv:2402.12142v1 Announce Type: new Abstract: Federated learning allows us to run machine learning algorithms on decentralized data when data sharing is not permitted due to privacy concerns. Ensemble-based learning works by training multiple (weak) classifiers whose output is aggregated. Federated ensembles are ensembles applied to a federated setting, where each classifier in the ensemble is trained on one data location. In this article, we explore the use of federated ensembles of Bayesian networks (FBNE) in a range of experiments and compare their performance with locally trained models and models trained with VertiBayes, a federated learning algorithm to train Bayesian networks from decentralized data. Our results show that FBNE outperforms local models and provides a significant increase in training speed compared with VertiBayes while maintaining a similar performance in most settings, among other advantages. We show that FBNE is a potentially useful tool within the federated learning toolbox, especially when local populations are heavily biased, or there is a strong imbalance in population size across parties. We discuss the advantages and disadvantages of this approach in terms of time complexity, model accuracy, privacy protection, and model interpretability.  ( 2 min )
    Interpretable Brain-Inspired Representations Improve RL Performance on Visual Navigation Tasks
    arXiv:2402.12067v1 Announce Type: new Abstract: Visual navigation requires a whole range of capabilities. A crucial one of these is the ability of an agent to determine its own location and heading in an environment. Prior works commonly assume this information as given, or use methods which lack a suitable inductive bias and accumulate error over time. In this work, we show how the method of slow feature analysis (SFA), inspired by neuroscience research, overcomes both limitations by generating interpretable representations of visual data that encode location and heading of an agent. We employ SFA in a modern reinforcement learning context, analyse and compare representations and illustrate where hierarchical SFA can outperform other feature extractors on navigation tasks.  ( 2 min )
    DualView: Data Attribution from the Dual Perspective
    arXiv:2402.12118v1 Announce Type: new Abstract: Local data attribution (or influence estimation) techniques aim at estimating the impact that individual data points seen during training have on particular predictions of an already trained Machine Learning model during test time. Previous methods either do not perform well consistently across different evaluation criteria from literature, are characterized by a high computational demand, or suffer from both. In this work we present DualView, a novel method for post-hoc data attribution based on surrogate modelling, demonstrating both high computational efficiency, as well as good evaluation results. With a focus on neural networks, we evaluate our proposed technique using suitable quantitative evaluation strategies from the literature against related principal local data attribution methods. We find that DualView requires considerably lower computational resources than other methods, while demonstrating comparable performance to competing approaches across evaluation metrics. Futhermore, our proposed method produces sparse explanations, where sparseness can be tuned via a hyperparameter. Finally, we showcase that with DualView, we can now render explanations from local data attributions compatible with established local feature attribution methods: For each prediction on (test) data points explained in terms of impactful samples from the training set, we are able to compute and visualize how the prediction on (test) sample relates to each influential training sample in terms of features recognized and by the model. We provide an Open Source implementation of DualView online, together with implementations for all other local data attribution methods we compare against, as well as the metrics reported here, for full reproducibility.  ( 3 min )
    All Language Models Large and Small
    arXiv:2402.12061v1 Announce Type: new Abstract: Many leading language models (LMs) use high-intensity computational resources both during training and execution. This poses the challenge of lowering resource costs for deployment and faster execution of decision-making tasks among others. We introduce a novel plug-and-play LM framework named Language Optimising Network Distribution (LONDI) framework. LONDI learns to selectively employ large LMs only where complex decision-making and reasoning are required while using low-resource LMs everywhere else. LONDI consists of a system of two (off-)policy networks, an LM, a large LM (LLM), and a reinforcement learning module that uses switching controls to quickly learn which system states to call the LLM. We then introduce a variant of LONDI that maintains budget constraints on LLM calls and hence its resource usage. Theoretically, we prove LONDI learns the subset of system states to activate the LLM required to solve the task. We then prove that LONDI converges to optimal solutions while also preserving budgetary constraints on LLM calls almost surely enabling it to solve various tasks while significantly lowering computational costs. We test LONDI's performance in a range of tasks in ScienceWorld and BabyAI-Text and demonstrate that LONDI can solve tasks only solvable by resource-intensive LLMs while reducing GPU usage by up to 30%.  ( 2 min )
    WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More
    arXiv:2402.12065v1 Announce Type: new Abstract: Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process. This paper addresses these challenges by focusing on the quantization of LLMs, a technique that reduces memory consumption by converting model parameters and activations into low-bit integers. We critically analyze the existing quantization approaches, identifying their limitations in balancing the accuracy and efficiency of the quantized LLMs. To advance beyond these limitations, we propose WKVQuant, a PTQ framework especially designed for quantizing weights and the key/value (KV) cache of LLMs. Specifically, we incorporates past-only quantization to improve the computation of attention. Additionally, we introduce two-dimensional quantization strategy to handle the distribution of KV cache, along with a cross-block reconstruction regularization for parameter optimization. Experiments show that WKVQuant achieves almost comparable memory savings to weight-activation quantization, while also approaching the performance of weight-only quantization.  ( 2 min )
    Linear bandits with polylogarithmic minimax regret
    arXiv:2402.12042v1 Announce Type: new Abstract: We study a noise model for linear stochastic bandits for which the subgaussian noise parameter vanishes linearly as we select actions on the unit sphere closer and closer to the unknown vector. We introduce an algorithm for this problem that exhibits a minimax regret scaling as $\log^3(T)$ in the time horizon $T$, in stark contrast the square root scaling of this regret for typical bandit algorithms. Our strategy, based on weighted least-squares estimation, achieves the eigenvalue relation $\lambda_{\min} ( V_t ) = \Omega (\sqrt{\lambda_{\max}(V_t ) })$ for the design matrix $V_t$ at each time step $t$ through geometrical arguments that are independent of the noise model and might be of independent interest. This allows us to tightly control the expected regret in each time step to be of the order $O(\frac1{t})$, leading to the logarithmic scaling of the cumulative regret.  ( 2 min )
    Self-AMPLIFY: Improving Small Language Models with Self Post Hoc Explanations
    arXiv:2402.12038v1 Announce Type: new Abstract: Incorporating natural language rationales in the prompt and In-Context Learning (ICL) has led to a significant improvement of Large Language Models (LLMs) performance. However, rationales currently require human-annotation or the use of auxiliary proxy models to target promising samples or generate high-quality rationales. In this work, we propose Self-AMPLIFY to generate automatically rationales from post hoc explanation methods applied to Small Language Models (SLMs) to improve their own performance. Self-AMPLIFY is a 3-step method that targets samples, generates rationales and builds a final prompt to leverage ICL. Self-AMPLIFY performance is evaluated on two SLMs and two datasets requiring reasoning abilities: these experiments show that Self-AMPLIFY achieves good results against competitors. Self-AMPLIFY is the first method to apply post hoc explanation methods to SLM to generate rationales to improve their own performance in a fully automated manner.  ( 2 min )
    Training Green AI Models Using Elite Samples
    arXiv:2402.12010v1 Announce Type: new Abstract: The substantial increase in AI model training has considerable environmental implications, mandating more energy-efficient and sustainable AI practices. On the one hand, data-centric approaches show great potential towards training energy-efficient AI models. On the other hand, instance selection methods demonstrate the capability of training AI models with minimised training sets and negligible performance degradation. Despite the growing interest in both topics, the impact of data-centric training set selection on energy efficiency remains to date unexplored. This paper presents an evolutionary-based sampling framework aimed at (i) identifying elite training samples tailored for datasets and model pairs, (ii) comparing model performance and energy efficiency gains against typical model training practice, and (iii) investigating the feasibility of this framework for fostering sustainable model training practices. To evaluate the proposed framework, we conducted an empirical experiment including 8 commonly used AI classification models and 25 publicly available datasets. The results showcase that by considering 10% elite training samples, the models' performance can show a 50% improvement and remarkable energy savings of 98% compared to the common training practice.  ( 2 min )
    Class-incremental Learning for Time Series: Benchmark and Evaluation
    arXiv:2402.12035v1 Announce Type: new Abstract: Real-world environments are inherently non-stationary, frequently introducing new classes over time. This is especially common in time series classification, such as the emergence of new disease classification in healthcare or the addition of new activities in human activity recognition. In such cases, a learning system is required to assimilate novel classes effectively while avoiding catastrophic forgetting of the old ones, which gives rise to the Class-incremental Learning (CIL) problem. However, despite the encouraging progress in the image and language domains, CIL for time series data remains relatively understudied. Existing studies suffer from inconsistent experimental designs, necessitating a comprehensive evaluation and benchmarking of methods across a wide range of datasets. To this end, we first present an overview of the Time Series Class-incremental Learning (TSCIL) problem, highlight its unique challenges, and cover the advanced methodologies. Further, based on standardized settings, we develop a unified experimental framework that supports the rapid development of new algorithms, easy integration of new datasets, and standardization of the evaluation process. Using this framework, we conduct a comprehensive evaluation of various generic and time-series-specific CIL methods in both standard and privacy-sensitive scenarios. Our extensive experiments not only provide a standard baseline to support future research but also shed light on the impact of various design factors such as normalization layers or memory budget thresholds. Codes are available at https://github.com/zqiao11/TSCIL.  ( 3 min )
    Cluster Metric Sensitivity to Irrelevant Features
    arXiv:2402.12008v1 Announce Type: new Abstract: Clustering algorithms are used extensively in data analysis for data exploration and discovery. Technological advancements lead to continually growth of data in terms of volume, dimensionality and complexity. This provides great opportunities in data analytics as the data can be interrogated for many different purposes. This however leads challenges, such as identification of relevant features for a given task. In supervised tasks, one can utilise a number of methods to optimise the input features for the task objective (e.g. classification accuracy). In unsupervised problems, such tools are not readily available, in part due to an inability to quantify feature relevance in unlabeled tasks. In this paper, we investigate the sensitivity of clustering performance noisy uncorrelated variables iteratively added to baseline datasets with well defined clusters. We show how different types of irrelevant variables can impact the outcome of a clustering result from $k$-means in different ways. We observe a resilience to very high proportions of irrelevant features for adjusted rand index (ARI) and normalised mutual information (NMI) when the irrelevant features are Gaussian distributed. For Uniformly distributed irrelevant features, we notice the resilience of ARI and NMI is dependent on the dimensionality of the data and exhibits tipping points between high scores and near zero. Our results show that the Silhouette Coefficient and the Davies-Bouldin score are the most sensitive to irrelevant added features exhibiting large changes in score for comparably low proportions of irrelevant features regardless of underlying distribution or data scaling. As such the Silhouette Coefficient and the Davies-Bouldin score are good candidates for optimising feature selection in unsupervised clustering tasks.  ( 3 min )
    Network Inversion of Binarised Neural Nets
    arXiv:2402.11995v1 Announce Type: new Abstract: While the deployment of neural networks, yielding impressive results, becomes more prevalent in various applications, their interpretability and understanding remain a critical challenge. Network inversion, a technique that aims to reconstruct the input space from the model's learned internal representations, plays a pivotal role in unraveling the black-box nature of input to output mappings in neural networks. In safety-critical scenarios, where model outputs may influence pivotal decisions, the integrity of the corresponding input space is paramount, necessitating the elimination of any extraneous "garbage" to ensure the trustworthiness of the network. Binarised Neural Networks (BNNs), characterized by binary weights and activations, offer computational efficiency and reduced memory requirements, making them suitable for resource-constrained environments. This paper introduces a novel approach to invert a trained BNN by encoding it into a CNF formula that captures the network's structure, allowing for both inference and inversion.  ( 2 min )
    Privacy-Preserving Low-Rank Adaptation for Latent Diffusion Models
    arXiv:2402.11989v1 Announce Type: new Abstract: Low-rank adaptation (LoRA) is an efficient strategy for adapting latent diffusion models (LDMs) on a training dataset to generate specific objects by minimizing the adaptation loss. However, adapted LDMs via LoRA are vulnerable to membership inference (MI) attacks that can judge whether a particular data point belongs to private training datasets, thus facing severe risks of privacy leakage. To defend against MI attacks, we make the first effort to propose a straightforward solution: privacy-preserving LoRA (PrivateLoRA). PrivateLoRA is formulated as a min-max optimization problem where a proxy attack model is trained by maximizing its MI gain while the LDM is adapted by minimizing the sum of the adaptation loss and the proxy attack model's MI gain. However, we empirically disclose that PrivateLoRA has the issue of unstable optimization due to the large fluctuation of the gradient scale which impedes adaptation. To mitigate this issue, we propose Stable PrivateLoRA that adapts the LDM by minimizing the ratio of the adaptation loss to the MI gain, which implicitly rescales the gradient and thus stabilizes the optimization. Our comprehensive empirical results corroborate that adapted LDMs via Stable PrivateLoRA can effectively defend against MI attacks while generating high-quality images. Our code is available at https://github.com/WilliamLUO0/StablePrivateLoRA.  ( 2 min )
    Bayesian Active Learning for Censored Regression
    arXiv:2402.11973v1 Announce Type: new Abstract: Bayesian active learning is based on information theoretical approaches that focus on maximising the information that new observations provide to the model parameters. This is commonly done by maximising the Bayesian Active Learning by Disagreement (BALD) acquisitions function. However, we highlight that it is challenging to estimate BALD when the new data points are subject to censorship, where only clipped values of the targets are observed. To address this, we derive the entropy and the mutual information for censored distributions and derive the BALD objective for active learning in censored regression ($\mathcal{C}$-BALD). We propose a novel modelling approach to estimate the $\mathcal{C}$-BALD objective and use it for active learning in the censored setting. Across a wide range of datasets and models, we demonstrate that $\mathcal{C}$-BALD outperforms other Bayesian active learning methods in censored regression.  ( 2 min )
    DB-LLM: Accurate Dual-Binarization for Efficient LLMs
    arXiv:2402.11960v1 Announce Type: new Abstract: Large language models (LLMs) have significantly advanced the field of natural language processing, while the expensive memory and computation consumption impede their practical deployment. Quantization emerges as one of the most effective methods for improving the computational efficiency of LLMs. However, existing ultra-low-bit quantization always causes severe accuracy drops. In this paper, we empirically relieve the micro and macro characteristics of ultra-low bit quantization and present a novel Dual-Binarization method for LLMs, namely DB-LLM. For the micro-level, we take both the accuracy advantage of 2-bit-width and the efficiency advantage of binarization into account, introducing Flexible Dual Binarization (FDB). By splitting 2-bit quantized weights into two independent sets of binaries, FDB ensures the accuracy of representations and introduces flexibility, utilizing the efficient bitwise operations of binarization while retaining the inherent high sparsity of ultra-low bit quantization. For the macro-level, we find the distortion that exists in the prediction of LLM after quantization, which is specified as the deviations related to the ambiguity of samples. We propose the Deviation-Aware Distillation (DAD) method, enabling the model to focus differently on various samples. Comprehensive experiments show that our DB-LLM not only significantly surpasses the current State-of-The-Art (SoTA) in ultra-low bit quantization (eg, perplexity decreased from 9.64 to 7.23), but also achieves an additional 20\% reduction in computational consumption compared to the SOTA method under the same bit-width. Our code will be released soon.  ( 3 min )
    Imbalance in Regression Datasets
    arXiv:2402.11963v1 Announce Type: new Abstract: For classification, the problem of class imbalance is well known and has been extensively studied. In this paper, we argue that imbalance in regression is an equally important problem which has so far been overlooked: Due to under- and over-representations in a data set's target distribution, regressors are prone to degenerate to naive models, systematically neglecting uncommon training data and over-representing targets seen often during training. We analyse this problem theoretically and use resulting insights to develop a first definition of imbalance in regression, which we show to be a generalisation of the commonly employed imbalance measure in classification. With this, we hope to turn the spotlight on the overlooked problem of imbalance in regression and to provide common ground for future research.  ( 2 min )
    The effect of Leaky ReLUs on the training and generalization of overparameterized networks
    arXiv:2402.11942v1 Announce Type: new Abstract: We investigate the training and generalization errors of overparameterized neural networks (NNs) with a wide class of leaky rectified linear unit (ReLU) functions. More specifically, we carefully upper bound both the convergence rate of the training error and the generalization error of such NNs and investigate the dependence of these bounds on the Leaky ReLU parameter, $\alpha$. We show that $\alpha =-1$, which corresponds to the absolute value activation function, is optimal for the training error bound. Furthermore, in special settings, it is also optimal for the generalization error bound. Numerical experiments empirically support the practical choices guided by the theory.  ( 2 min )
    Mini-Hes: A Parallelizable Second-order Latent Factor Analysis Model
    arXiv:2402.11948v1 Announce Type: new Abstract: Interactions among large number of entities is naturally high-dimensional and incomplete (HDI) in many big data related tasks. Behavioral characteristics of users are hidden in these interactions, hence, effective representation of the HDI data is a fundamental task for understanding user behaviors. Latent factor analysis (LFA) model has proven to be effective in representing HDI data. The performance of an LFA model relies heavily on its training process, which is a non-convex optimization. It has been proven that incorporating local curvature and preprocessing gradients during its training process can lead to superior performance compared to LFA models built with first-order family methods. However, with the escalation of data volume, the feasibility of second-order algorithms encounters challenges. To address this pivotal issue, this paper proposes a mini-block diagonal hessian-free (Mini-Hes) optimization for building an LFA model. It leverages the dominant diagonal blocks in the generalized Gauss-Newton matrix based on the analysis of the Hessian matrix of LFA model and serves as an intermediary strategy bridging the gap between first-order and second-order optimization methods. Experiment results indicate that, with Mini-Hes, the LFA model outperforms several state-of-the-art models in addressing missing data estimation task on multiple real HDI datasets from recommender system. (The source code of Mini-Hes is available at https://github.com/Goallow/Mini-Hes)  ( 2 min )
    SLADE: Detecting Dynamic Anomalies in Edge Streams without Labels via Self-Supervised Learning
    arXiv:2402.11933v1 Announce Type: new Abstract: To detect anomalies in real-world graphs, such as social, email, and financial networks, various approaches have been developed. While they typically assume static input graphs, most real-world graphs grow over time, naturally represented as edge streams. In this context, we aim to achieve three goals: (a) instantly detecting anomalies as they occur, (b) adapting to dynamically changing states, and (c) handling the scarcity of dynamic anomaly labels. In this paper, we propose SLADE (Self-supervised Learning for Anomaly Detection in Edge Streams) for rapid detection of dynamic anomalies in edge streams, without relying on labels. SLADE detects the shifts of nodes into abnormal states by observing deviations in their interaction patterns over time. To this end, it trains a deep neural network to perform two self-supervised tasks: (a) minimizing drift in node representations and (b) generating long-term interaction patterns from short-term ones. Failure in these tasks for a node signals its deviation from the norm. Notably, the neural network and tasks are carefully designed so that all required operations can be performed in constant time (w.r.t. the graph size) in response to each new edge in the input stream. In dynamic anomaly detection across four real-world datasets, SLADE outperforms nine competing methods, even those leveraging label supervision.  ( 2 min )
    Energy-Efficient Edge Learning via Joint Data Deepening-and-Prefetching
    arXiv:2402.11925v1 Announce Type: new Abstract: The vision of pervasive artificial intelligence (AI) services can be realized by training an AI model on time using real-time data collected by internet of things (IoT) devices. To this end, IoT devices require offloading their data to an edge server in proximity. However, transmitting high-dimensional and voluminous data from energy-constrained IoT devices poses a significant challenge. To address this limitation, we propose a novel offloading architecture, called joint data deepening-and-prefetching (JD2P), which is feature-by-feature offloading comprising two key techniques. The first one is data deepening, where each data sample's features are sequentially offloaded in the order of importance determined by the data embedding technique such as principle component analysis (PCA). Offloading is terminated once the already transmitted features are sufficient for accurate data classification, resulting in a reduction in the amount of transmitted data. The criteria to offload data are derived for binary and multi-class classifiers, which are designed based on support vector machine (SVM) and deep neural network (DNN), respectively. The second one is data prefetching, where some features potentially required in the future are offloaded in advance, thus achieving high efficiency via precise prediction and parameter optimization. We evaluate the effectiveness of JD2P through experiments using the MNIST dataset, and the results demonstrate its significant reduction in expected energy consumption compared to several benchmarks without degrading learning accuracy.  ( 3 min )
    A Generative Pre-Training Framework for Spatio-Temporal Graph Transfer Learning
    arXiv:2402.11922v1 Announce Type: new Abstract: Spatio-temporal graph (STG) learning is foundational for smart city applications, yet it is often hindered by data scarcity in many cities and regions. To bridge this gap, we propose a novel generative pre-training framework, GPDiff, for STG transfer learning. Unlike conventional approaches that heavily rely on common feature extraction or intricate transfer learning designs, our solution takes a novel approach by performing generative pre-training on a collection of model parameters optimized with data from source cities. We recast STG transfer learning as pre-training a generative hypernetwork, which generates tailored model parameters guided by prompts, allowing for adaptability to diverse data distributions and city-specific characteristics. GPDiff employs a diffusion model with a transformer-based denoising network, which is model-agnostic to integrate with powerful STG models. By addressing challenges arising from data gaps and the complexity of generalizing knowledge across cities, our framework consistently outperforms state-of-the-art baselines on multiple real-world datasets for tasks such as traffic speed prediction and crowd flow prediction. The implementation of our approach is available: https://github.com/PLUTO-SCY/GPDiff.  ( 2 min )
    A Mechanistic Analysis of a Transformer Trained on a Symbolic Multi-Step Reasoning Task
    arXiv:2402.11917v1 Announce Type: new Abstract: Transformers demonstrate impressive performance on a range of reasoning benchmarks. To evaluate the degree to which these abilities are a result of actual reasoning, existing work has focused on developing sophisticated benchmarks for behavioral studies. However, these studies do not provide insights into the internal mechanisms driving the observed capabilities. To improve our understanding of the internal mechanisms of transformers, we present a comprehensive mechanistic analysis of a transformer trained on a synthetic reasoning task. We identify a set of interpretable mechanisms the model uses to solve the task, and validate our findings using correlational and causal evidence. Our results suggest that it implements a depth-bounded recurrent mechanisms that operates in parallel and stores intermediate results in selected token positions. We anticipate that the motifs we identified in our synthetic setting can provide valuable insights into the broader operating principles of transformers and thus provide a basis for understanding more complex models.  ( 2 min )
    Finite-Time Error Analysis of Online Model-Based Q-Learning with a Relaxed Sampling Model
    arXiv:2402.11877v1 Announce Type: new Abstract: Reinforcement learning has witnessed significant advancements, particularly with the emergence of model-based approaches. Among these, $Q$-learning has proven to be a powerful algorithm in model-free settings. However, the extension of $Q$-learning to a model-based framework remains relatively unexplored. In this paper, we delve into the sample complexity of $Q$-learning when integrated with a model-based approach. Through theoretical analyses and empirical evaluations, we seek to elucidate the conditions under which model-based $Q$-learning excels in terms of sample efficiency compared to its model-free counterpart.  ( 2 min )
    Generative Semi-supervised Graph Anomaly Detection
    arXiv:2402.11887v1 Announce Type: new Abstract: This work considers a practical semi-supervised graph anomaly detection (GAD) scenario, where part of the nodes in a graph are known to be normal, contrasting to the unsupervised setting in most GAD studies with a fully unlabeled graph. As expected, we find that having access to these normal nodes helps enhance the detection performance of existing unsupervised GAD methods when they are adapted to the semi-supervised setting. However, their utilization of these normal nodes is limited. In this paper, we propose a novel Generative GAD approach (GGAD) for the semi-supervised scenario to better exploit the normal nodes. The key idea is to generate outlier nodes that assimilate anomaly nodes in both local structure and node representations for providing effective negative node samples in training a discriminative one-class classifier. There have been many generative anomaly detection approaches, but they are designed for non-graph data, and as a result, they fail to take account of the graph structure information. Our approach tackles this problem by generating graph structure-aware outlier nodes that have asymmetric affinity separability from normal nodes while being enforced to achieve egocentric closeness to normal nodes in the node representation space. Comprehensive experiments on four real-world datasets are performed to establish a benchmark for semi-supervised GAD and show that GGAD substantially outperforms state-of-the-art unsupervised and semi-supervised GAD methods with varying numbers of training normal nodes. Code will be made available at https://github.com/mala-lab/GGAD.  ( 2 min )
    LoRA Training in the NTK Regime has No Spurious Local Minima
    arXiv:2402.11867v1 Announce Type: new Abstract: Low-rank adaptation (LoRA) has become the standard approach for parameter-efficient fine-tuning of large language models (LLM), but our theoretical understanding of LoRA has been limited. In this work, we theoretically analyze LoRA fine-tuning in the neural tangent kernel (NTK) regime with $N$ data points, showing: (i) full fine-tuning (without LoRA) admits a low-rank solution of rank $r\lesssim \sqrt{N}$; (ii) using LoRA with rank $r\gtrsim \sqrt{N}$ eliminates spurious local minima, allowing gradient descent to find the low-rank solutions; (iii) the low-rank solution found using LoRA generalizes well.  ( 2 min )
    Communication-Efficient Distributed Learning with Local Immediate Error Compensation
    arXiv:2402.11857v1 Announce Type: new Abstract: Gradient compression with error compensation has attracted significant attention with the target of reducing the heavy communication overhead in distributed learning. However, existing compression methods either perform only unidirectional compression in one iteration with higher communication cost, or bidirectional compression with slower convergence rate. In this work, we propose the Local Immediate Error Compensated SGD (LIEC-SGD) optimization algorithm to break the above bottlenecks based on bidirectional compression and carefully designed compensation approaches. Specifically, the bidirectional compression technique is to reduce the communication cost, and the compensation technique compensates the local compression error to the model update immediately while only maintaining the global error variable on the server throughout the iterations to boost its efficacy. Theoretically, we prove that LIEC-SGD is superior to previous works in either the convergence rate or the communication cost, which indicates that LIEC-SGD could inherit the dual advantages from unidirectional compression and bidirectional compression. Finally, experiments of training deep neural networks validate the effectiveness of the proposed LIEC-SGD algorithm.  ( 2 min )
    Self-Guided Robust Graph Structure Refinement
    arXiv:2402.11837v1 Announce Type: new Abstract: Recent studies have revealed that GNNs are vulnerable to adversarial attacks. To defend against such attacks, robust graph structure refinement (GSR) methods aim at minimizing the effect of adversarial edges based on node features, graph structure, or external information. However, we have discovered that existing GSR methods are limited by narrowassumptions, such as assuming clean node features, moderate structural attacks, and the availability of external clean graphs, resulting in the restricted applicability in real-world scenarios. In this paper, we propose a self-guided GSR framework (SG-GSR), which utilizes a clean sub-graph found within the given attacked graph itself. Furthermore, we propose a novel graph augmentation and a group-training strategy to handle the two technical challenges in the clean sub-graph extraction: 1) loss of structural information, and 2) imbalanced node degree distribution. Extensive experiments demonstrate the effectiveness of SG-GSR under various scenarios including non-targeted attacks, targeted attacks, feature attacks, e-commerce fraud, and noisy node labels. Our code is available at https://github.com/yeonjun-in/torch-SG-GSR.  ( 2 min )
    UniST: A Prompt-Empowered Universal Model for Urban Spatio-Temporal Prediction
    arXiv:2402.11838v1 Announce Type: new Abstract: Urban spatio-temporal prediction is crucial for informed decision-making, such as transportation management, resource optimization, and urban planning. Although pretrained foundation models for natural languages have experienced remarkable breakthroughs, wherein one general-purpose model can tackle multiple tasks across various domains, urban spatio-temporal modeling lags behind. Existing approaches for urban prediction are usually tailored for specific spatio-temporal scenarios, requiring task-specific model designs and extensive in-domain training data. In this work, we propose a universal model, UniST, for urban spatio-temporal prediction. Drawing inspiration from large language models, UniST achieves success through: (i) flexibility towards diverse spatio-temporal data characteristics, (ii) effective generative pre-training with elaborated masking strategies to capture complex spatio-temporal relationships, (iii) spatio-temporal knowledge-guided prompts that align and leverage intrinsic and shared knowledge across scenarios. These designs together unlock the potential of a one-for-all model for spatio-temporal prediction with powerful generalization capability. Extensive experiments on 15 cities and 6 domains demonstrate the universality of UniST in advancing state-of-the-art prediction performance, especially in few-shot and zero-shot scenarios.  ( 2 min )
    Microstructures and Accuracy of Graph Recall by Large Language Models
    arXiv:2402.11821v1 Announce Type: new Abstract: Graphs data is crucial for many applications, and much of it exists in the relations described in textual format. As a result, being able to accurately recall and encode a graph described in earlier text is a basic yet pivotal ability that LLMs need to demonstrate if they are to perform reasoning tasks that involve graph-structured information. Human performance at graph recall by has been studied by cognitive scientists for decades, and has been found to often exhibit certain structural patterns of bias that align with human handling of social relationships. To date, however, we know little about how LLMs behave in analogous graph recall tasks: do their recalled graphs also exhibit certain biased patterns, and if so, how do they compare with humans and affect other graph reasoning tasks? In this work, we perform the first systematical study of graph recall by LLMs, investigating the accuracy and biased microstructures (local structural patterns) in their recall. We find that LLMs not only underperform often in graph recall, but also tend to favor more triangles and alternating 2-paths. Moreover, we find that more advanced LLMs have a striking dependence on the domain that a real-world graph comes from -- by yielding the best recall accuracy when the graph is narrated in a language style consistent with its original domain.  ( 3 min )
    Easy as ABCs: Unifying Boltzmann Q-Learning and Counterfactual Regret Minimization
    arXiv:2402.11835v1 Announce Type: new Abstract: We propose ABCs (Adaptive Branching through Child stationarity), a best-of-both-worlds algorithm combining Boltzmann Q-learning (BQL), a classic reinforcement learning algorithm for single-agent domains, and counterfactual regret minimization (CFR), a central algorithm for learning in multi-agent domains. ABCs adaptively chooses what fraction of the environment to explore each iteration by measuring the stationarity of the environment's reward and transition dynamics. In Markov decision processes, ABCs converges to the optimal policy with at most an O(A) factor slowdown compared to BQL, where A is the number of actions in the environment. In two-player zero-sum games, ABCs is guaranteed to converge to a Nash equilibrium (assuming access to a perfect oracle for detecting stationarity), while BQL has no such guarantees. Empirically, ABCs demonstrates strong performance when benchmarked across environments drawn from the OpenSpiel game library and OpenAI Gym and exceeds all prior methods in environments which are neither fully stationary nor fully nonstationary.  ( 2 min )
    Stochastic Approximation with Delayed Updates: Finite-Time Rates under Markovian Sampling
    arXiv:2402.11800v1 Announce Type: new Abstract: Motivated by applications in large-scale and multi-agent reinforcement learning, we study the non-asymptotic performance of stochastic approximation (SA) schemes with delayed updates under Markovian sampling. While the effect of delays has been extensively studied for optimization, the manner in which they interact with the underlying Markov process to shape the finite-time performance of SA remains poorly understood. In this context, our first main contribution is to show that under time-varying bounded delays, the delayed SA update rule guarantees exponentially fast convergence of the \emph{last iterate} to a ball around the SA operator's fixed point. Notably, our bound is \emph{tight} in its dependence on both the maximum delay $\tau_{max}$, and the mixing time $\tau_{mix}$. To achieve this tight bound, we develop a novel inductive proof technique that, unlike various existing delayed-optimization analyses, relies on establishing uniform boundedness of the iterates. As such, our proof may be of independent interest. Next, to mitigate the impact of the maximum delay on the convergence rate, we provide the first finite-time analysis of a delay-adaptive SA scheme under Markovian sampling. In particular, we show that the exponent of convergence of this scheme gets scaled down by $\tau_{avg}$, as opposed to $\tau_{max}$ for the vanilla delayed SA rule; here, $\tau_{avg}$ denotes the average delay across all iterations. Moreover, the adaptive scheme requires no prior knowledge of the delay sequence for step-size tuning. Our theoretical findings shed light on the finite-time effects of delays for a broad class of algorithms, including TD learning, Q-learning, and stochastic gradient descent under Markovian sampling.  ( 3 min )
    Generative Kaleidoscopic Networks
    arXiv:2402.11793v1 Announce Type: new Abstract: We discovered that the Deep ReLU networks (or Multilayer Perceptron architecture) demonstrate an 'over-generalization' phenomenon. That is, the output values for the inputs that were not seen during training are mapped close to the output range that were observed during the learning process. In other words, the MLP learns a many-to-one mapping and this effect is more prominent as we increase the number of layers or depth of the MLP. We utilize this property of Deep ReLU networks to design a dataset kaleidoscope, termed as 'Generative Kaleidoscopic Networks'. Briefly, if we learn a MLP to map from input $x\in\mathbb{R}^D$ to itself $f_\mathcal{N}(x)\rightarrow x$, the 'Kaleidoscopic sampling' procedure starts with a random input noise $z\in\mathbb{R}^D$ and recursively applies $f_\mathcal{N}(\cdots f_\mathcal{N}(z)\cdots )$. After a burn-in period duration, we start observing samples from the input distribution and we found that deeper the MLP, higher is the quality of samples recovered. Scope: We observed this phenomenon to various degrees for the other deep learning architectures like CNNs, Transformers & U-Nets and we are currently investigating them further.  ( 2 min )
    Dynamic Multi-Network Mining of Tensor Time Series
    arXiv:2402.11773v1 Announce Type: new Abstract: Subsequence clustering of time series is an essential task in data mining, and interpreting the resulting clusters is also crucial since we generally do not have prior knowledge of the data. Thus, given a large collection of tensor time series consisting of multiple modes, including timestamps, how can we achieve subsequence clustering for tensor time series and provide interpretable insights? In this paper, we propose a new method, Dynamic Multi-network Mining (DMM), that converts a tensor time series into a set of segment groups of various lengths (i.e., clusters) characterized by a dependency network constrained with l1-norm. Our method has the following properties. (a) Interpretable: it characterizes the cluster with multiple networks, each of which is a sparse dependency network of a corresponding non-temporal mode, and thus provides visible and interpretable insights into the key relationships. (b) Accurate: it discovers the clusters with distinct networks from tensor time series according to the minimum description length (MDL). (c) Scalable: it scales linearly in terms of the input data size when solving a non-convex problem to optimize the number of segments and clusters, and thus it is applicable to long-range and high-dimensional tensors. Extensive experiments with synthetic datasets confirm that our method outperforms the state-of-the-art methods in terms of clustering accuracy. We then use real datasets to demonstrate that DMM is useful for providing interpretable insights from tensor time series.  ( 3 min )
    Towards Theoretical Understandings of Self-Consuming Generative Models
    arXiv:2402.11778v1 Announce Type: new Abstract: This paper tackles the emerging challenge of training generative models within a self-consuming loop, wherein successive generations of models are recursively trained on mixtures of real and synthetic data from previous generations. We construct a theoretical framework to rigorously evaluate how this training regimen impacts the data distributions learned by future models. Specifically, we derive bounds on the total variation (TV) distance between the synthetic data distributions produced by future models and the original real data distribution under various mixed training scenarios. Our analysis demonstrates that this distance can be effectively controlled under the condition that mixed training dataset sizes or proportions of real data are large enough. Interestingly, we further unveil a phase transition induced by expanding synthetic data amounts, proving theoretically that while the TV distance exhibits an initial ascent, it declines beyond a threshold point. Finally, we specialize our general results to diffusion models, delivering nuanced insights such as the efficacy of optimal early stopping within the self-consuming loop.  ( 2 min )
    Reinforcement Learning as a Parsimonious Alternative to Prediction Cascades: A Case Study on Image Segmentation
    arXiv:2402.11760v1 Announce Type: new Abstract: Deep learning architectures have achieved state-of-the-art (SOTA) performance on computer vision tasks such as object detection and image segmentation. This may be attributed to the use of over-parameterized, monolithic deep learning architectures executed on large datasets. Although such architectures lead to increased accuracy, this is usually accompanied by a large increase in computation and memory requirements during inference. While this is a non-issue in traditional machine learning pipelines, the recent confluence of machine learning and fields like the Internet of Things has rendered such large architectures infeasible for execution in low-resource settings. In such settings, previous efforts have proposed decision cascades where inputs are passed through models of increasing complexity until desired performance is achieved. However, we argue that cascaded prediction leads to increased computational cost due to wasteful intermediate computations. To address this, we propose PaSeR (Parsimonious Segmentation with Reinforcement Learning) a non-cascading, cost-aware learning pipeline as an alternative to cascaded architectures. Through experimental evaluation on real-world and standard datasets, we demonstrate that PaSeR achieves better accuracy while minimizing computational cost relative to cascaded models. Further, we introduce a new metric IoU/GigaFlop to evaluate the balance between cost and performance. On the real-world task of battery material phase segmentation, PaSeR yields a minimum performance improvement of 174% on the IoU/GigaFlop metric with respect to baselines. We also demonstrate PaSeR's adaptability to complementary models trained on a noisy MNIST dataset, where it achieved a minimum performance improvement on IoU/GigaFlop of 13.4% over SOTA models. Code and data are available at https://github.com/scailab/paser .  ( 3 min )
    Evaluating the Effectiveness of Index-Based Treatment Allocation
    arXiv:2402.11771v1 Announce Type: new Abstract: When resources are scarce, an allocation policy is needed to decide who receives a resource. This problem occurs, for instance, when allocating scarce medical resources and is often solved using modern ML methods. This paper introduces methods to evaluate index-based allocation policies -- that allocate a fixed number of resources to those who need them the most -- by using data from a randomized control trial. Such policies create dependencies between agents, which render the assumptions behind standard statistical tests invalid and limit the effectiveness of estimators. Addressing these challenges, we translate and extend recent ideas from the statistics literature to present an efficient estimator and methods for computing asymptotically correct confidence intervals. This enables us to effectively draw valid statistical conclusions, a critical gap in previous work. Our extensive experiments validate our methodology in practical settings, while also showcasing its statistical power. We conclude by proposing and empirically verifying extensions of our methodology that enable us to reevaluate a past randomized control trial to evaluate different ML allocation policies in the context of a mHealth program, drawing previously invisible conclusions.  ( 2 min )
    SPML: A DSL for Defending Language Models Against Prompt Attacks
    arXiv:2402.11755v1 Announce Type: new Abstract: Large language models (LLMs) have profoundly transformed natural language applications, with a growing reliance on instruction-based definitions for designing chatbots. However, post-deployment the chatbot definitions are fixed and are vulnerable to attacks by malicious users, emphasizing the need to prevent unethical applications and financial losses. Existing studies explore user prompts' impact on LLM-based chatbots, yet practical methods to contain attacks on application-specific chatbots remain unexplored. This paper presents System Prompt Meta Language (SPML), a domain-specific language for refining prompts and monitoring the inputs to the LLM-based chatbots. SPML actively checks attack prompts, ensuring user inputs align with chatbot definitions to prevent malicious execution on the LLM backbone, optimizing costs. It also streamlines chatbot definition crafting with programming language capabilities, overcoming natural language design challenges. Additionally, we introduce a groundbreaking benchmark with 1.8k system prompts and 20k user inputs, offering the inaugural language and benchmark for chatbot definition evaluation. Experiments across datasets demonstrate SPML's proficiency in understanding attacker prompts, surpassing models like GPT-4, GPT-3.5, and LLAMA. Our data and codes are publicly available at: https://prompt-compiler.github.io/SPML/.  ( 2 min )
    Diagonalisation SGD: Fast & Convergent SGD for Non-Differentiable Models via Reparameterisation and Smoothing
    arXiv:2402.11752v1 Announce Type: new Abstract: It is well-known that the reparameterisation gradient estimator, which exhibits low variance in practice, is biased for non-differentiable models. This may compromise correctness of gradient-based optimisation methods such as stochastic gradient descent (SGD). We introduce a simple syntactic framework to define non-differentiable functions piecewisely and present a systematic approach to obtain smoothings for which the reparameterisation gradient estimator is unbiased. Our main contribution is a novel variant of SGD, Diagonalisation Stochastic Gradient Descent, which progressively enhances the accuracy of the smoothed approximation during optimisation, and we prove convergence to stationary points of the unsmoothed (original) objective. Our empirical evaluation reveals benefits over the state of the art: our approach is simple, fast, stable and attains orders of magnitude reduction in work-normalised variance.  ( 2 min )
    Extraction of nonlinearity in neural networks and model compression with Koopman operator
    arXiv:2402.11740v1 Announce Type: new Abstract: Nonlinearity plays a crucial role in deep neural networks. In this paper, we first investigate the degree to which the nonlinearity of the neural network is essential. For this purpose, we employ the Koopman operator, extended dynamic mode decomposition, and the tensor-train format. The results imply that restricted nonlinearity is enough for the classification of handwritten numbers. Then, we propose a model compression method for deep neural networks, which could be beneficial to handling large networks in resource-constrained environments. Leveraging the Koopman operator, the proposed method enables us to use linear algebra in the internal processing of neural networks. We numerically show that the proposed method performs comparably or better than conventional methods in highly compressed model settings for the handwritten number recognition task.  ( 2 min )
    Balanced Data, Imbalanced Spectra: Unveiling Class Disparities with Spectral Imbalance
    arXiv:2402.11742v1 Announce Type: new Abstract: Classification models are expected to perform equally well for different classes, yet in practice, there are often large gaps in their performance. This issue of class bias is widely studied in cases of datasets with sample imbalance, but is relatively overlooked in balanced datasets. In this work, we introduce the concept of spectral imbalance in features as a potential source for class disparities and study the connections between spectral imbalance and class bias in both theory and practice. To build the connection between spectral imbalance and class gap, we develop a theoretical framework for studying class disparities and derive exact expressions for the per-class error in a high-dimensional mixture model setting. We then study this phenomenon in 11 different state-of-the-art pretrained encoders and show how our proposed framework can be used to compare the quality of encoders, as well as evaluate and combine data augmentation strategies to mitigate the issue. Our work sheds light on the class-dependent effects of learning, and provides new insights into how state-of-the-art pretrained features may have unknown biases that can be diagnosed through their spectra.  ( 2 min )
    Monte Carlo with kernel-based Gibbs measures: Guarantees for probabilistic herding
    arXiv:2402.11736v1 Announce Type: new Abstract: Kernel herding belongs to a family of deterministic quadratures that seek to minimize the worst-case integration error over a reproducing kernel Hilbert space (RKHS). In spite of strong experimental support, it has revealed difficult to prove that this worst-case error decreases at a faster rate than the standard square root of the number of quadrature nodes, at least in the usual case where the RKHS is infinite-dimensional. In this theoretical paper, we study a joint probability distribution over quadrature nodes, whose support tends to minimize the same worst-case error as kernel herding. We prove that it does outperform i.i.d. Monte Carlo, in the sense of coming with a tighter concentration inequality on the worst-case integration error. While not improving the rate yet, this demonstrates that the mathematical tools of the study of Gibbs measures can help understand to what extent kernel herding and its variants improve on computationally cheaper methods. Moreover, we provide early experimental evidence that a faster rate of convergence, though not worst-case, is likely.  ( 2 min )
    Compression Repair for Feedforward Neural Networks Based on Model Equivalence Evaluation
    arXiv:2402.11737v1 Announce Type: new Abstract: In this paper, we propose a method of repairing compressed Feedforward Neural Networks (FNNs) based on equivalence evaluation of two neural networks. In the repairing framework, a novel neural network equivalence evaluation method is developed to compute the output discrepancy between two neural networks. The output discrepancy can quantitatively characterize the output difference produced by compression procedures. Based on the computed output discrepancy, the repairing method first initializes a new training set for the compressed networks to narrow down the discrepancy between the two neural networks and improve the performance of the compressed network. Then, we repair the compressed FNN by re-training based on the training set. We apply our developed method to the MNIST dataset to demonstrate the effectiveness and advantages of our proposed repair method.  ( 2 min )
    The Effectiveness of Random Forgetting for Robust Generalization
    arXiv:2402.11733v1 Announce Type: new Abstract: Deep neural networks are susceptible to adversarial attacks, which can compromise their performance and accuracy. Adversarial Training (AT) has emerged as a popular approach for protecting neural networks against such attacks. However, a key challenge of AT is robust overfitting, where the network's robust performance on test data deteriorates with further training, thus hindering generalization. Motivated by the concept of active forgetting in the brain, we introduce a novel learning paradigm called "Forget to Mitigate Overfitting (FOMO)". FOMO alternates between the forgetting phase, which randomly forgets a subset of weights and regulates the model's information through weight reinitialization, and the relearning phase, which emphasizes learning generalizable features. Our experiments on benchmark datasets and adversarial attacks show that FOMO alleviates robust overfitting by significantly reducing the gap between the best and last robust test accuracy while improving the state-of-the-art robustness. Furthermore, FOMO provides a better trade-off between standard and robust accuracy, outperforming baseline adversarial methods. Finally, our framework is robust to AutoAttacks and increases generalization in many real-world scenarios.  ( 2 min )
    Prospector Heads: Generalized Feature Attribution for Large Models & Data
    arXiv:2402.11729v1 Announce Type: new Abstract: Feature attribution, the ability to localize regions of the input data that are relevant for classification, is an important capability for machine learning models in scientific and biomedical domains. Current methods for feature attribution, which rely on "explaining" the predictions of end-to-end classifiers, suffer from imprecise feature localization and are inadequate for use with small sample sizes and high-dimensional datasets due to computational challenges. We introduce prospector heads, an efficient and interpretable alternative to explanation-based methods for feature attribution that can be applied to any encoder and any data modality. Prospector heads generalize across modalities through experiments on sequences (text), images (pathology), and graphs (protein structures), outperforming baseline attribution methods by up to 49 points in mean localization AUPRC. We also demonstrate how prospector heads enable improved interpretation and discovery of class-specific patterns in the input data. Through their high performance, flexibility, and generalizability, prospectors provide a framework for improving trust and transparency for machine learning models in complex domains.  ( 2 min )
    Learning the Topology and Behavior of Discrete Dynamical Systems
    arXiv:2402.11686v1 Announce Type: new Abstract: Discrete dynamical systems are commonly used to model the spread of contagions on real-world networks. Under the PAC framework, existing research has studied the problem of learning the behavior of a system, assuming that the underlying network is known. In this work, we focus on a more challenging setting: to learn both the behavior and the underlying topology of a black-box system. We show that, in general, this learning problem is computationally intractable. On the positive side, we present efficient learning methods under the PAC model when the underlying graph of the dynamical system belongs to some classes. Further, we examine a relaxed setting where the topology of an unknown system is partially observed. For this case, we develop an efficient PAC learner to infer the system and establish the sample complexity. Lastly, we present a formal analysis of the expressive power of the hypothesis class of dynamical systems where both the topology and behavior are unknown, using the well-known formalism of the Natarajan dimension. Our results provide a theoretical foundation for learning both the behavior and topology of discrete dynamical systems.  ( 2 min )
    Invertible Fourier Neural Operators for Tackling Both Forward and Inverse Problems
    arXiv:2402.11722v1 Announce Type: new Abstract: Fourier Neural Operator (FNO) is a popular operator learning method, which has demonstrated state-of-the-art performance across many tasks. However, FNO is mainly used in forward prediction, yet a large family of applications rely on solving inverse problems. In this paper, we propose an invertible Fourier Neural Operator (iFNO) that tackles both the forward and inverse problems. We designed a series of invertible Fourier blocks in the latent channel space to share the model parameters, efficiently exchange the information, and mutually regularize the learning for the bi-directional tasks. We integrated a variational auto-encoder to capture the intrinsic structures within the input space and to enable posterior inference so as to overcome challenges of illposedness, data shortage, noises, etc. We developed a three-step process for pre-training and fine tuning for efficient training. The evaluations on five benchmark problems have demonstrated the effectiveness of our approach.  ( 2 min )
    Interpretable Short-Term Load Forecasting via Multi-Scale Temporal Decomposition
    arXiv:2402.11664v1 Announce Type: new Abstract: Rapid progress in machine learning and deep learning has enabled a wide range of applications in the electricity load forecasting of power systems, for instance, univariate and multivariate short-term load forecasting. Though the strong capabilities of learning the non-linearity of the load patterns and the high prediction accuracy have been achieved, the interpretability of typical deep learning models for electricity load forecasting is less studied. This paper proposes an interpretable deep learning method, which learns a linear combination of neural networks that each attends to an input time feature. We also proposed a multi-scale time series decomposition method to deal with the complex time patterns. Case studies have been carried out on the Belgium central grid load dataset and the proposed model demonstrated better accuracy compared to the frequently applied baseline model. Specifically, the proposed multi-scale temporal decomposition achieves the best MSE, MAE and RMSE of 0.52, 0.57 and 0.72 respectively. As for interpretability, on one hand, the proposed method displays generalization capability. On the other hand, it can demonstrate not only the feature but also the temporal interpretability compared to other baseline methods. Besides, the global time feature interpretabilities are also obtained. Obtaining global feature interpretabilities allows us to catch the overall patterns, trends, and cyclicality in load data while also revealing the significance of various time-related features in forming the final outputs.  ( 3 min )
    Learning Conditional Invariances through Non-Commutativity
    arXiv:2402.11682v1 Announce Type: new Abstract: Invariance learning algorithms that conditionally filter out domain-specific random variables as distractors, do so based only on the data semantics, and not the target domain under evaluation. We show that a provably optimal and sample-efficient way of learning conditional invariances is by relaxing the invariance criterion to be non-commutatively directed towards the target domain. Under domain asymmetry, i.e., when the target domain contains semantically relevant information absent in the source, the risk of the encoder $\varphi^*$ that is optimal on average across domains is strictly lower-bounded by the risk of the target-specific optimal encoder $\Phi^*_\tau$. We prove that non-commutativity steers the optimization towards $\Phi^*_\tau$ instead of $\varphi^*$, bringing the $\mathcal{H}$-divergence between domains down to zero, leading to a stricter bound on the target risk. Both our theory and experiments demonstrate that non-commutative invariance (NCI) can leverage source domain samples to meet the sample complexity needs of learning $\Phi^*_\tau$, surpassing SOTA invariance learning algorithms for domain adaptation, at times by over $2\%$, approaching the performance of an oracle. Implementation is available at https://github.com/abhrac/nci.  ( 2 min )
    Self-evolving Autoencoder Embedded Q-Network
    arXiv:2402.11604v1 Announce Type: new Abstract: In the realm of sequential decision-making tasks, the exploration capability of a reinforcement learning (RL) agent is paramount for achieving high rewards through interactions with the environment. To enhance this crucial ability, we propose SAQN, a novel approach wherein a self-evolving autoencoder (SA) is embedded with a Q-Network (QN). In SAQN, the self-evolving autoencoder architecture adapts and evolves as the agent explores the environment. This evolution enables the autoencoder to capture a diverse range of raw observations and represent them effectively in its latent space. By leveraging the disentangled states extracted from the encoder generated latent space, the QN is trained to determine optimal actions that improve rewards. During the evolution of the autoencoder architecture, a bias-variance regulatory strategy is employed to elicit the optimal response from the RL agent. This strategy involves two key components: (i) fostering the growth of nodes to retain previously acquired knowledge, ensuring a rich representation of the environment, and (ii) pruning the least contributing nodes to maintain a more manageable and tractable latent space. Extensive experimental evaluations conducted on three distinct benchmark environments and a real-world molecular environment demonstrate that the proposed SAQN significantly outperforms state-of-the-art counterparts. The results highlight the effectiveness of the self-evolving autoencoder and its collaboration with the Q-Network in tackling sequential decision-making tasks.  ( 2 min )
    Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark
    arXiv:2402.11592v1 Announce Type: new Abstract: In the evolving landscape of natural language processing (NLP), fine-tuning pre-trained Large Language Models (LLMs) with first-order (FO) optimizers like SGD and Adam has become standard. Yet, as LLMs grow {in size}, the substantial memory overhead from back-propagation (BP) for FO gradient computation presents a significant challenge. Addressing this issue is crucial, especially for applications like on-device training where memory efficiency is paramount. This paper proposes a shift towards BP-free, zeroth-order (ZO) optimization as a solution for reducing memory costs during LLM fine-tuning, building on the initial concept introduced by MeZO. Unlike traditional ZO-SGD methods, our work expands the exploration to a wider array of ZO optimization techniques, through a comprehensive, first-of-its-kind benchmarking study across five LLM families (Roberta, OPT, LLaMA, Vicuna, Mistral), three task complexities, and five fine-tuning schemes. Our study unveils previously overlooked optimization principles, highlighting the importance of task alignment, the role of the forward gradient method, and the balance between algorithm complexity and fine-tuning performance. We further introduce novel enhancements to ZO optimization, including block-wise descent, hybrid training, and gradient sparsity. Our study offers a promising direction for achieving further memory-efficient LLM fine-tuning. Codes to reproduce all our experiments are at https://github.com/ZO-Bench/ZO-LLM .  ( 2 min )
    Large Language Model-driven Meta-structure Discovery in Heterogeneous Information Network
    arXiv:2402.11518v1 Announce Type: new Abstract: Heterogeneous information networks (HIN) have gained increasing popularity for being able to capture complex relations between nodes of diverse types. Meta-structure was proposed to identify important patterns of relations on HIN, which has been proven effective for extracting rich semantic information and facilitating graph neural networks to learn expressive representations. However, hand-crafted meta-structures pose challenges for scaling up, which draws wide research attention for developing automatic meta-structure search algorithms. Previous efforts concentrate on searching for meta-structures with good empirical prediction performance, overlooking explainability. Thus, they often produce meta-structures prone to overfitting and incomprehensible to humans. To address this, we draw inspiration from the emergent reasoning abilities of large language models (LLMs). We propose a novel REasoning meta-STRUCTure search (ReStruct) framework that integrates LLM reasoning into the evolutionary procedure. ReStruct uses a grammar translator to encode meta-structures into natural language sentences, and leverages the reasoning power of LLMs to evaluate semantically feasible meta-structures. ReStruct also employs performance-oriented evolutionary operations. These two competing forces jointly optimize for semantic explainability and empirical performance of meta-structures. We also design a differential LLM explainer that can produce natural language explanations for the discovered meta-structures, and refine the explanation by reasoning through the search history. Experiments on five datasets demonstrate ReStruct achieve SOTA performance in node classification and link recommendation tasks. Additionally, a survey study involving 73 graduate students shows that the meta-structures and natural language explanations generated by ReStruct are substantially more comprehensible.  ( 3 min )
    Graph Out-of-Distribution Generalization via Causal Intervention
    arXiv:2402.11494v1 Announce Type: new Abstract: Out-of-distribution (OOD) generalization has gained increasing attentions for learning on graphs, as graph neural networks (GNNs) often exhibit performance degradation with distribution shifts. The challenge is that distribution shifts on graphs involve intricate interconnections between nodes, and the environment labels are often absent in data. In this paper, we adopt a bottom-up data-generative perspective and reveal a key observation through causal analysis: the crux of GNNs' failure in OOD generalization lies in the latent confounding bias from the environment. The latter misguides the model to leverage environment-sensitive correlations between ego-graph features and target nodes' labels, resulting in undesirable generalization on new unseen nodes. Built upon this analysis, we introduce a conceptually simple yet principled approach for training robust GNNs under node-level distribution shifts, without prior knowledge of environment labels. Our method resorts to a new learning objective derived from causal inference that coordinates an environment estimator and a mixture-of-expert GNN predictor. The new approach can counteract the confounding bias in training data and facilitate learning generalizable predictive relations. Extensive experiment demonstrates that our model can effectively enhance generalization with various types of distribution shifts and yield up to 27.4\% accuracy improvement over state-of-the-arts on graph OOD generalization benchmarks. Source codes are available at https://github.com/fannie1208/CaNet.  ( 2 min )
    Optimal Parallelization Strategies for Active Flow Control in Deep Reinforcement Learning-Based Computational Fluid Dynamics
    arXiv:2402.11515v1 Announce Type: new Abstract: Deep Reinforcement Learning (DRL) has emerged as a promising approach for handling highly dynamic and nonlinear Active Flow Control (AFC) problems. However, the computational cost associated with training DRL models presents a significant performance bottleneck. To address this challenge and enable efficient scaling on high-performance computing architectures, this study focuses on optimizing DRL-based algorithms in parallel settings. We validate an existing state-of-the-art DRL framework used for AFC problems and discuss its efficiency bottlenecks. Subsequently, by deconstructing the overall framework and conducting extensive scalability benchmarks for individual components, we investigate various hybrid parallelization configurations and propose efficient parallelization strategies. Moreover, we refine input/output (I/O) operations in multi-environment DRL training to tackle critical overhead associated with data movement. Finally, we demonstrate the optimized framework for a typical AFC problem where near-linear scaling can be obtained for the overall framework. We achieve a significant boost in parallel efficiency from around 49% to approximately 78%, and the training process is accelerated by approximately 47 times using 60 CPU cores. These findings are expected to provide valuable insights for further advancements in DRL-based AFC studies.  ( 2 min )
    A Curious Case of Searching for the Correlation between Training Data and Adversarial Robustness of Transformer Textual Models
    arXiv:2402.11469v1 Announce Type: new Abstract: Existing works have shown that fine-tuned textual transformer models achieve state-of-the-art prediction performances but are also vulnerable to adversarial text perturbations. Traditional adversarial evaluation is often done \textit{only after} fine-tuning the models and ignoring the training data. In this paper, we want to prove that there is also a strong correlation between training data and model robustness. To this end, we extract 13 different features representing a wide range of input fine-tuning corpora properties and use them to predict the adversarial robustness of the fine-tuned models. Focusing mostly on encoder-only transformer models BERT and RoBERTa with additional results for BART, ELECTRA and GPT2, we provide diverse evidence to support our argument. First, empirical analyses show that (a) extracted features can be used with a lightweight classifier such as Random Forest to effectively predict the attack success rate and (b) features with the most influence on the model robustness have a clear correlation with the robustness. Second, our framework can be used as a fast and effective additional tool for robustness evaluation since it (a) saves 30x-193x runtime compared to the traditional technique, (b) is transferable across models, (c) can be used under adversarial training, and (d) robust to statistical randomness. Our code will be publicly available.  ( 3 min )
    Attractor Memory for Long-Term Time Series Forecasting: A Chaos Perspective
    arXiv:2402.11463v1 Announce Type: new Abstract: In long-term time series forecasting (LTSF) tasks, existing deep learning models overlook the crucial characteristic that discrete time series originate from underlying continuous dynamic systems, resulting in a lack of extrapolation and evolution capabilities. Recognizing the chaotic nature of real-world data, our model, \textbf{\textit{Attraos}}, incorporates chaos theory into LTSF, perceiving real-world time series as observations from unknown high-dimensional chaotic dynamic systems. Under the concept of attractor invariance, Attraos utilizes the proposed multi-scale dynamic memory unit to memorize historical dynamics structure and predicts by a frequency-enhanced local evolution strategy. Detailed theoretical analysis and abundant empirical evidence consistently show that Attraos outperforms various LTSF methods on mainstream LTSF datasets and chaotic datasets.  ( 2 min )
    Improved Indoor Localization with Machine Learning Techniques for IoT applications
    arXiv:2402.11433v1 Announce Type: new Abstract: The rise of the Internet of Things (IoT) and mobile internet applications has spurred interest in location-based services (LBS) for commercial, military, and social applications. While the global positioning system (GPS) dominates outdoor localization, its efficacy wanes indoors due to signal challenges. Indoor localization systems leverage wireless technologies like Wi-Fi, ZigBee, Bluetooth, UWB, selecting based on context. Received signal strength indicator (RSSI) technology, known for its accuracy and simplicity, is widely adopted. This study employs machine learning algorithms in three phases: supervised regressors, supervised classifiers, and ensemble methods for RSSI-based indoor localization. Additionally, it introduces a weighted least squares technique and pseudo-linear solution approach to address non-linear RSSI measurement equations by approximating them with linear equations. An experimental testbed, utilizing diverse wireless technologies and anchor nodes, is designed for data collection, employing IoT cloud architectures. Pre-processing involves investigating filters for data refinement before algorithm training. The study employs machine learning models like linear regression, polynomial regression, support vector regression, random forest regression, and decision tree regressor across various wireless technologies. These models estimate the geographical coordinates of a moving target node, and their performance is evaluated using metrics such as accuracy, root mean square errors, precision, recall, sensitivity, coefficient of determinant, and the f1-score. The experiment's outcomes provide insights into the effectiveness of different supervised machine learning techniques in terms of localization accuracy and robustness in indoor environments.  ( 2 min )
    OptEx: Expediting First-Order Optimization with Approximately Parallelized Iterations
    arXiv:2402.11427v1 Announce Type: new Abstract: First-order optimization (FOO) algorithms are pivotal in numerous computational domains such as machine learning and signal denoising. However, their application to complex tasks like neural network training often entails significant inefficiencies due to the need for many sequential iterations for convergence. In response, we introduce first-order optimization expedited with approximately parallelized iterations (OptEx), the first framework that enhances the efficiency of FOO by leveraging parallel computing to mitigate its iterative bottleneck. OptEx employs kernelized gradient estimation to make use of gradient history for future gradient prediction, enabling parallelization of iterations -- a strategy once considered impractical because of the inherent iterative dependency in FOO. We provide theoretical guarantees for the reliability of our kernelized gradient estimation and the iteration complexity of SGD-based OptEx, confirming that estimation errors diminish to zero as historical gradients accumulate and that SGD-based OptEx enjoys an effective acceleration rate of $\Omega(\sqrt{N})$ over standard SGD given parallelism of N. We also use extensive empirical studies, including synthetic functions, reinforcement learning tasks, and neural network training across various datasets, to underscore the substantial efficiency improvements achieved by OptEx.  ( 2 min )
    Aligning Modalities in Vision Large Language Models via Preference Fine-tuning
    arXiv:2402.11411v1 Announce Type: new Abstract: Instruction-following Vision Large Language Models (VLLMs) have achieved significant progress recently on a variety of tasks. These approaches merge strong pre-trained vision models and large language models (LLMs). Since these components are trained separately, the learned representations need to be aligned with joint training on additional image-language pairs. This procedure is not perfect and can cause the model to hallucinate - provide answers that do not accurately reflect the image, even when the core LLM is highly factual and the vision backbone has sufficiently complete representations. In this work, we frame the hallucination problem as an alignment issue, tackle it with preference tuning. Specifically, we propose POVID to generate feedback data with AI models. We use ground-truth instructions as the preferred response and a two-stage approach to generate dispreferred data. First, we prompt GPT-4V to inject plausible hallucinations into the correct answer. Second, we distort the image to trigger the inherent hallucination behavior of the VLLM. This is an automated approach, which does not rely on human data generation or require a perfect expert, which makes it easily scalable. Finally, both of these generation strategies are integrated into an RLHF pipeline via Direct Preference Optimization. In experiments across broad benchmarks, we show that we can not only reduce hallucinations, but improve model performance across standard benchmarks, outperforming prior approaches. Our data and code are available at https://github.com/YiyangZhou/POVID.  ( 3 min )
    An Elementary Predictor Obtaining $2\sqrt{T}$ Distance to Calibration
    arXiv:2402.11410v1 Announce Type: new Abstract: Blasiok et al. [2023] proposed distance to calibration as a natural measure of calibration error that unlike expected calibration error (ECE) is continuous. Recently, Qiao and Zheng [2024] gave a non-constructive argument establishing the existence of an online predictor that can obtain $O(\sqrt{T})$ distance to calibration in the adversarial setting, which is known to be impossible for ECE. They leave as an open problem finding an explicit, efficient algorithm. We resolve this problem and give an extremely simple, efficient, deterministic algorithm that obtains distance to calibration error at most $2\sqrt{T}$.  ( 2 min )
    Random Projection Neural Networks of Best Approximation: Convergence theory and practical applications
    arXiv:2402.11397v1 Announce Type: new Abstract: We investigate the concept of Best Approximation for Feedforward Neural Networks (FNN) and explore their convergence properties through the lens of Random Projection (RPNNs). RPNNs have predetermined and fixed, once and for all, internal weights and biases, offering computational efficiency. We demonstrate that there exists a choice of external weights, for any family of such RPNNs, with non-polynomial infinitely differentiable activation functions, that exhibit an exponential convergence rate when approximating any infinitely differentiable function. For illustration purposes, we test the proposed RPNN-based function approximation, with parsimoniously chosen basis functions, across five benchmark function approximation problems. Results show that RPNNs achieve comparable performance to established methods such as Legendre Polynomials, highlighting their potential for efficient and accurate function approximation.  ( 2 min )
    Exploiting T-norms for Deep Learning in Autonomous Driving
    arXiv:2402.11362v1 Announce Type: new Abstract: Deep learning has been at the core of the autonomous driving field development, due to the neural networks' success in finding patterns in raw data and turning them into accurate predictions. Moreover, recent neuro-symbolic works have shown that incorporating the available background knowledge about the problem at hand in the loss function via t-norms can further improve the deep learning models' performance. However, t-norm-based losses may have very high memory requirements and, thus, they may be impossible to apply in complex application domains like autonomous driving. In this paper, we show how it is possible to define memory-efficient t-norm-based losses, allowing for exploiting t-norms for the task of event detection in autonomous driving. We conduct an extensive experimental analysis on the ROAD-R dataset and show (i) that our proposal can be implemented and run on GPUs with less than 25 GiB of available memory, while standard t-norm-based losses are estimated to require more than 100 GiB, far exceeding the amount of memory normally available, (ii) that t-norm-based losses improve performance, especially when limited labelled data are available, and (iii) that t-norm-based losses can further improve performance when exploited on both labelled and unlabelled data.  ( 2 min )
    Data-Driven Stochastic AC-OPF using Gaussian Processes
    arXiv:2402.11365v1 Announce Type: new Abstract: The thesis focuses on developing a data-driven algorithm, based on machine learning, to solve the stochastic alternating current (AC) chance-constrained (CC) Optimal Power Flow (OPF) problem. Although the AC CC-OPF problem has been successful in academic circles, it is highly nonlinear and computationally demanding, which limits its practical impact. The proposed approach aims to address this limitation and demonstrate its empirical efficiency through applications to multiple IEEE test cases. To solve the non-convex and computationally challenging CC AC-OPF problem, the proposed approach relies on a machine learning Gaussian process regression (GPR) model. The full Gaussian process (GP) approach is capable of learning a simple yet non-convex data-driven approximation to the AC power flow equations that can incorporate uncertain inputs. The proposed approach uses various approximations for GP-uncertainty propagation. The full GP CC-OPF approach exhibits highly competitive and promising results, outperforming the state-of-the-art sample-based chance constraint approaches. To further improve the robustness and complexity/accuracy trade-off of the full GP CC-OPF, a fast data-driven setup is proposed. This setup relies on the sparse and hybrid Gaussian processes (GP) framework to model the power flow equations with input uncertainty.  ( 2 min )
    Probabilistic Routing for Graph-Based Approximate Nearest Neighbor Search
    arXiv:2402.11354v1 Announce Type: new Abstract: Approximate nearest neighbor search (ANNS) in high-dimensional spaces is a pivotal challenge in the field of machine learning. In recent years, graph-based methods have emerged as the superior approach to ANNS, establishing a new state of the art. Although various optimizations for graph-based ANNS have been introduced, they predominantly rely on heuristic methods that lack formal theoretical backing. This paper aims to enhance routing within graph-based ANNS by introducing a method that offers a probabilistic guarantee when exploring a node's neighbors in the graph. We formulate the problem as probabilistic routing and develop two baseline strategies by incorporating locality-sensitive techniques. Subsequently, we introduce PEOs, a novel approach that efficiently identifies which neighbors in the graph should be considered for exact distance computation, thus significantly improving efficiency in practice. Our experiments demonstrate that equipping PEOs can increase throughput on a commonly utilized graph index (HNSW) by a factor of 1.6 to 2.5, and its efficiency consistently outperforms the leading-edge routing technique by 1.1 to 1.4 times.  ( 2 min )
    Fair Classification with Partial Feedback: An Exploration-Based Data-Collection Approach
    arXiv:2402.11338v1 Announce Type: new Abstract: In many predictive contexts (e.g., credit lending), true outcomes are only observed for samples that were positively classified in the past. These past observations, in turn, form training datasets for classifiers that make future predictions. However, such training datasets lack information about the outcomes of samples that were (incorrectly) negatively classified in the past and can lead to erroneous classifiers. We present an approach that trains a classifier using available data and comes with a family of exploration strategies to collect outcome data about subpopulations that otherwise would have been ignored. For any exploration strategy, the approach comes with guarantees that (1) all sub-populations are explored, (2) the fraction of false positives is bounded, and (3) the trained classifier converges to a "desired" classifier. The right exploration strategy is context-dependent; it can be chosen to improve learning guarantees and encode context-specific group fairness properties. Evaluation on real-world datasets shows that this approach consistently boosts the quality of collected outcome data and improves the fraction of true positives for all groups, with only a small reduction in predictive utility.  ( 2 min )
    Expressive Higher-Order Link Prediction through Hypergraph Symmetry Breaking
    arXiv:2402.11339v1 Announce Type: new Abstract: A hypergraph consists of a set of nodes along with a collection of subsets of the nodes called hyperedges. Higher-order link prediction is the task of predicting the existence of a missing hyperedge in a hypergraph. A hyperedge representation learned for higher order link prediction is fully expressive when it does not lose distinguishing power up to an isomorphism. Many existing hypergraph representation learners, are bounded in expressive power by the Generalized Weisfeiler Lehman-1 (GWL-1) algorithm, a generalization of the Weisfeiler Lehman-1 algorithm. However, GWL-1 has limited expressive power. In fact, induced subhypergraphs with identical GWL-1 valued nodes are indistinguishable. Furthermore, message passing on hypergraphs can already be computationally expensive, especially on GPU memory. To address these limitations, we devise a preprocessing algorithm that can identify certain regular subhypergraphs exhibiting symmetry. Our preprocessing algorithm runs once with complexity the size of the input hypergraph. During training, we randomly replace subhypergraphs identified by the algorithm with covering hyperedges to break symmetry. We show that our method improves the expressivity of GWL-1. Our extensive experiments also demonstrate the effectiveness of our approach for higher-order link prediction on both graph and hypergraph datasets with negligible change in computation.  ( 2 min )
    Debiased Offline Representation Learning for Fast Online Adaptation in Non-stationary Dynamics
    arXiv:2402.11317v1 Announce Type: new Abstract: Developing policies that can adjust to non-stationary environments is essential for real-world reinforcement learning applications. However, learning such adaptable policies in offline settings, with only a limited set of pre-collected trajectories, presents significant challenges. A key difficulty arises because the limited offline data makes it hard for the context encoder to differentiate between changes in the environment dynamics and shifts in the behavior policy, often leading to context misassociations. To address this issue, we introduce a novel approach called Debiased Offline Representation for fast online Adaptation (DORA). DORA incorporates an information bottleneck principle that maximizes mutual information between the dynamics encoding and the environmental data, while minimizing mutual information between the dynamics encoding and the actions of the behavior policy. We present a practical implementation of DORA, leveraging tractable bounds of the information bottleneck principle. Our experimental evaluation across six benchmark MuJoCo tasks with variable parameters demonstrates that DORA not only achieves a more precise dynamics encoding but also significantly outperforms existing baselines in terms of performance.  ( 2 min )
    BiasBuster: a Neural Approach for Accurate Estimation of Population Statistics using Biased Location Data
    arXiv:2402.11318v1 Announce Type: new Abstract: While extremely useful (e.g., for COVID-19 forecasting and policy-making, urban mobility analysis and marketing, and obtaining business insights), location data collected from mobile devices often contain data from a biased population subset, with some communities over or underrepresented in the collected datasets. As a result, aggregate statistics calculated from such datasets (as is done by various companies including Safegraph, Google, and Facebook), while ignoring the bias, leads to an inaccurate representation of population statistics. Such statistics will not only be generally inaccurate, but the error will disproportionately impact different population subgroups (e.g., because they ignore the underrepresented communities). This has dire consequences, as these datasets are used for sensitive decision-making such as COVID-19 policymaking. This paper tackles the problem of providing accurate population statistics using such biased datasets. We show that statistical debiasing, although in some cases useful, often fails to improve accuracy. We then propose BiasBuster, a neural network approach that utilizes the correlations between population statistics and location characteristics to provide accurate estimates of population statistics. Extensive experiments on real-world data show that BiasBuster improves accuracy by up to 2 times in general and up to 3 times for underrepresented populations.  ( 3 min )
    Aligning Large Language Models by On-Policy Self-Judgment
    arXiv:2402.11253v1 Announce Type: new Abstract: To align large language models with human preferences, existing research either utilizes a separate reward model (RM) to perform on-policy learning or simplifies the training procedure by discarding the on-policy learning and the need for a separate RM. In this paper, we present a novel alignment framework, SELF-JUDGE that is (1) on-policy learning and 2) parameter efficient, as it does not require an additional RM for evaluating the samples for on-policy learning. To this end, we propose Judge-augmented Supervised Fine-Tuning (JSFT) to train a single model acting as both a policy and a judge. Specifically, we view the pairwise judgment task as a special case of the instruction-following task, choosing the better response from a response pair. Thus, the resulting model can judge preferences of on-the-fly responses from current policy initialized from itself. Experimental results show the efficacy of SELF-JUDGE, outperforming baselines in preference benchmarks. We also show that self-rejection with oversampling can improve further without an additional evaluator. Our code is available at https://github.com/oddqueue/self-judge.  ( 2 min )
    ZeroG: Investigating Cross-dataset Zero-shot Transferability in Graphs
    arXiv:2402.11235v1 Announce Type: new Abstract: With the development of foundation models such as large language models, zero-shot transfer learning has become increasingly significant. This is highlighted by the generative capabilities of NLP models like GPT-4, and the retrieval-based approaches of CV models like CLIP, both of which effectively bridge the gap between seen and unseen data. In the realm of graph learning, the continuous emergence of new graphs and the challenges of human labeling also amplify the necessity for zero-shot transfer learning, driving the exploration of approaches that can generalize across diverse graph data without necessitating dataset-specific and label-specific fine-tuning. In this study, we extend such paradigms to zero-shot transferability in graphs by introducing ZeroG, a new framework tailored to enable cross-dataset generalization. Addressing the inherent challenges such as feature misalignment, mismatched label spaces, and negative transfer, we leverage a language model to encode both node attributes and class semantics, ensuring consistent feature dimensions across datasets. We also propose a prompt-based subgraph sampling module that enriches the semantic information and structure information of extracted subgraphs using prompting nodes and neighborhood aggregation, respectively. We further adopt a lightweight fine-tuning strategy that reduces the risk of overfitting and maintains the zero-shot learning efficacy of the language model. The results underscore the effectiveness of our model in achieving significant cross-dataset zero-shot transferability, opening pathways for the development of graph foundation models. Especially, ZeroG, as a zero-shot method, can even achieve results comparable to those of semi-supervised learning on Pubmed.  ( 3 min )
    Be Persistent: Towards a Unified Solution for Mitigating Shortcuts in Deep Learning
    arXiv:2402.11237v1 Announce Type: new Abstract: Deep neural networks (DNNs) are vulnerable to shortcut learning: rather than learning the intended task, they tend to draw inconclusive relationships between their inputs and outputs. Shortcut learning is ubiquitous among many failure cases of neural networks, and traces of this phenomenon can be seen in their generalizability issues, domain shift, adversarial vulnerability, and even bias towards majority groups. In this paper, we argue that this commonality in the cause of various DNN issues creates a significant opportunity that should be leveraged to find a unified solution for shortcut learning. To this end, we outline the recent advances in topological data analysis~(TDA), and persistent homology~(PH) in particular, to sketch a unified roadmap for detecting shortcuts in deep learning. We demonstrate our arguments by investigating the topological features of computational graphs in DNNs using two cases of unlearnable examples and bias in decision-making as our test studies. Our analysis of these two failure cases of DNNs reveals that finding a unified solution for shortcut learning in DNNs is not out of reach, and TDA can play a significant role in forming such a framework.  ( 2 min )
    HEAL: Brain-inspired Hyperdimensional Efficient Active Learning
    arXiv:2402.11223v1 Announce Type: new Abstract: Drawing inspiration from the outstanding learning capability of our human brains, Hyperdimensional Computing (HDC) emerges as a novel computing paradigm, and it leverages high-dimensional vector presentation and operations for brain-like lightweight Machine Learning (ML). Practical deployments of HDC have significantly enhanced the learning efficiency compared to current deep ML methods on a broad spectrum of applications. However, boosting the data efficiency of HDC classifiers in supervised learning remains an open question. In this paper, we introduce Hyperdimensional Efficient Active Learning (HEAL), a novel Active Learning (AL) framework tailored for HDC classification. HEAL proactively annotates unlabeled data points via uncertainty and diversity-guided acquisition, leading to a more efficient dataset annotation and lowering labor costs. Unlike conventional AL methods that only support classifiers built upon deep neural networks (DNN), HEAL operates without the need for gradient or probabilistic computations. This allows it to be effortlessly integrated with any existing HDC classifier architecture. The key design of HEAL is a novel approach for uncertainty estimation in HDC classifiers through a lightweight HDC ensemble with prior hypervectors. Additionally, by exploiting hypervectors as prototypes (i.e., compact representations), we develop an extra metric for HEAL to select diverse samples within each batch for annotation. Our evaluation shows that HEAL surpasses a diverse set of baselines in AL quality and achieves notably faster acquisition than many BNN-powered or diversity-guided AL methods, recording 11 times to 40,000 times speedup in acquisition runtime per batch.  ( 2 min )
    Neural Networks with (Low-Precision) Polynomial Approximations: New Insights and Techniques for Accuracy Improvement
    arXiv:2402.11224v1 Announce Type: new Abstract: Replacing non-polynomial functions (e.g., non-linear activation functions such as ReLU) in a neural network with their polynomial approximations is a standard practice in privacy-preserving machine learning. The resulting neural network, called polynomial approximation of neural network (PANN) in this paper, is compatible with advanced cryptosystems to enable privacy-preserving model inference. Using ``highly precise'' approximation, state-of-the-art PANN offers similar inference accuracy as the underlying backbone model. However, little is known about the effect of approximation, and existing literature often determined the required approximation precision empirically. In this paper, we initiate the investigation of PANN as a standalone object. Specifically, our contribution is two-fold. Firstly, we provide an explanation on the effect of approximate error in PANN. In particular, we discovered that (1) PANN is susceptible to some type of perturbations; and (2) weight regularisation significantly reduces PANN's accuracy. We support our explanation with experiments. Secondly, based on the insights from our investigations, we propose solutions to increase inference accuracy for PANN. Experiments showed that combination of our solutions is very effective: at the same precision, our PANN is 10% to 50% more accurate than state-of-the-arts; and at the same accuracy, our PANN only requires a precision of $2^{-9}$ while state-of-the-art solution requires a precision of $2^{-12}$ using the ResNet-20 model on CIFAR-10 dataset.  ( 2 min )
    AdAdaGrad: Adaptive Batch Size Schemes for Adaptive Gradient Methods
    arXiv:2402.11215v1 Announce Type: new Abstract: The choice of batch sizes in stochastic gradient optimizers is critical for model training. However, the practice of varying batch sizes throughout the training process is less explored compared to other hyperparameters. We investigate adaptive batch size strategies derived from adaptive sampling methods, traditionally applied only in stochastic gradient descent. Given the significant interplay between learning rates and batch sizes, and considering the prevalence of adaptive gradient methods in deep learning, we emphasize the need for adaptive batch size strategies in these contexts. We introduce AdAdaGrad and its scalar variant AdAdaGradNorm, which incrementally increase batch sizes during training, while model updates are performed using AdaGrad and AdaGradNorm. We prove that AdaGradNorm converges with high probability at a rate of $\mathscr{O}(1/K)$ for finding a first-order stationary point of smooth nonconvex functions within $K$ iterations. AdaGrad also demonstrates similar convergence properties when integrated with a novel coordinate-wise variant of our adaptive batch size strategies. Our theoretical claims are supported by numerical experiments on various image classification tasks, highlighting the enhanced adaptability of progressive batching protocols in deep learning and the potential of such adaptive batch size strategies with adaptive gradient optimizers in large-scale model training.  ( 2 min )
    Achieving Linear Speedup in Asynchronous Federated Learning with Heterogeneous Clients
    arXiv:2402.11198v1 Announce Type: new Abstract: Federated learning (FL) is an emerging distributed training paradigm that aims to learn a common global model without exchanging or transferring the data that are stored locally at different clients. The Federated Averaging (FedAvg)-based algorithms have gained substantial popularity in FL to reduce the communication overhead, where each client conducts multiple localized iterations before communicating with a central server. In this paper, we focus on FL where the clients have diverse computation and/or communication capabilities. Under this circumstance, FedAvg can be less efficient since it requires all clients that participate in the global aggregation in a round to initiate iterations from the latest global model, and thus the synchronization among fast clients and straggler clients can severely slow down the overall training process. To address this issue, we propose an efficient asynchronous federated learning (AFL) framework called Delayed Federated Averaging (DeFedAvg). In DeFedAvg, the clients are allowed to perform local training with different stale global models at their own paces. Theoretical analyses demonstrate that DeFedAvg achieves asymptotic convergence rates that are on par with the results of FedAvg for solving nonconvex problems. More importantly, DeFedAvg is the first AFL algorithm that provably achieves the desirable linear speedup property, which indicates its high scalability. Additionally, we carry out extensive numerical experiments using real datasets to validate the efficiency and scalability of our approach when training deep neural networks.  ( 3 min )
    Minimally Supervised Topological Projections of Self-Organizing Maps for Phase of Flight Identification
    arXiv:2402.11185v1 Announce Type: new Abstract: Identifying phases of flight is important in the field of general aviation, as knowing which phase of flight data is collected from aircraft flight data recorders can aid in the more effective detection of safety or hazardous events. General aviation flight data for phase of flight identification is usually per-second data, comes on a large scale, and is class imbalanced. It is expensive to manually label the data and training classification models usually faces class imbalance problems. This work investigates the use of a novel method for minimally supervised self-organizing maps (MS-SOMs) which utilize nearest neighbor majority votes in the SOM U-matrix for class estimation. Results show that the proposed method can reach or exceed a naive SOM approach which utilized a full data file of labeled data, with only 30 labeled datapoints per class. Additionally, the minimally supervised SOM is significantly more robust to the class imbalance of the phase of flight data. These results highlight how little data is required for effective phase of flight identification.  ( 2 min )
    Maintaining Adversarial Robustness in Continuous Learning
    arXiv:2402.11196v1 Announce Type: new Abstract: Adversarial robustness is essential for security and reliability of machine learning systems. However, the adversarial robustness gained by sophisticated defense algorithms is easily erased as the neural network evolves to learn new tasks. This vulnerability can be addressed by fostering a novel capability for neural networks, termed continual robust learning, which focuses on both the (classification) performance and adversarial robustness on previous tasks during continuous learning. To achieve continuous robust learning, we propose an approach called Double Gradient Projection that projects the gradients for weight updates orthogonally onto two crucial subspaces -- one for stabilizing the smoothed sample gradients and another for stabilizing the final outputs of the neural network. The experimental results on four benchmarks demonstrate that the proposed approach effectively maintains continuous robustness against strong adversarial attacks, outperforming the baselines formed by combining the existing defense strategies and continual learning methods.  ( 2 min )
    How to Make the Gradients Small Privately: Improved Rates for Differentially Private Non-Convex Optimization
    arXiv:2402.11173v1 Announce Type: new Abstract: We provide a simple and flexible framework for designing differentially private algorithms to find approximate stationary points of non-convex loss functions. Our framework is based on using a private approximate risk minimizer to "warm start" another private algorithm for finding stationary points. We use this framework to obtain improved, and sometimes optimal, rates for several classes of non-convex loss functions. First, we obtain improved rates for finding stationary points of smooth non-convex empirical loss functions. Second, we specialize to quasar-convex functions, which generalize star-convex functions and arise in learning dynamical systems and training some neural nets. We achieve the optimal rate for this class. Third, we give an optimal algorithm for finding stationary points of functions satisfying the Kurdyka-Lojasiewicz (KL) condition. For example, over-parameterized neural networks often satisfy this condition. Fourth, we provide new state-of-the-art rates for stationary points of non-convex population loss functions. Fifth, we obtain improved rates for non-convex generalized linear models. A modification of our algorithm achieves nearly the same rates for second-order stationary points of functions with Lipschitz Hessian, improving over the previous state-of-the-art for each of the above problems.  ( 2 min )
    Uncertainty Quantification of Graph Convolution Neural Network Models of Evolving Processes
    arXiv:2402.11179v1 Announce Type: new Abstract: The application of neural network models to scientific machine learning tasks has proliferated in recent years. In particular, neural network models have proved to be adept at modeling processes with spatial-temporal complexity. Nevertheless, these highly parameterized models have garnered skepticism in their ability to produce outputs with quantified error bounds over the regimes of interest. Hence there is a need to find uncertainty quantification methods that are suitable for neural networks. In this work we present comparisons of the parametric uncertainty quantification of neural networks modeling complex spatial-temporal processes with Hamiltonian Monte Carlo and Stein variational gradient descent and its projected variant. Specifically we apply these methods to graph convolutional neural network models of evolving systems modeled with recurrent neural network and neural ordinary differential equations architectures. We show that Stein variational inference is a viable alternative to Monte Carlo methods with some clear advantages for complex neural network models. For our exemplars, Stein variational interference gave similar uncertainty profiles through time compared to Hamiltonian Monte Carlo, albeit with generally more generous variance.Projected Stein variational gradient descent also produced similar uncertainty profiles to the non-projected counterpart, but large reductions in the active weight space were confounded by the stability of the neural network predictions and the convoluted likelihood landscape.  ( 3 min )
    Trust Regions for Explanations via Black-Box Probabilistic Certification
    arXiv:2402.11168v1 Announce Type: new Abstract: Given the black box nature of machine learning models, a plethora of explainability methods have been developed to decipher the factors behind individual decisions. In this paper, we introduce a novel problem of black box (probabilistic) explanation certification. We ask the question: Given a black box model with only query access, an explanation for an example and a quality metric (viz. fidelity, stability), can we find the largest hypercube (i.e., $\ell_{\infty}$ ball) centered at the example such that when the explanation is applied to all examples within the hypercube, (with high probability) a quality criterion is met (viz. fidelity greater than some value)? Being able to efficiently find such a \emph{trust region} has multiple benefits: i) insight into model behavior in a \emph{region}, with a \emph{guarantee}; ii) ascertained \emph{stability} of the explanation; iii) \emph{explanation reuse}, which can save time, energy and money by not having to find explanations for every example; and iv) a possible \emph{meta-metric} to compare explanation methods. Our contributions include formalizing this problem, proposing solutions, providing theoretical guarantees for these solutions that are computable, and experimentally showing their efficacy on synthetic and real data.  ( 2 min )
    Beyond Generalization: A Survey of Out-Of-Distribution Adaptation on Graphs
    arXiv:2402.11153v1 Announce Type: new Abstract: Distribution shifts on graphs -- the data distribution discrepancies between training and testing a graph machine learning model, are often ubiquitous and unavoidable in real-world scenarios. Such shifts may severely deteriorate the performance of the model, posing significant challenges for reliable graph machine learning. Consequently, there has been a surge in research on graph Out-Of-Distribution (OOD) adaptation methods that aim to mitigate the distribution shifts and adapt the knowledge from one distribution to another. In our survey, we provide an up-to-date and forward-looking review of graph OOD adaptation methods, covering two main problem scenarios including training-time as well as test-time graph OOD adaptation. We start by formally formulating the two problems and then discuss different types of distribution shifts on graphs. Based on our proposed taxonomy for graph OOD adaptation, we systematically categorize the existing methods according to their learning paradigm and investigate the techniques behind them. Finally, we point out promising research directions and the corresponding challenges. We also provide a continuously updated reading list at https://github.com/kaize0409/Awesome-Graph-OOD-Adaptation.git  ( 2 min )
    LiGNN: Graph Neural Networks at LinkedIn
    arXiv:2402.11139v1 Announce Type: new Abstract: In this paper, we present LiGNN, a deployed large-scale Graph Neural Networks (GNNs) Framework. We share our insight on developing and deployment of GNNs at large scale at LinkedIn. We present a set of algorithmic improvements to the quality of GNN representation learning including temporal graph architectures with long term losses, effective cold start solutions via graph densification, ID embeddings and multi-hop neighbor sampling. We explain how we built and sped up by 7x our large-scale training on LinkedIn graphs with adaptive sampling of neighbors, grouping and slicing of training data batches, specialized shared-memory queue and local gradient optimization. We summarize our deployment lessons and learnings gathered from A/B test experiments. The techniques presented in this work have contributed to an approximate relative improvements of 1% of Job application hearing back rate, 2% Ads CTR lift, 0.5% of Feed engaged daily active users, 0.2% session lift and 0.1% weekly active user lift from people recommendation. We believe that this work can provide practical solutions and insights for engineers who are interested in applying Graph neural networks at large scale.  ( 2 min )
    Knowledge Distillation Based on Transformed Teacher Matching
    arXiv:2402.11148v1 Announce Type: new Abstract: As a technique to bridge logit matching and probability distribution matching, temperature scaling plays a pivotal role in knowledge distillation (KD). Conventionally, temperature scaling is applied to both teacher's logits and student's logits in KD. Motivated by some recent works, in this paper, we drop instead temperature scaling on the student side, and systematically study the resulting variant of KD, dubbed transformed teacher matching (TTM). By reinterpreting temperature scaling as a power transform of probability distribution, we show that in comparison with the original KD, TTM has an inherent R\'enyi entropy term in its objective function, which serves as an extra regularization term. Extensive experiment results demonstrate that thanks to this inherent regularization, TTM leads to trained students with better generalization than the original KD. To further enhance student's capability to match teacher's power transformed probability distribution, we introduce a sample-adaptive weighting coefficient into TTM, yielding a novel distillation approach dubbed weighted TTM (WTTM). It is shown, by comprehensive experiments, that although WTTM is simple, it is effective, improves upon TTM, and achieves state-of-the-art accuracy performance. Our source code is available at https://github.com/zkxufo/TTM.  ( 2 min )
    Kolmogorov n-Widths for Multitask Physics-Informed Machine Learning (PIML) Methods: Towards Robust Metrics
    arXiv:2402.11126v1 Announce Type: new Abstract: Physics-informed machine learning (PIML) as a means of solving partial differential equations (PDE) has garnered much attention in the Computational Science and Engineering (CS&E) world. This topic encompasses a broad array of methods and models aimed at solving a single or a collection of PDE problems, called multitask learning. PIML is characterized by the incorporation of physical laws into the training process of machine learning models in lieu of large data when solving PDE problems. Despite the overall success of this collection of methods, it remains incredibly difficult to analyze, benchmark, and generally compare one approach to another. Using Kolmogorov n-widths as a measure of effectiveness of approximating functions, we judiciously apply this metric in the comparison of various multitask PIML architectures. We compute lower accuracy bounds and analyze the model's learned basis functions on various PDE problems. This is the first objective metric for comparing multitask PIML architectures and helps remove uncertainty in model validation from selective sampling and overfitting. We also identify avenues of improvement for model architectures, such as the choice of activation function, which can drastically affect model generalization to "worst-case" scenarios, which is not observed when reporting task-specific errors. We also incorporate this metric into the optimization process through regularization, which improves the models' generalizability over the multitask PDE problem.  ( 2 min )
    TuneTables: Context Optimization for Scalable Prior-Data Fitted Networks
    arXiv:2402.11137v1 Announce Type: new Abstract: While tabular classification has traditionally relied on from-scratch training, a recent breakthrough called prior-data fitted networks (PFNs) challenges this approach. Similar to large language models, PFNs make use of pretraining and in-context learning to achieve strong performance on new tasks in a single forward pass. However, current PFNs have limitations that prohibit their widespread adoption. Notably, TabPFN achieves very strong performance on small tabular datasets but is not designed to make predictions for datasets of size larger than 1000. In this work, we overcome these limitations and substantially improve the performance of PFNs by developing context optimization techniques for PFNs. Specifically, we propose TuneTables, a novel prompt-tuning strategy that compresses large datasets into a smaller learned context. TuneTables scales TabPFN to be competitive with state-of-the-art tabular classification methods on larger datasets, while having a substantially lower inference time than TabPFN. Furthermore, we show that TuneTables can be used as an interpretability tool and can even be used to mitigate biases by optimizing a fairness objective.  ( 2 min )
    Disentanglement in Implicit Causal Models via Switch Variable
    arXiv:2402.11124v1 Announce Type: new Abstract: Learning causal representations from observational and interventional data in the absence of known ground-truth graph structures necessitates implicit latent causal representation learning. Implicitly learning causal mechanisms typically involves two categories of interventional data: hard and soft interventions. In real-world scenarios, soft interventions are often more realistic than hard interventions, as the latter require fully controlled environments. Unlike hard interventions, which directly force changes in a causal variable, soft interventions exert influence indirectly by affecting the causal mechanism. In this paper, we tackle implicit latent causal representation learning in a Variational Autoencoder (VAE) framework through soft interventions. Our approach models soft interventions effects by employing a causal mechanism switch variable designed to toggle between different causal mechanisms. In our experiments, we consistently observe improved learning of identifiable, causal representations, compared to baseline approaches.  ( 2 min )
    Optimizing Warfarin Dosing Using Contextual Bandit: An Offline Policy Learning and Evaluation Method
    arXiv:2402.11123v1 Announce Type: new Abstract: Warfarin, an anticoagulant medication, is formulated to prevent and address conditions associated with abnormal blood clotting, making it one of the most prescribed drugs globally. However, determining the suitable dosage remains challenging due to individual response variations, and prescribing an incorrect dosage may lead to severe consequences. Contextual bandit and reinforcement learning have shown promise in addressing this issue. Given the wide availability of observational data and safety concerns of decision-making in healthcare, we focused on using exclusively observational data from historical policies as demonstrations to derive new policies; we utilized offline policy learning and evaluation in a contextual bandit setting to establish the optimal personalized dosage strategy. Our learned policies surpassed these baseline approaches without genotype inputs, even when given a suboptimal demonstration, showcasing promising application potential.  ( 2 min )
    Private PAC Learning May be Harder than Online Learning
    arXiv:2402.11119v1 Announce Type: new Abstract: We continue the study of the computational complexity of differentially private PAC learning and how it is situated within the foundations of machine learning. A recent line of work uncovered a qualitative equivalence between the private PAC model and Littlestone's mistake-bounded model of online learning, in particular, showing that any concept class of Littlestone dimension $d$ can be privately PAC learned using $\mathrm{poly}(d)$ samples. This raises the natural question of whether there might be a generic conversion from online learners to private PAC learners that also preserves computational efficiency. We give a negative answer to this question under reasonable cryptographic assumptions (roughly, those from which it is possible to build indistinguishability obfuscation for all circuits). We exhibit a concept class that admits an online learner running in polynomial time with a polynomial mistake bound, but for which there is no computationally-efficient differentially private PAC learner. Our construction and analysis strengthens and generalizes that of Bun and Zhandry (TCC 2016-A), who established such a separation between private and non-private PAC learner.  ( 2 min )
    DART: A Principled Approach to Adversarially Robust Unsupervised Domain Adaptation
    arXiv:2402.11120v1 Announce Type: new Abstract: Distribution shifts and adversarial examples are two major challenges for deploying machine learning models. While these challenges have been studied individually, their combination is an important topic that remains relatively under-explored. In this work, we study the problem of adversarial robustness under a common setting of distribution shift - unsupervised domain adaptation (UDA). Specifically, given a labeled source domain $D_S$ and an unlabeled target domain $D_T$ with related but different distributions, the goal is to obtain an adversarially robust model for $D_T$. The absence of target domain labels poses a unique challenge, as conventional adversarial robustness defenses cannot be directly applied to $D_T$. To address this challenge, we first establish a generalization bound for the adversarial target loss, which consists of (i) terms related to the loss on the data, and (ii) a measure of worst-case domain divergence. Motivated by this bound, we develop a novel unified defense framework called Divergence Aware adveRsarial Training (DART), which can be used in conjunction with a variety of standard UDA methods; e.g., DANN [Ganin and Lempitsky, 2015]. DART is applicable to general threat models, including the popular $\ell_p$-norm model, and does not require heuristic regularizers or architectural changes. We also release DomainRobust: a testbed for evaluating robustness of UDA models to adversarial attacks. DomainRobust consists of 4 multi-domain benchmark datasets (with 46 source-target pairs) and 7 meta-algorithms with a total of 11 variants. Our large-scale experiments demonstrate that on average, DART significantly enhances model robustness on all benchmarks compared to the state of the art, while maintaining competitive standard accuracy. The relative improvement in robustness from DART reaches up to 29.2% on the source-target domain pairs considered.  ( 3 min )
    Toward Learning Latent-Variable Representations of Microstructures by Optimizing in Spatial Statistics Space
    arXiv:2402.11103v1 Announce Type: new Abstract: In Materials Science, material development involves evaluating and optimizing the internal structures of the material, generically referred to as microstructures. Microstructures structure is stochastic, analogously to image textures. A particular microstructure can be well characterized by its spatial statistics, analogously to image texture being characterized by the response to a Fourier-like filter bank. Material design would benefit from low-dimensional representation of microstructures Paulson et al. (2017). In this work, we train a Variational Autoencoders (VAE) to produce reconstructions of textures that preserve the spatial statistics of the original texture, while not necessarily reconstructing the same image in data space. We accomplish this by adding a differentiable term to the cost function in order to minimize the distance between the original and the reconstruction in spatial statistics space. Our experiments indicate that it is possible to train a VAE that minimizes the distance in spatial statistics space between the original and the reconstruction of synthetic images. In future work, we will apply the same techniques to microstructures, with the goal of obtaining low-dimensional representations of material microstructures.  ( 2 min )
    Model Editing by Pure Fine-Tuning
    arXiv:2402.11078v1 Announce Type: new Abstract: Fine-tuning is dismissed as not effective for model editing due to its poor performance compared to more specialized methods. However, fine-tuning is simple, agnostic to the architectural details of the model being edited, and able to leverage ongoing advances in standard training methods (e.g., PEFT), making it an appealing choice for a model editor. In this work, we show that pure fine-tuning can be a viable approach to model editing. We propose a slight modification of naive fine-tuning with two key ingredients. First, we optimize the conditional likelihood rather than the full likelihood. Second, we augment the data with random paraphrases and facts to encourage generalization and locality. Our experiments on ZsRE and CounterFact show that this simple modification allows fine-tuning to often match or outperform specialized editors in the edit score.  ( 2 min )
    The Evolution of Statistical Induction Heads: In-Context Learning Markov Chains
    arXiv:2402.11004v1 Announce Type: new Abstract: Large language models have the ability to generate text that mimics patterns in their inputs. We introduce a simple Markov Chain sequence modeling task in order to study how this in-context learning (ICL) capability emerges. In our setting, each example is sampled from a Markov chain drawn from a prior distribution over Markov chains. Transformers trained on this task form \emph{statistical induction heads} which compute accurate next-token probabilities given the bigram statistics of the context. During the course of training, models pass through multiple phases: after an initial stage in which predictions are uniform, they learn to sub-optimally predict using in-context single-token statistics (unigrams); then, there is a rapid phase transition to the correct in-context bigram solution. We conduct an empirical and theoretical investigation of this multi-phase process, showing how successful learning results from the interaction between the transformer's layers, and uncovering evidence that the presence of the simpler unigram solution may delay formation of the final bigram solution. We examine how learning is affected by varying the prior distribution over Markov chains, and consider the generalization of our in-context learning of Markov chains (ICL-MC) task to $n$-grams for $n > 2$.  ( 2 min )
    Analysis and Mortality Prediction using Multiclass Classification for Older Adults with Type 2 Diabetes
    arXiv:2402.10999v1 Announce Type: new Abstract: Designing proper treatment plans to manage diabetes requires health practitioners to pay heed to the individuals remaining life along with the comorbidities affecting them. Older adults with Type 2 Diabetes Mellitus (T2DM) are prone to experience premature death or even hypoglycaemia. The structured dataset utilized has 68 potential mortality predictors for 275,190 diabetic U.S. military Veterans aged 65 years or older. A new target variable is invented by combining the two original target variables. Outliers are handled by discretizing the continuous variables. Categorical variables have been dummy encoded. Class balancing is achieved by random under-sampling. A benchmark regression model is built using Multinomial Logistic Regression with LASSO. Chi-Squared and Information Gain are the filter-based feature selection techniques utilized. Classifiers such as Multinomial Logistic Regression, Random Forest, Extreme Gradient Boosting (XGBoost), and One-vs-Rest classifier are employed to build various models. Contrary to expectations, all the models have constantly underperformed. XGBoost has given the highest accuracy of 53.03 percent with Chi-Squared feature selection. All the models have consistently shown an acceptable performance for Class 3 (remaining life is more than 10 years), significantly low for Class 1 (remaining life is up to 5 years), and the worst for Class 2 (remaining life is more than 5 but up to 10 years). Features analysis has deduced that almost all input variables are associated with multiple target classes. The high dimensionality of the input data after dummy encoding seems to have confused the models, leading to misclassifications. The approach taken in this study is ineffective in producing a high-performing predictive model but lays a foundation as this problem has never been viewed from a multiclass classification perspective.  ( 3 min )
    Language Models with Conformal Factuality Guarantees
    arXiv:2402.10978v1 Announce Type: new Abstract: Guaranteeing the correctness and factuality of language model (LM) outputs is a major open problem. In this work, we propose conformal factuality, a framework that can ensure high probability correctness guarantees for LMs by connecting language modeling and conformal prediction. We observe that the correctness of an LM output is equivalent to an uncertainty quantification problem, where the uncertainty sets are defined as the entailment set of an LM's output. Using this connection, we show that conformal prediction in language models corresponds to a back-off algorithm that provides high probability correctness guarantees by progressively making LM outputs less specific (and expanding the associated uncertainty sets). This approach applies to any black-box LM and requires very few human-annotated samples. Evaluations of our approach on closed book QA (FActScore, NaturalQuestions) and reasoning tasks (MATH) show that our approach can provide 80-90% correctness guarantees while retaining the majority of the LM's original output.  ( 2 min )
    mshw, a forecasting library to predict short-term electricity demand based on multiple seasonal Holt-Winters
    arXiv:2402.10982v1 Announce Type: new Abstract: Transmission system operators have a growing need for more accurate forecasting of electricity demand. Current electricity systems largely require demand forecasting so that the electricity market establishes electricity prices as well as the programming of production units. The companies that are part of the electrical system use exclusive software to obtain predictions, based on the use of time series and prediction tools, whether statistical or artificial intelligence. However, the most common form of prediction is based on hybrid models that use both technologies. In any case, it is software with a complicated structure, with a large number of associated variables and that requires a high computational load to make predictions. The predictions they can offer are not much better than those that simple models can offer. In this paper we present a MATLAB toolbox created for the prediction of electrical demand. The toolbox implements multiple seasonal Holt-Winters exponential smoothing models and neural network models. The models used include the use of discrete interval mobile seasonalities (DIMS) to improve forecasting on special days. Additionally, the results of its application in various electrical systems in Europe are shown, where the results obtained can be seen. The use of this library opens a new avenue of research for the use of models with discrete and complex seasonalities in other fields of application.  ( 3 min )
    Optimal feature rescaling in machine learning based on neural networks
    arXiv:2402.10964v1 Announce Type: new Abstract: This paper proposes a novel approach to improve the training efficiency and the generalization performance of Feed Forward Neural Networks (FFNNs) resorting to an optimal rescaling of input features (OFR) carried out by a Genetic Algorithm (GA). The OFR reshapes the input space improving the conditioning of the gradient-based algorithm used for the training. Moreover, the scale factors exploration entailed by GA trials and selection corresponds to different initialization of the first layer weights at each training attempt, thus realizing a multi-start global search algorithm (even though restrained to few weights only) which fosters the achievement of a global minimum. The approach has been tested on a FFNN modeling the outcome of a real industrial process (centerless grinding).  ( 2 min )
    Generative AI and Process Systems Engineering: The Next Frontier
    arXiv:2402.10977v1 Announce Type: new Abstract: This article explores how emerging generative artificial intelligence (GenAI) models, such as large language models (LLMs), can enhance solution methodologies within process systems engineering (PSE). These cutting-edge GenAI models, particularly foundation models (FMs), which are pre-trained on extensive, general-purpose datasets, offer versatile adaptability for a broad range of tasks, including responding to queries, image generation, and complex decision-making. Given the close relationship between advancements in PSE and developments in computing and systems technologies, exploring the synergy between GenAI and PSE is essential. We begin our discussion with a compact overview of both classic and emerging GenAI models, including FMs, and then dive into their applications within key PSE domains: synthesis and design, optimization and integration, and process monitoring and control. In each domain, we explore how GenAI models could potentially advance PSE methodologies, providing insights and prospects for each area. Furthermore, the article identifies and discusses potential challenges in fully leveraging GenAI within PSE, including multiscale modeling, data requirements, evaluation metrics and benchmarks, and trust and safety, thereby deepening the discourse on effective GenAI integration into systems analysis, design, optimization, operations, monitoring, and control. This paper provides a guide for future research focused on the applications of emerging GenAI in PSE.  ( 2 min )
    Robustness to Subpopulation Shift with Domain Label Noise via Regularized Annotation of Domains
    arXiv:2402.11039v1 Announce Type: new Abstract: Existing methods for last layer retraining that aim to optimize worst-group accuracy (WGA) rely heavily on well-annotated groups in the training data. We show, both in theory and practice, that annotation-based data augmentations using either downsampling or upweighting for WGA are susceptible to domain annotation noise, and in high-noise regimes approach the WGA of a model trained with vanilla empirical risk minimization. We introduce Regularized Annotation of Domains (RAD) in order to train robust last layer classifiers without the need for explicit domain annotations. Our results show that RAD is competitive with other recently proposed domain annotation-free techniques. Most importantly, RAD outperforms state-of-the-art annotation-reliant methods even with only 5% noise in the training data for several publicly available datasets.  ( 2 min )
    Training Bayesian Neural Networks with Sparse Subspace Variational Inference
    arXiv:2402.11025v1 Announce Type: new Abstract: Bayesian neural networks (BNNs) offer uncertainty quantification but come with the downside of substantially increased training and inference costs. Sparse BNNs have been investigated for efficient inference, typically by either slowly introducing sparsity throughout the training or by post-training compression of dense BNNs. The dilemma of how to cut down massive training costs remains, particularly given the requirement to learn about the uncertainty. To solve this challenge, we introduce Sparse Subspace Variational Inference (SSVI), the first fully sparse BNN framework that maintains a consistently highly sparse Bayesian model throughout the training and inference phases. Starting from a randomly initialized low-dimensional sparse subspace, our approach alternately optimizes the sparse subspace basis selection and its associated parameters. While basis selection is characterized as a non-differentiable problem, we approximate the optimal solution with a removal-and-addition strategy, guided by novel criteria based on weight distribution statistics. Our extensive experiments show that SSVI sets new benchmarks in crafting sparse BNNs, achieving, for instance, a 10-20x compression in model size with under 3\% performance drop, and up to 20x FLOPs reduction during training compared with dense VI training. Remarkably, SSVI also demonstrates enhanced robustness to hyperparameters, reducing the need for intricate tuning in VI and occasionally even surpassing VI-trained dense BNNs on both accuracy and uncertainty metrics.  ( 2 min )
    Accelerating Semi-Asynchronous Federated Learning
    arXiv:2402.10991v1 Announce Type: new Abstract: Federated Learning (FL) is a distributed machine learning paradigm that allows clients to train models on their data while preserving their privacy. FL algorithms, such as Federated Averaging (FedAvg) and its variants, have been shown to converge well in many scenarios. However, these methods require clients to upload their local updates to the server in a synchronous manner, which can be slow and unreliable in realistic FL settings. To address this issue, researchers have developed asynchronous FL methods that allow clients to continue training on their local data using a stale global model. However, most of these methods simply aggregate all of the received updates without considering their relative contributions, which can slow down convergence. In this paper, we propose a contribution-aware asynchronous FL method that takes into account the staleness and statistical heterogeneity of the received updates. Our method dynamically adjusts the contribution of each update based on these factors, which can speed up convergence compared to existing methods.  ( 2 min )
  • Open

    Denoising Diffusion Variational Inference: Diffusion Models as Expressive Variational Posteriors
    arXiv:2401.02739v2 Announce Type: replace-cross Abstract: We propose denoising diffusion variational inference (DDVI), an approximate inference algorithm for latent variable models which relies on diffusion models as flexible variational posteriors. Specifically, our method introduces an expressive class of approximate posteriors with auxiliary latent variables that perform diffusion in latent space by reversing a user-specified noising process. We fit these models by optimizing a lower bound on the marginal likelihood inspired by the wake-sleep algorithm. Our method is easy to implement (it fits a regularized extension of the ELBO), is compatible with black-box variational inference, and outperforms alternative classes of approximate posteriors based on normalizing flows or adversarial networks. It increases the expressivity of flow-based methods via non-invertible deep recurrent architectures and avoids the instability of adversarial methods. We use DDVI on a motivating task in biology -- inferring latent ancestry from human genomes -- and we find that it outperforms strong baselines on the Thousand Genomes dataset.  ( 2 min )
    Hidden Minima in Two-Layer ReLU Networks
    arXiv:2312.16819v2 Announce Type: replace-cross Abstract: The optimization problem associated to fitting two-layer ReLU networks having $d$~inputs, $k$~neurons, and labels generated by a target network, is considered. Two types of infinite families of spurious minima, giving one minimum per $d$, were recently found. The loss at minima belonging to the first type converges to zero as $d$ increases. In the second type, the loss remains bounded away from zero. That being so, how may one avoid minima belonging to the latter type? Fortunately, such minima are never detected by standard optimization methods. Motivated by questions concerning the nature of this phenomenon, we develop methods to study distinctive analytic properties of hidden minima. By existing analyses, the Hessian spectrum of both types agree modulo $O(d^{-1/2})$-terms -- not promising. Thus, rather, our investigation proceeds by studying curves along which the loss is minimized or maximized, generally referred to as tangency arcs. We prove that apparently far removed group representation-theoretic considerations concerning the arrangement of subspaces invariant to the action of subgroups of $S_d$, the symmetry group over $d$ symbols, relative to ones fixed by the action yield a precise description of all finitely many admissible types of tangency arcs. The general results used for the loss function reveal that arcs emanating from hidden minima differ, characteristically, by their structure and symmetry, precisely on account of the $O(d^{-1/2})$-eigenvalue terms absent in previous work, indicating in particular the subtlety of the analysis. The theoretical results, stated and proved for o-minimal structures, show that the set comprising all tangency arcs is topologically sufficiently tame to enable a numerical construction of tangency arcs and so compare how minima, both types, are positioned relative to adjacent critical points.  ( 3 min )
    Statistical Optimality of Divide and Conquer Kernel-based Functional Linear Regression
    arXiv:2211.10968v3 Announce Type: replace Abstract: Previous analysis of regularized functional linear regression in a reproducing kernel Hilbert space (RKHS) typically requires the target function to be contained in this kernel space. This paper studies the convergence performance of divide-and-conquer estimators in the scenario that the target function does not necessarily reside in the underlying RKHS. As a decomposition-based scalable approach, the divide-and-conquer estimators of functional linear regression can substantially reduce the algorithmic complexities in time and memory. We develop an integral operator approach to establish sharp finite sample upper bounds for prediction with divide-and-conquer estimators under various regularity conditions of explanatory variables and target function. We also prove the asymptotic optimality of the derived rates by building the mini-max lower bounds. Finally, we consider the convergence of noiseless estimators and show that the rates can be arbitrarily fast under mild conditions.  ( 2 min )
    Towards black-box parameter estimation
    arXiv:2303.15041v2 Announce Type: replace Abstract: Deep learning algorithms have recently shown to be a successful tool in estimating parameters of statistical models for which simulation is easy, but likelihood computation is challenging. But the success of these approaches depends on simulating parameters that sufficiently reproduce the observed data, and, at present, there is a lack of efficient methods to produce these simulations. We develop new black-box procedures to estimate parameters of statistical models based only on weak parameter structure assumptions. For well-structured likelihoods with frequent occurrences, such as in time series, this is achieved by pre-training a deep neural network on an extensive simulated database that covers a wide range of data sizes. For other types of complex dependencies, an iterative algorithm guides simulations to the correct parameter region in multiple rounds. These approaches can successfully estimate and quantify the uncertainty of parameters from non-Gaussian models with complex spatial and temporal dependencies. The success of our methods is a first step towards a fully flexible automatic black-box estimation framework.  ( 2 min )
    Stochastic optimal transport in Banach Spaces for regularized estimation of multivariate quantiles
    arXiv:2302.00982v2 Announce Type: replace-cross Abstract: We introduce a new stochastic algorithm for solving entropic optimal transport (EOT) between two absolutely continuous probability measures $\mu$ and $\nu$. Our work is motivated by the specific setting of Monge-Kantorovich quantiles where the source measure $\mu$ is either the uniform distribution on the unit hypercube or the spherical uniform distribution. Using the knowledge of the source measure, we propose to parametrize a Kantorovich dual potential by its Fourier coefficients. In this way, each iteration of our stochastic algorithm reduces to two Fourier transforms that enables us to make use of the Fast Fourier Transform (FFT) in order to implement a fast numerical method to solve EOT. We study the almost sure convergence of our stochastic algorithm that takes its values in an infinite-dimensional Banach space. Then, using numerical experiments, we illustrate the performances of our approach on the computation of regularized Monge-Kantorovich quantiles. In particular, we investigate the potential benefits of entropic regularization for the smooth estimation of multivariate quantiles using data sampled from the target measure $\nu$.  ( 2 min )
    Efficient Computation of Sparse and Robust Maximum Association Estimators
    arXiv:2311.17563v2 Announce Type: replace-cross Abstract: Although robust statistical estimators are less affected by outlying observations, their computation is usually more challenging. This is particularly the case in high-dimensional sparse settings. The availability of new optimization procedures, mainly developed in the computer science domain, offers new possibilities for the field of robust statistics. This paper investigates how such procedures can be used for robust sparse association estimators. The problem can be split into a robust estimation step followed by an optimization for the remaining decoupled, (bi-)convex problem. A combination of the augmented Lagrangian algorithm and adaptive gradient descent is implemented to also include suitable constraints for inducing sparsity. We provide results concerning the precision of the algorithm and show the advantages over existing algorithms in this context. High-dimensional empirical examples underline the usefulness of this procedure. Extensions to other robust sparse estimators are possible.  ( 2 min )
    Sliced Wasserstein Estimation with Control Variates
    arXiv:2305.00402v2 Announce Type: replace Abstract: The sliced Wasserstein (SW) distances between two probability measures are defined as the expectation of the Wasserstein distance between two one-dimensional projections of the two measures. The randomness comes from a projecting direction that is used to project the two input measures to one dimension. Due to the intractability of the expectation, Monte Carlo integration is performed to estimate the value of the SW distance. Despite having various variants, there has been no prior work that improves the Monte Carlo estimation scheme for the SW distance in terms of controlling its variance. To bridge the literature on variance reduction and the literature on the SW distance, we propose computationally efficient control variates to reduce the variance of the empirical estimation of the SW distance. The key idea is to first find Gaussian approximations of projected one-dimensional measures, then we utilize the closed-form of the Wasserstein-2 distance between two Gaussian distributions to design the control variates. In particular, we propose using a lower bound and an upper bound of the Wasserstein-2 distance between two fitted Gaussians as two computationally efficient control variates. We empirically show that the proposed control variate estimators can help to reduce the variance considerably when comparing measures over images and point-clouds. Finally, we demonstrate the favorable performance of the proposed control variate estimators in gradient flows to interpolate between two point-clouds and in deep generative modeling on standard image datasets, such as CIFAR10 and CelebA.  ( 3 min )
    Human Limits in Machine Learning: Prediction of Plant Phenotypes Using Soil Microbiome Data
    arXiv:2306.11157v2 Announce Type: replace Abstract: The preservation of soil health is a critical challenge in the 21st century due to its significant impact on agriculture, human health, and biodiversity. We provide the first deep investigation of the predictive potential of machine learning models to understand the connections between soil and biological phenotypes. We investigate an integrative framework performing accurate machine learning-based prediction of plant phenotypes from biological, chemical, and physical properties of the soil via two models: random forest and Bayesian neural network. We show that prediction is improved when incorporating environmental features like soil physicochemical properties and microbial population density into the models, in addition to the microbiome information. Exploring various data preprocessing strategies confirms the significant impact of human decisions on predictive performance. We show that the naive total sum scaling normalization that is commonly used in microbiome research is not the optimal strategy to maximize predictive power. Also, we find that accurately defined labels are more important than normalization, taxonomic level or model characteristics. In cases where humans are unable to classify samples accurately, machine learning model performance is limited. Lastly, we provide domain scientists via a full model selection decision tree to identify the human choices that optimize model prediction power. Our work is accompanied by open source reproducible scripts (https://github.com/solislemuslab/soil-microbiome-nn) for maximum outreach among the microbiome research community.  ( 3 min )
    Sensitivity-Aware Amortized Bayesian Inference
    arXiv:2310.11122v4 Announce Type: replace Abstract: Sensitivity analyses reveal the influence of various modeling choices on the outcomes of statistical analyses. While theoretically appealing, they are overwhelmingly inefficient for complex Bayesian models. In this work, we propose sensitivity-aware amortized Bayesian inference (SA-ABI), a multifaceted approach to efficiently integrate sensitivity analyses into simulation-based inference with neural networks. First, we utilize weight sharing to encode the structural similarities between alternative likelihood and prior specifications in the training process with minimal computational overhead. Second, we leverage the rapid inference of neural networks to assess sensitivity to data perturbations and preprocessing steps. In contrast to most other Bayesian approaches, both steps circumvent the costly bottleneck of refitting the model for each choice of likelihood, prior, or data set. Finally, we propose to use deep ensembles to detect sensitivity arising from unreliable approximation (e.g., due to model misspecification). We demonstrate the effectiveness of our method in applied modeling problems, ranging from disease outbreak dynamics and global warming thresholds to human decision-making. Our results support sensitivity-aware inference as a default choice for amortized Bayesian workflows, automatically providing modelers with insights into otherwise hidden dimensions.  ( 2 min )
    A Theoretical Analysis of the Learning Dynamics under Class Imbalance
    arXiv:2207.00391v4 Announce Type: replace Abstract: Data imbalance is a common problem in machine learning that can have a critical effect on the performance of a model. Various solutions exist but their impact on the convergence of the learning dynamics is not understood. Here, we elucidate the significant negative impact of data imbalance on learning, showing that the learning curves for minority and majority classes follow sub-optimal trajectories when training with a gradient-based optimizer. This slowdown is related to the imbalance ratio and can be traced back to a competition between the optimization of different classes. Our main contribution is the analysis of the convergence of full-batch (GD) and stochastic gradient descent (SGD), and of variants that renormalize the contribution of each per-class gradient. We find that GD is not guaranteed to decrease the loss for each class but that this problem can be addressed by performing a per-class normalization of the gradient. With SGD, class imbalance has an additional effect on the direction of the gradients: the minority class suffers from a higher directional noise, which reduces the effectiveness of the per-class gradient normalization. Our findings not only allow us to understand the potential and limitations of strategies involving the per-class gradients, but also the reason for the effectiveness of previously used solutions for class imbalance such as oversampling.  ( 3 min )
    From Denoising Diffusions to Denoising Markov Models
    arXiv:2211.03595v3 Announce Type: replace Abstract: Denoising diffusions are state-of-the-art generative models exhibiting remarkable empirical performance. They work by diffusing the data distribution into a Gaussian distribution and then learning to reverse this noising process to obtain synthetic datapoints. The denoising diffusion relies on approximations of the logarithmic derivatives of the noised data densities using score matching. Such models can also be used to perform approximate posterior simulation when one can only sample from the prior and likelihood. We propose a unifying framework generalising this approach to a wide class of spaces and leading to an original extension of score matching. We illustrate the resulting models on various applications.  ( 2 min )
    Accelerating Approximate Thompson Sampling with Underdamped Langevin Monte Carlo
    arXiv:2401.11665v2 Announce Type: replace Abstract: Approximate Thompson sampling with Langevin Monte Carlo broadens its reach from Gaussian posterior sampling to encompass more general smooth posteriors. However, it still encounters scalability issues in high-dimensional problems when demanding high accuracy. To address this, we propose an approximate Thompson sampling strategy, utilizing underdamped Langevin Monte Carlo, where the latter is the go-to workhorse for simulations of high-dimensional posteriors. Based on the standard smoothness and log-concavity conditions, we study the accelerated posterior concentration and sampling using a specific potential function. This design improves the sample complexity for realizing logarithmic regrets from $\mathcal{\tilde O}(d)$ to $\mathcal{\tilde O}(\sqrt{d})$. The scalability and robustness of our algorithm are also empirically validated through synthetic experiments in high-dimensional bandit problems.  ( 2 min )
    On Sampling with Approximate Transport Maps
    arXiv:2302.04763v3 Announce Type: replace Abstract: Transport maps can ease the sampling of distributions with non-trivial geometries by transforming them into distributions that are easier to handle. The potential of this approach has risen with the development of Normalizing Flows (NF) which are maps parameterized with deep neural networks trained to push a reference distribution towards a target. NF-enhanced samplers recently proposed blend (Markov chain) Monte Carlo methods with either (i) proposal draws from the flow or (ii) a flow-based reparametrization. In both cases, the quality of the learned transport conditions performance. The present work clarifies for the first time the relative strengths and weaknesses of these two approaches. Our study concludes that multimodal targets can be reliably handled with flow-based proposals up to moderately high dimensions. In contrast, methods relying on reparametrization struggle with multimodality but are more robust otherwise in high-dimensional settings and under poor training. To further illustrate the influence of target-proposal adequacy, we also derive a new quantitative bound for the mixing time of the Independent Metropolis-Hastings sampler.  ( 2 min )
    Best-of-Both-Worlds Algorithms for Linear Contextual Bandits
    arXiv:2312.15433v2 Announce Type: replace-cross Abstract: We study best-of-both-worlds algorithms for $K$-armed linear contextual bandits. Our algorithms deliver near-optimal regret bounds in both the adversarial and stochastic regimes, without prior knowledge about the environment. In the stochastic regime, we achieve the polylogarithmic rate $\frac{(dK)^2\mathrm{poly}\log(dKT)}{\Delta_{\min}}$, where $\Delta_{\min}$ is the minimum suboptimality gap over the $d$-dimensional context space. In the adversarial regime, we obtain either the first-order $\widetilde{O}(dK\sqrt{L^*})$ bound, or the second-order $\widetilde{O}(dK\sqrt{\Lambda^*})$ bound, where $L^*$ is the cumulative loss of the best action and $\Lambda^*$ is a notion of the cumulative second moment for the losses incurred by the algorithm. Moreover, we develop an algorithm based on FTRL with Shannon entropy regularizer that does not require the knowledge of the inverse of the covariance matrix, and achieves a polylogarithmic regret in the stochastic regime while obtaining $\widetilde{O}\big(dK\sqrt{T}\big)$ regret bounds in the adversarial regime.  ( 2 min )
    Robust Learning Rate Selection for Stochastic Optimization via Splitting Diagnostic
    arXiv:1910.08597v5 Announce Type: replace Abstract: This paper proposes SplitSGD, a new dynamic learning rate schedule for stochastic optimization. This method decreases the learning rate for better adaptation to the local geometry of the objective function whenever a stationary phase is detected, that is, the iterates are likely to bounce at around a vicinity of a local minimum. The detection is performed by splitting the single thread into two and using the inner product of the gradients from the two threads as a measure of stationarity. Owing to this simple yet provably valid stationarity detection, SplitSGD is easy-to-implement and essentially does not incur additional computational cost than standard SGD. Through a series of extensive experiments, we show that this method is appropriate for both convex problems and training (non-convex) neural networks, with performance compared favorably to other stochastic optimization methods. Importantly, this method is observed to be very robust with a set of default parameters for a wide range of problems and, moreover, can yield better generalization performance than other adaptive gradient methods such as Adam.  ( 2 min )
    Learning from higher-order statistics, efficiently: hypothesis tests, random features, and neural networks
    arXiv:2312.14922v2 Announce Type: replace Abstract: Neural networks excel at discovering statistical patterns in high-dimensional data sets. In practice, higher-order cumulants, which quantify the non-Gaussian correlations between three or more variables, are particularly important for the performance of neural networks. But how efficient are neural networks at extracting features from higher-order cumulants? We study this question in the spiked cumulant model, where the statistician needs to recover a privileged direction or "spike" from the order-$p\ge 4$ cumulants of $d$-dimensional inputs. We first characterise the fundamental statistical and computational limits of recovering the spike by analysing the number of samples $n$ required to strongly distinguish between inputs from the spiked cumulant model and isotropic Gaussian inputs. We find that statistical distinguishability requires $n\gtrsim d$ samples, while distinguishing the two distributions in polynomial time requires $n \gtrsim d^2$ samples for a wide class of algorithms, i.e. those covered by the low-degree conjecture. These results suggest the existence of a wide statistical-to-computational gap in this problem. Numerical experiments show that neural networks learn to distinguish the two distributions with quadratic sample complexity, while "lazy" methods like random features are not better than random guessing in this regime. Our results show that neural networks extract information from higher-order correlations in the spiked cumulant model efficiently, and reveal a large gap in the amount of data required by neural networks and random features to learn from higher-order cumulants.  ( 3 min )
    Tail-adaptive Bayesian shrinkage
    arXiv:2007.02192v4 Announce Type: replace-cross Abstract: Robust Bayesian methods for high-dimensional regression problems under diverse sparse regimes are studied. Traditional shrinkage priors are primarily designed to detect a handful of signals from tens of thousands of predictors in the so-called ultra-sparsity domain. However, they may not perform desirably when the degree of sparsity is moderate. In this paper, we propose a robust sparse estimation method under diverse sparsity regimes, which has a tail-adaptive shrinkage property. In this property, the tail-heaviness of the prior adjusts adaptively, becoming larger or smaller as the sparsity level increases or decreases, respectively, to accommodate more or fewer signals, a posteriori. We propose a global-local-tail (GLT) Gaussian mixture distribution that ensures this property. We examine the role of the tail-index of the prior in relation to the underlying sparsity level and demonstrate that the GLT posterior contracts at the minimax optimal rate for sparse normal mean models. We apply both the GLT prior and the Horseshoe prior to a real data problem and simulation examples. Our findings indicate that the varying tail rule based on the GLT prior offers advantages over a fixed tail rule based on the Horseshoe prior in diverse sparsity regimes.  ( 2 min )
    Iterative Regularization with k-support Norm: An Important Complement to Sparse Recovery
    arXiv:2401.05394v3 Announce Type: replace-cross Abstract: Sparse recovery is ubiquitous in machine learning and signal processing. Due to the NP-hard nature of sparse recovery, existing methods are known to suffer either from restrictive (or even unknown) applicability conditions, or high computational cost. Recently, iterative regularization methods have emerged as a promising fast approach because they can achieve sparse recovery in one pass through early stopping, rather than the tedious grid-search used in the traditional methods. However, most of those iterative methods are based on the $\ell_1$ norm which requires restrictive applicability conditions and could fail in many cases. Therefore, achieving sparse recovery with iterative regularization methods under a wider range of conditions has yet to be further explored. To address this issue, we propose a novel iterative regularization algorithm, IRKSN, based on the $k$-support norm regularizer rather than the $\ell_1$ norm. We provide conditions for sparse recovery with IRKSN, and compare them with traditional conditions for recovery with $\ell_1$ norm regularizers. Additionally, we give an early stopping bound on the model error of IRKSN with explicit constants, achieving the standard linear rate for sparse recovery. Finally, we illustrate the applicability of our algorithm on several experiments, including a support recovery experiment with a correlated design matrix.  ( 3 min )
    A General Framework for User-Guided Bayesian Optimization
    arXiv:2311.14645v2 Announce Type: replace-cross Abstract: The optimization of expensive-to-evaluate black-box functions is prevalent in various scientific disciplines. Bayesian optimization is an automatic, general and sample-efficient method to solve these problems with minimal knowledge of the underlying function dynamics. However, the ability of Bayesian optimization to incorporate prior knowledge or beliefs about the function at hand in order to accelerate the optimization is limited, which reduces its appeal for knowledgeable practitioners with tight budgets. To allow domain experts to customize the optimization routine, we propose ColaBO, the first Bayesian-principled framework for incorporating prior beliefs beyond the typical kernel structure, such as the likely location of the optimizer or the optimal value. The generality of ColaBO makes it applicable across different Monte Carlo acquisition functions and types of user beliefs. We empirically demonstrate ColaBO's ability to substantially accelerate optimization when the prior information is accurate, and to retain approximately default performance when it is misleading.  ( 2 min )
    On diffusion-based generative models and their error bounds: The log-concave case with full convergence estimates
    arXiv:2311.13584v2 Announce Type: replace-cross Abstract: We provide full theoretical guarantees for the convergence behaviour of diffusion-based generative models under the assumption of strongly log-concave data distributions while our approximating class of functions used for score estimation is made of Lipschitz continuous functions. We demonstrate via a motivating example, sampling from a Gaussian distribution with unknown mean, the powerfulness of our approach. In this case, explicit estimates are provided for the associated optimization problem, i.e. score approximation, while these are combined with the corresponding sampling estimates. As a result, we obtain the best known upper bound estimates in terms of key quantities of interest, such as the dimension and rates of convergence, for the Wasserstein-2 distance between the data distribution (Gaussian with unknown mean) and our sampling algorithm. Beyond the motivating example and in order to allow for the use of a diverse range of stochastic optimizers, we present our results using an $L^2$-accurate score estimation assumption, which crucially is formed under an expectation with respect to the stochastic optimizer and our novel auxiliary process that uses only known information. This approach yields the best known convergence rate for our sampling algorithm.  ( 3 min )
    On Double Descent in Reinforcement Learning with LSTD and Random Features
    arXiv:2310.05518v4 Announce Type: replace-cross Abstract: Temporal Difference (TD) algorithms are widely used in Deep Reinforcement Learning (RL). Their performance is heavily influenced by the size of the neural network. While in supervised learning, the regime of over-parameterization and its benefits are well understood, the situation in RL is much less clear. In this paper, we present a theoretical analysis of the influence of network size and $l_2$-regularization on performance. We identify the ratio between the number of parameters and the number of visited states as a crucial factor and define over-parameterization as the regime when it is larger than one. Furthermore, we observe a double descent phenomenon, i.e., a sudden drop in performance around the parameter/state ratio of one. Leveraging random features and the lazy training regime, we study the regularized Least-Square Temporal Difference (LSTD) algorithm in an asymptotic regime, as both the number of parameters and states go to infinity, maintaining a constant ratio. We derive deterministic limits of both the empirical and the true Mean-Squared Bellman Error (MSBE) that feature correction terms responsible for the double descent. Correction terms vanish when the $l_2$-regularization is increased or the number of unvisited states goes to zero. Numerical experiments with synthetic and small real-world environments closely match the theoretical predictions.  ( 3 min )
    On the Posterior Distribution in Denoising: Application to Uncertainty Quantification
    arXiv:2309.13598v2 Announce Type: replace-cross Abstract: Denoisers play a central role in many applications, from noise suppression in low-grade imaging sensors, to empowering score-based generative models. The latter category of methods makes use of Tweedie's formula, which links the posterior mean in Gaussian denoising (\ie the minimum MSE denoiser) with the score of the data distribution. Here, we derive a fundamental relation between the higher-order central moments of the posterior distribution, and the higher-order derivatives of the posterior mean. We harness this result for uncertainty quantification of pre-trained denoisers. Particularly, we show how to efficiently compute the principal components of the posterior distribution for any desired region of an image, as well as to approximate the full marginal distribution along those (or any other) one-dimensional directions. Our method is fast and memory-efficient, as it does not explicitly compute or store the high-order moment tensors and it requires no training or fine tuning of the denoiser. Code and examples are available on the project webpage in https://hilamanor.github.io/GaussianDenoisingPosterior/ .  ( 2 min )
    Balanced Training of Energy-Based Models with Adaptive Flow Sampling
    arXiv:2306.00684v4 Announce Type: replace-cross Abstract: Energy-based models (EBMs) are versatile density estimation models that directly parameterize an unnormalized log density. Although very flexible, EBMs lack a specified normalization constant of the model, making the likelihood of the model computationally intractable. Several approximate samplers and variational inference techniques have been proposed to estimate the likelihood gradients for training. These techniques have shown promising results in generating samples, but little attention has been paid to the statistical accuracy of the estimated density, such as determining the relative importance of different classes in a dataset. In this work, we propose a new maximum likelihood training algorithm for EBMs that uses a different type of generative model, normalizing flows (NF), which have recently been proposed to facilitate sampling. Our method fits an NF to an EBM during training so that an NF-assisted sampling scheme provides an accurate gradient for the EBMs at all times, ultimately leading to a fast sampler for generating new data.  ( 2 min )
    Multimodal Web Navigation with Instruction-Finetuned Foundation Models
    arXiv:2305.11854v3 Announce Type: replace-cross Abstract: The progress of autonomous web navigation has been hindered by the dependence on billions of exploratory interactions via online reinforcement learning, and domain-specific model designs that make it difficult to leverage generalization from rich out-of-domain data. In this work, we study data-driven offline training for web agents with vision-language foundation models. We propose an instruction-following multimodal agent, WebGUM, that observes both webpage screenshots and HTML pages and outputs web navigation actions, such as click and type. WebGUM is trained by jointly finetuning an instruction-finetuned language model and a vision encoder with temporal and local perception on a large corpus of demonstrations. We empirically demonstrate this recipe improves the agent's ability of grounded multimodal perception, HTML comprehension, and multi-step reasoning, outperforming prior works by a significant margin. On the MiniWoB, we improve over the previous best offline methods by more than 45.8%, even outperforming online-finetuned SoTA, humans, and GPT-4-based agent. On the WebShop benchmark, our 3-billion-parameter model achieves superior performance to the existing SoTA, PaLM-540B. Furthermore, WebGUM exhibits strong positive transfer to the real-world planning tasks on the Mind2Web. We also collect 347K high-quality demonstrations using our trained models, 38 times larger than prior work, and make them available to promote future research in this direction.  ( 3 min )
    Adam-family Methods for Nonsmooth Optimization with Convergence Guarantees
    arXiv:2305.03938v2 Announce Type: replace-cross Abstract: In this paper, we present a comprehensive study on the convergence properties of Adam-family methods for nonsmooth optimization, especially in the training of nonsmooth neural networks. We introduce a novel two-timescale framework that adopts a two-timescale updating scheme, and prove its convergence properties under mild assumptions. Our proposed framework encompasses various popular Adam-family methods, providing convergence guarantees for these methods in training nonsmooth neural networks. Furthermore, we develop stochastic subgradient methods that incorporate gradient clipping techniques for training nonsmooth neural networks with heavy-tailed noise. Through our framework, we show that our proposed methods converge even when the evaluation noises are only assumed to be integrable. Extensive numerical experiments demonstrate the high efficiency and robustness of our proposed methods.  ( 2 min )
    Decomposed Diffusion Sampler for Accelerating Large-Scale Inverse Problems
    arXiv:2303.05754v3 Announce Type: replace-cross Abstract: Krylov subspace, which is generated by multiplying a given vector by the matrix of a linear transformation and its successive powers, has been extensively studied in classical optimization literature to design algorithms that converge quickly for large linear inverse problems. For example, the conjugate gradient method (CG), one of the most popular Krylov subspace methods, is based on the idea of minimizing the residual error in the Krylov subspace. However, with the recent advancement of high-performance diffusion solvers for inverse problems, it is not clear how classical wisdom can be synergistically combined with modern diffusion models. In this study, we propose a novel and efficient diffusion sampling strategy that synergistically combines the diffusion sampling and Krylov subspace methods. Specifically, we prove that if the tangent space at a denoised sample by Tweedie's formula forms a Krylov subspace, then the CG initialized with the denoised data ensures the data consistency update to remain in the tangent space. This negates the need to compute the manifold-constrained gradient (MCG), leading to a more efficient diffusion sampling method. Our method is applicable regardless of the parametrization and setting (i.e., VE, VP). Notably, we achieve state-of-the-art reconstruction quality on challenging real-world medical inverse imaging problems, including multi-coil MRI reconstruction and 3D CT reconstruction. Moreover, our proposed method achieves more than 80 times faster inference time than the previous state-of-the-art method. Code is available at https://github.com/HJ-harry/DDS  ( 3 min )
    Applications of No-Collision Transportation Maps in Manifold Learning
    arXiv:2304.00199v4 Announce Type: replace-cross Abstract: In this work, we investigate applications of no-collision transportation maps introduced in [Nurbekyan et. al., 2020] in manifold learning for image data. Recently, there has been a surge in applying transportation-based distances and features for data representing motion-like or deformation-like phenomena. Indeed, comparing intensities at fixed locations often does not reveal the data structure. No-collision maps and distances developed in [Nurbekyan et. al., 2020] are sensitive to geometric features similar to optimal transportation (OT) maps but much cheaper to compute due to the absence of optimization. In this work, we prove that no-collision distances provide an isometry between translations (respectively dilations) of a single probability measure and the translation (respectively dilation) vectors equipped with a Euclidean distance. Furthermore, we prove that no-collision transportation maps, as well as OT and linearized OT maps, do not in general provide an isometry for rotations. The numerical experiments confirm our theoretical findings and show that no-collision distances achieve similar or better performance on several manifold learning tasks compared to other OT and Euclidean-based methods at a fraction of a computational cost.  ( 2 min )
    Variance estimation in graphs with the fused lasso
    arXiv:2207.12638v3 Announce Type: replace-cross Abstract: We study the problem of variance estimation in general graph-structured problems. First, we develop a linear time estimator for the homoscedastic case that can consistently estimate the variance in general graphs. We show that our estimator attains minimax rates for the chain and 2D grid graphs when the mean signal has total variation with canonical scaling. Furthermore, we provide general upper bounds on the mean squared error performance of the fused lasso estimator in general graphs under a moment condition and a bound on the tail behavior of the errors. These upper bounds allow us to generalize for broader classes of distributions, such as sub-exponential, many existing results on the fused lasso that are only known to hold with the assumption that errors are sub-Gaussian random variables. Exploiting our upper bounds, we then study a simple total variation regularization estimator for estimating the signal of variances in the heteroscedastic case. We also provide lower bounds showing that our heteroscedastic variance estimator attains minimax rates for estimating signals of bounded variation in grid graphs, and $K$-nearest neighbor graphs, and the estimator is consistent for estimating the variances in any connected graph.  ( 2 min )
    Deep-Lock: Secure Authorization for Deep Neural Networks
    arXiv:2008.05966v2 Announce Type: replace-cross Abstract: Trained Deep Neural Network (DNN) models are considered valuable Intellectual Properties (IP) in several business models. Prevention of IP theft and unauthorized usage of such DNN models has been raised as of significant concern by industry. In this paper, we address the problem of preventing unauthorized usage of DNN models by proposing a generic and lightweight key-based model-locking scheme, which ensures that a locked model functions correctly only upon applying the correct secret key. The proposed scheme, known as Deep-Lock, utilizes S-Boxes with good security properties to encrypt each parameter of a trained DNN model with secret keys generated from a master key via a key scheduling algorithm. The resulting dense network of encrypted weights is found robust against model fine-tuning attacks. Finally, Deep-Lock does not require any intervention in the structure and training of the DNN models, making it applicable for all existing software and hardware implementations of DNN.  ( 2 min )
    A Lower Bound and a Near-Optimal Algorithm for Bilevel Empirical Risk Minimization
    arXiv:2302.08766v3 Announce Type: replace Abstract: Bilevel optimization problems, which are problems where two optimization problems are nested, have more and more applications in machine learning. In many practical cases, the upper and the lower objectives correspond to empirical risk minimization problems and therefore have a sum structure. In this context, we propose a bilevel extension of the celebrated SARAH algorithm. We demonstrate that the algorithm requires $\mathcal{O}((n+m)^{\frac12}\varepsilon^{-1})$ gradient computations to achieve $\varepsilon$-stationarity with $n+m$ the total number of samples, which improves over all previous bilevel algorithms. Moreover, we provide a lower bound on the number of oracle calls required to get an approximate stationary point of the objective function of the bilevel problem. This lower bound is attained by our algorithm, which is therefore optimal in terms of sample complexity.  ( 2 min )
    Group-Sparse Matrix Factorization for Transfer Learning of Word Embeddings
    arXiv:2104.08928v3 Announce Type: replace Abstract: Unstructured text provides decision-makers with a rich data source in many domains, ranging from product reviews in retail to nursing notes in healthcare. To leverage this information, words are typically translated into word embeddings -- vectors that encode the semantic relationships between words -- through unsupervised learning algorithms such as matrix factorization. However, learning word embeddings from new domains with limited training data can be challenging, because the meaning/usage may be different in the new domain, e.g., the word ``positive'' typically has positive sentiment, but often has negative sentiment in medical notes since it may imply that a patient tested positive for a disease. In practice, we expect that only a small number of domain-specific words may have new meanings. We propose an intuitive two-stage estimator that exploits this structure via a group-sparse penalty to efficiently transfer learn domain-specific word embeddings by combining large-scale text corpora (such as Wikipedia) with limited domain-specific text data. We bound the generalization error of our transfer learning estimator, proving that it can achieve high accuracy with substantially less domain-specific data when only a small number of embeddings are altered between domains. Furthermore, we prove that all local minima identified by our nonconvex objective function are statistically indistinguishable from the global minimum under standard regularization conditions, implying that our estimator can be computed efficiently. Our results provide the first bounds on group-sparse matrix factorization, which may be of independent interest. We empirically evaluate our approach compared to state-of-the-art fine-tuning heuristics from natural language processing.  ( 3 min )
    LoRA+: Efficient Low Rank Adaptation of Large Models
    arXiv:2402.12354v1 Announce Type: cross Abstract: In this paper, we show that Low Rank Adaptation (LoRA) as originally introduced in Hu et al. (2021) leads to suboptimal finetuning of models with large width (embedding dimension). This is due to the fact that adapter matrices A and B in LoRA are updated with the same learning rate. Using scaling arguments for large width networks, we demonstrate that using the same learning rate for A and B does not allow efficient feature learning. We then show that this suboptimality of LoRA can be corrected simply by setting different learning rates for the LoRA adapter matrices A and B with a well-chosen ratio. We call this proposed algorithm LoRA$+$. In our extensive experiments, LoRA$+$ improves performance (1-2 $\%$ improvements) and finetuning speed (up to $\sim$ 2X SpeedUp), at the same computational cost as LoRA.  ( 2 min )
    Generating Survival Interpretable Trajectories and Data
    arXiv:2402.12331v1 Announce Type: cross Abstract: A new model for generating survival trajectories and data based on applying an autoencoder of a specific structure is proposed. It solves three tasks. First, it provides predictions in the form of the expected event time and the survival function for a new generated feature vector on the basis of the Beran estimator. Second, the model generates additional data based on a given training set that would supplement the original dataset. Third, the most important, it generates a prototype time-dependent trajectory for an object, which characterizes how features of the object could be changed to achieve a different time to an event. The trajectory can be viewed as a type of the counterfactual explanation. The proposed model is robust during training and inference due to a specific weighting scheme incorporating into the variational autoencoder. The model also determines the censored indicators of new generated data by solving a classification task. The paper demonstrates the efficiency and properties of the proposed model using numerical experiments on synthetic and real datasets. The code of the algorithm implementing the proposed model is publicly available.  ( 2 min )
    Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models
    arXiv:2402.12336v1 Announce Type: cross Abstract: Multi-modal foundation models like OpenFlamingo, LLaVA, and GPT-4 are increasingly used for various real-world tasks. Prior work has shown that these models are highly vulnerable to adversarial attacks on the vision modality. These attacks can be leveraged to spread fake information or defraud users, and thus pose a significant risk, which makes the robustness of large multi-modal foundation models a pressing problem. The CLIP model, or one of its variants, is used as a frozen vision encoder in many vision-language models (VLMs), e.g. LLaVA and OpenFlamingo. We propose an unsupervised adversarial fine-tuning scheme to obtain a robust CLIP vision encoder, which yields robustness on all vision down-stream tasks (VLMs, zero-shot classification) that rely on CLIP. In particular, we show that stealth-attacks on users of VLMs by a malicious third party providing manipulated images are no longer possible once one replaces the original CLIP model with our robust one. No retraining or fine-tuning of the VLM is required. The code and robust models are available at https://github.com/chs20/RobustVLM  ( 2 min )
    Uncertainty quantification in fine-tuned LLMs using LoRA ensembles
    arXiv:2402.12264v1 Announce Type: cross Abstract: Fine-tuning large language models can improve task specific performance, although a general understanding of what the fine-tuned model has learned, forgotten and how to trust its predictions is still missing. We derive principled uncertainty quantification for fine-tuned LLMs with posterior approximations using computationally efficient low-rank adaptation ensembles. We analyze three common multiple-choice datasets using low-rank adaptation ensembles based on Mistral-7b, and draw quantitative and qualitative conclusions on their perceived complexity and model efficacy on the different target domains during and after fine-tuning. In particular, backed by the numerical experiments, we hypothesise about signals from entropic uncertainty measures for data domains that are inherently difficult for a given architecture to learn.  ( 2 min )
    Convergence of Gradient Descent for Recurrent Neural Networks: A Nonasymptotic Analysis
    arXiv:2402.12241v1 Announce Type: cross Abstract: We analyze recurrent neural networks trained with gradient descent in the supervised learning setting for dynamical systems, and prove that gradient descent can achieve optimality \emph{without} massive overparameterization. Our in-depth nonasymptotic analysis (i) provides sharp bounds on the network size $m$ and iteration complexity $\tau$ in terms of the sequence length $T$, sample size $n$ and ambient dimension $d$, and (ii) identifies the significant impact of long-term dependencies in the dynamical system on the convergence and network width bounds characterized by a cutoff point that depends on the Lipschitz continuity of the activation function. Remarkably, this analysis reveals that an appropriately-initialized recurrent neural network trained with $n$ samples can achieve optimality with a network size $m$ that scales only logarithmically with $n$. This sharply contrasts with the prior works that require high-order polynomial dependency of $m$ on $n$ to establish strong regularity conditions. Our results are based on an explicit characterization of the class of dynamical systems that can be approximated and learned by recurrent neural networks via norm-constrained transportation mappings, and establishing local smoothness properties of the hidden state with respect to the learnable parameters.  ( 2 min )
    Universal Generalization Guarantees for Wasserstein Distributionally Robust Models
    arXiv:2402.11981v1 Announce Type: cross Abstract: Distributionally robust optimization has emerged as an attractive way to train robust machine learning models, capturing data uncertainty and distribution shifts. Recent statistical analyses have proved that robust models built from Wasserstein ambiguity sets have nice generalization guarantees, breaking the curse of dimensionality. However, these results are obtained in specific cases, at the cost of approximations, or under assumptions difficult to verify in practice. In contrast, we establish, in this article, exact generalization guarantees that cover all practical cases, including any transport cost function and any loss function, potentially non-convex and nonsmooth. For instance, our result applies to deep learning, without requiring restrictive assumptions. We achieve this result through a novel proof technique that combines nonsmooth analysis rationale with classical concentration results. Our approach is general enough to extend to the recent versions of Wasserstein/Sinkhorn distributionally robust problems that involve (double) regularizations.  ( 2 min )
    Linear bandits with polylogarithmic minimax regret
    arXiv:2402.12042v1 Announce Type: cross Abstract: We study a noise model for linear stochastic bandits for which the subgaussian noise parameter vanishes linearly as we select actions on the unit sphere closer and closer to the unknown vector. We introduce an algorithm for this problem that exhibits a minimax regret scaling as $\log^3(T)$ in the time horizon $T$, in stark contrast the square root scaling of this regret for typical bandit algorithms. Our strategy, based on weighted least-squares estimation, achieves the eigenvalue relation $\lambda_{\min} ( V_t ) = \Omega (\sqrt{\lambda_{\max}(V_t ) })$ for the design matrix $V_t$ at each time step $t$ through geometrical arguments that are independent of the noise model and might be of independent interest. This allows us to tightly control the expected regret in each time step to be of the order $O(\frac1{t})$, leading to the logarithmic scaling of the cumulative regret.  ( 2 min )
    Bayesian Active Learning for Censored Regression
    arXiv:2402.11973v1 Announce Type: cross Abstract: Bayesian active learning is based on information theoretical approaches that focus on maximising the information that new observations provide to the model parameters. This is commonly done by maximising the Bayesian Active Learning by Disagreement (BALD) acquisitions function. However, we highlight that it is challenging to estimate BALD when the new data points are subject to censorship, where only clipped values of the targets are observed. To address this, we derive the entropy and the mutual information for censored distributions and derive the BALD objective for active learning in censored regression ($\mathcal{C}$-BALD). We propose a novel modelling approach to estimate the $\mathcal{C}$-BALD objective and use it for active learning in the censored setting. Across a wide range of datasets and models, we demonstrate that $\mathcal{C}$-BALD outperforms other Bayesian active learning methods in censored regression.  ( 2 min )
    Evaluating the Effectiveness of Index-Based Treatment Allocation
    arXiv:2402.11771v1 Announce Type: cross Abstract: When resources are scarce, an allocation policy is needed to decide who receives a resource. This problem occurs, for instance, when allocating scarce medical resources and is often solved using modern ML methods. This paper introduces methods to evaluate index-based allocation policies -- that allocate a fixed number of resources to those who need them the most -- by using data from a randomized control trial. Such policies create dependencies between agents, which render the assumptions behind standard statistical tests invalid and limit the effectiveness of estimators. Addressing these challenges, we translate and extend recent ideas from the statistics literature to present an efficient estimator and methods for computing asymptotically correct confidence intervals. This enables us to effectively draw valid statistical conclusions, a critical gap in previous work. Our extensive experiments validate our methodology in practical settings, while also showcasing its statistical power. We conclude by proposing and empirically verifying extensions of our methodology that enable us to reevaluate a past randomized control trial to evaluate different ML allocation policies in the context of a mHealth program, drawing previously invisible conclusions.  ( 2 min )
    Balanced Data, Imbalanced Spectra: Unveiling Class Disparities with Spectral Imbalance
    arXiv:2402.11742v1 Announce Type: cross Abstract: Classification models are expected to perform equally well for different classes, yet in practice, there are often large gaps in their performance. This issue of class bias is widely studied in cases of datasets with sample imbalance, but is relatively overlooked in balanced datasets. In this work, we introduce the concept of spectral imbalance in features as a potential source for class disparities and study the connections between spectral imbalance and class bias in both theory and practice. To build the connection between spectral imbalance and class gap, we develop a theoretical framework for studying class disparities and derive exact expressions for the per-class error in a high-dimensional mixture model setting. We then study this phenomenon in 11 different state-of-the-art pretrained encoders and show how our proposed framework can be used to compare the quality of encoders, as well as evaluate and combine data augmentation strategies to mitigate the issue. Our work sheds light on the class-dependent effects of learning, and provides new insights into how state-of-the-art pretrained features may have unknown biases that can be diagnosed through their spectra.  ( 2 min )
    Monte Carlo with kernel-based Gibbs measures: Guarantees for probabilistic herding
    arXiv:2402.11736v1 Announce Type: cross Abstract: Kernel herding belongs to a family of deterministic quadratures that seek to minimize the worst-case integration error over a reproducing kernel Hilbert space (RKHS). In spite of strong experimental support, it has revealed difficult to prove that this worst-case error decreases at a faster rate than the standard square root of the number of quadrature nodes, at least in the usual case where the RKHS is infinite-dimensional. In this theoretical paper, we study a joint probability distribution over quadrature nodes, whose support tends to minimize the same worst-case error as kernel herding. We prove that it does outperform i.i.d. Monte Carlo, in the sense of coming with a tighter concentration inequality on the worst-case integration error. While not improving the rate yet, this demonstrates that the mathematical tools of the study of Gibbs measures can help understand to what extent kernel herding and its variants improve on computationally cheaper methods. Moreover, we provide early experimental evidence that a faster rate of convergence, though not worst-case, is likely.  ( 2 min )
    OptEx: Expediting First-Order Optimization with Approximately Parallelized Iterations
    arXiv:2402.11427v1 Announce Type: cross Abstract: First-order optimization (FOO) algorithms are pivotal in numerous computational domains such as machine learning and signal denoising. However, their application to complex tasks like neural network training often entails significant inefficiencies due to the need for many sequential iterations for convergence. In response, we introduce first-order optimization expedited with approximately parallelized iterations (OptEx), the first framework that enhances the efficiency of FOO by leveraging parallel computing to mitigate its iterative bottleneck. OptEx employs kernelized gradient estimation to make use of gradient history for future gradient prediction, enabling parallelization of iterations -- a strategy once considered impractical because of the inherent iterative dependency in FOO. We provide theoretical guarantees for the reliability of our kernelized gradient estimation and the iteration complexity of SGD-based OptEx, confirming that estimation errors diminish to zero as historical gradients accumulate and that SGD-based OptEx enjoys an effective acceleration rate of $\Omega(\sqrt{N})$ over standard SGD given parallelism of N. We also use extensive empirical studies, including synthetic functions, reinforcement learning tasks, and neural network training across various datasets, to underscore the substantial efficiency improvements achieved by OptEx.  ( 2 min )
    Data-Driven Stochastic AC-OPF using Gaussian Processes
    arXiv:2402.11365v1 Announce Type: cross Abstract: The thesis focuses on developing a data-driven algorithm, based on machine learning, to solve the stochastic alternating current (AC) chance-constrained (CC) Optimal Power Flow (OPF) problem. Although the AC CC-OPF problem has been successful in academic circles, it is highly nonlinear and computationally demanding, which limits its practical impact. The proposed approach aims to address this limitation and demonstrate its empirical efficiency through applications to multiple IEEE test cases. To solve the non-convex and computationally challenging CC AC-OPF problem, the proposed approach relies on a machine learning Gaussian process regression (GPR) model. The full Gaussian process (GP) approach is capable of learning a simple yet non-convex data-driven approximation to the AC power flow equations that can incorporate uncertain inputs. The proposed approach uses various approximations for GP-uncertainty propagation. The full GP CC-OPF approach exhibits highly competitive and promising results, outperforming the state-of-the-art sample-based chance constraint approaches. To further improve the robustness and complexity/accuracy trade-off of the full GP CC-OPF, a fast data-driven setup is proposed. This setup relies on the sparse and hybrid Gaussian processes (GP) framework to model the power flow equations with input uncertainty.  ( 2 min )
    An Elementary Predictor Obtaining $2\sqrt{T}$ Distance to Calibration
    arXiv:2402.11410v1 Announce Type: cross Abstract: Blasiok et al. [2023] proposed distance to calibration as a natural measure of calibration error that unlike expected calibration error (ECE) is continuous. Recently, Qiao and Zheng [2024] gave a non-constructive argument establishing the existence of an online predictor that can obtain $O(\sqrt{T})$ distance to calibration in the adversarial setting, which is known to be impossible for ECE. They leave as an open problem finding an explicit, efficient algorithm. We resolve this problem and give an extremely simple, efficient, deterministic algorithm that obtains distance to calibration error at most $2\sqrt{T}$.  ( 2 min )
    Fair Classification with Partial Feedback: An Exploration-Based Data-Collection Approach
    arXiv:2402.11338v1 Announce Type: cross Abstract: In many predictive contexts (e.g., credit lending), true outcomes are only observed for samples that were positively classified in the past. These past observations, in turn, form training datasets for classifiers that make future predictions. However, such training datasets lack information about the outcomes of samples that were (incorrectly) negatively classified in the past and can lead to erroneous classifiers. We present an approach that trains a classifier using available data and comes with a family of exploration strategies to collect outcome data about subpopulations that otherwise would have been ignored. For any exploration strategy, the approach comes with guarantees that (1) all sub-populations are explored, (2) the fraction of false positives is bounded, and (3) the trained classifier converges to a "desired" classifier. The right exploration strategy is context-dependent; it can be chosen to improve learning guarantees and encode context-specific group fairness properties. Evaluation on real-world datasets shows that this approach consistently boosts the quality of collected outcome data and improves the fraction of true positives for all groups, with only a small reduction in predictive utility.  ( 2 min )
    Expressive Higher-Order Link Prediction through Hypergraph Symmetry Breaking
    arXiv:2402.11339v1 Announce Type: cross Abstract: A hypergraph consists of a set of nodes along with a collection of subsets of the nodes called hyperedges. Higher-order link prediction is the task of predicting the existence of a missing hyperedge in a hypergraph. A hyperedge representation learned for higher order link prediction is fully expressive when it does not lose distinguishing power up to an isomorphism. Many existing hypergraph representation learners, are bounded in expressive power by the Generalized Weisfeiler Lehman-1 (GWL-1) algorithm, a generalization of the Weisfeiler Lehman-1 algorithm. However, GWL-1 has limited expressive power. In fact, induced subhypergraphs with identical GWL-1 valued nodes are indistinguishable. Furthermore, message passing on hypergraphs can already be computationally expensive, especially on GPU memory. To address these limitations, we devise a preprocessing algorithm that can identify certain regular subhypergraphs exhibiting symmetry. Our preprocessing algorithm runs once with complexity the size of the input hypergraph. During training, we randomly replace subhypergraphs identified by the algorithm with covering hyperedges to break symmetry. We show that our method improves the expressivity of GWL-1. Our extensive experiments also demonstrate the effectiveness of our approach for higher-order link prediction on both graph and hypergraph datasets with negligible change in computation.  ( 2 min )
    Deep adaptive sampling for surrogate modeling without labeled data
    arXiv:2402.11283v1 Announce Type: cross Abstract: Surrogate modeling is of great practical significance for parametric differential equation systems. In contrast to classical numerical methods, using physics-informed deep learning methods to construct simulators for such systems is a promising direction due to its potential to handle high dimensionality, which requires minimizing a loss over a training set of random samples. However, the random samples introduce statistical errors, which may become the dominant errors for the approximation of low-regularity and high-dimensional problems. In this work, we present a deep adaptive sampling method for surrogate modeling ($\text{DAS}^2$), where we generalize the deep adaptive sampling (DAS) method [62] [Tang, Wan and Yang, 2023] to build surrogate models for low-regularity parametric differential equations. In the parametric setting, the residual loss function can be regarded as an unnormalized probability density function (PDF) of the spatial and parametric variables. This PDF is approximated by a deep generative model, from which new samples are generated and added to the training set. Since the new samples match the residual-induced distribution, the refined training set can further reduce the statistical error in the current approximate solution. We demonstrate the effectiveness of $\text{DAS}^2$ with a series of numerical experiments, including the parametric lid-driven 2D cavity flow problem with a continuous range of Reynolds numbers from 100 to 1000.  ( 2 min )
    Learning by Reconstruction Produces Uninformative Features For Perception
    arXiv:2402.11337v1 Announce Type: cross Abstract: Input space reconstruction is an attractive representation learning paradigm. Despite interpretability of the reconstruction and generation, we identify a misalignment between learning by reconstruction, and learning for perception. We show that the former allocates a model's capacity towards a subspace of the data explaining the observed variance--a subspace with uninformative features for the latter. For example, the supervised TinyImagenet task with images projected onto the top subspace explaining 90\% of the pixel variance can be solved with 45\% test accuracy. Using the bottom subspace instead, accounting for only 20\% of the pixel variance, reaches 55\% test accuracy. The features for perception being learned last explains the need for long training time, e.g., with Masked Autoencoders. Learning by denoising is a popular strategy to alleviate that misalignment. We prove that while some noise strategies such as masking are indeed beneficial, others such as additive Gaussian noise are not. Yet, even in the case of masking, we find that the benefits vary as a function of the mask's shape, ratio, and the considered dataset. While tuning the noise strategy without knowledge of the perception task seems challenging, we provide first clues on how to detect if a noise strategy is never beneficial regardless of the perception task.  ( 2 min )
    Functional Partial Least-Squares: Optimal Rates and Adaptation
    arXiv:2402.11134v1 Announce Type: cross Abstract: We consider the functional linear regression model with a scalar response and a Hilbert space-valued predictor, a well-known ill-posed inverse problem. We propose a new formulation of the functional partial least-squares (PLS) estimator related to the conjugate gradient method. We shall show that the estimator achieves the (nearly) optimal convergence rate on a class of ellipsoids and we introduce an early stopping rule which adapts to the unknown degree of ill-posedness. Some theoretical and simulation comparison between the estimator and the principal component regression estimator is provided.  ( 2 min )
    AdAdaGrad: Adaptive Batch Size Schemes for Adaptive Gradient Methods
    arXiv:2402.11215v1 Announce Type: cross Abstract: The choice of batch sizes in stochastic gradient optimizers is critical for model training. However, the practice of varying batch sizes throughout the training process is less explored compared to other hyperparameters. We investigate adaptive batch size strategies derived from adaptive sampling methods, traditionally applied only in stochastic gradient descent. Given the significant interplay between learning rates and batch sizes, and considering the prevalence of adaptive gradient methods in deep learning, we emphasize the need for adaptive batch size strategies in these contexts. We introduce AdAdaGrad and its scalar variant AdAdaGradNorm, which incrementally increase batch sizes during training, while model updates are performed using AdaGrad and AdaGradNorm. We prove that AdaGradNorm converges with high probability at a rate of $\mathscr{O}(1/K)$ for finding a first-order stationary point of smooth nonconvex functions within $K$ iterations. AdaGrad also demonstrates similar convergence properties when integrated with a novel coordinate-wise variant of our adaptive batch size strategies. Our theoretical claims are supported by numerical experiments on various image classification tasks, highlighting the enhanced adaptability of progressive batching protocols in deep learning and the potential of such adaptive batch size strategies with adaptive gradient optimizers in large-scale model training.  ( 2 min )
    DART: A Principled Approach to Adversarially Robust Unsupervised Domain Adaptation
    arXiv:2402.11120v1 Announce Type: cross Abstract: Distribution shifts and adversarial examples are two major challenges for deploying machine learning models. While these challenges have been studied individually, their combination is an important topic that remains relatively under-explored. In this work, we study the problem of adversarial robustness under a common setting of distribution shift - unsupervised domain adaptation (UDA). Specifically, given a labeled source domain $D_S$ and an unlabeled target domain $D_T$ with related but different distributions, the goal is to obtain an adversarially robust model for $D_T$. The absence of target domain labels poses a unique challenge, as conventional adversarial robustness defenses cannot be directly applied to $D_T$. To address this challenge, we first establish a generalization bound for the adversarial target loss, which consists of (i) terms related to the loss on the data, and (ii) a measure of worst-case domain divergence. Motivated by this bound, we develop a novel unified defense framework called Divergence Aware adveRsarial Training (DART), which can be used in conjunction with a variety of standard UDA methods; e.g., DANN [Ganin and Lempitsky, 2015]. DART is applicable to general threat models, including the popular $\ell_p$-norm model, and does not require heuristic regularizers or architectural changes. We also release DomainRobust: a testbed for evaluating robustness of UDA models to adversarial attacks. DomainRobust consists of 4 multi-domain benchmark datasets (with 46 source-target pairs) and 7 meta-algorithms with a total of 11 variants. Our large-scale experiments demonstrate that on average, DART significantly enhances model robustness on all benchmarks compared to the state of the art, while maintaining competitive standard accuracy. The relative improvement in robustness from DART reaches up to 29.2% on the source-target domain pairs considered.  ( 3 min )
    Building Trees for Probabilistic Prediction via Scoring Rules
    arXiv:2402.11052v1 Announce Type: cross Abstract: Decision trees built with data remain in widespread use for nonparametric prediction. Predicting probability distributions is preferred over point predictions when uncertainty plays a prominent role in analysis and decision-making. We study modifying a tree to produce nonparametric predictive distributions. We find the standard method for building trees may not result in good predictive distributions and propose changing the splitting criteria for trees to one based on proper scoring rules. Analysis of both simulated data and several real datasets demonstrates that using these new splitting criteria results in trees with improved predictive properties considering the entire predictive distribution.  ( 2 min )
    Robustness to Subpopulation Shift with Domain Label Noise via Regularized Annotation of Domains
    arXiv:2402.11039v1 Announce Type: cross Abstract: Existing methods for last layer retraining that aim to optimize worst-group accuracy (WGA) rely heavily on well-annotated groups in the training data. We show, both in theory and practice, that annotation-based data augmentations using either downsampling or upweighting for WGA are susceptible to domain annotation noise, and in high-noise regimes approach the WGA of a model trained with vanilla empirical risk minimization. We introduce Regularized Annotation of Domains (RAD) in order to train robust last layer classifiers without the need for explicit domain annotations. Our results show that RAD is competitive with other recently proposed domain annotation-free techniques. Most importantly, RAD outperforms state-of-the-art annotation-reliant methods even with only 5% noise in the training data for several publicly available datasets.  ( 2 min )
    Training Bayesian Neural Networks with Sparse Subspace Variational Inference
    arXiv:2402.11025v1 Announce Type: cross Abstract: Bayesian neural networks (BNNs) offer uncertainty quantification but come with the downside of substantially increased training and inference costs. Sparse BNNs have been investigated for efficient inference, typically by either slowly introducing sparsity throughout the training or by post-training compression of dense BNNs. The dilemma of how to cut down massive training costs remains, particularly given the requirement to learn about the uncertainty. To solve this challenge, we introduce Sparse Subspace Variational Inference (SSVI), the first fully sparse BNN framework that maintains a consistently highly sparse Bayesian model throughout the training and inference phases. Starting from a randomly initialized low-dimensional sparse subspace, our approach alternately optimizes the sparse subspace basis selection and its associated parameters. While basis selection is characterized as a non-differentiable problem, we approximate the optimal solution with a removal-and-addition strategy, guided by novel criteria based on weight distribution statistics. Our extensive experiments show that SSVI sets new benchmarks in crafting sparse BNNs, achieving, for instance, a 10-20x compression in model size with under 3\% performance drop, and up to 20x FLOPs reduction during training compared with dense VI training. Remarkably, SSVI also demonstrates enhanced robustness to hyperparameters, reducing the need for intricate tuning in VI and occasionally even surpassing VI-trained dense BNNs on both accuracy and uncertainty metrics.  ( 2 min )
    Regularization by denoising: Bayesian model and Langevin-within-split Gibbs sampling
    arXiv:2402.12292v1 Announce Type: new Abstract: This paper introduces a Bayesian framework for image inversion by deriving a probabilistic counterpart to the regularization-by-denoising (RED) paradigm. It additionally implements a Monte Carlo algorithm specifically tailored for sampling from the resulting posterior distribution, based on an asymptotically exact data augmentation (AXDA). The proposed algorithm is an approximate instance of split Gibbs sampling (SGS) which embeds one Langevin Monte Carlo step. The proposed method is applied to common imaging tasks such as deblurring, inpainting and super-resolution, demonstrating its efficacy through extensive numerical experiments. These contributions advance Bayesian inference in imaging by leveraging data-driven regularization strategies within a probabilistic framework.  ( 2 min )
    Asymptotic Gaussian Fluctuations of Eigenvectors in Spectral Clustering
    arXiv:2402.12302v1 Announce Type: new Abstract: The performance of spectral clustering relies on the fluctuations of the entries of the eigenvectors of a similarity matrix, which has been left uncharacterized until now. In this letter, it is shown that the signal $+$ noise structure of a general spike random matrix model is transferred to the eigenvectors of the corresponding Gram kernel matrix and the fluctuations of their entries are Gaussian in the large-dimensional regime. This CLT-like result was the last missing piece to precisely predict the classification performance of spectral clustering. The proposed proof is very general and relies solely on the rotational invariance of the noise. Numerical experiments on synthetic and real data illustrate the universality of this phenomenon.  ( 2 min )
    Kernel KMeans clustering splits for end-to-end unsupervised decision trees
    arXiv:2402.12232v1 Announce Type: new Abstract: Trees are convenient models for obtaining explainable predictions on relatively small datasets. Although there are many proposals for the end-to-end construction of such trees in supervised learning, learning a tree end-to-end for clustering without labels remains an open challenge. As most works focus on interpreting with trees the result of another clustering algorithm, we present here a novel end-to-end trained unsupervised binary tree for clustering: Kauri. This method performs a greedy maximisation of the kernel KMeans objective without requiring the definition of centroids. We compare this model on multiple datasets with recent unsupervised trees and show that Kauri performs identically when using a linear kernel. For other kernels, Kauri often outperforms the concatenation of kernel KMeans and a CART decision tree.  ( 2 min )
    Towards AI-Based Precision Oncology: A Machine Learning Framework for Personalized Counterfactual Treatment Suggestions based on Multi-Omics Data
    arXiv:2402.12190v1 Announce Type: new Abstract: AI-driven precision oncology has the transformative potential to reshape cancer treatment by leveraging the power of AI models to analyze the interaction between complex patient characteristics and their corresponding treatment outcomes. New technological platforms have facilitated the timely acquisition of multimodal data on tumor biology at an unprecedented resolution, such as single-cell multi-omics data, making this quality and quantity of data available for data-driven improved clinical decision-making. In this work, we propose a modular machine learning framework designed for personalized counterfactual cancer treatment suggestions based on an ensemble of machine learning experts trained on diverse multi-omics technologies. These specialized counterfactual experts per technology are consistently aggregated into a more powerful expert with superior performance and can provide both confidence and an explanation of its decision. The framework is tailored to address critical challenges inherent in data-driven cancer research, including the high-dimensional nature of the data, and the presence of treatment assignment bias in the retrospective observational data. The framework is showcased through comprehensive demonstrations using data from in-vitro and in-vivo treatment responses from a cohort of patients with ovarian cancer. Our method aims to empower clinicians with a reality-centric decision-support tool including probabilistic treatment suggestions with calibrated confidence and personalized explanations for tailoring treatment strategies to multi-omics characteristics of individual cancer patients.  ( 3 min )
    When Do Off-Policy and On-Policy Policy Gradient Methods Align?
    arXiv:2402.12034v1 Announce Type: new Abstract: Policy gradient methods are widely adopted reinforcement learning algorithms for tasks with continuous action spaces. These methods succeeded in many application domains, however, because of their notorious sample inefficiency their use remains limited to problems where fast and accurate simulations are available. A common way to improve sample efficiency is to modify their objective function to be computable from off-policy samples without importance sampling. A well-established off-policy objective is the excursion objective. This work studies the difference between the excursion objective and the traditional on-policy objective, which we refer to as the on-off gap. We provide the first theoretical analysis showing conditions to reduce the on-off gap while establishing empirical evidence of shortfalls arising when these conditions are not met.  ( 2 min )
    Stochastic Hessian Fitting on Lie Group
    arXiv:2402.11858v1 Announce Type: new Abstract: This paper studies the fitting of Hessian or its inverse with stochastic Hessian-vector products. A Hessian fitting criterion, which can be used to derive most of the commonly used methods, e.g., BFGS, Gaussian-Newton, AdaGrad, etc., is used for the analysis. Our studies reveal different convergence rates for different Hessian fitting methods, e.g., sublinear rates for gradient descent in the Euclidean space and a commonly used closed-form solution, linear rates for gradient descent on the manifold of symmetric positive definite (SPL) matrices and certain Lie groups. The Hessian fitting problem is further shown to be strongly convex under mild conditions on a specific yet general enough Lie group. To confirm our analysis, these methods are tested under different settings like noisy Hessian-vector products, time varying Hessians, and low precision arithmetic. These findings are useful for stochastic second order optimizations that rely on fast, robust and accurate Hessian estimations.  ( 2 min )
    Statistical Test for Generated Hypotheses by Diffusion Models
    arXiv:2402.11789v1 Announce Type: new Abstract: The enhanced performance of AI has accelerated its integration into scientific research. In particular, the use of generative AI to create scientific hypotheses is promising and is increasingly being applied across various fields. However, when employing AI-generated hypotheses for critical decisions, such as medical diagnoses, verifying their reliability is crucial. In this study, we consider a medical diagnostic task using generated images by diffusion models, and propose a statistical test to quantify its reliability. The basic idea behind the proposed statistical test is to employ a selective inference framework, where we consider a statistical test conditional on the fact that the generated images are produced by a trained diffusion model. Using the proposed method, the statistical reliability of medical image diagnostic results can be quantified in the form of a p-value, allowing for decision-making with a controlled error rate. We show the theoretical validity of the proposed statistical test and its effectiveness through numerical experiments on synthetic and brain image datasets.  ( 2 min )
    Learning Memory Kernels in Generalized Langevin Equations
    arXiv:2402.11705v1 Announce Type: new Abstract: We introduce a novel approach for learning memory kernels in Generalized Langevin Equations. This approach initially utilizes a regularized Prony method to estimate correlation functions from trajectory data, followed by regression over a Sobolev norm-based loss function with RKHS regularization. Our approach guarantees improved performance within an exponentially weighted $L^2$ space, with the kernel estimation error controlled by the error in estimated correlation functions. We demonstrate the superiority of our estimator compared to other regression estimators that rely on $L^2$ loss functions and also an estimator derived from the inverse Laplace transform, using numerical examples that highlight its consistent advantage across various weight parameter selections. Additionally, we provide examples that include the application of force and drift terms in the equation.  ( 2 min )
    Empirical Density Estimation based on Spline Quasi-Interpolation with applications to Copulas clustering modeling
    arXiv:2402.11552v1 Announce Type: new Abstract: Density estimation is a fundamental technique employed in various fields to model and to understand the underlying distribution of data. The primary objective of density estimation is to estimate the probability density function of a random variable. This process is particularly valuable when dealing with univariate or multivariate data and is essential for tasks such as clustering, anomaly detection, and generative modeling. In this paper we propose the mono-variate approximation of the density using spline quasi interpolation and we applied it in the context of clustering modeling. The clustering technique used is based on the construction of suitable multivariate distributions which rely on the estimation of the monovariate empirical densities (marginals). Such an approximation is achieved by using the proposed spline quasi-interpolation, while the joint distributions to model the sought clustering partition is constructed with the use of copulas functions. In particular, since copulas can capture the dependence between the features of the data independently from the marginal distributions, a finite mixture copula model is proposed. The presented algorithm is validated on artificial and real datasets.  ( 2 min )
    Variational Entropy Search for Adjusting Expected Improvement
    arXiv:2402.11345v1 Announce Type: new Abstract: Bayesian optimization is a widely used technique for optimizing black-box functions, with Expected Improvement (EI) being the most commonly utilized acquisition function in this domain. While EI is often viewed as distinct from other information-theoretic acquisition functions, such as entropy search (ES) and max-value entropy search (MES), our work reveals that EI can be considered a special case of MES when approached through variational inference (VI). In this context, we have developed the Variational Entropy Search (VES) methodology and the VES-Gamma algorithm, which adapts EI by incorporating principles from information-theoretic concepts. The efficacy of VES-Gamma is demonstrated across a variety of test functions and read datasets, highlighting its theoretical and practical utilities in Bayesian optimization scenarios.  ( 2 min )
    Efficient Low-Rank Matrix Estimation, Experimental Design, and Arm-Set-Dependent Low-Rank Bandits
    arXiv:2402.11156v1 Announce Type: new Abstract: We study low-rank matrix trace regression and the related problem of low-rank matrix bandits. Assuming access to the distribution of the covariates, we propose a novel low-rank matrix estimation method called LowPopArt and provide its recovery guarantee that depends on a novel quantity denoted by B(Q) that characterizes the hardness of the problem, where Q is the covariance matrix of the measurement distribution. We show that our method can provide tighter recovery guarantees than classical nuclear norm penalized least squares (Koltchinskii et al., 2011) in several problems. To perform efficient estimation with a limited number of measurements from an arbitrarily given measurement set A, we also propose a novel experimental design criterion that minimizes B(Q) with computational efficiency. We leverage our novel estimator and design of experiments to derive two low-rank linear bandit algorithms for general arm sets that enjoy improved regret upper bounds. This improves over previous works on low-rank bandits, which make somewhat restrictive assumptions that the arm set is the unit ball or that an efficient exploration distribution is given. To our knowledge, our experimental design criterion is the first one tailored to low-rank matrix estimation beyond the naive reduction to linear regression, which can be of independent interest.  ( 2 min )
    Adaptive Split Balancing for Optimal Random Forest
    arXiv:2402.11228v1 Announce Type: new Abstract: While random forests are commonly used for regression problems, existing methods often lack adaptability in complex situations or lose optimality under simple, smooth scenarios. In this study, we introduce the adaptive split balancing forest (ASBF), capable of learning tree representations from data while simultaneously achieving minimax optimality under the Lipschitz class. To exploit higher-order smoothness levels, we further propose a localized version that attains the minimax rate under the H\"older class $\mathcal{H}^{q,\beta}$ for any $q\in\mathbb{N}$ and $\beta\in(0,1]$. Rather than relying on the widely-used random feature selection, we consider a balanced modification to existing approaches. Our results indicate that an over-reliance on auxiliary randomness may compromise the approximation power of tree models, leading to suboptimal results. Conversely, a less random, more balanced approach demonstrates optimality. Additionally, we establish uniform upper bounds and explore the application of random forests in average treatment effect estimation problems. Through simulation studies and real-data applications, we demonstrate the superior empirical performance of the proposed methods over existing random forests.  ( 2 min )
    Doubly Robust Inference in Causal Latent Factor Models
    arXiv:2402.11652v1 Announce Type: cross Abstract: This article introduces a new framework for estimating average treatment effects under unobserved confounding in modern data-rich environments featuring large numbers of units and outcomes. The proposed estimator is doubly robust, combining outcome imputation, inverse probability weighting, and a novel cross-fitting procedure for matrix completion. We derive finite-sample and asymptotic guarantees, and show that the error of the new estimator converges to a mean-zero Gaussian distribution at a parametric rate. Simulation results demonstrate the practical relevance of the formal properties of the estimators analyzed in this article.  ( 2 min )

  • Open

    [R] Stumped on a Video Trying to Learn the Fundamentals of BackPropagation
    I was watching this video on back propagation today, which I really liked, but something in this video stumped me. I was completely on the same page and in lockstep, until the point 3:33 where he says, "when we multiply, we get..." ∂C --- = i = 1.5 * 2(a-y) = 4.5 * w - 1.5 ∂w For the life of me, I can't understand how he got 4.5 * w -1.5 in this video. I am hoping someone on this thread can help me unravel that mystery! submitted by /u/Lanky_Barnacle1130 [link] [comments]
    [N] Biscuit, the wandering piglet, arrived at the Zuse Research Institute and studied the graphics processor used for atomistic and molecular modeling.
    Hello everyone! Meet Biscuit, a toy piglet who’s traveling the world by being passed from one traveler to another. Biscuit recently joined scientists from Geneva on a visit to the Zuse Institute in Berlin to attend the conference on “Prospects and Challenges of Future High-Performance Computing for Atomistic and Molecular Modeling.” There, Biscuit attentively explored this topic by listening to all the lectures and examining the GPU used in supercomputers. Next, he will travel to Geneva and, together with scientists from CERN, visit the Large Hadron Collider. —------------------------- A little backstory: not long ago, I came up with the idea to create a toy. Its name is Biscuit, a charming piggy crafted by my wife and me. The mission of Biscuit is to travel around the world, passing from hand to hand. Through this project, I aim to connect people globally, showcasing the beauty of our planet and sharing fascinating stories and facts about various places. For this purpose, we created an Instagram page https://www.instagram.com/biscuitroams/ where all updates and adventures of Biscuit will be posted. Additionally, on Imgur and Reddit, I will compile and publish complete stories. Biscuit also has a small backpack through which participants can exchange small souvenirs and magnets from different countries! Biscuit has just begun his journey, and we currently have few volunteers to take him along. If you have friends who love traveling, perhaps they would like to take Biscuit with them! Yes, and Biscuit is quite small, standing at a full height of 18 centimeters. He easily fits into a briefcase, and there is a carabiner on his little briefcase so that he can be attached securely. submitted by /u/Dangerous-Annual-511 [link] [comments]
    [R] flight path optimisation algorithm
    I'm creating a tool to help planes fly smarter between two cities, A and B. Think of it like Google Maps, but for airplanes. When planes fly, they don't go straight from A to B; they follow a path that hops from one point to another, called waypoints. These waypoints and other details like how far the plane flies, how long it takes, and how much fuel it uses are all planned ahead and written down in what's called a flight planning report. But, the actual flight might use more fuel or take longer than expected. I've got records of past flights between these cities, so I know which paths turned out to be really good at saving fuel. My goal is to build an algorithm that looks at the planned route for an upcoming flight and checks it against these past flights. Then, it'll suggest to the pilots the best paths to take for saving fuel. It's like giving them a heads-up on which routes have been most fuel-efficient in the past." I am a complete newbie in this. Can someone please help me with the optimisation method I should opt for? submitted by /u/Desperate_Roll2360 [link] [comments]
    [D] How many images would be needed to create a facial recognition of say ~50 people?
    Would 10 good pictures of each person be enough? If I had a small number of pictures of each person (like 10) in different places/lights can i alter the pictures digitally to "fake" having more pictures to my set? Can I use an already-face-trained model and input my own pictures as its already hyper trained on noticing small facial changes? How many do you think I'd need to train a model that doesnt need to be 100% accurate, but one thats better than picking at random hah -- using any tools available, thanks :) submitted by /u/underbrownmaleroad [link] [comments]
    Any open source libraries that can help me easily switch between LLMs while building LLM applications? [D]
    I have been building open source tools which would be using LLMs and RAG, however, there is a plethora of LLM models and frameworks to choose between, including OpenAI, Huggingface, AzureOpenAI etc. Writing a new class and extensions for each of them can be difficult. I was curious if there was more easier way like a tool/framework which unifies maximum number of LLM apis under one umbrella so that I don't have to write a new class for everything? What do you usually do in these situations? submitted by /u/metalvendetta [link] [comments]
    [D] What is the best text-to-speech tool (preferably free) that takes in PDF currently?
    Hi everyone, I need a TTS tool that uses advanced ML algorithms to sound very natural. I want to use it to edit some of my YouTube videos, to turn a few books (PDF) into mp3s. I see a lot of TTS platforms around. Which do you recommend? I hope this isn't too much to ask. I would gladly appreciate it. Thanks in advance. PS: ElevenLabs won't do it since, it doesn't take PDF or text files. Also GitHub repos that aren't too complicated to use would be great too submitted by /u/HighAndInsane [link] [comments]
    [D] Kaggle datasets vs actual tabular data - bitter realization
    After years working and practising on tabular datasets on Kaggle or other platforms, I finally got to work with a tabular data from a university hospital and it was like a pool of dirt. Spent a whole day just to find proper headers and link all those inter-sheet formulae and filters. On the other hand I spent max. 30 mins for EDA on Kaggle datasets. I was told about the difference but realized what mess DS have to deal with. Always underestimated it, skipped workshops related to it and also casually made fun of it (I usually work with images and videos). submitted by /u/ade17_in [link] [comments]
    [R] Can predictive coding lives up to its promise?
    Just wonder if anyone in the community is still working on predictive coding which to my understanding started as early as 1999 by Rajesh Rao, https://www.researchgate.net/publication/13103385_Predictive_Coding_in_the_Visual_Cortex_a_Functional_Interpretation_of_Some_Extra-classical_Receptive-field_Effects. In the paper, as described in figure 1, only small error signal is passing from stage to stage, which suppose to be more energy efficient like the biological NN. Despite started from than 25 years ago, and multiple papers published along the way. I can't see much promise in terms of (1) energy efficient (2) true localized asynchronous training. Did I missing something? submitted by /u/Tasty_Road_3519 [link] [comments]
    [D] How useful is a master's degree to excel as a Machine Learning Engineer and better career prospects?
    I have an undergrad in Computer engineering with 3.99 GPA from a Texas School. I have 3 years of research experience from undergrad and also publications. Currently, I've been working as a machine learning engineer for 2 years at a leading OEM. I work with scientists with PhDs and worked across the pipeline for several projects. I feel like I am learning a ton at work. Note: - I tried searching and could not find anything targeting question specific to my circumstance, - I tried Gtech OMSCS and and the first 2 classes were garbage - so I dropped out. I also did not get the courses I wanted due to how many people are in each cohort. The lack of organization in this online program also made in very frustrating. I don't necessarily plan on being a ML Scientist but intend to pivot to management for ML teams. submitted by /u/Crackjoke [link] [comments]
    [D] Just In Time GPU Hosting?
    Hi, I've seen plenty of questions about hosting, but this is very specific. I want to be able to deploy 4 small DistilBERT based models, that are trained elsewhere. I will only need them to be "up" for an hour a day or less (for live demo's), and so I don't want to pay for monthly hosting fees. However, I need more than just some random Flask provider, I need them to run on appropriately powered GPU/TPU hardware for that hour (so I have sub 50 ms latencies). Does anyone have a low-cost, easy-to-use service they can recommend? Thanks in advance, W submitted by /u/wantondevious [link] [comments]
    [R] Does any use TTS to read papers?
    Obviously, there are far too many papers for one to read. Wondering if anyone is using a TTS platform to listen to papers while on a walk or run etc. Any recommendations would be awesome! Apologies if this is duplicate. submitted by /u/sck209 [link] [comments]
    [D] About visualizing normalized images
    For binary image segmentation, I have images with uint16 dtype and 4 channels. I splitted them into smaller patches and converted dtype into uint8 which ranges 0-255 then normalized them by dividing 255. Now it ranges between 0 and 1. I wanted to visualize my images but the visualization doesn't look like original image. How can I visualize normalized images which looks like same as original one. Here is screenshots: the original image (splitted into smaller tiles): https://preview.redd.it/6hx5y4zc8rjc1.png?width=1248&format=png&auto=webp&s=5e508fe8d8bcabefccb88dc895763b640aad9a56 Here is normalised image: What can I do? https://preview.redd.it/f0qka5ck8rjc1.png?width=946&format=png&auto=webp&s=a413854ebbdea09e97530e0497d59666e0f15116 ​ submitted by /u/NailaBaghir [link] [comments]
    [D] Why does the diffuser library change the original sampling algorithm?
    I have been recently analyzing the inner workings of the hugging face diffusers library, and I have realised what seems to be a change in the sampling method inside the step() function: ​ https://preview.redd.it/9zz4yzm37rjc1.png?width=2041&format=png&auto=webp&s=de825eb3d57bbffa103998368f60401dc9f6f304 If I have understood this correctly, first, they create a prediction of x₀ with the following fórmula: https://preview.redd.it/2cy3dp257rjc1.png?width=1483&format=png&auto=webp&s=3126d7b6ddb26743f4c0545275d0dc3a12f154a0 Next, they create xₜ₋₁ with these equations: https://preview.redd.it/72hbpzn57rjc1.png?width=1936&format=png&auto=webp&s=3f839aa6ad96d3c81812256e33573859a30b8f9b But I think that this is a huge change compared with line 4 of the original equation: https://preview.redd.it/upexate77rjc1.png?width=1060&format=png&auto=webp&s=4c6d17715f74fa4b72ecf63b4a82598634a36372 Why do they use a different way to obtain xₜ₋₁? What are the advantages of this approach? Is there any source material or paper to check where this comes from? Thank you for the help! submitted by /u/SrPinko [link] [comments]
    [D] Confused about different terms in Retrieval-Augmented Generation (RAG)
    Hi, I have a class exercises that require a literature review about Retrieval-Augmented Generation (RAG). My teacher told us to read "Benchmarking Large Language Models in Retrieval-Augmented Generation" and find some other papers that help improve Noise Robustness and Information Intergation. One paper I found is "Chain-of-note: Enhancing Robustness in Retrieval-Augmented Language Models", they mentioned the Retrieval-augmented language models (RALMs). So is this RAG and RALMs is the same or different? As I understand, the RALMs is a smaller part of RAG since it combines a language model with a system that finds information from a large group of documents. Am I correct? Also, from the Chain-of-note paper, I read another paper "Interleaving Retrieval with Chain-of-Though Reasoning for Knowledge-Intensive Multi-Step Questions" and felt like although seems the same, it actually solved a different problem than Chain-of-note. ​ submitted by /u/ma-d-ghost [link] [comments]
    [D] Picking an ML lab as an undergraduate: big, established lab or small, focused lab?
    Some background: I'm a third-year trying really hard for a PhD at a good school (a crapshoot, I know). I go to a solid school for CS and have some basics like a good GPA, plus I've been doing some applied work in healthcare at a T3, expecting a first authorship at a top medical journal. But what I'm interested in machine learning itself, not necessarily its applications in a specific field where the cleverness comes from applications rather than new fundamental ideas. To this end, I've been trying to find a new lab at my school (which will also serve as my home institution rec) to work on hardcore over this year. With help from the professor at the T3 I worked at, I shortlisted a bunch of professors, and my top two picks have either guaranteed a position in the lab (Lab 1) or strongly im…
    [D] Other popular/announced Diffusion Transformer products like Sora?
    While Sora has created quite a splash, what other known, popular/announced products use a similar model architecture ? submitted by /u/CodeComedianCat [link] [comments]
    [D] Versioning, Cataloging, and Decommissioning Data Products
    Non-Disruptive Management Of The Evolving Landscape Of Data Products: Part 2 How to keep up with evolving Data Products and run at the pace of user requirements? While 1️⃣ Versioning Data Product assets and features is a critical aspect of data product management, here are a couple more actions to look out for 👉🏻 2️⃣ Consistent cataloging (not just data, but all assets & features of the product) 3️⃣ Carefully identifying inflection points (when a DP is no longer a version change but a new DP) 4️⃣ Identifying when to decommission Data Products or specific product features And more! Dive in: https://moderndata101.substack.com/p/managing-the-evolving-data-products-landscape-p2 submitted by /u/growth_man [link] [comments]
    [P] Auto summary and reasoning bias detection to improve forum discussions
    Hey everyone! Check out Hivemind, our open-source project aiming to lift democracy to the digital age. In short, it's like Reddit, but with restricted login and voting features to ensure efficient communication between a population and its leading group. Two cool features (which Reddit is missing as well IMO) are: - Auto summarization, so you don't need to read all the comments of every thread. - Auto bias/reasoning fallacy detection and labelling, to improve the discussion quality. The LLMs of the world can do this, but the budget quality trade-off will rapidly kick in. Any ideas on how to approach this, so it's cheap, scalable and of great quality? (The project just started, so feel free to join! 🙌) ` submitted by /u/mkeySeraSera [link] [comments]
  • Open

    Who ever came up with the idea of training Language Models to be conversational?
    I just think it's such an interesting step in logic, to take a basically an advanced autocomplete model, and ask it to "autocomplete" on chat responses rather than simply arbitrary text blocks. I think it's *super* fucking cool, that you can sort of "goad" a text prediction machine into replying to you, with responses that actually take into account the semantics and logic of what came before. *Super* fucking cool. I feel like I've only ever seen that part of the whole process glossed over. submitted by /u/7ven7o [link] [comments]
    Sora explained simply with pen and paper
    Sora explained simply with pen and paper in under 5 min (based on my understanding of OpenAI's limited research blog) submitted by /u/techie_ray [link] [comments]
    Personal AI - an AI platform designed to improve human cognition
    We are the creators of Personal AI (our subreddit) - an AI platform designed to boost and improve human cognition. Personal AI was created with two missions: to build an AI for each individual and augment their biological memory to change and improve how we humans fundamentally retain, recall, and relive our own memories What is Personal AI? One core use of Personal AI is to record a person’s memories and make them readily accessible to browse and recall. For example, you can ask what the insightful thoughts are from a conversation, the name of your friend’s spouse you met the week before, or the Berkeley restaurant recommendation you got last month - pieces of information that evaporated from your memory but could be useful to you at a later time. Essentially, Personal AI creates …
    Which Ai's are best at long-form detailed outputs and editing results properly?
    Currently I'm writing a script for a presentation. ChatGPT has been very helpful brainstorming and giving feedback, but I really wish it was better at actually tweaking and modifying the script itself. I find chatGPT doesn't like to output long results or go into detail. Then when you ask it for small modifications it can change it significantly and going back to the previous one is hard it all just gets muddled, Are any other models better at this? Currently I'm writing a script for a presentation. ChatGPT has been very helpful in brainstorming and giving feedback, but I really wish it was better at actually tweaking and modifying the script itself. submitted by /u/zascar [link] [comments]
    Daily AI Quick Links (02/20/2024)
    Reddit has a new AI training deal to sell user content [1] AI determines sex of person from brain scans [2] OpenAI valued at over $80 billion in latest deal [3] FTC proposes new protections to combat AI impersonation [4] Sources: [1] https://www.theverge.com/2024/2/17/24075670/reddit-ai-training-license-deal-user-content [2] https://neurosciencenews.com/ai-gender-identification-25631/ [3] https://www.pymnts.com/cpi_posts/capital-one-in-advanced-talks-to-acquire-discover [4] https://www.ftc.gov/news-events/news/press-releases/2024/02/ftc-proposes-new-protections-combat-ai-impersonation-individuals submitted by /u/Used-Bat3441 [link] [comments]
    It looks like the common consensus is that singularity is just a matter of time. I wonder how similar is this situation with the one from over hundred years ago when apparently almost all scientists strongly believed that there was nothing to discover anymore. Can be similarly wrong this time?
    The vision that emerges from all the news is that AI is bascially that last human invention. That it will help us with every possible thing. Or could do some wrong. We can't predict that. But almost everybody agrees that it's going to disrupt everything in some way. I wonder if that's really the case. If it's just a matter of stacking more and more chips to make it better and better for which there are really no obstacles then I guess that is correct but is it really that simple? submitted by /u/bobfrutt [link] [comments]
  • Open

    Where do RL and statistics intersect ?
    I am a stats grad student and interested in RL/DRL, I want to maximize the benefit from degree and focus on stat topics (and numerical methods) that would prepare me for Phd programs in RL/DL if possible, what topics do I need to pay extra attention for ? submitted by /u/al3arabcoreleone [link] [comments]
    RL retraining
    I have a real-world robotic environment in Unity. I formulated a simpler version (mathematical model) of it and trained a TD3 agent, the results were really good. I directly applied this policy in the actual environment to see the behavior and it was actually kind of doing okaish trajectory following. I then used the trained policy to further train and fine tune it on the actual environment which has a bit more complexity. At the start the learning looked fine but after 3-4 episodes the policy was just generating super weird actions (no where near to reach the target). I am not quite sure what I’m doing wrong. I kept the learning rates and the exploration noise in the 2nd training lower than the first one so that the policy transitions smoothly with respect to the more complex environment. submitted by /u/fx619 [link] [comments]
    Any work around biped training used for any of the popular gaming engines ?
    In the old days there was natural motion Endorphin which was used in GTA5.. Then I saw a 2 minute paper on 2 bipeds boxing which was amazing.. And years have gone by but I can't find anything that I would imagine would be here by 2024 for working with trained models of Bipeds for gaming. I know about MLAGENTS for Unity and they have an example of a creature being trained and moving badly. Is there any other resources out there for this topic ? I find it really interesting submitted by /u/punkouter23 [link] [comments]
    Observation Normalization in multi-agent RL
    Hello fellow researchers, I have a question about observation normalization in MARL. When training agents in a centralized training decentralized execution (CTDE) setting, how should observation normalization be performed? I have thought of three ways to do it : normalize the observations of each agent separately (in this approach, the inputs to the critic will have different mean and std). normalize the observations of one agent and then use the mean and std for the normalization of other agents' observations, or normalize the tensor of all observations concatenated then separate observations for each corresponding agents. Thank you in advance submitted by /u/Many_Reception_4921 [link] [comments]
    Training MountainContinuousCar with PPO
    Hi! I'm currently trying to train MountrainContinuousCar with PPO.This is the type of graph I get for my average return. https://preview.redd.it/1f83xrxbzqjc1.png?width=514&format=png&auto=webp&s=28a3afb294f072506aad9a34711fa59be7cd0da5 ​ Would anyone has advices on how to make it work? This is my config btw: action_space: continuous frame_skip: 1 truncate_env_steps_at: 200 use_truncation_as_termination: true num_envs: 1 total_env_steps: 819200 agent_steps_per_env: 256 is_episodic: true agent_steps_per_batch: 256 total_agent_steps: 819200 num_batches: 3200 num_training_phases: 3200 actor: activation: Tanh linear_layers: num_layers: 2 layer_size: 64 critic: activation: Tanh linear_layers: num_layers: 2 layer_size: 64 minibatch_size: 256 num_epochs: 10 algo: Adam max_grad_norm: 5 lr: 7.77e-05 num_minibatches_per_epoch: 1 max_gradient_steps_per_rollout: 10 max_total_gradient_steps: 32000 gamma: 0.9999 lmbda: 0.9 submitted by /u/hc7Loh21BptjaT79EG [link] [comments]
  • Open

    DSC Weekly 20 February 2024
    Announcements Top Stories In-Depth The post DSC Weekly 20 February 2024 appeared first on Data Science Central.  ( 21 min )
    Custom Software Development for No-Code Tool Email Design
    Introduction The fast-paced digital world requires effective and user-friendly software solutions more than ever. Custom software development is changing business communication, especially email design. This blog explores the specialized but crucial field of-—the creation of no-code tools for email design. These tools represent the core of digital-age strategic marketing and communication, not just aesthetics. The… Read More »Custom Software Development for No-Code Tool Email Design The post Custom Software Development for No-Code Tool Email Design appeared first on Data Science Central.  ( 24 min )
    The extensive scope of knowledge graph use cases
    Image by atul prajapati from Pixabay February’s Enterprise Data Transformation Symposium, hosted by Semantic Arts, featured talks from two prominent members of pharma’s Pistoia Alliance: Martin Romacker of Roche and Ben Gardner of AstraZeneca.  It’s been evident for years now that the Pistoia Alliance, organized originally in 2008 by Pfizer, GlaxoSmithKline and Novartis for industry… Read More »The extensive scope of knowledge graph use cases The post The extensive scope of knowledge graph use cases appeared first on Data Science Central.  ( 22 min )
    The importance of cybersecurity at home and 5 tips to secure your network
    Those working in technology, security and data know protecting critical infrastructure from cybersecurity threats is a non-negotiable aspect of a functioning, modern society. However, the same mentality must apply to households. They are just as vulnerable and deserve protection. What are advanced strategies to deter threat actors and maintain privacy and data integrity? Why cybersecurity… Read More »The importance of cybersecurity at home and 5 tips to secure your network The post The importance of cybersecurity at home and 5 tips to secure your network appeared first on Data Science Central.  ( 20 min )
    Decoding different types of databases: A comparison
    In the digital landscape, the backbone of any robust application or system resides in its database architecture. Choosing the appropriate database, considering the Relational vs. Non-Relational Databases debate, is pivotal for businesses in determining performance, scalability, and the ability to handle diverse data categories. However, choosing the right database can be really tricky. There are… Read More »Decoding different types of databases: A comparison The post Decoding different types of databases: A comparison appeared first on Data Science Central.  ( 23 min )
  • Open

    A neural network to generate animations based on a certain example
    Hello! I'm making a videogame with a ton of very similar animations. For example: a 2D guy jumps 2 meters to his right. Is there a neural network who will create an animation of this very guy jumping 3 meters to his right -- based on the previous animation? Thanks! submitted by /u/SomethingsAwesome [link] [comments]
    Having trouble recognizing my own digits with a Neuronal network trained on msnit data set
    I really dont get it what i am doing wrong my image has the same shape (28,28) like the images from the Mnist dataset und when i show them with plt.imshow(img_flat) plt.show() they look exactly the same but since days i cannot figure out why my images are not recognized and the images from the data set are... I did a YT tutorial but the exact same code does not work for me ... #My own data model = tf.keras.models.load_model("handwritten.model") img = cv2.imread("Digits/digit11.png") gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) # gray = tf.keras.utils.normalize(gray, axis=1) img_flat = gray.reshape(28, 28) # Form (1, 784) prediction = model.predict(np.expand_dims(img_flat, axis=0)) print(f"This digit is probably a {np.argmax(prediction)}") #MNIST Data prediction = model.predict(np.expand_dims(x_train[27], axis=0)) print(f"This digit is probably a {np.argmax(prediction)}") heres my code, and all digits from mnsit are recognized perfectly... really appreciate any help ... submitted by /u/xXSavardoXx [link] [comments]
  • Open

    Streamline diarization using AI as an assistive technology: ZOO Digital’s story
    ZOO Digital provides end-to-end localization and media services to adapt original TV and movie content to different languages, regions, and cultures. It makes globalization easier for the world’s best content creators. Trusted by the biggest names in entertainment, ZOO Digital delivers high-quality localization and media services at scale, including dubbing, subtitling, scripting, and compliance. Typical […]  ( 11 min )

  • Open

    Implementing Gradient Descent in PyTorch
    The gradient descent algorithm is one of the most popular techniques for training deep neural networks. It has many applications in fields such as computer vision, speech recognition, and natural language processing. While the idea of gradient descent has been around for decades, it’s only recently that it’s been applied to applications related to deep […] The post Implementing Gradient Descent in PyTorch appeared first on MachineLearningMastery.com.  ( 25 min )

  • Open

    Training a Linear Regression Model in PyTorch
    Linear regression is a simple yet powerful technique for predicting the values of variables based on other variables. It is often used for modeling relationships between two or more continuous variables, such as the relationship between income and age, or the relationship between weight and height. Likewise, linear regression can be used to predict continuous […] The post Training a Linear Regression Model in PyTorch appeared first on MachineLearningMastery.com.  ( 24 min )
    Making Linear Predictions in PyTorch
    Linear regression is a statistical technique for estimating the relationship between two variables. A simple example of linear regression is to predict the height of someone based on the square root of the person’s weight (that’s what BMI is based on). To do this, we need to find the slope and intercept of the line. […] The post Making Linear Predictions in PyTorch appeared first on MachineLearningMastery.com.  ( 21 min )

  • Open

    Loading and Providing Datasets in PyTorch
    Structuring the data pipeline in a way that it can be effortlessly linked to your deep learning model is an important aspect of any deep learning-based system. PyTorch packs everything to do just that. While in the previous tutorial, we used simple datasets, we’ll need to work with larger datasets in real world scenarios in […] The post Loading and Providing Datasets in PyTorch appeared first on MachineLearningMastery.com.  ( 20 min )

  • Open

    Using Dataset Classes in PyTorch
    In machine learning and deep learning problems, a lot of effort goes into preparing the data. Data is usually messy and needs to be preprocessed before it can be used for training a model. If the data is not prepared correctly, the model won’t be able to generalize well. Some of the common steps required […] The post Using Dataset Classes in PyTorch appeared first on MachineLearningMastery.com.  ( 21 min )

  • Open

    Calculating Derivatives in PyTorch
    Derivatives are one of the most fundamental concepts in calculus. They describe how changes in the variable inputs affect the function outputs. The objective of this article is to provide a high-level introduction to calculating derivatives in PyTorch for those who are new to the framework. PyTorch offers a convenient way to calculate derivatives for […] The post Calculating Derivatives in PyTorch appeared first on Machine Learning Mastery.  ( 20 min )

  • Open

    Two-Dimensional Tensors in Pytorch
    Two-dimensional tensors are analogous to two-dimensional metrics. Like a two-dimensional metric, a two-dimensional tensor also has $n$ number of rows and columns. Let’s take a gray-scale image as an example, which is a two-dimensional matrix of numeric values, commonly known as pixels. Ranging from ‘0’ to ‘255’, each number represents a pixel intensity value. Here, […] The post Two-Dimensional Tensors in Pytorch appeared first on Machine Learning Mastery.  ( 21 min )

  • Open

    One-Dimensional Tensors in Pytorch
    PyTorch is an open-source deep learning framework based on Python language. It allows you to build, train, and deploy deep learning models, offering a lot of versatility and efficiency. PyTorch is primarily focused on tensor operations while a tensor can be a number, matrix, or a multi-dimensional array. In this tutorial, we will perform some […] The post One-Dimensional Tensors in Pytorch appeared first on Machine Learning Mastery.  ( 22 min )

  • Open

    365 Data Science courses free until November 21
    Sponsored Post   The unlimited access initiative presents a risk-free way to break into data science.     The online educational platform 365 Data Science launches the #21DaysFREE campaign and provides 100% free unlimited access to all content for three weeks. From November 1 to 21, you can take courses from renowned instructors and earn […] The post 365 Data Science courses free until November 21 appeared first on Machine Learning Mastery.  ( 15 min )

  • Open

    Attend the Data Science Symposium 2022, November 8 in Cincinnati
    Sponsored Post      Attend the Data Science Symposium 2022 on November 8 The Center for Business Analytics at the University of Cincinnati will present its annual Data Science Symposium 2022 on November 8. This all day in-person event will have three featured speakers and two tech talk tracks with four concurrent presentations in each track. The […] The post Attend the Data Science Symposium 2022, November 8 in Cincinnati appeared first on Machine Learning Mastery.  ( 10 min )

  • Open

    My family's unlikely homeschooling journey
    My husband Jeremy and I never intended to homeschool, and yet we have now, unexpectedly, committed to homeschooling long-term. Prior to the pandemic, we both worked full-time in careers that we loved and found meaningful, and we sent our daughter to a full-day Montessori school. Although I struggled with significant health issues, I felt unbelievably lucky and fulfilled in both my family life and my professional life. The pandemic upended my careful balance. Every family is different, with different needs, circumstances, and constraints, and what works for one may not work for others. My intention here is primarily to share the journey of my own (very privileged) family. Our unplanned introduction to homeschooling For the first year of the pandemic, most schools in California, where …  ( 7 min )

  • Open

    The Jupyter+git problem is now solved
    Jupyter notebooks don’t work with git by default. With nbdev2, the Jupyter+git problem has been totally solved. It provides a set of hooks which provide clean git diffs, solve most git conflicts automatically, and ensure that any remaining conflicts can be resolved entirely within the standard Jupyter notebook environment. To get started, follow the directions on Git-friendly Jupyter. Contents The Jupyter+git problem The solution The nbdev2 git merge driver The nbdev2 Jupyter save hook Background The result Postscript: other Jupyter+git tools ReviewNB An alternative solution: Jupytext nbdime The Jupyter+git problem Jupyter notebooks are a powerful tool for scientists, engineers, technical writers, students, teachers, and more. They provide an ideal notebook environment for interact…  ( 7 min )
2024-03-21T00:44:53.794Z osmosfeed 1.15.1